前言
本文对Hudi官网提到的部分特性(功能)做了测试,具体的测试数据均由以下代码直接生成:
from faker import Faker
def fake_data(faker: Faker, row_num: int):
file_name = f'/Users/gavin/Desktop/tmp/student_{row_num}_rows.csv'
with open(file=file_name, mode='w') as file:
file.write("id,name,age,adress,partition_path\n")
for i in range(row_num):
file.write(
f'{my_faker.iana_id()},{my_faker.name()},{my_faker.random_int(min=15, max=25)},{my_faker.address()},{my_faker.day_of_week()}\n')
if __name__ == '__main__':
my_faker = Faker(locale='zh_CN')
fake_data(my_faker, 100000)
测试数据例:
id | name | age | adress | partition_path |
---|---|---|---|---|
7548525 | 谭娜 | 15 | 黑龙江省广州市白云姚路w座 391301 | Sunday |
5615440 | 金亮 | 19 | 陕西省巢湖县西峰张街N座 711897 | Tuesday |
3887721 | 刘倩 | 21 | 贵州省敏县清浦深圳路A座 116469 | Thursday |
pyspark启动时引入hudi的命令:
pyspark --packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
Hudi基础下探
File Layouts(文件结构)
copy-on-Write
hudi表除了parquet文件之外,在表名根目录下,有一个.hoodie文件夹,存储了该表的元信息;示例如下:
gavin@GavindeMacBook-Pro hudi_tables % tree -a student_for_pre_validate
student_for_pre_validate
├── .hoodie #hudi表元信息文件,包含commit信息、marker信息等
│ ├── .20220317111613163.commit.requested.crc
│ ├── .20220317111613163.inflight.crc
│ ├── .aux
│ │ └── .bootstrap #.bootstrap下存放的是进行引导操作的时候的文件,引导操作是用来将已有的表转化为Hudi表的操作,因为没有执行这个,所以下面没有内容
│ │ ├── .fileids
│ │ └── .partitions
│ ├── .hoodie.properties.crc
│ ├── .temp
│ │ └── 20220317111613163
│ │ ├── .MARKERS.type.crc
│ │ ├── .MARKERS0.crc
│ │ ├── MARKERS.type
│ │ └── MARKERS0
│ ├── 20220317111613163.commit.requested
│ ├── 20220317111613163.inflight
│ ├── archived #存放归档Instant的目录,当不断写入Hudi表时,Timeline上的Instant数量会持续增多,为减少Timeline的操作压力,会在Commit时对Instant进行归档,并将Timeline上对应的Instant删除。因为我们的Instant个数尚未达到默认值30个,所以并没有产生对应的文件
│ └── hoodie.properties
├── Friday #具体分区数据
│ ├── ..hoodie_partition_Metadata.crc
│ ├── .65792147-0976-4433-91a1-cb9867326bdf-0_0-30-30_20220317111613163.parquet.crc
│ ├── .hoodie_partition_Metadata
│ └── 65792147-0976-4433-91a1-cb9867326bdf-0_0-30-30_20220317111613163.parquet
└── Wednesday #具体分区数据
├── ..hoodie_partition_Metadata.crc
├── .4454a7c0-4e4c-4ef6-b790-e066dd2fc8ca-0_1-30-31_20220317111613163.parquet.crc
├── .hoodie_partition_Metadata
└── 4454a7c0-4e4c-4ef6-b790-e066dd2fc8ca-0_1-30-31_20220317111613163.parquet
10 directories, 18 files
gavin@GavindeMacBook-Pro hudi_tables %
Merge-on-Read
可以参考:Apache Hudi 从入门到放弃(2) —— MOR表的文件结构分析
commit文件中的信息
结论:
- 每一个parquet文件在创建的时候都有一个对应的fileId,该Id作为parquet文件的文件名前缀,同时记录在commit文件中;后续对该文件的修改只会改变文件名后时间戳部分,前缀fileId不变
- commit文件中会记录每次每个fileId的「numWrites」、「numDeletes」、「numUpdateWrites」、「numInserts」以及文件大小等其他基本信息
- commit文件中记录了fileId和具体文件的映射关系
- commit文件中记录了表的schema信息
具体数据演示
vi 20220316171316850.commit:
{
"partitionToWriteStats" : {
"Thursday" : [ {
"fileId" : "9643d9e7-82b1-4e84-b8e2-0ae625bb54d5-0",
"path" : "Thursday/9643d9e7-82b1-4e84-b8e2-0ae625bb54d5-0_0-29-41_20220316171316850.parquet",
"prevCommit" : "null",
"numWrites" : 461,
"numDeletes" : 0,
"numUpdateWrites" : 0,
"numInserts" : 461,
"totalWriteBytes" : 451097,
"totalWriteErrors" : 0,
"tempPath" : null,
"partitionPath" : "Thursday",
"totalLogRecords" : 0,
"totalLogFilesCompacted" : 0,
"totalLogSizeCompacted" : 0,
"totalUpdatedRecordsCompacted" : 0,
"totalLogBlocks" : 0,
"totalCorruptLogBlock" : 0,
"totalRollbackBlocks" : 0,
"fileSizeInBytes" : 451097,
"mineventTime" : null,
"maxEventTime" : null
},
···
···
···
{
"fileId" : "1efa72c3-a714-46e2-bb91-5019fa6e7ede-0",
"path" : "Saturday/1efa72c3-a714-46e2-bb91-5019fa6e7ede-0_224-53-265_20220316171316850.parquet",
"prevCommit" : "null",
"numWrites" : 210,
"numDeletes" : 0,
"numUpdateWrites" : 0,
"numInserts" : 210,
"totalWriteBytes" : 443162,
"totalWriteErrors" : 0,
"tempPath" : null,
"partitionPath" : "Saturday",
"totalLogRecords" : 0,
"totalLogFilesCompacted" : 0,
"totalLogSizeCompacted" : 0,
"totalUpdatedRecordsCompacted" : 0,
"totalLogBlocks" : 0,
"totalCorruptLogBlock" : 0,
"totalRollbackBlocks" : 0,
"fileSizeInBytes" : 443162,
"mineventTime" : null,
"maxEventTime" : null
} ]
},
"compacted" : false,
"extraMetadata" : {
"schema" : "{\"type\":\"record\",\"name\":\"student_record\",\"namespace\":\"hoodie.student\",\"fields\":[{\"name\":\"id\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"name\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"age\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"adress\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"partition_path\",\"type\":[\"null\",\"string\"],\"default\":null}]}"
},
"operationType" : "UPSERT",
"fileIdAndRelativePaths" : {
"111e0979-9006-441d-9af2-ac9656be4500-0" : "Sunday/111e0979-9006-441d-9af2-ac9656be4500-0_120-47-161_20220316171316850.parquet",
...
...
...
"a13bf769-7dcb-4aa7-a26f-fd701aa07eaf-0" : "Monday/a13bf769-7dcb-4aa7-a26f-fd701aa07eaf-0_33-47-74_20220316171316850.parquet",
"9fa086af-e28e-4a3f-9a31-06b658ad514b-0" : "Thursday/9fa086af-e28e-4a3f-9a31-06b658ad514b-0_15-41-56_20220316171316850.parquet"
},
"totalLogRecordsCompacted" : 0,
"totalLogFilesCompacted" : 0,
"totalCompactedRecordsUpdated" : 0,
"totalRecordsDeleted" : 0,
"totalLogFilesSize" : 0,
"totalScanTime" : 0,
"totalCreateTime" : 36958,
"totalUpsertTime" : 0,
"minAndMaxEventTime" : {
"Optional.empty" : {
"val" : null,
"present" : false
}
},
"writePartitionPaths" : [ "Thursday", "Monday", "Friday", "Sunday", "Wednesday", "Tuesday", "Saturday" ]
}
执行了一次upsert之后:
vi 20220316171648081.commit
{
"partitionToWriteStats" : {
"Thursday" : [ {
"fileId" : "5540e2fd-bc18-42db-a831-f72a6d7eb603-0",
"path" : "Thursday/5540e2fd-bc18-42db-a831-f72a6d7eb603-0_0-29-492_20220316171648081.parquet",
"prevCommit" : "20220316171316850",
"numWrites" : 459,
"numDeletes" : 0,
"numUpdateWrites" : 1,
"numInserts" : 0,
"totalWriteBytes" : 450943,
"totalWriteErrors" : 0,
"tempPath" : null,
"partitionPath" : "Thursday",
"totalLogRecords" : 0,
"totalLogFilesCompacted" : 0,
"totalLogSizeCompacted" : 0,
"totalUpdatedRecordsCompacted" : 0,
"totalLogBlocks" : 0,
"totalCorruptLogBlock" : 0,
"totalRollbackBlocks" : 0,
"fileSizeInBytes" : 450943,
"mineventTime" : null,
"maxEventTime" : null
},
···
···
···
{
"fileId" : "d02425d8-0216-4a3b-9810-b613d80cd60f-0",
"path" : "Saturday/d02425d8-0216-4a3b-9810-b613d80cd60f-0_433-53-925_20220316171648081.parquet",
"prevCommit" : "null",
"numWrites" : 84,
"numDeletes" : 0,
"numUpdateWrites" : 0,
"numInserts" : 84,
"totalWriteBytes" : 439040,
"totalWriteErrors" : 0,
"tempPath" : null,
"partitionPath" : "Saturday",
"totalLogRecords" : 0,
"totalLogFilesCompacted" : 0,
"totalLogSizeCompacted" : 0,
"totalUpdatedRecordsCompacted" : 0,
"totalLogBlocks" : 0,
"totalCorruptLogBlock" : 0,
"totalRollbackBlocks" : 0,
"fileSizeInBytes" : 439040,
"mineventTime" : null,
"maxEventTime" : null
} ]
},
"compacted" : false,
"extraMetadata" : {
"schema" : "{\"type\":\"record\",\"name\":\"student_record\",\"namespace\":\"hoodie.student\",\"fields\":[{\"name\":\"id\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"name\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"age\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"adress\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"partition_path\",\"type\":[\"null\",\"string\"],\"default\":null}]}"
},
"operationType" : "UPSERT",
"fileIdAndRelativePaths" : {
"0d288e1e-f593-4782-95de-0583c4cd286b-0" : "Saturday/0d288e1e-f593-4782-95de-0583c4cd286b-0_415-53-907_20220316171648081.parquet",
···
···
···
"cc32046a-55b1-4b2b-be93-3225e42154b7-0" : "Saturday/cc32046a-55b1-4b2b-be93-3225e42154b7-0_211-53-703_20220316171648081.parquet"
},
"totalLogRecordsCompacted" : 0,
"totalLogFilesCompacted" : 0,
"totalCompactedRecordsUpdated" : 0,
"writePartitionPaths" : [ "Thursday", "Monday", "Friday", "Sunday", "Wednesday", "Tuesday", "Saturday" ],
"totalRecordsDeleted" : 0,
"totalLogFilesSize" : 0,
"totalScanTime" : 0,
"totalCreateTime" : 31116,
"totalUpsertTime" : 37426,
"minAndMaxEventTime" : {
"Optional.empty" : {
"val" : null,
"present" : false
}
}
}
upsert数据时候数据文件变化
结论:upsert数据之后,会新增一个新版的数据文件,新的版本数据文件中包含了历史数据和新的数据;之前的版本文件不会变化
测试代码
import pyspark
if __name__ == '__main__':
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars",
"/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
"/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark = builder.getorCreate()
sc = spark.sparkContext
tableName = "student"
basePath = "file:///tmp/hudi_base_path"
csv_path = '/Users/gavin/Desktop/tmp/student_3_rows.csv'
csv_df = spark.read.csv(path=csv_path, header='true')
csv_df.printSchema()
print(f'csv_df.count(): [{csv_df.count()}]')
hudi_options = {
'hoodie.table.name': tableName,
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.partitionpath.field': 'partition_path',
'hoodie.datasource.write.table.name': tableName,
'hoodie.datasource.write.precombine.field': 'age',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2
}
csv_df.write.format("hudi"). \
options(**hudi_options). \
mode("append"). \
save(basePath)
历史数据如下:
gavin@GavindeMacBook-Pro tmp % cat student_5_rows.csv
id,name,age,adress,partition_path
4169306,邵晶,17,广西壮族自治区楠县花溪张路H座 932045,Saturday
4345298,陈海燕,15,内蒙古自治区郑州市浔阳石家庄街d座 725757,Wednesday
1759335,杨波,16,贵州省上海县平山程街s座 255034,Thursday
3141294,毛秀兰,17,浙江省海燕县东城石家庄街O座 459489,Saturday
2580276,王凤兰,22,宁夏回族自治区兴安盟县永川唐路A座 437666,Wednesday
gavin@GavindeMacBook-Pro tmp % cat student_3_rows.csv
upsert数据如下:
gavin@GavindeMacBook-Pro tmp % cat student_3_rows.csv
id,name,age,adress,partition_path
7548525,谭娜,15,黑龙江省广州市白云姚路w座 391301,Sunday
5615440,金亮,19,陕西省巢湖县西峰张街N座 711897,Tuesday
3887721,刘倩,21,贵州省敏县清浦深圳路A座 116469,Thursday
执行upsert之后,对于「Thursday」分区来说,会新增数据
#执行了upsert之前
gavin@GavindeMacBook-Pro Thursday % ll
total 856
-rw-r--r-- 1 gavin wheel 435628 Mar 16 11:14 53188680-ecdf-4b06-9e59-b59c33ab37fd-0_0-29-31_20220316111454901.parquet
gavin@GavindeMacBook-Pro Thursday % pwd
/tmp/hudi_base_path/Thursday
#执行了upsert之后,对应的分区下新增了一个parquet文件
gavin@GavindeMacBook-Pro Thursday % ll
total 1712
-rw-r--r-- 1 gavin wheel 435628 Mar 16 11:14 53188680-ecdf-4b06-9e59-b59c33ab37fd-0_0-29-31_20220316111454901.parquet
-rw-r--r-- 1 gavin wheel 435051 Mar 16 11:21 53188680-ecdf-4b06-9e59-b59c33ab37fd-0_0-29-31_20220316112130171.parquet
gavin@GavindeMacBook-Pro Thursday %
查看parquet文件的具体数据
# 执行upsert之前的文件
>>> spark.read.parquet('/tmp/hudi_base_path/Thursday/53188680-ecdf-4b06-9e59-b59c33ab37fd-0_0-29-31_20220316111454901.parquet').show()
+-------------------+--------------------+------------------+----------------------+--------------------+-------+----+---+------------------------------+--------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id|name|age| adress|partition_path|
+-------------------+--------------------+------------------+----------------------+--------------------+-------+----+---+------------------------------+--------------+
| 20220316111454901|20220316111454901...| 1759335| Thursday|53188680-ecdf-4b0...|1759335|杨波| 16|贵州省上海县平山程街s座 255034| Thursday|
+-------------------+--------------------+------------------+----------------------+--------------------+-------+----+---+------------------------------+--------------+
#执行upsert之后生成的文件
>>> spark.read.parquet('/tmp/hudi_base_path/Thursday/53188680-ecdf-4b06-9e59-b59c33ab37fd-0_0-29-31_20220316112130171.parquet').show()
+-------------------+--------------------+------------------+----------------------+--------------------+-------+----+---+------------------------------+--------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id|name|age| adress|partition_path|
+-------------------+--------------------+------------------+----------------------+--------------------+-------+----+---+------------------------------+--------------+
| 20220316111454901|20220316111454901...| 1759335| Thursday|53188680-ecdf-4b0...|1759335|杨波| 16|贵州省上海县平山程街s座 255034| Thursday|
| 20220316112130171|20220316112130171...| 3887721| Thursday|53188680-ecdf-4b0...|3887721|刘倩| 21|贵州省敏县清浦深圳路A座 116469| Thursday|
+-------------------+--------------------+------------------+----------------------+--------------------+-------+----+---+------------------------------+--------------+
手动删除历史版本数据文件的影响
结论:1. 删除了历史版本数据之后,不会影响其他版本的数据;2. 查询被删除了数据文件的版本的时候,不会报错,但是查询的时候数据会缺失被删除的部分
gavin@GavindeMacBook-Pro .hoodie % ll
total 40
-rw-r--r-- 1 gavin wheel 3735 Mar 16 11:14 20220316111454901.commit
-rw-r--r-- 1 gavin wheel 0 Mar 16 11:14 20220316111454901.commit.requested
-rw-r--r-- 1 gavin wheel 2486 Mar 16 11:14 20220316111454901.inflight
-rw-r--r-- 1 gavin wheel 3730 Mar 16 11:21 20220316112130171.commit
-rw-r--r-- 1 gavin wheel 0 Mar 16 11:21 20220316112130171.commit.requested
-rw-r--r-- 1 gavin wheel 2478 Mar 16 11:21 20220316112130171.inflight
drwxr-xr-x 2 gavin wheel 64 Mar 16 11:14 archived
-rw-r--r-- 1 gavin wheel 593 Mar 16 11:14 hoodie.properties
gavin@GavindeMacBook-Pro .hoodie %
#查询「Mar 16 11:21」时候的数据条数
>>> spark.read.format('hudi').option('as.of.instant','20220316112130171').load('/tmp/hudi_base_path').count()
8
#查询「Mar 16 11:14」时候的数据条数
>>> spark.read.format('hudi').option('as.of.instant','20220316111454901').load('/tmp/hudi_base_path').count()
5
>>>
gavin@GavindeMacBook-Pro Thursday % ll
total 1712
-rw-r--r-- 1 gavin wheel 435628 Mar 16 11:14 53188680-ecdf-4b06-9e59-b59c33ab37fd-0_0-29-31_20220316111454901.parquet
-rw-r--r-- 1 gavin wheel 435051 Mar 16 11:21 53188680-ecdf-4b06-9e59-b59c33ab37fd-0_0-29-31_20220316112130171.parquet
gavin@GavindeMacBook-Pro Thursday % rm 53188680-ecdf-4b06-9e59-b59c33ab37fd-0_0-29-31_20220316111454901.parquet
#删除「Mar 16 11:14」历史版本的数据文件
gavin@GavindeMacBook-Pro Thursday % ll
total 856
-rw-r--r-- 1 gavin wheel 435051 Mar 16 11:21 53188680-ecdf-4b06-9e59-b59c33ab37fd-0_0-29-31_20220316112130171.parquet
gavin@GavindeMacBook-Pro Thursday %
#删除了历史版本之后,对最新版的查询不影响
>>> spark.read.format('hudi').option('as.of.instant','20220316112130171').load('/tmp/hudi_base_path').show()
+-------------------+--------------------+------------------+----------------------+--------------------+-------+------+---+------------------------------------+--------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id| name|age| adress|partition_path|
+-------------------+--------------------+------------------+----------------------+--------------------+-------+------+---+------------------------------------+--------------+
| 20220316112130171|20220316112130171...| 7548525| Sunday|c297b5a9-128f-488...|7548525| 谭娜| 15| 黑龙江省广州市白云姚路w座 391301| Sunday|
| 20220316111454901|20220316111454901...| 4345298| Wednesday|d78b45e2-0d97-470...|4345298|陈海燕| 15|内蒙古自治区郑州市浔阳石家庄街d座...| Wednesday|
| 20220316111454901|20220316111454901...| 2580276| Wednesday|d78b45e2-0d97-470...|2580276|王凤兰| 22|宁夏回族自治区兴安盟县永川唐路A座...| Wednesday|
| 20220316111454901|20220316111454901...| 3141294| Saturday|8868d778-2ffd-461...|3141294|毛秀兰| 17| 浙江省海燕县东城石家庄街O座 45...| Saturday|
| 20220316111454901|20220316111454901...| 4169306| Saturday|8868d778-2ffd-461...|4169306| 邵晶| 17| 广西壮族自治区楠县花溪张路H座 9...| Saturday|
| 20220316112130171|20220316112130171...| 5615440| Tuesday|13fc6b03-48f1-414...|5615440| 金亮| 19| 陕西省巢湖县西峰张街N座 711897| Tuesday|
| 20220316111454901|20220316111454901...| 1759335| Thursday|53188680-ecdf-4b0...|1759335| 杨波| 16| 贵州省上海县平山程街s座 255034| Thursday|
| 20220316112130171|20220316112130171...| 3887721| Thursday|53188680-ecdf-4b0...|3887721| 刘倩| 21| 贵州省敏县清浦深圳路A座 116469| Thursday|
+-------------------+--------------------+------------------+----------------------+--------------------+-------+------+---+------------------------------------+--------------+
>>> spark.read.format('hudi').option('as.of.instant','20220316112130171').load('/tmp/hudi_base_path').count()
8
>>>
#但是查询已经删除了数据的版本的时候,数据少了被删除的部分
>>> spark.read.format('hudi').option('as.of.instant','20220316111454901').load('/tmp/hudi_base_path').count()
4
insert数据前执行
precombine.field 功能校验
结论:在数据真正写入之前,如果有写入的数据中有相同的key值,那么hudi会将「precombine.field」进行比较,取大的数据作为新数据插入;
测试代码
import pyspark
if __name__ == '__main__':
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars",
"/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
"/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark = builder.getorCreate()
sc = spark.sparkContext
tableName = "student_mor"
basePath = "file:///tmp/hudi_test/student_precombine_validate"
csv_path = '/Users/gavin/Desktop/tmp/student_3_rows.csv'
csv_df = spark.read.csv(path=csv_path, header='true')
csv_df.printSchema()
print(f'csv_df.count(): [{csv_df.count()}]')
hudi_options = {
'hoodie.table.name': tableName,
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.partitionpath.field': 'partition_path',
'hoodie.datasource.write.table.name': tableName,
'hoodie.datasource.write.precombine.field': 'age',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2
}
csv_df.write.format("hudi"). \
options(**hudi_options). \
mode("append"). \
save(basePath)
涉及参数
-
hoodie.datasource.write.precombine.field
Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareto(…)
Default Value: ts (Optional)
Config Param: READ_PRE_COMBINE_FIELD
测试数据(历史)
id | name | age | adress | partition_path |
---|---|---|---|---|
7548525 | 谭娜 | 15 | 黑龙江省广州市白云姚路w座 391301 | Sunday |
5615440 | 金亮 | 19 | 陕西省巢湖县西峰张街N座 711897 | Tuesday |
3887721 | 刘倩 | 21 | 贵州省敏县清浦深圳路A座 116469 | Thursday |
测试数据(增量)
id | name | age | adress | partition_path |
---|---|---|---|---|
5615440 | 金亮 | 25 | 陕西省巢湖县西峰张街N座 711897 | Tuesday |
5615440 | 金亮 | 27 | 陕西省巢湖县西峰张街N座 711897 | Tuesday |
结果查询
执行了增量数据的upsert之后,表中关于「金亮」的数据,「age」字段的值由19变成了27,而不是25(同时有证明了record key 的唯一性,毕竟就是根据recordkey进行更新的)
======== 表中共计[3]条数据
+-------------------+--------------------+------------------+----------------------+--------------------+-------+----+---+--------------------------------+--------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id|name|age| adress|partition_path|
+-------------------+--------------------+------------------+----------------------+--------------------+-------+----+---+--------------------------------+--------------+
| 20220321114056501|20220321114056501...| 3887721| Thursday|9efb9ef5-3339-4f3...|3887721|刘倩| 21| 贵州省敏县清浦深圳路A座 116469| Thursday|
| 20220321114056501|20220321114056501...| 7548525| Sunday|579317c9-c569-457...|7548525|谭娜| 15|黑龙江省广州市白云姚路w座 391301| Sunday|
| 20220321114824576|20220321114824576...| 5615440| Tuesday|131332e9-874d-435...|5615440|金亮| 27| 陕西省巢湖县西峰张街N座 711897| Tuesday|
+-------------------+--------------------+------------------+----------------------+--------------------+-------+----+---+--------------------------------+--------------+
Hudi相关特性实践
ps:以下实践均是基于「copy-On-Write」表进行的
Upsert时候控制小文件数量和文件大小
**结论:**parquet文件的大小会尽量控制在「hoodie.parquet.small.file.limit」和「hoodie.parquet.max.file.size」之间,但是不是向这最大文件size满足,感觉更像是优先保证满足最小文件size
涉及参数
-
Target size for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance.
Default Value: 125829120 (Optional)
Config Param: PARQUET_MAX_FILE_SIZE
-
hoodie.parquet.small.file.limit
During upsert operation, we opportunistically expand existing small files on storage, instead of writing new files, to keep number of files to an optimum. This config sets the file size limit below which a file on storage becomes a candidate to be selected as such a
small file
. By default, treat any file <= 100MB as a small file.
Default Value: 104857600 (Optional)
Config Param: PARQUET_SMALL_FILE_LIMIT
import os
import pyspark
if __name__ == '__main__':
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars",
"/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
"/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark = builder.getorCreate()
sc = spark.sparkContext
tableName = "student"
basePath = "file:///tmp/hudi_base_path"
csv_path = '/Users/gavin/Desktop/tmp/student_30000_rows.csv'
csv_df = spark.read.csv(path=csv_path, header='true')
csv_df.printSchema()
# csv_df.show()
print(f'csv_df.count(): [{csv_df.count()}]')
hudi_options = {
'hoodie.table.name': tableName,
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.partitionpath.field': 'partition_path',
'hoodie.datasource.write.table.name': tableName,
# 'hoodie.datasource.write.operation': 'insert', 不配置的时候,默认值为upsert
'hoodie.datasource.write.precombine.field': 'age',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2,
'hoodie.parquet.max.file.size': 1024*1024*13,
'hoodie.parquet.small.file.limit': 1024 *1024*1
}
csv_df.write.format("hudi"). \
options(**hudi_options). \
mode("append"). \
save(basePath)
测试结果
gavin@GavindeMacBook-Pro hudi_base_path % du -sh ./*
872K ./Friday
872K ./Monday
868K ./Saturday
872K ./Sunday
868K ./Thursday
868K ./Tuesday
876K ./Wednesday
gavin@GavindeMacBook-Pro hudi_base_path % cd Tuesday
gavin@GavindeMacBook-Pro Tuesday % ll
total 1704
-rw-r--r-- 1 gavin wheel 872301 Mar 16 13:35 4af17600-ed3d-4765-9d7a-0fd87ef19afc-0_5-41-46_20220316133513897.parquet
gavin@GavindeMacBook-Pro Tuesday % du -sh ./*
852K ./4af17600-ed3d-4765-9d7a-0fd87ef19afc-0_5-41-46_20220316133513897.parquet
gavin@GavindeMacBook-Pro Tuesday %
#先删除了原来的所有数据,重新进行数据录入10W条数据,大小设置为「1024 *1024*1 ~ 1024*1024*13」之后:
gavin@GavindeMacBook-Pro Tuesday % ll
total 2568
-rw-r--r-- 1 gavin wheel 845757 Mar 16 13:52 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_10-41-51_20220316135246436.parquet
-rw-r--r-- 1 gavin wheel 465912 Mar 16 13:52 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_11-41-52_20220316135246436.parquet
#upsert 5W条数据
gavin@GavindeMacBook-Pro Tuesday % ll -rt
total 5544
-rw-r--r-- 1 gavin wheel 465912 Mar 16 13:52 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_11-41-52_20220316135246436.parquet
-rw-r--r-- 1 gavin wheel 845757 Mar 16 13:52 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_10-41-51_20220316135246436.parquet
-rw-r--r-- 1 gavin wheel 465589 Mar 16 13:55 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_8-41-64_20220316135452779.parquet
-rw-r--r-- 1 gavin wheel 1055128 Mar 16 13:55 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_9-41-65_20220316135452779.parquet
#upsert 3W条数据
gavin@GavindeMacBook-Pro Tuesday % ll -rt
total 8792
-rw-r--r-- 1 gavin wheel 465912 Mar 16 13:52 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_11-41-52_20220316135246436.parquet
-rw-r--r-- 1 gavin wheel 845757 Mar 16 13:52 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_10-41-51_20220316135246436.parquet
-rw-r--r-- 1 gavin wheel 465589 Mar 16 13:55 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_8-41-64_20220316135452779.parquet
-rw-r--r-- 1 gavin wheel 1055128 Mar 16 13:55 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_9-41-65_20220316135452779.parquet
-rw-r--r-- 1 gavin wheel 603980 Mar 16 14:02 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_13-41-69_20220316140250440.parquet
-rw-r--r-- 1 gavin wheel 1055175 Mar 16 14:02 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_7-41-63_20220316140250440.parquet
gavin@GavindeMacBook-Pro Tuesday %
#upsert 30W条数据
gavin@GavindeMacBook-Pro Tuesday % ll -rt
total 14504
-rw-r--r-- 1 gavin wheel 465912 Mar 16 13:52 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_11-41-52_20220316135246436.parquet
-rw-r--r-- 1 gavin wheel 845757 Mar 16 13:52 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_10-41-51_20220316135246436.parquet
-rw-r--r-- 1 gavin wheel 465589 Mar 16 13:55 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_8-41-64_20220316135452779.parquet
-rw-r--r-- 1 gavin wheel 1055128 Mar 16 13:55 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_9-41-65_20220316135452779.parquet
-rw-r--r-- 1 gavin wheel 603980 Mar 16 14:02 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_13-41-69_20220316140250440.parquet
-rw-r--r-- 1 gavin wheel 1055175 Mar 16 14:02 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_7-41-63_20220316140250440.parquet
-rw-r--r-- 1 gavin wheel 1055760 Mar 16 14:16 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_11-41-77_20220316141641791.parquet
-rw-r--r-- 1 gavin wheel 1864068 Mar 16 14:16 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_10-41-76_20220316141641791.parquet
gavin@GavindeMacBook-Pro Tuesday %
#upsert 300W条数据
gavin@GavindeMacBook-Pro Tuesday % ll -rt
total 47680
-rw-r--r-- 1 gavin wheel 465912 Mar 16 13:52 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_11-41-52_20220316135246436.parquet
-rw-r--r-- 1 gavin wheel 845757 Mar 16 13:52 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_10-41-51_20220316135246436.parquet
-rw-r--r-- 1 gavin wheel 465589 Mar 16 13:55 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_8-41-64_20220316135452779.parquet
-rw-r--r-- 1 gavin wheel 1055128 Mar 16 13:55 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_9-41-65_20220316135452779.parquet
-rw-r--r-- 1 gavin wheel 603980 Mar 16 14:02 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_13-41-69_20220316140250440.parquet
-rw-r--r-- 1 gavin wheel 1055175 Mar 16 14:02 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_7-41-63_20220316140250440.parquet
-rw-r--r-- 1 gavin wheel 1055760 Mar 16 14:16 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_11-41-77_20220316141641791.parquet
-rw-r--r-- 1 gavin wheel 1864068 Mar 16 14:16 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_10-41-76_20220316141641791.parquet
-rw-r--r-- 1 gavin wheel 1059910 Mar 16 14:35 3eb68abd-8cb2-4d56-bc10-12f04d14bd40-0_11-41-89_20220316143328401.parquet
-rw-r--r-- 1 gavin wheel 1873786 Mar 16 14:35 dd4aa735-475a-46f2-946f-e6bfe3dc933f-0_10-41-88_20220316143328401.parquet
-rw-r--r-- 1 gavin wheel 3595929 Mar 16 14:36 33eb95a1-db3f-425a-ac55-8dba4555b911-0_23-41-101_20220316143328401.parquet
-rw-r--r-- 1 gavin wheel 9385079 Mar 16 14:36 8762d8e1-1868-4c60-9515-a9df1214f328-0_22-41-100_20220316143328401.parquet
Clustering (收束)特性测试
结论:设置了「hoodie.clustering.inline.max.commits」之后,commit次数达到这个值,就会触发clustering;
涉及参数
Turn on inline clustering - clustering will be run after each write operation is complete
Default Value: false (Optional)
Config Param: INLINE_CLUSTERING
Since Version: 0.7.0
-
hoodie.clustering.inline.max.commits
Config to control frequency of clustering planning
Default Value: 4 (Optional)
Config Param: INLINE_CLUSTERING_MAX_COMMITS
Since Version: 0.7.0
-
hoodie.clustering.plan.strategy.target.file.max.bytes
Each group can produce ‘N’ (CLUSTERING_MAX_GROUP_SIZE/CLUSTERING_TARGET_FILE_SIZE) output file groups
Default Value: 1073741824 (Optional)
Config Param: PLAN_STRATEGY_TARGET_FILE_MAX_BYTES
Since Version: 0.7.0
-
hoodie.clustering.plan.strategy.small.file.limit
Files smaller than the size specified here are candidates for clustering
Default Value: 629145600 (Optional)
Config Param: PLAN_STRATEGY_SMALL_FILE_LIMIT
Since Version: 0.7.0
-
hoodie.clustering.plan.strategy.sort.columns
Columns to sort the data by when clustering
Default Value: N/A (required)
Config Param: PLAN_STRATEGY_SORT_COLUMNS
Since Version: 0.7.0
测试代码
import os
import pyspark
if __name__ == '__main__':
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars",
"/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
"/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark = builder.getorCreate()
sc = spark.sparkContext
tableName = "student"
basePath = "file:///tmp/hudi_base_path"
csv_path = '/Users/gavin/Desktop/tmp/student_100000_rows.csv'
csv_df = spark.read.csv(path=csv_path, header='true')
csv_df.printSchema()
print(f'csv_df.count(): [{csv_df.count()}]')
hudi_options_for_clusering = {
'hoodie.table.name': tableName,
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.partitionpath.field': 'partition_path',
'hoodie.datasource.write.table.name': tableName,
'hoodie.datasource.write.precombine.field': 'age',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2,
'hoodie.clustering.inline': 'true',
'hoodie.clustering.inline.max.commits': 3,
'hoodie.clustering.plan.strategy.target.file.max.bytes':1024*1024*10, # 10M
'hoodie.clustering.plan.strategy.small.file.limit':1024*500*1, # 500K
'hoodie.clustering.plan.strategy.sort.columns':'id',
'hoodie.parquet.max.file.size': 1024*450*1, # 450K
'hoodie.parquet.small.file.limit': 0
}
csv_df.write.format("hudi"). \
options(**hudi_options_for_clusering). \
mode("append"). \
save(basePath)
Commit 文件
gavin@GavindeMacBook-Pro /tmp % ll hudi_base_path/.hoodie
total 2640
-rw-r--r-- 1 gavin wheel 200993 Mar 16 17:13 20220316171316850.commit
-rw-r--r-- 1 gavin wheel 0 Mar 16 17:13 20220316171316850.commit.requested
-rw-r--r-- 1 gavin wheel 5100 Mar 16 17:13 20220316171316850.inflight
-rw-r--r-- 1 gavin wheel 308305 Mar 16 17:15 20220316171506014.commit
-rw-r--r-- 1 gavin wheel 0 Mar 16 17:15 20220316171506014.commit.requested
-rw-r--r-- 1 gavin wheel 92341 Mar 16 17:15 20220316171506014.inflight
-rw-r--r-- 1 gavin wheel 389328 Mar 16 17:17 20220316171648081.commit
-rw-r--r-- 1 gavin wheel 0 Mar 16 17:16 20220316171648081.commit.requested
-rw-r--r-- 1 gavin wheel 157952 Mar 16 17:17 20220316171648081.inflight
-rw-r--r-- 1 gavin wheel 60710 Mar 16 17:18 20220316171751882.replacecommit
-rw-r--r-- 1 gavin wheel 0 Mar 16 17:17 20220316171751882.replacecommit.inflight
-rw-r--r-- 1 gavin wheel 114687 Mar 16 17:17 20220316171751882.replacecommit.requested
drwxr-xr-x 2 gavin wheel 64 Mar 16 17:13 archived
-rw-r--r-- 1 gavin wheel 593 Mar 16 17:13 hoodie.properties
gavin@GavindeMacBook-Pro /tmp %
前3次commit每次生成的parquet文件大小保持在400K~500K(「hoodie.parquet.max.file.size」设置的450K);
gavin@GavindeMacBook-Pro Friday % ll -rt
total 139600
#第一次commit
-rw-r--r-- 1 gavin wheel 449713 Mar 16 17:13 aeeab49f-b7ae-4d42-bfc8-30c5815a19f2-0_68-47-109_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 451069 Mar 16 17:13 6ffa4af8-cf3f-41e5-9225-3f82cabd3416-0_70-47-111_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 449926 Mar 16 17:13 e65fbef1-5499-41cc-b956-236f1f070e4d-0_65-47-106_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450787 Mar 16 17:13 5609e5fd-8eef-41ef-8fc4-3c6c711a6199-0_67-47-108_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 449952 Mar 16 17:13 9ee1e33f-d28f-443d-88cb-59e841e927a8-0_69-47-110_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 452176 Mar 16 17:13 732e4ddd-ed9a-4c6b-9f59-50ce32f9706d-0_64-47-105_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450516 Mar 16 17:13 dbfcfef3-ddde-41d9-82e4-12b961962c7b-0_66-47-107_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 451607 Mar 16 17:13 873f0e77-75d2-41f6-b5cd-e3fa6e7d69b8-0_71-47-112_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 451974 Mar 16 17:13 c500a137-757a-4355-90fb-0a38e17b215c-0_72-47-113_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450860 Mar 16 17:13 070bf517-195c-48a0-b0f1-423a4a482592-0_74-47-115_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450590 Mar 16 17:13 1eb137ab-c52e-47f2-ac4b-a25ad2fe5aae-0_73-47-114_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 452988 Mar 16 17:13 2c026223-4f5c-4634-aaf9-32a98f7a275f-0_75-47-116_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450829 Mar 16 17:13 20f468b9-afe3-49d8-905b-712c9f9fd441-0_77-47-118_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 451479 Mar 16 17:13 4a173caf-18dc-420c-9d61-4dc6ab366845-0_76-47-117_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 449907 Mar 16 17:13 8bc280d8-81e7-49f4-a49f-2a0767a827a3-0_81-47-122_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 451221 Mar 16 17:13 cb6fac3b-83bc-483b-92c6-c8aca859c4bc-0_84-47-125_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450260 Mar 16 17:13 9aa6b1df-f18b-43bb-93c8-4ce1a726c1bb-0_83-47-124_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 451619 Mar 16 17:13 034163f0-823c-42f0-b109-6282d7dab628-0_79-47-120_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450229 Mar 16 17:13 bc5e69a0-133e-42ab-bca5-887d7ed200e8-0_82-47-123_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 451089 Mar 16 17:13 f19e5203-af28-4af3-9bf3-de75f4ac9494-0_80-47-121_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450510 Mar 16 17:13 3fc10e56-cf07-447f-a209-22f5e92b4351-0_78-47-119_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450287 Mar 16 17:13 2d6b1e2b-6336-4be5-be19-cc272c3fa62c-0_89-47-130_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450477 Mar 16 17:13 5496b225-ef42-4a2d-a21b-b6706dab97a2-0_87-47-128_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 451361 Mar 16 17:13 48f19ef7-7062-4368-9def-9b25d1578ac0-0_85-47-126_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 449939 Mar 16 17:13 85d7ef1b-7473-43f8-87bb-1d7d3f088c2c-0_88-47-129_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450747 Mar 16 17:13 81d27725-d4f2-47ee-a4c6-553c467df7a3-0_86-47-127_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 452180 Mar 16 17:13 452e8da7-63b4-48d0-83d1-7033d0040ab4-0_94-47-135_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 452128 Mar 16 17:13 1716c74e-a8af-41bd-8bea-ecd521758ea6-0_93-47-134_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450150 Mar 16 17:13 4a4f5e05-77e4-4787-859f-fbd3fc27ead8-0_92-47-133_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 451382 Mar 16 17:13 1f8c21ef-ff50-428e-b7f7-e8d3b0f1ea71-0_91-47-132_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 446216 Mar 16 17:13 0a1acb0e-a35f-4f8e-a3a9-d03c02423ac6-0_95-47-136_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 451804 Mar 16 17:13 1f63cc38-5827-4fde-8d42-4b51e3907cdb-0_90-47-131_20220316171316850.parqu
#第二次commitet
-rw-r--r-- 1 gavin wheel 450421 Mar 16 17:15 81d27725-d4f2-47ee-a4c6-553c467df7a3-0_35-47-304_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 449626 Mar 16 17:15 9ee1e33f-d28f-443d-88cb-59e841e927a8-0_36-47-305_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451791 Mar 16 17:15 1716c74e-a8af-41bd-8bea-ecd521758ea6-0_38-47-307_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450152 Mar 16 17:15 5496b225-ef42-4a2d-a21b-b6706dab97a2-0_37-47-306_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 449382 Mar 16 17:15 aeeab49f-b7ae-4d42-bfc8-30c5815a19f2-0_44-47-313_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451295 Mar 16 17:15 873f0e77-75d2-41f6-b5cd-e3fa6e7d69b8-0_40-47-309_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451052 Mar 16 17:15 1f8c21ef-ff50-428e-b7f7-e8d3b0f1ea71-0_48-47-317_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450520 Mar 16 17:15 070bf517-195c-48a0-b0f1-423a4a482592-0_47-47-316_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 449577 Mar 16 17:15 8bc280d8-81e7-49f4-a49f-2a0767a827a3-0_39-47-308_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450254 Mar 16 17:15 1eb137ab-c52e-47f2-ac4b-a25ad2fe5aae-0_43-47-312_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450888 Mar 16 17:15 cb6fac3b-83bc-483b-92c6-c8aca859c4bc-0_45-47-314_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451649 Mar 16 17:15 c500a137-757a-4355-90fb-0a38e17b215c-0_46-47-315_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 449829 Mar 16 17:15 4a4f5e05-77e4-4787-859f-fbd3fc27ead8-0_42-47-311_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451847 Mar 16 17:15 452e8da7-63b4-48d0-83d1-7033d0040ab4-0_41-47-310_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450199 Mar 16 17:15 dbfcfef3-ddde-41d9-82e4-12b961962c7b-0_50-47-319_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451284 Mar 16 17:15 034163f0-823c-42f0-b109-6282d7dab628-0_49-47-318_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451149 Mar 16 17:15 4a173caf-18dc-420c-9d61-4dc6ab366845-0_55-47-324_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 449894 Mar 16 17:15 bc5e69a0-133e-42ab-bca5-887d7ed200e8-0_57-47-326_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450497 Mar 16 17:15 20f468b9-afe3-49d8-905b-712c9f9fd441-0_54-47-323_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 449934 Mar 16 17:15 9aa6b1df-f18b-43bb-93c8-4ce1a726c1bb-0_52-47-321_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 452662 Mar 16 17:15 2c026223-4f5c-4634-aaf9-32a98f7a275f-0_56-47-325_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 449616 Mar 16 17:15 85d7ef1b-7473-43f8-87bb-1d7d3f088c2c-0_51-47-320_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451849 Mar 16 17:15 732e4ddd-ed9a-4c6b-9f59-50ce32f9706d-0_53-47-322_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451479 Mar 16 17:15 1f63cc38-5827-4fde-8d42-4b51e3907cdb-0_58-47-327_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451489 Mar 16 17:15 2aeac070-d67e-4ca3-a186-b5d9c383876e-0_184-53-453_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 449996 Mar 16 17:15 2b447f6d-d0fc-4d2e-a0d7-243aa46aacac-0_186-53-455_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450542 Mar 16 17:15 121be9f5-0774-426b-b061-f93817e8568e-0_185-53-454_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450542 Mar 16 17:15 8b9f11a8-b108-462c-b45a-3cef7766d61d-0_187-53-456_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451778 Mar 16 17:15 b7f8ab04-fee2-4455-88a3-53c44a1a8299-0_188-53-457_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450995 Mar 16 17:15 1144170e-b154-4d85-8eed-866393cf2ed4-0_189-53-458_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451001 Mar 16 17:15 6ef97619-31f9-4f8e-b240-98fefed9fa41-0_191-53-460_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451192 Mar 16 17:15 95d724df-ec57-42c8-9de3-9c0f3b0888b3-0_196-53-465_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451287 Mar 16 17:15 20dc678a-c2fe-4156-bb61-f04cb269f248-0_192-53-461_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 452000 Mar 16 17:15 6d5fbff6-ff69-4a3a-9534-407b19154730-0_194-53-463_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450895 Mar 16 17:15 2bc61fdd-e343-4f9b-babd-161478d227a8-0_193-53-462_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450804 Mar 16 17:15 d07f11e8-b78e-4643-aef9-86903d89866d-0_198-53-467_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450956 Mar 16 17:15 1b8ede99-4f6c-43dc-8709-52ac0e307fc0-0_195-53-464_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451451 Mar 16 17:15 2a2e1a62-2325-4389-9ba7-60c9dff21491-0_197-53-466_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450494 Mar 16 17:15 9e70261a-ecf2-4706-a8cd-861e3f02786c-0_190-53-459_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450380 Mar 16 17:15 99b51786-5ec4-4c96-8892-47accd2882db-0_199-53-468_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451301 Mar 16 17:15 09bd1d0d-2d12-4ce3-abd0-50d0c8687e2b-0_200-53-469_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451434 Mar 16 17:15 148bc858-2bec-44fe-891f-60f6165dc17e-0_201-53-470_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450780 Mar 16 17:15 db3e9783-4674-4d73-8fe9-abcd47f19218-0_202-53-471_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451296 Mar 16 17:15 88d248a8-8f77-4ede-8d78-ef953afb8fc2-0_211-53-480_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450226 Mar 16 17:15 6bc3c636-6bf7-4672-84b7-8010c0a26cd6-0_203-53-472_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450595 Mar 16 17:15 59028aa4-a91f-4c82-9d34-d25fee9af494-0_206-53-475_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451978 Mar 16 17:15 20089b65-91f7-43d5-b7d6-d54029ed92db-0_205-53-474_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 449974 Mar 16 17:15 443d5fbb-ff22-4fa2-a6f0-98a0f0eaea29-0_212-53-481_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451370 Mar 16 17:15 b44f53a6-e46f-4ae3-9a57-91c8b9cf3692-0_209-53-478_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450341 Mar 16 17:15 32dcd3d4-5ce0-41d1-b7dd-1c9e1ac55fd9-0_204-53-473_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451524 Mar 16 17:15 9ecda712-650f-497a-ae46-1f81462342ee-0_208-53-477_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451285 Mar 16 17:15 261b1fda-52df-466f-8858-2d167b7d8216-0_210-53-479_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450942 Mar 16 17:15 7da056d3-9c71-4731-89f4-cf6bb37d4a5b-0_207-53-476_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451893 Mar 16 17:15 db51b6eb-6107-4121-9328-eb78d950aaf5-0_213-53-482_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451526 Mar 16 17:15 bf3784e7-2a00-4cb3-a9d1-9c49fe59b91d-0_214-53-483_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 442060 Mar 16 17:15 c1eef0f6-fb4a-4fb8-86b1-ad5836668fac-0_215-53-484_20220316171506014.parquet
#第三次commit
-rw-r--r-- 1 gavin wheel 451663 Mar 16 17:17 6d5fbff6-ff69-4a3a-9534-407b19154730-0_56-47-548_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450951 Mar 16 17:17 261b1fda-52df-466f-8858-2d167b7d8216-0_57-47-549_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450197 Mar 16 17:17 5496b225-ef42-4a2d-a21b-b6706dab97a2-0_55-47-547_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450950 Mar 16 17:17 20dc678a-c2fe-4156-bb61-f04cb269f248-0_59-47-551_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 449619 Mar 16 17:17 8bc280d8-81e7-49f4-a49f-2a0767a827a3-0_58-47-550_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 449429 Mar 16 17:17 aeeab49f-b7ae-4d42-bfc8-30c5815a19f2-0_60-47-552_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451134 Mar 16 17:17 2a2e1a62-2325-4389-9ba7-60c9dff21491-0_61-47-553_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 449658 Mar 16 17:17 443d5fbb-ff22-4fa2-a6f0-98a0f0eaea29-0_62-47-554_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450563 Mar 16 17:17 070bf517-195c-48a0-b0f1-423a4a482592-0_63-47-555_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451027 Mar 16 17:17 48f19ef7-7062-4368-9def-9b25d1578ac0-0_64-47-556_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450568 Mar 16 17:17 2bc61fdd-e343-4f9b-babd-161478d227a8-0_65-47-557_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 449592 Mar 16 17:17 e65fbef1-5499-41cc-b956-236f1f070e4d-0_68-47-560_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 449664 Mar 16 17:17 85d7ef1b-7473-43f8-87bb-1d7d3f088c2c-0_66-47-558_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451183 Mar 16 17:17 4a173caf-18dc-420c-9d61-4dc6ab366845-0_67-47-559_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 445898 Mar 16 17:17 0a1acb0e-a35f-4f8e-a3a9-d03c02423ac6-0_69-47-561_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 449662 Mar 16 17:17 2b447f6d-d0fc-4d2e-a0d7-243aa46aacac-0_70-47-562_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451641 Mar 16 17:17 20089b65-91f7-43d5-b7d6-d54029ed92db-0_71-47-563_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451900 Mar 16 17:17 452e8da7-63b4-48d0-83d1-7033d0040ab4-0_73-47-565_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450669 Mar 16 17:17 6ef97619-31f9-4f8e-b240-98fefed9fa41-0_72-47-564_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 449892 Mar 16 17:17 6bc3c636-6bf7-4672-84b7-8010c0a26cd6-0_74-47-566_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450468 Mar 16 17:17 d07f11e8-b78e-4643-aef9-86903d89866d-0_76-47-568_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451699 Mar 16 17:17 c500a137-757a-4355-90fb-0a38e17b215c-0_75-47-567_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450769 Mar 16 17:17 f19e5203-af28-4af3-9bf3-de75f4ac9494-0_78-47-570_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451093 Mar 16 17:17 1f8c21ef-ff50-428e-b7f7-e8d3b0f1ea71-0_77-47-569_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451117 Mar 16 17:17 148bc858-2bec-44fe-891f-60f6165dc17e-0_79-47-571_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 449980 Mar 16 17:17 9aa6b1df-f18b-43bb-93c8-4ce1a726c1bb-0_80-47-572_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450023 Mar 16 17:17 32dcd3d4-5ce0-41d1-b7dd-1c9e1ac55fd9-0_82-47-574_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450684 Mar 16 17:17 1144170e-b154-4d85-8eed-866393cf2ed4-0_81-47-573_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450622 Mar 16 17:17 1b8ede99-4f6c-43dc-8709-52ac0e307fc0-0_83-47-575_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450188 Mar 16 17:17 3fc10e56-cf07-447f-a209-22f5e92b4351-0_85-47-577_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 452714 Mar 16 17:17 2c026223-4f5c-4634-aaf9-32a98f7a275f-0_84-47-576_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451258 Mar 16 17:17 60d615c4-0355-44aa-8692-93dcac902bad-0_278-53-770_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451940 Mar 16 17:17 7cb0f09b-e54d-4c78-9306-53e044676a94-0_277-53-769_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450629 Mar 16 17:17 e923dad4-a72c-49a0-8885-0982008ceccf-0_276-53-768_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451596 Mar 16 17:17 d3e720e3-88fa-446b-a522-d44cf7497674-0_281-53-773_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450251 Mar 16 17:17 6f5d416f-e8ff-456c-a0fc-75c7a3a32308-0_279-53-771_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451586 Mar 16 17:17 64d0a424-7744-4423-b86d-3fce04a5046b-0_283-53-775_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450225 Mar 16 17:17 29ae854e-45a1-4222-98c1-2d0acc1c8884-0_282-53-774_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451972 Mar 16 17:17 06f51fc0-1078-4d3a-ae1f-67684917eb1b-0_280-53-772_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451680 Mar 16 17:17 e095fe81-189b-4f5d-8395-7239667ad2d8-0_286-53-778_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451413 Mar 16 17:17 7261ddee-42c5-4abc-8c4f-8f226072b826-0_285-53-777_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451844 Mar 16 17:17 0bcbbdc8-1d2e-4526-88f0-11d17cfff835-0_284-53-776_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451977 Mar 16 17:17 7eddebc9-6828-4053-ad7d-0831daa000ae-0_289-53-781_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451308 Mar 16 17:17 65982416-73c0-41b0-972c-4c4355f3b235-0_290-53-782_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451078 Mar 16 17:17 e62f224c-8d1f-4349-ad3f-81886fba230d-0_288-53-780_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450278 Mar 16 17:17 53764539-435e-4f6f-a7e4-d48fe7966389-0_287-53-779_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451599 Mar 16 17:17 139b4d5c-ed80-42ff-8f97-671f13390edb-0_291-53-783_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 452320 Mar 16 17:17 98d687a0-c93d-4326-b4a0-6f2540bd9aeb-0_292-53-784_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450597 Mar 16 17:17 285cb019-5481-4556-8a96-bc8248028778-0_295-53-787_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451878 Mar 16 17:17 1a7a27ab-8670-4b9c-bfde-3a1dba0669d8-0_293-53-785_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 449657 Mar 16 17:17 fecc9a85-371d-493c-83f2-35b5849ee0cb-0_294-53-786_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450069 Mar 16 17:17 c3cc410b-13c3-4f00-ac53-fec9a8f307a1-0_297-53-789_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451344 Mar 16 17:17 ebf3f3b4-6e1a-4bb7-be85-0d2ee31126e5-0_298-53-790_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 452456 Mar 16 17:17 a77ec88d-1ccb-40a5-b8a8-eaf214cfa6a9-0_296-53-788_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450826 Mar 16 17:17 574b5510-586a-49a5-a2ac-b75ebe90b87d-0_301-53-793_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450752 Mar 16 17:17 bc2d6241-343c-4c47-9125-2f63b269117e-0_300-53-792_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450909 Mar 16 17:17 4b4abae6-0dfd-4761-871f-2224ddccb1ec-0_303-53-795_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450745 Mar 16 17:17 e7068c6a-7a2d-43ae-b030-59bebd68f36b-0_304-53-796_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451571 Mar 16 17:17 5bdc45fa-88d9-47b4-9597-e717ae7dbc48-0_302-53-794_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450517 Mar 16 17:17 41f629d8-189b-4867-af6f-bb91effe9f74-0_299-53-791_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450574 Mar 16 17:17 16c7513f-7f79-48d5-84ff-e784f2d1e795-0_305-53-797_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450647 Mar 16 17:17 40b4c081-baab-430c-a9e3-e93d0527c923-0_306-53-798_20220316171648081.parquet
#第三次commit之后紧跟的一次clustering动作,对应.hoodie文件下timeline的「replacecommit」
-rw-r--r-- 1 gavin wheel 654082 Mar 16 17:18 ee1bf9af-1636-4863-8b3a-7a7a15861573-0_9-78-2745_20220316171751882.parquet
-rw-r--r-- 1 gavin wheel 667860 Mar 16 17:18 a4e54744-4277-4ce5-96e4-fa2da7010b1f-0_7-78-2743_20220316171751882.parquet
-rw-r--r-- 1 gavin wheel 699511 Mar 16 17:18 a4a3dfe1-d87d-473c-93b2-713b79aef185-0_6-78-2742_20220316171751882.parquet
-rw-r--r-- 1 gavin wheel 742546 Mar 16 17:18 3e74d509-e720-416b-a1c7-9380e5e4a830-0_8-78-2744_20220316171751882.parquet
-rw-r--r-- 1 gavin wheel 737640 Mar 16 17:18 9a0124e9-5e16-4ea5-8b7b-27aea81f4d92-0_5-78-2741_20220316171751882.parquet
gavin@GavindeMacBook-Pro Friday %
Cleaning (清理)数据
**结论:**每次执行了upsert之后都会(默认)主动进行清理。
涉及参数
When enabled, the cleaner table service is invoked immediately after each commit, to delete older file slices. It’s recommended to enable this, to ensure Metadata and data storage growth is bounded.
Default Value: true (Optional)
Config Param: AUTO_CLEAN
Cleaning policy to be used. The cleaner service deletes older file slices files to re-claim space. By default, cleaner spares the file slices written by the last N commits, determined by hoodie.cleaner.commits.retained Long running query plans may often refer to older file slices and will break if those are cleaned, before the query has had a chance to run. So, it is good to make sure that the data is retained for more than the maximum query execution time
Default Value: KEEP_LATEST_COMMITS (Optional)
Config Param: CLEANER_POLICY
Number of commits to retain, without cleaning. This will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much data retention the table supports for incremental queries.
Default Value: 10 (Optional)
Config Param: CLEANER_COMMITS_RETAINED
测试代码
import pyspark
if __name__ == '__main__':
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars",
"/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
"/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark = builder.getorCreate()
sc = spark.sparkContext
tableName = "student"
basePath = "file:///tmp/hudi_base_path"
csv_path = '/Users/gavin/Desktop/tmp/student_100000_rows.csv'
csv_df = spark.read.csv(path=csv_path, header='true')
csv_df.printSchema()
print(f'csv_df.count(): [{csv_df.count()}]')
hudi_options = {
'hoodie.table.name': tableName,
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.partitionpath.field': 'partition_path',
'hoodie.datasource.write.table.name': tableName,
'hoodie.datasource.write.precombine.field': 'age',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2,
'hoodie.cleaner.commits.retained': 1 #为了测试效果,直接配置为仅保留一个历史版本
}
csv_df.write.format("hudi"). \
options(**hudi_options). \
mode("append"). \
save(basePath)
测试结果
gavin@GavindeMacBook-Pro Friday % ll -rt #运行代码之前
total 139600
-rw-r--r-- 1 gavin wheel 449713 Mar 16 17:13 aeeab49f-b7ae-4d42-bfc8-30c5815a19f2-0_68-47-109_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 451069 Mar 16 17:13 6ffa4af8-cf3f-41e5-9225-3f82cabd3416-0_70-47-111_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 449926 Mar 16 17:13 e65fbef1-5499-41cc-b956-236f1f070e4d-0_65-47-106_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450787 Mar 16 17:13 5609e5fd-8eef-41ef-8fc4-3c6c711a6199-0_67-47-108_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 449952 Mar 16 17:13 9ee1e33f-d28f-443d-88cb-59e841e927a8-0_69-47-110_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 452176 Mar 16 17:13 732e4ddd-ed9a-4c6b-9f59-50ce32f9706d-0_64-47-105_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450516 Mar 16 17:13 dbfcfef3-ddde-41d9-82e4-12b961962c7b-0_66-47-107_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 451607 Mar 16 17:13 873f0e77-75d2-41f6-b5cd-e3fa6e7d69b8-0_71-47-112_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 451974 Mar 16 17:13 c500a137-757a-4355-90fb-0a38e17b215c-0_72-47-113_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450860 Mar 16 17:13 070bf517-195c-48a0-b0f1-423a4a482592-0_74-47-115_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450590 Mar 16 17:13 1eb137ab-c52e-47f2-ac4b-a25ad2fe5aae-0_73-47-114_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 452988 Mar 16 17:13 2c026223-4f5c-4634-aaf9-32a98f7a275f-0_75-47-116_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450829 Mar 16 17:13 20f468b9-afe3-49d8-905b-712c9f9fd441-0_77-47-118_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 451479 Mar 16 17:13 4a173caf-18dc-420c-9d61-4dc6ab366845-0_76-47-117_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 449907 Mar 16 17:13 8bc280d8-81e7-49f4-a49f-2a0767a827a3-0_81-47-122_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 451221 Mar 16 17:13 cb6fac3b-83bc-483b-92c6-c8aca859c4bc-0_84-47-125_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450260 Mar 16 17:13 9aa6b1df-f18b-43bb-93c8-4ce1a726c1bb-0_83-47-124_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 451619 Mar 16 17:13 034163f0-823c-42f0-b109-6282d7dab628-0_79-47-120_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450229 Mar 16 17:13 bc5e69a0-133e-42ab-bca5-887d7ed200e8-0_82-47-123_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 451089 Mar 16 17:13 f19e5203-af28-4af3-9bf3-de75f4ac9494-0_80-47-121_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450510 Mar 16 17:13 3fc10e56-cf07-447f-a209-22f5e92b4351-0_78-47-119_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450287 Mar 16 17:13 2d6b1e2b-6336-4be5-be19-cc272c3fa62c-0_89-47-130_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450477 Mar 16 17:13 5496b225-ef42-4a2d-a21b-b6706dab97a2-0_87-47-128_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 451361 Mar 16 17:13 48f19ef7-7062-4368-9def-9b25d1578ac0-0_85-47-126_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 449939 Mar 16 17:13 85d7ef1b-7473-43f8-87bb-1d7d3f088c2c-0_88-47-129_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450747 Mar 16 17:13 81d27725-d4f2-47ee-a4c6-553c467df7a3-0_86-47-127_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 452180 Mar 16 17:13 452e8da7-63b4-48d0-83d1-7033d0040ab4-0_94-47-135_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 452128 Mar 16 17:13 1716c74e-a8af-41bd-8bea-ecd521758ea6-0_93-47-134_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450150 Mar 16 17:13 4a4f5e05-77e4-4787-859f-fbd3fc27ead8-0_92-47-133_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 451382 Mar 16 17:13 1f8c21ef-ff50-428e-b7f7-e8d3b0f1ea71-0_91-47-132_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 446216 Mar 16 17:13 0a1acb0e-a35f-4f8e-a3a9-d03c02423ac6-0_95-47-136_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 451804 Mar 16 17:13 1f63cc38-5827-4fde-8d42-4b51e3907cdb-0_90-47-131_20220316171316850.parquet
-rw-r--r-- 1 gavin wheel 450421 Mar 16 17:15 81d27725-d4f2-47ee-a4c6-553c467df7a3-0_35-47-304_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 449626 Mar 16 17:15 9ee1e33f-d28f-443d-88cb-59e841e927a8-0_36-47-305_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451791 Mar 16 17:15 1716c74e-a8af-41bd-8bea-ecd521758ea6-0_38-47-307_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450152 Mar 16 17:15 5496b225-ef42-4a2d-a21b-b6706dab97a2-0_37-47-306_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 449382 Mar 16 17:15 aeeab49f-b7ae-4d42-bfc8-30c5815a19f2-0_44-47-313_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451295 Mar 16 17:15 873f0e77-75d2-41f6-b5cd-e3fa6e7d69b8-0_40-47-309_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451052 Mar 16 17:15 1f8c21ef-ff50-428e-b7f7-e8d3b0f1ea71-0_48-47-317_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450520 Mar 16 17:15 070bf517-195c-48a0-b0f1-423a4a482592-0_47-47-316_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 449577 Mar 16 17:15 8bc280d8-81e7-49f4-a49f-2a0767a827a3-0_39-47-308_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450254 Mar 16 17:15 1eb137ab-c52e-47f2-ac4b-a25ad2fe5aae-0_43-47-312_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450888 Mar 16 17:15 cb6fac3b-83bc-483b-92c6-c8aca859c4bc-0_45-47-314_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451649 Mar 16 17:15 c500a137-757a-4355-90fb-0a38e17b215c-0_46-47-315_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 449829 Mar 16 17:15 4a4f5e05-77e4-4787-859f-fbd3fc27ead8-0_42-47-311_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451847 Mar 16 17:15 452e8da7-63b4-48d0-83d1-7033d0040ab4-0_41-47-310_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450199 Mar 16 17:15 dbfcfef3-ddde-41d9-82e4-12b961962c7b-0_50-47-319_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451284 Mar 16 17:15 034163f0-823c-42f0-b109-6282d7dab628-0_49-47-318_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451149 Mar 16 17:15 4a173caf-18dc-420c-9d61-4dc6ab366845-0_55-47-324_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 449894 Mar 16 17:15 bc5e69a0-133e-42ab-bca5-887d7ed200e8-0_57-47-326_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450497 Mar 16 17:15 20f468b9-afe3-49d8-905b-712c9f9fd441-0_54-47-323_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 449934 Mar 16 17:15 9aa6b1df-f18b-43bb-93c8-4ce1a726c1bb-0_52-47-321_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 452662 Mar 16 17:15 2c026223-4f5c-4634-aaf9-32a98f7a275f-0_56-47-325_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 449616 Mar 16 17:15 85d7ef1b-7473-43f8-87bb-1d7d3f088c2c-0_51-47-320_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451849 Mar 16 17:15 732e4ddd-ed9a-4c6b-9f59-50ce32f9706d-0_53-47-322_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451479 Mar 16 17:15 1f63cc38-5827-4fde-8d42-4b51e3907cdb-0_58-47-327_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451489 Mar 16 17:15 2aeac070-d67e-4ca3-a186-b5d9c383876e-0_184-53-453_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 449996 Mar 16 17:15 2b447f6d-d0fc-4d2e-a0d7-243aa46aacac-0_186-53-455_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450542 Mar 16 17:15 121be9f5-0774-426b-b061-f93817e8568e-0_185-53-454_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450542 Mar 16 17:15 8b9f11a8-b108-462c-b45a-3cef7766d61d-0_187-53-456_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451778 Mar 16 17:15 b7f8ab04-fee2-4455-88a3-53c44a1a8299-0_188-53-457_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450995 Mar 16 17:15 1144170e-b154-4d85-8eed-866393cf2ed4-0_189-53-458_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451001 Mar 16 17:15 6ef97619-31f9-4f8e-b240-98fefed9fa41-0_191-53-460_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451192 Mar 16 17:15 95d724df-ec57-42c8-9de3-9c0f3b0888b3-0_196-53-465_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451287 Mar 16 17:15 20dc678a-c2fe-4156-bb61-f04cb269f248-0_192-53-461_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 452000 Mar 16 17:15 6d5fbff6-ff69-4a3a-9534-407b19154730-0_194-53-463_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450895 Mar 16 17:15 2bc61fdd-e343-4f9b-babd-161478d227a8-0_193-53-462_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450804 Mar 16 17:15 d07f11e8-b78e-4643-aef9-86903d89866d-0_198-53-467_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450956 Mar 16 17:15 1b8ede99-4f6c-43dc-8709-52ac0e307fc0-0_195-53-464_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451451 Mar 16 17:15 2a2e1a62-2325-4389-9ba7-60c9dff21491-0_197-53-466_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450494 Mar 16 17:15 9e70261a-ecf2-4706-a8cd-861e3f02786c-0_190-53-459_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450380 Mar 16 17:15 99b51786-5ec4-4c96-8892-47accd2882db-0_199-53-468_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451301 Mar 16 17:15 09bd1d0d-2d12-4ce3-abd0-50d0c8687e2b-0_200-53-469_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451434 Mar 16 17:15 148bc858-2bec-44fe-891f-60f6165dc17e-0_201-53-470_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450780 Mar 16 17:15 db3e9783-4674-4d73-8fe9-abcd47f19218-0_202-53-471_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451296 Mar 16 17:15 88d248a8-8f77-4ede-8d78-ef953afb8fc2-0_211-53-480_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450226 Mar 16 17:15 6bc3c636-6bf7-4672-84b7-8010c0a26cd6-0_203-53-472_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450595 Mar 16 17:15 59028aa4-a91f-4c82-9d34-d25fee9af494-0_206-53-475_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451978 Mar 16 17:15 20089b65-91f7-43d5-b7d6-d54029ed92db-0_205-53-474_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 449974 Mar 16 17:15 443d5fbb-ff22-4fa2-a6f0-98a0f0eaea29-0_212-53-481_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451370 Mar 16 17:15 b44f53a6-e46f-4ae3-9a57-91c8b9cf3692-0_209-53-478_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450341 Mar 16 17:15 32dcd3d4-5ce0-41d1-b7dd-1c9e1ac55fd9-0_204-53-473_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451524 Mar 16 17:15 9ecda712-650f-497a-ae46-1f81462342ee-0_208-53-477_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451285 Mar 16 17:15 261b1fda-52df-466f-8858-2d167b7d8216-0_210-53-479_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 450942 Mar 16 17:15 7da056d3-9c71-4731-89f4-cf6bb37d4a5b-0_207-53-476_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451893 Mar 16 17:15 db51b6eb-6107-4121-9328-eb78d950aaf5-0_213-53-482_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451526 Mar 16 17:15 bf3784e7-2a00-4cb3-a9d1-9c49fe59b91d-0_214-53-483_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 442060 Mar 16 17:15 c1eef0f6-fb4a-4fb8-86b1-ad5836668fac-0_215-53-484_20220316171506014.parquet
-rw-r--r-- 1 gavin wheel 451663 Mar 16 17:17 6d5fbff6-ff69-4a3a-9534-407b19154730-0_56-47-548_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450951 Mar 16 17:17 261b1fda-52df-466f-8858-2d167b7d8216-0_57-47-549_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450197 Mar 16 17:17 5496b225-ef42-4a2d-a21b-b6706dab97a2-0_55-47-547_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450950 Mar 16 17:17 20dc678a-c2fe-4156-bb61-f04cb269f248-0_59-47-551_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 449619 Mar 16 17:17 8bc280d8-81e7-49f4-a49f-2a0767a827a3-0_58-47-550_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 449429 Mar 16 17:17 aeeab49f-b7ae-4d42-bfc8-30c5815a19f2-0_60-47-552_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451134 Mar 16 17:17 2a2e1a62-2325-4389-9ba7-60c9dff21491-0_61-47-553_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 449658 Mar 16 17:17 443d5fbb-ff22-4fa2-a6f0-98a0f0eaea29-0_62-47-554_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450563 Mar 16 17:17 070bf517-195c-48a0-b0f1-423a4a482592-0_63-47-555_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451027 Mar 16 17:17 48f19ef7-7062-4368-9def-9b25d1578ac0-0_64-47-556_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450568 Mar 16 17:17 2bc61fdd-e343-4f9b-babd-161478d227a8-0_65-47-557_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 449592 Mar 16 17:17 e65fbef1-5499-41cc-b956-236f1f070e4d-0_68-47-560_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 449664 Mar 16 17:17 85d7ef1b-7473-43f8-87bb-1d7d3f088c2c-0_66-47-558_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451183 Mar 16 17:17 4a173caf-18dc-420c-9d61-4dc6ab366845-0_67-47-559_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 445898 Mar 16 17:17 0a1acb0e-a35f-4f8e-a3a9-d03c02423ac6-0_69-47-561_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 449662 Mar 16 17:17 2b447f6d-d0fc-4d2e-a0d7-243aa46aacac-0_70-47-562_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451641 Mar 16 17:17 20089b65-91f7-43d5-b7d6-d54029ed92db-0_71-47-563_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451900 Mar 16 17:17 452e8da7-63b4-48d0-83d1-7033d0040ab4-0_73-47-565_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450669 Mar 16 17:17 6ef97619-31f9-4f8e-b240-98fefed9fa41-0_72-47-564_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 449892 Mar 16 17:17 6bc3c636-6bf7-4672-84b7-8010c0a26cd6-0_74-47-566_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450468 Mar 16 17:17 d07f11e8-b78e-4643-aef9-86903d89866d-0_76-47-568_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451699 Mar 16 17:17 c500a137-757a-4355-90fb-0a38e17b215c-0_75-47-567_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450769 Mar 16 17:17 f19e5203-af28-4af3-9bf3-de75f4ac9494-0_78-47-570_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451093 Mar 16 17:17 1f8c21ef-ff50-428e-b7f7-e8d3b0f1ea71-0_77-47-569_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451117 Mar 16 17:17 148bc858-2bec-44fe-891f-60f6165dc17e-0_79-47-571_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 449980 Mar 16 17:17 9aa6b1df-f18b-43bb-93c8-4ce1a726c1bb-0_80-47-572_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450023 Mar 16 17:17 32dcd3d4-5ce0-41d1-b7dd-1c9e1ac55fd9-0_82-47-574_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450684 Mar 16 17:17 1144170e-b154-4d85-8eed-866393cf2ed4-0_81-47-573_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450622 Mar 16 17:17 1b8ede99-4f6c-43dc-8709-52ac0e307fc0-0_83-47-575_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450188 Mar 16 17:17 3fc10e56-cf07-447f-a209-22f5e92b4351-0_85-47-577_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 452714 Mar 16 17:17 2c026223-4f5c-4634-aaf9-32a98f7a275f-0_84-47-576_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451258 Mar 16 17:17 60d615c4-0355-44aa-8692-93dcac902bad-0_278-53-770_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451940 Mar 16 17:17 7cb0f09b-e54d-4c78-9306-53e044676a94-0_277-53-769_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450629 Mar 16 17:17 e923dad4-a72c-49a0-8885-0982008ceccf-0_276-53-768_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451596 Mar 16 17:17 d3e720e3-88fa-446b-a522-d44cf7497674-0_281-53-773_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450251 Mar 16 17:17 6f5d416f-e8ff-456c-a0fc-75c7a3a32308-0_279-53-771_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451586 Mar 16 17:17 64d0a424-7744-4423-b86d-3fce04a5046b-0_283-53-775_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450225 Mar 16 17:17 29ae854e-45a1-4222-98c1-2d0acc1c8884-0_282-53-774_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451972 Mar 16 17:17 06f51fc0-1078-4d3a-ae1f-67684917eb1b-0_280-53-772_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451680 Mar 16 17:17 e095fe81-189b-4f5d-8395-7239667ad2d8-0_286-53-778_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451413 Mar 16 17:17 7261ddee-42c5-4abc-8c4f-8f226072b826-0_285-53-777_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451844 Mar 16 17:17 0bcbbdc8-1d2e-4526-88f0-11d17cfff835-0_284-53-776_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451977 Mar 16 17:17 7eddebc9-6828-4053-ad7d-0831daa000ae-0_289-53-781_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451308 Mar 16 17:17 65982416-73c0-41b0-972c-4c4355f3b235-0_290-53-782_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451078 Mar 16 17:17 e62f224c-8d1f-4349-ad3f-81886fba230d-0_288-53-780_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450278 Mar 16 17:17 53764539-435e-4f6f-a7e4-d48fe7966389-0_287-53-779_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451599 Mar 16 17:17 139b4d5c-ed80-42ff-8f97-671f13390edb-0_291-53-783_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 452320 Mar 16 17:17 98d687a0-c93d-4326-b4a0-6f2540bd9aeb-0_292-53-784_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450597 Mar 16 17:17 285cb019-5481-4556-8a96-bc8248028778-0_295-53-787_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451878 Mar 16 17:17 1a7a27ab-8670-4b9c-bfde-3a1dba0669d8-0_293-53-785_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 449657 Mar 16 17:17 fecc9a85-371d-493c-83f2-35b5849ee0cb-0_294-53-786_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450069 Mar 16 17:17 c3cc410b-13c3-4f00-ac53-fec9a8f307a1-0_297-53-789_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451344 Mar 16 17:17 ebf3f3b4-6e1a-4bb7-be85-0d2ee31126e5-0_298-53-790_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 452456 Mar 16 17:17 a77ec88d-1ccb-40a5-b8a8-eaf214cfa6a9-0_296-53-788_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450826 Mar 16 17:17 574b5510-586a-49a5-a2ac-b75ebe90b87d-0_301-53-793_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450752 Mar 16 17:17 bc2d6241-343c-4c47-9125-2f63b269117e-0_300-53-792_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450909 Mar 16 17:17 4b4abae6-0dfd-4761-871f-2224ddccb1ec-0_303-53-795_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450745 Mar 16 17:17 e7068c6a-7a2d-43ae-b030-59bebd68f36b-0_304-53-796_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 451571 Mar 16 17:17 5bdc45fa-88d9-47b4-9597-e717ae7dbc48-0_302-53-794_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450517 Mar 16 17:17 41f629d8-189b-4867-af6f-bb91effe9f74-0_299-53-791_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450574 Mar 16 17:17 16c7513f-7f79-48d5-84ff-e784f2d1e795-0_305-53-797_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 450647 Mar 16 17:17 40b4c081-baab-430c-a9e3-e93d0527c923-0_306-53-798_20220316171648081.parquet
-rw-r--r-- 1 gavin wheel 654082 Mar 16 17:18 ee1bf9af-1636-4863-8b3a-7a7a15861573-0_9-78-2745_20220316171751882.parquet
-rw-r--r-- 1 gavin wheel 667860 Mar 16 17:18 a4e54744-4277-4ce5-96e4-fa2da7010b1f-0_7-78-2743_20220316171751882.parquet
-rw-r--r-- 1 gavin wheel 699511 Mar 16 17:18 a4a3dfe1-d87d-473c-93b2-713b79aef185-0_6-78-2742_20220316171751882.parquet
-rw-r--r-- 1 gavin wheel 742546 Mar 16 17:18 3e74d509-e720-416b-a1c7-9380e5e4a830-0_8-78-2744_20220316171751882.parquet
-rw-r--r-- 1 gavin wheel 737640 Mar 16 17:18 9a0124e9-5e16-4ea5-8b7b-27aea81f4d92-0_5-78-2741_20220316171751882.parquet
gavin@GavindeMacBook-Pro Friday %
gavin@GavindeMacBook-Pro Friday %
gavin@GavindeMacBook-Pro Friday % ll -rt #运行代码之后,只保留了一个历史版本的数据
total 14624
-rw-r--r-- 1 gavin wheel 654082 Mar 16 17:18 ee1bf9af-1636-4863-8b3a-7a7a15861573-0_9-78-2745_20220316171751882.parquet
-rw-r--r-- 1 gavin wheel 667860 Mar 16 17:18 a4e54744-4277-4ce5-96e4-fa2da7010b1f-0_7-78-2743_20220316171751882.parquet
-rw-r--r-- 1 gavin wheel 699511 Mar 16 17:18 a4a3dfe1-d87d-473c-93b2-713b79aef185-0_6-78-2742_20220316171751882.parquet
-rw-r--r-- 1 gavin wheel 742546 Mar 16 17:18 3e74d509-e720-416b-a1c7-9380e5e4a830-0_8-78-2744_20220316171751882.parquet
-rw-r--r-- 1 gavin wheel 737640 Mar 16 17:18 9a0124e9-5e16-4ea5-8b7b-27aea81f4d92-0_5-78-2741_20220316171751882.parquet
-rw-r--r-- 1 gavin wheel 699397 Mar 17 10:37 a4a3dfe1-d87d-473c-93b2-713b79aef185-0_13-41-92_20220317103652699.parquet
-rw-r--r-- 1 gavin wheel 667672 Mar 17 10:37 a4e54744-4277-4ce5-96e4-fa2da7010b1f-0_10-41-89_20220317103652699.parquet
-rw-r--r-- 1 gavin wheel 742453 Mar 17 10:37 3e74d509-e720-416b-a1c7-9380e5e4a830-0_14-41-93_20220317103652699.parquet
-rw-r--r-- 1 gavin wheel 737512 Mar 17 10:37 9a0124e9-5e16-4ea5-8b7b-27aea81f4d92-0_11-41-90_20220317103652699.parquet
-rw-r--r-- 1 gavin wheel 1080742 Mar 17 10:37 ee1bf9af-1636-4863-8b3a-7a7a15861573-0_12-41-91_20220317103652699.parquet
gavin@GavindeMacBook-Pro Friday %
Data Quality(数据质量)
结论: 在Overwrite模式下,如果写入的数据不符合预期,报错:At least one pre-commit validation Failed;(我在append模式的时候运行代码直接报错「java.util.ConcurrentModificationException」,目前还不知到为啥在append模式下会报错),这样就可以在写数据之前对数据做一次校验了
涉及配置
-
Comma separated list of class names that can be invoked to validate commit
Default Value: (Optional)
Config Param: VALIDATOR_CLASS_NAMES
测试代码
存量数据中没有age为17的数据,新数据有一条age为17的记录;使用校验条件「select count(*) from {tableName} where age=17」对新入数据进行校验,并拟订校验结果为「0」,预期将会得到一个不准许写入数据的结果;
import pyspark
if __name__ == '__main__':
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars",
"/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
"/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark = builder.getorCreate()
sc = spark.sparkContext
tableName = "student_for_pre_validate"
basePath = "file:///tmp/hudi_tables/student_for_pre_validate"
csv_path = '/Users/gavin/Desktop/tmp/student_2_rows.csv'
csv_df = spark.read.csv(path=csv_path, header='true')
csv_df.printSchema()
csv_df.show()
print(f'csv_df.count(): [{csv_df.count()}]')
hudi_options = {
'hoodie.table.name': tableName,
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.partitionpath.field': 'partition_path',
'hoodie.datasource.write.table.name': tableName,
'hoodie.datasource.write.precombine.field': 'age',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2,
'hoodie.precommit.validators': 'org.apache.hudi.client.validator.sqlQueryEqualityPreCommitValidator',
'hoodie.precommit.validators.single.value.sql.queries': f'select count(*) from {tableName} where age=17#0'
}
csv_df.write.format("hudi"). \
options(**hudi_options). \
mode("overwrite"). \
save(basePath)
测试数据:存量数据
id | name | age | adress | partition_path |
---|---|---|---|---|
6070762 | 林婷 | 16 | 江苏省凯县魏都刘街G座 217662 | Saturday |
4566846 | 汤斌 | 15 | 上海市志强市清城辽阳路k座 407334 | Tuesday |
1120433 | 刘宁 | 22 | 黑龙江省马鞍山县龙潭傅路F座 707735 | Wednesday |
305942 | 李凯 | 19 | 重庆市欣市合川姚路K座 936317 | Monday |
1604502 | 冉秀芳 | 25 | 江苏省阜新市沈北新陆街c座 997546 | Wednesday |
测试数据:增量数据
id | name | age | adress | partition_path |
---|---|---|---|---|
6031576 | 艾璐 | 19 | 北京市静市西夏韩路M座 566903 | Wednesday |
3565711 | 刘霞 | 17 | 四川省石家庄市滨城杨路w座 549721 | Friday |
代码运行结果
报错:At least one pre-commit validation Failed
py4j.protocol.Py4JJavaError: An error occurred while calling o52.save.
: org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit time 20220317111613163
at org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:63)
at org.apache.hudi.table.action.commit.SparkUpsertCommitactionExecutor.execute(SparkUpsertCommitactionExecutor.java:46)
at org.apache.hudi.table.HoodieSparkcopyOnWriteTable.upsert(HoodieSparkcopyOnWriteTable.java:119)
at org.apache.hudi.table.HoodieSparkcopyOnWriteTable.upsert(HoodieSparkcopyOnWriteTable.java:103)
at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:160)
at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:217)
at org.apache.hudi.HoodieSparksqlWriter$.write(HoodieSparksqlWriter.scala:277)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runcommand$1(DataFrameWriter.scala:989)
at org.apache.spark.sql.execution.sqlExecution$.$anonfun$withNewExecutionId$5(sqlExecution.scala:103)
at org.apache.spark.sql.execution.sqlExecution$.withsqlConfPropagated(sqlExecution.scala:163)
at org.apache.spark.sql.execution.sqlExecution$.$anonfun$withNewExecutionId$1(sqlExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.sqlExecution$.withNewExecutionId(sqlExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runcommand(DataFrameWriter.scala:989)
at org.apache.spark.sql.DataFrameWriter.savetoV1Source(DataFrameWriter.scala:438)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hudi.exception.HoodieValidationException: At least one pre-commit validation Failed
at org.apache.hudi.client.utils.SparkValidatorUtils.runValidators(SparkValidatorUtils.java:94)
at org.apache.hudi.table.action.commit.BaseSparkCommitactionExecutor.runPrecommitValidators(BaseSparkCommitactionExecutor.java:399)
at org.apache.hudi.table.action.commit.BaseCommitactionExecutor.commitOnAutoCommit(BaseCommitactionExecutor.java:140)
at org.apache.hudi.table.action.commit.BaseSparkCommitactionExecutor.updateIndexAndCommitIfNeeded(BaseSparkCommitactionExecutor.java:265)
at org.apache.hudi.table.action.commit.BaseSparkCommitactionExecutor.execute(BaseSparkCommitactionExecutor.java:180)
at org.apache.hudi.table.action.commit.BaseSparkCommitactionExecutor.execute(BaseSparkCommitactionExecutor.java:82)
at org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:56)
... 39 more
Hudi的sql建表语句
只需要把sql中的「STORED AS INPUTFORMAT」 设置为 「org.apache.hudi.hadoop.HoodieParquetInputFormat」即可,其他正常不变
其他参数参考
-
「hoodie.clean.automatic」
-
「hoodie.clean.async」
默认false: 是否异步清理文件。开启异步清理文件的原理是开启一个后台线程,在client执行upsert时就会被调用。
-
「hoodie.cleaner.policy」
默认 HoodieCleaningPolicy.KEEP_LATEST_COMMITS :数据清理策略参数,清理策略参数有两个配置KEEP_LATEST_FILE_VERSIONS和KEEP_LATEST_COMMITS。
-
「hoodie.cleaner.commits.retained」
默认10 : 在KEEP_LATEST_COMMITS策略中配置生效,根据commit提交次数计算保留多少个fileID版本文件。因为是根据commit提交次数来计算,参数不能大于hoodie.keep.min.commits(最少保留多少次commmit元数据)。
-
「hoodie.cleaner.fiLeversions.retained」
默认3 : 在KEEP_LATEST_FILE_VERSIONS策略中配置生效,根据文件版本数计算保留多少个fileId版本文件。
-
「hoodie.parquet.small.file.limit」:
-
「hoodie.copyonwrite.record.size.estimate」:
-
「hoodie.record.size.estimation.threshold」:
默认为1: 数据最开始的时候parquet文件没有数据会去用默认的1kb预估一条数据的大小,如果有fileid的文件大小大于 (hoodie.record.size.estimation.threshold*hoodie.parquet.small.file.limit) 一条记录的大小将会根据(fileid文件大小/文件的总条数)来计算,所以这里是一个权重值。
-
「hoodie.parquet.max.file.size」:
默认120 * 1024 * 1024(120兆):文件的最大大小,在分桶时会根据这个大小减去当前fileId文件大小除以预估每条数据大小来计算当前文件还能插入多少数据。因为每条数据大小是预估计算平均值的,所以这里最大文件的大小控制只能接近与你所配置的大小。
-
「hoodie.copyonwrite.insert.split.size」:
默认500000 :精确控制一个fileid文件存放多少条数据,前提必须关闭hoodie.copyonwrite.insert.auto.split 自动分桶。
-
「hoodie.copyonwrite.insert.auto.split」: