Kiril Aleksovski
04/28/2024, 9:33 PMdaft_df = daft.read_parquet(trip_data).sort(col("pickup_datetime"))
daft_df.write_parquet(output + "tripdata_sort_daft.parquet")
Trip data parquet file in question:
https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-01.parquet
num_columns: 24
num_rows: 18479031
num_row_groups: 9
format_version: 2.6
serialized_size: 34549
Here is a plot with other libraries timings:jay
04/28/2024, 9:33 PMjay
04/28/2024, 9:34 PMjay
04/28/2024, 9:36 PMjay
04/28/2024, 9:38 PMKiril Aleksovski
04/28/2024, 9:40 PMjay
04/28/2024, 9:54 PMnum_row_groups: 9
would allow the optimization I linked above to split the parquet file into multiple partitions, allowing Daft to read/decode those partitions across multiple cores
Your parquet file at https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-01.parquet seems to only have 1 rowgroup though! Did you perhaps link a different PQ file?Kiril Aleksovski
04/28/2024, 11:13 PMjay
04/28/2024, 11:14 PMSammy Sidhu
04/29/2024, 4:11 AMKiril Aleksovski
04/29/2024, 9:17 AMKiril Aleksovski
04/29/2024, 9:28 AMSammy Sidhu
04/29/2024, 6:04 PMSammy Sidhu
04/29/2024, 6:04 PM