<@U042126MG49> <@U041QSEF2H2> What did you do betw...
# general
k
@jay @Sammy Sidhu What did you do between these versions?! πŸ‘ Or I did something wrong? 😁 From time to time I play with my own benchmarks with different versions of some DataFrame libraries(Daft, Polars, DuckDB, DataFusion...). Few days back I noticed this drastic improvement in Daft, around 55% in one of the queries on sorting parquet file by a column and writing to disk:
Copy code
daft_df = daft.read_parquet(trip_data).sort(col("pickup_datetime"))
daft_df.write_parquet(output + "tripdata_sort_daft.parquet")
Trip data parquet file in question: https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-01.parquet
Copy code
num_columns: 24
  num_rows: 18479031
  num_row_groups: 9
  format_version: 2.6
  serialized_size: 34549
Here is a plot with other libraries timings:
πŸ”₯ 3
πŸ‘€ 4
j
Heh πŸ˜›
I’d have to dig through our changelog to figure out…. But that’s a pretty drastic difference! Is this workload working entirely off of local disk? What kind of hardware are you running
Also love it β€” thanks for running the benchmarks! I’ve had a hunch that we were actually really performant vs these other local data engines which have been getting a ton of attention, but our marketing hasn’t been as good as theirs πŸ˜“
πŸ‘ 3
Here’s one PR (released in v0.2.13) that could be responsible for the speedup: https://github.com/Eventual-Inc/Daft/pull/1799 We actually wrote a blogpost about it too: https://blog.getdaft.io/p/adversarial-file-reading-from-10000
πŸ‘ 1
k
24 GB ARM
j
> num_columns: 24 > num_rows: 18479031 > num_row_groups: 9 > format_version: 2.6 > serialized_size: 34549
num_row_groups: 9
would allow the optimization I linked above to split the parquet file into multiple partitions, allowing Daft to read/decode those partitions across multiple cores Your parquet file at https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-01.parquet seems to only have 1 rowgroup though! Did you perhaps link a different PQ file?
πŸ™Œ 1
k
Yep, sorry 😁 I forgot to mention that the description is of my version of the file after I did experiment with different parquet file variations and found this one to be most performant with BigQuery load job. It looks like this works fine in general πŸ€”
πŸŽ‰ 2
j
Makes sense! Having more rowgroups will definitely help with performance since Parquet reading/decoding parallelization is really only possible over rowgroups.
πŸ‘ 1
s
@Kiril Aleksovski this is amazing! You should totally do a write up of this πŸ™‚
πŸ”₯ 1
πŸ‘ 1
k
@Sammy Sidhu And do guest blog?! πŸ˜‡
πŸ”₯ 2
@jay And don't worry about attention for Daft, just continue the good work and stay true to what you are doing and everything will follow I guess πŸ™Œ
βž• 1
❀️ 2
s
@Kiril Aleksovski that sounds like a great idea actually πŸ™‚
πŸ‘ 1
Let's come up with some bullet points for what we'd like to cover πŸ™‚