< jay> < Sammy Sidhu> What did you do between these versions Distributed Data Community #general

<@U042126MG49> <@U041QSEF2H2> What did you do betw...

Kiril Aleksovski

04/28/2024, 9:33 PM

@jay @Sammy Sidhu What did you do between these versions?! 👏 Or I did something wrong? 😁 From time to time I play with my own benchmarks with different versions of some DataFrame libraries(Daft, Polars, DuckDB, DataFusion...). Few days back I noticed this drastic improvement in Daft, around 55% in one of the queries on sorting parquet file by a column and writing to disk:

Copy code

daft_df = daft.read_parquet(trip_data).sort(col("pickup_datetime"))
daft_df.write_parquet(output + "tripdata_sort_daft.parquet")

Trip data parquet file in question: https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-01.parquet

Copy code

num_columns: 24
  num_rows: 18479031
  num_row_groups: 9
  format_version: 2.6
  serialized_size: 34549

Here is a plot with other libraries timings:

🔥 3

👀 4

jay

04/28/2024, 9:33 PM

Heh 😛

jay

04/28/2024, 9:34 PM

I’d have to dig through our changelog to figure out…. But that’s a pretty drastic difference! Is this workload working entirely off of local disk? What kind of hardware are you running

jay

04/28/2024, 9:36 PM

Also love it — thanks for running the benchmarks! I’ve had a hunch that we were actually really performant vs these other local data engines which have been getting a ton of attention, but our marketing hasn’t been as good as theirs 😓

👏 3

jay

04/28/2024, 9:38 PM

Here’s one PR (released in v0.2.13) that could be responsible for the speedup: https://github.com/Eventual-Inc/Daft/pull/1799 We actually wrote a blogpost about it too: https://blog.getdaft.io/p/adversarial-file-reading-from-10000

👍 1

Kiril Aleksovski

04/28/2024, 9:40 PM

24 GB ARM

jay

04/28/2024, 9:54 PM

> num_columns: 24 > num_rows: 18479031 > num_row_groups: 9 > format_version: 2.6 > serialized_size: 34549

num_row_groups: 9

would allow the optimization I linked above to split the parquet file into multiple partitions, allowing Daft to read/decode those partitions across multiple cores Your parquet file at https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-01.parquet seems to only have 1 rowgroup though! Did you perhaps link a different PQ file?

🙌 1

Kiril Aleksovski

04/28/2024, 11:13 PM

Yep, sorry 😁 I forgot to mention that the description is of my version of the file after I did experiment with different parquet file variations and found this one to be most performant with BigQuery load job. It looks like this works fine in general 🤔

🎉 2

jay

04/28/2024, 11:14 PM

Makes sense! Having more rowgroups will definitely help with performance since Parquet reading/decoding parallelization is really only possible over rowgroups.

👍 1

Sammy Sidhu

04/29/2024, 4:11 AM

@Kiril Aleksovski this is amazing! You should totally do a write up of this 🙂

🔥 1

👍 1

Kiril Aleksovski

04/29/2024, 9:17 AM

@Sammy Sidhu And do guest blog?! 😇

🔥 2

Kiril Aleksovski

04/29/2024, 9:28 AM

@jay And don't worry about attention for Daft, just continue the good work and stay true to what you are doing and everything will follow I guess 🙌

➕ 1

❤️ 2

Sammy Sidhu

04/29/2024, 6:04 PM

@Kiril Aleksovski that sounds like a great idea actually 🙂

👍 1

Sammy Sidhu

04/29/2024, 6:04 PM

Let's come up with some bullet points for what we'd like to cover 🙂

Open in Slack

Previous Next