Community for the Daft project and all things distributed data

Distributed Data Community

image.png

On a whim I decided to run Daft on the Polars TPC-H benchmark repo and was initially horrified :smiling_face_with_tear:.
• Then I realized there was a bug in Polars’ Parquet file generation code!
• Polars was somehow generating extremely fragmented Parquet files with tiny rowgroups: <https://github.com/pola-rs/tpch/issues/123>
• And it seems our Daft Parquet reader somehow reeeeally sucked at reading these fragmented rowgroups. That’s probably a bug/optimization we need to fix....
---

But anyhow, I ran this on my M2 macbook air and the results actually look really good!

I think we’ve accidentally made a really fast local data engine that can also run distributed…