On a whim I decided to run Daft on the Polars TPC-H benchmark repo and was initially horrified 🥲.
• Then I realized there was a bug in Polars’ Parquet file generation code!
• Polars was somehow generating extremely fragmented Parquet files with tiny rowgroups:
https://github.com/pola-rs/tpch/issues/123
• And it seems our Daft Parquet reader somehow reeeeally sucked at reading these fragmented rowgroups. That’s probably a bug/optimization we need to fix....
---
But anyhow, I ran this on my M2 macbook air and the results actually look really good!
I think we’ve accidentally made a really fast local data engine that can also run distributed…