Those files are from Polars’ TPC-H benchmarks (
https://github.com/pola-rs/tpch) generated with a scale factor of 10
Polars wrote 20,000 rowgroups for the lineitem file, which does slow down reads by about 2X for polars and duckdb, but for Daft it reeeeeally slowed us down by like 40x
@Sammy Sidhu had some hunches and I haven’t done much exploration yet, but some interesting things to explore might be:
• how many ScanTasks are we spawning (and any associated overhead of having too many ScanTasks)
• We can also just run the local parquet reading function in isolation of the rest of Daft to see where the slowdown is occurring