Community for the Daft project and all things distributed data

Distributed Data Community

<@U042126MG49> <@U051JM6FY5B> I wanted to start looking into <https://github.com/Eventual-Inc/Daft/issues/2257>.

Do either of you have any ideas on why the perf is bad for these type of files? Are there specific known optimizations that we should perform here?

Those files are from Polars’ TPC-H benchmarks (<https://github.com/pola-rs/tpch|https://github.com/pola-rs/tpch>) generated with a scale factor of 10

Polars wrote 20,000 rowgroups for the lineitem file, which does slow down reads by about 2X for polars and duckdb, but for Daft it reeeeeally slowed us down by like 40x

<@U041QSEF2H2> had some hunches and I haven’t done much exploration yet, but some interesting things to explore might be:

• how many ScanTasks are we spawning (and any associated overhead of having too many ScanTasks)
• We can also just run the local parquet reading function in isolation of the rest of Daft to see where the slowdown is occurring

Ok, thanks for the additional context. I'll start poking around &amp; see if I can find any obvious performance issues.

Yeah I was thinking it's because we may split up the file into multiple scan tasks and we then have to read the metadata multiple times