Hey all qq does Daft have join based filtering on read Distributed Data Community #general

Join Slack

Hey all - qq, does Daft have join based filtering ...

# general

Jake Waller

09/29/2024, 1:06 PM

Hey all - qq, does Daft have join based filtering on read?

jay

09/29/2024, 4:46 PM

Are you referring the Dynamic Partitioning Pruning like Spark has?

Jake Waller

09/29/2024, 4:48 PM

When I was reading the apache data fusion paper, they made mention to an issue request that was submitted this year highlighting the usage of join based filtering on read, so I was wondering as you’ve done a lot of work on the io side whether it was something daft had

jay

09/29/2024, 5:15 PM

We do have it as part of an AQE rule, but AQE hasn’t yet been a feature that we’ve focused on yet! The work interestingly isn’t in the IO side — our IO already does a lot of good filter push down. The work is more about an engine knowing (using statistics or sampling) which side of a joint it wants to materialize first, and then using the materialized data to construct a filter to push down! @Sammy Sidhu might be able to add more here

Jake Waller

09/30/2024, 7:54 AM

What does AQE stand for here? Also that’s super interesting, would that show as part of the query plan as well I assume? I didn’t realise it wouldn’t be part of the IO, will see if I can read more up on it!

jay

09/30/2024, 7:57 AM

Adaptive Query Execution! It wouldn’t show up as part of the initial plan, but as the plan executes (and at various stages of execution) Daft gets more up-to-date information about the data coming out of each stage. For example if it runs the left side of a join, we can say oh it’s actually pretty small. Let me convert the data into a filter and push it down the right side. That’s DPP in a nutshell!

Open in Slack

Previous Next