< Kyle> BTW quick follow up ```daft set execution config par Distributed Data Community #general

<@U054CEF80SC> BTW quick follow-up: ```daft.set_e...

jay

09/25/2024, 12:53 AM

@Kyle BTW quick follow-up:

Copy code

daft.set_execution_config(parquet_split_row_groups_max_files=100)

You can use this flag to increase the number of files for which Daft will attempt to split reads. Especially if you have larger PQ files (e.g. your 10G ones) this will be useful. It will increase the amount of time Daft takes to generate the query plan, but you should see total number of partitions increase and Daft will split each file into multiple partitions

Kyle

09/25/2024, 12:55 AM

Great, thanks!!

jay

09/25/2024, 12:58 AM

Yeah it currently defaults to 10….

jay

09/25/2024, 12:58 AM

which maybe we should consider increasing

Kyle

09/25/2024, 1:04 AM

Cool! I think my bigger problem is that it's a little hard to understand which config param does what and what scenarios it can handle if we tweak those parameters.. For my scale of thousands of files I think the default threshold will unlikely be sufficient..

jay

09/25/2024, 1:04 AM

Yeah the config parameters are going to be more of a clutch. Ideally query engines should be able to do more intelligent things here 😕

Kyle

09/25/2024, 1:07 AM

Makes sense! 😄 My fiddling with the parameters never worked out well haha

jay

09/25/2024, 1:08 AM

Yeah we’ll think about it a bit more too. With some of the upcoming architectural changes to Daft this should hopefully be much less of an issue as we move towards a more streaming execution model

Kyle

09/25/2024, 1:09 AM

Nice! Looking forward haha

Open in Slack

Previous Next