Community for the Daft project and all things distributed data

Distributed Data Community

If i find that my script is reading the parquets in too small units like the parquet is 1gb and the read/writes are coming in through kb (im guessing due to the auto streaming?) how can i adjust it back to a bigger chunk? I'm getting many errors due to throttling on the number of requests :smiling_face_with_tear:

Daft should perform coalescing of your red sizes to larger chunks of about a few MB by default!

How are you checking that it is reading kb at a time?

I had to reach out to another team running the s3 service and they said that my chunks were too small and that it was in kb. Not too sure how they checked that. Which parameters in the execution config (or s3 config) would influence this?

Are you running against AWS S3 or your own S3 service? The Daft readers are optimized against AWS so might require a little bit of tuning when running against your own service!

They recommended that I specify the multipart_chunksize to the s3. Would there be an equivalent config in daft?