Community for the Daft project and all things distributed data

Distributed Data Community

More coverage on Daft that I missed yesterday too! <https://www.linkedin.com/posts/daniel-beach-6ab8b4132_daft-polars-dataengineering-activity-7208470406247182336-I64y?utm_source=share&amp;utm_medium=member_desktop|https://www.linkedin.com/posts/daniel-beach-6ab8b4132_daft-polars-dataengineering-act[…]406247182336-I64y?utm_source=share&amp;utm_medium=member_desktop>

Looks like the experience was nicer than Polars too (we worked out of the box with S3 but Polars failed…, and were faster than the Polars+PyArrow solution) :heart:

I am forever surprised with the somewhat poor integration that Polars has with S3, You would assume what with the prevalence of datasets sitting in remote stores this would be one of the first things that are incorporated and focused on.

They had really poor integrations until very recently actually… tbh I think it’s when we started putting out really strong S3 read benchmarks for Daft that Ritchie started focusing on it

IIRC it was on their roadmap for a while, but the original implementation was a bit of a mess and took a bit of finesse to get into a better state

That makes sense. Interestingly for us it was the other way around — we had really good S3 readers but had to put in additional work to get our local readers up to speed!