This message was deleted Distributed Data Community #daft-dev

Join Slack

This message was deleted.

# daft-dev

Slackbot

03/10/2024, 11:13 PM

This message was deleted.

jay

03/11/2024, 12:37 AM

Cc @Clark Zinzow You’ve actually hit on some of the bigger architectural improvements we have on our internal roadmap: 1. Removal of Ray dataplane dependency (so we can run on something like lambdas) 2. Cost-based (+dynamic) optimizations Would love to get your further thoughts there. Also we’re hiring heavily in this area so if you know anyone who might be a good fit to help us further this vision — let us know 😊

Ismael Ghalimi

03/11/2024, 12:54 AM

@Clark Zinzow Great to meet you. Many thanks for your work on Ray Data, we love it! I can't wait to see what you're going to do with Daft. If I knew such candidates, I would probably hire them myself, so don't expect much from us on that front 😉 Regarding Lambdas, we should really have a conversation about goals and objectives. If you use them in a stateless manner, they won't give you much beyond acceleration of downloads from S3. This is nice, but a lot more can be done there. If you use them in a stateful manner like we or BoilingData do, you can do incredible things. And now that stateful execution is officially supported by AWS, the sky is the limit. But you need your cost-based optimizer to take their cost into account, because they're 11x more expensive than EC2, therefore should only be used for elasticity, not scalability as far as computes are concerned. But they can totally be used for accelerating downloads. Another aspect that is slightly related to that is that companies like yours should help us lobby AWS to add support for Parquet output when doing a

SELECT

from S3. Right now, it only supports CSV output, which is not really useful. Having Parquet output would allow super efficient filter pushdown. Regarding cost-based optimizations, we should have a direct conversation. There is a lot to cover there...

Open in Slack

Previous Next