@Clark Zinzow Great to meet you. Many thanks for your work on Ray Data, we love it! I can't wait to see what you're going to do with Daft.
If I knew such candidates, I would probably hire them myself, so don't expect much from us on that front 😉
Regarding Lambdas, we should really have a conversation about goals and objectives. If you use them in a stateless manner, they won't give you much beyond acceleration of downloads from S3. This is nice, but a lot more can be done there. If you use them in a stateful manner like we or
BoilingData do, you can do incredible things. And now that stateful execution is officially supported by AWS, the sky is the limit. But you need your cost-based optimizer to take their cost into account, because they're 11x more expensive than EC2, therefore should only be used for elasticity, not scalability as far as computes are concerned. But they can totally be used for accelerating downloads.
Another aspect that is slightly related to that is that companies like yours should help us lobby AWS to add support for Parquet output when doing a
SELECT
from S3. Right now, it only supports CSV output, which is not really useful. Having Parquet output would allow super efficient filter pushdown.
Regarding cost-based optimizations, we should have a direct conversation. There is a lot to cover there...