I am new to daft and considering using daft as an alternativ Distributed Data Community #general

I am new to daft and considering using daft as an ...

Phil Chen

07/20/2024, 4:35 PM

I am new to daft and considering using daft as an alternative to emr over eks for my ETL tasks. Does daft support aws glue? How good the library integrates with AWS ecosystem? What will be the major benefits using daft over emr for such tasks in AWS environment? Any documents and examples would be appreciated.

🙋 1

jay

07/21/2024, 6:08 PM

Welcome! We work with AWS and glue in a few ways: 1. You can run Ray-on-Glue and use Daft on it to run in distributed mode 2. you can read Glue iceberg tables 3. our S3 readers are tuned extremely well to read from AWS S3 One nice ergonomic thing is that everything was built statically for S3, so there’s no figuring out installations of jars or anything. Daft is plug-and-play with AWS and S3

jay

07/21/2024, 6:08 PM

https://www.getdaft.io/projects/docs/en/latest/user_guide/integrations/aws.html

jay

07/21/2024, 6:09 PM

Would love any feedback as you start on your Daft workloads!

Phil Chen

07/27/2024, 2:56 AM

Thanks Jay for the info. I just got a chance to play with Daft. Had it installed locally. As you said, it is pretty easy to load up parquet files from the s3 object store. I ran daft in local mode and on a local ray cluster. That does seem to be pretty straightforward. What I am trying to do is to instead of read s3 files, I would like to have a way to point to aws glue, so I can read from a glue database/tables. I also want to be able to write against the glue catalog, so it I’ll be able to create glue database tables much like Spark can do or with Athena CTAS. I would like to do this without using Ray on Glue. I’d like to have my own ray cluster running on eks cluster. I cannot find any documentation or examples how to that.

jay

07/27/2024, 3:08 AM

All good points. Do you know what kind of tables you want to use? (Iceberg, delta etc)

Phil Chen

07/27/2024, 3:20 AM

Our use case requires read/write to both traditional hive table and iceberg table.

Open in Slack

Previous Next