Hey guys I came across daft recently and find the project re Distributed Data Community #general

Hey guys, I came across daft recently and find the...

Zac Steer

09/05/2024, 12:22 PM

Hey guys, I came across daft recently and find the project really exciting. I’ve also seen the ballista project in the past and had some questions on how they compare. More details in thread.

Zac Steer

09/05/2024, 12:22 PM

Ballista is essentially distributed datafusion with a sql + dataframe interface

Zac Steer

09/05/2024, 12:22 PM

It advertises some of the same benefits of daft (e.g. rust + arrow based, can handle larger than memory datasets w/distributed execution)

Zac Steer

09/05/2024, 12:23 PM

Being neither a daft nor ballista expert, my initial impression is ballista aims to just be “distributed datafusion” whereas daft aims to be “the best distributed dataframe”. E.g. daft has more integrations (iceberg), better docs, simpler user experience.

Zac Steer

09/05/2024, 12:23 PM

That being said, I’m curious what the team thinks about it and if y’all looked at ballista at some point but decided it wasn’t a good fit

jay

09/05/2024, 7:46 PM

Hey @Zac Steer! Ballista really felt like a demo project (and still does…) to demonstrate datafusion. We haven’t seen anyone use or test it in production. It’s very hard to get distributed execution right, and it feels like most of the investment from that team is now just going to be put into putting Datafusion into Spark via the Comet project.

whereas daft aims to be “the best distributed dataframe”.

In the long run, I think we just want to be the best offline batch data engine across analytics, data processing (ETL) and ML!

Zac Steer

09/05/2024, 8:18 PM

Sweet :) thanks Jay. As a follow up, what are the team’s thoughts on spark + comet? Do you guys view it as a competing product to daft or do you guys view them as separate tools for separate use cases?

Zac Steer

09/05/2024, 8:40 PM

Found this post from Andy essentially confirming your take on ballista

🙌 1

jay

09/05/2024, 9:09 PM

I think Comet is basically a pure performance increase for Spark! So if folks are already happy with Spark but just need it to be faster, then it’s a potential solution there. Viewed from that lens, it’s a direct competitor to Daft as a plug-in replacement for Spark. But Daft actually provides quite a bit more than just Spark: 1. Efficient dataloading into ML training 2. Great local story (no need to spin up a cluster locally… And you get DuckDB-like performance) 3. Better Python UDFs for running Python code/models (avoiding the JVM)

Zac Steer

09/05/2024, 11:04 PM

Gotcha. Thank you. On 1, are you mainly referring to the to_torch_* and to_ray_dataset utilities? 2 makes sense. 3 makes sense. I’m not sure if comet supports udfs right now. If/when they do, would be interesting to compare how they work compared to daft udfs

jay

09/05/2024, 11:07 PM

1. Yes and also just…

df.with_column("data", df["urls"].url.download()).iter_rows()

haha. It’s really good 😛 2. 👍 3. Some things they will always be hamstrung by Spark. For example, requesting for a GPU to run your models. Stateful initialization of models. etc etc

😎 1

Zac Steer

09/05/2024, 11:55 PM

Nice. Thanks for all the knowledge

Kyle

09/06/2024, 12:13 AM

@jay do you use the iter_rows here just to trigger the execution? Does the result get cached if I don't explicitly state for it to persist?

jay

09/06/2024, 12:30 AM

iter_rows

will lazily materialize partitions and basically stream you results. There isn’t any caching per-se, but there is buffering — we hold

number of result partitions in the buffer (configurable!) and return you results row-at-a-time over those partitions. When there is more space in the buffer, we then trigger more compute to fetch another partition.

Kyle

09/06/2024, 12:31 AM

@jay Super cool! Thank you!!

Zac Steer

09/15/2024, 7:18 PM

Funny timing :) Andy and the datafusion community are looking at using ray to do distributed datafusion instead of ballista! Specifically, they’re looking at using datafusion ray-sql. It’s an early prototype that essentially takes a query and runs it across ray nodes, where each node uses datafusion to process its portion of the query. Ballista does the same, taking a query and distributing it across cluster nodes, but uses it own hand-rolled cluster scheduling and executor logic. This all comes from Andy taking a look at ballista this weekend and looks like he thought it was too much to maintain/using ray was much simpler. Here’s the discord discussion where Andy starts talking about it: https://discord.com/channels/885562378132000778/1234886350398947419/1284220026894811258 Here’s the formal github issue where they’ll discuss further: https://github.com/apache/datafusion-python/issues/872 Interested to see where it goes. Could be a potential area for daft and datafusion to collaborate. At a minimum, looks like a good sign that more projects are adopting ray. Cc @jay

👍 1

jay

09/15/2024, 7:23 PM

Interesting! Yeah a naive Ray solution scales well into the TB range, but we’ve found it to be limiting for some of our larger workloads Cool to see that others in the community are now starting to pick up on this though

😎 1

4 Views

Open in Slack

Previous Next