Hey guys, I came across daft recently and find the...
# general
z
Hey guys, I came across daft recently and find the project really exciting. I’ve also seen the ballista project in the past and had some questions on how they compare. More details in thread.
Ballista is essentially distributed datafusion with a sql + dataframe interface
It advertises some of the same benefits of daft (e.g. rust + arrow based, can handle larger than memory datasets w/distributed execution)
Being neither a daft nor ballista expert, my initial impression is ballista aims to just be “distributed datafusion” whereas daft aims to be “the best distributed dataframe”. E.g. daft has more integrations (iceberg), better docs, simpler user experience.
That being said, I’m curious what the team thinks about it and if y’all looked at ballista at some point but decided it wasn’t a good fit
j
Hey @Zac Steer! Ballista really felt like a demo project (and still does…) to demonstrate datafusion. We haven’t seen anyone use or test it in production. It’s very hard to get distributed execution right, and it feels like most of the investment from that team is now just going to be put into putting Datafusion into Spark via the Comet project.
whereas daft aims to be “the best distributed dataframe”.
In the long run, I think we just want to be the best offline batch data engine across analytics, data processing (ETL) and ML!
z
Sweet :) thanks Jay. As a follow up, what are the team’s thoughts on spark + comet? Do you guys view it as a competing product to daft or do you guys view them as separate tools for separate use cases?
Found this post from Andy essentially confirming your take on ballista
🙌 1
j
I think Comet is basically a pure performance increase for Spark! So if folks are already happy with Spark but just need it to be faster, then it’s a potential solution there. Viewed from that lens, it’s a direct competitor to Daft as a plug-in replacement for Spark. But Daft actually provides quite a bit more than just Spark: 1. Efficient dataloading into ML training 2. Great local story (no need to spin up a cluster locally… And you get DuckDB-like performance) 3. Better Python UDFs for running Python code/models (avoiding the JVM)
z
Gotcha. Thank you. On 1, are you mainly referring to the to_torch_* and to_ray_dataset utilities? 2 makes sense. 3 makes sense. I’m not sure if comet supports udfs right now. If/when they do, would be interesting to compare how they work compared to daft udfs
j
1. Yes and also just…
df.with_column("data", df["urls"].url.download()).iter_rows()
haha. It’s really good 😛 2. 👍 3. Some things they will always be hamstrung by Spark. For example, requesting for a GPU to run your models. Stateful initialization of models. etc etc
😎 1
z
Nice. Thanks for all the knowledge
k
@jay do you use the iter_rows here just to trigger the execution? Does the result get cached if I don't explicitly state for it to persist?
j
iter_rows
will lazily materialize partitions and basically stream you results. There isn’t any caching per-se, but there is buffering — we hold
N
number of result partitions in the buffer (configurable!) and return you results row-at-a-time over those partitions. When there is more space in the buffer, we then trigger more compute to fetch another partition.
k
@jay Super cool! Thank you!!
z
Funny timing :) Andy and the datafusion community are looking at using ray to do distributed datafusion instead of ballista! Specifically, they’re looking at using datafusion ray-sql. It’s an early prototype that essentially takes a query and runs it across ray nodes, where each node uses datafusion to process its portion of the query. Ballista does the same, taking a query and distributing it across cluster nodes, but uses it own hand-rolled cluster scheduling and executor logic. This all comes from Andy taking a look at ballista this weekend and looks like he thought it was too much to maintain/using ray was much simpler. Here’s the discord discussion where Andy starts talking about it: https://discord.com/channels/885562378132000778/1234886350398947419/1284220026894811258 Here’s the formal github issue where they’ll discuss further: https://github.com/apache/datafusion-python/issues/872 Interested to see where it goes. Could be a potential area for daft and datafusion to collaborate. At a minimum, looks like a good sign that more projects are adopting ray. Cc @jay
👍 1
j
Interesting! Yeah a naive Ray solution scales well into the TB range, but we’ve found it to be limiting for some of our larger workloads Cool to see that others in the community are now starting to pick up on this though
😎 1