This message was deleted Distributed Data Community #daft-dev

Join Slack

This message was deleted.

# daft-dev

Slackbot

03/10/2024, 8:23 PM

This message was deleted.

jay

03/10/2024, 8:36 PM

We’ve been asked about Window, but haven’t looked at pivot yet!

Ismael Ghalimi

03/10/2024, 8:47 PM

Got it. My understanding is that

pivot

and

unpivot

are much simpler than

window

. I would probably start with the former. Also,

window

probably requires better temporal functions before it can be implemented properly. Regarding

pivot

and

unpivot

, I would recommend the Ibis API, which is much richer than Dask's, but does not add much structural complexity. I would also use the "pivot" and "unpivot" terminology, which is quite conventional in the SQL world. "pivot_wider" and "pivot_longer" are good alternatives respectively (and they're certainly more explicit), but I would avoid Dask's "pivot_table" and "melt", which are quite confusing in my opinion.

🚀 1

jay

03/10/2024, 8:48 PM

Exciting :) I’ll check with the team tomorrow to see what they think

Ismael Ghalimi

03/10/2024, 8:48 PM

Awesome. Thanks a lot for that! Parity with Ibis with regards to DataFrame transforms would be a huge milestone for the project.

👍 1

jay

03/10/2024, 8:49 PM

Do you happen to have an existing checklist of ibis functionality that you compared Daft against?

Ismael Ghalimi

03/10/2024, 8:54 PM

Yes, but it's probably incomplete, because we're very new to Daft, and our master list is missing a few things on the Ibis side. We started from Ibis' Operation Support Matrix of 330 nodes, added 44 that were missing from the official documentation, then did mappings to Daft and Ray. Here is what the matrix looks like. When we contribute our Ibis bindings to the Daft project, we'll provide a CSV or a JSON for all this metadata.

👍 1

jay

03/10/2024, 8:56 PM

Perfect thanks! Looking forward to discussing more of this soon

Ismael Ghalimi

03/10/2024, 9:05 PM

Likewise. As mentioned in my post to the #C0537HWJT6C channel, I don't think that getting parity with Ibis with respect to scalar functions should be a priority. There are very many of them, and you can't realistically support them all in native Rust anytime soon, especially with regards to geospatial functions (there is a Rust project trying to do that, but it will take one or two years for them to get anywhere close to what DuckDB does today). These scalar functions do not require any distribution, and you could get them directly through a simple integration with DuckDB. Even if that requires a memory copy, the cost of that memory copy is probably marginal compared to the cost of executing these functions, which usually do some fairly complex things on the geospatial structures. And support their 50+ file formats is a massive undertaking that you really don't want to bother with. The one area that probably requires careful consideration is time series analysis with

window

. There, I would focus on scenarios where you can do trivial distribution, whereby data is partitioned against dimensions that are included in the `window`'s groupby dimensions. That way, you don't have to worry about window boundaries spilling across nodes. This won't cover 100% of the use cases that you might find out there, but it's enough for the vast majority of them, and it makes the overall implementation an order of magnitude simpler. I would also ignore

CumulateWindowingTVF

HopWindowingTVF

, and

TumbleWindowingTVF

windowing functions, which are very specific to streaming databases like Apache Flink. They're super cool, but they raise many more questions that you probably don't want to tackle right now.

🔥 1

2 Views

Open in Slack

Previous Next