This message was deleted.
# daft-dev
s
This message was deleted.
j
We’ve been asked about Window, but haven’t looked at pivot yet!
i
Got it. My understanding is that
pivot
and
unpivot
are much simpler than
window
. I would probably start with the former. Also,
window
probably requires better temporal functions before it can be implemented properly. Regarding
pivot
and
unpivot
, I would recommend the Ibis API, which is much richer than Dask's, but does not add much structural complexity. I would also use the "pivot" and "unpivot" terminology, which is quite conventional in the SQL world. "pivot_wider" and "pivot_longer" are good alternatives respectively (and they're certainly more explicit), but I would avoid Dask's "pivot_table" and "melt", which are quite confusing in my opinion.
🚀 1
j
Exciting :) I’ll check with the team tomorrow to see what they think
i
Awesome. Thanks a lot for that! Parity with Ibis with regards to DataFrame transforms would be a huge milestone for the project.
👍 1
j
Do you happen to have an existing checklist of ibis functionality that you compared Daft against?
i
Yes, but it's probably incomplete, because we're very new to Daft, and our master list is missing a few things on the Ibis side. We started from Ibis' Operation Support Matrix of 330 nodes, added 44 that were missing from the official documentation, then did mappings to Daft and Ray. Here is what the matrix looks like. When we contribute our Ibis bindings to the Daft project, we'll provide a CSV or a JSON for all this metadata.
👍 1
j
Perfect thanks! Looking forward to discussing more of this soon
i
Likewise. As mentioned in my post to the #C0537HWJT6C channel, I don't think that getting parity with Ibis with respect to scalar functions should be a priority. There are very many of them, and you can't realistically support them all in native Rust anytime soon, especially with regards to geospatial functions (there is a Rust project trying to do that, but it will take one or two years for them to get anywhere close to what DuckDB does today). These scalar functions do not require any distribution, and you could get them directly through a simple integration with DuckDB. Even if that requires a memory copy, the cost of that memory copy is probably marginal compared to the cost of executing these functions, which usually do some fairly complex things on the geospatial structures. And support their 50+ file formats is a massive undertaking that you really don't want to bother with. The one area that probably requires careful consideration is time series analysis with
window
. There, I would focus on scenarios where you can do trivial distribution, whereby data is partitioned against dimensions that are included in the `window`'s groupby dimensions. That way, you don't have to worry about window boundaries spilling across nodes. This won't cover 100% of the use cases that you might find out there, but it's enough for the vast majority of them, and it makes the overall implementation an order of magnitude simpler. I would also ignore
CumulateWindowingTVF
,
HopWindowingTVF
, and
TumbleWindowingTVF
windowing functions, which are very specific to streaming databases like Apache Flink. They're super cool, but they raise many more questions that you probably don't want to tackle right now.
🔥 1