Slackbot
03/10/2024, 8:23 PMjay
03/10/2024, 8:36 PMIsmael Ghalimi
03/10/2024, 8:47 PMpivot
and unpivot
are much simpler than window
. I would probably start with the former. Also, window
probably requires better temporal functions before it can be implemented properly. Regarding pivot
and unpivot
, I would recommend the Ibis API, which is much richer than Dask's, but does not add much structural complexity. I would also use the "pivot" and "unpivot" terminology, which is quite conventional in the SQL world. "pivot_wider" and "pivot_longer" are good alternatives respectively (and they're certainly more explicit), but I would avoid Dask's "pivot_table" and "melt", which are quite confusing in my opinion.jay
03/10/2024, 8:48 PMIsmael Ghalimi
03/10/2024, 8:48 PMjay
03/10/2024, 8:49 PMIsmael Ghalimi
03/10/2024, 8:54 PMjay
03/10/2024, 8:56 PMIsmael Ghalimi
03/10/2024, 9:05 PMwindow
. There, I would focus on scenarios where you can do trivial distribution, whereby data is partitioned against dimensions that are included in the `window`'s groupby dimensions. That way, you don't have to worry about window boundaries spilling across nodes. This won't cover 100% of the use cases that you might find out there, but it's enough for the vast majority of them, and it makes the overall implementation an order of magnitude simpler.
I would also ignore CumulateWindowingTVF
, HopWindowingTVF
, and TumbleWindowingTVF
windowing functions, which are very specific to streaming databases like Apache Flink. They're super cool, but they raise many more questions that you probably don't want to tackle right now.