Slackbot
03/10/2024, 8:23 PMjay
03/10/2024, 8:36 PMIsmael Ghalimi
03/10/2024, 8:47 PMpivot and unpivot are much simpler than window. I would probably start with the former. Also, window probably requires better temporal functions before it can be implemented properly. Regarding pivot and unpivot, I would recommend the Ibis API, which is much richer than Dask's, but does not add much structural complexity. I would also use the "pivot" and "unpivot" terminology, which is quite conventional in the SQL world. "pivot_wider" and "pivot_longer" are good alternatives respectively (and they're certainly more explicit), but I would avoid Dask's "pivot_table" and "melt", which are quite confusing in my opinion.jay
03/10/2024, 8:48 PMIsmael Ghalimi
03/10/2024, 8:48 PMjay
03/10/2024, 8:49 PMIsmael Ghalimi
03/10/2024, 8:54 PMjay
03/10/2024, 8:56 PMIsmael Ghalimi
03/10/2024, 9:05 PMwindow. There, I would focus on scenarios where you can do trivial distribution, whereby data is partitioned against dimensions that are included in the `window`'s groupby dimensions. That way, you don't have to worry about window boundaries spilling across nodes. This won't cover 100% of the use cases that you might find out there, but it's enough for the vast majority of them, and it makes the overall implementation an order of magnitude simpler.
I would also ignore CumulateWindowingTVF, HopWindowingTVF, and TumbleWindowingTVF windowing functions, which are very specific to streaming databases like Apache Flink. They're super cool, but they raise many more questions that you probably don't want to tackle right now.