David Blum
08/01/2024, 12:39 AMjay
08/01/2024, 5:28 PMNo harm, but I don’t believe my use case benefits significantly from the shuffling and regrouping.Is this because your dataframe is already partitioned by the timestamp?
Do daft UDFs support sliding windows?This is tricky because Daft is distributed and has partitions. It would be easy to implement per-partition sliding windows (in fact, today when you run a UDF you get access to the entire partition of data so you can access it however you want), but doing this in a global way is going to be quite tricky. We will probably need to “pad” each partition with the data from the partitions that comes before and after it (with
WINDOW_SIZE/2
rows) and then run the operations I think.
So … perhaps your team would be interested in adding some optimized timeseries ffill()/bfill()/mean_fill()/smoothing() methods?Could you elaborate or provide examples here of what that looks like? I think per-partition operations would be quite simple to add, but anything that might span across partitions will also be tricky.