Community for the Daft project and all things distributed data

Distributed Data Community

hey, i find it's a little confusing that single_partition_pipeline vs fanout_pipeline and reduce_pipeline vs reduce_and_fanout are same implemented. why are we have different function names? waiting for the reply

Hi! <@U07D2NZ2493> There's a couple of factors!

`reduce_pipeline` and `reduce_and_fanout` require `spread` strategies and a `list[inputs]`, (one for each input partition) which gives the scheduler hints that they should spread the function invocations across the cluster since they are reduces. The default behavior is to schedule the function as close as possible to the data. But in the case of reduce we do not want that.

`single_partition_pipeline` and `fanout_pipeline` typically take in the number of args that the op requires (independent of the number of partitions) and these are scheduled where the data lives.

The reason why they may have the same impl but different names between these is to aid profiling. Ray's profiler captures functions at the name level!