Sometimes the simplest solutions yield outsized benefits —
@Clark Zinzow just merged a 60 line PR (
#1950) to improve the scheduling locality of ScanTasks on the Ray runner. The fix was just to make sure that we spread out the work of reading files as much as possible across the cluster.
The result? Higher resource utilization, stabler Ray clusters and much better I/O throughput + performance!
(Coming soon to you in the next Daft release 😛)