Community for the Daft project and all things distributed data

Distributed Data Community

Hey folks, a quick question about `limit() vs show()` just to double-check my understanding :slightly_smiling_face:
As a Dask user I'm used to calling `df.head(n)` to see the first n rows and having that return a pandas df (eager) that I can further work with, usually for interactive testing/demo in a notebook. In Daft, `df.show()` is eager but returns a `NoneType` whereas `df.limit()` also returns n rows but is lazy. Is the thinking here to discourage using `df.show()` as part of pipelines/transformations and instead recommend using  `limit` with an explicit `collect` in cases where we want to only materialize `n rows` of a dataframe and manipulate that sample further?

There is actually some subtle behavior difference between show and limit

.show() is optimized for lower latency, and tries to avoid sending out too much work at once, iteratively materializing partitions one at a time until we have enough data (internally, we call this an “eager” limit). This is because it is optimized for visualization when working iteratively in a notebook. 

.limit() is optimized for higher throughput, and will send out as much work as possible to saturate the cluster when executed. This is much more optimal for something like `.limit(100000)` 

Nice, that’s helpful to know! Thanks <@U042126MG49> :grin: