Hey folks, a quick question about `limit() vs show...
# daft-dev
a
Hey folks, a quick question about
limit() vs show()
just to double-check my understanding šŸ™‚ As a Dask user I'm used to calling
df.head(n)
to see the first n rows and having that return a pandas df (eager) that I can further work with, usually for interactive testing/demo in a notebook. In Daft,
df.show()
is eager but returns a
NoneType
whereas
df.limit()
also returns n rows but is lazy. Is the thinking here to discourage using
df.show()
as part of pipelines/transformations and instead recommend using
limit
with an explicit
collect
in cases where we want to only materialize
n rows
of a dataframe and manipulate that sample further?
j
There is actually some subtle behavior difference between show and limit .show() is optimized for lower latency, and tries to avoid sending out too much work at once, iteratively materializing partitions one at a time until we have enough data (internally, we call this an ā€œeagerā€ limit). This is because it is optimized for visualization when working iteratively in a notebook. .limit() is optimized for higher throughput, and will send out as much work as possible to saturate the cluster when executed. This is much more optimal for something like
.limit(100000)
a
Nice, that’s helpful to know! Thanks @jay 😁