<@U042126MG49> I checked the memory usage for `to_...
# daft-dev
v
@jay I checked the memory usage for
to_pylist
function using both
iter_rows
and materializing everything at once using
<http://result.to|result.to>_pylist()
for the issue: (#2578) and here are my findings. In both approaches, the memory usage is fairly similar. For streaming approach, the code uses ~790 MB and for the other approach, it uses ~795MB. This is tested on a parquet data with 100 partitions. I also tried this with 1000 partitions, Even with increase in the number of partitions, the memory usage did not increase much in both the approaches.
j
I would expect peak memory usage to be the same, but I’m guessing during the materialization process it should be more stable!
v
could you please let me know which approach we can use?
I think with streaming approach, since we can materialize the intermediate layers, it would be better.
j
You should be able to get memray to produce a graph of memory usage over time! I also commented on your PR again — we’re unnecessarily calling
.collect()
now that you’re correctly using
.iter_rows()
, which would result in an unnecessary intermediate materialization
So that could also be why you’re seeing them both use the same amt of memory
v
After removing
.collect
,
iter_rows()
is using ~650MB.
👍 1
I have removed and pushed the changes. Let me know if any other changes were needed.
j
I think it should be good to go
thanks!
Do you mind also posting a screenshot of the memray results to the PR?
v
Sure
Added the memray usage of both the approaches in the PR.