< jay> I checked the memory usage for `to pylist` function u Distributed Data Community #daft-dev

<@U042126MG49> I checked the memory usage for `to_...

vignesh ravi

09/13/2024, 3:11 PM

@jay I checked the memory usage for

to_pylist

function using both

iter_rows

and materializing everything at once using

<http://result.to|result.to>_pylist()

for the issue: (#2578) and here are my findings. In both approaches, the memory usage is fairly similar. For streaming approach, the code uses ~790 MB and for the other approach, it uses ~795MB. This is tested on a parquet data with 100 partitions. I also tried this with 1000 partitions, Even with increase in the number of partitions, the memory usage did not increase much in both the approaches.

jay

09/13/2024, 4:33 PM

I would expect peak memory usage to be the same, but I’m guessing during the materialization process it should be more stable!

vignesh ravi

09/13/2024, 4:37 PM

could you please let me know which approach we can use?

vignesh ravi

09/13/2024, 4:41 PM

I think with streaming approach, since we can materialize the intermediate layers, it would be better.

jay

09/13/2024, 4:42 PM

You should be able to get memray to produce a graph of memory usage over time! I also commented on your PR again — we’re unnecessarily calling

.collect()

now that you’re correctly using

.iter_rows()

, which would result in an unnecessary intermediate materialization

jay

09/13/2024, 4:42 PM

So that could also be why you’re seeing them both use the same amt of memory

vignesh ravi

09/13/2024, 4:52 PM

After removing

.collect

iter_rows()

is using ~650MB.

👍 1

vignesh ravi

09/13/2024, 4:54 PM

I have removed and pushed the changes. Let me know if any other changes were needed.

jay

09/13/2024, 4:54 PM

I think it should be good to go

jay

09/13/2024, 4:54 PM

thanks!

jay

09/13/2024, 4:54 PM

Do you mind also posting a screenshot of the memray results to the PR?

vignesh ravi

09/13/2024, 4:55 PM

Sure

vignesh ravi

09/13/2024, 5:00 PM

Added the memray usage of both the approaches in the PR.

Open in Slack

Previous Next