vignesh ravi
09/13/2024, 3:11 PMto_pylist function using both iter_rows and materializing everything at once using <http://result.to|result.to>_pylist() for the issue: (#2578) and here are my findings. In both approaches, the memory usage is fairly similar.
For streaming approach, the code uses ~790 MB and for the other approach, it uses ~795MB. This is tested on a parquet data with 100 partitions. I also tried this with 1000 partitions, Even with increase in the number of partitions, the memory usage did not increase much in both the approaches.jay
09/13/2024, 4:33 PMvignesh ravi
09/13/2024, 4:37 PMvignesh ravi
09/13/2024, 4:41 PMjay
09/13/2024, 4:42 PM.collect() now that you’re correctly using .iter_rows(), which would result in an unnecessary intermediate materializationjay
09/13/2024, 4:42 PMvignesh ravi
09/13/2024, 4:52 PM.collect , iter_rows() is using ~650MB.vignesh ravi
09/13/2024, 4:54 PMjay
09/13/2024, 4:54 PMjay
09/13/2024, 4:54 PMjay
09/13/2024, 4:54 PMvignesh ravi
09/13/2024, 4:55 PMvignesh ravi
09/13/2024, 5:00 PM