If we want to apply different distributed operations on a co Distributed Data Community #general

If we want to apply different distributed operatio...

Kyle

09/16/2024, 3:16 AM

If we want to apply different distributed operations on a column of strings of hugely varying lengths, and we have a column denoting length of the strings in that column, would it be better to have a partitioning which maintains the same sum of string lengths, or the usual partitioning by rows?

jay

09/16/2024, 3:25 AM

It shouldn’t matter too much unless there is a ton of data skew (e.g. you’re partitioning by date and for some reason more recent data has a ton of text) In that case you could be running into OOM issues where some partitions are much bigger than others!

Kyle

09/16/2024, 3:31 AM

I am guessing that may be the case because I do expect some rows to have huge amounts of text whereas most do not.. The max character count for the row with the most is 90M characters which I think shouldn't cause an OOM by itself but perhaps it'll be extremely problematic if each partitions is expected to have 100K or more rows.

jay

09/16/2024, 3:33 AM

Hmm yes that might indeed be problematic 😝 Try it it let us know how it goes

Kyle

09/16/2024, 3:41 AM

Okay! Any recommendation on how to implement a cumsum in daft?

Kyle

09/16/2024, 3:56 AM

Oh I think the first error I faced is due to timeouts which will occur whenever I try to re-run the snippets without restarting my kernel (so I'm assuming it's something to do with the plan stage?). I seem to stop getting the first type of error if I consistently restart my kernel to rerun the commands from the start.

Kyle

09/16/2024, 5:35 AM

Didn't work as well, I think when trying to read the data in it already fails so the partitioning will also fail. I did see that there was one parquet which was of size 2.X GB which is significantly bigger than the rest. But wouldn't the parquet be split into a more appropriate size automatically on read if it was too big?

jay

09/16/2024, 5:59 AM

Hmm is this all hugging face reads? We can try to reproduce it on our end if so

jay

09/16/2024, 6:00 AM

We do this best-effort based on parquet rowgroups, but it isn’t always possible depending on how the data is laid out

jay

09/16/2024, 6:00 AM

We should set up some time to chat more about your use-case 🫡 🫡 seems like you’ve been pushing the I/O limits quite a bit

Kyle

09/16/2024, 6:01 AM

Yes it is

Kyle

09/16/2024, 6:04 AM

I was using hf://datasets/AlgorithmicResearchGroup/arxiv_research_code/ to get the data from this repo https://huggingface.co/datasets/AlgorithmicResearchGroup/arxiv_research_code

Kyle

09/16/2024, 6:04 AM

i tried downloading everything to local first and then reading the parquets from the local paths but it also OOM-ed

Kyle

09/16/2024, 6:04 AM

but interestingly pandas read from local succeeded haha

Kyle

09/16/2024, 6:05 AM

it's about 21GB in total

jay

09/16/2024, 6:05 AM

Great, will take a look tomorrow and share findings. We’re probably doing a much more parallel and aggressive read than pandas I’m guessing.

Kyle

09/16/2024, 6:05 AM

Great thanks!!

❤️ 1

Open in Slack

Previous Next