If we want to apply different distributed operatio...
# general
k
If we want to apply different distributed operations on a column of strings of hugely varying lengths, and we have a column denoting length of the strings in that column, would it be better to have a partitioning which maintains the same sum of string lengths, or the usual partitioning by rows?
j
It shouldn’t matter too much unless there is a ton of data skew (e.g. you’re partitioning by date and for some reason more recent data has a ton of text) In that case you could be running into OOM issues where some partitions are much bigger than others!
k
I am guessing that may be the case because I do expect some rows to have huge amounts of text whereas most do not.. The max character count for the row with the most is 90M characters which I think shouldn't cause an OOM by itself but perhaps it'll be extremely problematic if each partitions is expected to have 100K or more rows.
j
Hmm yes that might indeed be problematic 😝 Try it it let us know how it goes
k
Okay! Any recommendation on how to implement a cumsum in daft?
Oh I think the first error I faced is due to timeouts which will occur whenever I try to re-run the snippets without restarting my kernel (so I'm assuming it's something to do with the plan stage?). I seem to stop getting the first type of error if I consistently restart my kernel to rerun the commands from the start.
Didn't work as well, I think when trying to read the data in it already fails so the partitioning will also fail. I did see that there was one parquet which was of size 2.X GB which is significantly bigger than the rest. But wouldn't the parquet be split into a more appropriate size automatically on read if it was too big?
j
Hmm is this all hugging face reads? We can try to reproduce it on our end if so
We do this best-effort based on parquet rowgroups, but it isn’t always possible depending on how the data is laid out
We should set up some time to chat more about your use-case 🫡 🫡 seems like you’ve been pushing the I/O limits quite a bit
k
Yes it is
i tried downloading everything to local first and then reading the parquets from the local paths but it also OOM-ed
but interestingly pandas read from local succeeded haha
it's about 21GB in total
j
Great, will take a look tomorrow and share findings. We’re probably doing a much more parallel and aggressive read than pandas I’m guessing.
k
Great thanks!!
❤️ 1