Anyone seen this error before? I suspect it's a pr...
# general
k
Anyone seen this error before? I suspect it's a problem with the encoding for a non-English piece of text but I'm not sure what it would be
j
Oof that’s not pretty. It has to do with your string data lengths being so long that it doesn’t fit into Arrow. This is surprising though, we allocate a lot of space and should be resilient to really large data lengths. What might be happening is that our arrow 2 logic might be falling back to using int32 offsets (which then have a maximum of representing 2GB of string data per chunk). Cc @Raunak Bhagat to take a look, and @Sammy Sidhu as well for thought on string memory representation
👍 1
k
It's also surprising for me because it OOM-ed with a big bundle of 200gb while the pyspark executor seems to be able to handle it with 96g. I guess in this case the 2gb could be the problem? But on the other hand 2gb for a chunk does sound huge already and I did see that each parquet partition is less than 2gb in the file system although I'm not sure how big it becomes after the read.
Not sure if it's related but there is an easily reproducible overflow I get with this command although the file is only 293mb but I am guessing the compression ratio is super high because of the high amount of repetition within the file.
Copy code
daft.read_parquet("<https://huggingface.co/datasets/liwu/MNBVC/resolve/refs%2Fconvert%2Fparquet/co_ann_report/partial-train/0000.parquet>").show()
j
Oo that’s super helpful. Yes my guess is that it’s a compression bomb kind of a thing 😝
The panic you showed isn’t an OOM — it’s just that the data doesn’t fit into our arrow representation (easily fixable) The larger problem here is maybe we need a better representation like Umbra or arrow StringView to handle these highly repetitive columns…. Hmm
k
Oh that sounds optimistic haha. I've been wondering why this particular job keeps crashing my workers...
Would this error be related as well? It's for a different dataset though.
j
Yes quite likely related