Are there any possible factors which would cause w...
# general
k
Are there any possible factors which would cause writing to parquet to be unable to be distributed in ray? I'm getting 100% cpu usage and super high mem in the head but the workers aren't really doing anything for my final write
👀 1
I still have no concrete idea why this happened but since it was adding on to the number of tasks without being able to complete any of them I killed the job eventually.. maybe because my tasks were OOM-ing in the workers too much 🤔
j
Could you give us some more idea about your workload? • How many workers? • If you can provide us the
plan
that would be helpful too! • How many partitions? • How many parquet files? • Where are you writing this data to? Is it AWS S3, or some other S3 implementation?
k
@jay 30 workers, and the original data size is 2TB with about a billion rows. Haven't printed the plan yet but there's filters, a dedup and an antijoin. All print statements get through, including the final df count after the filters, dedup and the antijoin. It only crashes during the write and nothing can be found in the folder where it is writing to. It's a local path mounted such that all nodes are able to access (read and writes are working well for the same script for smaller test datasets of gb size). The object memory store also rises to crazy levels like 200%, but for occasions like these sometimes it dies and sometimes it survives. Writing just the IDs of the final table consistently works but writing the full table does not. Raising the memory store % causes workers to OOM more frequently individually. I didn't specify the number of partitions and number of parquet files so it should be default. Apparently through the logs the driver ended the process.
Each worker has 200gb mem and 20 vcores
j
Hmm that’s odd. Would love to take a look at the plan so we can give some suggestions. When you get the chance can you run
df.explain(show_all=True)
?