I'm currently running a daft job on ray, and it se...
# general
k
I'm currently running a daft job on ray, and it seems that the job largely finished after an hour. However, there's still a few stragglers that continue to be running the final writing step and after one extra hour the results folder is still the same size and has the same number of parquets. Any idea why that may be the case?
j
Hi! Can you share how you know there are some stragglers?
k
There's 200+ hashjoin-writefile tasks which were created and there's 8 which have been just running for the past 2 hours with nothing new in logs or in the results folder
The rest of the maps and reduces and aggregates etc all completed within one hour out of the 3 hours it's been for the job
The writing task is supposed to just write out the ids of the rows which are kept after the joins and filtering so I wouldn't expect it to be a big table
Within the results folder most of the parquets have been well formed, except for three which are still suffixed by a uuid and are either empty or half the size of the rest of the parquets
j
Could you share your plan? You can get it by running df.explain(show_all=True) I’ll try to reproduce it. My guess is that the writer (we currently use PyArrow) is being unreliable. What version of PyArrow are you using?
k
In between the df gets materialized so I didn't try to run the explain on the final df.. The manual fix I applied was to kill the worker stragglers and eventually the job got finished
j
Hmm interesting. When you say stragglers do you mean Ray tasks that you’re looking at on the dashboard?
k
Yes i mean ray tasks which are taking way longer than they should with no errors and don't complete