Is there any way to specify the amount of resources I should Distributed Data Community #general

Join Slack

Is there any way to specify the amount of resource...

# general

Kyle

09/23/2024, 12:17 AM

Is there any way to specify the amount of resources I should reserve for the join like we have for UDFs?

jay

09/23/2024, 12:30 AM

Only UDFs support resource requests — how are you intending to provide requests to joins?

Kyle

09/23/2024, 12:33 AM

I was thinking of maybe the planning config, execution config or the join function itself. I just thought maybe it could help with some of the spilling happening now in the joins because requesting for a bigger resource seems to have helped with the UDFs to handle my data which has high memory size variance across rows.

jay

09/23/2024, 1:21 AM

Yeah we don’t really have resources available for joins… It’s especially tricky because there are multiple different kinds of joins algorithms, each of which can be optimized differently and have very different memory characteristics. If you are struggling with data skews, you might be able to try a sort_merge join!

jay

09/23/2024, 1:21 AM

.join(strategy="sort_merge")

Kyle

09/23/2024, 1:29 AM

Okay!! Let me try that! Thanks!

Kyle

09/23/2024, 1:34 AM

Oh it only works for inner joins, how about for antijoins? 😅 I have a list of IDs of the duplicated rows which i want to drop from my dataframe.

jay

09/23/2024, 1:55 AM

Hmm cc @Kevin Wang who’s been working on antijoins

Kevin Wang

09/23/2024, 6:33 PM

We have two strategies for anti joins: hash and broadcast. If your right side is small enough we automatically use a broadcast join, but the limit is pretty low so if your right side table can fit into an entire machine, feel free to try

strategy="broadcast"

and see if that helps!

Kyle

09/23/2024, 10:52 PM

I've tried the broadcast strategy but it consistently spills over so I think it is replicating the data too much. It's a fairly manageable table on the right side for my machine (16gb for a 200gb machine) but I still get tons of spills and OOM. I'm guessing that the single core ray resource requests means that the memory is being split over 20 cores and that leaves less than 10gb per core. After trying out the request for resources on the UDFs, I realised that it reduced this problem a lot with huge speedups for me on the runs with UDFs. Not sure if something similar would work for the joins (or maybe at least the joins which don't require shuffles) 🤔

jay

09/23/2024, 11:29 PM

I’m guessing that the single core ray resource requests means that the memory is being split over 20 cores and that leaves less than 10gb per core.

I see. And I’m guessing if you’re running the broadcast antijoin it’s likely running 20 broadcast joins in parallel, each taking up >16GB of memory.

jay

09/23/2024, 11:30 PM

We do have some heuristics around how much memory to request Ray for a broadcast join, but they could be “too naive” for this case.

2 Views

Open in Slack

Previous Next