Is there any good way to estimate the amount of memory in ea Distributed Data Community #general

Is there any good way to estimate the amount of me...

Kyle

09/06/2024, 3:53 PM

Is there any good way to estimate the amount of memory in each runner and number of runners I should provide for my daft jobs?

jay

09/06/2024, 4:38 PM

Hey @Kyle ! Are you running Daft on a single machine? If so then you shouldn’t need to configure anything in terms of number of runners We run a single Python multithreading backend that is shared across all dataframe executions

Kyle

09/06/2024, 5:01 PM

I'm trying to run it on a ray cluster and need to figure out what cluster configuration would be appropriate so I came here to see if you guys have any tips on that 😅

jay

09/06/2024, 5:02 PM

Ah I see! Makes sense

jay

09/06/2024, 5:03 PM

Generally speaking: 1. Choose machine types with lots of RAM per CPU 2. Choose machines with nvme SSD (helps a lot with making spilling faster when your datasets get big) 3. Depending on the size of your dataset, I’d provision ~2x the size of your dataset in total cluster memory and go from there

Kyle

09/06/2024, 5:04 PM

Thanks!! Would it be better to have more workers or less and also how much would you allocate to the head?

jay

09/06/2024, 5:04 PM

Usually bigger machines = better We recommend not running any tasks on the head node

jay

09/06/2024, 5:04 PM

More machines will give you better read throughput from the cloud

jay

09/06/2024, 5:05 PM

but fewer (but bigger) machines will give you better overall workload performance when performing shuffles

Kyle

09/06/2024, 5:05 PM

Also, I am getting a DaftCoreException that says that the target array can not contain nulls. How can I discard that row or keep it somewhere as a log?

jay

09/06/2024, 5:05 PM

A good compromise is to use something like the

r5d.8xlarge

jay

09/06/2024, 5:06 PM

Can you send us more details in a separate thread?

Kyle

09/06/2024, 5:06 PM

Yes okay!

Open in Slack

Previous Next