Is there any good way to estimate the amount of me...
# general
k
Is there any good way to estimate the amount of memory in each runner and number of runners I should provide for my daft jobs?
j
Hey @Kyle ! Are you running Daft on a single machine? If so then you shouldn’t need to configure anything in terms of number of runners We run a single Python multithreading backend that is shared across all dataframe executions
k
I'm trying to run it on a ray cluster and need to figure out what cluster configuration would be appropriate so I came here to see if you guys have any tips on that 😅
j
Ah I see! Makes sense
Generally speaking: 1. Choose machine types with lots of RAM per CPU 2. Choose machines with nvme SSD (helps a lot with making spilling faster when your datasets get big) 3. Depending on the size of your dataset, I’d provision ~2x the size of your dataset in total cluster memory and go from there
k
Thanks!! Would it be better to have more workers or less and also how much would you allocate to the head?
j
Usually bigger machines = better We recommend not running any tasks on the head node
More machines will give you better read throughput from the cloud
but fewer (but bigger) machines will give you better overall workload performance when performing shuffles
k
Also, I am getting a DaftCoreException that says that the target array can not contain nulls. How can I discard that row or keep it somewhere as a log?
j
A good compromise is to use something like the
r5d.8xlarge
Can you send us more details in a separate thread?
k
Yes okay!