Nicolay Gerold
05/08/2024, 3:37 AMjay
05/08/2024, 5:43 AMdf.explain()
and it will show you each step!
That being said, there are some arguments to be made for running Daft in an orchestration engine.
1. Intermediate materializations: if your Daft workflow performs steps 1, 2, and 3, but you have consumers of the data that need data from both step 2 AND 3, then it makes sense to introduce an intermediate materialization point
2. Data versioning: you can tie your dataset versions to the ID of your run on the orchestration engine
My recommendations if you decide to use an orchestration engine:
• Try to minimize data transfer between steps and do heavy filtering. A lot of your execution runtime will likely come from reading/writing data from persistent storage.
• I would store data as files (preferably Parquet), and you can just create one “folder” per run of your workflow. /<workflow_execution_id>/step_<n>/<output_name>
This makes it very easy to inspect the outputs of your workflow and perform any necessary debugging. This way instead of passing the actual data between steps, you can just pass a folder URL and recreate the dataframe by reading from storage in the subsequent step
Generally you’ll want to be very intentional in deciding data materialization boundaries. If you do need to materialize data, make a strong case as to why you need to do it! Note also that sometimes breaking workloads up into steps with intermediate materializations might break any optimizations that Daft might be able to perform for you
Having more steps doesn’t always mean a better pipeline 😉Nicolay Gerold
05/08/2024, 6:22 AMjay
05/08/2024, 5:03 PMdf = df.select(col("x") + col("y"))
The original columns x and y will be dropped from the dataframe once the new column is calculated!