```def to partition tasks self psets dict str list Partition Distributed Data Community #daft-dev

```def to_partition_tasks(self, psets: dict[str, l...

Chuanlei Ni

07/25/2024, 1:31 AM

Copy code

def to_partition_tasks(self, psets: dict[str, list[PartitionT]]) -> physical_plan.MaterializedPhysicalPlan:
    return physical_plan.materialize(self._scheduler.to_partition_tasks(psets))

could you please walk through this code snippet? i stuck on this logic.... help me @Colin Ho @jay

jay

07/25/2024, 1:54 AM

Yup, so there are two types of generators that will be produced from the physical plan: 1.

InProgressPhysicalPlan

MaterializedPhysicalPlan

InProgressPhysicalPlan

is a generator that can emit: • `None`: indicating that the plan is waiting on more work to be done before it can proceed • `PartitionTaskBuilder`: indicating that this is a pipeline of instructions that can be further appended to (allowing us to do fusing of operations such as Project -> Filter -> …) • `PartitionTask`: indicating that this is a task that has been “built”, and that the runner should run this task.

MaterializedPhysicalPlan

is very similar, but only emits: •

None

•

PartitionTask

•

MaterializedResult

: indicates that this is a final result that is completed, and the runner should store this in the final set of partitions Essentially the

materialize(plan: InProgressPhysicalPlan) -> MaterializedPhysicalPlan

function will just finalize any

PartitionTaskBuilders

that were emitted, and emit those as

PartitionTasks

. This way the runner can just run any

PartitionTasks

that are emitted from this plan.

Chuanlei Ni

07/25/2024, 1:56 AM

it seems that

materialize

can also generate

MaterializedResult[PartitionT]

jay

07/25/2024, 1:56 AM

From the runner’s point of view, it has this black-box

MaterializedPhysicalPlan

that it can now call

next()

on. It will receive either a •

None

(this is the plan’s way of saying — “please do more work so I can proceed”) •

PartitionTask

(this is the plan’s way of saying — “please run this task”) • Or a

StopIteration

(this is the plan’s way o saying — no more work left) •

MaterializedResult

: indicates that this is a final result that is completed, and the runner should store this in the final set of partitions

jay

07/25/2024, 1:57 AM

Oh yes, sorry.

MaterializedResult[PartitionT]

is the plan’s way of saying “this is a FINAL result”

jay

07/25/2024, 1:57 AM

Let me amend my previous messages to reflect this

jay

07/25/2024, 2:00 AM

In this manner, each Python generator has very fine-grained control around how it wants to do execution. For example, our joins can say “please run these

PartitionTasks

from the left/right children”, keep a buffer of the left/right tasks, and then start emitting new

PartitionTaskBuilders

to kickstart the downstream generators as results become available

Chuanlei Ni

07/25/2024, 2:12 AM

I will compare the code to understand what you said, thx

Chuanlei Ni

07/25/2024, 2:17 AM

None

(this is the plan’s way of saying — “please do more work so I can proceed”). how to do more work to callback the process of generating partition tasks?

Chuanlei Ni

07/25/2024, 2:17 AM

@jay

jay

07/25/2024, 2:17 AM

The plan is waiting on previous work that is submitted, it doesn’t have to do anything At that point, the runner should have pending work in the queue that it is still working on (previous

PartitionTasks

that it received)

Chuanlei Ni

07/25/2024, 1:38 PM

thx @jay i have grasp the processing of generating partition tasks using a whole day, hh

jay

07/25/2024, 4:11 PM

Yeah that’s the worst part of our code 😅

Chuanlei Ni

07/26/2024, 1:12 AM

the code is clean, what we need is a doc

Open in Slack

Previous Next