and i notice that the format of `MicroPartition` is differen Distributed Data Community #daft-dev

and i notice that the format of `MicroPartition` i...

Chuanlei Ni

07/24/2024, 1:34 AM

and i notice that the format of

MicroPartition

is different from arrow. I want to know the consideration why we not use arrow as local format directly. Modin&ray-data are using arrow as the local format. @Colin Ho @jay thx

jay

07/24/2024, 1:37 AM

Yes! We do this because we have our own types and internal representations of data For example, we allow for columns of Python objects. Also, having our own implementations of nested types makes it much easier to work with.

jay

07/24/2024, 1:37 AM

This also lets us innovate without relying on arrow. For example doing umbra strings

Chuanlei Ni

07/24/2024, 1:38 AM

understand. but we cannot use datafusion or other existing library for operators.

Chuanlei Ni

07/24/2024, 1:44 AM

btw, what is

umbra strings

Chuanlei Ni

07/24/2024, 1:50 AM

and while we store data in plasma store, can

Table

support in-place computation? @jay

jay

07/24/2024, 4:15 AM

Umbra strings: https://cedardb.com/blog/german_strings/

while we store data in plasma store, can
Table
support in-place computation?

We don’t do in-place computation, it’s really difficult to reason about when doing distributed data processing. Instead, we are moving to a streaming-based model for memory stability

jay

07/24/2024, 4:16 AM

we cannot use datafusion or other existing library for operators.

Not really, it’s pretty easy to export arrow-shaped data in a zero-copy way if we need to have other libraries work with our data. Our primitives are still arrow based!

Open in Slack

Previous Next