Hey all What is the proper way of instatiating an empty data Distributed Data Community #general

Join Slack

Hey all, What is the proper way of instatiating a...

# general

Everett Kleven

09/23/2024, 11:20 PM

Hey all, What is the proper way of instatiating an empty dataframe with a pyarrow schema?

Kevin Wang

09/23/2024, 11:24 PM

What would you need an empty dataframe for? What you can generally do is create an empty pyarrow table using

pa_table = pa.from_pylist([], schema)

and then you can do

daft.from_arrow(pa_table)

Kevin Wang

09/23/2024, 11:24 PM

https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_functions/daft.from_arrow.html#daft.from_arrow

Everett Kleven

09/23/2024, 11:39 PM

Hey Kevin, I am creating a few classes with dataframes as properties. The solution I found was actually in the arrow docs with

Copy code

import pyarrow as pa

base_schema = pa.schema(
    [
        pa.field(
            "id",
            pa.string(),
            nullable=False,
            metadata={"description": "Unique identifier for the record"},
        ),
        pa.field(
            "created_at",
            pa.timestamp("ns", tz="UTC"),
            nullable=False,
            metadata={"description": "Creation timestamp"},
        ),
        pa.field(
            "updated_at",
            pa.timestamp("ns", tz="UTC"),
            nullable=False,
            metadata={"description": "Last update timestamp"},
        ),
        pa.field(
            "inserted_at",
            pa.timestamp("ns", tz="UTC"),
            nullable=False,
            metadata={"description": "Insertion timestamp into the database"},
        ),
    ]
)

class BaseDF:
    schema: ClassVar[pa.Schema] = base_schema
    df: Optional[DataFrame] = daft.from_arrow(base_schema.empty_table())

    @classmethod
    def validate_schema(cls, df: DataFrame) -> DataFrame:
        if not df.schema.to_pyarrow_schema().equals(cls.schema):
            raise ValueError(f"DataFrame schema does not match the {cls.__name__} schema.")
        return df

Everett Kleven

09/23/2024, 11:41 PM

I am defining all of my schemas with arrow for interop.

Kevin Wang

09/23/2024, 11:44 PM

I see. Is the dataframe only there for validating the schema, or are you also storing data in it? If the former, you can also store a daft.Schema. You can also go from pyarrow to daft schema using daft.Schema.from_pyarrow_schema

Everett Kleven

09/23/2024, 11:45 PM

So the plan is to store real data there, just looking to make sure each class is built with the right types/schema. I initially started with

self.schema: daft.Schema

but since I am integrating with LanceDB and iceberg, I figured Arrow should be source of truth

👍 2

Everett Kleven

09/23/2024, 11:46 PM

As you pointed out, its easy enough to go between

Open in Slack

Previous Next