Hey all, What is the proper way of instatiating a...
# general
e
Hey all, What is the proper way of instatiating an empty dataframe with a pyarrow schema?
k
What would you need an empty dataframe for? What you can generally do is create an empty pyarrow table using
pa_table = pa.from_pylist([], schema)
and then you can do
daft.from_arrow(pa_table)
e
Hey Kevin, I am creating a few classes with dataframes as properties. The solution I found was actually in the arrow docs with
Copy code
import pyarrow as pa

base_schema = pa.schema(
    [
        pa.field(
            "id",
            pa.string(),
            nullable=False,
            metadata={"description": "Unique identifier for the record"},
        ),
        pa.field(
            "created_at",
            pa.timestamp("ns", tz="UTC"),
            nullable=False,
            metadata={"description": "Creation timestamp"},
        ),
        pa.field(
            "updated_at",
            pa.timestamp("ns", tz="UTC"),
            nullable=False,
            metadata={"description": "Last update timestamp"},
        ),
        pa.field(
            "inserted_at",
            pa.timestamp("ns", tz="UTC"),
            nullable=False,
            metadata={"description": "Insertion timestamp into the database"},
        ),
    ]
)

class BaseDF:
    schema: ClassVar[pa.Schema] = base_schema
    df: Optional[DataFrame] = daft.from_arrow(base_schema.empty_table())

    @classmethod
    def validate_schema(cls, df: DataFrame) -> DataFrame:
        if not df.schema.to_pyarrow_schema().equals(cls.schema):
            raise ValueError(f"DataFrame schema does not match the {cls.__name__} schema.")
        return df
I am defining all of my schemas with arrow for interop.
k
I see. Is the dataframe only there for validating the schema, or are you also storing data in it? If the former, you can also store a daft.Schema. You can also go from pyarrow to daft schema using daft.Schema.from_pyarrow_schema
e
So the plan is to store real data there, just looking to make sure each class is built with the right types/schema. I initially started with
self.schema: daft.Schema
but since I am integrating with LanceDB and iceberg, I figured Arrow should be source of truth
👍 2
As you pointed out, its easy enough to go between