sherlockbeard
03/29/2024, 8:23 PMjay
03/29/2024, 8:28 PMfield_id
(https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L459) was populated because Iceberg relies on that pretty heavily for schema evolution! I’m actually not sure if DeltaLake does so as well (cc @Clark Zinzow)
Otherwise, I think the same pattern that we did for iceberg **should also work with DeltaLake (write files -> collect metadata -> convert to delta lake metadata -> commit the metadata to deltalake
)
The main work that needs to be done is to figure out if these operations can be mapped to the DeltaLake Python SDKsherlockbeard
03/29/2024, 9:00 PMjay
03/29/2024, 9:05 PMwrite_table
API you linked to actually does both data and metadata writes!
In Daft, the flow looks like:
1. Write files (Daft will do this in a distributed fashion, without the deltalake package)
2. Collect metadata
3. Convert to deltalake metadata
4. Commit the write to deltalake (this is a metadata-only operation)
Thus we’d actually just use the metadata writing capabilities in Step 4. Probably this portion I think: https://github.com/delta-io/delta-rs/blob/e58df28d2589dd79f689c68ae2cb6489e0a633fc/python/deltalake/writer.py#L550-L558jay
03/29/2024, 9:08 PMClark Zinzow
03/29/2024, 10:20 PMsherlockbeard
04/01/2024, 3:45 PMjay
04/01/2024, 3:48 PMdf.schema() -> Schema
object
Right now you can do it manually like so:
fields = [f for f in df.schema()]
pyarrow_fields = [pa.field(f.name, f.dtype.to_arrow_dtype()) for f in fields]
pyarrow_schema = pa.schema(pyarrow_fields)