I was looking into how to add write_delta in daft....
# daft-dev
s
I was looking into how to add write_delta in daft. From what I understand about iceberg_write, daft starts by making a commit. Then it distributes the data collection and data write in the execution step, and collects the metadata. It saves the file metadata in the commit (of course, this is just a summary 🤧). For delta_lake, the basic steps should be the same I think. But the file_visitor will be different for stats and format. I believe the current parquet file write should work for basic delta lake without column mapping and advanced stuff? Are there any other pitfalls I should look out for when doing this?
👀 1
j
Hi @sherlockbeard! This is very exciting stuff 😛 I think the current parquet file write should work for delta lake — the one thing we had to do for writing iceberg was to ensure that the Parquet
field_id
(https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L459) was populated because Iceberg relies on that pretty heavily for schema evolution! I’m actually not sure if DeltaLake does so as well (cc @Clark Zinzow) Otherwise, I think the same pattern that we did for iceberg **should also work with DeltaLake (
write files -> collect metadata -> convert to delta lake metadata -> commit the metadata to deltalake
) The main work that needs to be done is to figure out if these operations can be mapped to the DeltaLake Python SDK
s
Currently, I think it can be mapped to the Python SDK https://github.com/delta-io/delta-rs/blob/e58df28d2589dd79f689c68ae2cb6489e0a633fc/python/deltalake/writer.py#L335C3-L335C30. But looks like they are planning to move away from the PyArrow engine to the Rust engine. https://github.com/delta-io/delta-rs/blob/e58df28d2589dd79f689c68ae2cb6489e0a633fc/python/deltalake/writer.py#L209-L211. And the Rust engine code on the Python side is directly offering all work to the Rust layer.🫠
j
The
write_table
API you linked to actually does both data and metadata writes! In Daft, the flow looks like: 1. Write files (Daft will do this in a distributed fashion, without the deltalake package) 2. Collect metadata 3. Convert to deltalake metadata 4. Commit the write to deltalake (this is a metadata-only operation) Thus we’d actually just use the metadata writing capabilities in Step 4. Probably this portion I think: https://github.com/delta-io/delta-rs/blob/e58df28d2589dd79f689c68ae2cb6489e0a633fc/python/deltalake/writer.py#L550-L558
👍 1
And I think Step 3 (convert to deltalake metadata) most of the necessary logic should be here: https://github.com/delta-io/delta-rs/blob/e58df28d2589dd79f689c68ae2cb6489e0a633fc/python/deltalake/writer.py#L405C13-L426
👍 1
c
@sherlockbeard I'm super excited that you're looking at taking this on! I added support for Delta Lake reads, and was going to chime in here, but it looks like @jay has already hit the nail on the head on how we expect the Delta Lake write flow to work.
🙌 1
s
any idea guys how can i convert a daft schema to pyarrow schema without the .to_arrow() as it will call collect.
j
Hello! Yes this is something we should add to our
df.schema() -> Schema
object Right now you can do it manually like so:
Copy code
fields = [f for f in df.schema()]
pyarrow_fields = [pa.field(f.name, f.dtype.to_arrow_dtype()) for f in fields]
pyarrow_schema = pa.schema(pyarrow_fields)
👍 1