I was looking into how to add write delta in daft From what Distributed Data Community #daft-dev

I was looking into how to add write_delta in daft....

sherlockbeard

03/29/2024, 8:23 PM

I was looking into how to add write_delta in daft. From what I understand about iceberg_write, daft starts by making a commit. Then it distributes the data collection and data write in the execution step, and collects the metadata. It saves the file metadata in the commit (of course, this is just a summary 🤧). For delta_lake, the basic steps should be the same I think. But the file_visitor will be different for stats and format. I believe the current parquet file write should work for basic delta lake without column mapping and advanced stuff? Are there any other pitfalls I should look out for when doing this?

👀 1

jay

03/29/2024, 8:28 PM

Hi @sherlockbeard! This is very exciting stuff 😛 I think the current parquet file write should work for delta lake — the one thing we had to do for writing iceberg was to ensure that the Parquet

field_id

(https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L459) was populated because Iceberg relies on that pretty heavily for schema evolution! I’m actually not sure if DeltaLake does so as well (cc @Clark Zinzow) Otherwise, I think the same pattern that we did for iceberg **should also work with DeltaLake (

write files -> collect metadata -> convert to delta lake metadata -> commit the metadata to deltalake

) The main work that needs to be done is to figure out if these operations can be mapped to the DeltaLake Python SDK

sherlockbeard

03/29/2024, 9:00 PM

Currently, I think it can be mapped to the Python SDK https://github.com/delta-io/delta-rs/blob/e58df28d2589dd79f689c68ae2cb6489e0a633fc/python/deltalake/writer.py#L335C3-L335C30. But looks like they are planning to move away from the PyArrow engine to the Rust engine. https://github.com/delta-io/delta-rs/blob/e58df28d2589dd79f689c68ae2cb6489e0a633fc/python/deltalake/writer.py#L209-L211. And the Rust engine code on the Python side is directly offering all work to the Rust layer.🫠

jay

03/29/2024, 9:05 PM

The

write_table

API you linked to actually does both data and metadata writes! In Daft, the flow looks like: 1. Write files (Daft will do this in a distributed fashion, without the deltalake package) 2. Collect metadata 3. Convert to deltalake metadata 4. Commit the write to deltalake (this is a metadata-only operation) Thus we’d actually just use the metadata writing capabilities in Step 4. Probably this portion I think: https://github.com/delta-io/delta-rs/blob/e58df28d2589dd79f689c68ae2cb6489e0a633fc/python/deltalake/writer.py#L550-L558

👍 1

jay

03/29/2024, 9:08 PM

And I think Step 3 (convert to deltalake metadata) most of the necessary logic should be here: https://github.com/delta-io/delta-rs/blob/e58df28d2589dd79f689c68ae2cb6489e0a633fc/python/deltalake/writer.py#L405C13-L426

👍 1

Clark Zinzow

03/29/2024, 10:20 PM

@sherlockbeard I'm super excited that you're looking at taking this on! I added support for Delta Lake reads, and was going to chime in here, but it looks like @jay has already hit the nail on the head on how we expect the Delta Lake write flow to work.

🙌 1

sherlockbeard

04/01/2024, 3:45 PM

any idea guys how can i convert a daft schema to pyarrow schema without the .to_arrow() as it will call collect.

jay

04/01/2024, 3:48 PM

Hello! Yes this is something we should add to our

df.schema() -> Schema

object Right now you can do it manually like so:

Copy code

fields = [f for f in df.schema()]
pyarrow_fields = [pa.field(f.name, f.dtype.to_arrow_dtype()) for f in fields]
pyarrow_schema = pa.schema(pyarrow_fields)

👍 1

Open in Slack

Previous Next