Question for the daft team — when using daft.read_iceberg(…) to read a parquet-formed iceberg table containing a partitionspec and catalogued in a Hive Metastore, will daft create a dataframe partitioned similar to the source table?
I got the error below which suggests that daft is confused by the table’s partitionspec. (I’m using ray 2.8.1, Python 3.8, pyiceberg 0.4.0, pyarrow 15.0.0, thrift 0.16.0, getdaft 0.3.2, pandas 1.5.3)
The offending code is:
df = daft.read_iceberg(tbl_iceberg)
print(f"num_daft_partitions: {df.num_partitions()}")
———-
Traceback (most recent call last):
File "huggingface_ts.py", line 756, in <module>
print(f"num_daft_partitions: {df.num_partitions()}")
File "/tmp/ray/session_2024-09-11_10-50-57_989820_8/runtime_resources/pip/ccd6546b9db45e126cc6cf4dab015ec053a14fb1/virtualenv/lib/python3.8/site-packages/daft/dataframe/dataframe.py", line 194, in num_partitions
return self.__builder.optimize().to_physical_plan_scheduler(daft_execution_config).num_partitions()
File "/tmp/ray/session_2024-09-11_10-50-57_989820_8/runtime_resources/pip/ccd6546b9db45e126cc6cf4dab015ec053a14fb1/virtualenv/lib/python3.8/site-packages/daft/logical/builder.py", line 67, in to_physical_plan_scheduler
return PhysicalPlanScheduler.from_logical_plan_builder(
File "/tmp/ray/session_2024-09-11_10-50-57_989820_8/runtime_resources/pip/ccd6546b9db45e126cc6cf4dab015ec053a14fb1/virtualenv/lib/python3.8/site-packages/daft/plan_scheduler/physical_plan_scheduler.py", line 30, in from_logical_plan_builder
scheduler = _PhysicalPlanScheduler.from_logical_plan_builder(builder._builder, daft_execution_config)
File "/tmp/ray/session_2024-09-11_10-50-57_989820_8/runtime_resources/pip/ccd6546b9db45e126cc6cf4dab015ec053a14fb1/virtualenv/lib/python3.8/site-packages/daft/iceberg/iceberg_scan.py", line 173, in to_scan_tasks
pspec = self._iceberg_record_to_partition_spec(self._table.specs()[file.spec_id], file.partition)
KeyError: None