Question for the daft team when using daft read iceberg to r Distributed Data Community #daft-dev

Question for the daft team — when using daft.read_...

David Blum

09/17/2024, 10:55 PM

Question for the daft team — when using daft.read_iceberg(…) to read a parquet-formed iceberg table containing a partitionspec and catalogued in a Hive Metastore, will daft create a dataframe partitioned similar to the source table? I got the error below which suggests that daft is confused by the table’s partitionspec. (I’m using ray 2.8.1, Python 3.8, pyiceberg 0.4.0, pyarrow 15.0.0, thrift 0.16.0, getdaft 0.3.2, pandas 1.5.3) The offending code is: df = daft.read_iceberg(tbl_iceberg) print(f"num_daft_partitions: {df.num_partitions()}") ———- Traceback (most recent call last): File "huggingface_ts.py", line 756, in <module> print(f"num_daft_partitions: {df.num_partitions()}") File "/tmp/ray/session_2024-09-11_10-50-57_989820_8/runtime_resources/pip/ccd6546b9db45e126cc6cf4dab015ec053a14fb1/virtualenv/lib/python3.8/site-packages/daft/dataframe/dataframe.py", line 194, in num_partitions return self.__builder.optimize().to_physical_plan_scheduler(daft_execution_config).num_partitions() File "/tmp/ray/session_2024-09-11_10-50-57_989820_8/runtime_resources/pip/ccd6546b9db45e126cc6cf4dab015ec053a14fb1/virtualenv/lib/python3.8/site-packages/daft/logical/builder.py", line 67, in to_physical_plan_scheduler return PhysicalPlanScheduler.from_logical_plan_builder( File "/tmp/ray/session_2024-09-11_10-50-57_989820_8/runtime_resources/pip/ccd6546b9db45e126cc6cf4dab015ec053a14fb1/virtualenv/lib/python3.8/site-packages/daft/plan_scheduler/physical_plan_scheduler.py", line 30, in from_logical_plan_builder scheduler = _PhysicalPlanScheduler.from_logical_plan_builder(builder._builder, daft_execution_config) File "/tmp/ray/session_2024-09-11_10-50-57_989820_8/runtime_resources/pip/ccd6546b9db45e126cc6cf4dab015ec053a14fb1/virtualenv/lib/python3.8/site-packages/daft/iceberg/iceberg_scan.py", line 173, in to_scan_tasks pspec = self._iceberg_record_to_partition_spec(self._table.specs()[file.spec_id], file.partition) KeyError: None

2 Views

Open in Slack

Previous Next