The source table was created in PySpark 3.4 using ...
# daft-dev
d
The source table was created in PySpark 3.4 using dataframewriterv2 and has parquet format_version = 1
k
Huh this is interesting. Could you run
Copy code
print(tbl_iceberg.specs())
for me?
Also, has the partitioning ever been changed for this table?
d
The partitioning has never been changed. Here is what I get with the print command: {0: PartitionSpec(PartitionField(source_id=1, field_id=1000, transform=IdentityTransform(), name='sn'), spec_id=0)}
k
I see. Could you run one more command for me?
Copy code
print([task.file for task in tbl_iceberg.scan().plan_files()])
I'm just curious why the file apparently does not have a spec id
d
There is a lot of output including some underlying data that unfortunately I can’t share. What are you looking for?
spec_id=None
k
Ah all good. Just looking for the spec IDs actually. are all of them none or are some 0?
d
Many instances of: partition=Record[sn=‘…’]
Just a sec, looking over the output
k
You could try maybe
Copy code
print([task.file.spec_id for task in tbl_iceberg.scan().plan_files()])
to just print out the spec IDs
d
Yes, every instance of spec ID is None
k
Huh that's interesting. Would you be able to try a later version of pyiceberg?
d
@Kevin Wang i greatly appreciate your help debugging this. I’ll be driving for the next hour but will happily run anymore commands as soon as I get to my destination
Sure, although I settled on 0.4.0 because of a dependency conflict. But let’s see what happens …
k
sounds good, thanks for reporting this!
I'll dig a little more to see what else could have caused this, will let you know if I find something
d
Amazing, thank you!
Yeah, there is a pydantic version conflict between ray 2.8.1 and later versions of pyiceberg, and I’m stuck with ray 2.8.1 for my ray cluster. Although, I could run a later version of ray locally to test out a more recent pyiceberg version . I’ll try this later tonight
👍 2