The source table was created in PySpark 3 4 using dataframew Distributed Data Community #daft-dev

Join Slack

The source table was created in PySpark 3.4 using ...

# daft-dev

David Blum

09/17/2024, 10:58 PM

The source table was created in PySpark 3.4 using dataframewriterv2 and has parquet format_version = 1

Kevin Wang

09/17/2024, 11:02 PM

Huh this is interesting. Could you run

Copy code

print(tbl_iceberg.specs())

for me?

Kevin Wang

09/17/2024, 11:03 PM

Also, has the partitioning ever been changed for this table?

David Blum

09/17/2024, 11:11 PM

The partitioning has never been changed. Here is what I get with the print command: {0: PartitionSpec(PartitionField(source_id=1, field_id=1000, transform=IdentityTransform(), name='sn'), spec_id=0)}

Kevin Wang

09/17/2024, 11:16 PM

I see. Could you run one more command for me?

Copy code

print([task.file for task in tbl_iceberg.scan().plan_files()])

Kevin Wang

09/17/2024, 11:16 PM

I'm just curious why the file apparently does not have a spec id

David Blum

09/17/2024, 11:22 PM

There is a lot of output including some underlying data that unfortunately I can’t share. What are you looking for?

David Blum

09/17/2024, 11:23 PM

spec_id=None

Kevin Wang

09/17/2024, 11:24 PM

Ah all good. Just looking for the spec IDs actually. are all of them none or are some 0?

David Blum

09/17/2024, 11:25 PM

Many instances of: partition=Record[sn=‘…’]

David Blum

09/17/2024, 11:25 PM

Just a sec, looking over the output

Kevin Wang

09/17/2024, 11:25 PM

You could try maybe

Copy code

print([task.file.spec_id for task in tbl_iceberg.scan().plan_files()])

to just print out the spec IDs

David Blum

09/17/2024, 11:26 PM

Yes, every instance of spec ID is None

Kevin Wang

09/17/2024, 11:32 PM

Huh that's interesting. Would you be able to try a later version of pyiceberg?

David Blum

09/17/2024, 11:33 PM

@Kevin Wang i greatly appreciate your help debugging this. I’ll be driving for the next hour but will happily run anymore commands as soon as I get to my destination

David Blum

09/17/2024, 11:34 PM

Sure, although I settled on 0.4.0 because of a dependency conflict. But let’s see what happens …

Kevin Wang

09/17/2024, 11:34 PM

sounds good, thanks for reporting this!

Kevin Wang

09/17/2024, 11:35 PM

I'll dig a little more to see what else could have caused this, will let you know if I find something

David Blum

09/17/2024, 11:36 PM

Amazing, thank you!

David Blum

09/17/2024, 11:43 PM

Yeah, there is a pydantic version conflict between ray 2.8.1 and later versions of pyiceberg, and I’m stuck with ray 2.8.1 for my ray cluster. Although, I could run a later version of ray locally to test out a more recent pyiceberg version . I’ll try this later tonight

👍 2

Open in Slack

Previous Next