Hi team I am facing an error while loading parquet file i wa Distributed Data Community #daft-dev

Hi team, I am facing an error while loading parque...

Akshat Suwalka

04/15/2024, 1:13 PM

Hi team, I am facing an error while loading parquet file, i was able to load this through the ray api

Copy code

%%time
data_series_mau = ray.data.read_parquet(file_path_mau_users)

But when trying to do same with daft

Copy code

%%time
data_series_mau = daft.read_parquet(file_path_mau_users)

Got below error:-

Copy code

ValueError: DaftError::External File: <s3://abcd/> is not a valid parquet file. Has incorrect footer: [58, 91, 93, 125]

cc @Rishabh Agarwal Jain

✅ 1

jay

04/15/2024, 6:23 PM

Hmm interesting. Could you determine if that specific file with the obfuscated URL (s3://abcd/) is indeed a parquet file? An easy way to check is to download it and then try to read it with something like pandas!

jay

04/15/2024, 6:24 PM

(This error seems to suggest that that particular file is not a Parquet file) There may be different behavior in terms of Ray data vs Daft where they might choose to ignore files with certain names when reading Parquet etc

Akshat Suwalka

04/16/2024, 10:25 AM

Copy code

data_series_mau_pd = pandas.read_parquet(file_path_mau_users)
data_series_mau_pd.shape

Able to read through pandas

jay

04/16/2024, 4:06 PM

Oh! I was thinking specifically that file (s3://abcd/) from the error message, not

file_path_mau_users

. Or are they equivalent?

jay

04/17/2024, 3:37 AM

BTW — this is because every valid Parquet file should end with

PAR1

, which is just a magical sequence of bytes that lets us know this is a Parquet file 😛 My guess is that

file_path_mau_users

is a folder of some sort, and some of the files in there are not Parquet files but Daft is erroneously reading those files and treating them as Parquet. Let me know!

Akshat Suwalka

04/18/2024, 2:43 AM

yes, basically i save the data from spark dataframe to parquet in a s3 location from databricks

Akshat Suwalka

04/18/2024, 2:45 AM

file_path_mau_users = "s3://abcd/"

jay

04/18/2024, 2:51 AM

Got it, are there any non-Parquet files in that location? By default Daft should treat all files in there as Parquet, but attempt to skip files with names _metadata and _success, which Spark writes out. Another option would be to be more selective if you know the extension of the parquet files:

daft.read_parquet(“<s3://abcd/*.parquet”>)

Akshat Suwalka

04/18/2024, 4:07 AM

let me try and get back to u!

🎉 1

Akshat Suwalka

04/18/2024, 2:07 PM

its working, thanks!

❤️ 1

jay

04/18/2024, 5:11 PM

Ahhh good to hear! I’m very curious to know what file was in there that caused the issue though 😛 @Rishabh Agarwal Jain @Akshat Suwalka at some point do you mind running an

aws s3 ls <s3://abcd/>

to list the files? I’m curious to see which files don’t match the

*.parquet

extension. We currently allow-list

["_metadata", "_common_metadata", "_success"]

and also files that end with

[".crc"]

suffixes because to our knowledge those are some of the random files that Spark will write into the folder, but there might be more!

Akshat Suwalka

04/19/2024, 7:43 AM

Copy code

_committed_1234 - 318B
_started_1234 - 0B
_SUCCESS - 0B
part-00000-tid-1234-1234c.snappy.parquet - 63MB
part-00000-tid-123456-1234c.snappy.parquet - 57MB
part-00000-tid-123478-1234c.snappy.parquet - 13MB

jay

04/19/2024, 8:08 PM

Ah…. it’s most definitely the

_committed_1234

file that’s causing issues then. Hmm. cc @Colin Ho for thoughts here too on if we should be ignoring those files as well

Open in Slack

Previous Next