Hi team, I am facing an error while loading parque...
# daft-dev
a
Hi team, I am facing an error while loading parquet file, i was able to load this through the ray api
Copy code
%%time
data_series_mau = ray.data.read_parquet(file_path_mau_users)
But when trying to do same with daft
Copy code
%%time
data_series_mau = daft.read_parquet(file_path_mau_users)
Got below error:-
Copy code
ValueError: DaftError::External File: <s3://abcd/> is not a valid parquet file. Has incorrect footer: [58, 91, 93, 125]
cc @Rishabh Agarwal Jain
1
j
Hmm interesting. Could you determine if that specific file with the obfuscated URL (s3://abcd/) is indeed a parquet file? An easy way to check is to download it and then try to read it with something like pandas!
(This error seems to suggest that that particular file is not a Parquet file) There may be different behavior in terms of Ray data vs Daft where they might choose to ignore files with certain names when reading Parquet etc
a
Copy code
data_series_mau_pd = pandas.read_parquet(file_path_mau_users)
data_series_mau_pd.shape
Able to read through pandas
j
Oh! I was thinking specifically that file (s3://abcd/) from the error message, not
file_path_mau_users
. Or are they equivalent?
BTW — this is because every valid Parquet file should end with
PAR1
, which is just a magical sequence of bytes that lets us know this is a Parquet file 😛 My guess is that
file_path_mau_users
is a folder of some sort, and some of the files in there are not Parquet files but Daft is erroneously reading those files and treating them as Parquet. Let me know!
a
yes, basically i save the data from spark dataframe to parquet in a s3 location from databricks
file_path_mau_users = "s3://abcd/"
j
Got it, are there any non-Parquet files in that location? By default Daft should treat all files in there as Parquet, but attempt to skip files with names _metadata and _success, which Spark writes out. Another option would be to be more selective if you know the extension of the parquet files:
daft.read_parquet(“<s3://abcd/*.parquet”>)
a
let me try and get back to u!
🎉 1
its working, thanks!
❤️ 1
j
Ahhh good to hear! I’m very curious to know what file was in there that caused the issue though 😛 @Rishabh Agarwal Jain @Akshat Suwalka at some point do you mind running an
aws s3 ls <s3://abcd/>
to list the files? I’m curious to see which files don’t match the
*.parquet
extension. We currently allow-list
["_metadata", "_common_metadata", "_success"]
and also files that end with
[".crc"]
suffixes because to our knowledge those are some of the random files that Spark will write into the folder, but there might be more!
a
Copy code
_committed_1234 - 318B
_started_1234 - 0B
_SUCCESS - 0B
part-00000-tid-1234-1234c.snappy.parquet - 63MB
part-00000-tid-123456-1234c.snappy.parquet - 57MB
part-00000-tid-123478-1234c.snappy.parquet - 13MB
j
Ah…. it’s most definitely the
_committed_1234
file that’s causing issues then. Hmm. cc @Colin Ho for thoughts here too on if we should be ignoring those files as well