Akshat Suwalka
04/15/2024, 1:13 PM%%time
data_series_mau = ray.data.read_parquet(file_path_mau_users)
But when trying to do same with daft
%%time
data_series_mau = daft.read_parquet(file_path_mau_users)
Got below error:-
ValueError: DaftError::External File: <s3://abcd/> is not a valid parquet file. Has incorrect footer: [58, 91, 93, 125]
cc @Rishabh Agarwal Jainjay
04/15/2024, 6:23 PMjay
04/15/2024, 6:24 PMAkshat Suwalka
04/16/2024, 10:25 AMdata_series_mau_pd = pandas.read_parquet(file_path_mau_users)
data_series_mau_pd.shape
Able to read through pandasjay
04/16/2024, 4:06 PMfile_path_mau_users
. Or are they equivalent?jay
04/17/2024, 3:37 AMPAR1
, which is just a magical sequence of bytes that lets us know this is a Parquet file 😛
My guess is that file_path_mau_users
is a folder of some sort, and some of the files in there are not Parquet files but Daft is erroneously reading those files and treating them as Parquet. Let me know!Akshat Suwalka
04/18/2024, 2:43 AMAkshat Suwalka
04/18/2024, 2:45 AMjay
04/18/2024, 2:51 AMdaft.read_parquet(“<s3://abcd/*.parquet”>)
Akshat Suwalka
04/18/2024, 4:07 AMAkshat Suwalka
04/18/2024, 2:07 PMjay
04/18/2024, 5:11 PMaws s3 ls <s3://abcd/>
to list the files? I’m curious to see which files don’t match the *.parquet
extension.
We currently allow-list ["_metadata", "_common_metadata", "_success"]
and also files that end with [".crc"]
suffixes because to our knowledge those are some of the random files that Spark will write into the folder, but there might be more!Akshat Suwalka
04/19/2024, 7:43 AM_committed_1234 - 318B
_started_1234 - 0B
_SUCCESS - 0B
part-00000-tid-1234-1234c.snappy.parquet - 63MB
part-00000-tid-123456-1234c.snappy.parquet - 57MB
part-00000-tid-123478-1234c.snappy.parquet - 13MB
jay
04/19/2024, 8:08 PM_committed_1234
file that’s causing issues then. Hmm.
cc @Colin Ho for thoughts here too on if we should be ignoring those files as well