If my wildcard happens to include some badly forma...
# general
k
If my wildcard happens to include some badly formatted parquets is there a way to catch them and maybe issue a warning but exclude them during the read instead of failing
j
This is quite challenging because we actually don’t perform per-file reads (each file can be split into many reads to be performed in a distributed way) There are some possibilities here: 1. The files aren’t supposed to be Parquet files at all (e.g. Spark writes a bunch of marker files like
_SUCCESS
when it finishes writing parquet files) — in this case we provide a list of file prefixes to ignore 2. The files are Parquet files, but were somehow corrupted or written incorrectly Which case are you thinking of handling?
k
Both actually... I somehow have some .compact files left after the streaming write and also somehow some actual parquet files which are just malformed and only 4kb
j
Yeah unfortunately there isn’t a magic pill here — the tool you used to run your compaction seems to be badly behaved 😬 You should at least be able to get around the
.compact
files by specifying a glob like
/*.parquet
?
Otherwise, your compaction tool really needs to be outputting a manifest of “written files”
k
Makes sense.. it's just sad that after a long processing time the job dies because there's a bad file 🥲🥲 but I'll check it more first before running the processing. Thank you!!
j
You’re welcome!