Maurice Weber
05/14/2024, 8:17 PMOSError: When completing multiple part upload for key 'path/to/output_file.parquet' in bucket 'XXXX': AWS Error UNKNOWN (HTTP status 400) during CompleteMultipartUpload operation: Unable to parse ExceptionName: InvalidPart Message: One or more of the specified parts could not be found.
• In the second use case I run MinhashLSH on 10k parquet files (~1.4TB on disk). The pipeline consists of (1) computing minhash signatures, (2) groupby on the signatures, and (3) distributed connected components. I'm using a ray cluster on 5 nodes here, each with 200 cores. In this case I get warnings and subsequent errors get thrown during reading -- at first, I get something which seems like a warning:
Encountered error while streaming bytes into parquet reader. This may show up as a Thrift Error Downstream: Cached error: Unable to read data from file
<s3://bucket/datasets/input_file.parquet>: error reading a body from connection: end of file before message length reached
and then, some time later the following gets thrown, again interrupting the workflow:
daft.exceptions.DaftCoreException: DaftError::ArrowError External format error: File out of specification: underlying IO error: Cached error: Unable to read data from file <s3://bucket/datasets/input_file.parquet>: error reading a body from connection: end of file before message length reached
do you have any hints on how I can deal with such errors?jay
05/14/2024, 10:10 PMSammy Sidhu
05/14/2024, 10:19 PMSammy Sidhu
05/14/2024, 10:29 PMSammy Sidhu
05/14/2024, 10:29 PMnum_tries
number from 5 to something like 15 or 20
in the S3Config
to combat the heavy load.
https://www.getdaft.io/projects/docs/en/stable/api_docs/doc_gen/io_configs/daft.io.S3Config.html#daft-io-s3configMaurice Weber
05/15/2024, 7:45 PMFor workload 1:
• We actually use pyarrow's parquet writer to write out each partition, and it does look like it is coming from there.
• I think we need to need some of the s3 concurrency parameters there to not overwhelm the r2 connections.are you referring to parameters aside from the num_tries and max connections set in the S3Config?
• What version of pyarrow are you using here?I'm on version 14.0.2
for the second use case:
• are you reading data from R2 again or from S3?I'm reading from R2 (we don't use S3 at all)
• It may be that R2 is much more flakey than S3. And if that is the case, we can add more retry mechanisms around stream interruption.How is stream interruption handled currently? Is there any way we can retry or drop the reading task if a stream got interrupted?