I'm trying to download this repo using the new hf ...
# general
k
I'm trying to download this repo using the new hf path, and managed to succeed for a bunch of repos, but this repo consistently gives an error (although for different parquets occasionally) Path I'm using: hf://datasets/AlgorithmicResearchGroup/arxiv_research_code/ DaftCoreException: DaftError::External Cached error: Unable to open file https://huggingface.co/api/datasets/AlgorithmicResearchGroup/arxiv_research_code/parquet/default/train/78.parquet: reqwest::Error { kind: Status(400), url: Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("huggingface.co")), port: None, path: "/api/datasets/AlgorithmicResearchGroup/arxiv_research_code/parquet/default/train/78.parquet", query: None, fragment: None } } Any ideas why this may be failing?
j
Is this flaky? I.e. if we throw more retries at it would it go through?
k
I tried quite a few more times but it's consistently failing After restarting I'm getting this now instead though DaftCoreException: DaftError::ArrowError External format error: Operation would exceed memory use threshold
Some of the parquets would get written and then the error would pop out. The row count of the parquets which got written was about 70% of the expected total row count if it had succeeded.
k
Hi @Kyle, I was able to successfully download the same dataset and was unable to reproduce your first error. The second error you see is likely due to the memory constraints of the machine. Even though the dataset is 21.6GB, I believe the in-memory size is more than 60GB