Having trouble thinking of a good way to generate ...
# daft-dev
c
Having trouble thinking of a good way to generate ids for the RedPajama dataset. They don’t provide ids in their JSON files, it seems like they do it at runtime. Their ids are of the form “(file path)/(row number within file)“, such as
2023-06/0000/en_head.json.gz/137
. For benchmarking purposes, I would like to read the JSONs, add the id column, and save them to parquets. However in order to get the file path and generate row numbers, I would need to read the JSONs one by one, instead of as a glob. This seems extremely slow, taking 2-3 seconds per file for each of 10,000 files. Is there a better way to do this? Should I just give up on having ids in their format and use monotonically increasing id instead?
Going to try using the
parallel
command to see if I can do it more quickly.
ETA 1 hour 🙂
Never mind, was doing it wrong, need to restart
j
To get this right, what functionality would you need? Perhaps something like a
daft.read_json(preserve_filepath=True, append_rownum=True)
which would pass in the filepath and row number within a file or something?
Or I guess a UDF would work today as well for the row_numbers, but not the filepath
c
Maybe it would append the file path and row number as separate columns?
j
Yeah that’s what I’m thinking…
c
In any case I optimized the script so now it runs a lot in parallel
Should complete much quicker