Having trouble thinking of a good way to generate ids for th Distributed Data Community #daft-dev

Having trouble thinking of a good way to generate ...

Conor Kennedy

08/26/2024, 10:29 PM

Having trouble thinking of a good way to generate ids for the RedPajama dataset. They don’t provide ids in their JSON files, it seems like they do it at runtime. Their ids are of the form “(file path)/(row number within file)“, such as

2023-06/0000/en_head.json.gz/137

. For benchmarking purposes, I would like to read the JSONs, add the id column, and save them to parquets. However in order to get the file path and generate row numbers, I would need to read the JSONs one by one, instead of as a glob. This seems extremely slow, taking 2-3 seconds per file for each of 10,000 files. Is there a better way to do this? Should I just give up on having ids in their format and use monotonically increasing id instead?

Conor Kennedy

08/26/2024, 10:48 PM

Going to try using the

parallel

command to see if I can do it more quickly.

Conor Kennedy

08/26/2024, 11:01 PM

ETA 1 hour 🙂

Conor Kennedy

08/26/2024, 11:04 PM

Never mind, was doing it wrong, need to restart

jay

08/26/2024, 11:18 PM

To get this right, what functionality would you need? Perhaps something like a

daft.read_json(preserve_filepath=True, append_rownum=True)

which would pass in the filepath and row number within a file or something?

jay

08/26/2024, 11:18 PM

Or I guess a UDF would work today as well for the row_numbers, but not the filepath

Conor Kennedy

08/26/2024, 11:20 PM

Maybe it would append the file path and row number as separate columns?

jay

08/26/2024, 11:21 PM

Yeah that’s what I’m thinking…

Conor Kennedy

08/26/2024, 11:21 PM

In any case I optimized the script so now it runs a lot in parallel

Conor Kennedy

08/26/2024, 11:21 PM

Should complete much quicker

2 Views

Open in Slack

Previous Next