Conor Kennedy
08/26/2024, 10:29 PM2023-06/0000/en_head.json.gz/137.
For benchmarking purposes, I would like to read the JSONs, add the id column, and save them to parquets. However in order to get the file path and generate row numbers, I would need to read the JSONs one by one, instead of as a glob. This seems extremely slow, taking 2-3 seconds per file for each of 10,000 files. Is there a better way to do this? Should I just give up on having ids in their format and use monotonically increasing id instead?Conor Kennedy
08/26/2024, 10:48 PMparallel command to see if I can do it more quickly.Conor Kennedy
08/26/2024, 11:01 PMConor Kennedy
08/26/2024, 11:04 PMjay
08/26/2024, 11:18 PMdaft.read_json(preserve_filepath=True, append_rownum=True) which would pass in the filepath and row number within a file or something?jay
08/26/2024, 11:18 PMConor Kennedy
08/26/2024, 11:20 PMjay
08/26/2024, 11:21 PMConor Kennedy
08/26/2024, 11:21 PMConor Kennedy
08/26/2024, 11:21 PM