Cory Grinstead
05/31/2024, 3:14 PMScanTask
. When we do this, we lose all of the FileMetadata
that we obtained.
So when the task is materialized, we need to go fetch that again. Instead, we can just make that data serializable, attach it to the task, then the physical planner can skip those extra steps.
But first, we need to make the parquet FileMetadata
(de)serializable
https://github.com/Eventual-Inc/parquet2/pull/2Sammy Sidhu
05/31/2024, 4:01 PMsammy/owned-page-stream
on the parquet2
fork. But it probably makes sense to just stack those commits and your PR on top of main now since parquet2
is no longer maintained.Cory Grinstead
05/31/2024, 6:01 PMsincewe could just move both of those (arrow2 and parquet2) into the daft repo. (polars did this after arrow2 was archived)is no longer maintained.parquet2
Cory Grinstead
06/03/2024, 4:36 PMparquet2
to make it serializable, one thing we discussed was potentially moving the arrow2
and parquet2
forks into crates inside daft
.
@jay mentioned that this was something that y'all were already considering.Cory Grinstead
06/03/2024, 4:44 PMSammy Sidhu
06/03/2024, 4:45 PMSammy Sidhu
06/03/2024, 4:46 PMCory Grinstead
06/04/2024, 2:06 PMCory Grinstead
06/05/2024, 6:56 PMSammy Sidhu
06/06/2024, 6:59 PMCory Grinstead
06/06/2024, 7:20 PMColin Ho
06/07/2024, 6:10 PMCory Grinstead
06/07/2024, 8:54 PMSammy Sidhu
06/07/2024, 8:55 PMCory Grinstead
06/11/2024, 8:00 PMCory Grinstead
06/11/2024, 8:00 PMCory Grinstead
06/11/2024, 8:15 PMCory Grinstead
06/11/2024, 9:12 PMSammy Sidhu
06/11/2024, 9:33 PMCory Grinstead
06/11/2024, 9:51 PMmetadata[]
isn't maintaing the same ordering as the uris[]
so its attempting to get the wrong metadata for the files.Cory Grinstead
06/11/2024, 10:04 PMjay
06/11/2024, 10:07 PMSammy Sidhu
06/11/2024, 10:07 PMparquet_metadata
to different files which would be incorrect! When we glob, we typically only read 1 parquet metadata to infer the schema. But the rest of the globs and parquet metadata fetches actually occur here where we split row groups. Note: if we have more than a certain number of files, we skip row group splitting!Cory Grinstead
06/12/2024, 5:01 PM