So i think i have a plan forward to remove the add...
# daft-dev
c
So i think i have a plan forward to remove the additional metadata scans. It appears we already have all of the parquet metadata we need during logical planning, but we then convert it to a serializable
ScanTask
. When we do this, we lose all of the
FileMetadata
that we obtained. So when the task is materialized, we need to go fetch that again. Instead, we can just make that data serializable, attach it to the task, then the physical planner can skip those extra steps. But first, we need to make the parquet
FileMetadata
(de)serializable https://github.com/Eventual-Inc/parquet2/pull/2
s
Nice! Just left a review. A note though, Daft currently points to the branch
sammy/owned-page-stream
on the
parquet2
fork. But it probably makes sense to just stack those commits and your PR on top of main now since
parquet2
is no longer maintained.
c
since
parquet2
is no longer maintained.
we could just move both of those (arrow2 and parquet2) into the daft repo. (polars did this after arrow2 was archived)
@Sammy Sidhu, I was speaking with @jay on friday about the parquet functionality. Since we need to modify a significant portion of
parquet2
to make it serializable, one thing we discussed was potentially moving the
arrow2
and
parquet2
forks into crates inside
daft
. @jay mentioned that this was something that y'all were already considering.
I'd be happy to tackle this first if this is a direction you want to go. (i think it'll make the metadata work much easier).
s
Yeah that makes a lot of sense. I tried pulling in arrow2 a few months back, but realized it more work than expected so I dropped it.
So full steam ahead on pulling it in if you think it would help!
c
lol I now realize why you didn't want deal with it
🤣 1
@Sammy Sidhu The PR is passing CI now https://github.com/Eventual-Inc/Daft/pull/2341 should be ready for review.
🔥 1
s
@Cory Grinstead amazing! @Colin Ho and I can take a look asap
c
I also have this stacked and ready to go after that one. https://github.com/Eventual-Inc/Daft/pull/2346
c
Hey @Cory Grinstead Just approved and merged the first PR, the second one also looks good but could you give it a rebase first?
✅ 1
c
🤔 So simply (de)serializing the metadata and using that when materializing unexpectedly was slower than re-reading the metadata for each task. (likely due to the massive amount of row groups for this edge case) Will have to think on this one a bit more. 🤔 🤔
s
Hmm, what if we only package the relevant row group metadata for each scan task rather than the whole parquet metadata?
c
🔥 🔥 I think i finally got it!!
🙌 1
🔥 1
25 sec down to 11 sec 🔥
@Sammy Sidhu @Clark Zinzow Here's a PR. https://github.com/Eventual-Inc/Daft/pull/2358
It seems like this broke the globbing, but I'm not sure why.. 🤔
s
amazing @Cory Grinstead! Cutting the time in half! Let me take a look 🙂
c
I think what's happening is the
metadata[]
isn't maintaing the same ordering as the
uris[]
so its attempting to get the wrong metadata for the files.
oh nvm, I figured it out. I'm only grabbing the metadata from the first file.
j
Halved! 🔥 🔥 🔥
s
Just left a review! Main comment: I believe that this would pass the same
parquet_metadata
to different files which would be incorrect! When we glob, we typically only read 1 parquet metadata to infer the schema. But the rest of the globs and parquet metadata fetches actually occur here where we split row groups. Note: if we have more than a certain number of files, we skip row group splitting!
c
I could use some help getting this over the finish line. I think it's like 90% of the way there, but I'm not well versed in the ray runner yet, so having some troubles figuring out why it's failing/timing out there.