So i think i have a plan forward to remove the additional me Distributed Data Community #daft-dev

So i think i have a plan forward to remove the add...

Cory Grinstead

05/31/2024, 3:14 PM

So i think i have a plan forward to remove the additional metadata scans. It appears we already have all of the parquet metadata we need during logical planning, but we then convert it to a serializable

ScanTask

. When we do this, we lose all of the

FileMetadata

that we obtained. So when the task is materialized, we need to go fetch that again. Instead, we can just make that data serializable, attach it to the task, then the physical planner can skip those extra steps. But first, we need to make the parquet

FileMetadata

(de)serializable https://github.com/Eventual-Inc/parquet2/pull/2

Sammy Sidhu

05/31/2024, 4:01 PM

Nice! Just left a review. A note though, Daft currently points to the branch

sammy/owned-page-stream

on the

parquet2

fork. But it probably makes sense to just stack those commits and your PR on top of main now since

parquet2

is no longer maintained.

Cory Grinstead

05/31/2024, 6:01 PM

since
parquet2
is no longer maintained.

we could just move both of those (arrow2 and parquet2) into the daft repo. (polars did this after arrow2 was archived)

Cory Grinstead

06/03/2024, 4:36 PM

@Sammy Sidhu, I was speaking with @jay on friday about the parquet functionality. Since we need to modify a significant portion of

parquet2

to make it serializable, one thing we discussed was potentially moving the

arrow2

and

parquet2

forks into crates inside

daft

. @jay mentioned that this was something that y'all were already considering.

Cory Grinstead

06/03/2024, 4:44 PM

I'd be happy to tackle this first if this is a direction you want to go. (i think it'll make the metadata work much easier).

Sammy Sidhu

06/03/2024, 4:45 PM

Yeah that makes a lot of sense. I tried pulling in arrow2 a few months back, but realized it more work than expected so I dropped it.

Sammy Sidhu

06/03/2024, 4:46 PM

So full steam ahead on pulling it in if you think it would help!

Cory Grinstead

06/04/2024, 2:06 PM

lol I now realize why you didn't want deal with it

🤣 1

Cory Grinstead

06/05/2024, 6:56 PM

@Sammy Sidhu The PR is passing CI now https://github.com/Eventual-Inc/Daft/pull/2341 should be ready for review.

🔥 1

Sammy Sidhu

06/06/2024, 6:59 PM

@Cory Grinstead amazing! @Colin Ho and I can take a look asap

Cory Grinstead

06/06/2024, 7:20 PM

I also have this stacked and ready to go after that one. https://github.com/Eventual-Inc/Daft/pull/2346

Colin Ho

06/07/2024, 6:10 PM

Hey @Cory Grinstead Just approved and merged the first PR, the second one also looks good but could you give it a rebase first?

✅ 1

Cory Grinstead

06/07/2024, 8:54 PM

🤔 So simply (de)serializing the metadata and using that when materializing unexpectedly was slower than re-reading the metadata for each task. (likely due to the massive amount of row groups for this edge case) Will have to think on this one a bit more. 🤔 🤔

Sammy Sidhu

06/07/2024, 8:55 PM

Hmm, what if we only package the relevant row group metadata for each scan task rather than the whole parquet metadata?

Cory Grinstead

06/11/2024, 8:00 PM

🔥 🔥 I think i finally got it!!

🙌 1

🔥 1

Cory Grinstead

06/11/2024, 8:00 PM

25 sec down to 11 sec 🔥

Cory Grinstead

06/11/2024, 8:15 PM

@Sammy Sidhu @Clark Zinzow Here's a PR. https://github.com/Eventual-Inc/Daft/pull/2358

Cory Grinstead

06/11/2024, 9:12 PM

It seems like this broke the globbing, but I'm not sure why.. 🤔

Sammy Sidhu

06/11/2024, 9:33 PM

amazing @Cory Grinstead! Cutting the time in half! Let me take a look 🙂

Cory Grinstead

06/11/2024, 9:51 PM

I think what's happening is the

metadata[]

isn't maintaing the same ordering as the

uris[]

so its attempting to get the wrong metadata for the files.

Cory Grinstead

06/11/2024, 10:04 PM

oh nvm, I figured it out. I'm only grabbing the metadata from the first file.

jay

06/11/2024, 10:07 PM

Halved! 🔥 🔥 🔥

Sammy Sidhu

06/11/2024, 10:07 PM

Just left a review! Main comment: I believe that this would pass the same

parquet_metadata

to different files which would be incorrect! When we glob, we typically only read 1 parquet metadata to infer the schema. But the rest of the globs and parquet metadata fetches actually occur here where we split row groups. Note: if we have more than a certain number of files, we skip row group splitting!

Cory Grinstead

06/12/2024, 5:01 PM

I could use some help getting this over the finish line. I think it's like 90% of the way there, but I'm not well versed in the ray runner yet, so having some troubles figuring out why it's failing/timing out there.

Open in Slack

Previous Next