I ran a flamegraph on the parquet readers It looks like we a Distributed Data Community #daft-dev

I ran a flamegraph on the parquet readers, It look...

Cory Grinstead

05/23/2024, 8:40 PM

I ran a flamegraph on the parquet readers, It looks like we are spending a significant amount of time in

_platform_memmove

calls. for the code snippet

Copy code

let file = "Daft/lineitem.parquet";

        let io_config = IOConfig::default();

        let io_client = Arc::new(IOClient::new(io_config.into())?);
        let runtime_handle = daft_io::get_runtime(true)?;

        let table = read_parquet(
            file,
            None,
            None,
            None,
            None,
            None,
            io_client,
            None,
            runtime_handle,
            Default::default(),
        )?;

flamegraph.svg

jay

05/23/2024, 8:48 PM

Fun…is this running on the “malformed” polars data with thousands of rowgroups?

Cory Grinstead

05/23/2024, 8:49 PM

yeah

jay

05/23/2024, 9:00 PM

BTW How does this runtime compare to the “rest of Daft”? I’m curious if the problem might be further up the stack than the Rust-level

read_parquet

call. E.g. how much % does the Rust

read_parquet()

call take up in a Daft

daft.read_parquet("…").collect()

end-to-end call through the dataframe API?

Cory Grinstead

05/23/2024, 9:03 PM

hmm interesting point. I'll see if i can get a flamegraph of the entire stack. I don't have much experience profilling python scripts, but i'm sure it can't be too difficult.

jay

05/23/2024, 9:03 PM

Re: `_platform_memmove`: I guess Polars probably doesn’t have the overhead of needing to concat all the arrays on read since they do array chunking and Daft doesn’t, so that’s definitely a potential culprit…

jay

05/23/2024, 9:03 PM

We use https://github.com/benfred/py-spy with the

--native

flag, but I think you need it to run in a Linux box for this to work

jay

05/23/2024, 9:05 PM

Here’s how we use it, in our profiling CI jobs: https://github.com/Eventual-Inc/Daft/blob/main/.github/workflows/daft-profiling.yml

🙌 1

Cory Grinstead

05/23/2024, 9:09 PM

Re: `_platform_memmove`: yeah I was thinking the same thing. It looks like we create a table/batch for every row group, then concat them all at the end into a single table. I was thinking we could potentially be a bit more intelligent while reading, and instead of creating a table for every row group, we do some batching at the reader, so like

row groups would be processed as a single table.

🙌 1

Cory Grinstead

05/23/2024, 9:35 PM

I just came across memray which seems to support profiling & flamegraphs out of the box on OSX

Cory Grinstead

05/23/2024, 9:46 PM

I also noticed that we get the (file) metadata for every row group instead of just once.

🔥 2

jay

05/23/2024, 9:47 PM

That’s going to hurt…

Cory Grinstead

05/23/2024, 10:11 PM

I think that should be a pretty easy fix though. I'll see if I can get a PR together for it this weekend.

❤️ 2

jay

05/23/2024, 10:12 PM

Did you manage to get e2e profiling of a

daft.read_parquet

working? 😮

Cory Grinstead

05/23/2024, 10:14 PM

yeah I did with

memray

but it doesn't capture the native symbols as well as

cargo flamegraph

Copy code

import daft

print(daft.read_parquet('../Daft/lineitem.parquet').collect())

memray-flamegraph-daft.py.62103.html

👍 1

Cory Grinstead

05/23/2024, 10:15 PM

Although slower, it's much more memory efficient than polars. Daft Peak memory usage: 7.4 GiB Polars Peak memory usage: 17.7 GiB

jay

05/23/2024, 10:16 PM

That is indeed surprising, given that we do a bunch of concats

jay

05/23/2024, 10:17 PM

Copy code

Start time: 2024-05-23 16:56:01.040000
End time: 2024-05-23 16:56:22.031000

Hmm ok so Daft took about 21 seconds to read the entire file into memory

Cory Grinstead

05/23/2024, 10:18 PM

Here's polars for reference.

memray-flamegraph-daft.py.72360.html

👍 1

Cory Grinstead

05/23/2024, 10:54 PM

here's another one that shows the flamegraph per thread.

memray-flamegraph-daft.py.89975.html

Cory Grinstead

05/29/2024, 3:13 PM

So i spent a bit more time on this over the past couple days, I think the

concat

is definitely the biggest performance killer. There are a few other smaller optimizations that come to mind, but I think if we could add support for "chunked" arrays, it'd greatly increase performance. Unfortunately, I don't think I have enough familiarity with the core codebase yet to make these changes.

🙌 1

Open in Slack

Previous Next