I ran a flamegraph on the parquet readers, It look...
# daft-dev
c
I ran a flamegraph on the parquet readers, It looks like we are spending a significant amount of time in
_platform_memmove
calls. for the code snippet
Copy code
let file = "Daft/lineitem.parquet";

        let io_config = IOConfig::default();

        let io_client = Arc::new(IOClient::new(io_config.into())?);
        let runtime_handle = daft_io::get_runtime(true)?;

        let table = read_parquet(
            file,
            None,
            None,
            None,
            None,
            None,
            io_client,
            None,
            runtime_handle,
            Default::default(),
        )?;
j
Fun…is this running on the “malformed” polars data with thousands of rowgroups?
c
yeah
j
BTW How does this runtime compare to the “rest of Daft”? I’m curious if the problem might be further up the stack than the Rust-level
read_parquet
call. E.g. how much % does the Rust
read_parquet()
call take up in a Daft
daft.read_parquet("…").collect()
end-to-end call through the dataframe API?
c
hmm interesting point. I'll see if i can get a flamegraph of the entire stack. I don't have much experience profilling python scripts, but i'm sure it can't be too difficult.
j
Re: `_platform_memmove`: I guess Polars probably doesn’t have the overhead of needing to concat all the arrays on read since they do array chunking and Daft doesn’t, so that’s definitely a potential culprit…
We use https://github.com/benfred/py-spy with the
--native
flag, but I think you need it to run in a Linux box for this to work
🙌 1
c
Re: `_platform_memmove`: yeah I was thinking the same thing. It looks like we create a table/batch for every row group, then concat them all at the end into a single table. I was thinking we could potentially be a bit more intelligent while reading, and instead of creating a table for every row group, we do some batching at the reader, so like
N
row groups would be processed as a single table.
🙌 1
I just came across memray which seems to support profiling & flamegraphs out of the box on OSX
I also noticed that we get the (file) metadata for every row group instead of just once.
🔥 2
j
That’s going to hurt…
c
I think that should be a pretty easy fix though. I'll see if I can get a PR together for it this weekend.
❤️ 2
j
Did you manage to get e2e profiling of a
daft.read_parquet
working? 😮
c
yeah I did with
memray
but it doesn't capture the native symbols as well as
cargo flamegraph
Copy code
import daft

print(daft.read_parquet('../Daft/lineitem.parquet').collect())
👍 1
Although slower, it's much more memory efficient than polars. Daft Peak memory usage: 7.4 GiB Polars Peak memory usage: 17.7 GiB
j
That is indeed surprising, given that we do a bunch of concats
Copy code
Start time: 2024-05-23 16:56:01.040000
End time: 2024-05-23 16:56:22.031000
Hmm ok so Daft took about 21 seconds to read the entire file into memory
c
Here's polars for reference.
👍 1
here's another one that shows the flamegraph per thread.
So i spent a bit more time on this over the past couple days, I think the
concat
is definitely the biggest performance killer. There are a few other smaller optimizations that come to mind, but I think if we could add support for "chunked" arrays, it'd greatly increase performance. Unfortunately, I don't think I have enough familiarity with the core codebase yet to make these changes.
🙌 1