Hi guys I m currently trying to add approximate aggregations Distributed Data Community #daft-dev

Hi guys, I'm currently trying to add approximate ...

Maxime Petitjean

03/21/2024, 4:37 PM

Hi guys, I'm currently trying to add approximate aggregations to daft (starting with quantiles). I would like to integrate a sketch library like https://github.com/mheffner/rust-sketches-ddsketch. I started by adding a new

product

aggregation as an exercise. For that I've just followed the

sum

aggregation and everything is working fine. Now for the approximate aggregations, it's more difficult. There will be different result datatypes between final result and intermediate results. For example, an

approx_quantile

aggregation will return a number (Float64) but an intermediate result (between map and reduce phases) will be a sketch (sketches will be merged in reduce phase). Should I create a new

Sketch

DDSketch

datatype or should I use the

Binary

datatype? Should I create different functions in

Series

and

DataArray

, one returning the sketch (

approx_sketch

) and some returning approximate statistics (

approx_quantile

, ...) or should I only add statistics functions and keep

approx_sketch

private?

🎉 2

jay

03/21/2024, 4:42 PM

If I understand correctly, 1. Map phase - returns

Sketch

structure 2. Reduce phase - returns

Float64

For the map stage, I think you could define some serialization on the Sketch structs and use the Binary type. Otherwise, if you can decompose it into its primitive components, you can consider using a

Struct

type as well, and re-compose it on the other side! As for the Series methods, our public methods are mostly for testing in Python. I’d suggest exposing the quantile methods there!

jay

03/21/2024, 5:20 PM

Also happy to hop on a call at any point to help out here! Might be easier coordinating technical details in a call 🙂

Maxime Petitjean

03/25/2024, 5:39 PM

I made some progress and I would expose "sketch" functions and maybe add a new

Sketch

datatype because it is possible to compute multiple things on a sketch. For now I added an

approx_sketch

aggregation which aggregates values in a sketch and a

sketch_quantile

which computes a quantile from a sketch. Here is a (working!) example:

Copy code

from daft import from_pydict, col

df = from_pydict(
    {
        "numbers": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        "groups": ["g1", "g1", "g1", "g1", "g1", "g2", "g2", "g2", "g2", "g2"],
    }
)

df = df.groupby("groups").agg([col("numbers").approx_sketch().alias("sketches")])
df = (
    df.with_column("first_quartile", col("sketches").sketch_quantile(0.25))
    .with_column("median", col("sketches").sketch_quantile(0.5))
    .with_column("third_quartile", col("sketches").sketch_quantile(0.75))
    .select("groups", "first_quartile", "median", "third_quartile")
    .sort("groups")
)

print(df.collect())

On this example, you can see that I'm computing 3 different quantiles from the same sketch.

Maxime Petitjean

03/25/2024, 5:39 PM

The example output:

Copy code

╭────────┬───────────────────┬────────────────────┬───────────────────╮                                                                                                           
│ groups ┆ first_quartile    ┆ median             ┆ third_quartile    │                                                                                                           
│ ---    ┆ ---               ┆ ---                ┆ ---               │                                                                                                           
│ Utf8   ┆ Float64           ┆ Float64            ┆ Float64           │                                                                                                           
╞════════╪═══════════════════╪════════════════════╪═══════════════════╡                                                                                                           
│ g1     ┆ 1.993661701417351 ┆ 2.9742334234767167 ┆ 4.014835333028612 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ g2     ┆ 7.028793021534831 ┆ 7.924973703917148  ┆ 8.935418643763665 │
╰────────┴───────────────────┴────────────────────┴───────────────────╯

Maxime Petitjean

04/03/2024, 1:55 PM

I have just created a draft PR (https://github.com/Eventual-Inc/Daft/pull/2076) which should be a good base for discussion on this topic!

👀 1

jay

04/04/2024, 9:34 PM

I took a first-pass at the PR — great job!! 🎉 1. Had some questions about the user-facing APIs which I think we should address and agree on. Ideally I’d like to keep the details of Sketch away from the users, who would be mostly concerned with “approx_quantiles” itself. 2. I see your concerns around needing to use the BinaryArray for data representation of the Sketch object right now. I’ll chat with the rest of the team to see if there’s a better solution here!

Open in Slack

Previous Next