Hi guys, I'm currently trying to add approximate ...
# daft-dev
m
Hi guys, I'm currently trying to add approximate aggregations to daft (starting with quantiles). I would like to integrate a sketch library like https://github.com/mheffner/rust-sketches-ddsketch. I started by adding a new
product
aggregation as an exercise. For that I've just followed the
sum
aggregation and everything is working fine. Now for the approximate aggregations, it's more difficult. There will be different result datatypes between final result and intermediate results. For example, an
approx_quantile
aggregation will return a number (Float64) but an intermediate result (between map and reduce phases) will be a sketch (sketches will be merged in reduce phase). Should I create a new
Sketch
or
DDSketch
datatype or should I use the
Binary
datatype? Should I create different functions in
Series
and
DataArray
, one returning the sketch (
approx_sketch
) and some returning approximate statistics (
approx_quantile
, ...) or should I only add statistics functions and keep
approx_sketch
private?
๐ŸŽ‰ 2
j
If I understand correctly, 1. Map phase - returns
Sketch
structure 2. Reduce phase - returns
Float64
For the map stage, I think you could define some serialization on the Sketch structs and use the Binary type. Otherwise, if you can decompose it into its primitive components, you can consider using a
Struct
type as well, and re-compose it on the other side! As for the Series methods, our public methods are mostly for testing in Python. Iโ€™d suggest exposing the quantile methods there!
Also happy to hop on a call at any point to help out here! Might be easier coordinating technical details in a call ๐Ÿ™‚
m
I made some progress and I would expose "sketch" functions and maybe add a new
Sketch
datatype because it is possible to compute multiple things on a sketch. For now I added an
approx_sketch
aggregation which aggregates values in a sketch and a
sketch_quantile
which computes a quantile from a sketch. Here is a (working!) example:
Copy code
from daft import from_pydict, col

df = from_pydict(
    {
        "numbers": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        "groups": ["g1", "g1", "g1", "g1", "g1", "g2", "g2", "g2", "g2", "g2"],
    }
)

df = df.groupby("groups").agg([col("numbers").approx_sketch().alias("sketches")])
df = (
    df.with_column("first_quartile", col("sketches").sketch_quantile(0.25))
    .with_column("median", col("sketches").sketch_quantile(0.5))
    .with_column("third_quartile", col("sketches").sketch_quantile(0.75))
    .select("groups", "first_quartile", "median", "third_quartile")
    .sort("groups")
)

print(df.collect())
On this example, you can see that I'm computing 3 different quantiles from the same sketch.
The example output:
Copy code
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ                                                                                                           
โ”‚ groups โ”† first_quartile    โ”† median             โ”† third_quartile    โ”‚                                                                                                           
โ”‚ ---    โ”† ---               โ”† ---                โ”† ---               โ”‚                                                                                                           
โ”‚ Utf8   โ”† Float64           โ”† Float64            โ”† Float64           โ”‚                                                                                                           
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก                                                                                                           
โ”‚ g1     โ”† 1.993661701417351 โ”† 2.9742334234767167 โ”† 4.014835333028612 โ”‚
โ”œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ผโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ•Œโ”ค
โ”‚ g2     โ”† 7.028793021534831 โ”† 7.924973703917148  โ”† 8.935418643763665 โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
I have just created a draft PR (https://github.com/Eventual-Inc/Daft/pull/2076) which should be a good base for discussion on this topic!
๐Ÿ‘€ 1
j
I took a first-pass at the PR โ€” great job!! ๐ŸŽ‰ 1. Had some questions about the user-facing APIs which I think we should address and agree on. Ideally Iโ€™d like to keep the details of Sketch away from the users, who would be mostly concerned with โ€œapprox_quantilesโ€ itself. 2. I see your concerns around needing to use the BinaryArray for data representation of the Sketch object right now. Iโ€™ll chat with the rest of the team to see if thereโ€™s a better solution here!