Maxime Petitjean
03/21/2024, 4:37 PMproduct
aggregation as an exercise. For that I've just followed the sum
aggregation and everything is working fine.
Now for the approximate aggregations, it's more difficult. There will be different result datatypes between final result and intermediate results. For example, an approx_quantile
aggregation will return a number (Float64) but an intermediate result (between map and reduce phases) will be a sketch (sketches will be merged in reduce phase).
Should I create a new Sketch
or DDSketch
datatype or should I use the Binary
datatype?
Should I create different functions in Series
and DataArray
, one returning the sketch (approx_sketch
) and some returning approximate statistics (approx_quantile
, ...) or should I only add statistics functions and keep approx_sketch
private?jay
03/21/2024, 4:42 PMSketch
structure
2. Reduce phase - returns Float64
For the map stage, I think you could define some serialization on the Sketch structs and use the Binary type. Otherwise, if you can decompose it into its primitive components, you can consider using a Struct
type as well, and re-compose it on the other side!
As for the Series methods, our public methods are mostly for testing in Python. Iโd suggest exposing the quantile methods there!jay
03/21/2024, 5:20 PMMaxime Petitjean
03/25/2024, 5:39 PMSketch
datatype because it is possible to compute multiple things on a sketch.
For now I added an approx_sketch
aggregation which aggregates values in a sketch and a sketch_quantile
which computes a quantile from a sketch.
Here is a (working!) example:
from daft import from_pydict, col
df = from_pydict(
{
"numbers": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"groups": ["g1", "g1", "g1", "g1", "g1", "g2", "g2", "g2", "g2", "g2"],
}
)
df = df.groupby("groups").agg([col("numbers").approx_sketch().alias("sketches")])
df = (
df.with_column("first_quartile", col("sketches").sketch_quantile(0.25))
.with_column("median", col("sketches").sketch_quantile(0.5))
.with_column("third_quartile", col("sketches").sketch_quantile(0.75))
.select("groups", "first_quartile", "median", "third_quartile")
.sort("groups")
)
print(df.collect())
On this example, you can see that I'm computing 3 different quantiles from the same sketch.Maxime Petitjean
03/25/2024, 5:39 PMโญโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโฎ
โ groups โ first_quartile โ median โ third_quartile โ
โ --- โ --- โ --- โ --- โ
โ Utf8 โ Float64 โ Float64 โ Float64 โ
โโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโก
โ g1 โ 1.993661701417351 โ 2.9742334234767167 โ 4.014835333028612 โ
โโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ g2 โ 7.028793021534831 โ 7.924973703917148 โ 8.935418643763665 โ
โฐโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโฏ
Maxime Petitjean
04/03/2024, 1:55 PMjay
04/04/2024, 9:34 PM