If I suspect that there are duplicates within indi...
# general
k
If I suspect that there are duplicates within individual partitions what would be a good way to call distinct without shuffling and ensuring non-duplication within each partition?
j
Hmm you want a per-partition distinct?
If your data is already partitioned a certain way, groupby + distinct is going to be cheap because Daft can automatically elide the groupby. You should validate the plan to make sure we’re behaving as expected though
k
It's only partitioned by file but not by any column. Is there any way I could use that as my partition? I tried iter_partitions but is there any way to get the row count from the micropartition and also use distinct on the micropartition?
j
Oh… I see hmm. Currently the only workaround I am aware of would be to use a UDF (which today if you don’t specify
batch_size
will give you columns on the entire partition). We try as much as possible to abstract away the concept of a partition from our users, because it really should just be implementation detail. To our users, Daft should (hopefully) just look like one big dataframe. That’s of course not the case (yet) right now because of the current architecture, but hopefully in the next few months that will change as we move towards a more streaming-based architecture. In this case, what if we gave you the ability to add the filename as a new column? https://github.com/Eventual-Inc/Daft/issues/2808
You would then be able to do a groupby on the filename. I wonder if we could even somehow figure out that the data is indeed partitioned by filename. Just added that to the issue as well. • Approx count distinct aggregations
df["text"].approx_count_distinct()
(NOTE: this is much faster and more efficient than exact, which we don’t even have an implementation for because it’s so inefficient) • I think we can expose a distinct on the groupby as well. Let us know what API could be good here
k
Oh cool yeah that would be good!
j
So probably something like:
Copy code
daft.read_parquet("s3://...", filename_column="filenames").groupby("filenames").agg(df["text"].count_distinct_approx())
k
looks good!
j
Nice added to the issue, thanks! This seems quite useful in general 😛
k
Thanks! 😄
If I do have a groupby column I could use instead, how would I go about running the by-group distinct on the grouped DF?
j
Technically, you could just groupby and then grab the keys only 😅 😅
I.e. have the column you’re trying to dedup as a key as well
A great way of doing this could be first running a hash on that column and then doing the groupby on the hash +filename?
k
Okay thanks!!