If I suspect that there are duplicates within individual par Distributed Data Community #general

If I suspect that there are duplicates within indi...

Kyle

09/26/2024, 8:13 AM

If I suspect that there are duplicates within individual partitions what would be a good way to call distinct without shuffling and ensuring non-duplication within each partition?

jay

09/26/2024, 8:43 AM

Hmm you want a per-partition distinct?

jay

09/26/2024, 8:44 AM

If your data is already partitioned a certain way, groupby + distinct is going to be cheap because Daft can automatically elide the groupby. You should validate the plan to make sure we’re behaving as expected though

Kyle

09/26/2024, 8:52 AM

It's only partitioned by file but not by any column. Is there any way I could use that as my partition? I tried iter_partitions but is there any way to get the row count from the micropartition and also use distinct on the micropartition?

jay

09/26/2024, 8:59 AM

Oh… I see hmm. Currently the only workaround I am aware of would be to use a UDF (which today if you don’t specify

batch_size

will give you columns on the entire partition). We try as much as possible to abstract away the concept of a partition from our users, because it really should just be implementation detail. To our users, Daft should (hopefully) just look like one big dataframe. That’s of course not the case (yet) right now because of the current architecture, but hopefully in the next few months that will change as we move towards a more streaming-based architecture. In this case, what if we gave you the ability to add the filename as a new column? https://github.com/Eventual-Inc/Daft/issues/2808

jay

09/26/2024, 9:04 AM

You would then be able to do a groupby on the filename. I wonder if we could even somehow figure out that the data is indeed partitioned by filename. Just added that to the issue as well. • Approx count distinct aggregations

df["text"].approx_count_distinct()

(NOTE: this is much faster and more efficient than exact, which we don’t even have an implementation for because it’s so inefficient) • I think we can expose a distinct on the groupby as well. Let us know what API could be good here

Kyle

09/26/2024, 9:07 AM

Oh cool yeah that would be good!

jay

09/26/2024, 9:07 AM

So probably something like:

Copy code

daft.read_parquet("s3://...", filename_column="filenames").groupby("filenames").agg(df["text"].count_distinct_approx())

Kyle

09/26/2024, 9:07 AM

looks good!

jay

09/26/2024, 9:08 AM

Nice added to the issue, thanks! This seems quite useful in general 😛

Kyle

09/26/2024, 9:08 AM

Thanks! 😄

Kyle

09/26/2024, 10:34 AM

If I do have a groupby column I could use instead, how would I go about running the by-group distinct on the grouped DF?

jay

09/26/2024, 4:27 PM

Technically, you could just groupby and then grab the keys only 😅 😅

jay

09/26/2024, 4:27 PM

I.e. have the column you’re trying to dedup as a key as well

jay

09/26/2024, 4:28 PM

A great way of doing this could be first running a hash on that column and then doing the groupby on the hash +filename?

Kyle

09/26/2024, 11:31 PM

Okay thanks!!

Open in Slack

Previous Next