Raunak Bhagat
08/22/2024, 6:38 PMcount method with some existing `CountMode`s. For count_distinct, would it be better if instead of a new expression, I just created a new CountMode?
I.e., something like CountMode::Distinct ? The current `CountMode`s are CountMode::All, CountMode::Valid, and CountMode::Null.Kevin Wang
08/22/2024, 7:02 PMCOUNT(DISTINCT col) ) it would make sense to add it to count, but if we were to be similar to pyspark, they have a specific countDistinct function. My opinion is we should have a separate function since counting distinct is functionally pretty different from the other count modes, but I could go both ways.Raunak Bhagat
08/22/2024, 7:03 PMcount(count_mode = ...)
count_distinct()
count_distinct_approx()Kevin Wang
08/22/2024, 7:06 PMapprox_percentiles . We could rename that expression and set the original name as an alias with a deprecation warning if we wanted to change itjay
08/22/2024, 7:20 PMDesmond Cheong
08/22/2024, 7:45 PMapprox_count_distinct is the more conventional nameRaunak Bhagat
08/27/2024, 4:54 AMcount_approx_distinct.
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.countApproxDistinct.html?highlight=approx#pyspark.RDD.countApproxDistinct