Kyle
09/26/2024, 8:13 AMjay
09/26/2024, 8:43 AMjay
09/26/2024, 8:44 AMKyle
09/26/2024, 8:52 AMjay
09/26/2024, 8:59 AMbatch_size will give you columns on the entire partition).
We try as much as possible to abstract away the concept of a partition from our users, because it really should just be implementation detail. To our users, Daft should (hopefully) just look like one big dataframe. That’s of course not the case (yet) right now because of the current architecture, but hopefully in the next few months that will change as we move towards a more streaming-based architecture.
In this case, what if we gave you the ability to add the filename as a new column? https://github.com/Eventual-Inc/Daft/issues/2808jay
09/26/2024, 9:04 AMdf["text"].approx_count_distinct() (NOTE: this is much faster and more efficient than exact, which we don’t even have an implementation for because it’s so inefficient)
• I think we can expose a distinct on the groupby as well. Let us know what API could be good hereKyle
09/26/2024, 9:07 AMjay
09/26/2024, 9:07 AMdaft.read_parquet("s3://...", filename_column="filenames").groupby("filenames").agg(df["text"].count_distinct_approx())Kyle
09/26/2024, 9:07 AMjay
09/26/2024, 9:08 AMKyle
09/26/2024, 9:08 AMKyle
09/26/2024, 10:34 AMjay
09/26/2024, 4:27 PMjay
09/26/2024, 4:27 PMjay
09/26/2024, 4:28 PMKyle
09/26/2024, 11:31 PM