jay
08/13/2024, 11:11 PMdf.count() ?
https://github.com/Eventual-Inc/Daft/issues/1996
Some users have said that it’s probably more expected behavior to have that return an int of the row count of the dataframe, instead of the current behaviorSammy Sidhu
08/13/2024, 11:25 PMKevin Wang
08/15/2024, 9:01 PMcount, could we consider making the aggregation expression col("a").count() be a count on the number of non-null rows in column "a"? And if someone wants a count on the number of rows, we could provide count("*") now that we have support for wildcards. This would be more consistent with Spark and SQLjay
08/15/2024, 10:05 PMmaking the aggregation expressionI think this is already the current behavior? Could be wrongbe a count on the number of non-null rows in column “a”?col("a").count()
jay
08/15/2024, 10:06 PMAnd if someone wants a count on the number of rows, we could provideAnd yeah I thinknow that we have support for wildcardscount("*")
df.count() and df.count("*") should essentially be aliasesKevin Wang
08/15/2024, 10:06 PMKevin Wang
08/15/2024, 10:21 PMDataFrame.count and GroupedDataFrame.count as well
2. DataFrame.count has the same behavior except when you give it no parameters or "*", in which case it returns an integer instead of a dataframejay
08/15/2024, 10:37 PMDataFrame.count("*", mode="non-null") mean though?
2. I think having it return 2 different things might be more confusing. The current behavior here feels fine to me (“feels like SQL”). If a user wants an integer they can go ahead and use count_rows() which will return an int I thinkKevin Wang
08/15/2024, 10:39 PMcount to have the behavior of count_rows()?Kevin Wang
08/15/2024, 10:58 PMDataFrame.count to have?Cory Grinstead
08/16/2024, 12:49 AMdf.count() to me is the same as df.count_rows() or select count(*) from df. IMO if you want to do an aggregate, then you should have to specify, either through .agg or an implicit aggregation via select (which we don't yet support).
• df.agg(count('*'))
• df.select(count('*'))
• df.select(col('a').count())
• df.agg(col('a').count())
but then again, count(*) usually has special handling in SQL that means count_rows instead of a count on all rows.