Also folks any thoughts on making a breaking chang...
# daft-dev
j
Also folks any thoughts on making a breaking change to
df.count()
? https://github.com/Eventual-Inc/Daft/issues/1996 Some users have said that it’s probably more expected behavior to have that return an
int
of the row count of the dataframe, instead of the current behavior
s
Do it!
k
If we want to make some breaking changes to
count
, could we consider making the aggregation expression
col("a").count()
be a count on the number of non-null rows in column "a"? And if someone wants a count on the number of rows, we could provide
count("*")
now that we have support for wildcards. This would be more consistent with Spark and SQL
j
making the aggregation expression
col("a").count()
be a count on the number of non-null rows in column “a”?
I think this is already the current behavior? Could be wrong
And if someone wants a count on the number of rows, we could provide
count("*")
now that we have support for wildcards
And yeah I think
df.count()
and
df.count("*")
should essentially be aliases
k
ah you're right we have a setting for all, non-null, or null count
🙌 1
Here's what I am thinking then: 1. we add the count mode parameter to
DataFrame.count
and
GroupedDataFrame.count
as well 2.
DataFrame.count
has the same behavior except when you give it no parameters or "*", in which case it returns an integer instead of a dataframe
j
1. Maybe… but what would
DataFrame.count("*", mode="non-null")
mean though? 2. I think having it return 2 different things might be more confusing. The current behavior here feels fine to me (“feels like SQL”). If a user wants an integer they can go ahead and use
count_rows()
which will return an int I think
k
1. I'm thinking that would just give a non-null count of each of the columns 2. I think we want to change
count
to have the behavior of
count_rows()
?
@Sammy Sidhu @Cory Grinstead I saw that both of you said this was a confusion point. What behavior do you expect
DataFrame.count
to have?
c
df.count()
to me is the same as
df.count_rows()
or
select count(*) from df
. IMO if you want to do an aggregate, then you should have to specify, either through
.agg
or an implicit aggregation via
select
(which we don't yet support).
df.agg(count('*'))
df.select(count('*'))
df.select(col('a').count())
df.agg(col('a').count())
but then again,
count(*)
usually has special handling in SQL that means count_rows instead of a count on all rows.
👍 1