Also folks any thoughts on making a breaking change to `df c Distributed Data Community #daft-dev

Also folks any thoughts on making a breaking chang...

jay

08/13/2024, 11:11 PM

Also folks any thoughts on making a breaking change to

df.count()

? https://github.com/Eventual-Inc/Daft/issues/1996 Some users have said that it’s probably more expected behavior to have that return an

int

of the row count of the dataframe, instead of the current behavior

Sammy Sidhu

08/13/2024, 11:25 PM

Do it!

Kevin Wang

08/15/2024, 9:01 PM

If we want to make some breaking changes to

count

, could we consider making the aggregation expression

col("a").count()

be a count on the number of non-null rows in column "a"? And if someone wants a count on the number of rows, we could provide

count("*")

now that we have support for wildcards. This would be more consistent with Spark and SQL

jay

08/15/2024, 10:05 PM

making the aggregation expression
col("a").count()
be a count on the number of non-null rows in column “a”?

I think this is already the current behavior? Could be wrong

jay

08/15/2024, 10:06 PM

And if someone wants a count on the number of rows, we could provide
count("*")
now that we have support for wildcards

And yeah I think

df.count()

and

df.count("*")

should essentially be aliases

Kevin Wang

08/15/2024, 10:06 PM

ah you're right we have a setting for all, non-null, or null count

🙌 1

Kevin Wang

08/15/2024, 10:21 PM

Here's what I am thinking then: 1. we add the count mode parameter to

DataFrame.count

and

GroupedDataFrame.count

as well 2.

DataFrame.count

has the same behavior except when you give it no parameters or "*", in which case it returns an integer instead of a dataframe

jay

08/15/2024, 10:37 PM

1. Maybe… but what would

DataFrame.count("*", mode="non-null")

mean though? 2. I think having it return 2 different things might be more confusing. The current behavior here feels fine to me (“feels like SQL”). If a user wants an integer they can go ahead and use

count_rows()

which will return an int I think

Kevin Wang

08/15/2024, 10:39 PM

1. I'm thinking that would just give a non-null count of each of the columns 2. I think we want to change

count

to have the behavior of

count_rows()

Kevin Wang

08/15/2024, 10:58 PM

@Sammy Sidhu @Cory Grinstead I saw that both of you said this was a confusion point. What behavior do you expect

DataFrame.count

to have?

Cory Grinstead

08/16/2024, 12:49 AM

df.count()

to me is the same as

df.count_rows()

select count(*) from df

. IMO if you want to do an aggregate, then you should have to specify, either through

.agg

or an implicit aggregation via

select

(which we don't yet support). •

df.agg(count('*'))

•

df.select(count('*'))

•

df.select(col('a').count())

•

df.agg(col('a').count())

but then again,

count(*)

usually has special handling in SQL that means count_rows instead of a count on all rows.

👍 1

2 Views

Open in Slack

Previous Next