API syntax question I m trying to add a new column to an exi Distributed Data Community #general

API/syntax question -- I'm trying to add a new col...

avril

04/04/2024, 3:20 PM

API/syntax question -- I'm trying to add a new column to an existing df with fresh contents (i.e. not computed from existing columns). In pandas I would do something like

df["dogs"] = [Dog("ruffles"), Dog("waffles"), Dog("doofus")]

to add col

dogs

df

. • AFAICT

df.where

only accepts

Expressions

and so is not suitable for this (?) • the pandas syntax of direct column assignment is not supported (probably for good reason, I've always found it messy/risky) 😅 • so the 'easiest' way I came up with was to a

join

- this works but seems a bit overkill in terms of complexity Am I overlooking something here or is this the recommended approach atm?

jay

04/04/2024, 3:27 PM

Are these contents just data that you’ve generated on your local machine? Since Daft is distributed and partitioned, a simple column append operation is actually a little more complex since it involves having to split that column correctly along column boundaries. Oftentimes we actually don’t even know the length of the dataframe until materialization so that’s another interesting obstacle to overcome. If your column has the same data across all rows, you could do

with_column(“foo”, daft.lit(1))

Otherwise, a join would be the most canonical way of doing this right now, but if you had other ideas on how to enable this let us know!

avril

04/04/2024, 3:44 PM

Thanks @jay! I had my head stuck in pandas land 🤦‍♀️ 🙂 In Dask you can do

df["new_col"] = contents

where

contents

is either a scalar (i.e. like your

lit()

suggestion) or a Dask array. But with the Dask Array you'd still need to do some work to align partitions, so it's indeed not straightforward.

Open in Slack

Previous Next