API/syntax question -- I'm trying to add a new col...
# general
a
API/syntax question -- I'm trying to add a new column to an existing df with fresh contents (i.e. not computed from existing columns). In pandas I would do something like
df["dogs"] = [Dog("ruffles"), Dog("waffles"), Dog("doofus")]
to add col
dogs
to
df
. • AFAICT
df.where
only accepts
Expressions
and so is not suitable for this (?) • the pandas syntax of direct column assignment is not supported (probably for good reason, I've always found it messy/risky) 😅 • so the 'easiest' way I came up with was to a
join
- this works but seems a bit overkill in terms of complexity Am I overlooking something here or is this the recommended approach atm?
j
Are these contents just data that you’ve generated on your local machine? Since Daft is distributed and partitioned, a simple column append operation is actually a little more complex since it involves having to split that column correctly along column boundaries. Oftentimes we actually don’t even know the length of the dataframe until materialization so that’s another interesting obstacle to overcome. If your column has the same data across all rows, you could do
with_column(“foo”, daft.lit(1))
Otherwise, a join would be the most canonical way of doing this right now, but if you had other ideas on how to enable this let us know!
a
Thanks @jay! I had my head stuck in pandas land 🤦‍♀️ 🙂 In Dask you can do
df["new_col"] = contents
where
contents
is either a scalar (i.e. like your
lit()
suggestion) or a Dask array. But with the Dask Array you'd still need to do some work to align partitions, so it's indeed not straightforward.