question about selecting literals. in sql, you ca...
# daft-dev
c
question about selecting literals. in sql, you can usually select literals without a
from
Copy code
> select 1;
> select upper('hello');
I'd expect this to be the equivalent of selecting from an empty dataframe
Copy code
> daft.from_pydict({}).select(daft.lit(1)).collect()
> daft.from_pydict({}).select(daft.lit('hello').str.upper()).collect()
however this applies the schema change, but does not add the data.
Copy code
╭─────────╮
│ literal │
│ ---     │
│ Utf8    │
╞═════════╡
╰─────────╯

(No data to display: Materialized dataframe has no rows)
is this expected, or a bug?
j
Yeah I think the way it currently works is that the
lit
gets broadcasted across the existing dataframe…. So if you have an empty dataframe you get another empty dataframe in return
I’m guessing
select 1;
and
select 1 from my_table
in SQL might give the same distinction?
(Where the latter would also return an empty result set of
my_table
was empty)
c
yeah,
select 1
and
select 1 from my_table
will both return the same result
@jay I definitely think there's a bug in the broadcast logic here. for example:
Copy code
df = daft.from_pydict({
    'a': [1,2,3], 
    'b': ['a', 'b', 'c']
}).select(daft.lit(1)).collect()
I'd expect this to return a single record, but it instead broadcasts it across
n
rows and gives me
literal: [1,1,1]
j
Yeah that’s the current behavior. The way it works is that it performs a broadcast of the literal column against the rest of the dataframe. I don’t think we have a way today of detecting a literal-only projection and special-casing that
Under the hood, this is running a
.eval_expression(lit(1))
on every partition in the distributed dataframe This can either: • [CURRENT BEHAVIOR] Broadcast
lit(1)
to the length of every partition (hence performing a global broadcast) • Return a new partition with just 1 value per-partition The second option would unfortunately mean that our new dataframe would have N_PARTITION number of rows, which is quite weird Alternatively we can detect any projections with only literals and make special-case those to produce only 1 partition with 1 value.
c
Ok, I submitted a PR with special handling for a projection of all literals https://github.com/Eventual-Inc/Daft/pull/2669/files
this should enable sql support for
select 1
and
select 1 from my_table
d
hmmm fwiw in spark and duckdb,
select 1 from my_table
produces a
1
for every row in the table
I bet postgres does the same (edit: it does)
s
Also seeing duckdb does the same thing: https://codapi.org/duckdb/
c
🤔 I swear i just ran this query earlier and it produced a single row.
so maybe it just needs to be a special case for only if it's an empty dataframe.. IE
select <literal>
but not
select <literal> from <tbl>