Hi I ran into a daft crashes in sql_scan.py:221 wi...
# general
p
Hi I ran into a daft crashes in sql_scan.py:221 with the TypeError: unsupported operand type(s) for -: 'NoneType' and 'NoneType'. The code at that line: range_size = (max_val - min_val) / num_scan_tasks. This happens when the table is empty. Should default max_val and min_val to 0 in this case and return bound as 0? In any case, it should through exception as empty table are valid use case. The collect should just return an empty daft data frame, right?
d
Any chance you could provide a sample query?
p
The query is pretty simple: SELECT * FROM mydb.myschema.mytable with mytable contains an id column which is used as partitition_col and num of partitions set to 1. for daft.read_sql. The database is MS SQL server and I am using SqlAlchemy. I verified the table indeed empty.
c
Yeah, we should be able to return an empty DataFrame. And in the case where the min/max are None, we should default to just reading with 1 partition. However, It's interesting that Daft is still calculating partition bounds even though
num_partitions=1
, could you try removing it and see if it works?
p
Yeh, I can remove both partition_col and num_partitions.
I'm sorry, the num_partitions that I passed is 2 instead of 1 which makes sense why there are 2 scan tasks.
This explains why daft is calculating bound, but we will probably need to guard against empty tables regardless if what value num_partitions is set.
👍 1
@Colin Ho will this issue be addressed in the upcoming release?
d
I took a quick look, seems like a small fix. I can send out a pr shortly that'll meet monday's release
🙌 1
Capping the number of scan tasks, which shouldn't exceed the number of rows in a table: https://github.com/Eventual-Inc/Daft/pull/2885
🙏 1
p
Hi @Desmond Cheong want to give a quick update. I just verified the fix. It is working. Thanks!
d
Awesome, glad to hear it 🙂
p
Now I'm having another crash. Not sure if this is a bug. This time it is crashing at daft/table/micropartition.py#117. It complains about schema mismatch with builtin method concat. The exact error message: daft.exceptioon.DaftCoreException: DaftError::SchemaMismatch MicroPartition concat requires all scheam to match, Colum Name | Type path | Null ------------------------- Column Name | Type path | Utf8 My guess this is also related to the num_partition where if some partitions are empty, there will be no files written to destination, hence paths will be Null for the partition. When Daft try to concat the path together, we throw the above exception.
d
Hmm yeah, that sounds like what might be happening. The SQL database also doesn't return proper type information if there are no rows, which doesn't help. I'll take a look. As a temporary workaround, could you filter out tables with no rows before the concat?
c
Another workaround would be to pass in a 'schema hint', example:
Copy code
schema = {
        "path": daft.DataType.string(),  # This should override the "path" column to be a String
    }
    df = daft.read_sql(f"SELECT * FROM TABLE", create_conn, schema=schema)
which should help override the column type for 'path' to always be a String
p
In my query, I provided the schema, so I am not sure if this "Schema" mismatch is for database schema. My table doesn't have a column "path" either. I assume this is pure Daft thing. When Daft write_parquet is called. It will write 1 or more files to the s3 bucket. It will record the files written in the "path' column of the write_parquet return daft dataframe. Each microparitition will have a daft dataframe. The final dataframe retuned by the write_parquet to user is the concatenation of all the dataframe from each micropartition.
c
Ooohhh I see, so it is a
write_parquet
issue then. Got it.
I actually got a simple reproduction of this: 😅
Copy code
import daft

df = daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]}).into_partitions(4).write_parquet("z"))
print(df)
The issue here is that we need to make the column type for 'path' to be String, even if there's no files written, this should be a simple fix.
👍 1
🙌 2
p
Needless to say that this fix might need to be replicated to other write_xxxx calls unless the underlying code is shared/reused across all write variants. That way we only need to fix a single place.
👍 1
c
Forgot to mention, but reading / writing empty partitions should be fixed in the latest release!
p
That is great. Thanks you @Colin Ho!