Hi I ran into a daft crashes in sql scan py 221 with the Typ Distributed Data Community #general

Hi I ran into a daft crashes in sql_scan.py:221 wi...

Phil Chen

09/20/2024, 7:16 PM

Hi I ran into a daft crashes in sql_scan.py:221 with the TypeError: unsupported operand type(s) for -: 'NoneType' and 'NoneType'. The code at that line: range_size = (max_val - min_val) / num_scan_tasks. This happens when the table is empty. Should default max_val and min_val to 0 in this case and return bound as 0? In any case, it should through exception as empty table are valid use case. The collect should just return an empty daft data frame, right?

Desmond Cheong

09/20/2024, 7:26 PM

Any chance you could provide a sample query?

Phil Chen

09/20/2024, 7:38 PM

The query is pretty simple: SELECT * FROM mydb.myschema.mytable with mytable contains an id column which is used as partitition_col and num of partitions set to 1. for daft.read_sql. The database is MS SQL server and I am using SqlAlchemy. I verified the table indeed empty.

Colin Ho

09/20/2024, 8:12 PM

Yeah, we should be able to return an empty DataFrame. And in the case where the min/max are None, we should default to just reading with 1 partition. However, It's interesting that Daft is still calculating partition bounds even though

num_partitions=1

, could you try removing it and see if it works?

Phil Chen

09/20/2024, 8:42 PM

Yeh, I can remove both partition_col and num_partitions.

Phil Chen

09/20/2024, 8:54 PM

I'm sorry, the num_partitions that I passed is 2 instead of 1 which makes sense why there are 2 scan tasks.

Phil Chen

09/20/2024, 8:56 PM

This explains why daft is calculating bound, but we will probably need to guard against empty tables regardless if what value num_partitions is set.

👍 1

Phil Chen

09/22/2024, 11:23 PM

@Colin Ho will this issue be addressed in the upcoming release?

Desmond Cheong

09/22/2024, 11:25 PM

I took a quick look, seems like a small fix. I can send out a pr shortly that'll meet monday's release

🙌 1

Desmond Cheong

09/23/2024, 12:22 AM

Capping the number of scan tasks, which shouldn't exceed the number of rows in a table: https://github.com/Eventual-Inc/Daft/pull/2885

🙏 1

Phil Chen

09/26/2024, 7:06 PM

Hi @Desmond Cheong want to give a quick update. I just verified the fix. It is working. Thanks!

Desmond Cheong

09/26/2024, 7:07 PM

Awesome, glad to hear it 🙂

Phil Chen

09/26/2024, 7:15 PM

Now I'm having another crash. Not sure if this is a bug. This time it is crashing at daft/table/micropartition.py#117. It complains about schema mismatch with builtin method concat. The exact error message: daft.exceptioon.DaftCoreException: DaftError::SchemaMismatch MicroPartition concat requires all scheam to match, Colum Name | Type path | Null ------------------------- Column Name | Type path | Utf8 My guess this is also related to the num_partition where if some partitions are empty, there will be no files written to destination, hence paths will be Null for the partition. When Daft try to concat the path together, we throw the above exception.

Desmond Cheong

09/26/2024, 7:24 PM

Hmm yeah, that sounds like what might be happening. The SQL database also doesn't return proper type information if there are no rows, which doesn't help. I'll take a look. As a temporary workaround, could you filter out tables with no rows before the concat?

Colin Ho

09/26/2024, 7:30 PM

Another workaround would be to pass in a 'schema hint', example:

Copy code

schema = {
        "path": daft.DataType.string(),  # This should override the "path" column to be a String
    }
    df = daft.read_sql(f"SELECT * FROM TABLE", create_conn, schema=schema)

which should help override the column type for 'path' to always be a String

Phil Chen

09/26/2024, 7:31 PM

In my query, I provided the schema, so I am not sure if this "Schema" mismatch is for database schema. My table doesn't have a column "path" either. I assume this is pure Daft thing. When Daft write_parquet is called. It will write 1 or more files to the s3 bucket. It will record the files written in the "path' column of the write_parquet return daft dataframe. Each microparitition will have a daft dataframe. The final dataframe retuned by the write_parquet to user is the concatenation of all the dataframe from each micropartition.

Colin Ho

09/26/2024, 7:34 PM

Ooohhh I see, so it is a

write_parquet

issue then. Got it.

Colin Ho

09/26/2024, 7:35 PM

I actually got a simple reproduction of this: 😅

Copy code

import daft

df = daft.from_pydict({"foo": [1, 2, 3], "bar": ["a", "b", "c"]}).into_partitions(4).write_parquet("z"))
print(df)

Colin Ho

09/26/2024, 7:36 PM

The issue here is that we need to make the column type for 'path' to be String, even if there's no files written, this should be a simple fix.

👍 1

🙌 2

Phil Chen

09/26/2024, 8:00 PM

Needless to say that this fix might need to be replicated to other write_xxxx calls unless the underlying code is shared/reused across all write variants. That way we only need to fix a single place.

👍 1

Colin Ho

10/02/2024, 2:17 AM

Forgot to mention, but reading / writing empty partitions should be fixed in the latest release!

Phil Chen

10/02/2024, 2:31 AM

That is great. Thanks you @Colin Ho!

4 Views

Open in Slack

Previous Next