< C041NA2RBFD|> is there any best way to write database and Distributed Data Community #general

Join Slack

<#C041NA2RBFD|> is there any best way to write dat...

# general

seto

09/08/2024, 4:00 AM

#C041NA2RBFD is there any best way to write database and doing some upsert from daft dataframe?

jay

09/08/2024, 4:04 AM

Haha that’s a very general question and depends on what database you’re targeting!

seto

09/08/2024, 4:05 AM

I see, let's say if our database using postgres / sql server, maybe we're using aws rds

jay

09/08/2024, 4:09 AM

We currently support writing out to Parquet, so you can run an ingestion into those databases That being said, I wonder if it may be possible to leverage something like ADBC or JDBC. Upserts are even more interesting, because it involves not just an append but some kind of upsert API to specify criteria to update/delete rows

jay

09/08/2024, 4:09 AM

Do you have an API in mind?

seto

09/08/2024, 4:13 AM

Cause i want to replacing our etl stack (ssis), currently our main target is database. I think we can using adbc or jdbc or maybe odbc itself like pyodbc.

jay

09/08/2024, 4:14 AM

Do you require upserts? Or just appends

seto

09/08/2024, 4:16 AM

Both, depend on our change request. Are daft is in production grade?

seto

09/08/2024, 4:17 AM

I considering daft than polars, later we want to plan doing distribution process.

jay

09/08/2024, 4:19 AM

Yes, we’re more scalable and faster than polars but are a younger project so may be behind polars in some features (e.g. upsert APIs) and we primarily work with object stores (S3) Could you make an issue on the Daft repository? Would love to get your thoughts on suitable API!

seto

09/08/2024, 4:29 AM

Ya.. Maybe i can make an api to communicate with sql database help by chatgpt also haha. Think first. If i see daft documentation, daft can read sql from connectorx or sqlalchemy, i think to sql will do the same way ya

jay

09/08/2024, 4:31 AM

Yes we do read SQL actually in parallel 😎 Using the user-provided query we can shard it and perform a parallel read across the distributed cluster. We just need to make a good API to figure out the writing story here (appends and upserts). Every Python API I’ve seen for upserts has been very ugly haha.

seto

09/08/2024, 4:34 AM

I've benchmark dask / pandas / polars function to sql, they unable to support upsert process, just fail replace and append.. Is it right?

jay

09/08/2024, 6:24 AM

I believe so yes. Most databases would support appends through JDBC or ODBC I think.

Open in Slack

Previous Next