Hi I m trying to read data into a daft dataframe from the re Distributed Data Community #general

Hi, I’m trying to read data into a daft dataframe ...

Neeraj Krishnan

05/24/2024, 6:51 PM

Hi, I’m trying to read data into a daft dataframe from the result of running a sql query against a SQLServer instance. The salient parts of the code are:

Copy code

import ray
import daft

ray.init()

daft.read_sql(<sql query>, mssql://<user>:<password>@<server>:<port>/<db>, parallelism=<parallelism>, partition_col=<col>)

Which almost immediately returns

Copy code

RuntimeError: Timed out in bb8

If it really is a timeout, there doesn’t seem like a way to configure it as part of the read_sql call. Verified that the server is up, and the creds are correct because I can use pyodbc + ray dataset’s read_sql to read the data.

🙏 1

Neeraj Krishnan

05/24/2024, 6:56 PM

Copy code

Traceback (most recent call last):
  File "/tmp/ray/session_2024-05-24_09-29-07_153171_8/runtime_resources/working_dir_files/_ray_pkg_b59887b0d434edc1/daft_test.py", line 44, in <module>
    daft_df = daft.read_sql(sql, db_connection_string)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/daft/api_annotations.py", line 43, in _wrap
    return timed_func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/daft/analytics.py", line 215, in tracked_fn
    result = fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/daft/io/_sql.py", line 84, in read_sql
    sql_operator = SQLScanOperator(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/daft/sql/sql_scan.py", line 47, in __init__
    self._schema = self._attempt_schema_read()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/daft/sql/sql_scan.py", line 112, in _attempt_schema_read
    pa_table = self.conn.read(self.sql, limit=1)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/daft/sql/sql_connection.py", line 63, in read
    return self._execute_sql_query(self._construct_sql_query(sql, projection, predicate))
  File "/home/ray/anaconda3/lib/python3.9/site-packages/daft/sql/sql_connection.py", line 101, in _execute_sql_query
    return self._execute_sql_query_with_connectorx(sql)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/daft/sql/sql_connection.py", line 116, in _execute_sql_query_with_connectorx
    raise RuntimeError(f"Failed to execute sql: {sql} with url: {self.conn}, error: {e}") from e
...
error: Timed out in bb8

Neeraj Krishnan

05/24/2024, 7:05 PM

daft version: 0.2.21 ray version: 2.23.0 connectorx version: 0.3.3

jay

05/24/2024, 7:31 PM

Hmm interesting! Looks like there’s a bunch of issues online about connectorx compatibility with SQLServer https://stackoverflow.com/questions/70879465/connecting-ms-sql-server-using-connectorx-in-python It seems you could try manually specifying the driver to not use odbc:

?driver=SQL+Server&trusted_connection={trusted_conn}'

Let me know if that works?

Neeraj Krishnan

05/24/2024, 7:42 PM

thanks! adding

driver=SQL+Server

also results in the same error 😕

jay

05/24/2024, 7:44 PM

:( are you able to get “just” connectorx working without Daft? I’m looking at the example in that stackoverflow post:

Copy code

import connectorx

Copy code

username = 'my_username'
password = '1234'
server = 'SERVER1'
database = 'db1'
trusted_conn = 'no' # or yes

Copy code

query = "SELECT * FROM table"

Copy code

mssql_url = f'mssql://{username}:{password}@{server}/{database}?driver=SQL+Server&trusted_connection={trusted_conn}'

Copy code

df = connectorx.read_sql(mssql_url, query)

Neeraj Krishnan

05/24/2024, 7:51 PM

Curiously that makes no progress/hangs at read_sql call on Ray Will try locally

Neeraj Krishnan

05/24/2024, 7:55 PM

And we are back to

RuntimeError: Timed out in bb8

The stalled connection was from encrypt=true which was a red-herring presumably

jay

05/24/2024, 7:58 PM

Aha… I’m guessing connectorx just really doesn’t work with your setup of SQLServer :( I’m currently out on a lunch food run, but can show you how to use SQLAlchemy instead! It’s our “fallback option” which will be slower but should work as long as you can provide Daft with a callback to construct a working SQLAlchemy connection object.

jay

05/24/2024, 8:03 PM

Seems to be an ongoing thing with connectorx + mssql: https://github.com/sfu-db/connector-x/issues/233 Oh the joys of OSS 😝

Neeraj Krishnan

05/24/2024, 8:04 PM

Thanks so much! Will try SQL alchemy

jay

05/24/2024, 8:05 PM

No probs! Keep us posted I’m excited to see how your experience is with Daft, especially since you’re a Ray user 😁

jay

05/24/2024, 8:07 PM

(Btw I see you work at FlatIron Health — I think y’all do very cool work and actually considered applying back in 2018!) Would love to chat more about how you use Ray and Daft at some point. Happy to connect you with some of the Ray experts on our team too

Neeraj Krishnan

05/24/2024, 8:10 PM

Thank you, and I'll keep you posted!

jay

05/24/2024, 8:32 PM

Found the example on our docs: https://www.getdaft.io/projects/docs/en/latest/api_docs/doc_gen/io_functions/daft.read_sql.html#daft.read_sql

Copy code

def create_conn():
    return sqlalchemy.create_engine("mssql+pyodbc://<user>:<password>@<server>:<port>/<db>").connect()

df = daft.read_sql("SELECT * FROM my_table", create_conn)

I think you’ll need to specify

mssql+pyodbc://

so that SQLAlchemy will use pyodbc for the connection under the hood. Let me know if that works for you!

Colin Ho

05/29/2024, 6:32 PM

Hey @Neeraj Krishnan! I'm Colin, I worked on the read_sql functionality, just wanted to check in and see if using SQLAlchemy worked for you?

Neeraj Krishnan

05/31/2024, 3:40 PM

Hi Colin, and thanks for following up! Connecting via sqlalchemy does work. Query execution fails however, and this is what I see:

Copy code

UserWarning: Failed to execute the query with limit 1: Failed to execute sql: SELECT *
FROM (<original query>) AS subquery
LIMIT 1

which is expected since sqlserver does not support

limit

(uses

TOP

) However, after stating

Attempting to read the entire table

we still get:

Copy code

Incorrect syntax near 'LIMIT
...
[SQL: SELECT *
FROM (SELECT *
FROM (<original query>) AS subquery
LIMIT 1) AS subquery]

using daft 0.2.21 https://github.com/Eventual-Inc/Daft/blob/v0.2.21/daft/sql/sql_connection.py where the second query construction is supposed to leave LIMIT out of the query. B_ut_ the

sql

arg passed to it is the shadowed sql variable which has the wrapping select * and limit already added to it. Also, if its of interest later to someone

Copy code

connection_url = sqlalchemy.engine.URL.create(
    "mssql+pyodbc",
    username=user,
    password=password,
    host=host,
    port=1433,
    database=database,
    query={
        "driver": "ODBC Driver 18 for SQL Server",
    },
)
def create_conn():
    return sqlalchemy.create_engine(connection_url).connect()

Where ODBC drivers are installed per:
<https://learn.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server?view=sql-server-ver15&tabs=ubuntu18-install%2Calpine17-install%2Cdebian8-install%2Credhat7-13-install%2Crhel7-offline>

is what worked connection-wise.

Colin Ho

05/31/2024, 6:09 PM

Thanks so much for the detailed logs and code examples! I was able to reproduce the error, which is due to Daft manually constructing queries using

limit

in version

0.2.21

We've actually updated Daft starting from version

0.2.22

to use a separate SQL parsing library called SQLGlot to construct our sql queries, which should make us compatible across more dialects. However, I do need to make a small fix to allow it to work with SQL server (essentially SQLGlot recognizes the sql server dialect as

tsql

instead of

mssql

), which should be ready in the next release! https://github.com/Eventual-Inc/Daft/pull/2330 @jay could you take a look?

👍 1

jay

05/31/2024, 6:28 PM

Looks good! Just stamped it ✅

Neeraj Krishnan

05/31/2024, 6:28 PM

thank you!

jay

05/31/2024, 6:29 PM

Thanks @Neeraj Krishnan 🙂 We’ll provide an update again when the new version of Daft is released

👍 1

Colin Ho

06/04/2024, 8:49 PM

Hey @Neeraj Krishnan! the latest version of Daft was just released (

0.2.26

), the new fixes should be ready to try out!

Neeraj Krishnan

06/04/2024, 9:17 PM

Thank you!

Neeraj Krishnan

06/06/2024, 1:48 PM

This works! thanks so much for the quick fix.

❤️ 1

Colin Ho

06/06/2024, 3:58 PM

Awesome! glad to hear

2 Views

Open in Slack

Previous Next