Hi, I’m trying to read data into a daft dataframe ...
# general
n
Hi, I’m trying to read data into a daft dataframe from the result of running a sql query against a SQLServer instance. The salient parts of the code are:
Copy code
import ray
import daft

ray.init()

daft.read_sql(<sql query>, mssql://<user>:<password>@<server>:<port>/<db>, parallelism=<parallelism>, partition_col=<col>)
Which almost immediately returns
Copy code
RuntimeError: Timed out in bb8
If it really is a timeout, there doesn’t seem like a way to configure it as part of the read_sql call. Verified that the server is up, and the creds are correct because I can use pyodbc + ray dataset’s read_sql to read the data.
🙏 1
Copy code
Traceback (most recent call last):
  File "/tmp/ray/session_2024-05-24_09-29-07_153171_8/runtime_resources/working_dir_files/_ray_pkg_b59887b0d434edc1/daft_test.py", line 44, in <module>
    daft_df = daft.read_sql(sql, db_connection_string)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/daft/api_annotations.py", line 43, in _wrap
    return timed_func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/daft/analytics.py", line 215, in tracked_fn
    result = fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/daft/io/_sql.py", line 84, in read_sql
    sql_operator = SQLScanOperator(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/daft/sql/sql_scan.py", line 47, in __init__
    self._schema = self._attempt_schema_read()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/daft/sql/sql_scan.py", line 112, in _attempt_schema_read
    pa_table = self.conn.read(self.sql, limit=1)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/daft/sql/sql_connection.py", line 63, in read
    return self._execute_sql_query(self._construct_sql_query(sql, projection, predicate))
  File "/home/ray/anaconda3/lib/python3.9/site-packages/daft/sql/sql_connection.py", line 101, in _execute_sql_query
    return self._execute_sql_query_with_connectorx(sql)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/daft/sql/sql_connection.py", line 116, in _execute_sql_query_with_connectorx
    raise RuntimeError(f"Failed to execute sql: {sql} with url: {self.conn}, error: {e}") from e
...
error: Timed out in bb8
daft version: 0.2.21 ray version: 2.23.0 connectorx version: 0.3.3
j
Hmm interesting! Looks like there’s a bunch of issues online about connectorx compatibility with SQLServer https://stackoverflow.com/questions/70879465/connecting-ms-sql-server-using-connectorx-in-python It seems you could try manually specifying the driver to not use odbc:
?driver=SQL+Server&trusted_connection={trusted_conn}'
Let me know if that works?
n
thanks! adding
driver=SQL+Server
also results in the same error 😕
j
:( are you able to get “just” connectorx working without Daft? I’m looking at the example in that stackoverflow post:
Copy code
import connectorx
Copy code
username = 'my_username'
password = '1234'
server = 'SERVER1'
database = 'db1'
trusted_conn = 'no' # or yes
Copy code
query = "SELECT * FROM table"
Copy code
mssql_url = f'mssql://{username}:{password}@{server}/{database}?driver=SQL+Server&trusted_connection={trusted_conn}'
Copy code
df = connectorx.read_sql(mssql_url, query)
n
Curiously that makes no progress/hangs at read_sql call on Ray Will try locally
And we are back to
RuntimeError: Timed out in bb8
The stalled connection was from encrypt=true which was a red-herring presumably
j
Aha… I’m guessing connectorx just really doesn’t work with your setup of SQLServer :( I’m currently out on a lunch food run, but can show you how to use SQLAlchemy instead! It’s our “fallback option” which will be slower but should work as long as you can provide Daft with a callback to construct a working SQLAlchemy connection object.
Seems to be an ongoing thing with connectorx + mssql: https://github.com/sfu-db/connector-x/issues/233 Oh the joys of OSS 😝
n
Thanks so much! Will try SQL alchemy
j
No probs! Keep us posted I’m excited to see how your experience is with Daft, especially since you’re a Ray user 😁
(Btw I see you work at FlatIron Health — I think y’all do very cool work and actually considered applying back in 2018!) Would love to chat more about how you use Ray and Daft at some point. Happy to connect you with some of the Ray experts on our team too
n
Thank you, and I'll keep you posted!
j
Found the example on our docs: https://www.getdaft.io/projects/docs/en/latest/api_docs/doc_gen/io_functions/daft.read_sql.html#daft.read_sql
Copy code
def create_conn():
    return sqlalchemy.create_engine("mssql+pyodbc://<user>:<password>@<server>:<port>/<db>").connect()

df = daft.read_sql("SELECT * FROM my_table", create_conn)
I think you’ll need to specify
mssql+pyodbc://
so that SQLAlchemy will use pyodbc for the connection under the hood. Let me know if that works for you!
c
Hey @Neeraj Krishnan! I'm Colin, I worked on the read_sql functionality, just wanted to check in and see if using SQLAlchemy worked for you?
n
Hi Colin, and thanks for following up! Connecting via sqlalchemy does work. Query execution fails however, and this is what I see:
Copy code
UserWarning: Failed to execute the query with limit 1: Failed to execute sql: SELECT *
FROM (<original query>) AS subquery
LIMIT 1
which is expected since sqlserver does not support
limit
(uses
TOP
) However, after stating
Attempting to read the entire table
we still get:
Copy code
Incorrect syntax near 'LIMIT
...
[SQL: SELECT *
FROM (SELECT *
FROM (<original query>) AS subquery
LIMIT 1) AS subquery]
using daft 0.2.21 https://github.com/Eventual-Inc/Daft/blob/v0.2.21/daft/sql/sql_connection.py where the second query construction is supposed to leave LIMIT out of the query. B_ut_ the
sql
arg passed to it is the shadowed sql variable which has the wrapping select * and limit already added to it. Also, if its of interest later to someone
Copy code
connection_url = sqlalchemy.engine.URL.create(
    "mssql+pyodbc",
    username=user,
    password=password,
    host=host,
    port=1433,
    database=database,
    query={
        "driver": "ODBC Driver 18 for SQL Server",
    },
)
def create_conn():
    return sqlalchemy.create_engine(connection_url).connect()

Where ODBC drivers are installed per:
<https://learn.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server?view=sql-server-ver15&tabs=ubuntu18-install%2Calpine17-install%2Cdebian8-install%2Credhat7-13-install%2Crhel7-offline>
is what worked connection-wise.
c
Thanks so much for the detailed logs and code examples! I was able to reproduce the error, which is due to Daft manually constructing queries using
limit
in version
0.2.21
We've actually updated Daft starting from version
0.2.22
to use a separate SQL parsing library called SQLGlot to construct our sql queries, which should make us compatible across more dialects. However, I do need to make a small fix to allow it to work with SQL server (essentially SQLGlot recognizes the sql server dialect as
tsql
instead of
mssql
), which should be ready in the next release! https://github.com/Eventual-Inc/Daft/pull/2330 @jay could you take a look?
👍 1
j
Looks good! Just stamped it
n
thank you!
j
Thanks @Neeraj Krishnan 🙂 We’ll provide an update again when the new version of Daft is released
👍 1
c
Hey @Neeraj Krishnan! the latest version of Daft was just released (
0.2.26
), the new fixes should be ready to try out!
n
Thank you!
This works! thanks so much for the quick fix.
❤️ 1
c
Awesome! glad to hear