I am trying daft with unity catalog but getting th...
# general
d
I am trying daft with unity catalog but getting this error No module named 'unitycatalog' on
from daft.unity_catalog import UnityCatalog
Maybe am missing something but the documentation is not clear if I need to install something else or not. I will be grateful for some help here
j
Whoops, yes you’ll have to do
pip install getdaft[unity]
This will ensure that
pip install unitycatalog
is done for you
d
Thank you very much. The documentation needs to be updated 🙂
j
We rely on a Python package that we generated to access the unity catalog. Note also that Unity support is very new (it was open-sourced just a week ago by Databricks) so this support is very much in beta
Absolutely! Feel free to make a contribution to Daft: https://github.com/Eventual-Inc/Daft/blob/main/docs/source/user_guide/integrations/unity-catalog.rst Or you can make an issue we’ll get right on it
d
Yes I was surprised to see you guys have support for something that was just announced last week. Great job
j
Are you attempting to access Unity Catalog hosted by Databricks? I tried to do so last week but it seems their internal version of unity catalog currently doesn’t work with some of the APIs… Specifically they won’t vend credentials to us, and will throw an error: https://github.com/unitycatalog/unitycatalog/issues/2
Seems like this is something that databricks themselves have to fix though
d
Ok good to know, its a sad that it's not working though. Yes that's exactly what I was going to try
@jay it seems to be working now. I was able to query the catalog
j
Oh shiiiii
😛 very cool!
Wanna make a blogpost or something on linkedin about it? That’s pretty hype
d
If I get the to be able to query a table successfully, I will surely do that
j
Awesome. Let me know how we can help.
BTW we just hopped off the phone with some of our Databricks partners — looks like the Databricks implementation of Unity Catalog might still have issues around credentials vending. @Daniel Antwi if you have any issues that look like this, let us know! There’s a Databricks private preview form that we can send to you to have your account become enabled for credentials vending that will fix the issue.
d
Hi Jay, I think I might have missed your message. I couldn't see the issue link you sent me. However, I got this error message exactly like the one you sent me earlier
Copy code
{'error_code': 'UNAUTHENTICATED',
 'message': "Request to generate access credential for table 'redacted' from outside of Databricks Unity Catalog enabled compute environment is denied for security. Please contact Databricks support for integrations with Unity Catalog.",
 'details': [{'@type': '<http://type.googleapis.com/google|type.googleapis.com/google>…
Please send me the private preview form
j
d
Thank you very much Jay
Hi @jay I finally credential vending enabled on my databricks workspace. However, this is the new error I am getting at df=daft.read_deltalake(unity_table)
Copy code
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[9], line 15
     13 unity_table = unity.load_table("neutron-dev.bssm_lakehouse_dev1.hack_signins")
     14 print(unity_table)
---> 15 df=daft.read_deltalake(unity_table)

    [... skipping hidden 2 frame]

File ~/mssql/changestreamapp/.venv/lib/python3.10/site-packages/daft/io/_delta_lake.py:87, in read_deltalake(table, io_config, _multithreaded_io)
     83 else:
     84     raise ValueError(
     85         f"table argument must be a table URI string, DataCatalogTable or UnityCatalogTable instance, but got: {type(table)}, {table}"
     86     )
---> 87 delta_lake_operator = DeltaLakeScanOperator(table_uri, storage_config=storage_config)
     89 handle = ScanOperatorHandle.from_python_scan_operator(delta_lake_operator)
     90 builder = LogicalPlanBuilder.from_tabular_scan(scan_operator=handle)

File ~/mssql/changestreamapp/.venv/lib/python3.10/site-packages/daft/delta_lake/delta_lake_scan.py:63, in DeltaLakeScanOperator.__init__(self, table_uri, storage_config)
     56         if deltalake_sdk_io_config.s3.region_name is None:
     57             deltalake_sdk_io_config = deltalake_sdk_io_config.replace(
     58                 s3=deltalake_sdk_io_config.s3.replace(
     59                     region_name=s3_config_from_env.region_name,
     60                 )
     61             )
...
    301     without_files=without_files,
    302     log_buffer_size=log_buffer_size,
    303 )

OSError: Generic S3 error: Received redirect without LOCATION, this normally indicates an incorrectly configured region
j
Ah I haven’t even been approved yet 😧 it’s so difficult for me to test out Daft against databricks’ unity catalog at the moment This should be pretty easy to fix. Do you know what region your bucket is in? You should be able to do:
Copy code
# Replace with the region your AWS S3 bucket is in
MY_REGION = "us-west-2"

daft.set_planning_config(
    default_io_config=daft.io.IOConfig(
        s3=daft.io.S3Config(region_name=MY_REGION)
    ),
)
We could make this easier, but I’m not sure if databricks’ unity gives the correct region when passing the table info. Would you be open to running some code for me to see?
Copy code
import unitycatalog

token = "xxx"  # your databricks token
endpoint = "xxx"  # your databricks endpoint
table_name = "x.y.z"  # your table name

client = unitycatalog.Unitycatalog(
    base_url=endpoint.rstrip("/") + "/api/2.1/unity-catalog/",
    default_headers={"Authorization": f"Bearer {token}"},
)

table_info = client.tables.retrieve(table_name)

# Print what info databricks UC provides us
print(table_info)
print(table_info.properties)
I’m curious if Unity Catalog is giving us the region_name of the bucket. If so, then it will be very easy for us to forward the correct region_name when reading the deltalake table.
d
Sure am available to help any way I can
Here is the response from the script
Copy code
TableInfo(catalog_name='neutron-dev', columns=[Column(comment=None, name='_id', nullable=True, partition_index=None, position=0, type_interval_type=None, type_json='{"name":"_id","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='AzureId', nullable=True, partition_index=None, position=1, type_interval_type=None, type_json='{"name":"AzureId","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='accountNodeId', nullable=True, partition_index=None, position=2, type_interval_type=None, type_json='{"name":"accountNodeId","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='CreatedDateTime', nullable=True, partition_index=None, position=3, type_interval_type=None, type_json='{"name":"CreatedDateTime","type":"timestamp","nullable":true,"metadata":{}}', type_name='TIMESTAMP', type_precision=0, type_scale=0, type_text='timestamp'), Column(comment=None, name='AppId', nullable=True, partition_index=None, position=4, type_interval_type=None, type_json='{"name":"AppId","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='AppName', nullable=True, partition_index=None, position=5, type_interval_type=None, type_json='{"name":"AppName","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='UserId', nullable=True, partition_index=None, position=6, type_interval_type=None, type_json='{"name":"UserId","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='UserPrincipal', nullable=True, partition_index=None, position=7, type_interval_type=None, type_json='{"name":"UserPrincipal","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='UserDisplayName', nullable=True, partition_index=None, position=8, type_interval_type=None, type_json='{"name":"UserDisplayName","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='Location', nullable=True, partition_index=None, position=9, type_interval_type=None, type_json='{"name":"Location","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='DeviceDetail', nullable=True, partition_index=None, position=10, type_interval_type=None, type_json='{"name":"DeviceDetail","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='IsFailed', nullable=True, partition_index=None, position=11, type_interval_type=None, type_json='{"name":"IsFailed","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='ClientAppUsed', nullable=True, partition_index=None, position=12, type_interval_type=None, type_json='{"name":"ClientAppUsed","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='CorrelationId', nullable=True, partition_index=None, position=13, type_interval_type=None, type_json='{"name":"CorrelationId","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='RiskState', nullable=True, partition_index=None, position=14, type_interval_type=None, type_json='{"name":"RiskState","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='Status', nullable=True, partition_index=None, position=15, type_interval_type=None, type_json='{"name":"Status","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='document_id', nullable=True, partition_index=None, position=16, type_interval_type=None, type_json='{"name":"document_id","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='record_timestamp', nullable=True, partition_index=None, position=17, type_interval_type=None, type_json='{"name":"record_timestamp","type":"timestamp","nullable":true,"metadata":{}}', type_name='TIMESTAMP', type_precision=0, type_scale=0, type_text='timestamp')], comment=None, created_at=1721671757256, data_source_format='DELTA', name='hack_signins', properties={'delta.lastCommitTimestamp': '1721671751000', 'delta.lastUpdateVersion': '0', 'delta.minWriterVersion': '7', 'delta.enableDeletionVectors': 'true', 'delta.minReaderVersion': '3', 'delta.feature.deletionVectors': 'supported'}, schema_name='bssm_lakehouse_dev1', storage_location='<s3://bn-neutron-dev-eu-central-1-catalog/__unitystorage/catalogs/214cf227-302f-4390-a269-b37c3f46edf1/tables/702bc0b7-ebed-4009-9068-ed9a6d7a922d>', table_id='702bc0b7-ebed-4009-9068-ed9a6d7a922d', table_type='MANAGED', updated_at=1721671757256, owner='<mailto:dantwi@barracuda.com|dantwi@barracuda.com>', securable_kind='TABLE_DELTA', enable_auto_maintenance='INHERIT', enable_predictive_optimization='INHERIT', properties_pairs={'properties': {'delta.lastCommitTimestamp': '1721671751000', 'delta.lastUpdateVersion': '0', 'delta.minWriterVersion': '7', 'delta.enableDeletionVectors': 'true', 'delta.minReaderVersion': '3', 'delta.feature.deletionVectors': 'supported'}}, generation=0, metastore_id='924cdcca-40e3-4175-b7cb-a558ccef3fe4', full_name='neutron-dev.bssm_lakehouse_dev1.hack_signins', data_access_configuration_id='00000000-0000-0000-0000-000000000000', created_by='<mailto:dantwi@barracuda.com|dantwi@barracuda.com>', updated_by='<mailto:dantwi@barracuda.com|dantwi@barracuda.com>', delta_runtime_properties_kvpairs={}, securable_type='TABLE', effective_auto_maintenance_flag={'value': 'ENABLE', 'inherited_from_type': 'METASTORE', 'inherited_from_name': 'metastore-eu-central-1'}, effective_predictive_optimization_flag={'value': 'ENABLE', 'inherited_from_type': 'METASTORE', 'inherited_from_name': 'metastore-eu-central-1'}, browse_only=False)
{'delta.lastCommitTimestamp': '1721671751000', 'delta.lastUpdateVersion': '0', 'delta.minWriterVersion': '7', 'delta.enableDeletionVectors': 'true', 'delta.minReaderVersion': '3', 'delta.feature.deletionVectors': 'supported'}
My region is in eu-central-1
I got the same error even after adding
Copy code
# Replace with the region your AWS S3 bucket is in
MY_REGION = "eu-central-1"

daft.set_planning_config(
    default_io_config=daft.io.IOConfig(
        s3=daft.io.S3Config(region_name=MY_REGION)
    ),
)
j
Hmm ok let me give it a shot later today
Looks like databricks’ unity unfortunately doesn’t give any info about the bucket region. We should tell databricks to do that 🤣
d
Lol, interesting. we should do that
So how are you getting access to s3? to read the delta tables, even if they provide the region?
j
The flow looks like: 1. Obtain S3 credentials and region — either from the current environment or from unity catalog. 2. Translate these into a format that the deltalake SDK understands, and access the table metadata using the deltalake SDK to retrieve files that we need to read 3. Now perform distributed reads on the files using Daft’s parquet readers and the credentials obtained from (1) My guess is actually that something is going wrong in step (2). The unity catalog doesn’t provide the region name, so we attempt to fall back onto the region detected from your environment. I’m guessing that your environment doesn’t have a region configured, so it is just None, and the deltalake SDK is unfortunately pretty bad at performing the appropriate retries to get the correct region name when accessing S3 itself. —- Could you try this too? (Make sure to redact any private variables)
Copy code
print(daft.io.S3Config.from_env())
I’ll take a look at our code to see if there is anything fishy going on
🙏 1
d
So here is the result
Copy code
S3Config
    region_name: Some("us-east-1")
    endpoint_url: None
    key_id: Some("xxxxx")
    session_token: None,
    access_key: Some("xxxxxx")
    credentials_provider: None
    buffer_time: None
    max_connections: 8,
    retry_initial_backoff_ms: 1000,
    connect_timeout_ms: 30000,
    read_timeout_ms: 30000,
    num_tries: 25,
    retry_mode: Some("adaptive"),
    anonymous: false,
    use_ssl: true,
    verify_ssl: true,
    check_hostname_ssl: true
    requester_pays: false
    force_virtual_addressing: false
The region from the output is showing us-east-1 though
Ok I changed the aws_region in my environment variable and no error now. but no data is displayed. It's able to show the schema with no data
j
Yes correct, the Daft dataframe is lazy — you have to run
.show()
to get it to run and show you some data 🙂
d
Wonderful!!!!!! it's showing data now
So the region seems to be coming from environment. Do you know why this is not working
Copy code
# Replace with the region your AWS S3 bucket is in
MY_REGION = "eu-central-1"

daft.set_planning_config(
    default_io_config=daft.io.IOConfig(
        s3=daft.io.S3Config(region_name=MY_REGION)
    ),
)
j
I think we automatically detect your region, but because nothing is set it defaults to us-east-1 I think the ordering of which we select the region might be wrong… maybe we need to fix the order in which h we preferentially select the region specified on the planning context
👍 1
Ah yes, in our code we have some logic that overrides the
io_config
if we see that the table is a Unity Catalog table. We assumed that the Unity Catalog table would be “well-behaved”, but I guess it provides “bad” regions so this logic here might be a little too naive: https://github.com/Eventual-Inc/Daft/commit/395ebe8f406488b11eb65d71c0ff01ba76a61d76#diff-1e32b47add26cea48d942edc0[…]e76ccfbe83e7b8439de9065R79-R82
d
Thank you very much Jay, that is very much appreciated 🙂