Daniel Antwi
06/22/2024, 1:04 AMfrom daft.unity_catalog import UnityCatalog
Maybe am missing something but the documentation is not clear if I need to install something else or not. I will be grateful for some help herejay
06/22/2024, 1:05 AMpip install getdaft[unity]
This will ensure that pip install unitycatalog
is done for youDaniel Antwi
06/22/2024, 1:06 AMjay
06/22/2024, 1:06 AMjay
06/22/2024, 1:06 AMDaniel Antwi
06/22/2024, 1:07 AMjay
06/22/2024, 1:07 AMjay
06/22/2024, 1:08 AMDaniel Antwi
06/22/2024, 1:09 AMDaniel Antwi
06/22/2024, 1:19 AMjay
06/22/2024, 1:23 AMjay
06/22/2024, 1:23 AMjay
06/22/2024, 1:24 AMDaniel Antwi
06/22/2024, 1:32 AMjay
06/22/2024, 2:05 AMjay
06/24/2024, 11:07 PMDaniel Antwi
07/02/2024, 2:18 PM{'error_code': 'UNAUTHENTICATED',
'message': "Request to generate access credential for table 'redacted' from outside of Databricks Unity Catalog enabled compute environment is denied for security. Please contact Databricks support for integrations with Unity Catalog.",
'details': [{'@type': '<http://type.googleapis.com/google|type.googleapis.com/google>…
Please send me the private preview formjay
07/02/2024, 6:31 PMDaniel Antwi
07/05/2024, 1:34 PMDaniel Antwi
07/24/2024, 2:35 AM---------------------------------------------------------------------------
OSError Traceback (most recent call last)
Cell In[9], line 15
13 unity_table = unity.load_table("neutron-dev.bssm_lakehouse_dev1.hack_signins")
14 print(unity_table)
---> 15 df=daft.read_deltalake(unity_table)
[... skipping hidden 2 frame]
File ~/mssql/changestreamapp/.venv/lib/python3.10/site-packages/daft/io/_delta_lake.py:87, in read_deltalake(table, io_config, _multithreaded_io)
83 else:
84 raise ValueError(
85 f"table argument must be a table URI string, DataCatalogTable or UnityCatalogTable instance, but got: {type(table)}, {table}"
86 )
---> 87 delta_lake_operator = DeltaLakeScanOperator(table_uri, storage_config=storage_config)
89 handle = ScanOperatorHandle.from_python_scan_operator(delta_lake_operator)
90 builder = LogicalPlanBuilder.from_tabular_scan(scan_operator=handle)
File ~/mssql/changestreamapp/.venv/lib/python3.10/site-packages/daft/delta_lake/delta_lake_scan.py:63, in DeltaLakeScanOperator.__init__(self, table_uri, storage_config)
56 if deltalake_sdk_io_config.s3.region_name is None:
57 deltalake_sdk_io_config = deltalake_sdk_io_config.replace(
58 s3=deltalake_sdk_io_config.s3.replace(
59 region_name=s3_config_from_env.region_name,
60 )
61 )
...
301 without_files=without_files,
302 log_buffer_size=log_buffer_size,
303 )
OSError: Generic S3 error: Received redirect without LOCATION, this normally indicates an incorrectly configured region
jay
07/24/2024, 4:25 AM# Replace with the region your AWS S3 bucket is in
MY_REGION = "us-west-2"
daft.set_planning_config(
default_io_config=daft.io.IOConfig(
s3=daft.io.S3Config(region_name=MY_REGION)
),
)
We could make this easier, but I’m not sure if databricks’ unity gives the correct region when passing the table info. Would you be open to running some code for me to see?
import unitycatalog
token = "xxx" # your databricks token
endpoint = "xxx" # your databricks endpoint
table_name = "x.y.z" # your table name
client = unitycatalog.Unitycatalog(
base_url=endpoint.rstrip("/") + "/api/2.1/unity-catalog/",
default_headers={"Authorization": f"Bearer {token}"},
)
table_info = client.tables.retrieve(table_name)
# Print what info databricks UC provides us
print(table_info)
print(table_info.properties)
jay
07/24/2024, 4:25 AMDaniel Antwi
07/24/2024, 4:18 PMDaniel Antwi
07/24/2024, 4:21 PMTableInfo(catalog_name='neutron-dev', columns=[Column(comment=None, name='_id', nullable=True, partition_index=None, position=0, type_interval_type=None, type_json='{"name":"_id","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='AzureId', nullable=True, partition_index=None, position=1, type_interval_type=None, type_json='{"name":"AzureId","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='accountNodeId', nullable=True, partition_index=None, position=2, type_interval_type=None, type_json='{"name":"accountNodeId","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='CreatedDateTime', nullable=True, partition_index=None, position=3, type_interval_type=None, type_json='{"name":"CreatedDateTime","type":"timestamp","nullable":true,"metadata":{}}', type_name='TIMESTAMP', type_precision=0, type_scale=0, type_text='timestamp'), Column(comment=None, name='AppId', nullable=True, partition_index=None, position=4, type_interval_type=None, type_json='{"name":"AppId","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='AppName', nullable=True, partition_index=None, position=5, type_interval_type=None, type_json='{"name":"AppName","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='UserId', nullable=True, partition_index=None, position=6, type_interval_type=None, type_json='{"name":"UserId","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='UserPrincipal', nullable=True, partition_index=None, position=7, type_interval_type=None, type_json='{"name":"UserPrincipal","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='UserDisplayName', nullable=True, partition_index=None, position=8, type_interval_type=None, type_json='{"name":"UserDisplayName","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='Location', nullable=True, partition_index=None, position=9, type_interval_type=None, type_json='{"name":"Location","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='DeviceDetail', nullable=True, partition_index=None, position=10, type_interval_type=None, type_json='{"name":"DeviceDetail","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='IsFailed', nullable=True, partition_index=None, position=11, type_interval_type=None, type_json='{"name":"IsFailed","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='ClientAppUsed', nullable=True, partition_index=None, position=12, type_interval_type=None, type_json='{"name":"ClientAppUsed","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='CorrelationId', nullable=True, partition_index=None, position=13, type_interval_type=None, type_json='{"name":"CorrelationId","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='RiskState', nullable=True, partition_index=None, position=14, type_interval_type=None, type_json='{"name":"RiskState","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='Status', nullable=True, partition_index=None, position=15, type_interval_type=None, type_json='{"name":"Status","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='document_id', nullable=True, partition_index=None, position=16, type_interval_type=None, type_json='{"name":"document_id","type":"string","nullable":true,"metadata":{}}', type_name='STRING', type_precision=0, type_scale=0, type_text='string'), Column(comment=None, name='record_timestamp', nullable=True, partition_index=None, position=17, type_interval_type=None, type_json='{"name":"record_timestamp","type":"timestamp","nullable":true,"metadata":{}}', type_name='TIMESTAMP', type_precision=0, type_scale=0, type_text='timestamp')], comment=None, created_at=1721671757256, data_source_format='DELTA', name='hack_signins', properties={'delta.lastCommitTimestamp': '1721671751000', 'delta.lastUpdateVersion': '0', 'delta.minWriterVersion': '7', 'delta.enableDeletionVectors': 'true', 'delta.minReaderVersion': '3', 'delta.feature.deletionVectors': 'supported'}, schema_name='bssm_lakehouse_dev1', storage_location='<s3://bn-neutron-dev-eu-central-1-catalog/__unitystorage/catalogs/214cf227-302f-4390-a269-b37c3f46edf1/tables/702bc0b7-ebed-4009-9068-ed9a6d7a922d>', table_id='702bc0b7-ebed-4009-9068-ed9a6d7a922d', table_type='MANAGED', updated_at=1721671757256, owner='<mailto:dantwi@barracuda.com|dantwi@barracuda.com>', securable_kind='TABLE_DELTA', enable_auto_maintenance='INHERIT', enable_predictive_optimization='INHERIT', properties_pairs={'properties': {'delta.lastCommitTimestamp': '1721671751000', 'delta.lastUpdateVersion': '0', 'delta.minWriterVersion': '7', 'delta.enableDeletionVectors': 'true', 'delta.minReaderVersion': '3', 'delta.feature.deletionVectors': 'supported'}}, generation=0, metastore_id='924cdcca-40e3-4175-b7cb-a558ccef3fe4', full_name='neutron-dev.bssm_lakehouse_dev1.hack_signins', data_access_configuration_id='00000000-0000-0000-0000-000000000000', created_by='<mailto:dantwi@barracuda.com|dantwi@barracuda.com>', updated_by='<mailto:dantwi@barracuda.com|dantwi@barracuda.com>', delta_runtime_properties_kvpairs={}, securable_type='TABLE', effective_auto_maintenance_flag={'value': 'ENABLE', 'inherited_from_type': 'METASTORE', 'inherited_from_name': 'metastore-eu-central-1'}, effective_predictive_optimization_flag={'value': 'ENABLE', 'inherited_from_type': 'METASTORE', 'inherited_from_name': 'metastore-eu-central-1'}, browse_only=False)
{'delta.lastCommitTimestamp': '1721671751000', 'delta.lastUpdateVersion': '0', 'delta.minWriterVersion': '7', 'delta.enableDeletionVectors': 'true', 'delta.minReaderVersion': '3', 'delta.feature.deletionVectors': 'supported'}
My region is in eu-central-1Daniel Antwi
07/24/2024, 4:29 PM# Replace with the region your AWS S3 bucket is in
MY_REGION = "eu-central-1"
daft.set_planning_config(
default_io_config=daft.io.IOConfig(
s3=daft.io.S3Config(region_name=MY_REGION)
),
)
jay
07/24/2024, 4:30 PMjay
07/24/2024, 4:30 PMDaniel Antwi
07/24/2024, 4:31 PMDaniel Antwi
07/24/2024, 4:43 PMjay
07/24/2024, 4:56 PMprint(daft.io.S3Config.from_env())
jay
07/24/2024, 4:56 PMDaniel Antwi
07/24/2024, 6:09 PMS3Config
region_name: Some("us-east-1")
endpoint_url: None
key_id: Some("xxxxx")
session_token: None,
access_key: Some("xxxxxx")
credentials_provider: None
buffer_time: None
max_connections: 8,
retry_initial_backoff_ms: 1000,
connect_timeout_ms: 30000,
read_timeout_ms: 30000,
num_tries: 25,
retry_mode: Some("adaptive"),
anonymous: false,
use_ssl: true,
verify_ssl: true,
check_hostname_ssl: true
requester_pays: false
force_virtual_addressing: false
The region from the output is showing us-east-1 thoughDaniel Antwi
07/24/2024, 6:39 PMjay
07/24/2024, 6:39 PM.show()
to get it to run and show you some data 🙂Daniel Antwi
07/24/2024, 6:42 PMDaniel Antwi
07/24/2024, 6:43 PM# Replace with the region your AWS S3 bucket is in
MY_REGION = "eu-central-1"
daft.set_planning_config(
default_io_config=daft.io.IOConfig(
s3=daft.io.S3Config(region_name=MY_REGION)
),
)
jay
07/25/2024, 4:38 PMjay
07/25/2024, 11:16 PMio_config
if we see that the table is a Unity Catalog table. We assumed that the Unity Catalog table would be “well-behaved”, but I guess it provides “bad” regions so this logic here might be a little too naive:
https://github.com/Eventual-Inc/Daft/commit/395ebe8f406488b11eb65d71c0ff01ba76a61d76#diff-1e32b47add26cea48d942edc0[…]e76ccfbe83e7b8439de9065R79-R82jay
07/25/2024, 11:20 PMDaniel Antwi
07/29/2024, 4:14 PM