Zac Steer
09/19/2024, 9:19 PMpyarrow.fs.GCSFileSystem daft infers from the gs uri protocoljay
09/19/2024, 10:24 PMZac Steer
09/19/2024, 10:54 PMZac Steer
09/19/2024, 10:58 PMjay
09/19/2024, 10:59 PMZac Steer
09/19/2024, 10:59 PMjay
09/19/2024, 11:00 PMZac Steer
09/19/2024, 11:00 PMdf.write_parquet(<gs://bucket/prefix>) to df.write_parquet(<s3://bucket/prefix>)jay
09/19/2024, 11:01 PMjay
09/19/2024, 11:01 PMjay
09/19/2024, 11:06 PMimport daft
from daft.io import IOConfig, S3Config
# Create a custom Config to point s3://
# to instead hit GCP
io_config = IOConfig(
s3=S3Config(
endpoint_url="<https://storage.googleapis.com>",
region_name=GCP_REGION_NAME,
key_id=HMAC_ACCESS_KEY,
access_key=HMAC_SECRET_KEY,
# This defaults to 64 -- lower it to be less aggressive if you're running on Ray
max_connections=4,
),
)
daft.set_planning_config(default_io_config=io_config)
The tricky bit is you have to generate these HMAC keys. Here’s a blog that talks about how to do it: https://dzlab.github.io/gcp/2022/02/26/gs-with-s3-sdk/jay
09/19/2024, 11:09 PMZac Steer
09/19/2024, 11:37 PMZac Steer
09/19/2024, 11:39 PMjay
09/19/2024, 11:39 PMjay
09/19/2024, 11:40 PMZac Steer
09/19/2024, 11:41 PMZac Steer
09/19/2024, 11:45 PMny_times_<fs://bucket/prefix>jay
09/19/2024, 11:47 PM{"ny_times_fs://": MyCustomPyArrowFS}Zac Steer
09/19/2024, 11:50 PMZac Steer
09/19/2024, 11:51 PMny_times_<fs://path/shocking_news.article>Zac Steer
09/19/2024, 11:59 PMread_arrow where users can provide a function that returns arrow given a file uri and a filesystem arg that handles filesystem level operations?Zac Steer
09/20/2024, 12:01 AM