Is it possible to pass your own FileSystem to write parquet Distributed Data Community #general

Is it possible to pass your own FileSystem to writ...

Zac Steer

09/19/2024, 9:19 PM

Is it possible to pass your own FileSystem to write_parquet? I have a use case where I want to write to GCS, but can’t use the standard

pyarrow.fs.GCSFileSystem

daft infers from the

gs

uri protocol

jay

09/19/2024, 10:24 PM

Do you have a custom implementation that you need Daft to use? We started abstracting this away from users because Daft is moving towards more native (written by us in Rust) readers and writers. I think we’re open to exposing it, but note that this might not be compatible with the direction our writers are moving towards in the near future (we’ll be writing a lot of that logic ourselves to get it to be performant/fault tolerant)

Zac Steer

09/19/2024, 10:54 PM

Gotcha. My pyarrow package just doesn’t have fs.GcsFileSystem (for a number of reasons. Kinda out of my control). So currently, daft just throws an error when it infers + tries to import it. The future world where you rip out pyarrow write_dataset with a custom rust writer is great. Would still need to be able to pass in custom credentials like the previous thread, but seems like that wouldn’t change

👍 1

Zac Steer

09/19/2024, 10:58 PM

I could definitely see users who have some custom filesystem they want to read and write to though

jay

09/19/2024, 10:59 PM

Got it. Have you tried the S3 endpoint for GCS by the way? That might work better 🤞

Zac Steer

09/19/2024, 10:59 PM

I tried, but it kept hitting AWS

👀 2

jay

09/19/2024, 11:00 PM

Ok haha that sounds like a bug… Let me verify that on our end it should work (we’ve done it before!)

Zac Steer

09/19/2024, 11:00 PM

Try as in I switched

df.write_parquet(<gs://bucket/prefix>)

df.write_parquet(<s3://bucket/prefix>)

jay

09/19/2024, 11:01 PM

Oh! hahaha

jay

09/19/2024, 11:01 PM

Yeah you have to add some configs to make it work. Let me reshare that in a sec

❤️ 1

jay

09/19/2024, 11:06 PM

Here try this:

Copy code

import daft
from daft.io import IOConfig, S3Config

# Create a custom Config to point s3://
# to instead hit GCP
io_config = IOConfig(
    s3=S3Config(
        endpoint_url="<https://storage.googleapis.com>",
        region_name=GCP_REGION_NAME,
        key_id=HMAC_ACCESS_KEY,
        access_key=HMAC_SECRET_KEY,
        # This defaults to 64 -- lower it to be less aggressive if you're running on Ray
        max_connections=4,
    ),
)
daft.set_planning_config(default_io_config=io_config)

The tricky bit is you have to generate these HMAC keys. Here’s a blog that talks about how to do it: https://dzlab.github.io/gcp/2022/02/26/gs-with-s3-sdk/

jay

09/19/2024, 11:09 PM

Also question: is the reason you don’t have your own pa.GcsFileSystem because you need to support proxies when accessing GCS? We’ve been thinking about adding proxy support

Zac Steer

09/19/2024, 11:37 PM

Thank you! I’ll give it a shot tomorrow. Have to go through some hoops to generate hmac keys

Zac Steer

09/19/2024, 11:39 PM

On the proxy question, not really. Not sure how much I can legally talk about why

👍 1

jay

09/19/2024, 11:39 PM

Ah got it 😛

jay

09/19/2024, 11:40 PM

We could expose a parameter to pass in a pyarrow Filesystem as a workaround for now, but note that longer term we’ll probably be running our own Rust writers instead, which might require configuration on your end to work with your infrastructure setup

Zac Steer

09/19/2024, 11:41 PM

Let me try the hmac route before we go too far into that. That seems promising

🙌 1

Zac Steer

09/19/2024, 11:45 PM

On a separate note, I could definitely see users who have their own file systems needing to be able to read and write with daft. E.g.

ny_times_<fs://bucket/prefix>

jay

09/19/2024, 11:47 PM

Yes… longer-term we will likely maintain an escape-hatch somehow for this to work (but make very little guarantees around performance or reliability since that is all 3rd party code). Perhaps some kind of mapping of

{"ny_times_fs://": MyCustomPyArrowFS}

Zac Steer

09/19/2024, 11:50 PM

Yea. Could let users set some kind of package level registry or just let them specify the filesystem as an arg in the read_*, write_* method they’re calling

Zac Steer

09/19/2024, 11:51 PM

Could also see users that have their own file type. E.g.

ny_times_<fs://path/shocking_news.article>

Zac Steer

09/19/2024, 11:59 PM

Think reading custom files, file systems is probably more important than writing, at least to start… Maybe have a

read_arrow

where users can provide a function that returns arrow given a file uri and a filesystem arg that handles filesystem level operations?

Zac Steer

09/20/2024, 12:01 AM

Daft would take the arrow return values and use that to construct the dataframe

2 Views

Open in Slack

Previous Next