Is it possible to pass your own FileSystem to writ...
# general
z
Is it possible to pass your own FileSystem to write_parquet? I have a use case where I want to write to GCS, but can’t use the standard
pyarrow.fs.GCSFileSystem
daft infers from the
gs
uri protocol
j
Do you have a custom implementation that you need Daft to use? We started abstracting this away from users because Daft is moving towards more native (written by us in Rust) readers and writers. I think we’re open to exposing it, but note that this might not be compatible with the direction our writers are moving towards in the near future (we’ll be writing a lot of that logic ourselves to get it to be performant/fault tolerant)
z
Gotcha. My pyarrow package just doesn’t have fs.GcsFileSystem (for a number of reasons. Kinda out of my control). So currently, daft just throws an error when it infers + tries to import it. The future world where you rip out pyarrow write_dataset with a custom rust writer is great. Would still need to be able to pass in custom credentials like the previous thread, but seems like that wouldn’t change
👍 1
I could definitely see users who have some custom filesystem they want to read and write to though
j
Got it. Have you tried the S3 endpoint for GCS by the way? That might work better 🤞
z
I tried, but it kept hitting AWS
👀 2
j
Ok haha that sounds like a bug… Let me verify that on our end it should work (we’ve done it before!)
z
Try as in I switched
df.write_parquet(<gs://bucket/prefix>)
to
df.write_parquet(<s3://bucket/prefix>)
j
Oh! hahaha
Yeah you have to add some configs to make it work. Let me reshare that in a sec
❤️ 1
Here try this:
Copy code
import daft
from daft.io import IOConfig, S3Config

# Create a custom Config to point s3://
# to instead hit GCP
io_config = IOConfig(
    s3=S3Config(
        endpoint_url="<https://storage.googleapis.com>",
        region_name=GCP_REGION_NAME,
        key_id=HMAC_ACCESS_KEY,
        access_key=HMAC_SECRET_KEY,
        # This defaults to 64 -- lower it to be less aggressive if you're running on Ray
        max_connections=4,
    ),
)
daft.set_planning_config(default_io_config=io_config)
The tricky bit is you have to generate these HMAC keys. Here’s a blog that talks about how to do it: https://dzlab.github.io/gcp/2022/02/26/gs-with-s3-sdk/
Also question: is the reason you don’t have your own pa.GcsFileSystem because you need to support proxies when accessing GCS? We’ve been thinking about adding proxy support
z
Thank you! I’ll give it a shot tomorrow. Have to go through some hoops to generate hmac keys
On the proxy question, not really. Not sure how much I can legally talk about why
👍 1
j
Ah got it 😛
We could expose a parameter to pass in a pyarrow Filesystem as a workaround for now, but note that longer term we’ll probably be running our own Rust writers instead, which might require configuration on your end to work with your infrastructure setup
z
Let me try the hmac route before we go too far into that. That seems promising
🙌 1
On a separate note, I could definitely see users who have their own file systems needing to be able to read and write with daft. E.g.
ny_times_<fs://bucket/prefix>
j
Yes… longer-term we will likely maintain an escape-hatch somehow for this to work (but make very little guarantees around performance or reliability since that is all 3rd party code). Perhaps some kind of mapping of
{"ny_times_fs://": MyCustomPyArrowFS}
z
Yea. Could let users set some kind of package level registry or just let them specify the filesystem as an arg in the read_*, write_* method they’re calling
Could also see users that have their own file type. E.g.
ny_times_<fs://path/shocking_news.article>
Think reading custom files, file systems is probably more important than writing, at least to start… Maybe have a
read_arrow
where users can provide a function that returns arrow given a file uri and a filesystem arg that handles filesystem level operations?
Daft would take the arrow return values and use that to construct the dataframe