This message was deleted Distributed Data Community #general

Join Slack

This message was deleted.

# general

Slackbot

03/09/2024, 5:24 PM

This message was deleted.

jay

03/09/2024, 5:47 PM

Hello! You can try the following, let us know how well it performs for you:

Copy code

df = daft.read_csv(…) # load URLs
df = df.with_column(
    “my_objs”,
    df[“urls”].url.download().apply(
        lambda bytes_data: pickle.load(bytes_data),
        return_dtype=daft.DataType.python()
    )
)

👍 1

jay

03/09/2024, 5:48 PM

In general though, if the bottleneck in this process is loading the pickle rather than the actual downloads, you may not see as dramatic of a speedup, since we can’t speed up Python itself. Let me know how this goes!

Sammy Sidhu

03/09/2024, 6:03 PM

Yes! You should be able to do daft.from_glob_paths(path).with_column(“binary”, col(“path”).url.download()) Then use a UDF to unpicke the data!

👍 1

Sammy Sidhu

03/09/2024, 6:04 PM

Oh I see @jay beat me to it! Yeah you should be able to glob the paths on s3 or load the entire paths via CSV

Farzad E

03/09/2024, 6:12 PM

Thanks guys. Performance is not a bottleneck but I asked as an aside. My goal is to unify all the IO operations and replace many different tools we have in our codebase (s3fs, PyArrow, ray data,...) with one tool that can read and write a wide range of objects. I'll try your suggestions. Hope Daft's development continues to grow and get more adoption. I just started looking into it and it looks very promising.

🔥 2

jay

03/09/2024, 6:42 PM

Let us know if you have any features/APIs in mind that would be useful!

👍 1

Open in Slack

Previous Next