This message was deleted.
# general
s
This message was deleted.
j
Hello! You can try the following, let us know how well it performs for you:
Copy code
df = daft.read_csv(…) # load URLs
df = df.with_column(
    “my_objs”,
    df[“urls”].url.download().apply(
        lambda bytes_data: pickle.load(bytes_data),
        return_dtype=daft.DataType.python()
    )
)
👍 1
In general though, if the bottleneck in this process is loading the pickle rather than the actual downloads, you may not see as dramatic of a speedup, since we can’t speed up Python itself. Let me know how this goes!
s
Yes! You should be able to do daft.from_glob_paths(path).with_column(“binary”, col(“path”).url.download()) Then use a UDF to unpicke the data!
👍 1
Oh I see @jay beat me to it! Yeah you should be able to glob the paths on s3 or load the entire paths via CSV
f
Thanks guys. Performance is not a bottleneck but I asked as an aside. My goal is to unify all the IO operations and replace many different tools we have in our codebase (s3fs, PyArrow, ray data,...) with one tool that can read and write a wide range of objects. I'll try your suggestions. Hope Daft's development continues to grow and get more adoption. I just started looking into it and it looks very promising.
🔥 2
j
Let us know if you have any features/APIs in mind that would be useful!
👍 1