<@U0664J23JMP> I cleaned up the separation of `bl...
# daft-dev
s
@Colin Ho I cleaned up the separation of
block_on
into
block_on_io_pool
and
block_on_current_thread
in this PR. Should be ready for review! https://github.com/Eventual-Inc/Daft/pull/2687
🐐 1
c
Approved! Also, I tested the url download workload, and it works perfectly: cc @jay
Copy code
import daft
from daft.context import set_execution_config

set_execution_config(enable_native_executor=True, default_morsel_size=100)
df = daft.read_parquet(
    "<s3://daft-public-datasets/imagenet/sample-100k-deltalake/0-df403275-efee-4409-9ed6-8c25c4200655-0.parquet>"
)

df = df.with_column(
    "images",
    (
        "<s3://daft-public-datasets/imagenet/sample-100k-deltalake/images/>"
        + df["folder"]
        + "/"
        + df["filename"]
        + ".jpeg"
    )
    .url.download()
    .image.decode()
    .image.resize(128, 128),
)

NUM_ROWS_TO_DATALOAD = 1000
for i, row in enumerate(df):
    print(row)
    if i >= NUM_ROWS_TO_DATALOAD:
        break
Though to make it work, you still need to configure the
default_morsel_size
, otherwise it'll try run all 100k url downloads concurrently. I guess the next step would be to make the morsel sizing smarter.
j
Damn sick.
This is SUPER exciting