Sammy Sidhu
08/30/2024, 6:08 AMblock_on into block_on_io_pool and block_on_current_thread in this PR.
Should be ready for review!
https://github.com/Eventual-Inc/Daft/pull/2687Colin Ho
08/30/2024, 5:59 PMimport daft
from daft.context import set_execution_config
set_execution_config(enable_native_executor=True, default_morsel_size=100)
df = daft.read_parquet(
"<s3://daft-public-datasets/imagenet/sample-100k-deltalake/0-df403275-efee-4409-9ed6-8c25c4200655-0.parquet>"
)
df = df.with_column(
"images",
(
"<s3://daft-public-datasets/imagenet/sample-100k-deltalake/images/>"
+ df["folder"]
+ "/"
+ df["filename"]
+ ".jpeg"
)
.url.download()
.image.decode()
.image.resize(128, 128),
)
NUM_ROWS_TO_DATALOAD = 1000
for i, row in enumerate(df):
print(row)
if i >= NUM_ROWS_TO_DATALOAD:
break
Though to make it work, you still need to configure the default_morsel_size, otherwise it'll try run all 100k url downloads concurrently. I guess the next step would be to make the morsel sizing smarter.jay
08/30/2024, 7:38 PMjay
08/30/2024, 7:38 PM