Conor Kennedy
08/19/2024, 11:40 PMimport daft
from daft import col
from <http://daft.io|daft.io> import IOConfig
io_config = IOConfig(
s3=daft.io.S3Config.from_env()
)
dedup = daft.read_parquet("<s3://eventual-dev-benchmarking-fixtures/redpajama-parquet/dedupe1>", io_config=io_config)
cc_sizes = dedup.groupby("original_url", "original_date_download").agg(col("url").count().alias("component_size"))
cc_sizes.sort(by="component_size", desc=True).explain(True)
It seems to be doing a lot of sequential requests to AWS. (More info in thread)Conor Kennedy
08/19/2024, 11:41 PMgit bisect for this issue, it gets traced back to this commit: https://github.com/Eventual-Inc/Daft/commit/c5f2d4aac4ce86effeb3f1efe6199ad26a91f8f3Conor Kennedy
08/19/2024, 11:43 PMConor Kennedy
08/19/2024, 11:44 PMConor Kennedy
08/19/2024, 11:53 PMConor Kennedy
08/19/2024, 11:56 PMConor Kennedy
08/20/2024, 12:21 AMCory Grinstead
08/20/2024, 12:44 AMConor Kennedy
08/20/2024, 12:44 AMSammy Sidhu
08/20/2024, 6:27 AMConor Kennedy
08/20/2024, 9:42 PM.collect to only 1.9s. Hopefully this doesn’t mean I broke anything!Conor Kennedy
08/20/2024, 10:05 PMConor Kennedy
08/20/2024, 10:09 PMConor Kennedy
08/20/2024, 10:10 PMConor Kennedy
08/21/2024, 6:41 PMConor Kennedy
08/21/2024, 6:44 PMConor Kennedy
08/21/2024, 6:44 PMimport daft
from <http://daft.io|daft.io> import IOConfig, S3Config
import time
io_config = IOConfig(s3=S3Config.from_env())
# FILE = "<s3://daft-public-data/testing_data/bad-polars-lineitem.parquet>"
FILE = "<s3://eventual-dev-benchmarking-fixtures/redpajama-parquet/dedupe1>"
total = 0
for i in range(10):
start = time.perf_counter()
df = daft.read_parquet(FILE, io_config=io_config).count("*")
df.collect()
dur = time.perf_counter() - start
print("Time for iteration %i: %.3fs" % (i+1, dur))
total += dur
print("Total time: %.3fs" % total)
print("Avg time: %.3fs" % (total / 10))Conor Kennedy
08/21/2024, 6:44 PMConor Kennedy
08/21/2024, 6:46 PMSammy Sidhu
08/21/2024, 7:00 PMSammy Sidhu
08/21/2024, 7:00 PMConor Kennedy
08/21/2024, 7:00 PM