question from <@U07A01CBXCG> Is there a perf be...
# daft-dev
j
question from @Anil Pillai Is there a perf benefit using Ray dataset groupby+map_groups VS Daft groupBy+map+groups. What is the best practice you recommend?
Copy code
import pandas as pd
import ray

def normalize_variety(group: pd.DataFrame) -> pd.DataFrame:
    for feature in group.drop("variety").columns:
        group[feature] = group[feature] / group[feature].abs().max()
    return group
ds = (
    ray.data.read_parquet("<s3://anonymous@ray-example-data/iris.parquet>")
    .groupby("variety")
    .map_groups(normalize_variety, batch_format="pandas")
)
--- I wouldn’t expect there to be a huge difference in terms of performance between Daft and Ray Data here, though I think Daft is faster just by virtue of having a better tuned local execution than Pandas which Ray Data uses, and also a faster client for reading Parquet from S3. Generally our recommendation is: • If you just need simple last-mile preprocessing before piping data into distributed ML training, Ray Data is a good solution • However, if you need a fully-fledged data processing/analytics tool (e.g. supports joins, groupby, sorts…) then you should use Daft
🙌 2