Peter
04/07/2024, 9:28 AM518
numerical columns and 106574
rows. I am applying z-score
standardization onto each column as I am working on a nearest neighbor search algorithm and I want to eliminate bias caused by different vector magnitudes.
On screenshot #1 I am materializing results via to_pandas
by first limiting the dataframe to the top 5 rows via limit(5)
. This takes 24.0
seconds to execute, as VSCode indicates.
On screenshot #2 I am materializing results via to_pandas
without doing any limit. This takes 23.2
seconds to execute.
I would have expected that the approach where I limit(5)
first performs significantly better. In contrast, it performs slightly worse.
Do you have an idea why this might be the case? Maybe the query optimizer does not consider the presence of limit
when having with_column
statements including UDFs?
Also a note on the use of a UDF for the z-score standardization: I had to use it as a last resort, as I could not find a Daft-native implementation of variance nor standard deviation.
Any suggestions on optimizing things there?jay
04/07/2024, 7:53 PMdf.explain(True)
My guess is that the scan is probably taking most of the time, and limiting the number of rows doesn’t actually limit the scan enough (depending on the file format)Peter
04/07/2024, 9:23 PMfrom functools import reduce
import numpy as np
@daft.udf(return_dtype=daft.DataType.float32())
def standardize(series: daft.Series) -> daft.Series:
arr: np.array = series.to_arrow().to_numpy()
standardized = (arr - arr.mean()) / arr.std()
return daft.Series.from_numpy(standardized)
feature_names = [c for c in features.column_names if c != "track_id"]
features = reduce(
lambda df, c: df.with_column(c, standardize(df[c])), feature_names, features
)
features.limit(5).explain(show_all=True)
jay
04/07/2024, 9:37 PMjay
04/07/2024, 9:38 PM+ 1
) do you see a similar behavior?Peter
04/07/2024, 9:54 PMdaft.col(c) + 1
.
I attached screenshot #1 as a minimal reproduction.
I believe the issue I am facing here comes from the way how I instantiate the Daft DataFrame. I use daft.from_pandas(features_df)
. So yes, it looks like the input dataframe is already in-memory. For some reason I expected some magical Daft lazification to happen after daft.from_pandas(features_df)
though.
Unfortunately, I am forced to read the CSV with Pandas initially, because it has hierarchical headers (screenshot #2, also attached a tiny subset of the file) and daft.read_csv
lacks the advanced parsing options that pd.read_csv
has. Reading such a CSV with Daft will just raise an error, as screenshot #3 exhibits.jay
04/08/2024, 7:00 PMPeter
04/08/2024, 8:15 PMjay
04/09/2024, 5:50 PMPeter
04/09/2024, 5:54 PMKevin Wang
04/11/2024, 12:32 AMwith_column
is being called, which makes it costly for a dataframe with a lot of columns.
I'm working on fixing it right now, but a temporary workaround would be to call features.select
once with all of the desired columns specified, instead of using reduce. Something like this:
features = features.select(*(feature_names + [standardize(col(name)) for name in feature_names]))
Peter
04/11/2024, 5:36 AMKevin Wang
04/12/2024, 6:13 PMPeter
04/12/2024, 6:41 PMselect
instead of the reduction, as the screenshots indicate.
Interestingly, the reduce(lambda df, c: df.with_column(c, standardize(df[c])), feature_names, features)
line takes really long and not the expression where I explicitly materialize via .limit(5).to_pandas()
.Kevin Wang
04/12/2024, 6:42 PMKevin Wang
04/24/2024, 11:45 PMwith_column
, as well as adds an additional with_columns
(plural) DataFrame operation, which you might also find useful. These changes will be included in our next release!Peter
04/25/2024, 10:02 AMwith_columns
will be super useful. Out of curiosity, how does the semantics and performance of with_columns
compare to those of select
?Kevin Wang
04/25/2024, 10:09 AMwith_columns
will be more efficient than select
. As for semantics, it will take in a dictionary of name: expression instead of a list of columns like select