Slackbot
03/04/2024, 9:11 PMjay
03/04/2024, 9:13 PMjay
03/04/2024, 9:19 PMRishabh Agarwal Jain
03/04/2024, 9:28 PMSammy Sidhu
03/04/2024, 9:57 PMRishabh Agarwal Jain
03/04/2024, 10:15 PMSammy Sidhu
03/04/2024, 11:07 PMAmmar Alrashed
03/05/2024, 8:49 PMSammy Sidhu
03/05/2024, 8:59 PMRishabh Agarwal Jain
03/27/2024, 9:05 PMIntroduction
ETL is a crucial step in the data preprocessing pipeline, transforming raw data into meaningful features that enhance the performance of machine learning models. When dealing with large datasets, performing feature engineering at scale is essential for efficient data processing and extracting valuable insights.
Components of Feature Engineering:
Feature engineering at scale involves three main types of computations:
• Reads and Writes: Operations for reading data from various sources, such as files, databases, or distributed file systems, and writing the transformed data to the desired destination.
• SQL QueriesSuper Important: Utilizing SQL queries for efficient data manipulation, allowing complex transformations and aggregations for generating informative features.
• Dataframe Operations: Working with tabular data structures, such as dataframes, to efficiently manipulate structured data. This includes operations like filtering, grouping, merging, and other transformations, enabling feature engineering at scale.
Requirements
Core Capabilities
• Read and write data from different sources in parquet format, including: Iceberg, Delta, Redshift
• Support for executing SQL queries for data manipulation.
• Support for dataframe operations, including:
◦ Statistical computations.
◦ SQL operations such as joins, group by, map, and sort.
◦ Calculating correlations and covariance among columns.
◦ Custom UDFs like map_with_pandas
• Fault tolerance to handle failures(include node failures) and ensure robustness.
• GPU support for accelerating computations on compatible hardware.
• Integration with external catalogues such as Glue to access and manage metadata.
• Ability to operate on larger-than-memory datasets by employing techniques like:
◦ Pipelining of operations for efficient processing.
◦ Disk spilling to manage data that exceeds available memory.
Usability
• Provide a familiar API for users familiar with pyspark, enabling a seamless transition.
• Support remote execution, allowing connection to a remote cluster from any location.
• Offer full autoscaling support, eliminating the need for users to specify the number or size of workers manually.
• Enable conversion to Ray dataframe and integration with other libraries such as Dask and Modin.
• Ensure compatibility with notebooks, including features like type hints, schema inference, and autocomplete.
Extensibility
• Provide a Python API that allows users to add new features and customisations to the framework.
• Built on top of Ray, leveraging its integrations with other libraries like TensorFlow and PyTorch.
• Designed with extensibility in mind, allowing for future extensions to support streaming use cases utilising core streaming primitives.
Performance
• Operate in distributed mode, efficiently utilising full resources across a cluster.
• Implement lazy execution, deferring computation until an action is called, optimising resource utilisation.
• Utilise push-based execution to enable parallel execution of tasks, improving overall performance.
• Implement pipelining of operations to optimise data processing workflows.
• Incorporate adaptive query execution, dynamically optimising query plans for the next retry in case of failure, ensuring efficient execution even in the presence of errors.
Debuggability
• Provide Python-friendly error messages to facilitate easier debugging and troubleshooting.
• Generate explainable query plans that help users understand how the data processing steps are executed.
• Offer actionable errors that guide users on resolving issues encountered during feature engineering tasks.Sammy Sidhu
04/02/2024, 3:29 AMjay
04/02/2024, 4:49 PMSome operations: correlations, covariance (shouldn’t be difficult to add)• Not a deal breaker