This message was deleted.
# general
s
This message was deleted.
❤️ 2
🔥 3
👏 1
j
We really appreciate the message! I’m curious if you have a list of requirements? We’d be happy to work on them.
A big part of making those improvements happen was working with the community and users (many of whom are in this channel 😁 ) Would love to chat as you go about performing internal benchmarking and tests so that we can fix anything that pops up
💯 1
r
Sure @jay, Will share the benchmark findings here. We also have a list of requirements, let me check with the team. I will circle back on this.
🔥 2
s
@Rishabh Agarwal Jain Thanks for the kind words! As jay said, would love to hop on a call to learn more! Also if you're interested, we'd also be open to giving a talk to your company if you think it would help adoption 🙂
r
That’s actually a great idea @Sammy Sidhu. A talk from your side will definitely give a boost to adoption. Both DS Users and the platform engineers will also be able to share their thoughts more openly and directly. I will DM you on this. Thanks for this 🙂
s
Great! Will keep an eye out for your DM! 🙂
a
@Rishabh Agarwal Jain it seems we went through a similar paths interestingly. However, we did not go with RayDP because we didn’t see support and contributions. We chose Daft instead. We are currently onboarding the data science team and so far they are happy with daft. It’s has higher performance and uses less resources amazingly with our tests.
❤️ 1
s
Thanks @Ammar Alrashed!
r
@jay @Sammy Sidhu Sharing the requirements doc that we created for ETL on Ray.
Introduction
ETL is a crucial step in the data preprocessing pipeline, transforming raw data into meaningful features that enhance the performance of machine learning models. When dealing with large datasets, performing feature engineering at scale is essential for efficient data processing and extracting valuable insights.
Components of Feature Engineering:
Feature engineering at scale involves three main types of computations: • Reads and Writes: Operations for reading data from various sources, such as files, databases, or distributed file systems, and writing the transformed data to the desired destination. • SQL QueriesSuper Important: Utilizing SQL queries for efficient data manipulation, allowing complex transformations and aggregations for generating informative features. • Dataframe Operations: Working with tabular data structures, such as dataframes, to efficiently manipulate structured data. This includes operations like filtering, grouping, merging, and other transformations, enabling feature engineering at scale. Requirements
Core Capabilities
• Read and write data from different sources in parquet format, including: Iceberg, Delta, Redshift • Support for executing SQL queries for data manipulation. • Support for dataframe operations, including: ◦ Statistical computations. ◦ SQL operations such as joins, group by, map, and sort. ◦ Calculating correlations and covariance among columns. ◦ Custom UDFs like map_with_pandas • Fault tolerance to handle failures(include node failures) and ensure robustness. • GPU support for accelerating computations on compatible hardware. • Integration with external catalogues such as Glue to access and manage metadata. • Ability to operate on larger-than-memory datasets by employing techniques like: ◦ Pipelining of operations for efficient processing. ◦ Disk spilling to manage data that exceeds available memory.
Usability
• Provide a familiar API for users familiar with pyspark, enabling a seamless transition. • Support remote execution, allowing connection to a remote cluster from any location. • Offer full autoscaling support, eliminating the need for users to specify the number or size of workers manually. • Enable conversion to Ray dataframe and integration with other libraries such as Dask and Modin. • Ensure compatibility with notebooks, including features like type hints, schema inference, and autocomplete.
Extensibility
• Provide a Python API that allows users to add new features and customisations to the framework. • Built on top of Ray, leveraging its integrations with other libraries like TensorFlow and PyTorch. • Designed with extensibility in mind, allowing for future extensions to support streaming use cases utilising core streaming primitives.
Performance
• Operate in distributed mode, efficiently utilising full resources across a cluster. • Implement lazy execution, deferring computation until an action is called, optimising resource utilisation. • Utilise push-based execution to enable parallel execution of tasks, improving overall performance. • Implement pipelining of operations to optimise data processing workflows. • Incorporate adaptive query execution, dynamically optimising query plans for the next retry in case of failure, ensuring efficient execution even in the presence of errors.
Debuggability
• Provide Python-friendly error messages to facilitate easier debugging and troubleshooting. • Generate explainable query plans that help users understand how the data processing steps are executed. • Offer actionable errors that guide users on resolving issues encountered during feature engineering tasks.
🔥 3
s
Thanks @Rishabh Agarwal Jain! This is great! Let me know if you're still interested in us giving a talk about daft!
👍 1
j
(Moving our private discussions from last week here as well) Yes, SQL support is a deal breaker. I have divided my product users into different user personas 1. Research Scientists: They don’t need it. 2. MLEs: They need it 3. Data Scientists and Analysts: They need it In an enterprise environment, these 3 personas work together and code is shared across teams Given a project, there are always 5-10 Spark SQP operations in the ETL pipeline, which cannot be replaced from user adoption point of view.
Some operations: correlations, covariance (shouldn’t be difficult to add)
• Not a deal breaker