Zac Steer
09/05/2024, 12:22 PMZac Steer
09/05/2024, 12:22 PMZac Steer
09/05/2024, 12:22 PMZac Steer
09/05/2024, 12:23 PMZac Steer
09/05/2024, 12:23 PMjay
09/05/2024, 7:46 PMwhereas daft aims to be “the best distributed dataframe”.In the long run, I think we just want to be the best offline batch data engine across analytics, data processing (ETL) and ML!
Zac Steer
09/05/2024, 8:18 PMZac Steer
09/05/2024, 8:40 PMjay
09/05/2024, 9:09 PMZac Steer
09/05/2024, 11:04 PMjay
09/05/2024, 11:07 PMdf.with_column("data", df["urls"].url.download()).iter_rows() haha. It’s really good 😛
2. 👍
3. Some things they will always be hamstrung by Spark. For example, requesting for a GPU to run your models. Stateful initialization of models. etc etcZac Steer
09/05/2024, 11:55 PMKyle
09/06/2024, 12:13 AMjay
09/06/2024, 12:30 AMiter_rows will lazily materialize partitions and basically stream you results.
There isn’t any caching per-se, but there is buffering — we hold N number of result partitions in the buffer (configurable!) and return you results row-at-a-time over those partitions. When there is more space in the buffer, we then trigger more compute to fetch another partition.Kyle
09/06/2024, 12:31 AMZac Steer
09/15/2024, 7:18 PMjay
09/15/2024, 7:23 PM