This message was deleted Distributed Data Community #daft-dev

Join Slack

This message was deleted.

# daft-dev

Slackbot

03/11/2024, 12:45 AM

This message was deleted.

Ismael Ghalimi

03/11/2024, 6:58 PM

We just did a very quick and very dirty proof of concept consisting in creating a Daft dataframe, adding a column to it computed with DuckDB, then doing further transformations on it from Daft, with Daft running on top of Ray. Everything is done with zero copy through Arrow, and we did the integration directly from Daft, without having to go down to Ray. Next, we will do this integration through Ibis, writing an Ibis expression instead of a SQL query for the column produced by DuckDB. After that, we'll bind Ray as an Ibis backend, implementing the full pipeline as an Ibis pipeline, with a nested Ibis expression for the computed column (two totally separate Ibis execution contexts). Also, we'll do that with one column computed with DuckDB and another computed with Polars. Of course, all this is a bit of a kludge. A better approach would consist in doing all that integration within Daft itself, but that's something that we should probably design together.

❤️ 1

jay

03/11/2024, 7:41 PM

Very cool work! 😍 We could absolutely start with a Python-side UDF-based integration like what you currently have, and polars/duckdb can be optional Python dependencies on Daft, maybe for installing

getdaft[ibis]

or something similar. Having it natively integrated into Daft in Rust could be nice to avoid overheads with the GIL preventing parallelism, which only be a Daft+python multithreaded runner issue. cc @Sammy Sidhu

Peter

03/12/2024, 4:42 AM

Hey guys, joining the discussion here, as I am working together with @Ismael Ghalimi closely to develop all these proof of concepts. First of all, Daft is a super powerful project, we very much appreciate all your efforts going into it as well as the fact that you are making it accessible to the whole Data community through your permissive license. Kudos! The

getdaft[ibis]

flavor of installation sounds intriguing. Yet, we should keep in mind that

ibis-framework

itself makes use of extras extensively, that is, if I wanted to add Ibis with the Polars and DuckDB backends to my project, I would need to

pip install ibis-framework[duckdb,polars]

In other words, we might want to consider a pip extras structure like

getdaft[ibis,duckb,polars]

where the actual backends are listed explicitly by the user so that they can remain in full, fine-granular control of their

site-packages

. It would be great to stay closely connected to ensure that we are aligned on approaches on the design level, so that ultimately we can contribute back to Daft as much as we can 😁

👍 1

jay

03/12/2024, 5:04 AM

Welcome @Peter! Yeah woof that sounds like a ton of dependencies. Had some brief discussion with the rest of the team today. Couple of points that came up: 1. Using kernels from external deps like polars/duckdb will likely just be a temporary measure to unblock functionality, until we can build out our own kernels. We’d love to have control around concurrency and memory, which might get weird if we use another framework under the hood which potentially does its own concurrency/memory management as well. 2. We’re actually fairly bullish about most of the Ibis scalar operations being fairly easy without needing a polars/duckdb dependency. Most of the scalar ops likely delegates to something like Rust’s

std::string::String

methods, Rust’s

regex

crate or similar 🙂 — there are also many Rust crates that will be super helpful here. On our end, something we could do better is to perform a redesign of our expressions API to make it easier to extend (https://github.com/Eventual-Inc/Daft/issues/1806). Happy to chat more about what that could look like! We think with some strategic usage of macros, we might be able to get us in better shape for knocking out a ton of these scalar ops using just Rust crates. 3. The hard part is actually making sure those scalar operations work well, with good test coverage and documentation. Something that can be done pretty effectively here could be to write good test harnesses and fixtures to compare our behavior with the expected Ibis behavior on other backends. Conceptually this could look something like:

Copy code

class ScalarOpTestCase:
    duckdb_op: ...
    daft_expr: ...

    @pytest.mark.parametrize(...)
    def test_scalar_op(self):
        expected = run_duckdb_op(self.duckdb_op)
        result = run_daft_expr(self.daft_expr)
        assert expected == result

Once we have solid test-cases, writing the code to link up the necessary functionality should be fairly mechanical. Would love to collaborate closely!

👏 1

Ismael Ghalimi

03/12/2024, 1:48 PM

@jay For most scalar functions related to things like numerical, temporal, or textual values, I totally agree with you that native support is the best option. That being said, I would not under-estimate the efforts involved, especially with respect to simply defining the API. There, I would recommend that you embrace the Ibis API as aggressively as possible. It's not better than any other API, but it's the closest thing to a lingua franca that I can think of. And I recommend that you allow yourselves to change the API that you have right now in order to make it as close to the Ibis API as you can. The other thing to consider is some APIs like geospatial. There, you positively cannot re-implement the API easily. You could integrate the same dependencies that DuckDB has integrated with, but that's a lot of work, and they're in C++. You won't find any Rust equivalent anytime soon. Now that Overture Maps have been released with an open source license (they sponsored DuckDB's work), the need for good geospatial APIs is exploding. Eventually, you'll face the same problem with graph structures, but the challenge will be an order of magnitude greater. Yet another thing is windowing functions. There, you have a ton of work ahead of you to do it right, especially if you want to distribute it across partitions that are misaligned in relation to grouping dimensions. But if you keep things aligned, you can get these functions for free from DuckDB and Polars. Finally, the multi-engine architecture is something that you cannot avoid if you want to provide support for GPU. Now that cudf provides full support for Pandas, I like to believe that you really want to benefit from all this goodness. And this GPU support will become absolutely critical when you decide to add support for graph structures. In other words, I would really recommend that you take a step back, look at the full picture, and decide whether you really want a monolithic engine that does it all in Rust, or whether you want to go down the composable path. If you go for the former, you won't have rich datatypes (graph, geospatial, temporal) before 3 to 5 years. If you go for the latter, you could get them by the Summer. Today, there is an over-representation of AI workloads related to images and movies, but in the future, graph, geospatial, and temporal will become more and more critical, and the main value of Daft will be to support them all from a single platform. Furthermore, I don't believe that Daft's value is to be found with totally unstructured content like images or movies. These don't really need to live in the database. But graphs, geospatial geometries, and time series really do. This is where Daft can really shine and make a huge difference. One way or another, we'll support the project. It is the best foundation for what we need. ### Regarding the unit testing platform, I recommend that you do it against DuckDB and Polars through Ibis. Together, they cover most features defined by Ibis, and provide more coverage than any other backend. They're also very easy to import. And you might want to import Ibis' unit tests for these backends. That way, you don't have to build and maintain the collection yourself. Unlike any other backend, you're in a fantastic position right now, in the sense that you have very few scalar functions yet, therefore you can afford to do things the best possible way from day one.

👍 1

jay

03/12/2024, 6:21 PM

Thanks, yes that adds a lot more color. To summarize, I added ✅ on points where I think we agree, and ❓ on points that we would love to explore further with you to understand the use-cases and landscape of tooling. ✅ Embracing the Ibis API (and we really appreciate your expertise here in this area as well!) ✅ Scalar numerical/temporal/text functions are easy to implement in Daft using native support ❓ Functions for more complex modalities of data (geospatial, graphs, windowed data) would be more difficult to add, and this is where Daft may want to consider a dependency on other libraries which already have functionality around this. Breaking this down further: • My suspicion here is that most scalar

[per-row]

functionality here (e.g. euclidean distance in geospatial) isn’t difficult to add natively either • However working with these complex types may require

[per-partition]

[per-dataframe]

functionality ◦ `[per-partition]`: It could make sense to have 3rd-party dependencies that can do this on our

MicroPartition

abstraction ◦ `[per-dataframe]`: We will have to implement many ourselves because it requires global operations (e.g. windowing, outer joins etc) ✅ Testing via ibis and DuckDB/Polars — this makes a lot of sense, and is maybe where we should start investing effort early to set the foundation for Ibis feature coverage

Ismael Ghalimi

03/12/2024, 6:24 PM

I love all the ✅s! Regarding the one ❓, I would not under-estimate the work required on the geospatial side. Just adding support for the geospatial datatypes, binding the dozens of functions, and adding support for the 50+ file formats took many many months to a super-talented DuckDB engineer (we know him really well), tens of thousands of lines of code, and modifications to almost every layer of the DuckDB architecture. This was a massive undertaking...

jay

03/12/2024, 6:26 PM

Absolutely 👍 yeah we’re aware geospatial is its own beast… @Clark Zinzow used to work at Descartes Labs. Many other libraries end up having some geospatial variant or extension (e.g. geopandas/PostGIS). It’s a big chunk of effort.

👍 1

Kiril Aleksovski

04/04/2024, 7:39 PM

@jay Strongly agree that you should keep it as clean as possible from other libraries and don't let the faster market value lure you into making rushed architectural decisions that one might regret later... Of course interoperability is a must but one should be careful. Maybe something like: https://arrow.apache.org/docs/python/interchange_protocol.html

🙌 1

Open in Slack

Previous Next