Slackbot
03/11/2024, 12:45 AMIsmael Ghalimi
03/11/2024, 6:58 PMjay
03/11/2024, 7:41 PMgetdaft[ibis]
or something similar.
Having it natively integrated into Daft in Rust could be nice to avoid overheads with the GIL preventing parallelism, which only be a Daft+python multithreaded runner issue.
cc @Sammy SidhuPeter
03/12/2024, 4:42 AMgetdaft[ibis]
flavor of installation sounds intriguing. Yet, we should keep in mind that ibis-framework
itself makes use of extras extensively, that is, if I wanted to add Ibis with the Polars and DuckDB backends to my project, I would need to pip install ibis-framework[duckdb,polars]
In other words, we might want to consider a pip extras structure like getdaft[ibis,duckb,polars]
where the actual backends are listed explicitly by the user so that they can remain in full, fine-granular control of their site-packages
.
It would be great to stay closely connected to ensure that we are aligned on approaches on the design level, so that ultimately we can contribute back to Daft as much as we can 😁jay
03/12/2024, 5:04 AMstd::string::String
methods, Rust’s regex
crate or similar 🙂 — there are also many Rust crates that will be super helpful here. On our end, something we could do better is to perform a redesign of our expressions API to make it easier to extend (https://github.com/Eventual-Inc/Daft/issues/1806). Happy to chat more about what that could look like! We think with some strategic usage of macros, we might be able to get us in better shape for knocking out a ton of these scalar ops using just Rust crates.
3. The hard part is actually making sure those scalar operations work well, with good test coverage and documentation. Something that can be done pretty effectively here could be to write good test harnesses and fixtures to compare our behavior with the expected Ibis behavior on other backends. Conceptually this could look something like:
class ScalarOpTestCase:
duckdb_op: ...
daft_expr: ...
@pytest.mark.parametrize(...)
def test_scalar_op(self):
expected = run_duckdb_op(self.duckdb_op)
result = run_daft_expr(self.daft_expr)
assert expected == result
Once we have solid test-cases, writing the code to link up the necessary functionality should be fairly mechanical.
Would love to collaborate closely!Ismael Ghalimi
03/12/2024, 1:48 PMjay
03/12/2024, 6:21 PM[per-row]
functionality here (e.g. euclidean distance in geospatial) isn’t difficult to add natively either
• However working with these complex types may require [per-partition]
or [per-dataframe]
functionality
◦ `[per-partition]`: It could make sense to have 3rd-party dependencies that can do this on our MicroPartition
abstraction
◦ `[per-dataframe]`: We will have to implement many ourselves because it requires global operations (e.g. windowing, outer joins etc)
✅ Testing via ibis and DuckDB/Polars — this makes a lot of sense, and is maybe where we should start investing effort early to set the foundation for Ibis feature coverageIsmael Ghalimi
03/12/2024, 6:24 PMjay
03/12/2024, 6:26 PMKiril Aleksovski
04/04/2024, 7:39 PM