General question here - can anyone recommend any p...
# general
j
General question here - can anyone recommend any particularly good books/blogs etc around Distributed Data (pretty generic I know). Looking to deepen my knowledge and understanding further about how large scale distributed services work behind the scenes/issues that they encounter. Also interested to understand better why there does seem to be a significant push into Rust backing a large number of these open source projects (Apache Ballista for example)
j
Books: 1. DDIA 2. https://andygrove.io/how-query-engines-work/ 3. @Clark Zinzow probably has some reccs 😛 As for why Rust is getting popular… My hypotheses: • Rust + Python is incredible (really good language bindings) • High learning curve, but I would argue actually easier to pick up than C++ if coming from Python • Performance + memory safety = important for data processing
a
I know you asked for books but Andy Pavlo’s courses are my go-to… All courses - https://db.cs.cmu.edu/courses/ Latest iteration of AdvancedDB - https://15721.courses.cs.cmu.edu/spring2024/schedule.html Recorded lectures - https://www.youtube.com/@CMUDatabaseGroup
❤️ 2
j
Courses are great too - thanks for sharing!
@jay When you say really good language bindings - what is it specifically that makes the python/rust bindings so much better than other languages?
j
Here’s the crate that binds Python + Rust: https://pyo3.rs/v0.21.2/ Some things that stand out to me: • The mapping of Rust types to Python types are really clean. Rust
Vec<i64>
can even map cleanly to numpy arrays in Python 🤯 • The ergonomics around lifetimes and the lifetime of acquiring the GIL/making calls to the Python interpreter is really nice. This makes calling Python code from Rust much safer (if you have the
py
object, then you have the GIL for the duration of that object’s lifetime). • The syntax and learning curve just felt way better than what I used to have to do in pybind for C++
👍 1
c
@Jake Waller +1 to @jay's and @Amogh Akshintala's links! In addition, I'd recommend: • Database Internals: A Deep Dive into How Distributed Data Systems Work - great book on the storage data structures and distributed systems considerations behind distributed data systems (website, Amazon link) • Principles of Distributed Database Systems - great overview of many different aspects for distributed database systems, with a good bibliography of relevant papers (Springer link) • There are also loads of great papers linked to by Andy Pavlo's class that I would recommend! E.g. the papers for Spark, SparkSQL, Velox, Snowflake, F1, Morsel-driven parallelism, etc.
👍 2
j
Thanks Clark! Will add all these to my reading lists