Hello Team I m currently looking into migrating my PySpark J Distributed Data Community #daft-dev

Hello Team, I'm currently looking into migrating m...

Rushikesh Padia

07/17/2024, 11:57 PM

Hello Team, I'm currently looking into migrating my PySpark Job into Daft. I'm evaluating if it fits my use case. I've gone though the docs on the website but didn't find any detailed resource on Daft Query Optimizer or Join algorithms. Is there any detailed research paper on Daft? or any other publication on internal details? I've also gone through benchmark results and they look really amazing but I didn't find why Daft is faster than Spark, is there any trade off? Would appreciate if there are any resource on it. Thanks in advance

👋 1

jay

07/18/2024, 1:30 AM

Hi @Rushikesh Padia! We don’t currently have any resources on our query optimizer or join algorithms. Happy to answer any questions you may have here though! • The types of joins we support are listed under the

strategy

keyword arg: https://www.getdaft.io/projects/docs/en/latest/api_docs/doc_gen/dataframe_methods/daft.DataFrame.join.html Daft’s speedups over Spark come from: • Vectorized execution • Much faster/optimized reads from cloud storage (specifically AWS S3), written in async Rust • Lower overhead wrt JVM

jay

07/18/2024, 1:30 AM

WRT trade-offs, you’ll find that Spark will have more functionality (e.g. support SQL and other functions), but we’re constantly adding to Daft’s suite of functions 😄

Rushikesh Padia

07/18/2024, 6:46 PM

Thanks Jay! 😄

Open in Slack

Previous Next