Couple of more questions:
1. How the distributed will be work? Do we need to do anything work with distributed?
2. How to deal with larger data sets? I am running in Jupyter notebook but kernel keep restarting. ( Running with 2 vCPU and 8 GB ram )
3. Does this use Datafusion ?