Vincent Gromakowski
05/21/2024, 6:45 PMjay
05/21/2024, 6:45 PMVincent Gromakowski
05/21/2024, 6:46 PMjay
05/21/2024, 8:17 PMVincent Gromakowski
05/21/2024, 8:19 PMjay
05/21/2024, 8:22 PMEnforcing Fine Grained Access Control with Spark and Iceberg at AWS
talk a watch when it becomes available!jay
05/21/2024, 8:23 PMallowing user code and potential adversaries to easily access the memory space of the compute nodesYeah this was our primary question for the LinkedIn speaker, because technically speaking the compute node still has access to their raw data in their approach His response was that their primary concern was around policy enforcement rather than true “security”, so they could afford this
Kiril Aleksovski
05/21/2024, 8:35 PMjay
05/21/2024, 11:41 PMVincent Gromakowski
05/22/2024, 7:07 AMAmogh Akshintala
05/22/2024, 11:02 AMthe model of entrusting the query engine to enforce filtering do not work well on engines like Apache Spark that are designed to run arbitrary procedural code directly within the query engine worker processesThis is the entire reason for the SparkConnect project right? Isolate user code from the system entirely…
Honestly this might be a good case for a managed Daft data warehouse, where the storage service can take care of both optimized high throughput I/O as well as data masking/filtering, using existing components of Daft to execute the I/OThis is likely the best way to proceed IMO. The filters would be available on the table definition in the catalog and then you apply them via the managed I/O subsystem… If you do decide to implement a managed service, I’d recommend never offering a “single-user cluster” deployment model (where a single user has root access on the cluster). This mode is essentially an open can of worms and will make implementing anything security and isolation related much much harder… Isolate everything user-provided from the start and that way you never need to redesign the whole thing later…
Vincent Gromakowski
05/22/2024, 11:57 AMAmogh Akshintala
05/22/2024, 12:01 PMVincent Gromakowski
05/22/2024, 12:06 PMAmogh Akshintala
05/22/2024, 12:15 PMjay
05/22/2024, 4:11 PMjay
05/22/2024, 4:13 PMAmogh Akshintala
05/22/2024, 4:31 PMjay
05/22/2024, 6:43 PMVincent Gromakowski
05/22/2024, 6:56 PMAmogh Akshintala
05/22/2024, 7:41 PMMenno Hamburg
05/30/2024, 10:10 PMjay
05/30/2024, 10:50 PMWhich means that when spilling, the data is already filteredCorrect, spilling occurs on intermediate materialized outputs (e.g. at a shuffle boundary for example before a global sort). In Daft our steps are fused/pipelined — this means that
Scan -> Filter -> Project
is fused into a single step, and spilling is applied only at the output of that fused step.