Is there any plan to implement fine grained access...
# daft-dev
v
Is there any plan to implement fine grained access control in Daft? It’s a hot topic in the analytics market right now and implementation can be tricky, it would be good to think about it early to not be blocked later...
🎉 1
j
Hi @Vincent Gromakowski — do you have examples of implementations here? Are you talking about things like row/column-level access control?
v
yes column and row
j
I think most of the burden is placed on the storage/catalog, rather than the query engine. i.e. Daft by itself probably won’t be able to do this, but we’d need to integrate with a system that will perform this for us (I think) I’m curious how you handle it with today’s tooling!
v
CLS and RLS can only happen in the compute engine. Because you need to read data then filter. Look at this session for details on how we do at AWS with Spark https://app.events.ringcentral.com/events/iceberg-summit/replay/cut_74084c10-055e-470f-957d-3e3161d21149
j
^ yes, but we’d have to interact with a system to provides information about those filters At the LinkedIn Big Data meetup where we gave a talk, one of the LinkedIn speakers talked about how they “shimmed” table access at the query engine level to instead hit a view. These views have filters/column-pruning applied on them! I’ll give the
Enforcing Fine Grained Access Control with Spark and Iceberg at AWS
talk a watch when it becomes available!
allowing user code and potential adversaries to easily access the memory space of the compute nodes
Yeah this was our primary question for the LinkedIn speaker, because technically speaking the compute node still has access to their raw data in their approach His response was that their primary concern was around policy enforcement rather than true “security”, so they could afford this
k
Agree with @jay here, the query engine shouldn't be entrusted with enforcing security. Here is an excerpt from a recent Google paper on BigLake: BigLake: BigQuery’s Evolution toward a Multi-Cloud Lakehouse SIGMOD-Companion ’24, June 9–15, 2024, Santiago, AA, Chile https://storage.googleapis.com/gweb-research2023-media/pubtools/pdf/b6e4bce752acf2f0eb27cf56bcca6ffdfc3db780.pdf "The current status quo in open-source analytics places the responsibility of enforcing the fine-grained access controls with the query engines. This leads to two downsides: (1) security policies such as data masking and row-level filtering are tied to a SQL dialect, requiring duplication of governance policies across multiple query engines (2) the model of entrusting the query engine to enforce filtering do not work well on engines like Apache Spark that are designed to run arbitrary procedural code directly within the query engine worker processes."
j
^ the biggest issue with the above though is that query engines then have to go through a central bottleneck of a service that controls I/O. Traditionally these engines have access to the raw data (S3, HDFS). For Daft at least, this is a big win because we have our own optimized I/O. Honestly this might be a good case for a managed Daft data warehouse, where the storage service can take care of both optimized high throughput I/O as well as data masking/filtering, using existing components of Daft to execute the I/O
👍 1
v
There is no ideal solution but using a data access layer that enforce the filtering is not efficient at all: • it adds extra hops and associated performance (and sustainability) challenges (data transfer, latency, scalability....) You have now 2 computes, your compute engine and the data access compute. In case of BQ, it’s the read API. • it adds extra costs. You always need to pay for compute to access the data even if your engine could directly read it. Additionally, your compute is “waiting” for the other compute to read the data meaning you are wasting resources during the I/O phase. In BQ case, you pay for BQ slots. • it breaks some performance optimizations in your engine because your engine is not aware anymore of the predicates used by the data access layer. For example, no more filter derivation in a join based on the row level security filtering.
a
the model of entrusting the query engine to enforce filtering do not work well on engines like Apache Spark that are designed to run arbitrary procedural code directly within the query engine worker processes
This is the entire reason for the SparkConnect project right? Isolate user code from the system entirely…
Honestly this might be a good case for a managed Daft data warehouse, where the storage service can take care of both optimized high throughput I/O as well as data masking/filtering, using existing components of Daft to execute the I/O
This is likely the best way to proceed IMO. The filters would be available on the table definition in the catalog and then you apply them via the managed I/O subsystem… If you do decide to implement a managed service, I’d recommend never offering a “single-user cluster” deployment model (where a single user has root access on the cluster). This mode is essentially an open can of worms and will make implementing anything security and isolation related much much harder… Isolate everything user-provided from the start and that way you never need to redesign the whole thing later…
v
Last time I checked Spark connect had some serious limitations (lack of RDD support, differences in API, lack of expressions)
a
Hmm, I don’t think the differences in API and especially the lack of RDD are what I’d consider limitations per se.. RDDs are a bad idea that need to die IMO. Not all expressions being implemented is indeed a limitation, but that’s WIP IIUC..
v
If the best way is via a managed service (and manage I/O), how do OSS users handle it?
a
Good question… I mean you would implement it by adding support into the OSS release; the filters would come from the catalog and get applied during the scan, but there’s nothing stopping the OSS deployment from leaking the input files to malicious users if it’s not deployed thoughtfully… Wonder if sandboxing the scan and the I/O would be an option? Use a lightweight namespacing construct (LXC?) to isolate the executor process (and any I/O processes) such that the raw (temporary) files they download are not visible via the filesystem to any one else?
j
Yeah realistically I think if the requirements here are to completely sandbox raw data access away from the user, it will require some kind of data access service or abstraction, that is likely to be closed-source. Either that, or we’d have to build an OSS solution that users can deploy (this would be something like an OSS catalog implementation that also handles data I/O), but it wouldn’t be in Daft itself though Daft should be able to read from it! Another interesting idea I was toying with the other day was perhaps making use of encryption within Parquet 😅. Parquet actually does support encryption keys — perhaps some system could vend keys to the query engine. This would be expensive though and not quite as flexible since encryption is applied at the column-chunk level I believe.
@Amogh Akshintala indeed sandboxing scans and I/O is exactly what we need to do. The question is whether we need to do this remotely (for maximum security), or maybe like you said we can do it locally somehow through a different process with elevated permissions. It really depends on the security requirements I think!
a
Yeah very much depends on the security model. I’ve seen both in play - in deployment modes with good isolation to start with, local sandboxing is enough; where there isn’t as strong an isolation story (the single user deployment mode I mentioned earlier) you end up having to rely on an external service to do the I/O and filter the data on behalf of the client.
🙌 1
j
> you end up having to rely on an external service to do the I/O and filter the data on behalf of the client. Are you thinking of Databricks DBFS? Or is this more Unity Catalog territory
v
None you need a compute layer to apply the filtering
👍 1
a
What Vincent said… The filters to evaluate are stored as table attributes in UC (and Iceberg I presume), but applying the filters on the data has to be done by some compute layer… If you don’t have strong isolation guarantees in your compute offering, you’ll have to run a separate compute service (that is well isolated) to scan and filter the data down to only the set that the user has access to and send only the data that survives to the poorly isolated cluster as input to the query…
👍 1
m
General question, lets say that based on policies you alter the logical plan and before executing on this plan you somehow validate if this plan is untampered. You then proceed to execute on this plan, including the security predicates (filters, masking functions). Would disk spilling reveal sensitive information? Afaik the process is: remote IO -> memory -> apply expressions -> discover you are running OOM -> spil already processed data to disk. Which means that when spilling, the data is already filtered, or is this incorrect? Off course if you use the OSS version you can compile your own version which skips some sensitive steps, but one challenge at a time.
j
Which means that when spilling, the data is already filtered
Correct, spilling occurs on intermediate materialized outputs (e.g. at a shuffle boundary for example before a global sort). In Daft our steps are fused/pipelined — this means that
Scan -> Filter -> Project
is fused into a single step, and spilling is applied only at the output of that fused step.
🙏 1