Is there any plan to implement fine grained access control i Distributed Data Community #daft-dev

Is there any plan to implement fine grained access...

Vincent Gromakowski

05/21/2024, 6:45 PM

Is there any plan to implement fine grained access control in Daft? It’s a hot topic in the analytics market right now and implementation can be tricky, it would be good to think about it early to not be blocked later...

🎉 1

jay

05/21/2024, 6:45 PM

Hi @Vincent Gromakowski — do you have examples of implementations here? Are you talking about things like row/column-level access control?

Vincent Gromakowski

05/21/2024, 6:46 PM

yes column and row

jay

05/21/2024, 8:17 PM

I think most of the burden is placed on the storage/catalog, rather than the query engine. i.e. Daft by itself probably won’t be able to do this, but we’d need to integrate with a system that will perform this for us (I think) I’m curious how you handle it with today’s tooling!

Vincent Gromakowski

05/21/2024, 8:19 PM

CLS and RLS can only happen in the compute engine. Because you need to read data then filter. Look at this session for details on how we do at AWS with Spark https://app.events.ringcentral.com/events/iceberg-summit/replay/cut_74084c10-055e-470f-957d-3e3161d21149

jay

05/21/2024, 8:22 PM

^ yes, but we’d have to interact with a system to provides information about those filters At the LinkedIn Big Data meetup where we gave a talk, one of the LinkedIn speakers talked about how they “shimmed” table access at the query engine level to instead hit a view. These views have filters/column-pruning applied on them! I’ll give the

Enforcing Fine Grained Access Control with Spark and Iceberg at AWS

talk a watch when it becomes available!

jay

05/21/2024, 8:23 PM

allowing user code and potential adversaries to easily access the memory space of the compute nodes

Yeah this was our primary question for the LinkedIn speaker, because technically speaking the compute node still has access to their raw data in their approach His response was that their primary concern was around policy enforcement rather than true “security”, so they could afford this

Kiril Aleksovski

05/21/2024, 8:35 PM

Agree with @jay here, the query engine shouldn't be entrusted with enforcing security. Here is an excerpt from a recent Google paper on BigLake: BigLake: BigQuery’s Evolution toward a Multi-Cloud Lakehouse SIGMOD-Companion ’24, June 9–15, 2024, Santiago, AA, Chile https://storage.googleapis.com/gweb-research2023-media/pubtools/pdf/b6e4bce752acf2f0eb27cf56bcca6ffdfc3db780.pdf "The current status quo in open-source analytics places the responsibility of enforcing the fine-grained access controls with the query engines. This leads to two downsides: (1) security policies such as data masking and row-level filtering are tied to a SQL dialect, requiring duplication of governance policies across multiple query engines (2) the model of entrusting the query engine to enforce filtering do not work well on engines like Apache Spark that are designed to run arbitrary procedural code directly within the query engine worker processes."

jay

05/21/2024, 11:41 PM

^ the biggest issue with the above though is that query engines then have to go through a central bottleneck of a service that controls I/O. Traditionally these engines have access to the raw data (S3, HDFS). For Daft at least, this is a big win because we have our own optimized I/O. Honestly this might be a good case for a managed Daft data warehouse, where the storage service can take care of both optimized high throughput I/O as well as data masking/filtering, using existing components of Daft to execute the I/O

👍 1

Vincent Gromakowski

05/22/2024, 7:07 AM

There is no ideal solution but using a data access layer that enforce the filtering is not efficient at all: • it adds extra hops and associated performance (and sustainability) challenges (data transfer, latency, scalability....) You have now 2 computes, your compute engine and the data access compute. In case of BQ, it’s the read API. • it adds extra costs. You always need to pay for compute to access the data even if your engine could directly read it. Additionally, your compute is “waiting” for the other compute to read the data meaning you are wasting resources during the I/O phase. In BQ case, you pay for BQ slots. • it breaks some performance optimizations in your engine because your engine is not aware anymore of the predicates used by the data access layer. For example, no more filter derivation in a join based on the row level security filtering.

Amogh Akshintala

05/22/2024, 11:02 AM

the model of entrusting the query engine to enforce filtering do not work well on engines like Apache Spark that are designed to run arbitrary procedural code directly within the query engine worker processes

This is the entire reason for the SparkConnect project right? Isolate user code from the system entirely…

Honestly this might be a good case for a managed Daft data warehouse, where the storage service can take care of both optimized high throughput I/O as well as data masking/filtering, using existing components of Daft to execute the I/O

This is likely the best way to proceed IMO. The filters would be available on the table definition in the catalog and then you apply them via the managed I/O subsystem… If you do decide to implement a managed service, I’d recommend never offering a “single-user cluster” deployment model (where a single user has root access on the cluster). This mode is essentially an open can of worms and will make implementing anything security and isolation related much much harder… Isolate everything user-provided from the start and that way you never need to redesign the whole thing later…

Vincent Gromakowski

05/22/2024, 11:57 AM

Last time I checked Spark connect had some serious limitations (lack of RDD support, differences in API, lack of expressions)

Amogh Akshintala

05/22/2024, 12:01 PM

Hmm, I don’t think the differences in API and especially the lack of RDD are what I’d consider limitations per se.. RDDs are a bad idea that need to die IMO. Not all expressions being implemented is indeed a limitation, but that’s WIP IIUC..

Vincent Gromakowski

05/22/2024, 12:06 PM

If the best way is via a managed service (and manage I/O), how do OSS users handle it?

Amogh Akshintala

05/22/2024, 12:15 PM

Good question… I mean you would implement it by adding support into the OSS release; the filters would come from the catalog and get applied during the scan, but there’s nothing stopping the OSS deployment from leaking the input files to malicious users if it’s not deployed thoughtfully… Wonder if sandboxing the scan and the I/O would be an option? Use a lightweight namespacing construct (LXC?) to isolate the executor process (and any I/O processes) such that the raw (temporary) files they download are not visible via the filesystem to any one else?

jay

05/22/2024, 4:11 PM

Yeah realistically I think if the requirements here are to completely sandbox raw data access away from the user, it will require some kind of data access service or abstraction, that is likely to be closed-source. Either that, or we’d have to build an OSS solution that users can deploy (this would be something like an OSS catalog implementation that also handles data I/O), but it wouldn’t be in Daft itself though Daft should be able to read from it! Another interesting idea I was toying with the other day was perhaps making use of encryption within Parquet 😅. Parquet actually does support encryption keys — perhaps some system could vend keys to the query engine. This would be expensive though and not quite as flexible since encryption is applied at the column-chunk level I believe.

jay

05/22/2024, 4:13 PM

@Amogh Akshintala indeed sandboxing scans and I/O is exactly what we need to do. The question is whether we need to do this remotely (for maximum security), or maybe like you said we can do it locally somehow through a different process with elevated permissions. It really depends on the security requirements I think!

Amogh Akshintala

05/22/2024, 4:31 PM

Yeah very much depends on the security model. I’ve seen both in play - in deployment modes with good isolation to start with, local sandboxing is enough; where there isn’t as strong an isolation story (the single user deployment mode I mentioned earlier) you end up having to rely on an external service to do the I/O and filter the data on behalf of the client.

🙌 1

jay

05/22/2024, 6:43 PM

> you end up having to rely on an external service to do the I/O and filter the data on behalf of the client. Are you thinking of Databricks DBFS? Or is this more Unity Catalog territory

Vincent Gromakowski

05/22/2024, 6:56 PM

None you need a compute layer to apply the filtering

👍 1

Amogh Akshintala

05/22/2024, 7:41 PM

What Vincent said… The filters to evaluate are stored as table attributes in UC (and Iceberg I presume), but applying the filters on the data has to be done by some compute layer… If you don’t have strong isolation guarantees in your compute offering, you’ll have to run a separate compute service (that is well isolated) to scan and filter the data down to only the set that the user has access to and send only the data that survives to the poorly isolated cluster as input to the query…

👍 1

Menno Hamburg

05/30/2024, 10:10 PM

General question, lets say that based on policies you alter the logical plan and before executing on this plan you somehow validate if this plan is untampered. You then proceed to execute on this plan, including the security predicates (filters, masking functions). Would disk spilling reveal sensitive information? Afaik the process is: remote IO -> memory -> apply expressions -> discover you are running OOM -> spil already processed data to disk. Which means that when spilling, the data is already filtered, or is this incorrect? Off course if you use the OSS version you can compile your own version which skips some sensitive steps, but one challenge at a time.

jay

05/30/2024, 10:50 PM

Which means that when spilling, the data is already filtered

Correct, spilling occurs on intermediate materialized outputs (e.g. at a shuffle boundary for example before a global sort). In Daft our steps are fused/pipelined — this means that

Scan -> Filter -> Project

is fused into a single step, and spilling is applied only at the output of that fused step.

🙏 1

Open in Slack

Previous Next