# Advanced RAG - Access Control for your data corpus

## Importance in Advanced RAG Applications

### Dataset-Level Access Controls:

Access to specific data sources can be controlled at the row level within a collection, enabling fine-grained security. In a RAG system, users may need access to specific documents — such as PDFs, Word documents, or HTML pages — to generate relevant responses.

**Lookup Table Definition** — maps user permissions to specific documents or rows in the database. Each user is linked to a set of documents they can access, and the system enforces those restrictions. For example, a user can be restricted to PDFs or HTML pages related to their department or project.

**Row-Level Security** — restricts each user to only the rows of data they are authorized to view. This prevents unauthorized data access and supports compliance with privacy policies and legal requirements.

Combining role-based access control (RBAC) with dataset-level restrictions lets administrators keep data secure while allowing users to interact with the RAG system effectively.

### Step 1: Creating the Access Control Table

* Create a table in a relational database such as Snowflake or PostgreSQL that records file information, access level, and file type.

<figure><img src="https://5638239-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FbF1ZCyeKJI9Zuib6qdjY%2Fuploads%2FaV6gkk4rESlz0YuO9yb8%2FScreenshot%202024-10-03%20at%201.01.48%20PM.png?alt=media&#x26;token=4a9789ee-2f05-4b32-a5bb-af9c882c2919" alt=""><figcaption></figcaption></figure>

### Step 2: Creating a Lookup Table from the Access Control Table

* In Dataworkz, create a lookup table using the access control table from Step 1.

<figure><img src="https://5638239-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FbF1ZCyeKJI9Zuib6qdjY%2Fuploads%2FtrigfiDRCteUEwHqJSsa%2FScreenshot%202024-10-03%20at%201.14.53%20PM.png?alt=media&#x26;token=778f8c78-bb64-449b-8baa-37cbef012146" alt=""><figcaption></figcaption></figure>

* Give the lookup table a name and select the source table. Set the file name column as the lookup key.

<figure><img src="https://5638239-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FbF1ZCyeKJI9Zuib6qdjY%2Fuploads%2FthyqwPY3RfbwMOmNKxlz%2FScreenshot%202024-10-03%20at%201.20.05%20PM.png?alt=media&#x26;token=c6698702-6b37-4452-8437-41825535e931" alt=""><figcaption></figcaption></figure>

* Save the lookup table.

### Step 3: Using V-Lookup Function for Access Control Levels

* Navigate to the dataset created during ingestion for your PDF files.
* Apply the following transformation functions:

  1. Use the **Split Value** function on the `pdf_name` column, splitting by `/` and taking the first value from the right. Store the result in a new column named `file_name`. For example, `pdf_name = 's3a://dworkz-self-service-lake/demo/pdf_data/fin_bench/NVIDIA_2024_10k.pdf'` — splitting by `/` and taking the rightmost value yields `NVIDIA_2024_10k.pdf`.
  2. Apply a **VLookup** on the new `file_name` column using the `USER_ACCESS_LOOKUP` table you created.

  <figure><img src="https://5638239-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FbF1ZCyeKJI9Zuib6qdjY%2Fuploads%2FeOJyKlMsGNzWZqJ99IQl%2FScreenshot%202024-10-03%20at%201.44.09%20PM.png?alt=media&#x26;token=5f9c384f-6538-416b-a00c-3219e6bd1498" alt=""><figcaption></figcaption></figure>

  These two new columns hold the `access_level` and `file_name` values.

### Step 4: Chunking/Embedding Dataflow

* Run the standard embedding steps from the [Build Vector Embeddings from Scratch](https://docs.dataworkz.com/product-docs/getting-started/advanced-rag-quickstart/create-vector-embeddings/how-to-build-the-vector-embeddings-from-scratch) guide.
* Save this as a dataflow named `user_access_control`.
* Execute the dataflow and set the target as a MongoDB collection.

### Step 5: Creating the Q\&A with ACL Filters Defined

* Follow the standard steps in the [RAG Application Quickstart](https://docs.dataworkz.com/product-docs/getting-started/advanced-rag-quickstart), selecting the **Use existing Dataset** option when prompted.
* Select the MongoDB dataset created from the `user_access_control` dataflow.
* Select the `access_level` column as a filter in the source configuration and enable **auto extraction of filters from query**.

<figure><img src="https://5638239-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FbF1ZCyeKJI9Zuib6qdjY%2Fuploads%2F2U9eWDC7H2Jy30bFCgjQ%2FScreenshot%202024-10-03%20at%202.06.28%20PM.png?alt=media&#x26;token=7bd8083f-521e-40ea-8e55-d20caad4c5f7" alt=""><figcaption></figcaption></figure>

### Embedding ACL into Your Application:

* Follow the steps in the [Embed RAG into Your App](https://docs.dataworkz.com/product-docs/getting-started/embed-rag-into-your-app) guide.
* To integrate Access Control Lists (ACL) and user roles, pass the user's role as part of each API call. A standard API request includes:

  * **User's Query**
  * **System ID** — the identifier of the Q\&A application in Dataworkz
  * **LLM Provider ID** — the identifier of the language model configured for the Q\&A system
  * **Access Control** — the role associated with the user making the request

  For example, if an employee uses the application and Dataworkz has three defined data types — *Employees*, *Customers*, and *Partners* — the role corresponding to "Employee" is sent as part of the request.

### How the Access Control List Works with Roles:

Three user roles — *Executive*, *Partner*, and *Outside User* — map to the following access levels:

* **Executive** — access to Private, Partner, and Public data.
* **Partner** — access to Partner and Public data.
* **Outside User** — access to Public data only.

When a user submits a query, the application assigns the appropriate role based on that user's access level and includes it in the API request to Dataworkz.
