Advanced RAG - Access Control for your data corpus
Last updated
Last updated
Access to specific data sources can be controlled at the row level within a collection, enabling fine-grained security. In a RAG system, users may need access to particular documents, such as PDF files, Word documents, or HTML pages, to generate relevant responses.
Lookup Table Definition: This is where user permissions are mapped to specific documents or rows in the database. Each user is linked to a set of documents they can access, and the system ensures that they can only interact with those approved resources. For example, a specific user could be restricted to viewing only PDFs or HTML related to their department or project.
Row-Level Security: By applying row-level security controls, the RAG system can restrict access to only the rows of data that a user is authorized to view. This prevents unauthorized data access and supports compliance with privacy policies or legal requirements.
By combining role-based access control (RBAC) with dataset-level restrictions, administrators can ensure that data remains secure, while still allowing users to interact with the RAG system effectively. This approach is essential for RAG systems that operate in environments with sensitive or proprietary data.
Create a table in a relational database like Snowflake or PostgreSQL that contains file information, which level of access the file should have, and the file type.
In Dataworkz, create a lookup table using the access control table created in Step 1
Give the lookup table a name, and select the source table for the lookup. Make sure to select the name of the file as the lookup key.
Save the Lookup Table
Go to the dataset created during ingestion for your PDF files.
Perform the following transformation functions on this data
Use the Split Value function on the pdf_name column, and split by '/' and take the first value from the right. Then store this in a new column named 'file_name'. For example, pdf_name = 's3a://dworkz-self-service-lake/demo/pdf_data/fin_bench/NVIDIA_2024_10k.pdf' when we split value by '/', we are only taking the 'NVIDIA_2024_10k.pdf' piece and storing it in the new column.
Perform a VLookup on the new 'file_name' column, using the USER_ACCESS_LOOKUP table we have created.
These two new columns will hold the access_level and the file_name information.
Run the Standard embedding steps from the https://docs.dataworkz.com/product-docs/getting-started/advanced-rag-quickstart/create-vector-embeddings/how-to-build-the-vector-embeddings-from-scratch
Save this as a dataflow, we will call it 'user_access_control'
Execute this dataflow, and set the target as a MongoDB collection
Follow the standard steps in, making sure to select the 'Use existing Dataset' option when prompted:https://docs.dataworkz.com/product-docs/getting-started/advanced-rag-quickstart
Select the mongodb dataset created from the 'user_access_control' dataflow
Make sure to select the 'access_level' column as a filter in the source configuration segment and check 'enable auto extraction of filters from query'
Follow the steps in the Embed Rag into Your Apps sectionhttps://docs.dataworkz.com/product-docs/getting-started/embed-rag-into-your-app
To integrate Access Control Lists (ACL) and user roles into the application, the role associated with the user's query should be passed as part of the API call. A standard API request would include the following information:
User's Query
System ID: The identifier of the specific Q&A application created in Dataworkz.
LLM Provider ID: The identifier of the Language Model configured for the selected Q&A system.
Access Control: The role associated with the user accessing the system.
For example, if an employee is using the application, and Dataworkz has defined three data types—Employees, Customers, and Partners—for specific datasets, the role corresponding to "Employee" would be sent as part of the request.
To maintain simplicity and align with the established "Access Level" types outlined in the documentation, three user roles will be created: Executive, Partner, and Outside User. Each of these roles will be associated with specific access levels as follows:
Executive: Access to Private, Partner, and Public data.
Partner: Access to Partner and Public data.
Outside User: Access to Public data only.
During the final request to the Dataworkz API, the application will assign the appropriate user role based on the access level of the user initiating the query from the front end.