Product Docs
  • What is Dataworkz?
  • Getting Started
    • What You Will Need (Prerequisites)
    • Create with Default Settings: RAG Quickstart
    • Custom Settings: RAG Quickstart
    • Data Transformation Quickstart
    • Create an Agent: Quickstart
  • Concepts
    • RAG Applications
      • Overview
      • Ingestion
      • Embedding Models
      • Vectorization
      • Retrieve
    • AI Agents
      • Introduction
      • Overview
      • Tools
        • Implementation
      • Type
      • Tools Repository
      • Tool Execution Framework
      • Agents
      • Scenarios
      • Agent Builder
    • Data Studio
      • No-code Transformations
      • Datasets
      • Dataflows
        • Single Dataflows:
        • Composite dataflows:
        • Benefits of Dataflows:
      • Discovery
        • How to: Discovery
      • Lineage
        • Features of Lineage:
        • Viewing a dataset's lineage:
      • Catalog
      • Monitoring
      • Statistics
  • Guides
    • RAG Applications
      • Configure LLM's
        • AWS Bedrock
      • Embedding Models
        • Privately Hosted Embedding Models
        • Amazon Bedrock Hosted Embedding Model
        • OpenAI Embedding Model
      • Connecting Your Data
        • Finding Your Data Storage: Collections
      • Unstructured Data Ingestion
        • Ingesting Unstructured Data
        • Unstructured File Ingestion
        • Html/Sharepoint Ingestion
      • Create Vector Embeddings
        • How to Build the Vector embeddings from Scratch
        • How do Modify Existing Chunking/Embedding Dataflows
      • Response History
      • Creating RAG Experiments with Dataworkz
      • Advanced RAG - Access Control for your data corpus
    • AI Agents
      • Concepts
      • Tools
        • Dataset
        • AI App
        • Rest API
        • LLM Tool
        • Relational DB
        • MongoDB
        • Snowflake
      • Agent Builder
      • Agents
      • Guidelines
    • Data Studio
      • Transformation Functions
        • Column Transformations
          • String Operations
            • Format Operations
            • String Calculation Operations
            • Remove Stop Words Operation
            • Fuzzy Match Operation
            • Masking Operations
            • 1-way Hash Operation
            • Copy Operation
            • Unnest Operation
            • Convert Operation
            • Vlookup Operation
          • Numeric Operations
            • Tiles Operation
            • Numeric Calculation Operations
            • Custom Calculation Operation
            • Numeric Encode Operation
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Convert Operation
            • VLookup Operation
          • Boolean Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
          • Date Operations
            • Date Format Operations
            • Date Calculation Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Encode Operation
            • Convert Operation
          • Datetime/Timestamp Operations
            • Datetime Format Operations
            • Datetime Calculation Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Encode Operation
            • Page 1
        • Dataset Transformations
          • Utility Functions
            • Area Under the Curve
            • Page Rank Utility Function
            • Transpose Utility Function
            • Semantic Search Template Utility Function
            • New Header Utility Function
            • Transform to JSON Utility Function
            • Text Utility Function
            • UI Utility Function
          • Window Functions
          • Case Statement
            • Editor Query
            • UI Query
          • Filter
            • Editor Query
            • UI Query
      • Data Prep
        • Joins
          • Configuring a Join
        • Union
          • Configuring a Union
      • Working with CSV files
      • Job Monitoring
    • Utility Features
      • IP safelist
      • Connect to data source(s)
        • Cloud Data Platforms
          • AWS S3
          • BigQuery
          • Google Cloud Storage
          • Azure
          • Snowflake
          • Redshift
          • Databricks
        • Databases
          • MySQL
          • Microsoft SQL Server
          • Oracle
          • MariaDB
          • Postgres
          • DB2
          • MongoDB
          • Couchbase
          • Aerospike
          • Pinecone
        • SaaS Applications
          • Google Ads
          • Google Analytics
          • Marketo
          • Zoom
          • JIRA
          • Salesforce
          • Zendesk
          • Hubspot
          • Outreach
          • Fullstory
          • Pendo
          • Box
          • Google Sheets
          • Slack
          • OneDrive / Sharepoint
          • ServiceNow
          • Stripe
      • Authentication
      • User Management
    • How To
      • Data Lake to Salesforce
      • Embed RAG into your App
  • API
    • Generate API Key in Dataworkz
    • RAG Apps API
    • Agents API
  • Open Source License Types
Powered by GitBook
On this page
  • Importance in Advanced RAG Applications
  • Dataset-Level Access Controls:
  • Step 1: Creating the Access Control Table
  • Step 2: Creating a Lookup Table from the Access Control Table
  • Step 3: Using V-Lookup function for Access Control Levels
  • Step 4: Chunking/Embedding dataflow
  • Step 5: Creating the Q&A with the ACL Filters defined
  • Embedding ACL into your Application:
  • How the Access Control List Works with Roles:
  1. Guides
  2. RAG Applications

Advanced RAG - Access Control for your data corpus

PreviousCreating RAG Experiments with DataworkzNextAI Agents

Last updated 7 months ago

Importance in Advanced RAG Applications

Dataset-Level Access Controls:

Access to specific data sources can be controlled at the row level within a collection, enabling fine-grained security. In a RAG system, users may need access to particular documents, such as PDF files, Word documents, or HTML pages, to generate relevant responses.

Lookup Table Definition: This is where user permissions are mapped to specific documents or rows in the database. Each user is linked to a set of documents they can access, and the system ensures that they can only interact with those approved resources. For example, a specific user could be restricted to viewing only PDFs or HTML related to their department or project.

Row-Level Security: By applying row-level security controls, the RAG system can restrict access to only the rows of data that a user is authorized to view. This prevents unauthorized data access and supports compliance with privacy policies or legal requirements.

By combining role-based access control (RBAC) with dataset-level restrictions, administrators can ensure that data remains secure, while still allowing users to interact with the RAG system effectively. This approach is essential for RAG systems that operate in environments with sensitive or proprietary data.

Step 1: Creating the Access Control Table

  • Create a table in a relational database like Snowflake or PostgreSQL that contains file information, which level of access the file should have, and the file type.

Step 2: Creating a Lookup Table from the Access Control Table

  • In Dataworkz, create a lookup table using the access control table created in Step 1

  • Give the lookup table a name, and select the source table for the lookup. Make sure to select the name of the file as the lookup key.

  • Save the Lookup Table

Step 3: Using V-Lookup function for Access Control Levels

  • Go to the dataset created during ingestion for your PDF files.

  • Perform the following transformation functions on this data

    1. Use the Split Value function on the pdf_name column, and split by '/' and take the first value from the right. Then store this in a new column named 'file_name'. For example, pdf_name = 's3a://dworkz-self-service-lake/demo/pdf_data/fin_bench/NVIDIA_2024_10k.pdf' when we split value by '/', we are only taking the 'NVIDIA_2024_10k.pdf' piece and storing it in the new column.

    2. Perform a VLookup on the new 'file_name' column, using the USER_ACCESS_LOOKUP table we have created.

    These two new columns will hold the access_level and the file_name information.

Step 4: Chunking/Embedding dataflow

  • Save this as a dataflow, we will call it 'user_access_control'

  • Execute this dataflow, and set the target as a MongoDB collection

Step 5: Creating the Q&A with the ACL Filters defined

  • Select the mongodb dataset created from the 'user_access_control' dataflow

  • Make sure to select the 'access_level' column as a filter in the source configuration segment and check 'enable auto extraction of filters from query'

Embedding ACL into your Application:

  • To integrate Access Control Lists (ACL) and user roles into the application, the role associated with the user's query should be passed as part of the API call. A standard API request would include the following information:

    • User's Query

    • System ID: The identifier of the specific Q&A application created in Dataworkz.

    • LLM Provider ID: The identifier of the Language Model configured for the selected Q&A system.

    • Access Control: The role associated with the user accessing the system.

    For example, if an employee is using the application, and Dataworkz has defined three data types—Employees, Customers, and Partners—for specific datasets, the role corresponding to "Employee" would be sent as part of the request.

How the Access Control List Works with Roles:

To maintain simplicity and align with the established "Access Level" types outlined in the documentation, three user roles will be created: Executive, Partner, and Outside User. Each of these roles will be associated with specific access levels as follows:

  • Executive: Access to Private, Partner, and Public data.

  • Partner: Access to Partner and Public data.

  • Outside User: Access to Public data only.

During the final request to the Dataworkz API, the application will assign the appropriate user role based on the access level of the user initiating the query from the front end.

Run the Standard embedding steps from the

Follow the standard steps in, making sure to select the 'Use existing Dataset' option when prompted:

Follow the steps in the Embed Rag into Your Apps section

https://docs.dataworkz.com/product-docs/getting-started/advanced-rag-quickstart/create-vector-embeddings/how-to-build-the-vector-embeddings-from-scratch
https://docs.dataworkz.com/product-docs/getting-started/advanced-rag-quickstart
https://docs.dataworkz.com/product-docs/getting-started/embed-rag-into-your-app