Product Docs
  • What is Dataworkz?
  • Getting Started
    • What You Will Need (Prerequisites)
    • Create with Default Settings: RAG Quickstart
    • Custom Settings: RAG Quickstart
    • Data Transformation Quickstart
    • Create an Agent: Quickstart
  • Concepts
    • RAG Applications
      • Overview
      • Ingestion
      • Embedding Models
      • Vectorization
      • Retrieve
    • AI Agents
      • Introduction
      • Overview
      • Tools
        • Implementation
      • Type
      • Tools Repository
      • Tool Execution Framework
      • Agents
      • Scenarios
      • Agent Builder
    • Data Studio
      • No-code Transformations
      • Datasets
      • Dataflows
        • Single Dataflows:
        • Composite dataflows:
        • Benefits of Dataflows:
      • Discovery
        • How to: Discovery
      • Lineage
        • Features of Lineage:
        • Viewing a dataset's lineage:
      • Catalog
      • Monitoring
      • Statistics
  • Guides
    • RAG Applications
      • Configure LLM's
        • AWS Bedrock
      • Embedding Models
        • Privately Hosted Embedding Models
        • Amazon Bedrock Hosted Embedding Model
        • OpenAI Embedding Model
      • Connecting Your Data
        • Finding Your Data Storage: Collections
      • Unstructured Data Ingestion
        • Ingesting Unstructured Data
        • Unstructured File Ingestion
        • Html/Sharepoint Ingestion
      • Create Vector Embeddings
        • How to Build the Vector embeddings from Scratch
        • How do Modify Existing Chunking/Embedding Dataflows
      • Response History
      • Creating RAG Experiments with Dataworkz
      • Advanced RAG - Access Control for your data corpus
    • AI Agents
      • Concepts
      • Tools
        • Dataset
        • AI App
        • Rest API
        • LLM Tool
        • Relational DB
        • MongoDB
        • Snowflake
      • Agent Builder
      • Agents
      • Guidelines
    • Data Studio
      • Transformation Functions
        • Column Transformations
          • String Operations
            • Format Operations
            • String Calculation Operations
            • Remove Stop Words Operation
            • Fuzzy Match Operation
            • Masking Operations
            • 1-way Hash Operation
            • Copy Operation
            • Unnest Operation
            • Convert Operation
            • Vlookup Operation
          • Numeric Operations
            • Tiles Operation
            • Numeric Calculation Operations
            • Custom Calculation Operation
            • Numeric Encode Operation
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Convert Operation
            • VLookup Operation
          • Boolean Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
          • Date Operations
            • Date Format Operations
            • Date Calculation Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Encode Operation
            • Convert Operation
          • Datetime/Timestamp Operations
            • Datetime Format Operations
            • Datetime Calculation Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Encode Operation
            • Page 1
        • Dataset Transformations
          • Utility Functions
            • Area Under the Curve
            • Page Rank Utility Function
            • Transpose Utility Function
            • Semantic Search Template Utility Function
            • New Header Utility Function
            • Transform to JSON Utility Function
            • Text Utility Function
            • UI Utility Function
          • Window Functions
          • Case Statement
            • Editor Query
            • UI Query
          • Filter
            • Editor Query
            • UI Query
      • Data Prep
        • Joins
          • Configuring a Join
        • Union
          • Configuring a Union
      • Working with CSV files
      • Job Monitoring
    • Utility Features
      • IP safelist
      • Connect to data source(s)
        • Cloud Data Platforms
          • AWS S3
          • BigQuery
          • Google Cloud Storage
          • Azure
          • Snowflake
          • Redshift
          • Databricks
        • Databases
          • MySQL
          • Microsoft SQL Server
          • Oracle
          • MariaDB
          • Postgres
          • DB2
          • MongoDB
          • Couchbase
          • Aerospike
          • Pinecone
        • SaaS Applications
          • Google Ads
          • Google Analytics
          • Marketo
          • Zoom
          • JIRA
          • Salesforce
          • Zendesk
          • Hubspot
          • Outreach
          • Fullstory
          • Pendo
          • Box
          • Google Sheets
          • Slack
          • OneDrive / Sharepoint
          • ServiceNow
          • Stripe
      • Authentication
      • User Management
    • How To
      • Data Lake to Salesforce
      • Embed RAG into your App
  • API
    • Generate API Key in Dataworkz
    • RAG Apps API
    • Agents API
  • Open Source License Types
Powered by GitBook
On this page
  1. Concepts
  2. RAG Applications

Ingestion

Data ingestion is a crucial step in building Retrieval-Augmented Generation (RAG) apps. It refers to the process of bringing external data into your app, where it can be used by the model to generate relevant responses. In RAG apps, data is the foundation that informs the language model, allowing it to produce dynamic, context-aware outputs based on real-time information.

What is Data Ingestion?

Data ingestion involves selecting, uploading, or linking external data sources (e.g., unstructured data) to your RAG app. The process ensures that the app has access to up-to-date and relevant data, which the model will use during its generation process.

Once the data is ingested, the app processes it by creating vector embeddings, which are mathematical representations of the data in a high-dimensional space. These embeddings allow the model to efficiently retrieve the most relevant information when generating answers to user queries.

Key Concepts in Data Ingestion:

  1. Data Sources:

    • PDFs: Upload PDF files directly from your local system. This is ideal for working with documents, articles, or research papers.

    • Blob Storage: Connect to cloud storage services like AWS S3, Azure Blob Storage, or Google Cloud Storage to import large datasets or documents.

    • URL Crawling: Automatically ingest data from specified URLs or web pages. This method is great for continuously updated data or large volumes of online content.

  2. Data Processing:

    • Once data is uploaded or connected, the system automatically handles the extraction, parsing, and transformation of the data. It then can be used to create vector embeddings that represent the data in a format the model can use to understand and retrieve information efficiently.

    • Vector Embeddings: The model transforms raw data (e.g., text or documents) into numerical representations (vectors), which allows it to compare and retrieve the most relevant information based on user queries.

  3. Vector Storage:

    • After generating the embeddings, the data is stored in a vector database or storage system. This allows the app to perform fast, efficient searches to find the most relevant pieces of data during generation, ensuring responses are both accurate and contextually relevant.

Benefits of Proper Data Ingestion:

  • Context-Aware Responses: By integrating diverse and up-to-date data sources, your RAG app can generate responses based on the latest available information.

  • Customization: You can choose the data sources that best suit your application needs, from specific documents to large-scale datasets or real-time web data.

  • Efficiency: Proper ingestion and embedding processing allow for quick, relevant data retrieval during real-time interactions, resulting in faster and more accurate model outputs.

Data Ingestion Flow:

  1. Choose Your Source: Decide whether you’ll upload a local file, connect to cloud storage (blob storage), or crawl data from a URL.

  2. Upload or Connect: Bring your data into the system by uploading, linking, or crawling your selected sources.

  3. Automatic Embedding: Once the data is ingested, vector embeddings can be created and stored for efficient retrieval.

PreviousOverviewNextEmbedding Models

Last updated 1 month ago