Product Docs
  • What is Dataworkz?
  • Getting Started
    • What You Will Need (Prerequisites)
    • Create with Default Settings: RAG Quickstart
    • Custom Settings: RAG Quickstart
    • Data Transformation Quickstart
    • Create an Agent: Quickstart
  • Concepts
    • RAG Applications
      • Overview
      • Ingestion
      • Embedding Models
      • Vectorization
      • Retrieve
    • AI Agents
      • Introduction
      • Overview
      • Tools
        • Implementation
      • Type
      • Tools Repository
      • Tool Execution Framework
      • Agents
      • Scenarios
      • Agent Builder
    • Data Studio
      • No-code Transformations
      • Datasets
      • Dataflows
        • Single Dataflows:
        • Composite dataflows:
        • Benefits of Dataflows:
      • Discovery
        • How to: Discovery
      • Lineage
        • Features of Lineage:
        • Viewing a dataset's lineage:
      • Catalog
      • Monitoring
      • Statistics
  • Guides
    • RAG Applications
      • Configure LLM's
        • AWS Bedrock
      • Embedding Models
        • Privately Hosted Embedding Models
        • Amazon Bedrock Hosted Embedding Model
        • OpenAI Embedding Model
      • Connecting Your Data
        • Finding Your Data Storage: Collections
      • Unstructured Data Ingestion
        • Ingesting Unstructured Data
        • Unstructured File Ingestion
        • Html/Sharepoint Ingestion
      • Create Vector Embeddings
        • How to Build the Vector embeddings from Scratch
        • How do Modify Existing Chunking/Embedding Dataflows
      • Response History
      • Creating RAG Experiments with Dataworkz
      • Advanced RAG - Access Control for your data corpus
    • AI Agents
      • Concepts
      • Tools
        • Dataset
        • AI App
        • Rest API
        • LLM Tool
        • Relational DB
        • MongoDB
        • Snowflake
      • Agent Builder
      • Agents
      • Guidelines
    • Data Studio
      • Transformation Functions
        • Column Transformations
          • String Operations
            • Format Operations
            • String Calculation Operations
            • Remove Stop Words Operation
            • Fuzzy Match Operation
            • Masking Operations
            • 1-way Hash Operation
            • Copy Operation
            • Unnest Operation
            • Convert Operation
            • Vlookup Operation
          • Numeric Operations
            • Tiles Operation
            • Numeric Calculation Operations
            • Custom Calculation Operation
            • Numeric Encode Operation
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Convert Operation
            • VLookup Operation
          • Boolean Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
          • Date Operations
            • Date Format Operations
            • Date Calculation Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Encode Operation
            • Convert Operation
          • Datetime/Timestamp Operations
            • Datetime Format Operations
            • Datetime Calculation Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Encode Operation
            • Page 1
        • Dataset Transformations
          • Utility Functions
            • Area Under the Curve
            • Page Rank Utility Function
            • Transpose Utility Function
            • Semantic Search Template Utility Function
            • New Header Utility Function
            • Transform to JSON Utility Function
            • Text Utility Function
            • UI Utility Function
          • Window Functions
          • Case Statement
            • Editor Query
            • UI Query
          • Filter
            • Editor Query
            • UI Query
      • Data Prep
        • Joins
          • Configuring a Join
        • Union
          • Configuring a Union
      • Working with CSV files
      • Job Monitoring
    • Utility Features
      • IP safelist
      • Connect to data source(s)
        • Cloud Data Platforms
          • AWS S3
          • BigQuery
          • Google Cloud Storage
          • Azure
          • Snowflake
          • Redshift
          • Databricks
        • Databases
          • MySQL
          • Microsoft SQL Server
          • Oracle
          • MariaDB
          • Postgres
          • DB2
          • MongoDB
          • Couchbase
          • Aerospike
          • Pinecone
        • SaaS Applications
          • Google Ads
          • Google Analytics
          • Marketo
          • Zoom
          • JIRA
          • Salesforce
          • Zendesk
          • Hubspot
          • Outreach
          • Fullstory
          • Pendo
          • Box
          • Google Sheets
          • Slack
          • OneDrive / Sharepoint
          • ServiceNow
          • Stripe
      • Authentication
      • User Management
    • How To
      • Data Lake to Salesforce
      • Embed RAG into your App
  • API
    • Generate API Key in Dataworkz
    • RAG Apps API
    • Agents API
  • Open Source License Types
Powered by GitBook
On this page
  1. Concepts
  2. RAG Applications

Vectorization

Vectorization is the process of transforming unstructured data, such as text documents or raw content, into a structured format that can be efficiently processed by machine learning models. In the context of Retrieval-Augmented Generation (RAG) apps, this typically involves chunking and embedding to create vector representations of the data. These vectors allow the system to perform tasks like semantic search, matching user queries to the most relevant data, and generating meaningful responses based on that data.

Unstructured Data to Structured Dataset: The Vectorization Process

When you ingest unstructured data—such as PDFs, text documents, or web content—it’s typically in raw form that needs to be processed for further analysis or use. Vectorization is the method of converting this data into a structured dataset that the system can efficiently search, retrieve, and generate responses from. This transformation happens in two key stages: chunking and embedding.

Step 1: Chunking the Data

Chunking is the process of breaking large volumes of text into smaller, more manageable pieces, or "chunks." This is particularly important for unstructured data such as long documents, PDFs, or web pages. The purpose of chunking is to divide the data into smaller segments that are easier to process and retrieve from.

  • Why Chunking?

    • Large documents may contain multiple topics or concepts, making it difficult for the model to understand the entire document as a single entity.

    • Chunking allows the system to break down the data into logical, coherent sections—such as paragraphs or sentences—that are more easily interpreted by the embedding model.

    • Smaller chunks improve the accuracy of semantic matching since each chunk can be analyzed independently, making it more likely that relevant data is retrieved during a query.

  • How Chunking Works:

    • A document (e.g., a research paper or an article) might be divided into chunks based on certain delimiters, such as paragraphs, sentences, or sections.

    • Each chunk serves as a discrete unit of information that can later be represented as a vector.

    • The chunking process can be configured based on the type of data or use case, with flexibility in how large or small each chunk should be.

Step 2: Embedding the Chunks

Once the data is chunked, each chunk is passed through an embedding model to create a vector representation. Embedding models are designed to convert raw text data into numerical vectors that capture the semantic meaning of the text. These vectors are placed into a high-dimensional space where similar pieces of data are close to each other.

  • Why Embedding?

    • Embedding transforms raw text into a form that machine learning models can process, allowing for the generation of contextually relevant outputs.

    • The vector representation captures the meaning, context, and relationships between words in a chunk of text. Even if the exact wording differs between chunks or queries, the vectorization ensures that similar meanings are captured accurately.

  • How Embedding Works:

    • The chunked pieces of data are processed by an embedding model (e.g., OpenAI's GPT, BERT, or a custom model).

    • The embedding model converts each chunk into a vector—a list of numerical values—that represents the chunk’s semantic meaning.

    • These embeddings are stored in a vector database, where they can be retrieved during the RAG app's operation.

Step 3: Storing Vector Representations

After the chunks are embedded, the resulting vectors are stored in a vector database or a vector storage system. This database is optimized for fast, efficient retrieval of the most relevant vectors based on similarity searches.

  • Vector Storage:

    • A vector database indexes the embeddings, allowing for quick lookups when a query is made.

    • The database supports similarity search, meaning that when a query is processed, it finds the closest matching vector representations and retrieves the corresponding chunks of data.

    • This retrieval is key for enabling the model to generate responses based on relevant and contextually accurate data.

Example: From Unstructured Data to Vector Representation

  1. Ingest Raw Data: A PDF document is uploaded into the RAG app. This document is then structured into a dataset, broken up by page.

  2. Chunking: The document is chunked into smaller, more manageable units, such as paragraphs or sections. For instance, the introduction, conclusion, and body sections might each be treated as separate chunks.

  3. Embedding: Each chunk is passed through an embedding model, which transforms it into a vector representation—numerical values that capture the meaning and context of each chunk.

  4. Vector Storage: The resulting vectors are stored in a vector database. This allows for efficient retrieval based on similarity to the user’s query.

Key Benefits of Vectorization:

  • Enhanced Search: By using vector representations, the system can efficiently search through large datasets and retrieve the most relevant chunks based on meaning, not just keyword matching.

  • Scalability: Chunking and embedding allow the system to handle vast amounts of unstructured data while maintaining fast response times during queries.

  • Contextual Relevance: The vectorization process ensures that the model retrieves the most contextually relevant data, enhancing the quality of the generated responses.

PreviousEmbedding ModelsNextRetrieve

Last updated 1 month ago