Product Docs
  • What is Dataworkz?
  • Getting Started
    • What You Will Need (Prerequisites)
    • Create with Default Settings: RAG Quickstart
    • Custom Settings: RAG Quickstart
    • Data Transformation Quickstart
    • Create an Agent: Quickstart
  • Concepts
    • RAG Applications
      • Overview
      • Ingestion
      • Embedding Models
      • Vectorization
      • Retrieve
    • AI Agents
      • Introduction
      • Overview
      • Tools
        • Implementation
      • Type
      • Tools Repository
      • Tool Execution Framework
      • Agents
      • Scenarios
      • Agent Builder
    • Data Studio
      • No-code Transformations
      • Datasets
      • Dataflows
        • Single Dataflows:
        • Composite dataflows:
        • Benefits of Dataflows:
      • Discovery
        • How to: Discovery
      • Lineage
        • Features of Lineage:
        • Viewing a dataset's lineage:
      • Catalog
      • Monitoring
      • Statistics
  • Guides
    • RAG Applications
      • Configure LLM's
        • AWS Bedrock
      • Embedding Models
        • Privately Hosted Embedding Models
        • Amazon Bedrock Hosted Embedding Model
        • OpenAI Embedding Model
      • Connecting Your Data
        • Finding Your Data Storage: Collections
      • Unstructured Data Ingestion
        • Ingesting Unstructured Data
        • Unstructured File Ingestion
        • Html/Sharepoint Ingestion
      • Create Vector Embeddings
        • How to Build the Vector embeddings from Scratch
        • How do Modify Existing Chunking/Embedding Dataflows
      • Response History
      • Creating RAG Experiments with Dataworkz
      • Advanced RAG - Access Control for your data corpus
    • AI Agents
      • Concepts
      • Tools
        • Dataset
        • AI App
        • Rest API
        • LLM Tool
        • Relational DB
        • MongoDB
        • Snowflake
      • Agent Builder
      • Agents
      • Guidelines
    • Data Studio
      • Transformation Functions
        • Column Transformations
          • String Operations
            • Format Operations
            • String Calculation Operations
            • Remove Stop Words Operation
            • Fuzzy Match Operation
            • Masking Operations
            • 1-way Hash Operation
            • Copy Operation
            • Unnest Operation
            • Convert Operation
            • Vlookup Operation
          • Numeric Operations
            • Tiles Operation
            • Numeric Calculation Operations
            • Custom Calculation Operation
            • Numeric Encode Operation
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Convert Operation
            • VLookup Operation
          • Boolean Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
          • Date Operations
            • Date Format Operations
            • Date Calculation Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Encode Operation
            • Convert Operation
          • Datetime/Timestamp Operations
            • Datetime Format Operations
            • Datetime Calculation Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Encode Operation
            • Page 1
        • Dataset Transformations
          • Utility Functions
            • Area Under the Curve
            • Page Rank Utility Function
            • Transpose Utility Function
            • Semantic Search Template Utility Function
            • New Header Utility Function
            • Transform to JSON Utility Function
            • Text Utility Function
            • UI Utility Function
          • Window Functions
          • Case Statement
            • Editor Query
            • UI Query
          • Filter
            • Editor Query
            • UI Query
      • Data Prep
        • Joins
          • Configuring a Join
        • Union
          • Configuring a Union
      • Working with CSV files
      • Job Monitoring
    • Utility Features
      • IP safelist
      • Connect to data source(s)
        • Cloud Data Platforms
          • AWS S3
          • BigQuery
          • Google Cloud Storage
          • Azure
          • Snowflake
          • Redshift
          • Databricks
        • Databases
          • MySQL
          • Microsoft SQL Server
          • Oracle
          • MariaDB
          • Postgres
          • DB2
          • MongoDB
          • Couchbase
          • Aerospike
          • Pinecone
        • SaaS Applications
          • Google Ads
          • Google Analytics
          • Marketo
          • Zoom
          • JIRA
          • Salesforce
          • Zendesk
          • Hubspot
          • Outreach
          • Fullstory
          • Pendo
          • Box
          • Google Sheets
          • Slack
          • OneDrive / Sharepoint
          • ServiceNow
          • Stripe
      • Authentication
      • User Management
    • How To
      • Data Lake to Salesforce
      • Embed RAG into your App
  • API
    • Generate API Key in Dataworkz
    • RAG Apps API
    • Agents API
  • Open Source License Types
Powered by GitBook
On this page
  1. Concepts
  2. RAG Applications

Embedding Models

An embedding model is a machine learning model that transforms raw data (such as text, images, or other types of information) into numerical representations, or embeddings. These embeddings are high-dimensional vectors that capture the semantic meaning of the data, making it easier for the model to understand and process complex inputs.

In the context of Retrieval-Augmented Generation (RAG) apps, embedding models play a crucial role in enabling the system to efficiently retrieve relevant information from large datasets. By converting the data into embeddings, the system can quickly compare and match the most pertinent pieces of information to a query, improving the quality and accuracy of the generated responses.

Key Concepts in Embedding Models:

  1. What Are Embeddings?

    • Embeddings are mathematical representations of data in a high-dimensional space. Each piece of data—such as a sentence, document, or image—gets converted into a vector (a list of numbers) that captures its semantic meaning.

    • For example, two sentences with similar meanings will have embeddings that are numerically close in the vector space, even if the exact words are different. This allows the model to understand relationships between words, sentences, and concepts.

  2. Why Are Embeddings Important?

    • Embeddings allow models to perform tasks like semantic search, where the goal is to retrieve relevant information based on meaning rather than exact keyword matches.

    • They are also key in natural language processing (NLP) tasks, including text generation, summarization, and classification, as they capture the nuances of language in a way that traditional keyword-based models cannot.

  3. How Embedding Models Work:

    • Embedding models are trained on vast amounts of data to learn how to map raw input (e.g., text) into meaningful vector representations.

    • Once trained, these models can convert new, unseen data (e.g., a user query or a document) into embeddings. These embeddings can then be compared to other embeddings in the system to find the most relevant information.

  4. Types of Embedding Models:

    • Pre-Trained Models: Many embedding models are pre-trained on massive datasets, such as GPT (OpenAI), BERT, or other Transformer-based models. These models are capable of generating high-quality embeddings for a wide range of text-based tasks.

    • Custom Embedding Models: In some cases, you may need to train your own embedding models on domain-specific data. This is useful if you're working with specialized knowledge or proprietary data.

  5. Embedding in the RAG Context:

    • In a RAG app, embedding models are used to transform large volumes of ingested data into embeddings. Once the data is transformed, these embeddings are stored in a vector database or storage system.

    • When a user submits a query to the system, the app generates an embedding for the query and compares it to the stored embeddings. The most relevant data points are retrieved, which are then passed to the language model to generate an appropriate response.

Key Benefits of Embedding Models:

  • Efficient Data Retrieval: By converting data into embeddings, the system can quickly search and retrieve relevant information, even from large datasets.

  • Improved Semantic Understanding: Embedding models allow the system to understand the meaning behind words and phrases, not just their exact form.

  • Scalability: Embedding models can handle large volumes of data, enabling scalable and responsive applications, especially in RAG apps where vast datasets need to be searched quickly.

Example of How Embedding Models Are Used in RAG Apps:

  1. Data Ingestion: Raw data (e.g., a PDF document or a URL) is uploaded into the system, and converted into a structured file with source text.

  2. Embedding Creation: The embedding model processes the data and creates embeddings that represent the semantic content of the document.

  3. Query Processing: When a user submits a query, an embedding is generated for that query.

  4. Data Retrieval: The system compares the query’s embedding with the stored embeddings from the ingested data. The most relevant matches are selected.

  5. Text Generation: The relevant data is passed to the language model, which generates a contextually aware response.

PreviousIngestionNextVectorization

Last updated 1 month ago