Product Docs
  • What is Dataworkz?
  • Getting Started
    • What You Will Need (Prerequisites)
    • Create with Default Settings: RAG Quickstart
    • Custom Settings: RAG Quickstart
    • Data Transformation Quickstart
    • Create an Agent: Quickstart
  • Concepts
    • RAG Applications
      • Overview
      • Ingestion
      • Embedding Models
      • Vectorization
      • Retrieve
    • AI Agents
      • Introduction
      • Overview
      • Tools
        • Implementation
      • Type
      • Tools Repository
      • Tool Execution Framework
      • Agents
      • Scenarios
      • Agent Builder
    • Data Studio
      • No-code Transformations
      • Datasets
      • Dataflows
        • Single Dataflows:
        • Composite dataflows:
        • Benefits of Dataflows:
      • Discovery
        • How to: Discovery
      • Lineage
        • Features of Lineage:
        • Viewing a dataset's lineage:
      • Catalog
      • Monitoring
      • Statistics
  • Guides
    • RAG Applications
      • Configure LLM's
        • AWS Bedrock
      • Embedding Models
        • Privately Hosted Embedding Models
        • Amazon Bedrock Hosted Embedding Model
        • OpenAI Embedding Model
      • Connecting Your Data
        • Finding Your Data Storage: Collections
      • Unstructured Data Ingestion
        • Ingesting Unstructured Data
        • Unstructured File Ingestion
        • Html/Sharepoint Ingestion
      • Create Vector Embeddings
        • How to Build the Vector embeddings from Scratch
        • How do Modify Existing Chunking/Embedding Dataflows
      • Response History
      • Creating RAG Experiments with Dataworkz
      • Advanced RAG - Access Control for your data corpus
    • AI Agents
      • Concepts
      • Tools
        • Dataset
        • AI App
        • Rest API
        • LLM Tool
        • Relational DB
        • MongoDB
        • Snowflake
      • Agent Builder
      • Agents
      • Guidelines
    • Data Studio
      • Transformation Functions
        • Column Transformations
          • String Operations
            • Format Operations
            • String Calculation Operations
            • Remove Stop Words Operation
            • Fuzzy Match Operation
            • Masking Operations
            • 1-way Hash Operation
            • Copy Operation
            • Unnest Operation
            • Convert Operation
            • Vlookup Operation
          • Numeric Operations
            • Tiles Operation
            • Numeric Calculation Operations
            • Custom Calculation Operation
            • Numeric Encode Operation
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Convert Operation
            • VLookup Operation
          • Boolean Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
          • Date Operations
            • Date Format Operations
            • Date Calculation Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Encode Operation
            • Convert Operation
          • Datetime/Timestamp Operations
            • Datetime Format Operations
            • Datetime Calculation Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Encode Operation
            • Page 1
        • Dataset Transformations
          • Utility Functions
            • Area Under the Curve
            • Page Rank Utility Function
            • Transpose Utility Function
            • Semantic Search Template Utility Function
            • New Header Utility Function
            • Transform to JSON Utility Function
            • Text Utility Function
            • UI Utility Function
          • Window Functions
          • Case Statement
            • Editor Query
            • UI Query
          • Filter
            • Editor Query
            • UI Query
      • Data Prep
        • Joins
          • Configuring a Join
        • Union
          • Configuring a Union
      • Working with CSV files
      • Job Monitoring
    • Utility Features
      • IP safelist
      • Connect to data source(s)
        • Cloud Data Platforms
          • AWS S3
          • BigQuery
          • Google Cloud Storage
          • Azure
          • Snowflake
          • Redshift
          • Databricks
        • Databases
          • MySQL
          • Microsoft SQL Server
          • Oracle
          • MariaDB
          • Postgres
          • DB2
          • MongoDB
          • Couchbase
          • Aerospike
          • Pinecone
        • SaaS Applications
          • Google Ads
          • Google Analytics
          • Marketo
          • Zoom
          • JIRA
          • Salesforce
          • Zendesk
          • Hubspot
          • Outreach
          • Fullstory
          • Pendo
          • Box
          • Google Sheets
          • Slack
          • OneDrive / Sharepoint
          • ServiceNow
          • Stripe
      • Authentication
      • User Management
    • How To
      • Data Lake to Salesforce
      • Embed RAG into your App
  • API
    • Generate API Key in Dataworkz
    • RAG Apps API
    • Agents API
  • Open Source License Types
Powered by GitBook
On this page
  1. Guides
  2. RAG Applications
  3. Create Vector Embeddings

How to Build the Vector embeddings from Scratch

This is a more in depth explanation of how to create the pre-designated data flows. This is for more advanced users who want to explore RAG application performance by changing dataset structure.

In order to create an AI Application, we need chunks and embeddings. To do this, go to the dataset that has been created via pre processing. You can access this by heading to the dataset page from the list of headers on the home screen of Dataworkz. Find the collection you have stored it in, and click on the dataset. Then hit transform in the upper right side of the screen.

Follow these steps to create vector embeddings for your RAG application with the structured dataset that was created during pre processing.

Calculate Text Length:

  • Action Applied on Text: Calculate the length of the text and add it to a new column named ‘text_length’.

Filter Data:

  • Condition: summary_data = 'false' AND text_length > 200 AND text IS NOT NULL

  • Action: Apply this filter to the dataset.

Split Value:

  • Action Applied on Optional Headings: Use # in the ‘Split by’ text box. Choose ‘RIGHT’ in ‘Select Value from’ dropbox, and ‘FIRST’ in ‘Choose value’ dropdown. Enable the ‘Create new column’ slider to create a new column for the transformed data. This will split the headings value based on ‘#’ from the RIGHT side and choose the FIRST VALUE from the right side of the split for ‘optional_headings’, and store it in a new column named ‘meta_data’.

Chunk Text:

  • Action Applied on Chunking: Apply chunking on the source column ‘text’ using the chunk delimiter WORD_TEXT_SPLITTER (350 words with an 80-word overlap). Save the result as ‘text_chunks’.

Add Suffix to Meta Data:

  • Action Applied on Meta Data: Add a suffix to ‘meta_data’ with ‘text_chunks’ and store it in a new column named ‘embeddings_input’.

Apply Embeddings:

  • Action Applied on Embeddings: Apply embeddings on the source column ‘embeddings_input’ using the embedding model all-mpnet-base-v2.

Execution and Storage

Before Executing this dataflow make sure to hit the file save icon next to execute and give it a name. This can be used on any pre processed dataset without having to go through the same steps again. You can now execute and store the resulting dataset in your MongoDB collection. These steps create the necessary dataset for a Retrieval-Augmented Generation (RAG) application in Dataworkz.

The only necessary step for any rag application is the Apply Embeddings step. All other steps are used to create better end results in your applications, but are not necessary for the application to function.

PreviousCreate Vector EmbeddingsNextHow do Modify Existing Chunking/Embedding Dataflows

Last updated 9 months ago