How to Build the Vector embeddings from Scratch
This is a more in depth explanation of how to create the pre-designated data flows. This is for more advanced users who want to explore RAG application performance by changing dataset structure.
In order to create an AI Application, we need chunks and embeddings. To do this, go to the dataset that has been created via pre processing. You can access this by heading to the dataset page from the list of headers on the home screen of Dataworkz. Find the collection you have stored it in, and click on the dataset. Then hit transform in the upper right side of the screen.
Follow these steps to create vector embeddings for your RAG application with the structured dataset that was created during pre processing.
Calculate Text Length:
Action Applied on Text: Calculate the length of the text and add it to a new column named ‘text_length’.
Filter Data:
Condition: summary_data = 'false' AND text_length > 200 AND text IS NOT NULL
Action: Apply this filter to the dataset.
Split Value:
Action Applied on Optional Headings: Use # in the ‘Split by’ text box. Choose ‘RIGHT’ in ‘Select Value from’ dropbox, and ‘FIRST’ in ‘Choose value’ dropdown. Enable the ‘Create new column’ slider to create a new column for the transformed data. This will split the headings value based on ‘#’ from the RIGHT side and choose the FIRST VALUE from the right side of the split for ‘optional_headings’, and store it in a new column named ‘meta_data’.
Chunk Text:
Action Applied on Chunking: Apply chunking on the source column ‘text’ using the chunk delimiter WORD_TEXT_SPLITTER (350 words with an 80-word overlap). Save the result as ‘text_chunks’.
Add Suffix to Meta Data:
Action Applied on Meta Data: Add a suffix to ‘meta_data’ with ‘text_chunks’ and store it in a new column named ‘embeddings_input’.
Apply Embeddings:
Action Applied on Embeddings: Apply embeddings on the source column ‘embeddings_input’ using the embedding model all-mpnet-base-v2.
Execution and Storage
Before Executing this dataflow make sure to hit the file save icon next to execute and give it a name. This can be used on any pre processed dataset without having to go through the same steps again. You can now execute and store the resulting dataset in your MongoDB collection. These steps create the necessary dataset for a Retrieval-Augmented Generation (RAG) application in Dataworkz.
The only necessary step for any rag application is the Apply Embeddings step. All other steps are used to create better end results in your applications, but are not necessary for the application to function.
Last updated