What You Will Need (Prerequisites)

This section gives a brief overview of what a new user to Dataworkz will need to bring to create AI Applications.

To get started with Dataworkz, you will need a few essential components and configurations. Here’s a breakdown of the requirements to ensure a smooth setup:

1. Configuring LLMs in Dataworkz

Configuring your Large Language Model (LLM) is the first step before creating a RAG application or deploying an Agent in Dataworkz. Think of this as telling the system which brain to use for reasoning and generation.

When you open the LLM Configuration section, you’ll see a catalog of models you’ve already connected, along with details such as type, provider, and whether they are set as default. From here, you can manage existing models or add a new one.

Adding a New LLM

Click Add New LLM to open the configuration panel. This is where you define the connection between Dataworkz and your model provider.

The configuration starts with a simple choice:

  • Do you want a Generative model, capable of answering questions and writing content?

  • Or an Extractive model, specialized in pulling structured values directly from documents?

Most RAG setups will use Generative.

Once you choose the type, you’ll select the Deployment Type. Dataworkz supports multiple providers: OpenAI, Azure OpenAI, Amazon Bedrock, Gemini, or hosted models. This flexibility lets you use whatever infrastructure your enterprise already trusts.

Selecting the Model

After picking a provider, you’ll see a dropdown of all available models. For example, if you select OpenAI, you may find options like gpt-4, gpt-4o, or gpt-3.5-turbo. For Groq-hosted deployments, you might choose llama-3.3-70b-versatile.

This is also where you’ll give your model a friendly name (like support_gpt4o_prod) to help your team identify it later.

Next, paste in your API key. Dataworkz never ships with built-in keys—you must bring your own from your LLM provide

Fine-Tuning Behavior

Beyond basic connectivity, you’ll also control how the model behaves:

  • Max Tokens & Response Length: These set the "thinking space" and output size. For large enterprise documents, higher values ensure context isn’t cut short.

  • Temperature: Governs creativity. Lower values like 0.1 make responses factual and repeatable; higher values make them more exploratory.

  • Penalties: Adjust how much the model repeats itself or introduces new concepts.

If your use case requires image understanding (like reading PDFs or interpreting diagrams), you can enable Visual Language Model (VLM). This tells Dataworkz that the model can process multimodal input.

Testing and Saving

Before finalizing, click Test Connection. This quickly validates your API key and ensures the model is reachable. If the test succeeds, save the configuration.

Your new model now appears in the list, where you can:

  • Mark it as the default LLM.

  • Edit parameters later if business needs change.

  • Remove it if the key expires or the provider is no longer required.

Example Setup: Customer Support RAG

Imagine you’re building a RAG app to handle customer policy questions. A good LLM configuration might be:

  • Model: gpt-4o (OpenAI, production).

  • Temperature: 0.1 (ensures answers are accurate, not creative).

  • Tokens: 8000 (supports large context from policy PDFs).

  • Response length: 600 (keeps answers concise).

  • VLM: Enabled (if policy guides include diagrams).

This setup ensures that when a customer asks, “Can I return a damaged product bought last year?”, the Agent can retrieve the rule from documents and answer reliably, without inventing details.

2. Storage Setup

Dataworkz gives you flexible options for storing and managing data, so you can tailor your setup to your use case. Whether you’re working with documents for RAG, structured enterprise databases, or graph data, you’ll find integrations already available in the platform.

Default Storage

When you first sign up, Dataworkz provides you with a default workspace that comes with two built-in storage options:

  • S3-compatible object storage: This is where you can upload files such as PDFs, Word docs, or PowerPoint decks. It’s commonly used for RAG ingestion pipelines. Note that direct uploads from your local machine are limited to 1 MB per file. For larger uploads, you’ll want to connect your own external storage.

  • MongoDB (Vector storage): By default, embeddings, text chunks, and metadata are stored in MongoDB. This acts as your vector database, allowing semantic search across ingested datasets.

These defaults are designed for quick experimentation, but most production projects will connect external enterprise-grade storage.

Supported Storage Options

From the Databases panel in Dataworkz, you can configure a variety of storage systems. These are grouped by category:

  • Vector Databases: Pinecone, OpenSearch, Weaviate (coming soon).

  • NoSQL Databases: MongoDB, Couchbase, Aerospike, Datastax.

  • Relational Databases: Oracle, Microsoft SQL Server, MySQL, MariaDB, DB2, Postgres.

  • Graph Database: Neo4J.

This range allows you to either stick with the defaults or integrate directly with your organization’s data infrastructure.

Free-Tier Limitations

If you’re starting on the free tier, keep the following in mind:

  • File uploads are capped at 1 MB when done directly from your machine.

  • You can crawl up to three HTML pages per month via the web crawler. This is ideal for quick tests but limited for ongoing ingestion.

  • Vector storage is provisioned via MongoDB only. External DB connections are available once you move beyond the free tier.

Data Privacy and Best Practices

While the default workspace is a convenient way to get started, we strongly recommend setting up your own dedicated vector and object storage (such as S3, GCS, or your enterprise DB of choice) before uploading sensitive or proprietary data.

Once external storage is configured, you can:

  • Migrate existing datasets out of the default workspace.

  • Adjust workspace and collection settings for stricter access control.

  • Ensure compliance with internal data governance policies.

This approach provides the best balance of ease-of-use for prototyping and security for production workloads.

Last updated