Product Docs
  • What is Dataworkz?
  • Getting Started
    • What You Will Need (Prerequisites)
    • Create with Default Settings: RAG Quickstart
    • Custom Settings: RAG Quickstart
    • Data Transformation Quickstart
    • Create an Agent: Quickstart
  • Concepts
    • RAG Applications
      • Overview
      • Ingestion
      • Embedding Models
      • Vectorization
      • Retrieve
    • AI Agents
      • Introduction
      • Overview
      • Tools
        • Implementation
      • Type
      • Tools Repository
      • Tool Execution Framework
      • Agents
      • Scenarios
      • Agent Builder
    • Data Studio
      • No-code Transformations
      • Datasets
      • Dataflows
        • Single Dataflows:
        • Composite dataflows:
        • Benefits of Dataflows:
      • Discovery
        • How to: Discovery
      • Lineage
        • Features of Lineage:
        • Viewing a dataset's lineage:
      • Catalog
      • Monitoring
      • Statistics
  • Guides
    • RAG Applications
      • Configure LLM's
        • AWS Bedrock
      • Embedding Models
        • Privately Hosted Embedding Models
        • Amazon Bedrock Hosted Embedding Model
        • OpenAI Embedding Model
      • Connecting Your Data
        • Finding Your Data Storage: Collections
      • Unstructured Data Ingestion
        • Ingesting Unstructured Data
        • Unstructured File Ingestion
        • Html/Sharepoint Ingestion
      • Create Vector Embeddings
        • How to Build the Vector embeddings from Scratch
        • How do Modify Existing Chunking/Embedding Dataflows
      • Response History
      • Creating RAG Experiments with Dataworkz
      • Advanced RAG - Access Control for your data corpus
    • AI Agents
      • Concepts
      • Tools
        • Dataset
        • AI App
        • Rest API
        • LLM Tool
        • Relational DB
        • MongoDB
        • Snowflake
      • Agent Builder
      • Agents
      • Guidelines
    • Data Studio
      • Transformation Functions
        • Column Transformations
          • String Operations
            • Format Operations
            • String Calculation Operations
            • Remove Stop Words Operation
            • Fuzzy Match Operation
            • Masking Operations
            • 1-way Hash Operation
            • Copy Operation
            • Unnest Operation
            • Convert Operation
            • Vlookup Operation
          • Numeric Operations
            • Tiles Operation
            • Numeric Calculation Operations
            • Custom Calculation Operation
            • Numeric Encode Operation
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Convert Operation
            • VLookup Operation
          • Boolean Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
          • Date Operations
            • Date Format Operations
            • Date Calculation Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Encode Operation
            • Convert Operation
          • Datetime/Timestamp Operations
            • Datetime Format Operations
            • Datetime Calculation Operations
            • Mask Operation
            • 1-way Hash Operation
            • Copy Operation
            • Encode Operation
            • Page 1
        • Dataset Transformations
          • Utility Functions
            • Area Under the Curve
            • Page Rank Utility Function
            • Transpose Utility Function
            • Semantic Search Template Utility Function
            • New Header Utility Function
            • Transform to JSON Utility Function
            • Text Utility Function
            • UI Utility Function
          • Window Functions
          • Case Statement
            • Editor Query
            • UI Query
          • Filter
            • Editor Query
            • UI Query
      • Data Prep
        • Joins
          • Configuring a Join
        • Union
          • Configuring a Union
      • Working with CSV files
      • Job Monitoring
    • Utility Features
      • IP safelist
      • Connect to data source(s)
        • Cloud Data Platforms
          • AWS S3
          • BigQuery
          • Google Cloud Storage
          • Azure
          • Snowflake
          • Redshift
          • Databricks
        • Databases
          • MySQL
          • Microsoft SQL Server
          • Oracle
          • MariaDB
          • Postgres
          • DB2
          • MongoDB
          • Couchbase
          • Aerospike
          • Pinecone
        • SaaS Applications
          • Google Ads
          • Google Analytics
          • Marketo
          • Zoom
          • JIRA
          • Salesforce
          • Zendesk
          • Hubspot
          • Outreach
          • Fullstory
          • Pendo
          • Box
          • Google Sheets
          • Slack
          • OneDrive / Sharepoint
          • ServiceNow
          • Stripe
      • Authentication
      • User Management
    • How To
      • Data Lake to Salesforce
      • Embed RAG into your App
  • API
    • Generate API Key in Dataworkz
    • RAG Apps API
    • Agents API
  • Open Source License Types
Powered by GitBook
On this page
  1. Guides
  2. RAG Applications
  3. Unstructured Data Ingestion

Html/Sharepoint Ingestion

Normal Settings

These are the basic settings you'll need to configure for your HTML/SharePoint ingestion.

  1. Webpage URL Settings

    • Webpage URL: Enter the URL of the specific webpage you want to ingest.

    • Protected URL: If the URL is behind authentication, provide the credentials or access tokens as necessary.

  2. Crawl Settings

    • Crawl Sub-domain: Enable this option if you want to include subdomains linked to the main URL during the crawl.

    • Javascript Enabled for the Website: Toggle this setting to enable or disable JavaScript rendering on the website during the crawl.

  3. Include Path

    • Inclusion Pattern: Specify the path or pattern of URLs you want to include in the crawl. This could be specific folders or file types you want to prioritize.

  4. Exclude Path

    • Exclusion Pattern: Specify the paths or patterns of URLs you want to exclude from the crawl. This helps filter out irrelevant content.

  5. Use Sitemap.xml

  • Enable this option to use the sitemap.xml file from the website to guide the crawl process. This file provides a list of URLs for efficient crawling.

  1. Use robots.txt

  • Enable this option to respect the website’s robots.txt file, which controls web crawlers’ access to various parts of the website.


Advanced Settings

These advanced options give you more control over the crawling and processing behavior of the ingestion.

  1. Crawler Settings

    • Max Crawling Depth: Set the maximum depth level for the crawl. This controls how many layers deep the crawl will go from the starting URL.

    • Max Pages: Define the maximum number of pages the crawler will process.

    • Max Concurrency: Set the number of simultaneous connections or threads the crawler can use. Higher values can speed up the crawling process, but may increase load on the server.

    • Browser Caching: Enable caching to store assets locally during the crawl, making subsequent crawls faster.

  2. Rate Limiting and Dynamic Content Handling

    • Enable Rate Limiting: If enabled, this setting controls how quickly the crawler makes requests to avoid overloading the target server.

    • Wait for Dynamic Content (Seconds): Specify the time (in seconds) the crawler should wait for dynamic content (e.g., JavaScript-rendered elements) to load before processing the page.

  3. HTML Processing Settings

    • Remove HTML Element(s): Specify which HTML elements (such as ads, headers, footers) you want to remove during the crawl to focus only on relevant content.

    • Expand Clickable Elements: Enable this option to automatically click and expand elements such as dropdowns or "Read More" links before ingestion.

  4. Additional Content Handling

    • Remove Cookie Warning: Enable this setting to automatically remove cookie consent banners or pop-ups from the pages during the crawl.

    • Save PDF/DOCX Files: Enable this option if you want to save any PDF or DOCX files found during the crawl for later processing.


By using these settings, you can fine-tune your web content ingestion process, ensuring you capture the right data for chunking, embedding, or further analysis.

PreviousUnstructured File IngestionNextCreate Vector Embeddings

Last updated 2 months ago