Html/Sharepoint Ingestion
Normal Settings
These are the basic settings you'll need to configure for your HTML/SharePoint ingestion.
Webpage URL Settings
Webpage URL: Enter the URL of the specific webpage you want to ingest.
Protected URL: If the URL is behind authentication, provide the credentials or access tokens as necessary.
Crawl Settings
Crawl Sub-domain: Enable this option if you want to include subdomains linked to the main URL during the crawl.
Javascript Enabled for the Website: Toggle this setting to enable or disable JavaScript rendering on the website during the crawl.
Include Path
Inclusion Pattern: Specify the path or pattern of URLs you want to include in the crawl. This could be specific folders or file types you want to prioritize.
Exclude Path
Exclusion Pattern: Specify the paths or patterns of URLs you want to exclude from the crawl. This helps filter out irrelevant content.
Use Sitemap.xml
Enable this option to use the sitemap.xml file from the website to guide the crawl process. This file provides a list of URLs for efficient crawling.
Use robots.txt
Enable this option to respect the website’s robots.txt file, which controls web crawlers’ access to various parts of the website.
Advanced Settings
These advanced options give you more control over the crawling and processing behavior of the ingestion.
Crawler Settings
Max Crawling Depth: Set the maximum depth level for the crawl. This controls how many layers deep the crawl will go from the starting URL.
Max Pages: Define the maximum number of pages the crawler will process.
Max Concurrency: Set the number of simultaneous connections or threads the crawler can use. Higher values can speed up the crawling process, but may increase load on the server.
Browser Caching: Enable caching to store assets locally during the crawl, making subsequent crawls faster.
Rate Limiting and Dynamic Content Handling
Enable Rate Limiting: If enabled, this setting controls how quickly the crawler makes requests to avoid overloading the target server.
Wait for Dynamic Content (Seconds): Specify the time (in seconds) the crawler should wait for dynamic content (e.g., JavaScript-rendered elements) to load before processing the page.
HTML Processing Settings
Remove HTML Element(s): Specify which HTML elements (such as ads, headers, footers) you want to remove during the crawl to focus only on relevant content.
Expand Clickable Elements: Enable this option to automatically click and expand elements such as dropdowns or "Read More" links before ingestion.
Additional Content Handling
Remove Cookie Warning: Enable this setting to automatically remove cookie consent banners or pop-ups from the pages during the crawl.
Save PDF/DOCX Files: Enable this option if you want to save any PDF or DOCX files found during the crawl for later processing.
By using these settings, you can fine-tune your web content ingestion process, ensuring you capture the right data for chunking, embedding, or further analysis.
Last updated