Photo by

Direct Your Agent to the Content it Needs to Know

Author
OneAI
Author
OneAI
·
Mar 26, 2024
·
3 min read
No items found.

Agents can process and understand information from PDFs, text files, and websites to build their knowledge base. Once a document is successfully uploaded to an agent, it can then provide answers to questions based on the content of that document.

To add more data to the agent go to the studio, select the agent you want to add data to or create a new agent. the click on the knowledge tab on the left menu and click ont the 'Add New Document' button to get the Add data window.

To add to knowledge items to the agent from a website, and to make sure you get the best out of it, there are a handful of important configuration parameters to understand:

Scraping a page

This involves extracting the textual content from a given URL and incorporating it into the agent's knowledge base. It's important to note that text depicted in images, rather than as selectable text, cannot be scraped.

Crawling a page

This process entails gathering all the links present on a page.

Page limit

Specifies the total number of pages you wish to scrape.

Maximum link depth

Determines the extent of the crawl, starting from the initial page as depth 1, links found on that page are considered depth 2, and this pattern continues accordingly.

Only Crawl Domain

Restricts the crawl to links that are within the same domain as the initial URL.

Crawl

Decides whether the page should be crawled. If this option is not selected, the agent will only scrape the page.

Crawler Mode

Selects the method by which page content is extracted (options include direct, proxy, or render).

HTML Extraction Mode

Determines how the text content is pulled from the HTML.

Blocklist

This option allows to not scrape pages that have a certain pattern.

You can use regex or wildcard - any link that matches that expression will not be scraped.

You can choose to use Wildcard expressions or Regex expressions

Understanding and configuring these parameters correctly can greatly enhance the effectiveness of the content extraction process.

Let's break down a few scenarios:

Scenario 1:

Add a single page like a Wikipedia article to the agent's knowledge without crawling.

Actions

Add the specific article URL directly to the agent's knowledge.  

Ensure the crawl option is unchecked.

Scenario 2

Add a specific article and all referenced links mentioned in the article, even if they are from different domains.

Actions

Add the specific article URL to the agent's knowledge.  

Check the crawl option and set the maximum link depth to 2.  

Ensure the "Only scan domain" option is unchecked.

Scenario 3

Exclude pages from the /blog section on the website from scraping.

Actions

Add the wildcard expression `https://www.mywebsite.com/blog/*`_ to the blocklist.  

Alternatively, use the regex expression `^https://www\.mywebsite\.com/blog/.*`_ in the blocklist.

Scenario 4

Objective: Scrape all website pages except those from the /blog section.

Actions

Add the wildcard expression `!https://www.mywebsite.com/blog/*` to the blocklist.  

Alternatively, use the regex expression `^(?!https://www\.mywebsite\.com/blog/).*$ `in the blocklist.

TURN YOUR CoNTENT INTO A GPT AGENT

Solely based on your most up-to-date content – websites, PDFs, or internal systems – with built-in fact-checking for enhanced trust.

Read Next

No learn-pages on this subject yet!