Node Inputs

  • crawl_url: The starting point for the crawler. This is the website URL where you want the crawl to begin. Only websites on this domain or subdomain will be crawled.
  • crawl_instructions: Optionally describe the type of content you are interested in or provide keywords/filters. For example, “Crawl only blogs,” “Crawl only in Spanish,” or “Crawl tech news articles”.
  • max_num_records: Optionally set the maximum number of records you wish to have returned. The default is 10, but you can request up to 100.
  • outputs: Choose the information you want from the crawl. You can select from URLs, scraped content, descriptions, tags, titles, and date of creation or update for each website.
  • use_cache: Optionally decide to use cached results from previous runs with the same parameters to save time and resources.

Node Output

  • URLs: A list of URLs that the crawler found based on your instructions.
  • Scraped Content: The text content gathered from each of the crawled websites.
  • Descriptions: A summary description for each of the crawled websites.
  • Tags: Keywords indicating the type of content found on each website.
  • Titles: The title of each website that has been crawled.
  • Dates: The date when each website was either created or last updated.

Node Functionality

The Advanced Website Crawler is designed to automate the process of collecting and scraping web pages. By providing a start URL along with specific instructions and criteria, this node can recursively crawl through websites to gather and scrape content up to a specified limit.

This node uses AI to intelligently filter and fetch content that matches your defined criteria, making it a powerful tool for data collection and analysis from the web. It handles the complexities of web navigation and data extraction, leaving you with valuable insights and data without having to manually search and collate this information.

When To Use

The Advanced Website Crawler node is perfect for scenarios where you need to automatically collect information from various websites without manual intervention. Use this node when:

  • You need to gather articles, blogs, or specific content from a website or a set of websites.
  • You’re conducting market research or competitive analysis and need up-to-date content from industry-related websites.
  • You are building datasets for machine learning models that require large amounts of data from the web.
  • You wish to monitor certain websites for new content or updates regularly.

This node simplifies the task of web scraping by providing a straightforward way to specify what type of content you’re interested in and how much content you need. Whether for research, data analysis, or monitoring purposes, the Advanced Website Crawler helps streamline your data collection efforts from the web.