Node Inputs

  • URL

    • Type: text
    • Placeholder: Ex. https://www.gumloop.com/
    • Description: Enter the website URL you want to crawl. Note: Javascript generated text is not yet supported.
    • Usage: This is the starting point for the web crawl. It should be the full address to the webpage you wish to crawl.
  • Depth

    • Type: number
    • Placeholder: 1
    • Description: The number of layers to traverse in the crawl. Default is 1, which only crawls the initial page’s links. Setting 2 will crawl the initial page and then any linked pages. The maximum depth is 3, meaning the initial page’s links plus links on those pages, and then a third layer of linked pages.
    • Usage: Use this to control how deeply the node crawls. Higher depth values can potentially return more links but will take longer and consume more resources.

Node Output

  • URL List
    • Type: List of text
    • Description: A list of all URLs that have been found and crawled through, starting from the initial URL provided.

Node Functionality

This node operates as a web crawler. When triggered, it starts from a specified initial URL and looks for all URLs linked on that webpage. If set to do so, the node can go beyond the initial page, following those links and finding new ones on the connected pages, up to the specified depth.

When To Use

Use the Website Crawler node when you need to:

  • Gather a list of links present on a specific website.
  • Explore the structure of a website by identifying how different pages are interconnected.
  • Perform a shallow or deep audit of the links present on a website for SEO or general web analysis purposes.
  • Map out the link landscape of a competitor’s or your own website to understand what content is being referenced internally or externally.