This robot pulls all page URLs and metadata (like last modified date) from any valid sitemap file containing <urlset>
tags.
A Sitemap URL Set is a type of XML file that websites use to list their publicly accessible pages. These files often end in .xml
or .xml.gz
and are typically referenced in the site’s robots.txt
.
Example structure:
<urlset>
<url>
<loc>https://example.com/product/123</loc>
<lastmod>2025-07-25</lastmod>
</url>
</urlset>
This robot extracts every <loc>
(URL) and, if available, the <lastmod>
(last modified date).
From any valid sitemap file using <urlset>
and <url>
tags, the robot retrieves:
<loc>
)<lastmod>
, if present)📈 Website change monitoring
Automatically detect when new pages are added or existing ones are updated—without crawling the full site.
🔍 SEO audits & indexing checks
Export all indexed URLs to review crawlability, detect broken links, or clean up outdated content.
🕵️ Competitive research
Track which product, blog, or landing pages your competitors are publishing or updating over time.
🧠 Content planning
Map out published URLs and update patterns for your editorial calendar or content migration projects.
.xml
and .xml.gz
filesAny .xml
or .xml.gz
sitemap that uses <urlset>
and <url>
tags.
Yes — as long as the sitemap is publicly accessible, you can extract from it.
Absolutely. Use monitoring + Zapier to get notifications or auto-send new URLs to your internal tools.
Yes! Use this output as input for deeper scraping robots to extract content from each page.