# Diffbot Diffbot is an advanced, AI-powered web scraping and data extraction platform. Unlike traditional scrapers, it uses machine learning and computer vision to understand web page layouts and automatically transform unstructured content (like articles, products, or profiles) into structured, machine-readable data. Its crawler visits sites to extract this data for clients who use Diffbot's API for applications like market research, content aggregation, and building knowledge graphs. Breadcrumb navigation - [Privacy-focused, simple website analytics](https://plainsignal.com/) - [Agents](https://plainsignal.com/agents "Agents, User-Agents, Crawlers, Browsers") - [Diffbot](https://plainsignal.com/agents/diffbot) ## What is Diffbot? Diffbot is an AI-powered platform that specializes in web scraping and knowledge extraction. Using machine learning and computer vision, its sophisticated crawler can parse and understand web content much like a human does. Instead of relying on predefined rules, Diffbot's technology visually recognizes common page components (e.g., articles, product listings) and automatically extracts structured data from them. It identifies itself in server logs with the user-agent string `Diffbot`. Its key feature is the ability to convert unstructured web pages into clean, organized data for analysis. ## Why is Diffbot crawling my site? Diffbot is crawling your website to extract and structure data for one of its clients. These clients use Diffbot's API for a variety of purposes, including market research, competitive intelligence, and application development. The bot targets specific types of content, such as product information, news articles, business profiles, or job listings. The frequency of its visits is determined by client demand and how often your content is updated. While Diffbot is a legitimate commercial service, its crawling is not always explicitly authorized by the website owner, though it does aim to follow standard crawling protocols. ## What is the purpose of Diffbot? The core purpose of Diffbot is to transform the unstructured content of the web into structured, machine-readable data at scale. This data is then used by its clients for a range of applications, including competitive intelligence, product monitoring, content aggregation, and powering AI applications. While website owners do not directly benefit from being crawled, Diffbot's technology contributes to the broader web ecosystem by making information more programmatically accessible and usable. The company aims to maintain reasonable crawl rates to minimize the impact on server performance. ## How do I block Diffbot? To prevent Diffbot from scraping your website, you can add a disallow rule to your `robots.txt` file. This is the standard method for managing access for legitimate web crawlers. Add the following lines to your `robots.txt` file to block Diffbot: ``` User-agent: Diffbot Disallow: / ``` ## Related agents and operators ## Canonical Human friendly, reader version of this article is available at [Diffbot](https://plainsignal.com/agents/diffbot) ## Copyright (c) 2025 [PlainSignal](https://plainsignal.com/ "Privacy-focused, simple website analytics")