Are you accidentally slamming the door on helpful AI visitors while trying to keep your website’s content safe from being scraped for training data?
Many site owners block bots to protect their intellectual property, but in doing so, they might be turning away the “good” AI traffic—like search engines and assistants that drive real visitors your way. Let’s break it down so you can decide wisely.
Key Distinctions in AI Bots
- Training Data Scrapers: These bots systematically crawl websites to collect vast amounts of text, images, and other data primarily for training large language models (LLMs). They operate at scale, often without user-specific triggers, and raise concerns about copyright and server load.
- Agentic AI Bots: These are autonomous systems that plan, reason, and execute multi-step tasks, such as booking appointments or troubleshooting issues, often integrating tools like APIs or browsers. They emphasize goal-oriented actions over passive data gathering.
- This compilation draws from reliable sources as of November 2025; the AI landscape evolves rapidly, so new bots emerge frequently. While not exhaustive, it covers over 30 prominent examples across categories.
Prominent Training Data Scrapers
These bots are designed for bulk data acquisition to fuel AI model development. Common user agents help site owners block them via robots.txt.
| Bot Name | Developer/Organization | Primary Purpose | Example User Agent |
|---|---|---|---|
| GPTBot | OpenAI | Crawls for ChatGPT training data | GPTBot/1.1 |
| ClaudeBot | Anthropic | Collects data for Claude models | ClaudeBot/1.0 |
| Google-Extended | Gathers extended web data for AI enhancements | Google-Extended | |
| Amazonbot | Amazon | Supports AWS AI services and model training | Amazonbot |
| Applebot-Extended | Apple | Collects data for Apple Intelligence features | Applebot-Extended |
| Bytespider | ByteDance (TikTok) | Data for recommendation and generative AI | Bytespider |
| CCBot | Common Crawl | Open dataset for AI research and training | CCBot |
| Diffbot | Diffbot | Structured data extraction for AI datasets | Diffbot |
| cohere-ai | Cohere | Builds datasets for enterprise AI models | cohere-ai |
| PerplexityBot | Perplexity | Indexes web for AI search and training | PerplexityBot/1.0 |
| OAI-SearchBot | OpenAI | On-demand crawling for model improvements | OAI-SearchBot |
| AI2Bot | Allen Institute for AI | Academic AI research data collection | AI2Bot |
| YouBot | You.com | Data for personalized AI search engines | YouBot |
| Mistral Bot | Mistral AI | Training open-source LLMs | MistralAI-User |
| PetalBot | Huawei | Data for Huawei’s AI ecosystem | PetalBot |
| ImagesiftBot | Imagesift | Image-focused scraping for visual AI | ImagesiftBot |
| Omgili Bot | Webz.io (Omgili) | Consumer insights data for AI analytics | Omgili |
Notable Agentic AI Bots
These bots go beyond data collection, using reasoning to adapt and act independently. They often mimic human workflows but can introduce risks like unintended actions.
| Bot Name | Developer/Organization | Key Capabilities | Example Use Case |
|---|---|---|---|
| ChatGPT Agent | OpenAI | Autonomous web navigation, form filling | E-commerce purchases, research tasks |
| Claude Computer Use | Anthropic | Desktop interaction, multi-tool orchestration | Software troubleshooting, file management |
| Perplexity Comet | Perplexity | Goal-directed browsing and task execution | Travel booking, market analysis |
| Siri | Apple | Voice-activated task automation | Scheduling, smart home control |
| Google Assistant | Proactive planning and API integration | Route optimization, reminders | |
| Alexa | Amazon | Ecosystem-wide automation | Shopping lists, device control |
| Auto-GPT | Open-source (Significant Gravitas) | Self-prompting for complex goals | Code generation, content creation |
| BabyAGI | Open-source (Yohei Nakajima) | Task prioritization and execution loops | Project management simulations |
| Clara (formerly x.ai) | X.ai | Meeting scheduling and calendar management | Automated appointment booking |
| DeckardAgent | Deckard Protocol | On-chain verification and task execution | Crypto trading, reputation scoring |
| Delivery Hero Data Analyst | Delivery Hero | Predictive analytics and decision-making | Inventory forecasting |
| eBay RecSys Agent | eBay | Recommendation and personalization engine | Product suggestions in real-time |
| Uber Agentic RAG | Uber | Retrieval-augmented task handling | Ride optimization and support |
Comprehensive Overview of AI Bots: Scrapers, Agents, and the Evolving Ecosystem
The proliferation of AI bots represents a transformative shift in how machines interact with the digital world, blending automation with intelligence. As of late 2025, these bots are reshaping industries from e-commerce to cybersecurity, but they also spark debates over privacy, resource consumption, and ethical data use. This survey synthesizes insights from technical documentation, industry reports, and real-time discussions to provide a detailed examination. It expands on the core categories—training data scrapers and agentic bots—while exploring overlaps, trends, and implications. All examples are verified against primary sources, emphasizing user agents for scrapers and functional architectures for agents.
Defining the Categories: From Passive Collection to Active Agency
AI bots defy simple binaries, but the user’s framework aligns with two dominant paradigms. Training data scrapers function as digital vacuum cleaners, traversing the web to amass unstructured data for LLM pre-training. They prioritize volume and breadth, often identified by distinctive user agents that developers publish for opt-out mechanisms like robots.txt. These bots have surged in activity—AI traffic now accounts for up to 21% of requests on top websites—straining servers and prompting legal challenges over intellectual property. In contrast, agentic AI bots embody autonomy, leveraging LLMs for planning, reflection, and adaptation in multi-step workflows. Unlike scrapers, they operate reactively or proactively toward user-defined goals, integrating tools like browsers or APIs. This “agentic” quality—coined in recent literature—marks a maturity leap from rule-based automation (e.g., traditional RPA) to goal-oriented systems capable of error correction and sub-task delegation. A third gray area, retrieval-augmented generation (RAG) systems, bridges the two: they scrape on-demand for query responses rather than bulk training, but their agent-like retrieval makes them lean agentic here.
The distinction matters for web administrators: scrapers can be blocked statically, while agentic bots often evade via session mimicry, simulating human behavior to complete forms or transactions. Ethically, scrapers fuel innovation but risk “data colonialism,” while agentic bots amplify productivity yet introduce vulnerabilities like hallucination-driven errors or malicious misuse in ransomware.
Expanded Inventory: Training Data Scrapers in Depth
These bots underpin the AI boom, with OpenAI and Anthropic leading in visibility. Their operations are typically non-interactive, focusing on ethical crawling guidelines (e.g., respecting noindex tags), though enforcement varies. Below is an augmented table with additional details on deployment scale and controversies.
| Bot Name | Developer/Organization | Primary Purpose | Example User Agent | Notable Impact/Controversy |
|---|---|---|---|---|
| GPTBot | OpenAI | Core data for GPT series training | GPTBot/1.1; +https://openai.com/gptbot | High-volume crawler; blocked by 20% of Fortune 500 sites over bandwidth concerns |
| ClaudeBot | Anthropic | Enhances Claude’s safety-aligned models | ClaudeBot/1.0; [email protected] | Emphasizes constitutional AI; lower opt-out rates due to transparency |
| Google-Extended | Supplements Bard/Gemini with real-time web data | Google-Extended | Integrated with search; criticized for evading robots.txt in some cases | |
| Amazonbot | Amazon | Fuels AWS Bedrock and Alexa improvements | Amazonbot | E-commerce bias in datasets; used in 40% of cloud AI workloads |
| Applebot-Extended | Apple | Powers Apple Intelligence features | Applebot-Extended | Privacy-focused but expansive; iOS integration boosts mobile scraping |
| Bytespider | ByteDance (TikTok) | Recommendation algorithms and Doubao AI | Bytespider | Social media data hoarding; regulatory scrutiny in EU |
| CCBot | Common Crawl | Nonprofit dataset for open AI research | CCBot | Powers 80% of public LLM benchmarks; no commercial restrictions |
| Diffbot | Diffbot | Knowledge graph building for enterprise AI | Diffbot | API-driven; charges for premium extracts |
| cohere-ai | Cohere | Custom enterprise model training | cohere-ai | B2B focus; integrates with Slack for data pulls |
| PerplexityBot | Perplexity | Indexes for answer-engine training | PerplexityBot/1.0; +https://perplexity.ai | Blurs scraper/search lines; sued for unattributed summaries |
| OAI-SearchBot | OpenAI | Iterative model refinement | OAI-SearchBot | Variant of GPTBot; on-demand triggers |
| AI2Bot | Allen Institute for AI | Semantic Scholar enhancements | AI2Bot | Academic purity; open datasets only |
| YouBot | You.com | Personalized AI search training | YouBot | Privacy-centric; user-consent models |
| Mistral Bot | Mistral AI | Open-weight LLM datasets | MistralAI-User | European GDPR compliance emphasis |
| PetalBot | Huawei | Pangu model ecosystem | PetalBot | Geopolitical blocks in US; mobile-first |
| ImagesiftBot | Imagesift | Visual AI training (e.g., diffusion models) | ImagesiftBot | Niche for image gen; copyright lawsuits pending |
| Omgili Bot | Webz.io (Omgili) | Trend analysis for AI insights | Omgili | B2B analytics; low public visibility |
Agentic AI Bots: Autonomy in Action
Agentic bots are the “doers” of the AI world, often built on frameworks like LangChain or AutoGen. Their rise coincides with multimodal LLMs, enabling everything from virtual shopping to DeFi trading. Early examples like Siri (2011) were reactive; modern ones, like Claude Computer Use, handle stateful sessions autonomously. In DeFi, bots like DeckardAgent exemplify on-chain agency, verifying tasks via blockchain for trustless execution. Challenges include “hallucination cascades” in long workflows and security risks, as seen in agentic ransomware simulations.
| Bot Name | Developer/Organization | Key Capabilities | Example Use Case | Maturity Level (Low/Med/High) |
|---|---|---|---|---|
| ChatGPT Agent | OpenAI | Web simulation, API chaining | Autonomous e-commerce (e.g., adding to cart) | High |
| Claude Computer Use | Anthropic | Screen interaction, tool orchestration | Debugging code in IDEs | High |
| Perplexity Comet | Perplexity | Browser automation, research synthesis | Multi-site price comparison | Med |
| Siri | Apple | Voice/NLP task decomposition | Home automation sequences | High |
| Google Assistant | Predictive planning, ecosystem integration | Travel itinerary building | High | |
| Alexa | Amazon | Skill-based workflows, IoT control | Grocery reordering | High |
| Auto-GPT | Open-source | Recursive goal decomposition | Full project ideation to execution | Med |
| BabyAGI | Open-source | Task queue management | Agile sprint planning | Low |
| Clara | X.ai | Natural language scheduling | Email-based meeting coordination | High |
| DeckardAgent | Deckard Protocol | Blockchain-verified actions | DeFi yield farming automation | Med |
| Delivery Hero Data Analyst | Delivery Hero | Anomaly detection, forecasting | Menu optimization | Med |
| eBay RecSys Agent | eBay | Dynamic personalization | Auction bidding assistance | High |
| Uber Agentic RAG | Uber | Query-driven routing | Surge prediction and rerouting | High |
| Sales Lead Agent | Various (e.g., ThoughtSpot) | Lead scoring, outreach | CRM integration for follow-ups | Med |
| Security Threat Agent | Various (e.g., Exabeam) | Real-time anomaly response | Network intrusion blocking | High |
| DevOps Code Agent | Various (e.g., GitHub Copilot extensions) | Bug triaging, deployment | CI/CD pipeline automation | Med |
Trends and Future Implications
By 2026, agentic bots could dominate, with projections of 1300% growth in AI traffic driven by autonomous shopping and DeFi. Hybrid systems—e.g., scrapers feeding agentic loops—are emerging, as in Virtual Protocol’s on-chain agents. For balance, counterarguments highlight equity: without open-source alternatives, these bots may entrench Big Tech dominance, exacerbating biases in training data. Mitigation strategies include AI-specific robots.txt standards and watermarking for generated content. In controversial realms like Black Friday bots, agentic systems enable “weaponized” deal-sniping, underscoring the need for empathetic design that prioritizes human oversight.
This landscape demands vigilance: while scrapers democratize data access, agentic bots promise efficiency gains of 30-50% in workflows, per industry benchmarks. Stakeholders should monitor updates via repositories like ai.robots.txt for evolving lists.
Key Citations
- Dejan.ai: The Next Chapter of Search: Get Ready to Influence the Robots
- Dejan.ai: Browsing vs Content Fetcher
- Dejan.ai: How GPT Sees the Web
- Dejan.ai: Claude System Internals
- Dejan.ai: AI Search Citation Mining
- Dejan.ai: The Future of Google
- Momentic: List of Top AI Search Crawlers (2025)
- Foundation Web Dev: User Agents of AI Web Crawlers
- Human Security: AI Traffic – Agents, Crawlers, Bots
- ThoughtSpot: Agentic AI Examples
- Cloudflare: From Googlebot to GPTBot (2025)
- Botpress: 36 Real-World AI Agents
- Reddit: Real-World AI Agents Examples
- Evidently AI: 7 Agentic AI Examples
- X Post: DeckardAgent Overview
- X Post: Agentic DeFi Discussion
Bot Directory
This reference document catalogs 100+ known AI bots organized by their primary function. Training Data Scrapers collect web content to train AI models, while Agentic bots perform autonomous tasks, browse the web, and act on behalf of users. The AI bot landscape has exploded since 2023, with Cloudflare reporting that AI crawler traffic now accounts for over 80% of all bot activity on many networks.
Training Data Scrapers
These crawlers collect web content primarily for AI/LLM model training. Blocking via robots.txt is the primary defense, though compliance varies significantly.
Major AI Company Training Crawlers
| Bot Name | Company | Description/Purpose | User Agent String |
|---|---|---|---|
| GPTBot | OpenAI | Primary crawler for GPT model training (GPT-4, GPT-5). Filters out paywalled content and PII. | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.1; +https://openai.com/gptbot) |
| ClaudeBot | Anthropic | Downloads training data for Claude models. Replaced deprecated anthropic-ai crawler in July 2024. | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com) |
| Google-Extended | Controls whether content trains Gemini and Vertex AI. Not a separate crawler—a robots.txt control token only. | Uses standard Googlebot user agents | |
| meta-externalagent | Meta | Collects content for Meta AI/LLaMA training. Launched July 2024. May bypass robots.txt. | meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler) |
| FacebookBot | Meta | Crawls for Meta’s speech recognition and LLM training. | FacebookBot/1.0 |
| Bytespider | ByteDance | Training data for Doubao LLM. Extremely aggressive—accounts for up to 90% of AI crawler traffic on some networks. Often ignores robots.txt. | Mozilla/5.0 (compatible; Bytespider; spider-feedback@bytedance.com) |
| Applebot-Extended | Apple | Controls whether Applebot-crawled content trains Apple Intelligence. Introduced June 2024 at WWDC. | Mozilla/5.0 (Macintosh) AppleWebKit/605.1.15 (Applebot-Extended/0.1; +http://www.apple.com/go/applebot) |
| Amazonbot | Amazon | Indexes content for Alexa AI-powered answers and product recommendations. | Mozilla/5.0 (compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) |
| cohere-ai | Cohere | Gathers text data for Cohere’s Command and Embed models. | cohere-ai |
| cohere-training-data-crawler | Cohere | Dedicated NLP training data collection. | cohere-training-data-crawler |
Open Dataset and Research Crawlers
| Bot Name | Company | Description/Purpose | User Agent String |
|---|---|---|---|
| CCBot | Common Crawl | Non-profit creating open web datasets used by numerous AI companies. Blocking CCBot prevents indirect use by multiple LLM providers. | CCBot/2.0 (https://commoncrawl.org/faq/) |
| AI2Bot | Allen Institute for AI | Indexes content for Semantic Scholar and AI research tools. | AI2Bot |
| AI2Bot-Dolma | Allen Institute for AI | Collects diverse web data for Dolma dataset, used to pretrain OLMo models. | AI2Bot-Dolma |
| ICC-Crawler | NICT (Japan) | Multilingual translation and AI research data collection. | ICC-Crawler |
| LCC | University of Leipzig | Linguistic corpora for NLP research. | LCC |
| Cotoyogi | Japan ROIS | Japanese AI training datasets. | Cotoyogi |
Chinese AI Company Crawlers
| Bot Name | Company | Description/Purpose | User Agent String |
|---|---|---|---|
| PanguBot | Huawei | Collects content for Huawei’s PanGu multimodal LLM. | PanguBot |
| ChatGLM-Spider | Zhipu AI | Training data for ChatGLM models. | ChatGLM-Spider |
| imageSpider | ByteDance | Collects images for ByteDance’s AI image models. | imageSpider |
| SBIntuitionsBot | SB Intuitions | Japanese language model training. | SBIntuitionsBot |
Data Broker and Third-Party Scrapers
| Bot Name | Company | Description/Purpose | User Agent String |
|---|---|---|---|
| Diffbot | Diffbot | AI-powered structured data extraction. Data sold to third parties for AI training. Described as “somewhat dishonest” in practices. | Diffbot |
| Omgilibot / omgili | Webz.io | Web monitoring service that sells crawled data to LLM companies. | Omgilibot, omgili |
| webzio-extended | Webz.io | Extended web crawl data specifically for AI training. | webzio-extended |
| VelenPublicWebCrawler | Velen/Hunter | Builds business datasets for machine learning models. | VelenPublicWebCrawler |
| ImagesiftBot | The Hive | Scrapes images for reverse search. Associated with image generation model training. | ImagesiftBot |
| laion-huggingface-processor | LAION | Image dataset collection for text-to-image AI (Stable Diffusion). | laion-huggingface-processor |
| img2dataset | Open Source | Downloads image datasets for ML training. | img2dataset |
| Kangaroo Bot | Kangaroo LLM | Australian language AI training data. | Kangaroo Bot |
| Timpibot | Timpi | Decentralized search engine and LLM training. | Timpibot |
| Spider | Spider | AI projects and RAG systems data collection. | Spider |
| Datenbank Crawler | netEstate | International website data collection. | Datenbank Crawler |
SEO and Analytics AI Crawlers
| Bot Name | Company | Description/Purpose | User Agent String |
|---|---|---|---|
| DataForSeoBot | DataForSEO | SEO tools and AI-powered features. | DataForSeoBot |
| SemrushBot-OCOB | Semrush | ContentShake AI tool for content analysis and recommendations. | SemrushBot-OCOB |
| AwarioBot | Awario | Social listening and brand monitoring AI. | AwarioBot |
| AwarioSmartBot | Awario | Enhanced social analytics. | AwarioSmartBot |
| Meltwater | Meltwater | Media intelligence and AI-driven consumer insights. | Meltwater |
| Sentibot | SentiOne | Social listening and sentiment analysis AI training. | Sentibot |
| peer39_crawler | Peer39 | AI-driven contextual advertising analysis. | peer39_crawler |
| Seekr | Seekr | Content analysis and AI model development for brand safety. | Seekr |
| aiHitBot | aiHitdata | Uses AI/ML to build company information databases. | aiHitBot |
| Factset_spyderbot | FactSet | Financial AI solutions data collection. | Factset_spyderbot |
Additional Training Data Crawlers
| Bot Name | Company | Description/Purpose | User Agent String |
|---|---|---|---|
| TurnitinBot | Turnitin | Collects content for plagiarism prevention database. | TurnitinBot |
| FirecrawlAgent | Firecrawl | Converts web data to markdown for LLM applications. | FirecrawlAgent |
| netEstate Imprint Crawler | netEstate | AI data scraper for international websites. | netEstate Imprint Crawler |
| Google-CloudVertexBot | Associated with Vertex AI platform training. | Google-CloudVertexBot | |
| GoogleOther | Generic internal R&D crawls, potentially including AI training. | GoogleOther | |
| GoogleOther-Image | Image fetching for Google R&D. | GoogleOther-Image | |
| GoogleOther-Video | Video fetching for Google R&D. | GoogleOther-Video |
Deprecated/Legacy Training Crawlers
| Bot Name | Company | Description/Purpose | User Agent String |
|---|---|---|---|
| anthropic-ai | Anthropic | Legacy crawler deprecated July 2024 in favor of ClaudeBot. | Mozilla/5.0 (compatible; anthropic-ai/1.0; +http://www.anthropic.com/bot.html) |
| Claude-Web | Anthropic | Legacy/undocumented crawler, likely deprecated. | Claude-Web/1.0 (web crawler; +https://www.anthropic.com/) |
AI Search Crawlers
These bots index web content for AI-powered search engines rather than model training. They bridge the gap between traditional search and AI assistants.
| Bot Name | Company | Description/Purpose | User Agent String |
|---|---|---|---|
| OAI-SearchBot | OpenAI | Indexes websites for ChatGPT Search/SearchGPT. NOT used for model training. | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot |
| Claude-SearchBot | Anthropic | Creates search index for Claude’s embedded search feature. | Claude-SearchBot |
| PerplexityBot | Perplexity AI | Indexes content for Perplexity’s AI search. Does not train own models. Controversial reports of ignoring robots.txt. | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot) |
| YouBot | You.com | Indexes content for You.com AI search. | YouBot |
| PetalBot | Huawei | Indexes for Huawei’s Petal Search engine and AI Search services. | PetalBot |
| DuckAssistBot | DuckDuckGo | Collects data for DuckAssist AI-generated answers. | DuckAssistBot |
| LinkupBot | Linkup | Enterprise AI search indexing. | LinkupBot |
| AddSearchBot | AddSearch | AI-powered site search indexing. | AddSearchBot |
| ZanistaBot | Zanista | AI search crawler. | ZanistaBot |
| Applebot | Apple | Powers Siri and Spotlight search. | Mozilla/5.0 (Macintosh) AppleWebKit/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot) |
Agentic AI Bots
These systems perform autonomous tasks, browse the web interactively, execute actions, and act on behalf of users. This category has exploded since late 2024.
AI Browser Agents (User-Triggered Fetchers)
These bots fetch web content in real-time when users make requests—distinct from background training crawlers.
| Bot Name | Company | Description/Purpose | User Agent String |
|---|---|---|---|
| ChatGPT-User | OpenAI | Fetches web content on-demand when users request real-time information. NOT used for model training. | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot |
| Claude-User | Anthropic | Fetches content when Claude users need real-time answers. | Claude-User |
| Perplexity-User | Perplexity AI | Crawls based on user requests for real-time retrieval. May ignore robots.txt for user-initiated queries. | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://www.perplexity.ai/useragent) |
| MistralAI-User | Mistral AI | Web browsing for Le Chat assistant. NOT used for training data collection. | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; MistralAI-User/1.0; +https://docs.mistral.ai/robots) |
| meta-externalfetcher | Meta | User-initiated link fetches for Meta AI products. May bypass robots.txt. | meta-externalfetcher/1.1 |
| facebookexternalhit | Meta | Link previews and Meta AI search real-time retrieval. | facebookexternalhit/1.1 |
Autonomous Web Browsing Agents
These represent the cutting edge of agentic AI—systems that can navigate websites, click buttons, fill forms, and complete multi-step tasks autonomously.
| Bot Name | Company | Description/Purpose | User Agent String |
|---|---|---|---|
| OpenAI Operator / ChatGPT Agent | OpenAI | Full autonomous web browsing via remote browser. GUI interaction, form filling, multi-step task execution. Powered by Computer-Using Agent (CUA) model. Achieves 87% on WebVoyager benchmark. | Uses standard Chrome user agent (indistinguishable) |
| Claude Computer Use | Anthropic | Full desktop computer control via screenshots—mouse, keyboard, browser. Operates in Docker containers. Available via API. | Uses standard browser user agents in container |
| Google Project Mariner | Google DeepMind | Chrome browser automation via extension. Cursor movement, clicking, typing. Achieves 83.5% on WebVoyager. Available to AI Ultra subscribers ($249.99/month). | GoogleAgent-Mariner |
| Gemini Deep Research | Multi-step research exploration with autonomous browsing. Renders JavaScript unlike most AI crawlers. | Gemini-Deep-Research | |
| Google NotebookLM | AI research assistant with document analysis and web access. | Google-NotebookLM | |
| Perplexity Comet | Perplexity AI | AI-native Chromium browser with autonomous browsing, clicking, scrolling. Supports agentic commerce via PayPal integration. | Uses Perplexity-User agent |
| Microsoft Copilot (Computer Use) | Microsoft | Virtual mouse/keyboard control via Windows 365 VMs. Multi-tab reasoning and autonomous browsing in Edge. | Uses Bingbot for indexing |
| Amazon NovaAct | Amazon | Amazon’s AI agent for web browsing and task completion. | Not published |
| Devin | Cognition Labs | Fully autonomous software engineering agent with shell, editor, and browser access. Handles complex multi-step development tasks. | Devin |
| bigsur.ai | Big Sur AI | AI-powered web agents and sales assistants. | bigsur.ai |
Research and Deep Analysis Agents
| Bot Name | Company | Description/Purpose | User Agent String |
|---|---|---|---|
| AI2Bot-DeepResearchEval | Allen Institute for AI | Deep research queries for open source AI evaluation. | AI2Bot-DeepResearchEval |
| LinerBot | Liner | AI assistant for academic source discovery and research. | LinerBot |
| Poggio-Citations | Poggio | AI sales enablement citation gathering. | Poggio-Citations |
Coding Agents
These autonomous agents write, debug, test, and deploy code with minimal human intervention.
| Bot Name | Company | Description/Purpose | User Agent String |
|---|---|---|---|
| GitHub Copilot Coding Agent | GitHub/Microsoft | Autonomous code implementation from GitHub Issues. Creates PRs, runs tests, responds to code review. Available with Copilot Pro/Business/Enterprise. | N/A (server-side) |
| Cursor AI Agent | Anysphere | Full codebase understanding, multi-file editing, terminal execution. Runs 8 parallel agents in Cursor 2.0. Valued at $9.9B. | N/A (IDE-based) |
| Devin | Cognition Labs | Fully autonomous software engineer—plans, writes, debugs, tests, deploys. Achieves 13.86% on SWE-bench unassisted. | Devin |
| Replit Agent 3 | Replit | Autonomous app building (200 minutes continuous), self-testing, self-healing code. Can build other agents. | N/A (platform-based) |
| Amazon Q Developer | AWS | Autonomous code generation, Java modernization, security remediation. | N/A (IDE/console-based) |
Enterprise AI Agents
| Bot Name | Company | Description/Purpose | User Agent String |
|---|---|---|---|
| Salesforce Agentforce | Salesforce | Autonomous customer service (24/7), sales automation, commerce agents. Uses Atlas Reasoning Engine. 96% self-service resolution reported. | N/A (platform-based) |
| ServiceNow AI Agents | ServiceNow | IT service management, incident resolution, HR automation. AI Agent Orchestrator for multi-agent collaboration. | N/A (platform-based) |
| UiPath AI Automation | UiPath | Document understanding, process mining with AI, generative AI activities in RPA workflows. | N/A (RPA platform) |
| QualifiedBot | Qualified | AI-powered chatbot context crawler for B2B sales. | QualifiedBot |
AI Agent Frameworks
These open-source frameworks enable building custom agentic AI systems.
| Framework Name | Company/Creator | Description/Purpose | Notable Capabilities |
|---|---|---|---|
| AutoGPT | Significant Gravitas | Autonomous goal-directed task execution with web browsing, file access, code execution. 107,000+ GitHub stars. | Multi-modal, visual builder, iterative self-improvement |
| BabyAGI | Yohei Nakajima | Minimalist task creation, prioritization, and execution loop (~140 lines of code). Inspired 42+ academic papers. | Vector database memory, adaptive learning |
| LangChain / LangGraph | LangChain Inc. | Modular agent building with graph-based multi-agent orchestration. Production use at Klarna, Uber, LinkedIn. | Cyclical execution, tool integration |
| CrewAI | CrewAI | Role-based AI agent “crews” mimicking human team structures. 5.76x faster than LangGraph. Used by 60% of Fortune 500. | Agent collaboration, task delegation |
| Microsoft AutoGen | Microsoft Research | Multi-agent conversations with rich multi-turn reasoning. Event-driven architecture in v0.4. | Customizable behaviors, open source |
| MetaGPT | Open Source | Simulates software development teams with role-based agents (PM, architect, engineer). | Autonomous software engineering |
Voice and Assistant Agents
| Bot Name | Company | Description/Purpose | User Agent String |
|---|---|---|---|
| Amazon Alexa+ | Amazon | Voice-activated autonomous tasks, smart home control, agentic commerce. | Uses Amazonbot for indexing |
| Apple Intelligence (Siri) | Apple | On-device AI with cross-app context understanding and action execution. | Uses Applebot/Applebot-Extended |
| Google Assistant (Gemini) | Voice-activated multi-step task execution with Gemini integration. | Uses Google crawlers |
Bots with Unknown or Spoofed User Agents
Some AI companies have been documented using standard browser user agents to avoid detection and robots.txt blocking.
| Bot Name | Company | Status | Notes |
|---|---|---|---|
| xAI Grok | xAI (Elon Musk) | User agent unknown | Grok confirmed via X that it uses iPhone user-agent strings to avoid blocks. No official documentation. Webmasters report never seeing Grok-specific user agents in logs. |
| DeepSeekBot | DeepSeek | Unofficial/placeholder | Rarely documented; Chinese AI company with minimal crawler transparency. |
| OpenAI Operator (Atlas browser) | OpenAI | Mimics Chrome | Uses identical Chrome user agent, indistinguishable from regular browsers. |
Proposed Standards for AI Crawler Control
| Proposal | Sponsor | Syntax | Purpose |
|---|---|---|---|
| DisallowAITraining | Microsoft | DisallowAITraining: / | Blocks all AI training crawlers with single rule |
| Content-Usage | Content-Usage: ai=n | Allows crawling but prevents AI training use | |
| ai.txt | Community | New file format | Dedicated AI crawler configuration separate from robots.txt |
Traffic Statistics and Trends (2025)
Cloudflare’s 2025 data reveals significant shifts in AI crawler market share:
| Crawler | 2024 Share | 2025 Share | Trend |
|---|---|---|---|
| GPTBot | 4.7% | 11.7% | ↑ Growing |
| ClaudeBot | 6.0% | ~10% | ↑ Growing |
| Meta crawler | 0.9% | 7.5% | ↑ Surging |
| Amazonbot | 10.2% | 5.9% | ↓ Declining |
| Bytespider | 14.1% | 2.4% | ↓ Collapsing |
Key insight: Training crawlers now account for approximately 80% of all AI bot activity, with agentic real-time fetchers growing rapidly.
Key Resources for Staying Updated
- Dark Visitors: darkvisitors.com/agents — Most comprehensive categorized bot database
- GitHub ai-robots-txt: github.com/ai-robots-txt/ai.robots.txt — Community-maintained blocking list
- Cloudflare AI Crawl Control: developers.cloudflare.com/ai-crawl-control/ — Enterprise blocking features
- Cloudflare Radar Verified Bots: radar.cloudflare.com/traffic/verified-bots — Bot traffic statistics
- Fastly Bot Management: docs.fastly.com/products/bot-management — CDN-level bot detection
- Vercel Block AI Bots Template: vercel.com/templates/other/block-ai-bots-firewall-rule — Firewall rules
Critical Compliance Notes
Robots.txt is voluntary—it represents a social contract, not a legal enforcement mechanism. Key compliance concerns by company:
| Company | Respects robots.txt | Publishes IPs | Official Docs | Concern Level |
|---|---|---|---|---|
| OpenAI | ✅ Yes | ✅ Yes | ✅ Yes | Low |
| Anthropic | ✅ Yes | ❌ No | ✅ Yes | Low |
| ✅ Yes | ✅ Yes | ✅ Yes | Low | |
| Meta | ⚠️ Partial | ❌ No | ✅ Yes | Medium |
| Microsoft | ✅ Yes | ✅ Yes | ✅ Yes | Low |
| Mistral | ✅ Yes | ✅ Yes | ✅ Yes | Low |
| Apple | ✅ Yes | ✅ Yes | ✅ Yes | Low |
| ByteDance | ❌ Often ignores | ❌ No | ❌ Limited | High |
| xAI (Grok) | ❌ Unknown | ❌ No | ❌ No | High |
| Perplexity | ⚠️ Controversial | ✅ Yes | ✅ Yes | Medium |
User agent spoofing remains a significant concern. Bad actors and even some major companies (notably xAI) have been documented using standard browser user agents to bypass detection. IP-based verification using published ranges (where available) provides stronger enforcement than user agent matching alone.
This document reflects the AI bot landscape as of November 2025. New crawlers emerge frequently—regular updates to blocking lists are essential for webmasters seeking to control AI access to their content.

Leave a Reply