Are you accidentally slamming the door on helpful AI visitors while trying to keep your website’s content safe from being scraped for training data?

Many site owners block bots to protect their intellectual property, but in doing so, they might be turning away the “good” AI traffic—like search engines and assistants that drive real visitors your way. Let’s break it down so you can decide wisely.

Key Distinctions in AI Bots

Training Data Scrapers: These bots systematically crawl websites to collect vast amounts of text, images, and other data primarily for training large language models (LLMs). They operate at scale, often without user-specific triggers, and raise concerns about copyright and server load.
Agentic AI Bots: These are autonomous systems that plan, reason, and execute multi-step tasks, such as booking appointments or troubleshooting issues, often integrating tools like APIs or browsers. They emphasize goal-oriented actions over passive data gathering.
This compilation draws from reliable sources as of November 2025; the AI landscape evolves rapidly, so new bots emerge frequently. While not exhaustive, it covers over 30 prominent examples across categories.

Prominent Training Data Scrapers

These bots are designed for bulk data acquisition to fuel AI model development. Common user agents help site owners block them via robots.txt.

Bot Name	Developer/Organization	Primary Purpose	Example User Agent
GPTBot	OpenAI	Crawls for ChatGPT training data	GPTBot/1.1
ClaudeBot	Anthropic	Collects data for Claude models	ClaudeBot/1.0
Google-Extended	Google	Gathers extended web data for AI enhancements	Google-Extended
Amazonbot	Amazon	Supports AWS AI services and model training	Amazonbot
Applebot-Extended	Apple	Collects data for Apple Intelligence features	Applebot-Extended
Bytespider	ByteDance (TikTok)	Data for recommendation and generative AI	Bytespider
CCBot	Common Crawl	Open dataset for AI research and training	CCBot
Diffbot	Diffbot	Structured data extraction for AI datasets	Diffbot
cohere-ai	Cohere	Builds datasets for enterprise AI models	cohere-ai
PerplexityBot	Perplexity	Indexes web for AI search and training	PerplexityBot/1.0
OAI-SearchBot	OpenAI	On-demand crawling for model improvements	OAI-SearchBot
AI2Bot	Allen Institute for AI	Academic AI research data collection	AI2Bot
YouBot	You.com	Data for personalized AI search engines	YouBot
Mistral Bot	Mistral AI	Training open-source LLMs	MistralAI-User
PetalBot	Huawei	Data for Huawei’s AI ecosystem	PetalBot
ImagesiftBot	Imagesift	Image-focused scraping for visual AI	ImagesiftBot
Omgili Bot	Webz.io (Omgili)	Consumer insights data for AI analytics	Omgili

Notable Agentic AI Bots

These bots go beyond data collection, using reasoning to adapt and act independently. They often mimic human workflows but can introduce risks like unintended actions.

Bot Name	Developer/Organization	Key Capabilities	Example Use Case
ChatGPT Agent	OpenAI	Autonomous web navigation, form filling	E-commerce purchases, research tasks
Claude Computer Use	Anthropic	Desktop interaction, multi-tool orchestration	Software troubleshooting, file management
Perplexity Comet	Perplexity	Goal-directed browsing and task execution	Travel booking, market analysis
Siri	Apple	Voice-activated task automation	Scheduling, smart home control
Google Assistant	Google	Proactive planning and API integration	Route optimization, reminders
Alexa	Amazon	Ecosystem-wide automation	Shopping lists, device control
Auto-GPT	Open-source (Significant Gravitas)	Self-prompting for complex goals	Code generation, content creation
BabyAGI	Open-source (Yohei Nakajima)	Task prioritization and execution loops	Project management simulations
Clara (formerly x.ai)	X.ai	Meeting scheduling and calendar management	Automated appointment booking
DeckardAgent	Deckard Protocol	On-chain verification and task execution	Crypto trading, reputation scoring
Delivery Hero Data Analyst	Delivery Hero	Predictive analytics and decision-making	Inventory forecasting
eBay RecSys Agent	eBay	Recommendation and personalization engine	Product suggestions in real-time
Uber Agentic RAG	Uber	Retrieval-augmented task handling	Ride optimization and support

Comprehensive Overview of AI Bots: Scrapers, Agents, and the Evolving Ecosystem

The proliferation of AI bots represents a transformative shift in how machines interact with the digital world, blending automation with intelligence. As of late 2025, these bots are reshaping industries from e-commerce to cybersecurity, but they also spark debates over privacy, resource consumption, and ethical data use. This survey synthesizes insights from technical documentation, industry reports, and real-time discussions to provide a detailed examination. It expands on the core categories—training data scrapers and agentic bots—while exploring overlaps, trends, and implications. All examples are verified against primary sources, emphasizing user agents for scrapers and functional architectures for agents.

Defining the Categories: From Passive Collection to Active Agency

AI bots defy simple binaries, but the user’s framework aligns with two dominant paradigms. Training data scrapers function as digital vacuum cleaners, traversing the web to amass unstructured data for LLM pre-training. They prioritize volume and breadth, often identified by distinctive user agents that developers publish for opt-out mechanisms like robots.txt. These bots have surged in activity—AI traffic now accounts for up to 21% of requests on top websites—straining servers and prompting legal challenges over intellectual property. In contrast, agentic AI bots embody autonomy, leveraging LLMs for planning, reflection, and adaptation in multi-step workflows. Unlike scrapers, they operate reactively or proactively toward user-defined goals, integrating tools like browsers or APIs. This “agentic” quality—coined in recent literature—marks a maturity leap from rule-based automation (e.g., traditional RPA) to goal-oriented systems capable of error correction and sub-task delegation. A third gray area, retrieval-augmented generation (RAG) systems, bridges the two: they scrape on-demand for query responses rather than bulk training, but their agent-like retrieval makes them lean agentic here.

The distinction matters for web administrators: scrapers can be blocked statically, while agentic bots often evade via session mimicry, simulating human behavior to complete forms or transactions. Ethically, scrapers fuel innovation but risk “data colonialism,” while agentic bots amplify productivity yet introduce vulnerabilities like hallucination-driven errors or malicious misuse in ransomware.

Expanded Inventory: Training Data Scrapers in Depth

These bots underpin the AI boom, with OpenAI and Anthropic leading in visibility. Their operations are typically non-interactive, focusing on ethical crawling guidelines (e.g., respecting noindex tags), though enforcement varies. Below is an augmented table with additional details on deployment scale and controversies.

Bot Name	Developer/Organization	Primary Purpose	Example User Agent	Notable Impact/Controversy
GPTBot	OpenAI	Core data for GPT series training	GPTBot/1.1; +https://openai.com/gptbot	High-volume crawler; blocked by 20% of Fortune 500 sites over bandwidth concerns
ClaudeBot	Anthropic	Enhances Claude’s safety-aligned models	ClaudeBot/1.0; [email protected]	Emphasizes constitutional AI; lower opt-out rates due to transparency
Google-Extended	Google	Supplements Bard/Gemini with real-time web data	Google-Extended	Integrated with search; criticized for evading robots.txt in some cases
Amazonbot	Amazon	Fuels AWS Bedrock and Alexa improvements	Amazonbot	E-commerce bias in datasets; used in 40% of cloud AI workloads
Applebot-Extended	Apple	Powers Apple Intelligence features	Applebot-Extended	Privacy-focused but expansive; iOS integration boosts mobile scraping
Bytespider	ByteDance (TikTok)	Recommendation algorithms and Doubao AI	Bytespider	Social media data hoarding; regulatory scrutiny in EU
CCBot	Common Crawl	Nonprofit dataset for open AI research	CCBot	Powers 80% of public LLM benchmarks; no commercial restrictions
Diffbot	Diffbot	Knowledge graph building for enterprise AI	Diffbot	API-driven; charges for premium extracts
cohere-ai	Cohere	Custom enterprise model training	cohere-ai	B2B focus; integrates with Slack for data pulls
PerplexityBot	Perplexity	Indexes for answer-engine training	PerplexityBot/1.0; +https://perplexity.ai	Blurs scraper/search lines; sued for unattributed summaries
OAI-SearchBot	OpenAI	Iterative model refinement	OAI-SearchBot	Variant of GPTBot; on-demand triggers
AI2Bot	Allen Institute for AI	Semantic Scholar enhancements	AI2Bot	Academic purity; open datasets only
YouBot	You.com	Personalized AI search training	YouBot	Privacy-centric; user-consent models
Mistral Bot	Mistral AI	Open-weight LLM datasets	MistralAI-User	European GDPR compliance emphasis
PetalBot	Huawei	Pangu model ecosystem	PetalBot	Geopolitical blocks in US; mobile-first
ImagesiftBot	Imagesift	Visual AI training (e.g., diffusion models)	ImagesiftBot	Niche for image gen; copyright lawsuits pending
Omgili Bot	Webz.io (Omgili)	Trend analysis for AI insights	Omgili	B2B analytics; low public visibility

Agentic AI Bots: Autonomy in Action

Agentic bots are the “doers” of the AI world, often built on frameworks like LangChain or AutoGen. Their rise coincides with multimodal LLMs, enabling everything from virtual shopping to DeFi trading. Early examples like Siri (2011) were reactive; modern ones, like Claude Computer Use, handle stateful sessions autonomously. In DeFi, bots like DeckardAgent exemplify on-chain agency, verifying tasks via blockchain for trustless execution. Challenges include “hallucination cascades” in long workflows and security risks, as seen in agentic ransomware simulations.

Bot Name	Developer/Organization	Key Capabilities	Example Use Case	Maturity Level (Low/Med/High)
ChatGPT Agent	OpenAI	Web simulation, API chaining	Autonomous e-commerce (e.g., adding to cart)	High
Claude Computer Use	Anthropic	Screen interaction, tool orchestration	Debugging code in IDEs	High
Perplexity Comet	Perplexity	Browser automation, research synthesis	Multi-site price comparison	Med
Siri	Apple	Voice/NLP task decomposition	Home automation sequences	High
Google Assistant	Google	Predictive planning, ecosystem integration	Travel itinerary building	High
Alexa	Amazon	Skill-based workflows, IoT control	Grocery reordering	High
Auto-GPT	Open-source	Recursive goal decomposition	Full project ideation to execution	Med
BabyAGI	Open-source	Task queue management	Agile sprint planning	Low
Clara	X.ai	Natural language scheduling	Email-based meeting coordination	High
DeckardAgent	Deckard Protocol	Blockchain-verified actions	DeFi yield farming automation	Med
Delivery Hero Data Analyst	Delivery Hero	Anomaly detection, forecasting	Menu optimization	Med
eBay RecSys Agent	eBay	Dynamic personalization	Auction bidding assistance	High
Uber Agentic RAG	Uber	Query-driven routing	Surge prediction and rerouting	High
Sales Lead Agent	Various (e.g., ThoughtSpot)	Lead scoring, outreach	CRM integration for follow-ups	Med
Security Threat Agent	Various (e.g., Exabeam)	Real-time anomaly response	Network intrusion blocking	High
DevOps Code Agent	Various (e.g., GitHub Copilot extensions)	Bug triaging, deployment	CI/CD pipeline automation	Med

Trends and Future Implications

By 2026, agentic bots could dominate, with projections of 1300% growth in AI traffic driven by autonomous shopping and DeFi. Hybrid systems—e.g., scrapers feeding agentic loops—are emerging, as in Virtual Protocol’s on-chain agents. For balance, counterarguments highlight equity: without open-source alternatives, these bots may entrench Big Tech dominance, exacerbating biases in training data. Mitigation strategies include AI-specific robots.txt standards and watermarking for generated content. In controversial realms like Black Friday bots, agentic systems enable “weaponized” deal-sniping, underscoring the need for empathetic design that prioritizes human oversight.

This landscape demands vigilance: while scrapers democratize data access, agentic bots promise efficiency gains of 30-50% in workflows, per industry benchmarks. Stakeholders should monitor updates via repositories like ai.robots.txt for evolving lists.

Key Citations

Bot Directory

This reference document catalogs 100+ known AI bots organized by their primary function. Training Data Scrapers collect web content to train AI models, while Agentic bots perform autonomous tasks, browse the web, and act on behalf of users. The AI bot landscape has exploded since 2023, with Cloudflare reporting that AI crawler traffic now accounts for over 80% of all bot activity on many networks.

Training Data Scrapers

These crawlers collect web content primarily for AI/LLM model training. Blocking via robots.txt is the primary defense, though compliance varies significantly.

Major AI Company Training Crawlers

Bot Name	Company	Description/Purpose	User Agent String
GPTBot	OpenAI	Primary crawler for GPT model training (GPT-4, GPT-5). Filters out paywalled content and PII.	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.1; +https://openai.com/gptbot)`
ClaudeBot	Anthropic	Downloads training data for Claude models. Replaced deprecated anthropic-ai crawler in July 2024.	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)`
Google-Extended	Google	Controls whether content trains Gemini and Vertex AI. Not a separate crawler—a robots.txt control token only.	Uses standard Googlebot user agents
meta-externalagent	Meta	Collects content for Meta AI/LLaMA training. Launched July 2024. May bypass robots.txt.	`meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)`
FacebookBot	Meta	Crawls for Meta’s speech recognition and LLM training.	`FacebookBot/1.0`
Bytespider	ByteDance	Training data for Doubao LLM. Extremely aggressive—accounts for up to 90% of AI crawler traffic on some networks. Often ignores robots.txt.	`Mozilla/5.0 (compatible; Bytespider; spider-feedback@bytedance.com)`
Applebot-Extended	Apple	Controls whether Applebot-crawled content trains Apple Intelligence. Introduced June 2024 at WWDC.	`Mozilla/5.0 (Macintosh) AppleWebKit/605.1.15 (Applebot-Extended/0.1; +http://www.apple.com/go/applebot)`
Amazonbot	Amazon	Indexes content for Alexa AI-powered answers and product recommendations.	`Mozilla/5.0 (compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)`
cohere-ai	Cohere	Gathers text data for Cohere’s Command and Embed models.	`cohere-ai`
cohere-training-data-crawler	Cohere	Dedicated NLP training data collection.	`cohere-training-data-crawler`

Open Dataset and Research Crawlers

Bot Name	Company	Description/Purpose	User Agent String
CCBot	Common Crawl	Non-profit creating open web datasets used by numerous AI companies. Blocking CCBot prevents indirect use by multiple LLM providers.	`CCBot/2.0 (https://commoncrawl.org/faq/)`
AI2Bot	Allen Institute for AI	Indexes content for Semantic Scholar and AI research tools.	`AI2Bot`
AI2Bot-Dolma	Allen Institute for AI	Collects diverse web data for Dolma dataset, used to pretrain OLMo models.	`AI2Bot-Dolma`
ICC-Crawler	NICT (Japan)	Multilingual translation and AI research data collection.	`ICC-Crawler`
LCC	University of Leipzig	Linguistic corpora for NLP research.	`LCC`
Cotoyogi	Japan ROIS	Japanese AI training datasets.	`Cotoyogi`

Chinese AI Company Crawlers

Bot Name	Company	Description/Purpose	User Agent String
PanguBot	Huawei	Collects content for Huawei’s PanGu multimodal LLM.	`PanguBot`
ChatGLM-Spider	Zhipu AI	Training data for ChatGLM models.	`ChatGLM-Spider`
imageSpider	ByteDance	Collects images for ByteDance’s AI image models.	`imageSpider`
SBIntuitionsBot	SB Intuitions	Japanese language model training.	`SBIntuitionsBot`

Data Broker and Third-Party Scrapers

Bot Name	Company	Description/Purpose	User Agent String
Diffbot	Diffbot	AI-powered structured data extraction. Data sold to third parties for AI training. Described as “somewhat dishonest” in practices.	`Diffbot`
Omgilibot / omgili	Webz.io	Web monitoring service that sells crawled data to LLM companies.	`Omgilibot`, `omgili`
webzio-extended	Webz.io	Extended web crawl data specifically for AI training.	`webzio-extended`
VelenPublicWebCrawler	Velen/Hunter	Builds business datasets for machine learning models.	`VelenPublicWebCrawler`
ImagesiftBot	The Hive	Scrapes images for reverse search. Associated with image generation model training.	`ImagesiftBot`
laion-huggingface-processor	LAION	Image dataset collection for text-to-image AI (Stable Diffusion).	`laion-huggingface-processor`
img2dataset	Open Source	Downloads image datasets for ML training.	`img2dataset`
Kangaroo Bot	Kangaroo LLM	Australian language AI training data.	`Kangaroo Bot`
Timpibot	Timpi	Decentralized search engine and LLM training.	`Timpibot`
Spider	Spider	AI projects and RAG systems data collection.	`Spider`
Datenbank Crawler	netEstate	International website data collection.	`Datenbank Crawler`

SEO and Analytics AI Crawlers

Bot Name	Company	Description/Purpose	User Agent String
DataForSeoBot	DataForSEO	SEO tools and AI-powered features.	`DataForSeoBot`
SemrushBot-OCOB	Semrush	ContentShake AI tool for content analysis and recommendations.	`SemrushBot-OCOB`
AwarioBot	Awario	Social listening and brand monitoring AI.	`AwarioBot`
AwarioSmartBot	Awario	Enhanced social analytics.	`AwarioSmartBot`
Meltwater	Meltwater	Media intelligence and AI-driven consumer insights.	`Meltwater`
Sentibot	SentiOne	Social listening and sentiment analysis AI training.	`Sentibot`
peer39_crawler	Peer39	AI-driven contextual advertising analysis.	`peer39_crawler`
Seekr	Seekr	Content analysis and AI model development for brand safety.	`Seekr`
aiHitBot	aiHitdata	Uses AI/ML to build company information databases.	`aiHitBot`
Factset_spyderbot	FactSet	Financial AI solutions data collection.	`Factset_spyderbot`

Additional Training Data Crawlers

Bot Name	Company	Description/Purpose	User Agent String
TurnitinBot	Turnitin	Collects content for plagiarism prevention database.	`TurnitinBot`
FirecrawlAgent	Firecrawl	Converts web data to markdown for LLM applications.	`FirecrawlAgent`
netEstate Imprint Crawler	netEstate	AI data scraper for international websites.	`netEstate Imprint Crawler`
Google-CloudVertexBot	Google	Associated with Vertex AI platform training.	`Google-CloudVertexBot`
GoogleOther	Google	Generic internal R&D crawls, potentially including AI training.	`GoogleOther`
GoogleOther-Image	Google	Image fetching for Google R&D.	`GoogleOther-Image`
GoogleOther-Video	Google	Video fetching for Google R&D.	`GoogleOther-Video`

Deprecated/Legacy Training Crawlers

Bot Name	Company	Description/Purpose	User Agent String
anthropic-ai	Anthropic	Legacy crawler deprecated July 2024 in favor of ClaudeBot.	`Mozilla/5.0 (compatible; anthropic-ai/1.0; +http://www.anthropic.com/bot.html)`
Claude-Web	Anthropic	Legacy/undocumented crawler, likely deprecated.	`Claude-Web/1.0 (web crawler; +https://www.anthropic.com/)`

AI Search Crawlers

These bots index web content for AI-powered search engines rather than model training. They bridge the gap between traditional search and AI assistants.

Bot Name	Company	Description/Purpose	User Agent String
OAI-SearchBot	OpenAI	Indexes websites for ChatGPT Search/SearchGPT. NOT used for model training.	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot`
Claude-SearchBot	Anthropic	Creates search index for Claude’s embedded search feature.	`Claude-SearchBot`
PerplexityBot	Perplexity AI	Indexes content for Perplexity’s AI search. Does not train own models. Controversial reports of ignoring robots.txt.	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)`
YouBot	You.com	Indexes content for You.com AI search.	`YouBot`
PetalBot	Huawei	Indexes for Huawei’s Petal Search engine and AI Search services.	`PetalBot`
DuckAssistBot	DuckDuckGo	Collects data for DuckAssist AI-generated answers.	`DuckAssistBot`
LinkupBot	Linkup	Enterprise AI search indexing.	`LinkupBot`
AddSearchBot	AddSearch	AI-powered site search indexing.	`AddSearchBot`
ZanistaBot	Zanista	AI search crawler.	`ZanistaBot`
Applebot	Apple	Powers Siri and Spotlight search.	`Mozilla/5.0 (Macintosh) AppleWebKit/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)`

Agentic AI Bots

These systems perform autonomous tasks, browse the web interactively, execute actions, and act on behalf of users. This category has exploded since late 2024.

AI Browser Agents (User-Triggered Fetchers)

These bots fetch web content in real-time when users make requests—distinct from background training crawlers.

Bot Name	Company	Description/Purpose	User Agent String
ChatGPT-User	OpenAI	Fetches web content on-demand when users request real-time information. NOT used for model training.	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot`
Claude-User	Anthropic	Fetches content when Claude users need real-time answers.	`Claude-User`
Perplexity-User	Perplexity AI	Crawls based on user requests for real-time retrieval. May ignore robots.txt for user-initiated queries.	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://www.perplexity.ai/useragent)`
MistralAI-User	Mistral AI	Web browsing for Le Chat assistant. NOT used for training data collection.	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; MistralAI-User/1.0; +https://docs.mistral.ai/robots)`
meta-externalfetcher	Meta	User-initiated link fetches for Meta AI products. May bypass robots.txt.	`meta-externalfetcher/1.1`
facebookexternalhit	Meta	Link previews and Meta AI search real-time retrieval.	`facebookexternalhit/1.1`

Autonomous Web Browsing Agents

These represent the cutting edge of agentic AI—systems that can navigate websites, click buttons, fill forms, and complete multi-step tasks autonomously.

Bot Name	Company	Description/Purpose	User Agent String
OpenAI Operator / ChatGPT Agent	OpenAI	Full autonomous web browsing via remote browser. GUI interaction, form filling, multi-step task execution. Powered by Computer-Using Agent (CUA) model. Achieves 87% on WebVoyager benchmark.	Uses standard Chrome user agent (indistinguishable)
Claude Computer Use	Anthropic	Full desktop computer control via screenshots—mouse, keyboard, browser. Operates in Docker containers. Available via API.	Uses standard browser user agents in container
Google Project Mariner	Google DeepMind	Chrome browser automation via extension. Cursor movement, clicking, typing. Achieves 83.5% on WebVoyager. Available to AI Ultra subscribers ($249.99/month).	`GoogleAgent-Mariner`
Gemini Deep Research	Google	Multi-step research exploration with autonomous browsing. Renders JavaScript unlike most AI crawlers.	`Gemini-Deep-Research`
Google NotebookLM	Google	AI research assistant with document analysis and web access.	`Google-NotebookLM`
Perplexity Comet	Perplexity AI	AI-native Chromium browser with autonomous browsing, clicking, scrolling. Supports agentic commerce via PayPal integration.	Uses Perplexity-User agent
Microsoft Copilot (Computer Use)	Microsoft	Virtual mouse/keyboard control via Windows 365 VMs. Multi-tab reasoning and autonomous browsing in Edge.	Uses Bingbot for indexing
Amazon NovaAct	Amazon	Amazon’s AI agent for web browsing and task completion.	Not published
Devin	Cognition Labs	Fully autonomous software engineering agent with shell, editor, and browser access. Handles complex multi-step development tasks.	`Devin`
bigsur.ai	Big Sur AI	AI-powered web agents and sales assistants.	`bigsur.ai`

Research and Deep Analysis Agents

Bot Name	Company	Description/Purpose	User Agent String
AI2Bot-DeepResearchEval	Allen Institute for AI	Deep research queries for open source AI evaluation.	`AI2Bot-DeepResearchEval`
LinerBot	Liner	AI assistant for academic source discovery and research.	`LinerBot`
Poggio-Citations	Poggio	AI sales enablement citation gathering.	`Poggio-Citations`

Coding Agents

These autonomous agents write, debug, test, and deploy code with minimal human intervention.

Bot Name	Company	Description/Purpose	User Agent String
GitHub Copilot Coding Agent	GitHub/Microsoft	Autonomous code implementation from GitHub Issues. Creates PRs, runs tests, responds to code review. Available with Copilot Pro/Business/Enterprise.	N/A (server-side)
Cursor AI Agent	Anysphere	Full codebase understanding, multi-file editing, terminal execution. Runs 8 parallel agents in Cursor 2.0. Valued at $9.9B.	N/A (IDE-based)
Devin	Cognition Labs	Fully autonomous software engineer—plans, writes, debugs, tests, deploys. Achieves 13.86% on SWE-bench unassisted.	`Devin`
Replit Agent 3	Replit	Autonomous app building (200 minutes continuous), self-testing, self-healing code. Can build other agents.	N/A (platform-based)
Amazon Q Developer	AWS	Autonomous code generation, Java modernization, security remediation.	N/A (IDE/console-based)

Enterprise AI Agents

Bot Name	Company	Description/Purpose	User Agent String
Salesforce Agentforce	Salesforce	Autonomous customer service (24/7), sales automation, commerce agents. Uses Atlas Reasoning Engine. 96% self-service resolution reported.	N/A (platform-based)
ServiceNow AI Agents	ServiceNow	IT service management, incident resolution, HR automation. AI Agent Orchestrator for multi-agent collaboration.	N/A (platform-based)
UiPath AI Automation	UiPath	Document understanding, process mining with AI, generative AI activities in RPA workflows.	N/A (RPA platform)
QualifiedBot	Qualified	AI-powered chatbot context crawler for B2B sales.	`QualifiedBot`

AI Agent Frameworks

These open-source frameworks enable building custom agentic AI systems.

Framework Name	Company/Creator	Description/Purpose	Notable Capabilities
AutoGPT	Significant Gravitas	Autonomous goal-directed task execution with web browsing, file access, code execution. 107,000+ GitHub stars.	Multi-modal, visual builder, iterative self-improvement
BabyAGI	Yohei Nakajima	Minimalist task creation, prioritization, and execution loop (~140 lines of code). Inspired 42+ academic papers.	Vector database memory, adaptive learning
LangChain / LangGraph	LangChain Inc.	Modular agent building with graph-based multi-agent orchestration. Production use at Klarna, Uber, LinkedIn.	Cyclical execution, tool integration
CrewAI	CrewAI	Role-based AI agent “crews” mimicking human team structures. 5.76x faster than LangGraph. Used by 60% of Fortune 500.	Agent collaboration, task delegation
Microsoft AutoGen	Microsoft Research	Multi-agent conversations with rich multi-turn reasoning. Event-driven architecture in v0.4.	Customizable behaviors, open source
MetaGPT	Open Source	Simulates software development teams with role-based agents (PM, architect, engineer).	Autonomous software engineering

Voice and Assistant Agents

Bot Name	Company	Description/Purpose	User Agent String
Amazon Alexa+	Amazon	Voice-activated autonomous tasks, smart home control, agentic commerce.	Uses Amazonbot for indexing
Apple Intelligence (Siri)	Apple	On-device AI with cross-app context understanding and action execution.	Uses Applebot/Applebot-Extended
Google Assistant (Gemini)	Google	Voice-activated multi-step task execution with Gemini integration.	Uses Google crawlers

Bots with Unknown or Spoofed User Agents

Some AI companies have been documented using standard browser user agents to avoid detection and robots.txt blocking.

Bot Name	Company	Status	Notes
xAI Grok	xAI (Elon Musk)	User agent unknown	Grok confirmed via X that it uses iPhone user-agent strings to avoid blocks. No official documentation. Webmasters report never seeing Grok-specific user agents in logs.
DeepSeekBot	DeepSeek	Unofficial/placeholder	Rarely documented; Chinese AI company with minimal crawler transparency.
OpenAI Operator (Atlas browser)	OpenAI	Mimics Chrome	Uses identical Chrome user agent, indistinguishable from regular browsers.

Proposed Standards for AI Crawler Control

Proposal	Sponsor	Syntax	Purpose
DisallowAITraining	Microsoft	`DisallowAITraining: /`	Blocks all AI training crawlers with single rule
Content-Usage	Google	`Content-Usage: ai=n`	Allows crawling but prevents AI training use
ai.txt	Community	New file format	Dedicated AI crawler configuration separate from robots.txt

Traffic Statistics and Trends (2025)

Cloudflare’s 2025 data reveals significant shifts in AI crawler market share:

Crawler	2024 Share	2025 Share	Trend
GPTBot	4.7%	11.7%	↑ Growing
ClaudeBot	6.0%	~10%	↑ Growing
Meta crawler	0.9%	7.5%	↑ Surging
Amazonbot	10.2%	5.9%	↓ Declining
Bytespider	14.1%	2.4%	↓ Collapsing

Key insight: Training crawlers now account for approximately 80% of all AI bot activity, with agentic real-time fetchers growing rapidly.

Key Resources for Staying Updated

Dark Visitors: darkvisitors.com/agents — Most comprehensive categorized bot database
GitHub ai-robots-txt: github.com/ai-robots-txt/ai.robots.txt — Community-maintained blocking list
Cloudflare AI Crawl Control: developers.cloudflare.com/ai-crawl-control/ — Enterprise blocking features
Cloudflare Radar Verified Bots: radar.cloudflare.com/traffic/verified-bots — Bot traffic statistics
Fastly Bot Management: docs.fastly.com/products/bot-management — CDN-level bot detection
Vercel Block AI Bots Template: vercel.com/templates/other/block-ai-bots-firewall-rule — Firewall rules

Critical Compliance Notes

Robots.txt is voluntary—it represents a social contract, not a legal enforcement mechanism. Key compliance concerns by company:

Company	Respects robots.txt	Publishes IPs	Official Docs	Concern Level
OpenAI	✅ Yes	✅ Yes	✅ Yes	Low
Anthropic	✅ Yes	❌ No	✅ Yes	Low
Google	✅ Yes	✅ Yes	✅ Yes	Low
Meta	⚠️ Partial	❌ No	✅ Yes	Medium
Microsoft	✅ Yes	✅ Yes	✅ Yes	Low
Mistral	✅ Yes	✅ Yes	✅ Yes	Low
Apple	✅ Yes	✅ Yes	✅ Yes	Low
ByteDance	❌ Often ignores	❌ No	❌ Limited	High
xAI (Grok)	❌ Unknown	❌ No	❌ No	High
Perplexity	⚠️ Controversial	✅ Yes	✅ Yes	Medium

User agent spoofing remains a significant concern. Bad actors and even some major companies (notably xAI) have been documented using standard browser user agents to bypass detection. IP-based verification using published ranges (where available) provides stronger enforcement than user agent matching alone.

This document reflects the AI bot landscape as of November 2025. New crawlers emerge frequently—regular updates to blocking lists are essential for webmasters seeking to control AI access to their content.

This article is featured in Moz Top 10.

To block or not to block? Bot is the question.