To block or not to block? Bot is the question.

by

in ,

Are you accidentally slamming the door on helpful AI visitors while trying to keep your website’s content safe from being scraped for training data?

Many site owners block bots to protect their intellectual property, but in doing so, they might be turning away the “good” AI traffic—like search engines and assistants that drive real visitors your way. Let’s break it down so you can decide wisely.

Key Distinctions in AI Bots

  • Training Data Scrapers: These bots systematically crawl websites to collect vast amounts of text, images, and other data primarily for training large language models (LLMs). They operate at scale, often without user-specific triggers, and raise concerns about copyright and server load.
  • Agentic AI Bots: These are autonomous systems that plan, reason, and execute multi-step tasks, such as booking appointments or troubleshooting issues, often integrating tools like APIs or browsers. They emphasize goal-oriented actions over passive data gathering.
  • This compilation draws from reliable sources as of November 2025; the AI landscape evolves rapidly, so new bots emerge frequently. While not exhaustive, it covers over 30 prominent examples across categories.

Prominent Training Data Scrapers

These bots are designed for bulk data acquisition to fuel AI model development. Common user agents help site owners block them via robots.txt.

Bot NameDeveloper/OrganizationPrimary PurposeExample User Agent
GPTBotOpenAICrawls for ChatGPT training dataGPTBot/1.1
ClaudeBotAnthropicCollects data for Claude modelsClaudeBot/1.0
Google-ExtendedGoogleGathers extended web data for AI enhancementsGoogle-Extended
AmazonbotAmazonSupports AWS AI services and model trainingAmazonbot
Applebot-ExtendedAppleCollects data for Apple Intelligence featuresApplebot-Extended
BytespiderByteDance (TikTok)Data for recommendation and generative AIBytespider
CCBotCommon CrawlOpen dataset for AI research and trainingCCBot
DiffbotDiffbotStructured data extraction for AI datasetsDiffbot
cohere-aiCohereBuilds datasets for enterprise AI modelscohere-ai
PerplexityBotPerplexityIndexes web for AI search and trainingPerplexityBot/1.0
OAI-SearchBotOpenAIOn-demand crawling for model improvementsOAI-SearchBot
AI2BotAllen Institute for AIAcademic AI research data collectionAI2Bot
YouBotYou.comData for personalized AI search enginesYouBot
Mistral BotMistral AITraining open-source LLMsMistralAI-User
PetalBotHuaweiData for Huawei’s AI ecosystemPetalBot
ImagesiftBotImagesiftImage-focused scraping for visual AIImagesiftBot
Omgili BotWebz.io (Omgili)Consumer insights data for AI analyticsOmgili

Notable Agentic AI Bots

These bots go beyond data collection, using reasoning to adapt and act independently. They often mimic human workflows but can introduce risks like unintended actions.

Bot NameDeveloper/OrganizationKey CapabilitiesExample Use Case
ChatGPT AgentOpenAIAutonomous web navigation, form fillingE-commerce purchases, research tasks
Claude Computer UseAnthropicDesktop interaction, multi-tool orchestrationSoftware troubleshooting, file management
Perplexity CometPerplexityGoal-directed browsing and task executionTravel booking, market analysis
SiriAppleVoice-activated task automationScheduling, smart home control
Google AssistantGoogleProactive planning and API integrationRoute optimization, reminders
AlexaAmazonEcosystem-wide automationShopping lists, device control
Auto-GPTOpen-source (Significant Gravitas)Self-prompting for complex goalsCode generation, content creation
BabyAGIOpen-source (Yohei Nakajima)Task prioritization and execution loopsProject management simulations
Clara (formerly x.ai)X.aiMeeting scheduling and calendar managementAutomated appointment booking
DeckardAgentDeckard ProtocolOn-chain verification and task executionCrypto trading, reputation scoring
Delivery Hero Data AnalystDelivery HeroPredictive analytics and decision-makingInventory forecasting
eBay RecSys AgenteBayRecommendation and personalization engineProduct suggestions in real-time
Uber Agentic RAGUberRetrieval-augmented task handlingRide optimization and support

Comprehensive Overview of AI Bots: Scrapers, Agents, and the Evolving Ecosystem

The proliferation of AI bots represents a transformative shift in how machines interact with the digital world, blending automation with intelligence. As of late 2025, these bots are reshaping industries from e-commerce to cybersecurity, but they also spark debates over privacy, resource consumption, and ethical data use. This survey synthesizes insights from technical documentation, industry reports, and real-time discussions to provide a detailed examination. It expands on the core categories—training data scrapers and agentic bots—while exploring overlaps, trends, and implications. All examples are verified against primary sources, emphasizing user agents for scrapers and functional architectures for agents.

Defining the Categories: From Passive Collection to Active Agency

AI bots defy simple binaries, but the user’s framework aligns with two dominant paradigms. Training data scrapers function as digital vacuum cleaners, traversing the web to amass unstructured data for LLM pre-training. They prioritize volume and breadth, often identified by distinctive user agents that developers publish for opt-out mechanisms like robots.txt. These bots have surged in activity—AI traffic now accounts for up to 21% of requests on top websites—straining servers and prompting legal challenges over intellectual property. In contrast, agentic AI bots embody autonomy, leveraging LLMs for planning, reflection, and adaptation in multi-step workflows. Unlike scrapers, they operate reactively or proactively toward user-defined goals, integrating tools like browsers or APIs. This “agentic” quality—coined in recent literature—marks a maturity leap from rule-based automation (e.g., traditional RPA) to goal-oriented systems capable of error correction and sub-task delegation. A third gray area, retrieval-augmented generation (RAG) systems, bridges the two: they scrape on-demand for query responses rather than bulk training, but their agent-like retrieval makes them lean agentic here.

The distinction matters for web administrators: scrapers can be blocked statically, while agentic bots often evade via session mimicry, simulating human behavior to complete forms or transactions. Ethically, scrapers fuel innovation but risk “data colonialism,” while agentic bots amplify productivity yet introduce vulnerabilities like hallucination-driven errors or malicious misuse in ransomware.

Expanded Inventory: Training Data Scrapers in Depth

These bots underpin the AI boom, with OpenAI and Anthropic leading in visibility. Their operations are typically non-interactive, focusing on ethical crawling guidelines (e.g., respecting noindex tags), though enforcement varies. Below is an augmented table with additional details on deployment scale and controversies.

Bot NameDeveloper/OrganizationPrimary PurposeExample User AgentNotable Impact/Controversy
GPTBotOpenAICore data for GPT series trainingGPTBot/1.1; +https://openai.com/gptbotHigh-volume crawler; blocked by 20% of Fortune 500 sites over bandwidth concerns
ClaudeBotAnthropicEnhances Claude’s safety-aligned modelsClaudeBot/1.0; [email protected]Emphasizes constitutional AI; lower opt-out rates due to transparency
Google-ExtendedGoogleSupplements Bard/Gemini with real-time web dataGoogle-ExtendedIntegrated with search; criticized for evading robots.txt in some cases
AmazonbotAmazonFuels AWS Bedrock and Alexa improvementsAmazonbotE-commerce bias in datasets; used in 40% of cloud AI workloads
Applebot-ExtendedApplePowers Apple Intelligence featuresApplebot-ExtendedPrivacy-focused but expansive; iOS integration boosts mobile scraping
BytespiderByteDance (TikTok)Recommendation algorithms and Doubao AIBytespiderSocial media data hoarding; regulatory scrutiny in EU
CCBotCommon CrawlNonprofit dataset for open AI researchCCBotPowers 80% of public LLM benchmarks; no commercial restrictions
DiffbotDiffbotKnowledge graph building for enterprise AIDiffbotAPI-driven; charges for premium extracts
cohere-aiCohereCustom enterprise model trainingcohere-aiB2B focus; integrates with Slack for data pulls
PerplexityBotPerplexityIndexes for answer-engine trainingPerplexityBot/1.0; +https://perplexity.aiBlurs scraper/search lines; sued for unattributed summaries
OAI-SearchBotOpenAIIterative model refinementOAI-SearchBotVariant of GPTBot; on-demand triggers
AI2BotAllen Institute for AISemantic Scholar enhancementsAI2BotAcademic purity; open datasets only
YouBotYou.comPersonalized AI search trainingYouBotPrivacy-centric; user-consent models
Mistral BotMistral AIOpen-weight LLM datasetsMistralAI-UserEuropean GDPR compliance emphasis
PetalBotHuaweiPangu model ecosystemPetalBotGeopolitical blocks in US; mobile-first
ImagesiftBotImagesiftVisual AI training (e.g., diffusion models)ImagesiftBotNiche for image gen; copyright lawsuits pending
Omgili BotWebz.io (Omgili)Trend analysis for AI insightsOmgiliB2B analytics; low public visibility

Agentic AI Bots: Autonomy in Action

Agentic bots are the “doers” of the AI world, often built on frameworks like LangChain or AutoGen. Their rise coincides with multimodal LLMs, enabling everything from virtual shopping to DeFi trading. Early examples like Siri (2011) were reactive; modern ones, like Claude Computer Use, handle stateful sessions autonomously. In DeFi, bots like DeckardAgent exemplify on-chain agency, verifying tasks via blockchain for trustless execution. Challenges include “hallucination cascades” in long workflows and security risks, as seen in agentic ransomware simulations.

Bot NameDeveloper/OrganizationKey CapabilitiesExample Use CaseMaturity Level (Low/Med/High)
ChatGPT AgentOpenAIWeb simulation, API chainingAutonomous e-commerce (e.g., adding to cart)High
Claude Computer UseAnthropicScreen interaction, tool orchestrationDebugging code in IDEsHigh
Perplexity CometPerplexityBrowser automation, research synthesisMulti-site price comparisonMed
SiriAppleVoice/NLP task decompositionHome automation sequencesHigh
Google AssistantGooglePredictive planning, ecosystem integrationTravel itinerary buildingHigh
AlexaAmazonSkill-based workflows, IoT controlGrocery reorderingHigh
Auto-GPTOpen-sourceRecursive goal decompositionFull project ideation to executionMed
BabyAGIOpen-sourceTask queue managementAgile sprint planningLow
ClaraX.aiNatural language schedulingEmail-based meeting coordinationHigh
DeckardAgentDeckard ProtocolBlockchain-verified actionsDeFi yield farming automationMed
Delivery Hero Data AnalystDelivery HeroAnomaly detection, forecastingMenu optimizationMed
eBay RecSys AgenteBayDynamic personalizationAuction bidding assistanceHigh
Uber Agentic RAGUberQuery-driven routingSurge prediction and reroutingHigh
Sales Lead AgentVarious (e.g., ThoughtSpot)Lead scoring, outreachCRM integration for follow-upsMed
Security Threat AgentVarious (e.g., Exabeam)Real-time anomaly responseNetwork intrusion blockingHigh
DevOps Code AgentVarious (e.g., GitHub Copilot extensions)Bug triaging, deploymentCI/CD pipeline automationMed

Trends and Future Implications

By 2026, agentic bots could dominate, with projections of 1300% growth in AI traffic driven by autonomous shopping and DeFi. Hybrid systems—e.g., scrapers feeding agentic loops—are emerging, as in Virtual Protocol’s on-chain agents. For balance, counterarguments highlight equity: without open-source alternatives, these bots may entrench Big Tech dominance, exacerbating biases in training data. Mitigation strategies include AI-specific robots.txt standards and watermarking for generated content. In controversial realms like Black Friday bots, agentic systems enable “weaponized” deal-sniping, underscoring the need for empathetic design that prioritizes human oversight.

This landscape demands vigilance: while scrapers democratize data access, agentic bots promise efficiency gains of 30-50% in workflows, per industry benchmarks. Stakeholders should monitor updates via repositories like ai.robots.txt for evolving lists.


Key Citations


Bot Directory

This reference document catalogs 100+ known AI bots organized by their primary function. Training Data Scrapers collect web content to train AI models, while Agentic bots perform autonomous tasks, browse the web, and act on behalf of users. The AI bot landscape has exploded since 2023, with Cloudflare reporting that AI crawler traffic now accounts for over 80% of all bot activity on many networks.


Training Data Scrapers

These crawlers collect web content primarily for AI/LLM model training. Blocking via robots.txt is the primary defense, though compliance varies significantly.

Major AI Company Training Crawlers

Bot NameCompanyDescription/PurposeUser Agent String
GPTBotOpenAIPrimary crawler for GPT model training (GPT-4, GPT-5). Filters out paywalled content and PII.Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.1; +https://openai.com/gptbot)
ClaudeBotAnthropicDownloads training data for Claude models. Replaced deprecated anthropic-ai crawler in July 2024.Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
Google-ExtendedGoogleControls whether content trains Gemini and Vertex AI. Not a separate crawler—a robots.txt control token only.Uses standard Googlebot user agents
meta-externalagentMetaCollects content for Meta AI/LLaMA training. Launched July 2024. May bypass robots.txt.meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)
FacebookBotMetaCrawls for Meta’s speech recognition and LLM training.FacebookBot/1.0
BytespiderByteDanceTraining data for Doubao LLM. Extremely aggressive—accounts for up to 90% of AI crawler traffic on some networks. Often ignores robots.txt.Mozilla/5.0 (compatible; Bytespider; spider-feedback@bytedance.com)
Applebot-ExtendedAppleControls whether Applebot-crawled content trains Apple Intelligence. Introduced June 2024 at WWDC.Mozilla/5.0 (Macintosh) AppleWebKit/605.1.15 (Applebot-Extended/0.1; +http://www.apple.com/go/applebot)
AmazonbotAmazonIndexes content for Alexa AI-powered answers and product recommendations.Mozilla/5.0 (compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
cohere-aiCohereGathers text data for Cohere’s Command and Embed models.cohere-ai
cohere-training-data-crawlerCohereDedicated NLP training data collection.cohere-training-data-crawler

Open Dataset and Research Crawlers

Bot NameCompanyDescription/PurposeUser Agent String
CCBotCommon CrawlNon-profit creating open web datasets used by numerous AI companies. Blocking CCBot prevents indirect use by multiple LLM providers.CCBot/2.0 (https://commoncrawl.org/faq/)
AI2BotAllen Institute for AIIndexes content for Semantic Scholar and AI research tools.AI2Bot
AI2Bot-DolmaAllen Institute for AICollects diverse web data for Dolma dataset, used to pretrain OLMo models.AI2Bot-Dolma
ICC-CrawlerNICT (Japan)Multilingual translation and AI research data collection.ICC-Crawler
LCCUniversity of LeipzigLinguistic corpora for NLP research.LCC
CotoyogiJapan ROISJapanese AI training datasets.Cotoyogi

Chinese AI Company Crawlers

Bot NameCompanyDescription/PurposeUser Agent String
PanguBotHuaweiCollects content for Huawei’s PanGu multimodal LLM.PanguBot
ChatGLM-SpiderZhipu AITraining data for ChatGLM models.ChatGLM-Spider
imageSpiderByteDanceCollects images for ByteDance’s AI image models.imageSpider
SBIntuitionsBotSB IntuitionsJapanese language model training.SBIntuitionsBot

Data Broker and Third-Party Scrapers

Bot NameCompanyDescription/PurposeUser Agent String
DiffbotDiffbotAI-powered structured data extraction. Data sold to third parties for AI training. Described as “somewhat dishonest” in practices.Diffbot
Omgilibot / omgiliWebz.ioWeb monitoring service that sells crawled data to LLM companies.Omgilibot, omgili
webzio-extendedWebz.ioExtended web crawl data specifically for AI training.webzio-extended
VelenPublicWebCrawlerVelen/HunterBuilds business datasets for machine learning models.VelenPublicWebCrawler
ImagesiftBotThe HiveScrapes images for reverse search. Associated with image generation model training.ImagesiftBot
laion-huggingface-processorLAIONImage dataset collection for text-to-image AI (Stable Diffusion).laion-huggingface-processor
img2datasetOpen SourceDownloads image datasets for ML training.img2dataset
Kangaroo BotKangaroo LLMAustralian language AI training data.Kangaroo Bot
TimpibotTimpiDecentralized search engine and LLM training.Timpibot
SpiderSpiderAI projects and RAG systems data collection.Spider
Datenbank CrawlernetEstateInternational website data collection.Datenbank Crawler

SEO and Analytics AI Crawlers

Bot NameCompanyDescription/PurposeUser Agent String
DataForSeoBotDataForSEOSEO tools and AI-powered features.DataForSeoBot
SemrushBot-OCOBSemrushContentShake AI tool for content analysis and recommendations.SemrushBot-OCOB
AwarioBotAwarioSocial listening and brand monitoring AI.AwarioBot
AwarioSmartBotAwarioEnhanced social analytics.AwarioSmartBot
MeltwaterMeltwaterMedia intelligence and AI-driven consumer insights.Meltwater
SentibotSentiOneSocial listening and sentiment analysis AI training.Sentibot
peer39_crawlerPeer39AI-driven contextual advertising analysis.peer39_crawler
SeekrSeekrContent analysis and AI model development for brand safety.Seekr
aiHitBotaiHitdataUses AI/ML to build company information databases.aiHitBot
Factset_spyderbotFactSetFinancial AI solutions data collection.Factset_spyderbot

Additional Training Data Crawlers

Bot NameCompanyDescription/PurposeUser Agent String
TurnitinBotTurnitinCollects content for plagiarism prevention database.TurnitinBot
FirecrawlAgentFirecrawlConverts web data to markdown for LLM applications.FirecrawlAgent
netEstate Imprint CrawlernetEstateAI data scraper for international websites.netEstate Imprint Crawler
Google-CloudVertexBotGoogleAssociated with Vertex AI platform training.Google-CloudVertexBot
GoogleOtherGoogleGeneric internal R&D crawls, potentially including AI training.GoogleOther
GoogleOther-ImageGoogleImage fetching for Google R&D.GoogleOther-Image
GoogleOther-VideoGoogleVideo fetching for Google R&D.GoogleOther-Video

Deprecated/Legacy Training Crawlers

Bot NameCompanyDescription/PurposeUser Agent String
anthropic-aiAnthropicLegacy crawler deprecated July 2024 in favor of ClaudeBot.Mozilla/5.0 (compatible; anthropic-ai/1.0; +http://www.anthropic.com/bot.html)
Claude-WebAnthropicLegacy/undocumented crawler, likely deprecated.Claude-Web/1.0 (web crawler; +https://www.anthropic.com/)

AI Search Crawlers

These bots index web content for AI-powered search engines rather than model training. They bridge the gap between traditional search and AI assistants.

Bot NameCompanyDescription/PurposeUser Agent String
OAI-SearchBotOpenAIIndexes websites for ChatGPT Search/SearchGPT. NOT used for model training.Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot
Claude-SearchBotAnthropicCreates search index for Claude’s embedded search feature.Claude-SearchBot
PerplexityBotPerplexity AIIndexes content for Perplexity’s AI search. Does not train own models. Controversial reports of ignoring robots.txt.Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)
YouBotYou.comIndexes content for You.com AI search.YouBot
PetalBotHuaweiIndexes for Huawei’s Petal Search engine and AI Search services.PetalBot
DuckAssistBotDuckDuckGoCollects data for DuckAssist AI-generated answers.DuckAssistBot
LinkupBotLinkupEnterprise AI search indexing.LinkupBot
AddSearchBotAddSearchAI-powered site search indexing.AddSearchBot
ZanistaBotZanistaAI search crawler.ZanistaBot
ApplebotApplePowers Siri and Spotlight search.Mozilla/5.0 (Macintosh) AppleWebKit/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)

Agentic AI Bots

These systems perform autonomous tasks, browse the web interactively, execute actions, and act on behalf of users. This category has exploded since late 2024.

AI Browser Agents (User-Triggered Fetchers)

These bots fetch web content in real-time when users make requests—distinct from background training crawlers.

Bot NameCompanyDescription/PurposeUser Agent String
ChatGPT-UserOpenAIFetches web content on-demand when users request real-time information. NOT used for model training.Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
Claude-UserAnthropicFetches content when Claude users need real-time answers.Claude-User
Perplexity-UserPerplexity AICrawls based on user requests for real-time retrieval. May ignore robots.txt for user-initiated queries.Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://www.perplexity.ai/useragent)
MistralAI-UserMistral AIWeb browsing for Le Chat assistant. NOT used for training data collection.Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; MistralAI-User/1.0; +https://docs.mistral.ai/robots)
meta-externalfetcherMetaUser-initiated link fetches for Meta AI products. May bypass robots.txt.meta-externalfetcher/1.1
facebookexternalhitMetaLink previews and Meta AI search real-time retrieval.facebookexternalhit/1.1

Autonomous Web Browsing Agents

These represent the cutting edge of agentic AI—systems that can navigate websites, click buttons, fill forms, and complete multi-step tasks autonomously.

Bot NameCompanyDescription/PurposeUser Agent String
OpenAI Operator / ChatGPT AgentOpenAIFull autonomous web browsing via remote browser. GUI interaction, form filling, multi-step task execution. Powered by Computer-Using Agent (CUA) model. Achieves 87% on WebVoyager benchmark.Uses standard Chrome user agent (indistinguishable)
Claude Computer UseAnthropicFull desktop computer control via screenshots—mouse, keyboard, browser. Operates in Docker containers. Available via API.Uses standard browser user agents in container
Google Project MarinerGoogle DeepMindChrome browser automation via extension. Cursor movement, clicking, typing. Achieves 83.5% on WebVoyager. Available to AI Ultra subscribers ($249.99/month).GoogleAgent-Mariner
Gemini Deep ResearchGoogleMulti-step research exploration with autonomous browsing. Renders JavaScript unlike most AI crawlers.Gemini-Deep-Research
Google NotebookLMGoogleAI research assistant with document analysis and web access.Google-NotebookLM
Perplexity CometPerplexity AIAI-native Chromium browser with autonomous browsing, clicking, scrolling. Supports agentic commerce via PayPal integration.Uses Perplexity-User agent
Microsoft Copilot (Computer Use)MicrosoftVirtual mouse/keyboard control via Windows 365 VMs. Multi-tab reasoning and autonomous browsing in Edge.Uses Bingbot for indexing
Amazon NovaActAmazonAmazon’s AI agent for web browsing and task completion.Not published
DevinCognition LabsFully autonomous software engineering agent with shell, editor, and browser access. Handles complex multi-step development tasks.Devin
bigsur.aiBig Sur AIAI-powered web agents and sales assistants.bigsur.ai

Research and Deep Analysis Agents

Bot NameCompanyDescription/PurposeUser Agent String
AI2Bot-DeepResearchEvalAllen Institute for AIDeep research queries for open source AI evaluation.AI2Bot-DeepResearchEval
LinerBotLinerAI assistant for academic source discovery and research.LinerBot
Poggio-CitationsPoggioAI sales enablement citation gathering.Poggio-Citations

Coding Agents

These autonomous agents write, debug, test, and deploy code with minimal human intervention.

Bot NameCompanyDescription/PurposeUser Agent String
GitHub Copilot Coding AgentGitHub/MicrosoftAutonomous code implementation from GitHub Issues. Creates PRs, runs tests, responds to code review. Available with Copilot Pro/Business/Enterprise.N/A (server-side)
Cursor AI AgentAnysphereFull codebase understanding, multi-file editing, terminal execution. Runs 8 parallel agents in Cursor 2.0. Valued at $9.9B.N/A (IDE-based)
DevinCognition LabsFully autonomous software engineer—plans, writes, debugs, tests, deploys. Achieves 13.86% on SWE-bench unassisted.Devin
Replit Agent 3ReplitAutonomous app building (200 minutes continuous), self-testing, self-healing code. Can build other agents.N/A (platform-based)
Amazon Q DeveloperAWSAutonomous code generation, Java modernization, security remediation.N/A (IDE/console-based)

Enterprise AI Agents

Bot NameCompanyDescription/PurposeUser Agent String
Salesforce AgentforceSalesforceAutonomous customer service (24/7), sales automation, commerce agents. Uses Atlas Reasoning Engine. 96% self-service resolution reported.N/A (platform-based)
ServiceNow AI AgentsServiceNowIT service management, incident resolution, HR automation. AI Agent Orchestrator for multi-agent collaboration.N/A (platform-based)
UiPath AI AutomationUiPathDocument understanding, process mining with AI, generative AI activities in RPA workflows.N/A (RPA platform)
QualifiedBotQualifiedAI-powered chatbot context crawler for B2B sales.QualifiedBot

AI Agent Frameworks

These open-source frameworks enable building custom agentic AI systems.

Framework NameCompany/CreatorDescription/PurposeNotable Capabilities
AutoGPTSignificant GravitasAutonomous goal-directed task execution with web browsing, file access, code execution. 107,000+ GitHub stars.Multi-modal, visual builder, iterative self-improvement
BabyAGIYohei NakajimaMinimalist task creation, prioritization, and execution loop (~140 lines of code). Inspired 42+ academic papers.Vector database memory, adaptive learning
LangChain / LangGraphLangChain Inc.Modular agent building with graph-based multi-agent orchestration. Production use at Klarna, Uber, LinkedIn.Cyclical execution, tool integration
CrewAICrewAIRole-based AI agent “crews” mimicking human team structures. 5.76x faster than LangGraph. Used by 60% of Fortune 500.Agent collaboration, task delegation
Microsoft AutoGenMicrosoft ResearchMulti-agent conversations with rich multi-turn reasoning. Event-driven architecture in v0.4.Customizable behaviors, open source
MetaGPTOpen SourceSimulates software development teams with role-based agents (PM, architect, engineer).Autonomous software engineering

Voice and Assistant Agents

Bot NameCompanyDescription/PurposeUser Agent String
Amazon Alexa+AmazonVoice-activated autonomous tasks, smart home control, agentic commerce.Uses Amazonbot for indexing
Apple Intelligence (Siri)AppleOn-device AI with cross-app context understanding and action execution.Uses Applebot/Applebot-Extended
Google Assistant (Gemini)GoogleVoice-activated multi-step task execution with Gemini integration.Uses Google crawlers

Bots with Unknown or Spoofed User Agents

Some AI companies have been documented using standard browser user agents to avoid detection and robots.txt blocking.

Bot NameCompanyStatusNotes
xAI GrokxAI (Elon Musk)User agent unknownGrok confirmed via X that it uses iPhone user-agent strings to avoid blocks. No official documentation. Webmasters report never seeing Grok-specific user agents in logs.
DeepSeekBotDeepSeekUnofficial/placeholderRarely documented; Chinese AI company with minimal crawler transparency.
OpenAI Operator (Atlas browser)OpenAIMimics ChromeUses identical Chrome user agent, indistinguishable from regular browsers.

Proposed Standards for AI Crawler Control

ProposalSponsorSyntaxPurpose
DisallowAITrainingMicrosoftDisallowAITraining: /Blocks all AI training crawlers with single rule
Content-UsageGoogleContent-Usage: ai=nAllows crawling but prevents AI training use
ai.txtCommunityNew file formatDedicated AI crawler configuration separate from robots.txt

Traffic Statistics and Trends (2025)

Cloudflare’s 2025 data reveals significant shifts in AI crawler market share:

Crawler2024 Share2025 ShareTrend
GPTBot4.7%11.7%↑ Growing
ClaudeBot6.0%~10%↑ Growing
Meta crawler0.9%7.5%↑ Surging
Amazonbot10.2%5.9%↓ Declining
Bytespider14.1%2.4%↓ Collapsing

Key insight: Training crawlers now account for approximately 80% of all AI bot activity, with agentic real-time fetchers growing rapidly.


Key Resources for Staying Updated

  • Dark Visitors: darkvisitors.com/agents — Most comprehensive categorized bot database
  • GitHub ai-robots-txt: github.com/ai-robots-txt/ai.robots.txt — Community-maintained blocking list
  • Cloudflare AI Crawl Control: developers.cloudflare.com/ai-crawl-control/ — Enterprise blocking features
  • Cloudflare Radar Verified Bots: radar.cloudflare.com/traffic/verified-bots — Bot traffic statistics
  • Fastly Bot Management: docs.fastly.com/products/bot-management — CDN-level bot detection
  • Vercel Block AI Bots Template: vercel.com/templates/other/block-ai-bots-firewall-rule — Firewall rules

Critical Compliance Notes

Robots.txt is voluntary—it represents a social contract, not a legal enforcement mechanism. Key compliance concerns by company:

CompanyRespects robots.txtPublishes IPsOfficial DocsConcern Level
OpenAI✅ Yes✅ Yes✅ YesLow
Anthropic✅ Yes❌ No✅ YesLow
Google✅ Yes✅ Yes✅ YesLow
Meta⚠️ Partial❌ No✅ YesMedium
Microsoft✅ Yes✅ Yes✅ YesLow
Mistral✅ Yes✅ Yes✅ YesLow
Apple✅ Yes✅ Yes✅ YesLow
ByteDance❌ Often ignores❌ No❌ LimitedHigh
xAI (Grok)❌ Unknown❌ No❌ NoHigh
Perplexity⚠️ Controversial✅ Yes✅ YesMedium

User agent spoofing remains a significant concern. Bad actors and even some major companies (notably xAI) have been documented using standard browser user agents to bypass detection. IP-based verification using published ranges (where available) provides stronger enforcement than user agent matching alone.

This document reflects the AI bot landscape as of November 2025. New crawlers emerge frequently—regular updates to blocking lists are essential for webmasters seeking to control AI access to their content.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *