An overview of AI bots, distinguishing between training data scrapers used for LLM development and agentic bots designed for autonomous, goal-oriented tasks.
Many website owners are blocking AI bots to protect their content from being scraped. But in doing so, they might accidentally turn away helpful AI visitors that drive real traffic to their sites.
To manage this, it helps to understand the difference between training data scrapers and agentic AI. Training scrapers act like digital vacuum cleaners. They passively gather massive amounts of text and images to train large language models, or LLMs. These are bots like OpenAI's GPTBot or ByteDance's Bytespider, which website owners can typically block using standard server settings.
On the other side are agentic bots and AI search assistants. These are active systems that do not just collect data; they perform tasks. They can navigate websites, make purchases, and search the web in real-time to answer user questions. Blocking them means missing out on search engine visibility and actual customers.
This is creating a new challenge for web administrators. While many training scrapers respect standard blocking rules, some bots bypass these limits by mimicking human behavior. Managing website traffic now requires a careful balance, protecting intellectual property without slamming the door on the helpful AI agents that bring value to your business.
Many site owners block bots to protect their intellectual property, but in doing so, they might be turning away the “good” AI traffic—like search engines and assistants that drive real visitors your way. Let’s break it down so you can decide wisely.
These bots are designed for bulk data acquisition to fuel AI model development. Common user agents help site owners block them via robots.txt.
Bot NameDeveloper/OrganizationPrimary PurposeExample User AgentGPTBotOpenAICrawls for ChatGPT training dataGPTBot/1.1ClaudeBotAnthropicCollects data for Claude modelsClaudeBot/1.0Google-ExtendedGoogleGathers extended web data for AI enhancementsGoogle-ExtendedAmazonbotAmazonSupports AWS AI services and model trainingAmazonbotApplebot-ExtendedAppleCollects data for Apple Intelligence featuresApplebot-ExtendedBytespiderByteDance (TikTok)Data for recommendation and generative AIBytespiderCCBotCommon CrawlOpen dataset for AI research and trainingCCBotDiffbotDiffbotStructured data extraction for AI datasetsDiffbotcohere-aiCohereBuilds datasets for enterprise AI modelscohere-aiPerplexityBotPerplexityIndexes web for AI search and trainingPerplexityBot/1.0OAI-SearchBotOpenAIOn-demand crawling for model improvementsOAI-SearchBotAI2BotAllen Institute for AIAcademic AI research data collectionAI2BotYouBotYou.comData for personalized AI search enginesYouBotMistral BotMistral AITraining open-source LLMsMistralAI-UserPetalBotHuaweiData for Huawei’s AI ecosystemPetalBotImagesiftBotImagesiftImage-focused scraping for visual AIImagesiftBotOmgili BotWebz.io (Omgili)Consumer insights data for AI analyticsOmgiliThese bots go beyond data collection, using reasoning to adapt and act independently. They often mimic human workflows but can introduce risks like unintended actions.
Bot NameDeveloper/OrganizationKey CapabilitiesExample Use CaseChatGPT AgentOpenAIAutonomous web navigation, form fillingE-commerce purchases, research tasksClaude Computer UseAnthropicDesktop interaction, multi-tool orchestrationSoftware troubleshooting, file managementPerplexity CometPerplexityGoal-directed browsing and task executionTravel booking, market analysisSiriAppleVoice-activated task automationScheduling, smart home controlGoogle AssistantGoogleProactive planning and API integrationRoute optimization, remindersAlexaAmazonEcosystem-wide automationShopping lists, device controlAuto-GPTOpen-source (Significant Gravitas)Self-prompting for complex goalsCode generation, content creationBabyAGIOpen-source (Yohei Nakajima)Task prioritization and execution loopsProject management simulationsClara (formerly x.ai)X.aiMeeting scheduling and calendar managementAutomated appointment bookingDeckardAgentDeckard ProtocolOn-chain verification and task executionCrypto trading, reputation scoringDelivery Hero Data AnalystDelivery HeroPredictive analytics and decision-makingInventory forecastingeBay RecSys AgenteBayRecommendation and personalization engineProduct suggestions in real-timeUber Agentic RAGUberRetrieval-augmented task handlingRide optimization and supportThe proliferation of AI bots represents a transformative shift in how machines interact with the digital world, blending automation with intelligence. As of late 2025, these bots are reshaping industries from e-commerce to cybersecurity, but they also spark debates over privacy, resource consumption, and ethical data use. This survey synthesizes insights from technical documentation, industry reports, and real-time discussions to provide a detailed examination. It expands on the core categories—training data scrapers and agentic bots—while exploring overlaps, trends, and implications. All examples are verified against primary sources, emphasizing user agents for scrapers and functional architectures for agents.
AI bots defy simple binaries, but the user’s framework aligns with two dominant paradigms. Training data scrapers function as digital vacuum cleaners, traversing the web to amass unstructured data for LLM pre-training. They prioritize volume and breadth, often identified by distinctive user agents that developers publish for opt-out mechanisms like robots.txt. These bots have surged in activity—AI traffic now accounts for up to 21% of requests on top websites—straining servers and prompting legal challenges over intellectual property. In contrast, agentic AI bots embody autonomy, leveraging LLMs for planning, reflection, and adaptation in multi-step workflows. Unlike scrapers, they operate reactively or proactively toward user-defined goals, integrating tools like browsers or APIs. This “agentic” quality—coined in recent literature—marks a maturity leap from rule-based automation (e.g., traditional RPA) to goal-oriented systems capable of error correction and sub-task delegation. A third gray area, retrieval-augmented generation (RAG) systems, bridges the two: they scrape on-demand for query responses rather than bulk training, but their agent-like retrieval makes them lean agentic here.
The distinction matters for web administrators: scrapers can be blocked statically, while agentic bots often evade via session mimicry, simulating human behavior to complete forms or transactions. Ethically, scrapers fuel innovation but risk “data colonialism,” while agentic bots amplify productivity yet introduce vulnerabilities like hallucination-driven errors or malicious misuse in ransomware.
These bots underpin the AI boom, with OpenAI and Anthropic leading in visibility. Their operations are typically non-interactive, focusing on ethical crawling guidelines (e.g., respecting noindex tags), though enforcement varies. Below is an augmented table with additional details on deployment scale and controversies.
Bot NameDeveloper/OrganizationPrimary PurposeExample User AgentNotable Impact/ControversyGPTBotOpenAICore data for GPT series trainingGPTBot/1.1; +https://openai.com/gptbotHigh-volume crawler; blocked by 20% of Fortune 500 sites over bandwidth concernsClaudeBotAnthropicEnhances Claude’s safety-aligned modelsClaudeBot/1.0; [email protected]Emphasizes constitutional AI; lower opt-out rates due to transparencyGoogle-ExtendedGoogleSupplements Bard/Gemini with real-time web dataGoogle-ExtendedIntegrated with search; criticized for evading robots.txt in some casesAmazonbotAmazonFuels AWS Bedrock and Alexa improvementsAmazonbotE-commerce bias in datasets; used in 40% of cloud AI workloadsApplebot-ExtendedApplePowers Apple Intelligence featuresApplebot-ExtendedPrivacy-focused but expansive; iOS integration boosts mobile scrapingBytespiderByteDance (TikTok)Recommendation algorithms and Doubao AIBytespiderSocial media data hoarding; regulatory scrutiny in EUCCBotCommon CrawlNonprofit dataset for open AI researchCCBotPowers 80% of public LLM benchmarks; no commercial restrictionsDiffbotDiffbotKnowledge graph building for enterprise AIDiffbotAPI-driven; charges for premium extractscohere-aiCohereCustom enterprise model trainingcohere-aiB2B focus; integrates with Slack for data pullsPerplexityBotPerplexityIndexes for answer-engine trainingPerplexityBot/1.0; +https://perplexity.aiBlurs scraper/search lines; sued for unattributed summariesOAI-SearchBotOpenAIIterative model refinementOAI-SearchBotVariant of GPTBot; on-demand triggersAI2BotAllen Institute for AISemantic Scholar enhancementsAI2BotAcademic purity; open datasets onlyYouBotYou.comPersonalized AI search trainingYouBotPrivacy-centric; user-consent modelsMistral BotMistral AIOpen-weight LLM datasetsMistralAI-UserEuropean GDPR compliance emphasisPetalBotHuaweiPangu model ecosystemPetalBotGeopolitical blocks in US; mobile-firstImagesiftBotImagesiftVisual AI training (e.g., diffusion models)ImagesiftBotNiche for image gen; copyright lawsuits pendingOmgili BotWebz.io (Omgili)Trend analysis for AI insightsOmgiliB2B analytics; low public visibilityAgentic bots are the “doers” of the AI world, often built on frameworks like LangChain or AutoGen. Their rise coincides with multimodal LLMs, enabling everything from virtual shopping to DeFi trading. Early examples like Siri (2011) were reactive; modern ones, like Claude Computer Use, handle stateful sessions autonomously. In DeFi, bots like DeckardAgent exemplify on-chain agency, verifying tasks via blockchain for trustless execution. Challenges include “hallucination cascades” in long workflows and security risks, as seen in agentic ransomware simulations.
Bot NameDeveloper/OrganizationKey CapabilitiesExample Use CaseMaturity Level (Low/Med/High)ChatGPT AgentOpenAIWeb simulation, API chainingAutonomous e-commerce (e.g., adding to cart)HighClaude Computer UseAnthropicScreen interaction, tool orchestrationDebugging code in IDEsHighPerplexity CometPerplexityBrowser automation, research synthesisMulti-site price comparisonMedSiriAppleVoice/NLP task decompositionHome automation sequencesHighGoogle AssistantGooglePredictive planning, ecosystem integrationTravel itinerary buildingHighAlexaAmazonSkill-based workflows, IoT controlGrocery reorderingHighAuto-GPTOpen-sourceRecursive goal decompositionFull project ideation to executionMedBabyAGIOpen-sourceTask queue managementAgile sprint planningLowClaraX.aiNatural language schedulingEmail-based meeting coordinationHighDeckardAgentDeckard ProtocolBlockchain-verified actionsDeFi yield farming automationMedDelivery Hero Data AnalystDelivery HeroAnomaly detection, forecastingMenu optimizationMedeBay RecSys AgenteBayDynamic personalizationAuction bidding assistanceHighUber Agentic RAGUberQuery-driven routingSurge prediction and reroutingHighSales Lead AgentVarious (e.g., ThoughtSpot)Lead scoring, outreachCRM integration for follow-upsMedSecurity Threat AgentVarious (e.g., Exabeam)Real-time anomaly responseNetwork intrusion blockingHighDevOps Code AgentVarious (e.g., GitHub Copilot extensions)Bug triaging, deploymentCI/CD pipeline automationMedBy 2026, agentic bots could dominate, with projections of 1300% growth in AI traffic driven by autonomous shopping and DeFi. Hybrid systems—e.g., scrapers feeding agentic loops—are emerging, as in Virtual Protocol’s on-chain agents. For balance, counterarguments highlight equity: without open-source alternatives, these bots may entrench Big Tech dominance, exacerbating biases in training data. Mitigation strategies include AI-specific robots.txt standards and watermarking for generated content. In controversial realms like Black Friday bots, agentic systems enable “weaponized” deal-sniping, underscoring the need for empathetic design that prioritizes human oversight.
This landscape demands vigilance: while scrapers democratize data access, agentic bots promise efficiency gains of 30-50% in workflows, per industry benchmarks. Stakeholders should monitor updates via repositories like ai.robots.txt for evolving lists.
This reference document catalogs 100+ known AI bots organized by their primary function. Training Data Scrapers collect web content to train AI models, while Agentic bots perform autonomous tasks, browse the web, and act on behalf of users. The AI bot landscape has exploded since 2023, with Cloudflare reporting that AI crawler traffic now accounts for over 80% of all bot activity on many networks.
These crawlers collect web content primarily for AI/LLM model training. Blocking via robots.txt is the primary defense, though compliance varies significantly.
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.1; +https://openai.com/gptbot)ClaudeBotAnthropicDownloads training data for Claude models. Replaced deprecated anthropic-ai crawler in July 2024.Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)Google-ExtendedGoogleControls whether content trains Gemini and Vertex AI. Not a separate crawler—a robots.txt control token only.Uses standard Googlebot user agentsmeta-externalagentMetaCollects content for Meta AI/LLaMA training. Launched July 2024. May bypass robots.txt.meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)FacebookBotMetaCrawls for Meta’s speech recognition and LLM training.FacebookBot/1.0BytespiderByteDanceTraining data for Doubao LLM. Extremely aggressive—accounts for up to 90% of AI crawler traffic on some networks. Often ignores robots.txt.Mozilla/5.0 (compatible; Bytespider; spider-feedback@bytedance.com)Applebot-ExtendedAppleControls whether Applebot-crawled content trains Apple Intelligence. Introduced June 2024 at WWDC.Mozilla/5.0 (Macintosh) AppleWebKit/605.1.15 (Applebot-Extended/0.1; +http://www.apple.com/go/applebot)AmazonbotAmazonIndexes content for Alexa AI-powered answers and product recommendations.Mozilla/5.0 (compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)cohere-aiCohereGathers text data for Cohere’s Command and Embed models.cohere-aicohere-training-data-crawlerCohereDedicated NLP training data collection.cohere-training-data-crawlerCCBot/2.0 (https://commoncrawl.org/faq/)AI2BotAllen Institute for AIIndexes content for Semantic Scholar and AI research tools.AI2BotAI2Bot-DolmaAllen Institute for AICollects diverse web data for Dolma dataset, used to pretrain OLMo models.AI2Bot-DolmaICC-CrawlerNICT (Japan)Multilingual translation and AI research data collection.ICC-CrawlerLCCUniversity of LeipzigLinguistic corpora for NLP research.LCCCotoyogiJapan ROISJapanese AI training datasets.CotoyogiPanguBotChatGLM-SpiderZhipu AITraining data for ChatGLM models.ChatGLM-SpiderimageSpiderByteDanceCollects images for ByteDance’s AI image models.imageSpiderSBIntuitionsBotSB IntuitionsJapanese language model training.SBIntuitionsBotDiffbotOmgilibot / omgiliWebz.ioWeb monitoring service that sells crawled data to LLM companies.Omgilibot, omgiliwebzio-extendedWebz.ioExtended web crawl data specifically for AI training.webzio-extendedVelenPublicWebCrawlerVelen/HunterBuilds business datasets for machine learning models.VelenPublicWebCrawlerImagesiftBotThe HiveScrapes images for reverse search. Associated with image generation model training.ImagesiftBotlaion-huggingface-processorLAIONImage dataset collection for text-to-image AI (Stable Diffusion).laion-huggingface-processorimg2datasetOpen SourceDownloads image datasets for ML training.img2datasetKangaroo BotKangaroo LLMAustralian language AI training data.Kangaroo BotTimpibotTimpiDecentralized search engine and LLM training.TimpibotSpiderSpiderAI projects and RAG systems data collection.SpiderDatenbank CrawlernetEstateInternational website data collection.Datenbank CrawlerDataForSeoBotSemrushBot-OCOBSemrushContentShake AI tool for content analysis and recommendations.SemrushBot-OCOBAwarioBotAwarioSocial listening and brand monitoring AI.AwarioBotAwarioSmartBotAwarioEnhanced social analytics.AwarioSmartBotMeltwaterMeltwaterMedia intelligence and AI-driven consumer insights.MeltwaterSentibotSentiOneSocial listening and sentiment analysis AI training.Sentibotpeer39_crawlerPeer39AI-driven contextual advertising analysis.peer39_crawlerSeekrSeekrContent analysis and AI model development for brand safety.SeekraiHitBotaiHitdataUses AI/ML to build company information databases.aiHitBotFactset_spyderbotFactSetFinancial AI solutions data collection.Factset_spyderbotTurnitinBotFirecrawlAgentFirecrawlConverts web data to markdown for LLM applications.FirecrawlAgentnetEstate Imprint CrawlernetEstateAI data scraper for international websites.netEstate Imprint CrawlerGoogle-CloudVertexBotGoogleAssociated with Vertex AI platform training.Google-CloudVertexBotGoogleOtherGoogleGeneric internal R&D crawls, potentially including AI training.GoogleOtherGoogleOther-ImageGoogleImage fetching for Google R&D.GoogleOther-ImageGoogleOther-VideoGoogleVideo fetching for Google R&D.GoogleOther-VideoMozilla/5.0 (compatible; anthropic-ai/1.0; +http://www.anthropic.com/bot.html)Claude-WebAnthropicLegacy/undocumented crawler, likely deprecated.Claude-Web/1.0 (web crawler; +https://www.anthropic.com/)These bots index web content for AI-powered search engines rather than model training. They bridge the gap between traditional search and AI assistants.
Bot NameCompanyDescription/PurposeUser Agent StringOAI-SearchBotOpenAIIndexes websites for ChatGPT Search/SearchGPT. NOT used for model training.Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbotClaude-SearchBotAnthropicCreates search index for Claude’s embedded search feature.Claude-SearchBotPerplexityBotPerplexity AIIndexes content for Perplexity’s AI search. Does not train own models. Controversial reports of ignoring robots.txt.Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)YouBotYou.comIndexes content for You.com AI search.YouBotPetalBotHuaweiIndexes for Huawei’s Petal Search engine and AI Search services.PetalBotDuckAssistBotDuckDuckGoCollects data for DuckAssist AI-generated answers.DuckAssistBotLinkupBotLinkupEnterprise AI search indexing.LinkupBotAddSearchBotAddSearchAI-powered site search indexing.AddSearchBotZanistaBotZanistaAI search crawler.ZanistaBotApplebotApplePowers Siri and Spotlight search.Mozilla/5.0 (Macintosh) AppleWebKit/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)These systems perform autonomous tasks, browse the web interactively, execute actions, and act on behalf of users. This category has exploded since late 2024.
These bots fetch web content in real-time when users make requests—distinct from background training crawlers.
Bot NameCompanyDescription/PurposeUser Agent StringChatGPT-UserOpenAIFetches web content on-demand when users request real-time information. NOT used for model training.Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/botClaude-UserAnthropicFetches content when Claude users need real-time answers.Claude-UserPerplexity-UserPerplexity AICrawls based on user requests for real-time retrieval. May ignore robots.txt for user-initiated queries.Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://www.perplexity.ai/useragent)MistralAI-UserMistral AIWeb browsing for Le Chat assistant. NOT used for training data collection.Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; MistralAI-User/1.0; +https://docs.mistral.ai/robots)meta-externalfetcherMetaUser-initiated link fetches for Meta AI products. May bypass robots.txt.meta-externalfetcher/1.1facebookexternalhitMetaLink previews and Meta AI search real-time retrieval.facebookexternalhit/1.1These represent the cutting edge of agentic AI—systems that can navigate websites, click buttons, fill forms, and complete multi-step tasks autonomously.
Bot NameCompanyDescription/PurposeUser Agent StringOpenAI Operator / ChatGPT AgentOpenAIFull autonomous web browsing via remote browser. GUI interaction, form filling, multi-step task execution. Powered by Computer-Using Agent (CUA) model. Achieves 87% on WebVoyager benchmark.Uses standard Chrome user agent (indistinguishable)Claude Computer UseAnthropicFull desktop computer control via screenshots—mouse, keyboard, browser. Operates in Docker containers. Available via API.Uses standard browser user agents in containerGoogle Project MarinerGoogle DeepMindChrome browser automation via extension. Cursor movement, clicking, typing. Achieves 83.5% on WebVoyager. Available to AI Ultra subscribers ($249.99/month).GoogleAgent-MarinerGemini Deep ResearchGoogleMulti-step research exploration with autonomous browsing. Renders JavaScript unlike most AI crawlers.Gemini-Deep-ResearchGoogle NotebookLMGoogleAI research assistant with document analysis and web access.Google-NotebookLMPerplexity CometPerplexity AIAI-native Chromium browser with autonomous browsing, clicking, scrolling. Supports agentic commerce via PayPal integration.Uses Perplexity-User agentMicrosoft Copilot (Computer Use)MicrosoftVirtual mouse/keyboard control via Windows 365 VMs. Multi-tab reasoning and autonomous browsing in Edge.Uses Bingbot for indexingAmazon NovaActAmazonAmazon’s AI agent for web browsing and task completion.Not publishedDevinCognition LabsFully autonomous software engineering agent with shell, editor, and browser access. Handles complex multi-step development tasks.Devinbigsur.aiBig Sur AIAI-powered web agents and sales assistants.bigsur.aiAI2Bot-DeepResearchEvalLinerBotLinerAI assistant for academic source discovery and research.LinerBotPoggio-CitationsPoggioAI sales enablement citation gathering.Poggio-CitationsThese autonomous agents write, debug, test, and deploy code with minimal human intervention.
Bot NameCompanyDescription/PurposeUser Agent StringGitHub Copilot Coding AgentGitHub/MicrosoftAutonomous code implementation from GitHub Issues. Creates PRs, runs tests, responds to code review. Available with Copilot Pro/Business/Enterprise.N/A (server-side)Cursor AI AgentAnysphereFull codebase understanding, multi-file editing, terminal execution. Runs 8 parallel agents in Cursor 2.0. Valued at $9.9B.N/A (IDE-based)DevinCognition LabsFully autonomous software engineer—plans, writes, debugs, tests, deploys. Achieves 13.86% on SWE-bench unassisted.DevinReplit Agent 3ReplitAutonomous app building (200 minutes continuous), self-testing, self-healing code. Can build other agents.N/A (platform-based)Amazon Q DeveloperAWSAutonomous code generation, Java modernization, security remediation.N/A (IDE/console-based)QualifiedBotThese open-source frameworks enable building custom agentic AI systems.
Framework NameCompany/CreatorDescription/PurposeNotable CapabilitiesAutoGPTSignificant GravitasAutonomous goal-directed task execution with web browsing, file access, code execution. 107,000+ GitHub stars.Multi-modal, visual builder, iterative self-improvementBabyAGIYohei NakajimaMinimalist task creation, prioritization, and execution loop (~140 lines of code). Inspired 42+ academic papers.Vector database memory, adaptive learningLangChain / LangGraphLangChain Inc.Modular agent building with graph-based multi-agent orchestration. Production use at Klarna, Uber, LinkedIn.Cyclical execution, tool integrationCrewAICrewAIRole-based AI agent “crews” mimicking human team structures. 5.76x faster than LangGraph. Used by 60% of Fortune 500.Agent collaboration, task delegationMicrosoft AutoGenMicrosoft ResearchMulti-agent conversations with rich multi-turn reasoning. Event-driven architecture in v0.4.Customizable behaviors, open sourceMetaGPTOpen SourceSimulates software development teams with role-based agents (PM, architect, engineer).Autonomous software engineeringSome AI companies have been documented using standard browser user agents to avoid detection and robots.txt blocking.
Bot NameCompanyStatusNotesxAI GrokxAI (Elon Musk)User agent unknownGrok confirmed via X that it uses iPhone user-agent strings to avoid blocks. No official documentation. Webmasters report never seeing Grok-specific user agents in logs.DeepSeekBotDeepSeekUnofficial/placeholderRarely documented; Chinese AI company with minimal crawler transparency.OpenAI Operator (Atlas browser)OpenAIMimics ChromeUses identical Chrome user agent, indistinguishable from regular browsers.DisallowAITraining: /Blocks all AI training crawlers with single ruleContent-UsageGoogleContent-Usage: ai=nAllows crawling but prevents AI training useai.txtCommunityNew file formatDedicated AI crawler configuration separate from robots.txtCloudflare’s 2025 data reveals significant shifts in AI crawler market share:
Crawler2024 Share2025 ShareTrendGPTBot4.7%11.7%↑ GrowingClaudeBot6.0%~10%↑ GrowingMeta crawler0.9%7.5%↑ SurgingAmazonbot10.2%5.9%↓ DecliningBytespider14.1%2.4%↓ CollapsingKey insight: Training crawlers now account for approximately 80% of all AI bot activity, with agentic real-time fetchers growing rapidly.
Robots.txt is voluntary—it represents a social contract, not a legal enforcement mechanism. Key compliance concerns by company:
CompanyRespects robots.txtPublishes IPsOfficial DocsConcern LevelOpenAI✅ Yes✅ Yes✅ YesLowAnthropic✅ Yes❌ No✅ YesLowGoogle✅ Yes✅ Yes✅ YesLowMeta⚠️ Partial❌ No✅ YesMediumMicrosoft✅ Yes✅ Yes✅ YesLowMistral✅ Yes✅ Yes✅ YesLowApple✅ Yes✅ Yes✅ YesLowByteDance❌ Often ignores❌ No❌ LimitedHighxAI (Grok)❌ Unknown❌ No❌ NoHighPerplexity⚠️ Controversial✅ Yes✅ YesMediumUser agent spoofing remains a significant concern. Bad actors and even some major companies (notably xAI) have been documented using standard browser user agents to bypass detection. IP-based verification using published ranges (where available) provides stronger enforcement than user agent matching alone.
This document reflects the AI bot landscape as of November 2025. New crawlers emerge frequently—regular updates to blocking lists are essential for webmasters seeking to control AI access to their content.
This article is featured in Moz Top 10.
Sign in with Google to comment.
Great List:) well done