
A Token Count Analysis of 45,000 Real-World URLs
We recently analyzed 44,684 web pages and measured their content length using Gemini’s token counter. The results reveal fascinating insights about the true scale of web content—and why it matters for AI applications.
| Metric | Value |
|---|---|
| Total Pages Analyzed | 44,684 |
| Page Content Tokens | 464,854,727 |
| Total Tokens (all) | 541,062,817 |
The median web page contains roughly 3,200 tokens—equivalent to about 2,400 words or approximately 5 pages of text. However, the average is significantly higher at 10,400 tokens, indicating a strong right-skew from lengthy documents.
| Metric | Tokens |
|---|---|
| Median | 3,201 |
| Average | 10,403 |
| 25th percentile | 1,396 |
| 75th percentile | 8,207 |
Distribution Breakdown
Half of all web pages fall between 1,000 and 5,000 tokens. This represents the “typical” article, blog post, or informational page.
| Token Range | Pages | Percentage |
|---|---|---|
| 0 – 1,000 | 6,229 | 13.9% |
| 1,000 – 5,000 | 22,299 | 49.9% |
| 5,000 – 10,000 | 6,629 | 14.8% |
| 10,000 – 50,000 | 8,048 | 18.0% |
| 50,000 – 100,000 | 806 | 1.8% |
| 100,000 – 500,000 | 657 | 1.5% |
| 500,000+ | 16 | 0.04% |
Nearly 1 in 5 pages (18%) contain between 10,000 and 50,000 tokens—these are longer articles, comprehensive guides, or pages with significant supplementary content.
The Long Tail
Percentile analysis reveals the extreme outliers:
| Percentile | Tokens |
|---|---|
| 90th | 21,839 |
| 95th | 35,852 |
| 99th | 141,410 |
| Maximum | 3,004,502 |
The top 1% of pages exceed 140,000 tokens—roughly 100+ pages of text. These are typically:
- Full PDF documents (research papers, reports)
- Documentation sites
- Long-form educational content
- Scraped book chapters
The largest page in our dataset contained over 3 million tokens—equivalent to approximately 4-5 full-length novels.
Implications for AI Systems
Context Window Considerations
With major LLMs offering context windows from 32K to 2M tokens, our findings suggest:
- 95% of web pages fit comfortably in a 128K context window
- The median page (3,201 tokens) leaves ample room for multi-page retrieval
- Only 0.04% of pages exceed typical context limits
RAG System Design
For Retrieval-Augmented Generation systems:
- Chunk wisely: The median page is ~3K tokens—consider this when designing chunk sizes
- Handle outliers: The 99th percentile is 44x the median. Long-form content needs different treatment
- Budget for variety: A 10-document retrieval could range from 14K tokens (medians) to 350K+ tokens (90th percentiles)
Methodology Notes
- Pages were processed using Gemini’s
url_contexttool - Token counts reflect the model’s native tokenization
- Sample includes a diverse mix of content types: articles, academic papers, product pages, documentation, and PDFs
- Zero-token pages (5 total) represent failed fetches or blocked content
While the typical page sits around 3,000 tokens, the distribution has a remarkably long tail. AI systems consuming web content need to account for this variance—both for context management and cost optimization.
For practical applications:
- Design for the median (3K tokens) but handle the 99th percentile (140K tokens)
- Expect high variance between sources
- Budget conservatively—average costs will be 3x median costs due to outliers
What Did People Guess?
Before publishing this analysis, I ran a poll on LinkedIn asking people to predict the average page size in tokens:
| Guess | Votes | Percentage |
|---|---|---|
| 100 | 27 | 21% |
| 1,000 | 50 | 38% |
| 10,000 | 45 | 34% |
| 100,000 | 9 | 7% |
131 people voted. The most popular answer was 1,000 tokens (38%), followed closely by 10,000 tokens (34%). The actual answer? 10,403 tokens on average.
Only a third of respondents got it right. The majority underestimated—perhaps expecting a page of text to be shorter than it actually is when tokenized. What’s interesting is that the median (3,201 tokens) would have made “1,000” a more defensible answer, but averages get skewed heavily by those outlier documents.
The 7% who guessed 100,000 weren’t entirely wrong either—they just described the 99th percentile rather than the average.

Leave a Reply