Investigating Token Limits in Modern LLM Platforms

10 min read

Large language model providers have significantly expanded context windows (OpenAI’s 128K-token GPT-4, Anthropic’s 100K-200K Claude, Google’s 1M-2M-token Gemini, etc.), but real-world usage shows that these limits come with hidden truncation and prioritization strategies. Developers report that although you can supply extremely long inputs, models often do not preserve every token equally. In practice, older or “middle” context content may be compressed, ignored, or dropped to enforce limits and maintain quality. Below we break down how different platforms handle their context windows and the tradeoffs observed in practice.

OpenAI GPT-4’s 128K Context Window

OpenAI’s GPT-4 Turbo (e.g. the “o1” preview) advertises a 128,000-token context window (far larger than the earlier 8K and 32K versions). However, developers have found that you can’t simply stuff 128K tokens of content and expect flawless recall of every detail.

Empirical tests by Greg Kamradt and others showed GPT-4’s retrieval accuracy drops once prompts exceed roughly 70K tokens. In one “needle-in-a-haystack” experiment, GPT-4 could perfectly recall a fact buried in a long document up to ~64K tokens, but beyond ~73K tokens the model started missing information located in the middle of the prompt. Notably, facts at the very beginning of the prompt remained retrievable even in 100K+ token tests, whereas facts in the middle were often “lost in the middle”.

This suggests that GPT-4 does not allocate uniform attention across the entire 128K window, early and late tokens seem to have an advantage, while mid-context tokens can be overlooked as the prompt grows. This “middle context fading” aligns with the Lost in the Middle phenomenon reported by researchers (Liu et al. 2023), where transformer models tend to focus on the first and last parts of a long sequence and lose fidelity on the rest.

OpenAI’s documentation hints that the full 128K may effectively be constrained by internal needs. The Chat Completion API requires reserving tokens for the model’s output and reasoning process, so developers cannot use all 128K solely for input.

For example, the 128K GPT-4 is said to allow up to ~32K tokens for output. If you fill the context to the brim with input, the model will truncate or refuse to continue because it has no room to generate an answer. Indeed, OpenAI’s “o1” model uses an internal chain-of-thought (hidden reasoning) which also consumes tokens during processing. The system discards those hidden “reasoning tokens” after each step to free up space, carrying over only the user input and the visible answer between turns.

This design prevents the context from exploding every time the model thinks, but it also means some of the model’s own thought process is intentionally not remembered beyond each response.

OpenAI advises developers to “ensure there's enough space in the context window for reasoning tokens”, noting that complex queries may use “from a few hundred to tens of thousands” of tokens internally for step-by-step reasoning.

In practice, this reduces the usable portion of the 128K window for user content - e.g. if a query triggers 20K tokens of reasoning, those come out of the available budget.

Recency Bias and Positional Prioritization in GPT-4

Recency bias is another built-in factor: GPT-4 (especially in chat form) heavily prioritizes the latest instructions and messages. OpenAI’s instruct-tuning and transformer architecture naturally weight the end of the prompt sequence when predicting the next token.

Developers have even adjusted their prompting techniques to account for this - for instance, placing crucial instructions at the end of a long prompt (after a long reference text) to ensure the model follows them. If those instructions are at the very start and followed by tens of thousands of tokens of other data, GPT-4 might “forget” or ignore the instruction by the time it generates an answer.

One HackerNews user who experimented with GPT-4-100K noted the official prompt guide recommended putting any instructions after lengthy reference text, because the model tended to disregard instructions placed at the top of a huge input.

All of this indicates OpenAI likely employs positional encodings or training tweaks that favor recent tokens (and perhaps the initial system prompt) over the deep middle of the context.

In summary, while GPT-4’s 128K window is a major leap, developers effectively treat it as “memory with caveats”, dumping 100K tokens in will technically succeed, but the model may only reliably utilize the first ~50-70K and last few thousand tokens without degradation.

The closer you get to the max limit, the more likely the model will miss or “forget” details. One developer summed it up: using more than ~50% of the context length tends to yield more hallucinations or omissions, based on their GPT-4 testing.

Anthropic Claude’s 100K+ Long-Term Memory

Anthropic’s Claude 2 introduced a 100,000-token context window (roughly 75,000 words), and newer versions (Claude 2.1 / Claude-Pro) have been reported to handle up to 200K tokens in one prompt. Anthropic pitches this huge window as allowing a user to input hundreds of pages of technical documentation or even an entire book for analysis.

In practice, Claude’s long context exhibits similar behavior to GPT-4’s: not all tokens are treated equally, and very large prompts can confuse or dilute the model’s focus.

Real-world testing of Claude 2.1 at the 200K-token scale shows clear evidence of the “memory prioritization” effect. In Greg Kamradt’s long-context recall experiments (the same “needle in a haystack” tests), Claude could recall facts located at the start or end of the document more reliably than facts buried in the middle.

In his public results, facts in the first few percent of the text and in the second half were often retrieved correctly, but Claude struggled with information placed around the mid-section of the 200K prompt. This mirrors GPT-4’s pattern, implying Claude also exhibits a form of recency/primacy bias when processing extremely long contexts.

The takeaway from Kamradt’s test was: “Position matters – facts at the very beginning and in the second half of the document seem to be recalled better”.

Anthropic has not published technical details on how Claude’s 100K context is implemented, but these results hint at an internal strategy like segmented attention or a decaying weight on older tokens. One early user of Claude noted, “it’s evident they’re employing clever techniques to make it work, albeit with some imperfections” – referring to things like special position encodings or adaptive focus that support large contexts without extreme slowdowns, though not all tokens are remembered equally.

From a developer’s perspective, Claude enforces the context limit with a hard cutoff. If you exceed the 100K token cap in Claude’s chat interface, it will refuse to ingest more content. Users have reported hitting this cap after about 40 messages in a session, requiring them to start a new chat.

Unlike ChatGPT, which sometimes silently drops older content, Claude explicitly notifies users when the limit is reached. There is no built-in summarization or overflow mechanism. The responsibility to shorten or summarize lies entirely with the user.

Some users have adopted manual coping strategies: for example, periodically asking Claude to summarize the conversation so far, then continuing in a new chat with that summary as the new context. Anthropic encourages this, suggesting users break tasks into smaller, chunked sessions.

One expert summarized the reality: “There is no task that truly needs to break Claude’s context window that cannot be separated into simpler ones… The best approach is to start over with a new ‘genie’ and pass along needed info via summary.”

Prompt Order and Limitations of Claude’s Long Context

Anthropic’s documentation and prompt guidelines also hint at how they manage long inputs. As mentioned earlier, users discovered that Claude’s recommended prompt structure for long documents is to place the instructions last, after the long content.

This implies Claude, like GPT-4, has a recency bias or gives more attention to the tail end of the prompt. If a user provides a 100K-token document followed by “Now answer question X about the above,” Claude performs better than if the question is placed at the top and followed by 100K tokens of reference material.

If the instruction comes first, the model can otherwise get “lost” and output a generic summary or fail to respond with context. This prompt-order dependency is a clear sign that Claude prioritizes newer tokens, and that older parts of the context may be compressed or deprioritized relative to recent content.

It’s also worth noting that Claude’s long context was a first-of-its-kind feature, and early developer feedback revealed both excitement and some technical rough edges. For example, some users encountered a bug where Claude’s conversation unexpectedly “reset” in long chats-it began replying as if it had forgotten recent inputs and remembered only older ones.

One developer speculated the system was trying to drop the oldest half of the context to save space but mistakenly dropped the newest half due to a bug. These reports highlight how Claude’s system actively manages which tokens to keep, and that errors in that management can cause strange or confusing model behavior.

In summary, Claude accepts very large inputs, but there’s no guarantee that every detail will persist across the context. Anthropic has explicitly cautioned developers not to assume that all 100K tokens will be faithfully used during generation.

The longer the input, the more developers must treat Claude as having a limited “attention budget”. Important facts should be reinforced or placed strategically, while less critical details may be ignored or distilled by the model.

As one summary of Kamradt’s test put it: “No guarantees – your facts are not guaranteed to be retrieved. Don’t bake the assumption they will [be] into your application.”

In practice, it’s more reliable to feed Claude the most relevant chunks via retrieval techniques, instead of dumping entire massive texts. Even if Claude can technically accept large inputs, smart context selection yields better results.

Google’s Gemini with 1M–2M Token Windows

Google’s Gemini (DeepMind) models have pushed context lengths further than any other platform, with Gemini 2.0 Pro offering a 1 million token window and plans for a 2 million token version. This scale is unprecedented - enough to hold an entire library of documents in one prompt.

Google has cited examples like analyzing 19 hours of audio transcripts in a single prompt, equivalent to ~2 million tokens. But in real-world usage, this promise is more aspirational than reliably achievable at present.

In Google’s AI Studio, developers encountered hard and soft limits well below the advertised 1M. One user testing Gemini Pro reported “Out of tokens” errors consistently appearing around 450K–500K tokens - even though the model was supposed to handle 1M. In other words, Gemini broke down around 50% of its theoretical maximum.

They also noted this wasn’t a one-off: “This isn’t just an edge case – it’s repeatable in various sessions”. Such behavior undermines the expected value of long context, particularly if developers plan their architecture around the full capacity. Google staff even responded and attempted to reproduce the issue at 600K tokens - suggesting Google is aware of these constraints and still refining system behavior.

This suggests the full 1M may only be possible under specific conditions or backend optimizations. To prevent crashing or excessive latency, AI Studio may proactively throw errors before hitting true upper limits.

Beyond token errors, users also reported that Gemini output quality degrades significantly well before 1M. One developer noted that at 300K–400K tokens, the model began returning incoherent or messy answers - forgetting the question, mixing up references, or producing vague summaries.

The conclusion? “Just extending to 2M tokens won’t help unless the model actually uses that space effectively.” The cost also matters. Google has publicly acknowledged that “as the context window grows, so does the potential for input cost.” Self-attention operations grow quadratically with sequence length, making massive contexts extremely expensive to compute.

To address this, Google introduced context caching in the Gemini API - a feature that avoids resending repeated input tokens across calls. This lowers cost for large static prompts, but it doesn’t expand the model’s actual usable capacity.

Practically, developers are still encouraged to use retrieval-augmented generation (RAG) or summarization strategies rather than sending full dumps of 1M+ tokens on every request.

Community forums and Google’s own responses support this: one forum user asked whether 2M tokens were really usable, and another responded bluntly that most tasks “won’t benefit from 2M” due to performance limits and high cost, especially given Gemini's sketchy performance at 1M.

Under the hood, Google likely uses advanced architectures such as sparse attention, landmark tokens, or hierarchical transformers to support long contexts. But public technical details remain sparse.

Notably, Gemini 2.5 launched with 1M tokens available by default, while 2M was still listed as “coming soon” - implying Google is rolling out full capacity in phases to manage reliability risks. In mid-2024, Google opened a waitlist for 2M-token Gemini 1.5 Pro, and by late 2024 released the 1M window more broadly. Full 2M rollout is still pending.

The delay reflects the engineering complexity of processing 2M tokens in one go. A 2M-token input is roughly the size of an 8,000-page book - and handling that without coherence loss or latency issues remains a challenge.

It’s possible that internally, Gemini chunks or summarizes segments of input using a sliding window or hierarchical memory - a technique seen in long-context research prototypes. If so, not all 2M tokens may be simultaneously attended to by the model.

As one forum user pointed out: “AI Studio says 2M tokens, but in reality, things get unreliable long before that.” This reinforces a growing understanding across developers that advertised capacity ≠ effective capacity.

Final Thoughts on Gemini’s Context Window

In summary, Google’s Gemini has broken new ground with million-token context windows, but developers have quickly run into practical usage limits.

You often can’t actually feed a full 1M tokens and expect flawless results - errors may occur around 450K–500K tokens, or the model might produce garbled and incoherent answers when overloaded with dense input.

Google is actively refining the platform - addressing “out of tokens” errors and releasing features like context caching - but they also caution developers to use long context wisely.

The best practice remains: only feed what’s needed. One forum participant put it succinctly: “Focus on effectiveness, not on chasing the longest possible context window.”

In many cases, a shorter, more focused prompt will yield better results than overwhelming the model with a massive one-shot payload.

That said, Gemini’s large context window remains valuable in special scenarios - like multi-document legal analysis or hours-long transcripts - where chunking would create more confusion than clarity.

As with OpenAI and Anthropic, the burden is on developers to manage context wisely: structure prompts intentionally, use retrieval-augmented generation, apply interim summarization, and monitor when you’re nearing critical context thresholds.