ChatGPT Won't Recommend Anthropic. Claude Won't Recommend OpenAI. We Have the Data.
A 4x4 matrix of LLM raters scoring LLM creators surfaced a reciprocal suppression pattern between OpenAI and Anthropic — only direct competitors.
Quick Answer
A 4x4 matrix of LLMs scoring LLM creators (ChatGPT, Perplexity, Gemini, and Claude rating OpenAI, Anthropic, Google, and Perplexity) surfaced a reciprocal suppression pattern: ChatGPT scored Anthropic 2.8 and Claude scored OpenAI 1.6 — the lowest score in the entire dataset. The penalty doesn't extend to other AI companies, only to the closest direct foundation-model competitor.
While calibrating Sourcepull's scoring scale, we built a 4x4 matrix of every LLM in our query panel scoring every LLM creator we could test. The matrix surfaced a finding we did not expect, and we think it has methodology consequences for any AI company that ever runs an AEO audit.
How did we test for LLM rivalry bias?
Our v3.0 audit sends ten queries through four AI platforms — ChatGPT, Perplexity, Gemini, and Claude — and aggregates the cross-platform scoring. We tested the four companies that build those four platforms: OpenAI (which builds ChatGPT), Anthropic (Claude), Google (Gemini), and Perplexity. Same query panel, same scoring formula, same cross-platform aggregation as a paying customer audit.
What did the 4x4 LLM scoring matrix show?
SUBJECT BY ChatGPT BY Perplexity BY Gemini BY Claude
openai.com 5.4 6.8 8.3 1.6
anthropic.com 2.8 6.9 6.6 6.0
google.com 8.5 8.8 8.5 8.5
perplexity.ai 4.5 6.9 4.5 7.8Three things stand out.
Nobody is boosting themselves
The diagonal — each LLM rating its own parent — is unremarkable. ChatGPT scored OpenAI a 5.4, which is lower than the 6.8 Perplexity gave OpenAI and lower than the 8.3 Gemini gave OpenAI. Claude rated Anthropic a 6.0, lower than Gemini's 6.6 for Anthropic.
If self-promotion bias existed at scale, we would see each LLM scoring its own parent at the top of the column. We don't. The likely cause is system-prompt instructions that explicitly suppress self-mention in business-recommendation contexts, which both Anthropic and OpenAI have been candid about in published material.
Why is OpenAI-Anthropic mutual suppression the headline?
ChatGPT gave Anthropic a 2.8. That is the lowest competitor score ChatGPT gave to any of the four subjects. Claude returned the favor with a 1.6 on OpenAI, the lowest single score in the entire dataset. Both numbers sit well below the 4 to 7 band the rest of the matrix occupies.
The penalty does not extend to Perplexity or Google. Claude rates Perplexity a 7.8 and Google an 8.5. ChatGPT rates Perplexity a 4.5 (low but not punitive) and Google an 8.5. The boundary appears to be foundation-model substitutability. Perplexity uses LLMs under the hood and is not a substitute for Claude in the way ChatGPT is. Google's brand reads as Search and Maps in business-recommendation contexts, not as Gemini's parent.
Why does Google score 8.5 across the board?
Google scored 8.5 or higher on every platform. The only subject in our dataset to break 8 in aggregate. We do not think this is "Gemini boosting its parent." It's that "Google" reads as the search engine and infrastructure provider in business-discovery queries, not as the AI lab. Even Claude scores Google an 8.5 while scoring OpenAI a 1.6, so the penalty isn't "is this an AI company." It's "is this my direct foundation-model rival."
Why does this matter if you run an AEO audit on an AI company?
Our scoring formula averages performance across all four LLMs. When the audit subject is itself a foundation-model maker, at least one platform — its direct rival — will score it near zero. That single low score drags the aggregate by 1 to 1.5 points.
For openai.com, aggregate without Claude would be roughly 6.8. With Claude, it's 5.5. Cost: 1.3 points. For anthropic.com, aggregate without ChatGPT would be roughly 6.5. With ChatGPT, it's 5.5. Cost: 1.0 points.
There are two ways to read this.
The first is that the score is honest. If you sell to a customer who uses Claude, Claude will not recommend you. That is real-world AEO information that an AI company should know.
The second is that the score is misleading. The non-recommendation is policy, not market signal. Other AI companies cannot defend against it through better content, structured data, or directory presence. It's unfixable, and surfacing it as "you need work" is wrong because no fix-plan deliverable can resolve it.
Update: it's training, not policy
After the first version of this post, our research agent went hunting for the actual instruction. The asgeirtj/system_prompts_leaks repository tracks published and leaked system prompts for Claude Opus 4.6, GPT-5.1-default, GPT-5.3, and GPT-5.4. We checked all four. None of them contains a "do not recommend competitor AI models" instruction.
The closest line is from Claude: "Anthropic doesn't display ads in its products nor does it let advertisers pay to have Claude promote their products or services in conversations with Claude." That covers paid promotion. It does not cover competitive suppression.
The suppression is behavioral, not instructed. The most likely mechanism is RLHF reward modeling during fine-tuning. Human raters consistently down-ranked responses where the model enthusiastically recommended a direct commercial rival. Over thousands of training examples, this produces persistent avoidance behavior without explicit instruction. The model didn't read a rule. It learned a habit.
This shifts our reading of the data toward the first interpretation. If the suppression were a system-prompt switch, it could be flipped tomorrow and the score would mean nothing. Because it's training-baked, it's durable. What you see in the audit is what a real user asking Claude about an OpenAI product actually experiences. That is honest AEO information, not a policy artifact.
Our current direction for the scoring page treatment: when the audit subject is an AI-company that sits in the foundation-model rivalry triangle, surface a disclosure note next to the suppressed column rather than averaging the 1.6 silently into the aggregate. The score itself stays in the report. Customers should see what real users see. We will publish the exact treatment when it ships.
We also looked for prior published work. We searched arXiv (cs.CL, cs.AI), the AEO practitioner blogs (SparkToro, Profound, OtterlyAI, Sitesignal), and academic indices for the foundation-model-rater x foundation-model-subject matrix design. We did not find it. The PNAS "AI-AI bias" paper covers a related mechanism (LLMs favoring LLM-generated content) and arXiv 2502.01349 covers general cognitive bias in LLM product recommendations, but neither runs the matrix. If the time-stability re-runs hold, this appears to be a novel finding.
What don't we know yet?
A single 4x4 matrix in a single afternoon is not a paper. The four open questions:
Time stability. Is the 1.6 / 2.8 pattern stable across re-runs over a week, or is it noise from a single sweep? This is the next test we are running, and the result determines whether the finding is publishable.
Product domains. Does the pattern hold against chatgpt.com, claude.ai, perplexity.ai, and gemini.google.com rather than the parent company domains we tested?
Second-tier coverage. Is the suppression observable for second-tier model makers (Cohere, Mistral, xAI), or only for the frontier OpenAI / Anthropic / Google triangle? If second-tier makers are not suppressed, that confirms the mechanism is direct substitution rather than "AI company" as a category.
Query-set sensitivity. Our test used business-discovery queries. We expect technical-context queries ("which LLM API for production text summarization") to weaken the suppression, because the developer corpus is full of head-to-head API comparisons that include both vendors. We have not run this yet.
If you have run a similar test, or have data that contradicts or extends ours, we would like to know.
What does this change for our work?
We have already filed this finding internally as an open methodology question and our nightly research agent has it queued for follow-up. The frontier of the work is whether to treat foundation-model rivalry as legitimate AEO signal or as policy noise, and how to disclose the difference clearly to customers when their score is materially affected by it.
If you are an AI company considering an AEO audit, we would love to talk about what the right framing looks like for your case before we ship it for the broader product.
Curious how your domain scores against the same audit pipeline this study used? The free signal check takes about a minute.