ChatGPT Won't Recommend Anthropic. Claude Won't Recommend OpenAI. We Have the Data.
A 4x4 matrix of LLM raters scoring LLM creators surfaced a reciprocal suppression pattern between OpenAI and Anthropic specifically. The pattern doesn't extend to other AI companies — only to the closest direct competitors.
While calibrating Sourcepull's scoring scale, we built a 4x4 matrix of every LLM in our query panel scoring every LLM creator we could test. The matrix surfaced a finding we did not expect, and we think it has methodology consequences for any AI company that ever runs an AEO audit.
The setup
Our v3.0 audit sends ten queries through four AI platforms — ChatGPT, Perplexity, Gemini, and Claude — and aggregates the cross-platform scoring. We tested the four companies that build those four platforms: OpenAI (which builds ChatGPT), Anthropic (Claude), Google (Gemini), and Perplexity. Same query panel, same scoring formula, same cross-platform aggregation as a paying customer audit.
The matrix
SUBJECT BY ChatGPT BY Perplexity BY Gemini BY Claude
openai.com 5.4 6.8 8.3 1.6
anthropic.com 2.8 6.9 6.6 6.0
google.com 8.5 8.8 8.5 8.5
perplexity.ai 4.5 6.9 4.5 7.8Three things stand out.
Nobody is boosting themselves
The diagonal — each LLM rating its own parent — is unremarkable. ChatGPT scored OpenAI a 5.4, which is lower than the 6.8 Perplexity gave OpenAI and lower than the 8.3 Gemini gave OpenAI. Claude rated Anthropic a 6.0, lower than Gemini's 6.6 for Anthropic.
If self-promotion bias existed at scale, we would see each LLM scoring its own parent at the top of the column. We don't. The likely cause is system-prompt instructions that explicitly suppress self-mention in business-recommendation contexts, which both Anthropic and OpenAI have been candid about in published material.
The OpenAI-Anthropic mutual suppression is the headline
ChatGPT gave Anthropic a 2.8. That is the lowest competitor score ChatGPT gave to any of the four subjects. Claude returned the favor with a 1.6 on OpenAI, the lowest single score in the entire dataset. Both numbers sit well below the 4 to 7 band the rest of the matrix occupies.
The penalty does not extend to Perplexity or Google. Claude rates Perplexity a 7.8 and Google an 8.5. ChatGPT rates Perplexity a 4.5 (low but not punitive) and Google an 8.5. The boundary appears to be foundation-model substitutability. Perplexity uses LLMs under the hood and is not a substitute for Claude in the way ChatGPT is. Google's brand reads as Search and Maps in business-recommendation contexts, not as Gemini's parent.
Google as a category-collision artifact
Google scored 8.5 or higher on every platform. The only subject in our dataset to break 8 in aggregate. We do not think this is "Gemini boosting its parent." It's that "Google" reads as the search engine and infrastructure provider in business-discovery queries, not as the AI lab. Even Claude scores Google an 8.5 while scoring OpenAI a 1.6, so the penalty isn't "is this an AI company." It's "is this my direct foundation-model rival."
Why this matters if you run an AEO audit on an AI company
Our scoring formula averages performance across all four LLMs. When the audit subject is itself a foundation-model maker, at least one platform — its direct rival — will score it near zero. That single low score drags the aggregate by 1 to 1.5 points.
For openai.com, aggregate without Claude would be roughly 6.8. With Claude, it's 5.5. Cost: 1.3 points. For anthropic.com, aggregate without ChatGPT would be roughly 6.5. With ChatGPT, it's 5.5. Cost: 1.0 points.
There are two ways to read this.
The first is that the score is honest. If you sell to a customer who uses Claude, Claude will not recommend you. That is real-world AEO information that an AI company should know.
The second is that the score is misleading. The non-recommendation is policy, not market signal. Other AI companies cannot defend against it through better content, structured data, or directory presence. It's unfixable, and surfacing it as "you need work" is wrong because no fix-plan deliverable can resolve it.
We are leaning toward the second interpretation for our scoring page treatment. A future version of the audit may include a rival-suppression flag on AI-company subjects that discloses and down-weights the direct rival's column. We have not decided yet, and we are open to feedback.
What we don't know yet
A single 4x4 matrix in a single afternoon is not a paper. The follow-up questions that would settle the finding:
Does the pattern hold against the product domains — chatgpt.com, claude.ai, perplexity.ai, gemini.google.com — rather than the parent companies?
Is the pattern stable across multiple days? AI assistant behavior shifts in measurable ways within weeks.
Is the suppression observable for second-tier model makers — Cohere, Mistral, DeepMind, xAI — or only for the frontier OpenAI–Anthropic–Google triangle?
Does the suppression weaken or invert in technical-context queries ("which LLM API for production") versus the business-discovery queries we used here?
If you have run a similar test, or have data that contradicts or extends ours, we would like to know.
What this changes for our work
We have already filed this finding internally as an open methodology question and our nightly research agent has it queued for follow-up. The frontier of the work is whether to treat foundation-model rivalry as legitimate AEO signal or as policy noise, and how to disclose the difference clearly to customers when their score is materially affected by it.
If you are an AI company considering an AEO audit, we would love to talk about what the right framing looks like for your case before we ship it for the broader product.
Curious how your domain scores against the same audit pipeline this study used? The free signal check takes about a minute.