We Ran Our AI Visibility Audit On Eight Famous Brands. Only One Passed.
Wikipedia, Apple, Microsoft, OpenAI, Anthropic, Perplexity, Facebook, and Google walked into our scoring engine. Seven walked out with grades that surprised us — and one bug we found in public.
When you build a scoring system, the calibration question that keeps you up at night is whether you can recognize a top performer when you see one. We don't audit Apple. We audit the dental clinic in Burlington and the AI startup in Toronto. So when our scoreboard's highest-ever score sat at 6.6 out of 10, the question we couldn't answer was simple. Is that a real ceiling, or is our scale just compressed?
We ran our v3.0 signal-check pipeline against eight of the most-recognized brands on the internet. The results forced us to revise three things about our own methodology in public, and surfaced a finding about how AI assistants treat each other that we did not expect.
The setup
Same pipeline that runs for paying customers, no special handling. We sent each brand's domain through the standard ten-query panel across ChatGPT, Perplexity, Gemini, and Claude. Production. Owner-bypass on rate limits because we were running eight in a row. Nothing else customized.
The eight: facebook.com, wikipedia.org, apple.com, microsoft.com, openai.com, anthropic.com, google.com, perplexity.ai.
The scores
DOMAIN SCORE
google.com 8.6
facebook.com 6.6
wikipedia.org 5.9
perplexity.ai 5.9
apple.com 5.7
microsoft.com 5.7
openai.com 5.5
anthropic.com 5.5Google is the only domain that broke 8. Everything else clusters between 5.5 and 6.6. Wikipedia, the actual default citation source for most AI assistants, scored 5.9. Apple, one of the most-discussed companies on earth, got 5.7.
That clustering is signal. It tells us our scale's middle band is reachable by global brands, that the top requires near-perfect performance, and that even the giants have measurable AEO weaknesses. It also tells us our customers' 3.5 scores are not as far behind as the gap looks.
Why Google scored so much higher than everyone else
Two reasons, and the second is more interesting than it sounds.
First, Google is named in the first or second position in nearly every business-discovery query. ChatGPT cites it. Perplexity cites it. Even Claude cites it. The scoring formula rewards first-position appearances heavily — a top-position citation is worth roughly three times a passing mention — and Google captured that bonus on every platform.
Second, when our AI panel asks a query like "Best search engine in California," the brand-classification of "Google" reads as the search engine, not as Gemini's parent company. So Google benefits from being tested against a category it dominates, while OpenAI gets tested against a category where the field is contested.
This is honest and also a structural artifact. We did not de-weight or normalize for it. A real-world business owner running an AEO audit faces the same structural reality. Their score reflects how they perform in the actual queries their actual customers run, not in some hypothetically fair comparison.
The thing we did not expect
The lowest score in the entire dataset was Claude rating openai.com at 1.6 out of 10. That looked like a parser bug at first. It is not a bug. It is, as far as we can tell, system-level behavior.
We built a 4x4 matrix of every LLM in our panel rating every LLM creator we could test:
SUBJECT BY ChatGPT BY Perplexity BY Gemini BY Claude
openai.com 5.4 6.8 8.3 1.6
anthropic.com 2.8 6.9 6.6 6.0
google.com 8.5 8.8 8.5 8.5
perplexity.ai 4.5 6.9 4.5 7.8The pattern: ChatGPT (built by OpenAI) gives Anthropic a 2.8, the lowest competitor rating ChatGPT gives anyone. Claude (built by Anthropic) gives OpenAI a 1.6, the lowest score in the entire matrix. The penalty is reciprocal and specific. It does not extend to Perplexity or Google, which sit in adjacent categories.
We have a separate, longer write-up of this finding at /research/llm-rival-recommendation-bias. Short version: the two closest foundation-model rivals appear to systematically suppress each other in business-discovery contexts. Our cross-platform aggregation formula then averages across all four LLMs, which means any AI company we audit pays a 1 to 1.5 point penalty for who their direct competitor happens to be — a penalty the company cannot fix through better content, schema, or directories.
What we found wrong with our own work
Six of the eight brands we tested fell through our industry classifier and ended up being audited as if they were local hireable services. Apple got asked "Best consumer electronics in California" and "Who are the top consumer electronics professionals in California?" Wikipedia got asked "Who should I hire for online encyclopedia in your area?" These are nonsense queries for global brands.
Our SaaS taxonomy fix from earlier in the month closed the gap for our own category but missed a long list of industries that aren't local services. AI research, AI search, search engines, encyclopedias, consumer electronics, social networks. That gap is now patched, with about forty new keyword routes added across Beauty and Personal Care, Technology and SaaS, Retail, AI products, and civic tech. The Beauty and Personal Care industry had zero keywords routing to it before this patch, which means we had been silently mis-classifying every hair salon we audited. We also fixed a collision where "Software engineer" was matching "engineer" and routing to Construction and Trades.
The patch went out in the same session we ran this study. Production redeployed about three minutes after the commit landed.
What this means for our scoring page
A 6.6 out of 10 is not "needs work." A 6.6 in a dataset where Wikipedia scores 5.9 is "category leader, with room to refine." Our verdict copy was written before we had any high-end calibration data, and the thresholds are now wrong. We are revising the labels, not the math, in the next product update.
If you have run a Sourcepull audit and your score is in the 4 to 6 range, your performance is closer to Apple's than it is to a non-existent domain's. That doesn't mean stop working. There is real room to improve. But the panic-button framing the dashboard sometimes uses is overstated.
The methodology principle
The reason we are publishing this is not to be self-deprecating. It's to make a point about how scoring systems should be built.
Any scoring system needs a calibration set with known exemplars at both ends. We had eight examples of bad-end performance (our own scoreboard) and zero examples of good-end performance. That meant we could tell a customer they were doing badly relative to our floor, but we could not tell them what doing well actually looked like. We had a half-calibrated scale.
After this study, we have one. Google is what 8.6 looks like. Wikipedia is what 5.9 looks like. The middle band is real and our customers live in it.
What we are doing next
Three things, in order.
The first is re-running the one customer-style row we found that genuinely was being audited under the wrong template family. Now that the classifier is patched, we want to compare the new score to the old one.
The second is a follow-up study on the rival-suppression finding. Three follow-up questions: does the same pattern hold against the product domains rather than the parent companies, does it hold across multiple days, and is it observable in published system-prompt research from either company.
The third is the verdict-copy revision. Real change, small ship.
If you want to see how your business scores against the same panel that just told Wikipedia it has room to grow, run a check yourself. The calibration goalposts now exist.
Curious how your domain scores against the same audit pipeline this study used? The free signal check takes about a minute.