All posts
Analysis · 6 min read · 2026-05-27

The AI Citation Bias That No Schema Fix Can Change

Every AI visibility strategy assumes that if you fix the right things -- schema, directories, entity consistency -- your score improves across all platforms. Apply the fix. Collect the improvement. Repeat.

In April 2026, we ran a test that complicates that picture. We used Sourcepull to audit four foundation-model companies against each other: we ran signal checks on OpenAI, Anthropic, Google, and Perplexity, then observed how each platform rated its direct competitors. The result is a 4x4 matrix of AI platforms scoring subjects that include their closest rivals.

The pattern it revealed is not about schema or directories. It is about something baked into how the models were trained.

The 4x4 matrix

In our 2026-04-25 investigation (edge-cases/llm-cross-recommendation-bias-2026-04-25.md), we ran live signal checks against each of the four foundation-model companies whose products power our audit platform, then recorded per-platform scores across all four raters.

| Subject | ChatGPT | Perplexity | Gemini | Claude | Aggregate | |---|---|---|---|---|---| | openai.com | 5.4 | 6.8 | 8.3 | **1.6** | 5.5 | | anthropic.com | **2.8** | 6.9 | 6.6 | 6.0 | 5.5 | | google.com | 8.5 | 8.8 | 8.5 | 8.5 | 8.6 | | perplexity.ai | 4.5 | 6.9 | 4.5 | 7.8 | 5.9 |

Two cells stand out. Claude scored OpenAI 1.6/10 -- the lowest score in the entire matrix. ChatGPT scored Anthropic 2.8/10 -- the lowest score ChatGPT assigned to any subject. These two are the closest head-to-head rivals in the foundation-model space. They will not recommend each other.

What makes this a pattern rather than noise: neither platform applies the same penalty to indirect competitors. Claude rated Google 8.5 and Perplexity 7.8. ChatGPT rated Google 8.5. The suppression is specific -- it targets the direct substitutable rival, not AI companies generally.

One methodological caveat: our industry classifier placed OpenAI and Anthropic in the local-service category rather than technology, which produced some off-topic business-discovery queries. The awareness and evaluation queries ran correctly. The relative pattern -- mutual suppression between the two direct rivals -- survives the miscategorization because every subject received the same miscategorized queries, holding the error constant.

Why it happens and why it cannot be fixed

The first hypothesis we tested: this behavior is instructed via system prompts. In our Scout follow-up on 2026-04-26, we searched available Claude and GPT-5 system prompt disclosures for explicit competitor-restriction language. Neither contains it. Anthropic's published documentation includes an anti-advertising note ("Anthropic doesn't let advertisers pay to have Claude promote products") but no instruction to suppress competitors. OpenAI's December 2025 model spec has no equivalent.

The mechanism is almost certainly RLHF -- reinforcement learning from human feedback. During fine-tuning, human raters consistently downgraded responses where a model enthusiastically recommended a direct commercial rival. Over thousands of training examples, the model learned to avoid endorsing its closest substitutes without any explicit instruction. The behavior is a byproduct of the fine-tuning process, not a runtime policy.

This is why AEO tactics cannot fix it. There is no schema property, no directory listing, no content page that will cause Claude to confidently recommend OpenAI to a user asking "what AI tools should I use?" The signal is absent from the training layer. You cannot reach it with anything crawlable.

In our 2026-04-26 research, we reviewed arXiv, SEO/AEO practitioner archives, and AI safety literature for prior documentation of this pattern. We found papers on general LLM recommendation biases (anchoring, familiarity, popularity) but nothing studying the specific foundation-model-rating-its-direct-rival scenario. As of that investigation, this behavioral pattern appears to be undocumented in published research.

What Perplexity's numbers reveal

The other finding in the matrix is worth noting separately: Perplexity scored every subject between 6.8 and 8.8. Its range across all four subjects is 2.0. Every other rater showed wider variance -- ChatGPT ranged 5.7 points, Claude ranged 6.9 points.

The most plausible explanation is structural. Perplexity is a citation engine. Its product value comes from broad, inclusive retrieval -- finding and surfacing useful sources. A model trained to be a thorough citation finder is less likely to develop strong suppression behaviors against any category of sources. Where ChatGPT and Claude may hesitate to recommend a competitor or apply caution to an unfamiliar entity, Perplexity defaults to inclusion.

We see this pattern reflected in audit data beyond the matrix. For businesses that are invisible or near-invisible on ChatGPT and Claude, Perplexity frequently produces the first non-zero citation scores. The platform is more forgiving at earlier stages of entity establishment. If you are building AI presence from near zero, Perplexity is often the fastest platform to show early movement.

What this means for most businesses

For local service businesses -- contractors, accountants, healthcare providers, legal services -- the foundation-model rivalry is irrelevant. You are not Anthropic's commercial substitute. You are not in the suppression zone.

But the matrix reveals a principle that does apply broadly: AI platforms have recommendation tendencies baked into training that are not driven by content quality, schema, or directory presence. A low score on a specific platform may reflect that platform's particular training-level biases for your category or competitive context, not a correctable AEO gap.

The practical version of this shows up in how differently platforms behave for niche business categories. A contractor with strong Gemini scores and minimal Perplexity scores is not necessarily doing Perplexity AEO wrong -- Perplexity's B-series category citations for home services draw from contractor-specific directories that differ from Gemini's sources, and the gap may be a structural source-pool difference as much as a fixable absence.

This is why per-platform breakdowns matter more than aggregate scores for diagnosis. A 4.0 aggregate built from strong Perplexity and weak ChatGPT requires a different fix sequence than a 4.0 built from the reverse. Without the per-platform breakdown, you are solving for an average that may point you at the wrong lever.

Where the finding matters most directly

If you are building a product that competes with a foundation-model company -- an AI assistant, a conversational search product, a code generation tool -- at least one platform may structurally underperform for your brand. Not because of your AEO. Because you are its direct substitute.

The diagnostic question in that case is different from a standard audit: is the low score on that specific platform a suppression pattern, or an absence of entity signal? Suppression looks like the AI acknowledging your brand exists but consistently omitting or downgrading it in recommendation contexts. Absence looks like the model treating your brand as unverified or unknown.

Signal Check at sourcepull.ca runs the full 4-platform query set and shows per-platform citation data alongside accuracy classification. For most businesses, the platform with the lowest score has a correctable gap. For AI-space products, that per-platform breakdown is where you would first see whether the low-scoring platform happens to be the one whose parent company is your closest competitor.

See how your business scores on AI platforms.

Check your score — free