Tactical · 6 min read · 2026-04-23

What Happens When You Block GPTBot

Many businesses blocked GPTBot last year and never thought about it again. A developer added a line to robots.txt after reading about AI content scraping. A website template shipped with it pre-blocked. A plugin toggled it on automatically "for privacy."

The result: invisible to ChatGPT.

What GPTBot actually is

GPTBot is OpenAI's web crawler. It does two things. First, it crawls the web to collect training data — text that gets incorporated into future ChatGPT models. Second, it supports ChatGPT's real-time browsing feature, which lets the model retrieve current information when answering queries.

These are different functions, and blocking GPTBot affects both.

When your robots.txt file includes a rule that disallows GPTBot, you're telling OpenAI's infrastructure: do not read this site. And it won't.

The training data problem

ChatGPT's knowledge of your business comes from two places: what was in its training data when the model was built, and what it can retrieve in real time.

If GPTBot has never crawled your site, you have no footprint in the training data. The model may have no awareness of your business beyond what's mentioned elsewhere on the web — reviews, directory listings, or third-party articles.

For well-established businesses, this sometimes doesn't matter much. A dentist with 200 Google reviews, a complete GBP, and strong Yelp coverage has signal from many other sources. But for newer businesses, niche operators, or anyone without rich third-party coverage, blocking GPTBot is blocking the primary way ChatGPT learns you exist.

Training data inclusion doesn't guarantee a citation. But exclusion makes it much harder for a model to confidently recommend you from memory.

The real-time retrieval problem is more urgent

This is where the damage gets harder to undo.

When ChatGPT has browsing enabled — the default in most current versions — and someone asks a question, the model may run a live web search to supplement its answer. This retrieval process respects robots.txt.

If you've blocked GPTBot and a user asks "best commercial electrician in Mississauga," ChatGPT can't read your site during that query. It might surface your GBP data or your Yelp profile, but your actual service pages — where you've done the work of describing what you do, where you work, and who you serve — are completely off the table.

Perplexity has the same issue. PerplexityBot also respects robots.txt. Since Perplexity does a live web search on every query, blocking that crawler probably costs more than any other single misconfiguration. It's a closed door at query time, every time.

Other AI crawlers you might be blocking

GPTBot is the most commonly blocked, but it isn't alone. In Signal Check audits, we regularly see sites blocking:

**PerplexityBot** — Perplexity's crawler. Because Perplexity retrieves live during every answer, blocking this is the highest-cost mistake in practice.

**ClaudeBot** — Anthropic's crawler for training Claude. Blocking it cuts off training data collection for Claude models.

**GoogleExtended** — Google's crawler for training Gemini. This is separate from Googlebot, which handles Search. Blocking GoogleExtended doesn't affect your Google rankings — but it does affect your Gemini visibility.

The most common culprit is a blanket robots.txt rule: `User-agent: * / Disallow: /`. That blocks every compliant crawler — search bots, AI bots, everything. We've seen it on sites where the owner had no idea it was there.

When blocking is actually reasonable

There are legitimate reasons to block AI crawlers:

You don't want your content in training data. Writers, news organizations, and content businesses have reasonable concerns about their work being used without compensation.

Your site contains sensitive information. Medical records, private client databases, or proprietary content shouldn't be crawled by anyone.

Legal obligations apply. Some industries have data usage restrictions that may extend to AI training.

If any of these apply, blocking is defensible. But most local businesses — plumbers, dentists, law firms, contractors — don't have these concerns. They blocked AI crawlers because a template added it, a developer did it reflexively, or someone saw a headline about AI scraping and didn't think through the downstream effect on recommendations.

How to check your robots.txt right now

Go to `yourdomain.com/robots.txt` in your browser and read it. Look for:

``` User-agent: GPTBot Disallow: / ```

or any wildcard rule like:

``` User-agent: * Disallow: / ```

That wildcard blocks every crawler that respects robots.txt — including all four AI bots above.

If you see those rules and didn't put them there intentionally, remove them or replace the blanket disallow with a more specific path. If there are directories you want private, disallow those paths only. Don't shut out the entire site when you mean to protect one folder.

What to do instead of blanket blocking

If your concern is AI training data usage specifically, OpenAI, Anthropic, and Google have separate opt-out mechanisms that signal "do not train on this" without blocking real-time retrieval. These are worth looking into if training data is the issue, because they let you protect the concern without sacrificing recommendations.

For most local businesses, the right posture is no blocking at all. Let GPTBot, PerplexityBot, ClaudeBot, and GoogleExtended read your site. Give them something worth reading: clear service pages, LocalBusiness schema, consistent NAP data, and content that explicitly describes what you do and where.

The businesses winning in AI search are the ones making it easy to be crawled and cited. Blocking is the opposite of that — and it's often the only thing standing between a well-optimized site and a low visibility score.

This comes up in almost every audit

When we run a Signal Check and a business has low AI visibility despite good reviews and a solid website, robots.txt is one of the first things we look at. It's a one-line fix that can immediately open your site to four AI crawlers.

If you haven't checked your robots.txt recently, a Signal Check flags it in the technical section — it shows exactly which crawlers you're blocking and what to change.

See how your business scores on AI platforms.

Check your score — free