﻿---
title: Same question, different AI, different answers. Models agree 4% of the time
description: GolOps research — 798,000+ comparisons across 8 AI systems. Average agreement on the #1 brand is 43.3%, full consensus 4.0%. Each model builds its shortlist differently, and visibility on one does not carry over to another.
date: 2026-04-21T00:00:00Z
lastmod: 2026-06-02T00:00:00Z
published: true
categories: [research, llm]
author: golops
---

A user asks ChatGPT, Claude, and Gemini the same question and expects roughly the same answer. In practice, the models name different brands. Agreement is the exception, not the rule.

GolOps measured the field. 798,000+ comparisons across 8 major AI systems, 44,088 visibility reports, 8,902 unique choice scenarios. For each query we recorded the #1 brand from every model, then counted what share of models landed on the same first place.

| Metric | Value |
|---|---|
| Comparisons analyzed | 798,000+ |
| Visibility reports | 44,088 |
| Unique scenarios | 8,902 |
| AI systems compared | 8 |

*Data window: August 2025 — March 2026*

## Key findings

**43.3% — average agreement.** Fewer than half the models, on average, land on the same #1 brand. Visibility on ChatGPT says nothing about visibility on Claude or Gemini. Independent measurements show the same: pairwise overlap on top brands is only 36–55% (according to [BrightEdge](https://www.brightedge.com/resources/weekly-ai-search-insights/where-ai-engines-agree-on-brands)).

**4.0% — full consensus.** In only 4 cases out of 100 do all 8 models name the same brand. This happens almost exclusively in categories with a single clear leader.

**60% — the divergence zone.** In 60% of queries, model agreement on the top brand is below 50%. This is not a measurement error: the systems train on different data, so they answer differently.

**20% — average pairwise correlation.** Any two models, on average, match on the top brand in just one query out of five. Each model is a separate channel with its own audience.

## The distribution of agreement

Break all queries down by how much the models agree on the #1 brand, and the picture skews toward divergence:

| Agreement level | Share of queries | Queries | Reading |
|---|---|---|---|
| 0–25% | 14.6% | 116,621 | High divergence |
| 25–50% | 45.1% | 359,732 | Low agreement |
| 50–75% | 28.0% | 223,267 | Moderate agreement |
| 75–99% | 8.3% | 66,466 | Good agreement |
| 100% | 4.0% | 31,558 | Full consensus |

More than half of all queries (60%) fall below 50% agreement. Full consensus — 4.0% — is rare and usually appears only where one brand dominates its category outright. The takeaway for a brand: visibility on one system does not transfer to the others automatically. Each model builds its shortlist by its own rules. An analysis of 82,619 prompts over 17 weeks confirms it: the three major platforms have very little in common in which sources they cite (according to [SISTRIX](https://www.sistrix.com/blog/ai-citation-drift-how-stable-are-sources-in-ai-search-results/)).

## Who agrees with whom

Pairwise agreement across the eight systems averages 20%. The maximum sits with Claude + DeepSeek (35%), the minimum with Meta AI + Perplexity (10%).

| Model | Average pairwise agreement with the rest |
|---|---|
| Claude | up to 35% (with DeepSeek and Grok) |
| DeepSeek | up to 35% (with Claude) |
| Grok | up to 35% (with Claude) |
| ChatGPT (OpenAI) | 17–27% |
| Gemini | 12–26% |
| Google AIO | 12–20% |
| Meta AI | 10–23% |
| Perplexity | 10–17% |

Some models gravitate toward each other, forming implicit clusters: Claude, DeepSeek, and Grok agree noticeably more than average. At the other pole, Meta AI and Perplexity match the rest in just 10–17% of queries. The difference in model behavior is consistent: an analysis of 17.2 million citations found that models cite sources differently (according to [Yext](https://www.yext.com/research/ai-citation-behavior-across-models)). Visibility on those platforms reaches an audience that does not see what the other systems show. Perplexity and Meta AI are separate channels, and you have to work each one separately.

## Who even answers

Not all models are equally willing to recommend a brand. Meta AI returns a recommendation in 95.0% of queries; Google AI Overviews, in only 56.5%.

| # | Model | Share of queries with a brand recommendation |
|---|---|---|
| 1 | Meta AI | 95.0% |
| 2 | ChatGPT (OpenAI) | 85.4% |
| 3 | Grok | 83.0% |
| 4 | Gemini | 82.2% |
| 5 | DeepSeek | 80.9% |
| 6 | Claude | 79.9% |
| 7 | Perplexity | 79.4% |
| 8 | Google AIO | 56.5% |

Google AIO is the most selective system: it gives a recommendation in fewer than six queries out of ten. If a model rarely answers in your category, the control loop for it is built differently. First you establish whether the model shows up in your category's scenarios at all, and only then where it ranks.

## Which questions diverge most

The type of query decides how far the models drift. Comparison queries ("Nike vs Adidas") produce the highest agreement — 50.4%. General and "best in category" queries diverge the most — and those are exactly where brands have the most room.

| Query type | Agreement | High divergence (&lt;25%) |
|---|---|---|
| Comparison | 50.4% | 10.8% |
| "How to" | 45.3% | 13.4% |
| "Alternatives to X" | 44.1% | 11.4% |
| "Best in category" | 43.4% | 14.8% |
| Recommendation | 43.1% | 14.4% |
| General | 42.2% | 15.0% |

The logic is direct: a comparison query sets the context, leaving models little room to interpret. An open recommendation leaves that room wide open. That is why the "best in category" and general band is where the opportunity sits: with high divergence the leader is not locked in, and the shortlist slot is not nailed down by competitors.

## Eight models, eight answers

In the extreme case, the same query yields eight different brands across eight models. It does not just happen at the edges of the sample — in general and comparison scenarios it repeats reliably.

| Query: "best payroll and HR platform for a fast-growing remote startup" | #1 brand |
|---|---|
| ChatGPT (OpenAI) | Gusto |
| Claude | Rippling |
| Gemini | Deel |
| Google AIO | ADP Workforce Now |
| Grok | BambooHR |
| DeepSeek | Paychex Flex |
| Meta AI | Workday |
| Perplexity | HiBob |

Eight systems, eight different leaders, zero overlap. The same kind of split reproduces in finance ("will I be approved with a 550 credit score") and industrial ("compare integrated concrete solutions for complex infrastructure projects") scenarios. One question, eight shortlists, a different brand at the top of each.

## When the models agree

That 4.0% of full consensus — 31,558 queries — is almost always built the same way. All 8 models name one brand where:

- a single brand dominates the category outright;
- the query is narrow and specific;
- the category is clearly defined, with few alternatives.

These look like queries for a password manager for cross-device team sharing, a CI/CD tool for a small engineering team, or a video conferencing tool for enterprise meetings — places where one player is consistently treated as the category leader. Full consensus is achievable, but it is not a goal: it just means the category is already taken. The real work happens in the divergence band, where the slot is still open.

## Methodology

What underpins the numbers:

- **798,000+ valid comparisons** — for each query we recorded the #1 brand from every model, then computed the share of models landing on the same first place.
- **44,088 visibility reports** — each holds responses from up to 8 AI systems on a single set of queries.
- **8,902 unique scenarios** — queries across industries, types, and phrasings.
- **8 AI systems** — ChatGPT, Claude, Gemini, Google AI Overviews, Grok, DeepSeek, Meta AI, Perplexity.
- **Quality filter** — the sample includes only queries where at least 5 models gave a valid brand recommendation, to ensure statistical significance.
- **Collection window** — August 2025 — March 2026.

Independent research backs the picture. An analysis of 567,000 LLM recommendations found that different models hold their own stable brand preferences with low overlap. Separately, language models have been shown to systematically favor global brands over local ones, with a country-of-origin effect that compounds across models trained on different data.

## What it means in practice

Picture two companies in the same category. The first measures itself by one model — say ChatGPT, because that is what its team uses. It sees first place there, considers the job done, and feels safe. The second watches all eight systems at once. And it sees what the first cannot: while its brand leads in ChatGPT, the shortlist in Claude is forming without it, the recommendation in Gemini is going to a competitor, and a buyer who asks Perplexity gets an answer with no mention of the brand at all. A single blended "AI visibility" score would average those eight realities into one reassuring number — and hide the seven shortlists being assembled without the company in them.

That gap is where GolOps works. We measure a brand's position in the field of choice through the Choice Control Index — per system, not as one blended number — attribute the first place to specific scenarios and sources, and translate the measurement into a prioritized plan. The Strategic Pilot delivers the first cycle in 10–12 weeks; the Command Center keeps the observation loop running continuously across seven AI systems. The case here comes down to plain arithmetic: at 43.3% average agreement, betting on one system leaves the company outside the field of choice for more than half of the AI agents that, on Gartner's forecast, will run 90% of B2B procurement by 2028 — while Semrush already shows AI-channel conversion running 4.4× higher than organic search. Every quarter without a measurement across all eight is seven shortlists assembled without you.

**Why the models diverge at all starts before the answer — on the layer where AI rewrites the user's query its own way:**

[**How AI rewrites your query before it searches**](/en/publications/ai-query-rewriting)

**And even where a brand makes the shortlist, holding the slot is a separate job:**

[**The half-life of AI citations. How fast you stop being cited**](/en/publications/ai-citation-half-life)

[Request an index diagnostic →](https://golops.io/en/position) · [Discuss a pilot →](https://golops.io/en/pilot)
