﻿---
title: The anatomy of an AI citation. What makes a page worth citing
description: GolOps research — 1,465 AI-cited pages across 950 domains, 28,000+ citations. 68% carry schema markup, FAQ markup lifts citations by 45%, light markup beats heavy. The measurable anatomy of a page a model pulls into its answer.
date: 2025-12-02T00:00:00Z
lastmod: 2026-06-02T00:00:00Z
published: true
categories: [research, llm]
author: golops
---

When a language model assembles an answer, it pulls fragments from a narrow set of pages. The choice is not random. Those pages share an anatomy, a measurable set of features that recur from one citation to the next. Advertising budgets do not move it.

GolOps broke that anatomy down into parts. 1,465 pages across 950 domains that ChatGPT, Perplexity, and Gemini cite in live answers — a sample drawn from 28,000+ actual citations. For each page we extracted its schema markup, content structure, and technical metadata, then compared each feature against web averages from the HTTP Archive.

| Metric | Value |
|---|---|
| Cited pages analyzed | 1,465 |
| Domains in sample | 950 |
| Citations analyzed | 28,000+ |
| Share of pages with schema markup | 68% |

*This is a correlational picture. It describes what cited pages look like, not why a model chose them. Where the sample is small or a confound is present, we say so.*

## Key findings

**68% — the entry ticket, not the edge.** Two-thirds of cited pages carry schema markup, against ~38.5% across the web. Structured data is present on nearly everything in the sample. It is the cost of admission, not a lead over the pages next to you in the field.

**+45% — the FAQ markup effect.** Pages with FAQPage markup average 45% more citations than pages with no FAQ signal. It is the one markup type that, inside the sample, correlates with citation frequency.

**Light markup beats heavy.** Pages with a light schema implementation are cited more often than pages with bulky markup. Past a modest threshold, extra markup delivers diminishing — and eventually negative — returns.

**~2,290 words — the depth median.** A cited page is, on average, three times longer than a typical web page. Substance outweighs any single formal feature.

## The FAQ markup effect

The strongest signal in the dataset. Pages with FAQPage markup average 36.9 citations against 25.4 for pages with no FAQ signal — a 45% gap.

| Page FAQ signal | Avg citations | Sample |
|---|---|---|
| FAQPage markup + FAQ content | 36.9 | n=23 |
| FAQ content only, no markup | 27.2 | n=161 |
| No FAQ signal | 25.4 | n=269 |

The middle position of pages with FAQ content but no markup suggests the markup adds signal beyond the format itself. The caveat is mandatory: only 23 pages carry FAQ markup, and those pages also tend to be substantially longer, which partly explains the lift. This is an early, actionable signal, not a proven cause. FAQPage is the only markup type that, inside the sample, independently tracks with higher citation frequency. Other types appear more often on cited pages but do not predict citation volume within them.

## What markup lives on cited pages

68% of cited pages carry schema markup — nearly double the web average (~38.5%, Web Almanac 2024). Schema markup is structured data that machines read directly ([Google's documentation](https://developers.google.com/search/docs/appearance/structured-data/intro-structured-data)). Broken down by type and compared against the web at large, you can see which formats AI puts to work disproportionately often.

| Markup type | Lift vs web | On cited / on web |
|---|---|---|
| Person | 9.4× | 18.9% / 2.0% |
| ImageObject | 8.9× | 21.4% / 2.4% |
| NewsArticle | 8.7× | 10.4% / 1.2% |
| SoftwareApplication | 8.0× | 2.4% / 0.3% |
| Service | 6.5× | 1.3% / 0.2% |
| BreadcrumbList | 5.2× | 37.7% / 7.3% |
| WebPage | 5.1× | 29.3% / 5.8% |
| BlogPosting | 4.8× | 8.1% / 1.7% |
| ItemList | 4.4× | 4.4% / 1.0% |
| WebSite | 4.3× | 33.0% / 7.7% |
| Organization | 4.1× | 31.5% / 7.6% |
| Article | 3.8× | 24.4% / 6.5% |

The standouts over the web are Person (author attribution), ImageObject, and NewsArticle — each appearing 8–9× more often on cited pages than across the web. But over-representation describes the kind of page AI cites, not a direct causal effect. This lines up with independent data: a [Search Engine Land analysis](https://searchengineland.com/how-to-get-cited-by-ai-seo-insights-from-8000-ai-citations-455284) of 8,000 citations found blogs and news dominate while vendor product pages are pulled into answers less than 3% of the time — we broke down that gap between what AI crawls and what it cites in [**AI crawls your product pages. It cites your blog.**](/en/publications/page-type-citation-gap). Of all the types measured, only FAQPage independently tracks with higher citation frequency.

## Light markup beats heavy

This is where the "more markup, more signal" intuition breaks. Pages with a light schema implementation are cited most. Past a modest threshold, each additional layer of structured data delivers diminishing returns, and in the top tiers, negative ones.

| Markup tier | Avg citations | Sample | Avg words |
|---|---|---|---|
| No markup | 24.1 | n=146 | 1,843.4 |
| Light | 30.5 | n=135 | 2,551.6 |
| Medium | 26.6 | n=89 | 2,310.0 |
| Rich | 24.8 | n=72 | 2,646.6 |
| Very rich | 23.7 | n=12 | 2,478.6 |

The light tier, at 30.5 citations, consistently earns the most. Focus beats thoroughness. Whether that helps because the model prefers it or because light markup simply tends to sit on strong pages, we do not yet know. The practical takeaway holds either way: mark up the extractable answer and the basic structure, rather than piling on fields for completeness.

## The citation blueprint

Ten page features compared: the top 10% most-cited pages against the bottom 50%. The gaps are narrower than you might expect. These are internal properties of the page, not external signals like backlinks or domain authority.

| Feature | Top 10% | Bottom 50% | Delta |
|---|---|---|---|
| Has any schema | 80.0% | 65.6% | +14.4% |
| Article schema | 37.8% | 23.3% | +14.5% |
| FAQ schema | 11.1% | 5.3% | +5.8% |
| Person schema | 17.8% | 19.4% | −1.6% |
| Word count | 2,521.1 | 2,304.7 | +216.4 |
| Total headings | 33.7 | 31.4 | +2.3 |
| List items | 146.9 | 120.3 | +26.6 |
| Has tables | 40.0% | 28.2% | +11.8% |
| Has FAQ content | 42.2% | 38.3% | +3.9% |
| Has how-to content | 73.3% | 70.0% | +3.3% |

Markup separates the tiers most clearly: any schema, Article, FAQ, tables. Content-structure features — headings, FAQ patterns, instructional content — are almost identical between top and bottom pages. The baseline quality among cited pages is already high, so the gap is created by formal markup signals, not by content depth on its own.

## The most-cited pages

The fifteen URLs with the most citations in the sample. Both the pattern and the exceptions show: most carry focused markup, but several of the leaders carry none at all.

| Domain | Citations | Words | Markup types |
|---|---|---|---|
| softwarefinder.com | 218 | 2,937 | Corporation |
| rankmyagent.com | 174 | 1,461 | FAQPage · RealEstateAgent · ItemList |
| collegenet.com | 123 | 808 | WebPage · BreadcrumbList · VideoObject |
| dotcom-monitor.com | 111 | 5,785 | BreadcrumbList · Person · WebSite + 4 more |
| runnersworld.com | 82 | 3,834 | NewsArticle · ItemList |
| g-co.agency | 80 | 2,558 | None |
| iiba.org | 80 | 2,806 | None |
| milanote.com | 79 | 1,111 | HowTo |
| offers.hubspot.com | 75 | 553 | None |
| dash.dropbox.com | 75 | 1,474 | MobileApplication · SoftwareApplication · Organization + 2 more |
| nokia.com | 72 | 1,771 | BreadcrumbList |
| ehrinpractice.com | 72 | 1,832 | None |
| skyquestt.com | 71 | 2,993 | WebPage · ItemList |
| readycontacts.com | 70 | 1,857 | Person · Article |

Markup is a frequent feature of a leader, not a required one. Several of the top pages are cited dozens of times with no schema at all. That is the gap between "having markup" and "being chosen": the first is common among cited pages, the second is settled by substance and format.

## What this adds up to

Pulled into a single portrait of a cited page, the features give four conclusions.

**Schema is infrastructure, not an edge.** 68% of cited pages carry structured data, nearly double the web. But most markup types do not predict citation volume — they describe the kind of page AI already takes. Having schema among cited pages is normal; it does not create the difference inside the field.

**FAQ markup is the exception, with caveats.** +45% in citations over pages with no FAQ signal. But the sample is small (n=23), and those pages run longer. A real association, not yet a proven cause.

**Focused markup beats comprehensive markup.** A light implementation (1–20 fields) earns the most. Heavy implementations show diminishing returns. Markup complexity does not help; content quality might.

**Content depth is the likely foundation.** A cited page averages 2,289.6 words — three times the typical web page. Between the top 10% and the bottom 50%, structural differences are modest. Substance outweighs any single formal signal.

These conclusions line up with what GolOps has already recorded in a wider measurement of the citation field: listicle headlines carry a lift of about 1.2×, comparison and instructional headlines around 1.1×, a year in the headline another ~1.1× as a freshness signal, a brand mention on the page up to 1.5×. The anatomy of a page and the format of its headline work as two layers of one manageable signal.

## Methodology

What underpins the numbers:

- **1,465 pages** — top-cited URLs from the GolOps observation system, selected from 28,000+ citations across 950 domains. Each URL was crawled live to extract JSON-LD markup, content characteristics (word count, headings, lists, tables, FAQ patterns), and technical metadata.
- **Web averages** — benchmarks from the HTTP Archive / Web Almanac 2024.
- **Sample profile** — skews toward B2B, SaaS, and DTC brands; findings are most accurate for those verticals.
- **Small FAQ sample (n=23)** — the FAQ markup finding remains an early, actionable signal rather than a proven cause. A larger sample will sharpen the estimate.
- **No non-cited control group** — we compare cited pages to web averages, not to comparable non-cited pages. Some of the differences may reflect page quality rather than features AI selects for.
- **Presence, not quality** — markup shares record whether structured data exists, not whether it is correctly implemented.

## What to do with this

If you do only one thing, make the content deeper. Of all the measurable signals, it is content depth — a median of ~2,290 words, three times a typical page — that outweighs any single formal marker. Schema markup and an FAQ block help at the margin: they add signal, but they do not substitute for the substance of the page and do not create the edge on their own. The anatomy of a page is knowable and fixable — but only once you measure which of your pages the models actually pull into their answers. Without that measurement the fixes run blind: you pile on markup fields where the text decides, and the reverse.

This is the measurement GolOps takes under management. We fix a company's position in the field of choice through the Choice Control Index, attribute it to the specific pages and signals that shape it, and translate that into a prioritized fix plan: the Strategic Pilot closes the first cycle in 10–12 weeks, and the Command Center keeps the loop running continuously across seven AI systems. The cost of delay is countable. Gartner forecasts 90% of B2B procurement under autonomous AI agents by 2028, and Semrush already shows AI-channel conversion running 4.4× higher than organic search: every quarter without a managed anatomy of citation is a quarter of answers assembled without your page inside them.

**Page anatomy is just one layer of the manageable signal. Adjacent measurements:**

[**The llms.txt Effect: 37,894 Domains, Zero Citation Advantage**](/en/publications/llms-txt-effect)

[**The half-life of AI citations. How fast you stop being cited**](/en/publications/ai-citation-half-life)

[Request an index diagnostic →](https://golops.io/en/position) · [Discuss a pilot →](https://golops.io/en/pilot)
