The anatomy of a machine readable page

Creating truly machine readable content goes beyond adding meta tags and schema markup. It requires a thoughtful approach to structure, semantics, and metadata that helps AI systems understand not just what your content says, but what it means.

This post breaks down the key elements of a truly machine readable page and shows how they work together. We’ll look at practical examples and compare poorly structured content with its optimised counterpart.

What makes content machine readable?

Machine readability comes from four key elements working together:

  • Technical accessibility: Ensuring AI systems can access and crawl your content
  • Semantic structure: Using HTML elements that describe their content’s meaning
  • Metadata: Providing context about your content
  • Structured data: Adding machine readable annotations

Let’s look at each element in detail.

Technical accessibility fundamentals

Before focusing on structure and semantics, ensure AI systems can actually access your content. Check your robots.txt file (found at yourdomain.com/robots.txt) for any blocks against common AI crawlers like GPTBot (ChatGPT’s crawler) or CCBot (Common Crawl):

# Allowing AI crawlers (recommended)
User-agent: GPTBot
Allow: /
User-agent: CCBot
Allow: /

# Blocking AI crawlers (not recommended if you want AI visibility)
User-agent: GPTBot
Disallow: /

Also verify that your content isn’t hidden behind:

  • JavaScript-only navigation that crawlers can’t follow
  • Login walls or paywalls
  • Broken canonical tags
  • Server errors or extremely slow loading times

Semantic structure fundamentals

Semantic HTML uses elements that describe their content’s meaning rather than just its appearance. Consider these two approaches to marking up a product name:

<!-- Poor structure: meaningless div -->
<div class="product-name">Acme Widget Pro</div>

<!-- Good structure: semantic meaning -->
<h1 class="product-name" itemprop="name">Acme Widget Pro</h1>

The second example tells machines this is a primary heading and a product name. The first example only tells them it’s a generic container. The itemprop attribute is part of HTML’s microdata specification, which helps machines understand the property each element represents. You can learn more about microdata attributes like itemprop, itemscope, and itemtype in the MDN Web Docs.

Key semantic elements to prioritise:

  • <article> for complete, self contained content
  • <section> for thematic grouping of content
  • <nav> for navigation blocks
  • <aside> for tangentially related content
  • <header> and <footer> for introductory and closing content
  • <main> for the primary content area
  • Heading elements (<h1> through <h6>) in a logical hierarchy

Essential metadata

Metadata provides context about your content. Here’s a minimal but effective example:

<head>
  <!-- Standard metadata for search engines -->
  <title>Product name: A clear, specific title</title>
  <meta name="description" content="A precise description of this specific page">
  
  <!-- Open Graph metadata for social sharing and AI systems -->
  <meta property="og:type" content="article">
  <meta property="og:title" content="Same as your main title">
  <meta property="og:description" content="Same as your meta description">
  
  <!-- Canonical URL to specify the primary version (always include, even when pointing to current URL) -->
  <link rel="canonical" href="https://example.com/current-page">
</head>

The og: prefix indicates Open Graph metadata, originally created by Facebook for rich social sharing but now also used by AI systems as an additional source of content context. While you can use different text for Open Graph tags versus standard meta tags (for example, to optimise social sharing), consider whether varying the message across different metadata types serves a clear purpose for your audience.

Keep metadata:

  • Specific to the current page
  • Consistent across different meta tags (unless you have a specific reason to vary them)
  • Clear and precise
  • Free of keyword stuffing

Structured data implementation

Structured data adds machine readable annotations to your content using schema.org, a collaborative vocabulary for structured data founded by major search engines. JSON-LD (JavaScript Object Notation for Linked Data) is the recommended format for implementing these schemas:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Your article title",
  "description": "Your article description",
  "author": {
    "@type": "Person",
    "name": "Author Name"
  },
  "datePublished": "2025-08-19",
  "publisher": {
    "@type": "Organization",
    "name": "Your Site Name",
    "logo": {
      "@type": "ImageObject",
      "url": "https://example.com/logo.png"
    }
  }
}
</script>

Choose the most specific schema type for your content and include all required properties.

A practical example

Let’s look at a complete example comparing poor and optimised structure. First, the problematic version:

<div>
  <div class="title">How to choose a widget</div>
  <div class="author">By Jane Smith</div>
  <div class="content">
    <div>Introduction</div>
    <p>Text about widgets...</p>
    <div>Features to consider</div>
    <p>More text about widgets...</p>
  </div>
</div>

Now the machine readable version:

<article itemscope itemtype="https://schema.org/Article">
  <header>
    <h1 itemprop="headline">How to choose a widget</h1>
    <p>By <span itemprop="author" itemscope itemtype="https://schema.org/Person">
      <span itemprop="name">Jane Smith</span></span>
    </p>
  </header>
  
  <main itemprop="articleBody">
    <section>
      <h2>Introduction</h2>
      <p>Text about widgets...</p>
    </section>
    
    <section>
      <h2>Features to consider</h2>
      <p>More text about widgets...</p>
    </section>
  </main>
</article>

The optimised version:

  • Uses semantic HTML elements
  • Implements schema.org markup
  • Maintains a logical heading hierarchy
  • Groups related content in sections

Common mistakes to avoid

  1. Meaningless containers: Using <div> when a semantic element would be more appropriate
  2. Poor heading structure: Skipping heading levels or using them for styling
  3. Inconsistent metadata: Different titles and descriptions across various meta tags
  4. Generic schema markup: Using broad types like Thing when more specific ones exist
  5. Redundant structure: Wrapping semantic elements in unnecessary containers

Next steps for your content

Start with these practical steps:

  1. Verify AI crawler access in your robots.txt
  2. Audit your current HTML structure
  3. Replace generic containers with semantic elements
  4. Break content into self-contained sections
  5. Implement relevant schema.org types
  6. Validate your structured data
  7. Test with tools that show how machines see your content (we’ll cover these tools in detail in an upcoming post)

Each section of your content should make sense on its own, without requiring context from other sections. This helps AI systems extract and understand specific parts of your content without losing meaning.

Remember that machine readability isn’t just about technical implementation. It’s about creating clear, logical structure that both humans and machines can understand.

Conclusion

A truly machine readable page combines semantic HTML, relevant metadata, and structured data to help AI systems understand your content. Focus on meaningful structure and clear relationships between elements. The goal is to make your content not just parseable, but truly comprehensible to machines.