1 min read

What are document N-grams?

Written by

IV

Creator

Published on

4/12/2025

Based on Semantic SEO principles, document N-grams are contiguous or non-contiguous sequences of N items (typically words) found within a document. Here are several key aspects of document N-grams:

  • Identification of patterns and phrases: N-gram analysis involves processing text to identify how frequently different sequences of words appear. This helps in understanding the common phrases and linguistic structures present in a document.
  • Different types of N-grams: The value of 'N' determines the type of n-gram. Bigrams consist of two consecutive words, trigrams of three, fourgrams of four, and so on. Additionally, skip-grams are mentioned, which are non-contiguous sequences where some words are skipped (e.g., 1-skip bigrams, 2-skip bigrams).
  • Understanding document topic and context: Site-wide n-grams, which appear on every web page of a source, are particularly helpful for search engines to locate the main topic and macro context of the entire website. Analysing the consistent appearance of certain target words across a document can help understand its overall character.
  • Semantic SEO and ranking: Unique phrase sequences or unique n-grams containing original information can convey authority on a topic. By providing unique n-grams, particularly within supplementary content, a website can be perceived as an authority by search engines. Using lexical relationships, like hyponyms, can aid in creating these unique n-grams.
  • Query semantics: Understanding the n-grams within documents is related to query semantics, which focuses on the meaning and relevance of search terms. Search engines use n-gram analysis, along with other Natural Language Processing (NLP) techniques, to understand the relationship between queries and documents, focusing on context rather than just string matching.
  • Tools for analysis: Tools like Oncrawl offer features such as "N-gram Analysis as site-wide" to help analyse these word sequences within a website.
  • Sequence modelling: The concept of sequence modelling, which is the backbone of semantic SEO, involves understanding the likelihood of words appearing together. N-gram analysis contributes to this by revealing common word sequences in documents.
    In essence, document N-grams provide a way to analyse the composition of text at a multi-word level, offering insights into the content's themes, linguistic patterns, and its potential relevance to user queries for search engines.

Latest

More from the site