Data & Analytics

N-Gram

A sequence of n consecutive units (words, characters, etc.) extracted from text. A foundational technique in natural language processing.

N-gram Natural language processing NLP Text analysis Language model
Created: December 19, 2025 Updated: April 2, 2026

What is N-Gram?

N-gram is a set of n consecutive units (words or characters) extracted from text. For example, from “natural language processing,” unigrams (1-word) are “natural,” “language,” “processing”; bigrams (2-word) are “natural language,” “language processing.”

In a nutshell: “Split sentences into small consecutive chunks and search for patterns.”

Key points:

  • What it does: Divide text into small units and analyze word/character connection patterns
  • Why it matters: Capture text’s semantic structure concisely for language prediction and classification
  • Who uses it: NLP engineers, search engine companies, text analysis specialists

Why it matters

N-gram is one of natural language processing’s most basic and powerful techniques. When spell checkers suggest “type” after “typo,” or smartphones propose suggestions after typing, language models using n-grams drive these features.

Text classification (spam detection) and machine translation also use n-grams. Simple yet effective with low computational requirements, it remains widely used today.

How it works

N-gram relies on statistical probability models.

Basic concept: “After a word appears, which word likely follows?” is learned from historical text. If “hello” precedes “world” with 0.95 probability, the system strongly suggests “world” after “hello.”

Probability calculation: Count each n-gram’s occurrences in text corpora (large text collections). Bigram probability = “times (previous word + current word) appears” ÷ “times (previous word) appears.”

Smoothing techniques: Unseen n-grams become probability-zero, so special adjustments (smoothing) enable handling unknown text.

Libraries like NLTK and spaCy simplify n-gram extraction and probability calculation.

Real-world use cases

Predictive text and autocomplete

Email composition: typing “thank” suggests “you” via bigram models.

Spell checking

Typing “teh” triggers suggesting the high-frequency correct n-gram “the.”

Machine translation

When translation candidates exist, the target language’s n-gram model selects the most natural expression.

Benefits and considerations

Benefits: Simple implementation, low computational cost, effectively captures basic language patterns. Often learns from limited data.

Considerations: Cannot capture long-form context (n increasing worsens data scarcity), lacks deep semantic understanding. Recent Transformer neural networks replace n-grams, though n-grams still serve simple tasks and small systems.

Frequently asked questions

Q: What’s the difference between bigram and trigram? A: Bigram examines 2-word connections; trigram examines 3-word connections. Larger n provides richer context but needs more data.

Q: Is n-gram sufficient for spell checking? A: N-gram works effectively, but complex contexts benefit from advanced techniques like LLM.

Related Terms

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a groundbreaking natural language ...

Ă—
Contact Us Contact