What Is a Large Language Model? A First-Principles Explanation — AIwrites.news

A large language model (LLM) is a type of artificial neural network trained on large quantities of text to predict what word, or more precisely what token, comes next in a sequence. That is the core of it. Everything else — the apparent reasoning, the knowledge, the ability to write code or translate languages — emerges from doing this prediction task at extraordinary scale. Understanding the mechanism helps explain both what LLMs are genuinely capable of and where their well-documented failures come from.

The Core Task: Next-Token Prediction

During training, an LLM is shown enormous amounts of text and repeatedly asked to predict the next token given the tokens before it. A token is roughly a word or word-fragment — modern tokenisers split text into subword units that balance vocabulary size with coverage. The model adjusts its internal parameters (weights) to improve its predictions, minimising the difference between what it predicted and what the text actually contained.

After training on hundreds of billions to trillions of tokens, the model has developed internal representations of statistical patterns in language: which words tend to follow which, how syntactic structures work, which concepts appear in which contexts, what facts are commonly stated. These representations are not a database of facts you can query — they are distributed across billions of numbers (weights) in a way that is not directly human-readable.

What 'Large' Actually Means

The 'large' in LLM refers to two things simultaneously. First, parameter count: modern frontier models have hundreds of billions of trainable parameters — the numbers that are adjusted during training. Second, training data scale: models are trained on text corpora measured in trillions of tokens, typically scraped from the web, books, code repositories, and other sources.

Scale matters because capability tends to emerge at scale in a way that is difficult to predict from smaller models. Capabilities that are not present in a 1B parameter model may appear reliably at 10B or 70B. This emergent behaviour is not fully understood theoretically, which is part of why AI development remains empirically driven — researchers often discover what models can do by building and testing them.

Why LLMs Can Be Confidently Wrong

LLMs do not have access to a verified database of facts. They encode statistical patterns from their training data — which means they encode the biases, errors, and omissions in that data too. When asked a factual question, a model produces the sequence of tokens that, based on its training, is most likely to be the correct answer. Most of the time this is right. When it is wrong, the model has no mechanism to detect that it is wrong — it produces a confident-sounding answer because that is what its training optimised for.

This phenomenon — producing plausible but incorrect outputs with apparent confidence — is often called hallucination. It is not a bug to be patched; it is a consequence of the architecture. LLMs are trained to produce likely text, not to verify truth.

What Emerges From Scale

Despite this fundamental limitation, LLMs have demonstrated capabilities that were not anticipated when the training objective was designed. They can translate between languages without explicit translation training. They can write and debug code. They follow complex instructions and apply reasoning across multi-step problems. They summarise, classify, extract structured information, and produce creative writing.

These capabilities are best understood as statistical generalisations learned from training data in which humans performed these tasks. The model has seen enough examples of translation, code, and reasoning that it can produce outputs that pattern-match to correct examples — reliably enough to be useful, though not reliably enough to be trusted without review.

Key Takeaways

An LLM is a neural network trained to predict the next token — all other capabilities emerge from this task at scale
Model parameters encode statistical patterns from training data, not a queryable database of verified facts
Scale (parameter count and training data volume) unlocks capabilities that do not exist in smaller models, often unpredictably
Confident-sounding incorrect outputs are a structural property of the architecture, not a fixable bug
Capabilities like coding, translation, and reasoning are statistical generalisations from training data, which is why they are useful but not fully reliable