AI without technical debt: a software craftsmanship guide

There’s something that confused many people (including yours truly) when they first learned about large language models. LLMs are literally called language models, so how do they generate all other kinds of information?

All that LLMs do, at a basic level, is predict the next word. Yep, that’s it; that’s the whole thing. It’s easy to see how this works for programming languages, but LLMs also generate pictures, videos, audio, and 3D objects. You can use LLMs to build websites and apps end-to-end, from code to content and multimedia. But why exactly is this possible?

Let’s answer this question by taking a quick look at what’s under the hood.

Large LEGO models: everything is tokens

The key point we should all understand is that the concept of “language” differs between humans and AIs. For an AI model, it’s all about tokens, and, practically speaking, they’re less like words or phrases and more like… LEGOs. Let’s briefly talk about why that’s the case.

When playing with LEGO, you can use the same kinds of building blocks to build a castle, a space shuttle, or a dinosaur. Tokens are a lot like that: they are AI’s building blocks that can represent anything. A token can be a whole word, like “cat,” or a part of a word, like “in-” or “-ing.” Now, here’s the important part: a token can also represent visual or audible information, a patch of color, or a musical note.

Let’s get a little nerdy. Each token has a mathematical representation of its meaning and context, which looks like a string of numbers. For example, the word “king” might become something like [0.2, 0.8, 0.1, …] with hundreds of numbers for various dimensions and relationships. With tokens encoded in this manner, AI can do math with meaning. It can literally calculate that “king” – “man” + “woman” ≈ “queen” because the numbers encode relationships, not just definitions.

Analogy representation of LLM tokens as LEGO building blocks and their relationships as a math expression — *The same LEGO analogy, generated by an LLM as an image — courtesy of Nano Banana and Gemini*

Technically, the model doesn’t care what the token represents. What it cares about is how tokens relate to one another and the patterns that emerge from these relationships.

You can also think of it this way: everything on a computer is ones and zeros. Text, images, music, and videos are all binary at their core. LLMs operate one level above that. They’re pattern-matching machines that can learn relationships between any sequences of tokens.

Everything is language

One of the biggest recent breakthroughs in GenAI occurred when researchers realized that models can process images using language. More specifically, if you can describe images with words, you can teach an AI model to translate between the two.

A milestone model that emerged from this realization was CLIP, and it became a game-changer. Released in 2021, CLIP was trained on millions of images paired with natural-language captions. As a result, it could establish the connection between diverse data points, such as, for example:

The pixels that make up a photo of a dog
The word “dog”
The sentence “a golden retriever playing in a park”

Whenever you ask an AI to “draw a sunset over mountains,” the model isn’t magically learning to paint. It’s doing something much weirder and more interesting:

It understands your text description (i.e., the language model doing its thing).
It knows what visual patterns are associated with those concepts (thanks to training on image-text pairs).
It generates tokens that represent those visual patterns.
Those tokens get decoded back into pixels.

From LLMs to LMMs: the multimodal revolution

Fast forward to 2026, and the AI landscape is becoming increasingly multimodal. Large multimodal models (LMMs) build on LLMs primarily by incorporating specialized encoders and layers that map non-text data (such as images or audio) into the same numerical “embedding space” that the LLM backbone understands.

LMMs don’t just work with one type of data. Instead, these models enable services like GPT, Gemini, etc., to:

Read text and images in the same conversation
Understand charts, diagrams, and screenshots
Generate responses that reference visual information

Note that the AI is still fundamentally doing the same thing here: finding patterns in sequences. The only difference is that the sequences now include visual tokens and text tokens — all mixed together.

Why this matters

Understanding this helps explain both the power and limitations of AI:

The power: If you can represent something as a sequence of patterns, AI can potentially learn to work with it. Audio, video, 3D models, scientific data — it’s all fair game.

The limitations: AI models don’t truly understand images the way humans do — but they surely can simulate understanding via increasingly sophisticated pattern-matching. AI models know that the word “sunset” correlates with orange and purple colors and the landscape orientation, even though they don’t experience the beauty of a sunset.

One more thing to keep in mind is that working with images is way more resource-intensive than working with text. A single image might require thousands of tokens to represent, while a paragraph might only need a hundred. It’s no secret that training multimodal models requires enormous computing power and energy.AI-based image generation is also far less predictable than text generation. While this may be temporary — and reflective of AI’s ongoing evolution — some people theorise that AI models are actually about to reach their peak.

TL;DR

So if LLMs work with language, how can they generate images?

The short answer is that they don’t really work with language. They work with patterns.

Language just happens to be the easiest pattern to feed into AI. That said, believe me when I say these models train on ALL (and I mean it, ALL) publicly available images.

The moment you start thinking about text, images, and music as different ways of encoding patterns — different languages describing the same universe — that’s when AI stops seeming like magic and starts seeming like… well, something even weirder than magic.

Looking for production-ready AI expertise?

AgileEngine’s expertise spans the development of patented AI-first products that are top-rated by Gartner and G2. If you seek a consultation on AI adoption and need high-skilled experts to drive your initiatives, feel free to contact us — or explore more about our AI Studio and check our case studies.