Multimodal is not a feature, it's a stack change
Adding image understanding to your AI product is not a feature flag. It changes your data pipeline, your eval suite, your storage, your latency budget, and your cost model.
A product manager walks into a planning meeting and says, “Can we add image understanding? The new model supports it.” The team estimates it at two sprints. They are wrong by a factor of four, and they will not realize it until sprint three.
This happens constantly. Multimodal capabilities — image, audio, video — look like feature additions. They are not. They are stack changes. The distinction matters because feature additions work within your existing infrastructure. Stack changes require you to rebuild parts of it.
What actually changes
Here is what “add image understanding” means in practice, for a team that has a working text-based AI product.
Data pipeline. Your current pipeline ingests text. It chunks it, embeds it, indexes it. Images are different in every way. They need to be extracted from documents — PDFs, slides, emails with attachments. They need preprocessing — resizing, format conversion, OCR for text-in-images. They need metadata extraction — what page is this image on, what text surrounds it, what is the caption. Your text chunking code does not handle any of this. You are building a second pipeline.
Storage. Text chunks are small. A typical chunk is 500–1000 tokens, a few kilobytes. Images are large. A single page rendered at reasonable quality is 200KB–2MB. A 100-page PDF produces 100 images. Your vector store was sized for text. Your blob storage budget was sized for text. Your bandwidth costs were sized for text. Multiply everything by 100x and see if your architecture still makes sense.
Embedding and retrieval. Text embeddings and image embeddings live in different spaces. If you want to retrieve images based on text queries — and you do — you need a multimodal embedding model. These models have different dimensionality, different performance characteristics, and different failure modes than your text embedding model. You are not adding a column to your index. You are adding a second index with a different model and a different query path.
Latency. Sending an image to a vision model takes longer than sending text. Significantly longer. A text-only call to GPT-4 might take 1–3 seconds. The same call with an image might take 5–15 seconds. If your product has a 3-second SLA on response time, you just violated it. You need to rethink your latency budget. Maybe you preprocess images asynchronously. Maybe you cache image analysis results. Maybe you accept a slower experience for image queries. All of these are architectural decisions, not feature decisions.
Cost. Vision model calls cost more than text-only calls. Often 2–5x more per request, depending on image resolution and token count. If you are processing images at ingestion time — analyzing every image in every document — your ingestion costs go up dramatically. If you are processing images at query time — sending images to the model when the user asks about them — your per-query costs go up dramatically. Either way, your cost model changes.
Eval. This is the one teams forget about until it is too late. How do you evaluate whether the model correctly understood an image? Your text eval is straightforward — compare the generated answer to a reference answer. Image understanding eval is fundamentally harder. Did the model correctly read the chart? Did it understand the diagram? Did it extract the right numbers from the table? Each of these is a different eval task with different scoring criteria. Your eval suite just tripled in complexity.
The three-component rule
Here is a heuristic we use. Count the number of infrastructure components that need to change to support the new capability. If it is one or two, it is a feature. If it is three or more, it is a platform evolution.
Adding image understanding typically changes six components: data pipeline, storage, embedding and retrieval, latency architecture, cost model, and eval suite. This is not a feature. This is a second product built on top of the first one.
The reason this distinction matters is planning. Features get estimated in sprints. Platform evolutions get estimated in quarters. If you estimate a platform evolution in sprints, you will be wrong, and you will spend the extra time in a state of perpetual “we’re almost done” that erodes team morale and stakeholder trust.
The right way to scope it
If you genuinely need multimodal capabilities — and sometimes you do — scope it as a platform project.
Phase 1: Spike. One engineer, one week. Build the simplest possible version — take one image, send it to the vision model, get a response. This tells you whether the model can actually do what you need. Many teams discover in the spike that the model’s image understanding is not good enough for their use case. Better to learn this in a week than in a quarter.
Phase 2: Pipeline. Build the image ingestion pipeline. Extraction, preprocessing, storage. Do not integrate it with the existing text pipeline yet. Run it in parallel. This takes 2–4 weeks depending on document complexity.
Phase 3: Retrieval. Add image retrieval to your query path. This might mean a multimodal embedding model, a separate index, or a hybrid approach. Test it in isolation before connecting it to the generation step. Another 2–4 weeks.
Phase 4: Eval. Build an eval suite for image understanding. This is its own workstream. You need golden sets with images, scoring functions that can handle visual content, and CI gates that run image-specific tests. 2–3 weeks.
Phase 5: Integration. Connect the image pipeline to the existing text pipeline. Handle the mixed-modality queries — “what does the chart on page 7 show and how does it relate to the text?” This is where the complexity lives. 2–4 weeks.
That is 9–16 weeks. Not two sprints. And this is the optimistic timeline assuming the spike validates the approach.
The alternative nobody considers
Before building multimodal infrastructure, ask whether you actually need it. In many cases, the user’s real need can be met with a simpler approach.
If users want to understand charts and tables, OCR plus structured extraction might be enough. Convert the visual to text, process it with your existing text pipeline. It is less impressive in a demo. It ships in a week instead of a quarter.
If users want to search for images, metadata and captions might be enough. Tag images during ingestion with descriptions generated by a vision model, then search the text descriptions with your existing retrieval stack.
These approaches are not as powerful as true multimodal understanding. They are an order of magnitude simpler to build, operate, and evaluate. For many products, “good enough” ships and compounds while “perfect” sits in a planning document.
The heuristic
Before adding a multimodal capability, count the infrastructure components it touches. If it is more than three, call it what it is — a platform evolution — and scope it in quarters, not sprints. Then ask whether a text-only approximation would meet 80% of the user need at 20% of the cost. Usually it does.
tl;dr
The pattern. Teams estimate image understanding at two sprints because they are scoping a feature, when they are actually scoping six infrastructure changes across data pipelines, storage, retrieval, latency, cost, and evals. The fix. Count the infrastructure components the capability touches — if it is more than three, scope it as a quarter-long platform project with explicit phases, not a sprint item. The outcome. The timeline becomes honest, the phased build surfaces problems before they compound, and the team discovers early whether a simpler text-only approximation would have shipped the same user value in a week.