The pitch is always the same. A vendor shows a demo, a developer runs a prototype, or a board member reads an article and asks why the company is not doing more with AI. The conversation turns to budget. Someone mentions API costs, which are low. The assumption forms that AI is cheap.
It is not cheap. The API costs are real, and they are low relative to the total cost of delivering something useful in production. That gap, between what the invoice shows and what it actually costs to build and operate AI responsibly, is where most AI programmes run into trouble.
This is not an argument against AI investment. It is an argument for counting the whole cost before you commit.
What is actually in the invoice
Start with the visible part. API token costs are real and worth tracking. The problem is that they are easy to quote, easy to forecast, and easy to compare, so they dominate the budget conversation. A prototype that costs £40 a month in tokens sounds trivially cheap. The question is what it costs to get that prototype to production and keep it there.
Token costs vary by model tier, by input vs output ratio, and by how well the system is designed to avoid unnecessary inference. A poorly designed RAG pipeline with large context windows on every query can cost 10x more than a well-designed one doing the same job. The engineering to design it well is not free. That cost lives below the waterline.
The hidden costs, one by one
Engineering time
This is consistently the largest single cost in any AI programme. The activities that consume it: prompt engineering and evaluation (more rigorous than it sounds), integration with existing systems and data sources, building evaluation harnesses to measure output quality, and iterating when the model changes or the use case scope shifts.
A rule of thumb, based on building and observing multiple AI integrations: for every pound spent on inference, expect to spend four to ten pounds on engineering, depending on how production-grade the system needs to be. Higher for regulated industries, lower for low-stakes internal tools.
Rework and iteration
AI systems do not behave like traditional software. A model update from a vendor can change output characteristics without warning. A dataset distribution shift changes accuracy. A use case that worked in a demo does not work on the messy real-world data. These are not edge cases. They are the normal operating environment, and they generate engineering time that was not budgeted.
Vendor model deprecations are a particular trap. If the model you built against goes end-of-life and the replacement behaves differently, re-evaluation and re-tuning is a real cost. The industry moves fast, and that movement is not free for the teams consuming it.
Operations and monitoring
A production AI system needs monitoring that goes beyond uptime checks. Model drift (gradual degradation in output quality over time), cost anomalies (a change in usage pattern that sends spend to unexpected levels), latency regressions, and hallucination rates all need tracking. Building and maintaining that observability layer is not trivial.
On-call implications are also real. If a customer-facing AI feature goes wrong at 2am, who is paged? What does the rollback look like? These questions need answers before go-live, and answering them has a cost.
Compliance and governance
In any regulated sector, or any business that handles personal data, AI systems require formal risk assessment, data protection impact assessments, audit trail capability, and policy documentation. The EU AI Act creates tiered obligations based on risk category. Preparing for that is not a one-time exercise: it needs updating as the system evolves and as regulation changes.
Legal review of AI vendor contracts is also often underestimated. Data processing agreements, IP ownership of model outputs, liability for harmful outputs, and terms around training data retention all require attention, and legal time costs money.
Organisational change and training
Any AI deployment that changes how people work requires change management. Users need training. Processes need updating. Someone needs to field the "why did it give me this answer?" questions. At enterprise scale, this can be a significant programme in its own right, not an afterthought.
Opportunity cost
The hardest cost to put in a spreadsheet: the work that does not get done because engineering time went to AI. Every sprint hour spent on model evaluation, prompt iteration, and integration debugging is an hour not spent on product features, technical debt, or infrastructure that users and customers would notice. The AI programme needs to deliver enough value to justify that trade-off, or the organisation is worse off for having started.
Presenting this to a CFO
A CFO does not need to understand how a language model works. They need four things: total cost range, expected benefit baseline, measurement approach, and a decision point at which investment continues or stops.
The cost range should have three components: inference costs (the visible part, easy to model), engineering costs (time-and-materials estimate with explicit assumptions), and a contingency line for rework, compliance work, and change management. Present a low case, a central case, and a high case. The spread between low and high tells the CFO something honest about uncertainty, which is preferable to false precision.
The benefit baseline needs to be concrete. "Improved productivity" is not a benefit baseline. "Reduce average handling time from 8 minutes to 5 minutes for 120 agents, saving approximately £180k per year" is a benefit baseline. It can be challenged, refined, and tracked. Vague benefit claims make the finance conversation much harder and set the programme up for an impossible post-hoc justification exercise.
The decision point is the most important element, and the one most often omitted. Define up front: at what point in the programme will you assess whether to continue, scale, or stop? This is not a sign of low confidence. It is standard capital allocation discipline applied to technology investment. A programme with a defined checkpoint is easier to fund than one presented as an open-ended commitment.
Build vs buy vs compose: a decision model
The most common version of this question in 2026: do we use a vendor's packaged AI product, build our own integration against a foundation model API, or compose something from open-source components? The answer depends on three variables: differentiation requirement, data sensitivity, and engineering maturity.
The data sensitivity overlay
The matrix above assumes standard data sensitivity. For highly sensitive data (patient records, financial data, legal proceedings), the buy column shrinks. Vendor data processing terms, training data retention policies, and subprocessor chains need scrutiny that most organisations do not have the capacity to conduct thoroughly for every vendor they onboard.
For these use cases, private deployment (whether that is a self-hosted open-source model, a managed private endpoint from a major provider, or an on-premise appliance) shifts the cost profile significantly. The inference cost goes up. The compliance cost goes down. The engineering cost to deploy and maintain a private model is substantial. For most mid-market organisations, this trade-off only makes sense for a small number of genuinely high-sensitivity use cases.
The costs that compound
Two costs deserve special attention because they grow with programme scale rather than staying flat. The first is context window spend. As systems get more capable and users learn to give them more context, average context length grows. A system designed when the average prompt was 500 tokens may be running at 4,000 tokens a year later. The cost difference is real, and it requires periodic review of prompt design and retrieval strategy to manage.
The second is shadow AI spend. As AI tools become easier to adopt, employees will subscribe to tools individually and expense them, or simply use free tiers of consumer-grade tools with business data. This creates cost visibility gaps and, more seriously, data governance gaps. The cost of discovering and rationalising shadow AI spend a year into a programme is consistently higher than the cost of establishing a policy from the start.
What a well-structured AI budget looks like
A CFO-ready AI budget has five lines: inference costs (with a realistic volume model), engineering time (expressed as headcount or day rates, not vague "resource allocation"), operations and tooling (monitoring platforms, observability, alerting infrastructure), governance and compliance (legal review, risk assessments, policy documentation, training), and a contingency line of at least 20% for rework and vendor-driven changes.
It also has a benefit section with a baseline metric, a target metric, a measurement method, and a timeline. The programme should not be expected to be in net positive territory in the first six months. A twelve-month payback period is realistic for well-scoped use cases. Longer than 24 months requires a strong strategic argument, not just a financial one.
The programmes that fail budget scrutiny are not usually the expensive ones. They are the ones with no measurement plan, no decision checkpoint, and benefit cases built on assumptions that were never validated. Build the measurement in before the spend starts, and the finance conversation becomes straightforward.
Working on something like this?
Fractional CTO and transformation leadership for situations that aren't working. Bring a problem — thirty minutes, no obligation.
Bring a problem → or scan a repo first →