What is your software doing?

The third control point-of-focus: ensuring software does what it should. Let me give you a perspective, and something you can do tomorrow morning to improve your pipeline.

Continued from Part 2: Are you sure you know how to write software?.

Every change should do what it was meant to do — and the system as a whole should continue to do what it has always been meant to do. That is the control point-of-focus.

The team would say of course it does. The new feature works in the demo, tests pass, users are using it, nobody has complained. The question feels almost rude.

Put the controls hat on for a moment and the same question is asking something sharper: not “does the new thing work?” but “can you prove, on every release, that the system continues to do what is expected — not just for the new change, but for everything that was meant to keep working?”

The delivery team’s view

A well-run team writes tests. Unit tests, integration tests, and a regression suite that gets run before releases. The team manually tests new features against the requirements. The product owner signs off in sprint review. The release goes out. If users complain, the team fixes the bug; if they don’t, the team treats that as confirmation.

This is professional software delivery. It is real work.

The auditor’s view

The auditor does not want to see your test coverage report. The auditor wants to see, for each release, evidence that the system was proven to do what is expected — by tests that were designed to prove specific behaviours the business cares about, captured in human-readable form, automated where possible, and run on every release.

One classic audit question lands like a punch in this conversation: “Why isn’t your goal 100% code coverage?” Engineering teams produce credible-sounding answers — Pareto says the last twenty percent of tests cost eighty percent of the effort; every metric opens a fresh argument about what “100%” actually means (branches? statements? paths? mutation tests?). The answers are not wrong, but they are wishy-washy, and the auditor can hear it. We will come back to this.

Plus systematic pre-production checks (smoke, stress, load, regulatory algorithms) with documented targets and recorded outcomes. The auditor wants a release record that shows “we tested these things, against these targets, and got these results” — every release, no exceptions.

This is a more demanding question than “do the unit tests pass?” and a more answerable one than “do you have acceptance criteria on every ticket?” In my experience the right answer is closer to engineering practice than to ticket hygiene — and closer to BDD than to unit testing.

What each side is expecting

Same focus, two languages.

The delivery team is expecting…	The auditor is expecting…
Tests pass before we merge	An agreed, behaviour-led set of tests that prove the system does what is expected — written down, automated where possible, run every release
Regression bugs get caught	A regression pack with coverage and automation rates tracked over time
Manual testing covers the new feature	Manual testing covers both the new feature and validation of any newly automated tests
Stress and load tests run before major releases	Pre-production tests run with documented targets and recorded results — every release, not just major ones
Releases go out on schedule	Every release demonstrates that what was meant to keep working still does

The delivery team’s expectations are about catching bugs before they ship. The auditor’s expectations are about proving continued correctness. Same control. Different posture.

What the controls hat opens up

It sounds easy. Put your controls hat on for a minute and it isn’t.

Do you have an agreed, behaviour-led set of tests that prove the system does what is expected? Call this set p. The product owner and the engineering lead agree on p together, and it lives in human-readable form so anyone can read it as the system evolves. The tests are behavioural, not ticket-derived: each one proves a specific system behaviour the business cares about. As the system changes, tests get modified, extended, or retired. This is closer to BDD than unit testing — the SDLC writes a new test for every ticket; controls testing maintains a living set that proves the system behaves correctly.

This is also the answer to that earlier audit question — “why isn’t your goal 100% code coverage?” Code coverage measures what the code does; p measures what the system is meant to do. Once p is agreed, “100% of p tested” means something concrete that everyone — product owner, engineering lead, auditor — can read and agree on. That is a definition of “100%” that holds up.

How much of p is actually written down? Call the written subset w. Coverage = w/p. Anything less than 100% is a known gap — functionality you cannot prove on release, that depends on hope and history. The priority is closing this gap. If you have identified one hundred behaviours the system must demonstrate and you have written seventy-five of them down, you are deploying every release on a prayer that the remaining twenty-five still work.

How much of w is automated? The automated subset is q — fully deterministic, machine-executable, runs on every release without a human in the loop. Automation rate = q/w. This is a measure of where your testers’ time is going. Low q/w means they are stuck running manual tests every release instead of extending the set. High q/w gives them room to push w toward p. The trajectory matters more than the absolute number — what you want to see is both ratios climbing over time.

Does every release validate both the system AND the new automations? When a test moves from manual (in w−q) to automated (in q), the manual run for that release should also confirm that the new automation gives the same result the manual run gave. Otherwise q is growing on trust, not evidence.

Are pre-production tests recorded with targets and results? Smoke tests, stress tests, load tests, and any system-specific tests (regulatory algorithm verification, performance SLAs, security scans) all need defined targets and per-release pass/fail records. “We ran the load test and it seemed fine” does not survive a control review.

Are branches short-lived? If a feature branch lives for three weeks, the test results from when it was created are out of date by the time it merges. Daily merge to main is the only sustainable discipline. Long-running branches are a control gap dressed as flexibility.

Are changes deployed continuously, with feature toggles for anything not yet ready? Code that sits in main but never reaches production is not proven not to regress. Always deploy; feature-toggle off until ready. That keeps the deployed system continuously aligned with main and forces the regression pack to keep up.

None of these are exotic questions. Each one is a place where a control gap can hide.

What an auditor actually wants to see

Three specific evidence asks:

A documented regression pack — the agreed, behaviour-led set of tests (p) in human-readable form, with the written subset (w) and automated subset (q) clearly tracked. Two ratios visible: coverage (w/p, target 100%) and automation rate (q/w, showing how much testing is hands-off versus hands-on each release). Both ratios should trend upward.
A release record for every release showing the regression pack ran (the automated portion), manual testing covered both the un-automated tests and validation of any newly-automated ones, and pre-production tests (smoke, stress, load, system-specific) ran against their documented targets with pass/fail outcomes.
A continuous-deployment record demonstrating that branches are short-lived, changes deploy promptly, and feature toggles gate incomplete work — so the deployed system is provably aligned with main.

That is the whole list. Nothing custom. Nothing expensive. Three evidence trails, almost all of which most teams already produce as a side-effect of doing the work. Or could produce, if you configure the tools you already use!

What you can do tomorrow

If your team is not doing this, four steps:

Sit down with your product owner and your engineering lead and agree on p — the minimal, behaviour-led set of tests that prove the system does what is expected. Put it in a human-readable document anyone can read. The conversation alone often reshapes the team’s view of what really matters about the system. Behaviours, not tickets.
Inventory how much of p is written (w) and how much is automated (q). Hang two ratios on the wall: w/p (coverage) and q/w (automation rate). Now you can see, at a glance, both the gaps and the trajectory.
Write your first behavioural test to replace a single manual UAT. Pick the most repetitive manual test your team runs every release; write it BDD-style as an automated, deterministic behavioural check. On the release where you cut it over, run both manual and automated — and demonstrate they give the same result. That is q + 1, and the template for everything that follows.
Adopt a feature-toggle library if you do not already have one, and start deploying every change with the unfinished bits toggled off. Long-running branches go away. Stale changes go away. Regressions in features you forgot you had go away — because no feature is ever far from production. While you are at it, surface your pre-production test results (smoke, stress, load, regulatory) in a release report with their targets and outcomes.

All of this can start tomorrow morning — get your product owner and engineering lead in a room, agree on what p looks like, sketch the first few behaviours, pick a manual UAT to automate. The conversation is half a day, and it is the start. The rest is real engineering work, and engineering work takes time. What it does not take is new tooling — everything described here runs on what most modern engineering teams already have.

The reframe

The auditor isn’t asking you to test more. They are asking you to prove, on every release, that the system continues to do what is expected — and to make that proof a side-effect of how you release, not a separate activity bolted on at the end.

The next post sounds like the auditor is finally letting up — but it isn’t quite that. Auditors don’t have an opinion on your CI pipeline.

Paul Gresham is the creator of C3P (Continuous Compliance Control Protocol) and founder of Paul Gresham Advisory LLC. He has spent thirty-five years building and governing software in regulated industries, with senior technology leadership roles at global financial institutions. He writes about continuous compliance, software engineering controls, and AI governance.