Imagine this: you and your team are working on software development, and everyone is allowed to use Artificial Intelligence (AI). One of the developers decides to build an AI agent that takes initiative and acts autonomously as a software developer. Your role in the team is software tester. How do you ensure the quality of the software delivered by this AI agent? What challenges do you encounter, and how do you deal with them? In this blog, I explore what the rise of AI within development teams means for quality assurance in software testing.
What should you pay attention to when checking the quality of AI-developed software?
First, it is important to understand what the output of the system is. The system can generate either deterministic or non-deterministic output. Deterministic output means that the same input always produces the same result. Non-deterministic output means that the same input can produce different results.
An example of deterministic output is a sum calculation. An example of non-deterministic output is a dice roll. Deterministic output requires a different testing approach than non-deterministic output.
In the case of an open-ended question, for example, there can be multiple correct answers. In such cases, it becomes difficult to assess quality, because differences in wording prevent exact comparisons. Consider a history exam question: an answer may use different wording than the textbook and still correctly describe the topic. It is about understanding, not exact phrasing.
Therefore, it is important to document where, why, and what variation occurs. This makes it clear which answers are identical and which differ. When answers differ, you should investigate whether the relationship with the input remains correct. The relationship stays the same, even when wording varies.
What quality requirements should be set for AI-developed software?
When assessing quality, you evaluate whether the intended tasks are executed correctly. This assessment does not only focus on the final result, but also on intermediate steps and repeated execution of the same task, to check whether the output remains consistent.
For deterministic tasks, such as a sum, you verify whether the system always produces the same result for the same input. For non-deterministic tasks, such as a dice roll, this is not possible. The outcome will always vary. In this case, you do not check for identical results, but whether outcomes remain within defined boundaries and meet the required conditions.
When tasks are not executed correctly, appropriate measures must be taken to ensure quality. This means that when errors are detected in the AI agent, execution is stopped and corrected where possible.
Formal rules form the foundation for error detection and ensure that AI output meets defined quality requirements. Examples of errors include: incorrect reasoning, bias, prompt injection, jailbreaking, and privacy leaks, as illustrated below:
- Incorrect reasoning: for example, 1+1+1=1 instead of 1+1+1=3.
- Bias: when there is a systematic distortion or prejudice in the output, for example when professions are mainly associated with men while other genders are systematically excluded. In such cases, this exclusion must be addressed and corrected.
- Prompt injection: when a user manipulates the system into ignoring its original instructions. For example, overriding earlier instructions to ignore constraints. In such cases, system instructions must be designed to resist user manipulation.
- Jailbreaking: bypassing safety restrictions. Jailbreaking is a form of prompt injection, but not all prompt injections are jailbreaks.
- Privacy leakage: when personal data is not anonymized and can be exposed through prompt injection or similar techniques.
How do you test whether AI output meets quality requirements?
You validate whether the AI complies with predefined formal rules for responsible use. These include frameworks for transparency, safety, privacy, accountability, and risk management.
Examples of formal rules include:
- Performance quantification: The AI model reports the likelihood of error per response using metrics such as accuracy, precision, recall, F1-score, and false positive/negative ratios. This supports transparency and safety by clearly reporting performance and uncertainty.
- Hallucination prevention: The model compares outputs against one or more verifiable sources to minimize hallucinations or unsupported conclusions. This ensures reliability and reduces misinformation risk.
- Rule consistency: The model must never produce X when condition Y applies. This supports risk management and accountability by ensuring consistent behaviour.
- Format validation: The model applies JSON schema checks and API contract validation to ensure correct output structure. This supports data integrity and safe system integration.
- Constraint enforcement: The model adheres to limits on length, vocabulary, and forbidden tokens, preventing uncontrolled or unsafe output.
- Counterfactual testing: The model should produce consistent decisions with minimal changes in input phrasing. This supports fairness and transparency.
- Safety filters: The model must never use prohibited concepts, ensuring ethical compliance and protection of sensitive content.
To test these formal rules, various tools have been developed, such as Promptfoo, DeepTeam, Python Risk Identification Tool (PyRIT), and Garak. These tools enable prompt testing by exploring different prompt variations to determine which performs best for a specific task—such as generating a desired response type, behaviour, or structure.
In addition to prompt testing, these tools support objective evaluation of accuracy, completeness, and relevance of generated outputs. For example, questions like “What is 1+1+1?” or “What is the result of a dice roll?” fall under this category.
These techniques are part of what is known as Red Teaming. Red Teaming is a method where systems are deliberately challenged, broken, or manipulated to uncover vulnerabilities before they appear in production. These tools can be used to identify issues such as incorrect reasoning, bias, prompt injection, jailbreaking, or privacy leaks.
Would you like to know more about testing AI systems? Or discuss software testing, quality assurance, and risk mitigation in more detail? Feel free to get in touch.