AI Testing and Evaluation: Learnings from Science and Industry

What is AI governance?

AI governance refers to the frameworks, policies, best practices, and tools that guide the responsible development, deployment, and use of AI. At Microsoft, this includes working to ensure the alignment of AI systems with our Responsible AI Standard (opens in new tab), which we continue to build on as new AI capabilities, risks, and regulatory requirements emerge. Read more about Microsoft’s internal approach to AI governance in our 2025 Responsible AI Transparency Report (opens in new tab).

What is testing and evaluation, and what role does it play in AI governance?

AI evaluations are structured ways to test how AI models and systems perform and where they could go wrong. Because this is a rapidly evolving field, there’s no single agreed way to categorize these tests. Different methods are used depending on what is being tested and when it is tested.

The International AI Safety Report 2025 (opens in new tab)—the world’s first comprehensive synthesis of research on the capabilities and risks of advanced AI systems—defines AI evaluations as “systematic assessments of an AI system’s performance, capabilities, vulnerabilities or potential impacts. Evaluations can include benchmarking, red-teaming and audits and can be conducted both before and after model deployment.”

How has AI evaluation and testing evolved in the age of generative AI?

Many of the aims of evaluating generative AI models and systems resemble those of evaluating traditional software, such as assessing performance and reliability. However, there is growing recognition that evaluating generative AI is more challenging than evaluating traditional machine learning systems. This is because generative AI systems accept a wide range of inputs, produce diverse outputs, support numerous use cases, and can have impacts on people and society that range from mundane to consequential. We explore these challenges in Part 1 of our 2025 white paper, Learning from Other Domains to Advance AI Evaluation and Testing.

Why is Microsoft reaching out to people with governance expertise in other domains?

Microsoft recognizes that governance is not a blank slate. Many other domains have long histories of managing complex, impactful technologies in high-stakes settings. By engaging experts from these domains, Microsoft aims to learn from the strengths and shortfalls of established governance and public policy strategies, adapting insights to the unique challenges of AI.

This cross-domain learning has helped further shape Microsoft’s approach to AI governance and contributions to public policy discussions.

AI Testing and Evaluation: Learnings from Science and Industry

Frequently asked questions