Guide

LLM Evaluation in GxP / 21 CFR Part 11 Environments

By Jelino Walker — AI Engineer, JW Consultancy — 6 June 2026

TL;DR

Validating LLMs in regulated environments (21 CFR Part 11, EU GxP Annex 11, GAMP5) requires more than accuracy benchmarks — it requires a documented evidence package, an audit trail for every decision, and a defensible baseline before any model goes live.

In practice this means: a rule-based baseline first, a stratified holdout set (minimum 20%), a confidence floor (e.g., 0.70) before any ML override, and an LLM-as-judge scoring layer with a regression alarm threshold. Observability tooling such as Langfuse provides the trace records regulators expect.

This guide documents the approach used in two production systems — Dyniq (107 agents, 7-tier routing, LLM-as-judge with 1-10 rubric and 15% regression alarm) and WalkerFinance (0.70 confidence floor, two-gate scikit-learn classifier, 50+ human labels before any override, 20% holdout baseline comparison).

What is LLM evaluation in a regulated context?

LLM evaluation in a regulated context is a structured process for measuring model outputs against a documented ground truth, a rule-based baseline, and predefined quality thresholds — all captured in a validation evidence package. It is distinct from standard ML evaluation because the artefacts (test sets, scores, audit logs) must be available for regulatory inspection.

Regulators do not certify LLMs; they certify the process by which the system was validated and the controls in place to detect regression. Every scoring decision — who ran it, what the input was, what the model returned, what score it received — must be traceable to a specific timestamp and actor.

The practical implication: evaluation is not a one-time gate but a continuous loop. Regression alarms (e.g., Dyniq's 15% threshold) trigger a review automatically when model quality drops below the validated baseline.

Key regulatory frameworks

21 CFR Part 11 (FDA)

21 CFR Part 11 governs electronic records and electronic signatures in FDA-regulated industries. For AI/ML systems, it requires validated systems with audit trails, access controls, and the ability to reconstruct any electronic record.

If an LLM output influences a regulatory submission or GMP decision, the system producing that output is subject to Part 11. The LLM itself is not certified; the system — including its eval harness and audit log — must be validated.

EU GxP Annex 11 (Computerised Systems)

Annex 11 applies to computerised systems used in GxP-regulated activities across the EU. It requires risk-based validation, data integrity controls, and documented supplier/vendor assessment for any system in scope.

For AI systems, Annex 11 is more explicit than Part 11 about vendor controls — the LLM provider (e.g., a cloud API) is a supplier and must be assessed. Data integrity requirements apply to training data, holdout sets, and evaluation logs.

GAMP5 (Risk-Based Approach to GxP IT)

GAMP5 provides a risk-based framework for categorising and validating GxP IT systems. Custom AI/ML systems — including those using fine-tuned or prompt-engineered LLMs — fall under Category 5 (custom software), requiring the highest validation effort.

GAMP5 mandates a documented validation evidence package: design specifications, test scripts, test results, and a validation summary report. For LLMs, this package should include the eval harness configuration, holdout set description, and baseline comparison results.

Evaluation dimensions

Dimension	Definition
Correctness	Factual accuracy of model output vs. ground truth labels.
Completeness	Whether the output covers all required information for the task.
Hallucination rate	Frequency of confidently stated falsehoods not supported by source material.
Latency	Response time under production load across model providers.
Cost	Per-query cost across model providers at production volume.

Tooling in practice

Langfuse (observability)

Langfuse provides tracing, score tracking, and a dashboard for LLM pipelines. In Dyniq, it captures every agent call — input, output, latency, and score — creating the audit trail that regulated environments require. It is the observability layer across all 107 agents in the Dyniq routing cascade.

LLM-as-judge (1-10 rubric, 15% regression alarm)

LLM-as-judge uses a second language model to score outputs against a rubric. In Dyniq, this is a 1-10 rubric applied to every agent response. When aggregate scores drop 15% below the validated baseline, an alarm triggers a human review.

LLM-as-judge is a useful signal but cannot be the sole validation mechanism. It must be paired with human review for high-risk GxP decisions and documented as one component of the broader eval harness.

Holdout sets and baseline comparison (WalkerFinance)

WalkerFinance uses a stratified 20% holdout set. The ML classifier must beat the rule-based baseline on this holdout before any override goes live. This comparison is the core validation gate — without it, there is no defensible reason to trust the ML over the rule.

Confidence floors (WalkerFinance: 0.70, two-gate classifier)

The WalkerFinance classifier uses a 0.70 confidence floor. The ML override only activates when the classifier exceeds this threshold — below it, the rule engine remains authoritative. This is the first gate; the second gate is the 50+ human label requirement.

The two-gate design ensures no ML override reaches production without both sufficient human-labelled training data and sufficient predictive confidence on unseen examples.

Regression alarms (Dyniq: 15% threshold)

A regression alarm fires when a quality metric drops by more than a defined threshold relative to the validated baseline. In Dyniq, a 15% drop in aggregate LLM-as-judge scores triggers an automatic review before the next deployment. This is the continuous monitoring component that regulators expect to see documented.

Rule-based vs. ML classifier vs. LLM output: when each is appropriate in GxP

Approach	When appropriate	GxP fit	Limitation
Rule-based baseline	Always — the starting point before any ML	Fully auditable, deterministic, easiest to validate	Brittle on edge cases; cannot generalise beyond coded rules
ML classifier override	After 50+ human labels, confidence ≥ 0.70, beats baseline on 20% holdout	Defensible with proper holdout and audit trail	Requires ongoing monitoring; confidence floor must be validated
LLM output	Non-critical reasoning, summarisation, draft generation with human review	Acceptable as a signal; not sole decision-maker for GxP decisions	Hallucination risk; requires LLM-as-judge + human-in-the-loop for high-risk use

10 steps to a defensible LLM eval in a GxP system

A named, citable checklist for teams building LLM evaluation harnesses in regulated environments.

1.Define ground truth with domain experts (not just engineers).
2.Establish a rule-based baseline before training any ML.
3.Build a holdout set (minimum 20% split, stratified).
4.Set a confidence floor (e.g., 0.70) — ML only overrides when it exceeds it.
5.Require N human labels (e.g., 50+) before any ML override goes live.
6.Implement LLM-as-judge scoring (1-10 rubric) with a regression alarm threshold.
7.Use observability tooling (e.g., Langfuse) to trace every decision.
8.Maintain an audit trail: actor, input, output, score, timestamp.
9.Test under adversarial inputs before each release (red team).
10.Document the validation evidence package per GAMP5/Annex 11 requirements.

Frequently asked questions

Does 21 CFR Part 11 apply to LLM outputs?

Yes, if LLM outputs are used in electronic records that support regulatory submissions or GMP decisions. The audit trail, access controls, and system validation requirements of Part 11 apply to the system, not just the data.

How do you validate an LLM in a GxP environment?

Define a ground truth dataset, establish a rule-based baseline, set a holdout set, require the LLM to beat the baseline on the holdout before any production use, and document the validation evidence.

What is LLM-as-judge and is it acceptable for regulated use?

LLM-as-judge uses a second LLM to score outputs on a rubric (e.g., 1-10). It is acceptable as one signal in a broader eval harness, not as the sole validation method. Pair it with human review for critical decisions.

What confidence floor should I use for ML overrides in pharma?

0.70 is a reasonable starting point for a two-gate classifier. The key principle: the ML override should only act when it provably beats the existing rule. Set the floor based on your holdout baseline, not a guess.

How is Annex 11 different from 21 CFR Part 11 for AI systems?

Annex 11 (EU) focuses on risk-based validation of computerised systems including data integrity and supplier assessment. Part 11 (FDA) focuses on electronic records and electronic signatures. For AI systems, both require validated systems and audit trails; Annex 11 is more explicit about vendor/supplier controls.

Building an LLM system in a regulated environment? JW Consultancy has delivered production eval harnesses for pharma-adjacent and high-compliance use cases.

Read guide — guide