Mapping the European AI HLEG Trustworthiness Framework to Formal Methods
1. Motivation & Vision
The European High‑Level Expert Group on Artificial Intelligence (AI HLEG) defines Trustworthy AI through seven concrete requirements:
- Human agency & oversight
- Technical robustness & safety
- Privacy & data governance
- Transparency
- Diversity, non‑discrimination & fairness
- Societal & environmental well‑being
- Accountability
These requirements are widely quoted by regulators, standards bodies (ISO 42001, IEEE 7010) and industry road‑maps, yet only a subset can be expressed and verified with current formal‑method techniques.
The thesis aims to systematically formalise each trustworthiness dimension, decide what can be proved (e.g., safety properties, privacy leakage bounds) and what must remain empirical or policy‑driven (e.g., societal impact). The final artefact is a trustworthiness‑profiling framework that couples formal verification results with a lightweight, traceable “trust‑score” for AI systems.
2. Research Questions
| # | Question |
|---|---|
| RQ1 | Which of the seven AI‑HLEG trustworthiness requirements can be expressed as formal specifications (temporal logic, type systems, information‑flow, probabilistic models, etc.)? |
| RQ2 | For each formally expressible requirement, what existing formal‑method techniques (model checking, theorem proving, symbolic execution, probabilistic verification, quantitative information‑flow analysis) are suitable and scalable? |
| RQ3 | How can the remaining (non‑formal) requirements be integrated into a unified trustworthiness assessment so that the overall score is traceable to the formal evidence? |
| RQ4 | Does the proposed formal‑trustworthiness profiling improve developer confidence and regulatory compliancecompared with a baseline checklist‑only approach? |
| RQ5 | What are the limitations (expressiveness, scalability, data‑privacy) of current formal methods when applied to modern AI components (deep neural networks, reinforcement‑learning agents, large language models)? |
3. Scope & Deliverables
| Deliverable | Content |
|---|---|
| D1 – Trustworthiness Taxonomy & Formalisation Blueprint | Mapping of each AI‑HLEG requirement to a formal language (e.g., LTL, PCTL, type‑level contracts, relational algebra). For non‑formalizable aspects, a metadata schema describing required artefacts (documentation, impact assessments). |
| D2 – Prototype Verification Suite | A set of reusable verification modules (e.g., safety‑property model checker, differential‑privacy analyser, fairness counter‑example generator) implemented on top of open‑source tools (PRISM, Z3, DeepCheck, CrySL, Aequitas). |
| D3 – Trust‑Score Engine | A Python/JavaScript library that aggregates formal verification outcomes, metadata, and human‑provided evidence into a single, auditable trust‑score (with per‑requirement breakdown). |
| D4 – Case‑Study Implementation | Application of the framework to two real‑world AI systems (e.g., a credit‑scoring model and an autonomous‑driving perception stack). For each case: formalise requirements, run the verification suite, compute the trust‑score, and produce a compliance report. |
| D5 – Empirical Evaluation | Quantitative (time‑to‑verification, false‑positive/negative rates) and qualitative (developer survey, regulator interview) assessment of the framework against a baseline “checklist‑only” process. |
| D6 – Thesis Manuscript | ~80‑page dissertation covering background, methodology, results, discussion, limitations, and future work. All artefacts released under an OSI‑approved license. |
4. Methodology
4.1. Literature & Standards Review (Weeks 1‑4)
- Study the AI HLEG report, ISO 42001, IEEE 7010, and emerging regulations (EU AI‑Act).
- Survey formal‑method literature on safety, privacy, fairness, explainability, robustness, accountability.
- Identify gaps and prior attempts at “trustworthy AI verification”.
4.2. Requirement Formalisation (Weeks 5‑9)
| Requirement | Candidate Formalism | Example Property |
|---|---|---|
| Human agency & oversight | Linear Temporal Logic (LTL) + supervisory‐controller models | “Whenever the system proposes an autonomous action, a human decision is required within 5 s.” |
| Technical robustness & safety | Probabilistic CTL (PCTL), Hybrid Automata | “Probability of a collision in a 10 km mission ≤ 10⁻⁶.” |
| Privacy & data governance | Quantitative Information Flow (QIF), Differential Privacy (ε‑DP) | “Leakage ≤ 0.01 bits for any single record.” |
| Transparency | Contract‑based specifications(pre/post‑conditions, model‑explainability contracts) | “Every prediction is accompanied by a saliency map satisfying completeness ≥ 0.8.” |
| Fairness | Statistical parity constraints, Probabilistic relational properties | “P(ŷ=1)” |
| Societal & environmental well‑being | Meta‑model (no direct formalisation) → Evidence artefacts (LCA reports, impact assessments). | |
| Accountability | Audit trails expressed as event‑log LTL; non‑repudiation predicates. | “All high‑risk decisions are logged with immutable timestamps.” |
Create a catalogue (Markdown/JSON) that links each AI‑HLEG sub‑requirement to:
- Formal language (if any)
- Reference tool(s) capable of checking it
- Expected input artefacts (model, dataset, policies)
4.3. Prototype Toolchain Development (Weeks 10‑16)
- Safety & Robustness – Use PRISM (model checking of Markov Decision Processes) or HyTech for hybrid systems to verify probabilistic safety properties.
- Privacy – Implement a wrapper around Google’s Differential‑Privacy library and QIF analyser (e.g., Leakage).
- Fairness – Integrate Aequitas or IBM AI Fairness 360, expose counter‑example generation via Z3.
- Transparency – Define a contract DSL (similar to Alloy or Spec#) and a static analyser that checks that explanations are produced.
- Accountability – Use log‑formalisation (e.g., LTL over event streams) with K-Lin to verify mandatory logging.
Each module will expose a CLI returning a JSON payload:
{
“requirement”: “technical_robustness”,
“status”: “PASS”,
“evidence”: “prism_model_check.txt”,
“metrics”: {“prob_coll”: 3.2e-7}
}
4.4. Trust‑Score Engine (Weeks 17‑19)
- Design a weighted aggregation scheme (e.g., multi‑criteria decision analysis) where formally proved requirements receive a full score, empirically assessed ones receive a partial score capped by the credibility of supplied artefacts.
- Produce an HTML/JSON report that is traceable: each score element links back to the verification artefact or documentation file.
4.5. Case‑Study Selection & Execution (Weeks 20‑24)
| Case | Domain | AI Component | Reason for selection |
|---|---|---|---|
| C1 | FinTech | Gradient‑Boosted credit‑scoring model (tabular data) | Rich fairness & privacy concerns, easy to model. |
| C2 | Autonomous Driving | Perception pipeline (CNN + sensor fusion) | Safety‑critical, probabilistic behaviour, need for robustness checks. |
For each case:
- Model extraction – translate the AI component into a formal representation (e.g., abstract Markov decision process for the perception system, decision tree for the credit model).
- Requirement instantiation – fill the catalogue with concrete thresholds (e.g., financial fairness bound, privacy ε).
- Run verification suite – collect JSON evidence.
- Compute trust‑score – generate a compliance dossier.
4.6. Evaluation
- Benchmarking – measure CPU/memory usage and runtime for each verification module on the case‑study artefacts.
- Baseline comparison – repeat the same assessment using the AI‑HLEG self‑assessment checklist (no formal proofs). Compare total time and coverage (percentage of requirements addressed).
- User study – short survey (N ≈ 12) of data‑science engineers and regulatory officers. Questions focus on perceived trust, understanding of evidence, and willingness to adopt the framework.
- Statistical analysis – paired‑t tests (time) and Likert‑scale aggregation (confidence).
4.7. Documentation & Dissemination (Weeks 25‑30)
- Write the full thesis (introduction → discussion).
- Publish the catalogue and verification suite on GitHub under MIT license.
- Prepare a 15‑minute conference‑style video demo (e.g., for ICSE or FLoC workshops).
5. Expected Contributions
| Academic | Practical |
|---|---|
| C1 – A systematic taxonomy linking AI‑HLEG trustworthiness requirements to formal specification languages. | P1 – An open‑source Trustworthy‑AI verification toolkit that can be plugged into CI pipelines. |
| C2 – Empirical evidence on the feasibility and limits of formally verifying modern ML components. | P2 – A trust‑score engine that produces auditable compliance reports for regulators and auditors. |
| C3 – A methodology for combining formal evidence with non‑formal artefacts in a single, traceable assessment. | P3 – Demonstrated case studies (credit scoring, autonomous perception) showing end‑to‑end usage. |
| C4 – Recommendations for future research directions (e.g., formalising societal impact, scaling probabilistic verification). | P4 – A knowledge base (catalogue) that can serve as a starting point for standardisation bodies. |
6. Technologies & Toolchain
| Layer | Tool / Language |
|---|---|
| Modelling / Formal Specs | Alloy, TLA⁺, PRISM, Hybrid Automata (HyTech) |
| Verification Engines | Z3 (SMT), Coq/Why3 (theorem proving), DeepCheck (Neural‑network reachability), Aequitas/AI‑Fairness‑360 (fairness), Google DP Library |
| Programming | Python (for glue code & trust‑score engine), Rust (optional for performance‑critical adapters) |
| Containerisation | Docker Compose (each verifier as a micro‑service) |
| CI / Automation | GitHub Actions (run verification on PRs) |
| Reporting | Markdown → HTML, Jupyter notebooks, OpenAPI for REST endpoints |
| Evaluation | pytest, timeit, Qualtrics (surveys), R / pandas (statistical analysis) |
7. Risks & Mitigation
| Risk | Impact | Mitigation |
|---|---|---|
| R1 – Formal model extraction is too costly (e.g., abstracting a CNN). | Delays, limited coverage. | Focus on safety‑critical sub‑components (sensor fusion, decision logic) and use over‑approximations (interval abstraction). |
| R2 – LLM‑generated fairness constraints may be ambiguous. | Unreliable specifications. | Keep all fairness predicates hand‑crafted for the case studies; use LLM only for documentation assistance. |
| R3 – Regulators may reject a quantitative trust‑score. | Low practical relevance. | Design the score as transparent and configurable; provide full audit trail and a “raw evidence” view. |
| R4 – Tools (e.g., PRISM) may not scale to realistic models. | Incomplete verification. | Apply compositional verification: verify smaller modules independently, then compose results using assume‑guarantee reasoning. |
| R5 – Limited participant pool for user study. | Threat to external validity. | Combine the survey with semi‑structured interviews of senior data‑science managers (≈5) to enrich qualitative insights. |
8. Timeline (30 weeks)
| Week | Milestone |
|---|---|
| 1‑4 | Literature & standards review; finalise RQs |
| 5‑9 | Formalisation catalogue (draft) |
| 10‑16 | Prototype verification modules |
| 17‑19 | Trust‑score engine design & implementation |
| 20‑24 | Case‑study preparation & execution |
| 25‑27 | Evaluation (benchmarks + user study) |
| 28‑30 | Writing, polishing, release of artefacts, submission |
9. Why This Thesis Is Feasible & Relevant
- Existing Foundations: All required formal methods and verification tools are mature, open‑source, and have documented APIs.
- Clear Boundary: By explicitly separating formalizable from non‑formalizable trustworthiness aspects, the scope stays manageable while still delivering a valuable artefact.
- Regulatory Urgency: The EU AI‑Act and ISO 42001 will soon demand demonstrable compliance; a framework that provides provable evidence meets a pressing industry need.
- Academic Impact: The interdisciplinary blend of formal methods, ethical AI, and software‑engineering practicealigns with hot topics at conferences such as ICSE, FLoC, CAV, and AAAI.
10. Deliverable Summary
| # | Deliverable | Format |
|---|---|---|
| D1 | Trustworthiness taxonomy & formalisation blueprint | PDF + public GitHub repo (Markdown) |
| D2 | Verification suite (Dockerised micro‑services) | Open‑source code (MIT) |
| D3 | Trust‑score engine (library + CLI) | Python package (PyPI) |
| D4 | Two complete case‑study reports | PDF + reproducible notebooks |
| D5 | Evaluation results (benchmarks, survey data) | CSV + statistical scripts |
| D6 | Full master’s thesis | LaTeX PDF (university format) |
11. Final Thought
By mapping the European AI HLEG’s trustworthiness pillars onto concrete formal specifications, and by building a tool‑supported evidence pipeline, this thesis will provide the first end‑to‑end, reproducible methodology that tells exactly what parts of a modern AI system can be mathematically proved and what parts still require human judgement. It bridges the gap between ethical guidelines and rigorous engineering—a step forward for trustworthy AI in both academia and industry.
