Formalising Trustworthy AI

Mapping the European AI HLEG Trustworthiness Framework to Formal Methods

1. Motivation & Vision

The European High‑Level Expert Group on Artificial Intelligence (AI HLEG) defines Trustworthy AI through seven concrete requirements:

Human agency & oversight
Technical robustness & safety
Privacy & data governance
Transparency
Diversity, non‑discrimination & fairness
Societal & environmental well‑being
Accountability

These requirements are widely quoted by regulators, standards bodies (ISO 42001, IEEE 7010) and industry road‑maps, yet only a subset can be expressed and verified with current formal‑method techniques.

The thesis aims to systematically formalise each trustworthiness dimension, decide what can be proved (e.g., safety properties, privacy leakage bounds) and what must remain empirical or policy‑driven (e.g., societal impact). The final artefact is a trustworthiness‑profiling framework that couples formal verification results with a lightweight, traceable “trust‑score” for AI systems.

2. Research Questions

#	Question
RQ1	Which of the seven AI‑HLEG trustworthiness requirements can be expressed as formal specifications (temporal logic, type systems, information‑flow, probabilistic models, etc.)?
RQ2	For each formally expressible requirement, what existing formal‑method techniques (model checking, theorem proving, symbolic execution, probabilistic verification, quantitative information‑flow analysis) are suitable and scalable?
RQ3	How can the remaining (non‑formal) requirements be integrated into a unified trustworthiness assessment so that the overall score is traceable to the formal evidence?
RQ4	Does the proposed formal‑trustworthiness profiling improve developer confidence and regulatory compliancecompared with a baseline checklist‑only approach?
RQ5	What are the limitations (expressiveness, scalability, data‑privacy) of current formal methods when applied to modern AI components (deep neural networks, reinforcement‑learning agents, large language models)?

3. Scope & Deliverables

Deliverable	Content
D1 – Trustworthiness Taxonomy & Formalisation Blueprint	Mapping of each AI‑HLEG requirement to a formal language (e.g., LTL, PCTL, type‑level contracts, relational algebra). For non‑formalizable aspects, a metadata schema describing required artefacts (documentation, impact assessments).
D2 – Prototype Verification Suite	A set of reusable verification modules (e.g., safety‑property model checker, differential‑privacy analyser, fairness counter‑example generator) implemented on top of open‑source tools (PRISM, Z3, DeepCheck, CrySL, Aequitas).
D3 – Trust‑Score Engine	A Python/JavaScript library that aggregates formal verification outcomes, metadata, and human‑provided evidence into a single, auditable trust‑score (with per‑requirement breakdown).
D4 – Case‑Study Implementation	Application of the framework to two real‑world AI systems (e.g., a credit‑scoring model and an autonomous‑driving perception stack). For each case: formalise requirements, run the verification suite, compute the trust‑score, and produce a compliance report.
D5 – Empirical Evaluation	Quantitative (time‑to‑verification, false‑positive/negative rates) and qualitative (developer survey, regulator interview) assessment of the framework against a baseline “checklist‑only” process.
D6 – Thesis Manuscript	~80‑page dissertation covering background, methodology, results, discussion, limitations, and future work. All artefacts released under an OSI‑approved license.

4. Methodology

4.1. Literature & Standards Review (Weeks 1‑4)

Study the AI HLEG report, ISO 42001, IEEE 7010, and emerging regulations (EU AI‑Act).
Survey formal‑method literature on safety, privacy, fairness, explainability, robustness, accountability.
Identify gaps and prior attempts at “trustworthy AI verification”.

4.2. Requirement Formalisation (Weeks 5‑9)

Requirement	Candidate Formalism	Example Property
Human agency & oversight	Linear Temporal Logic (LTL) + supervisory‐controller models	“Whenever the system proposes an autonomous action, a human decision is required within 5 s.”
Technical robustness & safety	Probabilistic CTL (PCTL), Hybrid Automata	“Probability of a collision in a 10 km mission ≤ 10⁻⁶.”
Privacy & data governance	Quantitative Information Flow (QIF), Differential Privacy (ε‑DP)	“Leakage ≤ 0.01 bits for any single record.”
Transparency	Contract‑based specifications(pre/post‑conditions, model‑explainability contracts)	“Every prediction is accompanied by a saliency map satisfying completeness ≥ 0.8.”
Fairness	Statistical parity constraints, Probabilistic relational properties	“P(ŷ=1)”
Societal & environmental well‑being	Meta‑model (no direct formalisation) → Evidence artefacts (LCA reports, impact assessments).
Accountability	Audit trails expressed as event‑log LTL; non‑repudiation predicates.	“All high‑risk decisions are logged with immutable timestamps.”

Create a catalogue (Markdown/JSON) that links each AI‑HLEG sub‑requirement to:

Formal language (if any)
Reference tool(s) capable of checking it
Expected input artefacts (model, dataset, policies)

4.3. Prototype Toolchain Development (Weeks 10‑16)

Safety & Robustness – Use PRISM (model checking of Markov Decision Processes) or HyTech for hybrid systems to verify probabilistic safety properties.
Privacy – Implement a wrapper around Google’s Differential‑Privacy library and QIF analyser (e.g., Leakage).
Fairness – Integrate Aequitas or IBM AI Fairness 360, expose counter‑example generation via Z3.
Transparency – Define a contract DSL (similar to Alloy or Spec#) and a static analyser that checks that explanations are produced.
Accountability – Use log‑formalisation (e.g., LTL over event streams) with K-Lin to verify mandatory logging.

Each module will expose a CLI returning a JSON payload:

{
“requirement”: “technical_robustness”,
“status”: “PASS”,
“evidence”: “prism_model_check.txt”,
“metrics”: {“prob_coll”: 3.2e-7}
}

4.4. Trust‑Score Engine (Weeks 17‑19)

Design a weighted aggregation scheme (e.g., multi‑criteria decision analysis) where formally proved requirements receive a full score, empirically assessed ones receive a partial score capped by the credibility of supplied artefacts.
Produce an HTML/JSON report that is traceable: each score element links back to the verification artefact or documentation file.

4.5. Case‑Study Selection & Execution (Weeks 20‑24)

Case	Domain	AI Component	Reason for selection
C1	FinTech	Gradient‑Boosted credit‑scoring model (tabular data)	Rich fairness & privacy concerns, easy to model.
C2	Autonomous Driving	Perception pipeline (CNN + sensor fusion)	Safety‑critical, probabilistic behaviour, need for robustness checks.

For each case:

Model extraction – translate the AI component into a formal representation (e.g., abstract Markov decision process for the perception system, decision tree for the credit model).
Requirement instantiation – fill the catalogue with concrete thresholds (e.g., financial fairness bound, privacy ε).
Run verification suite – collect JSON evidence.
Compute trust‑score – generate a compliance dossier.

4.6. Evaluation

Benchmarking – measure CPU/memory usage and runtime for each verification module on the case‑study artefacts.
Baseline comparison – repeat the same assessment using the AI‑HLEG self‑assessment checklist (no formal proofs). Compare total time and coverage (percentage of requirements addressed).
User study – short survey (N ≈ 12) of data‑science engineers and regulatory officers. Questions focus on perceived trust, understanding of evidence, and willingness to adopt the framework.
Statistical analysis – paired‑t tests (time) and Likert‑scale aggregation (confidence).

4.7. Documentation & Dissemination (Weeks 25‑30)

Write the full thesis (introduction → discussion).
Publish the catalogue and verification suite on GitHub under MIT license.
Prepare a 15‑minute conference‑style video demo (e.g., for ICSE or FLoC workshops).

5. Expected Contributions

Academic	Practical
C1 – A systematic taxonomy linking AI‑HLEG trustworthiness requirements to formal specification languages.	P1 – An open‑source Trustworthy‑AI verification toolkit that can be plugged into CI pipelines.
C2 – Empirical evidence on the feasibility and limits of formally verifying modern ML components.	P2 – A trust‑score engine that produces auditable compliance reports for regulators and auditors.
C3 – A methodology for combining formal evidence with non‑formal artefacts in a single, traceable assessment.	P3 – Demonstrated case studies (credit scoring, autonomous perception) showing end‑to‑end usage.
C4 – Recommendations for future research directions (e.g., formalising societal impact, scaling probabilistic verification).	P4 – A knowledge base (catalogue) that can serve as a starting point for standardisation bodies.

6. Technologies & Toolchain

Layer	Tool / Language
Modelling / Formal Specs	Alloy, TLA⁺, PRISM, Hybrid Automata (HyTech)
Verification Engines	Z3 (SMT), Coq/Why3 (theorem proving), DeepCheck (Neural‑network reachability), Aequitas/AI‑Fairness‑360 (fairness), Google DP Library
Programming	Python (for glue code & trust‑score engine), Rust (optional for performance‑critical adapters)
Containerisation	Docker Compose (each verifier as a micro‑service)
CI / Automation	GitHub Actions (run verification on PRs)
Reporting	Markdown → HTML, Jupyter notebooks, OpenAPI for REST endpoints
Evaluation	pytest, timeit, Qualtrics (surveys), R / pandas (statistical analysis)

7. Risks & Mitigation

Risk	Impact	Mitigation
R1 – Formal model extraction is too costly (e.g., abstracting a CNN).	Delays, limited coverage.	Focus on safety‑critical sub‑components (sensor fusion, decision logic) and use over‑approximations (interval abstraction).
R2 – LLM‑generated fairness constraints may be ambiguous.	Unreliable specifications.	Keep all fairness predicates hand‑crafted for the case studies; use LLM only for documentation assistance.
R3 – Regulators may reject a quantitative trust‑score.	Low practical relevance.	Design the score as transparent and configurable; provide full audit trail and a “raw evidence” view.
R4 – Tools (e.g., PRISM) may not scale to realistic models.	Incomplete verification.	Apply compositional verification: verify smaller modules independently, then compose results using assume‑guarantee reasoning.
R5 – Limited participant pool for user study.	Threat to external validity.	Combine the survey with semi‑structured interviews of senior data‑science managers (≈5) to enrich qualitative insights.

8. Timeline (30 weeks)

Week	Milestone
1‑4	Literature & standards review; finalise RQs
5‑9	Formalisation catalogue (draft)
10‑16	Prototype verification modules
17‑19	Trust‑score engine design & implementation
20‑24	Case‑study preparation & execution
25‑27	Evaluation (benchmarks + user study)
28‑30	Writing, polishing, release of artefacts, submission

9. Why This Thesis Is Feasible & Relevant

Existing Foundations: All required formal methods and verification tools are mature, open‑source, and have documented APIs.
Clear Boundary: By explicitly separating formalizable from non‑formalizable trustworthiness aspects, the scope stays manageable while still delivering a valuable artefact.
Regulatory Urgency: The EU AI‑Act and ISO 42001 will soon demand demonstrable compliance; a framework that provides provable evidence meets a pressing industry need.
Academic Impact: The interdisciplinary blend of formal methods, ethical AI, and software‑engineering practicealigns with hot topics at conferences such as ICSE, FLoC, CAV, and AAAI.

10. Deliverable Summary

#	Deliverable	Format
D1	Trustworthiness taxonomy & formalisation blueprint	PDF + public GitHub repo (Markdown)
D2	Verification suite (Dockerised micro‑services)	Open‑source code (MIT)
D3	Trust‑score engine (library + CLI)	Python package (PyPI)
D4	Two complete case‑study reports	PDF + reproducible notebooks
D5	Evaluation results (benchmarks, survey data)	CSV + statistical scripts
D6	Full master’s thesis	LaTeX PDF (university format)

11. Final Thought

By mapping the European AI HLEG’s trustworthiness pillars onto concrete formal specifications, and by building a tool‑supported evidence pipeline, this thesis will provide the first end‑to‑end, reproducible methodology that tells exactly what parts of a modern AI system can be mathematically proved and what parts still require human judgement. It bridges the gap between ethical guidelines and rigorous engineering—a step forward for trustworthy AI in both academia and industry.

Formalising Trustworthy AI

1. Motivation & Vision

2. Research Questions

3. Scope & Deliverables

4. Methodology

4.1. Literature & Standards Review (Weeks 1‑4)

4.2. Requirement Formalisation (Weeks 5‑9)

4.3. Prototype Toolchain Development (Weeks 10‑16)

4.4. Trust‑Score Engine (Weeks 17‑19)

4.5. Case‑Study Selection & Execution (Weeks 20‑24)

4.6. Evaluation

4.7. Documentation & Dissemination (Weeks 25‑30)

5. Expected Contributions

6. Technologies & Toolchain

7. Risks & Mitigation

8. Timeline (30 weeks)

9. Why This Thesis Is Feasible & Relevant

10. Deliverable Summary

11. Final Thought

Related Thesis Topics

AI‑Enhanced FMaaS for Copilot

Hardware‑Backed MCDC Coverage Framework

FPGA‑Based Quantum‑Computer Simulator with a Web‑Interface

AI-based 3D Volume Reconstruction of Ultrasound Images

Visuelle Defekterkennung

Predictive Maintenance aus hochdimensionaler Laser‑Telemetrie

1. Motivation & Vision

2. Research Questions

3. Scope & Deliverables

4. Methodology

4.1. Literature & Standards Review (Weeks 1‑4)

4.2. Requirement Formalisation (Weeks 5‑9)

4.3. Prototype Toolchain Development (Weeks 10‑16)

4.4. Trust‑Score Engine (Weeks 17‑19)

4.5. Case‑Study Selection & Execution (Weeks 20‑24)

4.6. Evaluation

4.7. Documentation & Dissemination (Weeks 25‑30)

5. Expected Contributions

6. Technologies & Toolchain

7. Risks & Mitigation

8. Timeline (30 weeks)

9. Why This Thesis Is Feasible & Relevant

10. Deliverable Summary

11. Final Thought

Related Thesis Topics

AI‑Enhanced FMaaS for Copilot

Hardware‑Backed MCDC Coverage Framework

FPGA‑Based Quantum‑Computer Simulator with a Web‑Interface

AI-based 3D Volume Reconstruction of Ultrasound Images

Visuelle Defekterkennung

Predictive Maintenance aus hochdimensionaler Laser‑Telemetrie

4.1. Literature & Standards Review (Weeks 1‑4)

4.2. Requirement Formalisation (Weeks 5‑9)

4.3. Prototype Toolchain Development (Weeks 10‑16)

4.4. Trust‑Score Engine (Weeks 17‑19)

4.5. Case‑Study Selection & Execution (Weeks 20‑24)

4.7. Documentation & Dissemination (Weeks 25‑30)

8. Timeline (30 weeks)