Universität zu LübeckThesis System

Formalising Trustworthy AI

Mapping the European AI HLEG Trustworthiness Framework to Formal Methods


1. Motivation & Vision

The European High‑Level Expert Group on Artificial Intelligence (AI HLEG) defines Trustworthy AI through seven concrete requirements:

  1. Human agency & oversight
  2. Technical robustness & safety
  3. Privacy & data governance
  4. Transparency
  5. Diversity, non‑discrimination & fairness
  6. Societal & environmental well‑being
  7. Accountability

These requirements are widely quoted by regulators, standards bodies (ISO 42001, IEEE 7010) and industry road‑maps, yet only a subset can be expressed and verified with current formal‑method techniques

The thesis aims to systematically formalise each trustworthiness dimension, decide what can be proved (e.g., safety properties, privacy leakage bounds) and what must remain empirical or policy‑driven (e.g., societal impact). The final artefact is a trustworthiness‑profiling framework that couples formal verification results with a lightweight, traceable “trust‑score” for AI systems.


2. Research Questions

#Question
RQ1Which of the seven AI‑HLEG trustworthiness requirements can be expressed as formal specifications (temporal logic, type systems, information‑flow, probabilistic models, etc.)?
RQ2For each formally expressible requirement, what existing formal‑method techniques (model checking, theorem proving, symbolic execution, probabilistic verification, quantitative information‑flow analysis) are suitable and scalable?
RQ3How can the remaining (non‑formal) requirements be integrated into a unified trustworthiness assessment so that the overall score is traceable to the formal evidence?
RQ4Does the proposed formal‑trustworthiness profiling improve developer confidence and regulatory compliancecompared with a baseline checklist‑only approach?
RQ5What are the limitations (expressiveness, scalability, data‑privacy) of current formal methods when applied to modern AI components (deep neural networks, reinforcement‑learning agents, large language models)?

3. Scope & Deliverables

DeliverableContent
D1 – Trustworthiness Taxonomy & Formalisation BlueprintMapping of each AI‑HLEG requirement to a formal language (e.g., LTL, PCTL, type‑level contracts, relational algebra). For non‑formalizable aspects, a metadata schema describing required artefacts (documentation, impact assessments).
D2 – Prototype Verification SuiteA set of reusable verification modules (e.g., safety‑property model checker, differential‑privacy analyser, fairness counter‑example generator) implemented on top of open‑source tools (PRISM, Z3, DeepCheck, CrySL, Aequitas).
D3 – Trust‑Score EngineA Python/JavaScript library that aggregates formal verification outcomes, metadata, and human‑provided evidence into a single, auditable trust‑score (with per‑requirement breakdown).
D4 – Case‑Study ImplementationApplication of the framework to two real‑world AI systems (e.g., a credit‑scoring model and an autonomous‑driving perception stack). For each case: formalise requirements, run the verification suite, compute the trust‑score, and produce a compliance report.
D5 – Empirical EvaluationQuantitative (time‑to‑verification, false‑positive/negative rates) and qualitative (developer survey, regulator interview) assessment of the framework against a baseline “checklist‑only” process.
D6 – Thesis Manuscript~80‑page dissertation covering background, methodology, results, discussion, limitations, and future work. All artefacts released under an OSI‑approved license.

4. Methodology

4.1. Literature & Standards Review (Weeks 1‑4)

  • Study the AI HLEG report, ISO 42001, IEEE 7010, and emerging regulations (EU AI‑Act). 
  • Survey formal‑method literature on safetyprivacyfairnessexplainabilityrobustnessaccountability
  • Identify gaps and prior attempts at “trustworthy AI verification”.

4.2. Requirement Formalisation (Weeks 5‑9)

RequirementCandidate FormalismExample Property
Human agency & oversightLinear Temporal Logic (LTL) + supervisory‐controller models“Whenever the system proposes an autonomous action, a human decision is required within 5 s.”
Technical robustness & safetyProbabilistic CTL (PCTL)Hybrid Automata“Probability of a collision in a 10 km mission ≤ 10⁻⁶.”
Privacy & data governanceQuantitative Information Flow (QIF)Differential Privacy (ε‑DP)“Leakage ≤ 0.01 bits for any single record.”
TransparencyContract‑based specifications(pre/post‑conditions, model‑explainability contracts)“Every prediction is accompanied by a saliency map satisfying completeness ≥ 0.8.”
FairnessStatistical parity constraintsProbabilistic relational properties“P(ŷ=1)”
Societal & environmental well‑beingMeta‑model (no direct formalisation) → Evidence artefacts (LCA reports, impact assessments).
AccountabilityAudit trails expressed as event‑log LTLnon‑repudiation predicates.“All high‑risk decisions are logged with immutable timestamps.”

Create a catalogue (Markdown/JSON) that links each AI‑HLEG sub‑requirement to:

  • Formal language (if any) 
  • Reference tool(s) capable of checking it 
  • Expected input artefacts (model, dataset, policies)

4.3. Prototype Toolchain Development (Weeks 10‑16)

  1. Safety & Robustness – Use PRISM (model checking of Markov Decision Processes) or HyTech for hybrid systems to verify probabilistic safety properties. 
  2. Privacy – Implement a wrapper around Google’s Differential‑Privacy library and QIF analyser (e.g., Leakage). 
  3. Fairness – Integrate Aequitas or IBM AI Fairness 360, expose counter‑example generation via Z3
  4. Transparency – Define a contract DSL (similar to Alloy or Spec#) and a static analyser that checks that explanations are produced. 
  5. Accountability – Use log‑formalisation (e.g., LTL over event streams) with K-Lin to verify mandatory logging.

Each module will expose a CLI returning a JSON payload:

{
“requirement”: “technical_robustness”,
“status”: “PASS”,
“evidence”: “prism_model_check.txt”,
“metrics”: {“prob_coll”: 3.2e-7}
}

4.4. Trust‑Score Engine (Weeks 17‑19)

  • Design a weighted aggregation scheme (e.g., multi‑criteria decision analysis) where formally proved requirements receive a full score, empirically assessed ones receive a partial score capped by the credibility of supplied artefacts. 
  • Produce an HTML/JSON report that is traceable: each score element links back to the verification artefact or documentation file.

4.5. Case‑Study Selection & Execution (Weeks 20‑24)

CaseDomainAI ComponentReason for selection
C1FinTechGradient‑Boosted credit‑scoring model (tabular data)Rich fairness & privacy concerns, easy to model.
C2Autonomous DrivingPerception pipeline (CNN + sensor fusion)Safety‑critical, probabilistic behaviour, need for robustness checks.

For each case:

  1. Model extraction – translate the AI component into a formal representation (e.g., abstract Markov decision process for the perception system, decision tree for the credit model). 
  2. Requirement instantiation – fill the catalogue with concrete thresholds (e.g., financial fairness bound, privacy ε). 
  3. Run verification suite – collect JSON evidence. 
  4. Compute trust‑score – generate a compliance dossier.

4.6. Evaluation

  1. Benchmarking – measure CPU/memory usage and runtime for each verification module on the case‑study artefacts. 
  2. Baseline comparison – repeat the same assessment using the AI‑HLEG self‑assessment checklist (no formal proofs). Compare total time and coverage (percentage of requirements addressed). 
  3. User study – short survey (N ≈ 12) of data‑science engineers and regulatory officers. Questions focus on perceived trustunderstanding of evidence, and willingness to adopt the framework. 
  4. Statistical analysis – paired‑t tests (time) and Likert‑scale aggregation (confidence).

4.7. Documentation & Dissemination (Weeks 25‑30)

  • Write the full thesis (introduction → discussion). 
  • Publish the catalogue and verification suite on GitHub under MIT license. 
  • Prepare a 15‑minute conference‑style video demo (e.g., for ICSE or FLoC workshops).

5. Expected Contributions

AcademicPractical
C1 – A systematic taxonomy linking AI‑HLEG trustworthiness requirements to formal specification languages.P1 – An open‑source Trustworthy‑AI verification toolkit that can be plugged into CI pipelines.
C2 – Empirical evidence on the feasibility and limits of formally verifying modern ML components.P2 – A trust‑score engine that produces auditable compliance reports for regulators and auditors.
C3 – A methodology for combining formal evidence with non‑formal artefacts in a single, traceable assessment.P3 – Demonstrated case studies (credit scoring, autonomous perception) showing end‑to‑end usage.
C4 – Recommendations for future research directions (e.g., formalising societal impact, scaling probabilistic verification).P4 – A knowledge base (catalogue) that can serve as a starting point for standardisation bodies.

6. Technologies & Toolchain

LayerTool / Language
Modelling / Formal SpecsAlloyTLA⁺PRISMHybrid Automata (HyTech)
Verification EnginesZ3 (SMT), Coq/Why3 (theorem proving), DeepCheck (Neural‑network reachability), Aequitas/AI‑Fairness‑360 (fairness), Google DP Library
ProgrammingPython (for glue code & trust‑score engine), Rust (optional for performance‑critical adapters)
ContainerisationDocker Compose (each verifier as a micro‑service)
CI / AutomationGitHub Actions (run verification on PRs)
ReportingMarkdown → HTMLJupyter notebooksOpenAPI for REST endpoints
EvaluationpytesttimeitQualtrics (surveys), R / pandas (statistical analysis)

7. Risks & Mitigation

RiskImpactMitigation
R1 – Formal model extraction is too costly (e.g., abstracting a CNN).Delays, limited coverage.Focus on safety‑critical sub‑components (sensor fusion, decision logic) and use over‑approximations (interval abstraction).
R2 – LLM‑generated fairness constraints may be ambiguous.Unreliable specifications.Keep all fairness predicates hand‑crafted for the case studies; use LLM only for documentation assistance.
R3 – Regulators may reject a quantitative trust‑score.Low practical relevance.Design the score as transparent and configurable; provide full audit trail and a “raw evidence” view.
R4 – Tools (e.g., PRISM) may not scale to realistic models.Incomplete verification.Apply compositional verification: verify smaller modules independently, then compose results using assume‑guarantee reasoning.
R5 – Limited participant pool for user study.Threat to external validity.Combine the survey with semi‑structured interviews of senior data‑science managers (≈5) to enrich qualitative insights.

8. Timeline (30 weeks)

WeekMilestone
1‑4Literature & standards review; finalise RQs
5‑9Formalisation catalogue (draft)
10‑16Prototype verification modules
17‑19Trust‑score engine design & implementation
20‑24Case‑study preparation & execution
25‑27Evaluation (benchmarks + user study)
28‑30Writing, polishing, release of artefacts, submission

9. Why This Thesis Is Feasible & Relevant

  • Existing Foundations: All required formal methods and verification tools are mature, open‑source, and have documented APIs. 
  • Clear Boundary: By explicitly separating formalizable from non‑formalizable trustworthiness aspects, the scope stays manageable while still delivering a valuable artefact. 
  • Regulatory Urgency: The EU AI‑Act and ISO 42001 will soon demand demonstrable compliance; a framework that provides provable evidence meets a pressing industry need. 
  • Academic Impact: The interdisciplinary blend of formal methodsethical AI, and software‑engineering practicealigns with hot topics at conferences such as ICSEFLoCCAV, and AAAI.

10. Deliverable Summary

#DeliverableFormat
D1Trustworthiness taxonomy & formalisation blueprintPDF + public GitHub repo (Markdown)
D2Verification suite (Dockerised micro‑services)Open‑source code (MIT)
D3Trust‑score engine (library + CLI)Python package (PyPI)
D4Two complete case‑study reportsPDF + reproducible notebooks
D5Evaluation results (benchmarks, survey data)CSV + statistical scripts
D6Full master’s thesisLaTeX PDF (university format)

11. Final Thought

By mapping the European AI HLEG’s trustworthiness pillars onto concrete formal specifications, and by building a tool‑supported evidence pipeline, this thesis will provide the first end‑to‑end, reproducible methodology that tells exactly what parts of a modern AI system can be mathematically proved and what parts still require human judgement. It bridges the gap between ethical guidelines and rigorous engineering—a step forward for trustworthy AI in both academia and industry.

Related Thesis Topics

AI‑Enhanced FMaaS for Copilot

AI‑Enhanced Formal‑Methods‑as‑a‑Service: Designing, Implementing, and Evaluating an MCP‑Based Backend that Powers a VS Code Copilot Extension for Automated Program Analysis and Verification.

Hardware‑Backed MCDC Coverage Framework

Redirecting Clang’s MCDC Coverage Instrumentation to ARM CoreSight: A High‑Performance, Hardware‑Backed Coverage Collection Framework

FPGA‑Based Quantum‑Computer Simulator with a Web‑Interface

FPGA‑Based Quantum‑Computer Simulator with a Web‑Based Design‑Run‑Analyse Front‑End

AI-based 3D Volume Reconstruction of Ultrasound Images

This project aims to develop a machine learning model capable of reconstructing a 3D volume using the ultrasound images of a forearm, tracked probe data, and the calibration matrix as inputs. The model should be able to compensate for inaccuracies in the calibration matrix, thereby improving the accuracy of the reconstructed 3D volume. What you […]

Visuelle Defekterkennung

In Kooperation mit Coherent in Lübeck Warum das spannend ist Kontext & Ziel Heute wird Qualitätsprüfung bei Coherent in Lübeck noch manuell erledigt. Kommerzielle Tools gibt’s, aber ein erklärbarer In‑house‑Prototyp ermöglicht schnelleres Lernen, Kostenkontrolle und höhere Akzeptanz bei den Nutzer*innen. Ziel der Arbeit ist ein produktnaher Prototyp, der robust, nachvollziehbar und betreibbar ist. Leitfragen Datensatz […]

Predictive Maintenance aus hochdimensionaler Laser‑Telemetrie

in Kooperation mit Coherent Lübeck Warum das spannend ist Kontext & Motivation Industrielaser erzeugen umfangreiche Telemetrie (Temperaturen, Ströme, Leistungen, Interlocks, Fehlercodes). Historische Felddaten und Burn‑in‑Tests sollen genutzt werden, um Ausfälle rechtzeitig vorherzusagen, Ursachen schneller einzugrenzen und Wartung präventiv zu planen. Leitfragen Datensatz Mögliche Aufgaben Ergebnisse