AI‑Enhanced Formal‑Methods‑as‑a‑Service: Designing, Implementing, and Evaluating an MCP‑Based Backend that Powers a VS Code Copilot Extension for Automated Program Analysis and Verification.
1. Motivation & Vision
Formal methods (model checking, static analysis, theorem proving) provide mathematically‑rigorous guarantees about software, but they remain isolated, heavyweight tools that are rarely used inside everyday IDEs.
Recent advances in large‑language models (LLMs) and generative AI (e.g., GitHub Copilot, ChatGPT) have shown that AI can guide developers, suggest code, and even write specifications.
The vision of this thesis is a single, AI‑augmented, cloud‑hosted service (an MCP – “Multi‑Component Platform”) that:
- Wraps a portfolio of open‑source verification tools (e.g., SPIN, CBMC, Frama‑C, Why3, Z3).
- Enriches the raw tool output with AI‑driven explanations, bug‑localisation, and repair suggestions.
- Exposes the whole capability through a lightweight VS Code extension that behaves like a Copilot “assistant” – the user types, the assistant offers formal‑method insights in real time.
If successful, the system would demonstrate that AI can make formal methods practical for everyday programmingwhile preserving their soundness.
2. Research Questions
| # | Question |
|---|---|
| RQ1 | How can AI be combined with traditional formal‑method tools to produce human‑readable diagnostics and repair suggestions without sacrificing soundness? |
| RQ2 | What is the most extensible architecture for hosting multiple verification back‑ends (MCP) while supporting low‑latency, on‑demand queries from an IDE? |
| RQ3 | Does the AI‑augmented Copilot assistant improve developer productivity and confidence compared to using the raw tools or plain Copilot alone? |
| RQ4 | What are the security and privacy trade‑offs of running code analyses on a remote server, and how can they be mitigated? |
3. Scope & Deliverables
| Deliverable | Description |
|---|---|
| D1 – Conceptual Architecture | A detailed design (UML, data‑flow) of the MCP server, AI‑enhancement pipeline, and VS Code extension. |
| D2 – MCP Server Implementation | Containerised micro‑services for at least three verification tools (e.g., CBMC for C, SPIN for Promela, Z3 via Why3 for functional languages). |
| D3 – AI‑Assistance Layer | Fine‑tuned LLM (or prompt‑engineering on a hosted model) that parses tool output, generates natural‑language explanations, and proposes code patches. |
| D4 – VS Code Copilot‑Style Extension | An extension that intercepts the user’s cursor context, forwards a request to the MCP, and renders AI‑augmented suggestions inline (similar to Copilot). |
| D5 – Evaluation Suite | Benchmarks (real‑world open‑source projects), quantitative metrics (latency, precision/recall of bug detection, suggestion acceptance rate), and a small user study (≈10‑15 participants). |
| D6 – Thesis Document | Full academic write‑up (≈80 pages) covering background, methodology, results, discussion, and future work. |
4. Methodology
4.1. Background Study (Weeks 1‑4)
- Survey formal‑methods tools and their APIs.
- Review state‑of‑the‑art AI‑assisted programming assistants (Copilot, Tabnine, CodeBERT).
- Identify gaps where AI can add value (explanation, repair, prioritisation).
4.2. Concept & Architecture (Weeks 5‑8)
- MCP Core – REST‑/gRPC‑based gateway that receives JSON requests of the form: jsonZuklappen Kopieren1{ “language”: “c”, “file”: “…”, “cursor”: 123, “action”: “verify” }
- Tool Plugins – Docker‑based micro‑services wrapping each verifier. Each plugin:
- Accepts source code and optional specifications.
- Returns machine‑readable results (e.g., JSON of counterexample traces).
- AI Layer – A prompt‑engine that sends verifier output to an LLM (OpenAI, Azure AI, or a locally hosted LLaMA). The prompt instructs the LLM to:
- Summarise the bug in natural language.
- Point to the offending line(s).
- Propose a minimal code change (patch).
- VS Code Extension – Built with TypeScript, using the VS Code Language Server Protocol (LSP). The extension:
- Sends a “verification request” when the user saves or invokes a command.
- Receives AI‑augmented suggestions and displays them as Copilot‑style inline completions or as diagnostics.
4.3. Implementation (Weeks 9‑20)
| Phase | Tasks | Expected Outcome |
|---|---|---|
| MCP server | • Scaffold FastAPI (Python) or NestJS (Node) gateway. • Containerise CBMC, SPIN, Why3+Z3. • Implement request routing & authentication. | A functional HTTP endpoint that can analyse a snippet with any enabled tool. |
| AI layer | • Choose an LLM (GPT‑4‑Turbo, LLaMA‑2‑13B). • Build prompt template, test on sample verifier outputs. • Implement caching & rate‑limiting. | A service that returns JSON {explanation, patch, confidence}. |
| VS Code extension | • Set up scaffolding with Yeoman. • Implement LSP client that talks to MCP. • Render suggestions as CompletionItems and as Diagnostics. | A usable extension that shows AI‑augmented verification hints. |
| Integration Tests | • End‑to‑end scripts that compile a project, send to MCP, retrieve suggestion, and auto‑apply patch in a sandbox. | Confidence that the pipeline works on real code bases. |
4.4. Evaluation (Weeks 21‑28)
- Quantitative Benchmarks
- Dataset: 30 open‑source C/C++ projects (e.g., from the SAMATE test suite, GitHub).
- Metrics:
- Detection Rate (true positives / all known bugs).
- Precision of AI explanations (human rating on a 5‑point Likert scale).
- Patch Correctness (does the suggested fix compile & pass tests?).
- Latency (average round‑trip time < 2 s).
- User Study
- Recruit 10‑15 master students / junior developers.
- Two conditions: (a) “Plain Copilot only”, (b) “Copilot + AI‑formal‑method assistant”.
- Tasks: fix injected bugs, add missing assertions.
- Capture time‑to‑completion, subjective confidence, and acceptance rate of AI suggestions.
- Security & Privacy Review
- Threat modelling (code leakage, malicious payloads).
- Implement optional client‑side encryption & policy‑based code sanitisation.
- Evaluate overhead and impact on latency.
4.5. Writing & Dissemination (Weeks 29‑34)
- Draft each thesis chapter (introduction → related work → methodology → results → discussion).
- Prepare a 15‑minute conference‑style presentation.
- Publish the source code (MIT license) on GitHub and write a short technical blog post.
5. Expected Contributions
| Academic | Practical |
|---|---|
| C1 – A novel AI‑enhanced formal‑methods‑as‑a‑servicearchitecture that cleanly separates verification kernels from AI post‑processing. | P1 – An open‑source VS Code extension that brings formal‑method feedback directly into the developer’s coding workflow, mimicking Copilot’s UI. |
| C2 – Empirical evidence (benchmark + user study) on the usefulness of AI‑generated explanations & patches for formal‑method results. | P2 – A reusable Docker‑based MCP framework that can be extended with additional analysis tools (e.g., symbolic execution, abstract interpretation). |
| C3 – Guidelines for secure, privacy‑respecting remote program analysis services. | P3 – A prompt‑engineering recipe for turning raw verifier output into actionable natural‑language suggestions. |
6. Technologies & Toolchain
| Layer | Recommended Tools |
|---|---|
| Server (MCP) | • FastAPI (Python) or NestJS (Node) for REST/gRPC gateway. • Docker Compose for micro‑service orchestration. • NGINX as reverse proxy & TLS terminator. |
| Verification Tools | • CBMC (bounded model checking for C). • SPIN (model checking for Promela). • Why3 + Z3 (SMT‑based verification for functional languages). |
| AI Layer | • OpenAI API (GPT‑4‑Turbo) or locally hosted LLaMA‑2‑13B with vLLM. • Prompt templates stored in a JSON/YAML config. |
| IDE Extension | • VS Code Extension API (TypeScript). • Language Server Protocol for diagnostics. • GitHub Copilot SDK (preview) if available, otherwise mimic inline completions. |
| Testing & CI | • GitHub Actions (build, test, Docker push). • pytest / Jest for unit tests. |
| Evaluation | • BenchExec for reproducible tool benchmarking. • Qualtrics / Google Forms for user‑study questionnaires. |
7. Risks & Mitigation Strategies
| Risk | Impact | Mitigation |
|---|---|---|
| R1 – LLM hallucinations(incorrect explanations or patches). | Low trust, false sense of safety. | Keep soundness in the back‑end: only present AI suggestions as advice; provide a “View raw verifier output” button. |
| R2 – High latency due to remote verification or LLM calls. | Poor UX, users may disable the extension. | Cache results per file version; run lightweight static analysis locally and fall back to server only for heavy checks. |
| R3 – Code‑privacy concernswhen uploading proprietary code. | Legal/ethical obstacles. | Offer a self‑hosted mode (Docker compose) where the whole stack runs behind the user’s firewall; provide clear data‑handling policy. |
| R4 – Integration complexity(different tool output formats). | Development delays. | Define a common JSON schema for all tool plugins; write adapters early and use automated tests for each plugin. |
| R5 – Limited evaluation participants (hard to recruit). | Threat to validity of RQ3. | Leverage university courses (e.g., a programming lab) where participation can be incentivised with extra credit. |
8. Proposed Timeline (30 weeks)
| Week | Milestone |
|---|---|
| 1‑4 | Literature review, define research questions, finalize tool selection. |
| 5‑8 | Architecture design (UML, data flows), write first chapter. |
| 9‑12 | Implement MCP gateway and Dockerised verification plugins. |
| 13‑16 | Build AI prompt layer; iterate on explanation/patch quality. |
| 17‑20 | Develop VS Code extension, integrate with MCP, initial end‑to‑end tests. |
| 21‑24 | Benchmark suite, collect quantitative metrics. |
| 25‑27 | Conduct user study, analyse results, write evaluation chapter. |
| 28‑30 | Security/privacy analysis, final writing, proofreading, submission. |
9. Why This Thesis Is Feasible
- Existing building blocks – All verification tools and AI APIs are publicly available and can be containerised.
- Clear scope – Limiting the MCP to three representative tools keeps the implementation manageable while still showcasing heterogeneity.
- Incremental integration – The VS Code extension can be built step‑by‑step (first static diagnostics, then AI‑augmented suggestions).
- Measurable outcomes – Both quantitative (precision/recall, latency) and qualitative (user confidence) metrics are well‑defined, allowing a robust evaluation.
- Academic relevance – The work sits at the intersection of formal methods, AI‑assisted programming, and software‑engineering tools, a hot topic for conferences such as ASE, ICSE, CAV, ESEC/FSE.
10. Take‑away Summary
- Concept: An AI‑enhanced, cloud‑hosted verification platform (MCP) that packages multiple formal‑method tools, enriches their output with LLM‑generated explanations & repairs, and delivers the result directly in VS Code via a Copilot‑style extension.
- Implementation: Dockerised micro‑services for CBMC, SPIN, Why3/Z3; a FastAPI gateway; an LLM prompt‑engine; a TypeScript VS Code extension using LSP.
- Evaluation: Benchmark on 30 open‑source projects + a controlled user study; measure detection rates, explanation quality, patch correctness, latency, and user productivity.
- Outcome: A reusable research artefact, a peer‑reviewable thesis, and evidence that AI can make formal methods approachable without compromising their rigor.
If you adopt this topic, you will produce a concrete, open‑source system that bridges the gap between theory (formal verification) and practice (AI‑driven developer assistance)—a contribution that is both academically publishable and directly useful to software engineers.
