Airworthiness Traceability Demo

Background

Airworthiness certification depends on a semantic gap-bridging step: translating airworthiness criteria written in regulatory and standards language into program-specific system requirements that engineering can verify against. The translation is currently manual, opaque, and program-dependent. This methodology demonstrates a citation-grounded, AI-assisted pipeline that surfaces the corpus content relevant to each system requirement without compromising the certification record.

The methodology is corpus-agnostic by construction. The active demonstration harvests publicly available U.S. military airworthiness sources — sixteen documents spanning DoD umbrella policy, Air Force, Army, Navy, and tri-service criteria — and surfaces cross-service variation as a first-class finding. The same pipeline applies unchanged to civil regulatory corpora; civil application is the project’s identified next demonstration target.

Five Stages

The pipeline runs in five stages, with intentional offline / online separation so the user-facing surfaces remain near-deterministic.

Stage 1 — Corpus assembly. Sixteen publicly available U.S. military airworthiness documents inventoried with designation, version, effective date, and public retrieval URL. Documents themselves are gitignored; the inventory is the tracked artifact.
Stage 2.1 — Structural ingest. Five format-family parsers (MIL-STD/HDBK/JSSG, DoD Issuance, DAF, Army Reg, NAVAIR) extract per-paragraph segments with a SegmentKind classification (Definition, Requirement, Procedure, Reference, Boilerplate). The current corpus extraction produces 8,902 segments. Determinism is structural — the same PDFs produce byte-identical fixture files.
Stage 2.2 — LLM concept extraction. A locked tool schema instructs the served LLM (Granite 4.1-8B at FP8) to emit record_concept tool calls with paraphrased definitions and full citations. Generation parameters are fixed at temperature = 0, top_p = 1, seed = 42; the model fingerprint is stamped on every emitted concept for the audit trail.
Stage 2.3 — Validation gates. Four sequential deterministic gates filter every candidate: citation resolution (§7.1), verbatim-quote LCS ≥ 8 words (§7.2), model-based groundedness (§7.3), and dedup against the already-accepted set (§7.4). Rejected candidates are written to a structured rejection log; only accepted candidates enter the catalog. The current catalog holds 2,412 concepts across all sixteen documents.
Stages 3 + 5 — Alignment and retrieval. Stage 3 runs pairwise cross-jurisdiction alignment over the catalog, surfacing equivalent / narrows / broadens / differs / gap / conflicts relationships. Stage 5 is the live demonstration surface — requirement statements (or full ORL / SRD documents) are matched against the catalog and the LLM ranks relevant candidates with citation and rationale.

Public-Source Constraint

Every entry the pipeline produces traces to a paragraph in a publicly available U.S. military airworthiness document. Definitions are paraphrased, never verbatim — the §7.2 gate rejects any candidate whose longest common contiguous word sequence against its source paragraph reaches eight words. Source-quoted text appears only inside citation drill-downs in a distinct visual register, so the boundary between the methodology’s paraphrase and the source’s words is structural.

Synthetic content is admissible in the demo for coverage purposes under six guardrails (per ADR-0011): synthetic entries share the Concept schema, use a SYNTH- id prefix, never mix with real entries at the entry level, and render with a visible marker that survives screenshots. The MRT-X test scenario inputs used to validate the methodology are fictional methodology test material, not corpus content.

Validation Framework

The methodology is substantiated by a two-layer sampling protocol against the MRT-X test scenario — a fictional Marine UAV requirements and verification document set anchored to the publicly distributed NAVAIR PMA-263 MRT RFI. The full protocol is specified in the project’s validation-framework-v1 document.

Part A — Requirements layer. Each of 58 system requirement statements is queried against the catalog; the LLM ranks 0–5 relevant concepts per requirement with citation and rationale. Scored against a seeded answer key of thirteen requirement-level conflicts (categories: GAP, AMBIG, INFEAS, CONTRA), the engine produced 10 STRONG, 3 PARTIAL, 0 WEAK, and 0 MISS signals. Both the answer-key-nominated crown jewels (C1 infeasibility and C13 cross-requirement CONTRA) were surfaced.

Part B — Evidence layer. Each of 7 verification reports is reviewed by the LLM against applicable corpus criteria, emitting findings of category GAP, UNSUBSTANTIATED, AMBIG, CONTRA, or INFEAS. Scored against 14 planted gaps plus three CONTROL false-positive traps, the engine caught 8 of 14 unique real gaps including the crown jewel (TRL claim vs absent software design-assurance evidence).

Limitations

The methodology is bounded by three honest constraints worth surfacing here rather than discovering in defense.

Precision on clean inputs. The Part B measurement showed the engine producing findings on entirely-clean verification reports (14 false positives across the two reports the answer key marks CONTROL). The findings are not hallucinations — they read as plausible additional review concerns — but they count as false positives by the rubric’s standard. Prompt and retrieval-discipline tuning is identified as future work.
Demo retrieval surface. The /query route currently uses a substring filter for retrieval. The full LLM-driven retrieval that produces the Part A scoring runs offline via the test-pass scripts. Wiring the LLM into the live /query surface is the methodology paper’s most visible outstanding polish item.
Corpus scope. The active demonstration is U.S. military airworthiness only, per ADR-0002. Civil application (14 CFR §25.1309 and AC 25.1309-1A is the planned worked example) is architectural — the pipeline consumes corpus and jurisdiction as configuration — but unvalidated until that extension lands.

Acknowledgments

The Marine UAV test scenario anchors to the publicly distributed NAVAIR PMA-263 Request for Information for Medium Range Tactical Unmanned Aircraft Systems (Notice 243-26-024, 11 February 2026). The capstone does not respond to the RFI and the platform described in the test scenario is not a proposed solution; the RFI provides only the operational context.

The active corpus comprises sixteen publicly available U.S. military airworthiness documents listed in the project’s tracked corpus inventory. The served inference model is IBM Granite 4.1-8B Instruct at FP8 quantization, selected through a documented three-candidate trade study with a 500-run determinism PASS. The methodology is designed against EASA DS.AI Level 1B, DO-330 TQL-5, the FAA AI Safety Assurance Roadmap, and the DoD AI Strategy as a four-frame design compliance posture; actual qualification is out of scope and identified as future work.

Methodology

Background

Five Stages

Public-Source Constraint

Validation Framework

Limitations

Acknowledgments