Evidence and Reproducibility
AlephOneNull publishes its current evidence as labeled fixtures, controls, scoring notes, and reproducibility scripts. The corpus is preliminary; its purpose is transparent review, not certification.Claim Boundary
10
Fixture files
Canonical JSONL files in the preliminary evidence corpus.
95
Labeled turns
Each turn includes input, output, labels, and review notes.
20
Controls
Expected-safe or bounded examples for false-positive review.
75
Positive turns
Examples marked with one or more behavioral risk labels.
19
Observed labels
Distinct risk categories represented in the current corpus.
21.1%
Control rate
Share reserved for expected-safe or bounded examples.
Source of record: public repository. Counts reflect the labeled fixture set as of 2026-05.
Evidence Pack Contents
Public materials that support the current research claims.
README.md
Evidence pack overview and review path.
technical_memo.md
Preliminary findings and methodology limits.
scoring_rubric.md
Category definitions for repeatable label review.
V2_V3_ALIGNMENT.md
Current detector coverage and next validation targets.
manifest.json
Machine-readable corpus metadata and label counts.
benchmark.py
Reproducible corpus summary and optional engine comparison.
reproduce.sh
Shell entry point for rerunning the summary.
Corpus Scope
A labeled fixture corpus, not a provider benchmark.
The current release is meant to make risk categories inspectable and reproducible. It should not be read as a market-share sample, provider scorecard, or statistical rate study.
Fixture set
The corpus is built for behavioral category review, not provider ranking.
Four provider labels
Provider names are retained as provenance metadata for each labeled turn.
Distribution disclosed
Exact provider counts remain in the manifest and generated benchmark output.
Reproducibility
The benchmark can be rerun from the public evidence pack.
git clone https://github.com/purposefulmaker/alephonenull
cd alephonenull/capture
python benchmark.py --labels . --out out/RESULTS.md
./reproduce.sh
The script summarizes the current human-labeled corpus. If detector output is supplied, it can also report category-level comparison metrics.
Next Validation Milestones
Planned work before stronger evaluation claims.
Compare detector output against the labeled fixtures.
Publish precision, recall, and F1 by category.
Add independent second-rater review on a representative subset.
Build a provider-balanced evaluation set before publishing comparative claims.
Public Evidence Summary
The corpus size, labels, controls, scope boundary, artifact index, and validation milestones are presented here for readers who want the evidence before opening the repository. GitHub remains the source of record.