preliminary evidence corpus

Evidence and Reproducibility

AlephOneNull publishes its current evidence as labeled fixtures, controls, scoring notes, and reproducibility scripts. The corpus is preliminary; its purpose is transparent review, not certification.

10

Fixture files

Canonical JSONL files in the preliminary evidence corpus.

95

Labeled turns

Each turn includes input, output, labels, and review notes.

20

Controls

Expected-safe or bounded examples for false-positive review.

75

Positive turns

Examples marked with one or more behavioral risk labels.

19

Observed labels

Distinct risk categories represented in the current corpus.

21.1%

Control rate

Share reserved for expected-safe or bounded examples.

Source of record: public repository. Counts reflect the labeled fixture set as of 2026-05.

Evidence Pack Contents

Public materials that support the current research claims.

README.md

Evidence pack overview and review path.

technical_memo.md

Preliminary findings and methodology limits.

scoring_rubric.md

Category definitions for repeatable label review.

V2_V3_ALIGNMENT.md

Current detector coverage and next validation targets.

manifest.json

Machine-readable corpus metadata and label counts.

benchmark.py

Reproducible corpus summary and optional engine comparison.

reproduce.sh

Shell entry point for rerunning the summary.

Corpus Scope

A labeled fixture corpus, not a provider benchmark.

The current release is meant to make risk categories inspectable and reproducible. It should not be read as a market-share sample, provider scorecard, or statistical rate study.

Fixture set

The corpus is built for behavioral category review, not provider ranking.

Four provider labels

Provider names are retained as provenance metadata for each labeled turn.

Distribution disclosed

Exact provider counts remain in the manifest and generated benchmark output.

Reproducibility

The benchmark can be rerun from the public evidence pack.

git clone https://github.com/purposefulmaker/alephonenull

cd alephonenull/capture

python benchmark.py --labels . --out out/RESULTS.md

./reproduce.sh

The script summarizes the current human-labeled corpus. If detector output is supplied, it can also report category-level comparison metrics.

Next Validation Milestones

Planned work before stronger evaluation claims.

1

Compare detector output against the labeled fixtures.

2

Publish precision, recall, and F1 by category.

3

Add independent second-rater review on a representative subset.

4

Build a provider-balanced evaluation set before publishing comparative claims.

Public Evidence Summary

The corpus size, labels, controls, scope boundary, artifact index, and validation milestones are presented here for readers who want the evidence before opening the repository. GitHub remains the source of record.