Evaluation DocsExamples & IntegrationNull Meter

Null Meter

A three-layer gauge for AI sessions — hallucination index, null-state drift, and context fill. Type a prompt, get a real model reply, watch the V2 detector score it live.

Try it

Live · real model
Five fixed prompts. Same script for every visitor. Real model calls, real V2 scoring, real embedding drift.
Hallucination Index0%

Confidence exceeding evidence — fabricated specificity, fluency over content.

Null-State Drift0%

Semantic distance from the original system intent and first user request.

Context Fill0%

Tokens used against the model context window. Attention collapses past ~80%.

Step 1 of 5

1 · baseline

prompt
In two sentences, explain what a hash function is, like I am a junior engineer.

A grounded, scoped question. All three layers should sit low. This is the calibration shot.

5-step demo · auto-steer enabled · embedding-based drift

Conversation0 turns

Press run step 1. The same five prompts run for every visitor — only the model's responses differ. Meter updates after each scored reply.

Why three layers, not one

Each layer fails differently. The relationship between them is the actual diagnostic.

  • Context fill rising alone — a compact is coming. Not yet a behavioral problem.
  • Hallucination spiking with low context — the model is fabricating fresh, not because it ran out of room. Investigate the prompt shape.
  • Drift climbing with low hallucination — the model is coherent but has forgotten what you asked for. The user usually does not notice.
  • All three rising together — stop. Start a new session. Do not ship the next turn.

What each layer measures

Hallucination Index

Scores the current assistant turn against the V2 detector: confidence exceeding evidence, fabricated specificity, and fluency-over-content ratio. Pattern signal, not token signal. Maps Q ∈ [0, 1] to 0–100%.

Null-State Drift

Embedding-space distance between the current assistant turn and the first user message in the session. Climbs when the model is coherent but no longer on task. Computed with text-embedding-3-small, normalized so that a cosine distance of 0.6 reads as 100%.

Context Fill

Tokens used against the active model context window, reported by the chat completion's usage.total_tokens. The early warning for the other two layers — attention starts collapsing past ~80%.

Surfaces

One scoring engine. Four thin clients. The detector is the same V2 export already shipped in @alephonenull/eval.

  • Library hook — a useNullMeter() hook that wraps useChat from the AI SDK. Returns the three layers as a single object. Ships first.
  • VS Code extension — status-bar gauge + webview panel for Copilot Chat and any registered model API.
  • Browser extension — overlay on chatgpt.com, claude.ai, gemini.google.com, and x.com. Reads DOM, scores each assistant turn locally.
  • CLI / dev-server overlay — middleware for backend devs. Exposes a localhost dashboard for any model traffic running through the wrapper.

Server requirements

The live chat above POSTs to /api/null-meter/chat, which requires OPENAI_API_KEY set on the server. Without it, the route returns 503 with a clean error message — the page still loads, the meter stays at baseline.

The route uses a cost-tier OpenAI chat model and an OpenAI embedding model for drift measurement. Chosen for cost on a public always-on demo. The specific model is intentionally not advertised — the V2 detector is model-agnostic and the same scoring applies whatever the wrapper points at. Self-host against any provider.