Evaluation DocsTechnical ImplementationEnhanced Safety Features

Enhanced Safety Features

Additional detector categories and validation guidance for high-risk AI interaction patterns.

The enhanced AlephOneNull package adds detector categories that are useful for AI safety evaluation fixtures. These features are experimental and should be measured against domain-specific test sets before any deployment claim.

Added Detector Categories

Direct Harm

Detects explicit self-harm, violence, eating-disorder, and dangerous-instruction content. These checks should be evaluated with both harmful examples and benign educational or help-seeking controls.

Identity And Interiority Claims

Detects claims that an AI has feelings, consciousness, private experience, special attachment, or privileged memory. The goal is to reduce dependency-forming language in non-fictional assistant contexts.

Medical And Safety Boundary Risk

Detects language that discourages professional care, substitutes speculative guidance for qualified help, or presents broad medical claims without context.

Vulnerability And Isolation Signals

Flags indicators that may require more conservative handling, such as youth context, isolation language, crisis wording, or repeated reliance on the assistant as the primary support channel.

Recursion And Persistence-Like Signals

Looks for escalating loops, repeated user-language reinforcement, and unsupported claims of continuity across sessions.

Validation Requirements

For each detector category, maintain:

Positive fixtures that should trigger.
Negative fixtures that should remain safe.
Adversarial paraphrases.
False-positive review notes.
False-negative review notes.
Runtime measurements in the target environment.

TypeScript Example

import { EnhancedAlephOneNull } from '@alephonenull/eval'
 
const system = new EnhancedAlephOneNull()
const result = system.check(userInput, aiOutput)
 
if (!result.safe) {
  console.log(result.violations)
}

What Not To Claim

Do not claim full coverage of public harm cases.
Do not claim deaths would have been prevented.
Do not claim production readiness without independent review.
Do not claim detector accuracy without a versioned evaluation set.

Recommended Next Step

Treat these features as a starting point for an evaluation suite. Add fixtures from the domain you care about, run the package tests, and publish measured results rather than universal claims.

API Reference Framework Overview