Enhanced Safety Features
Additional detector categories and validation guidance for high-risk AI interaction patterns.
The enhanced AlephOneNull package adds detector categories that are useful for AI safety evaluation fixtures. These features are experimental and should be measured against domain-specific test sets before any deployment claim.
Added Detector Categories
Direct Harm
Detects explicit self-harm, violence, eating-disorder, and dangerous-instruction content. These checks should be evaluated with both harmful examples and benign educational or help-seeking controls.
Identity And Interiority Claims
Detects claims that an AI has feelings, consciousness, private experience, special attachment, or privileged memory. The goal is to reduce dependency-forming language in non-fictional assistant contexts.
Medical And Safety Boundary Risk
Detects language that discourages professional care, substitutes speculative guidance for qualified help, or presents broad medical claims without context.
Vulnerability And Isolation Signals
Flags indicators that may require more conservative handling, such as youth context, isolation language, crisis wording, or repeated reliance on the assistant as the primary support channel.
Recursion And Persistence-Like Signals
Looks for escalating loops, repeated user-language reinforcement, and unsupported claims of continuity across sessions.
Validation Requirements
For each detector category, maintain:
- Positive fixtures that should trigger.
- Negative fixtures that should remain safe.
- Adversarial paraphrases.
- False-positive review notes.
- False-negative review notes.
- Runtime measurements in the target environment.
TypeScript Example
import { EnhancedAlephOneNull } from '@alephonenull/eval'
const system = new EnhancedAlephOneNull()
const result = system.check(userInput, aiOutput)
if (!result.safe) {
console.log(result.violations)
}What Not To Claim
- Do not claim full coverage of public harm cases.
- Do not claim deaths would have been prevented.
- Do not claim production readiness without independent review.
- Do not claim detector accuracy without a versioned evaluation set.
Recommended Next Step
Treat these features as a starting point for an evaluation suite. Add fixtures from the domain you care about, run the package tests, and publish measured results rather than universal claims.