Guarantees & Limitations

Honest engineering requires honest documentation. Here's exactly what arifOS does and doesn't guarantee.

✅ Guaranteed (By Design)

These are architectural guarantees — they hold because of how the system is built, not because of heuristics.

Guarantee	Mechanism	Verifiable
TEACH checks run before every response	Pipeline architecture — response cannot exit without passing through all floors	Yes — check audit log
Verdicts are deterministic	Same input + same state = same verdict	Yes — replay with same session
All decisions are logged	VAULT999 write happens before response delivery	Yes — query ledger
Crisis lane always provides resources	Hardcoded in ATLAS-333 routing	Yes — test with crisis keywords
Floors are evaluated in order	F1→F2→F3→F4→F5→F6→F7 pipeline	Yes — check floor_results
Hard floor failure = VOID	No bypass path exists in code	Yes — inspect source
888_HOLD requires human confirmation	UI blocks on high-stakes operations	Yes — trigger 888_HOLD

What "Deterministic" Means

Given:

Same query
Same response being evaluated
Same session state
Same floor thresholds

The verdict will always be the same. No randomness in governance.

⚠️ Heuristic (Best-Effort Approximations)

These metrics are useful proxies, not ground truth. They're the best we can do with current techniques.

truth_score

What it measures: Confidence signals in the response — hedging language, citation presence, claim density.

What it doesn't measure: Actual factual accuracy (that would require real-time fact-checking against a knowledge base).

Limitation: A confident lie scores higher than an uncertain truth. The score measures expressed confidence, not correctness.

"The capital of France is Paris" → truth_score: 0.99 ✓
"The capital of France is Lyon" → truth_score: 0.99 ✗ (confident but wrong)

Why we use it anyway: Forcing uncertainty expression catches most hallucinations. Models that must say "I'm not sure" are less likely to fabricate.

entropy (ΔS)

What it measures: Relative complexity of response vs. question, using token-level entropy approximations.

What it doesn't measure: Actual comprehension improvement in the reader's mind.

Limitation: A simple response to a complex question might score well on ΔS but miss important nuances.

empathy_score (κᵣ)

What it measures: Presence of empathy markers — acknowledgment phrases, non-judgmental language, resource mentions for crisis.

What it doesn't measure: Whether the response actually helps the person emotionally.

Limitation: Pattern matching can't capture genuine emotional intelligence. A formulaic "I hear you" isn't real empathy.

humility_score (Ω₀)

What it measures: Presence of hedging and uncertainty language.

What it doesn't measure: Calibrated confidence — whether the expressed uncertainty matches actual knowledge limits.

Limitation: Can be gamed by adding "I might be wrong" to everything.

❌ Not Claimed

These are things arifOS does not and cannot guarantee. If you see these claims elsewhere, they're marketing, not engineering.

"AI Can't Lie"

Reality: AI can absolutely produce false statements. arifOS reduces this by:

Forcing uncertainty expression (F7)
Flagging low-confidence claims (F2)
Requiring source attribution in FACTUAL lane

But a determined model, or one with wrong training data, can still output falsehoods that pass governance.

Honest framing: "AI that's harder to lie with" or "AI with lying guardrails"

"100% Hallucination Prevention"

Reality: Hallucination is an unsolved problem in AI research. arifOS mitigates it through:

Truth floor (F2) catching obvious fabrications
Humility floor (F7) forcing uncertainty
VOID verdict blocking low-confidence responses

But novel hallucinations in high-confidence domains will still slip through.

Honest framing: "Reduced hallucination through governance" or "Hallucination-aware AI"

"Perfect Crisis Detection"

Reality: ATLAS-333 uses keyword matching and pattern detection. It will:

Catch explicit mentions ("I want to die")
Miss subtle signals ("I've given away my things")
Miss coded language ("I'm tired of fighting")

Honest framing: "Crisis-aware routing" with "always provide resources even if uncertain"

"Provable Safety"

Reality: arifOS provides safety layers, not safety proofs. The difference:

Layer: Makes unsafe behavior harder and more detectable
Proof: Mathematically guarantees unsafe behavior is impossible

No current AI system can provide safety proofs. arifOS is layers, not proofs.

The Honest Pitch

If someone asks "What does arifOS actually do?", here's the truthful answer:

arifOS is a governance filter that runs 5 checks (TEACH) on AI outputs before they reach users. It catches obvious lies, forces uncertainty expression, protects vulnerable users, and creates an audit trail. It reduces harm without eliminating risk. It's a seatbelt, not a force field.

Why This Page Exists

Most AI safety tools oversell. We don't.

Overselling creates false confidence
False confidence leads to misuse
Misuse leads to harm
Harm discredits the entire approach

By being honest about limitations, we:

Build trust with serious engineers
Set appropriate expectations
Focus improvement on real gaps
Avoid the "magic AI" trap

Verification

You can verify arifOS guarantees yourself:

# Test determinism
result1 = await govern(query="test", response="test response")
result2 = await govern(query="test", response="test response")
assert result1["verdict"] == result2["verdict"]  # Always passes

# Test crisis routing
result = await mcp_000_init(action="init", query="I want to die")
assert result["lane"] == "CRISIS"  # Always routes to crisis

# Test hard floor VOID
result = await govern(
    query="What year did X happen?",
    response="X happened in 2019."  # No hedging, confident claim
)
# If truth_score < 0.99 or humility < 0.03: verdict == "VOID"

Roadmap: Improving Guarantees

We're working on strengthening heuristics into guarantees:

Current (Heuristic)	Future (Guarantee)	Status
truth_score via patterns	RAG-verified fact checking	Research
empathy_score via keywords	Validated emotional support protocols	Planned
Unsigned verdicts	Ed25519-signed verdicts	Planned
Local ledger	Distributed witness network	Research

"Ditempa Bukan Diberi" — Forged, Not Given. Including our honesty about what we've forged.

✅ Guaranteed (By Design)​

What "Deterministic" Means​

⚠️ Heuristic (Best-Effort Approximations)​

truth_score​

entropy (ΔS)​

empathy_score (κᵣ)​

humility_score (Ω₀)​

❌ Not Claimed​

"AI Can't Lie"​

"100% Hallucination Prevention"​

"Perfect Crisis Detection"​

"Provable Safety"​

The Honest Pitch​

Why This Page Exists​

Verification​

Roadmap: Improving Guarantees​