// DETERMINISTIC FAULT INJECTION
PROVE YOUR SYSTEM
SURVIVES FAILURE.
Your unit tests cover the happy path. But what happens when the database dies
mid-write, or a dependency times out mid-call? That's the failure everyone
assumes they handle and almost nobody tests, until it's an incident. Shinari
makes it a test: bring up your real system, inject one deterministic fault, and
assert how it recovers, the same way on every run. One CLI, one YAML file, in
the CI you already have.
From assumption to assertion.
$ go build -o shinari ./cli && shinari run- Deterministic the fault lands at the same point every run, so it gates a merge instead of flaking a pipeline.
- Zero platform one binary, in the CI you have. No cluster, no agents.
- Findings ledger record a known gap as green; the build flips red the day it regresses or someone quietly fixes it.
You're already testing this by hand
If you've checked that your service survives a dead dependency, you've
probably done it with a sleep(), a hand-wired proxy, and a test that passes on your
laptop and flakes in CI.
Roll it yourself and you own all of this:
- timing the fault with sleeps instead of the system's actual lifecycle,
- driving the whole steady-state to fault to recovery sequence by hand,
- keeping the checks honest every time the code changes.
Shinari makes it a test. One CLI, one YAML file, your real system via docker compose. The fault lands at the same point in the lifecycle every run, the whole sequence runs itself, and every run ends in a verdict you can gate a merge on, instead of a flaky one-off you wired together yourself.
Turn it should recover from an assumption into an assertion.
A crash is a test case
The whole harness is one YAML file. Write the failure you fear on the left — Shinari runs it for real on the right.
kind: Scenario
name: checkout-survives-cache-outage
steadyState: # only test a healthy system
- run: http.get
with: /health
method:
- phase: "Kill the cache out from under the API"
steps:
- run: docker.kill
with: redis
- run: http.get # checkout must answer without it
with: /checkout/42
as: rsp
- phase: "Bring the cache back"
steps:
- run: docker.start
with: redis
verify:
- run: assert
with: { of: "${.outputs.rsp.value.total}", equals: 19.90 }
desc: "served from Postgres, priced correctly"
- run: http.get
with: /metrics
as: metrics
- run: assert
with: { of: "${.outputs.metrics.value.p99_ms}", lt: 200 }
desc: "p99 back under 200ms"
finding: "cold cache spikes ~30s"$ shinari run
━━ checkout-survives-cache-outage ──────────────────────────
steady ✓ http.get
method ⚡ docker.kill (fault injected)
✓ docker.kill
✓ http.get
✓ docker.start
recovery ✓ http.get
verify ✓ served from Postgres, priced correctly
✓ http.get
◆ p99 back under 200ms · FINDING: cold cache spikes ~30s
✔ PASSED · 1 finding held · 1.8s
1 scenario: 1 passed — 1 finding held (2s)
reports → shinari-out/{results.json,junit.xml,findings.sarif,…}See that finding:? A known gap stays documented, asserted,
and green. The day someone fixes it, the run flips red and says promote this to a
hard assertion. Your suite grows with your code.
Or drive it interactively
The same engine behind a terminal control center. Browse and search
scenarios, read the run plan, and run one while the steady state, injected fault, recovery
and verdict stream in live. One keystroke to shinari tui.
What you can break
Every capability is a namespaced verb. Eleven native providers ship in the binary — and you compose your own vocabulary in YAML, zero Go.
docker.killdrop a container mid-flight
toxiproxy.partitionsever the link between two services
net.nxdomainpoison DNS for one hostname
toxiproxy.latencyadd lag, watch the timeouts cascade
http.get · exec.runprobe real APIs, run any script
queue.poison_messageyour domain verbs, composed in YAML
Where Shinari fits
Shinari proves resilience before you ship, not by experimenting on live production. That boundary keeps it one deterministic binary you can gate a merge with.
Reach for it when
- you want to prove a service survives a dependency failure before it merges, not after it pages someone,
- you want it in CI on every change, with the fault landing at the same point every time,
- you want to reproduce a specific outage from nothing, no cluster or platform to stand up,
- you'd otherwise script the fault by hand and want it reproducible and maintained, not a flaky one-off.
Reach for something else when
- you are running experiments against live production traffic,
- you need continuous fault injection across a fleet, with blast-radius controls and scheduling,
- the faults you care about only exist in production infrastructure you cannot bring up locally.
Field manual
Tutorials to learn, how-to guides to solve, reference to look up, concepts to understand — and a developers track to extend.
Tutorials
Learning by doing: guided first encounters with the harness, from zero to a tracked finding.
EXPLORE → SECTOR 02How-to guides
Mission briefings: direct recipes for specific goals, for operators who already know the basics.
EXPLORE → SECTOR 03Reference
The technical specification: every command, key, verb, verdict, and rule, stated precisely.
EXPLORE → SECTOR 04Concepts
The lore behind Shinari: why it is built this way, the design bets behind the ledger, the gates, and the providers.
EXPLORE → SECTOR 05Developers
Extend Shinari by giving your system its own verbs, in YAML or against the Go SDK.
EXPLORE →