// DETERMINISTIC FAULT INJECTION

PROVE YOUR SYSTEM
SURVIVES FAILURE.

Your unit tests cover the happy path. But what happens when the database dies mid-write, or a dependency times out mid-call? That's the failure everyone assumes they handle and almost nobody tests, until it's an incident. Shinari makes it a test: bring up your real system, inject one deterministic fault, and assert how it recovers, the same way on every run. One CLI, one YAML file, in the CI you already have.
From assumption to assertion.

$ go build -o shinari ./cli && shinari run
  • Deterministic the fault lands at the same point every run, so it gates a merge instead of flaking a pipeline.
  • Zero platform one binary, in the CI you have. No cluster, no agents.
  • Findings ledger record a known gap as green; the build flips red the day it regresses or someone quietly fixes it.

You're already testing this by hand

If you've checked that your service survives a dead dependency, you've probably done it with a sleep(), a hand-wired proxy, and a test that passes on your laptop and flakes in CI.

Roll it yourself and you own all of this:

  • timing the fault with sleeps instead of the system's actual lifecycle,
  • driving the whole steady-state to fault to recovery sequence by hand,
  • keeping the checks honest every time the code changes.

Shinari makes it a test. One CLI, one YAML file, your real system via docker compose. The fault lands at the same point in the lifecycle every run, the whole sequence runs itself, and every run ends in a verdict you can gate a merge on, instead of a flaky one-off you wired together yourself.

Turn it should recover from an assumption into an assertion.

A crash is a test case

The whole harness is one YAML file. Write the failure you fear on the left — Shinari runs it for real on the right.

scenarios/resilience/cache-outage.yml
kind: Scenario
name: checkout-survives-cache-outage

steadyState:            # only test a healthy system
  - run: http.get
    with: /health

method:
  - phase: "Kill the cache out from under the API"
    steps:
      - run: docker.kill
        with: redis
      - run: http.get   # checkout must answer without it
        with: /checkout/42
        as: rsp

  - phase: "Bring the cache back"
    steps:
      - run: docker.start
        with: redis

verify:
  - run: assert
    with: { of: "${.outputs.rsp.value.total}", equals: 19.90 }
    desc: "served from Postgres, priced correctly"
  - run: http.get
    with: /metrics
    as: metrics
  - run: assert
    with: { of: "${.outputs.metrics.value.p99_ms}", lt: 200 }
    desc: "p99 back under 200ms"
    finding: "cold cache spikes ~30s"
shinari run
$ shinari run

━━ checkout-survives-cache-outage ──────────────────────────
  steady     http.get
  method     docker.kill (fault injected)
             docker.kill
             http.get
             docker.start
  recovery   http.get
  verify     served from Postgres, priced correctly
             http.get
             p99 back under 200ms · FINDING: cold cache spikes ~30s

  ✔ PASSED · 1 finding held · 1.8s

1 scenario: 1 passed  1 finding held (2s)
reports → shinari-out/{results.json,junit.xml,findings.sarif,…}

See that finding:? A known gap stays documented, asserted, and green. The day someone fixes it, the run flips red and says promote this to a hard assertion. Your suite grows with your code.

Or drive it interactively

The same engine behind a terminal control center. Browse and search scenarios, read the run plan, and run one while the steady state, injected fault, recovery and verdict stream in live. One keystroke to shinari tui.

shinari tui

What you can break

Every capability is a namespaced verb. Eleven native providers ship in the binary — and you compose your own vocabulary in YAML, zero Go.

docker.kill

drop a container mid-flight

toxiproxy.partition

sever the link between two services

net.nxdomain

poison DNS for one hostname

toxiproxy.latency

add lag, watch the timeouts cascade

http.get · exec.run

probe real APIs, run any script

queue.poison_message

your domain verbs, composed in YAML

Browse all providers →

Where Shinari fits

Shinari proves resilience before you ship, not by experimenting on live production. That boundary keeps it one deterministic binary you can gate a merge with.

Reach for it when

  • you want to prove a service survives a dependency failure before it merges, not after it pages someone,
  • you want it in CI on every change, with the fault landing at the same point every time,
  • you want to reproduce a specific outage from nothing, no cluster or platform to stand up,
  • you'd otherwise script the fault by hand and want it reproducible and maintained, not a flaky one-off.

Reach for something else when

  • you are running experiments against live production traffic,
  • you need continuous fault injection across a fleet, with blast-radius controls and scheduling,
  • the faults you care about only exist in production infrastructure you cannot bring up locally.