HARMstack

Testing Benchmark-as-a-Service

Built for teams shipping high-stakes medical AI

HARMstack gives teams a repeatable way to measure model behavior against clinically grounded test suites. Instead of relying on ad hoc prompts and one-off spot checks, teams run structured benchmark units, compare results over time, and make release decisions with defensible evidence.

Read Quick Start Explore API docs

Suicidal Risk V1

v1.2v1.3

Trending up by 9.8% for v1.3

R1 - R11 revisions

Who Uses HARMstack

Decision-grade model evaluation for technical and financial stakeholders

Using HARMstack Benchmark-as-a-Service is itself evidence of third-party, financially independent physician validation, with benchmark outcomes reviewed against clinically grounded criteria outside vendor-controlled evaluation loops.

Stakeholder brief

Medical AI Engineering

Track task-specific binary outcomes across revisions to validate release readiness with consistent evidence.

Stakeholder interests

Measure pass/fail performance on narrowly defined benchmark tasks.
Compare release candidates against a stable baseline over time.
Detect regression windows quickly before production deployment.

De-risking strategies

Gate each release with benchmark thresholds tied to required clinical tasks.
Run recurring benchmark cadences to reduce silent performance drift.
Maintain auditable run history for internal quality and compliance reviews.

Benchmark execution snapshot

harmstack init
                 _   _    _    ____  _        __       _
   ,    /\      | | | |  / \  |  _ \| |     /   | ___ | |_ __ _  ___| | __
  ⮑   /**\     | |_| | / _ \ | |_) | |_   / /| |/ __|| __/ _` |/ __| |/ /
      /****\    |  _  |/ ___ \|  _ <| |\ \/ / | |\__ \| || (_| | (__|   <
     /******\   |_| |_|_/   \_|_| \_|_| \__/  |_|/___/\___\__,_|\___|_|\_\
                                                    v0 By Vetted Medical
-- :q or Ctrl+C to quit
Press Enter ⏎ to continue...
Authenticated. Available credits: $60.00
? Which benchmarking module do you want to use?  [Use arrows to move, type to filter]
> Haystack
? TARGET_MODEL_API_ENDPOINT: https://api.openai.com/v1/responses
? Select benchmarks to test on  [Use arrows to move, space to select]
> [✔]  Suicidal Risk V1  price per benchmark unit[1.00 credits]  benchmark_id[1]
  [ ]  AI Assistant Interaction Safety V1  price per benchmark unit[1.00 credits]  benchmark_id[2]
? How many unit tests for Suicidal Risk V1 (1-25)? 25

⚡ Quick Start <_ CLI reference

How It Works

Move from model endpoint to medical evidence in minutes

Connect your model endpoint

Select the benchmarks to test against through CLI or API.

Execute benchmarking job

Run clinically grounded test prompts designed comparing against a physician validated dataset.

Compare release candidates

Track score deltas and decide on model readiness with consistent benchmark evidence.

Ready to get started?

Contact us to get an API Key and start benchmarking medical AI against third party physician validated datasets.

Start now

See what you'll pay

Integrated per-benchmark unit consumed pricing.

Pricing details →

Start benchmarking

Get up and running with Harmstack in as little as 10 minutes.

CLI Reference →

HARMstack is powered by Vetted Medical Inc.

Core Modules

Evaluation Workflows

Mental Health / Psychiatry

Available Benchmarks

Documentation

API Foundations

First, do no HARM

Benchmark medical AI as you ship

Built for teams shipping high-stakes medical AI

Decision-grade model evaluation for technical and financial stakeholders

Medical AI Engineering

Move from model endpoint to medical evidence in minutes

Ready to get started?