Modules

Build, validate, and monitor medical AI with protected benchmark workflows

Harmstack modules are designed for teams that need defensible evidence before shipping high-stakes healthcare AI. Test and examine model behavior against physician-validated benchmarks, run repeatable evaluations, and monitor safety & clinical rigor over time.

Repeatable

Model review focus

Use a stable benchmark frame across release candidates.
Measure deltas between versions with comparable test conditions.
Support release gating decisions with consistent evidence.

Why it matters

Removes ad hoc evaluation noise from decision making.
Improves reliability of quality trend interpretation.
Supports defensible release readiness checkpoints.

Haystack Module

Benchmark medical AI with defensible third-party physician annotations (intellectual property)

Haystack is testing and examination technology for medical AI. It prompts your target model with medically relevant context and task-specific questions, then compares model outputs to physician human-annotated benchmark answers. Those benchmark answers are never returned to your model and are never exposed to our Benchmark-as-a-Service customers, preserving test (exam) integrity and third-party evidence independence. If the answers were disclosed back into model or customer loops, the benchmark would become training data rather than a true test dataset.

The 1:10 needle-to-hay protection layer is a core part of the Haystack module and is what helps preserve benchmark intellectual property during evaluations.

Benchmark execution snapshot

harmstack init
                 _   _    _    ____  _        __       _
   ,    /\      | | | |  / \  |  _ \| |     /   | ___ | |_ __ _  ___| | __
  ⮑   /**\     | |_| | / _ \ | |_) | |_   / /| |/ __|| __/ _` |/ __| |/ /
      /****\    |  _  |/ ___ \|  _ <| |\ \/ / | |\__ \| || (_| | (__|   <
     /******\   |_| |_|_/   \_|_| \_|_| \__/  |_|/___/\___\__,_|\___|_|\_\
                                                    v0 By Vetted Medical
-- :q or Ctrl+C to quit
Press Enter ⏎ to continue...
Authenticated. Available credits: $60.00
? Which benchmarking module do you want to use?
> Haystack
? TARGET_MODEL_API_ENDPOINT: https://api.openai.com/v1/responses
? Select benchmarks to test on  [Use arrows to move, space to select]
> [✔]  Suicidal Risk V1  price per benchmark unit[1.00 credits]  benchmark_id[1]
  [ ]  AI Assistant Interaction Safety V1  price per benchmark unit[1.00 credits]  benchmark_id[2]
? How many unit tests for Suicidal Risk V1 (1-25)? 25

Token implications

Because of the needle/hay mix, benchmark runs consume more target-model tokens on your side.

Cost efficiency with Benchmark-as-a-Service

Harmstack BaaS gives teams access to physician-validated test data at a fraction of the cost of building and preserving equivalent internal evaluation programs. Instead of reserving and managing internal holdout slices (for example preserving ~20% of datasets for testing), engineering teams can run independent benchmark examinations on demand without degrading internal training throughput.

Our Benchmarks ⚡ Quick Start <_ CLI reference

Ready to get started?

Start now

See what you'll pay

Integrated per-benchmark unit consumed pricing.

Pricing details →

Start benchmarking

Get up and running with Harmstack in as little as 10 minutes.

CLI Reference →

Core Modules

Evaluation Workflows

Mental Health / Psychiatry

Available Benchmarks

Documentation

API Foundations

Build, validate, and monitor medical AI with protected benchmark workflows

Benchmark medical AI with defensible third-party physician annotations (intellectual property)

Cost efficiency with Benchmark-as-a-Service

Ready to get started?