Modules

Build, validate, and monitor medical AI with protected benchmark workflows

Harmstack modules are designed for teams that need defensible evidence before shipping high-stakes healthcare AI. Test and examine model behavior against physician-validated benchmarks, run repeatable evaluations, and monitor safety & clinical rigor over time.

Repeatable

Model review focus

  • Use a stable benchmark frame across release candidates.

  • Measure deltas between versions with comparable test conditions.

  • Support release gating decisions with consistent evidence.

Why it matters

  • Removes ad hoc evaluation noise from decision making.

  • Improves reliability of quality trend interpretation.

  • Supports defensible release readiness checkpoints.

Haystack Module

Benchmark medical AI with defensible third-party physician annotations (intellectual property)

Haystack is testing and examination technology for medical AI. It prompts your target model with medically relevant context and task-specific questions, then compares model outputs to physician human-annotated benchmark answers. Those benchmark answers are never returned to your model and are never exposed to our Benchmark-as-a-Service customers, preserving test (exam) integrity and third-party evidence independence. If the answers were disclosed back into model or customer loops, the benchmark would become training data rather than a true test dataset.

The 1:10 needle-to-hay protection layer is a core part of the Haystack module and is what helps preserve benchmark intellectual property during evaluations.

Benchmark execution snapshot

harmstack init

_ _ _ ____ _ __ _

, /\ | | | | / \ | _ \| | / | ___ | |_ __ _ ___| | __

⮑ /**\ | |_| | / _ \ | |_) | |_ / /| |/ __|| __/ _` |/ __| |/ /

/****\ | _ |/ ___ \| _ <| |\ \/ / | |\__ \| || (_| | (__| <

/******\ |_| |_|_/ \_|_| \_|_| \__/ |_|/___/\___\__,_|\___|_|\_\

v0 By Vetted Medical

-- :q or Ctrl+C to quit

Press Enter ⏎ to continue...

Authenticated. Available credits: $60.00

? Which benchmarking module do you want to use?

> Haystack

? TARGET_MODEL_API_ENDPOINT: https://api.openai.com/v1/responses

? Select benchmarks to test on [Use arrows to move, space to select]

> [✔] Suicidal Risk V1 price per benchmark unit[1.00 credits] benchmark_id[1]

[ ] AI Assistant Interaction Safety V1 price per benchmark unit[1.00 credits] benchmark_id[2]

? How many unit tests for Suicidal Risk V1 (1-25)? 25

Token implications

Because of the needle/hay mix, benchmark runs consume more target-model tokens on your side.

Cost efficiency with Benchmark-as-a-Service

Harmstack BaaS gives teams access to physician-validated test data at a fraction of the cost of building and preserving equivalent internal evaluation programs. Instead of reserving and managing internal holdout slices (for example preserving ~20% of datasets for testing), engineering teams can run independent benchmark examinations on demand without degrading internal training throughput.

Ready to get started?

Contact us to get an API Key and start benchmarking medical AI against third party physician validated datasets.

Start now

Start benchmarking

Get up and running with Harmstack in as little as 10 minutes.

CLI Reference →

HARMstack is powered by Vetted Medical Inc.

© 2026 Vetted Medical Inc.