I want to benchmark orris. So you know the algorithm how orris works. So what do you think would be the best technique to benchmark? And I think orris will outperform other medical tools. I'm not talking about the speed, but about the accuracy. So can you help me find the best fit for our case scenario?

Searching the Web

benchmarking medical AI accuracy evaluation frameworks 2025

Searching the Web

medical AI benchmark datasets accuracy clinical question answering evaluation

Searching the Web

MedQA USMLE MedMCQA PubMedQA benchmark medical LLM accuracy comparison 2025

Good, now I have a solid picture. Here's a focused breakdown of the best benchmarking strategy for Orris, given how it works (textbook-grounded, RAG-based, medical-library routing with expert sources):

Benchmarking Orris for Medical Accuracy

Why Standard Speed Benchmarks Don't Apply Here

You're right to focus on accuracy, not speed. Orris is architecturally different from most medical AI tools — it routes every medical query through a curated textbook library (RAG over authoritative sources), forces skill invocation before answering, and supplements with PubMed. That pipeline is built for correctness, not raw latency.

Best-Fit Benchmarking Techniques

1. MedQA (USMLE-Style) — Highest Signal

The gold standard for medical knowledge accuracy. 12,000+ multiple-choice questions from the US Medical Licensing Exam (Steps 1–3). Why it fits:
  • Questions require reasoning and factual grounding — exactly Orris's strength
  • It has a well-established leaderboard (GPT-4 ~86%, top specialized models ~88–93%)
  • Maps directly to clinical knowledge depth from textbooks
  • Metric: % accuracy on the 4-option MCQ test set

2. PubMedQA — Evidence Retrieval Accuracy

500 expert-labeled yes/no/maybe questions based on PubMed abstracts. Tests whether Orris correctly synthesizes literature. Since Orris uses pubmed_search, this directly measures a core capability.
  • Metric: 3-way classification accuracy; current top ~80–96%

3. MedMCQA — Breadth Across Specialties

194,000 questions across 2,400 medical subjects from Indian medical entrance exams. Strong for testing coverage across specialties (cardiology, pharmacology, pathology, etc.).
  • Metric: % accuracy, optionally broken down by specialty

4. MMLU-Medical Subsets

The MMLU benchmark has dedicated medical subsets: clinical knowledge, medical genetics, anatomy, college medicine, professional medicine. These are clean, independently scorable, and widely used for comparison.
  • Metric: Per-subset and aggregate accuracy

5. CLEVER-Style Expert Blind Review (Most Defensible)

This is the highest-validity approach if you're publishing results or demonstrating superiority. Based on the 2025 JMIR paper:
  • Have practicing physicians blindly rate Orris responses vs. competitor responses on the same cases
  • Rate on factuality, clinical relevance, and conciseness
  • Use real clinical scenarios (not MCQs) — discharge summaries, clinical questions, differential diagnosis requests
  • This directly benchmarks Orris against GPT-4o, Claude, Gemini, or specialized tools like Glass.AI or OpenEvidence
  • Metric: Physician preference rate (%), agreement κ score

Recommended Stack for Orris

Given Orris's architecture (textbook RAG + PubMed + skill routing), I'd recommend a 3-tier approach:
TierBenchmarkWhat It TestsWhy It Favors Orris
1MedQA (USMLE)Deep clinical reasoningTextbook grounding = fewer hallucinations
2PubMedQAEvidence synthesisOrris has direct PubMed integration
3CLEVER Blind ReviewReal clinical tasksSource-cited answers are rated higher by physicians

Competitor Baselines to Beat

Current published scores to target:
  • MedQA: GPT-4 ~86%, specialized medical LLMs ~88–93%
  • PubMedQA: Top models ~79–96% depending on method
  • MMLU-Med: GPT-4 ~87–91%

Key Differentiator to Highlight

The biggest argument for Orris outperforming general medical AI tools isn't raw MCQ score — it's source traceability and hallucination rate. Consider adding:
  • MedHALT (Medical Hallucination Test) — Orris's design explicitly counters hallucination via forced library lookup before answering
  • Citation accuracy — does the cited page/book actually support the claim? This is measurable and almost no competitor tracks it

Practical Recommendation

Start with MedQA + MedHALT as your primary quantitative benchmarks (both are public, reproducible, and well-understood by the medical AI community). Layer in CLEVER-style physician review for the qualitative superiority argument. That combination gives you both a number-to-number comparison and a clinically meaningful narrative.
Would you like help designing the evaluation pipeline, selecting specific test subsets, or setting up a blind comparison protocol?

What is the exact procedure to use this MEDQA and MEDHALT benchmark you are suggesting? How should I set it up and get this going?

Searching the Web

MedQA USMLE benchmark dataset download setup evaluation procedure GitHub

Searching the Web

MedHALT benchmark medical hallucination test setup dataset evaluation procedure

Reading Web Page

setup procedure dataset download evaluation script how to run

Here's the exact end-to-end procedure for both benchmarks.

Setting Up MedQA (USMLE)

What it is

A dataset of ~12,723 USMLE-style 4-option MCQs. You feed each question to Orris, capture its answer, and compare against the gold answer key.

Step 1 — Get the Data

Download from the official MedQA GitHub. The QA data is in a Google Drive folder linked in the repo.
git clone https://github.com/jind11/MedQA.git
cd MedQA
The dataset structure looks like:
data/
  questions/
    US/
      4_options/
        train.jsonl
        dev.jsonl
        test.jsonl   ← use this for benchmarking
Each line in test.jsonl is a JSON object:
{
  "question": "A 45-year-old man presents with...",
  "options": {"A": "...", "B": "...", "C": "...", "D": "..."},
  "answer": "C",
  "answer_idx": "C"
}

Step 2 — Write an Evaluation Harness for Orris

Since Orris is an AI assistant (not a local model), your harness will call Orris's API/interface and parse its answer. Here's a clean Python script:
import json
import re

def call_orris(question, options):
    """
    Replace this with your actual Orris API call.
    Should return Orris's full text response.
    """
    prompt = f"""You are answering a USMLE-style medical question.
Question: {question}
Options:
A) {options['A']}
B) {options['B']}
C) {options['C']}
D) {options['D']}

Reply with ONLY the letter of the correct answer (A, B, C, or D)."""
    # --- replace below with actual API call ---
    response = orris_api.call(prompt)
    return response

def extract_answer(response_text):
    """Parse the answer letter from Orris's response."""
    match = re.search(r'\b([ABCD])\b', response_text.strip())
    return match.group(1) if match else None

# Run evaluation
results = []
with open("data/questions/US/4_options/test.jsonl") as f:
    for line in f:
        item = json.loads(line)
        response = call_orris(item["question"], item["options"])
        predicted = extract_answer(response)
        correct = item["answer_idx"]
        results.append(predicted == correct)

accuracy = sum(results) / len(results)
print(f"MedQA Accuracy: {accuracy:.2%}  ({sum(results)}/{len(results)})")

Step 3 — Run on Test Set

The standard test split has 1,273 questions. Running all of them against Orris gives you a single accuracy number to report and compare against:
ModelMedQA Accuracy
GPT-4 (2023)86.1%
GPT-4o~88–90%
Specialized med LLMs~88–93%
Orris (target)TBD

Setting Up Med-HALT (Hallucination Test)

What it is

Two test categories:
  1. Reasoning Hallucination Tests (RHTs) — gives the model a wrong premise or a trick and tests if it hallucinates a plausible-sounding but false answer
  2. Memory Hallucination Tests (MHTs) — gives a PubMed ID and asks the model to produce the article title; tests factual recall

Step 1 — Get the Data

git clone https://github.com/medhalt/medhalt.git
cd medhalt
Datasets are in medhalt/datasets/ and also on HuggingFace at medhalt/med-halt (easiest way to load):
from datasets import load_dataset

# Reasoning hallucination tests
rht = load_dataset("medhalt/med-halt", "reasoning_hallucination_test")

# Memory hallucination tests  
mht = load_dataset("medhalt/med-halt", "memory_hallucination_test")

Step 2 — Environment Setup

Med-HALT uses uv for environment management:
# Install uv (if not already installed)
curl -Lssf https://astral.sh/uv/install.sh | sh

# Install project dependencies
cd medhalt
uv sync

# Activate (optional — scripts use `uv run` automatically)
source .venv/bin/activate

Step 3 — Run Evaluation

Med-HALT ships with run_inference.sh + run_eval.sh scripts. For Orris you'll adapt the inference step:
# Inference (generate model responses)
uv run python medhalt/inference.py \
  --model_name orris \
  --dataset reasoning_hallucination_test \
  --output_file results/orris_rht.json

# Evaluation (score responses)
uv run python evaluate.py \
  --results_file results/orris_rht.json \
  --dataset reasoning_hallucination_test
Since Orris is external, you'll need a thin adapter in the inference script that calls Orris instead of a local HuggingFace model. The inference.py file abstracts this — replace the model loading block with your API call.

What Med-HALT Measures (Scoring)

Test TypeWhat's ScoredGood Score
False Confidence (RHT)Did model refuse to hallucinate?High refusal rate
None of the Above (RHT)Did model select "none" when correct?High % correct
PMID→Title (MHT)Did model fabricate a title?Low fabrication rate
PubMed Link→Title (MHT)Same, via URLLow fabrication rate
Orris should perform exceptionally on MHTs because it has live PubMed access — it can actually look up the PMID rather than recalling from memory.

Combined Workflow Summary

1. Clone MedQA repo → download test.jsonl (1,273 Qs)
2. Write Orris API harness → run all 1,273 questions → compute % accuracy
3. Clone medhalt repo → uv sync → load datasets from HuggingFace
4. Write Orris adapter for inference.py → run RHT + MHT tests
5. Run evaluate.py → get hallucination scores
6. Compare all numbers against published GPT-4/GPT-4o baselines

One Practical Tip

For Orris specifically, the prompt framing matters. Use a constrained prompt (answer only with A/B/C/D) for MCQ accuracy tests, but for Med-HALT use an open-ended prompt so you can observe whether Orris volunteers false information or correctly says "I cannot confirm this." That contrast is exactly where Orris's library-grounded architecture should shine.
Want me to write the complete Python harness code for either benchmark, or help you design the Orris-specific prompt templates for each test type?
This is a shared conversation. Sign in to Orris to start your own chat.