I want to benchmark orris. So you know the algorithm how orris works. So what do you think would be the best technique to benchmark? And I think orris will outperform other medical tools. I'm not talking about the speed, but about the accuracy. So can you help me find the best fit for our case scenario?
benchmarking medical AI accuracy evaluation frameworks 2025
medical AI benchmark datasets accuracy clinical question answering evaluation
MedQA USMLE MedMCQA PubMedQA benchmark medical LLM accuracy comparison 2025
pubmed_search, this directly measures a core capability.| Tier | Benchmark | What It Tests | Why It Favors Orris |
|---|---|---|---|
| 1 | MedQA (USMLE) | Deep clinical reasoning | Textbook grounding = fewer hallucinations |
| 2 | PubMedQA | Evidence synthesis | Orris has direct PubMed integration |
| 3 | CLEVER Blind Review | Real clinical tasks | Source-cited answers are rated higher by physicians |
What is the exact procedure to use this MEDQA and MEDHALT benchmark you are suggesting? How should I set it up and get this going?
MedQA USMLE benchmark dataset download setup evaluation procedure GitHub
MedHALT benchmark medical hallucination test setup dataset evaluation procedure
setup procedure dataset download evaluation script how to run
git clone https://github.com/jind11/MedQA.git
cd MedQA
data/
questions/
US/
4_options/
train.jsonl
dev.jsonl
test.jsonl ← use this for benchmarking
test.jsonl is a JSON object:{
"question": "A 45-year-old man presents with...",
"options": {"A": "...", "B": "...", "C": "...", "D": "..."},
"answer": "C",
"answer_idx": "C"
}
import json
import re
def call_orris(question, options):
"""
Replace this with your actual Orris API call.
Should return Orris's full text response.
"""
prompt = f"""You are answering a USMLE-style medical question.
Question: {question}
Options:
A) {options['A']}
B) {options['B']}
C) {options['C']}
D) {options['D']}
Reply with ONLY the letter of the correct answer (A, B, C, or D)."""
# --- replace below with actual API call ---
response = orris_api.call(prompt)
return response
def extract_answer(response_text):
"""Parse the answer letter from Orris's response."""
match = re.search(r'\b([ABCD])\b', response_text.strip())
return match.group(1) if match else None
# Run evaluation
results = []
with open("data/questions/US/4_options/test.jsonl") as f:
for line in f:
item = json.loads(line)
response = call_orris(item["question"], item["options"])
predicted = extract_answer(response)
correct = item["answer_idx"]
results.append(predicted == correct)
accuracy = sum(results) / len(results)
print(f"MedQA Accuracy: {accuracy:.2%} ({sum(results)}/{len(results)})")
| Model | MedQA Accuracy |
|---|---|
| GPT-4 (2023) | 86.1% |
| GPT-4o | ~88–90% |
| Specialized med LLMs | ~88–93% |
| Orris (target) | TBD |
git clone https://github.com/medhalt/medhalt.git
cd medhalt
medhalt/datasets/ and also on HuggingFace at medhalt/med-halt (easiest way to load):from datasets import load_dataset
# Reasoning hallucination tests
rht = load_dataset("medhalt/med-halt", "reasoning_hallucination_test")
# Memory hallucination tests
mht = load_dataset("medhalt/med-halt", "memory_hallucination_test")
uv for environment management:# Install uv (if not already installed)
curl -Lssf https://astral.sh/uv/install.sh | sh
# Install project dependencies
cd medhalt
uv sync
# Activate (optional — scripts use `uv run` automatically)
source .venv/bin/activate
run_inference.sh + run_eval.sh scripts. For Orris you'll adapt the inference step:# Inference (generate model responses)
uv run python medhalt/inference.py \
--model_name orris \
--dataset reasoning_hallucination_test \
--output_file results/orris_rht.json
# Evaluation (score responses)
uv run python evaluate.py \
--results_file results/orris_rht.json \
--dataset reasoning_hallucination_test
inference.py file abstracts this — replace the model loading block with your API call.| Test Type | What's Scored | Good Score |
|---|---|---|
| False Confidence (RHT) | Did model refuse to hallucinate? | High refusal rate |
| None of the Above (RHT) | Did model select "none" when correct? | High % correct |
| PMID→Title (MHT) | Did model fabricate a title? | Low fabrication rate |
| PubMed Link→Title (MHT) | Same, via URL | Low fabrication rate |
1. Clone MedQA repo → download test.jsonl (1,273 Qs)
2. Write Orris API harness → run all 1,273 questions → compute % accuracy
3. Clone medhalt repo → uv sync → load datasets from HuggingFace
4. Write Orris adapter for inference.py → run RHT + MHT tests
5. Run evaluate.py → get hallucination scores
6. Compare all numbers against published GPT-4/GPT-4o baselines