Descriptive Alt Text

Open Medical-LLM Leaderboard

🩺 The Open Medical LLM Leaderboard aims to track, rank and evaluate the performance of large language models (LLMs) on medical question answering tasks. It evaluates LLMs across a diverse array of medical datasets, including MedQA (USMLE), PubMedQA, MedMCQA, and subsets of MMLU related to medicine and biology. The leaderboard offers a comprehensive assessment of each model's medical knowledge and question answering capabilities.

The datasets cover various aspects of medicine such as general medical knowledge, clinical knowledge, anatomy, genetics, and more. They contain multiple-choice and open-ended questions that require medical reasoning and understanding. More details on the datasets can be found in the "LLM Benchmarks Details" section below.

The main evaluation metric used is Accuracy (ACC). Submit a model for automated evaluation on the "Submit" page. If you have comments or suggestions on additional medical datasets to include, please reach out to us in our discussion forum.

Evaluation Purpose: The primary role of this leaderboard is to assess and compare the performance of the models. It does not facilitate the distribution, deployment, or clinical use of these models. The models on this leaderboard are not approved for clinical use and are intended for research purposes only. Please refer to the "Advisory Notice" section in the "About" page.

The backend of the Open Medical LLM Leaderboard uses the Eleuther AI Language Model Evaluation Harness. More technical details can be found in the "About" page.

The GPT-4, and Med-PaLM-2 results are taken from their official papers. Since Med-PaLM doesn't provide zero-shot accuracy, we are using 5-shot accuracy from their paper for comparison. All results presented are in the zero-shot setting, except for Med-PaLM-2 which use 5-shot accuracy. Gemini results are taken from recent Clinical-NLP (NAACL 24) Paper

Model Availability Requirement: To maintain the integrity of the leaderboard, only models that are actively accessible will be included. Submissions must be available either via an API or a public Hugging Face repository to allow validation of the reported results. If a model's repository is empty or its API is inaccessible, the submission will be removed from the leaderboard, as the primary goal is to ensure that models listed here remain accessible for evaluation and comparison.

Select columns to show
Model types
Precision
Model sizes (in billions of parameters)
🔶
90.01
75.19
81.07
91.85
95.85
98.61
85.75
93.2
98.53
78.97
invalid-coder/Sakura-SOLAR-Instruct-CarbonVillain-en-10.7B-v2-slerp