Gradio

Deploying a good clinical LLM requires more than just acing closed-ended medical QA exams. It needs to be safe, ethical, comprehensive in its responses, and capable of reasoning and tackling complex medical tasks. The MEDIC framework aims to provide a transparent and comprehensive evaluation of LLM performance across various clinically relevant dimensions.

Disclaimer: It is important to note that the purpose of this evaluation is purely academic and exploratory. The models assessed here have not been approved for clinical use, and their results should not be interpreted as clinically validated. The leaderboard serves as a platform for researchers to compare models, understand their strengths and limitations, and drive further advancements in the field of clinical NLP.

Note: Llama 3.1 70B Instruct has been used as judge for English.


🟦 🏥	google/gemini-2.5-flash-preview-04-17-thinking	2052	+26/-20	8.91	+0.01/-0.01


⭕	nvidia/Llama-3.1-Nemotron-70B-Instruct-HF	2052	+26/-20	8.91	+0.01/-0.01
🟦	Qwen/Qwen3-235B-A22B	1975	+22/-22	8.87	+0.02/-0.02
🟦	Qwen/Qwen3-32B	1975	+24/-19	8.87	+0.02/-0.02
⭕	openai/o4-mini	1920	+24/-25	8.75	+0.03/-0.03
🟦	Qwen/Qwen3-30B-A3B	1874	+25/-22	8.73	+0.02/-0.02
⭕	google/gemini-2.5-flash-preview-04-17	1845	+23/-22	8.6	+0.04/-0.03
🟦 🏥	m42-health/Llama3-Med42-70B	1837	+25/-21	8.73	+0.02/-0.01
⭕	google/gemini-2.5-flash-preview-04-17-thinking	1820	+26/-21	8.62	+0.03/-0.02
🟦	Qwen/Qwen3-14B	1818	+20/-20	8.66	+0.02/-0.02
⭕	deepseek-ai/DeepSeek-V3	1815	+25/-24	8.67	+0.02/-0.02
⭕	CohereForAI/aya-expanse-32b	1811	+21/-17	8.63	+0.03/-0.02
🟦	meta-llama/Llama-3.1-70B-Instruct	1805	+20/-17	8.58	+0.03/-0.02
🟦	Qwen/Qwen3-4B	1796	+23/-20	8.57	+0.02/-0.03
⭕ 🏥	google/medgemma-27b-text-it	1790	+22/-22	8.06	+0.06/-0.06
🟦	meta-llama/Meta-Llama-3-70B-Instruct	1787	+26/-21	8.55	+0.03/-0.03
🟦	meta-llama/Llama-4-Scout-17B-16E-Instruct	1784	+20/-18	8.58	+0.02/-0.02
🟦	meta-llama/Llama-3.1-405B-Instruct	1780	+24/-19	8.56	+0.03/-0.02
🟦 🏥	Qwen/Qwen3-8B	1773	+19/-19	8.35	+0.03/-0.03
🟦	meta-llama/Llama-4-Maverick-17B-128E-Instruct	1757	+18/-17	8.53	+0.03/-0.03
⭕	chengang12345/Qwen2.5-32B-Instruct-FineTune	1737	+18/-16	8.12	+0.04/-0.04
🟦	meta-llama/Llama-3.1-8B-Instruct	1733	+24/-22	8.31	+0.05/-0.05
⭕	openai/gpt-4.1	1728	+28/-30	8.48	+0.03/-0.02
🟦	meta-llama/Llama-3.2-3B-Instruct	1714	+20/-21	8.28	+0.04/-0.03
⭕ 🏥	Intelligent-Internet/II-Medical-8B	1703	+23/-21	8.32	+0.04/-0.03
⭕	microsoft/phi-4	1680	+18/-17	8.37	+0.03/-0.03
⭕	google/gemini-2.0-flash	1679	+31/-26	8.12	+0.05/-0.03
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	1619	+24/-22	8.21	+0.04/-0.03
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	1617	+22/-26	8.14	+0.04/-0.04
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	1603	+24/-23	8.11	+0.04/-0.03
⭕	mistralai/Mistral-Large-Instruct-2407	1598	+24/-23	8.21	+0.03/-0.03
🟦	princeton-nlp/gemma-2-9b-it-SimPO	1582	+23/-22	8.01	+0.04/-0.03
⭕	01-ai/Yi-1.5-6B-Chat	1574	+27/-21	7.76	+0.05/-0.04
⭕	meta-llama/Llama-3.2-1B-Instruct	1570	+26/-22	7.42	+0.08/-0.05
🟦	Qwen/Qwen2-72B-Instruct	1549	+28/-24	8.02	+0.04/-0.03
🟦	Qwen/Qwen2.5-72B-Instruct	1545	+26/-23	8.02	+0.03/-0.04
⭕	openai/gpt-4.1-mini	1542	+29/-27	8.07	+0.03/-0.03
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-8B	1527	+31/-21	7.72	+0.04/-0.04
🟦 🏥	FractalAIResearch/Ramanujan-Ganit-R1-14B	1525	+24/-19	7.25	+0.05/-0.03
🟦	NousResearch/Hermes-3-Llama-3.1-8B	1518	+23/-26	7.88	+0.04/-0.04
⭕	akjindal53244/Llama-3.1-Storm-8B	1505	+26/-23	7.65	+0.04/-0.05
🟦	Qwen/Qwen2.5-7B-Instruct	1475	+23/-26	7.72	+0.04/-0.04
🟢 🏥	OpenMeditron/Meditron3-70B	1468	+30/-23	7.45	+0.05/-0.05
⭕	openai/gpt-4o-mini-2024-07-18	1464	+26/-20	7.83	+0.03/-0.03
🟦 🏥	aaditya/Llama3-OpenBioLLM-70B	1461	+23/-19	7.68	+0.04/-0.04
🟦	newsbang/Homer-v1.0-Qwen2.5-7B	1457	+27/-26	7.49	+0.04/-0.04
🟦	Qwen/QwQ-32B-Preview	1447	+27/-27	6.2	+0.09/-0.09
🟦	oxyapi/oxy-1-small	1440	+26/-25	7.47	+0.04/-0.04
⭕	mistralai/Mistral-7B-Instruct-v0.3	1426	+23/-19	7.53	+0.04/-0.04
🟦	tiiuae/Falcon3-7B-Instruct	1412	+26/-25	7.24	+0.05/-0.05
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	1404	+26/-23	6.77	+0.07/-0.06
🟦	NousResearch/Hermes-2-Pro-Llama-3-8B	1392	+33/-24	7.39	+0.06/-0.04
⭕	upstage/SOLAR-10.7B-Instruct-v1.0	1365	+22/-27	7.31	+0.04/-0.04
🟢	Qwen/Qwen2.5-72B	1351	+26/-27	7.04	+0.04/-0.05
🟦	Qwen/Qwen2.5-3B-Instruct	1345	+27/-22	7.04	+0.05/-0.04
🟦	tiiuae/Falcon3-3B-Instruct	1344	+27/-22	6.87	+0.05/-0.04
🟦	tiiuae/Falcon3-1B-Instruct	1278	+22/-21	6.36	+0.07/-0.06
🟢	tiiuae/falcon-11B	1270	+34/-25	5.51	+0.09/-0.08
🟦	tiiuae/Falcon3-10B-Instruct	1270	+30/-23	6.55	+0.07/-0.05
⭕	HuggingFaceTB/SmolLM2-1.7B-Instruct	1213	+24/-24	6.34	+0.07/-0.05
🟢	Qwen/Qwen2.5-7B	1111	+33/-29	3.94	+0.11/-0.1
🟦	Qwen/Qwen3-0.6B	1100	+26/-21	5.36	+0.06/-0.06
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	1046	+28/-26	3.69	+0.07/-0.05
🟦	HuggingFaceTB/SmolLM2-360M-Instruct	1015	+21/-19	4.08	+0.08/-0.05
🟢	Qwen/Qwen2-0.5B	945	+25/-23	2.38	+0.06/-0.07
⭕	silma-ai/SILMA-9B-Instruct-v1.0	934	+24/-19	2.72	+0.05/-0.05
🟢 🏥	ProbeMedicalYonseiMAILab/medllama3-v20	869	+20/-16	1.27	+0.05/-0.05
🟦	Qwen/Qwen2.5-3B	791	+27/-23	0.98	+0.07/-0.05
⭕ 🏥	winninghealth/WiNGPT2-Gemma-2-9B-Chat	726	+17/-17	0.03	+0.01/-0.01
⭕ 🏥	winninghealth/WiNGPT2-Llama-3-8B-Chat	705	+20/-14	0.03	+0.01/-0.0


🟦 🏥	google/gemini-2.5-flash-preview-04-17-thinking	2052	+26/-20	8.91	+0.01/-0.01	false	false	?	instruction-tuned	Original	cc-by-nc-4.0	1149	235.09	2024-10-25 07:09:19+00:00


⭕	nvidia/Llama-3.1-Nemotron-70B-Instruct-HF	2052	+26/-20	8.91	+0.01/-0.01	false	true	?	instruction-tuned	Original	llama3.1	1149	70.55	2024-10-25 07:09:19+00:00
🟦	Qwen/Qwen3-235B-A22B	1975	+22/-22	8.87	+0.02/-0.02	false	true	?	preference-tuned	Original	apache-2.0	352	235.09	2025-04-29 10:42:15+00:00
🟦	Qwen/Qwen3-32B	1975	+24/-19	8.87	+0.02/-0.02	false	true	?	preference-tuned	Original	apache-2.0	162	32.76	2025-04-29 10:45:55+00:00
⭕	openai/o4-mini	1920	+24/-25	8.75	+0.03/-0.03	false	true	?	instruction-tuned	Original	null	-1	-1	2025-05-14 08:42:03+00:00
🟦	Qwen/Qwen3-30B-A3B	1874	+25/-22	8.73	+0.02/-0.02	false	true	?	preference-tuned	Original	apache-2.0	208	30.53	2025-04-29 10:45:32+00:00
⭕	google/gemini-2.5-flash-preview-04-17	1845	+23/-22	8.6	+0.04/-0.03	false	true	?	instruction-tuned	Original	null	-1	-1	2025-05-07 16:20:40+00:00
🟦 🏥	m42-health/Llama3-Med42-70B	1837	+25/-21	8.73	+0.02/-0.01	true	true	?	preference-tuned	Original	llama3	34	70.55	2024-10-24 06:24:59+00:00
⭕	google/gemini-2.5-flash-preview-04-17-thinking	1820	+26/-21	8.62	+0.03/-0.02	false	true	?	instruction-tuned	Original	null	-1	-1	2025-05-07 16:20:53+00:00
🟦	Qwen/Qwen3-14B	1818	+20/-20	8.66	+0.02/-0.02	false	true	?	preference-tuned	Original	apache-2.0	137	14.77	2025-05-12 12:17:12+00:00
⭕	deepseek-ai/DeepSeek-V3	1815	+25/-24	8.67	+0.02/-0.02	false	true	?	instruction-tuned	Original	other	1980	685	2024-10-22 23:04:13+00:00
⭕	CohereForAI/aya-expanse-32b	1811	+21/-17	8.63	+0.03/-0.02	false	true	?	instruction-tuned	Original	cc-by-nc-4.0	66	32.3	2024-10-25 07:13:05+00:00
🟦	meta-llama/Llama-3.1-70B-Instruct	1805	+20/-17	8.58	+0.03/-0.02	false	true	?	preference-tuned	Original	llama3.1	617	70.55	2024-10-24 13:25:28+00:00
🟦	Qwen/Qwen3-4B	1796	+23/-20	8.57	+0.02/-0.03	false	true	?	preference-tuned	Original	apache-2.0	79	4.02	2025-04-29 10:46:23+00:00
⭕ 🏥	google/medgemma-27b-text-it	1790	+22/-22	8.06	+0.06/-0.06	true	true	?	instruction-tuned	Original	other	108	27.01	2025-05-22 07:54:44+00:00
🟦	meta-llama/Meta-Llama-3-70B-Instruct	1787	+26/-21	8.55	+0.03/-0.03	false	true	?	preference-tuned	Original	llama3	1417	70.55	2024-10-24 13:25:47+00:00
🟦	meta-llama/Llama-4-Scout-17B-16E-Instruct	1784	+20/-18	8.58	+0.02/-0.02	false	true	?	preference-tuned	Original	null	0	-1	2025-01-20 10:32:48+00:00
🟦	meta-llama/Llama-3.1-405B-Instruct	1780	+24/-19	8.56	+0.03/-0.02	false	true	?	preference-tuned	Original	null	0	-1	2025-01-20 10:32:48+00:00
🟦 🏥	Qwen/Qwen3-8B	1773	+19/-19	8.35	+0.03/-0.03	true	true	?	preference-tuned	Original	apache-2.0	300	8.19	2025-05-20 11:36:36+00:00
🟦	meta-llama/Llama-4-Maverick-17B-128E-Instruct	1757	+18/-17	8.53	+0.03/-0.03	false	true	?	preference-tuned	Original	null	0	-1	2025-01-20 10:32:48+00:00
⭕	chengang12345/Qwen2.5-32B-Instruct-FineTune	1737	+18/-16	8.12	+0.04/-0.04	false	true	?	instruction-tuned	Original	apache-2.0	0	32.76	2025-05-19 12:37:03+00:00
🟦	meta-llama/Llama-3.1-8B-Instruct	1733	+24/-22	8.31	+0.05/-0.05	false	true	?	preference-tuned	Original	llama3.1	2845	8.03	2024-07-24 14:33:56+00:00
⭕	openai/gpt-4.1	1728	+28/-30	8.48	+0.03/-0.02	false	true	?	instruction-tuned	Original	null	-1	-1	2025-05-14 08:41:32+00:00
🟦	meta-llama/Llama-3.2-3B-Instruct	1714	+20/-21	8.28	+0.04/-0.03	false	true	?	preference-tuned	Original	llama3.2	402	3.21	2024-10-24 06:23:04+00:00
⭕ 🏥	Intelligent-Internet/II-Medical-8B	1703	+23/-21	8.32	+0.04/-0.03	true	true	?	instruction-tuned	Original	null	42	8.19	2025-05-16 09:57:55+00:00
⭕	microsoft/phi-4	1680	+18/-17	8.37	+0.03/-0.03	false	true	?	instruction-tuned	Original	null	0	-1	2025-01-17 12:10:32+00:00
⭕	google/gemini-2.0-flash	1679	+31/-26	8.12	+0.05/-0.03	false	true	?	instruction-tuned	Original	null	-1	-1	2025-05-07 16:19:08+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	1619	+24/-22	8.21	+0.04/-0.03	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:07:42+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	1617	+22/-26	8.14	+0.04/-0.04	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:08:05+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	1603	+24/-23	8.11	+0.04/-0.03	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:06:51+00:00
⭕	mistralai/Mistral-Large-Instruct-2407	1598	+24/-23	8.21	+0.03/-0.03	false	true	?	instruction-tuned	Original	other	808	122.61	2024-11-25 11:27:40+00:00
🟦	princeton-nlp/gemma-2-9b-it-SimPO	1582	+23/-22	8.01	+0.04/-0.03	false	true	?	preference-tuned	Original	mit	110	9.24	2024-10-25 07:11:14+00:00
⭕	01-ai/Yi-1.5-6B-Chat	1574	+27/-21	7.76	+0.05/-0.04	false	true	?	instruction-tuned	Original	apache-2.0	41	6.06	2024-10-22 23:04:13+00:00
⭕	meta-llama/Llama-3.2-1B-Instruct	1570	+26/-22	7.42	+0.08/-0.05	false	false	?	instruction-tuned	Original	llama3.2	430	1.24	2024-10-25 07:14:38+00:00
🟦	Qwen/Qwen2-72B-Instruct	1549	+28/-24	8.02	+0.04/-0.03	false	true	?	preference-tuned	Original	other	675	72.71	2024-11-14 11:37:18+00:00
🟦	Qwen/Qwen2.5-72B-Instruct	1545	+26/-23	8.02	+0.03/-0.04	false	true	?	preference-tuned	Original	other	343	72.71	2024-10-22 14:35:49+00:00
⭕	openai/gpt-4.1-mini	1542	+29/-27	8.07	+0.03/-0.03	false	true	?	instruction-tuned	Original	null	-1	-1	2025-05-14 08:41:43+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-8B	1527	+31/-21	7.72	+0.04/-0.04	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:08:22+00:00
🟦 🏥	FractalAIResearch/Ramanujan-Ganit-R1-14B	1525	+24/-19	7.25	+0.05/-0.03	true	true	?	preference-tuned	Original	mit	0	14.77	2025-05-20 11:29:52+00:00
🟦	NousResearch/Hermes-3-Llama-3.1-8B	1518	+23/-26	7.88	+0.04/-0.04	false	true	?	preference-tuned	Original	llama3	254	8.03	2024-12-10 09:38:34+00:00
⭕	akjindal53244/Llama-3.1-Storm-8B	1505	+26/-23	7.65	+0.04/-0.05	false	true	?	instruction-tuned	Original	llama3.1	164	8.03	2024-11-14 11:35:17+00:00
🟦	Qwen/Qwen2.5-7B-Instruct	1475	+23/-26	7.72	+0.04/-0.04	false	true	?	preference-tuned	Original	apache-2.0	274	7.62	2024-11-14 11:36:44+00:00
🟢 🏥	OpenMeditron/Meditron3-70B	1468	+30/-23	7.45	+0.05/-0.05	true	true	?	pretrained	Original	null	9	70.55	2024-11-11 13:58:37+00:00
⭕	openai/gpt-4o-mini-2024-07-18	1464	+26/-20	7.83	+0.03/-0.03	false	true	?	instruction-tuned	Original	null	-1	-1	2025-05-02 10:08:27+00:00
🟦 🏥	aaditya/Llama3-OpenBioLLM-70B	1461	+23/-19	7.68	+0.04/-0.04	true	true	?	preference-tuned	Original	llama3	339	70	2024-07-24 14:33:56+00:00
🟦	newsbang/Homer-v1.0-Qwen2.5-7B	1457	+27/-26	7.49	+0.04/-0.04	false	true	?	preference-tuned	Original	null	0	-1	2025-03-06 02:18:06+00:00
🟦	Qwen/QwQ-32B-Preview	1447	+27/-27	6.2	+0.09/-0.09	false	true	?	preference-tuned	Original	apache-2.0	239	32.76	2024-11-28 04:57:07+00:00
🟦	oxyapi/oxy-1-small	1440	+26/-25	7.47	+0.04/-0.04	false	true	?	preference-tuned	Original	apache-2.0	67	14.77	2024-12-10 07:27:22+00:00
⭕	mistralai/Mistral-7B-Instruct-v0.3	1426	+23/-19	7.53	+0.04/-0.04	false	true	?	instruction-tuned	Original	apache-2.0	1131	7.25	2024-11-14 11:38:25+00:00
🟦	tiiuae/Falcon3-7B-Instruct	1412	+26/-25	7.24	+0.05/-0.05	false	true	?	preference-tuned	Original	other	23	7.46	2024-12-19 05:59:29+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	1404	+26/-23	6.77	+0.07/-0.06	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:08:42+00:00
🟦	NousResearch/Hermes-2-Pro-Llama-3-8B	1392	+33/-24	7.39	+0.06/-0.04	false	true	?	preference-tuned	Original	llama3	411	8.03	2024-12-10 10:10:16+00:00
⭕	upstage/SOLAR-10.7B-Instruct-v1.0	1365	+22/-27	7.31	+0.04/-0.04	false	true	?	instruction-tuned	Original	cc-by-nc-4.0	613	10.73	2024-10-22 22:52:54+00:00
🟢	Qwen/Qwen2.5-72B	1351	+26/-27	7.04	+0.04/-0.05	false	true	?	pretrained	Original	other	39	72.71	2024-11-14 11:37:02+00:00
🟦	Qwen/Qwen2.5-3B-Instruct	1345	+27/-22	7.04	+0.05/-0.04	false	false	?	preference-tuned	Original	other	87	3.09	2024-11-18 11:36:42+00:00
🟦	tiiuae/Falcon3-3B-Instruct	1344	+27/-22	6.87	+0.05/-0.04	false	true	?	preference-tuned	Original	other	12	3.23	2024-12-19 06:00:40+00:00
🟦	tiiuae/Falcon3-1B-Instruct	1278	+22/-21	6.36	+0.07/-0.06	false	true	?	preference-tuned	Original	other	21	1.67	2024-12-19 06:01:10+00:00
🟢	tiiuae/falcon-11B	1270	+34/-25	5.51	+0.09/-0.08	false	false	?	pretrained	Original	unknown	210	11.1	2024-10-29 07:23:16+00:00
🟦	tiiuae/Falcon3-10B-Instruct	1270	+30/-23	6.55	+0.07/-0.05	false	true	?	preference-tuned	Original	other	40	10.31	2024-12-19 05:58:51+00:00
⭕	HuggingFaceTB/SmolLM2-1.7B-Instruct	1213	+24/-24	6.34	+0.07/-0.05	false	true	?	instruction-tuned	Original	apache-2.0	346	1.71	2024-11-22 10:44:37+00:00
🟢	Qwen/Qwen2.5-7B	1111	+33/-29	3.94	+0.11/-0.1	false	true	?	pretrained	Original	apache-2.0	67	7.62	2024-11-14 11:36:22+00:00
🟦	Qwen/Qwen3-0.6B	1100	+26/-21	5.36	+0.06/-0.06	false	true	?	preference-tuned	Original	apache-2.0	91	0.75	2025-04-29 10:46:33+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	1046	+28/-26	3.69	+0.07/-0.05	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:09:02+00:00
🟦	HuggingFaceTB/SmolLM2-360M-Instruct	1015	+21/-19	4.08	+0.08/-0.05	false	true	?	preference-tuned	Original	apache-2.0	60	0.36	2024-12-10 08:36:15+00:00
🟢	Qwen/Qwen2-0.5B	945	+25/-23	2.38	+0.06/-0.07	false	false	?	pretrained	Original	apache-2.0	101	0.49	2024-10-22 13:46:13+00:00
⭕	silma-ai/SILMA-9B-Instruct-v1.0	934	+24/-19	2.72	+0.05/-0.05	false	true	?	instruction-tuned	Original	gemma	44	9.24	2024-11-14 11:39:56+00:00
🟢 🏥	ProbeMedicalYonseiMAILab/medllama3-v20	869	+20/-16	1.27	+0.05/-0.05	true	false	?	pretrained	Original	llama3	37	8.03	2024-10-25 07:16:58+00:00
🟦	Qwen/Qwen2.5-3B	791	+27/-23	0.98	+0.07/-0.05	false	false	?	preference-tuned	Original	other	26	3.09	2024-10-22 13:17:21+00:00
⭕ 🏥	winninghealth/WiNGPT2-Gemma-2-9B-Chat	726	+17/-17	0.03	+0.01/-0.01	true	true	?	instruction-tuned	Original	apache-2.0	2	9.24	2025-05-19 05:20:13+00:00
⭕ 🏥	winninghealth/WiNGPT2-Llama-3-8B-Chat	705	+20/-14	0.03	+0.01/-0.0	true	true	?	instruction-tuned	Original	apache-2.0	5	8.03	2025-05-19 05:21:32+00:00


🟦 🏥	google/gemini-2.5-flash-preview-04-17-thinking	87.82	64.31	95.78	99.61	-218.26


⭕	deepseek-ai/DeepSeek-V3	87.82	64.31	95.78	99.61	72.32
⭕	openai/gpt-4.1-mini	87.55	61.64	95.93	99.31	74.38
🟦	Qwen/Qwen2-72B-Instruct	87.51	61.51	95.99	99.47	73.72
⭕	microsoft/phi-4	87.44	67.28	95.9	99.52	66.52
🟦	Qwen/Qwen2.5-72B-Instruct	87.41	63.09	95.89	99.56	70.91
⭕	chengang12345/Qwen2.5-32B-Instruct-FineTune	87.36	61.36	95.78	98.82	74.94
⭕	openai/o4-mini	87.17	79.89	94.9	95.27	64.43
⭕	mistralai/Mistral-Large-Instruct-2407	87.09	61.74	97.16	99.2	68.42
🟦	Qwen/Qwen3-32B	86.95	63.64	95.53	98.59	70.15
🟦	Qwen/Qwen3-235B-A22B	86.78	68.97	95.21	97.84	65.67
🟦 🏥	m42-health/Llama3-Med42-70B	86.75	66.7	95.79	98.5	65.21
⭕	google/gemini-2.5-flash-preview-04-17-thinking	86.7	56.52	96.48	99.54	73.97
🟦	Qwen/Qwen3-30B-A3B	86.64	57.18	96.02	98.7	75.85
🟦	Qwen/Qwen2.5-7B-Instruct	86.21	56.12	96.13	99.14	72.77
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	86.15	55.44	96.23	99.32	72.64
🟦	Qwen/Qwen3-14B	86.14	54.21	95.98	98.81	76.97
🟦	oxyapi/oxy-1-small	85.84	57.35	96.13	98.78	68.93
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-8B	85.8	57.8	95.89	98.59	69.06
⭕	openai/gpt-4.1	85.79	50.6	96.66	99.24	78.26
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	85.76	53.52	96.13	99.3	73.2
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	85.6	50.84	96.35	99.21	76.95
🟦	Qwen/Qwen2.5-3B-Instruct	85.59	56.58	95.45	98.74	70.05
⭕	mistralai/Mistral-7B-Instruct-v0.3	85.47	54.31	96.4	99.31	68.84
🟦	tiiuae/Falcon3-10B-Instruct	85.43	49.13	96.36	99.71	77.76
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	85.36	53.13	95.66	98.5	74.18
⭕	CohereForAI/aya-expanse-32b	85.19	53.17	96.57	97.96	71.62
🟢	Qwen/Qwen2.5-7B	85.08	51.12	96.46	99.08	71.75
🟦	NousResearch/Hermes-3-Llama-3.1-8B	84.86	46.65	96.73	99.21	78.97
🟦	Qwen/QwQ-32B-Preview	84.85	73.9	96.21	99.31	49.13
⭕	nvidia/Llama-3.1-Nemotron-70B-Instruct-HF	84.68	55.94	97.04	96.21	66.59
🟦	meta-llama/Llama-4-Scout-17B-16E-Instruct	84.33	49.41	96.56	98.04	71.34
🟦	tiiuae/Falcon3-7B-Instruct	84.33	44.54	96.53	99.29	79.77
🟦	meta-llama/Llama-3.1-70B-Instruct	84.31	51.62	96.71	97.47	68.16
🟦	tiiuae/Falcon3-3B-Instruct	84.09	44.91	96.21	99.15	77.67
🟢	Qwen/Qwen2.5-72B	84.07	51.09	97.58	99.16	60.65
🟦	meta-llama/Llama-3.1-405B-Instruct	83.95	45.8	96.48	98.31	75.66
🟦	meta-llama/Llama-3.1-8B-Instruct	83.79	54.71	96.44	96.75	62.11
🟦	meta-llama/Llama-3.2-3B-Instruct	83.52	50.42	96.33	96.87	66.54
⭕	akjindal53244/Llama-3.1-Storm-8B	83.26	41.57	96.79	98.13	80.65
🟦	meta-llama/Llama-3.3-70B-Instruct	83.19	44.17	96.68	97.63	73.71
🟦	meta-llama/Meta-Llama-3-70B-Instruct	83.06	45.3	96.61	96.02	75.24
🟢 🏥	OpenMeditron/Meditron3-70B	83.05	40.46	97.11	98.39	79.65
🟦	meta-llama/Llama-4-Maverick-17B-128E-Instruct	82.94	40.74	96.34	98.67	79.24
🟦 🏥	aaditya/Llama3-OpenBioLLM-70B	82.93	40.34	96.81	98.51	79.23
🟦	princeton-nlp/gemma-2-9b-it-SimPO	82.84	41.79	96.5	97.77	77.29
🟦	Qwen/Qwen3-0.6B	82.37	41.38	95.81	96.26	82.13
⭕	HuggingFaceTB/SmolLM2-1.7B-Instruct	82.32	41.76	95.92	97.37	75.1
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	82.26	46.44	95.86	96.99	64.28
⭕ 🏥	google/medgemma-4b-it	81.94	37.52	95.82	98.24	83.39
🟦	Qwen/Qwen2.5-3B	81.66	44.32	97.41	97.62	57.24
⭕	meta-llama/Llama-3.2-1B-Instruct	80.61	44.66	95.86	93.62	63.23
🟦	tiiuae/Falcon3-1B-Instruct	80.35	34.13	96.06	97.03	80.61
🟦 🏥	Qwen/Qwen3-8B	79.28	61.4	96.75	99.28	31.7
🟦	Qwen/Qwen2.5-0.5B-Instruct	79.21	38.47	96.5	96.13	54.23
⭕ 🏥	Intelligent-Internet/II-Medical-8B	78.59	59.72	96.68	99.23	29.9
⭕	silma-ai/SILMA-9B-Instruct-v1.0	73.66	16.29	97.72	95.57	92.16
🟦	HuggingFaceTB/SmolLM2-360M-Instruct	69.33	35.32	96.15	93.2	12.67
🟢	tiiuae/falcon-11B	64.87	60.94	97.61	97	-48.55
🟢	Qwen/Qwen2-0.5B	64.01	45.33	98.2	93.82	-218.26
🟦	newsbang/Homer-v1.0-Qwen2.5-7B	62.64	92.95	98.5	89.42	-172.93
🟢 🏥	ProbeMedicalYonseiMAILab/medllama3-v20	59.15	15.9	97.75	52.88	85.85
⭕ 🏥	google/medgemma-27b-text-it	57.68	12.36	98.47	74.56	-131.51
🟦	ministral/Ministral-3b-instruct	54.69	2.1	99.49	64.59	-237.7


🟦 🏥	google/gemini-2.5-flash-preview-04-17-thinking	87.82	64.31	95.78	99.61	-218.26	false	false	?	instruction-tuned	Original	cc-by-nc-4.0	1980	122.61	2024-10-22 23:04:13+00:00


⭕	deepseek-ai/DeepSeek-V3	87.82	64.31	95.78	99.61	72.32	false	true	?	instruction-tuned	Original	other	1980	685	2024-10-22 23:04:13+00:00
⭕	openai/gpt-4.1-mini	87.55	61.64	95.93	99.31	74.38	false	true	?	instruction-tuned	Original	null	-1	-1	2025-05-14 08:41:43+00:00
🟦	Qwen/Qwen2-72B-Instruct	87.51	61.51	95.99	99.47	73.72	false	true	?	preference-tuned	Original	other	675	72.71	2024-11-14 11:37:18+00:00
⭕	microsoft/phi-4	87.44	67.28	95.9	99.52	66.52	false	true	?	instruction-tuned	Original	null	0	-1	2025-01-17 12:10:32+00:00
🟦	Qwen/Qwen2.5-72B-Instruct	87.41	63.09	95.89	99.56	70.91	false	true	?	preference-tuned	Original	other	343	72.71	2024-10-22 14:35:49+00:00
⭕	chengang12345/Qwen2.5-32B-Instruct-FineTune	87.36	61.36	95.78	98.82	74.94	false	true	?	instruction-tuned	Original	apache-2.0	0	32.76	2025-05-19 12:37:03+00:00
⭕	openai/o4-mini	87.17	79.89	94.9	95.27	64.43	false	true	?	instruction-tuned	Original	null	-1	-1	2025-05-14 08:42:03+00:00
⭕	mistralai/Mistral-Large-Instruct-2407	87.09	61.74	97.16	99.2	68.42	false	true	?	instruction-tuned	Original	other	808	122.61	2024-11-25 11:27:40+00:00
🟦	Qwen/Qwen3-32B	86.95	63.64	95.53	98.59	70.15	false	true	?	preference-tuned	Original	apache-2.0	162	32.76	2025-04-29 10:45:55+00:00
🟦	Qwen/Qwen3-235B-A22B	86.78	68.97	95.21	97.84	65.67	false	true	?	preference-tuned	Original	apache-2.0	352	235.09	2025-04-29 10:42:15+00:00
🟦 🏥	m42-health/Llama3-Med42-70B	86.75	66.7	95.79	98.5	65.21	true	true	?	preference-tuned	Original	llama3	34	70.55	2024-10-24 06:24:59+00:00
⭕	google/gemini-2.5-flash-preview-04-17-thinking	86.7	56.52	96.48	99.54	73.97	false	true	?	instruction-tuned	Original	null	-1	-1	2025-05-07 16:20:53+00:00
🟦	Qwen/Qwen3-30B-A3B	86.64	57.18	96.02	98.7	75.85	false	true	?	preference-tuned	Original	apache-2.0	208	30.53	2025-04-29 10:45:32+00:00
🟦	Qwen/Qwen2.5-7B-Instruct	86.21	56.12	96.13	99.14	72.77	false	true	?	preference-tuned	Original	apache-2.0	274	7.62	2024-11-14 11:36:44+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	86.15	55.44	96.23	99.32	72.64	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:07:42+00:00
🟦	Qwen/Qwen3-14B	86.14	54.21	95.98	98.81	76.97	false	true	?	preference-tuned	Original	apache-2.0	137	14.77	2025-05-12 12:17:12+00:00
🟦	oxyapi/oxy-1-small	85.84	57.35	96.13	98.78	68.93	false	true	?	preference-tuned	Original	apache-2.0	67	14.77	2024-12-10 07:27:22+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-8B	85.8	57.8	95.89	98.59	69.06	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:08:22+00:00
⭕	openai/gpt-4.1	85.79	50.6	96.66	99.24	78.26	false	true	?	instruction-tuned	Original	null	-1	-1	2025-05-14 08:41:32+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	85.76	53.52	96.13	99.3	73.2	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:08:05+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	85.6	50.84	96.35	99.21	76.95	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:06:51+00:00
🟦	Qwen/Qwen2.5-3B-Instruct	85.59	56.58	95.45	98.74	70.05	false	false	?	preference-tuned	Original	other	87	3.09	2024-11-18 11:36:42+00:00
⭕	mistralai/Mistral-7B-Instruct-v0.3	85.47	54.31	96.4	99.31	68.84	false	true	?	instruction-tuned	Original	apache-2.0	1131	7.25	2024-11-14 11:38:25+00:00
🟦	tiiuae/Falcon3-10B-Instruct	85.43	49.13	96.36	99.71	77.76	false	true	?	preference-tuned	Original	other	40	10.31	2024-12-19 05:58:51+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	85.36	53.13	95.66	98.5	74.18	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:08:42+00:00
⭕	CohereForAI/aya-expanse-32b	85.19	53.17	96.57	97.96	71.62	false	true	?	instruction-tuned	Original	cc-by-nc-4.0	66	32.3	2024-10-25 07:13:05+00:00
🟢	Qwen/Qwen2.5-7B	85.08	51.12	96.46	99.08	71.75	false	true	?	pretrained	Original	apache-2.0	67	7.62	2024-11-14 11:36:22+00:00
🟦	NousResearch/Hermes-3-Llama-3.1-8B	84.86	46.65	96.73	99.21	78.97	false	true	?	preference-tuned	Original	llama3	254	8.03	2024-12-10 09:38:34+00:00
🟦	Qwen/QwQ-32B-Preview	84.85	73.9	96.21	99.31	49.13	false	true	?	preference-tuned	Original	apache-2.0	239	32.76	2024-11-28 04:57:07+00:00
⭕	nvidia/Llama-3.1-Nemotron-70B-Instruct-HF	84.68	55.94	97.04	96.21	66.59	false	true	?	instruction-tuned	Original	llama3.1	1149	70.55	2024-10-25 07:09:19+00:00
🟦	meta-llama/Llama-4-Scout-17B-16E-Instruct	84.33	49.41	96.56	98.04	71.34	false	true	?	preference-tuned	Original	null	0	-1	2025-01-20 10:32:48+00:00
🟦	tiiuae/Falcon3-7B-Instruct	84.33	44.54	96.53	99.29	79.77	false	true	?	preference-tuned	Original	other	23	7.46	2024-12-19 05:59:29+00:00
🟦	meta-llama/Llama-3.1-70B-Instruct	84.31	51.62	96.71	97.47	68.16	false	true	?	preference-tuned	Original	llama3.1	617	70.55	2024-10-24 13:25:28+00:00
🟦	tiiuae/Falcon3-3B-Instruct	84.09	44.91	96.21	99.15	77.67	false	true	?	preference-tuned	Original	other	12	3.23	2024-12-19 06:00:40+00:00
🟢	Qwen/Qwen2.5-72B	84.07	51.09	97.58	99.16	60.65	false	true	?	pretrained	Original	other	39	72.71	2024-11-14 11:37:02+00:00
🟦	meta-llama/Llama-3.1-405B-Instruct	83.95	45.8	96.48	98.31	75.66	false	true	?	preference-tuned	Original	null	0	-1	2025-01-20 10:32:48+00:00
🟦	meta-llama/Llama-3.1-8B-Instruct	83.79	54.71	96.44	96.75	62.11	false	true	?	preference-tuned	Original	llama3.1	2845	8.03	2024-07-24 14:33:56+00:00
🟦	meta-llama/Llama-3.2-3B-Instruct	83.52	50.42	96.33	96.87	66.54	false	true	?	preference-tuned	Original	llama3.2	402	3.21	2024-10-24 06:23:04+00:00
⭕	akjindal53244/Llama-3.1-Storm-8B	83.26	41.57	96.79	98.13	80.65	false	true	?	instruction-tuned	Original	llama3.1	164	8.03	2024-11-14 11:35:17+00:00
🟦	meta-llama/Llama-3.3-70B-Instruct	83.19	44.17	96.68	97.63	73.71	false	true	?	preference-tuned	Original	llama3.3	632	70.55	2024-12-09 09:10:34+00:00
🟦	meta-llama/Meta-Llama-3-70B-Instruct	83.06	45.3	96.61	96.02	75.24	false	true	?	preference-tuned	Original	llama3	1417	70.55	2024-10-24 13:25:47+00:00
🟢 🏥	OpenMeditron/Meditron3-70B	83.05	40.46	97.11	98.39	79.65	true	true	?	pretrained	Original	null	9	70.55	2024-11-11 13:58:37+00:00
🟦	meta-llama/Llama-4-Maverick-17B-128E-Instruct	82.94	40.74	96.34	98.67	79.24	false	true	?	preference-tuned	Original	null	0	-1	2025-01-20 10:32:48+00:00
🟦 🏥	aaditya/Llama3-OpenBioLLM-70B	82.93	40.34	96.81	98.51	79.23	true	true	?	preference-tuned	Original	llama3	339	70	2024-07-24 14:33:56+00:00
🟦	princeton-nlp/gemma-2-9b-it-SimPO	82.84	41.79	96.5	97.77	77.29	false	true	?	preference-tuned	Original	mit	110	9.24	2024-10-25 07:11:14+00:00
🟦	Qwen/Qwen3-0.6B	82.37	41.38	95.81	96.26	82.13	false	true	?	preference-tuned	Original	apache-2.0	91	0.75	2025-04-29 10:46:33+00:00
⭕	HuggingFaceTB/SmolLM2-1.7B-Instruct	82.32	41.76	95.92	97.37	75.1	false	true	?	instruction-tuned	Original	apache-2.0	346	1.71	2024-11-22 10:44:37+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	82.26	46.44	95.86	96.99	64.28	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:09:02+00:00
⭕ 🏥	google/medgemma-4b-it	81.94	37.52	95.82	98.24	83.39	true	true	?	instruction-tuned	Original	other	106	4.3	2025-05-22 07:54:59+00:00
🟦	Qwen/Qwen2.5-3B	81.66	44.32	97.41	97.62	57.24	false	false	?	preference-tuned	Original	other	26	3.09	2024-10-22 13:17:21+00:00
⭕	meta-llama/Llama-3.2-1B-Instruct	80.61	44.66	95.86	93.62	63.23	false	false	?	instruction-tuned	Original	llama3.2	430	1.24	2024-10-25 07:14:38+00:00
🟦	tiiuae/Falcon3-1B-Instruct	80.35	34.13	96.06	97.03	80.61	false	true	?	preference-tuned	Original	other	21	1.67	2024-12-19 06:01:10+00:00
🟦 🏥	Qwen/Qwen3-8B	79.28	61.4	96.75	99.28	31.7	true	true	?	preference-tuned	Original	apache-2.0	300	8.19	2025-05-20 11:36:36+00:00
🟦	Qwen/Qwen2.5-0.5B-Instruct	79.21	38.47	96.5	96.13	54.23	false	false	?	preference-tuned	Original	apache-2.0	100	0.49	2024-11-18 11:36:27+00:00
⭕ 🏥	Intelligent-Internet/II-Medical-8B	78.59	59.72	96.68	99.23	29.9	true	true	?	instruction-tuned	Original	null	42	8.19	2025-05-16 09:57:55+00:00
⭕	silma-ai/SILMA-9B-Instruct-v1.0	73.66	16.29	97.72	95.57	92.16	false	true	?	instruction-tuned	Original	gemma	44	9.24	2024-11-14 11:39:56+00:00
🟦	HuggingFaceTB/SmolLM2-360M-Instruct	69.33	35.32	96.15	93.2	12.67	false	true	?	preference-tuned	Original	apache-2.0	60	0.36	2024-12-10 08:36:15+00:00
🟢	tiiuae/falcon-11B	64.87	60.94	97.61	97	-48.55	false	false	?	pretrained	Original	unknown	210	11.1	2024-10-29 07:23:16+00:00
🟢	Qwen/Qwen2-0.5B	64.01	45.33	98.2	93.82	-218.26	false	false	?	pretrained	Original	apache-2.0	101	0.49	2024-10-22 13:46:13+00:00
🟦	newsbang/Homer-v1.0-Qwen2.5-7B	62.64	92.95	98.5	89.42	-172.93	false	true	?	preference-tuned	Original	null	0	-1	2025-03-06 02:18:06+00:00
🟢 🏥	ProbeMedicalYonseiMAILab/medllama3-v20	59.15	15.9	97.75	52.88	85.85	true	false	?	pretrained	Original	llama3	37	8.03	2024-10-25 07:16:58+00:00
⭕ 🏥	google/medgemma-27b-text-it	57.68	12.36	98.47	74.56	-131.51	true	true	?	instruction-tuned	Original	other	108	27.01	2025-05-22 07:54:44+00:00
🟦	ministral/Ministral-3b-instruct	54.69	2.1	99.49	64.59	-237.7	false	true	?	preference-tuned	Original	apache-2.0	33	3.32	2024-12-10 10:39:41+00:00


🟦 🏥	google/gemini-2.5-flash-preview-04-17-thinking	96.11	92.08	97.25	98.67


⭕	openai/gpt-4.1	96.11	92.08	97.25	99
🟦 🏥	Qwen/Qwen3-8B	95.56	91.83	96.17	98.67
🟦 🏥	m42-health/Llama3-Med42-70B	95.14	90	97.42	98
⭕	openai/gpt-4.1-mini	94.69	88.58	96.67	98.83
⭕	chengang12345/Qwen2.5-32B-Instruct-FineTune	94.64	92.41	96.83	94.67
⭕	google/gemini-2.5-flash-preview-04-17-thinking	94.61	87.5	97.33	99
⭕ 🏥	Intelligent-Internet/II-Medical-8B	94.5	89.07	95.92	98.5
⭕	openai/o4-mini	94.27	90.08	96.91	95.83
⭕	nvidia/Llama-3.1-Nemotron-70B-Instruct-HF	93.94	86.25	97.58	98
⭕	deepseek-ai/DeepSeek-V3	93.89	85.33	97.33	99
🟦	Qwen/QwQ-32B-Preview	93.85	85.48	97.75	98.33
🟦	Qwen/Qwen2.5-72B-Instruct	93.28	88.25	96.42	95.17
🟦	princeton-nlp/gemma-2-9b-it-SimPO	93.16	85.07	96.58	97.83
⭕	akjindal53244/Llama-3.1-Storm-8B	92.91	83.56	96	99.17
⭕	CohereForAI/aya-expanse-32b	92.88	84.16	95.82	98.67
🟦	meta-llama/Llama-4-Maverick-17B-128E-Instruct	92.66	80.58	97.57	99.83
⭕	microsoft/phi-4	92.64	82.16	96.58	99.17
🟦	Qwen/Qwen2.5-7B-Instruct	92.61	83.25	96.58	98
🟦	Qwen/Qwen3-32B	92.55	84.41	96.75	96.5
🟦	Qwen/Qwen3-235B-A22B	92.38	84.07	96.57	96.5
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	92.33	80.66	96.67	99.67
🟦	NousResearch/Hermes-3-Llama-3.1-8B	92.21	80.65	97.16	98.83
🟦	Qwen/Qwen3-30B-A3B	92.11	83.5	96.5	96.33
🟦	meta-llama/Llama-3.1-70B-Instruct	92.02	79.73	96.99	99.33
🟦	meta-llama/Llama-3.3-70B-Instruct	91.99	79.64	96.83	99.5
🟦	meta-llama/Llama-4-Scout-17B-16E-Instruct	91.91	78.73	97.33	99.67
🟦	meta-llama/Llama-3.1-8B-Instruct	91.85	79.82	96.57	99.17
🟦	newsbang/Homer-v1.0-Qwen2.5-7B	91.8	81.57	96.67	97.17
⭕	mistralai/Mistral-Large-Instruct-2407	91.53	79.75	97	97.83
⭕	mistralai/Mistral-7B-Instruct-v0.3	91.49	79.06	95.75	99.67
🟦	tiiuae/Falcon3-10B-Instruct	91.16	79.15	96.17	98.17
🟦	meta-llama/Llama-3.1-405B-Instruct	91.08	76.91	97.33	99
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-8B	90.94	77.91	96.25	98.67
🟦	Qwen/Qwen3-14B	90.83	79.98	96.17	96.33
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	90.8	76.81	96.42	99.17
🟦 🏥	aaditya/Llama3-OpenBioLLM-70B	90.66	75.65	96.67	99.67
🟦	meta-llama/Meta-Llama-3-70B-Instruct	90.66	75.64	97	99.33
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	90.5	75.92	96.42	99.17
🟦	Qwen/Qwen2-72B-Instruct	90.45	74.98	97.08	99.29
🟦	tiiuae/Falcon3-7B-Instruct	90.29	74.31	96.74	99.83
🟦	oxyapi/oxy-1-small	90.1	76.06	96.58	97.67
⭕ 🏥	google/medgemma-4b-it	89.66	75.67	95.16	98.17
🟢	Qwen/Qwen2.5-7B	89.66	74.73	95.92	98.33
🟦	meta-llama/Llama-3.2-3B-Instruct	89.63	75.06	96.32	97.5
🟦	Qwen/Qwen2.5-3B-Instruct	89.2	73.71	95.73	98.17
🟢 🏥	OpenMeditron/Meditron3-70B	89.17	70.42	97.08	100
🟦	tiiuae/Falcon3-3B-Instruct	88.91	74.73	94.17	97.83
🟢	Qwen/Qwen2.5-72B	88.83	70.25	96.58	99.67
🟦	Qwen/Qwen2.5-3B	87.06	69.11	94.74	97.33
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	86.8	68.24	94.49	97.67
⭕	meta-llama/Llama-3.2-1B-Instruct	85.32	66.68	94.46	94.83
🟢	tiiuae/falcon-11B	84.18	57.53	96.67	98.33
⭕	HuggingFaceTB/SmolLM2-1.7B-Instruct	82.87	63.04	93.9	91.67
🟦	tiiuae/Falcon3-1B-Instruct	82.57	64.56	92.31	90.83
🟦	Qwen/Qwen3-0.6B	82.3	54	95.74	97.17
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	80.21	48.9	96.07	95.67
🟦	Qwen/Qwen2.5-0.5B-Instruct	79.82	52.29	94.33	92.83
🟦	HuggingFaceTB/SmolLM2-360M-Instruct	78.54	44.37	98.08	93.17
🟢 🏥	ProbeMedicalYonseiMAILab/medllama3-v20	71.44	47.33	95.67	71.33
🟢	Qwen/Qwen2-0.5B	70	21.95	97.82	90.21
⭕	silma-ai/SILMA-9B-Instruct-v1.0	69.17	10.69	99.5	97.33
⭕ 🏥	google/medgemma-27b-text-it	64.52	26.82	92.08	74.67
🟦	ministral/Ministral-3b-instruct	56.67	1.58	99.33	69.08


🟦 🏥	google/gemini-2.5-flash-preview-04-17-thinking	96.11	92.08	97.25	98.67	false	false	?	instruction-tuned	Original	cc-by-nc-4.0	1149	235.09	2025-05-14 08:41:32+00:00


⭕	openai/gpt-4.1	96.11	92.08	97.25	99	false	true	?	instruction-tuned	Original	null	-1	-1	2025-05-14 08:41:32+00:00
🟦 🏥	Qwen/Qwen3-8B	95.56	91.83	96.17	98.67	true	true	?	preference-tuned	Original	apache-2.0	300	8.19	2025-05-20 11:36:36+00:00
🟦 🏥	m42-health/Llama3-Med42-70B	95.14	90	97.42	98	true	true	?	preference-tuned	Original	llama3	34	70.55	2024-10-24 06:24:59+00:00
⭕	openai/gpt-4.1-mini	94.69	88.58	96.67	98.83	false	true	?	instruction-tuned	Original	null	-1	-1	2025-05-14 08:41:43+00:00
⭕	chengang12345/Qwen2.5-32B-Instruct-FineTune	94.64	92.41	96.83	94.67	false	true	?	instruction-tuned	Original	apache-2.0	0	32.76	2025-05-19 12:37:03+00:00
⭕	google/gemini-2.5-flash-preview-04-17-thinking	94.61	87.5	97.33	99	false	true	?	instruction-tuned	Original	null	-1	-1	2025-05-07 16:20:53+00:00
⭕ 🏥	Intelligent-Internet/II-Medical-8B	94.5	89.07	95.92	98.5	true	true	?	instruction-tuned	Original	null	42	8.19	2025-05-16 09:57:55+00:00
⭕	openai/o4-mini	94.27	90.08	96.91	95.83	false	true	?	instruction-tuned	Original	null	-1	-1	2025-05-14 08:42:03+00:00
⭕	nvidia/Llama-3.1-Nemotron-70B-Instruct-HF	93.94	86.25	97.58	98	false	true	?	instruction-tuned	Original	llama3.1	1149	70.55	2024-10-25 07:09:19+00:00
⭕	deepseek-ai/DeepSeek-V3	93.89	85.33	97.33	99	false	true	?	instruction-tuned	Original	other	1980	685	2024-10-22 23:04:13+00:00
🟦	Qwen/QwQ-32B-Preview	93.85	85.48	97.75	98.33	false	true	?	preference-tuned	Original	apache-2.0	239	32.76	2024-11-28 04:57:07+00:00
🟦	Qwen/Qwen2.5-72B-Instruct	93.28	88.25	96.42	95.17	false	true	?	preference-tuned	Original	other	343	72.71	2024-10-22 14:35:49+00:00
🟦	princeton-nlp/gemma-2-9b-it-SimPO	93.16	85.07	96.58	97.83	false	true	?	preference-tuned	Original	mit	110	9.24	2024-10-25 07:11:14+00:00
⭕	akjindal53244/Llama-3.1-Storm-8B	92.91	83.56	96	99.17	false	true	?	instruction-tuned	Original	llama3.1	164	8.03	2024-11-14 11:35:17+00:00
⭕	CohereForAI/aya-expanse-32b	92.88	84.16	95.82	98.67	false	true	?	instruction-tuned	Original	cc-by-nc-4.0	66	32.3	2024-10-25 07:13:05+00:00
🟦	meta-llama/Llama-4-Maverick-17B-128E-Instruct	92.66	80.58	97.57	99.83	false	true	?	preference-tuned	Original	null	0	-1	2025-01-20 10:32:48+00:00
⭕	microsoft/phi-4	92.64	82.16	96.58	99.17	false	true	?	instruction-tuned	Original	null	0	-1	2025-01-17 12:10:32+00:00
🟦	Qwen/Qwen2.5-7B-Instruct	92.61	83.25	96.58	98	false	true	?	preference-tuned	Original	apache-2.0	274	7.62	2024-11-14 11:36:44+00:00
🟦	Qwen/Qwen3-32B	92.55	84.41	96.75	96.5	false	true	?	preference-tuned	Original	apache-2.0	162	32.76	2025-04-29 10:45:55+00:00
🟦	Qwen/Qwen3-235B-A22B	92.38	84.07	96.57	96.5	false	true	?	preference-tuned	Original	apache-2.0	352	235.09	2025-04-29 10:42:15+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	92.33	80.66	96.67	99.67	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:06:51+00:00
🟦	NousResearch/Hermes-3-Llama-3.1-8B	92.21	80.65	97.16	98.83	false	true	?	preference-tuned	Original	llama3	254	8.03	2024-12-10 09:38:34+00:00
🟦	Qwen/Qwen3-30B-A3B	92.11	83.5	96.5	96.33	false	true	?	preference-tuned	Original	apache-2.0	208	30.53	2025-04-29 10:45:32+00:00
🟦	meta-llama/Llama-3.1-70B-Instruct	92.02	79.73	96.99	99.33	false	true	?	preference-tuned	Original	llama3.1	617	70.55	2024-10-24 13:25:28+00:00
🟦	meta-llama/Llama-3.3-70B-Instruct	91.99	79.64	96.83	99.5	false	true	?	preference-tuned	Original	llama3.3	632	70.55	2024-12-09 09:10:34+00:00
🟦	meta-llama/Llama-4-Scout-17B-16E-Instruct	91.91	78.73	97.33	99.67	false	true	?	preference-tuned	Original	null	0	-1	2025-01-20 10:32:48+00:00
🟦	meta-llama/Llama-3.1-8B-Instruct	91.85	79.82	96.57	99.17	false	true	?	preference-tuned	Original	llama3.1	2845	8.03	2024-07-24 14:33:56+00:00
🟦	newsbang/Homer-v1.0-Qwen2.5-7B	91.8	81.57	96.67	97.17	false	true	?	preference-tuned	Original	null	0	-1	2025-03-06 02:18:06+00:00
⭕	mistralai/Mistral-Large-Instruct-2407	91.53	79.75	97	97.83	false	true	?	instruction-tuned	Original	other	808	122.61	2024-11-25 11:27:40+00:00
⭕	mistralai/Mistral-7B-Instruct-v0.3	91.49	79.06	95.75	99.67	false	true	?	instruction-tuned	Original	apache-2.0	1131	7.25	2024-11-14 11:38:25+00:00
🟦	tiiuae/Falcon3-10B-Instruct	91.16	79.15	96.17	98.17	false	true	?	preference-tuned	Original	other	40	10.31	2024-12-19 05:58:51+00:00
🟦	meta-llama/Llama-3.1-405B-Instruct	91.08	76.91	97.33	99	false	true	?	preference-tuned	Original	null	0	-1	2025-01-20 10:32:48+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-8B	90.94	77.91	96.25	98.67	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:08:22+00:00
🟦	Qwen/Qwen3-14B	90.83	79.98	96.17	96.33	false	true	?	preference-tuned	Original	apache-2.0	137	14.77	2025-05-12 12:17:12+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	90.8	76.81	96.42	99.17	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:08:05+00:00
🟦 🏥	aaditya/Llama3-OpenBioLLM-70B	90.66	75.65	96.67	99.67	true	true	?	preference-tuned	Original	llama3	339	70	2024-07-24 14:33:56+00:00
🟦	meta-llama/Meta-Llama-3-70B-Instruct	90.66	75.64	97	99.33	false	true	?	preference-tuned	Original	llama3	1417	70.55	2024-10-24 13:25:47+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	90.5	75.92	96.42	99.17	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:07:42+00:00
🟦	Qwen/Qwen2-72B-Instruct	90.45	74.98	97.08	99.29	false	true	?	preference-tuned	Original	other	675	72.71	2024-11-14 11:37:18+00:00
🟦	tiiuae/Falcon3-7B-Instruct	90.29	74.31	96.74	99.83	false	true	?	preference-tuned	Original	other	23	7.46	2024-12-19 05:59:29+00:00
🟦	oxyapi/oxy-1-small	90.1	76.06	96.58	97.67	false	true	?	preference-tuned	Original	apache-2.0	67	14.77	2024-12-10 07:27:22+00:00
⭕ 🏥	google/medgemma-4b-it	89.66	75.67	95.16	98.17	true	true	?	instruction-tuned	Original	other	106	4.3	2025-05-22 07:54:59+00:00
🟢	Qwen/Qwen2.5-7B	89.66	74.73	95.92	98.33	false	true	?	pretrained	Original	apache-2.0	67	7.62	2024-11-14 11:36:22+00:00
🟦	meta-llama/Llama-3.2-3B-Instruct	89.63	75.06	96.32	97.5	false	true	?	preference-tuned	Original	llama3.2	402	3.21	2024-10-24 06:23:04+00:00
🟦	Qwen/Qwen2.5-3B-Instruct	89.2	73.71	95.73	98.17	false	false	?	preference-tuned	Original	other	87	3.09	2024-11-18 11:36:42+00:00
🟢 🏥	OpenMeditron/Meditron3-70B	89.17	70.42	97.08	100	true	true	?	pretrained	Original	null	9	70.55	2024-11-11 13:58:37+00:00
🟦	tiiuae/Falcon3-3B-Instruct	88.91	74.73	94.17	97.83	false	true	?	preference-tuned	Original	other	12	3.23	2024-12-19 06:00:40+00:00
🟢	Qwen/Qwen2.5-72B	88.83	70.25	96.58	99.67	false	true	?	pretrained	Original	other	39	72.71	2024-11-14 11:37:02+00:00
🟦	Qwen/Qwen2.5-3B	87.06	69.11	94.74	97.33	false	false	?	preference-tuned	Original	other	26	3.09	2024-10-22 13:17:21+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	86.8	68.24	94.49	97.67	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:08:42+00:00
⭕	meta-llama/Llama-3.2-1B-Instruct	85.32	66.68	94.46	94.83	false	false	?	instruction-tuned	Original	llama3.2	430	1.24	2024-10-25 07:14:38+00:00
🟢	tiiuae/falcon-11B	84.18	57.53	96.67	98.33	false	false	?	pretrained	Original	unknown	210	11.1	2024-10-29 07:23:16+00:00
⭕	HuggingFaceTB/SmolLM2-1.7B-Instruct	82.87	63.04	93.9	91.67	false	true	?	instruction-tuned	Original	apache-2.0	346	1.71	2024-11-22 10:44:37+00:00
🟦	tiiuae/Falcon3-1B-Instruct	82.57	64.56	92.31	90.83	false	true	?	preference-tuned	Original	other	21	1.67	2024-12-19 06:01:10+00:00
🟦	Qwen/Qwen3-0.6B	82.3	54	95.74	97.17	false	true	?	preference-tuned	Original	apache-2.0	91	0.75	2025-04-29 10:46:33+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	80.21	48.9	96.07	95.67	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:09:02+00:00
🟦	Qwen/Qwen2.5-0.5B-Instruct	79.82	52.29	94.33	92.83	false	false	?	preference-tuned	Original	apache-2.0	100	0.49	2024-11-18 11:36:27+00:00
🟦	HuggingFaceTB/SmolLM2-360M-Instruct	78.54	44.37	98.08	93.17	false	true	?	preference-tuned	Original	apache-2.0	60	0.36	2024-12-10 08:36:15+00:00
🟢 🏥	ProbeMedicalYonseiMAILab/medllama3-v20	71.44	47.33	95.67	71.33	true	false	?	pretrained	Original	llama3	37	8.03	2024-10-25 07:16:58+00:00
🟢	Qwen/Qwen2-0.5B	70	21.95	97.82	90.21	false	false	?	pretrained	Original	apache-2.0	101	0.49	2024-10-22 13:46:13+00:00
⭕	silma-ai/SILMA-9B-Instruct-v1.0	69.17	10.69	99.5	97.33	false	true	?	instruction-tuned	Original	gemma	44	9.24	2024-11-14 11:39:56+00:00
⭕ 🏥	google/medgemma-27b-text-it	64.52	26.82	92.08	74.67	true	true	?	instruction-tuned	Original	other	108	27.01	2025-05-22 07:54:44+00:00
🟦	ministral/Ministral-3b-instruct	56.67	1.58	99.33	69.08	false	true	?	preference-tuned	Original	apache-2.0	33	3.32	2024-12-10 10:39:41+00:00


🟢 🏥	google/gemini-2.5-flash-preview-04-17-thinking	95.76	93.52	97.53	96.24


⭕	google/gemini-2.5-flash-preview-04-17-thinking	95.76	93.52	97.53	96.24
⭕	deepseek-ai/DeepSeek-V3	94.23	89.33	97.45	95.92
🟢	Qwen/Qwen2.5-72B	93.92	86.72	98.56	96.48
⭕	microsoft/phi-4	93.89	88.91	97.32	95.44
🟦	Qwen/QwQ-32B-Preview	93.71	90.48	97.37	93.28
🟦	meta-llama/Llama-4-Maverick-17B-128E-Instruct	93.64	86.85	97.36	96.72
🟢 🏥	OpenMeditron/Meditron3-70B	93.34	85.57	98.28	96.16
⭕	openai/gpt-4.1	93.22	90.31	97.6	91.76
⭕	openai/gpt-4.1-mini	93.19	89.48	96.8	93.28
🟦	meta-llama/Llama-4-Scout-17B-16E-Instruct	93.1	85.03	97.32	96.96
🟦	Qwen/Qwen2.5-72B-Instruct	92.91	88.25	97.44	93.04
🟦	Qwen/Qwen2-72B-Instruct	92.9	85.57	97.76	95.36
🟦	meta-llama/Llama-3.1-405B-Instruct	92.66	84.23	97.28	96.48
⭕	akjindal53244/Llama-3.1-Storm-8B	92.61	84.06	97.52	96.24
🟦	NousResearch/Hermes-3-Llama-3.1-8B	92.59	85.6	97.44	94.72
🟦 🏥	Qwen/Qwen3-8B	92.4	88.6	97.72	90.88
🟦	Qwen/Qwen2.5-7B-Instruct	92.16	86.99	97.2	92.3
⭕ 🏥	Intelligent-Internet/II-Medical-8B	92.03	89.27	97.15	89.68
🟦	Qwen/Qwen2.5-3B-Instruct	91.75	83.19	96.92	95.12
🟦	oxyapi/oxy-1-small	91.59	85.26	97.76	91.74
🟦 🏥	m42-health/Llama3-Med42-70B	91.58	88.45	96.04	90.24
🟦	meta-llama/Llama-3.3-70B-Instruct	91.55	80.48	97.68	96.48
⭕	nvidia/Llama-3.1-Nemotron-70B-Instruct-HF	91.53	82.86	97.92	93.82
🟦	Qwen/Qwen3-235B-A22B	91.25	86.46	96.4	90.88
🟦	newsbang/Homer-v1.0-Qwen2.5-7B	91.2	86.28	97.12	90.2
⭕	chengang12345/Qwen2.5-32B-Instruct-FineTune	91.2	88.55	96.02	89.02
🟦	meta-llama/Llama-3.1-70B-Instruct	91.01	81.26	97.76	94
🟦	Qwen/Qwen2.5-3B	90.84	76.57	98.04	97.92
🟦	meta-llama/Meta-Llama-3-70B-Instruct	90.65	77.57	98.08	96.32
⭕ 🏥	google/medgemma-4b-it	90.65	79.03	96.84	96.08
🟢	Qwen/Qwen2.5-7B	90.65	81.88	97.1	92.96
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	90.4	81.8	96.76	92.64
🟦 🏥	aaditya/Llama3-OpenBioLLM-70B	90.3	75.87	97.84	97.2
🟦	Qwen/Qwen3-30B-A3B	90.25	85.11	96.35	89.28
🟦	Qwen/Qwen3-14B	90.22	84.44	96.45	89.76
🟦	Qwen/Qwen3-32B	90.11	88.03	95.98	86.32
⭕	openai/o4-mini	90.05	91.36	96.57	82.22
🟦	meta-llama/Llama-3.1-8B-Instruct	90.04	81.43	97.41	91.28
⭕	CohereForAI/aya-expanse-32b	89.6	80.16	96.79	91.84
🟦	princeton-nlp/gemma-2-9b-it-SimPO	89.08	84.08	95.18	88
⭕	mistralai/Mistral-7B-Instruct-v0.3	88.82	78.15	97.04	91.26
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-8B	88.69	78.99	96.83	90.24
⭕	mistralai/Mistral-Large-Instruct-2407	88.37	81.43	97.13	86.56
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	88.23	77.17	95.93	91.6
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	88.18	73.81	96.5	94.24
🟦	meta-llama/Llama-3.2-3B-Instruct	87.73	77	96.6	89.58
🟦	tiiuae/Falcon3-7B-Instruct	87.24	74.01	96.35	91.36
🟦	tiiuae/Falcon3-10B-Instruct	85.77	76.01	96.99	84.3
🟢	tiiuae/falcon-11B	85.77	66.64	97.94	92.72
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	84.92	69.69	96.04	89.04
⭕	meta-llama/Llama-3.2-1B-Instruct	84.33	64.5	97.14	91.36
⭕	silma-ai/SILMA-9B-Instruct-v1.0	83.99	56.97	97.82	97.2
🟦	tiiuae/Falcon3-3B-Instruct	82.4	63.68	96.17	87.36
⭕	HuggingFaceTB/SmolLM2-1.7B-Instruct	81.02	53.96	97.1	92
🟦	Qwen/Qwen3-0.6B	80.88	56.37	96.35	89.92
🟢	Qwen/Qwen2-0.5B	80.24	49.69	97.83	93.2
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	79.33	52.03	96.69	89.28
🟦	tiiuae/Falcon3-1B-Instruct	76.66	47.84	96.58	85.58
🟦	HuggingFaceTB/SmolLM2-360M-Instruct	73.7	35.97	98.16	86.96
🟦	Qwen/Qwen2.5-0.5B-Instruct	71.02	48.86	97.64	66.56
🟢 🏥	ProbeMedicalYonseiMAILab/medllama3-v20	70.41	48.97	95.34	66.92
⭕ 🏥	google/medgemma-27b-text-it	62.93	25.01	95.84	67.94
🟦	ministral/Ministral-3b-instruct	55.79	1.4	99.56	66.41


🟢 🏥	google/gemini-2.5-flash-preview-04-17-thinking	95.76	93.52	97.53	96.24	false	false	?	instruction-tuned	Original	cc-by-nc-4.0	1980	235.09	2025-05-07 16:20:53+00:00


⭕	google/gemini-2.5-flash-preview-04-17-thinking	95.76	93.52	97.53	96.24	false	true	?	instruction-tuned	Original	null	-1	-1	2025-05-07 16:20:53+00:00
⭕	deepseek-ai/DeepSeek-V3	94.23	89.33	97.45	95.92	false	true	?	instruction-tuned	Original	other	1980	685	2024-10-22 23:04:13+00:00
🟢	Qwen/Qwen2.5-72B	93.92	86.72	98.56	96.48	false	true	?	pretrained	Original	other	39	72.71	2024-11-14 11:37:02+00:00
⭕	microsoft/phi-4	93.89	88.91	97.32	95.44	false	true	?	instruction-tuned	Original	null	0	-1	2025-01-17 12:10:32+00:00
🟦	Qwen/QwQ-32B-Preview	93.71	90.48	97.37	93.28	false	true	?	preference-tuned	Original	apache-2.0	239	32.76	2024-11-28 04:57:07+00:00
🟦	meta-llama/Llama-4-Maverick-17B-128E-Instruct	93.64	86.85	97.36	96.72	false	true	?	preference-tuned	Original	null	0	-1	2025-01-20 10:32:48+00:00
🟢 🏥	OpenMeditron/Meditron3-70B	93.34	85.57	98.28	96.16	true	true	?	pretrained	Original	null	9	70.55	2024-11-11 13:58:37+00:00
⭕	openai/gpt-4.1	93.22	90.31	97.6	91.76	false	true	?	instruction-tuned	Original	null	-1	-1	2025-05-14 08:41:32+00:00
⭕	openai/gpt-4.1-mini	93.19	89.48	96.8	93.28	false	true	?	instruction-tuned	Original	null	-1	-1	2025-05-14 08:41:43+00:00
🟦	meta-llama/Llama-4-Scout-17B-16E-Instruct	93.1	85.03	97.32	96.96	false	true	?	preference-tuned	Original	null	0	-1	2025-01-20 10:32:48+00:00
🟦	Qwen/Qwen2.5-72B-Instruct	92.91	88.25	97.44	93.04	false	true	?	preference-tuned	Original	other	343	72.71	2024-10-22 14:35:49+00:00
🟦	Qwen/Qwen2-72B-Instruct	92.9	85.57	97.76	95.36	false	true	?	preference-tuned	Original	other	675	72.71	2024-11-14 11:37:18+00:00
🟦	meta-llama/Llama-3.1-405B-Instruct	92.66	84.23	97.28	96.48	false	true	?	preference-tuned	Original	null	0	-1	2025-01-20 10:32:48+00:00
⭕	akjindal53244/Llama-3.1-Storm-8B	92.61	84.06	97.52	96.24	false	true	?	instruction-tuned	Original	llama3.1	164	8.03	2024-11-14 11:35:17+00:00
🟦	NousResearch/Hermes-3-Llama-3.1-8B	92.59	85.6	97.44	94.72	false	true	?	preference-tuned	Original	llama3	254	8.03	2024-12-10 09:38:34+00:00
🟦 🏥	Qwen/Qwen3-8B	92.4	88.6	97.72	90.88	true	true	?	preference-tuned	Original	apache-2.0	300	8.19	2025-05-20 11:36:36+00:00
🟦	Qwen/Qwen2.5-7B-Instruct	92.16	86.99	97.2	92.3	false	true	?	preference-tuned	Original	apache-2.0	274	7.62	2024-11-14 11:36:44+00:00
⭕ 🏥	Intelligent-Internet/II-Medical-8B	92.03	89.27	97.15	89.68	true	true	?	instruction-tuned	Original	null	42	8.19	2025-05-16 09:57:55+00:00
🟦	Qwen/Qwen2.5-3B-Instruct	91.75	83.19	96.92	95.12	false	false	?	preference-tuned	Original	other	87	3.09	2024-11-18 11:36:42+00:00
🟦	oxyapi/oxy-1-small	91.59	85.26	97.76	91.74	false	true	?	preference-tuned	Original	apache-2.0	67	14.77	2024-12-10 07:27:22+00:00
🟦 🏥	m42-health/Llama3-Med42-70B	91.58	88.45	96.04	90.24	true	true	?	preference-tuned	Original	llama3	34	70.55	2024-10-24 06:24:59+00:00
🟦	meta-llama/Llama-3.3-70B-Instruct	91.55	80.48	97.68	96.48	false	true	?	preference-tuned	Original	llama3.3	632	70.55	2024-12-09 09:10:34+00:00
⭕	nvidia/Llama-3.1-Nemotron-70B-Instruct-HF	91.53	82.86	97.92	93.82	false	true	?	instruction-tuned	Original	llama3.1	1149	70.55	2024-10-25 07:09:19+00:00
🟦	Qwen/Qwen3-235B-A22B	91.25	86.46	96.4	90.88	false	true	?	preference-tuned	Original	apache-2.0	352	235.09	2025-04-29 10:42:15+00:00
🟦	newsbang/Homer-v1.0-Qwen2.5-7B	91.2	86.28	97.12	90.2	false	true	?	preference-tuned	Original	null	0	-1	2025-03-06 02:18:06+00:00
⭕	chengang12345/Qwen2.5-32B-Instruct-FineTune	91.2	88.55	96.02	89.02	false	true	?	instruction-tuned	Original	apache-2.0	0	32.76	2025-05-19 12:37:03+00:00
🟦	meta-llama/Llama-3.1-70B-Instruct	91.01	81.26	97.76	94	false	true	?	preference-tuned	Original	llama3.1	617	70.55	2024-10-24 13:25:28+00:00
🟦	Qwen/Qwen2.5-3B	90.84	76.57	98.04	97.92	false	false	?	preference-tuned	Original	other	26	3.09	2024-10-22 13:17:21+00:00
🟦	meta-llama/Meta-Llama-3-70B-Instruct	90.65	77.57	98.08	96.32	false	true	?	preference-tuned	Original	llama3	1417	70.55	2024-10-24 13:25:47+00:00
⭕ 🏥	google/medgemma-4b-it	90.65	79.03	96.84	96.08	true	true	?	instruction-tuned	Original	other	106	4.3	2025-05-22 07:54:59+00:00
🟢	Qwen/Qwen2.5-7B	90.65	81.88	97.1	92.96	false	true	?	pretrained	Original	apache-2.0	67	7.62	2024-11-14 11:36:22+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	90.4	81.8	96.76	92.64	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:06:51+00:00
🟦 🏥	aaditya/Llama3-OpenBioLLM-70B	90.3	75.87	97.84	97.2	true	true	?	preference-tuned	Original	llama3	339	70	2024-07-24 14:33:56+00:00
🟦	Qwen/Qwen3-30B-A3B	90.25	85.11	96.35	89.28	false	true	?	preference-tuned	Original	apache-2.0	208	30.53	2025-04-29 10:45:32+00:00
🟦	Qwen/Qwen3-14B	90.22	84.44	96.45	89.76	false	true	?	preference-tuned	Original	apache-2.0	137	14.77	2025-05-12 12:17:12+00:00
🟦	Qwen/Qwen3-32B	90.11	88.03	95.98	86.32	false	true	?	preference-tuned	Original	apache-2.0	162	32.76	2025-04-29 10:45:55+00:00
⭕	openai/o4-mini	90.05	91.36	96.57	82.22	false	true	?	instruction-tuned	Original	null	-1	-1	2025-05-14 08:42:03+00:00
🟦	meta-llama/Llama-3.1-8B-Instruct	90.04	81.43	97.41	91.28	false	true	?	preference-tuned	Original	llama3.1	2845	8.03	2024-07-24 14:33:56+00:00
⭕	CohereForAI/aya-expanse-32b	89.6	80.16	96.79	91.84	false	true	?	instruction-tuned	Original	cc-by-nc-4.0	66	32.3	2024-10-25 07:13:05+00:00
🟦	princeton-nlp/gemma-2-9b-it-SimPO	89.08	84.08	95.18	88	false	true	?	preference-tuned	Original	mit	110	9.24	2024-10-25 07:11:14+00:00
⭕	mistralai/Mistral-7B-Instruct-v0.3	88.82	78.15	97.04	91.26	false	true	?	instruction-tuned	Original	apache-2.0	1131	7.25	2024-11-14 11:38:25+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-8B	88.69	78.99	96.83	90.24	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:08:22+00:00
⭕	mistralai/Mistral-Large-Instruct-2407	88.37	81.43	97.13	86.56	false	true	?	instruction-tuned	Original	other	808	122.61	2024-11-25 11:27:40+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	88.23	77.17	95.93	91.6	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:08:05+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	88.18	73.81	96.5	94.24	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:07:42+00:00
🟦	meta-llama/Llama-3.2-3B-Instruct	87.73	77	96.6	89.58	false	true	?	preference-tuned	Original	llama3.2	402	3.21	2024-10-24 06:23:04+00:00
🟦	tiiuae/Falcon3-7B-Instruct	87.24	74.01	96.35	91.36	false	true	?	preference-tuned	Original	other	23	7.46	2024-12-19 05:59:29+00:00
🟦	tiiuae/Falcon3-10B-Instruct	85.77	76.01	96.99	84.3	false	true	?	preference-tuned	Original	other	40	10.31	2024-12-19 05:58:51+00:00
🟢	tiiuae/falcon-11B	85.77	66.64	97.94	92.72	false	false	?	pretrained	Original	unknown	210	11.1	2024-10-29 07:23:16+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	84.92	69.69	96.04	89.04	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:08:42+00:00
⭕	meta-llama/Llama-3.2-1B-Instruct	84.33	64.5	97.14	91.36	false	false	?	instruction-tuned	Original	llama3.2	430	1.24	2024-10-25 07:14:38+00:00
⭕	silma-ai/SILMA-9B-Instruct-v1.0	83.99	56.97	97.82	97.2	false	true	?	instruction-tuned	Original	gemma	44	9.24	2024-11-14 11:39:56+00:00
🟦	tiiuae/Falcon3-3B-Instruct	82.4	63.68	96.17	87.36	false	true	?	preference-tuned	Original	other	12	3.23	2024-12-19 06:00:40+00:00
⭕	HuggingFaceTB/SmolLM2-1.7B-Instruct	81.02	53.96	97.1	92	false	true	?	instruction-tuned	Original	apache-2.0	346	1.71	2024-11-22 10:44:37+00:00
🟦	Qwen/Qwen3-0.6B	80.88	56.37	96.35	89.92	false	true	?	preference-tuned	Original	apache-2.0	91	0.75	2025-04-29 10:46:33+00:00
🟢	Qwen/Qwen2-0.5B	80.24	49.69	97.83	93.2	false	false	?	pretrained	Original	apache-2.0	101	0.49	2024-10-22 13:46:13+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	79.33	52.03	96.69	89.28	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:09:02+00:00
🟦	tiiuae/Falcon3-1B-Instruct	76.66	47.84	96.58	85.58	false	true	?	preference-tuned	Original	other	21	1.67	2024-12-19 06:01:10+00:00
🟦	HuggingFaceTB/SmolLM2-360M-Instruct	73.7	35.97	98.16	86.96	false	true	?	preference-tuned	Original	apache-2.0	60	0.36	2024-12-10 08:36:15+00:00
🟦	Qwen/Qwen2.5-0.5B-Instruct	71.02	48.86	97.64	66.56	false	false	?	preference-tuned	Original	apache-2.0	100	0.49	2024-11-18 11:36:27+00:00
🟢 🏥	ProbeMedicalYonseiMAILab/medllama3-v20	70.41	48.97	95.34	66.92	true	false	?	pretrained	Original	llama3	37	8.03	2024-10-25 07:16:58+00:00
⭕ 🏥	google/medgemma-27b-text-it	62.93	25.01	95.84	67.94	true	true	?	instruction-tuned	Original	other	108	27.01	2025-05-22 07:54:44+00:00
🟦	ministral/Ministral-3b-instruct	55.79	1.4	99.56	66.41	false	true	?	preference-tuned	Original	apache-2.0	33	3.32	2024-12-10 10:39:41+00:00


🟦 🏥	meta-llama/Llama-4-Maverick-17B-128E-Instruct	0.49	0.54	0.37	0.46	0.61	0.43	0.63	0.35


🟦	Qwen/Qwen3-235B-A22B	0.5	0.54	0.37	0.46	0.61	0.43	0.63	0.35
🟦	Qwen/Qwen3-32B	0.5	0.53	0.38	0.45	0.6	0.44	0.62	0.35
🟦	deepseek-ai/DeepSeek-R1	0.49	0.53	0.36	0.44	0.6	0.41	0.63	0.36
🟦	Qwen/Qwen3-4B	0.43	0.47	0.3	0.36	0.54	0.38	0.58	0.28
⭕	deepseek-ai/DeepSeek-V3	0.42	0.45	0.34	0.35	0.54	0.33	0.56	0.34
⭕	nvidia/Llama-3.1-Nemotron-70B-Instruct-HF	0.41	0.43	0.38	0.35	0.52	0.34	0.56	0.24
⭕	openai/gpt-4.1-mini	0.4	0.42	0.33	0.31	0.53	0.3	0.55	0.36
⭕	mistralai/Mistral-Large-Instruct-2407	0.35	0.38	0.27	0.25	0.46	0.26	0.51	0.27
⭕	microsoft/phi-4	0.34	0.38	0.31	0.24	0.45	0.25	0.5	0.25
⭕	CohereForAI/aya-expanse-32b	0.34	0.37	0.28	0.25	0.44	0.26	0.46	0.22
⭕	openai/gpt-4o-mini-2024-07-18	0.33	0.36	0.3	0.24	0.46	0.24	0.47	0.26
🟦 🏥	m42-health/Llama3-Med42-70B	0.33	0.38	0.26	0.24	0.44	0.24	0.46	0.25
🟦	meta-llama/Llama-4-Maverick-17B-128E-Instruct	0.32	0.35	0.32	0.23	0.45	0.22	0.46	0.26
🟦	meta-llama/Llama-4-Scout-17B-16E-Instruct	0.32	0.34	0.3	0.24	0.43	0.22	0.46	0.24
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	0.31	0.34	0.24	0.22	0.45	0.25	0.45	0.19
🟦	meta-llama/Llama-3.1-70B-Instruct	0.29	0.3	0.28	0.19	0.41	0.19	0.46	0.24
🟦 🏥	aaditya/Llama3-OpenBioLLM-70B	0.26	0.29	0.23	0.15	0.38	0.16	0.4	0.2
🟦	meta-llama/Llama-3.2-3B-Instruct	0.26	0.28	0.2	0.16	0.37	0.17	0.42	0.16
🟢 🏥	OpenMeditron/Meditron3-70B	0.21	0.24	0.17	0.1	0.33	0.12	0.33	0.18
🟦	Qwen/Qwen3-0.6B	0.16	0.18	0.13	0.04	0.3	0.08	0.3	0.11


🟦 🏥	meta-llama/Llama-4-Maverick-17B-128E-Instruct	0.24	0.27	0.15	0.27	0.27	0.28	0.26	0.18


🟦	Qwen/Qwen3-235B-A22B	0.24	0.27	0.1	0.27	0.27	0.28	0.26	0.18
🟦	Qwen/Qwen3-32B	0.24	0.28	0.15	0.25	0.24	0.28	0.24	0.15
🟦	deepseek-ai/DeepSeek-R1	0.22	0.26	0.07	0.25	0.22	0.24	0.26	0.16
🟦	Qwen/Qwen3-4B	0.16	0.2	0.05	0.16	0.17	0.22	0.22	0.07
⭕	nvidia/Llama-3.1-Nemotron-70B-Instruct-HF	0.15	0.16	0.15	0.17	0.15	0.15	0.19	0.07
⭕	deepseek-ai/DeepSeek-V3	0.14	0.12	0.08	0.14	0.19	0.15	0.15	0.1
⭕	openai/gpt-4.1-mini	0.12	0.13	0.1	0.1	0.16	0.12	0.11	0.11
🟦 🏥	m42-health/Llama3-Med42-70B	0.07	0.12	0.02	0.04	0.11	0.1	0.07	0.07
⭕	CohereForAI/aya-expanse-32b	0.07	0.1	0.03	0.05	0.1	0.09	0.06	0.05
⭕	microsoft/phi-4	0.06	0.1	0.06	0.04	0.09	0.07	0.06	0.02
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	0.06	0.11	0	0.05	0.07	0.09	0.04	0
⭕	mistralai/Mistral-Large-Instruct-2407	0.05	0.09	0	0.03	0.08	0.09	0.1	0.02
⭕	openai/gpt-4o-mini-2024-07-18	0.05	0.06	0.02	0.03	0.1	0.08	0.07	0
🟦	meta-llama/Llama-4-Scout-17B-16E-Instruct	0.05	0.05	0.05	0.04	0.07	0.06	0.04	0.06
🟦	meta-llama/Llama-4-Maverick-17B-128E-Instruct	0.05	0.08	0.1	0.03	0.08	0.02	0.03	0.03
🟦	meta-llama/Llama-3.1-70B-Instruct	0.02	0	0.04	0	0.07	0.02	0.05	0.01
🟦 🏥	aaditya/Llama3-OpenBioLLM-70B	0.01	0.03	0.02	0	0.06	0	0	0
🟦	meta-llama/Llama-3.2-3B-Instruct	0	0.02	0	0	0.04	0.01	0.02	0
🟦	Qwen/Qwen3-0.6B	0	0	0	0	0	0	0	0
🟢 🏥	OpenMeditron/Meditron3-70B	0	0.02	0	0	0.03	0	0	0


🟦 🏥	google/gemini-2.5-flash-preview-04-17-thinking	1.01	+0.002/-0.002


⭕ 🏥	winninghealth/WiNGPT2-Llama-3-8B-Chat	1	+0.002/-0.002
⭕ 🏥	winninghealth/WiNGPT2-Gemma-2-9B-Chat	1.01	+0.002/-0.002
🟦	Qwen/QwQ-32B-Preview	1.02	+0.005/-0.006
🟦	tiiuae/Falcon3-7B-Instruct	1.05	+0.006/-0.005
⭕	google/gemini-2.5-flash-preview-04-17	1.12	+0.01/-0.01
⭕	CohereForAI/aya-expanse-32b	1.12	+0.012/-0.012
🟦	Qwen/Qwen2-72B-Instruct	1.12	+0.01/-0.01
⭕	google/gemini-2.5-flash-preview-04-17-thinking	1.13	+0.011/-0.014
⭕	google/gemini-2.0-flash	1.13	+0.01/-0.009
🟦	tiiuae/Falcon3-1B-Instruct	1.13	+0.012/-0.012
⭕	openai/o4-mini	1.15	+0.011/-0.013
⭕	openai/gpt-4.1	1.16	+0.013/-0.013
🟦	meta-llama/Llama-4-Maverick-17B-128E-Instruct	1.17	+0.012/-0.013
⭕	akjindal53244/Llama-3.1-Storm-8B	1.2	+0.016/-0.019
⭕	nvidia/Llama-3.1-Nemotron-70B-Instruct-HF	1.21	+0.011/-0.012
🟦	Qwen/Qwen2.5-3B-Instruct	1.21	+0.012/-0.015
⭕	meta-llama/Llama-3.2-1B-Instruct	1.22	+0.017/-0.017
⭕	openai/gpt-4.1-mini	1.23	+0.016/-0.012
🟦	princeton-nlp/gemma-2-9b-it-SimPO	1.23	+0.016/-0.014
🟦	meta-llama/Meta-Llama-3-70B-Instruct	1.23	+0.018/-0.017
🟦	Qwen/Qwen3-14B	1.24	+0.015/-0.014
🟦	Qwen/Qwen2.5-72B-Instruct	1.25	+0.013/-0.014
🟦	Qwen/Qwen3-235B-A22B	1.26	+0.018/-0.013
🟦	meta-llama/Llama-3.1-405B-Instruct	1.26	+0.019/-0.018
🟦 🏥	aaditya/Llama3-OpenBioLLM-70B	1.27	+0.023/-0.016
🟦	newsbang/Homer-v1.0-Qwen2.5-7B	1.32	+0.017/-0.019
⭕	mistralai/Mistral-Large-Instruct-2407	1.32	+0.021/-0.019
🟦	meta-llama/Llama-3.3-70B-Instruct	1.33	+0.016/-0.018
⭕ 🏥	google/medgemma-27b-text-it	1.34	+0.016/-0.011
🟦 🏥	Qwen/Qwen3-8B	1.35	+0.018/-0.018
⭕	openai/gpt-4o-mini-2024-07-18	1.36	+0.021/-0.017
🟦	meta-llama/Llama-4-Scout-17B-16E-Instruct	1.37	+0.02/-0.016
🟦 🏥	m42-health/Llama3-Med42-70B	1.39	+0.02/-0.024
🟦	Qwen/Qwen2.5-0.5B-Instruct	1.4	+0.018/-0.03
⭕	deepseek-ai/DeepSeek-V3	1.4	+0.018/-0.015
⭕	silma-ai/SILMA-9B-Instruct-v1.0	1.41	+0.027/-0.027
🟢	Qwen/Qwen2.5-72B	1.41	+0.026/-0.019
🟦	meta-llama/Llama-3.1-8B-Instruct	1.41	+0.022/-0.024
⭕ 🏥	Intelligent-Internet/II-Medical-8B	1.44	+0.024/-0.023
🟦	meta-llama/Llama-3.2-3B-Instruct	1.46	+0.023/-0.023
🟢 🏥	OpenMeditron/Meditron3-70B	1.47	+0.021/-0.019
⭕	01-ai/Yi-1.5-6B-Chat	1.49	+0.03/-0.022
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	1.5	+0.021/-0.016
🟦	NousResearch/Hermes-3-Llama-3.1-8B	1.51	+0.024/-0.031
⭕	upstage/SOLAR-10.7B-Instruct-v1.0	1.52	+0.021/-0.02
🟦	meta-llama/Llama-3.1-70B-Instruct	1.52	+0.021/-0.025
⭕	mistralai/Mistral-7B-Instruct-v0.3	1.54	+0.022/-0.027
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	1.69	+0.021/-0.02
🟢 🏥	ProbeMedicalYonseiMAILab/medllama3-v20	1.71	+0.025/-0.025
⭕	chengang12345/Qwen2.5-32B-Instruct-FineTune	1.89	+0.029/-0.029
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	1.92	+0.026/-0.022
🟦	NousResearch/Hermes-2-Pro-Llama-3-8B	2.08	+0.035/-0.051
🟦	Qwen/Qwen3-0.6B	2.12	+0.033/-0.029
⭕	HuggingFaceTB/SmolLM2-1.7B-Instruct	2.19	+0.031/-0.039
🟦	ministral/Ministral-3b-instruct	2.42	+0.032/-0.035
🟢	tiiuae/falcon-11B	2.45	+0.029/-0.038
🟦	oxyapi/oxy-1-small	2.65	+0.042/-0.032
🟦	Qwen/Qwen2.5-3B	2.95	+0.042/-0.037


🟢 🏥	meta-llama/Llama-4-Maverick-17B-128E-Instruct	80.86	91.71	69.33	76.62	83.11	88.07	80.6	84.15


🟦	meta-llama/Llama-4-Maverick-17B-128E-Instruct	80.86	91.71	69.33	76.62	83.11	88.07	73	84.15
🟢 🏥	OpenMeditron/Meditron3-70B	80.02	86.97	64	70.95	79.26	91.95	80.6	86.38
🟦	meta-llama/Llama-3.1-70B-Instruct	79.82	87.62	65.33	71.79	78.16	94.79	73.6	87.45
🟦	meta-llama/Llama-3.3-70B-Instruct	79.47	86.34	64.24	72.01	78.32	94.6	73	87.77
⭕	deepseek-ai/DeepSeek-V3	78.82	90.36	66.79	71.93	79.03	92.17	63.4	88.09
🟦 🏥	aaditya/Llama3-OpenBioLLM-70B	78.31	90.4	64.2	73.2	76.9	79	73.2	91.3
🟦 🏥	m42-health/Llama3-Med42-70B	78.28	86.49	61.09	72.82	79.42	83.73	79.2	85.21
🟦	Qwen/Qwen3-235B-A22B	78.11	88.25	65.09	70.98	80.75	89.06	69.8	82.87
🟦	meta-llama/Llama-4-Scout-17B-16E-Instruct	77.02	86.51	65.09	69.69	75.26	83.27	74	85.32
🟦	meta-llama/Meta-Llama-3-70B-Instruct	76.95	85.73	62.91	72.15	78.08	84.73	67.4	87.66
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	75.73	83.04	58.91	65.91	74.47	88.48	71	88.3
🟦	Qwen/Qwen2.5-72B-Instruct	75.59	87.37	63.52	68.4	76.12	84.87	63.2	85.64
⭕	nvidia/Llama-3.1-Nemotron-70B-Instruct-HF	75.2	81.68	60.36	70.69	76.67	91.67	58.6	86.7
⭕	mistralai/Mistral-Large-Instruct-2407	75.03	87.65	66.67	68.25	75.81	85.93	52	88.94
🟢	Qwen/Qwen2.5-72B	75.02	86.98	62.06	67.32	75.49	84.94	74.8	73.51
⭕ 🏥	google/medgemma-27b-text-it	73.68	82.59	60.85	64.71	74.08	79.02	67.8	86.7
🟦	Qwen/Qwen2-72B-Instruct	72.6	85.17	61.45	67.97	72.9	75.47	61.6	83.62
🟦 🏥	baichuan-inc/Baichuan-M1-14B-Instruct	72.3	83.91	60.73	65.48	76.75	80.95	55.6	82.66
⭕	microsoft/phi-4	71.13	85.9	59.64	64.81	71.48	75.03	53	88.09
🟦	Qwen/Qwen3-32B	70.99	85.19	60.48	66.34	73.29	79.28	48.4	83.94
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	70.57	81.83	60.85	61.46	69.36	73.68	61.6	85.21
🟢 🏥	ProbeMedicalYonseiMAILab/medllama3-v20	70.22	93.5	49.09	74.4	75.96	55.78	69	73.83
🟦	Qwen/QwQ-32B-Preview	69.85	83.15	59.15	63.73	69.52	75.84	49.6	87.98
⭕	chengang12345/Qwen2.5-32B-Instruct-FineTune	69.78	84.93	60.12	63.47	70.15	75.82	46	87.98
🟦	Qwen/Qwen3-14B	69	83.45	57.58	63.61	68.26	68.41	57.2	84.47
🟦	oxyapi/oxy-1-small	68.8	80.47	53.09	59.12	65.59	70.69	70.4	82.23
⭕	google/gemma-3-27b-it	68.62	81.49	54.55	56.16	68.66	69.09	65.2	85.21
🟦	Qwen/Qwen3-30B-A3B	68.5	85.63	61.09	64.57	72.19	76.6	44	75.43
🟦	meta-llama/Llama-3.1-8B-Instruct	67.2	73.4	49.9	58.4	62	68.2	76.2	82.3
⭕	silma-ai/SILMA-9B-Instruct-v1.0	66.16	76.09	49.21	54.94	61.59	61.52	75.6	84.15
⭕	akjindal53244/Llama-3.1-Storm-8B	65.99	73.07	50.79	57.9	63	69.5	62.8	84.89
🟢	meta-llama/Llama-3.1-70B	65.56	82.44	60.36	65.14	75.88	81.39	15.6	78.09
⭕	CohereForAI/aya-expanse-32b	65.46	77.65	50.55	59.14	65.36	69.17	50	86.38
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	65.05	75.49	51.52	55.22	62.53	64.58	67	79.04
⭕ 🏥	winninghealth/WiNGPT2-Gemma-2-9B-Chat	64.08	83.21	48	58.43	69.91	60.43	53.8	74.79
🟦	meta-llama/Llama-3.2-3B-Instruct	63.2	67.76	38.06	52.81	55.77	74.41	70.6	82.98
⭕ 🏥	BiMediX/BiMediX-Bi	63	73.21	44.85	61.61	65.2	61.06	77.2	57.87
🟦 🏥	FractalAIResearch/Ramanujan-Ganit-R1-14B	62.48	75.65	49.21	54.98	61.35	63.59	71.4	61.17
🟦	tiiuae/Falcon3-10B-Instruct	61.78	75.39	48.61	54.86	59.07	63.31	50.6	80.64
🟦	princeton-nlp/gemma-2-9b-it-SimPO	61.14	76.81	48.48	52.93	59.62	59.4	49	81.7
🟦 🏥	Qwen/Qwen3-8B	60.92	78.43	53.7	58.67	63.39	68.24	33.6	70.43
🟦	newsbang/Homer-v1.0-Qwen2.5-7B	60.28	76.52	47.64	56.75	60.41	61.33	47	72.34
🟦	Qwen/Qwen2.5-7B-Instruct	60.03	76.71	48	56.83	60.17	61.4	45.2	71.91
🟢	Qwen/Qwen2.5-7B	59.84	73.38	45.33	53.84	57.66	57.69	55.8	75.21
⭕	upstage/SOLAR-10.7B-Instruct-v1.0	59.38	69.46	36.61	46.71	52	58.56	65.6	86.7
⭕ 🏥	google/medgemma-4b-it	58.92	63.51	36.97	52.16	55.38	57.85	67.4	79.15
🟦	NousResearch/Hermes-2-Pro-Llama-3-8B	58.83	68.24	39.03	50.82	53.89	56.98	58.2	84.68
🟦	NousResearch/Hermes-3-Llama-3.1-8B	58.22	70.51	43.15	51.11	55.7	55.65	49	82.45
🟦	tiiuae/Falcon3-7B-Instruct	57.45	70.83	39.39	52.35	54.99	54.48	53.2	76.91
🟦	Qwen/Qwen3-4B	56.83	72.56	45.7	54.1	58.37	61.04	43.8	62.23
⭕	neulab/Pangea-7B	56.69	68.3	39.27	50.2	53.02	54.42	57	74.57
⭕	tiiuae/falcon-mamba-7b-instruct	55.01	65.23	35.03	45.85	45.95	41.85	69.2	81.91
⭕ 🏥	winninghealth/WiNGPT2-Llama-3-8B-Chat	53.72	72.5	40.24	53.41	61.67	52.22	22.8	73.19
🟢	tiiuae/falcon-mamba-7b	53.43	64.31	31.88	46.67	46.58	39.45	66.4	78.72
⭕	mistralai/Mistral-7B-Instruct-v0.3	53.42	65.06	34.79	46.31	49.25	50.63	45.8	82.13
🟦	Qwen/Qwen2.5-3B	52.89	66.89	34.3	49.32	47.92	48.04	67.8	55.96
🟦	tiiuae/Falcon3-Mamba-7B-Instruct	52.61	69.82	38.55	47	50.51	55.28	49	58.09
🟢	tiiuae/falcon-11B	51.9	62.45	27.88	43.49	43.91	44.54	58.4	82.66
⭕	01-ai/Yi-1.5-6B-Chat	50.45	64	35.15	42.55	44.23	46.93	61.8	58.51
🟢	meta-llama/Llama-3.1-8B	50.06	64.82	38.67	49.96	55.07	52.18	38	51.7
🟦	Qwen/Qwen2.5-3B-Instruct	49.62	67.99	34.79	49.15	48.78	51.55	29.2	65.85
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-8B	48.41	59.48	26.42	42.22	44.3	41.85	57.8	66.81
🟦	tiiuae/Falcon3-3B-Instruct	45.74	57.52	24.97	42.29	41.24	42.85	42.4	68.94
⭕	HuggingFaceTB/SmolLM2-1.7B-Instruct	42.39	50.61	18.67	37.37	36.06	28.19	69	56.81
⭕ 🏥	Intelligent-Internet/II-Medical-8B	41.36	46.38	35.39	38.78	31.26	37.14	31.4	69.15
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	39.87	50.65	22.91	36.19	33.15	30.81	48.6	56.81
⭕	meta-llama/Llama-3.2-1B-Instruct	39.41	46.79	21.94	36.53	37.71	45.61	30.4	56.91
🟦	tiiuae/Falcon3-1B-Instruct	37.97	45.15	15.27	35.43	33.23	28.61	49.6	58.51
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	34.6	31.17	13.21	32.54	29.69	24.09	55.2	56.28
🟢	Qwen/Qwen2-0.5B	34.56	38.88	12.24	29.12	29.93	18.46	56.4	56.91
🟦	Qwen/Qwen2.5-0.5B-Instruct	32.42	44.38	15.76	35.19	36.21	28.12	11	56.28
🟦	Qwen/Qwen3-0.6B	31.53	46.94	19.76	32.44	33.94	33.22	11.2	43.19
🟦	ministral/Ministral-3b-instruct	28.62	22.64	10.55	25.1	28.83	21.2	35.2	56.81
🟦	HuggingFaceTB/SmolLM2-360M-Instruct	28.33	22.65	10.79	24.41	25.92	19.61	38	56.91


🟢 🏥	meta-llama/Llama-4-Maverick-17B-128E-Instruct	80.86	91.71	69.33	76.62	83.11	88.07	80.6	84.15	false	false	?	instruction-tuned	Original	cc-by-nc-sa-4.0	1980	235.09	2025-01-20 10:32:48+00:00


🟦	meta-llama/Llama-4-Maverick-17B-128E-Instruct	80.86	91.71	69.33	76.62	83.11	88.07	73	84.15	false	true	?	preference-tuned	Original	null	0	-1	2025-01-20 10:32:48+00:00
🟢 🏥	OpenMeditron/Meditron3-70B	80.02	86.97	64	70.95	79.26	91.95	80.6	86.38	true	true	?	pretrained	Original	null	9	70.55	2024-11-11 13:58:37+00:00
🟦	meta-llama/Llama-3.1-70B-Instruct	79.82	87.62	65.33	71.79	78.16	94.79	73.6	87.45	false	true	?	preference-tuned	Original	llama3.1	617	70.55	2024-10-24 13:25:28+00:00
🟦	meta-llama/Llama-3.3-70B-Instruct	79.47	86.34	64.24	72.01	78.32	94.6	73	87.77	false	true	?	preference-tuned	Original	llama3.3	632	70.55	2024-12-09 09:10:34+00:00
⭕	deepseek-ai/DeepSeek-V3	78.82	90.36	66.79	71.93	79.03	92.17	63.4	88.09	false	true	?	instruction-tuned	Original	other	1980	685	2024-10-22 23:04:13+00:00
🟦 🏥	aaditya/Llama3-OpenBioLLM-70B	78.31	90.4	64.2	73.2	76.9	79	73.2	91.3	true	true	?	preference-tuned	Original	llama3	339	70	2024-07-24 14:33:56+00:00
🟦 🏥	m42-health/Llama3-Med42-70B	78.28	86.49	61.09	72.82	79.42	83.73	79.2	85.21	true	true	?	preference-tuned	Original	llama3	34	70.55	2024-10-24 06:24:59+00:00
🟦	Qwen/Qwen3-235B-A22B	78.11	88.25	65.09	70.98	80.75	89.06	69.8	82.87	false	true	?	preference-tuned	Original	apache-2.0	352	235.09	2025-04-29 10:42:15+00:00
🟦	meta-llama/Llama-4-Scout-17B-16E-Instruct	77.02	86.51	65.09	69.69	75.26	83.27	74	85.32	false	true	?	preference-tuned	Original	null	0	-1	2025-01-20 10:32:48+00:00
🟦	meta-llama/Meta-Llama-3-70B-Instruct	76.95	85.73	62.91	72.15	78.08	84.73	67.4	87.66	false	true	?	preference-tuned	Original	llama3	1417	70.55	2024-10-24 13:25:47+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-70B	75.73	83.04	58.91	65.91	74.47	88.48	71	88.3	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:06:51+00:00
🟦	Qwen/Qwen2.5-72B-Instruct	75.59	87.37	63.52	68.4	76.12	84.87	63.2	85.64	false	true	?	preference-tuned	Original	other	343	72.71	2024-10-22 14:35:49+00:00
⭕	nvidia/Llama-3.1-Nemotron-70B-Instruct-HF	75.2	81.68	60.36	70.69	76.67	91.67	58.6	86.7	false	true	?	instruction-tuned	Original	llama3.1	1149	70.55	2024-10-25 07:09:19+00:00
⭕	mistralai/Mistral-Large-Instruct-2407	75.03	87.65	66.67	68.25	75.81	85.93	52	88.94	false	true	?	instruction-tuned	Original	other	808	122.61	2024-11-25 11:27:40+00:00
🟢	Qwen/Qwen2.5-72B	75.02	86.98	62.06	67.32	75.49	84.94	74.8	73.51	false	true	?	pretrained	Original	other	39	72.71	2024-11-14 11:37:02+00:00
⭕ 🏥	google/medgemma-27b-text-it	73.68	82.59	60.85	64.71	74.08	79.02	67.8	86.7	true	true	?	instruction-tuned	Original	other	108	27.01	2025-05-22 07:54:44+00:00
🟦	Qwen/Qwen2-72B-Instruct	72.6	85.17	61.45	67.97	72.9	75.47	61.6	83.62	false	true	?	preference-tuned	Original	other	675	72.71	2024-11-14 11:37:18+00:00
🟦 🏥	baichuan-inc/Baichuan-M1-14B-Instruct	72.3	83.91	60.73	65.48	76.75	80.95	55.6	82.66	true	true	?	preference-tuned	Original	null	58	14.47	2025-05-19 07:03:03+00:00
⭕	microsoft/phi-4	71.13	85.9	59.64	64.81	71.48	75.03	53	88.09	false	true	?	instruction-tuned	Original	null	0	-1	2025-01-17 12:10:32+00:00
🟦	Qwen/Qwen3-32B	70.99	85.19	60.48	66.34	73.29	79.28	48.4	83.94	false	true	?	preference-tuned	Original	apache-2.0	162	32.76	2025-04-29 10:45:55+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	70.57	81.83	60.85	61.46	69.36	73.68	61.6	85.21	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:07:42+00:00
🟢 🏥	ProbeMedicalYonseiMAILab/medllama3-v20	70.22	93.5	49.09	74.4	75.96	55.78	69	73.83	true	false	?	pretrained	Original	llama3	37	8.03	2024-10-25 07:16:58+00:00
🟦	Qwen/QwQ-32B-Preview	69.85	83.15	59.15	63.73	69.52	75.84	49.6	87.98	false	true	?	preference-tuned	Original	apache-2.0	239	32.76	2024-11-28 04:57:07+00:00
⭕	chengang12345/Qwen2.5-32B-Instruct-FineTune	69.78	84.93	60.12	63.47	70.15	75.82	46	87.98	false	true	?	instruction-tuned	Original	apache-2.0	0	32.76	2025-05-19 12:37:03+00:00
🟦	Qwen/Qwen3-14B	69	83.45	57.58	63.61	68.26	68.41	57.2	84.47	false	true	?	preference-tuned	Original	apache-2.0	137	14.77	2025-05-12 12:17:12+00:00
🟦	oxyapi/oxy-1-small	68.8	80.47	53.09	59.12	65.59	70.69	70.4	82.23	false	true	?	preference-tuned	Original	apache-2.0	67	14.77	2024-12-10 07:27:22+00:00
⭕	google/gemma-3-27b-it	68.62	81.49	54.55	56.16	68.66	69.09	65.2	85.21	false	true	?	instruction-tuned	Original	gemma	1373	27.43	2025-05-23 10:26:40+00:00
🟦	Qwen/Qwen3-30B-A3B	68.5	85.63	61.09	64.57	72.19	76.6	44	75.43	false	true	?	preference-tuned	Original	apache-2.0	208	30.53	2025-04-29 10:45:32+00:00
🟦	meta-llama/Llama-3.1-8B-Instruct	67.2	73.4	49.9	58.4	62	68.2	76.2	82.3	false	true	?	preference-tuned	Original	llama3.1	2845	8.03	2024-07-24 14:33:56+00:00
⭕	silma-ai/SILMA-9B-Instruct-v1.0	66.16	76.09	49.21	54.94	61.59	61.52	75.6	84.15	false	true	?	instruction-tuned	Original	gemma	44	9.24	2024-11-14 11:39:56+00:00
⭕	akjindal53244/Llama-3.1-Storm-8B	65.99	73.07	50.79	57.9	63	69.5	62.8	84.89	false	true	?	instruction-tuned	Original	llama3.1	164	8.03	2024-11-14 11:35:17+00:00
🟢	meta-llama/Llama-3.1-70B	65.56	82.44	60.36	65.14	75.88	81.39	15.6	78.09	false	false	?	pretrained	Original	llama3.1	308	70.55	2024-11-14 11:33:15+00:00
⭕	CohereForAI/aya-expanse-32b	65.46	77.65	50.55	59.14	65.36	69.17	50	86.38	false	true	?	instruction-tuned	Original	cc-by-nc-4.0	66	32.3	2024-10-25 07:13:05+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	65.05	75.49	51.52	55.22	62.53	64.58	67	79.04	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:08:05+00:00
⭕ 🏥	winninghealth/WiNGPT2-Gemma-2-9B-Chat	64.08	83.21	48	58.43	69.91	60.43	53.8	74.79	true	true	?	instruction-tuned	Original	apache-2.0	2	9.24	2025-05-19 05:20:13+00:00
🟦	meta-llama/Llama-3.2-3B-Instruct	63.2	67.76	38.06	52.81	55.77	74.41	70.6	82.98	false	true	?	preference-tuned	Original	llama3.2	402	3.21	2024-10-24 06:23:04+00:00
⭕ 🏥	BiMediX/BiMediX-Bi	63	73.21	44.85	61.61	65.2	61.06	77.2	57.87	true	true	?	instruction-tuned	Original	cc-by-nc-sa-4.0	3	0	2024-11-25 07:11:28+00:00
🟦 🏥	FractalAIResearch/Ramanujan-Ganit-R1-14B	62.48	75.65	49.21	54.98	61.35	63.59	71.4	61.17	true	true	?	preference-tuned	Original	mit	0	14.77	2025-05-20 11:29:52+00:00
🟦	tiiuae/Falcon3-10B-Instruct	61.78	75.39	48.61	54.86	59.07	63.31	50.6	80.64	false	true	?	preference-tuned	Original	other	40	10.31	2024-12-19 05:58:51+00:00
🟦	princeton-nlp/gemma-2-9b-it-SimPO	61.14	76.81	48.48	52.93	59.62	59.4	49	81.7	false	true	?	preference-tuned	Original	mit	110	9.24	2024-10-25 07:11:14+00:00
🟦 🏥	Qwen/Qwen3-8B	60.92	78.43	53.7	58.67	63.39	68.24	33.6	70.43	true	true	?	preference-tuned	Original	apache-2.0	300	8.19	2025-05-20 11:36:36+00:00
🟦	newsbang/Homer-v1.0-Qwen2.5-7B	60.28	76.52	47.64	56.75	60.41	61.33	47	72.34	false	true	?	preference-tuned	Original	null	0	-1	2025-03-06 02:18:06+00:00
🟦	Qwen/Qwen2.5-7B-Instruct	60.03	76.71	48	56.83	60.17	61.4	45.2	71.91	false	true	?	preference-tuned	Original	apache-2.0	274	7.62	2024-11-14 11:36:44+00:00
🟢	Qwen/Qwen2.5-7B	59.84	73.38	45.33	53.84	57.66	57.69	55.8	75.21	false	true	?	pretrained	Original	apache-2.0	67	7.62	2024-11-14 11:36:22+00:00
⭕	upstage/SOLAR-10.7B-Instruct-v1.0	59.38	69.46	36.61	46.71	52	58.56	65.6	86.7	false	true	?	instruction-tuned	Original	cc-by-nc-4.0	613	10.73	2024-10-22 22:52:54+00:00
⭕ 🏥	google/medgemma-4b-it	58.92	63.51	36.97	52.16	55.38	57.85	67.4	79.15	true	true	?	instruction-tuned	Original	other	106	4.3	2025-05-22 07:54:59+00:00
🟦	NousResearch/Hermes-2-Pro-Llama-3-8B	58.83	68.24	39.03	50.82	53.89	56.98	58.2	84.68	false	true	?	preference-tuned	Original	llama3	411	8.03	2024-12-10 10:10:16+00:00
🟦	NousResearch/Hermes-3-Llama-3.1-8B	58.22	70.51	43.15	51.11	55.7	55.65	49	82.45	false	true	?	preference-tuned	Original	llama3	254	8.03	2024-12-10 09:38:34+00:00
🟦	tiiuae/Falcon3-7B-Instruct	57.45	70.83	39.39	52.35	54.99	54.48	53.2	76.91	false	true	?	preference-tuned	Original	other	23	7.46	2024-12-19 05:59:29+00:00
🟦	Qwen/Qwen3-4B	56.83	72.56	45.7	54.1	58.37	61.04	43.8	62.23	false	true	?	preference-tuned	Original	apache-2.0	79	4.02	2025-04-29 10:46:23+00:00
⭕	neulab/Pangea-7B	56.69	68.3	39.27	50.2	53.02	54.42	57	74.57	false	true	?	instruction-tuned	Original	apache-2.0	76	7.94	2024-10-25 09:59:22+00:00
⭕	tiiuae/falcon-mamba-7b-instruct	55.01	65.23	35.03	45.85	45.95	41.85	69.2	81.91	false	true	?	instruction-tuned	Original	other	64	7.27	2024-11-18 11:47:16+00:00
⭕ 🏥	winninghealth/WiNGPT2-Llama-3-8B-Chat	53.72	72.5	40.24	53.41	61.67	52.22	22.8	73.19	true	true	?	instruction-tuned	Original	apache-2.0	5	8.03	2025-05-19 05:21:32+00:00
🟢	tiiuae/falcon-mamba-7b	53.43	64.31	31.88	46.67	46.58	39.45	66.4	78.72	false	false	?	pretrained	Original	other	211	7.27	2024-10-29 07:20:18+00:00
⭕	mistralai/Mistral-7B-Instruct-v0.3	53.42	65.06	34.79	46.31	49.25	50.63	45.8	82.13	false	true	?	instruction-tuned	Original	apache-2.0	1131	7.25	2024-11-14 11:38:25+00:00
🟦	Qwen/Qwen2.5-3B	52.89	66.89	34.3	49.32	47.92	48.04	67.8	55.96	false	false	?	preference-tuned	Original	other	26	3.09	2024-10-22 13:17:21+00:00
🟦	tiiuae/Falcon3-Mamba-7B-Instruct	52.61	69.82	38.55	47	50.51	55.28	49	58.09	false	true	?	preference-tuned	Original	other	13	7.27	2024-12-19 06:00:16+00:00
🟢	tiiuae/falcon-11B	51.9	62.45	27.88	43.49	43.91	44.54	58.4	82.66	false	false	?	pretrained	Original	unknown	210	11.1	2024-10-29 07:23:16+00:00
⭕	01-ai/Yi-1.5-6B-Chat	50.45	64	35.15	42.55	44.23	46.93	61.8	58.51	false	true	?	instruction-tuned	Original	apache-2.0	41	6.06	2024-10-22 23:04:13+00:00
🟢	meta-llama/Llama-3.1-8B	50.06	64.82	38.67	49.96	55.07	52.18	38	51.7	false	false	?	pretrained	Original	llama3.1	1068	8.03	2024-11-14 07:33:20+00:00
🟦	Qwen/Qwen2.5-3B-Instruct	49.62	67.99	34.79	49.15	48.78	51.55	29.2	65.85	false	false	?	preference-tuned	Original	other	87	3.09	2024-11-18 11:36:42+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Llama-8B	48.41	59.48	26.42	42.22	44.3	41.85	57.8	66.81	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:08:22+00:00
🟦	tiiuae/Falcon3-3B-Instruct	45.74	57.52	24.97	42.29	41.24	42.85	42.4	68.94	false	true	?	preference-tuned	Original	other	12	3.23	2024-12-19 06:00:40+00:00
⭕	HuggingFaceTB/SmolLM2-1.7B-Instruct	42.39	50.61	18.67	37.37	36.06	28.19	69	56.81	false	true	?	instruction-tuned	Original	apache-2.0	346	1.71	2024-11-22 10:44:37+00:00
⭕ 🏥	Intelligent-Internet/II-Medical-8B	41.36	46.38	35.39	38.78	31.26	37.14	31.4	69.15	true	true	?	instruction-tuned	Original	null	42	8.19	2025-05-16 09:57:55+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	39.87	50.65	22.91	36.19	33.15	30.81	48.6	56.81	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:08:42+00:00
⭕	meta-llama/Llama-3.2-1B-Instruct	39.41	46.79	21.94	36.53	37.71	45.61	30.4	56.91	false	false	?	instruction-tuned	Original	llama3.2	430	1.24	2024-10-25 07:14:38+00:00
🟦	tiiuae/Falcon3-1B-Instruct	37.97	45.15	15.27	35.43	33.23	28.61	49.6	58.51	false	true	?	preference-tuned	Original	other	21	1.67	2024-12-19 06:01:10+00:00
🟦	deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	34.6	31.17	13.21	32.54	29.69	24.09	55.2	56.28	false	true	?	preference-tuned	Original	null	0	-1	2025-01-22 17:09:02+00:00
🟢	Qwen/Qwen2-0.5B	34.56	38.88	12.24	29.12	29.93	18.46	56.4	56.91	false	false	?	pretrained	Original	apache-2.0	101	0.49	2024-10-22 13:46:13+00:00
🟦	Qwen/Qwen2.5-0.5B-Instruct	32.42	44.38	15.76	35.19	36.21	28.12	11	56.28	false	false	?	preference-tuned	Original	apache-2.0	100	0.49	2024-11-18 11:36:27+00:00
🟦	Qwen/Qwen3-0.6B	31.53	46.94	19.76	32.44	33.94	33.22	11.2	43.19	false	true	?	preference-tuned	Original	apache-2.0	91	0.75	2025-04-29 10:46:33+00:00
🟦	ministral/Ministral-3b-instruct	28.62	22.64	10.55	25.1	28.83	21.2	35.2	56.81	false	true	?	preference-tuned	Original	apache-2.0	33	3.32	2024-12-10 10:39:41+00:00
🟦	HuggingFaceTB/SmolLM2-360M-Instruct	28.33	22.65	10.79	24.41	25.92	19.61	38	56.91	false	true	?	preference-tuned	Original	apache-2.0	60	0.36	2024-12-10 08:36:15+00:00


🟦 🏥	mistralai/Mistral-7B-Instruct-v0.3	78.79	76.88	81.13	80.86	81.86	78.94	72.89


🟦	Qwen/Qwen2.5-72B-Instruct	78.79	76.88	81.13	81.2	81.86	78.8	72.89
🟦	meta-llama/Llama-3.3-70B-Instruct	78.62	73.42	80.66	80.86	80.86	78.94	77.01
🟦 🏥	m42-health/Llama3-Med42-70B	75.29	69.3	78.07	77.74	78.54	76.68	71.43
⭕	google/gemma-3-27b-it	73.03	68.24	74.68	74.75	74.35	73.62	72.56
⭕	CohereForAI/aya-expanse-32b	69.44	64.32	71.16	70.7	71.63	70.9	67.91
⭕	mistralai/Mistral-7B-Instruct-v0.3	47.42	33.02	55.95	54.49	52.43	53.22	35.42


🟦 🏥	mistralai/Mistral-7B-Instruct-v0.3	78.79	76.88	81.13	80.86	81.86	78.94	72.89	false	true	?	instruction-tuned	Original	cc-by-nc-4.0	1373	72.71	2024-10-22 14:35:49+00:00


🟦	Qwen/Qwen2.5-72B-Instruct	78.79	76.88	81.13	81.2	81.86	78.8	72.89	false	true	?	preference-tuned	Original	other	343	72.71	2024-10-22 14:35:49+00:00
🟦	meta-llama/Llama-3.3-70B-Instruct	78.62	73.42	80.66	80.86	80.86	78.94	77.01	false	true	?	preference-tuned	Original	llama3.3	632	70.55	2024-12-09 09:10:34+00:00
🟦 🏥	m42-health/Llama3-Med42-70B	75.29	69.3	78.07	77.74	78.54	76.68	71.43	true	true	?	preference-tuned	Original	llama3	34	70.55	2024-10-24 06:24:59+00:00
⭕	google/gemma-3-27b-it	73.03	68.24	74.68	74.75	74.35	73.62	72.56	false	true	?	instruction-tuned	Original	gemma	1373	27.43	2025-05-23 10:26:40+00:00
⭕	CohereForAI/aya-expanse-32b	69.44	64.32	71.16	70.7	71.63	70.9	67.91	false	true	?	instruction-tuned	Original	cc-by-nc-4.0	66	32.3	2024-10-25 07:13:05+00:00
⭕	mistralai/Mistral-7B-Instruct-v0.3	47.42	33.02	55.95	54.49	52.43	53.22	35.42	false	true	?	instruction-tuned	Original	apache-2.0	1131	7.25	2024-11-14 11:38:25+00:00

About

The MEDIC Leaderboard evaluates large language models (LLMs) on various healthcare tasks across five key dimensions. Designed to bridge the gap between stakeholder expectations and practical clinical applications, the MEDIC framework captures the interconnected capabilities LLMs need for real-world use. Its evaluation metrics objectively measure LLM performance on benchmark tasks and map results to the MEDIC dimensions. By assessing these dimensions, MEDIC aims to determine how effective and safe LLMs are for real-world healthcare settings.

Evaluation Categories

Close-ended Questions

This category measures the accuracy of an LLM's medical knowledge by having it answer multiple-choice questions from datasets like MedQA, MedMCQA, MMLU, MMLU Pro, PubMedQA, USMLE and Toxigen.

We used the Eleuther AI's Evaluation Harness framework, which focuses on the likelihood of a model generating each proposed answer rather than directly evaluating the generated text itself. We modified the framework's codebase to provide more detailed and relevant results. Rather than just calculating the probability of generating answer choice labels (e.g., a., b., c., or d.), we calculate the probability of generating the full answer text.

Open-ended Questions

This category assesses the quality of the LLM's reasoning and explanations. The LLM is tasked with answering open-ended medical questions from various datasets:

Each question is presented to the models without special prompting to test their baseline capabilities. To compare models, we use a tournament-style approach. A judge (Llama3.1 70b Instruct) evaluates pairs of responses to the same question from different models. To eliminate position bias, each comparison is performed twice with reversed response positions. If the winner changes when positions are swapped, we consider the responses too close and declare a tie. After multiple comparisons, we calculate win rates and convert them to Elo ratings to rank the models. It's important to note that this evaluation only assesses the quality of response writing, not medical accuracy. To properly evaluate clinical accuracy, a thorough study involving real healthcare professionals would be necessary.

Medical Safety

Medical Safety category uses the "med-safety" benchmark dataset, which consists of 4500 scenarios presenting harmful medical requests. These scenarios cover all nine principles of medical ethics as defined by the American Medical Association (AMA). In this evaluation, an LLM acts as a judge. Each response from the evaluated model is assigned a "harmfulness" score on a predefined scale. A low score means the model refused the harmful request, while a high score means it complied with the request.

Medical Summarization

This category evaluates the LLM's ability to summarize medical texts, with a focus on clinical trial descriptions from ClinicalTrials.gov. The dataset consists of 1629 carefully selected clinical trial protocols with detailed study descriptions (3000-8000 tokens long). The task is to generate concise and accurate summaries of these protocols.

It uses a novel "cross-examination" framework, where questions are generated from the original document and the LLM's summary to assess the scores of the summary. The four key scores calculated are:

Coverage: Measures how thoroughly the summary covers the original document. A higher score means the summary includes more details from the original.
Conformity: Also called the non-contradiction score, this checks if the summary avoids contradicting the original document. A higher score means the summary aligns better with the original.
Consistency: Measures the level of non-hallucination, or how much the summary sticks to the facts in the document. A higher score means the summary is more factual and accurate.
Conciseness: Measures how brief the summary is. A higher score means the summary is more concise. A negative score means the summary is longer than the original document.

Note Generation

This category assesses the LLM's ability to generate structured clinical notes from doctor-patient conversations. It uses the same cross-examination framework as Medical Summarization across two datasets:

ACI-Bench: A comprehensive collection designed specifically for benchmarking clinical note generation from doctor-patient dialogues. The dataset contains patient visit notes that have been validated by expert medical scribes and physicians.
SOAP Notes: Using the test split of the ChartNote dataset containing 250 synthetic patient-doctor conversations generated from real clinical notes. The task involves generating notes in the SOAP format with the following sections:
- Subjective: Patient's description of symptoms, medical history, and personal experiences
- Objective: Observable data like physical exam findings, vital signs, and diagnostic test results
- Assessment: Healthcare provider's diagnosis based on subjective and objective information
- Plan: Treatment plan including medications, therapies, follow-ups, and referrals


01-ai/Yi-1.5-6B-Chat	main	false	instruction-tuned	auto	Original	FINISHED	FINISHED	FINISHED	LOW CONTEXT LENGTH	LOW CONTEXT LENGTH
BiMediX/BiMediX-Bi	main	false	instruction-tuned	auto	Original	FINISHED	CHAT TEMPLATE ISSUE	CHAT TEMPLATE ISSUE	CHAT TEMPLATE ISSUE	CHAT TEMPLATE ISSUE
CohereForAI/aya-expanse-32b	main	false	instruction-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
HuggingFaceTB/SmolLM2-1.7B-Instruct	main	false	instruction-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
HuggingFaceTB/SmolLM2-360M-Instruct	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
Intelligent-Internet/II-Medical-8B	main	false	instruction-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
NousResearch/Hermes-2-Pro-Llama-3-8B	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
NousResearch/Hermes-3-Llama-3.1-8B	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
OpenMeditron/Meditron3-70B	main	false	pretrained	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
ProbeMedicalYonseiMAILab/medllama3-v20	main	false	instruction-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
Qwen/QwQ-32B-Preview	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
Qwen/Qwen2-0.5B	main	false	pretrained	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
Qwen/Qwen2-72B-Instruct	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
Qwen/Qwen2.5-0.5B-Instruct	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
Qwen/Qwen2.5-3B-Instruct	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
Qwen/Qwen2.5-3B	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
Qwen/Qwen2.5-72B-Instruct	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
Qwen/Qwen2.5-72B	main	false	pretrained	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
Qwen/Qwen2.5-7B-Instruct	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
Qwen/Qwen2.5-7B	main	false	pretrained	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
Qwen/Qwen3-0.6B	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
Qwen/Qwen3-14B	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
Qwen/Qwen3-235B-A22B	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
Qwen/Qwen3-30B-A3B	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
Qwen/Qwen3-32B	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
Qwen/Qwen3-4B	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
Qwen/Qwen3-8B	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
aaditya/Llama3-OpenBioLLM-70B	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
akjindal53244/Llama-3.1-Storm-8B	main	false	instruction-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
chengang12345/Qwen2.5-32B-Instruct-FineTune	main	false	instruction-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
deepseek-ai/DeepSeek-R1-Distill-Llama-70B	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
deepseek-ai/DeepSeek-R1-Distill-Llama-8B	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
deepseek-ai/DeepSeek-V3	main	false	instruction-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
google/gemini-2.5-flash-preview-04-17-thinking	main	true	instruction-tuned	auto	Original	PRIVATE MODEL	FINISHED	FINISHED	FINISHED	FINISHED
google/medgemma-27b-text-it	main	false	instruction-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
google/medgemma-4b-it	main	false	instruction-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
m42-health/Llama3-Med42-70B	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
meta-llama/Llama-3.1-70B-Instruct	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
meta-llama/Llama-3.1-70B	main	false	pretrained	auto	Original	FINISHED	CHAT TEMPLATE MISSING	CHAT TEMPLATE MISSING	CHAT TEMPLATE MISSING	CHAT TEMPLATE MISSING
meta-llama/Llama-3.1-8B-Instruct	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
meta-llama/Llama-3.1-8B	main	false	pretrained	auto	Original	FINISHED	CHAT TEMPLATE MISSING	CHAT TEMPLATE MISSING	CHAT TEMPLATE MISSING	CHAT TEMPLATE MISSING
meta-llama/Llama-3.2-1B-Instruct	main	false	instruction-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
meta-llama/Llama-3.2-3B-Instruct	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
meta-llama/Llama-3.3-70B-Instruct	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
meta-llama/Llama-4-Scout-17B-16E-Instruct	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
meta-llama/Meta-Llama-3-70B-Instruct	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
microsoft/phi-4	main	false	instruction-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
ministral/Ministral-3b-instruct	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
mistralai/Mistral-7B-Instruct-v0.3	main	false	instruction-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
mistralai/Mistral-Large-Instruct-2407	main	false	instruction-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
neulab/Pangea-7B	main	false	instruction-tuned	auto	Original	FINISHED	VLLM NOT SUPPORTED	VLLM NOT SUPPORTED	VLLM NOT SUPPORTED	VLLM NOT SUPPORTED
newsbang/Homer-v1.0-Qwen2.5-7B	main	false	preference-tuned	bfloat16	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
nvidia/Llama-3.1-Nemotron-70B-Instruct-HF	main	false	instruction-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
openai/gpt-4.1-mini	main	true	instruction-tuned	auto	Original	PRIVATE MODEL	FINISHED	FINISHED	FINISHED	FINISHED
openai/gpt-4.1	main	true	instruction-tuned	auto	Original	PRIVATE MODEL	FINISHED	FINISHED	FINISHED	FINISHED
openai/o4-mini	main	true	instruction-tuned	auto	Original	PRIVATE MODEL	FINISHED	FINISHED	FINISHED	FINISHED
oxyapi/oxy-1-small	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
princeton-nlp/gemma-2-9b-it-SimPO	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
silma-ai/SILMA-9B-Instruct-v1.0	main	false	instruction-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
tiiuae/Falcon3-10B-Instruct	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
tiiuae/Falcon3-1B-Instruct	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
tiiuae/Falcon3-3B-Instruct	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
tiiuae/Falcon3-7B-Instruct	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
tiiuae/Falcon3-Mamba-7B-Instruct	main	false	preference-tuned	auto	Original	FINISHED	LOW CONTEXT LENGTH	LOW CONTEXT LENGTH	LOW CONTEXT LENGTH	LOW CONTEXT LENGTH
tiiuae/falcon-11B	main	false	pretrained	auto	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
tiiuae/falcon-mamba-7b-instruct	main	false	instruction-tuned	auto	Original	FINISHED	LOW CONTEXT LENGTH	LOW CONTEXT LENGTH	LOW CONTEXT LENGTH	LOW CONTEXT LENGTH
tiiuae/falcon-mamba-7b	main	false	pretrained	auto	Original	FINISHED	CHAT TEMPLATE MISSING	CHAT TEMPLATE MISSING	CHAT TEMPLATE MISSING	CHAT TEMPLATE MISSING
upstage/SOLAR-10.7B-Instruct-v1.0	main	false	instruction-tuned	auto	Original	FINISHED	FINISHED	FINISHED	LOW CONTEXT LENGTH	LOW CONTEXT LENGTH
winninghealth/WiNGPT2-Gemma-2-9B-Chat	main	false	instruction-tuned	bfloat16	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED
winninghealth/WiNGPT2-Llama-3-8B-Chat	main	false	instruction-tuned	bfloat16	Original	FINISHED	FINISHED	FINISHED	FINISHED	FINISHED


FractalAIResearch/Ramanujan-Ganit-R1-14B	main	false	preference-tuned	auto	Original	FINISHED	FINISHED	RUNNING	FINISHED	FINISHED
google/gemma-3-4b-it	main	false	instruction-tuned	auto	Original	RUNNING	PENDING	PENDING	PENDING	PENDING


baichuan-inc/Baichuan-M1-14B-Instruct	main	false	preference-tuned	auto	Original	FINISHED	PENDING	PENDING	PENDING	PENDING
google/gemini-2.0-flash	main	true	instruction-tuned	auto	Original	PRIVATE MODEL	FINISHED	FINISHED	RERUN	RERUN
google/gemini-2.5-flash-preview-04-17	main	true	instruction-tuned	auto	Original	PRIVATE MODEL	FINISHED	FINISHED	RERUN	RERUN
google/gemma-3-27b-it	main	false	instruction-tuned	auto	Original	PENDING	PENDING	PENDING	PENDING	PENDING
meta-llama/Llama-3.1-405B-Instruct	main	false	preference-tuned	auto	Original	RERUN	FINISHED	FINISHED	FINISHED	FINISHED
meta-llama/Llama-4-Maverick-17B-128E-Instruct	main	false	preference-tuned	auto	Original	RERUN	FINISHED	FINISHED	FINISHED	FINISHED
openai/gpt-4o-mini-2024-07-18	main	true	instruction-tuned	auto	Original	PRIVATE MODEL	FINISHED	FINISHED	RERUN	RERUN

About

Evaluation Categories

Close-ended Questions

Open-ended Questions

Medical Safety

Medical Summarization

Note Generation

Submission Guide for the MEDIC Benchamark

First Steps Before Submitting a Model

1. Ensure Your Model Loads with AutoClasses

2. Convert Weights to Safetensors

3. Complete Your Model Card

4. Select the correct model type

5. Select Correct Precision

6. Medically oriented model

7. Chat template

✉️✨ Submit your model here!