
Deploying a good clinical LLM requires more than just acing closed-ended medical QA exams. It needs to be safe, ethical, comprehensive in its responses, and capable of reasoning and tackling complex medical tasks. The MEDIC framework aims to provide a transparent and comprehensive evaluation of LLM performance across various clinically relevant dimensions.
Disclaimer: It is important to note that the purpose of this evaluation is purely academic and exploratory. The models assessed here have not been approved for clinical use, and their results should not be interpreted as clinically validated. The leaderboard serves as a platform for researchers to compare models, understand their strengths and limitations, and drive further advancements in the field of clinical NLP.
Note: Llama 3.1 70B Instruct has been used as judge for English.
🟦 🏥 | 2052 | +26/-20 | 8.91 | +0.01/-0.01 |
🟦 🏥 | 2052 | +26/-20 | 8.91 | +0.01/-0.01 | false | false | ? | instruction-tuned | Original | cc-by-nc-4.0 | 1149 | 235.09 | 2024-10-25 07:09:19+00:00 |
⭕ | 2052 | +26/-20 | 8.91 | +0.01/-0.01 | false | true | ? | instruction-tuned | Original | llama3.1 | 1149 | 70.55 | 2024-10-25 07:09:19+00:00 | |
🟦 | 1975 | +22/-22 | 8.87 | +0.02/-0.02 | false | true | ? | preference-tuned | Original | apache-2.0 | 352 | 235.09 | 2025-04-29 10:42:15+00:00 | |
🟦 | 1975 | +24/-19 | 8.87 | +0.02/-0.02 | false | true | ? | preference-tuned | Original | apache-2.0 | 162 | 32.76 | 2025-04-29 10:45:55+00:00 | |
⭕ | 1920 | +24/-25 | 8.75 | +0.03/-0.03 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-14 08:42:03+00:00 | |
🟦 | 1874 | +25/-22 | 8.73 | +0.02/-0.02 | false | true | ? | preference-tuned | Original | apache-2.0 | 208 | 30.53 | 2025-04-29 10:45:32+00:00 | |
⭕ | 1845 | +23/-22 | 8.6 | +0.04/-0.03 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-07 16:20:40+00:00 | |
🟦 🏥 | 1837 | +25/-21 | 8.73 | +0.02/-0.01 | true | true | ? | preference-tuned | Original | llama3 | 34 | 70.55 | 2024-10-24 06:24:59+00:00 | |
⭕ | 1820 | +26/-21 | 8.62 | +0.03/-0.02 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-07 16:20:53+00:00 | |
🟦 | 1818 | +20/-20 | 8.66 | +0.02/-0.02 | false | true | ? | preference-tuned | Original | apache-2.0 | 137 | 14.77 | 2025-05-12 12:17:12+00:00 | |
⭕ | 1815 | +25/-24 | 8.67 | +0.02/-0.02 | false | true | ? | instruction-tuned | Original | other | 1980 | 685 | 2024-10-22 23:04:13+00:00 | |
⭕ | 1811 | +21/-17 | 8.63 | +0.03/-0.02 | false | true | ? | instruction-tuned | Original | cc-by-nc-4.0 | 66 | 32.3 | 2024-10-25 07:13:05+00:00 | |
🟦 | 1805 | +20/-17 | 8.58 | +0.03/-0.02 | false | true | ? | preference-tuned | Original | llama3.1 | 617 | 70.55 | 2024-10-24 13:25:28+00:00 | |
🟦 | 1796 | +23/-20 | 8.57 | +0.02/-0.03 | false | true | ? | preference-tuned | Original | apache-2.0 | 79 | 4.02 | 2025-04-29 10:46:23+00:00 | |
⭕ 🏥 | 1790 | +22/-22 | 8.06 | +0.06/-0.06 | true | true | ? | instruction-tuned | Original | other | 108 | 27.01 | 2025-05-22 07:54:44+00:00 | |
🟦 | 1787 | +26/-21 | 8.55 | +0.03/-0.03 | false | true | ? | preference-tuned | Original | llama3 | 1417 | 70.55 | 2024-10-24 13:25:47+00:00 | |
🟦 | 1784 | +20/-18 | 8.58 | +0.02/-0.02 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
🟦 | 1780 | +24/-19 | 8.56 | +0.03/-0.02 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
🟦 🏥 | 1773 | +19/-19 | 8.35 | +0.03/-0.03 | true | true | ? | preference-tuned | Original | apache-2.0 | 300 | 8.19 | 2025-05-20 11:36:36+00:00 | |
🟦 | 1757 | +18/-17 | 8.53 | +0.03/-0.03 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
⭕ | 1737 | +18/-16 | 8.12 | +0.04/-0.04 | false | true | ? | instruction-tuned | Original | apache-2.0 | 0 | 32.76 | 2025-05-19 12:37:03+00:00 | |
🟦 | 1733 | +24/-22 | 8.31 | +0.05/-0.05 | false | true | ? | preference-tuned | Original | llama3.1 | 2845 | 8.03 | 2024-07-24 14:33:56+00:00 | |
⭕ | 1728 | +28/-30 | 8.48 | +0.03/-0.02 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-14 08:41:32+00:00 | |
🟦 | 1714 | +20/-21 | 8.28 | +0.04/-0.03 | false | true | ? | preference-tuned | Original | llama3.2 | 402 | 3.21 | 2024-10-24 06:23:04+00:00 | |
⭕ 🏥 | 1703 | +23/-21 | 8.32 | +0.04/-0.03 | true | true | ? | instruction-tuned | Original | null | 42 | 8.19 | 2025-05-16 09:57:55+00:00 | |
⭕ | 1680 | +18/-17 | 8.37 | +0.03/-0.03 | false | true | ? | instruction-tuned | Original | null | 0 | -1 | 2025-01-17 12:10:32+00:00 | |
⭕ | 1679 | +31/-26 | 8.12 | +0.05/-0.03 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-07 16:19:08+00:00 | |
🟦 | 1619 | +24/-22 | 8.21 | +0.04/-0.03 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:07:42+00:00 | |
🟦 | 1617 | +22/-26 | 8.14 | +0.04/-0.04 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:08:05+00:00 | |
🟦 | 1603 | +24/-23 | 8.11 | +0.04/-0.03 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:06:51+00:00 | |
⭕ | 1598 | +24/-23 | 8.21 | +0.03/-0.03 | false | true | ? | instruction-tuned | Original | other | 808 | 122.61 | 2024-11-25 11:27:40+00:00 | |
🟦 | 1582 | +23/-22 | 8.01 | +0.04/-0.03 | false | true | ? | preference-tuned | Original | mit | 110 | 9.24 | 2024-10-25 07:11:14+00:00 | |
⭕ | 1574 | +27/-21 | 7.76 | +0.05/-0.04 | false | true | ? | instruction-tuned | Original | apache-2.0 | 41 | 6.06 | 2024-10-22 23:04:13+00:00 | |
⭕ | 1570 | +26/-22 | 7.42 | +0.08/-0.05 | false | false | ? | instruction-tuned | Original | llama3.2 | 430 | 1.24 | 2024-10-25 07:14:38+00:00 | |
🟦 | 1549 | +28/-24 | 8.02 | +0.04/-0.03 | false | true | ? | preference-tuned | Original | other | 675 | 72.71 | 2024-11-14 11:37:18+00:00 | |
🟦 | 1545 | +26/-23 | 8.02 | +0.03/-0.04 | false | true | ? | preference-tuned | Original | other | 343 | 72.71 | 2024-10-22 14:35:49+00:00 | |
⭕ | 1542 | +29/-27 | 8.07 | +0.03/-0.03 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-14 08:41:43+00:00 | |
🟦 | 1527 | +31/-21 | 7.72 | +0.04/-0.04 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:08:22+00:00 | |
🟦 🏥 | 1525 | +24/-19 | 7.25 | +0.05/-0.03 | true | true | ? | preference-tuned | Original | mit | 0 | 14.77 | 2025-05-20 11:29:52+00:00 | |
🟦 | 1518 | +23/-26 | 7.88 | +0.04/-0.04 | false | true | ? | preference-tuned | Original | llama3 | 254 | 8.03 | 2024-12-10 09:38:34+00:00 | |
⭕ | 1505 | +26/-23 | 7.65 | +0.04/-0.05 | false | true | ? | instruction-tuned | Original | llama3.1 | 164 | 8.03 | 2024-11-14 11:35:17+00:00 | |
🟦 | 1475 | +23/-26 | 7.72 | +0.04/-0.04 | false | true | ? | preference-tuned | Original | apache-2.0 | 274 | 7.62 | 2024-11-14 11:36:44+00:00 | |
🟢 🏥 | 1468 | +30/-23 | 7.45 | +0.05/-0.05 | true | true | ? | pretrained | Original | null | 9 | 70.55 | 2024-11-11 13:58:37+00:00 | |
⭕ | 1464 | +26/-20 | 7.83 | +0.03/-0.03 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-02 10:08:27+00:00 | |
🟦 🏥 | 1461 | +23/-19 | 7.68 | +0.04/-0.04 | true | true | ? | preference-tuned | Original | llama3 | 339 | 70 | 2024-07-24 14:33:56+00:00 | |
🟦 | 1457 | +27/-26 | 7.49 | +0.04/-0.04 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-03-06 02:18:06+00:00 | |
🟦 | 1447 | +27/-27 | 6.2 | +0.09/-0.09 | false | true | ? | preference-tuned | Original | apache-2.0 | 239 | 32.76 | 2024-11-28 04:57:07+00:00 | |
🟦 | 1440 | +26/-25 | 7.47 | +0.04/-0.04 | false | true | ? | preference-tuned | Original | apache-2.0 | 67 | 14.77 | 2024-12-10 07:27:22+00:00 | |
⭕ | 1426 | +23/-19 | 7.53 | +0.04/-0.04 | false | true | ? | instruction-tuned | Original | apache-2.0 | 1131 | 7.25 | 2024-11-14 11:38:25+00:00 | |
🟦 | 1412 | +26/-25 | 7.24 | +0.05/-0.05 | false | true | ? | preference-tuned | Original | other | 23 | 7.46 | 2024-12-19 05:59:29+00:00 | |
🟦 | 1404 | +26/-23 | 6.77 | +0.07/-0.06 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:08:42+00:00 | |
🟦 | 1392 | +33/-24 | 7.39 | +0.06/-0.04 | false | true | ? | preference-tuned | Original | llama3 | 411 | 8.03 | 2024-12-10 10:10:16+00:00 | |
⭕ | 1365 | +22/-27 | 7.31 | +0.04/-0.04 | false | true | ? | instruction-tuned | Original | cc-by-nc-4.0 | 613 | 10.73 | 2024-10-22 22:52:54+00:00 | |
🟢 | 1351 | +26/-27 | 7.04 | +0.04/-0.05 | false | true | ? | pretrained | Original | other | 39 | 72.71 | 2024-11-14 11:37:02+00:00 | |
🟦 | 1345 | +27/-22 | 7.04 | +0.05/-0.04 | false | false | ? | preference-tuned | Original | other | 87 | 3.09 | 2024-11-18 11:36:42+00:00 | |
🟦 | 1344 | +27/-22 | 6.87 | +0.05/-0.04 | false | true | ? | preference-tuned | Original | other | 12 | 3.23 | 2024-12-19 06:00:40+00:00 | |
🟦 | 1278 | +22/-21 | 6.36 | +0.07/-0.06 | false | true | ? | preference-tuned | Original | other | 21 | 1.67 | 2024-12-19 06:01:10+00:00 | |
🟢 | 1270 | +34/-25 | 5.51 | +0.09/-0.08 | false | false | ? | pretrained | Original | unknown | 210 | 11.1 | 2024-10-29 07:23:16+00:00 | |
🟦 | 1270 | +30/-23 | 6.55 | +0.07/-0.05 | false | true | ? | preference-tuned | Original | other | 40 | 10.31 | 2024-12-19 05:58:51+00:00 | |
⭕ | 1213 | +24/-24 | 6.34 | +0.07/-0.05 | false | true | ? | instruction-tuned | Original | apache-2.0 | 346 | 1.71 | 2024-11-22 10:44:37+00:00 | |
🟢 | 1111 | +33/-29 | 3.94 | +0.11/-0.1 | false | true | ? | pretrained | Original | apache-2.0 | 67 | 7.62 | 2024-11-14 11:36:22+00:00 | |
🟦 | 1100 | +26/-21 | 5.36 | +0.06/-0.06 | false | true | ? | preference-tuned | Original | apache-2.0 | 91 | 0.75 | 2025-04-29 10:46:33+00:00 | |
🟦 | 1046 | +28/-26 | 3.69 | +0.07/-0.05 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:09:02+00:00 | |
🟦 | 1015 | +21/-19 | 4.08 | +0.08/-0.05 | false | true | ? | preference-tuned | Original | apache-2.0 | 60 | 0.36 | 2024-12-10 08:36:15+00:00 | |
🟢 | 945 | +25/-23 | 2.38 | +0.06/-0.07 | false | false | ? | pretrained | Original | apache-2.0 | 101 | 0.49 | 2024-10-22 13:46:13+00:00 | |
⭕ | 934 | +24/-19 | 2.72 | +0.05/-0.05 | false | true | ? | instruction-tuned | Original | gemma | 44 | 9.24 | 2024-11-14 11:39:56+00:00 | |
🟢 🏥 | 869 | +20/-16 | 1.27 | +0.05/-0.05 | true | false | ? | pretrained | Original | llama3 | 37 | 8.03 | 2024-10-25 07:16:58+00:00 | |
🟦 | 791 | +27/-23 | 0.98 | +0.07/-0.05 | false | false | ? | preference-tuned | Original | other | 26 | 3.09 | 2024-10-22 13:17:21+00:00 | |
⭕ 🏥 | 726 | +17/-17 | 0.03 | +0.01/-0.01 | true | true | ? | instruction-tuned | Original | apache-2.0 | 2 | 9.24 | 2025-05-19 05:20:13+00:00 | |
⭕ 🏥 | 705 | +20/-14 | 0.03 | +0.01/-0.0 | true | true | ? | instruction-tuned | Original | apache-2.0 | 5 | 8.03 | 2025-05-19 05:21:32+00:00 |
- Coverage: Measures how thoroughly the summary covers the original document. A higher score means the summary includes more details from the original.
- Conformity: Also called the non-contradiction score, this checks if the summary avoids contradicting the original document. A higher score means the summary aligns better with the original.
- Consistency: Measures the level of non-hallucination, or how much the summary sticks to the facts in the document. A higher score means the summary is more factual and accurate.
- Conciseness: Measures how brief the summary is. A higher score means the summary is more concise. A negative score means the summary is longer than the original document.
- Overall Score: The average of coverage, conformity, consistency, and the harmonic mean of coverage and conciseness (if both are positive, otherwise 0).
🟦 🏥 | 87.82 | 64.31 | 95.78 | 99.61 | -218.26 |
⭕ | 87.82 | 64.31 | 95.78 | 99.61 | 72.32 | |
⭕ | 87.55 | 61.64 | 95.93 | 99.31 | 74.38 | |
🟦 | 87.51 | 61.51 | 95.99 | 99.47 | 73.72 | |
⭕ | 87.44 | 67.28 | 95.9 | 99.52 | 66.52 | |
🟦 | 87.41 | 63.09 | 95.89 | 99.56 | 70.91 | |
⭕ | 87.36 | 61.36 | 95.78 | 98.82 | 74.94 | |
⭕ | 87.17 | 79.89 | 94.9 | 95.27 | 64.43 | |
⭕ | 87.09 | 61.74 | 97.16 | 99.2 | 68.42 | |
🟦 | 86.95 | 63.64 | 95.53 | 98.59 | 70.15 | |
🟦 | 86.78 | 68.97 | 95.21 | 97.84 | 65.67 | |
🟦 🏥 | 86.75 | 66.7 | 95.79 | 98.5 | 65.21 | |
⭕ | 86.7 | 56.52 | 96.48 | 99.54 | 73.97 | |
🟦 | 86.64 | 57.18 | 96.02 | 98.7 | 75.85 | |
🟦 | 86.21 | 56.12 | 96.13 | 99.14 | 72.77 | |
🟦 | 86.15 | 55.44 | 96.23 | 99.32 | 72.64 | |
🟦 | 86.14 | 54.21 | 95.98 | 98.81 | 76.97 | |
🟦 | 85.84 | 57.35 | 96.13 | 98.78 | 68.93 | |
🟦 | 85.8 | 57.8 | 95.89 | 98.59 | 69.06 | |
⭕ | 85.79 | 50.6 | 96.66 | 99.24 | 78.26 | |
🟦 | 85.76 | 53.52 | 96.13 | 99.3 | 73.2 | |
🟦 | 85.6 | 50.84 | 96.35 | 99.21 | 76.95 | |
🟦 | 85.59 | 56.58 | 95.45 | 98.74 | 70.05 | |
⭕ | 85.47 | 54.31 | 96.4 | 99.31 | 68.84 | |
🟦 | 85.43 | 49.13 | 96.36 | 99.71 | 77.76 | |
🟦 | 85.36 | 53.13 | 95.66 | 98.5 | 74.18 | |
⭕ | 85.19 | 53.17 | 96.57 | 97.96 | 71.62 | |
🟢 | 85.08 | 51.12 | 96.46 | 99.08 | 71.75 | |
🟦 | 84.86 | 46.65 | 96.73 | 99.21 | 78.97 | |
🟦 | 84.85 | 73.9 | 96.21 | 99.31 | 49.13 | |
⭕ | 84.68 | 55.94 | 97.04 | 96.21 | 66.59 | |
🟦 | 84.33 | 49.41 | 96.56 | 98.04 | 71.34 | |
🟦 | 84.33 | 44.54 | 96.53 | 99.29 | 79.77 | |
🟦 | 84.31 | 51.62 | 96.71 | 97.47 | 68.16 | |
🟦 | 84.09 | 44.91 | 96.21 | 99.15 | 77.67 | |
🟢 | 84.07 | 51.09 | 97.58 | 99.16 | 60.65 | |
🟦 | 83.95 | 45.8 | 96.48 | 98.31 | 75.66 | |
🟦 | 83.79 | 54.71 | 96.44 | 96.75 | 62.11 | |
🟦 | 83.52 | 50.42 | 96.33 | 96.87 | 66.54 | |
⭕ | 83.26 | 41.57 | 96.79 | 98.13 | 80.65 | |
🟦 | 83.19 | 44.17 | 96.68 | 97.63 | 73.71 | |
🟦 | 83.06 | 45.3 | 96.61 | 96.02 | 75.24 | |
🟢 🏥 | 83.05 | 40.46 | 97.11 | 98.39 | 79.65 | |
🟦 | 82.94 | 40.74 | 96.34 | 98.67 | 79.24 | |
🟦 🏥 | 82.93 | 40.34 | 96.81 | 98.51 | 79.23 | |
🟦 | 82.84 | 41.79 | 96.5 | 97.77 | 77.29 | |
🟦 | 82.37 | 41.38 | 95.81 | 96.26 | 82.13 | |
⭕ | 82.32 | 41.76 | 95.92 | 97.37 | 75.1 | |
🟦 | 82.26 | 46.44 | 95.86 | 96.99 | 64.28 | |
⭕ 🏥 | 81.94 | 37.52 | 95.82 | 98.24 | 83.39 | |
🟦 | 81.66 | 44.32 | 97.41 | 97.62 | 57.24 | |
⭕ | 80.61 | 44.66 | 95.86 | 93.62 | 63.23 | |
🟦 | 80.35 | 34.13 | 96.06 | 97.03 | 80.61 | |
🟦 🏥 | 79.28 | 61.4 | 96.75 | 99.28 | 31.7 | |
🟦 | 79.21 | 38.47 | 96.5 | 96.13 | 54.23 | |
⭕ 🏥 | 78.59 | 59.72 | 96.68 | 99.23 | 29.9 | |
⭕ | 73.66 | 16.29 | 97.72 | 95.57 | 92.16 | |
🟦 | 69.33 | 35.32 | 96.15 | 93.2 | 12.67 | |
🟢 | 64.87 | 60.94 | 97.61 | 97 | -48.55 | |
🟢 | 64.01 | 45.33 | 98.2 | 93.82 | -218.26 | |
🟦 | 62.64 | 92.95 | 98.5 | 89.42 | -172.93 | |
🟢 🏥 | 59.15 | 15.9 | 97.75 | 52.88 | 85.85 | |
⭕ 🏥 | 57.68 | 12.36 | 98.47 | 74.56 | -131.51 | |
🟦 | 54.69 | 2.1 | 99.49 | 64.59 | -237.7 |
🟦 🏥 | 87.82 | 64.31 | 95.78 | 99.61 | -218.26 | false | false | ? | instruction-tuned | Original | cc-by-nc-4.0 | 1980 | 122.61 | 2024-10-22 23:04:13+00:00 |
⭕ | 87.82 | 64.31 | 95.78 | 99.61 | 72.32 | false | true | ? | instruction-tuned | Original | other | 1980 | 685 | 2024-10-22 23:04:13+00:00 | |
⭕ | 87.55 | 61.64 | 95.93 | 99.31 | 74.38 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-14 08:41:43+00:00 | |
🟦 | 87.51 | 61.51 | 95.99 | 99.47 | 73.72 | false | true | ? | preference-tuned | Original | other | 675 | 72.71 | 2024-11-14 11:37:18+00:00 | |
⭕ | 87.44 | 67.28 | 95.9 | 99.52 | 66.52 | false | true | ? | instruction-tuned | Original | null | 0 | -1 | 2025-01-17 12:10:32+00:00 | |
🟦 | 87.41 | 63.09 | 95.89 | 99.56 | 70.91 | false | true | ? | preference-tuned | Original | other | 343 | 72.71 | 2024-10-22 14:35:49+00:00 | |
⭕ | 87.36 | 61.36 | 95.78 | 98.82 | 74.94 | false | true | ? | instruction-tuned | Original | apache-2.0 | 0 | 32.76 | 2025-05-19 12:37:03+00:00 | |
⭕ | 87.17 | 79.89 | 94.9 | 95.27 | 64.43 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-14 08:42:03+00:00 | |
⭕ | 87.09 | 61.74 | 97.16 | 99.2 | 68.42 | false | true | ? | instruction-tuned | Original | other | 808 | 122.61 | 2024-11-25 11:27:40+00:00 | |
🟦 | 86.95 | 63.64 | 95.53 | 98.59 | 70.15 | false | true | ? | preference-tuned | Original | apache-2.0 | 162 | 32.76 | 2025-04-29 10:45:55+00:00 | |
🟦 | 86.78 | 68.97 | 95.21 | 97.84 | 65.67 | false | true | ? | preference-tuned | Original | apache-2.0 | 352 | 235.09 | 2025-04-29 10:42:15+00:00 | |
🟦 🏥 | 86.75 | 66.7 | 95.79 | 98.5 | 65.21 | true | true | ? | preference-tuned | Original | llama3 | 34 | 70.55 | 2024-10-24 06:24:59+00:00 | |
⭕ | 86.7 | 56.52 | 96.48 | 99.54 | 73.97 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-07 16:20:53+00:00 | |
🟦 | 86.64 | 57.18 | 96.02 | 98.7 | 75.85 | false | true | ? | preference-tuned | Original | apache-2.0 | 208 | 30.53 | 2025-04-29 10:45:32+00:00 | |
🟦 | 86.21 | 56.12 | 96.13 | 99.14 | 72.77 | false | true | ? | preference-tuned | Original | apache-2.0 | 274 | 7.62 | 2024-11-14 11:36:44+00:00 | |
🟦 | 86.15 | 55.44 | 96.23 | 99.32 | 72.64 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:07:42+00:00 | |
🟦 | 86.14 | 54.21 | 95.98 | 98.81 | 76.97 | false | true | ? | preference-tuned | Original | apache-2.0 | 137 | 14.77 | 2025-05-12 12:17:12+00:00 | |
🟦 | 85.84 | 57.35 | 96.13 | 98.78 | 68.93 | false | true | ? | preference-tuned | Original | apache-2.0 | 67 | 14.77 | 2024-12-10 07:27:22+00:00 | |
🟦 | 85.8 | 57.8 | 95.89 | 98.59 | 69.06 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:08:22+00:00 | |
⭕ | 85.79 | 50.6 | 96.66 | 99.24 | 78.26 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-14 08:41:32+00:00 | |
🟦 | 85.76 | 53.52 | 96.13 | 99.3 | 73.2 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:08:05+00:00 | |
🟦 | 85.6 | 50.84 | 96.35 | 99.21 | 76.95 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:06:51+00:00 | |
🟦 | 85.59 | 56.58 | 95.45 | 98.74 | 70.05 | false | false | ? | preference-tuned | Original | other | 87 | 3.09 | 2024-11-18 11:36:42+00:00 | |
⭕ | 85.47 | 54.31 | 96.4 | 99.31 | 68.84 | false | true | ? | instruction-tuned | Original | apache-2.0 | 1131 | 7.25 | 2024-11-14 11:38:25+00:00 | |
🟦 | 85.43 | 49.13 | 96.36 | 99.71 | 77.76 | false | true | ? | preference-tuned | Original | other | 40 | 10.31 | 2024-12-19 05:58:51+00:00 | |
🟦 | 85.36 | 53.13 | 95.66 | 98.5 | 74.18 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:08:42+00:00 | |
⭕ | 85.19 | 53.17 | 96.57 | 97.96 | 71.62 | false | true | ? | instruction-tuned | Original | cc-by-nc-4.0 | 66 | 32.3 | 2024-10-25 07:13:05+00:00 | |
🟢 | 85.08 | 51.12 | 96.46 | 99.08 | 71.75 | false | true | ? | pretrained | Original | apache-2.0 | 67 | 7.62 | 2024-11-14 11:36:22+00:00 | |
🟦 | 84.86 | 46.65 | 96.73 | 99.21 | 78.97 | false | true | ? | preference-tuned | Original | llama3 | 254 | 8.03 | 2024-12-10 09:38:34+00:00 | |
🟦 | 84.85 | 73.9 | 96.21 | 99.31 | 49.13 | false | true | ? | preference-tuned | Original | apache-2.0 | 239 | 32.76 | 2024-11-28 04:57:07+00:00 | |
⭕ | 84.68 | 55.94 | 97.04 | 96.21 | 66.59 | false | true | ? | instruction-tuned | Original | llama3.1 | 1149 | 70.55 | 2024-10-25 07:09:19+00:00 | |
🟦 | 84.33 | 49.41 | 96.56 | 98.04 | 71.34 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
🟦 | 84.33 | 44.54 | 96.53 | 99.29 | 79.77 | false | true | ? | preference-tuned | Original | other | 23 | 7.46 | 2024-12-19 05:59:29+00:00 | |
🟦 | 84.31 | 51.62 | 96.71 | 97.47 | 68.16 | false | true | ? | preference-tuned | Original | llama3.1 | 617 | 70.55 | 2024-10-24 13:25:28+00:00 | |
🟦 | 84.09 | 44.91 | 96.21 | 99.15 | 77.67 | false | true | ? | preference-tuned | Original | other | 12 | 3.23 | 2024-12-19 06:00:40+00:00 | |
🟢 | 84.07 | 51.09 | 97.58 | 99.16 | 60.65 | false | true | ? | pretrained | Original | other | 39 | 72.71 | 2024-11-14 11:37:02+00:00 | |
🟦 | 83.95 | 45.8 | 96.48 | 98.31 | 75.66 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
🟦 | 83.79 | 54.71 | 96.44 | 96.75 | 62.11 | false | true | ? | preference-tuned | Original | llama3.1 | 2845 | 8.03 | 2024-07-24 14:33:56+00:00 | |
🟦 | 83.52 | 50.42 | 96.33 | 96.87 | 66.54 | false | true | ? | preference-tuned | Original | llama3.2 | 402 | 3.21 | 2024-10-24 06:23:04+00:00 | |
⭕ | 83.26 | 41.57 | 96.79 | 98.13 | 80.65 | false | true | ? | instruction-tuned | Original | llama3.1 | 164 | 8.03 | 2024-11-14 11:35:17+00:00 | |
🟦 | 83.19 | 44.17 | 96.68 | 97.63 | 73.71 | false | true | ? | preference-tuned | Original | llama3.3 | 632 | 70.55 | 2024-12-09 09:10:34+00:00 | |
🟦 | 83.06 | 45.3 | 96.61 | 96.02 | 75.24 | false | true | ? | preference-tuned | Original | llama3 | 1417 | 70.55 | 2024-10-24 13:25:47+00:00 | |
🟢 🏥 | 83.05 | 40.46 | 97.11 | 98.39 | 79.65 | true | true | ? | pretrained | Original | null | 9 | 70.55 | 2024-11-11 13:58:37+00:00 | |
🟦 | 82.94 | 40.74 | 96.34 | 98.67 | 79.24 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
🟦 🏥 | 82.93 | 40.34 | 96.81 | 98.51 | 79.23 | true | true | ? | preference-tuned | Original | llama3 | 339 | 70 | 2024-07-24 14:33:56+00:00 | |
🟦 | 82.84 | 41.79 | 96.5 | 97.77 | 77.29 | false | true | ? | preference-tuned | Original | mit | 110 | 9.24 | 2024-10-25 07:11:14+00:00 | |
🟦 | 82.37 | 41.38 | 95.81 | 96.26 | 82.13 | false | true | ? | preference-tuned | Original | apache-2.0 | 91 | 0.75 | 2025-04-29 10:46:33+00:00 | |
⭕ | 82.32 | 41.76 | 95.92 | 97.37 | 75.1 | false | true | ? | instruction-tuned | Original | apache-2.0 | 346 | 1.71 | 2024-11-22 10:44:37+00:00 | |
🟦 | 82.26 | 46.44 | 95.86 | 96.99 | 64.28 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:09:02+00:00 | |
⭕ 🏥 | 81.94 | 37.52 | 95.82 | 98.24 | 83.39 | true | true | ? | instruction-tuned | Original | other | 106 | 4.3 | 2025-05-22 07:54:59+00:00 | |
🟦 | 81.66 | 44.32 | 97.41 | 97.62 | 57.24 | false | false | ? | preference-tuned | Original | other | 26 | 3.09 | 2024-10-22 13:17:21+00:00 | |
⭕ | 80.61 | 44.66 | 95.86 | 93.62 | 63.23 | false | false | ? | instruction-tuned | Original | llama3.2 | 430 | 1.24 | 2024-10-25 07:14:38+00:00 | |
🟦 | 80.35 | 34.13 | 96.06 | 97.03 | 80.61 | false | true | ? | preference-tuned | Original | other | 21 | 1.67 | 2024-12-19 06:01:10+00:00 | |
🟦 🏥 | 79.28 | 61.4 | 96.75 | 99.28 | 31.7 | true | true | ? | preference-tuned | Original | apache-2.0 | 300 | 8.19 | 2025-05-20 11:36:36+00:00 | |
🟦 | 79.21 | 38.47 | 96.5 | 96.13 | 54.23 | false | false | ? | preference-tuned | Original | apache-2.0 | 100 | 0.49 | 2024-11-18 11:36:27+00:00 | |
⭕ 🏥 | 78.59 | 59.72 | 96.68 | 99.23 | 29.9 | true | true | ? | instruction-tuned | Original | null | 42 | 8.19 | 2025-05-16 09:57:55+00:00 | |
⭕ | 73.66 | 16.29 | 97.72 | 95.57 | 92.16 | false | true | ? | instruction-tuned | Original | gemma | 44 | 9.24 | 2024-11-14 11:39:56+00:00 | |
🟦 | 69.33 | 35.32 | 96.15 | 93.2 | 12.67 | false | true | ? | preference-tuned | Original | apache-2.0 | 60 | 0.36 | 2024-12-10 08:36:15+00:00 | |
🟢 | 64.87 | 60.94 | 97.61 | 97 | -48.55 | false | false | ? | pretrained | Original | unknown | 210 | 11.1 | 2024-10-29 07:23:16+00:00 | |
🟢 | 64.01 | 45.33 | 98.2 | 93.82 | -218.26 | false | false | ? | pretrained | Original | apache-2.0 | 101 | 0.49 | 2024-10-22 13:46:13+00:00 | |
🟦 | 62.64 | 92.95 | 98.5 | 89.42 | -172.93 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-03-06 02:18:06+00:00 | |
🟢 🏥 | 59.15 | 15.9 | 97.75 | 52.88 | 85.85 | true | false | ? | pretrained | Original | llama3 | 37 | 8.03 | 2024-10-25 07:16:58+00:00 | |
⭕ 🏥 | 57.68 | 12.36 | 98.47 | 74.56 | -131.51 | true | true | ? | instruction-tuned | Original | other | 108 | 27.01 | 2025-05-22 07:54:44+00:00 | |
🟦 | 54.69 | 2.1 | 99.49 | 64.59 | -237.7 | false | true | ? | preference-tuned | Original | apache-2.0 | 33 | 3.32 | 2024-12-10 10:39:41+00:00 |
- Coverage: Measures how thoroughly the summary covers the original document. A higher score means the summary includes more details from the original.
- Conformity: Also called the non-contradiction score, this checks if the summary avoids contradicting the original document. A higher score means the summary aligns better with the original.
- Consistency: Measures the level of non-hallucination, or how much the summary sticks to the facts in the document. A higher score means the summary is more factual and accurate.
- Overall Score: The average of the above three scores.
🟦 🏥 | 96.11 | 92.08 | 97.25 | 98.67 |
⭕ | 96.11 | 92.08 | 97.25 | 99 | |
🟦 🏥 | 95.56 | 91.83 | 96.17 | 98.67 | |
🟦 🏥 | 95.14 | 90 | 97.42 | 98 | |
⭕ | 94.69 | 88.58 | 96.67 | 98.83 | |
⭕ | 94.64 | 92.41 | 96.83 | 94.67 | |
⭕ | 94.61 | 87.5 | 97.33 | 99 | |
⭕ 🏥 | 94.5 | 89.07 | 95.92 | 98.5 | |
⭕ | 94.27 | 90.08 | 96.91 | 95.83 | |
⭕ | 93.94 | 86.25 | 97.58 | 98 | |
⭕ | 93.89 | 85.33 | 97.33 | 99 | |
🟦 | 93.85 | 85.48 | 97.75 | 98.33 | |
🟦 | 93.28 | 88.25 | 96.42 | 95.17 | |
🟦 | 93.16 | 85.07 | 96.58 | 97.83 | |
⭕ | 92.91 | 83.56 | 96 | 99.17 | |
⭕ | 92.88 | 84.16 | 95.82 | 98.67 | |
🟦 | 92.66 | 80.58 | 97.57 | 99.83 | |
⭕ | 92.64 | 82.16 | 96.58 | 99.17 | |
🟦 | 92.61 | 83.25 | 96.58 | 98 | |
🟦 | 92.55 | 84.41 | 96.75 | 96.5 | |
🟦 | 92.38 | 84.07 | 96.57 | 96.5 | |
🟦 | 92.33 | 80.66 | 96.67 | 99.67 | |
🟦 | 92.21 | 80.65 | 97.16 | 98.83 | |
🟦 | 92.11 | 83.5 | 96.5 | 96.33 | |
🟦 | 92.02 | 79.73 | 96.99 | 99.33 | |
🟦 | 91.99 | 79.64 | 96.83 | 99.5 | |
🟦 | 91.91 | 78.73 | 97.33 | 99.67 | |
🟦 | 91.85 | 79.82 | 96.57 | 99.17 | |
🟦 | 91.8 | 81.57 | 96.67 | 97.17 | |
⭕ | 91.53 | 79.75 | 97 | 97.83 | |
⭕ | 91.49 | 79.06 | 95.75 | 99.67 | |
🟦 | 91.16 | 79.15 | 96.17 | 98.17 | |
🟦 | 91.08 | 76.91 | 97.33 | 99 | |
🟦 | 90.94 | 77.91 | 96.25 | 98.67 | |
🟦 | 90.83 | 79.98 | 96.17 | 96.33 | |
🟦 | 90.8 | 76.81 | 96.42 | 99.17 | |
🟦 🏥 | 90.66 | 75.65 | 96.67 | 99.67 | |
🟦 | 90.66 | 75.64 | 97 | 99.33 | |
🟦 | 90.5 | 75.92 | 96.42 | 99.17 | |
🟦 | 90.45 | 74.98 | 97.08 | 99.29 | |
🟦 | 90.29 | 74.31 | 96.74 | 99.83 | |
🟦 | 90.1 | 76.06 | 96.58 | 97.67 | |
⭕ 🏥 | 89.66 | 75.67 | 95.16 | 98.17 | |
🟢 | 89.66 | 74.73 | 95.92 | 98.33 | |
🟦 | 89.63 | 75.06 | 96.32 | 97.5 | |
🟦 | 89.2 | 73.71 | 95.73 | 98.17 | |
🟢 🏥 | 89.17 | 70.42 | 97.08 | 100 | |
🟦 | 88.91 | 74.73 | 94.17 | 97.83 | |
🟢 | 88.83 | 70.25 | 96.58 | 99.67 | |
🟦 | 87.06 | 69.11 | 94.74 | 97.33 | |
🟦 | 86.8 | 68.24 | 94.49 | 97.67 | |
⭕ | 85.32 | 66.68 | 94.46 | 94.83 | |
🟢 | 84.18 | 57.53 | 96.67 | 98.33 | |
⭕ | 82.87 | 63.04 | 93.9 | 91.67 | |
🟦 | 82.57 | 64.56 | 92.31 | 90.83 | |
🟦 | 82.3 | 54 | 95.74 | 97.17 | |
🟦 | 80.21 | 48.9 | 96.07 | 95.67 | |
🟦 | 79.82 | 52.29 | 94.33 | 92.83 | |
🟦 | 78.54 | 44.37 | 98.08 | 93.17 | |
🟢 🏥 | 71.44 | 47.33 | 95.67 | 71.33 | |
🟢 | 70 | 21.95 | 97.82 | 90.21 | |
⭕ | 69.17 | 10.69 | 99.5 | 97.33 | |
⭕ 🏥 | 64.52 | 26.82 | 92.08 | 74.67 | |
🟦 | 56.67 | 1.58 | 99.33 | 69.08 |
🟦 🏥 | 96.11 | 92.08 | 97.25 | 98.67 | false | false | ? | instruction-tuned | Original | cc-by-nc-4.0 | 1149 | 235.09 | 2025-05-14 08:41:32+00:00 |
⭕ | 96.11 | 92.08 | 97.25 | 99 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-14 08:41:32+00:00 | |
🟦 🏥 | 95.56 | 91.83 | 96.17 | 98.67 | true | true | ? | preference-tuned | Original | apache-2.0 | 300 | 8.19 | 2025-05-20 11:36:36+00:00 | |
🟦 🏥 | 95.14 | 90 | 97.42 | 98 | true | true | ? | preference-tuned | Original | llama3 | 34 | 70.55 | 2024-10-24 06:24:59+00:00 | |
⭕ | 94.69 | 88.58 | 96.67 | 98.83 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-14 08:41:43+00:00 | |
⭕ | 94.64 | 92.41 | 96.83 | 94.67 | false | true | ? | instruction-tuned | Original | apache-2.0 | 0 | 32.76 | 2025-05-19 12:37:03+00:00 | |
⭕ | 94.61 | 87.5 | 97.33 | 99 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-07 16:20:53+00:00 | |
⭕ 🏥 | 94.5 | 89.07 | 95.92 | 98.5 | true | true | ? | instruction-tuned | Original | null | 42 | 8.19 | 2025-05-16 09:57:55+00:00 | |
⭕ | 94.27 | 90.08 | 96.91 | 95.83 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-14 08:42:03+00:00 | |
⭕ | 93.94 | 86.25 | 97.58 | 98 | false | true | ? | instruction-tuned | Original | llama3.1 | 1149 | 70.55 | 2024-10-25 07:09:19+00:00 | |
⭕ | 93.89 | 85.33 | 97.33 | 99 | false | true | ? | instruction-tuned | Original | other | 1980 | 685 | 2024-10-22 23:04:13+00:00 | |
🟦 | 93.85 | 85.48 | 97.75 | 98.33 | false | true | ? | preference-tuned | Original | apache-2.0 | 239 | 32.76 | 2024-11-28 04:57:07+00:00 | |
🟦 | 93.28 | 88.25 | 96.42 | 95.17 | false | true | ? | preference-tuned | Original | other | 343 | 72.71 | 2024-10-22 14:35:49+00:00 | |
🟦 | 93.16 | 85.07 | 96.58 | 97.83 | false | true | ? | preference-tuned | Original | mit | 110 | 9.24 | 2024-10-25 07:11:14+00:00 | |
⭕ | 92.91 | 83.56 | 96 | 99.17 | false | true | ? | instruction-tuned | Original | llama3.1 | 164 | 8.03 | 2024-11-14 11:35:17+00:00 | |
⭕ | 92.88 | 84.16 | 95.82 | 98.67 | false | true | ? | instruction-tuned | Original | cc-by-nc-4.0 | 66 | 32.3 | 2024-10-25 07:13:05+00:00 | |
🟦 | 92.66 | 80.58 | 97.57 | 99.83 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
⭕ | 92.64 | 82.16 | 96.58 | 99.17 | false | true | ? | instruction-tuned | Original | null | 0 | -1 | 2025-01-17 12:10:32+00:00 | |
🟦 | 92.61 | 83.25 | 96.58 | 98 | false | true | ? | preference-tuned | Original | apache-2.0 | 274 | 7.62 | 2024-11-14 11:36:44+00:00 | |
🟦 | 92.55 | 84.41 | 96.75 | 96.5 | false | true | ? | preference-tuned | Original | apache-2.0 | 162 | 32.76 | 2025-04-29 10:45:55+00:00 | |
🟦 | 92.38 | 84.07 | 96.57 | 96.5 | false | true | ? | preference-tuned | Original | apache-2.0 | 352 | 235.09 | 2025-04-29 10:42:15+00:00 | |
🟦 | 92.33 | 80.66 | 96.67 | 99.67 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:06:51+00:00 | |
🟦 | 92.21 | 80.65 | 97.16 | 98.83 | false | true | ? | preference-tuned | Original | llama3 | 254 | 8.03 | 2024-12-10 09:38:34+00:00 | |
🟦 | 92.11 | 83.5 | 96.5 | 96.33 | false | true | ? | preference-tuned | Original | apache-2.0 | 208 | 30.53 | 2025-04-29 10:45:32+00:00 | |
🟦 | 92.02 | 79.73 | 96.99 | 99.33 | false | true | ? | preference-tuned | Original | llama3.1 | 617 | 70.55 | 2024-10-24 13:25:28+00:00 | |
🟦 | 91.99 | 79.64 | 96.83 | 99.5 | false | true | ? | preference-tuned | Original | llama3.3 | 632 | 70.55 | 2024-12-09 09:10:34+00:00 | |
🟦 | 91.91 | 78.73 | 97.33 | 99.67 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
🟦 | 91.85 | 79.82 | 96.57 | 99.17 | false | true | ? | preference-tuned | Original | llama3.1 | 2845 | 8.03 | 2024-07-24 14:33:56+00:00 | |
🟦 | 91.8 | 81.57 | 96.67 | 97.17 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-03-06 02:18:06+00:00 | |
⭕ | 91.53 | 79.75 | 97 | 97.83 | false | true | ? | instruction-tuned | Original | other | 808 | 122.61 | 2024-11-25 11:27:40+00:00 | |
⭕ | 91.49 | 79.06 | 95.75 | 99.67 | false | true | ? | instruction-tuned | Original | apache-2.0 | 1131 | 7.25 | 2024-11-14 11:38:25+00:00 | |
🟦 | 91.16 | 79.15 | 96.17 | 98.17 | false | true | ? | preference-tuned | Original | other | 40 | 10.31 | 2024-12-19 05:58:51+00:00 | |
🟦 | 91.08 | 76.91 | 97.33 | 99 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
🟦 | 90.94 | 77.91 | 96.25 | 98.67 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:08:22+00:00 | |
🟦 | 90.83 | 79.98 | 96.17 | 96.33 | false | true | ? | preference-tuned | Original | apache-2.0 | 137 | 14.77 | 2025-05-12 12:17:12+00:00 | |
🟦 | 90.8 | 76.81 | 96.42 | 99.17 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:08:05+00:00 | |
🟦 🏥 | 90.66 | 75.65 | 96.67 | 99.67 | true | true | ? | preference-tuned | Original | llama3 | 339 | 70 | 2024-07-24 14:33:56+00:00 | |
🟦 | 90.66 | 75.64 | 97 | 99.33 | false | true | ? | preference-tuned | Original | llama3 | 1417 | 70.55 | 2024-10-24 13:25:47+00:00 | |
🟦 | 90.5 | 75.92 | 96.42 | 99.17 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:07:42+00:00 | |
🟦 | 90.45 | 74.98 | 97.08 | 99.29 | false | true | ? | preference-tuned | Original | other | 675 | 72.71 | 2024-11-14 11:37:18+00:00 | |
🟦 | 90.29 | 74.31 | 96.74 | 99.83 | false | true | ? | preference-tuned | Original | other | 23 | 7.46 | 2024-12-19 05:59:29+00:00 | |
🟦 | 90.1 | 76.06 | 96.58 | 97.67 | false | true | ? | preference-tuned | Original | apache-2.0 | 67 | 14.77 | 2024-12-10 07:27:22+00:00 | |
⭕ 🏥 | 89.66 | 75.67 | 95.16 | 98.17 | true | true | ? | instruction-tuned | Original | other | 106 | 4.3 | 2025-05-22 07:54:59+00:00 | |
🟢 | 89.66 | 74.73 | 95.92 | 98.33 | false | true | ? | pretrained | Original | apache-2.0 | 67 | 7.62 | 2024-11-14 11:36:22+00:00 | |
🟦 | 89.63 | 75.06 | 96.32 | 97.5 | false | true | ? | preference-tuned | Original | llama3.2 | 402 | 3.21 | 2024-10-24 06:23:04+00:00 | |
🟦 | 89.2 | 73.71 | 95.73 | 98.17 | false | false | ? | preference-tuned | Original | other | 87 | 3.09 | 2024-11-18 11:36:42+00:00 | |
🟢 🏥 | 89.17 | 70.42 | 97.08 | 100 | true | true | ? | pretrained | Original | null | 9 | 70.55 | 2024-11-11 13:58:37+00:00 | |
🟦 | 88.91 | 74.73 | 94.17 | 97.83 | false | true | ? | preference-tuned | Original | other | 12 | 3.23 | 2024-12-19 06:00:40+00:00 | |
🟢 | 88.83 | 70.25 | 96.58 | 99.67 | false | true | ? | pretrained | Original | other | 39 | 72.71 | 2024-11-14 11:37:02+00:00 | |
🟦 | 87.06 | 69.11 | 94.74 | 97.33 | false | false | ? | preference-tuned | Original | other | 26 | 3.09 | 2024-10-22 13:17:21+00:00 | |
🟦 | 86.8 | 68.24 | 94.49 | 97.67 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:08:42+00:00 | |
⭕ | 85.32 | 66.68 | 94.46 | 94.83 | false | false | ? | instruction-tuned | Original | llama3.2 | 430 | 1.24 | 2024-10-25 07:14:38+00:00 | |
🟢 | 84.18 | 57.53 | 96.67 | 98.33 | false | false | ? | pretrained | Original | unknown | 210 | 11.1 | 2024-10-29 07:23:16+00:00 | |
⭕ | 82.87 | 63.04 | 93.9 | 91.67 | false | true | ? | instruction-tuned | Original | apache-2.0 | 346 | 1.71 | 2024-11-22 10:44:37+00:00 | |
🟦 | 82.57 | 64.56 | 92.31 | 90.83 | false | true | ? | preference-tuned | Original | other | 21 | 1.67 | 2024-12-19 06:01:10+00:00 | |
🟦 | 82.3 | 54 | 95.74 | 97.17 | false | true | ? | preference-tuned | Original | apache-2.0 | 91 | 0.75 | 2025-04-29 10:46:33+00:00 | |
🟦 | 80.21 | 48.9 | 96.07 | 95.67 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:09:02+00:00 | |
🟦 | 79.82 | 52.29 | 94.33 | 92.83 | false | false | ? | preference-tuned | Original | apache-2.0 | 100 | 0.49 | 2024-11-18 11:36:27+00:00 | |
🟦 | 78.54 | 44.37 | 98.08 | 93.17 | false | true | ? | preference-tuned | Original | apache-2.0 | 60 | 0.36 | 2024-12-10 08:36:15+00:00 | |
🟢 🏥 | 71.44 | 47.33 | 95.67 | 71.33 | true | false | ? | pretrained | Original | llama3 | 37 | 8.03 | 2024-10-25 07:16:58+00:00 | |
🟢 | 70 | 21.95 | 97.82 | 90.21 | false | false | ? | pretrained | Original | apache-2.0 | 101 | 0.49 | 2024-10-22 13:46:13+00:00 | |
⭕ | 69.17 | 10.69 | 99.5 | 97.33 | false | true | ? | instruction-tuned | Original | gemma | 44 | 9.24 | 2024-11-14 11:39:56+00:00 | |
⭕ 🏥 | 64.52 | 26.82 | 92.08 | 74.67 | true | true | ? | instruction-tuned | Original | other | 108 | 27.01 | 2025-05-22 07:54:44+00:00 | |
🟦 | 56.67 | 1.58 | 99.33 | 69.08 | false | true | ? | preference-tuned | Original | apache-2.0 | 33 | 3.32 | 2024-12-10 10:39:41+00:00 |
🟢 🏥 | 95.76 | 93.52 | 97.53 | 96.24 |
⭕ | 95.76 | 93.52 | 97.53 | 96.24 | |
⭕ | 94.23 | 89.33 | 97.45 | 95.92 | |
🟢 | 93.92 | 86.72 | 98.56 | 96.48 | |
⭕ | 93.89 | 88.91 | 97.32 | 95.44 | |
🟦 | 93.71 | 90.48 | 97.37 | 93.28 | |
🟦 | 93.64 | 86.85 | 97.36 | 96.72 | |
🟢 🏥 | 93.34 | 85.57 | 98.28 | 96.16 | |
⭕ | 93.22 | 90.31 | 97.6 | 91.76 | |
⭕ | 93.19 | 89.48 | 96.8 | 93.28 | |
🟦 | 93.1 | 85.03 | 97.32 | 96.96 | |
🟦 | 92.91 | 88.25 | 97.44 | 93.04 | |
🟦 | 92.9 | 85.57 | 97.76 | 95.36 | |
🟦 | 92.66 | 84.23 | 97.28 | 96.48 | |
⭕ | 92.61 | 84.06 | 97.52 | 96.24 | |
🟦 | 92.59 | 85.6 | 97.44 | 94.72 | |
🟦 🏥 | 92.4 | 88.6 | 97.72 | 90.88 | |
🟦 | 92.16 | 86.99 | 97.2 | 92.3 | |
⭕ 🏥 | 92.03 | 89.27 | 97.15 | 89.68 | |
🟦 | 91.75 | 83.19 | 96.92 | 95.12 | |
🟦 | 91.59 | 85.26 | 97.76 | 91.74 | |
🟦 🏥 | 91.58 | 88.45 | 96.04 | 90.24 | |
🟦 | 91.55 | 80.48 | 97.68 | 96.48 | |
⭕ | 91.53 | 82.86 | 97.92 | 93.82 | |
🟦 | 91.25 | 86.46 | 96.4 | 90.88 | |
🟦 | 91.2 | 86.28 | 97.12 | 90.2 | |
⭕ | 91.2 | 88.55 | 96.02 | 89.02 | |
🟦 | 91.01 | 81.26 | 97.76 | 94 | |
🟦 | 90.84 | 76.57 | 98.04 | 97.92 | |
🟦 | 90.65 | 77.57 | 98.08 | 96.32 | |
⭕ 🏥 | 90.65 | 79.03 | 96.84 | 96.08 | |
🟢 | 90.65 | 81.88 | 97.1 | 92.96 | |
🟦 | 90.4 | 81.8 | 96.76 | 92.64 | |
🟦 🏥 | 90.3 | 75.87 | 97.84 | 97.2 | |
🟦 | 90.25 | 85.11 | 96.35 | 89.28 | |
🟦 | 90.22 | 84.44 | 96.45 | 89.76 | |
🟦 | 90.11 | 88.03 | 95.98 | 86.32 | |
⭕ | 90.05 | 91.36 | 96.57 | 82.22 | |
🟦 | 90.04 | 81.43 | 97.41 | 91.28 | |
⭕ | 89.6 | 80.16 | 96.79 | 91.84 | |
🟦 | 89.08 | 84.08 | 95.18 | 88 | |
⭕ | 88.82 | 78.15 | 97.04 | 91.26 | |
🟦 | 88.69 | 78.99 | 96.83 | 90.24 | |
⭕ | 88.37 | 81.43 | 97.13 | 86.56 | |
🟦 | 88.23 | 77.17 | 95.93 | 91.6 | |
🟦 | 88.18 | 73.81 | 96.5 | 94.24 | |
🟦 | 87.73 | 77 | 96.6 | 89.58 | |
🟦 | 87.24 | 74.01 | 96.35 | 91.36 | |
🟦 | 85.77 | 76.01 | 96.99 | 84.3 | |
🟢 | 85.77 | 66.64 | 97.94 | 92.72 | |
🟦 | 84.92 | 69.69 | 96.04 | 89.04 | |
⭕ | 84.33 | 64.5 | 97.14 | 91.36 | |
⭕ | 83.99 | 56.97 | 97.82 | 97.2 | |
🟦 | 82.4 | 63.68 | 96.17 | 87.36 | |
⭕ | 81.02 | 53.96 | 97.1 | 92 | |
🟦 | 80.88 | 56.37 | 96.35 | 89.92 | |
🟢 | 80.24 | 49.69 | 97.83 | 93.2 | |
🟦 | 79.33 | 52.03 | 96.69 | 89.28 | |
🟦 | 76.66 | 47.84 | 96.58 | 85.58 | |
🟦 | 73.7 | 35.97 | 98.16 | 86.96 | |
🟦 | 71.02 | 48.86 | 97.64 | 66.56 | |
🟢 🏥 | 70.41 | 48.97 | 95.34 | 66.92 | |
⭕ 🏥 | 62.93 | 25.01 | 95.84 | 67.94 | |
🟦 | 55.79 | 1.4 | 99.56 | 66.41 |
🟢 🏥 | 95.76 | 93.52 | 97.53 | 96.24 | false | false | ? | instruction-tuned | Original | cc-by-nc-4.0 | 1980 | 235.09 | 2025-05-07 16:20:53+00:00 |
⭕ | 95.76 | 93.52 | 97.53 | 96.24 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-07 16:20:53+00:00 | |
⭕ | 94.23 | 89.33 | 97.45 | 95.92 | false | true | ? | instruction-tuned | Original | other | 1980 | 685 | 2024-10-22 23:04:13+00:00 | |
🟢 | 93.92 | 86.72 | 98.56 | 96.48 | false | true | ? | pretrained | Original | other | 39 | 72.71 | 2024-11-14 11:37:02+00:00 | |
⭕ | 93.89 | 88.91 | 97.32 | 95.44 | false | true | ? | instruction-tuned | Original | null | 0 | -1 | 2025-01-17 12:10:32+00:00 | |
🟦 | 93.71 | 90.48 | 97.37 | 93.28 | false | true | ? | preference-tuned | Original | apache-2.0 | 239 | 32.76 | 2024-11-28 04:57:07+00:00 | |
🟦 | 93.64 | 86.85 | 97.36 | 96.72 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
🟢 🏥 | 93.34 | 85.57 | 98.28 | 96.16 | true | true | ? | pretrained | Original | null | 9 | 70.55 | 2024-11-11 13:58:37+00:00 | |
⭕ | 93.22 | 90.31 | 97.6 | 91.76 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-14 08:41:32+00:00 | |
⭕ | 93.19 | 89.48 | 96.8 | 93.28 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-14 08:41:43+00:00 | |
🟦 | 93.1 | 85.03 | 97.32 | 96.96 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
🟦 | 92.91 | 88.25 | 97.44 | 93.04 | false | true | ? | preference-tuned | Original | other | 343 | 72.71 | 2024-10-22 14:35:49+00:00 | |
🟦 | 92.9 | 85.57 | 97.76 | 95.36 | false | true | ? | preference-tuned | Original | other | 675 | 72.71 | 2024-11-14 11:37:18+00:00 | |
🟦 | 92.66 | 84.23 | 97.28 | 96.48 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
⭕ | 92.61 | 84.06 | 97.52 | 96.24 | false | true | ? | instruction-tuned | Original | llama3.1 | 164 | 8.03 | 2024-11-14 11:35:17+00:00 | |
🟦 | 92.59 | 85.6 | 97.44 | 94.72 | false | true | ? | preference-tuned | Original | llama3 | 254 | 8.03 | 2024-12-10 09:38:34+00:00 | |
🟦 🏥 | 92.4 | 88.6 | 97.72 | 90.88 | true | true | ? | preference-tuned | Original | apache-2.0 | 300 | 8.19 | 2025-05-20 11:36:36+00:00 | |
🟦 | 92.16 | 86.99 | 97.2 | 92.3 | false | true | ? | preference-tuned | Original | apache-2.0 | 274 | 7.62 | 2024-11-14 11:36:44+00:00 | |
⭕ 🏥 | 92.03 | 89.27 | 97.15 | 89.68 | true | true | ? | instruction-tuned | Original | null | 42 | 8.19 | 2025-05-16 09:57:55+00:00 | |
🟦 | 91.75 | 83.19 | 96.92 | 95.12 | false | false | ? | preference-tuned | Original | other | 87 | 3.09 | 2024-11-18 11:36:42+00:00 | |
🟦 | 91.59 | 85.26 | 97.76 | 91.74 | false | true | ? | preference-tuned | Original | apache-2.0 | 67 | 14.77 | 2024-12-10 07:27:22+00:00 | |
🟦 🏥 | 91.58 | 88.45 | 96.04 | 90.24 | true | true | ? | preference-tuned | Original | llama3 | 34 | 70.55 | 2024-10-24 06:24:59+00:00 | |
🟦 | 91.55 | 80.48 | 97.68 | 96.48 | false | true | ? | preference-tuned | Original | llama3.3 | 632 | 70.55 | 2024-12-09 09:10:34+00:00 | |
⭕ | 91.53 | 82.86 | 97.92 | 93.82 | false | true | ? | instruction-tuned | Original | llama3.1 | 1149 | 70.55 | 2024-10-25 07:09:19+00:00 | |
🟦 | 91.25 | 86.46 | 96.4 | 90.88 | false | true | ? | preference-tuned | Original | apache-2.0 | 352 | 235.09 | 2025-04-29 10:42:15+00:00 | |
🟦 | 91.2 | 86.28 | 97.12 | 90.2 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-03-06 02:18:06+00:00 | |
⭕ | 91.2 | 88.55 | 96.02 | 89.02 | false | true | ? | instruction-tuned | Original | apache-2.0 | 0 | 32.76 | 2025-05-19 12:37:03+00:00 | |
🟦 | 91.01 | 81.26 | 97.76 | 94 | false | true | ? | preference-tuned | Original | llama3.1 | 617 | 70.55 | 2024-10-24 13:25:28+00:00 | |
🟦 | 90.84 | 76.57 | 98.04 | 97.92 | false | false | ? | preference-tuned | Original | other | 26 | 3.09 | 2024-10-22 13:17:21+00:00 | |
🟦 | 90.65 | 77.57 | 98.08 | 96.32 | false | true | ? | preference-tuned | Original | llama3 | 1417 | 70.55 | 2024-10-24 13:25:47+00:00 | |
⭕ 🏥 | 90.65 | 79.03 | 96.84 | 96.08 | true | true | ? | instruction-tuned | Original | other | 106 | 4.3 | 2025-05-22 07:54:59+00:00 | |
🟢 | 90.65 | 81.88 | 97.1 | 92.96 | false | true | ? | pretrained | Original | apache-2.0 | 67 | 7.62 | 2024-11-14 11:36:22+00:00 | |
🟦 | 90.4 | 81.8 | 96.76 | 92.64 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:06:51+00:00 | |
🟦 🏥 | 90.3 | 75.87 | 97.84 | 97.2 | true | true | ? | preference-tuned | Original | llama3 | 339 | 70 | 2024-07-24 14:33:56+00:00 | |
🟦 | 90.25 | 85.11 | 96.35 | 89.28 | false | true | ? | preference-tuned | Original | apache-2.0 | 208 | 30.53 | 2025-04-29 10:45:32+00:00 | |
🟦 | 90.22 | 84.44 | 96.45 | 89.76 | false | true | ? | preference-tuned | Original | apache-2.0 | 137 | 14.77 | 2025-05-12 12:17:12+00:00 | |
🟦 | 90.11 | 88.03 | 95.98 | 86.32 | false | true | ? | preference-tuned | Original | apache-2.0 | 162 | 32.76 | 2025-04-29 10:45:55+00:00 | |
⭕ | 90.05 | 91.36 | 96.57 | 82.22 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-14 08:42:03+00:00 | |
🟦 | 90.04 | 81.43 | 97.41 | 91.28 | false | true | ? | preference-tuned | Original | llama3.1 | 2845 | 8.03 | 2024-07-24 14:33:56+00:00 | |
⭕ | 89.6 | 80.16 | 96.79 | 91.84 | false | true | ? | instruction-tuned | Original | cc-by-nc-4.0 | 66 | 32.3 | 2024-10-25 07:13:05+00:00 | |
🟦 | 89.08 | 84.08 | 95.18 | 88 | false | true | ? | preference-tuned | Original | mit | 110 | 9.24 | 2024-10-25 07:11:14+00:00 | |
⭕ | 88.82 | 78.15 | 97.04 | 91.26 | false | true | ? | instruction-tuned | Original | apache-2.0 | 1131 | 7.25 | 2024-11-14 11:38:25+00:00 | |
🟦 | 88.69 | 78.99 | 96.83 | 90.24 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:08:22+00:00 | |
⭕ | 88.37 | 81.43 | 97.13 | 86.56 | false | true | ? | instruction-tuned | Original | other | 808 | 122.61 | 2024-11-25 11:27:40+00:00 | |
🟦 | 88.23 | 77.17 | 95.93 | 91.6 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:08:05+00:00 | |
🟦 | 88.18 | 73.81 | 96.5 | 94.24 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:07:42+00:00 | |
🟦 | 87.73 | 77 | 96.6 | 89.58 | false | true | ? | preference-tuned | Original | llama3.2 | 402 | 3.21 | 2024-10-24 06:23:04+00:00 | |
🟦 | 87.24 | 74.01 | 96.35 | 91.36 | false | true | ? | preference-tuned | Original | other | 23 | 7.46 | 2024-12-19 05:59:29+00:00 | |
🟦 | 85.77 | 76.01 | 96.99 | 84.3 | false | true | ? | preference-tuned | Original | other | 40 | 10.31 | 2024-12-19 05:58:51+00:00 | |
🟢 | 85.77 | 66.64 | 97.94 | 92.72 | false | false | ? | pretrained | Original | unknown | 210 | 11.1 | 2024-10-29 07:23:16+00:00 | |
🟦 | 84.92 | 69.69 | 96.04 | 89.04 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:08:42+00:00 | |
⭕ | 84.33 | 64.5 | 97.14 | 91.36 | false | false | ? | instruction-tuned | Original | llama3.2 | 430 | 1.24 | 2024-10-25 07:14:38+00:00 | |
⭕ | 83.99 | 56.97 | 97.82 | 97.2 | false | true | ? | instruction-tuned | Original | gemma | 44 | 9.24 | 2024-11-14 11:39:56+00:00 | |
🟦 | 82.4 | 63.68 | 96.17 | 87.36 | false | true | ? | preference-tuned | Original | other | 12 | 3.23 | 2024-12-19 06:00:40+00:00 | |
⭕ | 81.02 | 53.96 | 97.1 | 92 | false | true | ? | instruction-tuned | Original | apache-2.0 | 346 | 1.71 | 2024-11-22 10:44:37+00:00 | |
🟦 | 80.88 | 56.37 | 96.35 | 89.92 | false | true | ? | preference-tuned | Original | apache-2.0 | 91 | 0.75 | 2025-04-29 10:46:33+00:00 | |
🟢 | 80.24 | 49.69 | 97.83 | 93.2 | false | false | ? | pretrained | Original | apache-2.0 | 101 | 0.49 | 2024-10-22 13:46:13+00:00 | |
🟦 | 79.33 | 52.03 | 96.69 | 89.28 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:09:02+00:00 | |
🟦 | 76.66 | 47.84 | 96.58 | 85.58 | false | true | ? | preference-tuned | Original | other | 21 | 1.67 | 2024-12-19 06:01:10+00:00 | |
🟦 | 73.7 | 35.97 | 98.16 | 86.96 | false | true | ? | preference-tuned | Original | apache-2.0 | 60 | 0.36 | 2024-12-10 08:36:15+00:00 | |
🟦 | 71.02 | 48.86 | 97.64 | 66.56 | false | false | ? | preference-tuned | Original | apache-2.0 | 100 | 0.49 | 2024-11-18 11:36:27+00:00 | |
🟢 🏥 | 70.41 | 48.97 | 95.34 | 66.92 | true | false | ? | pretrained | Original | llama3 | 37 | 8.03 | 2024-10-25 07:16:58+00:00 | |
⭕ 🏥 | 62.93 | 25.01 | 95.84 | 67.94 | true | true | ? | instruction-tuned | Original | other | 108 | 27.01 | 2025-05-22 07:54:44+00:00 | |
🟦 | 55.79 | 1.4 | 99.56 | 66.41 | false | true | ? | preference-tuned | Original | apache-2.0 | 33 | 3.32 | 2024-12-10 10:39:41+00:00 |
HealthBench consists of 5,000 multi-turn conversations between users (patients or clinicians) and AI models, covering a wide range of medical topics and scenarios. Each conversation is accompanied by a set of physician-created rubric criteria, totaling over 48,562 unique items, to grade model responses based on accuracy, relevance, and safety. For more information, refer to the HealthBench paper and the OpenAI blog post.
Judge Used: meta-llama/Llama-3.1-70B-Instruct
🟦 🏥 | 0.49 | 0.54 | 0.37 | 0.46 | 0.61 | 0.43 | 0.63 | 0.35 |
🟦 | 0.5 | 0.54 | 0.37 | 0.46 | 0.61 | 0.43 | 0.63 | 0.35 | |
🟦 | 0.5 | 0.53 | 0.38 | 0.45 | 0.6 | 0.44 | 0.62 | 0.35 | |
🟦 | 0.49 | 0.53 | 0.36 | 0.44 | 0.6 | 0.41 | 0.63 | 0.36 | |
🟦 | 0.43 | 0.47 | 0.3 | 0.36 | 0.54 | 0.38 | 0.58 | 0.28 | |
⭕ | 0.42 | 0.45 | 0.34 | 0.35 | 0.54 | 0.33 | 0.56 | 0.34 | |
⭕ | 0.41 | 0.43 | 0.38 | 0.35 | 0.52 | 0.34 | 0.56 | 0.24 | |
⭕ | 0.4 | 0.42 | 0.33 | 0.31 | 0.53 | 0.3 | 0.55 | 0.36 | |
⭕ | 0.35 | 0.38 | 0.27 | 0.25 | 0.46 | 0.26 | 0.51 | 0.27 | |
⭕ | 0.34 | 0.38 | 0.31 | 0.24 | 0.45 | 0.25 | 0.5 | 0.25 | |
⭕ | 0.34 | 0.37 | 0.28 | 0.25 | 0.44 | 0.26 | 0.46 | 0.22 | |
⭕ | 0.33 | 0.36 | 0.3 | 0.24 | 0.46 | 0.24 | 0.47 | 0.26 | |
🟦 🏥 | 0.33 | 0.38 | 0.26 | 0.24 | 0.44 | 0.24 | 0.46 | 0.25 | |
🟦 | 0.32 | 0.35 | 0.32 | 0.23 | 0.45 | 0.22 | 0.46 | 0.26 | |
🟦 | 0.32 | 0.34 | 0.3 | 0.24 | 0.43 | 0.22 | 0.46 | 0.24 | |
🟦 | 0.31 | 0.34 | 0.24 | 0.22 | 0.45 | 0.25 | 0.45 | 0.19 | |
🟦 | 0.29 | 0.3 | 0.28 | 0.19 | 0.41 | 0.19 | 0.46 | 0.24 | |
🟦 🏥 | 0.26 | 0.29 | 0.23 | 0.15 | 0.38 | 0.16 | 0.4 | 0.2 | |
🟦 | 0.26 | 0.28 | 0.2 | 0.16 | 0.37 | 0.17 | 0.42 | 0.16 | |
🟢 🏥 | 0.21 | 0.24 | 0.17 | 0.1 | 0.33 | 0.12 | 0.33 | 0.18 | |
🟦 | 0.16 | 0.18 | 0.13 | 0.04 | 0.3 | 0.08 | 0.3 | 0.11 |
🟦 🏥 | 0.49 | 0.54 | 0.37 | 0.46 | 0.61 | 0.43 | 0.63 | 0.35 | 0.53 | 0.41 | 0.58 | 0.52 | 0.58 | false | true | ? | instruction-tuned | Original | cc-by-nc-4.0 | 12162 | 235.09 | 2025-04-29 10:42:15+00:00 |
🟦 | 0.5 | 0.54 | 0.37 | 0.46 | 0.61 | 0.43 | 0.63 | 0.35 | 0.53 | 0.41 | 0.58 | 0.52 | 0.58 | false | true | ? | preference-tuned | Original | apache-2.0 | 352 | 235.09 | 2025-04-29 10:42:15+00:00 | |
🟦 | 0.5 | 0.53 | 0.38 | 0.45 | 0.6 | 0.44 | 0.62 | 0.35 | 0.52 | 0.41 | 0.57 | 0.52 | 0.57 | false | true | ? | preference-tuned | Original | apache-2.0 | 162 | 32.76 | 2025-04-29 10:45:55+00:00 | |
🟦 | 0.49 | 0.53 | 0.36 | 0.44 | 0.6 | 0.41 | 0.63 | 0.36 | 0.51 | 0.4 | 0.57 | 0.52 | 0.6 | false | true | ? | preference-tuned | Original | other | 12162 | 685 | 2024-10-22 23:04:13+00:00 | |
🟦 | 0.43 | 0.47 | 0.3 | 0.36 | 0.54 | 0.38 | 0.58 | 0.28 | 0.42 | 0.38 | 0.51 | 0.5 | 0.58 | false | true | ? | preference-tuned | Original | apache-2.0 | 79 | 4.02 | 2025-04-29 10:46:23+00:00 | |
⭕ | 0.42 | 0.45 | 0.34 | 0.35 | 0.54 | 0.33 | 0.56 | 0.34 | 0.4 | 0.35 | 0.52 | 0.51 | 0.62 | false | true | ? | instruction-tuned | Original | other | 1980 | 685 | 2024-10-22 23:04:13+00:00 | |
⭕ | 0.41 | 0.43 | 0.38 | 0.35 | 0.52 | 0.34 | 0.56 | 0.24 | 0.4 | 0.38 | 0.5 | 0.52 | 0.57 | false | true | ? | instruction-tuned | Original | llama3.1 | 1149 | 70.55 | 2024-10-25 07:09:19+00:00 | |
⭕ | 0.4 | 0.42 | 0.33 | 0.31 | 0.53 | 0.3 | 0.55 | 0.36 | 0.34 | 0.36 | 0.5 | 0.51 | 0.66 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-14 08:41:43+00:00 | |
⭕ | 0.35 | 0.38 | 0.27 | 0.25 | 0.46 | 0.26 | 0.51 | 0.27 | 0.29 | 0.32 | 0.46 | 0.45 | 0.61 | false | true | ? | instruction-tuned | Original | other | 808 | 122.61 | 2024-11-25 11:27:40+00:00 | |
⭕ | 0.34 | 0.38 | 0.31 | 0.24 | 0.45 | 0.25 | 0.5 | 0.25 | 0.28 | 0.32 | 0.45 | 0.45 | 0.6 | false | true | ? | instruction-tuned | Original | null | 0 | -1 | 2025-01-17 12:10:32+00:00 | |
⭕ | 0.34 | 0.37 | 0.28 | 0.25 | 0.44 | 0.26 | 0.46 | 0.22 | 0.27 | 0.33 | 0.44 | 0.46 | 0.61 | false | true | ? | instruction-tuned | Original | cc-by-nc-4.0 | 66 | 32.3 | 2024-10-25 07:13:05+00:00 | |
⭕ | 0.33 | 0.36 | 0.3 | 0.24 | 0.46 | 0.24 | 0.47 | 0.26 | 0.26 | 0.32 | 0.45 | 0.48 | 0.61 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-02 10:08:27+00:00 | |
🟦 🏥 | 0.33 | 0.38 | 0.26 | 0.24 | 0.44 | 0.24 | 0.46 | 0.25 | 0.25 | 0.32 | 0.47 | 0.44 | 0.55 | true | true | ? | preference-tuned | Original | llama3 | 34 | 70.55 | 2024-10-24 06:24:59+00:00 | |
🟦 | 0.32 | 0.35 | 0.32 | 0.23 | 0.45 | 0.22 | 0.46 | 0.26 | 0.23 | 0.33 | 0.45 | 0.48 | 0.6 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
🟦 | 0.32 | 0.34 | 0.3 | 0.24 | 0.43 | 0.22 | 0.46 | 0.24 | 0.25 | 0.32 | 0.43 | 0.46 | 0.58 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
🟦 | 0.31 | 0.34 | 0.24 | 0.22 | 0.45 | 0.25 | 0.45 | 0.19 | 0.25 | 0.3 | 0.43 | 0.39 | 0.54 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:08:05+00:00 | |
🟦 | 0.29 | 0.3 | 0.28 | 0.19 | 0.41 | 0.19 | 0.46 | 0.24 | 0.2 | 0.3 | 0.41 | 0.45 | 0.58 | false | true | ? | preference-tuned | Original | llama3.1 | 617 | 70.55 | 2024-10-24 13:25:28+00:00 | |
🟦 🏥 | 0.26 | 0.29 | 0.23 | 0.15 | 0.38 | 0.16 | 0.4 | 0.2 | 0.14 | 0.28 | 0.4 | 0.4 | 0.56 | true | true | ? | preference-tuned | Original | llama3 | 339 | 70 | 2024-07-24 14:33:56+00:00 | |
🟦 | 0.26 | 0.28 | 0.2 | 0.16 | 0.37 | 0.17 | 0.42 | 0.16 | 0.17 | 0.28 | 0.36 | 0.38 | 0.54 | false | true | ? | preference-tuned | Original | llama3.2 | 402 | 3.21 | 2024-10-24 06:23:04+00:00 | |
🟢 🏥 | 0.21 | 0.24 | 0.17 | 0.1 | 0.33 | 0.12 | 0.33 | 0.18 | 0.08 | 0.23 | 0.34 | 0.35 | 0.55 | true | true | ? | pretrained | Original | null | 9 | 70.55 | 2024-11-11 13:58:37+00:00 | |
🟦 | 0.16 | 0.18 | 0.13 | 0.04 | 0.3 | 0.08 | 0.3 | 0.11 | 0.04 | 0.22 | 0.28 | 0.32 | 0.52 | false | true | ? | preference-tuned | Original | apache-2.0 | 91 | 0.75 | 2025-04-29 10:46:33+00:00 |
🟦 🏥 | 0.24 | 0.27 | 0.15 | 0.27 | 0.27 | 0.28 | 0.26 | 0.18 |
🟦 | 0.24 | 0.27 | 0.1 | 0.27 | 0.27 | 0.28 | 0.26 | 0.18 | |
🟦 | 0.24 | 0.28 | 0.15 | 0.25 | 0.24 | 0.28 | 0.24 | 0.15 | |
🟦 | 0.22 | 0.26 | 0.07 | 0.25 | 0.22 | 0.24 | 0.26 | 0.16 | |
🟦 | 0.16 | 0.2 | 0.05 | 0.16 | 0.17 | 0.22 | 0.22 | 0.07 | |
⭕ | 0.15 | 0.16 | 0.15 | 0.17 | 0.15 | 0.15 | 0.19 | 0.07 | |
⭕ | 0.14 | 0.12 | 0.08 | 0.14 | 0.19 | 0.15 | 0.15 | 0.1 | |
⭕ | 0.12 | 0.13 | 0.1 | 0.1 | 0.16 | 0.12 | 0.11 | 0.11 | |
🟦 🏥 | 0.07 | 0.12 | 0.02 | 0.04 | 0.11 | 0.1 | 0.07 | 0.07 | |
⭕ | 0.07 | 0.1 | 0.03 | 0.05 | 0.1 | 0.09 | 0.06 | 0.05 | |
⭕ | 0.06 | 0.1 | 0.06 | 0.04 | 0.09 | 0.07 | 0.06 | 0.02 | |
🟦 | 0.06 | 0.11 | 0 | 0.05 | 0.07 | 0.09 | 0.04 | 0 | |
⭕ | 0.05 | 0.09 | 0 | 0.03 | 0.08 | 0.09 | 0.1 | 0.02 | |
⭕ | 0.05 | 0.06 | 0.02 | 0.03 | 0.1 | 0.08 | 0.07 | 0 | |
🟦 | 0.05 | 0.05 | 0.05 | 0.04 | 0.07 | 0.06 | 0.04 | 0.06 | |
🟦 | 0.05 | 0.08 | 0.1 | 0.03 | 0.08 | 0.02 | 0.03 | 0.03 | |
🟦 | 0.02 | 0 | 0.04 | 0 | 0.07 | 0.02 | 0.05 | 0.01 | |
🟦 🏥 | 0.01 | 0.03 | 0.02 | 0 | 0.06 | 0 | 0 | 0 | |
🟦 | 0 | 0.02 | 0 | 0 | 0.04 | 0.01 | 0.02 | 0 | |
🟦 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
🟢 🏥 | 0 | 0.02 | 0 | 0 | 0.03 | 0 | 0 | 0 |
🟦 🏥 | 0.24 | 0.27 | 0.15 | 0.27 | 0.27 | 0.28 | 0.26 | 0.18 | 0.32 | 0.11 | 0.33 | 0.41 | 0.46 | false | true | ? | instruction-tuned | Original | cc-by-nc-4.0 | 12162 | 235.09 | 2025-04-29 10:42:15+00:00 |
🟦 | 0.24 | 0.27 | 0.1 | 0.27 | 0.27 | 0.28 | 0.26 | 0.18 | 0.32 | 0.1 | 0.33 | 0.41 | 0.5 | false | true | ? | preference-tuned | Original | apache-2.0 | 352 | 235.09 | 2025-04-29 10:42:15+00:00 | |
🟦 | 0.24 | 0.28 | 0.15 | 0.25 | 0.24 | 0.28 | 0.24 | 0.15 | 0.31 | 0.11 | 0.35 | 0.45 | 0.46 | false | true | ? | preference-tuned | Original | apache-2.0 | 162 | 32.76 | 2025-04-29 10:45:55+00:00 | |
🟦 | 0.22 | 0.26 | 0.07 | 0.25 | 0.22 | 0.24 | 0.26 | 0.16 | 0.28 | 0.08 | 0.33 | 0.44 | 0.52 | false | true | ? | preference-tuned | Original | other | 12162 | 685 | 2024-10-22 23:04:13+00:00 | |
🟦 | 0.16 | 0.2 | 0.05 | 0.16 | 0.17 | 0.22 | 0.22 | 0.07 | 0.2 | 0.08 | 0.27 | 0.4 | 0.47 | false | true | ? | preference-tuned | Original | apache-2.0 | 79 | 4.02 | 2025-04-29 10:46:23+00:00 | |
⭕ | 0.15 | 0.16 | 0.15 | 0.17 | 0.15 | 0.15 | 0.19 | 0.07 | 0.16 | 0.07 | 0.25 | 0.47 | 0.47 | false | true | ? | instruction-tuned | Original | llama3.1 | 1149 | 70.55 | 2024-10-25 07:09:19+00:00 | |
⭕ | 0.14 | 0.12 | 0.08 | 0.14 | 0.19 | 0.15 | 0.15 | 0.1 | 0.16 | 0.02 | 0.24 | 0.4 | 0.53 | false | true | ? | instruction-tuned | Original | other | 1980 | 685 | 2024-10-22 23:04:13+00:00 | |
⭕ | 0.12 | 0.13 | 0.1 | 0.1 | 0.16 | 0.12 | 0.11 | 0.11 | 0.07 | 0.03 | 0.24 | 0.42 | 0.55 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-14 08:41:43+00:00 | |
🟦 🏥 | 0.07 | 0.12 | 0.02 | 0.04 | 0.11 | 0.1 | 0.07 | 0.07 | 0.02 | 0 | 0.24 | 0.34 | 0.48 | true | true | ? | preference-tuned | Original | llama3 | 34 | 70.55 | 2024-10-24 06:24:59+00:00 | |
⭕ | 0.07 | 0.1 | 0.03 | 0.05 | 0.1 | 0.09 | 0.06 | 0.05 | 0.04 | 0.01 | 0.2 | 0.36 | 0.5 | false | true | ? | instruction-tuned | Original | cc-by-nc-4.0 | 66 | 32.3 | 2024-10-25 07:13:05+00:00 | |
⭕ | 0.06 | 0.1 | 0.06 | 0.04 | 0.09 | 0.07 | 0.06 | 0.02 | 0.03 | 0 | 0.2 | 0.34 | 0.45 | false | true | ? | instruction-tuned | Original | null | 0 | -1 | 2025-01-17 12:10:32+00:00 | |
🟦 | 0.06 | 0.11 | 0 | 0.05 | 0.07 | 0.09 | 0.04 | 0 | 0.02 | 0 | 0.19 | 0.27 | 0.41 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:08:05+00:00 | |
⭕ | 0.05 | 0.09 | 0 | 0.03 | 0.08 | 0.09 | 0.1 | 0.02 | 0.03 | 0 | 0.18 | 0.34 | 0.49 | false | true | ? | instruction-tuned | Original | other | 808 | 122.61 | 2024-11-25 11:27:40+00:00 | |
⭕ | 0.05 | 0.06 | 0.02 | 0.03 | 0.1 | 0.08 | 0.07 | 0 | 0 | 0 | 0.18 | 0.37 | 0.46 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-02 10:08:27+00:00 | |
🟦 | 0.05 | 0.05 | 0.05 | 0.04 | 0.07 | 0.06 | 0.04 | 0.06 | 0.02 | 0 | 0.18 | 0.41 | 0.49 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
🟦 | 0.05 | 0.08 | 0.1 | 0.03 | 0.08 | 0.02 | 0.03 | 0.03 | 0 | 0.01 | 0.19 | 0.42 | 0.48 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
🟦 | 0.02 | 0 | 0.04 | 0 | 0.07 | 0.02 | 0.05 | 0.01 | 0 | 0 | 0.16 | 0.4 | 0.44 | false | true | ? | preference-tuned | Original | llama3.1 | 617 | 70.55 | 2024-10-24 13:25:28+00:00 | |
🟦 🏥 | 0.01 | 0.03 | 0.02 | 0 | 0.06 | 0 | 0 | 0 | 0 | 0 | 0.16 | 0.33 | 0.46 | true | true | ? | preference-tuned | Original | llama3 | 339 | 70 | 2024-07-24 14:33:56+00:00 | |
🟦 | 0 | 0.02 | 0 | 0 | 0.04 | 0.01 | 0.02 | 0 | 0 | 0 | 0.13 | 0.29 | 0.42 | false | true | ? | preference-tuned | Original | llama3.2 | 402 | 3.21 | 2024-10-24 06:23:04+00:00 | |
🟦 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.07 | 0.24 | 0.41 | false | true | ? | preference-tuned | Original | apache-2.0 | 91 | 0.75 | 2025-04-29 10:46:33+00:00 | |
🟢 🏥 | 0 | 0.02 | 0 | 0 | 0.03 | 0 | 0 | 0 | 0 | 0 | 0.11 | 0.28 | 0.44 | true | true | ? | pretrained | Original | null | 9 | 70.55 | 2024-11-11 13:58:37+00:00 |
🟦 🏥 | 1.01 | +0.002/-0.002 |
⭕ 🏥 | 1 | +0.002/-0.002 | |
⭕ 🏥 | 1.01 | +0.002/-0.002 | |
🟦 | 1.02 | +0.005/-0.006 | |
🟦 | 1.05 | +0.006/-0.005 | |
⭕ | 1.12 | +0.01/-0.01 | |
⭕ | 1.12 | +0.012/-0.012 | |
🟦 | 1.12 | +0.01/-0.01 | |
⭕ | 1.13 | +0.011/-0.014 | |
⭕ | 1.13 | +0.01/-0.009 | |
🟦 | 1.13 | +0.012/-0.012 | |
⭕ | 1.15 | +0.011/-0.013 | |
⭕ | 1.16 | +0.013/-0.013 | |
🟦 | 1.17 | +0.012/-0.013 | |
⭕ | 1.2 | +0.016/-0.019 | |
⭕ | 1.21 | +0.011/-0.012 | |
🟦 | 1.21 | +0.012/-0.015 | |
⭕ | 1.22 | +0.017/-0.017 | |
⭕ | 1.23 | +0.016/-0.012 | |
🟦 | 1.23 | +0.016/-0.014 | |
🟦 | 1.23 | +0.018/-0.017 | |
🟦 | 1.24 | +0.015/-0.014 | |
🟦 | 1.25 | +0.013/-0.014 | |
🟦 | 1.26 | +0.018/-0.013 | |
🟦 | 1.26 | +0.019/-0.018 | |
🟦 🏥 | 1.27 | +0.023/-0.016 | |
🟦 | 1.32 | +0.017/-0.019 | |
⭕ | 1.32 | +0.021/-0.019 | |
🟦 | 1.33 | +0.016/-0.018 | |
⭕ 🏥 | 1.34 | +0.016/-0.011 | |
🟦 🏥 | 1.35 | +0.018/-0.018 | |
⭕ | 1.36 | +0.021/-0.017 | |
🟦 | 1.37 | +0.02/-0.016 | |
🟦 🏥 | 1.39 | +0.02/-0.024 | |
🟦 | 1.4 | +0.018/-0.03 | |
⭕ | 1.4 | +0.018/-0.015 | |
⭕ | 1.41 | +0.027/-0.027 | |
🟢 | 1.41 | +0.026/-0.019 | |
🟦 | 1.41 | +0.022/-0.024 | |
⭕ 🏥 | 1.44 | +0.024/-0.023 | |
🟦 | 1.46 | +0.023/-0.023 | |
🟢 🏥 | 1.47 | +0.021/-0.019 | |
⭕ | 1.49 | +0.03/-0.022 | |
🟦 | 1.5 | +0.021/-0.016 | |
🟦 | 1.51 | +0.024/-0.031 | |
⭕ | 1.52 | +0.021/-0.02 | |
🟦 | 1.52 | +0.021/-0.025 | |
⭕ | 1.54 | +0.022/-0.027 | |
🟦 | 1.69 | +0.021/-0.02 | |
🟢 🏥 | 1.71 | +0.025/-0.025 | |
⭕ | 1.89 | +0.029/-0.029 | |
🟦 | 1.92 | +0.026/-0.022 | |
🟦 | 2.08 | +0.035/-0.051 | |
🟦 | 2.12 | +0.033/-0.029 | |
⭕ | 2.19 | +0.031/-0.039 | |
🟦 | 2.42 | +0.032/-0.035 | |
🟢 | 2.45 | +0.029/-0.038 | |
🟦 | 2.65 | +0.042/-0.032 | |
🟦 | 2.95 | +0.042/-0.037 |
🟦 🏥 | 1.01 | +0.002/-0.002 | 1.01 | 1.01 | 1.02 | 1.01 | 1.01 | 1.01 | 1.01 | 1.01 | 1.02 | false | false | ? | instruction-tuned | Original | cc-by-nc-4.0 | 1149 | 235.09 | 2025-05-19 05:21:32+00:00 |
⭕ 🏥 | 1 | +0.002/-0.002 | 1 | 1 | 1 | 1.01 | 1.01 | 1 | 1 | 1.01 | 1 | true | true | ? | instruction-tuned | Original | apache-2.0 | 5 | 8.03 | 2025-05-19 05:21:32+00:00 | |
⭕ 🏥 | 1.01 | +0.002/-0.002 | 1 | 1.01 | 1 | 1.02 | 1.01 | 1.01 | 1.01 | 1.01 | 1 | true | true | ? | instruction-tuned | Original | apache-2.0 | 2 | 9.24 | 2025-05-19 05:20:13+00:00 | |
🟦 | 1.02 | +0.005/-0.006 | 1.01 | 1.02 | 1 | 1.03 | 1.03 | 1.05 | 1.01 | 1.04 | 1.02 | false | true | ? | preference-tuned | Original | apache-2.0 | 239 | 32.76 | 2024-11-28 04:57:07+00:00 | |
🟦 | 1.05 | +0.006/-0.005 | 1.04 | 1.02 | 1.02 | 1.09 | 1.07 | 1.1 | 1.01 | 1.06 | 1.02 | false | true | ? | preference-tuned | Original | other | 23 | 7.46 | 2024-12-19 05:59:29+00:00 | |
⭕ | 1.12 | +0.01/-0.01 | 1.1 | 1.09 | 1.12 | 1.18 | 1.13 | 1.2 | 1.05 | 1.14 | 1.09 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-07 16:20:40+00:00 | |
⭕ | 1.12 | +0.012/-0.012 | 1.08 | 1.09 | 1.12 | 1.18 | 1.12 | 1.22 | 1.07 | 1.12 | 1.11 | false | true | ? | instruction-tuned | Original | cc-by-nc-4.0 | 66 | 32.3 | 2024-10-25 07:13:05+00:00 | |
🟦 | 1.12 | +0.01/-0.01 | 1.08 | 1.08 | 1.12 | 1.18 | 1.12 | 1.2 | 1.05 | 1.15 | 1.13 | false | true | ? | preference-tuned | Original | other | 675 | 72.71 | 2024-11-14 11:37:18+00:00 | |
⭕ | 1.13 | +0.011/-0.014 | 1.11 | 1.1 | 1.13 | 1.18 | 1.11 | 1.18 | 1.07 | 1.14 | 1.13 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-07 16:20:53+00:00 | |
⭕ | 1.13 | +0.01/-0.009 | 1.1 | 1.07 | 1.16 | 1.2 | 1.14 | 1.22 | 1.06 | 1.13 | 1.11 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-07 16:19:08+00:00 | |
🟦 | 1.13 | +0.012/-0.012 | 1.1 | 1.08 | 1.09 | 1.24 | 1.12 | 1.25 | 1.05 | 1.16 | 1.09 | false | true | ? | preference-tuned | Original | other | 21 | 1.67 | 2024-12-19 06:01:10+00:00 | |
⭕ | 1.15 | +0.011/-0.013 | 1.11 | 1.09 | 1.12 | 1.31 | 1.13 | 1.24 | 1.08 | 1.16 | 1.11 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-14 08:42:03+00:00 | |
⭕ | 1.16 | +0.013/-0.013 | 1.12 | 1.11 | 1.16 | 1.21 | 1.18 | 1.27 | 1.08 | 1.22 | 1.13 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-14 08:41:32+00:00 | |
🟦 | 1.17 | +0.012/-0.013 | 1.14 | 1.1 | 1.17 | 1.23 | 1.17 | 1.32 | 1.09 | 1.2 | 1.13 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
⭕ | 1.2 | +0.016/-0.019 | 1.13 | 1.13 | 1.16 | 1.4 | 1.21 | 1.34 | 1.13 | 1.17 | 1.17 | false | true | ? | instruction-tuned | Original | llama3.1 | 164 | 8.03 | 2024-11-14 11:35:17+00:00 | |
⭕ | 1.21 | +0.011/-0.012 | 1.15 | 1.16 | 1.18 | 1.41 | 1.19 | 1.3 | 1.09 | 1.21 | 1.17 | false | true | ? | instruction-tuned | Original | llama3.1 | 1149 | 70.55 | 2024-10-25 07:09:19+00:00 | |
🟦 | 1.21 | +0.012/-0.015 | 1.16 | 1.13 | 1.24 | 1.37 | 1.22 | 1.29 | 1.12 | 1.22 | 1.15 | false | false | ? | preference-tuned | Original | other | 87 | 3.09 | 2024-11-18 11:36:42+00:00 | |
⭕ | 1.22 | +0.017/-0.017 | 1.11 | 1.11 | 1.12 | 1.56 | 1.2 | 1.37 | 1.1 | 1.17 | 1.21 | false | false | ? | instruction-tuned | Original | llama3.2 | 430 | 1.24 | 2024-10-25 07:14:38+00:00 | |
⭕ | 1.23 | +0.016/-0.012 | 1.18 | 1.17 | 1.23 | 1.34 | 1.2 | 1.36 | 1.1 | 1.25 | 1.2 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-14 08:41:43+00:00 | |
🟦 | 1.23 | +0.016/-0.014 | 1.2 | 1.1 | 1.27 | 1.16 | 1.3 | 1.31 | 1.23 | 1.27 | 1.2 | false | true | ? | preference-tuned | Original | mit | 110 | 9.24 | 2024-10-25 07:11:14+00:00 | |
🟦 | 1.23 | +0.018/-0.017 | 1.17 | 1.12 | 1.24 | 1.34 | 1.26 | 1.34 | 1.15 | 1.28 | 1.18 | false | true | ? | preference-tuned | Original | llama3 | 1417 | 70.55 | 2024-10-24 13:25:47+00:00 | |
🟦 | 1.24 | +0.015/-0.014 | 1.2 | 1.2 | 1.26 | 1.34 | 1.26 | 1.38 | 1.09 | 1.27 | 1.2 | false | true | ? | preference-tuned | Original | apache-2.0 | 137 | 14.77 | 2025-05-12 12:17:12+00:00 | |
🟦 | 1.25 | +0.013/-0.014 | 1.21 | 1.19 | 1.19 | 1.39 | 1.27 | 1.4 | 1.09 | 1.31 | 1.23 | false | true | ? | preference-tuned | Original | other | 343 | 72.71 | 2024-10-22 14:35:49+00:00 | |
🟦 | 1.26 | +0.018/-0.013 | 1.21 | 1.2 | 1.21 | 1.36 | 1.26 | 1.43 | 1.13 | 1.29 | 1.22 | false | true | ? | preference-tuned | Original | apache-2.0 | 352 | 235.09 | 2025-04-29 10:42:15+00:00 | |
🟦 | 1.26 | +0.019/-0.018 | 1.18 | 1.14 | 1.27 | 1.42 | 1.28 | 1.41 | 1.13 | 1.26 | 1.25 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
🟦 🏥 | 1.27 | +0.023/-0.016 | 1.24 | 1.19 | 1.27 | 1.39 | 1.25 | 1.38 | 1.13 | 1.39 | 1.17 | true | true | ? | preference-tuned | Original | llama3 | 339 | 70 | 2024-07-24 14:33:56+00:00 | |
🟦 | 1.32 | +0.017/-0.019 | 1.3 | 1.24 | 1.34 | 1.43 | 1.35 | 1.46 | 1.21 | 1.3 | 1.26 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-03-06 02:18:06+00:00 | |
⭕ | 1.32 | +0.021/-0.019 | 1.25 | 1.22 | 1.38 | 1.34 | 1.37 | 1.45 | 1.22 | 1.41 | 1.28 | false | true | ? | instruction-tuned | Original | other | 808 | 122.61 | 2024-11-25 11:27:40+00:00 | |
🟦 | 1.33 | +0.016/-0.018 | 1.27 | 1.22 | 1.3 | 1.57 | 1.31 | 1.46 | 1.16 | 1.36 | 1.34 | false | true | ? | preference-tuned | Original | llama3.3 | 632 | 70.55 | 2024-12-09 09:10:34+00:00 | |
⭕ 🏥 | 1.34 | +0.016/-0.011 | 1.35 | 1.27 | 1.3 | 1.4 | 1.32 | 1.42 | 1.28 | 1.37 | 1.31 | true | true | ? | instruction-tuned | Original | other | 108 | 27.01 | 2025-05-22 07:54:44+00:00 | |
🟦 🏥 | 1.35 | +0.018/-0.018 | 1.28 | 1.26 | 1.37 | 1.5 | 1.34 | 1.5 | 1.19 | 1.41 | 1.3 | true | true | ? | preference-tuned | Original | apache-2.0 | 300 | 8.19 | 2025-05-20 11:36:36+00:00 | |
⭕ | 1.36 | +0.021/-0.017 | 1.27 | 1.24 | 1.4 | 1.51 | 1.33 | 1.53 | 1.2 | 1.45 | 1.31 | false | true | ? | instruction-tuned | Original | null | -1 | -1 | 2025-05-02 10:08:27+00:00 | |
🟦 | 1.37 | +0.02/-0.016 | 1.31 | 1.23 | 1.33 | 1.7 | 1.36 | 1.46 | 1.23 | 1.38 | 1.29 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
🟦 🏥 | 1.39 | +0.02/-0.024 | 1.37 | 1.31 | 1.43 | 1.45 | 1.39 | 1.45 | 1.32 | 1.43 | 1.38 | true | true | ? | preference-tuned | Original | llama3 | 34 | 70.55 | 2024-10-24 06:24:59+00:00 | |
🟦 | 1.4 | +0.018/-0.03 | 1.23 | 1.33 | 1.25 | 1.97 | 1.33 | 1.53 | 1.23 | 1.46 | 1.29 | false | false | ? | preference-tuned | Original | apache-2.0 | 100 | 0.49 | 2024-11-18 11:36:27+00:00 | |
⭕ | 1.4 | +0.018/-0.015 | 1.36 | 1.32 | 1.44 | 1.46 | 1.4 | 1.53 | 1.31 | 1.46 | 1.35 | false | true | ? | instruction-tuned | Original | other | 1980 | 685 | 2024-10-22 23:04:13+00:00 | |
⭕ | 1.41 | +0.027/-0.027 | 1.38 | 1.22 | 1.43 | 1.55 | 1.48 | 1.52 | 1.32 | 1.42 | 1.34 | false | true | ? | instruction-tuned | Original | gemma | 44 | 9.24 | 2024-11-14 11:39:56+00:00 | |
🟢 | 1.41 | +0.026/-0.019 | 1.29 | 1.27 | 1.37 | 1.81 | 1.35 | 1.57 | 1.21 | 1.46 | 1.34 | false | true | ? | pretrained | Original | other | 39 | 72.71 | 2024-11-14 11:37:02+00:00 | |
🟦 | 1.41 | +0.022/-0.024 | 1.36 | 1.2 | 1.42 | 1.72 | 1.38 | 1.55 | 1.24 | 1.41 | 1.39 | false | true | ? | preference-tuned | Original | llama3.1 | 2845 | 8.03 | 2024-07-24 14:33:56+00:00 | |
⭕ 🏥 | 1.44 | +0.024/-0.023 | 1.35 | 1.3 | 1.43 | 1.67 | 1.42 | 1.66 | 1.21 | 1.53 | 1.41 | true | true | ? | instruction-tuned | Original | null | 42 | 8.19 | 2025-05-16 09:57:55+00:00 | |
🟦 | 1.46 | +0.023/-0.023 | 1.34 | 1.31 | 1.4 | 1.81 | 1.42 | 1.58 | 1.37 | 1.42 | 1.46 | false | true | ? | preference-tuned | Original | llama3.2 | 402 | 3.21 | 2024-10-24 06:23:04+00:00 | |
🟢 🏥 | 1.47 | +0.021/-0.019 | 1.4 | 1.39 | 1.44 | 1.71 | 1.46 | 1.59 | 1.43 | 1.46 | 1.37 | true | true | ? | pretrained | Original | null | 9 | 70.55 | 2024-11-11 13:58:37+00:00 | |
⭕ | 1.49 | +0.03/-0.022 | 1.37 | 1.43 | 1.52 | 1.73 | 1.45 | 1.53 | 1.46 | 1.5 | 1.42 | false | true | ? | instruction-tuned | Original | apache-2.0 | 41 | 6.06 | 2024-10-22 23:04:13+00:00 | |
🟦 | 1.5 | +0.021/-0.016 | 1.44 | 1.45 | 1.47 | 1.62 | 1.53 | 1.63 | 1.29 | 1.56 | 1.51 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:06:51+00:00 | |
🟦 | 1.51 | +0.024/-0.031 | 1.42 | 1.33 | 1.5 | 1.75 | 1.51 | 1.55 | 1.44 | 1.57 | 1.49 | false | true | ? | preference-tuned | Original | llama3 | 254 | 8.03 | 2024-12-10 09:38:34+00:00 | |
⭕ | 1.52 | +0.021/-0.02 | 1.49 | 1.39 | 1.54 | 1.58 | 1.58 | 1.54 | 1.55 | 1.51 | 1.5 | false | true | ? | instruction-tuned | Original | cc-by-nc-4.0 | 613 | 10.73 | 2024-10-22 22:52:54+00:00 | |
🟦 | 1.52 | +0.021/-0.025 | 1.37 | 1.38 | 1.53 | 1.94 | 1.52 | 1.69 | 1.27 | 1.46 | 1.53 | false | true | ? | preference-tuned | Original | llama3.1 | 617 | 70.55 | 2024-10-24 13:25:28+00:00 | |
⭕ | 1.54 | +0.022/-0.027 | 1.49 | 1.43 | 1.57 | 1.63 | 1.65 | 1.51 | 1.56 | 1.57 | 1.44 | false | true | ? | instruction-tuned | Original | apache-2.0 | 1131 | 7.25 | 2024-11-14 11:38:25+00:00 | |
🟦 | 1.69 | +0.021/-0.02 | 1.69 | 1.53 | 1.7 | 1.79 | 1.77 | 1.81 | 1.52 | 1.7 | 1.72 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:08:05+00:00 | |
🟢 🏥 | 1.71 | +0.025/-0.025 | 1.77 | 1.64 | 1.89 | 1.61 | 1.71 | 1.76 | 1.58 | 1.69 | 1.7 | true | false | ? | pretrained | Original | llama3 | 37 | 8.03 | 2024-10-25 07:16:58+00:00 | |
⭕ | 1.89 | +0.029/-0.029 | 1.9 | 1.65 | 2.04 | 1.87 | 1.91 | 1.99 | 1.74 | 1.98 | 1.93 | false | true | ? | instruction-tuned | Original | apache-2.0 | 0 | 32.76 | 2025-05-19 12:37:03+00:00 | |
🟦 | 1.92 | +0.026/-0.022 | 1.87 | 1.8 | 1.92 | 2.05 | 1.93 | 1.93 | 1.88 | 1.94 | 1.93 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:08:42+00:00 | |
🟦 | 2.08 | +0.035/-0.051 | 2.04 | 1.93 | 2.02 | 2.38 | 2.22 | 1.86 | 2.42 | 1.93 | 1.92 | false | true | ? | preference-tuned | Original | llama3 | 411 | 8.03 | 2024-12-10 10:10:16+00:00 | |
🟦 | 2.12 | +0.033/-0.029 | 1.91 | 1.97 | 2.06 | 2.58 | 2.12 | 2.23 | 2.08 | 2.08 | 2.03 | false | true | ? | preference-tuned | Original | apache-2.0 | 91 | 0.75 | 2025-04-29 10:46:33+00:00 | |
⭕ | 2.19 | +0.031/-0.039 | 2.11 | 2.07 | 2.09 | 2.61 | 2.21 | 2.01 | 2.41 | 2.09 | 2.09 | false | true | ? | instruction-tuned | Original | apache-2.0 | 346 | 1.71 | 2024-11-22 10:44:37+00:00 | |
🟦 | 2.42 | +0.032/-0.035 | 2.32 | 2.51 | 2.33 | 2.51 | 2.43 | 2.31 | 2.55 | 2.43 | 2.37 | false | true | ? | preference-tuned | Original | apache-2.0 | 33 | 3.32 | 2024-12-10 10:39:41+00:00 | |
🟢 | 2.45 | +0.029/-0.038 | 2.36 | 2.37 | 2.26 | 2.87 | 2.49 | 2.22 | 2.77 | 2.37 | 2.39 | false | false | ? | pretrained | Original | unknown | 210 | 11.1 | 2024-10-29 07:23:16+00:00 | |
🟦 | 2.65 | +0.042/-0.032 | 2.67 | 2.52 | 2.65 | 2.86 | 2.75 | 2.56 | 2.69 | 2.65 | 2.53 | false | true | ? | preference-tuned | Original | apache-2.0 | 67 | 14.77 | 2024-12-10 07:27:22+00:00 | |
🟦 | 2.95 | +0.042/-0.037 | 2.74 | 2.96 | 2.93 | 3.2 | 2.93 | 2.97 | 3.02 | 2.78 | 3.05 | false | false | ? | preference-tuned | Original | other | 26 | 3.09 | 2024-10-22 13:17:21+00:00 |
🟢 🏥 | 80.86 | 91.71 | 69.33 | 76.62 | 83.11 | 88.07 | 80.6 | 84.15 |
🟦 | 80.86 | 91.71 | 69.33 | 76.62 | 83.11 | 88.07 | 73 | 84.15 | |
🟢 🏥 | 80.02 | 86.97 | 64 | 70.95 | 79.26 | 91.95 | 80.6 | 86.38 | |
🟦 | 79.82 | 87.62 | 65.33 | 71.79 | 78.16 | 94.79 | 73.6 | 87.45 | |
🟦 | 79.47 | 86.34 | 64.24 | 72.01 | 78.32 | 94.6 | 73 | 87.77 | |
⭕ | 78.82 | 90.36 | 66.79 | 71.93 | 79.03 | 92.17 | 63.4 | 88.09 | |
🟦 🏥 | 78.31 | 90.4 | 64.2 | 73.2 | 76.9 | 79 | 73.2 | 91.3 | |
🟦 🏥 | 78.28 | 86.49 | 61.09 | 72.82 | 79.42 | 83.73 | 79.2 | 85.21 | |
🟦 | 78.11 | 88.25 | 65.09 | 70.98 | 80.75 | 89.06 | 69.8 | 82.87 | |
🟦 | 77.02 | 86.51 | 65.09 | 69.69 | 75.26 | 83.27 | 74 | 85.32 | |
🟦 | 76.95 | 85.73 | 62.91 | 72.15 | 78.08 | 84.73 | 67.4 | 87.66 | |
🟦 | 75.73 | 83.04 | 58.91 | 65.91 | 74.47 | 88.48 | 71 | 88.3 | |
🟦 | 75.59 | 87.37 | 63.52 | 68.4 | 76.12 | 84.87 | 63.2 | 85.64 | |
⭕ | 75.2 | 81.68 | 60.36 | 70.69 | 76.67 | 91.67 | 58.6 | 86.7 | |
⭕ | 75.03 | 87.65 | 66.67 | 68.25 | 75.81 | 85.93 | 52 | 88.94 | |
🟢 | 75.02 | 86.98 | 62.06 | 67.32 | 75.49 | 84.94 | 74.8 | 73.51 | |
⭕ 🏥 | 73.68 | 82.59 | 60.85 | 64.71 | 74.08 | 79.02 | 67.8 | 86.7 | |
🟦 | 72.6 | 85.17 | 61.45 | 67.97 | 72.9 | 75.47 | 61.6 | 83.62 | |
🟦 🏥 | 72.3 | 83.91 | 60.73 | 65.48 | 76.75 | 80.95 | 55.6 | 82.66 | |
⭕ | 71.13 | 85.9 | 59.64 | 64.81 | 71.48 | 75.03 | 53 | 88.09 | |
🟦 | 70.99 | 85.19 | 60.48 | 66.34 | 73.29 | 79.28 | 48.4 | 83.94 | |
🟦 | 70.57 | 81.83 | 60.85 | 61.46 | 69.36 | 73.68 | 61.6 | 85.21 | |
🟢 🏥 | 70.22 | 93.5 | 49.09 | 74.4 | 75.96 | 55.78 | 69 | 73.83 | |
🟦 | 69.85 | 83.15 | 59.15 | 63.73 | 69.52 | 75.84 | 49.6 | 87.98 | |
⭕ | 69.78 | 84.93 | 60.12 | 63.47 | 70.15 | 75.82 | 46 | 87.98 | |
🟦 | 69 | 83.45 | 57.58 | 63.61 | 68.26 | 68.41 | 57.2 | 84.47 | |
🟦 | 68.8 | 80.47 | 53.09 | 59.12 | 65.59 | 70.69 | 70.4 | 82.23 | |
⭕ | 68.62 | 81.49 | 54.55 | 56.16 | 68.66 | 69.09 | 65.2 | 85.21 | |
🟦 | 68.5 | 85.63 | 61.09 | 64.57 | 72.19 | 76.6 | 44 | 75.43 | |
🟦 | 67.2 | 73.4 | 49.9 | 58.4 | 62 | 68.2 | 76.2 | 82.3 | |
⭕ | 66.16 | 76.09 | 49.21 | 54.94 | 61.59 | 61.52 | 75.6 | 84.15 | |
⭕ | 65.99 | 73.07 | 50.79 | 57.9 | 63 | 69.5 | 62.8 | 84.89 | |
🟢 | 65.56 | 82.44 | 60.36 | 65.14 | 75.88 | 81.39 | 15.6 | 78.09 | |
⭕ | 65.46 | 77.65 | 50.55 | 59.14 | 65.36 | 69.17 | 50 | 86.38 | |
🟦 | 65.05 | 75.49 | 51.52 | 55.22 | 62.53 | 64.58 | 67 | 79.04 | |
⭕ 🏥 | 64.08 | 83.21 | 48 | 58.43 | 69.91 | 60.43 | 53.8 | 74.79 | |
🟦 | 63.2 | 67.76 | 38.06 | 52.81 | 55.77 | 74.41 | 70.6 | 82.98 | |
⭕ 🏥 | 63 | 73.21 | 44.85 | 61.61 | 65.2 | 61.06 | 77.2 | 57.87 | |
🟦 🏥 | 62.48 | 75.65 | 49.21 | 54.98 | 61.35 | 63.59 | 71.4 | 61.17 | |
🟦 | 61.78 | 75.39 | 48.61 | 54.86 | 59.07 | 63.31 | 50.6 | 80.64 | |
🟦 | 61.14 | 76.81 | 48.48 | 52.93 | 59.62 | 59.4 | 49 | 81.7 | |
🟦 🏥 | 60.92 | 78.43 | 53.7 | 58.67 | 63.39 | 68.24 | 33.6 | 70.43 | |
🟦 | 60.28 | 76.52 | 47.64 | 56.75 | 60.41 | 61.33 | 47 | 72.34 | |
🟦 | 60.03 | 76.71 | 48 | 56.83 | 60.17 | 61.4 | 45.2 | 71.91 | |
🟢 | 59.84 | 73.38 | 45.33 | 53.84 | 57.66 | 57.69 | 55.8 | 75.21 | |
⭕ | 59.38 | 69.46 | 36.61 | 46.71 | 52 | 58.56 | 65.6 | 86.7 | |
⭕ 🏥 | 58.92 | 63.51 | 36.97 | 52.16 | 55.38 | 57.85 | 67.4 | 79.15 | |
🟦 | 58.83 | 68.24 | 39.03 | 50.82 | 53.89 | 56.98 | 58.2 | 84.68 | |
🟦 | 58.22 | 70.51 | 43.15 | 51.11 | 55.7 | 55.65 | 49 | 82.45 | |
🟦 | 57.45 | 70.83 | 39.39 | 52.35 | 54.99 | 54.48 | 53.2 | 76.91 | |
🟦 | 56.83 | 72.56 | 45.7 | 54.1 | 58.37 | 61.04 | 43.8 | 62.23 | |
⭕ | 56.69 | 68.3 | 39.27 | 50.2 | 53.02 | 54.42 | 57 | 74.57 | |
⭕ | 55.01 | 65.23 | 35.03 | 45.85 | 45.95 | 41.85 | 69.2 | 81.91 | |
⭕ 🏥 | 53.72 | 72.5 | 40.24 | 53.41 | 61.67 | 52.22 | 22.8 | 73.19 | |
🟢 | 53.43 | 64.31 | 31.88 | 46.67 | 46.58 | 39.45 | 66.4 | 78.72 | |
⭕ | 53.42 | 65.06 | 34.79 | 46.31 | 49.25 | 50.63 | 45.8 | 82.13 | |
🟦 | 52.89 | 66.89 | 34.3 | 49.32 | 47.92 | 48.04 | 67.8 | 55.96 | |
🟦 | 52.61 | 69.82 | 38.55 | 47 | 50.51 | 55.28 | 49 | 58.09 | |
🟢 | 51.9 | 62.45 | 27.88 | 43.49 | 43.91 | 44.54 | 58.4 | 82.66 | |
⭕ | 50.45 | 64 | 35.15 | 42.55 | 44.23 | 46.93 | 61.8 | 58.51 | |
🟢 | 50.06 | 64.82 | 38.67 | 49.96 | 55.07 | 52.18 | 38 | 51.7 | |
🟦 | 49.62 | 67.99 | 34.79 | 49.15 | 48.78 | 51.55 | 29.2 | 65.85 | |
🟦 | 48.41 | 59.48 | 26.42 | 42.22 | 44.3 | 41.85 | 57.8 | 66.81 | |
🟦 | 45.74 | 57.52 | 24.97 | 42.29 | 41.24 | 42.85 | 42.4 | 68.94 | |
⭕ | 42.39 | 50.61 | 18.67 | 37.37 | 36.06 | 28.19 | 69 | 56.81 | |
⭕ 🏥 | 41.36 | 46.38 | 35.39 | 38.78 | 31.26 | 37.14 | 31.4 | 69.15 | |
🟦 | 39.87 | 50.65 | 22.91 | 36.19 | 33.15 | 30.81 | 48.6 | 56.81 | |
⭕ | 39.41 | 46.79 | 21.94 | 36.53 | 37.71 | 45.61 | 30.4 | 56.91 | |
🟦 | 37.97 | 45.15 | 15.27 | 35.43 | 33.23 | 28.61 | 49.6 | 58.51 | |
🟦 | 34.6 | 31.17 | 13.21 | 32.54 | 29.69 | 24.09 | 55.2 | 56.28 | |
🟢 | 34.56 | 38.88 | 12.24 | 29.12 | 29.93 | 18.46 | 56.4 | 56.91 | |
🟦 | 32.42 | 44.38 | 15.76 | 35.19 | 36.21 | 28.12 | 11 | 56.28 | |
🟦 | 31.53 | 46.94 | 19.76 | 32.44 | 33.94 | 33.22 | 11.2 | 43.19 | |
🟦 | 28.62 | 22.64 | 10.55 | 25.1 | 28.83 | 21.2 | 35.2 | 56.81 | |
🟦 | 28.33 | 22.65 | 10.79 | 24.41 | 25.92 | 19.61 | 38 | 56.91 |
🟢 🏥 | 80.86 | 91.71 | 69.33 | 76.62 | 83.11 | 88.07 | 80.6 | 84.15 | false | false | ? | instruction-tuned | Original | cc-by-nc-sa-4.0 | 1980 | 235.09 | 2025-01-20 10:32:48+00:00 |
🟦 | 80.86 | 91.71 | 69.33 | 76.62 | 83.11 | 88.07 | 73 | 84.15 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
🟢 🏥 | 80.02 | 86.97 | 64 | 70.95 | 79.26 | 91.95 | 80.6 | 86.38 | true | true | ? | pretrained | Original | null | 9 | 70.55 | 2024-11-11 13:58:37+00:00 | |
🟦 | 79.82 | 87.62 | 65.33 | 71.79 | 78.16 | 94.79 | 73.6 | 87.45 | false | true | ? | preference-tuned | Original | llama3.1 | 617 | 70.55 | 2024-10-24 13:25:28+00:00 | |
🟦 | 79.47 | 86.34 | 64.24 | 72.01 | 78.32 | 94.6 | 73 | 87.77 | false | true | ? | preference-tuned | Original | llama3.3 | 632 | 70.55 | 2024-12-09 09:10:34+00:00 | |
⭕ | 78.82 | 90.36 | 66.79 | 71.93 | 79.03 | 92.17 | 63.4 | 88.09 | false | true | ? | instruction-tuned | Original | other | 1980 | 685 | 2024-10-22 23:04:13+00:00 | |
🟦 🏥 | 78.31 | 90.4 | 64.2 | 73.2 | 76.9 | 79 | 73.2 | 91.3 | true | true | ? | preference-tuned | Original | llama3 | 339 | 70 | 2024-07-24 14:33:56+00:00 | |
🟦 🏥 | 78.28 | 86.49 | 61.09 | 72.82 | 79.42 | 83.73 | 79.2 | 85.21 | true | true | ? | preference-tuned | Original | llama3 | 34 | 70.55 | 2024-10-24 06:24:59+00:00 | |
🟦 | 78.11 | 88.25 | 65.09 | 70.98 | 80.75 | 89.06 | 69.8 | 82.87 | false | true | ? | preference-tuned | Original | apache-2.0 | 352 | 235.09 | 2025-04-29 10:42:15+00:00 | |
🟦 | 77.02 | 86.51 | 65.09 | 69.69 | 75.26 | 83.27 | 74 | 85.32 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-20 10:32:48+00:00 | |
🟦 | 76.95 | 85.73 | 62.91 | 72.15 | 78.08 | 84.73 | 67.4 | 87.66 | false | true | ? | preference-tuned | Original | llama3 | 1417 | 70.55 | 2024-10-24 13:25:47+00:00 | |
🟦 | 75.73 | 83.04 | 58.91 | 65.91 | 74.47 | 88.48 | 71 | 88.3 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:06:51+00:00 | |
🟦 | 75.59 | 87.37 | 63.52 | 68.4 | 76.12 | 84.87 | 63.2 | 85.64 | false | true | ? | preference-tuned | Original | other | 343 | 72.71 | 2024-10-22 14:35:49+00:00 | |
⭕ | 75.2 | 81.68 | 60.36 | 70.69 | 76.67 | 91.67 | 58.6 | 86.7 | false | true | ? | instruction-tuned | Original | llama3.1 | 1149 | 70.55 | 2024-10-25 07:09:19+00:00 | |
⭕ | 75.03 | 87.65 | 66.67 | 68.25 | 75.81 | 85.93 | 52 | 88.94 | false | true | ? | instruction-tuned | Original | other | 808 | 122.61 | 2024-11-25 11:27:40+00:00 | |
🟢 | 75.02 | 86.98 | 62.06 | 67.32 | 75.49 | 84.94 | 74.8 | 73.51 | false | true | ? | pretrained | Original | other | 39 | 72.71 | 2024-11-14 11:37:02+00:00 | |
⭕ 🏥 | 73.68 | 82.59 | 60.85 | 64.71 | 74.08 | 79.02 | 67.8 | 86.7 | true | true | ? | instruction-tuned | Original | other | 108 | 27.01 | 2025-05-22 07:54:44+00:00 | |
🟦 | 72.6 | 85.17 | 61.45 | 67.97 | 72.9 | 75.47 | 61.6 | 83.62 | false | true | ? | preference-tuned | Original | other | 675 | 72.71 | 2024-11-14 11:37:18+00:00 | |
🟦 🏥 | 72.3 | 83.91 | 60.73 | 65.48 | 76.75 | 80.95 | 55.6 | 82.66 | true | true | ? | preference-tuned | Original | null | 58 | 14.47 | 2025-05-19 07:03:03+00:00 | |
⭕ | 71.13 | 85.9 | 59.64 | 64.81 | 71.48 | 75.03 | 53 | 88.09 | false | true | ? | instruction-tuned | Original | null | 0 | -1 | 2025-01-17 12:10:32+00:00 | |
🟦 | 70.99 | 85.19 | 60.48 | 66.34 | 73.29 | 79.28 | 48.4 | 83.94 | false | true | ? | preference-tuned | Original | apache-2.0 | 162 | 32.76 | 2025-04-29 10:45:55+00:00 | |
🟦 | 70.57 | 81.83 | 60.85 | 61.46 | 69.36 | 73.68 | 61.6 | 85.21 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:07:42+00:00 | |
🟢 🏥 | 70.22 | 93.5 | 49.09 | 74.4 | 75.96 | 55.78 | 69 | 73.83 | true | false | ? | pretrained | Original | llama3 | 37 | 8.03 | 2024-10-25 07:16:58+00:00 | |
🟦 | 69.85 | 83.15 | 59.15 | 63.73 | 69.52 | 75.84 | 49.6 | 87.98 | false | true | ? | preference-tuned | Original | apache-2.0 | 239 | 32.76 | 2024-11-28 04:57:07+00:00 | |
⭕ | 69.78 | 84.93 | 60.12 | 63.47 | 70.15 | 75.82 | 46 | 87.98 | false | true | ? | instruction-tuned | Original | apache-2.0 | 0 | 32.76 | 2025-05-19 12:37:03+00:00 | |
🟦 | 69 | 83.45 | 57.58 | 63.61 | 68.26 | 68.41 | 57.2 | 84.47 | false | true | ? | preference-tuned | Original | apache-2.0 | 137 | 14.77 | 2025-05-12 12:17:12+00:00 | |
🟦 | 68.8 | 80.47 | 53.09 | 59.12 | 65.59 | 70.69 | 70.4 | 82.23 | false | true | ? | preference-tuned | Original | apache-2.0 | 67 | 14.77 | 2024-12-10 07:27:22+00:00 | |
⭕ | 68.62 | 81.49 | 54.55 | 56.16 | 68.66 | 69.09 | 65.2 | 85.21 | false | true | ? | instruction-tuned | Original | gemma | 1373 | 27.43 | 2025-05-23 10:26:40+00:00 | |
🟦 | 68.5 | 85.63 | 61.09 | 64.57 | 72.19 | 76.6 | 44 | 75.43 | false | true | ? | preference-tuned | Original | apache-2.0 | 208 | 30.53 | 2025-04-29 10:45:32+00:00 | |
🟦 | 67.2 | 73.4 | 49.9 | 58.4 | 62 | 68.2 | 76.2 | 82.3 | false | true | ? | preference-tuned | Original | llama3.1 | 2845 | 8.03 | 2024-07-24 14:33:56+00:00 | |
⭕ | 66.16 | 76.09 | 49.21 | 54.94 | 61.59 | 61.52 | 75.6 | 84.15 | false | true | ? | instruction-tuned | Original | gemma | 44 | 9.24 | 2024-11-14 11:39:56+00:00 | |
⭕ | 65.99 | 73.07 | 50.79 | 57.9 | 63 | 69.5 | 62.8 | 84.89 | false | true | ? | instruction-tuned | Original | llama3.1 | 164 | 8.03 | 2024-11-14 11:35:17+00:00 | |
🟢 | 65.56 | 82.44 | 60.36 | 65.14 | 75.88 | 81.39 | 15.6 | 78.09 | false | false | ? | pretrained | Original | llama3.1 | 308 | 70.55 | 2024-11-14 11:33:15+00:00 | |
⭕ | 65.46 | 77.65 | 50.55 | 59.14 | 65.36 | 69.17 | 50 | 86.38 | false | true | ? | instruction-tuned | Original | cc-by-nc-4.0 | 66 | 32.3 | 2024-10-25 07:13:05+00:00 | |
🟦 | 65.05 | 75.49 | 51.52 | 55.22 | 62.53 | 64.58 | 67 | 79.04 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:08:05+00:00 | |
⭕ 🏥 | 64.08 | 83.21 | 48 | 58.43 | 69.91 | 60.43 | 53.8 | 74.79 | true | true | ? | instruction-tuned | Original | apache-2.0 | 2 | 9.24 | 2025-05-19 05:20:13+00:00 | |
🟦 | 63.2 | 67.76 | 38.06 | 52.81 | 55.77 | 74.41 | 70.6 | 82.98 | false | true | ? | preference-tuned | Original | llama3.2 | 402 | 3.21 | 2024-10-24 06:23:04+00:00 | |
⭕ 🏥 | 63 | 73.21 | 44.85 | 61.61 | 65.2 | 61.06 | 77.2 | 57.87 | true | true | ? | instruction-tuned | Original | cc-by-nc-sa-4.0 | 3 | 0 | 2024-11-25 07:11:28+00:00 | |
🟦 🏥 | 62.48 | 75.65 | 49.21 | 54.98 | 61.35 | 63.59 | 71.4 | 61.17 | true | true | ? | preference-tuned | Original | mit | 0 | 14.77 | 2025-05-20 11:29:52+00:00 | |
🟦 | 61.78 | 75.39 | 48.61 | 54.86 | 59.07 | 63.31 | 50.6 | 80.64 | false | true | ? | preference-tuned | Original | other | 40 | 10.31 | 2024-12-19 05:58:51+00:00 | |
🟦 | 61.14 | 76.81 | 48.48 | 52.93 | 59.62 | 59.4 | 49 | 81.7 | false | true | ? | preference-tuned | Original | mit | 110 | 9.24 | 2024-10-25 07:11:14+00:00 | |
🟦 🏥 | 60.92 | 78.43 | 53.7 | 58.67 | 63.39 | 68.24 | 33.6 | 70.43 | true | true | ? | preference-tuned | Original | apache-2.0 | 300 | 8.19 | 2025-05-20 11:36:36+00:00 | |
🟦 | 60.28 | 76.52 | 47.64 | 56.75 | 60.41 | 61.33 | 47 | 72.34 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-03-06 02:18:06+00:00 | |
🟦 | 60.03 | 76.71 | 48 | 56.83 | 60.17 | 61.4 | 45.2 | 71.91 | false | true | ? | preference-tuned | Original | apache-2.0 | 274 | 7.62 | 2024-11-14 11:36:44+00:00 | |
🟢 | 59.84 | 73.38 | 45.33 | 53.84 | 57.66 | 57.69 | 55.8 | 75.21 | false | true | ? | pretrained | Original | apache-2.0 | 67 | 7.62 | 2024-11-14 11:36:22+00:00 | |
⭕ | 59.38 | 69.46 | 36.61 | 46.71 | 52 | 58.56 | 65.6 | 86.7 | false | true | ? | instruction-tuned | Original | cc-by-nc-4.0 | 613 | 10.73 | 2024-10-22 22:52:54+00:00 | |
⭕ 🏥 | 58.92 | 63.51 | 36.97 | 52.16 | 55.38 | 57.85 | 67.4 | 79.15 | true | true | ? | instruction-tuned | Original | other | 106 | 4.3 | 2025-05-22 07:54:59+00:00 | |
🟦 | 58.83 | 68.24 | 39.03 | 50.82 | 53.89 | 56.98 | 58.2 | 84.68 | false | true | ? | preference-tuned | Original | llama3 | 411 | 8.03 | 2024-12-10 10:10:16+00:00 | |
🟦 | 58.22 | 70.51 | 43.15 | 51.11 | 55.7 | 55.65 | 49 | 82.45 | false | true | ? | preference-tuned | Original | llama3 | 254 | 8.03 | 2024-12-10 09:38:34+00:00 | |
🟦 | 57.45 | 70.83 | 39.39 | 52.35 | 54.99 | 54.48 | 53.2 | 76.91 | false | true | ? | preference-tuned | Original | other | 23 | 7.46 | 2024-12-19 05:59:29+00:00 | |
🟦 | 56.83 | 72.56 | 45.7 | 54.1 | 58.37 | 61.04 | 43.8 | 62.23 | false | true | ? | preference-tuned | Original | apache-2.0 | 79 | 4.02 | 2025-04-29 10:46:23+00:00 | |
⭕ | 56.69 | 68.3 | 39.27 | 50.2 | 53.02 | 54.42 | 57 | 74.57 | false | true | ? | instruction-tuned | Original | apache-2.0 | 76 | 7.94 | 2024-10-25 09:59:22+00:00 | |
⭕ | 55.01 | 65.23 | 35.03 | 45.85 | 45.95 | 41.85 | 69.2 | 81.91 | false | true | ? | instruction-tuned | Original | other | 64 | 7.27 | 2024-11-18 11:47:16+00:00 | |
⭕ 🏥 | 53.72 | 72.5 | 40.24 | 53.41 | 61.67 | 52.22 | 22.8 | 73.19 | true | true | ? | instruction-tuned | Original | apache-2.0 | 5 | 8.03 | 2025-05-19 05:21:32+00:00 | |
🟢 | 53.43 | 64.31 | 31.88 | 46.67 | 46.58 | 39.45 | 66.4 | 78.72 | false | false | ? | pretrained | Original | other | 211 | 7.27 | 2024-10-29 07:20:18+00:00 | |
⭕ | 53.42 | 65.06 | 34.79 | 46.31 | 49.25 | 50.63 | 45.8 | 82.13 | false | true | ? | instruction-tuned | Original | apache-2.0 | 1131 | 7.25 | 2024-11-14 11:38:25+00:00 | |
🟦 | 52.89 | 66.89 | 34.3 | 49.32 | 47.92 | 48.04 | 67.8 | 55.96 | false | false | ? | preference-tuned | Original | other | 26 | 3.09 | 2024-10-22 13:17:21+00:00 | |
🟦 | 52.61 | 69.82 | 38.55 | 47 | 50.51 | 55.28 | 49 | 58.09 | false | true | ? | preference-tuned | Original | other | 13 | 7.27 | 2024-12-19 06:00:16+00:00 | |
🟢 | 51.9 | 62.45 | 27.88 | 43.49 | 43.91 | 44.54 | 58.4 | 82.66 | false | false | ? | pretrained | Original | unknown | 210 | 11.1 | 2024-10-29 07:23:16+00:00 | |
⭕ | 50.45 | 64 | 35.15 | 42.55 | 44.23 | 46.93 | 61.8 | 58.51 | false | true | ? | instruction-tuned | Original | apache-2.0 | 41 | 6.06 | 2024-10-22 23:04:13+00:00 | |
🟢 | 50.06 | 64.82 | 38.67 | 49.96 | 55.07 | 52.18 | 38 | 51.7 | false | false | ? | pretrained | Original | llama3.1 | 1068 | 8.03 | 2024-11-14 07:33:20+00:00 | |
🟦 | 49.62 | 67.99 | 34.79 | 49.15 | 48.78 | 51.55 | 29.2 | 65.85 | false | false | ? | preference-tuned | Original | other | 87 | 3.09 | 2024-11-18 11:36:42+00:00 | |
🟦 | 48.41 | 59.48 | 26.42 | 42.22 | 44.3 | 41.85 | 57.8 | 66.81 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:08:22+00:00 | |
🟦 | 45.74 | 57.52 | 24.97 | 42.29 | 41.24 | 42.85 | 42.4 | 68.94 | false | true | ? | preference-tuned | Original | other | 12 | 3.23 | 2024-12-19 06:00:40+00:00 | |
⭕ | 42.39 | 50.61 | 18.67 | 37.37 | 36.06 | 28.19 | 69 | 56.81 | false | true | ? | instruction-tuned | Original | apache-2.0 | 346 | 1.71 | 2024-11-22 10:44:37+00:00 | |
⭕ 🏥 | 41.36 | 46.38 | 35.39 | 38.78 | 31.26 | 37.14 | 31.4 | 69.15 | true | true | ? | instruction-tuned | Original | null | 42 | 8.19 | 2025-05-16 09:57:55+00:00 | |
🟦 | 39.87 | 50.65 | 22.91 | 36.19 | 33.15 | 30.81 | 48.6 | 56.81 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:08:42+00:00 | |
⭕ | 39.41 | 46.79 | 21.94 | 36.53 | 37.71 | 45.61 | 30.4 | 56.91 | false | false | ? | instruction-tuned | Original | llama3.2 | 430 | 1.24 | 2024-10-25 07:14:38+00:00 | |
🟦 | 37.97 | 45.15 | 15.27 | 35.43 | 33.23 | 28.61 | 49.6 | 58.51 | false | true | ? | preference-tuned | Original | other | 21 | 1.67 | 2024-12-19 06:01:10+00:00 | |
🟦 | 34.6 | 31.17 | 13.21 | 32.54 | 29.69 | 24.09 | 55.2 | 56.28 | false | true | ? | preference-tuned | Original | null | 0 | -1 | 2025-01-22 17:09:02+00:00 | |
🟢 | 34.56 | 38.88 | 12.24 | 29.12 | 29.93 | 18.46 | 56.4 | 56.91 | false | false | ? | pretrained | Original | apache-2.0 | 101 | 0.49 | 2024-10-22 13:46:13+00:00 | |
🟦 | 32.42 | 44.38 | 15.76 | 35.19 | 36.21 | 28.12 | 11 | 56.28 | false | false | ? | preference-tuned | Original | apache-2.0 | 100 | 0.49 | 2024-11-18 11:36:27+00:00 | |
🟦 | 31.53 | 46.94 | 19.76 | 32.44 | 33.94 | 33.22 | 11.2 | 43.19 | false | true | ? | preference-tuned | Original | apache-2.0 | 91 | 0.75 | 2025-04-29 10:46:33+00:00 | |
🟦 | 28.62 | 22.64 | 10.55 | 25.1 | 28.83 | 21.2 | 35.2 | 56.81 | false | true | ? | preference-tuned | Original | apache-2.0 | 33 | 3.32 | 2024-12-10 10:39:41+00:00 | |
🟦 | 28.33 | 22.65 | 10.79 | 24.41 | 25.92 | 19.61 | 38 | 56.91 | false | true | ? | preference-tuned | Original | apache-2.0 | 60 | 0.36 | 2024-12-10 08:36:15+00:00 |
📊 Dataset Information: This tab uses the Global MMLU dataset filtering only the subcategory: medical (10.7%)
🟦 🏥 | 78.79 | 76.88 | 81.13 | 80.86 | 81.86 | 78.94 | 72.89 |
🟦 | 78.79 | 76.88 | 81.13 | 81.2 | 81.86 | 78.8 | 72.89 | |
🟦 | 78.62 | 73.42 | 80.66 | 80.86 | 80.86 | 78.94 | 77.01 | |
🟦 🏥 | 75.29 | 69.3 | 78.07 | 77.74 | 78.54 | 76.68 | 71.43 | |
⭕ | 73.03 | 68.24 | 74.68 | 74.75 | 74.35 | 73.62 | 72.56 | |
⭕ | 69.44 | 64.32 | 71.16 | 70.7 | 71.63 | 70.9 | 67.91 | |
⭕ | 47.42 | 33.02 | 55.95 | 54.49 | 52.43 | 53.22 | 35.42 |
🟦 🏥 | 78.79 | 76.88 | 81.13 | 80.86 | 81.86 | 78.94 | 72.89 | false | true | ? | instruction-tuned | Original | cc-by-nc-4.0 | 1373 | 72.71 | 2024-10-22 14:35:49+00:00 |
🟦 | 78.79 | 76.88 | 81.13 | 81.2 | 81.86 | 78.8 | 72.89 | false | true | ? | preference-tuned | Original | other | 343 | 72.71 | 2024-10-22 14:35:49+00:00 | |
🟦 | 78.62 | 73.42 | 80.66 | 80.86 | 80.86 | 78.94 | 77.01 | false | true | ? | preference-tuned | Original | llama3.3 | 632 | 70.55 | 2024-12-09 09:10:34+00:00 | |
🟦 🏥 | 75.29 | 69.3 | 78.07 | 77.74 | 78.54 | 76.68 | 71.43 | true | true | ? | preference-tuned | Original | llama3 | 34 | 70.55 | 2024-10-24 06:24:59+00:00 | |
⭕ | 73.03 | 68.24 | 74.68 | 74.75 | 74.35 | 73.62 | 72.56 | false | true | ? | instruction-tuned | Original | gemma | 1373 | 27.43 | 2025-05-23 10:26:40+00:00 | |
⭕ | 69.44 | 64.32 | 71.16 | 70.7 | 71.63 | 70.9 | 67.91 | false | true | ? | instruction-tuned | Original | cc-by-nc-4.0 | 66 | 32.3 | 2024-10-25 07:13:05+00:00 | |
⭕ | 47.42 | 33.02 | 55.95 | 54.49 | 52.43 | 53.22 | 35.42 | false | true | ? | instruction-tuned | Original | apache-2.0 | 1131 | 7.25 | 2024-11-14 11:38:25+00:00 |
About
The MEDIC Leaderboard evaluates large language models (LLMs) on various healthcare tasks across five key dimensions. Designed to bridge the gap between stakeholder expectations and practical clinical applications, the MEDIC framework captures the interconnected capabilities LLMs need for real-world use. Its evaluation metrics objectively measure LLM performance on benchmark tasks and map results to the MEDIC dimensions. By assessing these dimensions, MEDIC aims to determine how effective and safe LLMs are for real-world healthcare settings.

Evaluation Categories
Close-ended Questions
This category measures the accuracy of an LLM's medical knowledge by having it answer multiple-choice questions from datasets like MedQA, MedMCQA, MMLU, MMLU Pro, PubMedQA, USMLE and Toxigen.
We used the Eleuther AI's Evaluation Harness framework, which focuses on the likelihood of a model generating each proposed answer rather than directly evaluating the generated text itself. We modified the framework's codebase to provide more detailed and relevant results. Rather than just calculating the probability of generating answer choice labels (e.g., a., b., c., or d.), we calculate the probability of generating the full answer text.
Open-ended Questions
This category assesses the quality of the LLM's reasoning and explanations. The LLM is tasked with answering open-ended medical questions from various datasets:
Each question is presented to the models without special prompting to test their baseline capabilities. To compare models, we use a tournament-style approach. A judge (Llama3.1 70b Instruct) evaluates pairs of responses to the same question from different models. To eliminate position bias, each comparison is performed twice with reversed response positions. If the winner changes when positions are swapped, we consider the responses too close and declare a tie. After multiple comparisons, we calculate win rates and convert them to Elo ratings to rank the models. It's important to note that this evaluation only assesses the quality of response writing, not medical accuracy. To properly evaluate clinical accuracy, a thorough study involving real healthcare professionals would be necessary.
Medical Safety
Medical Safety category uses the "med-safety" benchmark dataset, which consists of 4500 scenarios presenting harmful medical requests. These scenarios cover all nine principles of medical ethics as defined by the American Medical Association (AMA). In this evaluation, an LLM acts as a judge. Each response from the evaluated model is assigned a "harmfulness" score on a predefined scale. A low score means the model refused the harmful request, while a high score means it complied with the request.
Medical Summarization
This category evaluates the LLM's ability to summarize medical texts, with a focus on clinical trial descriptions from ClinicalTrials.gov. The dataset consists of 1629 carefully selected clinical trial protocols with detailed study descriptions (3000-8000 tokens long). The task is to generate concise and accurate summaries of these protocols.
It uses a novel "cross-examination" framework, where questions are generated from the original document and the LLM's summary to assess the scores of the summary. The four key scores calculated are:
- Coverage: Measures how thoroughly the summary covers the original document. A higher score means the summary includes more details from the original.
- Conformity: Also called the non-contradiction score, this checks if the summary avoids contradicting the original document. A higher score means the summary aligns better with the original.
- Consistency: Measures the level of non-hallucination, or how much the summary sticks to the facts in the document. A higher score means the summary is more factual and accurate.
- Conciseness: Measures how brief the summary is. A higher score means the summary is more concise. A negative score means the summary is longer than the original document.
Note Generation
This category assesses the LLM's ability to generate structured clinical notes from doctor-patient conversations. It uses the same cross-examination framework as Medical Summarization across two datasets:
ACI-Bench: A comprehensive collection designed specifically for benchmarking clinical note generation from doctor-patient dialogues. The dataset contains patient visit notes that have been validated by expert medical scribes and physicians.
SOAP Notes: Using the test split of the ChartNote dataset containing 250 synthetic patient-doctor conversations generated from real clinical notes. The task involves generating notes in the SOAP format with the following sections:
- Subjective: Patient's description of symptoms, medical history, and personal experiences
- Objective: Observable data like physical exam findings, vital signs, and diagnostic test results
- Assessment: Healthcare provider's diagnosis based on subjective and objective information
- Plan: Treatment plan including medications, therapies, follow-ups, and referrals
Currently, the benchmark supports evaluation for models hosted on the huggingface hub and of decoder type. It doesn't support adapter models yet but we will soon add adapters too.
Submission Guide for the MEDIC Benchamark
First Steps Before Submitting a Model
1. Ensure Your Model Loads with AutoClasses
Verify that you can load your model and tokenizer using AutoClasses:
from transformers import AutoConfig, AutoModel, AutoTokenizer
config = AutoConfig.from_pretrained("your model name", revision=revision)
model = AutoModel.from_pretrained("your model name", revision=revision)
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
Note:
- If this step fails, debug your model before submitting.
- Ensure your model is public.
2. Convert Weights to Safetensors
Safetensors is a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the Extended Viewer
!
3. Complete Your Model Card
When we add extra information about models to the leaderboard, it will be automatically taken from the model card
4. Select the correct model type
Choose the correct model cateogory from the option below:
- 🟢 : 🟢 pretrained model: new, base models, trained on a given text corpora using masked modelling or new, base models, continuously trained on further corpus (which may include IFT/chat data) using masked modelling
- ⭕ : ⭕ fine-tuned models: pretrained models finetuned on more data or tasks.
- 🟦 : 🟦 preference-tuned models: chat like fine-tunes, either using IFT (datasets of task instruction), RLHF or DPO (changing the model loss a bit with an added policy), etc
5. Select Correct Precision
Choose the right precision to avoid evaluation errors:
- Not all models convert properly from float16 to bfloat16.
- Incorrect precision can cause issues (e.g., loading a bf16 model in fp16 may generate NaNs).
- If you have selected auto, the precision mentioned under
torch_dtype
under model config will be used.
6. Medically oriented model
If the model has been specifically built for medical domains i.e. pretrained/finetuned on significant medical data, make sure check the Domain specific
checkbox
7. Chat template
Select this option if your model uses a chat template. The chat template will be used during evaluation.
- Before submitting, make sure the chat template is defined in tokenizer config.
Upon successful submission of your request, your model's result would be updated on the leaderboard within 5 working days!
main | false | instruction-tuned | bfloat16 | Original | PRIVATE MODEL | CHAT TEMPLATE MISSING | CHAT TEMPLATE MISSING | CHAT TEMPLATE MISSING | CHAT TEMPLATE MISSING |
main | false | instruction-tuned | auto | Original | FINISHED | FINISHED | FINISHED | LOW CONTEXT LENGTH | LOW CONTEXT LENGTH | |
main | false | instruction-tuned | auto | Original | FINISHED | CHAT TEMPLATE ISSUE | CHAT TEMPLATE ISSUE | CHAT TEMPLATE ISSUE | CHAT TEMPLATE ISSUE | |
main | false | instruction-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | instruction-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | instruction-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | pretrained | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | instruction-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | pretrained | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | pretrained | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | pretrained | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | instruction-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | instruction-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | instruction-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
google/gemini-2.5-flash-preview-04-17-thinking | main | true | instruction-tuned | auto | Original | PRIVATE MODEL | FINISHED | FINISHED | FINISHED | FINISHED |
main | false | instruction-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | instruction-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | pretrained | auto | Original | FINISHED | CHAT TEMPLATE MISSING | CHAT TEMPLATE MISSING | CHAT TEMPLATE MISSING | CHAT TEMPLATE MISSING | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | pretrained | auto | Original | FINISHED | CHAT TEMPLATE MISSING | CHAT TEMPLATE MISSING | CHAT TEMPLATE MISSING | CHAT TEMPLATE MISSING | |
main | false | instruction-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | instruction-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | instruction-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | instruction-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | instruction-tuned | auto | Original | FINISHED | VLLM NOT SUPPORTED | VLLM NOT SUPPORTED | VLLM NOT SUPPORTED | VLLM NOT SUPPORTED | |
main | false | preference-tuned | bfloat16 | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | instruction-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
openai/gpt-4.1-mini | main | true | instruction-tuned | auto | Original | PRIVATE MODEL | FINISHED | FINISHED | FINISHED | FINISHED |
openai/gpt-4.1 | main | true | instruction-tuned | auto | Original | PRIVATE MODEL | FINISHED | FINISHED | FINISHED | FINISHED |
openai/o4-mini | main | true | instruction-tuned | auto | Original | PRIVATE MODEL | FINISHED | FINISHED | FINISHED | FINISHED |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | instruction-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | FINISHED | LOW CONTEXT LENGTH | LOW CONTEXT LENGTH | LOW CONTEXT LENGTH | LOW CONTEXT LENGTH | |
main | false | pretrained | auto | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | instruction-tuned | auto | Original | FINISHED | LOW CONTEXT LENGTH | LOW CONTEXT LENGTH | LOW CONTEXT LENGTH | LOW CONTEXT LENGTH | |
main | false | pretrained | auto | Original | FINISHED | CHAT TEMPLATE MISSING | CHAT TEMPLATE MISSING | CHAT TEMPLATE MISSING | CHAT TEMPLATE MISSING | |
main | false | instruction-tuned | auto | Original | FINISHED | FINISHED | FINISHED | LOW CONTEXT LENGTH | LOW CONTEXT LENGTH | |
main | false | instruction-tuned | bfloat16 | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | instruction-tuned | bfloat16 | Original | FINISHED | FINISHED | FINISHED | FINISHED | FINISHED |
main | false | instruction-tuned | auto | Original | FINISHED | FINISHED | RUNNING | FINISHED | FINISHED |
main | false | preference-tuned | auto | Original | FINISHED | FINISHED | RUNNING | FINISHED | FINISHED | |
main | false | instruction-tuned | auto | Original | RUNNING | PENDING | PENDING | PENDING | PENDING |
main | false | instruction-tuned | auto | Original | PRIVATE MODEL | FINISHED | FINISHED | FINISHED | FINISHED |
main | false | preference-tuned | auto | Original | FINISHED | PENDING | PENDING | PENDING | PENDING | |
google/gemini-2.0-flash | main | true | instruction-tuned | auto | Original | PRIVATE MODEL | FINISHED | FINISHED | RERUN | RERUN |
google/gemini-2.5-flash-preview-04-17 | main | true | instruction-tuned | auto | Original | PRIVATE MODEL | FINISHED | FINISHED | RERUN | RERUN |
main | false | instruction-tuned | auto | Original | PENDING | PENDING | PENDING | PENDING | PENDING | |
main | false | preference-tuned | auto | Original | RERUN | FINISHED | FINISHED | FINISHED | FINISHED | |
main | false | preference-tuned | auto | Original | RERUN | FINISHED | FINISHED | FINISHED | FINISHED | |
openai/gpt-4o-mini-2024-07-18 | main | true | instruction-tuned | auto | Original | PRIVATE MODEL | FINISHED | FINISHED | RERUN | RERUN |
✉️✨ Submit your model here!
Is your model medically oriented?
Is your model a chat model?