# | Model | Size | Notes | SimpShort | CompShort | SimpLong | CompLong | AVG | |||||
PoT | CoT | PoT | CoT | PoT | CoT | PoT | CoT | PoT | CoT | ||||
- | Human Expert | - | - | 91.0 | 91.0 | 87.0 | 87.0 | 84.0 | 84.0 | 76.0 | 76.0 | - | - |
1 | GPT-4o | 84.0 | 86.0 | 69.5 | 76.5 | 56.0 | 64.0 | 41.0 | 36.7 | 60.8 | 62.4 | ||
2 | GPT-4-Turbo | 85.5 | 82.5 | 80.0 | 81.0 | 56.0 | 53.0 | 38.7 | 38.3 | 62.9 | 61.9 | ||
3 | Claude-3-Opus | 80.5 | 79.5 | 73.5 | 77.5 | 51.0 | 61.0 | 42.0 | 39.7 | 60.6 | 61.8 | ||
4 | DeepSeek-V2 | 236B | MoE | 87.0 | 82.0 | 75.5 | 69.5 | 61.0 | 56.0 | 43.0 | 39.7 | 64.4 | 59.8 |
5 | Mistral-Large | 123B | 85.0 | 83.5 | 76.5 | 81.0 | 56.0 | 55.0 | 41.0 | 31.3 | 62.8 | 59.7 | |
6 | Claude-3.5-Sonnet | 78.0 | 77.0 | 76.0 | 69.5 | 54.0 | 61.0 | 44.0 | 40.0 | 61.8 | 59.2 | ||
7 | Claude-3-Sonnet | 82.5 | 80.0 | 80.5 | 73.0 | 55.0 | 56.0 | 40.3 | 35.3 | 62.7 | 58.5 | ||
8 | DeepSeek-Coder-V2 | 236B | Code | 85.0 | 79.0 | 78.0 | 66.5 | 56.0 | 54.0 | 41.0 | 37.7 | 63.1 | 57.3 |
9 | Gemini-1.5-Flash | 85.0 | 78.0 | 78.5 | 69.5 | 55.0 | 46.0 | 40.0 | 31.7 | 62.8 | 54.5 | ||
10 | Llama-3.1 | 70B | 74.5 | 76.5 | 68.0 | 71.0 | 53.0 | 50.0 | 34.7 | 29.3 | 55.3 | 54.1 | |
11 | Gemini-1.5-Pro | 85.5 | 80.5 | 80.0 | 58.0 | 58.0 | 55.0 | 40.3 | 30.0 | 63.7 | 52.8 | ||
12 | Claude-3-Haiku | 74.5 | 79.0 | 71.5 | 58.5 | 55.0 | 50.0 | 36.7 | 31.7 | 57.1 | 52.5 | ||
13 | Qwen2 | 72B | 26.5 | 74.0 | 24.5 | 72.5 | 8.0 | 45.0 | 7.0 | 27.0 | 16.4 | 52.4 | |
14 | GPT-4o-Mini | 88.5 | 69.5 | 77.0 | 69.5 | 53.0 | 56.0 | 38.7 | 28.0 | 62.5 | 52.2 | ||
15 | Llama-3 | 70B | 84.5 | 73.5 | 64.0 | 63.5 | 52.0 | 42.0 | 41.0 | 28.3 | 59.0 | 50.1 | |
16 | Mixtral-8x22B | 141B | MoE | 30.0 | 74.0 | 21.5 | 57.0 | 25.0 | 47.0 | 14.7 | 24.0 | 21.5 | 47.6 |
17 | Gemma-2 | 9B | 79.0 | 66.5 | 65.0 | 54.5 | 50.0 | 39.0 | 24.3 | 17.7 | 51.4 | 41.8 | |
18 | DeepSeek-Coder-V2-Lite | 16B | Code | 66.0 | 67.5 | 51.0 | 53.5 | 27.0 | 30.0 | 22.0 | 20.3 | 40.9 | 41.6 |
19 | WizardLM-2 | 141B | MoE | 62.5 | 60.5 | 56.5 | 55.5 | 25.0 | 34.0 | 17.7 | 18.0 | 39.5 | 40.0 |
20 | C4AI Command R+ | 104B | 35.5 | 65.5 | 39.0 | 51.0 | 19.0 | 31.0 | 8.7 | 18.3 | 24.3 | 39.9 | |
21 | Yi-1.5 | 9B | 18.0 | 68.5 | 24.5 | 56.0 | 2.0 | 14.0 | 4.0 | 14.0 | 12.4 | 38.1 | |
22 | Yi-1.5 | 34B | 0.5 | 64.5 | 1.0 | 53.0 | 0.0 | 14.0 | 0.0 | 15.3 | 0.4 | 36.9 | |
23 | Mistral-Nemo | 12B | 52.5 | 59.5 | 37.5 | 44.0 | 28.0 | 37.0 | 15.3 | 16.7 | 31.7 | 36.8 | |
24 | Llama-3.1 | 8B | 62.0 | 60.0 | 44.0 | 42.5 | 32.0 | 33.0 | 19.0 | 14.3 | 37.6 | 35.1 | |
25 | DBRX | 132B | MoE | 41.0 | 57.0 | 29.5 | 43.0 | 32.0 | 30.0 | 12.0 | 16.3 | 26.1 | 34.9 |
26 | GPT-3.5-Turbo | 71.0 | 60.5 | 52.5 | 39.0 | 41.0 | 28.0 | 28.7 | 15.0 | 46.8 | 34.0 | ||
27 | Codestral | 22B | Code | 39.0 | 51.5 | 38.5 | 41.5 | 18.0 | 23.0 | 17.3 | 13.0 | 28.1 | 31.0 |
28 | Llama-3 | 8B | 49.5 | 56.5 | 21.5 | 31.0 | 24.0 | 29.0 | 10.0 | 12.3 | 24.5 | 30.1 | |
29 | Qwen2 | 7B | 13.0 | 56.0 | 9.5 | 33.0 | 4.0 | 31.0 | 2.3 | 10.0 | 7.0 | 29.9 | |
30 | Mathstral | 7B | Math | 43.5 | 55.0 | 32.5 | 35.0 | 10.0 | 23.0 | 11.3 | 11.7 | 24.5 | 29.8 |
31 | GLM-4 | 9B | 69.5 | 44.0 | 53.5 | 34.0 | 33.0 | 20.0 | 17.7 | 8.7 | 41.5 | 25.3 | |
32 | Aya-23 | 35B | 1.5 | 44.0 | 1.0 | 25.5 | 0.0 | 20.0 | 0.0 | 11.7 | 0.6 | 24.3 | |
33 | DeepSeek-V2-Lite | 16B | MoE | 7.0 | 45.5 | 3.5 | 18.0 | 1.0 | 17.0 | 1.0 | 10.3 | 3.1 | 21.9 |
34 | Mixtral-8x7B-v0.1 | 46B | MoE | 0.5 | 39.0 | 2.0 | 17.0 | 0.0 | 25.0 | 0.0 | 12.7 | 0.6 | 21.9 |
35 | DeepSeek-Math | 7B | Math | 2.0 | 46.0 | 1.0 | 27.0 | 1.0 | 4.0 | 0.3 | 8.0 | 1.0 | 21.8 |
36 | Llama-2 | 70B | 32.5 | 43.5 | 16.5 | 25.0 | 1.0 | 8.0 | 2.0 | 7.0 | 13.1 | 20.8 | |
37 | WizardLM-2 | 7B | 47.0 | 42.0 | 30.5 | 28.5 | 5.0 | 6.0 | 7.3 | 5.7 | 22.7 | 20.5 | |
38 | Mistral-v0.3 | 7B | 49.5 | 40.0 | 40.5 | 28.0 | 25.0 | 9.0 | 11.3 | 5.7 | 29.9 | 20.3 | |
39 | WizardMath | 7B | Math | 22.5 | 32.0 | 12.0 | 22.5 | 6.0 | 7.0 | 3.7 | 3.3 | 10.8 | 15.7 |
40 | InternLM2-Math-Plus | 7B | Math | 28.5 | 27.5 | 15.0 | 14.0 | 7.0 | 9.0 | 4.7 | 4.0 | 13.5 | 13.0 |
41 | StarCoder2 | 15B | Code | 47.5 | 21.0 | 34.0 | 15.5 | 11.0 | 6.0 | 8.3 | 4.3 | 24.9 | 11.5 |
42 | InternLM2 | 7B | 18.0 | 20.0 | 4.5 | 11.0 | 9.0 | 10.0 | 2.7 | 2.3 | 7.8 | 9.9 | |
43 | Gemma-1 | 7B | 1.0 | 20.0 | 0.0 | 7.5 | 0.0 | 7.0 | 0.0 | 3.3 | 0.2 | 9.0 | |
44 | Llama-2 | 7B | 4.0 | 17.0 | 4.0 | 11.5 | 0.0 | 2.0 | 1.3 | 2.7 | 2.5 | 8.4 | |
45 | DeepSeek-Coder-V1 | 33B | Code | 19.0 | 18.5 | 8.5 | 8.5 | 2.0 | 2.0 | 3.7 | 1.7 | 8.5 | 7.6 |
46 | WizardCoder | 33B | Code | 32.5 | 16.0 | 17.5 | 8.0 | 5.0 | 2.0 | 5.0 | 1.0 | 15.0 | 6.6 |
47 | Aya-23 | 8B | 1.0 | 13.0 | 0.0 | 9.0 | 0.0 | 2.0 | 0.3 | 2.3 | 0.4 | 6.6 | |
48 | Gemma-1 | 2B | 4.0 | 8.0 | 1.5 | 7.5 | 0.0 | 2.0 | 0.0 | 0.0 | 1.4 | 4.1 |