![](https://csdnimg.cn/release/download_crawler_static/87795313/bg4.jpg)
model self-improve. In the following sections, we detail important designs within our method, along
with additional approaches for the model to self-improve without supervised data.
3.1 GENERATING AND FILTERING MULTIPLE REASONING PATHS
Self-consistency (Wang et al., 2022b) brings large improvements on reasoning tasks (e.g., 56.5% →
74.4% on GSM8K test set), and the gap between greedy decoding and diverse decoding shows
there is a potential for further improving the reasoning ability of M, using the self-selected high-
confidence reasoning paths as training data.
Question: Stefan goes to a restaurant with his family. They order an appetizer that costs $10 and
4 entrees that are $20 each. If they tip 20% of the total, what is the total amount of money that
they spend?
Multiple Path Decoding:
Output 1: The appetizer costs $10. The entrees cost $20 each so in total 4 * $20 = $80. This means
the total cost is $10 + $80 = $90. They tip 20% of it, so the total amount they spend is $90 * 1.2
= $108. The answer is 108.
Output 2: The appetizer costs $10 and the entrees are $20 each. There are 4 entrees so the sum is
$20 * 4 = $80. The waiter gets 20% of the total. 20% of $80 is $80 * .2 = $16. The answer is $80
+ $16 = $96. (Incorrect reasoning path)
Output 3: The appetizer costs $10. The entrees cost 4 * $20 = $80. The tip is 20% of the total, so
it is 20% of the $90 they have spent. The tip is 0.2 * 90 = $18. The total they spent is $90 + $18
= $108. The answer is 108.
Table 1: Examples of 3 self-generated CoT reasoning paths given a question. Output 1 and 3 are the
most consistent reasoning paths based on majority voting and kept as self-training data.
For each training question x
i
, we sample m CoT reasoning paths, denoted as {r
i
1
, r
i
2
, . . . , r
i
m
}
(see Table 1 for examples). Since M is prompted with the CoT examples from Wei et al.
(2022b), we apply the same output parsing with “The answer is” to generate their predicted an-
swers {y
i
1
, y
i
2
, . . . , y
i
m
}. The most consistent answer, which is not necessarily a correct answer,
is selected by majority voting, denoted as ˜y
i
= arg max
y
i
j
P
m
k=1
I(y
i
j
= y
i
k
). For all the train-
ing questions, we filter the CoT reasoning paths that reach ˜y as the final answer to be put into the
self-training data, denoted as D
self−consistent
= {x
i
,
˜
r
i
}, where
˜
r
i
= {r
i
j
|1 ≤ j ≤ m, y
i
j
= ˜y
i
}.
0.0 0.2 0.4 0.6 0.8 1.0
Confidence
0.0
0.2
0.4
0.6
0.8
1.0
Accuracy
0
200
400
# of Questions
Figure 2: The relation of accu-
racy and confidence of the majority-
voted answer after multiple path de-
coding on GSM8K training-set ques-
tions. Predicted confidence from self-
consistency (Wang et al., 2022b) is well
calibrated (Guo et al., 2017).
Since we do not use any ground truth labels to filter out
cases where ˜y
i
6= y
i
, it is important that the self-generated
CoT reasoning paths are mostly reliable and incorrect an-
swers do not hurt the self-improvement of the model. We
plot the relation between the accuracy and confidence of
self-generated CoT paths for each question in GSM8K
training set in Fig. 2. The confidence is the number of
CoT paths leading to ˜y divided by the total path number
m. The y-axis shows the accuracy of ˜y under a certain
confidence. The circle area and the color darkness shows
the number of questions under a certain confidence. We
can observe that confident answers are more likely to be
correct, which means that when a question has many con-
sistent CoT paths, then the corresponding ˜y is more likely
to be correct. On the other hand, when ˜y is wrong, it is
likely to be supported by fewer CoT paths, and brings lit-
tle noise to the training samples.
4