Team # 2318982 Page 5 of 25
Table 1: Notations
Symbols Description
A
i
The set of states that are reachable in one step of state i.
S The state space of the Markov chain.
W All the words a player may fill in.
p
x
The subjective probability that word x is the correct answer.
freq
x
The word frequency of word x.
I
x
The amount of information obtained by filling in the word x at the opening.
x
(r)
true
The correct word of the r th day.
G
i
The set of words that the player has guessed when he is in state i
p
(r)
k
(i, j) The transfer probability from state i to j in Markov chain on day r.
T
(r)
j
The number of steps to first reach state j from state i on the Markov chain at day r.
C
(r)
k
The set of absorbing states of Markov chains on day r.
T
(r)
absorbed
Number of steps before falling into an absorbing state on Markov chain at day r.
q
k
(r) The proportion of all players using strategy k on day r.
where we define the main parameters while specific value of those parameters will be given later.
3 Data Preprocessing
Since we are only allowed to use the datasets “Problem_C_Data_Wordle.csv” by COMAP
official, we need to pre-process the dataset before solving the problem. An initial inspection of the
dataset showed that there are some outliers and missing values.
• In the word column, we find that the length of some words are not equal to five,such as
“rprobe”, “clen” and “tash”. As mentioned by COMAP official, in line 18, for contest 545,
the word listed is “rprobe” while it should be “probe”. By looking up the solution word of
the day published by wordle, we also get that “clen” should be “clean” and “tash” should be
“trash”.
• Additionally, in line 34, for contest 529, the number of reported results listed is “2569”, while
the correct number should be “25569”.
4 Task 1: Number Prediction and Word Attributes
In this section, we predicted the number of reported results on March 1, 2023 by building an
ARIMA model and choosing the optimal parameters. Then we summarize the word attributes and
then explore the effect of word attributes on the percentage of scores reported in the difficulty model
by building a multiple linear regression.
4.1 Number Prediction Based on ARIMA Model
Autoregressive integrated moving average, which is known as ARIMA, is a statistical analysis
model that uses time-series data to predict the future trend. The basic idea of ARIMA is that
the data sequence formed by the prediction over time is regarded as a random sequence and a