Team # 2314151 Page 4 of 24
2.2 Notations
Table 1: Notations
Symbol Definition
s
j
Timestamp
k Growth rate
δ
j
The amount of change in the growth rate on the timestamp
m Offset amount
ϵ Error term
N Number of cycles in the seasonality model
D
i
Period before and after a holiday
κ
i
Range of holiday effects
P Significance level
3 Data Processing
3.1 Data Cleaning
Topic C reports on the use of Wordle in the past year. However, we found a lot of dirty data in this
report.
Table 2: Dirty data
Contest number Word Number of reported results Number in hard mode 1 try 2 tries 3 tries 4 tries 5 tries 6 tries 7 or more tries (X)
525 clen 26381 2424 1 17 36 31 12 3 0
314 tash 106652 7001 2 19 34 27 13 4 1
540 na
¨
ıve 21947 2075 1 7 24 32 24 11 1
473 marxh 30935 2885 0 9 30 35 19 6 1
207 favor 137586 3073 1 4 15 26 29 21 4
In the data shown above, the two words numbered 525, and 314 do not match the game because they
are only 4 in length, so we inferred that the dataset blundered by under-entering the letters. To solve
such a problem, we found the most similar letters to them instead by comparing them with artificial
intelligence algorithms. The word numbered 540 is due to a misspelling of the letter, which should
be ”naive.” We searched the word database and found that the word ”marxh,” numbered 473, did not
exist. We then compared the shapes of the words with database analysis and concluded that the correct
spelling should be ”marsh.” The word numbered 207 has an extra space in the input, so it is also an
outlier. We can delete the extra space to get the correct data.
3.2 Outlier rejection and standardization
We use the 68–95–99.7 rule (3σ criterion) to screen and reject outliers[2]. We found an anomaly
in the Number of reported results data for the word ’study’ on 2022/11/30, and we zeroed it to bring it
4