Team # 2301192 Page 5
Table 2: Notations of Word Attributes Used in the Paper
Symbols Denition
Freq Word Frequency
SLF the Sum of Letter Frequencies
BU the Breadth of Usage of a Word
NDLW the Number of Dierent Letters in a Word
a-z the Number of Letters from a to z in a Word
3 Model 1-Integration of Interpretation and Prediction Model
based on Prophet and SIRS
3.1 Data Preprocessing and Exploratory Analysis
3.1.1 Data Collection and Pre-processing
In addressing task 1, it is dispensable to analyze the attributes of words related to the prob-
lem and collect relevant data.The possible factors include the frequency, the breadth of the usage
in dierent elds, the number of dierent letters in words and parts of speech. In general Nat-
ural Language Processing (NLP), there are 36 commonly used parts of speech[2], of which we
selected 18 types relevant to this task as shown in Table 1.
To process missing values, abnormal values and repeated observations in the original data
set, we apply a series of data processing methods: data cleaning, establishment of dummy
variables for discrete variables, logarithmic transformation of the number of reports and
set-up of new attributes. The four steps enable the elimination of extraneous information and
facilitate the identication and extraction of relevant information from the dataset.
Step 1: In the stage of data cleaning, we use Python to check for missing, outlier and du-
plicate values. By measuring length of words, we check for empty or unusually long values.
We nd that there are no empty values but three outliers: ”tash”, ”clen” and ”rprobe”. After
searching and comparing online, we correct those words as ”trash”, ”clean” and ”probe”. Fur-
thermore, using the ”duplicate()” method, we check for duplicate values with no duplicate value
found.
Step 2: To make the discrete variable of part-of-speech easier to be processed by the model,
we construct 17 dummy variables to convert the discrete variable into binary variables.
Step 3: We plan to use a time series model to predict the number of reports on March 1,
2023. In these types of models, it is crucial to eliminate heteroscedasticity in the data. Taking
the logarithm of the data does not change its nature or correlation, but it compresses the scale of
the variable. By shrinking the absolute values of the data, it is easier to eliminate the problem
of heteroscedasticity. Therefore, we logarithmically transform the reported quantity.
Step 4: To comprehensively explore the inuence of various word attributes on reported
Hard-Mode-played scores, we further extract the attributes of words and establish several new
variables. This will be elaborated in Section
3.4.
3.1.2 Data Description and Exploratory Analysis
The data is visualized to dig into the inherent rules, which is helpful for modeling. Figure