116 A. Mütter et al. / Nuclear Physics B 940 (2019) 113–129
lead to fertile islands in the string landscape, i.e. to patches in the parameter space of Z
6
-II
models where the number of MSSM-like models is above average.
Let us start with an ove
rview of the main points of the following discussion. We start with
the preprocessing of our data, where we transform each Z
6
-II model into a suitable, machine-
readable representation of 26 parameters X, also known as features. Then, we utilize a neural
network to project each Z
6
-II model to a point in a two-dimensional image, yielding a “chart” of
the Z
6
-II landscape. This is done such that the reconstruction error (i.e. the error when we map
each point of the two-dimensional chart back to a feature vector X) is as small as possible. In
this chart of the Z
6
-II landscape we can easily identify fertile islands where MSSM-like models
appear to cluster – even though the neural network had no information of a model being MSSM-
like or not during training. Afterwards, a decision tree is used to investigate these fertile islands,
i.e. to find conditions on the 26 features X of a Z
6
-II model, such that one can directly decide
if a given Z
6
-II model is located on a fertile island of the landscape or not. Finally, we discuss
the performance of this procedure: we analyze how many MSSM-like models can be found if we
restrict ourselves to search for MSSM-like models only on the fertile islands.
3.1. Data preprocessing
We start our machine learning workflow with the most basic, but crucial step: to define our
training and validation sets. The training set is used in the machine learning algorithms to actu-
ally tune the weights and biases in the neurons, while the validation set is used to estimate the
generalization properties of our machine learning model and can be e
xploited for hyperparameter
search, e.g. to adjust the architecture of the neural network. Both of these sets contribute to the
structure of the machine learning model.
In our case, we ha
ve a coarse sample of O(700, 000) Z
6
-II models. This dataset is used to
build our machine learning algorithm and is divided into 60% training and 40% validation data,
all in a random procedure.
In order for the autoencoder to handle the data, we need a suitable numerical representation
of the data. In our case, there e
xists a natural representation: the 26-dimensional feature vector of
integers X, see Appendix A. However, it turns out that this representation does not perform well
on the autoencoder. In f
act, a more abstract representation, a so-called one-hot encoding, leads to
a much better result. One-hot encoding is an approach for data that has no internal order like the
values “green”, “red”, “blue”. It generates a vector with n components where n equals the total
number of possible values. Hence, in the e
xample of three colors we have n = 3 and “green”,
“red” and “blue” have a one-hot encoding (1, 0, 0), (0, 1, 0) and (0, 0, 1), respectively. In our
case of Z
6
-II models, each feature X
k
of X can take 37 different values (i.e. there are in total 37
different breaking patterns for each E
8
factor). Thus, each component X
k
of the 26-dimensional
feature vector X is represented by a 37-dimensional vector. This 37-dimensional vector is zero
except for the component, which corresponds to the given value of X
k
. This component equals 1.
Therefore, we obtain for each Z
6
-II model a (26 × 37 = 962)-dimensional feature vector X
one-hot
as input to our neural network.
3.2. The autoencoder
The main effect of an autoencoder neural network is that redundancies in the feature vec-
tor X
one-hot
(such as irrelevant features) can be detected and reduced. Thus, an autoencoder