eCaReNet
survival and hazard functions are defined as
S(t
j
) = P (t
∗
> t
j
), (3)
h(t
j
) = P (t
∗
= t
j
|t
∗
> t
j−1
), (4)
S(t
j
) =
j
Y
k=0
(1 − h(t
k
)). (5)
The survival function is a monotonically decreasing
function, as can be seen from Equation 5.
An important characteristic of survival data is cen-
soring. Not all patients in the dataset experience an
event, either because they are lost to follow-up, their
event occurs after the end of documentation or they
never relapse. These patients are right-censored and
here t
∗
is not the time of the event, but the last ob-
served time without any event.
4.2. Model
As a base model for our proposed survival predic-
tion an InceptionV3 network (Szegedy et al., 2015),
pretrained on the ImageNet dataset (Russakovsky
et al., 2015), is chosen, while replacing the last lay-
ers to perform survival prediction as described be-
low. We chose InceptionV3 as it achieved best re-
sults in our experiments. We include two preceding
steps (4.2.1 and 4.2.2), before training our survival
model eCaReNet in a third step. Figure C.1 shows an
overview of the presented models and which datasets
these are trained on.
4.2.1. M
ISUP
In the first step, we additionally pretrain the Incep-
tionV3 model to adapt it to our histopathology do-
main. Our model M
ISUP
takes images from the Glea-
son dataset as input (Figure C.1A), downsized with
bilinear interpolation to 1024 × 1024 pixels, and clas-
sifies these into one out of six classes (benign or one of
5 malignant ISUP classes). During training, a cross-
entropy loss is used. For training details and results,
see Appendix B.
4.2.2. M
Bin
In the second step, a binary classification model M
Bin
is used to predict relapse within 2 years on the sur-
vival dataset (Figure C.1B). 2 years was chosen, as
it lies close to the median (26.8 months) of the re-
lapse times (44% of relapses earlier than 2 years).
For this, we took the model M
ISUP
and modified the
output to 2 classes. The input image is resized to
1024 × 1024 pixels as in M
ISUP
and a cross-entropy
loss is applied during training. As opposed to the first
step, the prediction per image is saved and used in
the third step, which is the survival prediction model
eCaReNet, shown in Figure 1.
4.2.3. eCaReNet
Each image of the survival dataset is cut into square,
non-overlapping patches as input to eCaReNet (64
patches with 256×256 pixels each, see also Section 5).
As this model predicts the hazard over time, one out-
put node per time interval is needed. We chose 28 in-
tervals to cover a time span of 7 years with intervals
of 3-months length, covering the 90% of relapses that
occur prior to 7 years. For eCaReNet, only the first
4 inception blocks of M
ISUP
are used to reduce over-
fitting. The following global average pooling layer re-
duces the dimensionality. Then a self-attention block,
as proposed by Rymarczyk et al. (2021), models the
influence of each patch across all other patches. Next,
the aforementioned binary classification is concate-
nated with the output vector of the self-attention
layer. This concatenated vector is repeated 28 times
to model the discrete time intervals. The current time
step is concatenated to each of these vectors. A gated
recurrent unit (GRU) layer (Cho et al., 2014) mod-
els the temporal dependency of the hazard rate in the
output, as proposed by Ren et al. (2019). At the end,
an attention-based MIL-layer weights the predictions
per patch and outputs a prediction per image, as pro-
posed in Ilse et al. (2018).
An individual survival curve per patient is obtained
through Equation 5. Using the normalized area un-
der the survival curve, the patient’s overall risk is es-
timated. Since a large area under the survival curve
indicates a low risk r and vice versa, the normalized
area is subtracted from one:
r = 1 −
1
t
k
k
X
i=1
S(t
i
) · |t
i
− t
i−1
|, (6)
with the last interval k at time t
k
(based on the sur-
vival time prediction in Xiao et al. (2020)). Since the
risk score is a single numerical value between 0 and
1, it eases comparison among patients.
As proposed by Kvamme et al. (2019), during
training a maximum likelihood loss is optimized. It
differs for censored (c = 1) and uncensored (c = 0)
patients with the observed event time t
∗
. For un-
censored patients, the loss L
u
can be defined by the
4