
3
To train and evaluate the performance of AnatomyNet,
we curated a dataset of 261 head and neck CT images
from a number of publicly available sources. We carried
out systematic experimental analyses on various compo-
nents of the network, and demonstrated their effective-
ness by comparing with other published methods. When
benchmarked on the test dataset from the MICCAI 2015
competition on HaN segmentation, the AnatomyNet out-
performed the state-of-the-art method by 3.3% in terms
of Dice coefficient (DSC), averaged over nine anatomical
structures.
The rest of the paper is organized as follows. Sec-
tion II B describes the network structure and SE residual
block of AnatomyNet. The designing of the loss function
for AnatomyNet is present in Section II C. How to handle
missing annotations is addressed in Section II D. Section
III validates the effectiveness of the proposed networks
and components. Discussions and limitations are in Sec-
tion IV. We conclude the work in Section V.
II. MATERIALS AND METHODS
Next we describe our deep learning model to delin-
eate OARs from head and neck CT images. Our model
receives whole-volume HaN CT images of a patient as
input and outputs the 3D binary masks of all OARs at
once. The dimension of a typical HaN CT is around
178 × 512 × 512, but the sizes can vary across differ-
ent patients because of image cropping and different set-
tings. In this work, we focus on segmenting nine OARs
most relevant to head and neck cancer radiation therapy
- brain stem, chiasm, mandible, optic nerve left, optic
nerve right, parotid gland left, parotid gland right, sub-
mandibular gland left, and submandibular gland right.
Therefore, our model will produce nine 3D binary masks
for each whole volume CT.
A. Data
Before we introduce our model, we first describe the cu-
ration of training and testing data. Our data consists of
whole-volume CT images together with manually gener-
ated binary masks of the nine anatomies described above.
There were collected from four publicly available sources:
1) DATASET 1 (38 samples) consists of the training set
from the MICCAI Head and Neck Auto Segmentation
Challenge 2015 [4]. 2) DATASET 2 (46 samples) consists
of CT images from the Head-Neck Cetuximab collection,
downloaded from The Cancer Imaging Archive (TCIA)
1
[36]. 3) DATASET 3 (177 samples) consists of CT im-
ages from four different institutions in Qu´ebec, Canada
[37], also downloaded from TCIA [36]. 4) DATATSET
4 (10 samples) consists of the test set from the MICCAI
1
https://wiki.cancerimagingarchive.net/
HaN Segmentation Challenge 2015. We combined the
first three datasets and used the aggregated data as our
training data, altogether yielding 261 training samples.
DATASET 4 was used as our final evaluation/test dataset
so that we can benchmark our performance against pub-
lished results evaluated on the same dataset. Each of
the training and test samples contains both head and
neck images and the corresponding manually delineated
OARs.
In generating these datasets, We carried out several
data cleaning steps, including 1) mapping annotation
names named by different doctors in different hospi-
tals into unified annotation names, 2) finding correspon-
dences between the annotations and the CT images, 3)
converting annotations in the radiation therapy format
into usable ground truth label mask, and 4) remov-
ing chest from CT images to focus on head and neck
anatomies. We have taken care to make sure that the four
datasets described above are non-overlapping to avoid
any potential pitfall of inflating testing or validation per-
formance.
B. Network architecture
We take advantage of the robust feature learning mech-
anisms obtained from squeeze-and-excitation (SE) resid-
ual blocks [30], and incorporate them into a modified
U-Net architecture for medical image segmentation. We
propose a novel three dimensional U-Net with squeeze-
and-excitation (SE) residual blocks and hybrid focal and
dice loss for anatomical segmentation as illustrated in
Fig. 1.
The AnatomyNet is a variant of 3D U-Net [25, 38, 39],
one of the most commonly used neural net architectures
in biomedical image segmentation. The standard U-Net
contains multiple down-sampling layers via max-pooling
or convolutions with strides over two. Although they
are beneficial to learn high-level features for segment-
ing complex, large anatomies, these down-sampling lay-
ers can hurt the segmentation of small anatomies such
as optic chiasm, which occupy only a few slices in HaN
CT images. We design the AnatomyNet with only one
down-sampling layer to account for the trade-off between
GPU memory usage and network learning capacity. The
down-sampling layer is used in the first encoding block
so that the feature maps and gradients in the follow-
ing layers occupy less GPU memory than other network
structures. Inspired by the effectiveness of squeeze-and-
excitation residual features on image object classifica-
tion, we design 3D squeeze-and-excitation (SE) residual
blocks in the AnatomyNet for OARs segmentation. The
SE residual block adaptively calibrates residual feature
maps within each feature channel. The 3D SE Residual
learning extracts 3D features from CT image directly by
extending two-dimensional squeeze, excitation, scale and
convolutional functions to three-dimensional functions.