An End-to-End Neural Network Approach to Story
Segmentation
Jia Yu
∗
, Lei Xie
∗‡
, Xiong Xiao
†
, Eng Siong Chng
†
,
∗
Shaanxi Provincial Key Laboratory of Speech and Image Information Processing,
School of Computer Science, Northwestern Polytechnical University, Xi’an, China
†
School of Computer Engineering, Nanyang Technological University, Singapore
E-mail: {jiayu,lxie}@nwpu-aslp.org, {xiaoxiong,ASESChng}@ntu.edu.sg
Abstract— This paper proposes an end-to-end story segmen-
tation approach based on long short-term memory (LSTM) -
recurrent neural network (RNN). Traditional story segmentation
approaches are a two-stage pipeline consisting of feature extrac-
tion and segmentation, each of which has its individual objective
function. In other words, the objective function used to extract
features is different from the true performance measure of story
segmentation, which may degrade the segmentation results. In
this paper, we combine the two components and optimize them
jointly, using an LSTM-RNN. Specifically, one LSTM layer is
used to extract sentence vectors, and another LSTM layer is used
to predict story boundaries by taking as input of the sentence
vectors. Importantly, the whole network is optimized directly
to predict story boundaries. We also investigate bi-directional
LSTM (BLSTM) that can utilize past and future information in
the process of extracting sentence vectors and story boundary
prediction. Experimental results on the TDT2 corpus show that
the proposed approach achieves state-of-the-art performance in
story segmentation.
I. INTRODUCTION
Story segmentation is a task of partitioning a stream of
audio, video or text into story segments, each addressing a
specific topic. It is a necessary precursor for a variety of
language processing technologies including content indexing
and retrieval [1], document summarization [2], topic detection
and tracking [3], [4] and information extraction [5]. Typical
story segmentation approaches are a pipeline consisting of
feature learning and segmentation. The two components are
not optimized jointly for story segmentation, making indepen-
dent assumptions for individual components [6], [7], [8], [9].
Recently, end-to-end (E2E) neural network (NN) learning that
jointly optimizes all components (e.g., in speech recognition)
has achieved promising results [10], [11], [12]. This motivates
us to develop an end-to-end NN approach for the story
segmentation task at hand.
Story segmentation has been studied for different genres,
such as broadcast news [13], [14], meeting recordings [15] and
lectures [16], [17], etc., over various types of media, including
audio [17], [18], [19], video [20] and text [21], [22], [23], [24],
[6], [15]. In this paper, we aim to perform story segmentation
for textual documents like broadcast news speech recognition
transcripts. Note that, with the recent tremendous success
of large vocabulary continuous speech recognition (LVCSR)
‡Corresponding author
using deep neural networks (DNN) [25], [26], [27], [28], [29],
[30], [31], we can easily obtain high accuracy transcripts for
broadcast news. Thus traditional text segmentation approaches,
with similar purposes of story segmentation, can be easily
applied to the speech recognition transcripts.
Traditional story segmentation approaches on texts are a
pipeline system consisting of feature learning that catches
semantic or topic information from a stream of text, and
segmentation that partitions the stream to topically coherent
segments by detecting the topic shift.
Feature extraction heavily affects the performance of story
segmentation. Bag-of-words (BOW) representation, or term
frequency-inverse document frequency (tf-idf), is a simple
representation in typical story segmentation approaches, e.g.,
TextTiling and dynamic programming (DP) [6], [7], [8]. How-
ever, BOW or tf-idf only counts the appearances of words,
ignoring semantic relations among them. Instead, probabilistic
latent semantic analysis (pLSA) [9], latent Dirichlet allocation
(LDA) [32], and LapPLSA [33], employ latent topic variables
and create topic model that depicts the probability distribution
of words on topics. With these probabilistic models, BOW
based word representations are transformed into topic repre-
sentations and used in varies segmentation approaches [32],
[34]. Recently, neural network based topic models have shown
promising performances [35], [36], [37], [38], [39]. Specifi-
cally, we derived word representation in topic space from a
neural network based topic model, leading to improved story
segmentation performance [40].
The second component of the pipeline is a segmenter. The
above-mentioned TextTiling [6], [7] and dynamic program-
ming (DP) [33], [41], [42], [43] are typical detection-based
approaches, which find optimal partitions over word sequence
by optimizing a local or global objective. Popular probabilistic
model approaches locate story boundaries by probability dis-
tribution of topics on document and probability distribution of
words on topics. Popular such approaches include PLSA [34],
BayesSeg [44], dd-CRP [45] and HMM [23], [24], [21].
The two components of a story segmentation system are
traditionally modeled independently. The objective function
used to extract feature may be substantially different from
the true performance measure of story segmentation. This
sort of inconsistency may degrade the performance of story
segmentation. The purpose of end-to-end (E2E) learning is to
Proceedings of APSIPA Annual Summit and Conference 2017
12 - 15 December 2017, Malaysia
978-1-5386-1542-3@2017 APSIPA