206 Y. Li et al.
detection accuracy. Note that the detection is performed frame-by-frame and
the temporal information is automatically learnt by the deep LSTM network
without requiring a sliding window design, which is time efficient.
The main contributions of this paper are summarized as follows:
– We investigate the new problem of online action detection for streaming skele-
ton data by leveraging recurrent neural network.
– We propose an end-to-end Joint Classification-Regression RNN to address our
target problem. Our method leverages the advantages of RNNs for frame-wise
action detection and forecasting without requiring a sliding window design
and explicit looking forward or backward.
– We build a large action dataset for the task of online action detection from
streaming sequence.
2 Related Work
2.1 Action Recognition and Action Detection
Action recognition and detection have attracted a lot of research interests in
recent years. Most methods are designed for action recognition [13,14,17], i.e.,
to recognize the action type from a well-segmented sequence, or offline action
detection [8,10,18,19]. However, in many applications it is desirable to recog-
nize the action on the fly, without waiting for the completion of the action, e.g.,
in human computer interaction to reduce the response delay. In [5], a learning
formulation based on a structural SVM is proposed to recognize partial events,
enabling early detection. To reduce the observational latency of human action
recognition, a non-parametric moving pose framework [6] and a dynamic integral
bag-of-words approach [20] are proposed respectively to detect actions earlier.
Our model goes beyond early detection. Besides providing frame-wise class infor-
mation, it forecasts the occurrence of start and end of actions.
To localize actions in streaming video sequence, existing detection meth-
ods utilize either sliding-window scheme [5,8–10], or action proposal approaches
[11,21,22]. These methods usually have low computational efficiency or unsat-
isfactory localization accuracy due to the overlapping design and unsupervised
localization approach. Besides, it is not easy to determine the sliding-window
size.
Our framework aims to address the online action detection in such a way that
it can predict the action at each time slot efficiently without requiring a sliding
window design. We use the regression design to determine the start/end points
learned in a supervised manner during the training, enabling the localization
being more accurate. Furthermore, it forecasts the start of the impending or end
of the ongoing actions.
2.2 Deep Learning
Recently, deep learning has been exploited for action recognition [17]. Instead
of using hand-crafted features, deep learning can automatically learn robust