AN EFFICIENT CODING FRAMEWORK FOR COMPACT DESCRIPTORS EXTRACTED
FROM VIDEO SEQUENCE
Zhangshuai Huang
?
, Ling-Yu Duan
?
, Jie Lin
†
, Shiqi Wang
◦
, Siwei Ma
?
, Tiejun Huang
?
?
Institute of Digital Media, School of EE & CS, Peking University, Beijing, 100871, China
Cooperative Medianet Innovation Center, Shanghai, China
†
Institute for Infocomm Research, 119613, Singapore
◦
Dept. of Electrical and Computer Engineering, University of Waterloo, Waterloo, Canada
ABSTRACT
Towards effective and efficient image matching or retrieval tasks, the
emerging MPEG standard, named Compact Descriptors for Visual
Search (CDVS), has fulfilled compact descriptors for still images,
consisting of compressed local and global descriptor. Nevertheless,
the frame-level coding of CDVS descriptors from a video sequence
does not address the inter-frame redundancy issue, which may con-
sume considerable bandwidth and storage resources. In this work,
we propose an efficient coding framework of CDVS descriptors to
generate compact descriptors for video sequences. For local descrip-
tors, we propose a multiple reference predictive technique to exploit
the temporal correlation of local descriptors and location coordinates
over a sequence of frames. To further improve the prediction perfor-
mance, keypoint tracking is applied to identify temporally repeated
keypoints. For global descriptors, a propagation coding way is em-
ployed to compress the global descriptors of adjacent frames. The
empirical evaluation has shown that the proposed coding approach
has yielded a low bit rate of less than 40kbps on average, while main-
taining comparable matching and retrieval performance. Compared
to the sequence of original frame-level CDVS descriptors, the pro-
posed approach has achieved over 25× bit rate reduction.
Index Terms— Compact descriptor, MPEG CDVS, Interest
points tracking, Predictive coding, Propagation coding.
1. INTRODUCTION
Video analysis applications, such as mobile augmented reality, vi-
sual sensor network and distributed surveillance, usually transmit
visual data from mobile client to remote server for the subsequent
matching or retrieval tasks. Instead of sending raw data of images or
videos, recent works [1] [2] have proposed to directly extract low bit
rate visual descriptors on mobile client, towards low latency delivery
in wireless environment. In general, visual descriptors can be broad-
ly categorized into two groups. The first group is local descriptor,
such as SIFT [3], SURF [4]. The second group is global descriptor,
such as Bag-of-Words [5], Fihser Vector (FV) [6] and Vectors of Lo-
cally Aggregated Descriptors (VLAD) [7]. These global descriptors
are usually aggregated from the statistics of local descriptors.
Compact representation of local and global descriptors has
drawn many research attentions. For compact local descriptor,
Chandrasekhar et al. proposed a Compact Histogram of Gradients
(CHoG) descriptor with ∼50 bits. Other representative works in-
clude BRIEF [8], ORB [9] and BRISK [10]. For compact global
descriptor, Chen et al. [11] introduced Residual Enhanced Visual
Vector (REVV) by reducing the VLAD dimension with Linear Dis-
CDVS
Descriptors
Video
Encoding
Retrieval or
Matching
Decoding
Compressed
stream
Client
Remote Server
CDVS
Descriptors
Fig. 1. Overview of our proposed approach. CDVS descriptors ex-
tracted from video are encoded at client and transmitted to server for
further retrieval or matching task.
criminative Analysis (LDA) followed by sign binarization. Lin et al.
[12] proposed Scalable Compressed Fisher Vector (SCFV) to direct-
ly binarize FV followed by centroid basis bit selection. In particular,
the emerging MPEG standrad [13], Compact Descriptors for Visual
Search (CDVS), has standardized both compact local and compact
global descriptors. It has shown that CDVS obtains state-of-the-art
image matching and retrieval performance at a low bit rate [14].
Nevertheless, there is few work on compressing descriptors ex-
tracted from video sequence, especially for CDVS descriptor. Unlike
still image, video is born with the so called temporal redundancy is-
sue. Recent work has proposed to address this issue on either local or
global descriptor. For local descriptor, Markar [15] proposed a tem-
porally coherent keypoint detector and inter-frame canonical patches
coding techniques. Baroffio [16][17] adopted both intra- and inter-
frame coding to compress SIFT- and BRIEF-like [8] descriptors,
where a coding mode decision scheme was proposed to improve the
coding efficiency. For global descriptor, Chen [18] proposed inter-
frame coding of scalable residual-based global signatures REVV by
propagating either codewords or residual vectors.
In this paper, we propose an efficient coding framework to com-
press CDVS descriptors steam. An overview of the our proposed
approach is illustrated in Fig.1. In the details of our framework, we
investigate a coding pipeline for both local and global descriptors
(see Fig.2). We first introduce a tracking process, before the feature
selection stage of CDVS framework, to recognize the repeated key-
points for better utilizing the temporal consistency. Then we employ
a multiple reference predictive coding technique to reduce the tem-
poral redundancy of local descriptors and location coordinates. At
last, an efficient propagation coding technique is designed to com-
press the global descriptors. Extensive experiments have been con-
ducted over the Stanford MAR Dataset [15]. The results have shown
that our approach can achieve a significant bit rate reduction, while
with little effect on the image matching and retrieval performance.