
Online Video Recommendation Based on
Multimodal Fusion and Relevance Feedback
∗
Bo Yang
†
, Tao Mei
‡
, Xian-Sheng Hua
‡
, Linjun Yang
‡
, Shi-Qiang Yang
†
, Mingjing Li
‡
†
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, P. R. China
‡
Microsoft Research Asia, 49 Zhichun Road, Beijing 100080, P. R. China
bo.yang02@gmail.com; {tmei, xshua, linjuny, mjli}@microsoft.com; yangshq@mail.tsinghua.edu.cn
ABSTRACT
With Internet delivery of video content surging to an un-
precedented level, video recommendation has become a very
popular online service. The capability of recommending rel-
evant videos to targeted users can alleviate users’ efforts
on finding the most relevant content according to their cur-
rent viewings or preferences. This paper presents a novel
online video recommendation system based on multimodal
fusion and relevance feedback. Given an online video doc-
ument, which usually consists of video content and related
information (such as query, title, tags, and surroundings),
video recommendation is formulated as finding a list of the
most relevant videos in terms of multimodal relevance. We
express th e multimodal relevance between two video doc-
uments as the combination of textual, visual, and aural
relevance. Furthermore, since different video documents
have different weights of the relevance for three modali-
ties, we adopt relevance feedback to automatically adjust
intra-weights within each modality and inter-weights among
different modalities by users’ click-though data, as well as
attention fusion function to fuse multimodal relevance to-
gether. Unlike traditional recommenders in which a suffi-
cient collection of users’ profiles is assumed available, this
proposed system is able to recommend videos without users’
profiles. We conducted an extensive experiment on 20 videos
searched by top 10 representative queries from more than
13k online videos, reported the effectiveness of our video
recommendation system.
Categories and Subject Descriptors
H.5.1 [Information Interfaces and Presentation]: Mul-
timedia Information Systems—video; H.3.5 [Information
Storage and Retrieval]: Online Information Services—
Web-based services
General Terms
∗
This work was performed while the first author was visiting Mi-
crosoft Research Asia as a research intern.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CIVR’07, July 9–11, 2007, Amsterdam, The Netherlands.
Copyright 2007 ACM 978-1-59593-733-9/07/0007 ...$5.00.
Algorithms, Human Factors, Experimentation.
Keywords
online video recommendation, multimodal fusion, relevance
feedback
1. INTRODUCTION
Driven by the age of Internet generation and the advent
of near-ubiquitous broadband Internet access, online deliv-
ery of video content have surged to an unprecedented level.
According to an Online Publishers Association study [17],
more than 140 million people (69%) have watch ed video
online with 50 million (24%) doing so weekly. This trend
has brought a variety of online video services, such as video
search, video tagging and editing, video sharing, video ad-
vertising, and so on. Therefore, it is natural to imagine that
today’s online users always face a daunting volume of video
content - be it from video sharing or blog content, or from
IPTV and mobile TV. As a result, there is an increasing de-
mand of an online video service to push the “interesting” or
“relevant” content to targeted people at every opportunity.
Video recommendation is such a kind of service which re-
leases users’ efforts on manually filtering out the unrelated
content and finding the most interesting videos according
to their current viewings or preferences. While many exist-
ing video-oriented sites, such as YouTube [6], MySpace [5],
Yahoo! [4], Google Video [2] and MSN Soapbox [1], have
already provided recommendation services, it is likely tha t
most of them recommend the relevant videos only based on
surrounding text information (such as the title, tags, and
comments). However, it still remains a challenging research
problem to leverage video content and users’ click-though
data for a more efficient recommendation.
The earlier research on recommendation began with Resnick
et al., who has given a general definition for a recommender
system as t o assist and augment the natural social process
[18]. A typical recommender system receives the recom-
mendations provided by users as inputs, and then aggre-
gates and directs to appropriate recipients aiming at good
matches between recommended items and users. While in
the specific domain of online video service, the input of a
video recommendation system is t he video content clicked
by a user, together with related information (such as query
and surrounding text provided by content providers), and
the output is a list of recommended videos according to
user’s current views and preference (such as user interest
and location).