236 V. Mekthanavanh, T. Li and J. Hu et al. / Neurocomputing 366 (2019) 234–247
2.2. Semantic relation and semantic similarity
The capture of semantic relation between terms and semantic
similarity has attracted many researchers’ attention in recent
years. A number of semantic relation and semantic similarity
measures have been proposed. The previously studied popular
semantic similarity methods were evaluated using WordNet
1
as
an underlying reference ontology. Billhardt et al. [31] proposed the
context vector model based on VSM, which incorporated term de-
pendencies and thus obtained semantically richer representations
of documents. Budanitsky and Hirst [32] showed an evaluation of
resource-based measures of lexical semantic distance, equivalently,
semantic relatedness, for natural language processing applications.
Mikolov and Dean [13] approached the literature of the semantic
relation between words by using the similarity of their context
information. Liu et al. [33] proposed a new short text modeling
method by combining the semantic information obtained from
a hierarchical lexical database and the statistical information ex-
tracted from the corpus involved. Resnik [34] presented a measure
of semantic similarity in an IS-A taxonomy based on the notion of
shared information content. Turney [35] proposed an algorithm to
measure the similarity of pairs of words by using the well-known
measure of semantic similarity: Pointwise Mutual Information
(PMI) and Information Retrieval (IR). Farahat and Kamel [36] pro-
posed new models for document representation that can capture
a semantic similarity between documents based on measures
of correlations between their terms. Gao et al. [37] addressed a
wordnet-based semantic similarity measurement by combining
edge-counting and information content theory.
2.3. Clustering ensemble
Clustering ensemble is an approach which is widely adopted in
clustering research areas. It combines multiple clustering results
to improve the quality of the final result. Clustering ensemble
includes two main parts: diversity (creating multiple clusterings)
and consensus function (combining multiple clusterings). Strehl
and Ghosh [38] approached the concept of clustering ensemble by
applying graph theory to achieve the consensus clustering results.
Fred and Jain [39] used evidence accumulation for combining
multiple clusterings and demonstrated that evidence accumulation
outperforms the other combination approaches. Mimaroglu and
Erdil [40,41] approached a combination of multiple clusterings
by using the evidence accumulated from the clusterings. Other
solutions for combining multiple clusterings based on genetic
algorithm were proposed by Mohammadi et al. [42] . Azimi
et al. [43] presented an ensemble method based on the ant colony
algorithm, which can automatically determine the number of
clusters.
Semi-supervised clustering ensemble has become an inter-
esting problem in machine learning. Yang et al. [44] presented
a semi-supervised consensus clustering ensemble based on
multi-ant colonies algorithm. Iqbal et al. [45] proposed a semi-
supervised clustering ensemble by using a voting scheme. Wang
and Pan [46] exploited the spectral clustering to generate con-
sensus clustering with semi-supervised clustering. Mahmood
et al. [21] incorporated Must-Link constraint with graph tree
consensus clustering ensemble. Yang et al. [47] proposed a novel
semi-supervised multi-ant colonies consensus clustering algorithm
and parallelized it on MapReduce. Yu et al. [48] proposed a
new semi-supervised clustering ensemble which referred to an
incremental semi-supervised clustering ensemble approach, where
1
http://wordnet.princeton.edu/
the contribution is to develop an incremental ensemble member
selection based on local and global objective function.
3. System framework for social WVC
3.1. System overview
Fig. 1 shows the framework of Social WVC model. Firstly, the
textual information from social web videos such as title, tag, and
description is selected and then extracted them based on fea-
ture extraction. Secondly, the possible external sources are em-
ployed, e.g., WordNet, Word2vec and NGD from Google search en-
gine which are expected to capture the semantic information and
the feature relevance of terms list in documents. In this step, an
additional technique for the semantic relation between terms in
documents has been used to capture the relation from the local
views. After that, the similarities from each model are combined by
using the combination function to get single similarity before pass-
ing it to the clustering model. Thirdly, three clustering algorithms,
namely, affinity propagation, spectral, and graph partitioning, have
been selected for the clustering purpose. The related video’s data is
included as pairwise constraint (Must-Link). Finally, we incorporate
the pairwise constraint into clustering ensemble in [38] to obtain
the ultimate results.
3.2. Feature e x traction
Each dataset consists of three subsets, e.g., title, tag and de-
scription, where each subset contains short information with noisy
and incomplete keywords. During the feature extraction, we use a
number of techniques, ( i.e. , word splitter, removing stop word
2
to
omit the most common words such as prepositions, articles, and
conjunctions, words stemming or lemmatization, tokenization) for
getting the useful information on videos.
3.3. Vector space model
In order to categorize the social web videos, we use Vector
Space Model (VSM) to compare two videos by using their textual
feature vectors. In this model, each video is represented as vectors
in a common vector space. The similarity between two videos is
measured by the OrS.
Definition 1 [49] . Term Frequency (TF). Suppose d is a document
and t is a term in d . TF(t,d) represents the frequency of a term in
a document.
T F (t, d) = f
td
, (1)
where f
td
is the frequency of term t in document d .
Definition 2
[49] . Inverse Document Frequency (IDF). Suppose D
is a document space and t is a term in D . IDF(t,D) is defined as
follows:
IDF (t, D ) = log
N
1 + |{ d ∈ D : t ∈ d}|
, (2)
where N is the total number of documents in the corpus, and
|{ d ∈ D : t ∈ d }| is the number of documents where term t appears.
Definition 3 [49] . Term Frequency-Inverse Document Frequency
(TF-IDF). Suppose D is a document space, d ∈ D and t is a term in
D . The Term Frequency-Inverse Document Frequency (TF-IDF) of t
to d in D is defined as follows.
T F − IDF (t, d, D ) = T F (t, d) × IDF (t, D ) (3)
2
www.ranks.nl/stopwords