Bilingual printed Document Image retrieval Based on SIFT Feature
Eksan Firkat
1
, Abdusalam Dawut
1
, Palidan Tuerxun
1
, Askar Hamdulla
2
*
1
School of software, Xinjiang University Urumqi 830046, P.R. China
2
Institute of Information Science and Engineering, Xinjiang University Urumqi 830046, P.R.China
*corresponding author’s email: askarhamdulla@sina.com
Abstract—
This paper present a printed document
retrieval system which can retrieve Chinese and
Uyghur keywords from printed document images. In
this paper we introduce the framework of the
printed document retrieval system and processing
step behind it which can be based line for the word
spotting based document retrieval system, we also
describe the extraction algorithm that use local
feature as SIFT to extract the feature from image
and use Euclidean based matching algorithm to
query the matching word in printed document
image. Some novel idea applied in this system might
be helpful for some Other Bilingual printed
document image retrieve system.
Keywords- word spotting; SIFT; document
retrieval system
The world has Huge amount of printed
literature which is a valuable asset for people to
study. But lots of them has not turn into the
search-able format which cause some difficulty
for people to use it very well , such as search
some specific information is very time
consuming. But with the development of the
information retrieval approach made this kind of
problem not so much obstacle anymore. However,
there are still some challenges in providing
effective search mechanisms. There are two main
retrieval approaches currently been proposed .
The first proposed method is Optical
Character Recognition (OCR) approach , this
approach just convert the printed document
image into text file not only an efficient storage
of the content but also makes it thoroughly
search-able. However ,when the quality of the
image became weak and some other noises
interruption, The OCR approach doesn’t work
very well . As alternative to OCR, keyword
spotting (KWS) approach can retrieving
document image very effectively, because of it’s
independent of the complex language structure
and focuses on the features of the given word, in
this approach , spotting is done by matching the
feature points between the query word and the
printed document images[1 2].
This paper presents a word spotting based
printed document retrieval system that uses
information of local descriptors as SIFT [3] . The
main steps of the proposed system is as follows :
Firstly the document image is segmented into
words which is the basic unit for matching , and
extract the SIFT features from the matching
units . In the next phase , the segmented words’
feature and location information of the matching
units it are stored as matrix prepare for the
matching stage. during the retrieval phase , the
query word is turned into image and it’s SIFT
feature is extracted , then Euclidean distance
based ratio matching algorithm is used to achieve
printed document image retrieval . The most
astonishing point of this system is that it can
retrieve both Chinese and Uyghur printed
document Image , and the detail of this approach
will be describe in the following part of this
paper.
The framework of this proposed system is
describe as follow. The user input the query
word , then the word is turn into the image
prepared for matching step. The printed
document images are segmented into word image
clusters also prepare for matching step. To
overcome the aforementioned steps, with the help
of extraction algorithm to extract the local
features of the printed document image and query
word image to build feature vector cluster and
together with the Euclidean distance based
matching algorithm to retrieve the target word
and pinpoint the location of the query word
which satisfy the user’s propose for retrieve the
query word from printed document image . The
flow diagram is shown is figure 1: