文件的20字中文标题：论述总结

119 浏览量更新于2023-10-23 收藏 1.71MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

Sketch Less for More: On-the-Fly Fine-Grained Sketch Based Image RetrievalAyan Kumar Bhunia1Yongxin Yang1Timothy M. Hospedales1,2 Tao Xiang1 Yi-Zhe Song11SketchX, CVSSP, University of Surrey, United Kingdom2University of Edinburgh, United Kingdom.{a.bhunia, yongxin.yang, t.xiang, y.song}@surrey.ac.uk, t.hospedales@ed.ac.ukAbstractFine-grained sketch-based image retrieval (FG-SBIR)addresses the problem of retrieving a particular photo in-stance given a user’s query sketch. Its widespread applica-bility is however hindered by the fact that drawing a sketchtakes time, and most people struggle to draw a completeand faithful sketch. In this paper, we reformulate the con-ventional FG-SBIR framework to tackle these challenges,with the ultimate goal of retrieving the target photo withthe least number of strokes possible. We further proposean on-the-ﬂy design that starts retrieving as soon as theuser starts drawing. To accomplish this, we devise a rein-forcement learning based cross-modal retrieval frameworkthat directly optimizes rank of the ground-truth photo overa complete sketch drawing episode. Additionally, we in-troduce a novel reward scheme that circumvents the prob-lems related to irrelevant sketch strokes, and thus providesus with a more consistent rank list during the retrieval. Weachieve superior early-retrieval efﬁciency over state-of-the-art methods and alternative baselines on two publicly avail-able ﬁne-grained sketch retrieval datasets.1. IntroductionDue to the rapid proliferation of touch-screen devices,the computer vision community has witnessed signiﬁcantresearch progress in sketch-related computer vision prob-lems [49, 41, 29, 6, 9, 4]. Among these methods, sketch-based image retrieval (SBIR) [4, 6, 9] has received par-ticular attention due to its potential commercial applica-tions.SBIR was initially posed as a category-level re-trieval problem. However, it became apparent that the keyadvantage of sketch over text/tag-based retrieval was con-veying ﬁne-grained detail [10] – leading to a focus onﬁne-grained SBIR that aims to retrieve a particular photowithin a gallery.Great progress has been made on FG-SBIR [49, 41, 29], but two barriers hinder its widespreadadoption in practice – the time taken to draw a completesketch, and the drawing skill shortage of the user. Firstly,while sketch can convey ﬁne-grained appearance detailsmore easily than text, drawing a complete sketch is slow55% sketch95% sketchOursBaselineOursBaselineFigure 1. Examples showing the potential of our framework thatcan retrieve (top-5 list) target photo using fewer number of strokesthan the conventional baseline method.compared to clicking a tag or typing a search keyword.Secondly, although state-of-the-art vision systems are goodat recognising badly drawn sketches [36, 50], users whoperceive themselves as someone who “can’t sketch” worryabout getting details wrong and receiving inaccurate results.In this paper we break these barriers by taking a viewof “less is more” and propose to tackle a new ﬁne-grainedSBIR problem that aims to retrieve the target photo with justa few strokes, as opposed to requiring the complete sketch.This problem assumes a “on-the-ﬂy” setting, where retrievalis conducted at every stroke drawn. Figure 1 offers an illus-trative example of our on-the-ﬂy FG-SBIR framework. Dueto stroke-by-stroke retrieval, and a framework optimised forfew-stroke retrieval, users can usually “stop early” as soonas their goal is retrieved. This thus makes sketch more com-parable with traditional search methods in terms of time toissue a query, and more easily – as those inexperienced atdrawing can retrieve their queried photo based on the easi-est/earliest strokes possible [1], while requiring fewer of thedetailed strokes that are harder to draw correctly.Solving this new problem is non-trivial. One might ar-gue that we can directly feed incomplete sketches into theoff-the-shelf FG-SBIR frameworks [49, 36], perhaps alsoenhanced by including synthesised sketches in the training9779data. However, those frameworks are not fundamentally de-signed to handle incomplete sketches. This is particularlythe case since most of them employ a triplet ranking frame-work where each triplet is treated as an independent trainingexample. So they struggle to perform well across a wholerange of sketch completion points. Also, the initial sketchstrokes could correspond to many possible photos due toits highly abstracted nature, thus more likely to give a noisygradient. Last, there is no speciﬁc mechanism that can guideexisting FG-SBIR model to retrieve the photo with minimalsketch strokes, leaving it struggling to perform well acrossa complete sketching episode during on-the-ﬂy retrieval.A novel on-the-ﬂy FG-SBIR framework is proposed inthis work. First and foremost, instead of the de facto choiceof triplet networks that learn an embedding where sketch-photo pairs lie close, we introduce a new model design thatoptimizes the rank of the corresponding photo over a sketchdrawing episode. Secondly, the model is optimised specif-ically to return the true match within a minimum numberof strokes. Lastly, efforts are taken to mitigate the effectof misleading noisy strokes on obtaining a consistent photoranking list as users add details towards the end of a sketch.More concretely, we render the sketch at different timeinstants of drawing, and feed it through a deep embeddingnetwork to get a vector representation.While the otherSBIR frameworks [49, 36] use triplet loss [45] in order tolearn an embedding suited for comparing sketch and photo,we optimise the rank of the target photo with respect to asketch query. By calculating the rank of the ground-truthphoto at each time-instant t and maximizing the sum of1rankt over a complete sketching episode, we ensure thatthe correct photo is retrieved as early as possible. Sinceranking is a non-differential operation, we use a Reinforce-ment Learning (RL) [16] based pipeline to achieve this goal.Representation learning is performed with knowledge of thewhole sequence, as we optimize the reward non-myopicallyover the sketch drawing episode. This is unlike the tripletloss used for feature learning that does not take into accountthe temporal nature of the sketch. We further introduce aglobal reward to guard against harmful noisy strokes espe-cially during later stages of sketching where details are typ-ically added. This also stabilises the RL training process,and produces smoother retrieval results.Our contributions can be summarised as follows: (a) Weintroduce a novel on-the-ﬂy FG-SBIR framework trainedusing reinforcement learning to retrieve photo using an in-complete sketch, and do so with the minimum possibledrawing. (b) To this end, we develop a novel reward schemethat models the early retrieval objective, as well as onebased on Kendall-Tau [18] rank distance that takes into ac-count the completeness of the sketch and associated uncer-tainty. (c) Extensive experiments on two public datasetsdemonstrate the superiority of our framework.2. Related WorksCategory-level SBIR:Category-level sketch-photo re-trieval is now well studied [3, 43, 2, 6, 5, 47, 39, 9, 24,4, 23]. Contemporary research directions can be broadlyclassiﬁed into traditional SBIR, zero-shot SBIR and sketch-image hashing.In traditional SBIR [3, 5, 4, 2], objectclasses are common to both training and testing. Whereaszero-shot SBIR [47, 6, 9, 24] asks models to generaliseacross disjoint training and testing classes in order to allevi-ate annotation costs. Sketch-image hashing [23, 39] aims toimprove the computational cost of retrieval by embeddingto binary hash-codes rather than continues vectors.While these SBIR works assume a single-step retrievalprocess, a recent study by Collomosse et al. [4] proposed aninteractive SBIR framework. Given an initial sketch query,if the system is unable to retrieve the user’s goal in the ﬁrsttry, it resorts to providing some relevant image clusters tothe user. The user can now select an image cluster in or-der to disambiguate the search, based on which the systemgenerates new query sketch for following iteration. This in-teraction continues until the user’s goal is retrieved. Thissystem used Sketch-RNN [13] for sketch query generationafter every interaction. However Sketch-RNN is acknowl-edged to be weak in multi-class sketch generation [13]. Asa result, the generated sketches often diverge from the user’sintent leading to poor performance. Note that though suchinteraction through clusters is reasonable in the case of cate-gory-level retrieval, it is not applicable to our FG-SBIR taskwhere all photos belong to a single class and differ in subtleways only.Fine-grained SBIR:FG-SBIR is a more recent addi-tion to sketch analysis and also less studied comparedto the category-level SBIR task.One of the ﬁrst stud-ies [20] addressed it by graph-matching of deformable-part models. A number of deep learning approaches sub-sequently emerged [49, 41, 30, 29].Yu et al. [49] pro-posed a deep triplet-ranking model for instance-level FG-SBIR. This paradigm was subsequently improved throughhybrid generative-discriminative cross-domain image gen-eration [30]; and providing an attention mechanism for ﬁne-grained details as well as more sophisticated triplet losses[41]. Recently Pang et al. [29] studied cross-category FG-SBIR in analogy to the ‘zero-shot’ SBIR mentioned ear-lier. In this paper, we open up a new research directionby studying FG-SBIR framework design for on-the-ﬂy andearly photo retrieval.Partial Sketch:One of the most popular areas forstudying incomplete or partial data is image inpainting[48, 51]. Signiﬁcant progress has been made in this areausing contextual-attention [48] and Conditional VariationalAutoencoder (CVAE) [51]. Following this direction, workshave attempted to model partial sketch data [22, 13, 12].97809744791136227331211818131111Baseline FG-SBIROn-the-ﬂy FG-SBIRFigure 2. Illustration of proposed on-the-ﬂy framework’s efﬁcacy over a baseline FG-SBIR method [41, 49] trained with completed sketchesonly. For this particular example, our method needs only 30% of the complete sketch to include the true match in the top-10 rank list,compared to 80% for the baseline. Top-5 photo images retrieved by either framework are shown here, in progressive sketch-rendering stepsof 10%. The number at the bottom denotes the paired (true match) photo’s rank at every stage.Sketch-RNN [13] learns to predict multiple possible end-ings of incomplete sketches using a Variational Autoenc-doer (VAE). While Sketch-RNN works on sequential pen-coordinates, Liu et al. [22] extend conditional image-to-image translation to rasterized sparse sketch domain forpartial sketch completion, followed by an auxiliary sketchrecognition task. Ghosh et al. [12] proposed an interactivesketch-to-image translation method, which completes an in-complete object outline, and thereafter it generates a ﬁnalsynthesised image. Overall, these works ﬁrst try to com-plete the partial sketch by modelling a conditional distribu-tion based on image-to-image translation, and subsequentlyfocus on speciﬁc task objective, be it sketch recognition orsketch-to-image generation. Unlike these two-stage infer-ence frameworks, we focus on instance-level photo retrievalwith a minimum number of sketch strokes, thus enablingpartial sketch queries in a single step.Reinforcement Learning in Vision:There has beensigniﬁcant progress in leveraging Reinforcement Learning(RL) [16] techniques in various computer vision problems[44, 14]. Vision applications beneﬁting from RL includevisual relationship detection [21], automatic face aging [8],vision-language navigation [44] and 3D scene completion[14].In terms of sketch analysis, RL was leveraged tostudy abstraction and summarisation by trading off betweenrecognisability of a sketch and number of strokes [34, 27].While these studies aimed to discover salient strokes by us-ing RL to ﬁlter out unnecessary strokes from a given com-plete sketch, we focus on leveraging RL to retrieve a photoon-the-ﬂy with a minimum number of strokes.3. MethodologyOverview:Our objective is to design an ‘on-the-ﬂy’ FG-SBIR framework, where we perform live analysis of thesketch as the user draws. The system should re-rank can-didate photos based on the sketch information up to thatinstant and retrieve the target photo at the earliest strokepossible (see Figure 2 for an example of how the frame-work works in practice).To this end, we ﬁrst pre-traina state-of-the-art FG-SBIR model [49, 41] using tripletloss. Thereafter, we keep the photo branch ﬁxed, and ﬁne-tune the sketch branch through a non-differentiable rankingbased metric over complete sketch drawing episodes usingreinforcement-learning.Formally, a pre-trained FG-SBIR model learns an em-bedding function F(·) : I → RD mapping a rasterizedsketch or photo I to a D dimensional feature.Given agallery of M photo images G = {Xi}Mi=1, we obtain a listof D dimensional vectors ˆG = {F(Xi)}Mi=1 using F(.).Now, for a given query sketch S, and some pairwise dis-tance metric, we obtain the top-q retrieved photos from G,denoted as Retq(F(S), ˆG).If the ground truth (paired)target photo appears in the top-q list, we consider top-qaccuracy to be true for that sketch sample. Since we aredealing with on-the-ﬂy retrieval, a sketch is represented asS ∈�p1, p2, p3, ...pN�, where pi denotes one sketch co-ordinate tuple (x, y), and N stands for maximum numberof points. We assume that there exists a sketch renderingoperation ∅, which takes a list SK of the ﬁrst K coordi-nates in S, and produces one rasterized sketch image. Our9781+ � ⋅ Σ��(a)(b)��Σ��t=1One Sketch Rendering Episodet=3t=Tt=T-1t=T-2��̂ ��RewardTriplet LossStartt=4t=2SketchPositiveNegativeΣ��Figure 3. (a) A conventional FG-SBIR framework trained using triplet loss. (b) Our proposed reinforcement learning based framework thattakes into account a complete sketch rendering episode. Key locks signiﬁes particular weights are ﬁxed during RL training.objective is to train the framework so that the ground-truthpaired photo appears in Retq(F(∅(SK)), ˆG) with a mini-mum value of K.3.1. Background: Base ModelsFor pre-training, we use a state-of-the-art Siamese net-work [41] with three CNN branches with shared weights,corresponding to a query sketch, positive and negativephoto respectively (see Figure 3(a)). Following recent state-of-the-art sketch feature extraction pipelines [6, 41], we usesoft spatial attention [46] to focus on salient parts of thefeature map.Our baseline model consists of three spe-ciﬁc modules: (a) fθ is initialised from pre-trained Incep-tionV3 [42] weights, (b) fatt is modelled using 1x1 con-volution followed by a softmax operation, (c) gφ is a ﬁ-nal fully-connected layer with l2 normalisation to obtainan embedding of size D. Given a feature map B = fθ(I),the output of the attention module is computed by Batt =B + B · fatt(B). Glo

下载后可阅读完整内容，剩余1页未读，立即下载