结构约束下的动作序列生成方法

研究论文

120 浏览量更新于2024-08-31 收藏 8.28MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

本文是一篇研究论文，标题为《结构约束下的运动序列生成》（Structure-Constrained Motion Sequence Generation），发表在IEEE Transactions on Multimedia期刊上，DOI为10.1109/TMM.2018.2885235。该文章由Yichao Yan、Bingbing Ni*、Wendong Zhang、Jingwei Xu和Xiaokang Yang（IEEE Fellow）共同完成，于2018年被接受，但可能在最终出版前会有内容修订。当前视频生成是一项极具挑战性的任务，由于其解决方案空间具有极高的维度，找到最优解非常困难。为此，作者提出了一种创新方法，即利用结构约束来简化问题。他们没有直接生成高维的视频数据，而是将对象的关键点（landmarks）作为明确的结构约束引入到视频生成过程中。论文的核心是构建一个两阶段框架，专用于基于动作条件的视频生成任务。首先，第一阶段的目标是利用这些结构约束生成与给定动作关联的低维特征表示。这可能涉及到运动捕捉技术，通过对关键点的追踪和建模来捕获物体在空间中的动态变化。这一阶段的输出可以视为动作特征的中间表示，它既保留了动作的信息，又降低了后续生成过程的复杂性。在第二阶段，这些低维特征被输入到生成模型中，如神经网络或基于深度学习的方法，转化为具体的视觉表现，如图像帧或连续帧序列。这个阶段可能采用了变分自编码器（VAE）、生成对抗网络（GAN）或其他生成模型，通过学习潜在分布来生成逼真的视频片段，同时确保它们在结构上与输入的动作关键点保持一致。整个框架的优势在于，通过结构约束，它能够引导生成过程朝向更可行和有意义的结果，减少了搜索空间，提高了生成视频的质量和一致性。此外，论文可能还探讨了如何训练模型，如何评估生成结果与真实数据的相似度，以及如何处理可能出现的结构冲突或不一致问题。《结构约束下的运动序列生成》这篇论文深入探讨了如何在视频生成领域利用结构信息来优化算法性能，为未来的视觉生成技术提供了有价值的研究方向。

资源详情

资源推荐

1520-9210 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2018.2885235, IEEE

Transactions on Multimedia

Image Generation. The early works in image generation

are mostly limited to simple texture synthesis based on hand-

crafted features [50]. Recent approaches for generating re-

alistic images mainly employ two kinds of models, i.e., the

variational autoencoder (VAE) [6] and GAN [5]. VAE is a

well-known approach for capturing complicated distributions

and has been widely applied in various generative tasks.

Gregor et al. [51] established a sequential generation model for

generating realistic images that takes advantage of both VAE

and recurrent neural networks with an attention mechanism.

Yan et al. [20] propose a layered generative model built on

the conditional VAE. GAN is another fairly popular gener-

ative model that has been exploited by many recent works.

Some works have focused on the architecture of the original

GAN approach and have provided better solutions for image

generation [52], [53], [54], [55]. Instead of utilizing only

noise vectors as inputs, the CGAN [26] approach adds extra

information along with the noise to constrain the generation

process and produces impressive results. Based on CGAN,

many works have proposed [24], [25] solving the problems of

image-to-image translation and style translation. Our genera-

tive model also takes inspiration from the CGAN and image-

to-image models in that we take landmarks as the structure

conditions and a reference image as the appearance condition,

and a generative model is trained to produce new object poses

corresponding to the given conditions.

Video Generation. Another topic closely related to our

problem is video generation and prediction. The early works

in this space have been mostly based on video texture meth-

ods [56], [57], [58]. These methods utilize an input reference

video and aim to generate periodic motion sequences. Guided

by ideas from neuroscience, Lotter et al. [11] propose a

predictive neural network for video generation. Finn et al. [8]

show another approach to video prediction conditioned on

action cues that focuses on pixel-level movement. Mathieu et

al. [12] exploit multi-scale architecture to solve the problem of

deformation in future predictions. Instead of predicting future

information at the pixel level, Van et al. [59] attempt to obtain

the transformations between frames given the observed videos.

GAN or VAE are also utilized in video generation, such as

in [9], [60], [61]. A recent work [62] disentangles the process

of learning pose features from content features in videos and

can synthesis high-quality videos. However, these proposed

methods usually suffer from the two following main issues: 1)

the foreground objects are easily deformed during the genera-

tion process, 2) and the consistency between adjacent frames

is hard to maintain. These problems suggest that strong motion

constraints need to be considered for generating more realistic

videos. Therefore, landmark information is employed to guide

the motion generation in our proposed model. Recently, some

other contemporary works [19], [63] also propose employing

landmark information for motion prediction.

III. METHOD

We formulate our problem as follows. Given an appearance

reference image I that speciﬁes the appearance information

for both foreground objects and the background, we aim to

generate a sequence of frames

Y = {ˆy

, ..., ˆy

} containing

the same object moving against the background according to

a speciﬁc motion pattern c. In other words, our task is to

generate motion sequences from single static image. Given

very limited information (I and c), the main challenge is how

to ﬁnd a good solution in extremely high-dimensional video

space. To address this issue, we propose a two-step framework

to improve the generation quality. First, we introduce an

action-conditioned landmark generation network, which aims

to generate a sequence of object landmarks

S = {ˆs

, ..., ˆs

}

corresponding to the speciﬁed motion pattern c. This network

is described in detail in Section III-B. The landmark sequence

S is a good low-dimensional representation of foreground ob-

ject structures and thus provides strong guidance information

for video generation. Therefore, we propose an appearance-

preserving motion generation network in the second step. This

network takes reference image I and object landmarks

as appearance and motion constraints, respectively, and will

thus generate high-quality motion sequences. The details of

this network are described in Section III-C. Speciﬁcally, we

employ a VAE model and a GAN model in these two steps.

In the following, we ﬁrst brieﬂy revisit generative adversarial

networks and then introduce our two-step model.

A. Generative Adversarial Networks

The GAN [5] approach was explicitly designed to address

data generation tasks. Speciﬁcally, GAN models consist of two

parts, i.e., a generator network G and a discriminator network

D. The discriminator aims to distinguish real samples from

synthesized ones, while the generator aims to generate data

that are close to the real distribution to fool the discriminator.

In this way, G and D form a min-max game, and the objective

function is deﬁned as follows:

min

max

GAN

= E

x∼p

(x)

[logD (x)] (1)

z∼p(z)

[log (1 − D (G (z)))] ,

where p

and p(z) are the distribution of real data and the

zero-mean Gaussian distribution N (0, 1), respectively.

However, the original GAN is not stable and is difﬁcult

to train because the use of Jensen-Shannon (JS) divergence

in its loss function. Under an (approximately) optimal D,

minimizing the loss of G is equivalent to minimizing the JS

divergence between the real distribution p

and the generat-

ing distribution p

. Because having a non-negligible overlap

between p

and p

is almost impossible, the JS divergence

is always a constant (log2). This divergence will eventually

make the generator gradient approach 0. WGAN [64] was

proposed to improve the stability of the model by replacing

JS distance with Wasserstein distance, which is Lipschitz

continuous and thus solves the gradient vanishing problem.

Based on Wasserstein distance,the loss function of G is deﬁned

as follows:

W GAN

(G) = −E

x∼P

[D(x)] (2)

and the loss function of D is deﬁned as follows:

W GAN

(D) = E

x∼P

[D(x)] − E

x∼P

[D(x)] (3)

剩余13页未读，继续阅读

weixin_38741891

粉丝: 6
资源: 908

结构约束下的动作序列生成方法

Iterative Re-constrained Group Sparse Face Recognition

ExtremeLearningMachine资源共享-Locality-constrained-representation-based-classification-with_2013_Neurocomp.pdf

Backhaul-Constrained HetNets翻译

车间调度中，RJSP问题是什么

ubuntu lvgl

stm32f103c8t6

RESTful IoT Authentication protocol COAP

python micropython FreeRTOS

R代码实现约束最小二乘回归，对多模型进行融合

LMCF目标跟踪算法的英文文献

yolov7tiny网络模型

Micro-YOLO

pytorch C++ yolov5

ESPAsyncWebServer

描述资源受限项目调度问题及其数学模型。

项目管理技术在工程调度的时间约束下运用什么算法进行建模优化

密码学哈希算法硬件加速的相关文献有哪些

A lightweight convolutional neural network for disease detection of fruit leaves

最新资源