"深度学习与多模态技术综述：模态介绍、架构探究与资源利用"

需积分: 2 154 浏览量更新于2024-04-10 收藏 39.16MB PDF 举报

Multimodal Deep Learning is a cutting-edge technology that combines multiple modalities such as natural language processing (NLP) and computer vision (CV) to improve the performance of various tasks. This booklet serves as a comprehensive overview of the current state-of-the-art in NLP, CV, and multimodal architectures. The introduction provides a brief overview of Multimodal Deep Learning, highlighting its importance and applications in various fields. The outline of the booklet is also presented, setting the stage for the subsequent discussions on different modalities. The section on modalities delves into the latest advancements in NLP and CV. It showcases the state-of-the-art techniques and models used in these domains, highlighting their strengths and limitations. Additionally, it discusses the resources and benchmarks available for NLP, CV, and multimodal tasks, providing a holistic view of the field. The core of the booklet focuses on multimodal architectures, specifically Image2Text models. These architectures leverage the strengths of both NLP and CV to perform complex tasks such as image captioning, visual question answering, and multimodal sentiment analysis. The section explores the various architectures and frameworks used in Multimodal Deep Learning, providing insights into their design principles and performance metrics. Overall, this booklet serves as a valuable resource for researchers, practitioners, and enthusiasts interested in Multimodal Deep Learning. It offers a comprehensive overview of the current trends and advancements in the field, showcasing the potential of combining multiple modalities to enhance the performance of AI systems. With its in-depth analysis and insightful discussions, this booklet lays the foundation for future research and innovation in the exciting field of Multimodal Deep Learning.

10 2 Introducing the modalities

2.1.1 Introduction

Natural Language Processing (NLP) exists for about 50 years, but it is more

relevant than ever. There have been several breakthroughs in this branch

of machine learning that is concerned with spoken and written language. In

this work, the most inﬂuential ones of the last decade are going to be pre-

sented. Starting with word embeddings, which eﬃciently model word semantics.

Encoder-decoder architectures represent another step forward by making mini-

mal assumptions about the sequence structure. Next, the attention mechanism

allows human-like focus shifting to put more emphasis on more relevant parts.

Then, the transformer applies attention in its architecture to process the

data non-sequentially, which boosts the performance on language tasks to

exceptional levels. At last, the most inﬂuential transformer architectures are

recognized before a few current topics in natural language processing are

discussed.

2.1.2 Word Embeddings

As mentioned in the introduction, one of the earlier advances in NLP is

learning word internal representations. Before that, a big problem with text

modelling was its messiness, while machine learning algorithms undoubtedly

prefer structured and well-deﬁned ﬁxed-length inputs. On a granular level, the

models rather work with numerical than textual data. Thus, by using very

basic techniques like one-hot encoding or bag-of-words, a text is converted

into its equivalent vector of numbers without losing information.

In the example depicting one-hot encoding (see Figure 2.1), there are ten

simple words and the dark squares indicate the only index with a non-zero

value.

FIGURE 2.1:

Ten one-hot encoded words (Source: Pilehvar and Camacho-

Collados (2021))

In contrast, there are multiple non-zero values while using bag-of-words, which

is another way of extracting features from text to use in modelling where we

measure if a word is present from a vocabulary of known words. It is called

2.1 State-of-the-art in NLP 11

bag-of-words because the order is disregarded here.

Treating words as atomic units has some plausible reasons, like robustness and

simplicity. It was even argued that simple models on a huge amount of data

outperform complex models trained on less data. However, simple techniques

are problematic for many tasks, e.g. when it comes to relevant in-domain data

for automatic speech recognition. The size of high-quality transcribed speech

data is often limited to just millions of words, so simply scaling up simpler

models is not possible in certain situations and therefore more advanced

techniques are needed. Additionally, thanks to the progress of machine

learning techniques, it is realistic to train more complex models on massive

amounts of data. Logically, more complex models generally outperform basic

ones. Other disadvantages of classic word representations are described by the

curse of dimensionality and the generalization problem. The former becomes a

problem due to the growing vocabulary equivalently increasing the feature

size. This results in sparse and high-dimensional vectors. The latter occurs

because the similarity between words is not captured. Therefore, previously

learned information cannot be used. Besides, assigning a distinct vector to

each word is a limitation, which becomes especially obvious for languages with

large vocabularies and many rare words.

To combat the downfalls of simple word representations, word embeddings

enable to use eﬃcient and dense representations in which similar words have a

similar encoding. So words that are closer in the vector space are expected to

be similar in meaning. An embedding is hereby deﬁned as a vector of ﬂoating

point values (with the length of the vector being a hyperparameter). The

values for the embedding are trainable parameters which are learned similarly

to a model learning the weights for a dense layer. The dimensionality of the

word representations is typically much smaller than the number of words in

the dictionary. For example, Mikolov et al. (2013a) called dimensions between

50-100 modest for more than a few hundred million words. For small data

sets, dimensionality for the word vectors could start at 8 and go up to 1024

for larger data sets. It is expected that higher dimensions can rather pick up

intricate relationships between words if given enough data to learn from.

For any NLP tasks, it is sensible to start with word embeddings because

it allows to conveniently incorporate prior knowledge into the model and

can be seen as a basic form of transfer learning. It is important to note

that even though embeddings attempt to represent the meaning of words

and do that to an extent, the semantics of the word in a given context

cannot be captured. This is due to the words having static precomputed

representations in traditional embedding techniques. Thus, the word

"bank" can either refer to a ﬁnancial institution or a river bank. Contex-

2.1 State-of-the-art in NLP 13

Another way of using vector representations of words is in the ﬁeld of transla-

tions. It has been presented that relations can be drawn from feature spaces of

diﬀerent languages. In below, the distributed word representations of numbers

between English and Spanish are compared. In this case, the same numbers

have similar geometric arrangements, which suggests that mapping linearly

between vector spaces of languages is feasible. Applying this simple method

for a larger set of translations in English and Spanish led to remarkable results

- achieving almost 90 % precision.

FIGURE 2.4:

Representations of numbers in English and Spanish (Source:

Mikolov et al. (2013c)).

This technique was then used for other experiments. One use case is the

detection of dictionary errors. Taking translations from a dictionary and

computing their geometric distance returns a conﬁdence measure. Closely

evaluating the translations with low conﬁdence and outputting an alternative

(one that is closest in vector space) results in a plain way to assess dictionary

translations. Furthermore, training the word embeddings on a large corpora

makes it possible to give sensible out-of-dictionary predictions for words.

This was tested by randomly removing a part of the vocabulary before.

Taking a look at the predictions revealed that they were often to some

extent related to the translations with regard to meaning and semantics.

Despite the accomplishments in other tasks, translations between distant

languages exposed shortcomings of word embeddings. For example, the

accuracy for translations between English and Vietnamese seemed signif-

icantly lower. This can be ascribed to both languages not having a good

one-to-one correspondence because the concept of a word is diﬀerent than

in English. In addition, the used Vietnamese model contains numerous syn-

onyms, which complicates making exact predictions (see Mikolov et al. (2013c)).

Turning the attention to one of the most impactful embedding techniques,

word2vec. It was proposed by Mikolov et al. (2013a) and is not a singular

algorithm. It can rather be seen as a family of model architectures and op-

timizations to learn word representations. Word2vec’s popularity also stems

from its success on multiple downstream natural language processing tasks.

剩余271页未读，继续阅读

T1.Faker

粉丝: 2w+
资源: 9

"深度学习与多模态技术综述：模态介绍、架构探究与资源利用"

Deep Learning in Medical Image Analysis and Multimodal Learning

深度学习Deep Learning

Recent Advances and Trends in Multimodal Deep Learning A Re

a hybrid method for traffic flow forecasting using multimodal deep learning

Deep Learning in Medical Image Analysis and Multimodal Learning最新智能医疗专著

Multimodal tag localization based on deep learning

Deep Learning and Multimodal Fusion of 3D Point Cloud

CMU11-777 multimodal machine learning Fall 2019讲义

deep multimodal learning a survey on recent advances and trends

exploration of deep learning-based multimodal fusion for semantic road scene

最新资源