efficient parallelism to enhance performance. Importantly, Large Vision Models
extend their transformative capabilities to fundamental computer vision tasks
beyond classification. A significant breakthrough in the segmentation task has
been achieved with the Segment Anything (SAM) model (Kirillov et al., 2023).
SAM comprises a ViT-H image encoder, a prompt encoder, and a transformer-
based mask decoder, which predicts object masks. SAM’s remarkable zero-shot
generalization ability enables it to segment previously unseen objects and im-
ages. To train SAM, the construction of the largest segmentation dataset to
date, SA-1B, featuring over 1 billion masks, represents a notable milestone in
this field.
Large multi-modal models, such as Large Vision-Language Models (LVLMs),
have shown remarkable success in various tasks, expanding their influence into
the realm of vision-language understanding (Zhe Gan et al, 2022). This success
has spawned a line of research dedicated to exploring the potential of LVLMs,
with a focus on both contrastive learning (Alec Radford et al, 2021, Xiyang Dai
et al, 2021, Chao Jia et al, 2021, Chunyuan L et al, 2022) and generative model-
ing (Danny Driess et al, 2023, Jean-Baptiste Alayrac et al, 2022, Jianfeng Wang
et al, 2022, and Liu et al, 2023). Remarkably, Liu et al, 2023, have demonstrated
that LVLMs exhibit exceptional zero-shot Optical Character Recognition (OCR)
performance without explicit training on OCR-specific data. This finding un-
derscores the critical importance of understanding the capabilities of LVLMs
in handling text-related visual tasks, considering their unique ability to extract
contextual information from various data sources, including text and images.
One noteworthy example of a generative pre-trained LVLM is GPT-4 (OpenAI,
2023), which has showcased exceptional visual comprehension and reasoning
abilities. While GPT-4 has achieved near-human performance on professional
and academic benchmarks, detailed technical specifications of the model remain
undisclosed. However, the primary focus of this discussion revolves around a
specific category of LVLMs: Large Vision Language Models (LVLMs) that ven-
ture beyond vision and language. Typically, LVLMs employ a dual-stream ar-
chitecture, where input text and images undergo separate encoding processes
to extract relevant features. For representation learning, the features from dif-
ferent modalities are either aligned through contrastive learning (A. Radford et
al, 2021 and Chen et al, 2021) or fused into a unified representation using an
additional encoder (V. Goswami et al., 2022 and Wang et al., 2023). The en-
tire model, encompassing both unimodal and multimodal encoders, undergoes
pre-training on large-scale image-text datasets and is subsequently fine-tuned
for specific tasks or used for zero-shot tasks without further fine-tuning. Pre-
training objectives may involve a combination of multi-modal and unimodal
tasks, with common multi-modal tasks encompassing image-text contrastive
learning, image-text matching, autoregressive modeling, masked modeling, and
image-grounded text generation.
Recent studies suggest that scaling up unimodal encoders and engaging in
multi-objective pre-training across both uni- and multi-modalities can signifi-
cantly enhance multi-modal representation learning. LVLMs have recently made
substantial progress in text-to-image generation, employing two main method-
6