SAM：开启万物分割新时代

下载需积分: 0 | PDF格式 | 14.92MB | 更新于2024-06-16 | 110 浏览量 | 举报

"AIGC论文-SAM-Segment Anything - 探索通用分割模型与大规模数据集的创新应用" 这篇论文“SAM - Segment Anything”聚焦于人工智能领域的深度学习技术，特别是图像分割的应用。图像分割是计算机视觉中的核心问题，它涉及识别和分离图像中的各个对象或区域。SAM（Segment Anything Model）是作者提出的一种新型模型，旨在建立一个基础模型，通过三个相互关联的组件来推动这一领域的发展：可提示的分割任务、能够驱动数据标注并实现零样本迁移的分割模型，以及用于收集大量标注数据的数据引擎。 1. 可提示的分割任务（Promptable Segmentation Task）：这一任务的核心是通过输入文本提示来指导模型进行图像分割。例如，用户可以输入“猫有黑色耳朵”，模型会根据这个提示来准确地在图像中识别出猫并且突出显示其黑色的耳朵部分。这种任务设计增强了模型的灵活性和通用性，使其能适应不同的分割需求。 2. SAM（Segment Anything Model）： SAM 是一种轻量级的模型，它结合了图像编码器和提示编码器，以及一个轻量级的掩码解码器。图像编码器处理输入图像，提示编码器则负责理解文本提示，两者的信息融合后通过解码器生成分割掩码。这种结构使得模型能够理解文本描述，并将其应用于图像分析中，实现零样本迁移，即在没有特定任务训练数据的情况下，通过调整提示来适应新的分割任务。 3. 数据引擎与SegmentAnything1B (SA-1B) 数据集：为了训练和验证SAM，作者创建了一个庞大的数据集SA-1B，包含超过10亿个分割掩码和1100万张图像。这个数据集尊重隐私，所有的图像都经过许可，确保了合法使用。数据引擎则是一个用于收集、管理和训练模型的工具，它对于构建大规模、多样性的数据集至关重要，有助于提高模型的泛化能力。 4. 零样本迁移（Zero-Shot Transfer）： SAM的一大优势在于其可以通过提示工程实现零样本迁移。这意味着模型可以在没有特定领域数据的情况下，通过改变文本提示就能适应新的分割任务。这极大地扩展了模型的实用性，降低了对新任务数据的需求。 5. 应用前景： SAM的创新设计有可能广泛应用于医疗影像分析、自动驾驶、遥感图像解析等多个领域，尤其是在需要快速适应新任务或处理海量无标注数据的情况下。 “SAM - Segment Anything”论文展示了一种新的深度学习方法，该方法通过结合文本提示和强大的模型架构，实现了图像分割任务的通用性和灵活性，为人工智能领域带来了重要的进展。同时，它强调了大规模数据集的构建和有效利用，对于推动未来计算机视觉研究具有深远意义。

At the start of this stage, SAM was trained using com-

mon public segmentation datasets. After sufﬁcient data an-

notation, SAM was retrained using only newly annotated

masks. As more masks were collected, the image encoder

was scaled from ViT-B to ViT-H and other architectural de-

tails evolved; in total we retrained our model 6 times. Av-

erage annotation time per mask decreased from 34 to 14

seconds as the model improved. We note that 14 seconds

is 6.5× faster than mask annotation for COCO [66] and

only 2× slower than bounding-box labeling with extreme

points [76, 71]. As SAM improved, the average number of

masks per image increased from 20 to 44 masks. Overall,

we collected 4.3M masks from 120k images in this stage.

Semi-automatic stage. In this stage, we aimed to increase

the diversity of masks in order to improve our model’s

ability to segment anything. To focus annotators on less

prominent objects, we ﬁrst automatically detected conﬁdent

masks. Then we presented annotators with images preﬁlled

with these masks and asked them to annotate any additional

unannotated objects. To detect conﬁdent masks, we trained

a bounding box detector [84] on all ﬁrst stage masks using a

generic “object” category. During this stage we collected an

additional 5.9M masks in 180k images (for a total of 10.2M

masks). As in the ﬁrst stage, we periodically retrained our

model on newly collected data (5 times). Average annota-

tion time per mask went back up to 34 seconds (excluding

the automatic masks) as these objects were more challeng-

ing to label. The average number of masks per image went

from 44 to 72 masks (including the automatic masks).

Fully automatic stage. In the ﬁnal stage, annotation was

fully automatic. This was feasible due to two major en-

hancements to our model. First, at the start of this stage, we

had collected enough masks to greatly improve the model,

including the diverse masks from the previous stage. Sec-

ond, by this stage we had developed the ambiguity-aware

model, which allowed us to predict valid masks even in am-

biguous cases. Speciﬁcally, we prompted the model with a

32×32 regular grid of points and for each point predicted

a set of masks that may correspond to valid objects. With

the ambiguity-aware model, if a point lies on a part or sub-

part, our model will return the subpart, part, and whole ob-

ject. The IoU prediction module of our model is used to se-

lect conﬁdent masks; moreover, we identiﬁed and selected

only stable masks (we consider a mask stable if threshold-

ing the probability map at 0.5 − δ and 0.5 + δ results in

similar masks). Finally, after selecting the conﬁdent and

stable masks, we applied non-maximal suppression (NMS)

to ﬁlter duplicates. To further improve the quality of smaller

masks, we also processed multiple overlapping zoomed-in

image crops. For further details of this stage, see §B. We

applied fully automatic mask generation to all 11M images

in our dataset, producing a total of 1.1B high-quality masks.

We describe and analyze the resulting dataset, SA-1B, next.

Figure 5: Image-size normalized mask center distributions.

5. Segment Anything Dataset

Our dataset, SA-1B, consists of 11M diverse, high-

resolution, licensed, and privacy protecting images and

1.1B high-quality segmentation masks collected with our

data engine. We compare SA-1B with existing datasets

and analyze mask quality and properties. We are releasing

SA-1B to aid future development of foundation models for

computer vision. We note that SA-1B will be released un-

der a favorable license agreement for certain research uses

and with protections for researchers.

Images. We licensed a new set of 11M images from a

provider that works directly with photographers. These im-

ages are high resolution (3300×4950 pixels on average),

and the resulting data size can present accessibility and stor-

age challenges. Therefore, we are releasing downsampled

images with their shortest side set to 1500 pixels. Even af-

ter downsampling, our images are signiﬁcantly higher reso-

lution than many existing vision datasets (e.g., COCO [66]

images are

∼

480×640 pixels). Note that most models today

operate on much lower resolution inputs. Faces and vehicle

license plates have been blurred in the released images.

Masks. Our data engine produced 1.1B masks, 99.1% of

which were generated fully automatically. Therefore, the

quality of the automatic masks is centrally important. We

compare them directly to professional annotations and look

at how various mask properties compare to prominent seg-

mentation datasets. Our main conclusion, as borne out in

the analysis below and the experiments in §7, is that our

automatic masks are high quality and effective for training

models. Motivated by these ﬁndings, SA-1B only includes

automatically generated masks.

Mask quality. To estimate mask quality, we randomly sam-

pled 500 images (

∼

50k masks) and asked our professional

annotators to improve the quality of all masks in these im-

ages. Annotators did so using our model and pixel-precise

“brush” and “eraser” editing tools. This procedure resulted

in pairs of automatically predicted and professionally cor-

rected masks. We computed IoU between each pair and

found that 94% of pairs have greater than 90% IoU (and

97% of pairs have greater than 75% IoU). For comparison,

prior work estimates inter-annotator consistency at 85-91%

IoU [44, 60]. Our experiments in §7 conﬁrm by human rat-

ings that mask quality is high relative to a variety of datasets

and that training our model on automatic masks is nearly as

good as using all masks produced by the data engine.

剩余29页未读，继续阅读

丁希希哇

粉丝: 3206

SAM：开启万物分割新时代

探索AnyLabeling的segment-anything-onnx模型自动标注功能

深度学习Segment-Anything代码解析与应用

Redux分析新工具：redux-segment集成Segment.io

IETF---Generalized-Segment-Routing-Header:IETF草案

k-means--color-segment.zip_K-Means图像分割_K.

Exp-09-BCD-7Segment.rar_BCD/7-segment

gatsby-plugin-segment-js:用于segment.com的analytic.js代码段的Gatsby插件

apache-airflow-providers-segment-feedstock：apada-airflow-providers-segment的conda-smithy存储库

SAM2（Segment Anything2）预训练权重sam2.1-hiera-base-plus.pt

Signaling-based path-segment protection in mesh optical networks

最新资源