At the start of this stage, SAM was trained using com-
mon public segmentation datasets. After sufficient data an-
notation, SAM was retrained using only newly annotated
masks. As more masks were collected, the image encoder
was scaled from ViT-B to ViT-H and other architectural de-
tails evolved; in total we retrained our model 6 times. Av-
erage annotation time per mask decreased from 34 to 14
seconds as the model improved. We note that 14 seconds
is 6.5× faster than mask annotation for COCO [66] and
only 2× slower than bounding-box labeling with extreme
points [76, 71]. As SAM improved, the average number of
masks per image increased from 20 to 44 masks. Overall,
we collected 4.3M masks from 120k images in this stage.
Semi-automatic stage. In this stage, we aimed to increase
the diversity of masks in order to improve our model’s
ability to segment anything. To focus annotators on less
prominent objects, we first automatically detected confident
masks. Then we presented annotators with images prefilled
with these masks and asked them to annotate any additional
unannotated objects. To detect confident masks, we trained
a bounding box detector [84] on all first stage masks using a
generic “object” category. During this stage we collected an
additional 5.9M masks in 180k images (for a total of 10.2M
masks). As in the first stage, we periodically retrained our
model on newly collected data (5 times). Average annota-
tion time per mask went back up to 34 seconds (excluding
the automatic masks) as these objects were more challeng-
ing to label. The average number of masks per image went
from 44 to 72 masks (including the automatic masks).
Fully automatic stage. In the final stage, annotation was
fully automatic. This was feasible due to two major en-
hancements to our model. First, at the start of this stage, we
had collected enough masks to greatly improve the model,
including the diverse masks from the previous stage. Sec-
ond, by this stage we had developed the ambiguity-aware
model, which allowed us to predict valid masks even in am-
biguous cases. Specifically, we prompted the model with a
32×32 regular grid of points and for each point predicted
a set of masks that may correspond to valid objects. With
the ambiguity-aware model, if a point lies on a part or sub-
part, our model will return the subpart, part, and whole ob-
ject. The IoU prediction module of our model is used to se-
lect confident masks; moreover, we identified and selected
only stable masks (we consider a mask stable if threshold-
ing the probability map at 0.5 − δ and 0.5 + δ results in
similar masks). Finally, after selecting the confident and
stable masks, we applied non-maximal suppression (NMS)
to filter duplicates. To further improve the quality of smaller
masks, we also processed multiple overlapping zoomed-in
image crops. For further details of this stage, see §B. We
applied fully automatic mask generation to all 11M images
in our dataset, producing a total of 1.1B high-quality masks.
We describe and analyze the resulting dataset, SA-1B, next.
Figure 5: Image-size normalized mask center distributions.
5. Segment Anything Dataset
Our dataset, SA-1B, consists of 11M diverse, high-
resolution, licensed, and privacy protecting images and
1.1B high-quality segmentation masks collected with our
data engine. We compare SA-1B with existing datasets
and analyze mask quality and properties. We are releasing
SA-1B to aid future development of foundation models for
computer vision. We note that SA-1B will be released un-
der a favorable license agreement for certain research uses
and with protections for researchers.
Images. We licensed a new set of 11M images from a
provider that works directly with photographers. These im-
ages are high resolution (3300×4950 pixels on average),
and the resulting data size can present accessibility and stor-
age challenges. Therefore, we are releasing downsampled
images with their shortest side set to 1500 pixels. Even af-
ter downsampling, our images are significantly higher reso-
lution than many existing vision datasets (e.g., COCO [66]
images are
∼
480×640 pixels). Note that most models today
operate on much lower resolution inputs. Faces and vehicle
license plates have been blurred in the released images.
Masks. Our data engine produced 1.1B masks, 99.1% of
which were generated fully automatically. Therefore, the
quality of the automatic masks is centrally important. We
compare them directly to professional annotations and look
at how various mask properties compare to prominent seg-
mentation datasets. Our main conclusion, as borne out in
the analysis below and the experiments in §7, is that our
automatic masks are high quality and effective for training
models. Motivated by these findings, SA-1B only includes
automatically generated masks.
Mask quality. To estimate mask quality, we randomly sam-
pled 500 images (
∼
50k masks) and asked our professional
annotators to improve the quality of all masks in these im-
ages. Annotators did so using our model and pixel-precise
“brush” and “eraser” editing tools. This procedure resulted
in pairs of automatically predicted and professionally cor-
rected masks. We computed IoU between each pair and
found that 94% of pairs have greater than 90% IoU (and
97% of pairs have greater than 75% IoU). For comparison,
prior work estimates inter-annotator consistency at 85-91%
IoU [44, 60]. Our experiments in §7 confirm by human rat-
ings that mask quality is high relative to a variety of datasets
and that training our model on automatic masks is nearly as
good as using all masks produced by the data engine.
6