Deep Image Matting
Ning Xu
1,2
, Brian Price
3
, Scott Cohen
3
, and Thomas Huang
1,2
1
Beckman Institute for Advanced Science and Technology
2
University of Illinois at Urbana-Champaign
3
Adobe Research
{ningxu2,t-huang1}@illinois.edu, {bprice,scohen}@adobe.com
Abstract
Image matting is a fundamental computer vision prob-
lem and has many applications. Previous algorithms have
poor performance when an image has similar foreground
and background colors or complicated textures. The main
reasons are prior methods 1) only use low-level features and
2) lack high-level context. In this paper, we propose a novel
deep learning based algorithm that can tackle both these
problems. Our deep model has two parts. The first part is a
deep convolutional encoder-decoder network that takes an
image and the corresponding trimap as inputs and predict
the alpha matte of the image. The second part is a small
convolutional network that refines the alpha matte predic-
tions of the first network to have more accurate alpha values
and sharper edges. In addition, we also create a large-scale
image matting dataset including 49300 training images and
1000 testing images. We evaluate our algorithm on the im-
age matting benchmark, our testing set, and a wide variety
of real images. Experimental results clearly demonstrate
the superiority of our algorithm over previous methods.
1. Introduction
Matting, the problem of accurate foreground estimation
in images and videos, has significant practical importance.
It is a key technology in image editing and film production
and effective natural image matting methods can greatly im-
prove current professional workflows. It necessitates meth-
ods that handle real world images in unconstrained scenes.
Unfortunately, current matting approaches do not gen-
eralize well to typical everyday scenes. This is partially
due to the difficulty of the problem: as formulated the mat-
ting problem is underconstrained with 7 unknown values
per pixel but only 3 known values:
I
i
= α
i
F
i
+ (1 − α
i
)B
i
α
i
∈ [0, 1]. (1)
where the RGB color at pixel i, I
i
, is known and the fore-
ground color F
i
, background color B
i
and matte estimation
α
i
are unknown. However, current approaches are further
limited in their approach.
The first limitation is due to current methods being de-
signed to solve the matting equation (Eq. 1). This equa-
tion formulates the matting problem as a linear combina-
tion of two colors, and consequently most current algo-
rithms approach this largely as a color problem. The stan-
dard approaches include sampling foreground and back-
ground colors [3, 9], propagating the alpha values accord-
ing to the matting equation [14, 31, 22], or a hybrid of the
two [32, 13, 28, 16]. Such approaches rely largely on color
as the distinguishing feature (often along with the spatial
position of the pixels), making them incredibly sensitive to
situations where the foreground and background color dis-
tributions overlap, which unfortunately for these methods is
the common case for natural images, often leading to low-
frequency “smearing” or high-frequency “chunky” artifacts
depending on the method (see Fig 1 top row). Even the re-
cently proposed deep learning methods are highly reliant on
color-dependent propagation methods [8, 29].
A second limitation is due to the focus on a very small
dataset. Generating ground truth for matting is very diffi-
cult, and the alphamatting.com dataset [25] made a signifi-
cant contribution to matting research by providing ground-
truth data. Unfortunately, it contains only 27 training im-
ages and 8 test images, most of which are objects in front
of an image on a monitor. Due to its size and constraints
of the dataset (e.g. indoor lab scenes, indoor lighting, no
humans or animals), it is by its nature biased, and methods
are incentivized to fit to this data for publication purposes.
As is the case with all datasets, especially small ones, at
some point methods will overfit to the dataset and no longer
generalize to real scenes. A recent video matting dataset is
available [10] with 3 training videos and 10 test videos, 5
of which were extracted from green screen footage and the
1