6 Kolesnikov
?
, Beyer
?
, Zhai
?
, Puigcerver, Yung, Gelly, Houlsby
Table 2: Improvement in accuracy when pre-training on the public ImageNet-21k
dataset over the “standard” ILSVRC-2012. Both models are ResNet152x4.
ILSVRC-
2012
CIFAR-
10
CIFAR-
100
Pets Flowers VTAB-1k
(19 tasks)
BiT-S (ILSVRC-2012) 81.30 97.51 86.21 93.97 89.89 66.87
BiT-M (ImageNet-21k) 85.39 98.91 92.17 94.46 99.30 70.64
Improvement +4.09 +1.40 +5.96 +0.49 +9.41 +3.77
Downstream Fine-Tuning To attain a low per-task adaptation cost, we do
not perform any hyperparameter sweeps downstream. Instead, we present BiT-
HyperRule, a heuristic to determine all hyperparameters for fine-tuning. Most
hyperparameters are fixed across all datasets, but schedule, resolution, and usage
of MixUp depend on the tasks image resolution and training set size.
For all tasks, we use SGD with an initial learning rate of 0.003, momentum
0.9, and batch size 512. We resize input images with area smaller than 96 × 96
pixels to 160 × 160 pixels, and then take a random crop of 128 × 128 pixels. We
resize larger images to 448 × 448 and take a 384 × 384-sized crop.
1
We apply
random crops and horizontal flips for all tasks, except those for which cropping
or flipping destroys the label semantics, see Supplementary section F for details.
For schedule length, we define three scale regimes based on the number of ex-
amples: we call small tasks those with fewer than 20 k labeled examples, medium
those with fewer than 500 k, and any larger dataset is a large task. We fine-tune
BiT for 500 steps on small tasks, for 10k steps on medium tasks, and for 20k
steps on large tasks. During fine-tuning, we decay the learning rate by a factor of
10 at 30%, 60% and 90% of the training steps. Finally, we use MixUp [67], with
α = 0.1, for medium and large tasks. See Supplementary section A for details.
3.4 Standard Computer Vision Benchmarks
We evaluate BiT-L on standard benchmarks and compare its performance to the
current state-of-the-art results (Table 1). We separate models that perform task-
independent pre-training (“general” representations), from those that perform
task-dependent auxiliary training (“specialist” representations). The specialist
methods condition on a particular task, for example ILSVRC-2012, then train
using a large support dataset, such as JFT-300M [38] or Instagram-1B [63]. See
discussion in Section 5. Specialist representations are highly effective, but require
a large training cost per task. By contrast, generalized representations require
large-scale training only once, followed by a cheap adaptation phase.
BiT-L outperforms previously reported generalist SOTA models as well as,
in many cases, the SOTA specialist models. Inspired by strong results of BiT-L
trained on JFT-300M, we also train models on the public ImageNet-21k dataset.
1
For our largest R152x4, we increase resolution to 512 × 512 and crop to 480 × 480.