complex systems for helmet detection [10] also do a great job at leveraging the contextual information around small
objects to isolate them and facilitate their detection. However, their approach is not quite universally applicable and
comes at the cost of introducing a two-step process.
Typical adjustments to the internal structures of the model are surface-level. In a recent apple detection system [32],
the backbone of YOLOv5 is slightly modified to simplify it, which offers the potential to adapt to the system’s re-
quirements and one that opens the way for additional changes. If a single backbone element is modified, more drastic
changes can be applied for additional effects.
2.4 Small object detection
Some effort has been put into developing systems which direct the processing towards certain areas of the input image
[29, 28, 27], which allows us to adjust resolution and therefore bypass the limitation of having fewer pixels defining an
object. This approach, however, is better suited for systems that are not time-sensitive, as they require multiple passes
through a network at different scales. This idea of paying more attention to specific scales can nevertheless inspire the
way we treat certain feature maps.
Additionally, a lot can be learned by looking at how feature maps can be treated instead of just modifying the backbone.
Different types of feature pyramid networks (FPN) [13, 30, 15] can aggregate feature maps differently to enhance a
backbone in different ways. Such techniques prove to be rather effective.
2.5 Autonomous vehicles
Within autonomous driving, object detection can provide valuable contextual information about the vehicle’s surround-
ings and heavily inform its decision making process [17, 4]. In this case, smaller objects translate to objects further
away, meaning a more complete context for the system to make use of. These systems heavily focus on inference time,
sacrificing performance if needed, but work can be done to improve them at minimal cost. Performance in this field is
critical, as a small improvement in this system can greatly impact the entire vehicle. A common requirement in this
area is for detectors to be single-stage [31], for the simple reason that fewer steps and transitions between them often
translates into fewer resources needed.
3 Methodology
YOLOv5 provides four different scales for their model, S, M , L and X which stand for Small, Medium, Large, and
Xlarge, respectively. Each of these scales applies a different multiplier to the depth and width of the model, meaning
the overall structure of the model remains constant, but the size and complexity of each model are scaled. In Our
experiments, we apply changes to the structure of the models individually across all the scales and treat each one as a
different model for the purposes of evaluating their effect.
To set a baseline, we trained and tested the unmodified versions of the four scales of YOLOv5. We then tested changes
to these networks individually in order to observe their impact separately against our baseline results. The techniques
and structures that did not appear to contribute to better accuracy or inference time were filtered out when moving to the
next phase. We then attempted combinations of the selected techniques. This process was repeated, observing whether
certain techniques complemented or diminished each other and adding more complex combinations progressively.
We first discuss the appropriate evaluation metric for our work (Section 3.1), and the dataset used for our investigation
(Section 3.2). We then move on to describe our plans to apply a number of model changes to be run under controlled
circumstances (Section 3.2), logging and adjusting as we move through different stages.
3.1 Evaluation metric
The original implementation of YOLOv5 provides compatibility with Microsoft Common Objects in Context (COCO)
API’s [14] metrics at three different object scales (bounding box areas) and Intersection over Unions (IOU ), which
proves useful for the purpose of this study. The way values at specific scales are calculated can give us a good indication