Each I/O has to traverse several layers from application to
hardware. The block layer allows applications to access
diverse storage devices in a uniform way and provides the
storage device drivers with a single point of entry from all
applications, thus alleviating the complexity and diversity of
storage devices. In addition, the block layer mainly imple-
ments I/O scheduling, which performs operations called
merging and sorting to significantly improve the perfor-
mance of system as a whole.
The SCSI layer mainly constructs SCSI commands with
I/O requests from the block layer. The Libfc (FCP) layer
maps SCSI commands to Fibre Channel (FC) frames as
defined in standard Fibre Channel Protocol for SCSI
(FCP) [18]. The FCoE layer encapsulates FC frames into
FCoE frames or de-encapsulates FCoE frames into FC
frames as FC-BB-6 standard [3]. In other words, the SCSI,
FCP and FCoE layer mainly translate the I/O requests from
BLOCK layer to FCoE command frames. The Ethernet
driver transmits/receives FCoE frames to/from hardware.
The main I/O performance factors in Open-FCoE stack can
summarized as follows: (1) I/O-issuing Side translates the
I/O requests into FCoE format frames; (2) I/O Completion
Side informs the I/O-issuing threads of the I/O comple-
tions; (3) Parallel Process and Synchronization implements
parallel access on multi-core servers. In this section, we
describe and investigate the current Open-FCoE stack
according to the above mentioned factors.
2.1 Issue 1: High Synchronization Overhead from
Single Queu e & Shared Lock Mecha nism
Fig. 2 shows the I/O requests transmission process in the
SCSI/FCP/FCoE layers of Open-FCoE stack when multiple
cores/threads submit I/O requests to the remote target in
multi-core systems. We describe it as follows :
1) The SCSI layer builds the SCSI command structure
describing the I/O operation from the block layer;
then acquires the shared lock when: (1) enqueueing
the SCSI command into the shared queue in the SCSI
layer; and (2) dispatching the SCSI command from
the shared queue in the SCSI layer to the FCP layer.
2) The FCP layer builds the internal data structure (FCP
request) to describe the SCSI command from the
SCSI layer and acquires the shared lock when
enqueueing the FCP request into the internal shared
queue in the FCP layer. Then, it initializes an FC
frame with sk_buff structure for the FCP request, and
delivers the sk_buff structure to the FCoE layer.
3) The FCoE layer encapsulates FC frame into FCoE
frame, and then acquires the shared lock when:
(1) enqueueing the FCoE frame; and (2) dequeueing
the FCoE frame to transmit the frame to network
with the standard interface dev_queue_xmit().
Obviously, the shared lock provides the synchronization
operations on the shared queue in multi-core servers. How-
ever, such single queue & shared lock mechanism in SCSI/
FCP/FCoE layer decreases the capacity of software scalabil-
ity in multi-core systems.
For the purpose of improving scalability, modern servers
employ cache coherent Non Uniform Memory Access (cc-
NUMA) in multi-core architecture, such as the one depicted
in Fig. 3 that corresponds to the servers in our work. In such
architecture, there are some representative features [11],
[19], [20], [21], [22], [23], [24] that cause significantly impacts
on the software performance, such as Migratory Sharing,
False Sharing and significant performance difference when
accessing local or remote memory. These features bring
challenges to the developers for developing multi-threaded
software in cc-NUMA multi-core systems [25].
Fig. 1. Architecture of Linux Open-FCoE stack.
Fig. 2. Process of I/O requests transmission in the current Open-FCoE
stack.
Fig. 3. Multi-core architecture with cache coherent non-uniform memory
access (cc-NUMA).
2516 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 9, SEPTEMBER 2017