Analysis of System Reliability for Cache Coherence Scheme in Multi-Processor
Sizhao Li
1
, Shan Lin
1
, Deming Chen
3
, W. Eric Wong
4
, and Donghui Guo
1, 2
, Senior Member, IEEE
1. Dept. of Electronic Engineering, Xiamen University, Fujian 361005, China
2. IC Design & IT Research Center of Fujian Province, Xiamen University, Fujian 361005, China
3. Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, IL 61801, USA
4. Department of Computer Science, University of Texas at Dallas, TX 75083, USA
sizhao.li@gmail.com, dchen@illinois.edu, ewong@utdallas.edu, dhguo@xmu.edu.cn
Abstract—In this paper, a cache coherence scheme in multi-
processor is introduced. There is a specific model in each kind
of software; cache coherence can be solved in AHB bus by
these models. First, we use dynamic address mapping policy to
realize data cache. Second, according to the randomness of
application environment that set up shared cache adaptive
configuration and management mechanism in the finite state
machine timing sequence model of each kind of software, to
ensure the system reliability. In order to support multi-tasking
and multi-user operator system – Linux, the multi-processor
must use shared memory technology, so this paper also
introduced the memory management unit, and base on these,
it focuses on how multi-processor and the AHB bus cooperate
to ensure cache coherence of the whole system. We can use
software execution model and hardware design to achieve
instruction or data coherence between each cache and main
memory.
Keywords- cache coherence, memory management, system
reliability, multi-processors, system failure.
I. INTRODUCTION
Research on multi-processor has been an important part
of research in the field of microprocessors. Over the years, a
variety of hardware architectures have been proposed to
solve mutual cooperation and communication between the
multi-processor, so as to improve processing speed and
performance. In order to ensure the system reliability, it is
necessary to ensure that the data is correct in each processor
[1], cache provide instruction and data to the processor,
therefore, cache coherence is the most important [2].
Currently, a variety of multi-processor cannot work
without the cache on-chip [3]. In order to improve the
performance of system, we can use multi-level cache for
design on-chip commonly. Each processor unit usually has
its own private L1 cache or L2 cache, and they can also
share storage resources other on-chip through
interconnection [4]. Multi-processor is often running
different programs simultaneously; it needs to consider how
to configure storage resource on-chip and sharing
management issues, but different architecture adopts the
management model may not be the same, and the same time
the memory management has great complexity [5].
Therefore, optimal configuration of shared cache requires
instruction and data reuse based on different software or
environment to decide. If there is no good memory
coherence management protocol, error detection and repair
mechanism, the system might cause memory usage conflict
or data transfer error, and it might causes system instability
or collapse [6]. These errors will affect the whole system
reliability through Mean Time to Failures (MTTF), Mean
Residual Life (MRL) and other performance index [7].
The rest of this paper is organized as follows: Section II
describes the related work. Section III. This section
describes the mathematical model of cache coherence, and
at the same time, it is important of coherence from the
system reliability point of view. Section IV. This section
describes the evaluation results and presents the discussion.
Finally, conclusion will be included.
II. SYSTEM RELIABILITY AND CACHE COHERENCE
PROTOCOL
System reliability indicates that the system is a capability
of complete specific function under the condition and the
required time [8]. Factors affecting the system reliability are
two aspects: one is self-reliability of system device; another
is effect of external condition.
A. Failure model
First introduced the following four key concepts,
T
is the
failure time,
is the probability density function of the
failure time, and the distribution function is
1) The Reliability Function can be defined as
Where R(t) is the no failure probability of device unit in
the time interval (0, t].
2) The Failure Rate Function
z
(
t
) is the failure
probability of device unit in the time interval (t, t+Δt]
Dividing both sides by
Δ
t
→
0, and taking the limit, so