A Generalized Systematic Comma Free Code
Tianbo Xue
∗†
and Francis C. M. Lau
∗†
∗
Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong
†
The Hong Kong Polytechnic University Shenzhen Research Institute, China
Email: tianbo.xue@connect.polyu.hk and francis-cm.lau@polyu.edu.hk
Abstract—For channels that introduce substitution, in-
sertion and deletion errors, one challenging problem for
a code designer is to avoid false code synchronization. In
other words, the probability of false codewords occurring
should be minimized with appropriate code design. In this
paper, we propose a new class of systematic comma free
code called generalized F (n, s, t) code. We first prove that
this code is a synchronous code and then we derive the
probabilities of false synchronization when a substitution,
insertion or deletion error occur. We compare the theoret-
ical and simulation results under different parameters. We
also compare the performance of our proposed code with
the classical F code.
I. INTRODUCTION
Many communication channels corrupt transmitted
signals by adding noise, causing intersymbol interfer-
ence, introducing adjacent channel interference, etc.
Some channels cause errors by substituting the transmit-
ted symbols with other symbols. There are also channels
that not only substitute symbols, but also insert new sym-
bols and remove transmitting symbols. One such exam-
ple is deoxyribonucleic acid (DNA)-based data storage
system where data are stored as DNA sequences [1–
4]. During the synthesis process and sequencing process
of the DNA sequences, nucleotides storing information
can be substituted, added or removed, causing substi-
tution, insertion and deletion errors, respectively. When
insertion and/or deletion errors occur, synchronization
between codewords will be lost.
One approach toward maintaining synchronization in
the presence of errors is to transmit “commas” between
codewords [5]. Another approach that do not depend
on “commas” was first proposed by Golomb et al. [6].
Such codes that do not depend on “commas” are called
“comma-free codes”. Specifically, a comma-free code
(CFC) is a set of codewords of length n over an alphabet
such that given any two codewords u = u
1
u
2
···u
n
and
v = v
1
v
2
···v
n
belonging to this set, the n letter con-
catenation w = u
k
···u
n
v
1
···v
k−1
(k =2, 3,...,n)
will not form a codeword. The class of CFCs proposed
by Golomb et al., however, does not provide fixed
locations for carrying information.
In [7], systematic fixed block length CFCs are intro-
duced. A systematic CFC is a subclass of CFC which
use fixed locations in each codeword for maintaining
synchronization and the remaining locations for trans-
mitting information. One strategy often used is that the
first few positions of a codeword are fixed to either 0 or
1. This fixed sequence, which appears at the beginning
of each codeword, is called the primer drive of the code.
An “F code” is a classic systematic CFC. The class of
F codes [7] does not preclude the existence of a primer
drive in the body of a codeword but it ensures that no
codewords are formed between concatenated codewords.
F codes are more efficient in terms of transmitting
information compared to other codes such as Gilbert
codes [5]. Denote the code length by n>1 and suppose
r is the least integer ≥
!
2(n − 1), s is the least integer
≥
r
2
and t = r − s. Then an F code of length n
has 0s fixed in the positions 1, 2,...,s and 1s fixed in
the positions n, n − s, n − 2s, . . . , n − (t − 1)s while
all other positions are arbitrary and can be used to
transmit information symbols. It has been proved that
the F code is the most efficient systematic CFC that
uses fixed places to maintain synchronization [7]. For
example, when n =15, the information “101101100”
will be encoded into an F code “000101101111001”
where the digits in bold represent those fixed bits used
for maintaining synchronization.
Decoding of a received data sequence is assumed to
be performed using a fixed-length decoding window of
size n, which is simply the length of an F codeword.
The decoder only knows the beginning and end of the
received sequence but has no direct information about the
boundaries between codewords. Whenever the decoding
window detects the structure of an F code, this codeword
is decoded immediately. Then the decoding window
shifts n positions to the right and attempts detecting
and decoding the next codeword. If any substitution/in-
sertion/deletion errors occur, the n bits in the decoding
window may not form a valid codeword. In this case, the
decoding window will only shift one position at a time
to the right until a valid F codeword is found. These
procedures repeat until the last bit in the received data
sequence is reached.
By definition, a valid codeword cannot be formed
based on bits from two consecutive CFCs such as
systematic F codes. However, a valid codeword or a
false codeword can be formed when (substitution/in-
sertion/deletion) errors occur in CFCs. For example,
if there is an insertion error occurring in an F code,
this F codeword will not be detected or decoded. At