Doi: 10.1145/1467247.1467269
Error Correction up to the
Information-Theoretic Limit
by Venkatesan guruswami and Atri Rudra
abstract
Ever since the birth of coding theory almost 60 years ago,
researchers have been pursuing the elusive goal of constructing the “best codes,” whose encoding introduces the
minimum possible redundancy for the level of noise they
can correct. In this article, we survey recent progress in list
decoding that has led to efficient error-correction schemes
with an optimal amount of redundancy, even against
worst-case errors caused by a potentially malicious channel. To correct a proportion r (say 20%) of worst-case errors, these codes
only need close to a proportion r of redundant symbols. The
redundancy cannot possibly be any lower information theoretically. This new method holds the promise of correcting a
factor of two more errors compared to the conventional algorithms currently in use in diverse everyday applications.
1. intRoDuction
Coping with corrupt data is an essential part of our modern
day lives; even if most of the time we are blissfully unaware
of it. When we watch television, the TV has to deal with signals that get distorted during transmission. While talking on
our cellphones, the phone has to work with audio signals that
get corrupted during transmission through the atmosphere
though we definitely are aware of it when the connection is
poor. When we surf the Internet, the TCP/IP protocol has to
account for packets that get lost or garbled while being routed.
When we watch movies on DVDs, the player has to overcome
loss of data due to scratches. Even when we buy our groceries,
the scanner has to deal with distorted barcodes on packages.
The key ingredient in coping with errors in these and
many other applications is an error-correcting code or just
code for brevity. The idea behind codes is conceptually simple: add redundancy to the information so that even if the
resulting data gets corrupted, e.g. packets get corrupted
during routing or the DVD gets some scratches, the original
information can still be recovered.
Ideally, we would like to add a small amount of redundancy and at the same time be able to correct many errors.
As one might expect, these are conflicting goals and striking the right balance is where things get interesting. For
example, consider the code where every information bit is
repeated say a 100 times (this is known as the repetition
code). Intuitively, this code should do well. In particular, the
following is a natural error-recovery procedure or a decoding
algorithm: for every consecutive 100 bits of the data, identify whether the majority of the bits is 0 or 1, and output the
corresponding bit. Unless we happen to be unlucky, this
decoding algorithm can recover from quite a few errors. The
downside is that every 100 bits of data contain only one bit
of information—imagine how large a DVD would need to
be in order to store a movie with such redundancy. On the
other extreme is the parity code, which appends the parity
of the information bits at the end of the message. This code
uses the minimum amount of redundancy possible but has
poor error-recovery capabilities. Indeed, even if just two bits
get flipped, it can go undetected. For example, 0001 gets
encoded as 00011 under the parity code. If the first two bits
get corrupted and we receive 11011, we would misinterpret
the original message to be 1101. Imagine Clark Gable saying
at the end of your small parity encoded DVD for Gone with
the Wind, “Frankly, my dear, I don’t give a ham !”
To capture this inherent tension between the redundancy
and the error tolerance of codes, let us define codes and some
key parameters formally. A code C is given by an encoding
map of the form C : Σk → Σn (for integers k < n) which encodes
a sequence of k symbols from Σ into a larger sequence of n
symbols. Given a message m Î Σk, C(m) is known as the corresponding codeword. The parameters k, n, and Σ are called the
dimension, block length, and alphabet of C, respectively. We
will often use the ratio R = k/n, which is called the rate of C.
Note that R exactly captures the amount of information contained per bit of a codeword. The Hamming distance between
two strings in Σn is the number of coordinates where they differ. The minimum distance of a code is defined to be the small-est Hamming distance between two distinct codewords.
Thus, our question of interest can be now re-stated as follows: given a code C of rate R, what is the maximum fraction
of errors (which we will henceforth refer to as r) that can
be tolerated by C Now as every codeword has k symbols of
information, it is intuitive that in the worst case at least k
symbols of a codeword should be uncorrupted to have any
hope of recovering the original information. In other words,
we can only have r ≤ (n − k)/n = 1 − R, irrespective of the computational power of the decoder.
The main focus of this article is the following question:
Can we construct a code C of rate R that can be efficiently
decoded from close to a 1 − R fraction of errors?
Quite surprisingly, we will show the answer to be yes. Thus,
the above simple information-theoretic limit can in fact be
approached. In particular, for small rates, we can recover from
situations where almost all of the data can be corrupted. For
example, we will be able to recover even if Clark Gable were
to spout “alhfksa, hy meap xH don’z hive b hayn!” There are
The original version of this paper was published in IEEE
Transactions on Information Theory, 2008.