Doi: 10.1145/1467247.1467269

Error Correction up to the
Information-Theoretic Limit

by Venkatesan guruswami and Atri Rudra

abstract

Ever since the birth of coding theory almost 60 years ago, researchers have been pursuing the elusive goal of constructing the “best codes,” whose encoding introduces the minimum possible redundancy for the level of noise they can correct. In this article, we survey recent progress in list decoding that has led to efficient error-correction schemes with an optimal amount of redundancy, even against worst-case errors caused by a potentially malicious channel. To correct a proportion r (say 20%) of worst-case errors, these codes only need close to a proportion r of redundant symbols. The redundancy cannot possibly be any lower information theoretically. This new method holds the promise of correcting a factor of two more errors compared to the conventional algorithms currently in use in diverse everyday applications.

1. intRoDuction

Coping with corrupt data is an essential part of our modern day lives; even if most of the time we are blissfully unaware of it. When we watch television, the TV has to deal with signals that get distorted during transmission. While talking on our cellphones, the phone has to work with audio signals that get corrupted during transmission through the atmosphere though we definitely are aware of it when the connection is poor. When we surf the Internet, the TCP/IP protocol has to account for packets that get lost or garbled while being routed. When we watch movies on DVDs, the player has to overcome loss of data due to scratches. Even when we buy our groceries, the scanner has to deal with distorted barcodes on packages.

The key ingredient in coping with errors in these and many other applications is an error-correcting code or just code for brevity. The idea behind codes is conceptually simple: add redundancy to the information so that even if the resulting data gets corrupted, e.g. packets get corrupted during routing or the DVD gets some scratches, the original information can still be recovered.

Ideally, we would like to add a small amount of redundancy and at the same time be able to correct many errors. As one might expect, these are conflicting goals and striking the right balance is where things get interesting. For example, consider the code where every information bit is repeated say a 100 times (this is known as the repetition code). Intuitively, this code should do well. In particular, the following is a natural error-recovery procedure or a decoding algorithm: for every consecutive 100 bits of the data, identify whether the majority of the bits is 0 or 1, and output the corresponding bit. Unless we happen to be unlucky, this decoding algorithm can recover from quite a few errors. The downside is that every 100 bits of data contain only one bit

of information—imagine how large a DVD would need to be in order to store a movie with such redundancy. On the other extreme is the parity code, which appends the parity of the information bits at the end of the message. This code uses the minimum amount of redundancy possible but has poor error-recovery capabilities. Indeed, even if just two bits get flipped, it can go undetected. For example, 0001 gets encoded as 00011 under the parity code. If the first two bits get corrupted and we receive 11011, we would misinterpret the original message to be 1101. Imagine Clark Gable saying at the end of your small parity encoded DVD for Gone with the Wind, “Frankly, my dear, I don’t give a ham !”

To capture this inherent tension between the redundancy and the error tolerance of codes, let us define codes and some key parameters formally. A code C is given by an encoding map of the form C : Σk → Σn (for integers k < n) which encodes a sequence of k symbols from Σ into a larger sequence of n symbols. Given a message m Î Σk, C(m) is known as the corresponding codeword. The parameters k, n, and Σ are called the dimension, block length, and alphabet of C, respectively. We will often use the ratio R = k/n, which is called the rate of C. Note that R exactly captures the amount of information contained per bit of a codeword. The Hamming distance between two strings in Σn is the number of coordinates where they differ. The minimum distance of a code is defined to be the small-est Hamming distance between two distinct codewords.

Thus, our question of interest can be now re-stated as follows: given a code C of rate R, what is the maximum fraction of errors (which we will henceforth refer to as r) that can be tolerated by C Now as every codeword has k symbols of information, it is intuitive that in the worst case at least k symbols of a codeword should be uncorrupted to have any hope of recovering the original information. In other words, we can only have r ≤ (nk)/n = 1 − R, irrespective of the computational power of the decoder.

The main focus of this article is the following question:

Can we construct a code C of rate R that can be efficiently decoded from close to a 1 − R fraction of errors?

Quite surprisingly, we will show the answer to be yes. Thus, the above simple information-theoretic limit can in fact be approached. In particular, for small rates, we can recover from situations where almost all of the data can be corrupted. For example, we will be able to recover even if Clark Gable were to spout “alhfksa, hy meap xH don’z hive b hayn!” There are

The original version of this paper was published in IEEE Transactions on Information Theory, 2008.

References:

Archives