Run-Length Encoding (RLE)

    Suppose we are given a file or a source message that has too many redundant characters. For example, an average MS Word file has too many consecutive byte-255 and NULL characters. Is it possible to represent these consecutive bytes or “runs” into a more compact form?

Indeed, a compression technique was designed to solve this particular problem. It is called Run-Length Encoding or RLE. Its name so accurately describes the process because it encodes a run of bytes to the following 2-byte form: {byte, length}, with length representing the number of runs of a single byte and which means that we can encode as many as 255 consecutive runs. This technique is the simplest in run-length encoding techniques. As an example, consider the following data source or string of 24 letters:

Input String: abbbbbbbbbbefffgggghhijk

To encode the above string, the output would be {‘a’, 1}, {‘b’, 10}, {‘e’, 1}, {‘f’, 3}, {‘g’, 4}, {‘h’, 2}, {‘i’, 1}, {‘j’, 1}, and {‘k’, 1 }. The total compressed form for this source is just 18 bytes. This saves us exactly 6 bytes, with a compression ratio of 25%. Technically, a better length byte could mean “how many more follows,” so that we could actually record at most 256 bytes: the byte flag plus the possible 255 runs. Thus, the letter ‘a’ would then be encoded as {‘a’, 0} instead of {‘a’, 1}.

  One drawback of this kind of RLE is that if there is only a single byte to encode, you have to add an extra byte for the length byte, like the one in the previous example when we encoded the letters ‘a’ and ‘e’. It is good to assume that there are only certain sections of the data source that have these runs. Though we could encode only 256 runs of a single byte, consecutive bytes more than 256 oftentimes do not occur in majority of files that it works perfectly well for files with enough redundant characters or bytes.

   We can improve this technique by limiting the length code to just 4 bits. With this arrangement, we can encode only a maximum of 16 runs of a byte, which is adequate for simple compression since actual files typically do not contain too many redundant bytes.

Text files. One way to improve the byte-length method is in the area of text-file compression. Given the nature of text (ASCII) files in which the individual bytes can actually be encoded in only 7 bits (i.e., bit-7, or the 8th bit, of all bytes in a text file is 0), we can therefore use bit-7 as a “signal” to the decoder if a byte is repeated or not. If there is a run of a byte, we will set the byte’s bit-7 to 1. Thus, if the decoder sees a byte which is greater than 127, it only means that there is a run of bytes and the decoder will promptly write those number of bytes.

   Binary files. Another clever form of run-length encoding is to encode if and only if there is a run. That is, do not encode an additional byte for a single non-redundant byte: encode only those redundant bytes.  This is done by encoding twice the byte and then encoding the length byte: {byte, byte, length}. This way, we do not incur a length byte for those bytes which occur only independently in a data stream. Thus, in the decompression phase, the presence of a twin byte alerts us that there is exactly a run of bytes. Hence, {‘b’, ‘b’, 8} means that there are 10 runs of byte ‘b’.  It follows that we should then write the next eight bytes after the two. The previous example would then be encoded like this:

{‘a’}, {‘b’, ‘b’, 8}, {‘e’}, {‘f’, ‘f’, 1}, {‘g’, ‘g’, 2}, 

{‘h’, ‘h’, 0}, {‘i’}, {j}, {‘k’}.

This encoding needs only 17 bytes for output. Notice that the letters ‘a’ and ‘e’ are now encoded as is, with single bytes. For very large files, this technique is more powerful than the “byte-length” technique. This method can record at most 257 consecutive bytes (2 + (0..255)).

   Another drawback we incur from this new technique, however, is the additional symbol encoded. If there are only two runs of a symbol, we would need an additional byte, encoding the run with three bytes instead of just two bytes. In general, however, this is more effective when we look at the data as a single large file which may naturally have a series of identical bytes.

Source Code:

RLE.ZIP - includes C implementations of run-length encoding for text and binary files. 

Gerald R Tamayo,
Dec 23, 2009, 8:53 PM