Beginning in the 1980s, the capacity and speed of storage devices have
been tremendously improving. Today, new kinds of cheaper and more
efficient memory devices are constantly emerging. Nevertheless, as
files become too large, it is helpful if we could somehow encode them
into a smaller size (when not in current use) and restore them back
when needed later. That way, we can avail more of our storage media
like diskettes, hard disks, CDs, USB Flash Disks, and tapes.

The process of transforming a data source into a smaller size is called **compression** while restoring it back to its original form is called **decompression**. Typically, the data source is a *string* or *sequence of bytes,* or it might be a *file*
or a group of files. For our purposes, we consider the data source to
be a file though the techniques we study here apply the same to any
source.

Most data that we store in our computer devices are *digital*—each unit of information packed in binary form. This binary nature of the source is the subject of much research: how much *information* it really contains, and which better way to represent that information in a smaller number of binary digits, or *bits*.
The compressed (or output) file is ultimately a concatenation of
thousands—or even millions—of bit strings. Clearly, the cost of sending
data over communications networks is minimized if the files are highly
compressed.

**Definition of Terms**

Data compression at its basic level does not only work on bytes; theoretically, we encode “**symbols**,” which may be 8-bit symbols or 16-bit symbols. These symbols, or numerical representations of objects, are *distinct*, and taken together the whole group is called the “**alphabet**.”
In the English language, the 26 letters are the symbols and the group
is called the English Alphabet. The whole string of symbols to compress
is referred to as the “**message**” or “**data source**.”

Consider an input message *S* of length *m* drawn from a source alphabet *A* of *n* distinct symbols, i.e.,* S* = *s*_{0}...*s*_{m-1} and *A *= {*a*_{0},...,*a*_{n-1}}. To compress a message symbol, we build a **code **(or **codeword**) for each of the symbols in *A*, thus creating another alphabet *B*. This alphabet of codes might be a **standard** alphabet or one which is deliberately created by the encoding algorithm according to the statistics of the message.

Normally, all distinct symbols of the source alphabet *A *have a *one-to-one* correspondence with the other members of the code alphabet *B*. That is, if *A *= {*a*_{0},...,*a*_{n-1}}, then *B *= {*b*_{0},...,*b*_{n-1}}, where *n* is typically 256. This ensures that all the symbols can be decompressed using their complement codes.

We define compression methods that employ a one-to-one mapping technique as **prefix-free coding** algorithms, where no symbol is a *prefix* of any other symbols. However, other coding techniques can create just *one* code for a “group” of symbols, enabling better compression. These methods are classified as **dictionary-based** algorithms.

The process of creating and employing a certain code as a substitute for a message symbol is called **encoding**. In other words, the compression process is nothing more than a way of *writing* the data source using *another*
alphabet. To be more effective, the written output must be smaller than
the original data source. And in order for compression to be achieved,
it follows that somehow most of the codes must be shorter in size than
the original symbols. The size of the codes in *bits* is referred to as the **codelength**, or **bit length**. Limiting the codelengths is the main focus of data compression.

Codes can be categorized in terms of “symbol-to-code” *mapping*, noting in particular the size in bits of the symbols and their corresponding codes [Lelewer and Hirschberg 1987]. These types of mapping are classified as *block-block*, *block-variable*, *variable-block*, and *variable-variable *coding.

In *block-block* coding, the symbols and codes are of the same size. The Extended ASCII [1] codes and EBCDIC [2] codes are block-block codes and do not provide compression. On the other hand, *block-variable* coding maps individual symbols into codes of varying sizes—and provides compression.

Dictionary-based algorithms are of the *variable-block*
coding type since the number of symbols vary (creating “strings” of two
or more symbols as much as possible), but the size of the codes
(usually the *indices* in the dictionary) is fixed.

A more advanced type is *variable-variable* coding. In practice, fixed-length dictionary codes can be turned into *variable*
codes. The string codes are simply limited in size at the start of the
coding process and gradually increased as the algorithm proceeds. After
some finite amount of codes are defined, the codes stabilize into a *block* code.

**Compression Ratio**

A common measure of a compression algorithm is its ability to produce smaller files. The so-called **compression ratio**
is based on this measure. Authors have defined the term in several
ways. Theoretically, as most purists note, the compression ratio should
be the *compressed size *over the *original size*; others use *original size */ *compressed size*.

However, it persisted in the literature (and in actual compression programs like WinZIP) that this ratio correspond to the *reduction* in number of bytes, and it is usually expressed in percentage form. The percentage value is thus indicative of the *bytes actually removed *and not of the compressed size:

**Comp. Ratio (%) **=

* (1 ***–*** (compressed size/original size)) **×** 100*

For example, if you have a 250-byte original file and the output file
size is exactly 50 bytes, then your compression ratio is 80% and not
20%. In other words, the *percentage of reduction* is exactly
80%, and that the resulting compressed file is only 20% of the original
file. Therefore, the higher the compression ratio, the smaller the
compressed file size, which is the whole purpose of data compression.

***

[1] **A**merican **S**tandard **C**ode for **I**nformation **I**nterchange (ASCII)

[2] **E**xtended **B**inary **C**oded **D**ecimal **I**nterchange **C**ode