Data Compression: Making the Big Smaller and Faster (Part 1)
In this article, we will spend some time understanding the basics of compression and how it works.
How important is data compression? The sharing of information in a fast and efficient manner has been an area of constant study and research. Companies like Google and Facebook have spent a lot of time and effort trying to develop faster and better compression algorithms. Compression algorithms have existed since the ’70s and the ongoing research to have better algorithms proves just how important compression is for the Internet and for all of us.
The Need for Data Compression
The World Wide Web (WWW) has undergone a lot of changes since it was made available to the public in 1991. Believe it or not, the copy of the world’s first website can still be browsed here. Back then, webpages were very simple. Today, they are increasingly more complex and there is an evident need to have compression algorithms that are lossless, fast, and efficient.
There are several best practices that help optimize page load time, read this blog post to learn about webpage optimization techniques. In this article, we will spend some time understanding the basics of compression and how it works. We will also cover a new type of compression method called “Brotli” in the second part of this blog.
Encoding and Data Compression
Let’s start by understanding what data encoding and compression are:
The word “compression” comes from the Latin word compressare, which means to press together. “Encoding” is the process of placing a sequence of characters in a specialized format that allows efficient data storage as well as transmission. Per Wikipedia: “Data compression involves encoding information using fewer bits than the original representation.”
Compression plays a key role when it comes to saving bandwidth and speeding up your site. Modern day websites involve a lot of HTTP requests and responses between the client (the browser) and the server to serve a webpage. With an overall increase in the number of HTTP requests and responses, it becomes important to ensure that these transfers are taking place at a fast and efficient rate.
HTTP works on a request-response model, as demonstrated below:
In this case, we are not using any compression method to compress the response being sent by the server.
- The browser sends an HTTP request asking for the Index.html page
- The server looks for the requested file and responds with the requested resource and a 200 OK HTTP status message
- The browser receives the server’s response and renders the page
As we can see, in this case there is no compression involved. The server responded with a 300 KB file (index.html page). If the file size was bigger, it would have taken more time for the response to be sent on the wire and this would have increased the overall page load time. Please note that we are currently looking only at a single HTTP response. Modern websites receive hundreds of such HTTP responses from the server to render a webpage.
The image below shows the same HTTP request – response between the browser and the server, but in this case, we use compression to reduce the size of the response being sent by the server to the browser.
Today, complex and dynamic websites generate hundreds of HTTP requests/responses. This made it important to have a system which would ensure fast and efficient data transfer between the server and the browser. This is when compression algorithms like Deflate and Gzip came into existence.
Introduction to Gzip
Gzip is a compression method that is used to make files smaller for storage and faster transmission over the network. Gzip is one of the most popular, powerful, and effective ways of compressing data and it can reduce the file size by up to 70%.
Gzip is based on the DEFLATE algorithm, which in turn is a combination of LZ77 and Huffman coding. Understanding how LZ77 works is essential to understand how compression methods like DEFLATE and Gzip work.
Developed in the late ’70s by Abraham Lempel and Jacob Ziv, the LZ77 method of compression looks for sequences of characters that recur in a text. It performs compression by replacing the recurring occurrences of strings using pointers that backreference identical strings, previously encountered in the text, that needs to be compressed.
The pointer or backreference is of the form <relative jump, length>, where relative jump signifies how many bytes are there between the current occurrence of the string and its last occurrence and length is the total number of identical bytes found.
Now let us understand this better with the help of an example. Assume, there is a text file with the following text:
As idle as a painted ship, upon a painted ocean.
In this file, we see the following strings: “as” and “painted” occurring multiple times. What LZ77 method does is, it replaces multiple occurrences of strings with the notation: <relative jump, length>.
So using LZ77, the text will get encoded in the following way:
As idle <8,2> a painted ship, upon a <21,7> ocean.
To encode the text, we took the following steps:
- Looked at the string and tried to find occurrences of the same “string” or “substrings”.
- Replaced multiple occurrences of a string with the notation: <relative jump, length>; The two strings: “as” and “painted” were replaced the multiple occurrences of the strings with <relative jump, length>.
- The string “painted” which would have earlier occupied 7 bytes (i.e. the number of characters in the word: “painted”) X 1 byte = 7 bytes was compressed to occupy only 2 bytes. 2 bytes or 16 bits is the size of the pointer or backreference.
Huffman Coding is another lossless data compression algorithm. The frequency of occurrence of a string in a text file or pixels in images form the basis of Huffman coding. To get a deeper understanding of this algorithm, read this detailed tutorial that clearly explains how Huffman Coding works.
For the first test run, we did not specify any encoding to be used by passing the custom header: Accept-Encoding: identity along with the request. The first image shows no Content-Encoding being passed for the request.
In the second image, the browser is sending Accept-Encoding:gzip, for which the server is sending zipped file as the response.
We can clearly see how Gzip can drastically compress the files to improve data transmission rate over the wire.
Catchpoint’s Scheduled tests also highlight the difference between compressed and not-compressed content loading on webpages.
A new compression method called Brotli was introduced not too long ago. The Brotli compression algorithm is optimized for the web and specifically for small text documents. We will discuss more about this compression method and what it has to offer to the World Wide Web community in the second part of the article.