Get plug and play, privacy-friendly and GDPR/CCPA compliant statistics for WordPress with Koko Analytics.
Danny van Kooten

HTML compression on popular websites

TLDR: I grabbed a list of the 10.000 most popular domains on the internet, downloaded their homepage and checked for compression techniques. Surprisingly, quite a few of them (~8%) are not applying any kind of compression at all with some of them leaving terabytes of potential monthly data savings on the table ($$$ + a slower website).

Some notable entries that caught my eye are the websites of the US Department of State, multiple country specific branches of Lidl, the Python programming language, Klarna and Zapier.

At its core, a website is just a collection of text, image and video files.

The text files include the actual content of the site (HTML), how this content should be displayed (CSS) and any dynamic code that should run in your browser (JavaScript).

All of these text files are highly suitable for lossless compression, which replaces repeated text patterns with a pointer to the previous occurrence of the pattern. As a result, larger text files result in higher compression ratios. Compression ratios of well over 80% are not uncommon, especially for files well over 100 kB.

All things considered, compression is a net win for everyone involved:

  • As a website owner, your bill for outgoing network traffic is lower.
  • As a website visitor, you’re downloading fewer data so the website loads faster.

Gzip and Brotli are the 2 most commonly used compression algorithms used on the web right now. Both strike a nice balance between compression rates and performance and most web server software (like Nginx or Caddy) has support for these built-in.

You would think any website with millions of unique per visitors per month would be using compression, right? Let’s find out!

10.000 HTTP requests later

I grabbed a list of the top 10.000 domains on the internet (according to their Alexa rank) and then made an HTTP request to each of them, discarding any error responses or requests that failed to return a response within 10 seconds.

The User-Agent HTTP header was set to match that of my browser. Accept-Encoding was set to br, gzip, deflate, again matching that of my browser.

To get the amount of bytes that could be saved by applying compression, I ran the response body through Golang’s compress/gzip using the default compression level.


Successful requests

Of all 10000 HTTP requests, just under 7900 managed to return a successful HTML response in under 10 seconds.

7885
2015
  • HTTP 200 OK
  • Non-OK response / request timed out

Compression

Of these 7900 HTML responses, about 8% did not apply any compression.

5028
2190
663
4
  • Gzip
  • Brotli
  • None
  • Deflate

On average, about 55 kB of data transfer could be saved by enabling compression on these HTML responses.


Notable sites not using compression

Lidl.cz, the Czech branch of a popular supermarket chain here in Europe, topped the list by shipping 1.65 MB of uncompressed HTML. Applying gzip compression at compression level 2 would reduce this to just under 200 kB, a 90% reduction.

Another interesting site that caught my eye was Python.org. Apparently the sample nginx configuration provided by Heroku was missing the gzip_proxied any setting to allow applying compression to proxied responses.

This was addressed back in 2021, but not merged upstream in the Python website codebase. Hopefully once my PR gets merged, it will be!

Other popular websites that I was surprised to see were klarna.com, zapier.com and state.gov.

You can browse the full list of websites not applying any compression here.

Running the math on data transfer that could be avoided

Take tmz.com as an example, #2 on the list:

  • They could save 665 kB on their homepage’s HTML by applying compression.
  • similarweb estimates their monthly traffic at about 60M.
  • The cache lifetime on their HTML responses is set to only 60 seconds, so a substantial amount of traffic will be downloading the HTML over and over again.

Multiplying 665 kilobytes by 10 million already gets you to about 6.14 terabytes of unnecessary data transfer.

That’s 6.14 terabytes of data transmission that could not exist by adding a single line to a configuration file somewhere:

gzip on

If you happen to know anyone working at these websites, please inform them of this opportunity. There’s no good reason to not be using compression so this is likely due to an oversight or misconfiguration on their part.


The code for this experiment is up on GitHub here: dannyvankooten/top-websites-compression.