2023-11-19

CDN from Scratch (4) Cache Control

This article is also available in 正體中文简体中文

0、Foreword

Starting from this article, this series will involve specific configuration items. The configuration pages are different for each platform but are similar. You can find the corresponding configuration page according to the name of the configuration item.

Regarding caching, first clearly define:

Web caching (or HTTP caching) is an information technology used to temporarily store (cache) Web documents such as HTML pages and images to reduce server latency.
Web caching systems save copies of documents that pass through the system; if certain conditions are met, subsequent requests can be served from the cache.
Web caching systems can refer to both devices and computer programs.

Cache can refer to server-side cache, that is, in CDN or reverse proxy, the cache is stored in the CDN node through user active configuration or using the default configuration; it can also refer to Client cache is a cache that is cached on the client within a certain period of time and no longer requests the server.

Websites are roughly divided into dynamic websites, dynamic and static websites and static websites. This is not necessarily the best classification, and the best practice should be dynamic content should never be cached in principle , static content should use a separate domain name to configure caching rules.

1. Client cache

Taking Nginx as an example, if you use the location field, you can write like this:

Nginx

     location ~* \.(css|js|map|scss)$ {
     expires 1d;
}

location ~* \.(avif|bmp|gif|ico|jpeg|jpg|pjpeg|png|svg|swf|tiff|webp)$ {
     expires 7d;
}

location ~* \.(oft|ttc|ttf|woff|woff2)$ {
     expires 30d;
}  
  

If pseudo-static or reverse proxy are used at the same time, the location field will conflict and prevent these files from hitting other rules, so you can use the if condition instead:

Nginx

     # Pseudo-static rules or reverse proxy rules
location/{
     # Omit rule content
     ...
     if ( $uri ~* "\.(css|js|map|scss)$" )
     {
         expires 1d;
     }
     if ( $uri ~* "\.(avif|bmp|gif|ico|jpeg|jpg|pjpeg|png|svg|swf|tiff|webp)$" )
     {
         expires 7d;
     }
     if ( $uri ~* "\.(oft|ttc|ttf|woff|woff2)$" )
     {
         expires 30d;
     }
}  
  

Here are my recommended caching rules: Cache time Front-end files > Images > Fonts. This is exactly the reverse arrangement of the update frequency of these files: the front-end files may need to be fine-tuned from time to time (~~I am not the only one who has front-end obsessive-compulsive disorder. I want to adjust it when I see the layout is not aligned~~), picture Generally speaking, it will not change again, but icons and avatars may change. Generally speaking, font files will never be changed.

Also, if the filename comes with a hash, like:

/assets/style.6f4e6432.css
/assets/app.ee143e90.js
/assets/inter-roman-latin.2ed14f66.woff2

Generally speaking, you can boldly cache these files for a very long time, such as 1145141919810 seconds, because the file names of these files will be replaced with new hash values after they are updated, and the previous cache will not be hit. Qingbei CDN adopts this solution. Every time the VitePress document is updated, it only needs to Purge the page cache, and the latest static resources will be pulled.

However, client-side caching is unreliable. Simply pressing Ctrl+F5 will deactivate the cache of the current page and refresh the page. Various XX Guards , XX Clean Master is always eager to help users clean cache files:

Your online garbage is up to 114514.1919810 Bytes! It is recommended that you clean it up immediately!

Therefore, we also need to use server-side caching to ensure that users can hit the cache at any time when they obtain files that need to be cached.

2. Server cache

2.1. Dynamic website

Dynamic content (not only PHP, ASP, etc., but also page files such as HTML) should not be cached in theory, but if you need to cache, you can configure it like this (take Nginx as an example):

Nginx

     location/
{
     #Other content is omitted
     ...
     proxy_ignore_headers Set-Cookie Cache-Control expires;
     proxy_cache cache_one;
     proxy_cache_key $host$uri$is_args$args;
     proxy_cache_valid 200 304 301 302 1d; # Cache for 1 day
}  
  

The cache time unit can be m, h, d or y.

In Cloudflare, Cloudflare will automatically determine whether a file needs to be cached. If we need to cache some files that are not cached by default, such as HTML files, we need to set "If matching rules... cache all files" in the page rules:

However, after adding this, the cache status is still DYNAMIC, that is, dynamic content is not cached. The reason is that even if the cache is turned on, if the origin site does not send the Cache-control header, Cloudflare will still not cache the file. So just add a line to the origin:

Nginx

     location ~* \.(html)$ {
     expires 1h;
     add_header Cache-Control "public, max-age=3600";
}  
  

That’s it.

2.2. Dynamic and static websites

In fact, static acceleration and dynamic acceleration are almost the same, except that the former also needs to configure which files hit which rules. In this case, if the CDN does not automatically determine whether the origin site has a Cache-control header like Cloudflare does, the cache everything rule is wrong once this rule is enabled with high priority in the CDN , will cause other rules to be overwritten, all files to be cached, and the website will not take effect after updating (~~Some people even cache without ignoring the backend directory, causing visitors to see the backend interface without logging in XD~~).

My suggestion is to separate the domain names of static resources, so that dynamic and static websites can be split into dynamic website and static website to be configured separately, saving a lot of trouble.

2.3. Static website

If you use the strategy I mentioned above, you can put all static resources under the same domain name to reduce the delay of DNS Lookup, but this will trigger browser restrictions. There are corresponding solutions to this situation. method:

The specification of HTTP/1.1 is RFC2616, which stipulates that only 2 simultaneous connections are allowed for the same domain name. Because this requirement seemed unreasonable, all subsequent browser implementations ignored this restriction, and this restriction was eventually removed in RFC7230. However, in order to ensure fairness, the browser only allows a maximum of 6 simultaneous connections under each domain name.
If you want to download dozens or even hundreds of files simultaneously (this is not surprising, you can use DevTools to take a look next time you open GitHub, the number of HTTP requests will not be less than 50), the most common method of squeezing browsers is domain name hashing.
GitHub's avatar{0..3}.githubusercontent.com, Bilibili's i{0..3}.hdslb.com, JD.com's img{1..30}.360buyimg.com are all Doing this. However, domain name hashing will bring the following problems:
The cost of establishing an HTTP connection is huge - each domain name needs to go through a DNS query, a TCP three-way handshake, and another TLS.
Each HTTP/1.1 connection must contain the same information such as User-Agent, and for non-Cookie-Free domain names, Cookie must be included in the request header, which causes a waste of traffic.
More concurrent connections and Keep-Alive connections cause performance burdens on the client and server.
If the same resource is hashed under different domain names on different pages, HTTP caching cannot be effectively utilized.

In short, taking this site as an example, the referenced front-end resources are only distributed under two domain names:

cdn.tsinbei.com
cdn.staticfile.org

Referring to resources with 2~3 domain names, the cost of DNS Lookup is within an acceptable range.

Other cache configuration files are the same as above, just modify the location or if fields.

3. Postscript

Regarding reducing the pressure on the origin site, many people have misunderstandings:

"Just use CDN to reduce the pressure on the origin site!"

However, if the website has dynamic content, using CDN still requires back-to-origin, which cannot reduce the load of the origin site. The correct solution is clustered distributed deployment, or directly upgrading the origin server.

If it is a static resource, then caching and CDN can play a big role in reducing the bandwidth and concurrency pressure used by the origin site to transmit static resources. Once cached, all resources will be cached before the cache expires. Through CDN distribution, the bandwidth and concurrency bottlenecks are no longer the origin site but the CDN. In addition, using object storage is also a good choice.

Properly configure the cache, even if the origin site is a small water pipe of 1Mbps, it can become a large water pump of 1Gbps.

CDN from Scratch (4) Cache Control

https://blog.tsinbei.com/en/archives/767/