Caching with CORS

Senior Professional Services Engineer

July 14, 2015

Quick shout-out to JSONP

Before diving into CORS (Cross-origin resource sharing), I need to mention JSONP, which is the other solution to getting data from a different “Origin.” In Using ESI, Part 2: Leveraging VCL and ESI to Use JSONP, Simon explains what JSONP is, and how to cache it with Fastly, using one Fastly specific feature, req.topurl. Now, with Varnish 4.1, req.top.url (note the extra period) is available, and it allows you to do the same thing with vanilla Varnish.

The problem

So what is the problem that JSONP and CORS are trying to solve? Getting data (usually JSON) from a 3rd party and using it from Javascript. For security reasons, AJAX (XMLHttpRequest) requests are not allowed to retrieve data from (or send data to) another origin. So if your website is http://www.example.com/, Javascript on your site is not allowed to load data from https://api.thirdparty.example/. JSONP gets around this by wrapping the data in a callback function, and CORS accomplishes it by explicitly granting access through response headers.

Origin in this context is the scheme, hostname and (optional) port part of the URL. So for both http://www.example.com/ and http://www.example.com/js/foo.js the origin is http://www.example.com. For https://api.thirdparty.example/v1/data the origin is https://api.thirdparty.example.

The advantages of CORS over JSONP

Even though JSONP is cacheable using ESI, it’s a rather complex setup in Varnish, and requires more CPU because of the ESI sub-request. As I will show in this post, CORS only requires some header manipulation.
JSONP only allows GET requests, since it makes use of the <script> tag. CORS allows the origin to specify what methods are allowed, so that PUT, POST, DELETE, and more become available.

Caching CORS responses

If you don’t care which domains access your API, all you need to do is add the following header to its responses:

Access-Control-Allow-Origin: *

Since there’s no variance in this header, there’s nothing special in caching these responses. You can just set the TTL as you normally would using beresp.ttl or Cache-Control: max-age.

The same principle applies if you’re only allowing a single origin, except you would list the origin instead of the *:

Access-Control-Allow-Origin: http://www.example.com

And caching is just as simple as with allowing all origins; set a TTL and you’re done.

Where things get tricky is if you want to allow multiple origins, say both http://www.example.com and https://www.example.com. The W3C Recommendation for CORS specifies that the Access-Control-Allow-Origin header can take a space-separated list of origins, but immediately warns that in practice, browsers only allow a single origin to be listed. Which means that Access-Control-Allow-Origin needs to be set depending on the value of the Origin header in the request.

To still be able to cache these requests, you will have to use the Vary header. If you are not familiar with how this header works, I refer you to a blog post about Vary that I wrote a while ago, which explains it in depth.

The easiest implementation would be to just add Origin to the Vary header. A typical request and response would look something like this:

GET /v1/data HTTP/1.1
Host: api.example.com
Origin: http://www.example.com

HTTP/1.1 200 Ok
Content-Type: application/json
Content-Length: 4365
Access-Control-Allow-Origin: http://www.example.com
Vary: Origin
Cache-Control: max-age=3600

This would cache for an hour, and be served from cache for any requests that have http://www.example.com as origin. The problem with this approach is that any request from an origin that you do not have a response in your cache for will cause a request to go to your backend. So normalizing the Origin header is key.

You should normalize to either one of the values that are allowed, or nothing. Basically you’re whitelisting certain values, and deleting everything else. Here’s what the VCL would look like:

sub vcl_recv {
  if (req.http.Origin != "https://www.example.com"
      && req.http.Origin != "http://www.example.com"
      && req.http.Origin != "http://www.friends.example") {
    unset req.http.Origin;
  }
  ...
}

And if your backend doesn’t send a Vary header with Origin in it:

sub vcl_fetch {
  if (beresp.http.Vary) {
    set beresp.http.Vary = beresp.http.Vary + ",Origin";
  } else {
    set beresp.http.Vary = "Origin";
  }
  ...
}

However, now the list of allowed origins is in both your VCL and your application. And for each allowed origin, there’s a copy of the response in the cache, which uses up space, and each copy is the result of a backend request.

Luckily, setting headers is something Varnish is really good at. :)

So here’s some VCL that does not Vary on Origin, so you have a single copy of the response in your cache, and then sets the Access-Control-Allow-Origin header if the Origin in the request is on your whitelist.

sub vcl_deliver {
  if (req.http.Origin == "https://www.example.com"
      || req.http.Origin == "http://www.example.com"
      || req.http.Origin == "http://www.friends.example") {
    set resp.http.Access-Control-Allow-Origin = req.http.Origin;
  }
  if (resp.http.Vary) {
    set resp.http.Vary = resp.http.Vary + ",Origin";
  } else {
    set resp.http.Vary = "Origin";
  }
  ...
}

You might have noticed that Vary is still set, but in this case we’re setting it on resp, not on beresp. This is to make sure that any caches between your Varnish and the browser, which you have no control over, still do the right thing, which is to cache the response, but still serve different variations based on the Origin header.

The value of beresp.http.Vary (which you can only set in vcl_fetch, before the object enters the cache) is used to determine how the object should be cached. You can only set resp.http.Vary in vcl_deliver, which is after the object has been inserted into the cache, but before the response is sent downstream, i.e. to the browser or an intermediate cache.

The VCL example above does assume that your backend does not know about CORS, and doesn’t send either Vary: Origin or Access-Control-Allow-Origin. If it does, for some reason, you will have to take that into account. Like so:

sub vcl_recv {
  # Save Origin in a custom header
  set req.http.X-Saved-Origin = req.http.Origin;
  # Remove Origin from the request so that backend
  # doesn’t add CORS headers.
  unset req.http.Origin;
  ...
}

sub vcl_deliver {
  if (req.http.X-Saved-Origin == "https://www.example.com"
      || req.http.X-Saved-Origin == "http://www.example.com"
      || req.http.X-Saved-Origin == "http://www.friends.example") {
    set resp.http.Access-Control-Allow-Origin =
                                   req.http.X-Saved-Origin;
  }
  if (resp.http.Vary) {
    set resp.http.Vary = resp.http.Vary + ",Origin";
  } else {
    set resp.http.Vary = "Origin";
  }
  ...
}

Other CORS headers

Access-Control-Allow-Origin is the most used header, but other response headers have similar ramifications. If they are different depending on Origin, make sure that Origin is in your Vary, or add them to the response using VCL.

CORS also defines two more request headers, which browsers might use in a pre-flight request, before doing something like a PUT or DELETE request. I will discuss those in a future Varnish tip.

Recap

To deal with caches where you are not in control, it’s important that Vary contains Origin. For the most efficient caching with Varnish, it is best to put all CORS logic in VCL.