Due to an influx of spam, we have had to temporarily disable account registrations. Please write an email to accountsupport@archlinux.org, with your desired username, if you want to get access. Sorry for the inconvenience.
That's most certainly impossible without Lua or some other dynamic language. Even then, there are many limitations, as mentioned in the manual:
[it assumes] you're using fancy urls like example.com/Page (we don't have that yet, !335 (merged))
[it assumes] your cache folder is inside your MediaWiki folder (we have cache in /srv/http/archwiki/cache and MediaWiki in /srv/http/archwiki/public; but this is probably easily solvable)
as with the Apache solution, you have to run a cron job or manually rebuild the cache files when they change (this is a major problem IMO)
pages whose title include non-ASCII characters are served through PHP (which should be transparent to the user anyway)
I think this is not entirely correct, since the Lua script does not do any encoding with ngx.var.request_uri. They have a more general note in the Apache section: "This will still not match titles containing periods, slashes or other punctuation which the file cache escapes but browsers don't (or vice versa)."
We have subpages with "/" in the title even in the main namespace and localized pages with a unicode suffix, which would reduce the efficiency of this approach.
I should also note that:
Only anonymous requests can be served directly by the web server (or something else than PHP in general). So potential attackers can still cause problems regardless of the cache configuration, simply by creating an account. Was the last attack done by anonymous or authenticated requests? If you don't know, logging should be improved.
File cache should apply only for pages which:
are not special pages
are not redirects
are being viewed in current version, plain view, no url parameters
The first two points can't be checked any other way than running the PHP code (and this is actually its behaviour).
We had many problems with caching in the past where cached content was not invalidated correctly, so I'd reconsider if adding even more complexity to it (which is clearly not bug-free in MediaWiki itself) is the right way forward.
I have no idea how CDN caching is supposed to work in practice. You might need to set up Varnish or Squid for that. I'm not even sure if running them on the same host as the web server would actually improve performance...
Still, some of my notes above might be relevant to CDN as well. For example, is it possible to skip caching special pages and redirects? And are we certain that the last attack was by anonymous requests (authenticated would not be affected by CDN caching)? It is also hard to argue about caching strategies without some real performance data.
I have no idea how CDN caching is supposed to work in practice. You might need to set up Varnish or Squid for that. I'm not even sure if running them on the same host as the web server would actually improve performance...
CDN caching just sets the relevant headers (ex: Cache-Control) which control how nginx cache the content. From a older nginx blog post:
How Does NGINX Determine Whether or Not to Cache Something?
By default, NGINX respects the Cache-Control headers from origin servers. It does not cache responses with Cache-Control set to Private, No-Cache, or No-Store or with Set-Cookie in the response header. NGINX only caches GET and HEAD client requests. You can override these defaults as described in the answers below.
For example, is it possible to skip caching special pages and redirects?
Should be doable, but what is the issue of caching special pages and redirects?
And are we certain that the last attack was by anonymous requests (authenticated would not be affected by CDN caching)?
We aren't certain (but I doubt that the requests was authenticated), but all caching can be bypassed :)
It is also hard to argue about caching strategies without some real performance data.
@svenstaro ran a benchmark with oha under the meeting and it was like 40r/s. Running a benchmark myself against https://wiki.archlinux.org/load.php?lang=en&modules=skins.vector.styles.legacy%2Cresponsive%7Czzz.ext.archLinux.styles&only=styles&skin=vector (which is already cached by nginx), I'm getting 780r/s.
CDN caching just sets the relevant headers (ex: Cache-Control) which control how nginx cache the content.
Interesting, so this should work
Should be doable, but what is the issue of caching special pages and redirects?
When we fix https://github.com/archlinux/archwiki/pull/29, I'd expect Special:Recentchanges to be always up to date even for anonymous users. Caching redirects might lead to infinite loops (and other problems) when they are changed while the old version is still cached.
But I'd expect MediaWiki to set Cache-Control: no-cache or Cache-Control: private for these pages as appropriate... (we should still check this)
Sure, but that's just one metric. How many r/s are there in practice? Which numbers indicate "high load" on the server? How many requests are anonymous and how many are authenticated? Or more interestingly, how many requests are cacheable (see the criteria in previous posts) and how many are not?
I don't have access to monitoring.archlinux.org now, but last time I checked, nginx was not monitored. This should be done first so we have some data to evaluate the changes.
Caching redirects might lead to infinite loops (and other problems) when they are changed while the old version is still cached.
We could implement purging, but that adds more complexity (it isn't supported natively by nginx). The simplest solution it to allow stale content for up to ex: 5 minutes (my original idea). Would that break anything? Ex: workflows?
But I'd expect MediaWiki to set Cache-Control: no-cache or Cache-Control: private for these pages as appropriate... (we should still check this)
I dug into the code (line 2569) and Cache-Control: s-maxage={$this->mCdnMaxage}, must-revalidate, max-age=0 is set for redirect and I assume also for Special pages.
Sure, but that's just one metric. How many r/s are there in practice? Which numbers indicate "high load" on the server? How many requests are anonymous and how many are authenticated? Or more interestingly, how many requests are cacheable (see the criteria in previous posts) and how many are not?
I don't have access to monitoring.archlinux.org now, but last time I checked, nginx was not monitored. This should be done first so we have some data to evaluate the changes.
It has been added since, this is the RPS for the last seven days:
We can't easily distinguish between anonymous and authenticated users (we are just parsing the nginx access log), but we could add $upstream_cache_status to the log after deploying caching, which would allow us to monitor the cache hit ratio.