README.md 11.5 KB
Newer Older
Sven-Hendrik Haase's avatar
Sven-Hendrik Haase committed
1
2
3
4
# Arch Infrastructure

This repository contains the complete collection of ansible playbooks and roles for the Arch Linux infrastructure.

5
6
## Requirements

7
Install these packages:
8
  - terraform
9
  - python-typer
10
  - python-jmespath
11
  - moreutils (for playbooks/tasks/reencrypt-vault-key.yml)
12

13
14
### Instructions

Sven-Hendrik Haase's avatar
Sven-Hendrik Haase committed
15
All systems are set up the same way. For the first time setup in the Hetzner rescue system,
16
run the provisioning script: `ansible-playbook playbooks/tasks/install-arch.yml -l $host`.
Sven-Hendrik Haase's avatar
Sven-Hendrik Haase committed
17
18
19
The provisioning script configures a sane basic systemd with sshd. By design, it is NOT idempotent.
After the provisioning script has run, it is safe to reboot.

20
Once in the new system, run the regular playbook: `HCLOUD_TOKEN=$(misc/get_key.py misc/vault_hetzner.yml hetzner_cloud_api_key) ansible-playbook playbooks/$hostname.yml`.
21
This playbook is the one regularity used for administrating the server and is entirely idempotent.
Sven-Hendrik Haase's avatar
Sven-Hendrik Haase committed
22

23
24
25
26
When adding a new machine you should also deploy our SSH known_hosts file and update the SSH hostkeys file in this git repo.
For this you can simply run the `playbooks/tasks/sync-ssh-hostkeys.yml` playbook and commit the changes it makes to this git repository.
It will also deploy any new SSH host keys to all our machines.

27
28
29
#### Note about GPG keys

The root_access.yml file contains the root_gpgkeys variable that determine the users that have access to the vault, as well as the borg backup keys.
30
All the keys should be on the local user gpg keyring and at **minimum** be locally signed with --lsign-key. This is necessary for running either the reencrypt-vault-key
Jelle van der Waa's avatar
Jelle van der Waa committed
31
or the fetch-borg-keys tasks.
32

33
34
35
36
37
38
39
40
#### Note about Ansible dynamic inventories

We use a dynamic inventory script in order to automatically get information for
all servers directly from hcloud. You don't really have to do anything to make
this work but you should keep in mind to NOT add hcloud servers to `hosts`!
They'll be available automatically.

#### Note about first time certificates
41
42
43
44
45

The first time a certificate is issued, you'll have to do this manually by yourself. First, configure the DNS to
point to the new server and then run a playbook onto the server which includes the nginx role. Then on the server,
it is necessary to run the following once:

Sven-Hendrik Haase's avatar
Sven-Hendrik Haase committed
46
    certbot certonly --email webmaster@archlinux.org --agree-tos --rsa-key-size 4096 --renew-by-default --webroot -w /var/lib/letsencrypt/ -d <domain-name>
47

48
49
Note that some roles already run this automatically.

50
#### Note about packer
Sven-Hendrik Haase's avatar
Sven-Hendrik Haase committed
51
52
53
54

We use packer to build snapshots on hcloud to use as server base images.
In order to use this, you need to install packer and then run

55
    packer build -var $(misc/get_key.py misc/vault_hetzner.yml hetzner_cloud_api_key --format env) packer/archlinux.json
Sven-Hendrik Haase's avatar
Sven-Hendrik Haase committed
56
57
58

This will take some time after which a new snapshot will have been created on the primary hcloud archlinux project.

59
#### Note about terraform
60

61
62
We use terraform in two ways:

63
64
1. To provision a part of the infrastructure on hcloud (and possibly other service providers in the future)
2. To declaratively configure applications
65
66
67

For both of these, we have set up a separate terraform script. The reason for that is that sadly terraform can't have
providers depend on other providers so we can't declaratively state that we want to configure software on a server which
68
69
itself needs to be provisioned first. Therefore, we use a two-stage process. Generally speaking, scenario 1. is configured in
`tf-stage1` and 2. is in `tf-stage2`. Maybe in the future, we can just have a single terraform script for everything
70
71
but for the time being, this is what we're stuck with.

72
73
The very first time you run terraform on your system, you'll have to init it:

74
    cd tf-stage1  # and also tf-stage2
75
    terraform init -backend-config="conn_str=postgres://terraform:$(../misc/get_key.py group_vars/all/vault_terraform.yml vault_terraform_db_password)@state.archlinux.org"
76

Sven-Hendrik Haase's avatar
Sven-Hendrik Haase committed
77
After making changes to the infrastructure in `tf-stage1/archlinux.tf`, run
78
79

    terraform plan
80
81
82
83

This will show you planned changes between the current infrastructure and the desired infrastructure.
You can then run

84
    terraform apply
85
86
87

to actually apply your changes.

88
89
90
The same applies to changed application configuration in which case you'd run
it inside of `tf-stage2` instead of `tf-stage1`.

91
92
93
94
95
96
We store terraform state on a special server that is the only hcloud server NOT
managed by terraform so that we do not run into a chicken-egg problem. The
state server is assumed to just exist so in an unlikely case where we have to
entirely redo this infrastructure, the state server would have to be manually
set up.

97
98
#### SMTP Configuration

99
All hosts should be relaying email through our primary mx host (currently 'orion'). See [docs/email.md](./docs/email.md) for full details.
100

101
#### Note about opendkim
102
103
104
105
106

The opendkim DNS data has to be added to DNS manually. The roles verifies that the DNS is correct before starting opendkim.

The file that has to be added to the zone is `/etc/opendkim/private/$selector.txt`.

107
108
109
110
111
112
113
114
115
116
117
118
119
120
### Putting a service in maintenance mode

Most web services with a nginx configuration, can be put into a maintenance mode, by running the playbook with a maintenance variable:

    ansible-playbook -e maintenance=true playbooks/<playbook.yml>

This also works with a tag:

    ansible-playbook -t <tag> -e maintenance=true playbooks/<playbook.yml>

As long as you pass the maintenance variable to the playbook run, the web service will stay in maintenance mode. As soon as you stop
passing it on the command line and run the playbook again, the regular nginx configuration should resume and the service should accept
requests by the end of the run.

121
Passing maintenance=false, will also prevent the regular nginx configuration from resuming, but will not put the service into maintenance
122
123
124
125
126
mode.

Keep in mind that passing the maintenance variable to the whole playbook, without any tag, will make all the web services that have the
maintenance mode in them, to be put in maintenance mode. Use tags to affect only the services you want.

127
Documentation on how to add the maintenance mode to a web service is inside [docs/maintenance.md](./docs/maintenance.md).
128
129
130
131
132
133
134

### Finding servers requiring security updates

Arch-audit can be used to find servers in need of updates for security issues.

    ansible all -a "arch-audit -u"

135
136
137
138
#### Updating servers

The following steps should be used to update our managed servers:

139
140
141
142
143
  * pacman -Syu
  * manually update the kernel, since it is in IgnorePkg by default
  * sync
  * checkservices
  * reboot
144

Sven-Hendrik Haase's avatar
Sven-Hendrik Haase committed
145
146
147
148
149
## Servers

### orion

#### Services
150
151
152
  - repos/sync (repos.archlinux.org)
  - sources (sources.archlinux.org)
  - archive (archive.archlinux.org)
Sven-Hendrik Haase's avatar
Sven-Hendrik Haase committed
153

154
### luna
Sven-Hendrik Haase's avatar
Sven-Hendrik Haase committed
155
156

#### Services
157

158
  - mailman
159
160
161
162
163
164
  - projects (projects.archlinux.org)

### apollo

#### Services
  - wiki (wiki.archlinux.org)
165
166
  - archweb
  - patchwork
167

168
### aur.archlinux.org
169
170
171
172

#### Services
  - aurweb

173
### bugs.archlinux.org
174
175
176
177

#### Services
  - flyspray

178
### bbs.archlinux.org
179
180
181

#### Services
  - bbs
Sven-Hendrik Haase's avatar
Sven-Hendrik Haase committed
182

183
### phrik.archlinux.org
184
185

#### Services
186
187
188
189
   - phrik (irc bot) users in the phrik group defined in
     the hosts vars and re-used the archusers role. Users
     in the phrik group are allowed to restar the irc bot.

190
191
192
### dragon

#### Services
Jelle van der Waa's avatar
Jelle van der Waa committed
193
  - build server
194
  - sogrep
195

196
### state.archlinux.org
197

198
#### Services
199
  - postgres server for terraform state
200

201
202
### quassel.archlinux.org

203
#### Services
204
  - quassel core
205

206
207
208
209
210
211
### matrix.archlinux.org

#### Services
  - Matrix homeserver (Synapse)
  - Matrix ↔ IRC bridge

212
### homedir.archlinux.org
Jelle van der Waa's avatar
Jelle van der Waa committed
213

214
#### Services
Jelle van der Waa's avatar
Jelle van der Waa committed
215
  - ~/user/ webhost
216

217
218
### accounts.archlinux.org

219
This server is _special_. It runs keycloak and is central to our unified Arch Linux account management world.
220
221
222
223
224
It has an Ansible playbook for the keycloak service but that only installs the package and starts it but it's configured via a secondary Terraform file only for keycloak `keycloak.tf`.
The reason for doing it this way is that Terraform support for Keycloak is much superior and it's declarative too which is great for making sure that no old config remains in the case of config changes.

So to set up this server from scratch, run:

225
226
227
228
229
  - `cd tf-stage1`
  - `terraform apply`
  - `cd ../tf-stage2`
  - `terraform import keycloak_realm.master master`
  - `terraform apply`
230

231
#### Services
232
  - keycloak
Jelle van der Waa's avatar
Jelle van der Waa committed
233

234
### mirror.pkgbuild.com
235

236
#### Services
237
  - Regular mirror.
238

239
### reproducible.archlinux.org
Jelle van der Waa's avatar
Jelle van der Waa committed
240

241
#### Services
242
  - Runs a master rebuilderd instance two workers:
243
244
    - repro1.pkgbuild.com (PIA worker)
    - repro3.pkgbuild.com (packet.net machine which runs Ubuntu)
Jelle van der Waa's avatar
Jelle van der Waa committed
245

246
247
248
249
250
251
252
253
254
255
256
257
258
259
### runner1.archlinux.org

Slow-ish PIA box with spinning disks.

#### Services
  - GitLab runner

### runner2.archlinux.org

Medium-fast-ish packet.net box with Debian on it. Is currently maintained manually.

#### Services
  - GitLab runner

260
261
262
263
## Ansible repo workflows

### Replace vault password and change vaulted passwords

264
265
266
267
  - Generate a new key and save it as ./new-vault-pw: `pwgen -s 64 1 > new-vault-pw`
  - `for i in $(ag ANSIBLE_VAULT -l); do ansible-vault rekey --new-vault-password-file new-vault-pw $i; done`
  - Change the key in misc/vault-password.gpg
  - `rm new-vault-pw`
268

269
270
### Re-encrypting the vault after adding or removing a new GPG key

271
  - Make sure you have all the GPG keys **at least** locally signed
272
  - Run the `playbooks/tasks/reencrypt-vault-key.yml` playbook and make sure it does not have **any** failed task
273
274
275
276
277
  - Test that the vault is working by running ansible-vault view on any encrypted vault file
  - Commit and push your changes

### Fetching the borg keys for local storage

278
  - Make sure you have all the GPG keys **at least** locally signed
279
280
  - Run the playbooks/tasks/fetch-borg-keys.yml playbook
  - Make sure the playbook runs successfully and check the keys under the borg-keys directory
281
282
283

## Backup documentation

284
285
Adding a new server to be backed up goes as following:

286
287
288
* Make sure the new servers host key is synced to `docs/ssh-known_hosts.txt` if not run:

      ansible-playbook playbooks/tasks/sync-ssh-hostkeys.yml
289
290
291

* Add the server to [borg-clients] in hosts

292
* Run the borg role on u236610.your-storagebox.de to allow the new machine to create backups
293
294

      ansibe-playbook playbooks/hetzner_storagebox.yml
295
296

* Run the borg role for rsync.net to allow the new machine to create backups
297
298

      ansibe-playbook playbooks/rsync.net.yml
299
300

* Run the borg role on the new machine to initialize the repository
301
302

      ansibe-playbook playbooks/$machine.yml -t borg
303

304
305
Backups should be checked now and then. Some common tasks are listed below.
You'll have to get the correct username from the vault.
306
307
308

### Listing current backups per server

309
310
    borg list ssh://<hetzner_storagebox_username>@u236610.your-storagebox.de:23/~/backup/<hostname>
    borg list ssh://<rsync_net_username>@prio.ch-s012.rsync.net:22/~/backup/<hostname>
311
312
313

Example

314
    borg list ssh://<hetzner_storagebox_username>@u236610.your-storagebox.de:23/~/backup/homedir.archlinux.org
315
316
317

### Listing files in a backup

318
    borg list ssh://<hetzner_storagebox_username>@u236610.your-storagebox.de:23/~/backup/<hostname>::<archive name>
319
320
321

Example

322
    borg list ssh://<hetzner_storagebox_username>@u236610.your-storagebox.de:23/~/backup/homedir.archlinux.org::20191127-084357
323

324
325
326
327
328
329
## Updating Gitlab

Our Gitlab installation uses [Omnibus](https://docs.gitlab.com/omnibus/) to run Gitlab on Docker. Updating Gitlab is as simple as running the ansible gitlab playbook:

    ansible-playbook playbooks/gitlab.archlinux.org -t gitlab

330
331
332
333
## One-shots

A bunch of once-only admin task scripts can be found in `one-shots/`.
We try to minimize the amount of manual one-shot admin work we have to do but sometimes for some migrations it might be necessary to have such scripts.