- Dec 04, 2021
-
-
Jelle van der Waa authored
-
Jelle van der Waa authored
-
- Sep 22, 2021
-
-
Evangelos Foutras authored
This partially reverts commit c3d00264. The smartd changes are not ready to land yet and were included in the above commit by accident.
-
- Sep 17, 2021
-
-
Jelle van der Waa authored
phrik has a polkit rule for demize to restart phrik
-
- May 23, 2021
-
-
Signed-off-by:
Leonidas Spyropoulos <artafinde@gmail.com>
-
Closes: #332 Signed-off-by:
Leonidas Spyropoulos <artafinde@gmail.com>
-
- May 14, 2021
-
-
Kristian Klausen authored
-
- May 13, 2021
-
-
Co-authored-by:
Kristian Klausen <kristian@klausen.dk>
-
- Apr 11, 2021
-
-
Jelle van der Waa authored
We want to get notifications of pacman/arch-audit notifications grouped as otherwise we'll be spammed with ~ X emails for every host. Closes: #191
-
- Apr 07, 2021
-
-
Jelle van der Waa authored
Re introduce the arch-audit rule as arch-audit no longer reports false positives from [testing]. Lax the high cpu alert as our mediawiki instance is perfectly fine running on 85% CPU for some time, and lax our disk will fill within X alert as our borg backups generate enough data in a short time to trigger the 4 hour alarm.
-
- Mar 27, 2021
-
-
Jelle van der Waa authored
Our qemu TCG builds generate plenty of cpu usage over time to be excluded from warnings.
-
- Mar 01, 2021
-
-
Jelle van der Waa authored
Our dedicated servers are fairly slow when rebooting and are then not available for 5 minutes, which means a ServiceDown notification is send for a normal reboot.
-
Jelle van der Waa authored
The value of the expr is not really useful as of now, but if we show the value of probe_ssl_earliest_cert_expiry it should show the date when the cert expires.
-
Jelle van der Waa authored
Prometheus alerts $value is the result of the expression, so it will be the amount of seconds since the last backup and not the last backup date.
-
Jelle van der Waa authored
-
- Feb 27, 2021
-
-
Sven-Hendrik Haase authored
gemini takes a long time to run backups and it would sometimes produce false positives for not having backed up for some time. The higher threshold should help with those false positives.
-
- Feb 14, 2021
-
-
Kristian Klausen authored
yaml: truthy value should be one of [false, true] (truthy) yaml: wrong indentation: expected 4 but found 2 (indentation) yaml: too few spaces before comment (comments) yaml: missing starting space in comment (comments) yaml: too many blank lines (1 > 0) (empty-lines) yaml: too many spaces after colon (colons) yaml: comment not indented like content (comments-indentation) yaml: no new line character at the end of file (new-line-at-end-of-file) load-failure: Failed to load or parse file parser-error: couldn't resolve module/action 'hosts'. This often indicates a misspelling, missing collection, or incorrect module path.
-
- Jan 31, 2021
-
-
- Jan 26, 2021
-
-
Jelle van der Waa authored
Collect prometheus btrfs errors from the btrfs command from btrfs-progs which since 5.10 supports json output for device stats. The collected errors will in the future trigger an alert when the errors reach a certain treshold.
-
- Jan 03, 2021
-
-
- Dec 18, 2020
-
-
Jelle van der Waa authored
Rebuilderd-workers are expected to have a high cpu load. Closes: #240
-
- Dec 16, 2020
-
-
Frederik Schwan authored
-
- Oct 29, 2020
-
-
Jelle van der Waa authored
Closes: #166
-
- Sep 20, 2020
-
-
- Sep 18, 2020
-
-
3 days is a bit too late. Certbot renews the certificate 30 days before, so 25 days should be safe and shouldn't cause any "false positives" due to transient errors.
-
- Sep 06, 2020
-
-
Jelle van der Waa authored
Record the rebuilderd queue length in prometheus so we can generate an alert for when the queue length keeps rising. As this could be an indication that the rebuilders have builds which are stuck.
-
Jelle van der Waa authored
Run the blackbox exporter on monitoring.archlinux.org to monitor other machines http status for public services we provide. Also has an alert for when a certificate is about to expire in 3 days.
-
- Aug 31, 2020
-
-
Jelle van der Waa authored
Introduce a new monitoring server with prometheus and alertmanager for monitoring all our boxes.
-