Arch Linux Bugs Snapshotter (ALBS)
With the sunsetting of bugs.archlinux.org on the horizon (+/- ~years), it is time to think about how we are going to archive it, so it can be accessed by future generations.
This is a take on it! :)
Snapshots
The snapshots branch contains a snapshot of https://bugs.archlinux.org, which is updated regularly.
Usage
First install wget
, rsync
, libxslt
and prettier
, then run:
$ ./snapshotter.sh [maximum number of tasks to download] [download attachment: true (default) or false] [prettify the HTML files: true (default) or false] [download dir, default: snapshots/2021-04-01T22:52+02:00]
How It Works
-
https://bugs.archlinux.org/index.php?project=0&status[]=&changedfrom=2021-04-01
is scrapped to get the newest task id - The range of tasks to download is decided:
- If
$ALBS_RANGE_DOWNLOAD_ENABLED = true
then:- A range of tasks is computed based on
$ALBS_RANGE_DOWNLOAD_CHUNK
and$ALBS_RANGE_DOWNLOAD_CHUNKS
- A range of tasks is computed based on
- else:
$min=0
$max=$new_task_id
- If
- A list of URLs is generated:
https://bugs.archlinux.org/task/{$min..$max}
-
wget
starts downloading the URLs, including page requisites and linked user pages -
xsltproc
is run on all the HTML files to cleanup the html (remove navbar entries, login form etc.) -
prettier
is run on all the HTML files to prettify the HTML (primarily fixing indentation)
Cloning
If you don't need the snapshots branch, you can do a partial clone and avoid downloading all the blobs needed by that branch:
$ git clone --filter=blob:none https://gitlab.archlinux.org/archlinux/archlinux-bugs-snapshotter.git
The snapshots branch also contains all the attachments, which you can also avoid downloading, by using sparse-checkout:
$ git sparse-checkout set '/*' '!/attachments/'
$ git sparse-checkout init
$ git checkout snapshots
Maintainer
ALBS is written and maintained by Kristian Klausen.