Arch Linux Bugs Snapshotter (ALBS)
With the sunsetting of bugs.archlinux.org on the horizon (+/- ~years), it is time to think about how we are going to archive it, so it can be accessed by future generations.
This is a take on it! :)
Usage
First install wget
, rsync
, libxslt
and prettier
, then run:
$ ./snapshotter.sh [maximum number of tasks to download] [download attachment: true (default) or false] [prettify the HTML files: true (default) or false] [download dir, default: snapshots/2021-04-01T22:52+02:00]
How It Works
-
https://bugs.archlinux.org/index.php?project=0&status[]=&changedfrom=2021-04-01
is scrapped to get the newest task id - The range of tasks to download is decided:
- If
$ALBS_RANGE_DOWNLOAD_ENABLED = true
then:- A range of tasks is computed based on
$ALBS_RANGE_DOWNLOAD_CHUNK
and$ALBS_RANGE_DOWNLOAD_CHUNKS
- A range of tasks is computed based on
- else:
$min=0
$max=$new_task_id
- If
- A list of URLs is generated:
https://bugs.archlinux.org/task/{$min..$max}
-
wget
starts downloading the URLs, including page requisites and linked user pages -
xsltproc
is run on all the HTML files to cleanup the html (remove navbar entries, login form etc.) -
prettier
is run on all the HTML files to prettify the HTML (primarily fixing indentation)