Skip to content
Snippets Groups Projects
Kristian Klausen's avatar
Kristian Klausen authored
It does more harm than good:
* Some comment links aren't converted[1]
* Old http://bugs.archlinux.org/task/ links isn't downloaded due to
  --max-redirect 0
* Links with query parameters are downloaded multiple times, ex:
  https://bugs.archlinux.org/task/69462?string=tt-rss from [2]

This reverts commit 94b035d4.

[1] https://bugs.archlinux.org/task/13059
[2] https://bugs.archlinux.org/task/69718
5ff17f7c
History

Arch Linux Bugs Snapshotter (ALBS)

With the sunsetting of bugs.archlinux.org on the horizon (+/- ~years), it is time to think about how we are going to archive it, so it can be accessed by future generations.

This is a take on it! :)

Usage

First install wget, rsync, libxslt and prettier, then run:

$ ./snapshotter.sh [maximum number of tasks to download] [download attachment: true (default) or false] [prettify the HTML files: true (default) or false] [download dir, default: snapshots/2021-04-01T22:52+02:00]

How It Works

  1. https://bugs.archlinux.org/index.php?project=0&status[]=&changedfrom=2021-04-01 is scrapped to get the newest task id
  2. The range of tasks to download is decided:
    • If $ALBS_RANGE_DOWNLOAD_ENABLED = true then:
      • A range of tasks is computed based on $ALBS_RANGE_DOWNLOAD_CHUNK and $ALBS_RANGE_DOWNLOAD_CHUNKS
    • else:
      • $min=0
      • $max=$new_task_id
  3. A list of URLs is generated: https://bugs.archlinux.org/task/{$min..$max}
  4. wget starts downloading the URLs, including page requisites and linked user pages
  5. xsltproc is run on all the HTML files to cleanup the html (remove navbar entries, login form etc.)
  6. prettier is run on all the HTML files to prettify the HTML (primarily fixing indentation)