diff --git a/.gitmodules b/.gitmodules new file mode 100644 index 0000000000000000000000000000000000000000..fb394de7276b15d1d9fabdb57e46e82eafce6650 --- /dev/null +++ b/.gitmodules @@ -0,0 +1,3 @@ +[submodule "archlinux-common-style"] + path = archlinux-common-style + url = https://gitlab.archlinux.org/archlinux/archlinux-common-style.git diff --git a/README.md b/README.md index d8d4448ea1d6c6c61c6de211e3dc5c05e0b14d5f..5ad15bdabb4ee015242456885ef886f4aab4a1b6 100644 --- a/README.md +++ b/README.md @@ -6,11 +6,11 @@ ## Installation -1. In the directory `mysite` copy `local_settings.py.example` to `local_settings.py` and edit `DEBUG = True` and the `SECRET_KEY` variable. +1. Copy `local_settings.py.example` to `local_settings.py` and edit `DEBUG = True` and the `SECRET_KEY` variable. 2. Configure a connection to a [PostgreSQL](https://wiki.archlinux.org/index.php/PostgreSQL) database - in the [Django database settings](https://docs.djangoproject.com/en/1.11/ref/settings/#databases) - in the `mysite/local_settings.py` file. + in the [Django database settings](https://docs.djangoproject.com/en/3.1/ref/settings/#databases) + in the `local_settings.py` file. 3. Make sure that the [pg_trgm](https://www.postgresql.org/docs/current/pgtrgm.html) extension is [created](https://www.postgresql.org/docs/current/sql-createextension.html) @@ -38,3 +38,161 @@ for the development, you can run e.g. `update.py --only-repos core` to import only man pages from the core repository (the smallest one, download size is about 160 MiB) or even `update.py --only-packages coreutils man-pages`. + +## About + +This website was created for the [man template](https://wiki.archlinux.org/index.php/Template:Man) +on the Arch wiki. Originally, the template replaced plain text, unclickable +references to man pages with links to [man7.org](https://man7.org/linux/man-pages/), +which contains a handful of manuals taken directly from upstream. Later, we +considered switching to another site providing more manuals. Since we did not +find a suitable external site, we decided to build a new service to satisfy all +our requirements: + +1. All man pages from official Arch packages are available. Old versions and + permalinks are not necessary. +2. Functionality does not require Javascript. +3. Pages are addressable by their name and section, both occurring exactly once + in the URL to avoid problems with pages such as + [ar(1)](https://jlk.fjfi.cvut.cz/arch/manpages/man/ar.1) and + [ar(1p)](https://jlk.fjfi.cvut.cz/arch/manpages/man/ar.1p). +4. The URLs used by the _man_ template should not redirect to permalinks, + otherwise users would start copy-pasting them to the wiki and it would be + hard to check if they are the same as the canonical URLs. +5. Human-readable subsection anchors. +6. The page should clearly indicate the Arch package version containing the + page. + +See the [original discussion](https://wiki.archlinux.org/index.php/Template_talk:Man#Sources) +for details. + +We used a dynamic approach instead of building a website consisting of +completely static pages. The main building blocks are the +[Django web framework](https://www.djangoproject.com/), the +[PostgreSQL](https://www.postgresql.org/) database server, the `mandoc` tool +from the [mandoc toolset](http://mdocml.bsd.lv/) for the conversion to HTML and +the [pyalpm](https://github.com/archlinux/pyalpm) library for data extraction +from the Arch repositories. The code is available in the +[archmanweb](https://gitlab.archlinux.org/archlinux/archmanweb) repository at +GitHub. + +Overall, this approach allows us to provide the following features without +rebuilding the whole website from scratch: + +- Listings with custom filters and orderings. +- Links to other versions of the same manual provided by different packages. +- Links to similar manuals available in other sections or languages. +- Searching in the names and descriptions of packages and manuals, similarly to + [apropos(1)](https://jlk.fjfi.cvut.cz/arch/manpages/about). + +### Similar projects + +Some similar projects, each using a different approach, are: + +- [manned.org](https://manned.org/) ([code](https://g.blicky.net/manned.git/), + [Arch BBS thread](https://bbs.archlinux.org/viewtopic.php?id=145382)) +- [man7.org](http://man7.org/linux/man-pages/) (no idea about website scripts) +- [manpages.debian.org](https://manpages.debian.org/) + ([source](https://github.com/Debian/debiman/)) +- [man.openbsd.org](http://man.openbsd.org/) (runs with the mandoc CGI script) + +## Test cases + +These links serve as test cases to ensure that all features still work, they +are not useful to regular users. + +### URLs with dots + +- intro +- intro.1 +- intro.1.en +- intro.en +- systemd.service +- systemd.service.5 +- systemd.service.5.en +- systemd.service.en +- gimp-2.8 +- gimp-2.8.1 +- gimp-2.8.1.en +- gimp-2.8.en +- CA.pl +- CA.pl.1ssl +- CA.pl.1ssl.en +- CA.pl.en + +### Best match lookup + +Ambiguous cases are ordered by section, package repository and package version, +then the first manual is selected. + +- mount redirects to + mount.8 + (not mount.2) +- gv redirects to + gv.1 + (not gv.3guile, + gv.3lua etc.) +- graphviz/gv redirects to + graphviz/gv.3guile + (not graphviz/gv.3lua etc.) +- gv.3 redirects to + gv.3guile + (not gv.1, + gv.3lua etc.) +- aliases.5 displays + extra/postfix/aliases.5 + (not community/opensmtpd/aliases.5) +- mysqld.8 displays + extra/mariadb/mysqld.8 + (not community/percona-server/mysqld.8) +- mailx and + mailx.1 redirect to + mail.1.en as a symbolic link + (not mailx.1p) + +### Language fallback + +- nvidia-smi.cs → + nvidia-smi.en → + nvidia-smi.1.en + (maybe we should try harder and avoid the double redirect) +- nvidia-smi.1.cs → + nvidia-smi.1.en +- nvidia-smi.foo → 404 +- nvidia-smi.1.foo → 404 + +### Package filter + +- nvidia-utils/nvidia-smi.en +- nvidia-340xx-utils/nvidia-smi.en +- nvidia-utils/nvidia-smi.cs → + nvidia-utils/nvidia-smi.en +- nvidia-340xx-utils/nvidia-smi.cs → + nvidia-utils/nvidia-340xx-smi.en +- foo/nvidia-smi.cs → 404 +- foo/nvidia-smi.en → 404 + +### .so macros + +There is a groff(1) extension for the +man(7) and +mdoc(7) +languages to include contents of other files using the `.so` macro. In normal +operation where manuals are stored as files on a file system, the +soelim(1) +pre-processor handles the inclusion. Our system is based on a database rather +than a file system, so we need a custom `soelim` as well. + +Some pages which contain the `.so` macro: + +- [.1.zh_CN +- pwunconv(8) +- pam(8) +- url(7) +- xorg.conf.d(5) +- glibc(7) +- systemd-logind(8) +- shorewall6.conf(5) + points to a page contained in a different package (`shorewall` instead of `shorewall6`) +- lsof(8) + (not a "hardlink", includes an invalid file `./00DIALECTS`) diff --git a/archlinux-common-style b/archlinux-common-style new file mode 160000 index 0000000000000000000000000000000000000000..fe41472481348017b99e31f205235cdcaa0d556f --- /dev/null +++ b/archlinux-common-style @@ -0,0 +1 @@ +Subproject commit fe41472481348017b99e31f205235cdcaa0d556f diff --git a/archmanweb/management/commands/man_drop_cache.py b/archmanweb/management/commands/man_drop_cache.py new file mode 100755 index 0000000000000000000000000000000000000000..bc4c68a0b7fb6798811af3d96aedc1e70bc39f8c --- /dev/null +++ b/archmanweb/management/commands/man_drop_cache.py @@ -0,0 +1,15 @@ +#! /usr/bin/env python3 + +from django.core.management.base import BaseCommand +from django.db import connection + +class Command(BaseCommand): + help = "Drops cached data from the database" + + def handle(self, *args, **kwargs): + with connection.cursor() as c: + c.execute("UPDATE archmanweb_content SET html = NULL WHERE html IS NOT NULL;") + c.execute("UPDATE archmanweb_content SET txt = NULL WHERE txt IS NOT NULL;") + c.execute("UPDATE archmanweb_content SET description = NULL WHERE description IS NOT NULL;") + c.execute("UPDATE archmanweb_manpage SET converted_content_id = NULL WHERE converted_content_id IS NOT NULL;") + c.execute("VACUUM FULL archmanweb_content;") diff --git a/update.py b/archmanweb/management/commands/man_update.py similarity index 68% rename from update.py rename to archmanweb/management/commands/man_update.py index 11dc4bc65073805df9c6145054f6e992198f2a8d..b4947a50b3a497e032db5b9da3c9ff46c70cbdc9 100755 --- a/update.py +++ b/archmanweb/management/commands/man_update.py @@ -10,12 +10,10 @@ import subprocess import chardet import pyalpm -from finder import MANDIR, ManPagesFinder +from archmanweb.management.utils.finder import MANDIR, ManPagesFinder -# init django -os.environ.setdefault("DJANGO_SETTINGS_MODULE", "mysite.settings") +from django.core.management.base import BaseCommand import django -django.setup() from django.db import connection, transaction from django.db.models import Count from archmanweb.models import Package, Content, ManPage, SymbolicLink, UpdateLog, SoelimError @@ -313,93 +311,113 @@ def update_man_pages(finder, updated_pkgs): return updated_pages -if __name__ == "__main__": - # init logging - logger = logging.getLogger() - logger.setLevel(logging.INFO) - handler = logging.StreamHandler() - formatter = logging.Formatter("{levelname:8} {message}", style="{") - handler.setFormatter(formatter) - logger.addHandler(handler) - - parser = argparse.ArgumentParser(description="update man pages in the django database") - parser.add_argument("--force", action="store_true", - help="force an import of man pages from all packages, even if they were not updated recently") - parser.add_argument("--only-repos", action="store", nargs="+", metavar="NAME", - help="import packages (and man pages) only from these repositories") - parser.add_argument("--only-packages", action="store", nargs="+", metavar="NAME", - help="import man pages only from these packages") - parser.add_argument("--cache-dir", action="store", default="./.cache/", - help="path to the cache directory (default: %(default)s)") - parser.add_argument("--keep-tarballs", action="store_true", - help="keep downloaded package tarballs in the cache directory") - parser.add_argument("--workers", type=int, default=0, - help="number of workers for parallel processing (0 = use 1 worker per CPU core)") - args = parser.parse_args() - - start = datetime.datetime.now(tz=datetime.timezone.utc) - - finder = ManPagesFinder(args.cache_dir) - finder.refresh() - - # everything in a single transaction - with transaction.atomic(): - updated_pkgs = update_packages(finder, force=args.force, only_repos=args.only_repos) - if args.only_packages is None: - count_updated_pages = update_man_pages(finder, updated_pkgs) - else: - count_updated_pages = update_man_pages(finder, [p for p in updated_pkgs if p.name in args.only_packages]) - - # this is called outside of the transaction, so that the cache can be reused on errors - if args.keep_tarballs is False: - finder.clear_pkgcache() - - # convert manual pages to plain-text - # (one transaction per update, otherwise we might hit memory allocation error) - def worker(man_id): - man = ManPage.objects.get(id=man_id) - try: - man.get_converted("txt") - except SoelimError as e: - logger.error("SoelimError ({}) while converting {}.{}.{} to txt".format(str(e), man.name, man.section, man.lang)) - except subprocess.CalledProcessError as e: - logger.error("CalledProcessError while converting {}.{}.{} to txt:\nreturncode = {}\nstderr = {}" - .format(man.name, man.section, man.lang, e.returncode, e.stderr)) - - # prepare man page IDs which need to be converted - # (queryset needs to be a list for multiprocessing to work) - queryset = ManPage.objects.only("package", "lang", "content_id", "converted_content_id").filter(content__txt=None).values_list("id", flat=True) - queryset = list(queryset) - - # all existing database connections have to be closed before forking, - # each process will then recreate its own connection: - # https://stackoverflow.com/a/10684672 - django.db.connections.close_all() - - # parallel processing of the queryset - import concurrent.futures - with concurrent.futures.ProcessPoolExecutor(max_workers=args.workers or None) as executor: - executor.map(worker, queryset) - - # VACUUM cannot run inside a transaction block - if updated_pkgs or args.only_packages is not None: - logger.info("Running VACUUM FULL ANALYZE on our tables...") - for Model in [Package, Content, ManPage, SymbolicLink]: - table = Model.objects.model._meta.db_table - logger.info("--> {}".format(table)) - with connection.cursor() as cursor: - cursor.execute("VACUUM FULL ANALYZE {};".format(table)) - - end = datetime.datetime.now(tz=datetime.timezone.utc) - - # log update - log = UpdateLog() - log.timestamp = start - log.duration = end - start - log.updated_pkgs = len(updated_pkgs) - log.updated_pages = count_updated_pages - log.stats_count_man_pages = ManPage.objects.count() - log.stats_count_symlinks = SymbolicLink.objects.count() - log.stats_count_all_pkgs = Package.objects.count() - log.stats_count_pkgs_with_mans = ManPage.objects.aggregate(Count("package_id", distinct=True))["package_id__count"] - log.save() +class Command(BaseCommand): + help = "Update man pages in the Django database" + + def __init__(self, *args, **kwargs): + BaseCommand.__init__(self, *args, **kwargs) + + # TODO: use Django settings to configure the logger + # https://docs.djangoproject.com/en/3.1/topics/logging/ + logger = logging.getLogger() + logger.setLevel(logging.INFO) + handler = logging.StreamHandler() + formatter = logging.Formatter("{levelname:8} {message}", style="{") + handler.setFormatter(formatter) + logger.addHandler(handler) + + def add_arguments(self, parser): + """ + :param parser: an instance of :py:class:`argparse.ArgumentParser` + """ + parser.add_argument("--force", action="store_true", + help="force an import of man pages from all packages, even if they were not updated recently") + parser.add_argument("--only-repos", action="store", nargs="+", metavar="NAME", + help="import packages (and man pages) only from these repositories") + parser.add_argument("--only-packages", action="store", nargs="+", metavar="NAME", + help="import man pages only from these packages") + parser.add_argument("--cache-dir", action="store", default="./.cache/", + help="path to the cache directory (default: %(default)s)") + parser.add_argument("--keep-tarballs", action="store_true", + help="keep downloaded package tarballs in the cache directory") + parser.add_argument("--workers", type=int, default=0, + help="number of workers for parallel processing (0 = use 1 worker per CPU core; default: %(default)s)") + + def handle(self, **kwargs): + start = datetime.datetime.now(tz=datetime.timezone.utc) + updated_pkgs, count_updated_pages = self.do_update(**kwargs) + end = datetime.datetime.now(tz=datetime.timezone.utc) + + # log update + log = UpdateLog() + log.timestamp = start + log.duration = end - start + log.updated_pkgs = len(updated_pkgs) + log.updated_pages = count_updated_pages + log.stats_count_man_pages = ManPage.objects.count() + log.stats_count_symlinks = SymbolicLink.objects.count() + log.stats_count_all_pkgs = Package.objects.count() + log.stats_count_pkgs_with_mans = ManPage.objects.aggregate(Count("package_id", distinct=True))["package_id__count"] + log.save() + + def do_update(self, *, cache_dir, workers, + force=False, + only_repos=None, + only_packages=None, + keep_tarballs=False, + **kwargs): + finder = ManPagesFinder(cache_dir) + finder.refresh() + + # everything in a single transaction + with transaction.atomic(): + updated_pkgs = update_packages(finder, force=force, only_repos=only_repos) + if only_packages is None: + count_updated_pages = update_man_pages(finder, updated_pkgs) + else: + count_updated_pages = update_man_pages(finder, [p for p in updated_pkgs if p.name in only_packages]) + + # this is called outside of the transaction, so that the cache can be reused on errors + if keep_tarballs is False: + finder.clear_pkgcache() + + # convert manual pages to plain-text + # (one transaction per update, otherwise we might hit memory allocation error) + def worker(man_id): + man = ManPage.objects.get(id=man_id) + try: + man.get_converted("txt") + except SoelimError as e: + logger.error("SoelimError ({}) while converting {}.{}.{} to txt".format(str(e), man.name, man.section, man.lang)) + except subprocess.CalledProcessError as e: + logger.error("CalledProcessError while converting {}.{}.{} to txt:\nreturncode = {}\nstderr = {}" + .format(man.name, man.section, man.lang, e.returncode, e.stderr)) + + # prepare man page IDs which need to be converted + # (queryset needs to be a list for multiprocessing to work) + queryset = ManPage.objects.only("package", "lang", "content_id", "converted_content_id").filter(content__txt=None).values_list("id", flat=True) + queryset = list(queryset) + + # all existing database connections have to be closed before forking, + # each process will then recreate its own connection: + # https://stackoverflow.com/a/10684672 + django.db.connections.close_all() + + # parallel processing of the queryset + import concurrent.futures + # FIXME: Why the fuck does it deadlock here, after we moved the code into the Command class? + # Database connections are closed just above, which used to work before... + #with concurrent.futures.ProcessPoolExecutor(max_workers=workers or None) as executor: + with concurrent.futures.ThreadPoolExecutor(max_workers=workers or None) as executor: + executor.map(worker, queryset) + + # VACUUM cannot run inside a transaction block + if updated_pkgs or only_packages is not None: + logger.info("Running VACUUM FULL ANALYZE on our tables...") + for Model in [Package, Content, ManPage, SymbolicLink]: + table = Model.objects.model._meta.db_table + logger.info("--> {}".format(table)) + with connection.cursor() as cursor: + cursor.execute("VACUUM FULL ANALYZE {};".format(table)) + + return updated_pkgs, count_updated_pages diff --git a/finder.py b/archmanweb/management/utils/finder.py similarity index 97% rename from finder.py rename to archmanweb/management/utils/finder.py index 9e410053959007c523d9b3d5db654eaa851e62a9..e0c4cab5fb6278bdf0d07723954e49bd2d826e2f 100644 --- a/finder.py +++ b/archmanweb/management/utils/finder.py @@ -188,9 +188,9 @@ class ManPagesFinder: # get the pkg tarball _pattern = "{}-{}-{}.pkg.tar.*".format(pkg.name, pkg.version, pkg.arch) - if not list(self.cachedir.glob(_pattern)): + if not list(f for f in self.cachedir.glob(_pattern) if not str(f).endswith(".part")): self._download_package(pkg) - tarballs = sorted(self.cachedir.glob(_pattern)) + tarballs = sorted(f for f in self.cachedir.glob(_pattern) if not str(f).endswith(".part")) assert len(tarballs) > 0, _pattern tarball = tarballs[0] diff --git a/archmanweb/static/archmanweb/base.css b/archmanweb/static/archmanweb/base.css index fcec57a4146fdd0add7628f1da9948716d69e914..4f1eb1d6f39b85b0ccdd1ad672db65ef6fadd9da 100644 --- a/archmanweb/static/archmanweb/base.css +++ b/archmanweb/static/archmanweb/base.css @@ -1,48 +1,8 @@ -header { - padding: 10px 15px; - background: #333; - border-bottom: 5px #08c solid; - color: white !important; - display: flex; - justify-content: space-between; - align-items: center; - font-family: sans-serif; -} -header > h1 { - font-size: 1.25em; - margin: 0; -} -header > nav { - margin-bottom: 0; -} -header > nav > a { - font-weight: bold; - font-size: 1em; - text-decoration: none; - color: #999 !important; -} -header > nav > a:focus, -header > nav > a:hover { - text-decoration: underline; - color: white !important; -} -header > nav > a.selected { - color: white !important; -} -header > nav > a:not(:last-child) { - margin-right: 1em; -} - -/* responsive header */ -@media only screen and (max-width: 650px) { - header { - /* place the items in vertical direction */ - flex-direction: column; - } - - header > :not(:last-child) { - margin: 0 0 0.5rem; - } +#archnavbar form { + display: inline-block !important; + font-size: 14px !important; + line-height: 14px !important; + padding: 14px 15px 0px !important; } /* simple reset */ diff --git a/archmanweb/templates/about.html b/archmanweb/templates/about.html deleted file mode 100644 index 6c6d2491254b245a393aa244b17fefa9e9d876e8..0000000000000000000000000000000000000000 --- a/archmanweb/templates/about.html +++ /dev/null @@ -1,69 +0,0 @@ -{% extends "index.html" %} - -{% block content %} -
- -

About

- -

- This website was created for the man template - on the Arch wiki. Originally, the template replaced plain text, unclickable references to man pages with links - to man7.org, which contains a handful of manuals taken directly - from upstream. Later, we considered switching to another site providing more manuals. Since we did not find - a suitable external site, we decided to build a new service to satisfy all our requirements: -

-
    -
  1. All man pages from official Arch packages are available. Old versions and permalinks are not necessary.
  2. -
  3. Functionality does not require Javascript.
  4. -
  5. Pages are addressable by their name and section, both occurring exactly once in the URL to avoid - problems with pages such as ar(1) and - ar(1p). -
  6. -
  7. The URLs used by the man template should not redirect to permalinks, otherwise users would start - copy-pasting them to the wiki and it would be hard to check if they are the same as the canonical URLs. -
  8. -
  9. Human-readable subsection anchors.
  10. -
  11. The page should clearly indicate the Arch package version containing the page.
  12. -
-

- See the original discussion - for details. -

- -

- We used a dynamic approach instead of building a website consisting of completely static pages. The main - building blocks are the Django web framework, the - PostgreSQL database server, the mandoc tool from - the mandoc toolset for the conversion to HTML and the - pyalpm library for data extraction from the Arch repositories. - The code is available in the lahwaacz/archmanweb - repository at GitHub. -

- -

- Overall, this approach allows us to provide the following features without rebuilding the whole website - from scratch: -

- - -

Similar projects

- -

- Some similar projects, each using a different approach, are: -

- - -
-{% endblock %} diff --git a/archmanweb/templates/base.html b/archmanweb/templates/base.html index de04ba85822f1af24c81ca941222bfb82836a9a6..cc4356b9130ee967b78cb36de2f77d450dc8fcd6 100644 --- a/archmanweb/templates/base.html +++ b/archmanweb/templates/base.html @@ -6,19 +6,37 @@ {% block title %}Arch manual pages{% endblock %} - + + + + {% block head %} {% endblock %}
-

Arch manual pages

- {% if search_form is None %} - - {% endif %} + +
{% block content %} @@ -41,7 +60,7 @@ {% endblock %}