Skip to content

More efficient data structure for build set graphs

Problem

The current petgraph-based implementation has performance bottlenecks:

  • Loading overview pages requires iterating through many large graphs to find builds with specific statuses
  • Finding and updating builds by pkgbase requires graph traversal

Solution

Use SQLite tables with strategic indexes for common queries while keeping in-memory petgraph where necessary for traversal operations.

This moves frequently-accessed data into indexed database tables:

  • Status filtering across multiple graphs becomes a simple indexed query instead of loading and iterating through thousands of nodes
  • Pkgbase lookups are reduced from graph traversal to direct index lookups
  • Many usages of petgraph can be replaced with simple, efficient sql queries
  • Graph calculation still works better with petgraph, but we won't have to store json blobs of petgraph structs in the db anymore

Database Schema

CREATE TABLE builds (
    id INTEGER PRIMARY KEY,
    namespace_id INTEGER NOT NULL REFERENCES namespaces(id),
    iteration_id INTEGER NOT NULL REFERENCES iterations(id),
    architecture TEXT NOT NULL,
    pkgbase TEXT NOT NULL,
    branch TEXT NOT NULL,
    commit TEXT NOT NULL,
    status TEXT NOT NULL,
    version TEXT NOT NULL,
    pkgnames TEXT NOT NULL,  -- JSON array
    UNIQUE(namespace_id, iteration_id, architecture, pkgbase),
    FOREIGN KEY (namespace_id) REFERENCES namespaces(id),
    FOREIGN KEY (iteration_id) REFERENCES iterations(id)
);

CREATE TABLE edges (
    dependent_build_id INTEGER NOT NULL REFERENCES builds(id),
    depends_on_build_id INTEGER NOT NULL REFERENCES builds(id),
    PRIMARY KEY (dependent_build_id, depends_on_build_id),
    FOREIGN KEY (dependent_build_id) REFERENCES builds(id),
    FOREIGN KEY (depends_on_build_id) REFERENCES builds(id)
);

Indexes:

  • Pkgbase lookup: CREATE INDEX idx_builds_lookup ON builds(namespace_id, iteration_id, architecture, pkgbase)
  • Global status filtering: CREATE INDEX idx_builds_status ON builds(status)
  • Forward dependency traversal: CREATE INDEX idx_edges_forward ON edges(dependent_build_id, depends_on_build_id)
  • Reverse dependency traversal: CREATE INDEX idx_edges_reverse ON edges(depends_on_build_id, dependent_build_id)

Additionally, we'll add a graph_hash column to the build iterations table. See below for details.

Scheduling builds

Currently, we need to traverse the graph all the time. Instead, we update dependent build statuses when a build completes:

  1. Query all builds that depend on it using reverse edge index
  2. For each dependent, check if all its dependencies are now successful
  3. Update status from Blocked to Pending where appropriate

SQL for checking if a build can be unblocked:

UPDATE builds 
SET status = 'Pending'
WHERE status = 'Blocked'
AND NOT EXISTS (
    SELECT 1 FROM edges e
    JOIN builds dep ON e.depends_on_build_id = dep.id
    WHERE e.dependent_build_id = builds.id
    AND dep.status != 'Built'
)

Actually dispatching builds is now easy, we can find builds ready to be scheduled:

SELECT * FROM builds 
WHERE status = 'Pending'

If anything during the update goes wrong, the build graph might get stuck. Here are two approaches to ensure builds always get scheduled eventually:

  1. Background job that scans all Blocked builds and verifies they should still be blocked. This is essentially what we have now, but we need to run it less often. This is good because it can become quite expensive to check dependencies of all blocked builds.
  2. Record build completions in an event table, process them idempotently with retry capability

Other Graph Operations

For rendering, if we move to a browser library, it's easy: those just take a list of nodes and edges as input which we can easily query. The dot crate works the same.

For the build graph calculation, I think petgraph works better than SQL, but there are still some improvements we can make, especially so we don't need to store the petgraph struct in the DB. I'll split this up into a separate issue later on.

  1. Build global graph with minimal data: Load pkgnames and dependency relationships into in-memory petgraph structs. Maintain nodes and edges in BTreeSet structures for automatic sorting, which will be used in step 3.
  2. Calculate subgraph. Start with packages that have changes and perform BFS traversal (using petgraph) to find all transitive dependents.
  3. To check if a new iteration is needed, compute a hash of the sorted graph structure and compare it with the existing graph hash. Insert into the database only if the hashes differ, avoiding unnecessary writes when the graph hasn't changed.
Edited by Rafael Epplée
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information