jj/docs/technical/architecture.md
2023-06-09 07:51:52 +02:00

246 lines
9.9 KiB
Markdown

# Architecture
## Data model
The commit data model is similar
to [Git's object model](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects)
, but with some differences.
## Separation of library from UI
The `jj` binary consists of two Rust crates: the library crate
(`jujutsu-lib`) and the CLI crate (`jujutsu`). The library crate is currently
only used by the CLI crate, but it is meant to also be usable from a GUI or TUI,
or in a server serving requests from multiple users. As a result, the library
should avoid interacting directly with the user via the terminal or by other
means; all input/output is handled by the CLI crate [^1]. Since the library
crate is meant to usable in a server, it also cannot read configuration from the
user's home directory, or from user-specific environment variables.
[^1]: There are a few exceptions, such as for messages printed during automatic
upgrades of the repo format
A lot of thought has gone into making the library crate's API easy to use, but
not much has gone into "details" such as which collection types are used, or
which symbols are exposed in the API.
## Storage-independent APIs
One overarching principle in the design is that it should be easy to change
where data is stored. The goal was to be able to put storage on local-disk by
default but also be able to move storage to the cloud at Google
(and for anyone). To that end, commits (and trees, files, etc.) are stored by
the commit backend, operations (and views) are stored by the operation backend,
the heads of the operation log are stored by the "op heads" backend, the commit
index is stored by the index backend, and the working copy is stored by the
working copy backend. The interfaces are defined in terms of plain Rust data
types, not tied to a specific format. The last working copy doesn't have its own
trait defined yet, but its interface is small and easy to create traits for when
needed.
The commit backend to use when loading a repo is specified in
the `.jj/repo/store/backend` file. We don't yet have support for choosing
different implementations for other kinds of backends than the commit backend.
## Design of the library crate
### Overview
Here's a diagram showing some important types in the library crate. The
following sections describe each component.
```mermaid
graph TD;
ReadonlyRepo-->Store;
ReadonlyRepo-->OpStore;
ReadonlyRepo-->OpHeadsStore;
ReadonlyRepo-->ReadonlyIndex
MutableIndex-->ReadonlyIndex;
Store-->Backend;
GitBackend-->Backend;
LocalBackend-->Backend;
LocalBackend-->StackedTable;
MutableRepo-->ReadonlyRepo;
MutableRepo-->MutableIndex;
Transaction-->MutableRepo;
WorkingCopy-->TreeState;
Workspace-->WorkingCopy;
Workspace-->RepoLoader;
RepoLoader-->Store;
RepoLoader-->OpStore;
RepoLoader-->OpHeadsStore;
RepoLoader-->ReadonlyRepo;
Git-->GitBackend;
GitBackend-->StackedTable;
```
### Backend
The [`Backend`](../../lib/src/backend.rs) trait defines the interface each
commit backend needs to implement. The current in-tree commit backends
are [`GitBackend`](../../lib/src/git_backend.rs)
and [`LocalBackend`](../../lib/src/local_backend.rs).
Since there are non-commit backends, the `Backend` trait should probably be
renamed to `CommitBackend`.
### GitBackend
The `GitBackend` stores commits in a Git repository. It uses `libgit2` to read
and write commits and refs.
To prevent GC from deleting commits that are still reachable from the operation
log, the `GitBackend` stores a ref for each commit in the operation log in
the `refs/jj/keep/` namespace.
Commit data that is available in Jujutsu's model but not in Git's model is
stored in a `StackedTable` in `.jj/repo/store/extra/`. That is currently the
change ID and the list of predecessors. For commits that don't have any data in
that table, which is any commit created by `git`, we use an empty list as
predecessors, and the bit-reversed commit ID as change ID.
Because we use the Git Object ID as commit ID, two commits that differ only in
their change ID, for example, will get the same commit ID, so we error out when
trying to write the second one of them.
### LocalBackend
The `LocalBackend` is just a proof of concept. It stores objects addressed by
their hash, with one file per object.
### Store
The `Store` type wraps the `Backend` and returns wrapped types for commits and
trees to make them easier to use. The wrapped objects have a reference to
the `Store` itself, so you can do e.g. `commit.parents()` without having to
provide the `Store` as an argument.
The `Store` type also provides caching of commits and trees.
### ReadonlyRepo
A `ReadonlyRepo` represents the state of a repo at a specific operation. It
keeps the view object associated with that operation.
The repository doesn't know where on disk any working copies live. It knows, via
the view object, which commit is supposed to be the current working-copy commit
in each workspace.
### MutableRepo
A `MutableRepo` is a mutable version of `ReadonlyRepo`. It has a reference to
its base `ReadonlyRepo`, but it has its own copy of the view object and lets the
caller modify it.
### Transaction
The `Transaction` object has a `MutableRepo` and metadata that will go into the
operation log. When the transaction commits, the `MutableRepo` becomes a view
object in the operation log on disk, and the `Transaction` object becomes an
operation object. In memory, `Transaction::commit()` returns a
new `ReadonlyRepo`.
### RepoLoader
The `RepoLoader` represents a repository at an unspecified operation. You can
think of as a pointer to the `.jj/repo/` directory. It can create
a `ReadonlyRepo` given an operation ID.
### TreeState
The `TreeState` type represents the state of the files in a working copy. It
keep track of the mtime and size for each tracked file. It knows the `TreeId`
that the working copy represents. It has a `snapshot()` method that will use the
recorded mtimes and sizes and detect changes in the working copy. If anything
changed, it will return a new `TreeId`. It also has `checkout()` for updating
the files on disk to match a requested `TreeId`.
The `TreeState` type supports sparse checkouts. In fact, all working copies are
sparse; they simply track the full repo in most cases.
### WorkingCopy
The `WorkingCopy` type has a `TreeState` but also knows which `WorkspaceId` it
has and at which operation it was most recently updated.
### Workspace
The `Workspace` type represents the combination of a repo and a working copy (
like Git's 'worktree' concept).
The repo view at the current operation determines the desired working-copy
commit in each workspace. The `WorkingCopy` determines what is actually in the
working copy. The working copy can become stale if the working-copy commit was
changed from another workspace (or if the process updating the working copy
crashed, for example).
### Git
The `git` module contains functionality for interoperating with a Git repo, at a
higher level than the `GitBackend`. The `GitBackend` is restricted by
the `Backend` trait; the `git` module is specifically for Git-backed repos. It
has functionality for importing refs from the Git repo and for exporting to refs
in the Git repo. It also has functionality for pushing and pulling to/from Git
remotes.
### Revsets
A user-provided revset expression string goes through a few different stages to
be evaluated:
1. Parse the expression into a `RevsetExpression`, which is close to an AST
2. Resolve symbols and functions like `tags()` into specific commits. After
this stage, the expression is still a `RevsetExpression`, but it won't have
any `CommitRef` variants in it.
3. Resolve visibility. This stage resolves `visible_heads()` and `all()` and
produces a `ResolvedExpression`.
4. Evaluate the `ResolvedExpression` into a `Revset`.
This evaluation step is performed by `Index::evaluate_revset()`, allowing
the `Revset` implementation to leverage the specifics of a custom index
implementation. The first three steps are independent of the index
implementation.
### StackedTable
`StackedTable` (actually `ReadonlyTable` and `MutableTable`) is a simple disk
format for storing key-value pairs sorted by key. The keys have to have the same
size but the values can have different sizes. We use our own format because we
want [lock-free concurrency](concurrency.md) and there doesn't seem to be an
existing key-value store we could use.
The file format contains a lookup table followed by concatenated values. The
lookup table is a sorted list of keys, where each key is followed by the
associated value's offset in the concatenated values.
A table can have a parent table. When looking up a key, if it's not found in the
current table, the parent table is searched. We never update a table in place.
If the number of new entries to write is less than half the number of entries in
the parent table, we create a new table with the new entries and a pointer to
the parent. Otherwise, we copy the entries from the parent table and the new
entries into a new table with the grandparent as the parent. We do that
recursively so parent tables are at least 2 times as large as child tables. This
results in O(log N) amortized insertion time and lookup time.
There's no garbage collection of unreachable tables yet.
The tables are named by their hash. We keep a separate directory of pointers to
the current leaf tables, in the same way as we
do [for the operation log](concurrency.md#storage).
## Design of the CLI crate
### Templates
The concept is copied from Mercurial, but the syntax is different. The main
difference is that the top-level expression is a template expression, not a
string like in Mercurial. There is also no string interpolation (e.g.
`"Commit ID: {node}"` in Mercurial).
### Diff-editing
Diff-editing works by creating two very sparse working copies, containing only
the files we want the user to edit. We then let the user edit the right-hand
side of the diff. Then we simply snapshot that working copy to create the new
tree.