This adds an initial `jj util gc` command, which simply calls `git gc`
when using the Git backend. That should already be useful in
non-colocated repos because it's not obvious how to GC (repack) such
repos. In my own jj repo, it shrunk `.jj/repo/store/` from 2.4 GiB to
780 MiB, and `jj log --ignore-working-copy` was sped up from 157 ms to
86 ms.
I haven't added any tests because the functionality depends on having
`git` binary on the PATH, which we don't yet depend on anywhere
else. I think we'll still be able to test much of the future parts of
garbage collection without a `git` binary because the interesting
parts are about manipulating the Git repo before calling `git gc` on
it.
This enables cheap str-to-RepoPath cast, which is useful when sorting and
filtering a large Vec<(String, _)> list by using matcher for example. It
will also eliminate temporary allocation by repo_path.parent().
Recognize signature metadata from git commit objects, implement
a basic version of that for the native backend.
Extract the signed data (a commit binary repr without the signature) to
be verified later.
Since the concurrent diff algorithm is significantly slower when using
the Git backend, I think we'll have to use switch between the two
algorithms depending on backend. Even if the concurrent version always
performed as well as the sequential version, exactly how concurrent it
should be probably still depends on the backend. This commit therefore
adds a function to the `Backend` trait, so each backend can say how
much concurrency they deal well with. I then use that number for
choosing between the sequential and concurrent versions in
`MergedTree::diff_stream()`, and also to decide the number of
concurrent reads to do in the concurrent version.
During the transition to using more async code, I keep running into
https://github.com/rust-lang/futures-rs/issues/2090. Right now, I want
to convert `MergedTree::diff()` into a `Stream`. I don't want to
update all call sites at once, so instead I'm adding a
`MergedTree::diff_stream()` method, which just wraps
`MergedTree::diff()` in a `Stream. However, since the iterator is
synchronous, it needs to block on the async `Backend::read_tree()`
calls. If we then also block on the `Stream` in the CLI, we run into
the panic.
This avoids https://github.com/rust-lang/futures-rs/issues/2090. I
don't think we need to worry about reading legacy conflicts
asynchronously - async is really only useful for Google's backend
right now, and we don't use the legacy format at Google. In
particular, I don't want `MergedTree::value()` to have to be async.
It seems we'll end up using `block_on()` quite a bit, at least until
we're done transitioning to async, and the function name doesn't
conflict with anything else, so let's always import it when we need
it.
The commit backend at Google is cloud-based (and so are the other
backends); it reads and writes commits from/to a server, which stores
them in a database. That makes latency much higher than for disk-based
backends. To reduce the latency, we have a local daemon process that
caches and prefetches objects. There are still many cases where
latency is high, such as when diffing two uncached commits. We can
improve that by changing some of our (jj's) algorithms to read many
objects concurrently from the backend. In the case of tree-diffing, we
can fetch one level (depth) of the tree at a time. There are several
ways of doing that:
* Make the backend methods `async`
* Use many threads for reading from the backend
* Add backend methods for batch reading
I don't think we typically need CPU parallelism, so it's wasteful to
have hundreds of threads running in order to fetch hundreds of objects
in parallel (especially when using a synchronous backend like the Git
backend). Batching would work well for the tree-diffing case, but it's
not as composable as `async`. For example, if we wanted to fetch some
commits at the same time as we were doing a diff, it's hard to see how
to do that with batching. Using async seems like our best bet.
I didn't make the backend interface's write functions async because
writes are already async with the daemon we have at Google. That
daemon will hash the object and immediately return, and then send the
object to the server in the background. I think any cloud-based
solution will need a similar daemon process. However, we may need to
reconsider this if/when jj gets used on a server with a custom backend
that writes directly to a database (i.e. no async daemon in between).
I've tried to measure the performance impact. That's the largest
difference I've been able to measure was on `jj diff
--ignore-working-copy -s --from v5.0 --to v6.0` in the Linux repo,
which increases from 749 ms to 773 ms (3.3%). In most cases I've
tested, there's no measurable difference. I've tried diffing from the
root commit, as well as `jj --ignore-working-copy log --no-graph -r
'::v3.0 & author(torvalds)' -T 'commit_id ++ "\n"'` (to test a
commit-heavy load).
We currently represent the root tree id in a commit by `Merge<TreeId>`
plus a boolean `uses_tree_conflict_format`. It's better to use an enum
for that. That makes it harder to forget to check which type of tree
it is, and it makes it impossible to store a legacy tree with multiple
ids (as we could with `uses_tree_conflict_format=false`,
`root_tree=Merge::new(...)`).
Maybe more importantly, we're also going to want to pass around this
information in most places where we currently pass a single `TreeId`,
and passing two separate values would be annoying.
Unlike the git backend, we don't need to support path-level conflicts
for existing repos because we don't care about compatibility with
existing repos using the native backend. However, we still need to
support both formats until all code paths are able to handle
tree-level conflicts.
Since `Conflict<T>` can also represent a non-conflict state (a single
term), `Merge<T>` seems like better name.
Thanks to @ilyagr for the suggestion in
https://github.com/martinvonz/jj/pull/1774#discussion_r1257547709
Sorry about the churn. It would have been better if I thought of this
name before I introduced `Conflict<T>`.
Tree-level conflicts (#1624) will be stored as multiple trees
associated with a single commit. This patch adds support for that in
`backend::Commit` and in the backends.
When the Git backend writes a tree conflict, it creates a special root
tree for the commit. That tree has only the individual trees from the
conflict as subtrees. That way we prevent the trees from getting
GC'd. We also write the tree ids to the extra metadata table
(i.e. outside of the Git repo) so we don't need to load the tree
object to determine if there are conflicts.
I also added new flag to `backend::Commit` indicating whether the
commit is a new-style commit (with support for tree-level
conflicts). That will help with the migration. We will remove it once
we no longer care about old repos. When the flag is set, we know that
a commit with a single tree cannot have conflicts. When the flag is
not set, it's an old-style commit where we have to walk the whole tree
to find conflicts.
It was convenient that what the git backend stored in its "extras"
table is exactly a subset of the fields that local backend stores, but
it's bit ugly and limiting. For example, it makes it possible to
populate the `author` field in the git extras, but that would have no
effect. It's better that it's not possible to do that (we store the
author field in the git commit, of course).
What made me notice this now was that I'm working on tree-level
conflicts (#1624) and I'm thinking of adding a field to the git extras
saying "this commit has single tree, but it's still a new-style
commit", so we can know not to walking such trees to find path-level
conflicts. That's only needed for the git backend because we don't
care about compatibility for the local backend.
The internal backend at Google doesn't let you write any value you
want for in the committer field. The `Store` type still caches the
value it attempted to write, which gets a little weird when the
written value is not what we tried to write. We should use the value
the backend actually wrote. However, we don't know if the backend
changed anything without reading the value back, which is often
wasteful. This commit changes the API to return the written value.
I only changed the signature of `write_commit()` for now. Maybe we
should make a similar change to `write_tree()`.
This has several advantages:
* Makes it possible to downcast to non-Git custom backends (might be
useful at Google, but we haven't needed it yet)
* Lets us access more specific functionality on the `GitBackend`,
making it possible to access the `git2::Repository` without
creating a copy of it.
* Removes the dependency on Git from the backend
It took a while before I realized that conflicts could be modeled as
simple algebraic expressions with positive and negative terms (they
were modeled as recursive 3-way conflicts initially). We've been
thinking of them that way for a while now, so let's make the
`ConflictPart` name match that model.
Our internal backend at Google uses a 32-byte change id, so I'd like
to make the backend able to decide the length. To start with, let's
make the backend able to decide what the root change id should
be. That's consistent with how we already let the backend decide what
the root commit id should be.
The function is currently only about the length of commit IDs, so
let's clarify that. I'm going to add another function for the length
of change IDs next. I don't know if we're going to care about lengths
of other hashes in the future. We might even be able to remove the
current restriction that all commit IDs and all change IDs have the
same length.
I needed this in the course of debugging an error. Before this commit, the error looked like this:
```
Error: Unexpected error from backend: Object not found
```
After this commit, it looks like this:
```
Error: Unexpected error from backend: Object with CommitId 8f59646bc9bb6bb44b5624f1248f4a708f37003c not found: object not found - no match for id (8f59646bc9bb6bb44b5624f1248f4a708f37003c); class=Odb (9); code=NotFound (-3)
```
The Protobuf team at Google decided to let us use Protobufs internally
after all. That will make things a little easier for us with the
Google-internal adapations, and the `protobuf` crate is noticeably
faster than the `thrift` crate.
This effectively rolls back commit 5b10c9aa0a. I resolved some
conflicts caused by the rename from `NormalFile` to `File`. I also
kept the changelog entry, but I changed it to say that the hashing
scheme has changed (not the format), but since the hashes are just
used for identity, existing repos should still work.
Let's acknowledge everyone's contributions by replacing "Google LLC"
in the copyright header by "The Jujutsu Authors". If I understand
correctly, it won't have any legal effect, but maybe it still helps
reduce concerns from contributors (though I haven't heard any
concerns).
Google employees can read about Google's policy at
go/releasing/contributions#copyright.
There are no "non-normal" files, so "normal" is not needed. We have
symlinks and conflicts, but they are not files, so I think just "file"
is unambiguous.
I left `testutils::write_normal_file()` because there it's used to
mean "not executable file" (there's also a `write_executable_file()`).
I left `working_copy::FileType::Normal` since renaming `Normal` there
to `File` would also suggest we should rename `FileType`, and I don't
know what would be a better name for that type.
This migrates the native backend from Protobuf to Thrift since
Google's Protobuf team does let us import jj into Google's monorepo if
it uses a third-party Protobuf library.
Since the native backend is not supported, I didn't write any
migration code for it.
We can't remove `lib/src/protos/store.proto` yet, because it's also
used by the Git backend (only the `predecessors` and `change_id`
fields).
We currently determine if the repo uses the Git backend or the local
backend by checking for presence of a `.jj/repo/store/git_target`
file. To make it easier to add out-of-tree backends, let's instead add
a file that indicates which backend to use.
In many of these places, we don't need an owned value, so using a
reference means we don't force the caller to clone the value. I really
doubt it will have any noticeable impact on performance (I think these
are all once-per-repo paths); it's just a little simpler this way.
I had made the backends unaware of the virtual root commit because
they don't need to know about it, and we could avoid some duplicated
code by putting that in `Store` instead. However, as we saw in
b21a123bc8, the root commit being virtual has some user-visible
effects (they can't create a merge with the root and some other
commit). So I'm thinking that we may want to make the root commit an
actual commit, depending on which backend is used. Specificially, when
using the Git backend, we cannot record the root commit as an actual
parent since Git would fail when trying to look it up. Backends that
don't need compatibility can make the root commit an actual commit,
however.
This commit therefore makes the backends aware of the root commit. It
makes it remain a virtual commit in the Git backend, and makes it an
actual commit in the `LocalBackend`.
This commit breaks any existing repos using the `LocalBackend`, but
there shouldn't be any such repos other than for testing.