This removes the last use of `ouroboros` in `merged_tree.rs`. The set
of conflicts to iterate is usually so small that I didn't bother
checking the performance impact.
While importing the `ouroboros` crate and the `aliasable` crate it
depends on, the "unsafe Rust reviewer" expressed some concern that
they contain a lot of unsafe code that's hard to review. We can avoid
the unsafe code altogether by making `TreeEntriesIterator` not
self-refential. Instead, we can collect the matching entries in an
individual tree up front. It does have some performance cost:
```
❯ hyperfine --warmup 3 --runs 30 \
'/tmp/jj-before --ignore-working-copy files -r v6.0' \
'/tmp/jj-after --ignore-working-copy files -r v6.0'
Benchmark 1: /tmp/jj-before --ignore-working-copy files -r v6.0
Time (mean ± σ): 461.4 ms ± 14.3 ms [User: 232.1 ms, System: 229.4 ms]
Range (min … max): 443.4 ms … 496.3 ms 30 runs
Benchmark 2: /tmp/jj-after --ignore-working-copy files -r v6.0
Time (mean ± σ): 482.0 ms ± 14.3 ms [User: 257.2 ms, System: 224.9 ms]
Range (min … max): 461.8 ms … 513.3 ms 30 runs
Summary
'/tmp/jj-before --ignore-working-copy files -r v6.0' ran
1.04 ± 0.04 times faster than '/tmp/jj-after --ignore-working-copy files -r v6.0'
```
I think that's acceptable.
Since the concurrent diff algorithm is significantly slower when using
the Git backend, I think we'll have to use switch between the two
algorithms depending on backend. Even if the concurrent version always
performed as well as the sequential version, exactly how concurrent it
should be probably still depends on the backend. This commit therefore
adds a function to the `Backend` trait, so each backend can say how
much concurrency they deal well with. I then use that number for
choosing between the sequential and concurrent versions in
`MergedTree::diff_stream()`, and also to decide the number of
concurrent reads to do in the concurrent version.
When diffing two trees, we currently start at the root and diff those
trees. Then we diff each subtree, one at a time, recursively. When
using a commit backend that uses remote storage, like our backend at
Google does, diffing the subtrees one at a time gets very slow. We
should be able to diff subtrees concurrently. That way, the number of
roundtrips to a server becomes determined by the depth of the deepest
difference instead of by the number of differing trees (times 2,
even). This patch implements such an algorithm behind a `Stream`
interface. It's not hooked in to `MergedTree::diff_stream()` yet; that
will happen in the next commit.
I timed the new implementation by updating `jj diff -s` to use the new
diff stream and then ran it on the Linux repo with `jj diff
--ignore-working-copy -s --from v5.0 --to v6.0`. That slowed down by
~20%, from ~750 ms to ~900 ms. Maybe we can get some of that
performance back but I think it'll be hard to match
`MergedTree::diff()`. We can decide later if we're okay with the
difference (after hopefully reducing the gap a bit) or if we want to
keep both implementations.
I also timed the new implementation on our cloud-based repo at
Google. As expected, it made some diffs much faster (I'm not sure if
I'm allowed to share figures).
I'm going to implement a `Stream`-based version optimized for
high-latency (RPC-based) commit backends. So far, that implementation
is about 20% slower in the Linux repo when running `jj diff
--ignore-working-copy -s --from v5.0 --to v6.0`. I think that's almost
only because the algorithm is different, not because it's async per
se.
This commit adds a `Stream`-based version of `MergedTree::diff()` that
just wraps the regular iterator in stream. I updated `jj diff` to use
it. I couldn't measure any difference on the command above in the
Linux repo. I think that means we can safely use the same
`Stream`-based interface regardless of backend, even if we end up
needing two different implementations of the `Stream`. We would then
be using the wrapped iterator from this commit for local backends, and
the new implementation for remote backends. But ideally we can make
the remote-friendly implementation fast enough that we don't need two
implementations.
During the transition to using more async code, I keep running into
https://github.com/rust-lang/futures-rs/issues/2090. Right now, I want
to convert `MergedTree::diff()` into a `Stream`. I don't want to
update all call sites at once, so instead I'm adding a
`MergedTree::diff_stream()` method, which just wraps
`MergedTree::diff()` in a `Stream. However, since the iterator is
synchronous, it needs to block on the async `Backend::read_tree()`
calls. If we then also block on the `Stream` in the CLI, we run into
the panic.
I want to fix error propagation before I start using async in this
code. This makes the diff iterator propagate errors from reading tree
objects.
Errors include the path and don't stop the iteration. The idea is that
we should be able to show the user an error inline in diff output if
we failed to read a tree. That's going to be especially useful for
backends that can return `BackendError::AccessDenied`. That error
variant doesn't yet exist, but I plan to add it, and use it in
Google's internal backend.
Reasons to introduce this alias:
* Reduces complexity of a type, to silence Clippy warnings in the
future if we use this type as a type parameter
* The type is used quite frequently, so it makes sense to have a name
for it
* It's easier to visually scan for the end of the type when you don't
have to match opening and closing angle brackets
I'm going to add `MergedTreeValue` as an alias for
`Merge<Option<TreeValue>>`, but we already have a type by that name in
`merged_tree`. This patch renames it away, to make room for the new
alias. I used `MergedTreeVal` for this borrowing version to be a bit
like how `str` is a borrowed version of `String`.
It seems we'll end up using `block_on()` quite a bit, at least until
we're done transitioning to async, and the function name doesn't
conflict with anything else, so let's always import it when we need
it.
I'm going to rewrite `TreeDiffIterator` to fetch one level (depth) of
the tree at a time and concurrently. One step towards that is to
convert the iterator to a `Stream`. I'd like to do that by making the
current `Iterator` implementation call the new `Stream`
implementation. However, we can't call `futures::executor::block_on()`
on a future that itself calls `futures::executor::block_on()` (as
`Store::read_tree()` does), so the first step is to bubble up the
async-ness a bit. This patch does that by fetching both sides of the
diff concurrently. That should give close to a 2x speedup on
high-latency backends. (It doesn't help with our backend at Google,
however, because we have a daemon process that does some speculative
prefetching that usually downloads the child trees anyway.)
I ran into a bug the other day where `jj status` said there was a
conflict in a file but there were no conflict markers in the working
copy. The commit was created when I squashed a conflict resolution
into the commit's parent. The rebased child commit then ended up in
this state. I.e., it looked something like this before squashing:
```
C (no conflict)
|
| B conflict
|/
A conflict
```
The conflict in B was different from the conflict in A. When I
squashed in C, jj would try to resolve the conflicts by first creating
a 7-way conflict (3 from A, 3 from B, 1 from C). Because of the exact
content-level changes, the 7-way conflict couldn't be automatically
resolved by `files::merge()` (the way it currently works
anyway). However, after simplifying the conflict, it could be
resolved. Because `MergedTree::merge()` does another round of conflict
simplification of the result at the end of the function, it was the
simplifed version that actually got stored in the commit. So when
inspecting the conflict later (e.g. in the working copy, as I did), it
could be automatically resolved.
I think there are at least two ways to solve this. One is to call
`merge_trees()` again after calling `tree.simplify()` in
`MergedTree::merge()`. However, I think it would only matter in the
case of content-level conflicts. Therefore, it seems better to make
the content-level resolution solve this case to start with. I've done
that by simplifying the conflict before passing it into
`files::merge()`. We could even do the fix in `files::merge()`, but
doing it before calling it has the advantage that we can avoid reading
some unchanged content from the backend.
Only tests dealing with Git submodules care about the backend type.
Switching the tests to use the test backend also uncovered another bug
in `MergedTree`, so I fixed that too. The bug only happens with legacy
trees (path-level conflicts) and backends that care about the conflict
path, so it wouldn't happen with Git backends, and it wouldn't happen
at Google either (because we use tree-level conflicts).
This fixes a bug where we used the parent directory's path when trying
read trees and files for a child entry. Many tests in
`test_merged_tree` fail after switching to the test backend there
without this fix/
When restoring (`jj restore`) a 3-sided conflict from one tree into a
2-sided tree (or a resolved tree), we'll need to extend the size arity
of the target tree to that of the source tree. I had not considered
this case before. This patch relaxes the constraint in
`MergedTreeBuilder` to allow such cases. The additional trees are
based on empty trees with only the larger overrides in them.
We're finally ready to start writing trees using the new format where
we represent conflicts by having multiple trees in the commit instead
of having a single tree with multiple entries at a path. This patch
adds a config option for that. It's not ready to be used yet, so I
haven't updated the release notes or other documentation.
I added only a simple CLI test for testing what happens when the
config is enabled in an existing repo. 108 tests currently fail if we
flip the default.
With the already existing `MergedTree::resolve()` and all the recent
refactorings into `Merge<T>`, it's now very easy to add support for
3-way merging of `MergedTree` instances.
This introduces a `MergedTreeBuilder` type, which takes a set of base
trees and overrides. The idea is that it will be able to write
multiple trees or a legacy tree. For now, it's only able to write
legacy trees. To show that it works, the working copy's snaphotting
code has been updated to use it.
As #2165 showed, when diffing two `MergedTree::Legacy` variants (or
one of each variant) and re recurse into a subtree, we need to treat
that as a legacy tree too, so we expand `TreeValue::Conflict`s found
in the diff.
This converts `TreeDiffIterator::tree()` and
`TreeDiffIterator::single_tree()` into associated functions and passes
in the `&MergedTree` into the former. This prepares for fixing #2165,
and it removes the need for the `TreeDiffIterator::store` field.
If we're going to be able to replace most instances of `Tree` by
`MergedTree`, we'll need to be able to diff two `MergedTree`s. This
implements support for that. The implementation copies a lot from the
diff iterator we have for `Tree`. I suspect we should be able to reuse
some of the code by introducing some traits that can then be
implemented by both `Tree` and `MergedTree`. I've left a TODO about
that.
Many of the `TreeBuilder` users have an `Option<TreeValue>` and call
either `set()` or `remove()` or the builder depending on whether the
value is present. Let's centralize this logic in a new
`TreeBuilder::set_or_remove()`.
There were still many instances of `conflict` left from before we
renamed `Conflict<T>` to `Merge<T>`. I decided to rename many of them
based on the type parameter instead of the container. I think that
made it more readable in many cases.