The current diff algorithm does a full LCS on the words of the texts,
which is really slow. Diffing the working copy when e.g.
`src/commands.py` has changes far apart takes seconds. This patch adds
an implementation inspired by JGit's Histogram diff. I say "inspired"
because I just didn't quite understand it :P In particular, I didn't
understand what it does when it finds non-unique elements. I decided
to line up the leading common elements on both sides of the merge. I
don't know if that usually gives good enough results in practice.
I'm sure this can still be optimized a lot, but this seems good enough
as a start. There is also many things to improve about the quality of
the diffs.
This patch introduces a new `IndexStore` struct. The idea is that it
will know about the directory in which the index files are stored, the
associations with operations. It may also cache `Arc<ReadonlyIndex>`
instances so if multiple `ReadonlyIndex` instances are loaded, they
can be returned from the cache. That may be useful when merging
operations because the operations are likely to share a large parent
index file. For now, however, all the new type has is `init()`,
`load()`, and `reinit()`.
We currently need to read the commit objects for finding common
ancestors. That can be very slow when the common ancestor is far back
in history. This patch adds a function for finding common ancestors
using the index instead.
Unlike the current algorithm, which only returns one common ancestor,
the new index-based one correctly handles criss-cross merges.
Here are some timings for finding the common ancestors in the git.git
repo:
| Without index | With Index |
| First run | Subsequent | First run | Subsequent |
v2.30.0-rc0 v2.30.0-rc1 | 5.68 ms | 5.94 us | 40.3 us | 4.77 us |
v2.25.4 v2.26.1 | 1.75 ms | 1.42 us | 13.8 ms | 4.29 ms |
v1.0.0 v2.0.0 | 492 ms | 2.79 ms | 23.4 ms | 6.41 ms |
Finding ancestors of v2.25.4 and v2.26.1 got much slower because the
new algorithm finds all common ancestors. Therefore, it also finds
v2.24.2, v2.23.2, v2.22.3, v2.21.2, v2.20.3, v2.19.4, v2.18.3, and
v2.17.4, which it then filters out because they're all ancestors of
v2.25.3.
Also note that the result was incorrect before, because the old
algorithm would return as soon as it had found a common ancestor, even
if it's not the latest common ancestor. For example, for the common
ancestor between v1.0.0 and v2.0.0, it returned an ancestor of v1.0.0
because it happened to get there by following some side branch that
led there more quickly.
The only place we currently need to find the common ancestor is when
merging trees, which we only do when the user runs `jj merge`, as well
as when operating on existing merge commits (e.g. to diff or rebase
them). That means that this change won't be very noticeable. However,
it's something we clearly want to do sooner or later, so we might as
well get it done.
I had tried to generate the protobuf code at build time many months
ago, but decided against it because it slowed down the build too
much. I didn't realize there was the
"cargo:rerun-if-changed=<filename>" feature that time. Given that that
exists, it seems like an obvious win to generate the source code at
build time.
I put the generated sources in `$OUT_DIR` (where [1] says they should
be), then include them in the `protos` module by using the `include!`
macro. The biggest problem with that is that I couldn't get IntelliJ
to understand it, even after enabling the experimental features
described in [2].
[1] https://doc.rust-lang.org/cargo/reference/build-script-examples.html#code-generation
[2] https://github.com/intellij-rust/intellij-rust/issues/1908#issuecomment-592773865