The `Repo` is a higher-level type that the index shouldn't have to
know about. With this change, a custom revset implementation should be
able evaluate the revset on a server without knowing which repo it
refers to.
We already pass a `CompositeIndex` to
`default_revset_engine::evaluate()` so let's use that wherever we
currently use `repo.index()`. That will help us remove the `repo`
argument, and it will also let us internal types (like `IndexEntry`)
in the index methods we call.
I'm about to replace the `&dyn Repo` argument by several smaller
types, and it's easier to collect those in a single context type than
to pass them separately as arguments.
I also moved `revset_for_commit_ids()` and `take_latest_revset()` onto
the new type because it was easy. `build_predicate_fn()` and
`has_diff_from_parent()` ran into some lifetime issue when I tried.
The `public_heads()` revset only contains the root commit in
practice. I'm not sure what we want to do about phases, but since we
don't have any real support for them yet, let's just remove this
revset. I didn't update the changelog because we don't seem to have
documented the revset function (and it seems unlikely that users who
found out about it found it useful enough to use it when they could
just use `root`).
The `ProtoOpStore` was separated out to simplify the migration from
Thrift. Now that the `ThriftOpStore` is gone, we can inline
`ProtoOpStore` as the TODO says.
Inline diffs on multi-byte UTF-8 characters would match individual
bytes, causing garbled diffs in some cases. For example, replacing
`⊢` with `⊣`, which differ in the final byte only, caused the
diff to display a diff of the bytes instead the character.
This commit uses a workaround present in Mercurial by treating all
bytes 0x80 and above as word characters, causing any multi-byte
character to be treated as a word and not segmented.
https://www.mercurial-scm.org/repo/hg/file/6.3.3/mercurial/patch.py#l51
This serves the role of limit() in Mercurial. Since revsets in JJ is
(conceptually) an unordered set, a "limit" predicate should define its
ordering criteria. That's why the added predicate is named as "latest".
Closes#1110
There are no remaining places where we iterate over a revset and need
the `IndexEntry`s, so we can now make `Revset::iter()` yield
`CommitId`s instead.
I'm about to make `Revset::iter()` yield just `CommitId`s, but the
tests in `test_default_revset_graph_iterator.rs` need an `IndexEntry`
iterator so they can pass it into `RevsetGraphIterator::new()`. This
commits prepares for the change by adding a
`RevsetImpl::iter_graph_impl()` that returns `RevsetGraphIterator`,
keeping `InternalRevset` still hidden within the revset engine. We
could instead have made that (and `ToPredicateFn`) visible to tests. I
can't say which is better.
I don't know if we ever resolve revsets in a mutable repo, but now
that we can get a change id index from a revset, it's easier to
implement this functionality that way.
This replaces the direct use of `IdIndex` in `ReadonlyRepo` by use of
`Revset::change_id_index()`.
I made the `Index` trait require `Send` and `Sync` in order to be able
to store an instance of it in `ReadonlyRepo` (via `ChangeIdIndex`) and
still have that be `Send` and `Sync`. We could alternatively store the
`ChangeIdIndex` in a `Mutex`. Now that will be up to the
`ChangeIdIndex` instead.
One of the remaining places we depend on index positions is when
creating a `ChangeIdIndex`. This moves that into the revset engine
(which is coupled to the commit index implementation) by adding a
`Revset::change_id_index()` method. We will also use this function
later when add support for resolving change id prefixes within a small
revset.
The current implementation simply creates an in-memory index using the
existing `IdIndex` we have in `repo.rs`.
The custom implementation at Google might do the same for small
revsets that are available on the client, but for revsets involving
many commits on the server, it might use a suboptimmal implementation
that uses longer-than-necessary prefixes for performance reasons. That
can be done by querying a server-side index including changes not in
the revset, and then verifying that the resulting commits are actually
in the revset.
When deciding the order to visit commits to rebase, we currently look
up parents in the index. I'm trying to remove the current `IndexEntry`
type and will probably have revsets iterators yield simply
`CommitId`. Let's therefore look up commit objects here.
I timed this by rewriting all commits in the jj repo. I couldn't
measure any difference. That makes sense since we cache the commits in
`Store` and we would read the commit when rebasing it anyway.
The function is only used in tests, so it doesn't belong in
`default_revset_engine`. Also, it's not specific to that
implementation, so I rewrote as a revset evaluation.
I'd like to be able to pass a `self` of `type `&ReadonlyRepo` to
functions that take a `&dyn Repo`. For that, we need `ReadonlyRepo`
itself to implement `Repo` instead of having `Arc<ReadonlyRepo>`
implement it. I could have solved it in a different way, but the `Arc`
requirement seems like an unnecessary constraint.
The functions resolving a change id to commits currently return a
`Vec<IndexEntry>`. We want to avoid depending on `IndexEntry` and we
only need the commit ids here.
The index position is specific to the default index implementation and
we don't want to use it in outside of there. This commit removes the
use of it as a key for nodes in the graphlog.
I timed it on the git.git repo using `jj log -r 'all()' -T commit_id`
(the worst case I can think of) and it slowed down from ~2.02 s to
~2.20 s (~9%).
Since we hid the graph iterator implementation behind
`Revset::iter_graph()`, I don't think we have any callers of
`Revset::iter()` require the iteration to be in index position order,
so let's not promise that. We do want to promise that the iteration is
in topological order with children before parents, however.
We need 1.64 to bump `clap` to `4.1`. We don't really need to upgrade
to that, but being on an older version causes minor confusions like
#1393. Rust 1.64 is very close to 6 months old at this point.
For large repos, it's useful to be able to use shorter change id and
commit id prefixes by resolving the prefix in a limited subset of the
repo (typically the same subset that you'd want to see in your default
log output). For very large repos, like Google's internal one, the
shortest unique prefix evaluated within the whole repo is practically
useless because it's long enough that the user would want to copy and
paste it anyway.
Mercurial supports this with its `revisions.disambiguatewithin` config
(added in https://www.mercurial-scm.org/repo/hg/rev/503f936489dd). I'd
like to add the same feature to jj. Mercurial's implementation works
by attempting to resolve the prefix in the whole repo and then, if the
prefix was ambiguous, it resolves it in the configured subset
instead. The advantage of doing it that way is that there's no extra
cost of resolving the revset defining the subset if the prefix was not
ambiguous within the whole repo. However, there are two important
reasons to do it differently in jj:
* We support very large repos using custom backends, and it's probably
cheaper to resolve a prefix within the subset because it can all be
cached on the client. Resolving the prefix within the whole repo
requires a roundtrip to the server.
* We want to be able to resolve change id prefixes, which is always
done in *some* revset. That revset is currently `all()`, i.e. all
visible commits. Even on local disk, it's probably cheaper to
resolve a small revset first and then resolve the prefix within that
than it is to build up the index of all visible change ids.
We could achieve the goal by letting each revset engine respect the
configured subset, but since the solution proposed above makes sense
also for local-disk repos, I think it's better to do it outside of the
revset engine, so all revset engines can share the code.
This commit prepares for the new functionality by moving the symbol
resolution out of `Index::evaluate_revset()`.
We want to allow custom revset engines define their own graph
iterator. This commit helps with that by adding a
`Revset::iter_graph()` function that returns an abstract iterator.
The current `RevsetGraphIterator` can be configured to skip or include
transitive edges. It skips them by default and we don't expose option
in the CLI. I didn't bother including that functionality in the new
`iter_graph()` either. At least for now, it will be up to the
implementation whether it includes such edges (it would of course be
free to ignore the caller's request even if we added an option for it
in the API).
This commit adds an `evaluate_revset()` function to the `Index`
trait. It will require some further cleanup, but it already achieves
the goal of letting the index implementation decide which revset
engine to use.
We want to allow customization of the revset engine, so it can query
server indexes, for example. The current revset implementation will be
our default implementation for now. What's left in the `revset` module
after this commit is mostly parsing code.
Now that there's a single implementation of `Revset`, I think it makes
more sense for `is_empty()` to be defined there. Maybe different
revset engines have different ways of implementing it. Even if they
don't, this is trivial to re-implement in each revset engine.
As the comment above `ToPredicateFn` says, it could be a private
type. This commit makes that happen by making the private `Revset`
implementations (`DifferenceRevset` etc.) instead implement an
internal revset type called `InternalRevset`. That type is what
extends `ToPredicateFn`, so the public type doesn't have to. The new
type will not need to implement the new functions I'm about to add to
the `Revset` trait.
We don't want the public `Revset` interface to know about
`ToPredicateFn`. In order to hide it, I'm wrapping the internal type
in another type, so only the internal type can keep implementing
`ToPredicateFn`.
I'd like to be able to change the return type of `evaluate_revset()`
to be an internal type. Since all external callers currently call the
function via `RevsetExpression::evaluate()`, it turns out it's easy to
make it private. To benefit from an internal type, we also need to
make the recursive calls be directly to the internal function.
The tests adding and removing heads to the repo mostly want to verify
that the set of heads is expected. Some of them also check that
commits are available in the index. But they shouldn't care about the
exact index stats.
I don't think there's much to gain from making the index match exactly
what's reachable from the view. FWIW, our cloud-based implementation
at Google will probably make everyone's commits visible in the index
regardless of which operation they're at.
We don't want custom index implementations to have to conform to the
same kind of stats as the default implementation. This commit also
makes the command error out on non-default index types.
I broke the commands in a27da7d8d5 and thought I just fixed it in
c7cf914694a8. However, as I added a test, I realized that I made it
only reindex the commits since the previous operation. I meant for the
command to do a full reindexing of th repo. This fixes that.
I broke `jj debug reindex` in a27da7d8d5. From that commit, we no
longer delete the pointer to the old index, so nothing happens when we
reload the index. This commit fixes that, and also makes the command
error out if run on a repo with a non-default index type.
This is yet another step towards making the index pluggable. The
`IndexStore` trait seems reasonable after this commit. There's still a
lot of work to remove `IndexPosition` from the `Index` trait.
I didn't make `ReadonlyIndex` extend `Index` because it needed an
`as_index()` to convert to `&dyn Index` trait object
anyway. Separating the types also gives us flexibility to implement
the two traits on different types.
Not all index implementations may want to store the readonly index
implementation in an Arc. Exposing the Arc in the interface is also
problematic because `Arc<IndexImpl>` cannot be cast to `Arc<dyn
Index>`.
These two files are closely related, and `Index` and `IndexStore` are
expected to be customized together, so it seems better to keep them in
a single file.
This is another step towards allowing a custom `jj` binary to have its
own index type. We're going to have a server-backed index
implementation at Google, for example.
This is a step towards making the index storage pluggable. The
interface will probably change a bit soon, but let's start with
functions that match the current implementation.
I called the current implementation the `DefaultIndexStore`. Calling
it `SimpleIndexStore` (like `SimpleOpStore` and `SimpleOpHeadsStore`)
didn't seem accurate.
In `git_fetch()`, any glob present in `globs` is an "allow" mark. Using
`&[]` to represent an "allow-all" may be misleading, as it could
indicate that no branch (only the git HEAD) should be fetched.
By using an `Option<&[&str]>`, it is clearer that `None` means that
all branches are fetched.
Using &[String] forces the caller to materalize owned strings if they
have only references, which is costly. Using &[&str] makes it cheap
if the caller owns strings as well.
To be able to make e.g. `jj log some/path` perform well on cloud-based
repos, a custom revset engine needs to be able to see the paths to
filter by. That way it is able pass those to a server-side index. This
commit helps with that by effectively converting `jj log -r foo
some/path` into `jj log -r 'foo & file(some/path)'`.
It makes the APIs much simpler if we don't have to pass in information
about the initial operation when we create the `OpHeadsStore`. It also
makes the alternative `OpHeadsStore` implementations simpler since we
move some logic into a shared location (`ReadonlyRepo::init()`).
This effectively undoes ec07104126. Maybe some further refactoring
made it possible to move it back as I'm doing in this commit?
By taking an `OperationId` argument to `IndexStore::write_index()`, we
can remove `associate_file_with_operation()` from the trait. That
simplifies the interace a little bit. The reason I noticed this was
that I'm trying to extract a trait for `IndexStore`, and the word
"file" in it is too specific for e.g. a cloud-based implementation.
I plan to make `RepoLoader::init()` return a `Result`, which means
that `WorkspaceLoader::load()` will need to return more kinds of
errors. Making it return `WorkspaceLoadError` is a good start. By also
extracting a function for converting `WorkspaceLoadError` to
`CommandError`, we can reuse a the handling of `PathError` in
`cli_util`.
I'm about to make `RepoLoader::init()` return a `Result`, and I don't
want to have to wrap that in a new error in
`ReadonlyRepo::load_at_head()` since that's only used in tests.
This should fix#1304. I think the added test simulates the behavior of
multiple rebase conflicts, but I don't have expertise around this.
add_index could be replaced with a peekable iterator, but the iterator version
wouldn't be as readable as the current implementation.
The outermost "op-log" label isn't moved to the default template. I think
it belongs to the command's formatter rather than the template.
Old bikeshedding items:
- "current_head", "is_head", or "is_head_op"
=> renamed to "current_operation"
- "templates.op-log" vs "templates.op_log" (the whole template is labeled
as "op-log")
=> renamed to "op_log"
- "template-aliases.'format_operation_duration(time_range)'"
=> renamed to 'format_time_range(time_range)'
The type doesn't seem to provide any benefit. I don't think I had a
good reason for creating it in the first place; it was probably just
unfamiliarity with Rust.
This is another step towards removing `RevsetIterator`. These types
are private, so someone using the library can't accidentally create a
`UnionRevsetIterator` with inputs in different order, for example.
I was thinking of replacing `RevsetIterator` by a regular
`Iterator<Item=IndexEntry>`. However, that would make it easier to
pass in an iterator that produces revisions in a non-topological order
into `RevsetGraphIterator`, which would produce unexpected results (it
would result in nodes that are not connected to their parents, if
their parents had already been emitted). I think it makes sense to
instead pass in a revset into `RevsetGraphIterator`.
Incidentally, it will also be useful to have the full revset available
in `RevsetGraphIterator` if we rewrite the algorithm to be more
similar to Mercurial's and Sapling's algorithm, which involves asking
the revset if it contains parent revisions.
We write conflict to the working copy by materializing them as
conflict markers in a file. When the file has been modified (or just
the mtime has changed), we parse the markers to reconstruct the
conflict. For example, let's say we see this conflict marker:
```
<<<<<<<
+++++++
b
%%%%%%%
-a
+c
>>>>>>>
```
Then we will create a hunk with ["a"] as removed and ["b", "c"] as
added.
Now, since commit b84be06c08, when we materialize conflicts, we
minimize the diff part of the marker (the `%%%%%%%` part). The problem
is that that minimization may result in a different order of the
positive conflict terms. That's particularly bad because we do the
minimization per hunk, so we can end up reconstructing an input that
never existed.
This commit fixes the bug by only considering the next add and the one
after that, and emitting either only the first with `%%%%%%%`, or both
of them, with the first one in `++++++++` and the second one in
`%%%%%%%`.
Note that the recent fix to add context to modify/delete conflicts
means that when we parse modified such conflicts, we'll always
consider them resolved, since the expected adds/removes we pass will
not match what's actually in the file. That doesn't seem so bad, and
it's not obvious what the fix should be, so I'll leave that for later.
The function only needs the `TreeValue` so it makes more sense this
way, I think. That will also let the caller keep the rest of the
`Conflict` value owned (though there is nothing but the `value` field
in it right now).
It took a while before I realized that conflicts could be modeled as
simple algebraic expressions with positive and negative terms (they
were modeled as recursive 3-way conflicts initially). We've been
thinking of them that way for a while now, so let's make the
`ConflictPart` name match that model.
When we materialize modify/delete conflicts, we currently don't
include any context lines. That's because modify/delete conflicts have
only two sides, so there's no common base to compare to. Hunks that
are unchanged on the "modify" side are therefore not considered
conflicting, and since they they don't contribute new changes, they're
simply skipped (here:
3dfedf5814/lib/src/files.rs (L228-L230)).
It seems more useful to instead pretend that the missing side is an
empty file. That way we'll get a conflict in the entire file.
We can still decide later to make e.g. `jj resolve` prompt the user on
modify/delete conflicts just like `hg resolve` does (or maybe it
actually happens earlier there, I don't remember).
Closes#1244.
If I understand correctly, the 'revset lifetimes on `Box<dyn
Revset<'index> + 'revset>` are not constrained by the lifetime of a
revset; we don't have any revsets that borrow data from other
revsets. Instead, they're all about constraining a boxed revset to the
index's lifetime. Without the lifetime annotation, it would default to
'static, and the borrow-checker doesn't like `dyn Revset<'index> +
'static`, since the revset could then live longer than the index it
borrows.
It's been about 10 weeks and 730 commits since 0.6.0, compared to
about 7 weeks and 350 commits between 0.5.0 and 0.6.0, so it's time
for a new release. There's been significant user-visible changes and
code-quality improvements. Thanks, everyone!
This is just a little preparation for extracting a `Repo` trait that's
implemented by both `ReadonlyRepo` and `MutableRepo`. The `index()`
function in that trait will of course have to return the same type in
both implementations, and that type will be `&dyn Index`.
Even though we don't know the details yet, we know that we want to
make the index pluggable like the commit and opstore
backends. Defining a trait for it should be a good step. We can refine
the trait later.
By separating the value spaces change ids and commit ids, we can
simplify lookup of a prefix. For example, if we know that a prefix is
for a change id, we don't have to try to find matching commit ids. I
think it might also help new users more quickly understand that change
ids are not commit ids.
This commit is a step towards that separation. It allows resolving
change ids by using hex digits from the back of the alphabet instead
of 0-f, so 'z'='0', 'y'='1', etc, and 'k'='f'. Thanks to @ilyagr for
the idea. The regular hex digits are still allowed.
Supported values are,
- `none` for no author information,
- `full` for both the name and email,
- `name` for just the name,
- `username` for username part of the email,
- (default) `email` (or any other gibberish for that matter) for the full email.
There's a subtle difference between
- 'expression = { whitespace* ... whitespace* }', and
- '_{ whitespace* ~ expression ~ whitespace* }'.
The former includes surrounding whitespace in an "expression", the latter
doesn't. This affects the span of error indication.
The added expect_arguments() is basically a copy from the template_parser.
I'll reimplement it to support keyword arguments, so I don't care much about
the current implementation.
I leave expect_no/one_argument() as wrappers because parsing 0/1 arguments
is pretty common.
Error messages are slightly changed. I personally prefer not to add extra
code for singular/plural handling, but if we do, I'll add 'if N == 1' case.
Our internal backend at Google uses a 32-byte change id, so I'd like
to make the backend able to decide the length. To start with, let's
make the backend able to decide what the root change id should
be. That's consistent with how we already let the backend decide what
the root commit id should be.
The function is currently only about the length of commit IDs, so
let's clarify that. I'm going to add another function for the length
of change IDs next. I don't know if we're going to care about lengths
of other hashes in the future. We might even be able to remove the
current restriction that all commit IDs and all change IDs have the
same length.
I think the CLI currently checks that the backend is not told to write
a merge commit with the root as one parent, but we should not panic if
those checks fail.
Git's HEAD ref is similar to other refs and can logically have
conflicts just like the other refs in `git_refs`. As with the other
refs, it can happen if you run concurrent commands importing two
different updates from Git. So let's treat `git_head` the same as
`git_refs` by making it an `Option<RefTarget>`.
Add a new git.auto-local-branch config option. When set to false, a
remote-tracking branch imported from Git will not automatically create a
local branch target. This is implemented by a new GitSettings struct
that passes Git-related settings from UserSettings.
This behavior is particularly useful in a co-located jj and Git repo,
because a Git remote might have branches that are not of everyday
interest to the user, so it does not make sense to export them as local
branches in Git. E.g. https://github.com/gitster/git, the maintainer's
fork of Git, has 379 branches, most of which are topic branches kept
around for historical reasons, and Git developers wouldn't be expected
to have local branches for each remote-tracking branch.
I don't think there's a good reason not to write the
`.jj/working_copy/tree_state` file on init. Being able to assume that
the file exists means that we won't need the store object to to lazily
load the `TreeState` object. Well, except that `TreeState` keeps an
`Arc<Store>`, but I'm trying to change that.
When building an initial index from an existing Git repo, for example,
we walk parents and predecessors to find all commits to index. Part of
that code was looking up the whole parent and predecessor commits even
though it only needed the ids. I don't know if this has a measurable
impact on performance, but it's not really any more complex to just
get the ids anyway.
I would expect `Commit::is_empty()` to check if the commit is empty in
our usual sense, i.e. that there are no changes compared to the
auto-merged parents. However, it would return `false` for any merge
commit (and for the root commit). Since we only use it in one place,
let's inline it there. The use there does seem reasonable, because
it's about abandoning an "uninteresting" working-copy commit.
I think of it more as style than a format, so using `style` in the
config key makes sense to me.
I didn't bother making upgrades easy by supporting the old name since
this was just released and only a few developers probably have it set.
revset::resolve_change_id() for ReadonlyRepo will be replaced with this
implementation. This doesn't mean revset query will speed up. A trivial
query will become slower due to the initialization cost of the change id
index. "jj log -r hex" will get faster since we have to pay the cost anyway.
Benchmark numbers (against my "linux" repo):
Command:
hyperfine --warmup 3 --runs 20 \
"jj log -r $hex -T '' --no-commit-working-copy --no-graph"
Linear search (e874570947):
Time (mean ± σ): 223.9 ms ± 16.2 ms [User: 181.2 ms, System: 42.7 ms]
Range (min … max): 207.7 ms … 247.6 ms 50 runs
Building IdIndex:
Time (mean ± σ): 855.0 ms ± 21.7 ms [User: 788.4 ms, System: 66.6 ms]
Range (min … max): 822.6 ms … 927.5 ms 50 runs
Building IdIndex, but hacked to store SmallVec<[u8; 20]>:
Time (mean ± σ): 406.1 ms ± 15.9 ms [User: 354.1 ms, System: 52.0 ms]
Range (min … max): 382.2 ms … 428.6 ms 50 runs
For my "jj" work repo, changes are < ~1ms.
I've preferred "working-copy commit" over "checkout" for a while
because I think it's clearer, but there were lots of places still
using "checkout". I've left "checkout" in places where it refers to
the action of updating the working copy or the working-copy commit.
`SimpleOpHeadsStore` currently stores its files in
`.jj/repo/op_heads/simple_op_heads/`. The `.jj/repo/op_heads/type`
file indicates the type of op-heads backend. If that contains
"simple_op_head_store", we use the `SimpleOpHeadsStore`
backend. There's no need for the `simple_op_heads` directory to also
indicate the type of backend in its name. I kept just the `heads` in
the name to make it less redundant with the parent directory (which is
`op_heads)`. We could alternatively call the directory `values` or
similar.
Since this function depends on both index and view, it can't be moved to
one of the storage objects. If we go forward with this approach, some
revset::resolve_*() functions will also be migrated to RepoRef.
This patch slightly changes the function name since a "prefix" might have
various meanings.
This should fix the panic in the case reported in #1107. It's a bit
hard to reproduce because we normally notice the missing commit when
we snapshot the working copy, but it's possible to reproduce it using
`--no-commit-working-copy`.
I suspect the added test is too brittle because it checks the exact
error message. On the other hand, it might be useful to have one test
case like this so we catch accidental changes in the format.
Since IdIndex is immutable, we don't need fast insertion provided by BTreeMap.
Let's simply use Vec for some speed up. More importantly, this allows us to
store multiple (ChangeId, CommitId) pairs for the same change id, and will
unblock the use of IdIndex in revset::resolve_symbol().
Some benchmark numbers (against my "linux" repo) follow.
Command:
hyperfine --warmup 3 "jj log -r master \
-T 'commit_id.short_prefix_and_brackets()' \
--no-commit-working-copy --no-graph"
Original:
Time (mean ± σ): 1.892 s ± 0.031 s [User: 1.800 s, System: 0.092 s]
Range (min … max): 1.833 s … 1.935 s 10 runs
This commit:
Time (mean ± σ): 867.5 ms ± 2.7 ms [User: 809.9 ms, System: 57.7 ms]
Range (min … max): 862.3 ms … 871.0 ms 10 runs
With my "jj" work repo, this saves ~4ms to show the log with default revset.
Command:
JJ_CONFIG=/dev/null hyperfine --warmup 3 --runs 100 \
"jj log -T 'commit_id.short_prefix_and_brackets() \
change_id.short_prefix_and_brackets()' \
--no-commit-working-copy"
Baseline (a7541e1ba4):
Time (mean ± σ): 54.1 ms ± 16.4 ms [User: 46.4 ms, System: 7.8 ms]
Range (min … max): 36.5 ms … 78.1 ms 100 runs
This commit:
Time (mean ± σ): 49.5 ms ± 16.4 ms [User: 42.4 ms, System: 7.2 ms]
Range (min … max): 31.4 ms … 70.9 ms 100 runs
This iterator will be used to merge neighbor commit ids across segments.
resolve_prefix() is simplified to non-short-circuiting loop. I think that's
fine because visiting parents is cheap, and the costly operation here is
segment_resolve_prefix().
entry_by_pos() could also be migrated to iterator, but I leave the unsafe
bits there.
ReadonlyIndex implementation leverages the existing binary search
function. MutableIndex one is basically the same as repo::IdIndex.
Shortest prefix length could be calculated for each segment, but I think
returning neighbors is better for testing.
This is ugly, but we need a special case because root_change_id and
root_commit_id aren't equal but share the same prefix bytes. In practice,
no one would care for the shortest root id prefix, but we'll need to deal
with a similar problem when migrating prefix id resolution to repo layer.
This helps us to migrate commit_id index to ReadonlyIndex. For large
repositories, this also reduces initialization cost, but that's not the main
intent of this change.
https://github.com/martinvonz/jj/pull/1041#issuecomment-1399225876
common_hex_len() and iter_half_bytes() are added to backend.rs since more
call sites will be added to index.rs, and I feel index.rs isn't a good place
to host this kind of utility functions.
I made it a free function. Alternatively, the root id could be instantiated
by and obtained through backend, but I don't think we'll need such level of
abstraction.
I'm going to add a workaround for shortest prefix calculation of the root ids,
where this function will be used.
Make op resolution a closed operation, powered by a callback provided by the
caller which runs under an internal lock scope. This allows for greatly
simplifying the internal lifetime structuring.
If commit_id[..prefix_len] < prefix, commit_id < prefix is obviously true.
If commit_id[..prefix_len] == prefix, commit_id < prefix returns false. So
slicing isn't needed.
This makes commit_id_byte_prefix_to_pos() basically the same as
segment_commit_id_to_pos(), and these two functions can be merged.
matches() is called from resolve_change_id() loop right now, so it's better to
not allocate String there. Regarding new IdIndex integration, I'll probably make
IdIndex store raw byte ids instead of hexes, and use HexPrefix to look up
range and test prefixes. I think this is basically the same as prefix lookup
in MutableIndex, but I have no idea if we can factor out a common interface.
I made HexPrefix store (Vec<u8>, bool) instead of (Vec<u8>, Option<u8>) so
both min/partial prefixes can be borrowed as slice.
By inlining `wite_commit_internal()` into `write_commit()`, we can
avoid redoing some steps when we retry. This includes taking the mutex
lock, and reading the tree object and parent commits. It also means
that we avoid cloning the input commit object, which we otherwise
would even in the non-retrying case. I haven't measured if any of this
makes a significant difference, but I think it also slightly
simplifies the code, so it doesn't have to.
This is fast enough to be used on medium-sized repositories such as git/git.
It is a bit slow, but bearable, on huge repositories such as torvalds/linux.
There is 0 performance penalty if the display of unique prefixes is disabled
A trie-based implementation will be submitted for consideration in a
follow-up PR. It is faster, but more complicated.
**Update:** I also just discovered https://sapling-scm.com/docs/internals/indexedlog/
There are three important aspects of performance that seemed relevant:
1. Speed of computing the shortest unique prefix per id. It is worlds faster
than the naive implementation before this commit. It can be optimized
furher by using a trie or maybe the `fst` crate.
2. Speed of inital loading of the index that happens before the first commit is
shown. This is the part that's noticeable but bearable on torvalds/linux.
This could be optimized by storing a sorted list of commit and change ids on
disk. This would likely involve reworking the `Index`.
Failing that, the speed of inital loading doesn't change if a trie is used
and would likely be worse with the `fst` crate
3. Memory use is unremarkable here. I don't have good tools to measure it
precisely, but it does not balloon to gigabytes even on the linux repo.
This creates a templater function `short_underscore_prefix` for commit and
change ids. It is similar to `short` function, but shows one fewer hexadecimal
digit and inserts an underscore after the shortest unique prefix.
Highlighting with an underline and perhaps color/bold will be in a follow-up
PR.
The implementation is quadratic, a simple comparison of each id with every
other id. It is replaced in a subsequent commit. The problem with it is that,
while it works fine for a `jj`-sized repo, it becomes is painfully slow with a
repo the size of git/git.
Still, this naive implemenation is included here since it's simple, and could
be used as a reference implementation.
The `shortest_unique_prefix_length` function goes into `repo.rs` since that's
convenient for follow-up commits in this PR to have nicer diffs.
Fixes https://github.com/martinvonz/jj/issues/1050
Thanks to Martin for suggesting the exact fix.
The tests go into the new tests/test_duplicate_command.rs, which will be
expanded shortly with other tests depending on this bugfix.