write up a new API proposal, tracked

This commit is contained in:
Niko Matsakis 2022-08-02 10:14:31 +03:00
parent 8e348f0bc8
commit db75a1a510

View file

@ -2,99 +2,288 @@
{{#include caveat.md}}
This page contains a brief overview of the pieces of a salsa program. For a more detailed look, check out the [tutorial](./tutorial.md), which walks through the creation of an entire project end-to-end.
This page contains a brief overview of the pieces of a salsa program.
For a more detailed look, check out the [tutorial](./tutorial.md), which walks through the creation of an entire project end-to-end.
## Goal of Salsa
The goal of salsa is to support efficient **incremental recomputation**.
salsa is used in rust-analyzer, for example, to help it recompile your program quickly as you type.
The basic idea of a salsa program is like this:
```rust
let mut input = ...;
loop {
let output = your_program(&input);
modify(&mut input);
}
```
You start out with an input that has some value.
You invoke your program to get back a result.
Some time later, you modify the input and invoke your program again.
**Our goal is to make this second call faster by re-using some of the results from the first call.**
In reality, of course, you can have many inputs and "your program" may be many different methods and functions defined on those inputs.
But this picture still conveys a few important concepts:
- Salsa separates out the "incremental computation" (the function `your_program`) from some outer loop that is defining the inputs.
- Salsa gives you the tools to define `your_program`.
- Salsa assumes that `your_program` is a purely deterministic function of its inputs, or else this whole setup makes no sense.
- The mutation of inputs always happens outside of `your_program`, as part of this master loop.
## Database
Every salsa program has an omnipresent _database_, which stores all the data across revisions. As you change the inputs to your program, we will consult this database to see if there are old computations that can be reused. The database is also used to implement interning and other convenient features.
Each time you run your program, salsa remembers the values of each computation in a **database**.
When the inputs change, it consults this database to look for values that can be reused.
The database is also used to implement interning (making a canonical version of a value that can be copied around and cheaply compared for equality) and other convenient salsa features.
## Memoized functions
## Inputs
The most basic concept in salsa is a **memoized function**. When you mark a function as memoized, that indicates that you would like to store its value in the database:
Every Salsa program begins with an **input**.
Inputs are special structs that define the starting point of your program.
Everything else in your program is ultimately a deterministic function of these inputs.
For example, in a compiler, there might be an input defining the contents of a file on disk:
```rust
#[salsa::memoized]
fn parse_module(db: &dyn Db, module: Module) -> Ast {
...
#[salsa::input]
pub struct ProgramFile {
pub path: PathBuf,
pub contents: String,
}
```
When you call a memoized function, we first check if we can find the answer in the database. In that case, we return a clone of the saved answer instead of executing the function twice.
Sometimes you have memoized functions whose return type might be expensive to clone. In that case, you can mark the memoized function as `return_ref`. When you call a `return_ref` function, we will return a reference to the memoized result in the database:
You create an input by using the `new` method.
Because the values of input fields are stored in the database, you also give an `&mut`-reference to the database:
```rust
#[salsa::memoized(return_ref)]
fn module_text(db: &dyn Db, module: Module) -> &String {
...
}
let file: ProgramFile = ProgramFile::new(
&mut db,
PathBuf::from("some_path.txt"),
String::from("fn foo() { }"),
);
```
## Inputs and revisions
### Salsa structs are just an integer
Each memoized function has an associated `set` method that can be used to set a return value explicitly. Memoized functions whose values are explicitly set are called _inputs_.
The `ProgramFile` struct generates by the `salsa::input` macro doesn't actually store any data. It's just a newtyped integer id:
```rust
fn load_module_source(db: &mut dyn Db, module: Module) {
let source: String = load_source_text();
module_text::set(db, module, source);
// ^^^ set function!
}
// Generated by the `#[salsa::input]` macro:
#[derive(Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct ProgramFile(salsa::Id);
```
Often, inputs don't have a function body, but simply panic in the case that they are not set explicitly, but this is not required. For example, the `module_text` function returns the raw bytes for a module. This is likely not something we can compute from "inside" the system, so the definition might just panic:
This means that, when you have a `ProgramFile`, you can easily copy it around and put it wherever you like.
To actually read any of its fields, however, you will need to use the database and a getter method.
### Reading fields and `return_ref`
You can access the value of an input's fields by using the getter method.
As this is only reading the field, it just needs a `&`-reference to the database:
```rust
#[salsa::memoized(return_ref)]
fn module_text(db: &dyn Db, module: Module) -> String {
panic!("text for module `{module:?}` not set")
}
let contents: String = file.contents(&db);
```
Each time you invoke `set`, you begin a new **revision** of the database. Each memoized result in the database tracks the revision in which it was computed; invoking `set` may invalidate memoized results, causing functions to be re-executed (see the reference for [more details on how salsa decides when a memoized result is outdated](./reference/algorithm.md)).
## Entity values
Entity structs are special structs whose fields are versioned and stored in the database. For example, the `Module` type that we have been passing around could potentially be declared as an entity:
Invoking the accessor clones the value from the database.
Sometimes this is not what you want, so you can annotate fields with `#[return_ref]` to indicate that they should return a reference into the database instead:
```rust
#[salsa::entity]
struct Module {
#[salsa::input]
pub struct ProgramFile {
pub path: PathBuf,
#[return_ref]
path: String,
pub contents: String,
}
```
A new module could be created with the `new` method:
Now `file.contents(&db)` will return an `&String`.
You can also use the `data` method to access the entire struct:
```rust
let m: Module = Module::new(db, "some_path".to_string());
file.data(&db)
```
Despite the struct declaration above, the actual `Module` struct is just a newtyped integer, guaranteed to be unique within this database revision. You can access fields via accessors like `m.path(db)` (the `#[return_ref]` attribute here indicates that a `path` returns an `&String`, and not a cloned `String`).
### Writing input fields
## Interned values
Finally, you can also modify the value of an input field by using the setter method.
Since this is modifyingg the input, the setter takes an `&mut`-reference to the database:
In addition to entities, you can also declare _interned structs_ (and enums). Interned structs take arbitrary data and replace it with an integer. Unlike an entity, where each call to `new` returns a fresh integer, interning the same data twice gives back the same integer.
```rust
file.set_contents(String::from("fn foo() { /* add a comment */ }"));
```
A common use for interning is to intern strings:
## Tracked functions
Once you've defined your inputs, the next thing to define are **tracked functions**:
```rust
#[salsa::tracked]
fn parse_file(db: &dyn crate::Db, file: ProgramFile) -> Ast {
let contents: &str = file.contents(db);
...
}
```
When you call a tracked function, salsa will track which inputs it accesses (in this example, `file.contents(db)`).
It will also memoize the return value (the `Ast`, in this case).
If you call a tracked function twice, salsa checks if the inputs have changed; if not, it can return the memoized value.
The algorithm salsa uses to decide when a tracked function needs to be re-executed is called the [red-green algorithm](./reference/algorithm.md), and it's where the name salsa comes from.
Tracked functions have to follow a particular structure:
- They must take a `&`-reference to the database as their first argument.
- Note that because this is an `&`-reference, it is not possible to create or modify inputs during a tracked function!
- They must take a "salsa struct" as the second argument -- in our example, this is an input struct, but there are other kinds of salsa structs we'll describe shortly.
- They _can_ take additional arguments, but it's faster and better if they don't.
Tracked functions can return any clone-able type. A clone is required since, when the value is cached, the result will be cloned out of the database. Tracked functions can also be annotated with `#[return_ref]` if you would prefer to return a reference into the database instead (if `parse_file` were so annotated, then callers would actually get back an `&Ast`, for example).
## Tracked structs
**Tracked structs** are intermediate structs created during your computation.
Like inputs, their fields are stored inside the database.
Unlike inputs, they can only be created inside a tracked function, and their fields can never change once they are created.
```rust
#[salsa::tracked]
struct Ast {
#[return_ref]
top_level_items: Vec<Item>,
}
```
Just as with an input, new values are created by invoking `Ast::new`.
Unlike with an input, the `new` for a tracked struct only requires a `&`-reference to the database:
```rust
#[salsa::tracked]
fn parse_file(db: &dyn crate::Db, file: ProgramFile) -> Ast {
let contents: &str = file.contents(db);
let parser = Parser::new(contents);
let mut top_level_items = vec![];
while let Some(item) = parser.parse_top_level_item() {
top_level_items.push(item);
}
Ast::new(db, top_level_items) // <-- create an Ast!
}
```
### Tracked struct identity and getter methods
Just like an input struct, a tracked struct is just a newtyped integer, and you access its fields with a getter method (e.g., `ast.top_level_items(db)`).
### `#[id]` fields
When a tracked function is re-executed because its inputs have changed, the tracked structs it creates in the new execution are matched against those from the old execution, and the values of their fields are compared.
If the field values have not changed, then other tracked functions that only read those fields will not be re-executed.
Normally, tracked structs are matched up by the order in which they are created.
For example, the first `Ast` that is created by `parse_file` in the old execution will be matched against the first `Ast` created by `parse_file` in the new execution.
In our example, `parse_file` only ever creates a single `Ast`, so this works great.
Sometimes, however, it doesn't work so well.
For example, imagine that we had a tracked struct for items in the file:
```rust
#[salsa::tracked]
struct Item {
name: Word, // we'll define Word in a second!
...
}
```
Maybe our parser first creates an `Item` with the name `foo` and then later a second `Item` with the name `bar`.
Then the user changes the input to reorder the functions.
Although we are still creating the same number of items, we are now creating them in the reverse order, so the naive algorithm will match up the _old_ `foo` struct with the new `bar` struct.
This will look to salsa as though the `foo` function was renamed to `bar` and the `bar` function was renamed to `foo`.
We'll still get the right result, but we might do more recomputation than we needed to do if we understood that they were just reordered.
To address this, you can tag fields in a tracked struct as `#[id]`. These fields are then used to "match up" struct instances across executions:
```rust
#[salsa::tracked]
struct Item {
#[id]
name: Word, // we'll define Word in a second!
...
}
```
### Overriding tracked functions for particular structs
Sometimes it is useful to define a tracked function but _override_ its value for some particular struct.
For example, maybe the default way to compute the representation for a function is to read the AST, but you also have some built-in functions in your language and you want to hard-code their results.
Salsa supports this use case via "override" methods:
```rust
#[salsa::tracked]
fn representation(db: &dyn crate::Db, item: Item) -> Representation {
// read the user's input AST by default
let ast = ast(db, item);
// ...
}
fn create_builtin_item(db: &dyn crate::Db) -> Item {
let i = Item::new(db, ...);
let r = hardcoded_representation();
representation::override(db, i, r); // <-- override method!
i
}
```
Overriding can also be really useful for unit testing, since you don't can force the values that you want to be returned.
## Interned structs and enums
The final kind of salsa type are _interned_ structs/enums.
Interned structs/enums are useful for quick equality comparison.
They are commonly used to represent strings or other primitive values.
Most compilers, for example, will define a type to represent a user identifier:
```rust
#[salsa::interned]
struct Word {
#[return_ref]
text: String
pub text: String,
}
```
Interning the same value twice gives the same integer, so in this code...
As with input and tracked structs, the `Word` struct itself is just a newtyped integer, and the actual data is stored in the database.
You can create a new interned struct using `new`, just like with input and tracked structs:
```rust
let w1 = Word::new(db, "foo".to_string());
let w2 = Word::new(db, "foo".to_string());
let w2 = Word::new(db, "bar".to_string());
let w3 = Word::new(db, "foo".to_string());
```
...we know that `w1 == w2`.
When you create two interned structs with the same field values, you are guaranted to get back the same integer id. So here, we know that `assert_eq!(w1, w3)` is true and `assert_ne!(w1, w2)`.
You can access the fields of an interned struct using a getter, like `word.text(db)`. These getters respect the `#[return_ref]` annotation.
### The data struct and method
In addition to the newtype'd integer, the `#[salsa::interned]` macro creates a "data" struct that contains all the fields of the interned value.
For an interned struct `Word`, this struct is normally named `WordData`, but this can be overridden:
```rust
// Generated by `#[salsa::interned]`:
#[derive(Copy, Clone, ...)]
pub struct Word(salsa::Id);
pub struct WordData {
pub text: String,
}
```
You can access this data struct by invoking `word.data(db)`, which returns an `&WordData`.
This is particularly useful when interning enums, so that you can match on the result.
## Accumulators
@ -107,17 +296,17 @@ To create an accumulator, you declare a type as an _accumulator_:
pub struct Diagnostics(String);
```
It must be a newtype of something, like `String`. Now, during a memoized function's execution, you can push those values:
It must be a newtype of something, like `String`. Now, during a tracked function's execution, you can push those values:
```rust
Diagnostics::push(db, "some_string".to_string())
```
Then later, from outside the execution, you can ask for the set of diagnostics that were accumulated by some particular memoized function. For example, imagine that we have a type-checker and, during type-checking, it reports some diagnostics:
Then later, from outside the execution, you can ask for the set of diagnostics that were accumulated by some particular tracked function. For example, imagine that we have a type-checker and, during type-checking, it reports some diagnostics:
```rust
#[salsa::memoized]
fn type_check(db: &dyn Db, module: Module) {
#[salsa::tracked]
fn type_check(db: &dyn Db, item: Item) {
// ...
Diagnostics::push(db, "some error message".to_string())
// ...