en_GB contains the rule `COMPOUNDRULE #*0{`.
unfortunately any word starting with a # was detected as a comment
the new logic skips all lines starting with # and then
allows unparsed data at the end of each line
This fixes a panic from a `fr` (see the test case) dictionary. Indexing
into the string by the 'from' patterns length is unsafe since it is
effectively arbitrary and might or might not lie on a character
boundary.
byte index 4 is not a char boundary; it is inside 'é' (bytes 3..5) of `caféx`
Hunspell does not reject Unicode when the flag type is not `UTF-8`.
For non `UTF-8` flag types, Hunspell operates in terms of bytes instead
of characters and doesn't check whether the values fit the definition of
the flag type. For example the short flag type (default) supports a
two-byte (in UTF-8 representation) character 'À' by using its first byte
and ignoring the rest. This change translates Hunspell's parsing code
from its `HashMgr` type directly to ensure we follow the same rules.
Unicode flags that take more than 16 bits to represent are rejected by
Nuspell but are accepted by Hunspell. When parsing single flags (like in
a PFX rule for example), Hunspell takes the higher of the two code
units, discarding the lower. When parsing a flag set (like in a .dic
line), Hunspell takes both code units.
I haven't updated all parsing. In particular I don't think that compound
rules (using wildcards '*' and '?') would work accurately if used with
flags with high scalar values. It may be worthwhile to emit an error in
those cases instead of silently behavior unpredictably.
The latest Clippy has a lint that checks that calls follow MSRV. It's
not perfect - it misses some cases like trait items, `Arc::default()`
for example. Spellbook is not changed often enough that it should be a
pain point however.
Also Spellbook is not meant to increase MSRV without good reason (unlike
an application like Helix) so eventually it will run further behind on
Rust versions than rust-analyzer supports. The MSRV CI will catch actual
violations not caught by Clippy before a release.
This was intended to follow `hashbrown` v0.15's switch. There is also a
nice side effect of dropping proc-macro stuff that I didn't realize that
ahash had introduced, making the `cargo tree` and `Cargo.lock`
delightfully small.
As previously remarked, the switch does not seem to change performance
either for the better or worse.
This can be useful for checking in text that doesn't follow normal
casing conventions. For example "Alice" should be correct but not
"alice" in prose but in source code conventions about naming variables
tend to enforce lower casing.
Installing latest stable can make the CI fail when Clippy is updated
with new lints for example. We should pin the toolchain versions to the
MSRV in the `test` and `lints` checks to avoid spurious failures.
At the end of the day the old and new code are equivalent.
`String::insert` does slightly more work (calculating the UTF-8 length
and such) which is unnecessary since we know that `' '.len_utf8() == 1`,
but this string modification is done at most once per suggestions so the
savings are not perceptible.
Nuspell switches into UTF-32 for the ngram part of the suggester. This
makes plenty of the metrics easier to calculate since, for example,
`s[3]` is the third character in a UTF-32 string. (Not so with UTF-8:
indexing into a UTF-8 string is a byte index and not necessarily a
character boundary.)
UTF-32 in Rust is not very well supported compared to UTF-8. Instead of
a `String` you use a `Vec<char>` and instead of `&str` you have
`&[char]`. These alternatives do not have as well optimized routines as
UTF-8, especially when it comes to the standard library's unstable
`Pattern` trait. The standard library will use platform `memcmp` and the
`memchr` crate for operations like `[u8]::eq` and `[u8]::contains`
respectively, which far outperform generic/dumb `[T]::eq` or
`[T]::starts_with`.
The most expensive part of ngram analysis seems to be the first step:
iterating over the whole word list and comparing the input word with the
basic `ngram_similarity` function. A flamegraph reveals that we spent
a huge amount of time on `contains_slice`, a generic but dumb
routine called by `ngram_similarity` in a loop. It's a function that
finds whether a slice contains subslice and emulates Nuspell's use of
`std::basic_string_view<char32_t>::find`. This was ultimately a lot of
`[char]::starts_with` which is fairly slow relative to `str::contains`
against a `str` pattern.
The `ngram` module was a bit special before this commit because it
eagerly switched most `str`s into UTF-32: `Vec<char>`s. That matched
Nuspell and, as mentioned above, made some calculations easier/dumber.
But the optimizations in the standard library for UTF-8 are undeniable.
This commit decreases the total time to `suggest` for a rough word like
"exmaple" by 25%.
"exmaple" is tough because it contains two 'e's. 'E' is super common in
English, so the `ngram_similarity` function ends up working relatively
harder for "exmaple". Same for other words with multiple common chars or
common stem substrings. The reason is that `ngram_similarity` has a fast
lane to break out of looping when it notices that a word is quite
dissimilar. It's a kind of "layered cake" - for a `left` and `right`
string, you first find any k=1 kgrams of `left` in `right` and that's a
fancy way of saying you find any `char`s in `left` that are in `right`.
If there is more than one match you move onto k=2: find any substrings
of `right` that match any two-character window in `left`. So the
substrings you search for increase in size:
k=1: "e", "x", "m", "a", "p", "l", "e"
k=2: "ex", "xm", "ma", "ap", "pl", "le"
k=3: "exm", "xma", "map", "apl", "ple"
...
You may break out of the loop at a low `k` if your words are dissimilar.
Words with multiple common letters though are unlikely to break out for
your average other stem in the dictionary.
All of this is to say that checking whether `right` contains a given
subslice of `left` is central to this module and even more so in
degenerate cases. So I believe focusing on UTF-8 here is worth the extra
complexity of dealing with byte indices.
---
To make this possible this module adds a wrapper struct around `&str`
`CharsStr`:
```rust
struct CharsStr<'a, 'i> {
inner: &'a str,
char_indices: &'i [u16]
}
```
This eagerly computes the `str::char_indices` in `CharsStr::new`,
borrowing a `&'i mut Vec<u16>`'s allocation. So this is hopefully
equivalently expensive as converting each stem or expanded string to
UTF-32 in a reused `Vec<char>` since we need to iterate on chars
anyways, and allocate per-char. We retain (and do not duplicate - as we
would by converting to UTF-32) the UTF-8 representation though, allowing
us to take advantage of the standard library's string searching
optimizations. Hopefully this is also thriftier with memory.
This was meant to be removed when enough of the library was implemented.
With suggestions fully implemented it can now be removed. The only
dead_code which is left to resolve is compound pattern replacements
which is a TODO in the checker.
This is the last part of the `suggester`. Hunspell has a bespoke
string similarity measurement called "ngram similarity." Conceptually
it's like Jaro or Levenshtein similarity - a measurement for how close
two strings are.
The suggester resorts to ngram suggestions when it believes that the
simple string edits in `suggest_low` are not high quality. Ngram
suggestions are a pipeline:
* Iterate on all stems in the wordlist. Take the 100 most promising
according to a basic ngram similarity score.
* Expand all affixes for each stem and give each expanded form a score
based on another ngram-similarity-based metric. Take up to the top 200
most promising candidates.
* Determine a threshold to eliminate lower quality candidates.
* Return the last remaining most promising candidates.
It's notable that because we iterate on the entire wordlist that ngram
suggestions are far far slower than the basic edit based suggestions.