321 Commits

Author SHA1 Message Date
Michael Davis
f5135a3b76 Prepare a v0.3.5 release v0.3.5 2025-09-12 10:10:16 -04:00
Michael Davis
5402ab95c9 deps: Loosen restrictions on hashbrown and foldhash 2025-09-03 09:11:05 -04:00
Michael Davis
ebf9375268 Add changelog notes for recent fixes 2025-08-31 13:53:19 -04:00
Gentle
e81244d629 change comment detection logic
en_GB contains the rule `COMPOUNDRULE #*0{`.
unfortunately any word starting with a # was detected as a comment

the new logic skips all lines starting with # and then
allows unparsed data at the end of each line
2025-08-31 13:47:16 -04:00
Gentle
e97dd88005 skip the BOM if there is one
some of the public lexicons start with a UTF8 BOM
which confused the parser
2025-08-31 13:40:56 -04:00
Michael Davis
91780067f3 Prepare a v0.3.4 release v0.3.4 2025-04-30 10:11:04 -04:00
Michael Davis
6739b13aae checker: Add unit test for REP end pattern panic 2025-04-30 10:08:31 -04:00
Michael Davis
d917733ef9 checker: Prefer str::ends_with to indexing arbitrarily
This is the same as bbcc45c322 but for equivalent code in the checker.
2025-04-25 09:59:29 -04:00
Michael Davis
bc23955dc6 Prepare a v0.3.3 release v0.3.3 2025-04-21 20:58:22 -04:00
Michael Davis
bbcc45c322 suggester: Prefer str::ends_with to indexing arbitrarily
This fixes a panic from a `fr` (see the test case) dictionary. Indexing
into the string by the 'from' patterns length is unsafe since it is
effectively arbitrary and might or might not lie on a character
boundary.

    byte index 4 is not a char boundary; it is inside 'é' (bytes 3..5) of `caféx`
2025-04-21 20:46:18 -04:00
Michael Davis
d8e1aea1a0 Prepare a v0.3.2 release v0.3.2 2025-04-15 19:22:00 -04:00
Michael Davis
6d5495466e Align flag parsing with Hunspell's HashMgr
Hunspell does not reject Unicode when the flag type is not `UTF-8`.
For non `UTF-8` flag types, Hunspell operates in terms of bytes instead
of characters and doesn't check whether the values fit the definition of
the flag type. For example the short flag type (default) supports a
two-byte (in UTF-8 representation) character 'À' by using its first byte
and ignoring the rest. This change translates Hunspell's parsing code
from its `HashMgr` type directly to ensure we follow the same rules.
2025-04-12 12:47:36 -04:00
Michael Davis
49603a2fbf Add an example to load dictionaries by path 2025-04-12 12:44:20 -04:00
Michael Davis
79ed7e8e76 Publicize MAX_WORD_LEN constant 2025-03-27 13:27:01 -04:00
Michael Davis
2b43f73e3b docs: Update internal docs on the representation of UTF-8 flags 2025-03-15 16:18:12 -04:00
Michael Davis
134e5d82a7 docs: Shout out to known dependents in the README 2025-03-15 15:50:00 -04:00
Michael Davis
3abf2ebe91 Prepare a v0.3.1 release v0.3.1 2025-03-11 17:26:17 -04:00
Michael Davis
a3b8e02b6f Follow Hunspell's way of parsing flags with large Unicode scalar values
Unicode flags that take more than 16 bits to represent are rejected by
Nuspell but are accepted by Hunspell. When parsing single flags (like in
a PFX rule for example), Hunspell takes the higher of the two code
units, discarding the lower. When parsing a flag set (like in a .dic
line), Hunspell takes both code units.

I haven't updated all parsing. In particular I don't think that compound
rules (using wildcards '*' and '?') would work accurately if used with
flags with high scalar values. It may be worthwhile to emit an error in
those cases instead of silently behavior unpredictably.
2025-03-04 10:15:41 -05:00
Michael Davis
0335a521d4 minor: Simplify conversion for ASCII flags in compound rules 2025-03-04 09:48:57 -05:00
Michael Davis
247bc4d1b0 Replace ahash with foldhash in benchmarks 2025-03-04 09:13:06 -05:00
Michael Davis
c24f0e4d26 Prepare a v0.3.0 release v0.3.0 2025-02-04 13:30:23 -05:00
Michael Davis
10f0feb5b8 Develop using latest stable Rust
The latest Clippy has a lint that checks that calls follow MSRV. It's
not perfect - it misses some cases like trait items, `Arc::default()`
for example. Spellbook is not changed often enough that it should be a
pain point however.

Also Spellbook is not meant to increase MSRV without good reason (unlike
an application like Helix) so eventually it will run further behind on
Rust versions than rust-analyzer supports. The MSRV CI will catch actual
violations not caught by Clippy before a release.
2025-02-04 13:25:07 -05:00
Michael Davis
1a61e13f30 Switch default hasher feature to use foldhash instead of ahash
This was intended to follow `hashbrown` v0.15's switch. There is also a
nice side effect of dropping proc-macro stuff that I didn't realize that
ahash had introduced, making the `cargo tree` and `Cargo.lock`
delightfully small.

As previously remarked, the switch does not seem to change performance
either for the better or worse.
2025-02-04 13:04:33 -05:00
Michael Davis
b6c31ea767 Allow configuring the Checker to convert lowercase to other casings
This can be useful for checking in text that doesn't follow normal
casing conventions. For example "Alice" should be correct but not
"alice" in prose but in source code conventions about naming variables
tend to enforce lower casing.
2025-01-28 09:01:57 -05:00
Michael Davis
6b3d6f6985 Expose the Checker type
This was a TODO comment before - we can expose the Checker type to
allow customizing its behavior like the Suggester does with ngram
suggestions.
2025-01-28 09:01:04 -05:00
Michael Davis
ad65f0a3f4 style: Prefer direct call to char::len_utf8 over a debug assertion 2025-01-28 09:00:25 -05:00
Michael Davis
4995a50184 Document use of unsafe 2025-01-27 18:56:37 -05:00
Michael Davis
dad67e8e3d Add a few more safety comments 2025-01-27 18:56:16 -05:00
Michael Davis
b8b92ee969 Accept Clippy 1.84 lints 2025-01-23 15:07:49 -05:00
Michael Davis
f034382c9a Accept Clippy 1.83 explicit lifetime elision lints
These lifetimes are not used in their `impl` blocks so they can be
safely elided.
2024-12-04 17:40:39 -05:00
Michael Davis
305f34f0ea CI: Use MSRV for tests & lints
Installing latest stable can make the CI fail when Clippy is updated
with new lints for example. We should pin the toolchain versions to the
MSRV in the `test` and `lints` checks to avoid spurious failures.
2024-12-03 09:48:36 -05:00
Michael Davis
c033d2e22a compare doc: Copy edits 2024-12-02 19:20:44 -05:00
Michael Davis
d6933d1f63 Prepare a v0.2.0 release v0.2.0 2024-11-18 18:46:17 -05:00
Michael Davis
1f8894c046 examples: Use Display for Duration in check example 2024-11-13 19:48:54 -05:00
Michael Davis
c35504f915 umbra_slice: Address clippy lint about transmute
Recent Clippy versions want the left and right hand sides written
explicitly.
2024-11-13 19:42:42 -05:00
Michael Davis
dfb9e07a21 suggester: Refactor out unnecessary unsafe string modification
At the end of the day the old and new code are equivalent.
`String::insert` does slightly more work (calculating the UTF-8 length
and such) which is unnecessary since we know that `' '.len_utf8() == 1`,
but this string modification is done at most once per suggestions so the
savings are not perceptible.
2024-11-13 19:38:43 -05:00
Michael Davis
10a5f9e1fb Update docs for FlagSet to mention UmbraSlice<Flag> 2024-11-13 19:38:13 -05:00
Michael Davis
0c5b568051 compare doc: Copy edits 2024-11-13 09:16:49 -05:00
Michael Davis
ca13e830a8 Add initial notes about performance comparison 2024-11-12 23:56:55 -05:00
Michael Davis
8b80d25405 suggester: Switch from UTF-32 to UTF-8 in ngram module
Nuspell switches into UTF-32 for the ngram part of the suggester. This
makes plenty of the metrics easier to calculate since, for example,
`s[3]` is the third character in a UTF-32 string. (Not so with UTF-8:
indexing into a UTF-8 string is a byte index and not necessarily a
character boundary.)

UTF-32 in Rust is not very well supported compared to UTF-8. Instead of
a `String` you use a `Vec<char>` and instead of `&str` you have
`&[char]`. These alternatives do not have as well optimized routines as
UTF-8, especially when it comes to the standard library's unstable
`Pattern` trait. The standard library will use platform `memcmp` and the
`memchr` crate for operations like `[u8]::eq` and `[u8]::contains`
respectively, which far outperform generic/dumb `[T]::eq` or
`[T]::starts_with`.

The most expensive part of ngram analysis seems to be the first step:
iterating over the whole word list and comparing the input word with the
basic `ngram_similarity` function. A flamegraph reveals that we spent
a huge amount of time on `contains_slice`, a generic but dumb
routine called by `ngram_similarity` in a loop. It's a function that
finds whether a slice contains subslice and emulates Nuspell's use of
`std::basic_string_view<char32_t>::find`. This was ultimately a lot of
`[char]::starts_with` which is fairly slow relative to `str::contains`
against a `str` pattern.

The `ngram` module was a bit special before this commit because it
eagerly switched most `str`s into UTF-32: `Vec<char>`s. That matched
Nuspell and, as mentioned above, made some calculations easier/dumber.
But the optimizations in the standard library for UTF-8 are undeniable.

This commit decreases the total time to `suggest` for a rough word like
"exmaple" by 25%.

"exmaple" is tough because it contains two 'e's. 'E' is super common in
English, so the `ngram_similarity` function ends up working relatively
harder for "exmaple". Same for other words with multiple common chars or
common stem substrings. The reason is that `ngram_similarity` has a fast
lane to break out of looping when it notices that a word is quite
dissimilar. It's a kind of "layered cake" - for a `left` and `right`
string, you first find any k=1 kgrams of `left` in `right` and that's a
fancy way of saying you find any `char`s in `left` that are in `right`.
If there is more than one match you move onto k=2: find any substrings
of `right` that match any two-character window in `left`. So the
substrings you search for increase in size:

    k=1: "e", "x", "m", "a", "p", "l", "e"
    k=2: "ex", "xm", "ma", "ap", "pl", "le"
    k=3: "exm", "xma", "map", "apl", "ple"
    ...

You may break out of the loop at a low `k` if your words are dissimilar.
Words with multiple common letters though are unlikely to break out for
your average other stem in the dictionary.

All of this is to say that checking whether `right` contains a given
subslice of `left` is central to this module and even more so in
degenerate cases. So I believe focusing on UTF-8 here is worth the extra
complexity of dealing with byte indices.

---

To make this possible this module adds a wrapper struct around `&str`
`CharsStr`:

```rust
struct CharsStr<'a, 'i> {
    inner: &'a str,
    char_indices: &'i [u16]
}
```

This eagerly computes the `str::char_indices` in `CharsStr::new`,
borrowing a `&'i mut Vec<u16>`'s allocation. So this is hopefully
equivalently expensive as converting each stem or expanded string to
UTF-32 in a reused `Vec<char>` since we need to iterate on chars
anyways, and allocate per-char. We retain (and do not duplicate - as we
would by converting to UTF-32) the UTF-8 representation though, allowing
us to take advantage of the standard library's string searching
optimizations. Hopefully this is also thriftier with memory.
2024-11-12 22:20:19 -05:00
Michael Davis
b17897064d examples: Use Display impl for Duration 2024-11-12 21:15:28 -05:00
Michael Davis
5e4dfba1b8 aff: Prefer Box<str> to String
As mentioned in the internals guide. There were a few fields of structs
left that were `String`s only for simplicity.
2024-11-11 17:40:34 -05:00
Michael Davis
808ae1f353 Remove crate-wide allow for dead_code
This was meant to be removed when enough of the library was implemented.
With suggestions fully implemented it can now be removed. The only
dead_code which is left to resolve is compound pattern replacements
which is a TODO in the checker.
2024-11-11 17:32:55 -05:00
Michael Davis
6c2e06897c Add documentation about the release with suggestions 2024-11-11 17:25:56 -05:00
Michael Davis
3f0aa5cab0 Implement "ngram" suggestions
This is the last part of the `suggester`. Hunspell has a bespoke
string similarity measurement called "ngram similarity." Conceptually
it's like Jaro or Levenshtein similarity - a measurement for how close
two strings are.

The suggester resorts to ngram suggestions when it believes that the
simple string edits in `suggest_low` are not high quality. Ngram
suggestions are a pipeline:

* Iterate on all stems in the wordlist. Take the 100 most promising
  according to a basic ngram similarity score.
* Expand all affixes for each stem and give each expanded form a score
  based on another ngram-similarity-based metric. Take up to the top 200
  most promising candidates.
* Determine a threshold to eliminate lower quality candidates.
* Return the last remaining most promising candidates.

It's notable that because we iterate on the entire wordlist that ngram
suggestions are far far slower than the basic edit based suggestions.
2024-11-11 17:25:37 -05:00
Michael Davis
8455774155 Add suggester legacy tests
Most (21 of 32) currently fail. This is maybe not super surprising as
ngram suggestions are not yet implemented. Nice that at least some pass
though.
2024-11-09 14:18:00 -05:00
Michael Davis
7a3bf3451e suggester: Fix subslice in try_rep_suggestion
The equivalent C++ ways subslicing to a _length_ of `j - i` and we want
to slice _to_ j.
2024-11-09 14:06:52 -05:00
Michael Davis
d23a934446 suggester: Make index into hypenated words absolute
This fixes suggest_breakdefault hanging. (Because it kept breaking the
word the same way over and over again.)
2024-11-09 14:01:12 -05:00
Michael Davis
e282b05c55 suggester: Avoid panic rotating titlecased suggestion to front 2024-11-09 13:54:51 -05:00
Michael Davis
99a63bb6ab suggester: Implement the camel/pascal branch in suggest_impl 2024-11-09 13:42:36 -05:00