spellbook

mirror of https://github.com/helix-editor/spellbook.git synced 2025-10-05 15:52:42 +02:00

Author	SHA1	Message	Date
Michael Davis	f5135a3b76	Prepare a v0.3.5 release v0.3.5	2025-09-12 10:10:16 -04:00
Michael Davis	5402ab95c9	deps: Loosen restrictions on hashbrown and foldhash	2025-09-03 09:11:05 -04:00
Michael Davis	ebf9375268	Add changelog notes for recent fixes	2025-08-31 13:53:19 -04:00
Gentle	e81244d629	change comment detection logic en_GB contains the rule `COMPOUNDRULE #*0{`. unfortunately any word starting with a # was detected as a comment the new logic skips all lines starting with # and then allows unparsed data at the end of each line	2025-08-31 13:47:16 -04:00
Gentle	e97dd88005	skip the BOM if there is one some of the public lexicons start with a UTF8 BOM which confused the parser	2025-08-31 13:40:56 -04:00
Michael Davis	91780067f3	Prepare a v0.3.4 release v0.3.4	2025-04-30 10:11:04 -04:00
Michael Davis	6739b13aae	checker: Add unit test for REP end pattern panic	2025-04-30 10:08:31 -04:00
Michael Davis	d917733ef9	checker: Prefer str::ends_with to indexing arbitrarily This is the same as `bbcc45c322` but for equivalent code in the checker.	2025-04-25 09:59:29 -04:00
Michael Davis	bc23955dc6	Prepare a v0.3.3 release v0.3.3	2025-04-21 20:58:22 -04:00
Michael Davis	bbcc45c322	suggester: Prefer str::ends_with to indexing arbitrarily This fixes a panic from a `fr` (see the test case) dictionary. Indexing into the string by the 'from' patterns length is unsafe since it is effectively arbitrary and might or might not lie on a character boundary. byte index 4 is not a char boundary; it is inside 'é' (bytes 3..5) of `caféx`	2025-04-21 20:46:18 -04:00
Michael Davis	d8e1aea1a0	Prepare a v0.3.2 release v0.3.2	2025-04-15 19:22:00 -04:00
Michael Davis	6d5495466e	Align flag parsing with Hunspell's HashMgr Hunspell does not reject Unicode when the flag type is not `UTF-8`. For non `UTF-8` flag types, Hunspell operates in terms of bytes instead of characters and doesn't check whether the values fit the definition of the flag type. For example the short flag type (default) supports a two-byte (in UTF-8 representation) character 'À' by using its first byte and ignoring the rest. This change translates Hunspell's parsing code from its `HashMgr` type directly to ensure we follow the same rules.	2025-04-12 12:47:36 -04:00
Michael Davis	49603a2fbf	Add an example to load dictionaries by path	2025-04-12 12:44:20 -04:00
Michael Davis	79ed7e8e76	Publicize MAX_WORD_LEN constant	2025-03-27 13:27:01 -04:00
Michael Davis	2b43f73e3b	docs: Update internal docs on the representation of UTF-8 flags	2025-03-15 16:18:12 -04:00
Michael Davis	134e5d82a7	docs: Shout out to known dependents in the README	2025-03-15 15:50:00 -04:00
Michael Davis	3abf2ebe91	Prepare a v0.3.1 release v0.3.1	2025-03-11 17:26:17 -04:00
Michael Davis	a3b8e02b6f	Follow Hunspell's way of parsing flags with large Unicode scalar values Unicode flags that take more than 16 bits to represent are rejected by Nuspell but are accepted by Hunspell. When parsing single flags (like in a PFX rule for example), Hunspell takes the higher of the two code units, discarding the lower. When parsing a flag set (like in a .dic line), Hunspell takes both code units. I haven't updated all parsing. In particular I don't think that compound rules (using wildcards '*' and '?') would work accurately if used with flags with high scalar values. It may be worthwhile to emit an error in those cases instead of silently behavior unpredictably.	2025-03-04 10:15:41 -05:00
Michael Davis	0335a521d4	minor: Simplify conversion for ASCII flags in compound rules	2025-03-04 09:48:57 -05:00
Michael Davis	247bc4d1b0	Replace ahash with foldhash in benchmarks	2025-03-04 09:13:06 -05:00
Michael Davis	c24f0e4d26	Prepare a v0.3.0 release v0.3.0	2025-02-04 13:30:23 -05:00
Michael Davis	10f0feb5b8	Develop using latest stable Rust The latest Clippy has a lint that checks that calls follow MSRV. It's not perfect - it misses some cases like trait items, `Arc::default()` for example. Spellbook is not changed often enough that it should be a pain point however. Also Spellbook is not meant to increase MSRV without good reason (unlike an application like Helix) so eventually it will run further behind on Rust versions than rust-analyzer supports. The MSRV CI will catch actual violations not caught by Clippy before a release.	2025-02-04 13:25:07 -05:00
Michael Davis	1a61e13f30	Switch default hasher feature to use `foldhash` instead of `ahash` This was intended to follow `hashbrown` v0.15's switch. There is also a nice side effect of dropping proc-macro stuff that I didn't realize that ahash had introduced, making the `cargo tree` and `Cargo.lock` delightfully small. As previously remarked, the switch does not seem to change performance either for the better or worse.	2025-02-04 13:04:33 -05:00
Michael Davis	b6c31ea767	Allow configuring the Checker to convert lowercase to other casings This can be useful for checking in text that doesn't follow normal casing conventions. For example "Alice" should be correct but not "alice" in prose but in source code conventions about naming variables tend to enforce lower casing.	2025-01-28 09:01:57 -05:00
Michael Davis	6b3d6f6985	Expose the Checker type This was a TODO comment before - we can expose the Checker type to allow customizing its behavior like the Suggester does with ngram suggestions.	2025-01-28 09:01:04 -05:00
Michael Davis	ad65f0a3f4	style: Prefer direct call to char::len_utf8 over a debug assertion	2025-01-28 09:00:25 -05:00
Michael Davis	4995a50184	Document use of `unsafe`	2025-01-27 18:56:37 -05:00
Michael Davis	dad67e8e3d	Add a few more safety comments	2025-01-27 18:56:16 -05:00
Michael Davis	b8b92ee969	Accept Clippy 1.84 lints	2025-01-23 15:07:49 -05:00
Michael Davis	f034382c9a	Accept Clippy 1.83 explicit lifetime elision lints These lifetimes are not used in their `impl` blocks so they can be safely elided.	2024-12-04 17:40:39 -05:00
Michael Davis	305f34f0ea	CI: Use MSRV for tests & lints Installing latest stable can make the CI fail when Clippy is updated with new lints for example. We should pin the toolchain versions to the MSRV in the `test` and `lints` checks to avoid spurious failures.	2024-12-03 09:48:36 -05:00
Michael Davis	c033d2e22a	compare doc: Copy edits	2024-12-02 19:20:44 -05:00
Michael Davis	d6933d1f63	Prepare a v0.2.0 release v0.2.0	2024-11-18 18:46:17 -05:00
Michael Davis	1f8894c046	examples: Use Display for Duration in check example	2024-11-13 19:48:54 -05:00
Michael Davis	c35504f915	umbra_slice: Address clippy lint about transmute Recent Clippy versions want the left and right hand sides written explicitly.	2024-11-13 19:42:42 -05:00
Michael Davis	dfb9e07a21	suggester: Refactor out unnecessary `unsafe` string modification At the end of the day the old and new code are equivalent. `String::insert` does slightly more work (calculating the UTF-8 length and such) which is unnecessary since we know that `' '.len_utf8() == 1`, but this string modification is done at most once per suggestions so the savings are not perceptible.	2024-11-13 19:38:43 -05:00
Michael Davis	10a5f9e1fb	Update docs for FlagSet to mention `UmbraSlice<Flag>`	2024-11-13 19:38:13 -05:00
Michael Davis	0c5b568051	compare doc: Copy edits	2024-11-13 09:16:49 -05:00
Michael Davis	ca13e830a8	Add initial notes about performance comparison	2024-11-12 23:56:55 -05:00
Michael Davis	8b80d25405	suggester: Switch from UTF-32 to UTF-8 in ngram module Nuspell switches into UTF-32 for the ngram part of the suggester. This makes plenty of the metrics easier to calculate since, for example, `s[3]` is the third character in a UTF-32 string. (Not so with UTF-8: indexing into a UTF-8 string is a byte index and not necessarily a character boundary.) UTF-32 in Rust is not very well supported compared to UTF-8. Instead of a `String` you use a `Vec<char>` and instead of `&str` you have `&[char]`. These alternatives do not have as well optimized routines as UTF-8, especially when it comes to the standard library's unstable `Pattern` trait. The standard library will use platform `memcmp` and the `memchr` crate for operations like `[u8]::eq` and `[u8]::contains` respectively, which far outperform generic/dumb `[T]::eq` or `[T]::starts_with`. The most expensive part of ngram analysis seems to be the first step: iterating over the whole word list and comparing the input word with the basic `ngram_similarity` function. A flamegraph reveals that we spent a huge amount of time on `contains_slice`, a generic but dumb routine called by `ngram_similarity` in a loop. It's a function that finds whether a slice contains subslice and emulates Nuspell's use of `std::basic_string_view<char32_t>::find`. This was ultimately a lot of `[char]::starts_with` which is fairly slow relative to `str::contains` against a `str` pattern. The `ngram` module was a bit special before this commit because it eagerly switched most `str`s into UTF-32: `Vec<char>`s. That matched Nuspell and, as mentioned above, made some calculations easier/dumber. But the optimizations in the standard library for UTF-8 are undeniable. This commit decreases the total time to `suggest` for a rough word like "exmaple" by 25%. "exmaple" is tough because it contains two 'e's. 'E' is super common in English, so the `ngram_similarity` function ends up working relatively harder for "exmaple". Same for other words with multiple common chars or common stem substrings. The reason is that `ngram_similarity` has a fast lane to break out of looping when it notices that a word is quite dissimilar. It's a kind of "layered cake" - for a `left` and `right` string, you first find any k=1 kgrams of `left` in `right` and that's a fancy way of saying you find any `char`s in `left` that are in `right`. If there is more than one match you move onto k=2: find any substrings of `right` that match any two-character window in `left`. So the substrings you search for increase in size: k=1: "e", "x", "m", "a", "p", "l", "e" k=2: "ex", "xm", "ma", "ap", "pl", "le" k=3: "exm", "xma", "map", "apl", "ple" ... You may break out of the loop at a low `k` if your words are dissimilar. Words with multiple common letters though are unlikely to break out for your average other stem in the dictionary. All of this is to say that checking whether `right` contains a given subslice of `left` is central to this module and even more so in degenerate cases. So I believe focusing on UTF-8 here is worth the extra complexity of dealing with byte indices. --- To make this possible this module adds a wrapper struct around `&str` `CharsStr`: ```rust struct CharsStr<'a, 'i> { inner: &'a str, char_indices: &'i [u16] } ``` This eagerly computes the `str::char_indices` in `CharsStr::new`, borrowing a `&'i mut Vec<u16>`'s allocation. So this is hopefully equivalently expensive as converting each stem or expanded string to UTF-32 in a reused `Vec<char>` since we need to iterate on chars anyways, and allocate per-char. We retain (and do not duplicate - as we would by converting to UTF-32) the UTF-8 representation though, allowing us to take advantage of the standard library's string searching optimizations. Hopefully this is also thriftier with memory.	2024-11-12 22:20:19 -05:00
Michael Davis	b17897064d	examples: Use Display impl for Duration	2024-11-12 21:15:28 -05:00
Michael Davis	5e4dfba1b8	aff: Prefer `Box<str>` to `String` As mentioned in the internals guide. There were a few fields of structs left that were `String`s only for simplicity.	2024-11-11 17:40:34 -05:00
Michael Davis	808ae1f353	Remove crate-wide allow for dead_code This was meant to be removed when enough of the library was implemented. With suggestions fully implemented it can now be removed. The only dead_code which is left to resolve is compound pattern replacements which is a TODO in the checker.	2024-11-11 17:32:55 -05:00
Michael Davis	6c2e06897c	Add documentation about the release with suggestions	2024-11-11 17:25:56 -05:00
Michael Davis	3f0aa5cab0	Implement "ngram" suggestions This is the last part of the `suggester`. Hunspell has a bespoke string similarity measurement called "ngram similarity." Conceptually it's like Jaro or Levenshtein similarity - a measurement for how close two strings are. The suggester resorts to ngram suggestions when it believes that the simple string edits in `suggest_low` are not high quality. Ngram suggestions are a pipeline: * Iterate on all stems in the wordlist. Take the 100 most promising according to a basic ngram similarity score. * Expand all affixes for each stem and give each expanded form a score based on another ngram-similarity-based metric. Take up to the top 200 most promising candidates. * Determine a threshold to eliminate lower quality candidates. * Return the last remaining most promising candidates. It's notable that because we iterate on the entire wordlist that ngram suggestions are far far slower than the basic edit based suggestions.	2024-11-11 17:25:37 -05:00
Michael Davis	8455774155	Add suggester legacy tests Most (21 of 32) currently fail. This is maybe not super surprising as ngram suggestions are not yet implemented. Nice that at least some pass though.	2024-11-09 14:18:00 -05:00
Michael Davis	7a3bf3451e	suggester: Fix subslice in try_rep_suggestion The equivalent C++ ways subslicing to a _length_ of `j - i` and we want to slice _to_ j.	2024-11-09 14:06:52 -05:00
Michael Davis	d23a934446	suggester: Make index into hypenated words absolute This fixes suggest_breakdefault hanging. (Because it kept breaking the word the same way over and over again.)	2024-11-09 14:01:12 -05:00
Michael Davis	e282b05c55	suggester: Avoid panic rotating titlecased suggestion to front	2024-11-09 13:54:51 -05:00
Michael Davis	99a63bb6ab	suggester: Implement the camel/pascal branch in suggest_impl	2024-11-09 13:42:36 -05:00

1 2 3 4 5 ...

321 Commits