321 Commits

Author SHA1 Message Date
Michael Davis
55b4d302fb HashBag: Remove unused get helper 2024-08-22 18:30:27 -04:00
Michael Davis
c92f1e33b3 aff parser: Improve coverage 2024-08-22 15:11:27 -04:00
Michael Davis
83b32fa298 Move shell definition into flake, use flake-compat 2024-08-22 13:53:39 -04:00
Michael Davis
66608bec2b Add a blank impl of check_compound_with_pattern_replacements
This function only does something when CompoundResult has a replacement
field on it. In my searching I didn't see a dictionary that actually
uses that so we might want to hold off on implementing that function
(which is really very complex in the Nuspell codebase, with lots of
`goto`) until we find a language that uses this.
2024-08-22 10:55:24 -04:00
Michael Davis
77d77d1fa9 checker: Fix lifetimes on CompoundResult returns
These have to live as long as the `word` and `&self` references. We can
elide the lifetimes because of that: the compiler will infer that the
reference needs to live as long as the intersection of the references
in the domain.
2024-08-22 10:55:18 -04:00
Michael Davis
68bfc779cf minor: Fix typo 2024-08-22 10:51:48 -04:00
Michael Davis
73af8e8965 Raise local version of Rust to 1.74.0
Ideally we would keep this at the MSRV but rust-analyzer now doesn't
work with anything older than this version. We had to jump up to 1.70
to get `block_box` to run the benchmarks anyways.
2024-08-22 09:54:49 -04:00
Michael Davis
d8f0ce0b81 Parse CompoundPatterns 2024-08-22 09:54:42 -04:00
Michael Davis
19ff354485 Update flake inputs 2024-08-22 09:47:02 -04:00
Michael Davis
ee2acd87f0 Use Box<str> for StringPair=>StrPair type 2024-08-22 09:25:52 -04:00
Michael Davis
c767cc7d99 Add a benchmark for a number word 2024-08-22 09:25:11 -04:00
Michael Davis
7e7678cb4f Perform case conversion based on locale
A few Turkic locales have special conversion rules for 'i' and 'I' which
we need to handle.

Nuspell covers this with ICU but we don't want to pull in ICU4X - it's
a very large dependency and adds weight even if we just pull in the
case mapping parts. Luckily the code to do this ourselves (like
Hunspell does too) is very small.
2024-08-21 17:19:20 -04:00
Michael Davis
fe902cec15 Rename checker::Casing to Capitalization
To avoid confusion with the Casing enum we'll introduce in the child
commit.
2024-08-21 17:19:07 -04:00
Michael Davis
5097053fab checker: Refactor classic compounding to be done like compound rules 2024-08-21 16:27:59 -04:00
Michael Davis
02fef71635 Rename HashMultiMap type to HashBag 2024-08-21 15:33:11 -04:00
Michael Davis
53fadffd9c style: Use a Lazy static for test en_US dictionary 2024-08-21 14:16:56 -04:00
Michael Davis
c1407d65fc checker: Improve sharp casing unit test 2024-08-21 12:37:09 -04:00
Michael Davis
3ef6b252ad Add a few minor unit tests for coverage 2024-08-21 12:14:56 -04:00
Michael Davis
1b15077f3f minor: Clean up imports 2024-08-21 11:27:34 -04:00
Michael Davis
73d83fcf7c shell: Add llvm-tools-preview and cargo-llvm-cov 2024-08-21 10:32:32 -04:00
Michael Davis
7cc1f04aad style: Replace BreakTable's From impl on a Vec with new on a slice 2024-08-21 10:32:31 -04:00
Michael Davis
325b2d8ff3 Implement check for COMPOUNDRULE compounds
This fixes check for compounds in en_US like "10th" and "202nd". These
are considered valid because of the COMPOUNDRULE rule. From en_US.aff:

    # compound rules:
    # 1. [0-9]*1[0-9]th (10th, 11th, 12th, 56714th, etc.)
    # 2. [0-9]*[02-9](1st|2nd|3rd|[4-9]th) (21st, 22nd, 123rd, 1234th, etc.)
    COMPOUNDRULE 2
    COMPOUNDRULE n*1t
    COMPOUNDRULE n*mp

For example with dictionary entries "1/n1" and "0th/pt", we split up
the word "10th" into parts "1" and "0th" which we look up in the
dictionary. We then check those words flagsets against the patterns.
"10th" would be represented as:

    `&[&flagset!['n', '1'], &flagset!['p', 't']]`

That matches the pattern `n*1t`: zero or more 'n' flags and then a '1'
flag (used for "1" and other digits) and a 't' flag ("0th"). The 'n'
flag matches zero times for "10th" but allows other numbers in front.
For example this pattern would also match "110th".

These kinds of compounds are fairly straightforward to check compared
to the other kind of compounding. This is the only compounding used by
en_US and a few other dictionaries.

This completes support for en_US for the checker - nothing else in the
aff affects the checker.
2024-08-21 10:29:41 -04:00
Michael Davis
0ae891d3e4 aff: Use boxed slices for CompoundRule types 2024-08-20 09:02:17 -04:00
Michael Davis
114ede1fa1 Fix some typos
Now that spellbook runs in Helix :D
2024-08-19 12:01:03 -04:00
Michael Davis
883792fcd0 minor: Use slice reference for parser list
With the reference we don't need to update the number of entries when
we add a parser. Tip taken from nucleo's `case_fold` module via Pascal.
2024-08-18 21:50:32 -04:00
Michael Davis
70c9af5631 Support ICONV and at least parse OCONV
(OCONV is used in the suggest API.)
2024-08-18 21:42:17 -04:00
Michael Davis
6e5ee0492d Add other casings to the benchmarks 2024-08-18 10:09:45 -04:00
Michael Davis
c38fa46853 Add special case for upper words with apostrophes 2024-08-18 09:56:45 -04:00
Michael Davis
4482c259b4 minor: Style improvements for spell_break 2024-08-18 09:56:45 -04:00
Michael Davis
6826683d5c minor: Update docs and comments 2024-08-18 09:24:52 -04:00
Michael Davis
7fabe38b00 minor: Derive default on enums instead of manual implementations 2024-08-18 09:17:34 -04:00
Michael Davis
fcdfe7d8ae Use boxed slices for BreakTable representation 2024-08-18 09:16:42 -04:00
Michael Davis
1cdaee8a3a minor: Resolve clippy lint about type bounds 2024-08-17 14:35:15 -04:00
Michael Davis
0ba85188ef minor: Improve panic docs 2024-08-17 14:27:50 -04:00
Michael Davis
8845e5df5d minor: Style and Debug derives 2024-08-17 14:27:29 -04:00
Michael Davis
2479466cfe Add special case for uppercase words with sharps in CHECKSHARPS 2024-08-17 14:27:11 -04:00
Michael Davis
2eceeda0be checker: Basic support for uppercase word checking 2024-08-17 13:21:16 -04:00
Michael Davis
7e4c1ac63d aff: Add COMPOUNDROOT 2024-08-17 12:11:45 -04:00
Michael Davis
f936ea1d98 Use Option<NonZeroU16>s for aff count/length options 2024-08-17 12:11:45 -04:00
Michael Davis
d2231b837c checker: Add a simple route for titlecase words
This is not fully correct because we don't do case conversion completely
accurately.
2024-08-17 12:11:45 -04:00
Michael Davis
5553b5b7eb checker: Fix nonsensical lifetimes 2024-08-17 12:09:24 -04:00
Michael Davis
303b92075b Add a very naive example for checking prose 2024-08-17 12:09:24 -04:00
Michael Davis
4a4ae680d6 Store the word list stem on CompoundingResult & AffixForm
This approximately matches Nuspell's storage of the wordlist pointer.
To make the lifetimes play nice we return the key from the map as well
as the value. This value is equal to the `stem2`/`stem3`s in the
affixing functions but we need to return the data from the map to use
that liftime rather than the Cows produced by stripping and adding with
affixes.

As another happy consequence of this, we can drop the `'aff` lifetime
on the `word: &str` parameters in the affixing stripping functions. The
input word should have a distinct lifetime and not having this lifetime
will cause problems for the compounding functions introduced in the
child commits.
2024-08-12 12:01:52 -04:00
Michael Davis
ea520a7407 Port the word lookup function for compounds 2024-08-12 12:01:52 -04:00
Michael Davis
7551efe438 Port are_three_code_points_equal from Nuspell 2024-08-12 12:01:51 -04:00
Michael Davis
c27a8249a2 Use const generics for affixing mode
This mirrors Nuspell's use of C++ template parameters.
2024-08-12 12:01:51 -04:00
Michael Davis
1376b8f913 Use str::char_indices instead of counting char bytes 2024-03-28 16:43:36 -04:00
Michael Davis
b5201915f8 Add basic shells for intro compounding functions 2024-03-28 16:43:28 -04:00
Michael Davis
bf0abd026a checker: Check COMPLEXPREFIXES prefixing rules 2024-03-25 18:35:32 -04:00
Michael Davis
5d3e8c8a3f checker: Strip a suffix, then prefix, then suffix 2024-03-24 15:44:48 -04:00