This function only does something when CompoundResult has a replacement
field on it. In my searching I didn't see a dictionary that actually
uses that so we might want to hold off on implementing that function
(which is really very complex in the Nuspell codebase, with lots of
`goto`) until we find a language that uses this.
These have to live as long as the `word` and `&self` references. We can
elide the lifetimes because of that: the compiler will infer that the
reference needs to live as long as the intersection of the references
in the domain.
Ideally we would keep this at the MSRV but rust-analyzer now doesn't
work with anything older than this version. We had to jump up to 1.70
to get `block_box` to run the benchmarks anyways.
A few Turkic locales have special conversion rules for 'i' and 'I' which
we need to handle.
Nuspell covers this with ICU but we don't want to pull in ICU4X - it's
a very large dependency and adds weight even if we just pull in the
case mapping parts. Luckily the code to do this ourselves (like
Hunspell does too) is very small.
This fixes check for compounds in en_US like "10th" and "202nd". These
are considered valid because of the COMPOUNDRULE rule. From en_US.aff:
# compound rules:
# 1. [0-9]*1[0-9]th (10th, 11th, 12th, 56714th, etc.)
# 2. [0-9]*[02-9](1st|2nd|3rd|[4-9]th) (21st, 22nd, 123rd, 1234th, etc.)
COMPOUNDRULE 2
COMPOUNDRULE n*1t
COMPOUNDRULE n*mp
For example with dictionary entries "1/n1" and "0th/pt", we split up
the word "10th" into parts "1" and "0th" which we look up in the
dictionary. We then check those words flagsets against the patterns.
"10th" would be represented as:
`&[&flagset!['n', '1'], &flagset!['p', 't']]`
That matches the pattern `n*1t`: zero or more 'n' flags and then a '1'
flag (used for "1" and other digits) and a 't' flag ("0th"). The 'n'
flag matches zero times for "10th" but allows other numbers in front.
For example this pattern would also match "110th".
These kinds of compounds are fairly straightforward to check compared
to the other kind of compounding. This is the only compounding used by
en_US and a few other dictionaries.
This completes support for en_US for the checker - nothing else in the
aff affects the checker.
This approximately matches Nuspell's storage of the wordlist pointer.
To make the lifetimes play nice we return the key from the map as well
as the value. This value is equal to the `stem2`/`stem3`s in the
affixing functions but we need to return the data from the map to use
that liftime rather than the Cows produced by stripping and adding with
affixes.
As another happy consequence of this, we can drop the `'aff` lifetime
on the `word: &str` parameters in the affixing stripping functions. The
input word should have a distinct lifetime and not having this lifetime
will cause problems for the compounding functions introduced in the
child commits.