Viktor Lofgren
6f7530e807
(refac) Clean up index code
2025-09-02 18:53:58 +02:00
Viktor Lofgren
87ce4a1b52
(refac) Clean up index code
2025-09-02 17:52:38 +02:00
Viktor Lofgren
52194cbe7a
(refac) Clean up index code
2025-09-02 17:44:42 +02:00
Viktor Lofgren
fd1ac03c78
(refac) Clean up index code
2025-09-02 17:30:19 +02:00
Viktor Lofgren
5e5b86efb4
(refac) Clean up index code
2025-09-02 17:24:30 +02:00
Viktor Lofgren
f332ec6191
(refac) Clean up index code
2025-09-02 13:13:10 +02:00
Viktor Lofgren
c25c1af437
(refac) Clean up index code
2025-09-02 13:04:05 +02:00
Viktor Lofgren
eb0c911b45
(refac) Clean up index code
2025-09-02 12:50:07 +02:00
Viktor Lofgren
1979870ce4
(refac) Merge index-forward, index-reverse, index/query into index
...
The project has too many submodules, and it's a bit of a headache to navigate.
2025-09-02 12:30:42 +02:00
Viktor Lofgren
0ba2ea38e1
(index) Move reverse index into a distinct package
2025-09-02 11:59:56 +02:00
Viktor Lofgren
d6cfbceeea
(index) Use a configurable hasher in the index
2025-09-01 13:44:28 +02:00
Viktor Lofgren
e369d200cc
(refac) Simplify index data model by merging SearchParameters, SearchTerms and ResultRankingContext into a new object called SearchContext
...
The previous design was difficult to reason about as similar data was stored in several places, and different functions wanted different nearly identical (but not fully identical) context objects.
This is in preparation for making the keyword hash function configurable, as we want focus all the code that hashes keywords into one place.
2025-09-01 13:17:11 +02:00
Viktor Lofgren
946d64c8da
(index) Make hash algorithm selection configurable, writer-side
2025-09-01 12:03:01 +02:00
Viktor Lofgren
42f043a60f
(API) Add language parameter to the APIs
2025-09-01 09:33:39 +02:00
Viktor Lofgren
b46f2e1407
(sideload) Remove upper limit on XML entities
...
This unfucks the sideloading of stackexchange definitions.
This broke some time when we merged the executor service into the index service.
2025-08-31 14:14:09 +02:00
Viktor Lofgren
18aa1b9764
(zim) Fix parsing of modern wikipedia zim files
...
The parser was relying on a deprecated format and
wikipedia has stopped generating zim files that
work with the old assumptions. The new approach should
hopefully work better.
2025-08-31 12:52:44 +02:00
Viktor Lofgren
2f3950e0d5
(language) Roll KeywordExtractor into LanguageDefinition
2025-08-29 10:55:48 +02:00
Viktor Lofgren
61d803869e
(language) Add support for languages with no POS-tagging
...
Clean up previous commit a bit.
2025-08-29 10:55:48 +02:00
Viktor Lofgren
df6434d177
(language) Add support for languages with no POS-tagging
...
This disables a lot of the smart keyword extraction,
which is mostly a crutch for helping English and similar
large languages to find relevant search results.
Smaller languages where a POS-tag model may not be available,
are probably fine with this disabled, as the search engine can
likely just rawdog the entire results list.
2025-08-29 10:55:48 +02:00
Viktor Lofgren
59519ed7c4
(language) Adjust languages.xml
2025-08-29 10:55:47 +02:00
Viktor Lofgren
874fc2d250
(language) Remove debug logging junk
2025-08-29 10:55:47 +02:00
Viktor Lofgren
69e8ec0eef
(language) Fix subject keywords matcher with better rules and correct logic
2025-08-29 10:55:47 +02:00
Viktor Lofgren
a7eb5f54e6
(language) Clean up PosPattern, add tests
2025-08-29 10:55:47 +02:00
Viktor Lofgren
b29ba3e228
(language) Integrate new configurable POS patterns with keyword matchers
2025-08-29 10:55:47 +02:00
Viktor Lofgren
5fa5029c60
(language) Clean up UI
2025-08-29 10:55:47 +02:00
Viktor Lofgren
4257f60f00
(keywords) Fix logic error causing misidentification of some keywords
2025-08-29 10:55:47 +02:00
Viktor Lofgren
ce221d3a0e
(language) Integrate old keyword extraction logic with new test tool
2025-08-29 10:55:47 +02:00
Viktor Lofgren
f0741142a3
(refac) Move keyword extraction into language processing
2025-08-29 10:55:47 +02:00
Viktor Lofgren
0899e4d895
(language) First version of the language processing debug tool
2025-08-29 10:55:47 +02:00
Viktor Lofgren
bbf7c5a1cb
(language) Fix RDRPosTagger back to working order and integrate with SentenceExtractor
2025-08-29 10:55:47 +02:00
Viktor Lofgren
686a40e69b
(language) Update modelling
2025-08-29 10:55:47 +02:00
Viktor Lofgren
8af254f44f
(language) Parse PosPattern tags
2025-08-29 10:55:47 +02:00
Viktor Lofgren
2c21bd9287
(language) Add logging for unknown POS tags in PosPattern
2025-08-29 10:55:47 +02:00
Viktor Lofgren
f9645e2f00
(language) Enhance PosPattern to support wildcard variants in pattern matching
2025-08-29 10:55:47 +02:00
Viktor Lofgren
81e311b558
(language) POS-patterns WIP
2025-08-29 10:55:47 +02:00
Viktor Lofgren
507c09146a
(language) Add support for downloadable resources, parsing POS tag configuration tags
2025-08-29 10:55:47 +02:00
Viktor Lofgren
f682425594
(language) Basic test for LanguageConfiguration
2025-08-29 10:55:47 +02:00
Viktor Lofgren
de67006c4f
(language) Initial integration of new language configuration utility
2025-08-29 10:55:47 +02:00
Viktor Lofgren
eea32bb7b4
(language) Very basic language.xml loading off classpath
2025-08-29 10:55:47 +02:00
Viktor Lofgren
e976940a4e
(config) Move slf4j config files to common:config
2025-08-29 10:55:47 +02:00
Viktor Lofgren
b564b33028
(language) Initial embryo for language configuration
2025-08-29 10:55:47 +02:00
Viktor Lofgren
1cca16a58e
(language) Simplify language filters
2025-08-29 10:55:47 +02:00
Viktor Lofgren
70b4ed6d81
(ldb) Pipe language information into LDB database
2025-08-29 10:55:47 +02:00
Viktor Lofgren
45dc6412c1
(converter) Add language column to slop tables
2025-08-29 10:55:47 +02:00
Viktor Lofgren
b3b95edcb5
(converter) Bypass some of the grammar processing in the keyword extraction depending on language selection
2025-08-29 10:55:47 +02:00
Viktor Lofgren
338d300e1a
(converter) Clean up spans-handling
...
This code was unnecessarily difficult to follow with repeated packing and re-packing of the same data.
2025-08-29 10:55:47 +02:00
Viktor Lofgren
fa685bf1f4
(converter) Add Language field to ProcessedDocumentDetails
2025-08-29 10:55:47 +02:00
Viktor Lofgren
d79a3e2b2a
(converter) Tag documents by language in the index as a keyword
2025-08-29 10:55:47 +02:00
Viktor Lofgren
854382b2be
(language-filter) Experimentally permit Swedish results to pass through the language filter
2025-08-29 10:55:47 +02:00
Viktor Lofgren
8710adbc2a
(build) Reduce log noise during tests
2025-08-29 10:55:32 +02:00