1
1
mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-10-05 21:22:39 +02:00

3505 Commits

Author SHA1 Message Date
Viktor Lofgren
6f7530e807 (refac) Clean up index code 2025-09-02 18:53:58 +02:00
Viktor Lofgren
87ce4a1b52 (refac) Clean up index code 2025-09-02 17:52:38 +02:00
Viktor Lofgren
52194cbe7a (refac) Clean up index code 2025-09-02 17:44:42 +02:00
Viktor Lofgren
fd1ac03c78 (refac) Clean up index code 2025-09-02 17:30:19 +02:00
Viktor Lofgren
5e5b86efb4 (refac) Clean up index code 2025-09-02 17:24:30 +02:00
Viktor Lofgren
f332ec6191 (refac) Clean up index code 2025-09-02 13:13:10 +02:00
Viktor Lofgren
c25c1af437 (refac) Clean up index code 2025-09-02 13:04:05 +02:00
Viktor Lofgren
eb0c911b45 (refac) Clean up index code 2025-09-02 12:50:07 +02:00
Viktor Lofgren
1979870ce4 (refac) Merge index-forward, index-reverse, index/query into index
The project has too many submodules, and it's a bit of a headache to navigate.
2025-09-02 12:30:42 +02:00
Viktor Lofgren
0ba2ea38e1 (index) Move reverse index into a distinct package 2025-09-02 11:59:56 +02:00
Viktor Lofgren
d6cfbceeea (index) Use a configurable hasher in the index 2025-09-01 13:44:28 +02:00
Viktor Lofgren
e369d200cc (refac) Simplify index data model by merging SearchParameters, SearchTerms and ResultRankingContext into a new object called SearchContext
The previous design was difficult to reason about as similar data was stored in several places, and different functions wanted different nearly identical (but not fully identical) context objects.

This is in preparation for making the keyword hash function configurable, as we want focus all the code that hashes keywords into one place.
2025-09-01 13:17:11 +02:00
Viktor Lofgren
946d64c8da (index) Make hash algorithm selection configurable, writer-side 2025-09-01 12:03:01 +02:00
Viktor Lofgren
42f043a60f (API) Add language parameter to the APIs 2025-09-01 09:33:39 +02:00
Viktor Lofgren
b46f2e1407 (sideload) Remove upper limit on XML entities
This unfucks the sideloading of stackexchange definitions.

This broke some time when we merged the executor service into the index service.
2025-08-31 14:14:09 +02:00
Viktor Lofgren
18aa1b9764 (zim) Fix parsing of modern wikipedia zim files
The parser was relying on a deprecated format and
wikipedia has stopped generating zim files that
work with the old assumptions.  The new approach should
hopefully work better.
2025-08-31 12:52:44 +02:00
Viktor Lofgren
2f3950e0d5 (language) Roll KeywordExtractor into LanguageDefinition 2025-08-29 10:55:48 +02:00
Viktor Lofgren
61d803869e (language) Add support for languages with no POS-tagging
Clean up previous commit a bit.
2025-08-29 10:55:48 +02:00
Viktor Lofgren
df6434d177 (language) Add support for languages with no POS-tagging
This disables a lot of the smart keyword extraction,
which is mostly a crutch for helping English and similar
large languages to find relevant search results.

Smaller languages where a POS-tag model may not be available,
are probably fine with this disabled, as the search engine can
likely just rawdog the entire results list.
2025-08-29 10:55:48 +02:00
Viktor Lofgren
59519ed7c4 (language) Adjust languages.xml 2025-08-29 10:55:47 +02:00
Viktor Lofgren
874fc2d250 (language) Remove debug logging junk 2025-08-29 10:55:47 +02:00
Viktor Lofgren
69e8ec0eef (language) Fix subject keywords matcher with better rules and correct logic 2025-08-29 10:55:47 +02:00
Viktor Lofgren
a7eb5f54e6 (language) Clean up PosPattern, add tests 2025-08-29 10:55:47 +02:00
Viktor Lofgren
b29ba3e228 (language) Integrate new configurable POS patterns with keyword matchers 2025-08-29 10:55:47 +02:00
Viktor Lofgren
5fa5029c60 (language) Clean up UI 2025-08-29 10:55:47 +02:00
Viktor Lofgren
4257f60f00 (keywords) Fix logic error causing misidentification of some keywords 2025-08-29 10:55:47 +02:00
Viktor Lofgren
ce221d3a0e (language) Integrate old keyword extraction logic with new test tool 2025-08-29 10:55:47 +02:00
Viktor Lofgren
f0741142a3 (refac) Move keyword extraction into language processing 2025-08-29 10:55:47 +02:00
Viktor Lofgren
0899e4d895 (language) First version of the language processing debug tool 2025-08-29 10:55:47 +02:00
Viktor Lofgren
bbf7c5a1cb (language) Fix RDRPosTagger back to working order and integrate with SentenceExtractor 2025-08-29 10:55:47 +02:00
Viktor Lofgren
686a40e69b (language) Update modelling 2025-08-29 10:55:47 +02:00
Viktor Lofgren
8af254f44f (language) Parse PosPattern tags 2025-08-29 10:55:47 +02:00
Viktor Lofgren
2c21bd9287 (language) Add logging for unknown POS tags in PosPattern 2025-08-29 10:55:47 +02:00
Viktor Lofgren
f9645e2f00 (language) Enhance PosPattern to support wildcard variants in pattern matching 2025-08-29 10:55:47 +02:00
Viktor Lofgren
81e311b558 (language) POS-patterns WIP 2025-08-29 10:55:47 +02:00
Viktor Lofgren
507c09146a (language) Add support for downloadable resources, parsing POS tag configuration tags 2025-08-29 10:55:47 +02:00
Viktor Lofgren
f682425594 (language) Basic test for LanguageConfiguration 2025-08-29 10:55:47 +02:00
Viktor Lofgren
de67006c4f (language) Initial integration of new language configuration utility 2025-08-29 10:55:47 +02:00
Viktor Lofgren
eea32bb7b4 (language) Very basic language.xml loading off classpath 2025-08-29 10:55:47 +02:00
Viktor Lofgren
e976940a4e (config) Move slf4j config files to common:config 2025-08-29 10:55:47 +02:00
Viktor Lofgren
b564b33028 (language) Initial embryo for language configuration 2025-08-29 10:55:47 +02:00
Viktor Lofgren
1cca16a58e (language) Simplify language filters 2025-08-29 10:55:47 +02:00
Viktor Lofgren
70b4ed6d81 (ldb) Pipe language information into LDB database 2025-08-29 10:55:47 +02:00
Viktor Lofgren
45dc6412c1 (converter) Add language column to slop tables 2025-08-29 10:55:47 +02:00
Viktor Lofgren
b3b95edcb5 (converter) Bypass some of the grammar processing in the keyword extraction depending on language selection 2025-08-29 10:55:47 +02:00
Viktor Lofgren
338d300e1a (converter) Clean up spans-handling
This code was unnecessarily difficult to follow with repeated packing and re-packing of the same data.
2025-08-29 10:55:47 +02:00
Viktor Lofgren
fa685bf1f4 (converter) Add Language field to ProcessedDocumentDetails 2025-08-29 10:55:47 +02:00
Viktor Lofgren
d79a3e2b2a (converter) Tag documents by language in the index as a keyword 2025-08-29 10:55:47 +02:00
Viktor Lofgren
854382b2be (language-filter) Experimentally permit Swedish results to pass through the language filter 2025-08-29 10:55:47 +02:00
Viktor Lofgren
8710adbc2a (build) Reduce log noise during tests 2025-08-29 10:55:32 +02:00