(query-segmentation) Merge pull request #89 from MarginaliaSearch/query-segmentation

The changeset cleans up the query parsing logic in the query service. It gets rid of a lot of old and largely unmaintainable query-rewriting logic that was based on POS-tagging rules, and adds a new cleaner approach. Query parsing is also refactored, and the internal APIs are updated to remove unnecessary duplication of document-level data across each search term. A new query segmentation model is introduced based on a dictionary of known n-grams, with tools for extracting this dictionary from Wikipedia data. The changeset introduces a new segmentation model file, which is downloaded with the usual run/setup.sh, as well as an updated term frequency model. A new intermediate representation of the query is introduced, based on a DAG with predefined vertices initiating and terminating the graph. This is for the benefit of easily writing rules for generating alternative queries, e.g. using the new segmentation data. The graph is converted to a basic LL(1) syntax loosely reminiscent of a regular expression, where e.g. "( wiby | marginalia | kagi ) ( search engine | searchengine )" expands to "wiby search engine", "wiby searchengine", "marginalia search engine", "marginalia searchengine", "kagi search engine" and "kagi searchengine". This compiled query is passed to the index, which parses the expression, where it is used for execution of the search and ranking of the results.
2025-10-05 21:22:39 +02:00 · 2024-04-16 15:31:05 +02:00
parent 448a941de2 2353c73c57
commit cfd9a7187f
143 changed files with 4547 additions and 2229 deletions
--- a/third-party/openzim/src/main/java/org/openzim/ZIMTypes/ZIMReader.java
+++ b/third-party/openzim/src/main/java/org/openzim/ZIMTypes/ZIMReader.java
@@ -275,9 +275,7 @@ public class ZIMReader {
 	}


-
-	// Gives the minimum required information needed for the given articleName
-	public DirectoryEntry forEachTitles(Consumer<ArticleEntry> aeConsumer, Consumer<RedirectEntry> reConsumer)
+	public DirectoryEntry forEachTitles(Consumer<String> titleConsumer)
 			throws IOException {

 		int numberOfArticles = mFile.getArticleCount();
@@ -287,26 +285,9 @@ public class ZIMReader {
 		System.err.println(numberOfArticles);
 		long start = System.currentTimeMillis();

-		Map<Integer, Map<Integer, String>> data = new TreeMap<>();
-
-		System.err.println("Indexing");
-
 		for (long i = beg; i < end; i+=4) {
 			var entry = getDirectoryInfoAtTitlePosition(i);
-
-			if (((i-beg)%100_000) == 0) {
-				System.err.printf("%f%%\n", ((i-beg) * 100.) / (end-beg));
-			}
-
-			if (entry.mimeType == targetMime && entry instanceof ArticleEntry) {
-				aeConsumer.accept((ArticleEntry) entry);
-			}
-			else if (entry.mimeType == 65535 && entry instanceof RedirectEntry) {
-
-				reConsumer.accept((RedirectEntry) entry);
-
-			}
-
+			titleConsumer.accept(entry.title);
 		}

 		return null;