mirror of
https://github.com/MarginaliaSearch/MarginaliaSearch.git
synced 2025-10-05 21:22:39 +02:00
(query-segmentation) Merge pull request #89 from MarginaliaSearch/query-segmentation
The changeset cleans up the query parsing logic in the query service. It gets rid of a lot of old and largely unmaintainable query-rewriting logic that was based on POS-tagging rules, and adds a new cleaner approach. Query parsing is also refactored, and the internal APIs are updated to remove unnecessary duplication of document-level data across each search term. A new query segmentation model is introduced based on a dictionary of known n-grams, with tools for extracting this dictionary from Wikipedia data. The changeset introduces a new segmentation model file, which is downloaded with the usual run/setup.sh, as well as an updated term frequency model. A new intermediate representation of the query is introduced, based on a DAG with predefined vertices initiating and terminating the graph. This is for the benefit of easily writing rules for generating alternative queries, e.g. using the new segmentation data. The graph is converted to a basic LL(1) syntax loosely reminiscent of a regular expression, where e.g. "( wiby | marginalia | kagi ) ( search engine | searchengine )" expands to "wiby search engine", "wiby searchengine", "marginalia search engine", "marginalia searchengine", "kagi search engine" and "kagi searchengine". This compiled query is passed to the index, which parses the expression, where it is used for execution of the search and ranking of the results.
This commit is contained in:
@@ -275,9 +275,7 @@ public class ZIMReader {
|
||||
}
|
||||
|
||||
|
||||
|
||||
// Gives the minimum required information needed for the given articleName
|
||||
public DirectoryEntry forEachTitles(Consumer<ArticleEntry> aeConsumer, Consumer<RedirectEntry> reConsumer)
|
||||
public DirectoryEntry forEachTitles(Consumer<String> titleConsumer)
|
||||
throws IOException {
|
||||
|
||||
int numberOfArticles = mFile.getArticleCount();
|
||||
@@ -287,26 +285,9 @@ public class ZIMReader {
|
||||
System.err.println(numberOfArticles);
|
||||
long start = System.currentTimeMillis();
|
||||
|
||||
Map<Integer, Map<Integer, String>> data = new TreeMap<>();
|
||||
|
||||
System.err.println("Indexing");
|
||||
|
||||
for (long i = beg; i < end; i+=4) {
|
||||
var entry = getDirectoryInfoAtTitlePosition(i);
|
||||
|
||||
if (((i-beg)%100_000) == 0) {
|
||||
System.err.printf("%f%%\n", ((i-beg) * 100.) / (end-beg));
|
||||
}
|
||||
|
||||
if (entry.mimeType == targetMime && entry instanceof ArticleEntry) {
|
||||
aeConsumer.accept((ArticleEntry) entry);
|
||||
}
|
||||
else if (entry.mimeType == 65535 && entry instanceof RedirectEntry) {
|
||||
|
||||
reConsumer.accept((RedirectEntry) entry);
|
||||
|
||||
}
|
||||
|
||||
titleConsumer.accept(entry.title);
|
||||
}
|
||||
|
||||
return null;
|
||||
|
Reference in New Issue
Block a user