(search) Reintroduce query rewriting for recipes, add rules for wikis and forums

(deploy) Add capability of adding tags to deploy script
Merge pull request #138 from MarginaliaSearch/vlofgren-patch-1
2025-10-06 07:32:38 +02:00 · 2024-12-31 16:05:00 +01:00 · 2024-12-31 16:04:13 +01:00 · 2024-12-31 14:41:02 +01:00 · 2024-12-31 14:40:34 +01:00
5 changed files with 89 additions and 191 deletions
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -8,20 +8,10 @@ be implemented as well.
 Major goals:

 * Reach 1 billion pages indexed
-* Improve technical ability of indexing and search.  Although this area has improved a bit, the
-  search engine is still not very good at dealing with longer queries.

-## Proper Position Index (COMPLETED 2024-09)

-The search engine uses a fixed width bit mask to indicate word positions.  It has the benefit
-of being very fast to evaluate and works well for what it is, but is inaccurate and has the 
-drawback of making support for quoted search terms inaccurate and largely reliant on indexing 
-word n-grams known beforehand.  This limits the ability to interpret longer queries.
-
-The positions mask should be supplemented or replaced with a more accurate (e.g.) gamma coded positions
-list, as is the civilized way of doing this.
-
-Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)
+* Improve technical ability of indexing and search.  ~~Although this area has improved a bit, the
+  search engine is still not very good at dealing with longer queries.~~  (As of PR [#129](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/129), this has improved significantly.  There is still more work to be done )

 ## Hybridize crawler w/ Common Crawl data

@@ -37,8 +27,7 @@ Retaining the ability to independently crawl the web is still strongly desirable

 ## Safe Search

-The search engine has a bit of a problem showing spicy content mixed in with the results.  It would be desirable
-to have a way to filter this out.  It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
+The search engine has a bit of a problem showing spicy content mixed in with the results.  It would be desirable to have a way to filter this out.  It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
 combined with naive bayesian filter would go a long way, or something more sophisticated...?

 ## Web Design Overhaul
@@ -55,15 +44,6 @@ associated with each language added, at least a models file or two, as well as s

 It would be very helpful to find a speaker of a large language other than English to help in the fine tuning.

-## Finalize RSS support (COMPLETED 2024-11)
-
-Marginalia has experimental RSS preview support for a few domains.  This works well and
-it should be extended to all domains.  It would also be interesting to offer search of the
-RSS data itself, or use the RSS set to feed a special live index that updates faster than the
-main dataset. 
-
-Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
-
 ## Support for binary formats like PDF

 The crawler needs to be modified to retain them, and the conversion logic needs to parse them.  
@@ -80,5 +60,27 @@ This looks like a good idea that wouldn't just help clean up the search filters
 website, but might be cheap enough we might go as far as to offer a number of ad-hoc custom search
 filter for any API consumer.

-I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, 
-which is quite ad-hoc, but instead to work together to find some new common description language for this. 
+I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, which is quite ad-hoc, but instead to work together to find some new common description language for this. 
+
+# Completed
+
+## Proper Position Index (COMPLETED 2024-09)
+
+The search engine uses a fixed width bit mask to indicate word positions.  It has the benefit
+of being very fast to evaluate and works well for what it is, but is inaccurate and has the 
+drawback of making support for quoted search terms inaccurate and largely reliant on indexing 
+word n-grams known beforehand.  This limits the ability to interpret longer queries.
+
+The positions mask should be supplemented or replaced with a more accurate (e.g.) gamma coded positions
+list, as is the civilized way of doing this.
+
+Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)
+
+## Finalize RSS support (COMPLETED 2024-11)
+
+Marginalia has experimental RSS preview support for a few domains.  This works well and
+it should be extended to all domains.  It would also be interesting to offer search of the
+RSS data itself, or use the RSS set to feed a special live index that updates faster than the
+main dataset. 
+
+Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
--- a/code/functions/search-query/java/nu/marginalia/functions/searchquery/query_parser/QueryExpansion.java
+++ b/code/functions/search-query/java/nu/marginalia/functions/searchquery/query_parser/QueryExpansion.java
@@ -25,6 +25,7 @@ public class QueryExpansion {
            this::joinDashes,
            this::splitWordNum,
            this::joinTerms,
+            this::categoryKeywords,
            this::ngramAll
    );

@@ -98,6 +99,24 @@ public class QueryExpansion {
        }
    }

+    // Category keyword substitution, e.g. guitar wiki -> guitar generator:wiki
+    public void categoryKeywords(QWordGraph graph) {
+
+        for (var qw : graph) {
+
+            // Ensure we only perform the substitution on the last word in the query
+            if (!graph.getNextOriginal(qw).getFirst().isEnd()) {
+                continue;
+            }
+
+            switch (qw.word()) {
+                case "recipe", "recipes" -> graph.addVariant(qw, "category:food");
+                case "forum" -> graph.addVariant(qw, "generator:forum");
+                case "wiki" -> graph.addVariant(qw, "generator:wiki");
+            }
+        }
+    }
+
    // Turn 'lawn chair' into 'lawnchair'
    public void joinTerms(QWordGraph graph) {
        QWord prev = null;
--- a/code/functions/search-query/java/nu/marginalia/util/language/EnglishDictionary.java
+++ b/code/functions/search-query/java/nu/marginalia/util/language/EnglishDictionary.java
@@ -1,165 +0,0 @@
-package nu.marginalia.util.language;
-
-import com.google.inject.Inject;
-import nu.marginalia.term_frequency_dict.TermFrequencyDict;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
-
-import java.io.BufferedReader;
-import java.io.InputStreamReader;
-import java.util.*;
-import java.util.regex.Pattern;
-import java.util.stream.Collectors;
-
-public class EnglishDictionary {
-    private final Set<String> englishWords = new HashSet<>();
-    private final TermFrequencyDict tfDict;
-    private final Logger logger = LoggerFactory.getLogger(getClass());
-
-    @Inject
-    public EnglishDictionary(TermFrequencyDict tfDict) {
-        this.tfDict = tfDict;
-        try (var resource = Objects.requireNonNull(ClassLoader.getSystemResourceAsStream("dictionary/en-words"),
-                "Could not load word frequency table");
-             var br = new BufferedReader(new InputStreamReader(resource))
-        ) {
-            for (;;) {
-                String s = br.readLine();
-                if (s == null) {
-                    break;
-                }
-                englishWords.add(s.toLowerCase());
-            }
-        }
-        catch (Exception ex) {
-            throw new RuntimeException(ex);
-        }
-    }
-
-    public boolean isWord(String word) {
-        return englishWords.contains(word);
-    }
-
-    private static final Pattern ingPattern = Pattern.compile(".*(\\w)\\1ing$");
-
-    public Collection<String> getWordVariants(String s) {
-        var variants = findWordVariants(s);
-
-        var ret = variants.stream()
-                .filter(var -> tfDict.getTermFreq(var) > 100)
-                .collect(Collectors.toList());
-
-        if (s.equals("recipe") || s.equals("recipes")) {
-            ret.add("category:food");
-        }
-
-        return ret;
-    }
-
-
-    public Collection<String> findWordVariants(String s) {
-        int sl = s.length();
-
-        if (sl < 2) {
-            return Collections.emptyList();
-        }
-        if (s.endsWith("s")) {
-            String a = s.substring(0, sl-1);
-            String b = s + "es";
-            if (isWord(a) && isWord(b)) {
-                return List.of(a, b);
-            }
-            else if (isWord(a)) {
-                return List.of(a);
-            }
-            else if (isWord(b)) {
-                return List.of(b);
-            }
-        }
-        if (s.endsWith("sm")) {
-            String a = s.substring(0, sl-1)+"t";
-            String b = s.substring(0, sl-1)+"ts";
-            if (isWord(a) && isWord(b)) {
-                return List.of(a, b);
-            }
-            else if (isWord(a)) {
-                return List.of(a);
-            }
-            else if (isWord(b)) {
-                return List.of(b);
-            }
-        }
-        if (s.endsWith("st")) {
-            String a = s.substring(0, sl-1)+"m";
-            String b = s + "s";
-            if (isWord(a) && isWord(b)) {
-                return List.of(a, b);
-            }
-            else if (isWord(a)) {
-                return List.of(a);
-            }
-            else if (isWord(b)) {
-                return List.of(b);
-            }
-        }
-        else if (ingPattern.matcher(s).matches() && sl > 4) { // humming, clapping
-            var a = s.substring(0, sl-4);
-            var b = s.substring(0, sl-3) + "ed";
-
-            if (isWord(a) && isWord(b)) {
-                return List.of(a, b);
-            }
-            else if (isWord(a)) {
-                return List.of(a);
-            }
-            else if (isWord(b)) {
-                return List.of(b);
-            }
-        }
-        else {
-            String a = s + "s";
-            String b = ingForm(s);
-            String c = s + "ed";
-
-            if (isWord(a) && isWord(b) && isWord(c)) {
-                return List.of(a, b, c);
-            }
-            else if (isWord(a) && isWord(b)) {
-                return List.of(a, b);
-            }
-            else if (isWord(b) && isWord(c)) {
-                return List.of(b, c);
-            }
-            else if (isWord(a) && isWord(c)) {
-                return List.of(a, c);
-            }
-            else if (isWord(a)) {
-                return List.of(a);
-            }
-            else if (isWord(b)) {
-                return List.of(b);
-            }
-            else if (isWord(c)) {
-                return List.of(c);
-            }
-        }
-
-        return Collections.emptyList();
-    }
-
-    public String ingForm(String s) {
-        if (s.endsWith("t") && !s.endsWith("tt")) {
-            return s + "ting";
-        }
-        if (s.endsWith("n") && !s.endsWith("nn")) {
-            return s + "ning";
-        }
-        if (s.endsWith("m") && !s.endsWith("mm")) {
-            return s + "ming";
-        }
-        if (s.endsWith("r") && !s.endsWith("rr")) {
-            return s + "ring";
-        }
-        return s + "ing";
-    }
-}
--- a/code/functions/search-query/test/nu/marginalia/query/svc/QueryFactoryTest.java
+++ b/code/functions/search-query/test/nu/marginalia/query/svc/QueryFactoryTest.java
@@ -12,6 +12,7 @@ import nu.marginalia.index.query.limit.SpecificationLimit;
 import nu.marginalia.index.query.limit.SpecificationLimitType;
 import nu.marginalia.segmentation.NgramLexicon;
 import nu.marginalia.term_frequency_dict.TermFrequencyDict;
+import org.junit.jupiter.api.Assertions;
 import org.junit.jupiter.api.BeforeAll;
 import org.junit.jupiter.api.Test;

@@ -207,6 +208,17 @@ public class QueryFactoryTest {
        System.out.println(subquery);
    }

+    @Test
+    public void testExpansion9() {
+        var subquery = parseAndGetSpecs("pie recipe");
+
+        Assertions.assertTrue(subquery.query.compiledQuery.contains(" category:food "));
+
+        subquery = parseAndGetSpecs("recipe pie");
+
+        Assertions.assertFalse(subquery.query.compiledQuery.contains(" category:food "));
+    }
+
    @Test
    public void testParsing() {
        var subquery = parseAndGetSpecs("strlen()");
--- a/tools/deployment/deployment.py
+++ b/tools/deployment/deployment.py
@@ -222,6 +222,31 @@ def run_gradle_build(targets: str) -> None:
    if return_code != 0:
        raise BuildError(service, return_code)

+
+def find_free_tag() -> str:
+    cmd = ['git', 'tag']
+    result = subprocess.run(cmd, capture_output=True, text=True)
+
+    if result.returncode != 0:
+        raise RuntimeError(f"Git command failed: {result.stderr}")
+
+    existing_tags = set(result.stdout.splitlines())
+
+    for i in range(1, 100000):
+        tag = f'deploy-{i:04d}'
+        if not tag in existing_tags:
+            return tag
+    raise RuntimeError(f"Failed to find a free deployment tag")
+
+def add_tags(tags: str) -> None:
+    new_tag = find_free_tag()
+
+    cmd = ['git', 'tag', new_tag, '-am', tags]
+    result = subprocess.run(cmd)
+
+    if result.returncode != 0:
+        raise RuntimeError(f"Git command failed: {result.stderr}")
+
 # Example usage:
 if __name__ == '__main__':
    # Define service configuration
@@ -295,7 +320,9 @@ if __name__ == '__main__':
        parser = argparse.ArgumentParser(
            prog='deployment.py',
            description='Continuous Deployment helper')
+
        parser.add_argument('-v', '--verify', help='Verify the tags are valid, if present', action='store_true')
+        parser.add_argument('-a', '--add', help='Add the tags provided as a new deployment tag, usually combined with -t', action='store_true')
        parser.add_argument('-t', '--tag', help='Use the specified tag value instead of the head git tag starting with deploy-')

        args = parser.parse_args()
@@ -316,7 +343,10 @@ if __name__ == '__main__':
            print("Services to build:", plan.services_to_build)
            print("Instances to deploy:", [container.name for container in plan.instances_to_deploy])

-            if not args.verify:
+            if args.verify:
+                if args.add:
+                    add_tags(args.tag)
+            else:
                print("\nExecution Plan:")

                build_and_deploy(plan, SERVICE_CONFIG)
Author	SHA1	Message	Date
Viktor Lofgren	baeb4a46cd	(search) Reintroduce query rewriting for recipes, add rules for wikis and forums	2024-12-31 16:05:00 +01:00
Viktor Lofgren	5e2a8e9f27	(deploy) Add capability of adding tags to deploy script	2024-12-31 16:04:13 +01:00
Viktor	cc1a5bdf90	Merge pull request #138 from MarginaliaSearch/vlofgren-patch-1 Update ROADMAP.md	2024-12-31 14:41:02 +01:00
Viktor	7f7b1ffaba	Update ROADMAP.md	2024-12-31 14:40:34 +01:00