(feed-fetcher) Add " entity mapping in feed fetcher

(search) Reintroduce query rewriting for recipes, add rules for wikis and forums
(deploy) Add capability of adding tags to deploy script
2025-10-06 07:32:38 +02:00 · 2025-01-01 15:45:17 +01:00 · 2024-12-31 16:05:00 +01:00 · 2024-12-31 16:04:13 +01:00 · 2024-12-31 14:41:02 +01:00 · 2024-12-31 14:40:34 +01:00
11 changed files with 115 additions and 220 deletions
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -8,20 +8,10 @@ be implemented as well.
 Major goals:
 * Reach 1 billion pages indexed
 * Improve technical ability of indexing and search.  Although this area has improved a bit, the
  search engine is still not very good at dealing with longer queries.
 ## Proper Position Index (COMPLETED 2024-09)
-The search engine uses a fixed width bit mask to indicate word positions.  It has the benefit
+* Improve technical ability of indexing and search.  ~~Although this area has improved a bit, the
-of being very fast to evaluate and works well for what it is, but is inaccurate and has the 
+  search engine is still not very good at dealing with longer queries.~~  (As of PR [#129](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/129), this has improved significantly.  There is still more work to be done )
 drawback of making support for quoted search terms inaccurate and largely reliant on indexing 
 word n-grams known beforehand.  This limits the ability to interpret longer queries.
 The positions mask should be supplemented or replaced with a more accurate (e.g.) gamma coded positions
 list, as is the civilized way of doing this.
 Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)
 ## Hybridize crawler w/ Common Crawl data
@@ -37,8 +27,7 @@ Retaining the ability to independently crawl the web is still strongly desirable
 ## Safe Search
-The search engine has a bit of a problem showing spicy content mixed in with the results.  It would be desirable
+The search engine has a bit of a problem showing spicy content mixed in with the results.  It would be desirable to have a way to filter this out.  It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
 to have a way to filter this out.  It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
 combined with naive bayesian filter would go a long way, or something more sophisticated...?
 ## Web Design Overhaul
@@ -55,15 +44,6 @@ associated with each language added, at least a models file or two, as well as s
 It would be very helpful to find a speaker of a large language other than English to help in the fine tuning.
 ## Finalize RSS support (COMPLETED 2024-11)
 Marginalia has experimental RSS preview support for a few domains.  This works well and
 it should be extended to all domains.  It would also be interesting to offer search of the
 RSS data itself, or use the RSS set to feed a special live index that updates faster than the
 main dataset. 
 Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
 ## Support for binary formats like PDF
 The crawler needs to be modified to retain them, and the conversion logic needs to parse them.  
@@ -80,5 +60,27 @@ This looks like a good idea that wouldn't just help clean up the search filters
 website, but might be cheap enough we might go as far as to offer a number of ad-hoc custom search
 filter for any API consumer.
-I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, 
+I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, which is quite ad-hoc, but instead to work together to find some new common description language for this. 
-which is quite ad-hoc, but instead to work together to find some new common description language for this. 
+
 # Completed
 ## Proper Position Index (COMPLETED 2024-09)
 The search engine uses a fixed width bit mask to indicate word positions.  It has the benefit
 of being very fast to evaluate and works well for what it is, but is inaccurate and has the 
 drawback of making support for quoted search terms inaccurate and largely reliant on indexing 
 word n-grams known beforehand.  This limits the ability to interpret longer queries.
 The positions mask should be supplemented or replaced with a more accurate (e.g.) gamma coded positions
 list, as is the civilized way of doing this.
 Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)
 ## Finalize RSS support (COMPLETED 2024-11)
 Marginalia has experimental RSS preview support for a few domains.  This works well and
 it should be extended to all domains.  It would also be interesting to offer search of the
 RSS data itself, or use the RSS set to feed a special live index that updates faster than the
 main dataset. 
 Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
--- a/code/functions/live-capture/java/nu/marginalia/rss/svc/FeedFetcherService.java
+++ b/code/functions/live-capture/java/nu/marginalia/rss/svc/FeedFetcherService.java
@@ -402,6 +402,7 @@ public class FeedFetcherService {
            "&ndash;", "-",
            "&rsquo;", "'",
            "&lsquo;", "'",
            "&quot;", "\"",
            "&nbsp;", ""
    );
--- a/code/functions/search-query/java/nu/marginalia/functions/searchquery/query_parser/QueryExpansion.java
+++ b/code/functions/search-query/java/nu/marginalia/functions/searchquery/query_parser/QueryExpansion.java
@@ -25,6 +25,7 @@ public class QueryExpansion {
            this::joinDashes,
            this::splitWordNum,
            this::joinTerms,
            this::categoryKeywords,
            this::ngramAll
    );
@@ -98,6 +99,24 @@ public class QueryExpansion {
        }
    }
    // Category keyword substitution, e.g. guitar wiki -> guitar generator:wiki
    public void categoryKeywords(QWordGraph graph) {
        for (var qw : graph) {
            // Ensure we only perform the substitution on the last word in the query
            if (!graph.getNextOriginal(qw).getFirst().isEnd()) {
                continue;
            }
            switch (qw.word()) {
                case "recipe", "recipes" -> graph.addVariant(qw, "category:food");
                case "forum" -> graph.addVariant(qw, "generator:forum");
                case "wiki" -> graph.addVariant(qw, "generator:wiki");
            }
        }
    }
    // Turn 'lawn chair' into 'lawnchair'
    public void joinTerms(QWordGraph graph) {
        QWord prev = null;
--- a/code/functions/search-query/java/nu/marginalia/util/language/EnglishDictionary.java
+++ b/code/functions/search-query/java/nu/marginalia/util/language/EnglishDictionary.java
@@ -1,165 +0,0 @@
 package nu.marginalia.util.language;
 import com.google.inject.Inject;
 import nu.marginalia.term_frequency_dict.TermFrequencyDict;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 import java.io.BufferedReader;
 import java.io.InputStreamReader;
 import java.util.*;
 import java.util.regex.Pattern;
 import java.util.stream.Collectors;
 public class EnglishDictionary {
    private final Set<String> englishWords = new HashSet<>();
    private final TermFrequencyDict tfDict;
    private final Logger logger = LoggerFactory.getLogger(getClass());
    @Inject
    public EnglishDictionary(TermFrequencyDict tfDict) {
        this.tfDict = tfDict;
        try (var resource = Objects.requireNonNull(ClassLoader.getSystemResourceAsStream("dictionary/en-words"),
                "Could not load word frequency table");
             var br = new BufferedReader(new InputStreamReader(resource))
        ) {
            for (;;) {
                String s = br.readLine();
                if (s == null) {
                    break;
                }
                englishWords.add(s.toLowerCase());
            }
        }
        catch (Exception ex) {
            throw new RuntimeException(ex);
        }
    }
    public boolean isWord(String word) {
        return englishWords.contains(word);
    }
    private static final Pattern ingPattern = Pattern.compile(".*(\\w)\\1ing$");
    public Collection<String> getWordVariants(String s) {
        var variants = findWordVariants(s);
        var ret = variants.stream()
                .filter(var -> tfDict.getTermFreq(var) > 100)
                .collect(Collectors.toList());
        if (s.equals("recipe") || s.equals("recipes")) {
            ret.add("category:food");
        }
        return ret;
    }
    public Collection<String> findWordVariants(String s) {
        int sl = s.length();
        if (sl < 2) {
            return Collections.emptyList();
        }
        if (s.endsWith("s")) {
            String a = s.substring(0, sl-1);
            String b = s + "es";
            if (isWord(a) && isWord(b)) {
                return List.of(a, b);
            }
            else if (isWord(a)) {
                return List.of(a);
            }
            else if (isWord(b)) {
                return List.of(b);
            }
        }
        if (s.endsWith("sm")) {
            String a = s.substring(0, sl-1)+"t";
            String b = s.substring(0, sl-1)+"ts";
            if (isWord(a) && isWord(b)) {
                return List.of(a, b);
            }
            else if (isWord(a)) {
                return List.of(a);
            }
            else if (isWord(b)) {
                return List.of(b);
            }
        }
        if (s.endsWith("st")) {
            String a = s.substring(0, sl-1)+"m";
            String b = s + "s";
            if (isWord(a) && isWord(b)) {
                return List.of(a, b);
            }
            else if (isWord(a)) {
                return List.of(a);
            }
            else if (isWord(b)) {
                return List.of(b);
            }
        }
        else if (ingPattern.matcher(s).matches() && sl > 4) { // humming, clapping
            var a = s.substring(0, sl-4);
            var b = s.substring(0, sl-3) + "ed";
            if (isWord(a) && isWord(b)) {
                return List.of(a, b);
            }
            else if (isWord(a)) {
                return List.of(a);
            }
            else if (isWord(b)) {
                return List.of(b);
            }
        }
        else {
            String a = s + "s";
            String b = ingForm(s);
            String c = s + "ed";
            if (isWord(a) && isWord(b) && isWord(c)) {
                return List.of(a, b, c);
            }
            else if (isWord(a) && isWord(b)) {
                return List.of(a, b);
            }
            else if (isWord(b) && isWord(c)) {
                return List.of(b, c);
            }
            else if (isWord(a) && isWord(c)) {
                return List.of(a, c);
            }
            else if (isWord(a)) {
                return List.of(a);
            }
            else if (isWord(b)) {
                return List.of(b);
            }
            else if (isWord(c)) {
                return List.of(c);
            }
        }
        return Collections.emptyList();
    }
    public String ingForm(String s) {
        if (s.endsWith("t") && !s.endsWith("tt")) {
            return s + "ting";
        }
        if (s.endsWith("n") && !s.endsWith("nn")) {
            return s + "ning";
        }
        if (s.endsWith("m") && !s.endsWith("mm")) {
            return s + "ming";
        }
        if (s.endsWith("r") && !s.endsWith("rr")) {
            return s + "ring";
        }
        return s + "ing";
    }
 }
--- a/code/functions/search-query/test/nu/marginalia/query/svc/QueryFactoryTest.java
+++ b/code/functions/search-query/test/nu/marginalia/query/svc/QueryFactoryTest.java
@@ -12,6 +12,7 @@ import nu.marginalia.index.query.limit.SpecificationLimit;
 import nu.marginalia.index.query.limit.SpecificationLimitType;
 import nu.marginalia.segmentation.NgramLexicon;
 import nu.marginalia.term_frequency_dict.TermFrequencyDict;
 import org.junit.jupiter.api.Assertions;
 import org.junit.jupiter.api.BeforeAll;
 import org.junit.jupiter.api.Test;
@@ -207,6 +208,17 @@ public class QueryFactoryTest {
        System.out.println(subquery);
    }
    @Test
    public void testExpansion9() {
        var subquery = parseAndGetSpecs("pie recipe");
        Assertions.assertTrue(subquery.query.compiledQuery.contains(" category:food "));
        subquery = parseAndGetSpecs("recipe pie");
        Assertions.assertFalse(subquery.query.compiledQuery.contains(" category:food "));
    }
    @Test
    public void testParsing() {
        var subquery = parseAndGetSpecs("strlen()");
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/ContentTags.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/ContentTags.java
@@ -20,34 +20,11 @@ public record ContentTags(String etag, String lastMod) {
    public void paint(Request.Builder getBuilder) {
        if (etag != null) {
-            getBuilder.addHeader("If-None-Match", ifNoneMatch());
+            getBuilder.addHeader("If-None-Match", etag);
        }
        if (lastMod != null) {
-            getBuilder.addHeader("If-Modified-Since", ifModifiedSince());
+            getBuilder.addHeader("If-Modified-Since", lastMod);
        }
    }
    private String ifNoneMatch() {
        // Remove the W/ prefix if it exists
        //'W/' (case-sensitive) indicates that a weak validator is used. Weak etags are
        // easy to generate, but are far less useful for comparisons. Strong validators
        // are ideal for comparisons but can be very difficult to generate efficiently.
        // Weak ETag values of two representations of the same resources might be semantically
        // equivalent, but not byte-for-byte identical. This means weak etags prevent caching
        // when byte range requests are used, but strong etags mean range requests can
        // still be cached.
        // - https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag
        if (null != etag && etag.startsWith("W/")) {
            return etag.substring(2);
        } else {
            return etag;
        }
    }
    private String ifModifiedSince() {
        return lastMod;
    }
 }
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/warc/WarcRecorder.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/warc/WarcRecorder.java
@@ -34,8 +34,9 @@ import java.util.*;
 public class WarcRecorder implements AutoCloseable {
    /** Maximum time we'll wait on a single request */
    static final int MAX_TIME = 30_000;
-    /** Maximum (decompressed) size we'll fetch */
+
-    static final int MAX_SIZE = 1024 * 1024 * 10;
+    /** Maximum (decompressed) size we'll save */
    static final int MAX_SIZE = Integer.getInteger("crawler.maxFetchSize", 10 * 1024 * 1024);
    private final WarcWriter writer;
    private final Path warcFile;
--- a/code/processes/live-crawling-process/java/nu/marginalia/livecrawler/SimpleLinkScraper.java
+++ b/code/processes/live-crawling-process/java/nu/marginalia/livecrawler/SimpleLinkScraper.java
@@ -48,6 +48,8 @@ public class SimpleLinkScraper implements AutoCloseable {
    private final Duration readTimeout = Duration.ofSeconds(10);
    private final DomainLocks domainLocks = new DomainLocks();
    private final static int MAX_SIZE = Integer.getInteger("crawler.maxFetchSize", 10 * 1024 * 1024);
    public SimpleLinkScraper(LiveCrawlDataSet dataSet,
                             DbDomainQueries domainQueries,
                             DomainBlacklist domainBlacklist) {
@@ -207,7 +209,7 @@ public class SimpleLinkScraper implements AutoCloseable {
                }
                byte[] body = getResponseData(response);
-                if (body.length > 1024 * 1024) {
+                if (body.length > MAX_SIZE) {
                    return new FetchResult.Error(parsedUrl);
                }
--- a/code/services-application/search-service/resources/templates/search/index/index-redesign.hdb
+++ b/code/services-application/search-service/resources/templates/search/index/index-redesign.hdb
@@ -0,0 +1,14 @@
 <section id="frontpage-tips">
    <h2>Public Beta Available</h2>
    <div class="info">
        <p>
            A redesigned version of the search engine UI is available for beta testing.
            Feel free to give it a spin, feedback is welcome!
            The old one will also be keep being available if you hate it,
            or have compatibility issues.
        </p>
        <p>
            <a href="https://test.marginalia.nu/">Try it out!</a>
        </p>
    </div>
 </section>
--- a/code/services-application/search-service/resources/templates/search/index/index.hdb
+++ b/code/services-application/search-service/resources/templates/search/index/index.hdb
@@ -24,7 +24,7 @@
 <section id="frontpage">
 {{>search/index/index-news}}
 {{>search/index/index-about}}
-{{>search/index/index-tips}}
+{{>search/index/index-redesign}}
 </section>
 {{>search/parts/search-footer}}
--- a/tools/deployment/deployment.py
+++ b/tools/deployment/deployment.py
@@ -1,3 +1,5 @@
 #!/usr/bin/env python3
 from dataclasses import dataclass
 import subprocess, os
 from typing import List, Set, Dict, Optional
@@ -220,6 +222,31 @@ def run_gradle_build(targets: str) -> None:
    if return_code != 0:
        raise BuildError(service, return_code)
 def find_free_tag() -> str:
    cmd = ['git', 'tag']
    result = subprocess.run(cmd, capture_output=True, text=True)
    if result.returncode != 0:
        raise RuntimeError(f"Git command failed: {result.stderr}")
    existing_tags = set(result.stdout.splitlines())
    for i in range(1, 100000):
        tag = f'deploy-{i:04d}'
        if not tag in existing_tags:
            return tag
    raise RuntimeError(f"Failed to find a free deployment tag")
 def add_tags(tags: str) -> None:
    new_tag = find_free_tag()
    cmd = ['git', 'tag', new_tag, '-am', tags]
    result = subprocess.run(cmd)
    if result.returncode != 0:
        raise RuntimeError(f"Git command failed: {result.stderr}")
 # Example usage:
 if __name__ == '__main__':
    # Define service configuration
@@ -293,7 +320,9 @@ if __name__ == '__main__':
        parser = argparse.ArgumentParser(
            prog='deployment.py',
            description='Continuous Deployment helper')
        parser.add_argument('-v', '--verify', help='Verify the tags are valid, if present', action='store_true')
        parser.add_argument('-a', '--add', help='Add the tags provided as a new deployment tag, usually combined with -t', action='store_true')
        parser.add_argument('-t', '--tag', help='Use the specified tag value instead of the head git tag starting with deploy-')
        args = parser.parse_args()
@@ -314,7 +343,10 @@ if __name__ == '__main__':
            print("Services to build:", plan.services_to_build)
            print("Instances to deploy:", [container.name for container in plan.instances_to_deploy])
-            if not args.verify:
+            if args.verify:
                if args.add:
                    add_tags(args.tag)
            else:
                print("\nExecution Plan:")
                build_and_deploy(plan, SERVICE_CONFIG)
Author	SHA1	Message	Date
Viktor Lofgren	710af4999a	(feed-fetcher) Add " entity mapping in feed fetcher	2025-01-01 15:45:17 +01:00
Viktor Lofgren	baeb4a46cd	(search) Reintroduce query rewriting for recipes, add rules for wikis and forums	2024-12-31 16:05:00 +01:00
Viktor Lofgren	5e2a8e9f27	(deploy) Add capability of adding tags to deploy script	2024-12-31 16:04:13 +01:00
Viktor	cc1a5bdf90	Merge pull request #138 from MarginaliaSearch/vlofgren-patch-1 Update ROADMAP.md	2024-12-31 14:41:02 +01:00
Viktor	7f7b1ffaba	Update ROADMAP.md	2024-12-31 14:40:34 +01:00
Viktor Lofgren	0ea8092350	(search) Add link promoting the redesign beta	2024-12-30 15:47:13 +01:00
Viktor Lofgren	483d29497e	(deploy) Add hashbang to deploy script	2024-12-30 15:47:13 +01:00
Viktor Lofgren	bae44497fe	(crawler) Add a new system property crawler.maxFetchSize This gives the same upper limit to the live crawler and the big boy crawler, though the live crawler will reject items too large, and the big crawler will truncate at that point.	2024-12-30 15:10:11 +01:00
Viktor Lofgren	0d59202aca	(crawler) Do not remove W/-prefix on weak e-tags The server expects to get them back prefixed, as we received them.	2024-12-27 20:56:42 +01:00