(search) Adjust token formation rules to be more lenient to C++ and PHP code.

This addresses Issue #142
Merge pull request #141 from MarginaliaSearch/vlofgren-patch-1
2025-10-06 07:32:38 +02:00 · 2025-01-05 20:50:27 +01:00 · 2025-01-05 18:40:20 +01:00 · 2025-01-05 18:40:06 +01:00 · 2025-01-04 14:45:51 +01:00 · 2025-01-02 20:40:53 +01:00
21 changed files with 265 additions and 221 deletions
--- a/.github/FUNDING.yml
+++ b/.github/FUNDING.yml
@@ -1,5 +1,6 @@
 # These are supported funding model platforms

+polar: marginalia-search
 github: MarginaliaSearch
 patreon: marginalia_nu
 open_collective: # Replace with a single Open Collective username
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -8,20 +8,10 @@ be implemented as well.
 Major goals:

 * Reach 1 billion pages indexed
-* Improve technical ability of indexing and search.  Although this area has improved a bit, the
-  search engine is still not very good at dealing with longer queries.

-## Proper Position Index (COMPLETED 2024-09)

-The search engine uses a fixed width bit mask to indicate word positions.  It has the benefit
-of being very fast to evaluate and works well for what it is, but is inaccurate and has the 
-drawback of making support for quoted search terms inaccurate and largely reliant on indexing 
-word n-grams known beforehand.  This limits the ability to interpret longer queries.
-
-The positions mask should be supplemented or replaced with a more accurate (e.g.) gamma coded positions
-list, as is the civilized way of doing this.
-
-Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)
+* Improve technical ability of indexing and search.  ~~Although this area has improved a bit, the
+  search engine is still not very good at dealing with longer queries.~~  (As of PR [#129](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/129), this has improved significantly.  There is still more work to be done )

 ## Hybridize crawler w/ Common Crawl data

@@ -37,8 +27,7 @@ Retaining the ability to independently crawl the web is still strongly desirable

 ## Safe Search

-The search engine has a bit of a problem showing spicy content mixed in with the results.  It would be desirable
-to have a way to filter this out.  It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
+The search engine has a bit of a problem showing spicy content mixed in with the results.  It would be desirable to have a way to filter this out.  It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
 combined with naive bayesian filter would go a long way, or something more sophisticated...?

 ## Web Design Overhaul
@@ -55,15 +44,6 @@ associated with each language added, at least a models file or two, as well as s

 It would be very helpful to find a speaker of a large language other than English to help in the fine tuning.

-## Finalize RSS support (COMPLETED 2024-11)
-
-Marginalia has experimental RSS preview support for a few domains.  This works well and
-it should be extended to all domains.  It would also be interesting to offer search of the
-RSS data itself, or use the RSS set to feed a special live index that updates faster than the
-main dataset. 
-
-Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
-
 ## Support for binary formats like PDF

 The crawler needs to be modified to retain them, and the conversion logic needs to parse them.  
@@ -80,5 +60,27 @@ This looks like a good idea that wouldn't just help clean up the search filters
 website, but might be cheap enough we might go as far as to offer a number of ad-hoc custom search
 filter for any API consumer.

-I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, 
-which is quite ad-hoc, but instead to work together to find some new common description language for this. 
+I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, which is quite ad-hoc, but instead to work together to find some new common description language for this. 
+
+# Completed
+
+## Proper Position Index (COMPLETED 2024-09)
+
+The search engine uses a fixed width bit mask to indicate word positions.  It has the benefit
+of being very fast to evaluate and works well for what it is, but is inaccurate and has the 
+drawback of making support for quoted search terms inaccurate and largely reliant on indexing 
+word n-grams known beforehand.  This limits the ability to interpret longer queries.
+
+The positions mask should be supplemented or replaced with a more accurate (e.g.) gamma coded positions
+list, as is the civilized way of doing this.
+
+Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)
+
+## Finalize RSS support (COMPLETED 2024-11)
+
+Marginalia has experimental RSS preview support for a few domains.  This works well and
+it should be extended to all domains.  It would also be interesting to offer search of the
+RSS data itself, or use the RSS set to feed a special live index that updates faster than the
+main dataset. 
+
+Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
--- a/code/common/service/java/nu/marginalia/service/client/GrpcMultiNodeChannelPool.java
+++ b/code/common/service/java/nu/marginalia/service/client/GrpcMultiNodeChannelPool.java
@@ -7,8 +7,6 @@ import nu.marginalia.service.discovery.property.PartitionTraits;
 import nu.marginalia.service.discovery.property.ServiceEndpoint;
 import nu.marginalia.service.discovery.property.ServiceKey;
 import nu.marginalia.service.discovery.property.ServicePartition;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;

 import java.util.List;
 import java.util.concurrent.CompletableFuture;
@@ -24,7 +22,7 @@ import java.util.function.Function;
 public class GrpcMultiNodeChannelPool<STUB> {
    private final ConcurrentHashMap<Integer, GrpcSingleNodeChannelPool<STUB>> pools =
            new ConcurrentHashMap<>();
-    private static final Logger logger = LoggerFactory.getLogger(GrpcMultiNodeChannelPool.class);
+
    private final ServiceRegistryIf serviceRegistryIf;
    private final ServiceKey<? extends PartitionTraits.Multicast> serviceKey;
    private final Function<ServiceEndpoint.InstanceAddress, ManagedChannel> channelConstructor;
--- a/code/common/service/java/nu/marginalia/service/client/GrpcSingleNodeChannelPool.java
+++ b/code/common/service/java/nu/marginalia/service/client/GrpcSingleNodeChannelPool.java
@@ -10,6 +10,8 @@ import nu.marginalia.service.discovery.property.ServiceKey;
 import org.jetbrains.annotations.NotNull;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
+import org.slf4j.Marker;
+import org.slf4j.MarkerFactory;

 import java.time.Duration;
 import java.util.*;
@@ -26,13 +28,13 @@ import java.util.function.Function;
 public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
    private final Map<InstanceAddress, ConnectionHolder> channels = new ConcurrentHashMap<>();

+    private final Marker grpcMarker = MarkerFactory.getMarker("GRPC");
    private static final Logger logger = LoggerFactory.getLogger(GrpcSingleNodeChannelPool.class);

    private final ServiceRegistryIf serviceRegistryIf;
    private final Function<InstanceAddress, ManagedChannel> channelConstructor;
    private final Function<ManagedChannel, STUB> stubConstructor;

-
    public GrpcSingleNodeChannelPool(ServiceRegistryIf serviceRegistryIf,
                                     ServiceKey<? extends PartitionTraits.Unicast> serviceKey,
                                     Function<InstanceAddress, ManagedChannel> channelConstructor,
@@ -48,8 +50,6 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
        serviceRegistryIf.registerMonitor(this);

        onChange();
-
-        awaitChannel(Duration.ofSeconds(5));
    }


@@ -62,10 +62,10 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
        for (var route : Sets.symmetricDifference(oldRoutes, newRoutes)) {
            ConnectionHolder oldChannel;
            if (newRoutes.contains(route)) {
-                logger.info("Adding route {}", route);
+                logger.info(grpcMarker, "Adding route {} => {}", serviceKey, route);
                oldChannel = channels.put(route, new ConnectionHolder(route));
            } else {
-                logger.info("Expelling route {}", route);
+                logger.info(grpcMarker, "Expelling route {} => {}", serviceKey, route);
                oldChannel = channels.remove(route);
            }
            if (oldChannel != null) {
@@ -103,7 +103,7 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
            }

            try {
-                logger.info("Creating channel for {}:{}", serviceKey, address);
+                logger.info(grpcMarker, "Creating channel for {} => {}", serviceKey, address);
                value = channelConstructor.apply(address);
                if (channel.compareAndSet(null, value)) {
                    return value;
@@ -114,7 +114,7 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
                }
            }
            catch (Exception e) {
-                logger.error("Failed to get channel for " + address, e);
+                logger.error(grpcMarker, "Failed to get channel for " + address, e);
                return null;
            }
        }
@@ -206,7 +206,7 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
        }

        for (var e : exceptions) {
-            logger.error("Failed to call service {}", serviceKey, e);
+            logger.error(grpcMarker, "Failed to call service {}", serviceKey, e);
        }

        throw new ServiceNotAvailableException(serviceKey);
--- a/code/common/service/java/nu/marginalia/service/client/ServiceNotAvailableException.java
+++ b/code/common/service/java/nu/marginalia/service/client/ServiceNotAvailableException.java
@@ -4,6 +4,11 @@ import nu.marginalia.service.discovery.property.ServiceKey;

 public class ServiceNotAvailableException extends RuntimeException {
    public ServiceNotAvailableException(ServiceKey<?> key) {
-        super("Service " + key + " not available");
+        super(key.toString());
+    }
+
+    @Override
+    public StackTraceElement[] getStackTrace() { // Suppress stack trace
+        return new StackTraceElement[0];
    }
 }
--- a/code/common/service/java/nu/marginalia/service/discovery/property/ServiceEndpoint.java
+++ b/code/common/service/java/nu/marginalia/service/discovery/property/ServiceEndpoint.java
@@ -48,5 +48,10 @@ public record ServiceEndpoint(String host, int port) {
        public int port() {
            return endpoint.port();
        }
+
+        @Override
+        public String toString() {
+            return endpoint().host() + ":" + endpoint.port() + " [" + instance + "]";
+        }
    }
 }
--- a/code/common/service/java/nu/marginalia/service/discovery/property/ServiceKey.java
+++ b/code/common/service/java/nu/marginalia/service/discovery/property/ServiceKey.java
@@ -48,6 +48,19 @@ public sealed interface ServiceKey<P extends ServicePartition> {
        {
            throw new UnsupportedOperationException();
        }
+
+        @Override
+        public String toString() {
+            final String shortName;
+
+            int periodIndex = name.lastIndexOf('.');
+
+            if (periodIndex >= 0) shortName = name.substring(periodIndex+1);
+            else shortName = name;
+
+            return "rest:" + shortName;
+        }
+
    }
    record Grpc<P extends ServicePartition>(String name, P partition) implements ServiceKey<P> {
        public String baseName() {
@@ -64,6 +77,18 @@ public sealed interface ServiceKey<P extends ServicePartition> {
        {
            return new Grpc<>(name, partition);
        }
+
+        @Override
+        public String toString() {
+            final String shortName;
+
+            int periodIndex = name.lastIndexOf('.');
+
+            if (periodIndex >= 0) shortName = name.substring(periodIndex+1);
+            else shortName = name;
+
+            return "grpc:" + shortName + "[" + partition.identifier() + "]";
+        }
    }

 }
--- a/code/functions/domain-info/api/src/main/protobuf/domain-info.proto
+++ b/code/functions/domain-info/api/src/main/protobuf/domain-info.proto
@@ -101,6 +101,7 @@ message RpcSimilarDomain {
  bool active = 6;
  bool screenshot = 7;
  LINK_TYPE linkType = 8;
+  bool feed = 9;

  enum LINK_TYPE {
      BACKWARD = 0;
--- a/code/functions/domain-info/java/nu/marginalia/functions/domains/SimilarDomainsService.java
+++ b/code/functions/domain-info/java/nu/marginalia/functions/domains/SimilarDomainsService.java
@@ -9,6 +9,7 @@ import gnu.trove.map.hash.TIntIntHashMap;
 import gnu.trove.set.TIntSet;
 import gnu.trove.set.hash.TIntHashSet;
 import it.unimi.dsi.fastutil.ints.Int2DoubleArrayMap;
+import nu.marginalia.WmsaHome;
 import nu.marginalia.api.domains.RpcSimilarDomain;
 import nu.marginalia.api.domains.model.SimilarDomain;
 import nu.marginalia.api.linkgraph.AggregateLinkGraphClient;
@@ -17,10 +18,14 @@ import org.roaringbitmap.RoaringBitmap;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;

+import java.nio.file.Path;
+import java.sql.DriverManager;
 import java.sql.ResultSet;
 import java.sql.SQLException;
 import java.util.ArrayList;
+import java.util.HashSet;
 import java.util.List;
+import java.util.Set;
 import java.util.concurrent.Executors;
 import java.util.concurrent.ScheduledExecutorService;
 import java.util.concurrent.TimeUnit;
@@ -32,12 +37,13 @@ public class SimilarDomainsService {
    private final HikariDataSource dataSource;
    private final AggregateLinkGraphClient linkGraphClient;

-    private volatile TIntIntHashMap domainIdToIdx = new TIntIntHashMap(100_000);
+    private final TIntIntHashMap domainIdToIdx = new TIntIntHashMap(100_000);
    private volatile int[] domainIdxToId;

    public volatile Int2DoubleArrayMap[] relatedDomains;
    public volatile TIntList[] domainNeighbors = null;
    public volatile RoaringBitmap screenshotDomains = null;
+    public volatile RoaringBitmap feedDomains = null;
    public volatile RoaringBitmap activeDomains = null;
    public volatile RoaringBitmap indexedDomains = null;
    public volatile TIntDoubleHashMap domainRanks = null;
@@ -82,6 +88,7 @@ public class SimilarDomainsService {
                domainNames = new String[domainIdToIdx.size()];
                domainNeighbors = new TIntList[domainIdToIdx.size()];
                screenshotDomains = new RoaringBitmap();
+                feedDomains = new RoaringBitmap();
                activeDomains = new RoaringBitmap();
                indexedDomains = new RoaringBitmap();
                relatedDomains = new Int2DoubleArrayMap[domainIdToIdx.size()];
@@ -145,10 +152,12 @@ public class SimilarDomainsService {
                        activeDomains.add(idx);
                }

-                updateScreenshotInfo();
-
                logger.info("Loaded {} domains", domainRanks.size());
                isReady = true;
+
+                // We can defer these as they only populate a roaringbitmap, and will degrade gracefully when not complete
+                updateScreenshotInfo();
+                updateFeedInfo();
            }
        }
        catch (SQLException throwables) {
@@ -156,6 +165,42 @@ public class SimilarDomainsService {
        }
    }

+    private void updateFeedInfo() {
+        Set<String> feedsDomainNames = new HashSet<>(500_000);
+        Path readerDbPath = WmsaHome.getDataPath().resolve("rss-feeds.db").toAbsolutePath();
+        String dbUrl = "jdbc:sqlite:" + readerDbPath;
+
+        logger.info("Opening feed db at " + dbUrl);
+
+        try (var conn = DriverManager.getConnection(dbUrl);
+             var stmt = conn.createStatement()) {
+            var rs = stmt.executeQuery("""
+                select
+                    json_extract(feed, '$.domain') as domain
+                from feed
+                where json_array_length(feed, '$.items') > 0
+                """);
+            while (rs.next()) {
+                feedsDomainNames.add(rs.getString(1));
+            }
+        }
+        catch (SQLException ex) {
+            logger.error("Failed to read RSS feed items", ex);
+        }
+
+        for (int idx = 0; idx < domainNames.length; idx++) {
+            String name = domainNames[idx];
+            if (name == null) {
+                continue;
+            }
+
+            if (feedsDomainNames.contains(name)) {
+                feedDomains.add(idx);
+            }
+        }
+
+    }
+
    private void updateScreenshotInfo() {
        try (var connection = dataSource.getConnection()) {
            try (var stmt = connection.createStatement()) {
@@ -254,6 +299,7 @@ public class SimilarDomainsService {
                    .setIndexed(indexedDomains.contains(idx))
                    .setActive(activeDomains.contains(idx))
                    .setScreenshot(screenshotDomains.contains(idx))
+                    .setFeed(feedDomains.contains(idx))
                    .setLinkType(RpcSimilarDomain.LINK_TYPE.valueOf(linkType.name()))
                    .build());

@@ -369,6 +415,7 @@ public class SimilarDomainsService {
                            .setIndexed(indexedDomains.contains(idx))
                            .setActive(activeDomains.contains(idx))
                            .setScreenshot(screenshotDomains.contains(idx))
+                            .setFeed(feedDomains.contains(idx))
                            .setLinkType(RpcSimilarDomain.LINK_TYPE.valueOf(linkType.name()))
                    .build());

--- a/code/functions/live-capture/api/java/nu/marginalia/api/livecapture/LiveCaptureClient.java
+++ b/code/functions/live-capture/api/java/nu/marginalia/api/livecapture/LiveCaptureClient.java
@@ -5,6 +5,7 @@ import com.google.inject.Singleton;
 import nu.marginalia.api.livecapture.LiveCaptureApiGrpc.LiveCaptureApiBlockingStub;
 import nu.marginalia.service.client.GrpcChannelPoolFactory;
 import nu.marginalia.service.client.GrpcSingleNodeChannelPool;
+import nu.marginalia.service.client.ServiceNotAvailableException;
 import nu.marginalia.service.discovery.property.ServiceKey;
 import nu.marginalia.service.discovery.property.ServicePartition;
 import org.slf4j.Logger;
@@ -29,6 +30,9 @@ public class LiveCaptureClient {
            channelPool.call(LiveCaptureApiBlockingStub::requestScreengrab)
                    .run(RpcDomainId.newBuilder().setDomainId(domainId).build());
        }
+        catch (ServiceNotAvailableException e) {
+            logger.info("requestScreengrab() failed since the service is not available");
+        }
        catch (Exception e) {
            logger.error("API Exception", e);
        }
--- a/code/functions/live-capture/java/nu/marginalia/rss/svc/FeedFetcherService.java
+++ b/code/functions/live-capture/java/nu/marginalia/rss/svc/FeedFetcherService.java
@@ -402,6 +402,7 @@ public class FeedFetcherService {
            "&ndash;", "-",
            "&rsquo;", "'",
            "&lsquo;", "'",
+            "&quot;", "\"",
            "&nbsp;", ""
    );

--- a/code/functions/live-capture/test/nu/marginalia/rss/svc/TestXmlSanitization.java
+++ b/code/functions/live-capture/test/nu/marginalia/rss/svc/TestXmlSanitization.java
@@ -10,7 +10,6 @@ public class TestXmlSanitization {
        Assertions.assertEquals("&amp;", FeedFetcherService.sanitizeEntities("&amp;"));
        Assertions.assertEquals("&lt;", FeedFetcherService.sanitizeEntities("&lt;"));
        Assertions.assertEquals("&gt;", FeedFetcherService.sanitizeEntities("&gt;"));
-        Assertions.assertEquals("&quot;", FeedFetcherService.sanitizeEntities("&quot;"));
        Assertions.assertEquals("&apos;", FeedFetcherService.sanitizeEntities("&apos;"));
    }

@@ -23,4 +22,9 @@ public class TestXmlSanitization {
    public void testTranslatedHtmlEntity() {
        Assertions.assertEquals("Foo -- Bar", FeedFetcherService.sanitizeEntities("Foo &mdash; Bar"));
    }
+
+    @Test
+    public void testTranslatedHtmlEntityQuot() {
+        Assertions.assertEquals("\"Bob\"", FeedFetcherService.sanitizeEntities("&quot;Bob&quot;"));
+    }
 }
--- a/code/functions/search-query/api/java/nu/marginalia/api/searchquery/QueryClient.java
+++ b/code/functions/search-query/api/java/nu/marginalia/api/searchquery/QueryClient.java
@@ -9,10 +9,9 @@ import nu.marginalia.service.client.GrpcChannelPoolFactory;
 import nu.marginalia.service.client.GrpcSingleNodeChannelPool;
 import nu.marginalia.service.discovery.property.ServiceKey;
 import nu.marginalia.service.discovery.property.ServicePartition;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;

 import javax.annotation.CheckReturnValue;
+import java.time.Duration;

@Singleton
 public class QueryClient  {
@@ -24,13 +23,14 @@ public class QueryClient  {

    private final GrpcSingleNodeChannelPool<QueryApiGrpc.QueryApiBlockingStub> queryApiPool;

-    private final Logger logger = LoggerFactory.getLogger(getClass());
-
    @Inject
-    public QueryClient(GrpcChannelPoolFactory channelPoolFactory) {
+    public QueryClient(GrpcChannelPoolFactory channelPoolFactory) throws InterruptedException {
        this.queryApiPool = channelPoolFactory.createSingle(
                ServiceKey.forGrpcApi(QueryApiGrpc.class, ServicePartition.any()),
                QueryApiGrpc::newBlockingStub);
+
+        // Hold up initialization until we have a downstream connection
+        this.queryApiPool.awaitChannel(Duration.ofSeconds(5));
    }

    @CheckReturnValue
--- a/code/functions/search-query/java/nu/marginalia/functions/searchquery/query_parser/QueryExpansion.java
+++ b/code/functions/search-query/java/nu/marginalia/functions/searchquery/query_parser/QueryExpansion.java
@@ -25,6 +25,7 @@ public class QueryExpansion {
            this::joinDashes,
            this::splitWordNum,
            this::joinTerms,
+            this::categoryKeywords,
            this::ngramAll
    );

@@ -98,6 +99,24 @@ public class QueryExpansion {
        }
    }

+    // Category keyword substitution, e.g. guitar wiki -> guitar generator:wiki
+    public void categoryKeywords(QWordGraph graph) {
+
+        for (var qw : graph) {
+
+            // Ensure we only perform the substitution on the last word in the query
+            if (!graph.getNextOriginal(qw).getFirst().isEnd()) {
+                continue;
+            }
+
+            switch (qw.word()) {
+                case "recipe", "recipes" -> graph.addVariant(qw, "category:food");
+                case "forum" -> graph.addVariant(qw, "generator:forum");
+                case "wiki" -> graph.addVariant(qw, "generator:wiki");
+            }
+        }
+    }
+
    // Turn 'lawn chair' into 'lawnchair'
    public void joinTerms(QWordGraph graph) {
        QWord prev = null;
--- a/code/functions/search-query/java/nu/marginalia/functions/searchquery/query_parser/QueryParser.java
+++ b/code/functions/search-query/java/nu/marginalia/functions/searchquery/query_parser/QueryParser.java
@@ -155,16 +155,25 @@ public class QueryParser {

        // Remove trailing punctuation
        int lastChar = str.charAt(str.length() - 1);
-        if (":.,!?$'".indexOf(lastChar) >= 0)
-            entity.replace(new QueryToken.LiteralTerm(str.substring(0, str.length() - 1), lt.displayStr()));
+        if (":.,!?$'".indexOf(lastChar) >= 0) {
+            str = str.substring(0, str.length() - 1);
+            entity.replace(new QueryToken.LiteralTerm(str, lt.displayStr()));
+        }

        // Remove term elements that aren't indexed by the search engine
-        if (str.endsWith("'s"))
-            entity.replace(new QueryToken.LiteralTerm(str.substring(0, str.length() - 2), lt.displayStr()));
-        if (str.endsWith("()"))
-            entity.replace(new QueryToken.LiteralTerm(str.substring(0, str.length() - 2), lt.displayStr()));
-        if (str.startsWith("$"))
-            entity.replace(new QueryToken.LiteralTerm(str.substring(1), lt.displayStr()));
+        if (str.endsWith("'s")) {
+            str = str.substring(0, str.length() - 2);
+            entity.replace(new QueryToken.LiteralTerm(str, lt.displayStr()));
+        }
+        if (str.endsWith("()")) {
+            str = str.substring(0, str.length() - 2);
+            entity.replace(new QueryToken.LiteralTerm(str, lt.displayStr()));
+        }
+
+        while (str.startsWith("$") || str.startsWith("_")) {
+            str = str.substring(1);
+            entity.replace(new QueryToken.LiteralTerm(str, lt.displayStr()));
+        }

        if (entity.isBlank()) {
            entity.remove();
--- a/code/functions/search-query/java/nu/marginalia/util/language/EnglishDictionary.java
+++ b/code/functions/search-query/java/nu/marginalia/util/language/EnglishDictionary.java
@@ -1,165 +0,0 @@
-package nu.marginalia.util.language;
-
-import com.google.inject.Inject;
-import nu.marginalia.term_frequency_dict.TermFrequencyDict;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
-
-import java.io.BufferedReader;
-import java.io.InputStreamReader;
-import java.util.*;
-import java.util.regex.Pattern;
-import java.util.stream.Collectors;
-
-public class EnglishDictionary {
-    private final Set<String> englishWords = new HashSet<>();
-    private final TermFrequencyDict tfDict;
-    private final Logger logger = LoggerFactory.getLogger(getClass());
-
-    @Inject
-    public EnglishDictionary(TermFrequencyDict tfDict) {
-        this.tfDict = tfDict;
-        try (var resource = Objects.requireNonNull(ClassLoader.getSystemResourceAsStream("dictionary/en-words"),
-                "Could not load word frequency table");
-             var br = new BufferedReader(new InputStreamReader(resource))
-        ) {
-            for (;;) {
-                String s = br.readLine();
-                if (s == null) {
-                    break;
-                }
-                englishWords.add(s.toLowerCase());
-            }
-        }
-        catch (Exception ex) {
-            throw new RuntimeException(ex);
-        }
-    }
-
-    public boolean isWord(String word) {
-        return englishWords.contains(word);
-    }
-
-    private static final Pattern ingPattern = Pattern.compile(".*(\\w)\\1ing$");
-
-    public Collection<String> getWordVariants(String s) {
-        var variants = findWordVariants(s);
-
-        var ret = variants.stream()
-                .filter(var -> tfDict.getTermFreq(var) > 100)
-                .collect(Collectors.toList());
-
-        if (s.equals("recipe") || s.equals("recipes")) {
-            ret.add("category:food");
-        }
-
-        return ret;
-    }
-
-
-    public Collection<String> findWordVariants(String s) {
-        int sl = s.length();
-
-        if (sl < 2) {
-            return Collections.emptyList();
-        }
-        if (s.endsWith("s")) {
-            String a = s.substring(0, sl-1);
-            String b = s + "es";
-            if (isWord(a) && isWord(b)) {
-                return List.of(a, b);
-            }
-            else if (isWord(a)) {
-                return List.of(a);
-            }
-            else if (isWord(b)) {
-                return List.of(b);
-            }
-        }
-        if (s.endsWith("sm")) {
-            String a = s.substring(0, sl-1)+"t";
-            String b = s.substring(0, sl-1)+"ts";
-            if (isWord(a) && isWord(b)) {
-                return List.of(a, b);
-            }
-            else if (isWord(a)) {
-                return List.of(a);
-            }
-            else if (isWord(b)) {
-                return List.of(b);
-            }
-        }
-        if (s.endsWith("st")) {
-            String a = s.substring(0, sl-1)+"m";
-            String b = s + "s";
-            if (isWord(a) && isWord(b)) {
-                return List.of(a, b);
-            }
-            else if (isWord(a)) {
-                return List.of(a);
-            }
-            else if (isWord(b)) {
-                return List.of(b);
-            }
-        }
-        else if (ingPattern.matcher(s).matches() && sl > 4) { // humming, clapping
-            var a = s.substring(0, sl-4);
-            var b = s.substring(0, sl-3) + "ed";
-
-            if (isWord(a) && isWord(b)) {
-                return List.of(a, b);
-            }
-            else if (isWord(a)) {
-                return List.of(a);
-            }
-            else if (isWord(b)) {
-                return List.of(b);
-            }
-        }
-        else {
-            String a = s + "s";
-            String b = ingForm(s);
-            String c = s + "ed";
-
-            if (isWord(a) && isWord(b) && isWord(c)) {
-                return List.of(a, b, c);
-            }
-            else if (isWord(a) && isWord(b)) {
-                return List.of(a, b);
-            }
-            else if (isWord(b) && isWord(c)) {
-                return List.of(b, c);
-            }
-            else if (isWord(a) && isWord(c)) {
-                return List.of(a, c);
-            }
-            else if (isWord(a)) {
-                return List.of(a);
-            }
-            else if (isWord(b)) {
-                return List.of(b);
-            }
-            else if (isWord(c)) {
-                return List.of(c);
-            }
-        }
-
-        return Collections.emptyList();
-    }
-
-    public String ingForm(String s) {
-        if (s.endsWith("t") && !s.endsWith("tt")) {
-            return s + "ting";
-        }
-        if (s.endsWith("n") && !s.endsWith("nn")) {
-            return s + "ning";
-        }
-        if (s.endsWith("m") && !s.endsWith("mm")) {
-            return s + "ming";
-        }
-        if (s.endsWith("r") && !s.endsWith("rr")) {
-            return s + "ring";
-        }
-        return s + "ing";
-    }
-}
--- a/code/functions/search-query/test/nu/marginalia/functions/searchquery/query_parser/QueryParserTest.java
+++ b/code/functions/search-query/test/nu/marginalia/functions/searchquery/query_parser/QueryParserTest.java
@@ -0,0 +1,32 @@
+package nu.marginalia.functions.searchquery.query_parser;
+
+import nu.marginalia.functions.searchquery.query_parser.token.QueryToken;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.Test;
+
+import java.util.List;
+
+class QueryParserTest {
+
+    @Test
+    // https://github.com/MarginaliaSearch/MarginaliaSearch/issues/140
+    void parse__builtin_ffs() {
+        QueryParser parser = new QueryParser();
+        var tokens = parser.parse("__builtin_ffs");
+        Assertions.assertEquals(List.of(new QueryToken.LiteralTerm("builtin_ffs", "__builtin_ffs")), tokens);
+    }
+
+    @Test
+    void trailingParens() {
+        QueryParser parser = new QueryParser();
+        var tokens = parser.parse("strcpy()");
+        Assertions.assertEquals(List.of(new QueryToken.LiteralTerm("strcpy", "strcpy()")), tokens);
+    }
+
+    @Test
+    void trailingQuote() {
+        QueryParser parser = new QueryParser();
+        var tokens = parser.parse("bob's");
+        Assertions.assertEquals(List.of(new QueryToken.LiteralTerm("bob", "bob's")), tokens);
+    }
+}
--- a/code/functions/search-query/test/nu/marginalia/query/svc/QueryFactoryTest.java
+++ b/code/functions/search-query/test/nu/marginalia/query/svc/QueryFactoryTest.java
@@ -12,6 +12,7 @@ import nu.marginalia.index.query.limit.SpecificationLimit;
 import nu.marginalia.index.query.limit.SpecificationLimitType;
 import nu.marginalia.segmentation.NgramLexicon;
 import nu.marginalia.term_frequency_dict.TermFrequencyDict;
+import org.junit.jupiter.api.Assertions;
 import org.junit.jupiter.api.BeforeAll;
 import org.junit.jupiter.api.Test;

@@ -207,6 +208,17 @@ public class QueryFactoryTest {
        System.out.println(subquery);
    }

+    @Test
+    public void testExpansion9() {
+        var subquery = parseAndGetSpecs("pie recipe");
+
+        Assertions.assertTrue(subquery.query.compiledQuery.contains(" category:food "));
+
+        subquery = parseAndGetSpecs("recipe pie");
+
+        Assertions.assertFalse(subquery.query.compiledQuery.contains(" category:food "));
+    }
+
    @Test
    public void testParsing() {
        var subquery = parseAndGetSpecs("strlen()");
--- a/code/libraries/language-processing/java/nu/marginalia/language/sentence/SentenceSegmentSplitter.java
+++ b/code/libraries/language-processing/java/nu/marginalia/language/sentence/SentenceSegmentSplitter.java
@@ -27,7 +27,7 @@ public class SentenceSegmentSplitter {
        else {
            // If we flatten unicode, we do this...
            // FIXME: This can almost definitely be cleaned up and simplified.
-            wordBreakPattern = Pattern.compile("([^/_#@.a-zA-Z'+\\-0-9\\u00C0-\\u00D6\\u00D8-\\u00f6\\u00f8-\\u00ff]+)|[|]|(\\.(\\s+|$))");
+            wordBreakPattern = Pattern.compile("([^/<>$:_#@.a-zA-Z'+\\-0-9\\u00C0-\\u00D6\\u00D8-\\u00f6\\u00f8-\\u00ff]+)|[|]|(\\.(\\s+|$))");
        }
    }

--- a/code/libraries/language-processing/test/nu/marginalia/language/sentence/SentenceExtractorTest.java
+++ b/code/libraries/language-processing/test/nu/marginalia/language/sentence/SentenceExtractorTest.java
@@ -28,6 +28,20 @@ class SentenceExtractorTest {
        System.out.println(dld);
    }

+    @Test
+    void testCplusplus() {
+        var dld = sentenceExtractor.extractSentence("std::vector", EnumSet.noneOf(HtmlTag.class));
+        assertEquals(1, dld.length());
+        assertEquals("std::vector", dld.wordsLowerCase[0]);
+    }
+
+    @Test
+    void testPHP() {
+        var dld = sentenceExtractor.extractSentence("$_GET", EnumSet.noneOf(HtmlTag.class));
+        assertEquals(1, dld.length());
+        assertEquals("$_get", dld.wordsLowerCase[0]);
+    }
+
    @Test
    void testPolishArtist() {
        var dld = sentenceExtractor.extractSentence("Uklański", EnumSet.noneOf(HtmlTag.class));
--- a/tools/deployment/deployment.py
+++ b/tools/deployment/deployment.py
@@ -222,6 +222,31 @@ def run_gradle_build(targets: str) -> None:
    if return_code != 0:
        raise BuildError(service, return_code)

+
+def find_free_tag() -> str:
+    cmd = ['git', 'tag']
+    result = subprocess.run(cmd, capture_output=True, text=True)
+
+    if result.returncode != 0:
+        raise RuntimeError(f"Git command failed: {result.stderr}")
+
+    existing_tags = set(result.stdout.splitlines())
+
+    for i in range(1, 100000):
+        tag = f'deploy-{i:04d}'
+        if not tag in existing_tags:
+            return tag
+    raise RuntimeError(f"Failed to find a free deployment tag")
+
+def add_tags(tags: str) -> None:
+    new_tag = find_free_tag()
+
+    cmd = ['git', 'tag', new_tag, '-am', tags]
+    result = subprocess.run(cmd)
+
+    if result.returncode != 0:
+        raise RuntimeError(f"Git command failed: {result.stderr}")
+
 # Example usage:
 if __name__ == '__main__':
    # Define service configuration
@@ -295,7 +320,9 @@ if __name__ == '__main__':
        parser = argparse.ArgumentParser(
            prog='deployment.py',
            description='Continuous Deployment helper')
+
        parser.add_argument('-v', '--verify', help='Verify the tags are valid, if present', action='store_true')
+        parser.add_argument('-a', '--add', help='Add the tags provided as a new deployment tag, usually combined with -t', action='store_true')
        parser.add_argument('-t', '--tag', help='Use the specified tag value instead of the head git tag starting with deploy-')

        args = parser.parse_args()
@@ -316,7 +343,10 @@ if __name__ == '__main__':
            print("Services to build:", plan.services_to_build)
            print("Instances to deploy:", [container.name for container in plan.instances_to_deploy])

-            if not args.verify:
+            if args.verify:
+                if args.add:
+                    add_tags(args.tag)
+            else:
                print("\nExecution Plan:")

                build_and_deploy(plan, SERVICE_CONFIG)
Author	SHA1	Message	Date
Viktor Lofgren	b62f043910	(search) Adjust token formation rules to be more lenient to C++ and PHP code. This addresses Issue #142	2025-01-05 20:50:27 +01:00
Viktor	9b2ceaf37c	Merge pull request #141 from MarginaliaSearch/vlofgren-patch-1 Update FUNDING.yml	2025-01-05 18:40:20 +01:00
Viktor	8019c2ce18	Update FUNDING.yml	2025-01-05 18:40:06 +01:00
Viktor Lofgren	4da3563d8a	(service) Clean up exceptions when requestScreengrab is not available	2025-01-04 14:45:51 +01:00
Viktor Lofgren	48d0a3089a	(service) Improve logging around grpc This change adds a marker for the gRPC-specific logging, as well as improves the clarity and meaningfulness of the log messages.	2025-01-02 20:40:53 +01:00
Viktor Lofgren	594df64b20	(domain-info) Use appropriate sqlite database when fetching feed status	2025-01-02 20:20:36 +01:00
Viktor Lofgren	78eb1417a7	(service) Only block on SingleNodeChannelPool creation in QueryClient The code was always blocking for up to 5s while waiting for the remote end to become available, meaning some services would stall for several seconds on start-up for no sensible reason. This should make most services start faster as a result.	2025-01-02 18:42:01 +01:00
Viktor Lofgren	67edc8f90d	(domain-info) Only flag domains with rss feed items as having a feed	2025-01-02 17:41:52 +01:00
Viktor Lofgren	5f576b7d0c	(query-parser) Strip leading underlines This addresses issue #140, where __builtin_ffs gives no results.	2025-01-02 14:39:03 +01:00
Viktor Lofgren	0b65164f60	(chore) Fix broken test	2025-01-01 18:06:29 +01:00
Viktor Lofgren	9be477de33	(domain-info) Add a feed flag to domain info This is a bit of a sketchy solution that requires both assistant services to run on the same physical machine.	2025-01-01 18:02:33 +01:00
Viktor Lofgren	710af4999a	(feed-fetcher) Add " entity mapping in feed fetcher	2025-01-01 15:45:17 +01:00
Viktor Lofgren	baeb4a46cd	(search) Reintroduce query rewriting for recipes, add rules for wikis and forums	2024-12-31 16:05:00 +01:00
Viktor Lofgren	5e2a8e9f27	(deploy) Add capability of adding tags to deploy script	2024-12-31 16:04:13 +01:00
Viktor	cc1a5bdf90	Merge pull request #138 from MarginaliaSearch/vlofgren-patch-1 Update ROADMAP.md	2024-12-31 14:41:02 +01:00
Viktor	7f7b1ffaba	Update ROADMAP.md	2024-12-31 14:40:34 +01:00