1
1
mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-10-06 07:32:38 +02:00

Compare commits

...

21 Commits

Author SHA1 Message Date
Viktor Lofgren
b62f043910 (search) Adjust token formation rules to be more lenient to C++ and PHP code.
This addresses Issue #142
2025-01-05 20:50:27 +01:00
Viktor
9b2ceaf37c Merge pull request #141 from MarginaliaSearch/vlofgren-patch-1
Update FUNDING.yml
2025-01-05 18:40:20 +01:00
Viktor
8019c2ce18 Update FUNDING.yml 2025-01-05 18:40:06 +01:00
Viktor Lofgren
4da3563d8a (service) Clean up exceptions when requestScreengrab is not available 2025-01-04 14:45:51 +01:00
Viktor Lofgren
48d0a3089a (service) Improve logging around grpc
This change adds a marker for the gRPC-specific logging, as well as improves the clarity and meaningfulness of the log messages.
2025-01-02 20:40:53 +01:00
Viktor Lofgren
594df64b20 (domain-info) Use appropriate sqlite database when fetching feed status 2025-01-02 20:20:36 +01:00
Viktor Lofgren
78eb1417a7 (service) Only block on SingleNodeChannelPool creation in QueryClient
The code was always blocking for up to 5s while waiting for the remote end to become available, meaning some services would stall for several seconds on start-up for no sensible reason.

This should make most services start faster as a result.
2025-01-02 18:42:01 +01:00
Viktor Lofgren
67edc8f90d (domain-info) Only flag domains with rss feed items as having a feed 2025-01-02 17:41:52 +01:00
Viktor Lofgren
5f576b7d0c (query-parser) Strip leading underlines
This addresses issue #140, where __builtin_ffs gives no results.
2025-01-02 14:39:03 +01:00
Viktor Lofgren
0b65164f60 (chore) Fix broken test 2025-01-01 18:06:29 +01:00
Viktor Lofgren
9be477de33 (domain-info) Add a feed flag to domain info
This is a bit of a sketchy solution that requires both assistant services to run on the same physical machine.
2025-01-01 18:02:33 +01:00
Viktor Lofgren
710af4999a (feed-fetcher) Add " entity mapping in feed fetcher 2025-01-01 15:45:17 +01:00
Viktor Lofgren
baeb4a46cd (search) Reintroduce query rewriting for recipes, add rules for wikis and forums 2024-12-31 16:05:00 +01:00
Viktor Lofgren
5e2a8e9f27 (deploy) Add capability of adding tags to deploy script 2024-12-31 16:04:13 +01:00
Viktor
cc1a5bdf90 Merge pull request #138 from MarginaliaSearch/vlofgren-patch-1
Update ROADMAP.md
2024-12-31 14:41:02 +01:00
Viktor
7f7b1ffaba Update ROADMAP.md 2024-12-31 14:40:34 +01:00
Viktor Lofgren
0ea8092350 (search) Add link promoting the redesign beta 2024-12-30 15:47:13 +01:00
Viktor Lofgren
483d29497e (deploy) Add hashbang to deploy script 2024-12-30 15:47:13 +01:00
Viktor Lofgren
bae44497fe (crawler) Add a new system property crawler.maxFetchSize
This gives the same upper limit to the live crawler and the big boy crawler, though the live crawler will reject items too large, and the big crawler will truncate at that point.
2024-12-30 15:10:11 +01:00
Viktor Lofgren
0d59202aca (crawler) Do not remove W/-prefix on weak e-tags
The server expects to get them back prefixed, as we received them.
2024-12-27 20:56:42 +01:00
Viktor Lofgren
0ca43f0c9c (live-crawler) Improve live crawler short-circuit logic
We should not wait until we've fetched robots.txt to decide whether we have any data to fetch!  This makes the live crawler very slow and leads to unnecessary requests.
2024-12-27 20:54:42 +01:00
28 changed files with 377 additions and 287 deletions

1
.github/FUNDING.yml vendored
View File

@@ -1,5 +1,6 @@
# These are supported funding model platforms # These are supported funding model platforms
polar: marginalia-search
github: MarginaliaSearch github: MarginaliaSearch
patreon: marginalia_nu patreon: marginalia_nu
open_collective: # Replace with a single Open Collective username open_collective: # Replace with a single Open Collective username

View File

@@ -8,20 +8,10 @@ be implemented as well.
Major goals: Major goals:
* Reach 1 billion pages indexed * Reach 1 billion pages indexed
* Improve technical ability of indexing and search. Although this area has improved a bit, the
search engine is still not very good at dealing with longer queries.
## Proper Position Index (COMPLETED 2024-09)
The search engine uses a fixed width bit mask to indicate word positions. It has the benefit * Improve technical ability of indexing and search. ~~Although this area has improved a bit, the
of being very fast to evaluate and works well for what it is, but is inaccurate and has the search engine is still not very good at dealing with longer queries.~~ (As of PR [#129](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/129), this has improved significantly. There is still more work to be done )
drawback of making support for quoted search terms inaccurate and largely reliant on indexing
word n-grams known beforehand. This limits the ability to interpret longer queries.
The positions mask should be supplemented or replaced with a more accurate (e.g.) gamma coded positions
list, as is the civilized way of doing this.
Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)
## Hybridize crawler w/ Common Crawl data ## Hybridize crawler w/ Common Crawl data
@@ -37,8 +27,7 @@ Retaining the ability to independently crawl the web is still strongly desirable
## Safe Search ## Safe Search
The search engine has a bit of a problem showing spicy content mixed in with the results. It would be desirable The search engine has a bit of a problem showing spicy content mixed in with the results. It would be desirable to have a way to filter this out. It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
to have a way to filter this out. It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
combined with naive bayesian filter would go a long way, or something more sophisticated...? combined with naive bayesian filter would go a long way, or something more sophisticated...?
## Web Design Overhaul ## Web Design Overhaul
@@ -55,15 +44,6 @@ associated with each language added, at least a models file or two, as well as s
It would be very helpful to find a speaker of a large language other than English to help in the fine tuning. It would be very helpful to find a speaker of a large language other than English to help in the fine tuning.
## Finalize RSS support (COMPLETED 2024-11)
Marginalia has experimental RSS preview support for a few domains. This works well and
it should be extended to all domains. It would also be interesting to offer search of the
RSS data itself, or use the RSS set to feed a special live index that updates faster than the
main dataset.
Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
## Support for binary formats like PDF ## Support for binary formats like PDF
The crawler needs to be modified to retain them, and the conversion logic needs to parse them. The crawler needs to be modified to retain them, and the conversion logic needs to parse them.
@@ -80,5 +60,27 @@ This looks like a good idea that wouldn't just help clean up the search filters
website, but might be cheap enough we might go as far as to offer a number of ad-hoc custom search website, but might be cheap enough we might go as far as to offer a number of ad-hoc custom search
filter for any API consumer. filter for any API consumer.
I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, which is quite ad-hoc, but instead to work together to find some new common description language for this.
which is quite ad-hoc, but instead to work together to find some new common description language for this.
# Completed
## Proper Position Index (COMPLETED 2024-09)
The search engine uses a fixed width bit mask to indicate word positions. It has the benefit
of being very fast to evaluate and works well for what it is, but is inaccurate and has the
drawback of making support for quoted search terms inaccurate and largely reliant on indexing
word n-grams known beforehand. This limits the ability to interpret longer queries.
The positions mask should be supplemented or replaced with a more accurate (e.g.) gamma coded positions
list, as is the civilized way of doing this.
Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)
## Finalize RSS support (COMPLETED 2024-11)
Marginalia has experimental RSS preview support for a few domains. This works well and
it should be extended to all domains. It would also be interesting to offer search of the
RSS data itself, or use the RSS set to feed a special live index that updates faster than the
main dataset.
Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)

View File

@@ -7,8 +7,6 @@ import nu.marginalia.service.discovery.property.PartitionTraits;
import nu.marginalia.service.discovery.property.ServiceEndpoint; import nu.marginalia.service.discovery.property.ServiceEndpoint;
import nu.marginalia.service.discovery.property.ServiceKey; import nu.marginalia.service.discovery.property.ServiceKey;
import nu.marginalia.service.discovery.property.ServicePartition; import nu.marginalia.service.discovery.property.ServicePartition;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.List; import java.util.List;
import java.util.concurrent.CompletableFuture; import java.util.concurrent.CompletableFuture;
@@ -24,7 +22,7 @@ import java.util.function.Function;
public class GrpcMultiNodeChannelPool<STUB> { public class GrpcMultiNodeChannelPool<STUB> {
private final ConcurrentHashMap<Integer, GrpcSingleNodeChannelPool<STUB>> pools = private final ConcurrentHashMap<Integer, GrpcSingleNodeChannelPool<STUB>> pools =
new ConcurrentHashMap<>(); new ConcurrentHashMap<>();
private static final Logger logger = LoggerFactory.getLogger(GrpcMultiNodeChannelPool.class);
private final ServiceRegistryIf serviceRegistryIf; private final ServiceRegistryIf serviceRegistryIf;
private final ServiceKey<? extends PartitionTraits.Multicast> serviceKey; private final ServiceKey<? extends PartitionTraits.Multicast> serviceKey;
private final Function<ServiceEndpoint.InstanceAddress, ManagedChannel> channelConstructor; private final Function<ServiceEndpoint.InstanceAddress, ManagedChannel> channelConstructor;

View File

@@ -10,6 +10,8 @@ import nu.marginalia.service.discovery.property.ServiceKey;
import org.jetbrains.annotations.NotNull; import org.jetbrains.annotations.NotNull;
import org.slf4j.Logger; import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import org.slf4j.LoggerFactory;
import org.slf4j.Marker;
import org.slf4j.MarkerFactory;
import java.time.Duration; import java.time.Duration;
import java.util.*; import java.util.*;
@@ -26,13 +28,13 @@ import java.util.function.Function;
public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor { public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
private final Map<InstanceAddress, ConnectionHolder> channels = new ConcurrentHashMap<>(); private final Map<InstanceAddress, ConnectionHolder> channels = new ConcurrentHashMap<>();
private final Marker grpcMarker = MarkerFactory.getMarker("GRPC");
private static final Logger logger = LoggerFactory.getLogger(GrpcSingleNodeChannelPool.class); private static final Logger logger = LoggerFactory.getLogger(GrpcSingleNodeChannelPool.class);
private final ServiceRegistryIf serviceRegistryIf; private final ServiceRegistryIf serviceRegistryIf;
private final Function<InstanceAddress, ManagedChannel> channelConstructor; private final Function<InstanceAddress, ManagedChannel> channelConstructor;
private final Function<ManagedChannel, STUB> stubConstructor; private final Function<ManagedChannel, STUB> stubConstructor;
public GrpcSingleNodeChannelPool(ServiceRegistryIf serviceRegistryIf, public GrpcSingleNodeChannelPool(ServiceRegistryIf serviceRegistryIf,
ServiceKey<? extends PartitionTraits.Unicast> serviceKey, ServiceKey<? extends PartitionTraits.Unicast> serviceKey,
Function<InstanceAddress, ManagedChannel> channelConstructor, Function<InstanceAddress, ManagedChannel> channelConstructor,
@@ -48,8 +50,6 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
serviceRegistryIf.registerMonitor(this); serviceRegistryIf.registerMonitor(this);
onChange(); onChange();
awaitChannel(Duration.ofSeconds(5));
} }
@@ -62,10 +62,10 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
for (var route : Sets.symmetricDifference(oldRoutes, newRoutes)) { for (var route : Sets.symmetricDifference(oldRoutes, newRoutes)) {
ConnectionHolder oldChannel; ConnectionHolder oldChannel;
if (newRoutes.contains(route)) { if (newRoutes.contains(route)) {
logger.info("Adding route {}", route); logger.info(grpcMarker, "Adding route {} => {}", serviceKey, route);
oldChannel = channels.put(route, new ConnectionHolder(route)); oldChannel = channels.put(route, new ConnectionHolder(route));
} else { } else {
logger.info("Expelling route {}", route); logger.info(grpcMarker, "Expelling route {} => {}", serviceKey, route);
oldChannel = channels.remove(route); oldChannel = channels.remove(route);
} }
if (oldChannel != null) { if (oldChannel != null) {
@@ -103,7 +103,7 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
} }
try { try {
logger.info("Creating channel for {}:{}", serviceKey, address); logger.info(grpcMarker, "Creating channel for {} => {}", serviceKey, address);
value = channelConstructor.apply(address); value = channelConstructor.apply(address);
if (channel.compareAndSet(null, value)) { if (channel.compareAndSet(null, value)) {
return value; return value;
@@ -114,7 +114,7 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
} }
} }
catch (Exception e) { catch (Exception e) {
logger.error("Failed to get channel for " + address, e); logger.error(grpcMarker, "Failed to get channel for " + address, e);
return null; return null;
} }
} }
@@ -206,7 +206,7 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
} }
for (var e : exceptions) { for (var e : exceptions) {
logger.error("Failed to call service {}", serviceKey, e); logger.error(grpcMarker, "Failed to call service {}", serviceKey, e);
} }
throw new ServiceNotAvailableException(serviceKey); throw new ServiceNotAvailableException(serviceKey);

View File

@@ -4,6 +4,11 @@ import nu.marginalia.service.discovery.property.ServiceKey;
public class ServiceNotAvailableException extends RuntimeException { public class ServiceNotAvailableException extends RuntimeException {
public ServiceNotAvailableException(ServiceKey<?> key) { public ServiceNotAvailableException(ServiceKey<?> key) {
super("Service " + key + " not available"); super(key.toString());
}
@Override
public StackTraceElement[] getStackTrace() { // Suppress stack trace
return new StackTraceElement[0];
} }
} }

View File

@@ -48,5 +48,10 @@ public record ServiceEndpoint(String host, int port) {
public int port() { public int port() {
return endpoint.port(); return endpoint.port();
} }
@Override
public String toString() {
return endpoint().host() + ":" + endpoint.port() + " [" + instance + "]";
}
} }
} }

View File

@@ -48,6 +48,19 @@ public sealed interface ServiceKey<P extends ServicePartition> {
{ {
throw new UnsupportedOperationException(); throw new UnsupportedOperationException();
} }
@Override
public String toString() {
final String shortName;
int periodIndex = name.lastIndexOf('.');
if (periodIndex >= 0) shortName = name.substring(periodIndex+1);
else shortName = name;
return "rest:" + shortName;
}
} }
record Grpc<P extends ServicePartition>(String name, P partition) implements ServiceKey<P> { record Grpc<P extends ServicePartition>(String name, P partition) implements ServiceKey<P> {
public String baseName() { public String baseName() {
@@ -64,6 +77,18 @@ public sealed interface ServiceKey<P extends ServicePartition> {
{ {
return new Grpc<>(name, partition); return new Grpc<>(name, partition);
} }
@Override
public String toString() {
final String shortName;
int periodIndex = name.lastIndexOf('.');
if (periodIndex >= 0) shortName = name.substring(periodIndex+1);
else shortName = name;
return "grpc:" + shortName + "[" + partition.identifier() + "]";
}
} }
} }

View File

@@ -101,6 +101,7 @@ message RpcSimilarDomain {
bool active = 6; bool active = 6;
bool screenshot = 7; bool screenshot = 7;
LINK_TYPE linkType = 8; LINK_TYPE linkType = 8;
bool feed = 9;
enum LINK_TYPE { enum LINK_TYPE {
BACKWARD = 0; BACKWARD = 0;

View File

@@ -9,6 +9,7 @@ import gnu.trove.map.hash.TIntIntHashMap;
import gnu.trove.set.TIntSet; import gnu.trove.set.TIntSet;
import gnu.trove.set.hash.TIntHashSet; import gnu.trove.set.hash.TIntHashSet;
import it.unimi.dsi.fastutil.ints.Int2DoubleArrayMap; import it.unimi.dsi.fastutil.ints.Int2DoubleArrayMap;
import nu.marginalia.WmsaHome;
import nu.marginalia.api.domains.RpcSimilarDomain; import nu.marginalia.api.domains.RpcSimilarDomain;
import nu.marginalia.api.domains.model.SimilarDomain; import nu.marginalia.api.domains.model.SimilarDomain;
import nu.marginalia.api.linkgraph.AggregateLinkGraphClient; import nu.marginalia.api.linkgraph.AggregateLinkGraphClient;
@@ -17,10 +18,14 @@ import org.roaringbitmap.RoaringBitmap;
import org.slf4j.Logger; import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import org.slf4j.LoggerFactory;
import java.nio.file.Path;
import java.sql.DriverManager;
import java.sql.ResultSet; import java.sql.ResultSet;
import java.sql.SQLException; import java.sql.SQLException;
import java.util.ArrayList; import java.util.ArrayList;
import java.util.HashSet;
import java.util.List; import java.util.List;
import java.util.Set;
import java.util.concurrent.Executors; import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService; import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit; import java.util.concurrent.TimeUnit;
@@ -32,12 +37,13 @@ public class SimilarDomainsService {
private final HikariDataSource dataSource; private final HikariDataSource dataSource;
private final AggregateLinkGraphClient linkGraphClient; private final AggregateLinkGraphClient linkGraphClient;
private volatile TIntIntHashMap domainIdToIdx = new TIntIntHashMap(100_000); private final TIntIntHashMap domainIdToIdx = new TIntIntHashMap(100_000);
private volatile int[] domainIdxToId; private volatile int[] domainIdxToId;
public volatile Int2DoubleArrayMap[] relatedDomains; public volatile Int2DoubleArrayMap[] relatedDomains;
public volatile TIntList[] domainNeighbors = null; public volatile TIntList[] domainNeighbors = null;
public volatile RoaringBitmap screenshotDomains = null; public volatile RoaringBitmap screenshotDomains = null;
public volatile RoaringBitmap feedDomains = null;
public volatile RoaringBitmap activeDomains = null; public volatile RoaringBitmap activeDomains = null;
public volatile RoaringBitmap indexedDomains = null; public volatile RoaringBitmap indexedDomains = null;
public volatile TIntDoubleHashMap domainRanks = null; public volatile TIntDoubleHashMap domainRanks = null;
@@ -82,6 +88,7 @@ public class SimilarDomainsService {
domainNames = new String[domainIdToIdx.size()]; domainNames = new String[domainIdToIdx.size()];
domainNeighbors = new TIntList[domainIdToIdx.size()]; domainNeighbors = new TIntList[domainIdToIdx.size()];
screenshotDomains = new RoaringBitmap(); screenshotDomains = new RoaringBitmap();
feedDomains = new RoaringBitmap();
activeDomains = new RoaringBitmap(); activeDomains = new RoaringBitmap();
indexedDomains = new RoaringBitmap(); indexedDomains = new RoaringBitmap();
relatedDomains = new Int2DoubleArrayMap[domainIdToIdx.size()]; relatedDomains = new Int2DoubleArrayMap[domainIdToIdx.size()];
@@ -145,10 +152,12 @@ public class SimilarDomainsService {
activeDomains.add(idx); activeDomains.add(idx);
} }
updateScreenshotInfo();
logger.info("Loaded {} domains", domainRanks.size()); logger.info("Loaded {} domains", domainRanks.size());
isReady = true; isReady = true;
// We can defer these as they only populate a roaringbitmap, and will degrade gracefully when not complete
updateScreenshotInfo();
updateFeedInfo();
} }
} }
catch (SQLException throwables) { catch (SQLException throwables) {
@@ -156,6 +165,42 @@ public class SimilarDomainsService {
} }
} }
private void updateFeedInfo() {
Set<String> feedsDomainNames = new HashSet<>(500_000);
Path readerDbPath = WmsaHome.getDataPath().resolve("rss-feeds.db").toAbsolutePath();
String dbUrl = "jdbc:sqlite:" + readerDbPath;
logger.info("Opening feed db at " + dbUrl);
try (var conn = DriverManager.getConnection(dbUrl);
var stmt = conn.createStatement()) {
var rs = stmt.executeQuery("""
select
json_extract(feed, '$.domain') as domain
from feed
where json_array_length(feed, '$.items') > 0
""");
while (rs.next()) {
feedsDomainNames.add(rs.getString(1));
}
}
catch (SQLException ex) {
logger.error("Failed to read RSS feed items", ex);
}
for (int idx = 0; idx < domainNames.length; idx++) {
String name = domainNames[idx];
if (name == null) {
continue;
}
if (feedsDomainNames.contains(name)) {
feedDomains.add(idx);
}
}
}
private void updateScreenshotInfo() { private void updateScreenshotInfo() {
try (var connection = dataSource.getConnection()) { try (var connection = dataSource.getConnection()) {
try (var stmt = connection.createStatement()) { try (var stmt = connection.createStatement()) {
@@ -254,6 +299,7 @@ public class SimilarDomainsService {
.setIndexed(indexedDomains.contains(idx)) .setIndexed(indexedDomains.contains(idx))
.setActive(activeDomains.contains(idx)) .setActive(activeDomains.contains(idx))
.setScreenshot(screenshotDomains.contains(idx)) .setScreenshot(screenshotDomains.contains(idx))
.setFeed(feedDomains.contains(idx))
.setLinkType(RpcSimilarDomain.LINK_TYPE.valueOf(linkType.name())) .setLinkType(RpcSimilarDomain.LINK_TYPE.valueOf(linkType.name()))
.build()); .build());
@@ -369,6 +415,7 @@ public class SimilarDomainsService {
.setIndexed(indexedDomains.contains(idx)) .setIndexed(indexedDomains.contains(idx))
.setActive(activeDomains.contains(idx)) .setActive(activeDomains.contains(idx))
.setScreenshot(screenshotDomains.contains(idx)) .setScreenshot(screenshotDomains.contains(idx))
.setFeed(feedDomains.contains(idx))
.setLinkType(RpcSimilarDomain.LINK_TYPE.valueOf(linkType.name())) .setLinkType(RpcSimilarDomain.LINK_TYPE.valueOf(linkType.name()))
.build()); .build());

View File

@@ -5,6 +5,7 @@ import com.google.inject.Singleton;
import nu.marginalia.api.livecapture.LiveCaptureApiGrpc.LiveCaptureApiBlockingStub; import nu.marginalia.api.livecapture.LiveCaptureApiGrpc.LiveCaptureApiBlockingStub;
import nu.marginalia.service.client.GrpcChannelPoolFactory; import nu.marginalia.service.client.GrpcChannelPoolFactory;
import nu.marginalia.service.client.GrpcSingleNodeChannelPool; import nu.marginalia.service.client.GrpcSingleNodeChannelPool;
import nu.marginalia.service.client.ServiceNotAvailableException;
import nu.marginalia.service.discovery.property.ServiceKey; import nu.marginalia.service.discovery.property.ServiceKey;
import nu.marginalia.service.discovery.property.ServicePartition; import nu.marginalia.service.discovery.property.ServicePartition;
import org.slf4j.Logger; import org.slf4j.Logger;
@@ -29,6 +30,9 @@ public class LiveCaptureClient {
channelPool.call(LiveCaptureApiBlockingStub::requestScreengrab) channelPool.call(LiveCaptureApiBlockingStub::requestScreengrab)
.run(RpcDomainId.newBuilder().setDomainId(domainId).build()); .run(RpcDomainId.newBuilder().setDomainId(domainId).build());
} }
catch (ServiceNotAvailableException e) {
logger.info("requestScreengrab() failed since the service is not available");
}
catch (Exception e) { catch (Exception e) {
logger.error("API Exception", e); logger.error("API Exception", e);
} }

View File

@@ -402,6 +402,7 @@ public class FeedFetcherService {
"&ndash;", "-", "&ndash;", "-",
"&rsquo;", "'", "&rsquo;", "'",
"&lsquo;", "'", "&lsquo;", "'",
"&quot;", "\"",
"&nbsp;", "" "&nbsp;", ""
); );

View File

@@ -10,7 +10,6 @@ public class TestXmlSanitization {
Assertions.assertEquals("&amp;", FeedFetcherService.sanitizeEntities("&amp;")); Assertions.assertEquals("&amp;", FeedFetcherService.sanitizeEntities("&amp;"));
Assertions.assertEquals("&lt;", FeedFetcherService.sanitizeEntities("&lt;")); Assertions.assertEquals("&lt;", FeedFetcherService.sanitizeEntities("&lt;"));
Assertions.assertEquals("&gt;", FeedFetcherService.sanitizeEntities("&gt;")); Assertions.assertEquals("&gt;", FeedFetcherService.sanitizeEntities("&gt;"));
Assertions.assertEquals("&quot;", FeedFetcherService.sanitizeEntities("&quot;"));
Assertions.assertEquals("&apos;", FeedFetcherService.sanitizeEntities("&apos;")); Assertions.assertEquals("&apos;", FeedFetcherService.sanitizeEntities("&apos;"));
} }
@@ -23,4 +22,9 @@ public class TestXmlSanitization {
public void testTranslatedHtmlEntity() { public void testTranslatedHtmlEntity() {
Assertions.assertEquals("Foo -- Bar", FeedFetcherService.sanitizeEntities("Foo &mdash; Bar")); Assertions.assertEquals("Foo -- Bar", FeedFetcherService.sanitizeEntities("Foo &mdash; Bar"));
} }
@Test
public void testTranslatedHtmlEntityQuot() {
Assertions.assertEquals("\"Bob\"", FeedFetcherService.sanitizeEntities("&quot;Bob&quot;"));
}
} }

View File

@@ -9,10 +9,9 @@ import nu.marginalia.service.client.GrpcChannelPoolFactory;
import nu.marginalia.service.client.GrpcSingleNodeChannelPool; import nu.marginalia.service.client.GrpcSingleNodeChannelPool;
import nu.marginalia.service.discovery.property.ServiceKey; import nu.marginalia.service.discovery.property.ServiceKey;
import nu.marginalia.service.discovery.property.ServicePartition; import nu.marginalia.service.discovery.property.ServicePartition;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import javax.annotation.CheckReturnValue; import javax.annotation.CheckReturnValue;
import java.time.Duration;
@Singleton @Singleton
public class QueryClient { public class QueryClient {
@@ -24,13 +23,14 @@ public class QueryClient {
private final GrpcSingleNodeChannelPool<QueryApiGrpc.QueryApiBlockingStub> queryApiPool; private final GrpcSingleNodeChannelPool<QueryApiGrpc.QueryApiBlockingStub> queryApiPool;
private final Logger logger = LoggerFactory.getLogger(getClass());
@Inject @Inject
public QueryClient(GrpcChannelPoolFactory channelPoolFactory) { public QueryClient(GrpcChannelPoolFactory channelPoolFactory) throws InterruptedException {
this.queryApiPool = channelPoolFactory.createSingle( this.queryApiPool = channelPoolFactory.createSingle(
ServiceKey.forGrpcApi(QueryApiGrpc.class, ServicePartition.any()), ServiceKey.forGrpcApi(QueryApiGrpc.class, ServicePartition.any()),
QueryApiGrpc::newBlockingStub); QueryApiGrpc::newBlockingStub);
// Hold up initialization until we have a downstream connection
this.queryApiPool.awaitChannel(Duration.ofSeconds(5));
} }
@CheckReturnValue @CheckReturnValue

View File

@@ -25,6 +25,7 @@ public class QueryExpansion {
this::joinDashes, this::joinDashes,
this::splitWordNum, this::splitWordNum,
this::joinTerms, this::joinTerms,
this::categoryKeywords,
this::ngramAll this::ngramAll
); );
@@ -98,6 +99,24 @@ public class QueryExpansion {
} }
} }
// Category keyword substitution, e.g. guitar wiki -> guitar generator:wiki
public void categoryKeywords(QWordGraph graph) {
for (var qw : graph) {
// Ensure we only perform the substitution on the last word in the query
if (!graph.getNextOriginal(qw).getFirst().isEnd()) {
continue;
}
switch (qw.word()) {
case "recipe", "recipes" -> graph.addVariant(qw, "category:food");
case "forum" -> graph.addVariant(qw, "generator:forum");
case "wiki" -> graph.addVariant(qw, "generator:wiki");
}
}
}
// Turn 'lawn chair' into 'lawnchair' // Turn 'lawn chair' into 'lawnchair'
public void joinTerms(QWordGraph graph) { public void joinTerms(QWordGraph graph) {
QWord prev = null; QWord prev = null;

View File

@@ -155,16 +155,25 @@ public class QueryParser {
// Remove trailing punctuation // Remove trailing punctuation
int lastChar = str.charAt(str.length() - 1); int lastChar = str.charAt(str.length() - 1);
if (":.,!?$'".indexOf(lastChar) >= 0) if (":.,!?$'".indexOf(lastChar) >= 0) {
entity.replace(new QueryToken.LiteralTerm(str.substring(0, str.length() - 1), lt.displayStr())); str = str.substring(0, str.length() - 1);
entity.replace(new QueryToken.LiteralTerm(str, lt.displayStr()));
}
// Remove term elements that aren't indexed by the search engine // Remove term elements that aren't indexed by the search engine
if (str.endsWith("'s")) if (str.endsWith("'s")) {
entity.replace(new QueryToken.LiteralTerm(str.substring(0, str.length() - 2), lt.displayStr())); str = str.substring(0, str.length() - 2);
if (str.endsWith("()")) entity.replace(new QueryToken.LiteralTerm(str, lt.displayStr()));
entity.replace(new QueryToken.LiteralTerm(str.substring(0, str.length() - 2), lt.displayStr())); }
if (str.startsWith("$")) if (str.endsWith("()")) {
entity.replace(new QueryToken.LiteralTerm(str.substring(1), lt.displayStr())); str = str.substring(0, str.length() - 2);
entity.replace(new QueryToken.LiteralTerm(str, lt.displayStr()));
}
while (str.startsWith("$") || str.startsWith("_")) {
str = str.substring(1);
entity.replace(new QueryToken.LiteralTerm(str, lt.displayStr()));
}
if (entity.isBlank()) { if (entity.isBlank()) {
entity.remove(); entity.remove();

View File

@@ -1,165 +0,0 @@
package nu.marginalia.util.language;
import com.google.inject.Inject;
import nu.marginalia.term_frequency_dict.TermFrequencyDict;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.util.*;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
public class EnglishDictionary {
private final Set<String> englishWords = new HashSet<>();
private final TermFrequencyDict tfDict;
private final Logger logger = LoggerFactory.getLogger(getClass());
@Inject
public EnglishDictionary(TermFrequencyDict tfDict) {
this.tfDict = tfDict;
try (var resource = Objects.requireNonNull(ClassLoader.getSystemResourceAsStream("dictionary/en-words"),
"Could not load word frequency table");
var br = new BufferedReader(new InputStreamReader(resource))
) {
for (;;) {
String s = br.readLine();
if (s == null) {
break;
}
englishWords.add(s.toLowerCase());
}
}
catch (Exception ex) {
throw new RuntimeException(ex);
}
}
public boolean isWord(String word) {
return englishWords.contains(word);
}
private static final Pattern ingPattern = Pattern.compile(".*(\\w)\\1ing$");
public Collection<String> getWordVariants(String s) {
var variants = findWordVariants(s);
var ret = variants.stream()
.filter(var -> tfDict.getTermFreq(var) > 100)
.collect(Collectors.toList());
if (s.equals("recipe") || s.equals("recipes")) {
ret.add("category:food");
}
return ret;
}
public Collection<String> findWordVariants(String s) {
int sl = s.length();
if (sl < 2) {
return Collections.emptyList();
}
if (s.endsWith("s")) {
String a = s.substring(0, sl-1);
String b = s + "es";
if (isWord(a) && isWord(b)) {
return List.of(a, b);
}
else if (isWord(a)) {
return List.of(a);
}
else if (isWord(b)) {
return List.of(b);
}
}
if (s.endsWith("sm")) {
String a = s.substring(0, sl-1)+"t";
String b = s.substring(0, sl-1)+"ts";
if (isWord(a) && isWord(b)) {
return List.of(a, b);
}
else if (isWord(a)) {
return List.of(a);
}
else if (isWord(b)) {
return List.of(b);
}
}
if (s.endsWith("st")) {
String a = s.substring(0, sl-1)+"m";
String b = s + "s";
if (isWord(a) && isWord(b)) {
return List.of(a, b);
}
else if (isWord(a)) {
return List.of(a);
}
else if (isWord(b)) {
return List.of(b);
}
}
else if (ingPattern.matcher(s).matches() && sl > 4) { // humming, clapping
var a = s.substring(0, sl-4);
var b = s.substring(0, sl-3) + "ed";
if (isWord(a) && isWord(b)) {
return List.of(a, b);
}
else if (isWord(a)) {
return List.of(a);
}
else if (isWord(b)) {
return List.of(b);
}
}
else {
String a = s + "s";
String b = ingForm(s);
String c = s + "ed";
if (isWord(a) && isWord(b) && isWord(c)) {
return List.of(a, b, c);
}
else if (isWord(a) && isWord(b)) {
return List.of(a, b);
}
else if (isWord(b) && isWord(c)) {
return List.of(b, c);
}
else if (isWord(a) && isWord(c)) {
return List.of(a, c);
}
else if (isWord(a)) {
return List.of(a);
}
else if (isWord(b)) {
return List.of(b);
}
else if (isWord(c)) {
return List.of(c);
}
}
return Collections.emptyList();
}
public String ingForm(String s) {
if (s.endsWith("t") && !s.endsWith("tt")) {
return s + "ting";
}
if (s.endsWith("n") && !s.endsWith("nn")) {
return s + "ning";
}
if (s.endsWith("m") && !s.endsWith("mm")) {
return s + "ming";
}
if (s.endsWith("r") && !s.endsWith("rr")) {
return s + "ring";
}
return s + "ing";
}
}

View File

@@ -0,0 +1,32 @@
package nu.marginalia.functions.searchquery.query_parser;
import nu.marginalia.functions.searchquery.query_parser.token.QueryToken;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.Test;
import java.util.List;
class QueryParserTest {
@Test
// https://github.com/MarginaliaSearch/MarginaliaSearch/issues/140
void parse__builtin_ffs() {
QueryParser parser = new QueryParser();
var tokens = parser.parse("__builtin_ffs");
Assertions.assertEquals(List.of(new QueryToken.LiteralTerm("builtin_ffs", "__builtin_ffs")), tokens);
}
@Test
void trailingParens() {
QueryParser parser = new QueryParser();
var tokens = parser.parse("strcpy()");
Assertions.assertEquals(List.of(new QueryToken.LiteralTerm("strcpy", "strcpy()")), tokens);
}
@Test
void trailingQuote() {
QueryParser parser = new QueryParser();
var tokens = parser.parse("bob's");
Assertions.assertEquals(List.of(new QueryToken.LiteralTerm("bob", "bob's")), tokens);
}
}

View File

@@ -12,6 +12,7 @@ import nu.marginalia.index.query.limit.SpecificationLimit;
import nu.marginalia.index.query.limit.SpecificationLimitType; import nu.marginalia.index.query.limit.SpecificationLimitType;
import nu.marginalia.segmentation.NgramLexicon; import nu.marginalia.segmentation.NgramLexicon;
import nu.marginalia.term_frequency_dict.TermFrequencyDict; import nu.marginalia.term_frequency_dict.TermFrequencyDict;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.BeforeAll; import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.api.Test; import org.junit.jupiter.api.Test;
@@ -207,6 +208,17 @@ public class QueryFactoryTest {
System.out.println(subquery); System.out.println(subquery);
} }
@Test
public void testExpansion9() {
var subquery = parseAndGetSpecs("pie recipe");
Assertions.assertTrue(subquery.query.compiledQuery.contains(" category:food "));
subquery = parseAndGetSpecs("recipe pie");
Assertions.assertFalse(subquery.query.compiledQuery.contains(" category:food "));
}
@Test @Test
public void testParsing() { public void testParsing() {
var subquery = parseAndGetSpecs("strlen()"); var subquery = parseAndGetSpecs("strlen()");

View File

@@ -27,7 +27,7 @@ public class SentenceSegmentSplitter {
else { else {
// If we flatten unicode, we do this... // If we flatten unicode, we do this...
// FIXME: This can almost definitely be cleaned up and simplified. // FIXME: This can almost definitely be cleaned up and simplified.
wordBreakPattern = Pattern.compile("([^/_#@.a-zA-Z'+\\-0-9\\u00C0-\\u00D6\\u00D8-\\u00f6\\u00f8-\\u00ff]+)|[|]|(\\.(\\s+|$))"); wordBreakPattern = Pattern.compile("([^/<>$:_#@.a-zA-Z'+\\-0-9\\u00C0-\\u00D6\\u00D8-\\u00f6\\u00f8-\\u00ff]+)|[|]|(\\.(\\s+|$))");
} }
} }

View File

@@ -28,6 +28,20 @@ class SentenceExtractorTest {
System.out.println(dld); System.out.println(dld);
} }
@Test
void testCplusplus() {
var dld = sentenceExtractor.extractSentence("std::vector", EnumSet.noneOf(HtmlTag.class));
assertEquals(1, dld.length());
assertEquals("std::vector", dld.wordsLowerCase[0]);
}
@Test
void testPHP() {
var dld = sentenceExtractor.extractSentence("$_GET", EnumSet.noneOf(HtmlTag.class));
assertEquals(1, dld.length());
assertEquals("$_get", dld.wordsLowerCase[0]);
}
@Test @Test
void testPolishArtist() { void testPolishArtist() {
var dld = sentenceExtractor.extractSentence("Uklański", EnumSet.noneOf(HtmlTag.class)); var dld = sentenceExtractor.extractSentence("Uklański", EnumSet.noneOf(HtmlTag.class));

View File

@@ -20,34 +20,11 @@ public record ContentTags(String etag, String lastMod) {
public void paint(Request.Builder getBuilder) { public void paint(Request.Builder getBuilder) {
if (etag != null) { if (etag != null) {
getBuilder.addHeader("If-None-Match", ifNoneMatch()); getBuilder.addHeader("If-None-Match", etag);
} }
if (lastMod != null) { if (lastMod != null) {
getBuilder.addHeader("If-Modified-Since", ifModifiedSince()); getBuilder.addHeader("If-Modified-Since", lastMod);
} }
} }
private String ifNoneMatch() {
// Remove the W/ prefix if it exists
//'W/' (case-sensitive) indicates that a weak validator is used. Weak etags are
// easy to generate, but are far less useful for comparisons. Strong validators
// are ideal for comparisons but can be very difficult to generate efficiently.
// Weak ETag values of two representations of the same resources might be semantically
// equivalent, but not byte-for-byte identical. This means weak etags prevent caching
// when byte range requests are used, but strong etags mean range requests can
// still be cached.
// - https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag
if (null != etag && etag.startsWith("W/")) {
return etag.substring(2);
} else {
return etag;
}
}
private String ifModifiedSince() {
return lastMod;
}
} }

View File

@@ -34,8 +34,9 @@ import java.util.*;
public class WarcRecorder implements AutoCloseable { public class WarcRecorder implements AutoCloseable {
/** Maximum time we'll wait on a single request */ /** Maximum time we'll wait on a single request */
static final int MAX_TIME = 30_000; static final int MAX_TIME = 30_000;
/** Maximum (decompressed) size we'll fetch */
static final int MAX_SIZE = 1024 * 1024 * 10; /** Maximum (decompressed) size we'll save */
static final int MAX_SIZE = Integer.getInteger("crawler.maxFetchSize", 10 * 1024 * 1024);
private final WarcWriter writer; private final WarcWriter writer;
private final Path warcFile; private final Path warcFile;

View File

@@ -1,11 +1,15 @@
package nu.marginalia.io; package nu.marginalia.io;
import nu.marginalia.model.crawldata.CrawledDocument;
import nu.marginalia.model.crawldata.CrawledDomain;
import nu.marginalia.model.crawldata.SerializableCrawlData; import nu.marginalia.model.crawldata.SerializableCrawlData;
import org.jetbrains.annotations.Nullable; import org.jetbrains.annotations.Nullable;
import java.io.IOException; import java.io.IOException;
import java.nio.file.Path; import java.nio.file.Path;
import java.util.ArrayList;
import java.util.Iterator; import java.util.Iterator;
import java.util.List;
/** Closable iterator exceptional over serialized crawl data /** Closable iterator exceptional over serialized crawl data
* The data may appear in any order, and the iterator must be closed. * The data may appear in any order, and the iterator must be closed.
@@ -26,6 +30,37 @@ public interface SerializableCrawlDataStream extends AutoCloseable {
@Nullable @Nullable
default Path path() { return null; } default Path path() { return null; }
/** For tests */
default List<SerializableCrawlData> asList() throws IOException {
List<SerializableCrawlData> data = new ArrayList<>();
while (hasNext()) {
data.add(next());
}
return data;
}
/** For tests */
default List<CrawledDocument> docsAsList() throws IOException {
List<CrawledDocument> data = new ArrayList<>();
while (hasNext()) {
if (next() instanceof CrawledDocument doc) {
data.add(doc);
}
}
return data;
}
/** For tests */
default List<CrawledDomain> domainsAsList() throws IOException {
List<CrawledDomain> data = new ArrayList<>();
while (hasNext()) {
if (next() instanceof CrawledDomain domain) {
data.add(domain);
}
}
return data;
}
// Dummy iterator over nothing // Dummy iterator over nothing
static SerializableCrawlDataStream empty() { static SerializableCrawlDataStream empty() {
return new SerializableCrawlDataStream() { return new SerializableCrawlDataStream() {

View File

@@ -26,6 +26,7 @@ import java.net.http.HttpHeaders;
import java.net.http.HttpRequest; import java.net.http.HttpRequest;
import java.net.http.HttpResponse; import java.net.http.HttpResponse;
import java.time.Duration; import java.time.Duration;
import java.util.ArrayList;
import java.util.List; import java.util.List;
import java.util.Optional; import java.util.Optional;
import java.util.concurrent.ThreadLocalRandom; import java.util.concurrent.ThreadLocalRandom;
@@ -47,6 +48,8 @@ public class SimpleLinkScraper implements AutoCloseable {
private final Duration readTimeout = Duration.ofSeconds(10); private final Duration readTimeout = Duration.ofSeconds(10);
private final DomainLocks domainLocks = new DomainLocks(); private final DomainLocks domainLocks = new DomainLocks();
private final static int MAX_SIZE = Integer.getInteger("crawler.maxFetchSize", 10 * 1024 * 1024);
public SimpleLinkScraper(LiveCrawlDataSet dataSet, public SimpleLinkScraper(LiveCrawlDataSet dataSet,
DbDomainQueries domainQueries, DbDomainQueries domainQueries,
DomainBlacklist domainBlacklist) { DomainBlacklist domainBlacklist) {
@@ -65,52 +68,68 @@ public class SimpleLinkScraper implements AutoCloseable {
pool.submitQuietly(() -> retrieveNow(domain, id.getAsInt(), urls)); pool.submitQuietly(() -> retrieveNow(domain, id.getAsInt(), urls));
} }
public void retrieveNow(EdgeDomain domain, int domainId, List<String> urls) throws Exception { public int retrieveNow(EdgeDomain domain, int domainId, List<String> urls) throws Exception {
EdgeUrl rootUrl = domain.toRootUrlHttps();
List<EdgeUrl> relevantUrls = new ArrayList<>();
for (var url : urls) {
Optional<EdgeUrl> optParsedUrl = lp.parseLink(rootUrl, url);
if (optParsedUrl.isEmpty()) {
continue;
}
if (dataSet.hasUrl(optParsedUrl.get())) {
continue;
}
relevantUrls.add(optParsedUrl.get());
}
if (relevantUrls.isEmpty()) {
return 0;
}
int fetched = 0;
try (HttpClient client = HttpClient try (HttpClient client = HttpClient
.newBuilder() .newBuilder()
.connectTimeout(connectTimeout) .connectTimeout(connectTimeout)
.followRedirects(HttpClient.Redirect.NEVER) .followRedirects(HttpClient.Redirect.NEVER)
.version(HttpClient.Version.HTTP_2) .version(HttpClient.Version.HTTP_2)
.build(); .build();
DomainLocks.DomainLock lock = domainLocks.lockDomain(domain) // throttle concurrent access per domain; do not remove // throttle concurrent access per domain; IDE will complain it's not used, but it holds a semaphore -- do not remove:
DomainLocks.DomainLock lock = domainLocks.lockDomain(domain)
) { ) {
EdgeUrl rootUrl = domain.toRootUrlHttps();
SimpleRobotRules rules = fetchRobotsRules(rootUrl, client); SimpleRobotRules rules = fetchRobotsRules(rootUrl, client);
if (rules == null) { // I/O error fetching robots.txt if (rules == null) { // I/O error fetching robots.txt
// If we can't fetch the robots.txt, // If we can't fetch the robots.txt,
for (var url : urls) { for (var url : relevantUrls) {
lp.parseLink(rootUrl, url).ifPresent(this::maybeFlagAsBad); maybeFlagAsBad(url);
} }
return; return fetched;
} }
CrawlDelayTimer timer = new CrawlDelayTimer(rules.getCrawlDelay()); CrawlDelayTimer timer = new CrawlDelayTimer(rules.getCrawlDelay());
for (var url : urls) { for (var parsedUrl : relevantUrls) {
Optional<EdgeUrl> optParsedUrl = lp.parseLink(rootUrl, url);
if (optParsedUrl.isEmpty()) {
continue;
}
if (dataSet.hasUrl(optParsedUrl.get())) {
continue;
}
EdgeUrl parsedUrl = optParsedUrl.get(); if (!rules.isAllowed(parsedUrl.toString())) {
if (!rules.isAllowed(url)) {
maybeFlagAsBad(parsedUrl); maybeFlagAsBad(parsedUrl);
continue; continue;
} }
switch (fetchUrl(domainId, parsedUrl, timer, client)) { switch (fetchUrl(domainId, parsedUrl, timer, client)) {
case FetchResult.Success(int id, EdgeUrl docUrl, String body, String headers) case FetchResult.Success(int id, EdgeUrl docUrl, String body, String headers) -> {
-> dataSet.saveDocument(id, docUrl, body, headers, ""); dataSet.saveDocument(id, docUrl, body, headers, "");
fetched++;
}
case FetchResult.Error(EdgeUrl docUrl) -> maybeFlagAsBad(docUrl); case FetchResult.Error(EdgeUrl docUrl) -> maybeFlagAsBad(docUrl);
} }
} }
} }
return fetched;
} }
private void maybeFlagAsBad(EdgeUrl url) { private void maybeFlagAsBad(EdgeUrl url) {
@@ -190,7 +209,7 @@ public class SimpleLinkScraper implements AutoCloseable {
} }
byte[] body = getResponseData(response); byte[] body = getResponseData(response);
if (body.length > 1024 * 1024) { if (body.length > MAX_SIZE) {
return new FetchResult.Error(parsedUrl); return new FetchResult.Error(parsedUrl);
} }

View File

@@ -3,8 +3,8 @@ package nu.marginalia.livecrawler;
import nu.marginalia.db.DomainBlacklistImpl; import nu.marginalia.db.DomainBlacklistImpl;
import nu.marginalia.io.SerializableCrawlDataStream; import nu.marginalia.io.SerializableCrawlDataStream;
import nu.marginalia.model.EdgeDomain; import nu.marginalia.model.EdgeDomain;
import nu.marginalia.model.EdgeUrl;
import nu.marginalia.model.crawldata.CrawledDocument; import nu.marginalia.model.crawldata.CrawledDocument;
import nu.marginalia.model.crawldata.CrawledDomain;
import org.apache.commons.io.FileUtils; import org.apache.commons.io.FileUtils;
import org.junit.jupiter.api.AfterEach; import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.Assertions; import org.junit.jupiter.api.Assertions;
@@ -38,7 +38,8 @@ class SimpleLinkScraperTest {
@Test @Test
public void testRetrieveNow() throws Exception { public void testRetrieveNow() throws Exception {
var scraper = new SimpleLinkScraper(dataSet, null, Mockito.mock(DomainBlacklistImpl.class)); var scraper = new SimpleLinkScraper(dataSet, null, Mockito.mock(DomainBlacklistImpl.class));
scraper.retrieveNow(new EdgeDomain("www.marginalia.nu"), 1, List.of("https://www.marginalia.nu/")); int fetched = scraper.retrieveNow(new EdgeDomain("www.marginalia.nu"), 1, List.of("https://www.marginalia.nu/"));
Assertions.assertEquals(1, fetched);
var streams = dataSet.getDataStreams(); var streams = dataSet.getDataStreams();
Assertions.assertEquals(1, streams.size()); Assertions.assertEquals(1, streams.size());
@@ -46,23 +47,20 @@ class SimpleLinkScraperTest {
SerializableCrawlDataStream firstStream = streams.iterator().next(); SerializableCrawlDataStream firstStream = streams.iterator().next();
Assertions.assertTrue(firstStream.hasNext()); Assertions.assertTrue(firstStream.hasNext());
if (firstStream.next() instanceof CrawledDomain domain) { List<CrawledDocument> documents = firstStream.docsAsList();
Assertions.assertEquals("www.marginalia.nu",domain.getDomain()); Assertions.assertEquals(1, documents.size());
} Assertions.assertTrue(documents.getFirst().documentBody.startsWith("<!doctype"));
else { }
Assertions.fail();
}
Assertions.assertTrue(firstStream.hasNext());
if ((firstStream.next() instanceof CrawledDocument document)) {
// verify we decompress the body string
Assertions.assertTrue(document.documentBody.startsWith("<!doctype"));
}
else{
Assertions.fail();
}
Assertions.assertFalse(firstStream.hasNext()); @Test
public void testRetrieveNow_Redundant() throws Exception {
dataSet.saveDocument(1, new EdgeUrl("https://www.marginalia.nu/"), "<html>", "", "127.0.0.1");
var scraper = new SimpleLinkScraper(dataSet, null, Mockito.mock(DomainBlacklistImpl.class));
// If the requested URL is already in the dataSet, we retrieveNow should shortcircuit and not fetch anything
int fetched = scraper.retrieveNow(new EdgeDomain("www.marginalia.nu"), 1, List.of("https://www.marginalia.nu/"));
Assertions.assertEquals(0, fetched);
} }
} }

View File

@@ -0,0 +1,14 @@
<section id="frontpage-tips">
<h2>Public Beta Available</h2>
<div class="info">
<p>
A redesigned version of the search engine UI is available for beta testing.
Feel free to give it a spin, feedback is welcome!
The old one will also be keep being available if you hate it,
or have compatibility issues.
</p>
<p>
<a href="https://test.marginalia.nu/">Try it out!</a>
</p>
</div>
</section>

View File

@@ -24,7 +24,7 @@
<section id="frontpage"> <section id="frontpage">
{{>search/index/index-news}} {{>search/index/index-news}}
{{>search/index/index-about}} {{>search/index/index-about}}
{{>search/index/index-tips}} {{>search/index/index-redesign}}
</section> </section>
{{>search/parts/search-footer}} {{>search/parts/search-footer}}

34
tools/deployment/deployment.py Normal file → Executable file
View File

@@ -1,3 +1,5 @@
#!/usr/bin/env python3
from dataclasses import dataclass from dataclasses import dataclass
import subprocess, os import subprocess, os
from typing import List, Set, Dict, Optional from typing import List, Set, Dict, Optional
@@ -220,6 +222,31 @@ def run_gradle_build(targets: str) -> None:
if return_code != 0: if return_code != 0:
raise BuildError(service, return_code) raise BuildError(service, return_code)
def find_free_tag() -> str:
cmd = ['git', 'tag']
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
raise RuntimeError(f"Git command failed: {result.stderr}")
existing_tags = set(result.stdout.splitlines())
for i in range(1, 100000):
tag = f'deploy-{i:04d}'
if not tag in existing_tags:
return tag
raise RuntimeError(f"Failed to find a free deployment tag")
def add_tags(tags: str) -> None:
new_tag = find_free_tag()
cmd = ['git', 'tag', new_tag, '-am', tags]
result = subprocess.run(cmd)
if result.returncode != 0:
raise RuntimeError(f"Git command failed: {result.stderr}")
# Example usage: # Example usage:
if __name__ == '__main__': if __name__ == '__main__':
# Define service configuration # Define service configuration
@@ -293,7 +320,9 @@ if __name__ == '__main__':
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
prog='deployment.py', prog='deployment.py',
description='Continuous Deployment helper') description='Continuous Deployment helper')
parser.add_argument('-v', '--verify', help='Verify the tags are valid, if present', action='store_true') parser.add_argument('-v', '--verify', help='Verify the tags are valid, if present', action='store_true')
parser.add_argument('-a', '--add', help='Add the tags provided as a new deployment tag, usually combined with -t', action='store_true')
parser.add_argument('-t', '--tag', help='Use the specified tag value instead of the head git tag starting with deploy-') parser.add_argument('-t', '--tag', help='Use the specified tag value instead of the head git tag starting with deploy-')
args = parser.parse_args() args = parser.parse_args()
@@ -314,7 +343,10 @@ if __name__ == '__main__':
print("Services to build:", plan.services_to_build) print("Services to build:", plan.services_to_build)
print("Instances to deploy:", [container.name for container in plan.instances_to_deploy]) print("Instances to deploy:", [container.name for container in plan.instances_to_deploy])
if not args.verify: if args.verify:
if args.add:
add_tags(args.tag)
else:
print("\nExecution Plan:") print("\nExecution Plan:")
build_and_deploy(plan, SERVICE_CONFIG) build_and_deploy(plan, SERVICE_CONFIG)