mirror of
https://github.com/MarginaliaSearch/MarginaliaSearch.git
synced 2025-10-06 07:32:38 +02:00
Compare commits
109 Commits
deploy-001
...
deploy-002
Author | SHA1 | Date | |
---|---|---|---|
|
b5469bd8a1 | ||
|
6a6318d04c | ||
|
55933f8d40 | ||
|
be6382e0d0 | ||
|
45e771f96b | ||
|
8dde502cc9 | ||
|
3e66767af3 | ||
|
9ec9d1b338 | ||
|
dcad0d7863 | ||
|
94e1aa0baf | ||
|
b62f043910 | ||
|
6ea22d0d21 | ||
|
8c69dc31b8 | ||
|
00734ea87f | ||
|
3009713db4 | ||
|
9b2ceaf37c | ||
|
8019c2ce18 | ||
|
a9e312b8b1 | ||
|
4da3563d8a | ||
|
48d0a3089a | ||
|
594df64b20 | ||
|
06efb5abfc | ||
|
78eb1417a7 | ||
|
8c8f2ad5ee | ||
|
f71e79d10f | ||
|
1b27c5cf06 | ||
|
67edc8f90d | ||
|
5f576b7d0c | ||
|
8b05c788fd | ||
|
236f033bc9 | ||
|
510fc75121 | ||
|
0376f2e6e3 | ||
|
0b65164f60 | ||
|
9be477de33 | ||
|
84f55b84ff | ||
|
ab5c30ad51 | ||
|
0c839453c5 | ||
|
5e4c5d03ae | ||
|
710af4999a | ||
|
a5b0a1ae62 | ||
|
e9f71ee39b | ||
|
baeb4a46cd | ||
|
5e2a8e9f27 | ||
|
cc1a5bdf90 | ||
|
7f7b1ffaba | ||
|
0ea8092350 | ||
|
483d29497e | ||
|
bae44497fe | ||
|
0d59202aca | ||
|
0ca43f0c9c | ||
|
3bc99639a0 | ||
|
927bc0b63c | ||
|
d968801dc1 | ||
|
89db69d360 | ||
|
895cee7004 | ||
|
4bb71b8439 | ||
|
81cdd6385d | ||
|
e76c42329f | ||
|
e6ef4734ea | ||
|
df4bc1d7e9 | ||
|
2b222efa75 | ||
|
6d18e6d840 | ||
|
2a3c63f209 | ||
|
9f70cecaef | ||
|
c08203e2ed | ||
|
86497fd32f | ||
|
3b998573fd | ||
|
e161882ec7 | ||
|
357f349e30 | ||
|
e4769f541d | ||
|
2a173e2861 | ||
|
a6a900266c | ||
|
bdba53f055 | ||
|
bbdde789e7 | ||
|
eab61cd48a | ||
|
0ce2ba9ad9 | ||
|
3ddcebaa36 | ||
|
b91463383e | ||
|
7444a2f36c | ||
|
fdee07048d | ||
|
2fbf201761 | ||
|
4018e4c434 | ||
|
f3382b5bd8 | ||
|
9287ee0141 | ||
|
2769c8f869 | ||
|
ddb66f33ba | ||
|
79500b8fbc | ||
|
187eea43a4 | ||
|
a89ed6fa9f | ||
|
8d168be138 | ||
|
6e1aa7b391 | ||
|
deab9b9516 | ||
|
39d99a906a | ||
|
6f72e6e0d3 | ||
|
d786d79483 | ||
|
01510f6c2e | ||
|
7ba43e9e3f | ||
|
97bfcd1353 | ||
|
aa3c85c196 | ||
|
fb75a3827d | ||
|
7d546d0e2a | ||
|
8fcb6ffd7a | ||
|
f97de0c15a | ||
|
be9e192b78 | ||
|
75ae1c9526 | ||
|
33761a0236 | ||
|
19b69b1764 | ||
|
8b804359a9 | ||
|
f050bf5c4c |
1
.github/FUNDING.yml
vendored
1
.github/FUNDING.yml
vendored
@@ -1,5 +1,6 @@
|
||||
# These are supported funding model platforms
|
||||
|
||||
polar: marginalia-search
|
||||
github: MarginaliaSearch
|
||||
patreon: marginalia_nu
|
||||
open_collective: # Replace with a single Open Collective username
|
||||
|
1
.gitignore
vendored
1
.gitignore
vendored
@@ -7,3 +7,4 @@ build/
|
||||
lombok.config
|
||||
Dockerfile
|
||||
run
|
||||
jte-classes
|
52
ROADMAP.md
52
ROADMAP.md
@@ -8,20 +8,10 @@ be implemented as well.
|
||||
Major goals:
|
||||
|
||||
* Reach 1 billion pages indexed
|
||||
* Improve technical ability of indexing and search. Although this area has improved a bit, the
|
||||
search engine is still not very good at dealing with longer queries.
|
||||
|
||||
## Proper Position Index (COMPLETED 2024-09)
|
||||
|
||||
The search engine uses a fixed width bit mask to indicate word positions. It has the benefit
|
||||
of being very fast to evaluate and works well for what it is, but is inaccurate and has the
|
||||
drawback of making support for quoted search terms inaccurate and largely reliant on indexing
|
||||
word n-grams known beforehand. This limits the ability to interpret longer queries.
|
||||
|
||||
The positions mask should be supplemented or replaced with a more accurate (e.g.) gamma coded positions
|
||||
list, as is the civilized way of doing this.
|
||||
|
||||
Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)
|
||||
* Improve technical ability of indexing and search. ~~Although this area has improved a bit, the
|
||||
search engine is still not very good at dealing with longer queries.~~ (As of PR [#129](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/129), this has improved significantly. There is still more work to be done )
|
||||
|
||||
## Hybridize crawler w/ Common Crawl data
|
||||
|
||||
@@ -37,8 +27,7 @@ Retaining the ability to independently crawl the web is still strongly desirable
|
||||
|
||||
## Safe Search
|
||||
|
||||
The search engine has a bit of a problem showing spicy content mixed in with the results. It would be desirable
|
||||
to have a way to filter this out. It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
|
||||
The search engine has a bit of a problem showing spicy content mixed in with the results. It would be desirable to have a way to filter this out. It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
|
||||
combined with naive bayesian filter would go a long way, or something more sophisticated...?
|
||||
|
||||
## Web Design Overhaul
|
||||
@@ -55,15 +44,6 @@ associated with each language added, at least a models file or two, as well as s
|
||||
|
||||
It would be very helpful to find a speaker of a large language other than English to help in the fine tuning.
|
||||
|
||||
## Finalize RSS support (COMPLETED 2024-11)
|
||||
|
||||
Marginalia has experimental RSS preview support for a few domains. This works well and
|
||||
it should be extended to all domains. It would also be interesting to offer search of the
|
||||
RSS data itself, or use the RSS set to feed a special live index that updates faster than the
|
||||
main dataset.
|
||||
|
||||
Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
|
||||
|
||||
## Support for binary formats like PDF
|
||||
|
||||
The crawler needs to be modified to retain them, and the conversion logic needs to parse them.
|
||||
@@ -80,5 +60,27 @@ This looks like a good idea that wouldn't just help clean up the search filters
|
||||
website, but might be cheap enough we might go as far as to offer a number of ad-hoc custom search
|
||||
filter for any API consumer.
|
||||
|
||||
I've talked to the stract dev and he does not think it's a good idea to mimic their optics language,
|
||||
which is quite ad-hoc, but instead to work together to find some new common description language for this.
|
||||
I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, which is quite ad-hoc, but instead to work together to find some new common description language for this.
|
||||
|
||||
# Completed
|
||||
|
||||
## Proper Position Index (COMPLETED 2024-09)
|
||||
|
||||
The search engine uses a fixed width bit mask to indicate word positions. It has the benefit
|
||||
of being very fast to evaluate and works well for what it is, but is inaccurate and has the
|
||||
drawback of making support for quoted search terms inaccurate and largely reliant on indexing
|
||||
word n-grams known beforehand. This limits the ability to interpret longer queries.
|
||||
|
||||
The positions mask should be supplemented or replaced with a more accurate (e.g.) gamma coded positions
|
||||
list, as is the civilized way of doing this.
|
||||
|
||||
Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)
|
||||
|
||||
## Finalize RSS support (COMPLETED 2024-11)
|
||||
|
||||
Marginalia has experimental RSS preview support for a few domains. This works well and
|
||||
it should be extended to all domains. It would also be interesting to offer search of the
|
||||
RSS data itself, or use the RSS set to feed a special live index that updates faster than the
|
||||
main dataset.
|
||||
|
||||
Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
|
||||
|
@@ -48,6 +48,7 @@ ext {
|
||||
dockerImageTag='latest'
|
||||
dockerImageRegistry='marginalia'
|
||||
jibVersion = '3.4.3'
|
||||
|
||||
}
|
||||
|
||||
idea {
|
||||
|
@@ -28,7 +28,7 @@ public class DbDomainQueries {
|
||||
}
|
||||
|
||||
|
||||
public Integer getDomainId(EdgeDomain domain) {
|
||||
public Integer getDomainId(EdgeDomain domain) throws NoSuchElementException {
|
||||
try (var connection = dataSource.getConnection()) {
|
||||
|
||||
return domainIdCache.get(domain, () -> {
|
||||
@@ -42,6 +42,9 @@ public class DbDomainQueries {
|
||||
throw new NoSuchElementException();
|
||||
});
|
||||
}
|
||||
catch (UncheckedExecutionException ex) {
|
||||
throw new NoSuchElementException();
|
||||
}
|
||||
catch (ExecutionException ex) {
|
||||
throw new RuntimeException(ex.getCause());
|
||||
}
|
||||
|
@@ -42,6 +42,12 @@ dependencies {
|
||||
implementation libs.bundles.curator
|
||||
implementation libs.bundles.flyway
|
||||
|
||||
libs.bundles.jooby.get().each {
|
||||
implementation dependencies.create(it) {
|
||||
exclude group: 'org.slf4j'
|
||||
}
|
||||
}
|
||||
|
||||
testImplementation libs.bundles.slf4j.test
|
||||
implementation libs.bundles.mariadb
|
||||
|
||||
|
@@ -7,8 +7,6 @@ import nu.marginalia.service.discovery.property.PartitionTraits;
|
||||
import nu.marginalia.service.discovery.property.ServiceEndpoint;
|
||||
import nu.marginalia.service.discovery.property.ServiceKey;
|
||||
import nu.marginalia.service.discovery.property.ServicePartition;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import java.util.List;
|
||||
import java.util.concurrent.CompletableFuture;
|
||||
@@ -24,7 +22,7 @@ import java.util.function.Function;
|
||||
public class GrpcMultiNodeChannelPool<STUB> {
|
||||
private final ConcurrentHashMap<Integer, GrpcSingleNodeChannelPool<STUB>> pools =
|
||||
new ConcurrentHashMap<>();
|
||||
private static final Logger logger = LoggerFactory.getLogger(GrpcMultiNodeChannelPool.class);
|
||||
|
||||
private final ServiceRegistryIf serviceRegistryIf;
|
||||
private final ServiceKey<? extends PartitionTraits.Multicast> serviceKey;
|
||||
private final Function<ServiceEndpoint.InstanceAddress, ManagedChannel> channelConstructor;
|
||||
|
@@ -10,6 +10,8 @@ import nu.marginalia.service.discovery.property.ServiceKey;
|
||||
import org.jetbrains.annotations.NotNull;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
import org.slf4j.Marker;
|
||||
import org.slf4j.MarkerFactory;
|
||||
|
||||
import java.time.Duration;
|
||||
import java.util.*;
|
||||
@@ -26,13 +28,13 @@ import java.util.function.Function;
|
||||
public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
|
||||
private final Map<InstanceAddress, ConnectionHolder> channels = new ConcurrentHashMap<>();
|
||||
|
||||
private final Marker grpcMarker = MarkerFactory.getMarker("GRPC");
|
||||
private static final Logger logger = LoggerFactory.getLogger(GrpcSingleNodeChannelPool.class);
|
||||
|
||||
private final ServiceRegistryIf serviceRegistryIf;
|
||||
private final Function<InstanceAddress, ManagedChannel> channelConstructor;
|
||||
private final Function<ManagedChannel, STUB> stubConstructor;
|
||||
|
||||
|
||||
public GrpcSingleNodeChannelPool(ServiceRegistryIf serviceRegistryIf,
|
||||
ServiceKey<? extends PartitionTraits.Unicast> serviceKey,
|
||||
Function<InstanceAddress, ManagedChannel> channelConstructor,
|
||||
@@ -48,8 +50,6 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
|
||||
serviceRegistryIf.registerMonitor(this);
|
||||
|
||||
onChange();
|
||||
|
||||
awaitChannel(Duration.ofSeconds(5));
|
||||
}
|
||||
|
||||
|
||||
@@ -62,10 +62,10 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
|
||||
for (var route : Sets.symmetricDifference(oldRoutes, newRoutes)) {
|
||||
ConnectionHolder oldChannel;
|
||||
if (newRoutes.contains(route)) {
|
||||
logger.info("Adding route {}", route);
|
||||
logger.info(grpcMarker, "Adding route {} => {}", serviceKey, route);
|
||||
oldChannel = channels.put(route, new ConnectionHolder(route));
|
||||
} else {
|
||||
logger.info("Expelling route {}", route);
|
||||
logger.info(grpcMarker, "Expelling route {} => {}", serviceKey, route);
|
||||
oldChannel = channels.remove(route);
|
||||
}
|
||||
if (oldChannel != null) {
|
||||
@@ -103,7 +103,7 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
|
||||
}
|
||||
|
||||
try {
|
||||
logger.info("Creating channel for {}:{}", serviceKey, address);
|
||||
logger.info(grpcMarker, "Creating channel for {} => {}", serviceKey, address);
|
||||
value = channelConstructor.apply(address);
|
||||
if (channel.compareAndSet(null, value)) {
|
||||
return value;
|
||||
@@ -114,7 +114,7 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
|
||||
}
|
||||
}
|
||||
catch (Exception e) {
|
||||
logger.error("Failed to get channel for " + address, e);
|
||||
logger.error(grpcMarker, "Failed to get channel for " + address, e);
|
||||
return null;
|
||||
}
|
||||
}
|
||||
@@ -206,7 +206,7 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
|
||||
}
|
||||
|
||||
for (var e : exceptions) {
|
||||
logger.error("Failed to call service {}", serviceKey, e);
|
||||
logger.error(grpcMarker, "Failed to call service {}", serviceKey, e);
|
||||
}
|
||||
|
||||
throw new ServiceNotAvailableException(serviceKey);
|
||||
|
@@ -4,6 +4,11 @@ import nu.marginalia.service.discovery.property.ServiceKey;
|
||||
|
||||
public class ServiceNotAvailableException extends RuntimeException {
|
||||
public ServiceNotAvailableException(ServiceKey<?> key) {
|
||||
super("Service " + key + " not available");
|
||||
super(key.toString());
|
||||
}
|
||||
|
||||
@Override
|
||||
public StackTraceElement[] getStackTrace() { // Suppress stack trace
|
||||
return new StackTraceElement[0];
|
||||
}
|
||||
}
|
||||
|
@@ -48,5 +48,10 @@ public record ServiceEndpoint(String host, int port) {
|
||||
public int port() {
|
||||
return endpoint.port();
|
||||
}
|
||||
|
||||
@Override
|
||||
public String toString() {
|
||||
return endpoint().host() + ":" + endpoint.port() + " [" + instance + "]";
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@@ -48,6 +48,19 @@ public sealed interface ServiceKey<P extends ServicePartition> {
|
||||
{
|
||||
throw new UnsupportedOperationException();
|
||||
}
|
||||
|
||||
@Override
|
||||
public String toString() {
|
||||
final String shortName;
|
||||
|
||||
int periodIndex = name.lastIndexOf('.');
|
||||
|
||||
if (periodIndex >= 0) shortName = name.substring(periodIndex+1);
|
||||
else shortName = name;
|
||||
|
||||
return "rest:" + shortName;
|
||||
}
|
||||
|
||||
}
|
||||
record Grpc<P extends ServicePartition>(String name, P partition) implements ServiceKey<P> {
|
||||
public String baseName() {
|
||||
@@ -64,6 +77,18 @@ public sealed interface ServiceKey<P extends ServicePartition> {
|
||||
{
|
||||
return new Grpc<>(name, partition);
|
||||
}
|
||||
|
||||
@Override
|
||||
public String toString() {
|
||||
final String shortName;
|
||||
|
||||
int periodIndex = name.lastIndexOf('.');
|
||||
|
||||
if (periodIndex >= 0) shortName = name.substring(periodIndex+1);
|
||||
else shortName = name;
|
||||
|
||||
return "grpc:" + shortName + "[" + partition.identifier() + "]";
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
@@ -0,0 +1,178 @@
|
||||
package nu.marginalia.service.server;
|
||||
|
||||
import io.jooby.*;
|
||||
import io.prometheus.client.Counter;
|
||||
import nu.marginalia.mq.inbox.MqInboxIf;
|
||||
import nu.marginalia.service.client.ServiceNotAvailableException;
|
||||
import nu.marginalia.service.discovery.property.ServiceEndpoint;
|
||||
import nu.marginalia.service.discovery.property.ServiceKey;
|
||||
import nu.marginalia.service.discovery.property.ServicePartition;
|
||||
import nu.marginalia.service.module.ServiceConfiguration;
|
||||
import nu.marginalia.service.server.jte.JteModule;
|
||||
import nu.marginalia.service.server.mq.ServiceMqSubscription;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
import org.slf4j.Marker;
|
||||
import org.slf4j.MarkerFactory;
|
||||
|
||||
import java.nio.file.Path;
|
||||
import java.nio.file.Paths;
|
||||
import java.util.List;
|
||||
|
||||
public class JoobyService {
|
||||
private final Logger logger = LoggerFactory.getLogger(getClass());
|
||||
|
||||
// Marker for filtering out sensitive content from the persistent logs
|
||||
private final Marker httpMarker = MarkerFactory.getMarker("HTTP");
|
||||
|
||||
private final Initialization initialization;
|
||||
|
||||
private final static Counter request_counter = Counter.build("wmsa_request_counter", "Request Counter")
|
||||
.labelNames("service", "node")
|
||||
.register();
|
||||
private final static Counter request_counter_good = Counter.build("wmsa_request_counter_good", "Good Requests")
|
||||
.labelNames("service", "node")
|
||||
.register();
|
||||
private final static Counter request_counter_bad = Counter.build("wmsa_request_counter_bad", "Bad Requests")
|
||||
.labelNames("service", "node")
|
||||
.register();
|
||||
private final static Counter request_counter_err = Counter.build("wmsa_request_counter_err", "Error Requests")
|
||||
.labelNames("service", "node")
|
||||
.register();
|
||||
private final String serviceName;
|
||||
private static volatile boolean initialized = false;
|
||||
|
||||
protected final MqInboxIf messageQueueInbox;
|
||||
private final int node;
|
||||
private GrpcServer grpcServer;
|
||||
|
||||
private ServiceConfiguration config;
|
||||
private final List<MvcExtension> joobyServices;
|
||||
private final ServiceEndpoint restEndpoint;
|
||||
|
||||
public JoobyService(BaseServiceParams params,
|
||||
ServicePartition partition,
|
||||
List<DiscoverableService> grpcServices,
|
||||
List<MvcExtension> joobyServices
|
||||
) throws Exception {
|
||||
|
||||
this.joobyServices = joobyServices;
|
||||
this.initialization = params.initialization;
|
||||
config = params.configuration;
|
||||
node = config.node();
|
||||
|
||||
String inboxName = config.serviceName();
|
||||
logger.info("Inbox name: {}", inboxName);
|
||||
|
||||
var serviceRegistry = params.serviceRegistry;
|
||||
|
||||
restEndpoint = serviceRegistry.registerService(ServiceKey.forRest(config.serviceId(), config.node()),
|
||||
config.instanceUuid(), config.externalAddress());
|
||||
|
||||
var mqInboxFactory = params.messageQueueInboxFactory;
|
||||
messageQueueInbox = mqInboxFactory.createSynchronousInbox(inboxName, config.node(), config.instanceUuid());
|
||||
messageQueueInbox.subscribe(new ServiceMqSubscription(this));
|
||||
|
||||
serviceName = System.getProperty("service-name");
|
||||
|
||||
initialization.addCallback(params.heartbeat::start);
|
||||
initialization.addCallback(messageQueueInbox::start);
|
||||
initialization.addCallback(() -> params.eventLog.logEvent("SVC-INIT", serviceName + ":" + config.node()));
|
||||
initialization.addCallback(() -> serviceRegistry.announceInstance(config.instanceUuid()));
|
||||
|
||||
Thread.setDefaultUncaughtExceptionHandler((t, e) -> {
|
||||
if (e instanceof ServiceNotAvailableException) {
|
||||
// reduce log spam for this common case
|
||||
logger.error("Service not available: {}", e.getMessage());
|
||||
}
|
||||
else {
|
||||
logger.error("Uncaught exception", e);
|
||||
}
|
||||
request_counter_err.labels(serviceName, Integer.toString(node)).inc();
|
||||
});
|
||||
|
||||
if (!initialization.isReady() && ! initialized ) {
|
||||
initialized = true;
|
||||
grpcServer = new GrpcServer(config, serviceRegistry, partition, grpcServices);
|
||||
grpcServer.start();
|
||||
}
|
||||
}
|
||||
|
||||
public void startJooby(Jooby jooby) {
|
||||
|
||||
logger.info("{} Listening to {}:{} ({})", getClass().getSimpleName(),
|
||||
restEndpoint.host(),
|
||||
restEndpoint.port(),
|
||||
config.externalAddress());
|
||||
|
||||
// FIXME: This won't work outside of docker, may need to submit a PR to jooby to allow classpaths here
|
||||
jooby.install(new JteModule(Path.of("/app/resources/jte"), Path.of("/app/classes/jte-precompiled")));
|
||||
jooby.assets("/*", Paths.get("/app/resources/static"));
|
||||
|
||||
var options = new ServerOptions();
|
||||
options.setHost(config.bindAddress());
|
||||
options.setPort(restEndpoint.port());
|
||||
|
||||
// Enable gzip compression of response data, but set compression to the lowest level
|
||||
// since it doesn't really save much more space to dial it up. It's typically a
|
||||
// single digit percentage difference since HTML already compresses very well with level = 1.
|
||||
options.setCompressionLevel(1);
|
||||
|
||||
|
||||
jooby.setServerOptions(options);
|
||||
|
||||
jooby.get("/internal/ping", ctx -> "pong");
|
||||
jooby.get("/internal/started", this::isInitialized);
|
||||
jooby.get("/internal/ready", this::isReady);
|
||||
|
||||
for (var service : joobyServices) {
|
||||
jooby.mvc(service);
|
||||
}
|
||||
|
||||
jooby.before(this::auditRequestIn);
|
||||
jooby.after(this::auditRequestOut);
|
||||
}
|
||||
|
||||
private Object isInitialized(Context ctx) {
|
||||
if (initialization.isReady()) {
|
||||
return "ok";
|
||||
}
|
||||
else {
|
||||
ctx.setResponseCode(StatusCode.FAILED_DEPENDENCY_CODE);
|
||||
return "bad";
|
||||
}
|
||||
}
|
||||
|
||||
public boolean isReady() {
|
||||
return true;
|
||||
}
|
||||
|
||||
private String isReady(Context ctx) {
|
||||
if (isReady()) {
|
||||
return "ok";
|
||||
}
|
||||
else {
|
||||
ctx.setResponseCode(StatusCode.FAILED_DEPENDENCY_CODE);
|
||||
return "bad";
|
||||
}
|
||||
}
|
||||
|
||||
private void auditRequestIn(Context ctx) {
|
||||
request_counter.labels(serviceName, Integer.toString(node)).inc();
|
||||
}
|
||||
|
||||
private void auditRequestOut(Context ctx, Object result, Throwable failure) {
|
||||
if (ctx.getResponseCode().value() < 400) {
|
||||
request_counter_good.labels(serviceName, Integer.toString(node)).inc();
|
||||
}
|
||||
else {
|
||||
request_counter_bad.labels(serviceName, Integer.toString(node)).inc();
|
||||
}
|
||||
|
||||
if (failure != null) {
|
||||
logger.error("Request failed " + ctx.getMethod() + " " + ctx.getRequestURL(), failure);
|
||||
request_counter_err.labels(serviceName, Integer.toString(node)).inc();
|
||||
}
|
||||
}
|
||||
|
||||
}
|
@@ -16,7 +16,7 @@ import spark.Spark;
|
||||
|
||||
import java.util.List;
|
||||
|
||||
public class Service {
|
||||
public class SparkService {
|
||||
private final Logger logger = LoggerFactory.getLogger(getClass());
|
||||
|
||||
// Marker for filtering out sensitive content from the persistent logs
|
||||
@@ -43,10 +43,10 @@ public class Service {
|
||||
private final int node;
|
||||
private GrpcServer grpcServer;
|
||||
|
||||
public Service(BaseServiceParams params,
|
||||
Runnable configureStaticFiles,
|
||||
ServicePartition partition,
|
||||
List<DiscoverableService> grpcServices) throws Exception {
|
||||
public SparkService(BaseServiceParams params,
|
||||
Runnable configureStaticFiles,
|
||||
ServicePartition partition,
|
||||
List<DiscoverableService> grpcServices) throws Exception {
|
||||
|
||||
this.initialization = params.initialization;
|
||||
var config = params.configuration;
|
||||
@@ -126,18 +126,18 @@ public class Service {
|
||||
}
|
||||
}
|
||||
|
||||
public Service(BaseServiceParams params,
|
||||
ServicePartition partition,
|
||||
List<DiscoverableService> grpcServices) throws Exception {
|
||||
public SparkService(BaseServiceParams params,
|
||||
ServicePartition partition,
|
||||
List<DiscoverableService> grpcServices) throws Exception {
|
||||
this(params,
|
||||
Service::defaultSparkConfig,
|
||||
SparkService::defaultSparkConfig,
|
||||
partition,
|
||||
grpcServices);
|
||||
}
|
||||
|
||||
public Service(BaseServiceParams params) throws Exception {
|
||||
public SparkService(BaseServiceParams params) throws Exception {
|
||||
this(params,
|
||||
Service::defaultSparkConfig,
|
||||
SparkService::defaultSparkConfig,
|
||||
ServicePartition.any(),
|
||||
List.of());
|
||||
}
|
@@ -0,0 +1,61 @@
|
||||
package nu.marginalia.service.server.jte;
|
||||
|
||||
import edu.umd.cs.findbugs.annotations.NonNull;
|
||||
import edu.umd.cs.findbugs.annotations.Nullable;
|
||||
import gg.jte.ContentType;
|
||||
import gg.jte.TemplateEngine;
|
||||
import gg.jte.resolve.DirectoryCodeResolver;
|
||||
import io.jooby.*;
|
||||
|
||||
import java.io.File;
|
||||
import java.nio.file.Path;
|
||||
import java.util.List;
|
||||
import java.util.Objects;
|
||||
import java.util.Optional;
|
||||
import java.util.stream.Stream;
|
||||
|
||||
// Temporary workaround for a bug
|
||||
// APL-2.0 https://github.com/jooby-project/jooby
|
||||
public class JteModule implements Extension {
|
||||
private Path sourceDirectory;
|
||||
private Path classDirectory;
|
||||
private TemplateEngine templateEngine;
|
||||
|
||||
public JteModule(@NonNull Path sourceDirectory, @NonNull Path classDirectory) {
|
||||
this.sourceDirectory = (Path)Objects.requireNonNull(sourceDirectory, "Source directory is required.");
|
||||
this.classDirectory = (Path)Objects.requireNonNull(classDirectory, "Class directory is required.");
|
||||
}
|
||||
|
||||
public JteModule(@NonNull Path sourceDirectory) {
|
||||
this.sourceDirectory = (Path)Objects.requireNonNull(sourceDirectory, "Source directory is required.");
|
||||
}
|
||||
|
||||
public JteModule(@NonNull TemplateEngine templateEngine) {
|
||||
this.templateEngine = (TemplateEngine)Objects.requireNonNull(templateEngine, "Template engine is required.");
|
||||
}
|
||||
|
||||
public void install(@NonNull Jooby application) {
|
||||
if (this.templateEngine == null) {
|
||||
this.templateEngine = create(application.getEnvironment(), this.sourceDirectory, this.classDirectory);
|
||||
}
|
||||
|
||||
ServiceRegistry services = application.getServices();
|
||||
services.put(TemplateEngine.class, this.templateEngine);
|
||||
application.encoder(MediaType.html, new JteTemplateEngine(this.templateEngine));
|
||||
}
|
||||
|
||||
public static TemplateEngine create(@NonNull Environment environment, @NonNull Path sourceDirectory, @Nullable Path classDirectory) {
|
||||
boolean dev = environment.isActive("dev", new String[]{"test"});
|
||||
if (dev) {
|
||||
Objects.requireNonNull(sourceDirectory, "Source directory is required.");
|
||||
Path requiredClassDirectory = (Path)Optional.ofNullable(classDirectory).orElseGet(() -> sourceDirectory.resolve("jte-classes"));
|
||||
TemplateEngine engine = TemplateEngine.create(new DirectoryCodeResolver(sourceDirectory), requiredClassDirectory, ContentType.Html, environment.getClassLoader());
|
||||
Optional<List<String>> var10000 = Optional.ofNullable(System.getProperty("jooby.run.classpath")).map((it) -> it.split(File.pathSeparator)).map(Stream::of).map(Stream::toList);
|
||||
Objects.requireNonNull(engine);
|
||||
var10000.ifPresent(engine::setClassPath);
|
||||
return engine;
|
||||
} else {
|
||||
return classDirectory == null ? TemplateEngine.createPrecompiled(ContentType.Html) : TemplateEngine.createPrecompiled(classDirectory, ContentType.Html);
|
||||
}
|
||||
}
|
||||
}
|
@@ -0,0 +1,48 @@
|
||||
package nu.marginalia.service.server.jte;
|
||||
|
||||
import edu.umd.cs.findbugs.annotations.NonNull;
|
||||
import gg.jte.TemplateEngine;
|
||||
import io.jooby.Context;
|
||||
import io.jooby.MapModelAndView;
|
||||
import io.jooby.ModelAndView;
|
||||
import io.jooby.buffer.DataBuffer;
|
||||
import io.jooby.internal.jte.DataBufferOutput;
|
||||
|
||||
import java.nio.charset.StandardCharsets;
|
||||
import java.util.HashMap;
|
||||
import java.util.List;
|
||||
|
||||
// Temporary workaround for a bug
|
||||
// APL-2.0 https://github.com/jooby-project/jooby
|
||||
class JteTemplateEngine implements io.jooby.TemplateEngine {
|
||||
private final TemplateEngine jte;
|
||||
private final List<String> extensions;
|
||||
|
||||
public JteTemplateEngine(TemplateEngine jte) {
|
||||
this.jte = jte;
|
||||
this.extensions = List.of(".jte", ".kte");
|
||||
}
|
||||
|
||||
|
||||
@NonNull @Override
|
||||
public List<String> extensions() {
|
||||
return extensions;
|
||||
}
|
||||
|
||||
@Override
|
||||
public DataBuffer render(Context ctx, ModelAndView modelAndView) {
|
||||
var buffer = ctx.getBufferFactory().allocateBuffer();
|
||||
var output = new DataBufferOutput(buffer, StandardCharsets.UTF_8);
|
||||
var attributes = ctx.getAttributes();
|
||||
if (modelAndView instanceof MapModelAndView mapModelAndView) {
|
||||
var mapModel = new HashMap<String, Object>();
|
||||
mapModel.putAll(attributes);
|
||||
mapModel.putAll(mapModelAndView.getModel());
|
||||
jte.render(modelAndView.getView(), mapModel, output);
|
||||
} else {
|
||||
jte.render(modelAndView.getView(), modelAndView.getModel(), output);
|
||||
}
|
||||
|
||||
return buffer;
|
||||
}
|
||||
}
|
@@ -3,7 +3,6 @@ package nu.marginalia.service.server.mq;
|
||||
import nu.marginalia.mq.MqMessage;
|
||||
import nu.marginalia.mq.inbox.MqInboxResponse;
|
||||
import nu.marginalia.mq.inbox.MqSubscription;
|
||||
import nu.marginalia.service.server.Service;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
@@ -15,10 +14,10 @@ import java.util.Map;
|
||||
public class ServiceMqSubscription implements MqSubscription {
|
||||
private static final Logger logger = LoggerFactory.getLogger(ServiceMqSubscription.class);
|
||||
private final Map<String, Method> requests = new HashMap<>();
|
||||
private final Service service;
|
||||
private final Object service;
|
||||
|
||||
|
||||
public ServiceMqSubscription(Service service) {
|
||||
public ServiceMqSubscription(Object service) {
|
||||
this.service = service;
|
||||
|
||||
/* Wire up all methods annotated with @MqRequest and @MqNotification
|
||||
|
@@ -6,4 +6,8 @@ public record BrowseResultSet(Collection<BrowseResult> results, String focusDoma
|
||||
public BrowseResultSet(Collection<BrowseResult> results) {
|
||||
this(results, "");
|
||||
}
|
||||
|
||||
public boolean hasFocusDomain() {
|
||||
return focusDomain != null && !focusDomain.isBlank();
|
||||
}
|
||||
}
|
||||
|
@@ -38,6 +38,7 @@ public class DomainsProtobufCodec {
|
||||
sd.getIndexed(),
|
||||
sd.getActive(),
|
||||
sd.getScreenshot(),
|
||||
sd.getFeed(),
|
||||
SimilarDomain.LinkType.valueOf(sd.getLinkType().name())
|
||||
);
|
||||
}
|
||||
|
@@ -71,6 +71,23 @@ public class DomainInformation {
|
||||
return new String(Character.toChars(firstChar)) + new String(Character.toChars(secondChar));
|
||||
}
|
||||
|
||||
public String getAsnFlag() {
|
||||
if (asnCountry == null || asnCountry.codePointCount(0, asnCountry.length()) != 2) {
|
||||
return "";
|
||||
}
|
||||
String country = asnCountry;
|
||||
|
||||
if ("UK".equals(country)) {
|
||||
country = "GB";
|
||||
}
|
||||
|
||||
int offset = 0x1F1E6;
|
||||
int asciiOffset = 0x41;
|
||||
int firstChar = Character.codePointAt(country, 0) - asciiOffset + offset;
|
||||
int secondChar = Character.codePointAt(country, 1) - asciiOffset + offset;
|
||||
return new String(Character.toChars(firstChar)) + new String(Character.toChars(secondChar));
|
||||
}
|
||||
|
||||
public EdgeDomain getDomain() {
|
||||
return this.domain;
|
||||
}
|
||||
|
@@ -9,6 +9,7 @@ public record SimilarDomain(EdgeUrl url,
|
||||
boolean indexed,
|
||||
boolean active,
|
||||
boolean screenshot,
|
||||
boolean feed,
|
||||
LinkType linkType) {
|
||||
|
||||
public String getRankSymbols() {
|
||||
@@ -52,12 +53,12 @@ public record SimilarDomain(EdgeUrl url,
|
||||
return NONE;
|
||||
}
|
||||
|
||||
public String toString() {
|
||||
public String faIcon() {
|
||||
return switch (this) {
|
||||
case FOWARD -> "→";
|
||||
case BACKWARD -> "←";
|
||||
case BIDIRECTIONAL -> "⇆";
|
||||
case NONE -> "-";
|
||||
case FOWARD -> "fa-solid fa-arrow-right";
|
||||
case BACKWARD -> "fa-solid fa-arrow-left";
|
||||
case BIDIRECTIONAL -> "fa-solid fa-arrow-right-arrow-left";
|
||||
case NONE -> "";
|
||||
};
|
||||
}
|
||||
|
||||
|
@@ -101,6 +101,7 @@ message RpcSimilarDomain {
|
||||
bool active = 6;
|
||||
bool screenshot = 7;
|
||||
LINK_TYPE linkType = 8;
|
||||
bool feed = 9;
|
||||
|
||||
enum LINK_TYPE {
|
||||
BACKWARD = 0;
|
||||
|
@@ -9,6 +9,7 @@ import gnu.trove.map.hash.TIntIntHashMap;
|
||||
import gnu.trove.set.TIntSet;
|
||||
import gnu.trove.set.hash.TIntHashSet;
|
||||
import it.unimi.dsi.fastutil.ints.Int2DoubleArrayMap;
|
||||
import nu.marginalia.WmsaHome;
|
||||
import nu.marginalia.api.domains.RpcSimilarDomain;
|
||||
import nu.marginalia.api.domains.model.SimilarDomain;
|
||||
import nu.marginalia.api.linkgraph.AggregateLinkGraphClient;
|
||||
@@ -17,10 +18,14 @@ import org.roaringbitmap.RoaringBitmap;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import java.nio.file.Path;
|
||||
import java.sql.DriverManager;
|
||||
import java.sql.ResultSet;
|
||||
import java.sql.SQLException;
|
||||
import java.util.ArrayList;
|
||||
import java.util.HashSet;
|
||||
import java.util.List;
|
||||
import java.util.Set;
|
||||
import java.util.concurrent.Executors;
|
||||
import java.util.concurrent.ScheduledExecutorService;
|
||||
import java.util.concurrent.TimeUnit;
|
||||
@@ -32,12 +37,13 @@ public class SimilarDomainsService {
|
||||
private final HikariDataSource dataSource;
|
||||
private final AggregateLinkGraphClient linkGraphClient;
|
||||
|
||||
private volatile TIntIntHashMap domainIdToIdx = new TIntIntHashMap(100_000);
|
||||
private final TIntIntHashMap domainIdToIdx = new TIntIntHashMap(100_000);
|
||||
private volatile int[] domainIdxToId;
|
||||
|
||||
public volatile Int2DoubleArrayMap[] relatedDomains;
|
||||
public volatile TIntList[] domainNeighbors = null;
|
||||
public volatile RoaringBitmap screenshotDomains = null;
|
||||
public volatile RoaringBitmap feedDomains = null;
|
||||
public volatile RoaringBitmap activeDomains = null;
|
||||
public volatile RoaringBitmap indexedDomains = null;
|
||||
public volatile TIntDoubleHashMap domainRanks = null;
|
||||
@@ -82,6 +88,7 @@ public class SimilarDomainsService {
|
||||
domainNames = new String[domainIdToIdx.size()];
|
||||
domainNeighbors = new TIntList[domainIdToIdx.size()];
|
||||
screenshotDomains = new RoaringBitmap();
|
||||
feedDomains = new RoaringBitmap();
|
||||
activeDomains = new RoaringBitmap();
|
||||
indexedDomains = new RoaringBitmap();
|
||||
relatedDomains = new Int2DoubleArrayMap[domainIdToIdx.size()];
|
||||
@@ -145,10 +152,12 @@ public class SimilarDomainsService {
|
||||
activeDomains.add(idx);
|
||||
}
|
||||
|
||||
updateScreenshotInfo();
|
||||
|
||||
logger.info("Loaded {} domains", domainRanks.size());
|
||||
isReady = true;
|
||||
|
||||
// We can defer these as they only populate a roaringbitmap, and will degrade gracefully when not complete
|
||||
updateScreenshotInfo();
|
||||
updateFeedInfo();
|
||||
}
|
||||
}
|
||||
catch (SQLException throwables) {
|
||||
@@ -156,6 +165,42 @@ public class SimilarDomainsService {
|
||||
}
|
||||
}
|
||||
|
||||
private void updateFeedInfo() {
|
||||
Set<String> feedsDomainNames = new HashSet<>(500_000);
|
||||
Path readerDbPath = WmsaHome.getDataPath().resolve("rss-feeds.db").toAbsolutePath();
|
||||
String dbUrl = "jdbc:sqlite:" + readerDbPath;
|
||||
|
||||
logger.info("Opening feed db at " + dbUrl);
|
||||
|
||||
try (var conn = DriverManager.getConnection(dbUrl);
|
||||
var stmt = conn.createStatement()) {
|
||||
var rs = stmt.executeQuery("""
|
||||
select
|
||||
json_extract(feed, '$.domain') as domain
|
||||
from feed
|
||||
where json_array_length(feed, '$.items') > 0
|
||||
""");
|
||||
while (rs.next()) {
|
||||
feedsDomainNames.add(rs.getString(1));
|
||||
}
|
||||
}
|
||||
catch (SQLException ex) {
|
||||
logger.error("Failed to read RSS feed items", ex);
|
||||
}
|
||||
|
||||
for (int idx = 0; idx < domainNames.length; idx++) {
|
||||
String name = domainNames[idx];
|
||||
if (name == null) {
|
||||
continue;
|
||||
}
|
||||
|
||||
if (feedsDomainNames.contains(name)) {
|
||||
feedDomains.add(idx);
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
private void updateScreenshotInfo() {
|
||||
try (var connection = dataSource.getConnection()) {
|
||||
try (var stmt = connection.createStatement()) {
|
||||
@@ -254,6 +299,7 @@ public class SimilarDomainsService {
|
||||
.setIndexed(indexedDomains.contains(idx))
|
||||
.setActive(activeDomains.contains(idx))
|
||||
.setScreenshot(screenshotDomains.contains(idx))
|
||||
.setFeed(feedDomains.contains(idx))
|
||||
.setLinkType(RpcSimilarDomain.LINK_TYPE.valueOf(linkType.name()))
|
||||
.build());
|
||||
|
||||
@@ -369,6 +415,7 @@ public class SimilarDomainsService {
|
||||
.setIndexed(indexedDomains.contains(idx))
|
||||
.setActive(activeDomains.contains(idx))
|
||||
.setScreenshot(screenshotDomains.contains(idx))
|
||||
.setFeed(feedDomains.contains(idx))
|
||||
.setLinkType(RpcSimilarDomain.LINK_TYPE.valueOf(linkType.name()))
|
||||
.build());
|
||||
|
||||
|
@@ -5,6 +5,7 @@ import com.google.inject.Singleton;
|
||||
import nu.marginalia.api.livecapture.LiveCaptureApiGrpc.LiveCaptureApiBlockingStub;
|
||||
import nu.marginalia.service.client.GrpcChannelPoolFactory;
|
||||
import nu.marginalia.service.client.GrpcSingleNodeChannelPool;
|
||||
import nu.marginalia.service.client.ServiceNotAvailableException;
|
||||
import nu.marginalia.service.discovery.property.ServiceKey;
|
||||
import nu.marginalia.service.discovery.property.ServicePartition;
|
||||
import org.slf4j.Logger;
|
||||
@@ -29,6 +30,9 @@ public class LiveCaptureClient {
|
||||
channelPool.call(LiveCaptureApiBlockingStub::requestScreengrab)
|
||||
.run(RpcDomainId.newBuilder().setDomainId(domainId).build());
|
||||
}
|
||||
catch (ServiceNotAvailableException e) {
|
||||
logger.info("requestScreengrab() failed since the service is not available");
|
||||
}
|
||||
catch (Exception e) {
|
||||
logger.error("API Exception", e);
|
||||
}
|
||||
|
@@ -24,6 +24,7 @@ dependencies {
|
||||
implementation project(':code:libraries:message-queue')
|
||||
|
||||
implementation project(':code:execution:api')
|
||||
implementation project(':code:processes:crawling-process:ft-content-type')
|
||||
|
||||
implementation libs.jsoup
|
||||
implementation libs.rssreader
|
||||
|
@@ -8,6 +8,7 @@ import nu.marginalia.rss.model.FeedDefinition;
|
||||
import nu.marginalia.rss.model.FeedItems;
|
||||
import nu.marginalia.service.module.ServiceConfiguration;
|
||||
import org.jetbrains.annotations.NotNull;
|
||||
import org.jetbrains.annotations.Nullable;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
@@ -127,6 +128,26 @@ public class FeedDb {
|
||||
return FeedItems.none();
|
||||
}
|
||||
|
||||
|
||||
@Nullable
|
||||
public String getEtag(EdgeDomain domain) {
|
||||
if (!feedDbEnabled) {
|
||||
throw new IllegalStateException("Feed database is disabled on this node");
|
||||
}
|
||||
|
||||
// Capture the current reader to avoid concurrency issues
|
||||
FeedDbReader reader = this.reader;
|
||||
try {
|
||||
if (reader != null) {
|
||||
return reader.getEtag(domain);
|
||||
}
|
||||
}
|
||||
catch (Exception e) {
|
||||
logger.error("Error getting etag for " + domain, e);
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
public Optional<String> getFeedAsJson(String domain) {
|
||||
if (!feedDbEnabled) {
|
||||
throw new IllegalStateException("Feed database is disabled on this node");
|
||||
@@ -214,7 +235,7 @@ public class FeedDb {
|
||||
|
||||
public Instant getFetchTime() {
|
||||
if (!Files.exists(readerDbPath)) {
|
||||
return Instant.ofEpochMilli(0);
|
||||
return Instant.EPOCH;
|
||||
}
|
||||
|
||||
try {
|
||||
@@ -224,7 +245,23 @@ public class FeedDb {
|
||||
}
|
||||
catch (IOException ex) {
|
||||
logger.error("Failed to read the creatiom time of {}", readerDbPath);
|
||||
return Instant.ofEpochMilli(0);
|
||||
return Instant.EPOCH;
|
||||
}
|
||||
}
|
||||
|
||||
public boolean hasData() {
|
||||
if (!feedDbEnabled) {
|
||||
throw new IllegalStateException("Feed database is disabled on this node");
|
||||
}
|
||||
|
||||
// Capture the current reader to avoid concurrency issues
|
||||
FeedDbReader reader = this.reader;
|
||||
|
||||
if (reader != null) {
|
||||
return reader.hasData();
|
||||
}
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
}
|
||||
|
@@ -8,6 +8,7 @@ import nu.marginalia.rss.model.FeedItems;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import javax.annotation.Nullable;
|
||||
import java.nio.file.Path;
|
||||
import java.sql.Connection;
|
||||
import java.sql.DriverManager;
|
||||
@@ -32,6 +33,7 @@ public class FeedDbReader implements AutoCloseable {
|
||||
try (var stmt = connection.createStatement()) {
|
||||
stmt.executeUpdate("CREATE TABLE IF NOT EXISTS feed (domain TEXT PRIMARY KEY, feed JSON)");
|
||||
stmt.executeUpdate("CREATE TABLE IF NOT EXISTS errors (domain TEXT PRIMARY KEY, cnt INT DEFAULT 0)");
|
||||
stmt.executeUpdate("CREATE TABLE IF NOT EXISTS etags (domain TEXT PRIMARY KEY, etag TEXT)");
|
||||
}
|
||||
}
|
||||
|
||||
@@ -106,6 +108,22 @@ public class FeedDbReader implements AutoCloseable {
|
||||
return FeedItems.none();
|
||||
}
|
||||
|
||||
@Nullable
|
||||
public String getEtag(EdgeDomain domain) {
|
||||
try (var stmt = connection.prepareStatement("SELECT etag FROM etags WHERE DOMAIN = ?")) {
|
||||
stmt.setString(1, domain.toString());
|
||||
var rs = stmt.executeQuery();
|
||||
|
||||
if (rs.next()) {
|
||||
return rs.getString(1);
|
||||
}
|
||||
} catch (SQLException e) {
|
||||
logger.error("Error getting etag for " + domain, e);
|
||||
}
|
||||
|
||||
return null;
|
||||
}
|
||||
|
||||
private FeedItems deserialize(String string) {
|
||||
return gson.fromJson(string, FeedItems.class);
|
||||
}
|
||||
@@ -141,4 +159,18 @@ public class FeedDbReader implements AutoCloseable {
|
||||
}
|
||||
|
||||
|
||||
public boolean hasData() {
|
||||
try (var stmt = connection.prepareStatement("SELECT 1 FROM feed LIMIT 1")) {
|
||||
var rs = stmt.executeQuery();
|
||||
if (rs.next()) {
|
||||
return rs.getBoolean(1);
|
||||
}
|
||||
else {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
catch (SQLException ex) {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@@ -20,6 +20,7 @@ public class FeedDbWriter implements AutoCloseable {
|
||||
private final Connection connection;
|
||||
private final PreparedStatement insertFeedStmt;
|
||||
private final PreparedStatement insertErrorStmt;
|
||||
private final PreparedStatement insertEtagStmt;
|
||||
private final Path dbPath;
|
||||
|
||||
private volatile boolean closed = false;
|
||||
@@ -34,10 +35,12 @@ public class FeedDbWriter implements AutoCloseable {
|
||||
try (var stmt = connection.createStatement()) {
|
||||
stmt.executeUpdate("CREATE TABLE IF NOT EXISTS feed (domain TEXT PRIMARY KEY, feed JSON)");
|
||||
stmt.executeUpdate("CREATE TABLE IF NOT EXISTS errors (domain TEXT PRIMARY KEY, cnt INT DEFAULT 0)");
|
||||
stmt.executeUpdate("CREATE TABLE IF NOT EXISTS etags (domain TEXT PRIMARY KEY, etag TEXT)");
|
||||
}
|
||||
|
||||
insertFeedStmt = connection.prepareStatement("INSERT INTO feed (domain, feed) VALUES (?, ?)");
|
||||
insertErrorStmt = connection.prepareStatement("INSERT INTO errors (domain, cnt) VALUES (?, ?)");
|
||||
insertEtagStmt = connection.prepareStatement("INSERT INTO etags (domain, etag) VALUES (?, ?)");
|
||||
}
|
||||
|
||||
public Path getDbPath() {
|
||||
@@ -56,6 +59,20 @@ public class FeedDbWriter implements AutoCloseable {
|
||||
}
|
||||
}
|
||||
|
||||
public synchronized void saveEtag(String domain, String etag) {
|
||||
if (etag == null || etag.isBlank())
|
||||
return;
|
||||
|
||||
try {
|
||||
insertEtagStmt.setString(1, domain.toLowerCase());
|
||||
insertEtagStmt.setString(2, etag);
|
||||
insertEtagStmt.executeUpdate();
|
||||
}
|
||||
catch (SQLException e) {
|
||||
logger.error("Error saving etag for " + domain, e);
|
||||
}
|
||||
}
|
||||
|
||||
public synchronized void setErrorCount(String domain, int count) {
|
||||
try {
|
||||
insertErrorStmt.setString(1, domain);
|
||||
|
@@ -5,6 +5,8 @@ import com.apptasticsoftware.rssreader.RssReader;
|
||||
import com.google.inject.Inject;
|
||||
import com.opencsv.CSVReader;
|
||||
import nu.marginalia.WmsaHome;
|
||||
import nu.marginalia.contenttype.ContentType;
|
||||
import nu.marginalia.contenttype.DocumentBodyToString;
|
||||
import nu.marginalia.executor.client.ExecutorClient;
|
||||
import nu.marginalia.model.EdgeDomain;
|
||||
import nu.marginalia.nodecfg.NodeConfigurationService;
|
||||
@@ -32,9 +34,7 @@ import java.net.http.HttpRequest;
|
||||
import java.net.http.HttpResponse;
|
||||
import java.nio.charset.StandardCharsets;
|
||||
import java.sql.SQLException;
|
||||
import java.time.Duration;
|
||||
import java.time.LocalDateTime;
|
||||
import java.time.ZonedDateTime;
|
||||
import java.time.*;
|
||||
import java.time.format.DateTimeFormatter;
|
||||
import java.util.*;
|
||||
import java.util.concurrent.Executors;
|
||||
@@ -59,7 +59,6 @@ public class FeedFetcherService {
|
||||
private final DomainLocks domainLocks = new DomainLocks();
|
||||
|
||||
private volatile boolean updating;
|
||||
private boolean deterministic = false;
|
||||
|
||||
@Inject
|
||||
public FeedFetcherService(FeedDb feedDb,
|
||||
@@ -91,11 +90,6 @@ public class FeedFetcherService {
|
||||
REFRESH
|
||||
};
|
||||
|
||||
/** Disable random-based heuristics. This is meant for testing */
|
||||
public void setDeterministic() {
|
||||
this.deterministic = true;
|
||||
}
|
||||
|
||||
public void updateFeeds(UpdateMode updateMode) throws IOException {
|
||||
if (updating) // Prevent concurrent updates
|
||||
{
|
||||
@@ -135,37 +129,37 @@ public class FeedFetcherService {
|
||||
for (var feed : definitions) {
|
||||
executor.submitQuietly(() -> {
|
||||
try {
|
||||
var oldData = feedDb.getFeed(new EdgeDomain(feed.domain()));
|
||||
EdgeDomain domain = new EdgeDomain(feed.domain());
|
||||
var oldData = feedDb.getFeed(domain);
|
||||
|
||||
// If we have existing data, we might skip updating it with a probability that increases with time,
|
||||
// this is to avoid hammering the feeds that are updated very rarely and save some time and resources
|
||||
// on our end
|
||||
@Nullable
|
||||
String ifModifiedSinceDate = switch(updateMode) {
|
||||
case REFRESH -> getIfModifiedSinceDate(feedDb);
|
||||
case CLEAN -> null;
|
||||
};
|
||||
|
||||
/* Disable for now:
|
||||
|
||||
if (!oldData.isEmpty()) {
|
||||
Duration duration = feed.durationSinceUpdated();
|
||||
long daysSinceUpdate = duration.toDays();
|
||||
|
||||
|
||||
if (deterministic || (daysSinceUpdate > 2 && ThreadLocalRandom.current()
|
||||
.nextInt(1, 1 + (int) Math.min(10, daysSinceUpdate) / 2) > 1)) {
|
||||
// Skip updating this feed, just write the old data back instead
|
||||
writer.saveFeed(oldData);
|
||||
return;
|
||||
}
|
||||
}
|
||||
*/
|
||||
@Nullable
|
||||
String ifNoneMatchTag = switch (updateMode) {
|
||||
case REFRESH -> feedDb.getEtag(domain);
|
||||
case CLEAN -> null;
|
||||
};
|
||||
|
||||
FetchResult feedData;
|
||||
try (DomainLocks.DomainLock domainLock = domainLocks.lockDomain(new EdgeDomain(feed.domain()))) {
|
||||
feedData = fetchFeedData(feed, client);
|
||||
feedData = fetchFeedData(feed, client, ifModifiedSinceDate, ifNoneMatchTag);
|
||||
} catch (Exception ex) {
|
||||
feedData = new FetchResult.TransientError();
|
||||
}
|
||||
|
||||
switch (feedData) {
|
||||
case FetchResult.Success(String value) -> writer.saveFeed(parseFeed(value, feed));
|
||||
case FetchResult.Success(String value, String etag) -> {
|
||||
writer.saveEtag(feed.domain(), etag);
|
||||
writer.saveFeed(parseFeed(value, feed));
|
||||
}
|
||||
case FetchResult.NotModified() -> {
|
||||
writer.saveEtag(feed.domain(), ifNoneMatchTag);
|
||||
writer.saveFeed(oldData);
|
||||
}
|
||||
case FetchResult.TransientError() -> {
|
||||
int errorCount = errorCounts.getOrDefault(feed.domain().toLowerCase(), 0);
|
||||
writer.setErrorCount(feed.domain().toLowerCase(), ++errorCount);
|
||||
@@ -212,30 +206,73 @@ public class FeedFetcherService {
|
||||
}
|
||||
}
|
||||
|
||||
private FetchResult fetchFeedData(FeedDefinition feed, HttpClient client) {
|
||||
@Nullable
|
||||
static String getIfModifiedSinceDate(FeedDb feedDb) {
|
||||
|
||||
// If the db is fresh, we don't send If-Modified-Since
|
||||
if (!feedDb.hasData())
|
||||
return null;
|
||||
|
||||
Instant cutoffInstant = feedDb.getFetchTime();
|
||||
|
||||
// If we're unable to establish fetch time, we don't send If-Modified-Since
|
||||
if (cutoffInstant == Instant.EPOCH)
|
||||
return null;
|
||||
|
||||
return cutoffInstant.atZone(ZoneId.of("GMT")).format(DateTimeFormatter.RFC_1123_DATE_TIME);
|
||||
}
|
||||
|
||||
private FetchResult fetchFeedData(FeedDefinition feed,
|
||||
HttpClient client,
|
||||
@Nullable String ifModifiedSinceDate,
|
||||
@Nullable String ifNoneMatchTag)
|
||||
{
|
||||
try {
|
||||
URI uri = new URI(feed.feedUrl());
|
||||
|
||||
HttpRequest getRequest = HttpRequest.newBuilder()
|
||||
HttpRequest.Builder requestBuilder = HttpRequest.newBuilder()
|
||||
.GET()
|
||||
.uri(uri)
|
||||
.header("User-Agent", WmsaHome.getUserAgent().uaIdentifier())
|
||||
.header("Accept-Encoding", "gzip")
|
||||
.header("Accept", "text/*, */*;q=0.9")
|
||||
.timeout(Duration.ofSeconds(15))
|
||||
.build();
|
||||
;
|
||||
|
||||
if (ifModifiedSinceDate != null) {
|
||||
requestBuilder.header("If-Modified-Since", ifModifiedSinceDate);
|
||||
}
|
||||
|
||||
if (ifNoneMatchTag != null) {
|
||||
requestBuilder.header("If-None-Match", ifNoneMatchTag);
|
||||
}
|
||||
|
||||
HttpRequest getRequest = requestBuilder.build();
|
||||
|
||||
for (int i = 0; i < 3; i++) {
|
||||
var rs = client.send(getRequest, HttpResponse.BodyHandlers.ofString());
|
||||
if (429 == rs.statusCode()) {
|
||||
HttpResponse<byte[]> rs = client.send(getRequest, HttpResponse.BodyHandlers.ofByteArray());
|
||||
|
||||
if (rs.statusCode() == 429) { // Too Many Requests
|
||||
int retryAfter = Integer.parseInt(rs.headers().firstValue("Retry-After").orElse("2"));
|
||||
Thread.sleep(Duration.ofSeconds(Math.clamp(retryAfter, 1, 5)));
|
||||
} else if (200 == rs.statusCode()) {
|
||||
return new FetchResult.Success(rs.body());
|
||||
} else if (404 == rs.statusCode()) {
|
||||
return new FetchResult.PermanentError(); // never try again
|
||||
} else {
|
||||
return new FetchResult.TransientError(); // we try again in a few days
|
||||
continue;
|
||||
}
|
||||
|
||||
String newEtagValue = rs.headers().firstValue("ETag").orElse("");
|
||||
|
||||
return switch (rs.statusCode()) {
|
||||
case 200 -> {
|
||||
byte[] responseData = getResponseData(rs);
|
||||
|
||||
String contentType = rs.headers().firstValue("Content-Type").orElse("");
|
||||
String bodyText = DocumentBodyToString.getStringData(ContentType.parse(contentType), responseData);
|
||||
|
||||
yield new FetchResult.Success(bodyText, newEtagValue);
|
||||
}
|
||||
case 304 -> new FetchResult.NotModified(); // via If-Modified-Since semantics
|
||||
case 404 -> new FetchResult.PermanentError(); // never try again
|
||||
default -> new FetchResult.TransientError(); // we try again later
|
||||
};
|
||||
}
|
||||
}
|
||||
catch (Exception ex) {
|
||||
@@ -245,8 +282,22 @@ public class FeedFetcherService {
|
||||
return new FetchResult.TransientError();
|
||||
}
|
||||
|
||||
private byte[] getResponseData(HttpResponse<byte[]> response) throws IOException {
|
||||
String encoding = response.headers().firstValue("Content-Encoding").orElse("");
|
||||
|
||||
if ("gzip".equals(encoding)) {
|
||||
try (var stream = new GZIPInputStream(new ByteArrayInputStream(response.body()))) {
|
||||
return stream.readAllBytes();
|
||||
}
|
||||
}
|
||||
else {
|
||||
return response.body();
|
||||
}
|
||||
}
|
||||
|
||||
public sealed interface FetchResult {
|
||||
record Success(String value) implements FetchResult {}
|
||||
record Success(String value, String etag) implements FetchResult {}
|
||||
record NotModified() implements FetchResult {}
|
||||
record TransientError() implements FetchResult {}
|
||||
record PermanentError() implements FetchResult {}
|
||||
}
|
||||
@@ -351,6 +402,7 @@ public class FeedFetcherService {
|
||||
"–", "-",
|
||||
"’", "'",
|
||||
"‘", "'",
|
||||
""", "\"",
|
||||
" ", ""
|
||||
);
|
||||
|
||||
|
@@ -96,7 +96,6 @@ class FeedFetcherServiceTest extends AbstractModule {
|
||||
feedDb.switchDb(writer);
|
||||
}
|
||||
|
||||
feedFetcherService.setDeterministic();
|
||||
feedFetcherService.updateFeeds(FeedFetcherService.UpdateMode.REFRESH);
|
||||
|
||||
var result = feedDb.getFeed(new EdgeDomain("www.marginalia.nu"));
|
||||
@@ -104,6 +103,26 @@ class FeedFetcherServiceTest extends AbstractModule {
|
||||
Assertions.assertFalse(result.isEmpty());
|
||||
}
|
||||
|
||||
@Tag("flaky")
|
||||
@Test
|
||||
public void testFetchRepeatedly() throws Exception {
|
||||
try (var writer = feedDb.createWriter()) {
|
||||
writer.saveFeed(new FeedItems("www.marginalia.nu", "https://www.marginalia.nu/log/index.xml", "", List.of()));
|
||||
feedDb.switchDb(writer);
|
||||
}
|
||||
|
||||
feedFetcherService.updateFeeds(FeedFetcherService.UpdateMode.REFRESH);
|
||||
Assertions.assertNotNull(feedDb.getEtag(new EdgeDomain("www.marginalia.nu")));
|
||||
feedFetcherService.updateFeeds(FeedFetcherService.UpdateMode.REFRESH);
|
||||
Assertions.assertNotNull(feedDb.getEtag(new EdgeDomain("www.marginalia.nu")));
|
||||
feedFetcherService.updateFeeds(FeedFetcherService.UpdateMode.REFRESH);
|
||||
Assertions.assertNotNull(feedDb.getEtag(new EdgeDomain("www.marginalia.nu")));
|
||||
|
||||
var result = feedDb.getFeed(new EdgeDomain("www.marginalia.nu"));
|
||||
System.out.println(result);
|
||||
Assertions.assertFalse(result.isEmpty());
|
||||
}
|
||||
|
||||
@Tag("flaky")
|
||||
@Test
|
||||
public void test404() throws Exception {
|
||||
@@ -112,7 +131,6 @@ class FeedFetcherServiceTest extends AbstractModule {
|
||||
feedDb.switchDb(writer);
|
||||
}
|
||||
|
||||
feedFetcherService.setDeterministic();
|
||||
feedFetcherService.updateFeeds(FeedFetcherService.UpdateMode.REFRESH);
|
||||
|
||||
// We forget the feed on a 404 error
|
||||
|
@@ -10,7 +10,6 @@ public class TestXmlSanitization {
|
||||
Assertions.assertEquals("&", FeedFetcherService.sanitizeEntities("&"));
|
||||
Assertions.assertEquals("<", FeedFetcherService.sanitizeEntities("<"));
|
||||
Assertions.assertEquals(">", FeedFetcherService.sanitizeEntities(">"));
|
||||
Assertions.assertEquals(""", FeedFetcherService.sanitizeEntities("""));
|
||||
Assertions.assertEquals("'", FeedFetcherService.sanitizeEntities("'"));
|
||||
}
|
||||
|
||||
@@ -23,4 +22,9 @@ public class TestXmlSanitization {
|
||||
public void testTranslatedHtmlEntity() {
|
||||
Assertions.assertEquals("Foo -- Bar", FeedFetcherService.sanitizeEntities("Foo — Bar"));
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testTranslatedHtmlEntityQuot() {
|
||||
Assertions.assertEquals("\"Bob\"", FeedFetcherService.sanitizeEntities(""Bob""));
|
||||
}
|
||||
}
|
||||
|
@@ -7,4 +7,8 @@ public record DictionaryResponse(String word, List<DictionaryEntry> entries) {
|
||||
this.word = word;
|
||||
this.entries = entries.stream().toList(); // Make an immutable copy
|
||||
}
|
||||
|
||||
public boolean hasEntries() {
|
||||
return !entries.isEmpty();
|
||||
}
|
||||
}
|
||||
|
@@ -9,10 +9,9 @@ import nu.marginalia.service.client.GrpcChannelPoolFactory;
|
||||
import nu.marginalia.service.client.GrpcSingleNodeChannelPool;
|
||||
import nu.marginalia.service.discovery.property.ServiceKey;
|
||||
import nu.marginalia.service.discovery.property.ServicePartition;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import javax.annotation.CheckReturnValue;
|
||||
import java.time.Duration;
|
||||
|
||||
@Singleton
|
||||
public class QueryClient {
|
||||
@@ -24,13 +23,14 @@ public class QueryClient {
|
||||
|
||||
private final GrpcSingleNodeChannelPool<QueryApiGrpc.QueryApiBlockingStub> queryApiPool;
|
||||
|
||||
private final Logger logger = LoggerFactory.getLogger(getClass());
|
||||
|
||||
@Inject
|
||||
public QueryClient(GrpcChannelPoolFactory channelPoolFactory) {
|
||||
public QueryClient(GrpcChannelPoolFactory channelPoolFactory) throws InterruptedException {
|
||||
this.queryApiPool = channelPoolFactory.createSingle(
|
||||
ServiceKey.forGrpcApi(QueryApiGrpc.class, ServicePartition.any()),
|
||||
QueryApiGrpc::newBlockingStub);
|
||||
|
||||
// Hold up initialization until we have a downstream connection
|
||||
this.queryApiPool.awaitChannel(Duration.ofSeconds(5));
|
||||
}
|
||||
|
||||
@CheckReturnValue
|
||||
|
@@ -71,6 +71,17 @@ public class QueryFactory {
|
||||
|
||||
String[] parts = StringUtils.split(str, '_');
|
||||
|
||||
// Trim down tokens to match the behavior of the tokenizer used in indexing
|
||||
for (int i = 0; i < parts.length; i++) {
|
||||
String part = parts[i];
|
||||
|
||||
if (part.endsWith("'s") && part.length() > 2) {
|
||||
part = part.substring(0, part.length()-2);
|
||||
}
|
||||
|
||||
parts[i] = part;
|
||||
}
|
||||
|
||||
if (parts.length > 1) {
|
||||
// Require that the terms appear in sequence
|
||||
queryBuilder.phraseConstraint(SearchPhraseConstraint.mandatory(parts));
|
||||
|
@@ -25,6 +25,7 @@ public class QueryExpansion {
|
||||
this::joinDashes,
|
||||
this::splitWordNum,
|
||||
this::joinTerms,
|
||||
this::categoryKeywords,
|
||||
this::ngramAll
|
||||
);
|
||||
|
||||
@@ -98,6 +99,24 @@ public class QueryExpansion {
|
||||
}
|
||||
}
|
||||
|
||||
// Category keyword substitution, e.g. guitar wiki -> guitar generator:wiki
|
||||
public void categoryKeywords(QWordGraph graph) {
|
||||
|
||||
for (var qw : graph) {
|
||||
|
||||
// Ensure we only perform the substitution on the last word in the query
|
||||
if (!graph.getNextOriginal(qw).getFirst().isEnd()) {
|
||||
continue;
|
||||
}
|
||||
|
||||
switch (qw.word()) {
|
||||
case "recipe", "recipes" -> graph.addVariant(qw, "category:food");
|
||||
case "forum" -> graph.addVariant(qw, "generator:forum");
|
||||
case "wiki" -> graph.addVariant(qw, "generator:wiki");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Turn 'lawn chair' into 'lawnchair'
|
||||
public void joinTerms(QWordGraph graph) {
|
||||
QWord prev = null;
|
||||
|
@@ -155,16 +155,25 @@ public class QueryParser {
|
||||
|
||||
// Remove trailing punctuation
|
||||
int lastChar = str.charAt(str.length() - 1);
|
||||
if (":.,!?$'".indexOf(lastChar) >= 0)
|
||||
entity.replace(new QueryToken.LiteralTerm(str.substring(0, str.length() - 1), lt.displayStr()));
|
||||
if (":.,!?$'".indexOf(lastChar) >= 0) {
|
||||
str = str.substring(0, str.length() - 1);
|
||||
entity.replace(new QueryToken.LiteralTerm(str, lt.displayStr()));
|
||||
}
|
||||
|
||||
// Remove term elements that aren't indexed by the search engine
|
||||
if (str.endsWith("'s"))
|
||||
entity.replace(new QueryToken.LiteralTerm(str.substring(0, str.length() - 2), lt.displayStr()));
|
||||
if (str.endsWith("()"))
|
||||
entity.replace(new QueryToken.LiteralTerm(str.substring(0, str.length() - 2), lt.displayStr()));
|
||||
if (str.startsWith("$"))
|
||||
entity.replace(new QueryToken.LiteralTerm(str.substring(1), lt.displayStr()));
|
||||
if (str.endsWith("'s")) {
|
||||
str = str.substring(0, str.length() - 2);
|
||||
entity.replace(new QueryToken.LiteralTerm(str, lt.displayStr()));
|
||||
}
|
||||
if (str.endsWith("()")) {
|
||||
str = str.substring(0, str.length() - 2);
|
||||
entity.replace(new QueryToken.LiteralTerm(str, lt.displayStr()));
|
||||
}
|
||||
|
||||
while (str.startsWith("$") || str.startsWith("_")) {
|
||||
str = str.substring(1);
|
||||
entity.replace(new QueryToken.LiteralTerm(str, lt.displayStr()));
|
||||
}
|
||||
|
||||
if (entity.isBlank()) {
|
||||
entity.remove();
|
||||
|
@@ -1,165 +0,0 @@
|
||||
package nu.marginalia.util.language;
|
||||
|
||||
import com.google.inject.Inject;
|
||||
import nu.marginalia.term_frequency_dict.TermFrequencyDict;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import java.io.BufferedReader;
|
||||
import java.io.InputStreamReader;
|
||||
import java.util.*;
|
||||
import java.util.regex.Pattern;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
public class EnglishDictionary {
|
||||
private final Set<String> englishWords = new HashSet<>();
|
||||
private final TermFrequencyDict tfDict;
|
||||
private final Logger logger = LoggerFactory.getLogger(getClass());
|
||||
|
||||
@Inject
|
||||
public EnglishDictionary(TermFrequencyDict tfDict) {
|
||||
this.tfDict = tfDict;
|
||||
try (var resource = Objects.requireNonNull(ClassLoader.getSystemResourceAsStream("dictionary/en-words"),
|
||||
"Could not load word frequency table");
|
||||
var br = new BufferedReader(new InputStreamReader(resource))
|
||||
) {
|
||||
for (;;) {
|
||||
String s = br.readLine();
|
||||
if (s == null) {
|
||||
break;
|
||||
}
|
||||
englishWords.add(s.toLowerCase());
|
||||
}
|
||||
}
|
||||
catch (Exception ex) {
|
||||
throw new RuntimeException(ex);
|
||||
}
|
||||
}
|
||||
|
||||
public boolean isWord(String word) {
|
||||
return englishWords.contains(word);
|
||||
}
|
||||
|
||||
private static final Pattern ingPattern = Pattern.compile(".*(\\w)\\1ing$");
|
||||
|
||||
public Collection<String> getWordVariants(String s) {
|
||||
var variants = findWordVariants(s);
|
||||
|
||||
var ret = variants.stream()
|
||||
.filter(var -> tfDict.getTermFreq(var) > 100)
|
||||
.collect(Collectors.toList());
|
||||
|
||||
if (s.equals("recipe") || s.equals("recipes")) {
|
||||
ret.add("category:food");
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
|
||||
public Collection<String> findWordVariants(String s) {
|
||||
int sl = s.length();
|
||||
|
||||
if (sl < 2) {
|
||||
return Collections.emptyList();
|
||||
}
|
||||
if (s.endsWith("s")) {
|
||||
String a = s.substring(0, sl-1);
|
||||
String b = s + "es";
|
||||
if (isWord(a) && isWord(b)) {
|
||||
return List.of(a, b);
|
||||
}
|
||||
else if (isWord(a)) {
|
||||
return List.of(a);
|
||||
}
|
||||
else if (isWord(b)) {
|
||||
return List.of(b);
|
||||
}
|
||||
}
|
||||
if (s.endsWith("sm")) {
|
||||
String a = s.substring(0, sl-1)+"t";
|
||||
String b = s.substring(0, sl-1)+"ts";
|
||||
if (isWord(a) && isWord(b)) {
|
||||
return List.of(a, b);
|
||||
}
|
||||
else if (isWord(a)) {
|
||||
return List.of(a);
|
||||
}
|
||||
else if (isWord(b)) {
|
||||
return List.of(b);
|
||||
}
|
||||
}
|
||||
if (s.endsWith("st")) {
|
||||
String a = s.substring(0, sl-1)+"m";
|
||||
String b = s + "s";
|
||||
if (isWord(a) && isWord(b)) {
|
||||
return List.of(a, b);
|
||||
}
|
||||
else if (isWord(a)) {
|
||||
return List.of(a);
|
||||
}
|
||||
else if (isWord(b)) {
|
||||
return List.of(b);
|
||||
}
|
||||
}
|
||||
else if (ingPattern.matcher(s).matches() && sl > 4) { // humming, clapping
|
||||
var a = s.substring(0, sl-4);
|
||||
var b = s.substring(0, sl-3) + "ed";
|
||||
|
||||
if (isWord(a) && isWord(b)) {
|
||||
return List.of(a, b);
|
||||
}
|
||||
else if (isWord(a)) {
|
||||
return List.of(a);
|
||||
}
|
||||
else if (isWord(b)) {
|
||||
return List.of(b);
|
||||
}
|
||||
}
|
||||
else {
|
||||
String a = s + "s";
|
||||
String b = ingForm(s);
|
||||
String c = s + "ed";
|
||||
|
||||
if (isWord(a) && isWord(b) && isWord(c)) {
|
||||
return List.of(a, b, c);
|
||||
}
|
||||
else if (isWord(a) && isWord(b)) {
|
||||
return List.of(a, b);
|
||||
}
|
||||
else if (isWord(b) && isWord(c)) {
|
||||
return List.of(b, c);
|
||||
}
|
||||
else if (isWord(a) && isWord(c)) {
|
||||
return List.of(a, c);
|
||||
}
|
||||
else if (isWord(a)) {
|
||||
return List.of(a);
|
||||
}
|
||||
else if (isWord(b)) {
|
||||
return List.of(b);
|
||||
}
|
||||
else if (isWord(c)) {
|
||||
return List.of(c);
|
||||
}
|
||||
}
|
||||
|
||||
return Collections.emptyList();
|
||||
}
|
||||
|
||||
public String ingForm(String s) {
|
||||
if (s.endsWith("t") && !s.endsWith("tt")) {
|
||||
return s + "ting";
|
||||
}
|
||||
if (s.endsWith("n") && !s.endsWith("nn")) {
|
||||
return s + "ning";
|
||||
}
|
||||
if (s.endsWith("m") && !s.endsWith("mm")) {
|
||||
return s + "ming";
|
||||
}
|
||||
if (s.endsWith("r") && !s.endsWith("rr")) {
|
||||
return s + "ring";
|
||||
}
|
||||
return s + "ing";
|
||||
}
|
||||
}
|
@@ -0,0 +1,32 @@
|
||||
package nu.marginalia.functions.searchquery.query_parser;
|
||||
|
||||
import nu.marginalia.functions.searchquery.query_parser.token.QueryToken;
|
||||
import org.junit.jupiter.api.Assertions;
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
import java.util.List;
|
||||
|
||||
class QueryParserTest {
|
||||
|
||||
@Test
|
||||
// https://github.com/MarginaliaSearch/MarginaliaSearch/issues/140
|
||||
void parse__builtin_ffs() {
|
||||
QueryParser parser = new QueryParser();
|
||||
var tokens = parser.parse("__builtin_ffs");
|
||||
Assertions.assertEquals(List.of(new QueryToken.LiteralTerm("builtin_ffs", "__builtin_ffs")), tokens);
|
||||
}
|
||||
|
||||
@Test
|
||||
void trailingParens() {
|
||||
QueryParser parser = new QueryParser();
|
||||
var tokens = parser.parse("strcpy()");
|
||||
Assertions.assertEquals(List.of(new QueryToken.LiteralTerm("strcpy", "strcpy()")), tokens);
|
||||
}
|
||||
|
||||
@Test
|
||||
void trailingQuote() {
|
||||
QueryParser parser = new QueryParser();
|
||||
var tokens = parser.parse("bob's");
|
||||
Assertions.assertEquals(List.of(new QueryToken.LiteralTerm("bob", "bob's")), tokens);
|
||||
}
|
||||
}
|
@@ -12,6 +12,7 @@ import nu.marginalia.index.query.limit.SpecificationLimit;
|
||||
import nu.marginalia.index.query.limit.SpecificationLimitType;
|
||||
import nu.marginalia.segmentation.NgramLexicon;
|
||||
import nu.marginalia.term_frequency_dict.TermFrequencyDict;
|
||||
import org.junit.jupiter.api.Assertions;
|
||||
import org.junit.jupiter.api.BeforeAll;
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
@@ -207,6 +208,28 @@ public class QueryFactoryTest {
|
||||
System.out.println(subquery);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testQuotedApostrophe() {
|
||||
var subquery = parseAndGetSpecs("\"bob's cars\"");
|
||||
|
||||
System.out.println(subquery);
|
||||
|
||||
Assertions.assertTrue(subquery.query.compiledQuery.contains(" bob "));
|
||||
Assertions.assertFalse(subquery.query.compiledQuery.contains(" bob's "));
|
||||
Assertions.assertEquals("\"bob's cars\"", subquery.humanQuery);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testExpansion9() {
|
||||
var subquery = parseAndGetSpecs("pie recipe");
|
||||
|
||||
Assertions.assertTrue(subquery.query.compiledQuery.contains(" category:food "));
|
||||
|
||||
subquery = parseAndGetSpecs("recipe pie");
|
||||
|
||||
Assertions.assertFalse(subquery.query.compiledQuery.contains(" category:food "));
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testParsing() {
|
||||
var subquery = parseAndGetSpecs("strlen()");
|
||||
|
@@ -85,7 +85,7 @@ class BTreeWriterTest {
|
||||
public void testWriteEntrySize2() throws IOException {
|
||||
BTreeContext ctx = new BTreeContext(4, 2, BTreeBlockSize.BS_64);
|
||||
|
||||
var tempFile = Files.createTempFile(Path.of("/tmp"), "tst", "dat");
|
||||
var tempFile = Files.createTempFile("tst", "dat");
|
||||
|
||||
int[] data = generateItems32(64);
|
||||
|
||||
|
@@ -27,7 +27,7 @@ public class SentenceSegmentSplitter {
|
||||
else {
|
||||
// If we flatten unicode, we do this...
|
||||
// FIXME: This can almost definitely be cleaned up and simplified.
|
||||
wordBreakPattern = Pattern.compile("([^/_#@.a-zA-Z'+\\-0-9\\u00C0-\\u00D6\\u00D8-\\u00f6\\u00f8-\\u00ff]+)|[|]|(\\.(\\s+|$))");
|
||||
wordBreakPattern = Pattern.compile("([^/<>$:_#@.a-zA-Z'+\\-0-9\\u00C0-\\u00D6\\u00D8-\\u00f6\\u00f8-\\u00ff]+)|[|]|(\\.(\\s+|$))");
|
||||
}
|
||||
}
|
||||
|
||||
@@ -90,12 +90,17 @@ public class SentenceSegmentSplitter {
|
||||
for (int i = 0; i < ret.size(); i++) {
|
||||
String part = ret.get(i);
|
||||
|
||||
if (part.startsWith("<") && part.endsWith(">") && part.length() > 2) {
|
||||
ret.set(i, part.substring(1, part.length() - 1));
|
||||
}
|
||||
|
||||
if (part.startsWith("'") && part.length() > 1) {
|
||||
ret.set(i, part.substring(1));
|
||||
}
|
||||
if (part.endsWith("'") && part.length() > 1) {
|
||||
ret.set(i, part.substring(0, part.length()-1));
|
||||
}
|
||||
|
||||
while (part.endsWith(".")) {
|
||||
part = part.substring(0, part.length()-1);
|
||||
ret.set(i, part);
|
||||
|
@@ -28,6 +28,20 @@ class SentenceExtractorTest {
|
||||
System.out.println(dld);
|
||||
}
|
||||
|
||||
@Test
|
||||
void testCplusplus() {
|
||||
var dld = sentenceExtractor.extractSentence("std::vector", EnumSet.noneOf(HtmlTag.class));
|
||||
assertEquals(1, dld.length());
|
||||
assertEquals("std::vector", dld.wordsLowerCase[0]);
|
||||
}
|
||||
|
||||
@Test
|
||||
void testPHP() {
|
||||
var dld = sentenceExtractor.extractSentence("$_GET", EnumSet.noneOf(HtmlTag.class));
|
||||
assertEquals(1, dld.length());
|
||||
assertEquals("$_get", dld.wordsLowerCase[0]);
|
||||
}
|
||||
|
||||
@Test
|
||||
void testPolishArtist() {
|
||||
var dld = sentenceExtractor.extractSentence("Uklański", EnumSet.noneOf(HtmlTag.class));
|
||||
|
@@ -25,12 +25,11 @@ public class ProcessedDocumentDetails {
|
||||
|
||||
public List<EdgeUrl> linksInternal;
|
||||
public List<EdgeUrl> linksExternal;
|
||||
public List<EdgeUrl> feedLinks;
|
||||
|
||||
public DocumentMetadata metadata;
|
||||
public GeneratorType generator;
|
||||
|
||||
public String toString() {
|
||||
return "ProcessedDocumentDetails(title=" + this.title + ", description=" + this.description + ", pubYear=" + this.pubYear + ", length=" + this.length + ", quality=" + this.quality + ", hashCode=" + this.hashCode + ", features=" + this.features + ", standard=" + this.standard + ", linksInternal=" + this.linksInternal + ", linksExternal=" + this.linksExternal + ", feedLinks=" + this.feedLinks + ", metadata=" + this.metadata + ", generator=" + this.generator + ")";
|
||||
return "ProcessedDocumentDetails(title=" + this.title + ", description=" + this.description + ", pubYear=" + this.pubYear + ", length=" + this.length + ", quality=" + this.quality + ", hashCode=" + this.hashCode + ", features=" + this.features + ", standard=" + this.standard + ", linksInternal=" + this.linksInternal + ", linksExternal=" + this.linksExternal + ", metadata=" + this.metadata + ", generator=" + this.generator + ")";
|
||||
}
|
||||
}
|
||||
|
@@ -34,7 +34,6 @@ public class LinkProcessor {
|
||||
|
||||
ret.linksExternal = new ArrayList<>();
|
||||
ret.linksInternal = new ArrayList<>();
|
||||
ret.feedLinks = new ArrayList<>();
|
||||
}
|
||||
|
||||
public Set<EdgeUrl> getSeenUrls() {
|
||||
@@ -72,19 +71,6 @@ public class LinkProcessor {
|
||||
}
|
||||
}
|
||||
|
||||
/** Accepts a link as a feed link */
|
||||
public void acceptFeed(EdgeUrl link) {
|
||||
if (!isLinkPermitted(link)) {
|
||||
return;
|
||||
}
|
||||
|
||||
if (!seenUrls.add(link)) {
|
||||
return;
|
||||
}
|
||||
|
||||
ret.feedLinks.add(link);
|
||||
}
|
||||
|
||||
private boolean isLinkPermitted(EdgeUrl link) {
|
||||
if (!permittedSchemas.contains(link.proto.toLowerCase())) {
|
||||
return false;
|
||||
|
@@ -294,11 +294,6 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
|
||||
for (var meta : doc.select("meta[http-equiv=refresh]")) {
|
||||
linkParser.parseMetaRedirect(baseUrl, meta).ifPresent(lp::accept);
|
||||
}
|
||||
for (var link : doc.select("link[rel=alternate]")) {
|
||||
feedExtractor
|
||||
.getFeedFromAlternateTag(baseUrl, link)
|
||||
.ifPresent(lp::acceptFeed);
|
||||
}
|
||||
|
||||
words.addAllSyntheticTerms(FileLinks.createFileLinkKeywords(lp, domain));
|
||||
words.addAllSyntheticTerms(FileLinks.createFileEndingKeywords(doc));
|
||||
|
@@ -125,7 +125,6 @@ public class PlainTextDocumentProcessorPlugin extends AbstractDocumentProcessorP
|
||||
/* These are assumed to be populated */
|
||||
ret.linksInternal = new ArrayList<>();
|
||||
ret.linksExternal = new ArrayList<>();
|
||||
ret.feedLinks = new ArrayList<>();
|
||||
|
||||
return new DetailsWithWords(ret, words);
|
||||
}
|
||||
|
@@ -166,7 +166,6 @@ public class StackexchangeSideloader implements SideloadSource {
|
||||
ret.details.length = 128;
|
||||
|
||||
ret.details.standard = HtmlStandard.HTML5;
|
||||
ret.details.feedLinks = List.of();
|
||||
ret.details.linksExternal = List.of();
|
||||
ret.details.linksInternal = List.of();
|
||||
ret.state = UrlIndexingState.OK;
|
||||
|
@@ -178,7 +178,6 @@ public class ConverterBatchWriter implements AutoCloseable, ConverterBatchWriter
|
||||
public void writeDomainData(ProcessedDomain domain) throws IOException {
|
||||
DomainMetadata metadata = DomainMetadata.from(domain);
|
||||
|
||||
List<String> feeds = getFeedUrls(domain);
|
||||
|
||||
domainWriter.write(
|
||||
new SlopDomainRecord(
|
||||
@@ -188,25 +187,11 @@ public class ConverterBatchWriter implements AutoCloseable, ConverterBatchWriter
|
||||
metadata.visited(),
|
||||
Optional.ofNullable(domain.state).map(DomainIndexingState::toString).orElse(""),
|
||||
Optional.ofNullable(domain.redirect).map(EdgeDomain::toString).orElse(""),
|
||||
domain.ip,
|
||||
feeds
|
||||
domain.ip
|
||||
)
|
||||
);
|
||||
}
|
||||
|
||||
private List<String> getFeedUrls(ProcessedDomain domain) {
|
||||
var documents = domain.documents;
|
||||
if (documents == null)
|
||||
return List.of();
|
||||
|
||||
return documents.stream().map(doc -> doc.details)
|
||||
.filter(Objects::nonNull)
|
||||
.flatMap(dets -> dets.feedLinks.stream())
|
||||
.distinct()
|
||||
.map(EdgeUrl::toString)
|
||||
.toList();
|
||||
}
|
||||
|
||||
public void close() throws IOException {
|
||||
domainWriter.close();
|
||||
documentWriter.close();
|
||||
|
@@ -1,7 +1,6 @@
|
||||
package nu.marginalia.model.processed;
|
||||
|
||||
import nu.marginalia.slop.SlopTable;
|
||||
import nu.marginalia.slop.column.array.ObjectArrayColumn;
|
||||
import nu.marginalia.slop.column.primitive.IntColumn;
|
||||
import nu.marginalia.slop.column.string.EnumColumn;
|
||||
import nu.marginalia.slop.column.string.TxtStringColumn;
|
||||
@@ -10,7 +9,6 @@ import nu.marginalia.slop.desc.StorageType;
|
||||
import java.io.IOException;
|
||||
import java.nio.charset.StandardCharsets;
|
||||
import java.nio.file.Path;
|
||||
import java.util.List;
|
||||
import java.util.function.Consumer;
|
||||
|
||||
public record SlopDomainRecord(
|
||||
@@ -20,8 +18,7 @@ public record SlopDomainRecord(
|
||||
int visitedUrls,
|
||||
String state,
|
||||
String redirectDomain,
|
||||
String ip,
|
||||
List<String> rssFeeds)
|
||||
String ip)
|
||||
{
|
||||
|
||||
public record DomainWithIpProjection(
|
||||
@@ -38,9 +35,6 @@ public record SlopDomainRecord(
|
||||
private static final IntColumn goodUrlsColumn = new IntColumn("goodUrls", StorageType.PLAIN);
|
||||
private static final IntColumn visitedUrlsColumn = new IntColumn("visitedUrls", StorageType.PLAIN);
|
||||
|
||||
private static final ObjectArrayColumn<String> rssFeedsColumn = new TxtStringColumn("rssFeeds", StandardCharsets.UTF_8, StorageType.GZIP).asArray();
|
||||
|
||||
|
||||
public static class DomainNameReader extends SlopTable {
|
||||
private final TxtStringColumn.Reader domainsReader;
|
||||
|
||||
@@ -101,8 +95,6 @@ public record SlopDomainRecord(
|
||||
private final IntColumn.Reader goodUrlsReader;
|
||||
private final IntColumn.Reader visitedUrlsReader;
|
||||
|
||||
private final ObjectArrayColumn<String>.Reader rssFeedsReader;
|
||||
|
||||
public Reader(SlopTable.Ref<SlopDomainRecord> ref) throws IOException {
|
||||
super(ref);
|
||||
|
||||
@@ -114,8 +106,6 @@ public record SlopDomainRecord(
|
||||
knownUrlsReader = knownUrlsColumn.open(this);
|
||||
goodUrlsReader = goodUrlsColumn.open(this);
|
||||
visitedUrlsReader = visitedUrlsColumn.open(this);
|
||||
|
||||
rssFeedsReader = rssFeedsColumn.open(this);
|
||||
}
|
||||
|
||||
public Reader(Path baseDir, int page) throws IOException {
|
||||
@@ -140,8 +130,7 @@ public record SlopDomainRecord(
|
||||
visitedUrlsReader.get(),
|
||||
statesReader.get(),
|
||||
redirectReader.get(),
|
||||
ipReader.get(),
|
||||
rssFeedsReader.get()
|
||||
ipReader.get()
|
||||
);
|
||||
}
|
||||
}
|
||||
@@ -156,8 +145,6 @@ public record SlopDomainRecord(
|
||||
private final IntColumn.Writer goodUrlsWriter;
|
||||
private final IntColumn.Writer visitedUrlsWriter;
|
||||
|
||||
private final ObjectArrayColumn<String>.Writer rssFeedsWriter;
|
||||
|
||||
public Writer(Path baseDir, int page) throws IOException {
|
||||
super(baseDir, page);
|
||||
|
||||
@@ -169,8 +156,6 @@ public record SlopDomainRecord(
|
||||
knownUrlsWriter = knownUrlsColumn.create(this);
|
||||
goodUrlsWriter = goodUrlsColumn.create(this);
|
||||
visitedUrlsWriter = visitedUrlsColumn.create(this);
|
||||
|
||||
rssFeedsWriter = rssFeedsColumn.create(this);
|
||||
}
|
||||
|
||||
public void write(SlopDomainRecord record) throws IOException {
|
||||
@@ -182,8 +167,6 @@ public record SlopDomainRecord(
|
||||
knownUrlsWriter.put(record.knownUrls());
|
||||
goodUrlsWriter.put(record.goodUrls());
|
||||
visitedUrlsWriter.put(record.visitedUrls());
|
||||
|
||||
rssFeedsWriter.put(record.rssFeeds());
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@@ -9,7 +9,6 @@ import org.junit.jupiter.api.Test;
|
||||
import java.io.IOException;
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Path;
|
||||
import java.util.List;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertFalse;
|
||||
import static org.junit.jupiter.api.Assertions.assertTrue;
|
||||
@@ -35,8 +34,7 @@ public class SlopDomainRecordTest {
|
||||
1, 2, 3,
|
||||
"state",
|
||||
"redirectDomain",
|
||||
"192.168.0.1",
|
||||
List.of("rss1", "rss2")
|
||||
"192.168.0.1"
|
||||
);
|
||||
|
||||
try (var writer = new SlopDomainRecord.Writer(testDir, 0)) {
|
||||
|
@@ -7,6 +7,7 @@ import nu.marginalia.WmsaHome;
|
||||
import nu.marginalia.converting.model.ProcessedDomain;
|
||||
import nu.marginalia.converting.processor.DomainProcessor;
|
||||
import nu.marginalia.crawl.CrawlerMain;
|
||||
import nu.marginalia.crawl.DomainStateDb;
|
||||
import nu.marginalia.crawl.fetcher.HttpFetcher;
|
||||
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
||||
import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
|
||||
@@ -46,6 +47,7 @@ public class CrawlingThenConvertingIntegrationTest {
|
||||
|
||||
private Path fileName;
|
||||
private Path fileName2;
|
||||
private Path dbTempFile;
|
||||
|
||||
@BeforeAll
|
||||
public static void setUpAll() {
|
||||
@@ -63,16 +65,18 @@ public class CrawlingThenConvertingIntegrationTest {
|
||||
httpFetcher = new HttpFetcherImpl(WmsaHome.getUserAgent().uaString());
|
||||
this.fileName = Files.createTempFile("crawling-then-converting", ".warc.gz");
|
||||
this.fileName2 = Files.createTempFile("crawling-then-converting", ".warc.gz");
|
||||
this.dbTempFile = Files.createTempFile("domains", "db");
|
||||
}
|
||||
|
||||
@AfterEach
|
||||
public void tearDown() throws IOException {
|
||||
Files.deleteIfExists(fileName);
|
||||
Files.deleteIfExists(fileName2);
|
||||
Files.deleteIfExists(dbTempFile);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testInvalidDomain() throws IOException {
|
||||
public void testInvalidDomain() throws Exception {
|
||||
// Attempt to fetch an invalid domain
|
||||
var specs = new CrawlerMain.CrawlSpecRecord("invalid.invalid.invalid", 10);
|
||||
|
||||
@@ -88,7 +92,7 @@ public class CrawlingThenConvertingIntegrationTest {
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testRedirectingDomain() throws IOException {
|
||||
public void testRedirectingDomain() throws Exception {
|
||||
// Attempt to fetch an invalid domain
|
||||
var specs = new CrawlerMain.CrawlSpecRecord("memex.marginalia.nu", 10);
|
||||
|
||||
@@ -107,7 +111,7 @@ public class CrawlingThenConvertingIntegrationTest {
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testBlockedDomain() throws IOException {
|
||||
public void testBlockedDomain() throws Exception {
|
||||
// Attempt to fetch an invalid domain
|
||||
var specs = new CrawlerMain.CrawlSpecRecord("search.marginalia.nu", 10);
|
||||
|
||||
@@ -124,7 +128,7 @@ public class CrawlingThenConvertingIntegrationTest {
|
||||
}
|
||||
|
||||
@Test
|
||||
public void crawlSunnyDay() throws IOException {
|
||||
public void crawlSunnyDay() throws Exception {
|
||||
var specs = new CrawlerMain.CrawlSpecRecord("www.marginalia.nu", 10);
|
||||
|
||||
CrawledDomain domain = crawl(specs);
|
||||
@@ -157,7 +161,7 @@ public class CrawlingThenConvertingIntegrationTest {
|
||||
|
||||
|
||||
@Test
|
||||
public void crawlContentTypes() throws IOException {
|
||||
public void crawlContentTypes() throws Exception {
|
||||
var specs = new CrawlerMain.CrawlSpecRecord("www.marginalia.nu", 10,
|
||||
List.of(
|
||||
"https://www.marginalia.nu/sanic.png",
|
||||
@@ -195,7 +199,7 @@ public class CrawlingThenConvertingIntegrationTest {
|
||||
|
||||
|
||||
@Test
|
||||
public void crawlRobotsTxt() throws IOException {
|
||||
public void crawlRobotsTxt() throws Exception {
|
||||
var specs = new CrawlerMain.CrawlSpecRecord("search.marginalia.nu", 5,
|
||||
List.of("https://search.marginalia.nu/search?q=hello+world")
|
||||
);
|
||||
@@ -235,15 +239,17 @@ public class CrawlingThenConvertingIntegrationTest {
|
||||
return null; // unreachable
|
||||
}
|
||||
}
|
||||
private CrawledDomain crawl(CrawlerMain.CrawlSpecRecord specs) throws IOException {
|
||||
private CrawledDomain crawl(CrawlerMain.CrawlSpecRecord specs) throws Exception {
|
||||
return crawl(specs, domain -> true);
|
||||
}
|
||||
|
||||
private CrawledDomain crawl(CrawlerMain.CrawlSpecRecord specs, Predicate<EdgeDomain> domainBlacklist) throws IOException {
|
||||
private CrawledDomain crawl(CrawlerMain.CrawlSpecRecord specs, Predicate<EdgeDomain> domainBlacklist) throws Exception {
|
||||
List<SerializableCrawlData> data = new ArrayList<>();
|
||||
|
||||
try (var recorder = new WarcRecorder(fileName)) {
|
||||
new CrawlerRetreiver(httpFetcher, new DomainProber(domainBlacklist), specs, recorder).crawlDomain();
|
||||
try (var recorder = new WarcRecorder(fileName);
|
||||
var db = new DomainStateDb(dbTempFile))
|
||||
{
|
||||
new CrawlerRetreiver(httpFetcher, new DomainProber(domainBlacklist), specs, db, recorder).crawlDomain();
|
||||
}
|
||||
|
||||
CrawledDocumentParquetRecordFileWriter.convertWarc(specs.domain(),
|
||||
|
@@ -46,6 +46,8 @@ dependencies {
|
||||
|
||||
implementation libs.notnull
|
||||
implementation libs.guava
|
||||
implementation libs.sqlite
|
||||
|
||||
implementation dependencies.create(libs.guice.get()) {
|
||||
exclude group: 'com.google.guava'
|
||||
}
|
||||
|
@@ -241,6 +241,7 @@ public class CrawlerMain extends ProcessMainClass {
|
||||
|
||||
// Set up the work log and the warc archiver so we can keep track of what we've done
|
||||
try (WorkLog workLog = new WorkLog(outputDir.resolve("crawler.log"));
|
||||
DomainStateDb domainStateDb = new DomainStateDb(outputDir.resolve("domainstate.db"));
|
||||
WarcArchiverIf warcArchiver = warcArchiverFactory.get(outputDir);
|
||||
AnchorTagsSource anchorTagsSource = anchorTagsSourceFactory.create(domainsToCrawl)
|
||||
) {
|
||||
@@ -258,6 +259,7 @@ public class CrawlerMain extends ProcessMainClass {
|
||||
anchorTagsSource,
|
||||
outputDir,
|
||||
warcArchiver,
|
||||
domainStateDb,
|
||||
workLog);
|
||||
|
||||
if (pendingCrawlTasks.putIfAbsent(crawlSpec.domain(), task) == null) {
|
||||
@@ -299,11 +301,12 @@ public class CrawlerMain extends ProcessMainClass {
|
||||
heartbeat.start();
|
||||
|
||||
try (WorkLog workLog = new WorkLog(outputDir.resolve("crawler-" + targetDomainName.replace('/', '-') + ".log"));
|
||||
DomainStateDb domainStateDb = new DomainStateDb(outputDir.resolve("domainstate.db"));
|
||||
WarcArchiverIf warcArchiver = warcArchiverFactory.get(outputDir);
|
||||
AnchorTagsSource anchorTagsSource = anchorTagsSourceFactory.create(List.of(new EdgeDomain(targetDomainName)))
|
||||
) {
|
||||
var spec = new CrawlSpecRecord(targetDomainName, 1000, List.of());
|
||||
var task = new CrawlTask(spec, anchorTagsSource, outputDir, warcArchiver, workLog);
|
||||
var task = new CrawlTask(spec, anchorTagsSource, outputDir, warcArchiver, domainStateDb, workLog);
|
||||
task.run();
|
||||
}
|
||||
catch (Exception ex) {
|
||||
@@ -324,18 +327,21 @@ public class CrawlerMain extends ProcessMainClass {
|
||||
private final AnchorTagsSource anchorTagsSource;
|
||||
private final Path outputDir;
|
||||
private final WarcArchiverIf warcArchiver;
|
||||
private final DomainStateDb domainStateDb;
|
||||
private final WorkLog workLog;
|
||||
|
||||
CrawlTask(CrawlSpecRecord specification,
|
||||
AnchorTagsSource anchorTagsSource,
|
||||
Path outputDir,
|
||||
WarcArchiverIf warcArchiver,
|
||||
DomainStateDb domainStateDb,
|
||||
WorkLog workLog)
|
||||
{
|
||||
this.specification = specification;
|
||||
this.anchorTagsSource = anchorTagsSource;
|
||||
this.outputDir = outputDir;
|
||||
this.warcArchiver = warcArchiver;
|
||||
this.domainStateDb = domainStateDb;
|
||||
this.workLog = workLog;
|
||||
|
||||
this.domain = specification.domain();
|
||||
@@ -359,7 +365,7 @@ public class CrawlerMain extends ProcessMainClass {
|
||||
}
|
||||
|
||||
try (var warcRecorder = new WarcRecorder(newWarcFile); // write to a temp file for now
|
||||
var retriever = new CrawlerRetreiver(fetcher, domainProber, specification, warcRecorder);
|
||||
var retriever = new CrawlerRetreiver(fetcher, domainProber, specification, domainStateDb, warcRecorder);
|
||||
CrawlDataReference reference = getReference();
|
||||
)
|
||||
{
|
||||
|
@@ -0,0 +1,127 @@
|
||||
package nu.marginalia.crawl;
|
||||
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import javax.annotation.Nullable;
|
||||
import java.nio.file.Path;
|
||||
import java.sql.Connection;
|
||||
import java.sql.DriverManager;
|
||||
import java.sql.SQLException;
|
||||
import java.time.Instant;
|
||||
import java.util.Optional;
|
||||
|
||||
/** Supplemental sqlite database for storing the summary of a crawl.
|
||||
* One database exists per crawl data set.
|
||||
* */
|
||||
public class DomainStateDb implements AutoCloseable {
|
||||
|
||||
private static final Logger logger = LoggerFactory.getLogger(DomainStateDb.class);
|
||||
|
||||
private final Connection connection;
|
||||
|
||||
public record SummaryRecord(
|
||||
String domainName,
|
||||
Instant lastUpdated,
|
||||
String state,
|
||||
@Nullable String stateDesc,
|
||||
@Nullable String feedUrl
|
||||
)
|
||||
{
|
||||
public static SummaryRecord forSuccess(String domainName) {
|
||||
return new SummaryRecord(domainName, Instant.now(), "OK", null, null);
|
||||
}
|
||||
|
||||
public static SummaryRecord forSuccess(String domainName, String feedUrl) {
|
||||
return new SummaryRecord(domainName, Instant.now(), "OK", null, feedUrl);
|
||||
}
|
||||
|
||||
public static SummaryRecord forError(String domainName, String state, String stateDesc) {
|
||||
return new SummaryRecord(domainName, Instant.now(), state, stateDesc, null);
|
||||
}
|
||||
|
||||
public boolean equals(Object other) {
|
||||
if (other == this) {
|
||||
return true;
|
||||
}
|
||||
if (!(other instanceof SummaryRecord(String name, Instant updated, String state1, String desc, String url))) {
|
||||
return false;
|
||||
}
|
||||
return domainName.equals(name) &&
|
||||
lastUpdated.toEpochMilli() == updated.toEpochMilli() &&
|
||||
state.equals(state1) &&
|
||||
(stateDesc == null ? desc == null : stateDesc.equals(desc)) &&
|
||||
(feedUrl == null ? url == null : feedUrl.equals(url));
|
||||
}
|
||||
|
||||
public int hashCode() {
|
||||
return domainName.hashCode() + Long.hashCode(lastUpdated.toEpochMilli());
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
public DomainStateDb(Path filename) throws SQLException {
|
||||
String sqliteDbString = "jdbc:sqlite:" + filename.toString();
|
||||
connection = DriverManager.getConnection(sqliteDbString);
|
||||
|
||||
try (var stmt = connection.createStatement()) {
|
||||
stmt.executeUpdate("""
|
||||
CREATE TABLE IF NOT EXISTS summary (
|
||||
domain TEXT PRIMARY KEY,
|
||||
lastUpdatedEpochMs LONG NOT NULL,
|
||||
state TEXT NOT NULL,
|
||||
stateDesc TEXT,
|
||||
feedUrl TEXT
|
||||
)
|
||||
""");
|
||||
|
||||
stmt.execute("PRAGMA journal_mode=WAL");
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public void close() throws SQLException {
|
||||
connection.close();
|
||||
}
|
||||
|
||||
|
||||
public void save(SummaryRecord record) {
|
||||
try (var stmt = connection.prepareStatement("""
|
||||
INSERT OR REPLACE INTO summary (domain, lastUpdatedEpochMs, state, stateDesc, feedUrl)
|
||||
VALUES (?, ?, ?, ?, ?)
|
||||
""")) {
|
||||
stmt.setString(1, record.domainName());
|
||||
stmt.setLong(2, record.lastUpdated().toEpochMilli());
|
||||
stmt.setString(3, record.state());
|
||||
stmt.setString(4, record.stateDesc());
|
||||
stmt.setString(5, record.feedUrl());
|
||||
stmt.executeUpdate();
|
||||
} catch (SQLException e) {
|
||||
logger.error("Failed to insert summary record", e);
|
||||
}
|
||||
}
|
||||
|
||||
public Optional<SummaryRecord> get(String domainName) {
|
||||
try (var stmt = connection.prepareStatement("""
|
||||
SELECT domain, lastUpdatedEpochMs, state, stateDesc, feedUrl
|
||||
FROM summary
|
||||
WHERE domain = ?
|
||||
""")) {
|
||||
stmt.setString(1, domainName);
|
||||
var rs = stmt.executeQuery();
|
||||
if (rs.next()) {
|
||||
return Optional.of(new SummaryRecord(
|
||||
rs.getString("domain"),
|
||||
Instant.ofEpochMilli(rs.getLong("lastUpdatedEpochMs")),
|
||||
rs.getString("state"),
|
||||
rs.getString("stateDesc"),
|
||||
rs.getString("feedUrl")
|
||||
));
|
||||
}
|
||||
} catch (SQLException e) {
|
||||
logger.error("Failed to get summary record", e);
|
||||
}
|
||||
|
||||
return Optional.empty();
|
||||
}
|
||||
}
|
@@ -20,34 +20,11 @@ public record ContentTags(String etag, String lastMod) {
|
||||
public void paint(Request.Builder getBuilder) {
|
||||
|
||||
if (etag != null) {
|
||||
getBuilder.addHeader("If-None-Match", ifNoneMatch());
|
||||
getBuilder.addHeader("If-None-Match", etag);
|
||||
}
|
||||
|
||||
if (lastMod != null) {
|
||||
getBuilder.addHeader("If-Modified-Since", ifModifiedSince());
|
||||
getBuilder.addHeader("If-Modified-Since", lastMod);
|
||||
}
|
||||
}
|
||||
|
||||
private String ifNoneMatch() {
|
||||
// Remove the W/ prefix if it exists
|
||||
|
||||
//'W/' (case-sensitive) indicates that a weak validator is used. Weak etags are
|
||||
// easy to generate, but are far less useful for comparisons. Strong validators
|
||||
// are ideal for comparisons but can be very difficult to generate efficiently.
|
||||
// Weak ETag values of two representations of the same resources might be semantically
|
||||
// equivalent, but not byte-for-byte identical. This means weak etags prevent caching
|
||||
// when byte range requests are used, but strong etags mean range requests can
|
||||
// still be cached.
|
||||
// - https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag
|
||||
|
||||
if (null != etag && etag.startsWith("W/")) {
|
||||
return etag.substring(2);
|
||||
} else {
|
||||
return etag;
|
||||
}
|
||||
}
|
||||
|
||||
private String ifModifiedSince() {
|
||||
return lastMod;
|
||||
}
|
||||
}
|
||||
|
@@ -34,8 +34,9 @@ import java.util.*;
|
||||
public class WarcRecorder implements AutoCloseable {
|
||||
/** Maximum time we'll wait on a single request */
|
||||
static final int MAX_TIME = 30_000;
|
||||
/** Maximum (decompressed) size we'll fetch */
|
||||
static final int MAX_SIZE = 1024 * 1024 * 10;
|
||||
|
||||
/** Maximum (decompressed) size we'll save */
|
||||
static final int MAX_SIZE = Integer.getInteger("crawler.maxFetchSize", 10 * 1024 * 1024);
|
||||
|
||||
private final WarcWriter writer;
|
||||
private final Path warcFile;
|
||||
|
@@ -4,6 +4,7 @@ import crawlercommons.robots.SimpleRobotRules;
|
||||
import nu.marginalia.atags.model.DomainLinks;
|
||||
import nu.marginalia.contenttype.ContentType;
|
||||
import nu.marginalia.crawl.CrawlerMain;
|
||||
import nu.marginalia.crawl.DomainStateDb;
|
||||
import nu.marginalia.crawl.fetcher.ContentTags;
|
||||
import nu.marginalia.crawl.fetcher.HttpFetcher;
|
||||
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
||||
@@ -16,7 +17,9 @@ import nu.marginalia.ip_blocklist.UrlBlocklist;
|
||||
import nu.marginalia.link_parser.LinkParser;
|
||||
import nu.marginalia.model.EdgeDomain;
|
||||
import nu.marginalia.model.EdgeUrl;
|
||||
import nu.marginalia.model.body.DocumentBodyExtractor;
|
||||
import nu.marginalia.model.body.HttpFetchResult;
|
||||
import nu.marginalia.model.crawldata.CrawlerDomainStatus;
|
||||
import org.jsoup.Jsoup;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
@@ -46,6 +49,7 @@ public class CrawlerRetreiver implements AutoCloseable {
|
||||
|
||||
private final DomainProber domainProber;
|
||||
private final DomainCrawlFrontier crawlFrontier;
|
||||
private final DomainStateDb domainStateDb;
|
||||
private final WarcRecorder warcRecorder;
|
||||
private final CrawlerRevisitor crawlerRevisitor;
|
||||
|
||||
@@ -55,8 +59,10 @@ public class CrawlerRetreiver implements AutoCloseable {
|
||||
public CrawlerRetreiver(HttpFetcher fetcher,
|
||||
DomainProber domainProber,
|
||||
CrawlerMain.CrawlSpecRecord specs,
|
||||
DomainStateDb domainStateDb,
|
||||
WarcRecorder warcRecorder)
|
||||
{
|
||||
this.domainStateDb = domainStateDb;
|
||||
this.warcRecorder = warcRecorder;
|
||||
this.fetcher = fetcher;
|
||||
this.domainProber = domainProber;
|
||||
@@ -90,8 +96,21 @@ public class CrawlerRetreiver implements AutoCloseable {
|
||||
try {
|
||||
// Do an initial domain probe to determine the root URL
|
||||
EdgeUrl rootUrl;
|
||||
if (probeRootUrl() instanceof HttpFetcher.DomainProbeResult.Ok ok) rootUrl = ok.probedUrl();
|
||||
else return 1;
|
||||
|
||||
var probeResult = probeRootUrl();
|
||||
switch (probeResult) {
|
||||
case HttpFetcher.DomainProbeResult.Ok(EdgeUrl probedUrl) -> {
|
||||
rootUrl = probedUrl; // Good track
|
||||
}
|
||||
case HttpFetcher.DomainProbeResult.Redirect(EdgeDomain domain1) -> {
|
||||
domainStateDb.save(DomainStateDb.SummaryRecord.forError(domain, "Redirect", domain1.toString()));
|
||||
return 1;
|
||||
}
|
||||
case HttpFetcher.DomainProbeResult.Error(CrawlerDomainStatus status, String desc) -> {
|
||||
domainStateDb.save(DomainStateDb.SummaryRecord.forError(domain, status.toString(), desc));
|
||||
return 1;
|
||||
}
|
||||
}
|
||||
|
||||
// Sleep after the initial probe, we don't have access to the robots.txt yet
|
||||
// so we don't know the crawl delay
|
||||
@@ -114,7 +133,8 @@ public class CrawlerRetreiver implements AutoCloseable {
|
||||
|
||||
delayTimer.waitFetchDelay(0); // initial delay after robots.txt
|
||||
|
||||
sniffRootDocument(rootUrl, delayTimer);
|
||||
DomainStateDb.SummaryRecord summaryRecord = sniffRootDocument(rootUrl, delayTimer);
|
||||
domainStateDb.save(summaryRecord);
|
||||
|
||||
// Play back the old crawl data (if present) and fetch the documents comparing etags and last-modified
|
||||
if (crawlerRevisitor.recrawl(oldCrawlData, robotsRules, delayTimer) > 0) {
|
||||
@@ -196,7 +216,9 @@ public class CrawlerRetreiver implements AutoCloseable {
|
||||
return domainProbeResult;
|
||||
}
|
||||
|
||||
private void sniffRootDocument(EdgeUrl rootUrl, CrawlDelayTimer timer) {
|
||||
private DomainStateDb.SummaryRecord sniffRootDocument(EdgeUrl rootUrl, CrawlDelayTimer timer) {
|
||||
Optional<String> feedLink = Optional.empty();
|
||||
|
||||
try {
|
||||
var url = rootUrl.withPathAndParam("/", null);
|
||||
|
||||
@@ -204,11 +226,11 @@ public class CrawlerRetreiver implements AutoCloseable {
|
||||
timer.waitFetchDelay(0);
|
||||
|
||||
if (!(result instanceof HttpFetchResult.ResultOk ok))
|
||||
return;
|
||||
return DomainStateDb.SummaryRecord.forSuccess(domain);
|
||||
|
||||
var optDoc = ok.parseDocument();
|
||||
if (optDoc.isEmpty())
|
||||
return;
|
||||
return DomainStateDb.SummaryRecord.forSuccess(domain);
|
||||
|
||||
// Sniff the software based on the sample document
|
||||
var doc = optDoc.get();
|
||||
@@ -216,7 +238,6 @@ public class CrawlerRetreiver implements AutoCloseable {
|
||||
crawlFrontier.enqueueLinksFromDocument(url, doc);
|
||||
|
||||
EdgeUrl faviconUrl = url.withPathAndParam("/favicon.ico", null);
|
||||
Optional<EdgeUrl> sitemapUrl = Optional.empty();
|
||||
|
||||
for (var link : doc.getElementsByTag("link")) {
|
||||
String rel = link.attr("rel");
|
||||
@@ -232,23 +253,33 @@ public class CrawlerRetreiver implements AutoCloseable {
|
||||
|
||||
// Grab the RSS/Atom as a sitemap if it exists
|
||||
if (rel.equalsIgnoreCase("alternate")
|
||||
&& (type.equalsIgnoreCase("application/atom+xml") || type.equalsIgnoreCase("application/atomsvc+xml"))) {
|
||||
&& (type.equalsIgnoreCase("application/atom+xml")
|
||||
|| type.equalsIgnoreCase("application/atomsvc+xml")
|
||||
|| type.equalsIgnoreCase("application/rss+xml")
|
||||
)) {
|
||||
String href = link.attr("href");
|
||||
|
||||
sitemapUrl = linkParser.parseLink(url, href)
|
||||
.filter(crawlFrontier::isSameDomain);
|
||||
feedLink = linkParser.parseLink(url, href)
|
||||
.filter(crawlFrontier::isSameDomain)
|
||||
.map(EdgeUrl::toString);
|
||||
}
|
||||
}
|
||||
|
||||
// Download the sitemap if available exists
|
||||
if (sitemapUrl.isPresent()) {
|
||||
sitemapFetcher.downloadSitemaps(List.of(sitemapUrl.get()));
|
||||
|
||||
if (feedLink.isEmpty()) {
|
||||
feedLink = guessFeedUrl(timer);
|
||||
}
|
||||
|
||||
// Download the sitemap if available
|
||||
if (feedLink.isPresent()) {
|
||||
sitemapFetcher.downloadSitemaps(List.of(feedLink.get()));
|
||||
timer.waitFetchDelay(0);
|
||||
}
|
||||
|
||||
// Grab the favicon if it exists
|
||||
fetchWithRetry(faviconUrl, timer, HttpFetcher.ProbeType.DISABLED, ContentTags.empty());
|
||||
timer.waitFetchDelay(0);
|
||||
|
||||
}
|
||||
catch (Exception ex) {
|
||||
logger.error("Error configuring link filter", ex);
|
||||
@@ -256,6 +287,74 @@ public class CrawlerRetreiver implements AutoCloseable {
|
||||
finally {
|
||||
crawlFrontier.addVisited(rootUrl);
|
||||
}
|
||||
|
||||
if (feedLink.isPresent()) {
|
||||
return DomainStateDb.SummaryRecord.forSuccess(domain, feedLink.get());
|
||||
}
|
||||
else {
|
||||
return DomainStateDb.SummaryRecord.forSuccess(domain);
|
||||
}
|
||||
}
|
||||
|
||||
private final List<String> likelyFeedEndpoints = List.of(
|
||||
"rss.xml",
|
||||
"atom.xml",
|
||||
"feed.xml",
|
||||
"index.xml",
|
||||
"feed",
|
||||
"rss",
|
||||
"atom",
|
||||
"feeds",
|
||||
"blog/feed",
|
||||
"blog/rss"
|
||||
);
|
||||
|
||||
private Optional<String> guessFeedUrl(CrawlDelayTimer timer) throws InterruptedException {
|
||||
var oldDomainStateRecord = domainStateDb.get(domain);
|
||||
|
||||
// If we are already aware of an old feed URL, then we can just revalidate it
|
||||
if (oldDomainStateRecord.isPresent()) {
|
||||
var oldRecord = oldDomainStateRecord.get();
|
||||
if (oldRecord.feedUrl() != null && validateFeedUrl(oldRecord.feedUrl(), timer)) {
|
||||
return Optional.of(oldRecord.feedUrl());
|
||||
}
|
||||
}
|
||||
|
||||
for (String endpoint : likelyFeedEndpoints) {
|
||||
String url = "https://" + domain + "/" + endpoint;
|
||||
if (validateFeedUrl(url, timer)) {
|
||||
return Optional.of(url);
|
||||
}
|
||||
}
|
||||
|
||||
return Optional.empty();
|
||||
}
|
||||
|
||||
private boolean validateFeedUrl(String url, CrawlDelayTimer timer) throws InterruptedException {
|
||||
var parsedOpt = EdgeUrl.parse(url);
|
||||
if (parsedOpt.isEmpty())
|
||||
return false;
|
||||
|
||||
HttpFetchResult result = fetchWithRetry(parsedOpt.get(), timer, HttpFetcher.ProbeType.DISABLED, ContentTags.empty());
|
||||
timer.waitFetchDelay(0);
|
||||
|
||||
if (!(result instanceof HttpFetchResult.ResultOk ok)) {
|
||||
return false;
|
||||
}
|
||||
|
||||
// Extract the beginning of the
|
||||
Optional<String> bodyOpt = DocumentBodyExtractor.asString(ok).getBody();
|
||||
if (bodyOpt.isEmpty())
|
||||
return false;
|
||||
String body = bodyOpt.get();
|
||||
body = body.substring(0, Math.min(128, body.length())).toLowerCase();
|
||||
|
||||
if (body.contains("<atom"))
|
||||
return true;
|
||||
if (body.contains("<rss"))
|
||||
return true;
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
public HttpFetchResult fetchContentWithReference(EdgeUrl top,
|
||||
|
@@ -7,9 +7,9 @@ import nu.marginalia.model.EdgeUrl;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import java.util.ArrayList;
|
||||
import java.util.HashSet;
|
||||
import java.util.List;
|
||||
import java.util.Optional;
|
||||
import java.util.Set;
|
||||
|
||||
public class SitemapFetcher {
|
||||
@@ -24,26 +24,27 @@ public class SitemapFetcher {
|
||||
}
|
||||
|
||||
public void downloadSitemaps(SimpleRobotRules robotsRules, EdgeUrl rootUrl) {
|
||||
List<String> sitemaps = robotsRules.getSitemaps();
|
||||
List<String> urls = robotsRules.getSitemaps();
|
||||
|
||||
List<EdgeUrl> urls = new ArrayList<>(sitemaps.size());
|
||||
if (!sitemaps.isEmpty()) {
|
||||
for (var url : sitemaps) {
|
||||
EdgeUrl.parse(url).ifPresent(urls::add);
|
||||
}
|
||||
}
|
||||
else {
|
||||
urls.add(rootUrl.withPathAndParam("/sitemap.xml", null));
|
||||
if (urls.isEmpty()) {
|
||||
urls = List.of(rootUrl.withPathAndParam("/sitemap.xml", null).toString());
|
||||
}
|
||||
|
||||
downloadSitemaps(urls);
|
||||
}
|
||||
|
||||
public void downloadSitemaps(List<EdgeUrl> urls) {
|
||||
public void downloadSitemaps(List<String> urls) {
|
||||
|
||||
Set<String> checkedSitemaps = new HashSet<>();
|
||||
|
||||
for (var url : urls) {
|
||||
for (var rawUrl : urls) {
|
||||
Optional<EdgeUrl> parsedUrl = EdgeUrl.parse(rawUrl);
|
||||
if (parsedUrl.isEmpty()) {
|
||||
continue;
|
||||
}
|
||||
|
||||
EdgeUrl url = parsedUrl.get();
|
||||
|
||||
// Let's not download sitemaps from other domains for now
|
||||
if (!crawlFrontier.isSameDomain(url)) {
|
||||
continue;
|
||||
|
@@ -1,11 +1,15 @@
|
||||
package nu.marginalia.io;
|
||||
|
||||
import nu.marginalia.model.crawldata.CrawledDocument;
|
||||
import nu.marginalia.model.crawldata.CrawledDomain;
|
||||
import nu.marginalia.model.crawldata.SerializableCrawlData;
|
||||
import org.jetbrains.annotations.Nullable;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.nio.file.Path;
|
||||
import java.util.ArrayList;
|
||||
import java.util.Iterator;
|
||||
import java.util.List;
|
||||
|
||||
/** Closable iterator exceptional over serialized crawl data
|
||||
* The data may appear in any order, and the iterator must be closed.
|
||||
@@ -26,6 +30,37 @@ public interface SerializableCrawlDataStream extends AutoCloseable {
|
||||
@Nullable
|
||||
default Path path() { return null; }
|
||||
|
||||
/** For tests */
|
||||
default List<SerializableCrawlData> asList() throws IOException {
|
||||
List<SerializableCrawlData> data = new ArrayList<>();
|
||||
while (hasNext()) {
|
||||
data.add(next());
|
||||
}
|
||||
return data;
|
||||
}
|
||||
|
||||
/** For tests */
|
||||
default List<CrawledDocument> docsAsList() throws IOException {
|
||||
List<CrawledDocument> data = new ArrayList<>();
|
||||
while (hasNext()) {
|
||||
if (next() instanceof CrawledDocument doc) {
|
||||
data.add(doc);
|
||||
}
|
||||
}
|
||||
return data;
|
||||
}
|
||||
|
||||
/** For tests */
|
||||
default List<CrawledDomain> domainsAsList() throws IOException {
|
||||
List<CrawledDomain> data = new ArrayList<>();
|
||||
while (hasNext()) {
|
||||
if (next() instanceof CrawledDomain domain) {
|
||||
data.add(domain);
|
||||
}
|
||||
}
|
||||
return data;
|
||||
}
|
||||
|
||||
// Dummy iterator over nothing
|
||||
static SerializableCrawlDataStream empty() {
|
||||
return new SerializableCrawlDataStream() {
|
||||
|
@@ -18,6 +18,7 @@ public class ContentTypeLogic {
|
||||
"application/xhtml",
|
||||
"application/xml",
|
||||
"application/atom+xml",
|
||||
"application/atomsvc+xml",
|
||||
"application/rss+xml",
|
||||
"application/x-rss+xml",
|
||||
"application/rdf+xml",
|
||||
|
@@ -23,6 +23,10 @@ public sealed interface DocumentBodyResult<T> {
|
||||
return mapper.apply(contentType, body);
|
||||
}
|
||||
|
||||
public Optional<T> getBody() {
|
||||
return Optional.of(body);
|
||||
}
|
||||
|
||||
@Override
|
||||
public void ifPresent(ExConsumer<T, Exception> consumer) throws Exception {
|
||||
consumer.accept(contentType, body);
|
||||
@@ -41,6 +45,11 @@ public sealed interface DocumentBodyResult<T> {
|
||||
return (DocumentBodyResult<T2>) this;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Optional<T> getBody() {
|
||||
return Optional.empty();
|
||||
}
|
||||
|
||||
@Override
|
||||
public void ifPresent(ExConsumer<T, Exception> consumer) throws Exception {
|
||||
}
|
||||
@@ -49,6 +58,7 @@ public sealed interface DocumentBodyResult<T> {
|
||||
<T2> Optional<T2> mapOpt(BiFunction<ContentType, T, T2> mapper);
|
||||
<T2> Optional<T2> flatMapOpt(BiFunction<ContentType, T, Optional<T2>> mapper);
|
||||
<T2> DocumentBodyResult<T2> flatMap(BiFunction<ContentType, T, DocumentBodyResult<T2>> mapper);
|
||||
Optional<T> getBody();
|
||||
|
||||
void ifPresent(ExConsumer<T,Exception> consumer) throws Exception;
|
||||
|
||||
|
@@ -0,0 +1,66 @@
|
||||
package nu.marginalia.crawl;
|
||||
|
||||
import org.junit.jupiter.api.AfterEach;
|
||||
import org.junit.jupiter.api.BeforeEach;
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Path;
|
||||
import java.sql.SQLException;
|
||||
import java.time.Instant;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
|
||||
class DomainStateDbTest {
|
||||
|
||||
Path tempFile;
|
||||
@BeforeEach
|
||||
void setUp() throws IOException {
|
||||
tempFile = Files.createTempFile(getClass().getSimpleName(), ".db");
|
||||
}
|
||||
|
||||
@AfterEach
|
||||
void tearDown() throws IOException {
|
||||
Files.deleteIfExists(tempFile);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testSunnyDay() throws SQLException {
|
||||
try (var db = new DomainStateDb(tempFile)) {
|
||||
var allFields = new DomainStateDb.SummaryRecord(
|
||||
"all.marginalia.nu",
|
||||
Instant.now(),
|
||||
"OK",
|
||||
"Bad address",
|
||||
"https://www.marginalia.nu/atom.xml"
|
||||
);
|
||||
|
||||
var minFields = new DomainStateDb.SummaryRecord(
|
||||
"min.marginalia.nu",
|
||||
Instant.now(),
|
||||
"OK",
|
||||
null,
|
||||
null
|
||||
);
|
||||
|
||||
db.save(allFields);
|
||||
db.save(minFields);
|
||||
|
||||
assertEquals(allFields, db.get("all.marginalia.nu").orElseThrow());
|
||||
assertEquals(minFields, db.get("min.marginalia.nu").orElseThrow());
|
||||
|
||||
var updatedAllFields = new DomainStateDb.SummaryRecord(
|
||||
"all.marginalia.nu",
|
||||
Instant.now(),
|
||||
"BAD",
|
||||
null,
|
||||
null
|
||||
);
|
||||
|
||||
db.save(updatedAllFields);
|
||||
assertEquals(updatedAllFields, db.get("all.marginalia.nu").orElseThrow());
|
||||
}
|
||||
}
|
||||
|
||||
}
|
@@ -42,24 +42,24 @@ class ContentTypeProberTest {
|
||||
port = r.nextInt(10000) + 8000;
|
||||
server = HttpServer.create(new InetSocketAddress("127.0.0.1", port), 10);
|
||||
|
||||
server.createContext("/html", exchange -> {
|
||||
server.createContext("/html.gz", exchange -> {
|
||||
exchange.getResponseHeaders().add("Content-Type", "text/html");
|
||||
exchange.sendResponseHeaders(200, -1);
|
||||
exchange.close();
|
||||
});
|
||||
server.createContext("/redir", exchange -> {
|
||||
exchange.getResponseHeaders().add("Location", "/html");
|
||||
server.createContext("/redir.gz", exchange -> {
|
||||
exchange.getResponseHeaders().add("Location", "/html.gz");
|
||||
exchange.sendResponseHeaders(301, -1);
|
||||
exchange.close();
|
||||
});
|
||||
|
||||
server.createContext("/bin", exchange -> {
|
||||
server.createContext("/bin.gz", exchange -> {
|
||||
exchange.getResponseHeaders().add("Content-Type", "application/binary");
|
||||
exchange.sendResponseHeaders(200, -1);
|
||||
exchange.close();
|
||||
});
|
||||
|
||||
server.createContext("/timeout", exchange -> {
|
||||
server.createContext("/timeout.gz", exchange -> {
|
||||
try {
|
||||
Thread.sleep(15_000);
|
||||
} catch (InterruptedException e) {
|
||||
@@ -73,10 +73,10 @@ class ContentTypeProberTest {
|
||||
|
||||
server.start();
|
||||
|
||||
htmlEndpoint = EdgeUrl.parse("http://localhost:" + port + "/html").get();
|
||||
binaryEndpoint = EdgeUrl.parse("http://localhost:" + port + "/bin").get();
|
||||
timeoutEndpoint = EdgeUrl.parse("http://localhost:" + port + "/timeout").get();
|
||||
htmlRedirEndpoint = EdgeUrl.parse("http://localhost:" + port + "/redir").get();
|
||||
htmlEndpoint = EdgeUrl.parse("http://localhost:" + port + "/html.gz").get();
|
||||
binaryEndpoint = EdgeUrl.parse("http://localhost:" + port + "/bin.gz").get();
|
||||
timeoutEndpoint = EdgeUrl.parse("http://localhost:" + port + "/timeout.gz").get();
|
||||
htmlRedirEndpoint = EdgeUrl.parse("http://localhost:" + port + "/redir.gz").get();
|
||||
|
||||
fetcher = new HttpFetcherImpl("test");
|
||||
recorder = new WarcRecorder(warcFile);
|
||||
|
@@ -2,6 +2,7 @@ package nu.marginalia.crawling.retreival;
|
||||
|
||||
import crawlercommons.robots.SimpleRobotRules;
|
||||
import nu.marginalia.crawl.CrawlerMain;
|
||||
import nu.marginalia.crawl.DomainStateDb;
|
||||
import nu.marginalia.crawl.fetcher.ContentTags;
|
||||
import nu.marginalia.crawl.fetcher.HttpFetcher;
|
||||
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
||||
@@ -18,6 +19,7 @@ import nu.marginalia.model.crawldata.SerializableCrawlData;
|
||||
import nu.marginalia.test.CommonTestData;
|
||||
import okhttp3.Headers;
|
||||
import org.junit.jupiter.api.AfterEach;
|
||||
import org.junit.jupiter.api.BeforeEach;
|
||||
import org.junit.jupiter.api.Test;
|
||||
import org.mockito.Mockito;
|
||||
import org.slf4j.Logger;
|
||||
@@ -25,6 +27,9 @@ import org.slf4j.LoggerFactory;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.net.URISyntaxException;
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Path;
|
||||
import java.sql.SQLException;
|
||||
import java.util.ArrayList;
|
||||
import java.util.HashMap;
|
||||
import java.util.List;
|
||||
@@ -36,9 +41,14 @@ public class CrawlerMockFetcherTest {
|
||||
|
||||
Map<EdgeUrl, CrawledDocument> mockData = new HashMap<>();
|
||||
HttpFetcher fetcherMock = new MockFetcher();
|
||||
|
||||
private Path dbTempFile;
|
||||
@BeforeEach
|
||||
public void setUp() throws IOException {
|
||||
dbTempFile = Files.createTempFile("domains","db");
|
||||
}
|
||||
@AfterEach
|
||||
public void tearDown() {
|
||||
public void tearDown() throws IOException {
|
||||
Files.deleteIfExists(dbTempFile);
|
||||
mockData.clear();
|
||||
}
|
||||
|
||||
@@ -66,15 +76,17 @@ public class CrawlerMockFetcherTest {
|
||||
|
||||
}
|
||||
|
||||
void crawl(CrawlerMain.CrawlSpecRecord spec) throws IOException {
|
||||
try (var recorder = new WarcRecorder()) {
|
||||
new CrawlerRetreiver(fetcherMock, new DomainProber(d -> true), spec, recorder)
|
||||
void crawl(CrawlerMain.CrawlSpecRecord spec) throws IOException, SQLException {
|
||||
try (var recorder = new WarcRecorder();
|
||||
var db = new DomainStateDb(dbTempFile)
|
||||
) {
|
||||
new CrawlerRetreiver(fetcherMock, new DomainProber(d -> true), spec, db, recorder)
|
||||
.crawlDomain();
|
||||
}
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testLemmy() throws URISyntaxException, IOException {
|
||||
public void testLemmy() throws Exception {
|
||||
List<SerializableCrawlData> out = new ArrayList<>();
|
||||
|
||||
registerUrlClasspathData(new EdgeUrl("https://startrek.website/"), "mock-crawl-data/lemmy/index.html");
|
||||
@@ -85,7 +97,7 @@ public class CrawlerMockFetcherTest {
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testMediawiki() throws URISyntaxException, IOException {
|
||||
public void testMediawiki() throws Exception {
|
||||
List<SerializableCrawlData> out = new ArrayList<>();
|
||||
|
||||
registerUrlClasspathData(new EdgeUrl("https://en.wikipedia.org/"), "mock-crawl-data/mediawiki/index.html");
|
||||
@@ -94,7 +106,7 @@ public class CrawlerMockFetcherTest {
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testDiscourse() throws URISyntaxException, IOException {
|
||||
public void testDiscourse() throws Exception {
|
||||
List<SerializableCrawlData> out = new ArrayList<>();
|
||||
|
||||
registerUrlClasspathData(new EdgeUrl("https://community.tt-rss.org/"), "mock-crawl-data/discourse/index.html");
|
||||
|
@@ -4,6 +4,7 @@ import nu.marginalia.UserAgent;
|
||||
import nu.marginalia.WmsaHome;
|
||||
import nu.marginalia.atags.model.DomainLinks;
|
||||
import nu.marginalia.crawl.CrawlerMain;
|
||||
import nu.marginalia.crawl.DomainStateDb;
|
||||
import nu.marginalia.crawl.fetcher.HttpFetcher;
|
||||
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
||||
import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
|
||||
@@ -25,6 +26,7 @@ import java.io.RandomAccessFile;
|
||||
import java.net.URISyntaxException;
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Path;
|
||||
import java.sql.SQLException;
|
||||
import java.util.*;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
@@ -39,11 +41,13 @@ class CrawlerRetreiverTest {
|
||||
Path tempFileWarc2;
|
||||
Path tempFileParquet2;
|
||||
Path tempFileWarc3;
|
||||
Path tempFileDb;
|
||||
@BeforeEach
|
||||
public void setUp() throws IOException {
|
||||
httpFetcher = new HttpFetcherImpl("search.marginalia.nu; testing a bit :D");
|
||||
tempFileParquet1 = Files.createTempFile("crawling-process", ".parquet");
|
||||
tempFileParquet2 = Files.createTempFile("crawling-process", ".parquet");
|
||||
tempFileDb = Files.createTempFile("crawling-process", ".db");
|
||||
|
||||
}
|
||||
|
||||
@@ -505,22 +509,26 @@ class CrawlerRetreiverTest {
|
||||
}
|
||||
|
||||
private void doCrawlWithReferenceStream(CrawlerMain.CrawlSpecRecord specs, SerializableCrawlDataStream stream) {
|
||||
try (var recorder = new WarcRecorder(tempFileWarc2)) {
|
||||
new CrawlerRetreiver(httpFetcher, new DomainProber(d -> true), specs, recorder).crawlDomain(new DomainLinks(),
|
||||
try (var recorder = new WarcRecorder(tempFileWarc2);
|
||||
var db = new DomainStateDb(tempFileDb)
|
||||
) {
|
||||
new CrawlerRetreiver(httpFetcher, new DomainProber(d -> true), specs, db, recorder).crawlDomain(new DomainLinks(),
|
||||
new CrawlDataReference(stream));
|
||||
}
|
||||
catch (IOException ex) {
|
||||
catch (IOException | SQLException ex) {
|
||||
Assertions.fail(ex);
|
||||
}
|
||||
}
|
||||
|
||||
@NotNull
|
||||
private DomainCrawlFrontier doCrawl(Path tempFileWarc1, CrawlerMain.CrawlSpecRecord specs) {
|
||||
try (var recorder = new WarcRecorder(tempFileWarc1)) {
|
||||
var crawler = new CrawlerRetreiver(httpFetcher, new DomainProber(d -> true), specs, recorder);
|
||||
try (var recorder = new WarcRecorder(tempFileWarc1);
|
||||
var db = new DomainStateDb(tempFileDb)
|
||||
) {
|
||||
var crawler = new CrawlerRetreiver(httpFetcher, new DomainProber(d -> true), specs, db, recorder);
|
||||
crawler.crawlDomain();
|
||||
return crawler.getCrawlFrontier();
|
||||
} catch (IOException ex) {
|
||||
} catch (IOException| SQLException ex) {
|
||||
Assertions.fail(ex);
|
||||
return null; // unreachable
|
||||
}
|
||||
|
@@ -179,6 +179,9 @@ public class LiveCrawlerMain extends ProcessMainClass {
|
||||
EdgeDomain domain = new EdgeDomain(entry.getKey());
|
||||
List<String> urls = entry.getValue();
|
||||
|
||||
if (urls.isEmpty())
|
||||
continue;
|
||||
|
||||
fetcher.scheduleRetrieval(domain, urls);
|
||||
}
|
||||
}
|
||||
|
@@ -3,6 +3,8 @@ package nu.marginalia.livecrawler;
|
||||
import crawlercommons.robots.SimpleRobotRules;
|
||||
import crawlercommons.robots.SimpleRobotRulesParser;
|
||||
import nu.marginalia.WmsaHome;
|
||||
import nu.marginalia.contenttype.ContentType;
|
||||
import nu.marginalia.contenttype.DocumentBodyToString;
|
||||
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
||||
import nu.marginalia.crawl.logic.DomainLocks;
|
||||
import nu.marginalia.crawl.retreival.CrawlDelayTimer;
|
||||
@@ -16,6 +18,7 @@ import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import javax.annotation.Nullable;
|
||||
import java.io.ByteArrayInputStream;
|
||||
import java.io.IOException;
|
||||
import java.net.URISyntaxException;
|
||||
import java.net.http.HttpClient;
|
||||
@@ -23,10 +26,12 @@ import java.net.http.HttpHeaders;
|
||||
import java.net.http.HttpRequest;
|
||||
import java.net.http.HttpResponse;
|
||||
import java.time.Duration;
|
||||
import java.util.ArrayList;
|
||||
import java.util.List;
|
||||
import java.util.Optional;
|
||||
import java.util.concurrent.ThreadLocalRandom;
|
||||
import java.util.concurrent.TimeUnit;
|
||||
import java.util.zip.GZIPInputStream;
|
||||
|
||||
/** A simple link scraper that fetches URLs and stores them in a database,
|
||||
* with no concept of a crawl frontier, WARC output, or other advanced features
|
||||
@@ -43,6 +48,8 @@ public class SimpleLinkScraper implements AutoCloseable {
|
||||
private final Duration readTimeout = Duration.ofSeconds(10);
|
||||
private final DomainLocks domainLocks = new DomainLocks();
|
||||
|
||||
private final static int MAX_SIZE = Integer.getInteger("crawler.maxFetchSize", 10 * 1024 * 1024);
|
||||
|
||||
public SimpleLinkScraper(LiveCrawlDataSet dataSet,
|
||||
DbDomainQueries domainQueries,
|
||||
DomainBlacklist domainBlacklist) {
|
||||
@@ -61,52 +68,68 @@ public class SimpleLinkScraper implements AutoCloseable {
|
||||
pool.submitQuietly(() -> retrieveNow(domain, id.getAsInt(), urls));
|
||||
}
|
||||
|
||||
public void retrieveNow(EdgeDomain domain, int domainId, List<String> urls) throws Exception {
|
||||
public int retrieveNow(EdgeDomain domain, int domainId, List<String> urls) throws Exception {
|
||||
|
||||
EdgeUrl rootUrl = domain.toRootUrlHttps();
|
||||
|
||||
List<EdgeUrl> relevantUrls = new ArrayList<>();
|
||||
|
||||
for (var url : urls) {
|
||||
Optional<EdgeUrl> optParsedUrl = lp.parseLink(rootUrl, url);
|
||||
if (optParsedUrl.isEmpty()) {
|
||||
continue;
|
||||
}
|
||||
if (dataSet.hasUrl(optParsedUrl.get())) {
|
||||
continue;
|
||||
}
|
||||
relevantUrls.add(optParsedUrl.get());
|
||||
}
|
||||
|
||||
if (relevantUrls.isEmpty()) {
|
||||
return 0;
|
||||
}
|
||||
|
||||
int fetched = 0;
|
||||
|
||||
try (HttpClient client = HttpClient
|
||||
.newBuilder()
|
||||
.connectTimeout(connectTimeout)
|
||||
.followRedirects(HttpClient.Redirect.NEVER)
|
||||
.version(HttpClient.Version.HTTP_2)
|
||||
.build();
|
||||
DomainLocks.DomainLock lock = domainLocks.lockDomain(domain) // throttle concurrent access per domain; do not remove
|
||||
// throttle concurrent access per domain; IDE will complain it's not used, but it holds a semaphore -- do not remove:
|
||||
DomainLocks.DomainLock lock = domainLocks.lockDomain(domain)
|
||||
) {
|
||||
|
||||
EdgeUrl rootUrl = domain.toRootUrlHttps();
|
||||
|
||||
SimpleRobotRules rules = fetchRobotsRules(rootUrl, client);
|
||||
|
||||
if (rules == null) { // I/O error fetching robots.txt
|
||||
// If we can't fetch the robots.txt,
|
||||
for (var url : urls) {
|
||||
lp.parseLink(rootUrl, url).ifPresent(this::maybeFlagAsBad);
|
||||
for (var url : relevantUrls) {
|
||||
maybeFlagAsBad(url);
|
||||
}
|
||||
return;
|
||||
return fetched;
|
||||
}
|
||||
|
||||
CrawlDelayTimer timer = new CrawlDelayTimer(rules.getCrawlDelay());
|
||||
|
||||
for (var url : urls) {
|
||||
Optional<EdgeUrl> optParsedUrl = lp.parseLink(rootUrl, url);
|
||||
if (optParsedUrl.isEmpty()) {
|
||||
continue;
|
||||
}
|
||||
if (dataSet.hasUrl(optParsedUrl.get())) {
|
||||
continue;
|
||||
}
|
||||
for (var parsedUrl : relevantUrls) {
|
||||
|
||||
EdgeUrl parsedUrl = optParsedUrl.get();
|
||||
if (!rules.isAllowed(url)) {
|
||||
if (!rules.isAllowed(parsedUrl.toString())) {
|
||||
maybeFlagAsBad(parsedUrl);
|
||||
continue;
|
||||
}
|
||||
|
||||
switch (fetchUrl(domainId, parsedUrl, timer, client)) {
|
||||
case FetchResult.Success(int id, EdgeUrl docUrl, String body, String headers)
|
||||
-> dataSet.saveDocument(id, docUrl, body, headers, "");
|
||||
case FetchResult.Success(int id, EdgeUrl docUrl, String body, String headers) -> {
|
||||
dataSet.saveDocument(id, docUrl, body, headers, "");
|
||||
fetched++;
|
||||
}
|
||||
case FetchResult.Error(EdgeUrl docUrl) -> maybeFlagAsBad(docUrl);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return fetched;
|
||||
}
|
||||
|
||||
private void maybeFlagAsBad(EdgeUrl url) {
|
||||
@@ -128,6 +151,7 @@ public class SimpleLinkScraper implements AutoCloseable {
|
||||
var robotsRequest = HttpRequest.newBuilder(rootUrl.withPathAndParam("/robots.txt", null).asURI())
|
||||
.GET()
|
||||
.header("User-Agent", WmsaHome.getUserAgent().uaString())
|
||||
.header("Accept-Encoding","gzip")
|
||||
.timeout(readTimeout);
|
||||
|
||||
// Fetch the robots.txt
|
||||
@@ -135,9 +159,10 @@ public class SimpleLinkScraper implements AutoCloseable {
|
||||
try {
|
||||
SimpleRobotRulesParser parser = new SimpleRobotRulesParser();
|
||||
HttpResponse<byte[]> robotsTxt = client.send(robotsRequest.build(), HttpResponse.BodyHandlers.ofByteArray());
|
||||
|
||||
if (robotsTxt.statusCode() == 200) {
|
||||
return parser.parseContent(rootUrl.toString(),
|
||||
robotsTxt.body(),
|
||||
getResponseData(robotsTxt),
|
||||
robotsTxt.headers().firstValue("Content-Type").orElse("text/plain"),
|
||||
WmsaHome.getUserAgent().uaIdentifier());
|
||||
}
|
||||
@@ -161,18 +186,19 @@ public class SimpleLinkScraper implements AutoCloseable {
|
||||
.GET()
|
||||
.header("User-Agent", WmsaHome.getUserAgent().uaString())
|
||||
.header("Accept", "text/html")
|
||||
.header("Accept-Encoding", "gzip")
|
||||
.timeout(readTimeout)
|
||||
.build();
|
||||
|
||||
try {
|
||||
HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());
|
||||
HttpResponse<byte[]> response = client.send(request, HttpResponse.BodyHandlers.ofByteArray());
|
||||
|
||||
// Handle rate limiting by waiting and retrying once
|
||||
if (response.statusCode() == 429) {
|
||||
timer.waitRetryDelay(new HttpFetcherImpl.RateLimitException(
|
||||
response.headers().firstValue("Retry-After").orElse("5")
|
||||
));
|
||||
response = client.send(request, HttpResponse.BodyHandlers.ofString());
|
||||
response = client.send(request, HttpResponse.BodyHandlers.ofByteArray());
|
||||
}
|
||||
|
||||
String contentType = response.headers().firstValue("Content-Type").orElse("").toLowerCase();
|
||||
@@ -182,12 +208,14 @@ public class SimpleLinkScraper implements AutoCloseable {
|
||||
return new FetchResult.Error(parsedUrl);
|
||||
}
|
||||
|
||||
String body = response.body();
|
||||
if (body.length() > 1024 * 1024) {
|
||||
byte[] body = getResponseData(response);
|
||||
if (body.length > MAX_SIZE) {
|
||||
return new FetchResult.Error(parsedUrl);
|
||||
}
|
||||
|
||||
return new FetchResult.Success(domainId, parsedUrl, body, headersToString(response.headers()));
|
||||
String bodyText = DocumentBodyToString.getStringData(ContentType.parse(contentType), body);
|
||||
|
||||
return new FetchResult.Success(domainId, parsedUrl, bodyText, headersToString(response.headers()));
|
||||
}
|
||||
}
|
||||
catch (IOException ex) {
|
||||
@@ -198,6 +226,19 @@ public class SimpleLinkScraper implements AutoCloseable {
|
||||
return new FetchResult.Error(parsedUrl);
|
||||
}
|
||||
|
||||
private byte[] getResponseData(HttpResponse<byte[]> response) throws IOException {
|
||||
String encoding = response.headers().firstValue("Content-Encoding").orElse("");
|
||||
|
||||
if ("gzip".equals(encoding)) {
|
||||
try (var stream = new GZIPInputStream(new ByteArrayInputStream(response.body()))) {
|
||||
return stream.readAllBytes();
|
||||
}
|
||||
}
|
||||
else {
|
||||
return response.body();
|
||||
}
|
||||
}
|
||||
|
||||
sealed interface FetchResult {
|
||||
record Success(int domainId, EdgeUrl url, String body, String headers) implements FetchResult {}
|
||||
record Error(EdgeUrl url) implements FetchResult {}
|
||||
|
@@ -0,0 +1,66 @@
|
||||
package nu.marginalia.livecrawler;
|
||||
|
||||
import nu.marginalia.db.DomainBlacklistImpl;
|
||||
import nu.marginalia.io.SerializableCrawlDataStream;
|
||||
import nu.marginalia.model.EdgeDomain;
|
||||
import nu.marginalia.model.EdgeUrl;
|
||||
import nu.marginalia.model.crawldata.CrawledDocument;
|
||||
import org.apache.commons.io.FileUtils;
|
||||
import org.junit.jupiter.api.AfterEach;
|
||||
import org.junit.jupiter.api.Assertions;
|
||||
import org.junit.jupiter.api.BeforeEach;
|
||||
import org.junit.jupiter.api.Test;
|
||||
import org.mockito.Mockito;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Path;
|
||||
import java.sql.SQLException;
|
||||
import java.util.List;
|
||||
|
||||
class SimpleLinkScraperTest {
|
||||
private Path tempDir;
|
||||
private LiveCrawlDataSet dataSet;
|
||||
|
||||
@BeforeEach
|
||||
public void setUp() throws IOException, SQLException {
|
||||
tempDir = Files.createTempDirectory(getClass().getSimpleName());
|
||||
dataSet = new LiveCrawlDataSet(tempDir);
|
||||
}
|
||||
|
||||
|
||||
@AfterEach
|
||||
public void tearDown() throws Exception {
|
||||
dataSet.close();
|
||||
FileUtils.deleteDirectory(tempDir.toFile());
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testRetrieveNow() throws Exception {
|
||||
var scraper = new SimpleLinkScraper(dataSet, null, Mockito.mock(DomainBlacklistImpl.class));
|
||||
int fetched = scraper.retrieveNow(new EdgeDomain("www.marginalia.nu"), 1, List.of("https://www.marginalia.nu/"));
|
||||
Assertions.assertEquals(1, fetched);
|
||||
|
||||
var streams = dataSet.getDataStreams();
|
||||
Assertions.assertEquals(1, streams.size());
|
||||
|
||||
SerializableCrawlDataStream firstStream = streams.iterator().next();
|
||||
Assertions.assertTrue(firstStream.hasNext());
|
||||
|
||||
List<CrawledDocument> documents = firstStream.docsAsList();
|
||||
Assertions.assertEquals(1, documents.size());
|
||||
Assertions.assertTrue(documents.getFirst().documentBody.startsWith("<!doctype"));
|
||||
}
|
||||
|
||||
|
||||
|
||||
@Test
|
||||
public void testRetrieveNow_Redundant() throws Exception {
|
||||
dataSet.saveDocument(1, new EdgeUrl("https://www.marginalia.nu/"), "<html>", "", "127.0.0.1");
|
||||
var scraper = new SimpleLinkScraper(dataSet, null, Mockito.mock(DomainBlacklistImpl.class));
|
||||
|
||||
// If the requested URL is already in the dataSet, we retrieveNow should shortcircuit and not fetch anything
|
||||
int fetched = scraper.retrieveNow(new EdgeDomain("www.marginalia.nu"), 1, List.of("https://www.marginalia.nu/"));
|
||||
Assertions.assertEquals(0, fetched);
|
||||
}
|
||||
}
|
@@ -11,7 +11,7 @@ import nu.marginalia.api.svc.RateLimiterService;
|
||||
import nu.marginalia.api.svc.ResponseCache;
|
||||
import nu.marginalia.model.gson.GsonFactory;
|
||||
import nu.marginalia.service.server.BaseServiceParams;
|
||||
import nu.marginalia.service.server.Service;
|
||||
import nu.marginalia.service.server.SparkService;
|
||||
import nu.marginalia.service.server.mq.MqRequest;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
@@ -21,7 +21,7 @@ import spark.Request;
|
||||
import spark.Response;
|
||||
import spark.Spark;
|
||||
|
||||
public class ApiService extends Service {
|
||||
public class ApiService extends SparkService {
|
||||
|
||||
private final Logger logger = LoggerFactory.getLogger(getClass());
|
||||
private final Gson gson = GsonFactory.get();
|
||||
@@ -69,7 +69,7 @@ public class ApiService extends Service {
|
||||
this.searchOperator = searchOperator;
|
||||
|
||||
Spark.get("/api/", (rq, rsp) -> {
|
||||
rsp.redirect("https://memex.marginalia.nu/projects/edge/api.gmi");
|
||||
rsp.redirect("https://about.marginalia-search.com/article/api/");
|
||||
return "";
|
||||
});
|
||||
|
||||
|
@@ -9,7 +9,7 @@ import nu.marginalia.renderer.MustacheRenderer;
|
||||
import nu.marginalia.renderer.RendererFactory;
|
||||
import nu.marginalia.screenshot.ScreenshotService;
|
||||
import nu.marginalia.service.server.BaseServiceParams;
|
||||
import nu.marginalia.service.server.Service;
|
||||
import nu.marginalia.service.server.SparkService;
|
||||
import org.jetbrains.annotations.NotNull;
|
||||
import spark.Request;
|
||||
import spark.Response;
|
||||
@@ -18,7 +18,7 @@ import spark.Spark;
|
||||
import java.util.Map;
|
||||
import java.util.Optional;
|
||||
|
||||
public class DatingService extends Service {
|
||||
public class DatingService extends SparkService {
|
||||
private final DomainBlacklist blacklist;
|
||||
private final DbBrowseDomainsSimilarCosine browseSimilarCosine;
|
||||
private final DbBrowseDomainsRandom browseRandom;
|
||||
|
@@ -5,7 +5,7 @@ import com.zaxxer.hikari.HikariDataSource;
|
||||
import nu.marginalia.renderer.MustacheRenderer;
|
||||
import nu.marginalia.renderer.RendererFactory;
|
||||
import nu.marginalia.service.server.BaseServiceParams;
|
||||
import nu.marginalia.service.server.Service;
|
||||
import nu.marginalia.service.server.SparkService;
|
||||
import nu.marginalia.service.server.StaticResources;
|
||||
import org.jetbrains.annotations.NotNull;
|
||||
import spark.Request;
|
||||
@@ -15,7 +15,7 @@ import spark.Spark;
|
||||
import java.sql.SQLException;
|
||||
import java.util.*;
|
||||
|
||||
public class ExplorerService extends Service {
|
||||
public class ExplorerService extends SparkService {
|
||||
|
||||
private final MustacheRenderer<Object> renderer;
|
||||
private final HikariDataSource dataSource;
|
||||
|
94
code/services-application/search-service-legacy/build.gradle
Normal file
94
code/services-application/search-service-legacy/build.gradle
Normal file
@@ -0,0 +1,94 @@
|
||||
plugins {
|
||||
id 'java'
|
||||
id 'io.freefair.sass-base' version '8.4'
|
||||
id 'io.freefair.sass-java' version '8.4'
|
||||
id 'application'
|
||||
id 'jvm-test-suite'
|
||||
|
||||
id 'com.google.cloud.tools.jib' version '3.4.3'
|
||||
}
|
||||
|
||||
application {
|
||||
mainClass = 'nu.marginalia.search.SearchMain'
|
||||
applicationName = 'search-service-legacy'
|
||||
}
|
||||
|
||||
tasks.distZip.enabled = false
|
||||
|
||||
|
||||
java {
|
||||
toolchain {
|
||||
languageVersion.set(JavaLanguageVersion.of(rootProject.ext.jvmVersion))
|
||||
}
|
||||
}
|
||||
sass {
|
||||
sourceMapEnabled = true
|
||||
sourceMapEmbed = true
|
||||
outputStyle = EXPANDED
|
||||
}
|
||||
|
||||
apply from: "$rootProject.projectDir/srcsets.gradle"
|
||||
apply from: "$rootProject.projectDir/docker.gradle"
|
||||
|
||||
dependencies {
|
||||
implementation project(':code:common:db')
|
||||
implementation project(':code:common:model')
|
||||
implementation project(':code:common:service')
|
||||
implementation project(':code:common:config')
|
||||
implementation project(':code:index:query')
|
||||
|
||||
implementation project(':code:libraries:easy-lsh')
|
||||
implementation project(':code:libraries:language-processing')
|
||||
implementation project(':code:libraries:braille-block-punch-cards')
|
||||
implementation project(':code:libraries:term-frequency-dict')
|
||||
|
||||
implementation project(':code:functions:live-capture:api')
|
||||
implementation project(':code:functions:math:api')
|
||||
implementation project(':code:functions:domain-info:api')
|
||||
implementation project(':code:functions:search-query:api')
|
||||
|
||||
|
||||
implementation project(':code:index:api')
|
||||
implementation project(':code:common:renderer')
|
||||
|
||||
implementation project(':code:features-search:screenshots')
|
||||
implementation project(':code:features-search:random-websites')
|
||||
|
||||
implementation libs.bundles.slf4j
|
||||
|
||||
implementation libs.roaringbitmap
|
||||
implementation libs.prometheus
|
||||
implementation libs.notnull
|
||||
implementation libs.guava
|
||||
implementation dependencies.create(libs.guice.get()) {
|
||||
exclude group: 'com.google.guava'
|
||||
}
|
||||
implementation libs.handlebars
|
||||
implementation dependencies.create(libs.spark.get()) {
|
||||
exclude group: 'org.eclipse.jetty'
|
||||
}
|
||||
implementation libs.bundles.jetty
|
||||
implementation libs.opencsv
|
||||
implementation libs.trove
|
||||
implementation libs.fastutil
|
||||
implementation libs.bundles.gson
|
||||
implementation libs.bundles.mariadb
|
||||
implementation libs.bundles.nlp
|
||||
|
||||
testImplementation libs.bundles.slf4j.test
|
||||
testImplementation libs.bundles.junit
|
||||
testImplementation libs.mockito
|
||||
|
||||
testImplementation platform('org.testcontainers:testcontainers-bom:1.17.4')
|
||||
testImplementation libs.commons.codec
|
||||
testImplementation 'org.testcontainers:mariadb:1.17.4'
|
||||
testImplementation 'org.testcontainers:junit-jupiter:1.17.4'
|
||||
testImplementation project(':code:libraries:test-helpers')
|
||||
}
|
||||
|
||||
tasks.register('paperDoll', Test) {
|
||||
useJUnitPlatform {
|
||||
includeTags "paperdoll"
|
||||
}
|
||||
jvmArgs = [ '-DrunPaperDoll=true', '--enable-preview' ]
|
||||
}
|
@@ -0,0 +1,47 @@
|
||||
package nu.marginalia.search;
|
||||
|
||||
import com.google.inject.Guice;
|
||||
import com.google.inject.Inject;
|
||||
import com.google.inject.Injector;
|
||||
import nu.marginalia.service.MainClass;
|
||||
import nu.marginalia.service.discovery.ServiceRegistryIf;
|
||||
import nu.marginalia.service.module.ServiceConfiguration;
|
||||
import nu.marginalia.service.module.ServiceDiscoveryModule;
|
||||
import nu.marginalia.service.ServiceId;
|
||||
import nu.marginalia.service.module.ServiceConfigurationModule;
|
||||
import nu.marginalia.service.module.DatabaseModule;
|
||||
import nu.marginalia.service.server.Initialization;
|
||||
import spark.Spark;
|
||||
|
||||
public class SearchMain extends MainClass {
|
||||
private final SearchService service;
|
||||
|
||||
@Inject
|
||||
public SearchMain(SearchService service) {
|
||||
this.service = service;
|
||||
}
|
||||
|
||||
public static void main(String... args) {
|
||||
|
||||
init(ServiceId.Search, args);
|
||||
|
||||
Spark.staticFileLocation("/static/search/");
|
||||
|
||||
Injector injector = Guice.createInjector(
|
||||
new SearchModule(),
|
||||
new ServiceConfigurationModule(ServiceId.Search),
|
||||
new ServiceDiscoveryModule(),
|
||||
new DatabaseModule(false)
|
||||
);
|
||||
|
||||
|
||||
// Orchestrate the boot order for the services
|
||||
var registry = injector.getInstance(ServiceRegistryIf.class);
|
||||
var configuration = injector.getInstance(ServiceConfiguration.class);
|
||||
orchestrateBoot(registry, configuration);
|
||||
|
||||
injector.getInstance(SearchMain.class);
|
||||
injector.getInstance(Initialization.class).setReady();
|
||||
|
||||
}
|
||||
}
|
@@ -0,0 +1,20 @@
|
||||
package nu.marginalia.search;
|
||||
|
||||
import com.google.inject.AbstractModule;
|
||||
import nu.marginalia.LanguageModels;
|
||||
import nu.marginalia.WebsiteUrl;
|
||||
import nu.marginalia.WmsaHome;
|
||||
import nu.marginalia.renderer.config.HandlebarsConfigurator;
|
||||
|
||||
public class SearchModule extends AbstractModule {
|
||||
|
||||
public void configure() {
|
||||
bind(HandlebarsConfigurator.class).to(SearchHandlebarsConfigurator.class);
|
||||
|
||||
bind(LanguageModels.class).toInstance(WmsaHome.getLanguageModels());
|
||||
|
||||
bind(WebsiteUrl.class).toInstance(new WebsiteUrl(
|
||||
System.getProperty("search.legacyWebsiteUrl", "https://old-search.marginalia.nu/")));
|
||||
}
|
||||
|
||||
}
|
@@ -0,0 +1,266 @@
|
||||
package nu.marginalia.search;
|
||||
|
||||
import com.google.inject.Inject;
|
||||
import com.google.inject.Singleton;
|
||||
import nu.marginalia.WebsiteUrl;
|
||||
import nu.marginalia.api.math.MathClient;
|
||||
import nu.marginalia.api.searchquery.QueryClient;
|
||||
import nu.marginalia.api.searchquery.model.query.QueryResponse;
|
||||
import nu.marginalia.api.searchquery.model.results.DecoratedSearchResultItem;
|
||||
import nu.marginalia.bbpc.BrailleBlockPunchCards;
|
||||
import nu.marginalia.db.DbDomainQueries;
|
||||
import nu.marginalia.index.query.limit.QueryLimits;
|
||||
import nu.marginalia.model.EdgeDomain;
|
||||
import nu.marginalia.model.EdgeUrl;
|
||||
import nu.marginalia.model.crawl.DomainIndexingState;
|
||||
import nu.marginalia.search.command.SearchParameters;
|
||||
import nu.marginalia.search.model.ClusteredUrlDetails;
|
||||
import nu.marginalia.search.model.DecoratedSearchResults;
|
||||
import nu.marginalia.search.model.SearchFilters;
|
||||
import nu.marginalia.search.model.UrlDetails;
|
||||
import nu.marginalia.search.results.UrlDeduplicator;
|
||||
import nu.marginalia.search.svc.SearchQueryCountService;
|
||||
import nu.marginalia.search.svc.SearchUnitConversionService;
|
||||
import org.apache.logging.log4j.util.Strings;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
import org.slf4j.Marker;
|
||||
import org.slf4j.MarkerFactory;
|
||||
|
||||
import javax.annotation.Nullable;
|
||||
import java.time.Duration;
|
||||
import java.util.ArrayList;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
import java.util.concurrent.Future;
|
||||
import java.util.concurrent.TimeUnit;
|
||||
import java.util.stream.Collectors;
|
||||
import java.util.stream.IntStream;
|
||||
|
||||
@Singleton
|
||||
public class SearchOperator {
|
||||
|
||||
private static final Logger logger = LoggerFactory.getLogger(SearchOperator.class);
|
||||
|
||||
// Marker for filtering out sensitive content from the persistent logs
|
||||
private final Marker queryMarker = MarkerFactory.getMarker("QUERY");
|
||||
|
||||
private final MathClient mathClient;
|
||||
private final DbDomainQueries domainQueries;
|
||||
private final QueryClient queryClient;
|
||||
private final SearchQueryParamFactory paramFactory;
|
||||
private final WebsiteUrl websiteUrl;
|
||||
private final SearchUnitConversionService searchUnitConversionService;
|
||||
private final SearchQueryCountService searchVisitorCount;
|
||||
|
||||
|
||||
@Inject
|
||||
public SearchOperator(MathClient mathClient,
|
||||
DbDomainQueries domainQueries,
|
||||
QueryClient queryClient,
|
||||
SearchQueryParamFactory paramFactory,
|
||||
WebsiteUrl websiteUrl,
|
||||
SearchUnitConversionService searchUnitConversionService,
|
||||
SearchQueryCountService searchVisitorCount
|
||||
)
|
||||
{
|
||||
|
||||
this.mathClient = mathClient;
|
||||
this.domainQueries = domainQueries;
|
||||
this.queryClient = queryClient;
|
||||
this.paramFactory = paramFactory;
|
||||
this.websiteUrl = websiteUrl;
|
||||
this.searchUnitConversionService = searchUnitConversionService;
|
||||
this.searchVisitorCount = searchVisitorCount;
|
||||
}
|
||||
|
||||
public List<UrlDetails> doSiteSearch(String domain,
|
||||
int domainId,
|
||||
int count) {
|
||||
|
||||
var queryParams = paramFactory.forSiteSearch(domain, domainId, count);
|
||||
var queryResponse = queryClient.search(queryParams);
|
||||
|
||||
return getResultsFromQuery(queryResponse);
|
||||
}
|
||||
|
||||
public List<UrlDetails> doBacklinkSearch(String domain) {
|
||||
|
||||
var queryParams = paramFactory.forBacklinkSearch(domain);
|
||||
var queryResponse = queryClient.search(queryParams);
|
||||
|
||||
return getResultsFromQuery(queryResponse);
|
||||
}
|
||||
|
||||
public List<UrlDetails> doLinkSearch(String source, String dest) {
|
||||
var queryParams = paramFactory.forLinkSearch(source, dest);
|
||||
var queryResponse = queryClient.search(queryParams);
|
||||
|
||||
return getResultsFromQuery(queryResponse);
|
||||
}
|
||||
|
||||
public DecoratedSearchResults doSearch(SearchParameters userParams) throws InterruptedException {
|
||||
// The full user-facing search query does additional work to try to evaluate the query
|
||||
// e.g. as a unit conversion query. This is done in parallel with the regular search.
|
||||
|
||||
Future<String> eval = searchUnitConversionService.tryEval(userParams.query());
|
||||
|
||||
// Perform the regular search
|
||||
|
||||
var queryParams = paramFactory.forRegularSearch(userParams);
|
||||
QueryResponse queryResponse = queryClient.search(queryParams);
|
||||
var queryResults = getResultsFromQuery(queryResponse);
|
||||
|
||||
// Cluster the results based on the query response
|
||||
List<ClusteredUrlDetails> clusteredResults = SearchResultClusterer
|
||||
.selectStrategy(queryResponse)
|
||||
.clusterResults(queryResults, 25);
|
||||
|
||||
// Log the query and results
|
||||
|
||||
logger.info(queryMarker, "Human terms: {}", Strings.join(queryResponse.searchTermsHuman(), ','));
|
||||
logger.info(queryMarker, "Search Result Count: {}", queryResults.size());
|
||||
|
||||
// Get the evaluation result and other data to return to the user
|
||||
String evalResult = getFutureOrDefault(eval, "");
|
||||
|
||||
String focusDomain = queryResponse.domain();
|
||||
int focusDomainId = focusDomain == null
|
||||
? -1
|
||||
: domainQueries.tryGetDomainId(new EdgeDomain(focusDomain)).orElse(-1);
|
||||
|
||||
List<String> problems = getProblems(evalResult, queryResults, queryResponse);
|
||||
|
||||
List<DecoratedSearchResults.Page> resultPages = IntStream.rangeClosed(1, queryResponse.totalPages())
|
||||
.mapToObj(number -> new DecoratedSearchResults.Page(
|
||||
number,
|
||||
number == userParams.page(),
|
||||
userParams.withPage(number).renderUrl(websiteUrl)
|
||||
))
|
||||
.toList();
|
||||
|
||||
// Return the results to the user
|
||||
return DecoratedSearchResults.builder()
|
||||
.params(userParams)
|
||||
.problems(problems)
|
||||
.evalResult(evalResult)
|
||||
.results(clusteredResults)
|
||||
.filters(new SearchFilters(websiteUrl, userParams))
|
||||
.focusDomain(focusDomain)
|
||||
.focusDomainId(focusDomainId)
|
||||
.resultPages(resultPages)
|
||||
.build();
|
||||
}
|
||||
|
||||
|
||||
public List<UrlDetails> getResultsFromQuery(QueryResponse queryResponse) {
|
||||
final QueryLimits limits = queryResponse.specs().queryLimits;
|
||||
final UrlDeduplicator deduplicator = new UrlDeduplicator(limits.resultsByDomain());
|
||||
|
||||
// Update the query count (this is what you see on the front page)
|
||||
searchVisitorCount.registerQuery();
|
||||
|
||||
return queryResponse.results().stream()
|
||||
.filter(deduplicator::shouldRetain)
|
||||
.limit(limits.resultsTotal())
|
||||
.map(SearchOperator::createDetails)
|
||||
.toList();
|
||||
}
|
||||
|
||||
private static UrlDetails createDetails(DecoratedSearchResultItem item) {
|
||||
return new UrlDetails(
|
||||
item.documentId(),
|
||||
item.domainId(),
|
||||
cleanUrl(item.url),
|
||||
item.title,
|
||||
item.description,
|
||||
item.format,
|
||||
item.features,
|
||||
DomainIndexingState.ACTIVE,
|
||||
item.rankingScore, // termScore
|
||||
item.resultsFromDomain,
|
||||
BrailleBlockPunchCards.printBits(item.bestPositions, 64),
|
||||
Long.bitCount(item.bestPositions),
|
||||
item.rawIndexResult,
|
||||
item.rawIndexResult.keywordScores
|
||||
);
|
||||
}
|
||||
|
||||
/** Replace nuisance domains with replacements where available */
|
||||
private static EdgeUrl cleanUrl(EdgeUrl url) {
|
||||
String topdomain = url.domain.topDomain;
|
||||
String subdomain = url.domain.subDomain;
|
||||
String path = url.path;
|
||||
|
||||
if (topdomain.equals("fandom.com")) {
|
||||
int wikiIndex = path.indexOf("/wiki/");
|
||||
if (wikiIndex >= 0) {
|
||||
return new EdgeUrl("https", new EdgeDomain("breezewiki.com"), null, "/" + subdomain + path.substring(wikiIndex), null);
|
||||
}
|
||||
}
|
||||
else if (topdomain.equals("medium.com")) {
|
||||
if (!subdomain.isBlank()) {
|
||||
return new EdgeUrl("https", new EdgeDomain("scribe.rip"), null, path, null);
|
||||
}
|
||||
else {
|
||||
String article = path.substring(path.indexOf("/", 1));
|
||||
return new EdgeUrl("https", new EdgeDomain("scribe.rip"), null, article, null);
|
||||
}
|
||||
|
||||
}
|
||||
return url;
|
||||
}
|
||||
|
||||
private List<String> getProblems(String evalResult, List<UrlDetails> queryResults, QueryResponse response) throws InterruptedException {
|
||||
|
||||
// We don't debug the query if it's a site search
|
||||
if (response.domain() == null)
|
||||
return List.of();
|
||||
|
||||
final List<String> problems = new ArrayList<>(response.problems());
|
||||
|
||||
if (queryResults.size() <= 5 && null == evalResult) {
|
||||
problems.add("Try rephrasing the query, changing the word order or using synonyms to get different results.");
|
||||
|
||||
// Try to spell check the search terms
|
||||
var suggestions = getFutureOrDefault(
|
||||
mathClient.spellCheck(response.searchTermsHuman()),
|
||||
Map.of()
|
||||
);
|
||||
|
||||
suggestions.forEach((term, suggestion) -> {
|
||||
if (suggestion.size() > 1) {
|
||||
String suggestionsStr = "\"%s\" could be spelled %s".formatted(term, suggestion.stream().map(s -> "\"" + s + "\"").collect(Collectors.joining(", ")));
|
||||
problems.add(suggestionsStr);
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
Set<String> representativeKeywords = response.getAllKeywords();
|
||||
if (representativeKeywords.size() > 1 && (representativeKeywords.contains("definition") || representativeKeywords.contains("define") || representativeKeywords.contains("meaning")))
|
||||
{
|
||||
problems.add("Tip: Try using a query that looks like <tt>define:word</tt> if you want a dictionary definition");
|
||||
}
|
||||
|
||||
return problems;
|
||||
}
|
||||
|
||||
private <T> T getFutureOrDefault(@Nullable Future<T> fut, T defaultValue) {
|
||||
return getFutureOrDefault(fut, Duration.ofMillis(50), defaultValue);
|
||||
}
|
||||
|
||||
private <T> T getFutureOrDefault(@Nullable Future<T> fut, Duration timeout, T defaultValue) {
|
||||
if (fut == null || fut.isCancelled()) {
|
||||
return defaultValue;
|
||||
}
|
||||
try {
|
||||
return fut.get(timeout.toMillis(), TimeUnit.MILLISECONDS);
|
||||
}
|
||||
catch (Exception ex) {
|
||||
logger.warn("Error fetching eval result", ex);
|
||||
return defaultValue;
|
||||
}
|
||||
}
|
||||
|
||||
}
|
@@ -0,0 +1,104 @@
|
||||
package nu.marginalia.search;
|
||||
|
||||
import nu.marginalia.api.searchquery.model.query.QueryParams;
|
||||
import nu.marginalia.api.searchquery.model.query.SearchQuery;
|
||||
import nu.marginalia.api.searchquery.model.query.SearchSetIdentifier;
|
||||
import nu.marginalia.api.searchquery.model.results.ResultRankingParameters;
|
||||
import nu.marginalia.index.query.limit.QueryLimits;
|
||||
import nu.marginalia.index.query.limit.QueryStrategy;
|
||||
import nu.marginalia.index.query.limit.SpecificationLimit;
|
||||
import nu.marginalia.search.command.SearchParameters;
|
||||
|
||||
import java.util.List;
|
||||
|
||||
public class SearchQueryParamFactory {
|
||||
|
||||
public QueryParams forRegularSearch(SearchParameters userParams) {
|
||||
SearchQuery prototype = new SearchQuery();
|
||||
var profile = userParams.profile();
|
||||
|
||||
profile.addTacitTerms(prototype);
|
||||
userParams.js().addTacitTerms(prototype);
|
||||
userParams.adtech().addTacitTerms(prototype);
|
||||
|
||||
return new QueryParams(
|
||||
userParams.query(),
|
||||
null,
|
||||
prototype.searchTermsInclude,
|
||||
prototype.searchTermsExclude,
|
||||
prototype.searchTermsPriority,
|
||||
prototype.searchTermsAdvice,
|
||||
profile.getQualityLimit(),
|
||||
profile.getYearLimit(),
|
||||
profile.getSizeLimit(),
|
||||
SpecificationLimit.none(),
|
||||
List.of(),
|
||||
new QueryLimits(5, 100, 200, 8192),
|
||||
profile.searchSetIdentifier.name(),
|
||||
userParams.strategy(),
|
||||
userParams.temporalBias(),
|
||||
userParams.page()
|
||||
);
|
||||
|
||||
}
|
||||
|
||||
public QueryParams forSiteSearch(String domain, int domainId, int count) {
|
||||
return new QueryParams("site:"+domain,
|
||||
null,
|
||||
List.of(),
|
||||
List.of(),
|
||||
List.of(),
|
||||
List.of(),
|
||||
SpecificationLimit.none(),
|
||||
SpecificationLimit.none(),
|
||||
SpecificationLimit.none(),
|
||||
SpecificationLimit.none(),
|
||||
List.of(domainId),
|
||||
new QueryLimits(count, count, 100, 512),
|
||||
SearchSetIdentifier.NONE.name(),
|
||||
QueryStrategy.AUTO,
|
||||
ResultRankingParameters.TemporalBias.NONE,
|
||||
1
|
||||
);
|
||||
}
|
||||
|
||||
public QueryParams forBacklinkSearch(String domain) {
|
||||
return new QueryParams("links:"+domain,
|
||||
null,
|
||||
List.of(),
|
||||
List.of(),
|
||||
List.of(),
|
||||
List.of(),
|
||||
SpecificationLimit.none(),
|
||||
SpecificationLimit.none(),
|
||||
SpecificationLimit.none(),
|
||||
SpecificationLimit.none(),
|
||||
List.of(),
|
||||
new QueryLimits(100, 100, 100, 512),
|
||||
SearchSetIdentifier.NONE.name(),
|
||||
QueryStrategy.AUTO,
|
||||
ResultRankingParameters.TemporalBias.NONE,
|
||||
1
|
||||
);
|
||||
}
|
||||
|
||||
public QueryParams forLinkSearch(String sourceDomain, String destDomain) {
|
||||
return new QueryParams("site:" + sourceDomain + " links:" + destDomain,
|
||||
null,
|
||||
List.of(),
|
||||
List.of(),
|
||||
List.of(),
|
||||
List.of(),
|
||||
SpecificationLimit.none(),
|
||||
SpecificationLimit.none(),
|
||||
SpecificationLimit.none(),
|
||||
SpecificationLimit.none(),
|
||||
List.of(),
|
||||
new QueryLimits(100, 100, 100, 512),
|
||||
SearchSetIdentifier.NONE.name(),
|
||||
QueryStrategy.AUTO,
|
||||
ResultRankingParameters.TemporalBias.NONE,
|
||||
1
|
||||
);
|
||||
}
|
||||
}
|
@@ -0,0 +1,53 @@
|
||||
package nu.marginalia.search;
|
||||
|
||||
import nu.marginalia.api.searchquery.model.query.QueryResponse;
|
||||
import nu.marginalia.search.model.ClusteredUrlDetails;
|
||||
import nu.marginalia.search.model.UrlDetails;
|
||||
|
||||
import java.util.List;
|
||||
import java.util.stream.Collectors;
|
||||
|
||||
/** Functions for clustering search results */
|
||||
public class SearchResultClusterer {
|
||||
private SearchResultClusterer() {}
|
||||
|
||||
public interface SearchResultClusterStrategy {
|
||||
List<ClusteredUrlDetails> clusterResults(List<UrlDetails> results, int total);
|
||||
}
|
||||
|
||||
public static SearchResultClusterStrategy selectStrategy(QueryResponse response) {
|
||||
if (response.domain() != null && !response.domain().isBlank())
|
||||
return SearchResultClusterer::noOp;
|
||||
|
||||
return SearchResultClusterer::byDomain;
|
||||
}
|
||||
|
||||
/** No clustering, just return the results as is */
|
||||
private static List<ClusteredUrlDetails> noOp(List<UrlDetails> results, int total) {
|
||||
if (results.isEmpty())
|
||||
return List.of();
|
||||
|
||||
return results.stream()
|
||||
.map(ClusteredUrlDetails::new)
|
||||
.toList();
|
||||
}
|
||||
|
||||
/** Cluster the results by domain, and return the top "total" clusters
|
||||
* sorted by the relevance of the best result
|
||||
*/
|
||||
private static List<ClusteredUrlDetails> byDomain(List<UrlDetails> results, int total) {
|
||||
if (results.isEmpty())
|
||||
return List.of();
|
||||
|
||||
return results.stream()
|
||||
.collect(
|
||||
Collectors.groupingBy(details -> details.domainId)
|
||||
)
|
||||
.values().stream()
|
||||
.map(ClusteredUrlDetails::new)
|
||||
.sorted()
|
||||
.limit(total)
|
||||
.toList();
|
||||
}
|
||||
|
||||
}
|
@@ -0,0 +1,128 @@
|
||||
package nu.marginalia.search;
|
||||
|
||||
import com.google.inject.Inject;
|
||||
import io.prometheus.client.Counter;
|
||||
import io.prometheus.client.Histogram;
|
||||
import nu.marginalia.WebsiteUrl;
|
||||
import nu.marginalia.search.svc.*;
|
||||
import nu.marginalia.service.server.BaseServiceParams;
|
||||
import nu.marginalia.service.server.SparkService;
|
||||
import nu.marginalia.service.server.StaticResources;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
import spark.Request;
|
||||
import spark.Response;
|
||||
import spark.Route;
|
||||
import spark.Spark;
|
||||
|
||||
import java.net.URLEncoder;
|
||||
import java.nio.charset.StandardCharsets;
|
||||
|
||||
public class SearchService extends SparkService {
|
||||
|
||||
private final WebsiteUrl websiteUrl;
|
||||
private final StaticResources staticResources;
|
||||
|
||||
private static final Logger logger = LoggerFactory.getLogger(SearchService.class);
|
||||
private static final Histogram wmsa_search_service_request_time = Histogram.build()
|
||||
.name("wmsa_search_service_request_time")
|
||||
.linearBuckets(0.05, 0.05, 15)
|
||||
.labelNames("matchedPath", "method")
|
||||
.help("Search service request time (seconds)")
|
||||
.register();
|
||||
private static final Counter wmsa_search_service_error_count = Counter.build()
|
||||
.name("wmsa_search_service_error_count")
|
||||
.labelNames("matchedPath", "method")
|
||||
.help("Search service error count")
|
||||
.register();
|
||||
|
||||
@Inject
|
||||
public SearchService(BaseServiceParams params,
|
||||
WebsiteUrl websiteUrl,
|
||||
StaticResources staticResources,
|
||||
SearchFrontPageService frontPageService,
|
||||
SearchErrorPageService errorPageService,
|
||||
SearchAddToCrawlQueueService addToCrawlQueueService,
|
||||
SearchSiteInfoService siteInfoService,
|
||||
SearchCrosstalkService crosstalkService,
|
||||
SearchQueryService searchQueryService)
|
||||
throws Exception
|
||||
{
|
||||
super(params);
|
||||
|
||||
this.websiteUrl = websiteUrl;
|
||||
this.staticResources = staticResources;
|
||||
|
||||
Spark.staticFiles.expireTime(600);
|
||||
|
||||
SearchServiceMetrics.get("/search", searchQueryService::pathSearch);
|
||||
|
||||
SearchServiceMetrics.get("/", frontPageService::render);
|
||||
SearchServiceMetrics.get("/news.xml", frontPageService::renderNewsFeed);
|
||||
SearchServiceMetrics.get("/:resource", this::serveStatic);
|
||||
|
||||
SearchServiceMetrics.post("/site/suggest/", addToCrawlQueueService::suggestCrawling);
|
||||
|
||||
SearchServiceMetrics.get("/site-search/:site/*", this::siteSearchRedir);
|
||||
|
||||
SearchServiceMetrics.get("/site/:site", siteInfoService::handle);
|
||||
SearchServiceMetrics.post("/site/:site", siteInfoService::handlePost);
|
||||
|
||||
SearchServiceMetrics.get("/crosstalk/", crosstalkService::handle);
|
||||
|
||||
Spark.exception(Exception.class, (e,p,q) -> {
|
||||
logger.error("Error during processing", e);
|
||||
wmsa_search_service_error_count.labels(p.pathInfo(), p.requestMethod()).inc();
|
||||
errorPageService.serveError(p, q);
|
||||
});
|
||||
|
||||
Spark.awaitInitialization();
|
||||
}
|
||||
|
||||
|
||||
|
||||
/** Wraps a route with a timer and a counter */
|
||||
private static class SearchServiceMetrics implements Route {
|
||||
private final Route delegatedRoute;
|
||||
|
||||
static void get(String path, Route route) {
|
||||
Spark.get(path, new SearchServiceMetrics(route));
|
||||
}
|
||||
static void post(String path, Route route) {
|
||||
Spark.post(path, new SearchServiceMetrics(route));
|
||||
}
|
||||
|
||||
private SearchServiceMetrics(Route delegatedRoute) {
|
||||
this.delegatedRoute = delegatedRoute;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Object handle(Request request, Response response) throws Exception {
|
||||
return wmsa_search_service_request_time
|
||||
.labels(request.matchedPath(), request.requestMethod())
|
||||
.time(() -> delegatedRoute.handle(request, response));
|
||||
}
|
||||
}
|
||||
|
||||
private Object serveStatic(Request request, Response response) {
|
||||
String resource = request.params("resource");
|
||||
staticResources.serveStatic("search", resource, request, response);
|
||||
return "";
|
||||
}
|
||||
|
||||
private Object siteSearchRedir(Request request, Response response) {
|
||||
final String site = request.params("site");
|
||||
final String searchTerms;
|
||||
|
||||
if (request.splat().length == 0) searchTerms = "";
|
||||
else searchTerms = request.splat()[0];
|
||||
|
||||
final String query = URLEncoder.encode(String.format("%s site:%s", searchTerms, site), StandardCharsets.UTF_8).trim();
|
||||
final String profile = request.queryParamOrDefault("profile", "yolo");
|
||||
|
||||
response.redirect(websiteUrl.withPath("search?query="+query+"&profile="+profile));
|
||||
|
||||
return "";
|
||||
}
|
||||
|
||||
}
|
@@ -0,0 +1,43 @@
|
||||
package nu.marginalia.search.command;
|
||||
|
||||
import com.google.inject.Inject;
|
||||
import nu.marginalia.search.command.commands.*;
|
||||
import spark.Response;
|
||||
|
||||
import java.util.ArrayList;
|
||||
import java.util.List;
|
||||
|
||||
public class CommandEvaluator {
|
||||
|
||||
private final List<SearchCommandInterface> specialCommands = new ArrayList<>();
|
||||
private final SearchCommand defaultCommand;
|
||||
|
||||
@Inject
|
||||
public CommandEvaluator(
|
||||
BrowseCommand browse,
|
||||
ConvertCommand convert,
|
||||
DefinitionCommand define,
|
||||
BangCommand bang,
|
||||
SiteRedirectCommand siteRedirect,
|
||||
SearchCommand search
|
||||
) {
|
||||
specialCommands.add(browse);
|
||||
specialCommands.add(convert);
|
||||
specialCommands.add(define);
|
||||
specialCommands.add(bang);
|
||||
specialCommands.add(siteRedirect);
|
||||
|
||||
defaultCommand = search;
|
||||
}
|
||||
|
||||
public Object eval(Response response, SearchParameters parameters) {
|
||||
for (var cmd : specialCommands) {
|
||||
var maybe = cmd.process(response, parameters);
|
||||
if (maybe.isPresent())
|
||||
return maybe.get();
|
||||
}
|
||||
|
||||
return defaultCommand.process(response, parameters).orElse("");
|
||||
}
|
||||
|
||||
}
|
@@ -0,0 +1,29 @@
|
||||
package nu.marginalia.search.command;
|
||||
|
||||
import nu.marginalia.api.searchquery.model.query.SearchQuery;
|
||||
|
||||
import javax.annotation.Nullable;
|
||||
import java.util.Arrays;
|
||||
|
||||
public enum SearchAdtechParameter {
|
||||
DEFAULT("default"),
|
||||
REDUCE("reduce", "special:ads", "special:affiliate");
|
||||
|
||||
public final String value;
|
||||
public final String[] implictExcludeSearchTerms;
|
||||
|
||||
SearchAdtechParameter(String value, String... implictExcludeSearchTerms) {
|
||||
this.value = value;
|
||||
this.implictExcludeSearchTerms = implictExcludeSearchTerms;
|
||||
}
|
||||
|
||||
public static SearchAdtechParameter parse(@Nullable String value) {
|
||||
if (REDUCE.value.equals(value)) return REDUCE;
|
||||
|
||||
return DEFAULT;
|
||||
}
|
||||
|
||||
public void addTacitTerms(SearchQuery subquery) {
|
||||
subquery.searchTermsExclude.addAll(Arrays.asList(implictExcludeSearchTerms));
|
||||
}
|
||||
}
|
@@ -0,0 +1,10 @@
|
||||
package nu.marginalia.search.command;
|
||||
|
||||
|
||||
import spark.Response;
|
||||
|
||||
import java.util.Optional;
|
||||
|
||||
public interface SearchCommandInterface {
|
||||
Optional<Object> process(Response response, SearchParameters parameters);
|
||||
}
|
@@ -0,0 +1,31 @@
|
||||
package nu.marginalia.search.command;
|
||||
|
||||
import nu.marginalia.api.searchquery.model.query.SearchQuery;
|
||||
|
||||
import javax.annotation.Nullable;
|
||||
import java.util.Arrays;
|
||||
|
||||
public enum SearchJsParameter {
|
||||
DEFAULT("default"),
|
||||
DENY_JS("no-js", "js:true"),
|
||||
REQUIRE_JS("yes-js", "js:false");
|
||||
|
||||
public final String value;
|
||||
public final String[] implictExcludeSearchTerms;
|
||||
|
||||
SearchJsParameter(String value, String... implictExcludeSearchTerms) {
|
||||
this.value = value;
|
||||
this.implictExcludeSearchTerms = implictExcludeSearchTerms;
|
||||
}
|
||||
|
||||
public static SearchJsParameter parse(@Nullable String value) {
|
||||
if (DENY_JS.value.equals(value)) return DENY_JS;
|
||||
if (REQUIRE_JS.value.equals(value)) return REQUIRE_JS;
|
||||
|
||||
return DEFAULT;
|
||||
}
|
||||
|
||||
public void addTacitTerms(SearchQuery subquery) {
|
||||
subquery.searchTermsExclude.addAll(Arrays.asList(implictExcludeSearchTerms));
|
||||
}
|
||||
}
|
@@ -0,0 +1,106 @@
|
||||
package nu.marginalia.search.command;
|
||||
|
||||
import nu.marginalia.WebsiteUrl;
|
||||
import nu.marginalia.api.searchquery.model.results.ResultRankingParameters;
|
||||
import nu.marginalia.index.query.limit.QueryStrategy;
|
||||
import nu.marginalia.index.query.limit.SpecificationLimit;
|
||||
import nu.marginalia.search.model.SearchProfile;
|
||||
import spark.Request;
|
||||
|
||||
import java.net.URLEncoder;
|
||||
import java.nio.charset.StandardCharsets;
|
||||
import java.util.Objects;
|
||||
|
||||
import static nu.marginalia.search.command.SearchRecentParameter.RECENT;
|
||||
|
||||
public record SearchParameters(String query,
|
||||
SearchProfile profile,
|
||||
SearchJsParameter js,
|
||||
SearchRecentParameter recent,
|
||||
SearchTitleParameter searchTitle,
|
||||
SearchAdtechParameter adtech,
|
||||
boolean newFilter,
|
||||
int page
|
||||
) {
|
||||
|
||||
public SearchParameters(String queryString, Request request) {
|
||||
this(
|
||||
queryString,
|
||||
SearchProfile.getSearchProfile(request.queryParams("profile")),
|
||||
SearchJsParameter.parse(request.queryParams("js")),
|
||||
SearchRecentParameter.parse(request.queryParams("recent")),
|
||||
SearchTitleParameter.parse(request.queryParams("searchTitle")),
|
||||
SearchAdtechParameter.parse(request.queryParams("adtech")),
|
||||
"true".equals(request.queryParams("newfilter")),
|
||||
Integer.parseInt(Objects.requireNonNullElse(request.queryParams("page"), "1"))
|
||||
);
|
||||
}
|
||||
|
||||
public String profileStr() {
|
||||
return profile.filterId;
|
||||
}
|
||||
|
||||
public SearchParameters withProfile(SearchProfile profile) {
|
||||
return new SearchParameters(query, profile, js, recent, searchTitle, adtech, true, page);
|
||||
}
|
||||
|
||||
public SearchParameters withJs(SearchJsParameter js) {
|
||||
return new SearchParameters(query, profile, js, recent, searchTitle, adtech, true, page);
|
||||
}
|
||||
public SearchParameters withAdtech(SearchAdtechParameter adtech) {
|
||||
return new SearchParameters(query, profile, js, recent, searchTitle, adtech, true, page);
|
||||
}
|
||||
|
||||
public SearchParameters withRecent(SearchRecentParameter recent) {
|
||||
return new SearchParameters(query, profile, js, recent, searchTitle, adtech, true, page);
|
||||
}
|
||||
|
||||
public SearchParameters withTitle(SearchTitleParameter title) {
|
||||
return new SearchParameters(query, profile, js, recent, title, adtech, true, page);
|
||||
}
|
||||
|
||||
public SearchParameters withPage(int page) {
|
||||
return new SearchParameters(query, profile, js, recent, searchTitle, adtech, false, page);
|
||||
}
|
||||
|
||||
public String renderUrl(WebsiteUrl baseUrl) {
|
||||
String path = String.format("/search?query=%s&profile=%s&js=%s&adtech=%s&recent=%s&searchTitle=%s&newfilter=%s&page=%d",
|
||||
URLEncoder.encode(query, StandardCharsets.UTF_8),
|
||||
URLEncoder.encode(profile.filterId, StandardCharsets.UTF_8),
|
||||
URLEncoder.encode(js.value, StandardCharsets.UTF_8),
|
||||
URLEncoder.encode(adtech.value, StandardCharsets.UTF_8),
|
||||
URLEncoder.encode(recent.value, StandardCharsets.UTF_8),
|
||||
URLEncoder.encode(searchTitle.value, StandardCharsets.UTF_8),
|
||||
Boolean.valueOf(newFilter).toString(),
|
||||
page
|
||||
);
|
||||
|
||||
return baseUrl.withPath(path);
|
||||
}
|
||||
|
||||
public ResultRankingParameters.TemporalBias temporalBias() {
|
||||
if (recent == RECENT) {
|
||||
return ResultRankingParameters.TemporalBias.RECENT;
|
||||
}
|
||||
else if (profile == SearchProfile.VINTAGE) {
|
||||
return ResultRankingParameters.TemporalBias.OLD;
|
||||
}
|
||||
|
||||
return ResultRankingParameters.TemporalBias.NONE;
|
||||
}
|
||||
|
||||
public QueryStrategy strategy() {
|
||||
if (searchTitle == SearchTitleParameter.TITLE) {
|
||||
return QueryStrategy.REQUIRE_FIELD_TITLE;
|
||||
}
|
||||
|
||||
return QueryStrategy.AUTO;
|
||||
}
|
||||
|
||||
public SpecificationLimit yearLimit() {
|
||||
if (recent == RECENT)
|
||||
return SpecificationLimit.greaterThan(2018);
|
||||
|
||||
return profile.getYearLimit();
|
||||
}
|
||||
}
|
@@ -0,0 +1,21 @@
|
||||
package nu.marginalia.search.command;
|
||||
|
||||
import javax.annotation.Nullable;
|
||||
|
||||
public enum SearchRecentParameter {
|
||||
DEFAULT("default"),
|
||||
RECENT("recent");
|
||||
|
||||
public final String value;
|
||||
|
||||
SearchRecentParameter(String value) {
|
||||
this.value = value;
|
||||
}
|
||||
|
||||
public static SearchRecentParameter parse(@Nullable String value) {
|
||||
if (RECENT.value.equals(value)) return RECENT;
|
||||
|
||||
return DEFAULT;
|
||||
}
|
||||
|
||||
}
|
@@ -0,0 +1,21 @@
|
||||
package nu.marginalia.search.command;
|
||||
|
||||
import javax.annotation.Nullable;
|
||||
|
||||
public enum SearchTitleParameter {
|
||||
DEFAULT("default"),
|
||||
TITLE("title");
|
||||
|
||||
public final String value;
|
||||
|
||||
SearchTitleParameter(String value) {
|
||||
this.value = value;
|
||||
}
|
||||
|
||||
public static SearchTitleParameter parse(@Nullable String value) {
|
||||
if (TITLE.value.equals(value)) return TITLE;
|
||||
|
||||
return DEFAULT;
|
||||
}
|
||||
|
||||
}
|
@@ -0,0 +1,104 @@
|
||||
package nu.marginalia.search.command.commands;
|
||||
|
||||
import com.google.inject.Inject;
|
||||
import nu.marginalia.search.command.SearchCommandInterface;
|
||||
import nu.marginalia.search.command.SearchParameters;
|
||||
import nu.marginalia.search.exceptions.RedirectException;
|
||||
import spark.Response;
|
||||
|
||||
import java.net.URLEncoder;
|
||||
import java.nio.charset.StandardCharsets;
|
||||
import java.util.HashMap;
|
||||
import java.util.Map;
|
||||
import java.util.Optional;
|
||||
|
||||
public class BangCommand implements SearchCommandInterface {
|
||||
private final Map<String, String> bangsToPattern = new HashMap<>();
|
||||
|
||||
@Inject
|
||||
public BangCommand()
|
||||
{
|
||||
bangsToPattern.put("!g", "https://www.google.com/search?q=%s");
|
||||
bangsToPattern.put("!ddg", "https://duckduckgo.com/?q=%s");
|
||||
bangsToPattern.put("!w", "https://search.marginalia.nu/search?query=%s+site:en.wikipedia.org&profile=wiki");
|
||||
}
|
||||
|
||||
@Override
|
||||
public Optional<Object> process(Response response, SearchParameters parameters) {
|
||||
|
||||
for (var entry : bangsToPattern.entrySet()) {
|
||||
String bangPattern = entry.getKey();
|
||||
String redirectPattern = entry.getValue();
|
||||
|
||||
var match = matchBangPattern(parameters.query(), bangPattern);
|
||||
|
||||
if (match.isPresent()) {
|
||||
var url = String.format(redirectPattern, URLEncoder.encode(match.get(), StandardCharsets.UTF_8));
|
||||
throw new RedirectException(url);
|
||||
}
|
||||
}
|
||||
|
||||
return Optional.empty();
|
||||
}
|
||||
|
||||
/** If the query contains the bang pattern bangKey, return the query with the bang pattern removed. */
|
||||
Optional<String> matchBangPattern(String query, String bangKey) {
|
||||
var bm = new BangMatcher(query);
|
||||
|
||||
while (bm.findNext(bangKey)) {
|
||||
|
||||
if (!bm.isRelativeSpaceOrInvalid(-1))
|
||||
continue;
|
||||
if (!bm.isRelativeSpaceOrInvalid(bangKey.length()))
|
||||
continue;
|
||||
|
||||
String prefix = bm.prefix().trim();
|
||||
String suffix = bm.suffix(bangKey.length()).trim();
|
||||
|
||||
String ret = (prefix + " " + suffix).trim();
|
||||
|
||||
return Optional.of(ret)
|
||||
.filter(s -> !s.isBlank());
|
||||
}
|
||||
|
||||
return Optional.empty();
|
||||
}
|
||||
|
||||
private static class BangMatcher {
|
||||
private final String str;
|
||||
private int pos;
|
||||
|
||||
public String prefix() {
|
||||
return str.substring(0, pos);
|
||||
}
|
||||
|
||||
public String suffix(int offset) {
|
||||
if (pos+offset < str.length())
|
||||
return str.substring(pos + offset);
|
||||
return "";
|
||||
}
|
||||
|
||||
public BangMatcher(String str) {
|
||||
this.str = str;
|
||||
this.pos = -1;
|
||||
}
|
||||
|
||||
public boolean findNext(String pattern) {
|
||||
if (pos + 1 >= str.length())
|
||||
return false;
|
||||
|
||||
return (pos = str.indexOf(pattern, pos + 1)) >= 0;
|
||||
}
|
||||
|
||||
public boolean isRelativeSpaceOrInvalid(int offset) {
|
||||
if (offset + pos < 0)
|
||||
return true;
|
||||
if (offset + pos >= str.length())
|
||||
return true;
|
||||
|
||||
return Character.isSpaceChar(str.charAt(offset + pos));
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
}
|
@@ -0,0 +1,36 @@
|
||||
package nu.marginalia.search.command.commands;
|
||||
|
||||
import com.google.inject.Inject;
|
||||
import nu.marginalia.renderer.MustacheRenderer;
|
||||
import nu.marginalia.renderer.RendererFactory;
|
||||
import nu.marginalia.search.command.SearchCommandInterface;
|
||||
import nu.marginalia.search.command.SearchParameters;
|
||||
import nu.marginalia.search.svc.SearchUnitConversionService;
|
||||
import spark.Response;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.util.Map;
|
||||
import java.util.Optional;
|
||||
|
||||
public class ConvertCommand implements SearchCommandInterface {
|
||||
private final SearchUnitConversionService searchUnitConversionService;
|
||||
private final MustacheRenderer<Map<String, String>> conversionRenderer;
|
||||
|
||||
@Inject
|
||||
public ConvertCommand(SearchUnitConversionService searchUnitConversionService, RendererFactory rendererFactory) throws IOException {
|
||||
this.searchUnitConversionService = searchUnitConversionService;
|
||||
|
||||
conversionRenderer = rendererFactory.renderer("search/conversion-results");
|
||||
}
|
||||
|
||||
@Override
|
||||
public Optional<Object> process(Response response, SearchParameters parameters) {
|
||||
var conversion = searchUnitConversionService.tryConversion(parameters.query());
|
||||
return conversion.map(s -> conversionRenderer.render(Map.of(
|
||||
"query", parameters.query(),
|
||||
"result", s,
|
||||
"profile", parameters.profileStr())
|
||||
));
|
||||
|
||||
}
|
||||
}
|
@@ -0,0 +1,70 @@
|
||||
|
||||
package nu.marginalia.search.command.commands;
|
||||
|
||||
import com.google.inject.Inject;
|
||||
import nu.marginalia.api.math.MathClient;
|
||||
import nu.marginalia.api.math.model.DictionaryResponse;
|
||||
import nu.marginalia.renderer.MustacheRenderer;
|
||||
import nu.marginalia.search.command.SearchCommandInterface;
|
||||
import nu.marginalia.search.command.SearchParameters;
|
||||
import nu.marginalia.renderer.RendererFactory;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
import spark.Response;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.util.Map;
|
||||
import java.util.Optional;
|
||||
import java.util.concurrent.TimeUnit;
|
||||
import java.util.function.Predicate;
|
||||
import java.util.regex.Pattern;
|
||||
|
||||
public class DefinitionCommand implements SearchCommandInterface {
|
||||
private final Logger logger = LoggerFactory.getLogger(getClass());
|
||||
|
||||
private final MustacheRenderer<DictionaryResponse> dictionaryRenderer;
|
||||
private final MathClient mathClient;
|
||||
|
||||
|
||||
private final Predicate<String> queryPatternPredicate = Pattern.compile("^define:[A-Za-z\\s-0-9]+$").asPredicate();
|
||||
|
||||
@Inject
|
||||
public DefinitionCommand(RendererFactory rendererFactory, MathClient mathClient)
|
||||
throws IOException
|
||||
{
|
||||
|
||||
dictionaryRenderer = rendererFactory.renderer("search/dictionary-results");
|
||||
this.mathClient = mathClient;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Optional<Object> process(Response response, SearchParameters parameters) {
|
||||
if (!queryPatternPredicate.test(parameters.query())) {
|
||||
return Optional.empty();
|
||||
}
|
||||
|
||||
var results = lookupDefinition(parameters.query());
|
||||
|
||||
return Optional.of(dictionaryRenderer.render(results,
|
||||
Map.of("query", parameters.query(),
|
||||
"profile", parameters.profileStr())
|
||||
));
|
||||
}
|
||||
|
||||
|
||||
private DictionaryResponse lookupDefinition(String humanQuery) {
|
||||
String definePrefix = "define:";
|
||||
String word = humanQuery.substring(definePrefix.length()).toLowerCase();
|
||||
|
||||
try {
|
||||
return mathClient
|
||||
.dictionaryLookup(word)
|
||||
.get(250, TimeUnit.MILLISECONDS);
|
||||
}
|
||||
catch (Exception e) {
|
||||
logger.error("Failed to lookup definition for word: " + word, e);
|
||||
|
||||
throw new RuntimeException(e);
|
||||
}
|
||||
}
|
||||
}
|
@@ -0,0 +1,39 @@
|
||||
package nu.marginalia.search.command.commands;
|
||||
|
||||
import com.google.inject.Inject;
|
||||
import nu.marginalia.renderer.MustacheRenderer;
|
||||
import nu.marginalia.renderer.RendererFactory;
|
||||
import nu.marginalia.search.SearchOperator;
|
||||
import nu.marginalia.search.command.SearchCommandInterface;
|
||||
import nu.marginalia.search.command.SearchParameters;
|
||||
import nu.marginalia.search.model.DecoratedSearchResults;
|
||||
import spark.Response;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.util.Optional;
|
||||
|
||||
public class SearchCommand implements SearchCommandInterface {
|
||||
private final SearchOperator searchOperator;
|
||||
private final MustacheRenderer<DecoratedSearchResults> searchResultsRenderer;
|
||||
|
||||
|
||||
@Inject
|
||||
public SearchCommand(SearchOperator searchOperator,
|
||||
RendererFactory rendererFactory) throws IOException {
|
||||
this.searchOperator = searchOperator;
|
||||
|
||||
searchResultsRenderer = rendererFactory.renderer("search/search-results");
|
||||
}
|
||||
|
||||
@Override
|
||||
public Optional<Object> process(Response response, SearchParameters parameters) {
|
||||
try {
|
||||
DecoratedSearchResults results = searchOperator.doSearch(parameters);
|
||||
return Optional.of(searchResultsRenderer.render(results));
|
||||
}
|
||||
catch (InterruptedException ex) {
|
||||
Thread.currentThread().interrupt();
|
||||
return Optional.empty();
|
||||
}
|
||||
}
|
||||
}
|
@@ -0,0 +1,50 @@
|
||||
package nu.marginalia.search.command.commands;
|
||||
|
||||
import com.google.inject.Inject;
|
||||
import nu.marginalia.search.command.SearchCommandInterface;
|
||||
import nu.marginalia.search.command.SearchParameters;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
import spark.Response;
|
||||
|
||||
import java.util.Optional;
|
||||
import java.util.function.Predicate;
|
||||
import java.util.regex.Pattern;
|
||||
|
||||
public class SiteRedirectCommand implements SearchCommandInterface {
|
||||
|
||||
private final Logger logger = LoggerFactory.getLogger(getClass());
|
||||
|
||||
private final Predicate<String> queryPatternPredicate = Pattern.compile("^(site|links):[.A-Za-z\\-0-9]+$").asPredicate();
|
||||
|
||||
@Inject
|
||||
public SiteRedirectCommand() {
|
||||
}
|
||||
|
||||
@Override
|
||||
public Optional<Object> process(Response response, SearchParameters parameters) {
|
||||
if (!queryPatternPredicate.test(parameters.query())) {
|
||||
return Optional.empty();
|
||||
}
|
||||
|
||||
int idx = parameters.query().indexOf(':');
|
||||
String prefix = parameters.query().substring(0, idx);
|
||||
String domain = parameters.query().substring(idx + 1).toLowerCase();
|
||||
|
||||
// Use an HTML redirect here, so we can use relative URLs
|
||||
String view = switch (prefix) {
|
||||
case "links" -> "links";
|
||||
default -> "info";
|
||||
};
|
||||
|
||||
return Optional.of("""
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<meta charset="UTF-8">
|
||||
<title>Redirecting...</title>
|
||||
<meta http-equiv="refresh" content="0; url=/site/%s?view=%s">
|
||||
""".formatted(domain, view)
|
||||
);
|
||||
}
|
||||
|
||||
}
|
@@ -0,0 +1,66 @@
|
||||
package nu.marginalia.search.db;
|
||||
|
||||
import com.google.inject.Inject;
|
||||
import com.zaxxer.hikari.HikariDataSource;
|
||||
|
||||
import java.sql.ResultSet;
|
||||
import java.util.ArrayList;
|
||||
import java.util.List;
|
||||
import java.util.function.Consumer;
|
||||
|
||||
public class DbNearDomainsQuery {
|
||||
|
||||
private final HikariDataSource dataSource;
|
||||
|
||||
@Inject
|
||||
public DbNearDomainsQuery(HikariDataSource dataSource) {
|
||||
this.dataSource = dataSource;
|
||||
}
|
||||
|
||||
public List<Integer> getRelatedDomains(String term, Consumer<String> onProblem) {
|
||||
List<Integer> ret = new ArrayList<>();
|
||||
try (var conn = dataSource.getConnection();
|
||||
|
||||
var selfStmt = conn.prepareStatement("""
|
||||
SELECT ID FROM EC_DOMAIN WHERE DOMAIN_NAME=?
|
||||
""");
|
||||
var stmt = conn.prepareStatement("""
|
||||
SELECT NEIGHBOR_ID, ND.INDEXED, ND.STATE FROM EC_DOMAIN_NEIGHBORS_2
|
||||
INNER JOIN EC_DOMAIN ND ON ND.ID=NEIGHBOR_ID
|
||||
WHERE DOMAIN_ID=?
|
||||
""")) {
|
||||
ResultSet rsp;
|
||||
selfStmt.setString(1, term);
|
||||
rsp = selfStmt.executeQuery();
|
||||
int domainId = -1;
|
||||
if (rsp.next()) {
|
||||
domainId = rsp.getInt(1);
|
||||
ret.add(domainId);
|
||||
}
|
||||
|
||||
stmt.setInt(1, domainId);
|
||||
rsp = stmt.executeQuery();
|
||||
|
||||
while (rsp.next()) {
|
||||
int id = rsp.getInt(1);
|
||||
int indexed = rsp.getInt(2);
|
||||
String state = rsp.getString(3);
|
||||
|
||||
if (indexed > 0 && ("ACTIVE".equalsIgnoreCase(state) || "SOCIAL_MEDIA".equalsIgnoreCase(state) || "SPECIAL".equalsIgnoreCase(state))) {
|
||||
ret.add(id);
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
catch (Exception ex) {
|
||||
throw new RuntimeException(ex);
|
||||
}
|
||||
|
||||
if (ret.isEmpty()) {
|
||||
onProblem.accept("Could not find domains adjacent " + term);
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
}
|
@@ -0,0 +1,102 @@
|
||||
package nu.marginalia.search.model;
|
||||
|
||||
import nu.marginalia.model.EdgeDomain;
|
||||
import nu.marginalia.model.idx.WordFlags;
|
||||
import org.jetbrains.annotations.NotNull;
|
||||
|
||||
import java.util.*;
|
||||
|
||||
/** A class to hold a list of UrlDetails, grouped by domain, where the first one is the main result
|
||||
* and the rest are additional results, for summary display. */
|
||||
public class ClusteredUrlDetails implements Comparable<ClusteredUrlDetails> {
|
||||
|
||||
@NotNull
|
||||
public final UrlDetails first;
|
||||
|
||||
@NotNull
|
||||
public final List<UrlDetails> rest;
|
||||
|
||||
/** Create a new ClusteredUrlDetails from a collection of UrlDetails,
|
||||
* with the best result as "first", and the others, in descending order
|
||||
* of quality as the "rest"...
|
||||
*
|
||||
* @param details A collection of UrlDetails, which must not be empty.
|
||||
*/
|
||||
public ClusteredUrlDetails(Collection<UrlDetails> details) {
|
||||
var items = new ArrayList<>(details);
|
||||
|
||||
items.sort(Comparator.naturalOrder());
|
||||
|
||||
if (items.isEmpty())
|
||||
throw new IllegalArgumentException("Empty list of details");
|
||||
|
||||
this.first = items.removeFirst();
|
||||
this.rest = items;
|
||||
|
||||
double bestScore = first.termScore;
|
||||
double scoreLimit = Math.min(4.0, bestScore * 1.25);
|
||||
|
||||
this.rest.removeIf(urlDetail -> {
|
||||
if (urlDetail.termScore > scoreLimit)
|
||||
return false;
|
||||
|
||||
for (var keywordScore : urlDetail.resultItem.keywordScores) {
|
||||
if (keywordScore.isKeywordSpecial())
|
||||
continue;
|
||||
if (keywordScore.hasTermFlag(WordFlags.Title))
|
||||
return false;
|
||||
if (keywordScore.hasTermFlag(WordFlags.ExternalLink))
|
||||
return false;
|
||||
if (keywordScore.hasTermFlag(WordFlags.UrlDomain))
|
||||
return false;
|
||||
if (keywordScore.hasTermFlag(WordFlags.UrlPath))
|
||||
return false;
|
||||
if (keywordScore.hasTermFlag(WordFlags.Subjects))
|
||||
return false;
|
||||
}
|
||||
|
||||
return true;
|
||||
});
|
||||
|
||||
}
|
||||
|
||||
|
||||
public ClusteredUrlDetails(@NotNull UrlDetails onlyFirst) {
|
||||
this.first = onlyFirst;
|
||||
this.rest = Collections.emptyList();
|
||||
}
|
||||
|
||||
// For renderer use, do not remove
|
||||
public @NotNull UrlDetails getFirst() {
|
||||
return first;
|
||||
}
|
||||
|
||||
// For renderer use, do not remove
|
||||
public @NotNull List<UrlDetails> getRest() {
|
||||
return rest;
|
||||
}
|
||||
|
||||
|
||||
public EdgeDomain getDomain() {
|
||||
return first.url.getDomain();
|
||||
}
|
||||
|
||||
public boolean hasMultiple() {
|
||||
return !rest.isEmpty();
|
||||
}
|
||||
|
||||
/** Returns the total number of results from the same domain,
|
||||
* including such results that are not included here. */
|
||||
public int totalCount() {
|
||||
return first.resultsFromSameDomain;
|
||||
}
|
||||
|
||||
public int remainingCount() {
|
||||
return totalCount() - 1 - rest.size();
|
||||
}
|
||||
|
||||
@Override
|
||||
public int compareTo(@NotNull ClusteredUrlDetails o) {
|
||||
return Objects.compare(first, o.first, UrlDetails::compareTo);
|
||||
}
|
||||
}
|
@@ -0,0 +1,186 @@
|
||||
package nu.marginalia.search.model;
|
||||
|
||||
import nu.marginalia.search.command.SearchParameters;
|
||||
|
||||
import java.util.List;
|
||||
|
||||
/**
|
||||
* A class to hold details about the search results,
|
||||
* as used by the handlebars templating engine to render
|
||||
* the search results page.
|
||||
*/
|
||||
public class DecoratedSearchResults {
|
||||
private final SearchParameters params;
|
||||
private final List<String> problems;
|
||||
private final String evalResult;
|
||||
|
||||
public DecoratedSearchResults(SearchParameters params,
|
||||
List<String> problems,
|
||||
String evalResult,
|
||||
List<ClusteredUrlDetails> results,
|
||||
String focusDomain,
|
||||
int focusDomainId,
|
||||
SearchFilters filters,
|
||||
List<Page> resultPages) {
|
||||
this.params = params;
|
||||
this.problems = problems;
|
||||
this.evalResult = evalResult;
|
||||
this.results = results;
|
||||
this.focusDomain = focusDomain;
|
||||
this.focusDomainId = focusDomainId;
|
||||
this.filters = filters;
|
||||
this.resultPages = resultPages;
|
||||
}
|
||||
|
||||
public final List<ClusteredUrlDetails> results;
|
||||
|
||||
public static DecoratedSearchResultsBuilder builder() {
|
||||
return new DecoratedSearchResultsBuilder();
|
||||
}
|
||||
|
||||
public SearchParameters getParams() {
|
||||
return params;
|
||||
}
|
||||
|
||||
public List<String> getProblems() {
|
||||
return problems;
|
||||
}
|
||||
|
||||
public String getEvalResult() {
|
||||
return evalResult;
|
||||
}
|
||||
|
||||
public List<ClusteredUrlDetails> getResults() {
|
||||
return results;
|
||||
}
|
||||
|
||||
public String getFocusDomain() {
|
||||
return focusDomain;
|
||||
}
|
||||
|
||||
public int getFocusDomainId() {
|
||||
return focusDomainId;
|
||||
}
|
||||
|
||||
public SearchFilters getFilters() {
|
||||
return filters;
|
||||
}
|
||||
|
||||
public List<Page> getResultPages() {
|
||||
return resultPages;
|
||||
}
|
||||
|
||||
private final String focusDomain;
|
||||
private final int focusDomainId;
|
||||
private final SearchFilters filters;
|
||||
|
||||
private final List<Page> resultPages;
|
||||
|
||||
public boolean isMultipage() {
|
||||
return resultPages.size() > 1;
|
||||
}
|
||||
|
||||
public record Page(int number, boolean current, String href) {
|
||||
}
|
||||
|
||||
// These are used by the search form, they look unused in the IDE but are used by the mustache template,
|
||||
// DO NOT REMOVE THEM
|
||||
public int getResultCount() {
|
||||
return results.size();
|
||||
}
|
||||
|
||||
public String getQuery() {
|
||||
return params.query();
|
||||
}
|
||||
|
||||
public String getProfile() {
|
||||
return params.profile().filterId;
|
||||
}
|
||||
|
||||
public String getJs() {
|
||||
return params.js().value;
|
||||
}
|
||||
|
||||
public String getAdtech() {
|
||||
return params.adtech().value;
|
||||
}
|
||||
|
||||
public String getRecent() {
|
||||
return params.recent().value;
|
||||
}
|
||||
|
||||
public String getSearchTitle() {
|
||||
return params.searchTitle().value;
|
||||
}
|
||||
|
||||
public int page() {
|
||||
return params.page();
|
||||
}
|
||||
|
||||
public Boolean isNewFilter() {
|
||||
return params.newFilter();
|
||||
}
|
||||
|
||||
|
||||
public static class DecoratedSearchResultsBuilder {
|
||||
private SearchParameters params;
|
||||
private List<String> problems;
|
||||
private String evalResult;
|
||||
private List<ClusteredUrlDetails> results;
|
||||
private String focusDomain;
|
||||
private int focusDomainId;
|
||||
private SearchFilters filters;
|
||||
private List<Page> resultPages;
|
||||
|
||||
DecoratedSearchResultsBuilder() {
|
||||
}
|
||||
|
||||
public DecoratedSearchResultsBuilder params(SearchParameters params) {
|
||||
this.params = params;
|
||||
return this;
|
||||
}
|
||||
|
||||
public DecoratedSearchResultsBuilder problems(List<String> problems) {
|
||||
this.problems = problems;
|
||||
return this;
|
||||
}
|
||||
|
||||
public DecoratedSearchResultsBuilder evalResult(String evalResult) {
|
||||
this.evalResult = evalResult;
|
||||
return this;
|
||||
}
|
||||
|
||||
public DecoratedSearchResultsBuilder results(List<ClusteredUrlDetails> results) {
|
||||
this.results = results;
|
||||
return this;
|
||||
}
|
||||
|
||||
public DecoratedSearchResultsBuilder focusDomain(String focusDomain) {
|
||||
this.focusDomain = focusDomain;
|
||||
return this;
|
||||
}
|
||||
|
||||
public DecoratedSearchResultsBuilder focusDomainId(int focusDomainId) {
|
||||
this.focusDomainId = focusDomainId;
|
||||
return this;
|
||||
}
|
||||
|
||||
public DecoratedSearchResultsBuilder filters(SearchFilters filters) {
|
||||
this.filters = filters;
|
||||
return this;
|
||||
}
|
||||
|
||||
public DecoratedSearchResultsBuilder resultPages(List<Page> resultPages) {
|
||||
this.resultPages = resultPages;
|
||||
return this;
|
||||
}
|
||||
|
||||
public DecoratedSearchResults build() {
|
||||
return new DecoratedSearchResults(this.params, this.problems, this.evalResult, this.results, this.focusDomain, this.focusDomainId, this.filters, this.resultPages);
|
||||
}
|
||||
|
||||
public String toString() {
|
||||
return "DecoratedSearchResults.DecoratedSearchResultsBuilder(params=" + this.params + ", problems=" + this.problems + ", evalResult=" + this.evalResult + ", results=" + this.results + ", focusDomain=" + this.focusDomain + ", focusDomainId=" + this.focusDomainId + ", filters=" + this.filters + ", resultPages=" + this.resultPages + ")";
|
||||
}
|
||||
}
|
||||
}
|
@@ -0,0 +1,223 @@
|
||||
package nu.marginalia.search.model;
|
||||
|
||||
import nu.marginalia.WebsiteUrl;
|
||||
import nu.marginalia.search.command.*;
|
||||
|
||||
import java.util.List;
|
||||
|
||||
/** Models the search filters displayed next to the search results */
|
||||
public class SearchFilters {
|
||||
private final WebsiteUrl url;
|
||||
|
||||
public final String currentFilter;
|
||||
|
||||
// These are necessary for the renderer to access the data
|
||||
public final RemoveJsOption removeJsOption;
|
||||
public final ReduceAdtechOption reduceAdtechOption;
|
||||
public final ShowRecentOption showRecentOption;
|
||||
public final SearchTitleOption searchTitleOption;
|
||||
|
||||
public final List<List<Filter>> filterGroups;
|
||||
|
||||
// Getters are for the renderer to access the data
|
||||
|
||||
|
||||
public String getCurrentFilter() {
|
||||
return currentFilter;
|
||||
}
|
||||
|
||||
public RemoveJsOption getRemoveJsOption() {
|
||||
return removeJsOption;
|
||||
}
|
||||
|
||||
public ReduceAdtechOption getReduceAdtechOption() {
|
||||
return reduceAdtechOption;
|
||||
}
|
||||
|
||||
public ShowRecentOption getShowRecentOption() {
|
||||
return showRecentOption;
|
||||
}
|
||||
|
||||
public SearchTitleOption getSearchTitleOption() {
|
||||
return searchTitleOption;
|
||||
}
|
||||
|
||||
public List<List<Filter>> getFilterGroups() {
|
||||
return filterGroups;
|
||||
}
|
||||
|
||||
public SearchFilters(WebsiteUrl url, SearchParameters parameters) {
|
||||
this.url = url;
|
||||
|
||||
removeJsOption = new RemoveJsOption(parameters);
|
||||
reduceAdtechOption = new ReduceAdtechOption(parameters);
|
||||
showRecentOption = new ShowRecentOption(parameters);
|
||||
searchTitleOption = new SearchTitleOption(parameters);
|
||||
|
||||
|
||||
currentFilter = parameters.profile().filterId;
|
||||
|
||||
filterGroups = List.of(
|
||||
List.of(
|
||||
new Filter("No Filter", SearchProfile.NO_FILTER, parameters),
|
||||
// new Filter("Popular", SearchProfile.POPULAR, parameters),
|
||||
new Filter("Small Web", SearchProfile.SMALLWEB, parameters),
|
||||
new Filter("Blogosphere", SearchProfile.BLOGOSPHERE, parameters),
|
||||
new Filter("Academia", SearchProfile.ACADEMIA, parameters)
|
||||
),
|
||||
List.of(
|
||||
new Filter("Vintage", SearchProfile.VINTAGE, parameters),
|
||||
new Filter("Plain Text", SearchProfile.PLAIN_TEXT, parameters),
|
||||
new Filter("~tilde", SearchProfile.TILDE, parameters)
|
||||
),
|
||||
List.of(
|
||||
new Filter("Wiki", SearchProfile.WIKI, parameters),
|
||||
new Filter("Forum", SearchProfile.FORUM, parameters),
|
||||
new Filter("Docs", SearchProfile.DOCS, parameters),
|
||||
new Filter("Recipes", SearchProfile.FOOD, parameters)
|
||||
)
|
||||
);
|
||||
|
||||
|
||||
}
|
||||
|
||||
public class RemoveJsOption {
|
||||
private final SearchJsParameter value;
|
||||
|
||||
public final String url;
|
||||
public String getUrl() {
|
||||
return url;
|
||||
}
|
||||
|
||||
public boolean isSet() {
|
||||
return value.equals(SearchJsParameter.DENY_JS);
|
||||
}
|
||||
|
||||
public String name() {
|
||||
return "Remove Javascript";
|
||||
}
|
||||
|
||||
public RemoveJsOption(SearchParameters parameters) {
|
||||
this.value = parameters.js();
|
||||
|
||||
var toggledValue = switch (parameters.js()) {
|
||||
case DENY_JS -> SearchJsParameter.DEFAULT;
|
||||
default -> SearchJsParameter.DENY_JS;
|
||||
};
|
||||
|
||||
this.url = parameters.withJs(toggledValue).renderUrl(SearchFilters.this.url);
|
||||
}
|
||||
}
|
||||
|
||||
public class ReduceAdtechOption {
|
||||
private final SearchAdtechParameter value;
|
||||
|
||||
public final String url;
|
||||
public String getUrl() {
|
||||
return url;
|
||||
}
|
||||
|
||||
public boolean isSet() {
|
||||
return value.equals(SearchAdtechParameter.REDUCE);
|
||||
}
|
||||
|
||||
public String name() {
|
||||
return "Reduce Adtech";
|
||||
}
|
||||
|
||||
public ReduceAdtechOption(SearchParameters parameters) {
|
||||
this.value = parameters.adtech();
|
||||
|
||||
var toggledValue = switch (parameters.adtech()) {
|
||||
case REDUCE -> SearchAdtechParameter.DEFAULT;
|
||||
default -> SearchAdtechParameter.REDUCE;
|
||||
};
|
||||
|
||||
this.url = parameters.withAdtech(toggledValue).renderUrl(SearchFilters.this.url);
|
||||
}
|
||||
}
|
||||
|
||||
public class ShowRecentOption {
|
||||
private final SearchRecentParameter value;
|
||||
|
||||
public final String url;
|
||||
public String getUrl() {
|
||||
return url;
|
||||
}
|
||||
|
||||
public boolean isSet() {
|
||||
return value.equals(SearchRecentParameter.RECENT);
|
||||
}
|
||||
|
||||
public String name() {
|
||||
return "Recent Results";
|
||||
}
|
||||
|
||||
public ShowRecentOption(SearchParameters parameters) {
|
||||
this.value = parameters.recent();
|
||||
|
||||
var toggledValue = switch (parameters.recent()) {
|
||||
case RECENT -> SearchRecentParameter.DEFAULT;
|
||||
default -> SearchRecentParameter.RECENT;
|
||||
};
|
||||
|
||||
this.url = parameters.withRecent(toggledValue).renderUrl(SearchFilters.this.url);
|
||||
}
|
||||
}
|
||||
|
||||
public class SearchTitleOption {
|
||||
private final SearchTitleParameter value;
|
||||
|
||||
public final String url;
|
||||
public String getUrl() {
|
||||
return url;
|
||||
}
|
||||
|
||||
public boolean isSet() {
|
||||
return value.equals(SearchTitleParameter.TITLE);
|
||||
}
|
||||
|
||||
public String name() {
|
||||
return "Search In Title";
|
||||
}
|
||||
|
||||
public SearchTitleOption(SearchParameters parameters) {
|
||||
this.value = parameters.searchTitle();
|
||||
|
||||
var toggledValue = switch (parameters.searchTitle()) {
|
||||
case TITLE -> SearchTitleParameter.DEFAULT;
|
||||
default -> SearchTitleParameter.TITLE;
|
||||
};
|
||||
|
||||
this.url = parameters.withTitle(toggledValue).renderUrl(SearchFilters.this.url);
|
||||
}
|
||||
}
|
||||
|
||||
public class Filter {
|
||||
public final SearchProfile profile;
|
||||
|
||||
public final String displayName;
|
||||
public final boolean current;
|
||||
public final String url;
|
||||
|
||||
public Filter(String displayName, SearchProfile profile, SearchParameters parameters) {
|
||||
this.displayName = displayName;
|
||||
this.profile = profile;
|
||||
this.current = profile.equals(parameters.profile());
|
||||
|
||||
this.url = parameters.withProfile(profile).renderUrl(SearchFilters.this.url);
|
||||
}
|
||||
|
||||
public String getDisplayName() {
|
||||
return displayName;
|
||||
}
|
||||
|
||||
public boolean isCurrent() {
|
||||
return current;
|
||||
}
|
||||
|
||||
public String getUrl() {
|
||||
return url;
|
||||
}
|
||||
}
|
||||
}
|
@@ -0,0 +1,105 @@
|
||||
package nu.marginalia.search.model;
|
||||
|
||||
import nu.marginalia.index.query.limit.SpecificationLimit;
|
||||
import nu.marginalia.model.crawl.HtmlFeature;
|
||||
import nu.marginalia.api.searchquery.model.query.SearchQuery;
|
||||
import nu.marginalia.api.searchquery.model.query.SearchSetIdentifier;
|
||||
|
||||
import java.util.Objects;
|
||||
|
||||
public enum SearchProfile {
|
||||
POPULAR("default", SearchSetIdentifier.POPULAR),
|
||||
SMALLWEB("modern", SearchSetIdentifier.SMALLWEB),
|
||||
BLOGOSPHERE("blogosphere", SearchSetIdentifier.BLOGS),
|
||||
NO_FILTER("corpo", SearchSetIdentifier.NONE),
|
||||
VINTAGE("vintage", SearchSetIdentifier.NONE),
|
||||
TILDE("tilde", SearchSetIdentifier.NONE),
|
||||
CORPO_CLEAN("corpo-clean", SearchSetIdentifier.NONE),
|
||||
ACADEMIA("academia", SearchSetIdentifier.NONE),
|
||||
PLAIN_TEXT("plain-text", SearchSetIdentifier.NONE),
|
||||
FOOD("food", SearchSetIdentifier.POPULAR),
|
||||
FORUM("forum", SearchSetIdentifier.NONE),
|
||||
WIKI("wiki", SearchSetIdentifier.NONE),
|
||||
DOCS("docs", SearchSetIdentifier.NONE),
|
||||
;
|
||||
|
||||
|
||||
public final String filterId;
|
||||
public final SearchSetIdentifier searchSetIdentifier;
|
||||
|
||||
SearchProfile(String filterId, SearchSetIdentifier searchSetIdentifier) {
|
||||
this.filterId = filterId;
|
||||
this.searchSetIdentifier = searchSetIdentifier;
|
||||
}
|
||||
|
||||
private final static SearchProfile[] values = values();
|
||||
public static SearchProfile getSearchProfile(String param) {
|
||||
if (null == param) {
|
||||
return NO_FILTER;
|
||||
}
|
||||
|
||||
for (var profile : values) {
|
||||
if (Objects.equals(profile.filterId, param)) {
|
||||
return profile;
|
||||
}
|
||||
}
|
||||
|
||||
return NO_FILTER;
|
||||
}
|
||||
|
||||
public void addTacitTerms(SearchQuery subquery) {
|
||||
if (this == ACADEMIA) {
|
||||
subquery.searchTermsAdvice.add("special:academia");
|
||||
}
|
||||
if (this == VINTAGE) {
|
||||
subquery.searchTermsPriority.add("format:html123");
|
||||
subquery.searchTermsPriority.add("js:false");
|
||||
}
|
||||
if (this == TILDE) {
|
||||
subquery.searchTermsAdvice.add("special:tilde");
|
||||
}
|
||||
if (this == PLAIN_TEXT) {
|
||||
subquery.searchTermsAdvice.add("format:plain");
|
||||
}
|
||||
if (this == WIKI) {
|
||||
subquery.searchTermsAdvice.add("generator:wiki");
|
||||
}
|
||||
if (this == FORUM) {
|
||||
subquery.searchTermsAdvice.add("generator:forum");
|
||||
}
|
||||
if (this == DOCS) {
|
||||
subquery.searchTermsAdvice.add("generator:docs");
|
||||
}
|
||||
if (this == FOOD) {
|
||||
subquery.searchTermsAdvice.add(HtmlFeature.CATEGORY_FOOD.getKeyword());
|
||||
subquery.searchTermsExclude.add("special:ads");
|
||||
}
|
||||
}
|
||||
|
||||
public SpecificationLimit getYearLimit() {
|
||||
if (this == SMALLWEB) {
|
||||
return SpecificationLimit.greaterThan(2015);
|
||||
}
|
||||
if (this == VINTAGE) {
|
||||
return SpecificationLimit.lessThan(2003);
|
||||
}
|
||||
else return SpecificationLimit.none();
|
||||
}
|
||||
|
||||
public SpecificationLimit getSizeLimit() {
|
||||
if (this == SMALLWEB) {
|
||||
return SpecificationLimit.lessThan(500);
|
||||
}
|
||||
else return SpecificationLimit.none();
|
||||
}
|
||||
|
||||
|
||||
public SpecificationLimit getQualityLimit() {
|
||||
if (this == SMALLWEB) {
|
||||
return SpecificationLimit.lessThan(5);
|
||||
}
|
||||
else return SpecificationLimit.none();
|
||||
}
|
||||
|
||||
}
|
||||
|
@@ -0,0 +1,293 @@
|
||||
package nu.marginalia.search.model;
|
||||
|
||||
import nu.marginalia.api.searchquery.model.results.SearchResultItem;
|
||||
import nu.marginalia.api.searchquery.model.results.SearchResultKeywordScore;
|
||||
import nu.marginalia.model.EdgeUrl;
|
||||
import nu.marginalia.model.crawl.DomainIndexingState;
|
||||
import nu.marginalia.model.crawl.HtmlFeature;
|
||||
|
||||
import java.util.ArrayList;
|
||||
import java.util.List;
|
||||
|
||||
/**
|
||||
* A class to hold details about a single search result.
|
||||
*/
|
||||
public class UrlDetails implements Comparable<UrlDetails> {
|
||||
public long id;
|
||||
public int domainId;
|
||||
|
||||
public EdgeUrl url;
|
||||
public String title;
|
||||
public String description;
|
||||
|
||||
public String format;
|
||||
public int features;
|
||||
|
||||
public DomainIndexingState domainState;
|
||||
|
||||
public double termScore;
|
||||
|
||||
public int resultsFromSameDomain;
|
||||
|
||||
public String positions;
|
||||
public int positionsCount;
|
||||
public SearchResultItem resultItem;
|
||||
public List<SearchResultKeywordScore> keywordScores;
|
||||
|
||||
public UrlDetails(long id, int domainId, EdgeUrl url, String title, String description, String format, int features, DomainIndexingState domainState, double termScore, int resultsFromSameDomain, String positions, int positionsCount, SearchResultItem resultItem, List<SearchResultKeywordScore> keywordScores) {
|
||||
this.id = id;
|
||||
this.domainId = domainId;
|
||||
this.url = url;
|
||||
this.title = title;
|
||||
this.description = description;
|
||||
this.format = format;
|
||||
this.features = features;
|
||||
this.domainState = domainState;
|
||||
this.termScore = termScore;
|
||||
this.resultsFromSameDomain = resultsFromSameDomain;
|
||||
this.positions = positions;
|
||||
this.positionsCount = positionsCount;
|
||||
this.resultItem = resultItem;
|
||||
this.keywordScores = keywordScores;
|
||||
}
|
||||
|
||||
public UrlDetails() {
|
||||
}
|
||||
|
||||
public boolean hasMoreResults() {
|
||||
return resultsFromSameDomain > 1;
|
||||
}
|
||||
|
||||
public String getFormat() {
|
||||
if (null == format) {
|
||||
return "?";
|
||||
}
|
||||
switch (format) {
|
||||
case "HTML123":
|
||||
return "HTML 1-3";
|
||||
case "HTML4":
|
||||
return "HTML 4";
|
||||
case "XHTML":
|
||||
return "XHTML";
|
||||
case "HTML5":
|
||||
return "HTML 5";
|
||||
case "PLAIN":
|
||||
return "Plain Text";
|
||||
default:
|
||||
return "?";
|
||||
}
|
||||
}
|
||||
|
||||
public int hashCode() {
|
||||
return Long.hashCode(id);
|
||||
}
|
||||
|
||||
@Override
|
||||
public int compareTo(UrlDetails other) {
|
||||
int result = Double.compare(getTermScore(), other.getTermScore());
|
||||
if (result == 0) result = Long.compare(getId(), other.getId());
|
||||
return result;
|
||||
}
|
||||
|
||||
public boolean equals(Object other) {
|
||||
if (other == null) {
|
||||
return false;
|
||||
}
|
||||
if (other == this) {
|
||||
return true;
|
||||
}
|
||||
if (other instanceof UrlDetails) {
|
||||
return ((UrlDetails) other).id == id;
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
public String getTitle() {
|
||||
if (title == null || title.isBlank()) {
|
||||
return url.toString();
|
||||
}
|
||||
return title;
|
||||
}
|
||||
|
||||
public boolean isPlainText() {
|
||||
return "PLAIN".equals(format);
|
||||
}
|
||||
|
||||
public int getProblemCount() {
|
||||
int mask = HtmlFeature.JS.getFeatureBit()
|
||||
| HtmlFeature.COOKIES.getFeatureBit()
|
||||
| HtmlFeature.TRACKING.getFeatureBit()
|
||||
| HtmlFeature.AFFILIATE_LINK.getFeatureBit()
|
||||
| HtmlFeature.TRACKING_ADTECH.getFeatureBit()
|
||||
| HtmlFeature.ADVERTISEMENT.getFeatureBit();
|
||||
|
||||
return Integer.bitCount(features & mask);
|
||||
}
|
||||
|
||||
public List<UrlProblem> getProblems() {
|
||||
List<UrlProblem> problems = new ArrayList<>();
|
||||
|
||||
if (isScripts()) {
|
||||
problems.add(new UrlProblem("Js", "The page uses Javascript"));
|
||||
}
|
||||
if (isCookies()) {
|
||||
problems.add(new UrlProblem("Co", "The page uses Cookies"));
|
||||
}
|
||||
if (isTracking()) {
|
||||
problems.add(new UrlProblem("Tr", "The page uses Tracking/Analytics"));
|
||||
}
|
||||
if (isAffiliate()) {
|
||||
problems.add(new UrlProblem("Af", "The page may use Affiliate Linking"));
|
||||
}
|
||||
if (isAds()) {
|
||||
problems.add(new UrlProblem("Ad", "The page uses Ads/Adtech Tracking"));
|
||||
}
|
||||
return problems;
|
||||
|
||||
}
|
||||
|
||||
public boolean isScripts() {
|
||||
return HtmlFeature.hasFeature(features, HtmlFeature.JS);
|
||||
}
|
||||
|
||||
public boolean isTracking() {
|
||||
return HtmlFeature.hasFeature(features, HtmlFeature.TRACKING);
|
||||
}
|
||||
|
||||
public boolean isAffiliate() {
|
||||
return HtmlFeature.hasFeature(features, HtmlFeature.AFFILIATE_LINK);
|
||||
}
|
||||
|
||||
public boolean isMedia() {
|
||||
return HtmlFeature.hasFeature(features, HtmlFeature.MEDIA);
|
||||
}
|
||||
|
||||
public boolean isCookies() {
|
||||
return HtmlFeature.hasFeature(features, HtmlFeature.COOKIES);
|
||||
}
|
||||
|
||||
public boolean isAds() {
|
||||
return HtmlFeature.hasFeature(features, HtmlFeature.TRACKING_ADTECH);
|
||||
}
|
||||
|
||||
public int getMatchRank() {
|
||||
if (termScore <= 1) return 1;
|
||||
if (termScore <= 2) return 2;
|
||||
if (termScore <= 3) return 3;
|
||||
if (termScore <= 5) return 5;
|
||||
|
||||
return 10;
|
||||
}
|
||||
|
||||
public long getId() {
|
||||
return this.id;
|
||||
}
|
||||
|
||||
public int getDomainId() {
|
||||
return this.domainId;
|
||||
}
|
||||
|
||||
public EdgeUrl getUrl() {
|
||||
return this.url;
|
||||
}
|
||||
|
||||
public String getDescription() {
|
||||
return this.description;
|
||||
}
|
||||
|
||||
public int getFeatures() {
|
||||
return this.features;
|
||||
}
|
||||
|
||||
public DomainIndexingState getDomainState() {
|
||||
return this.domainState;
|
||||
}
|
||||
|
||||
public double getTermScore() {
|
||||
return this.termScore;
|
||||
}
|
||||
|
||||
public int getResultsFromSameDomain() {
|
||||
return this.resultsFromSameDomain;
|
||||
}
|
||||
|
||||
public String getPositions() {
|
||||
return this.positions;
|
||||
}
|
||||
|
||||
public int getPositionsCount() {
|
||||
return this.positionsCount;
|
||||
}
|
||||
|
||||
public SearchResultItem getResultItem() {
|
||||
return this.resultItem;
|
||||
}
|
||||
|
||||
public List<SearchResultKeywordScore> getKeywordScores() {
|
||||
return this.keywordScores;
|
||||
}
|
||||
|
||||
public UrlDetails withId(long id) {
|
||||
return this.id == id ? this : new UrlDetails(id, this.domainId, this.url, this.title, this.description, this.format, this.features, this.domainState, this.termScore, this.resultsFromSameDomain, this.positions, this.positionsCount, this.resultItem, this.keywordScores);
|
||||
}
|
||||
|
||||
public UrlDetails withDomainId(int domainId) {
|
||||
return this.domainId == domainId ? this : new UrlDetails(this.id, domainId, this.url, this.title, this.description, this.format, this.features, this.domainState, this.termScore, this.resultsFromSameDomain, this.positions, this.positionsCount, this.resultItem, this.keywordScores);
|
||||
}
|
||||
|
||||
public UrlDetails withUrl(EdgeUrl url) {
|
||||
return this.url == url ? this : new UrlDetails(this.id, this.domainId, url, this.title, this.description, this.format, this.features, this.domainState, this.termScore, this.resultsFromSameDomain, this.positions, this.positionsCount, this.resultItem, this.keywordScores);
|
||||
}
|
||||
|
||||
public UrlDetails withTitle(String title) {
|
||||
return this.title == title ? this : new UrlDetails(this.id, this.domainId, this.url, title, this.description, this.format, this.features, this.domainState, this.termScore, this.resultsFromSameDomain, this.positions, this.positionsCount, this.resultItem, this.keywordScores);
|
||||
}
|
||||
|
||||
public UrlDetails withDescription(String description) {
|
||||
return this.description == description ? this : new UrlDetails(this.id, this.domainId, this.url, this.title, description, this.format, this.features, this.domainState, this.termScore, this.resultsFromSameDomain, this.positions, this.positionsCount, this.resultItem, this.keywordScores);
|
||||
}
|
||||
|
||||
public UrlDetails withFormat(String format) {
|
||||
return this.format == format ? this : new UrlDetails(this.id, this.domainId, this.url, this.title, this.description, format, this.features, this.domainState, this.termScore, this.resultsFromSameDomain, this.positions, this.positionsCount, this.resultItem, this.keywordScores);
|
||||
}
|
||||
|
||||
public UrlDetails withFeatures(int features) {
|
||||
return this.features == features ? this : new UrlDetails(this.id, this.domainId, this.url, this.title, this.description, this.format, features, this.domainState, this.termScore, this.resultsFromSameDomain, this.positions, this.positionsCount, this.resultItem, this.keywordScores);
|
||||
}
|
||||
|
||||
public UrlDetails withDomainState(DomainIndexingState domainState) {
|
||||
return this.domainState == domainState ? this : new UrlDetails(this.id, this.domainId, this.url, this.title, this.description, this.format, this.features, domainState, this.termScore, this.resultsFromSameDomain, this.positions, this.positionsCount, this.resultItem, this.keywordScores);
|
||||
}
|
||||
|
||||
public UrlDetails withTermScore(double termScore) {
|
||||
return this.termScore == termScore ? this : new UrlDetails(this.id, this.domainId, this.url, this.title, this.description, this.format, this.features, this.domainState, termScore, this.resultsFromSameDomain, this.positions, this.positionsCount, this.resultItem, this.keywordScores);
|
||||
}
|
||||
|
||||
public UrlDetails withResultsFromSameDomain(int resultsFromSameDomain) {
|
||||
return this.resultsFromSameDomain == resultsFromSameDomain ? this : new UrlDetails(this.id, this.domainId, this.url, this.title, this.description, this.format, this.features, this.domainState, this.termScore, resultsFromSameDomain, this.positions, this.positionsCount, this.resultItem, this.keywordScores);
|
||||
}
|
||||
|
||||
public UrlDetails withPositions(String positions) {
|
||||
return this.positions == positions ? this : new UrlDetails(this.id, this.domainId, this.url, this.title, this.description, this.format, this.features, this.domainState, this.termScore, this.resultsFromSameDomain, positions, this.positionsCount, this.resultItem, this.keywordScores);
|
||||
}
|
||||
|
||||
public UrlDetails withPositionsCount(int positionsCount) {
|
||||
return this.positionsCount == positionsCount ? this : new UrlDetails(this.id, this.domainId, this.url, this.title, this.description, this.format, this.features, this.domainState, this.termScore, this.resultsFromSameDomain, this.positions, positionsCount, this.resultItem, this.keywordScores);
|
||||
}
|
||||
|
||||
public UrlDetails withResultItem(SearchResultItem resultItem) {
|
||||
return this.resultItem == resultItem ? this : new UrlDetails(this.id, this.domainId, this.url, this.title, this.description, this.format, this.features, this.domainState, this.termScore, this.resultsFromSameDomain, this.positions, this.positionsCount, resultItem, this.keywordScores);
|
||||
}
|
||||
|
||||
public UrlDetails withKeywordScores(List<SearchResultKeywordScore> keywordScores) {
|
||||
return this.keywordScores == keywordScores ? this : new UrlDetails(this.id, this.domainId, this.url, this.title, this.description, this.format, this.features, this.domainState, this.termScore, this.resultsFromSameDomain, this.positions, this.positionsCount, this.resultItem, keywordScores);
|
||||
}
|
||||
|
||||
public String toString() {
|
||||
return "UrlDetails(id=" + this.getId() + ", domainId=" + this.getDomainId() + ", url=" + this.getUrl() + ", title=" + this.getTitle() + ", description=" + this.getDescription() + ", format=" + this.getFormat() + ", features=" + this.getFeatures() + ", domainState=" + this.getDomainState() + ", termScore=" + this.getTermScore() + ", resultsFromSameDomain=" + this.getResultsFromSameDomain() + ", positions=" + this.getPositions() + ", positionsCount=" + this.getPositionsCount() + ", resultItem=" + this.getResultItem() + ", keywordScores=" + this.getKeywordScores() + ")";
|
||||
}
|
||||
|
||||
public static record UrlProblem(String name, String description) {
|
||||
|
||||
}
|
||||
}
|
@@ -0,0 +1,27 @@
|
||||
package nu.marginalia.search.results;
|
||||
|
||||
import com.google.inject.Inject;
|
||||
import com.google.inject.Singleton;
|
||||
import nu.marginalia.browse.model.BrowseResult;
|
||||
import nu.marginalia.screenshot.ScreenshotService;
|
||||
|
||||
import java.util.HashSet;
|
||||
import java.util.Set;
|
||||
import java.util.function.Predicate;
|
||||
|
||||
@Singleton
|
||||
public class BrowseResultCleaner {
|
||||
private final ScreenshotService screenshotService;
|
||||
|
||||
@Inject
|
||||
public BrowseResultCleaner(ScreenshotService screenshotService) {
|
||||
this.screenshotService = screenshotService;
|
||||
}
|
||||
|
||||
public Predicate<BrowseResult> shouldRemoveResultPredicateBr() {
|
||||
Set<String> domainHashes = new HashSet<>(100);
|
||||
|
||||
return (res) -> !screenshotService.hasScreenshot(res.domainId())
|
||||
|| !domainHashes.add(res.domainHash());
|
||||
}
|
||||
}
|
@@ -0,0 +1,69 @@
|
||||
package nu.marginalia.search.results;
|
||||
|
||||
import gnu.trove.list.TLongList;
|
||||
import gnu.trove.list.array.TLongArrayList;
|
||||
import gnu.trove.map.hash.TObjectIntHashMap;
|
||||
import gnu.trove.set.hash.TIntHashSet;
|
||||
import nu.marginalia.api.searchquery.model.results.DecoratedSearchResultItem;
|
||||
import nu.marginalia.lsh.EasyLSH;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import java.util.Objects;
|
||||
|
||||
public class UrlDeduplicator {
|
||||
private final int LSH_SIMILARITY_THRESHOLD = 2;
|
||||
private static final Logger logger = LoggerFactory.getLogger(UrlDeduplicator.class);
|
||||
|
||||
private final TIntHashSet seenSuperficialhashes = new TIntHashSet(200);
|
||||
private final TLongList seehLSHList = new TLongArrayList(200);
|
||||
private final TObjectIntHashMap<String> keyCount = new TObjectIntHashMap<>(200, 0.75f, 0);
|
||||
|
||||
private final int resultsPerKey;
|
||||
public UrlDeduplicator(int resultsPerKey) {
|
||||
this.resultsPerKey = resultsPerKey;
|
||||
}
|
||||
|
||||
public boolean shouldRemove(DecoratedSearchResultItem details) {
|
||||
if (!deduplicateOnSuperficialHash(details))
|
||||
return true;
|
||||
if (!deduplicateOnLSH(details))
|
||||
return true;
|
||||
if (!limitResultsPerDomain(details))
|
||||
return true;
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
public boolean shouldRetain(DecoratedSearchResultItem details) {
|
||||
return !shouldRemove(details);
|
||||
}
|
||||
|
||||
private boolean deduplicateOnSuperficialHash(DecoratedSearchResultItem details) {
|
||||
return seenSuperficialhashes.add(Objects.hash(details.url.path, details.title));
|
||||
}
|
||||
|
||||
private boolean deduplicateOnLSH(DecoratedSearchResultItem details) {
|
||||
long thisHash = details.dataHash;
|
||||
|
||||
if (0 == thisHash)
|
||||
return true;
|
||||
|
||||
if (seehLSHList.forEach(otherHash -> EasyLSH.hammingDistance(thisHash, otherHash) >= LSH_SIMILARITY_THRESHOLD))
|
||||
{
|
||||
seehLSHList.add(thisHash);
|
||||
return true;
|
||||
}
|
||||
|
||||
return false;
|
||||
|
||||
}
|
||||
|
||||
private boolean limitResultsPerDomain(DecoratedSearchResultItem details) {
|
||||
final var domain = details.getUrl().getDomain();
|
||||
final String key = domain.getDomainKey();
|
||||
|
||||
return keyCount.adjustOrPutValue(key, 1, 1) <= resultsPerKey;
|
||||
}
|
||||
|
||||
}
|
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user