mirror of
https://github.com/MarginaliaSearch/MarginaliaSearch.git
synced 2025-10-06 17:32:39 +02:00
Compare commits
116 Commits
deploy-000
...
deploy-002
Author | SHA1 | Date | |
---|---|---|---|
|
b5469bd8a1 | ||
|
6a6318d04c | ||
|
55933f8d40 | ||
|
be6382e0d0 | ||
|
45e771f96b | ||
|
8dde502cc9 | ||
|
3e66767af3 | ||
|
9ec9d1b338 | ||
|
dcad0d7863 | ||
|
94e1aa0baf | ||
|
b62f043910 | ||
|
6ea22d0d21 | ||
|
8c69dc31b8 | ||
|
00734ea87f | ||
|
3009713db4 | ||
|
9b2ceaf37c | ||
|
8019c2ce18 | ||
|
a9e312b8b1 | ||
|
4da3563d8a | ||
|
48d0a3089a | ||
|
594df64b20 | ||
|
06efb5abfc | ||
|
78eb1417a7 | ||
|
8c8f2ad5ee | ||
|
f71e79d10f | ||
|
1b27c5cf06 | ||
|
67edc8f90d | ||
|
5f576b7d0c | ||
|
8b05c788fd | ||
|
236f033bc9 | ||
|
510fc75121 | ||
|
0376f2e6e3 | ||
|
0b65164f60 | ||
|
9be477de33 | ||
|
84f55b84ff | ||
|
ab5c30ad51 | ||
|
0c839453c5 | ||
|
5e4c5d03ae | ||
|
710af4999a | ||
|
a5b0a1ae62 | ||
|
e9f71ee39b | ||
|
baeb4a46cd | ||
|
5e2a8e9f27 | ||
|
cc1a5bdf90 | ||
|
7f7b1ffaba | ||
|
0ea8092350 | ||
|
483d29497e | ||
|
bae44497fe | ||
|
0d59202aca | ||
|
0ca43f0c9c | ||
|
3bc99639a0 | ||
|
927bc0b63c | ||
|
d968801dc1 | ||
|
89db69d360 | ||
|
895cee7004 | ||
|
4bb71b8439 | ||
|
e4a41f7dd1 | ||
|
69ad6287b1 | ||
|
81cdd6385d | ||
|
e76c42329f | ||
|
e6ef4734ea | ||
|
41a59dcf45 | ||
|
df4bc1d7e9 | ||
|
2b222efa75 | ||
|
94d4d2edb7 | ||
|
7ae19a92ba | ||
|
56d14e56d7 | ||
|
a557c7ae7f | ||
|
6d18e6d840 | ||
|
2a3c63f209 | ||
|
9f70cecaef | ||
|
c08203e2ed | ||
|
86497fd32f | ||
|
3b998573fd | ||
|
e161882ec7 | ||
|
357f349e30 | ||
|
e4769f541d | ||
|
2a173e2861 | ||
|
a6a900266c | ||
|
bdba53f055 | ||
|
bbdde789e7 | ||
|
eab61cd48a | ||
|
0ce2ba9ad9 | ||
|
3ddcebaa36 | ||
|
b91463383e | ||
|
7444a2f36c | ||
|
fdee07048d | ||
|
2fbf201761 | ||
|
4018e4c434 | ||
|
f3382b5bd8 | ||
|
9287ee0141 | ||
|
2769c8f869 | ||
|
ddb66f33ba | ||
|
79500b8fbc | ||
|
187eea43a4 | ||
|
a89ed6fa9f | ||
|
8d168be138 | ||
|
6e1aa7b391 | ||
|
deab9b9516 | ||
|
39d99a906a | ||
|
6f72e6e0d3 | ||
|
d786d79483 | ||
|
01510f6c2e | ||
|
7ba43e9e3f | ||
|
97bfcd1353 | ||
|
aa3c85c196 | ||
|
fb75a3827d | ||
|
7d546d0e2a | ||
|
8fcb6ffd7a | ||
|
f97de0c15a | ||
|
be9e192b78 | ||
|
75ae1c9526 | ||
|
33761a0236 | ||
|
19b69b1764 | ||
|
8b804359a9 | ||
|
f050bf5c4c |
1
.github/FUNDING.yml
vendored
1
.github/FUNDING.yml
vendored
@@ -1,5 +1,6 @@
|
|||||||
# These are supported funding model platforms
|
# These are supported funding model platforms
|
||||||
|
|
||||||
|
polar: marginalia-search
|
||||||
github: MarginaliaSearch
|
github: MarginaliaSearch
|
||||||
patreon: marginalia_nu
|
patreon: marginalia_nu
|
||||||
open_collective: # Replace with a single Open Collective username
|
open_collective: # Replace with a single Open Collective username
|
||||||
|
1
.gitignore
vendored
1
.gitignore
vendored
@@ -7,3 +7,4 @@ build/
|
|||||||
lombok.config
|
lombok.config
|
||||||
Dockerfile
|
Dockerfile
|
||||||
run
|
run
|
||||||
|
jte-classes
|
58
ROADMAP.md
58
ROADMAP.md
@@ -8,20 +8,10 @@ be implemented as well.
|
|||||||
Major goals:
|
Major goals:
|
||||||
|
|
||||||
* Reach 1 billion pages indexed
|
* Reach 1 billion pages indexed
|
||||||
* Improve technical ability of indexing and search. Although this area has improved a bit, the
|
|
||||||
search engine is still not very good at dealing with longer queries.
|
|
||||||
|
|
||||||
## Proper Position Index (COMPLETED 2024-09)
|
|
||||||
|
|
||||||
The search engine uses a fixed width bit mask to indicate word positions. It has the benefit
|
* Improve technical ability of indexing and search. ~~Although this area has improved a bit, the
|
||||||
of being very fast to evaluate and works well for what it is, but is inaccurate and has the
|
search engine is still not very good at dealing with longer queries.~~ (As of PR [#129](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/129), this has improved significantly. There is still more work to be done )
|
||||||
drawback of making support for quoted search terms inaccurate and largely reliant on indexing
|
|
||||||
word n-grams known beforehand. This limits the ability to interpret longer queries.
|
|
||||||
|
|
||||||
The positions mask should be supplemented or replaced with a more accurate (e.g.) gamma coded positions
|
|
||||||
list, as is the civilized way of doing this.
|
|
||||||
|
|
||||||
Completed with PR https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99
|
|
||||||
|
|
||||||
## Hybridize crawler w/ Common Crawl data
|
## Hybridize crawler w/ Common Crawl data
|
||||||
|
|
||||||
@@ -37,10 +27,15 @@ Retaining the ability to independently crawl the web is still strongly desirable
|
|||||||
|
|
||||||
## Safe Search
|
## Safe Search
|
||||||
|
|
||||||
The search engine has a bit of a problem showing spicy content mixed in with the results. It would be desirable
|
The search engine has a bit of a problem showing spicy content mixed in with the results. It would be desirable to have a way to filter this out. It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
|
||||||
to have a way to filter this out. It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
|
|
||||||
combined with naive bayesian filter would go a long way, or something more sophisticated...?
|
combined with naive bayesian filter would go a long way, or something more sophisticated...?
|
||||||
|
|
||||||
|
## Web Design Overhaul
|
||||||
|
|
||||||
|
The design is kinda clunky and hard to maintain, and needlessly outdated-looking.
|
||||||
|
|
||||||
|
In progress: PR [#127](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/127) -- demo available at https://test.marginalia.nu/
|
||||||
|
|
||||||
## Additional Language Support
|
## Additional Language Support
|
||||||
|
|
||||||
It would be desirable if the search engine supported more languages than English. This is partially about
|
It would be desirable if the search engine supported more languages than English. This is partially about
|
||||||
@@ -49,15 +44,6 @@ associated with each language added, at least a models file or two, as well as s
|
|||||||
|
|
||||||
It would be very helpful to find a speaker of a large language other than English to help in the fine tuning.
|
It would be very helpful to find a speaker of a large language other than English to help in the fine tuning.
|
||||||
|
|
||||||
## Finalize RSS support (COMPLETED 2024-11)
|
|
||||||
|
|
||||||
Marginalia has experimental RSS preview support for a few domains. This works well and
|
|
||||||
it should be extended to all domains. It would also be interesting to offer search of the
|
|
||||||
RSS data itself, or use the RSS set to feed a special live index that updates faster than the
|
|
||||||
main dataset.
|
|
||||||
|
|
||||||
Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122)
|
|
||||||
|
|
||||||
## Support for binary formats like PDF
|
## Support for binary formats like PDF
|
||||||
|
|
||||||
The crawler needs to be modified to retain them, and the conversion logic needs to parse them.
|
The crawler needs to be modified to retain them, and the conversion logic needs to parse them.
|
||||||
@@ -74,5 +60,27 @@ This looks like a good idea that wouldn't just help clean up the search filters
|
|||||||
website, but might be cheap enough we might go as far as to offer a number of ad-hoc custom search
|
website, but might be cheap enough we might go as far as to offer a number of ad-hoc custom search
|
||||||
filter for any API consumer.
|
filter for any API consumer.
|
||||||
|
|
||||||
I've talked to the stract dev and he does not think it's a good idea to mimic their optics language,
|
I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, which is quite ad-hoc, but instead to work together to find some new common description language for this.
|
||||||
which is quite ad-hoc, but instead to work together to find some new common description language for this.
|
|
||||||
|
# Completed
|
||||||
|
|
||||||
|
## Proper Position Index (COMPLETED 2024-09)
|
||||||
|
|
||||||
|
The search engine uses a fixed width bit mask to indicate word positions. It has the benefit
|
||||||
|
of being very fast to evaluate and works well for what it is, but is inaccurate and has the
|
||||||
|
drawback of making support for quoted search terms inaccurate and largely reliant on indexing
|
||||||
|
word n-grams known beforehand. This limits the ability to interpret longer queries.
|
||||||
|
|
||||||
|
The positions mask should be supplemented or replaced with a more accurate (e.g.) gamma coded positions
|
||||||
|
list, as is the civilized way of doing this.
|
||||||
|
|
||||||
|
Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)
|
||||||
|
|
||||||
|
## Finalize RSS support (COMPLETED 2024-11)
|
||||||
|
|
||||||
|
Marginalia has experimental RSS preview support for a few domains. This works well and
|
||||||
|
it should be extended to all domains. It would also be interesting to offer search of the
|
||||||
|
RSS data itself, or use the RSS set to feed a special live index that updates faster than the
|
||||||
|
main dataset.
|
||||||
|
|
||||||
|
Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
|
||||||
|
@@ -48,6 +48,7 @@ ext {
|
|||||||
dockerImageTag='latest'
|
dockerImageTag='latest'
|
||||||
dockerImageRegistry='marginalia'
|
dockerImageRegistry='marginalia'
|
||||||
jibVersion = '3.4.3'
|
jibVersion = '3.4.3'
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
idea {
|
idea {
|
||||||
|
@@ -28,7 +28,7 @@ public class DbDomainQueries {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
public Integer getDomainId(EdgeDomain domain) {
|
public Integer getDomainId(EdgeDomain domain) throws NoSuchElementException {
|
||||||
try (var connection = dataSource.getConnection()) {
|
try (var connection = dataSource.getConnection()) {
|
||||||
|
|
||||||
return domainIdCache.get(domain, () -> {
|
return domainIdCache.get(domain, () -> {
|
||||||
@@ -42,6 +42,9 @@ public class DbDomainQueries {
|
|||||||
throw new NoSuchElementException();
|
throw new NoSuchElementException();
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
catch (UncheckedExecutionException ex) {
|
||||||
|
throw new NoSuchElementException();
|
||||||
|
}
|
||||||
catch (ExecutionException ex) {
|
catch (ExecutionException ex) {
|
||||||
throw new RuntimeException(ex.getCause());
|
throw new RuntimeException(ex.getCause());
|
||||||
}
|
}
|
||||||
|
@@ -42,6 +42,12 @@ dependencies {
|
|||||||
implementation libs.bundles.curator
|
implementation libs.bundles.curator
|
||||||
implementation libs.bundles.flyway
|
implementation libs.bundles.flyway
|
||||||
|
|
||||||
|
libs.bundles.jooby.get().each {
|
||||||
|
implementation dependencies.create(it) {
|
||||||
|
exclude group: 'org.slf4j'
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
testImplementation libs.bundles.slf4j.test
|
testImplementation libs.bundles.slf4j.test
|
||||||
implementation libs.bundles.mariadb
|
implementation libs.bundles.mariadb
|
||||||
|
|
||||||
|
@@ -7,8 +7,6 @@ import nu.marginalia.service.discovery.property.PartitionTraits;
|
|||||||
import nu.marginalia.service.discovery.property.ServiceEndpoint;
|
import nu.marginalia.service.discovery.property.ServiceEndpoint;
|
||||||
import nu.marginalia.service.discovery.property.ServiceKey;
|
import nu.marginalia.service.discovery.property.ServiceKey;
|
||||||
import nu.marginalia.service.discovery.property.ServicePartition;
|
import nu.marginalia.service.discovery.property.ServicePartition;
|
||||||
import org.slf4j.Logger;
|
|
||||||
import org.slf4j.LoggerFactory;
|
|
||||||
|
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
import java.util.concurrent.CompletableFuture;
|
import java.util.concurrent.CompletableFuture;
|
||||||
@@ -24,7 +22,7 @@ import java.util.function.Function;
|
|||||||
public class GrpcMultiNodeChannelPool<STUB> {
|
public class GrpcMultiNodeChannelPool<STUB> {
|
||||||
private final ConcurrentHashMap<Integer, GrpcSingleNodeChannelPool<STUB>> pools =
|
private final ConcurrentHashMap<Integer, GrpcSingleNodeChannelPool<STUB>> pools =
|
||||||
new ConcurrentHashMap<>();
|
new ConcurrentHashMap<>();
|
||||||
private static final Logger logger = LoggerFactory.getLogger(GrpcMultiNodeChannelPool.class);
|
|
||||||
private final ServiceRegistryIf serviceRegistryIf;
|
private final ServiceRegistryIf serviceRegistryIf;
|
||||||
private final ServiceKey<? extends PartitionTraits.Multicast> serviceKey;
|
private final ServiceKey<? extends PartitionTraits.Multicast> serviceKey;
|
||||||
private final Function<ServiceEndpoint.InstanceAddress, ManagedChannel> channelConstructor;
|
private final Function<ServiceEndpoint.InstanceAddress, ManagedChannel> channelConstructor;
|
||||||
|
@@ -10,6 +10,8 @@ import nu.marginalia.service.discovery.property.ServiceKey;
|
|||||||
import org.jetbrains.annotations.NotNull;
|
import org.jetbrains.annotations.NotNull;
|
||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
import org.slf4j.LoggerFactory;
|
import org.slf4j.LoggerFactory;
|
||||||
|
import org.slf4j.Marker;
|
||||||
|
import org.slf4j.MarkerFactory;
|
||||||
|
|
||||||
import java.time.Duration;
|
import java.time.Duration;
|
||||||
import java.util.*;
|
import java.util.*;
|
||||||
@@ -26,13 +28,13 @@ import java.util.function.Function;
|
|||||||
public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
|
public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
|
||||||
private final Map<InstanceAddress, ConnectionHolder> channels = new ConcurrentHashMap<>();
|
private final Map<InstanceAddress, ConnectionHolder> channels = new ConcurrentHashMap<>();
|
||||||
|
|
||||||
|
private final Marker grpcMarker = MarkerFactory.getMarker("GRPC");
|
||||||
private static final Logger logger = LoggerFactory.getLogger(GrpcSingleNodeChannelPool.class);
|
private static final Logger logger = LoggerFactory.getLogger(GrpcSingleNodeChannelPool.class);
|
||||||
|
|
||||||
private final ServiceRegistryIf serviceRegistryIf;
|
private final ServiceRegistryIf serviceRegistryIf;
|
||||||
private final Function<InstanceAddress, ManagedChannel> channelConstructor;
|
private final Function<InstanceAddress, ManagedChannel> channelConstructor;
|
||||||
private final Function<ManagedChannel, STUB> stubConstructor;
|
private final Function<ManagedChannel, STUB> stubConstructor;
|
||||||
|
|
||||||
|
|
||||||
public GrpcSingleNodeChannelPool(ServiceRegistryIf serviceRegistryIf,
|
public GrpcSingleNodeChannelPool(ServiceRegistryIf serviceRegistryIf,
|
||||||
ServiceKey<? extends PartitionTraits.Unicast> serviceKey,
|
ServiceKey<? extends PartitionTraits.Unicast> serviceKey,
|
||||||
Function<InstanceAddress, ManagedChannel> channelConstructor,
|
Function<InstanceAddress, ManagedChannel> channelConstructor,
|
||||||
@@ -48,8 +50,6 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
|
|||||||
serviceRegistryIf.registerMonitor(this);
|
serviceRegistryIf.registerMonitor(this);
|
||||||
|
|
||||||
onChange();
|
onChange();
|
||||||
|
|
||||||
awaitChannel(Duration.ofSeconds(5));
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@@ -62,10 +62,10 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
|
|||||||
for (var route : Sets.symmetricDifference(oldRoutes, newRoutes)) {
|
for (var route : Sets.symmetricDifference(oldRoutes, newRoutes)) {
|
||||||
ConnectionHolder oldChannel;
|
ConnectionHolder oldChannel;
|
||||||
if (newRoutes.contains(route)) {
|
if (newRoutes.contains(route)) {
|
||||||
logger.info("Adding route {}", route);
|
logger.info(grpcMarker, "Adding route {} => {}", serviceKey, route);
|
||||||
oldChannel = channels.put(route, new ConnectionHolder(route));
|
oldChannel = channels.put(route, new ConnectionHolder(route));
|
||||||
} else {
|
} else {
|
||||||
logger.info("Expelling route {}", route);
|
logger.info(grpcMarker, "Expelling route {} => {}", serviceKey, route);
|
||||||
oldChannel = channels.remove(route);
|
oldChannel = channels.remove(route);
|
||||||
}
|
}
|
||||||
if (oldChannel != null) {
|
if (oldChannel != null) {
|
||||||
@@ -103,7 +103,7 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
|
|||||||
}
|
}
|
||||||
|
|
||||||
try {
|
try {
|
||||||
logger.info("Creating channel for {}:{}", serviceKey, address);
|
logger.info(grpcMarker, "Creating channel for {} => {}", serviceKey, address);
|
||||||
value = channelConstructor.apply(address);
|
value = channelConstructor.apply(address);
|
||||||
if (channel.compareAndSet(null, value)) {
|
if (channel.compareAndSet(null, value)) {
|
||||||
return value;
|
return value;
|
||||||
@@ -114,7 +114,7 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
catch (Exception e) {
|
catch (Exception e) {
|
||||||
logger.error("Failed to get channel for " + address, e);
|
logger.error(grpcMarker, "Failed to get channel for " + address, e);
|
||||||
return null;
|
return null;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -206,7 +206,7 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
|
|||||||
}
|
}
|
||||||
|
|
||||||
for (var e : exceptions) {
|
for (var e : exceptions) {
|
||||||
logger.error("Failed to call service {}", serviceKey, e);
|
logger.error(grpcMarker, "Failed to call service {}", serviceKey, e);
|
||||||
}
|
}
|
||||||
|
|
||||||
throw new ServiceNotAvailableException(serviceKey);
|
throw new ServiceNotAvailableException(serviceKey);
|
||||||
|
@@ -4,6 +4,11 @@ import nu.marginalia.service.discovery.property.ServiceKey;
|
|||||||
|
|
||||||
public class ServiceNotAvailableException extends RuntimeException {
|
public class ServiceNotAvailableException extends RuntimeException {
|
||||||
public ServiceNotAvailableException(ServiceKey<?> key) {
|
public ServiceNotAvailableException(ServiceKey<?> key) {
|
||||||
super("Service " + key + " not available");
|
super(key.toString());
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public StackTraceElement[] getStackTrace() { // Suppress stack trace
|
||||||
|
return new StackTraceElement[0];
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@@ -48,5 +48,10 @@ public record ServiceEndpoint(String host, int port) {
|
|||||||
public int port() {
|
public int port() {
|
||||||
return endpoint.port();
|
return endpoint.port();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public String toString() {
|
||||||
|
return endpoint().host() + ":" + endpoint.port() + " [" + instance + "]";
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@@ -48,6 +48,19 @@ public sealed interface ServiceKey<P extends ServicePartition> {
|
|||||||
{
|
{
|
||||||
throw new UnsupportedOperationException();
|
throw new UnsupportedOperationException();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public String toString() {
|
||||||
|
final String shortName;
|
||||||
|
|
||||||
|
int periodIndex = name.lastIndexOf('.');
|
||||||
|
|
||||||
|
if (periodIndex >= 0) shortName = name.substring(periodIndex+1);
|
||||||
|
else shortName = name;
|
||||||
|
|
||||||
|
return "rest:" + shortName;
|
||||||
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
record Grpc<P extends ServicePartition>(String name, P partition) implements ServiceKey<P> {
|
record Grpc<P extends ServicePartition>(String name, P partition) implements ServiceKey<P> {
|
||||||
public String baseName() {
|
public String baseName() {
|
||||||
@@ -64,6 +77,18 @@ public sealed interface ServiceKey<P extends ServicePartition> {
|
|||||||
{
|
{
|
||||||
return new Grpc<>(name, partition);
|
return new Grpc<>(name, partition);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public String toString() {
|
||||||
|
final String shortName;
|
||||||
|
|
||||||
|
int periodIndex = name.lastIndexOf('.');
|
||||||
|
|
||||||
|
if (periodIndex >= 0) shortName = name.substring(periodIndex+1);
|
||||||
|
else shortName = name;
|
||||||
|
|
||||||
|
return "grpc:" + shortName + "[" + partition.identifier() + "]";
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
|
@@ -0,0 +1,178 @@
|
|||||||
|
package nu.marginalia.service.server;
|
||||||
|
|
||||||
|
import io.jooby.*;
|
||||||
|
import io.prometheus.client.Counter;
|
||||||
|
import nu.marginalia.mq.inbox.MqInboxIf;
|
||||||
|
import nu.marginalia.service.client.ServiceNotAvailableException;
|
||||||
|
import nu.marginalia.service.discovery.property.ServiceEndpoint;
|
||||||
|
import nu.marginalia.service.discovery.property.ServiceKey;
|
||||||
|
import nu.marginalia.service.discovery.property.ServicePartition;
|
||||||
|
import nu.marginalia.service.module.ServiceConfiguration;
|
||||||
|
import nu.marginalia.service.server.jte.JteModule;
|
||||||
|
import nu.marginalia.service.server.mq.ServiceMqSubscription;
|
||||||
|
import org.slf4j.Logger;
|
||||||
|
import org.slf4j.LoggerFactory;
|
||||||
|
import org.slf4j.Marker;
|
||||||
|
import org.slf4j.MarkerFactory;
|
||||||
|
|
||||||
|
import java.nio.file.Path;
|
||||||
|
import java.nio.file.Paths;
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
|
public class JoobyService {
|
||||||
|
private final Logger logger = LoggerFactory.getLogger(getClass());
|
||||||
|
|
||||||
|
// Marker for filtering out sensitive content from the persistent logs
|
||||||
|
private final Marker httpMarker = MarkerFactory.getMarker("HTTP");
|
||||||
|
|
||||||
|
private final Initialization initialization;
|
||||||
|
|
||||||
|
private final static Counter request_counter = Counter.build("wmsa_request_counter", "Request Counter")
|
||||||
|
.labelNames("service", "node")
|
||||||
|
.register();
|
||||||
|
private final static Counter request_counter_good = Counter.build("wmsa_request_counter_good", "Good Requests")
|
||||||
|
.labelNames("service", "node")
|
||||||
|
.register();
|
||||||
|
private final static Counter request_counter_bad = Counter.build("wmsa_request_counter_bad", "Bad Requests")
|
||||||
|
.labelNames("service", "node")
|
||||||
|
.register();
|
||||||
|
private final static Counter request_counter_err = Counter.build("wmsa_request_counter_err", "Error Requests")
|
||||||
|
.labelNames("service", "node")
|
||||||
|
.register();
|
||||||
|
private final String serviceName;
|
||||||
|
private static volatile boolean initialized = false;
|
||||||
|
|
||||||
|
protected final MqInboxIf messageQueueInbox;
|
||||||
|
private final int node;
|
||||||
|
private GrpcServer grpcServer;
|
||||||
|
|
||||||
|
private ServiceConfiguration config;
|
||||||
|
private final List<MvcExtension> joobyServices;
|
||||||
|
private final ServiceEndpoint restEndpoint;
|
||||||
|
|
||||||
|
public JoobyService(BaseServiceParams params,
|
||||||
|
ServicePartition partition,
|
||||||
|
List<DiscoverableService> grpcServices,
|
||||||
|
List<MvcExtension> joobyServices
|
||||||
|
) throws Exception {
|
||||||
|
|
||||||
|
this.joobyServices = joobyServices;
|
||||||
|
this.initialization = params.initialization;
|
||||||
|
config = params.configuration;
|
||||||
|
node = config.node();
|
||||||
|
|
||||||
|
String inboxName = config.serviceName();
|
||||||
|
logger.info("Inbox name: {}", inboxName);
|
||||||
|
|
||||||
|
var serviceRegistry = params.serviceRegistry;
|
||||||
|
|
||||||
|
restEndpoint = serviceRegistry.registerService(ServiceKey.forRest(config.serviceId(), config.node()),
|
||||||
|
config.instanceUuid(), config.externalAddress());
|
||||||
|
|
||||||
|
var mqInboxFactory = params.messageQueueInboxFactory;
|
||||||
|
messageQueueInbox = mqInboxFactory.createSynchronousInbox(inboxName, config.node(), config.instanceUuid());
|
||||||
|
messageQueueInbox.subscribe(new ServiceMqSubscription(this));
|
||||||
|
|
||||||
|
serviceName = System.getProperty("service-name");
|
||||||
|
|
||||||
|
initialization.addCallback(params.heartbeat::start);
|
||||||
|
initialization.addCallback(messageQueueInbox::start);
|
||||||
|
initialization.addCallback(() -> params.eventLog.logEvent("SVC-INIT", serviceName + ":" + config.node()));
|
||||||
|
initialization.addCallback(() -> serviceRegistry.announceInstance(config.instanceUuid()));
|
||||||
|
|
||||||
|
Thread.setDefaultUncaughtExceptionHandler((t, e) -> {
|
||||||
|
if (e instanceof ServiceNotAvailableException) {
|
||||||
|
// reduce log spam for this common case
|
||||||
|
logger.error("Service not available: {}", e.getMessage());
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
logger.error("Uncaught exception", e);
|
||||||
|
}
|
||||||
|
request_counter_err.labels(serviceName, Integer.toString(node)).inc();
|
||||||
|
});
|
||||||
|
|
||||||
|
if (!initialization.isReady() && ! initialized ) {
|
||||||
|
initialized = true;
|
||||||
|
grpcServer = new GrpcServer(config, serviceRegistry, partition, grpcServices);
|
||||||
|
grpcServer.start();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public void startJooby(Jooby jooby) {
|
||||||
|
|
||||||
|
logger.info("{} Listening to {}:{} ({})", getClass().getSimpleName(),
|
||||||
|
restEndpoint.host(),
|
||||||
|
restEndpoint.port(),
|
||||||
|
config.externalAddress());
|
||||||
|
|
||||||
|
// FIXME: This won't work outside of docker, may need to submit a PR to jooby to allow classpaths here
|
||||||
|
jooby.install(new JteModule(Path.of("/app/resources/jte"), Path.of("/app/classes/jte-precompiled")));
|
||||||
|
jooby.assets("/*", Paths.get("/app/resources/static"));
|
||||||
|
|
||||||
|
var options = new ServerOptions();
|
||||||
|
options.setHost(config.bindAddress());
|
||||||
|
options.setPort(restEndpoint.port());
|
||||||
|
|
||||||
|
// Enable gzip compression of response data, but set compression to the lowest level
|
||||||
|
// since it doesn't really save much more space to dial it up. It's typically a
|
||||||
|
// single digit percentage difference since HTML already compresses very well with level = 1.
|
||||||
|
options.setCompressionLevel(1);
|
||||||
|
|
||||||
|
|
||||||
|
jooby.setServerOptions(options);
|
||||||
|
|
||||||
|
jooby.get("/internal/ping", ctx -> "pong");
|
||||||
|
jooby.get("/internal/started", this::isInitialized);
|
||||||
|
jooby.get("/internal/ready", this::isReady);
|
||||||
|
|
||||||
|
for (var service : joobyServices) {
|
||||||
|
jooby.mvc(service);
|
||||||
|
}
|
||||||
|
|
||||||
|
jooby.before(this::auditRequestIn);
|
||||||
|
jooby.after(this::auditRequestOut);
|
||||||
|
}
|
||||||
|
|
||||||
|
private Object isInitialized(Context ctx) {
|
||||||
|
if (initialization.isReady()) {
|
||||||
|
return "ok";
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
ctx.setResponseCode(StatusCode.FAILED_DEPENDENCY_CODE);
|
||||||
|
return "bad";
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public boolean isReady() {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
|
||||||
|
private String isReady(Context ctx) {
|
||||||
|
if (isReady()) {
|
||||||
|
return "ok";
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
ctx.setResponseCode(StatusCode.FAILED_DEPENDENCY_CODE);
|
||||||
|
return "bad";
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private void auditRequestIn(Context ctx) {
|
||||||
|
request_counter.labels(serviceName, Integer.toString(node)).inc();
|
||||||
|
}
|
||||||
|
|
||||||
|
private void auditRequestOut(Context ctx, Object result, Throwable failure) {
|
||||||
|
if (ctx.getResponseCode().value() < 400) {
|
||||||
|
request_counter_good.labels(serviceName, Integer.toString(node)).inc();
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
request_counter_bad.labels(serviceName, Integer.toString(node)).inc();
|
||||||
|
}
|
||||||
|
|
||||||
|
if (failure != null) {
|
||||||
|
logger.error("Request failed " + ctx.getMethod() + " " + ctx.getRequestURL(), failure);
|
||||||
|
request_counter_err.labels(serviceName, Integer.toString(node)).inc();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
@@ -16,7 +16,7 @@ import spark.Spark;
|
|||||||
|
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
|
|
||||||
public class Service {
|
public class SparkService {
|
||||||
private final Logger logger = LoggerFactory.getLogger(getClass());
|
private final Logger logger = LoggerFactory.getLogger(getClass());
|
||||||
|
|
||||||
// Marker for filtering out sensitive content from the persistent logs
|
// Marker for filtering out sensitive content from the persistent logs
|
||||||
@@ -43,7 +43,7 @@ public class Service {
|
|||||||
private final int node;
|
private final int node;
|
||||||
private GrpcServer grpcServer;
|
private GrpcServer grpcServer;
|
||||||
|
|
||||||
public Service(BaseServiceParams params,
|
public SparkService(BaseServiceParams params,
|
||||||
Runnable configureStaticFiles,
|
Runnable configureStaticFiles,
|
||||||
ServicePartition partition,
|
ServicePartition partition,
|
||||||
List<DiscoverableService> grpcServices) throws Exception {
|
List<DiscoverableService> grpcServices) throws Exception {
|
||||||
@@ -126,18 +126,18 @@ public class Service {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
public Service(BaseServiceParams params,
|
public SparkService(BaseServiceParams params,
|
||||||
ServicePartition partition,
|
ServicePartition partition,
|
||||||
List<DiscoverableService> grpcServices) throws Exception {
|
List<DiscoverableService> grpcServices) throws Exception {
|
||||||
this(params,
|
this(params,
|
||||||
Service::defaultSparkConfig,
|
SparkService::defaultSparkConfig,
|
||||||
partition,
|
partition,
|
||||||
grpcServices);
|
grpcServices);
|
||||||
}
|
}
|
||||||
|
|
||||||
public Service(BaseServiceParams params) throws Exception {
|
public SparkService(BaseServiceParams params) throws Exception {
|
||||||
this(params,
|
this(params,
|
||||||
Service::defaultSparkConfig,
|
SparkService::defaultSparkConfig,
|
||||||
ServicePartition.any(),
|
ServicePartition.any(),
|
||||||
List.of());
|
List.of());
|
||||||
}
|
}
|
@@ -0,0 +1,61 @@
|
|||||||
|
package nu.marginalia.service.server.jte;
|
||||||
|
|
||||||
|
import edu.umd.cs.findbugs.annotations.NonNull;
|
||||||
|
import edu.umd.cs.findbugs.annotations.Nullable;
|
||||||
|
import gg.jte.ContentType;
|
||||||
|
import gg.jte.TemplateEngine;
|
||||||
|
import gg.jte.resolve.DirectoryCodeResolver;
|
||||||
|
import io.jooby.*;
|
||||||
|
|
||||||
|
import java.io.File;
|
||||||
|
import java.nio.file.Path;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Objects;
|
||||||
|
import java.util.Optional;
|
||||||
|
import java.util.stream.Stream;
|
||||||
|
|
||||||
|
// Temporary workaround for a bug
|
||||||
|
// APL-2.0 https://github.com/jooby-project/jooby
|
||||||
|
public class JteModule implements Extension {
|
||||||
|
private Path sourceDirectory;
|
||||||
|
private Path classDirectory;
|
||||||
|
private TemplateEngine templateEngine;
|
||||||
|
|
||||||
|
public JteModule(@NonNull Path sourceDirectory, @NonNull Path classDirectory) {
|
||||||
|
this.sourceDirectory = (Path)Objects.requireNonNull(sourceDirectory, "Source directory is required.");
|
||||||
|
this.classDirectory = (Path)Objects.requireNonNull(classDirectory, "Class directory is required.");
|
||||||
|
}
|
||||||
|
|
||||||
|
public JteModule(@NonNull Path sourceDirectory) {
|
||||||
|
this.sourceDirectory = (Path)Objects.requireNonNull(sourceDirectory, "Source directory is required.");
|
||||||
|
}
|
||||||
|
|
||||||
|
public JteModule(@NonNull TemplateEngine templateEngine) {
|
||||||
|
this.templateEngine = (TemplateEngine)Objects.requireNonNull(templateEngine, "Template engine is required.");
|
||||||
|
}
|
||||||
|
|
||||||
|
public void install(@NonNull Jooby application) {
|
||||||
|
if (this.templateEngine == null) {
|
||||||
|
this.templateEngine = create(application.getEnvironment(), this.sourceDirectory, this.classDirectory);
|
||||||
|
}
|
||||||
|
|
||||||
|
ServiceRegistry services = application.getServices();
|
||||||
|
services.put(TemplateEngine.class, this.templateEngine);
|
||||||
|
application.encoder(MediaType.html, new JteTemplateEngine(this.templateEngine));
|
||||||
|
}
|
||||||
|
|
||||||
|
public static TemplateEngine create(@NonNull Environment environment, @NonNull Path sourceDirectory, @Nullable Path classDirectory) {
|
||||||
|
boolean dev = environment.isActive("dev", new String[]{"test"});
|
||||||
|
if (dev) {
|
||||||
|
Objects.requireNonNull(sourceDirectory, "Source directory is required.");
|
||||||
|
Path requiredClassDirectory = (Path)Optional.ofNullable(classDirectory).orElseGet(() -> sourceDirectory.resolve("jte-classes"));
|
||||||
|
TemplateEngine engine = TemplateEngine.create(new DirectoryCodeResolver(sourceDirectory), requiredClassDirectory, ContentType.Html, environment.getClassLoader());
|
||||||
|
Optional<List<String>> var10000 = Optional.ofNullable(System.getProperty("jooby.run.classpath")).map((it) -> it.split(File.pathSeparator)).map(Stream::of).map(Stream::toList);
|
||||||
|
Objects.requireNonNull(engine);
|
||||||
|
var10000.ifPresent(engine::setClassPath);
|
||||||
|
return engine;
|
||||||
|
} else {
|
||||||
|
return classDirectory == null ? TemplateEngine.createPrecompiled(ContentType.Html) : TemplateEngine.createPrecompiled(classDirectory, ContentType.Html);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
@@ -0,0 +1,48 @@
|
|||||||
|
package nu.marginalia.service.server.jte;
|
||||||
|
|
||||||
|
import edu.umd.cs.findbugs.annotations.NonNull;
|
||||||
|
import gg.jte.TemplateEngine;
|
||||||
|
import io.jooby.Context;
|
||||||
|
import io.jooby.MapModelAndView;
|
||||||
|
import io.jooby.ModelAndView;
|
||||||
|
import io.jooby.buffer.DataBuffer;
|
||||||
|
import io.jooby.internal.jte.DataBufferOutput;
|
||||||
|
|
||||||
|
import java.nio.charset.StandardCharsets;
|
||||||
|
import java.util.HashMap;
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
|
// Temporary workaround for a bug
|
||||||
|
// APL-2.0 https://github.com/jooby-project/jooby
|
||||||
|
class JteTemplateEngine implements io.jooby.TemplateEngine {
|
||||||
|
private final TemplateEngine jte;
|
||||||
|
private final List<String> extensions;
|
||||||
|
|
||||||
|
public JteTemplateEngine(TemplateEngine jte) {
|
||||||
|
this.jte = jte;
|
||||||
|
this.extensions = List.of(".jte", ".kte");
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@NonNull @Override
|
||||||
|
public List<String> extensions() {
|
||||||
|
return extensions;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public DataBuffer render(Context ctx, ModelAndView modelAndView) {
|
||||||
|
var buffer = ctx.getBufferFactory().allocateBuffer();
|
||||||
|
var output = new DataBufferOutput(buffer, StandardCharsets.UTF_8);
|
||||||
|
var attributes = ctx.getAttributes();
|
||||||
|
if (modelAndView instanceof MapModelAndView mapModelAndView) {
|
||||||
|
var mapModel = new HashMap<String, Object>();
|
||||||
|
mapModel.putAll(attributes);
|
||||||
|
mapModel.putAll(mapModelAndView.getModel());
|
||||||
|
jte.render(modelAndView.getView(), mapModel, output);
|
||||||
|
} else {
|
||||||
|
jte.render(modelAndView.getView(), modelAndView.getModel(), output);
|
||||||
|
}
|
||||||
|
|
||||||
|
return buffer;
|
||||||
|
}
|
||||||
|
}
|
@@ -3,7 +3,6 @@ package nu.marginalia.service.server.mq;
|
|||||||
import nu.marginalia.mq.MqMessage;
|
import nu.marginalia.mq.MqMessage;
|
||||||
import nu.marginalia.mq.inbox.MqInboxResponse;
|
import nu.marginalia.mq.inbox.MqInboxResponse;
|
||||||
import nu.marginalia.mq.inbox.MqSubscription;
|
import nu.marginalia.mq.inbox.MqSubscription;
|
||||||
import nu.marginalia.service.server.Service;
|
|
||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
import org.slf4j.LoggerFactory;
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
@@ -15,10 +14,10 @@ import java.util.Map;
|
|||||||
public class ServiceMqSubscription implements MqSubscription {
|
public class ServiceMqSubscription implements MqSubscription {
|
||||||
private static final Logger logger = LoggerFactory.getLogger(ServiceMqSubscription.class);
|
private static final Logger logger = LoggerFactory.getLogger(ServiceMqSubscription.class);
|
||||||
private final Map<String, Method> requests = new HashMap<>();
|
private final Map<String, Method> requests = new HashMap<>();
|
||||||
private final Service service;
|
private final Object service;
|
||||||
|
|
||||||
|
|
||||||
public ServiceMqSubscription(Service service) {
|
public ServiceMqSubscription(Object service) {
|
||||||
this.service = service;
|
this.service = service;
|
||||||
|
|
||||||
/* Wire up all methods annotated with @MqRequest and @MqNotification
|
/* Wire up all methods annotated with @MqRequest and @MqNotification
|
||||||
|
@@ -50,12 +50,18 @@ public class LiveCrawlActor extends RecordActorPrototype {
|
|||||||
yield new Monitor("-");
|
yield new Monitor("-");
|
||||||
}
|
}
|
||||||
case Monitor(String feedsHash) -> {
|
case Monitor(String feedsHash) -> {
|
||||||
|
// Sleep initially in case this is during start-up
|
||||||
for (;;) {
|
for (;;) {
|
||||||
|
try {
|
||||||
|
Thread.sleep(Duration.ofMinutes(15));
|
||||||
String currentHash = feedsClient.getFeedDataHash();
|
String currentHash = feedsClient.getFeedDataHash();
|
||||||
if (!Objects.equals(currentHash, feedsHash)) {
|
if (!Objects.equals(currentHash, feedsHash)) {
|
||||||
yield new LiveCrawl(currentHash);
|
yield new LiveCrawl(currentHash);
|
||||||
}
|
}
|
||||||
Thread.sleep(Duration.ofMinutes(15));
|
}
|
||||||
|
catch (RuntimeException ex) {
|
||||||
|
logger.error("Failed to fetch feed data hash");
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
case LiveCrawl(String feedsHash, long msgId) when msgId < 0 -> {
|
case LiveCrawl(String feedsHash, long msgId) when msgId < 0 -> {
|
||||||
|
@@ -6,4 +6,8 @@ public record BrowseResultSet(Collection<BrowseResult> results, String focusDoma
|
|||||||
public BrowseResultSet(Collection<BrowseResult> results) {
|
public BrowseResultSet(Collection<BrowseResult> results) {
|
||||||
this(results, "");
|
this(results, "");
|
||||||
}
|
}
|
||||||
|
|
||||||
|
public boolean hasFocusDomain() {
|
||||||
|
return focusDomain != null && !focusDomain.isBlank();
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
@@ -38,6 +38,7 @@ public class DomainsProtobufCodec {
|
|||||||
sd.getIndexed(),
|
sd.getIndexed(),
|
||||||
sd.getActive(),
|
sd.getActive(),
|
||||||
sd.getScreenshot(),
|
sd.getScreenshot(),
|
||||||
|
sd.getFeed(),
|
||||||
SimilarDomain.LinkType.valueOf(sd.getLinkType().name())
|
SimilarDomain.LinkType.valueOf(sd.getLinkType().name())
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
@@ -71,6 +71,23 @@ public class DomainInformation {
|
|||||||
return new String(Character.toChars(firstChar)) + new String(Character.toChars(secondChar));
|
return new String(Character.toChars(firstChar)) + new String(Character.toChars(secondChar));
|
||||||
}
|
}
|
||||||
|
|
||||||
|
public String getAsnFlag() {
|
||||||
|
if (asnCountry == null || asnCountry.codePointCount(0, asnCountry.length()) != 2) {
|
||||||
|
return "";
|
||||||
|
}
|
||||||
|
String country = asnCountry;
|
||||||
|
|
||||||
|
if ("UK".equals(country)) {
|
||||||
|
country = "GB";
|
||||||
|
}
|
||||||
|
|
||||||
|
int offset = 0x1F1E6;
|
||||||
|
int asciiOffset = 0x41;
|
||||||
|
int firstChar = Character.codePointAt(country, 0) - asciiOffset + offset;
|
||||||
|
int secondChar = Character.codePointAt(country, 1) - asciiOffset + offset;
|
||||||
|
return new String(Character.toChars(firstChar)) + new String(Character.toChars(secondChar));
|
||||||
|
}
|
||||||
|
|
||||||
public EdgeDomain getDomain() {
|
public EdgeDomain getDomain() {
|
||||||
return this.domain;
|
return this.domain;
|
||||||
}
|
}
|
||||||
|
@@ -9,6 +9,7 @@ public record SimilarDomain(EdgeUrl url,
|
|||||||
boolean indexed,
|
boolean indexed,
|
||||||
boolean active,
|
boolean active,
|
||||||
boolean screenshot,
|
boolean screenshot,
|
||||||
|
boolean feed,
|
||||||
LinkType linkType) {
|
LinkType linkType) {
|
||||||
|
|
||||||
public String getRankSymbols() {
|
public String getRankSymbols() {
|
||||||
@@ -52,12 +53,12 @@ public record SimilarDomain(EdgeUrl url,
|
|||||||
return NONE;
|
return NONE;
|
||||||
}
|
}
|
||||||
|
|
||||||
public String toString() {
|
public String faIcon() {
|
||||||
return switch (this) {
|
return switch (this) {
|
||||||
case FOWARD -> "→";
|
case FOWARD -> "fa-solid fa-arrow-right";
|
||||||
case BACKWARD -> "←";
|
case BACKWARD -> "fa-solid fa-arrow-left";
|
||||||
case BIDIRECTIONAL -> "⇆";
|
case BIDIRECTIONAL -> "fa-solid fa-arrow-right-arrow-left";
|
||||||
case NONE -> "-";
|
case NONE -> "";
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@@ -101,6 +101,7 @@ message RpcSimilarDomain {
|
|||||||
bool active = 6;
|
bool active = 6;
|
||||||
bool screenshot = 7;
|
bool screenshot = 7;
|
||||||
LINK_TYPE linkType = 8;
|
LINK_TYPE linkType = 8;
|
||||||
|
bool feed = 9;
|
||||||
|
|
||||||
enum LINK_TYPE {
|
enum LINK_TYPE {
|
||||||
BACKWARD = 0;
|
BACKWARD = 0;
|
||||||
|
@@ -9,6 +9,7 @@ import gnu.trove.map.hash.TIntIntHashMap;
|
|||||||
import gnu.trove.set.TIntSet;
|
import gnu.trove.set.TIntSet;
|
||||||
import gnu.trove.set.hash.TIntHashSet;
|
import gnu.trove.set.hash.TIntHashSet;
|
||||||
import it.unimi.dsi.fastutil.ints.Int2DoubleArrayMap;
|
import it.unimi.dsi.fastutil.ints.Int2DoubleArrayMap;
|
||||||
|
import nu.marginalia.WmsaHome;
|
||||||
import nu.marginalia.api.domains.RpcSimilarDomain;
|
import nu.marginalia.api.domains.RpcSimilarDomain;
|
||||||
import nu.marginalia.api.domains.model.SimilarDomain;
|
import nu.marginalia.api.domains.model.SimilarDomain;
|
||||||
import nu.marginalia.api.linkgraph.AggregateLinkGraphClient;
|
import nu.marginalia.api.linkgraph.AggregateLinkGraphClient;
|
||||||
@@ -17,10 +18,14 @@ import org.roaringbitmap.RoaringBitmap;
|
|||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
import org.slf4j.LoggerFactory;
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
|
import java.nio.file.Path;
|
||||||
|
import java.sql.DriverManager;
|
||||||
import java.sql.ResultSet;
|
import java.sql.ResultSet;
|
||||||
import java.sql.SQLException;
|
import java.sql.SQLException;
|
||||||
import java.util.ArrayList;
|
import java.util.ArrayList;
|
||||||
|
import java.util.HashSet;
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
|
import java.util.Set;
|
||||||
import java.util.concurrent.Executors;
|
import java.util.concurrent.Executors;
|
||||||
import java.util.concurrent.ScheduledExecutorService;
|
import java.util.concurrent.ScheduledExecutorService;
|
||||||
import java.util.concurrent.TimeUnit;
|
import java.util.concurrent.TimeUnit;
|
||||||
@@ -32,12 +37,13 @@ public class SimilarDomainsService {
|
|||||||
private final HikariDataSource dataSource;
|
private final HikariDataSource dataSource;
|
||||||
private final AggregateLinkGraphClient linkGraphClient;
|
private final AggregateLinkGraphClient linkGraphClient;
|
||||||
|
|
||||||
private volatile TIntIntHashMap domainIdToIdx = new TIntIntHashMap(100_000);
|
private final TIntIntHashMap domainIdToIdx = new TIntIntHashMap(100_000);
|
||||||
private volatile int[] domainIdxToId;
|
private volatile int[] domainIdxToId;
|
||||||
|
|
||||||
public volatile Int2DoubleArrayMap[] relatedDomains;
|
public volatile Int2DoubleArrayMap[] relatedDomains;
|
||||||
public volatile TIntList[] domainNeighbors = null;
|
public volatile TIntList[] domainNeighbors = null;
|
||||||
public volatile RoaringBitmap screenshotDomains = null;
|
public volatile RoaringBitmap screenshotDomains = null;
|
||||||
|
public volatile RoaringBitmap feedDomains = null;
|
||||||
public volatile RoaringBitmap activeDomains = null;
|
public volatile RoaringBitmap activeDomains = null;
|
||||||
public volatile RoaringBitmap indexedDomains = null;
|
public volatile RoaringBitmap indexedDomains = null;
|
||||||
public volatile TIntDoubleHashMap domainRanks = null;
|
public volatile TIntDoubleHashMap domainRanks = null;
|
||||||
@@ -82,6 +88,7 @@ public class SimilarDomainsService {
|
|||||||
domainNames = new String[domainIdToIdx.size()];
|
domainNames = new String[domainIdToIdx.size()];
|
||||||
domainNeighbors = new TIntList[domainIdToIdx.size()];
|
domainNeighbors = new TIntList[domainIdToIdx.size()];
|
||||||
screenshotDomains = new RoaringBitmap();
|
screenshotDomains = new RoaringBitmap();
|
||||||
|
feedDomains = new RoaringBitmap();
|
||||||
activeDomains = new RoaringBitmap();
|
activeDomains = new RoaringBitmap();
|
||||||
indexedDomains = new RoaringBitmap();
|
indexedDomains = new RoaringBitmap();
|
||||||
relatedDomains = new Int2DoubleArrayMap[domainIdToIdx.size()];
|
relatedDomains = new Int2DoubleArrayMap[domainIdToIdx.size()];
|
||||||
@@ -145,10 +152,12 @@ public class SimilarDomainsService {
|
|||||||
activeDomains.add(idx);
|
activeDomains.add(idx);
|
||||||
}
|
}
|
||||||
|
|
||||||
updateScreenshotInfo();
|
|
||||||
|
|
||||||
logger.info("Loaded {} domains", domainRanks.size());
|
logger.info("Loaded {} domains", domainRanks.size());
|
||||||
isReady = true;
|
isReady = true;
|
||||||
|
|
||||||
|
// We can defer these as they only populate a roaringbitmap, and will degrade gracefully when not complete
|
||||||
|
updateScreenshotInfo();
|
||||||
|
updateFeedInfo();
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
catch (SQLException throwables) {
|
catch (SQLException throwables) {
|
||||||
@@ -156,6 +165,42 @@ public class SimilarDomainsService {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
private void updateFeedInfo() {
|
||||||
|
Set<String> feedsDomainNames = new HashSet<>(500_000);
|
||||||
|
Path readerDbPath = WmsaHome.getDataPath().resolve("rss-feeds.db").toAbsolutePath();
|
||||||
|
String dbUrl = "jdbc:sqlite:" + readerDbPath;
|
||||||
|
|
||||||
|
logger.info("Opening feed db at " + dbUrl);
|
||||||
|
|
||||||
|
try (var conn = DriverManager.getConnection(dbUrl);
|
||||||
|
var stmt = conn.createStatement()) {
|
||||||
|
var rs = stmt.executeQuery("""
|
||||||
|
select
|
||||||
|
json_extract(feed, '$.domain') as domain
|
||||||
|
from feed
|
||||||
|
where json_array_length(feed, '$.items') > 0
|
||||||
|
""");
|
||||||
|
while (rs.next()) {
|
||||||
|
feedsDomainNames.add(rs.getString(1));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
catch (SQLException ex) {
|
||||||
|
logger.error("Failed to read RSS feed items", ex);
|
||||||
|
}
|
||||||
|
|
||||||
|
for (int idx = 0; idx < domainNames.length; idx++) {
|
||||||
|
String name = domainNames[idx];
|
||||||
|
if (name == null) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (feedsDomainNames.contains(name)) {
|
||||||
|
feedDomains.add(idx);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
private void updateScreenshotInfo() {
|
private void updateScreenshotInfo() {
|
||||||
try (var connection = dataSource.getConnection()) {
|
try (var connection = dataSource.getConnection()) {
|
||||||
try (var stmt = connection.createStatement()) {
|
try (var stmt = connection.createStatement()) {
|
||||||
@@ -254,6 +299,7 @@ public class SimilarDomainsService {
|
|||||||
.setIndexed(indexedDomains.contains(idx))
|
.setIndexed(indexedDomains.contains(idx))
|
||||||
.setActive(activeDomains.contains(idx))
|
.setActive(activeDomains.contains(idx))
|
||||||
.setScreenshot(screenshotDomains.contains(idx))
|
.setScreenshot(screenshotDomains.contains(idx))
|
||||||
|
.setFeed(feedDomains.contains(idx))
|
||||||
.setLinkType(RpcSimilarDomain.LINK_TYPE.valueOf(linkType.name()))
|
.setLinkType(RpcSimilarDomain.LINK_TYPE.valueOf(linkType.name()))
|
||||||
.build());
|
.build());
|
||||||
|
|
||||||
@@ -369,6 +415,7 @@ public class SimilarDomainsService {
|
|||||||
.setIndexed(indexedDomains.contains(idx))
|
.setIndexed(indexedDomains.contains(idx))
|
||||||
.setActive(activeDomains.contains(idx))
|
.setActive(activeDomains.contains(idx))
|
||||||
.setScreenshot(screenshotDomains.contains(idx))
|
.setScreenshot(screenshotDomains.contains(idx))
|
||||||
|
.setFeed(feedDomains.contains(idx))
|
||||||
.setLinkType(RpcSimilarDomain.LINK_TYPE.valueOf(linkType.name()))
|
.setLinkType(RpcSimilarDomain.LINK_TYPE.valueOf(linkType.name()))
|
||||||
.build());
|
.build());
|
||||||
|
|
||||||
|
@@ -59,12 +59,6 @@ public class FeedsClient {
|
|||||||
.forEachRemaining(rsp -> consumer.accept(rsp.getDomain(), new ArrayList<>(rsp.getUrlList())));
|
.forEachRemaining(rsp -> consumer.accept(rsp.getDomain(), new ArrayList<>(rsp.getUrlList())));
|
||||||
}
|
}
|
||||||
|
|
||||||
public record UpdatedDomain(String domain, List<String> urls) {
|
|
||||||
public UpdatedDomain(RpcUpdatedLinksResponse rsp) {
|
|
||||||
this(rsp.getDomain(), new ArrayList<>(rsp.getUrlList()));
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
/** Get the hash of the feed data, for identifying when the data has been updated */
|
/** Get the hash of the feed data, for identifying when the data has been updated */
|
||||||
public String getFeedDataHash() {
|
public String getFeedDataHash() {
|
||||||
return channelPool.call(FeedApiGrpc.FeedApiBlockingStub::getFeedDataHash)
|
return channelPool.call(FeedApiGrpc.FeedApiBlockingStub::getFeedDataHash)
|
||||||
|
@@ -5,6 +5,7 @@ import com.google.inject.Singleton;
|
|||||||
import nu.marginalia.api.livecapture.LiveCaptureApiGrpc.LiveCaptureApiBlockingStub;
|
import nu.marginalia.api.livecapture.LiveCaptureApiGrpc.LiveCaptureApiBlockingStub;
|
||||||
import nu.marginalia.service.client.GrpcChannelPoolFactory;
|
import nu.marginalia.service.client.GrpcChannelPoolFactory;
|
||||||
import nu.marginalia.service.client.GrpcSingleNodeChannelPool;
|
import nu.marginalia.service.client.GrpcSingleNodeChannelPool;
|
||||||
|
import nu.marginalia.service.client.ServiceNotAvailableException;
|
||||||
import nu.marginalia.service.discovery.property.ServiceKey;
|
import nu.marginalia.service.discovery.property.ServiceKey;
|
||||||
import nu.marginalia.service.discovery.property.ServicePartition;
|
import nu.marginalia.service.discovery.property.ServicePartition;
|
||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
@@ -29,6 +30,9 @@ public class LiveCaptureClient {
|
|||||||
channelPool.call(LiveCaptureApiBlockingStub::requestScreengrab)
|
channelPool.call(LiveCaptureApiBlockingStub::requestScreengrab)
|
||||||
.run(RpcDomainId.newBuilder().setDomainId(domainId).build());
|
.run(RpcDomainId.newBuilder().setDomainId(domainId).build());
|
||||||
}
|
}
|
||||||
|
catch (ServiceNotAvailableException e) {
|
||||||
|
logger.info("requestScreengrab() failed since the service is not available");
|
||||||
|
}
|
||||||
catch (Exception e) {
|
catch (Exception e) {
|
||||||
logger.error("API Exception", e);
|
logger.error("API Exception", e);
|
||||||
}
|
}
|
||||||
|
@@ -46,6 +46,7 @@ message RpcFeed {
|
|||||||
string feedUrl = 3;
|
string feedUrl = 3;
|
||||||
string updated = 4;
|
string updated = 4;
|
||||||
repeated RpcFeedItem items = 5;
|
repeated RpcFeedItem items = 5;
|
||||||
|
int64 fetchTimestamp = 6;
|
||||||
}
|
}
|
||||||
|
|
||||||
message RpcFeedItem {
|
message RpcFeedItem {
|
||||||
|
@@ -24,6 +24,7 @@ dependencies {
|
|||||||
implementation project(':code:libraries:message-queue')
|
implementation project(':code:libraries:message-queue')
|
||||||
|
|
||||||
implementation project(':code:execution:api')
|
implementation project(':code:execution:api')
|
||||||
|
implementation project(':code:processes:crawling-process:ft-content-type')
|
||||||
|
|
||||||
implementation libs.jsoup
|
implementation libs.jsoup
|
||||||
implementation libs.rssreader
|
implementation libs.rssreader
|
||||||
|
@@ -8,13 +8,16 @@ import nu.marginalia.rss.model.FeedDefinition;
|
|||||||
import nu.marginalia.rss.model.FeedItems;
|
import nu.marginalia.rss.model.FeedItems;
|
||||||
import nu.marginalia.service.module.ServiceConfiguration;
|
import nu.marginalia.service.module.ServiceConfiguration;
|
||||||
import org.jetbrains.annotations.NotNull;
|
import org.jetbrains.annotations.NotNull;
|
||||||
|
import org.jetbrains.annotations.Nullable;
|
||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
import org.slf4j.LoggerFactory;
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
import java.io.BufferedInputStream;
|
import java.io.BufferedInputStream;
|
||||||
|
import java.io.IOException;
|
||||||
import java.nio.file.Files;
|
import java.nio.file.Files;
|
||||||
import java.nio.file.Path;
|
import java.nio.file.Path;
|
||||||
import java.nio.file.StandardCopyOption;
|
import java.nio.file.StandardCopyOption;
|
||||||
|
import java.nio.file.attribute.PosixFileAttributes;
|
||||||
import java.security.MessageDigest;
|
import java.security.MessageDigest;
|
||||||
import java.time.Instant;
|
import java.time.Instant;
|
||||||
import java.util.Base64;
|
import java.util.Base64;
|
||||||
@@ -125,6 +128,26 @@ public class FeedDb {
|
|||||||
return FeedItems.none();
|
return FeedItems.none();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@Nullable
|
||||||
|
public String getEtag(EdgeDomain domain) {
|
||||||
|
if (!feedDbEnabled) {
|
||||||
|
throw new IllegalStateException("Feed database is disabled on this node");
|
||||||
|
}
|
||||||
|
|
||||||
|
// Capture the current reader to avoid concurrency issues
|
||||||
|
FeedDbReader reader = this.reader;
|
||||||
|
try {
|
||||||
|
if (reader != null) {
|
||||||
|
return reader.getEtag(domain);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
catch (Exception e) {
|
||||||
|
logger.error("Error getting etag for " + domain, e);
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
public Optional<String> getFeedAsJson(String domain) {
|
public Optional<String> getFeedAsJson(String domain) {
|
||||||
if (!feedDbEnabled) {
|
if (!feedDbEnabled) {
|
||||||
throw new IllegalStateException("Feed database is disabled on this node");
|
throw new IllegalStateException("Feed database is disabled on this node");
|
||||||
@@ -209,4 +232,36 @@ public class FeedDb {
|
|||||||
|
|
||||||
reader.getLinksUpdatedSince(since, consumer);
|
reader.getLinksUpdatedSince(since, consumer);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
public Instant getFetchTime() {
|
||||||
|
if (!Files.exists(readerDbPath)) {
|
||||||
|
return Instant.EPOCH;
|
||||||
|
}
|
||||||
|
|
||||||
|
try {
|
||||||
|
return Files.readAttributes(readerDbPath, PosixFileAttributes.class)
|
||||||
|
.creationTime()
|
||||||
|
.toInstant();
|
||||||
|
}
|
||||||
|
catch (IOException ex) {
|
||||||
|
logger.error("Failed to read the creatiom time of {}", readerDbPath);
|
||||||
|
return Instant.EPOCH;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public boolean hasData() {
|
||||||
|
if (!feedDbEnabled) {
|
||||||
|
throw new IllegalStateException("Feed database is disabled on this node");
|
||||||
|
}
|
||||||
|
|
||||||
|
// Capture the current reader to avoid concurrency issues
|
||||||
|
FeedDbReader reader = this.reader;
|
||||||
|
|
||||||
|
if (reader != null) {
|
||||||
|
return reader.hasData();
|
||||||
|
}
|
||||||
|
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
|
@@ -8,6 +8,7 @@ import nu.marginalia.rss.model.FeedItems;
|
|||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
import org.slf4j.LoggerFactory;
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
|
import javax.annotation.Nullable;
|
||||||
import java.nio.file.Path;
|
import java.nio.file.Path;
|
||||||
import java.sql.Connection;
|
import java.sql.Connection;
|
||||||
import java.sql.DriverManager;
|
import java.sql.DriverManager;
|
||||||
@@ -32,6 +33,7 @@ public class FeedDbReader implements AutoCloseable {
|
|||||||
try (var stmt = connection.createStatement()) {
|
try (var stmt = connection.createStatement()) {
|
||||||
stmt.executeUpdate("CREATE TABLE IF NOT EXISTS feed (domain TEXT PRIMARY KEY, feed JSON)");
|
stmt.executeUpdate("CREATE TABLE IF NOT EXISTS feed (domain TEXT PRIMARY KEY, feed JSON)");
|
||||||
stmt.executeUpdate("CREATE TABLE IF NOT EXISTS errors (domain TEXT PRIMARY KEY, cnt INT DEFAULT 0)");
|
stmt.executeUpdate("CREATE TABLE IF NOT EXISTS errors (domain TEXT PRIMARY KEY, cnt INT DEFAULT 0)");
|
||||||
|
stmt.executeUpdate("CREATE TABLE IF NOT EXISTS etags (domain TEXT PRIMARY KEY, etag TEXT)");
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -106,6 +108,22 @@ public class FeedDbReader implements AutoCloseable {
|
|||||||
return FeedItems.none();
|
return FeedItems.none();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Nullable
|
||||||
|
public String getEtag(EdgeDomain domain) {
|
||||||
|
try (var stmt = connection.prepareStatement("SELECT etag FROM etags WHERE DOMAIN = ?")) {
|
||||||
|
stmt.setString(1, domain.toString());
|
||||||
|
var rs = stmt.executeQuery();
|
||||||
|
|
||||||
|
if (rs.next()) {
|
||||||
|
return rs.getString(1);
|
||||||
|
}
|
||||||
|
} catch (SQLException e) {
|
||||||
|
logger.error("Error getting etag for " + domain, e);
|
||||||
|
}
|
||||||
|
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
private FeedItems deserialize(String string) {
|
private FeedItems deserialize(String string) {
|
||||||
return gson.fromJson(string, FeedItems.class);
|
return gson.fromJson(string, FeedItems.class);
|
||||||
}
|
}
|
||||||
@@ -141,4 +159,18 @@ public class FeedDbReader implements AutoCloseable {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
public boolean hasData() {
|
||||||
|
try (var stmt = connection.prepareStatement("SELECT 1 FROM feed LIMIT 1")) {
|
||||||
|
var rs = stmt.executeQuery();
|
||||||
|
if (rs.next()) {
|
||||||
|
return rs.getBoolean(1);
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
catch (SQLException ex) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
@@ -20,6 +20,7 @@ public class FeedDbWriter implements AutoCloseable {
|
|||||||
private final Connection connection;
|
private final Connection connection;
|
||||||
private final PreparedStatement insertFeedStmt;
|
private final PreparedStatement insertFeedStmt;
|
||||||
private final PreparedStatement insertErrorStmt;
|
private final PreparedStatement insertErrorStmt;
|
||||||
|
private final PreparedStatement insertEtagStmt;
|
||||||
private final Path dbPath;
|
private final Path dbPath;
|
||||||
|
|
||||||
private volatile boolean closed = false;
|
private volatile boolean closed = false;
|
||||||
@@ -34,10 +35,12 @@ public class FeedDbWriter implements AutoCloseable {
|
|||||||
try (var stmt = connection.createStatement()) {
|
try (var stmt = connection.createStatement()) {
|
||||||
stmt.executeUpdate("CREATE TABLE IF NOT EXISTS feed (domain TEXT PRIMARY KEY, feed JSON)");
|
stmt.executeUpdate("CREATE TABLE IF NOT EXISTS feed (domain TEXT PRIMARY KEY, feed JSON)");
|
||||||
stmt.executeUpdate("CREATE TABLE IF NOT EXISTS errors (domain TEXT PRIMARY KEY, cnt INT DEFAULT 0)");
|
stmt.executeUpdate("CREATE TABLE IF NOT EXISTS errors (domain TEXT PRIMARY KEY, cnt INT DEFAULT 0)");
|
||||||
|
stmt.executeUpdate("CREATE TABLE IF NOT EXISTS etags (domain TEXT PRIMARY KEY, etag TEXT)");
|
||||||
}
|
}
|
||||||
|
|
||||||
insertFeedStmt = connection.prepareStatement("INSERT INTO feed (domain, feed) VALUES (?, ?)");
|
insertFeedStmt = connection.prepareStatement("INSERT INTO feed (domain, feed) VALUES (?, ?)");
|
||||||
insertErrorStmt = connection.prepareStatement("INSERT INTO errors (domain, cnt) VALUES (?, ?)");
|
insertErrorStmt = connection.prepareStatement("INSERT INTO errors (domain, cnt) VALUES (?, ?)");
|
||||||
|
insertEtagStmt = connection.prepareStatement("INSERT INTO etags (domain, etag) VALUES (?, ?)");
|
||||||
}
|
}
|
||||||
|
|
||||||
public Path getDbPath() {
|
public Path getDbPath() {
|
||||||
@@ -56,6 +59,20 @@ public class FeedDbWriter implements AutoCloseable {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
public synchronized void saveEtag(String domain, String etag) {
|
||||||
|
if (etag == null || etag.isBlank())
|
||||||
|
return;
|
||||||
|
|
||||||
|
try {
|
||||||
|
insertEtagStmt.setString(1, domain.toLowerCase());
|
||||||
|
insertEtagStmt.setString(2, etag);
|
||||||
|
insertEtagStmt.executeUpdate();
|
||||||
|
}
|
||||||
|
catch (SQLException e) {
|
||||||
|
logger.error("Error saving etag for " + domain, e);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
public synchronized void setErrorCount(String domain, int count) {
|
public synchronized void setErrorCount(String domain, int count) {
|
||||||
try {
|
try {
|
||||||
insertErrorStmt.setString(1, domain);
|
insertErrorStmt.setString(1, domain);
|
||||||
|
@@ -5,6 +5,8 @@ import com.apptasticsoftware.rssreader.RssReader;
|
|||||||
import com.google.inject.Inject;
|
import com.google.inject.Inject;
|
||||||
import com.opencsv.CSVReader;
|
import com.opencsv.CSVReader;
|
||||||
import nu.marginalia.WmsaHome;
|
import nu.marginalia.WmsaHome;
|
||||||
|
import nu.marginalia.contenttype.ContentType;
|
||||||
|
import nu.marginalia.contenttype.DocumentBodyToString;
|
||||||
import nu.marginalia.executor.client.ExecutorClient;
|
import nu.marginalia.executor.client.ExecutorClient;
|
||||||
import nu.marginalia.model.EdgeDomain;
|
import nu.marginalia.model.EdgeDomain;
|
||||||
import nu.marginalia.nodecfg.NodeConfigurationService;
|
import nu.marginalia.nodecfg.NodeConfigurationService;
|
||||||
@@ -32,9 +34,7 @@ import java.net.http.HttpRequest;
|
|||||||
import java.net.http.HttpResponse;
|
import java.net.http.HttpResponse;
|
||||||
import java.nio.charset.StandardCharsets;
|
import java.nio.charset.StandardCharsets;
|
||||||
import java.sql.SQLException;
|
import java.sql.SQLException;
|
||||||
import java.time.Duration;
|
import java.time.*;
|
||||||
import java.time.LocalDateTime;
|
|
||||||
import java.time.ZonedDateTime;
|
|
||||||
import java.time.format.DateTimeFormatter;
|
import java.time.format.DateTimeFormatter;
|
||||||
import java.util.*;
|
import java.util.*;
|
||||||
import java.util.concurrent.Executors;
|
import java.util.concurrent.Executors;
|
||||||
@@ -59,7 +59,6 @@ public class FeedFetcherService {
|
|||||||
private final DomainLocks domainLocks = new DomainLocks();
|
private final DomainLocks domainLocks = new DomainLocks();
|
||||||
|
|
||||||
private volatile boolean updating;
|
private volatile boolean updating;
|
||||||
private boolean deterministic = false;
|
|
||||||
|
|
||||||
@Inject
|
@Inject
|
||||||
public FeedFetcherService(FeedDb feedDb,
|
public FeedFetcherService(FeedDb feedDb,
|
||||||
@@ -91,11 +90,6 @@ public class FeedFetcherService {
|
|||||||
REFRESH
|
REFRESH
|
||||||
};
|
};
|
||||||
|
|
||||||
/** Disable random-based heuristics. This is meant for testing */
|
|
||||||
public void setDeterministic() {
|
|
||||||
this.deterministic = true;
|
|
||||||
}
|
|
||||||
|
|
||||||
public void updateFeeds(UpdateMode updateMode) throws IOException {
|
public void updateFeeds(UpdateMode updateMode) throws IOException {
|
||||||
if (updating) // Prevent concurrent updates
|
if (updating) // Prevent concurrent updates
|
||||||
{
|
{
|
||||||
@@ -135,37 +129,37 @@ public class FeedFetcherService {
|
|||||||
for (var feed : definitions) {
|
for (var feed : definitions) {
|
||||||
executor.submitQuietly(() -> {
|
executor.submitQuietly(() -> {
|
||||||
try {
|
try {
|
||||||
var oldData = feedDb.getFeed(new EdgeDomain(feed.domain()));
|
EdgeDomain domain = new EdgeDomain(feed.domain());
|
||||||
|
var oldData = feedDb.getFeed(domain);
|
||||||
|
|
||||||
// If we have existing data, we might skip updating it with a probability that increases with time,
|
@Nullable
|
||||||
// this is to avoid hammering the feeds that are updated very rarely and save some time and resources
|
String ifModifiedSinceDate = switch(updateMode) {
|
||||||
// on our end
|
case REFRESH -> getIfModifiedSinceDate(feedDb);
|
||||||
|
case CLEAN -> null;
|
||||||
|
};
|
||||||
|
|
||||||
/* Disable for now:
|
@Nullable
|
||||||
|
String ifNoneMatchTag = switch (updateMode) {
|
||||||
if (!oldData.isEmpty()) {
|
case REFRESH -> feedDb.getEtag(domain);
|
||||||
Duration duration = feed.durationSinceUpdated();
|
case CLEAN -> null;
|
||||||
long daysSinceUpdate = duration.toDays();
|
};
|
||||||
|
|
||||||
|
|
||||||
if (deterministic || (daysSinceUpdate > 2 && ThreadLocalRandom.current()
|
|
||||||
.nextInt(1, 1 + (int) Math.min(10, daysSinceUpdate) / 2) > 1)) {
|
|
||||||
// Skip updating this feed, just write the old data back instead
|
|
||||||
writer.saveFeed(oldData);
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
*/
|
|
||||||
|
|
||||||
FetchResult feedData;
|
FetchResult feedData;
|
||||||
try (DomainLocks.DomainLock domainLock = domainLocks.lockDomain(new EdgeDomain(feed.domain()))) {
|
try (DomainLocks.DomainLock domainLock = domainLocks.lockDomain(new EdgeDomain(feed.domain()))) {
|
||||||
feedData = fetchFeedData(feed, client);
|
feedData = fetchFeedData(feed, client, ifModifiedSinceDate, ifNoneMatchTag);
|
||||||
} catch (Exception ex) {
|
} catch (Exception ex) {
|
||||||
feedData = new FetchResult.TransientError();
|
feedData = new FetchResult.TransientError();
|
||||||
}
|
}
|
||||||
|
|
||||||
switch (feedData) {
|
switch (feedData) {
|
||||||
case FetchResult.Success(String value) -> writer.saveFeed(parseFeed(value, feed));
|
case FetchResult.Success(String value, String etag) -> {
|
||||||
|
writer.saveEtag(feed.domain(), etag);
|
||||||
|
writer.saveFeed(parseFeed(value, feed));
|
||||||
|
}
|
||||||
|
case FetchResult.NotModified() -> {
|
||||||
|
writer.saveEtag(feed.domain(), ifNoneMatchTag);
|
||||||
|
writer.saveFeed(oldData);
|
||||||
|
}
|
||||||
case FetchResult.TransientError() -> {
|
case FetchResult.TransientError() -> {
|
||||||
int errorCount = errorCounts.getOrDefault(feed.domain().toLowerCase(), 0);
|
int errorCount = errorCounts.getOrDefault(feed.domain().toLowerCase(), 0);
|
||||||
writer.setErrorCount(feed.domain().toLowerCase(), ++errorCount);
|
writer.setErrorCount(feed.domain().toLowerCase(), ++errorCount);
|
||||||
@@ -212,30 +206,73 @@ public class FeedFetcherService {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
private FetchResult fetchFeedData(FeedDefinition feed, HttpClient client) {
|
@Nullable
|
||||||
|
static String getIfModifiedSinceDate(FeedDb feedDb) {
|
||||||
|
|
||||||
|
// If the db is fresh, we don't send If-Modified-Since
|
||||||
|
if (!feedDb.hasData())
|
||||||
|
return null;
|
||||||
|
|
||||||
|
Instant cutoffInstant = feedDb.getFetchTime();
|
||||||
|
|
||||||
|
// If we're unable to establish fetch time, we don't send If-Modified-Since
|
||||||
|
if (cutoffInstant == Instant.EPOCH)
|
||||||
|
return null;
|
||||||
|
|
||||||
|
return cutoffInstant.atZone(ZoneId.of("GMT")).format(DateTimeFormatter.RFC_1123_DATE_TIME);
|
||||||
|
}
|
||||||
|
|
||||||
|
private FetchResult fetchFeedData(FeedDefinition feed,
|
||||||
|
HttpClient client,
|
||||||
|
@Nullable String ifModifiedSinceDate,
|
||||||
|
@Nullable String ifNoneMatchTag)
|
||||||
|
{
|
||||||
try {
|
try {
|
||||||
URI uri = new URI(feed.feedUrl());
|
URI uri = new URI(feed.feedUrl());
|
||||||
|
|
||||||
HttpRequest getRequest = HttpRequest.newBuilder()
|
HttpRequest.Builder requestBuilder = HttpRequest.newBuilder()
|
||||||
.GET()
|
.GET()
|
||||||
.uri(uri)
|
.uri(uri)
|
||||||
.header("User-Agent", WmsaHome.getUserAgent().uaIdentifier())
|
.header("User-Agent", WmsaHome.getUserAgent().uaIdentifier())
|
||||||
|
.header("Accept-Encoding", "gzip")
|
||||||
.header("Accept", "text/*, */*;q=0.9")
|
.header("Accept", "text/*, */*;q=0.9")
|
||||||
.timeout(Duration.ofSeconds(15))
|
.timeout(Duration.ofSeconds(15))
|
||||||
.build();
|
;
|
||||||
|
|
||||||
|
if (ifModifiedSinceDate != null) {
|
||||||
|
requestBuilder.header("If-Modified-Since", ifModifiedSinceDate);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (ifNoneMatchTag != null) {
|
||||||
|
requestBuilder.header("If-None-Match", ifNoneMatchTag);
|
||||||
|
}
|
||||||
|
|
||||||
|
HttpRequest getRequest = requestBuilder.build();
|
||||||
|
|
||||||
for (int i = 0; i < 3; i++) {
|
for (int i = 0; i < 3; i++) {
|
||||||
var rs = client.send(getRequest, HttpResponse.BodyHandlers.ofString());
|
HttpResponse<byte[]> rs = client.send(getRequest, HttpResponse.BodyHandlers.ofByteArray());
|
||||||
if (429 == rs.statusCode()) {
|
|
||||||
|
if (rs.statusCode() == 429) { // Too Many Requests
|
||||||
int retryAfter = Integer.parseInt(rs.headers().firstValue("Retry-After").orElse("2"));
|
int retryAfter = Integer.parseInt(rs.headers().firstValue("Retry-After").orElse("2"));
|
||||||
Thread.sleep(Duration.ofSeconds(Math.clamp(retryAfter, 1, 5)));
|
Thread.sleep(Duration.ofSeconds(Math.clamp(retryAfter, 1, 5)));
|
||||||
} else if (200 == rs.statusCode()) {
|
continue;
|
||||||
return new FetchResult.Success(rs.body());
|
|
||||||
} else if (404 == rs.statusCode()) {
|
|
||||||
return new FetchResult.PermanentError(); // never try again
|
|
||||||
} else {
|
|
||||||
return new FetchResult.TransientError(); // we try again in a few days
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
String newEtagValue = rs.headers().firstValue("ETag").orElse("");
|
||||||
|
|
||||||
|
return switch (rs.statusCode()) {
|
||||||
|
case 200 -> {
|
||||||
|
byte[] responseData = getResponseData(rs);
|
||||||
|
|
||||||
|
String contentType = rs.headers().firstValue("Content-Type").orElse("");
|
||||||
|
String bodyText = DocumentBodyToString.getStringData(ContentType.parse(contentType), responseData);
|
||||||
|
|
||||||
|
yield new FetchResult.Success(bodyText, newEtagValue);
|
||||||
|
}
|
||||||
|
case 304 -> new FetchResult.NotModified(); // via If-Modified-Since semantics
|
||||||
|
case 404 -> new FetchResult.PermanentError(); // never try again
|
||||||
|
default -> new FetchResult.TransientError(); // we try again later
|
||||||
|
};
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
catch (Exception ex) {
|
catch (Exception ex) {
|
||||||
@@ -245,8 +282,22 @@ public class FeedFetcherService {
|
|||||||
return new FetchResult.TransientError();
|
return new FetchResult.TransientError();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
private byte[] getResponseData(HttpResponse<byte[]> response) throws IOException {
|
||||||
|
String encoding = response.headers().firstValue("Content-Encoding").orElse("");
|
||||||
|
|
||||||
|
if ("gzip".equals(encoding)) {
|
||||||
|
try (var stream = new GZIPInputStream(new ByteArrayInputStream(response.body()))) {
|
||||||
|
return stream.readAllBytes();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
return response.body();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
public sealed interface FetchResult {
|
public sealed interface FetchResult {
|
||||||
record Success(String value) implements FetchResult {}
|
record Success(String value, String etag) implements FetchResult {}
|
||||||
|
record NotModified() implements FetchResult {}
|
||||||
record TransientError() implements FetchResult {}
|
record TransientError() implements FetchResult {}
|
||||||
record PermanentError() implements FetchResult {}
|
record PermanentError() implements FetchResult {}
|
||||||
}
|
}
|
||||||
@@ -316,6 +367,8 @@ public class FeedFetcherService {
|
|||||||
|
|
||||||
public FeedItems parseFeed(String feedData, FeedDefinition definition) {
|
public FeedItems parseFeed(String feedData, FeedDefinition definition) {
|
||||||
try {
|
try {
|
||||||
|
feedData = sanitizeEntities(feedData);
|
||||||
|
|
||||||
List<Item> rawItems = rssReader.read(
|
List<Item> rawItems = rssReader.read(
|
||||||
// Massage the data to maximize the possibility of the flaky XML parser consuming it
|
// Massage the data to maximize the possibility of the flaky XML parser consuming it
|
||||||
new BOMInputStream(new ByteArrayInputStream(feedData.trim().getBytes(StandardCharsets.UTF_8)), false)
|
new BOMInputStream(new ByteArrayInputStream(feedData.trim().getBytes(StandardCharsets.UTF_8)), false)
|
||||||
@@ -342,6 +395,33 @@ public class FeedFetcherService {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
private static final Map<String, String> HTML_ENTITIES = Map.of(
|
||||||
|
"»", "»",
|
||||||
|
"«", "«",
|
||||||
|
"—", "--",
|
||||||
|
"–", "-",
|
||||||
|
"’", "'",
|
||||||
|
"‘", "'",
|
||||||
|
""", "\"",
|
||||||
|
" ", ""
|
||||||
|
);
|
||||||
|
|
||||||
|
/** The XML parser will blow up if you insert HTML entities in the feed XML,
|
||||||
|
* which is unfortunately relatively common. Replace them as far as is possible
|
||||||
|
* with their corresponding characters
|
||||||
|
*/
|
||||||
|
static String sanitizeEntities(String feedData) {
|
||||||
|
String result = feedData;
|
||||||
|
for (Map.Entry<String, String> entry : HTML_ENTITIES.entrySet()) {
|
||||||
|
result = result.replace(entry.getKey(), entry.getValue());
|
||||||
|
}
|
||||||
|
|
||||||
|
// Handle lone ampersands not part of a recognized XML entity
|
||||||
|
result = result.replaceAll("&(?!(amp|lt|gt|apos|quot);)", "&");
|
||||||
|
|
||||||
|
return result;
|
||||||
|
}
|
||||||
|
|
||||||
/** Decide whether to keep URI fragments in the feed items.
|
/** Decide whether to keep URI fragments in the feed items.
|
||||||
* <p></p>
|
* <p></p>
|
||||||
* We keep fragments if there are multiple different fragments in the items.
|
* We keep fragments if there are multiple different fragments in the items.
|
||||||
|
@@ -107,8 +107,7 @@ public class FeedsGrpcService extends FeedApiGrpc.FeedApiImplBase implements Dis
|
|||||||
|
|
||||||
@Override
|
@Override
|
||||||
public void getFeed(RpcDomainId request,
|
public void getFeed(RpcDomainId request,
|
||||||
StreamObserver<RpcFeed> responseObserver)
|
StreamObserver<RpcFeed> responseObserver) {
|
||||||
{
|
|
||||||
if (!feedDb.isEnabled()) {
|
if (!feedDb.isEnabled()) {
|
||||||
responseObserver.onError(new IllegalStateException("Feed database is disabled on this node"));
|
responseObserver.onError(new IllegalStateException("Feed database is disabled on this node"));
|
||||||
return;
|
return;
|
||||||
@@ -126,7 +125,8 @@ public class FeedsGrpcService extends FeedApiGrpc.FeedApiImplBase implements Dis
|
|||||||
.setDomainId(request.getDomainId())
|
.setDomainId(request.getDomainId())
|
||||||
.setDomain(domainName.get().toString())
|
.setDomain(domainName.get().toString())
|
||||||
.setFeedUrl(feedItems.feedUrl())
|
.setFeedUrl(feedItems.feedUrl())
|
||||||
.setUpdated(feedItems.updated());
|
.setUpdated(feedItems.updated())
|
||||||
|
.setFetchTimestamp(feedDb.getFetchTime().toEpochMilli());
|
||||||
|
|
||||||
for (var item : feedItems.items()) {
|
for (var item : feedItems.items()) {
|
||||||
retB.addItemsBuilder()
|
retB.addItemsBuilder()
|
||||||
|
@@ -96,10 +96,31 @@ class FeedFetcherServiceTest extends AbstractModule {
|
|||||||
feedDb.switchDb(writer);
|
feedDb.switchDb(writer);
|
||||||
}
|
}
|
||||||
|
|
||||||
feedFetcherService.setDeterministic();
|
|
||||||
feedFetcherService.updateFeeds(FeedFetcherService.UpdateMode.REFRESH);
|
feedFetcherService.updateFeeds(FeedFetcherService.UpdateMode.REFRESH);
|
||||||
|
|
||||||
Assertions.assertFalse(feedDb.getFeed(new EdgeDomain("www.marginalia.nu")).isEmpty());
|
var result = feedDb.getFeed(new EdgeDomain("www.marginalia.nu"));
|
||||||
|
System.out.println(result);
|
||||||
|
Assertions.assertFalse(result.isEmpty());
|
||||||
|
}
|
||||||
|
|
||||||
|
@Tag("flaky")
|
||||||
|
@Test
|
||||||
|
public void testFetchRepeatedly() throws Exception {
|
||||||
|
try (var writer = feedDb.createWriter()) {
|
||||||
|
writer.saveFeed(new FeedItems("www.marginalia.nu", "https://www.marginalia.nu/log/index.xml", "", List.of()));
|
||||||
|
feedDb.switchDb(writer);
|
||||||
|
}
|
||||||
|
|
||||||
|
feedFetcherService.updateFeeds(FeedFetcherService.UpdateMode.REFRESH);
|
||||||
|
Assertions.assertNotNull(feedDb.getEtag(new EdgeDomain("www.marginalia.nu")));
|
||||||
|
feedFetcherService.updateFeeds(FeedFetcherService.UpdateMode.REFRESH);
|
||||||
|
Assertions.assertNotNull(feedDb.getEtag(new EdgeDomain("www.marginalia.nu")));
|
||||||
|
feedFetcherService.updateFeeds(FeedFetcherService.UpdateMode.REFRESH);
|
||||||
|
Assertions.assertNotNull(feedDb.getEtag(new EdgeDomain("www.marginalia.nu")));
|
||||||
|
|
||||||
|
var result = feedDb.getFeed(new EdgeDomain("www.marginalia.nu"));
|
||||||
|
System.out.println(result);
|
||||||
|
Assertions.assertFalse(result.isEmpty());
|
||||||
}
|
}
|
||||||
|
|
||||||
@Tag("flaky")
|
@Tag("flaky")
|
||||||
@@ -110,7 +131,6 @@ class FeedFetcherServiceTest extends AbstractModule {
|
|||||||
feedDb.switchDb(writer);
|
feedDb.switchDb(writer);
|
||||||
}
|
}
|
||||||
|
|
||||||
feedFetcherService.setDeterministic();
|
|
||||||
feedFetcherService.updateFeeds(FeedFetcherService.UpdateMode.REFRESH);
|
feedFetcherService.updateFeeds(FeedFetcherService.UpdateMode.REFRESH);
|
||||||
|
|
||||||
// We forget the feed on a 404 error
|
// We forget the feed on a 404 error
|
||||||
|
@@ -0,0 +1,30 @@
|
|||||||
|
package nu.marginalia.rss.svc;
|
||||||
|
|
||||||
|
import org.junit.jupiter.api.Assertions;
|
||||||
|
import org.junit.jupiter.api.Test;
|
||||||
|
|
||||||
|
public class TestXmlSanitization {
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testPreservedEntities() {
|
||||||
|
Assertions.assertEquals("&", FeedFetcherService.sanitizeEntities("&"));
|
||||||
|
Assertions.assertEquals("<", FeedFetcherService.sanitizeEntities("<"));
|
||||||
|
Assertions.assertEquals(">", FeedFetcherService.sanitizeEntities(">"));
|
||||||
|
Assertions.assertEquals("'", FeedFetcherService.sanitizeEntities("'"));
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testStrayAmpersand() {
|
||||||
|
Assertions.assertEquals("Bed & Breakfast", FeedFetcherService.sanitizeEntities("Bed & Breakfast"));
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testTranslatedHtmlEntity() {
|
||||||
|
Assertions.assertEquals("Foo -- Bar", FeedFetcherService.sanitizeEntities("Foo — Bar"));
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testTranslatedHtmlEntityQuot() {
|
||||||
|
Assertions.assertEquals("\"Bob\"", FeedFetcherService.sanitizeEntities(""Bob""));
|
||||||
|
}
|
||||||
|
}
|
@@ -7,4 +7,8 @@ public record DictionaryResponse(String word, List<DictionaryEntry> entries) {
|
|||||||
this.word = word;
|
this.word = word;
|
||||||
this.entries = entries.stream().toList(); // Make an immutable copy
|
this.entries = entries.stream().toList(); // Make an immutable copy
|
||||||
}
|
}
|
||||||
|
|
||||||
|
public boolean hasEntries() {
|
||||||
|
return !entries.isEmpty();
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
@@ -9,10 +9,9 @@ import nu.marginalia.service.client.GrpcChannelPoolFactory;
|
|||||||
import nu.marginalia.service.client.GrpcSingleNodeChannelPool;
|
import nu.marginalia.service.client.GrpcSingleNodeChannelPool;
|
||||||
import nu.marginalia.service.discovery.property.ServiceKey;
|
import nu.marginalia.service.discovery.property.ServiceKey;
|
||||||
import nu.marginalia.service.discovery.property.ServicePartition;
|
import nu.marginalia.service.discovery.property.ServicePartition;
|
||||||
import org.slf4j.Logger;
|
|
||||||
import org.slf4j.LoggerFactory;
|
|
||||||
|
|
||||||
import javax.annotation.CheckReturnValue;
|
import javax.annotation.CheckReturnValue;
|
||||||
|
import java.time.Duration;
|
||||||
|
|
||||||
@Singleton
|
@Singleton
|
||||||
public class QueryClient {
|
public class QueryClient {
|
||||||
@@ -24,13 +23,14 @@ public class QueryClient {
|
|||||||
|
|
||||||
private final GrpcSingleNodeChannelPool<QueryApiGrpc.QueryApiBlockingStub> queryApiPool;
|
private final GrpcSingleNodeChannelPool<QueryApiGrpc.QueryApiBlockingStub> queryApiPool;
|
||||||
|
|
||||||
private final Logger logger = LoggerFactory.getLogger(getClass());
|
|
||||||
|
|
||||||
@Inject
|
@Inject
|
||||||
public QueryClient(GrpcChannelPoolFactory channelPoolFactory) {
|
public QueryClient(GrpcChannelPoolFactory channelPoolFactory) throws InterruptedException {
|
||||||
this.queryApiPool = channelPoolFactory.createSingle(
|
this.queryApiPool = channelPoolFactory.createSingle(
|
||||||
ServiceKey.forGrpcApi(QueryApiGrpc.class, ServicePartition.any()),
|
ServiceKey.forGrpcApi(QueryApiGrpc.class, ServicePartition.any()),
|
||||||
QueryApiGrpc::newBlockingStub);
|
QueryApiGrpc::newBlockingStub);
|
||||||
|
|
||||||
|
// Hold up initialization until we have a downstream connection
|
||||||
|
this.queryApiPool.awaitChannel(Duration.ofSeconds(5));
|
||||||
}
|
}
|
||||||
|
|
||||||
@CheckReturnValue
|
@CheckReturnValue
|
||||||
|
@@ -71,6 +71,17 @@ public class QueryFactory {
|
|||||||
|
|
||||||
String[] parts = StringUtils.split(str, '_');
|
String[] parts = StringUtils.split(str, '_');
|
||||||
|
|
||||||
|
// Trim down tokens to match the behavior of the tokenizer used in indexing
|
||||||
|
for (int i = 0; i < parts.length; i++) {
|
||||||
|
String part = parts[i];
|
||||||
|
|
||||||
|
if (part.endsWith("'s") && part.length() > 2) {
|
||||||
|
part = part.substring(0, part.length()-2);
|
||||||
|
}
|
||||||
|
|
||||||
|
parts[i] = part;
|
||||||
|
}
|
||||||
|
|
||||||
if (parts.length > 1) {
|
if (parts.length > 1) {
|
||||||
// Require that the terms appear in sequence
|
// Require that the terms appear in sequence
|
||||||
queryBuilder.phraseConstraint(SearchPhraseConstraint.mandatory(parts));
|
queryBuilder.phraseConstraint(SearchPhraseConstraint.mandatory(parts));
|
||||||
|
@@ -25,6 +25,7 @@ public class QueryExpansion {
|
|||||||
this::joinDashes,
|
this::joinDashes,
|
||||||
this::splitWordNum,
|
this::splitWordNum,
|
||||||
this::joinTerms,
|
this::joinTerms,
|
||||||
|
this::categoryKeywords,
|
||||||
this::ngramAll
|
this::ngramAll
|
||||||
);
|
);
|
||||||
|
|
||||||
@@ -98,6 +99,24 @@ public class QueryExpansion {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Category keyword substitution, e.g. guitar wiki -> guitar generator:wiki
|
||||||
|
public void categoryKeywords(QWordGraph graph) {
|
||||||
|
|
||||||
|
for (var qw : graph) {
|
||||||
|
|
||||||
|
// Ensure we only perform the substitution on the last word in the query
|
||||||
|
if (!graph.getNextOriginal(qw).getFirst().isEnd()) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
switch (qw.word()) {
|
||||||
|
case "recipe", "recipes" -> graph.addVariant(qw, "category:food");
|
||||||
|
case "forum" -> graph.addVariant(qw, "generator:forum");
|
||||||
|
case "wiki" -> graph.addVariant(qw, "generator:wiki");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// Turn 'lawn chair' into 'lawnchair'
|
// Turn 'lawn chair' into 'lawnchair'
|
||||||
public void joinTerms(QWordGraph graph) {
|
public void joinTerms(QWordGraph graph) {
|
||||||
QWord prev = null;
|
QWord prev = null;
|
||||||
|
@@ -155,16 +155,25 @@ public class QueryParser {
|
|||||||
|
|
||||||
// Remove trailing punctuation
|
// Remove trailing punctuation
|
||||||
int lastChar = str.charAt(str.length() - 1);
|
int lastChar = str.charAt(str.length() - 1);
|
||||||
if (":.,!?$'".indexOf(lastChar) >= 0)
|
if (":.,!?$'".indexOf(lastChar) >= 0) {
|
||||||
entity.replace(new QueryToken.LiteralTerm(str.substring(0, str.length() - 1), lt.displayStr()));
|
str = str.substring(0, str.length() - 1);
|
||||||
|
entity.replace(new QueryToken.LiteralTerm(str, lt.displayStr()));
|
||||||
|
}
|
||||||
|
|
||||||
// Remove term elements that aren't indexed by the search engine
|
// Remove term elements that aren't indexed by the search engine
|
||||||
if (str.endsWith("'s"))
|
if (str.endsWith("'s")) {
|
||||||
entity.replace(new QueryToken.LiteralTerm(str.substring(0, str.length() - 2), lt.displayStr()));
|
str = str.substring(0, str.length() - 2);
|
||||||
if (str.endsWith("()"))
|
entity.replace(new QueryToken.LiteralTerm(str, lt.displayStr()));
|
||||||
entity.replace(new QueryToken.LiteralTerm(str.substring(0, str.length() - 2), lt.displayStr()));
|
}
|
||||||
if (str.startsWith("$"))
|
if (str.endsWith("()")) {
|
||||||
entity.replace(new QueryToken.LiteralTerm(str.substring(1), lt.displayStr()));
|
str = str.substring(0, str.length() - 2);
|
||||||
|
entity.replace(new QueryToken.LiteralTerm(str, lt.displayStr()));
|
||||||
|
}
|
||||||
|
|
||||||
|
while (str.startsWith("$") || str.startsWith("_")) {
|
||||||
|
str = str.substring(1);
|
||||||
|
entity.replace(new QueryToken.LiteralTerm(str, lt.displayStr()));
|
||||||
|
}
|
||||||
|
|
||||||
if (entity.isBlank()) {
|
if (entity.isBlank()) {
|
||||||
entity.remove();
|
entity.remove();
|
||||||
|
@@ -1,165 +0,0 @@
|
|||||||
package nu.marginalia.util.language;
|
|
||||||
|
|
||||||
import com.google.inject.Inject;
|
|
||||||
import nu.marginalia.term_frequency_dict.TermFrequencyDict;
|
|
||||||
import org.slf4j.Logger;
|
|
||||||
import org.slf4j.LoggerFactory;
|
|
||||||
|
|
||||||
import java.io.BufferedReader;
|
|
||||||
import java.io.InputStreamReader;
|
|
||||||
import java.util.*;
|
|
||||||
import java.util.regex.Pattern;
|
|
||||||
import java.util.stream.Collectors;
|
|
||||||
|
|
||||||
public class EnglishDictionary {
|
|
||||||
private final Set<String> englishWords = new HashSet<>();
|
|
||||||
private final TermFrequencyDict tfDict;
|
|
||||||
private final Logger logger = LoggerFactory.getLogger(getClass());
|
|
||||||
|
|
||||||
@Inject
|
|
||||||
public EnglishDictionary(TermFrequencyDict tfDict) {
|
|
||||||
this.tfDict = tfDict;
|
|
||||||
try (var resource = Objects.requireNonNull(ClassLoader.getSystemResourceAsStream("dictionary/en-words"),
|
|
||||||
"Could not load word frequency table");
|
|
||||||
var br = new BufferedReader(new InputStreamReader(resource))
|
|
||||||
) {
|
|
||||||
for (;;) {
|
|
||||||
String s = br.readLine();
|
|
||||||
if (s == null) {
|
|
||||||
break;
|
|
||||||
}
|
|
||||||
englishWords.add(s.toLowerCase());
|
|
||||||
}
|
|
||||||
}
|
|
||||||
catch (Exception ex) {
|
|
||||||
throw new RuntimeException(ex);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
public boolean isWord(String word) {
|
|
||||||
return englishWords.contains(word);
|
|
||||||
}
|
|
||||||
|
|
||||||
private static final Pattern ingPattern = Pattern.compile(".*(\\w)\\1ing$");
|
|
||||||
|
|
||||||
public Collection<String> getWordVariants(String s) {
|
|
||||||
var variants = findWordVariants(s);
|
|
||||||
|
|
||||||
var ret = variants.stream()
|
|
||||||
.filter(var -> tfDict.getTermFreq(var) > 100)
|
|
||||||
.collect(Collectors.toList());
|
|
||||||
|
|
||||||
if (s.equals("recipe") || s.equals("recipes")) {
|
|
||||||
ret.add("category:food");
|
|
||||||
}
|
|
||||||
|
|
||||||
return ret;
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
public Collection<String> findWordVariants(String s) {
|
|
||||||
int sl = s.length();
|
|
||||||
|
|
||||||
if (sl < 2) {
|
|
||||||
return Collections.emptyList();
|
|
||||||
}
|
|
||||||
if (s.endsWith("s")) {
|
|
||||||
String a = s.substring(0, sl-1);
|
|
||||||
String b = s + "es";
|
|
||||||
if (isWord(a) && isWord(b)) {
|
|
||||||
return List.of(a, b);
|
|
||||||
}
|
|
||||||
else if (isWord(a)) {
|
|
||||||
return List.of(a);
|
|
||||||
}
|
|
||||||
else if (isWord(b)) {
|
|
||||||
return List.of(b);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if (s.endsWith("sm")) {
|
|
||||||
String a = s.substring(0, sl-1)+"t";
|
|
||||||
String b = s.substring(0, sl-1)+"ts";
|
|
||||||
if (isWord(a) && isWord(b)) {
|
|
||||||
return List.of(a, b);
|
|
||||||
}
|
|
||||||
else if (isWord(a)) {
|
|
||||||
return List.of(a);
|
|
||||||
}
|
|
||||||
else if (isWord(b)) {
|
|
||||||
return List.of(b);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if (s.endsWith("st")) {
|
|
||||||
String a = s.substring(0, sl-1)+"m";
|
|
||||||
String b = s + "s";
|
|
||||||
if (isWord(a) && isWord(b)) {
|
|
||||||
return List.of(a, b);
|
|
||||||
}
|
|
||||||
else if (isWord(a)) {
|
|
||||||
return List.of(a);
|
|
||||||
}
|
|
||||||
else if (isWord(b)) {
|
|
||||||
return List.of(b);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
else if (ingPattern.matcher(s).matches() && sl > 4) { // humming, clapping
|
|
||||||
var a = s.substring(0, sl-4);
|
|
||||||
var b = s.substring(0, sl-3) + "ed";
|
|
||||||
|
|
||||||
if (isWord(a) && isWord(b)) {
|
|
||||||
return List.of(a, b);
|
|
||||||
}
|
|
||||||
else if (isWord(a)) {
|
|
||||||
return List.of(a);
|
|
||||||
}
|
|
||||||
else if (isWord(b)) {
|
|
||||||
return List.of(b);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
else {
|
|
||||||
String a = s + "s";
|
|
||||||
String b = ingForm(s);
|
|
||||||
String c = s + "ed";
|
|
||||||
|
|
||||||
if (isWord(a) && isWord(b) && isWord(c)) {
|
|
||||||
return List.of(a, b, c);
|
|
||||||
}
|
|
||||||
else if (isWord(a) && isWord(b)) {
|
|
||||||
return List.of(a, b);
|
|
||||||
}
|
|
||||||
else if (isWord(b) && isWord(c)) {
|
|
||||||
return List.of(b, c);
|
|
||||||
}
|
|
||||||
else if (isWord(a) && isWord(c)) {
|
|
||||||
return List.of(a, c);
|
|
||||||
}
|
|
||||||
else if (isWord(a)) {
|
|
||||||
return List.of(a);
|
|
||||||
}
|
|
||||||
else if (isWord(b)) {
|
|
||||||
return List.of(b);
|
|
||||||
}
|
|
||||||
else if (isWord(c)) {
|
|
||||||
return List.of(c);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
return Collections.emptyList();
|
|
||||||
}
|
|
||||||
|
|
||||||
public String ingForm(String s) {
|
|
||||||
if (s.endsWith("t") && !s.endsWith("tt")) {
|
|
||||||
return s + "ting";
|
|
||||||
}
|
|
||||||
if (s.endsWith("n") && !s.endsWith("nn")) {
|
|
||||||
return s + "ning";
|
|
||||||
}
|
|
||||||
if (s.endsWith("m") && !s.endsWith("mm")) {
|
|
||||||
return s + "ming";
|
|
||||||
}
|
|
||||||
if (s.endsWith("r") && !s.endsWith("rr")) {
|
|
||||||
return s + "ring";
|
|
||||||
}
|
|
||||||
return s + "ing";
|
|
||||||
}
|
|
||||||
}
|
|
@@ -0,0 +1,32 @@
|
|||||||
|
package nu.marginalia.functions.searchquery.query_parser;
|
||||||
|
|
||||||
|
import nu.marginalia.functions.searchquery.query_parser.token.QueryToken;
|
||||||
|
import org.junit.jupiter.api.Assertions;
|
||||||
|
import org.junit.jupiter.api.Test;
|
||||||
|
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
|
class QueryParserTest {
|
||||||
|
|
||||||
|
@Test
|
||||||
|
// https://github.com/MarginaliaSearch/MarginaliaSearch/issues/140
|
||||||
|
void parse__builtin_ffs() {
|
||||||
|
QueryParser parser = new QueryParser();
|
||||||
|
var tokens = parser.parse("__builtin_ffs");
|
||||||
|
Assertions.assertEquals(List.of(new QueryToken.LiteralTerm("builtin_ffs", "__builtin_ffs")), tokens);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void trailingParens() {
|
||||||
|
QueryParser parser = new QueryParser();
|
||||||
|
var tokens = parser.parse("strcpy()");
|
||||||
|
Assertions.assertEquals(List.of(new QueryToken.LiteralTerm("strcpy", "strcpy()")), tokens);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void trailingQuote() {
|
||||||
|
QueryParser parser = new QueryParser();
|
||||||
|
var tokens = parser.parse("bob's");
|
||||||
|
Assertions.assertEquals(List.of(new QueryToken.LiteralTerm("bob", "bob's")), tokens);
|
||||||
|
}
|
||||||
|
}
|
@@ -12,6 +12,7 @@ import nu.marginalia.index.query.limit.SpecificationLimit;
|
|||||||
import nu.marginalia.index.query.limit.SpecificationLimitType;
|
import nu.marginalia.index.query.limit.SpecificationLimitType;
|
||||||
import nu.marginalia.segmentation.NgramLexicon;
|
import nu.marginalia.segmentation.NgramLexicon;
|
||||||
import nu.marginalia.term_frequency_dict.TermFrequencyDict;
|
import nu.marginalia.term_frequency_dict.TermFrequencyDict;
|
||||||
|
import org.junit.jupiter.api.Assertions;
|
||||||
import org.junit.jupiter.api.BeforeAll;
|
import org.junit.jupiter.api.BeforeAll;
|
||||||
import org.junit.jupiter.api.Test;
|
import org.junit.jupiter.api.Test;
|
||||||
|
|
||||||
@@ -207,6 +208,28 @@ public class QueryFactoryTest {
|
|||||||
System.out.println(subquery);
|
System.out.println(subquery);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testQuotedApostrophe() {
|
||||||
|
var subquery = parseAndGetSpecs("\"bob's cars\"");
|
||||||
|
|
||||||
|
System.out.println(subquery);
|
||||||
|
|
||||||
|
Assertions.assertTrue(subquery.query.compiledQuery.contains(" bob "));
|
||||||
|
Assertions.assertFalse(subquery.query.compiledQuery.contains(" bob's "));
|
||||||
|
Assertions.assertEquals("\"bob's cars\"", subquery.humanQuery);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testExpansion9() {
|
||||||
|
var subquery = parseAndGetSpecs("pie recipe");
|
||||||
|
|
||||||
|
Assertions.assertTrue(subquery.query.compiledQuery.contains(" category:food "));
|
||||||
|
|
||||||
|
subquery = parseAndGetSpecs("recipe pie");
|
||||||
|
|
||||||
|
Assertions.assertFalse(subquery.query.compiledQuery.contains(" category:food "));
|
||||||
|
}
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
public void testParsing() {
|
public void testParsing() {
|
||||||
var subquery = parseAndGetSpecs("strlen()");
|
var subquery = parseAndGetSpecs("strlen()");
|
||||||
|
@@ -85,7 +85,7 @@ class BTreeWriterTest {
|
|||||||
public void testWriteEntrySize2() throws IOException {
|
public void testWriteEntrySize2() throws IOException {
|
||||||
BTreeContext ctx = new BTreeContext(4, 2, BTreeBlockSize.BS_64);
|
BTreeContext ctx = new BTreeContext(4, 2, BTreeBlockSize.BS_64);
|
||||||
|
|
||||||
var tempFile = Files.createTempFile(Path.of("/tmp"), "tst", "dat");
|
var tempFile = Files.createTempFile("tst", "dat");
|
||||||
|
|
||||||
int[] data = generateItems32(64);
|
int[] data = generateItems32(64);
|
||||||
|
|
||||||
|
@@ -27,7 +27,7 @@ public class SentenceSegmentSplitter {
|
|||||||
else {
|
else {
|
||||||
// If we flatten unicode, we do this...
|
// If we flatten unicode, we do this...
|
||||||
// FIXME: This can almost definitely be cleaned up and simplified.
|
// FIXME: This can almost definitely be cleaned up and simplified.
|
||||||
wordBreakPattern = Pattern.compile("([^/_#@.a-zA-Z'+\\-0-9\\u00C0-\\u00D6\\u00D8-\\u00f6\\u00f8-\\u00ff]+)|[|]|(\\.(\\s+|$))");
|
wordBreakPattern = Pattern.compile("([^/<>$:_#@.a-zA-Z'+\\-0-9\\u00C0-\\u00D6\\u00D8-\\u00f6\\u00f8-\\u00ff]+)|[|]|(\\.(\\s+|$))");
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -90,12 +90,17 @@ public class SentenceSegmentSplitter {
|
|||||||
for (int i = 0; i < ret.size(); i++) {
|
for (int i = 0; i < ret.size(); i++) {
|
||||||
String part = ret.get(i);
|
String part = ret.get(i);
|
||||||
|
|
||||||
|
if (part.startsWith("<") && part.endsWith(">") && part.length() > 2) {
|
||||||
|
ret.set(i, part.substring(1, part.length() - 1));
|
||||||
|
}
|
||||||
|
|
||||||
if (part.startsWith("'") && part.length() > 1) {
|
if (part.startsWith("'") && part.length() > 1) {
|
||||||
ret.set(i, part.substring(1));
|
ret.set(i, part.substring(1));
|
||||||
}
|
}
|
||||||
if (part.endsWith("'") && part.length() > 1) {
|
if (part.endsWith("'") && part.length() > 1) {
|
||||||
ret.set(i, part.substring(0, part.length()-1));
|
ret.set(i, part.substring(0, part.length()-1));
|
||||||
}
|
}
|
||||||
|
|
||||||
while (part.endsWith(".")) {
|
while (part.endsWith(".")) {
|
||||||
part = part.substring(0, part.length()-1);
|
part = part.substring(0, part.length()-1);
|
||||||
ret.set(i, part);
|
ret.set(i, part);
|
||||||
|
@@ -28,6 +28,20 @@ class SentenceExtractorTest {
|
|||||||
System.out.println(dld);
|
System.out.println(dld);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void testCplusplus() {
|
||||||
|
var dld = sentenceExtractor.extractSentence("std::vector", EnumSet.noneOf(HtmlTag.class));
|
||||||
|
assertEquals(1, dld.length());
|
||||||
|
assertEquals("std::vector", dld.wordsLowerCase[0]);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
void testPHP() {
|
||||||
|
var dld = sentenceExtractor.extractSentence("$_GET", EnumSet.noneOf(HtmlTag.class));
|
||||||
|
assertEquals(1, dld.length());
|
||||||
|
assertEquals("$_get", dld.wordsLowerCase[0]);
|
||||||
|
}
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
void testPolishArtist() {
|
void testPolishArtist() {
|
||||||
var dld = sentenceExtractor.extractSentence("Uklański", EnumSet.noneOf(HtmlTag.class));
|
var dld = sentenceExtractor.extractSentence("Uklański", EnumSet.noneOf(HtmlTag.class));
|
||||||
|
@@ -25,12 +25,11 @@ public class ProcessedDocumentDetails {
|
|||||||
|
|
||||||
public List<EdgeUrl> linksInternal;
|
public List<EdgeUrl> linksInternal;
|
||||||
public List<EdgeUrl> linksExternal;
|
public List<EdgeUrl> linksExternal;
|
||||||
public List<EdgeUrl> feedLinks;
|
|
||||||
|
|
||||||
public DocumentMetadata metadata;
|
public DocumentMetadata metadata;
|
||||||
public GeneratorType generator;
|
public GeneratorType generator;
|
||||||
|
|
||||||
public String toString() {
|
public String toString() {
|
||||||
return "ProcessedDocumentDetails(title=" + this.title + ", description=" + this.description + ", pubYear=" + this.pubYear + ", length=" + this.length + ", quality=" + this.quality + ", hashCode=" + this.hashCode + ", features=" + this.features + ", standard=" + this.standard + ", linksInternal=" + this.linksInternal + ", linksExternal=" + this.linksExternal + ", feedLinks=" + this.feedLinks + ", metadata=" + this.metadata + ", generator=" + this.generator + ")";
|
return "ProcessedDocumentDetails(title=" + this.title + ", description=" + this.description + ", pubYear=" + this.pubYear + ", length=" + this.length + ", quality=" + this.quality + ", hashCode=" + this.hashCode + ", features=" + this.features + ", standard=" + this.standard + ", linksInternal=" + this.linksInternal + ", linksExternal=" + this.linksExternal + ", metadata=" + this.metadata + ", generator=" + this.generator + ")";
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@@ -34,7 +34,6 @@ public class LinkProcessor {
|
|||||||
|
|
||||||
ret.linksExternal = new ArrayList<>();
|
ret.linksExternal = new ArrayList<>();
|
||||||
ret.linksInternal = new ArrayList<>();
|
ret.linksInternal = new ArrayList<>();
|
||||||
ret.feedLinks = new ArrayList<>();
|
|
||||||
}
|
}
|
||||||
|
|
||||||
public Set<EdgeUrl> getSeenUrls() {
|
public Set<EdgeUrl> getSeenUrls() {
|
||||||
@@ -72,19 +71,6 @@ public class LinkProcessor {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/** Accepts a link as a feed link */
|
|
||||||
public void acceptFeed(EdgeUrl link) {
|
|
||||||
if (!isLinkPermitted(link)) {
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (!seenUrls.add(link)) {
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
ret.feedLinks.add(link);
|
|
||||||
}
|
|
||||||
|
|
||||||
private boolean isLinkPermitted(EdgeUrl link) {
|
private boolean isLinkPermitted(EdgeUrl link) {
|
||||||
if (!permittedSchemas.contains(link.proto.toLowerCase())) {
|
if (!permittedSchemas.contains(link.proto.toLowerCase())) {
|
||||||
return false;
|
return false;
|
||||||
|
@@ -294,11 +294,6 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
|
|||||||
for (var meta : doc.select("meta[http-equiv=refresh]")) {
|
for (var meta : doc.select("meta[http-equiv=refresh]")) {
|
||||||
linkParser.parseMetaRedirect(baseUrl, meta).ifPresent(lp::accept);
|
linkParser.parseMetaRedirect(baseUrl, meta).ifPresent(lp::accept);
|
||||||
}
|
}
|
||||||
for (var link : doc.select("link[rel=alternate]")) {
|
|
||||||
feedExtractor
|
|
||||||
.getFeedFromAlternateTag(baseUrl, link)
|
|
||||||
.ifPresent(lp::acceptFeed);
|
|
||||||
}
|
|
||||||
|
|
||||||
words.addAllSyntheticTerms(FileLinks.createFileLinkKeywords(lp, domain));
|
words.addAllSyntheticTerms(FileLinks.createFileLinkKeywords(lp, domain));
|
||||||
words.addAllSyntheticTerms(FileLinks.createFileEndingKeywords(doc));
|
words.addAllSyntheticTerms(FileLinks.createFileEndingKeywords(doc));
|
||||||
|
@@ -125,7 +125,6 @@ public class PlainTextDocumentProcessorPlugin extends AbstractDocumentProcessorP
|
|||||||
/* These are assumed to be populated */
|
/* These are assumed to be populated */
|
||||||
ret.linksInternal = new ArrayList<>();
|
ret.linksInternal = new ArrayList<>();
|
||||||
ret.linksExternal = new ArrayList<>();
|
ret.linksExternal = new ArrayList<>();
|
||||||
ret.feedLinks = new ArrayList<>();
|
|
||||||
|
|
||||||
return new DetailsWithWords(ret, words);
|
return new DetailsWithWords(ret, words);
|
||||||
}
|
}
|
||||||
|
@@ -166,7 +166,6 @@ public class StackexchangeSideloader implements SideloadSource {
|
|||||||
ret.details.length = 128;
|
ret.details.length = 128;
|
||||||
|
|
||||||
ret.details.standard = HtmlStandard.HTML5;
|
ret.details.standard = HtmlStandard.HTML5;
|
||||||
ret.details.feedLinks = List.of();
|
|
||||||
ret.details.linksExternal = List.of();
|
ret.details.linksExternal = List.of();
|
||||||
ret.details.linksInternal = List.of();
|
ret.details.linksInternal = List.of();
|
||||||
ret.state = UrlIndexingState.OK;
|
ret.state = UrlIndexingState.OK;
|
||||||
|
@@ -178,7 +178,6 @@ public class ConverterBatchWriter implements AutoCloseable, ConverterBatchWriter
|
|||||||
public void writeDomainData(ProcessedDomain domain) throws IOException {
|
public void writeDomainData(ProcessedDomain domain) throws IOException {
|
||||||
DomainMetadata metadata = DomainMetadata.from(domain);
|
DomainMetadata metadata = DomainMetadata.from(domain);
|
||||||
|
|
||||||
List<String> feeds = getFeedUrls(domain);
|
|
||||||
|
|
||||||
domainWriter.write(
|
domainWriter.write(
|
||||||
new SlopDomainRecord(
|
new SlopDomainRecord(
|
||||||
@@ -188,25 +187,11 @@ public class ConverterBatchWriter implements AutoCloseable, ConverterBatchWriter
|
|||||||
metadata.visited(),
|
metadata.visited(),
|
||||||
Optional.ofNullable(domain.state).map(DomainIndexingState::toString).orElse(""),
|
Optional.ofNullable(domain.state).map(DomainIndexingState::toString).orElse(""),
|
||||||
Optional.ofNullable(domain.redirect).map(EdgeDomain::toString).orElse(""),
|
Optional.ofNullable(domain.redirect).map(EdgeDomain::toString).orElse(""),
|
||||||
domain.ip,
|
domain.ip
|
||||||
feeds
|
|
||||||
)
|
)
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
private List<String> getFeedUrls(ProcessedDomain domain) {
|
|
||||||
var documents = domain.documents;
|
|
||||||
if (documents == null)
|
|
||||||
return List.of();
|
|
||||||
|
|
||||||
return documents.stream().map(doc -> doc.details)
|
|
||||||
.filter(Objects::nonNull)
|
|
||||||
.flatMap(dets -> dets.feedLinks.stream())
|
|
||||||
.distinct()
|
|
||||||
.map(EdgeUrl::toString)
|
|
||||||
.toList();
|
|
||||||
}
|
|
||||||
|
|
||||||
public void close() throws IOException {
|
public void close() throws IOException {
|
||||||
domainWriter.close();
|
domainWriter.close();
|
||||||
documentWriter.close();
|
documentWriter.close();
|
||||||
|
@@ -1,7 +1,6 @@
|
|||||||
package nu.marginalia.model.processed;
|
package nu.marginalia.model.processed;
|
||||||
|
|
||||||
import nu.marginalia.slop.SlopTable;
|
import nu.marginalia.slop.SlopTable;
|
||||||
import nu.marginalia.slop.column.array.ObjectArrayColumn;
|
|
||||||
import nu.marginalia.slop.column.primitive.IntColumn;
|
import nu.marginalia.slop.column.primitive.IntColumn;
|
||||||
import nu.marginalia.slop.column.string.EnumColumn;
|
import nu.marginalia.slop.column.string.EnumColumn;
|
||||||
import nu.marginalia.slop.column.string.TxtStringColumn;
|
import nu.marginalia.slop.column.string.TxtStringColumn;
|
||||||
@@ -10,7 +9,6 @@ import nu.marginalia.slop.desc.StorageType;
|
|||||||
import java.io.IOException;
|
import java.io.IOException;
|
||||||
import java.nio.charset.StandardCharsets;
|
import java.nio.charset.StandardCharsets;
|
||||||
import java.nio.file.Path;
|
import java.nio.file.Path;
|
||||||
import java.util.List;
|
|
||||||
import java.util.function.Consumer;
|
import java.util.function.Consumer;
|
||||||
|
|
||||||
public record SlopDomainRecord(
|
public record SlopDomainRecord(
|
||||||
@@ -20,8 +18,7 @@ public record SlopDomainRecord(
|
|||||||
int visitedUrls,
|
int visitedUrls,
|
||||||
String state,
|
String state,
|
||||||
String redirectDomain,
|
String redirectDomain,
|
||||||
String ip,
|
String ip)
|
||||||
List<String> rssFeeds)
|
|
||||||
{
|
{
|
||||||
|
|
||||||
public record DomainWithIpProjection(
|
public record DomainWithIpProjection(
|
||||||
@@ -38,9 +35,6 @@ public record SlopDomainRecord(
|
|||||||
private static final IntColumn goodUrlsColumn = new IntColumn("goodUrls", StorageType.PLAIN);
|
private static final IntColumn goodUrlsColumn = new IntColumn("goodUrls", StorageType.PLAIN);
|
||||||
private static final IntColumn visitedUrlsColumn = new IntColumn("visitedUrls", StorageType.PLAIN);
|
private static final IntColumn visitedUrlsColumn = new IntColumn("visitedUrls", StorageType.PLAIN);
|
||||||
|
|
||||||
private static final ObjectArrayColumn<String> rssFeedsColumn = new TxtStringColumn("rssFeeds", StandardCharsets.UTF_8, StorageType.GZIP).asArray();
|
|
||||||
|
|
||||||
|
|
||||||
public static class DomainNameReader extends SlopTable {
|
public static class DomainNameReader extends SlopTable {
|
||||||
private final TxtStringColumn.Reader domainsReader;
|
private final TxtStringColumn.Reader domainsReader;
|
||||||
|
|
||||||
@@ -101,8 +95,6 @@ public record SlopDomainRecord(
|
|||||||
private final IntColumn.Reader goodUrlsReader;
|
private final IntColumn.Reader goodUrlsReader;
|
||||||
private final IntColumn.Reader visitedUrlsReader;
|
private final IntColumn.Reader visitedUrlsReader;
|
||||||
|
|
||||||
private final ObjectArrayColumn<String>.Reader rssFeedsReader;
|
|
||||||
|
|
||||||
public Reader(SlopTable.Ref<SlopDomainRecord> ref) throws IOException {
|
public Reader(SlopTable.Ref<SlopDomainRecord> ref) throws IOException {
|
||||||
super(ref);
|
super(ref);
|
||||||
|
|
||||||
@@ -114,8 +106,6 @@ public record SlopDomainRecord(
|
|||||||
knownUrlsReader = knownUrlsColumn.open(this);
|
knownUrlsReader = knownUrlsColumn.open(this);
|
||||||
goodUrlsReader = goodUrlsColumn.open(this);
|
goodUrlsReader = goodUrlsColumn.open(this);
|
||||||
visitedUrlsReader = visitedUrlsColumn.open(this);
|
visitedUrlsReader = visitedUrlsColumn.open(this);
|
||||||
|
|
||||||
rssFeedsReader = rssFeedsColumn.open(this);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
public Reader(Path baseDir, int page) throws IOException {
|
public Reader(Path baseDir, int page) throws IOException {
|
||||||
@@ -140,8 +130,7 @@ public record SlopDomainRecord(
|
|||||||
visitedUrlsReader.get(),
|
visitedUrlsReader.get(),
|
||||||
statesReader.get(),
|
statesReader.get(),
|
||||||
redirectReader.get(),
|
redirectReader.get(),
|
||||||
ipReader.get(),
|
ipReader.get()
|
||||||
rssFeedsReader.get()
|
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -156,8 +145,6 @@ public record SlopDomainRecord(
|
|||||||
private final IntColumn.Writer goodUrlsWriter;
|
private final IntColumn.Writer goodUrlsWriter;
|
||||||
private final IntColumn.Writer visitedUrlsWriter;
|
private final IntColumn.Writer visitedUrlsWriter;
|
||||||
|
|
||||||
private final ObjectArrayColumn<String>.Writer rssFeedsWriter;
|
|
||||||
|
|
||||||
public Writer(Path baseDir, int page) throws IOException {
|
public Writer(Path baseDir, int page) throws IOException {
|
||||||
super(baseDir, page);
|
super(baseDir, page);
|
||||||
|
|
||||||
@@ -169,8 +156,6 @@ public record SlopDomainRecord(
|
|||||||
knownUrlsWriter = knownUrlsColumn.create(this);
|
knownUrlsWriter = knownUrlsColumn.create(this);
|
||||||
goodUrlsWriter = goodUrlsColumn.create(this);
|
goodUrlsWriter = goodUrlsColumn.create(this);
|
||||||
visitedUrlsWriter = visitedUrlsColumn.create(this);
|
visitedUrlsWriter = visitedUrlsColumn.create(this);
|
||||||
|
|
||||||
rssFeedsWriter = rssFeedsColumn.create(this);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
public void write(SlopDomainRecord record) throws IOException {
|
public void write(SlopDomainRecord record) throws IOException {
|
||||||
@@ -182,8 +167,6 @@ public record SlopDomainRecord(
|
|||||||
knownUrlsWriter.put(record.knownUrls());
|
knownUrlsWriter.put(record.knownUrls());
|
||||||
goodUrlsWriter.put(record.goodUrls());
|
goodUrlsWriter.put(record.goodUrls());
|
||||||
visitedUrlsWriter.put(record.visitedUrls());
|
visitedUrlsWriter.put(record.visitedUrls());
|
||||||
|
|
||||||
rssFeedsWriter.put(record.rssFeeds());
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@@ -9,7 +9,6 @@ import org.junit.jupiter.api.Test;
|
|||||||
import java.io.IOException;
|
import java.io.IOException;
|
||||||
import java.nio.file.Files;
|
import java.nio.file.Files;
|
||||||
import java.nio.file.Path;
|
import java.nio.file.Path;
|
||||||
import java.util.List;
|
|
||||||
|
|
||||||
import static org.junit.jupiter.api.Assertions.assertFalse;
|
import static org.junit.jupiter.api.Assertions.assertFalse;
|
||||||
import static org.junit.jupiter.api.Assertions.assertTrue;
|
import static org.junit.jupiter.api.Assertions.assertTrue;
|
||||||
@@ -35,8 +34,7 @@ public class SlopDomainRecordTest {
|
|||||||
1, 2, 3,
|
1, 2, 3,
|
||||||
"state",
|
"state",
|
||||||
"redirectDomain",
|
"redirectDomain",
|
||||||
"192.168.0.1",
|
"192.168.0.1"
|
||||||
List.of("rss1", "rss2")
|
|
||||||
);
|
);
|
||||||
|
|
||||||
try (var writer = new SlopDomainRecord.Writer(testDir, 0)) {
|
try (var writer = new SlopDomainRecord.Writer(testDir, 0)) {
|
||||||
|
@@ -7,6 +7,7 @@ import nu.marginalia.WmsaHome;
|
|||||||
import nu.marginalia.converting.model.ProcessedDomain;
|
import nu.marginalia.converting.model.ProcessedDomain;
|
||||||
import nu.marginalia.converting.processor.DomainProcessor;
|
import nu.marginalia.converting.processor.DomainProcessor;
|
||||||
import nu.marginalia.crawl.CrawlerMain;
|
import nu.marginalia.crawl.CrawlerMain;
|
||||||
|
import nu.marginalia.crawl.DomainStateDb;
|
||||||
import nu.marginalia.crawl.fetcher.HttpFetcher;
|
import nu.marginalia.crawl.fetcher.HttpFetcher;
|
||||||
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
||||||
import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
|
import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
|
||||||
@@ -46,6 +47,7 @@ public class CrawlingThenConvertingIntegrationTest {
|
|||||||
|
|
||||||
private Path fileName;
|
private Path fileName;
|
||||||
private Path fileName2;
|
private Path fileName2;
|
||||||
|
private Path dbTempFile;
|
||||||
|
|
||||||
@BeforeAll
|
@BeforeAll
|
||||||
public static void setUpAll() {
|
public static void setUpAll() {
|
||||||
@@ -63,16 +65,18 @@ public class CrawlingThenConvertingIntegrationTest {
|
|||||||
httpFetcher = new HttpFetcherImpl(WmsaHome.getUserAgent().uaString());
|
httpFetcher = new HttpFetcherImpl(WmsaHome.getUserAgent().uaString());
|
||||||
this.fileName = Files.createTempFile("crawling-then-converting", ".warc.gz");
|
this.fileName = Files.createTempFile("crawling-then-converting", ".warc.gz");
|
||||||
this.fileName2 = Files.createTempFile("crawling-then-converting", ".warc.gz");
|
this.fileName2 = Files.createTempFile("crawling-then-converting", ".warc.gz");
|
||||||
|
this.dbTempFile = Files.createTempFile("domains", "db");
|
||||||
}
|
}
|
||||||
|
|
||||||
@AfterEach
|
@AfterEach
|
||||||
public void tearDown() throws IOException {
|
public void tearDown() throws IOException {
|
||||||
Files.deleteIfExists(fileName);
|
Files.deleteIfExists(fileName);
|
||||||
Files.deleteIfExists(fileName2);
|
Files.deleteIfExists(fileName2);
|
||||||
|
Files.deleteIfExists(dbTempFile);
|
||||||
}
|
}
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
public void testInvalidDomain() throws IOException {
|
public void testInvalidDomain() throws Exception {
|
||||||
// Attempt to fetch an invalid domain
|
// Attempt to fetch an invalid domain
|
||||||
var specs = new CrawlerMain.CrawlSpecRecord("invalid.invalid.invalid", 10);
|
var specs = new CrawlerMain.CrawlSpecRecord("invalid.invalid.invalid", 10);
|
||||||
|
|
||||||
@@ -88,7 +92,7 @@ public class CrawlingThenConvertingIntegrationTest {
|
|||||||
}
|
}
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
public void testRedirectingDomain() throws IOException {
|
public void testRedirectingDomain() throws Exception {
|
||||||
// Attempt to fetch an invalid domain
|
// Attempt to fetch an invalid domain
|
||||||
var specs = new CrawlerMain.CrawlSpecRecord("memex.marginalia.nu", 10);
|
var specs = new CrawlerMain.CrawlSpecRecord("memex.marginalia.nu", 10);
|
||||||
|
|
||||||
@@ -107,7 +111,7 @@ public class CrawlingThenConvertingIntegrationTest {
|
|||||||
}
|
}
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
public void testBlockedDomain() throws IOException {
|
public void testBlockedDomain() throws Exception {
|
||||||
// Attempt to fetch an invalid domain
|
// Attempt to fetch an invalid domain
|
||||||
var specs = new CrawlerMain.CrawlSpecRecord("search.marginalia.nu", 10);
|
var specs = new CrawlerMain.CrawlSpecRecord("search.marginalia.nu", 10);
|
||||||
|
|
||||||
@@ -124,7 +128,7 @@ public class CrawlingThenConvertingIntegrationTest {
|
|||||||
}
|
}
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
public void crawlSunnyDay() throws IOException {
|
public void crawlSunnyDay() throws Exception {
|
||||||
var specs = new CrawlerMain.CrawlSpecRecord("www.marginalia.nu", 10);
|
var specs = new CrawlerMain.CrawlSpecRecord("www.marginalia.nu", 10);
|
||||||
|
|
||||||
CrawledDomain domain = crawl(specs);
|
CrawledDomain domain = crawl(specs);
|
||||||
@@ -157,7 +161,7 @@ public class CrawlingThenConvertingIntegrationTest {
|
|||||||
|
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
public void crawlContentTypes() throws IOException {
|
public void crawlContentTypes() throws Exception {
|
||||||
var specs = new CrawlerMain.CrawlSpecRecord("www.marginalia.nu", 10,
|
var specs = new CrawlerMain.CrawlSpecRecord("www.marginalia.nu", 10,
|
||||||
List.of(
|
List.of(
|
||||||
"https://www.marginalia.nu/sanic.png",
|
"https://www.marginalia.nu/sanic.png",
|
||||||
@@ -195,7 +199,7 @@ public class CrawlingThenConvertingIntegrationTest {
|
|||||||
|
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
public void crawlRobotsTxt() throws IOException {
|
public void crawlRobotsTxt() throws Exception {
|
||||||
var specs = new CrawlerMain.CrawlSpecRecord("search.marginalia.nu", 5,
|
var specs = new CrawlerMain.CrawlSpecRecord("search.marginalia.nu", 5,
|
||||||
List.of("https://search.marginalia.nu/search?q=hello+world")
|
List.of("https://search.marginalia.nu/search?q=hello+world")
|
||||||
);
|
);
|
||||||
@@ -235,15 +239,17 @@ public class CrawlingThenConvertingIntegrationTest {
|
|||||||
return null; // unreachable
|
return null; // unreachable
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
private CrawledDomain crawl(CrawlerMain.CrawlSpecRecord specs) throws IOException {
|
private CrawledDomain crawl(CrawlerMain.CrawlSpecRecord specs) throws Exception {
|
||||||
return crawl(specs, domain -> true);
|
return crawl(specs, domain -> true);
|
||||||
}
|
}
|
||||||
|
|
||||||
private CrawledDomain crawl(CrawlerMain.CrawlSpecRecord specs, Predicate<EdgeDomain> domainBlacklist) throws IOException {
|
private CrawledDomain crawl(CrawlerMain.CrawlSpecRecord specs, Predicate<EdgeDomain> domainBlacklist) throws Exception {
|
||||||
List<SerializableCrawlData> data = new ArrayList<>();
|
List<SerializableCrawlData> data = new ArrayList<>();
|
||||||
|
|
||||||
try (var recorder = new WarcRecorder(fileName)) {
|
try (var recorder = new WarcRecorder(fileName);
|
||||||
new CrawlerRetreiver(httpFetcher, new DomainProber(domainBlacklist), specs, recorder).crawlDomain();
|
var db = new DomainStateDb(dbTempFile))
|
||||||
|
{
|
||||||
|
new CrawlerRetreiver(httpFetcher, new DomainProber(domainBlacklist), specs, db, recorder).crawlDomain();
|
||||||
}
|
}
|
||||||
|
|
||||||
CrawledDocumentParquetRecordFileWriter.convertWarc(specs.domain(),
|
CrawledDocumentParquetRecordFileWriter.convertWarc(specs.domain(),
|
||||||
|
@@ -46,6 +46,8 @@ dependencies {
|
|||||||
|
|
||||||
implementation libs.notnull
|
implementation libs.notnull
|
||||||
implementation libs.guava
|
implementation libs.guava
|
||||||
|
implementation libs.sqlite
|
||||||
|
|
||||||
implementation dependencies.create(libs.guice.get()) {
|
implementation dependencies.create(libs.guice.get()) {
|
||||||
exclude group: 'com.google.guava'
|
exclude group: 'com.google.guava'
|
||||||
}
|
}
|
||||||
|
@@ -241,6 +241,7 @@ public class CrawlerMain extends ProcessMainClass {
|
|||||||
|
|
||||||
// Set up the work log and the warc archiver so we can keep track of what we've done
|
// Set up the work log and the warc archiver so we can keep track of what we've done
|
||||||
try (WorkLog workLog = new WorkLog(outputDir.resolve("crawler.log"));
|
try (WorkLog workLog = new WorkLog(outputDir.resolve("crawler.log"));
|
||||||
|
DomainStateDb domainStateDb = new DomainStateDb(outputDir.resolve("domainstate.db"));
|
||||||
WarcArchiverIf warcArchiver = warcArchiverFactory.get(outputDir);
|
WarcArchiverIf warcArchiver = warcArchiverFactory.get(outputDir);
|
||||||
AnchorTagsSource anchorTagsSource = anchorTagsSourceFactory.create(domainsToCrawl)
|
AnchorTagsSource anchorTagsSource = anchorTagsSourceFactory.create(domainsToCrawl)
|
||||||
) {
|
) {
|
||||||
@@ -258,6 +259,7 @@ public class CrawlerMain extends ProcessMainClass {
|
|||||||
anchorTagsSource,
|
anchorTagsSource,
|
||||||
outputDir,
|
outputDir,
|
||||||
warcArchiver,
|
warcArchiver,
|
||||||
|
domainStateDb,
|
||||||
workLog);
|
workLog);
|
||||||
|
|
||||||
if (pendingCrawlTasks.putIfAbsent(crawlSpec.domain(), task) == null) {
|
if (pendingCrawlTasks.putIfAbsent(crawlSpec.domain(), task) == null) {
|
||||||
@@ -299,11 +301,12 @@ public class CrawlerMain extends ProcessMainClass {
|
|||||||
heartbeat.start();
|
heartbeat.start();
|
||||||
|
|
||||||
try (WorkLog workLog = new WorkLog(outputDir.resolve("crawler-" + targetDomainName.replace('/', '-') + ".log"));
|
try (WorkLog workLog = new WorkLog(outputDir.resolve("crawler-" + targetDomainName.replace('/', '-') + ".log"));
|
||||||
|
DomainStateDb domainStateDb = new DomainStateDb(outputDir.resolve("domainstate.db"));
|
||||||
WarcArchiverIf warcArchiver = warcArchiverFactory.get(outputDir);
|
WarcArchiverIf warcArchiver = warcArchiverFactory.get(outputDir);
|
||||||
AnchorTagsSource anchorTagsSource = anchorTagsSourceFactory.create(List.of(new EdgeDomain(targetDomainName)))
|
AnchorTagsSource anchorTagsSource = anchorTagsSourceFactory.create(List.of(new EdgeDomain(targetDomainName)))
|
||||||
) {
|
) {
|
||||||
var spec = new CrawlSpecRecord(targetDomainName, 1000, List.of());
|
var spec = new CrawlSpecRecord(targetDomainName, 1000, List.of());
|
||||||
var task = new CrawlTask(spec, anchorTagsSource, outputDir, warcArchiver, workLog);
|
var task = new CrawlTask(spec, anchorTagsSource, outputDir, warcArchiver, domainStateDb, workLog);
|
||||||
task.run();
|
task.run();
|
||||||
}
|
}
|
||||||
catch (Exception ex) {
|
catch (Exception ex) {
|
||||||
@@ -324,18 +327,21 @@ public class CrawlerMain extends ProcessMainClass {
|
|||||||
private final AnchorTagsSource anchorTagsSource;
|
private final AnchorTagsSource anchorTagsSource;
|
||||||
private final Path outputDir;
|
private final Path outputDir;
|
||||||
private final WarcArchiverIf warcArchiver;
|
private final WarcArchiverIf warcArchiver;
|
||||||
|
private final DomainStateDb domainStateDb;
|
||||||
private final WorkLog workLog;
|
private final WorkLog workLog;
|
||||||
|
|
||||||
CrawlTask(CrawlSpecRecord specification,
|
CrawlTask(CrawlSpecRecord specification,
|
||||||
AnchorTagsSource anchorTagsSource,
|
AnchorTagsSource anchorTagsSource,
|
||||||
Path outputDir,
|
Path outputDir,
|
||||||
WarcArchiverIf warcArchiver,
|
WarcArchiverIf warcArchiver,
|
||||||
|
DomainStateDb domainStateDb,
|
||||||
WorkLog workLog)
|
WorkLog workLog)
|
||||||
{
|
{
|
||||||
this.specification = specification;
|
this.specification = specification;
|
||||||
this.anchorTagsSource = anchorTagsSource;
|
this.anchorTagsSource = anchorTagsSource;
|
||||||
this.outputDir = outputDir;
|
this.outputDir = outputDir;
|
||||||
this.warcArchiver = warcArchiver;
|
this.warcArchiver = warcArchiver;
|
||||||
|
this.domainStateDb = domainStateDb;
|
||||||
this.workLog = workLog;
|
this.workLog = workLog;
|
||||||
|
|
||||||
this.domain = specification.domain();
|
this.domain = specification.domain();
|
||||||
@@ -359,7 +365,7 @@ public class CrawlerMain extends ProcessMainClass {
|
|||||||
}
|
}
|
||||||
|
|
||||||
try (var warcRecorder = new WarcRecorder(newWarcFile); // write to a temp file for now
|
try (var warcRecorder = new WarcRecorder(newWarcFile); // write to a temp file for now
|
||||||
var retriever = new CrawlerRetreiver(fetcher, domainProber, specification, warcRecorder);
|
var retriever = new CrawlerRetreiver(fetcher, domainProber, specification, domainStateDb, warcRecorder);
|
||||||
CrawlDataReference reference = getReference();
|
CrawlDataReference reference = getReference();
|
||||||
)
|
)
|
||||||
{
|
{
|
||||||
|
@@ -0,0 +1,127 @@
|
|||||||
|
package nu.marginalia.crawl;
|
||||||
|
|
||||||
|
import org.slf4j.Logger;
|
||||||
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
|
import javax.annotation.Nullable;
|
||||||
|
import java.nio.file.Path;
|
||||||
|
import java.sql.Connection;
|
||||||
|
import java.sql.DriverManager;
|
||||||
|
import java.sql.SQLException;
|
||||||
|
import java.time.Instant;
|
||||||
|
import java.util.Optional;
|
||||||
|
|
||||||
|
/** Supplemental sqlite database for storing the summary of a crawl.
|
||||||
|
* One database exists per crawl data set.
|
||||||
|
* */
|
||||||
|
public class DomainStateDb implements AutoCloseable {
|
||||||
|
|
||||||
|
private static final Logger logger = LoggerFactory.getLogger(DomainStateDb.class);
|
||||||
|
|
||||||
|
private final Connection connection;
|
||||||
|
|
||||||
|
public record SummaryRecord(
|
||||||
|
String domainName,
|
||||||
|
Instant lastUpdated,
|
||||||
|
String state,
|
||||||
|
@Nullable String stateDesc,
|
||||||
|
@Nullable String feedUrl
|
||||||
|
)
|
||||||
|
{
|
||||||
|
public static SummaryRecord forSuccess(String domainName) {
|
||||||
|
return new SummaryRecord(domainName, Instant.now(), "OK", null, null);
|
||||||
|
}
|
||||||
|
|
||||||
|
public static SummaryRecord forSuccess(String domainName, String feedUrl) {
|
||||||
|
return new SummaryRecord(domainName, Instant.now(), "OK", null, feedUrl);
|
||||||
|
}
|
||||||
|
|
||||||
|
public static SummaryRecord forError(String domainName, String state, String stateDesc) {
|
||||||
|
return new SummaryRecord(domainName, Instant.now(), state, stateDesc, null);
|
||||||
|
}
|
||||||
|
|
||||||
|
public boolean equals(Object other) {
|
||||||
|
if (other == this) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
if (!(other instanceof SummaryRecord(String name, Instant updated, String state1, String desc, String url))) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
return domainName.equals(name) &&
|
||||||
|
lastUpdated.toEpochMilli() == updated.toEpochMilli() &&
|
||||||
|
state.equals(state1) &&
|
||||||
|
(stateDesc == null ? desc == null : stateDesc.equals(desc)) &&
|
||||||
|
(feedUrl == null ? url == null : feedUrl.equals(url));
|
||||||
|
}
|
||||||
|
|
||||||
|
public int hashCode() {
|
||||||
|
return domainName.hashCode() + Long.hashCode(lastUpdated.toEpochMilli());
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
public DomainStateDb(Path filename) throws SQLException {
|
||||||
|
String sqliteDbString = "jdbc:sqlite:" + filename.toString();
|
||||||
|
connection = DriverManager.getConnection(sqliteDbString);
|
||||||
|
|
||||||
|
try (var stmt = connection.createStatement()) {
|
||||||
|
stmt.executeUpdate("""
|
||||||
|
CREATE TABLE IF NOT EXISTS summary (
|
||||||
|
domain TEXT PRIMARY KEY,
|
||||||
|
lastUpdatedEpochMs LONG NOT NULL,
|
||||||
|
state TEXT NOT NULL,
|
||||||
|
stateDesc TEXT,
|
||||||
|
feedUrl TEXT
|
||||||
|
)
|
||||||
|
""");
|
||||||
|
|
||||||
|
stmt.execute("PRAGMA journal_mode=WAL");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public void close() throws SQLException {
|
||||||
|
connection.close();
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
public void save(SummaryRecord record) {
|
||||||
|
try (var stmt = connection.prepareStatement("""
|
||||||
|
INSERT OR REPLACE INTO summary (domain, lastUpdatedEpochMs, state, stateDesc, feedUrl)
|
||||||
|
VALUES (?, ?, ?, ?, ?)
|
||||||
|
""")) {
|
||||||
|
stmt.setString(1, record.domainName());
|
||||||
|
stmt.setLong(2, record.lastUpdated().toEpochMilli());
|
||||||
|
stmt.setString(3, record.state());
|
||||||
|
stmt.setString(4, record.stateDesc());
|
||||||
|
stmt.setString(5, record.feedUrl());
|
||||||
|
stmt.executeUpdate();
|
||||||
|
} catch (SQLException e) {
|
||||||
|
logger.error("Failed to insert summary record", e);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public Optional<SummaryRecord> get(String domainName) {
|
||||||
|
try (var stmt = connection.prepareStatement("""
|
||||||
|
SELECT domain, lastUpdatedEpochMs, state, stateDesc, feedUrl
|
||||||
|
FROM summary
|
||||||
|
WHERE domain = ?
|
||||||
|
""")) {
|
||||||
|
stmt.setString(1, domainName);
|
||||||
|
var rs = stmt.executeQuery();
|
||||||
|
if (rs.next()) {
|
||||||
|
return Optional.of(new SummaryRecord(
|
||||||
|
rs.getString("domain"),
|
||||||
|
Instant.ofEpochMilli(rs.getLong("lastUpdatedEpochMs")),
|
||||||
|
rs.getString("state"),
|
||||||
|
rs.getString("stateDesc"),
|
||||||
|
rs.getString("feedUrl")
|
||||||
|
));
|
||||||
|
}
|
||||||
|
} catch (SQLException e) {
|
||||||
|
logger.error("Failed to get summary record", e);
|
||||||
|
}
|
||||||
|
|
||||||
|
return Optional.empty();
|
||||||
|
}
|
||||||
|
}
|
@@ -20,34 +20,11 @@ public record ContentTags(String etag, String lastMod) {
|
|||||||
public void paint(Request.Builder getBuilder) {
|
public void paint(Request.Builder getBuilder) {
|
||||||
|
|
||||||
if (etag != null) {
|
if (etag != null) {
|
||||||
getBuilder.addHeader("If-None-Match", ifNoneMatch());
|
getBuilder.addHeader("If-None-Match", etag);
|
||||||
}
|
}
|
||||||
|
|
||||||
if (lastMod != null) {
|
if (lastMod != null) {
|
||||||
getBuilder.addHeader("If-Modified-Since", ifModifiedSince());
|
getBuilder.addHeader("If-Modified-Since", lastMod);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
private String ifNoneMatch() {
|
|
||||||
// Remove the W/ prefix if it exists
|
|
||||||
|
|
||||||
//'W/' (case-sensitive) indicates that a weak validator is used. Weak etags are
|
|
||||||
// easy to generate, but are far less useful for comparisons. Strong validators
|
|
||||||
// are ideal for comparisons but can be very difficult to generate efficiently.
|
|
||||||
// Weak ETag values of two representations of the same resources might be semantically
|
|
||||||
// equivalent, but not byte-for-byte identical. This means weak etags prevent caching
|
|
||||||
// when byte range requests are used, but strong etags mean range requests can
|
|
||||||
// still be cached.
|
|
||||||
// - https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag
|
|
||||||
|
|
||||||
if (null != etag && etag.startsWith("W/")) {
|
|
||||||
return etag.substring(2);
|
|
||||||
} else {
|
|
||||||
return etag;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
private String ifModifiedSince() {
|
|
||||||
return lastMod;
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
@@ -139,7 +139,7 @@ public class HttpFetcherImpl implements HttpFetcher {
|
|||||||
public ContentTypeProbeResult probeContentType(EdgeUrl url,
|
public ContentTypeProbeResult probeContentType(EdgeUrl url,
|
||||||
WarcRecorder warcRecorder,
|
WarcRecorder warcRecorder,
|
||||||
ContentTags tags) throws RateLimitException {
|
ContentTags tags) throws RateLimitException {
|
||||||
if (tags.isEmpty()) {
|
if (tags.isEmpty() && contentTypeLogic.isUrlLikeBinary(url)) {
|
||||||
var headBuilder = new Request.Builder().head()
|
var headBuilder = new Request.Builder().head()
|
||||||
.addHeader("User-agent", userAgentString)
|
.addHeader("User-agent", userAgentString)
|
||||||
.addHeader("Accept-Encoding", "gzip")
|
.addHeader("Accept-Encoding", "gzip")
|
||||||
|
@@ -34,8 +34,9 @@ import java.util.*;
|
|||||||
public class WarcRecorder implements AutoCloseable {
|
public class WarcRecorder implements AutoCloseable {
|
||||||
/** Maximum time we'll wait on a single request */
|
/** Maximum time we'll wait on a single request */
|
||||||
static final int MAX_TIME = 30_000;
|
static final int MAX_TIME = 30_000;
|
||||||
/** Maximum (decompressed) size we'll fetch */
|
|
||||||
static final int MAX_SIZE = 1024 * 1024 * 10;
|
/** Maximum (decompressed) size we'll save */
|
||||||
|
static final int MAX_SIZE = Integer.getInteger("crawler.maxFetchSize", 10 * 1024 * 1024);
|
||||||
|
|
||||||
private final WarcWriter writer;
|
private final WarcWriter writer;
|
||||||
private final Path warcFile;
|
private final Path warcFile;
|
||||||
|
@@ -4,6 +4,7 @@ import crawlercommons.robots.SimpleRobotRules;
|
|||||||
import nu.marginalia.atags.model.DomainLinks;
|
import nu.marginalia.atags.model.DomainLinks;
|
||||||
import nu.marginalia.contenttype.ContentType;
|
import nu.marginalia.contenttype.ContentType;
|
||||||
import nu.marginalia.crawl.CrawlerMain;
|
import nu.marginalia.crawl.CrawlerMain;
|
||||||
|
import nu.marginalia.crawl.DomainStateDb;
|
||||||
import nu.marginalia.crawl.fetcher.ContentTags;
|
import nu.marginalia.crawl.fetcher.ContentTags;
|
||||||
import nu.marginalia.crawl.fetcher.HttpFetcher;
|
import nu.marginalia.crawl.fetcher.HttpFetcher;
|
||||||
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
||||||
@@ -16,7 +17,9 @@ import nu.marginalia.ip_blocklist.UrlBlocklist;
|
|||||||
import nu.marginalia.link_parser.LinkParser;
|
import nu.marginalia.link_parser.LinkParser;
|
||||||
import nu.marginalia.model.EdgeDomain;
|
import nu.marginalia.model.EdgeDomain;
|
||||||
import nu.marginalia.model.EdgeUrl;
|
import nu.marginalia.model.EdgeUrl;
|
||||||
|
import nu.marginalia.model.body.DocumentBodyExtractor;
|
||||||
import nu.marginalia.model.body.HttpFetchResult;
|
import nu.marginalia.model.body.HttpFetchResult;
|
||||||
|
import nu.marginalia.model.crawldata.CrawlerDomainStatus;
|
||||||
import org.jsoup.Jsoup;
|
import org.jsoup.Jsoup;
|
||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
import org.slf4j.LoggerFactory;
|
import org.slf4j.LoggerFactory;
|
||||||
@@ -46,6 +49,7 @@ public class CrawlerRetreiver implements AutoCloseable {
|
|||||||
|
|
||||||
private final DomainProber domainProber;
|
private final DomainProber domainProber;
|
||||||
private final DomainCrawlFrontier crawlFrontier;
|
private final DomainCrawlFrontier crawlFrontier;
|
||||||
|
private final DomainStateDb domainStateDb;
|
||||||
private final WarcRecorder warcRecorder;
|
private final WarcRecorder warcRecorder;
|
||||||
private final CrawlerRevisitor crawlerRevisitor;
|
private final CrawlerRevisitor crawlerRevisitor;
|
||||||
|
|
||||||
@@ -55,8 +59,10 @@ public class CrawlerRetreiver implements AutoCloseable {
|
|||||||
public CrawlerRetreiver(HttpFetcher fetcher,
|
public CrawlerRetreiver(HttpFetcher fetcher,
|
||||||
DomainProber domainProber,
|
DomainProber domainProber,
|
||||||
CrawlerMain.CrawlSpecRecord specs,
|
CrawlerMain.CrawlSpecRecord specs,
|
||||||
|
DomainStateDb domainStateDb,
|
||||||
WarcRecorder warcRecorder)
|
WarcRecorder warcRecorder)
|
||||||
{
|
{
|
||||||
|
this.domainStateDb = domainStateDb;
|
||||||
this.warcRecorder = warcRecorder;
|
this.warcRecorder = warcRecorder;
|
||||||
this.fetcher = fetcher;
|
this.fetcher = fetcher;
|
||||||
this.domainProber = domainProber;
|
this.domainProber = domainProber;
|
||||||
@@ -90,8 +96,21 @@ public class CrawlerRetreiver implements AutoCloseable {
|
|||||||
try {
|
try {
|
||||||
// Do an initial domain probe to determine the root URL
|
// Do an initial domain probe to determine the root URL
|
||||||
EdgeUrl rootUrl;
|
EdgeUrl rootUrl;
|
||||||
if (probeRootUrl() instanceof HttpFetcher.DomainProbeResult.Ok ok) rootUrl = ok.probedUrl();
|
|
||||||
else return 1;
|
var probeResult = probeRootUrl();
|
||||||
|
switch (probeResult) {
|
||||||
|
case HttpFetcher.DomainProbeResult.Ok(EdgeUrl probedUrl) -> {
|
||||||
|
rootUrl = probedUrl; // Good track
|
||||||
|
}
|
||||||
|
case HttpFetcher.DomainProbeResult.Redirect(EdgeDomain domain1) -> {
|
||||||
|
domainStateDb.save(DomainStateDb.SummaryRecord.forError(domain, "Redirect", domain1.toString()));
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
case HttpFetcher.DomainProbeResult.Error(CrawlerDomainStatus status, String desc) -> {
|
||||||
|
domainStateDb.save(DomainStateDb.SummaryRecord.forError(domain, status.toString(), desc));
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// Sleep after the initial probe, we don't have access to the robots.txt yet
|
// Sleep after the initial probe, we don't have access to the robots.txt yet
|
||||||
// so we don't know the crawl delay
|
// so we don't know the crawl delay
|
||||||
@@ -114,7 +133,8 @@ public class CrawlerRetreiver implements AutoCloseable {
|
|||||||
|
|
||||||
delayTimer.waitFetchDelay(0); // initial delay after robots.txt
|
delayTimer.waitFetchDelay(0); // initial delay after robots.txt
|
||||||
|
|
||||||
sniffRootDocument(rootUrl, delayTimer);
|
DomainStateDb.SummaryRecord summaryRecord = sniffRootDocument(rootUrl, delayTimer);
|
||||||
|
domainStateDb.save(summaryRecord);
|
||||||
|
|
||||||
// Play back the old crawl data (if present) and fetch the documents comparing etags and last-modified
|
// Play back the old crawl data (if present) and fetch the documents comparing etags and last-modified
|
||||||
if (crawlerRevisitor.recrawl(oldCrawlData, robotsRules, delayTimer) > 0) {
|
if (crawlerRevisitor.recrawl(oldCrawlData, robotsRules, delayTimer) > 0) {
|
||||||
@@ -196,7 +216,9 @@ public class CrawlerRetreiver implements AutoCloseable {
|
|||||||
return domainProbeResult;
|
return domainProbeResult;
|
||||||
}
|
}
|
||||||
|
|
||||||
private void sniffRootDocument(EdgeUrl rootUrl, CrawlDelayTimer timer) {
|
private DomainStateDb.SummaryRecord sniffRootDocument(EdgeUrl rootUrl, CrawlDelayTimer timer) {
|
||||||
|
Optional<String> feedLink = Optional.empty();
|
||||||
|
|
||||||
try {
|
try {
|
||||||
var url = rootUrl.withPathAndParam("/", null);
|
var url = rootUrl.withPathAndParam("/", null);
|
||||||
|
|
||||||
@@ -204,11 +226,11 @@ public class CrawlerRetreiver implements AutoCloseable {
|
|||||||
timer.waitFetchDelay(0);
|
timer.waitFetchDelay(0);
|
||||||
|
|
||||||
if (!(result instanceof HttpFetchResult.ResultOk ok))
|
if (!(result instanceof HttpFetchResult.ResultOk ok))
|
||||||
return;
|
return DomainStateDb.SummaryRecord.forSuccess(domain);
|
||||||
|
|
||||||
var optDoc = ok.parseDocument();
|
var optDoc = ok.parseDocument();
|
||||||
if (optDoc.isEmpty())
|
if (optDoc.isEmpty())
|
||||||
return;
|
return DomainStateDb.SummaryRecord.forSuccess(domain);
|
||||||
|
|
||||||
// Sniff the software based on the sample document
|
// Sniff the software based on the sample document
|
||||||
var doc = optDoc.get();
|
var doc = optDoc.get();
|
||||||
@@ -216,7 +238,6 @@ public class CrawlerRetreiver implements AutoCloseable {
|
|||||||
crawlFrontier.enqueueLinksFromDocument(url, doc);
|
crawlFrontier.enqueueLinksFromDocument(url, doc);
|
||||||
|
|
||||||
EdgeUrl faviconUrl = url.withPathAndParam("/favicon.ico", null);
|
EdgeUrl faviconUrl = url.withPathAndParam("/favicon.ico", null);
|
||||||
Optional<EdgeUrl> sitemapUrl = Optional.empty();
|
|
||||||
|
|
||||||
for (var link : doc.getElementsByTag("link")) {
|
for (var link : doc.getElementsByTag("link")) {
|
||||||
String rel = link.attr("rel");
|
String rel = link.attr("rel");
|
||||||
@@ -232,23 +253,33 @@ public class CrawlerRetreiver implements AutoCloseable {
|
|||||||
|
|
||||||
// Grab the RSS/Atom as a sitemap if it exists
|
// Grab the RSS/Atom as a sitemap if it exists
|
||||||
if (rel.equalsIgnoreCase("alternate")
|
if (rel.equalsIgnoreCase("alternate")
|
||||||
&& (type.equalsIgnoreCase("application/atom+xml") || type.equalsIgnoreCase("application/atomsvc+xml"))) {
|
&& (type.equalsIgnoreCase("application/atom+xml")
|
||||||
|
|| type.equalsIgnoreCase("application/atomsvc+xml")
|
||||||
|
|| type.equalsIgnoreCase("application/rss+xml")
|
||||||
|
)) {
|
||||||
String href = link.attr("href");
|
String href = link.attr("href");
|
||||||
|
|
||||||
sitemapUrl = linkParser.parseLink(url, href)
|
feedLink = linkParser.parseLink(url, href)
|
||||||
.filter(crawlFrontier::isSameDomain);
|
.filter(crawlFrontier::isSameDomain)
|
||||||
|
.map(EdgeUrl::toString);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// Download the sitemap if available exists
|
|
||||||
if (sitemapUrl.isPresent()) {
|
if (feedLink.isEmpty()) {
|
||||||
sitemapFetcher.downloadSitemaps(List.of(sitemapUrl.get()));
|
feedLink = guessFeedUrl(timer);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Download the sitemap if available
|
||||||
|
if (feedLink.isPresent()) {
|
||||||
|
sitemapFetcher.downloadSitemaps(List.of(feedLink.get()));
|
||||||
timer.waitFetchDelay(0);
|
timer.waitFetchDelay(0);
|
||||||
}
|
}
|
||||||
|
|
||||||
// Grab the favicon if it exists
|
// Grab the favicon if it exists
|
||||||
fetchWithRetry(faviconUrl, timer, HttpFetcher.ProbeType.DISABLED, ContentTags.empty());
|
fetchWithRetry(faviconUrl, timer, HttpFetcher.ProbeType.DISABLED, ContentTags.empty());
|
||||||
timer.waitFetchDelay(0);
|
timer.waitFetchDelay(0);
|
||||||
|
|
||||||
}
|
}
|
||||||
catch (Exception ex) {
|
catch (Exception ex) {
|
||||||
logger.error("Error configuring link filter", ex);
|
logger.error("Error configuring link filter", ex);
|
||||||
@@ -256,6 +287,74 @@ public class CrawlerRetreiver implements AutoCloseable {
|
|||||||
finally {
|
finally {
|
||||||
crawlFrontier.addVisited(rootUrl);
|
crawlFrontier.addVisited(rootUrl);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
if (feedLink.isPresent()) {
|
||||||
|
return DomainStateDb.SummaryRecord.forSuccess(domain, feedLink.get());
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
return DomainStateDb.SummaryRecord.forSuccess(domain);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private final List<String> likelyFeedEndpoints = List.of(
|
||||||
|
"rss.xml",
|
||||||
|
"atom.xml",
|
||||||
|
"feed.xml",
|
||||||
|
"index.xml",
|
||||||
|
"feed",
|
||||||
|
"rss",
|
||||||
|
"atom",
|
||||||
|
"feeds",
|
||||||
|
"blog/feed",
|
||||||
|
"blog/rss"
|
||||||
|
);
|
||||||
|
|
||||||
|
private Optional<String> guessFeedUrl(CrawlDelayTimer timer) throws InterruptedException {
|
||||||
|
var oldDomainStateRecord = domainStateDb.get(domain);
|
||||||
|
|
||||||
|
// If we are already aware of an old feed URL, then we can just revalidate it
|
||||||
|
if (oldDomainStateRecord.isPresent()) {
|
||||||
|
var oldRecord = oldDomainStateRecord.get();
|
||||||
|
if (oldRecord.feedUrl() != null && validateFeedUrl(oldRecord.feedUrl(), timer)) {
|
||||||
|
return Optional.of(oldRecord.feedUrl());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
for (String endpoint : likelyFeedEndpoints) {
|
||||||
|
String url = "https://" + domain + "/" + endpoint;
|
||||||
|
if (validateFeedUrl(url, timer)) {
|
||||||
|
return Optional.of(url);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return Optional.empty();
|
||||||
|
}
|
||||||
|
|
||||||
|
private boolean validateFeedUrl(String url, CrawlDelayTimer timer) throws InterruptedException {
|
||||||
|
var parsedOpt = EdgeUrl.parse(url);
|
||||||
|
if (parsedOpt.isEmpty())
|
||||||
|
return false;
|
||||||
|
|
||||||
|
HttpFetchResult result = fetchWithRetry(parsedOpt.get(), timer, HttpFetcher.ProbeType.DISABLED, ContentTags.empty());
|
||||||
|
timer.waitFetchDelay(0);
|
||||||
|
|
||||||
|
if (!(result instanceof HttpFetchResult.ResultOk ok)) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Extract the beginning of the
|
||||||
|
Optional<String> bodyOpt = DocumentBodyExtractor.asString(ok).getBody();
|
||||||
|
if (bodyOpt.isEmpty())
|
||||||
|
return false;
|
||||||
|
String body = bodyOpt.get();
|
||||||
|
body = body.substring(0, Math.min(128, body.length())).toLowerCase();
|
||||||
|
|
||||||
|
if (body.contains("<atom"))
|
||||||
|
return true;
|
||||||
|
if (body.contains("<rss"))
|
||||||
|
return true;
|
||||||
|
|
||||||
|
return false;
|
||||||
}
|
}
|
||||||
|
|
||||||
public HttpFetchResult fetchContentWithReference(EdgeUrl top,
|
public HttpFetchResult fetchContentWithReference(EdgeUrl top,
|
||||||
|
@@ -7,9 +7,9 @@ import nu.marginalia.model.EdgeUrl;
|
|||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
import org.slf4j.LoggerFactory;
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
import java.util.ArrayList;
|
|
||||||
import java.util.HashSet;
|
import java.util.HashSet;
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
|
import java.util.Optional;
|
||||||
import java.util.Set;
|
import java.util.Set;
|
||||||
|
|
||||||
public class SitemapFetcher {
|
public class SitemapFetcher {
|
||||||
@@ -24,26 +24,27 @@ public class SitemapFetcher {
|
|||||||
}
|
}
|
||||||
|
|
||||||
public void downloadSitemaps(SimpleRobotRules robotsRules, EdgeUrl rootUrl) {
|
public void downloadSitemaps(SimpleRobotRules robotsRules, EdgeUrl rootUrl) {
|
||||||
List<String> sitemaps = robotsRules.getSitemaps();
|
List<String> urls = robotsRules.getSitemaps();
|
||||||
|
|
||||||
List<EdgeUrl> urls = new ArrayList<>(sitemaps.size());
|
if (urls.isEmpty()) {
|
||||||
if (!sitemaps.isEmpty()) {
|
urls = List.of(rootUrl.withPathAndParam("/sitemap.xml", null).toString());
|
||||||
for (var url : sitemaps) {
|
|
||||||
EdgeUrl.parse(url).ifPresent(urls::add);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
else {
|
|
||||||
urls.add(rootUrl.withPathAndParam("/sitemap.xml", null));
|
|
||||||
}
|
}
|
||||||
|
|
||||||
downloadSitemaps(urls);
|
downloadSitemaps(urls);
|
||||||
}
|
}
|
||||||
|
|
||||||
public void downloadSitemaps(List<EdgeUrl> urls) {
|
public void downloadSitemaps(List<String> urls) {
|
||||||
|
|
||||||
Set<String> checkedSitemaps = new HashSet<>();
|
Set<String> checkedSitemaps = new HashSet<>();
|
||||||
|
|
||||||
for (var url : urls) {
|
for (var rawUrl : urls) {
|
||||||
|
Optional<EdgeUrl> parsedUrl = EdgeUrl.parse(rawUrl);
|
||||||
|
if (parsedUrl.isEmpty()) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
EdgeUrl url = parsedUrl.get();
|
||||||
|
|
||||||
// Let's not download sitemaps from other domains for now
|
// Let's not download sitemaps from other domains for now
|
||||||
if (!crawlFrontier.isSameDomain(url)) {
|
if (!crawlFrontier.isSameDomain(url)) {
|
||||||
continue;
|
continue;
|
||||||
|
@@ -1,11 +1,15 @@
|
|||||||
package nu.marginalia.io;
|
package nu.marginalia.io;
|
||||||
|
|
||||||
|
import nu.marginalia.model.crawldata.CrawledDocument;
|
||||||
|
import nu.marginalia.model.crawldata.CrawledDomain;
|
||||||
import nu.marginalia.model.crawldata.SerializableCrawlData;
|
import nu.marginalia.model.crawldata.SerializableCrawlData;
|
||||||
import org.jetbrains.annotations.Nullable;
|
import org.jetbrains.annotations.Nullable;
|
||||||
|
|
||||||
import java.io.IOException;
|
import java.io.IOException;
|
||||||
import java.nio.file.Path;
|
import java.nio.file.Path;
|
||||||
|
import java.util.ArrayList;
|
||||||
import java.util.Iterator;
|
import java.util.Iterator;
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
/** Closable iterator exceptional over serialized crawl data
|
/** Closable iterator exceptional over serialized crawl data
|
||||||
* The data may appear in any order, and the iterator must be closed.
|
* The data may appear in any order, and the iterator must be closed.
|
||||||
@@ -26,6 +30,37 @@ public interface SerializableCrawlDataStream extends AutoCloseable {
|
|||||||
@Nullable
|
@Nullable
|
||||||
default Path path() { return null; }
|
default Path path() { return null; }
|
||||||
|
|
||||||
|
/** For tests */
|
||||||
|
default List<SerializableCrawlData> asList() throws IOException {
|
||||||
|
List<SerializableCrawlData> data = new ArrayList<>();
|
||||||
|
while (hasNext()) {
|
||||||
|
data.add(next());
|
||||||
|
}
|
||||||
|
return data;
|
||||||
|
}
|
||||||
|
|
||||||
|
/** For tests */
|
||||||
|
default List<CrawledDocument> docsAsList() throws IOException {
|
||||||
|
List<CrawledDocument> data = new ArrayList<>();
|
||||||
|
while (hasNext()) {
|
||||||
|
if (next() instanceof CrawledDocument doc) {
|
||||||
|
data.add(doc);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return data;
|
||||||
|
}
|
||||||
|
|
||||||
|
/** For tests */
|
||||||
|
default List<CrawledDomain> domainsAsList() throws IOException {
|
||||||
|
List<CrawledDomain> data = new ArrayList<>();
|
||||||
|
while (hasNext()) {
|
||||||
|
if (next() instanceof CrawledDomain domain) {
|
||||||
|
data.add(domain);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return data;
|
||||||
|
}
|
||||||
|
|
||||||
// Dummy iterator over nothing
|
// Dummy iterator over nothing
|
||||||
static SerializableCrawlDataStream empty() {
|
static SerializableCrawlDataStream empty() {
|
||||||
return new SerializableCrawlDataStream() {
|
return new SerializableCrawlDataStream() {
|
||||||
|
@@ -18,6 +18,7 @@ public class ContentTypeLogic {
|
|||||||
"application/xhtml",
|
"application/xhtml",
|
||||||
"application/xml",
|
"application/xml",
|
||||||
"application/atom+xml",
|
"application/atom+xml",
|
||||||
|
"application/atomsvc+xml",
|
||||||
"application/rss+xml",
|
"application/rss+xml",
|
||||||
"application/x-rss+xml",
|
"application/x-rss+xml",
|
||||||
"application/rdf+xml",
|
"application/rdf+xml",
|
||||||
|
@@ -23,6 +23,10 @@ public sealed interface DocumentBodyResult<T> {
|
|||||||
return mapper.apply(contentType, body);
|
return mapper.apply(contentType, body);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
public Optional<T> getBody() {
|
||||||
|
return Optional.of(body);
|
||||||
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public void ifPresent(ExConsumer<T, Exception> consumer) throws Exception {
|
public void ifPresent(ExConsumer<T, Exception> consumer) throws Exception {
|
||||||
consumer.accept(contentType, body);
|
consumer.accept(contentType, body);
|
||||||
@@ -41,6 +45,11 @@ public sealed interface DocumentBodyResult<T> {
|
|||||||
return (DocumentBodyResult<T2>) this;
|
return (DocumentBodyResult<T2>) this;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Optional<T> getBody() {
|
||||||
|
return Optional.empty();
|
||||||
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public void ifPresent(ExConsumer<T, Exception> consumer) throws Exception {
|
public void ifPresent(ExConsumer<T, Exception> consumer) throws Exception {
|
||||||
}
|
}
|
||||||
@@ -49,6 +58,7 @@ public sealed interface DocumentBodyResult<T> {
|
|||||||
<T2> Optional<T2> mapOpt(BiFunction<ContentType, T, T2> mapper);
|
<T2> Optional<T2> mapOpt(BiFunction<ContentType, T, T2> mapper);
|
||||||
<T2> Optional<T2> flatMapOpt(BiFunction<ContentType, T, Optional<T2>> mapper);
|
<T2> Optional<T2> flatMapOpt(BiFunction<ContentType, T, Optional<T2>> mapper);
|
||||||
<T2> DocumentBodyResult<T2> flatMap(BiFunction<ContentType, T, DocumentBodyResult<T2>> mapper);
|
<T2> DocumentBodyResult<T2> flatMap(BiFunction<ContentType, T, DocumentBodyResult<T2>> mapper);
|
||||||
|
Optional<T> getBody();
|
||||||
|
|
||||||
void ifPresent(ExConsumer<T,Exception> consumer) throws Exception;
|
void ifPresent(ExConsumer<T,Exception> consumer) throws Exception;
|
||||||
|
|
||||||
|
@@ -0,0 +1,66 @@
|
|||||||
|
package nu.marginalia.crawl;
|
||||||
|
|
||||||
|
import org.junit.jupiter.api.AfterEach;
|
||||||
|
import org.junit.jupiter.api.BeforeEach;
|
||||||
|
import org.junit.jupiter.api.Test;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.nio.file.Files;
|
||||||
|
import java.nio.file.Path;
|
||||||
|
import java.sql.SQLException;
|
||||||
|
import java.time.Instant;
|
||||||
|
|
||||||
|
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||||
|
|
||||||
|
class DomainStateDbTest {
|
||||||
|
|
||||||
|
Path tempFile;
|
||||||
|
@BeforeEach
|
||||||
|
void setUp() throws IOException {
|
||||||
|
tempFile = Files.createTempFile(getClass().getSimpleName(), ".db");
|
||||||
|
}
|
||||||
|
|
||||||
|
@AfterEach
|
||||||
|
void tearDown() throws IOException {
|
||||||
|
Files.deleteIfExists(tempFile);
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testSunnyDay() throws SQLException {
|
||||||
|
try (var db = new DomainStateDb(tempFile)) {
|
||||||
|
var allFields = new DomainStateDb.SummaryRecord(
|
||||||
|
"all.marginalia.nu",
|
||||||
|
Instant.now(),
|
||||||
|
"OK",
|
||||||
|
"Bad address",
|
||||||
|
"https://www.marginalia.nu/atom.xml"
|
||||||
|
);
|
||||||
|
|
||||||
|
var minFields = new DomainStateDb.SummaryRecord(
|
||||||
|
"min.marginalia.nu",
|
||||||
|
Instant.now(),
|
||||||
|
"OK",
|
||||||
|
null,
|
||||||
|
null
|
||||||
|
);
|
||||||
|
|
||||||
|
db.save(allFields);
|
||||||
|
db.save(minFields);
|
||||||
|
|
||||||
|
assertEquals(allFields, db.get("all.marginalia.nu").orElseThrow());
|
||||||
|
assertEquals(minFields, db.get("min.marginalia.nu").orElseThrow());
|
||||||
|
|
||||||
|
var updatedAllFields = new DomainStateDb.SummaryRecord(
|
||||||
|
"all.marginalia.nu",
|
||||||
|
Instant.now(),
|
||||||
|
"BAD",
|
||||||
|
null,
|
||||||
|
null
|
||||||
|
);
|
||||||
|
|
||||||
|
db.save(updatedAllFields);
|
||||||
|
assertEquals(updatedAllFields, db.get("all.marginalia.nu").orElseThrow());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
@@ -42,24 +42,24 @@ class ContentTypeProberTest {
|
|||||||
port = r.nextInt(10000) + 8000;
|
port = r.nextInt(10000) + 8000;
|
||||||
server = HttpServer.create(new InetSocketAddress("127.0.0.1", port), 10);
|
server = HttpServer.create(new InetSocketAddress("127.0.0.1", port), 10);
|
||||||
|
|
||||||
server.createContext("/html", exchange -> {
|
server.createContext("/html.gz", exchange -> {
|
||||||
exchange.getResponseHeaders().add("Content-Type", "text/html");
|
exchange.getResponseHeaders().add("Content-Type", "text/html");
|
||||||
exchange.sendResponseHeaders(200, -1);
|
exchange.sendResponseHeaders(200, -1);
|
||||||
exchange.close();
|
exchange.close();
|
||||||
});
|
});
|
||||||
server.createContext("/redir", exchange -> {
|
server.createContext("/redir.gz", exchange -> {
|
||||||
exchange.getResponseHeaders().add("Location", "/html");
|
exchange.getResponseHeaders().add("Location", "/html.gz");
|
||||||
exchange.sendResponseHeaders(301, -1);
|
exchange.sendResponseHeaders(301, -1);
|
||||||
exchange.close();
|
exchange.close();
|
||||||
});
|
});
|
||||||
|
|
||||||
server.createContext("/bin", exchange -> {
|
server.createContext("/bin.gz", exchange -> {
|
||||||
exchange.getResponseHeaders().add("Content-Type", "application/binary");
|
exchange.getResponseHeaders().add("Content-Type", "application/binary");
|
||||||
exchange.sendResponseHeaders(200, -1);
|
exchange.sendResponseHeaders(200, -1);
|
||||||
exchange.close();
|
exchange.close();
|
||||||
});
|
});
|
||||||
|
|
||||||
server.createContext("/timeout", exchange -> {
|
server.createContext("/timeout.gz", exchange -> {
|
||||||
try {
|
try {
|
||||||
Thread.sleep(15_000);
|
Thread.sleep(15_000);
|
||||||
} catch (InterruptedException e) {
|
} catch (InterruptedException e) {
|
||||||
@@ -73,10 +73,10 @@ class ContentTypeProberTest {
|
|||||||
|
|
||||||
server.start();
|
server.start();
|
||||||
|
|
||||||
htmlEndpoint = EdgeUrl.parse("http://localhost:" + port + "/html").get();
|
htmlEndpoint = EdgeUrl.parse("http://localhost:" + port + "/html.gz").get();
|
||||||
binaryEndpoint = EdgeUrl.parse("http://localhost:" + port + "/bin").get();
|
binaryEndpoint = EdgeUrl.parse("http://localhost:" + port + "/bin.gz").get();
|
||||||
timeoutEndpoint = EdgeUrl.parse("http://localhost:" + port + "/timeout").get();
|
timeoutEndpoint = EdgeUrl.parse("http://localhost:" + port + "/timeout.gz").get();
|
||||||
htmlRedirEndpoint = EdgeUrl.parse("http://localhost:" + port + "/redir").get();
|
htmlRedirEndpoint = EdgeUrl.parse("http://localhost:" + port + "/redir.gz").get();
|
||||||
|
|
||||||
fetcher = new HttpFetcherImpl("test");
|
fetcher = new HttpFetcherImpl("test");
|
||||||
recorder = new WarcRecorder(warcFile);
|
recorder = new WarcRecorder(warcFile);
|
||||||
|
@@ -2,6 +2,7 @@ package nu.marginalia.crawling.retreival;
|
|||||||
|
|
||||||
import crawlercommons.robots.SimpleRobotRules;
|
import crawlercommons.robots.SimpleRobotRules;
|
||||||
import nu.marginalia.crawl.CrawlerMain;
|
import nu.marginalia.crawl.CrawlerMain;
|
||||||
|
import nu.marginalia.crawl.DomainStateDb;
|
||||||
import nu.marginalia.crawl.fetcher.ContentTags;
|
import nu.marginalia.crawl.fetcher.ContentTags;
|
||||||
import nu.marginalia.crawl.fetcher.HttpFetcher;
|
import nu.marginalia.crawl.fetcher.HttpFetcher;
|
||||||
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
||||||
@@ -18,6 +19,7 @@ import nu.marginalia.model.crawldata.SerializableCrawlData;
|
|||||||
import nu.marginalia.test.CommonTestData;
|
import nu.marginalia.test.CommonTestData;
|
||||||
import okhttp3.Headers;
|
import okhttp3.Headers;
|
||||||
import org.junit.jupiter.api.AfterEach;
|
import org.junit.jupiter.api.AfterEach;
|
||||||
|
import org.junit.jupiter.api.BeforeEach;
|
||||||
import org.junit.jupiter.api.Test;
|
import org.junit.jupiter.api.Test;
|
||||||
import org.mockito.Mockito;
|
import org.mockito.Mockito;
|
||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
@@ -25,6 +27,9 @@ import org.slf4j.LoggerFactory;
|
|||||||
|
|
||||||
import java.io.IOException;
|
import java.io.IOException;
|
||||||
import java.net.URISyntaxException;
|
import java.net.URISyntaxException;
|
||||||
|
import java.nio.file.Files;
|
||||||
|
import java.nio.file.Path;
|
||||||
|
import java.sql.SQLException;
|
||||||
import java.util.ArrayList;
|
import java.util.ArrayList;
|
||||||
import java.util.HashMap;
|
import java.util.HashMap;
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
@@ -36,9 +41,14 @@ public class CrawlerMockFetcherTest {
|
|||||||
|
|
||||||
Map<EdgeUrl, CrawledDocument> mockData = new HashMap<>();
|
Map<EdgeUrl, CrawledDocument> mockData = new HashMap<>();
|
||||||
HttpFetcher fetcherMock = new MockFetcher();
|
HttpFetcher fetcherMock = new MockFetcher();
|
||||||
|
private Path dbTempFile;
|
||||||
|
@BeforeEach
|
||||||
|
public void setUp() throws IOException {
|
||||||
|
dbTempFile = Files.createTempFile("domains","db");
|
||||||
|
}
|
||||||
@AfterEach
|
@AfterEach
|
||||||
public void tearDown() {
|
public void tearDown() throws IOException {
|
||||||
|
Files.deleteIfExists(dbTempFile);
|
||||||
mockData.clear();
|
mockData.clear();
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -66,15 +76,17 @@ public class CrawlerMockFetcherTest {
|
|||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
void crawl(CrawlerMain.CrawlSpecRecord spec) throws IOException {
|
void crawl(CrawlerMain.CrawlSpecRecord spec) throws IOException, SQLException {
|
||||||
try (var recorder = new WarcRecorder()) {
|
try (var recorder = new WarcRecorder();
|
||||||
new CrawlerRetreiver(fetcherMock, new DomainProber(d -> true), spec, recorder)
|
var db = new DomainStateDb(dbTempFile)
|
||||||
|
) {
|
||||||
|
new CrawlerRetreiver(fetcherMock, new DomainProber(d -> true), spec, db, recorder)
|
||||||
.crawlDomain();
|
.crawlDomain();
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
public void testLemmy() throws URISyntaxException, IOException {
|
public void testLemmy() throws Exception {
|
||||||
List<SerializableCrawlData> out = new ArrayList<>();
|
List<SerializableCrawlData> out = new ArrayList<>();
|
||||||
|
|
||||||
registerUrlClasspathData(new EdgeUrl("https://startrek.website/"), "mock-crawl-data/lemmy/index.html");
|
registerUrlClasspathData(new EdgeUrl("https://startrek.website/"), "mock-crawl-data/lemmy/index.html");
|
||||||
@@ -85,7 +97,7 @@ public class CrawlerMockFetcherTest {
|
|||||||
}
|
}
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
public void testMediawiki() throws URISyntaxException, IOException {
|
public void testMediawiki() throws Exception {
|
||||||
List<SerializableCrawlData> out = new ArrayList<>();
|
List<SerializableCrawlData> out = new ArrayList<>();
|
||||||
|
|
||||||
registerUrlClasspathData(new EdgeUrl("https://en.wikipedia.org/"), "mock-crawl-data/mediawiki/index.html");
|
registerUrlClasspathData(new EdgeUrl("https://en.wikipedia.org/"), "mock-crawl-data/mediawiki/index.html");
|
||||||
@@ -94,7 +106,7 @@ public class CrawlerMockFetcherTest {
|
|||||||
}
|
}
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
public void testDiscourse() throws URISyntaxException, IOException {
|
public void testDiscourse() throws Exception {
|
||||||
List<SerializableCrawlData> out = new ArrayList<>();
|
List<SerializableCrawlData> out = new ArrayList<>();
|
||||||
|
|
||||||
registerUrlClasspathData(new EdgeUrl("https://community.tt-rss.org/"), "mock-crawl-data/discourse/index.html");
|
registerUrlClasspathData(new EdgeUrl("https://community.tt-rss.org/"), "mock-crawl-data/discourse/index.html");
|
||||||
|
@@ -4,6 +4,7 @@ import nu.marginalia.UserAgent;
|
|||||||
import nu.marginalia.WmsaHome;
|
import nu.marginalia.WmsaHome;
|
||||||
import nu.marginalia.atags.model.DomainLinks;
|
import nu.marginalia.atags.model.DomainLinks;
|
||||||
import nu.marginalia.crawl.CrawlerMain;
|
import nu.marginalia.crawl.CrawlerMain;
|
||||||
|
import nu.marginalia.crawl.DomainStateDb;
|
||||||
import nu.marginalia.crawl.fetcher.HttpFetcher;
|
import nu.marginalia.crawl.fetcher.HttpFetcher;
|
||||||
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
||||||
import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
|
import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
|
||||||
@@ -25,6 +26,7 @@ import java.io.RandomAccessFile;
|
|||||||
import java.net.URISyntaxException;
|
import java.net.URISyntaxException;
|
||||||
import java.nio.file.Files;
|
import java.nio.file.Files;
|
||||||
import java.nio.file.Path;
|
import java.nio.file.Path;
|
||||||
|
import java.sql.SQLException;
|
||||||
import java.util.*;
|
import java.util.*;
|
||||||
import java.util.stream.Collectors;
|
import java.util.stream.Collectors;
|
||||||
|
|
||||||
@@ -39,11 +41,13 @@ class CrawlerRetreiverTest {
|
|||||||
Path tempFileWarc2;
|
Path tempFileWarc2;
|
||||||
Path tempFileParquet2;
|
Path tempFileParquet2;
|
||||||
Path tempFileWarc3;
|
Path tempFileWarc3;
|
||||||
|
Path tempFileDb;
|
||||||
@BeforeEach
|
@BeforeEach
|
||||||
public void setUp() throws IOException {
|
public void setUp() throws IOException {
|
||||||
httpFetcher = new HttpFetcherImpl("search.marginalia.nu; testing a bit :D");
|
httpFetcher = new HttpFetcherImpl("search.marginalia.nu; testing a bit :D");
|
||||||
tempFileParquet1 = Files.createTempFile("crawling-process", ".parquet");
|
tempFileParquet1 = Files.createTempFile("crawling-process", ".parquet");
|
||||||
tempFileParquet2 = Files.createTempFile("crawling-process", ".parquet");
|
tempFileParquet2 = Files.createTempFile("crawling-process", ".parquet");
|
||||||
|
tempFileDb = Files.createTempFile("crawling-process", ".db");
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -505,22 +509,26 @@ class CrawlerRetreiverTest {
|
|||||||
}
|
}
|
||||||
|
|
||||||
private void doCrawlWithReferenceStream(CrawlerMain.CrawlSpecRecord specs, SerializableCrawlDataStream stream) {
|
private void doCrawlWithReferenceStream(CrawlerMain.CrawlSpecRecord specs, SerializableCrawlDataStream stream) {
|
||||||
try (var recorder = new WarcRecorder(tempFileWarc2)) {
|
try (var recorder = new WarcRecorder(tempFileWarc2);
|
||||||
new CrawlerRetreiver(httpFetcher, new DomainProber(d -> true), specs, recorder).crawlDomain(new DomainLinks(),
|
var db = new DomainStateDb(tempFileDb)
|
||||||
|
) {
|
||||||
|
new CrawlerRetreiver(httpFetcher, new DomainProber(d -> true), specs, db, recorder).crawlDomain(new DomainLinks(),
|
||||||
new CrawlDataReference(stream));
|
new CrawlDataReference(stream));
|
||||||
}
|
}
|
||||||
catch (IOException ex) {
|
catch (IOException | SQLException ex) {
|
||||||
Assertions.fail(ex);
|
Assertions.fail(ex);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@NotNull
|
@NotNull
|
||||||
private DomainCrawlFrontier doCrawl(Path tempFileWarc1, CrawlerMain.CrawlSpecRecord specs) {
|
private DomainCrawlFrontier doCrawl(Path tempFileWarc1, CrawlerMain.CrawlSpecRecord specs) {
|
||||||
try (var recorder = new WarcRecorder(tempFileWarc1)) {
|
try (var recorder = new WarcRecorder(tempFileWarc1);
|
||||||
var crawler = new CrawlerRetreiver(httpFetcher, new DomainProber(d -> true), specs, recorder);
|
var db = new DomainStateDb(tempFileDb)
|
||||||
|
) {
|
||||||
|
var crawler = new CrawlerRetreiver(httpFetcher, new DomainProber(d -> true), specs, db, recorder);
|
||||||
crawler.crawlDomain();
|
crawler.crawlDomain();
|
||||||
return crawler.getCrawlFrontier();
|
return crawler.getCrawlFrontier();
|
||||||
} catch (IOException ex) {
|
} catch (IOException| SQLException ex) {
|
||||||
Assertions.fail(ex);
|
Assertions.fail(ex);
|
||||||
return null; // unreachable
|
return null; // unreachable
|
||||||
}
|
}
|
||||||
|
@@ -179,6 +179,9 @@ public class LiveCrawlerMain extends ProcessMainClass {
|
|||||||
EdgeDomain domain = new EdgeDomain(entry.getKey());
|
EdgeDomain domain = new EdgeDomain(entry.getKey());
|
||||||
List<String> urls = entry.getValue();
|
List<String> urls = entry.getValue();
|
||||||
|
|
||||||
|
if (urls.isEmpty())
|
||||||
|
continue;
|
||||||
|
|
||||||
fetcher.scheduleRetrieval(domain, urls);
|
fetcher.scheduleRetrieval(domain, urls);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@@ -3,7 +3,10 @@ package nu.marginalia.livecrawler;
|
|||||||
import crawlercommons.robots.SimpleRobotRules;
|
import crawlercommons.robots.SimpleRobotRules;
|
||||||
import crawlercommons.robots.SimpleRobotRulesParser;
|
import crawlercommons.robots.SimpleRobotRulesParser;
|
||||||
import nu.marginalia.WmsaHome;
|
import nu.marginalia.WmsaHome;
|
||||||
|
import nu.marginalia.contenttype.ContentType;
|
||||||
|
import nu.marginalia.contenttype.DocumentBodyToString;
|
||||||
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
||||||
|
import nu.marginalia.crawl.logic.DomainLocks;
|
||||||
import nu.marginalia.crawl.retreival.CrawlDelayTimer;
|
import nu.marginalia.crawl.retreival.CrawlDelayTimer;
|
||||||
import nu.marginalia.db.DbDomainQueries;
|
import nu.marginalia.db.DbDomainQueries;
|
||||||
import nu.marginalia.db.DomainBlacklist;
|
import nu.marginalia.db.DomainBlacklist;
|
||||||
@@ -15,6 +18,7 @@ import org.slf4j.Logger;
|
|||||||
import org.slf4j.LoggerFactory;
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
import javax.annotation.Nullable;
|
import javax.annotation.Nullable;
|
||||||
|
import java.io.ByteArrayInputStream;
|
||||||
import java.io.IOException;
|
import java.io.IOException;
|
||||||
import java.net.URISyntaxException;
|
import java.net.URISyntaxException;
|
||||||
import java.net.http.HttpClient;
|
import java.net.http.HttpClient;
|
||||||
@@ -22,10 +26,12 @@ import java.net.http.HttpHeaders;
|
|||||||
import java.net.http.HttpRequest;
|
import java.net.http.HttpRequest;
|
||||||
import java.net.http.HttpResponse;
|
import java.net.http.HttpResponse;
|
||||||
import java.time.Duration;
|
import java.time.Duration;
|
||||||
|
import java.util.ArrayList;
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
import java.util.Optional;
|
import java.util.Optional;
|
||||||
import java.util.concurrent.ThreadLocalRandom;
|
import java.util.concurrent.ThreadLocalRandom;
|
||||||
import java.util.concurrent.TimeUnit;
|
import java.util.concurrent.TimeUnit;
|
||||||
|
import java.util.zip.GZIPInputStream;
|
||||||
|
|
||||||
/** A simple link scraper that fetches URLs and stores them in a database,
|
/** A simple link scraper that fetches URLs and stores them in a database,
|
||||||
* with no concept of a crawl frontier, WARC output, or other advanced features
|
* with no concept of a crawl frontier, WARC output, or other advanced features
|
||||||
@@ -40,6 +46,9 @@ public class SimpleLinkScraper implements AutoCloseable {
|
|||||||
private final DomainBlacklist domainBlacklist;
|
private final DomainBlacklist domainBlacklist;
|
||||||
private final Duration connectTimeout = Duration.ofSeconds(10);
|
private final Duration connectTimeout = Duration.ofSeconds(10);
|
||||||
private final Duration readTimeout = Duration.ofSeconds(10);
|
private final Duration readTimeout = Duration.ofSeconds(10);
|
||||||
|
private final DomainLocks domainLocks = new DomainLocks();
|
||||||
|
|
||||||
|
private final static int MAX_SIZE = Integer.getInteger("crawler.maxFetchSize", 10 * 1024 * 1024);
|
||||||
|
|
||||||
public SimpleLinkScraper(LiveCrawlDataSet dataSet,
|
public SimpleLinkScraper(LiveCrawlDataSet dataSet,
|
||||||
DbDomainQueries domainQueries,
|
DbDomainQueries domainQueries,
|
||||||
@@ -59,27 +68,11 @@ public class SimpleLinkScraper implements AutoCloseable {
|
|||||||
pool.submitQuietly(() -> retrieveNow(domain, id.getAsInt(), urls));
|
pool.submitQuietly(() -> retrieveNow(domain, id.getAsInt(), urls));
|
||||||
}
|
}
|
||||||
|
|
||||||
public void retrieveNow(EdgeDomain domain, int domainId, List<String> urls) throws Exception {
|
public int retrieveNow(EdgeDomain domain, int domainId, List<String> urls) throws Exception {
|
||||||
try (HttpClient client = HttpClient
|
|
||||||
.newBuilder()
|
|
||||||
.connectTimeout(connectTimeout)
|
|
||||||
.followRedirects(HttpClient.Redirect.NEVER)
|
|
||||||
.version(HttpClient.Version.HTTP_2)
|
|
||||||
.build()) {
|
|
||||||
|
|
||||||
EdgeUrl rootUrl = domain.toRootUrlHttps();
|
EdgeUrl rootUrl = domain.toRootUrlHttps();
|
||||||
|
|
||||||
SimpleRobotRules rules = fetchRobotsRules(rootUrl, client);
|
List<EdgeUrl> relevantUrls = new ArrayList<>();
|
||||||
|
|
||||||
if (rules == null) { // I/O error fetching robots.txt
|
|
||||||
// If we can't fetch the robots.txt,
|
|
||||||
for (var url : urls) {
|
|
||||||
lp.parseLink(rootUrl, url).ifPresent(this::maybeFlagAsBad);
|
|
||||||
}
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
CrawlDelayTimer timer = new CrawlDelayTimer(rules.getCrawlDelay());
|
|
||||||
|
|
||||||
for (var url : urls) {
|
for (var url : urls) {
|
||||||
Optional<EdgeUrl> optParsedUrl = lp.parseLink(rootUrl, url);
|
Optional<EdgeUrl> optParsedUrl = lp.parseLink(rootUrl, url);
|
||||||
@@ -89,20 +82,54 @@ public class SimpleLinkScraper implements AutoCloseable {
|
|||||||
if (dataSet.hasUrl(optParsedUrl.get())) {
|
if (dataSet.hasUrl(optParsedUrl.get())) {
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
relevantUrls.add(optParsedUrl.get());
|
||||||
|
}
|
||||||
|
|
||||||
EdgeUrl parsedUrl = optParsedUrl.get();
|
if (relevantUrls.isEmpty()) {
|
||||||
if (!rules.isAllowed(url)) {
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
int fetched = 0;
|
||||||
|
|
||||||
|
try (HttpClient client = HttpClient
|
||||||
|
.newBuilder()
|
||||||
|
.connectTimeout(connectTimeout)
|
||||||
|
.followRedirects(HttpClient.Redirect.NEVER)
|
||||||
|
.version(HttpClient.Version.HTTP_2)
|
||||||
|
.build();
|
||||||
|
// throttle concurrent access per domain; IDE will complain it's not used, but it holds a semaphore -- do not remove:
|
||||||
|
DomainLocks.DomainLock lock = domainLocks.lockDomain(domain)
|
||||||
|
) {
|
||||||
|
SimpleRobotRules rules = fetchRobotsRules(rootUrl, client);
|
||||||
|
|
||||||
|
if (rules == null) { // I/O error fetching robots.txt
|
||||||
|
// If we can't fetch the robots.txt,
|
||||||
|
for (var url : relevantUrls) {
|
||||||
|
maybeFlagAsBad(url);
|
||||||
|
}
|
||||||
|
return fetched;
|
||||||
|
}
|
||||||
|
|
||||||
|
CrawlDelayTimer timer = new CrawlDelayTimer(rules.getCrawlDelay());
|
||||||
|
|
||||||
|
for (var parsedUrl : relevantUrls) {
|
||||||
|
|
||||||
|
if (!rules.isAllowed(parsedUrl.toString())) {
|
||||||
maybeFlagAsBad(parsedUrl);
|
maybeFlagAsBad(parsedUrl);
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
|
||||||
switch (fetchUrl(domainId, parsedUrl, timer, client)) {
|
switch (fetchUrl(domainId, parsedUrl, timer, client)) {
|
||||||
case FetchResult.Success(int id, EdgeUrl docUrl, String body, String headers)
|
case FetchResult.Success(int id, EdgeUrl docUrl, String body, String headers) -> {
|
||||||
-> dataSet.saveDocument(id, docUrl, body, headers, "");
|
dataSet.saveDocument(id, docUrl, body, headers, "");
|
||||||
|
fetched++;
|
||||||
|
}
|
||||||
case FetchResult.Error(EdgeUrl docUrl) -> maybeFlagAsBad(docUrl);
|
case FetchResult.Error(EdgeUrl docUrl) -> maybeFlagAsBad(docUrl);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
return fetched;
|
||||||
}
|
}
|
||||||
|
|
||||||
private void maybeFlagAsBad(EdgeUrl url) {
|
private void maybeFlagAsBad(EdgeUrl url) {
|
||||||
@@ -124,6 +151,7 @@ public class SimpleLinkScraper implements AutoCloseable {
|
|||||||
var robotsRequest = HttpRequest.newBuilder(rootUrl.withPathAndParam("/robots.txt", null).asURI())
|
var robotsRequest = HttpRequest.newBuilder(rootUrl.withPathAndParam("/robots.txt", null).asURI())
|
||||||
.GET()
|
.GET()
|
||||||
.header("User-Agent", WmsaHome.getUserAgent().uaString())
|
.header("User-Agent", WmsaHome.getUserAgent().uaString())
|
||||||
|
.header("Accept-Encoding","gzip")
|
||||||
.timeout(readTimeout);
|
.timeout(readTimeout);
|
||||||
|
|
||||||
// Fetch the robots.txt
|
// Fetch the robots.txt
|
||||||
@@ -131,9 +159,10 @@ public class SimpleLinkScraper implements AutoCloseable {
|
|||||||
try {
|
try {
|
||||||
SimpleRobotRulesParser parser = new SimpleRobotRulesParser();
|
SimpleRobotRulesParser parser = new SimpleRobotRulesParser();
|
||||||
HttpResponse<byte[]> robotsTxt = client.send(robotsRequest.build(), HttpResponse.BodyHandlers.ofByteArray());
|
HttpResponse<byte[]> robotsTxt = client.send(robotsRequest.build(), HttpResponse.BodyHandlers.ofByteArray());
|
||||||
|
|
||||||
if (robotsTxt.statusCode() == 200) {
|
if (robotsTxt.statusCode() == 200) {
|
||||||
return parser.parseContent(rootUrl.toString(),
|
return parser.parseContent(rootUrl.toString(),
|
||||||
robotsTxt.body(),
|
getResponseData(robotsTxt),
|
||||||
robotsTxt.headers().firstValue("Content-Type").orElse("text/plain"),
|
robotsTxt.headers().firstValue("Content-Type").orElse("text/plain"),
|
||||||
WmsaHome.getUserAgent().uaIdentifier());
|
WmsaHome.getUserAgent().uaIdentifier());
|
||||||
}
|
}
|
||||||
@@ -157,18 +186,19 @@ public class SimpleLinkScraper implements AutoCloseable {
|
|||||||
.GET()
|
.GET()
|
||||||
.header("User-Agent", WmsaHome.getUserAgent().uaString())
|
.header("User-Agent", WmsaHome.getUserAgent().uaString())
|
||||||
.header("Accept", "text/html")
|
.header("Accept", "text/html")
|
||||||
|
.header("Accept-Encoding", "gzip")
|
||||||
.timeout(readTimeout)
|
.timeout(readTimeout)
|
||||||
.build();
|
.build();
|
||||||
|
|
||||||
try {
|
try {
|
||||||
HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());
|
HttpResponse<byte[]> response = client.send(request, HttpResponse.BodyHandlers.ofByteArray());
|
||||||
|
|
||||||
// Handle rate limiting by waiting and retrying once
|
// Handle rate limiting by waiting and retrying once
|
||||||
if (response.statusCode() == 429) {
|
if (response.statusCode() == 429) {
|
||||||
timer.waitRetryDelay(new HttpFetcherImpl.RateLimitException(
|
timer.waitRetryDelay(new HttpFetcherImpl.RateLimitException(
|
||||||
response.headers().firstValue("Retry-After").orElse("5")
|
response.headers().firstValue("Retry-After").orElse("5")
|
||||||
));
|
));
|
||||||
response = client.send(request, HttpResponse.BodyHandlers.ofString());
|
response = client.send(request, HttpResponse.BodyHandlers.ofByteArray());
|
||||||
}
|
}
|
||||||
|
|
||||||
String contentType = response.headers().firstValue("Content-Type").orElse("").toLowerCase();
|
String contentType = response.headers().firstValue("Content-Type").orElse("").toLowerCase();
|
||||||
@@ -178,12 +208,14 @@ public class SimpleLinkScraper implements AutoCloseable {
|
|||||||
return new FetchResult.Error(parsedUrl);
|
return new FetchResult.Error(parsedUrl);
|
||||||
}
|
}
|
||||||
|
|
||||||
String body = response.body();
|
byte[] body = getResponseData(response);
|
||||||
if (body.length() > 1024 * 1024) {
|
if (body.length > MAX_SIZE) {
|
||||||
return new FetchResult.Error(parsedUrl);
|
return new FetchResult.Error(parsedUrl);
|
||||||
}
|
}
|
||||||
|
|
||||||
return new FetchResult.Success(domainId, parsedUrl, body, headersToString(response.headers()));
|
String bodyText = DocumentBodyToString.getStringData(ContentType.parse(contentType), body);
|
||||||
|
|
||||||
|
return new FetchResult.Success(domainId, parsedUrl, bodyText, headersToString(response.headers()));
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
catch (IOException ex) {
|
catch (IOException ex) {
|
||||||
@@ -194,6 +226,19 @@ public class SimpleLinkScraper implements AutoCloseable {
|
|||||||
return new FetchResult.Error(parsedUrl);
|
return new FetchResult.Error(parsedUrl);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
private byte[] getResponseData(HttpResponse<byte[]> response) throws IOException {
|
||||||
|
String encoding = response.headers().firstValue("Content-Encoding").orElse("");
|
||||||
|
|
||||||
|
if ("gzip".equals(encoding)) {
|
||||||
|
try (var stream = new GZIPInputStream(new ByteArrayInputStream(response.body()))) {
|
||||||
|
return stream.readAllBytes();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
return response.body();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
sealed interface FetchResult {
|
sealed interface FetchResult {
|
||||||
record Success(int domainId, EdgeUrl url, String body, String headers) implements FetchResult {}
|
record Success(int domainId, EdgeUrl url, String body, String headers) implements FetchResult {}
|
||||||
record Error(EdgeUrl url) implements FetchResult {}
|
record Error(EdgeUrl url) implements FetchResult {}
|
||||||
|
@@ -0,0 +1,66 @@
|
|||||||
|
package nu.marginalia.livecrawler;
|
||||||
|
|
||||||
|
import nu.marginalia.db.DomainBlacklistImpl;
|
||||||
|
import nu.marginalia.io.SerializableCrawlDataStream;
|
||||||
|
import nu.marginalia.model.EdgeDomain;
|
||||||
|
import nu.marginalia.model.EdgeUrl;
|
||||||
|
import nu.marginalia.model.crawldata.CrawledDocument;
|
||||||
|
import org.apache.commons.io.FileUtils;
|
||||||
|
import org.junit.jupiter.api.AfterEach;
|
||||||
|
import org.junit.jupiter.api.Assertions;
|
||||||
|
import org.junit.jupiter.api.BeforeEach;
|
||||||
|
import org.junit.jupiter.api.Test;
|
||||||
|
import org.mockito.Mockito;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.nio.file.Files;
|
||||||
|
import java.nio.file.Path;
|
||||||
|
import java.sql.SQLException;
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
|
class SimpleLinkScraperTest {
|
||||||
|
private Path tempDir;
|
||||||
|
private LiveCrawlDataSet dataSet;
|
||||||
|
|
||||||
|
@BeforeEach
|
||||||
|
public void setUp() throws IOException, SQLException {
|
||||||
|
tempDir = Files.createTempDirectory(getClass().getSimpleName());
|
||||||
|
dataSet = new LiveCrawlDataSet(tempDir);
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@AfterEach
|
||||||
|
public void tearDown() throws Exception {
|
||||||
|
dataSet.close();
|
||||||
|
FileUtils.deleteDirectory(tempDir.toFile());
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testRetrieveNow() throws Exception {
|
||||||
|
var scraper = new SimpleLinkScraper(dataSet, null, Mockito.mock(DomainBlacklistImpl.class));
|
||||||
|
int fetched = scraper.retrieveNow(new EdgeDomain("www.marginalia.nu"), 1, List.of("https://www.marginalia.nu/"));
|
||||||
|
Assertions.assertEquals(1, fetched);
|
||||||
|
|
||||||
|
var streams = dataSet.getDataStreams();
|
||||||
|
Assertions.assertEquals(1, streams.size());
|
||||||
|
|
||||||
|
SerializableCrawlDataStream firstStream = streams.iterator().next();
|
||||||
|
Assertions.assertTrue(firstStream.hasNext());
|
||||||
|
|
||||||
|
List<CrawledDocument> documents = firstStream.docsAsList();
|
||||||
|
Assertions.assertEquals(1, documents.size());
|
||||||
|
Assertions.assertTrue(documents.getFirst().documentBody.startsWith("<!doctype"));
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testRetrieveNow_Redundant() throws Exception {
|
||||||
|
dataSet.saveDocument(1, new EdgeUrl("https://www.marginalia.nu/"), "<html>", "", "127.0.0.1");
|
||||||
|
var scraper = new SimpleLinkScraper(dataSet, null, Mockito.mock(DomainBlacklistImpl.class));
|
||||||
|
|
||||||
|
// If the requested URL is already in the dataSet, we retrieveNow should shortcircuit and not fetch anything
|
||||||
|
int fetched = scraper.retrieveNow(new EdgeDomain("www.marginalia.nu"), 1, List.of("https://www.marginalia.nu/"));
|
||||||
|
Assertions.assertEquals(0, fetched);
|
||||||
|
}
|
||||||
|
}
|
@@ -11,7 +11,7 @@ import nu.marginalia.api.svc.RateLimiterService;
|
|||||||
import nu.marginalia.api.svc.ResponseCache;
|
import nu.marginalia.api.svc.ResponseCache;
|
||||||
import nu.marginalia.model.gson.GsonFactory;
|
import nu.marginalia.model.gson.GsonFactory;
|
||||||
import nu.marginalia.service.server.BaseServiceParams;
|
import nu.marginalia.service.server.BaseServiceParams;
|
||||||
import nu.marginalia.service.server.Service;
|
import nu.marginalia.service.server.SparkService;
|
||||||
import nu.marginalia.service.server.mq.MqRequest;
|
import nu.marginalia.service.server.mq.MqRequest;
|
||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
import org.slf4j.LoggerFactory;
|
import org.slf4j.LoggerFactory;
|
||||||
@@ -21,7 +21,7 @@ import spark.Request;
|
|||||||
import spark.Response;
|
import spark.Response;
|
||||||
import spark.Spark;
|
import spark.Spark;
|
||||||
|
|
||||||
public class ApiService extends Service {
|
public class ApiService extends SparkService {
|
||||||
|
|
||||||
private final Logger logger = LoggerFactory.getLogger(getClass());
|
private final Logger logger = LoggerFactory.getLogger(getClass());
|
||||||
private final Gson gson = GsonFactory.get();
|
private final Gson gson = GsonFactory.get();
|
||||||
@@ -69,7 +69,7 @@ public class ApiService extends Service {
|
|||||||
this.searchOperator = searchOperator;
|
this.searchOperator = searchOperator;
|
||||||
|
|
||||||
Spark.get("/api/", (rq, rsp) -> {
|
Spark.get("/api/", (rq, rsp) -> {
|
||||||
rsp.redirect("https://memex.marginalia.nu/projects/edge/api.gmi");
|
rsp.redirect("https://about.marginalia-search.com/article/api/");
|
||||||
return "";
|
return "";
|
||||||
});
|
});
|
||||||
|
|
||||||
|
@@ -9,7 +9,7 @@ import nu.marginalia.renderer.MustacheRenderer;
|
|||||||
import nu.marginalia.renderer.RendererFactory;
|
import nu.marginalia.renderer.RendererFactory;
|
||||||
import nu.marginalia.screenshot.ScreenshotService;
|
import nu.marginalia.screenshot.ScreenshotService;
|
||||||
import nu.marginalia.service.server.BaseServiceParams;
|
import nu.marginalia.service.server.BaseServiceParams;
|
||||||
import nu.marginalia.service.server.Service;
|
import nu.marginalia.service.server.SparkService;
|
||||||
import org.jetbrains.annotations.NotNull;
|
import org.jetbrains.annotations.NotNull;
|
||||||
import spark.Request;
|
import spark.Request;
|
||||||
import spark.Response;
|
import spark.Response;
|
||||||
@@ -18,7 +18,7 @@ import spark.Spark;
|
|||||||
import java.util.Map;
|
import java.util.Map;
|
||||||
import java.util.Optional;
|
import java.util.Optional;
|
||||||
|
|
||||||
public class DatingService extends Service {
|
public class DatingService extends SparkService {
|
||||||
private final DomainBlacklist blacklist;
|
private final DomainBlacklist blacklist;
|
||||||
private final DbBrowseDomainsSimilarCosine browseSimilarCosine;
|
private final DbBrowseDomainsSimilarCosine browseSimilarCosine;
|
||||||
private final DbBrowseDomainsRandom browseRandom;
|
private final DbBrowseDomainsRandom browseRandom;
|
||||||
|
@@ -5,7 +5,7 @@ import com.zaxxer.hikari.HikariDataSource;
|
|||||||
import nu.marginalia.renderer.MustacheRenderer;
|
import nu.marginalia.renderer.MustacheRenderer;
|
||||||
import nu.marginalia.renderer.RendererFactory;
|
import nu.marginalia.renderer.RendererFactory;
|
||||||
import nu.marginalia.service.server.BaseServiceParams;
|
import nu.marginalia.service.server.BaseServiceParams;
|
||||||
import nu.marginalia.service.server.Service;
|
import nu.marginalia.service.server.SparkService;
|
||||||
import nu.marginalia.service.server.StaticResources;
|
import nu.marginalia.service.server.StaticResources;
|
||||||
import org.jetbrains.annotations.NotNull;
|
import org.jetbrains.annotations.NotNull;
|
||||||
import spark.Request;
|
import spark.Request;
|
||||||
@@ -15,7 +15,7 @@ import spark.Spark;
|
|||||||
import java.sql.SQLException;
|
import java.sql.SQLException;
|
||||||
import java.util.*;
|
import java.util.*;
|
||||||
|
|
||||||
public class ExplorerService extends Service {
|
public class ExplorerService extends SparkService {
|
||||||
|
|
||||||
private final MustacheRenderer<Object> renderer;
|
private final MustacheRenderer<Object> renderer;
|
||||||
private final HikariDataSource dataSource;
|
private final HikariDataSource dataSource;
|
||||||
|
94
code/services-application/search-service-legacy/build.gradle
Normal file
94
code/services-application/search-service-legacy/build.gradle
Normal file
@@ -0,0 +1,94 @@
|
|||||||
|
plugins {
|
||||||
|
id 'java'
|
||||||
|
id 'io.freefair.sass-base' version '8.4'
|
||||||
|
id 'io.freefair.sass-java' version '8.4'
|
||||||
|
id 'application'
|
||||||
|
id 'jvm-test-suite'
|
||||||
|
|
||||||
|
id 'com.google.cloud.tools.jib' version '3.4.3'
|
||||||
|
}
|
||||||
|
|
||||||
|
application {
|
||||||
|
mainClass = 'nu.marginalia.search.SearchMain'
|
||||||
|
applicationName = 'search-service-legacy'
|
||||||
|
}
|
||||||
|
|
||||||
|
tasks.distZip.enabled = false
|
||||||
|
|
||||||
|
|
||||||
|
java {
|
||||||
|
toolchain {
|
||||||
|
languageVersion.set(JavaLanguageVersion.of(rootProject.ext.jvmVersion))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
sass {
|
||||||
|
sourceMapEnabled = true
|
||||||
|
sourceMapEmbed = true
|
||||||
|
outputStyle = EXPANDED
|
||||||
|
}
|
||||||
|
|
||||||
|
apply from: "$rootProject.projectDir/srcsets.gradle"
|
||||||
|
apply from: "$rootProject.projectDir/docker.gradle"
|
||||||
|
|
||||||
|
dependencies {
|
||||||
|
implementation project(':code:common:db')
|
||||||
|
implementation project(':code:common:model')
|
||||||
|
implementation project(':code:common:service')
|
||||||
|
implementation project(':code:common:config')
|
||||||
|
implementation project(':code:index:query')
|
||||||
|
|
||||||
|
implementation project(':code:libraries:easy-lsh')
|
||||||
|
implementation project(':code:libraries:language-processing')
|
||||||
|
implementation project(':code:libraries:braille-block-punch-cards')
|
||||||
|
implementation project(':code:libraries:term-frequency-dict')
|
||||||
|
|
||||||
|
implementation project(':code:functions:live-capture:api')
|
||||||
|
implementation project(':code:functions:math:api')
|
||||||
|
implementation project(':code:functions:domain-info:api')
|
||||||
|
implementation project(':code:functions:search-query:api')
|
||||||
|
|
||||||
|
|
||||||
|
implementation project(':code:index:api')
|
||||||
|
implementation project(':code:common:renderer')
|
||||||
|
|
||||||
|
implementation project(':code:features-search:screenshots')
|
||||||
|
implementation project(':code:features-search:random-websites')
|
||||||
|
|
||||||
|
implementation libs.bundles.slf4j
|
||||||
|
|
||||||
|
implementation libs.roaringbitmap
|
||||||
|
implementation libs.prometheus
|
||||||
|
implementation libs.notnull
|
||||||
|
implementation libs.guava
|
||||||
|
implementation dependencies.create(libs.guice.get()) {
|
||||||
|
exclude group: 'com.google.guava'
|
||||||
|
}
|
||||||
|
implementation libs.handlebars
|
||||||
|
implementation dependencies.create(libs.spark.get()) {
|
||||||
|
exclude group: 'org.eclipse.jetty'
|
||||||
|
}
|
||||||
|
implementation libs.bundles.jetty
|
||||||
|
implementation libs.opencsv
|
||||||
|
implementation libs.trove
|
||||||
|
implementation libs.fastutil
|
||||||
|
implementation libs.bundles.gson
|
||||||
|
implementation libs.bundles.mariadb
|
||||||
|
implementation libs.bundles.nlp
|
||||||
|
|
||||||
|
testImplementation libs.bundles.slf4j.test
|
||||||
|
testImplementation libs.bundles.junit
|
||||||
|
testImplementation libs.mockito
|
||||||
|
|
||||||
|
testImplementation platform('org.testcontainers:testcontainers-bom:1.17.4')
|
||||||
|
testImplementation libs.commons.codec
|
||||||
|
testImplementation 'org.testcontainers:mariadb:1.17.4'
|
||||||
|
testImplementation 'org.testcontainers:junit-jupiter:1.17.4'
|
||||||
|
testImplementation project(':code:libraries:test-helpers')
|
||||||
|
}
|
||||||
|
|
||||||
|
tasks.register('paperDoll', Test) {
|
||||||
|
useJUnitPlatform {
|
||||||
|
includeTags "paperdoll"
|
||||||
|
}
|
||||||
|
jvmArgs = [ '-DrunPaperDoll=true', '--enable-preview' ]
|
||||||
|
}
|
@@ -0,0 +1,47 @@
|
|||||||
|
package nu.marginalia.search;
|
||||||
|
|
||||||
|
import com.google.inject.Guice;
|
||||||
|
import com.google.inject.Inject;
|
||||||
|
import com.google.inject.Injector;
|
||||||
|
import nu.marginalia.service.MainClass;
|
||||||
|
import nu.marginalia.service.discovery.ServiceRegistryIf;
|
||||||
|
import nu.marginalia.service.module.ServiceConfiguration;
|
||||||
|
import nu.marginalia.service.module.ServiceDiscoveryModule;
|
||||||
|
import nu.marginalia.service.ServiceId;
|
||||||
|
import nu.marginalia.service.module.ServiceConfigurationModule;
|
||||||
|
import nu.marginalia.service.module.DatabaseModule;
|
||||||
|
import nu.marginalia.service.server.Initialization;
|
||||||
|
import spark.Spark;
|
||||||
|
|
||||||
|
public class SearchMain extends MainClass {
|
||||||
|
private final SearchService service;
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public SearchMain(SearchService service) {
|
||||||
|
this.service = service;
|
||||||
|
}
|
||||||
|
|
||||||
|
public static void main(String... args) {
|
||||||
|
|
||||||
|
init(ServiceId.Search, args);
|
||||||
|
|
||||||
|
Spark.staticFileLocation("/static/search/");
|
||||||
|
|
||||||
|
Injector injector = Guice.createInjector(
|
||||||
|
new SearchModule(),
|
||||||
|
new ServiceConfigurationModule(ServiceId.Search),
|
||||||
|
new ServiceDiscoveryModule(),
|
||||||
|
new DatabaseModule(false)
|
||||||
|
);
|
||||||
|
|
||||||
|
|
||||||
|
// Orchestrate the boot order for the services
|
||||||
|
var registry = injector.getInstance(ServiceRegistryIf.class);
|
||||||
|
var configuration = injector.getInstance(ServiceConfiguration.class);
|
||||||
|
orchestrateBoot(registry, configuration);
|
||||||
|
|
||||||
|
injector.getInstance(SearchMain.class);
|
||||||
|
injector.getInstance(Initialization.class).setReady();
|
||||||
|
|
||||||
|
}
|
||||||
|
}
|
@@ -0,0 +1,20 @@
|
|||||||
|
package nu.marginalia.search;
|
||||||
|
|
||||||
|
import com.google.inject.AbstractModule;
|
||||||
|
import nu.marginalia.LanguageModels;
|
||||||
|
import nu.marginalia.WebsiteUrl;
|
||||||
|
import nu.marginalia.WmsaHome;
|
||||||
|
import nu.marginalia.renderer.config.HandlebarsConfigurator;
|
||||||
|
|
||||||
|
public class SearchModule extends AbstractModule {
|
||||||
|
|
||||||
|
public void configure() {
|
||||||
|
bind(HandlebarsConfigurator.class).to(SearchHandlebarsConfigurator.class);
|
||||||
|
|
||||||
|
bind(LanguageModels.class).toInstance(WmsaHome.getLanguageModels());
|
||||||
|
|
||||||
|
bind(WebsiteUrl.class).toInstance(new WebsiteUrl(
|
||||||
|
System.getProperty("search.legacyWebsiteUrl", "https://old-search.marginalia.nu/")));
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
@@ -0,0 +1,266 @@
|
|||||||
|
package nu.marginalia.search;
|
||||||
|
|
||||||
|
import com.google.inject.Inject;
|
||||||
|
import com.google.inject.Singleton;
|
||||||
|
import nu.marginalia.WebsiteUrl;
|
||||||
|
import nu.marginalia.api.math.MathClient;
|
||||||
|
import nu.marginalia.api.searchquery.QueryClient;
|
||||||
|
import nu.marginalia.api.searchquery.model.query.QueryResponse;
|
||||||
|
import nu.marginalia.api.searchquery.model.results.DecoratedSearchResultItem;
|
||||||
|
import nu.marginalia.bbpc.BrailleBlockPunchCards;
|
||||||
|
import nu.marginalia.db.DbDomainQueries;
|
||||||
|
import nu.marginalia.index.query.limit.QueryLimits;
|
||||||
|
import nu.marginalia.model.EdgeDomain;
|
||||||
|
import nu.marginalia.model.EdgeUrl;
|
||||||
|
import nu.marginalia.model.crawl.DomainIndexingState;
|
||||||
|
import nu.marginalia.search.command.SearchParameters;
|
||||||
|
import nu.marginalia.search.model.ClusteredUrlDetails;
|
||||||
|
import nu.marginalia.search.model.DecoratedSearchResults;
|
||||||
|
import nu.marginalia.search.model.SearchFilters;
|
||||||
|
import nu.marginalia.search.model.UrlDetails;
|
||||||
|
import nu.marginalia.search.results.UrlDeduplicator;
|
||||||
|
import nu.marginalia.search.svc.SearchQueryCountService;
|
||||||
|
import nu.marginalia.search.svc.SearchUnitConversionService;
|
||||||
|
import org.apache.logging.log4j.util.Strings;
|
||||||
|
import org.slf4j.Logger;
|
||||||
|
import org.slf4j.LoggerFactory;
|
||||||
|
import org.slf4j.Marker;
|
||||||
|
import org.slf4j.MarkerFactory;
|
||||||
|
|
||||||
|
import javax.annotation.Nullable;
|
||||||
|
import java.time.Duration;
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Map;
|
||||||
|
import java.util.Set;
|
||||||
|
import java.util.concurrent.Future;
|
||||||
|
import java.util.concurrent.TimeUnit;
|
||||||
|
import java.util.stream.Collectors;
|
||||||
|
import java.util.stream.IntStream;
|
||||||
|
|
||||||
|
@Singleton
|
||||||
|
public class SearchOperator {
|
||||||
|
|
||||||
|
private static final Logger logger = LoggerFactory.getLogger(SearchOperator.class);
|
||||||
|
|
||||||
|
// Marker for filtering out sensitive content from the persistent logs
|
||||||
|
private final Marker queryMarker = MarkerFactory.getMarker("QUERY");
|
||||||
|
|
||||||
|
private final MathClient mathClient;
|
||||||
|
private final DbDomainQueries domainQueries;
|
||||||
|
private final QueryClient queryClient;
|
||||||
|
private final SearchQueryParamFactory paramFactory;
|
||||||
|
private final WebsiteUrl websiteUrl;
|
||||||
|
private final SearchUnitConversionService searchUnitConversionService;
|
||||||
|
private final SearchQueryCountService searchVisitorCount;
|
||||||
|
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public SearchOperator(MathClient mathClient,
|
||||||
|
DbDomainQueries domainQueries,
|
||||||
|
QueryClient queryClient,
|
||||||
|
SearchQueryParamFactory paramFactory,
|
||||||
|
WebsiteUrl websiteUrl,
|
||||||
|
SearchUnitConversionService searchUnitConversionService,
|
||||||
|
SearchQueryCountService searchVisitorCount
|
||||||
|
)
|
||||||
|
{
|
||||||
|
|
||||||
|
this.mathClient = mathClient;
|
||||||
|
this.domainQueries = domainQueries;
|
||||||
|
this.queryClient = queryClient;
|
||||||
|
this.paramFactory = paramFactory;
|
||||||
|
this.websiteUrl = websiteUrl;
|
||||||
|
this.searchUnitConversionService = searchUnitConversionService;
|
||||||
|
this.searchVisitorCount = searchVisitorCount;
|
||||||
|
}
|
||||||
|
|
||||||
|
public List<UrlDetails> doSiteSearch(String domain,
|
||||||
|
int domainId,
|
||||||
|
int count) {
|
||||||
|
|
||||||
|
var queryParams = paramFactory.forSiteSearch(domain, domainId, count);
|
||||||
|
var queryResponse = queryClient.search(queryParams);
|
||||||
|
|
||||||
|
return getResultsFromQuery(queryResponse);
|
||||||
|
}
|
||||||
|
|
||||||
|
public List<UrlDetails> doBacklinkSearch(String domain) {
|
||||||
|
|
||||||
|
var queryParams = paramFactory.forBacklinkSearch(domain);
|
||||||
|
var queryResponse = queryClient.search(queryParams);
|
||||||
|
|
||||||
|
return getResultsFromQuery(queryResponse);
|
||||||
|
}
|
||||||
|
|
||||||
|
public List<UrlDetails> doLinkSearch(String source, String dest) {
|
||||||
|
var queryParams = paramFactory.forLinkSearch(source, dest);
|
||||||
|
var queryResponse = queryClient.search(queryParams);
|
||||||
|
|
||||||
|
return getResultsFromQuery(queryResponse);
|
||||||
|
}
|
||||||
|
|
||||||
|
public DecoratedSearchResults doSearch(SearchParameters userParams) throws InterruptedException {
|
||||||
|
// The full user-facing search query does additional work to try to evaluate the query
|
||||||
|
// e.g. as a unit conversion query. This is done in parallel with the regular search.
|
||||||
|
|
||||||
|
Future<String> eval = searchUnitConversionService.tryEval(userParams.query());
|
||||||
|
|
||||||
|
// Perform the regular search
|
||||||
|
|
||||||
|
var queryParams = paramFactory.forRegularSearch(userParams);
|
||||||
|
QueryResponse queryResponse = queryClient.search(queryParams);
|
||||||
|
var queryResults = getResultsFromQuery(queryResponse);
|
||||||
|
|
||||||
|
// Cluster the results based on the query response
|
||||||
|
List<ClusteredUrlDetails> clusteredResults = SearchResultClusterer
|
||||||
|
.selectStrategy(queryResponse)
|
||||||
|
.clusterResults(queryResults, 25);
|
||||||
|
|
||||||
|
// Log the query and results
|
||||||
|
|
||||||
|
logger.info(queryMarker, "Human terms: {}", Strings.join(queryResponse.searchTermsHuman(), ','));
|
||||||
|
logger.info(queryMarker, "Search Result Count: {}", queryResults.size());
|
||||||
|
|
||||||
|
// Get the evaluation result and other data to return to the user
|
||||||
|
String evalResult = getFutureOrDefault(eval, "");
|
||||||
|
|
||||||
|
String focusDomain = queryResponse.domain();
|
||||||
|
int focusDomainId = focusDomain == null
|
||||||
|
? -1
|
||||||
|
: domainQueries.tryGetDomainId(new EdgeDomain(focusDomain)).orElse(-1);
|
||||||
|
|
||||||
|
List<String> problems = getProblems(evalResult, queryResults, queryResponse);
|
||||||
|
|
||||||
|
List<DecoratedSearchResults.Page> resultPages = IntStream.rangeClosed(1, queryResponse.totalPages())
|
||||||
|
.mapToObj(number -> new DecoratedSearchResults.Page(
|
||||||
|
number,
|
||||||
|
number == userParams.page(),
|
||||||
|
userParams.withPage(number).renderUrl(websiteUrl)
|
||||||
|
))
|
||||||
|
.toList();
|
||||||
|
|
||||||
|
// Return the results to the user
|
||||||
|
return DecoratedSearchResults.builder()
|
||||||
|
.params(userParams)
|
||||||
|
.problems(problems)
|
||||||
|
.evalResult(evalResult)
|
||||||
|
.results(clusteredResults)
|
||||||
|
.filters(new SearchFilters(websiteUrl, userParams))
|
||||||
|
.focusDomain(focusDomain)
|
||||||
|
.focusDomainId(focusDomainId)
|
||||||
|
.resultPages(resultPages)
|
||||||
|
.build();
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
public List<UrlDetails> getResultsFromQuery(QueryResponse queryResponse) {
|
||||||
|
final QueryLimits limits = queryResponse.specs().queryLimits;
|
||||||
|
final UrlDeduplicator deduplicator = new UrlDeduplicator(limits.resultsByDomain());
|
||||||
|
|
||||||
|
// Update the query count (this is what you see on the front page)
|
||||||
|
searchVisitorCount.registerQuery();
|
||||||
|
|
||||||
|
return queryResponse.results().stream()
|
||||||
|
.filter(deduplicator::shouldRetain)
|
||||||
|
.limit(limits.resultsTotal())
|
||||||
|
.map(SearchOperator::createDetails)
|
||||||
|
.toList();
|
||||||
|
}
|
||||||
|
|
||||||
|
private static UrlDetails createDetails(DecoratedSearchResultItem item) {
|
||||||
|
return new UrlDetails(
|
||||||
|
item.documentId(),
|
||||||
|
item.domainId(),
|
||||||
|
cleanUrl(item.url),
|
||||||
|
item.title,
|
||||||
|
item.description,
|
||||||
|
item.format,
|
||||||
|
item.features,
|
||||||
|
DomainIndexingState.ACTIVE,
|
||||||
|
item.rankingScore, // termScore
|
||||||
|
item.resultsFromDomain,
|
||||||
|
BrailleBlockPunchCards.printBits(item.bestPositions, 64),
|
||||||
|
Long.bitCount(item.bestPositions),
|
||||||
|
item.rawIndexResult,
|
||||||
|
item.rawIndexResult.keywordScores
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
/** Replace nuisance domains with replacements where available */
|
||||||
|
private static EdgeUrl cleanUrl(EdgeUrl url) {
|
||||||
|
String topdomain = url.domain.topDomain;
|
||||||
|
String subdomain = url.domain.subDomain;
|
||||||
|
String path = url.path;
|
||||||
|
|
||||||
|
if (topdomain.equals("fandom.com")) {
|
||||||
|
int wikiIndex = path.indexOf("/wiki/");
|
||||||
|
if (wikiIndex >= 0) {
|
||||||
|
return new EdgeUrl("https", new EdgeDomain("breezewiki.com"), null, "/" + subdomain + path.substring(wikiIndex), null);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
else if (topdomain.equals("medium.com")) {
|
||||||
|
if (!subdomain.isBlank()) {
|
||||||
|
return new EdgeUrl("https", new EdgeDomain("scribe.rip"), null, path, null);
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
String article = path.substring(path.indexOf("/", 1));
|
||||||
|
return new EdgeUrl("https", new EdgeDomain("scribe.rip"), null, article, null);
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
|
return url;
|
||||||
|
}
|
||||||
|
|
||||||
|
private List<String> getProblems(String evalResult, List<UrlDetails> queryResults, QueryResponse response) throws InterruptedException {
|
||||||
|
|
||||||
|
// We don't debug the query if it's a site search
|
||||||
|
if (response.domain() == null)
|
||||||
|
return List.of();
|
||||||
|
|
||||||
|
final List<String> problems = new ArrayList<>(response.problems());
|
||||||
|
|
||||||
|
if (queryResults.size() <= 5 && null == evalResult) {
|
||||||
|
problems.add("Try rephrasing the query, changing the word order or using synonyms to get different results.");
|
||||||
|
|
||||||
|
// Try to spell check the search terms
|
||||||
|
var suggestions = getFutureOrDefault(
|
||||||
|
mathClient.spellCheck(response.searchTermsHuman()),
|
||||||
|
Map.of()
|
||||||
|
);
|
||||||
|
|
||||||
|
suggestions.forEach((term, suggestion) -> {
|
||||||
|
if (suggestion.size() > 1) {
|
||||||
|
String suggestionsStr = "\"%s\" could be spelled %s".formatted(term, suggestion.stream().map(s -> "\"" + s + "\"").collect(Collectors.joining(", ")));
|
||||||
|
problems.add(suggestionsStr);
|
||||||
|
}
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
Set<String> representativeKeywords = response.getAllKeywords();
|
||||||
|
if (representativeKeywords.size() > 1 && (representativeKeywords.contains("definition") || representativeKeywords.contains("define") || representativeKeywords.contains("meaning")))
|
||||||
|
{
|
||||||
|
problems.add("Tip: Try using a query that looks like <tt>define:word</tt> if you want a dictionary definition");
|
||||||
|
}
|
||||||
|
|
||||||
|
return problems;
|
||||||
|
}
|
||||||
|
|
||||||
|
private <T> T getFutureOrDefault(@Nullable Future<T> fut, T defaultValue) {
|
||||||
|
return getFutureOrDefault(fut, Duration.ofMillis(50), defaultValue);
|
||||||
|
}
|
||||||
|
|
||||||
|
private <T> T getFutureOrDefault(@Nullable Future<T> fut, Duration timeout, T defaultValue) {
|
||||||
|
if (fut == null || fut.isCancelled()) {
|
||||||
|
return defaultValue;
|
||||||
|
}
|
||||||
|
try {
|
||||||
|
return fut.get(timeout.toMillis(), TimeUnit.MILLISECONDS);
|
||||||
|
}
|
||||||
|
catch (Exception ex) {
|
||||||
|
logger.warn("Error fetching eval result", ex);
|
||||||
|
return defaultValue;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
@@ -0,0 +1,104 @@
|
|||||||
|
package nu.marginalia.search;
|
||||||
|
|
||||||
|
import nu.marginalia.api.searchquery.model.query.QueryParams;
|
||||||
|
import nu.marginalia.api.searchquery.model.query.SearchQuery;
|
||||||
|
import nu.marginalia.api.searchquery.model.query.SearchSetIdentifier;
|
||||||
|
import nu.marginalia.api.searchquery.model.results.ResultRankingParameters;
|
||||||
|
import nu.marginalia.index.query.limit.QueryLimits;
|
||||||
|
import nu.marginalia.index.query.limit.QueryStrategy;
|
||||||
|
import nu.marginalia.index.query.limit.SpecificationLimit;
|
||||||
|
import nu.marginalia.search.command.SearchParameters;
|
||||||
|
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
|
public class SearchQueryParamFactory {
|
||||||
|
|
||||||
|
public QueryParams forRegularSearch(SearchParameters userParams) {
|
||||||
|
SearchQuery prototype = new SearchQuery();
|
||||||
|
var profile = userParams.profile();
|
||||||
|
|
||||||
|
profile.addTacitTerms(prototype);
|
||||||
|
userParams.js().addTacitTerms(prototype);
|
||||||
|
userParams.adtech().addTacitTerms(prototype);
|
||||||
|
|
||||||
|
return new QueryParams(
|
||||||
|
userParams.query(),
|
||||||
|
null,
|
||||||
|
prototype.searchTermsInclude,
|
||||||
|
prototype.searchTermsExclude,
|
||||||
|
prototype.searchTermsPriority,
|
||||||
|
prototype.searchTermsAdvice,
|
||||||
|
profile.getQualityLimit(),
|
||||||
|
profile.getYearLimit(),
|
||||||
|
profile.getSizeLimit(),
|
||||||
|
SpecificationLimit.none(),
|
||||||
|
List.of(),
|
||||||
|
new QueryLimits(5, 100, 200, 8192),
|
||||||
|
profile.searchSetIdentifier.name(),
|
||||||
|
userParams.strategy(),
|
||||||
|
userParams.temporalBias(),
|
||||||
|
userParams.page()
|
||||||
|
);
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
public QueryParams forSiteSearch(String domain, int domainId, int count) {
|
||||||
|
return new QueryParams("site:"+domain,
|
||||||
|
null,
|
||||||
|
List.of(),
|
||||||
|
List.of(),
|
||||||
|
List.of(),
|
||||||
|
List.of(),
|
||||||
|
SpecificationLimit.none(),
|
||||||
|
SpecificationLimit.none(),
|
||||||
|
SpecificationLimit.none(),
|
||||||
|
SpecificationLimit.none(),
|
||||||
|
List.of(domainId),
|
||||||
|
new QueryLimits(count, count, 100, 512),
|
||||||
|
SearchSetIdentifier.NONE.name(),
|
||||||
|
QueryStrategy.AUTO,
|
||||||
|
ResultRankingParameters.TemporalBias.NONE,
|
||||||
|
1
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
public QueryParams forBacklinkSearch(String domain) {
|
||||||
|
return new QueryParams("links:"+domain,
|
||||||
|
null,
|
||||||
|
List.of(),
|
||||||
|
List.of(),
|
||||||
|
List.of(),
|
||||||
|
List.of(),
|
||||||
|
SpecificationLimit.none(),
|
||||||
|
SpecificationLimit.none(),
|
||||||
|
SpecificationLimit.none(),
|
||||||
|
SpecificationLimit.none(),
|
||||||
|
List.of(),
|
||||||
|
new QueryLimits(100, 100, 100, 512),
|
||||||
|
SearchSetIdentifier.NONE.name(),
|
||||||
|
QueryStrategy.AUTO,
|
||||||
|
ResultRankingParameters.TemporalBias.NONE,
|
||||||
|
1
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
public QueryParams forLinkSearch(String sourceDomain, String destDomain) {
|
||||||
|
return new QueryParams("site:" + sourceDomain + " links:" + destDomain,
|
||||||
|
null,
|
||||||
|
List.of(),
|
||||||
|
List.of(),
|
||||||
|
List.of(),
|
||||||
|
List.of(),
|
||||||
|
SpecificationLimit.none(),
|
||||||
|
SpecificationLimit.none(),
|
||||||
|
SpecificationLimit.none(),
|
||||||
|
SpecificationLimit.none(),
|
||||||
|
List.of(),
|
||||||
|
new QueryLimits(100, 100, 100, 512),
|
||||||
|
SearchSetIdentifier.NONE.name(),
|
||||||
|
QueryStrategy.AUTO,
|
||||||
|
ResultRankingParameters.TemporalBias.NONE,
|
||||||
|
1
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
@@ -0,0 +1,53 @@
|
|||||||
|
package nu.marginalia.search;
|
||||||
|
|
||||||
|
import nu.marginalia.api.searchquery.model.query.QueryResponse;
|
||||||
|
import nu.marginalia.search.model.ClusteredUrlDetails;
|
||||||
|
import nu.marginalia.search.model.UrlDetails;
|
||||||
|
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.stream.Collectors;
|
||||||
|
|
||||||
|
/** Functions for clustering search results */
|
||||||
|
public class SearchResultClusterer {
|
||||||
|
private SearchResultClusterer() {}
|
||||||
|
|
||||||
|
public interface SearchResultClusterStrategy {
|
||||||
|
List<ClusteredUrlDetails> clusterResults(List<UrlDetails> results, int total);
|
||||||
|
}
|
||||||
|
|
||||||
|
public static SearchResultClusterStrategy selectStrategy(QueryResponse response) {
|
||||||
|
if (response.domain() != null && !response.domain().isBlank())
|
||||||
|
return SearchResultClusterer::noOp;
|
||||||
|
|
||||||
|
return SearchResultClusterer::byDomain;
|
||||||
|
}
|
||||||
|
|
||||||
|
/** No clustering, just return the results as is */
|
||||||
|
private static List<ClusteredUrlDetails> noOp(List<UrlDetails> results, int total) {
|
||||||
|
if (results.isEmpty())
|
||||||
|
return List.of();
|
||||||
|
|
||||||
|
return results.stream()
|
||||||
|
.map(ClusteredUrlDetails::new)
|
||||||
|
.toList();
|
||||||
|
}
|
||||||
|
|
||||||
|
/** Cluster the results by domain, and return the top "total" clusters
|
||||||
|
* sorted by the relevance of the best result
|
||||||
|
*/
|
||||||
|
private static List<ClusteredUrlDetails> byDomain(List<UrlDetails> results, int total) {
|
||||||
|
if (results.isEmpty())
|
||||||
|
return List.of();
|
||||||
|
|
||||||
|
return results.stream()
|
||||||
|
.collect(
|
||||||
|
Collectors.groupingBy(details -> details.domainId)
|
||||||
|
)
|
||||||
|
.values().stream()
|
||||||
|
.map(ClusteredUrlDetails::new)
|
||||||
|
.sorted()
|
||||||
|
.limit(total)
|
||||||
|
.toList();
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
@@ -0,0 +1,128 @@
|
|||||||
|
package nu.marginalia.search;
|
||||||
|
|
||||||
|
import com.google.inject.Inject;
|
||||||
|
import io.prometheus.client.Counter;
|
||||||
|
import io.prometheus.client.Histogram;
|
||||||
|
import nu.marginalia.WebsiteUrl;
|
||||||
|
import nu.marginalia.search.svc.*;
|
||||||
|
import nu.marginalia.service.server.BaseServiceParams;
|
||||||
|
import nu.marginalia.service.server.SparkService;
|
||||||
|
import nu.marginalia.service.server.StaticResources;
|
||||||
|
import org.slf4j.Logger;
|
||||||
|
import org.slf4j.LoggerFactory;
|
||||||
|
import spark.Request;
|
||||||
|
import spark.Response;
|
||||||
|
import spark.Route;
|
||||||
|
import spark.Spark;
|
||||||
|
|
||||||
|
import java.net.URLEncoder;
|
||||||
|
import java.nio.charset.StandardCharsets;
|
||||||
|
|
||||||
|
public class SearchService extends SparkService {
|
||||||
|
|
||||||
|
private final WebsiteUrl websiteUrl;
|
||||||
|
private final StaticResources staticResources;
|
||||||
|
|
||||||
|
private static final Logger logger = LoggerFactory.getLogger(SearchService.class);
|
||||||
|
private static final Histogram wmsa_search_service_request_time = Histogram.build()
|
||||||
|
.name("wmsa_search_service_request_time")
|
||||||
|
.linearBuckets(0.05, 0.05, 15)
|
||||||
|
.labelNames("matchedPath", "method")
|
||||||
|
.help("Search service request time (seconds)")
|
||||||
|
.register();
|
||||||
|
private static final Counter wmsa_search_service_error_count = Counter.build()
|
||||||
|
.name("wmsa_search_service_error_count")
|
||||||
|
.labelNames("matchedPath", "method")
|
||||||
|
.help("Search service error count")
|
||||||
|
.register();
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public SearchService(BaseServiceParams params,
|
||||||
|
WebsiteUrl websiteUrl,
|
||||||
|
StaticResources staticResources,
|
||||||
|
SearchFrontPageService frontPageService,
|
||||||
|
SearchErrorPageService errorPageService,
|
||||||
|
SearchAddToCrawlQueueService addToCrawlQueueService,
|
||||||
|
SearchSiteInfoService siteInfoService,
|
||||||
|
SearchCrosstalkService crosstalkService,
|
||||||
|
SearchQueryService searchQueryService)
|
||||||
|
throws Exception
|
||||||
|
{
|
||||||
|
super(params);
|
||||||
|
|
||||||
|
this.websiteUrl = websiteUrl;
|
||||||
|
this.staticResources = staticResources;
|
||||||
|
|
||||||
|
Spark.staticFiles.expireTime(600);
|
||||||
|
|
||||||
|
SearchServiceMetrics.get("/search", searchQueryService::pathSearch);
|
||||||
|
|
||||||
|
SearchServiceMetrics.get("/", frontPageService::render);
|
||||||
|
SearchServiceMetrics.get("/news.xml", frontPageService::renderNewsFeed);
|
||||||
|
SearchServiceMetrics.get("/:resource", this::serveStatic);
|
||||||
|
|
||||||
|
SearchServiceMetrics.post("/site/suggest/", addToCrawlQueueService::suggestCrawling);
|
||||||
|
|
||||||
|
SearchServiceMetrics.get("/site-search/:site/*", this::siteSearchRedir);
|
||||||
|
|
||||||
|
SearchServiceMetrics.get("/site/:site", siteInfoService::handle);
|
||||||
|
SearchServiceMetrics.post("/site/:site", siteInfoService::handlePost);
|
||||||
|
|
||||||
|
SearchServiceMetrics.get("/crosstalk/", crosstalkService::handle);
|
||||||
|
|
||||||
|
Spark.exception(Exception.class, (e,p,q) -> {
|
||||||
|
logger.error("Error during processing", e);
|
||||||
|
wmsa_search_service_error_count.labels(p.pathInfo(), p.requestMethod()).inc();
|
||||||
|
errorPageService.serveError(p, q);
|
||||||
|
});
|
||||||
|
|
||||||
|
Spark.awaitInitialization();
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
/** Wraps a route with a timer and a counter */
|
||||||
|
private static class SearchServiceMetrics implements Route {
|
||||||
|
private final Route delegatedRoute;
|
||||||
|
|
||||||
|
static void get(String path, Route route) {
|
||||||
|
Spark.get(path, new SearchServiceMetrics(route));
|
||||||
|
}
|
||||||
|
static void post(String path, Route route) {
|
||||||
|
Spark.post(path, new SearchServiceMetrics(route));
|
||||||
|
}
|
||||||
|
|
||||||
|
private SearchServiceMetrics(Route delegatedRoute) {
|
||||||
|
this.delegatedRoute = delegatedRoute;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Object handle(Request request, Response response) throws Exception {
|
||||||
|
return wmsa_search_service_request_time
|
||||||
|
.labels(request.matchedPath(), request.requestMethod())
|
||||||
|
.time(() -> delegatedRoute.handle(request, response));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private Object serveStatic(Request request, Response response) {
|
||||||
|
String resource = request.params("resource");
|
||||||
|
staticResources.serveStatic("search", resource, request, response);
|
||||||
|
return "";
|
||||||
|
}
|
||||||
|
|
||||||
|
private Object siteSearchRedir(Request request, Response response) {
|
||||||
|
final String site = request.params("site");
|
||||||
|
final String searchTerms;
|
||||||
|
|
||||||
|
if (request.splat().length == 0) searchTerms = "";
|
||||||
|
else searchTerms = request.splat()[0];
|
||||||
|
|
||||||
|
final String query = URLEncoder.encode(String.format("%s site:%s", searchTerms, site), StandardCharsets.UTF_8).trim();
|
||||||
|
final String profile = request.queryParamOrDefault("profile", "yolo");
|
||||||
|
|
||||||
|
response.redirect(websiteUrl.withPath("search?query="+query+"&profile="+profile));
|
||||||
|
|
||||||
|
return "";
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
@@ -0,0 +1,43 @@
|
|||||||
|
package nu.marginalia.search.command;
|
||||||
|
|
||||||
|
import com.google.inject.Inject;
|
||||||
|
import nu.marginalia.search.command.commands.*;
|
||||||
|
import spark.Response;
|
||||||
|
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
|
public class CommandEvaluator {
|
||||||
|
|
||||||
|
private final List<SearchCommandInterface> specialCommands = new ArrayList<>();
|
||||||
|
private final SearchCommand defaultCommand;
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public CommandEvaluator(
|
||||||
|
BrowseCommand browse,
|
||||||
|
ConvertCommand convert,
|
||||||
|
DefinitionCommand define,
|
||||||
|
BangCommand bang,
|
||||||
|
SiteRedirectCommand siteRedirect,
|
||||||
|
SearchCommand search
|
||||||
|
) {
|
||||||
|
specialCommands.add(browse);
|
||||||
|
specialCommands.add(convert);
|
||||||
|
specialCommands.add(define);
|
||||||
|
specialCommands.add(bang);
|
||||||
|
specialCommands.add(siteRedirect);
|
||||||
|
|
||||||
|
defaultCommand = search;
|
||||||
|
}
|
||||||
|
|
||||||
|
public Object eval(Response response, SearchParameters parameters) {
|
||||||
|
for (var cmd : specialCommands) {
|
||||||
|
var maybe = cmd.process(response, parameters);
|
||||||
|
if (maybe.isPresent())
|
||||||
|
return maybe.get();
|
||||||
|
}
|
||||||
|
|
||||||
|
return defaultCommand.process(response, parameters).orElse("");
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
@@ -0,0 +1,29 @@
|
|||||||
|
package nu.marginalia.search.command;
|
||||||
|
|
||||||
|
import nu.marginalia.api.searchquery.model.query.SearchQuery;
|
||||||
|
|
||||||
|
import javax.annotation.Nullable;
|
||||||
|
import java.util.Arrays;
|
||||||
|
|
||||||
|
public enum SearchAdtechParameter {
|
||||||
|
DEFAULT("default"),
|
||||||
|
REDUCE("reduce", "special:ads", "special:affiliate");
|
||||||
|
|
||||||
|
public final String value;
|
||||||
|
public final String[] implictExcludeSearchTerms;
|
||||||
|
|
||||||
|
SearchAdtechParameter(String value, String... implictExcludeSearchTerms) {
|
||||||
|
this.value = value;
|
||||||
|
this.implictExcludeSearchTerms = implictExcludeSearchTerms;
|
||||||
|
}
|
||||||
|
|
||||||
|
public static SearchAdtechParameter parse(@Nullable String value) {
|
||||||
|
if (REDUCE.value.equals(value)) return REDUCE;
|
||||||
|
|
||||||
|
return DEFAULT;
|
||||||
|
}
|
||||||
|
|
||||||
|
public void addTacitTerms(SearchQuery subquery) {
|
||||||
|
subquery.searchTermsExclude.addAll(Arrays.asList(implictExcludeSearchTerms));
|
||||||
|
}
|
||||||
|
}
|
@@ -0,0 +1,10 @@
|
|||||||
|
package nu.marginalia.search.command;
|
||||||
|
|
||||||
|
|
||||||
|
import spark.Response;
|
||||||
|
|
||||||
|
import java.util.Optional;
|
||||||
|
|
||||||
|
public interface SearchCommandInterface {
|
||||||
|
Optional<Object> process(Response response, SearchParameters parameters);
|
||||||
|
}
|
@@ -0,0 +1,31 @@
|
|||||||
|
package nu.marginalia.search.command;
|
||||||
|
|
||||||
|
import nu.marginalia.api.searchquery.model.query.SearchQuery;
|
||||||
|
|
||||||
|
import javax.annotation.Nullable;
|
||||||
|
import java.util.Arrays;
|
||||||
|
|
||||||
|
public enum SearchJsParameter {
|
||||||
|
DEFAULT("default"),
|
||||||
|
DENY_JS("no-js", "js:true"),
|
||||||
|
REQUIRE_JS("yes-js", "js:false");
|
||||||
|
|
||||||
|
public final String value;
|
||||||
|
public final String[] implictExcludeSearchTerms;
|
||||||
|
|
||||||
|
SearchJsParameter(String value, String... implictExcludeSearchTerms) {
|
||||||
|
this.value = value;
|
||||||
|
this.implictExcludeSearchTerms = implictExcludeSearchTerms;
|
||||||
|
}
|
||||||
|
|
||||||
|
public static SearchJsParameter parse(@Nullable String value) {
|
||||||
|
if (DENY_JS.value.equals(value)) return DENY_JS;
|
||||||
|
if (REQUIRE_JS.value.equals(value)) return REQUIRE_JS;
|
||||||
|
|
||||||
|
return DEFAULT;
|
||||||
|
}
|
||||||
|
|
||||||
|
public void addTacitTerms(SearchQuery subquery) {
|
||||||
|
subquery.searchTermsExclude.addAll(Arrays.asList(implictExcludeSearchTerms));
|
||||||
|
}
|
||||||
|
}
|
@@ -0,0 +1,106 @@
|
|||||||
|
package nu.marginalia.search.command;
|
||||||
|
|
||||||
|
import nu.marginalia.WebsiteUrl;
|
||||||
|
import nu.marginalia.api.searchquery.model.results.ResultRankingParameters;
|
||||||
|
import nu.marginalia.index.query.limit.QueryStrategy;
|
||||||
|
import nu.marginalia.index.query.limit.SpecificationLimit;
|
||||||
|
import nu.marginalia.search.model.SearchProfile;
|
||||||
|
import spark.Request;
|
||||||
|
|
||||||
|
import java.net.URLEncoder;
|
||||||
|
import java.nio.charset.StandardCharsets;
|
||||||
|
import java.util.Objects;
|
||||||
|
|
||||||
|
import static nu.marginalia.search.command.SearchRecentParameter.RECENT;
|
||||||
|
|
||||||
|
public record SearchParameters(String query,
|
||||||
|
SearchProfile profile,
|
||||||
|
SearchJsParameter js,
|
||||||
|
SearchRecentParameter recent,
|
||||||
|
SearchTitleParameter searchTitle,
|
||||||
|
SearchAdtechParameter adtech,
|
||||||
|
boolean newFilter,
|
||||||
|
int page
|
||||||
|
) {
|
||||||
|
|
||||||
|
public SearchParameters(String queryString, Request request) {
|
||||||
|
this(
|
||||||
|
queryString,
|
||||||
|
SearchProfile.getSearchProfile(request.queryParams("profile")),
|
||||||
|
SearchJsParameter.parse(request.queryParams("js")),
|
||||||
|
SearchRecentParameter.parse(request.queryParams("recent")),
|
||||||
|
SearchTitleParameter.parse(request.queryParams("searchTitle")),
|
||||||
|
SearchAdtechParameter.parse(request.queryParams("adtech")),
|
||||||
|
"true".equals(request.queryParams("newfilter")),
|
||||||
|
Integer.parseInt(Objects.requireNonNullElse(request.queryParams("page"), "1"))
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
public String profileStr() {
|
||||||
|
return profile.filterId;
|
||||||
|
}
|
||||||
|
|
||||||
|
public SearchParameters withProfile(SearchProfile profile) {
|
||||||
|
return new SearchParameters(query, profile, js, recent, searchTitle, adtech, true, page);
|
||||||
|
}
|
||||||
|
|
||||||
|
public SearchParameters withJs(SearchJsParameter js) {
|
||||||
|
return new SearchParameters(query, profile, js, recent, searchTitle, adtech, true, page);
|
||||||
|
}
|
||||||
|
public SearchParameters withAdtech(SearchAdtechParameter adtech) {
|
||||||
|
return new SearchParameters(query, profile, js, recent, searchTitle, adtech, true, page);
|
||||||
|
}
|
||||||
|
|
||||||
|
public SearchParameters withRecent(SearchRecentParameter recent) {
|
||||||
|
return new SearchParameters(query, profile, js, recent, searchTitle, adtech, true, page);
|
||||||
|
}
|
||||||
|
|
||||||
|
public SearchParameters withTitle(SearchTitleParameter title) {
|
||||||
|
return new SearchParameters(query, profile, js, recent, title, adtech, true, page);
|
||||||
|
}
|
||||||
|
|
||||||
|
public SearchParameters withPage(int page) {
|
||||||
|
return new SearchParameters(query, profile, js, recent, searchTitle, adtech, false, page);
|
||||||
|
}
|
||||||
|
|
||||||
|
public String renderUrl(WebsiteUrl baseUrl) {
|
||||||
|
String path = String.format("/search?query=%s&profile=%s&js=%s&adtech=%s&recent=%s&searchTitle=%s&newfilter=%s&page=%d",
|
||||||
|
URLEncoder.encode(query, StandardCharsets.UTF_8),
|
||||||
|
URLEncoder.encode(profile.filterId, StandardCharsets.UTF_8),
|
||||||
|
URLEncoder.encode(js.value, StandardCharsets.UTF_8),
|
||||||
|
URLEncoder.encode(adtech.value, StandardCharsets.UTF_8),
|
||||||
|
URLEncoder.encode(recent.value, StandardCharsets.UTF_8),
|
||||||
|
URLEncoder.encode(searchTitle.value, StandardCharsets.UTF_8),
|
||||||
|
Boolean.valueOf(newFilter).toString(),
|
||||||
|
page
|
||||||
|
);
|
||||||
|
|
||||||
|
return baseUrl.withPath(path);
|
||||||
|
}
|
||||||
|
|
||||||
|
public ResultRankingParameters.TemporalBias temporalBias() {
|
||||||
|
if (recent == RECENT) {
|
||||||
|
return ResultRankingParameters.TemporalBias.RECENT;
|
||||||
|
}
|
||||||
|
else if (profile == SearchProfile.VINTAGE) {
|
||||||
|
return ResultRankingParameters.TemporalBias.OLD;
|
||||||
|
}
|
||||||
|
|
||||||
|
return ResultRankingParameters.TemporalBias.NONE;
|
||||||
|
}
|
||||||
|
|
||||||
|
public QueryStrategy strategy() {
|
||||||
|
if (searchTitle == SearchTitleParameter.TITLE) {
|
||||||
|
return QueryStrategy.REQUIRE_FIELD_TITLE;
|
||||||
|
}
|
||||||
|
|
||||||
|
return QueryStrategy.AUTO;
|
||||||
|
}
|
||||||
|
|
||||||
|
public SpecificationLimit yearLimit() {
|
||||||
|
if (recent == RECENT)
|
||||||
|
return SpecificationLimit.greaterThan(2018);
|
||||||
|
|
||||||
|
return profile.getYearLimit();
|
||||||
|
}
|
||||||
|
}
|
@@ -0,0 +1,21 @@
|
|||||||
|
package nu.marginalia.search.command;
|
||||||
|
|
||||||
|
import javax.annotation.Nullable;
|
||||||
|
|
||||||
|
public enum SearchRecentParameter {
|
||||||
|
DEFAULT("default"),
|
||||||
|
RECENT("recent");
|
||||||
|
|
||||||
|
public final String value;
|
||||||
|
|
||||||
|
SearchRecentParameter(String value) {
|
||||||
|
this.value = value;
|
||||||
|
}
|
||||||
|
|
||||||
|
public static SearchRecentParameter parse(@Nullable String value) {
|
||||||
|
if (RECENT.value.equals(value)) return RECENT;
|
||||||
|
|
||||||
|
return DEFAULT;
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
@@ -0,0 +1,21 @@
|
|||||||
|
package nu.marginalia.search.command;
|
||||||
|
|
||||||
|
import javax.annotation.Nullable;
|
||||||
|
|
||||||
|
public enum SearchTitleParameter {
|
||||||
|
DEFAULT("default"),
|
||||||
|
TITLE("title");
|
||||||
|
|
||||||
|
public final String value;
|
||||||
|
|
||||||
|
SearchTitleParameter(String value) {
|
||||||
|
this.value = value;
|
||||||
|
}
|
||||||
|
|
||||||
|
public static SearchTitleParameter parse(@Nullable String value) {
|
||||||
|
if (TITLE.value.equals(value)) return TITLE;
|
||||||
|
|
||||||
|
return DEFAULT;
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
@@ -0,0 +1,104 @@
|
|||||||
|
package nu.marginalia.search.command.commands;
|
||||||
|
|
||||||
|
import com.google.inject.Inject;
|
||||||
|
import nu.marginalia.search.command.SearchCommandInterface;
|
||||||
|
import nu.marginalia.search.command.SearchParameters;
|
||||||
|
import nu.marginalia.search.exceptions.RedirectException;
|
||||||
|
import spark.Response;
|
||||||
|
|
||||||
|
import java.net.URLEncoder;
|
||||||
|
import java.nio.charset.StandardCharsets;
|
||||||
|
import java.util.HashMap;
|
||||||
|
import java.util.Map;
|
||||||
|
import java.util.Optional;
|
||||||
|
|
||||||
|
public class BangCommand implements SearchCommandInterface {
|
||||||
|
private final Map<String, String> bangsToPattern = new HashMap<>();
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public BangCommand()
|
||||||
|
{
|
||||||
|
bangsToPattern.put("!g", "https://www.google.com/search?q=%s");
|
||||||
|
bangsToPattern.put("!ddg", "https://duckduckgo.com/?q=%s");
|
||||||
|
bangsToPattern.put("!w", "https://search.marginalia.nu/search?query=%s+site:en.wikipedia.org&profile=wiki");
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Optional<Object> process(Response response, SearchParameters parameters) {
|
||||||
|
|
||||||
|
for (var entry : bangsToPattern.entrySet()) {
|
||||||
|
String bangPattern = entry.getKey();
|
||||||
|
String redirectPattern = entry.getValue();
|
||||||
|
|
||||||
|
var match = matchBangPattern(parameters.query(), bangPattern);
|
||||||
|
|
||||||
|
if (match.isPresent()) {
|
||||||
|
var url = String.format(redirectPattern, URLEncoder.encode(match.get(), StandardCharsets.UTF_8));
|
||||||
|
throw new RedirectException(url);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return Optional.empty();
|
||||||
|
}
|
||||||
|
|
||||||
|
/** If the query contains the bang pattern bangKey, return the query with the bang pattern removed. */
|
||||||
|
Optional<String> matchBangPattern(String query, String bangKey) {
|
||||||
|
var bm = new BangMatcher(query);
|
||||||
|
|
||||||
|
while (bm.findNext(bangKey)) {
|
||||||
|
|
||||||
|
if (!bm.isRelativeSpaceOrInvalid(-1))
|
||||||
|
continue;
|
||||||
|
if (!bm.isRelativeSpaceOrInvalid(bangKey.length()))
|
||||||
|
continue;
|
||||||
|
|
||||||
|
String prefix = bm.prefix().trim();
|
||||||
|
String suffix = bm.suffix(bangKey.length()).trim();
|
||||||
|
|
||||||
|
String ret = (prefix + " " + suffix).trim();
|
||||||
|
|
||||||
|
return Optional.of(ret)
|
||||||
|
.filter(s -> !s.isBlank());
|
||||||
|
}
|
||||||
|
|
||||||
|
return Optional.empty();
|
||||||
|
}
|
||||||
|
|
||||||
|
private static class BangMatcher {
|
||||||
|
private final String str;
|
||||||
|
private int pos;
|
||||||
|
|
||||||
|
public String prefix() {
|
||||||
|
return str.substring(0, pos);
|
||||||
|
}
|
||||||
|
|
||||||
|
public String suffix(int offset) {
|
||||||
|
if (pos+offset < str.length())
|
||||||
|
return str.substring(pos + offset);
|
||||||
|
return "";
|
||||||
|
}
|
||||||
|
|
||||||
|
public BangMatcher(String str) {
|
||||||
|
this.str = str;
|
||||||
|
this.pos = -1;
|
||||||
|
}
|
||||||
|
|
||||||
|
public boolean findNext(String pattern) {
|
||||||
|
if (pos + 1 >= str.length())
|
||||||
|
return false;
|
||||||
|
|
||||||
|
return (pos = str.indexOf(pattern, pos + 1)) >= 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
public boolean isRelativeSpaceOrInvalid(int offset) {
|
||||||
|
if (offset + pos < 0)
|
||||||
|
return true;
|
||||||
|
if (offset + pos >= str.length())
|
||||||
|
return true;
|
||||||
|
|
||||||
|
return Character.isSpaceChar(str.charAt(offset + pos));
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
@@ -0,0 +1,36 @@
|
|||||||
|
package nu.marginalia.search.command.commands;
|
||||||
|
|
||||||
|
import com.google.inject.Inject;
|
||||||
|
import nu.marginalia.renderer.MustacheRenderer;
|
||||||
|
import nu.marginalia.renderer.RendererFactory;
|
||||||
|
import nu.marginalia.search.command.SearchCommandInterface;
|
||||||
|
import nu.marginalia.search.command.SearchParameters;
|
||||||
|
import nu.marginalia.search.svc.SearchUnitConversionService;
|
||||||
|
import spark.Response;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.util.Map;
|
||||||
|
import java.util.Optional;
|
||||||
|
|
||||||
|
public class ConvertCommand implements SearchCommandInterface {
|
||||||
|
private final SearchUnitConversionService searchUnitConversionService;
|
||||||
|
private final MustacheRenderer<Map<String, String>> conversionRenderer;
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public ConvertCommand(SearchUnitConversionService searchUnitConversionService, RendererFactory rendererFactory) throws IOException {
|
||||||
|
this.searchUnitConversionService = searchUnitConversionService;
|
||||||
|
|
||||||
|
conversionRenderer = rendererFactory.renderer("search/conversion-results");
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Optional<Object> process(Response response, SearchParameters parameters) {
|
||||||
|
var conversion = searchUnitConversionService.tryConversion(parameters.query());
|
||||||
|
return conversion.map(s -> conversionRenderer.render(Map.of(
|
||||||
|
"query", parameters.query(),
|
||||||
|
"result", s,
|
||||||
|
"profile", parameters.profileStr())
|
||||||
|
));
|
||||||
|
|
||||||
|
}
|
||||||
|
}
|
@@ -0,0 +1,70 @@
|
|||||||
|
|
||||||
|
package nu.marginalia.search.command.commands;
|
||||||
|
|
||||||
|
import com.google.inject.Inject;
|
||||||
|
import nu.marginalia.api.math.MathClient;
|
||||||
|
import nu.marginalia.api.math.model.DictionaryResponse;
|
||||||
|
import nu.marginalia.renderer.MustacheRenderer;
|
||||||
|
import nu.marginalia.search.command.SearchCommandInterface;
|
||||||
|
import nu.marginalia.search.command.SearchParameters;
|
||||||
|
import nu.marginalia.renderer.RendererFactory;
|
||||||
|
import org.slf4j.Logger;
|
||||||
|
import org.slf4j.LoggerFactory;
|
||||||
|
import spark.Response;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.util.Map;
|
||||||
|
import java.util.Optional;
|
||||||
|
import java.util.concurrent.TimeUnit;
|
||||||
|
import java.util.function.Predicate;
|
||||||
|
import java.util.regex.Pattern;
|
||||||
|
|
||||||
|
public class DefinitionCommand implements SearchCommandInterface {
|
||||||
|
private final Logger logger = LoggerFactory.getLogger(getClass());
|
||||||
|
|
||||||
|
private final MustacheRenderer<DictionaryResponse> dictionaryRenderer;
|
||||||
|
private final MathClient mathClient;
|
||||||
|
|
||||||
|
|
||||||
|
private final Predicate<String> queryPatternPredicate = Pattern.compile("^define:[A-Za-z\\s-0-9]+$").asPredicate();
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public DefinitionCommand(RendererFactory rendererFactory, MathClient mathClient)
|
||||||
|
throws IOException
|
||||||
|
{
|
||||||
|
|
||||||
|
dictionaryRenderer = rendererFactory.renderer("search/dictionary-results");
|
||||||
|
this.mathClient = mathClient;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Optional<Object> process(Response response, SearchParameters parameters) {
|
||||||
|
if (!queryPatternPredicate.test(parameters.query())) {
|
||||||
|
return Optional.empty();
|
||||||
|
}
|
||||||
|
|
||||||
|
var results = lookupDefinition(parameters.query());
|
||||||
|
|
||||||
|
return Optional.of(dictionaryRenderer.render(results,
|
||||||
|
Map.of("query", parameters.query(),
|
||||||
|
"profile", parameters.profileStr())
|
||||||
|
));
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
private DictionaryResponse lookupDefinition(String humanQuery) {
|
||||||
|
String definePrefix = "define:";
|
||||||
|
String word = humanQuery.substring(definePrefix.length()).toLowerCase();
|
||||||
|
|
||||||
|
try {
|
||||||
|
return mathClient
|
||||||
|
.dictionaryLookup(word)
|
||||||
|
.get(250, TimeUnit.MILLISECONDS);
|
||||||
|
}
|
||||||
|
catch (Exception e) {
|
||||||
|
logger.error("Failed to lookup definition for word: " + word, e);
|
||||||
|
|
||||||
|
throw new RuntimeException(e);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
@@ -0,0 +1,39 @@
|
|||||||
|
package nu.marginalia.search.command.commands;
|
||||||
|
|
||||||
|
import com.google.inject.Inject;
|
||||||
|
import nu.marginalia.renderer.MustacheRenderer;
|
||||||
|
import nu.marginalia.renderer.RendererFactory;
|
||||||
|
import nu.marginalia.search.SearchOperator;
|
||||||
|
import nu.marginalia.search.command.SearchCommandInterface;
|
||||||
|
import nu.marginalia.search.command.SearchParameters;
|
||||||
|
import nu.marginalia.search.model.DecoratedSearchResults;
|
||||||
|
import spark.Response;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.util.Optional;
|
||||||
|
|
||||||
|
public class SearchCommand implements SearchCommandInterface {
|
||||||
|
private final SearchOperator searchOperator;
|
||||||
|
private final MustacheRenderer<DecoratedSearchResults> searchResultsRenderer;
|
||||||
|
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public SearchCommand(SearchOperator searchOperator,
|
||||||
|
RendererFactory rendererFactory) throws IOException {
|
||||||
|
this.searchOperator = searchOperator;
|
||||||
|
|
||||||
|
searchResultsRenderer = rendererFactory.renderer("search/search-results");
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Optional<Object> process(Response response, SearchParameters parameters) {
|
||||||
|
try {
|
||||||
|
DecoratedSearchResults results = searchOperator.doSearch(parameters);
|
||||||
|
return Optional.of(searchResultsRenderer.render(results));
|
||||||
|
}
|
||||||
|
catch (InterruptedException ex) {
|
||||||
|
Thread.currentThread().interrupt();
|
||||||
|
return Optional.empty();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
@@ -0,0 +1,50 @@
|
|||||||
|
package nu.marginalia.search.command.commands;
|
||||||
|
|
||||||
|
import com.google.inject.Inject;
|
||||||
|
import nu.marginalia.search.command.SearchCommandInterface;
|
||||||
|
import nu.marginalia.search.command.SearchParameters;
|
||||||
|
import org.slf4j.Logger;
|
||||||
|
import org.slf4j.LoggerFactory;
|
||||||
|
import spark.Response;
|
||||||
|
|
||||||
|
import java.util.Optional;
|
||||||
|
import java.util.function.Predicate;
|
||||||
|
import java.util.regex.Pattern;
|
||||||
|
|
||||||
|
public class SiteRedirectCommand implements SearchCommandInterface {
|
||||||
|
|
||||||
|
private final Logger logger = LoggerFactory.getLogger(getClass());
|
||||||
|
|
||||||
|
private final Predicate<String> queryPatternPredicate = Pattern.compile("^(site|links):[.A-Za-z\\-0-9]+$").asPredicate();
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public SiteRedirectCommand() {
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Optional<Object> process(Response response, SearchParameters parameters) {
|
||||||
|
if (!queryPatternPredicate.test(parameters.query())) {
|
||||||
|
return Optional.empty();
|
||||||
|
}
|
||||||
|
|
||||||
|
int idx = parameters.query().indexOf(':');
|
||||||
|
String prefix = parameters.query().substring(0, idx);
|
||||||
|
String domain = parameters.query().substring(idx + 1).toLowerCase();
|
||||||
|
|
||||||
|
// Use an HTML redirect here, so we can use relative URLs
|
||||||
|
String view = switch (prefix) {
|
||||||
|
case "links" -> "links";
|
||||||
|
default -> "info";
|
||||||
|
};
|
||||||
|
|
||||||
|
return Optional.of("""
|
||||||
|
<!DOCTYPE html>
|
||||||
|
<html lang="en">
|
||||||
|
<meta charset="UTF-8">
|
||||||
|
<title>Redirecting...</title>
|
||||||
|
<meta http-equiv="refresh" content="0; url=/site/%s?view=%s">
|
||||||
|
""".formatted(domain, view)
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
@@ -0,0 +1,66 @@
|
|||||||
|
package nu.marginalia.search.db;
|
||||||
|
|
||||||
|
import com.google.inject.Inject;
|
||||||
|
import com.zaxxer.hikari.HikariDataSource;
|
||||||
|
|
||||||
|
import java.sql.ResultSet;
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.function.Consumer;
|
||||||
|
|
||||||
|
public class DbNearDomainsQuery {
|
||||||
|
|
||||||
|
private final HikariDataSource dataSource;
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public DbNearDomainsQuery(HikariDataSource dataSource) {
|
||||||
|
this.dataSource = dataSource;
|
||||||
|
}
|
||||||
|
|
||||||
|
public List<Integer> getRelatedDomains(String term, Consumer<String> onProblem) {
|
||||||
|
List<Integer> ret = new ArrayList<>();
|
||||||
|
try (var conn = dataSource.getConnection();
|
||||||
|
|
||||||
|
var selfStmt = conn.prepareStatement("""
|
||||||
|
SELECT ID FROM EC_DOMAIN WHERE DOMAIN_NAME=?
|
||||||
|
""");
|
||||||
|
var stmt = conn.prepareStatement("""
|
||||||
|
SELECT NEIGHBOR_ID, ND.INDEXED, ND.STATE FROM EC_DOMAIN_NEIGHBORS_2
|
||||||
|
INNER JOIN EC_DOMAIN ND ON ND.ID=NEIGHBOR_ID
|
||||||
|
WHERE DOMAIN_ID=?
|
||||||
|
""")) {
|
||||||
|
ResultSet rsp;
|
||||||
|
selfStmt.setString(1, term);
|
||||||
|
rsp = selfStmt.executeQuery();
|
||||||
|
int domainId = -1;
|
||||||
|
if (rsp.next()) {
|
||||||
|
domainId = rsp.getInt(1);
|
||||||
|
ret.add(domainId);
|
||||||
|
}
|
||||||
|
|
||||||
|
stmt.setInt(1, domainId);
|
||||||
|
rsp = stmt.executeQuery();
|
||||||
|
|
||||||
|
while (rsp.next()) {
|
||||||
|
int id = rsp.getInt(1);
|
||||||
|
int indexed = rsp.getInt(2);
|
||||||
|
String state = rsp.getString(3);
|
||||||
|
|
||||||
|
if (indexed > 0 && ("ACTIVE".equalsIgnoreCase(state) || "SOCIAL_MEDIA".equalsIgnoreCase(state) || "SPECIAL".equalsIgnoreCase(state))) {
|
||||||
|
ret.add(id);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
|
catch (Exception ex) {
|
||||||
|
throw new RuntimeException(ex);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (ret.isEmpty()) {
|
||||||
|
onProblem.accept("Could not find domains adjacent " + term);
|
||||||
|
}
|
||||||
|
|
||||||
|
return ret;
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
@@ -0,0 +1,102 @@
|
|||||||
|
package nu.marginalia.search.model;
|
||||||
|
|
||||||
|
import nu.marginalia.model.EdgeDomain;
|
||||||
|
import nu.marginalia.model.idx.WordFlags;
|
||||||
|
import org.jetbrains.annotations.NotNull;
|
||||||
|
|
||||||
|
import java.util.*;
|
||||||
|
|
||||||
|
/** A class to hold a list of UrlDetails, grouped by domain, where the first one is the main result
|
||||||
|
* and the rest are additional results, for summary display. */
|
||||||
|
public class ClusteredUrlDetails implements Comparable<ClusteredUrlDetails> {
|
||||||
|
|
||||||
|
@NotNull
|
||||||
|
public final UrlDetails first;
|
||||||
|
|
||||||
|
@NotNull
|
||||||
|
public final List<UrlDetails> rest;
|
||||||
|
|
||||||
|
/** Create a new ClusteredUrlDetails from a collection of UrlDetails,
|
||||||
|
* with the best result as "first", and the others, in descending order
|
||||||
|
* of quality as the "rest"...
|
||||||
|
*
|
||||||
|
* @param details A collection of UrlDetails, which must not be empty.
|
||||||
|
*/
|
||||||
|
public ClusteredUrlDetails(Collection<UrlDetails> details) {
|
||||||
|
var items = new ArrayList<>(details);
|
||||||
|
|
||||||
|
items.sort(Comparator.naturalOrder());
|
||||||
|
|
||||||
|
if (items.isEmpty())
|
||||||
|
throw new IllegalArgumentException("Empty list of details");
|
||||||
|
|
||||||
|
this.first = items.removeFirst();
|
||||||
|
this.rest = items;
|
||||||
|
|
||||||
|
double bestScore = first.termScore;
|
||||||
|
double scoreLimit = Math.min(4.0, bestScore * 1.25);
|
||||||
|
|
||||||
|
this.rest.removeIf(urlDetail -> {
|
||||||
|
if (urlDetail.termScore > scoreLimit)
|
||||||
|
return false;
|
||||||
|
|
||||||
|
for (var keywordScore : urlDetail.resultItem.keywordScores) {
|
||||||
|
if (keywordScore.isKeywordSpecial())
|
||||||
|
continue;
|
||||||
|
if (keywordScore.hasTermFlag(WordFlags.Title))
|
||||||
|
return false;
|
||||||
|
if (keywordScore.hasTermFlag(WordFlags.ExternalLink))
|
||||||
|
return false;
|
||||||
|
if (keywordScore.hasTermFlag(WordFlags.UrlDomain))
|
||||||
|
return false;
|
||||||
|
if (keywordScore.hasTermFlag(WordFlags.UrlPath))
|
||||||
|
return false;
|
||||||
|
if (keywordScore.hasTermFlag(WordFlags.Subjects))
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
return true;
|
||||||
|
});
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
public ClusteredUrlDetails(@NotNull UrlDetails onlyFirst) {
|
||||||
|
this.first = onlyFirst;
|
||||||
|
this.rest = Collections.emptyList();
|
||||||
|
}
|
||||||
|
|
||||||
|
// For renderer use, do not remove
|
||||||
|
public @NotNull UrlDetails getFirst() {
|
||||||
|
return first;
|
||||||
|
}
|
||||||
|
|
||||||
|
// For renderer use, do not remove
|
||||||
|
public @NotNull List<UrlDetails> getRest() {
|
||||||
|
return rest;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
public EdgeDomain getDomain() {
|
||||||
|
return first.url.getDomain();
|
||||||
|
}
|
||||||
|
|
||||||
|
public boolean hasMultiple() {
|
||||||
|
return !rest.isEmpty();
|
||||||
|
}
|
||||||
|
|
||||||
|
/** Returns the total number of results from the same domain,
|
||||||
|
* including such results that are not included here. */
|
||||||
|
public int totalCount() {
|
||||||
|
return first.resultsFromSameDomain;
|
||||||
|
}
|
||||||
|
|
||||||
|
public int remainingCount() {
|
||||||
|
return totalCount() - 1 - rest.size();
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public int compareTo(@NotNull ClusteredUrlDetails o) {
|
||||||
|
return Objects.compare(first, o.first, UrlDetails::compareTo);
|
||||||
|
}
|
||||||
|
}
|
@@ -0,0 +1,186 @@
|
|||||||
|
package nu.marginalia.search.model;
|
||||||
|
|
||||||
|
import nu.marginalia.search.command.SearchParameters;
|
||||||
|
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* A class to hold details about the search results,
|
||||||
|
* as used by the handlebars templating engine to render
|
||||||
|
* the search results page.
|
||||||
|
*/
|
||||||
|
public class DecoratedSearchResults {
|
||||||
|
private final SearchParameters params;
|
||||||
|
private final List<String> problems;
|
||||||
|
private final String evalResult;
|
||||||
|
|
||||||
|
public DecoratedSearchResults(SearchParameters params,
|
||||||
|
List<String> problems,
|
||||||
|
String evalResult,
|
||||||
|
List<ClusteredUrlDetails> results,
|
||||||
|
String focusDomain,
|
||||||
|
int focusDomainId,
|
||||||
|
SearchFilters filters,
|
||||||
|
List<Page> resultPages) {
|
||||||
|
this.params = params;
|
||||||
|
this.problems = problems;
|
||||||
|
this.evalResult = evalResult;
|
||||||
|
this.results = results;
|
||||||
|
this.focusDomain = focusDomain;
|
||||||
|
this.focusDomainId = focusDomainId;
|
||||||
|
this.filters = filters;
|
||||||
|
this.resultPages = resultPages;
|
||||||
|
}
|
||||||
|
|
||||||
|
public final List<ClusteredUrlDetails> results;
|
||||||
|
|
||||||
|
public static DecoratedSearchResultsBuilder builder() {
|
||||||
|
return new DecoratedSearchResultsBuilder();
|
||||||
|
}
|
||||||
|
|
||||||
|
public SearchParameters getParams() {
|
||||||
|
return params;
|
||||||
|
}
|
||||||
|
|
||||||
|
public List<String> getProblems() {
|
||||||
|
return problems;
|
||||||
|
}
|
||||||
|
|
||||||
|
public String getEvalResult() {
|
||||||
|
return evalResult;
|
||||||
|
}
|
||||||
|
|
||||||
|
public List<ClusteredUrlDetails> getResults() {
|
||||||
|
return results;
|
||||||
|
}
|
||||||
|
|
||||||
|
public String getFocusDomain() {
|
||||||
|
return focusDomain;
|
||||||
|
}
|
||||||
|
|
||||||
|
public int getFocusDomainId() {
|
||||||
|
return focusDomainId;
|
||||||
|
}
|
||||||
|
|
||||||
|
public SearchFilters getFilters() {
|
||||||
|
return filters;
|
||||||
|
}
|
||||||
|
|
||||||
|
public List<Page> getResultPages() {
|
||||||
|
return resultPages;
|
||||||
|
}
|
||||||
|
|
||||||
|
private final String focusDomain;
|
||||||
|
private final int focusDomainId;
|
||||||
|
private final SearchFilters filters;
|
||||||
|
|
||||||
|
private final List<Page> resultPages;
|
||||||
|
|
||||||
|
public boolean isMultipage() {
|
||||||
|
return resultPages.size() > 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
public record Page(int number, boolean current, String href) {
|
||||||
|
}
|
||||||
|
|
||||||
|
// These are used by the search form, they look unused in the IDE but are used by the mustache template,
|
||||||
|
// DO NOT REMOVE THEM
|
||||||
|
public int getResultCount() {
|
||||||
|
return results.size();
|
||||||
|
}
|
||||||
|
|
||||||
|
public String getQuery() {
|
||||||
|
return params.query();
|
||||||
|
}
|
||||||
|
|
||||||
|
public String getProfile() {
|
||||||
|
return params.profile().filterId;
|
||||||
|
}
|
||||||
|
|
||||||
|
public String getJs() {
|
||||||
|
return params.js().value;
|
||||||
|
}
|
||||||
|
|
||||||
|
public String getAdtech() {
|
||||||
|
return params.adtech().value;
|
||||||
|
}
|
||||||
|
|
||||||
|
public String getRecent() {
|
||||||
|
return params.recent().value;
|
||||||
|
}
|
||||||
|
|
||||||
|
public String getSearchTitle() {
|
||||||
|
return params.searchTitle().value;
|
||||||
|
}
|
||||||
|
|
||||||
|
public int page() {
|
||||||
|
return params.page();
|
||||||
|
}
|
||||||
|
|
||||||
|
public Boolean isNewFilter() {
|
||||||
|
return params.newFilter();
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
public static class DecoratedSearchResultsBuilder {
|
||||||
|
private SearchParameters params;
|
||||||
|
private List<String> problems;
|
||||||
|
private String evalResult;
|
||||||
|
private List<ClusteredUrlDetails> results;
|
||||||
|
private String focusDomain;
|
||||||
|
private int focusDomainId;
|
||||||
|
private SearchFilters filters;
|
||||||
|
private List<Page> resultPages;
|
||||||
|
|
||||||
|
DecoratedSearchResultsBuilder() {
|
||||||
|
}
|
||||||
|
|
||||||
|
public DecoratedSearchResultsBuilder params(SearchParameters params) {
|
||||||
|
this.params = params;
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
|
public DecoratedSearchResultsBuilder problems(List<String> problems) {
|
||||||
|
this.problems = problems;
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
|
public DecoratedSearchResultsBuilder evalResult(String evalResult) {
|
||||||
|
this.evalResult = evalResult;
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
|
public DecoratedSearchResultsBuilder results(List<ClusteredUrlDetails> results) {
|
||||||
|
this.results = results;
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
|
public DecoratedSearchResultsBuilder focusDomain(String focusDomain) {
|
||||||
|
this.focusDomain = focusDomain;
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
|
public DecoratedSearchResultsBuilder focusDomainId(int focusDomainId) {
|
||||||
|
this.focusDomainId = focusDomainId;
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
|
public DecoratedSearchResultsBuilder filters(SearchFilters filters) {
|
||||||
|
this.filters = filters;
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
|
public DecoratedSearchResultsBuilder resultPages(List<Page> resultPages) {
|
||||||
|
this.resultPages = resultPages;
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
|
||||||
|
public DecoratedSearchResults build() {
|
||||||
|
return new DecoratedSearchResults(this.params, this.problems, this.evalResult, this.results, this.focusDomain, this.focusDomainId, this.filters, this.resultPages);
|
||||||
|
}
|
||||||
|
|
||||||
|
public String toString() {
|
||||||
|
return "DecoratedSearchResults.DecoratedSearchResultsBuilder(params=" + this.params + ", problems=" + this.problems + ", evalResult=" + this.evalResult + ", results=" + this.results + ", focusDomain=" + this.focusDomain + ", focusDomainId=" + this.focusDomainId + ", filters=" + this.filters + ", resultPages=" + this.resultPages + ")";
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user