1
1
mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-10-06 07:32:38 +02:00

Compare commits

...

64 Commits

Author SHA1 Message Date
Viktor
3b2ac414dc Merge pull request #210 from MarginaliaSearch/ads-fingerprinting
Implement advertisement and popover identification based on DOM sample data
2025-07-21 12:25:31 +02:00
Viktor Lofgren
0ba6515a01 (converter) Ensure converter works well even when dom sample data is unavailable 2025-07-21 12:11:17 +02:00
Viktor Lofgren
16c6b0f151 (search) Add link to new discord community 2025-07-20 20:54:42 +02:00
Viktor Lofgren
e998692900 (converter) Ensure converter works well even when dom sample data is unavailable 2025-07-20 19:24:40 +02:00
Viktor Lofgren
eeb1695a87 (search) Clean up dead code 2025-07-20 19:15:01 +02:00
Viktor Lofgren
a0ab910940 (search) Clean up code 2025-07-20 19:14:13 +02:00
Viktor Lofgren
b9f31048d7 (search) Clean up overlong class names 2025-07-20 19:13:04 +02:00
Viktor Lofgren
12c304289a (grpc) Use grpc-netty instead of grpc-netty-shaded
This will help reduce runaway thread pool sizes
2025-07-20 17:36:25 +02:00
Viktor Lofgren
6ee01dabea (search) Drastically reduce worker thread count in search-service 2025-07-20 17:16:58 +02:00
Viktor Lofgren
1b80e282a7 (search) Drastically reduce worker thread count in search-service 2025-07-20 16:58:33 +02:00
Viktor Lofgren
a65d18f1d1 (client) Use virtual threads in a few more clients 2025-07-20 14:10:02 +02:00
Viktor Lofgren
90a1ff220b (ui) Clean up UI 2025-07-19 18:41:36 +02:00
Viktor Lofgren
d6c7092335 (classifier) More rules 2025-07-19 18:41:36 +02:00
Viktor Lofgren
b716333856 (classifier) Match regexes against the path + query only, as well as the full URL 2025-07-19 18:41:36 +02:00
Viktor Lofgren
b504b8482c (classifier) Add new tracker 2025-07-19 18:41:36 +02:00
Viktor Lofgren
80da1e9ad1 (ui) UI cleanup 2025-07-19 18:41:36 +02:00
Viktor Lofgren
d3f744a441 (ui) Add traffic report to overview menu 2025-07-19 18:41:36 +02:00
Viktor Lofgren
60fb539875 (ui) Add explanatory blurb 2025-07-19 18:41:35 +02:00
Viktor Lofgren
7f5094fedf (ui) Clean up UI 2025-07-19 18:41:35 +02:00
Viktor Lofgren
45066636a5 (classifier) Add classification for domains that make 3rd party requests 2025-07-19 18:41:35 +02:00
Viktor Lofgren
e2d6898c51 (search) Change tag colors to more pleasant ones 2025-07-19 18:41:35 +02:00
Viktor Lofgren
58ef767b94 (search) Improve traffic report UI 2025-07-19 18:41:35 +02:00
Viktor Lofgren
f9f268c67a (grpc) Improve error handling 2025-07-19 18:41:35 +02:00
Viktor Lofgren
f44c2bdee9 (chore) Cleanup 2025-07-19 18:41:35 +02:00
Viktor Lofgren
6fdf477c18 (refac) Move DomSampleClassification to top level 2025-07-19 18:41:35 +02:00
Viktor Lofgren
6b6e455e3f (classifier) Clean up xml 2025-07-19 18:41:35 +02:00
Viktor Lofgren
a3a126540c (classifier) Add README.md 2025-07-19 18:41:35 +02:00
Viktor Lofgren
842b19da40 (search) Mobile layout + phrasing 2025-07-19 18:41:35 +02:00
Viktor Lofgren
2a30e93bf0 (classifier) 2025-07-19 18:41:34 +02:00
Viktor Lofgren
3d998f12c0 (search) Use display name where possible 2025-07-19 18:41:34 +02:00
Viktor Lofgren
cbccc2ac23 (classification) Add /ccm/collect as an ads-related request 2025-07-19 18:41:34 +02:00
Viktor Lofgren
2cfc23f9b7 (search) Fix layout for mobile 2025-07-18 19:06:23 +02:00
Viktor Lofgren
88fe394cdb (request-classifier) Add rule for /pagead/ 2025-07-18 19:01:33 +02:00
Viktor Lofgren
f30fcebd4f Remove dead code 2025-07-18 18:56:42 +02:00
Viktor Lofgren
5d885927b4 (search) Fix layout and presentation 2025-07-18 17:54:47 +02:00
Viktor Lofgren
7622c8358e (request-classifier) Adjust flagging of a few hosts 2025-07-18 17:54:46 +02:00
Viktor Lofgren
69ed9aef47 (ddgt) Load global tracker data 2025-07-18 17:02:50 +02:00
Viktor Lofgren
4c78c223da (search) Fix endpoint collection 2025-07-18 16:59:05 +02:00
Viktor Lofgren
71b9935dd6 (search) Add warmup to programmatic tailwind classes, fix word break 2025-07-18 16:49:31 +02:00
Viktor Lofgren
ad38f2fd83 (search) Hide classification tag on unclassified requests 2025-07-18 15:45:40 +02:00
Viktor Lofgren
9c47388846 (search) Improve display ordering 2025-07-18 15:44:55 +02:00
Viktor Lofgren
d9ab10e33f (search) Fix tracker data for the correct domain 2025-07-18 15:29:15 +02:00
Viktor Lofgren
e13ea7f42b (search) Sort results by classifications 2025-07-18 14:51:35 +02:00
Viktor Lofgren
f38daeb036 (WIP) First stab at a GUI for viewing network traffic
The change also moves the dom classifier to a separate package so that it can be accessed from both the search service and converter.

The change also adds a parser for DDG's tracker radar data.
2025-07-18 13:58:57 +02:00
Viktor Lofgren
6e214293e5 (ping) Fix backoff value overflow 2025-07-16 19:50:12 +02:00
Viktor Lofgren
52582a6d7d (experiment) Also add clients to loom experiment 2025-07-16 18:08:00 +02:00
Viktor Lofgren
ec0e39ad32 (experiment) Also add clients to loom experiment 2025-07-16 17:28:57 +02:00
Viktor Lofgren
6a15aee4b0 (ping) Fix arithmetic errors in backoff strategy due to long overflow 2025-07-16 17:23:36 +02:00
Viktor Lofgren
bd5111e8a2 (experimental) Add flag for using loom/virtual threads in gRPC executor 2025-07-16 17:12:07 +02:00
Viktor Lofgren
1ecbeb0272 (doc) Update ROADMAP.md 2025-07-14 13:38:34 +02:00
Viktor Lofgren
b91354925d (converter) Index documents even when they are short
... but assign short documents a special flag and penalize them in index lookups
2025-07-14 12:24:25 +02:00
Viktor Lofgren
3f85c9c154 (refac) Clean up code 2025-07-14 11:55:21 +02:00
Viktor Lofgren
89e03d6914 (chore) Idiomatic error handling in gRPC clients
responseObserver.onError(...) should be passed Status.WHATEVER.foo().asRuntimeException() and not random throwables as was done before.
2025-07-13 02:59:22 +02:00
Viktor Lofgren
14e0bc9f26 (index) Add comment about encoding caveat 2025-07-13 02:47:00 +02:00
Viktor Lofgren
7065b46c6f (index) Add penalties for new feature flags from dom sample 2025-07-13 02:37:30 +02:00
Viktor Lofgren
0372190c90 (index, refac) Move domain ranking to a better named package 2025-07-13 02:37:29 +02:00
Viktor Lofgren
ceaf32fb90 (converter) Integrate dom sample features into the converter 2025-07-13 01:38:28 +02:00
Viktor Lofgren
b57db01415 (converter) Clean out some old and redundant advertisement and tracking detection code 2025-07-11 19:32:25 +02:00
Viktor Lofgren
ce7d522608 (converter) First basic hook-in of the new dom sample classifier into the converter workflow 2025-07-11 16:57:37 +02:00
Viktor Lofgren
18649b6ee9 (converter) Move DomSampleClassifier to converter's code tree 2025-07-11 16:12:48 +02:00
Viktor Lofgren
f6417aef1a (converter) Additional code cleanup 2025-07-11 15:58:48 +02:00
Viktor Lofgren
2aa7e376b0 (converter) Clean up code around document deduplication 2025-07-11 15:54:28 +02:00
Viktor Lofgren
f33bc44860 (dom-sample) Create API for fetching DOM sample data across services 2025-07-11 15:41:10 +02:00
Viktor Lofgren
a2826efd44 (dom-sample) First stab at classifying outgoing requests from DOM sample data 2025-07-11 15:41:10 +02:00
81 changed files with 1742 additions and 280 deletions

View File

@@ -48,10 +48,6 @@ filter for any API consumer.
I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, which is quite ad-hoc, but instead to work together to find some new common description language for this. I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, which is quite ad-hoc, but instead to work together to find some new common description language for this.
## Show favicons next to search results
This is expected from search engines. Basic proof of concept sketch of fetching this data has been done, but the feature is some way from being reality.
## Specialized crawler for github ## Specialized crawler for github
One of the search engine's biggest limitations right now is that it does not index github at all. A specialized crawler that fetches at least the readme.md would go a long way toward providing search capabilities in this domain. One of the search engine's biggest limitations right now is that it does not index github at all. A specialized crawler that fetches at least the readme.md would go a long way toward providing search capabilities in this domain.
@@ -66,6 +62,10 @@ The documents database probably should have some sort of flag indicating it's a
PDF parsing is known to be a bit of a security liability so some thought needs to be put in PDF parsing is known to be a bit of a security liability so some thought needs to be put in
that direction as well. that direction as well.
## Show favicons next to search results (COMPLETED 2025-03)
This is expected from search engines. Basic proof of concept sketch of fetching this data has been done, but the feature is some way from being reality.
## Web Design Overhaul (COMPLETED 2025-01) ## Web Design Overhaul (COMPLETED 2025-01)
The design is kinda clunky and hard to maintain, and needlessly outdated-looking. The design is kinda clunky and hard to maintain, and needlessly outdated-looking.

View File

@@ -5,13 +5,15 @@ import java.util.Collection;
public enum HtmlFeature { public enum HtmlFeature {
// Note, the first 32 of these features are bit encoded in the database // Note, the first 32 of these features are bit encoded in the database
// so be sure to keep anything that's potentially important toward the top // so be sure to keep anything that's potentially important toward the top
// of the list // of the list; but adding new values will shift the encoded values and break
// binary compatibility! Scroll down for a marker where you should add new values
// if they need to be accessible from IndexResultScoreCalculator!
MEDIA( "special:media"), MEDIA( "special:media"),
JS("special:scripts"), JS("special:scripts"),
AFFILIATE_LINK( "special:affiliate"), AFFILIATE_LINK( "special:affiliate"),
TRACKING("special:tracking"), TRACKING("special:tracking"),
TRACKING_ADTECH("special:ads"), // We'll call this ads for now TRACKING_ADTECH("special:adtech"),
KEBAB_CASE_URL("special:kcurl"), // https://www.example.com/urls-that-look-like-this/ KEBAB_CASE_URL("special:kcurl"), // https://www.example.com/urls-that-look-like-this/
LONG_URL("special:longurl"), LONG_URL("special:longurl"),
@@ -30,6 +32,15 @@ public enum HtmlFeature {
PDF("format:pdf"), PDF("format:pdf"),
POPOVER("special:popover"),
CONSENT("special:consent"),
SHORT_DOCUMENT("special:shorty"),
THIRD_PARTY_REQUESTS("special:3pr"),
// Here! It is generally safe to add additional values here without
// disrupting the encoded values used by the DocumentValuator
// class in the index!
/** For fingerprinting and ranking */ /** For fingerprinting and ranking */
OPENGRAPH("special:opengraph"), OPENGRAPH("special:opengraph"),
OPENGRAPH_IMAGE("special:opengraph:image"), OPENGRAPH_IMAGE("special:opengraph:image"),
@@ -67,6 +78,7 @@ public enum HtmlFeature {
S3_FEATURE("special:s3"), S3_FEATURE("special:s3"),
MISSING_DOM_SAMPLE("special:nosample"),
UNKNOWN("special:uncategorized"); UNKNOWN("special:uncategorized");
@@ -93,6 +105,8 @@ public enum HtmlFeature {
} }
public int getFeatureBit() { public int getFeatureBit() {
if (getClass().desiredAssertionStatus() && ordinal() >= 32)
throw new IllegalStateException("Attempting to extract feature bit of " + name() + ", with ordinal " + ordinal());
return (1<< ordinal()); return (1<< ordinal());
} }
} }

View File

@@ -13,6 +13,7 @@ import nu.marginalia.service.discovery.property.ServicePartition;
import nu.marginalia.util.NamedExecutorFactory; import nu.marginalia.util.NamedExecutorFactory;
import java.util.concurrent.Executor; import java.util.concurrent.Executor;
import java.util.concurrent.Executors;
import java.util.function.Function; import java.util.function.Function;
@Singleton @Singleton
@@ -20,10 +21,15 @@ public class GrpcChannelPoolFactory {
private final NodeConfigurationWatcher nodeConfigurationWatcher; private final NodeConfigurationWatcher nodeConfigurationWatcher;
private final ServiceRegistryIf serviceRegistryIf; private final ServiceRegistryIf serviceRegistryIf;
private static final Executor executor = NamedExecutorFactory.createFixed("gRPC-Channel-Pool",
Math.clamp(Runtime.getRuntime().availableProcessors() / 2, 2, 32)); private static final boolean useLoom = Boolean.getBoolean("system.experimentalUseLoom");
private static final Executor offloadExecutor = NamedExecutorFactory.createFixed("gRPC-Offload-Pool",
Math.clamp(Runtime.getRuntime().availableProcessors() / 2, 2, 32)); private static final Executor executor = useLoom
? Executors.newVirtualThreadPerTaskExecutor()
: NamedExecutorFactory.createFixed("gRPC-Channel-Pool", Math.clamp(Runtime.getRuntime().availableProcessors() / 2, 2, 32));
private static final Executor offloadExecutor = useLoom
? Executors.newVirtualThreadPerTaskExecutor()
: NamedExecutorFactory.createFixed("gRPC-Offload-Pool", Math.clamp(Runtime.getRuntime().availableProcessors() / 2, 2, 32));
@Inject @Inject
public GrpcChannelPoolFactory(NodeConfigurationWatcher nodeConfigurationWatcher, public GrpcChannelPoolFactory(NodeConfigurationWatcher nodeConfigurationWatcher,

View File

@@ -2,6 +2,7 @@ package nu.marginalia.service.client;
import com.google.common.collect.Sets; import com.google.common.collect.Sets;
import io.grpc.ManagedChannel; import io.grpc.ManagedChannel;
import io.grpc.StatusRuntimeException;
import nu.marginalia.service.discovery.ServiceRegistryIf; import nu.marginalia.service.discovery.ServiceRegistryIf;
import nu.marginalia.service.discovery.monitor.ServiceChangeMonitor; import nu.marginalia.service.discovery.monitor.ServiceChangeMonitor;
import nu.marginalia.service.discovery.property.PartitionTraits; import nu.marginalia.service.discovery.property.PartitionTraits;
@@ -206,6 +207,11 @@ public class GrpcSingleNodeChannelPool<STUB> extends ServiceChangeMonitor {
} }
for (var e : exceptions) { for (var e : exceptions) {
if (e instanceof StatusRuntimeException se) {
throw se; // Re-throw SRE as-is
}
// If there are other exceptions, log them
logger.error(grpcMarker, "Failed to call service {}", serviceKey, e); logger.error(grpcMarker, "Failed to call service {}", serviceKey, e);
} }

View File

@@ -1,9 +1,9 @@
package nu.marginalia.service.server; package nu.marginalia.service.server;
import io.grpc.Server; import io.grpc.Server;
import io.grpc.netty.shaded.io.grpc.netty.NettyServerBuilder; import io.grpc.netty.NettyServerBuilder;
import io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoopGroup; import io.netty.channel.nio.NioEventLoopGroup;
import io.grpc.netty.shaded.io.netty.channel.socket.nio.NioServerSocketChannel; import io.netty.channel.socket.nio.NioServerSocketChannel;
import nu.marginalia.service.discovery.ServiceRegistryIf; import nu.marginalia.service.discovery.ServiceRegistryIf;
import nu.marginalia.service.discovery.property.ServiceKey; import nu.marginalia.service.discovery.property.ServiceKey;
import nu.marginalia.service.discovery.property.ServicePartition; import nu.marginalia.service.discovery.property.ServicePartition;
@@ -13,9 +13,14 @@ import nu.marginalia.util.NamedExecutorFactory;
import java.io.IOException; import java.io.IOException;
import java.net.InetSocketAddress; import java.net.InetSocketAddress;
import java.util.List; import java.util.List;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class GrpcServer { public class GrpcServer {
private final Server server; private final Server server;
private static final boolean useLoom = Boolean.getBoolean("system.experimentalUseLoom");
public GrpcServer(ServiceConfiguration config, public GrpcServer(ServiceConfiguration config,
ServiceRegistryIf serviceRegistry, ServiceRegistryIf serviceRegistry,
ServicePartition partition, ServicePartition partition,
@@ -26,13 +31,19 @@ public class GrpcServer {
int nThreads = Math.clamp(Runtime.getRuntime().availableProcessors() / 2, 2, 16); int nThreads = Math.clamp(Runtime.getRuntime().availableProcessors() / 2, 2, 16);
// Start the gRPC server // Start the gRPC server
ExecutorService workExecutor = useLoom ?
Executors.newVirtualThreadPerTaskExecutor() :
NamedExecutorFactory.createFixed("nettyExecutor", nThreads);
var grpcServerBuilder = NettyServerBuilder.forAddress(new InetSocketAddress(config.bindAddress(), port)) var grpcServerBuilder = NettyServerBuilder.forAddress(new InetSocketAddress(config.bindAddress(), port))
.executor(NamedExecutorFactory.createFixed("nettyExecutor", nThreads)) .executor(workExecutor)
.workerEventLoopGroup(new NioEventLoopGroup(nThreads, NamedExecutorFactory.createFixed("Worker-ELG", nThreads))) .workerEventLoopGroup(new NioEventLoopGroup(nThreads, NamedExecutorFactory.createFixed("Worker-ELG", nThreads)))
.bossEventLoopGroup(new NioEventLoopGroup(nThreads, NamedExecutorFactory.createFixed("Boss-ELG", nThreads))) .bossEventLoopGroup(new NioEventLoopGroup(nThreads, NamedExecutorFactory.createFixed("Boss-ELG", nThreads)))
.channelType(NioServerSocketChannel.class); .channelType(NioServerSocketChannel.class);
for (var grpcService : grpcServices) { for (var grpcService : grpcServices) {
if (!grpcService.shouldRegisterService()) { if (!grpcService.shouldRegisterService()) {
continue; continue;
} }

View File

@@ -125,8 +125,7 @@ public class JoobyService {
// Set a cap on the number of worker threads, as Jooby's default value does not seem to consider // Set a cap on the number of worker threads, as Jooby's default value does not seem to consider
// multi-tenant servers with high thread counts, and spins up an exorbitant number of threads in that // multi-tenant servers with high thread counts, and spins up an exorbitant number of threads in that
// scenario // scenario
options.setWorkerThreads(Math.min(128, options.getWorkerThreads())); options.setWorkerThreads(Math.min(16, options.getWorkerThreads()));
jooby.setServerOptions(options); jooby.setServerOptions(options);

View File

@@ -1,6 +1,7 @@
package nu.marginalia.execution; package nu.marginalia.execution;
import com.google.inject.Inject; import com.google.inject.Inject;
import io.grpc.Status;
import io.grpc.stub.StreamObserver; import io.grpc.stub.StreamObserver;
import nu.marginalia.actor.ExecutorActor; import nu.marginalia.actor.ExecutorActor;
import nu.marginalia.actor.ExecutorActorControlService; import nu.marginalia.actor.ExecutorActorControlService;
@@ -36,7 +37,7 @@ public class ExecutorCrawlGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -52,7 +53,7 @@ public class ExecutorCrawlGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -66,7 +67,7 @@ public class ExecutorCrawlGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -80,7 +81,7 @@ public class ExecutorCrawlGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -98,7 +99,7 @@ public class ExecutorCrawlGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }

View File

@@ -2,6 +2,7 @@ package nu.marginalia.execution;
import com.google.inject.Inject; import com.google.inject.Inject;
import com.google.inject.Singleton; import com.google.inject.Singleton;
import io.grpc.Status;
import io.grpc.stub.StreamObserver; import io.grpc.stub.StreamObserver;
import nu.marginalia.actor.ExecutorActor; import nu.marginalia.actor.ExecutorActor;
import nu.marginalia.actor.ExecutorActorControlService; import nu.marginalia.actor.ExecutorActorControlService;
@@ -38,7 +39,7 @@ public class ExecutorExportGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -57,7 +58,7 @@ public class ExecutorExportGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -73,7 +74,7 @@ public class ExecutorExportGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -87,7 +88,7 @@ public class ExecutorExportGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -99,7 +100,7 @@ public class ExecutorExportGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -114,14 +115,14 @@ public class ExecutorExportGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@Override @Override
public void exportAllAtags(Empty request, StreamObserver<Empty> responseObserver) { public void exportAllAtags(Empty request, StreamObserver<Empty> responseObserver) {
if (serviceConfiguration.node() != 1) { if (serviceConfiguration.node() != 1) {
responseObserver.onError(new IllegalArgumentException("Export all atags is only available on node 1")); responseObserver.onError(Status.UNAVAILABLE.withDescription("Export all atags is only available on node 1").asRuntimeException());
} }
try { try {
actorControlService.startFrom(ExecutorActor.PREC_EXPORT_ALL, actorControlService.startFrom(ExecutorActor.PREC_EXPORT_ALL,
@@ -131,7 +132,7 @@ public class ExecutorExportGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -145,7 +146,7 @@ public class ExecutorExportGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -159,7 +160,7 @@ public class ExecutorExportGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
} }

View File

@@ -1,6 +1,7 @@
package nu.marginalia.execution; package nu.marginalia.execution;
import com.google.inject.Inject; import com.google.inject.Inject;
import io.grpc.Status;
import io.grpc.stub.StreamObserver; import io.grpc.stub.StreamObserver;
import nu.marginalia.WmsaHome; import nu.marginalia.WmsaHome;
import nu.marginalia.actor.ActorApi; import nu.marginalia.actor.ActorApi;
@@ -58,7 +59,7 @@ public class ExecutorGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -70,7 +71,7 @@ public class ExecutorGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -82,7 +83,7 @@ public class ExecutorGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -96,7 +97,7 @@ public class ExecutorGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -112,7 +113,7 @@ public class ExecutorGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -128,7 +129,7 @@ public class ExecutorGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -203,7 +204,7 @@ public class ExecutorGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -229,7 +230,7 @@ public class ExecutorGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -276,7 +277,7 @@ public class ExecutorGrpcService
} }
catch (Exception e) { catch (Exception e) {
logger.error("Failed to update nsfw filters", e); logger.error("Failed to update nsfw filters", e);
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
} }

View File

@@ -1,6 +1,7 @@
package nu.marginalia.execution; package nu.marginalia.execution;
import com.google.inject.Inject; import com.google.inject.Inject;
import io.grpc.Status;
import io.grpc.stub.StreamObserver; import io.grpc.stub.StreamObserver;
import nu.marginalia.actor.ExecutorActor; import nu.marginalia.actor.ExecutorActor;
import nu.marginalia.actor.ExecutorActorControlService; import nu.marginalia.actor.ExecutorActorControlService;
@@ -33,7 +34,7 @@ public class ExecutorSideloadGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -48,7 +49,7 @@ public class ExecutorSideloadGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -63,7 +64,7 @@ public class ExecutorSideloadGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -78,7 +79,7 @@ public class ExecutorSideloadGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -93,7 +94,7 @@ public class ExecutorSideloadGrpcService
responseObserver.onCompleted(); responseObserver.onCompleted();
} }
catch (Exception e) { catch (Exception e) {
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }

View File

@@ -2,6 +2,8 @@ package nu.marginalia.api.domains;
import com.google.inject.Inject; import com.google.inject.Inject;
import com.google.inject.Singleton; import com.google.inject.Singleton;
import nu.marginalia.api.domains.model.DomainInformation;
import nu.marginalia.api.domains.model.SimilarDomain;
import nu.marginalia.service.client.GrpcChannelPoolFactory; import nu.marginalia.service.client.GrpcChannelPoolFactory;
import nu.marginalia.service.client.GrpcSingleNodeChannelPool; import nu.marginalia.service.client.GrpcSingleNodeChannelPool;
import nu.marginalia.service.discovery.property.ServiceKey; import nu.marginalia.service.discovery.property.ServiceKey;
@@ -10,16 +12,19 @@ import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import org.slf4j.LoggerFactory;
import java.util.List; import java.util.List;
import java.util.concurrent.*; import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import nu.marginalia.api.domains.model.*; import java.util.concurrent.Future;
@Singleton @Singleton
public class DomainInfoClient { public class DomainInfoClient {
private static final Logger logger = LoggerFactory.getLogger(DomainInfoClient.class); private static final Logger logger = LoggerFactory.getLogger(DomainInfoClient.class);
private final GrpcSingleNodeChannelPool<DomainInfoAPIGrpc.DomainInfoAPIBlockingStub> channelPool; private final GrpcSingleNodeChannelPool<DomainInfoAPIGrpc.DomainInfoAPIBlockingStub> channelPool;
private final ExecutorService executor = Executors.newWorkStealingPool(8);
private static final boolean useLoom = Boolean.getBoolean("system.experimentalUseLoom");
private static final ExecutorService executor = useLoom ? Executors.newVirtualThreadPerTaskExecutor() : Executors.newWorkStealingPool(8);
@Inject @Inject
public DomainInfoClient(GrpcChannelPoolFactory factory) { public DomainInfoClient(GrpcChannelPoolFactory factory) {

View File

@@ -0,0 +1,99 @@
package nu.marginalia.api.domsample;
import com.google.inject.Inject;
import com.google.inject.Singleton;
import io.grpc.Status;
import io.grpc.StatusRuntimeException;
import nu.marginalia.service.client.GrpcChannelPoolFactory;
import nu.marginalia.service.client.GrpcSingleNodeChannelPool;
import nu.marginalia.service.discovery.property.ServiceKey;
import nu.marginalia.service.discovery.property.ServicePartition;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.time.Duration;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.Optional;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ExecutorService;
@Singleton
public class DomSampleClient {
private final GrpcSingleNodeChannelPool<DomSampleApiGrpc.DomSampleApiBlockingStub> channelPool;
private static final Logger logger = LoggerFactory.getLogger(DomSampleClient.class);
@Inject
public DomSampleClient(GrpcChannelPoolFactory factory) {
// The client is only interested in the primary node
var key = ServiceKey.forGrpcApi(DomSampleApiGrpc.class, ServicePartition.any());
this.channelPool = factory.createSingle(key, DomSampleApiGrpc::newBlockingStub);
}
public Optional<RpcDomainSample> getSample(String domainName) {
try {
var val = channelPool.call(DomSampleApiGrpc.DomSampleApiBlockingStub::getSample)
.run(RpcDomainName.newBuilder().setDomainName(domainName).build());
return Optional.of(val);
}
catch (StatusRuntimeException sre) {
if (sre.getStatus() != Status.NOT_FOUND) {
logger.error("Failed to fetch DOM sample");
}
return Optional.empty();
}
}
public boolean hasSample(String domainName) {
try {
return channelPool.call(DomSampleApiGrpc.DomSampleApiBlockingStub::hasSample)
.run(RpcDomainName.newBuilder().setDomainName(domainName).build())
.getAnswer();
}
catch (StatusRuntimeException sre) {
return false;
}
}
public CompletableFuture<Boolean> hasSample(String domainName, ExecutorService executor) {
try {
return channelPool.call(DomSampleApiGrpc.DomSampleApiBlockingStub::hasSample)
.async(executor)
.run(RpcDomainName.newBuilder().setDomainName(domainName).build())
.thenApply(RpcBooleanRsp::getAnswer);
}
catch (StatusRuntimeException sre) {
return CompletableFuture.completedFuture(false);
}
}
public CompletableFuture<RpcDomainSample> getSampleAsync(String domainName, ExecutorService executorService) {
return channelPool.call(DomSampleApiGrpc.DomSampleApiBlockingStub::getSample)
.async(executorService)
.run(RpcDomainName.newBuilder().setDomainName(domainName).build());
}
public List<RpcDomainSample> getAllSamples(String domainName) {
try {
Iterator<RpcDomainSample> val = channelPool.call(DomSampleApiGrpc.DomSampleApiBlockingStub::getAllSamples)
.run(RpcDomainName.newBuilder().setDomainName(domainName).build());
List<RpcDomainSample> ret = new ArrayList<>();
val.forEachRemaining(ret::add);
return ret;
}
catch (StatusRuntimeException sre) {
logger.error("Failed to fetch DOM sample");
return List.of();
}
}
public boolean waitReady(Duration duration) throws InterruptedException {
return channelPool.awaitChannel(duration);
}
}

View File

@@ -24,7 +24,9 @@ import java.util.function.BiConsumer;
@Singleton @Singleton
public class FeedsClient { public class FeedsClient {
private final ExecutorService executorService = Executors.newCachedThreadPool(); private static final boolean useLoom = Boolean.getBoolean("system.experimentalUseLoom");
private static final ExecutorService executorService = useLoom ? Executors.newVirtualThreadPerTaskExecutor() : Executors.newCachedThreadPool();
private final GrpcSingleNodeChannelPool<FeedApiGrpc.FeedApiBlockingStub> channelPool; private final GrpcSingleNodeChannelPool<FeedApiGrpc.FeedApiBlockingStub> channelPool;
private final MqOutbox updateFeedsOutbox; private final MqOutbox updateFeedsOutbox;

View File

@@ -0,0 +1,40 @@
syntax="proto3";
package nu.marginalia.api.domsample;
option java_package="nu.marginalia.api.domsample";
option java_multiple_files=true;
service DomSampleApi {
rpc getSample(RpcDomainName) returns (RpcDomainSample) {}
rpc hasSample(RpcDomainName) returns (RpcBooleanRsp) {}
rpc getAllSamples(RpcDomainName) returns (stream RpcDomainSample) {}
}
message RpcDomainName {
string domainName = 1;
}
message RpcBooleanRsp {
bool answer = 1;
}
message RpcDomainSample {
string domainName = 1;
string url = 2;
string htmlSample = 3;
bool accepted_popover = 4;
repeated RpcOutgoingRequest outgoingRequests = 5;
}
message RpcOutgoingRequest {
RequestMethod method = 1;
int64 timestamp = 2;
string url = 3;
enum RequestMethod {
GET = 0;
POST = 1;
OTHER = 2;
};
}

View File

@@ -0,0 +1,123 @@
package nu.marginalia.domsample;
import com.google.inject.Inject;
import io.grpc.Status;
import io.grpc.stub.StreamObserver;
import nu.marginalia.api.domsample.*;
import nu.marginalia.domsample.db.DomSampleDb;
import nu.marginalia.service.server.DiscoverableService;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.List;
public class DomSampleGrpcService
extends DomSampleApiGrpc.DomSampleApiImplBase
implements DiscoverableService
{
private static final Logger logger = LoggerFactory.getLogger(DomSampleGrpcService.class);
private final DomSampleDb domSampleDb;
@Inject
public DomSampleGrpcService(DomSampleDb domSampleDb) {
this.domSampleDb = domSampleDb;
}
@Override
public void getSample(RpcDomainName request, StreamObserver<RpcDomainSample> responseObserver) {
String domainName = request.getDomainName();
if (domainName.isBlank()) {
responseObserver.onError(Status.INVALID_ARGUMENT
.withDescription("Invalid domain name")
.asRuntimeException());
return;
}
try {
List<DomSampleDb.Sample> dbRecords = domSampleDb.getSamples(domainName);
if (dbRecords.isEmpty()) {
responseObserver.onError(Status.NOT_FOUND.withDescription("No sample found").asRuntimeException());
return;
}
// Grab the first sample
RpcDomainSample.Builder response = convert(dbRecords.getFirst());
responseObserver.onNext(response.build());
responseObserver.onCompleted();
}
catch (Exception e) {
logger.error("Error in getSample()", e);
responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
}
}
@Override
public void hasSample(RpcDomainName request, StreamObserver<RpcBooleanRsp> responseObserver) {
String domainName = request.getDomainName();
if (domainName.isBlank()) {
responseObserver.onError(Status.INVALID_ARGUMENT
.withDescription("Invalid domain name")
.asRuntimeException());
return;
}
try {
responseObserver.onNext(RpcBooleanRsp.newBuilder()
.setAnswer(domSampleDb.hasSample(domainName)).build());
responseObserver.onCompleted();
}
catch (Exception e) {
responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
}
}
@Override
public void getAllSamples(RpcDomainName request, StreamObserver<RpcDomainSample> responseObserver) {
String domainName = request.getDomainName();
if (domainName.isBlank()) {
responseObserver.onError(Status.INVALID_ARGUMENT
.withDescription("Invalid domain name")
.asRuntimeException());
return;
}
try {
List<DomSampleDb.Sample> dbRecords = domSampleDb.getSamples(domainName);
for (var record : dbRecords) {
responseObserver.onNext(convert(record).build());
}
responseObserver.onCompleted();
}
catch (Exception e) {
logger.error("Error in getSample()", e);
responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
}
}
private RpcDomainSample.Builder convert(DomSampleDb.Sample dbSample) {
var sampleBuilder = RpcDomainSample.newBuilder()
.setDomainName(dbSample.domain())
.setAcceptedPopover(dbSample.acceptedPopover())
.setHtmlSample(dbSample.sample());
for (var req : dbSample.parseRequests()) {
sampleBuilder.addOutgoingRequestsBuilder()
.setUrl(req.uri().toString())
.setMethod(switch (req.method().toUpperCase())
{
case "GET" -> RpcOutgoingRequest.RequestMethod.GET;
case "POST" -> RpcOutgoingRequest.RequestMethod.POST;
default -> RpcOutgoingRequest.RequestMethod.OTHER;
})
.setTimestamp(req.timestamp());
}
return sampleBuilder;
}
}

View File

@@ -1,17 +1,28 @@
package nu.marginalia.domsample.db; package nu.marginalia.domsample.db;
import nu.marginalia.WmsaHome; import nu.marginalia.WmsaHome;
import nu.marginalia.model.EdgeUrl;
import org.apache.commons.lang3.StringUtils;
import org.jsoup.Jsoup; import org.jsoup.Jsoup;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.net.URI;
import java.net.URISyntaxException;
import java.nio.file.Path; import java.nio.file.Path;
import java.sql.Connection; import java.sql.Connection;
import java.sql.DriverManager; import java.sql.DriverManager;
import java.sql.SQLException; import java.sql.SQLException;
import java.util.*; import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.function.Predicate;
public class DomSampleDb implements AutoCloseable { public class DomSampleDb implements AutoCloseable {
private static final String dbFileName = "dom-sample.db"; private static final String dbFileName = "dom-sample.db";
private final Connection connection; private final Connection connection;
private static final Logger logger = LoggerFactory.getLogger(DomSampleDb.class);
public DomSampleDb() throws SQLException{ public DomSampleDb() throws SQLException{
this(WmsaHome.getDataPath().resolve(dbFileName)); this(WmsaHome.getDataPath().resolve(dbFileName));
@@ -88,14 +99,78 @@ public class DomSampleDb implements AutoCloseable {
} }
public record Sample(String url, String domain, String sample, String requests, boolean acceptedPopover) {} public record Sample(String url, String domain, String sample, String requests, boolean acceptedPopover) {
public List<SampleRequest> parseRequests() {
List<SampleRequest> requests = new ArrayList<>();
// Request format is METHOD\tTIMESTAMP\tURI\n
for (var line : StringUtils.split(this.requests, '\n')) {
String[] parts = StringUtils.split(line, "\t", 3);
if (parts.length != 3) continue;
try {
String method = parts[0];
long ts = Long.parseLong(parts[1]);
String linkUrl = parts[2];
URI uri = parseURI(linkUrl);
requests.add(new SampleRequest(method, ts, uri));
}
catch (Exception e) {
logger.warn("Failed to parse requests", e);
}
}
return requests;
}
private static URI parseURI(String uri) throws URISyntaxException {
try {
return new URI(uri);
}
catch (URISyntaxException ex) {
return new EdgeUrl(uri).asURI();
}
}
}
public record SampleRequest(String method, long timestamp, URI uri) {}
/**
* @param consumer - consume the sample, return true to continue consumption
* @throws SQLException
*/
public void forEachSample(Predicate<Sample> consumer) throws SQLException {
try (var stmt = connection.prepareStatement("""
SELECT url, domain, sample, requests, accepted_popover
FROM samples
"""))
{
var rs = stmt.executeQuery();
while (rs.next()) {
var sample = new Sample(
rs.getString("url"),
rs.getString("domain"),
rs.getString("sample"),
rs.getString("requests"),
rs.getBoolean("accepted_popover")
);
if (!consumer.test(sample)) break;
}
}
}
public List<Sample> getSamples(String domain) throws SQLException { public List<Sample> getSamples(String domain) throws SQLException {
List<Sample> samples = new ArrayList<>(); List<Sample> samples = new ArrayList<>();
try (var stmt = connection.prepareStatement(""" try (var stmt = connection.prepareStatement("""
SELECT url, sample, requests, accepted_popover SELECT url, sample, requests, accepted_popover
FROM samples FROM samples
WHERE domain = ? WHERE domain = ?
""")) """))
{ {
@@ -116,6 +191,21 @@ public class DomSampleDb implements AutoCloseable {
return samples; return samples;
} }
public boolean hasSample(String domain) throws SQLException {
try (var stmt = connection.prepareStatement("""
SELECT 1
FROM samples
WHERE domain = ?
"""))
{
stmt.setString(1, domain);
var rs = stmt.executeQuery();
return rs.next();
}
}
public void saveSample(String domain, String url, String rawContent) throws SQLException { public void saveSample(String domain, String url, String rawContent) throws SQLException {
var doc = Jsoup.parse(rawContent); var doc = Jsoup.parse(rawContent);

View File

@@ -1,6 +1,7 @@
package nu.marginalia.rss.svc; package nu.marginalia.rss.svc;
import com.google.inject.Inject; import com.google.inject.Inject;
import io.grpc.Status;
import io.grpc.stub.StreamObserver; import io.grpc.stub.StreamObserver;
import nu.marginalia.api.feeds.*; import nu.marginalia.api.feeds.*;
import nu.marginalia.db.DbDomainQueries; import nu.marginalia.db.DbDomainQueries;
@@ -69,7 +70,7 @@ public class FeedsGrpcService extends FeedApiGrpc.FeedApiImplBase implements Dis
@Override @Override
public void getFeedDataHash(Empty request, StreamObserver<RpcFeedDataHash> responseObserver) { public void getFeedDataHash(Empty request, StreamObserver<RpcFeedDataHash> responseObserver) {
if (!feedDb.isEnabled()) { if (!feedDb.isEnabled()) {
responseObserver.onError(new IllegalStateException("Feed database is disabled on this node")); responseObserver.onError(Status.INTERNAL.withDescription("Feed database is disabled on this node").asRuntimeException());
return; return;
} }
@@ -80,7 +81,7 @@ public class FeedsGrpcService extends FeedApiGrpc.FeedApiImplBase implements Dis
} }
catch (Exception e) { catch (Exception e) {
logger.error("Error getting feed data hash", e); logger.error("Error getting feed data hash", e);
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -101,7 +102,7 @@ public class FeedsGrpcService extends FeedApiGrpc.FeedApiImplBase implements Dis
} }
catch (Exception e) { catch (Exception e) {
logger.error("Error getting updated links", e); logger.error("Error getting updated links", e);
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }
@@ -109,13 +110,13 @@ public class FeedsGrpcService extends FeedApiGrpc.FeedApiImplBase implements Dis
public void getFeed(RpcDomainId request, public void getFeed(RpcDomainId request,
StreamObserver<RpcFeed> responseObserver) { StreamObserver<RpcFeed> responseObserver) {
if (!feedDb.isEnabled()) { if (!feedDb.isEnabled()) {
responseObserver.onError(new IllegalStateException("Feed database is disabled on this node")); responseObserver.onError(Status.INTERNAL.withDescription("Feed database is disabled on this node").asRuntimeException());
return; return;
} }
Optional<EdgeDomain> domainName = domainQueries.getDomain(request.getDomainId()); Optional<EdgeDomain> domainName = domainQueries.getDomain(request.getDomainId());
if (domainName.isEmpty()) { if (domainName.isEmpty()) {
responseObserver.onError(new IllegalArgumentException("Domain not found")); responseObserver.onError(Status.NOT_FOUND.withDescription("Domain not found").asRuntimeException());
return; return;
} }

View File

@@ -26,7 +26,9 @@ public class MathClient {
private static final Logger logger = LoggerFactory.getLogger(MathClient.class); private static final Logger logger = LoggerFactory.getLogger(MathClient.class);
private final GrpcSingleNodeChannelPool<MathApiGrpc.MathApiBlockingStub> channelPool; private final GrpcSingleNodeChannelPool<MathApiGrpc.MathApiBlockingStub> channelPool;
private final ExecutorService executor = Executors.newWorkStealingPool(8);
private static final boolean useLoom = Boolean.getBoolean("system.experimentalUseLoom");
private static final ExecutorService executor = useLoom ? Executors.newVirtualThreadPerTaskExecutor() : Executors.newWorkStealingPool(8);
@Inject @Inject
public MathClient(GrpcChannelPoolFactory factory) { public MathClient(GrpcChannelPoolFactory factory) {

View File

@@ -3,6 +3,7 @@ package nu.marginalia.functions.searchquery;
import com.google.common.collect.Lists; import com.google.common.collect.Lists;
import com.google.inject.Inject; import com.google.inject.Inject;
import com.google.inject.Singleton; import com.google.inject.Singleton;
import io.grpc.Status;
import io.grpc.stub.StreamObserver; import io.grpc.stub.StreamObserver;
import io.prometheus.client.Histogram; import io.prometheus.client.Histogram;
import nu.marginalia.api.searchquery.*; import nu.marginalia.api.searchquery.*;
@@ -93,7 +94,7 @@ public class QueryGRPCService
}); });
} catch (Exception e) { } catch (Exception e) {
logger.error("Exception", e); logger.error("Exception", e);
responseObserver.onError(e); responseObserver.onError(Status.INTERNAL.withCause(e).asRuntimeException());
} }
} }

View File

@@ -38,7 +38,9 @@ public class IndexClient {
.help("Count of results filtered by NSFW tier") .help("Count of results filtered by NSFW tier")
.register(); .register();
private static final ExecutorService executor = Executors.newCachedThreadPool();
private static final boolean useLoom = Boolean.getBoolean("system.experimentalUseLoom");
private static final ExecutorService executor = useLoom ? Executors.newVirtualThreadPerTaskExecutor() : Executors.newCachedThreadPool();
@Inject @Inject
public IndexClient(GrpcChannelPoolFactory channelPoolFactory, public IndexClient(GrpcChannelPoolFactory channelPoolFactory,

View File

@@ -1,10 +1,10 @@
package nu.marginalia.ranking.domains; package nu.marginalia.domainranking;
import gnu.trove.list.TIntList; import gnu.trove.list.TIntList;
import gnu.trove.list.array.TIntArrayList; import gnu.trove.list.array.TIntArrayList;
import nu.marginalia.ranking.domains.accumulator.RankingResultAccumulator; import nu.marginalia.domainranking.accumulator.RankingResultAccumulator;
import nu.marginalia.ranking.domains.data.GraphSource; import nu.marginalia.domainranking.data.GraphSource;
import nu.marginalia.ranking.domains.jgrapht.PersonalizedPageRank; import nu.marginalia.domainranking.jgrapht.PersonalizedPageRank;
import org.jgrapht.Graph; import org.jgrapht.Graph;
import org.jgrapht.alg.interfaces.VertexScoringAlgorithm; import org.jgrapht.alg.interfaces.VertexScoringAlgorithm;
import org.jgrapht.alg.scoring.PageRank; import org.jgrapht.alg.scoring.PageRank;

View File

@@ -1,6 +1,6 @@
package nu.marginalia.ranking.domains; package nu.marginalia.domainranking;
import nu.marginalia.ranking.domains.accumulator.RankingResultAccumulator; import nu.marginalia.domainranking.accumulator.RankingResultAccumulator;
import java.util.function.Supplier; import java.util.function.Supplier;

View File

@@ -1,4 +1,4 @@
package nu.marginalia.ranking.domains.accumulator; package nu.marginalia.domainranking.accumulator;
public interface RankingResultAccumulator<T> { public interface RankingResultAccumulator<T> {
void add(int domainId, int rank); void add(int domainId, int rank);

View File

@@ -1,4 +1,4 @@
package nu.marginalia.ranking.domains.accumulator; package nu.marginalia.domainranking.accumulator;
import org.roaringbitmap.RoaringBitmap; import org.roaringbitmap.RoaringBitmap;

View File

@@ -1,4 +1,4 @@
package nu.marginalia.ranking.domains.accumulator; package nu.marginalia.domainranking.accumulator;
import it.unimi.dsi.fastutil.ints.Int2IntOpenHashMap; import it.unimi.dsi.fastutil.ints.Int2IntOpenHashMap;

View File

@@ -1,4 +1,4 @@
package nu.marginalia.ranking.domains.accumulator; package nu.marginalia.domainranking.accumulator;
import it.unimi.dsi.fastutil.ints.IntOpenHashSet; import it.unimi.dsi.fastutil.ints.IntOpenHashSet;

View File

@@ -1,4 +1,4 @@
package nu.marginalia.ranking.domains.accumulator; package nu.marginalia.domainranking.accumulator;
import gnu.trove.list.array.TIntArrayList; import gnu.trove.list.array.TIntArrayList;

View File

@@ -1,4 +1,4 @@
package nu.marginalia.ranking.domains.data; package nu.marginalia.domainranking.data;
import com.zaxxer.hikari.HikariDataSource; import com.zaxxer.hikari.HikariDataSource;
import org.jgrapht.Graph; import org.jgrapht.Graph;

View File

@@ -1,4 +1,4 @@
package nu.marginalia.ranking.domains.data; package nu.marginalia.domainranking.data;
import org.jgrapht.Graph; import org.jgrapht.Graph;

View File

@@ -1,4 +1,4 @@
package nu.marginalia.ranking.domains.data; package nu.marginalia.domainranking.data;
import com.google.inject.Inject; import com.google.inject.Inject;
import com.zaxxer.hikari.HikariDataSource; import com.zaxxer.hikari.HikariDataSource;

View File

@@ -1,4 +1,4 @@
package nu.marginalia.ranking.domains.data; package nu.marginalia.domainranking.data;
import com.google.inject.Inject; import com.google.inject.Inject;
import com.zaxxer.hikari.HikariDataSource; import com.zaxxer.hikari.HikariDataSource;

View File

@@ -1,4 +1,4 @@
package nu.marginalia.ranking.domains.data; package nu.marginalia.domainranking.data;
import com.google.inject.Inject; import com.google.inject.Inject;
import com.zaxxer.hikari.HikariDataSource; import com.zaxxer.hikari.HikariDataSource;

View File

@@ -1,4 +1,4 @@
package nu.marginalia.ranking.domains.jgrapht; package nu.marginalia.domainranking.jgrapht;
/* /*
* (C) Copyright 2016-2023, by Dimitrios Michail and Contributors. * (C) Copyright 2016-2023, by Dimitrios Michail and Contributors.
@@ -21,8 +21,9 @@ package nu.marginalia.ranking.domains.jgrapht;
/* (modified by @vlofgren to add personalization) */ /* (modified by @vlofgren to add personalization) */
import org.jgrapht.*; import org.jgrapht.Graph;
import org.jgrapht.alg.interfaces.*; import org.jgrapht.Graphs;
import org.jgrapht.alg.interfaces.VertexScoringAlgorithm;
import java.util.*; import java.util.*;

View File

@@ -2,6 +2,7 @@ package nu.marginalia.index;
import com.google.inject.Inject; import com.google.inject.Inject;
import com.google.inject.Singleton; import com.google.inject.Singleton;
import io.grpc.Status;
import io.grpc.stub.StreamObserver; import io.grpc.stub.StreamObserver;
import io.prometheus.client.Counter; import io.prometheus.client.Counter;
import io.prometheus.client.Gauge; import io.prometheus.client.Gauge;
@@ -148,7 +149,7 @@ public class IndexGrpcService
} }
catch (Exception ex) { catch (Exception ex) {
logger.error("Error in handling request", ex); logger.error("Error in handling request", ex);
responseObserver.onError(ex); responseObserver.onError(Status.INTERNAL.withCause(ex).asRuntimeException());
} }
} }

View File

@@ -551,9 +551,18 @@ public class IndexResultScoreCalculator {
largeSiteFactor = 2; largeSiteFactor = 2;
} }
if (DocumentMetadata.hasFlags(featureFlags, HtmlFeature.TRACKING_ADTECH.getFeatureBit())) if (DocumentMetadata.hasFlags(featureFlags, HtmlFeature.ADVERTISEMENT.getFeatureBit()))
penalty += 7.5 * largeSiteFactor; penalty += 7.5 * largeSiteFactor;
if (DocumentMetadata.hasFlags(featureFlags, HtmlFeature.CONSENT.getFeatureBit()))
penalty += 2.5 * largeSiteFactor;
if (DocumentMetadata.hasFlags(featureFlags, HtmlFeature.POPOVER.getFeatureBit()))
penalty += 2.5 * largeSiteFactor;
if (DocumentMetadata.hasFlags(featureFlags, HtmlFeature.TRACKING_ADTECH.getFeatureBit()))
penalty += 5.0 * largeSiteFactor;
if (DocumentMetadata.hasFlags(featureFlags, HtmlFeature.AFFILIATE_LINK.getFeatureBit())) if (DocumentMetadata.hasFlags(featureFlags, HtmlFeature.AFFILIATE_LINK.getFeatureBit()))
penalty += 5.0 * largeSiteFactor; penalty += 5.0 * largeSiteFactor;
@@ -563,6 +572,9 @@ public class IndexResultScoreCalculator {
if (DocumentMetadata.hasFlags(featureFlags, HtmlFeature.TRACKING.getFeatureBit())) if (DocumentMetadata.hasFlags(featureFlags, HtmlFeature.TRACKING.getFeatureBit()))
penalty += 2.5 * largeSiteFactor; penalty += 2.5 * largeSiteFactor;
if (DocumentMetadata.hasFlags(featureFlags, HtmlFeature.SHORT_DOCUMENT.getFeatureBit()))
penalty += 2.5 * largeSiteFactor;
if (isForum || isWiki) { if (isForum || isWiki) {
penalty = Math.min(0, penalty - 2); penalty = Math.min(0, penalty - 2);
} }

View File

@@ -6,14 +6,14 @@ import gnu.trove.list.TIntList;
import it.unimi.dsi.fastutil.ints.IntOpenHashSet; import it.unimi.dsi.fastutil.ints.IntOpenHashSet;
import nu.marginalia.db.DomainRankingSetsService; import nu.marginalia.db.DomainRankingSetsService;
import nu.marginalia.db.DomainTypes; import nu.marginalia.db.DomainTypes;
import nu.marginalia.domainranking.PageRankDomainRanker;
import nu.marginalia.domainranking.accumulator.RankingResultHashMapAccumulator;
import nu.marginalia.domainranking.accumulator.RankingResultHashSetAccumulator;
import nu.marginalia.domainranking.data.GraphSource;
import nu.marginalia.domainranking.data.LinkGraphSource;
import nu.marginalia.domainranking.data.SimilarityGraphSource;
import nu.marginalia.index.IndexFactory; import nu.marginalia.index.IndexFactory;
import nu.marginalia.index.domainrankings.DomainRankings; import nu.marginalia.index.domainrankings.DomainRankings;
import nu.marginalia.ranking.domains.PageRankDomainRanker;
import nu.marginalia.ranking.domains.accumulator.RankingResultHashMapAccumulator;
import nu.marginalia.ranking.domains.accumulator.RankingResultHashSetAccumulator;
import nu.marginalia.ranking.domains.data.GraphSource;
import nu.marginalia.ranking.domains.data.LinkGraphSource;
import nu.marginalia.ranking.domains.data.SimilarityGraphSource;
import nu.marginalia.service.control.ServiceEventLog; import nu.marginalia.service.control.ServiceEventLog;
import nu.marginalia.service.module.ServiceConfiguration; import nu.marginalia.service.module.ServiceConfiguration;
import org.slf4j.Logger; import org.slf4j.Logger;

View File

@@ -1,6 +1,6 @@
package nu.marginalia.ranking.domains; package nu.marginalia.domainranking;
import nu.marginalia.ranking.domains.accumulator.RankingResultListAccumulator; import nu.marginalia.domainranking.accumulator.RankingResultListAccumulator;
import org.junit.jupiter.api.Disabled; import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test; import org.junit.jupiter.api.Test;

View File

@@ -1,12 +1,12 @@
package nu.marginalia.ranking.domains; package nu.marginalia.domainranking;
import com.zaxxer.hikari.HikariConfig; import com.zaxxer.hikari.HikariConfig;
import com.zaxxer.hikari.HikariDataSource; import com.zaxxer.hikari.HikariDataSource;
import nu.marginalia.api.linkgraph.AggregateLinkGraphClient; import nu.marginalia.api.linkgraph.AggregateLinkGraphClient;
import nu.marginalia.ranking.domains.data.InvertedLinkGraphSource; import nu.marginalia.domainranking.data.InvertedLinkGraphSource;
import nu.marginalia.ranking.domains.data.LinkGraphSource; import nu.marginalia.domainranking.data.LinkGraphSource;
import nu.marginalia.ranking.domains.data.SimilarityGraphSource; import nu.marginalia.domainranking.data.SimilarityGraphSource;
import nu.marginalia.test.TestMigrationLoader; import nu.marginalia.test.TestMigrationLoader;
import org.jgrapht.Graph; import org.jgrapht.Graph;
import org.jgrapht.graph.DefaultWeightedEdge; import org.jgrapht.graph.DefaultWeightedEdge;

View File

@@ -1,7 +1,7 @@
package nu.marginalia.ranking.domains; package nu.marginalia.domainranking;
import nu.marginalia.array.LongArrayFactory; import nu.marginalia.array.LongArrayFactory;
import nu.marginalia.ranking.domains.data.GraphSource; import nu.marginalia.domainranking.data.GraphSource;
import org.apache.commons.lang3.StringUtils; import org.apache.commons.lang3.StringUtils;
import org.jgrapht.Graph; import org.jgrapht.Graph;
import org.jgrapht.graph.DefaultDirectedGraph; import org.jgrapht.graph.DefaultDirectedGraph;

View File

@@ -1,7 +1,7 @@
package nu.marginalia.ranking.domains; package nu.marginalia.domainranking;
import nu.marginalia.array.LongArrayFactory; import nu.marginalia.array.LongArrayFactory;
import nu.marginalia.ranking.domains.data.GraphSource; import nu.marginalia.domainranking.data.GraphSource;
import org.apache.commons.lang3.StringUtils; import org.apache.commons.lang3.StringUtils;
import org.jgrapht.Graph; import org.jgrapht.Graph;
import org.jgrapht.graph.DefaultDirectedGraph; import org.jgrapht.graph.DefaultDirectedGraph;

View File

@@ -1,6 +1,6 @@
package nu.marginalia.ranking.domains; package nu.marginalia.domainranking;
import nu.marginalia.ranking.domains.data.GraphSource; import nu.marginalia.domainranking.data.GraphSource;
import org.apache.commons.lang3.StringUtils; import org.apache.commons.lang3.StringUtils;
import org.jgrapht.Graph; import org.jgrapht.Graph;
import org.jgrapht.graph.DefaultUndirectedWeightedGraph; import org.jgrapht.graph.DefaultUndirectedWeightedGraph;

View File

@@ -47,11 +47,14 @@ dependencies {
implementation project(':code:processes:converting-process:ft-anchor-keywords') implementation project(':code:processes:converting-process:ft-anchor-keywords')
implementation project(':code:processes:converting-process:ft-keyword-extraction') implementation project(':code:processes:converting-process:ft-keyword-extraction')
implementation project(':code:processes:converting-process:ft-dom-classifier')
implementation project(':code:processes:crawling-process:ft-crawl-blocklist') implementation project(':code:processes:crawling-process:ft-crawl-blocklist')
implementation project(':code:processes:crawling-process:ft-link-parser') implementation project(':code:processes:crawling-process:ft-link-parser')
implementation project(':code:processes:crawling-process:ft-content-type') implementation project(':code:processes:crawling-process:ft-content-type')
implementation project(':code:functions:live-capture:api')
testImplementation project(':code:libraries:term-frequency-dict') testImplementation project(':code:libraries:term-frequency-dict')
testImplementation project(':code:processes:crawling-process:model') testImplementation project(':code:processes:crawling-process:model')

View File

@@ -0,0 +1,40 @@
plugins {
id 'java'
id "de.undercouch.download" version "5.1.0"
id 'jvm-test-suite'
}
java {
toolchain {
languageVersion.set(JavaLanguageVersion.of(rootProject.ext.jvmVersion))
}
}
apply from: "$rootProject.projectDir/srcsets.gradle"
dependencies {
implementation project(':code:common:config')
implementation project(':code:common:service')
implementation project(':code:common:model')
implementation project(':code:common:db')
implementation project(':code:functions:live-capture:api')
implementation libs.bundles.slf4j
implementation libs.guava
implementation dependencies.create(libs.guice.get()) {
exclude group: 'com.google.guava'
}
implementation libs.trove
implementation libs.gson
implementation libs.bundles.protobuf
implementation libs.bundles.mariadb
implementation libs.duckdb
implementation libs.notnull
implementation libs.jsoup
testImplementation libs.bundles.slf4j.test
testImplementation libs.bundles.junit
testImplementation libs.mockito
}

View File

@@ -0,0 +1,99 @@
package nu.marginalia.ddtrackergradar;
import com.google.gson.Gson;
import nu.marginalia.WmsaHome;
import nu.marginalia.ddtrackergradar.model.DDGTDomain;
import nu.marginalia.model.gson.GsonFactory;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.*;
/** Holds tracker metadata from DuckDuckGo's Tracker Radar
* data itself CC-BY-NC-SA 4.0
* */
public class DDGTrackerData {
private final Map<String, DDGTDomain> topDomains = new HashMap<>();
private final Map<String, DDGTDomain> domains = new HashMap<>();
private final Gson gson = GsonFactory.get();
private static final Logger logger = LoggerFactory.getLogger(DDGTrackerData.class);
public DDGTrackerData() {
// Data is assumed to be in ${WMSA_HOME}/data/tracker-radar
// ... do a shallow clone of the repo
// https://github.com/duckduckgo/tracker-radar/
Path dataDir = WmsaHome.getDataPath().resolve("tracker-radar");
if (!Files.exists(dataDir)) {
logger.info("tracker-radar data absent from expected path {}, loading nothing", dataDir);
return;
}
try (var sources = Files.list(dataDir.resolve("domains"))) {
sources.filter(Files::isDirectory).forEach(this::loadDomainDir);
}
catch (IOException e) {
logger.error("Failed to read tracker radar data dir", e);
}
}
/** Tries to fetch available information about tracking coming from the specified domain
*/
public Optional<DDGTDomain> getDomainInfo(String domain) {
return Optional
.ofNullable(topDomains.get(domain))
.or(() -> Optional.ofNullable(domains.get(domain)));
}
/** public for testing */
public void loadDomainDir(Path dir) {
try (var dirContent = Files.list(dir)) {
dirContent
.filter(Files::isRegularFile)
.filter(path -> path.toString().endsWith(".json"))
.forEach(this::loadDomainModel);
}
catch (IOException e) {
logger.error("Error while loading DDGT tracker data", e);
}
}
void loadDomainModel(Path jsonFile) {
try {
var model = gson.fromJson(Files.readString(jsonFile), DDGTDomain.class);
if (model.domain() == null)
return;
if ((model.owner() == null || model.owner().isEmpty())
&& (model.categories() == null || model.categories().isEmpty()))
return;
topDomains.put(model.domain(), model);
domains.put(model.domain(), model);
if (model.subdomains() != null) {
for (String subdomain : model.subdomains()) {
domains.put(subdomain + "." + model.domain(), model);
}
}
}
catch (Exception e) {
logger.error("Error while loading DDGT tracker data", e);
}
}
// Export all classifications in the data set
public Set<String> getAllClassifications() {
Set<String> ret = new HashSet<>();
for (var domain: domains.values()) {
ret.addAll(domain.categories());
}
return ret;
}
}

View File

@@ -0,0 +1,12 @@
package nu.marginalia.ddtrackergradar.model;
import java.util.List;
public record DDGTDomain(
String domain,
DDGTOwner owner,
List<String> categories,
List<String> subdomains
)
{
}

View File

@@ -0,0 +1,10 @@
package nu.marginalia.ddtrackergradar.model;
public record DDGTOwner(String name, String displayName, String privacyPolicy, String url) {
public boolean isEmpty() {
return name == null
&& displayName == null
&& privacyPolicy == null
&& url == null;
}
}

View File

@@ -0,0 +1,25 @@
package nu.marginalia.domclassifier;
import nu.marginalia.model.crawl.HtmlFeature;
import javax.annotation.Nullable;
/**
* Feature classifications for the DOM sample
*/
public enum DomSampleClassification {
ADS(HtmlFeature.ADVERTISEMENT),
TRACKING(HtmlFeature.TRACKING_ADTECH),
CONSENT(HtmlFeature.CONSENT),
POPOVER(HtmlFeature.POPOVER),
THIRD_PARTY_REQUESTS(HtmlFeature.THIRD_PARTY_REQUESTS),
UNCLASSIFIED(HtmlFeature.MISSING_DOM_SAMPLE),
IGNORE(null);
@Nullable
public final HtmlFeature htmlFeature;
DomSampleClassification(@Nullable HtmlFeature feature) {
this.htmlFeature = feature;
}
}

View File

@@ -0,0 +1,174 @@
package nu.marginalia.domclassifier;
import com.google.inject.Inject;
import com.google.inject.Singleton;
import nu.marginalia.api.domsample.RpcDomainSample;
import nu.marginalia.model.EdgeDomain;
import nu.marginalia.model.EdgeUrl;
import org.jsoup.Jsoup;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import java.io.IOException;
import java.io.InputStream;
import java.net.URISyntaxException;
import java.util.*;
import java.util.function.Predicate;
import java.util.regex.Pattern;
@Singleton
public class DomSampleClassifier {
private static final Logger logger = LoggerFactory.getLogger(DomSampleClassifier.class);
private final List<Map.Entry<Predicate<String>, DomSampleClassification>> regexClassification = new ArrayList<>();
private final Map<String, DomSampleClassification> urlClassification = new HashMap<>();
private final Map<String, DomSampleClassification> topDomainClassification = new HashMap<>();
private final Map<String, DomSampleClassification> fullDomainClassification = new HashMap<>();
@Inject
public DomSampleClassifier() throws ParserConfigurationException, IOException, SAXException {
this(ClassLoader.getSystemResourceAsStream("request-classifier.xml"));
}
public DomSampleClassifier(InputStream specificationXmlData) throws ParserConfigurationException, IOException, SAXException {
Objects.requireNonNull(specificationXmlData, "specificationXmlData is null");
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(specificationXmlData);
NodeList classifierNodes = doc.getElementsByTagName("classifier");
for (int i = 0; i < classifierNodes.getLength(); i++) {
Element classifier = (Element) classifierNodes.item(i);
String target = classifier.getAttribute("target");
String rule = classifier.getAttribute("rule");
String content = classifier.getTextContent().trim();
// Convert rule to Classification enum
DomSampleClassification classification = DomSampleClassification.valueOf(rule.toUpperCase());
// Add to appropriate map based on target
switch (target) {
case "url":
urlClassification.put(content, classification);
break;
case "url-regex":
regexClassification.add(Map.entry(Pattern.compile(content).asPredicate(), classification));
break;
case "top":
topDomainClassification.put(content, classification);
break;
case "domain":
fullDomainClassification.put(content, classification);
break;
default:
throw new IllegalArgumentException("Unknown target type: " + target);
}
}
}
public Set<DomSampleClassification> classifySample(RpcDomainSample sample) {
Set<DomSampleClassification> classifications = new HashSet<>();
// Look at DOM
EdgeDomain sampleDomain = new EdgeDomain(sample.getDomainName());
try {
var parsedDoc = Jsoup.parse(sample.getHtmlSample());
var fixedElements = parsedDoc.select("*[data-position=fixed]");
if (sample.getAcceptedPopover()) {
classifications.add(DomSampleClassification.POPOVER);
}
else if (!fixedElements.isEmpty()) {
String fixedText = fixedElements.text().toLowerCase();
if (fixedText.contains("cookie") ||
fixedText.contains("subscribe") ||
fixedText.contains("consent") ||
fixedText.contains("newsletter") ||
fixedText.contains("gdpr"))
{
classifications.add(DomSampleClassification.POPOVER);
}
}
}
catch (Exception ex) {
logger.warn("Error when parsing DOM HTML sample");
}
// Classify outgoing requests
for (var req : sample.getOutgoingRequestsList()) {
EdgeUrl url;
try {
url = new EdgeUrl(req.getUrl());
}
catch (URISyntaxException ex) {
continue;
}
if (!url.domain.hasSameTopDomain(sampleDomain)) {
classifications.add(DomSampleClassification.THIRD_PARTY_REQUESTS);
}
var clazz = classifyRequest(url);
if (clazz != DomSampleClassification.IGNORE && clazz != DomSampleClassification.UNCLASSIFIED) {
classifications.add(clazz);
}
}
return classifications;
}
public DomSampleClassification classifyRequest(EdgeUrl edgeUrl) {
StringBuilder pathSb = new StringBuilder(edgeUrl.path);
if (edgeUrl.param != null) {
pathSb.append("?").append(edgeUrl.param);
}
String pathMatchString = pathSb.toString();
String urlDisplayString = edgeUrl.toDisplayString();
for (Map.Entry<Predicate<String>, DomSampleClassification> regexMatcher : regexClassification) {
var matcher = regexMatcher.getKey();
if (matcher.test(pathMatchString) || matcher.test(urlDisplayString)) {
var clazz = regexMatcher.getValue();
if (clazz != DomSampleClassification.IGNORE) {
return clazz;
}
}
}
DomSampleClassification clazz = urlClassification.get(edgeUrl.toDisplayString());
if (clazz != null && clazz != DomSampleClassification.IGNORE) {
return clazz;
}
clazz = fullDomainClassification.get(edgeUrl.domain.toString());
if (clazz != null && clazz != DomSampleClassification.IGNORE) {
return clazz;
}
clazz = topDomainClassification.get(edgeUrl.domain.topDomain);
if (clazz != null && clazz != DomSampleClassification.IGNORE) {
return clazz;
}
return DomSampleClassification.UNCLASSIFIED;
}
}

View File

@@ -0,0 +1,8 @@
Holds a classification model for rendered DOM data and exported network traffic generated by
[functions/live-capture](../../../functions/live-capture).
The model is primarily used in the [converting-process](../../converting-process) but also run in the search UI for inspection purposes.
The traffic classification model is found in [resources/request-classifier.xml](resources/request-classifier.xml).
The code evaluating the model is in [DomSampleClassifier.java](java/nu/marginalia/domclassifier/DomSampleClassifier.java).

View File

@@ -0,0 +1,111 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rules [
<!ELEMENT rules (classifier*)>
<!ELEMENT classifier (#PCDATA)>
<!ATTLIST classifier
target (url-regex|url|domain|top) #REQUIRED
rule (ads|tracking|consent|ignore) #REQUIRED>
]>
<!-- Contains rules for mapping outgoing requests during DOM Sampling to website classification -->
<rules>
<!-- Regex rules -->
<classifier target="url-regex" rule="tracking">/ads/ga-audiences</classifier>
<classifier target="url-regex" rule="tracking">/google_top_exp.js$</classifier>
<classifier target="url-regex" rule="tracking">/ccm/collect$</classifier>
<classifier target="url-regex" rule="tracking">^/[0-9]+\.js$</classifier>
<classifier target="url-regex" rule="tracking">^/[a-z0-9]\.gif$</classifier>
<classifier target="url-regex" rule="ads">/pagead/</classifier>
<classifier target="url-regex" rule="ads">/google-ads/</classifier>
<!-- URL classifications TRACKING -->
<classifier target="url" rule="tracking">https://googleads.g.doubleclick.net/pagead/id</classifier>
<classifier target="url" rule="tracking">https://securepubads.g.doubleclick.net/tag/js/gpt.js</classifier>
<classifier target="url" rule="tracking">https://pagead2.googlesyndication.com/ccm/collect</classifier>
<classifier target="url" rule="tracking">https://z-na.amazon-adsystem.com/widgets/onejs</classifier>
<!-- Full domain classifications ADS -->
<classifier target="domain" rule="ads">securepubads.g.doubleclick.net</classifier>
<classifier target="domain" rule="ads">googleads.g.doubleclick.net</classifier>
<!-- Full domain classifications TRACKING -->
<classifier target="domain" rule="tracking">stats.g.doubleclick.net</classifier>
<classifier target="domain" rule="tracking">insight.adsrvr.org</classifier>
<classifier target="domain" rule="tracking">pixel.wp.com</classifier>
<classifier target="domain" rule="tracking">connect.facebook.net</classifier>
<classifier target="domain" rule="tracking">stats.wp.com</classifier>
<classifier target="domain" rule="tracking">track.hubspot.com</classifier>
<classifier target="domain" rule="tracking">analytics.tiktok.com</classifier>
<classifier target="domain" rule="tracking">analytics-ipv6.tiktokw.us</classifier>
<classifier target="domain" rule="tracking">tr6.snapchat.com</classifier>
<classifier target="domain" rule="tracking">tr.snapchat.com</classifier>
<classifier target="domain" rule="tracking">geo-location.prebid.cloud</classifier>
<classifier target="domain" rule="tracking">px.ads.linkedin.com</classifier>
<classifier target="domain" rule="tracking">region1.analytics.google.com</classifier>
<classifier target="domain" rule="tracking">api.hubapi.com</classifier>
<classifier target="domain" rule="tracking">bat.bing.com</classifier>
<classifier target="domain" rule="tracking">bat.bing.net</classifier>
<classifier target="domain" rule="tracking">c.bing.com</classifier>
<classifier target="domain" rule="tracking">c.bing.net</classifier>
<classifier target="domain" rule="tracking">analytics.twitter.com</classifier>
<classifier target="domain" rule="tracking">play.google.com</classifier>
<classifier target="domain" rule="tracking">www.youtube.com</classifier>
<!-- Full domain classifications CONSENT -->
<classifier target="domain" rule="consent">cdnconsents.websitepolicies.com</classifier>
<!-- Top-level domain classifications - ADS -->
<classifier target="top" rule="ads">googlesyndication.com</classifier>
<classifier target="top" rule="ads">amazon-adsystem.com</classifier>
<classifier target="top" rule="ads">smartadserver.com</classifier>
<classifier target="top" rule="ads">googleadservices.com</classifier>
<classifier target="top" rule="ads">prebid.cloud</classifier>
<classifier target="top" rule="ads">pubmine.com</classifier>
<classifier target="top" rule="ads">adtrafficquality.google</classifier>
<classifier target="top" rule="ads">syndicatedsearch.goog</classifier>
<classifier target="top" rule="ads">adsrvr.org</classifier>
<classifier target="top" rule="ads">adnxs.net</classifier>
<classifier target="top" rule="ads">aditude.io</classifier>
<classifier target="top" rule="ads">buysellads.net</classifier>
<!-- Top-level domain classifications - TRACKING -->
<classifier target="top" rule="tracking">plausible.io</classifier>
<classifier target="top" rule="tracking">amplitude.com</classifier>
<classifier target="top" rule="tracking">hsadspixel.net</classifier>
<classifier target="top" rule="tracking">demdex.net</classifier>
<classifier target="top" rule="tracking">omtrdc.net</classifier>
<classifier target="top" rule="tracking">ggpht.com</classifier>
<classifier target="top" rule="tracking">doubleclick.net</classifier>
<classifier target="top" rule="tracking">google.com</classifier>
<classifier target="top" rule="tracking">google.se</classifier>
<classifier target="top" rule="tracking">google-analytics.com</classifier>
<classifier target="top" rule="tracking">googletagmanager.com</classifier>
<classifier target="top" rule="tracking">cloudflareinsights.com</classifier>
<classifier target="top" rule="tracking">branch.io</classifier>
<classifier target="top" rule="tracking">clarity.ms</classifier>
<classifier target="top" rule="tracking">hotjar.com</classifier>
<classifier target="top" rule="tracking">hotjar.io</classifier>
<classifier target="top" rule="tracking">nr-data.net</classifier>
<classifier target="top" rule="tracking">newrelic.com</classifier>
<classifier target="top" rule="tracking">siteimproveanalytics.com</classifier>
<classifier target="top" rule="tracking">siteimproveanalytics.io</classifier>
<classifier target="top" rule="tracking">hs-analytics.net</classifier>
<classifier target="top" rule="tracking">sentry.io</classifier>
<classifier target="top" rule="tracking">hs-scripts.com</classifier>
<classifier target="top" rule="tracking">addtoany.com</classifier>
<classifier target="top" rule="tracking">facebook.com</classifier>
<classifier target="top" rule="tracking">scorecardresearch.com</classifier>
<!-- Top-level domain classifications - CONSENT -->
<classifier target="top" rule="consent">trustarc.com</classifier>
<classifier target="top" rule="consent">truste.com</classifier>
<classifier target="top" rule="consent">onetrust.com</classifier>
<classifier target="top" rule="consent">cookielaw.org</classifier>
<classifier target="top" rule="consent">hs-banner.com</classifier>
<classifier target="top" rule="consent">fundingchoicesmessages.google.com</classifier>
</rules>

View File

@@ -0,0 +1,15 @@
package nu.marginalia.ddtrackergradar;
import org.junit.jupiter.api.Test;
import java.nio.file.Path;
class DDGTrackerDataTest {
@Test
public void testLoad() {
DDGTrackerData data = new DDGTrackerData();
data.loadDomainDir(Path.of("/home/vlofgren/Work/tracker-radar/domains/US/"));
data.getDomainInfo("hotjar.com").ifPresent(System.out::println);
data.getAllClassifications().forEach(System.out::println);
}
}

View File

@@ -113,6 +113,13 @@ public class DocumentKeywordsBuilder {
newWords.forEach(word -> wordToMeta.putIfAbsent(word, meta)); newWords.forEach(word -> wordToMeta.putIfAbsent(word, meta));
} }
public void addSyntheticTerm(String newWord) {
byte meta = WordFlags.Synthetic.asBit();
wordToMeta.putIfAbsent(newWord, meta);
}
public List<String> getWordsWithAnyFlag(long flags) { public List<String> getWordsWithAnyFlag(long flags) {
List<String> ret = new ArrayList<>(); List<String> ret = new ArrayList<>();

View File

@@ -23,6 +23,7 @@ import nu.marginalia.process.control.ProcessHeartbeatImpl;
import nu.marginalia.process.log.WorkLog; import nu.marginalia.process.log.WorkLog;
import nu.marginalia.process.log.WorkLogEntry; import nu.marginalia.process.log.WorkLogEntry;
import nu.marginalia.service.module.DatabaseModule; import nu.marginalia.service.module.DatabaseModule;
import nu.marginalia.service.module.ServiceDiscoveryModule;
import nu.marginalia.storage.FileStorageService; import nu.marginalia.storage.FileStorageService;
import nu.marginalia.util.SimpleBlockingThreadPool; import nu.marginalia.util.SimpleBlockingThreadPool;
import nu.marginalia.worklog.BatchingWorkLog; import nu.marginalia.worklog.BatchingWorkLog;
@@ -59,6 +60,7 @@ public class ConverterMain extends ProcessMainClass {
Injector injector = Guice.createInjector( Injector injector = Guice.createInjector(
new ConverterModule(), new ConverterModule(),
new ProcessConfigurationModule("converter"), new ProcessConfigurationModule("converter"),
new ServiceDiscoveryModule(),
new DatabaseModule(false) new DatabaseModule(false)
); );

View File

@@ -5,10 +5,12 @@ import nu.marginalia.atags.AnchorTextKeywords;
import nu.marginalia.atags.model.DomainLinks; import nu.marginalia.atags.model.DomainLinks;
import nu.marginalia.converting.model.DisqualifiedException; import nu.marginalia.converting.model.DisqualifiedException;
import nu.marginalia.converting.model.ProcessedDocument; import nu.marginalia.converting.model.ProcessedDocument;
import nu.marginalia.converting.processor.classifier.AcceptableAds;
import nu.marginalia.converting.processor.plugin.AbstractDocumentProcessorPlugin; import nu.marginalia.converting.processor.plugin.AbstractDocumentProcessorPlugin;
import nu.marginalia.converting.processor.plugin.HtmlDocumentProcessorPlugin; import nu.marginalia.converting.processor.plugin.HtmlDocumentProcessorPlugin;
import nu.marginalia.converting.processor.plugin.PdfDocumentProcessorPlugin; import nu.marginalia.converting.processor.plugin.PdfDocumentProcessorPlugin;
import nu.marginalia.converting.processor.plugin.PlainTextDocumentProcessorPlugin; import nu.marginalia.converting.processor.plugin.PlainTextDocumentProcessorPlugin;
import nu.marginalia.domclassifier.DomSampleClassification;
import nu.marginalia.keyword.LinkTexts; import nu.marginalia.keyword.LinkTexts;
import nu.marginalia.model.EdgeDomain; import nu.marginalia.model.EdgeDomain;
import nu.marginalia.model.EdgeUrl; import nu.marginalia.model.EdgeUrl;
@@ -22,7 +24,6 @@ import org.slf4j.LoggerFactory;
import org.slf4j.Marker; import org.slf4j.Marker;
import org.slf4j.MarkerFactory; import org.slf4j.MarkerFactory;
import java.io.IOException;
import java.net.URISyntaxException; import java.net.URISyntaxException;
import java.util.ArrayList; import java.util.ArrayList;
import java.util.List; import java.util.List;
@@ -60,6 +61,7 @@ public class DocumentProcessor {
public ProcessedDocument process(CrawledDocument crawledDocument, public ProcessedDocument process(CrawledDocument crawledDocument,
EdgeDomain domain, EdgeDomain domain,
DomainLinks externalDomainLinks, DomainLinks externalDomainLinks,
Set<DomSampleClassification> domSampleClassifications,
DocumentDecorator documentDecorator) { DocumentDecorator documentDecorator) {
ProcessedDocument ret = new ProcessedDocument(); ProcessedDocument ret = new ProcessedDocument();
@@ -79,7 +81,38 @@ public class DocumentProcessor {
default -> DocumentClass.EXTERNALLY_LINKED_MULTI; default -> DocumentClass.EXTERNALLY_LINKED_MULTI;
}; };
processDocument(crawledDocument, documentClass, documentDecorator, externalDomainLinks, ret); var crawlerStatus = CrawlerDocumentStatus.valueOf(crawledDocument.crawlerStatus);
if (crawlerStatus != CrawlerDocumentStatus.OK)
throw new DisqualifiedException(crawlerStatus);
if (AcceptableAds.hasAcceptableAdsHeader(crawledDocument))
throw new DisqualifiedException(DisqualifiedException.DisqualificationReason.ACCEPTABLE_ADS);
if (!isAcceptedContentType(crawledDocument))
throw new DisqualifiedException(DisqualifiedException.DisqualificationReason.CONTENT_TYPE);
ret.state = crawlerStatusToUrlState(crawledDocument.crawlerStatus, crawledDocument.httpStatus);
LinkTexts linkTexts = anchorTextKeywords.getAnchorTextKeywords(externalDomainLinks, ret.url);
var detailsWithWords =
findPlugin(crawledDocument)
.createDetails(crawledDocument, linkTexts, domSampleClassifications, documentClass);
ret.details = detailsWithWords.details();
ret.words = detailsWithWords.words();
if (ret.url.path.equals("/")) {
ret.words.addMeta("special:root", WordFlags.Synthetic.asBit());
}
documentDecorator.apply(ret);
if (Boolean.TRUE.equals(crawledDocument.hasCookies)
&& ret.details != null
&& ret.details.features != null)
{
ret.details.features.add(HtmlFeature.COOKIES);
}
} }
catch (DisqualifiedException ex) { catch (DisqualifiedException ex) {
ret.state = UrlIndexingState.DISQUALIFIED; ret.state = UrlIndexingState.DISQUALIFIED;
@@ -89,60 +122,12 @@ public class DocumentProcessor {
catch (Exception ex) { catch (Exception ex) {
ret.state = UrlIndexingState.DISQUALIFIED; ret.state = UrlIndexingState.DISQUALIFIED;
ret.stateReason = DisqualifiedException.DisqualificationReason.PROCESSING_EXCEPTION.toString(); ret.stateReason = DisqualifiedException.DisqualificationReason.PROCESSING_EXCEPTION.toString();
logger.info(converterAuditMarker, "Failed to convert {}: {}", crawledDocument.url, ex.getClass().getSimpleName()); logger.warn(converterAuditMarker, "Failed to convert {}: {}", crawledDocument.url, ex.getClass().getSimpleName());
logger.warn(converterAuditMarker, "Failed to convert " + crawledDocument.url, ex);
} }
return ret; return ret;
} }
private void processDocument(CrawledDocument crawledDocument,
DocumentClass documentClass,
DocumentDecorator documentDecorator,
DomainLinks externalDomainLinks,
ProcessedDocument ret) throws URISyntaxException, IOException, DisqualifiedException
{
var crawlerStatus = CrawlerDocumentStatus.valueOf(crawledDocument.crawlerStatus);
if (crawlerStatus != CrawlerDocumentStatus.OK) {
throw new DisqualifiedException(crawlerStatus);
}
if (AcceptableAds.hasAcceptableAdsHeader(crawledDocument)) {
throw new DisqualifiedException(DisqualifiedException.DisqualificationReason.ACCEPTABLE_ADS);
}
if (!isAcceptedContentType(crawledDocument)) {
throw new DisqualifiedException(DisqualifiedException.DisqualificationReason.CONTENT_TYPE);
}
ret.state = crawlerStatusToUrlState(crawledDocument.crawlerStatus, crawledDocument.httpStatus);
AbstractDocumentProcessorPlugin plugin = findPlugin(crawledDocument);
EdgeUrl url = new EdgeUrl(crawledDocument.url);
LinkTexts linkTexts = anchorTextKeywords.getAnchorTextKeywords(externalDomainLinks, url);
AbstractDocumentProcessorPlugin.DetailsWithWords detailsWithWords = plugin.createDetails(crawledDocument, linkTexts, documentClass);
ret.details = detailsWithWords.details();
ret.words = detailsWithWords.words();
if (url.path.equals("/")) {
ret.words.addMeta("special:root", WordFlags.Synthetic.asBit());
}
documentDecorator.apply(ret);
if (Boolean.TRUE.equals(crawledDocument.hasCookies)
&& ret.details != null
&& ret.details.features != null)
{
ret.details.features.add(HtmlFeature.COOKIES);
}
}
private AbstractDocumentProcessorPlugin findPlugin(CrawledDocument crawledDocument) throws DisqualifiedException { private AbstractDocumentProcessorPlugin findPlugin(CrawledDocument crawledDocument) throws DisqualifiedException {
for (var plugin : processorPlugins) { for (var plugin : processorPlugins) {
if (plugin.isApplicable(crawledDocument)) if (plugin.isApplicable(crawledDocument))

View File

@@ -1,6 +1,7 @@
package nu.marginalia.converting.processor; package nu.marginalia.converting.processor;
import com.google.inject.Inject; import com.google.inject.Inject;
import nu.marginalia.api.domsample.DomSampleClient;
import nu.marginalia.atags.model.DomainLinks; import nu.marginalia.atags.model.DomainLinks;
import nu.marginalia.atags.source.AnchorTagsSource; import nu.marginalia.atags.source.AnchorTagsSource;
import nu.marginalia.atags.source.AnchorTagsSourceFactory; import nu.marginalia.atags.source.AnchorTagsSourceFactory;
@@ -12,11 +13,14 @@ import nu.marginalia.converting.processor.logic.links.TopKeywords;
import nu.marginalia.converting.sideload.SideloadSource; import nu.marginalia.converting.sideload.SideloadSource;
import nu.marginalia.converting.writer.ConverterBatchWritableIf; import nu.marginalia.converting.writer.ConverterBatchWritableIf;
import nu.marginalia.converting.writer.ConverterBatchWriter; import nu.marginalia.converting.writer.ConverterBatchWriter;
import nu.marginalia.domclassifier.DomSampleClassification;
import nu.marginalia.domclassifier.DomSampleClassifier;
import nu.marginalia.geoip.GeoIpDictionary; import nu.marginalia.geoip.GeoIpDictionary;
import nu.marginalia.geoip.sources.AsnTable; import nu.marginalia.geoip.sources.AsnTable;
import nu.marginalia.io.SerializableCrawlDataStream; import nu.marginalia.io.SerializableCrawlDataStream;
import nu.marginalia.model.EdgeDomain; import nu.marginalia.model.EdgeDomain;
import nu.marginalia.model.crawl.DomainIndexingState; import nu.marginalia.model.crawl.DomainIndexingState;
import nu.marginalia.model.crawl.UrlIndexingState;
import nu.marginalia.model.crawldata.CrawledDocument; import nu.marginalia.model.crawldata.CrawledDocument;
import nu.marginalia.model.crawldata.CrawledDomain; import nu.marginalia.model.crawldata.CrawledDomain;
import nu.marginalia.model.crawldata.CrawlerDomainStatus; import nu.marginalia.model.crawldata.CrawlerDomainStatus;
@@ -28,7 +32,11 @@ import org.slf4j.LoggerFactory;
import java.io.IOException; import java.io.IOException;
import java.sql.SQLException; import java.sql.SQLException;
import java.time.Duration;
import java.util.*; import java.util.*;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.regex.Pattern; import java.util.regex.Pattern;
public class DomainProcessor { public class DomainProcessor {
@@ -36,21 +44,29 @@ public class DomainProcessor {
private final SiteWords siteWords; private final SiteWords siteWords;
private final AnchorTagsSource anchorTagsSource; private final AnchorTagsSource anchorTagsSource;
private final GeoIpDictionary geoIpDictionary; private final GeoIpDictionary geoIpDictionary;
private final DomSampleClient domSampleClient;
private final DomSampleClassifier domSampleClassifier;
private final ExecutorService domSampleExecutor = Executors.newCachedThreadPool();
private final Logger logger = LoggerFactory.getLogger(getClass()); private final Logger logger = LoggerFactory.getLogger(getClass());
private final boolean hasDomSamples;
@Inject @Inject
public DomainProcessor(DocumentProcessor documentProcessor, public DomainProcessor(DocumentProcessor documentProcessor,
SiteWords siteWords, SiteWords siteWords,
AnchorTagsSourceFactory anchorTagsSourceFactory, AnchorTagsSourceFactory anchorTagsSourceFactory,
GeoIpDictionary geoIpDictionary) throws SQLException DomSampleClient domSampleClient,
{ GeoIpDictionary geoIpDictionary,
DomSampleClassifier domSampleClassifier) throws SQLException, InterruptedException {
this.documentProcessor = documentProcessor; this.documentProcessor = documentProcessor;
this.siteWords = siteWords; this.siteWords = siteWords;
this.anchorTagsSource = anchorTagsSourceFactory.create(); this.anchorTagsSource = anchorTagsSourceFactory.create();
this.geoIpDictionary = geoIpDictionary; this.geoIpDictionary = geoIpDictionary;
this.domSampleClient = domSampleClient;
this.domSampleClassifier = domSampleClassifier;
geoIpDictionary.waitReady(); geoIpDictionary.waitReady();
hasDomSamples = !Boolean.getBoolean("converter.ignoreDomSampleData") && domSampleClient.waitReady(Duration.ofSeconds(15));
} }
public SimpleProcessing simpleProcessing(SerializableCrawlDataStream dataStream, int sizeHint, Collection<String> extraKeywords) { public SimpleProcessing simpleProcessing(SerializableCrawlDataStream dataStream, int sizeHint, Collection<String> extraKeywords) {
@@ -73,6 +89,21 @@ public class DomainProcessor {
} }
} }
/** Fetch and process the DOM sample and extract classifications */
private Set<DomSampleClassification> getDomainClassifications(String domainName) throws ExecutionException, InterruptedException {
if (!hasDomSamples) {
return EnumSet.of(DomSampleClassification.UNCLASSIFIED);
}
return domSampleClient
.getSampleAsync(domainName, domSampleExecutor)
.thenApply(domSampleClassifier::classifySample)
.handle((a,b) ->
Objects.requireNonNullElseGet(a,
() -> EnumSet.of(DomSampleClassification.UNCLASSIFIED)))
.get();
}
@Nullable @Nullable
public ProcessedDomain fullProcessing(SerializableCrawlDataStream dataStream) { public ProcessedDomain fullProcessing(SerializableCrawlDataStream dataStream) {
try { try {
@@ -80,7 +111,6 @@ public class DomainProcessor {
return null; return null;
} }
List<ProcessedDocument> docs = new ArrayList<>();
Set<String> processedUrls = new HashSet<>(); Set<String> processedUrls = new HashSet<>();
if (!(dataStream.next() instanceof CrawledDomain crawledDomain)) { if (!(dataStream.next() instanceof CrawledDomain crawledDomain)) {
@@ -94,10 +124,12 @@ public class DomainProcessor {
ProcessedDomain ret = new ProcessedDomain(); ProcessedDomain ret = new ProcessedDomain();
processDomain(crawledDomain, ret, documentDecorator); processDomain(crawledDomain, ret, documentDecorator);
ret.documents = docs; ret.documents = new ArrayList<>();
// Process Documents // Process Documents
Set<DomSampleClassification> classifications = getDomainClassifications(crawledDomain.getDomain());
try (var deduplicator = new LshDocumentDeduplicator()) { try (var deduplicator = new LshDocumentDeduplicator()) {
while (dataStream.hasNext()) { while (dataStream.hasNext()) {
if (!(dataStream.next() instanceof CrawledDocument doc)) if (!(dataStream.next() instanceof CrawledDocument doc))
@@ -110,9 +142,23 @@ public class DomainProcessor {
continue; continue;
try { try {
var processedDoc = documentProcessor.process(doc, ret.domain, externalDomainLinks, documentDecorator); var processedDoc = documentProcessor.process(doc, ret.domain, externalDomainLinks, classifications, documentDecorator);
deduplicator.markIfDuplicate(processedDoc);
docs.add(processedDoc); if (deduplicator.isDocumentDuplicate(processedDoc)) {
processedDoc.state = UrlIndexingState.DISQUALIFIED;
processedDoc.stateReason = "Duplicate";
}
if (processedDoc.isOk() && processedDoc.words != null && processedDoc.details != null) {
classifications.forEach(classification -> {
if (classification.htmlFeature == null) return;
processedDoc.words.addSyntheticTerm(classification.htmlFeature.getKeyword());
processedDoc.details.features.add(classification.htmlFeature);
});
}
ret.documents.add(processedDoc);
} catch (Exception ex) { } catch (Exception ex) {
logger.warn("Failed to process " + doc.url, ex); logger.warn("Failed to process " + doc.url, ex);
} }
@@ -142,15 +188,17 @@ public class DomainProcessor {
private final DomainLinks externalDomainLinks; private final DomainLinks externalDomainLinks;
private final LshDocumentDeduplicator deduplicator = new LshDocumentDeduplicator(); private final LshDocumentDeduplicator deduplicator = new LshDocumentDeduplicator();
Set<DomSampleClassification> classifications;
private static final ProcessingIterator.Factory iteratorFactory = ProcessingIterator.factory(8, private static final ProcessingIterator.Factory iteratorFactory = ProcessingIterator.factory(8,
Integer.getInteger("java.util.concurrent.ForkJoinPool.common.parallelism", Runtime.getRuntime().availableProcessors()) Integer.getInteger("java.util.concurrent.ForkJoinPool.common.parallelism", Runtime.getRuntime().availableProcessors())
); );
SimpleProcessing(SerializableCrawlDataStream dataStream, int sizeHint) throws IOException { SimpleProcessing(SerializableCrawlDataStream dataStream, int sizeHint) throws Exception {
this(dataStream, sizeHint, List.of()); this(dataStream, sizeHint, List.of());
} }
SimpleProcessing(SerializableCrawlDataStream dataStream, int sizeHint, Collection<String> extraKeywords) throws IOException { SimpleProcessing(SerializableCrawlDataStream dataStream, int sizeHint, Collection<String> extraKeywords) throws Exception {
this.dataStream = dataStream; this.dataStream = dataStream;
if (!dataStream.hasNext() || !(dataStream.next() instanceof CrawledDomain crawledDomain)) if (!dataStream.hasNext() || !(dataStream.next() instanceof CrawledDomain crawledDomain))
@@ -166,6 +214,8 @@ public class DomainProcessor {
processDomain(crawledDomain, domain, documentDecorator); processDomain(crawledDomain, domain, documentDecorator);
classifications = getDomainClassifications(crawledDomain.getDomain());
externalDomainLinks = anchorTagsSource.getAnchorTags(domain.domain); externalDomainLinks = anchorTagsSource.getAnchorTags(domain.domain);
} }
@@ -187,16 +237,31 @@ public class DomainProcessor {
taskConsumer.accept(() -> { taskConsumer.accept(() -> {
var processedDoc = documentProcessor.process(doc, domain.domain, externalDomainLinks, documentDecorator); var processedDoc = documentProcessor.process(doc, domain.domain, externalDomainLinks, classifications, documentDecorator);
synchronized (deduplicator) { synchronized (deduplicator) {
deduplicator.markIfDuplicate(processedDoc); if (deduplicator.isDocumentDuplicate(processedDoc)) {
processedDoc.state = UrlIndexingState.DISQUALIFIED;
processedDoc.stateReason = "Duplicate";
}
} }
if (processedDoc.isProcessedFully()) { if (processedDoc.isProcessedFully()) {
// This is a bit sketchy, but we need to set the size and topology to something // This is a bit sketchy, but we need to set the size and topology to something
processedDoc.details.metadata = processedDoc.details.metadata.withSizeAndTopology( processedDoc.details.metadata = processedDoc.details.metadata.withSizeAndTopology(
10_000, externalDomainLinks.countForUrl(processedDoc.url)); 10_000, externalDomainLinks.countForUrl(processedDoc.url));
// Apply classifications
try {
classifications.forEach(classification -> {
if (classification.htmlFeature == null) return;
processedDoc.words.addSyntheticTerm(classification.htmlFeature.getKeyword());
processedDoc.details.features.add(classification.htmlFeature);
});
}
catch (Exception ex) {
}
} }
return processedDoc; return processedDoc;

View File

@@ -1,4 +1,4 @@
package nu.marginalia.converting.processor; package nu.marginalia.converting.processor.classifier;
import nu.marginalia.model.crawldata.CrawledDocument; import nu.marginalia.model.crawldata.CrawledDocument;
import org.jsoup.nodes.Document; import org.jsoup.nodes.Document;

View File

@@ -2,6 +2,7 @@ package nu.marginalia.converting.processor.logic;
import crawlercommons.utils.Strings; import crawlercommons.utils.Strings;
import nu.marginalia.converting.model.DisqualifiedException; import nu.marginalia.converting.model.DisqualifiedException;
import nu.marginalia.domclassifier.DomSampleClassification;
import nu.marginalia.model.DocumentFormat; import nu.marginalia.model.DocumentFormat;
import nu.marginalia.model.crawl.HtmlFeature; import nu.marginalia.model.crawl.HtmlFeature;
import nu.marginalia.model.crawldata.CrawledDocument; import nu.marginalia.model.crawldata.CrawledDocument;
@@ -14,6 +15,8 @@ import org.jsoup.select.NodeVisitor;
import java.util.List; import java.util.List;
import java.util.Set; import java.util.Set;
import static nu.marginalia.domclassifier.DomSampleClassification.*;
public class DocumentValuator { public class DocumentValuator {
public double getQuality(CrawledDocument crawledDocument, public double getQuality(CrawledDocument crawledDocument,
@@ -126,6 +129,25 @@ public class DocumentValuator {
return quality + adjustment; return quality + adjustment;
} }
public double getQuality(Set<DomSampleClassification> classifications) {
double quality = 0;
if (classifications.contains(ADS)) {
quality -= 6;
}
if (classifications.contains(TRACKING)) {
quality -= 4;
}
if (classifications.contains(CONSENT)) {
quality -= 4;
}
else if (classifications.contains(POPOVER)) {
quality -= 4;
}
return quality;
}
public static class ScriptVisitor implements NodeVisitor { public static class ScriptVisitor implements NodeVisitor {
boolean hasBadScript = false; boolean hasBadScript = false;
int scriptLength = 0; int scriptLength = 0;

View File

@@ -3,7 +3,6 @@ package nu.marginalia.converting.processor.logic;
import com.google.inject.Inject; import com.google.inject.Inject;
import com.google.inject.Singleton; import com.google.inject.Singleton;
import nu.marginalia.converting.model.DocumentHeaders; import nu.marginalia.converting.model.DocumentHeaders;
import nu.marginalia.converting.processor.classifier.adblock.AdblockSimulator;
import nu.marginalia.converting.processor.classifier.adblock.GoogleAnwersSpamDetector; import nu.marginalia.converting.processor.classifier.adblock.GoogleAnwersSpamDetector;
import nu.marginalia.converting.processor.classifier.topic.RecipeDetector; import nu.marginalia.converting.processor.classifier.topic.RecipeDetector;
import nu.marginalia.converting.processor.classifier.topic.TextileCraftDetector; import nu.marginalia.converting.processor.classifier.topic.TextileCraftDetector;
@@ -65,20 +64,17 @@ public class FeatureExtractor {
"counter.yadro.ru" "counter.yadro.ru"
); );
private final AdblockSimulator adblockSimulator;
private final RecipeDetector recipeDetector; private final RecipeDetector recipeDetector;
private final TextileCraftDetector textileCraftDetector; private final TextileCraftDetector textileCraftDetector;
private final WoodworkingDetector woodworkingDetector; private final WoodworkingDetector woodworkingDetector;
private final GoogleAnwersSpamDetector googleAnwersSpamDetector; private final GoogleAnwersSpamDetector googleAnwersSpamDetector;
@Inject @Inject
public FeatureExtractor(AdblockSimulator adblockSimulator, public FeatureExtractor(RecipeDetector recipeDetector,
RecipeDetector recipeDetector,
TextileCraftDetector textileCraftDetector, TextileCraftDetector textileCraftDetector,
WoodworkingDetector woodworkingDetector, WoodworkingDetector woodworkingDetector,
GoogleAnwersSpamDetector googleAnwersSpamDetector) GoogleAnwersSpamDetector googleAnwersSpamDetector)
{ {
this.adblockSimulator = adblockSimulator;
this.recipeDetector = recipeDetector; this.recipeDetector = recipeDetector;
this.textileCraftDetector = textileCraftDetector; this.textileCraftDetector = textileCraftDetector;
this.woodworkingDetector = woodworkingDetector; this.woodworkingDetector = woodworkingDetector;
@@ -218,13 +214,6 @@ public class FeatureExtractor {
} }
} }
if (features.contains(HtmlFeature.JS)
// remove while disabled to get rid of expensive clone() call:
// adblockSimulator.hasAds(doc.clone())
) {
features.add(HtmlFeature.ADVERTISEMENT);
}
if (!doc.getElementsByTag("object").isEmpty() if (!doc.getElementsByTag("object").isEmpty()
|| !doc.getElementsByTag("audio").isEmpty() || !doc.getElementsByTag("audio").isEmpty()
|| !doc.getElementsByTag("video").isEmpty()) { || !doc.getElementsByTag("video").isEmpty()) {

View File

@@ -1,7 +1,6 @@
package nu.marginalia.converting.processor.logic; package nu.marginalia.converting.processor.logic;
import gnu.trove.list.array.TLongArrayList; import gnu.trove.list.array.TLongArrayList;
import nu.marginalia.model.crawl.UrlIndexingState;
import nu.marginalia.converting.model.ProcessedDocument; import nu.marginalia.converting.model.ProcessedDocument;
import nu.marginalia.lsh.EasyLSH; import nu.marginalia.lsh.EasyLSH;
@@ -14,26 +13,25 @@ public class LshDocumentDeduplicator implements AutoCloseable {
private final TLongArrayList hashCodes = new TLongArrayList(1000); private final TLongArrayList hashCodes = new TLongArrayList(1000);
private static final int DISTANCE_THRESHOLD = 2; private static final int DISTANCE_THRESHOLD = 2;
public void markIfDuplicate(ProcessedDocument document) { public boolean isDocumentDuplicate(ProcessedDocument document) {
if (!document.isProcessedFully()) { if (!document.isOk()) return false;
return; if (document.words == null) return false;
} if (document.details == null) return false;
if (document.words.size() < 100) { if (document.words.size() < 100) {
return; return false;
} }
long hashCode = document.details.hashCode; long hashCode = document.details.hashCode;
for (int i = 0; i < hashCodes.size(); i++) { for (int i = 0; i < hashCodes.size(); i++) {
if (EasyLSH.hammingDistance(hashCode, hashCodes.get(i)) < DISTANCE_THRESHOLD) { if (EasyLSH.hammingDistance(hashCode, hashCodes.get(i)) < DISTANCE_THRESHOLD) {
document.state = UrlIndexingState.DISQUALIFIED; return true;
document.stateReason = "Duplicate";
return;
} }
} }
hashCodes.add(hashCode); hashCodes.add(hashCode);
return false;
} }
@Override @Override

View File

@@ -3,6 +3,7 @@ package nu.marginalia.converting.processor.plugin;
import nu.marginalia.converting.model.DisqualifiedException; import nu.marginalia.converting.model.DisqualifiedException;
import nu.marginalia.converting.model.ProcessedDocumentDetails; import nu.marginalia.converting.model.ProcessedDocumentDetails;
import nu.marginalia.converting.processor.DocumentClass; import nu.marginalia.converting.processor.DocumentClass;
import nu.marginalia.domclassifier.DomSampleClassification;
import nu.marginalia.keyword.LinkTexts; import nu.marginalia.keyword.LinkTexts;
import nu.marginalia.keyword.model.DocumentKeywordsBuilder; import nu.marginalia.keyword.model.DocumentKeywordsBuilder;
import nu.marginalia.language.filter.LanguageFilter; import nu.marginalia.language.filter.LanguageFilter;
@@ -26,7 +27,7 @@ public abstract class AbstractDocumentProcessorPlugin {
this.languageFilter = languageFilter; this.languageFilter = languageFilter;
} }
public abstract DetailsWithWords createDetails(CrawledDocument crawledDocument, LinkTexts linkTexts, DocumentClass documentClass) throws DisqualifiedException, URISyntaxException, IOException; public abstract DetailsWithWords createDetails(CrawledDocument crawledDocument, LinkTexts linkTexts, Set<DomSampleClassification> domSampleClassifications, DocumentClass documentClass) throws DisqualifiedException, URISyntaxException, IOException;
public abstract boolean isApplicable(CrawledDocument doc); public abstract boolean isApplicable(CrawledDocument doc);
protected void checkDocumentLanguage(DocumentLanguageData dld) throws DisqualifiedException { protected void checkDocumentLanguage(DocumentLanguageData dld) throws DisqualifiedException {

View File

@@ -6,15 +6,16 @@ import nu.marginalia.converting.model.DisqualifiedException;
import nu.marginalia.converting.model.DocumentHeaders; import nu.marginalia.converting.model.DocumentHeaders;
import nu.marginalia.converting.model.GeneratorType; import nu.marginalia.converting.model.GeneratorType;
import nu.marginalia.converting.model.ProcessedDocumentDetails; import nu.marginalia.converting.model.ProcessedDocumentDetails;
import nu.marginalia.converting.processor.AcceptableAds;
import nu.marginalia.converting.processor.DocumentClass; import nu.marginalia.converting.processor.DocumentClass;
import nu.marginalia.converting.processor.MetaRobotsTag; import nu.marginalia.converting.processor.MetaRobotsTag;
import nu.marginalia.converting.processor.classifier.AcceptableAds;
import nu.marginalia.converting.processor.logic.*; import nu.marginalia.converting.processor.logic.*;
import nu.marginalia.converting.processor.logic.dom.MeasureLengthVisitor; import nu.marginalia.converting.processor.logic.dom.MeasureLengthVisitor;
import nu.marginalia.converting.processor.logic.links.FileLinks; import nu.marginalia.converting.processor.logic.links.FileLinks;
import nu.marginalia.converting.processor.logic.links.LinkProcessor; import nu.marginalia.converting.processor.logic.links.LinkProcessor;
import nu.marginalia.converting.processor.plugin.specialization.HtmlProcessorSpecializations; import nu.marginalia.converting.processor.plugin.specialization.HtmlProcessorSpecializations;
import nu.marginalia.converting.processor.pubdate.PubDateSniffer; import nu.marginalia.converting.processor.pubdate.PubDateSniffer;
import nu.marginalia.domclassifier.DomSampleClassification;
import nu.marginalia.gregex.GuardedRegex; import nu.marginalia.gregex.GuardedRegex;
import nu.marginalia.gregex.GuardedRegexFactory; import nu.marginalia.gregex.GuardedRegexFactory;
import nu.marginalia.keyword.DocumentKeywordExtractor; import nu.marginalia.keyword.DocumentKeywordExtractor;
@@ -23,7 +24,6 @@ import nu.marginalia.keyword.model.DocumentKeywordsBuilder;
import nu.marginalia.language.filter.LanguageFilter; import nu.marginalia.language.filter.LanguageFilter;
import nu.marginalia.language.model.DocumentLanguageData; import nu.marginalia.language.model.DocumentLanguageData;
import nu.marginalia.language.sentence.ThreadLocalSentenceExtractorProvider; import nu.marginalia.language.sentence.ThreadLocalSentenceExtractorProvider;
import nu.marginalia.link_parser.FeedExtractor;
import nu.marginalia.link_parser.LinkParser; import nu.marginalia.link_parser.LinkParser;
import nu.marginalia.model.DocumentFormat; import nu.marginalia.model.DocumentFormat;
import nu.marginalia.model.EdgeDomain; import nu.marginalia.model.EdgeDomain;
@@ -62,12 +62,10 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
private static final DocumentValuator documentValuator = new DocumentValuator(); private static final DocumentValuator documentValuator = new DocumentValuator();
private static final LinkParser linkParser = new LinkParser(); private static final LinkParser linkParser = new LinkParser();
private static final FeedExtractor feedExtractor = new FeedExtractor(linkParser);
private final ThreadLocalSentenceExtractorProvider sentenceExtractorProvider; private final ThreadLocalSentenceExtractorProvider sentenceExtractorProvider;
private final HtmlProcessorSpecializations htmlProcessorSpecializations; private final HtmlProcessorSpecializations htmlProcessorSpecializations;
private static final int MAX_DOCUMENT_LENGTH_BYTES = Integer.getInteger("converter.max-body-length",128_000);
private static boolean lenientProcessing = Boolean.getBoolean("converter.lenientProcessing"); private static boolean lenientProcessing = Boolean.getBoolean("converter.lenientProcessing");
@Inject @Inject
@@ -106,7 +104,7 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
@Override @Override
public DetailsWithWords createDetails(CrawledDocument crawledDocument, public DetailsWithWords createDetails(CrawledDocument crawledDocument,
LinkTexts linkTexts, LinkTexts linkTexts,
DocumentClass documentClass) Set<DomSampleClassification> domSampleClassifications, DocumentClass documentClass)
throws DisqualifiedException, URISyntaxException, IOException { throws DisqualifiedException, URISyntaxException, IOException {
if (!lenientProcessing && languageFilter.isBlockedUnicodeRange(crawledDocument.documentBody(512))) { if (!lenientProcessing && languageFilter.isBlockedUnicodeRange(crawledDocument.documentBody(512))) {
@@ -138,7 +136,14 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
final int length = getLength(doc); final int length = getLength(doc);
final DocumentFormat format = getDocumentFormat(doc); final DocumentFormat format = getDocumentFormat(doc);
final double quality = documentValuator.getQuality(crawledDocument, format, doc, length); final double quality;
if (domSampleClassifications.contains(DomSampleClassification.UNCLASSIFIED)) {
quality = documentValuator.getQuality(crawledDocument, format, doc, length);
}
else {
quality = documentValuator.getQuality(domSampleClassifications);
}
if (!lenientProcessing && isDisqualified(documentClass, url, quality, doc.title())) { if (!lenientProcessing && isDisqualified(documentClass, url, quality, doc.title())) {
throw new DisqualifiedException(DisqualificationReason.QUALITY); throw new DisqualifiedException(DisqualificationReason.QUALITY);
@@ -148,10 +153,6 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
checkDocumentLanguage(dld); checkDocumentLanguage(dld);
if (!lenientProcessing && !documentLengthLogic.validateLength(dld, specialization.lengthModifier() * documentClass.lengthLimitModifier())) {
throw new DisqualifiedException(DisqualifiedException.DisqualificationReason.LENGTH);
}
var ret = new ProcessedDocumentDetails(); var ret = new ProcessedDocumentDetails();
ret.length = length; ret.length = length;
@@ -160,6 +161,12 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
final Set<HtmlFeature> features = featureExtractor.getFeatures(url, doc, documentHeaders, dld); final Set<HtmlFeature> features = featureExtractor.getFeatures(url, doc, documentHeaders, dld);
if (!documentLengthLogic.validateLength(dld, specialization.lengthModifier() * documentClass.lengthLimitModifier())) {
features.add(HtmlFeature.SHORT_DOCUMENT);
}
ret.features = features; ret.features = features;
ret.quality = documentValuator.adjustQuality(quality, features); ret.quality = documentValuator.adjustQuality(quality, features);
ret.hashCode = dld.localitySensitiveHashCode(); ret.hashCode = dld.localitySensitiveHashCode();

View File

@@ -7,6 +7,7 @@ import nu.marginalia.converting.model.ProcessedDocumentDetails;
import nu.marginalia.converting.processor.DocumentClass; import nu.marginalia.converting.processor.DocumentClass;
import nu.marginalia.converting.processor.logic.DocumentLengthLogic; import nu.marginalia.converting.processor.logic.DocumentLengthLogic;
import nu.marginalia.converting.processor.plugin.specialization.DefaultSpecialization; import nu.marginalia.converting.processor.plugin.specialization.DefaultSpecialization;
import nu.marginalia.domclassifier.DomSampleClassification;
import nu.marginalia.keyword.DocumentKeywordExtractor; import nu.marginalia.keyword.DocumentKeywordExtractor;
import nu.marginalia.keyword.LinkTexts; import nu.marginalia.keyword.LinkTexts;
import nu.marginalia.keyword.model.DocumentKeywordsBuilder; import nu.marginalia.keyword.model.DocumentKeywordsBuilder;
@@ -77,7 +78,7 @@ public class PdfDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
@Override @Override
public DetailsWithWords createDetails(CrawledDocument crawledDocument, public DetailsWithWords createDetails(CrawledDocument crawledDocument,
LinkTexts linkTexts, LinkTexts linkTexts,
DocumentClass documentClass) Set<DomSampleClassification> domSampleClassifications, DocumentClass documentClass)
throws DisqualifiedException, URISyntaxException, IOException { throws DisqualifiedException, URISyntaxException, IOException {
String documentBody = crawledDocument.documentBody(); String documentBody = crawledDocument.documentBody();

View File

@@ -8,6 +8,7 @@ import nu.marginalia.converting.processor.DocumentClass;
import nu.marginalia.converting.processor.logic.DocumentLengthLogic; import nu.marginalia.converting.processor.logic.DocumentLengthLogic;
import nu.marginalia.converting.processor.logic.PlainTextLogic; import nu.marginalia.converting.processor.logic.PlainTextLogic;
import nu.marginalia.converting.util.LineUtils; import nu.marginalia.converting.util.LineUtils;
import nu.marginalia.domclassifier.DomSampleClassification;
import nu.marginalia.keyword.DocumentKeywordExtractor; import nu.marginalia.keyword.DocumentKeywordExtractor;
import nu.marginalia.keyword.LinkTexts; import nu.marginalia.keyword.LinkTexts;
import nu.marginalia.keyword.model.DocumentKeywordsBuilder; import nu.marginalia.keyword.model.DocumentKeywordsBuilder;
@@ -23,10 +24,7 @@ import org.apache.commons.lang3.StringUtils;
import java.net.URISyntaxException; import java.net.URISyntaxException;
import java.time.LocalDate; import java.time.LocalDate;
import java.util.ArrayList; import java.util.*;
import java.util.EnumSet;
import java.util.HashSet;
import java.util.List;
public class PlainTextDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin { public class PlainTextDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin {
@@ -70,7 +68,7 @@ public class PlainTextDocumentProcessorPlugin extends AbstractDocumentProcessorP
@Override @Override
public DetailsWithWords createDetails(CrawledDocument crawledDocument, public DetailsWithWords createDetails(CrawledDocument crawledDocument,
LinkTexts linkTexts, LinkTexts linkTexts,
DocumentClass documentClass) Set<DomSampleClassification> domSampleClassifications, DocumentClass documentClass)
throws DisqualifiedException, URISyntaxException { throws DisqualifiedException, URISyntaxException {
String documentBody = crawledDocument.documentBody(); String documentBody = crawledDocument.documentBody();

View File

@@ -7,6 +7,7 @@ import nu.marginalia.converting.model.GeneratorType;
import nu.marginalia.converting.model.ProcessedDocument; import nu.marginalia.converting.model.ProcessedDocument;
import nu.marginalia.converting.processor.DocumentClass; import nu.marginalia.converting.processor.DocumentClass;
import nu.marginalia.converting.processor.plugin.HtmlDocumentProcessorPlugin; import nu.marginalia.converting.processor.plugin.HtmlDocumentProcessorPlugin;
import nu.marginalia.domclassifier.DomSampleClassification;
import nu.marginalia.keyword.LinkTexts; import nu.marginalia.keyword.LinkTexts;
import nu.marginalia.model.DocumentFormat; import nu.marginalia.model.DocumentFormat;
import nu.marginalia.model.EdgeUrl; import nu.marginalia.model.EdgeUrl;
@@ -64,7 +65,7 @@ public class SideloaderProcessing {
var ret = new ProcessedDocument(); var ret = new ProcessedDocument();
try { try {
var details = htmlProcessorPlugin.createDetails(crawledDoc, linkTexts, documentClass); var details = htmlProcessorPlugin.createDetails(crawledDoc, linkTexts, EnumSet.noneOf(DomSampleClassification.class), documentClass);
ret.words = details.words(); ret.words = details.words();

View File

@@ -0,0 +1,16 @@
package nu.marginalia.converting.processor.classifier.adblock;
import nu.marginalia.domclassifier.DomSampleClassifier;
import org.junit.jupiter.api.Test;
import org.xml.sax.SAXException;
import javax.xml.parsers.ParserConfigurationException;
import java.io.IOException;
class DomSampleClassifierTest {
@Test
public void testLoadSpecs() throws ParserConfigurationException, IOException, SAXException {
new DomSampleClassifier();
}
}

View File

@@ -25,6 +25,7 @@ import java.net.URISyntaxException;
import java.nio.file.Files; import java.nio.file.Files;
import java.nio.file.Path; import java.nio.file.Path;
import java.time.Instant; import java.time.Instant;
import java.util.Set;
@Tag("flaky") @Tag("flaky")
class PdfDocumentProcessorPluginTest { class PdfDocumentProcessorPluginTest {
@@ -51,7 +52,7 @@ class PdfDocumentProcessorPluginTest {
} }
public AbstractDocumentProcessorPlugin.DetailsWithWords testPdfFile(byte[] pdfBytes) throws Exception { public AbstractDocumentProcessorPlugin.DetailsWithWords testPdfFile(byte[] pdfBytes) throws Exception {
var doc = new CrawledDocument("test", "https://www.example.com/sample.pdf", "application/pdf", Instant.now().toString(), 200, "OK", "OK", "", pdfBytes, false, -1, null, null); var doc = new CrawledDocument("test", "https://www.example.com/sample.pdf", "application/pdf", Instant.now().toString(), 200, "OK", "OK", "", pdfBytes, false, -1, null, null);
return plugin.createDetails(doc, new LinkTexts(), DocumentClass.NORMAL); return plugin.createDetails(doc, new LinkTexts(), Set.of(), DocumentClass.NORMAL);
} }
public AbstractDocumentProcessorPlugin.DetailsWithWords testPdfFile(Path file) throws Exception { public AbstractDocumentProcessorPlugin.DetailsWithWords testPdfFile(Path file) throws Exception {

View File

@@ -61,14 +61,14 @@ public class BackoffStrategy {
}; };
double backoffMinutes = baseInterval.toMinutes() double backoffMinutes = baseInterval.toMinutes()
* Math.pow(multiplier, backoffConsecutiveFailures - 1); * Math.pow(multiplier, Math.clamp(backoffConsecutiveFailures, 1, 10));
Duration newDuration = Duration.ofMinutes(Math.round(0.5+backoffMinutes)); var backoffVal = Math.round(0.5+backoffMinutes);
if (newDuration.compareTo(maxInterval) > 0) { if (backoffVal > maxInterval.toMinutes()) {
return maxInterval; return maxInterval;
} }
return newDuration; return Duration.ofMinutes(backoffVal);
} }
private Duration addJitter(Duration duration) { private Duration addJitter(Duration duration) {

View File

@@ -44,6 +44,7 @@ dependencies {
implementation project(':code:functions:favicon:api') implementation project(':code:functions:favicon:api')
implementation project(':code:functions:domain-info:api') implementation project(':code:functions:domain-info:api')
implementation project(':code:functions:search-query:api') implementation project(':code:functions:search-query:api')
implementation project(':code:processes:converting-process:ft-dom-classifier')
implementation project(':code:index:api') implementation project(':code:index:api')

View File

@@ -132,46 +132,6 @@ public class SearchFrontPageService {
return new IndexModel(items, refreshDateStr, searchVisitorCount.getQueriesPerMinute()); return new IndexModel(items, refreshDateStr, searchVisitorCount.getQueriesPerMinute());
} }
/* FIXME
public Object renderNewsFeed(Request request, Response response) {
List<NewsItem> newsItems = getNewsItems();
StringBuilder sb = new StringBuilder();
sb.append("""
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>Marginalia Search News and Mentions</title>
<link>https://search.marginalia.nu/</link>
<description>News and Mentions of Marginalia Search</description>
<language>en-us</language>
<ttl>60</ttl>
""");
sb.append("<lastBuildDate>").append(ZonedDateTime.now().format(DateTimeFormatter.RFC_1123_DATE_TIME)).append("</lastBuildDate>\n");
sb.append("<pubDate>").append(ZonedDateTime.now().format(DateTimeFormatter.RFC_1123_DATE_TIME)).append("</pubDate>\n");
sb.append("<ttl>60</ttl>\n");
for (var item : newsItems) {
sb.append("<item>\n");
sb.append("<title>").append(item.title()).append("</title>\n");
sb.append("<link>").append(item.url()).append("</link>\n");
if (item.source != null) {
sb.append("<author>").append(item.source()).append("</author>\n");
}
sb.append("<pubDate>").append(item.date().atStartOfDay().atZone(ZoneId.systemDefault()).format(DateTimeFormatter.RFC_1123_DATE_TIME)).append("</pubDate>\n");
sb.append("</item>\n");
}
sb.append("</channel>\n");
sb.append("</rss>\n");
response.type("application/rss+xml");
return sb.toString();
}*/
public record IndexModel(List<NewsItemCluster> news, public record IndexModel(List<NewsItemCluster> news,
String refreshDate, String refreshDate,
int searchPerMinute) { } int searchPerMinute) { }

View File

@@ -1,6 +1,7 @@
package nu.marginalia.search.svc; package nu.marginalia.search.svc;
import com.google.inject.Inject; import com.google.inject.Inject;
import com.google.inject.Singleton;
import com.zaxxer.hikari.HikariDataSource; import com.zaxxer.hikari.HikariDataSource;
import io.jooby.Context; import io.jooby.Context;
import io.jooby.MapModelAndView; import io.jooby.MapModelAndView;
@@ -9,12 +10,20 @@ import io.jooby.annotation.*;
import nu.marginalia.api.domains.DomainInfoClient; import nu.marginalia.api.domains.DomainInfoClient;
import nu.marginalia.api.domains.model.DomainInformation; import nu.marginalia.api.domains.model.DomainInformation;
import nu.marginalia.api.domains.model.SimilarDomain; import nu.marginalia.api.domains.model.SimilarDomain;
import nu.marginalia.api.domsample.DomSampleClient;
import nu.marginalia.api.domsample.RpcDomainSample;
import nu.marginalia.api.domsample.RpcOutgoingRequest;
import nu.marginalia.api.feeds.FeedsClient; import nu.marginalia.api.feeds.FeedsClient;
import nu.marginalia.api.feeds.RpcFeed; import nu.marginalia.api.feeds.RpcFeed;
import nu.marginalia.api.feeds.RpcFeedItem; import nu.marginalia.api.feeds.RpcFeedItem;
import nu.marginalia.api.livecapture.LiveCaptureClient; import nu.marginalia.api.livecapture.LiveCaptureClient;
import nu.marginalia.db.DbDomainQueries; import nu.marginalia.db.DbDomainQueries;
import nu.marginalia.ddtrackergradar.DDGTrackerData;
import nu.marginalia.ddtrackergradar.model.DDGTDomain;
import nu.marginalia.domclassifier.DomSampleClassification;
import nu.marginalia.domclassifier.DomSampleClassifier;
import nu.marginalia.model.EdgeDomain; import nu.marginalia.model.EdgeDomain;
import nu.marginalia.model.EdgeUrl;
import nu.marginalia.screenshot.ScreenshotService; import nu.marginalia.screenshot.ScreenshotService;
import nu.marginalia.search.SearchOperator; import nu.marginalia.search.SearchOperator;
import nu.marginalia.search.model.GroupedUrlDetails; import nu.marginalia.search.model.GroupedUrlDetails;
@@ -26,6 +35,7 @@ import nu.marginalia.service.server.RateLimiter;
import org.slf4j.Logger; import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import org.slf4j.LoggerFactory;
import javax.annotation.Nullable;
import java.sql.SQLException; import java.sql.SQLException;
import java.util.*; import java.util.*;
import java.util.concurrent.CompletableFuture; import java.util.concurrent.CompletableFuture;
@@ -34,6 +44,9 @@ import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit; import java.util.concurrent.TimeUnit;
import java.util.function.Supplier; import java.util.function.Supplier;
import static nu.marginalia.search.svc.SearchSiteInfoService.TrafficSample.*;
@Singleton
public class SearchSiteInfoService { public class SearchSiteInfoService {
private static final Logger logger = LoggerFactory.getLogger(SearchSiteInfoService.class); private static final Logger logger = LoggerFactory.getLogger(SearchSiteInfoService.class);
@@ -43,13 +56,17 @@ public class SearchSiteInfoService {
private final DbDomainQueries domainQueries; private final DbDomainQueries domainQueries;
private final FeedsClient feedsClient; private final FeedsClient feedsClient;
private final LiveCaptureClient liveCaptureClient; private final LiveCaptureClient liveCaptureClient;
private final DomSampleClient domSampleClient;
private final ScreenshotService screenshotService; private final ScreenshotService screenshotService;
private final HikariDataSource dataSource; private final HikariDataSource dataSource;
private final DDGTrackerData ddgTrackerData;
private final SearchSiteSubscriptionService searchSiteSubscriptions; private final SearchSiteSubscriptionService searchSiteSubscriptions;
private final RateLimiter rateLimiter = RateLimiter.custom(60); private final RateLimiter rateLimiter = RateLimiter.custom(60);
private final DomSampleClassifier domSampleClassifier;
@Inject @Inject
public SearchSiteInfoService(SearchOperator searchOperator, public SearchSiteInfoService(SearchOperator searchOperator,
DomainInfoClient domainInfoClient, DomainInfoClient domainInfoClient,
@@ -59,6 +76,9 @@ public class SearchSiteInfoService {
LiveCaptureClient liveCaptureClient, LiveCaptureClient liveCaptureClient,
ScreenshotService screenshotService, ScreenshotService screenshotService,
HikariDataSource dataSource, HikariDataSource dataSource,
DomSampleClient domSampleClient,
DomSampleClassifier domSampleClassifier,
DDGTrackerData ddgTrackerData,
SearchSiteSubscriptionService searchSiteSubscriptions) SearchSiteSubscriptionService searchSiteSubscriptions)
{ {
this.searchOperator = searchOperator; this.searchOperator = searchOperator;
@@ -70,6 +90,9 @@ public class SearchSiteInfoService {
this.liveCaptureClient = liveCaptureClient; this.liveCaptureClient = liveCaptureClient;
this.screenshotService = screenshotService; this.screenshotService = screenshotService;
this.dataSource = dataSource; this.dataSource = dataSource;
this.domSampleClient = domSampleClient;
this.domSampleClassifier = domSampleClassifier;
this.ddgTrackerData = ddgTrackerData;
this.searchSiteSubscriptions = searchSiteSubscriptions; this.searchSiteSubscriptions = searchSiteSubscriptions;
Thread.ofPlatform().name("Recently Added Domains Model Updater").start(this::modelUpdater); Thread.ofPlatform().name("Recently Added Domains Model Updater").start(this::modelUpdater);
@@ -154,6 +177,7 @@ public class SearchSiteInfoService {
case "links" -> listLinks(domainName, page); case "links" -> listLinks(domainName, page);
case "docs" -> listDocs(domainName, page); case "docs" -> listDocs(domainName, page);
case "info" -> listInfo(context, domainName); case "info" -> listInfo(context, domainName);
case "traffic" -> listSiteRequests(context, domainName);
case "report" -> reportSite(domainName); case "report" -> reportSite(domainName);
default -> listInfo(context, domainName); default -> listInfo(context, domainName);
}; };
@@ -239,6 +263,7 @@ public class SearchSiteInfoService {
String url = "https://" + domainName + "/"; String url = "https://" + domainName + "/";
boolean hasScreenshot = screenshotService.hasScreenshot(domainId); boolean hasScreenshot = screenshotService.hasScreenshot(domainId);
boolean isSubscribed = searchSiteSubscriptions.isSubscribed(context, domain); boolean isSubscribed = searchSiteSubscriptions.isSubscribed(context, domain);
boolean rateLimited = !rateLimiter.isAllowed(); boolean rateLimited = !rateLimiter.isAllowed();
@@ -368,6 +393,91 @@ public class SearchSiteInfoService {
); );
} }
private SiteInfoModel listSiteRequests(Context context, String domainName) {
if (!rateLimiter.isAllowed()) {
return forServiceUnavailable(domainName);
}
Optional<RpcDomainSample> sample = domSampleClient.getSample(domainName.toLowerCase());
if (sample.isEmpty()) {
return forNoData(domainName);
}
final EdgeDomain currentDomain = new EdgeDomain(domainName);
final List<RequestsForTargetDomain> requests = new ArrayList<>();
final Map<EdgeDomain, List<Map.Entry<EdgeUrl, RpcOutgoingRequest>>> urlsPerDomain = new HashMap<>();
final Set<EdgeUrl> seenUrls = new HashSet<>();
for (RpcOutgoingRequest rpcOutgoingRequest : sample.get().getOutgoingRequestsList()) {
Optional<EdgeUrl> parsedUrl = EdgeUrl.parse(rpcOutgoingRequest.getUrl());
if (parsedUrl.isEmpty())
continue;
final EdgeUrl url = parsedUrl.get();
if (url.domain.hasSameTopDomain(currentDomain))
continue;
if (!seenUrls.add(url))
continue;
urlsPerDomain
.computeIfAbsent(url.getDomain(), k -> new ArrayList<>())
.add(Map.entry(url, rpcOutgoingRequest));
}
Map<DomSampleClassification, Integer> requestSummary = new HashMap<>();
urlsPerDomain.forEach((requestDomain, urlsAndReqs) -> {
final List<RequestEndpoint> endpoints = new ArrayList<>();
for (Map.Entry<EdgeUrl, RpcOutgoingRequest> urlAndReq : urlsAndReqs) {
final EdgeUrl url = urlAndReq.getKey();
final RpcOutgoingRequest outgoingRequest = urlAndReq.getValue();
final DomSampleClassification clazz = domSampleClassifier.classifyRequest(url);
requestSummary.merge(clazz, 1, Integer::sum);
endpoints.add(
new RequestEndpoint(
url.path + (url.param == null ? "" : "?" + url.param),
outgoingRequest.getMethod().name(),
clazz
)
);
}
@Nullable
final DDGTDomain trackerData =
ddgTrackerData
.getDomainInfo(requestDomain.toString())
.orElse(null);
requests.add(
new RequestsForTargetDomain(
requestDomain,
endpoints,
trackerData
)
);
});
requests.sort(Comparator
.comparing((RequestsForTargetDomain req) -> req.endpoints.getFirst().classification.ordinal())
.thenComparing(req -> req.ownerDisplayName() == null)
.thenComparing(req -> req.domain.topDomain)
.thenComparing(req -> req.domain.toString()));
return new TrafficSample(domainName, requestSummary, requests);
}
public interface SiteInfoModel {
String domain();
}
public record Docs(String domain, public record Docs(String domain,
long domainId, long domainId,
List<UrlDetails> results, List<UrlDetails> results,
@@ -395,10 +505,6 @@ public class SearchSiteInfoService {
} }
} }
public interface SiteInfoModel {
String domain();
}
public record SiteInfoWithContext(String domain, public record SiteInfoWithContext(String domain,
boolean isSubscribed, boolean isSubscribed,
List<DbDomainQueries.DomainWithNode> siblingDomains, List<DbDomainQueries.DomainWithNode> siblingDomains,
@@ -492,4 +598,108 @@ public class SearchSiteInfoService {
} }
} }
public record TrafficSample(String domain,
boolean hasData,
boolean serviceAvailable,
Map<DomSampleClassification, Integer> requestSummary,
List<RequestsForTargetDomain> requests) implements SiteInfoModel {
public static String classificationIcon(DomSampleClassification clazz) {
return switch (clazz) {
case ADS -> "fa-ad";
case TRACKING -> "fa-crosshairs";
case CONSENT -> "fa-shield-alt";
default -> "";
};
}
public static String classificationColor(DomSampleClassification clazz) {
return switch (clazz) {
case ADS -> "bg-red-100 text-red-800 dark:bg-red-900 dark:text-white dark:border dark:border-red-400";
case TRACKING -> "bg-purple-100 text-purple-800 dark:bg-purple-900 dark:text-white dark:border dark:border-purple-400";
case CONSENT -> "bg-yellow-100 text-yellow-800 dark:bg-yellow-900 dark:text-white dark:border dark:border-yellow-400";
default -> "";
};
}
public static String categoryColor(String category) {
return switch (category) {
case "Ad Motivated Tracking", "Tracking", "Advertising", "Third-Party Analytics Marketing", "Action Pixels", "Badge" -> "bg-red-100 text-red-800 dark:bg-red-900 dark:text-white dark:border dark:border-red-400";
case "CDN", "Fraud Prevention", "Online Payment", "Consent Management Platform", "SSO" -> "bg-green-100 text-green-800 dark:bg-green-900 dark:text-white dark:border dark:border-green-400";
case "Social - Comment", "Social - Share", "Social Network", "Federated Login" -> "bg-yellow-100 text-yellow-800 dark:bg-yellow-900 dark:text-white dark:border dark:border-yellow-400";
case "Session Replay", "Audience Measurement", "Analytics", "Tag Manager" -> "bg-purple-100 text-purple-800 dark:bg-purple-900 dark:text-white dark:border dark:border-purple-400";
case "Malware", "Ad Fraud", "Unknown High Risk Behavior", "Obscure Ownership" -> "bg-blue-100 text-blue-800 dark:bg-blue-900 dark:text-blue-200 dark:border dark:border-blue-400";
default -> "bg-gray-200 text-gray-800 dark:bg-gray-600 dark:text-gray-200 dark:border dark:border-gray-200";
};
}
public TrafficSample(String domain,
Map<DomSampleClassification, Integer> requestSummary,
List<RequestsForTargetDomain> requests
) {
this(domain, true, true, requestSummary, requests);
}
static TrafficSample forNoData(String domain) {
return new TrafficSample(domain, false, true, Map.of(), List.of());
}
static TrafficSample forServiceUnavailable(String domain) {
return new TrafficSample(domain, true, false, Map.of(), List.of());
}
public record RequestEndpoint(String path,
String method,
DomSampleClassification classification) {
}
public record RequestsForTargetDomain(EdgeDomain domain, List<RequestEndpoint> endpoints, @Nullable DDGTDomain ddgtTrackerInfo)
{
public List<String> ownerCategories() {
if (ddgtTrackerInfo == null) return List.of();
if (ddgtTrackerInfo.categories() == null) return List.of();
return ddgtTrackerInfo.categories();
}
@Nullable
public String ownerName() {
if (ddgtTrackerInfo == null)
return null;
if (ddgtTrackerInfo.owner() == null)
return null;
return ddgtTrackerInfo.owner().name();
}
@Nullable
public String ownerDisplayName() {
if (ddgtTrackerInfo == null)
return null;
if (ddgtTrackerInfo.owner() == null)
return null;
return ddgtTrackerInfo.owner().displayName();
}
@Nullable
public String ownerUrl() {
if (ddgtTrackerInfo == null)
return null;
if (ddgtTrackerInfo.owner() == null)
return null;
return ddgtTrackerInfo.owner().url();
}
@Nullable
public String ownerPolicy() {
if (ddgtTrackerInfo == null)
return null;
if (ddgtTrackerInfo.owner() == null)
return null;
return ddgtTrackerInfo.owner().privacyPolicy();
}
}
}
} }

View File

@@ -13,4 +13,21 @@ as we sometimes generate classes from Java code or javascript!
<input type="radio" name="suggestion" class="peer hidden" checked> <input type="radio" name="suggestion" class="peer hidden" checked>
<div class="px-4 py-2 cursor-pointer dark:peer-checked:bg-gray-700 dark:hover:bg-gray-700 peer-checked:bg-gray-300 hover:bg-gray-300 w-full"> <div class="px-4 py-2 cursor-pointer dark:peer-checked:bg-gray-700 dark:hover:bg-gray-700 peer-checked:bg-gray-300 hover:bg-gray-300 w-full">
</div> </div>
</label> </label>
// case ADS -> "fa-ad";
case TRACKING -> "fa-crosshairs";
case CONSENT -> "fa-shield-alt";
default -> "";
};
}
<i class="bg-red-100 text-red-800 dark:bg-red-900 dark:text-red-200 dark:border dark:border-red-400"></i>
<i class="bg-purple-100 text-purple-800 dark:bg-purple-900 dark:text-purple-200 dark:border dark:border-purple-400"></i>
<i class="bg-yellow-100 text-yellow-800 dark:bg-yellow-900 dark:text-yellow-200 dark:border dark:border-yellow-400"></i>
<i class="bg-red-100 text-red-800 dark:bg-red-900 dark:text-red-200 dark:border dark:border-red-400"></i>
<i class="bg-green-100 text-green-800 dark:bg-green-900 dark:text-green-200 dark:border dark:border-green-400"></i>
<i class="bg-purple-100 text-purple-800 dark:bg-purple-900 dark:text-purple-200 dark:border dark:border-purple-400"></i>
<i class="bg-blue-100 text-blue-800 dark:bg-blue-900 dark:text-blue-200 dark:border dark:border-blue-400"></i>
<i class="bg-gray-200 text-gray-800 dark:bg-gray-600 dark:text-gray-200 dark:border dark:border-gray-200"></i>

View File

@@ -38,35 +38,41 @@
<a href="https://old-search.marginalia.nu/" class="underline text-liteblue dark:text-blue-200">here</a>. <a href="https://old-search.marginalia.nu/" class="underline text-liteblue dark:text-blue-200">here</a>.
</div> </div>
</div> </div>
<div class="mx-auto flex flex-col sm:flex-row my-4 sm:space-x-2 space-y-2 sm:space-y-0 w-full md:w-auto px-2"> <div class="mx-auto flex flex-col sm:flex-row my-4 sm:space-x-2 space-y-2 sm:space-y-0 w-full md:w-auto px-2 items-center sm:items-stretch">
<div class="flex flex-col border border-gray-300 dark:border-gray-600 rounded overflow-hidden dark:bg-gray-800 bg-white p-6 space-y-3"> <div class="flex flex-col border border-gray-300 dark:border-gray-600 rounded overflow-hidden dark:bg-gray-800 bg-white p-8 sm:p-4 space-y-3 w-96 sm:w-64">
<div><i class="fas fa-sailboat mx-2 text-margeblue dark:text-slate-200"></i>Explore the Web</div> <div><i class="fas fa-sailboat mx-2 text-margeblue dark:text-slate-200"></i>Explore the Web</div>
<ul class="list-disc ml-6 text-slate-700 dark:text-white text-xs leading-5"> <ul class="list-disc ml-8 sm:ml-6 text-slate-700 dark:text-white text-xs leading-5">
<li>Prioritizes non-commercial content</li> <li>Prioritizes non-commercial content</li>
<li>Tools for both search and discovery</li> <li>Tools for both search and discovery</li>
<li>Find lost old websites</li> <li>Find lost old websites</li>
</ul> </ul>
</div> </div>
<div class="flex flex-col border border-gray-300 dark:border-gray-600 rounded overflow-hidden dark:bg-gray-800 bg-white p-6 space-y-3 "> <div class="flex flex-col border border-gray-300 dark:border-gray-600 rounded overflow-hidden dark:bg-gray-800 bg-white p-8 sm:p-4 space-y-3 w-96 sm:w-64">
<div><i class="fas fa-hand-holding-hand mx-2 text-margeblue dark:text-slate-200"></i>Open Source</div> <div><i class="fas fa-hand-holding-hand mx-2 text-margeblue dark:text-slate-200"></i>Open Source</div>
<ul class="list-disc ml-6 text-slate-700 dark:text-white text-xs leading-5"> <ul class="list-disc ml-8 sm:ml-6 text-slate-700 dark:text-white text-xs leading-5">
<li>Custom index and crawler software</li> <li>Custom index and crawler software</li>
<li>Simple technology -- no AI or cloud</li> <li>Simple technology, no AI</li>
<li>AGPL license</li> <li>AGPL license</li>
</ul> </ul>
<div class="text-xs text-liteblue dark:text-blue-200 pt-4"> <div class="flex pt-4 gap-2">
<i class="fas fa-link"></i> <div class="text-xs text-liteblue dark:text-blue-200">
<i class="fa-brands fa-github"></i>
<a href="https://git.marginalia.nu/" class="underline">Git Repository</a> <a href="https://git.marginalia.nu/" class="underline">Git Repository</a>
</div> </div>
<div class="text-xs text-liteblue dark:text-blue-200">
<i class="fa-brands fa-discord"></i>
<a href="https://discord.gg/GgpkrVbF" class="underline">Project Discord</a>
</div>
</div>
</div> </div>
<div class="flex flex-col border border-gray-300 dark:border-gray-600 rounded overflow-hidden dark:bg-gray-800 bg-white p-6 space-y-3 "> <div class="flex flex-col border border-gray-300 dark:border-gray-600 rounded overflow-hidden dark:bg-gray-800 bg-white p-8 sm:p-4 space-y-3 w-96 sm:w-64">
<div><i class="fas fa-lock mx-2 text-margeblue dark:text-slate-200"></i> Privacy by default</div> <div><i class="fas fa-lock mx-2 text-margeblue dark:text-slate-200"></i> Privacy by default</div>
<ul class="list-disc ml-6 text-slate-700 dark:text-white text-xs leading-5"> <ul class="list-disc ml-8 sm:ml-6 text-slate-700 dark:text-white text-xs leading-5">
<li>Filter out tracking and adtech</li> <li>Filter out tracking </li>
<li>No user or search data shared with 3rd parties</li> <li>No data shared with 3rd parties</li>
<li>No long-term retention of queries or IP addresses</li> <li>No long-term retention of IPs</li>
</ul> </ul>
<div class="text-xs text-liteblue dark:text-blue-200 pt-4"> <div class="text-xs text-liteblue dark:text-blue-200 pt-4">
<i class="fas fa-link"></i> <i class="fas fa-link"></i>

View File

@@ -16,7 +16,7 @@
<header class="border-b border-gray-300 dark:border-gray-600 bg-white dark:bg-gray-800 shadow-md"> <header class="border-b border-gray-300 dark:border-gray-600 bg-white dark:bg-gray-800 shadow-md">
<div class="max-w-[1400px] mx-auto p-4"> <div class="max-w-[1400px] mx-auto p-4">
<div class="flex place-items-baseline space-x-2"> <div class="flex place-items-baseline space-x-2">
<span class="text-gray-900 dark:text-white text-md font-mono rounded-sm block p-2.5"> <span class="text-gray-900 dark:text-white break-none text-sm sm:text-md font-mono rounded-sm block p-2.5">
${model.domain()} ${model.domain()}
</span> </span>
<span class="grow"></span> <span class="grow"></span>
@@ -57,8 +57,8 @@
</div> </div>
<div class="mx-auto md:px-4 border dark:border-gray-600 bg-slate-50 dark:bg-gray-600"> <div class="mx-auto md:px-4 border dark:border-gray-600 bg-slate-50 dark:bg-gray-600">
<div class="flex md:space-x-2 max-w-[1000px] mx-auto"> <div class="flex md:space-x-2 max-w-[1000px] mx-auto">
<div class="has-[:checked]:bg-slate-200 dark:has-[:checked]:bg-slate-800 py-1 sm:px-2 px-1"> <div class="has-[:checked]:bg-slate-200 dark:has-[:checked]:bg-slate-800 py-1 px-2">
<a href="?view=info" class="whitespace-nowrap place-items-baseline space-x-1 text-gray-700 dark:text-white text-sm hover:text-gray-900 dark:hover:text-gray-200"> <a href="?view=info" class="whitespace-nowrap place-items-baseline space-x-1 text-gray-700 dark:text-white text-xs sm:text-sm hover:text-gray-900 dark:hover:text-gray-200">
@if (model instanceof SearchSiteInfoService.SiteInfoWithContext) @if (model instanceof SearchSiteInfoService.SiteInfoWithContext)
<input type="checkbox" class="sr-only hidden " checked readonly /> <input type="checkbox" class="sr-only hidden " checked readonly />
@else @else
@@ -71,8 +71,8 @@
</a> </a>
</div> </div>
<div class="has-[:checked]:bg-slate-200 dark:has-[:checked]:bg-slate-800 py-1 sm:px-2 px-1"> <div class="has-[:checked]:bg-slate-200 dark:has-[:checked]:bg-slate-800 py-1 px-2">
<a href="?view=docs" class="whitespace-nowrap place-items-baseline space-x-1 text-gray-700 dark:text-white text-sm hover:text-gray-900 dark:hover:text-gray-200"> <a href="?view=docs" class="whitespace-nowrap place-items-baseline space-x-1 text-gray-700 dark:text-white text-xs sm:text-sm hover:text-gray-900 dark:hover:text-gray-200">
@if (model instanceof SearchSiteInfoService.Docs) @if (model instanceof SearchSiteInfoService.Docs)
<input type="checkbox" class="sr-only hidden absolute" checked readonly /> <input type="checkbox" class="sr-only hidden absolute" checked readonly />
@else @else
@@ -81,12 +81,13 @@
<i class="fa-regular fa-file"></i> <i class="fa-regular fa-file"></i>
<span>Documents</span> <span class="hidden sm:inline">Documents</span>
<span class="inline sm:hidden">Docs</span>
</a> </a>
</div> </div>
<div class="has-[:checked]:bg-slate-200 dark:has-[:checked]:bg-slate-800 py-1 sm:px-2 px-1"> <div class="has-[:checked]:bg-slate-200 dark:has-[:checked]:bg-slate-800 py-1 px-2">
<a href="?view=links" class="whitespace-nowrap place-items-baseline space-x-1 text-gray-700 dark:text-white text-sm hover:text-gray-900 dark:hover:text-gray-200"> <a href="?view=links" class="whitespace-nowrap place-items-baseline space-x-1 text-gray-700 dark:text-white text-xs sm:text-sm hover:text-gray-900 dark:hover:text-gray-200">
@if (model instanceof SearchSiteInfoService.Backlinks) @if (model instanceof SearchSiteInfoService.Backlinks)
<input type="checkbox" class="sr-only hidden absolute" checked readonly /> <input type="checkbox" class="sr-only hidden absolute" checked readonly />
@else @else
@@ -95,12 +96,27 @@
<i class="fas fa-link"></i> <i class="fas fa-link"></i>
<span>Backlinks</span> <span class="hidden sm:inline">Backlinks</span>
<span class="inline sm:hidden">Links</span>
</a>
</div>
<div class="has-[:checked]:bg-slate-200 dark:has-[:checked]:bg-slate-800 py-1 px-2">
<a href="?view=traffic" class="whitespace-nowrap place-items-baseline space-x-1 text-gray-700 dark:text-white text-xs sm:text-sm hover:text-gray-900 dark:hover:text-gray-200">
@if (model instanceof SearchSiteInfoService.TrafficSample)
<input type="checkbox" class="sr-only hidden absolute" checked readonly />
@else
<span></span>
@endif
<i class="fas fa-crosshairs"></i>
<span class="hidden sm:inline">Requests</span>
<span class="inline sm:hidden">Reqs</span>
</a> </a>
</div> </div>
<div class="grow"></div> <div class="grow"></div>
<div class="has-[:checked]:bg-slate-200 dark:has-[:checked]:bg-slate-800 py-1 sm:px-2 px-1"> <div class="has-[:checked]:bg-slate-200 dark:has-[:checked]:bg-slate-800 py-1 px-2">
<a href="?view=report" class="text-sm whitespace-nowrap place-items-baseline space-x-1 text-red-800 dark:text-red-200 text-sm hover:text-red-600 dark:hover:text-red-300"> <a href="?view=report" class="text-sm whitespace-nowrap place-items-baseline space-x-1 text-red-800 dark:text-red-200 text-xs sm:text-sm hover:text-red-600 dark:hover:text-red-300">
@if (model instanceof SearchSiteInfoService.ReportDomain) @if (model instanceof SearchSiteInfoService.ReportDomain)
<input type="checkbox" class="sr-only hidden absolute" checked readonly /> <input type="checkbox" class="sr-only hidden absolute" checked readonly />
@else @else
@@ -126,6 +142,8 @@
@template.siteinfo.view.backlinks(backlinks = backlinks) @template.siteinfo.view.backlinks(backlinks = backlinks)
@elseif (model instanceof SearchSiteInfoService.Docs docs) @elseif (model instanceof SearchSiteInfoService.Docs docs)
@template.siteinfo.view.docs(docs = docs) @template.siteinfo.view.docs(docs = docs)
@elseif (model instanceof SearchSiteInfoService.TrafficSample report)
@template.siteinfo.view.traffic(report = report)
@endif @endif
</div> </div>

View File

@@ -148,7 +148,6 @@
</form> </form>
@endif @endif
@if (!siteInfo.siblingDomains().isEmpty()) @if (!siteInfo.siblingDomains().isEmpty())
<div class="mx-3 flex place-items-baseline space-x-2 p-2 bg-gray-100 dark:bg-gray-600 rounded"> <div class="mx-3 flex place-items-baseline space-x-2 p-2 bg-gray-100 dark:bg-gray-600 rounded">
<i class="fas fa-globe"></i> <i class="fas fa-globe"></i>

View File

@@ -0,0 +1,179 @@
@import nu.marginalia.domclassifier.DomSampleClassification
@import nu.marginalia.search.svc.SearchSiteInfoService.*
@import nu.marginalia.search.svc.SearchSiteInfoService.TrafficSample.RequestsForTargetDomain
@param TrafficSample report
<!-- Main content -->
<div class="flex flex-col space-y-2 w-full">
<div class="flex flex-col space-y-4 my-4">
@if (!report.serviceAvailable())
<div class="border border-gray-300 dark:border-gray-600 rounded bg-white dark:bg-gray-800 dark:text-white overflow-hidden mx-2 text-gray-800 text-sm">
<div class="flex place-items-center space-x-2 p-2 text-md border-b dark:border-gray-600 bg-margeblue text-white">
<span>Third-Party Requests</span>
</div>
<div class="p-4">
This service is currently being relentlessly scraped by bots and access
is disabled until they give up.
</div>
</div>
@elseif (!report.hasData())
<div class="border border-gray-300 dark:border-gray-600 rounded bg-white dark:bg-gray-800 dark:text-white overflow-hidden mx-2 text-gray-800 text-sm">
<div class="flex place-items-center space-x-2 p-2 text-md border-b dark:border-gray-600 bg-margeblue text-white">
<span>Third-Party Requests</span>
</div>
<div class="p-4">
The database of third party requests is still being assembled, and the
search engine doesn't yet have any information about <span class="inline font-mono text-pink-800 dark:text-pink-200">${report.domain()}</span>.
<p class="mt-4"></p>
Be patient. Several million websites need to be visited and assessed,
each visit taking up to 30 seconds. At the current rate, it is expected
the full database will be complete around the end of 2025, or early 2026.
</div>
</div>
@else
<div class="border border-gray-300 dark:border-gray-600 rounded bg-white dark:bg-gray-800 dark:text-white overflow-hidden mx-2 text-gray-800 text-sm">
<div class="flex place-items-center space-x-2 p-2 text-md border-b dark:border-gray-600 bg-margeblue text-white">
<span>Third-Party Requests</span>
</div>
<div class="p-4">
To better understand what <span class="inline font-mono text-pink-800 dark:text-pink-200">${report.domain()}</span> is doing
in the background as you visit the website, the search engine records which third-party servers it talks to.
<p class="mt-2"></p>
To help make sense of the recorded network traffic, the report is supplemented with information from
<a href="https://github.com/duckduckgo/tracker-radar/" class="text-blue-800 dark:text-blue-200 underline" rel="external">DuckDuckGo's Tracker Radar</a>,
subject to the CC BY-NC-SA 4.0 license.
<details class="mt-2">
<summary class="text-gray-600 hover:text-gray-700 dark:text-gray-400 hover:dark:text-gray-300 cursor-pointer select-none">
Learn More
</summary>
<p class="mt-2">
The search engine classifies third party requests into four buckets, based on their apparent purpose.
</p>
<p class="mt-2">
<span class="text-red-600 dark:text-red-400"><i class="fa fa-ad"></i> Advertisement</span> requests are involved in the bidding or display of advertisements, or the tracking
of ad impressions. They do not guarantee ads will be present on the website, as the advertisement
broker may decide it's not economic to place an ad for any particular visitor, but it is on the other hand virtually
impossible for ads to be present if this type of activity is not found.
</p>
<p class="mt-2">
<span class="text-purple-600 dark:text-purple-400"><i class="fa fa-crosshairs"></i> Tracking</span> requests analyze user behavior on the web, sometimes with the purpose of building a profile
for advertisement using cookies or browser fingerprinting technologies, other times the traffic exists only to help understand what visitors are doing on a website
for the benefit of the webmasters.
</p>
<p class="mt-2">
<span class="text-orange-600 dark:text-orange-400"><i class="fa fa-shield-alt"></i> Consent</span> requests manage GDPR or cookie consent popups, and similar nuisances.
In general, tracking and advertisement scripts are not run until a consent popup is dismissed. The system will try to automatically
agree to tracking consent popups when it can identify them in order to also capture these deferred requests, but this is not always successful,
so the presence of consent requests alone is a weak indicator a website may intend to load tracking or advertisement scripts.
</p>
<p class="mt-2">
<span class="text-gray-600 dark:text-gray-400"><i class="fa fa-question-circle"></i> Unclassified</span> requests are requests the system doesn't know what they are. Often these are
requests to content-delivery networks intended to reduce the network traffic to the server hosting the website and speed up page loads.
</p>
<p class="mt-2"></p>
This data is continuously updated, but updates are fairly
slow so the information may not be fully up to date.
</details>
</div>
</div>
@endif
</div>
@if (report.hasData())
<div class="mx-2">
<div class="flex place-items-center space-x-2 p-2 text-md border-b dark:border-gray-600 bg-margeblue text-white rounded border mb-4">
<span>Summary</span>
</div>
<!-- Summary Stats -->
<div class="grid grid-cols-4 gap-4 mb-8">
<div class="bg-white rounded p-4 shadow-sm border dark:bg-gray-800 dark:border-gray-600 place-items-center">
<div class="text-2xl font-bold text-red-600 dark:text-red-400">${report.requestSummary().getOrDefault(DomSampleClassification.ADS, 0)}</div>
<div class="text-sm text-gray-600 dark:text-gray-400">Ads</div>
</div>
<div class="bg-white rounded p-4 shadow-sm border dark:bg-gray-800 dark:border-gray-600 place-items-center">
<div class="text-2xl font-bold text-purple-600 dark:text-purple-400">${report.requestSummary().getOrDefault(DomSampleClassification.TRACKING, 0)}</div>
<div class="text-sm text-gray-600 dark:text-gray-400">Tracking</div>
</div>
<div class="bg-white rounded p-4 shadow-sm border dark:bg-gray-800 dark:border-gray-600 place-items-center">
<div class="text-2xl font-bold text-orange-600 dark:text-orange-400">${report.requestSummary().getOrDefault(DomSampleClassification.CONSENT, 0)}</div>
<div class="text-sm text-gray-600 dark:text-gray-400">Consent</div>
</div>
<div class="bg-white rounded p-4 shadow-sm border dark:bg-gray-800 dark:border-gray-600 place-items-center">
<div class="text-2xl font-bold text-gray-600 dark:text-gray-400">${report.requestSummary().getOrDefault(DomSampleClassification.UNCLASSIFIED, 0)}</div>
<div class="text-sm text-gray-600 dark:text-gray-400">Other</div>
</div>
</div>
</div>
<!-- Domain Groups -->
<div class="space-y-4 mx-2">
<div class="flex place-items-center space-x-2 p-2 text-md border-b dark:border-gray-600 bg-margeblue text-white rounded border">
<span>Breakdown</span>
</div>
@if (report.requests().isEmpty())
<div class="border border-gray-300 dark:border-gray-600 rounded bg-white dark:bg-gray-800 dark:text-white flex flex-col overflow-hidden p-4 mx-2 text-gray-800 text-sm">
No third-party requests were made!
</div>
@endif
@for (RequestsForTargetDomain request : report.requests())
<!-- Google Analytics Domain -->
<div class="bg-white rounded shadow-sm border border-gray-200 dark:bg-gray-800 dark:border-gray-600">
<div class="p-2 md:p-6 border-b border-gray-100 dark:border-gray-600">
<div class="flex items-start justify-between flex-col md:flex-row gap-2">
<div class="flex-1">
<h3 class="text-lg font-semibold dark:text-gray-100 text-gray-900 font-mono">${request.domain().toString()}</h3>
@if (request.ownerDisplayName() != null)
<p class="text-sm text-gray-600 dark:text-gray-400 mt-1">${request.ownerDisplayName()}</p>
@elseif (request.ownerName() != null)
<p class="text-sm text-gray-600 dark:text-gray-400 mt-1">${request.ownerName()}</p>
@endif
<div class="flex items-center gap-4 mt-3">
@if (request.ownerUrl() != null)
<a href="${request.ownerUrl()}" rel="external nofollow" class="text-blue-600 dark:text-blue-200 text-sm flex flex-row place-items-baseline gap-1">
<i class="fas fa-external-link-alt text-xs"></i> Visit Site
</a>
@endif
@if (request.ownerPolicy() != null)
<a href="${request.ownerPolicy()}" rel="external nofollow" class="text-blue-600 dark:text-blue-200 text-sm flex flex-row place-items-baseline gap-1">
<i class="fas fa-shield-alt text-xs"></i> Privacy Policy
</a>
@endif
</div>
</div>
<div class="flex flex-wrap justify-end gap-2 md:ml-2">
@for (String tag : request.ownerCategories())
<span class="px-2 py-1 ${TrafficSample.categoryColor(tag)} text-xs rounded">${tag}</span>
@endfor
</div>
</div>
</div>
<div class="p-4">
<div class="space-y-3">
@for (var req : request.endpoints())
<div class="flex items-center justify-between py-2 px-3 bg-gray-100 dark:bg-gray-600 rounded-lg">
<div class="flex items-center gap-3">
<div class="text-xs text-gray-500 dark:text-gray-100 font-mono">${req.method()}</div>
<span class="text-sm text-gray-600 dark:text-white font-mono break-all">${req.path()}</span>
</div>
@if (req.classification() != DomSampleClassification.UNCLASSIFIED)
<span class="px-2 py-1 bg-orange-100 text-orange-800 text-xs rounded flex flex-row place-items-baseline gap-1 ${TrafficSample.classificationColor(req.classification())}">
<i class="fa ${TrafficSample.classificationIcon(req.classification())}"></i> ${req.classification().name()}</span>
@endif
</div>
@endfor
</div>
</div>
</div>
@endfor
</div>
@endif
</div>

View File

@@ -87,6 +87,7 @@ public class JtePaperDoll {
"results", ret) "results", ret)
) )
); );
Spark.get("/site-info", Spark.get("/site-info",
(rq, rs) -> { (rq, rs) -> {
if ("links".equals(rq.queryParams("view"))) { if ("links".equals(rq.queryParams("view"))) {
@@ -98,6 +99,9 @@ public class JtePaperDoll {
else if ("report".equals(rq.queryParams("view"))) { else if ("report".equals(rq.queryParams("view"))) {
return MockedSearchResults.mockReportDomain(); return MockedSearchResults.mockReportDomain();
} }
else if ("traffic".equals(rq.queryParams("view"))) {
return MockedSearchResults.mockTrafficReport();
}
else return MockedSearchResults.mockSiteInfoData(); else return MockedSearchResults.mockSiteInfoData();
}, },

View File

@@ -7,6 +7,9 @@ import nu.marginalia.api.searchquery.model.results.SearchResultItem;
import nu.marginalia.browse.model.BrowseResult; import nu.marginalia.browse.model.BrowseResult;
import nu.marginalia.browse.model.BrowseResultSet; import nu.marginalia.browse.model.BrowseResultSet;
import nu.marginalia.db.DbDomainQueries; import nu.marginalia.db.DbDomainQueries;
import nu.marginalia.ddtrackergradar.model.DDGTDomain;
import nu.marginalia.ddtrackergradar.model.DDGTOwner;
import nu.marginalia.domclassifier.DomSampleClassification;
import nu.marginalia.model.EdgeDomain; import nu.marginalia.model.EdgeDomain;
import nu.marginalia.model.EdgeUrl; import nu.marginalia.model.EdgeUrl;
import nu.marginalia.model.crawl.DomainIndexingState; import nu.marginalia.model.crawl.DomainIndexingState;
@@ -19,6 +22,7 @@ import nu.marginalia.search.svc.SearchSiteInfoService;
import java.net.URISyntaxException; import java.net.URISyntaxException;
import java.util.ArrayList; import java.util.ArrayList;
import java.util.List; import java.util.List;
import java.util.Map;
import java.util.concurrent.ThreadLocalRandom; import java.util.concurrent.ThreadLocalRandom;
public class MockedSearchResults { public class MockedSearchResults {
@@ -271,4 +275,49 @@ public class MockedSearchResults {
List.of(mockUrlDetails("https://www.example.com/some-incredibly-long-address-that-goes-on-and-on", "One document")), List.of(mockUrlDetails("https://www.example.com/some-incredibly-long-address-that-goes-on-and-on", "One document")),
List.of(mockUrlDetails("https://other.example.com/", "Other document"))); List.of(mockUrlDetails("https://other.example.com/", "Other document")));
} }
public static Object mockTrafficReport() {
List<SearchSiteInfoService.TrafficSample.RequestsForTargetDomain> requests = new ArrayList<>();
requests.add(new SearchSiteInfoService.TrafficSample.RequestsForTargetDomain(
new EdgeDomain("hotjar.com"),
List.of(new SearchSiteInfoService.TrafficSample.RequestEndpoint("/foo.js", "POST", DomSampleClassification.TRACKING)),
new DDGTDomain(
"hotjar.com",
new DDGTOwner("Hotjar Ltd", "Hotjar", "https://www.example.com/", "https://www.hotjar.com/"),
List.of("Tracking", "Session Replay"),
List.of()
)
));
requests.add(new SearchSiteInfoService.TrafficSample.RequestsForTargetDomain(
new EdgeDomain("doubleclick.net"),
List.of(new SearchSiteInfoService.TrafficSample.RequestEndpoint("/foo.js", "GET", DomSampleClassification.TRACKING),
new SearchSiteInfoService.TrafficSample.RequestEndpoint("/bar.js", "GET", DomSampleClassification.TRACKING)),
new DDGTDomain(
"doubleclick.net",
new DDGTOwner("Doubleclick Inc", "Doubleclick", "https://www.example.com/", "https://www.hotjar.com/"),
List.of("CDN", "Advertising"),
List.of()
)
));
requests.add(new SearchSiteInfoService.TrafficSample.RequestsForTargetDomain(
new EdgeDomain("sketchy.org"),
List.of(new SearchSiteInfoService.TrafficSample.RequestEndpoint("/foo.js", "GET", DomSampleClassification.ADS),
new SearchSiteInfoService.TrafficSample.RequestEndpoint("/bar.js", "GET", DomSampleClassification.CONSENT)),
new DDGTDomain(
"sketchy.org",
new DDGTOwner("Doubious AB", "Legit Enterprises", "https://www.example.com/", "https://www.hotjar.com/"),
List.of("Malware", "Social - Comment"),
List.of()
)
));
return new SearchSiteInfoService.TrafficSample(
"example.com",
Map.of(
DomSampleClassification.ADS, 3,
DomSampleClassification.TRACKING, 10
),
requests
);
}
} }

View File

@@ -5,6 +5,7 @@ import com.google.inject.Inject;
import io.jooby.Context; import io.jooby.Context;
import io.jooby.Jooby; import io.jooby.Jooby;
import nu.marginalia.assistant.suggest.Suggestions; import nu.marginalia.assistant.suggest.Suggestions;
import nu.marginalia.domsample.DomSampleGrpcService;
import nu.marginalia.domsample.DomSampleService; import nu.marginalia.domsample.DomSampleService;
import nu.marginalia.functions.domains.DomainInfoGrpcService; import nu.marginalia.functions.domains.DomainInfoGrpcService;
import nu.marginalia.functions.math.MathGrpcService; import nu.marginalia.functions.math.MathGrpcService;
@@ -22,7 +23,6 @@ import java.util.List;
public class AssistantService extends JoobyService { public class AssistantService extends JoobyService {
private final Logger logger = LoggerFactory.getLogger(getClass()); private final Logger logger = LoggerFactory.getLogger(getClass());
private final Gson gson = GsonFactory.get(); private final Gson gson = GsonFactory.get();
@org.jetbrains.annotations.NotNull
private final ScreenshotService screenshotService; private final ScreenshotService screenshotService;
private final Suggestions suggestions; private final Suggestions suggestions;
@@ -32,6 +32,7 @@ public class AssistantService extends JoobyService {
DomainInfoGrpcService domainInfoGrpcService, DomainInfoGrpcService domainInfoGrpcService,
LiveCaptureGrpcService liveCaptureGrpcService, LiveCaptureGrpcService liveCaptureGrpcService,
DomSampleService domSampleService, DomSampleService domSampleService,
DomSampleGrpcService domSampleGrpcService,
FeedsGrpcService feedsGrpcService, FeedsGrpcService feedsGrpcService,
MathGrpcService mathGrpcService, MathGrpcService mathGrpcService,
Suggestions suggestions) Suggestions suggestions)
@@ -41,7 +42,9 @@ public class AssistantService extends JoobyService {
List.of(domainInfoGrpcService, List.of(domainInfoGrpcService,
mathGrpcService, mathGrpcService,
liveCaptureGrpcService, liveCaptureGrpcService,
feedsGrpcService), feedsGrpcService,
domSampleGrpcService
),
List.of()); List.of());
this.screenshotService = screenshotService; this.screenshotService = screenshotService;

View File

@@ -36,6 +36,7 @@ dependencies {
implementation project(':code:common:linkdb') implementation project(':code:common:linkdb')
implementation project(':code:common:service') implementation project(':code:common:service')
implementation project(':code:common:model') implementation project(':code:common:model')
implementation project(':code:functions:live-capture:api')
implementation libs.bundles.slf4j implementation libs.bundles.slf4j
implementation libs.bundles.grpc implementation libs.bundles.grpc

View File

@@ -9,6 +9,7 @@ import gnu.trove.list.array.TIntArrayList;
import nu.marginalia.IndexLocations; import nu.marginalia.IndexLocations;
import nu.marginalia.LanguageModels; import nu.marginalia.LanguageModels;
import nu.marginalia.WmsaHome; import nu.marginalia.WmsaHome;
import nu.marginalia.api.domsample.DomSampleClient;
import nu.marginalia.db.DomainTypes; import nu.marginalia.db.DomainTypes;
import nu.marginalia.index.domainrankings.DomainRankings; import nu.marginalia.index.domainrankings.DomainRankings;
import nu.marginalia.index.journal.IndexJournalSlopWriter; import nu.marginalia.index.journal.IndexJournalSlopWriter;
@@ -37,6 +38,7 @@ import java.sql.SQLException;
import java.util.ArrayList; import java.util.ArrayList;
import java.util.Random; import java.util.Random;
import java.util.UUID; import java.util.UUID;
import java.util.concurrent.CompletableFuture;
import static nu.marginalia.linkdb.LinkdbFileNames.DOCDB_FILE_NAME; import static nu.marginalia.linkdb.LinkdbFileNames.DOCDB_FILE_NAME;
import static nu.marginalia.linkdb.LinkdbFileNames.DOMAIN_LINKS_FILE_NAME; import static nu.marginalia.linkdb.LinkdbFileNames.DOMAIN_LINKS_FILE_NAME;
@@ -85,6 +87,9 @@ public class IntegrationTestModule extends AbstractModule {
bind(FileStorageService.class).toInstance(fileStorageServiceMock); bind(FileStorageService.class).toInstance(fileStorageServiceMock);
bind(ServiceHeartbeat.class).toInstance(new FakeServiceHeartbeat()); bind(ServiceHeartbeat.class).toInstance(new FakeServiceHeartbeat());
bind(ProcessHeartbeat.class).toInstance(new FakeProcessHeartbeat()); bind(ProcessHeartbeat.class).toInstance(new FakeProcessHeartbeat());
DomSampleClient domSampleClientMock = Mockito.mock(DomSampleClient.class);
when(domSampleClientMock.getSampleAsync(any(), any())).thenReturn(CompletableFuture.failedFuture(new RuntimeException()));
bind(DomSampleClient.class).toInstance(domSampleClientMock);
SearchSetsService setsServiceMock = Mockito.mock(SearchSetsService.class); SearchSetsService setsServiceMock = Mockito.mock(SearchSetsService.class);
when(setsServiceMock.getSearchSetByName("NONE")).thenReturn(new SearchSetAny()); when(setsServiceMock.getSearchSetByName("NONE")).thenReturn(new SearchSetAny());

View File

@@ -83,6 +83,7 @@ include 'code:common:renderer'
include 'code:processes:converting-process' include 'code:processes:converting-process'
include 'code:processes:converting-process:model' include 'code:processes:converting-process:model'
include 'code:processes:converting-process:ft-keyword-extraction' include 'code:processes:converting-process:ft-keyword-extraction'
include 'code:processes:converting-process:ft-dom-classifier'
include 'code:processes:crawling-process' include 'code:processes:crawling-process'
include 'code:processes:crawling-process:model' include 'code:processes:crawling-process:model'
@@ -159,7 +160,7 @@ dependencyResolutionManagement {
library('protobuf', 'com.google.protobuf', 'protobuf-java').version('3.16.3') library('protobuf', 'com.google.protobuf', 'protobuf-java').version('3.16.3')
library('grpc-protobuf', 'io.grpc', 'grpc-protobuf').version('1.73.0') library('grpc-protobuf', 'io.grpc', 'grpc-protobuf').version('1.73.0')
library('grpc-stub', 'io.grpc', 'grpc-stub').version('1.73.0') library('grpc-stub', 'io.grpc', 'grpc-stub').version('1.73.0')
library('grpc-netty', 'io.grpc', 'grpc-netty-shaded').version('1.73.0') library('grpc-netty', 'io.grpc', 'grpc-netty').version('1.73.0')
library('prometheus', 'io.prometheus', 'simpleclient').version('0.16.0') library('prometheus', 'io.prometheus', 'simpleclient').version('0.16.0')
library('prometheus-servlet', 'io.prometheus', 'simpleclient_servlet').version('0.16.0') library('prometheus-servlet', 'io.prometheus', 'simpleclient_servlet').version('0.16.0')