1
1
mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-10-06 07:32:38 +02:00

Compare commits

...

45 Commits

Author SHA1 Message Date
Viktor Lofgren
251006d4f9 (url) Fix urlencoding issues with certain symbols
Problems primarily cropped up with sideloaded wikipedia articles, though the search engine has been returning inconsistently URLEncoded search results for a while, though browsers and servers have seemingly magically fixed the issues in many scenarios.

This addresses Issue #195 and Issue #131.
2025-05-03 23:48:45 +02:00
Viktor Lofgren
c3e99dc12a (service) Limit logging from ad hoc task heartbeats
Certain usage patterns of the ad hoc task heartbeats would lead to an incredible amount of log noise, as it would log each update.

Limit log updates to increments of 10% to avoid this problem.
2025-05-03 12:39:58 +02:00
Viktor
aaaa2de022 Merge pull request #196 from MarginaliaSearch/filter-export-sample-data
Add the ability to filter sample data based on content type
2025-05-02 13:23:49 +02:00
Viktor Lofgren
fc1388422a (actor) Add the ability to filter sample data based on content type
This will help in extracting relevant test sets for PDF processing.
2025-05-02 13:09:22 +02:00
Viktor Lofgren
b07080db16 (crawler) Don't retry requests when encountering UnknownHostException 2025-05-01 16:07:34 +02:00
Viktor Lofgren
e9d86dca4a (crawler) Add timeout to wrap-up phase of WarcInputBuffer. 2025-05-01 15:57:47 +02:00
Viktor Lofgren
1d693f0efa (build) Upgrade JIB to 3.4.5 2025-04-30 15:26:52 +02:00
Viktor Lofgren
5874a163dc (build) Upgrade gradle to 8.14 2025-04-30 15:26:37 +02:00
Viktor Lofgren
5ec7a1deab (crawler) Fix 80%-ish progress crawler stall
Since the crawl tasks are started in two phases, first when generating them in one loop, and then in a second loop that drains the task list; if the first loop contains a long-running crawl task that is triggered late, the rest of the crawl may halt until that task is finish.

Fixed the problem by draining and re-trying also in the first loop.
2025-04-29 12:23:51 +02:00
Viktor Lofgren
7fea2808ed (search) Fix error view
Fix rendering error when query was null

Fix border on error message.
2025-04-27 12:12:56 +02:00
Viktor Lofgren
8da74484f0 (search) Remove unused count modifier from the footer help 2025-04-27 12:08:34 +02:00
Viktor Lofgren
923d5a7234 (search) Add a note for TUI users pointing them to the old UI 2025-04-27 11:52:07 +02:00
Viktor Lofgren
58f88749b8 (deploy) assistant 2025-04-25 13:25:50 +02:00
Viktor Lofgren
77f727a5ba (crawler) Alter conditional request logic to avoid sending both If-None-Match and If-Modified-Since
It seems like some servers dislike this combination, and may turn a 304 into a 200.
2025-04-25 13:19:07 +02:00
Viktor Lofgren
667cfb53dc (assistant) Remove more link text junk from suggestions at loadtime. 2025-04-24 13:35:29 +02:00
Viktor Lofgren
fe36d4ed20 (deploy) Executor services 2025-04-24 13:23:51 +02:00
Viktor Lofgren
acf4bef98d (assistant) Improve search suggestions
Improve suggestions by loading a secondary suggestions set with link text data.
2025-04-24 13:10:59 +02:00
Viktor Lofgren
2a737c34bb (search) Improve suggestions UX
Fix the highlight colors when arrowing through search suggestions.  Also fix the suggestions box for dark mode.
2025-04-24 12:34:05 +02:00
Viktor Lofgren
90a577af82 (search) Improve suggestions UX 2025-04-24 00:32:25 +02:00
Viktor
f0c9b935d8 Merge pull request #192 from MarginaliaSearch/improve-suggestions
Improve typeahead suggestions
2025-04-23 20:17:49 +02:00
Viktor Lofgren
7b5493dd51 (assistant) Improve typeahead suggestions
Implement a new prefix search structure (not a trie, but hash table based) with a concept of score.
2025-04-23 20:13:53 +02:00
Viktor Lofgren
c246a59158 (search) Make it clearer that it's a search engine 2025-04-22 16:03:42 +02:00
Viktor
0b99781d24 Merge pull request #191 from MarginaliaSearch/pdf-support-in-crawler
Pdf support in crawler
2025-04-22 15:52:41 +02:00
Viktor Lofgren
39db9620c1 (crawler) Increase maximum permitted file size to 32 MB 2025-04-22 15:51:03 +02:00
Viktor Lofgren
1781599363 (crawler) Add support for crawling PDF files 2025-04-22 15:50:05 +02:00
Viktor Lofgren
6b2d18fb9b (crawler) Adjust domain limits to be generally more permissive. 2025-04-22 15:27:57 +02:00
Viktor
59b1d200ab Merge pull request #190 from MarginaliaSearch/download-sample-chores
Download sample chores
2025-04-22 13:29:49 +02:00
Viktor Lofgren
897010a2cf (control) Update download sample data actor with better UI
The original implementation didn't really give a lot of feedback about what it was doing.  Adding a progress bar to the download step.

Relates to issue 189.
2025-04-22 13:27:22 +02:00
Viktor Lofgren
602af7a77e (control) Update UI with new sample sizes
Relates to issue 189.
2025-04-22 13:27:13 +02:00
Viktor Lofgren
a7d91c8527 (crawler) Clean up fetcher detailed logging 2025-04-21 12:53:52 +02:00
Viktor Lofgren
7151602124 (crawler) Reduce the likelihood of crawler tasks locking on domains before they are ready
Cleaning up after changes.
2025-04-21 12:47:03 +02:00
Viktor Lofgren
884e33bd4a (crawler) Reduce the likelihood of crawler tasks locking on domains before they are ready
Change back to an unbounded queue, tighten sleep times a bit.
2025-04-21 11:48:15 +02:00
Viktor Lofgren
e84d5c497a (crawler) Reduce the likelihood of crawler tasks locking on domains before they are ready
Change to a bounded queue and adding a sleep to reduce the amount of effectively busy looping threads.
2025-04-21 00:39:26 +02:00
Viktor Lofgren
2d2d3e2466 (crawler) Reduce the likelihood of crawler tasks locking on domains before they are ready
Change to a bounded queue and adding a sleep to reduce the amount of effectively busy looping threads.
2025-04-21 00:36:48 +02:00
Viktor Lofgren
647dd9b12f (crawler) Reduce the likelihood of crawler tasks locking on domains before they are ready 2025-04-21 00:24:30 +02:00
Viktor Lofgren
de4e2849ce (crawler) Tweak request retry counts
Increase the default number of tries to 3, but don't retry on SSL errors as they are unlikely to fix themselves in the short term.
2025-04-19 00:19:48 +02:00
Viktor Lofgren
3c43f1954e (crawler) Add custom cookie store implementation
Apache HttpClient's cookie implementation builds an enormous concurrent hashmap with every cookie for every domain ever crawled.  This is a big waste of resources.

Replacing it with a fairly crude domain-isolated instance, as we are primarily interested in answering whether a cookie is set, and we will never retain cookies long term.
2025-04-18 13:04:22 +02:00
Viktor Lofgren
fa2462ec39 (crawler) Re-enable aborts on timeout 2025-04-18 12:59:34 +02:00
Viktor Lofgren
f4ad7145db (crawler) Disable SO_LINGER 2025-04-18 01:42:02 +02:00
Viktor Lofgren
068b450180 (crawler) Temporarily disable request.abort() 2025-04-18 01:25:56 +02:00
Viktor Lofgren
05b909a21f (crawler) Add logging to get more info about connection leak 2025-04-18 01:06:52 +02:00
Viktor Lofgren
3d179cddce (crawler) Correctly consume entity in sitemap retrieval 2025-04-18 00:32:21 +02:00
Viktor Lofgren
1a2aae496a (crawler) Correct handling and abortion of HttpClient's requests
There was a resource leak in the initial implementation of the Apache HttpClient WarcInputBuffer that failed to free up resources.

Using HttpGet objects instead of the Classic...Request objects, as the latter fail to expose an abort()-method.
2025-04-18 00:16:26 +02:00
Viktor Lofgren
353cdffb3f (crawler) Increase connection request timeout, restore congestion timeout 2025-04-17 21:32:06 +02:00
Viktor Lofgren
2e3f1313c7 (crawler) Log exceptions while crawling in crawler audit log 2025-04-17 21:18:09 +02:00
69 changed files with 1485 additions and 631 deletions

View File

@@ -5,7 +5,7 @@ plugins {
// This is a workaround for a bug in the Jib plugin that causes it to stall randomly // This is a workaround for a bug in the Jib plugin that causes it to stall randomly
// https://github.com/GoogleContainerTools/jib/issues/3347 // https://github.com/GoogleContainerTools/jib/issues/3347
id 'com.google.cloud.tools.jib' version '3.4.4' apply(false) id 'com.google.cloud.tools.jib' version '3.4.5' apply(false)
} }
group 'marginalia' group 'marginalia'
@@ -47,7 +47,7 @@ ext {
dockerImageBase='container-registry.oracle.com/graalvm/jdk:24' dockerImageBase='container-registry.oracle.com/graalvm/jdk:24'
dockerImageTag='latest' dockerImageTag='latest'
dockerImageRegistry='marginalia' dockerImageRegistry='marginalia'
jibVersion = '3.4.4' jibVersion = '3.4.5'
} }
idea { idea {

View File

@@ -1,16 +1,14 @@
package nu.marginalia.model; package nu.marginalia.model;
import nu.marginalia.util.QueryParams; import nu.marginalia.util.QueryParams;
import org.apache.commons.lang3.StringUtils;
import javax.annotation.Nullable; import javax.annotation.Nullable;
import java.io.Serializable; import java.io.Serializable;
import java.net.MalformedURLException; import java.net.*;
import java.net.URI; import java.nio.charset.StandardCharsets;
import java.net.URISyntaxException;
import java.net.URL;
import java.util.Objects; import java.util.Objects;
import java.util.Optional; import java.util.Optional;
import java.util.regex.Pattern;
public class EdgeUrl implements Serializable { public class EdgeUrl implements Serializable {
public final String proto; public final String proto;
@@ -33,7 +31,7 @@ public class EdgeUrl implements Serializable {
private static URI parseURI(String url) throws URISyntaxException { private static URI parseURI(String url) throws URISyntaxException {
try { try {
return new URI(urlencodeFixer(url)); return EdgeUriFactory.uriFromString(url);
} catch (URISyntaxException ex) { } catch (URISyntaxException ex) {
throw new URISyntaxException("Failed to parse URI '" + url + "'", ex.getMessage()); throw new URISyntaxException("Failed to parse URI '" + url + "'", ex.getMessage());
} }
@@ -51,58 +49,6 @@ public class EdgeUrl implements Serializable {
} }
} }
private static Pattern badCharPattern = Pattern.compile("[ \t\n\"<>\\[\\]()',|]");
/* Java's URI parser is a bit too strict in throwing exceptions when there's an error.
Here on the Internet, standards are like the picture on the box of the frozen pizza,
and what you get is more like what's on the inside, we try to patch things instead,
just give it a best-effort attempt att cleaning out broken or unnecessary constructions
like bad or missing URLEncoding
*/
public static String urlencodeFixer(String url) throws URISyntaxException {
var s = new StringBuilder();
String goodChars = "&.?:/-;+$#";
String hexChars = "0123456789abcdefABCDEF";
int pathIdx = findPathIdx(url);
if (pathIdx < 0) { // url looks like http://marginalia.nu
return url + "/";
}
s.append(url, 0, pathIdx);
// We don't want the fragment, and multiple fragments breaks the Java URIParser for some reason
int end = url.indexOf("#");
if (end < 0) end = url.length();
for (int i = pathIdx; i < end; i++) {
int c = url.charAt(i);
if (goodChars.indexOf(c) >= 0 || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || (c >= '0' && c <= '9')) {
s.appendCodePoint(c);
} else if (c == '%' && i + 2 < end) {
int cn = url.charAt(i + 1);
int cnn = url.charAt(i + 2);
if (hexChars.indexOf(cn) >= 0 && hexChars.indexOf(cnn) >= 0) {
s.appendCodePoint(c);
} else {
s.append("%25");
}
} else {
s.append(String.format("%%%02X", c));
}
}
return s.toString();
}
private static int findPathIdx(String url) throws URISyntaxException {
int colonIdx = url.indexOf(':');
if (colonIdx < 0 || colonIdx + 2 >= url.length()) {
throw new URISyntaxException(url, "Lacking protocol");
}
return url.indexOf('/', colonIdx + 2);
}
public EdgeUrl(URI URI) { public EdgeUrl(URI URI) {
try { try {
@@ -247,3 +193,123 @@ public class EdgeUrl implements Serializable {
} }
} }
/* Java's URI parser is a bit too strict in throwing exceptions when there's an error.
Here on the Internet, standards are like the picture on the box of the frozen pizza,
and what you get is more like what's on the inside, we try to patch things instead,
just give it a best-effort attempt att cleaning out broken or unnecessary constructions
like bad or missing URLEncoding
*/
class EdgeUriFactory {
public static URI uriFromString(String url) throws URISyntaxException {
var s = new StringBuilder();
int pathIdx = findPathIdx(url);
if (pathIdx < 0) { // url looks like http://marginalia.nu
return new URI(url + "/");
}
s.append(url, 0, pathIdx);
// We don't want the fragment, and multiple fragments breaks the Java URIParser for some reason
int end = url.indexOf("#");
if (end < 0) end = url.length();
int queryIdx = url.indexOf('?');
if (queryIdx < 0) queryIdx = end;
recombinePaths(s, url.substring(pathIdx, queryIdx));
if (queryIdx < end) {
recombineQueryString(s, url.substring(queryIdx + 1, end));
}
return new URI(s.toString());
}
private static void recombinePaths(StringBuilder sb, String path) {
if (path == null || path.isEmpty()) {
return;
}
String[] pathParts = StringUtils.split(path, '/');
if (pathParts.length == 0) {
sb.append('/');
return;
}
for (String pathPart : pathParts) {
if (pathPart.isEmpty()) continue;
if (needsUrlEncode(pathPart)) {
sb.append('/');
sb.append(URLEncoder.encode(pathPart, StandardCharsets.UTF_8));
} else {
sb.append('/');
sb.append(pathPart);
}
}
}
private static void recombineQueryString(StringBuilder sb, String param) {
if (param == null || param.isEmpty()) {
return;
}
sb.append('?');
String[] pathParts = StringUtils.split(param, '&');
boolean first = true;
for (String pathPart : pathParts) {
if (pathPart.isEmpty()) continue;
if (first) {
first = false;
} else {
sb.append('&');
}
if (needsUrlEncode(pathPart)) {
sb.append(URLEncoder.encode(pathPart, StandardCharsets.UTF_8));
} else {
sb.append(pathPart);
}
}
}
/** Test if the url element needs URL encoding.
* <p></p>
* Note we may have been given an already encoded path element,
* so we include % and + in the list of good characters
*/
private static boolean needsUrlEncode(String urlElement) {
for (int i = 0; i < urlElement.length(); i++) {
char c = urlElement.charAt(i);
if (c >= 'a' && c <= 'z') continue;
if (c >= 'A' && c <= 'Z') continue;
if (c >= '0' && c <= '9') continue;
if ("-_.~+?=&".indexOf(c) >= 0) continue;
if (c == '%' && i + 2 < urlElement.length()) {
char c1 = urlElement.charAt(i + 1);
char c2 = urlElement.charAt(i + 2);
if (c1 >= '0' && c1 <= '9' && c2 >= '0' && c2 <= '9') {
i += 2;
continue;
}
}
return true;
}
return false;
}
private static int findPathIdx(String url) throws URISyntaxException {
int colonIdx = url.indexOf(':');
if (colonIdx < 0 || colonIdx + 3 >= url.length()) {
throw new URISyntaxException(url, "Lacking protocol");
}
return url.indexOf('/', colonIdx + 3);
}
}

View File

@@ -1,6 +1,6 @@
package nu.marginalia.model; package nu.marginalia.model;
import nu.marginalia.model.EdgeUrl; import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.Test; import org.junit.jupiter.api.Test;
import java.net.URISyntaxException; import java.net.URISyntaxException;
@@ -21,25 +21,25 @@ class EdgeUrlTest {
new EdgeUrl("https://memex.marginalia.nu/#here") new EdgeUrl("https://memex.marginalia.nu/#here")
); );
} }
@Test @Test
public void testParam() throws URISyntaxException { void testUriFromString() throws URISyntaxException {
System.out.println(new EdgeUrl("https://memex.marginalia.nu/index.php?id=1").toString()); Assertions.assertEquals("https://www.example.com/", EdgeUriFactory.uriFromString("https://www.example.com/#heredoc").toString());
System.out.println(new EdgeUrl("https://memex.marginalia.nu/showthread.php?id=1&count=5&tracking=123").toString()); Assertions.assertEquals("https://www.example.com/%25-sign", EdgeUriFactory.uriFromString("https://www.example.com/%-sign").toString());
} Assertions.assertEquals("https://www.example.com/%22-sign", EdgeUriFactory.uriFromString("https://www.example.com/%22-sign").toString());
@Test Assertions.assertEquals("https://www.example.com/%0A+%22huh%22", EdgeUriFactory.uriFromString("https://www.example.com/\n \"huh\"").toString());
void urlencodeFixer() throws URISyntaxException { Assertions.assertEquals("https://en.wikipedia.org/wiki/S%C3%A1mi", EdgeUriFactory.uriFromString("https://en.wikipedia.org/wiki/Sámi").toString());
System.out.println(EdgeUrl.urlencodeFixer("https://www.example.com/#heredoc"));
System.out.println(EdgeUrl.urlencodeFixer("https://www.example.com/%-sign"));
System.out.println(EdgeUrl.urlencodeFixer("https://www.example.com/%22-sign"));
System.out.println(EdgeUrl.urlencodeFixer("https://www.example.com/\n \"huh\""));
} }
@Test @Test
void testParms() throws URISyntaxException { void testParms() throws URISyntaxException {
System.out.println(new EdgeUrl("https://search.marginalia.nu/?id=123")); Assertions.assertEquals("id=123", new EdgeUrl("https://search.marginalia.nu/?id=123").param);
System.out.println(new EdgeUrl("https://search.marginalia.nu/?t=123")); Assertions.assertEquals("t=123", new EdgeUrl("https://search.marginalia.nu/?t=123").param);
System.out.println(new EdgeUrl("https://search.marginalia.nu/?v=123")); Assertions.assertEquals("v=123", new EdgeUrl("https://search.marginalia.nu/?v=123").param);
System.out.println(new EdgeUrl("https://search.marginalia.nu/?m=123")); Assertions.assertEquals("id=1", new EdgeUrl("https://memex.marginalia.nu/showthread.php?id=1&count=5&tracking=123").param);
System.out.println(new EdgeUrl("https://search.marginalia.nu/?follow=123")); Assertions.assertEquals("id=1&t=5", new EdgeUrl("https://memex.marginalia.nu/shöwthrëad.php?id=1&t=5&tracking=123").param);
Assertions.assertEquals("id=1&t=5", new EdgeUrl("https://memex.marginalia.nu/shöwthrëad.php?trëaking=123&id=1&t=5&").param);
Assertions.assertNull(new EdgeUrl("https://search.marginalia.nu/?m=123").param);
Assertions.assertNull(new EdgeUrl("https://search.marginalia.nu/?follow=123").param);
} }
} }

View File

@@ -59,16 +59,13 @@ public class ProcessAdHocTaskHeartbeatImpl implements AutoCloseable, ProcessAdHo
*/ */
@Override @Override
public void progress(String step, int stepProgress, int stepCount) { public void progress(String step, int stepProgress, int stepCount) {
int lastProgress = this.progress;
this.step = step; this.step = step;
// off by one since we calculate the progress based on the number of steps,
// and Enum.ordinal() is zero-based (so the 5th step in a 5 step task is 4, not 5; resulting in the
// final progress being 80% and not 100%)
this.progress = (int) Math.round(100. * stepProgress / (double) stepCount); this.progress = (int) Math.round(100. * stepProgress / (double) stepCount);
logger.info("ProcessTask {} progress: {}%", taskBase, progress); if (this.progress / 10 != lastProgress / 10) {
logger.info("ProcessTask {} progress: {}%", taskBase, progress);
}
} }
/** Wrap a collection to provide heartbeat progress updates as it's iterated through */ /** Wrap a collection to provide heartbeat progress updates as it's iterated through */

View File

@@ -57,16 +57,13 @@ public class ServiceAdHocTaskHeartbeatImpl implements AutoCloseable, ServiceAdHo
*/ */
@Override @Override
public void progress(String step, int stepProgress, int stepCount) { public void progress(String step, int stepProgress, int stepCount) {
int lastProgress = this.progress;
this.step = step; this.step = step;
// off by one since we calculate the progress based on the number of steps,
// and Enum.ordinal() is zero-based (so the 5th step in a 5 step task is 4, not 5; resulting in the
// final progress being 80% and not 100%)
this.progress = (int) Math.round(100. * stepProgress / (double) stepCount); this.progress = (int) Math.round(100. * stepProgress / (double) stepCount);
logger.info("ServiceTask {} progress: {}%", taskBase, progress); if (this.progress / 10 != lastProgress / 10) {
logger.info("ProcessTask {} progress: {}%", taskBase, progress);
}
} }
public void shutDown() { public void shutDown() {

View File

@@ -48,12 +48,13 @@ public class ExecutorExportClient {
return msgId; return msgId;
} }
public void exportSampleData(int node, FileStorageId fid, int size, String name) { public void exportSampleData(int node, FileStorageId fid, int size, String ctFilter, String name) {
channelPool.call(ExecutorExportApiBlockingStub::exportSampleData) channelPool.call(ExecutorExportApiBlockingStub::exportSampleData)
.forNode(node) .forNode(node)
.run(RpcExportSampleData.newBuilder() .run(RpcExportSampleData.newBuilder()
.setFileStorageId(fid.id()) .setFileStorageId(fid.id())
.setSize(size) .setSize(size)
.setCtFilter(ctFilter)
.setName(name) .setName(name)
.build()); .build());
} }

View File

@@ -100,6 +100,7 @@ message RpcExportSampleData {
int64 fileStorageId = 1; int64 fileStorageId = 1;
int32 size = 2; int32 size = 2;
string name = 3; string name = 3;
string ctFilter = 4;
} }
message RpcDownloadSampleData { message RpcDownloadSampleData {
string sampleSet = 1; string sampleSet = 1;

View File

@@ -8,6 +8,7 @@ import nu.marginalia.actor.state.ActorResumeBehavior;
import nu.marginalia.actor.state.ActorStep; import nu.marginalia.actor.state.ActorStep;
import nu.marginalia.actor.state.Resume; import nu.marginalia.actor.state.Resume;
import nu.marginalia.service.control.ServiceEventLog; import nu.marginalia.service.control.ServiceEventLog;
import nu.marginalia.service.control.ServiceHeartbeat;
import nu.marginalia.storage.FileStorageService; import nu.marginalia.storage.FileStorageService;
import nu.marginalia.storage.model.FileStorage; import nu.marginalia.storage.model.FileStorage;
import nu.marginalia.storage.model.FileStorageId; import nu.marginalia.storage.model.FileStorageId;
@@ -19,6 +20,7 @@ import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import org.slf4j.LoggerFactory;
import java.io.*; import java.io.*;
import java.net.HttpURLConnection;
import java.net.MalformedURLException; import java.net.MalformedURLException;
import java.net.URI; import java.net.URI;
import java.net.URL; import java.net.URL;
@@ -32,6 +34,7 @@ public class DownloadSampleActor extends RecordActorPrototype {
private final FileStorageService storageService; private final FileStorageService storageService;
private final ServiceEventLog eventLog; private final ServiceEventLog eventLog;
private final ServiceHeartbeat heartbeat;
private final Logger logger = LoggerFactory.getLogger(getClass()); private final Logger logger = LoggerFactory.getLogger(getClass());
@Resume(behavior = ActorResumeBehavior.ERROR) @Resume(behavior = ActorResumeBehavior.ERROR)
@@ -66,15 +69,39 @@ public class DownloadSampleActor extends RecordActorPrototype {
Files.deleteIfExists(Path.of(tarFileName)); Files.deleteIfExists(Path.of(tarFileName));
try (var is = new BufferedInputStream(new URI(downloadURI).toURL().openStream()); HttpURLConnection urlConnection = (HttpURLConnection) new URI(downloadURI).toURL().openConnection();
var os = new BufferedOutputStream(Files.newOutputStream(Path.of(tarFileName), StandardOpenOption.CREATE))) {
is.transferTo(os); try (var hb = heartbeat.createServiceAdHocTaskHeartbeat("Downloading sample")) {
long size = urlConnection.getContentLengthLong();
byte[] buffer = new byte[8192];
try (var is = new BufferedInputStream(urlConnection.getInputStream());
var os = new BufferedOutputStream(Files.newOutputStream(Path.of(tarFileName), StandardOpenOption.CREATE))) {
long copiedSize = 0;
while (copiedSize < size) {
int read = is.read(buffer);
if (read < 0) // We've been promised a file of length 'size'
throw new IOException("Unexpected end of stream");
os.write(buffer, 0, read);
copiedSize += read;
// Update progress bar
hb.progress(String.format("%d MB", copiedSize / 1024 / 1024), (int) (copiedSize / 1024), (int) (size / 1024));
}
}
} }
catch (Exception ex) { catch (Exception ex) {
eventLog.logEvent(DownloadSampleActor.class, "Error downloading sample"); eventLog.logEvent(DownloadSampleActor.class, "Error downloading sample");
logger.error("Error downloading sample", ex); logger.error("Error downloading sample", ex);
yield new Error(); yield new Error();
} }
finally {
urlConnection.disconnect();
}
eventLog.logEvent(DownloadSampleActor.class, "Download complete"); eventLog.logEvent(DownloadSampleActor.class, "Download complete");
yield new Extract(fileStorageId, tarFileName); yield new Extract(fileStorageId, tarFileName);
@@ -170,11 +197,12 @@ public class DownloadSampleActor extends RecordActorPrototype {
@Inject @Inject
public DownloadSampleActor(Gson gson, public DownloadSampleActor(Gson gson,
FileStorageService storageService, FileStorageService storageService,
ServiceEventLog eventLog) ServiceEventLog eventLog, ServiceHeartbeat heartbeat)
{ {
super(gson); super(gson);
this.storageService = storageService; this.storageService = storageService;
this.eventLog = eventLog; this.eventLog = eventLog;
this.heartbeat = heartbeat;
} }
} }

View File

@@ -26,32 +26,32 @@ public class ExportSampleDataActor extends RecordActorPrototype {
private final MqOutbox exportTasksOutbox; private final MqOutbox exportTasksOutbox;
private final Logger logger = LoggerFactory.getLogger(getClass()); private final Logger logger = LoggerFactory.getLogger(getClass());
public record Export(FileStorageId crawlId, int size, String name) implements ActorStep {} public record Export(FileStorageId crawlId, int size, String ctFilter, String name) implements ActorStep {}
public record Run(FileStorageId crawlId, FileStorageId destId, int size, String name, long msgId) implements ActorStep { public record Run(FileStorageId crawlId, FileStorageId destId, int size, String ctFilter, String name, long msgId) implements ActorStep {
public Run(FileStorageId crawlId, FileStorageId destId, int size, String name) { public Run(FileStorageId crawlId, FileStorageId destId, int size, String name, String ctFilter) {
this(crawlId, destId, size, name, -1); this(crawlId, destId, size, name, ctFilter,-1);
} }
} }
@Override @Override
public ActorStep transition(ActorStep self) throws Exception { public ActorStep transition(ActorStep self) throws Exception {
return switch(self) { return switch(self) {
case Export(FileStorageId crawlId, int size, String name) -> { case Export(FileStorageId crawlId, int size, String ctFilter, String name) -> {
var storage = storageService.allocateStorage(FileStorageType.EXPORT, var storage = storageService.allocateStorage(FileStorageType.EXPORT,
"crawl-sample-export", "crawl-sample-export",
"Crawl Data Sample " + name + "/" + size + " " + LocalDateTime.now() "Crawl Data Sample " + name + "/" + size + " " + LocalDateTime.now()
); );
if (storage == null) yield new Error("Bad storage id"); if (storage == null) yield new Error("Bad storage id");
yield new Run(crawlId, storage.id(), size, name); yield new Run(crawlId, storage.id(), size, ctFilter, name);
} }
case Run(FileStorageId crawlId, FileStorageId destId, int size, String name, long msgId) when msgId < 0 -> { case Run(FileStorageId crawlId, FileStorageId destId, int size, String ctFilter, String name, long msgId) when msgId < 0 -> {
storageService.setFileStorageState(destId, FileStorageState.NEW); storageService.setFileStorageState(destId, FileStorageState.NEW);
long newMsgId = exportTasksOutbox.sendAsync(ExportTaskRequest.sampleData(crawlId, destId, size, name)); long newMsgId = exportTasksOutbox.sendAsync(ExportTaskRequest.sampleData(crawlId, destId, ctFilter, size, name));
yield new Run(crawlId, destId, size, name, newMsgId); yield new Run(crawlId, destId, size, ctFilter, name, newMsgId);
} }
case Run(_, FileStorageId destId, _, _, long msgId) -> { case Run(_, FileStorageId destId, _, _, _, long msgId) -> {
var rsp = processWatcher.waitResponse(exportTasksOutbox, ProcessService.ProcessId.EXPORT_TASKS, msgId); var rsp = processWatcher.waitResponse(exportTasksOutbox, ProcessService.ProcessId.EXPORT_TASKS, msgId);
if (rsp.state() != MqMessageState.OK) { if (rsp.state() != MqMessageState.OK) {
@@ -70,7 +70,7 @@ public class ExportSampleDataActor extends RecordActorPrototype {
@Override @Override
public String describe() { public String describe() {
return "Export RSS/Atom feeds from crawl data"; return "Export sample crawl data";
} }
@Inject @Inject

View File

@@ -49,6 +49,7 @@ public class ExecutorExportGrpcService
new ExportSampleDataActor.Export( new ExportSampleDataActor.Export(
FileStorageId.of(request.getFileStorageId()), FileStorageId.of(request.getFileStorageId()),
request.getSize(), request.getSize(),
request.getCtFilter(),
request.getName() request.getName()
) )
); );

View File

@@ -229,13 +229,15 @@ public class FeedFetcherService {
.timeout(Duration.ofSeconds(15)) .timeout(Duration.ofSeconds(15))
; ;
if (ifModifiedSinceDate != null) { // Set the If-Modified-Since or If-None-Match headers if we have them
// though since there are certain idiosyncrasies in server implementations,
// we avoid setting both at the same time as that may turn a 304 into a 200.
if (ifNoneMatchTag != null) {
requestBuilder.header("If-None-Match", ifNoneMatchTag);
} else if (ifModifiedSinceDate != null) {
requestBuilder.header("If-Modified-Since", ifModifiedSinceDate); requestBuilder.header("If-Modified-Since", ifModifiedSinceDate);
} }
if (ifNoneMatchTag != null) {
requestBuilder.header("If-None-Match", ifNoneMatchTag);
}
HttpRequest getRequest = requestBuilder.build(); HttpRequest getRequest = requestBuilder.build();

View File

@@ -20,7 +20,6 @@ import nu.marginalia.model.crawldata.CrawledDocument;
import nu.marginalia.model.crawldata.CrawledDomain; import nu.marginalia.model.crawldata.CrawledDomain;
import nu.marginalia.model.crawldata.SerializableCrawlData; import nu.marginalia.model.crawldata.SerializableCrawlData;
import nu.marginalia.parquet.crawldata.CrawledDocumentParquetRecordFileWriter; import nu.marginalia.parquet.crawldata.CrawledDocumentParquetRecordFileWriter;
import org.apache.hc.client5.http.cookie.BasicCookieStore;
import org.junit.jupiter.api.*; import org.junit.jupiter.api.*;
import org.slf4j.Logger; import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import org.slf4j.LoggerFactory;
@@ -247,7 +246,7 @@ public class CrawlingThenConvertingIntegrationTest {
private CrawledDomain crawl(CrawlerMain.CrawlSpecRecord specs, Predicate<EdgeDomain> domainBlacklist) throws Exception { private CrawledDomain crawl(CrawlerMain.CrawlSpecRecord specs, Predicate<EdgeDomain> domainBlacklist) throws Exception {
List<SerializableCrawlData> data = new ArrayList<>(); List<SerializableCrawlData> data = new ArrayList<>();
try (var recorder = new WarcRecorder(fileName, new BasicCookieStore()); try (var recorder = new WarcRecorder(fileName);
var db = new DomainStateDb(dbTempFile)) var db = new DomainStateDb(dbTempFile))
{ {
new CrawlerRetreiver(httpFetcher, new DomainProber(domainBlacklist), specs, db, recorder).crawlDomain(); new CrawlerRetreiver(httpFetcher, new DomainProber(domainBlacklist), specs, db, recorder).crawlDomain();

View File

@@ -43,6 +43,7 @@ import java.nio.file.StandardCopyOption;
import java.security.Security; import java.security.Security;
import java.util.*; import java.util.*;
import java.util.concurrent.ConcurrentHashMap; import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.TimeUnit; import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger; import java.util.concurrent.atomic.AtomicInteger;
@@ -66,6 +67,8 @@ public class CrawlerMain extends ProcessMainClass {
private final Map<String, CrawlTask> pendingCrawlTasks = new ConcurrentHashMap<>(); private final Map<String, CrawlTask> pendingCrawlTasks = new ConcurrentHashMap<>();
private final LinkedBlockingQueue<CrawlTask> retryQueue = new LinkedBlockingQueue<>();
private final AtomicInteger tasksDone = new AtomicInteger(0); private final AtomicInteger tasksDone = new AtomicInteger(0);
private final HttpFetcherImpl fetcher; private final HttpFetcherImpl fetcher;
@@ -261,28 +264,44 @@ public class CrawlerMain extends ProcessMainClass {
if (workLog.isJobFinished(crawlSpec.domain)) if (workLog.isJobFinished(crawlSpec.domain))
continue; continue;
var task = new CrawlTask( var task = new CrawlTask(crawlSpec, anchorTagsSource, outputDir, warcArchiver, domainStateDb, workLog);
crawlSpec,
anchorTagsSource,
outputDir,
warcArchiver,
domainStateDb,
workLog);
// Try to run immediately, to avoid unnecessarily keeping the entire work set in RAM // Try to run immediately, to avoid unnecessarily keeping the entire work set in RAM
if (!trySubmitDeferredTask(task)) { if (!trySubmitDeferredTask(task)) {
// Otherwise add to the taskList for deferred execution
// Drain the retry queue to the taskList, and try to submit any tasks that are in the retry queue
retryQueue.drainTo(taskList);
taskList.removeIf(this::trySubmitDeferredTask);
// Then add this new task to the retry queue
taskList.add(task); taskList.add(task);
} }
} }
// Schedule viable tasks for execution until list is empty // Schedule viable tasks for execution until list is empty
while (!taskList.isEmpty()) { for (int emptyRuns = 0;emptyRuns < 300;) {
taskList.removeIf(this::trySubmitDeferredTask); boolean hasTasks = !taskList.isEmpty();
// Add a small pause here to avoid busy looping toward the end of the execution cycle when // The order of these checks very important to avoid a race condition
// we might have no new viable tasks to run for hours on end // where we miss a task that is put into the retry queue
TimeUnit.MILLISECONDS.sleep(50); boolean hasRunningTasks = pool.getActiveCount() > 0;
boolean hasRetryTasks = !retryQueue.isEmpty();
if (hasTasks || hasRetryTasks || hasRunningTasks) {
retryQueue.drainTo(taskList);
// Try to submit any tasks that are in the retry queue (this will block if the pool is full)
taskList.removeIf(this::trySubmitDeferredTask);
// Add a small pause here to avoid busy looping toward the end of the execution cycle when
// we might have no new viable tasks to run for hours on end
TimeUnit.MILLISECONDS.sleep(5);
} else {
// We have no tasks to run, and no tasks in the retry queue
// but we wait a bit to see if any new tasks come in via the retry queue
emptyRuns++;
TimeUnit.SECONDS.sleep(1);
}
} }
logger.info("Shutting down the pool, waiting for tasks to complete..."); logger.info("Shutting down the pool, waiting for tasks to complete...");
@@ -414,7 +433,7 @@ public class CrawlerMain extends ProcessMainClass {
/** Best effort indicator whether we could start this now without getting stuck in /** Best effort indicator whether we could start this now without getting stuck in
* DomainLocks purgatory */ * DomainLocks purgatory */
public boolean canRun() { public boolean canRun() {
return domainLocks.canLock(new EdgeDomain(domain)); return domainLocks.isLockableHint(new EdgeDomain(domain));
} }
@Override @Override
@@ -425,66 +444,82 @@ public class CrawlerMain extends ProcessMainClass {
return; return;
} }
Path newWarcFile = CrawlerOutputFile.createWarcPath(outputDir, id, domain, CrawlerOutputFile.WarcFileVersion.LIVE); Optional<DomainLocks.DomainLock> lock = domainLocks.tryLockDomain(new EdgeDomain(domain));
Path tempFile = CrawlerOutputFile.createWarcPath(outputDir, id, domain, CrawlerOutputFile.WarcFileVersion.TEMP); // We don't have a lock, so we can't run this task
Path slopFile = CrawlerOutputFile.createSlopPath(outputDir, id, domain); // we return to avoid blocking the pool for too long
if (lock.isEmpty()) {
// Move the WARC file to a temp file if it exists, so we can resume the crawl using the old data if (retryQueue.remainingCapacity() > 0) {
// while writing to the same file name as before // Sleep a moment to avoid busy looping via the retry queue
if (Files.exists(newWarcFile)) { // in the case when few tasks remain and almost all are ineligible for
Files.move(newWarcFile, tempFile, StandardCopyOption.REPLACE_EXISTING); // immediate restart
} Thread.sleep(5);
else {
Files.deleteIfExists(tempFile);
}
try (var warcRecorder = new WarcRecorder(newWarcFile, fetcher); // write to a temp file for now
var retriever = new CrawlerRetreiver(fetcher, domainProber, specification, domainStateDb, warcRecorder);
CrawlDataReference reference = getReference()
)
{
// Resume the crawl if it was aborted
if (Files.exists(tempFile)) {
retriever.syncAbortedRun(tempFile);
Files.delete(tempFile);
} }
DomainLinks domainLinks = anchorTagsSource.getAnchorTags(domain); retryQueue.put(this);
return;
}
DomainLocks.DomainLock domainLock = lock.get();
int size; try (domainLock) {
try (var lock = domainLocks.lockDomain(new EdgeDomain(domain))) { Thread.currentThread().setName("crawling:" + domain);
size = retriever.crawlDomain(domainLinks, reference);
Path newWarcFile = CrawlerOutputFile.createWarcPath(outputDir, id, domain, CrawlerOutputFile.WarcFileVersion.LIVE);
Path tempFile = CrawlerOutputFile.createWarcPath(outputDir, id, domain, CrawlerOutputFile.WarcFileVersion.TEMP);
Path slopFile = CrawlerOutputFile.createSlopPath(outputDir, id, domain);
// Move the WARC file to a temp file if it exists, so we can resume the crawl using the old data
// while writing to the same file name as before
if (Files.exists(newWarcFile)) {
Files.move(newWarcFile, tempFile, StandardCopyOption.REPLACE_EXISTING);
}
else {
Files.deleteIfExists(tempFile);
} }
// Delete the reference crawl data if it's not the same as the new one try (var warcRecorder = new WarcRecorder(newWarcFile); // write to a temp file for now
// (mostly a case when migrating from legacy->warc) var retriever = new CrawlerRetreiver(fetcher, domainProber, specification, domainStateDb, warcRecorder);
reference.delete(); CrawlDataReference reference = getReference())
{
// Resume the crawl if it was aborted
if (Files.exists(tempFile)) {
retriever.syncAbortedRun(tempFile);
Files.delete(tempFile);
}
// Convert the WARC file to Parquet DomainLinks domainLinks = anchorTagsSource.getAnchorTags(domain);
SlopCrawlDataRecord
.convertWarc(domain, userAgent, newWarcFile, slopFile);
// Optionally archive the WARC file if full retention is enabled, int size = retriever.crawlDomain(domainLinks, reference);
// otherwise delete it:
warcArchiver.consumeWarc(newWarcFile, domain);
// Mark the domain as finished in the work log // Delete the reference crawl data if it's not the same as the new one
workLog.setJobToFinished(domain, slopFile.toString(), size); // (mostly a case when migrating from legacy->warc)
reference.delete();
// Update the progress bar // Convert the WARC file to Slop
heartbeat.setProgress(tasksDone.incrementAndGet() / (double) totalTasks); SlopCrawlDataRecord
.convertWarc(domain, userAgent, newWarcFile, slopFile);
logger.info("Fetched {}", domain); // Optionally archive the WARC file if full retention is enabled,
} catch (Exception e) { // otherwise delete it:
logger.error("Error fetching domain " + domain, e); warcArchiver.consumeWarc(newWarcFile, domain);
}
finally {
// We don't need to double-count these; it's also kept in the workLog
pendingCrawlTasks.remove(domain);
Thread.currentThread().setName("[idle]");
Files.deleteIfExists(newWarcFile); // Mark the domain as finished in the work log
Files.deleteIfExists(tempFile); workLog.setJobToFinished(domain, slopFile.toString(), size);
// Update the progress bar
heartbeat.setProgress(tasksDone.incrementAndGet() / (double) totalTasks);
logger.info("Fetched {}", domain);
} catch (Exception e) {
logger.error("Error fetching domain " + domain, e);
}
finally {
// We don't need to double-count these; it's also kept in the workLog
pendingCrawlTasks.remove(domain);
Thread.currentThread().setName("[idle]");
Files.deleteIfExists(newWarcFile);
Files.deleteIfExists(tempFile);
}
} }
} }

View File

@@ -1,6 +1,6 @@
package nu.marginalia.crawl.fetcher; package nu.marginalia.crawl.fetcher;
import org.apache.hc.core5.http.io.support.ClassicRequestBuilder; import org.apache.hc.client5.http.classic.methods.HttpGet;
/** Encapsulates request modifiers; the ETag and Last-Modified tags for a resource */ /** Encapsulates request modifiers; the ETag and Last-Modified tags for a resource */
public record ContentTags(String etag, String lastMod) { public record ContentTags(String etag, String lastMod) {
@@ -17,14 +17,16 @@ public record ContentTags(String etag, String lastMod) {
} }
/** Paints the tags onto the request builder. */ /** Paints the tags onto the request builder. */
public void paint(ClassicRequestBuilder getBuilder) { public void paint(HttpGet request) {
// Paint the ETag header if present,
// otherwise paint the Last-Modified header
// (but not both at the same time due to some servers not liking it)
if (etag != null) { if (etag != null) {
getBuilder.addHeader("If-None-Match", etag); request.addHeader("If-None-Match", etag);
} } else if (lastMod != null) {
request.addHeader("If-Modified-Since", lastMod);
if (lastMod != null) {
getBuilder.addHeader("If-Modified-Since", lastMod);
} }
} }
} }

View File

@@ -1,34 +0,0 @@
package nu.marginalia.crawl.fetcher;
import java.io.IOException;
import java.net.CookieHandler;
import java.net.URI;
import java.util.List;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
public class Cookies extends CookieHandler {
final ThreadLocal<ConcurrentHashMap<String, List<String>>> cookieJar = ThreadLocal.withInitial(ConcurrentHashMap::new);
public void clear() {
cookieJar.get().clear();
}
public boolean hasCookies() {
return !cookieJar.get().isEmpty();
}
public List<String> getCookies() {
return cookieJar.get().values().stream().flatMap(List::stream).toList();
}
@Override
public Map<String, List<String>> get(URI uri, Map<String, List<String>> requestHeaders) throws IOException {
return cookieJar.get();
}
@Override
public void put(URI uri, Map<String, List<String>> responseHeaders) throws IOException {
cookieJar.get().putAll(responseHeaders);
}
}

View File

@@ -0,0 +1,56 @@
package nu.marginalia.crawl.fetcher;
import org.apache.hc.client5.http.classic.methods.HttpUriRequestBase;
import org.apache.hc.core5.http.ClassicHttpRequest;
import org.apache.hc.core5.http.HttpResponse;
import java.util.HashMap;
import java.util.Map;
import java.util.StringJoiner;
public class DomainCookies {
private final Map<String, String> cookies = new HashMap<>();
public boolean hasCookies() {
return !cookies.isEmpty();
}
public void updateCookieStore(HttpResponse response) {
for (var header : response.getHeaders()) {
if (header.getName().equalsIgnoreCase("Set-Cookie")) {
parseCookieHeader(header.getValue());
}
}
}
private void parseCookieHeader(String value) {
// Parse the Set-Cookie header value and extract the cookies
String[] parts = value.split(";");
String cookie = parts[0].trim();
if (cookie.contains("=")) {
String[] cookieParts = cookie.split("=");
String name = cookieParts[0].trim();
String val = cookieParts[1].trim();
cookies.put(name, val);
}
}
public void paintRequest(HttpUriRequestBase request) {
request.addHeader("Cookie", createCookieHeader());
}
public void paintRequest(ClassicHttpRequest request) {
request.addHeader("Cookie", createCookieHeader());
}
private String createCookieHeader() {
StringJoiner sj = new StringJoiner("; ");
for (var cookie : cookies.entrySet()) {
sj.add(cookie.getKey() + "=" + cookie.getValue());
}
return sj.toString();
}
}

View File

@@ -23,6 +23,7 @@ public interface HttpFetcher extends AutoCloseable {
HttpFetchResult fetchContent(EdgeUrl url, HttpFetchResult fetchContent(EdgeUrl url,
WarcRecorder recorder, WarcRecorder recorder,
DomainCookies cookies,
CrawlDelayTimer timer, CrawlDelayTimer timer,
ContentTags tags, ContentTags tags,
ProbeType probeType); ProbeType probeType);

View File

@@ -17,6 +17,7 @@ import nu.marginalia.model.crawldata.CrawlerDomainStatus;
import org.apache.hc.client5.http.ConnectionKeepAliveStrategy; import org.apache.hc.client5.http.ConnectionKeepAliveStrategy;
import org.apache.hc.client5.http.HttpRequestRetryStrategy; import org.apache.hc.client5.http.HttpRequestRetryStrategy;
import org.apache.hc.client5.http.classic.HttpClient; import org.apache.hc.client5.http.classic.HttpClient;
import org.apache.hc.client5.http.classic.methods.HttpGet;
import org.apache.hc.client5.http.config.ConnectionConfig; import org.apache.hc.client5.http.config.ConnectionConfig;
import org.apache.hc.client5.http.config.RequestConfig; import org.apache.hc.client5.http.config.RequestConfig;
import org.apache.hc.client5.http.cookie.BasicCookieStore; import org.apache.hc.client5.http.cookie.BasicCookieStore;
@@ -34,6 +35,7 @@ import org.apache.hc.core5.http.io.entity.EntityUtils;
import org.apache.hc.core5.http.io.support.ClassicRequestBuilder; import org.apache.hc.core5.http.io.support.ClassicRequestBuilder;
import org.apache.hc.core5.http.message.MessageSupport; import org.apache.hc.core5.http.message.MessageSupport;
import org.apache.hc.core5.http.protocol.HttpContext; import org.apache.hc.core5.http.protocol.HttpContext;
import org.apache.hc.core5.pool.PoolStats;
import org.apache.hc.core5.util.TimeValue; import org.apache.hc.core5.util.TimeValue;
import org.apache.hc.core5.util.Timeout; import org.apache.hc.core5.util.Timeout;
import org.jsoup.Jsoup; import org.jsoup.Jsoup;
@@ -45,11 +47,14 @@ import org.slf4j.Marker;
import org.slf4j.MarkerFactory; import org.slf4j.MarkerFactory;
import javax.net.ssl.SSLContext; import javax.net.ssl.SSLContext;
import javax.net.ssl.SSLException;
import java.io.IOException; import java.io.IOException;
import java.net.SocketTimeoutException; import java.net.SocketTimeoutException;
import java.net.URISyntaxException; import java.net.URISyntaxException;
import java.net.UnknownHostException;
import java.security.NoSuchAlgorithmException; import java.security.NoSuchAlgorithmException;
import java.time.Duration; import java.time.Duration;
import java.time.Instant;
import java.util.*; import java.util.*;
import java.util.concurrent.Semaphore; import java.util.concurrent.Semaphore;
import java.util.concurrent.TimeUnit; import java.util.concurrent.TimeUnit;
@@ -76,14 +81,20 @@ public class HttpFetcherImpl implements HttpFetcher, HttpRequestRetryStrategy {
} }
private final CloseableHttpClient client; private final CloseableHttpClient client;
private PoolingHttpClientConnectionManager connectionManager;
public PoolStats getPoolStats() {
return connectionManager.getTotalStats();
}
private CloseableHttpClient createClient() throws NoSuchAlgorithmException { private CloseableHttpClient createClient() throws NoSuchAlgorithmException {
final ConnectionConfig connectionConfig = ConnectionConfig.custom() final ConnectionConfig connectionConfig = ConnectionConfig.custom()
.setSocketTimeout(10, TimeUnit.SECONDS) .setSocketTimeout(10, TimeUnit.SECONDS)
.setConnectTimeout(30, TimeUnit.SECONDS) .setConnectTimeout(30, TimeUnit.SECONDS)
.setValidateAfterInactivity(TimeValue.ofSeconds(5))
.build(); .build();
final PoolingHttpClientConnectionManager connectionManager = PoolingHttpClientConnectionManagerBuilder.create() connectionManager = PoolingHttpClientConnectionManagerBuilder.create()
.setMaxConnPerRoute(2) .setMaxConnPerRoute(2)
.setMaxConnTotal(5000) .setMaxConnTotal(5000)
.setDefaultConnectionConfig(connectionConfig) .setDefaultConnectionConfig(connectionConfig)
@@ -91,15 +102,27 @@ public class HttpFetcherImpl implements HttpFetcher, HttpRequestRetryStrategy {
.build(); .build();
connectionManager.setDefaultSocketConfig(SocketConfig.custom() connectionManager.setDefaultSocketConfig(SocketConfig.custom()
.setSoLinger(TimeValue.ofSeconds(15)) .setSoLinger(TimeValue.ofSeconds(-1))
.setSoTimeout(Timeout.ofSeconds(10)) .setSoTimeout(Timeout.ofSeconds(10))
.build() .build()
); );
Thread.ofPlatform().daemon(true).start(() -> {
try {
for (;;) {
TimeUnit.SECONDS.sleep(15);
logger.info("Connection pool stats: {}", connectionManager.getTotalStats());
}
}
catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
});
final RequestConfig defaultRequestConfig = RequestConfig.custom() final RequestConfig defaultRequestConfig = RequestConfig.custom()
.setCookieSpec(StandardCookieSpec.RELAXED) .setCookieSpec(StandardCookieSpec.RELAXED)
.setResponseTimeout(10, TimeUnit.SECONDS) .setResponseTimeout(10, TimeUnit.SECONDS)
.setConnectionRequestTimeout(8, TimeUnit.SECONDS) .setConnectionRequestTimeout(5, TimeUnit.MINUTES)
.build(); .build();
return HttpClients.custom() return HttpClients.custom()
@@ -287,6 +310,7 @@ public class HttpFetcherImpl implements HttpFetcher, HttpRequestRetryStrategy {
* recorded in the WARC file on failure. * recorded in the WARC file on failure.
*/ */
public ContentTypeProbeResult probeContentType(EdgeUrl url, public ContentTypeProbeResult probeContentType(EdgeUrl url,
DomainCookies cookies,
CrawlDelayTimer timer, CrawlDelayTimer timer,
ContentTags tags) { ContentTags tags) {
if (!tags.isEmpty() || !contentTypeLogic.isUrlLikeBinary(url)) { if (!tags.isEmpty() || !contentTypeLogic.isUrlLikeBinary(url)) {
@@ -299,9 +323,11 @@ public class HttpFetcherImpl implements HttpFetcher, HttpRequestRetryStrategy {
.addHeader("Accept-Encoding", "gzip") .addHeader("Accept-Encoding", "gzip")
.build(); .build();
var result = SendLock.wrapSend(client, head, (rsp) -> { cookies.paintRequest(head);
EntityUtils.consume(rsp.getEntity());
return SendLock.wrapSend(client, head, (rsp) -> {
cookies.updateCookieStore(rsp);
EntityUtils.consume(rsp.getEntity());
int statusCode = rsp.getCode(); int statusCode = rsp.getCode();
// Handle redirects // Handle redirects
@@ -339,8 +365,6 @@ public class HttpFetcherImpl implements HttpFetcher, HttpRequestRetryStrategy {
return new ContentTypeProbeResult.BadContentType(contentType, statusCode); return new ContentTypeProbeResult.BadContentType(contentType, statusCode);
} }
}); });
return result;
} }
catch (SocketTimeoutException ex) { catch (SocketTimeoutException ex) {
@@ -362,6 +386,7 @@ public class HttpFetcherImpl implements HttpFetcher, HttpRequestRetryStrategy {
@Override @Override
public HttpFetchResult fetchContent(EdgeUrl url, public HttpFetchResult fetchContent(EdgeUrl url,
WarcRecorder warcRecorder, WarcRecorder warcRecorder,
DomainCookies cookies,
CrawlDelayTimer timer, CrawlDelayTimer timer,
ContentTags contentTags, ContentTags contentTags,
ProbeType probeType) ProbeType probeType)
@@ -369,26 +394,32 @@ public class HttpFetcherImpl implements HttpFetcher, HttpRequestRetryStrategy {
try { try {
if (probeType == HttpFetcher.ProbeType.FULL) { if (probeType == HttpFetcher.ProbeType.FULL) {
try { try {
var probeResult = probeContentType(url, timer, contentTags); var probeResult = probeContentType(url, cookies, timer, contentTags);
logger.info(crawlerAuditMarker, "Probe result {} for {}", probeResult.getClass().getSimpleName(), url);
switch (probeResult) { switch (probeResult) {
case HttpFetcher.ContentTypeProbeResult.NoOp(): case HttpFetcher.ContentTypeProbeResult.NoOp():
break; // break; //
case HttpFetcher.ContentTypeProbeResult.Ok(EdgeUrl resolvedUrl): case HttpFetcher.ContentTypeProbeResult.Ok(EdgeUrl resolvedUrl):
logger.info(crawlerAuditMarker, "Probe result OK for {}", url);
url = resolvedUrl; // If we were redirected while probing, use the final URL for fetching url = resolvedUrl; // If we were redirected while probing, use the final URL for fetching
break; break;
case ContentTypeProbeResult.BadContentType badContentType: case ContentTypeProbeResult.BadContentType badContentType:
warcRecorder.flagAsFailedContentTypeProbe(url, badContentType.contentType(), badContentType.statusCode()); warcRecorder.flagAsFailedContentTypeProbe(url, badContentType.contentType(), badContentType.statusCode());
logger.info(crawlerAuditMarker, "Probe result Bad ContenType ({}) for {}", badContentType.contentType(), url);
return new HttpFetchResult.ResultNone(); return new HttpFetchResult.ResultNone();
case ContentTypeProbeResult.BadContentType.Timeout(Exception ex): case ContentTypeProbeResult.BadContentType.Timeout(Exception ex):
logger.info(crawlerAuditMarker, "Probe result Timeout for {}", url);
warcRecorder.flagAsTimeout(url); warcRecorder.flagAsTimeout(url);
return new HttpFetchResult.ResultException(ex); return new HttpFetchResult.ResultException(ex);
case ContentTypeProbeResult.Exception(Exception ex): case ContentTypeProbeResult.Exception(Exception ex):
logger.info(crawlerAuditMarker, "Probe result Exception({}) for {}", ex.getClass().getSimpleName(), url);
warcRecorder.flagAsError(url, ex); warcRecorder.flagAsError(url, ex);
return new HttpFetchResult.ResultException(ex); return new HttpFetchResult.ResultException(ex);
case ContentTypeProbeResult.HttpError httpError: case ContentTypeProbeResult.HttpError httpError:
logger.info(crawlerAuditMarker, "Probe result HTTP Error ({}) for {}", httpError.statusCode(), url);
return new HttpFetchResult.ResultException(new HttpException("HTTP status code " + httpError.statusCode() + ": " + httpError.message())); return new HttpFetchResult.ResultException(new HttpException("HTTP status code " + httpError.statusCode() + ": " + httpError.message()));
case ContentTypeProbeResult.Redirect redirect: case ContentTypeProbeResult.Redirect redirect:
logger.info(crawlerAuditMarker, "Probe result redirect for {} -> {}", url, redirect.location());
return new HttpFetchResult.ResultRedirect(redirect.location()); return new HttpFetchResult.ResultRedirect(redirect.location());
} }
} catch (Exception ex) { } catch (Exception ex) {
@@ -398,36 +429,41 @@ public class HttpFetcherImpl implements HttpFetcher, HttpRequestRetryStrategy {
} }
ClassicRequestBuilder getBuilder = ClassicRequestBuilder.get(url.asURI()) HttpGet request = new HttpGet(url.asURI());
.addHeader("User-Agent", userAgentString) request.addHeader("User-Agent", userAgentString);
.addHeader("Accept-Encoding", "gzip") request.addHeader("Accept-Encoding", "gzip");
.addHeader("Accept-Language", "en,*;q=0.5") request.addHeader("Accept-Language", "en,*;q=0.5");
.addHeader("Accept", "text/html, application/xhtml+xml, text/*;q=0.8"); request.addHeader("Accept", "text/html, application/xhtml+xml, text/*;q=0.8");
contentTags.paint(getBuilder); contentTags.paint(request);
try (var sl = new SendLock()) { try (var sl = new SendLock()) {
HttpFetchResult result = warcRecorder.fetch(client, getBuilder.build()); Instant start = Instant.now();
HttpFetchResult result = warcRecorder.fetch(client, cookies, request);
Duration fetchDuration = Duration.between(start, Instant.now());
if (result instanceof HttpFetchResult.ResultOk ok) { if (result instanceof HttpFetchResult.ResultOk ok) {
if (ok.statusCode() == 304) { if (ok.statusCode() == 304) {
return new HttpFetchResult.Result304Raw(); result = new HttpFetchResult.Result304Raw();
} }
} }
switch (result) { switch (result) {
case HttpFetchResult.ResultOk ok -> logger.info(crawlerAuditMarker, "Fetch result OK {} for {}", ok.statusCode(), url); case HttpFetchResult.ResultOk ok -> logger.info(crawlerAuditMarker, "Fetch result OK {} for {} ({} ms)", ok.statusCode(), url, fetchDuration.toMillis());
case HttpFetchResult.ResultRedirect redirect -> logger.info(crawlerAuditMarker, "Fetch result redirect: {} for {}", redirect.url(), url); case HttpFetchResult.ResultRedirect redirect -> logger.info(crawlerAuditMarker, "Fetch result redirect: {} for {}", redirect.url(), url);
case HttpFetchResult.ResultNone none -> logger.info(crawlerAuditMarker, "Fetch result none for {}", url); case HttpFetchResult.ResultNone none -> logger.info(crawlerAuditMarker, "Fetch result none for {}", url);
case HttpFetchResult.ResultException ex -> logger.error(crawlerAuditMarker, "Fetch result exception: {} for {}", ex.getClass().getSimpleName(), url); case HttpFetchResult.ResultException ex -> logger.error(crawlerAuditMarker, "Fetch result exception for {}", url, ex.ex());
case HttpFetchResult.Result304Raw raw -> logger.info(crawlerAuditMarker, "Fetch result: 304 Raw for {}", url); case HttpFetchResult.Result304Raw raw -> logger.info(crawlerAuditMarker, "Fetch result: 304 Raw for {}", url);
case HttpFetchResult.Result304ReplacedWithReference ref -> logger.info(crawlerAuditMarker, "Fetch result: 304 With reference for {}", url); case HttpFetchResult.Result304ReplacedWithReference ref -> logger.info(crawlerAuditMarker, "Fetch result: 304 With reference for {}", url);
} }
return result; return result;
} }
} }
catch (Exception ex) { catch (Exception ex) {
ex.printStackTrace(); logger.error(crawlerAuditMarker, "Fetch result exception for {}", url, ex);
return new HttpFetchResult.ResultException(ex); return new HttpFetchResult.ResultException(ex);
} }
@@ -494,56 +530,61 @@ public class HttpFetcherImpl implements HttpFetcher, HttpRequestRetryStrategy {
} }
private SitemapResult fetchSingleSitemap(EdgeUrl sitemapUrl) throws URISyntaxException, IOException, InterruptedException { private SitemapResult fetchSingleSitemap(EdgeUrl sitemapUrl) throws URISyntaxException {
ClassicHttpRequest getRequest = ClassicRequestBuilder.get(sitemapUrl.asURI()) HttpGet getRequest = new HttpGet(sitemapUrl.asURI());
.addHeader("User-Agent", userAgentString)
.addHeader("Accept-Encoding", "gzip") getRequest.addHeader("User-Agent", userAgentString);
.addHeader("Accept", "text/*, */*;q=0.9") getRequest.addHeader("Accept-Encoding", "gzip");
.addHeader("User-Agent", userAgentString) getRequest.addHeader("Accept", "text/*, */*;q=0.9");
.build(); getRequest.addHeader("User-Agent", userAgentString);
try (var sl = new SendLock()) { try (var sl = new SendLock()) {
return client.execute(getRequest, response -> { return client.execute(getRequest, response -> {
if (response.getCode() != 200) { try {
return new SitemapResult.SitemapError(); if (response.getCode() != 200) {
return new SitemapResult.SitemapError();
}
Document parsedSitemap = Jsoup.parse(
EntityUtils.toString(response.getEntity()),
sitemapUrl.toString(),
Parser.xmlParser()
);
if (parsedSitemap.childrenSize() == 0) {
return new SitemapResult.SitemapError();
}
String rootTagName = parsedSitemap.child(0).tagName();
return switch (rootTagName.toLowerCase()) {
case "sitemapindex" -> {
List<String> references = new ArrayList<>();
for (var locTag : parsedSitemap.getElementsByTag("loc")) {
references.add(locTag.text().trim());
}
yield new SitemapResult.SitemapReferences(Collections.unmodifiableList(references));
}
case "urlset" -> {
List<String> urls = new ArrayList<>();
for (var locTag : parsedSitemap.select("url > loc")) {
urls.add(locTag.text().trim());
}
yield new SitemapResult.SitemapUrls(Collections.unmodifiableList(urls));
}
case "rss", "atom" -> {
List<String> urls = new ArrayList<>();
for (var locTag : parsedSitemap.select("link, url")) {
urls.add(locTag.text().trim());
}
yield new SitemapResult.SitemapUrls(Collections.unmodifiableList(urls));
}
default -> new SitemapResult.SitemapError();
};
} }
finally {
Document parsedSitemap = Jsoup.parse( EntityUtils.consume(response.getEntity());
EntityUtils.toString(response.getEntity()),
sitemapUrl.toString(),
Parser.xmlParser()
);
if (parsedSitemap.childrenSize() == 0) {
return new SitemapResult.SitemapError();
} }
String rootTagName = parsedSitemap.child(0).tagName();
return switch (rootTagName.toLowerCase()) {
case "sitemapindex" -> {
List<String> references = new ArrayList<>();
for (var locTag : parsedSitemap.getElementsByTag("loc")) {
references.add(locTag.text().trim());
}
yield new SitemapResult.SitemapReferences(Collections.unmodifiableList(references));
}
case "urlset" -> {
List<String> urls = new ArrayList<>();
for (var locTag : parsedSitemap.select("url > loc")) {
urls.add(locTag.text().trim());
}
yield new SitemapResult.SitemapUrls(Collections.unmodifiableList(urls));
}
case "rss", "atom" -> {
List<String> urls = new ArrayList<>();
for (var locTag : parsedSitemap.select("link, url")) {
urls.add(locTag.text().trim());
}
yield new SitemapResult.SitemapUrls(Collections.unmodifiableList(urls));
}
default -> new SitemapResult.SitemapError();
};
}); });
} }
catch (Exception ex) { catch (Exception ex) {
@@ -574,13 +615,12 @@ public class HttpFetcherImpl implements HttpFetcher, HttpRequestRetryStrategy {
private Optional<SimpleRobotRules> fetchAndParseRobotsTxt(EdgeUrl url, WarcRecorder recorder) { private Optional<SimpleRobotRules> fetchAndParseRobotsTxt(EdgeUrl url, WarcRecorder recorder) {
try (var sl = new SendLock()) { try (var sl = new SendLock()) {
ClassicHttpRequest request = ClassicRequestBuilder.get(url.asURI()) HttpGet request = new HttpGet(url.asURI());
.addHeader("User-Agent", userAgentString) request.addHeader("User-Agent", userAgentString);
.addHeader("Accept-Encoding", "gzip") request.addHeader("Accept-Encoding", "gzip");
.addHeader("Accept", "text/*, */*;q=0.9") request.addHeader("Accept", "text/*, */*;q=0.9");
.build();
HttpFetchResult result = recorder.fetch(client, request); HttpFetchResult result = recorder.fetch(client, new DomainCookies(), request);
return DocumentBodyExtractor.asBytes(result).mapOpt((contentType, body) -> return DocumentBodyExtractor.asBytes(result).mapOpt((contentType, body) ->
robotsParser.parseContent(url.toString(), robotsParser.parseContent(url.toString(),
@@ -596,18 +636,19 @@ public class HttpFetcherImpl implements HttpFetcher, HttpRequestRetryStrategy {
@Override @Override
public boolean retryRequest(HttpRequest request, IOException exception, int executionCount, HttpContext context) { public boolean retryRequest(HttpRequest request, IOException exception, int executionCount, HttpContext context) {
if (exception instanceof SocketTimeoutException ex) { return switch (exception) {
return false; case SocketTimeoutException ste -> false;
} case SSLException ssle -> false;
case UnknownHostException uhe -> false;
return executionCount < 3; default -> executionCount <= 3;
};
} }
@Override @Override
public boolean retryRequest(HttpResponse response, int executionCount, HttpContext context) { public boolean retryRequest(HttpResponse response, int executionCount, HttpContext context) {
return switch (response.getCode()) { return switch (response.getCode()) {
case 500, 503 -> executionCount < 2; case 500, 503 -> executionCount <= 2;
case 429 -> executionCount < 3; case 429 -> executionCount <= 3;
default -> false; default -> false;
}; };
} }

View File

@@ -2,6 +2,7 @@ package nu.marginalia.crawl.fetcher.warc;
import org.apache.commons.io.IOUtils; import org.apache.commons.io.IOUtils;
import org.apache.commons.io.input.BOMInputStream; import org.apache.commons.io.input.BOMInputStream;
import org.apache.hc.client5.http.classic.methods.HttpGet;
import org.apache.hc.core5.http.ClassicHttpResponse; import org.apache.hc.core5.http.ClassicHttpResponse;
import org.apache.hc.core5.http.Header; import org.apache.hc.core5.http.Header;
import org.netpreserve.jwarc.WarcTruncationReason; import org.netpreserve.jwarc.WarcTruncationReason;
@@ -43,7 +44,9 @@ public abstract class WarcInputBuffer implements AutoCloseable {
* and suppressed from the headers. * and suppressed from the headers.
* If an error occurs, a buffer will be created with no content and an error status. * If an error occurs, a buffer will be created with no content and an error status.
*/ */
static WarcInputBuffer forResponse(ClassicHttpResponse response, Duration timeLimit) throws IOException { static WarcInputBuffer forResponse(ClassicHttpResponse response,
HttpGet request,
Duration timeLimit) throws IOException {
if (response == null) if (response == null)
return new ErrorBuffer(); return new ErrorBuffer();
@@ -54,16 +57,47 @@ public abstract class WarcInputBuffer implements AutoCloseable {
return new ErrorBuffer(); return new ErrorBuffer();
} }
InputStream is = entity.getContent(); Instant start = Instant.now();
long length = entity.getContentLength(); InputStream is = null;
try {
is = entity.getContent();
long length = entity.getContentLength();
try (response) {
if (length > 0 && length < 8192) { if (length > 0 && length < 8192) {
// If the content is small and not compressed, we can just read it into memory // If the content is small and not compressed, we can just read it into memory
return new MemoryBuffer(response.getHeaders(), timeLimit, is, (int) length); return new MemoryBuffer(response.getHeaders(), request, timeLimit, is, (int) length);
} else { } else {
// Otherwise, we unpack it into a file and read it from there // Otherwise, we unpack it into a file and read it from there
return new FileBuffer(response.getHeaders(), timeLimit, is); return new FileBuffer(response.getHeaders(), request, timeLimit, is);
}
}
finally {
// We're required to consume the stream to avoid leaking connections,
// but we also don't want to get stuck on slow or malicious connections
// forever, so we set a time limit on this phase and call abort() if it's exceeded.
try {
while (is != null) {
// Consume some data
if (is.skip(65536) == 0) {
// Note that skip may return 0 if the stream is empty
// or for other unspecified reasons, so we need to check
// with read() as well to determine if the stream is done
if (is.read() == -1)
is = null;
}
// Check if the time limit has been exceeded
else if (Duration.between(start, Instant.now()).compareTo(timeLimit) > 0) {
request.abort();
is = null;
}
}
}
catch (IOException e) {
// Ignore the exception
}
finally {
// Close the input stream
IOUtils.closeQuietly(is);
} }
} }
@@ -71,7 +105,7 @@ public abstract class WarcInputBuffer implements AutoCloseable {
} }
/** Copy an input stream to an output stream, with a maximum size and time limit */ /** Copy an input stream to an output stream, with a maximum size and time limit */
protected void copy(InputStream is, OutputStream os, Duration timeLimit) { protected void copy(InputStream is, HttpGet request, OutputStream os, Duration timeLimit) {
Instant start = Instant.now(); Instant start = Instant.now();
Instant timeout = start.plus(timeLimit); Instant timeout = start.plus(timeLimit);
long size = 0; long size = 0;
@@ -86,6 +120,11 @@ public abstract class WarcInputBuffer implements AutoCloseable {
Duration remaining = Duration.between(Instant.now(), timeout); Duration remaining = Duration.between(Instant.now(), timeout);
if (remaining.isNegative()) { if (remaining.isNegative()) {
truncationReason = WarcTruncationReason.TIME; truncationReason = WarcTruncationReason.TIME;
// Abort the request if the time limit is exceeded
// so we don't keep the connection open forever or are forced to consume
// the stream to the end
request.abort();
break; break;
} }
@@ -104,6 +143,7 @@ public abstract class WarcInputBuffer implements AutoCloseable {
} }
else if (truncationReason != WarcTruncationReason.LENGTH) { else if (truncationReason != WarcTruncationReason.LENGTH) {
truncationReason = WarcTruncationReason.LENGTH; truncationReason = WarcTruncationReason.LENGTH;
break;
} }
} catch (IOException e) { } catch (IOException e) {
@@ -111,13 +151,6 @@ public abstract class WarcInputBuffer implements AutoCloseable {
} }
} }
// Try to close the connection as long as we haven't timed out.
// As per Apache HttpClient's semantics, this will reset the connection
// and close the stream if we have timed out.
if (truncationReason != WarcTruncationReason.TIME) {
IOUtils.closeQuietly(is);
}
} }
/** Takes a Content-Range header and checks if it is complete. /** Takes a Content-Range header and checks if it is complete.
@@ -218,7 +251,7 @@ class ErrorBuffer extends WarcInputBuffer {
/** Buffer for when we have the response in memory */ /** Buffer for when we have the response in memory */
class MemoryBuffer extends WarcInputBuffer { class MemoryBuffer extends WarcInputBuffer {
byte[] data; byte[] data;
public MemoryBuffer(Header[] headers, Duration timeLimit, InputStream responseStream, int size) { public MemoryBuffer(Header[] headers, HttpGet request, Duration timeLimit, InputStream responseStream, int size) {
super(suppressContentEncoding(headers)); super(suppressContentEncoding(headers));
if (!isRangeComplete(headers)) { if (!isRangeComplete(headers)) {
@@ -229,7 +262,7 @@ class MemoryBuffer extends WarcInputBuffer {
var outputStream = new ByteArrayOutputStream(size); var outputStream = new ByteArrayOutputStream(size);
copy(responseStream, outputStream, timeLimit); copy(responseStream, request, outputStream, timeLimit);
data = outputStream.toByteArray(); data = outputStream.toByteArray();
} }
@@ -253,7 +286,7 @@ class MemoryBuffer extends WarcInputBuffer {
class FileBuffer extends WarcInputBuffer { class FileBuffer extends WarcInputBuffer {
private final Path tempFile; private final Path tempFile;
public FileBuffer(Header[] headers, Duration timeLimit, InputStream responseStream) throws IOException { public FileBuffer(Header[] headers, HttpGet request, Duration timeLimit, InputStream responseStream) throws IOException {
super(suppressContentEncoding(headers)); super(suppressContentEncoding(headers));
if (!isRangeComplete(headers)) { if (!isRangeComplete(headers)) {
@@ -265,7 +298,7 @@ class FileBuffer extends WarcInputBuffer {
this.tempFile = Files.createTempFile("rsp", ".html"); this.tempFile = Files.createTempFile("rsp", ".html");
try (var out = Files.newOutputStream(tempFile)) { try (var out = Files.newOutputStream(tempFile)) {
copy(responseStream, out, timeLimit); copy(responseStream, request, out, timeLimit);
} }
catch (Exception ex) { catch (Exception ex) {
truncationReason = WarcTruncationReason.UNSPECIFIED; truncationReason = WarcTruncationReason.UNSPECIFIED;

View File

@@ -1,6 +1,7 @@
package nu.marginalia.crawl.fetcher.warc; package nu.marginalia.crawl.fetcher.warc;
import nu.marginalia.crawl.fetcher.ContentTags; import nu.marginalia.crawl.fetcher.ContentTags;
import nu.marginalia.crawl.fetcher.DomainCookies;
import nu.marginalia.crawl.fetcher.HttpFetcher; import nu.marginalia.crawl.fetcher.HttpFetcher;
import nu.marginalia.crawl.fetcher.HttpFetcherImpl; import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
import nu.marginalia.link_parser.LinkParser; import nu.marginalia.link_parser.LinkParser;
@@ -8,9 +9,7 @@ import nu.marginalia.model.EdgeDomain;
import nu.marginalia.model.EdgeUrl; import nu.marginalia.model.EdgeUrl;
import nu.marginalia.model.body.HttpFetchResult; import nu.marginalia.model.body.HttpFetchResult;
import org.apache.hc.client5.http.classic.HttpClient; import org.apache.hc.client5.http.classic.HttpClient;
import org.apache.hc.client5.http.cookie.BasicCookieStore; import org.apache.hc.client5.http.classic.methods.HttpGet;
import org.apache.hc.client5.http.cookie.CookieStore;
import org.apache.hc.core5.http.ClassicHttpRequest;
import org.apache.hc.core5.http.NameValuePair; import org.apache.hc.core5.http.NameValuePair;
import org.jetbrains.annotations.Nullable; import org.jetbrains.annotations.Nullable;
import org.netpreserve.jwarc.*; import org.netpreserve.jwarc.*;
@@ -42,7 +41,7 @@ public class WarcRecorder implements AutoCloseable {
static final int MAX_TIME = 30_000; static final int MAX_TIME = 30_000;
/** Maximum (decompressed) size we'll save */ /** Maximum (decompressed) size we'll save */
static final int MAX_SIZE = Integer.getInteger("crawler.maxFetchSize", 10 * 1024 * 1024); static final int MAX_SIZE = Integer.getInteger("crawler.maxFetchSize", 32 * 1024 * 1024);
private final WarcWriter writer; private final WarcWriter writer;
private final Path warcFile; private final Path warcFile;
@@ -53,23 +52,15 @@ public class WarcRecorder implements AutoCloseable {
// Affix a version string in case we need to change the format in the future // Affix a version string in case we need to change the format in the future
// in some way // in some way
private final String warcRecorderVersion = "1.0"; private final String warcRecorderVersion = "1.0";
private final CookieStore cookies;
private final LinkParser linkParser = new LinkParser(); private final LinkParser linkParser = new LinkParser();
/** /**
* Create a new WarcRecorder that will write to the given file * Create a new WarcRecorder that will write to the given file
* *
* @param warcFile The file to write to * @param warcFile The file to write to
*/ */
public WarcRecorder(Path warcFile, HttpFetcherImpl fetcher) throws IOException { public WarcRecorder(Path warcFile) throws IOException {
this.warcFile = warcFile; this.warcFile = warcFile;
this.writer = new WarcWriter(warcFile); this.writer = new WarcWriter(warcFile);
this.cookies = fetcher.getCookies();
}
public WarcRecorder(Path warcFile, CookieStore cookies) throws IOException {
this.warcFile = warcFile;
this.writer = new WarcWriter(warcFile);
this.cookies = cookies;
} }
/** /**
@@ -79,24 +70,21 @@ public class WarcRecorder implements AutoCloseable {
public WarcRecorder() throws IOException { public WarcRecorder() throws IOException {
this.warcFile = Files.createTempFile("warc", ".warc.gz"); this.warcFile = Files.createTempFile("warc", ".warc.gz");
this.writer = new WarcWriter(this.warcFile); this.writer = new WarcWriter(this.warcFile);
this.cookies = new BasicCookieStore();
temporaryFile = true; temporaryFile = true;
} }
private boolean hasCookies() {
return !cookies.getCookies().isEmpty();
}
public HttpFetchResult fetch(HttpClient client, public HttpFetchResult fetch(HttpClient client,
ClassicHttpRequest request) DomainCookies cookies,
HttpGet request)
throws NoSuchAlgorithmException, IOException, URISyntaxException, InterruptedException throws NoSuchAlgorithmException, IOException, URISyntaxException, InterruptedException
{ {
return fetch(client, request, Duration.ofMillis(MAX_TIME)); return fetch(client, cookies, request, Duration.ofMillis(MAX_TIME));
} }
public HttpFetchResult fetch(HttpClient client, public HttpFetchResult fetch(HttpClient client,
ClassicHttpRequest request, DomainCookies cookies,
HttpGet request,
Duration timeout) Duration timeout)
throws NoSuchAlgorithmException, IOException, URISyntaxException, InterruptedException throws NoSuchAlgorithmException, IOException, URISyntaxException, InterruptedException
{ {
@@ -113,13 +101,15 @@ public class WarcRecorder implements AutoCloseable {
// Inject a range header to attempt to limit the size of the response // Inject a range header to attempt to limit the size of the response
// to the maximum size we want to store, if the server supports it. // to the maximum size we want to store, if the server supports it.
request.addHeader("Range", "bytes=0-"+MAX_SIZE); request.addHeader("Range", "bytes=0-"+MAX_SIZE);
cookies.paintRequest(request);
try { try {
return client.execute(request, response -> { return client.execute(request,response -> {
try (WarcInputBuffer inputBuffer = WarcInputBuffer.forResponse(response, timeout); try (WarcInputBuffer inputBuffer = WarcInputBuffer.forResponse(response, request, timeout);
InputStream inputStream = inputBuffer.read()) { InputStream inputStream = inputBuffer.read()) {
cookies.updateCookieStore(response);
// Build and write the request // Build and write the request
WarcDigestBuilder requestDigestBuilder = new WarcDigestBuilder(); WarcDigestBuilder requestDigestBuilder = new WarcDigestBuilder();
@@ -143,8 +133,9 @@ public class WarcRecorder implements AutoCloseable {
warcRequest.http(); // force HTTP header to be parsed before body is consumed so that caller can use it warcRequest.http(); // force HTTP header to be parsed before body is consumed so that caller can use it
writer.write(warcRequest); writer.write(warcRequest);
if (hasCookies()) {
extraHeaders.put("X-Has-Cookies", List.of("1")); if (cookies.hasCookies()) {
response.addHeader("X-Has-Cookies", 1);
} }
byte[] responseHeaders = WarcProtocolReconstructor.getResponseHeader(response, inputBuffer.size()).getBytes(StandardCharsets.UTF_8); byte[] responseHeaders = WarcProtocolReconstructor.getResponseHeader(response, inputBuffer.size()).getBytes(StandardCharsets.UTF_8);
@@ -259,7 +250,7 @@ public class WarcRecorder implements AutoCloseable {
writer.write(item); writer.write(item);
} }
private void saveOldResponse(EdgeUrl url, String contentType, int statusCode, byte[] documentBody, @Nullable String headers, ContentTags contentTags) { private void saveOldResponse(EdgeUrl url, DomainCookies domainCookies, String contentType, int statusCode, byte[] documentBody, @Nullable String headers, ContentTags contentTags) {
try { try {
WarcDigestBuilder responseDigestBuilder = new WarcDigestBuilder(); WarcDigestBuilder responseDigestBuilder = new WarcDigestBuilder();
WarcDigestBuilder payloadDigestBuilder = new WarcDigestBuilder(); WarcDigestBuilder payloadDigestBuilder = new WarcDigestBuilder();
@@ -320,7 +311,7 @@ public class WarcRecorder implements AutoCloseable {
.date(Instant.now()) .date(Instant.now())
.body(MediaType.HTTP_RESPONSE, responseDataBuffer.copyBytes()); .body(MediaType.HTTP_RESPONSE, responseDataBuffer.copyBytes());
if (hasCookies()) { if (domainCookies.hasCookies() || (headers != null && headers.contains("Set-Cookie:"))) {
builder.addHeader("X-Has-Cookies", "1"); builder.addHeader("X-Has-Cookies", "1");
} }
@@ -340,8 +331,8 @@ public class WarcRecorder implements AutoCloseable {
* an E-Tag or Last-Modified header, and the server responds with a 304 Not Modified. In this * an E-Tag or Last-Modified header, and the server responds with a 304 Not Modified. In this
* scenario we want to record the data as it was in the previous crawl, but not re-fetch it. * scenario we want to record the data as it was in the previous crawl, but not re-fetch it.
*/ */
public void writeReferenceCopy(EdgeUrl url, String contentType, int statusCode, byte[] documentBody, @Nullable String headers, ContentTags ctags) { public void writeReferenceCopy(EdgeUrl url, DomainCookies cookies, String contentType, int statusCode, byte[] documentBody, @Nullable String headers, ContentTags ctags) {
saveOldResponse(url, contentType, statusCode, documentBody, headers, ctags); saveOldResponse(url, cookies, contentType, statusCode, documentBody, headers, ctags);
} }
public void writeWarcinfoHeader(String ip, EdgeDomain domain, HttpFetcherImpl.DomainProbeResult result) throws IOException { public void writeWarcinfoHeader(String ip, EdgeDomain domain, HttpFetcherImpl.DomainProbeResult result) throws IOException {

View File

@@ -3,6 +3,7 @@ package nu.marginalia.crawl.logic;
import nu.marginalia.model.EdgeDomain; import nu.marginalia.model.EdgeDomain;
import java.util.Map; import java.util.Map;
import java.util.Optional;
import java.util.concurrent.ConcurrentHashMap; import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.Semaphore; import java.util.concurrent.Semaphore;
@@ -19,8 +20,22 @@ public class DomainLocks {
* and may be held by another thread. The caller is responsible for locking and releasing the lock. * and may be held by another thread. The caller is responsible for locking and releasing the lock.
*/ */
public DomainLock lockDomain(EdgeDomain domain) throws InterruptedException { public DomainLock lockDomain(EdgeDomain domain) throws InterruptedException {
return new DomainLock(domain.toString(), var sem = locks.computeIfAbsent(domain.topDomain.toLowerCase(), this::defaultPermits);
locks.computeIfAbsent(domain.topDomain.toLowerCase(), this::defaultPermits));
sem.acquire();
return new DomainLock(sem);
}
public Optional<DomainLock> tryLockDomain(EdgeDomain domain) {
var sem = locks.computeIfAbsent(domain.topDomain.toLowerCase(), this::defaultPermits);
if (sem.tryAcquire(1)) {
return Optional.of(new DomainLock(sem));
}
else {
// We don't have a lock, so we return an empty optional
return Optional.empty();
}
} }
private Semaphore defaultPermits(String topDomain) { private Semaphore defaultPermits(String topDomain) {
@@ -28,23 +43,27 @@ public class DomainLocks {
return new Semaphore(16); return new Semaphore(16);
if (topDomain.equals("blogspot.com")) if (topDomain.equals("blogspot.com"))
return new Semaphore(8); return new Semaphore(8);
if (topDomain.equals("tumblr.com"))
return new Semaphore(8);
if (topDomain.equals("neocities.org")) if (topDomain.equals("neocities.org"))
return new Semaphore(4); return new Semaphore(8);
if (topDomain.equals("github.io")) if (topDomain.equals("github.io"))
return new Semaphore(4); return new Semaphore(8);
// Substack really dislikes broad-scale crawlers, so we need to be careful
// to not get blocked.
if (topDomain.equals("substack.com")) { if (topDomain.equals("substack.com")) {
return new Semaphore(1); return new Semaphore(1);
} }
if (topDomain.endsWith(".edu")) {
return new Semaphore(1);
}
return new Semaphore(2); return new Semaphore(2);
} }
public boolean canLock(EdgeDomain domain) { /** Returns true if the domain is lockable, i.e. if it is not already locked by another thread.
* (this is just a hint, and does not guarantee that the domain is actually lockable any time
* after this method returns true)
*/
public boolean isLockableHint(EdgeDomain domain) {
Semaphore sem = locks.get(domain.topDomain.toLowerCase()); Semaphore sem = locks.get(domain.topDomain.toLowerCase());
if (null == sem) if (null == sem)
return true; return true;
@@ -53,22 +72,16 @@ public class DomainLocks {
} }
public static class DomainLock implements AutoCloseable { public static class DomainLock implements AutoCloseable {
private final String domainName;
private final Semaphore semaphore; private final Semaphore semaphore;
DomainLock(String domainName, Semaphore semaphore) throws InterruptedException { DomainLock(Semaphore semaphore) {
this.domainName = domainName;
this.semaphore = semaphore; this.semaphore = semaphore;
Thread.currentThread().setName("crawling:" + domainName + " [await domain lock]");
semaphore.acquire();
Thread.currentThread().setName("crawling:" + domainName);
} }
@Override @Override
public void close() throws Exception { public void close() throws Exception {
semaphore.release(); semaphore.release();
Thread.currentThread().setName("crawling:" + domainName + " [wrapping up]"); Thread.currentThread().setName("[idle]");
} }
} }
} }

View File

@@ -6,6 +6,7 @@ import nu.marginalia.contenttype.ContentType;
import nu.marginalia.crawl.CrawlerMain; import nu.marginalia.crawl.CrawlerMain;
import nu.marginalia.crawl.DomainStateDb; import nu.marginalia.crawl.DomainStateDb;
import nu.marginalia.crawl.fetcher.ContentTags; import nu.marginalia.crawl.fetcher.ContentTags;
import nu.marginalia.crawl.fetcher.DomainCookies;
import nu.marginalia.crawl.fetcher.HttpFetcher; import nu.marginalia.crawl.fetcher.HttpFetcher;
import nu.marginalia.crawl.fetcher.warc.WarcRecorder; import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
import nu.marginalia.crawl.logic.LinkFilterSelector; import nu.marginalia.crawl.logic.LinkFilterSelector;
@@ -51,9 +52,10 @@ public class CrawlerRetreiver implements AutoCloseable {
private final DomainStateDb domainStateDb; private final DomainStateDb domainStateDb;
private final WarcRecorder warcRecorder; private final WarcRecorder warcRecorder;
private final CrawlerRevisitor crawlerRevisitor; private final CrawlerRevisitor crawlerRevisitor;
private final DomainCookies cookies = new DomainCookies();
private static final CrawlerConnectionThrottle connectionThrottle = new CrawlerConnectionThrottle( private static final CrawlerConnectionThrottle connectionThrottle = new CrawlerConnectionThrottle(
Duration.ofMillis(50) // pace the connections to avoid network congestion at startup Duration.ofSeconds(1) // pace the connections to avoid network congestion at startup
); );
int errorCount = 0; int errorCount = 0;
@@ -124,7 +126,7 @@ public class CrawlerRetreiver implements AutoCloseable {
} }
Instant recrawlStart = Instant.now(); Instant recrawlStart = Instant.now();
CrawlerRevisitor.RecrawlMetadata recrawlMetadata = crawlerRevisitor.recrawl(oldCrawlData, robotsRules, delayTimer); CrawlerRevisitor.RecrawlMetadata recrawlMetadata = crawlerRevisitor.recrawl(oldCrawlData, cookies, robotsRules, delayTimer);
Duration recrawlTime = Duration.between(recrawlStart, Instant.now()); Duration recrawlTime = Duration.between(recrawlStart, Instant.now());
// Play back the old crawl data (if present) and fetch the documents comparing etags and last-modified // Play back the old crawl data (if present) and fetch the documents comparing etags and last-modified
@@ -274,7 +276,7 @@ public class CrawlerRetreiver implements AutoCloseable {
try { try {
var url = rootUrl.withPathAndParam("/", null); var url = rootUrl.withPathAndParam("/", null);
HttpFetchResult result = fetcher.fetchContent(url, warcRecorder, timer, ContentTags.empty(), HttpFetcher.ProbeType.DISABLED); HttpFetchResult result = fetcher.fetchContent(url, warcRecorder, cookies, timer, ContentTags.empty(), HttpFetcher.ProbeType.DISABLED);
timer.waitFetchDelay(0); timer.waitFetchDelay(0);
if (result instanceof HttpFetchResult.ResultRedirect(EdgeUrl location)) { if (result instanceof HttpFetchResult.ResultRedirect(EdgeUrl location)) {
@@ -337,7 +339,7 @@ public class CrawlerRetreiver implements AutoCloseable {
// Grab the favicon if it exists // Grab the favicon if it exists
if (fetcher.fetchContent(faviconUrl, warcRecorder, timer, ContentTags.empty(), HttpFetcher.ProbeType.DISABLED) instanceof HttpFetchResult.ResultOk iconResult) { if (fetcher.fetchContent(faviconUrl, warcRecorder, cookies, timer, ContentTags.empty(), HttpFetcher.ProbeType.DISABLED) instanceof HttpFetchResult.ResultOk iconResult) {
String contentType = iconResult.header("Content-Type"); String contentType = iconResult.header("Content-Type");
byte[] iconData = iconResult.getBodyBytes(); byte[] iconData = iconResult.getBodyBytes();
@@ -407,7 +409,7 @@ public class CrawlerRetreiver implements AutoCloseable {
if (parsedOpt.isEmpty()) if (parsedOpt.isEmpty())
return false; return false;
HttpFetchResult result = fetcher.fetchContent(parsedOpt.get(), warcRecorder, timer, ContentTags.empty(), HttpFetcher.ProbeType.DISABLED); HttpFetchResult result = fetcher.fetchContent(parsedOpt.get(), warcRecorder, cookies, timer, ContentTags.empty(), HttpFetcher.ProbeType.DISABLED);
timer.waitFetchDelay(0); timer.waitFetchDelay(0);
if (!(result instanceof HttpFetchResult.ResultOk ok)) { if (!(result instanceof HttpFetchResult.ResultOk ok)) {
@@ -435,7 +437,7 @@ public class CrawlerRetreiver implements AutoCloseable {
{ {
var contentTags = reference.getContentTags(); var contentTags = reference.getContentTags();
HttpFetchResult fetchedDoc = fetcher.fetchContent(top, warcRecorder, timer, contentTags, HttpFetcher.ProbeType.FULL); HttpFetchResult fetchedDoc = fetcher.fetchContent(top, warcRecorder, cookies, timer, contentTags, HttpFetcher.ProbeType.FULL);
timer.waitFetchDelay(); timer.waitFetchDelay();
if (Thread.interrupted()) { if (Thread.interrupted()) {
@@ -461,7 +463,7 @@ public class CrawlerRetreiver implements AutoCloseable {
{ {
var doc = reference.doc(); var doc = reference.doc();
warcRecorder.writeReferenceCopy(top, doc.contentType, doc.httpStatus, doc.documentBodyBytes, doc.headers, contentTags); warcRecorder.writeReferenceCopy(top, cookies, doc.contentType, doc.httpStatus, doc.documentBodyBytes, doc.headers, contentTags);
fetchedDoc = new HttpFetchResult.Result304ReplacedWithReference(doc.url, fetchedDoc = new HttpFetchResult.Result304ReplacedWithReference(doc.url,
new ContentType(doc.contentType, "UTF-8"), new ContentType(doc.contentType, "UTF-8"),

View File

@@ -2,6 +2,7 @@ package nu.marginalia.crawl.retreival.revisit;
import crawlercommons.robots.SimpleRobotRules; import crawlercommons.robots.SimpleRobotRules;
import nu.marginalia.crawl.fetcher.ContentTags; import nu.marginalia.crawl.fetcher.ContentTags;
import nu.marginalia.crawl.fetcher.DomainCookies;
import nu.marginalia.crawl.fetcher.warc.WarcRecorder; import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
import nu.marginalia.crawl.retreival.CrawlDataReference; import nu.marginalia.crawl.retreival.CrawlDataReference;
import nu.marginalia.crawl.retreival.CrawlDelayTimer; import nu.marginalia.crawl.retreival.CrawlDelayTimer;
@@ -37,6 +38,7 @@ public class CrawlerRevisitor {
/** Performs a re-crawl of old documents, comparing etags and last-modified */ /** Performs a re-crawl of old documents, comparing etags and last-modified */
public RecrawlMetadata recrawl(CrawlDataReference oldCrawlData, public RecrawlMetadata recrawl(CrawlDataReference oldCrawlData,
DomainCookies cookies,
SimpleRobotRules robotsRules, SimpleRobotRules robotsRules,
CrawlDelayTimer delayTimer) CrawlDelayTimer delayTimer)
throws InterruptedException { throws InterruptedException {
@@ -132,6 +134,7 @@ public class CrawlerRevisitor {
} }
// Add a WARC record so we don't repeat this // Add a WARC record so we don't repeat this
warcRecorder.writeReferenceCopy(url, warcRecorder.writeReferenceCopy(url,
cookies,
doc.contentType, doc.contentType,
doc.httpStatus, doc.httpStatus,
doc.documentBodyBytes, doc.documentBodyBytes,

View File

@@ -6,6 +6,7 @@ public class ContentTypes {
public static final Set<String> acceptedContentTypes = Set.of("application/xhtml+xml", public static final Set<String> acceptedContentTypes = Set.of("application/xhtml+xml",
"application/xhtml", "application/xhtml",
"text/html", "text/html",
"application/pdf",
"image/x-icon", "image/x-icon",
"text/plain"); "text/plain");
@@ -19,4 +20,9 @@ public class ContentTypes {
return false; return false;
} }
public static boolean isBinary(String contentTypeHeader) {
String lcHeader = contentTypeHeader.toLowerCase();
return lcHeader.startsWith("application/pdf");
}
} }

View File

@@ -37,8 +37,12 @@ public class SlopSerializableCrawlDataStream implements AutoCloseable, Serializa
public boolean filter(String url, int status, String contentType) { public boolean filter(String url, int status, String contentType) {
String ctLc = contentType.toLowerCase(); String ctLc = contentType.toLowerCase();
// Permit all plain text content types
if (ctLc.startsWith("text/")) if (ctLc.startsWith("text/"))
return true; return true;
// PDF
else if (ctLc.startsWith("application/pdf"))
return true;
else if (ctLc.startsWith("x-marginalia/")) else if (ctLc.startsWith("x-marginalia/"))
return true; return true;

View File

@@ -10,7 +10,7 @@ import java.util.regex.Pattern;
public class ContentTypeLogic { public class ContentTypeLogic {
private static final Predicate<String> probableHtmlPattern = Pattern.compile("^.*\\.(htm|html|php|txt|md)$").asMatchPredicate(); private static final Predicate<String> probableGoodPattern = Pattern.compile("^.*\\.(htm|html|php|txt|md|pdf)$").asMatchPredicate();
private static final Predicate<String> probableBinaryPattern = Pattern.compile("^.*\\.[a-z]+$").asMatchPredicate(); private static final Predicate<String> probableBinaryPattern = Pattern.compile("^.*\\.[a-z]+$").asMatchPredicate();
private static final Set<String> blockedContentTypes = Set.of("text/css", "text/javascript"); private static final Set<String> blockedContentTypes = Set.of("text/css", "text/javascript");
private static final List<String> acceptedContentTypePrefixes = List.of( private static final List<String> acceptedContentTypePrefixes = List.of(
@@ -22,6 +22,7 @@ public class ContentTypeLogic {
"application/rss+xml", "application/rss+xml",
"application/x-rss+xml", "application/x-rss+xml",
"application/rdf+xml", "application/rdf+xml",
"application/pdf",
"x-rss+xml" "x-rss+xml"
); );
private boolean allowAllContentTypes = false; private boolean allowAllContentTypes = false;
@@ -34,7 +35,7 @@ public class ContentTypeLogic {
public boolean isUrlLikeBinary(EdgeUrl url) { public boolean isUrlLikeBinary(EdgeUrl url) {
String pathLowerCase = url.path.toLowerCase(); String pathLowerCase = url.path.toLowerCase();
if (probableHtmlPattern.test(pathLowerCase)) if (probableGoodPattern.test(pathLowerCase))
return false; return false;
return probableBinaryPattern.test(pathLowerCase); return probableBinaryPattern.test(pathLowerCase);

View File

@@ -216,6 +216,11 @@ public record SlopCrawlDataRecord(String domain,
return false; return false;
} }
// If the format is binary, we don't want to translate it if the response is truncated
if (response.truncated() != WarcTruncationReason.NOT_TRUNCATED && ContentTypes.isBinary(contentType)) {
return false;
}
return true; return true;
} }

View File

@@ -11,6 +11,8 @@ import org.junit.jupiter.api.*;
import java.io.IOException; import java.io.IOException;
import java.net.URISyntaxException; import java.net.URISyntaxException;
import static org.junit.jupiter.api.Assertions.assertEquals;
@Tag("slow") @Tag("slow")
class HttpFetcherImplContentTypeProbeTest { class HttpFetcherImplContentTypeProbeTest {
@@ -85,55 +87,59 @@ class HttpFetcherImplContentTypeProbeTest {
@AfterEach @AfterEach
public void tearDown() throws IOException { public void tearDown() throws IOException {
var stats = fetcher.getPoolStats();
assertEquals(0, stats.getLeased());
assertEquals(0, stats.getPending());
fetcher.close(); fetcher.close();
} }
@Test @Test
public void testProbeContentTypeHtmlShortcircuitPath() throws URISyntaxException { public void testProbeContentTypeHtmlShortcircuitPath() throws URISyntaxException {
var result = fetcher.probeContentType(new EdgeUrl("https://localhost/test.html"), new CrawlDelayTimer(50), ContentTags.empty()); var result = fetcher.probeContentType(new EdgeUrl("https://localhost/test.html"), new DomainCookies(), new CrawlDelayTimer(50), ContentTags.empty());
Assertions.assertInstanceOf(HttpFetcher.ContentTypeProbeResult.Ok.class, result); Assertions.assertInstanceOf(HttpFetcher.ContentTypeProbeResult.NoOp.class, result);
} }
@Test @Test
public void testProbeContentTypeHtmlShortcircuitTags() { public void testProbeContentTypeHtmlShortcircuitTags() {
var result = fetcher.probeContentType(contentTypeBinaryUrl, new CrawlDelayTimer(50), new ContentTags("a", "b")); var result = fetcher.probeContentType(contentTypeBinaryUrl, new DomainCookies(), new CrawlDelayTimer(50), new ContentTags("a", "b"));
Assertions.assertInstanceOf(HttpFetcher.ContentTypeProbeResult.Ok.class, result); Assertions.assertInstanceOf(HttpFetcher.ContentTypeProbeResult.NoOp.class, result);
} }
@Test @Test
public void testProbeContentTypeHtml() { public void testProbeContentTypeHtml() {
var result = fetcher.probeContentType(contentTypeHtmlUrl, new CrawlDelayTimer(50), ContentTags.empty()); var result = fetcher.probeContentType(contentTypeHtmlUrl, new DomainCookies(), new CrawlDelayTimer(50), ContentTags.empty());
Assertions.assertEquals(new HttpFetcher.ContentTypeProbeResult.Ok(contentTypeHtmlUrl), result); Assertions.assertEquals(new HttpFetcher.ContentTypeProbeResult.Ok(contentTypeHtmlUrl), result);
} }
@Test @Test
public void testProbeContentTypeBinary() { public void testProbeContentTypeBinary() {
var result = fetcher.probeContentType(contentTypeBinaryUrl, new CrawlDelayTimer(50), ContentTags.empty()); var result = fetcher.probeContentType(contentTypeBinaryUrl, new DomainCookies(), new CrawlDelayTimer(50), ContentTags.empty());
Assertions.assertEquals(new HttpFetcher.ContentTypeProbeResult.BadContentType("application/octet-stream", 200), result); Assertions.assertEquals(new HttpFetcher.ContentTypeProbeResult.BadContentType("application/octet-stream", 200), result);
} }
@Test @Test
public void testProbeContentTypeRedirect() { public void testProbeContentTypeRedirect() {
var result = fetcher.probeContentType(redirectUrl, new CrawlDelayTimer(50), ContentTags.empty()); var result = fetcher.probeContentType(redirectUrl, new DomainCookies(), new CrawlDelayTimer(50), ContentTags.empty());
Assertions.assertEquals(new HttpFetcher.ContentTypeProbeResult.Redirect(contentTypeHtmlUrl), result); Assertions.assertEquals(new HttpFetcher.ContentTypeProbeResult.Redirect(contentTypeHtmlUrl), result);
} }
@Test @Test
public void testProbeContentTypeBadHttpStatus() { public void testProbeContentTypeBadHttpStatus() {
var result = fetcher.probeContentType(badHttpStatusUrl, new CrawlDelayTimer(50), ContentTags.empty()); var result = fetcher.probeContentType(badHttpStatusUrl, new DomainCookies(), new CrawlDelayTimer(50), ContentTags.empty());
Assertions.assertEquals(new HttpFetcher.ContentTypeProbeResult.HttpError(500, "Bad status code"), result); Assertions.assertEquals(new HttpFetcher.ContentTypeProbeResult.HttpError(500, "Bad status code"), result);
} }
@Test @Test
public void testOnlyGetAllowed() { public void testOnlyGetAllowed() {
var result = fetcher.probeContentType(onlyGetAllowedUrl, new CrawlDelayTimer(50), ContentTags.empty()); var result = fetcher.probeContentType(onlyGetAllowedUrl, new DomainCookies(), new CrawlDelayTimer(50), ContentTags.empty());
Assertions.assertEquals(new HttpFetcher.ContentTypeProbeResult.Ok(onlyGetAllowedUrl), result); Assertions.assertEquals(new HttpFetcher.ContentTypeProbeResult.Ok(onlyGetAllowedUrl), result);
} }
@Test @Test
public void testTimeout() { public void testTimeout() {
var result = fetcher.probeContentType(timeoutUrl, new CrawlDelayTimer(50), ContentTags.empty()); var result = fetcher.probeContentType(timeoutUrl, new DomainCookies(), new CrawlDelayTimer(50), ContentTags.empty());
Assertions.assertInstanceOf(HttpFetcher.ContentTypeProbeResult.Timeout.class, result); Assertions.assertInstanceOf(HttpFetcher.ContentTypeProbeResult.Timeout.class, result);
} }

View File

@@ -12,6 +12,8 @@ import org.junit.jupiter.api.*;
import java.io.IOException; import java.io.IOException;
import java.net.URISyntaxException; import java.net.URISyntaxException;
import static org.junit.jupiter.api.Assertions.assertEquals;
@Tag("slow") @Tag("slow")
class HttpFetcherImplDomainProbeTest { class HttpFetcherImplDomainProbeTest {
@@ -47,6 +49,10 @@ class HttpFetcherImplDomainProbeTest {
@AfterEach @AfterEach
public void tearDown() throws IOException { public void tearDown() throws IOException {
var stats = fetcher.getPoolStats();
assertEquals(0, stats.getLeased());
assertEquals(0, stats.getPending());
fetcher.close(); fetcher.close();
} }

View File

@@ -31,6 +31,7 @@ class HttpFetcherImplFetchTest {
private static String lastModified = "Wed, 21 Oct 2024 07:28:00 GMT"; private static String lastModified = "Wed, 21 Oct 2024 07:28:00 GMT";
private static EdgeUrl okUrl; private static EdgeUrl okUrl;
private static EdgeUrl okUrlSetsCookie;
private static EdgeUrl okRangeResponseUrl; private static EdgeUrl okRangeResponseUrl;
private static EdgeUrl okUrlWith304; private static EdgeUrl okUrlWith304;
@@ -39,6 +40,8 @@ class HttpFetcherImplFetchTest {
private static EdgeUrl badHttpStatusUrl; private static EdgeUrl badHttpStatusUrl;
private static EdgeUrl keepAliveUrl; private static EdgeUrl keepAliveUrl;
private static EdgeUrl pdfUrl;
@BeforeAll @BeforeAll
public static void setupAll() throws URISyntaxException { public static void setupAll() throws URISyntaxException {
wireMockServer = wireMockServer =
@@ -88,6 +91,19 @@ class HttpFetcherImplFetchTest {
.withStatus(200) .withStatus(200)
.withBody("Hello World"))); .withBody("Hello World")));
okUrlSetsCookie = new EdgeUrl("http://localhost:18089/okSetCookie.bin");
wireMockServer.stubFor(WireMock.head(WireMock.urlEqualTo(okUrlSetsCookie.path))
.willReturn(WireMock.aResponse()
.withHeader("Content-Type", "text/html")
.withHeader("Set-Cookie", "test=1")
.withStatus(200)));
wireMockServer.stubFor(WireMock.get(WireMock.urlEqualTo(okUrlSetsCookie.path))
.willReturn(WireMock.aResponse()
.withHeader("Content-Type", "text/html")
.withHeader("Set-Cookie", "test=1")
.withStatus(200)
.withBody("Hello World")));
okUrlWith304 = new EdgeUrl("http://localhost:18089/ok304.bin"); okUrlWith304 = new EdgeUrl("http://localhost:18089/ok304.bin");
wireMockServer.stubFor(WireMock.head(WireMock.urlEqualTo(okUrlWith304.path)) wireMockServer.stubFor(WireMock.head(WireMock.urlEqualTo(okUrlWith304.path))
.willReturn(WireMock.aResponse() .willReturn(WireMock.aResponse()
@@ -117,6 +133,15 @@ class HttpFetcherImplFetchTest {
.withHeader("Keep-Alive", "max=4, timeout=30") .withHeader("Keep-Alive", "max=4, timeout=30")
.withBody("Hello") .withBody("Hello")
)); ));
pdfUrl = new EdgeUrl("http://localhost:18089/test.pdf");
wireMockServer.stubFor(WireMock.get(WireMock.urlEqualTo(pdfUrl.path))
.willReturn(WireMock.aResponse()
.withHeader("Content-Type", "application/pdf")
.withStatus(200)
.withBody("Hello World")));
wireMockServer.start(); wireMockServer.start();
} }
@@ -134,20 +159,31 @@ class HttpFetcherImplFetchTest {
public void setUp() throws IOException { public void setUp() throws IOException {
fetcher = new HttpFetcherImpl(new UserAgent("test.marginalia.nu", "test.marginalia.nu")); fetcher = new HttpFetcherImpl(new UserAgent("test.marginalia.nu", "test.marginalia.nu"));
warcFile = Files.createTempFile(getClass().getSimpleName(), ".warc"); warcFile = Files.createTempFile(getClass().getSimpleName(), ".warc");
warcRecorder = new WarcRecorder(warcFile, fetcher); warcRecorder = new WarcRecorder(warcFile);
} }
@AfterEach @AfterEach
public void tearDown() throws IOException { public void tearDown() throws IOException {
var stats = fetcher.getPoolStats();
assertEquals(0, stats.getLeased());
assertEquals(0, stats.getPending());
System.out.println(stats);
fetcher.close(); fetcher.close();
warcRecorder.close(); warcRecorder.close();
Files.deleteIfExists(warcFile); Files.deleteIfExists(warcFile);
} }
@Test
public void testFoo() {
fetcher.fetchSitemapUrls("https://www.marginalia.nu/sitemap.xml", new CrawlDelayTimer(100));
}
@Test @Test
public void testOk_NoProbe() throws IOException { public void testOk_NoProbe() throws IOException {
var result = fetcher.fetchContent(okUrl, warcRecorder, new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.DISABLED); var result = fetcher.fetchContent(okUrl, warcRecorder, new DomainCookies(), new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.DISABLED);
Assertions.assertInstanceOf(HttpFetchResult.ResultOk.class, result); Assertions.assertInstanceOf(HttpFetchResult.ResultOk.class, result);
Assertions.assertTrue(result.isOk()); Assertions.assertTrue(result.isOk());
@@ -158,12 +194,29 @@ class HttpFetcherImplFetchTest {
Assertions.assertInstanceOf(WarcResponse.class, warcRecords.get(1)); Assertions.assertInstanceOf(WarcResponse.class, warcRecords.get(1));
WarcResponse response = (WarcResponse) warcRecords.get(1); WarcResponse response = (WarcResponse) warcRecords.get(1);
assertEquals("0", response.headers().first("X-Has-Cookies").orElse("0")); assertEquals("0", response.http().headers().first("X-Has-Cookies").orElse("0"));
}
@Test
public void testOkSetsCookie() throws IOException {
var cookies = new DomainCookies();
var result = fetcher.fetchContent(okUrlSetsCookie, warcRecorder, cookies, new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.DISABLED);
Assertions.assertInstanceOf(HttpFetchResult.ResultOk.class, result);
Assertions.assertTrue(result.isOk());
List<WarcRecord> warcRecords = getWarcRecords();
assertEquals(2, warcRecords.size());
Assertions.assertInstanceOf(WarcRequest.class, warcRecords.get(0));
Assertions.assertInstanceOf(WarcResponse.class, warcRecords.get(1));
WarcResponse response = (WarcResponse) warcRecords.get(1);
assertEquals("1", response.http().headers().first("X-Has-Cookies").orElse("0"));
} }
@Test @Test
public void testOk_FullProbe() { public void testOk_FullProbe() {
var result = fetcher.fetchContent(okUrl, warcRecorder, new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.FULL); var result = fetcher.fetchContent(okUrl, warcRecorder, new DomainCookies(), new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.FULL);
Assertions.assertInstanceOf(HttpFetchResult.ResultOk.class, result); Assertions.assertInstanceOf(HttpFetchResult.ResultOk.class, result);
Assertions.assertTrue(result.isOk()); Assertions.assertTrue(result.isOk());
@@ -171,7 +224,7 @@ class HttpFetcherImplFetchTest {
@Test @Test
public void testOk304_NoProbe() { public void testOk304_NoProbe() {
var result = fetcher.fetchContent(okUrlWith304, warcRecorder, new CrawlDelayTimer(1000), new ContentTags(etag, lastModified), HttpFetcher.ProbeType.DISABLED); var result = fetcher.fetchContent(okUrlWith304, warcRecorder, new DomainCookies(), new CrawlDelayTimer(1000), new ContentTags(etag, lastModified), HttpFetcher.ProbeType.DISABLED);
Assertions.assertInstanceOf(HttpFetchResult.Result304Raw.class, result); Assertions.assertInstanceOf(HttpFetchResult.Result304Raw.class, result);
System.out.println(result); System.out.println(result);
@@ -180,7 +233,7 @@ class HttpFetcherImplFetchTest {
@Test @Test
public void testOk304_FullProbe() { public void testOk304_FullProbe() {
var result = fetcher.fetchContent(okUrlWith304, warcRecorder, new CrawlDelayTimer(1000), new ContentTags(etag, lastModified), HttpFetcher.ProbeType.FULL); var result = fetcher.fetchContent(okUrlWith304, warcRecorder, new DomainCookies(), new CrawlDelayTimer(1000), new ContentTags(etag, lastModified), HttpFetcher.ProbeType.FULL);
Assertions.assertInstanceOf(HttpFetchResult.Result304Raw.class, result); Assertions.assertInstanceOf(HttpFetchResult.Result304Raw.class, result);
System.out.println(result); System.out.println(result);
@@ -188,7 +241,7 @@ class HttpFetcherImplFetchTest {
@Test @Test
public void testBadStatus_NoProbe() throws IOException { public void testBadStatus_NoProbe() throws IOException {
var result = fetcher.fetchContent(badHttpStatusUrl, warcRecorder, new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.DISABLED); var result = fetcher.fetchContent(badHttpStatusUrl, warcRecorder, new DomainCookies(), new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.DISABLED);
Assertions.assertInstanceOf(HttpFetchResult.ResultOk.class, result); Assertions.assertInstanceOf(HttpFetchResult.ResultOk.class, result);
Assertions.assertFalse(result.isOk()); Assertions.assertFalse(result.isOk());
@@ -202,7 +255,7 @@ class HttpFetcherImplFetchTest {
@Test @Test
public void testBadStatus_FullProbe() { public void testBadStatus_FullProbe() {
var result = fetcher.fetchContent(badHttpStatusUrl, warcRecorder, new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.FULL); var result = fetcher.fetchContent(badHttpStatusUrl, warcRecorder, new DomainCookies(), new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.FULL);
Assertions.assertInstanceOf(HttpFetchResult.ResultOk.class, result); Assertions.assertInstanceOf(HttpFetchResult.ResultOk.class, result);
Assertions.assertFalse(result.isOk()); Assertions.assertFalse(result.isOk());
@@ -212,7 +265,7 @@ class HttpFetcherImplFetchTest {
@Test @Test
public void testRedirect_NoProbe() throws URISyntaxException, IOException { public void testRedirect_NoProbe() throws URISyntaxException, IOException {
var result = fetcher.fetchContent(redirectUrl, warcRecorder, new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.DISABLED); var result = fetcher.fetchContent(redirectUrl, warcRecorder, new DomainCookies(), new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.DISABLED);
Assertions.assertInstanceOf(HttpFetchResult.ResultRedirect.class, result); Assertions.assertInstanceOf(HttpFetchResult.ResultRedirect.class, result);
assertEquals(new EdgeUrl("http://localhost:18089/test.html.bin"), ((HttpFetchResult.ResultRedirect) result).url()); assertEquals(new EdgeUrl("http://localhost:18089/test.html.bin"), ((HttpFetchResult.ResultRedirect) result).url());
@@ -225,7 +278,7 @@ class HttpFetcherImplFetchTest {
@Test @Test
public void testRedirect_FullProbe() throws URISyntaxException { public void testRedirect_FullProbe() throws URISyntaxException {
var result = fetcher.fetchContent(redirectUrl, warcRecorder, new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.FULL); var result = fetcher.fetchContent(redirectUrl, warcRecorder, new DomainCookies(), new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.FULL);
Assertions.assertInstanceOf(HttpFetchResult.ResultRedirect.class, result); Assertions.assertInstanceOf(HttpFetchResult.ResultRedirect.class, result);
assertEquals(new EdgeUrl("http://localhost:18089/test.html.bin"), ((HttpFetchResult.ResultRedirect) result).url()); assertEquals(new EdgeUrl("http://localhost:18089/test.html.bin"), ((HttpFetchResult.ResultRedirect) result).url());
@@ -238,7 +291,7 @@ class HttpFetcherImplFetchTest {
public void testFetchTimeout_NoProbe() throws IOException, URISyntaxException { public void testFetchTimeout_NoProbe() throws IOException, URISyntaxException {
Instant requestStart = Instant.now(); Instant requestStart = Instant.now();
var result = fetcher.fetchContent(timeoutUrl, warcRecorder, new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.DISABLED); var result = fetcher.fetchContent(timeoutUrl, warcRecorder, new DomainCookies(), new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.DISABLED);
Assertions.assertInstanceOf(HttpFetchResult.ResultException.class, result); Assertions.assertInstanceOf(HttpFetchResult.ResultException.class, result);
@@ -262,7 +315,7 @@ class HttpFetcherImplFetchTest {
@Test @Test
public void testRangeResponse() throws IOException { public void testRangeResponse() throws IOException {
var result = fetcher.fetchContent(okRangeResponseUrl, warcRecorder, new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.DISABLED); var result = fetcher.fetchContent(okRangeResponseUrl, warcRecorder, new DomainCookies(), new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.DISABLED);
Assertions.assertInstanceOf(HttpFetchResult.ResultOk.class, result); Assertions.assertInstanceOf(HttpFetchResult.ResultOk.class, result);
Assertions.assertTrue(result.isOk()); Assertions.assertTrue(result.isOk());
@@ -279,7 +332,7 @@ class HttpFetcherImplFetchTest {
@Test @Test
public void testFetchTimeout_Probe() throws IOException, URISyntaxException { public void testFetchTimeout_Probe() throws IOException, URISyntaxException {
Instant requestStart = Instant.now(); Instant requestStart = Instant.now();
var result = fetcher.fetchContent(timeoutUrl, warcRecorder, new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.FULL); var result = fetcher.fetchContent(timeoutUrl, warcRecorder, new DomainCookies(), new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.FULL);
Instant requestEnd = Instant.now(); Instant requestEnd = Instant.now();
Assertions.assertInstanceOf(HttpFetchResult.ResultException.class, result); Assertions.assertInstanceOf(HttpFetchResult.ResultException.class, result);
@@ -302,7 +355,15 @@ class HttpFetcherImplFetchTest {
@Test @Test
public void testKeepaliveUrl() { public void testKeepaliveUrl() {
// mostly for smoke testing and debugger utility // mostly for smoke testing and debugger utility
var result = fetcher.fetchContent(keepAliveUrl, warcRecorder, new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.DISABLED); var result = fetcher.fetchContent(keepAliveUrl, warcRecorder, new DomainCookies(), new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.DISABLED);
Assertions.assertInstanceOf(HttpFetchResult.ResultOk.class, result);
Assertions.assertTrue(result.isOk());
}
@Test
public void testPdf() {
var result = fetcher.fetchContent(pdfUrl, warcRecorder, new DomainCookies(), new CrawlDelayTimer(1000), ContentTags.empty(), HttpFetcher.ProbeType.FULL);
Assertions.assertInstanceOf(HttpFetchResult.ResultOk.class, result); Assertions.assertInstanceOf(HttpFetchResult.ResultOk.class, result);
Assertions.assertTrue(result.isOk()); Assertions.assertTrue(result.isOk());
@@ -319,6 +380,13 @@ class HttpFetcherImplFetchTest {
WarcXEntityRefused.register(reader); WarcXEntityRefused.register(reader);
for (var record : reader) { for (var record : reader) {
// Load the body, we need to do this before we close the reader to have access to the content.
if (record instanceof WarcRequest req) {
req.http();
} else if (record instanceof WarcResponse rsp) {
rsp.http();
}
records.add(record); records.add(record);
} }
} }

View File

@@ -1,12 +1,12 @@
package nu.marginalia.crawl.retreival; package nu.marginalia.crawl.retreival;
import nu.marginalia.crawl.fetcher.DomainCookies;
import nu.marginalia.crawl.fetcher.warc.WarcRecorder; import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
import nu.marginalia.model.EdgeDomain; import nu.marginalia.model.EdgeDomain;
import nu.marginalia.model.EdgeUrl; import nu.marginalia.model.EdgeUrl;
import org.apache.hc.client5.http.classic.HttpClient; import org.apache.hc.client5.http.classic.HttpClient;
import org.apache.hc.client5.http.cookie.BasicCookieStore; import org.apache.hc.client5.http.classic.methods.HttpGet;
import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.client5.http.impl.classic.HttpClients;
import org.apache.hc.core5.http.io.support.ClassicRequestBuilder;
import org.junit.jupiter.api.AfterEach; import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.BeforeEach; import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test; import org.junit.jupiter.api.Test;
@@ -45,7 +45,7 @@ class CrawlerWarcResynchronizerTest {
@Test @Test
void run() throws IOException, URISyntaxException { void run() throws IOException, URISyntaxException {
try (var oldRecorder = new WarcRecorder(fileName, new BasicCookieStore())) { try (var oldRecorder = new WarcRecorder(fileName)) {
fetchUrl(oldRecorder, "https://www.marginalia.nu/"); fetchUrl(oldRecorder, "https://www.marginalia.nu/");
fetchUrl(oldRecorder, "https://www.marginalia.nu/log/"); fetchUrl(oldRecorder, "https://www.marginalia.nu/log/");
fetchUrl(oldRecorder, "https://www.marginalia.nu/feed/"); fetchUrl(oldRecorder, "https://www.marginalia.nu/feed/");
@@ -55,7 +55,7 @@ class CrawlerWarcResynchronizerTest {
var crawlFrontier = new DomainCrawlFrontier(new EdgeDomain("www.marginalia.nu"), List.of(), 100); var crawlFrontier = new DomainCrawlFrontier(new EdgeDomain("www.marginalia.nu"), List.of(), 100);
try (var newRecorder = new WarcRecorder(outputFile, new BasicCookieStore())) { try (var newRecorder = new WarcRecorder(outputFile)) {
new CrawlerWarcResynchronizer(crawlFrontier, newRecorder).run(fileName); new CrawlerWarcResynchronizer(crawlFrontier, newRecorder).run(fileName);
} }
@@ -78,10 +78,10 @@ class CrawlerWarcResynchronizerTest {
} }
void fetchUrl(WarcRecorder recorder, String url) throws NoSuchAlgorithmException, IOException, URISyntaxException, InterruptedException { void fetchUrl(WarcRecorder recorder, String url) throws NoSuchAlgorithmException, IOException, URISyntaxException, InterruptedException {
var req = ClassicRequestBuilder.get(new java.net.URI(url)) HttpGet request = new HttpGet(url);
.addHeader("User-agent", "test.marginalia.nu") request.addHeader("User-agent", "test.marginalia.nu");
.addHeader("Accept-Encoding", "gzip") request.addHeader("Accept-Encoding", "gzip");
.build();
recorder.fetch(httpClient, req); recorder.fetch(httpClient, new DomainCookies(), request);
} }
} }

View File

@@ -2,6 +2,7 @@ package nu.marginalia.crawl.retreival.fetcher;
import com.sun.net.httpserver.HttpServer; import com.sun.net.httpserver.HttpServer;
import nu.marginalia.crawl.fetcher.ContentTags; import nu.marginalia.crawl.fetcher.ContentTags;
import nu.marginalia.crawl.fetcher.DomainCookies;
import nu.marginalia.crawl.fetcher.HttpFetcher; import nu.marginalia.crawl.fetcher.HttpFetcher;
import nu.marginalia.crawl.fetcher.HttpFetcherImpl; import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
import nu.marginalia.crawl.retreival.CrawlDelayTimer; import nu.marginalia.crawl.retreival.CrawlDelayTimer;
@@ -88,7 +89,7 @@ class ContentTypeProberTest {
@Test @Test
void probeContentTypeOk() throws Exception { void probeContentTypeOk() throws Exception {
HttpFetcher.ContentTypeProbeResult result = fetcher.probeContentType(htmlEndpoint, new CrawlDelayTimer(50), ContentTags.empty()); HttpFetcher.ContentTypeProbeResult result = fetcher.probeContentType(htmlEndpoint, new DomainCookies(), new CrawlDelayTimer(50), ContentTags.empty());
System.out.println(result); System.out.println(result);
@@ -97,7 +98,7 @@ class ContentTypeProberTest {
@Test @Test
void probeContentTypeRedir() throws Exception { void probeContentTypeRedir() throws Exception {
HttpFetcher.ContentTypeProbeResult result = fetcher.probeContentType(htmlRedirEndpoint, new CrawlDelayTimer(50), ContentTags.empty()); HttpFetcher.ContentTypeProbeResult result = fetcher.probeContentType(htmlRedirEndpoint, new DomainCookies(), new CrawlDelayTimer(50), ContentTags.empty());
System.out.println(result); System.out.println(result);
@@ -106,7 +107,7 @@ class ContentTypeProberTest {
@Test @Test
void probeContentTypeBad() throws Exception { void probeContentTypeBad() throws Exception {
HttpFetcher.ContentTypeProbeResult result = fetcher.probeContentType(binaryEndpoint, new CrawlDelayTimer(50), ContentTags.empty()); HttpFetcher.ContentTypeProbeResult result = fetcher.probeContentType(binaryEndpoint, new DomainCookies(), new CrawlDelayTimer(50), ContentTags.empty());
System.out.println(result); System.out.println(result);
@@ -115,7 +116,7 @@ class ContentTypeProberTest {
@Test @Test
void probeContentTypeTimeout() throws Exception { void probeContentTypeTimeout() throws Exception {
HttpFetcher.ContentTypeProbeResult result = fetcher.probeContentType(timeoutEndpoint, new CrawlDelayTimer(50), ContentTags.empty()); HttpFetcher.ContentTypeProbeResult result = fetcher.probeContentType(timeoutEndpoint, new DomainCookies(), new CrawlDelayTimer(50), ContentTags.empty());
System.out.println(result); System.out.println(result);

View File

@@ -1,11 +1,11 @@
package nu.marginalia.crawl.retreival.fetcher; package nu.marginalia.crawl.retreival.fetcher;
import com.sun.net.httpserver.HttpServer; import com.sun.net.httpserver.HttpServer;
import nu.marginalia.crawl.fetcher.DomainCookies;
import nu.marginalia.crawl.fetcher.warc.WarcRecorder; import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
import org.apache.hc.client5.http.classic.HttpClient; import org.apache.hc.client5.http.classic.HttpClient;
import org.apache.hc.client5.http.cookie.BasicCookieStore; import org.apache.hc.client5.http.classic.methods.HttpGet;
import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.client5.http.impl.classic.HttpClients;
import org.apache.hc.core5.http.io.support.ClassicRequestBuilder;
import org.junit.jupiter.api.*; import org.junit.jupiter.api.*;
import org.netpreserve.jwarc.WarcReader; import org.netpreserve.jwarc.WarcReader;
import org.netpreserve.jwarc.WarcRequest; import org.netpreserve.jwarc.WarcRequest;
@@ -51,14 +51,14 @@ class WarcRecorderFakeServerTest {
os.write("<html><body>hello</body></html>".getBytes()); os.write("<html><body>hello</body></html>".getBytes());
os.flush(); os.flush();
try { try {
TimeUnit.SECONDS.sleep(1); TimeUnit.SECONDS.sleep(2);
} catch (InterruptedException e) { } catch (InterruptedException e) {
throw new RuntimeException(e); throw new RuntimeException(e);
} }
os.write(":".getBytes()); os.write(":".getBytes());
os.flush(); os.flush();
try { try {
TimeUnit.SECONDS.sleep(1); TimeUnit.SECONDS.sleep(2);
} catch (InterruptedException e) { } catch (InterruptedException e) {
throw new RuntimeException(e); throw new RuntimeException(e);
} }
@@ -89,24 +89,22 @@ class WarcRecorderFakeServerTest {
fileNameWarc = Files.createTempFile("test", ".warc"); fileNameWarc = Files.createTempFile("test", ".warc");
fileNameParquet = Files.createTempFile("test", ".parquet"); fileNameParquet = Files.createTempFile("test", ".parquet");
client = new WarcRecorder(fileNameWarc, new BasicCookieStore()); client = new WarcRecorder(fileNameWarc);
} }
@AfterEach @AfterEach
public void tearDown() throws Exception { public void tearDown() throws Exception {
client.close(); client.close();
Files.delete(fileNameWarc); Files.delete(fileNameWarc);
} }
@Test @Test
public void fetchFast() throws Exception { public void fetchFast() throws Exception {
client.fetch(httpClient, HttpGet request = new HttpGet("http://localhost:14510/fast");
ClassicRequestBuilder request.addHeader("User-agent", "test.marginalia.nu");
.get(new java.net.URI("http://localhost:14510/fast")) request.addHeader("Accept-Encoding", "gzip");
.addHeader("User-agent", "test.marginalia.nu") client.fetch(httpClient, new DomainCookies(), request);
.addHeader("Accept-Encoding", "gzip")
.build()
);
Map<String, String> sampleData = new HashMap<>(); Map<String, String> sampleData = new HashMap<>();
try (var warcReader = new WarcReader(fileNameWarc)) { try (var warcReader = new WarcReader(fileNameWarc)) {
@@ -127,11 +125,13 @@ class WarcRecorderFakeServerTest {
public void fetchSlow() throws Exception { public void fetchSlow() throws Exception {
Instant start = Instant.now(); Instant start = Instant.now();
HttpGet request = new HttpGet("http://localhost:14510/slow");
request.addHeader("User-agent", "test.marginalia.nu");
request.addHeader("Accept-Encoding", "gzip");
client.fetch(httpClient, client.fetch(httpClient,
ClassicRequestBuilder.get(new java.net.URI("http://localhost:14510/slow")) new DomainCookies(),
.addHeader("User-agent", "test.marginalia.nu") request,
.addHeader("Accept-Encoding", "gzip")
.build(),
Duration.ofSeconds(1) Duration.ofSeconds(1)
); );
Instant end = Instant.now(); Instant end = Instant.now();
@@ -149,6 +149,8 @@ class WarcRecorderFakeServerTest {
}); });
} }
System.out.println(
Files.readString(fileNameWarc));
System.out.println(sampleData); System.out.println(sampleData);
// Timeout is set to 1 second, but the server will take 5 seconds to respond, // Timeout is set to 1 second, but the server will take 5 seconds to respond,

View File

@@ -2,14 +2,14 @@ package nu.marginalia.crawl.retreival.fetcher;
import nu.marginalia.UserAgent; import nu.marginalia.UserAgent;
import nu.marginalia.crawl.fetcher.ContentTags; import nu.marginalia.crawl.fetcher.ContentTags;
import nu.marginalia.crawl.fetcher.DomainCookies;
import nu.marginalia.crawl.fetcher.warc.WarcRecorder; import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
import nu.marginalia.io.SerializableCrawlDataStream;
import nu.marginalia.model.EdgeUrl; import nu.marginalia.model.EdgeUrl;
import nu.marginalia.parquet.crawldata.CrawledDocumentParquetRecordFileReader; import nu.marginalia.slop.SlopCrawlDataRecord;
import nu.marginalia.parquet.crawldata.CrawledDocumentParquetRecordFileWriter;
import org.apache.hc.client5.http.classic.HttpClient; import org.apache.hc.client5.http.classic.HttpClient;
import org.apache.hc.client5.http.cookie.BasicCookieStore; import org.apache.hc.client5.http.classic.methods.HttpGet;
import org.apache.hc.client5.http.impl.classic.HttpClients; import org.apache.hc.client5.http.impl.classic.HttpClients;
import org.apache.hc.core5.http.io.support.ClassicRequestBuilder;
import org.junit.jupiter.api.AfterEach; import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.BeforeEach; import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test; import org.junit.jupiter.api.Test;
@@ -24,13 +24,14 @@ import java.nio.file.Files;
import java.nio.file.Path; import java.nio.file.Path;
import java.security.NoSuchAlgorithmException; import java.security.NoSuchAlgorithmException;
import java.util.HashMap; import java.util.HashMap;
import java.util.List;
import java.util.Map; import java.util.Map;
import static org.junit.jupiter.api.Assertions.assertEquals; import static org.junit.jupiter.api.Assertions.assertEquals;
class WarcRecorderTest { class WarcRecorderTest {
Path fileNameWarc; Path fileNameWarc;
Path fileNameParquet; Path fileNameSlop;
WarcRecorder client; WarcRecorder client;
HttpClient httpClient; HttpClient httpClient;
@@ -39,9 +40,9 @@ class WarcRecorderTest {
httpClient = HttpClients.createDefault(); httpClient = HttpClients.createDefault();
fileNameWarc = Files.createTempFile("test", ".warc"); fileNameWarc = Files.createTempFile("test", ".warc");
fileNameParquet = Files.createTempFile("test", ".parquet"); fileNameSlop = Files.createTempFile("test", ".slop.zip");
client = new WarcRecorder(fileNameWarc, new BasicCookieStore()); client = new WarcRecorder(fileNameWarc);
} }
@AfterEach @AfterEach
@@ -52,12 +53,12 @@ class WarcRecorderTest {
@Test @Test
void fetch() throws NoSuchAlgorithmException, IOException, URISyntaxException, InterruptedException { void fetch() throws NoSuchAlgorithmException, IOException, URISyntaxException, InterruptedException {
client.fetch(httpClient,
ClassicRequestBuilder.get(new java.net.URI("https://www.marginalia.nu/")) HttpGet request = new HttpGet("https://www.marginalia.nu/");
.addHeader("User-agent", "test.marginalia.nu") request.addHeader("User-agent", "test.marginalia.nu");
.addHeader("Accept-Encoding", "gzip") request.addHeader("Accept-Encoding", "gzip");
.build()
); client.fetch(httpClient, new DomainCookies(), request);
Map<String, String> sampleData = new HashMap<>(); Map<String, String> sampleData = new HashMap<>();
try (var warcReader = new WarcReader(fileNameWarc)) { try (var warcReader = new WarcReader(fileNameWarc)) {
@@ -78,8 +79,9 @@ class WarcRecorderTest {
@Test @Test
public void flagAsSkipped() throws IOException, URISyntaxException { public void flagAsSkipped() throws IOException, URISyntaxException {
try (var recorder = new WarcRecorder(fileNameWarc, new BasicCookieStore())) { try (var recorder = new WarcRecorder(fileNameWarc)) {
recorder.writeReferenceCopy(new EdgeUrl("https://www.marginalia.nu/"), recorder.writeReferenceCopy(new EdgeUrl("https://www.marginalia.nu/"),
new DomainCookies(),
"text/html", "text/html",
200, 200,
"<?doctype html><html><body>test</body></html>".getBytes(), "<?doctype html><html><body>test</body></html>".getBytes(),
@@ -102,8 +104,9 @@ class WarcRecorderTest {
@Test @Test
public void flagAsSkippedNullBody() throws IOException, URISyntaxException { public void flagAsSkippedNullBody() throws IOException, URISyntaxException {
try (var recorder = new WarcRecorder(fileNameWarc, new BasicCookieStore())) { try (var recorder = new WarcRecorder(fileNameWarc)) {
recorder.writeReferenceCopy(new EdgeUrl("https://www.marginalia.nu/"), recorder.writeReferenceCopy(new EdgeUrl("https://www.marginalia.nu/"),
new DomainCookies(),
"text/html", "text/html",
200, 200,
null, null,
@@ -114,8 +117,9 @@ class WarcRecorderTest {
@Test @Test
public void testSaveImport() throws URISyntaxException, IOException { public void testSaveImport() throws URISyntaxException, IOException {
try (var recorder = new WarcRecorder(fileNameWarc, new BasicCookieStore())) { try (var recorder = new WarcRecorder(fileNameWarc)) {
recorder.writeReferenceCopy(new EdgeUrl("https://www.marginalia.nu/"), recorder.writeReferenceCopy(new EdgeUrl("https://www.marginalia.nu/"),
new DomainCookies(),
"text/html", "text/html",
200, 200,
"<?doctype html><html><body>test</body></html>".getBytes(), "<?doctype html><html><body>test</body></html>".getBytes(),
@@ -138,35 +142,46 @@ class WarcRecorderTest {
@Test @Test
public void testConvertToParquet() throws NoSuchAlgorithmException, IOException, URISyntaxException, InterruptedException { public void testConvertToParquet() throws NoSuchAlgorithmException, IOException, URISyntaxException, InterruptedException {
client.fetch(httpClient, ClassicRequestBuilder HttpGet request1 = new HttpGet("https://www.marginalia.nu/");
.get(new java.net.URI("https://www.marginalia.nu/")) request1.addHeader("User-agent", "test.marginalia.nu");
.addHeader("User-agent", "test.marginalia.nu") request1.addHeader("Accept-Encoding", "gzip");
.addHeader("Accept-Encoding", "gzip")
.build());
client.fetch(httpClient, ClassicRequestBuilder client.fetch(httpClient, new DomainCookies(), request1);
.get(new java.net.URI("https://www.marginalia.nu/log/"))
.addHeader("User-agent", "test.marginalia.nu")
.addHeader("Accept-Encoding", "gzip")
.build());
client.fetch(httpClient, ClassicRequestBuilder HttpGet request2 = new HttpGet("https://www.marginalia.nu/log/");
.get(new java.net.URI("https://www.marginalia.nu/sanic.png")) request2.addHeader("User-agent", "test.marginalia.nu");
.addHeader("User-agent", "test.marginalia.nu") request2.addHeader("Accept-Encoding", "gzip");
.addHeader("Accept-Encoding", "gzip")
.build());
CrawledDocumentParquetRecordFileWriter.convertWarc( client.fetch(httpClient, new DomainCookies(), request2);
HttpGet request3 = new HttpGet("https://www.marginalia.nu/sanic.png");
request3.addHeader("User-agent", "test.marginalia.nu");
request3.addHeader("Accept-Encoding", "gzip");
client.fetch(httpClient, new DomainCookies(), request3);
HttpGet request4 = new HttpGet("https://downloads.marginalia.nu/test.pdf");
request4.addHeader("User-agent", "test.marginalia.nu");
request4.addHeader("Accept-Encoding", "gzip");
client.fetch(httpClient, new DomainCookies(), request4);
SlopCrawlDataRecord.convertWarc(
"www.marginalia.nu", "www.marginalia.nu",
new UserAgent("test", "test"), new UserAgent("test", "test"),
fileNameWarc, fileNameWarc,
fileNameParquet); fileNameSlop);
var urls = CrawledDocumentParquetRecordFileReader.stream(fileNameParquet).map(doc -> doc.url).toList(); List<String> urls;
assertEquals(2, urls.size()); try (var stream = SerializableCrawlDataStream.openDataStream(fileNameSlop)) {
urls = stream.docsAsList().stream().map(doc -> doc.url.toString()).toList();
}
assertEquals(3, urls.size());
assertEquals("https://www.marginalia.nu/", urls.get(0)); assertEquals("https://www.marginalia.nu/", urls.get(0));
assertEquals("https://www.marginalia.nu/log/", urls.get(1)); assertEquals("https://www.marginalia.nu/log/", urls.get(1));
// sanic.jpg gets filtered out for its bad mime type // sanic.jpg gets filtered out for its bad mime type
assertEquals("https://downloads.marginalia.nu/test.pdf", urls.get(2));
} }

View File

@@ -1,6 +1,7 @@
package nu.marginalia.crawling; package nu.marginalia.crawling;
import nu.marginalia.crawl.fetcher.ContentTags; import nu.marginalia.crawl.fetcher.ContentTags;
import nu.marginalia.crawl.fetcher.DomainCookies;
import nu.marginalia.crawl.fetcher.HttpFetcher; import nu.marginalia.crawl.fetcher.HttpFetcher;
import nu.marginalia.crawl.fetcher.HttpFetcherImpl; import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
import nu.marginalia.crawl.fetcher.warc.WarcRecorder; import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
@@ -31,7 +32,7 @@ class HttpFetcherTest {
void fetchUTF8() throws Exception { void fetchUTF8() throws Exception {
var fetcher = new HttpFetcherImpl("nu.marginalia.edge-crawler"); var fetcher = new HttpFetcherImpl("nu.marginalia.edge-crawler");
try (var recorder = new WarcRecorder()) { try (var recorder = new WarcRecorder()) {
var result = fetcher.fetchContent(new EdgeUrl("https://www.marginalia.nu"), recorder, new CrawlDelayTimer(100), ContentTags.empty(), HttpFetcher.ProbeType.FULL); var result = fetcher.fetchContent(new EdgeUrl("https://www.marginalia.nu"), recorder, new DomainCookies(), new CrawlDelayTimer(100), ContentTags.empty(), HttpFetcher.ProbeType.FULL);
if (DocumentBodyExtractor.asString(result) instanceof DocumentBodyResult.Ok bodyOk) { if (DocumentBodyExtractor.asString(result) instanceof DocumentBodyResult.Ok bodyOk) {
System.out.println(bodyOk.contentType()); System.out.println(bodyOk.contentType());
} }
@@ -49,7 +50,7 @@ class HttpFetcherTest {
var fetcher = new HttpFetcherImpl("nu.marginalia.edge-crawler"); var fetcher = new HttpFetcherImpl("nu.marginalia.edge-crawler");
try (var recorder = new WarcRecorder()) { try (var recorder = new WarcRecorder()) {
var result = fetcher.fetchContent(new EdgeUrl("https://www.marginalia.nu/robots.txt"), recorder, new CrawlDelayTimer(100), ContentTags.empty(), HttpFetcher.ProbeType.FULL); var result = fetcher.fetchContent(new EdgeUrl("https://www.marginalia.nu/robots.txt"), recorder, new DomainCookies(), new CrawlDelayTimer(100), ContentTags.empty(), HttpFetcher.ProbeType.FULL);
if (DocumentBodyExtractor.asString(result) instanceof DocumentBodyResult.Ok bodyOk) { if (DocumentBodyExtractor.asString(result) instanceof DocumentBodyResult.Ok bodyOk) {
System.out.println(bodyOk.contentType()); System.out.println(bodyOk.contentType());
} }

View File

@@ -3,10 +3,7 @@ package nu.marginalia.crawling.retreival;
import crawlercommons.robots.SimpleRobotRules; import crawlercommons.robots.SimpleRobotRules;
import nu.marginalia.crawl.CrawlerMain; import nu.marginalia.crawl.CrawlerMain;
import nu.marginalia.crawl.DomainStateDb; import nu.marginalia.crawl.DomainStateDb;
import nu.marginalia.crawl.fetcher.ContentTags; import nu.marginalia.crawl.fetcher.*;
import nu.marginalia.crawl.fetcher.HttpFetcher;
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
import nu.marginalia.crawl.fetcher.SitemapRetriever;
import nu.marginalia.crawl.fetcher.warc.WarcRecorder; import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
import nu.marginalia.crawl.retreival.CrawlDelayTimer; import nu.marginalia.crawl.retreival.CrawlDelayTimer;
import nu.marginalia.crawl.retreival.CrawlerRetreiver; import nu.marginalia.crawl.retreival.CrawlerRetreiver;
@@ -137,7 +134,7 @@ public class CrawlerMockFetcherTest {
} }
@Override @Override
public HttpFetchResult fetchContent(EdgeUrl url, WarcRecorder recorder, CrawlDelayTimer timer, ContentTags tags, ProbeType probeType) { public HttpFetchResult fetchContent(EdgeUrl url, WarcRecorder recorder, DomainCookies cookies, CrawlDelayTimer timer, ContentTags tags, ProbeType probeType) {
logger.info("Fetching {}", url); logger.info("Fetching {}", url);
if (mockData.containsKey(url)) { if (mockData.containsKey(url)) {
byte[] bodyBytes = mockData.get(url).documentBodyBytes; byte[] bodyBytes = mockData.get(url).documentBodyBytes;

View File

@@ -16,7 +16,6 @@ import nu.marginalia.model.crawldata.CrawledDocument;
import nu.marginalia.model.crawldata.CrawledDomain; import nu.marginalia.model.crawldata.CrawledDomain;
import nu.marginalia.model.crawldata.SerializableCrawlData; import nu.marginalia.model.crawldata.SerializableCrawlData;
import nu.marginalia.slop.SlopCrawlDataRecord; import nu.marginalia.slop.SlopCrawlDataRecord;
import org.apache.hc.client5.http.cookie.BasicCookieStore;
import org.jetbrains.annotations.NotNull; import org.jetbrains.annotations.NotNull;
import org.junit.jupiter.api.*; import org.junit.jupiter.api.*;
import org.netpreserve.jwarc.*; import org.netpreserve.jwarc.*;
@@ -180,7 +179,7 @@ class CrawlerRetreiverTest {
new EdgeDomain("www.marginalia.nu"), new EdgeDomain("www.marginalia.nu"),
List.of(), 100); List.of(), 100);
var resync = new CrawlerWarcResynchronizer(revisitCrawlFrontier, var resync = new CrawlerWarcResynchronizer(revisitCrawlFrontier,
new WarcRecorder(tempFileWarc2, new BasicCookieStore()) new WarcRecorder(tempFileWarc2)
); );
// truncate the size of the file to simulate a crash // truncate the size of the file to simulate a crash
@@ -456,7 +455,7 @@ class CrawlerRetreiverTest {
List.of(), 100); List.of(), 100);
var resync = new CrawlerWarcResynchronizer(revisitCrawlFrontier, var resync = new CrawlerWarcResynchronizer(revisitCrawlFrontier,
new WarcRecorder(tempFileWarc3, new BasicCookieStore()) new WarcRecorder(tempFileWarc3)
); );
// truncate the size of the file to simulate a crash // truncate the size of the file to simulate a crash
@@ -507,7 +506,7 @@ class CrawlerRetreiverTest {
} }
private void doCrawlWithReferenceStream(CrawlerMain.CrawlSpecRecord specs, CrawlDataReference reference) { private void doCrawlWithReferenceStream(CrawlerMain.CrawlSpecRecord specs, CrawlDataReference reference) {
try (var recorder = new WarcRecorder(tempFileWarc2, new BasicCookieStore()); try (var recorder = new WarcRecorder(tempFileWarc2);
var db = new DomainStateDb(tempFileDb) var db = new DomainStateDb(tempFileDb)
) { ) {
new CrawlerRetreiver(httpFetcher, new DomainProber(d -> true), specs, db, recorder).crawlDomain(new DomainLinks(), reference); new CrawlerRetreiver(httpFetcher, new DomainProber(d -> true), specs, db, recorder).crawlDomain(new DomainLinks(), reference);
@@ -519,7 +518,7 @@ class CrawlerRetreiverTest {
@NotNull @NotNull
private DomainCrawlFrontier doCrawl(Path tempFileWarc1, CrawlerMain.CrawlSpecRecord specs) { private DomainCrawlFrontier doCrawl(Path tempFileWarc1, CrawlerMain.CrawlSpecRecord specs) {
try (var recorder = new WarcRecorder(tempFileWarc1, new BasicCookieStore()); try (var recorder = new WarcRecorder(tempFileWarc1);
var db = new DomainStateDb(tempFileDb) var db = new DomainStateDb(tempFileDb)
) { ) {
var crawler = new CrawlerRetreiver(httpFetcher, new DomainProber(d -> true), specs, db, recorder); var crawler = new CrawlerRetreiver(httpFetcher, new DomainProber(d -> true), specs, db, recorder);

View File

@@ -53,6 +53,8 @@ dependencies {
implementation libs.commons.compress implementation libs.commons.compress
implementation libs.commons.codec implementation libs.commons.codec
implementation libs.jsoup implementation libs.jsoup
implementation libs.slop
implementation libs.jwarc

View File

@@ -3,11 +3,15 @@ package nu.marginalia.extractor;
import com.google.inject.Inject; import com.google.inject.Inject;
import nu.marginalia.process.log.WorkLog; import nu.marginalia.process.log.WorkLog;
import nu.marginalia.process.log.WorkLogEntry; import nu.marginalia.process.log.WorkLogEntry;
import nu.marginalia.slop.SlopCrawlDataRecord;
import nu.marginalia.slop.SlopTablePacker;
import nu.marginalia.storage.FileStorageService; import nu.marginalia.storage.FileStorageService;
import nu.marginalia.storage.model.FileStorage; import nu.marginalia.storage.model.FileStorage;
import nu.marginalia.storage.model.FileStorageId; import nu.marginalia.storage.model.FileStorageId;
import org.apache.commons.compress.archivers.tar.TarArchiveOutputStream; import org.apache.commons.compress.archivers.tar.TarArchiveOutputStream;
import org.apache.commons.compress.utils.IOUtils; import org.apache.commons.compress.utils.IOUtils;
import org.apache.commons.io.FileUtils;
import org.apache.commons.lang3.StringUtils;
import java.io.IOException; import java.io.IOException;
import java.nio.file.Files; import java.nio.file.Files;
@@ -27,7 +31,7 @@ public class SampleDataExporter {
public SampleDataExporter(FileStorageService storageService) { public SampleDataExporter(FileStorageService storageService) {
this.storageService = storageService; this.storageService = storageService;
} }
public void export(FileStorageId crawlId, FileStorageId destId, int size, String name) throws SQLException, IOException { public void export(FileStorageId crawlId, FileStorageId destId, int size, String ctFilter, String name) throws SQLException, IOException {
FileStorage destStorage = storageService.getStorage(destId); FileStorage destStorage = storageService.getStorage(destId);
Path inputDir = storageService.getStorage(crawlId).asPath(); Path inputDir = storageService.getStorage(crawlId).asPath();
@@ -54,6 +58,7 @@ public class SampleDataExporter {
Path newCrawlerLogFile = Files.createTempFile(destStorage.asPath(), "crawler", ".log", Path newCrawlerLogFile = Files.createTempFile(destStorage.asPath(), "crawler", ".log",
PosixFilePermissions.asFileAttribute(PosixFilePermissions.fromString("rw-r--r--"))); PosixFilePermissions.asFileAttribute(PosixFilePermissions.fromString("rw-r--r--")));
try (var bw = Files.newBufferedWriter(newCrawlerLogFile, StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING)) { try (var bw = Files.newBufferedWriter(newCrawlerLogFile, StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING)) {
for (var item : entriesAll) { for (var item : entriesAll) {
bw.write(item.id() + " " + item.ts() + " " + item.relPath() + " " + item.cnt() + "\n"); bw.write(item.id() + " " + item.ts() + " " + item.relPath() + " " + item.cnt() + "\n");
@@ -72,7 +77,22 @@ public class SampleDataExporter {
Path crawlDataPath = inputDir.resolve(item.relPath()); Path crawlDataPath = inputDir.resolve(item.relPath());
if (!Files.exists(crawlDataPath)) continue; if (!Files.exists(crawlDataPath)) continue;
addFileToTar(stream, crawlDataPath, item.relPath()); if (StringUtils.isBlank(ctFilter)) {
addFileToTar(stream, crawlDataPath, item.relPath());
}
else /* filter != null */ {
boolean didFilterData = false;
try {
crawlDataPath = filterEntries(crawlDataPath, ctFilter);
didFilterData = true;
addFileToTar(stream, crawlDataPath, item.relPath());
}
finally {
if (didFilterData) {
Files.deleteIfExists(crawlDataPath);
}
}
}
} }
addFileToTar(stream, newCrawlerLogFile, "crawler.log"); addFileToTar(stream, newCrawlerLogFile, "crawler.log");
@@ -86,6 +106,46 @@ public class SampleDataExporter {
Files.move(tmpTarFile, destStorage.asPath().resolve("crawl-data.tar"), StandardCopyOption.ATOMIC_MOVE, StandardCopyOption.REPLACE_EXISTING); Files.move(tmpTarFile, destStorage.asPath().resolve("crawl-data.tar"), StandardCopyOption.ATOMIC_MOVE, StandardCopyOption.REPLACE_EXISTING);
} }
/** Filters the entries in the crawl data file based on the content type.
* @param crawlDataPath The path to the crawl data file.
* @param contentTypeFilter The content type to filter by.
* @return The path to the filtered crawl data file, or null if an error occurred.
*/
private Path filterEntries(Path crawlDataPath, String contentTypeFilter) throws IOException {
Path tempDir = crawlDataPath.resolveSibling(crawlDataPath.getFileName() + ".filtered");
Path tempFile = crawlDataPath.resolveSibling(crawlDataPath.getFileName() + ".filtered.slop.zip");
Files.createDirectory(tempDir);
try (var writer = new SlopCrawlDataRecord.Writer(tempDir);
var reader = new SlopCrawlDataRecord.FilteringReader(crawlDataPath) {
@Override
public boolean filter(String url, int status, String contentType) {
if (contentTypeFilter.equals(contentType))
return true;
else if (contentType.startsWith("x-marginalia/"))
// This is a metadata entry, typically domain or redirect information
// let's keep those to not confuse the consumer of the data, which might
// expect at least the domain summary
return true;
return false;
}
}
) {
while (reader.hasRemaining()) {
writer.write(reader.get());
}
SlopTablePacker.packToSlopZip(tempDir, tempFile);
}
finally {
FileUtils.deleteDirectory(tempDir.toFile());
}
return tempFile;
}
private void addFileToTar(TarArchiveOutputStream outputStream, Path file, String fileName) throws IOException { private void addFileToTar(TarArchiveOutputStream outputStream, Path file, String fileName) throws IOException {
var entry = outputStream.createArchiveEntry(file.toFile(), fileName); var entry = outputStream.createArchiveEntry(file.toFile(), fileName);
entry.setSize(Files.size(file)); entry.setSize(Files.size(file));

View File

@@ -92,7 +92,7 @@ public class ExportTasksMain extends ProcessMainClass {
termFrequencyExporter.export(request.crawlId, request.destId); termFrequencyExporter.export(request.crawlId, request.destId);
break; break;
case SAMPLE_DATA: case SAMPLE_DATA:
sampleDataExporter.export(request.crawlId, request.destId, request.size, request.name); sampleDataExporter.export(request.crawlId, request.destId, request.size, request.ctFilter, request.name);
break; break;
case ADJACENCIES: case ADJACENCIES:
websiteAdjacenciesCalculator.export(); websiteAdjacenciesCalculator.export();

View File

@@ -16,6 +16,7 @@ public class ExportTaskRequest {
public FileStorageId destId; public FileStorageId destId;
public int size; public int size;
public String name; public String name;
public String ctFilter;
public ExportTaskRequest(Task task) { public ExportTaskRequest(Task task) {
this.task = task; this.task = task;
@@ -42,12 +43,13 @@ public class ExportTaskRequest {
return request; return request;
} }
public static ExportTaskRequest sampleData(FileStorageId crawlId, FileStorageId destId, int size, String name) { public static ExportTaskRequest sampleData(FileStorageId crawlId, FileStorageId destId, String ctFilter, int size, String name) {
ExportTaskRequest request = new ExportTaskRequest(Task.SAMPLE_DATA); ExportTaskRequest request = new ExportTaskRequest(Task.SAMPLE_DATA);
request.crawlId = crawlId; request.crawlId = crawlId;
request.destId = destId; request.destId = destId;
request.size = size; request.size = size;
request.name = name; request.name = name;
request.ctFilter = ctFilter;
return request; return request;
} }

View File

@@ -3,7 +3,7 @@ plugins {
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
java { java {

View File

@@ -3,7 +3,7 @@ plugins {
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
application { application {

View File

@@ -3,7 +3,7 @@ plugins {
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
application { application {

View File

@@ -5,7 +5,7 @@ plugins {
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
application { application {

View File

@@ -3,7 +3,7 @@ plugins {
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'gg.jte.gradle' version '3.1.15' id 'gg.jte.gradle' version '3.1.15'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
application { application {

View File

@@ -26,4 +26,10 @@
<link rel="search" type="application/opensearchdescription+xml" href="/opensearch.xml" title="Marginalia"> <link rel="search" type="application/opensearchdescription+xml" href="/opensearch.xml" title="Marginalia">
</head> </head>
<noscript>
<h1>Users of text-based browsers</h1>
<p>Consider using the old interface at <a href="https://old-search.marginalia.nu/">https://old-search.marginalia.nu/</a>,
as it uses fewer modern CSS tricks, and should work better than the new UI. It's functionally nearly identical, but just renders it using a different layout.</p>
<hr>
</noscript>

View File

@@ -1,9 +1,16 @@
This is a bit of a hack! This is a bit of a hack!
This class exists to let tailwind we're using these classes even though they aren't visible in the code, This class exists to let tailwind we're using these classes even though they aren't visible in the code,
as we sometimes generate classes from Java code! as we sometimes generate classes from Java code or javascript!
<i class="text-blue-800 bg-blue-50 dark:text-blue-200 dark:bg-blue-950"></i> <i class="text-blue-800 bg-blue-50 dark:text-blue-200 dark:bg-blue-950"></i>
<i class="text-green-800 bg-green-50 dark:text-green-200 dark:bg-green-950"></i> <i class="text-green-800 bg-green-50 dark:text-green-200 dark:bg-green-950"></i>
<i class="text-purple-800 bg-purple-50 dark:text-purple-200 dark:bg-purple-950"></i> <i class="text-purple-800 bg-purple-50 dark:text-purple-200 dark:bg-purple-950"></i>
<i class="text-blue-950 bg-gray-100 dark:text-blue-50 dark:bg-gray-900"></i> <i class="text-blue-950 bg-gray-100 dark:text-blue-50 dark:bg-gray-900"></i>
<span class="hover:bg-gray-300 "></span>
<label class="suggestion group block relative">
<input type="radio" name="suggestion" class="peer hidden" checked>
<div class="px-4 py-2 cursor-pointer dark:peer-checked:bg-gray-700 dark:hover:bg-gray-700 peer-checked:bg-gray-300 hover:bg-gray-300 w-full">
</div>
</label>

View File

@@ -26,7 +26,7 @@
<!-- Main content --> <!-- Main content -->
<main class="flex-1 p-4 max-w-2xl space-y-4"> <main class="flex-1 p-4 max-w-2xl space-y-4">
<div class="border dark:border-gray-600 rounded bg-white text-black dark:bg-gray-800 dark:text-white text-m p-4"> <div class="border border-gray-300 dark:border-gray-600 rounded bg-white text-black dark:bg-gray-800 dark:text-white text-m p-4">
<div class="flex space-x-3 place-items-baseline"> <div class="flex space-x-3 place-items-baseline">
<i class="fa fa-circle-exclamation text-red-800"></i> <i class="fa fa-circle-exclamation text-red-800"></i>
<div class="grow">${model.errorTitle()}</div> <div class="grow">${model.errorTitle()}</div>

View File

@@ -80,10 +80,6 @@
<tr><td>rank&gt;50</td><td>The ranking of the website is at least 50 in a span of 1 - 255</td></tr> <tr><td>rank&gt;50</td><td>The ranking of the website is at least 50 in a span of 1 - 255</td></tr>
<tr><td>rank&lt;50</td><td>The ranking of the website is at most 50 in a span of 1 - 255</td></tr> <tr><td>rank&lt;50</td><td>The ranking of the website is at most 50 in a span of 1 - 255</td></tr>
<tr><td>count&gt;10</td><td> The search term must appear in at least 10 results form the domain</td></tr>
<tr><td>count&lt;10</td><td> The search term must appear in at most 10 results from the domain</td></tr>
<tr><td>format:html5</td><td>Filter documents using the HTML5 standard. This is typically modern websites.</td></tr> <tr><td>format:html5</td><td>Filter documents using the HTML5 standard. This is typically modern websites.</td></tr>
<tr><td>format:xhtml</td><td>Filter documents using the XHTML standard</td></tr> <tr><td>format:xhtml</td><td>Filter documents using the XHTML standard</td></tr>
<tr><td>format:html123</td><td>Filter documents using the HTML standards 1, 2, and 3. This is typically very old websites. </td></tr> <tr><td>format:html123</td><td>Filter documents using the HTML standards 1, 2, and 3. This is typically very old websites. </td></tr>

View File

@@ -7,13 +7,13 @@
<form class="flex-1 max-w-2xl" action="/search"> <form class="flex-1 max-w-2xl" action="/search">
<div class="flex"> <div class="flex">
@if (query.isBlank()) @if (query != null && query.isBlank())
<%-- Add autofocus if the query is blank --%> <%-- Add autofocus if the query is blank --%>
<input type="text" <input type="text"
class="shadow-inner flex-1 dark:bg-black dark:text-gray-100 bg-gray-50 border dark:border-gray-600 border-gray-300 text-gray-900 text-sm rounded-sm block w-full p-2.5" class="shadow-inner flex-1 dark:bg-black dark:text-gray-100 bg-gray-50 border dark:border-gray-600 border-gray-300 text-gray-900 text-sm rounded-sm block w-full p-2.5"
value="${query}" value="${query}"
autofocus autofocus
placeholder="Search..." placeholder="Search the web!"
autocomplete="off" autocomplete="off"
name="query" name="query"
id="searchInput" /> id="searchInput" />
@@ -21,13 +21,13 @@
<input type="text" <input type="text"
class="shadow-inner flex-1 dark:bg-black dark:text-gray-100 bg-gray-50 border dark:border-gray-600 border-gray-300 text-gray-900 text-sm rounded-sm block w-full p-2.5" class="shadow-inner flex-1 dark:bg-black dark:text-gray-100 bg-gray-50 border dark:border-gray-600 border-gray-300 text-gray-900 text-sm rounded-sm block w-full p-2.5"
value="${query}" value="${query}"
placeholder="Search..." placeholder="Search the web!"
autocomplete="off" autocomplete="off"
name="query" name="query"
id="searchInput" /> id="searchInput" />
@endif @endif
<div id="searchSuggestions" class="text-sm absolute top-2 mt-10 w-96 bg-white dark:bg-black border dark:border-gray-600 border-gray-200 rounded-lg shadow-lg hidden"></div> <div aria-hidden="true" id="searchSuggestions" class="text-sm absolute top-3 mt-10 w-96 bg-white dark:bg-black border dark:border-gray-600 border-gray-300 rounded-lg shadow-lg hidden"></div>
<button class="px-4 py-2 bg-margeblue text-white ml-2 rounded whitespace-nowrap active:text-slate-200"> <button class="px-4 py-2 bg-margeblue text-white ml-2 rounded whitespace-nowrap active:text-slate-200">
<i class="fas fa-search text-sm sm:mr-3"></i> <i class="fas fa-search text-sm sm:mr-3"></i>

View File

@@ -43,13 +43,13 @@ function displaySuggestions(suggestions) {
} }
suggestionsContainer.innerHTML = suggestions.map((suggestion, index) => ` suggestionsContainer.innerHTML = suggestions.map((suggestion, index) => `
<div <label class="suggestion group block relative">
class="suggestion px-4 py-2 cursor-pointer hover:bg-gray-100 ${index === selectedIndex ? 'bg-blue-50' : ''}" <input type="radio" name="suggestion" class="peer hidden" ${index === selectedIndex ? 'checked' : ''}>
data-index="${index}" <div class="px-4 py-2 cursor-pointer dark:peer-checked:bg-gray-700 dark:hover:bg-gray-700 peer-checked:bg-gray-300 hover:bg-gray-300 w-full" data-index="${index}">
> ${suggestion}
${suggestion} </div>
</div> </label>
`).join(''); `).join('');
suggestionsContainer.classList.remove('hidden'); suggestionsContainer.classList.remove('hidden');

View File

@@ -2,7 +2,7 @@ plugins {
id 'java' id 'java'
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
java { java {

View File

@@ -3,7 +3,7 @@ plugins {
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
application { application {

View File

@@ -10,7 +10,8 @@ import static com.google.inject.name.Names.named;
public class AssistantModule extends AbstractModule { public class AssistantModule extends AbstractModule {
public void configure() { public void configure() {
bind(Path.class).annotatedWith(named("suggestions-file")).toInstance(WmsaHome.getHomePath().resolve("data/suggestions.txt")); bind(Path.class).annotatedWith(named("suggestions-file1")).toInstance(WmsaHome.getHomePath().resolve("data/suggestions2.txt.gz"));
bind(Path.class).annotatedWith(named("suggestions-file2")).toInstance(WmsaHome.getHomePath().resolve("data/suggestions3.txt.gz"));
bind(LanguageModels.class).toInstance(WmsaHome.getLanguageModels()); bind(LanguageModels.class).toInstance(WmsaHome.getLanguageModels());
} }

View File

@@ -0,0 +1,465 @@
package nu.marginalia.assistant.suggest;
import gnu.trove.list.array.TIntArrayList;
import org.jetbrains.annotations.NotNull;
import java.util.*;
/** Unhinged data structure for fast prefix searching.
*/
public class PrefixSearchStructure {
// Core data structures
private final HashMap<String, TIntArrayList> prefixIndex; // Short prefix index (up to 8 chars)
private final HashMap<String, TIntArrayList> longPrefixIndex; // Long prefix index (9-16 chars)
private final ArrayList<String> words; // All words by ID
private final TIntArrayList wordScores; // Scores for all words
// Configuration
private static final int SHORT_PREFIX_LENGTH = 8;
private static final int MAX_INDEXED_PREFIX_LENGTH = 16;
public int size() {
return words.size();
}
// For sorting efficiency
private static class WordScorePair {
final String word;
final int score;
WordScorePair(String word, int score) {
this.word = word;
this.score = score;
}
}
/**
* Creates a new PrefixTrie for typeahead search.
*/
public PrefixSearchStructure() {
prefixIndex = new HashMap<>(1024);
longPrefixIndex = new HashMap<>(1024);
words = new ArrayList<>(1024);
wordScores = new TIntArrayList(1024);
}
/**
* Adds a prefix to the index.
*/
private void indexPrefix(String word, int wordId) {
// Index short prefixes
for (int i = 1; i <= Math.min(word.length(), SHORT_PREFIX_LENGTH); i++) {
String prefix = word.substring(0, i);
TIntArrayList wordIds = prefixIndex.computeIfAbsent(
prefix, k -> new TIntArrayList(16));
wordIds.add(wordId);
}
// Index longer prefixes
for (int i = SHORT_PREFIX_LENGTH + 1; i <= Math.min(word.length(), MAX_INDEXED_PREFIX_LENGTH); i++) {
String prefix = word.substring(0, i);
TIntArrayList wordIds = longPrefixIndex.computeIfAbsent(
prefix, k -> new TIntArrayList(8));
wordIds.add(wordId);
}
// If the word contains spaces, also index by each term for multi-word queries
if (word.contains(" ")) {
String[] terms = word.split("\\s+");
for (String term : terms) {
if (term.length() >= 2) {
for (int i = 1; i <= Math.min(term.length(), SHORT_PREFIX_LENGTH); i++) {
String termPrefix = "t:" + term.substring(0, i);
TIntArrayList wordIds = prefixIndex.computeIfAbsent(
termPrefix, k -> new TIntArrayList(16));
wordIds.add(wordId);
}
}
}
}
}
/**
* Inserts a word with its associated score.
*/
public void insert(String word, int score) {
if (word == null || word.isEmpty()) {
return;
}
// Add to the word list and index
int wordId = words.size();
words.add(word);
wordScores.add(score);
indexPrefix(word, wordId);
}
/**
* Returns the top k completions for a given prefix.
*/
public List<ScoredSuggestion> getTopCompletions(String prefix, int k) {
if (prefix == null || prefix.isEmpty()) {
// Return top k words by score
return getTopKWords(k);
}
// Check if this is a term search (t:) - for searching within multi-word items
boolean isTermSearch = false;
if (prefix.startsWith("t:") && prefix.length() > 2) {
isTermSearch = true;
prefix = prefix.substring(2);
}
// 1. Fast path for short prefixes
if (prefix.length() <= SHORT_PREFIX_LENGTH) {
String lookupPrefix = isTermSearch ? "t:" + prefix : prefix;
TIntArrayList wordIds = prefixIndex.get(lookupPrefix);
if (wordIds != null) {
return getTopKFromWordIds(wordIds, k);
}
}
// 2. Fast path for long prefixes (truncate to MAX_INDEXED_PREFIX_LENGTH)
if (prefix.length() > SHORT_PREFIX_LENGTH) {
// Try exact match in longPrefixIndex first
if (prefix.length() <= MAX_INDEXED_PREFIX_LENGTH) {
TIntArrayList wordIds = longPrefixIndex.get(prefix);
if (wordIds != null) {
return getTopKFromWordIds(wordIds, k);
}
}
// If prefix is longer than MAX_INDEXED_PREFIX_LENGTH, truncate and filter
if (prefix.length() > MAX_INDEXED_PREFIX_LENGTH) {
String truncatedPrefix = prefix.substring(0, MAX_INDEXED_PREFIX_LENGTH);
TIntArrayList candidateIds = longPrefixIndex.get(truncatedPrefix);
if (candidateIds != null) {
// Filter candidates by the full prefix
return getFilteredTopKFromWordIds(candidateIds, prefix, k);
}
}
}
// 3. Optimized fallback for long prefixes - use prefix tree for segments
List<ScoredSuggestion> results = new ArrayList<>();
// Handle multi-segment queries by finding candidates from first 8 chars
if (prefix.length() > SHORT_PREFIX_LENGTH) {
String shortPrefix = prefix.substring(0, Math.min(prefix.length(), SHORT_PREFIX_LENGTH));
TIntArrayList candidates = prefixIndex.get(shortPrefix);
if (candidates != null) {
return getFilteredTopKFromWordIds(candidates, prefix, k);
}
}
// 4. Last resort - optimized binary search in sorted segments
return findByBinarySearchPrefix(prefix, k);
}
/**
* Helper to get the top k words by score.
*/
private List<ScoredSuggestion> getTopKWords(int k) {
// Create pairs of (score, wordId)
int[][] pairs = new int[words.size()][2];
for (int i = 0; i < words.size(); i++) {
pairs[i][0] = wordScores.get(i);
pairs[i][1] = i;
}
// Sort by score (descending)
Arrays.sort(pairs, (a, b) -> Integer.compare(b[0], a[0]));
// Take top k
List<ScoredSuggestion> results = new ArrayList<>();
for (int i = 0; i < Math.min(k, pairs.length); i++) {
String word = words.get(pairs[i][1]);
int score = pairs[i][0];
results.add(new ScoredSuggestion(word, score));
}
return results;
}
/**
* Helper to get the top k words from a list of word IDs.
*/
private List<ScoredSuggestion> getTopKFromWordIds(TIntArrayList wordIds, int k) {
if (wordIds == null || wordIds.isEmpty()) {
return Collections.emptyList();
}
// For small lists, avoid sorting
if (wordIds.size() <= k) {
List<ScoredSuggestion> results = new ArrayList<>(wordIds.size());
int[] ids = wordIds.toArray();
for (int wordId : ids) {
if (wordId >= 0 && wordId < words.size()) {
results.add(new ScoredSuggestion(words.get(wordId), wordScores.get(wordId)));
}
}
results.sort((a, b) -> Integer.compare(b.getScore(), a.getScore()));
return results;
}
// For larger lists, use an array-based approach for better performance
// Find top k without full sorting
int[] topScores = new int[k];
int[] topWordIds = new int[k];
int[] ids = wordIds.toArray();
// Initialize with first k elements
int filledCount = Math.min(k, ids.length);
for (int i = 0; i < filledCount; i++) {
int wordId = ids[i];
if (wordId >= 0 && wordId < words.size()) {
topWordIds[i] = wordId;
topScores[i] = wordScores.get(wordId);
}
}
// Sort initial elements
for (int i = 0; i < filledCount; i++) {
for (int j = i + 1; j < filledCount; j++) {
if (topScores[j] > topScores[i]) {
// Swap scores
int tempScore = topScores[i];
topScores[i] = topScores[j];
topScores[j] = tempScore;
// Swap word IDs
int tempId = topWordIds[i];
topWordIds[i] = topWordIds[j];
topWordIds[j] = tempId;
}
}
}
// Process remaining elements
int minScore = filledCount > 0 ? topScores[filledCount - 1] : Integer.MIN_VALUE;
for (int i = k; i < ids.length; i++) {
int wordId = ids[i];
if (wordId >= 0 && wordId < words.size()) {
int score = wordScores.get(wordId);
if (score > minScore) {
// Replace the lowest element
topScores[filledCount - 1] = score;
topWordIds[filledCount - 1] = wordId;
// Bubble up the new element
for (int j = filledCount - 1; j > 0; j--) {
if (topScores[j] > topScores[j - 1]) {
// Swap scores
int tempScore = topScores[j];
topScores[j] = topScores[j - 1];
topScores[j - 1] = tempScore;
// Swap word IDs
int tempId = topWordIds[j];
topWordIds[j] = topWordIds[j - 1];
topWordIds[j - 1] = tempId;
} else {
break;
}
}
// Update min score
minScore = topScores[filledCount - 1];
}
}
}
// Create result list
List<ScoredSuggestion> results = new ArrayList<>(filledCount);
for (int i = 0; i < filledCount; i++) {
results.add(new ScoredSuggestion(words.get(topWordIds[i]), topScores[i]));
}
return results;
}
/**
* Use binary search on sorted word segments to efficiently find matches.
*/
private List<ScoredSuggestion> findByBinarySearchPrefix(String prefix, int k) {
// If we have a lot of words, use an optimized segment approach
if (words.size() > 1000) {
// Divide words into segments for better locality
int segmentSize = 1000;
int numSegments = (words.size() + segmentSize - 1) / segmentSize;
// Find matches using binary search within each segment
List<WordScorePair> allMatches = new ArrayList<>();
for (int segment = 0; segment < numSegments; segment++) {
int start = segment * segmentSize;
int end = Math.min(start + segmentSize, words.size());
// Binary search for first potential match
int pos = Collections.binarySearch(
words.subList(start, end),
prefix,
(a, b) -> a.compareTo(b)
);
if (pos < 0) {
pos = -pos - 1;
}
// Collect all matches
for (int i = start + pos; i < end && i < words.size(); i++) {
String word = words.get(i);
if (word.startsWith(prefix)) {
allMatches.add(new WordScorePair(word, wordScores.get(i)));
} else if (word.compareTo(prefix) > 0) {
break; // Past potential matches
}
}
}
// Sort by score and take top k
allMatches.sort((a, b) -> Integer.compare(b.score, a.score));
List<ScoredSuggestion> results = new ArrayList<>(Math.min(k, allMatches.size()));
for (int i = 0; i < Math.min(k, allMatches.size()); i++) {
WordScorePair pair = allMatches.get(i);
results.add(new ScoredSuggestion(pair.word, pair.score));
}
return results;
}
// Fallback for small dictionaries - linear scan but optimized
return simpleSearchFallback(prefix, k);
}
/**
* Optimized linear scan - only used for small dictionaries.
*/
private List<ScoredSuggestion> simpleSearchFallback(String prefix, int k) {
// Use primitive arrays for better cache locality
int[] matchScores = new int[Math.min(words.size(), 100)]; // Assume we won't find more than 100 matches
String[] matchWords = new String[matchScores.length];
int matchCount = 0;
for (int i = 0; i < words.size() && matchCount < matchScores.length; i++) {
String word = words.get(i);
if (word.startsWith(prefix)) {
matchWords[matchCount] = word;
matchScores[matchCount] = wordScores.get(i);
matchCount++;
}
}
// Sort matches by score (in-place for small arrays)
for (int i = 0; i < matchCount; i++) {
for (int j = i + 1; j < matchCount; j++) {
if (matchScores[j] > matchScores[i]) {
// Swap scores
int tempScore = matchScores[i];
matchScores[i] = matchScores[j];
matchScores[j] = tempScore;
// Swap words
String tempWord = matchWords[i];
matchWords[i] = matchWords[j];
matchWords[j] = tempWord;
}
}
}
// Create results
List<ScoredSuggestion> results = new ArrayList<>(Math.min(k, matchCount));
for (int i = 0; i < Math.min(k, matchCount); i++) {
results.add(new ScoredSuggestion(matchWords[i], matchScores[i]));
}
return results;
}
/**
* Get top k words from candidate IDs, filtering by the full prefix.
*/
private List<ScoredSuggestion> getFilteredTopKFromWordIds(TIntArrayList wordIds, String fullPrefix, int k) {
if (wordIds == null || wordIds.isEmpty()) {
return Collections.emptyList();
}
// Make primitive arrays for better performance
String[] matchWords = new String[Math.min(wordIds.size(), 1000)];
int[] matchScores = new int[matchWords.length];
int matchCount = 0;
int[] ids = wordIds.toArray();
for (int i = 0; i < ids.length && matchCount < matchWords.length; i++) {
int wordId = ids[i];
if (wordId >= 0 && wordId < words.size()) {
String word = words.get(wordId);
if (word.startsWith(fullPrefix)) {
matchWords[matchCount] = word;
matchScores[matchCount] = wordScores.get(wordId);
matchCount++;
}
}
}
// Sort by score (efficient insertion sort for small k)
for (int i = 0; i < Math.min(matchCount, k); i++) {
int maxPos = i;
for (int j = i + 1; j < matchCount; j++) {
if (matchScores[j] > matchScores[maxPos]) {
maxPos = j;
}
}
if (maxPos != i) {
// Swap
int tempScore = matchScores[i];
matchScores[i] = matchScores[maxPos];
matchScores[maxPos] = tempScore;
String tempWord = matchWords[i];
matchWords[i] = matchWords[maxPos];
matchWords[maxPos] = tempWord;
}
}
// Create result list (only up to k elements)
List<ScoredSuggestion> results = new ArrayList<>(Math.min(k, matchCount));
for (int i = 0; i < Math.min(k, matchCount); i++) {
results.add(new ScoredSuggestion(matchWords[i], matchScores[i]));
}
return results;
}
/**
* Class representing a suggested completion.
*/
public static class ScoredSuggestion implements Comparable<ScoredSuggestion> {
private final String word;
private final int score;
public ScoredSuggestion(String word, int score) {
this.word = word;
this.score = score;
}
public String getWord() {
return word;
}
public int getScore() {
return score;
}
@Override
public String toString() {
return word + " (" + score + ")";
}
@Override
public int compareTo(@NotNull PrefixSearchStructure.ScoredSuggestion o) {
return Integer.compare(this.score, o.score);
}
}
}

View File

@@ -2,74 +2,89 @@ package nu.marginalia.assistant.suggest;
import com.google.inject.Inject; import com.google.inject.Inject;
import com.google.inject.name.Named; import com.google.inject.name.Named;
import nu.marginalia.functions.math.dict.SpellChecker;
import nu.marginalia.term_frequency_dict.TermFrequencyDict;
import nu.marginalia.model.crawl.HtmlFeature;
import org.apache.commons.collections4.trie.PatriciaTrie;
import org.apache.commons.lang3.StringUtils; import org.apache.commons.lang3.StringUtils;
import org.slf4j.Logger; import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import org.slf4j.LoggerFactory;
import java.io.BufferedInputStream;
import java.io.IOException; import java.io.IOException;
import java.nio.file.Files; import java.nio.file.Files;
import java.nio.file.Path; import java.nio.file.Path;
import java.nio.file.StandardOpenOption;
import java.util.*; import java.util.*;
import java.util.function.Supplier; import java.util.zip.GZIPInputStream;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public class Suggestions { public class Suggestions {
private PatriciaTrie<String> suggestionsTrie = null; List<PrefixSearchStructure> searchStructures = new ArrayList<>();
private TermFrequencyDict termFrequencyDict = null;
private volatile boolean ready = false; private volatile boolean ready = false;
private final SpellChecker spellChecker;
private static final Pattern suggestionPattern = Pattern.compile("^[a-zA-Z0-9]+( [a-zA-Z0-9]+)*$");
private static final Logger logger = LoggerFactory.getLogger(Suggestions.class); private static final Logger logger = LoggerFactory.getLogger(Suggestions.class);
private static final int MIN_SUGGEST_LENGTH = 3; private static final int MIN_SUGGEST_LENGTH = 3;
@Inject @Inject
public Suggestions(@Named("suggestions-file") Path suggestionsFile, public Suggestions(@Named("suggestions-file1") Path suggestionsFile1,
SpellChecker spellChecker, @Named("suggestions-file2") Path suggestionsFile2
TermFrequencyDict dict
) { ) {
this.spellChecker = spellChecker;
Thread.ofPlatform().start(() -> { Thread.ofPlatform().start(() -> {
suggestionsTrie = loadSuggestions(suggestionsFile); searchStructures.add(loadSuggestions(suggestionsFile1));
termFrequencyDict = dict; searchStructures.add(loadSuggestions(suggestionsFile2));
ready = true; ready = true;
logger.info("Loaded {} suggestions", suggestionsTrie.size()); logger.info("Loaded suggestions");
}); });
} }
private static PatriciaTrie<String> loadSuggestions(Path file) { private static PrefixSearchStructure loadSuggestions(Path file) {
PrefixSearchStructure ret = new PrefixSearchStructure();
if (!Files.exists(file)) { if (!Files.exists(file)) {
logger.error("Suggestions file {} absent, loading empty suggestions db", file); logger.error("Suggestions file {} absent, loading empty suggestions db", file);
return new PatriciaTrie<>(); return ret;
} }
try (var lines = Files.lines(file)) {
var ret = new PatriciaTrie<String>();
lines.filter(suggestionPattern.asPredicate()) try (var scanner = new Scanner(new GZIPInputStream(new BufferedInputStream(Files.newInputStream(file, StandardOpenOption.READ))))) {
.filter(line -> line.length()<32) while (scanner.hasNextLine()) {
.map(String::toLowerCase) String line = scanner.nextLine().trim();
.forEach(w -> ret.put(w, w)); String[] parts = StringUtils.split(line, " ,", 2);
if (parts.length != 2) {
logger.warn("Invalid suggestion line: {}", line);
continue;
}
int cnt = Integer.parseInt(parts[0]);
if (cnt > 1) {
String word = parts[1];
// Add special keywords to the suggestions // Remove quotes and trailing periods if this is a CSV
for (var feature : HtmlFeature.values()) { if (word.startsWith("\"") && word.endsWith("\"")) {
String keyword = feature.getKeyword(); word = word.substring(1, word.length() - 1);
}
ret.put(keyword, keyword); // Remove trailing periods
ret.put("-" + keyword, "-" + keyword); while (word.endsWith(".")) {
word = word.substring(0, word.length() - 1);
}
// Remove junk items we may have gotten from link extraction
if (word.startsWith("click here"))
continue;
if (word.contains("new window"))
continue;
if (word.contains("click to"))
continue;
if (word.startsWith("share "))
continue;
if (word.length() > 3) {
ret.insert(word, cnt);
}
}
} }
return ret; return ret;
} }
catch (IOException ex) { catch (IOException ex) {
logger.error("Failed to load suggestions file", ex); logger.error("Failed to load suggestions file", ex);
return new PatriciaTrie<>(); return new PrefixSearchStructure();
} }
} }
@@ -83,96 +98,36 @@ public class Suggestions {
searchWord = StringUtils.stripStart(searchWord.toLowerCase(), " "); searchWord = StringUtils.stripStart(searchWord.toLowerCase(), " ");
return Stream.of( return getSuggestionsForKeyword(count, searchWord);
new SuggestionStream("", getSuggestionsForKeyword(count, searchWord)),
suggestionsForLastWord(count, searchWord),
spellCheckStream(searchWord)
)
.flatMap(SuggestionsStreamable::stream)
.limit(count)
.collect(Collectors.toList());
} }
private SuggestionsStreamable suggestionsForLastWord(int count, String searchWord) { public List<String> getSuggestionsForKeyword(int count, String prefix) {
int sp = searchWord.lastIndexOf(' ');
if (sp < 0) {
return Stream::empty;
}
String prefixString = searchWord.substring(0, sp+1);
String suggestString = searchWord.substring(sp+1);
return new SuggestionStream(prefixString, getSuggestionsForKeyword(count, suggestString));
}
private SuggestionsStreamable spellCheckStream(String word) {
int start = word.lastIndexOf(' ');
String prefix;
String corrWord;
if (start < 0) {
corrWord = word;
prefix = "";
}
else {
prefix = word.substring(0, start + 1);
corrWord = word.substring(start + 1);
}
if (corrWord.length() >= MIN_SUGGEST_LENGTH) {
Supplier<Stream<String>> suggestionsLazyEval = () -> spellChecker.correct(corrWord).stream();
return new SuggestionStream(prefix, Stream.of(suggestionsLazyEval).flatMap(Supplier::get));
}
else {
return Stream::empty;
}
}
public Stream<String> getSuggestionsForKeyword(int count, String prefix) {
if (!ready) if (!ready)
return Stream.empty(); return List.of();
if (prefix.length() < MIN_SUGGEST_LENGTH) { if (prefix.length() < MIN_SUGGEST_LENGTH) {
return Stream.empty(); return List.of();
} }
var start = suggestionsTrie.select(prefix); List<PrefixSearchStructure.ScoredSuggestion> resultsAll = new ArrayList<>();
if (start == null) { for (var searchStructure : searchStructures) {
return Stream.empty(); resultsAll.addAll(searchStructure.getTopCompletions(prefix, count));
}
resultsAll.sort(Comparator.reverseOrder());
List<String> ret = new ArrayList<>(count);
Set<String> seen = new HashSet<>();
for (var result : resultsAll) {
if (seen.add(result.getWord())) {
ret.add(result.getWord());
}
if (ret.size() >= count) {
break;
}
} }
if (!start.getKey().startsWith(prefix)) { return ret;
return Stream.empty();
}
SuggestionsValueCalculator sv = new SuggestionsValueCalculator();
return Stream.iterate(start.getKey(), Objects::nonNull, suggestionsTrie::nextKey)
.takeWhile(s -> s.startsWith(prefix))
.limit(256)
.sorted(Comparator.comparing(sv::get).thenComparing(String::length).thenComparing(Comparator.naturalOrder()))
.limit(count);
} }
private record SuggestionStream(String prefix, Stream<String> suggestionStream) implements SuggestionsStreamable {
public Stream<String> stream() {
return suggestionStream.map(s -> prefix + s);
}
}
interface SuggestionsStreamable { Stream<String> stream(); }
private class SuggestionsValueCalculator {
private final Map<String, Long> hashCache = new HashMap<>(512);
public int get(String s) {
long hash = hashCache.computeIfAbsent(s, TermFrequencyDict::getStringHash);
return -termFrequencyDict.getTermFreqHash(hash);
}
}
} }

View File

@@ -2,7 +2,7 @@ plugins {
id 'java' id 'java'
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
java { java {

View File

@@ -59,9 +59,14 @@ public class ControlMain extends MainClass {
download(adblockFile, new URI("https://downloads.marginalia.nu/data/adblock.txt")); download(adblockFile, new URI("https://downloads.marginalia.nu/data/adblock.txt"));
} }
Path suggestionsFile = dataPath.resolve("suggestions.txt"); Path suggestionsFile = dataPath.resolve("suggestions2.txt.gz");
if (!Files.exists(suggestionsFile)) { if (!Files.exists(suggestionsFile)) {
downloadGzipped(suggestionsFile, new URI("https://downloads.marginalia.nu/data/suggestions.txt.gz")); download(suggestionsFile, new URI("https://downloads.marginalia.nu/data/suggestions2.txt.gz"));
}
Path altSuggestionsFile = dataPath.resolve("suggestions3.txt.gz");
if (!Files.exists(altSuggestionsFile)) {
download(altSuggestionsFile, new URI("https://downloads.marginalia.nu/data/suggestions3.txt.gz"));
} }
Path asnRawData = dataPath.resolve("asn-data-raw-table"); Path asnRawData = dataPath.resolve("asn-data-raw-table");

View File

@@ -321,9 +321,10 @@ public class ControlNodeActionsService {
private Object exportSampleData(Request req, Response rsp) { private Object exportSampleData(Request req, Response rsp) {
FileStorageId source = parseSourceFileStorageId(req.queryParams("source")); FileStorageId source = parseSourceFileStorageId(req.queryParams("source"));
int size = Integer.parseInt(req.queryParams("size")); int size = Integer.parseInt(req.queryParams("size"));
String ctFilter = req.queryParams("ctFilter");
String name = req.queryParams("name"); String name = req.queryParams("name");
exportClient.exportSampleData(Integer.parseInt(req.params("id")), source, size, name); exportClient.exportSampleData(Integer.parseInt(req.params("id")), source, size, ctFilter, name);
return ""; return "";
} }

View File

@@ -24,25 +24,25 @@ This is a sample of real crawl data. It is intended for demo, testing and devel
<tr> <tr>
<td><input id="sample-s" value="sample-s" name="sample" class="form-check-input" type="radio"></td> <td><input id="sample-s" value="sample-s" name="sample" class="form-check-input" type="radio"></td>
<td><label for="sample-s">Small</label></td> <td><label for="sample-s">Small</label></td>
<td>1000 Domains. About 2 GB. </td> <td>1000 Domains. About 1 GB. </td>
</tr> </tr>
<tr> <tr>
<td><input id="sample-m" value="sample-m" name="sample" class="form-check-input" type="radio"></td> <td><input id="sample-m" value="sample-m" name="sample" class="form-check-input" type="radio"></td>
<td><label for="sample-m">Medium</label></td> <td><label for="sample-m">Medium</label></td>
<td>2000 Domains. About 6 GB. Recommended.</td> <td>2000 Domains. About 2 GB. Recommended.</td>
</tr> </tr>
<tr> <tr>
<td><input id="sample-l" value="sample-l" name="sample" class="form-check-input" type="radio"></td> <td><input id="sample-l" value="sample-l" name="sample" class="form-check-input" type="radio"></td>
<td><label for="sample-l">Large</label></td> <td><label for="sample-l">Large</label></td>
<td>5000 Domains. About 20 GB.</td> <td>5000 Domains. About 7 GB.</td>
</tr> </tr>
<tr> <tr>
<td><input id="sample-xl" value="sample-xl" name="sample" class="form-check-input" type="radio"></td> <td><input id="sample-xl" value="sample-xl" name="sample" class="form-check-input" type="radio"></td>
<td><label for="sample-xl">Huge</label></td> <td><label for="sample-xl">Huge</label></td>
<td>50,000 Domains. Around 180 GB. Primarily intended for pre-production like testing environments. <td>50,000 Domains. Around 80 GB. Primarily intended for pre-production like testing environments.
Expect hours of processing time. </td> Expect hours of processing time. </td>
</tr> </tr>
</table> </table>

View File

@@ -35,6 +35,11 @@
<div><input type="text" name="size" id="size" pattern="\d+" /></div> <div><input type="text" name="size" id="size" pattern="\d+" /></div>
<small class="text-muted">How many domains to include in the sample set</small> <small class="text-muted">How many domains to include in the sample set</small>
</div> </div>
<div class="mb-3">
<label for="ctFilter">Content Type Filter</label>
<div><input type="text" name="ctFilter" id="ctFilter" /></div>
<small class="text-muted">If set, includes only documents with the specified content type value</small>
</div>
<div class="mb-3"> <div class="mb-3">
<label for="name">Name</label> <label for="name">Name</label>
<div><input type="text" name="name" id="name" /></div> <div><input type="text" name="name" id="name" /></div>

View File

@@ -3,7 +3,7 @@ plugins {
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
application { application {

View File

@@ -3,7 +3,7 @@ plugins {
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
application { application {

View File

@@ -3,7 +3,7 @@ plugins {
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
application { application {

View File

@@ -10,6 +10,7 @@ import nu.marginalia.api.searchquery.model.results.PrototypeRankingParameters;
import nu.marginalia.converting.processor.DomainProcessor; import nu.marginalia.converting.processor.DomainProcessor;
import nu.marginalia.converting.writer.ConverterBatchWriter; import nu.marginalia.converting.writer.ConverterBatchWriter;
import nu.marginalia.crawl.fetcher.ContentTags; import nu.marginalia.crawl.fetcher.ContentTags;
import nu.marginalia.crawl.fetcher.DomainCookies;
import nu.marginalia.crawl.fetcher.HttpFetcherImpl; import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
import nu.marginalia.crawl.fetcher.warc.WarcRecorder; import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
import nu.marginalia.functions.searchquery.QueryFactory; import nu.marginalia.functions.searchquery.QueryFactory;
@@ -43,7 +44,6 @@ import nu.marginalia.process.control.FakeProcessHeartbeat;
import nu.marginalia.storage.FileStorageService; import nu.marginalia.storage.FileStorageService;
import nu.marginalia.test.IntegrationTestModule; import nu.marginalia.test.IntegrationTestModule;
import nu.marginalia.test.TestUtil; import nu.marginalia.test.TestUtil;
import org.apache.hc.client5.http.cookie.BasicCookieStore;
import org.junit.jupiter.api.AfterEach; import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.BeforeEach; import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test; import org.junit.jupiter.api.Test;
@@ -121,11 +121,12 @@ public class IntegrationTest {
public void run() throws Exception { public void run() throws Exception {
/** CREATE WARC */ /** CREATE WARC */
try (WarcRecorder warcRecorder = new WarcRecorder(warcData, new BasicCookieStore())) { try (WarcRecorder warcRecorder = new WarcRecorder(warcData)) {
warcRecorder.writeWarcinfoHeader("127.0.0.1", new EdgeDomain("www.example.com"), warcRecorder.writeWarcinfoHeader("127.0.0.1", new EdgeDomain("www.example.com"),
new HttpFetcherImpl.DomainProbeResult.Ok(new EdgeUrl("https://www.example.com/"))); new HttpFetcherImpl.DomainProbeResult.Ok(new EdgeUrl("https://www.example.com/")));
warcRecorder.writeReferenceCopy(new EdgeUrl("https://www.example.com/"), warcRecorder.writeReferenceCopy(new EdgeUrl("https://www.example.com/"),
new DomainCookies(),
"text/html", 200, "text/html", 200,
""" """
<html> <html>

View File

@@ -3,7 +3,7 @@ plugins {
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
java { java {

View File

@@ -1,4 +1,6 @@
## This is a token file for automatic deployment ## This is a token file for triggering automatic deployment when no commit is made.
2025-01-08: Deploy executor. 2025-01-08: Deploy executor.
2025-01-07: Deploy executor. 2025-01-07: Deploy executor.
2025-04-24: Deploy executor.
2025-04-24: Deploy assistant.

View File

@@ -1,5 +1,5 @@
distributionBase=GRADLE_USER_HOME distributionBase=GRADLE_USER_HOME
distributionPath=wrapper/dists distributionPath=wrapper/dists
distributionUrl=https\://services.gradle.org/distributions/gradle-8.10-bin.zip distributionUrl=https\://services.gradle.org/distributions/gradle-8.14-bin.zip
zipStoreBase=GRADLE_USER_HOME zipStoreBase=GRADLE_USER_HOME
zipStorePath=wrapper/dists zipStorePath=wrapper/dists