1
1
mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-10-06 07:32:38 +02:00

Compare commits

...

41 Commits

Author SHA1 Message Date
Viktor Lofgren
18700e1919 (sample) Fix bug where slop files would not be saved despite containing data 2025-05-06 13:38:21 +02:00
Viktor Lofgren
120b431998 (crawler) Fix outdated assumptions about content types and http status codes always being 200 when good.
We now sometimes get 206 when good.
2025-05-06 13:18:30 +02:00
Viktor Lofgren
71dad99326 (crawler) Revisitor should not demand a 200, but support a 206 as well 2025-05-06 13:11:52 +02:00
Viktor Lofgren
c1e8afdf86 (crawler) Remove domains from pending crawl tasks queue when retrying 2025-05-06 12:56:30 +02:00
Viktor Lofgren
fa32dddc24 (sample-actor) Make content type matching lenient with regard to ct parameters such as charset 2025-05-06 12:48:09 +02:00
Viktor Lofgren
a266fcbf30 (sample-actor) Clean up debris from previous runs to avoid errors on re-runs 2025-05-05 13:16:37 +02:00
Viktor Lofgren
6e47e58e0e (sample-actor) Add progress tracking to sample export actor 2025-05-05 13:04:14 +02:00
Viktor Lofgren
9dc43d8b4a (sample-actor) Update the actor export sample actor to not generate empty files when the filter is not applicable. 2025-05-05 12:56:12 +02:00
Viktor Lofgren
83967e3305 (sample-actor) Update the actor export sample actor to not generate empty files when the filter is not applicable. 2025-05-05 12:50:21 +02:00
Viktor Lofgren
4db980a291 (jooby-service) Set an upper limit on the number of worker threads 2025-05-05 12:40:31 +02:00
Viktor Lofgren
089b177868 (deploy) Executor partition 4. 2025-05-05 12:21:27 +02:00
Viktor Lofgren
9c8e9a68d5 (deploy) Executor partition 4. 2025-05-05 12:00:05 +02:00
Viktor Lofgren
413d5cc788 (url, minor) Fix typo in test 2025-05-04 16:28:30 +02:00
Viktor Lofgren
58539b92ac (search) Don't show addresses with URLencoding in the UI 2025-05-04 16:26:39 +02:00
Viktor Lofgren
fe72f16df1 (url) Add additional tests for parameter handling 2025-05-04 16:23:39 +02:00
Viktor Lofgren
b49a244a2e (url) Fix encoding handling of query parameters 2025-05-04 16:18:47 +02:00
Viktor Lofgren
3f0b4c010f (deploy) Fix deploy script to be aware of the status service 2025-05-04 16:14:07 +02:00
Viktor Lofgren
c6e0cd93f7 (status) Fix status service to poll the new domain 2025-05-04 16:11:08 +02:00
Viktor Lofgren
80a7ccb080 Trigger redeploy of qs, search and api 2025-05-04 16:07:28 +02:00
Viktor Lofgren
54dec347c4 (url) Fix urlencoding issues with certain symbols
Optimize the code by adding a simple heuristic for guessing whether we need to repair the URI before we pass it to Java's parser.
2025-05-04 13:39:39 +02:00
Viktor Lofgren
d6ee3f0785 (url) Fix urlencoding issues with certain symbols
The urlencoding logic would consider the need to urlencode on an element basis, which is incorrect.  Even if we urlencode on an element basis, we should either urlencode or not urlencode, never a mix of the two.
2025-05-04 13:08:49 +02:00
Viktor Lofgren
8be88afcf3 (url) Fix urlencoding issues with certain symbols
We also need to apply the fix when performing toString() on the EdgeUrl, the URI class will URLDecode the input.

The change also alters the parseURI method to only run the URLEncode-fixer during parsing if URI doesn't throw an exception.  This bad path is obviously going to be slower, but realistically, most URLs are valid, so it's probably a significant optimization to do it like this.
2025-05-04 12:58:13 +02:00
Viktor Lofgren
0e3c00d3e1 (url) Fix urlencoding issues with certain symbols
Minor fix of issue where url sanitizer would strip some trailing slashes.
2025-05-03 23:58:28 +02:00
Viktor Lofgren
4279a7f1aa (url) Fix urlencoding issues with certain symbols
Minor fix with previously urlencoded codepoints, we need to account for the fact that they are encoded in hexadecimal.
2025-05-03 23:51:39 +02:00
Viktor Lofgren
251006d4f9 (url) Fix urlencoding issues with certain symbols
Problems primarily cropped up with sideloaded wikipedia articles, though the search engine has been returning inconsistently URLEncoded search results for a while, though browsers and servers have seemingly magically fixed the issues in many scenarios.

This addresses Issue #195 and Issue #131.
2025-05-03 23:48:45 +02:00
Viktor Lofgren
c3e99dc12a (service) Limit logging from ad hoc task heartbeats
Certain usage patterns of the ad hoc task heartbeats would lead to an incredible amount of log noise, as it would log each update.

Limit log updates to increments of 10% to avoid this problem.
2025-05-03 12:39:58 +02:00
Viktor
aaaa2de022 Merge pull request #196 from MarginaliaSearch/filter-export-sample-data
Add the ability to filter sample data based on content type
2025-05-02 13:23:49 +02:00
Viktor Lofgren
fc1388422a (actor) Add the ability to filter sample data based on content type
This will help in extracting relevant test sets for PDF processing.
2025-05-02 13:09:22 +02:00
Viktor Lofgren
b07080db16 (crawler) Don't retry requests when encountering UnknownHostException 2025-05-01 16:07:34 +02:00
Viktor Lofgren
e9d86dca4a (crawler) Add timeout to wrap-up phase of WarcInputBuffer. 2025-05-01 15:57:47 +02:00
Viktor Lofgren
1d693f0efa (build) Upgrade JIB to 3.4.5 2025-04-30 15:26:52 +02:00
Viktor Lofgren
5874a163dc (build) Upgrade gradle to 8.14 2025-04-30 15:26:37 +02:00
Viktor Lofgren
5ec7a1deab (crawler) Fix 80%-ish progress crawler stall
Since the crawl tasks are started in two phases, first when generating them in one loop, and then in a second loop that drains the task list; if the first loop contains a long-running crawl task that is triggered late, the rest of the crawl may halt until that task is finish.

Fixed the problem by draining and re-trying also in the first loop.
2025-04-29 12:23:51 +02:00
Viktor Lofgren
7fea2808ed (search) Fix error view
Fix rendering error when query was null

Fix border on error message.
2025-04-27 12:12:56 +02:00
Viktor Lofgren
8da74484f0 (search) Remove unused count modifier from the footer help 2025-04-27 12:08:34 +02:00
Viktor Lofgren
923d5a7234 (search) Add a note for TUI users pointing them to the old UI 2025-04-27 11:52:07 +02:00
Viktor Lofgren
58f88749b8 (deploy) assistant 2025-04-25 13:25:50 +02:00
Viktor Lofgren
77f727a5ba (crawler) Alter conditional request logic to avoid sending both If-None-Match and If-Modified-Since
It seems like some servers dislike this combination, and may turn a 304 into a 200.
2025-04-25 13:19:07 +02:00
Viktor Lofgren
667cfb53dc (assistant) Remove more link text junk from suggestions at loadtime. 2025-04-24 13:35:29 +02:00
Viktor Lofgren
fe36d4ed20 (deploy) Executor services 2025-04-24 13:23:51 +02:00
Viktor Lofgren
acf4bef98d (assistant) Improve search suggestions
Improve suggestions by loading a secondary suggestions set with link text data.
2025-04-24 13:10:59 +02:00
50 changed files with 606 additions and 208 deletions

View File

@@ -5,7 +5,7 @@ plugins {
// This is a workaround for a bug in the Jib plugin that causes it to stall randomly // This is a workaround for a bug in the Jib plugin that causes it to stall randomly
// https://github.com/GoogleContainerTools/jib/issues/3347 // https://github.com/GoogleContainerTools/jib/issues/3347
id 'com.google.cloud.tools.jib' version '3.4.4' apply(false) id 'com.google.cloud.tools.jib' version '3.4.5' apply(false)
} }
group 'marginalia' group 'marginalia'
@@ -47,7 +47,7 @@ ext {
dockerImageBase='container-registry.oracle.com/graalvm/jdk:24' dockerImageBase='container-registry.oracle.com/graalvm/jdk:24'
dockerImageTag='latest' dockerImageTag='latest'
dockerImageRegistry='marginalia' dockerImageRegistry='marginalia'
jibVersion = '3.4.4' jibVersion = '3.4.5'
} }
idea { idea {

View File

@@ -1,16 +1,14 @@
package nu.marginalia.model; package nu.marginalia.model;
import nu.marginalia.util.QueryParams; import nu.marginalia.util.QueryParams;
import org.apache.commons.lang3.StringUtils;
import javax.annotation.Nullable; import javax.annotation.Nullable;
import java.io.Serializable; import java.io.Serializable;
import java.net.MalformedURLException; import java.net.*;
import java.net.URI; import java.nio.charset.StandardCharsets;
import java.net.URISyntaxException;
import java.net.URL;
import java.util.Objects; import java.util.Objects;
import java.util.Optional; import java.util.Optional;
import java.util.regex.Pattern;
public class EdgeUrl implements Serializable { public class EdgeUrl implements Serializable {
public final String proto; public final String proto;
@@ -33,7 +31,7 @@ public class EdgeUrl implements Serializable {
private static URI parseURI(String url) throws URISyntaxException { private static URI parseURI(String url) throws URISyntaxException {
try { try {
return new URI(urlencodeFixer(url)); return EdgeUriFactory.parseURILenient(url);
} catch (URISyntaxException ex) { } catch (URISyntaxException ex) {
throw new URISyntaxException("Failed to parse URI '" + url + "'", ex.getMessage()); throw new URISyntaxException("Failed to parse URI '" + url + "'", ex.getMessage());
} }
@@ -51,58 +49,6 @@ public class EdgeUrl implements Serializable {
} }
} }
private static Pattern badCharPattern = Pattern.compile("[ \t\n\"<>\\[\\]()',|]");
/* Java's URI parser is a bit too strict in throwing exceptions when there's an error.
Here on the Internet, standards are like the picture on the box of the frozen pizza,
and what you get is more like what's on the inside, we try to patch things instead,
just give it a best-effort attempt att cleaning out broken or unnecessary constructions
like bad or missing URLEncoding
*/
public static String urlencodeFixer(String url) throws URISyntaxException {
var s = new StringBuilder();
String goodChars = "&.?:/-;+$#";
String hexChars = "0123456789abcdefABCDEF";
int pathIdx = findPathIdx(url);
if (pathIdx < 0) { // url looks like http://marginalia.nu
return url + "/";
}
s.append(url, 0, pathIdx);
// We don't want the fragment, and multiple fragments breaks the Java URIParser for some reason
int end = url.indexOf("#");
if (end < 0) end = url.length();
for (int i = pathIdx; i < end; i++) {
int c = url.charAt(i);
if (goodChars.indexOf(c) >= 0 || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || (c >= '0' && c <= '9')) {
s.appendCodePoint(c);
} else if (c == '%' && i + 2 < end) {
int cn = url.charAt(i + 1);
int cnn = url.charAt(i + 2);
if (hexChars.indexOf(cn) >= 0 && hexChars.indexOf(cnn) >= 0) {
s.appendCodePoint(c);
} else {
s.append("%25");
}
} else {
s.append(String.format("%%%02X", c));
}
}
return s.toString();
}
private static int findPathIdx(String url) throws URISyntaxException {
int colonIdx = url.indexOf(':');
if (colonIdx < 0 || colonIdx + 2 >= url.length()) {
throw new URISyntaxException(url, "Lacking protocol");
}
return url.indexOf('/', colonIdx + 2);
}
public EdgeUrl(URI URI) { public EdgeUrl(URI URI) {
try { try {
@@ -166,11 +112,32 @@ public class EdgeUrl implements Serializable {
sb.append(port); sb.append(port);
} }
EdgeUriFactory.urlencodePath(sb, path);
if (param != null) {
EdgeUriFactory.urlencodeQuery(sb, param);
}
return sb.toString();
}
public String toDisplayString() {
StringBuilder sb = new StringBuilder(256);
sb.append(proto);
sb.append("://");
sb.append(domain);
if (port != null) {
sb.append(':');
sb.append(port);
}
sb.append(path); sb.append(path);
if (param != null) { if (param != null) {
sb.append('?'); sb.append('?').append(param);
sb.append(param);
} }
return sb.toString(); return sb.toString();
@@ -247,3 +214,244 @@ public class EdgeUrl implements Serializable {
} }
} }
class EdgeUriFactory {
public static URI parseURILenient(String url) throws URISyntaxException {
if (shouldOmitUrlencodeRepair(url)) {
try {
return new URI(url);
}
catch (URISyntaxException ex) {
// ignore and run the lenient parser
}
}
var s = new StringBuilder(url.length()+8);
int pathIdx = findPathIdx(url);
if (pathIdx < 0) { // url looks like http://marginalia.nu
return new URI(url + "/");
}
s.append(url, 0, pathIdx);
// We don't want the fragment, and multiple fragments breaks the Java URIParser for some reason
int end = url.indexOf("#");
if (end < 0) end = url.length();
int queryIdx = url.indexOf('?');
if (queryIdx < 0) queryIdx = end;
urlencodePath(s, url.substring(pathIdx, queryIdx));
if (queryIdx < end) {
urlencodeQuery(s, url.substring(queryIdx + 1, end));
}
return new URI(s.toString());
}
/** Break apart the path element of an URI into its components, and then
* urlencode any component that needs it, and recombine it into a single
* path element again.
*/
public static void urlencodePath(StringBuilder sb, String path) {
if (path == null || path.isEmpty()) {
return;
}
String[] pathParts = StringUtils.split(path, '/');
if (pathParts.length == 0) {
sb.append('/');
return;
}
boolean shouldUrlEncode = false;
for (String pathPart : pathParts) {
if (pathPart.isEmpty()) continue;
if (needsUrlEncode(pathPart)) {
shouldUrlEncode = true;
break;
}
}
for (String pathPart : pathParts) {
if (pathPart.isEmpty()) continue;
if (shouldUrlEncode) {
sb.append('/');
sb.append(URLEncoder.encode(pathPart, StandardCharsets.UTF_8).replace("+", "%20"));
} else {
sb.append('/');
sb.append(pathPart);
}
}
if (path.endsWith("/")) {
sb.append('/');
}
}
/** Break apart the query element of a URI into its components, and then
* urlencode any component that needs it, and recombine it into a single
* query element again.
*/
public static void urlencodeQuery(StringBuilder sb, String param) {
if (param == null || param.isEmpty()) {
return;
}
String[] queryParts = StringUtils.split(param, '&');
boolean shouldUrlEncode = false;
for (String queryPart : queryParts) {
if (queryPart.isEmpty()) continue;
if (needsUrlEncode(queryPart)) {
shouldUrlEncode = true;
break;
}
}
boolean first = true;
for (String queryPart : queryParts) {
if (queryPart.isEmpty()) continue;
if (first) {
sb.append('?');
first = false;
} else {
sb.append('&');
}
if (shouldUrlEncode) {
int idx = queryPart.indexOf('=');
if (idx < 0) {
sb.append(URLEncoder.encode(queryPart, StandardCharsets.UTF_8));
} else {
sb.append(URLEncoder.encode(queryPart.substring(0, idx), StandardCharsets.UTF_8));
sb.append('=');
sb.append(URLEncoder.encode(queryPart.substring(idx + 1), StandardCharsets.UTF_8));
}
} else {
sb.append(queryPart);
}
}
}
/** Test if the url element needs URL encoding.
* <p></p>
* Note we may have been given an already encoded path element,
* so we include % and + in the list of good characters
*/
static boolean needsUrlEncode(String urlElement) {
for (int i = 0; i < urlElement.length(); i++) {
char c = urlElement.charAt(i);
if (isUrlSafe(c)) continue;
if ("+".indexOf(c) >= 0) continue;
if (c == '%' && i + 2 < urlElement.length()) {
char c1 = urlElement.charAt(i + 1);
char c2 = urlElement.charAt(i + 2);
if (isHexDigit(c1) && isHexDigit(c2)) {
i += 2;
continue;
}
}
return true;
}
return false;
}
static boolean isUrlSafe(int c) {
if (c >= 'a' && c <= 'z') return true;
if (c >= 'A' && c <= 'Z') return true;
if (c >= '0' && c <= '9') return true;
if (c == '-' || c == '_' || c == '.' || c == '~') return true;
return false;
}
/** Test if the URL is a valid URL that does not need to be
* urlencoded.
* <p></p>
* This is a very simple heuristic test that does not guarantee
* that the URL is valid, but it will identify cases where we
* are fairly certain that the URL does not need encoding,
* so we can skip a bunch of allocations and string operations
* that would otherwise be needed to fix the URL.
*/
static boolean shouldOmitUrlencodeRepair(String url) {
int idx = 0;
final int len = url.length();
// Validate the scheme
while (idx < len - 2) {
char c = url.charAt(idx++);
if (c == ':') break;
if (!isAsciiAlphabetic(c)) return false;
}
if (url.charAt(idx++) != '/') return false;
if (url.charAt(idx++) != '/') return false;
// Validate the authority
while (idx < len) {
char c = url.charAt(idx++);
if (c == '/') break;
if (c == ':') continue;
if (c == '@') continue;
if (!isUrlSafe(c)) return false;
}
// Validate the path
if (idx >= len) return true;
while (idx < len) {
char c = url.charAt(idx++);
if (c == '?') break;
if (c == '/') continue;
if (c == '#') return true;
if (!isUrlSafe(c)) return false;
}
if (idx >= len) return true;
// Validate the query
while (idx < len) {
char c = url.charAt(idx++);
if (c == '&') continue;
if (c == '=') continue;
if (c == '#') return true;
if (!isUrlSafe(c)) return false;
}
return true;
}
private static boolean isAsciiAlphabetic(int c) {
return (c >= 'a' && c <= 'f') || (c >= 'A' && c <= 'F');
}
private static boolean isHexDigit(int c) {
return (c >= '0' && c <= '9') || (c >= 'a' && c <= 'f') || (c >= 'A' && c <= 'F');
}
/** Find the index of the path element in a URL.
* <p></p>
* The path element starts after the scheme and authority part of the URL,
* which is everything up to and including the first slash after the colon.
*/
private static int findPathIdx(String url) throws URISyntaxException {
int colonIdx = url.indexOf(':');
if (colonIdx < 0 || colonIdx + 3 >= url.length()) {
throw new URISyntaxException(url, "Lacking scheme");
}
return url.indexOf('/', colonIdx + 3);
}
}

View File

@@ -1,6 +1,6 @@
package nu.marginalia.model; package nu.marginalia.model;
import nu.marginalia.model.EdgeUrl; import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.Test; import org.junit.jupiter.api.Test;
import java.net.URISyntaxException; import java.net.URISyntaxException;
@@ -21,25 +21,70 @@ class EdgeUrlTest {
new EdgeUrl("https://memex.marginalia.nu/#here") new EdgeUrl("https://memex.marginalia.nu/#here")
); );
} }
@Test @Test
public void testParam() throws URISyntaxException { void testUriFromString() throws URISyntaxException {
System.out.println(new EdgeUrl("https://memex.marginalia.nu/index.php?id=1").toString()); // We test these URLs several times as we perform URLEncode-fixing both when parsing the URL and when
System.out.println(new EdgeUrl("https://memex.marginalia.nu/showthread.php?id=1&count=5&tracking=123").toString()); // converting it back to a string, we want to ensure there is no changes along the way.
}
@Test Assertions.assertEquals("/", EdgeUriFactory.parseURILenient("https://www.example.com/").getPath());
void urlencodeFixer() throws URISyntaxException { Assertions.assertEquals("https://www.example.com/", EdgeUriFactory.parseURILenient("https://www.example.com/").toString());
System.out.println(EdgeUrl.urlencodeFixer("https://www.example.com/#heredoc")); Assertions.assertEquals("https://www.example.com/", new EdgeUrl("https://www.example.com/").toString());
System.out.println(EdgeUrl.urlencodeFixer("https://www.example.com/%-sign"));
System.out.println(EdgeUrl.urlencodeFixer("https://www.example.com/%22-sign")); Assertions.assertEquals("/", EdgeUriFactory.parseURILenient("https://www.example.com/#heredoc").getPath());
System.out.println(EdgeUrl.urlencodeFixer("https://www.example.com/\n \"huh\"")); Assertions.assertEquals("https://www.example.com/", EdgeUriFactory.parseURILenient("https://www.example.com/#heredoc").toString());
Assertions.assertEquals("https://www.example.com/", new EdgeUrl("https://www.example.com/#heredoc").toString());
Assertions.assertEquals("/trailingslash/", EdgeUriFactory.parseURILenient("https://www.example.com/trailingslash/").getPath());
Assertions.assertEquals("https://www.example.com/trailingslash/", EdgeUriFactory.parseURILenient("https://www.example.com/trailingslash/").toString());
Assertions.assertEquals("https://www.example.com/trailingslash/", new EdgeUrl("https://www.example.com/trailingslash/").toString());
Assertions.assertEquals("/%-sign", EdgeUriFactory.parseURILenient("https://www.example.com/%-sign").getPath());
Assertions.assertEquals("https://www.example.com/%25-sign", EdgeUriFactory.parseURILenient("https://www.example.com/%-sign").toString());
Assertions.assertEquals("https://www.example.com/%25-sign", new EdgeUrl("https://www.example.com/%-sign").toString());
Assertions.assertEquals("/%-sign/\"-sign", EdgeUriFactory.parseURILenient("https://www.example.com//%-sign/\"-sign").getPath());
Assertions.assertEquals("https://www.example.com/%25-sign/%22-sign", EdgeUriFactory.parseURILenient("https://www.example.com//%-sign/\"-sign").toString());
Assertions.assertEquals("https://www.example.com/%25-sign/%22-sign", new EdgeUrl("https://www.example.com//%-sign/\"-sign").toString());
Assertions.assertEquals("/\"-sign", EdgeUriFactory.parseURILenient("https://www.example.com/%22-sign").getPath());
Assertions.assertEquals("https://www.example.com/%22-sign", EdgeUriFactory.parseURILenient("https://www.example.com/%22-sign").toString());
Assertions.assertEquals("https://www.example.com/%22-sign", new EdgeUrl("https://www.example.com/%22-sign").toString());
Assertions.assertEquals("/\n \"huh\"", EdgeUriFactory.parseURILenient("https://www.example.com/\n \"huh\"").getPath());
Assertions.assertEquals("https://www.example.com/%0A%20%22huh%22", EdgeUriFactory.parseURILenient("https://www.example.com/\n \"huh\"").toString());
Assertions.assertEquals("https://www.example.com/%0A%20%22huh%22", new EdgeUrl("https://www.example.com/\n \"huh\"").toString());
Assertions.assertEquals("/wiki/Sámi", EdgeUriFactory.parseURILenient("https://en.wikipedia.org/wiki/Sámi").getPath());
Assertions.assertEquals("https://en.wikipedia.org/wiki/S%C3%A1mi", EdgeUriFactory.parseURILenient("https://en.wikipedia.org/wiki/Sámi").toString());
Assertions.assertEquals("https://en.wikipedia.org/wiki/S%C3%A1mi", new EdgeUrl("https://en.wikipedia.org/wiki/Sámi").toString());
Assertions.assertEquals("https://www.prijatelji-zivotinja.hr/index.en.php?id=2301k", new EdgeUrl("https://www.prijatelji-zivotinja.hr/index.en.php?id=2301k").toString());
} }
@Test @Test
void testParms() throws URISyntaxException { void testParms() throws URISyntaxException {
System.out.println(new EdgeUrl("https://search.marginalia.nu/?id=123")); Assertions.assertEquals("id=123", new EdgeUrl("https://search.marginalia.nu/?id=123").param);
System.out.println(new EdgeUrl("https://search.marginalia.nu/?t=123")); Assertions.assertEquals("https://search.marginalia.nu/?id=123", new EdgeUrl("https://search.marginalia.nu/?id=123").toString());
System.out.println(new EdgeUrl("https://search.marginalia.nu/?v=123"));
System.out.println(new EdgeUrl("https://search.marginalia.nu/?m=123")); Assertions.assertEquals("t=123", new EdgeUrl("https://search.marginalia.nu/?t=123").param);
System.out.println(new EdgeUrl("https://search.marginalia.nu/?follow=123")); Assertions.assertEquals("https://search.marginalia.nu/?t=123", new EdgeUrl("https://search.marginalia.nu/?t=123").toString());
Assertions.assertEquals("v=123", new EdgeUrl("https://search.marginalia.nu/?v=123").param);
Assertions.assertEquals("https://search.marginalia.nu/?v=123", new EdgeUrl("https://search.marginalia.nu/?v=123").toString());
Assertions.assertEquals("id=1", new EdgeUrl("https://memex.marginalia.nu/showthread.php?id=1&count=5&tracking=123").param);
Assertions.assertEquals("https://memex.marginalia.nu/showthread.php?id=1",
new EdgeUrl("https://memex.marginalia.nu/showthread.php?id=1&count=5&tracking=123").toString());
Assertions.assertEquals("id=1&t=5", new EdgeUrl("https://memex.marginalia.nu/shöwthrëad.php?id=1&t=5&tracking=123").param);
Assertions.assertEquals("https://memex.marginalia.nu/sh%C3%B6wthr%C3%ABad.php?id=1&t=5", new EdgeUrl("https://memex.marginalia.nu/shöwthrëad.php?id=1&t=5&tracking=123").toString());
Assertions.assertEquals("id=1&t=5", new EdgeUrl("https://memex.marginalia.nu/shöwthrëad.php?trëaking=123&id=1&t=5&").param);
Assertions.assertEquals("https://memex.marginalia.nu/sh%C3%B6wthr%C3%ABad.php?id=1&t=5", new EdgeUrl("https://memex.marginalia.nu/shöwthrëad.php?trëaking=123&id=1&t=5&").toString());
Assertions.assertNull(new EdgeUrl("https://search.marginalia.nu/?m=123").param);
Assertions.assertNull(new EdgeUrl("https://search.marginalia.nu/?follow=123").param);
} }
} }

View File

@@ -59,16 +59,13 @@ public class ProcessAdHocTaskHeartbeatImpl implements AutoCloseable, ProcessAdHo
*/ */
@Override @Override
public void progress(String step, int stepProgress, int stepCount) { public void progress(String step, int stepProgress, int stepCount) {
int lastProgress = this.progress;
this.step = step; this.step = step;
// off by one since we calculate the progress based on the number of steps,
// and Enum.ordinal() is zero-based (so the 5th step in a 5 step task is 4, not 5; resulting in the
// final progress being 80% and not 100%)
this.progress = (int) Math.round(100. * stepProgress / (double) stepCount); this.progress = (int) Math.round(100. * stepProgress / (double) stepCount);
logger.info("ProcessTask {} progress: {}%", taskBase, progress); if (this.progress / 10 != lastProgress / 10) {
logger.info("ProcessTask {} progress: {}%", taskBase, progress);
}
} }
/** Wrap a collection to provide heartbeat progress updates as it's iterated through */ /** Wrap a collection to provide heartbeat progress updates as it's iterated through */

View File

@@ -57,16 +57,13 @@ public class ServiceAdHocTaskHeartbeatImpl implements AutoCloseable, ServiceAdHo
*/ */
@Override @Override
public void progress(String step, int stepProgress, int stepCount) { public void progress(String step, int stepProgress, int stepCount) {
int lastProgress = this.progress;
this.step = step; this.step = step;
// off by one since we calculate the progress based on the number of steps,
// and Enum.ordinal() is zero-based (so the 5th step in a 5 step task is 4, not 5; resulting in the
// final progress being 80% and not 100%)
this.progress = (int) Math.round(100. * stepProgress / (double) stepCount); this.progress = (int) Math.round(100. * stepProgress / (double) stepCount);
logger.info("ServiceTask {} progress: {}%", taskBase, progress); if (this.progress / 10 != lastProgress / 10) {
logger.info("ProcessTask {} progress: {}%", taskBase, progress);
}
} }
public void shutDown() { public void shutDown() {

View File

@@ -122,6 +122,11 @@ public class JoobyService {
// single digit percentage difference since HTML already compresses very well with level = 1. // single digit percentage difference since HTML already compresses very well with level = 1.
options.setCompressionLevel(1); options.setCompressionLevel(1);
// Set a cap on the number of worker threads, as Jooby's default value does not seem to consider
// multi-tenant servers with high thread counts, and spins up an exorbitant number of threads in that
// scenario
options.setWorkerThreads(Math.min(128, options.getWorkerThreads()));
jooby.setServerOptions(options); jooby.setServerOptions(options);

View File

@@ -48,12 +48,13 @@ public class ExecutorExportClient {
return msgId; return msgId;
} }
public void exportSampleData(int node, FileStorageId fid, int size, String name) { public void exportSampleData(int node, FileStorageId fid, int size, String ctFilter, String name) {
channelPool.call(ExecutorExportApiBlockingStub::exportSampleData) channelPool.call(ExecutorExportApiBlockingStub::exportSampleData)
.forNode(node) .forNode(node)
.run(RpcExportSampleData.newBuilder() .run(RpcExportSampleData.newBuilder()
.setFileStorageId(fid.id()) .setFileStorageId(fid.id())
.setSize(size) .setSize(size)
.setCtFilter(ctFilter)
.setName(name) .setName(name)
.build()); .build());
} }

View File

@@ -100,6 +100,7 @@ message RpcExportSampleData {
int64 fileStorageId = 1; int64 fileStorageId = 1;
int32 size = 2; int32 size = 2;
string name = 3; string name = 3;
string ctFilter = 4;
} }
message RpcDownloadSampleData { message RpcDownloadSampleData {
string sampleSet = 1; string sampleSet = 1;

View File

@@ -26,32 +26,32 @@ public class ExportSampleDataActor extends RecordActorPrototype {
private final MqOutbox exportTasksOutbox; private final MqOutbox exportTasksOutbox;
private final Logger logger = LoggerFactory.getLogger(getClass()); private final Logger logger = LoggerFactory.getLogger(getClass());
public record Export(FileStorageId crawlId, int size, String name) implements ActorStep {} public record Export(FileStorageId crawlId, int size, String ctFilter, String name) implements ActorStep {}
public record Run(FileStorageId crawlId, FileStorageId destId, int size, String name, long msgId) implements ActorStep { public record Run(FileStorageId crawlId, FileStorageId destId, int size, String ctFilter, String name, long msgId) implements ActorStep {
public Run(FileStorageId crawlId, FileStorageId destId, int size, String name) { public Run(FileStorageId crawlId, FileStorageId destId, int size, String name, String ctFilter) {
this(crawlId, destId, size, name, -1); this(crawlId, destId, size, name, ctFilter,-1);
} }
} }
@Override @Override
public ActorStep transition(ActorStep self) throws Exception { public ActorStep transition(ActorStep self) throws Exception {
return switch(self) { return switch(self) {
case Export(FileStorageId crawlId, int size, String name) -> { case Export(FileStorageId crawlId, int size, String ctFilter, String name) -> {
var storage = storageService.allocateStorage(FileStorageType.EXPORT, var storage = storageService.allocateStorage(FileStorageType.EXPORT,
"crawl-sample-export", "crawl-sample-export",
"Crawl Data Sample " + name + "/" + size + " " + LocalDateTime.now() "Crawl Data Sample " + name + "/" + size + " " + LocalDateTime.now()
); );
if (storage == null) yield new Error("Bad storage id"); if (storage == null) yield new Error("Bad storage id");
yield new Run(crawlId, storage.id(), size, name); yield new Run(crawlId, storage.id(), size, ctFilter, name);
} }
case Run(FileStorageId crawlId, FileStorageId destId, int size, String name, long msgId) when msgId < 0 -> { case Run(FileStorageId crawlId, FileStorageId destId, int size, String ctFilter, String name, long msgId) when msgId < 0 -> {
storageService.setFileStorageState(destId, FileStorageState.NEW); storageService.setFileStorageState(destId, FileStorageState.NEW);
long newMsgId = exportTasksOutbox.sendAsync(ExportTaskRequest.sampleData(crawlId, destId, size, name)); long newMsgId = exportTasksOutbox.sendAsync(ExportTaskRequest.sampleData(crawlId, destId, ctFilter, size, name));
yield new Run(crawlId, destId, size, name, newMsgId); yield new Run(crawlId, destId, size, ctFilter, name, newMsgId);
} }
case Run(_, FileStorageId destId, _, _, long msgId) -> { case Run(_, FileStorageId destId, _, _, _, long msgId) -> {
var rsp = processWatcher.waitResponse(exportTasksOutbox, ProcessService.ProcessId.EXPORT_TASKS, msgId); var rsp = processWatcher.waitResponse(exportTasksOutbox, ProcessService.ProcessId.EXPORT_TASKS, msgId);
if (rsp.state() != MqMessageState.OK) { if (rsp.state() != MqMessageState.OK) {
@@ -70,7 +70,7 @@ public class ExportSampleDataActor extends RecordActorPrototype {
@Override @Override
public String describe() { public String describe() {
return "Export RSS/Atom feeds from crawl data"; return "Export sample crawl data";
} }
@Inject @Inject

View File

@@ -49,6 +49,7 @@ public class ExecutorExportGrpcService
new ExportSampleDataActor.Export( new ExportSampleDataActor.Export(
FileStorageId.of(request.getFileStorageId()), FileStorageId.of(request.getFileStorageId()),
request.getSize(), request.getSize(),
request.getCtFilter(),
request.getName() request.getName()
) )
); );

View File

@@ -229,13 +229,15 @@ public class FeedFetcherService {
.timeout(Duration.ofSeconds(15)) .timeout(Duration.ofSeconds(15))
; ;
if (ifModifiedSinceDate != null) { // Set the If-Modified-Since or If-None-Match headers if we have them
// though since there are certain idiosyncrasies in server implementations,
// we avoid setting both at the same time as that may turn a 304 into a 200.
if (ifNoneMatchTag != null) {
requestBuilder.header("If-None-Match", ifNoneMatchTag);
} else if (ifModifiedSinceDate != null) {
requestBuilder.header("If-Modified-Since", ifModifiedSinceDate); requestBuilder.header("If-Modified-Since", ifModifiedSinceDate);
} }
if (ifNoneMatchTag != null) {
requestBuilder.header("If-None-Match", ifNoneMatchTag);
}
HttpRequest getRequest = requestBuilder.build(); HttpRequest getRequest = requestBuilder.build();

View File

@@ -264,17 +264,16 @@ public class CrawlerMain extends ProcessMainClass {
if (workLog.isJobFinished(crawlSpec.domain)) if (workLog.isJobFinished(crawlSpec.domain))
continue; continue;
var task = new CrawlTask( var task = new CrawlTask(crawlSpec, anchorTagsSource, outputDir, warcArchiver, domainStateDb, workLog);
crawlSpec,
anchorTagsSource,
outputDir,
warcArchiver,
domainStateDb,
workLog);
// Try to run immediately, to avoid unnecessarily keeping the entire work set in RAM // Try to run immediately, to avoid unnecessarily keeping the entire work set in RAM
if (!trySubmitDeferredTask(task)) { if (!trySubmitDeferredTask(task)) {
// Otherwise add to the taskList for deferred execution
// Drain the retry queue to the taskList, and try to submit any tasks that are in the retry queue
retryQueue.drainTo(taskList);
taskList.removeIf(this::trySubmitDeferredTask);
// Then add this new task to the retry queue
taskList.add(task); taskList.add(task);
} }
} }
@@ -449,13 +448,7 @@ public class CrawlerMain extends ProcessMainClass {
// We don't have a lock, so we can't run this task // We don't have a lock, so we can't run this task
// we return to avoid blocking the pool for too long // we return to avoid blocking the pool for too long
if (lock.isEmpty()) { if (lock.isEmpty()) {
if (retryQueue.remainingCapacity() > 0) { pendingCrawlTasks.remove(domain);
// Sleep a moment to avoid busy looping via the retry queue
// in the case when few tasks remain and almost all are ineligible for
// immediate restart
Thread.sleep(5);
}
retryQueue.put(this); retryQueue.put(this);
return; return;
} }

View File

@@ -19,11 +19,13 @@ public record ContentTags(String etag, String lastMod) {
/** Paints the tags onto the request builder. */ /** Paints the tags onto the request builder. */
public void paint(HttpGet request) { public void paint(HttpGet request) {
// Paint the ETag header if present,
// otherwise paint the Last-Modified header
// (but not both at the same time due to some servers not liking it)
if (etag != null) { if (etag != null) {
request.addHeader("If-None-Match", etag); request.addHeader("If-None-Match", etag);
} } else if (lastMod != null) {
if (lastMod != null) {
request.addHeader("If-Modified-Since", lastMod); request.addHeader("If-Modified-Since", lastMod);
} }
} }

View File

@@ -51,6 +51,7 @@ import javax.net.ssl.SSLException;
import java.io.IOException; import java.io.IOException;
import java.net.SocketTimeoutException; import java.net.SocketTimeoutException;
import java.net.URISyntaxException; import java.net.URISyntaxException;
import java.net.UnknownHostException;
import java.security.NoSuchAlgorithmException; import java.security.NoSuchAlgorithmException;
import java.time.Duration; import java.time.Duration;
import java.time.Instant; import java.time.Instant;
@@ -635,14 +636,12 @@ public class HttpFetcherImpl implements HttpFetcher, HttpRequestRetryStrategy {
@Override @Override
public boolean retryRequest(HttpRequest request, IOException exception, int executionCount, HttpContext context) { public boolean retryRequest(HttpRequest request, IOException exception, int executionCount, HttpContext context) {
if (exception instanceof SocketTimeoutException) { // Timeouts are not recoverable return switch (exception) {
return false; case SocketTimeoutException ste -> false;
} case SSLException ssle -> false;
if (exception instanceof SSLException) { // SSL exceptions are unlikely to be recoverable case UnknownHostException uhe -> false;
return false; default -> executionCount <= 3;
} };
return executionCount <= 3;
} }
@Override @Override

View File

@@ -57,6 +57,7 @@ public abstract class WarcInputBuffer implements AutoCloseable {
return new ErrorBuffer(); return new ErrorBuffer();
} }
Instant start = Instant.now();
InputStream is = null; InputStream is = null;
try { try {
is = entity.getContent(); is = entity.getContent();
@@ -71,8 +72,25 @@ public abstract class WarcInputBuffer implements AutoCloseable {
} }
} }
finally { finally {
// We're required to consume the stream to avoid leaking connections,
// but we also don't want to get stuck on slow or malicious connections
// forever, so we set a time limit on this phase and call abort() if it's exceeded.
try { try {
is.skip(Long.MAX_VALUE); while (is != null) {
// Consume some data
if (is.skip(65536) == 0) {
// Note that skip may return 0 if the stream is empty
// or for other unspecified reasons, so we need to check
// with read() as well to determine if the stream is done
if (is.read() == -1)
is = null;
}
// Check if the time limit has been exceeded
else if (Duration.between(start, Instant.now()).compareTo(timeLimit) > 0) {
request.abort();
is = null;
}
}
} }
catch (IOException e) { catch (IOException e) {
// Ignore the exception // Ignore the exception

View File

@@ -74,7 +74,7 @@ public class CrawlerRevisitor {
// If the reference document is empty or the HTTP status is not 200, we'll skip it since it's // If the reference document is empty or the HTTP status is not 200, we'll skip it since it's
// unlikely to produce anything meaningful for us. // unlikely to produce anything meaningful for us.
if (doc.httpStatus != 200) if (doc.httpStatus != 200 && doc.httpStatus != 206)
continue; continue;
if (!doc.hasBody()) if (!doc.hasBody())
continue; continue;

View File

@@ -58,7 +58,7 @@ public record DocumentWithReference(
if (null == doc) if (null == doc)
return ContentTags.empty(); return ContentTags.empty();
if (doc.documentBodyBytes.length == 0 || doc.httpStatus != 200) if (doc.documentBodyBytes.length == 0 || (doc.httpStatus != 200 && doc.httpStatus != 206))
return ContentTags.empty(); return ContentTags.empty();
String lastmod = doc.getLastModified(); String lastmod = doc.getLastModified();

View File

@@ -1,5 +1,7 @@
package nu.marginalia; package nu.marginalia;
import org.apache.commons.lang3.StringUtils;
import java.util.Set; import java.util.Set;
public class ContentTypes { public class ContentTypes {
@@ -11,9 +13,9 @@ public class ContentTypes {
"text/plain"); "text/plain");
public static boolean isAccepted(String contentTypeHeader) { public static boolean isAccepted(String contentTypeHeader) {
String lcHeader = contentTypeHeader.toLowerCase(); String lcHeader = StringUtils.substringBefore(contentTypeHeader.toLowerCase(), ';');
for (var type : acceptedContentTypes) { for (var type : acceptedContentTypes) {
if (lcHeader.startsWith(type)) { if (lcHeader.equals(type)) {
return true; return true;
} }
} }
@@ -21,7 +23,7 @@ public class ContentTypes {
} }
public static boolean isBinary(String contentTypeHeader) { public static boolean isBinary(String contentTypeHeader) {
String lcHeader = contentTypeHeader.toLowerCase(); String lcHeader = StringUtils.substringBefore(contentTypeHeader.toLowerCase(), ';');
return lcHeader.startsWith("application/pdf"); return lcHeader.startsWith("application/pdf");
} }

View File

@@ -277,7 +277,8 @@ public record SlopCrawlDataRecord(String domain,
try (var table = new SlopTable(path)) { try (var table = new SlopTable(path)) {
ShortColumn.Reader statusReader = statusColumn.open(table); ShortColumn.Reader statusReader = statusColumn.open(table);
while (statusReader.hasRemaining()) { while (statusReader.hasRemaining()) {
if (statusReader.get() == 200) { int status = statusReader.get();
if (status == 200 || status == 206) {
cnt++; cnt++;
} }
} }

View File

@@ -53,6 +53,8 @@ dependencies {
implementation libs.commons.compress implementation libs.commons.compress
implementation libs.commons.codec implementation libs.commons.codec
implementation libs.jsoup implementation libs.jsoup
implementation libs.slop
implementation libs.jwarc

View File

@@ -1,13 +1,18 @@
package nu.marginalia.extractor; package nu.marginalia.extractor;
import com.google.inject.Inject; import com.google.inject.Inject;
import nu.marginalia.process.control.ProcessHeartbeat;
import nu.marginalia.process.log.WorkLog; import nu.marginalia.process.log.WorkLog;
import nu.marginalia.process.log.WorkLogEntry; import nu.marginalia.process.log.WorkLogEntry;
import nu.marginalia.slop.SlopCrawlDataRecord;
import nu.marginalia.slop.SlopTablePacker;
import nu.marginalia.storage.FileStorageService; import nu.marginalia.storage.FileStorageService;
import nu.marginalia.storage.model.FileStorage; import nu.marginalia.storage.model.FileStorage;
import nu.marginalia.storage.model.FileStorageId; import nu.marginalia.storage.model.FileStorageId;
import org.apache.commons.compress.archivers.tar.TarArchiveOutputStream; import org.apache.commons.compress.archivers.tar.TarArchiveOutputStream;
import org.apache.commons.compress.utils.IOUtils; import org.apache.commons.compress.utils.IOUtils;
import org.apache.commons.io.FileUtils;
import org.apache.commons.lang3.StringUtils;
import java.io.IOException; import java.io.IOException;
import java.nio.file.Files; import java.nio.file.Files;
@@ -16,18 +21,19 @@ import java.nio.file.StandardCopyOption;
import java.nio.file.StandardOpenOption; import java.nio.file.StandardOpenOption;
import java.nio.file.attribute.PosixFilePermissions; import java.nio.file.attribute.PosixFilePermissions;
import java.sql.SQLException; import java.sql.SQLException;
import java.util.ArrayList; import java.util.*;
import java.util.Collections;
import java.util.List;
public class SampleDataExporter { public class SampleDataExporter {
private final FileStorageService storageService; private final FileStorageService storageService;
private final ProcessHeartbeat processHeartbeat;
@Inject @Inject
public SampleDataExporter(FileStorageService storageService) { public SampleDataExporter(FileStorageService storageService, ProcessHeartbeat processHeartbeat) {
this.storageService = storageService; this.storageService = storageService;
this.processHeartbeat = processHeartbeat;
} }
public void export(FileStorageId crawlId, FileStorageId destId, int size, String name) throws SQLException, IOException {
public void export(FileStorageId crawlId, FileStorageId destId, int size, String ctFilter, String name) throws SQLException, IOException {
FileStorage destStorage = storageService.getStorage(destId); FileStorage destStorage = storageService.getStorage(destId);
Path inputDir = storageService.getStorage(crawlId).asPath(); Path inputDir = storageService.getStorage(crawlId).asPath();
@@ -54,11 +60,6 @@ public class SampleDataExporter {
Path newCrawlerLogFile = Files.createTempFile(destStorage.asPath(), "crawler", ".log", Path newCrawlerLogFile = Files.createTempFile(destStorage.asPath(), "crawler", ".log",
PosixFilePermissions.asFileAttribute(PosixFilePermissions.fromString("rw-r--r--"))); PosixFilePermissions.asFileAttribute(PosixFilePermissions.fromString("rw-r--r--")));
try (var bw = Files.newBufferedWriter(newCrawlerLogFile, StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING)) {
for (var item : entriesAll) {
bw.write(item.id() + " " + item.ts() + " " + item.relPath() + " " + item.cnt() + "\n");
}
}
Path newManifestJsonFile = Files.createTempFile(destStorage.asPath(), "manifest", ".json", Path newManifestJsonFile = Files.createTempFile(destStorage.asPath(), "manifest", ".json",
PosixFilePermissions.asFileAttribute(PosixFilePermissions.fromString("rw-r--r--"))); PosixFilePermissions.asFileAttribute(PosixFilePermissions.fromString("rw-r--r--")));
@@ -67,12 +68,34 @@ public class SampleDataExporter {
var tmpTarFile = Files.createTempFile(destStorage.asPath(), "data", ".tar", var tmpTarFile = Files.createTempFile(destStorage.asPath(), "data", ".tar",
PosixFilePermissions.asFileAttribute(PosixFilePermissions.fromString("rw-r--r--"))); PosixFilePermissions.asFileAttribute(PosixFilePermissions.fromString("rw-r--r--")));
try (var stream = new TarArchiveOutputStream(Files.newOutputStream(tmpTarFile, StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING))) { try (var stream = new TarArchiveOutputStream(Files.newOutputStream(tmpTarFile, StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING));
for (var item : entriesAll) { var logWriter = Files.newBufferedWriter(newCrawlerLogFile, StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING);
var hb = processHeartbeat.createAdHocTaskHeartbeat("Generating Sample")
) {
for (var item : hb.wrap("Scanning", entriesAll)) {
Path crawlDataPath = inputDir.resolve(item.relPath()); Path crawlDataPath = inputDir.resolve(item.relPath());
if (!Files.exists(crawlDataPath)) continue; if (!Files.exists(crawlDataPath)) continue;
addFileToTar(stream, crawlDataPath, item.relPath()); if (StringUtils.isBlank(ctFilter)) {
addFileToTar(stream, crawlDataPath, item.relPath());
logWriter.write(item.id() + " " + item.ts() + " " + item.relPath() + " " + item.cnt() + "\n");
}
else /* filter != null */ {
Path filteredData = null;
try {
filteredData = filterEntries(crawlDataPath, ctFilter);
addFileToTar(stream, filteredData, item.relPath());
logWriter.write(item.id() + " " + item.ts() + " " + item.relPath() + " " + item.cnt() + "\n");
}
catch (NoSuchElementException ex) {
// Ignore
}
finally {
if (filteredData != null) {
Files.deleteIfExists(filteredData);
}
}
}
} }
addFileToTar(stream, newCrawlerLogFile, "crawler.log"); addFileToTar(stream, newCrawlerLogFile, "crawler.log");
@@ -86,6 +109,48 @@ public class SampleDataExporter {
Files.move(tmpTarFile, destStorage.asPath().resolve("crawl-data.tar"), StandardCopyOption.ATOMIC_MOVE, StandardCopyOption.REPLACE_EXISTING); Files.move(tmpTarFile, destStorage.asPath().resolve("crawl-data.tar"), StandardCopyOption.ATOMIC_MOVE, StandardCopyOption.REPLACE_EXISTING);
} }
/** Filters the entries in the crawl data file based on the content type. */
private Path filterEntries(Path crawlDataPath, String contentTypeFilter) throws IOException, NoSuchElementException {
Path tempDir = crawlDataPath.resolveSibling(crawlDataPath.getFileName() + ".filtered");
Path tempFile = crawlDataPath.resolveSibling(crawlDataPath.getFileName() + ".filtered.slop.zip");
// We may have debris from a previous run, so let's clean it up
if (Files.isDirectory(tempDir)) {
FileUtils.deleteDirectory(tempDir.toFile());
}
Files.createDirectory(tempDir);
try (var writer = new SlopCrawlDataRecord.Writer(tempDir);
var reader = new SlopCrawlDataRecord.FilteringReader(crawlDataPath) {
@Override
public boolean filter(String url, int status, String contentType) {
return Objects.equals(StringUtils.substringBefore(contentType, ';'), contentTypeFilter)
|| contentType.startsWith("x-marginalia/"); // metadata records
}
}
) {
boolean wroteEntry = false;
while (reader.hasRemaining()) {
var entry = reader.get();
writer.write(entry);
wroteEntry = wroteEntry || Objects.equals(StringUtils.substringBefore(entry.contentType(), ';'), contentTypeFilter);
}
if (!wroteEntry) {
throw new NoSuchElementException("No relevant entries");
}
SlopTablePacker.packToSlopZip(tempDir, tempFile);
}
finally {
FileUtils.deleteDirectory(tempDir.toFile());
}
return tempFile;
}
private void addFileToTar(TarArchiveOutputStream outputStream, Path file, String fileName) throws IOException { private void addFileToTar(TarArchiveOutputStream outputStream, Path file, String fileName) throws IOException {
var entry = outputStream.createArchiveEntry(file.toFile(), fileName); var entry = outputStream.createArchiveEntry(file.toFile(), fileName);
entry.setSize(Files.size(file)); entry.setSize(Files.size(file));

View File

@@ -92,7 +92,7 @@ public class ExportTasksMain extends ProcessMainClass {
termFrequencyExporter.export(request.crawlId, request.destId); termFrequencyExporter.export(request.crawlId, request.destId);
break; break;
case SAMPLE_DATA: case SAMPLE_DATA:
sampleDataExporter.export(request.crawlId, request.destId, request.size, request.name); sampleDataExporter.export(request.crawlId, request.destId, request.size, request.ctFilter, request.name);
break; break;
case ADJACENCIES: case ADJACENCIES:
websiteAdjacenciesCalculator.export(); websiteAdjacenciesCalculator.export();

View File

@@ -16,6 +16,7 @@ public class ExportTaskRequest {
public FileStorageId destId; public FileStorageId destId;
public int size; public int size;
public String name; public String name;
public String ctFilter;
public ExportTaskRequest(Task task) { public ExportTaskRequest(Task task) {
this.task = task; this.task = task;
@@ -42,12 +43,13 @@ public class ExportTaskRequest {
return request; return request;
} }
public static ExportTaskRequest sampleData(FileStorageId crawlId, FileStorageId destId, int size, String name) { public static ExportTaskRequest sampleData(FileStorageId crawlId, FileStorageId destId, String ctFilter, int size, String name) {
ExportTaskRequest request = new ExportTaskRequest(Task.SAMPLE_DATA); ExportTaskRequest request = new ExportTaskRequest(Task.SAMPLE_DATA);
request.crawlId = crawlId; request.crawlId = crawlId;
request.destId = destId; request.destId = destId;
request.size = size; request.size = size;
request.name = name; request.name = name;
request.ctFilter = ctFilter;
return request; return request;
} }

View File

@@ -3,7 +3,7 @@ plugins {
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
java { java {

View File

@@ -3,7 +3,7 @@ plugins {
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
application { application {

View File

@@ -3,7 +3,7 @@ plugins {
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
application { application {

View File

@@ -5,7 +5,7 @@ plugins {
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
application { application {

View File

@@ -3,7 +3,7 @@ plugins {
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'gg.jte.gradle' version '3.1.15' id 'gg.jte.gradle' version '3.1.15'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
application { application {

View File

@@ -180,7 +180,7 @@ public class UrlDetails implements Comparable<UrlDetails> {
* semantically meaningful codepoints into entity codes */ * semantically meaningful codepoints into entity codes */
public String displayUrl() { public String displayUrl() {
StringBuilder sb = new StringBuilder(); StringBuilder sb = new StringBuilder();
String urlStr = url.toString(); String urlStr = url.toDisplayString();
for (int i = 0; i < urlStr.length(); i++) { for (int i = 0; i < urlStr.length(); i++) {
char c = urlStr.charAt(i); char c = urlStr.charAt(i);

View File

@@ -26,4 +26,10 @@
<link rel="search" type="application/opensearchdescription+xml" href="/opensearch.xml" title="Marginalia"> <link rel="search" type="application/opensearchdescription+xml" href="/opensearch.xml" title="Marginalia">
</head> </head>
<noscript>
<h1>Users of text-based browsers</h1>
<p>Consider using the old interface at <a href="https://old-search.marginalia.nu/">https://old-search.marginalia.nu/</a>,
as it uses fewer modern CSS tricks, and should work better than the new UI. It's functionally nearly identical, but just renders it using a different layout.</p>
<hr>
</noscript>

View File

@@ -26,7 +26,7 @@
<!-- Main content --> <!-- Main content -->
<main class="flex-1 p-4 max-w-2xl space-y-4"> <main class="flex-1 p-4 max-w-2xl space-y-4">
<div class="border dark:border-gray-600 rounded bg-white text-black dark:bg-gray-800 dark:text-white text-m p-4"> <div class="border border-gray-300 dark:border-gray-600 rounded bg-white text-black dark:bg-gray-800 dark:text-white text-m p-4">
<div class="flex space-x-3 place-items-baseline"> <div class="flex space-x-3 place-items-baseline">
<i class="fa fa-circle-exclamation text-red-800"></i> <i class="fa fa-circle-exclamation text-red-800"></i>
<div class="grow">${model.errorTitle()}</div> <div class="grow">${model.errorTitle()}</div>

View File

@@ -80,10 +80,6 @@
<tr><td>rank&gt;50</td><td>The ranking of the website is at least 50 in a span of 1 - 255</td></tr> <tr><td>rank&gt;50</td><td>The ranking of the website is at least 50 in a span of 1 - 255</td></tr>
<tr><td>rank&lt;50</td><td>The ranking of the website is at most 50 in a span of 1 - 255</td></tr> <tr><td>rank&lt;50</td><td>The ranking of the website is at most 50 in a span of 1 - 255</td></tr>
<tr><td>count&gt;10</td><td> The search term must appear in at least 10 results form the domain</td></tr>
<tr><td>count&lt;10</td><td> The search term must appear in at most 10 results from the domain</td></tr>
<tr><td>format:html5</td><td>Filter documents using the HTML5 standard. This is typically modern websites.</td></tr> <tr><td>format:html5</td><td>Filter documents using the HTML5 standard. This is typically modern websites.</td></tr>
<tr><td>format:xhtml</td><td>Filter documents using the XHTML standard</td></tr> <tr><td>format:xhtml</td><td>Filter documents using the XHTML standard</td></tr>
<tr><td>format:html123</td><td>Filter documents using the HTML standards 1, 2, and 3. This is typically very old websites. </td></tr> <tr><td>format:html123</td><td>Filter documents using the HTML standards 1, 2, and 3. This is typically very old websites. </td></tr>

View File

@@ -7,7 +7,7 @@
<form class="flex-1 max-w-2xl" action="/search"> <form class="flex-1 max-w-2xl" action="/search">
<div class="flex"> <div class="flex">
@if (query.isBlank()) @if (query != null && query.isBlank())
<%-- Add autofocus if the query is blank --%> <%-- Add autofocus if the query is blank --%>
<input type="text" <input type="text"
class="shadow-inner flex-1 dark:bg-black dark:text-gray-100 bg-gray-50 border dark:border-gray-600 border-gray-300 text-gray-900 text-sm rounded-sm block w-full p-2.5" class="shadow-inner flex-1 dark:bg-black dark:text-gray-100 bg-gray-50 border dark:border-gray-600 border-gray-300 text-gray-900 text-sm rounded-sm block w-full p-2.5"

View File

@@ -2,7 +2,7 @@ plugins {
id 'java' id 'java'
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
java { java {

View File

@@ -20,6 +20,6 @@ public class StatusModule extends AbstractModule {
bind(String.class) bind(String.class)
.annotatedWith(Names.named("searchEngineTestQuery")) .annotatedWith(Names.named("searchEngineTestQuery"))
.toInstance(System.getProperty("status-service.public-query", .toInstance(System.getProperty("status-service.public-query",
"https://search.marginalia.nu/search?query=plato&ref=marginalia-automatic-metrics")); "https://marginalia-search.com/search?query=plato&ref=marginalia-automatic-metrics"));
} }
} }

View File

@@ -3,7 +3,7 @@ plugins {
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
application { application {

View File

@@ -10,7 +10,8 @@ import static com.google.inject.name.Names.named;
public class AssistantModule extends AbstractModule { public class AssistantModule extends AbstractModule {
public void configure() { public void configure() {
bind(Path.class).annotatedWith(named("suggestions-file")).toInstance(WmsaHome.getHomePath().resolve("data/suggestions2.txt.gz")); bind(Path.class).annotatedWith(named("suggestions-file1")).toInstance(WmsaHome.getHomePath().resolve("data/suggestions2.txt.gz"));
bind(Path.class).annotatedWith(named("suggestions-file2")).toInstance(WmsaHome.getHomePath().resolve("data/suggestions3.txt.gz"));
bind(LanguageModels.class).toInstance(WmsaHome.getLanguageModels()); bind(LanguageModels.class).toInstance(WmsaHome.getLanguageModels());
} }

View File

@@ -1,6 +1,7 @@
package nu.marginalia.assistant.suggest; package nu.marginalia.assistant.suggest;
import gnu.trove.list.array.TIntArrayList; import gnu.trove.list.array.TIntArrayList;
import org.jetbrains.annotations.NotNull;
import java.util.*; import java.util.*;
@@ -434,7 +435,7 @@ public class PrefixSearchStructure {
/** /**
* Class representing a suggested completion. * Class representing a suggested completion.
*/ */
public static class ScoredSuggestion { public static class ScoredSuggestion implements Comparable<ScoredSuggestion> {
private final String word; private final String word;
private final int score; private final int score;
@@ -455,5 +456,10 @@ public class PrefixSearchStructure {
public String toString() { public String toString() {
return word + " (" + score + ")"; return word + " (" + score + ")";
} }
@Override
public int compareTo(@NotNull PrefixSearchStructure.ScoredSuggestion o) {
return Integer.compare(this.score, o.score);
}
} }
} }

View File

@@ -2,8 +2,6 @@ package nu.marginalia.assistant.suggest;
import com.google.inject.Inject; import com.google.inject.Inject;
import com.google.inject.name.Named; import com.google.inject.name.Named;
import nu.marginalia.functions.math.dict.SpellChecker;
import nu.marginalia.term_frequency_dict.TermFrequencyDict;
import org.apache.commons.lang3.StringUtils; import org.apache.commons.lang3.StringUtils;
import org.slf4j.Logger; import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import org.slf4j.LoggerFactory;
@@ -13,35 +11,27 @@ import java.io.IOException;
import java.nio.file.Files; import java.nio.file.Files;
import java.nio.file.Path; import java.nio.file.Path;
import java.nio.file.StandardOpenOption; import java.nio.file.StandardOpenOption;
import java.util.ArrayList; import java.util.*;
import java.util.Collections;
import java.util.List;
import java.util.Scanner;
import java.util.regex.Pattern;
import java.util.zip.GZIPInputStream; import java.util.zip.GZIPInputStream;
public class Suggestions { public class Suggestions {
private PrefixSearchStructure searchStructure = null; List<PrefixSearchStructure> searchStructures = new ArrayList<>();
private TermFrequencyDict termFrequencyDict = null;
private volatile boolean ready = false; private volatile boolean ready = false;
private final SpellChecker spellChecker;
private static final Pattern suggestionPattern = Pattern.compile("^[a-zA-Z0-9]+( [a-zA-Z0-9]+)*$");
private static final Logger logger = LoggerFactory.getLogger(Suggestions.class); private static final Logger logger = LoggerFactory.getLogger(Suggestions.class);
private static final int MIN_SUGGEST_LENGTH = 3; private static final int MIN_SUGGEST_LENGTH = 3;
@Inject @Inject
public Suggestions(@Named("suggestions-file") Path suggestionsFile, public Suggestions(@Named("suggestions-file1") Path suggestionsFile1,
SpellChecker spellChecker, @Named("suggestions-file2") Path suggestionsFile2
TermFrequencyDict dict
) { ) {
this.spellChecker = spellChecker;
Thread.ofPlatform().start(() -> { Thread.ofPlatform().start(() -> {
searchStructure = loadSuggestions(suggestionsFile); searchStructures.add(loadSuggestions(suggestionsFile1));
termFrequencyDict = dict; searchStructures.add(loadSuggestions(suggestionsFile2));
ready = true; ready = true;
logger.info("Loaded {} suggestions", searchStructure.size()); logger.info("Loaded suggestions");
}); });
} }
@@ -55,8 +45,8 @@ public class Suggestions {
try (var scanner = new Scanner(new GZIPInputStream(new BufferedInputStream(Files.newInputStream(file, StandardOpenOption.READ))))) { try (var scanner = new Scanner(new GZIPInputStream(new BufferedInputStream(Files.newInputStream(file, StandardOpenOption.READ))))) {
while (scanner.hasNextLine()) { while (scanner.hasNextLine()) {
String line = scanner.nextLine(); String line = scanner.nextLine().trim();
String[] parts = StringUtils.split(line, " ", 2); String[] parts = StringUtils.split(line, " ,", 2);
if (parts.length != 2) { if (parts.length != 2) {
logger.warn("Invalid suggestion line: {}", line); logger.warn("Invalid suggestion line: {}", line);
continue; continue;
@@ -64,7 +54,30 @@ public class Suggestions {
int cnt = Integer.parseInt(parts[0]); int cnt = Integer.parseInt(parts[0]);
if (cnt > 1) { if (cnt > 1) {
String word = parts[1]; String word = parts[1];
ret.insert(word, cnt);
// Remove quotes and trailing periods if this is a CSV
if (word.startsWith("\"") && word.endsWith("\"")) {
word = word.substring(1, word.length() - 1);
}
// Remove trailing periods
while (word.endsWith(".")) {
word = word.substring(0, word.length() - 1);
}
// Remove junk items we may have gotten from link extraction
if (word.startsWith("click here"))
continue;
if (word.contains("new window"))
continue;
if (word.contains("click to"))
continue;
if (word.startsWith("share "))
continue;
if (word.length() > 3) {
ret.insert(word, cnt);
}
} }
} }
return ret; return ret;
@@ -96,10 +109,22 @@ public class Suggestions {
return List.of(); return List.of();
} }
var results = searchStructure.getTopCompletions(prefix, count); List<PrefixSearchStructure.ScoredSuggestion> resultsAll = new ArrayList<>();
for (var searchStructure : searchStructures) {
resultsAll.addAll(searchStructure.getTopCompletions(prefix, count));
}
resultsAll.sort(Comparator.reverseOrder());
List<String> ret = new ArrayList<>(count); List<String> ret = new ArrayList<>(count);
for (var result : results) {
ret.add(result.getWord()); Set<String> seen = new HashSet<>();
for (var result : resultsAll) {
if (seen.add(result.getWord())) {
ret.add(result.getWord());
}
if (ret.size() >= count) {
break;
}
} }
return ret; return ret;

View File

@@ -2,7 +2,7 @@ plugins {
id 'java' id 'java'
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
java { java {

View File

@@ -64,6 +64,11 @@ public class ControlMain extends MainClass {
download(suggestionsFile, new URI("https://downloads.marginalia.nu/data/suggestions2.txt.gz")); download(suggestionsFile, new URI("https://downloads.marginalia.nu/data/suggestions2.txt.gz"));
} }
Path altSuggestionsFile = dataPath.resolve("suggestions3.txt.gz");
if (!Files.exists(altSuggestionsFile)) {
download(altSuggestionsFile, new URI("https://downloads.marginalia.nu/data/suggestions3.txt.gz"));
}
Path asnRawData = dataPath.resolve("asn-data-raw-table"); Path asnRawData = dataPath.resolve("asn-data-raw-table");
if (!Files.exists(asnRawData)) { if (!Files.exists(asnRawData)) {
download(asnRawData, new URI("https://thyme.apnic.net/current/data-raw-table")); download(asnRawData, new URI("https://thyme.apnic.net/current/data-raw-table"));

View File

@@ -321,9 +321,10 @@ public class ControlNodeActionsService {
private Object exportSampleData(Request req, Response rsp) { private Object exportSampleData(Request req, Response rsp) {
FileStorageId source = parseSourceFileStorageId(req.queryParams("source")); FileStorageId source = parseSourceFileStorageId(req.queryParams("source"));
int size = Integer.parseInt(req.queryParams("size")); int size = Integer.parseInt(req.queryParams("size"));
String ctFilter = req.queryParams("ctFilter");
String name = req.queryParams("name"); String name = req.queryParams("name");
exportClient.exportSampleData(Integer.parseInt(req.params("id")), source, size, name); exportClient.exportSampleData(Integer.parseInt(req.params("id")), source, size, ctFilter, name);
return ""; return "";
} }

View File

@@ -35,6 +35,11 @@
<div><input type="text" name="size" id="size" pattern="\d+" /></div> <div><input type="text" name="size" id="size" pattern="\d+" /></div>
<small class="text-muted">How many domains to include in the sample set</small> <small class="text-muted">How many domains to include in the sample set</small>
</div> </div>
<div class="mb-3">
<label for="ctFilter">Content Type Filter</label>
<div><input type="text" name="ctFilter" id="ctFilter" /></div>
<small class="text-muted">If set, includes only documents with the specified content type value</small>
</div>
<div class="mb-3"> <div class="mb-3">
<label for="name">Name</label> <label for="name">Name</label>
<div><input type="text" name="name" id="name" /></div> <div><input type="text" name="name" id="name" /></div>

View File

@@ -3,7 +3,7 @@ plugins {
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
application { application {

View File

@@ -3,7 +3,7 @@ plugins {
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
application { application {

View File

@@ -3,7 +3,7 @@ plugins {
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
application { application {

View File

@@ -3,7 +3,7 @@ plugins {
id 'application' id 'application'
id 'jvm-test-suite' id 'jvm-test-suite'
id 'com.google.cloud.tools.jib' version '3.4.4' id 'com.google.cloud.tools.jib' version '3.4.5'
} }
java { java {

View File

@@ -1,4 +1,9 @@
## This is a token file for automatic deployment ## This is a token file for triggering automatic deployment when no commit is made.
2025-01-08: Deploy executor. 2025-01-08: Deploy executor.
2025-01-07: Deploy executor. 2025-01-07: Deploy executor.
2025-04-24: Deploy executor.
2025-04-24: Deploy assistant.
2025-05-04: Deploy qs, search and api-services.
2025-05-05: Deploy executor partition 4.
2025-05-05: Deploy control.

View File

@@ -1,5 +1,5 @@
distributionBase=GRADLE_USER_HOME distributionBase=GRADLE_USER_HOME
distributionPath=wrapper/dists distributionPath=wrapper/dists
distributionUrl=https\://services.gradle.org/distributions/gradle-8.10-bin.zip distributionUrl=https\://services.gradle.org/distributions/gradle-8.14-bin.zip
zipStoreBase=GRADLE_USER_HOME zipStoreBase=GRADLE_USER_HOME
zipStorePath=wrapper/dists zipStorePath=wrapper/dists

View File

@@ -314,6 +314,13 @@ if __name__ == '__main__':
deploy_tier=0, deploy_tier=0,
groups={"all", "core"} groups={"all", "core"}
), ),
'status': ServiceConfig(
gradle_target=':code:services-application:status-service:docker',
docker_name='status-service',
instances=None,
deploy_tier=4,
groups={"all"}
),
'query': ServiceConfig( 'query': ServiceConfig(
gradle_target=':code:services-core:query-service:docker', gradle_target=':code:services-core:query-service:docker',
docker_name='query-service', docker_name='query-service',