(assistant) Improve search suggestions

Improve suggestions by loading a secondary suggestions set with link text data.
(search) Improve suggestions UX
2025-10-06 07:32:38 +02:00 · 2025-04-24 13:10:59 +02:00 · 2025-04-24 12:34:05 +02:00 · 2025-04-24 00:32:25 +02:00 · 2025-04-23 20:17:49 +02:00 · 2025-04-23 20:13:53 +02:00
186 changed files with 5860 additions and 1979 deletions
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -1,4 +1,4 @@
-# Roadmap 2024-2025
+# Roadmap 2025

 This is a roadmap with major features planned for Marginalia Search.

@@ -30,12 +30,6 @@ Retaining the ability to independently crawl the web is still strongly desirable
 The search engine has a bit of a problem showing spicy content mixed in with the results.  It would be desirable to have a way to filter this out.  It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
 combined with naive bayesian filter would go a long way, or something more sophisticated...?

-## Web Design Overhaul
-
-The design is kinda clunky and hard to maintain, and needlessly outdated-looking.  
-
-In progress: PR [#127](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/127)  -- demo available at https://test.marginalia.nu/
-
 ## Additional Language Support

 It would be desirable if the search engine supported more languages than English.  This is partially about
@@ -62,8 +56,31 @@ filter for any API consumer.

 I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, which is quite ad-hoc, but instead to work together to find some new common description language for this. 

+## Show favicons next to search results
+
+This is expected from search engines.  Basic proof of concept sketch of fetching this data has been done, but the feature is some way from being reality. 
+
+## Specialized crawler for github
+
+One of the search engine's biggest limitations right now is that it does not index github at all.   A specialized crawler that fetches at least the readme.md would go a long way toward providing search capabilities in this domain.
+
 # Completed

+## Web Design Overhaul (COMPLETED 2025-01)
+
+The design is kinda clunky and hard to maintain, and needlessly outdated-looking.  
+
+PR [#127](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/127)
+
+## Finalize RSS support (COMPLETED 2024-11)
+
+Marginalia has experimental RSS preview support for a few domains.  This works well and
+it should be extended to all domains.  It would also be interesting to offer search of the
+RSS data itself, or use the RSS set to feed a special live index that updates faster than the
+main dataset. 
+
+Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
+
 ## Proper Position Index (COMPLETED 2024-09)

 The search engine uses a fixed width bit mask to indicate word positions.  It has the benefit
@@ -76,11 +93,3 @@ list, as is the civilized way of doing this.

 Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)

-## Finalize RSS support (COMPLETED 2024-11)
-
-Marginalia has experimental RSS preview support for a few domains.  This works well and
-it should be extended to all domains.  It would also be interesting to offer search of the
-RSS data itself, or use the RSS set to feed a special live index that updates faster than the
-main dataset. 
-
-Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
--- a/build.gradle
+++ b/build.gradle
@@ -5,7 +5,7 @@ plugins {

    // This is a workaround for a bug in the Jib plugin that causes it to stall randomly
    // https://github.com/GoogleContainerTools/jib/issues/3347
-    id 'com.google.cloud.tools.jib' version '3.4.3' apply(false)
+    id 'com.google.cloud.tools.jib' version '3.4.4' apply(false)
 }

 group 'marginalia'
@@ -43,12 +43,11 @@ subprojects.forEach {it ->
 }

 ext {
-    jvmVersion=23
-    dockerImageBase='container-registry.oracle.com/graalvm/jdk:23'
+    jvmVersion = 24
+    dockerImageBase='container-registry.oracle.com/graalvm/jdk:24'
    dockerImageTag='latest'
    dockerImageRegistry='marginalia'
-    jibVersion = '3.4.3'
-
+    jibVersion = '3.4.4'
 }

 idea {
--- a/code/common/config/java/nu/marginalia/LanguageModels.java
+++ b/code/common/config/java/nu/marginalia/LanguageModels.java
@@ -24,58 +24,4 @@ public class LanguageModels {
        this.fasttextLanguageModel = fasttextLanguageModel;
        this.segments = segments;
    }
-
-    public static LanguageModelsBuilder builder() {
-        return new LanguageModelsBuilder();
-    }
-
-    public static class LanguageModelsBuilder {
-        private Path termFrequencies;
-        private Path openNLPSentenceDetectionData;
-        private Path posRules;
-        private Path posDict;
-        private Path fasttextLanguageModel;
-        private Path segments;
-
-        LanguageModelsBuilder() {
-        }
-
-        public LanguageModelsBuilder termFrequencies(Path termFrequencies) {
-            this.termFrequencies = termFrequencies;
-            return this;
-        }
-
-        public LanguageModelsBuilder openNLPSentenceDetectionData(Path openNLPSentenceDetectionData) {
-            this.openNLPSentenceDetectionData = openNLPSentenceDetectionData;
-            return this;
-        }
-
-        public LanguageModelsBuilder posRules(Path posRules) {
-            this.posRules = posRules;
-            return this;
-        }
-
-        public LanguageModelsBuilder posDict(Path posDict) {
-            this.posDict = posDict;
-            return this;
-        }
-
-        public LanguageModelsBuilder fasttextLanguageModel(Path fasttextLanguageModel) {
-            this.fasttextLanguageModel = fasttextLanguageModel;
-            return this;
-        }
-
-        public LanguageModelsBuilder segments(Path segments) {
-            this.segments = segments;
-            return this;
-        }
-
-        public LanguageModels build() {
-            return new LanguageModels(this.termFrequencies, this.openNLPSentenceDetectionData, this.posRules, this.posDict, this.fasttextLanguageModel, this.segments);
-        }
-
-        public String toString() {
-            return "LanguageModels.LanguageModelsBuilder(termFrequencies=" + this.termFrequencies + ", openNLPSentenceDetectionData=" + this.openNLPSentenceDetectionData + ", posRules=" + this.posRules + ", posDict=" + this.posDict + ", fasttextLanguageModel=" + this.fasttextLanguageModel + ", segments=" + this.segments + ")";
-        }
-    }
 }
--- a/code/common/db/java/nu/marginalia/db/DbDomainQueries.java
+++ b/code/common/db/java/nu/marginalia/db/DbDomainQueries.java
@@ -22,6 +22,7 @@ public class DbDomainQueries {
    private static final Logger logger = LoggerFactory.getLogger(DbDomainQueries.class);

    private final Cache<EdgeDomain, Integer> domainIdCache = CacheBuilder.newBuilder().maximumSize(10_000).build();
+    private final Cache<EdgeDomain, DomainIdWithNode> domainWithNodeCache = CacheBuilder.newBuilder().maximumSize(10_000).build();
    private final Cache<Integer, EdgeDomain> domainNameCache = CacheBuilder.newBuilder().maximumSize(10_000).build();
    private final Cache<String, List<DomainWithNode>> siblingsCache = CacheBuilder.newBuilder().maximumSize(10_000).build();

@@ -59,6 +60,34 @@ public class DbDomainQueries {
        }
    }

+
+    public DomainIdWithNode getDomainIdWithNode(EdgeDomain domain) throws NoSuchElementException {
+        try {
+            return domainWithNodeCache.get(domain, () -> {
+                try (var connection = dataSource.getConnection();
+                     var stmt = connection.prepareStatement("SELECT ID, NODE_AFFINITY FROM EC_DOMAIN WHERE DOMAIN_NAME=?")) {
+
+                    stmt.setString(1, domain.toString());
+                    var rsp = stmt.executeQuery();
+                    if (rsp.next()) {
+                        return new DomainIdWithNode(rsp.getInt(1), rsp.getInt(2));
+                    }
+                }
+                catch (SQLException ex) {
+                    throw new RuntimeException(ex);
+                }
+
+                throw new NoSuchElementException();
+            });
+        }
+        catch (UncheckedExecutionException ex) {
+            throw new NoSuchElementException();
+        }
+        catch (ExecutionException ex) {
+            throw new RuntimeException(ex.getCause());
+        }
+    }
+
    public OptionalInt tryGetDomainId(EdgeDomain domain) {

        Integer maybeId = domainIdCache.getIfPresent(domain);
@@ -145,4 +174,6 @@ public class DbDomainQueries {
            return nodeAffinity > 0;
        }
    }
+
+    public record DomainIdWithNode (int domainId, int nodeAffinity) { }
 }
--- a/code/common/model/java/nu/marginalia/model/EdgeDomain.java
+++ b/code/common/model/java/nu/marginalia/model/EdgeDomain.java
@@ -14,7 +14,7 @@ public class EdgeDomain implements Serializable {
    @Nonnull
    public final String topDomain;

-    public EdgeDomain(String host) {
+    public EdgeDomain(@Nonnull String host) {
        Objects.requireNonNull(host, "domain name must not be null");

        host = host.toLowerCase();
@@ -61,6 +61,10 @@ public class EdgeDomain implements Serializable {
        this.topDomain = topDomain;
    }

+    public static String getTopDomain(String host) {
+        return new EdgeDomain(host).topDomain;
+    }
+
    private boolean looksLikeGovTld(String host) {
        if (host.length() < 8)
            return false;
@@ -116,24 +120,6 @@ public class EdgeDomain implements Serializable {
        return topDomain.substring(0, cutPoint).toLowerCase();
    }

-    public String getLongDomainKey() {
-        StringBuilder ret = new StringBuilder();
-
-        int cutPoint = topDomain.indexOf('.');
-        if (cutPoint < 0) {
-            ret.append(topDomain);
-        } else {
-            ret.append(topDomain, 0, cutPoint);
-        }
-
-        if (!subDomain.isEmpty() && !"www".equals(subDomain)) {
-            ret.append(":");
-            ret.append(subDomain);
-        }
-
-        return ret.toString().toLowerCase();
-    }
-
    /** If possible, try to provide an alias domain,
     * i.e. a domain name that is very likely to link to this one
     * */
--- a/code/common/service/java/nu/marginalia/process/log/WorkLog.java
+++ b/code/common/service/java/nu/marginalia/process/log/WorkLog.java
@@ -10,7 +10,9 @@ import java.nio.charset.StandardCharsets;
 import java.nio.file.Files;
 import java.nio.file.Path;
 import java.time.LocalDateTime;
-import java.util.*;
+import java.util.HashSet;
+import java.util.Optional;
+import java.util.Set;
 import java.util.function.Function;

 /** WorkLog is a journal of work done by a process,
@@ -61,6 +63,12 @@ public class WorkLog implements AutoCloseable, Closeable {
        return new WorkLoadIterable<>(logFile, mapper);
    }

+    public static int countEntries(Path crawlerLog) throws IOException{
+        try (var linesStream = Files.lines(crawlerLog)) {
+            return (int) linesStream.filter(WorkLogEntry::isJobId).count();
+        }
+    }
+
    // Use synchro over concurrent set to avoid competing writes
    // - correct is better than fast here, it's sketchy enough to use
    // a PrintWriter
--- a/code/common/service/java/nu/marginalia/service/module/ServiceConfigurationModule.java
+++ b/code/common/service/java/nu/marginalia/service/module/ServiceConfigurationModule.java
@@ -6,6 +6,7 @@ import nu.marginalia.service.ServiceId;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;

+import java.io.IOException;
 import java.net.InetAddress;
 import java.net.NetworkInterface;
 import java.util.Enumeration;
@@ -115,11 +116,12 @@ public class ServiceConfigurationModule extends AbstractModule {
        }
    }

-    public static String getLocalNetworkIP() throws Exception {
+    public static String getLocalNetworkIP() throws IOException {
        Enumeration<NetworkInterface> nets = NetworkInterface.getNetworkInterfaces();

        while (nets.hasMoreElements()) {
            NetworkInterface netif = nets.nextElement();
+            logger.info("Considering network interface {}:  Up? {},  Loopback? {}", netif.getDisplayName(), netif.isUp(), netif.isLoopback());
            if (!netif.isUp() || netif.isLoopback()) {
                continue;
            }
@@ -127,6 +129,7 @@ public class ServiceConfigurationModule extends AbstractModule {
            Enumeration<InetAddress> inetAddresses = netif.getInetAddresses();
            while (inetAddresses.hasMoreElements()) {
                InetAddress addr = inetAddresses.nextElement();
+                logger.info("Considering address {}: SiteLocal? {}, Loopback? {}", addr.getHostAddress(), addr.isSiteLocalAddress(), addr.isLoopbackAddress());
                if (addr.isSiteLocalAddress() && !addr.isLoopbackAddress()) {
                    return addr.getHostAddress();
                }
--- a/code/common/service/java/nu/marginalia/service/server/JoobyService.java
+++ b/code/common/service/java/nu/marginalia/service/server/JoobyService.java
@@ -15,6 +15,7 @@ import org.slf4j.LoggerFactory;
 import org.slf4j.Marker;
 import org.slf4j.MarkerFactory;

+import java.nio.file.Files;
 import java.nio.file.Path;
 import java.nio.file.Paths;
 import java.util.List;
@@ -106,9 +107,12 @@ public class JoobyService {
                config.externalAddress());

        // FIXME:  This won't work outside of docker, may need to submit a PR to jooby to allow classpaths here
-        jooby.install(new JteModule(Path.of("/app/resources/jte"), Path.of("/app/classes/jte-precompiled")));
-        jooby.assets("/*", Paths.get("/app/resources/static"));
-
+        if (Files.exists(Path.of("/app/resources/jte")) || Files.exists(Path.of("/app/classes/jte-precompiled"))) {
+            jooby.install(new JteModule(Path.of("/app/resources/jte"), Path.of("/app/classes/jte-precompiled")));
+        }
+        if (Files.exists(Path.of("/app/resources/static"))) {
+            jooby.assets("/*", Paths.get("/app/resources/static"));
+        }
        var options = new ServerOptions();
        options.setHost(config.bindAddress());
        options.setPort(restEndpoint.port());
--- a/code/common/service/java/nu/marginalia/service/server/MetricsServer.java
+++ b/code/common/service/java/nu/marginalia/service/server/MetricsServer.java
@@ -6,25 +6,36 @@ import nu.marginalia.service.module.ServiceConfiguration;
 import org.eclipse.jetty.server.Server;
 import org.eclipse.jetty.servlet.ServletContextHandler;
 import org.eclipse.jetty.servlet.ServletHolder;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;

 import java.net.InetSocketAddress;

 public class MetricsServer {

+    private static final Logger logger = LoggerFactory.getLogger(MetricsServer.class);
+
    @Inject
-    public MetricsServer(ServiceConfiguration configuration) throws Exception {
+    public MetricsServer(ServiceConfiguration configuration) {
        // If less than zero, we forego setting up a metrics server
        if (configuration.metricsPort() < 0)
            return;

-        Server server = new Server(new InetSocketAddress(configuration.bindAddress(), configuration.metricsPort()));
+        try {
+            Server server = new Server(new InetSocketAddress(configuration.bindAddress(), configuration.metricsPort()));

-        ServletContextHandler context = new ServletContextHandler();
-        context.setContextPath("/");
-        server.setHandler(context);
+            ServletContextHandler context = new ServletContextHandler();
+            context.setContextPath("/");
+            server.setHandler(context);

-        context.addServlet(new ServletHolder(new MetricsServlet()), "/metrics");
+            context.addServlet(new ServletHolder(new MetricsServlet()), "/metrics");

-        server.start();
+            logger.info("MetricsServer listening on {}:{}", configuration.bindAddress(), configuration.metricsPort());
+
+            server.start();
+        }
+        catch (Exception|NoSuchMethodError ex) {
+            logger.error("Failed to set up metrics server", ex);
+        }
    }
 }
--- a/code/common/service/java/nu/marginalia/service/server/RateLimiter.java
+++ b/code/common/service/java/nu/marginalia/service/server/RateLimiter.java
@@ -35,21 +35,8 @@ public class RateLimiter {
    }


-    public static RateLimiter forExpensiveRequest() {
-        return new RateLimiter(5, 10);
-    }
-
    public static RateLimiter custom(int perMinute) {
-        return new RateLimiter(perMinute, 60);
-    }
-
-    public static RateLimiter forSpamBots() {
-        return new RateLimiter(120, 3600);
-    }
-
-
-    public static RateLimiter forLogin() {
-        return new RateLimiter(3, 15);
+        return new RateLimiter(4 * perMinute, perMinute);
    }

    private void cleanIdleBuckets() {
@@ -62,7 +49,7 @@ public class RateLimiter {
    }

    private Bucket createBucket() {
-        var refill = Refill.greedy(1, Duration.ofSeconds(refillRate));
+        var refill = Refill.greedy(refillRate, Duration.ofSeconds(60));
        var bw = Bandwidth.classic(capacity, refill);
        return Bucket.builder().addLimit(bw).build();
    }
--- a/code/common/service/resources/log4j2-json.xml
+++ b/code/common/service/resources/log4j2-json.xml
@@ -5,6 +5,7 @@
            <Filters>
                <MarkerFilter marker="QUERY" onMatch="DENY" onMismatch="NEUTRAL" />
                <MarkerFilter marker="HTTP" onMatch="DENY" onMismatch="NEUTRAL" />
+                <MarkerFilter marker="CRAWLER" onMatch="DENY" onMismatch="NEUTRAL" />
            </Filters>
        </Console>
        <RollingFile name="LogToFile" fileName="${env:WMSA_LOG_DIR:-/var/log/wmsa}/wmsa-${sys:service-name}-${env:WMSA_SERVICE_NODE:-0}.log" filePattern="/var/log/wmsa/wmsa-${sys:service-name}-${env:WMSA_SERVICE_NODE:-0}-log-%d{MM-dd-yy-HH-mm-ss}-%i.log.gz"
@@ -13,9 +14,20 @@
            <Filters>
                <MarkerFilter marker="QUERY" onMatch="DENY" onMismatch="NEUTRAL" />
                <MarkerFilter marker="HTTP" onMatch="DENY" onMismatch="NEUTRAL" />
+                <MarkerFilter marker="CRAWLER" onMatch="DENY" onMismatch="NEUTRAL" />
            </Filters>
            <SizeBasedTriggeringPolicy size="10MB" />
        </RollingFile>
+        <RollingFile name="LogToFile" fileName="${env:WMSA_LOG_DIR:-/var/log/wmsa}/crawler-audit-${env:WMSA_SERVICE_NODE:-0}.log" filePattern="/var/log/wmsa/crawler-audit-${env:WMSA_SERVICE_NODE:-0}-log-%d{MM-dd-yy-HH-mm-ss}-%i.log.gz"
+                     ignoreExceptions="false">
+            <PatternLayout>
+                <Pattern>%d{yyyy-MM-dd HH:mm:ss,SSS}: %msg{nolookups}%n</Pattern>
+            </PatternLayout>
+            <SizeBasedTriggeringPolicy size="100MB" />
+            <Filters>
+                <MarkerFilter marker="CRAWLER" onMatch="ALLOW" onMismatch="DENY" />
+            </Filters>
+        </RollingFile>
    </Appenders>
    <Loggers>
        <Logger name="org.apache.zookeeper" level="WARN" />
--- a/code/common/service/resources/log4j2-prod.xml
+++ b/code/common/service/resources/log4j2-prod.xml
@@ -5,6 +5,7 @@
            <Filters>
                <MarkerFilter marker="QUERY" onMatch="DENY" onMismatch="NEUTRAL" />
                <MarkerFilter marker="HTTP" onMatch="DENY" onMismatch="NEUTRAL" />
+                <MarkerFilter marker="CRAWLER" onMatch="DENY" onMismatch="NEUTRAL" />
            </Filters>
        </Console>
        <RollingFile name="LogToFile" fileName="${env:WMSA_LOG_DIR:-/var/log/wmsa}/wmsa-${sys:service-name}-${env:WMSA_SERVICE_NODE:-0}.log" filePattern="/var/log/wmsa/wmsa-${sys:service-name}-${env:WMSA_SERVICE_NODE:-0}-log-%d{MM-dd-yy-HH-mm-ss}-%i.log.gz"
@@ -17,6 +18,17 @@
                <MarkerFilter marker="PROCESS" onMatch="DENY" onMismatch="NEUTRAL" />
                <MarkerFilter marker="QUERY" onMatch="DENY" onMismatch="NEUTRAL" />
                <MarkerFilter marker="HTTP" onMatch="DENY" onMismatch="NEUTRAL" />
+                <MarkerFilter marker="CRAWLER" onMatch="DENY" onMismatch="NEUTRAL" />
+            </Filters>
+        </RollingFile>
+        <RollingFile name="LogToFile" fileName="${env:WMSA_LOG_DIR:-/var/log/wmsa}/crawler-audit-${env:WMSA_SERVICE_NODE:-0}.log" filePattern="/var/log/wmsa/crawler-audit-${env:WMSA_SERVICE_NODE:-0}-log-%d{MM-dd-yy-HH-mm-ss}-%i.log.gz"
+                     ignoreExceptions="false">
+            <PatternLayout>
+                <Pattern>%d{yyyy-MM-dd HH:mm:ss,SSS}: %msg{nolookups}%n</Pattern>
+            </PatternLayout>
+            <SizeBasedTriggeringPolicy size="100MB" />
+            <Filters>
+                <MarkerFilter marker="CRAWLER" onMatch="ALLOW" onMismatch="DENY" />
            </Filters>
        </RollingFile>
    </Appenders>
--- a/code/common/service/test/nu/marginalia/service/discovery/ZkServiceRegistryTest.java
+++ b/code/common/service/test/nu/marginalia/service/discovery/ZkServiceRegistryTest.java
@@ -25,7 +25,7 @@ import static org.mockito.Mockito.when;
 class ZkServiceRegistryTest {
    private static final int ZOOKEEPER_PORT = 2181;
    private static final GenericContainer<?> zookeeper =
-            new GenericContainer<>("zookeeper:3.8.0")
+            new GenericContainer<>("zookeeper:3.8")
                    .withExposedPorts(ZOOKEEPER_PORT);

    List<ZkServiceRegistry> registries = new ArrayList<>();
--- a/code/execution/java/nu/marginalia/actor/ExecutorActor.java
+++ b/code/execution/java/nu/marginalia/actor/ExecutorActor.java
@@ -20,6 +20,7 @@ public enum ExecutorActor {
    EXPORT_FEEDS(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED),
    EXPORT_SAMPLE_DATA(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED),
    DOWNLOAD_SAMPLE(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED),
+    MIGRATE_CRAWL_DATA(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED),

    PROC_CONVERTER_SPAWNER(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED, NodeProfile.SIDELOAD),
    PROC_LOADER_SPAWNER(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED, NodeProfile.SIDELOAD),
--- a/code/execution/java/nu/marginalia/actor/ExecutorActorControlService.java
+++ b/code/execution/java/nu/marginalia/actor/ExecutorActorControlService.java
@@ -66,6 +66,7 @@ public class ExecutorActorControlService {
                                       DownloadSampleActor downloadSampleActor,
                                       ScrapeFeedsActor scrapeFeedsActor,
                                       ExecutorActorStateMachines stateMachines,
+                                       MigrateCrawlDataActor migrateCrawlDataActor,
                                       ExportAllPrecessionActor exportAllPrecessionActor,
                                       UpdateRssActor updateRssActor) throws SQLException {
        this.messageQueueFactory = messageQueueFactory;
@@ -107,6 +108,8 @@ public class ExecutorActorControlService {
        register(ExecutorActor.SCRAPE_FEEDS, scrapeFeedsActor);
        register(ExecutorActor.UPDATE_RSS, updateRssActor);

+        register(ExecutorActor.MIGRATE_CRAWL_DATA, migrateCrawlDataActor);
+
        if (serviceConfiguration.node() == 1) {
            register(ExecutorActor.PREC_EXPORT_ALL, exportAllPrecessionActor);
        }
--- a/code/execution/java/nu/marginalia/actor/proc/UpdateRssActor.java
+++ b/code/execution/java/nu/marginalia/actor/proc/UpdateRssActor.java
@@ -14,6 +14,8 @@ import nu.marginalia.mq.persistence.MqPersistence;
 import nu.marginalia.nodecfg.NodeConfigurationService;
 import nu.marginalia.nodecfg.model.NodeProfile;
 import nu.marginalia.service.module.ServiceConfiguration;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;

 import java.time.Duration;
 import java.time.LocalDateTime;
@@ -29,6 +31,7 @@ public class UpdateRssActor extends RecordActorPrototype {

    private final NodeConfigurationService nodeConfigurationService;
    private final MqPersistence persistence;
+    private static final Logger logger = LoggerFactory.getLogger(UpdateRssActor.class);

    @Inject
    public UpdateRssActor(Gson gson,
@@ -101,8 +104,8 @@ public class UpdateRssActor extends RecordActorPrototype {
            case UpdateRefresh(int count, long msgId) -> {
                MqMessage msg = persistence.waitForMessageTerminalState(msgId, Duration.ofSeconds(10), Duration.ofHours(12));
                if (msg == null) {
-                    // Retry the update
-                    yield new Error("Failed to update feeds: message not found");
+                    logger.warn("UpdateRefresh is taking a very long time");
+                    yield new UpdateRefresh(count, msgId);
                } else if (msg.state() != MqMessageState.OK) {
                    // Retry the update
                    yield new Error("Failed to update feeds: " + msg.state());
@@ -119,8 +122,8 @@ public class UpdateRssActor extends RecordActorPrototype {
            case UpdateClean(long msgId) -> {
                MqMessage msg = persistence.waitForMessageTerminalState(msgId, Duration.ofSeconds(10), Duration.ofHours(12));
                if (msg == null) {
-                    // Retry the update
-                    yield new Error("Failed to update feeds: message not found");
+                    logger.warn("UpdateClean is taking a very long time");
+                    yield new UpdateClean(msgId);
                } else if (msg.state() != MqMessageState.OK) {
                    // Retry the update
                    yield new Error("Failed to update feeds: " + msg.state());
--- a/code/execution/java/nu/marginalia/actor/task/DownloadSampleActor.java
+++ b/code/execution/java/nu/marginalia/actor/task/DownloadSampleActor.java
@@ -8,6 +8,7 @@ import nu.marginalia.actor.state.ActorResumeBehavior;
 import nu.marginalia.actor.state.ActorStep;
 import nu.marginalia.actor.state.Resume;
 import nu.marginalia.service.control.ServiceEventLog;
+import nu.marginalia.service.control.ServiceHeartbeat;
 import nu.marginalia.storage.FileStorageService;
 import nu.marginalia.storage.model.FileStorage;
 import nu.marginalia.storage.model.FileStorageId;
@@ -19,6 +20,7 @@ import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;

 import java.io.*;
+import java.net.HttpURLConnection;
 import java.net.MalformedURLException;
 import java.net.URI;
 import java.net.URL;
@@ -32,6 +34,7 @@ public class DownloadSampleActor extends RecordActorPrototype {

    private final FileStorageService storageService;
    private final ServiceEventLog eventLog;
+    private final ServiceHeartbeat heartbeat;
    private final Logger logger = LoggerFactory.getLogger(getClass());

    @Resume(behavior = ActorResumeBehavior.ERROR)
@@ -66,15 +69,39 @@ public class DownloadSampleActor extends RecordActorPrototype {

                Files.deleteIfExists(Path.of(tarFileName));

-                try (var is = new BufferedInputStream(new URI(downloadURI).toURL().openStream());
-                     var os = new BufferedOutputStream(Files.newOutputStream(Path.of(tarFileName), StandardOpenOption.CREATE))) {
-                    is.transferTo(os);
+                HttpURLConnection urlConnection = (HttpURLConnection) new URI(downloadURI).toURL().openConnection();
+
+                try (var hb = heartbeat.createServiceAdHocTaskHeartbeat("Downloading sample")) {
+                    long size = urlConnection.getContentLengthLong();
+                    byte[] buffer = new byte[8192];
+
+                    try (var is = new BufferedInputStream(urlConnection.getInputStream());
+                         var os = new BufferedOutputStream(Files.newOutputStream(Path.of(tarFileName), StandardOpenOption.CREATE))) {
+                        long copiedSize = 0;
+
+                        while (copiedSize < size) {
+                            int read = is.read(buffer);
+
+                            if (read < 0) // We've been promised a file of length 'size'
+                                throw new IOException("Unexpected end of stream");
+
+                            os.write(buffer, 0, read);
+                            copiedSize += read;
+
+                            // Update progress bar
+                            hb.progress(String.format("%d MB", copiedSize / 1024 / 1024), (int) (copiedSize / 1024), (int) (size / 1024));
+                        }
+                    }
+
                }
                catch (Exception ex) {
                    eventLog.logEvent(DownloadSampleActor.class, "Error downloading sample");
                    logger.error("Error downloading sample", ex);
                    yield new Error();
                }
+                finally {
+                    urlConnection.disconnect();
+                }

                eventLog.logEvent(DownloadSampleActor.class, "Download complete");
                yield new Extract(fileStorageId, tarFileName);
@@ -170,11 +197,12 @@ public class DownloadSampleActor extends RecordActorPrototype {
    @Inject
    public DownloadSampleActor(Gson gson,
                               FileStorageService storageService,
-                               ServiceEventLog eventLog)
+                               ServiceEventLog eventLog, ServiceHeartbeat heartbeat)
    {
        super(gson);
        this.storageService = storageService;
        this.eventLog = eventLog;
+        this.heartbeat = heartbeat;
    }

 }
--- a/code/execution/java/nu/marginalia/actor/task/MigrateCrawlDataActor.java
+++ b/code/execution/java/nu/marginalia/actor/task/MigrateCrawlDataActor.java
@@ -0,0 +1,150 @@
+package nu.marginalia.actor.task;
+
+import com.google.gson.Gson;
+import jakarta.inject.Inject;
+import jakarta.inject.Singleton;
+import nu.marginalia.actor.prototype.RecordActorPrototype;
+import nu.marginalia.actor.state.ActorStep;
+import nu.marginalia.io.CrawlerOutputFile;
+import nu.marginalia.process.log.WorkLog;
+import nu.marginalia.process.log.WorkLogEntry;
+import nu.marginalia.service.control.ServiceHeartbeat;
+import nu.marginalia.slop.SlopCrawlDataRecord;
+import nu.marginalia.storage.FileStorageService;
+import nu.marginalia.storage.model.FileStorage;
+import nu.marginalia.storage.model.FileStorageId;
+import org.apache.logging.log4j.util.Strings;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.nio.file.StandardCopyOption;
+import java.util.Map;
+import java.util.Optional;
+import java.util.function.Function;
+
+@Singleton
+public class MigrateCrawlDataActor extends RecordActorPrototype {
+
+    private final FileStorageService fileStorageService;
+    private final ServiceHeartbeat serviceHeartbeat;
+    private static final Logger logger = LoggerFactory.getLogger(MigrateCrawlDataActor.class);
+
+    @Inject
+    public MigrateCrawlDataActor(Gson gson, FileStorageService fileStorageService, ServiceHeartbeat serviceHeartbeat) {
+        super(gson);
+
+        this.fileStorageService = fileStorageService;
+        this.serviceHeartbeat = serviceHeartbeat;
+    }
+
+    public record Run(long fileStorageId) implements ActorStep {}
+
+    @Override
+    public ActorStep transition(ActorStep self) throws Exception {
+        return switch (self) {
+            case Run(long fileStorageId) -> {
+
+                FileStorage storage = fileStorageService.getStorage(FileStorageId.of(fileStorageId));
+                Path root = storage.asPath();
+
+                Path crawlerLog = root.resolve("crawler.log");
+                Path newCrawlerLog = Files.createTempFile(root, "crawler", ".migrate.log");
+
+                int totalEntries = WorkLog.countEntries(crawlerLog);
+
+                try (WorkLog workLog = new WorkLog(newCrawlerLog);
+                     var heartbeat = serviceHeartbeat.createServiceAdHocTaskHeartbeat("Migrating")
+                ) {
+                    int entryIdx = 0;
+
+                    for (Map.Entry<WorkLogEntry, Path> item : WorkLog.iterableMap(crawlerLog, new CrawlDataLocator(root))) {
+
+                        final WorkLogEntry entry = item.getKey();
+                        final Path inputPath = item.getValue();
+
+                        Path outputPath = inputPath;
+                        heartbeat.progress("Migrating" + inputPath.getFileName(), entryIdx++, totalEntries);
+
+                        if (inputPath.toString().endsWith(".parquet")) {
+                            String domain = entry.id();
+                            String id = Integer.toHexString(domain.hashCode());
+
+                            outputPath = CrawlerOutputFile.createSlopPath(root, id, domain);
+
+                            if (Files.exists(inputPath)) {
+                                try {
+                                    SlopCrawlDataRecord.convertFromParquet(inputPath, outputPath);
+                                    Files.deleteIfExists(inputPath);
+                                } catch (Exception ex) {
+                                    outputPath = inputPath; // don't update the work log on error
+                                    logger.error("Failed to convert " + inputPath, ex);
+                                }
+                            }
+                            else if (!Files.exists(inputPath) && !Files.exists(outputPath)) {
+                                // if the input file is missing, and the output file is missing, we just write the log
+                                // record identical to the old one
+                                outputPath = inputPath;
+                            }
+                        }
+
+                        // Write a log entry for the (possibly) converted file
+                        workLog.setJobToFinished(entry.id(), outputPath.toString(), entry.cnt());
+                    }
+                }
+
+                Path oldCrawlerLog = Files.createTempFile(root, "crawler-", ".migrate.old.log");
+                Files.move(crawlerLog, oldCrawlerLog, StandardCopyOption.REPLACE_EXISTING);
+                Files.move(newCrawlerLog, crawlerLog);
+
+                yield new End();
+            }
+            default -> new Error();
+        };
+    }
+
+    private static class CrawlDataLocator implements Function<WorkLogEntry, Optional<Map.Entry<WorkLogEntry, Path>>> {
+
+        private final Path crawlRootDir;
+
+        CrawlDataLocator(Path crawlRootDir) {
+            this.crawlRootDir = crawlRootDir;
+        }
+
+        @Override
+        public Optional<Map.Entry<WorkLogEntry, Path>> apply(WorkLogEntry entry) {
+            var path = getCrawledFilePath(crawlRootDir, entry.path());
+
+            if (!Files.exists(path)) {
+                return Optional.empty();
+            }
+
+            try {
+                return Optional.of(Map.entry(entry, path));
+            }
+            catch (Exception ex) {
+                return Optional.empty();
+            }
+        }
+
+        private Path getCrawledFilePath(Path crawlDir, String fileName) {
+            int sp = fileName.lastIndexOf('/');
+
+            // Normalize the filename
+            if (sp >= 0 && sp + 1< fileName.length())
+                fileName = fileName.substring(sp + 1);
+            if (fileName.length() < 4)
+                fileName = Strings.repeat("0", 4 - fileName.length()) + fileName;
+
+            String sp1 = fileName.substring(0, 2);
+            String sp2 = fileName.substring(2, 4);
+            return crawlDir.resolve(sp1).resolve(sp2).resolve(fileName);
+        }
+    }
+
+    @Override
+    public String describe() {
+        return "Migrates crawl data to the latest format";
+    }
+}
--- a/code/functions/favicon/api/build.gradle
+++ b/code/functions/favicon/api/build.gradle
@@ -0,0 +1,47 @@
+plugins {
+    id 'java'
+
+    id "com.google.protobuf" version "0.9.4"
+    id 'jvm-test-suite'
+}
+
+java {
+    toolchain {
+        languageVersion.set(JavaLanguageVersion.of(rootProject.ext.jvmVersion))
+    }
+}
+
+jar.archiveBaseName = 'favicon-api'
+
+apply from: "$rootProject.projectDir/protobuf.gradle"
+apply from: "$rootProject.projectDir/srcsets.gradle"
+
+dependencies {
+    implementation project(':code:common:model')
+    implementation project(':code:common:config')
+    implementation project(':code:common:service')
+
+    implementation libs.bundles.slf4j
+
+    implementation libs.prometheus
+    implementation libs.notnull
+    implementation libs.guava
+    implementation dependencies.create(libs.guice.get()) {
+        exclude group: 'com.google.guava'
+    }
+    implementation libs.gson
+    implementation libs.bundles.protobuf
+    implementation libs.guava
+    libs.bundles.grpc.get().each {
+        implementation dependencies.create(it) {
+            exclude group: 'com.google.guava'
+        }
+    }
+
+
+
+    testImplementation libs.bundles.slf4j.test
+    testImplementation libs.bundles.junit
+    testImplementation libs.mockito
+
+}
--- a/code/functions/favicon/api/java/nu/marginalia/api/favicon/FaviconClient.java
+++ b/code/functions/favicon/api/java/nu/marginalia/api/favicon/FaviconClient.java
@@ -0,0 +1,39 @@
+package nu.marginalia.api.favicon;
+
+import com.google.inject.Inject;
+import nu.marginalia.service.client.GrpcChannelPoolFactory;
+import nu.marginalia.service.client.GrpcMultiNodeChannelPool;
+import nu.marginalia.service.discovery.property.ServiceKey;
+import nu.marginalia.service.discovery.property.ServicePartition;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Optional;
+
+public class FaviconClient {
+    private static final Logger logger = LoggerFactory.getLogger(FaviconClient.class);
+
+    private final GrpcMultiNodeChannelPool<FaviconAPIGrpc.FaviconAPIBlockingStub> channelPool;
+
+    @Inject
+    public FaviconClient(GrpcChannelPoolFactory factory) {
+        this.channelPool = factory.createMulti(
+                ServiceKey.forGrpcApi(FaviconAPIGrpc.class, ServicePartition.multi()),
+                FaviconAPIGrpc::newBlockingStub);
+    }
+
+    public record FaviconData(byte[] bytes, String contentType) {}
+
+
+    public Optional<FaviconData> getFavicon(String domain, int node) {
+        RpcFaviconResponse rsp = channelPool.call(FaviconAPIGrpc.FaviconAPIBlockingStub::getFavicon)
+                .forNode(node)
+                .run(RpcFaviconRequest.newBuilder().setDomain(domain).build());
+
+        if (rsp.getData().isEmpty())
+            return Optional.empty();
+
+        return Optional.of(new FaviconData(rsp.getData().toByteArray(), rsp.getContentType()));
+    }
+
+}
--- a/code/functions/favicon/api/src/main/protobuf/favicon.proto
+++ b/code/functions/favicon/api/src/main/protobuf/favicon.proto
@@ -0,0 +1,20 @@
+syntax="proto3";
+package marginalia.api.favicon;
+
+option java_package="nu.marginalia.api.favicon";
+option java_multiple_files=true;
+
+service FaviconAPI {
+  /** Fetches information about a domain. */
+  rpc getFavicon(RpcFaviconRequest) returns (RpcFaviconResponse) {}
+}
+
+message RpcFaviconRequest {
+  string domain = 1;
+}
+
+message RpcFaviconResponse {
+  string domain = 1;
+  bytes data = 2;
+  string contentType = 3;
+}
--- a/code/functions/favicon/build.gradle
+++ b/code/functions/favicon/build.gradle
@@ -0,0 +1,49 @@
+plugins {
+    id 'java'
+
+    id 'application'
+    id 'jvm-test-suite'
+}
+
+java {
+    toolchain {
+        languageVersion.set(JavaLanguageVersion.of(rootProject.ext.jvmVersion))
+    }
+}
+
+apply from: "$rootProject.projectDir/srcsets.gradle"
+
+dependencies {
+    implementation project(':code:common:config')
+    implementation project(':code:common:service')
+    implementation project(':code:common:model')
+    implementation project(':code:common:db')
+    implementation project(':code:functions:favicon:api')
+    implementation project(':code:processes:crawling-process')
+
+    implementation libs.bundles.slf4j
+
+    implementation libs.prometheus
+    implementation libs.guava
+    libs.bundles.grpc.get().each {
+        implementation dependencies.create(it) {
+            exclude group: 'com.google.guava'
+        }
+    }
+
+
+    implementation libs.notnull
+    implementation libs.guava
+    implementation dependencies.create(libs.guice.get()) {
+        exclude group: 'com.google.guava'
+    }
+    implementation dependencies.create(libs.spark.get()) {
+        exclude group: 'org.eclipse.jetty'
+    }
+
+    testImplementation libs.bundles.slf4j.test
+    testImplementation libs.bundles.junit
+    testImplementation libs.mockito
+
+
+}
--- a/code/functions/favicon/java/nu/marginalia/functions/favicon/FaviconGrpcService.java
+++ b/code/functions/favicon/java/nu/marginalia/functions/favicon/FaviconGrpcService.java
@@ -0,0 +1,48 @@
+package nu.marginalia.functions.favicon;
+
+import com.google.inject.Inject;
+import com.google.inject.Singleton;
+import com.google.protobuf.ByteString;
+import io.grpc.stub.StreamObserver;
+import nu.marginalia.api.favicon.FaviconAPIGrpc;
+import nu.marginalia.api.favicon.RpcFaviconRequest;
+import nu.marginalia.api.favicon.RpcFaviconResponse;
+import nu.marginalia.crawl.DomainStateDb;
+import nu.marginalia.service.server.DiscoverableService;
+
+import java.util.Optional;
+
+@Singleton
+public class FaviconGrpcService extends FaviconAPIGrpc.FaviconAPIImplBase implements DiscoverableService {
+    private final DomainStateDb domainStateDb;
+
+    @Inject
+    public FaviconGrpcService(DomainStateDb domainStateDb) {
+        this.domainStateDb = domainStateDb;
+    }
+
+    public boolean shouldRegisterService() {
+        return domainStateDb.isAvailable();
+    }
+
+    @Override
+    public void getFavicon(RpcFaviconRequest request, StreamObserver<RpcFaviconResponse> responseObserver) {
+        Optional<DomainStateDb.FaviconRecord> icon = domainStateDb.getIcon(request.getDomain());
+
+        RpcFaviconResponse response;
+        if (icon.isEmpty()) {
+            response = RpcFaviconResponse.newBuilder().build();
+        }
+        else {
+            var iconRecord = icon.get();
+            response = RpcFaviconResponse.newBuilder()
+                            .setContentType(iconRecord.contentType())
+                            .setDomain(request.getDomain())
+                            .setData(ByteString.copyFrom(iconRecord.imageData()))
+                            .build();
+        }
+
+        responseObserver.onNext(response);
+        responseObserver.onCompleted();
+    }
+}
--- a/code/functions/live-capture/build.gradle
+++ b/code/functions/live-capture/build.gradle
@@ -34,6 +34,7 @@ dependencies {
    implementation libs.bundles.slf4j
    implementation libs.commons.lang3
    implementation libs.commons.io
+    implementation libs.wiremock

    implementation libs.prometheus
    implementation libs.guava
--- a/code/functions/live-capture/java/nu/marginalia/livecapture/BrowserlessClient.java
+++ b/code/functions/live-capture/java/nu/marginalia/livecapture/BrowserlessClient.java
@@ -1,6 +1,7 @@
 package nu.marginalia.livecapture;

 import com.google.gson.Gson;
+import nu.marginalia.WmsaHome;
 import nu.marginalia.model.gson.GsonFactory;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
@@ -12,6 +13,7 @@ import java.net.http.HttpRequest;
 import java.net.http.HttpResponse;
 import java.time.Duration;
 import java.util.Map;
+import java.util.Optional;

 /** Client for local browserless.io API */
 public class BrowserlessClient implements AutoCloseable {
@@ -27,13 +29,16 @@ public class BrowserlessClient implements AutoCloseable {
    private final URI browserlessURI;
    private final Gson gson = GsonFactory.get();

+    private final String userAgent = WmsaHome.getUserAgent().uaString();
+
    public BrowserlessClient(URI browserlessURI) {
        this.browserlessURI = browserlessURI;
    }

-    public String content(String url, GotoOptions gotoOptions) throws IOException, InterruptedException {
+    public Optional<String> content(String url, GotoOptions gotoOptions) throws IOException, InterruptedException {
        Map<String, Object> requestData = Map.of(
                "url", url,
+                "userAgent", userAgent,
                "gotoOptions", gotoOptions
        );

@@ -49,10 +54,10 @@ public class BrowserlessClient implements AutoCloseable {

        if (rsp.statusCode() >= 300) {
            logger.info("Failed to fetch content for {}, status {}", url, rsp.statusCode());
-            return null;
+            return Optional.empty();
        }

-        return rsp.body();
+        return Optional.of(rsp.body());
    }

    public byte[] screenshot(String url, GotoOptions gotoOptions, ScreenshotOptions screenshotOptions)
@@ -60,6 +65,7 @@ public class BrowserlessClient implements AutoCloseable {

        Map<String, Object> requestData = Map.of(
                "url", url,
+                "userAgent", userAgent,
                "options", screenshotOptions,
                "gotoOptions", gotoOptions
        );
@@ -84,7 +90,7 @@ public class BrowserlessClient implements AutoCloseable {
    }

    @Override
-    public void close() throws Exception {
+    public void close() {
        httpClient.shutdownNow();
    }

--- a/code/functions/live-capture/java/nu/marginalia/rss/svc/FeedFetcherService.java
+++ b/code/functions/live-capture/java/nu/marginalia/rss/svc/FeedFetcherService.java
@@ -33,6 +33,7 @@ import java.sql.SQLException;
 import java.time.*;
 import java.time.format.DateTimeFormatter;
 import java.util.*;
+import java.util.concurrent.ExecutorService;
 import java.util.concurrent.Executors;
 import java.util.concurrent.TimeUnit;
 import java.util.concurrent.atomic.AtomicInteger;
@@ -71,7 +72,7 @@ public class FeedFetcherService {
    public enum UpdateMode {
        CLEAN,
        REFRESH
-    };
+    }

    public void updateFeeds(UpdateMode updateMode) throws IOException {
        if (updating) // Prevent concurrent updates
@@ -87,6 +88,7 @@ public class FeedFetcherService {
                .followRedirects(HttpClient.Redirect.NORMAL)
                .version(HttpClient.Version.HTTP_2)
                .build();
+             ExecutorService fetchExecutor = Executors.newCachedThreadPool();
             FeedJournal feedJournal = FeedJournal.create();
             var heartbeat = serviceHeartbeat.createServiceAdHocTaskHeartbeat("Update Rss Feeds")
        ) {
@@ -131,7 +133,7 @@ public class FeedFetcherService {

                        FetchResult feedData;
                        try (DomainLocks.DomainLock domainLock = domainLocks.lockDomain(new EdgeDomain(feed.domain()))) {
-                            feedData = fetchFeedData(feed, client, ifModifiedSinceDate, ifNoneMatchTag);
+                            feedData = fetchFeedData(feed, client, fetchExecutor, ifModifiedSinceDate, ifNoneMatchTag);
                        } catch (Exception ex) {
                            feedData = new FetchResult.TransientError();
                        }
@@ -211,6 +213,7 @@ public class FeedFetcherService {

    private FetchResult fetchFeedData(FeedDefinition feed,
                                      HttpClient client,
+                                      ExecutorService executorService,
                                      @Nullable String ifModifiedSinceDate,
                                      @Nullable String ifNoneMatchTag)
    {
@@ -237,7 +240,14 @@ public class FeedFetcherService {
            HttpRequest getRequest = requestBuilder.build();

            for (int i = 0; i < 3; i++) {
-                HttpResponse<byte[]> rs = client.send(getRequest, HttpResponse.BodyHandlers.ofByteArray());
+
+                /* Note we need to use an executor to time-limit the send() method in HttpClient, as
+                 * its support for timeouts only applies to the time until response starts to be received,
+                 * and does not catch the case when the server starts to send data but then hangs.
+                 */
+                HttpResponse<byte[]> rs = executorService.submit(
+                        () -> client.send(getRequest, HttpResponse.BodyHandlers.ofByteArray()))
+                                .get(15, TimeUnit.SECONDS);

                if (rs.statusCode() == 429) { // Too Many Requests
                    int retryAfter = Integer.parseInt(rs.headers().firstValue("Retry-After").orElse("2"));
--- a/code/functions/live-capture/test/nu/marginalia/livecapture/BrowserlessClientTest.java
+++ b/code/functions/live-capture/test/nu/marginalia/livecapture/BrowserlessClientTest.java
@@ -1,5 +1,9 @@
 package nu.marginalia.livecapture;

+import com.github.tomakehurst.wiremock.WireMockServer;
+import com.github.tomakehurst.wiremock.core.WireMockConfiguration;
+import nu.marginalia.WmsaHome;
+import nu.marginalia.service.module.ServiceConfigurationModule;
 import org.junit.jupiter.api.Assertions;
 import org.junit.jupiter.api.BeforeAll;
 import org.junit.jupiter.api.Tag;
@@ -8,34 +12,86 @@ import org.testcontainers.containers.GenericContainer;
 import org.testcontainers.junit.jupiter.Testcontainers;
 import org.testcontainers.utility.DockerImageName;

+import java.io.IOException;
 import java.net.URI;
 import java.util.Map;

+import static com.github.tomakehurst.wiremock.client.WireMock.*;
+
+
@Testcontainers
@Tag("slow")
 public class BrowserlessClientTest {
    static GenericContainer<?> container = new GenericContainer<>(DockerImageName.parse("browserless/chrome"))
            .withEnv(Map.of("TOKEN", "BROWSERLESS_TOKEN"))
+            .withNetworkMode("bridge")
            .withExposedPorts(3000);

+    static WireMockServer wireMockServer =
+            new WireMockServer(WireMockConfiguration.wireMockConfig()
+                    .port(18089));
+
+    static String localIp;
+
+    static URI browserlessURI;
+
    @BeforeAll
-    public static void setup() {
+    public static void setup() throws IOException {
        container.start();
+
+        browserlessURI = URI.create(String.format("http://%s:%d/",
+                container.getHost(),
+                container.getMappedPort(3000))
+        );
+
+        wireMockServer.start();
+        wireMockServer.stubFor(get("/").willReturn(aResponse().withStatus(200).withBody("Ok")));
+
+        localIp = ServiceConfigurationModule.getLocalNetworkIP();
+
+    }
+
+    @Tag("flaky")
+    @Test
+    public void testInspectContentUA__Flaky() throws Exception {
+        try (var client = new BrowserlessClient(browserlessURI)) {
+            client.content("http://" + localIp + ":18089/",
+                    BrowserlessClient.GotoOptions.defaultValues()
+            );
+        }
+
+        wireMockServer.verify(getRequestedFor(urlEqualTo("/")).withHeader("User-Agent", equalTo(WmsaHome.getUserAgent().uaString())));
+    }
+
+    @Tag("flaky")
+    @Test
+    public void testInspectScreenshotUA__Flaky() throws Exception {
+        try (var client = new BrowserlessClient(browserlessURI)) {
+            client.screenshot("http://" + localIp + ":18089/",
+                    BrowserlessClient.GotoOptions.defaultValues(),
+                    BrowserlessClient.ScreenshotOptions.defaultValues()
+            );
+        }
+
+        wireMockServer.verify(getRequestedFor(urlEqualTo("/")).withHeader("User-Agent", equalTo(WmsaHome.getUserAgent().uaString())));
    }

    @Test
    public void testContent() throws Exception {
-        try (var client = new BrowserlessClient(URI.create("http://" + container.getHost() + ":" + container.getMappedPort(3000)))) {
-            var content = client.content("https://www.marginalia.nu/", BrowserlessClient.GotoOptions.defaultValues());
-            Assertions.assertNotNull(content, "Content should not be null");
+        try (var client = new BrowserlessClient(browserlessURI)) {
+            var content = client.content("https://www.marginalia.nu/", BrowserlessClient.GotoOptions.defaultValues()).orElseThrow();
+
            Assertions.assertFalse(content.isBlank(), "Content should not be empty");
        }
    }

    @Test
    public void testScreenshot() throws Exception {
-        try (var client = new BrowserlessClient(URI.create("http://" + container.getHost() + ":" + container.getMappedPort(3000)))) {
-            var screenshot = client.screenshot("https://www.marginalia.nu/", BrowserlessClient.GotoOptions.defaultValues(), BrowserlessClient.ScreenshotOptions.defaultValues());
+        try (var client = new BrowserlessClient(browserlessURI)) {
+            var screenshot = client.screenshot("https://www.marginalia.nu/",
+                    BrowserlessClient.GotoOptions.defaultValues(),
+                    BrowserlessClient.ScreenshotOptions.defaultValues());
+
            Assertions.assertNotNull(screenshot, "Screenshot should not be null");
        }
    }
--- a/code/functions/search-query/java/nu/marginalia/functions/searchquery/query_parser/QueryExpansion.java
+++ b/code/functions/search-query/java/nu/marginalia/functions/searchquery/query_parser/QueryExpansion.java
@@ -134,6 +134,10 @@ public class QueryExpansion {
                if (scoreCombo > scoreA + scoreB || scoreCombo > 1000) {
                    graph.addVariantForSpan(prev, qw, joinedWord);
                }
+                else if (StringUtils.isAlpha(prev.word()) && StringUtils.isNumeric(qw.word())) { // join e.g. trs 80 to trs80 and trs-80
+                    graph.addVariantForSpan(prev, qw, prev.word() + qw.word());
+                    graph.addVariantForSpan(prev, qw, prev.word() + "-" + qw.word());
+                }
            }

            prev = qw;
--- a/code/functions/search-query/test/nu/marginalia/query/svc/QueryFactoryTest.java
+++ b/code/functions/search-query/test/nu/marginalia/query/svc/QueryFactoryTest.java
@@ -213,6 +213,18 @@ public class QueryFactoryTest {
        System.out.println(subquery);
    }

+
+    @Test
+    public void testContractionWordNum() {
+        var subquery = parseAndGetSpecs("glove 80");
+
+        Assertions.assertTrue(subquery.query.compiledQuery.contains(" glove "));
+        Assertions.assertTrue(subquery.query.compiledQuery.contains(" 80 "));
+        Assertions.assertTrue(subquery.query.compiledQuery.contains(" glove-80 "));
+        Assertions.assertTrue(subquery.query.compiledQuery.contains(" glove80 "));
+    }
+
+
    @Test
    public void testCplusPlus() {
        var subquery = parseAndGetSpecs("std::vector::push_back vector");
--- a/code/index/java/nu/marginalia/index/results/DomainRankingOverrides.java
+++ b/code/index/java/nu/marginalia/index/results/DomainRankingOverrides.java
@@ -0,0 +1,119 @@
+package nu.marginalia.index.results;
+
+import com.google.inject.Inject;
+import com.google.inject.Singleton;
+import gnu.trove.map.hash.TIntDoubleHashMap;
+import nu.marginalia.WmsaHome;
+import nu.marginalia.db.DbDomainQueries;
+import nu.marginalia.model.EdgeDomain;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.util.List;
+import java.util.OptionalInt;
+import java.util.concurrent.TimeUnit;
+
+@Singleton
+public class DomainRankingOverrides {
+    private final DbDomainQueries domainQueries;
+
+    private volatile TIntDoubleHashMap rankingFactors = new TIntDoubleHashMap(100, 0.75f, -1, 1.);
+
+    private static final Logger logger = LoggerFactory.getLogger(DomainRankingOverrides.class);
+
+    private final Path overrideFilePath;
+
+    @Inject
+    public DomainRankingOverrides(DbDomainQueries domainQueries) {
+        this.domainQueries = domainQueries;
+
+        overrideFilePath = WmsaHome.getDataPath().resolve("domain-ranking-factors.txt");
+
+        Thread.ofPlatform().start(this::updateRunner);
+    }
+
+    // for test access
+    public DomainRankingOverrides(DbDomainQueries domainQueries, Path overrideFilePath)
+    {
+        this.domainQueries = domainQueries;
+        this.overrideFilePath = overrideFilePath;
+    }
+
+
+    public double getRankingFactor(int domainId) {
+        return rankingFactors.get(domainId);
+    }
+
+    private void updateRunner() {
+        for (;;) {
+            reloadFile();
+
+            try {
+                TimeUnit.MINUTES.sleep(5);
+            } catch (InterruptedException ex) {
+                logger.warn("Thread interrupted", ex);
+                break;
+            }
+        }
+    }
+
+    void reloadFile() {
+        if (!Files.exists(overrideFilePath)) {
+            return;
+        }
+
+        try {
+            List<String> lines = Files.readAllLines(overrideFilePath);
+
+            double factor = 1.;
+
+            var newRankingFactors = new TIntDoubleHashMap(lines.size(), 0.75f, -1, 1.);
+
+            for (var line : lines) {
+                if (line.isBlank()) continue;
+                if (line.startsWith("#")) continue;
+
+                String[] parts = line.split("\\s+");
+                if (parts.length != 2) {
+                    logger.warn("Unrecognized format for domain overrides file: {}", line);
+                    continue;
+                }
+
+                try {
+                    switch (parts[0]) {
+                        case "value" -> {
+                            // error handle me
+                            factor = Double.parseDouble(parts[1]);
+                            if (factor < 0) {
+                                logger.error("Negative values are not permitted, found {}", factor);
+                                factor = 1;
+                            }
+                        }
+                        case "domain" -> {
+                            // error handle
+                            OptionalInt domainId = domainQueries.tryGetDomainId(new EdgeDomain(parts[1]));
+                            if (domainId.isPresent()) {
+                                newRankingFactors.put(domainId.getAsInt(), factor);
+                            }
+                            else {
+                                logger.warn("Unrecognized domain id {}", parts[1]);
+                            }
+                        }
+                        default -> {
+                            logger.warn("Unrecognized format {}", line);
+                        }
+                    }
+                } catch (Exception ex) {
+                    logger.warn("Error in parsing domain overrides file: {} ({})", line, ex.getClass().getSimpleName());
+                }
+            }
+
+            rankingFactors = newRankingFactors;
+        } catch (IOException ex) {
+            logger.error("Failed to read " + overrideFilePath, ex);
+        }
+    }
+}
--- a/code/index/java/nu/marginalia/index/results/IndexResultRankingService.java
+++ b/code/index/java/nu/marginalia/index/results/IndexResultRankingService.java
@@ -40,13 +40,16 @@ public class IndexResultRankingService {

    private final DocumentDbReader documentDbReader;
    private final StatefulIndex statefulIndex;
+    private final DomainRankingOverrides domainRankingOverrides;

    @Inject
    public IndexResultRankingService(DocumentDbReader documentDbReader,
-                                     StatefulIndex statefulIndex)
+                                     StatefulIndex statefulIndex,
+                                     DomainRankingOverrides domainRankingOverrides)
    {
        this.documentDbReader = documentDbReader;
        this.statefulIndex = statefulIndex;
+        this.domainRankingOverrides = domainRankingOverrides;
    }

    public List<SearchResultItem> rankResults(SearchParameters params,
@@ -57,7 +60,7 @@ public class IndexResultRankingService {
        if (resultIds.isEmpty())
            return List.of();

-        IndexResultScoreCalculator resultRanker = new IndexResultScoreCalculator(statefulIndex, rankingContext, params);
+        IndexResultScoreCalculator resultRanker = new IndexResultScoreCalculator(statefulIndex, domainRankingOverrides, rankingContext, params);

        List<SearchResultItem> results = new ArrayList<>(resultIds.size());

--- a/code/index/java/nu/marginalia/index/results/IndexResultScoreCalculator.java
+++ b/code/index/java/nu/marginalia/index/results/IndexResultScoreCalculator.java
@@ -41,14 +41,17 @@ public class IndexResultScoreCalculator {
    private final CombinedIndexReader index;
    private final QueryParams queryParams;

+    private final DomainRankingOverrides domainRankingOverrides;
    private final ResultRankingContext rankingContext;
    private final CompiledQuery<String> compiledQuery;

    public IndexResultScoreCalculator(StatefulIndex statefulIndex,
+                                      DomainRankingOverrides domainRankingOverrides,
                                      ResultRankingContext rankingContext,
                                      SearchParameters params)
    {
        this.index = statefulIndex.get();
+        this.domainRankingOverrides = domainRankingOverrides;
        this.rankingContext = rankingContext;

        this.queryParams = params.queryParams;
@@ -127,10 +130,10 @@ public class IndexResultScoreCalculator {
                * wordFlagsQuery.root.visit(new TermFlagsGraphVisitor(params.getBm25K(), wordFlagsQuery.data, unorderedMatches.getWeightedCounts(), rankingContext))
                / (Math.sqrt(unorderedMatches.searchableKeywordCount + 1));

+        double rankingAdjustment = domainRankingOverrides.getRankingFactor(UrlIdCodec.getDomainId(combinedId));
+
        double score = normalize(
-                score_firstPosition + score_proximity + score_verbatim
-                        + score_bM25
-                        + score_bFlags,
+                rankingAdjustment * (score_firstPosition + score_proximity + score_verbatim + score_bM25 + score_bFlags),
                -Math.min(0, documentBonus) // The magnitude of documentBonus, if it is negative; otherwise 0
        );

@@ -580,3 +583,4 @@ public class IndexResultScoreCalculator {
    }

 }
+
--- a/code/index/test/nu/marginalia/index/results/DomainRankingOverridesTest.java
+++ b/code/index/test/nu/marginalia/index/results/DomainRankingOverridesTest.java
@@ -0,0 +1,103 @@
+package nu.marginalia.index.results;
+
+import com.zaxxer.hikari.HikariConfig;
+import com.zaxxer.hikari.HikariDataSource;
+import nu.marginalia.db.DbDomainQueries;
+import nu.marginalia.model.EdgeDomain;
+import nu.marginalia.test.TestMigrationLoader;
+import org.junit.jupiter.api.Assertions;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Tag;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.parallel.Execution;
+import org.junit.jupiter.api.parallel.ExecutionMode;
+import org.testcontainers.containers.MariaDBContainer;
+import org.testcontainers.junit.jupiter.Container;
+import org.testcontainers.junit.jupiter.Testcontainers;
+
+import java.io.IOException;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.nio.file.StandardOpenOption;
+import java.sql.SQLException;
+
+@Testcontainers
+@Execution(ExecutionMode.SAME_THREAD)
+@Tag("slow")
+class DomainRankingOverridesTest {
+    @Container
+    static MariaDBContainer<?> mariaDBContainer = new MariaDBContainer<>("mariadb")
+            .withDatabaseName("WMSA_prod")
+            .withUsername("wmsa")
+            .withPassword("wmsa")
+            .withNetworkAliases("mariadb");
+
+    private static DbDomainQueries domainQueries;
+
+    @BeforeAll
+    public static void setup() throws SQLException {
+        HikariConfig config = new HikariConfig();
+        config.setJdbcUrl(mariaDBContainer.getJdbcUrl());
+        config.setUsername("wmsa");
+        config.setPassword("wmsa");
+
+        var dataSource = new HikariDataSource(config);
+
+        TestMigrationLoader.flywayMigration(dataSource);
+
+        try (var conn = dataSource.getConnection();
+             var stmt = conn.createStatement()) {
+            stmt.executeQuery("DELETE FROM EC_DOMAIN"); // Wipe any old state from other test runs
+
+            stmt.executeQuery("INSERT INTO EC_DOMAIN (DOMAIN_NAME, DOMAIN_TOP, NODE_AFFINITY) VALUES ('first.example.com', 'example.com', 1)");
+            stmt.executeQuery("INSERT INTO EC_DOMAIN (DOMAIN_NAME, DOMAIN_TOP, NODE_AFFINITY) VALUES ('second.example.com', 'example.com', 1)");
+            stmt.executeQuery("INSERT INTO EC_DOMAIN (DOMAIN_NAME, DOMAIN_TOP, NODE_AFFINITY) VALUES ('third.example.com', 'example.com', 1)");
+            stmt.executeQuery("INSERT INTO EC_DOMAIN (DOMAIN_NAME, DOMAIN_TOP, NODE_AFFINITY) VALUES ('not-added.example.com', 'example.com', 1)");
+        }
+
+        domainQueries = new DbDomainQueries(dataSource);
+
+    }
+
+    @Test
+    public void test() throws IOException {
+
+        Path overridesFile = Files.createTempFile(getClass().getSimpleName(), ".txt");
+        try {
+
+            Files.writeString(overridesFile, """
+                    # A comment
+                    value 0.75
+                    domain first.example.com
+                    domain second.example.com
+                    
+                    value 1.1
+                    domain third.example.com
+                    """,
+                    StandardOpenOption.APPEND);
+
+            var overrides = new DomainRankingOverrides(domainQueries, overridesFile);
+
+            overrides.reloadFile();
+
+            Assertions.assertEquals(0.75, overrides.getRankingFactor(
+                    domainQueries.getDomainId(new EdgeDomain("first.example.com"))
+            ));
+            Assertions.assertEquals(0.75, overrides.getRankingFactor(
+                    domainQueries.getDomainId(new EdgeDomain("second.example.com"))
+            ));
+            Assertions.assertEquals(1.1, overrides.getRankingFactor(
+                    domainQueries.getDomainId(new EdgeDomain("third.example.com"))
+            ));
+            Assertions.assertEquals(1.0, overrides.getRankingFactor(
+                    domainQueries.getDomainId(new EdgeDomain("not-added.example.com"))
+            ));
+            Assertions.assertEquals(1.0, overrides.getRankingFactor(1<<23));
+
+        }
+        finally {
+            Files.deleteIfExists(overridesFile);
+        }
+    }
+
+}
--- a/code/libraries/blocking-thread-pool/java/nu/marginalia/util/SimpleBlockingThreadPool.java
+++ b/code/libraries/blocking-thread-pool/java/nu/marginalia/util/SimpleBlockingThreadPool.java
@@ -23,16 +23,33 @@ public class SimpleBlockingThreadPool {
    private final Logger logger = LoggerFactory.getLogger(SimpleBlockingThreadPool.class);

    public SimpleBlockingThreadPool(String name, int poolSize, int queueSize) {
+        this(name, poolSize, queueSize, ThreadType.PLATFORM);
+    }
+
+    public SimpleBlockingThreadPool(String name, int poolSize, int queueSize, ThreadType threadType) {
        tasks = new ArrayBlockingQueue<>(queueSize);

        for (int i = 0; i < poolSize; i++) {
-            Thread worker = new Thread(this::worker, name  + "[" + i + "]");
-            worker.setDaemon(true);
-            worker.start();
+
+            Thread.Builder threadBuilder = switch (threadType) {
+                case VIRTUAL -> Thread.ofVirtual();
+                case PLATFORM -> Thread.ofPlatform().daemon(true);
+            };
+
+            Thread worker = threadBuilder
+                    .name(name  + "[" + i + "]")
+                    .start(this::worker);
+
            workers.add(worker);
        }

    }
+
+    public enum ThreadType {
+        VIRTUAL,
+        PLATFORM
+    }
+
    public void submit(Task task) throws InterruptedException {
        tasks.put(task);
    }
--- a/code/libraries/coded-sequence/java/nu/marginalia/sequence/slop/GammaCodedSequenceArrayColumn.java
+++ b/code/libraries/coded-sequence/java/nu/marginalia/sequence/slop/GammaCodedSequenceArrayColumn.java
@@ -45,6 +45,11 @@ public class GammaCodedSequenceArrayColumn extends AbstractObjectColumn<List<Gam
        );
    }

+    @Override
+    public int alignmentSize() {
+        return 1;
+    }
+
    public Reader openUnregistered(URI uri, int page) throws IOException {
        return new Reader(
                dataColumn.openUnregistered(uri, page),
@@ -109,6 +114,11 @@ public class GammaCodedSequenceArrayColumn extends AbstractObjectColumn<List<Gam
            dataReader.skip(toSkip);
        }

+        @Override
+        public boolean isDirect() {
+            return dataReader.isDirect();
+        }
+
        @Override
        public boolean hasRemaining() throws IOException {
            return groupsReader.hasRemaining();
--- a/code/libraries/coded-sequence/java/nu/marginalia/sequence/slop/GammaCodedSequenceColumn.java
+++ b/code/libraries/coded-sequence/java/nu/marginalia/sequence/slop/GammaCodedSequenceColumn.java
@@ -44,6 +44,11 @@ public class GammaCodedSequenceColumn extends AbstractObjectColumn<GammaCodedSeq
        );
    }

+    @Override
+    public int alignmentSize() {
+        return 1;
+    }
+
    public Reader openUnregistered(URI uri, int page) throws IOException {
        return new Reader(
                Storage.reader(uri, this, page, false),
@@ -96,6 +101,11 @@ public class GammaCodedSequenceColumn extends AbstractObjectColumn<GammaCodedSeq
            this.indexReader = indexReader;
        }

+        @Override
+        public boolean isDirect() {
+            return storage.isDirect();
+        }
+
        @Override
        public AbstractColumn<?, ?> columnDesc() {
            return GammaCodedSequenceColumn.this;
--- a/code/libraries/coded-sequence/java/nu/marginalia/sequence/slop/VarintCodedSequenceArrayColumn.java
+++ b/code/libraries/coded-sequence/java/nu/marginalia/sequence/slop/VarintCodedSequenceArrayColumn.java
@@ -45,6 +45,11 @@ public class VarintCodedSequenceArrayColumn extends AbstractObjectColumn<List<Va
        );
    }

+    @Override
+    public int alignmentSize() {
+        return 0;
+    }
+
    public Reader openUnregistered(URI uri, int page) throws IOException {
        return new Reader(
                dataColumn.openUnregistered(uri, page),
@@ -109,6 +114,11 @@ public class VarintCodedSequenceArrayColumn extends AbstractObjectColumn<List<Va
            dataReader.skip(toSkip);
        }

+        @Override
+        public boolean isDirect() {
+            return dataReader.isDirect();
+        }
+
        @Override
        public boolean hasRemaining() throws IOException {
            return groupsReader.hasRemaining();
--- a/code/libraries/coded-sequence/java/nu/marginalia/sequence/slop/VarintCodedSequenceColumn.java
+++ b/code/libraries/coded-sequence/java/nu/marginalia/sequence/slop/VarintCodedSequenceColumn.java
@@ -44,6 +44,11 @@ public class VarintCodedSequenceColumn extends AbstractObjectColumn<VarintCodedS
        );
    }

+    @Override
+    public int alignmentSize() {
+        return 1;
+    }
+
    public Reader openUnregistered(URI uri, int page) throws IOException {
        return new Reader(
                Storage.reader(uri, this, page, false),
@@ -101,6 +106,11 @@ public class VarintCodedSequenceColumn extends AbstractObjectColumn<VarintCodedS
            return VarintCodedSequenceColumn.this;
        }

+        @Override
+        public boolean isDirect() {
+            return storage.isDirect();
+        }
+
        @Override
        public void skip(long positions) throws IOException {
            for (int i = 0; i < positions; i++) {
--- a/code/libraries/language-processing/java/nu/marginalia/language/sentence/SentenceExtractor.java
+++ b/code/libraries/language-processing/java/nu/marginalia/language/sentence/SentenceExtractor.java
@@ -155,8 +155,15 @@ public class SentenceExtractor {
    public List<DocumentSentence> extractSentencesFromString(String text, EnumSet<HtmlTag> htmlTags) {
        String[] sentences;

-        // Normalize spaces
+        // Safety net against malformed data DOS attacks,
+        // found 5+ MB <p>-tags in the wild that just break
+        // the sentence extractor causing it to stall forever.
+        if (text.length() > 50_000) {
+            // 50k chars can hold a small novel, let alone single html tags
+            text = text.substring(0, 50_000);
+        }

+        // Normalize spaces
        text = normalizeSpaces(text);

        // Split into sentences
--- a/code/libraries/message-queue/java/nu/marginalia/actor/prototype/RecordActorPrototype.java
+++ b/code/libraries/message-queue/java/nu/marginalia/actor/prototype/RecordActorPrototype.java
@@ -5,9 +5,7 @@ import nu.marginalia.actor.state.*;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;

-import java.util.ArrayList;
-import java.util.Arrays;
-import java.util.List;
+import java.util.*;

 public abstract class RecordActorPrototype implements ActorPrototype {

@@ -118,7 +116,7 @@ public abstract class RecordActorPrototype implements ActorPrototype {
        }

        private String functionName(Class<? extends ActorStep> functionClass) {
-            return functionClass.getSimpleName().toUpperCase();
+            return ActorStep.functionName(functionClass);
        }

        private ActorStep constructState(String message) throws ReflectiveOperationException {
@@ -145,4 +143,43 @@ public abstract class RecordActorPrototype implements ActorPrototype {
        }
    }

+    /** Get a list of JSON prototypes for each actor step declared by this actor */
+    @SuppressWarnings("unchecked")
+    public Map<String, String> getMessagePrototypes() {
+        Map<String, String> messagePrototypes = new HashMap<>();
+
+        for (var clazz : getClass().getDeclaredClasses()) {
+            if (!clazz.isRecord() || !ActorStep.class.isAssignableFrom(clazz))
+                continue;
+
+            StringJoiner sj = new StringJoiner(",\n\t", "{\n\t", "\n}");
+
+            renderToJsonPrototype(sj, (Class<? extends Record>) clazz);
+
+            messagePrototypes.put(ActorStep.functionName((Class<? extends ActorStep>) clazz), sj.toString());
+        }
+
+        return messagePrototypes;
+    }
+
+    @SuppressWarnings("unchecked")
+    private void renderToJsonPrototype(StringJoiner sj, Class<? extends Record> recordType) {
+        for (var field : recordType.getDeclaredFields()) {
+            String typeName = field.getType().getSimpleName();
+
+            if ("List".equals(typeName)) {
+                sj.add(String.format("\"%s\": [ ]", field.getName()));
+            }
+            else if (field.getType().isRecord()) {
+                var innerSj = new StringJoiner(",", "{", "}");
+                renderToJsonPrototype(innerSj, (Class<? extends Record>) field.getType());
+                sj.add(String.format("\"%s\": %s", field.getName(), sj));
+            }
+            else {
+                sj.add(String.format("\"%s\": \"%s\"", field.getName(), typeName));
+            }
+        }
+
+    }
+
 }
--- a/code/libraries/message-queue/java/nu/marginalia/actor/state/ActorStep.java
+++ b/code/libraries/message-queue/java/nu/marginalia/actor/state/ActorStep.java
@@ -1,3 +1,7 @@
 package nu.marginalia.actor.state;

-public interface ActorStep {}
+public interface ActorStep {
+    static String functionName(Class<? extends ActorStep> type) {
+        return type.getSimpleName().toUpperCase();
+    }
+}
--- a/code/processes/converting-process/build.gradle
+++ b/code/processes/converting-process/build.gradle
@@ -87,6 +87,8 @@ dependencies {
    implementation libs.commons.compress
    implementation libs.sqlite

+    implementation libs.bundles.httpcomponents
+
    testImplementation libs.bundles.slf4j.test
    testImplementation libs.bundles.junit
    testImplementation libs.mockito
--- a/code/processes/converting-process/java/nu/marginalia/converting/ConverterMain.java
+++ b/code/processes/converting-process/java/nu/marginalia/converting/ConverterMain.java
@@ -12,7 +12,6 @@ import nu.marginalia.converting.sideload.SideloadSourceFactory;
 import nu.marginalia.converting.writer.ConverterBatchWritableIf;
 import nu.marginalia.converting.writer.ConverterBatchWriter;
 import nu.marginalia.converting.writer.ConverterWriter;
-import nu.marginalia.io.CrawledDomainReader;
 import nu.marginalia.io.SerializableCrawlDataStream;
 import nu.marginalia.mq.MessageQueueFactory;
 import nu.marginalia.mqapi.converting.ConvertRequest;
@@ -36,6 +35,7 @@ import java.io.IOException;
 import java.nio.file.Files;
 import java.nio.file.Path;
 import java.sql.SQLException;
+import java.util.ArrayList;
 import java.util.Collection;
 import java.util.List;
 import java.util.Optional;
@@ -51,6 +51,7 @@ public class ConverterMain extends ProcessMainClass {
    private final ProcessHeartbeat heartbeat;
    private final FileStorageService fileStorageService;
    private final SideloadSourceFactory sideloadSourceFactory;
+    private static final int SIDELOAD_THRESHOLD = Integer.getInteger("converter.sideloadThreshold", 10_000);

    public static void main(String... args) throws Exception {

@@ -201,12 +202,26 @@ public class ConverterMain extends ProcessMainClass {
            processedDomains.set(batchingWorkLog.size());
            heartbeat.setProgress(processedDomains.get() / (double) totalDomains);

-            for (var domain : WorkLog.iterableMap(crawlDir.getLogFile(),
+            logger.info("Processing small items");
+
+            // We separate the large and small domains to reduce the number of critical sections,
+            // as the large domains have a separate processing track that doesn't store everything
+            // in memory
+
+            final List<Path> bigTasks = new ArrayList<>();
+
+            // First process the small items
+            for (var dataPath : WorkLog.iterableMap(crawlDir.getLogFile(),
                    new CrawlDataLocator(crawlDir.getDir(), batchingWorkLog)))
            {
+                if (SerializableCrawlDataStream.getSizeHint(dataPath) >= SIDELOAD_THRESHOLD) {
+                    bigTasks.add(dataPath);
+                    continue;
+                }
+
                pool.submit(() -> {
-                    try {
-                        ConverterBatchWritableIf writable = processor.createWritable(domain);
+                    try (var dataStream = SerializableCrawlDataStream.openDataStream(dataPath)) {
+                        ConverterBatchWritableIf writable = processor.fullProcessing(dataStream) ;
                        converterWriter.accept(writable);
                    }
                    catch (Exception ex) {
@@ -225,10 +240,39 @@ public class ConverterMain extends ProcessMainClass {
            do {
                System.out.println("Waiting for pool to terminate... " + pool.getActiveCount() + " remaining");
            } while (!pool.awaitTermination(60, TimeUnit.SECONDS));
+
+            logger.info("Processing large items");
+
+            try (var hb = heartbeat.createAdHocTaskHeartbeat("Large Domains")) {
+                int bigTaskIdx = 0;
+                // Next the big items domain-by-domain
+                for (var dataPath : bigTasks) {
+                    hb.progress(dataPath.toFile().getName(), bigTaskIdx++, bigTasks.size());
+
+                    try {
+                        // SerializableCrawlDataStream is autocloseable, we can't try-with-resources because then it will be
+                        // closed before it's consumed by the converterWriter.  Instead, the converterWriter guarantees it
+                        // will close it after it's consumed.
+
+                        var stream = SerializableCrawlDataStream.openDataStream(dataPath);
+                        ConverterBatchWritableIf writable = processor.simpleProcessing(stream, SerializableCrawlDataStream.getSizeHint(dataPath));
+
+                        converterWriter.accept(writable);
+                    }
+                    catch (Exception ex) {
+                        logger.info("Error in processing", ex);
+                    }
+                    finally {
+                        heartbeat.setProgress(processedDomains.incrementAndGet() / (double) totalDomains);
+                    }
+                }
+            }
+
+            logger.info("Processing complete");
        }
    }

-    private static class CrawlDataLocator implements Function<WorkLogEntry, Optional<SerializableCrawlDataStream>> {
+    private static class CrawlDataLocator implements Function<WorkLogEntry, Optional<Path>> {

        private final Path crawlRootDir;
        private final BatchingWorkLog batchingWorkLog;
@@ -239,7 +283,7 @@ public class ConverterMain extends ProcessMainClass {
        }

        @Override
-        public Optional<SerializableCrawlDataStream> apply(WorkLogEntry entry) {
+        public Optional<Path> apply(WorkLogEntry entry) {
            if (batchingWorkLog.isItemProcessed(entry.id())) {
                return Optional.empty();
            }
@@ -252,7 +296,7 @@ public class ConverterMain extends ProcessMainClass {
            }

            try {
-                return Optional.of(CrawledDomainReader.createDataStream(path));
+                return Optional.of(path);
            }
            catch (Exception ex) {
                return Optional.empty();
--- a/code/processes/converting-process/java/nu/marginalia/converting/processor/DocumentProcessor.java
+++ b/code/processes/converting-process/java/nu/marginalia/converting/processor/DocumentProcessor.java
@@ -19,6 +19,7 @@ import nu.marginalia.model.idx.WordFlags;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;

+import java.io.IOException;
 import java.net.URISyntaxException;
 import java.util.ArrayList;
 import java.util.List;
@@ -91,7 +92,7 @@ public class DocumentProcessor {
                                 DocumentClass documentClass,
                                 DocumentDecorator documentDecorator,
                                 DomainLinks externalDomainLinks,
-                                 ProcessedDocument ret) throws URISyntaxException, DisqualifiedException
+                                 ProcessedDocument ret) throws URISyntaxException, IOException, DisqualifiedException
    {

        var crawlerStatus = CrawlerDocumentStatus.valueOf(crawledDocument.crawlerStatus);
@@ -109,7 +110,7 @@ public class DocumentProcessor {

        ret.state = crawlerStatusToUrlState(crawledDocument.crawlerStatus, crawledDocument.httpStatus);

-        final var plugin = findPlugin(crawledDocument);
+        AbstractDocumentProcessorPlugin plugin = findPlugin(crawledDocument);

        EdgeUrl url = new EdgeUrl(crawledDocument.url);
        LinkTexts linkTexts = anchorTextKeywords.getAnchorTextKeywords(externalDomainLinks, url);
--- a/code/processes/converting-process/java/nu/marginalia/converting/processor/DomainProcessor.java
+++ b/code/processes/converting-process/java/nu/marginalia/converting/processor/DomainProcessor.java
@@ -32,7 +32,6 @@ import java.util.*;
 import java.util.regex.Pattern;

 public class DomainProcessor {
-    private static final int SIDELOAD_THRESHOLD = Integer.getInteger("converter.sideloadThreshold", 10_000);
    private final DocumentProcessor documentProcessor;
    private final SiteWords siteWords;
    private final AnchorTagsSource anchorTagsSource;
@@ -54,21 +53,9 @@ public class DomainProcessor {
        geoIpDictionary.waitReady();
    }

-    public ConverterBatchWritableIf createWritable(SerializableCrawlDataStream domain) {
-        final int sizeHint = domain.sizeHint();
-
-        if (sizeHint > SIDELOAD_THRESHOLD) {
-            // If the file is too big, we run a processing mode that doesn't
-            // require loading the entire dataset into RAM
-            return sideloadProcessing(domain, sizeHint);
-        }
-
-        return fullProcessing(domain);
-    }
-
-    public SideloadProcessing sideloadProcessing(SerializableCrawlDataStream dataStream, int sizeHint, Collection<String> extraKeywords) {
+    public SimpleProcessing simpleProcessing(SerializableCrawlDataStream dataStream, int sizeHint, Collection<String> extraKeywords) {
        try {
-            return new SideloadProcessing(dataStream, sizeHint, extraKeywords);
+            return new SimpleProcessing(dataStream, sizeHint, extraKeywords);
        }
        catch (Exception ex) {
            logger.warn("Failed to process domain sideload", ex);
@@ -76,9 +63,9 @@ public class DomainProcessor {
        }
    }

-    public SideloadProcessing sideloadProcessing(SerializableCrawlDataStream dataStream, int sizeHint) {
+    public SimpleProcessing simpleProcessing(SerializableCrawlDataStream dataStream, int sizeHint) {
        try {
-            return new SideloadProcessing(dataStream, sizeHint);
+            return new SimpleProcessing(dataStream, sizeHint);
        }
        catch (Exception ex) {
            logger.warn("Failed to process domain sideload", ex);
@@ -86,22 +73,84 @@ public class DomainProcessor {
        }
    }

-    public class SideloadProcessing implements ConverterBatchWritableIf, SideloadSource {
+    @Nullable
+    public ProcessedDomain fullProcessing(SerializableCrawlDataStream dataStream) {
+        try {
+            if (!dataStream.hasNext()) {
+                return null;
+            }
+
+            List<ProcessedDocument> docs = new ArrayList<>();
+            Set<String> processedUrls = new HashSet<>();
+
+            if (!(dataStream.next() instanceof CrawledDomain crawledDomain)) {
+                throw new IllegalStateException("First record must be a domain, was " + dataStream.next().getClass().getSimpleName());
+            }
+
+            DomainLinks externalDomainLinks = anchorTagsSource.getAnchorTags(crawledDomain.getDomain());
+            DocumentDecorator documentDecorator = new DocumentDecorator();
+
+            // Process Domain Record
+
+            ProcessedDomain ret = new ProcessedDomain();
+            processDomain(crawledDomain, ret, documentDecorator);
+            ret.documents = docs;
+
+            // Process Documents
+
+            try (var deduplicator = new LshDocumentDeduplicator()) {
+                while (dataStream.hasNext()) {
+                    if (!(dataStream.next() instanceof CrawledDocument doc))
+                        continue;
+                    if (doc.url == null)
+                        continue;
+                    if (doc.documentBodyBytes.length == 0)
+                        continue;
+                    if (!processedUrls.add(doc.url))
+                        continue;
+
+                    try {
+                        var processedDoc = documentProcessor.process(doc, ret.domain, externalDomainLinks, documentDecorator);
+                        deduplicator.markIfDuplicate(processedDoc);
+                        docs.add(processedDoc);
+                    } catch (Exception ex) {
+                        logger.warn("Failed to process " + doc.url, ex);
+                    }
+                }
+            }
+
+            // Add late keywords and features from domain-level information
+
+            calculateStatistics(ret, externalDomainLinks);
+
+            return ret;
+        }
+        catch (Exception ex) {
+            logger.warn("Failed to process domain", ex);
+            return null;
+        }
+    }
+
+    /** The simple processing track processes documents individually, and does not perform any domain-level analysis.
+     *  This is needed to process extremely large domains, which would otherwise eat up too much RAM.
+     */
+    public class SimpleProcessing implements ConverterBatchWritableIf, SideloadSource {
        private final SerializableCrawlDataStream dataStream;
        private final ProcessedDomain domain;
        private final DocumentDecorator documentDecorator;
        private final Set<String> processedUrls = new HashSet<>();
        private final DomainLinks externalDomainLinks;
        private final LshDocumentDeduplicator deduplicator = new LshDocumentDeduplicator();
+
        private static final ProcessingIterator.Factory iteratorFactory = ProcessingIterator.factory(8,
                Integer.getInteger("java.util.concurrent.ForkJoinPool.common.parallelism", Runtime.getRuntime().availableProcessors())
        );

-        SideloadProcessing(SerializableCrawlDataStream dataStream, int sizeHint) throws IOException {
+        SimpleProcessing(SerializableCrawlDataStream dataStream, int sizeHint) throws IOException {
            this(dataStream, sizeHint, List.of());
        }

-        SideloadProcessing(SerializableCrawlDataStream dataStream, int sizeHint, Collection<String> extraKeywords) throws IOException {
+        SimpleProcessing(SerializableCrawlDataStream dataStream, int sizeHint, Collection<String> extraKeywords) throws IOException {
            this.dataStream = dataStream;

            if (!dataStream.hasNext() || !(dataStream.next() instanceof CrawledDomain crawledDomain))
@@ -128,6 +177,7 @@ public class DomainProcessor {
        @Override
        public Iterator<ProcessedDocument> getDocumentsStream() {
            return iteratorFactory.create((taskConsumer) -> {
+
                while (dataStream.hasNext())
                {
                    if (!(dataStream.next() instanceof CrawledDocument doc))
@@ -172,65 +222,6 @@ public class DomainProcessor {
        }
    }

-
-    @Nullable
-    public ProcessedDomain fullProcessing(SerializableCrawlDataStream dataStream) {
-        try {
-            if (!dataStream.hasNext()) {
-                return null;
-            }
-
-            List<ProcessedDocument> docs = new ArrayList<>();
-            Set<String> processedUrls = new HashSet<>();
-
-            if (!(dataStream.next() instanceof CrawledDomain crawledDomain)) {
-                throw new IllegalStateException("First record must be a domain, was " + dataStream.next().getClass().getSimpleName());
-            }
-
-            DomainLinks externalDomainLinks = anchorTagsSource.getAnchorTags(crawledDomain.getDomain());
-            DocumentDecorator documentDecorator = new DocumentDecorator();
-
-            // Process Domain Record
-
-            ProcessedDomain ret = new ProcessedDomain();
-            processDomain(crawledDomain, ret, documentDecorator);
-            ret.documents = docs;
-
-            // Process Documents
-
-            try (var deduplicator = new LshDocumentDeduplicator()) {
-                while (dataStream.hasNext()) {
-                    if (!(dataStream.next() instanceof CrawledDocument doc))
-                        continue;
-                    if (doc.url == null)
-                        continue;
-                    if (doc.documentBody.isBlank())
-                        continue;
-                    if (!processedUrls.add(doc.url))
-                        continue;
-
-                    try {
-                        var processedDoc = documentProcessor.process(doc, ret.domain, externalDomainLinks, documentDecorator);
-                        deduplicator.markIfDuplicate(processedDoc);
-                        docs.add(processedDoc);
-                    } catch (Exception ex) {
-                        logger.warn("Failed to process " + doc.url, ex);
-                    }
-                }
-            }
-
-            // Add late keywords and features from domain-level information
-
-            calculateStatistics(ret, externalDomainLinks);
-
-            return ret;
-        }
-        catch (Exception ex) {
-            logger.warn("Failed to process domain", ex);
-            return null;
-        }
-    }
-
    private void processDomain(CrawledDomain crawledDomain,
                                          ProcessedDomain domain,
                                          DocumentDecorator decorator)
--- a/code/processes/converting-process/java/nu/marginalia/converting/processor/classifier/adblock/AdblockSimulator.java
+++ b/code/processes/converting-process/java/nu/marginalia/converting/processor/classifier/adblock/AdblockSimulator.java
@@ -116,7 +116,7 @@ public class AdblockSimulator {


    // Refrain from cleaning up this code, it's very hot code and needs to be fast.
-    // This version is about 100x faster than the a "clean" first stab implementation.
+    // This version is about 100x faster than a "clean" first stab implementation.

    class RuleVisitor implements NodeFilter {
        public boolean sawAds;
--- a/code/processes/converting-process/java/nu/marginalia/converting/processor/logic/DocumentGeneratorExtractor.java
+++ b/code/processes/converting-process/java/nu/marginalia/converting/processor/logic/DocumentGeneratorExtractor.java
@@ -23,7 +23,7 @@ public class DocumentGeneratorExtractor {

        var tags = doc.select("meta[name=generator]");

-        if (tags.size() == 0) {
+        if (tags.isEmpty()) {
            // Some sites have a comment in the head instead of a meta tag
            return fingerprintServerTech(doc, responseHeaders);
        }
--- a/code/processes/converting-process/java/nu/marginalia/converting/processor/logic/DocumentValuator.java
+++ b/code/processes/converting-process/java/nu/marginalia/converting/processor/logic/DocumentValuator.java
@@ -24,7 +24,7 @@ public class DocumentValuator {
        double scriptPenalty = getScriptPenalty(parsedDocument);
        double chatGptPenalty = getChatGptContentFarmPenalty(parsedDocument);

-        int rawLength = crawledDocument.documentBody.length();
+        int rawLength = crawledDocument.documentBodyBytes.length;

        if (textLength == 0) {
            throw new DisqualifiedException(DisqualifiedException.DisqualificationReason.LENGTH);
--- a/code/processes/converting-process/java/nu/marginalia/converting/processor/logic/FeatureExtractor.java
+++ b/code/processes/converting-process/java/nu/marginalia/converting/processor/logic/FeatureExtractor.java
@@ -218,7 +218,10 @@ public class FeatureExtractor {
            }
        }

-        if (features.contains(HtmlFeature.JS) && adblockSimulator.hasAds(doc.clone())) {
+        if (features.contains(HtmlFeature.JS)
+            // remove while disabled to get rid of expensive clone() call:
+            // adblockSimulator.hasAds(doc.clone())
+            ) {
            features.add(HtmlFeature.ADVERTISEMENT);
        }

--- a/code/processes/converting-process/java/nu/marginalia/converting/processor/plugin/AbstractDocumentProcessorPlugin.java
+++ b/code/processes/converting-process/java/nu/marginalia/converting/processor/plugin/AbstractDocumentProcessorPlugin.java
@@ -14,6 +14,7 @@ import nu.marginalia.model.crawldata.CrawledDocument;
 import nu.marginalia.model.html.HtmlStandard;

 import javax.annotation.Nullable;
+import java.io.IOException;
 import java.net.URISyntaxException;
 import java.util.HashSet;
 import java.util.List;
@@ -25,7 +26,7 @@ public abstract class AbstractDocumentProcessorPlugin {
        this.languageFilter = languageFilter;
    }

-    public abstract DetailsWithWords createDetails(CrawledDocument crawledDocument, LinkTexts linkTexts, DocumentClass documentClass) throws DisqualifiedException, URISyntaxException;
+    public abstract DetailsWithWords createDetails(CrawledDocument crawledDocument, LinkTexts linkTexts, DocumentClass documentClass) throws DisqualifiedException, URISyntaxException, IOException;
    public abstract boolean isApplicable(CrawledDocument doc);

    protected void checkDocumentLanguage(DocumentLanguageData dld) throws DisqualifiedException {
@@ -86,6 +87,7 @@ public abstract class AbstractDocumentProcessorPlugin {

            return this;
        }
+
        public MetaTagsBuilder addPubDate(PubDate pubDate) {

            if (pubDate.year() > 1900) {
--- a/code/processes/converting-process/java/nu/marginalia/converting/processor/plugin/HtmlDocumentProcessorPlugin.java
+++ b/code/processes/converting-process/java/nu/marginalia/converting/processor/plugin/HtmlDocumentProcessorPlugin.java
@@ -6,6 +6,7 @@ import nu.marginalia.converting.model.DisqualifiedException;
 import nu.marginalia.converting.model.DocumentHeaders;
 import nu.marginalia.converting.model.GeneratorType;
 import nu.marginalia.converting.model.ProcessedDocumentDetails;
+import nu.marginalia.converting.processor.AcceptableAds;
 import nu.marginalia.converting.processor.DocumentClass;
 import nu.marginalia.converting.processor.MetaRobotsTag;
 import nu.marginalia.converting.processor.logic.*;
@@ -32,11 +33,11 @@ import nu.marginalia.model.crawldata.CrawledDocument;
 import nu.marginalia.model.html.HtmlStandard;
 import nu.marginalia.model.idx.DocumentFlags;
 import nu.marginalia.model.idx.DocumentMetadata;
-import org.jsoup.Jsoup;
 import org.jsoup.nodes.Document;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;

+import java.io.IOException;
 import java.net.URISyntaxException;
 import java.util.EnumSet;
 import java.util.HashSet;
@@ -51,7 +52,6 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
    private final double minDocumentQuality;

    private final FeatureExtractor featureExtractor;
-    private final TitleExtractor titleExtractor;
    private final DocumentKeywordExtractor keywordExtractor;
    private final PubDateSniffer pubDateSniffer;

@@ -74,7 +74,6 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
            @Named("min-document-quality") Double minDocumentQuality,
            LanguageFilter languageFilter,
            FeatureExtractor featureExtractor,
-            TitleExtractor titleExtractor,
            DocumentKeywordExtractor keywordExtractor,
            PubDateSniffer pubDateSniffer,
            DocumentLengthLogic documentLengthLogic,
@@ -89,7 +88,6 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
        this.minDocumentQuality = minDocumentQuality;
        this.featureExtractor = featureExtractor;

-        this.titleExtractor = titleExtractor;
        this.keywordExtractor = keywordExtractor;
        this.pubDateSniffer = pubDateSniffer;
        this.metaRobotsTag = metaRobotsTag;
@@ -108,19 +106,17 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
    public DetailsWithWords createDetails(CrawledDocument crawledDocument,
                                          LinkTexts linkTexts,
                                          DocumentClass documentClass)
-            throws DisqualifiedException, URISyntaxException {
+            throws DisqualifiedException, URISyntaxException, IOException {

-        String documentBody = crawledDocument.documentBody;
-
-        if (languageFilter.isBlockedUnicodeRange(documentBody)) {
+        if (languageFilter.isBlockedUnicodeRange(crawledDocument.documentBody(512))) {
            throw new DisqualifiedException(DisqualificationReason.LANGUAGE);
        }

-        if (documentBody.length() > MAX_DOCUMENT_LENGTH_BYTES) { // 128kb
-            documentBody = documentBody.substring(0, MAX_DOCUMENT_LENGTH_BYTES);
-        }
+        Document doc = crawledDocument.parseBody();

-        Document doc = Jsoup.parse(documentBody);
+        if (AcceptableAds.hasAcceptableAdsTag(doc)) {
+            throw new DisqualifiedException(DisqualifiedException.DisqualificationReason.ACCEPTABLE_ADS);
+        }

        if (!metaRobotsTag.allowIndexingByMetaTag(doc)) {
            throw new DisqualifiedException(DisqualificationReason.FORBIDDEN);
@@ -138,32 +134,33 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
        }

        var prunedDoc = specialization.prune(doc);
-        DocumentLanguageData dld = sentenceExtractorProvider.get().extractSentences(prunedDoc);

-        checkDocumentLanguage(dld);
-
-        var ret = new ProcessedDocumentDetails();

        final int length = getLength(doc);
        final HtmlStandard standard = getHtmlStandard(doc);
        final double quality = documentValuator.getQuality(crawledDocument, standard, doc, length);

+        if (isDisqualified(documentClass, url, quality, doc.title())) {
+            throw new DisqualifiedException(DisqualificationReason.QUALITY);
+        }
+
+        DocumentLanguageData dld = sentenceExtractorProvider.get().extractSentences(prunedDoc);
+
+        checkDocumentLanguage(dld);
+        documentLengthLogic.validateLength(dld, specialization.lengthModifier() * documentClass.lengthLimitModifier());
+
+        var ret = new ProcessedDocumentDetails();
+
        ret.length = length;
        ret.standard = standard;
        ret.title = specialization.getTitle(doc, dld, crawledDocument.url);

-        documentLengthLogic.validateLength(dld, specialization.lengthModifier() * documentClass.lengthLimitModifier());
-
        final Set<HtmlFeature> features = featureExtractor.getFeatures(url, doc, documentHeaders, dld);

        ret.features = features;
        ret.quality = documentValuator.adjustQuality(quality, features);
        ret.hashCode = dld.localitySensitiveHashCode();

-        if (isDisqualified(documentClass, url, quality, ret.title)) {
-            throw new DisqualifiedException(DisqualificationReason.QUALITY);
-        }
-
        PubDate pubDate = pubDateSniffer.getPubDate(documentHeaders, url, doc, standard, true);

        EnumSet<DocumentFlags> documentFlags = documentFlags(features, generatorParts.type());
--- a/code/processes/converting-process/java/nu/marginalia/converting/processor/plugin/PlainTextDocumentProcessorPlugin.java
+++ b/code/processes/converting-process/java/nu/marginalia/converting/processor/plugin/PlainTextDocumentProcessorPlugin.java
@@ -71,7 +71,7 @@ public class PlainTextDocumentProcessorPlugin extends AbstractDocumentProcessorP
                                          DocumentClass documentClass)
            throws DisqualifiedException, URISyntaxException {

-        String documentBody = crawledDocument.documentBody;
+        String documentBody = crawledDocument.documentBody();

        if (languageFilter.isBlockedUnicodeRange(documentBody)) {
            throw new DisqualifiedException(DisqualifiedException.DisqualificationReason.LANGUAGE);
--- a/code/processes/converting-process/java/nu/marginalia/converting/sideload/SideloaderProcessing.java
+++ b/code/processes/converting-process/java/nu/marginalia/converting/sideload/SideloaderProcessing.java
@@ -19,6 +19,7 @@ import nu.marginalia.model.idx.DocumentMetadata;
 import nu.marginalia.model.idx.WordFlags;

 import java.net.URISyntaxException;
+import java.nio.charset.StandardCharsets;
 import java.time.LocalDateTime;
 import java.util.EnumSet;
 import java.util.List;
@@ -50,7 +51,7 @@ public class SideloaderProcessing {
                "OK",
                "NP",
                "",
-                body,
+                body.getBytes(StandardCharsets.UTF_8),
                false,
                null,
                null
--- a/code/processes/converting-process/java/nu/marginalia/converting/sideload/encyclopedia/EncyclopediaMarginaliaNuSideloader.java
+++ b/code/processes/converting-process/java/nu/marginalia/converting/sideload/encyclopedia/EncyclopediaMarginaliaNuSideloader.java
@@ -127,7 +127,7 @@ public class EncyclopediaMarginaliaNuSideloader implements SideloadSource, AutoC
        }
        fullHtml.append("</div></body></html>");

-        var doc = sideloaderProcessing
+        return sideloaderProcessing
                .processDocument(fullUrl,
                        fullHtml.toString(),
                        List.of("encyclopedia", "wiki"),
@@ -137,8 +137,6 @@ public class EncyclopediaMarginaliaNuSideloader implements SideloadSource, AutoC
                        anchorTextKeywords.getAnchorTextKeywords(domainLinks, new EdgeUrl(fullUrl)),
                        LocalDate.now().getYear(),
                        10_000_000);
-
-        return doc;
    }

    private String normalizeUtf8(String url) {
--- a/code/processes/converting-process/java/nu/marginalia/converting/sideload/warc/WarcSideloader.java
+++ b/code/processes/converting-process/java/nu/marginalia/converting/sideload/warc/WarcSideloader.java
@@ -106,11 +106,7 @@ public class WarcSideloader implements SideloadSource, AutoCloseable {
                return false;

            var url = new EdgeUrl(warcResponse.target());
-            if (!Objects.equals(url.getDomain(), domain)) {
-                return false;
-            }
-
-            return true;
+            return Objects.equals(url.getDomain(), domain);
        } catch (Exception e) {
            logger.warn("Failed to process response", e);
        }
--- a/code/processes/converting-process/java/nu/marginalia/converting/writer/ConverterWriter.java
+++ b/code/processes/converting-process/java/nu/marginalia/converting/writer/ConverterWriter.java
@@ -39,6 +39,9 @@ public class ConverterWriter implements AutoCloseable {
        workerThread.start();
    }

+    /** Queue and eventually write the domain into the converter journal
+     *  The domain object will be closed after it's processed.
+     * */
    public void accept(@Nullable ConverterBatchWritableIf domain) {
        if (null == domain)
            return;
@@ -72,15 +75,15 @@ public class ConverterWriter implements AutoCloseable {

                if (workLog.isItemCommitted(id) || workLog.isItemInCurrentBatch(id)) {
                    logger.warn("Skipping already logged item {}", id);
+                }
+                else {
+                    currentWriter.write(data);
+                    workLog.logItem(id);
                    data.close();
-                    continue;
                }

-                currentWriter.write(data);
-
-                workLog.logItem(id);
-
                switcher.tick();
+                data.close();
            }
        }
        catch (Exception ex) {
--- a/code/processes/converting-process/model/java/nu/marginalia/model/processed/SlopDocumentRecord.java
+++ b/code/processes/converting-process/model/java/nu/marginalia/model/processed/SlopDocumentRecord.java
@@ -11,7 +11,6 @@ import nu.marginalia.slop.column.primitive.IntColumn;
 import nu.marginalia.slop.column.primitive.LongColumn;
 import nu.marginalia.slop.column.string.EnumColumn;
 import nu.marginalia.slop.column.string.StringColumn;
-import nu.marginalia.slop.column.string.TxtStringColumn;
 import nu.marginalia.slop.desc.StorageType;
 import org.jetbrains.annotations.Nullable;

@@ -182,8 +181,8 @@ public record SlopDocumentRecord(
    }

    // Basic information
-    private static final TxtStringColumn domainsColumn = new TxtStringColumn("domain", StandardCharsets.UTF_8, StorageType.GZIP);
-    private static final TxtStringColumn urlsColumn = new TxtStringColumn("url", StandardCharsets.UTF_8, StorageType.GZIP);
+    private static final StringColumn domainsColumn = new StringColumn("domain", StandardCharsets.UTF_8, StorageType.GZIP);
+    private static final StringColumn urlsColumn = new StringColumn("url", StandardCharsets.UTF_8, StorageType.GZIP);
    private static final VarintColumn ordinalsColumn = new VarintColumn("ordinal", StorageType.PLAIN);
    private static final EnumColumn statesColumn = new EnumColumn("state", StandardCharsets.US_ASCII, StorageType.PLAIN);
    private static final StringColumn stateReasonsColumn = new StringColumn("stateReason", StandardCharsets.US_ASCII, StorageType.GZIP);
@@ -211,7 +210,7 @@ public record SlopDocumentRecord(
    private static final VarintCodedSequenceArrayColumn spansColumn = new VarintCodedSequenceArrayColumn("spans", StorageType.ZSTD);

    public static class KeywordsProjectionReader extends SlopTable {
-        private final TxtStringColumn.Reader domainsReader;
+        private final StringColumn.Reader domainsReader;
        private final VarintColumn.Reader ordinalsReader;
        private final IntColumn.Reader htmlFeaturesReader;
        private final LongColumn.Reader domainMetadataReader;
@@ -275,8 +274,8 @@ public record SlopDocumentRecord(
    }

    public static class MetadataReader extends SlopTable {
-        private final TxtStringColumn.Reader domainsReader;
-        private final TxtStringColumn.Reader urlsReader;
+        private final StringColumn.Reader domainsReader;
+        private final StringColumn.Reader urlsReader;
        private final VarintColumn.Reader ordinalsReader;
        private final StringColumn.Reader titlesReader;
        private final StringColumn.Reader descriptionsReader;
@@ -332,8 +331,8 @@ public record SlopDocumentRecord(
    }

    public static class Writer extends SlopTable {
-        private final TxtStringColumn.Writer domainsWriter;
-        private final TxtStringColumn.Writer urlsWriter;
+        private final StringColumn.Writer domainsWriter;
+        private final StringColumn.Writer urlsWriter;
        private final VarintColumn.Writer ordinalsWriter;
        private final EnumColumn.Writer statesWriter;
        private final StringColumn.Writer stateReasonsWriter;
--- a/code/processes/converting-process/test/nu/marginalia/converting/ConvertingIntegrationTest.java
+++ b/code/processes/converting-process/test/nu/marginalia/converting/ConvertingIntegrationTest.java
@@ -98,7 +98,7 @@ public class ConvertingIntegrationTest {

    @Test
    public void testMemexMarginaliaNuSideloadProcessing() throws IOException {
-        var ret = domainProcessor.sideloadProcessing(asSerializableCrawlData(readMarginaliaWorkingSet()), 100);
+        var ret = domainProcessor.simpleProcessing(asSerializableCrawlData(readMarginaliaWorkingSet()), 100);
        assertNotNull(ret);
        assertEquals("memex.marginalia.nu", ret.id());

@@ -146,7 +146,7 @@ public class ConvertingIntegrationTest {
                    "OK",
                    "",
                    "",
-                    readClassPathFile(p.toString()),
+                    readClassPathFile(p.toString()).getBytes(),
                    false,
                    null,
                    null
--- a/code/processes/converting-process/test/nu/marginalia/converting/CrawlingThenConvertingIntegrationTest.java
+++ b/code/processes/converting-process/test/nu/marginalia/converting/CrawlingThenConvertingIntegrationTest.java
@@ -200,23 +200,23 @@ public class CrawlingThenConvertingIntegrationTest {

    @Test
    public void crawlRobotsTxt() throws Exception {
-        var specs = new CrawlerMain.CrawlSpecRecord("search.marginalia.nu", 5,
-                        List.of("https://search.marginalia.nu/search?q=hello+world")
+        var specs = new CrawlerMain.CrawlSpecRecord("marginalia-search.com", 5,
+                        List.of("https://marginalia-search.com/search?q=hello+world")
        );

        CrawledDomain domain = crawl(specs);
        assertFalse(domain.doc.isEmpty());
        assertEquals("OK", domain.crawlerStatus);
-        assertEquals("search.marginalia.nu", domain.domain);
+        assertEquals("marginalia-search.com", domain.domain);

        Set<String> allUrls = domain.doc.stream().map(doc -> doc.url).collect(Collectors.toSet());
-        assertTrue(allUrls.contains("https://search.marginalia.nu/search"), "We expect a record for entities that are forbidden");
+        assertTrue(allUrls.contains("https://marginalia-search.com/search"), "We expect a record for entities that are forbidden");

        var output = process();

        assertNotNull(output);
        assertFalse(output.documents.isEmpty());
-        assertEquals(new EdgeDomain("search.marginalia.nu"), output.domain);
+        assertEquals(new EdgeDomain("marginalia-search.com"), output.domain);
        assertEquals(DomainIndexingState.ACTIVE, output.state);

        for (var doc : output.documents) {
--- a/code/processes/crawling-process/build.gradle
+++ b/code/processes/crawling-process/build.gradle
@@ -55,16 +55,19 @@ dependencies {
    implementation libs.zstd
    implementation libs.jwarc
    implementation libs.crawlercommons
-    implementation libs.okhttp3
    implementation libs.jsoup
    implementation libs.opencsv
    implementation libs.fastutil

    implementation libs.bundles.mariadb
+    implementation libs.bundles.httpcomponents

    testImplementation libs.bundles.slf4j.test
    testImplementation libs.bundles.junit
    testImplementation libs.mockito
+    testImplementation libs.wiremock
+
+

    testImplementation project(':code:processes:test-data')
 }
--- a/code/processes/crawling-process/ft-content-type/java/nu/marginalia/contenttype/ContentType.java
+++ b/code/processes/crawling-process/ft-content-type/java/nu/marginalia/contenttype/ContentType.java
@@ -2,11 +2,16 @@ package nu.marginalia.contenttype;

 import org.apache.commons.lang3.StringUtils;

+import java.nio.charset.Charset;
+import java.nio.charset.IllegalCharsetNameException;
+import java.nio.charset.StandardCharsets;
+
 /** Content type and charset of a document
 * @param contentType The content type, e.g. "text/html"
 * @param charset The charset, e.g. "UTF-8"
 */
 public record ContentType(String contentType, String charset) {
+
    public static ContentType parse(String contentTypeHeader) {
        if (contentTypeHeader == null || contentTypeHeader.isBlank())
            return new ContentType(null,  null);
@@ -15,9 +20,31 @@ public record ContentType(String contentType, String charset) {
        String contentType = parts[0].trim();
        String charset = parts.length > 1 ? parts[1].trim() : "UTF-8";

+        if (charset.toLowerCase().startsWith("charset=")) {
+            charset = charset.substring("charset=".length());
+        }
+
        return new ContentType(contentType, charset);
    }

+    /** Best effort method for turning the provided charset string into a Java charset method,
+     * with some guesswork-heuristics for when it doesn't work
+     */
+    public Charset asCharset() {
+        try {
+            if (Charset.isSupported(charset)) {
+                return Charset.forName(charset);
+            } else if (charset.equalsIgnoreCase("macintosh-latin")) {
+                return StandardCharsets.ISO_8859_1;
+            } else {
+                return StandardCharsets.UTF_8;
+            }
+        }
+        catch (IllegalCharsetNameException ex) { // thrown by Charset.isSupported()
+            return StandardCharsets.UTF_8;
+        }
+    }
+
    public boolean is(String contentType) {
        return this.contentType.equalsIgnoreCase(contentType);
    }
--- a/code/processes/crawling-process/ft-content-type/java/nu/marginalia/contenttype/DocumentBodyToString.java
+++ b/code/processes/crawling-process/ft-content-type/java/nu/marginalia/contenttype/DocumentBodyToString.java
@@ -1,9 +1,12 @@
 package nu.marginalia.contenttype;

+import org.jsoup.Jsoup;
+import org.jsoup.nodes.Document;
+
+import java.io.ByteArrayInputStream;
+import java.io.IOException;
 import java.nio.charset.Charset;
-import java.nio.charset.IllegalCharsetNameException;
 import java.nio.charset.StandardCharsets;
-import java.nio.charset.UnsupportedCharsetException;
 import java.util.Map;
 import java.util.concurrent.ConcurrentHashMap;

@@ -23,24 +26,25 @@ public class DocumentBodyToString {
        return new String(data, charset);
    }

+    public static Document getParsedData(ContentType type, byte[] data, int maxLength, String url) throws IOException {
+        final Charset charset;
+
+        if (type.charset() == null || type.charset().isBlank()) {
+            charset = StandardCharsets.UTF_8;
+        } else {
+            charset = charsetMap.computeIfAbsent(type, DocumentBodyToString::computeCharset);
+        }
+
+        ByteArrayInputStream bais = new ByteArrayInputStream(data, 0, Math.min(data.length, maxLength));
+
+        return Jsoup.parse(bais, charset.name(), url);
+    }
+
    private static Charset computeCharset(ContentType type) {
-        try {
-            if (type.charset() == null || type.charset().isBlank())
-                return StandardCharsets.UTF_8;
-            else {
-                return Charset.forName(type.charset());
-            }
-        }
-        catch (IllegalCharsetNameException ex) {
-            // Fall back to UTF-8 if we don't understand what this is.  It's *probably* fine? Maybe?
+        if (type.charset() == null || type.charset().isBlank())
            return StandardCharsets.UTF_8;
-        }
-        catch (UnsupportedCharsetException ex) {
-            // This is usually like Macintosh Latin
-            // (https://en.wikipedia.org/wiki/Macintosh_Latin_encoding)
-            //
-            // It's close enough to 8859-1 to serve
-            return StandardCharsets.ISO_8859_1;
+        else {
+            return type.asCharset();
        }
    }
 }
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/CrawlerMain.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/CrawlerMain.java
@@ -19,22 +19,19 @@ import nu.marginalia.crawl.retreival.DomainProber;
 import nu.marginalia.crawl.warc.WarcArchiverFactory;
 import nu.marginalia.crawl.warc.WarcArchiverIf;
 import nu.marginalia.db.DomainBlacklist;
-import nu.marginalia.io.CrawledDomainReader;
 import nu.marginalia.io.CrawlerOutputFile;
 import nu.marginalia.model.EdgeDomain;
 import nu.marginalia.mq.MessageQueueFactory;
-import nu.marginalia.parquet.crawldata.CrawledDocumentParquetRecordFileWriter;
 import nu.marginalia.process.ProcessConfiguration;
 import nu.marginalia.process.ProcessConfigurationModule;
 import nu.marginalia.process.ProcessMainClass;
 import nu.marginalia.process.control.ProcessHeartbeatImpl;
 import nu.marginalia.process.log.WorkLog;
 import nu.marginalia.service.module.DatabaseModule;
+import nu.marginalia.slop.SlopCrawlDataRecord;
 import nu.marginalia.storage.FileStorageService;
 import nu.marginalia.storage.model.FileStorageId;
 import nu.marginalia.util.SimpleBlockingThreadPool;
-import okhttp3.ConnectionPool;
-import okhttp3.Dispatcher;
 import org.jetbrains.annotations.NotNull;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
@@ -44,11 +41,9 @@ import java.nio.file.Files;
 import java.nio.file.Path;
 import java.nio.file.StandardCopyOption;
 import java.security.Security;
-import java.util.ArrayList;
-import java.util.Collections;
-import java.util.List;
-import java.util.Map;
+import java.util.*;
 import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.LinkedBlockingQueue;
 import java.util.concurrent.TimeUnit;
 import java.util.concurrent.atomic.AtomicInteger;

@@ -72,6 +67,8 @@ public class CrawlerMain extends ProcessMainClass {

    private final Map<String, CrawlTask> pendingCrawlTasks = new ConcurrentHashMap<>();

+    private final LinkedBlockingQueue<CrawlTask> retryQueue = new LinkedBlockingQueue<>();
+
    private final AtomicInteger tasksDone = new AtomicInteger(0);
    private final HttpFetcherImpl fetcher;

@@ -85,6 +82,7 @@ public class CrawlerMain extends ProcessMainClass {

    @Inject
    public CrawlerMain(UserAgent userAgent,
+                       HttpFetcherImpl httpFetcher,
                       ProcessHeartbeatImpl heartbeat,
                       MessageQueueFactory messageQueueFactory, DomainProber domainProber,
                       FileStorageService fileStorageService,
@@ -98,6 +96,7 @@ public class CrawlerMain extends ProcessMainClass {
        super(messageQueueFactory, processConfiguration, gson, CRAWLER_INBOX);

        this.userAgent = userAgent;
+        this.fetcher = httpFetcher;
        this.heartbeat = heartbeat;
        this.domainProber = domainProber;
        this.fileStorageService = fileStorageService;
@@ -107,14 +106,19 @@ public class CrawlerMain extends ProcessMainClass {
        this.blacklist = blacklist;
        this.node = processConfiguration.node();

+        SimpleBlockingThreadPool.ThreadType threadType;
+        if (Boolean.getBoolean("crawler.useVirtualThreads")) {
+            threadType = SimpleBlockingThreadPool.ThreadType.VIRTUAL;
+        }
+        else {
+            threadType = SimpleBlockingThreadPool.ThreadType.PLATFORM;
+        }
+
        pool = new SimpleBlockingThreadPool("CrawlerPool",
                Integer.getInteger("crawler.poolSize", 256),
-                1);
+                1,
+                threadType);

-        fetcher = new HttpFetcherImpl(userAgent,
-                new Dispatcher(),
-                new ConnectionPool(5, 10, TimeUnit.SECONDS)
-        );

        // Wait for the blacklist to be loaded before starting the crawl
        blacklist.waitUntilLoaded();
@@ -132,6 +136,10 @@ public class CrawlerMain extends ProcessMainClass {
        System.setProperty("sun.net.client.defaultConnectTimeout", "30000");
        System.setProperty("sun.net.client.defaultReadTimeout", "30000");

+        // Set the maximum number of connections to keep alive in the connection pool
+        System.setProperty("jdk.httpclient.idleTimeout", "15"); // 15 seconds
+        System.setProperty("jdk.httpclient.connectionPoolSize", "256");
+
        // We don't want to use too much memory caching sessions for https
        System.setProperty("javax.net.ssl.sessionCacheSize", "2048");

@@ -225,10 +233,7 @@ public class CrawlerMain extends ProcessMainClass {

        logger.info("Loaded {} domains", crawlSpecRecords.size());

-        // Shuffle the domains to ensure we get a good mix of domains in each crawl,
-        // so that e.g. the big domains don't get all crawled at once, or we end up
-        // crawling the same server in parallel from different subdomains...
-        Collections.shuffle(crawlSpecRecords);
+        crawlSpecRecords.sort(crawlSpecArrangement(crawlSpecRecords));

        // First a validation run to ensure the file is all good to parse
        if (crawlSpecRecords.isEmpty()) {
@@ -249,9 +254,14 @@ public class CrawlerMain extends ProcessMainClass {
            // (this happens when the process is restarted after a crash or a shutdown)
            tasksDone.set(workLog.countFinishedJobs());

-            // Create crawl tasks and submit them to the pool for execution
+            // List of deferred tasks used to ensure beneficial scheduling of domains with regard to DomainLocks,
+            // merely shuffling the domains tends to lead to a lot of threads being blocked waiting for a semphore,
+            // this will more aggressively attempt to schedule the jobs to avoid blocking
+            List<CrawlTask> taskList = new ArrayList<>();
+
+            // Create crawl tasks
            for (CrawlSpecRecord crawlSpec : crawlSpecRecords) {
-                if (workLog.isJobFinished(crawlSpec.domain()))
+                if (workLog.isJobFinished(crawlSpec.domain))
                    continue;

                var task = new CrawlTask(
@@ -262,8 +272,36 @@ public class CrawlerMain extends ProcessMainClass {
                        domainStateDb,
                        workLog);

-                if (pendingCrawlTasks.putIfAbsent(crawlSpec.domain(), task) == null) {
-                    pool.submitQuietly(task);
+                // Try to run immediately, to avoid unnecessarily keeping the entire work set in RAM
+                if (!trySubmitDeferredTask(task)) {
+                    // Otherwise add to the taskList for deferred execution
+                    taskList.add(task);
+                }
+            }
+
+             // Schedule viable tasks for execution until list is empty
+            for (int emptyRuns = 0;emptyRuns < 300;) {
+                boolean hasTasks = !taskList.isEmpty();
+
+                // The order of these checks  very important to avoid a race condition
+                // where we miss a task that is put into the retry queue
+                boolean hasRunningTasks = pool.getActiveCount() > 0;
+                boolean hasRetryTasks = !retryQueue.isEmpty();
+
+                if (hasTasks || hasRetryTasks || hasRunningTasks) {
+                    retryQueue.drainTo(taskList);
+
+                    // Try to submit any tasks that are in the retry queue (this will block if the pool is full)
+                    taskList.removeIf(this::trySubmitDeferredTask);
+
+                    // Add a small pause here to avoid busy looping toward the end of the execution cycle when
+                    // we might have no new viable tasks to run for hours on end
+                    TimeUnit.MILLISECONDS.sleep(5);
+                } else {
+                    // We have no tasks to run, and no tasks in the retry queue
+                    // but we wait a bit to see if any new tasks come in via the retry queue
+                    emptyRuns++;
+                    TimeUnit.SECONDS.sleep(1);
                }
            }

@@ -291,6 +329,51 @@ public class CrawlerMain extends ProcessMainClass {
        }
    }

+    /** Create a comparator that sorts the crawl specs in a way that is beneficial for the crawl,
+     * we want to enqueue domains that have common top domains first, but otherwise have a random
+     * order.
+     * <p></p>
+     * Note, we can't use hash codes for randomization as it is not desirable to have the same order
+     * every time the process is restarted (and CrawlSpecRecord is a record, which defines equals and
+     * hashcode based on the fields).
+     * */
+    private Comparator<CrawlSpecRecord> crawlSpecArrangement(List<CrawlSpecRecord> records) {
+        Random r = new Random();
+        Map<String, Integer> topDomainCounts = new HashMap<>(4 + (int) Math.sqrt(records.size()));
+        Map<String, Integer> randomOrder = new HashMap<>(records.size());
+
+        for (var spec : records) {
+            topDomainCounts.merge(EdgeDomain.getTopDomain(spec.domain), 1, Integer::sum);
+            randomOrder.put(spec.domain, r.nextInt());
+        }
+
+        return Comparator.comparing((CrawlSpecRecord spec) -> topDomainCounts.getOrDefault(EdgeDomain.getTopDomain(spec.domain), 0) >= 8)
+                .reversed()
+                .thenComparing(spec -> randomOrder.get(spec.domain))
+                .thenComparing(Record::hashCode); // non-deterministic tie-breaker to
+    }
+
+    /** Submit a task for execution if it can be run, returns true if it was submitted
+     * or if it can be discarded */
+    private boolean trySubmitDeferredTask(CrawlTask task) {
+        if (!task.canRun()) {
+            return false;
+        }
+
+        if (pendingCrawlTasks.putIfAbsent(task.domain, task) != null) {
+            return true; // task has already run, duplicate in crawl specs
+        }
+
+        try {
+            // This blocks the caller when the pool is full
+            pool.submitQuietly(task);
+            return true;
+        }
+        catch (RuntimeException ex) {
+            logger.error("Failed to submit task " + task.domain, ex);
+            return false;
+        }
+    }

    public void runForSingleDomain(String targetDomainName, FileStorageId fileStorageId) throws Exception {
        runForSingleDomain(targetDomainName, fileStorageService.getStorage(fileStorageId).asPath());
@@ -348,79 +431,117 @@ public class CrawlerMain extends ProcessMainClass {
            this.id = Integer.toHexString(domain.hashCode());
        }

+        /** Best effort indicator whether we could start this now without getting stuck in
+         * DomainLocks purgatory */
+        public boolean canRun() {
+            return domainLocks.isLockableHint(new EdgeDomain(domain));
+        }
+
        @Override
        public void run() throws Exception {

-            Path newWarcFile = CrawlerOutputFile.createWarcPath(outputDir, id, domain, CrawlerOutputFile.WarcFileVersion.LIVE);
-            Path tempFile = CrawlerOutputFile.createWarcPath(outputDir, id, domain, CrawlerOutputFile.WarcFileVersion.TEMP);
-            Path parquetFile = CrawlerOutputFile.createParquetPath(outputDir, id, domain);
-
-            // Move the WARC file to a temp file if it exists, so we can resume the crawl using the old data
-            // while writing to the same file name as before
-            if (Files.exists(newWarcFile)) {
-                Files.move(newWarcFile, tempFile, StandardCopyOption.REPLACE_EXISTING);
-            }
-            else {
-                Files.deleteIfExists(tempFile);
+            if (workLog.isJobFinished(domain)) { // No-Op
+                logger.info("Omitting task {}, as it is already run", domain);
+                return;
            }

-            try (var warcRecorder = new WarcRecorder(newWarcFile); // write to a temp file for now
-                 var retriever = new CrawlerRetreiver(fetcher, domainProber, specification, domainStateDb, warcRecorder);
-                 CrawlDataReference reference = getReference();
-                 )
-            {
-                // Resume the crawl if it was aborted
-                if (Files.exists(tempFile)) {
-                    retriever.syncAbortedRun(tempFile);
-                    Files.delete(tempFile);
+            Optional<DomainLocks.DomainLock> lock = domainLocks.tryLockDomain(new EdgeDomain(domain));
+            // We don't have a lock, so we can't run this task
+            // we return to avoid blocking the pool for too long
+            if (lock.isEmpty()) {
+                if (retryQueue.remainingCapacity() > 0) {
+                    // Sleep a moment to avoid busy looping via the retry queue
+                    // in the case when few tasks remain and almost all are ineligible for
+                    // immediate restart
+                    Thread.sleep(5);
                }

-                DomainLinks domainLinks = anchorTagsSource.getAnchorTags(domain);
+                retryQueue.put(this);
+                return;
+            }
+            DomainLocks.DomainLock domainLock = lock.get();

-                int size;
-                try (var lock = domainLocks.lockDomain(new EdgeDomain(domain))) {
-                    size = retriever.crawlDomain(domainLinks, reference);
+            try (domainLock) {
+                Thread.currentThread().setName("crawling:" + domain);
+
+                Path newWarcFile = CrawlerOutputFile.createWarcPath(outputDir, id, domain, CrawlerOutputFile.WarcFileVersion.LIVE);
+                Path tempFile = CrawlerOutputFile.createWarcPath(outputDir, id, domain, CrawlerOutputFile.WarcFileVersion.TEMP);
+                Path slopFile = CrawlerOutputFile.createSlopPath(outputDir, id, domain);
+
+                // Move the WARC file to a temp file if it exists, so we can resume the crawl using the old data
+                // while writing to the same file name as before
+                if (Files.exists(newWarcFile)) {
+                    Files.move(newWarcFile, tempFile, StandardCopyOption.REPLACE_EXISTING);
+                }
+                else {
+                    Files.deleteIfExists(tempFile);
                }

-                // Delete the reference crawl data if it's not the same as the new one
-                // (mostly a case when migrating from legacy->warc)
-                reference.delete();
+                try (var warcRecorder = new WarcRecorder(newWarcFile); // write to a temp file for now
+                     var retriever = new CrawlerRetreiver(fetcher, domainProber, specification, domainStateDb, warcRecorder);
+                     CrawlDataReference reference = getReference())
+                {
+                    // Resume the crawl if it was aborted
+                    if (Files.exists(tempFile)) {
+                        retriever.syncAbortedRun(tempFile);
+                        Files.delete(tempFile);
+                    }

-                // Convert the WARC file to Parquet
-                CrawledDocumentParquetRecordFileWriter
-                        .convertWarc(domain, userAgent, newWarcFile, parquetFile);
+                    DomainLinks domainLinks = anchorTagsSource.getAnchorTags(domain);

-                // Optionally archive the WARC file if full retention is enabled,
-                // otherwise delete it:
-                warcArchiver.consumeWarc(newWarcFile, domain);
+                    int size = retriever.crawlDomain(domainLinks, reference);

-                // Mark the domain as finished in the work log
-                workLog.setJobToFinished(domain, parquetFile.toString(), size);
+                    // Delete the reference crawl data if it's not the same as the new one
+                    // (mostly a case when migrating from legacy->warc)
+                    reference.delete();

-                // Update the progress bar
-                heartbeat.setProgress(tasksDone.incrementAndGet() / (double) totalTasks);
+                    // Convert the WARC file to Slop
+                    SlopCrawlDataRecord
+                            .convertWarc(domain, userAgent, newWarcFile, slopFile);

-                logger.info("Fetched {}", domain);
-            } catch (Exception e) {
-                logger.error("Error fetching domain " + domain, e);
-            }
-            finally {
-                // We don't need to double-count these; it's also kept int he workLog
-                pendingCrawlTasks.remove(domain);
-                Thread.currentThread().setName("[idle]");
+                    // Optionally archive the WARC file if full retention is enabled,
+                    // otherwise delete it:
+                    warcArchiver.consumeWarc(newWarcFile, domain);

-                Files.deleteIfExists(newWarcFile);
-                Files.deleteIfExists(tempFile);
+                    // Mark the domain as finished in the work log
+                    workLog.setJobToFinished(domain, slopFile.toString(), size);
+
+                    // Update the progress bar
+                    heartbeat.setProgress(tasksDone.incrementAndGet() / (double) totalTasks);
+
+                    logger.info("Fetched {}", domain);
+                } catch (Exception e) {
+                    logger.error("Error fetching domain " + domain, e);
+                }
+                finally {
+                    // We don't need to double-count these; it's also kept in the workLog
+                    pendingCrawlTasks.remove(domain);
+                    Thread.currentThread().setName("[idle]");
+
+                    Files.deleteIfExists(newWarcFile);
+                    Files.deleteIfExists(tempFile);
+                }
            }
        }

        private CrawlDataReference getReference() {
            try {
-                return new CrawlDataReference(CrawledDomainReader.createDataStream(outputDir, domain, id));
-            } catch (IOException e) {
+                Path slopPath = CrawlerOutputFile.getSlopPath(outputDir, id, domain);
+                if (Files.exists(slopPath)) {
+                    return new CrawlDataReference(slopPath);
+                }
+
+                Path parquetPath = CrawlerOutputFile.getParquetPath(outputDir, id, domain);
+                if (Files.exists(parquetPath)) {
+                    slopPath = migrateParquetData(parquetPath, domain, outputDir);
+                    return new CrawlDataReference(slopPath);
+                }
+
+            } catch (Exception e) {
                logger.debug("Failed to read previous crawl data for {}", specification.domain());
-                return new CrawlDataReference();
            }
+
+            return new CrawlDataReference();
        }

    }
@@ -480,4 +601,20 @@ public class CrawlerMain extends ProcessMainClass {
            }
        }
    }
+
+    // Migrate from parquet to slop if necessary
+    //
+    // This must be synchronized as chewing through parquet files in parallel leads to enormous memory overhead
+    private synchronized Path migrateParquetData(Path inputPath, String domain, Path crawlDataRoot) throws IOException {
+        if (!inputPath.toString().endsWith(".parquet")) {
+            return inputPath;
+        }
+
+        Path outputFile = CrawlerOutputFile.createSlopPath(crawlDataRoot, Integer.toHexString(domain.hashCode()), domain);
+
+        SlopCrawlDataRecord.convertFromParquet(inputPath, outputFile);
+
+        return outputFile;
+    }
+
 }
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/DomainStateDb.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/DomainStateDb.java
@@ -1,5 +1,8 @@
 package nu.marginalia.crawl;

+import com.google.inject.Inject;
+import nu.marginalia.storage.FileStorageService;
+import nu.marginalia.storage.model.FileStorageType;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;

@@ -8,7 +11,9 @@ import java.nio.file.Path;
 import java.sql.Connection;
 import java.sql.DriverManager;
 import java.sql.SQLException;
+import java.time.Duration;
 import java.time.Instant;
+import java.util.Objects;
 import java.util.Optional;

 /** Supplemental sqlite database for storing the summary of a crawl.
@@ -20,6 +25,17 @@ public class DomainStateDb implements AutoCloseable {

    private final Connection connection;

+
+    public record CrawlMeta(
+            String domainName,
+            Instant lastFullCrawl,
+            Duration recrawlTime,
+            Duration crawlTime,
+            int recrawlErrors,
+            int crawlChanges,
+            int totalCrawlSize
+    ) {}
+
    public record SummaryRecord(
            String domainName,
            Instant lastUpdated,
@@ -60,7 +76,31 @@ public class DomainStateDb implements AutoCloseable {

    }

-    public DomainStateDb(Path filename) throws SQLException {
+    public record FaviconRecord(String contentType, byte[] imageData) {}
+
+    @Inject
+    public DomainStateDb(FileStorageService fileStorageService) throws SQLException {
+        this(findFilename(fileStorageService));
+    }
+
+    private static Path findFilename(FileStorageService fileStorageService) throws SQLException {
+        var fsId = fileStorageService.getOnlyActiveFileStorage(FileStorageType.CRAWL_DATA);
+
+        if (fsId.isPresent()) {
+            var fs = fileStorageService.getStorage(fsId.get());
+            return fs.asPath().resolve("domainstate.db");
+        }
+        else {
+            return null;
+        }
+    }
+
+    public DomainStateDb(@Nullable Path filename) throws SQLException {
+        if (null == filename) {
+            connection = null;
+            return;
+        }
+
        String sqliteDbString = "jdbc:sqlite:" + filename.toString();
        connection = DriverManager.getConnection(sqliteDbString);

@@ -74,18 +114,102 @@ public class DomainStateDb implements AutoCloseable {
                        feedUrl TEXT
                    )
                    """);
-
+            stmt.executeUpdate("""
+                    CREATE TABLE IF NOT EXISTS crawl_meta (
+                        domain TEXT PRIMARY KEY,
+                        lastFullCrawlEpochMs LONG NOT NULL,
+                        recrawlTimeMs LONG NOT NULL,
+                        recrawlErrors INTEGER NOT NULL,
+                        crawlTimeMs LONG NOT NULL,
+                        crawlChanges INTEGER NOT NULL,
+                        totalCrawlSize INTEGER NOT NULL
+                    )
+                    """);
+            stmt.executeUpdate("""
+                    CREATE TABLE IF NOT EXISTS favicon (
+                        domain TEXT PRIMARY KEY,
+                        contentType TEXT NOT NULL,
+                        icon BLOB NOT NULL
+                    )
+                    """);
            stmt.execute("PRAGMA journal_mode=WAL");
        }
    }

    @Override
    public void close() throws SQLException {
-        connection.close();
+        if (connection != null) {
+            connection.close();
+        }
    }

+    public boolean isAvailable() {
+        return connection != null;
+    }
+
+    public void saveIcon(String domain, FaviconRecord faviconRecord) {
+        if (connection == null) throw new IllegalStateException("No connection to domainstate db");
+
+        try (var stmt = connection.prepareStatement("""
+                INSERT OR REPLACE INTO favicon (domain, contentType, icon)
+                       VALUES(?, ?, ?)
+            """)) {
+            stmt.setString(1, domain);
+            stmt.setString(2, Objects.requireNonNullElse(faviconRecord.contentType, "application/octet-stream"));
+            stmt.setBytes(3, faviconRecord.imageData);
+            stmt.executeUpdate();
+        }
+        catch (SQLException ex) {
+            logger.error("Failed to insert favicon", ex);
+        }
+    }
+
+    public Optional<FaviconRecord> getIcon(String domain) {
+        if (connection == null)
+            return Optional.empty();
+
+        try (var stmt = connection.prepareStatement("SELECT contentType, icon FROM favicon WHERE DOMAIN = ?")) {
+            stmt.setString(1, domain);
+            var rs = stmt.executeQuery();
+
+            if (rs.next()) {
+                return Optional.of(
+                    new FaviconRecord(
+                        rs.getString("contentType"),
+                        rs.getBytes("icon")
+                    )
+                );
+            }
+        } catch (SQLException e) {
+            logger.error("Failed to retrieve favicon", e);
+        }
+
+        return Optional.empty();
+    }
+
+    public void save(CrawlMeta crawlMeta) {
+        if (connection == null) throw new IllegalStateException("No connection to domainstate db");
+
+        try (var stmt = connection.prepareStatement("""
+                INSERT OR REPLACE INTO crawl_meta (domain, lastFullCrawlEpochMs, recrawlTimeMs, recrawlErrors, crawlTimeMs, crawlChanges, totalCrawlSize)
+                VALUES (?, ?, ?, ?, ?, ?, ?)
+                """)) {
+            stmt.setString(1, crawlMeta.domainName());
+            stmt.setLong(2, crawlMeta.lastFullCrawl.toEpochMilli());
+            stmt.setLong(3, crawlMeta.recrawlTime.toMillis());
+            stmt.setInt(4, crawlMeta.recrawlErrors);
+            stmt.setLong(5, crawlMeta.crawlTime.toMillis());
+            stmt.setInt(6, crawlMeta.crawlChanges);
+            stmt.setInt(7, crawlMeta.totalCrawlSize);
+            stmt.executeUpdate();
+        } catch (SQLException e) {
+            logger.error("Failed to insert crawl meta record", e);
+        }
+    }

    public void save(SummaryRecord record) {
+        if (connection == null) throw new IllegalStateException("No connection to domainstate db");
+
        try (var stmt = connection.prepareStatement("""
                INSERT OR REPLACE INTO summary (domain, lastUpdatedEpochMs, state, stateDesc, feedUrl)
                VALUES (?, ?, ?, ?, ?)
@@ -101,7 +225,38 @@ public class DomainStateDb implements AutoCloseable {
        }
    }

-    public Optional<SummaryRecord> get(String domainName) {
+    public Optional<CrawlMeta> getMeta(String domainName) {
+        if (connection == null)
+            return Optional.empty();
+
+        try (var stmt = connection.prepareStatement("""
+                SELECT domain, lastFullCrawlEpochMs, recrawlTimeMs, recrawlErrors, crawlTimeMs, crawlChanges, totalCrawlSize
+                FROM crawl_meta
+                WHERE domain = ?
+                """)) {
+            stmt.setString(1, domainName);
+            var rs = stmt.executeQuery();
+            if (rs.next()) {
+                return Optional.of(new CrawlMeta(
+                        rs.getString("domain"),
+                        Instant.ofEpochMilli(rs.getLong("lastFullCrawlEpochMs")),
+                        Duration.ofMillis(rs.getLong("recrawlTimeMs")),
+                        Duration.ofMillis(rs.getLong("crawlTimeMs")),
+                        rs.getInt("recrawlErrors"),
+                        rs.getInt("crawlChanges"),
+                        rs.getInt("totalCrawlSize")
+                ));
+            }
+        } catch (SQLException ex) {
+            logger.error("Failed to get crawl meta record", ex);
+        }
+        return Optional.empty();
+    }
+
+    public Optional<SummaryRecord> getSummary(String domainName) {
+        if (connection == null)
+            return Optional.empty();
+
        try (var stmt = connection.prepareStatement("""
                SELECT domain, lastUpdatedEpochMs, state, stateDesc, feedUrl
                FROM summary
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/ContentTags.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/ContentTags.java
@@ -1,6 +1,6 @@
 package nu.marginalia.crawl.fetcher;

-import okhttp3.Request;
+import org.apache.hc.client5.http.classic.methods.HttpGet;

 /** Encapsulates request modifiers; the ETag and Last-Modified tags for a resource */
 public record ContentTags(String etag, String lastMod) {
@@ -17,14 +17,14 @@ public record ContentTags(String etag, String lastMod) {
    }

    /** Paints the tags onto the request builder. */
-    public void paint(Request.Builder getBuilder) {
+    public void paint(HttpGet request) {

        if (etag != null) {
-            getBuilder.addHeader("If-None-Match", etag);
+            request.addHeader("If-None-Match", etag);
        }

        if (lastMod != null) {
-            getBuilder.addHeader("If-Modified-Since", lastMod);
+            request.addHeader("If-Modified-Since", lastMod);
        }
    }
 }
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/Cookies.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/Cookies.java
@@ -1,43 +0,0 @@
-package nu.marginalia.crawl.fetcher;
-
-import okhttp3.Cookie;
-import okhttp3.CookieJar;
-import okhttp3.HttpUrl;
-
-import java.util.Collections;
-import java.util.List;
-import java.util.concurrent.ConcurrentHashMap;
-
-public class Cookies {
-    final ThreadLocal<ConcurrentHashMap<String, List<Cookie>>> cookieJar = ThreadLocal.withInitial(ConcurrentHashMap::new);
-
-    public CookieJar getJar() {
-        return new CookieJar() {
-
-            @Override
-            public void saveFromResponse(HttpUrl url, List<Cookie> cookies) {
-
-                if (!cookies.isEmpty()) {
-                    cookieJar.get().put(url.host(), cookies);
-                }
-            }
-
-            @Override
-            public List<Cookie> loadForRequest(HttpUrl url) {
-                return cookieJar.get().getOrDefault(url.host(), Collections.emptyList());
-            }
-        };
-    }
-
-    public void clear() {
-        cookieJar.get().clear();
-    }
-
-    public boolean hasCookies() {
-        return !cookieJar.get().isEmpty();
-    }
-
-    public List<String> getCookies() {
-        return cookieJar.get().values().stream().flatMap(List::stream).map(Cookie::toString).toList();
-    }
-}
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/DomainCookies.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/DomainCookies.java
@@ -0,0 +1,56 @@
+package nu.marginalia.crawl.fetcher;
+
+import org.apache.hc.client5.http.classic.methods.HttpUriRequestBase;
+import org.apache.hc.core5.http.ClassicHttpRequest;
+import org.apache.hc.core5.http.HttpResponse;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.StringJoiner;
+
+public class DomainCookies {
+    private final Map<String, String> cookies = new HashMap<>();
+
+    public boolean hasCookies() {
+        return !cookies.isEmpty();
+    }
+
+    public void updateCookieStore(HttpResponse response) {
+        for (var header : response.getHeaders()) {
+            if (header.getName().equalsIgnoreCase("Set-Cookie")) {
+                parseCookieHeader(header.getValue());
+            }
+        }
+    }
+
+    private void parseCookieHeader(String value) {
+        // Parse the Set-Cookie header value and extract the cookies
+
+        String[] parts = value.split(";");
+        String cookie = parts[0].trim();
+
+        if (cookie.contains("=")) {
+            String[] cookieParts = cookie.split("=");
+            String name = cookieParts[0].trim();
+            String val = cookieParts[1].trim();
+            cookies.put(name, val);
+        }
+    }
+
+    public void paintRequest(HttpUriRequestBase request) {
+        request.addHeader("Cookie", createCookieHeader());
+    }
+
+    public void paintRequest(ClassicHttpRequest request) {
+        request.addHeader("Cookie", createCookieHeader());
+    }
+
+    private String createCookieHeader() {
+        StringJoiner sj = new StringJoiner("; ");
+        for (var cookie : cookies.entrySet()) {
+            sj.add(cookie.getKey() + "=" + cookie.getValue());
+        }
+        return sj.toString();
+    }
+
+}
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/HttpFetcher.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/HttpFetcher.java
@@ -3,31 +3,32 @@ package nu.marginalia.crawl.fetcher;
 import com.google.inject.ImplementedBy;
 import crawlercommons.robots.SimpleRobotRules;
 import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
+import nu.marginalia.crawl.retreival.CrawlDelayTimer;
 import nu.marginalia.model.EdgeDomain;
 import nu.marginalia.model.EdgeUrl;
 import nu.marginalia.model.body.HttpFetchResult;
 import nu.marginalia.model.crawldata.CrawlerDomainStatus;
+import org.apache.hc.client5.http.cookie.CookieStore;

 import java.util.List;

@ImplementedBy(HttpFetcherImpl.class)
-public interface HttpFetcher {
+public interface HttpFetcher extends AutoCloseable {
    void setAllowAllContentTypes(boolean allowAllContentTypes);

-    List<String> getCookies();
+    CookieStore getCookies();
    void clearCookies();

    DomainProbeResult probeDomain(EdgeUrl url);

-    ContentTypeProbeResult probeContentType(
-                                EdgeUrl url,
-                                WarcRecorder recorder,
-                                ContentTags tags) throws HttpFetcherImpl.RateLimitException;
-
    HttpFetchResult fetchContent(EdgeUrl url,
                                 WarcRecorder recorder,
+                                 DomainCookies cookies,
+                                 CrawlDelayTimer timer,
                                 ContentTags tags,
-                                 ProbeType probeType) throws HttpFetcherImpl.RateLimitException, Exception;
+                                 ProbeType probeType);
+
+    List<EdgeUrl> fetchSitemapUrls(String rootSitemapUrl, CrawlDelayTimer delayTimer);

    SimpleRobotRules fetchRobotRules(EdgeDomain domain, WarcRecorder recorder);

@@ -43,6 +44,7 @@ public interface HttpFetcher {

        /** This domain redirects to another domain */
        record Redirect(EdgeDomain domain) implements DomainProbeResult {}
+        record RedirectSameDomain_Internal(EdgeUrl domain) implements DomainProbeResult {}

        /** If the retrieval of the probed url was successful, return the url as it was fetched
         * (which may be different from the url we probed, if we attempted another URL schema).
@@ -53,7 +55,10 @@ public interface HttpFetcher {
    }

    sealed interface ContentTypeProbeResult {
+        record NoOp() implements ContentTypeProbeResult {}
        record Ok(EdgeUrl resolvedUrl) implements ContentTypeProbeResult { }
+        record HttpError(int statusCode, String message) implements ContentTypeProbeResult { }
+        record Redirect(EdgeUrl location) implements ContentTypeProbeResult { }
        record BadContentType(String contentType, int statusCode) implements ContentTypeProbeResult { }
        record Timeout(java.lang.Exception ex) implements ContentTypeProbeResult { }
        record Exception(java.lang.Exception ex) implements ContentTypeProbeResult { }
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/HttpFetcherImpl.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/HttpFetcherImpl.java
@@ -1,78 +1,173 @@
 package nu.marginalia.crawl.fetcher;

 import com.google.inject.Inject;
+import com.google.inject.Singleton;
 import crawlercommons.robots.SimpleRobotRules;
 import crawlercommons.robots.SimpleRobotRulesParser;
 import nu.marginalia.UserAgent;
-import nu.marginalia.crawl.fetcher.socket.FastTerminatingSocketFactory;
-import nu.marginalia.crawl.fetcher.socket.IpInterceptingNetworkInterceptor;
-import nu.marginalia.crawl.fetcher.socket.NoSecuritySSL;
 import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
+import nu.marginalia.crawl.retreival.CrawlDelayTimer;
+import nu.marginalia.link_parser.LinkParser;
 import nu.marginalia.model.EdgeDomain;
 import nu.marginalia.model.EdgeUrl;
 import nu.marginalia.model.body.ContentTypeLogic;
 import nu.marginalia.model.body.DocumentBodyExtractor;
 import nu.marginalia.model.body.HttpFetchResult;
 import nu.marginalia.model.crawldata.CrawlerDomainStatus;
-import okhttp3.ConnectionPool;
-import okhttp3.Dispatcher;
-import okhttp3.OkHttpClient;
-import okhttp3.Request;
+import org.apache.hc.client5.http.ConnectionKeepAliveStrategy;
+import org.apache.hc.client5.http.HttpRequestRetryStrategy;
+import org.apache.hc.client5.http.classic.HttpClient;
+import org.apache.hc.client5.http.classic.methods.HttpGet;
+import org.apache.hc.client5.http.config.ConnectionConfig;
+import org.apache.hc.client5.http.config.RequestConfig;
+import org.apache.hc.client5.http.cookie.BasicCookieStore;
+import org.apache.hc.client5.http.cookie.CookieStore;
+import org.apache.hc.client5.http.cookie.StandardCookieSpec;
+import org.apache.hc.client5.http.impl.classic.CloseableHttpClient;
+import org.apache.hc.client5.http.impl.classic.HttpClients;
+import org.apache.hc.client5.http.impl.io.PoolingHttpClientConnectionManager;
+import org.apache.hc.client5.http.impl.io.PoolingHttpClientConnectionManagerBuilder;
+import org.apache.hc.client5.http.ssl.DefaultClientTlsStrategy;
+import org.apache.hc.core5.http.*;
+import org.apache.hc.core5.http.io.HttpClientResponseHandler;
+import org.apache.hc.core5.http.io.SocketConfig;
+import org.apache.hc.core5.http.io.entity.EntityUtils;
+import org.apache.hc.core5.http.io.support.ClassicRequestBuilder;
+import org.apache.hc.core5.http.message.MessageSupport;
+import org.apache.hc.core5.http.protocol.HttpContext;
+import org.apache.hc.core5.pool.PoolStats;
+import org.apache.hc.core5.util.TimeValue;
+import org.apache.hc.core5.util.Timeout;
+import org.jsoup.Jsoup;
+import org.jsoup.nodes.Document;
+import org.jsoup.parser.Parser;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
+import org.slf4j.Marker;
+import org.slf4j.MarkerFactory;

-import javax.net.ssl.X509TrustManager;
-import java.io.InterruptedIOException;
+import javax.net.ssl.SSLContext;
+import javax.net.ssl.SSLException;
+import java.io.IOException;
+import java.net.SocketTimeoutException;
+import java.net.URISyntaxException;
+import java.security.NoSuchAlgorithmException;
 import java.time.Duration;
-import java.util.List;
-import java.util.Objects;
-import java.util.Optional;
+import java.time.Instant;
+import java.util.*;
+import java.util.concurrent.Semaphore;
 import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicBoolean;


-public class HttpFetcherImpl implements HttpFetcher {
+@Singleton
+public class HttpFetcherImpl implements HttpFetcher, HttpRequestRetryStrategy {

    private final Logger logger = LoggerFactory.getLogger(getClass());
    private final String userAgentString;
    private final String userAgentIdentifier;
-    private final Cookies cookies = new Cookies();
+
+    private final CookieStore cookies = new BasicCookieStore();

    private static final SimpleRobotRulesParser robotsParser = new SimpleRobotRulesParser();
    private static final ContentTypeLogic contentTypeLogic = new ContentTypeLogic();
+    private final Marker crawlerAuditMarker = MarkerFactory.getMarker("CRAWLER");

+    private final LinkParser linkParser = new LinkParser();
    @Override
    public void setAllowAllContentTypes(boolean allowAllContentTypes) {
        contentTypeLogic.setAllowAllContentTypes(allowAllContentTypes);
    }

-    private final OkHttpClient client;
+    private final CloseableHttpClient client;
+    private PoolingHttpClientConnectionManager connectionManager;

-    private static final FastTerminatingSocketFactory ftSocketFactory = new FastTerminatingSocketFactory();
+    public PoolStats getPoolStats() {
+        return connectionManager.getTotalStats();
+    }

-    private OkHttpClient createClient(Dispatcher dispatcher, ConnectionPool pool) {
-        var builder = new OkHttpClient.Builder();
-        if (dispatcher != null) {
-            builder.dispatcher(dispatcher);
-        }
+    private CloseableHttpClient createClient() throws NoSuchAlgorithmException {
+        final ConnectionConfig connectionConfig = ConnectionConfig.custom()
+                .setSocketTimeout(10, TimeUnit.SECONDS)
+                .setConnectTimeout(30, TimeUnit.SECONDS)
+                .setValidateAfterInactivity(TimeValue.ofSeconds(5))
+                .build();

-        return builder.sslSocketFactory(NoSecuritySSL.buildSocketFactory(), (X509TrustManager) NoSecuritySSL.trustAllCerts[0])
-            .socketFactory(ftSocketFactory)
-            .hostnameVerifier(NoSecuritySSL.buildHostnameVerifyer())
-            .addNetworkInterceptor(new IpInterceptingNetworkInterceptor())
-            .connectionPool(pool)
-            .cookieJar(cookies.getJar())
-            .followRedirects(true)
-            .followSslRedirects(true)
-            .connectTimeout(8, TimeUnit.SECONDS)
-            .readTimeout(10, TimeUnit.SECONDS)
-            .writeTimeout(10, TimeUnit.SECONDS)
-            .build();
+        connectionManager = PoolingHttpClientConnectionManagerBuilder.create()
+                .setMaxConnPerRoute(2)
+                .setMaxConnTotal(5000)
+                .setDefaultConnectionConfig(connectionConfig)
+                .setTlsSocketStrategy(new DefaultClientTlsStrategy(SSLContext.getDefault()))
+                .build();

+        connectionManager.setDefaultSocketConfig(SocketConfig.custom()
+                .setSoLinger(TimeValue.ofSeconds(-1))
+                .setSoTimeout(Timeout.ofSeconds(10))
+                .build()
+        );
+
+        Thread.ofPlatform().daemon(true).start(() -> {
+            try {
+                for (;;) {
+                    TimeUnit.SECONDS.sleep(15);
+                    logger.info("Connection pool stats: {}", connectionManager.getTotalStats());
+                }
+            }
+            catch (InterruptedException e) {
+                Thread.currentThread().interrupt();
+            }
+        });
+
+        final RequestConfig defaultRequestConfig = RequestConfig.custom()
+                .setCookieSpec(StandardCookieSpec.RELAXED)
+                .setResponseTimeout(10, TimeUnit.SECONDS)
+                .setConnectionRequestTimeout(5, TimeUnit.MINUTES)
+                .build();
+
+        return HttpClients.custom()
+                .setDefaultCookieStore(cookies)
+                .setConnectionManager(connectionManager)
+                .setRetryStrategy(this)
+                .setKeepAliveStrategy(new ConnectionKeepAliveStrategy() {
+                    // Default keep-alive duration is 3 minutes, but this is too long for us,
+                    // as we are either going to re-use it fairly quickly or close it for a long time.
+                    //
+                    // So we set it to 30 seconds or clamp the server-provided value to a minimum of 10 seconds.
+                    private static final TimeValue defaultValue = TimeValue.ofSeconds(30);
+
+                    @Override
+                    public TimeValue getKeepAliveDuration(HttpResponse response, HttpContext context) {
+                        final Iterator<HeaderElement> it = MessageSupport.iterate(response, HeaderElements.KEEP_ALIVE);
+
+                        while (it.hasNext()) {
+                            final HeaderElement he = it.next();
+                            final String param = he.getName();
+                            final String value = he.getValue();
+
+                            if (value == null)
+                                continue;
+                            if (!"timeout".equalsIgnoreCase(param))
+                                continue;
+
+                            try {
+                                long timeout = Long.parseLong(value);
+                                timeout = Math.clamp(timeout, 30, defaultValue.toSeconds());
+                                return TimeValue.ofSeconds(timeout);
+                            } catch (final NumberFormatException ignore) {
+                                break;
+                            }
+                        }
+                        return defaultValue;
+                    }
+                })
+                .disableRedirectHandling()
+                .setDefaultRequestConfig(defaultRequestConfig)
+                .build();
    }

    @Override
-    public List<String> getCookies() {
-        return cookies.getCookies();
+    public CookieStore getCookies() {
+        return cookies;
    }

    @Override
@@ -81,26 +176,32 @@ public class HttpFetcherImpl implements HttpFetcher {
    }

    @Inject
-    public HttpFetcherImpl(UserAgent userAgent,
-                           Dispatcher dispatcher,
-                           ConnectionPool connectionPool)
+    public HttpFetcherImpl(UserAgent userAgent)
    {
-        this.client = createClient(dispatcher, connectionPool);
+        try {
+            this.client = createClient();
+        } catch (NoSuchAlgorithmException e) {
+            throw new RuntimeException(e);
+        }
        this.userAgentString = userAgent.uaString();
        this.userAgentIdentifier = userAgent.uaIdentifier();
    }

    public HttpFetcherImpl(String userAgent) {
-        this.client = createClient(null, new ConnectionPool());
+        try {
+            this.client = createClient();
+        } catch (NoSuchAlgorithmException e) {
+            throw new RuntimeException(e);
+        }
        this.userAgentString = userAgent;
        this.userAgentIdentifier = userAgent;
    }

    // Not necessary in prod, but useful in test
-    public void close() {
-        client.dispatcher().executorService().shutdown();
-        client.connectionPool().evictAll();
+    public void close() throws IOException {
+        client.close();
    }
+
    /**
     * Probe the domain to see if it is reachable, attempting to identify which schema to use,
     * and if there are any redirects.  This is done by one or more HEAD requests.
@@ -110,23 +211,94 @@ public class HttpFetcherImpl implements HttpFetcher {
     */
    @Override
    public DomainProbeResult probeDomain(EdgeUrl url) {
-        var head = new Request.Builder().head().addHeader("User-agent", userAgentString)
-                .url(url.toString())
-                .build();
+        List<EdgeUrl> urls = new ArrayList<>();
+        urls.add(url);

-        var call = client.newCall(head);
+        int redirects = 0;
+        AtomicBoolean tryGet = new AtomicBoolean(false);

-        try (var rsp = call.execute()) {
-            EdgeUrl requestUrl = new EdgeUrl(rsp.request().url().toString());
+        while (!urls.isEmpty() && ++redirects < 5) {
+            ClassicHttpRequest request;

-            if (!Objects.equals(requestUrl.domain, url.domain)) {
-                return new DomainProbeResult.Redirect(requestUrl.domain);
+            EdgeUrl topUrl = urls.removeFirst();
+            try {
+                if (tryGet.get()) {
+                    request = ClassicRequestBuilder.get(topUrl.asURI())
+                                .addHeader("User-Agent", userAgentString)
+                                .addHeader("Accept-Encoding", "gzip")
+                                .addHeader("Range", "bytes=0-255")
+                                .build();
+                } else {
+                    request = ClassicRequestBuilder.head(topUrl.asURI())
+                                .addHeader("User-Agent", userAgentString)
+                                .addHeader("Accept-Encoding", "gzip")
+                                .build();
+                }
+            } catch (URISyntaxException e) {
+                return new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, "Invalid URL");
            }
-            return new DomainProbeResult.Ok(requestUrl);
-        }
-        catch (Exception ex) {
-            return new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, ex.getMessage());
+
+            try {
+                var result = SendLock.wrapSend(client, request, response -> {
+                    EntityUtils.consume(response.getEntity());
+
+                    return switch (response.getCode()) {
+                        case 200 -> new DomainProbeResult.Ok(url);
+                        case 405 -> {
+                            if (!tryGet.get()) {
+                                tryGet.set(true);
+                                yield new DomainProbeResult.RedirectSameDomain_Internal(url);
+                            }
+                            else {
+                                yield new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, "HTTP status 405, tried HEAD and GET?!");
+                            }
+                        }
+                        case 301, 302, 307 -> {
+                            var location = response.getFirstHeader("Location");
+
+                            if (location != null) {
+                                Optional<EdgeUrl> newUrl = linkParser.parseLink(topUrl, location.getValue());
+                                if (newUrl.isEmpty()) {
+                                    yield new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, "Invalid location header on redirect");
+                                }
+                                EdgeUrl newEdgeUrl = newUrl.get();
+                                if (newEdgeUrl.domain.equals(topUrl.domain)) {
+                                    yield new DomainProbeResult.RedirectSameDomain_Internal(newEdgeUrl);
+                                }
+                                else {
+                                    yield new DomainProbeResult.Redirect(newEdgeUrl.domain);
+                                }
+                            }
+
+                            yield new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, "No location header on redirect");
+
+                        }
+                        default ->
+                                new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, "HTTP status " + response.getCode());
+                    };
+                });
+
+                if (result instanceof DomainProbeResult.RedirectSameDomain_Internal(EdgeUrl redirUrl)) {
+                    urls.add(redirUrl);
+                }
+                else {
+                    return result;
+                }
+
+                // We don't have robots.txt yet, so we'll assume a request delay of 1 second
+                TimeUnit.SECONDS.sleep(1);
+            }
+            catch (SocketTimeoutException ex) {
+                return new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, "Timeout during domain probe");
+            }
+            catch (Exception ex) {
+                return new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, "Error during domain probe");
+            }
+
        }
+
+        return new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, "Failed to resolve domain root");
+
    }

    /** Perform a HEAD request to fetch the content type of a URL.
@@ -137,66 +309,73 @@ public class HttpFetcherImpl implements HttpFetcher {
     * recorded in the WARC file on failure.
     */
    public ContentTypeProbeResult probeContentType(EdgeUrl url,
-                                                   WarcRecorder warcRecorder,
-                                                   ContentTags tags) throws RateLimitException {
-        if (tags.isEmpty() && contentTypeLogic.isUrlLikeBinary(url)) {
-            var headBuilder = new Request.Builder().head()
-                    .addHeader("User-agent", userAgentString)
-                    .addHeader("Accept-Encoding", "gzip")
-                    .url(url.toString());
-
-            var head = headBuilder.build();
-            var call = client.newCall(head);
-
-            try (var rsp = call.execute()) {
-                var contentTypeHeader = rsp.header("Content-type");
-
-                if (contentTypeHeader != null && !contentTypeLogic.isAllowableContentType(contentTypeHeader)) {
-                    warcRecorder.flagAsFailedContentTypeProbe(url, contentTypeHeader, rsp.code());
-
-                    return new ContentTypeProbeResult.BadContentType(contentTypeHeader, rsp.code());
-                }
-
-                // Update the URL to the final URL of the HEAD request, otherwise we might end up doing
-
-                // HEAD 301 url1 -> url2
-                // HEAD 200 url2
-                // GET 301 url1 -> url2
-                // GET 200 url2
-
-                // which is not what we want. Overall we want to do as few requests as possible to not raise
-                // too many eyebrows when looking at the logs on the target server.  Overall it's probably desirable
-                // that it looks like the traffic makes sense, as opposed to looking like a broken bot.
-
-                var redirectUrl = new EdgeUrl(rsp.request().url().toString());
-                EdgeUrl ret;
-
-                if (Objects.equals(redirectUrl.domain, url.domain)) ret = redirectUrl;
-                else ret = url;
-
-                // Intercept rate limiting
-                if (rsp.code() == 429) {
-                    throw new HttpFetcherImpl.RateLimitException(Objects.requireNonNullElse(rsp.header("Retry-After"), "1"));
-                }
-
-                return new ContentTypeProbeResult.Ok(ret);
-            }
-            catch (RateLimitException ex) {
-                throw ex;
-            }
-            catch (InterruptedIOException ex) {
-                warcRecorder.flagAsTimeout(url);
-
-                return new ContentTypeProbeResult.Timeout(ex);
-            } catch (Exception ex) {
-                logger.error("Error during fetching {}[{}]", ex.getClass().getSimpleName(), ex.getMessage());
-
-                warcRecorder.flagAsError(url, ex);
-
-                return new ContentTypeProbeResult.Exception(ex);
-            }
+                                                   DomainCookies cookies,
+                                                   CrawlDelayTimer timer,
+                                                   ContentTags tags) {
+        if (!tags.isEmpty() || !contentTypeLogic.isUrlLikeBinary(url)) {
+            return new ContentTypeProbeResult.NoOp();
+        }
+
+        try {
+            ClassicHttpRequest head = ClassicRequestBuilder.head(url.asURI())
+                    .addHeader("User-Agent", userAgentString)
+                    .addHeader("Accept-Encoding", "gzip")
+                    .build();
+
+            cookies.paintRequest(head);
+
+            return SendLock.wrapSend(client, head, (rsp) -> {
+                cookies.updateCookieStore(rsp);
+                EntityUtils.consume(rsp.getEntity());
+                int statusCode = rsp.getCode();
+
+                // Handle redirects
+                if (statusCode == 301 || statusCode == 302 || statusCode == 307) {
+                    var location = rsp.getFirstHeader("Location");
+                    if (location != null) {
+                        Optional<EdgeUrl> newUrl = linkParser.parseLink(url, location.getValue());
+                        if (newUrl.isEmpty())
+                            return new ContentTypeProbeResult.HttpError(statusCode, "Invalid location header on redirect");
+                        return new ContentTypeProbeResult.Redirect(newUrl.get());
+                    }
+                }
+
+                if (statusCode == 405) {
+                    // If we get a 405, we can't probe the content type with HEAD, so we'll just say it's ok
+                    return new ContentTypeProbeResult.Ok(url);
+                }
+
+                // Handle errors
+                if (statusCode < 200 || statusCode > 300) {
+                    return new ContentTypeProbeResult.HttpError(statusCode, "Bad status code");
+                }
+
+                // Handle missing content type
+                var ctHeader = rsp.getFirstHeader("Content-Type");
+                if (ctHeader == null) {
+                    return new ContentTypeProbeResult.HttpError(statusCode, "Missing Content-Type header");
+                }
+                var contentType = ctHeader.getValue();
+
+                // Check if the content type is allowed
+                if (contentTypeLogic.isAllowableContentType(contentType)) {
+                    return new ContentTypeProbeResult.Ok(url);
+                } else {
+                    return new ContentTypeProbeResult.BadContentType(contentType, statusCode);
+                }
+            });
+        }
+        catch (SocketTimeoutException ex) {
+
+            return new ContentTypeProbeResult.Timeout(ex);
+        }
+        catch (Exception ex) {
+            logger.error("Error during fetching {}[{}]", ex.getClass().getSimpleName(), ex.getMessage());
+            return new ContentTypeProbeResult.Exception(ex);
+        }
+        finally {
+            timer.waitFetchDelay();
        }
-        return new ContentTypeProbeResult.Ok(url);
    }

    /** Fetch the content of a URL, and record it in a WARC file,
@@ -206,35 +385,87 @@ public class HttpFetcherImpl implements HttpFetcher {
    @Override
    public HttpFetchResult fetchContent(EdgeUrl url,
                                           WarcRecorder warcRecorder,
+                                           DomainCookies cookies,
+                                           CrawlDelayTimer timer,
                                           ContentTags contentTags,
                                           ProbeType probeType)
-        throws Exception
    {
-        var getBuilder = new Request.Builder().get();
+        try {
+            if (probeType == HttpFetcher.ProbeType.FULL) {
+                try {
+                    var probeResult = probeContentType(url, cookies, timer, contentTags);

-        getBuilder.url(url.toString())
-                .addHeader("Accept-Encoding", "gzip")
-                .addHeader("Accept-Language", "en,*;q=0.5")
-                .addHeader("Accept", "text/html, application/xhtml+xml, text/*;q=0.8")
-                .addHeader("User-agent", userAgentString);
+                    switch (probeResult) {
+                        case HttpFetcher.ContentTypeProbeResult.NoOp():
+                            break; //
+                        case HttpFetcher.ContentTypeProbeResult.Ok(EdgeUrl resolvedUrl):
+                            logger.info(crawlerAuditMarker, "Probe result OK for {}", url);
+                            url = resolvedUrl; // If we were redirected while probing, use the final URL for fetching
+                            break;
+                        case ContentTypeProbeResult.BadContentType badContentType:
+                            warcRecorder.flagAsFailedContentTypeProbe(url, badContentType.contentType(), badContentType.statusCode());
+                            logger.info(crawlerAuditMarker, "Probe result Bad ContenType ({}) for {}", badContentType.contentType(), url);
+                            return new HttpFetchResult.ResultNone();
+                        case ContentTypeProbeResult.BadContentType.Timeout(Exception ex):
+                            logger.info(crawlerAuditMarker, "Probe result Timeout for {}", url);
+                            warcRecorder.flagAsTimeout(url);
+                            return new HttpFetchResult.ResultException(ex);
+                        case ContentTypeProbeResult.Exception(Exception ex):
+                            logger.info(crawlerAuditMarker, "Probe result Exception({}) for {}", ex.getClass().getSimpleName(), url);
+                            warcRecorder.flagAsError(url, ex);
+                            return new HttpFetchResult.ResultException(ex);
+                        case ContentTypeProbeResult.HttpError httpError:
+                            logger.info(crawlerAuditMarker, "Probe result HTTP Error ({}) for {}", httpError.statusCode(), url);
+                            return new HttpFetchResult.ResultException(new HttpException("HTTP status code " + httpError.statusCode() + ": " + httpError.message()));
+                        case ContentTypeProbeResult.Redirect redirect:
+                            logger.info(crawlerAuditMarker, "Probe result redirect for {} -> {}", url, redirect.location());
+                            return new HttpFetchResult.ResultRedirect(redirect.location());
+                    }
+                } catch (Exception ex) {
+                    logger.warn("Failed to fetch {}", url, ex);
+                    return new HttpFetchResult.ResultException(ex);
+                }

-        contentTags.paint(getBuilder);
-
-        HttpFetchResult result = warcRecorder.fetch(client, getBuilder.build());
-
-        if (result instanceof HttpFetchResult.ResultOk ok) {
-            if (ok.statusCode() == 429) {
-                throw new RateLimitException(Objects.requireNonNullElse(ok.header("Retry-After"), "1"));
            }
-            if (ok.statusCode() == 304) {
-                return new HttpFetchResult.Result304Raw();
-            }
-            if (ok.statusCode() == 200) {
-                return ok;
+
+            HttpGet request = new HttpGet(url.asURI());
+            request.addHeader("User-Agent", userAgentString);
+            request.addHeader("Accept-Encoding", "gzip");
+            request.addHeader("Accept-Language", "en,*;q=0.5");
+            request.addHeader("Accept", "text/html, application/xhtml+xml, text/*;q=0.8");
+
+            contentTags.paint(request);
+
+            try (var sl = new SendLock()) {
+                Instant start = Instant.now();
+                HttpFetchResult result = warcRecorder.fetch(client, cookies, request);
+
+                Duration fetchDuration = Duration.between(start, Instant.now());
+
+                if (result instanceof HttpFetchResult.ResultOk ok) {
+                    if (ok.statusCode() == 304) {
+                        result = new HttpFetchResult.Result304Raw();
+                    }
+                }
+
+                switch (result) {
+                    case HttpFetchResult.ResultOk ok -> logger.info(crawlerAuditMarker, "Fetch result OK {} for {} ({} ms)", ok.statusCode(), url, fetchDuration.toMillis());
+                    case HttpFetchResult.ResultRedirect redirect -> logger.info(crawlerAuditMarker, "Fetch result redirect: {}  for {}", redirect.url(), url);
+                    case HttpFetchResult.ResultNone none -> logger.info(crawlerAuditMarker, "Fetch result none for {}", url);
+                    case HttpFetchResult.ResultException ex -> logger.error(crawlerAuditMarker, "Fetch result exception for {}", url, ex.ex());
+                    case HttpFetchResult.Result304Raw raw -> logger.info(crawlerAuditMarker, "Fetch result: 304 Raw for {}", url);
+                    case HttpFetchResult.Result304ReplacedWithReference ref -> logger.info(crawlerAuditMarker, "Fetch result: 304 With reference for {}", url);
+                }
+
+                return result;
            }
        }
+        catch (Exception ex) {
+            logger.error(crawlerAuditMarker, "Fetch result exception for {}", url, ex);
+
+            return new HttpFetchResult.ResultException(ex);
+        }

-        return result;
    }

    @Override
@@ -242,6 +473,131 @@ public class HttpFetcherImpl implements HttpFetcher {
        return new SitemapRetriever();
    }

+    /** Recursively fetch sitemaps */
+    @Override
+    public List<EdgeUrl> fetchSitemapUrls(String root, CrawlDelayTimer delayTimer) {
+        try {
+            List<EdgeUrl> ret = new ArrayList<>();
+
+            Set<String> seenUrls = new HashSet<>();
+            Set<String> seenSitemaps = new HashSet<>();
+
+            Deque<EdgeUrl> sitemapQueue = new LinkedList<>();
+
+            EdgeUrl rootSitemapUrl = new EdgeUrl(root);
+
+            sitemapQueue.add(rootSitemapUrl);
+
+            int fetchedSitemaps = 0;
+
+            while (!sitemapQueue.isEmpty() && ret.size() < 20_000 && ++fetchedSitemaps < 10) {
+                var head = sitemapQueue.removeFirst();
+
+                switch (fetchSingleSitemap(head)) {
+                    case SitemapResult.SitemapUrls(List<String> urls) -> {
+
+                        for (var url : urls) {
+                            if (seenUrls.add(url)) {
+                                EdgeUrl.parse(url)
+                                        .filter(u -> u.domain.equals(rootSitemapUrl.domain))
+                                        .ifPresent(ret::add);
+                            }
+                        }
+
+                    }
+                    case SitemapResult.SitemapReferences(List<String> refs) -> {
+                        for (var ref : refs) {
+                            if (seenSitemaps.add(ref)) {
+                                EdgeUrl.parse(ref)
+                                        .filter(url -> url.domain.equals(rootSitemapUrl.domain))
+                                        .ifPresent(sitemapQueue::addFirst);
+                            }
+                        }
+                    }
+                    case SitemapResult.SitemapError() -> {}
+                }
+
+                delayTimer.waitFetchDelay();
+            }
+
+            return ret;
+        }
+        catch (Exception ex) {
+            logger.error("Error while fetching sitemaps via {}: {} ({})", root, ex.getClass().getSimpleName(), ex.getMessage());
+            return List.of();
+        }
+    }
+
+
+    private SitemapResult fetchSingleSitemap(EdgeUrl sitemapUrl) throws URISyntaxException {
+        HttpGet getRequest = new HttpGet(sitemapUrl.asURI());
+
+        getRequest.addHeader("User-Agent", userAgentString);
+        getRequest.addHeader("Accept-Encoding", "gzip");
+        getRequest.addHeader("Accept", "text/*, */*;q=0.9");
+        getRequest.addHeader("User-Agent", userAgentString);
+
+        try (var sl = new SendLock()) {
+            return client.execute(getRequest, response -> {
+                try {
+                    if (response.getCode() != 200) {
+                        return new SitemapResult.SitemapError();
+                    }
+
+                    Document parsedSitemap = Jsoup.parse(
+                            EntityUtils.toString(response.getEntity()),
+                            sitemapUrl.toString(),
+                            Parser.xmlParser()
+                    );
+
+                    if (parsedSitemap.childrenSize() == 0) {
+                        return new SitemapResult.SitemapError();
+                    }
+
+                    String rootTagName = parsedSitemap.child(0).tagName();
+
+                    return switch (rootTagName.toLowerCase()) {
+                        case "sitemapindex" -> {
+                            List<String> references = new ArrayList<>();
+                            for (var locTag : parsedSitemap.getElementsByTag("loc")) {
+                                references.add(locTag.text().trim());
+                            }
+                            yield new SitemapResult.SitemapReferences(Collections.unmodifiableList(references));
+                        }
+                        case "urlset" -> {
+                            List<String> urls = new ArrayList<>();
+                            for (var locTag : parsedSitemap.select("url > loc")) {
+                                urls.add(locTag.text().trim());
+                            }
+                            yield new SitemapResult.SitemapUrls(Collections.unmodifiableList(urls));
+                        }
+                        case "rss", "atom" -> {
+                            List<String> urls = new ArrayList<>();
+                            for (var locTag : parsedSitemap.select("link, url")) {
+                                urls.add(locTag.text().trim());
+                            }
+                            yield new SitemapResult.SitemapUrls(Collections.unmodifiableList(urls));
+                        }
+                        default -> new SitemapResult.SitemapError();
+                    };
+                }
+                finally {
+                    EntityUtils.consume(response.getEntity());
+                }
+            });
+        }
+        catch (Exception ex) {
+            logger.warn("Error while fetching sitemap {}: {} ({})", sitemapUrl, ex.getClass().getSimpleName(), ex.getMessage());
+            return new SitemapResult.SitemapError();
+        }
+    }
+
+    private sealed interface SitemapResult {
+        record SitemapUrls(List<String> urls) implements SitemapResult {}
+        record SitemapReferences(List<String> sitemapRefs) implements SitemapResult {}
+        record SitemapError() implements SitemapResult {}
+    }
+
    @Override
    public SimpleRobotRules fetchRobotRules(EdgeDomain domain, WarcRecorder recorder) {
        var ret = fetchAndParseRobotsTxt(new EdgeUrl("https", domain, null, "/robots.txt", null), recorder);
@@ -256,15 +612,14 @@ public class HttpFetcherImpl implements HttpFetcher {
    }

    private Optional<SimpleRobotRules> fetchAndParseRobotsTxt(EdgeUrl url, WarcRecorder recorder) {
-        try {
-            var getBuilder = new Request.Builder().get();
+        try (var sl = new SendLock()) {

-            getBuilder.url(url.toString())
-                    .addHeader("Accept-Encoding", "gzip")
-                    .addHeader("Accept", "text/*, */*;q=0.9")
-                    .addHeader("User-agent", userAgentString);
+            HttpGet request = new HttpGet(url.asURI());
+            request.addHeader("User-Agent", userAgentString);
+            request.addHeader("Accept-Encoding", "gzip");
+            request.addHeader("Accept", "text/*, */*;q=0.9");

-            HttpFetchResult result = recorder.fetch(client, getBuilder.build());
+            HttpFetchResult result = recorder.fetch(client, new DomainCookies(), request);

            return DocumentBodyExtractor.asBytes(result).mapOpt((contentType, body) ->
                robotsParser.parseContent(url.toString(),
@@ -278,6 +633,59 @@ public class HttpFetcherImpl implements HttpFetcher {
        }
    }

+    @Override
+    public boolean retryRequest(HttpRequest request, IOException exception, int executionCount, HttpContext context) {
+        if (exception instanceof SocketTimeoutException) { // Timeouts are not recoverable
+            return false;
+        }
+        if (exception instanceof SSLException) { // SSL exceptions are unlikely to be recoverable
+            return false;
+        }
+
+        return executionCount <= 3;
+    }
+
+    @Override
+    public boolean retryRequest(HttpResponse response, int executionCount, HttpContext context) {
+        return switch (response.getCode()) {
+            case 500, 503 -> executionCount <= 2;
+            case 429 -> executionCount <= 3;
+            default -> false;
+        };
+    }
+
+    @Override
+    public TimeValue getRetryInterval(HttpRequest request, IOException exception, int executionCount, HttpContext context) {
+        return TimeValue.ofSeconds(1);
+    }
+
+    @Override
+    public TimeValue getRetryInterval(HttpResponse response, int executionCount, HttpContext context) {
+
+        int statusCode = response.getCode();
+
+        // Give 503 a bit more time
+        if (statusCode == 503) return TimeValue.ofSeconds(5);
+
+        if (statusCode == 429) {
+            // get the Retry-After header
+            String retryAfter = response.getFirstHeader("Retry-After").getValue();
+            if (retryAfter == null) {
+                return TimeValue.ofSeconds(2);
+            }
+
+            try {
+                int retryAfterTime = Integer.parseInt(retryAfter);
+                retryAfterTime = Math.clamp(retryAfterTime, 1, 5);
+
+                return TimeValue.ofSeconds(retryAfterTime);
+            } catch (NumberFormatException e) {
+                logger.warn("Invalid Retry-After header: {}", retryAfter);
+            }
+        }
+
+        return TimeValue.ofSeconds(2);
+    }

    public static class RateLimitException extends Exception {
        private final String retryAfter;
@@ -298,5 +706,31 @@ public class HttpFetcherImpl implements HttpFetcher {
            }
        }
    }
+
+}
+
+class SendLock implements AutoCloseable {
+
+    private static final Semaphore maxConcurrentRequests = new Semaphore(Integer.getInteger("crawler.maxConcurrentRequests", 512));
+    boolean closed = false;
+
+    public SendLock() {
+        maxConcurrentRequests.acquireUninterruptibly();
+    }
+
+    public static <T> T wrapSend(HttpClient client, final ClassicHttpRequest request,
+                                               final HttpClientResponseHandler<? extends T> responseHandler) throws IOException {
+        try (var lock = new SendLock()) {
+            return client.execute(request, responseHandler);
+        }
+    }
+
+    @Override
+    public void close() {
+        if (!closed) {
+            maxConcurrentRequests.release();
+            closed = true;
+        }
+    }
 }

--- a/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/socket/IpInterceptingNetworkInterceptor.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/socket/IpInterceptingNetworkInterceptor.java
@@ -1,31 +0,0 @@
-package nu.marginalia.crawl.fetcher.socket;
-
-import okhttp3.Interceptor;
-import okhttp3.Response;
-import org.jetbrains.annotations.NotNull;
-
-import java.io.IOException;
-
-
-/** An interceptor that intercepts network requests and adds the remote IP address as
- * a header in the response.  This is used to pass the remote IP address to the Warc
- * writer, as this information is not available in the response.
- */
-public class IpInterceptingNetworkInterceptor implements Interceptor  {
-    private static final String pseudoHeaderName = "X-Marginalia-Remote-IP";
-
-    @NotNull
-    @Override
-    public Response intercept(@NotNull Interceptor.Chain chain) throws IOException {
-        String IP = chain.connection().socket().getInetAddress().getHostAddress();
-
-        return chain.proceed(chain.request())
-                .newBuilder()
-                .addHeader(pseudoHeaderName, IP)
-                .build();
-    }
-
-    public static String getIpFromResponse(Response response) {
-        return response.header(pseudoHeaderName);
-    }
-}
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/socket/NoSecuritySSL.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/socket/NoSecuritySSL.java
@@ -27,7 +27,7 @@ public class NoSecuritySSL {
            }
    };

-    public static SSLSocketFactory buildSocketFactory() {
+    public static SSLContext buildSslContext() {
        try {
            // Install the all-trusting trust manager
            final SSLContext sslContext = SSLContext.getInstance("TLS");
@@ -40,14 +40,11 @@ public class NoSecuritySSL {
            clientSessionContext.setSessionCacheSize(2048);

            // Create a ssl socket factory with our all-trusting manager
-            return sslContext.getSocketFactory();
+            return sslContext;
        }
        catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

-    public static HostnameVerifier buildHostnameVerifyer() {
-        return (hn, session) -> true;
-    }
 }
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/warc/WarcInputBuffer.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/warc/WarcInputBuffer.java
@@ -1,15 +1,20 @@
 package nu.marginalia.crawl.fetcher.warc;

-import okhttp3.Headers;
-import okhttp3.Response;
+import org.apache.commons.io.IOUtils;
 import org.apache.commons.io.input.BOMInputStream;
+import org.apache.hc.client5.http.classic.methods.HttpGet;
+import org.apache.hc.core5.http.ClassicHttpResponse;
+import org.apache.hc.core5.http.Header;
 import org.netpreserve.jwarc.WarcTruncationReason;

 import java.io.*;
 import java.nio.file.Files;
 import java.nio.file.Path;
-import java.util.Objects;
-import java.util.zip.GZIPInputStream;
+import java.time.Duration;
+import java.time.Instant;
+import java.util.Arrays;
+
+import static nu.marginalia.crawl.fetcher.warc.ErrorBuffer.suppressContentEncoding;

 /** Input buffer for temporary storage of a HTTP response
 *  This may be in-memory or on-disk, at the discretion of
@@ -17,8 +22,9 @@ import java.util.zip.GZIPInputStream;
 * */
 public abstract class WarcInputBuffer implements AutoCloseable {
    protected WarcTruncationReason truncationReason = WarcTruncationReason.NOT_TRUNCATED;
-    protected Headers headers;
-    WarcInputBuffer(Headers headers) {
+    protected Header[] headers;
+
+    WarcInputBuffer(Header[] headers) {
        this.headers = headers;
    }

@@ -30,7 +36,7 @@ public abstract class WarcInputBuffer implements AutoCloseable {

    public final WarcTruncationReason truncationReason() { return truncationReason; }

-    public final Headers headers() { return headers; }
+    public final Header[] headers() { return headers; }

    /** Create a buffer for a response.
     *  If the response is small and not compressed, it will be stored in memory.
@@ -38,33 +44,52 @@ public abstract class WarcInputBuffer implements AutoCloseable {
     *  and suppressed from the headers.
     *  If an error occurs, a buffer will be created with no content and an error status.
     */
-    static WarcInputBuffer forResponse(Response rsp) {
-        if (rsp == null)
+    static WarcInputBuffer forResponse(ClassicHttpResponse response,
+                                       HttpGet request,
+                                       Duration timeLimit) throws IOException {
+        if (response == null)
            return new ErrorBuffer();

-        try {
-            String contentLengthHeader = Objects.requireNonNullElse(rsp.header("Content-Length"), "-1");
-            int contentLength = Integer.parseInt(contentLengthHeader);
-            String contentEncoding = rsp.header("Content-Encoding");

-            if (contentEncoding == null && contentLength > 0 && contentLength < 8192) {
+        var entity = response.getEntity();
+
+        if (null == entity) {
+            return new ErrorBuffer();
+        }
+
+        InputStream is = null;
+        try {
+            is = entity.getContent();
+            long length = entity.getContentLength();
+
+            if (length > 0 && length < 8192) {
                // If the content is small and not compressed, we can just read it into memory
-                return new MemoryBuffer(rsp, contentLength);
-            }
-            else {
+                return new MemoryBuffer(response.getHeaders(), request, timeLimit, is, (int) length);
+            } else {
                // Otherwise, we unpack it into a file and read it from there
-                return new FileBuffer(rsp);
+                return new FileBuffer(response.getHeaders(), request, timeLimit, is);
            }
        }
-        catch (Exception ex) {
-            return new ErrorBuffer(rsp);
+        finally {
+            try {
+                is.skip(Long.MAX_VALUE);
+            }
+            catch (IOException e) {
+                // Ignore the exception
+            }
+            finally {
+                // Close the input stream
+                IOUtils.closeQuietly(is);
+            }
        }

+
    }

    /** Copy an input stream to an output stream, with a maximum size and time limit */
-    protected void copy(InputStream is, OutputStream os) {
-        long startTime = System.currentTimeMillis();
+    protected void copy(InputStream is, HttpGet request, OutputStream os, Duration timeLimit) {
+        Instant start = Instant.now();
+        Instant timeout = start.plus(timeLimit);
        long size = 0;

        byte[] buffer = new byte[8192];
@@ -74,24 +99,105 @@ public abstract class WarcInputBuffer implements AutoCloseable {

        while (true) {
            try {
+                Duration remaining = Duration.between(Instant.now(), timeout);
+                if (remaining.isNegative()) {
+                    truncationReason = WarcTruncationReason.TIME;
+                    // Abort the request if the time limit is exceeded
+                    // so we don't keep the connection open forever or are forced to consume
+                    // the stream to the end
+
+                    request.abort();
+                    break;
+                }
+
                int n = is.read(buffer);
+
                if (n < 0) break;
                size += n;
-                os.write(buffer, 0, n);

-                if (size > WarcRecorder.MAX_SIZE) {
+                // Even if we've exceeded the max length,
+                // we keep consuming the stream up until the end or a timeout,
+                // as closing the stream means resetting the connection, and
+                // that's generally not desirable.
+
+                if (size < WarcRecorder.MAX_SIZE) {
+                    os.write(buffer, 0, n);
+                }
+                else if (truncationReason != WarcTruncationReason.LENGTH) {
                    truncationReason = WarcTruncationReason.LENGTH;
                    break;
                }

-                if (System.currentTimeMillis() - startTime > WarcRecorder.MAX_TIME) {
-                    truncationReason = WarcTruncationReason.TIME;
-                    break;
-                }
            } catch (IOException e) {
-                throw new RuntimeException(e);
+                truncationReason = WarcTruncationReason.UNSPECIFIED;
            }
        }
+
+    }
+
+    /** Takes a Content-Range header and checks if it is complete.
+     *  A complete range is one that covers the entire resource.
+     *  For example, "bytes 0-1023/2048" or "bytes 0-1023/*" are complete ranges.
+     *  "bytes 0-1023/2048" is not a complete range.
+     */
+    public boolean isRangeComplete(Header[] headers) {
+        // Find the Content-Range header
+        String contentRangeHeader = null;
+        for (var header : headers) {
+            if ("Content-Range".equalsIgnoreCase(header.getName())) {
+                contentRangeHeader = header.getValue();
+                break;
+            }
+        }
+
+        // Return true if header is null or empty
+        if (contentRangeHeader == null || contentRangeHeader.isEmpty()) {
+            return true;
+        }
+
+        try {
+            // Content-Range format: "bytes range-start-range-end/size"
+            // e.g., "bytes 0-1023/2048" or "bytes 0-1023/*"
+
+            // Get the part after "bytes "
+            String[] parts = contentRangeHeader.split(" ", 2);
+            if (parts.length < 2) {
+                return false;
+            }
+
+            // Get the range and size parts (e.g., "0-1023/2048")
+            String rangeAndSize = parts[1];
+            String[] rangeAndSizeParts = rangeAndSize.split("/", 2);
+            if (rangeAndSizeParts.length < 2) {
+                return false;
+            }
+
+            // Get the range (e.g., "0-1023")
+            String range = rangeAndSizeParts[0];
+            String[] rangeParts = range.split("-", 2);
+            if (rangeParts.length < 2) {
+                return false;
+            }
+
+            // Get the size (e.g., "2048" or "*")
+            String size = rangeAndSizeParts[1];
+
+            // If size is "*", we don't know the total size, so return false
+            if ("*".equals(size)) {
+                return false;
+            }
+
+            // Parse as long to handle large files
+            long rangeStart = Long.parseLong(rangeParts[0]);
+            long rangeEnd = Long.parseLong(rangeParts[1]);
+            long totalSize = Long.parseLong(size);
+
+            // Check if the range covers the entire resource
+            return rangeStart == 0 && rangeEnd == totalSize - 1;
+
+        } catch (NumberFormatException | ArrayIndexOutOfBoundsException e) {
+            return false;
+        }
    }

 }
@@ -99,12 +205,8 @@ public abstract class WarcInputBuffer implements AutoCloseable {
 /** Pseudo-buffer for when we have an error */
 class ErrorBuffer extends WarcInputBuffer {
    public ErrorBuffer() {
-        super(Headers.of());
-        truncationReason = WarcTruncationReason.UNSPECIFIED;
-    }
+        super(new Header[0]);

-    public ErrorBuffer(Response rsp) {
-        super(rsp.headers());
        truncationReason = WarcTruncationReason.UNSPECIFIED;
    }

@@ -120,17 +222,29 @@ class ErrorBuffer extends WarcInputBuffer {

    @Override
    public void close() throws Exception {}
+
+
+    static Header[] suppressContentEncoding(Header[] headers) {
+        return Arrays.stream(headers).filter(header -> !"Content-Encoding".equalsIgnoreCase(header.getName())).toArray(Header[]::new);
+    }
+
 }

 /** Buffer for when we have the response in memory */
 class MemoryBuffer extends WarcInputBuffer {
    byte[] data;
-    public MemoryBuffer(Response response, int size) {
-        super(response.headers());
+    public MemoryBuffer(Header[] headers, HttpGet request, Duration timeLimit, InputStream responseStream, int size) {
+        super(suppressContentEncoding(headers));
+
+        if (!isRangeComplete(headers)) {
+            truncationReason = WarcTruncationReason.LENGTH;
+        } else {
+            truncationReason = WarcTruncationReason.NOT_TRUNCATED;
+        }

        var outputStream = new ByteArrayOutputStream(size);

-        copy(response.body().byteStream(), outputStream);
+        copy(responseStream, request, outputStream, timeLimit);

        data = outputStream.toByteArray();
    }
@@ -154,53 +268,25 @@ class MemoryBuffer extends WarcInputBuffer {
 class FileBuffer extends WarcInputBuffer {
    private final Path tempFile;

-    public FileBuffer(Response response) throws IOException {
-        super(suppressContentEncoding(response.headers()));
+    public FileBuffer(Header[] headers, HttpGet request, Duration timeLimit, InputStream responseStream) throws IOException {
+        super(suppressContentEncoding(headers));
+
+        if (!isRangeComplete(headers)) {
+            truncationReason = WarcTruncationReason.LENGTH;
+        } else {
+            truncationReason = WarcTruncationReason.NOT_TRUNCATED;
+        }

        this.tempFile = Files.createTempFile("rsp", ".html");

-        if (response.body() == null) {
-            truncationReason = WarcTruncationReason.DISCONNECT;
-            return;
+        try (var out = Files.newOutputStream(tempFile)) {
+            copy(responseStream, request, out, timeLimit);
        }
-
-        if ("gzip".equals(response.header("Content-Encoding"))) {
-            try (var out = Files.newOutputStream(tempFile)) {
-                copy(new GZIPInputStream(response.body().byteStream()), out);
-            }
-            catch (Exception ex) {
-                truncationReason = WarcTruncationReason.UNSPECIFIED;
-            }
-        }
-        else {
-            try (var out = Files.newOutputStream(tempFile)) {
-                copy(response.body().byteStream(), out);
-            }
-            catch (Exception ex) {
-                truncationReason = WarcTruncationReason.UNSPECIFIED;
-            }
+        catch (Exception ex) {
+            truncationReason = WarcTruncationReason.UNSPECIFIED;
        }
    }

-    private static Headers suppressContentEncoding(Headers headers) {
-        var builder = new Headers.Builder();
-
-        headers.toMultimap().forEach((k, values) -> {
-            if ("Content-Encoding".equalsIgnoreCase(k)) {
-                return;
-            }
-            if ("Transfer-Encoding".equalsIgnoreCase(k)) {
-                return;
-            }
-            for (var value : values) {
-                builder.add(k, value);
-            }
-        });
-
-        return builder.build();
-    }
-
-
    public InputStream read() throws IOException {
        return Files.newInputStream(tempFile);
    }
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/warc/WarcProtocolReconstructor.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/warc/WarcProtocolReconstructor.java
@@ -1,11 +1,14 @@
 package nu.marginalia.crawl.fetcher.warc;

-import okhttp3.Protocol;
-import okhttp3.Response;
 import org.apache.commons.lang3.StringUtils;
+import org.apache.hc.core5.http.ClassicHttpResponse;
+import org.apache.hc.core5.http.Header;

 import java.net.URI;
 import java.net.URLEncoder;
+import java.net.http.HttpClient;
+import java.net.http.HttpHeaders;
+import java.net.http.HttpResponse;
 import java.nio.charset.StandardCharsets;
 import java.util.*;
 import java.util.stream.Collectors;
@@ -16,7 +19,7 @@ import java.util.stream.Collectors;
 public class WarcProtocolReconstructor {

    static String getHttpRequestString(String method,
-                                       Map<String, List<String>> mainHeaders,
+                                       Header[] mainHeaders,
                                       Map<String, List<String>> extraHeaders,
                                       URI uri) {
        StringBuilder requestStringBuilder = new StringBuilder();
@@ -33,12 +36,13 @@ public class WarcProtocolReconstructor {

        Set<String> addedHeaders = new HashSet<>();

-        mainHeaders.forEach((k, values) -> {
-            for (var value : values) {
-                addedHeaders.add(k);
-                requestStringBuilder.append(capitalizeHeader(k)).append(": ").append(value).append("\r\n");
-            }
-        });
+        for (var header : mainHeaders) {
+            String k = header.getName();
+            String v = header.getValue();
+
+            addedHeaders.add(k);
+            requestStringBuilder.append(capitalizeHeader(k)).append(": ").append(v).append("\r\n");
+        }

        extraHeaders.forEach((k, values) -> {
            if (!addedHeaders.contains(k)) {
@@ -75,17 +79,23 @@ public class WarcProtocolReconstructor {
        return "HTTP/" + version + " " + statusCode + " " + statusMessage + "\r\n" + headerString + "\r\n\r\n";
    }

-    static String getResponseHeader(Response response, long size) {
-        String version = response.protocol() == Protocol.HTTP_1_1 ? "1.1" : "2.0";
+    static String getResponseHeader(HttpResponse<?> response, long size) {
+        String version = response.version() == HttpClient.Version.HTTP_1_1 ? "1.1" : "2.0";

-        String statusCode = String.valueOf(response.code());
-        String statusMessage = STATUS_CODE_MAP.getOrDefault(response.code(), "Unknown");
+        String statusCode = String.valueOf(response.statusCode());
+        String statusMessage = STATUS_CODE_MAP.getOrDefault(response.statusCode(), "Unknown");

-        String headerString = getHeadersAsString(response, size);
+        String headerString = getHeadersAsString(response.headers(), size);

        return "HTTP/" + version + " " + statusCode + " " + statusMessage + "\r\n" + headerString + "\r\n\r\n";
    }

+    static String getResponseHeader(ClassicHttpResponse response, long size) {
+        String headerString = getHeadersAsString(response.getHeaders(), size);
+
+        return response.getVersion().format() + " " + response.getCode() + " " + response.getReasonPhrase() + "\r\n" + headerString + "\r\n\r\n";
+    }
+
    private static final Map<Integer, String> STATUS_CODE_MAP = Map.ofEntries(
            Map.entry(200, "OK"),
            Map.entry(201, "Created"),
@@ -148,10 +158,41 @@ public class WarcProtocolReconstructor {
        return joiner.toString();
    }

-    static private String getHeadersAsString(Response response, long responseSize) {
+
+
+    static private String getHeadersAsString(Header[] headers, long responseSize) {
        StringJoiner joiner = new StringJoiner("\r\n");

-        response.headers().toMultimap().forEach((k, values) -> {
+        for (var header : headers) {
+            String headerCapitalized = capitalizeHeader(header.getName());
+
+            // Omit pseudoheaders injected by the crawler itself
+            if (headerCapitalized.startsWith("X-Marginalia"))
+                continue;
+
+            // Omit Transfer-Encoding and Content-Encoding headers
+            if (headerCapitalized.equals("Transfer-Encoding"))
+                continue;
+            if (headerCapitalized.equals("Content-Encoding"))
+                continue;
+
+            // Since we're transparently decoding gzip, we need to update the Content-Length header
+            // to reflect the actual size of the response body. We'll do this at the end.
+            if (headerCapitalized.equals("Content-Length"))
+                continue;
+
+            joiner.add(headerCapitalized + ": " + header.getValue());
+        }
+
+        joiner.add("Content-Length: " + responseSize);
+
+        return joiner.toString();
+    }
+
+    static private String getHeadersAsString(HttpHeaders headers, long responseSize) {
+        StringJoiner joiner = new StringJoiner("\r\n");
+
+        headers.map().forEach((k, values) -> {
            String headerCapitalized = capitalizeHeader(k);

            // Omit pseudoheaders injected by the crawler itself
@@ -179,8 +220,8 @@ public class WarcProtocolReconstructor {
        return joiner.toString();
    }

-    // okhttp gives us flattened headers, so we need to reconstruct Camel-Kebab-Case style
-    // for the WARC parser's sake...
+    // okhttp gave us flattened headers, so we need to reconstruct Camel-Kebab-Case style
+    // for the WARC parser's sake...  (do we still need this, mr chesterton?)
    static private String capitalizeHeader(String k) {
        return Arrays.stream(StringUtils.split(k, '-'))
                .map(StringUtils::capitalize)
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/warc/WarcRecorder.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/warc/WarcRecorder.java
@@ -1,13 +1,16 @@
 package nu.marginalia.crawl.fetcher.warc;

 import nu.marginalia.crawl.fetcher.ContentTags;
+import nu.marginalia.crawl.fetcher.DomainCookies;
+import nu.marginalia.crawl.fetcher.HttpFetcher;
 import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
-import nu.marginalia.crawl.fetcher.socket.IpInterceptingNetworkInterceptor;
+import nu.marginalia.link_parser.LinkParser;
 import nu.marginalia.model.EdgeDomain;
 import nu.marginalia.model.EdgeUrl;
 import nu.marginalia.model.body.HttpFetchResult;
-import okhttp3.OkHttpClient;
-import okhttp3.Request;
+import org.apache.hc.client5.http.classic.HttpClient;
+import org.apache.hc.client5.http.classic.methods.HttpGet;
+import org.apache.hc.core5.http.NameValuePair;
 import org.jetbrains.annotations.Nullable;
 import org.netpreserve.jwarc.*;
 import org.slf4j.Logger;
@@ -16,18 +19,20 @@ import org.slf4j.LoggerFactory;
 import java.io.IOException;
 import java.io.InputStream;
 import java.net.InetAddress;
+import java.net.SocketTimeoutException;
 import java.net.URI;
 import java.net.URISyntaxException;
 import java.nio.charset.StandardCharsets;
 import java.nio.file.Files;
 import java.nio.file.Path;
 import java.security.NoSuchAlgorithmException;
+import java.time.Duration;
 import java.time.Instant;
 import java.util.*;

 /** Based on JWarc's fetch method, APL 2.0 license
 * <p></p>
- * This class wraps OkHttp's OkHttpClient and records the HTTP request and response in a WARC file,
+ * This class wraps HttpClient and records the HTTP request and response in a WARC file,
 * as best is possible given not all the data is available at the same time and needs to
 * be reconstructed.
 */
@@ -36,7 +41,7 @@ public class WarcRecorder implements AutoCloseable {
    static final int MAX_TIME = 30_000;

    /** Maximum (decompressed) size we'll save */
-    static final int MAX_SIZE = Integer.getInteger("crawler.maxFetchSize", 10 * 1024 * 1024);
+    static final int MAX_SIZE = Integer.getInteger("crawler.maxFetchSize", 32 * 1024 * 1024);

    private final WarcWriter writer;
    private final Path warcFile;
@@ -47,12 +52,7 @@ public class WarcRecorder implements AutoCloseable {
    // Affix a version string in case we need to change the format in the future
    // in some way
    private final String warcRecorderVersion = "1.0";
-
-    // We need to know if the site uses cookies so this can be reported among the search results
-    // -- flip this to true if we see any cookies.  This information will also be painted on any
-    // revisited pages.  It's not 100% perfect and a bit order dependent, but it's good enough.
-    private final WarcXCookieInformationHeader cookieInformation = new WarcXCookieInformationHeader();
-
+    private final LinkParser linkParser = new LinkParser();
    /**
     * Create a new WarcRecorder that will write to the given file
     *
@@ -74,108 +74,173 @@ public class WarcRecorder implements AutoCloseable {
        temporaryFile = true;
    }

-    public HttpFetchResult fetch(OkHttpClient client, Request request) throws NoSuchAlgorithmException,
-            IOException,
-            URISyntaxException,
-            InterruptedException
+    public HttpFetchResult fetch(HttpClient client,
+                                 DomainCookies cookies,
+                                 HttpGet request)
+            throws NoSuchAlgorithmException, IOException, URISyntaxException, InterruptedException
    {
-        URI requestUri = request.url().uri();
+        return fetch(client, cookies, request, Duration.ofMillis(MAX_TIME));
+    }
+
+    public HttpFetchResult fetch(HttpClient client,
+                                 DomainCookies cookies,
+                                 HttpGet request,
+                                 Duration timeout)
+            throws NoSuchAlgorithmException, IOException, URISyntaxException, InterruptedException
+    {
+        URI requestUri = request.getUri();

        WarcDigestBuilder responseDigestBuilder = new WarcDigestBuilder();
        WarcDigestBuilder payloadDigestBuilder = new WarcDigestBuilder();

-        String ip;
        Instant date = Instant.now();

-        var call = client.newCall(request);
+        // Not entirely sure why we need to do this, but keeping it due to Chesterton's Fence
+        Map<String, List<String>> extraHeaders = new HashMap<>(request.getHeaders().length);

-        cookieInformation.update(client, request.url());
+        // Inject a range header to attempt to limit the size of the response
+        // to the maximum size we want to store, if the server supports it.
+        request.addHeader("Range", "bytes=0-"+MAX_SIZE);
+        cookies.paintRequest(request);
+        try {
+            return client.execute(request,response -> {

-        try (var response = call.execute();
-             WarcInputBuffer inputBuffer = WarcInputBuffer.forResponse(response))
-        {
-            byte[] responseHeaders = WarcProtocolReconstructor.getResponseHeader(response, inputBuffer.size()).getBytes(StandardCharsets.UTF_8);
+                try (WarcInputBuffer inputBuffer = WarcInputBuffer.forResponse(response, request, timeout);
+                     InputStream inputStream = inputBuffer.read()) {

-            ResponseDataBuffer responseDataBuffer = new ResponseDataBuffer(inputBuffer.size() + responseHeaders.length);
-            InputStream inputStream = inputBuffer.read();
+                    cookies.updateCookieStore(response);

-            ip = IpInterceptingNetworkInterceptor.getIpFromResponse(response);
+                    // Build and write the request

-            responseDataBuffer.put(responseHeaders);
-            responseDataBuffer.updateDigest(responseDigestBuilder, 0, responseHeaders.length);
+                    WarcDigestBuilder requestDigestBuilder = new WarcDigestBuilder();

-            int dataStart = responseDataBuffer.pos();
+                    byte[] httpRequestString = WarcProtocolReconstructor
+                            .getHttpRequestString(
+                                    request.getMethod(),
+                                    request.getHeaders(),
+                                    extraHeaders,
+                                    requestUri)
+                            .getBytes();

-            for (;;) {
-                int remainingLength = responseDataBuffer.remaining();
-                if (remainingLength == 0)
-                    break;
+                    requestDigestBuilder.update(httpRequestString);

-                int startPos = responseDataBuffer.pos();
+                    WarcRequest warcRequest = new WarcRequest.Builder(requestUri)
+                            .blockDigest(requestDigestBuilder.build())
+                            .date(date)
+                            .body(MediaType.HTTP_REQUEST, httpRequestString)
+                            .build();

-                int n = responseDataBuffer.readFrom(inputStream, remainingLength);
-                if (n < 0)
-                    break;
+                    warcRequest.http(); // force HTTP header to be parsed before body is consumed so that caller can use it
+                    writer.write(warcRequest);

-                responseDataBuffer.updateDigest(responseDigestBuilder, startPos, n);
-                responseDataBuffer.updateDigest(payloadDigestBuilder, startPos, n);
-            }

-            // It looks like this might be the same as requestUri, but it's not;
-            // it's the URI after resolving redirects.
-            final URI responseUri = response.request().url().uri();
+                    if (cookies.hasCookies()) {
+                        response.addHeader("X-Has-Cookies", 1);
+                    }

-            WarcResponse.Builder responseBuilder = new WarcResponse.Builder(responseUri)
-                    .blockDigest(responseDigestBuilder.build())
-                    .date(date)
-                    .body(MediaType.HTTP_RESPONSE, responseDataBuffer.copyBytes());
+                    byte[] responseHeaders = WarcProtocolReconstructor.getResponseHeader(response, inputBuffer.size()).getBytes(StandardCharsets.UTF_8);

-            cookieInformation.paint(responseBuilder);
+                    ResponseDataBuffer responseDataBuffer = new ResponseDataBuffer(inputBuffer.size() + responseHeaders.length);

-            if (ip != null) responseBuilder.ipAddress(InetAddress.getByName(ip));
+                    responseDataBuffer.put(responseHeaders);
+                    responseDataBuffer.updateDigest(responseDigestBuilder, 0, responseHeaders.length);

-            responseBuilder.payloadDigest(payloadDigestBuilder.build());
-            responseBuilder.truncated(inputBuffer.truncationReason());
+                    int dataStart = responseDataBuffer.pos();

-            // Build and write the response
+                    for (;;) {
+                        int remainingLength = responseDataBuffer.remaining();
+                        if (remainingLength == 0)
+                            break;

-            var warcResponse = responseBuilder.build();
-            warcResponse.http(); // force HTTP header to be parsed before body is consumed so that caller can use it
-            writer.write(warcResponse);
+                        int startPos = responseDataBuffer.pos();

-            // Build and write the request
+                        int n = responseDataBuffer.readFrom(inputStream, remainingLength);
+                        if (n < 0)
+                            break;

-            WarcDigestBuilder requestDigestBuilder = new WarcDigestBuilder();
+                        responseDataBuffer.updateDigest(responseDigestBuilder, startPos, n);
+                        responseDataBuffer.updateDigest(payloadDigestBuilder, startPos, n);
+                    }

-            byte[] httpRequestString = WarcProtocolReconstructor
-                    .getHttpRequestString(
-                            response.request().method(),
-                            response.request().headers().toMultimap(),
-                            request.headers().toMultimap(),
-                            requestUri)
-                    .getBytes();
+                    // with some http client libraries, that resolve redirects transparently, this might be different
+                    // from the request URI, but currently we don't have transparent redirect resolution so it's always
+                    // the same (though let's keep the variables separate in case this changes)
+                    final URI responseUri = requestUri;

-            requestDigestBuilder.update(httpRequestString);
+                    WarcResponse.Builder responseBuilder = new WarcResponse.Builder(responseUri)
+                            .blockDigest(responseDigestBuilder.build())
+                            .date(date)
+                            .concurrentTo(warcRequest.id())
+                            .body(MediaType.HTTP_RESPONSE, responseDataBuffer.copyBytes());

-            WarcRequest warcRequest = new WarcRequest.Builder(requestUri)
-                    .blockDigest(requestDigestBuilder.build())
-                    .date(date)
-                    .body(MediaType.HTTP_REQUEST, httpRequestString)
-                    .concurrentTo(warcResponse.id())
-                    .build();
+                    InetAddress inetAddress = InetAddress.getByName(responseUri.getHost());
+                    responseBuilder.ipAddress(inetAddress);
+                    responseBuilder.payloadDigest(payloadDigestBuilder.build());
+                    responseBuilder.truncated(inputBuffer.truncationReason());

-            warcRequest.http(); // force HTTP header to be parsed before body is consumed so that caller can use it
-            writer.write(warcRequest);
+                    // Build and write the response

-            return new HttpFetchResult.ResultOk(responseUri,
-                    response.code(),
-                    inputBuffer.headers(),
-                    ip,
-                    responseDataBuffer.data,
-                    dataStart,
-                    responseDataBuffer.length() - dataStart);
-        }
-        catch (Exception ex) {
+                    var warcResponse = responseBuilder.build();
+                    warcResponse.http(); // force HTTP header to be parsed before body is consumed so that caller can use it
+                    writer.write(warcResponse);
+
+                    if (Duration.between(date, Instant.now()).compareTo(Duration.ofSeconds(9)) > 0
+                            && inputBuffer.size() < 2048
+                            && !requestUri.getPath().endsWith("robots.txt")) // don't bail on robots.txt
+                    {
+                        // Fast detection and mitigation of crawler traps that respond with slow
+                        // small responses, with a high branching factor
+
+                        // Note we bail *after* writing the warc records, this will effectively only
+                        // prevent link extraction from the document.
+
+                        logger.warn("URL {} took too long to fetch ({}s) and was too small for the effort ({}b)",
+                                requestUri,
+                                Duration.between(date, Instant.now()).getSeconds(),
+                                inputBuffer.size()
+                        );
+
+                        return new HttpFetchResult.ResultException(new IOException("Likely crawler trap"));
+                    }
+
+                    if (response.getCode() == 301 || response.getCode() == 302 || response.getCode() == 307) {
+                        // If the server responds with a redirect, we need to
+                        // update the request URI to the new location
+                        EdgeUrl redirectLocation = Optional.ofNullable(response.getFirstHeader("Location"))
+                                                           .map(NameValuePair::getValue)
+                                .flatMap(location -> linkParser.parseLink(new EdgeUrl(requestUri), location))
+                                .orElse(null);
+                        if (redirectLocation != null) {
+                            // If the redirect location is a valid URL, we need to update the request URI
+                            return new HttpFetchResult.ResultRedirect(redirectLocation);
+                        } else {
+                            // If the redirect location is not a valid URL, we need to throw an exception
+                            return new HttpFetchResult.ResultException(new IOException("Invalid redirect location: " + response.getFirstHeader("Location")));
+                        }
+                    }
+
+
+                    return new HttpFetchResult.ResultOk(responseUri,
+                            response.getCode(),
+                            inputBuffer.headers(),
+                            inetAddress.getHostAddress(),
+                            responseDataBuffer.data,
+                            dataStart,
+                            responseDataBuffer.length() - dataStart);
+                } catch (Exception ex) {
+                    flagAsError(new EdgeUrl(requestUri), ex); // write a WARC record to indicate the error
+                    logger.warn("Failed to fetch URL {}:  {}", requestUri, ex.getMessage());
+                    return new HttpFetchResult.ResultException(ex);
+                }
+            });
+        // the client.execute() method will throw an exception if the request times out
+        // or on other IO exceptions, so we need to catch those here as well as having
+        // exception handling in the response handler
+        } catch (SocketTimeoutException ex) {
+            flagAsTimeout(new EdgeUrl(requestUri)); // write a WARC record to indicate the timeout
+            return new HttpFetchResult.ResultException(ex);
+        } catch (IOException ex) {
+            flagAsError(new EdgeUrl(requestUri), ex); // write a WARC record to indicate the error
            logger.warn("Failed to fetch URL {}:  {}", requestUri, ex.getMessage());
            return new HttpFetchResult.ResultException(ex);
        }
@@ -185,7 +250,7 @@ public class WarcRecorder implements AutoCloseable {
        writer.write(item);
    }

-    private void saveOldResponse(EdgeUrl url, String contentType, int statusCode, String documentBody, @Nullable String headers, ContentTags contentTags) {
+    private void saveOldResponse(EdgeUrl url, DomainCookies domainCookies, String contentType, int statusCode, byte[] documentBody, @Nullable String headers, ContentTags contentTags) {
        try {
            WarcDigestBuilder responseDigestBuilder = new WarcDigestBuilder();
            WarcDigestBuilder payloadDigestBuilder = new WarcDigestBuilder();
@@ -195,7 +260,7 @@ public class WarcRecorder implements AutoCloseable {
            if (documentBody == null) {
                bytes = new byte[0];
            } else {
-                bytes = documentBody.getBytes();
+                bytes = documentBody;
            }

            // Create a synthesis of custom headers and the original headers
@@ -246,7 +311,9 @@ public class WarcRecorder implements AutoCloseable {
                    .date(Instant.now())
                    .body(MediaType.HTTP_RESPONSE, responseDataBuffer.copyBytes());

-            cookieInformation.paint(builder);
+            if (domainCookies.hasCookies() || (headers != null && headers.contains("Set-Cookie:"))) {
+                builder.addHeader("X-Has-Cookies", "1");
+            }

            var reference = builder.build();

@@ -264,8 +331,8 @@ public class WarcRecorder implements AutoCloseable {
     * an E-Tag or Last-Modified header, and the server responds with a 304 Not Modified.  In this
     * scenario we want to record the data as it was in the previous crawl, but not re-fetch it.
     */
-    public void writeReferenceCopy(EdgeUrl url, String contentType, int statusCode, String documentBody, @Nullable String headers, ContentTags ctags) {
-        saveOldResponse(url, contentType, statusCode, documentBody, headers, ctags);
+    public void writeReferenceCopy(EdgeUrl url, DomainCookies cookies, String contentType, int statusCode, byte[] documentBody, @Nullable String headers, ContentTags ctags) {
+        saveOldResponse(url, cookies, contentType, statusCode, documentBody, headers, ctags);
    }

    public void writeWarcinfoHeader(String ip, EdgeDomain domain, HttpFetcherImpl.DomainProbeResult result) throws IOException {
@@ -285,6 +352,9 @@ public class WarcRecorder implements AutoCloseable {
            case HttpFetcherImpl.DomainProbeResult.Ok ok:
                fields.put("X-WARC-Probe-Status", List.of("OK"));
                break;
+            case HttpFetcher.DomainProbeResult.RedirectSameDomain_Internal redirectSameDomain:
+                fields.put("X-WARC-Probe-Status", List.of("REDIR-INTERNAL"));
+                break;
        }

        var warcinfo = new Warcinfo.Builder()
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/logic/DomainLocks.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/logic/DomainLocks.java
@@ -3,6 +3,7 @@ package nu.marginalia.crawl.logic;
 import nu.marginalia.model.EdgeDomain;

 import java.util.Map;
+import java.util.Optional;
 import java.util.concurrent.ConcurrentHashMap;
 import java.util.concurrent.Semaphore;

@@ -19,8 +20,22 @@ public class DomainLocks {
     * and may be held by another thread.  The caller is responsible for locking and  releasing the lock.
     */
    public DomainLock lockDomain(EdgeDomain domain) throws InterruptedException {
-        return new DomainLock(domain.toString(),
-                locks.computeIfAbsent(domain.topDomain.toLowerCase(), this::defaultPermits));
+        var sem = locks.computeIfAbsent(domain.topDomain.toLowerCase(), this::defaultPermits);
+
+        sem.acquire();
+
+        return new DomainLock(sem);
+    }
+
+    public Optional<DomainLock> tryLockDomain(EdgeDomain domain) {
+        var sem = locks.computeIfAbsent(domain.topDomain.toLowerCase(), this::defaultPermits);
+        if (sem.tryAcquire(1)) {
+            return Optional.of(new DomainLock(sem));
+        }
+        else {
+            // We don't have a lock, so we return an empty optional
+            return Optional.empty();
+        }
    }

    private Semaphore defaultPermits(String topDomain) {
@@ -28,39 +43,45 @@ public class DomainLocks {
            return new Semaphore(16);
        if (topDomain.equals("blogspot.com"))
            return new Semaphore(8);
-
+        if (topDomain.equals("tumblr.com"))
+            return new Semaphore(8);
        if (topDomain.equals("neocities.org"))
-            return new Semaphore(4);
+            return new Semaphore(8);
        if (topDomain.equals("github.io"))
-            return new Semaphore(4);
+            return new Semaphore(8);

+        // Substack really dislikes broad-scale crawlers, so we need to be careful
+        // to not get blocked.
        if (topDomain.equals("substack.com")) {
            return new Semaphore(1);
        }
-        if (topDomain.endsWith(".edu")) {
-            return new Semaphore(1);
-        }

        return new Semaphore(2);
    }

+    /** Returns true if the domain is lockable, i.e. if it is not already locked by another thread.
+     * (this is just a hint, and does not guarantee that the domain is actually lockable any time
+     * after this method returns true)
+     */
+    public boolean isLockableHint(EdgeDomain domain) {
+        Semaphore sem = locks.get(domain.topDomain.toLowerCase());
+        if (null == sem)
+            return true;
+        else
+            return sem.availablePermits() > 0;
+    }
+
    public static class DomainLock implements AutoCloseable {
-        private final String domainName;
        private final Semaphore semaphore;

-        DomainLock(String domainName, Semaphore semaphore) throws InterruptedException {
-            this.domainName = domainName;
+        DomainLock(Semaphore semaphore) {
            this.semaphore = semaphore;
-
-            Thread.currentThread().setName("crawling:" + domainName + " [await domain lock]");
-            semaphore.acquire();
-            Thread.currentThread().setName("crawling:" + domainName);
        }

        @Override
        public void close() throws Exception {
            semaphore.release();
-            Thread.currentThread().setName("crawling:" + domainName + " [wrapping up]");
+            Thread.currentThread().setName("[idle]");
        }
    }
 }
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlDataReference.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlDataReference.java
@@ -4,6 +4,7 @@ import nu.marginalia.ContentTypes;
 import nu.marginalia.io.SerializableCrawlDataStream;
 import nu.marginalia.lsh.EasyLSH;
 import nu.marginalia.model.crawldata.CrawledDocument;
+import org.jetbrains.annotations.NotNull;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;

@@ -11,54 +12,76 @@ import javax.annotation.Nullable;
 import java.io.IOException;
 import java.nio.file.Files;
 import java.nio.file.Path;
+import java.util.Iterator;
+import java.util.Objects;
+import java.util.Optional;

 /** A reference to a domain that has been crawled before. */
-public class CrawlDataReference implements AutoCloseable {
+public class CrawlDataReference implements AutoCloseable, Iterable<CrawledDocument> {
+
+    private boolean closed = false;
+
+    @Nullable
+    private final Path path;
+
+    @Nullable
+    private SerializableCrawlDataStream data = null;

-    private final SerializableCrawlDataStream data;
    private static final Logger logger = LoggerFactory.getLogger(CrawlDataReference.class);

-    public CrawlDataReference(SerializableCrawlDataStream data) {
-        this.data = data;
+    public CrawlDataReference(@Nullable Path path) {
+        this.path = path;
    }

    public CrawlDataReference() {
-        this(SerializableCrawlDataStream.empty());
+        this(null);
    }

    /** Delete the associated data from disk, if it exists */
    public void delete() throws IOException {
-        Path filePath = data.path();
-
-        if (filePath != null) {
-            Files.deleteIfExists(filePath);
+        if (path != null) {
+            Files.deleteIfExists(path);
        }
    }

-    /** Get the next document from the crawl data,
-     * returning null when there are no more documents
-     * available
-     */
-    @Nullable
-    public CrawledDocument nextDocument() {
-        try {
-            while (data.hasNext()) {
-                if (data.next() instanceof CrawledDocument doc) {
-                    if (!ContentTypes.isAccepted(doc.contentType))
-                        continue;
+    public @NotNull Iterator<CrawledDocument> iterator() {

-                    return doc;
+        requireStream();
+        // Guaranteed by requireStream, but helps java
+        Objects.requireNonNull(data);
+
+        return data.map(next -> {
+            if (next instanceof CrawledDocument doc && ContentTypes.isAccepted(doc.contentType)) {
+                return Optional.of(doc);
+            }
+            else {
+                return Optional.empty();
+            }
+        });
+    }
+
+    /** After calling this method, data is guaranteed to be non-null */
+    private void requireStream() {
+        if (closed) {
+            throw new IllegalStateException("Use after close()");
+        }
+
+        if (data == null) {
+            try {
+                if (path != null) {
+                    data = SerializableCrawlDataStream.openDataStream(path);
+                    return;
                }
            }
-        }
-        catch (IOException ex) {
-            logger.error("Failed to read next document", ex);
-        }
+            catch (Exception ex) {
+                logger.error("Failed to open stream", ex);
+            }

-        return null;
+            data = SerializableCrawlDataStream.empty();
+        }
    }

-    public static boolean isContentBodySame(String one, String other) {
+    public static boolean isContentBodySame(byte[] one, byte[] other) {

        final long contentHashOne = contentHash(one);
        final long contentHashOther = contentHash(other);
@@ -66,7 +89,7 @@ public class CrawlDataReference implements AutoCloseable {
        return EasyLSH.hammingDistance(contentHashOne, contentHashOther) < 4;
    }

-    private static long contentHash(String content) {
+    private static long contentHash(byte[] content) {
        EasyLSH hash = new EasyLSH();
        int next = 0;

@@ -74,8 +97,8 @@ public class CrawlDataReference implements AutoCloseable {

        // In a naive best-effort fashion, extract the text
        // content of the document and feed it into the LSH
-        for (int i = 0; i < content.length(); i++) {
-            char c = content.charAt(i);
+        for (byte b : content) {
+            char c = (char) b;
            if (c == '<') {
                isInTag = true;
            } else if (c == '>') {
@@ -98,7 +121,12 @@ public class CrawlDataReference implements AutoCloseable {
    }

    @Override
-    public void close() throws Exception {
-        data.close();
+    public void close() throws IOException {
+        if (!closed) {
+            if (data != null) {
+                data.close();
+            }
+            closed = true;
+        }
    }
 }
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlDelayTimer.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlDelayTimer.java
@@ -3,6 +3,7 @@ package nu.marginalia.crawl.retreival;
 import nu.marginalia.crawl.fetcher.HttpFetcherImpl;

 import java.time.Duration;
+import java.util.concurrent.ThreadLocalRandom;

 import static java.lang.Math.max;
 import static java.lang.Math.min;
@@ -50,15 +51,20 @@ public class CrawlDelayTimer {
        waitFetchDelay(0);
    }

+    public void waitFetchDelay(Duration spentTime) {
+        waitFetchDelay(spentTime.toMillis());
+    }
+
    public void waitFetchDelay(long spentTime) {
        long sleepTime = delayTime;

+        long jitter = ThreadLocalRandom.current().nextLong(0, 150);
        try {
            if (sleepTime >= 1) {
                if (spentTime > sleepTime)
                    return;

-                Thread.sleep(min(sleepTime - spentTime, 5000));
+                Thread.sleep(min(sleepTime - spentTime, 5000) + jitter);
            } else {
                // When no crawl delay is specified, lean toward twice the fetch+process time,
                // within sane limits. This means slower servers get slower crawling, and faster
@@ -71,17 +77,17 @@ public class CrawlDelayTimer {
                if (spentTime > sleepTime)
                    return;

-                Thread.sleep(sleepTime - spentTime);
+                Thread.sleep(sleepTime - spentTime + jitter);
            }

            if (slowDown) {
                // Additional delay when the server is signalling it wants slower requests
-                Thread.sleep(DEFAULT_CRAWL_DELAY_MIN_MS);
+                Thread.sleep(DEFAULT_CRAWL_DELAY_MIN_MS + jitter);
            }
        }
        catch (InterruptedException e) {
            Thread.currentThread().interrupt();
-            throw new RuntimeException();
+            throw new RuntimeException("Interrupted", e);
        }
    }
 }
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerConnectionThrottle.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerConnectionThrottle.java
@@ -0,0 +1,42 @@
+package nu.marginalia.crawl.retreival;
+
+import java.time.Duration;
+import java.time.Instant;
+import java.util.concurrent.Semaphore;
+import java.util.concurrent.TimeUnit;
+
+/**
+ * This class is used to stagger the rate at which connections are created.
+ * <p></p>
+ * It is used to ensure that we do not create too many connections at once,
+ * which can lead to network congestion and other issues.  Since the connections
+ * tend to be very long-lived, we can afford to wait a bit before creating the next
+ * even if it adds a bit of build-up time when the crawl starts.
+ */
+public class CrawlerConnectionThrottle {
+    private Instant lastCrawlStart = Instant.EPOCH;
+    private final Semaphore launchSemaphore = new Semaphore(1);
+
+    private final Duration launchInterval;
+
+    public CrawlerConnectionThrottle(Duration launchInterval) {
+        this.launchInterval = launchInterval;
+    }
+
+    public void waitForConnectionPermission() throws InterruptedException {
+        try {
+            launchSemaphore.acquire();
+            Instant nextPermittedLaunch = lastCrawlStart.plus(launchInterval);
+
+            if (nextPermittedLaunch.isAfter(Instant.now())) {
+                long waitTime = Duration.between(Instant.now(), nextPermittedLaunch).toMillis();
+                TimeUnit.MILLISECONDS.sleep(waitTime);
+            }
+
+            lastCrawlStart = Instant.now();
+        }
+        finally {
+            launchSemaphore.release();
+        }
+    }
+}
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
@@ -6,13 +6,12 @@ import nu.marginalia.contenttype.ContentType;
 import nu.marginalia.crawl.CrawlerMain;
 import nu.marginalia.crawl.DomainStateDb;
 import nu.marginalia.crawl.fetcher.ContentTags;
+import nu.marginalia.crawl.fetcher.DomainCookies;
 import nu.marginalia.crawl.fetcher.HttpFetcher;
-import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
 import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
 import nu.marginalia.crawl.logic.LinkFilterSelector;
 import nu.marginalia.crawl.retreival.revisit.CrawlerRevisitor;
 import nu.marginalia.crawl.retreival.revisit.DocumentWithReference;
-import nu.marginalia.crawl.retreival.sitemap.SitemapFetcher;
 import nu.marginalia.ip_blocklist.UrlBlocklist;
 import nu.marginalia.link_parser.LinkParser;
 import nu.marginalia.model.EdgeDomain;
@@ -20,7 +19,6 @@ import nu.marginalia.model.EdgeUrl;
 import nu.marginalia.model.body.DocumentBodyExtractor;
 import nu.marginalia.model.body.HttpFetchResult;
 import nu.marginalia.model.crawldata.CrawlerDomainStatus;
-import org.jsoup.Jsoup;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;

@@ -28,14 +26,16 @@ import java.io.IOException;
 import java.net.InetAddress;
 import java.net.UnknownHostException;
 import java.nio.file.Path;
+import java.time.Duration;
+import java.time.Instant;
 import java.util.List;
+import java.util.Objects;
 import java.util.Optional;
 import java.util.concurrent.TimeUnit;

 public class CrawlerRetreiver implements AutoCloseable {

    private static final int MAX_ERRORS = 20;
-    private static final int HTTP_429_RETRY_LIMIT = 1; // Retry 429s once

    private final HttpFetcher fetcher;

@@ -52,8 +52,12 @@ public class CrawlerRetreiver implements AutoCloseable {
    private final DomainStateDb domainStateDb;
    private final WarcRecorder warcRecorder;
    private final CrawlerRevisitor crawlerRevisitor;
+    private final DomainCookies cookies = new DomainCookies();
+
+    private static final CrawlerConnectionThrottle connectionThrottle = new CrawlerConnectionThrottle(
+            Duration.ofSeconds(1) // pace the connections to avoid network congestion at startup
+    );

-    private final SitemapFetcher sitemapFetcher;
    int errorCount = 0;

    public CrawlerRetreiver(HttpFetcher fetcher,
@@ -71,7 +75,6 @@ public class CrawlerRetreiver implements AutoCloseable {

        crawlFrontier = new DomainCrawlFrontier(new EdgeDomain(domain), specs.urls(), specs.crawlDepth());
        crawlerRevisitor = new CrawlerRevisitor(crawlFrontier, this, warcRecorder);
-        sitemapFetcher = new SitemapFetcher(crawlFrontier, fetcher.createSitemapRetriever());

        // We must always crawl the index page first, this is assumed when fingerprinting the server
        var fst = crawlFrontier.peek();
@@ -93,30 +96,63 @@ public class CrawlerRetreiver implements AutoCloseable {
    }

    public int crawlDomain(DomainLinks domainLinks, CrawlDataReference oldCrawlData) {
-        try {
-            // Do an initial domain probe to determine the root URL
-            EdgeUrl rootUrl;
+        try (oldCrawlData) {

+            // Wait for permission to open a connection to avoid network congestion
+            // from hundreds/thousands of TCP handshakes
+            connectionThrottle.waitForConnectionPermission();
+
+            // Do an initial domain probe to determine the root URL
            var probeResult = probeRootUrl();
-            switch (probeResult) {
+
+            return switch (probeResult) {
                case HttpFetcher.DomainProbeResult.Ok(EdgeUrl probedUrl) -> {
-                    rootUrl = probedUrl; // Good track
+
+                    // Sleep after the initial probe, we don't have access to the robots.txt yet
+                    // so we don't know the crawl delay
+                    TimeUnit.SECONDS.sleep(1);
+
+                    final SimpleRobotRules robotsRules = fetcher.fetchRobotRules(probedUrl.domain, warcRecorder);
+                    final CrawlDelayTimer delayTimer = new CrawlDelayTimer(robotsRules.getCrawlDelay());
+
+                    delayTimer.waitFetchDelay(0); // initial delay after robots.txt
+
+                    DomainStateDb.SummaryRecord summaryRecord = sniffRootDocument(probedUrl, delayTimer);
+                    domainStateDb.save(summaryRecord);
+
+                    if (Thread.interrupted()) {
+                        // There's a small chance we're interrupted during the sniffing portion
+                        throw new InterruptedException();
+                    }
+
+                    Instant recrawlStart = Instant.now();
+                    CrawlerRevisitor.RecrawlMetadata recrawlMetadata = crawlerRevisitor.recrawl(oldCrawlData, cookies, robotsRules, delayTimer);
+                    Duration recrawlTime = Duration.between(recrawlStart, Instant.now());
+
+                    // Play back the old crawl data (if present) and fetch the documents comparing etags and last-modified
+                    if (recrawlMetadata.size() > 0) {
+                        // If we have reference data, we will always grow the crawl depth a bit
+                        crawlFrontier.increaseDepth(1.5, 2500);
+                    }
+
+                    oldCrawlData.close(); // proactively close the crawl data reference here to not hold onto expensive resources
+
+                    yield crawlDomain(probedUrl, robotsRules, delayTimer, domainLinks, recrawlMetadata, recrawlTime);
                }
                case HttpFetcher.DomainProbeResult.Redirect(EdgeDomain domain1) -> {
                    domainStateDb.save(DomainStateDb.SummaryRecord.forError(domain, "Redirect", domain1.toString()));
-                    return 1;
+                    yield 1;
                }
                case HttpFetcher.DomainProbeResult.Error(CrawlerDomainStatus status, String desc) -> {
                    domainStateDb.save(DomainStateDb.SummaryRecord.forError(domain, status.toString(), desc));
-                    return 1;
+                    yield 1;
                }
-            }
+                default -> {
+                    logger.error("Unexpected domain probe result {}", probeResult);
+                    yield 1;
+                }
+            };

-            // Sleep after the initial probe, we don't have access to the robots.txt yet
-            // so we don't know the crawl delay
-            TimeUnit.SECONDS.sleep(1);
-
-            return crawlDomain(oldCrawlData, rootUrl, domainLinks);
        }
        catch (Exception ex) {
            logger.error("Error crawling domain {}", domain, ex);
@@ -124,30 +160,31 @@ public class CrawlerRetreiver implements AutoCloseable {
        }
    }

-    private int crawlDomain(CrawlDataReference oldCrawlData,
-                            EdgeUrl rootUrl,
-                            DomainLinks domainLinks) throws InterruptedException {
+    private int crawlDomain(EdgeUrl rootUrl,
+                            SimpleRobotRules robotsRules,
+                            CrawlDelayTimer delayTimer,
+                            DomainLinks domainLinks,
+                            CrawlerRevisitor.RecrawlMetadata recrawlMetadata,
+                            Duration recrawlTime) {

-        final SimpleRobotRules robotsRules = fetcher.fetchRobotRules(rootUrl.domain, warcRecorder);
-        final CrawlDelayTimer delayTimer = new CrawlDelayTimer(robotsRules.getCrawlDelay());
-
-        delayTimer.waitFetchDelay(0); // initial delay after robots.txt
-
-        DomainStateDb.SummaryRecord summaryRecord = sniffRootDocument(rootUrl, delayTimer);
-        domainStateDb.save(summaryRecord);
-
-        // Play back the old crawl data (if present) and fetch the documents comparing etags and last-modified
-        if (crawlerRevisitor.recrawl(oldCrawlData, robotsRules, delayTimer) > 0) {
-            // If we have reference data, we will always grow the crawl depth a bit
-            crawlFrontier.increaseDepth(1.5, 2500);
-        }
+        Instant crawlStart = Instant.now();

        // Add external links to the crawl frontier
        crawlFrontier.addAllToQueue(domainLinks.getUrls(rootUrl.proto));

-        // Add links from the sitemap to the crawl frontier
-        sitemapFetcher.downloadSitemaps(robotsRules, rootUrl);
+        // Fetch sitemaps
+        for (var sitemap : robotsRules.getSitemaps()) {

+            // Validate the sitemap URL and check if it belongs to the domain as the root URL
+            if (EdgeUrl.parse(sitemap)
+                    .map(url -> url.getDomain().equals(rootUrl.domain))
+                    .orElse(false)) {
+
+                crawlFrontier.addAllToQueue(fetcher.fetchSitemapUrls(sitemap, delayTimer));
+            }
+        }
+
+        int crawlerAdditions = 0;

        while (!crawlFrontier.isEmpty()
            && !crawlFrontier.isCrawlDepthReached()
@@ -180,7 +217,11 @@ public class CrawlerRetreiver implements AutoCloseable {
                continue;

            try {
-                fetchContentWithReference(top, delayTimer, DocumentWithReference.empty());
+                var result = fetchContentWithReference(top, delayTimer, DocumentWithReference.empty());
+
+                if (result.isOk()) {
+                    crawlerAdditions++;
+                }
            }
            catch (InterruptedException ex) {
                Thread.currentThread().interrupt();
@@ -188,6 +229,17 @@ public class CrawlerRetreiver implements AutoCloseable {
            }
        }

+        Duration crawlTime = Duration.between(crawlStart, Instant.now());
+        domainStateDb.save(new DomainStateDb.CrawlMeta(
+                domain,
+                Instant.now(),
+                recrawlTime,
+                crawlTime,
+                recrawlMetadata.errors(),
+                crawlerAdditions,
+                recrawlMetadata.size() + crawlerAdditions
+        ));
+
        return crawlFrontier.visitedSize();
    }

@@ -216,17 +268,29 @@ public class CrawlerRetreiver implements AutoCloseable {
        return domainProbeResult;
    }

+
+
    private DomainStateDb.SummaryRecord sniffRootDocument(EdgeUrl rootUrl, CrawlDelayTimer timer) {
        Optional<String> feedLink = Optional.empty();

        try {
            var url = rootUrl.withPathAndParam("/", null);

-            HttpFetchResult result = fetchWithRetry(url, timer, HttpFetcher.ProbeType.DISABLED, ContentTags.empty());
+            HttpFetchResult result = fetcher.fetchContent(url, warcRecorder, cookies, timer, ContentTags.empty(), HttpFetcher.ProbeType.DISABLED);
            timer.waitFetchDelay(0);

-            if (!(result instanceof HttpFetchResult.ResultOk ok))
+            if (result instanceof HttpFetchResult.ResultRedirect(EdgeUrl location)) {
+                if (Objects.equals(location.domain, url.domain)) {
+                    // TODO: Follow the redirect to the new location and sniff the document
+                    crawlFrontier.addFirst(location);
+                }
+
                return DomainStateDb.SummaryRecord.forSuccess(domain);
+            }
+
+            if (!(result instanceof HttpFetchResult.ResultOk ok)) {
+                return DomainStateDb.SummaryRecord.forSuccess(domain);
+            }

            var optDoc = ok.parseDocument();
            if (optDoc.isEmpty())
@@ -271,18 +335,28 @@ public class CrawlerRetreiver implements AutoCloseable {
            }

            // Download the sitemap if available
-            if (feedLink.isPresent()) {
-                sitemapFetcher.downloadSitemaps(List.of(feedLink.get()));
-                timer.waitFetchDelay(0);
-            }
+            feedLink.ifPresent(s -> fetcher.fetchSitemapUrls(s, timer));

            // Grab the favicon if it exists
-            fetchWithRetry(faviconUrl, timer, HttpFetcher.ProbeType.DISABLED, ContentTags.empty());
+
+            if (fetcher.fetchContent(faviconUrl, warcRecorder, cookies, timer, ContentTags.empty(), HttpFetcher.ProbeType.DISABLED) instanceof HttpFetchResult.ResultOk iconResult) {
+                String contentType = iconResult.header("Content-Type");
+                byte[] iconData = iconResult.getBodyBytes();
+
+                domainStateDb.saveIcon(
+                        domain,
+                        new DomainStateDb.FaviconRecord(contentType, iconData)
+                );
+            }
            timer.waitFetchDelay(0);

        }
        catch (Exception ex) {
            logger.error("Error configuring link filter", ex);
+            if (Thread.interrupted()) {
+                Thread.currentThread().interrupt();
+                return DomainStateDb.SummaryRecord.forError(domain, "Crawler Interrupted", ex.getMessage());
+            }
        }
        finally {
            crawlFrontier.addVisited(rootUrl);
@@ -310,7 +384,7 @@ public class CrawlerRetreiver implements AutoCloseable {
    );

    private Optional<String> guessFeedUrl(CrawlDelayTimer timer) throws InterruptedException {
-        var oldDomainStateRecord = domainStateDb.get(domain);
+        var oldDomainStateRecord = domainStateDb.getSummary(domain);

        // If we are already aware of an old feed URL, then we can just revalidate it
        if (oldDomainStateRecord.isPresent()) {
@@ -335,7 +409,7 @@ public class CrawlerRetreiver implements AutoCloseable {
        if (parsedOpt.isEmpty())
            return false;

-        HttpFetchResult result = fetchWithRetry(parsedOpt.get(), timer, HttpFetcher.ProbeType.DISABLED, ContentTags.empty());
+        HttpFetchResult result = fetcher.fetchContent(parsedOpt.get(), warcRecorder, cookies, timer, ContentTags.empty(), HttpFetcher.ProbeType.DISABLED);
        timer.waitFetchDelay(0);

        if (!(result instanceof HttpFetchResult.ResultOk ok)) {
@@ -361,110 +435,63 @@ public class CrawlerRetreiver implements AutoCloseable {
                                                     CrawlDelayTimer timer,
                                                     DocumentWithReference reference) throws InterruptedException
    {
-        logger.debug("Fetching {}", top);
-
-        long startTime = System.currentTimeMillis();
        var contentTags = reference.getContentTags();

-        HttpFetchResult fetchedDoc = fetchWithRetry(top, timer, HttpFetcher.ProbeType.FULL, contentTags);
+        HttpFetchResult fetchedDoc = fetcher.fetchContent(top, warcRecorder, cookies, timer, contentTags, HttpFetcher.ProbeType.FULL);
+        timer.waitFetchDelay();
+
+        if (Thread.interrupted()) {
+            Thread.currentThread().interrupt();
+            throw new InterruptedException();
+        }

        // Parse the document and enqueue links
        try {
-            if (fetchedDoc instanceof HttpFetchResult.ResultOk ok) {
-                var docOpt = ok.parseDocument();
-                if (docOpt.isPresent()) {
-                    var doc = docOpt.get();
+            switch (fetchedDoc) {
+                case HttpFetchResult.ResultOk ok -> {
+                    var docOpt = ok.parseDocument();
+                    if (docOpt.isPresent()) {
+                        var doc = docOpt.get();

-                    crawlFrontier.enqueueLinksFromDocument(top, doc);
-                    crawlFrontier.addVisited(new EdgeUrl(ok.uri()));
+                        var responseUrl = new EdgeUrl(ok.uri());
+
+                        crawlFrontier.enqueueLinksFromDocument(responseUrl, doc);
+                        crawlFrontier.addVisited(responseUrl);
+                    }
                }
-            }
-            else if (fetchedDoc instanceof HttpFetchResult.Result304Raw && reference.doc() != null) {
-                var doc = reference.doc();
+                case HttpFetchResult.Result304Raw ref when reference.doc() != null ->
+                {
+                    var doc = reference.doc();

-                warcRecorder.writeReferenceCopy(top, doc.contentType, doc.httpStatus, doc.documentBody, doc.headers, contentTags);
+                    warcRecorder.writeReferenceCopy(top, cookies, doc.contentType, doc.httpStatus, doc.documentBodyBytes, doc.headers, contentTags);

-                fetchedDoc = new HttpFetchResult.Result304ReplacedWithReference(doc.url,
-                        new ContentType(doc.contentType, "UTF-8"),
-                        doc.documentBody);
+                    fetchedDoc = new HttpFetchResult.Result304ReplacedWithReference(doc.url,
+                            new ContentType(doc.contentType, "UTF-8"),
+                            doc.documentBodyBytes);

-                if (doc.documentBody != null) {
-                    var parsed = Jsoup.parse(doc.documentBody);
+                    if (doc.documentBodyBytes != null) {
+                        var parsed = doc.parseBody();

-                    crawlFrontier.enqueueLinksFromDocument(top, parsed);
-                    crawlFrontier.addVisited(top);
+                        crawlFrontier.enqueueLinksFromDocument(top, parsed);
+                        crawlFrontier.addVisited(top);
+                    }
                }
-            }
-            else if (fetchedDoc instanceof HttpFetchResult.ResultException) {
-                errorCount ++;
+                case HttpFetchResult.ResultRedirect(EdgeUrl location) -> {
+                    if (Objects.equals(location.domain, top.domain)) {
+                        crawlFrontier.addFirst(location);
+                    }
+                }
+                case HttpFetchResult.ResultException ex -> errorCount++;
+                default -> {} // Ignore other types
            }
        }
        catch (Exception ex) {
            logger.error("Error parsing document {}", top, ex);
        }

-        timer.waitFetchDelay(System.currentTimeMillis() - startTime);
-
        return fetchedDoc;
    }

-    /** Fetch a document and retry on 429s */
-    private HttpFetchResult fetchWithRetry(EdgeUrl url,
-                                           CrawlDelayTimer timer,
-                                           HttpFetcher.ProbeType probeType,
-                                           ContentTags contentTags) throws InterruptedException {
-
-        long probeStart = System.currentTimeMillis();
-
-        if (probeType == HttpFetcher.ProbeType.FULL) {
-            retryLoop:
-            for (int i = 0; i <= HTTP_429_RETRY_LIMIT; i++) {
-                try {
-                    var probeResult = fetcher.probeContentType(url, warcRecorder, contentTags);
-
-                    switch (probeResult) {
-                        case HttpFetcher.ContentTypeProbeResult.Ok(EdgeUrl resolvedUrl):
-                            url = resolvedUrl; // If we were redirected while probing, use the final URL for fetching
-                            break retryLoop;
-                        case HttpFetcher.ContentTypeProbeResult.BadContentType badContentType:
-                            return new HttpFetchResult.ResultNone();
-                        case HttpFetcher.ContentTypeProbeResult.BadContentType.Timeout timeout:
-                            return new HttpFetchResult.ResultException(timeout.ex());
-                        case HttpFetcher.ContentTypeProbeResult.Exception exception:
-                            return new HttpFetchResult.ResultException(exception.ex());
-                        default:  // should be unreachable
-                            throw new IllegalStateException("Unknown probe result");
-                    }
-                }
-                catch (HttpFetcherImpl.RateLimitException ex) {
-                    timer.waitRetryDelay(ex);
-                }
-                catch (Exception ex) {
-                    logger.warn("Failed to fetch {}", url, ex);
-                    return new HttpFetchResult.ResultException(ex);
-                }
-            }
-
-            timer.waitFetchDelay(System.currentTimeMillis() - probeStart);
-        }
-
-
-        for (int i = 0; i <= HTTP_429_RETRY_LIMIT; i++) {
-            try {
-                return fetcher.fetchContent(url, warcRecorder, contentTags, probeType);
-            }
-            catch (HttpFetcherImpl.RateLimitException ex) {
-                timer.waitRetryDelay(ex);
-            }
-            catch (Exception ex) {
-                logger.warn("Failed to fetch {}", url, ex);
-                return new HttpFetchResult.ResultException(ex);
-            }
-        }
-
-        return new HttpFetchResult.ResultNone();
-    }
-
    private boolean isAllowedProtocol(String proto) {
        return proto.equalsIgnoreCase("http")
                || proto.equalsIgnoreCase("https");
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/retreival/DomainCrawlFrontier.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/retreival/DomainCrawlFrontier.java
@@ -55,6 +55,9 @@ public class DomainCrawlFrontier {
        }
    }

+    public EdgeDomain getDomain() {
+        return thisDomain;
+    }
    /** Increase the depth of the crawl by a factor.  If the current depth is smaller
     * than the number of already visited documents, the base depth will be adjusted
     * to the visited count first.
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/retreival/revisit/CrawlerRevisitor.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/retreival/revisit/CrawlerRevisitor.java
@@ -1,8 +1,8 @@
 package nu.marginalia.crawl.retreival.revisit;

-import com.google.common.base.Strings;
 import crawlercommons.robots.SimpleRobotRules;
 import nu.marginalia.crawl.fetcher.ContentTags;
+import nu.marginalia.crawl.fetcher.DomainCookies;
 import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
 import nu.marginalia.crawl.retreival.CrawlDataReference;
 import nu.marginalia.crawl.retreival.CrawlDelayTimer;
@@ -11,17 +11,23 @@ import nu.marginalia.crawl.retreival.DomainCrawlFrontier;
 import nu.marginalia.model.EdgeUrl;
 import nu.marginalia.model.body.HttpFetchResult;
 import nu.marginalia.model.crawldata.CrawledDocument;
-import org.jsoup.Jsoup;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;

 /** This class encapsulates the logic for re-visiting a domain that has already been crawled.
 *  We may use information from the previous crawl to inform the next crawl, specifically the
 *  E-Tag and Last-Modified headers.
 */
 public class CrawlerRevisitor {
+
    private final DomainCrawlFrontier crawlFrontier;
    private final CrawlerRetreiver crawlerRetreiver;
    private final WarcRecorder warcRecorder;

+    private static final Logger logger = LoggerFactory.getLogger(CrawlerRevisitor.class);
+
    public CrawlerRevisitor(DomainCrawlFrontier crawlFrontier,
                            CrawlerRetreiver crawlerRetreiver,
                            WarcRecorder warcRecorder) {
@@ -31,7 +37,8 @@ public class CrawlerRevisitor {
    }

    /** Performs a re-crawl of old documents, comparing etags and last-modified */
-    public int recrawl(CrawlDataReference oldCrawlData,
+    public RecrawlMetadata recrawl(CrawlDataReference oldCrawlData,
+                       DomainCookies cookies,
                       SimpleRobotRules robotsRules,
                       CrawlDelayTimer delayTimer)
    throws InterruptedException {
@@ -39,19 +46,18 @@ public class CrawlerRevisitor {
        int retained = 0;
        int errors = 0;
        int skipped = 0;
+        int size = 0;

-        for (;;) {
+        for (CrawledDocument doc : oldCrawlData) {
            if (errors > 20) {
                // If we've had too many errors, we'll stop trying to recrawl
                break;
            }

-            CrawledDocument doc = oldCrawlData.nextDocument();
+            if (Thread.interrupted()) {
+                throw new InterruptedException();
+            }

-            if (doc == null)
-                break;
-
-            // This Shouldn't Happen (TM)
            var urlMaybe = EdgeUrl.parse(doc.url);
            if (urlMaybe.isEmpty())
                continue;
@@ -70,7 +76,7 @@ public class CrawlerRevisitor {
            // unlikely to produce anything meaningful for us.
            if (doc.httpStatus != 200)
                continue;
-            if (Strings.isNullOrEmpty(doc.documentBody))
+            if (!doc.hasBody())
                continue;

            if (!crawlFrontier.filterLink(url))
@@ -84,6 +90,7 @@ public class CrawlerRevisitor {
                continue;
            }

+            size++;

            double skipProb;

@@ -117,14 +124,20 @@ public class CrawlerRevisitor {
                // fashion to make sure we eventually catch changes over time
                // and ensure we discover new links

-                // Hoover up any links from the document
-                crawlFrontier.enqueueLinksFromDocument(url, Jsoup.parse(doc.documentBody));
+                try {
+                    // Hoover up any links from the document
+                    crawlFrontier.enqueueLinksFromDocument(url, doc.parseBody());

+                }
+                catch (IOException ex) {
+                    //
+                }
                // Add a WARC record so we don't repeat this
                warcRecorder.writeReferenceCopy(url,
+                        cookies,
                        doc.contentType,
                        doc.httpStatus,
-                        doc.documentBody,
+                        doc.documentBodyBytes,
                        doc.headers,
                        new ContentTags(doc.etagMaybe, doc.lastModifiedMaybe)
                );
@@ -146,11 +159,15 @@ public class CrawlerRevisitor {
                else if (result instanceof HttpFetchResult.ResultException) {
                    errors++;
                }
-
                recrawled++;
            }
        }

-        return recrawled;
+        logger.info("Recrawl summary {}: {} recrawled, {} retained, {} errors, {} skipped",
+                crawlFrontier.getDomain(), recrawled, retained, errors, skipped);
+
+        return new RecrawlMetadata(size, errors, skipped);
    }
+
+    public record RecrawlMetadata(int size, int errors, int skipped) {}
 }
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/retreival/revisit/DocumentWithReference.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/retreival/revisit/DocumentWithReference.java
@@ -2,12 +2,11 @@ package nu.marginalia.crawl.retreival.revisit;

 import nu.marginalia.crawl.fetcher.ContentTags;
 import nu.marginalia.crawl.retreival.CrawlDataReference;
-import nu.marginalia.model.body.DocumentBodyExtractor;
-import nu.marginalia.model.body.DocumentBodyResult;
 import nu.marginalia.model.body.HttpFetchResult;
 import nu.marginalia.model.crawldata.CrawledDocument;

 import javax.annotation.Nullable;
+import java.util.Objects;

 public record DocumentWithReference(
        @Nullable CrawledDocument doc,
@@ -35,21 +34,31 @@ public record DocumentWithReference(
            return false;
        if (doc == null)
            return false;
-        if (doc.documentBody == null)
-            return false;
+        if (doc.documentBodyBytes.length == 0) {
+            if (doc.httpStatus < 300) {
+                return resultOk.bytesLength() == 0;
+            }
+            else if (doc.httpStatus == 301 || doc.httpStatus == 302 || doc.httpStatus == 307) {
+                @Nullable
+                String docLocation = doc.getHeader("Location");
+                @Nullable
+                String resultLocation = resultOk.header("Location");

-        if (!(DocumentBodyExtractor.asString(resultOk) instanceof DocumentBodyResult.Ok<String> bodyOk)) {
-            return false;
+                return Objects.equals(docLocation, resultLocation);
+            }
+            else {
+                return doc.httpStatus == resultOk.statusCode();
+            }
        }

-        return CrawlDataReference.isContentBodySame(doc.documentBody, bodyOk.body());
+        return CrawlDataReference.isContentBodySame(doc.documentBodyBytes, resultOk.bytesRaw());
    }

    public ContentTags getContentTags() {
        if (null == doc)
            return ContentTags.empty();

-        if (doc.documentBody == null || doc.httpStatus != 200)
+        if (doc.documentBodyBytes.length == 0 || doc.httpStatus != 200)
            return ContentTags.empty();

        String lastmod = doc.getLastModified();
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/retreival/sitemap/SitemapFetcher.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/retreival/sitemap/SitemapFetcher.java
@@ -1,72 +0,0 @@
-package nu.marginalia.crawl.retreival.sitemap;
-
-import crawlercommons.robots.SimpleRobotRules;
-import nu.marginalia.crawl.fetcher.SitemapRetriever;
-import nu.marginalia.crawl.retreival.DomainCrawlFrontier;
-import nu.marginalia.model.EdgeUrl;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
-
-import java.util.HashSet;
-import java.util.List;
-import java.util.Optional;
-import java.util.Set;
-
-public class SitemapFetcher {
-
-    private final DomainCrawlFrontier crawlFrontier;
-    private final SitemapRetriever sitemapRetriever;
-    private static final Logger logger = LoggerFactory.getLogger(SitemapFetcher.class);
-
-    public SitemapFetcher(DomainCrawlFrontier crawlFrontier, SitemapRetriever sitemapRetriever) {
-        this.crawlFrontier = crawlFrontier;
-        this.sitemapRetriever = sitemapRetriever;
-    }
-
-    public void downloadSitemaps(SimpleRobotRules robotsRules, EdgeUrl rootUrl) {
-        List<String> urls = robotsRules.getSitemaps();
-
-        if (urls.isEmpty()) {
-            urls = List.of(rootUrl.withPathAndParam("/sitemap.xml", null).toString());
-        }
-
-        downloadSitemaps(urls);
-    }
-
-    public void downloadSitemaps(List<String> urls) {
-
-        Set<String> checkedSitemaps = new HashSet<>();
-
-        for (var rawUrl : urls) {
-            Optional<EdgeUrl> parsedUrl = EdgeUrl.parse(rawUrl);
-            if (parsedUrl.isEmpty()) {
-                continue;
-            }
-
-            EdgeUrl url = parsedUrl.get();
-
-            // Let's not download sitemaps from other domains for now
-            if (!crawlFrontier.isSameDomain(url)) {
-                continue;
-            }
-
-            if (checkedSitemaps.contains(url.path))
-                continue;
-
-            var sitemap =  sitemapRetriever.fetchSitemap(url);
-            if (sitemap.isEmpty()) {
-                continue;
-            }
-
-            // ensure we don't try to download this sitemap again
-            // (don't move this up, as we may want to check the same
-            // path with different protocols until we find one that works)
-
-            checkedSitemaps.add(url.path);
-
-            crawlFrontier.addAllToQueue(sitemap);
-        }
-
-        logger.debug("Queue is now {}", crawlFrontier.queueSize());
-    }
-}
--- a/code/processes/crawling-process/model/build.gradle
+++ b/code/processes/crawling-process/model/build.gradle
@@ -32,15 +32,17 @@ dependencies {
    implementation libs.bundles.parquet

    implementation libs.trove
+    implementation libs.slop
    implementation libs.jwarc
    implementation libs.gson
    implementation libs.commons.io
    implementation libs.commons.lang3
-    implementation libs.okhttp3
    implementation libs.jsoup
    implementation libs.snakeyaml
    implementation libs.zstd

+    implementation libs.bundles.httpcomponents
+
    testImplementation libs.bundles.slf4j.test
    testImplementation libs.bundles.junit
    testImplementation libs.mockito
--- a/code/processes/crawling-process/model/java/nu/marginalia/ContentTypes.java
+++ b/code/processes/crawling-process/model/java/nu/marginalia/ContentTypes.java
@@ -6,6 +6,7 @@ public class ContentTypes {
    public static final Set<String> acceptedContentTypes = Set.of("application/xhtml+xml",
            "application/xhtml",
            "text/html",
+            "application/pdf",
            "image/x-icon",
            "text/plain");

@@ -19,4 +20,9 @@ public class ContentTypes {
        return false;
    }

+    public static boolean isBinary(String contentTypeHeader) {
+        String lcHeader = contentTypeHeader.toLowerCase();
+        return lcHeader.startsWith("application/pdf");
+    }
+
 }
--- a/code/processes/crawling-process/model/java/nu/marginalia/io/CrawledDomainReader.java
+++ b/code/processes/crawling-process/model/java/nu/marginalia/io/CrawledDomainReader.java
@@ -1,45 +0,0 @@
-package nu.marginalia.io;
-
-import nu.marginalia.io.crawldata.format.ParquetSerializableCrawlDataStream;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
-
-import java.io.FileNotFoundException;
-import java.io.IOException;
-import java.nio.file.Files;
-import java.nio.file.Path;
-
-public class CrawledDomainReader {
-    private static final Logger logger = LoggerFactory.getLogger(CrawledDomainReader.class);
-
-    /** An iterator-like access to domain data  This must be closed otherwise it will leak off-heap memory! */
-    public static SerializableCrawlDataStream createDataStream(Path fullPath) throws IOException
-    {
-
-        String fileName = fullPath.getFileName().toString();
-        if (fileName.endsWith(".parquet")) {
-            try {
-                return new ParquetSerializableCrawlDataStream(fullPath);
-            } catch (Exception ex) {
-                logger.error("Error reading domain data from " + fullPath, ex);
-                return SerializableCrawlDataStream.empty();
-            }
-        } else {
-            logger.error("Unknown file type: {}", fullPath);
-            return SerializableCrawlDataStream.empty();
-        }
-    }
-
-    /** An iterator-like access to domain data. This must be closed otherwise it will leak off-heap memory! */
-    public static SerializableCrawlDataStream createDataStream(Path basePath, String domain, String id) throws IOException {
-        Path parquetPath = CrawlerOutputFile.getParquetPath(basePath, id, domain);
-
-        if (Files.exists(parquetPath)) {
-            return createDataStream(parquetPath);
-        }
-        else {
-            throw new FileNotFoundException("No such file: " + parquetPath);
-        }
-    }
-
-}
--- a/code/processes/crawling-process/model/java/nu/marginalia/io/CrawlerOutputFile.java
+++ b/code/processes/crawling-process/model/java/nu/marginalia/io/CrawlerOutputFile.java
@@ -35,7 +35,7 @@ public class CrawlerOutputFile {
        return destDir.resolve(id + "-" + filesystemSafeName(domain) + "-" + version.suffix + ".warc.gz");
    }

-    public static Path createParquetPath(Path basePath, String id, String domain) throws IOException {
+    public static Path createSlopPath(Path basePath, String id, String domain) throws IOException {
        id = padId(id);

        String first = id.substring(0, 2);
@@ -45,8 +45,9 @@ public class CrawlerOutputFile {
        if (!Files.exists(destDir)) {
            Files.createDirectories(destDir);
        }
-        return destDir.resolve(id + "-" + filesystemSafeName(domain) + ".parquet");
+        return destDir.resolve(id + "-" + filesystemSafeName(domain) + ".slop.zip");
    }
+
    public static Path getParquetPath(Path basePath, String id, String domain) {
        id = padId(id);

@@ -56,16 +57,18 @@ public class CrawlerOutputFile {
        Path destDir = basePath.resolve(first).resolve(second);
        return destDir.resolve(id + "-" + filesystemSafeName(domain) + ".parquet");
    }
-    public static Path getWarcPath(Path basePath, String id, String domain, WarcFileVersion version) {
+
+    public static Path getSlopPath(Path basePath, String id, String domain) {
        id = padId(id);

        String first = id.substring(0, 2);
        String second = id.substring(2, 4);

        Path destDir = basePath.resolve(first).resolve(second);
-        return destDir.resolve(id + "-" + filesystemSafeName(domain) + ".warc" + version.suffix);
+        return destDir.resolve(id + "-" + filesystemSafeName(domain) + ".slop.zip");
    }

+
    /**
     * Pads the given ID with leading zeros to ensure it has a length of 4 characters.
     */
--- a/code/processes/crawling-process/model/java/nu/marginalia/io/SerializableCrawlDataStream.java
+++ b/code/processes/crawling-process/model/java/nu/marginalia/io/SerializableCrawlDataStream.java
@@ -1,35 +1,122 @@
 package nu.marginalia.io;

+import nu.marginalia.io.crawldata.format.ParquetSerializableCrawlDataStream;
+import nu.marginalia.io.crawldata.format.SlopSerializableCrawlDataStream;
 import nu.marginalia.model.crawldata.CrawledDocument;
 import nu.marginalia.model.crawldata.CrawledDomain;
 import nu.marginalia.model.crawldata.SerializableCrawlData;
 import org.jetbrains.annotations.Nullable;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;

 import java.io.IOException;
 import java.nio.file.Path;
 import java.util.ArrayList;
 import java.util.Iterator;
 import java.util.List;
+import java.util.Optional;
+import java.util.function.Function;

 /** Closable iterator exceptional over serialized crawl data
 * The data may appear in any order, and the iterator must be closed.
 *
- * @see CrawledDomainReader
 * */
 public interface SerializableCrawlDataStream extends AutoCloseable {
-
+    Logger logger = LoggerFactory.getLogger(SerializableCrawlDataStream.class);

    SerializableCrawlData next() throws IOException;

    /** Return a size hint for the stream.  0 is returned if the hint is not available,
     * or if the file is seemed too small to bother */
-    default int sizeHint() { return 0; }
+    default int getSizeHint() { return 0; }

    boolean hasNext() throws IOException;

    @Nullable
    default Path path() { return null; }

+    void close() throws IOException;
+
+    /** An iterator-like access to domain data  This must be closed otherwise it will leak off-heap memory! */
+    static SerializableCrawlDataStream openDataStream(Path fullPath) throws IOException
+    {
+
+        String fileName = fullPath.getFileName().toString();
+
+        if (fileName.endsWith(".slop.zip")) {
+            try {
+                return new SlopSerializableCrawlDataStream(fullPath);
+            } catch (Exception ex) {
+                logger.error("Error reading domain data from " + fullPath, ex);
+                return SerializableCrawlDataStream.empty();
+            }
+        }
+
+        else if (fileName.endsWith(".parquet")) {
+            logger.error("Opening deprecated parquet-style crawl data stream", new Exception());
+            try {
+                return new ParquetSerializableCrawlDataStream(fullPath);
+            } catch (Exception ex) {
+                logger.error("Error reading domain data from " + fullPath, ex);
+                return SerializableCrawlDataStream.empty();
+            }
+        }
+
+        logger.error("Unknown file type: {}", fullPath);
+        return SerializableCrawlDataStream.empty();
+    }
+
+    /** Get an idication of the size of the stream.  This is used to determine whether to
+     * load the stream into memory or not.  0 is returned if the hint is not available,
+     * or if the file is seemed too small to bother */
+    static int getSizeHint(Path fullPath) {
+        String fileName = fullPath.getFileName().toString();
+        if (fileName.endsWith(".parquet")) {
+            return ParquetSerializableCrawlDataStream.sizeHint(fullPath);
+        }
+        else if (fileName.endsWith(".slop.zip")) {
+            return SlopSerializableCrawlDataStream.sizeHint(fullPath);
+        }
+        else {
+            return 0;
+        }
+    }
+
+    default <T>  Iterator<T> map(Function<SerializableCrawlData, Optional<T>> mapper) {
+        return new Iterator<>() {
+            T next = null;
+
+            public boolean hasNext() {
+                if (next != null)
+                    return true;
+                try {
+                    while (SerializableCrawlDataStream.this.hasNext()) {
+                        var val = mapper.apply(SerializableCrawlDataStream.this.next());
+                        if (val.isPresent()) {
+                            next = val.get();
+                            return true;
+                        }
+                    }
+                }
+                catch (IOException ex) {
+                    logger.error("Error during stream", ex);
+                }
+
+                return false;
+            }
+
+            public T next() {
+                if (next == null && !hasNext())
+                    throw new IllegalStateException("No more data to read");
+
+                T ret = next;
+                next = null;
+                return ret;
+            }
+        };
+
+    }
+
    /** For tests */
    default List<SerializableCrawlData> asList() throws IOException {
        List<SerializableCrawlData> data = new ArrayList<>();
@@ -81,7 +168,6 @@ public interface SerializableCrawlDataStream extends AutoCloseable {
            public boolean hasNext() { return iterator.hasNext(); }
            public void close() {}
        };
-
    }

 }
--- a/code/processes/crawling-process/model/java/nu/marginalia/io/crawldata/format/ParquetSerializableCrawlDataStream.java
+++ b/code/processes/crawling-process/model/java/nu/marginalia/io/crawldata/format/ParquetSerializableCrawlDataStream.java
@@ -1,7 +1,6 @@
 package nu.marginalia.io.crawldata.format;

 import nu.marginalia.contenttype.ContentType;
-import nu.marginalia.contenttype.DocumentBodyToString;
 import nu.marginalia.hash.MurmurHash3_128;
 import nu.marginalia.io.SerializableCrawlDataStream;
 import nu.marginalia.model.EdgeUrl;
@@ -18,6 +17,7 @@ import java.nio.file.Path;
 import java.util.*;
 import java.util.stream.Stream;

+@Deprecated
 public class ParquetSerializableCrawlDataStream implements AutoCloseable, SerializableCrawlDataStream {
    private static final Logger logger = LoggerFactory.getLogger(ParquetSerializableCrawlDataStream.class);

@@ -40,7 +40,7 @@ public class ParquetSerializableCrawlDataStream implements AutoCloseable, Serial
        return path;
    }

-    public int sizeHint() {
+    public static int sizeHint(Path path) {
        // Only calculate size hint for large files
        // (the reason we calculate them in the first place is to assess whether it is large
        // because it has many documents, or because it is a small number of large documents)
@@ -124,9 +124,7 @@ public class ParquetSerializableCrawlDataStream implements AutoCloseable, Serial
        }
        else if (nextRecord.body != null) {
            try {
-                bodyString = DocumentBodyToString.getStringData(
-                        ContentType.parse(nextRecord.contentType),
-                        nextRecord.body);
+                ContentType.parse(nextRecord.contentType);
            } catch (Exception ex) {
                logger.error("Failed to convert body to string", ex);
                status = CrawlerDocumentStatus.BAD_CHARSET;
@@ -147,7 +145,7 @@ public class ParquetSerializableCrawlDataStream implements AutoCloseable, Serial
                status.toString(),
                "",
                nextRecord.headers,
-                bodyString,
+                nextRecord.body,
                // this field isn't actually used, maybe we can skip calculating it?
                nextRecord.cookies,
                lastModified,
--- a/code/processes/crawling-process/model/java/nu/marginalia/io/crawldata/format/SlopSerializableCrawlDataStream.java
+++ b/code/processes/crawling-process/model/java/nu/marginalia/io/crawldata/format/SlopSerializableCrawlDataStream.java
@@ -0,0 +1,185 @@
+package nu.marginalia.io.crawldata.format;
+
+import nu.marginalia.contenttype.ContentType;
+import nu.marginalia.io.SerializableCrawlDataStream;
+import nu.marginalia.model.EdgeUrl;
+import nu.marginalia.model.crawldata.*;
+import nu.marginalia.slop.SlopCrawlDataRecord;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.net.URISyntaxException;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.time.Instant;
+import java.util.ArrayDeque;
+import java.util.ArrayList;
+import java.util.Deque;
+import java.util.NoSuchElementException;
+
+public class SlopSerializableCrawlDataStream implements AutoCloseable, SerializableCrawlDataStream {
+    private static final Logger logger = LoggerFactory.getLogger(SlopSerializableCrawlDataStream.class);
+
+    private final SlopCrawlDataRecord.FilteringReader reader;
+
+    // Holds the next value.  This is not a buffer, but to deal with the fact that
+    // we sometimes generate multiple SerializableCrawlData records for a single input
+    private final Deque<SerializableCrawlData> nextQ = new ArrayDeque<>();
+
+    private boolean wroteDomainRecord = false;
+    private final Path path;
+
+    public SlopSerializableCrawlDataStream(Path file) throws IOException {
+        path = file;
+        reader = new SlopCrawlDataRecord.FilteringReader(file) {
+            @Override
+            public boolean filter(String url, int status, String contentType) {
+                String ctLc = contentType.toLowerCase();
+
+                // Permit all plain text content types
+                if (ctLc.startsWith("text/"))
+                    return true;
+                // PDF
+                else if (ctLc.startsWith("application/pdf"))
+                    return true;
+                else if (ctLc.startsWith("x-marginalia/"))
+                    return true;
+
+                return false;
+            }
+        };
+    }
+
+    @Override
+    public Path path() {
+        return path;
+    }
+
+    public static int sizeHint(Path path) {
+        // Only calculate size hint for large files
+        // (the reason we calculate them in the first place is to assess whether it is large
+        // because it has many documents, or because it is a small number of large documents)
+        try {
+            if (Files.size(path) > 10_000_000) {
+                return SlopCrawlDataRecord.countGoodStatusCodes(path);
+            }
+        } catch (IOException e) {
+            // suppressed
+        }
+
+        return 0;
+    }
+
+    @Override
+    public boolean hasNext() {
+        try {
+            while (reader.hasRemaining() && nextQ.isEmpty()) {
+                try {
+                    var nextRecord = reader.get();
+                    if (!wroteDomainRecord) {
+                        createDomainRecord(nextRecord);
+                        wroteDomainRecord = true;
+                    }
+
+                    createDocumentRecord(nextRecord);
+                } catch (Exception ex) {
+                    logger.error("Failed to create document record", ex);
+                }
+            }
+            return !nextQ.isEmpty();
+        }
+        catch (IOException ex) {
+            return false;
+        }
+    }
+
+    private void createDomainRecord(SlopCrawlDataRecord parquetRecord) throws URISyntaxException {
+
+        CrawlerDomainStatus status = CrawlerDomainStatus.OK;
+        String statusReason = "";
+
+        String redirectDomain = null;
+
+        // The advisory content types are used to signal various states of the crawl
+        // that are not actual crawled documents.
+
+        switch (parquetRecord.contentType()) {
+            case "x-marginalia/advisory;state=redirect" -> {
+                EdgeUrl crawledUrl = new EdgeUrl(parquetRecord.url());
+                redirectDomain = crawledUrl.getDomain().toString();
+                status = CrawlerDomainStatus.REDIRECT;
+            }
+            case "x-marginalia/advisory;state=blocked" -> {
+                status = CrawlerDomainStatus.BLOCKED;
+            }
+            case "x-marginalia/advisory;state=error" -> {
+                status = CrawlerDomainStatus.ERROR;
+                statusReason = new String(parquetRecord.body());
+            }
+        }
+
+        nextQ.add(new CrawledDomain(
+                parquetRecord.domain(),
+                redirectDomain,
+                status.toString(),
+                statusReason,
+                parquetRecord.ip(),
+                new ArrayList<>(),
+                new ArrayList<>()
+        ));
+    }
+
+    private void createDocumentRecord(SlopCrawlDataRecord nextRecord) {
+        CrawlerDocumentStatus status = CrawlerDocumentStatus.OK;
+
+        if (nextRecord.contentType().startsWith("x-marginalia/advisory;state=content-type-failed-probe")) {
+            status = CrawlerDocumentStatus.BAD_CONTENT_TYPE;
+        }
+        else if (nextRecord.contentType().startsWith("x-marginalia/advisory;state=robots-txt-skipped")) {
+            status = CrawlerDocumentStatus.ROBOTS_TXT;
+        }
+        else if (nextRecord.contentType().startsWith("x-marginalia/advisory")) {
+            // we don't care about the other advisory content types here
+            return;
+        }
+        else if (nextRecord.body() != null) {
+            try {
+                ContentType.parse(nextRecord.contentType());
+            } catch (Exception ex) {
+                logger.error("Failed to convert body to string", ex);
+                status = CrawlerDocumentStatus.BAD_CHARSET;
+            }
+        }
+        else {
+            status = CrawlerDocumentStatus.ERROR;
+        }
+
+        nextQ.add(new CrawledDocument("",
+                nextRecord.url(),
+                nextRecord.contentType(),
+                Instant.ofEpochMilli(nextRecord.timestamp()).toString(),
+                nextRecord.httpStatus(),
+                status.toString(),
+                "",
+                nextRecord.headers(),
+                nextRecord.body(),
+                // this field isn't actually used, maybe we can skip calculating it?
+                nextRecord.cookies(),
+                null,
+                null));
+    }
+
+    public void close() throws IOException {
+        reader.close();
+    }
+
+    @Override
+    public SerializableCrawlData next() throws IOException {
+        if (!hasNext())
+            throw new NoSuchElementException();
+
+        return nextQ.poll();
+    }
+
+}
--- a/code/processes/crawling-process/model/java/nu/marginalia/model/body/ContentTypeLogic.java
+++ b/code/processes/crawling-process/model/java/nu/marginalia/model/body/ContentTypeLogic.java
@@ -10,7 +10,7 @@ import java.util.regex.Pattern;

 public class ContentTypeLogic {

-    private static final Predicate<String> probableHtmlPattern = Pattern.compile("^.*\\.(htm|html|php|txt|md)$").asMatchPredicate();
+    private static final Predicate<String> probableGoodPattern = Pattern.compile("^.*\\.(htm|html|php|txt|md|pdf)$").asMatchPredicate();
    private static final Predicate<String> probableBinaryPattern = Pattern.compile("^.*\\.[a-z]+$").asMatchPredicate();
    private static final Set<String> blockedContentTypes = Set.of("text/css", "text/javascript");
    private static final List<String> acceptedContentTypePrefixes = List.of(
@@ -22,6 +22,7 @@ public class ContentTypeLogic {
            "application/rss+xml",
            "application/x-rss+xml",
            "application/rdf+xml",
+            "application/pdf",
            "x-rss+xml"
    );
    private boolean allowAllContentTypes = false;
@@ -34,7 +35,7 @@ public class ContentTypeLogic {
    public boolean isUrlLikeBinary(EdgeUrl url) {
        String pathLowerCase = url.path.toLowerCase();

-        if (probableHtmlPattern.test(pathLowerCase))
+        if (probableGoodPattern.test(pathLowerCase))
            return false;

        return probableBinaryPattern.test(pathLowerCase);
--- a/code/processes/crawling-process/model/java/nu/marginalia/model/body/DocumentBodyExtractor.java
+++ b/code/processes/crawling-process/model/java/nu/marginalia/model/body/DocumentBodyExtractor.java
@@ -18,7 +18,7 @@ public class DocumentBodyExtractor {
            return asBytes(fetchOk);
        }
        else if (result instanceof HttpFetchResult.Result304ReplacedWithReference retained) {
-            return new DocumentBodyResult.Ok<>(retained.contentType(), retained.body().getBytes());
+            return new DocumentBodyResult.Ok<>(retained.contentType(), retained.body());
        }

        return new DocumentBodyResult.Error<>(CrawlerDocumentStatus.ERROR, "Fetch Result Not Ok");
--- a/code/processes/crawling-process/model/java/nu/marginalia/model/body/HttpFetchResult.java
+++ b/code/processes/crawling-process/model/java/nu/marginalia/model/body/HttpFetchResult.java
@@ -1,17 +1,22 @@
 package nu.marginalia.model.body;

 import nu.marginalia.contenttype.ContentType;
-import okhttp3.Headers;
+import nu.marginalia.model.EdgeUrl;
+import org.apache.hc.core5.http.Header;
+import org.apache.hc.core5.http.message.BasicHeader;
+import org.jetbrains.annotations.Nullable;
 import org.jsoup.Jsoup;
 import org.jsoup.nodes.Document;
 import org.netpreserve.jwarc.MessageHeaders;
 import org.netpreserve.jwarc.WarcResponse;

 import java.io.ByteArrayInputStream;
-import java.io.IOException;
 import java.io.InputStream;
 import java.net.InetAddress;
 import java.net.URI;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
 import java.util.Optional;

 /* FIXME:  This interface has a very unfortunate name that is not very descriptive.
@@ -56,42 +61,47 @@ public sealed interface HttpFetchResult {
     */
    record ResultOk(URI uri,
                    int statusCode,
-                    Headers headers,
+                    Header[] headers,
                    String ipAddress,
-                    byte[] bytesRaw,
+                    byte[] bytesRaw, // raw data for the entire response including headers
                    int bytesStart,
                    int bytesLength
    ) implements HttpFetchResult {

+        public ResultOk(URI uri, int status, MessageHeaders headers, String ipAddress, byte[] bytes, int bytesStart, int length) {
+            this(uri, status, convertHeaders(headers), ipAddress, bytes, bytesStart, length);
+        }
+
+        private static Header[] convertHeaders(MessageHeaders messageHeaders) {
+            List<Header> headers = new ArrayList<>(12);
+
+            messageHeaders.map().forEach((k, v) -> {
+                if (k.isBlank()) return;
+                if (!Character.isAlphabetic(k.charAt(0))) return;
+
+                for (var value : v) {
+                    headers.add(new BasicHeader(k, value));
+                }
+            });
+
+            return headers.toArray(new Header[0]);
+        }
+
        public boolean isOk() {
            return statusCode >= 200 && statusCode < 300;
        }

-        public ResultOk(URI uri,
-                        int statusCode,
-                        MessageHeaders headers,
-                        String ipAddress,
-                        byte[] bytesRaw,
-                        int bytesStart,
-                        int bytesLength) {
-            this(uri, statusCode, convertHeaders(headers), ipAddress, bytesRaw, bytesStart, bytesLength);
-        }
-
-        private static Headers convertHeaders(MessageHeaders headers) {
-            var ret = new Headers.Builder();
-            for (var header : headers.map().entrySet()) {
-                for (var value : header.getValue()) {
-                    ret.add(header.getKey(), value);
-                }
-            }
-            return ret.build();
-        }
-
        public InputStream getInputStream() {
            return new ByteArrayInputStream(bytesRaw, bytesStart, bytesLength);
        }

-        public Optional<Document> parseDocument() throws IOException {
+        /** Copy the byte range corresponding to the payload of the response,
+            Warning:  Copies the data, use getInputStream() for zero copy access */
+        public byte[] getBodyBytes() {
+            return Arrays.copyOfRange(bytesRaw, bytesStart, bytesStart + bytesLength);
+        }
+
+        public Optional<Document> parseDocument() {
            return DocumentBodyExtractor.asString(this).flatMapOpt((contentType, body) -> {
                if (contentType.is("text/html")) {
                    return Optional.of(Jsoup.parse(body));
@@ -102,8 +112,15 @@ public sealed interface HttpFetchResult {
            });
        }

+        @Nullable
        public String header(String name) {
-            return headers.get(name);
+            for (var header : headers) {
+                if (header.getName().equalsIgnoreCase(name)) {
+                    String headerValue = header.getValue();
+                    return headerValue;
+                }
+            }
+            return null;
        }

    }
@@ -114,20 +131,10 @@ public sealed interface HttpFetchResult {
     *
     * @see Result304Raw for the case where the document has not yet been replaced with the reference data.
     */
-    record Result304ReplacedWithReference(String url, ContentType contentType, String body) implements HttpFetchResult {
-
+    record Result304ReplacedWithReference(String url, ContentType contentType, byte[] body) implements HttpFetchResult {
        public boolean isOk() {
            return true;
        }
-
-        public Optional<Document> parseDocument() {
-            try {
-                return Optional.of(Jsoup.parse(body));
-            }
-            catch (Exception ex) {
-                return Optional.empty();
-            }
-        }
    }

    /** Fetching resulted in an exception */
@@ -137,6 +144,12 @@ public sealed interface HttpFetchResult {
        }
    }

+    record ResultRedirect(EdgeUrl url) implements HttpFetchResult {
+        public boolean isOk() {
+            return true;
+        }
+    }
+
    /** Fetching resulted in a HTTP 304, the remote content is identical to
     * our reference copy.  This will be replaced with a Result304ReplacedWithReference
     * at a later stage.
--- a/code/processes/crawling-process/model/java/nu/marginalia/model/crawldata/CrawledDocument.java
+++ b/code/processes/crawling-process/model/java/nu/marginalia/model/crawldata/CrawledDocument.java
@@ -1,8 +1,16 @@
 package nu.marginalia.model.crawldata;

+import nu.marginalia.contenttype.ContentType;
+import nu.marginalia.contenttype.DocumentBodyToString;
 import nu.marginalia.model.EdgeUrl;
 import org.apache.commons.lang3.StringUtils;
 import org.jetbrains.annotations.Nullable;
+import org.jsoup.nodes.Document;
+
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import java.util.Arrays;
+import java.util.Objects;

 public final class CrawledDocument implements SerializableCrawlData {
    public String crawlId;
@@ -19,8 +27,52 @@ public final class CrawledDocument implements SerializableCrawlData {
    @Nullable
    public String headers;

-    public String documentBody;
+    public String documentBody() {
+        return DocumentBodyToString.getStringData(
+                ContentType.parse(contentType),
+                documentBodyBytes);
+    }

+    /** Attempt to parse the first sampleSize bytes of the document body into a string */
+    public String documentBody(int sampleSize) {
+        if (sampleSize >= documentBodyBytes.length) {
+            return documentBody();
+        }
+
+        // Truncating the string at an unlucky point *may* lead to a parsing error
+        // ... so we try again with a longer length
+        for (int i = 0; i <= 3 && sampleSize + i < documentBodyBytes.length; i++) {
+            try {
+                byte[] bytes = new byte[sampleSize + i];
+                System.arraycopy(documentBodyBytes, 0, bytes, 0, bytes.length);
+
+                return DocumentBodyToString.getStringData(
+                        ContentType.parse(contentType),
+                        bytes);
+            }
+            catch (RuntimeException ex) {
+                // Try again with i + 1
+            }
+        }
+
+        throw new IllegalArgumentException("Failed to parse substring");
+    }
+
+    public Document parseBody() throws IOException {
+        // Prevent stalls from parsing excessively large documents
+
+        return DocumentBodyToString.getParsedData(
+                ContentType.parse(contentType),
+                documentBodyBytes,
+                200_000,
+                url);
+    }
+
+    public boolean hasBody() {
+        return documentBodyBytes.length > 0;
+    }
+
+    public byte[] documentBodyBytes;
    /**
     * This is not guaranteed to be set in all versions of the format,
     * information may come in CrawledDomain instead
@@ -30,7 +82,7 @@ public final class CrawledDocument implements SerializableCrawlData {
    public String lastModifiedMaybe;
    public String etagMaybe;

-    public CrawledDocument(String crawlId, String url, String contentType, String timestamp, int httpStatus, String crawlerStatus, String crawlerStatusDesc, @Nullable String headers, String documentBody, Boolean hasCookies, String lastModifiedMaybe, String etagMaybe) {
+    public CrawledDocument(String crawlId, String url, String contentType, String timestamp, int httpStatus, String crawlerStatus, String crawlerStatusDesc, @Nullable String headers, byte[] documentBodyBytes, Boolean hasCookies, String lastModifiedMaybe, String etagMaybe) {
        this.crawlId = crawlId;
        this.url = url;
        this.contentType = contentType;
@@ -39,7 +91,7 @@ public final class CrawledDocument implements SerializableCrawlData {
        this.crawlerStatus = crawlerStatus;
        this.crawlerStatusDesc = crawlerStatusDesc;
        this.headers = headers;
-        this.documentBody = documentBody;
+        this.documentBodyBytes = Objects.requireNonNullElse(documentBodyBytes, new byte[] {});
        this.hasCookies = hasCookies;
        this.lastModifiedMaybe = lastModifiedMaybe;
        this.etagMaybe = etagMaybe;
@@ -50,7 +102,7 @@ public final class CrawledDocument implements SerializableCrawlData {
    }

    @Nullable
-    private String getHeader(String header) {
+    public String getHeader(String header) {
        if (headers == null) {
            return null;
        }
@@ -106,7 +158,7 @@ public final class CrawledDocument implements SerializableCrawlData {
    }

    public String toString() {
-        return "CrawledDocument(crawlId=" + this.crawlId + ", url=" + this.url + ", contentType=" + this.contentType + ", timestamp=" + this.timestamp + ", httpStatus=" + this.httpStatus + ", crawlerStatus=" + this.crawlerStatus + ", crawlerStatusDesc=" + this.crawlerStatusDesc + ", headers=" + this.headers + ", documentBody=" + this.documentBody + ", hasCookies=" + this.hasCookies + ", lastModifiedMaybe=" + this.lastModifiedMaybe + ", etagMaybe=" + this.etagMaybe + ")";
+        return "CrawledDocument(crawlId=" + this.crawlId + ", url=" + this.url + ", contentType=" + this.contentType + ", timestamp=" + this.timestamp + ", httpStatus=" + this.httpStatus + ", crawlerStatus=" + this.crawlerStatus + ", crawlerStatusDesc=" + this.crawlerStatusDesc + ", headers=" + this.headers + ", documentBody=" + documentBody() + ", hasCookies=" + this.hasCookies + ", lastModifiedMaybe=" + this.lastModifiedMaybe + ", etagMaybe=" + this.etagMaybe + ")";
    }

    public static class CrawledDocumentBuilder {
@@ -118,7 +170,7 @@ public final class CrawledDocument implements SerializableCrawlData {
        private String crawlerStatus;
        private String crawlerStatusDesc;
        private @Nullable String headers;
-        private String documentBody;
+        private byte[] documentBodyBytes = new byte[0];
        private String recrawlState;
        private Boolean hasCookies;
        private String lastModifiedMaybe;
@@ -168,10 +220,13 @@ public final class CrawledDocument implements SerializableCrawlData {
        }

        public CrawledDocumentBuilder documentBody(String documentBody) {
-            this.documentBody = documentBody;
+            this.documentBodyBytes = documentBody.getBytes(StandardCharsets.UTF_8);
+            return this;
+        }
+        public CrawledDocumentBuilder documentBodyBytes(byte[] documentBodyBytes) {
+            this.documentBodyBytes = documentBodyBytes;
            return this;
        }
-
        @Deprecated
        public CrawledDocumentBuilder recrawlState(String recrawlState) {
            this.recrawlState = recrawlState;
@@ -194,11 +249,11 @@ public final class CrawledDocument implements SerializableCrawlData {
        }

        public CrawledDocument build() {
-            return new CrawledDocument(this.crawlId, this.url, this.contentType, this.timestamp, this.httpStatus, this.crawlerStatus, this.crawlerStatusDesc, this.headers, this.documentBody, this.hasCookies, this.lastModifiedMaybe, this.etagMaybe);
+            return new CrawledDocument(this.crawlId, this.url, this.contentType, this.timestamp, this.httpStatus, this.crawlerStatus, this.crawlerStatusDesc, this.headers, this.documentBodyBytes, this.hasCookies, this.lastModifiedMaybe, this.etagMaybe);
        }

        public String toString() {
-            return "CrawledDocument.CrawledDocumentBuilder(crawlId=" + this.crawlId + ", url=" + this.url + ", contentType=" + this.contentType + ", timestamp=" + this.timestamp + ", httpStatus=" + this.httpStatus + ", crawlerStatus=" + this.crawlerStatus + ", crawlerStatusDesc=" + this.crawlerStatusDesc + ", headers=" + this.headers + ", documentBody=" + this.documentBody +  ", recrawlState=" + this.recrawlState + ", hasCookies=" + this.hasCookies + ", lastModifiedMaybe=" + this.lastModifiedMaybe + ", etagMaybe=" + this.etagMaybe + ")";
+            return "CrawledDocument.CrawledDocumentBuilder(crawlId=" + this.crawlId + ", url=" + this.url + ", contentType=" + this.contentType + ", timestamp=" + this.timestamp + ", httpStatus=" + this.httpStatus + ", crawlerStatus=" + this.crawlerStatus + ", crawlerStatusDesc=" + this.crawlerStatusDesc + ", headers=" + this.headers + ", documentBodyBytes=" + Arrays.toString(this.documentBodyBytes) +  ", recrawlState=" + this.recrawlState + ", hasCookies=" + this.hasCookies + ", lastModifiedMaybe=" + this.lastModifiedMaybe + ", etagMaybe=" + this.etagMaybe + ")";
        }
    }
 }
--- a/code/processes/crawling-process/model/java/nu/marginalia/parquet/crawldata/CrawledDocumentParquetRecordFileWriter.java
+++ b/code/processes/crawling-process/model/java/nu/marginalia/parquet/crawldata/CrawledDocumentParquetRecordFileWriter.java
@@ -165,27 +165,42 @@ public class CrawledDocumentParquetRecordFileWriter implements AutoCloseable {
            contentType = "";
        }

-        String headersStr = null;
+        boolean hasCookies = false;
+        String etag = null;
+        String lastModified = null;
+
        StringJoiner headersStrBuilder = new StringJoiner("\n");
        for (var header : headers) {
-            headersStrBuilder.add(header.getFirst() + ": " + header.getSecond());
+            if (header.getName().equalsIgnoreCase("X-Has-Cookies")) {
+                hasCookies = hasCookies || header.getValue().equals("1");
+            }
+            else if (header.getName().equalsIgnoreCase("ETag")) {
+                etag = header.getValue();
+            }
+            else if (header.getName().equalsIgnoreCase("Last-Modified")) {
+                lastModified = header.getValue();
+            }
+            else {
+                headersStrBuilder.add(header.getName() + ": " + header.getValue());
+            }
        }
-        headersStr = headersStrBuilder.toString();
+
+        String headersStr = headersStrBuilder.toString();


        write(new CrawledDocumentParquetRecord(
                domain,
                response.target(),
                fetchOk.ipAddress(),
-                WarcXCookieInformationHeader.hasCookies(response),
+                hasCookies,
                fetchOk.statusCode(),
                response.date(),
                contentType,
                bodyBytes,
                headersStr,
-                headers.get("ETag"),
-                headers.get("Last-Modified"))
-        );
+                etag,
+                lastModified
+        ));
    }


--- a/code/processes/crawling-process/model/java/nu/marginalia/slop/SlopCrawlDataRecord.java
+++ b/code/processes/crawling-process/model/java/nu/marginalia/slop/SlopCrawlDataRecord.java
@@ -0,0 +1,530 @@
+package nu.marginalia.slop;
+
+import nu.marginalia.ContentTypes;
+import nu.marginalia.UserAgent;
+import nu.marginalia.model.body.DocumentBodyExtractor;
+import nu.marginalia.model.body.DocumentBodyResult;
+import nu.marginalia.model.body.HttpFetchResult;
+import nu.marginalia.parquet.crawldata.CrawledDocumentParquetRecord;
+import nu.marginalia.parquet.crawldata.CrawledDocumentParquetRecordFileReader;
+import nu.marginalia.slop.column.array.ByteArrayColumn;
+import nu.marginalia.slop.column.primitive.ByteColumn;
+import nu.marginalia.slop.column.primitive.LongColumn;
+import nu.marginalia.slop.column.primitive.ShortColumn;
+import nu.marginalia.slop.column.string.EnumColumn;
+import nu.marginalia.slop.column.string.StringColumn;
+import nu.marginalia.slop.desc.StorageType;
+import nu.marginalia.slop.storage.LargeItem;
+import org.apache.commons.io.FileUtils;
+import org.apache.commons.lang3.StringUtils;
+import org.netpreserve.jwarc.*;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.net.URI;
+import java.nio.charset.StandardCharsets;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.time.Instant;
+import java.util.List;
+import java.util.Objects;
+import java.util.StringJoiner;
+
+public record SlopCrawlDataRecord(String domain,
+                                  String url,
+                                  String ip,
+                                  boolean cookies,
+                                  int httpStatus,
+                                  long timestamp,
+                                  String contentType,
+                                  byte[] body,
+                                  String headers)
+{
+    private static final EnumColumn domainColumn = new EnumColumn("domain", StandardCharsets.UTF_8, StorageType.ZSTD);
+    private static final StringColumn urlColumn = new StringColumn("url", StandardCharsets.UTF_8, StorageType.ZSTD);
+    private static final StringColumn ipColumn = new StringColumn("ip", StandardCharsets.ISO_8859_1, StorageType.ZSTD);
+    private static final ByteColumn cookiesColumn = new ByteColumn("cookies");
+    private static final ShortColumn statusColumn = new ShortColumn("httpStatus");
+    private static final LongColumn timestampColumn = new LongColumn("timestamp");
+    private static final EnumColumn contentTypeColumn = new EnumColumn("contentType", StandardCharsets.UTF_8);
+    private static final ByteArrayColumn bodyColumn = new ByteArrayColumn("body", StorageType.ZSTD);
+    private static final StringColumn headerColumn = new StringColumn("header", StandardCharsets.UTF_8, StorageType.ZSTD);
+
+    public SlopCrawlDataRecord(CrawledDocumentParquetRecord parquetRecord) {
+        this(parquetRecord.domain,
+                parquetRecord.url,
+                parquetRecord.ip,
+                parquetRecord.cookies,
+                parquetRecord.httpStatus,
+                parquetRecord.timestamp.toEpochMilli(),
+                parquetRecord.contentType,
+                parquetRecord.body,
+                parquetRecord.headers
+                );
+    }
+
+
+    private static SlopCrawlDataRecord forDomainRedirect(String domain, Instant date, String redirectDomain) {
+        return new SlopCrawlDataRecord(domain,
+                "https://" + redirectDomain + "/",
+                "",
+                false,
+                0,
+                date.toEpochMilli(),
+                "x-marginalia/advisory;state=redirect",
+                new byte[0],
+                ""
+        );
+    }
+
+    private static SlopCrawlDataRecord forDomainError(String domain, Instant date, String ip, String errorStatus) {
+        return new SlopCrawlDataRecord(domain,
+                "https://" + domain + "/",
+                ip,
+                false,
+                0,
+                date.toEpochMilli(),
+                "x-marginalia/advisory;state=error",
+                errorStatus.getBytes(),
+                ""
+        );
+    }
+
+    private static SlopCrawlDataRecord forDocError(String domain, Instant date, String url, String errorStatus) {
+        return new SlopCrawlDataRecord(domain,
+                url,
+                "",
+                false,
+                0,
+                date.toEpochMilli(),
+                errorStatus,
+                new byte[0],
+                ""
+        );
+    }
+
+
+    public static void convertFromParquet(Path parquetInput, Path slopOutput) throws IOException {
+        Path tempDir = Files.createTempDirectory(slopOutput.getParent(), "conversion");
+
+        try (var writer = new Writer(tempDir);
+             var stream = CrawledDocumentParquetRecordFileReader.stream(parquetInput))
+        {
+            stream.forEach(
+                parquetRecord -> {
+                    try {
+                        writer.write(new SlopCrawlDataRecord(parquetRecord));
+                    } catch (IOException e) {
+                        throw new RuntimeException(e);
+                    }
+                });
+        }
+        catch (IOException ex) {
+            FileUtils.deleteDirectory(tempDir.toFile());
+            throw ex;
+        }
+
+        try {
+            SlopTablePacker.packToSlopZip(tempDir, slopOutput);
+            FileUtils.deleteDirectory(tempDir.toFile());
+        }
+        catch (Exception ex) {
+            logger.error("Failed to convert WARC file to Parquet", ex);
+        }
+    }
+
+    private static final Logger logger = LoggerFactory.getLogger(SlopCrawlDataRecord.class);
+
+    public static void convertWarc(String domain,
+                                   UserAgent userAgent,
+                                   Path warcInputFile,
+                                   Path slopOutputFile) throws IOException {
+
+        Path tempDir = Files.createTempDirectory(slopOutputFile.getParent(), "slop-"+domain);
+
+        try (var warcReader = new WarcReader(warcInputFile);
+             var slopWriter = new SlopCrawlDataRecord.Writer(tempDir)
+        ) {
+            WarcXResponseReference.register(warcReader);
+            WarcXEntityRefused.register(warcReader);
+
+            String uaString = userAgent.uaString();
+
+            for (var record : warcReader) {
+                try {
+                    if (record instanceof WarcResponse response) {
+                        // this also captures WarcXResponseReference, which inherits from WarcResponse
+                        // and is used to store old responses from previous crawls; in this part of the logic
+                        // we treat them the same as a normal response
+
+                        if (!filterResponse(uaString, response)) {
+                            continue;
+                        }
+
+                        slopWriter.write(domain, response);
+                    } else if (record instanceof WarcXEntityRefused refused) {
+                        slopWriter.write(domain, refused);
+                    } else if (record instanceof Warcinfo warcinfo) {
+                        slopWriter.write(warcinfo);
+                    }
+                }
+                catch (Exception ex) {
+                    logger.error("Failed to convert WARC record to Parquet", ex);
+                }
+            }
+        }
+        catch (Exception ex) {
+            logger.error("Failed to convert WARC file to Parquet", ex);
+        }
+
+        try {
+            SlopTablePacker.packToSlopZip(tempDir, slopOutputFile);
+            FileUtils.deleteDirectory(tempDir.toFile());
+        }
+        catch (Exception ex) {
+            logger.error("Failed to convert WARC file to Parquet", ex);
+        }
+    }
+
+
+
+    /** Return true if the WarcResponse should be excluded from conversion */
+    private static boolean filterResponse(String uaString, WarcResponse response) throws IOException {
+
+        // We don't want to store robots.txt files, as they are not
+        // interesting for the analysis we want to do.  This is important
+        // since txt-files in general are interesting, and we don't want to
+        // exclude them as a class.
+
+        if (response.targetURI().getPath().equals("/robots.txt")) {
+            return false;
+        }
+
+        var headers = response.http().headers();
+        var robotsTags = headers.all("X-Robots-Tag");
+
+        if (!isXRobotsTagsPermitted(robotsTags, uaString)) {
+            return false;
+        }
+
+        // Strip out responses with content types we aren't interested in
+        // (though ideally we wouldn't download these at all)
+        String contentType = headers.first("Content-Type").orElse("text/plain").toLowerCase();
+
+        if (!ContentTypes.isAccepted(contentType)) {
+            return false;
+        }
+
+        // If the format is binary, we don't want to translate it if the response is truncated
+        if (response.truncated() != WarcTruncationReason.NOT_TRUNCATED && ContentTypes.isBinary(contentType)) {
+            return false;
+        }
+
+        return true;
+    }
+
+    /**  Check X-Robots-Tag header tag to see if we are allowed to index this page.
+     * <p>
+     * Reference: <a href="https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag">https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag</a>
+     *
+     * @param xRobotsHeaderTags List of X-Robots-Tag values
+     * @param userAgent User agent string
+     * @return true if we are allowed to index this page
+     */
+    // Visible for tests
+    public static boolean isXRobotsTagsPermitted(List<String> xRobotsHeaderTags, String userAgent) {
+        boolean isPermittedGeneral = true;
+        boolean isPermittedMarginalia = false;
+        boolean isForbiddenMarginalia = false;
+
+        for (String header : xRobotsHeaderTags) {
+            if (header.indexOf(':') >= 0) {
+                String[] parts = StringUtils.split(header, ":", 2);
+
+                if (parts.length < 2)
+                    continue;
+
+                // Is this relevant to us?
+                if (!Objects.equals(parts[0].trim(), userAgent))
+                    continue;
+
+                if (parts[1].contains("noindex"))
+                    isForbiddenMarginalia = true;
+                else if (parts[1].contains("none"))
+                    isForbiddenMarginalia = true;
+                else if (parts[1].contains("all"))
+                    isPermittedMarginalia = true;
+            }
+            else {
+                if (header.contains("noindex"))
+                    isPermittedGeneral = false;
+                if (header.contains("none"))
+                    isPermittedGeneral = false;
+            }
+        }
+
+        if (isPermittedMarginalia)
+            return true;
+        if (isForbiddenMarginalia)
+            return false;
+        return isPermittedGeneral;
+    }
+
+    public static int countGoodStatusCodes(Path path) throws IOException {
+        int cnt = 0;
+
+        try (var table = new SlopTable(path)) {
+            ShortColumn.Reader statusReader = statusColumn.open(table);
+            while (statusReader.hasRemaining()) {
+                if (statusReader.get() == 200) {
+                    cnt++;
+                }
+            }
+        }
+
+        return cnt;
+    }
+
+    public static class Writer extends SlopTable {
+        private final EnumColumn.Writer domainColumnWriter;
+        private final StringColumn.Writer urlColumnWriter;
+        private final StringColumn.Writer ipColumnWriter;
+        private final ByteColumn.Writer cookiesColumnWriter;
+        private final ShortColumn.Writer statusColumnWriter;
+        private final LongColumn.Writer timestampColumnWriter;
+        private final EnumColumn.Writer contentTypeColumnWriter;
+        private final ByteArrayColumn.Writer bodyColumnWriter;
+        private final StringColumn.Writer headerColumnWriter;
+
+        public Writer(Path path) throws IOException {
+            super(path);
+
+            domainColumnWriter = domainColumn.create(this);
+            urlColumnWriter = urlColumn.create(this);
+            ipColumnWriter = ipColumn.create(this);
+            cookiesColumnWriter = cookiesColumn.create(this);
+            statusColumnWriter = statusColumn.create(this);
+            timestampColumnWriter = timestampColumn.create(this);
+            contentTypeColumnWriter = contentTypeColumn.create(this);
+            bodyColumnWriter = bodyColumn.create(this);
+            headerColumnWriter = headerColumn.create(this);
+        }
+
+        public void write(SlopCrawlDataRecord record) throws IOException {
+            domainColumnWriter.put(record.domain);
+            urlColumnWriter.put(record.url);
+            ipColumnWriter.put(record.ip);
+            cookiesColumnWriter.put(record.cookies ? (byte) 1 : (byte) 0);
+            statusColumnWriter.put((short) record.httpStatus);
+            timestampColumnWriter.put(record.timestamp);
+            contentTypeColumnWriter.put(record.contentType);
+            bodyColumnWriter.put(record.body);
+            headerColumnWriter.put(record.headers);
+        }
+
+        public void write(String domain, WarcResponse response) throws IOException {
+
+            HttpFetchResult result = HttpFetchResult.importWarc(response);
+            if (!(result instanceof HttpFetchResult.ResultOk fetchOk)) {
+                return;
+            }
+
+            byte[] bodyBytes;
+            String contentType;
+
+            var body = DocumentBodyExtractor.asBytes(result);
+
+            var headers = fetchOk.headers();
+
+            if (body instanceof DocumentBodyResult.Ok<byte[]> bodyOk) {
+                bodyBytes = bodyOk.body();
+                contentType = bodyOk.contentType().toString();
+            }
+            else {
+                bodyBytes = new byte[0];
+                contentType = "";
+            }
+
+            boolean hasCookies = false;
+
+            String headersStr;
+            StringJoiner headersStrBuilder = new StringJoiner("\n");
+            for (var header : headers) {
+                if (header.getName().equalsIgnoreCase("X-Cookies") && "1".equals(header.getValue())) {
+                    hasCookies = true;
+                }
+                headersStrBuilder.add(header.getName() + ": " + header.getValue());
+            }
+            headersStr = headersStrBuilder.toString();
+
+
+            write(new SlopCrawlDataRecord(
+                    domain,
+                    response.target(),
+                    fetchOk.ipAddress(),
+                    hasCookies,
+                    fetchOk.statusCode(),
+                    response.date().toEpochMilli(),
+                    contentType,
+                    bodyBytes,
+                    headersStr
+                )
+            );
+        }
+
+        private void write(String domain, WarcXEntityRefused refused) throws IOException {
+            URI profile = refused.profile();
+
+            String meta;
+            if (profile.equals(WarcXEntityRefused.documentRobotsTxtSkippedURN)) {
+                meta = "x-marginalia/advisory;state=robots-txt-skipped";
+            }
+            else if (profile.equals(WarcXEntityRefused.documentBadContentTypeURN)) {
+                meta = "x-marginalia/advisory;state=content-type-failed-probe";
+            }
+            else if (profile.equals(WarcXEntityRefused.documentProbeTimeout)) {
+                meta = "x-marginalia/advisory;state=timeout-probe";
+            }
+            else if (profile.equals(WarcXEntityRefused.documentUnspecifiedError)) {
+                meta = "x-marginalia/advisory;state=doc-error";
+            }
+            else {
+                meta = "x-marginalia/advisory;state=unknown";
+            }
+
+            write(forDocError(domain, refused.date(), refused.target(), meta));
+        }
+
+        private void write(Warcinfo warcinfo) throws IOException {
+            String selfDomain = warcinfo.fields().first("domain").orElse("");
+            String ip = warcinfo.fields().first("ip").orElse("");
+            String probeStatus = warcinfo.fields().first("X-WARC-Probe-Status").orElse("");
+
+            if (probeStatus.startsWith("REDIRECT")) {
+                String redirectDomain = probeStatus.substring("REDIRECT;".length());
+                write(forDomainRedirect(selfDomain, warcinfo.date(), redirectDomain));
+            }
+            else if (!"OK".equals(probeStatus)) {
+                write(forDomainError(selfDomain, warcinfo.date(), ip, probeStatus));
+            }
+        }
+    }
+
+    public static class Reader extends SlopTable {
+        private final EnumColumn.Reader domainColumnReader;
+        private final StringColumn.Reader urlColumnReader;
+        private final StringColumn.Reader ipColumnReader;
+        private final ByteColumn.Reader cookiesColumnReader;
+        private final ShortColumn.Reader statusColumnReader;
+        private final LongColumn.Reader timestampColumnReader;
+        private final EnumColumn.Reader contentTypeColumnReader;
+        private final ByteArrayColumn.Reader bodyColumnReader;
+        private final StringColumn.Reader headerColumnReader;
+
+        public Reader(Path path) throws IOException {
+            super(path);
+
+            domainColumnReader = domainColumn.open(this);
+            urlColumnReader = urlColumn.open(this);
+            ipColumnReader = ipColumn.open(this);
+            cookiesColumnReader = cookiesColumn.open(this);
+            statusColumnReader = statusColumn.open(this);
+            timestampColumnReader = timestampColumn.open(this);
+            contentTypeColumnReader = contentTypeColumn.open(this);
+            bodyColumnReader = bodyColumn.open(this);
+            headerColumnReader = headerColumn.open(this);
+        }
+
+        public SlopCrawlDataRecord get() throws IOException {
+            return new SlopCrawlDataRecord(
+                    domainColumnReader.get(),
+                    urlColumnReader.get(),
+                    ipColumnReader.get(),
+                    cookiesColumnReader.get() == 1,
+                    statusColumnReader.get(),
+                    timestampColumnReader.get(),
+                    contentTypeColumnReader.get(),
+                    bodyColumnReader.get(),
+                    headerColumnReader.get()
+            );
+        }
+
+        public boolean hasRemaining() throws IOException {
+            return domainColumnReader.hasRemaining();
+        }
+    }
+
+
+    public abstract static class FilteringReader extends SlopTable {
+        private final EnumColumn.Reader domainColumnReader;
+        private final StringColumn.Reader urlColumnReader;
+        private final StringColumn.Reader ipColumnReader;
+        private final ByteColumn.Reader cookiesColumnReader;
+        private final ShortColumn.Reader statusColumnReader;
+        private final LongColumn.Reader timestampColumnReader;
+        private final EnumColumn.Reader contentTypeColumnReader;
+        private final ByteArrayColumn.Reader bodyColumnReader;
+        private final StringColumn.Reader headerColumnReader;
+
+        private SlopCrawlDataRecord next = null;
+
+        public FilteringReader(Path path) throws IOException {
+            super(path);
+
+            domainColumnReader = domainColumn.open(this);
+            urlColumnReader = urlColumn.open(this);
+            ipColumnReader = ipColumn.open(this);
+            cookiesColumnReader = cookiesColumn.open(this);
+            statusColumnReader = statusColumn.open(this);
+            timestampColumnReader = timestampColumn.open(this);
+            contentTypeColumnReader = contentTypeColumn.open(this);
+            bodyColumnReader = bodyColumn.open(this);
+            headerColumnReader = headerColumn.open(this);
+        }
+
+        public abstract boolean filter(String url, int status, String contentType);
+
+        public SlopCrawlDataRecord get() throws IOException {
+            if (next == null) {
+                if (!hasRemaining()) {
+                    throw new IllegalStateException("No more values remaining");
+                }
+            }
+            var val = next;
+            next = null;
+            return val;
+        }
+
+        public boolean hasRemaining() throws IOException {
+            if (next != null)
+                return true;
+
+            while (domainColumnReader.hasRemaining()) {
+                String domain = domainColumnReader.get();
+                String url = urlColumnReader.get();
+                String ip = ipColumnReader.get();
+                boolean cookies = cookiesColumnReader.get() == 1;
+                int status = statusColumnReader.get();
+                long timestamp = timestampColumnReader.get();
+                String contentType = contentTypeColumnReader.get();
+
+                LargeItem<byte[]> body = bodyColumnReader.getLarge();
+                LargeItem<String> headers = headerColumnReader.getLarge();
+
+                if (filter(url, status, contentType)) {
+                    next = new SlopCrawlDataRecord(
+                            domain, url, ip, cookies, status, timestamp, contentType, body.get(), headers.get()
+                    );
+                    return true;
+                }
+                else {
+                    body.close();
+                    headers.close();
+                }
+            }
+
+            return false;
+        }
+    }
+}
--- a/code/processes/crawling-process/model/java/org/netpreserve/jwarc/WarcXCookieInformationHeader.java
+++ b/code/processes/crawling-process/model/java/org/netpreserve/jwarc/WarcXCookieInformationHeader.java
@@ -1,35 +0,0 @@
-package org.netpreserve.jwarc;
-
-import okhttp3.HttpUrl;
-import okhttp3.OkHttpClient;
-
-/** Encapsulates out-of-band information about whether a website uses cookies,
- * using a non-standard WARC header "X-Has-Cookies".
- */
-public class WarcXCookieInformationHeader {
-    private boolean hasCookies = false;
-    private static final String headerName = "X-Has-Cookies";
-
-    public void update(OkHttpClient client, HttpUrl url) {
-        if (!hasCookies) {
-            hasCookies = !client.cookieJar().loadForRequest(url).isEmpty();
-        }
-    }
-
-    public boolean hasCookies() {
-        return hasCookies;
-    }
-
-    public void paint(WarcResponse.Builder builder) {
-        builder.addHeader(headerName, hasCookies ? "1" : "0");
-    }
-    public void paint(WarcXResponseReference.Builder builder) {
-        builder.addHeader(headerName, hasCookies ? "1" : "0");
-    }
-
-    public static boolean hasCookies(WarcRecord record) {
-        return record.headers().contains(headerName, "1");
-    }
-
-
-}
--- a/code/processes/crawling-process/model/test/nu/marginalia/crawling/parquet/CrawledDocumentParquetRecordFileWriterTest.java
+++ b/code/processes/crawling-process/model/test/nu/marginalia/crawling/parquet/CrawledDocumentParquetRecordFileWriterTest.java
@@ -80,7 +80,7 @@ class CrawledDocumentParquetRecordFileWriterTest {
        var document = (CrawledDocument) secondItem;
        assertEquals("https://www.marginalia.nu/", document.url);
        assertEquals("text/html", document.contentType);
-        assertEquals("hello world", document.documentBody);
+        assertEquals("hello world", document.documentBody());
        assertEquals(200, document.httpStatus);
    }

@@ -103,7 +103,7 @@ class CrawledDocumentParquetRecordFileWriterTest {
                    System.out.println(doc.url);
                    System.out.println(doc.contentType);
                    System.out.println(doc.httpStatus);
-                    System.out.println(doc.documentBody.length());
+                    System.out.println(doc.documentBody().length());
                }
            }
        } catch (IOException e) {
--- a/code/processes/crawling-process/test/nu/marginalia/crawl/DomainStateDbTest.java
+++ b/code/processes/crawling-process/test/nu/marginalia/crawl/DomainStateDbTest.java
@@ -8,9 +8,10 @@ import java.io.IOException;
 import java.nio.file.Files;
 import java.nio.file.Path;
 import java.sql.SQLException;
+import java.time.Duration;
 import java.time.Instant;

-import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.*;

 class DomainStateDbTest {

@@ -26,7 +27,7 @@ class DomainStateDbTest {
    }

    @Test
-    public void testSunnyDay() throws SQLException {
+    public void testSummaryRecord() throws SQLException {
        try (var db = new DomainStateDb(tempFile)) {
            var allFields = new DomainStateDb.SummaryRecord(
                    "all.marginalia.nu",
@@ -47,8 +48,8 @@ class DomainStateDbTest {
            db.save(allFields);
            db.save(minFields);

-            assertEquals(allFields, db.get("all.marginalia.nu").orElseThrow());
-            assertEquals(minFields, db.get("min.marginalia.nu").orElseThrow());
+            assertEquals(allFields, db.getSummary("all.marginalia.nu").orElseThrow());
+            assertEquals(minFields, db.getSummary("min.marginalia.nu").orElseThrow());

            var updatedAllFields = new DomainStateDb.SummaryRecord(
                    "all.marginalia.nu",
@@ -59,7 +60,36 @@ class DomainStateDbTest {
            );

            db.save(updatedAllFields);
-            assertEquals(updatedAllFields, db.get("all.marginalia.nu").orElseThrow());
+            assertEquals(updatedAllFields, db.getSummary("all.marginalia.nu").orElseThrow());
+        }
+    }
+
+    @Test
+    public void testMetadata() throws SQLException {
+        try (var db = new DomainStateDb(tempFile)) {
+            var original = new DomainStateDb.CrawlMeta("example.com", Instant.ofEpochMilli(12345), Duration.ofMillis(30), Duration.ofMillis(300), 1, 2, 3);
+            db.save(original);
+
+            var maybeMeta = db.getMeta("example.com");
+            assertTrue(maybeMeta.isPresent());
+            assertEquals(original, maybeMeta.get());
+        }
+    }
+
+    @Test
+    public void testFavicon() throws SQLException {
+        try (var db = new DomainStateDb(tempFile)) {
+            db.saveIcon("www.marginalia.nu", new DomainStateDb.FaviconRecord("text/plain", "hello world".getBytes()));
+
+            var maybeData = db.getIcon("www.marginalia.nu");
+            assertTrue(maybeData.isPresent());
+            var actualData = maybeData.get();
+
+            assertEquals("text/plain", actualData.contentType());
+            assertArrayEquals("hello world".getBytes(), actualData.imageData());
+
+            maybeData = db.getIcon("foobar");
+            assertTrue(maybeData.isEmpty());
        }
    }

--- a/code/processes/crawling-process/test/nu/marginalia/crawl/fetcher/HttpFetcherImplContentTypeProbeTest.java
+++ b/code/processes/crawling-process/test/nu/marginalia/crawl/fetcher/HttpFetcherImplContentTypeProbeTest.java
@@ -0,0 +1,146 @@
+package nu.marginalia.crawl.fetcher;
+
+import com.github.tomakehurst.wiremock.WireMockServer;
+import com.github.tomakehurst.wiremock.client.WireMock;
+import com.github.tomakehurst.wiremock.core.WireMockConfiguration;
+import nu.marginalia.UserAgent;
+import nu.marginalia.crawl.retreival.CrawlDelayTimer;
+import nu.marginalia.model.EdgeUrl;
+import org.junit.jupiter.api.*;
+
+import java.io.IOException;
+import java.net.URISyntaxException;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+@Tag("slow")
+class HttpFetcherImplContentTypeProbeTest {
+
+    private HttpFetcherImpl fetcher;
+    private  static WireMockServer wireMockServer;
+
+    private static EdgeUrl timeoutUrl;
+    private static EdgeUrl contentTypeHtmlUrl;
+    private static EdgeUrl contentTypeBinaryUrl;
+    private static EdgeUrl redirectUrl;
+    private static EdgeUrl badHttpStatusUrl;
+    private static EdgeUrl onlyGetAllowedUrl;
+
+    @BeforeAll
+    public static void setupAll() throws URISyntaxException {
+        wireMockServer =
+                new WireMockServer(WireMockConfiguration.wireMockConfig()
+                        .port(18089));
+
+        timeoutUrl = new EdgeUrl("http://localhost:18089/timeout.bin");
+
+        wireMockServer.stubFor(WireMock.head(WireMock.urlEqualTo(timeoutUrl.path))
+                .willReturn(WireMock.aResponse()
+                        .withFixedDelay(15000))); // 10 seconds delay to simulate timeout
+
+        contentTypeHtmlUrl = new EdgeUrl("http://localhost:18089/test.html.bin");
+        wireMockServer.stubFor(WireMock.head(WireMock.urlEqualTo(contentTypeHtmlUrl.path))
+                .willReturn(WireMock.aResponse()
+                        .withHeader("Content-Type", "text/html")
+                        .withStatus(200)));
+
+        contentTypeBinaryUrl = new EdgeUrl("http://localhost:18089/test.bad.bin");
+        wireMockServer.stubFor(WireMock.head(WireMock.urlEqualTo(contentTypeBinaryUrl.path))
+                .willReturn(WireMock.aResponse()
+                        .withHeader("Content-Type", "application/octet-stream")
+                        .withStatus(200)));
+
+        redirectUrl = new EdgeUrl("http://localhost:18089/redirect.bin");
+        wireMockServer.stubFor(WireMock.head(WireMock.urlEqualTo(redirectUrl.path))
+                .willReturn(WireMock.aResponse()
+                        .withHeader("Location", "http://localhost:18089/test.html.bin")
+                        .withStatus(301)));
+
+        badHttpStatusUrl = new EdgeUrl("http://localhost:18089/badstatus.bin");
+        wireMockServer.stubFor(WireMock.head(WireMock.urlEqualTo(badHttpStatusUrl.path))
+                .willReturn(WireMock.aResponse()
+                        .withHeader("Content-Type", "text/html")
+                        .withStatus(500)));
+
+        onlyGetAllowedUrl = new EdgeUrl("http://localhost:18089/onlyget.bin");
+        wireMockServer.stubFor(WireMock.head(WireMock.urlEqualTo(onlyGetAllowedUrl.path))
+                .willReturn(WireMock.aResponse()
+                        .withStatus(405))); // Method Not Allowed
+        wireMockServer.stubFor(WireMock.get(WireMock.urlEqualTo(onlyGetAllowedUrl.path))
+                .willReturn(WireMock.aResponse()
+                        .withHeader("Content-Type", "text/html")
+                        .withStatus(200)));
+
+        wireMockServer.start();
+
+    }
+
+    @AfterAll
+    public static void tearDownAll() {
+        wireMockServer.stop();
+    }
+
+    @BeforeEach
+    public void setUp() {
+        fetcher = new HttpFetcherImpl(new UserAgent("test.marginalia.nu", "test.marginalia.nu"));
+    }
+
+    @AfterEach
+    public void tearDown() throws IOException {
+        var stats = fetcher.getPoolStats();
+        assertEquals(0, stats.getLeased());
+        assertEquals(0, stats.getPending());
+
+        fetcher.close();
+    }
+
+    @Test
+    public void testProbeContentTypeHtmlShortcircuitPath() throws URISyntaxException {
+        var result = fetcher.probeContentType(new EdgeUrl("https://localhost/test.html"), new DomainCookies(), new CrawlDelayTimer(50), ContentTags.empty());
+        Assertions.assertInstanceOf(HttpFetcher.ContentTypeProbeResult.NoOp.class, result);
+    }
+
+
+    @Test
+    public void testProbeContentTypeHtmlShortcircuitTags() {
+        var result = fetcher.probeContentType(contentTypeBinaryUrl, new DomainCookies(), new CrawlDelayTimer(50), new ContentTags("a", "b"));
+        Assertions.assertInstanceOf(HttpFetcher.ContentTypeProbeResult.NoOp.class, result);
+    }
+
+    @Test
+    public void testProbeContentTypeHtml() {
+        var result = fetcher.probeContentType(contentTypeHtmlUrl, new DomainCookies(), new CrawlDelayTimer(50), ContentTags.empty());
+        Assertions.assertEquals(new HttpFetcher.ContentTypeProbeResult.Ok(contentTypeHtmlUrl), result);
+    }
+
+    @Test
+    public void testProbeContentTypeBinary() {
+        var result = fetcher.probeContentType(contentTypeBinaryUrl, new DomainCookies(), new CrawlDelayTimer(50), ContentTags.empty());
+        Assertions.assertEquals(new HttpFetcher.ContentTypeProbeResult.BadContentType("application/octet-stream", 200), result);
+    }
+
+    @Test
+    public void testProbeContentTypeRedirect() {
+        var result = fetcher.probeContentType(redirectUrl, new DomainCookies(), new CrawlDelayTimer(50), ContentTags.empty());
+        Assertions.assertEquals(new HttpFetcher.ContentTypeProbeResult.Redirect(contentTypeHtmlUrl), result);
+    }
+
+    @Test
+    public void testProbeContentTypeBadHttpStatus() {
+        var result = fetcher.probeContentType(badHttpStatusUrl, new DomainCookies(), new CrawlDelayTimer(50), ContentTags.empty());
+        Assertions.assertEquals(new HttpFetcher.ContentTypeProbeResult.HttpError(500, "Bad status code"), result);
+    }
+
+    @Test
+    public void testOnlyGetAllowed() {
+        var result = fetcher.probeContentType(onlyGetAllowedUrl, new DomainCookies(), new CrawlDelayTimer(50), ContentTags.empty());
+        Assertions.assertEquals(new HttpFetcher.ContentTypeProbeResult.Ok(onlyGetAllowedUrl), result);
+    }
+
+    @Test
+    public void testTimeout() {
+        var result = fetcher.probeContentType(timeoutUrl, new DomainCookies(), new CrawlDelayTimer(50), ContentTags.empty());
+        Assertions.assertInstanceOf(HttpFetcher.ContentTypeProbeResult.Timeout.class, result);
+    }
+
+}
--- a/Show More
+++ b/Show More