It's long surpassed the single-responsibility library it once was, and is as such out of place in its original location, and fits better among the function-type modules.
The primary motivation for this is that in production, the large number of partitioned services has lead to an intermittent exhaustion of available database connections, as each service has a connection pool.
The decision to have a separate executor service dates back from when the index service was very slow to start, and the executor didn't always spin off its memory-hungry tasks into separate processes, which meant the executor would sometimes OOM and crash, and it was undesirable to bring the index down with it.
The change also moves the dom classifier to a separate package so that it can be accessed from both the search service and converter.
The change also adds a parser for DDG's tracker radar data.
This is a working "work in progress" commit, will need more refinement, but given the usual difficulties in testing crawler-adjacent code without actually crawling, it needs some maturation time in production.
The original approach of injecting javascript into the page directly didn't work with pages that reloaded themselves. To work around this, a chrome extension is used instead that does the same work, but subscribes to reload events and re-installs the change listener.
v0.0.11 uses atomic moves. This ensures we don't encounter a race condition in the backup service with lingering .tmp-files that should have been renamed.
PDFBox 2.x uses commons logging, which does not route through SLF4j, and thus is a hassle to configure; and is extremely verbose in its default logging settings.
Migrating to PDFBox 3.x lets us use slf4j to address the log spam by filtering out the noisy methods.
The previously used Java HttpClient seems unsuitable for crawler usage,
that lead to issues like send()-operations sometimes hanging forever,
with clunky workarounds such as running each send operation in a separate
Future that can be cancelled on a timeout.
It has too many assumptions that break, and fails to adequately expose
the inner workings of the connection pool to a degree that makes it possible
to configure in a satisfactory manner.
Apache's HttpClient solves all these problems.
The change also includes a new battery of tests for the HttpFetcher,
and refactors the retriever class a bit to move stuff into the HttpFetcher,
leading to a better separation of concerns.
Updated the `documentBody` method to improve parsing retries and error handling. Refactored `ContentType` charset processing with cleaner logic, removing redundant handling for unsupported charsets. Also, updated the version of the `slop` library in dependency settings.
Also alter CrawledDocument to not require String parsing of the underlying byte[] data. This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.
Replaces Parquet output and processing with the new Slop-based format. Includes data migration functionality, updates to handling and writing of crawl data, and introduces support for SLOP in domain readers and converters.
Still missing is a proper build, we're currently pulling in tailwind from a CDN, which is no bueno in prod.
There's also a lot of polish remaining everywhere, dead links, etc.
Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx.
While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform. It's still a bit heterogeneous between different processes, but a bit less so for now.
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
The new service 'status-service' will poll public endpoints periodically, and publish a basic read-only UI with the results, as well as publish the results to prometheus.