It's long surpassed the single-responsibility library it once was, and is as such out of place in its original location, and fits better among the function-type modules.
Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx.
While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform. It's still a bit heterogeneous between different processes, but a bit less so for now.
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
The most common error when dealing with Slop columns is that they can fall out of sync with each other if the programmer accidentally does a conditional read and forgets to skip.
The second most common error is forgetting to close one of the columns in a reader or writer.
To deal with both cases, a new class SlopTable is added that keeps track of the lifecycle of all slop columns and performs a check when closing them that they are in sync.
Refactoring keyword extraction to extract spans information.
Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.
This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.
This is not hooked in yet, and the term metadata is still left intact. It should probably shrink to a smaller representation (byte?) with the upcoming removal of the position mask.
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle
The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.
While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's *modul*ar. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.
To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added.
A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything.
The last remaining REST service, the assistant-service, has been migrated to gRPC.
This also proved a good time to clear out primordial technical debt from the root of the codebase. The 'service-client' library has been taken behind the barn and given a last farewell. It's replaced by a small library for managing gRPC channels.
Since it's no longer used by anything, RxJava has been removed as a dependency from the project.
Although the current state seems reasonably stable, this is a work-in-progress commit.
The production configuration assumes all content of interest is 7 bit ASCII, and makes a series of optimizations based on this. This assumption holds poorly in the wild.
Adding an **experimental** system property 'system.noFlattenUnicode', that when set to TRUE, will disable this behavior.
IMPORTANT!! The index needs to be re-constructed when this flag is changed, as different hash functions are selected for the keyword->identifier mappings.
Previously, in order to load encyclopedia data into the search engine, it was necessary to use the encyclopedia.marginalia.nu converter to first create a .db-file. This isn't very ergonomic, so parts of that code-base was lifted in as a 3rd party library, and conversion from .zim to .db is now done automatically.
The output file name is based on the original filename, plus a crc32 hash and a .db-ending, to ensure we can recycle the data on repeat loads.
Use a system.properties file to configure the system. This is loaded statically by MainClass or ProcessMainClass. Update the property names to be more consistent, and update the documentations to reflect the changes.
Several tests were manually running migrations in a large copy-paste blob of code. This makes the test less useful as it's possible to break the code while keeping the tests green by introducing a new migration that never gets run in the tests, and it's also difficult to reason about what the tests are doing.
A new test helper library is introduced with a TestMigrationLoader that can both run Flyway migrations, or load specific migrations in the cases a specific set of migrations need to be loaded. Existing tests are migrated to use the new code.