Use pointer:fine media query to better distinguish between mobile devices and PCs with a window in portrait orientation.
With this, we never show mobile filtering functionality on mobile; and never show the touch-inaccessible minimized sidebar on mobile.
A bug was introduced at some point where the special keyword for filtering on javascript was changed to special:scripts, from js:true/js:false.
Solves issue #155
The warc->slop converter was rejecting some items because they had headers that were representable in the Warc code's MessageHeader map implementation, but illegal in the HttpHeaders' implementation.
Fixing this by manually filtering these out. Ostensibly the constructor has a filtering predicate, but this annoyingly runs too late and fails to prevent the problem.
The crawler was incorrectly using the request URL as the base URL when resolving relative links. This caused problems when encountering redirects.
For example if we fetch /log, redirecting to /log/ and find links to foo/, and bar/; these would resolve to /foo and /bar, and not /log/foo and /log/bar.
Nephentes has been doing the rounds in social media, adding an easy detection and mitigation mechanism for this type of trap, as sadly not all webmasters set up their robots.txt correctly. Out of the box crawl limits will also deal with this type of attack, but this fix is faster.
This should have the rpc stream reception be performed in parallel in separate threads, rather blocking sequentially in the main thread, hopefully giving a slight performance boost.
This solves some issues with the rssreader based parser, which was very picky about the XML being valid. Jsoup is much more lenient when parsing malformed XML.
This is primarily to make the code a bit easier to reason about, and will reduce the level of indirection and data copying in the search-servi->query-service->index-service communication chain.
Previously a special db table was used to hold domains slated for crawling, but this is deprecated, and instead now each domain has a node_affinity flag that decides its indexing state, where a value of -1 indicates it shouldn't be crawled, a value of 0 means it's slated for crawling by the next index partition to be crawled, and a positive value means it's assigned to an index partition.
The change set also adds a test case validating the modified behavior.
Quoted search queries that contained keywords with possessive 's endings were not returning any results, as the index does not retain that suffix, and the query parser was not stripping it away in this code path.
This solves issue #143.
The code was always blocking for up to 5s while waiting for the remote end to become available, meaning some services would stall for several seconds on start-up for no sensible reason.
This should make most services start faster as a result.
Also correct the DbDomainQueries.getDomainId so that it throws NoSuchElementException when domain id is missing, and not UncheckedExecutionException via Cache.
This gives the same upper limit to the live crawler and the big boy crawler, though the live crawler will reject items too large, and the big crawler will truncate at that point.
We should not wait until we've fetched robots.txt to decide whether we have any data to fetch! This makes the live crawler very slow and leads to unnecessary requests.
Add `If-None-Match` and `If-Modified-Since` headers as appropriate to the feed fetcher's requests. On well-configured web servers, this should short-circuit the request and reduce the amount of bandwidth and processing that is necessary.
A new table was added to the FeedDb to hold one etag per domain.
If-Modified-Since semantics are based on the creation date for the feed database, which should serve as a cutoff date for the earliest update we can have received.
This completes the changes for Issue #136.
This change adds `Accept-Encoding: gzip` to all outbound requests from the live crawler and feed fetcher, and the corresponding decoding logic for the compressed response data.
The change addresses issue #136, save for making the fetcher's requests conditional.
Discovered feed URLs were given a double slash after their domain name in the DB. This will go away in the URL normalizer, so the URLs are still viable, but the commit fixes the issue regardless.
Feed discover is improved with by probing a few likely endpoints when no feed link tag is provided. To store the feed URLs, a sqlite database is added to each crawlset that stores a simple summary of the crawl job, including any feed URLs that have been discovered.
Solves issue #135
Updated the `documentBody` method to improve parsing retries and error handling. Refactored `ContentType` charset processing with cleaner logic, removing redundant handling for unsupported charsets. Also, updated the version of the `slop` library in dependency settings.
Also alter CrawledDocument to not require String parsing of the underlying byte[] data. This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.
Replaces Parquet output and processing with the new Slop-based format. Includes data migration functionality, updates to handling and writing of crawl data, and introduces support for SLOP in domain readers and converters.
Added a custom sorting mechanism to prioritize HTTPS over HTTP and domain-based URLs over raw IPs during deduplication. Ensures "bad duplicates" are discarded while maintaining the original presentation order for user-facing results.
Previously, the sideloader did not generate a locality-sensitive hashCode for document details. This caused all documents from the same domain to be considered duplicates by the deduplication logic.
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
Added an additional filter step to ensure URLs with binary suffixes are excluded during crawling. This prevents unnecessary processing of non-HTML content, improving the efficiency of the link parsing process.
If the document's URL path is "/", a "special:root" meta flag is now added with the "Synthetic" bit set. This will help searching only for the root document of a website, neat stuff ahead :D
Centralized HTML feature handling with `applyFeatures` in StackexchangeSideloader and added dynamic synthetic term generation. Improved HTML structure in RedditSideloader and enhanced metadata processing with feature-based keywords. Updated DomainLinks to correctly compute link counts using individual link occurrences.
Still missing is a proper build, we're currently pulling in tailwind from a CDN, which is no bueno in prod.
There's also a lot of polish remaining everywhere, dead links, etc.
This update eliminates all occurrences of the OpenNLP token model from the setup script, configuration, and test files, as this model file is no longer used.
Introduced optional alias domain functionality in EdgeDomain class to handle domain variations such as "www" in the anchor tags code, as there are commonly a number of relevant but glancing misses in the atags data.
Adding a tracking message to the export actor means it's possible to run them in a precession.
Adding a new precession actor, and some GUI components for triggering exports.
The change also adds a heartbeat to the export process.
Add a check to ensure only documents with "text/html" content type are processed in FeedExporter, AtagExporter, and TermFrequencyExporter. This prevents non-HTML documents from being parsed and helps maintain data consistency and keep the memory usage down.
Enhanced scoring logic to add bonuses when the query matches single-word anchor (atag) spans exactly. Implemented this by adding conditions in `IndexResultScoreCalculator.java` and creating a new method `containsRangeExact` in `DocumentSpan.java` to check for exact span matches.
Normalize URLs by replacing en-dash with hyphen to prevent encoding errors. This ensures correct handling of a small subset of articles with improperly normalized UTF-8 paths. Added `normalizeUtf8` method to address this issue.
Fixes issue #109.
Introduced a method to decide whether to retain URI fragments in feed items based on their uniqueness. Enhanced FeedItem processing to conditionally strip fragments to maintain clean URLs where applicable.
This is to prevent data corruption. This shouldn't be necessary for the regular loader path, but the live crawler is a bit different and needs some paving of the road ahead of it.
To avoid hammering the same invalid URLs for up to two months, URLs that fail to fetch correctly are on a dice roll added to a bad URLs table, that prevents further attempts at fetching them.
Removed the redundant "RSS Feed Fetcher" suffix from the User-Agent header in the FeedFetcherService. This will help avoid making the feed fetcher trigger bot mitigation that accepts the regular UA-string.
Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx.
While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform. It's still a bit heterogeneous between different processes, but a bit less so for now.
Node profiles decide which actors are started, and which views are available in the control GUI. This helps keep the system organized, and hides real-time clutter from the batch-oriented nodes.
This is done by applying a large constant offset to the ordinals for the live crawled documents. The chosen value still permits upto 100k documents to be fetched for a single domain with the live crawler, which is ridiculously large.
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
Adding a new @Tag("flaky") for tests that do not reliably return successes. These may still be valuable during development, but should not run in CI.
Also tagging a few of the slower tests with the old @Tag("slow"), to speed up the run-time.
The new service 'status-service' will poll public endpoints periodically, and publish a basic read-only UI with the results, as well as publish the results to prometheus.
Move the logic deciding which operation to perform into the actor, updating its state graph to incorporate a counter that runs a clean update once in a blue moon.
We can only do this for files that are not required for unit tests.
As it is illegal to run more than one instance of the control service, this should be fine with regard to race conditions. The boot orchestration will also ensure that no other services will boot up before the downloading is complete.
Also be a bit smarter about pre-allocating queues and sets based on depth rather than the number of provided URLs, which was always zero outside of tests.
The crawl spec abstraction was used to upload lists of domains into the system for future crawling. This was fairly clunky, and it was difficult to understand what was going to be crawled.
Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table. This is much preferred and means the operator can directly manage domains without specs.
This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.
Ensure all data is written to writeChannel by looping until the buffer is fully drained. This prevents potential data loss during the close operation and maintains data integrity.
Updated the method to return Integer.MAX_VALUE if any of the position lists are empty, instead of returning 0. This ensures that empty lists are handled consistently and address edge cases where an empty list is encountered.
Adjust the write offset calculation by adding the position of the write buffer. Updated the test to validate the transformation process and ensure correctness of output file positions.
Add a new actor that polls an URL every 6 hours and amends the domain database with any unseen domains, flagging them to be crawled by the next crawl job.
The URLs are specified in data/scrape-urls.txt. If this file is absent, the actor shuts down.
Return an empty byte array when screenshot fetch fails, ensuring downstream processes are not impacted by null responses. Additionally, only attempt to upload the screenshot if the byte array is non-empty, preventing invalid data from being stored.
Changed the scheduling function to use a single schedule call instead of a fixed delay for the init task. The updateScreenshotInfo method was also moved and slightly refactored for clearer readability and consistency.
Always request the main site screenshot to ensure staleness checks and necessary updates. Limit additional screenshot requests for similar and linking domains to avoid overloading with a maximum of 5 requests per view.
To be able to tell service discovery whether to enable a service on a particular runtime, a common base interface DiscoverableService extends BindableService was added.
* Restructure the code to make a bit more sense
* Store full headers in crawl data
* Fix bug in retry-after header that assumed the timeout was in milliseconds, and then clamped it to a lower bound of 500ms, meaning this was almost always handled wrong
We want to index all words in the document, stopword handling is moved to the index where we change the semantics to elide inclusion checks in query construction for a very short list of words tentatively hard-coded in SearchTerms.
The ordering of TermIdsList is assumed to be unchanged by the surrounding code, but the constructor sorts the dang list to be able to do contains() by binary search. This is no bueno.
This is gonna be a merge conflict in the future, but it's too big of a bug to leave for another month.
To help offer verbatim matches for external link texts, we assign these positions in the document a bit after the actual document ends. Integrating this information with the ranking is not performed here.
Don't wait until the loader step is finished to reset the NEW flag, as this leaves manually processed (but not yet loaded) crawl data stuck in "CREATING" in the GUI.
The first change, running index construction in parallel, was previously how it was done, but it was changed to run sequentially to see how it would affect performance. It got worse, so the change is reverted.
Though it's been noted that sorting in parallel is likely not a good idea as it leads to a lot of I/O thrashing, so this is changed to be done sequentially.
This will help these queries deal with domains that do not have a subdomain so that they do not drag up subdomains as well, as they are also given the special site:-keyword for their corresponding parent domain.
Previously it was sorted on a field that would switch to just showing the time whenever the date was the same as the day's date, leading to a bizarre sort order where files created today was typically shown first, followed by the rest of the files with the oldest date first.
This lets the slop library be stand-alone without dependence on coded-sequence.
The change also gets rid of the vestigial seek() method in ColumnReader.
The most common error when dealing with Slop columns is that they can fall out of sync with each other if the programmer accidentally does a conditional read and forgets to skip.
The second most common error is forgetting to close one of the columns in a reader or writer.
To deal with both cases, a new class SlopTable is added that keeps track of the lifecycle of all slop columns and performs a check when closing them that they are in sync.
Turns out throttling to only 1 lock per domain means the crawler chokes hard on large hosting websites such as wordpress. Giving these a slightly larger allowance.
Refactoring keyword extraction to extract spans information.
Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.
This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.
The CompressingStorageReader would incorrectly report having data when a file was empty. Preemptively attempting to fill the backing buffer fixes the behavior.
Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object. Separator information is encoded as a bit set instead of an array of integers.
The change also cleans up the SentenceExtractor class a fair bit. It no longer extracts ngrams, and a significant amount of redundant operations were removed as well. This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.
To let up the pressure on domains with lot sof subdomains such as substack, medium, neocities, etc. a per-domain mutex is added that will limit crawling of these domains to one thread at a time.
The revisit logic wasn't sufficiently dampening the recrawl rate for websites that largely have not changed.
Modified it to be more reactive to the degree to which the content has changed, while applying upper and lower limits depending on the size of the crawl set.
Expected behavior changed since the ranking algorithm now takes into account the number of positions of the keyword, and the test loader was previously modified to generate positions based on prime factors of the document id.
Fix rare bug where the takeWhileZero method would fail to repopulate the underlying buffer. This caused intermittent de-compression errors if takeWhileZero happened at a 64 bit boundary while the underlying buffer was empty.
The change also alters how sequence-lengths are encoded, to more consistently use the getGamma method instead of adding special significance to a zero first byte.
Finally, assertions are added checking the invariants of the gamma and delta coding logic as well as UrlIdCodec to earlier detect issues.
The change also restructures the internal API a bit, moving resultsFromDomain from RpcRawResultItem into RpcDecoratedResultItem, as the previous order was driving complexity in the code that generates these objects, and the consumer side of things puts all this data in the same object regardless.
How'd This Ever Work? (tm)
TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.
How'd This Ever Work? (tm)
TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.
Adding new ranking parameters to the API and routing them through the system, in order to permit integration of the new position data with the ranking algorithm.
The change also cleans out several parameters that no longer filled any function.
It was incorrectly assumed that a "next" value could not be zero or negative, as this is not representable via the Gamam code. This is incorrect in this case, as we're able to provide a negative offset. Changing to using Integer.MIN_VALUE as indicator that a value is absent instead, as this will never be used.
The priority index documents file can be trivially compressed to a large degree.
Compression schema:
```
00b -> diff docord (E gamma)
01b -> diff domainid (E delta) + (1 + docord) (E delta)
10b -> rank (E gamma) + domainid,docord (raw)
11b -> 30 bit size header, followed by 1 raw doc id (61 bits)
```
The implementation was incorrectly using 1 bit more than it should. The change also adds a put method for Elias delta; and cleans up the interface a bit.
Previously this was the responsibility of the caller, which lead to the possibility of passing in improperly prepared buffers and receiving bad outcome
Btree index adds overhead and disk space and doesn't fill any function for the prio index.
* Update finalize logic with a new IO transformer that copies the data and prepends a size
* Update the reader to read the new format
* Added a test
Cache the Charset object returned from Charset.forName() for future use, since we're likely to see the same charset again and Charset.forName(...) can be surprisingly expensive and its built-in caching strategy, which just caches the 2 last values seen doesn't cope well with how we're hitting it with a wide array of random charsets
The term data iterator is quite hot and was performing buffer slice operations that were not necessary.
Replacing with a fixed pointer alias that can be repositioned to the relevant data.
The positions data was also being wrapped in a GammaCodedSequence only to be immediately un-wrapped.
Removed this unnecessary step and move to copying the buffer directly instead.
After real-world testing, it was determined that 256 was still a bit too low, but 512 seems like it will only truncate outlier cases like assembly code and certain tabulations.
The change adds a new system property 'system.profile' that makes ProcessService automatically trigger JFR profiling of the processes it spawns. By default, these are put in the log directory.
The change also adds a JVM parameter that makes it shut up about native access.
This commit introduces a readme.md file to document the functionality and usage of the coded-sequence library. It covers the Elias Gamma code support, how sequences are encoded, and methods the library offers to query sequences, iterate over values, access data, and decode sequences.
The default C++ language standard on macOS is gnu++98, which won't build
this module.
Full error:
```
> Task :code:libraries:array:cpp:compileCpp FAILED
src/main/cpp/cpphelpers.cpp:28:5: error: expected expression
[](const p64x2& fst, const p64x2& snd) {
^
```
The input comes from the config file so this isn't a very realistic threat vector, and even if it wasn't it's a query in an empty duckdb instance; but adding a validation check to provide a better error message.
This is not hooked in yet, and the term metadata is still left intact. It should probably shrink to a smaller representation (byte?) with the upcoming removal of the position mask.
Add a new rule that crates an alternative path that omits a word if it's a stopword.
In queries where a stopword is present, and no query ngram expansion is possible, the query should not require the stopword to be present in the index, as this results in no search results being found.
It had been previously assumed that re-writing this function in the style of retain() would make it faster, but it had the opposite effect.
The reason why retain is so fast due to properties of the data that hold true when intersecting document lists, where long runs of adjacent documents are expected, but not when looking up the data associated with the already intersected documents, where the data is more sparse.
IntArray gets the YAGNI axe. The array library had two implementations, one for longs which was used, and one for ints, which only ever saw bit rot. Removing the latter, as all it ever did was clutter up the codebase and add technical debt. If we need int arrays, we fork LongArray again (or add int capabilities to it)
Also cleaning up the interfaces, removing layers of redundant abstractions and adding javadocs.
Finally adding sz=2 specializations to the quick- and insertion sort algorithms. It seems the JIT isn't optimizing these particularly well, this is an attempt to help it out a bit.
Retire search functions that weren't used, including the native implementations. Drop confusing suffixes on search function names. Search functions no longer encode search misses as negative values.
Replaced binary search function with a branchless version that is much faster.
Cleaned up benchmark code.
Retire search functions that weren't used, including the native implementations. Drop confusing suffixes on search function names. Search functions no longer encode search misses as negative values.
Replaced binary search function with a branchless version that is much faster.
Cleaned up benchmark code.
This corrects an annoying bug that had the system crash and burn on first start-up due to a race condition in service initialization, where the services were attempting to access the database before it was properly migrated.
A fix was in principle already in place, but it was running too late and did not prevent attempts to access the as-yet uninitialized database. Move the first boot check into the MainClass instead of the Service constructor.
The change also adds more appropriate docker dependencies to the services to fix rare errors resolving the hostname of the database.
Before the gRPC migration, the system would serve both public and internal requests over HTTP, but distinguish the two using path prefixes and a few HTTP Headers (X-Public, X-Context) added by the reverse proxy to prevent misconfigurations.
Since internal requests meaningfully no longer use HTTP, this convention is just an obstacle now, adding the need to always run the system behind a reverse proxy that rewrites the paths.
The change removes the path prefix, and updates the docker templates to reflect the change. This will require a migration for existing systems.
This is necessary as we use zookeeper to orchestrate first-time startup of the services, to ensure that the database is properly migrated by the control service before anything else is permitted to start.
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle
The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
Since the performance fix in 3359f72239 had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.
Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits. This was not a good idea, so the change is reverted.
This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.
The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality.
The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.
The sign of the counter is used to indicate whether a term has appeared as title. Until it's seen in the title, it's provisionally saved as a negative count.
We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption.
For the API service, we'll simulate the old behavior to keep the API stable.
For the search service, we'll introduce a new way of calculating positions through tree aggregation.
The code would always re-initialize the static ngramLexicon and rdrposTagger fields with new instances even if they were already instantiated, leading to a ton of unnecessary RAM allocation.
The modified behavior checks for nullity before creating a new instance.
Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.
This addresses the relatively common case where the graph consists of two segments, such as x y, z w; in this case we want an output like (x_y) (z w | z_w) | x y (z_w). The generated output does somewhat pessimize a few other cases, but this one is arguably more important.
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle
The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
Since the performance fix in 3359f72239 had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.
The changeset cleans up the query parsing logic in the query service. It gets rid of a lot of old and largely unmaintainable query-rewriting logic that was based on POS-tagging rules, and adds a new cleaner approach. Query parsing is also refactored, and the internal APIs are updated to remove unnecessary duplication of document-level data across each search term.
A new query segmentation model is introduced based on a dictionary of known n-grams, with tools for extracting this dictionary from Wikipedia data. The changeset introduces a new segmentation model file, which is downloaded with the usual run/setup.sh, as well as an updated term frequency model.
A new intermediate representation of the query is introduced, based on a DAG with predefined vertices initiating and terminating the graph. This is for the benefit of easily writing rules for generating alternative queries, e.g. using the new segmentation data.
The graph is converted to a basic LL(1) syntax loosely reminiscent of a regular expression, where e.g. "( wiby | marginalia | kagi ) ( search engine | searchengine )" expands to "wiby search engine", "wiby searchengine", "marginalia search engine", "marginalia searchengine", "kagi search engine" and "kagi searchengine".
This compiled query is passed to the index, which parses the expression, where it is used for execution of the search and ranking of the results.
Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits. This was not a good idea, so the change is reverted.
This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.
The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality.
The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.
The sign of the counter is used to indicate whether a term has appeared as title. Until it's seen in the title, it's provisionally saved as a negative count.
We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption.
For the API service, we'll simulate the old behavior to keep the API stable.
For the search service, we'll introduce a new way of calculating positions through tree aggregation.
The code would always re-initialize the static ngramLexicon and rdrposTagger fields with new instances even if they were already instantiated, leading to a ton of unnecessary RAM allocation.
The modified behavior checks for nullity before creating a new instance.
Use SimpleBlockingThreadPool pool instead of Java's Workstealing Pool as the latter causes runaway memory consumption in some circumstances, while SimpleBlockingThreadPool uses a bounded queue and always pushes back against the supplier if it can't hold any more tasks.
Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.
This addresses the relatively common case where the graph consists of two segments, such as x y, z w; in this case we want an output like (x_y) (z w | z_w) | x y (z_w). The generated output does somewhat pessimize a few other cases, but this one is arguably more important.
I was following the release demo video for v2024.01.0
https://www.youtube.com/watch?v=PNwMkenQQ24 and when I did 'docker
compose up' the containers couldn't resolve the DNS name for 'zookeeper'
I realized this was because the zookeeper container was using the
default docker network, so I specified the wmsa network explicitly.
I was following the release demo video for v2024.01.0
https://www.youtube.com/watch?v=PNwMkenQQ24 and when I did 'docker
compose up' the containers couldn't resolve the DNS name for 'zookeeper'
I realized this was because the zookeeper container was using the
default docker network, so I specified the wmsa network explicitly.
The previous behavior would listen to too many changes, and based on zookeeper and not curator assumptions about behavior, add an additional monitor on each invocation of each monitor, (which always trigger on service state changes), leading to each monitor re-registering and effectively doubling monitors in numbers whenever a service stopped or started, which in turn meant a lot of bizarre thrashing behavior even on changes in services that don't explicitly talk to each other.
This re-registering behavior is no longer done.
Netty and GRPC by default spawns an incredible number of threads on high-core CPUs, which amount to a fair bit of RAM usage.
Add custom executors that throttle this behavior.
Add the ability to indicate to the search service that a request is malicious, and to poison the results by providing randomly reorered old results instead.
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.
While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's *modul*ar. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.
Cleaning out a lot of old junk from the code, and one thing lead to another...
* Build is improved, now constructing docker images with 'jib'. Clean build went from 3 minutes to 50 seconds.
* The ProcessService's spawning is smarter. Will now just spawn a java process instead of relying on the application plugin's generated outputs.
* Project is migrated to GraalVM
* gRPC clients are re-written with a neat fluent/functional style. e.g.
```channelPool.call(grpcStub::method)
.async(executor) // <-- optional
.run(argument);
```
This change is primarily to allow handling ManagedChannel errors, but it turned out to be a pretty clean API overall.
* For now the project is all in on zookeeper
* Service discovery is now based on APIs and not services. Theoretically means we could ship the same code either a monolith or a service mesh.
* To this end, began modularizing a few of the APIs so that they aren't strongly "living" in a service. WIP!
Missing is documentation and testing, and some more breaking apart of code.
The change adds a hostname validation step to remove endpoints from the ZkServiceRegistry when they do not resolve. This is a scenario that primarily happens when running in docker, and the entire system is started and stopped.
The warmup would sometimes crash during a cold start-up, because it could not get an API. Changed the warmup to just create a GrpcSingleNodeChannelPool for the node.
The query service delegates and aggregates IndexDomainLinksApiGrpc
messages to the index services. The query client was accidentally
also doing this, instead of talking to the query client.
Fixed so it correctly talks to the query client and nothing else.
Adds new ways to configure the bind and external IP addresses for a service. Notably, if the environment variable WMSA_IN_DOCKER is present, the system will grab the HOSTNAME variable and announce that as the external address in the service registry.
The default bind address is also changed to be 0.0.0.0 only if WMSA_IN_DOCKER is present, otherwise 127.0.0.1; as this is a more secure default.
The previous code made an incorrect assumption that all routes refer to the same node, and would overwrite the route list on each update. This lead to storms of closing and opening channels whenever an update was received.
The new code is correctly aware that we may talk to multiple nodes.
To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added.
A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything.
The last remaining REST service, the assistant-service, has been migrated to gRPC.
This also proved a good time to clear out primordial technical debt from the root of the codebase. The 'service-client' library has been taken behind the barn and given a last farewell. It's replaced by a small library for managing gRPC channels.
Since it's no longer used by anything, RxJava has been removed as a dependency from the project.
Although the current state seems reasonably stable, this is a work-in-progress commit.
In the scenario where an operator
* Performs a new crawl from spec
* Doesn't load the data into the index
* Recrawls the data
The recrawl will not find the domains in the database, and the crawl log will be overwritten with an empty file,
irrecoverably losing the crawl log making it impossible to load!
To mitigate the impact similar problems, the change saves a backup of the old crawl log, as well as complains about this happening.
More specifically to this exact scenario however, the parquet-loaded domains are also preemptively inserted into the domain database at the start of the crawl. This should help the DbCrawlSpecProvider to find them regardless of loaded state.
This may seem a bit redundant, but losing crawl data is arguably the worst type of disaster scenario for this software, so it's arguably merited.
To help services start faster, the blacklist will no longer block until it's loaded. If such a behavior is desirable, a method was added to explicitly wait for the data.
The domain blacklist blocked the start-up of each process that injected it, adding like 30 seconds to the start-up time in prod.
This change moves the loading to a separate thread entirely. For threads or processes that require the blacklist to be definitely loaded, a helper method was added that blocks until that time.
This filter currently does not distinguish itself very much from the unfiltered results, and lends the impression that the filters don't "do anything".
It may come back in some shape or form in the future, with some additional tweaking of the rankings...
Modified the DbCrawlSpecProvider to shuffle domains after loading to ensure a good mix for each crawl. This change prevents overload of crawling the same server in parallel from different subdomains or crawling big domains all at once.
Refactored the GRPC Stub Pool for better handling of channel SHUTDOWN state. Any disconnected channels are now re-created before returning the stub.
The class was also renamed to GrpcChannelPool, as we no longer pool the stubs.
Clean up the sideloading code a bit, making the Reddit sideloader use the more sophisticated SideloaderProcessing approach to sideloading, instead of mimicing StackexchangeSideloader's cruder approach.
The reddit sideloader now uses the SideloaderProcessing class. It also properly sets js-attributes for the sideloaded documents.
The control GUI now also filters the upload directory items based on name, and disables the items that do not have appropriate filenames.
Fix a bug where sideloading stackexchange files by explicitly selecting the 7z file would fail, since the 7z file would be passed along to the converter rather than the path to the pre-converted .db file.
The change deprecates the 'algorithm' field from the domain ranking set configuration. Instead, the algorithm will be chosen based on whether influence domains are provided, and whether similarity data is present.
The domain ranking code was admittedly a bit of a clown fiesta; at the same time buggy, fragile and inscrutable.
Migrating over to use JGraphT to store the link graph
when doing rankings, and using their PageRank implementation. Also added a modified version that does PersonalizedPageRank.
This change set updates the query APIs to enable the search service to add additional criteria, such as QueryStrategy and TemporalBias.
The QueryStrategy makes it possible to e.g. require a match is in the title of a result, and TemporalBias enables penalizing results that are not within a particular time period.
These options are added to the search interface. The old 'recent results' is modified to use TemporalBias, and a new filter 'Search In Title' is added as well.
The vintage filter is modified to add a temporal bias for the past.
(converter) Loader for reddit data
Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult.
Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more.
Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run.
The change also refactors the sideloading a bit since it was a bit messy, and improves the sideload UX a tiny bit.
Improve the UX of the sideload GUI by sorting the results in a sensible fashion, first by whether it's a directory, then by its filename.
The change also changes the timestamp rendering to a more human-readable format than full ISO-8601.
The sideload forms didn't properly set the label 'for' property, meaning that while label tags existed, they weren't appropriately clickable.
Also removed unnecessary limits on the sideload target being a directory for stackexchange and warc. It's been possible to directly load a particular file for a while, but not allowed due to GUI limits.
Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult.
Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more.
Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run.
The change also refactors the sideloading a bit since it was a bit messy.
Look at whether the property 'system.conserveProperty' is enabled when deciding he default pool size for the converter.
If true, a much more conservative default is used, limiting the risk of running out of memory.
Adds experimental support for clustering search results by e.g. domain. At a first stage, this is only enabled for the wiki and forum filters.
The commit also cleans up the UrlDetails class, which contained a number of vestigial entries.
The WARC specification says the records should transparently remove compression. This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.
The WARC specification says the records should transparently remove compression. This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.
Recent changes to the result ranking mean the no filter mode returns sufficiently good results for most queries that filtering by default just makes the search results more restricted.
* (executor-api) Make executor API talk GRPC
The executor's REST API was very fragile and annoying to work with, lacking even basic type safety. Migrate to use GRPC instead. GRPC is a bit of a pain with how verbose it is, but that is probably a lesser evil. This is a fairly straightforward change, but it's also large so a solid round of testing is needed...
The change set breaks out the GrpcStubPool previously residing in the QueryService, and makes it available to all clients.
ServiceId.name was also renamed to avoid the very dangerous clash with Enum.name().
The boilerplate needed for grpc was also extracted into a common gradle file for inclusion into the appropriate build.gradle-files.
!bang query handling seems to have fallen victim to an overzealous refactoring effort, and broken.
It's now repaired, and a test is in place to ensure we know if it breaks again.
The readme for the array library was extremely out of date. Updating it with accurate information about how the library works, and a demo that should compile.
Also added a system property for disabling the use of sun.misc.Unsafe.
Continues 467ba5be20 by breaking out a constant with the name of the primary ranking set. Also ensures it doesn't get spuriously logged as updated during the secondary updating pass.
This change splits the previous 'repartition' action into two steps, one for recalculating the domain rankings, and one for recalculating the other ranking sets. Since only the first is necessary before the index construction, the rest can be delayed until after...
To avoid issues in handling the shotgun blast of MqNotifications, Service was switched over to use a synchronous message queue instead of an asynchronous one.
The change also modifies the behavior so that only node 1 will push the changes to the EC_DOMAIN database table, to avoid unnecessary db locks and contention with the loader.
Additionally, the change fixes a bug where the index construction code wasn't actually picking up the rankings data.
Since the index construction used to be performed by the index-service, merely saving the data to memory was enough for it to be accessible within the index-construction logic, but since it's been broken out into a separate process, the new process just injected an empty DomainRankings object instead.
To fix this, DomainRankings can now be persisted to disk, and a pre-loaded version of the object is injected into the index-construction process.
The codebase used to have a monkey patched version of gson that made special optimizations for the unusually large JSON files that used to store e.g. crawl data.
Since JSON is no longer used in this fashion, the GSON fork is not needed anymore.
To help distinguish between environments, a system property 'control.appBorder' is added that is injected as a body element border property in the control GUI stylesheets.
IndexJournalWriterPagingImpl was modified to not page on number of entries written, but number of (equivalent uncompressed) bytes written.
Since the failure mode if too much data is written per file is quiet corruption of the index, the former behavior was extremely fragile. The new behavior should consistently ensure that the data is sufficiently small to not cause any integer rollovers.
The change in 6dcc20038c was reverted, as there is really no sane reason to have this configurable in software.
The RandomFileAssembler implementations, introduced in commit 53c575db3f were all acting subtly differently. The RWF implementation wrote BigEndian longs instead of the native endianness used by the other implementations (and expected by the index construction code), further the mmap implementation exposed a bug in LongArray.write() that caused it to create a larger file than necessary.
A test was built to ensure the output of these implementations is equivalent.
To cope with writing large files out of order, the system needs some form of strategy to avoid writing them directly to disk, as this causes insane amounts of disk thrashing. By default, the data is just buffered in RAM. This works well on a large server, but smaller systems struggle.
To help systems with small RAM process large amounts of data, the old RandomWriteFunnel is brought back if the system property 'system.conserve-memory' is set to true. RandomWriteFunnel is buffering the construction by creating a series of small files that pigeonhole the writes into rough neighborhoods, and then it goes over the files one by one to construct one area of the file at a time. This is relatively slow and uses more than twice the disk size.
A new interface RandomFileAssembler is introduced as an abstraction for this operation. A third strategy, direct mmaps, is also introduced if the file is very small (less than 1 GB). In this domain, disk thrashing is unlikely since it will comfortably fit in RAM.
This update enhances the SideloaderProcessing and DocumentClass modules to specially handle sideloaded wiki documents. Wiki content is generally truncated to the first paragraph, which generally tends to be too short to be included independently. An additional DocumentClass (SIDELOAD) has been introduced to suppress the length check in this case.
Wrote a new test to examine the redirect behavior of the crawler, ensuring that the redirect URL is the URL that is reported in the parquet file. This works as intended.
Noticed in the course of this that the crawler doesn't add links from meta-tag redirects to the crawl frontier. Added logic to handle this case, amended the test case to verify the new behavior. Added the meta-redirect case to the HtmlDocumentProcessorPlugin as well, so that we consider it a link between documents in the unlikely case that a meta redirect is to another domain.
The flag is `system.languageDetectionModelVersion`.
* If negative, no model is used.
* If 0, both models are used.
* If 1, the old crappy model is used.
* If 2, the new fasttext model is used.
The flag is `system.languageDetectionModelVersion`.
* If negative, no model is used.
* If 0, both models are used.
* If 1, the old crappy model is used.
* If 2, the new fasttext model is used.
The production configuration assumes all content of interest is 7 bit ASCII, and makes a series of optimizations based on this. This assumption holds poorly in the wild.
Adding an **experimental** system property 'system.noFlattenUnicode', that when set to TRUE, will disable this behavior.
IMPORTANT!! The index needs to be re-constructed when this flag is changed, as different hash functions are selected for the keyword->identifier mappings.
Since the bleed-flags set by the anchor tags logics have been changed to Site and SiteAdjacent, give them a bit of more importance when set together with ExternalLink.
UrlDomain and UrlPath are also only more consistently only rewarded once.
This adds a single-node barebones configuration to the install script. It also moves the log4j configuration into system.properties, and sets assertions to disabled by default.
This changeset adds an action for downloading a set of sample data from downloads.marginalia.nu.
It also refactors out some leaky abstractions out of FileStorageService. allocateTemporaryStorage has been renamed allocateStorage. The storage was never temporary in any scenario...
It also doesn't take a storage base, as there was always only one valid option for this input. The allocateStorage method finds the appropriate base itself.
This will avoid having to dig in the message queue to perform this relatively common task.
The control service was also refactored to extract common timestamp formatting logic out of the data objects and into the rendering.
It's a mistake to let it bleed into Title, as this is a high quality signal. We'll co-opt Site and SiteAdjacent instead to reinforce the ExternalLink when count is high.
This avoids concurrent access errors. This is especially important when using Unsafe-based LongArrays, since we have concurrent access to the underlying memory-mapped file. If pull the rug from under the caller by closing the file, we'll get a SIGSEGV. Even with a "safe" MemorySegment, we'll get ugly stacktraces if we close the file while a thread is still accessing it.
So we spin up a thread that sleeps for a minute before actually unmapping the file, allowing any ongoing requests to wrap up. This is 100% a hack, but it lets us get away with doing this without adding locks to the index readers.
Since this is "just" mmapped data, and this operation happens optimistically once a month, it should be safe if the call gets lost.
The sideload instruction in the stackexchange template was updated. The instruction now states that stackexchange data will be loaded from a directory on the server and directs users to a new documentation url for more detailed information.
The method `isApplicable` in the `PlainTextDocumentProcessorPlugin` was refactored to handle a wider range of content types beyond merely "text/plain". It now also handles any content type that starts with "text/plain;", to accomodate contentTypes that append a charset as well.
Removed the need to have to run an external tool to pre-process the data in order to load stackexchange-style data into the search engine.
Removed the tool itself.
This stirred up some issues with the dependencies, that were due to both third-party:ing xz and importing it as a dependency. This has been fixed, and :third-party:xz was removed.
java.lang.Error:s were not handled properly, leading to mismatch in the bookkeeping of the FSMs. These are now caught, acted on, and re-thrown.
MqSynchronousInbox also no longer assumes all exceptions are InterruptedException.
This commit extracts several previously hardcoded configuration properties, and makes then available through system.properties.
The documentation is updated to reflect the change.
Dead code was also removed in the process. CrawlSpecGenerator is left feeling a bit over-engineered still, since it's built for a more general case, where all other implementations but the current one are removed, but we'll leave it like this for now as it's fairly readable still.
Previously, in order to load encyclopedia data into the search engine, it was necessary to use the encyclopedia.marginalia.nu converter to first create a .db-file. This isn't very ergonomic, so parts of that code-base was lifted in as a 3rd party library, and conversion from .zim to .db is now done automatically.
The output file name is based on the original filename, plus a crc32 hash and a .db-ending, to ensure we can recycle the data on repeat loads.
The change adds a new column to the MESSAGE_QUEUE table called AUDIT_RELATED_ID. This field is populated transparently, using a dictionary mapping Thread IDs to Message IDs, populated by the inbox handlers.
The existing RELATED_ID field has too many semantics associated with them,
among other things the FSM code uses them this field in tracking state changes.
The change set also improves the consistency of inbox names. The IndexClient was buggy and populated its outbox with a UUID. This is fixed. All Service2Service outboxes are now prefixed with 'pp:' to make them even easier to differentiate.
It's a confusing default behavior.
This was off for nodes n>1 before as a bandaid since querying indices with no data caused delays and errors. This has been fixed now, so there's no need to do this anymore!
This improves query times, and gets rid of exceptions in the logs when one of the index nodes doesn't have any data loaded, yet is configured to answer queries.
In some scenarios, such as when restoring storage items from json-manifest on db failure, the file storage view would present the items in a non-chronological order. Added a sort() operation to mitigate this.
Use a system.properties file to configure the system. This is loaded statically by MainClass or ProcessMainClass. Update the property names to be more consistent, and update the documentations to reflect the changes.
The EC_DOMAIN_LINK table is deprecated and slated for removal, use QueryClient.getAllDomainLinks() instead.
The ExportDataActor now uses the QueryClient appropriately. The CSV format was also changed to quote the values, to prevent e.g. Excel from interpreting the comma as a decimal separator when previewing the file.
Finally the form for triggering an export was overhauled.
Several tests were manually running migrations in a large copy-paste blob of code. This makes the test less useful as it's possible to break the code while keeping the tests green by introducing a new migration that never gets run in the tests, and it's also difficult to reason about what the tests are doing.
A new test helper library is introduced with a TestMigrationLoader that can both run Flyway migrations, or load specific migrations in the cases a specific set of migrations need to be loaded. Existing tests are migrated to use the new code.
Add a toggle for saving the WARC data generated by the search engine's crawler. Normally this is discarded, but for debugging or archival purposes, retaining it may be of interest.
The warc files are concatenated into larger archives, up to about 1 GB each.
An index is also created containing filenames, domain names, offsets and sizes
to help navigate these larger archives.
The warc data is saved in a directory warc/ under the crawl data storage.
This commit overhauls a lot of the UX for the control service, adding a new actions menu to the nodes views. It has many small tweaks to make the work flow better.
It also adds a new /uploads directory in each index node, from which sideloaded data can be selected. This is a bit of a breaking change, as this directory needs to exist in each index node.
The changeset also makes the control service responsible for flyway migrations. This helps reduce the number of places the database configuration needs to be spread out. These automatic migrations can be disabled with -DdisableFlyway=true.
The commit also adds curl to the docker container, to enable docker health checks and interdependencies.
This adds a docker-compose file 'docker-compose-barebones.yml' which will only start the minimal number of services needed to run a whitelabel Marginalia Search-style search engine, with none of the surrounding frills.
The change also adds a minimal search GUI to the query service, which is also available with JSON results if the appropriate Accept header is provided.
This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used. This method is removed with this change.
This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used. This method is removed with this change.
The new converter logic assumes that the crawl data is ordered where the domain record comes first, and then a sequence of document records. This is true for the new parquet format, but not for the old zstd/gson format.
To make the new converter compatible with the old format, a specialized reader is introduced that scans for the domain record before running through the sequence of document records; and presenting them in the new order.
This is slower than just reading the file beginning to end, so in order to retain performance when this ordering isn't necessary, a CompatibilityLevel flag is added to CrawledDomainReader, permitting the caller to decide how compatible the data needs to be.
Down the line when all the old data is purged, this should be removed, as it amounts to technical debt.
The EC_DOMAIN_LINK MariaDB table stores links between domains. This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB). This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need.
This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains. This file is loaded in memory in each node, and can be queried via the Query Service.
A migration step is needed before this file is created in each node. Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file.
The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.
Settings for enabling reproducible builds for all subprojects were added to improve build consistency. This includes preserving file timestamps and ordering files reproducibly.
This is primarily of help for docker, since it uses hashes to determine if a file or image layer has changed.
Settings for enabling reproducible builds for all subprojects were added to improve build consistency. This includes preserving file timestamps and ordering files reproducibly.
This is primarily of help for docker, since it uses hashes to determine if a file or image layer has changed.
Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.
Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.
There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero. This meant that "bad" results always rank the same. The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization.
Some of the weights were also re-adjusted based on what appears to produce better results. Needs evaluation.
This seems like it would make the wikipedia search result worse, but it drastically improves the result quality!
This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.
There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero. This meant that "bad" results always rank the same. The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization.
Some of the weights were also re-adjusted based on what appears to produce better results. Needs evaluation.
This seems like it would make the wikipedia search result worse, but it drastically improves the result quality!
This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.
Modify processingiterator to be constructed via a factory, to enable re-use of its backing executor service.
This reduces thread churn in the converter sideloader style processing of regular crawl data.
Route the sizeHint from the input parquet file to SideloadProcessing, so that it can set sideloadSizeAdvice appropriately, instead of using a fixed "large" number.
This is necessary to populate the KNOWN_URL column in the domain data table, which is important as it is used in e.g. calculating how far to re-crawl the site in the future.
The URI query string is now URL encoded in the WarcProtocolReconstructor. This change ensures proper encoding of special characters as per the standard URL encoding rules and improves URL validity during the crawling process.
Updated ProcessingIterator's queue polling from one second to 50 milliseconds for improved performance. This facilitates faster document processing across more cores, reducing bottlenecks and slow single-threaded processing.
Use ProcessingIterator to fan out processing of documents across more cores, instead of doing all of it in the writer thread blocking everything else with slow single-threaded processing.
This commit adds a safety check that the URL of the document is from the correct domain.
It also adds a sizeHint() method to SerializableCrawlDataStream which *may* provide an indication if the stream is very large and benefits from sideload-style processing (which is slow).
It furthermore addresses a bug where the ProcessedDomain.write() invoked the wrong method on ConverterBatchWriter and only wrote the domain metadata, not the rest...
The processor normally retains the domain data in memory after processing to be able to do additional site-wide analysis. This works well, except there are a number of outlier websites that have an absurd number of documents that can rapidly fill up the heap of the process.
These websites now receive a simplified treatment. This is executed in the converter batch writer thread. This is slower, but the documents will not be persisted in memory.
With the new crawler modifications, the crawl data comes in a slightly different order, and a result of this is that we can optimize the converter. This is a breaking change that will be incompatible with the old style of crawl data, hence it will linger as a branch for a while.
The first step is to move stuff out of the domain processor into the document processor.
Guava's hashers are a bit allocation hungry, and a big driver of GC churn in the crawler. This switches to the modified Murmur hash function used throughout Marginalia.
Modified site info feed template to secure the description field against injected code. Also adjusted search service by extracting samples within the correct scope and including them in the returned site info. This improves the quality and security of the displayed information.
This change integrates the Feedlot RSS Bot with Marginalia's site info view to offer a preview of the latest updates.
The change introduces a new tiny feature that is a feedlot-client based on Java's HttpClient.
This is for filtering results on how many times the term appears on the domain. The intent is to be beneficial in creating e.g. a domain search feature. It's also very helpful when tracking down spammy domains.
A number of crawl jobs get stuck at about 300 documents, or just under. This seems to be because we fail to increase the crawl limit, which is based on MAX(200, 1.25 x GOOD_URLS) with a 1.5x modifier applied upon a recrawl. GOOD_URLS is based on how many documents successfully process, which is typically fairly small. Switching to KNOWN_URLS should let this grow faster.
The SQL query in the DbCrawlSpecProvider class has been updated; 'GOOD_URLS' has been replaced with 'KNOWN_URLS'. This update ensures the correct data is selected from the DOMAIN_METADATA table.
The floor is also increased to 250 from 200.
Added functionality to remove processes from listing that have not checked in for over a day. A 'removeProcessHeartbeat' function was created to delete the respective entry from the PROCESS_HEARTBEAT table in case heartbeats are absent for more than one day.
This fixes a bug where a prepared statement was created before the table it was supposed to insert into was created. This fails and does nothing.
Furthermore, added the logging that would have warned about this failure, had it been in place.
Since the sideloaders don't populate the documents list in ProcessedDomain to keep the memory footprint manageable, the code that estimates knownUrls etc. will set them to zero, which has negative effects on their ranking. This change will populate them with a bullshit value within a sane ballpark, ensuring that these domains show up in the rankings.
Make some temporary modifications to the CrawledDocument model to support both a "big string" style headers field like in the old formats, and explicit fields as in the new formats. This is a bit awkward to deal with, but it's a necessity until we migrate off the old formats entirely.
The commit also adds a few tests to this logic.
The size of the ArrayBlockingQueue in ConverterWriter.java has been reduced from 4 to 1. This change aims to reduce the memory utilization by not having fully processed domains piling up in RAM. This may cause the writer to go idle in waiting for new data, but that may be preferable to an OOM.
Initialization parameters in DomainLoaderService and DomainIdRegistry have been updated to improve performance. This is done by adding sane default sizes to the hash tables involved, reducing GC churn, but also by setting a sensible fetch size to the queries used, and not fetching irrelevant information such as the domain name.
We do both ip2location and ASN data.
The change also adds some keywords based on autonomous system information, on a somewhat experimental basis. It would be neat to be able to e.g. exclude cloud services or just e.g. cloudflare from the search results.
This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time.
Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.
This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time.
Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.
This had the knock-on effect of breaking the anchor tag loading in the processor for a lot of domains, since they'd grab domains for the wrong domain name.
In encyclopedia, add a class "mw-content-text" that the WikiSpecialization class is looking for during pruning to give the articles a more fair treatment.
Also add generator keywords based on the generator type provided, to ensure that these documents show up in appropriate filters.
Further, add a new document flag value 'Sideloaded' to be able to distinguish these entries.
This is caused by a resource contention with the query code. The proper way to fix this is to use some form of synchronization, but that will slow the code down. So we just hammer it a few times and let the GC deal with the problem if it fails. Not optimal, but fast.
The code now intercepts and deals with potential exceptions during the parsing of search parameters. This is in response to constant bad requests from bots which were cluttering the logs. A catch clause is added that suppresses these errors and redirects to the base URL.
The site info view can't blindly assume that every website supports https. To figure out which schema to use when linking to a site, execute a single-result search for site:domain.name and then grab the schema off the result.
To allow this, a count parameter is introduced to doSiteSearch() in SearchOperator.
There really is no fantastic place to put this logic, but we need to remove entries with an X-Robots-Tags header where that header indicates it doesn't want to be crawled by Marginalia.
We want to mute some of these records so that they don't produce documents, but in some cases we want a document to be produced for accounting purposes.
Added improved tests that reach for known resources on www.marginalia.nu to test the behavior when encountering bad content type and 404s.
The commit also adds some safety try-catch:es around the charset handling, as it may sometimes explode when fed incorrect data, and we do be guessing...
This commit updates CrawlingThenConvertingIntegrationTest with additional tests for invalid, redirecting, and blocked domains. Improvements have also been made to filter out irrelevant entries in ParquetSerializableCrawlDataStream.
This update includes the addition of timestamps to the parquet format for crawl data, as extracted from the Warc stream.
The parquet format stores the timestamp as a 64 bit long, seconds since unix epoch, without a logical type. This is to avoid having to do format conversions when writing and reading the data.
This parquet field populates the timestamp field in CrawledDocument.
Add an optional new field to CrawledDocument containing information about whether the domain has cookies. This was previously on the CrawledDomain object, but since the WarcFormat requires us to write a WarcInfo object at the start of a crawl rather than at the end, this information is unobtainable when creating the CrawledDomain object.
Also fix a bug in the deduplication logic in the DomainProcessor class that caused a test to break.
This information is then propagated to the parquet file as a boolean.
For documents that are copied from the reference, use whatever value we last saw. This isn't 100% deterministic and may result in false negatives, but permits websites that used cookies but have stopped to repent and have the change reflect in the search engine more quickly.
This commit includes mostly exception handling, error propagation, a few bug fixes and minor changes to log formatting. The CrawlDelayTimer, HTTP 429 responses and IOException responses are now more accurately handled.
A non-standard WarcXEntityRefused WARC record has also been introduced, essentially acting as a rejected 'response' with different semantics.
Besides these, several existing features have been refined, such as URL encoding, crawl depth incrementing and usage of Content-Length headers.
This commit further cleans up the warc->parquet conversion. It fixes issues with redirect handling in WarcRecorder, adds support information about redirects and errors due to probe failure.
It also refactors the fetch result, body extraction and content type abstractions.
This commit cleans up the warc->parquet conversion. Records with a http status other than 200 are now included.
The commit also fixes a bug where the robots.txt parser would be fed the full HTTP response (and choke), instead of the body.
The DocumentBodyExtractor code has also been cleaned up, and now offers a way of just getting the byte[] representation for later processing, as conversion to and from strings is a bit wasteful.
This is not hooked into anything yet. The change also makes modifications to the parquet-floor library to support reading and writing of byte[] arrays. This is desirable since we may in the future want to support inputs that are not text-based, and codifying the assumption that each document is a string will definitely cause us grief down the line.
This finally resolves issues with gradle confusing intellij by complaining about java incompatibilities (that were never a problem), so that it doesn't report test errors correctly.
This commit is in a pretty rough state. It refactors the crawler fairly significantly to offer better separation of concerns. It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data. This works, -ish.
There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either.
A problem is that the WARC files are a bit too large. It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.
At this stage, the crawler will use the WARCs to resume a crawl if it terminates incorrectly.
This is a WIP commit, since the warc files are not fully incorporated into the work flow, they are deleted after the domain is crawled.
The commit also includes fairly invasive refactoring of the crawler classes, to accomplish better separation of concerns.
This is the same as the prefix for the IP address, but I don't think that substantially matters, the as two have such different namespaces there can be no confusion.
In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services.
The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions.
The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server.
The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.
This commit also fixes a bug in the loader where the IP field wouldn't always populate as intended, and refactors the DomainInformationService to use significantly fewer SQL queries.
The previous version used a personalized pagerank centering on a few academic domains, but this didn't work very well and most results were not very academia-centric.
This reduces the impact of restarting the search service, as the site information takes a few minutes to load during which it's not available. It also permits exposing this information via API in the future if there is interest in this.
The assistant service was also modified to do a late load of the suggestions trie, as this is a major contributor to its start-up time.
Finally, some changes were made to the client library, a new get() method was added that takes a TypeToken to allow deserialization of generics such as List<Foo>, and the scheduler was also modified to use virtual threads.
Refactored the getDocumentsStream method in EncyclopediaMarginaliaNuSideloader to use the newly extracted ProcessingIterator class that encapsulates processing a stream of results from e.g a database query in parallel and returning the computed results as an iterator.
The iterator was also improved on to be more reliable, previous versions of the logic would sometimes deadlock due to false positives in hasMore().
The commit updates EncyclopediaMarginaliaNuSideloader to include the AnchorTextKeywords in processing documents, aiding search result relevance.
It also removes old test-related functionality and a large but fairly useless test previously used to debug a specific problem, to the detriment of the overall code quality.
A race condition was found where precession actors would sometimes skip a step, because when invoking ExecutorRemoteActor.getState(), it would get the last 'OK' actor state from a previous run of the actor!
To avoid this, the trigger method was changed from returning a boolean to the message ID, negative if an error occurred, to be passed to getState to select only messages that pertain to the present or future runs.
The converter was not properly initiating the external links for each domain, causing an NPE in conversion. This needs to be loaded later since we don't know the domain we're processing until we've seen it in the crawl data.
Also made some refactorings to make finding converter bugs easier, and finding the related domain less awkward from the SerializableCrawlData interface.
The code now includes an additional function in the DomainProcessor class that checks if a domain is associated with academia. An academic domain is identified by the ".edu" TLD, or fits a specific regex pattern matching domains like *.ac.ccTld or *.edu.ccTld.
If these conditions are met, the search term "special:academia" is added to the domain.
The existing academia search filter uses personalized pagerank to select academia-adjacent domains, but it isn't working very well. The hope is that filtering on domain names will be more effective, and that it can supplant the ranking-based approach.
Partially hook in the WarcRecorder into the crawler process. So far it's not read, but should record the crawled documents.
The WarcRecorder and HttpFetcher classes were also refactored and broken apart to be easier to reason about.
This is a first step of using WARC as an intermediate flight recorder style step in the crawler, ultimately aimed at being able to resume crawls if the crawler is restarted. This component is currently not hooked into anything.
The OkHttp3 client wrapper class 'WarcRecordingFetcherClient' was implemented for web archiving. This allows for the recording of HTTP requests and responses. New classes were introduced, 'WarcDigestBuilder', 'IpInterceptingNetworkInterceptor', and 'WarcProtocolReconstructor'.
The JWarc dependency was added to the build.gradle file, and relevant unit tests were also introduced. Some HttpFetcher-adjacent structural changes were also done for better organization.
This functionality needs to be accessed by the WarcSideloader, which is in the converter. The resultant microlibrary is tiny, but I think in this case it's justifiable.
This update includes the integration of the jwarc library and implements support for Warc file sideloading, as a first trial integration with this library.
In the future this logic probably needs to move into a separate
service, as it's still quite slow to load. But this fixes response
times and DOS potential of previous version.
If a process is violently terminated, the associated file storage may get stuck in the ephemeral 'NEW' state, preventing future operations on the associated data.
To remedy this without having to dig through the database, a button was added to reset the state. It's a band-aid, but the situation is rare enough that I think it's fine.
The repartition endpoint was mis-addressing its mqapi notifications, omitting the proper nodeId. In fixing this, it became apparent that having both @MqRequest and @MqNotification is a serious footgun, and the two should be unified into a single API where the caller isn't burdened with knowledge of the remote end's implementation specifics.
Wrapping these exceptions in a try-catch and logging them with slf4j will ensure they end up in the process logs.
The way it worked using the default exception handler, they'd print on console (which nothing captures!), leading to a very annoying debugging experience.
Tricky problem, creating a procedure apparently needs delimiter shenanigans in Flyway, otherwise it will truncate the END statement and mariadb will be sad.
This is to enable running an external repository for production and test.
Use the ./gradle -Pdocker-registry=registry.foo.bar -Pdocker-tag=my-tag while building to accomplish this. By default, use 'marginalia' for repository and 'latest' as tag.
This behavior is an old vestige from the days of only having a single loader process. We'd truncate the links table because doing inserts/updates was too slow. This was also important because we had 32 bit ID, and there's a lot of links between domains to go around...
Instead we delete the rows associated with the current node with a stored procedure PURGE_LINKS_TABLE.
We also update the PRIMARY KEY to a BIGINT. We'll need to load the data in excess of billion times to hit an ID rollover, so it'll be fine.
Support for anchor tag keywords
* Added new (optional) model file in $WMSA_HOME/data/atags.parquet. Due to size limitations on github, this is available at https://downloads.marginalia.nu/exports
* Converter gets a component for creating a projection of its domains onto the full atags parquet file
* New WordFlag ExternalLink
* These terms are also for now flagged as title words
* The ranking algorithm was tweaked to make better use of ngram information as well as weighting the priority BM25
* Fixed a bug where Title words aliased with UrlDomain words
* Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking
* Crawler will also use the anchor tag file to prioritize crawling documents with external links.
* Added new (optional) model file in $WMSA_HOME/data/atags.parquet
* Converter gets a component for creating a projection of its domains onto the full atags parquet file
* New WordFlag ExternalLink
* These terms are also for now flagged as title words
* Fixed a bug where Title words aliased with UrlDomain words
* Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking
Don't log the PROCESS stream to executor's logs, as it will also be logged in the spawned process' log files.
Also tell the spawned process which "service" it is so that it gets a log file with a name that makes sense.
This was caused by a bug in the binary search algorithm causing it to sometimes return positive values when encoding a search miss.
It was also necessary to get rid of the vestiges of the old LongArray and IntArray classes to make this fix doable.
* Encyclopedia sideloader; permit providing base URL.
* Storage base shows node id in GUI
* ProcessLivenessMonitorActor restarts automatically
* Clean-up of outbox code
This turned out to be very difficult to do in small isolated steps.
* Design overhaul of the control gui using bootstrap
* Move the actors out of control-service into to a new executor-service, that can be run on multiple nodes
* Add node-affinity to message queue
This makes index complete in the sense that you can deploy an index instance and build a complete separate application on top of it, without having to go through the Marginalia-laden search service.
* (index-reverse) Parallel construction of the reverse indexes.
* (array) Remove wasteful calculation of numDistinct before merging two sorted arrays.
* (index-reverse) Force changes to disk on close, reduce logging.
* (index-reverse) Clean up merging process and add back logging
* (run) Add a conservative default for INDEX_CONSTRUCTION_PROCESS_OPTS's parallelism as it eats a lot of RAM
* (index-reverse) Better logging during processing
* (array) 2GB+ compatible write() function
* (array) 2GB+ compatible write() function
* (index-reverse) We are logging like Bolsonaro and I will not have it.
* (reverse-index) Self-diagnostics
* (btree) Fix bug in btree reader to do with large data sizes
Further de-ByteBuffer:ing of these classes is to be done, but this is the smallest most urgently needed benefit.
This commit is a WIP but in a fully working state, pushing due to the importance of the changes to offer lifecycle control over mmaps.
... also move some common configuration into the root build.gradle-file.
Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.
This is a big performance boost in array.range().get().
Without an override, each access will go through pages[page].get(...) for each get()-operation. This adds up very quickly. BTreeReader does a bunch of get():s on a range()'d array during traversal in the queryData... methods.
* Reduce memory churn in LoaderIndexJournalWriter, fix bug with keyword mappings as well
* Remove remains of OldDomains
* Ensure LOADER_PROCESS_OPTS gets fed to the processes
* LinkdbStatusWriter won't execute batch after each added item post 100 items
This is a system-wide change. The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table. This made index-construction easier, but it
also added a fairly significant RAM penalty to both the index service and the loader.
The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices.
It also became necessary half-way through to upgrade guice as its error reporting wasn't *quite* compatible with JDK20.
This provides a much cleaner separation of concerns, and makes it possible to get rid of a lot of the gunkier parts of the index service. It will also permit lowering the Xmx on the index service a fair bit, so we can get CompressedOOps again :D
Deprecate the LoadUrl instruction entirely. We no longer need to be told upfront about which URLs to expect, as IDs are generated from the domain id and document ordinal.
For now, we no longer store new URLs in different domains. We need to re-implement this somehow, probably in a different job or a as a different output.
This is a QOL improvement for mobile users, who otherwise would have to scroll all the way up to refresh.
Also removed the confusing "this is a random set of domains"-message when viewing adjacent websites, as it's not random.
Super weird encoding bug that only arises on versions below jdk18 causing crawl data to be read incorrectly.
Seems possibly related to the new standard charset of UTF-8. Maybe some library (unknown which) is attempting to be backwards compatible in a way that totally breaks?
This is mostly a pilot track for sideloading other large websites.
Also change coverter to produce a more compact output (java serialization instead of json).
Also clean up the specializations logic a bit, and add a barebones specialization for phpbb that cleans out paths we aren't interested in but doesn't touch pruning or summarizing logic for now.
* Break apart CrawlerRetreiver
* Break apart HttpFetcher into an interface and impl for testing sanity
* Add special logic for Lemmy, Mediawiki and Discourse to not waste requests on paths that aren't interesting.
* Add response caching to the API service to help SearXNG
* Clean up the code a bit.
* Add an endpoint without a terminating slash for getLicense.
* Add tests for API service.
* Utilities for merging BTrees of entity size 1 and 2.
* Isolate and clean up sorting algorithms.
* Functions for keeping distinct items in a LongArray
... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java
... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java
* Increase accuracy of the position bits.
* Increase their width to 56.
* Use a rolling position scheme for bits 16-56 to increase the average accuracy.
* Result ranking overhaul
* Optimized queries
* BM25 in the index service's ranking
* Make gui less jank
* Javadocs for ranking parameters.
* A deduplication filter step ran too early, and removed many good results on the basis that they partially, but did not fully fit another set of search terms.
* Altered the query creation process to prefer documents where multiple terms appear in the priority index.
Not everyone shows up in the git commit history, doesn't mean they didn't contribute valuable changes.
In such circumstances, their deeds will be recorded here.
* [@samstorment](https://www.github.com/samstorment) provided a design overhaul for [https://explore.marginalia.nu/](https://explore.marginalia.nu/) in [10cad3](https://github.com/MarginaliaSearch/MarginaliaSearch/commit/10cad3abb29b8a87bf5fd56afbc192335e3e94d7)
via [issue #44](https://github.com/MarginaliaSearch/MarginaliaSearch/issues/44).
* [@dreimolo](https://github.com/dreimolo) provided build script [fixes for apple silicon](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/64)
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
Note that packages under [third-party/](third-party/) have different licenses, and the code in [code/libraries/](code/libraries/) is dual-licensed under MIT.
d="m 261.75384,-85.665085 -13.08512,15.97528 h 13.498 v 3.4936 H 243.206 v -2.76312 l 13.08512,-15.97528 h -12.8628 v -3.4936 h 18.32552 z"
id="path12603"/><path
d="m 278.84063,-75.787725 v 6.12968 h 12.5452 v 3.46184 h -16.674 v -22.232 h 16.22936 v 3.46184 h -12.10056 v 5.78032 h 10.73488 v 3.39832 z"
id="path12605"/><path
d="m 323.74919,-66.196205 h -4.4464 l -4.54168,-6.5108 q -0.28584,0.03176 -0.85752,0.03176 h -5.01808 v 6.47904 h -4.1288 v -22.232 h 9.14688 q 2.89016,0 5.01808,0.9528 2.15968,0.9528 3.30304,2.73136 1.14336,1.77856 1.14336,4.22408 0,2.50904 -1.23864,4.31936 -1.20688,1.81032 -3.4936,2.6996 z m -4.54168,-14.32376 q 0,-2.12792 -1.39744,-3.27128 -1.39744,-1.14336 -4.09704,-1.14336 h -4.82752 v 8.86104 h 4.82752 q 2.6996,0 4.09704,-1.14336 1.39744,-1.17512 1.39744,-3.30304 z"
d="m 245.81989,-41.935548 v 3.861737 h 7.90356 v 2.180981 h -10.50473 v -14.0063 h 10.2246 v 2.180981 h -7.62343 v 3.641638 h 6.76304 v 2.140963 z"
id="path12612"/><path
d="m 270.04847,-40.414864 v -9.484266 h 2.58116 v 14.0063 h -2.14096 l -7.72347,-9.484266 v 9.484266 h -2.58117 v -14.0063 h 2.14097 z"
id="path12614"/><path
d="m 285.39308,-35.89283 h -2.60117 v -11.80531 h -4.64209 v -2.20099 h 11.88535 v 2.20099 h -4.64209 z"
id="path12616"/><path
d="m 307.52074,-35.89283 h -2.80126 l -2.86129,-4.101845 q -0.18008,0.02001 -0.54024,0.02001 h -3.16142 v 4.081836 h -2.60117 v -14.0063 h 5.76259 q 1.82082,0 3.16142,0.60027 1.36061,0.60027 2.08094,1.720774 0.72032,1.120504 0.72032,2.661197 0,1.580711 -0.78035,2.721224 -0.76034,1.140513 -2.20099,1.700765 z m -2.86129,-9.024059 q 0,-1.340603 -0.88039,-2.060927 -0.8804,-0.720324 -2.58116,-0.720324 h -3.04137 v 5.582511 h 3.04137 q 1.70076,0 2.58116,-0.720324 0.88039,-0.740333 0.88039,-2.080936 z"
id="path12618"/><path
d="m 319.76395,-35.69274 q -2.90131,0 -4.52204,-1.620729 -1.62073,-1.640738 -1.62073,-4.682106 v -7.903555 h 2.60117 v 7.80351 q 0,4.121854 3.5616,4.121854 3.5416,0 3.5416,-4.121854 v -7.80351 h 2.56115 v 7.903555 q 0,3.041368 -1.62073,4.682106 -1.60072,1.620729 -4.50202,1.620729 z"
This is the source code for marginalia.nu, including the [search engine](https://search.marginalia.nu),
the [MEMEX/gemini server](https://memex.marginalia.nu), the and the [encyclopedia service](https://encyclopedia.marginalia.nu).
This is the source code for [Marginalia Search](https://search.marginalia.nu).
The aim of the project is to develop new and alternative discovery methods for the Internet.
It's an experimental workshop as much as it is a public service, the overarching goal is to
elevate the more human, non-commercial sides of the Internet. A side-goal is to do this without
requiring datacenters and expensive enterprise hardware, to run this operation on affordable hardware.
elevate the more human, non-commercial sides of the Internet.
The canonical git server for this project is [https://git.marginalia.nu](https://git.marginalia.nu).
It is fine to mirror it on other hosts, but if you have issues or questions
git.marginalia.nu is where you want to go.
A side-goal is to do this without requiring datacenters and enterprise hardware budgets,
to be able to run this operation on affordable hardware with minimal operational overhead.
## Important note about wmsa.local
The long term plan is to refine the search engine so that it provide enough public value
that the project can be funded through grants, donations and commercial API licenses
(non-commercial share-alike is always free).
This project has a [sister repository called wmsa.local](https://git.marginalia.nu/marginalia/wmsa.local)
that contains scripts and configuration files for running and developing the code.
The system can both be run as a copy of Marginalia Search, or as a white-label search engine
for your own data (either crawled or side-loaded). At present the logic isn't very configurable, and a lot of the judgements
made are based on the Marginalia project's goals, but additional configurability is being
worked on!
Without it, development is very unpleasant.
Here's a demo of the set-up and operation of the self-hostable barebones mode of the search engine: [🌎 https://www.youtube.com/watch?v=PNwMkenQQ24](https://www.youtube.com/watch?v=PNwMkenQQ24)
While developing the code, you will want an environment variable WMSA_HOME pointing to
the directory in which wmsa.local is checked out, otherwise the code will not run and
several tests will fail.
## Set up
## Documentation
To set up a local test environment, follow the instructions in [📄 run/readme.md](run/readme.md)!
Documentation is a work in progress. See the [wiki](https://git.marginalia.nu/marginalia/marginalia.nu/wiki).
Further documentation is available at [🌎 https://docs.marginalia.nu/](https://docs.marginalia.nu/).
## Contributing
Before compiling, it's necessary to run [⚙️ run/setup.sh](run/setup.sh).
This will download supplementary model data that is necessary to run the code.
These are also necessary to run the tests.
[CONTRIBUTING.md](CONTRIBUTING.md)
If you wish to hack on the code, check out [📄 doc/ide-configuration.md](doc/ide-configuration.md).
## Supporting
## Hardware Requirements
Consider [supporting this project](https://memex.marginalia.nu/projects/edge/supporting.gmi).
A production-like environment requires a lot of RAM and ideally enterprise SSDs for
the index, as well as some additional terabytes of slower harddrives for storing crawl
data. It can be made to run on smaller hardware by limiting size of the index.
The system will definitely run on a 32 Gb machine, possibly smaller, but at that size it may not perform
very well as it relies on disk caching to be fast.
A local developer's deployment is possible with much smaller hardware (and index size).
## Project Structure
[📁 code/](code/) - The Source Code. See [📄 code/readme.md](code/readme.md) for a further breakdown of the structure and architecture.
[📁 run/](run/) - Scripts and files used to run the search engine locally
[📁 third-party/](third-party/) - Third party code
[📁 doc/](doc/) - Supplementary documentation
[📄 CONTRIBUTING.md](CONTRIBUTING.md) - How to contribute
[📄 LICENSE.md](LICENSE.md) - License terms
## Contact
You can email <kontakt@marginalia.nu> with any questions or feedback.
## License
The bulk of the project is available with AGPL 3.0, with exceptions. Some parts are co-licensed under MIT,
third party code may have different licenses. See the appropriate readme.md / license.md.
## Versioning
The project uses modified Calendar Versioning, where the first two pairs of numbers are a year and month coinciding
with the latest crawling operation, and the third number is a patch number.
```
version
--
yy.mm.VV
-----
crawl
```
For example, `23.03.02` is a release with crawl data from March 2023 (released in May 2023).
It is the second patch for the 23.02 release.
Versions with the same year and month are compatible with each other, or offer an upgrade path where the same
data set can be used, but across different crawl sets data format changes may be introduced, and you're generally
expected to re-crawl the data from scratch as crawler data has shelf life approximately as long as the major release
cycles of this project. After about 2-3 months it gets noticeably stale with many dead links.
For development purposes, crawling is discouraged and sample data is available. See [📄 run/readme.md](run/readme.md)
for more information.
## Funding
### Donations
Consider [donating to the project](https://www.marginalia.nu/marginalia-search/supporting/).
### Grants
This project was funded through the [NGI0 Entrust Fund](https://nlnet.nl/entrust), a fund established by [NLnet](https://nlnet.nl) with financial support from the European Commission's [Next Generation Internet](https://ngi.eu/) programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement No 101069594.
This is a roadmap with major features planned for Marginalia Search.
It's not set in any particular order and other features will definitely
be implemented as well.
Major goals:
* Reach 1 billion pages indexed
* Improve technical ability of indexing and search. ~~Although this area has improved a bit, the
search engine is still not very good at dealing with longer queries.~~ (As of PR [#129](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/129), this has improved significantly. There is still more work to be done )
## Hybridize crawler w/ Common Crawl data
Sometimes Marginalia's relatively obscure crawler is blocked when attempting to crawl a website, or for
other technical reasons it may be prevented from doing so. A possible work-around is to hybridize the
crawler so that it attempts to fetch such inaccessible websites from common crawl. This is an important
step on the road to 1 billion pages indexed.
As a rough sketch, the crawler would identify target websites, consume CC's index, and then fetch the WARC data
with byte range queries.
Retaining the ability to independently crawl the web is still strongly desirable so going full CC is not an option.
## Safe Search
The search engine has a bit of a problem showing spicy content mixed in with the results. It would be desirable to have a way to filter this out. It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
combined with naive bayesian filter would go a long way, or something more sophisticated...?
## Additional Language Support
It would be desirable if the search engine supported more languages than English. This is partially about
rooting out assumptions regarding character encoding, but there's most likely some amount of custom logic
associated with each language added, at least a models file or two, as well as some fine tuning.
It would be very helpful to find a speaker of a large language other than English to help in the fine tuning.
## Support for binary formats like PDF
The crawler needs to be modified to retain them, and the conversion logic needs to parse them.
The documents database probably should have some sort of flag indicating it's a PDF as well.
PDF parsing is known to be a bit of a security liability so some thought needs to be put in
that direction as well.
## Custom ranking logic
Stract does an interesting thing where they have configurable search filters.
This looks like a good idea that wouldn't just help clean up the search filters on the main
website, but might be cheap enough we might go as far as to offer a number of ad-hoc custom search
filter for any API consumer.
I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, which is quite ad-hoc, but instead to work together to find some new common description language for this.
## Show favicons next to search results
This is expected from search engines. Basic proof of concept sketch of fetching this data has been done, but the feature is some way from being reality.
## Specialized crawler for github
One of the search engine's biggest limitations right now is that it does not index github at all. A specialized crawler that fetches at least the readme.md would go a long way toward providing search capabilities in this domain.
# Completed
## Web Design Overhaul (COMPLETED 2025-01)
The design is kinda clunky and hard to maintain, and needlessly outdated-looking.
Marginalia has experimental RSS preview support for a few domains. This works well and
it should be extended to all domains. It would also be interesting to offer search of the
RSS data itself, or use the RSS set to feed a special live index that updates faster than the
main dataset.
Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
## Proper Position Index (COMPLETED 2024-09)
The search engine uses a fixed width bit mask to indicate word positions. It has the benefit
of being very fast to evaluate and works well for what it is, but is inaccurate and has the
drawback of making support for quoted search terms inaccurate and largely reliant on indexing
word n-grams known beforehand. This limits the ability to interpret longer queries.
The positions mask should be supplemented or replaced with a more accurate (e.g.) gamma coded positions
list, as is the civilized way of doing this.
Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)
INSERTIGNOREINTOCONF_DOMAIN_RANKING_SET(NAME,DESCRIPTION,ALGORITHM,DEPTH,DEFINITION)VALUES('NONE','Reserved: No Ranking Algorithm','SPECIAL',50000,'');
**TODO**: This module should probably be renamed and moved into some other package.
## See Also
The database is constructed by the [loading-process](../../processes/loading-process), and consumed by the [index-service](../../services-core/index-service).
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.