1
1
mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-10-05 21:22:39 +02:00
Commit Graph

152 Commits

Author SHA1 Message Date
Viktor Lofgren
c661ebb619 (refac) Move language-processing into functions
It's long surpassed the single-responsibility library it once was, and is as such out of place in its original location, and fits better among the function-type modules.
2025-09-18 10:30:40 +02:00
Viktor Lofgren
0cfd759f85 (deps) Upgrade slop to 0.17 for better skip performance and faster index construction times 2025-09-08 18:02:34 +02:00
Viktor Lofgren
edd453531e (index) Partition keyword lexicons by language 2025-09-04 17:24:48 +02:00
Viktor Lofgren
8ca6209260 (refac) Fold ft-anchor-keywords into converting-process 2025-09-03 13:03:38 +02:00
Viktor Lofgren
673c65d3c9 (refac) Fold term-frequency-dict into language-processing 2025-09-03 12:59:10 +02:00
Viktor Lofgren
1979870ce4 (refac) Merge index-forward, index-reverse, index/query into index
The project has too many submodules, and it's a bit of a headache to navigate.
2025-09-02 12:30:42 +02:00
Viktor Lofgren
f0741142a3 (refac) Move keyword extraction into language processing 2025-08-29 10:55:47 +02:00
Viktor Lofgren
ea99b62356 (build) Fix missing junit engine version 2025-08-16 11:01:32 +02:00
Viktor Lofgren
291ff0c4de (deps) Upgrade crawler commons to fix robots.txt-parser bug 2025-08-15 00:13:15 +02:00
Viktor Lofgren
4a98a3c711 (skiplist) Move to a separate directory instead of in the btree module 2025-08-14 01:09:46 +02:00
Viktor Lofgren
ca283f9684 (native) Clean up native helpers and break them into their own library 2025-08-10 20:55:34 +02:00
Viktor Lofgren
673b0d3de1 (index) Perf test tool (WIP!) 2025-07-26 11:49:31 +02:00
Viktor Lofgren
62b696b1c3 (architecture) Remove the separate executor service and merge it into the index service
The primary motivation for this is that in production, the large number of partitioned services has lead to an intermittent exhaustion of available database connections, as each service has a connection pool.

The decision to have a separate executor service dates back from when the index service was very slow to start, and the executor didn't always spin off its memory-hungry tasks into separate processes, which meant the executor would sometimes OOM and crash, and it was undesirable to bring the index down with it.
2025-07-23 12:57:13 +02:00
Viktor Lofgren
556d7af9dc Reapply "(grpc) Use grpc-netty instead of grpc-netty-shaded"
This reverts commit b7a5219ed3.
2025-07-21 13:23:32 +02:00
Viktor Lofgren
b7a5219ed3 Revert "(grpc) Use grpc-netty instead of grpc-netty-shaded"
Reverting this change to see if it's the cause of some instability issues observed.
2025-07-21 13:10:41 +02:00
Viktor
3b2ac414dc Merge pull request #210 from MarginaliaSearch/ads-fingerprinting
Implement advertisement and popover identification based on DOM sample data
2025-07-21 12:25:31 +02:00
Viktor Lofgren
12c304289a (grpc) Use grpc-netty instead of grpc-netty-shaded
This will help reduce runaway thread pool sizes
2025-07-20 17:36:25 +02:00
Viktor Lofgren
f38daeb036 (WIP) First stab at a GUI for viewing network traffic
The change also moves the dom classifier to a separate package so that it can be accessed from both the search service and converter.

The change also adds a parser for DDG's tracker radar data.
2025-07-18 13:58:57 +02:00
Viktor Lofgren
52ff7fb4dd (ndp) Add a process for adding new domains to be crawled
This is a working "work in progress" commit, will need more refinement, but given the usual difficulties in testing crawler-adjacent code without actually crawling, it needs some maturation time in production.
2025-06-21 14:10:27 +02:00
Viktor Lofgren
82456ad673 (coordination) Trial the use of zookeeper for coordinating semaphores across multiple crawler-like processes
The performance implication of this needs to be evaluated.  If it does not hold water. some other solution may be required instead.
2025-06-14 16:16:10 +02:00
Viktor Lofgren
d6d5467696 (ping) Add domain pinging service 2025-06-10 18:28:13 +02:00
Viktor Lofgren
4e1595c1a6 (nsfw) Initial work on adding UT1-based domain filtering 2025-06-06 14:23:37 +02:00
Viktor Lofgren
edf382e1c5 (website-capture) Add a custom docker image with a new custom extension for DOM capture
The original approach of injecting javascript into the page directly didn't work with pages that reloaded themselves.  To work around this, a chrome extension is used instead that does the same work, but subscribes to reload events and re-installs the change listener.
2025-05-21 14:13:54 +02:00
Viktor Lofgren
5aef844f0d (dependency) Increase slop version to 0.0.11
v0.0.11 uses atomic moves.  This ensures we don't encounter a race condition in the backup service with lingering .tmp-files that should have been renamed.
2025-05-12 14:09:16 +02:00
Viktor Lofgren
36889950e8 (pdf) Migrate to PDFBox 3.0.5 and suppress log spam
PDFBox 2.x uses commons logging, which does not route through SLF4j, and thus is a hassle to configure; and is extremely verbose in its default logging settings.

Migrating to PDFBox 3.x lets us use slf4j to address the log spam by filtering out the noisy methods.
2025-05-08 18:03:26 +02:00
Viktor Lofgren
0819d46f97 (pdf) Minimal protytype to get PDFs working 2025-05-08 18:03:26 +02:00
Viktor Lofgren
d7c4c5141f (crawler) Migrate to Apache HttpClient for crawler
The previously used Java HttpClient seems unsuitable for crawler usage,
that lead to issues like send()-operations sometimes hanging forever,
with clunky workarounds such as running each send operation in a separate
Future that can be cancelled on a timeout.

It has too many assumptions that break, and fails to adequately expose
the inner workings of the connection pool to a degree that makes it possible
to configure in a satisfactory manner.

Apache's HttpClient solves all these problems.

The change also includes a new battery of tests for the HttpFetcher,
and refactors the retriever class a bit to move stuff into the HttpFetcher,
leading to a better separation of concerns.
2025-04-17 12:51:08 +02:00
Viktor Lofgren
cfd4712191 (favicon) Add capability for fetching favicons 2025-03-21 13:38:58 +01:00
Viktor Lofgren
f076d05595 (deps) Upgrade slf4j to latest 2025-02-15 12:50:16 +01:00
Viktor Lofgren
fbba392491 (live-capture) Send a UA-string from the browserless fetcher as well
The change also introduces a somewhat convoluted wiremock test to intercept and verify that these headers are in fact sent
2025-02-04 13:36:49 +01:00
Viktor Lofgren
c8b0a32c0f (crawler) Reduce long retention of CrawlDataReference objects and their associated SerializableCrawlDataStreams 2025-01-26 15:40:17 +01:00
Viktor Lofgren
55d6ab933f Merge branch 'master' into slop-crawl-data-spike 2025-01-21 13:32:58 +01:00
Viktor Lofgren
2315fdc731 (search) Vendor rssreader and modify it to be able to consume the nlnet atom feed
Also dial down the logging a bit for the rssreader package.
2025-01-06 17:58:50 +01:00
Viktor Lofgren
1b27c5cf06 (search) Add a copy of the old UI as a separate service, search-service-legacy 2025-01-02 18:02:17 +01:00
Viktor Lofgren
47e58a21c6 Refactor documentBody method and ContentType charset handling
Updated the `documentBody` method to improve parsing retries and error handling. Refactored `ContentType` charset processing with cleaner logic, removing redundant handling for unsupported charsets. Also, updated the version of the `slop` library in dependency settings.
2024-12-17 17:11:37 +01:00
Viktor Lofgren
3714104976 Add loader for slop data in converter.
Also alter CrawledDocument to not require String parsing of the underlying byte[] data.  This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.
2024-12-17 15:40:24 +01:00
Viktor Lofgren
f6f036b9b1 Switch to new Slop format for crawl data storage and processing.
Replaces Parquet output and processing with the new Slop-based format. Includes data migration functionality, updates to handling and writing of crawl data, and introduces support for SLOP in domain readers and converters.
2024-12-15 19:34:03 +01:00
Viktor Lofgren
b510b7feb8 Spike for storing crawl data in slop instead of parquet
This seems to reduce RAM overhead to 100s of MB (from ~2 GB), as well as roughly double the read speeds.  On disk size is virtually identical.
2024-12-15 15:49:47 +01:00
Viktor Lofgren
fdee07048d (search) Remove Spark and migrate to Jooby for the search service 2024-12-10 19:13:13 +01:00
Viktor Lofgren
f050bf5c4c (WIP) Initial semi-working transformation to new tailwind UI
Still missing is a proper build, we're currently pulling in tailwind from a CDN, which is no bueno in prod.

There's also a lot of polish remaining everywhere, dead links, etc.
2024-12-05 14:00:17 +01:00
Viktor Lofgren
51e46ad2b0 (refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents
Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx.

While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform.  It's still a bit heterogeneous between different processes, but a bit less so for now.
2024-11-21 16:00:09 +01:00
Viktor Lofgren
89d8af640d (live-crawl) Rename the live crawler code module to be more consistent with the other processes 2024-11-20 15:55:15 +01:00
Viktor Lofgren
a91ab4c203 (live-crawler) Crude first-try process for live crawling #WIP
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
2024-11-19 19:35:01 +01:00
Viktor Lofgren
3791ea1e18 (service) Add a new application service for external liveness monitoring
The new service 'status-service' will poll public endpoints periodically, and publish a basic read-only UI with the results, as well as publish the results to prometheus.
2024-11-17 18:01:08 +01:00
Viktor Lofgren
9f47ce8d15 (chore) Remove lombok
There are likely some instances of delombok gore with this commit.
2024-11-11 21:14:38 +01:00
Viktor Lofgren
bfeb9a4538 (feeds) Retire feedlot the feed bot, move RSS capture into the live-capture service 2024-11-09 17:56:43 +01:00
Viktor Lofgren
23cce0c78a Add a new function 'Live Capture' for on-demand screenshot capture
The screenshots are requested by the site-service, and triggered via the site-info view.
2024-09-27 13:46:34 +02:00
Viktor Lofgren
f78ef36cd4 (slop) Upgrade to 0.0.8, add encodings to string columns. 2024-09-04 15:19:00 +02:00
Viktor Lofgren
266d6e4bea (slop) Replace SlopPageRef<T> with SlopTable.Ref<T> 2024-08-21 10:13:49 +02:00
Viktor Lofgren
b0a874a842 (*) Upgrade slop library -> 0.0.5 2024-08-18 11:05:27 +02:00