MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-10-05 21:22:39 +02:00

Author	SHA1	Message	Date
Viktor Lofgren	c661ebb619	(refac) Move language-processing into functions It's long surpassed the single-responsibility library it once was, and is as such out of place in its original location, and fits better among the function-type modules.	2025-09-18 10:30:40 +02:00
Viktor Lofgren	0cfd759f85	(deps) Upgrade slop to 0.17 for better skip performance and faster index construction times	2025-09-08 18:02:34 +02:00
Viktor Lofgren	edd453531e	(index) Partition keyword lexicons by language	2025-09-04 17:24:48 +02:00
Viktor Lofgren	8ca6209260	(refac) Fold ft-anchor-keywords into converting-process	2025-09-03 13:03:38 +02:00
Viktor Lofgren	673c65d3c9	(refac) Fold term-frequency-dict into language-processing	2025-09-03 12:59:10 +02:00
Viktor Lofgren	1979870ce4	(refac) Merge index-forward, index-reverse, index/query into index The project has too many submodules, and it's a bit of a headache to navigate.	2025-09-02 12:30:42 +02:00
Viktor Lofgren	f0741142a3	(refac) Move keyword extraction into language processing	2025-08-29 10:55:47 +02:00
Viktor Lofgren	ea99b62356	(build) Fix missing junit engine version	2025-08-16 11:01:32 +02:00
Viktor Lofgren	291ff0c4de	(deps) Upgrade crawler commons to fix robots.txt-parser bug	2025-08-15 00:13:15 +02:00
Viktor Lofgren	4a98a3c711	(skiplist) Move to a separate directory instead of in the btree module	2025-08-14 01:09:46 +02:00
Viktor Lofgren	ca283f9684	(native) Clean up native helpers and break them into their own library	2025-08-10 20:55:34 +02:00
Viktor Lofgren	673b0d3de1	(index) Perf test tool (WIP!)	2025-07-26 11:49:31 +02:00
Viktor Lofgren	62b696b1c3	(architecture) Remove the separate executor service and merge it into the index service The primary motivation for this is that in production, the large number of partitioned services has lead to an intermittent exhaustion of available database connections, as each service has a connection pool. The decision to have a separate executor service dates back from when the index service was very slow to start, and the executor didn't always spin off its memory-hungry tasks into separate processes, which meant the executor would sometimes OOM and crash, and it was undesirable to bring the index down with it.	2025-07-23 12:57:13 +02:00
Viktor Lofgren	556d7af9dc	Reapply "(grpc) Use grpc-netty instead of grpc-netty-shaded" This reverts commit `b7a5219ed3`.	2025-07-21 13:23:32 +02:00
Viktor Lofgren	b7a5219ed3	Revert "(grpc) Use grpc-netty instead of grpc-netty-shaded" Reverting this change to see if it's the cause of some instability issues observed.	2025-07-21 13:10:41 +02:00
Viktor	3b2ac414dc	Merge pull request #210 from MarginaliaSearch/ads-fingerprinting Implement advertisement and popover identification based on DOM sample data	2025-07-21 12:25:31 +02:00
Viktor Lofgren	12c304289a	(grpc) Use grpc-netty instead of grpc-netty-shaded This will help reduce runaway thread pool sizes	2025-07-20 17:36:25 +02:00
Viktor Lofgren	f38daeb036	(WIP) First stab at a GUI for viewing network traffic The change also moves the dom classifier to a separate package so that it can be accessed from both the search service and converter. The change also adds a parser for DDG's tracker radar data.	2025-07-18 13:58:57 +02:00
Viktor Lofgren	52ff7fb4dd	(ndp) Add a process for adding new domains to be crawled This is a working "work in progress" commit, will need more refinement, but given the usual difficulties in testing crawler-adjacent code without actually crawling, it needs some maturation time in production.	2025-06-21 14:10:27 +02:00
Viktor Lofgren	82456ad673	(coordination) Trial the use of zookeeper for coordinating semaphores across multiple crawler-like processes The performance implication of this needs to be evaluated. If it does not hold water. some other solution may be required instead.	2025-06-14 16:16:10 +02:00
Viktor Lofgren	d6d5467696	(ping) Add domain pinging service	2025-06-10 18:28:13 +02:00
Viktor Lofgren	4e1595c1a6	(nsfw) Initial work on adding UT1-based domain filtering	2025-06-06 14:23:37 +02:00
Viktor Lofgren	edf382e1c5	(website-capture) Add a custom docker image with a new custom extension for DOM capture The original approach of injecting javascript into the page directly didn't work with pages that reloaded themselves. To work around this, a chrome extension is used instead that does the same work, but subscribes to reload events and re-installs the change listener.	2025-05-21 14:13:54 +02:00
Viktor Lofgren	5aef844f0d	(dependency) Increase slop version to 0.0.11 v0.0.11 uses atomic moves. This ensures we don't encounter a race condition in the backup service with lingering .tmp-files that should have been renamed.	2025-05-12 14:09:16 +02:00
Viktor Lofgren	36889950e8	(pdf) Migrate to PDFBox 3.0.5 and suppress log spam PDFBox 2.x uses commons logging, which does not route through SLF4j, and thus is a hassle to configure; and is extremely verbose in its default logging settings. Migrating to PDFBox 3.x lets us use slf4j to address the log spam by filtering out the noisy methods.	2025-05-08 18:03:26 +02:00
Viktor Lofgren	0819d46f97	(pdf) Minimal protytype to get PDFs working	2025-05-08 18:03:26 +02:00
Viktor Lofgren	d7c4c5141f	(crawler) Migrate to Apache HttpClient for crawler The previously used Java HttpClient seems unsuitable for crawler usage, that lead to issues like send()-operations sometimes hanging forever, with clunky workarounds such as running each send operation in a separate Future that can be cancelled on a timeout. It has too many assumptions that break, and fails to adequately expose the inner workings of the connection pool to a degree that makes it possible to configure in a satisfactory manner. Apache's HttpClient solves all these problems. The change also includes a new battery of tests for the HttpFetcher, and refactors the retriever class a bit to move stuff into the HttpFetcher, leading to a better separation of concerns.	2025-04-17 12:51:08 +02:00
Viktor Lofgren	cfd4712191	(favicon) Add capability for fetching favicons	2025-03-21 13:38:58 +01:00
Viktor Lofgren	f076d05595	(deps) Upgrade slf4j to latest	2025-02-15 12:50:16 +01:00
Viktor Lofgren	fbba392491	(live-capture) Send a UA-string from the browserless fetcher as well The change also introduces a somewhat convoluted wiremock test to intercept and verify that these headers are in fact sent	2025-02-04 13:36:49 +01:00
Viktor Lofgren	c8b0a32c0f	(crawler) Reduce long retention of CrawlDataReference objects and their associated SerializableCrawlDataStreams	2025-01-26 15:40:17 +01:00
Viktor Lofgren	55d6ab933f	Merge branch 'master' into slop-crawl-data-spike	2025-01-21 13:32:58 +01:00
Viktor Lofgren	2315fdc731	(search) Vendor rssreader and modify it to be able to consume the nlnet atom feed Also dial down the logging a bit for the rssreader package.	2025-01-06 17:58:50 +01:00
Viktor Lofgren	1b27c5cf06	(search) Add a copy of the old UI as a separate service, `search-service-legacy`	2025-01-02 18:02:17 +01:00
Viktor Lofgren	47e58a21c6	Refactor documentBody method and ContentType charset handling Updated the `documentBody` method to improve parsing retries and error handling. Refactored `ContentType` charset processing with cleaner logic, removing redundant handling for unsupported charsets. Also, updated the version of the `slop` library in dependency settings.	2024-12-17 17:11:37 +01:00
Viktor Lofgren	3714104976	Add loader for slop data in converter. Also alter CrawledDocument to not require String parsing of the underlying byte[] data. This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.	2024-12-17 15:40:24 +01:00
Viktor Lofgren	f6f036b9b1	Switch to new Slop format for crawl data storage and processing. Replaces Parquet output and processing with the new Slop-based format. Includes data migration functionality, updates to handling and writing of crawl data, and introduces support for SLOP in domain readers and converters.	2024-12-15 19:34:03 +01:00
Viktor Lofgren	b510b7feb8	Spike for storing crawl data in slop instead of parquet This seems to reduce RAM overhead to 100s of MB (from ~2 GB), as well as roughly double the read speeds. On disk size is virtually identical.	2024-12-15 15:49:47 +01:00
Viktor Lofgren	fdee07048d	(search) Remove Spark and migrate to Jooby for the search service	2024-12-10 19:13:13 +01:00
Viktor Lofgren	f050bf5c4c	(WIP) Initial semi-working transformation to new tailwind UI Still missing is a proper build, we're currently pulling in tailwind from a CDN, which is no bueno in prod. There's also a lot of polish remaining everywhere, dead links, etc.	2024-12-05 14:00:17 +01:00
Viktor Lofgren	51e46ad2b0	(refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx. While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform. It's still a bit heterogeneous between different processes, but a bit less so for now.	2024-11-21 16:00:09 +01:00
Viktor Lofgren	89d8af640d	(live-crawl) Rename the live crawler code module to be more consistent with the other processes	2024-11-20 15:55:15 +01:00
Viktor Lofgren	a91ab4c203	(live-crawler) Crude first-try process for live crawling #WIP Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.	2024-11-19 19:35:01 +01:00
Viktor Lofgren	3791ea1e18	(service) Add a new application service for external liveness monitoring The new service 'status-service' will poll public endpoints periodically, and publish a basic read-only UI with the results, as well as publish the results to prometheus.	2024-11-17 18:01:08 +01:00
Viktor Lofgren	9f47ce8d15	(chore) Remove lombok There are likely some instances of delombok gore with this commit.	2024-11-11 21:14:38 +01:00
Viktor Lofgren	bfeb9a4538	(feeds) Retire feedlot the feed bot, move RSS capture into the live-capture service	2024-11-09 17:56:43 +01:00
Viktor Lofgren	23cce0c78a	Add a new function 'Live Capture' for on-demand screenshot capture The screenshots are requested by the site-service, and triggered via the site-info view.	2024-09-27 13:46:34 +02:00
Viktor Lofgren	f78ef36cd4	(slop) Upgrade to 0.0.8, add encodings to string columns.	2024-09-04 15:19:00 +02:00
Viktor Lofgren	266d6e4bea	(slop) Replace SlopPageRef<T> with SlopTable.Ref<T>	2024-08-21 10:13:49 +02:00
Viktor Lofgren	b0a874a842	(*) Upgrade slop library -> 0.0.5	2024-08-18 11:05:27 +02:00

1 2 3 4

152 Commits