1
1
mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-10-06 07:32:38 +02:00

Compare commits

...

1096 Commits

Author SHA1 Message Date
Viktor Lofgren
ecb0e57a1a (crawler) Make the use of virtual threads in the crawler configurable via system properties 2025-03-27 21:26:05 +01:00
Viktor Lofgren
8c61f61b46 (crawler) Add crawling metadata to domainstate db 2025-03-27 16:38:37 +01:00
Viktor Lofgren
662a18c933 Revert "(crawler) Further rearrange crawl order"
This reverts commit 1c2426a052.

The change does not appear necessary to avoid problems.
2025-03-27 11:25:08 +01:00
Viktor Lofgren
1c2426a052 (crawler) Further rearrange crawl order
Limit crawl order preferrence to edu domains, to avoid hitting stuff like medium and wordpress with shotgun requests.
2025-03-27 11:19:20 +01:00
Viktor Lofgren
34df7441ac (crawler) Add some jitter to crawl delay to avoid accidentally synchronized requests 2025-03-27 11:15:16 +01:00
Viktor Lofgren
5387e2bd80 (crawler) Adjust crawl order to get a better mixture of domains 2025-03-27 11:12:48 +01:00
Viktor Lofgren
0f3b24d0f8 (crawler) Evaluate virtual threads for the crawler
The change also alters SimpleBlockingThreadPool to add the option to use virtual threads instead of platform threads.
2025-03-27 11:02:21 +01:00
Viktor Lofgren
a732095d2a (crawler) Improve crawl task ordering
Further improve the ordering of the crawl tasks in order to ensure that potentially blocking tasks are enqueued as soon as possible.
2025-03-26 16:51:37 +01:00
Viktor Lofgren
6607f0112f (crawler) Improve how the crawler deals with interruptions
In some cases, it threads would previously fail to terminate when interrupted.
2025-03-26 16:19:57 +01:00
Viktor Lofgren
4913730de9 (jdk) Upgrade to Java 24 2025-03-26 13:26:06 +01:00
Viktor Lofgren
1db64f9d56 (chore) Fix zookeeper test by upgrading zk image version.
Test suddenly broke due to the increasing entropy of the universe.
2025-03-26 11:47:14 +01:00
Viktor Lofgren
4dcff14498 (search) Improve contrast with light mode 2025-03-25 13:15:31 +01:00
Viktor Lofgren
426658f64e (search) Improve contrast with light mode 2025-03-25 11:54:54 +01:00
Viktor Lofgren
2181b22f05 (crawler) Change default maxConcurrentRequests to 512
This seems like a more sensible default after testing a bit.  May need local tuning.
2025-03-22 12:11:09 +01:00
Viktor Lofgren
42bd79a609 (crawler) Experimentally throttle the number of active retrievals to see how this affects the network performance
There's been some indications that request storms lead to buffer bloat and bad throughput.

This adds a configurable semaphore, by default permitting 100 active requests.
2025-03-22 11:50:37 +01:00
Viktor Lofgren
b91c1e528a (favicon) Send dummy svg result when image is missing
This prevents the browser from rendering a "broken image" in this scenario.
2025-03-21 15:15:14 +01:00
Viktor Lofgren
b1130d7a04 (domainstatedb) Allow creation of disconnected db
This is required for executor services that do not have crawl data to still be able to initialize.
2025-03-21 14:59:36 +01:00
Viktor Lofgren
8364bcdc97 (favicon) Add favicons to the matchograms 2025-03-21 14:30:40 +01:00
Viktor Lofgren
626cab5fab (favicon) Add favicon to site overview 2025-03-21 14:15:23 +01:00
Viktor Lofgren
cfd4712191 (favicon) Add capability for fetching favicons 2025-03-21 13:38:58 +01:00
Viktor Lofgren
9f18ced73d (crawler) Improve deferred task behavior 2025-03-18 12:54:18 +01:00
Viktor Lofgren
18e91269ab (crawler) Improve deferred task behavior 2025-03-18 12:25:22 +01:00
Viktor Lofgren
e315ca5758 (search) Change icon for small web filter
The previous icon was of an irregular size and shifted the layout in an unaesthetic way.
2025-03-17 12:07:34 +01:00
Viktor Lofgren
3ceea17c1d (search) Adjustments to devicd detection in CSS
Use pointer:fine media query to better distinguish between mobile devices and PCs with a window in portrait orientation.

With this, we never show mobile filtering functionality on mobile; and never show the touch-inaccessible minimized sidebar on mobile.
2025-03-17 12:04:34 +01:00
Viktor Lofgren
b34527c1a3 (search) Add small web filter for new UI 2025-03-17 11:39:19 +01:00
Viktor Lofgren
185bf28fca (crawler) Correct issue leading to parquet files not being correctly preconverted
Path.endsWith("str") != String.endsWith(".str")
2025-03-10 13:48:12 +01:00
Viktor Lofgren
78cc25584a (crawler) Add error logging when entering bad path for historical crawl data 2025-03-10 13:38:40 +01:00
Viktor Lofgren
62ba30bacf (common) Log info about metrics server 2025-03-10 13:12:39 +01:00
Viktor Lofgren
3bb84eb206 (common) Log info about metrics server 2025-03-10 13:03:48 +01:00
Viktor Lofgren
be7d13ccce (crawler) Correct task execution logic in crawler
The old behavior would flag domains as pending too soon, leading to them being omitted from execution if they were not immediately available to run.
2025-03-09 13:47:51 +01:00
Viktor Lofgren
8c088a7c0b (crawler) Remove custom thread factory
This was causing issues, and not really doing much of benefit.
2025-03-09 11:50:52 +01:00
Viktor Lofgren
ea9a642b9b (crawler) More effective task scheduling in the crawler
This should hopefully allow more threads to be busy
2025-03-09 11:44:59 +01:00
Viktor Lofgren
27f528af6a (search) Fix "Remove Javascript" toggle
A bug was introduced at some point where the special keyword for filtering on javascript was changed to special:scripts, from js:true/js:false.

Solves issue #155
2025-02-28 12:03:04 +01:00
Viktor Lofgren
20ca41ec95 (processed model) Use String columns instead of Txt columns for SlopDocumentRecord
It's very likely TxtStringColumn is the culprit of the bug seen in https://github.com/MarginaliaSearch/MarginaliaSearch/issues/154 where the wrong URL was shown for a search result.
2025-02-24 11:41:51 +01:00
Viktor Lofgren
7671f0d9e4 (search) Display message when no search results are found 2025-02-24 11:15:55 +01:00
Viktor Lofgren
44d6bc71b7 (assistant) Migrate to Jooby framework 2025-02-15 13:28:12 +01:00
Viktor Lofgren
9d302e2973 (assistant) Migrate to Jooby framework 2025-02-15 13:26:04 +01:00
Viktor Lofgren
f553701224 (assistant) Migrate to Jooby framework 2025-02-15 13:21:48 +01:00
Viktor Lofgren
f076d05595 (deps) Upgrade slf4j to latest 2025-02-15 12:50:16 +01:00
Viktor Lofgren
b513809710 (*) Stopgap fix for metrics server initialization errors bringing down services 2025-02-14 17:09:48 +01:00
Viktor Lofgren
7519b28e21 (search) Correct exception from misbehaving bots feeding invalid urls 2025-02-14 17:05:24 +01:00
Viktor Lofgren
3eac4dd57f (search) Correct exception in error handler when page is missing 2025-02-14 17:00:21 +01:00
Viktor Lofgren
4c2810720a (search) Add redirect handler for full URLs in the /site endpoint 2025-02-14 16:31:11 +01:00
Viktor Lofgren
8480ba8daa (live-capture) Code cleanup 2025-02-04 14:05:36 +01:00
Viktor Lofgren
fbba392491 (live-capture) Send a UA-string from the browserless fetcher as well
The change also introduces a somewhat convoluted wiremock test to intercept and verify that these headers are in fact sent
2025-02-04 13:36:49 +01:00
Viktor Lofgren
530eb35949 (update-rss) Do not fail the feed fetcher control actor if it takes a long time to complete. 2025-02-03 11:35:32 +01:00
Viktor Lofgren
c2dd2175a2 (search) Add new query expansion rule contracting WORD NUM pairs into WORD-NUM and WORDNUM 2025-02-01 13:13:30 +01:00
Viktor Lofgren
b8581b0f56 (crawler) Safe sanitization of headers during warc->slop conversion
The warc->slop converter was rejecting some items because they had headers that were representable in the Warc code's MessageHeader map implementation, but illegal in the HttpHeaders' implementation.

Fixing this by manually filtering these out.  Ostensibly the constructor has a filtering predicate, but this annoyingly runs too late and fails to prevent the problem.
2025-01-31 12:47:42 +01:00
Viktor Lofgren
2ea34767d8 (crawler) Use the response URL when resolving relative links
The crawler was incorrectly using the request URL as the base URL when resolving relative links.  This caused problems when encountering redirects.

 For example if we fetch /log, redirecting to  /log/ and find links to foo/, and bar/; these would resolve to /foo and /bar, and not /log/foo and /log/bar.
2025-01-31 12:40:13 +01:00
Viktor Lofgren
e9af838231 (actor) Fix migration actor final steps 2025-01-30 11:48:21 +01:00
Viktor Lofgren
ae0cad47c4 (actor) Utility method for getting a json prototype for actor states
If we can hook this into the control gui somehow, it'll make for a nice QOL upgrade when manually interacting with the actors.
2025-01-29 15:20:25 +01:00
Viktor Lofgren
5fbc8ef998 (misc) Tidying 2025-01-29 15:17:04 +01:00
Viktor Lofgren
32c6dd9e6a (actor) Delete old data in the migration actor 2025-01-29 14:51:46 +01:00
Viktor Lofgren
6ece6a6cfb (actor) Improve resilience for the migration actor 2025-01-29 14:43:09 +01:00
Viktor Lofgren
39cd1c18f8 Automatically run npm install tailwindcss@3 via setup.sh, as the new default version of the package is incompatible with the project 2025-01-29 12:21:08 +01:00
Viktor
eb65daaa88 Merge pull request #151 from Lionstiger/master
fix small grammar error in footerLegal.jte
2025-01-28 21:49:50 +01:00
Viktor
0bebdb6e33 Merge branch 'master' into master 2025-01-28 21:49:36 +01:00
Viktor Lofgren
1e50e392c6 (actor) Improve logging and error handling for data migration actor 2025-01-28 15:34:36 +01:00
Viktor Lofgren
fb673de370 (crawler) Change the header 'User-agent' to 'User-Agent' 2025-01-28 15:34:16 +01:00
Viktor Lofgren
eee73ab16c (crawler) Be more lenient when performing a domain probe 2025-01-28 15:24:30 +01:00
Viktor Lofgren
5354e034bf (search) Minor grammar fix 2025-01-27 18:36:31 +01:00
Magnus Wulf
72384ad6ca fix small grammar error 2025-01-27 15:04:57 +01:00
Viktor Lofgren
a2b076f9be (converter) Add progress tracking for big domains in converter 2025-01-26 18:03:59 +01:00
Viktor Lofgren
c8b0a32c0f (crawler) Reduce long retention of CrawlDataReference objects and their associated SerializableCrawlDataStreams 2025-01-26 15:40:17 +01:00
Viktor Lofgren
f0d74aa3bb (converter) Fix close() ordering to prevent converter crash 2025-01-26 14:47:36 +01:00
Viktor Lofgren
74a1f100f4 (converter) Refactor to remove CrawledDomainReader and move its functionality into SerializableCrawlDataStream 2025-01-26 14:46:50 +01:00
Viktor Lofgren
eb049658e4 (converter) Add truncation att the parser step to prevent the converter from spending too much time on excessively large documents
Refactor to do this without introducing additional copies
2025-01-26 14:28:53 +01:00
Viktor Lofgren
db138b2a6f (converter) Add truncation att the parser step to prevent the converter from spending too much time on exessively large documents 2025-01-26 14:25:57 +01:00
Viktor Lofgren
1673fc284c (converter) Reduce lock contention in converter by separating the processing of full and simple-track domains 2025-01-26 13:21:46 +01:00
Viktor Lofgren
503ea57d5b (converter) Reduce lock contention in converter by separating the processing of full and simple-track domains 2025-01-26 13:18:14 +01:00
Viktor Lofgren
18ca926c7f (converter) Truncate excessively long strings in SentenceExtractor, malformed data was effectively DOS:ing the converter 2025-01-26 12:52:54 +01:00
Viktor Lofgren
db99242db2 (converter) Adding some logging around the simple processing track to investigate an issue with the converter stalling 2025-01-26 12:02:00 +01:00
Viktor Lofgren
2b9d2985ba (doc) Update readme with up-to-date install instructions. 2025-01-24 18:51:41 +01:00
Viktor Lofgren
eeb6ecd711 (search) Make it clearer that the affiliate marker applies to the result, and not the search engine's relation to the result. 2025-01-24 18:50:00 +01:00
Viktor Lofgren
1f58aeadbf (build) Upgrade JIB 2025-01-24 18:49:28 +01:00
Viktor Lofgren
3d68be64da (crawler) Add default CT when it's missing for icons 2025-01-22 13:55:47 +01:00
Viktor Lofgren
668f3b16ef (search) Redirect ^/site/$ to /site 2025-01-22 13:35:18 +01:00
Viktor Lofgren
98a340a0d1 (crawler) Add favicon data to domain state db in its own table 2025-01-22 11:41:20 +01:00
Viktor Lofgren
8862100f7e (crawler) Improve logging and error handling 2025-01-21 21:44:21 +01:00
Viktor Lofgren
274941f6de (crawler) Smarter parquet->slop crawl data migration 2025-01-21 21:26:12 +01:00
Viktor Lofgren
abec83582d Fix refactoring gore 2025-01-21 15:08:04 +01:00
Viktor Lofgren
569520c9b6 (index) Add manual adjustments for rankings based on domain 2025-01-21 15:07:43 +01:00
Viktor Lofgren
088310e998 (converter) Improve simple processing performance
There was a regression introduced in the recent slop migration changes in  the performance of the simple conversion track.  This reverts the issue.
2025-01-21 14:13:33 +01:00
Viktor
270cab874b Merge pull request #134 from MarginaliaSearch/slop-crawl-data-spike
Store crawl data in slop instead of parquet
2025-01-21 13:34:22 +01:00
Viktor Lofgren
4c74e280d3 (crawler) Fix urlencoding in sitemap fetcher 2025-01-21 13:33:35 +01:00
Viktor Lofgren
5b347e17ac (crawler) Automatically migrate to slop from parquet when crawling 2025-01-21 13:33:14 +01:00
Viktor Lofgren
55d6ab933f Merge branch 'master' into slop-crawl-data-spike 2025-01-21 13:32:58 +01:00
Viktor Lofgren
43b74e9706 (crawler) Fix exception handler and resource leak in WarcRecorder 2025-01-20 23:45:28 +01:00
Viktor Lofgren
579a115243 (crawler) Reduce log spam from error handling in new sitemap fetcher 2025-01-20 23:17:13 +01:00
Viktor
2c67f50a43 Merge pull request #150 from MarginaliaSearch/httpclient-in-crawler
Reduce the use of 3rd party code in the crawler
2025-01-20 19:35:30 +01:00
Viktor Lofgren
78a958e2b0 (crawler) Fix broken test that started failing after the search engine moved to a new domain 2025-01-20 18:52:14 +01:00
Viktor Lofgren
4e939389b2 (crawler) New Jsoup based sitemap parser 2025-01-20 14:37:44 +01:00
Viktor Lofgren
e67a9bdb91 (crawler) Migrate away from using OkHttp in the crawler, use Java's HttpClient instead. 2025-01-19 15:07:11 +01:00
Viktor Lofgren
567e4e1237 (crawler) Fast detection and bail-out for crawler traps
Improve logging and exclude robots.txt from this logic.
2025-01-18 15:28:54 +01:00
Viktor Lofgren
4342e42722 (crawler) Fast detection and bail-out for crawler traps
Nephentes has been doing the rounds in social media, adding an easy detection and mitigation mechanism for this type of trap, as sadly not all webmasters set up their robots.txt correctly.  Out of the box crawl limits will also deal with this type of attack, but this fix is faster.
2025-01-17 13:02:57 +01:00
Viktor Lofgren
bc818056e6 (run) Fix templates for mariadb
Apparently the docker image contract changed at some point, and now we should spawn mariadbd and not mysqld; mariadb-admin and not mysqladmin.
2025-01-16 15:27:02 +01:00
Viktor Lofgren
de2feac238 (chore) Upgrade jib from 3.4.3 to 3.4.4 2025-01-16 15:10:45 +01:00
Viktor Lofgren
1e770205a5 (search) Dyslexia fix 2025-01-12 20:40:14 +01:00
Viktor
e44ecd6d69 Merge pull request #149 from MarginaliaSearch/vlofgren-patch-1
Update ROADMAP.md
2025-01-12 20:38:36 +01:00
Viktor
5b93a0e633 Update ROADMAP.md 2025-01-12 20:38:11 +01:00
Viktor
08fb0e5efe Update ROADMAP.md 2025-01-12 20:37:43 +01:00
Viktor
bcf67782ea Update ROADMAP.md 2025-01-12 20:37:09 +01:00
Viktor Lofgren
ef3f175ede (search) Don't clobber the search query URL with default values 2025-01-10 15:57:30 +01:00
Viktor Lofgren
bbe4b5d9fd Revert experimental changes 2025-01-10 15:52:02 +01:00
Viktor Lofgren
c67a635103 (search, experimental) Add a few debugging tracks to the search UI 2025-01-10 15:44:44 +01:00
Viktor Lofgren
20b24133fb (search, experimental) Add a few debugging tracks to the search UI 2025-01-10 15:34:48 +01:00
Viktor Lofgren
f2567677e8 (index-client) Clean up index client code
Improve error handling.  This should be a relatively rare case, but we don't want one bad index partition to blow up the entire query.
2025-01-10 15:17:07 +01:00
Viktor Lofgren
bc2c2061f2 (index-client) Clean up index client code
This should have the rpc stream reception be performed in parallel in separate threads, rather blocking sequentially in the main thread, hopefully giving a slight performance boost.
2025-01-10 15:14:42 +01:00
Viktor Lofgren
1c7f5a31a5 (search) Further reduce the number of db queries by adding more caching to DbDomainQueries. 2025-01-10 14:17:29 +01:00
Viktor Lofgren
59a8ea60f7 (search) Further reduce the number of db queries by adding more caching to DbDomainQueries. 2025-01-10 14:15:22 +01:00
Viktor Lofgren
aa9b1244ea (search) Reduce the number of db queries a bit by caching data that doesn't change too often 2025-01-10 13:56:04 +01:00
Viktor Lofgren
2d17233366 (search) Reduce the number of db queries a bit by caching data that doesn't change too often 2025-01-10 13:53:56 +01:00
Viktor Lofgren
b245cc9f38 (search) Reduce the number of db queries a bit by caching data that doesn't change too often 2025-01-10 13:46:19 +01:00
Viktor Lofgren
6614d05bdf (db) Make db pool size configurable 2025-01-09 20:20:51 +01:00
Viktor Lofgren
55aeb03c4a (feeds) Replace rssreader based parsing with a custom jsoup based rss parser
This solves some issues with the rssreader based parser, which was very picky about the XML being valid.  Jsoup is much more lenient when parsing malformed XML.
2025-01-09 18:29:55 +01:00
Viktor Lofgren
faa589962f (live-capture) Browserless now requires a token 2025-01-09 14:51:11 +01:00
Viktor Lofgren
c7edd6b39f (live-capture) Browserless now requires a token 2025-01-09 14:46:05 +01:00
Viktor Lofgren
79da622e3b (search) Update front page with new banner about move 2025-01-08 21:38:19 +01:00
Viktor Lofgren
3da8337ba6 (feeds) Add system property for exporting fetched feeds to a slop table for debugging 2025-01-08 20:49:16 +01:00
Viktor Lofgren
a32d230f0a (special) Trigger deployment 2025-01-08 20:07:54 +01:00
Viktor Lofgren
3772bfd387 (query) Fix handling of optional ranking parameters 2025-01-08 17:11:22 +01:00
Viktor Lofgren
02a7900d1a (search) Correct search-in-title toggle in search UI 2025-01-08 16:51:10 +01:00
Viktor Lofgren
a1fb92468f (refac) Remove ResultRankingParameters, QueryLimits class and use protobuf classes directly instead
This is primarily to make the code a bit easier to reason about, and will reduce the level of indirection and data copying in the search-servi->query-service->index-service communication chain.
2025-01-08 16:15:57 +01:00
Viktor Lofgren
b7f0a2a98e (search-service) Fix metrics for errors and request times
This was previously in place, but broke during the jooby migration.
2025-01-08 14:10:43 +01:00
Viktor Lofgren
5fb76b2e79 (search-service) Fix metrics for errors and request times
This was previously in place, but broke during the jooby migration.
2025-01-08 14:06:03 +01:00
Viktor Lofgren
ad8c97f342 (search-service) Begin replacement of the crawl queue mechanism with node_affinity flagging
Previously a special db table was used to hold domains slated for crawling, but this is deprecated, and instead now each domain has a node_affinity flag that decides its indexing state, where a value of -1 indicates it shouldn't be crawled, a value of 0 means it's slated for crawling by the next index partition to be crawled, and a positive value means it's assigned to an index partition.

The change set also adds a test case validating the modified behavior.
2025-01-08 13:25:56 +01:00
Viktor Lofgren
dc1b6373eb (search-service) Clean up readme 2025-01-08 13:04:39 +01:00
Viktor Lofgren
983d6d067c (search-service) Add indexing indicator to sibling domains listing 2025-01-08 12:58:34 +01:00
Viktor Lofgren
a84a06975c (ranking-params) Add disable penalties flag to ranking params
This will help debugging ranking issues.  Later it may be added to some filters.
2025-01-08 00:16:49 +01:00
Viktor Lofgren
d2864c13ec (query-params) Add additional permitted query params 2025-01-07 20:21:44 +01:00
Viktor Lofgren
03ba53ce51 (legacy-search) Update nav bar with correct links 2025-01-07 17:44:52 +01:00
Viktor Lofgren
d4a6684931 (specialization) Soften length requirements for wiki-specialized documents (incl. cppreference) 2025-01-07 15:53:25 +01:00
Viktor
6f0485287a Merge pull request #145 from MarginaliaSearch/cppreference_fixes
Cppreference fixes
2025-01-07 15:43:19 +01:00
Viktor Lofgren
59e2dd4c26 (specialization) Soften length requirements for wiki-specialized documents (incl. cppreference) 2025-01-07 15:41:30 +01:00
Viktor Lofgren
ca1807caae (specialization) Add new specialization for cppreference.com
Give this reference website some synthetically generated tokens to improve the likelihood of a good match.
2025-01-07 15:41:05 +01:00
Viktor Lofgren
26c20e18ac (keyword-extraction) Soften constraints on keyword patterns, allowing for longer segmented words 2025-01-07 15:20:50 +01:00
Viktor Lofgren
7c90b6b414 (query) Don't blindly make tokens containing a colon into a non-ranking advice term 2025-01-07 15:18:05 +01:00
Viktor Lofgren
b63c54c4ce (search) Update opensearch.xml to point to non-redirecting domains. 2025-01-07 00:23:09 +01:00
Viktor Lofgren
fecd2f4ec3 (deploy) Add legacy search service to deploy script 2025-01-07 00:21:13 +01:00
Viktor Lofgren
39e420de88 (search) Add wayback machine link to siteinfo 2025-01-06 20:33:10 +01:00
Viktor Lofgren
dc83619861 (rssreader) Further suppress logging 2025-01-06 20:20:37 +01:00
Viktor Lofgren
87d1c89701 (search) Add listing of sibling subdomains to site overview 2025-01-06 20:17:36 +01:00
Viktor Lofgren
a42a7769e2 (leagacy-search) Remove legacy paperdoll class 2025-01-06 20:17:36 +01:00
Viktor
202bda884f Update readme.md
Add note about installing tailwindcss via npm
2025-01-06 18:35:13 +01:00
Viktor Lofgren
2315fdc731 (search) Vendor rssreader and modify it to be able to consume the nlnet atom feed
Also dial down the logging a bit for the rssreader package.
2025-01-06 17:58:50 +01:00
Viktor Lofgren
b5469bd8a1 (search) Turn relative feed URLs absolute when dealing with RSS/Atom item URLs 2025-01-06 16:56:24 +01:00
Viktor Lofgren
6a6318d04c (search) Add separate websiteUrl property to legacy service 2025-01-06 16:26:08 +01:00
Viktor Lofgren
55933f8d40 (search) Ensure we respect old URL contracts
/explore/random should be equivalent to /explore
2025-01-06 16:20:53 +01:00
Viktor
be6382e0d0 Merge pull request #127 from MarginaliaSearch/serp-redesign
Web UI redesign
2025-01-06 16:08:14 +01:00
Viktor Lofgren
45e771f96b (api) Update the / API redirect to the new documentation stub. 2025-01-06 16:07:32 +01:00
Viktor Lofgren
8dde502cc9 Merge branch 'master' into serp-redesign 2025-01-05 23:33:35 +01:00
Viktor Lofgren
3e66767af3 (search) Adjust query parsing to trim tokens in quoted search terms
Quoted search queries that contained keywords with possessive 's endings were not returning any results, as the index does not retain that suffix, and the query parser was not stripping it away in this code path.

This solves issue #143.
2025-01-05 23:33:09 +01:00
Viktor Lofgren
9ec9d1b338 Merge branch 'master' into serp-redesign 2025-01-05 21:10:20 +01:00
Viktor Lofgren
dcad0d7863 (search) Tweak token formation. 2025-01-05 21:01:09 +01:00
Viktor Lofgren
94e1aa0baf (search) Tweak token formation to still break apart emails in brackets. 2025-01-05 20:55:44 +01:00
Viktor Lofgren
b62f043910 (search) Adjust token formation rules to be more lenient to C++ and PHP code.
This addresses Issue #142
2025-01-05 20:50:27 +01:00
Viktor Lofgren
6ea22d0d21 (search) Update front page with work-in-progress note 2025-01-05 19:08:02 +01:00
Viktor Lofgren
8c69dc31b8 Merge branch 'master' into serp-redesign 2025-01-05 18:52:51 +01:00
Viktor Lofgren
00734ea87f (search) Add hover text for matchogram 2025-01-05 18:50:44 +01:00
Viktor Lofgren
3009713db4 (search) Fix broken tests 2025-01-05 18:50:27 +01:00
Viktor
9b2ceaf37c Merge pull request #141 from MarginaliaSearch/vlofgren-patch-1
Update FUNDING.yml
2025-01-05 18:40:20 +01:00
Viktor
8019c2ce18 Update FUNDING.yml 2025-01-05 18:40:06 +01:00
Viktor Lofgren
a9e312b8b1 (service) Add links to marginalia-search.com where appropriate 2025-01-05 16:56:38 +01:00
Viktor Lofgren
4da3563d8a (service) Clean up exceptions when requestScreengrab is not available 2025-01-04 14:45:51 +01:00
Viktor Lofgren
48d0a3089a (service) Improve logging around grpc
This change adds a marker for the gRPC-specific logging, as well as improves the clarity and meaningfulness of the log messages.
2025-01-02 20:40:53 +01:00
Viktor Lofgren
594df64b20 (domain-info) Use appropriate sqlite database when fetching feed status 2025-01-02 20:20:36 +01:00
Viktor Lofgren
06efb5abfc Merge branch 'master' into serp-redesign 2025-01-02 18:42:12 +01:00
Viktor Lofgren
78eb1417a7 (service) Only block on SingleNodeChannelPool creation in QueryClient
The code was always blocking for up to 5s while waiting for the remote end to become available, meaning some services would stall for several seconds on start-up for no sensible reason.

This should make most services start faster as a result.
2025-01-02 18:42:01 +01:00
Viktor Lofgren
8c8f2ad5ee (search) Add an indicator when a link has a feed in the similar/linked domains views 2025-01-02 18:11:57 +01:00
Viktor Lofgren
f71e79d10f (search) Add a copy of the old UI as a separate service, search-service-legacy 2025-01-02 18:03:42 +01:00
Viktor Lofgren
1b27c5cf06 (search) Add a copy of the old UI as a separate service, search-service-legacy 2025-01-02 18:02:17 +01:00
Viktor Lofgren
67edc8f90d (domain-info) Only flag domains with rss feed items as having a feed 2025-01-02 17:41:52 +01:00
Viktor Lofgren
5f576b7d0c (query-parser) Strip leading underlines
This addresses issue #140, where __builtin_ffs gives no results.
2025-01-02 14:39:03 +01:00
Viktor Lofgren
8b05c788fd (Search) Enable gzip compression of responses 2025-01-01 18:34:42 +01:00
Viktor Lofgren
236f033bc9 (Search) Reduce whitespace in explore view on all resolutions 2025-01-01 18:23:35 +01:00
Viktor Lofgren
510fc75121 (Search) Reduce whitespace in explorer view on mobile 2025-01-01 18:18:09 +01:00
Viktor Lofgren
0376f2e6e3 Merge branch 'master' into serp-redesign
# Conflicts:
#	code/services-application/search-service/resources/templates/search/index/index.hdb
2025-01-01 18:15:09 +01:00
Viktor Lofgren
0b65164f60 (chore) Fix broken test 2025-01-01 18:06:29 +01:00
Viktor Lofgren
9be477de33 (domain-info) Add a feed flag to domain info
This is a bit of a sketchy solution that requires both assistant services to run on the same physical machine.
2025-01-01 18:02:33 +01:00
Viktor Lofgren
84f55b84ff (search) Add experimental OPML-export function for feed subscriptions 2025-01-01 17:17:54 +01:00
Viktor Lofgren
ab5c30ad51 (search) Fix site info view for completely unknown domains
Also correct the DbDomainQueries.getDomainId so that it throws NoSuchElementException when domain id is missing, and not UncheckedExecutionException via Cache.
2025-01-01 16:29:01 +01:00
Viktor Lofgren
0c839453c5 (search) Fix crosstalk link 2025-01-01 16:09:19 +01:00
Viktor Lofgren
5e4c5d03ae (search) Clean up breakpoints in site overview 2025-01-01 16:06:08 +01:00
Viktor Lofgren
710af4999a (feed-fetcher) Add " entity mapping in feed fetcher 2025-01-01 15:45:17 +01:00
Viktor Lofgren
a5b0a1ae62 (search) Move linked/similar domains to a popover style menu on mobile
Fix scroll
2025-01-01 15:37:35 +01:00
Viktor Lofgren
e9f71ee39b (search) Move linked/similar domains to a popover style menu on mobile 2025-01-01 15:23:25 +01:00
Viktor Lofgren
baeb4a46cd (search) Reintroduce query rewriting for recipes, add rules for wikis and forums 2024-12-31 16:05:00 +01:00
Viktor Lofgren
5e2a8e9f27 (deploy) Add capability of adding tags to deploy script 2024-12-31 16:04:13 +01:00
Viktor
cc1a5bdf90 Merge pull request #138 from MarginaliaSearch/vlofgren-patch-1
Update ROADMAP.md
2024-12-31 14:41:02 +01:00
Viktor
7f7b1ffaba Update ROADMAP.md 2024-12-31 14:40:34 +01:00
Viktor Lofgren
0ea8092350 (search) Add link promoting the redesign beta 2024-12-30 15:47:13 +01:00
Viktor Lofgren
483d29497e (deploy) Add hashbang to deploy script 2024-12-30 15:47:13 +01:00
Viktor Lofgren
bae44497fe (crawler) Add a new system property crawler.maxFetchSize
This gives the same upper limit to the live crawler and the big boy crawler, though the live crawler will reject items too large, and the big crawler will truncate at that point.
2024-12-30 15:10:11 +01:00
Viktor Lofgren
0d59202aca (crawler) Do not remove W/-prefix on weak e-tags
The server expects to get them back prefixed, as we received them.
2024-12-27 20:56:42 +01:00
Viktor Lofgren
0ca43f0c9c (live-crawler) Improve live crawler short-circuit logic
We should not wait until we've fetched robots.txt to decide whether we have any data to fetch!  This makes the live crawler very slow and leads to unnecessary requests.
2024-12-27 20:54:42 +01:00
Viktor Lofgren
3bc99639a0 (feed-fetcher) Make feed fetcher requests conditional
Add `If-None-Match` and `If-Modified-Since` headers as appropriate to the feed fetcher's requests.  On well-configured web servers, this should short-circuit the request and reduce the amount of bandwidth and processing that is necessary.

A new table was added to the FeedDb to hold one etag per domain.

If-Modified-Since semantics are based on the creation date for the feed database, which should serve as a cutoff date for the earliest update we can have received.

This completes the changes for Issue #136.
2024-12-27 15:10:15 +01:00
Viktor Lofgren
927bc0b63c (live-crawler) Add Accept-Encoding: gzip to outbound requests
This change adds `Accept-Encoding: gzip` to all outbound requests from the live crawler and feed fetcher, and the corresponding decoding logic for the compressed response data.

The change addresses issue #136, save for making the fetcher's requests conditional.
2024-12-27 03:59:34 +01:00
Viktor Lofgren
d968801dc1 (converter) Drop feed data from SlopDomainRecord
Also remove feed extraction from converter.  This is the crawler's responsibility now.
2024-12-26 17:57:08 +01:00
Viktor Lofgren
89db69d360 (crawler) Correct feed URLs in domain state db
Discovered feed URLs were given a double slash after their domain name in the DB.  This will go away in the URL normalizer, so the URLs are still viable, but the commit fixes the issue regardless.
2024-12-26 15:18:31 +01:00
Viktor Lofgren
895cee7004 (crawler) Improved feed discovery, new domain state db per crawlset
Feed discover is improved with by probing a few likely endpoints when no feed link tag is provided.  To store the feed URLs, a sqlite database is added to each crawlset that stores a simple summary of the crawl job, including any feed URLs that have been discovered.

Solves issue #135
2024-12-26 15:05:52 +01:00
Viktor Lofgren
4bb71b8439 (crawler) Correct content type probing to only run on URLs that are suspected to be binary 2024-12-26 14:26:23 +01:00
Viktor Lofgren
e4a41f7dd1 (crawler) Correct content type probing to only run on URLs that are suspected to be binary 2024-12-26 14:13:17 +01:00
Viktor
69ad6287b1 Update ROADMAP.md 2024-12-25 21:16:38 +00:00
Viktor Lofgren
81cdd6385d Add rendering tests for most major views
This will prevent accidentally deploying a broken search service
2024-12-25 15:22:26 +01:00
Viktor Lofgren
e76c42329f Correct dark mode for infobox in site focused search 2024-12-25 15:06:05 +01:00
Viktor Lofgren
e6ef4734ea Fix tests 2024-12-25 15:05:41 +01:00
Viktor Lofgren
41a59dcf45 (feed) Sanitize illegal HTML entities out of the feed XML before parsing 2024-12-25 14:53:28 +01:00
Viktor Lofgren
df4bc1d7e9 Add update time to front page subscriptions 2024-12-25 14:42:00 +01:00
Viktor Lofgren
2b222efa75 Merge branch 'master' into serp-redesign 2024-12-25 14:22:42 +01:00
Viktor Lofgren
94d4d2edb7 (live-crawler) Add refresh date to feeds API
For now this is just the ctime for the feeds db.  We may want to store this per-record in the future.
2024-12-25 14:20:48 +01:00
Viktor Lofgren
7ae19a92ba (deploy) Improve deployment script to allow specification of partitions 2024-12-24 11:16:15 +01:00
Viktor Lofgren
56d14e56d7 (live-crawler) Improve LiveCrawlActor resilience to FeedService outages 2024-12-23 23:33:54 +01:00
Viktor Lofgren
a557c7ae7f (live-crawler) Limit concurrent accesses per domain using DomainLocks from main crawler 2024-12-23 23:31:03 +01:00
Viktor Lofgren
b66879ccb1 (feed) Add support for date discovery through atom:issued and atom:created
This is specifically to help parse monadnock.net's Atom feed.
2024-12-23 20:05:58 +01:00
Viktor Lofgren
f1b7157ca2 (deploy) Add basic linting ability to deployment script. 2024-12-23 16:21:29 +01:00
Viktor Lofgren
7622335e84 (deploy) Correct deploy script, set correct name for assistant 2024-12-23 15:59:02 +01:00
Viktor Lofgren
0da2047eae (live-capture) Correctly update processed count, disable poll rate adjustment based on freshness. 2024-12-23 15:56:27 +01:00
Viktor Lofgren
5ee4321110 (ci) Correct deploy script 2024-12-22 20:08:37 +01:00
Viktor Lofgren
9459b9933b (ci) Correct deploy script 2024-12-22 19:40:32 +01:00
Viktor Lofgren
87fb564f89 (ci) Add script for automatic deployment based on git tags 2024-12-22 19:24:54 +01:00
Viktor Lofgren
5ca8523220 (math) Reduce log error spam from null unit conversions 2024-12-21 18:51:45 +01:00
Viktor Lofgren
1118657ffd (system) Supply local IP to service discovery if multiFace is enabled 2024-12-19 22:20:19 +01:00
Viktor Lofgren
b1f970152d (system) To support configurations with multiple docker networks, bind to the "most local" interface.
Make the behavior optional.
2024-12-19 20:26:31 +01:00
Viktor Lofgren
e1783891ab (system) To support configurations with multiple docker networks, bind to the "most local" interface. 2024-12-19 20:18:57 +01:00
Viktor Lofgren
64d32471dd (deploy) Deploy executor test 2024-12-19 17:45:47 +01:00
Viktor Lofgren
232cc465d9 (deploy) Deploy executor test 2024-12-19 17:35:38 +01:00
Viktor Lofgren
8c963bd4ba (feeds) Remove Content-Encoding: gzip from feed fetcher
We don't support decompressing gzip, so this just gives us errors at this point should the server support it.
2024-12-18 22:23:44 +01:00
Viktor Lofgren
6a079c1c75 (feeds) Add per-domain throttling for feed fetcher. 2024-12-18 22:06:46 +01:00
Viktor Lofgren
2dc9f2e639 (feeds) Make feed XML parsing more lenient
... by consuming BOM markers and leading whitespace.
2024-12-18 17:18:41 +01:00
Viktor Lofgren
b66fb9caf6 (feeds) Improve error handling in the feed fetcher. 2024-12-18 17:02:13 +01:00
Viktor Lofgren
6d18e6d840 (search) Add clustering to subscriptions view 2024-12-18 15:36:05 +01:00
Viktor Lofgren
2a3c63f209 (search) Exclude generated style.css from git 2024-12-18 15:24:31 +01:00
Viktor Lofgren
9f70cecaef (search) Add site subscription feature that puts RSS updates on the front page 2024-12-18 15:24:31 +01:00
Viktor Lofgren
47e58a21c6 Refactor documentBody method and ContentType charset handling
Updated the `documentBody` method to improve parsing retries and error handling. Refactored `ContentType` charset processing with cleaner logic, removing redundant handling for unsupported charsets. Also, updated the version of the `slop` library in dependency settings.
2024-12-17 17:11:37 +01:00
Viktor Lofgren
3714104976 Add loader for slop data in converter.
Also alter CrawledDocument to not require String parsing of the underlying byte[] data.  This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.
2024-12-17 15:40:24 +01:00
Viktor Lofgren
f6f036b9b1 Switch to new Slop format for crawl data storage and processing.
Replaces Parquet output and processing with the new Slop-based format. Includes data migration functionality, updates to handling and writing of crawl data, and introduces support for SLOP in domain readers and converters.
2024-12-15 19:34:03 +01:00
Viktor Lofgren
b510b7feb8 Spike for storing crawl data in slop instead of parquet
This seems to reduce RAM overhead to 100s of MB (from ~2 GB), as well as roughly double the read speeds.  On disk size is virtually identical.
2024-12-15 15:49:47 +01:00
Viktor Lofgren
c08203e2ed (search) Prevent paperdoll from being run as a test by CI 2024-12-14 20:35:57 +01:00
Viktor Lofgren
86497fd32f (site-info) Mobile layout fix 2024-12-14 16:19:56 +01:00
Viktor Lofgren
3b998573fd Adjust colors on dark mode for site overview 2024-12-13 21:51:25 +01:00
Viktor Lofgren
e161882ec7 (search) Fix layout for light mode 2024-12-13 21:47:29 +01:00
Viktor Lofgren
357f349e30 (search) Table layout fixes for dictionary lookup 2024-12-13 21:47:08 +01:00
Viktor Lofgren
e4769f541d (search) Sort and deduplicate search results for better relevance.
Added a custom sorting mechanism to prioritize HTTPS over HTTP and domain-based URLs over raw IPs during deduplication. Ensures "bad duplicates" are discarded while maintaining the original presentation order for user-facing results.
2024-12-13 21:47:08 +01:00
Viktor Lofgren
2a173e2861 (search) Dark Mode 2024-12-13 21:47:07 +01:00
Viktor Lofgren
a6a900266c (search) Fix redirects 2024-12-13 02:40:51 +01:00
Viktor Lofgren
bdba53f055 (site) Update domain parameter type from PathParam to QueryParam 2024-12-13 02:15:35 +01:00
Viktor Lofgren
eb2fe18867 (sideload) Add LSH generation for sideloaded StackExchange data
Previously, the sideloader did not generate a locality-sensitive hashCode for document details.  This caused all documents from the same domain to be considered duplicates by the deduplication logic.
2024-12-13 02:10:52 +01:00
Viktor Lofgren
a7468c8d23 (converter) Ensure paths are created for converter batch writer 2024-12-13 01:35:07 +01:00
Viktor Lofgren
fb2beb1eac (converter) Fix data-loss bug where the converter writer would remove all but the last batch of processed data 2024-12-13 01:19:30 +01:00
Viktor Lofgren
0fb03e3d62 (export) Add logging to AtagExporter for error handling 2024-12-12 22:54:32 +01:00
Viktor Lofgren
67db3f295e (index) Revert some optimization changes 2024-12-12 22:14:24 +01:00
Viktor Lofgren
dafaab3ef7 (index) Additional optimization pass 2024-12-12 18:57:33 +01:00
Viktor Lofgren
3f11ca409f (index) Increase thread limit and optimize search result handling
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
2024-12-12 17:07:06 +01:00
Viktor Lofgren
694eed79ef (index) Increase thread limit and optimize search result handling
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
2024-12-12 15:32:31 +01:00
Viktor Lofgren
4220169119 (index) Increase thread limit and optimize search result handling
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
2024-12-12 15:31:11 +01:00
Viktor Lofgren
bbdde789e7 Merge branch 'master' into serp-redesign 2024-12-11 19:45:17 +01:00
Viktor Lofgren
0a53ac68a0 Add specialization for steam store and GOG 2024-12-11 18:32:45 +01:00
Viktor Lofgren
eab61cd48a Merge branch 'master' into serp-redesign 2024-12-11 17:09:27 +01:00
Viktor Lofgren
e65d75a0f9 (crawler) Reintroduce content type probing and clean out bad content type data from the existing crawl sets 2024-12-11 17:01:52 +01:00
Viktor Lofgren
3b99cffb3d (link-parser) Filter out URLs with binary file suffixes in LinkParser
Added an additional filter step to ensure URLs with binary suffixes are excluded during crawling. This prevents unnecessary processing of non-HTML content, improving the efficiency of the link parsing process.
2024-12-11 16:42:47 +01:00
Viktor Lofgren
a97c05107e Add synthetic meta flag for root path documents
If the document's URL path is "/", a "special:root" meta flag is now added with the "Synthetic" bit set. This will help searching only for the root document of a website, neat stuff ahead :D
2024-12-11 16:10:44 +01:00
Viktor Lofgren
5002870d1f (converter) Refactor sideloaders to improve feature handling and keyword logic
Centralized HTML feature handling with `applyFeatures` in StackexchangeSideloader and added dynamic synthetic term generation. Improved HTML structure in RedditSideloader and enhanced metadata processing with feature-based keywords. Updated DomainLinks to correctly compute link counts using individual link occurrences.
2024-12-11 16:01:38 +01:00
Viktor Lofgren
73861e613f (ranking) Downtune score boost for unordered heading matces 2024-12-11 15:44:29 +01:00
Viktor Lofgren
0ce2ba9ad9 (jooby) Fix asset handler 2024-12-11 14:38:04 +01:00
Viktor Lofgren
3ddcebaa36 (search) Give serp/start a more consistent name to siteinfo/start
The change also cleans up the layout a bit.
2024-12-11 14:33:57 +01:00
Viktor Lofgren
b91463383e (jooby) Clean up initialization process 2024-12-11 14:33:18 +01:00
Viktor Lofgren
7444a2f36c (site-info) Add placeholder when a feed item lacks a title. 2024-12-10 22:46:12 +01:00
Viktor Lofgren
461bc3eb1a (generator) Add special workaround to flag fextralife as a wiki 2024-12-10 22:22:52 +01:00
Viktor Lofgren
cf7f84f033 (rank) Reduce the impact of domain rank bonus, and only apply it to cancel out negative penalties, never to increase the ranking 2024-12-10 22:04:12 +01:00
Viktor Lofgren
fdee07048d (search) Remove Spark and migrate to Jooby for the search service 2024-12-10 19:13:13 +01:00
Viktor Lofgren
2fbf201761 (search) Adjust crosstalk flex-basis 2024-12-10 15:12:51 +01:00
Viktor Lofgren
4018e4c434 (search) Add crosstalk to paperdoll 2024-12-10 15:12:39 +01:00
Viktor Lofgren
f3382b5bd8 (search) Completely remove all old hdb templates
Create new views for conversion results, dictionary results, and site crosstalk.
2024-12-10 15:04:49 +01:00
Viktor Lofgren
9fc82574f0 (fingerprint) Add FluxGarden as a wiki generator
#130
2024-12-10 13:51:42 +01:00
Viktor
589f4dafb9 Merge pull request #129 from MarginaliaSearch/atags-counts
(WIP) Improve atag sentence matching
2024-12-10 12:42:34 +00:00
Viktor Lofgren
c5d657ef98 (live-crawler) Flag live crawled documents with a special keyword 2024-12-10 13:42:10 +01:00
Viktor Lofgren
3c2bb566da (converter) Wipe the converter output path on initialization to avoid lingering stale data. 2024-12-10 13:41:05 +01:00
Viktor Lofgren
9287ee0141 (search) Improve hyphenation logic for titles 2024-12-09 15:29:10 +01:00
Viktor Lofgren
2769c8f869 (search) Remove sticky search bar to aid with performance on firefox (and iOS?) 2024-12-09 15:20:33 +01:00
Viktor Lofgren
ddb66f33ba (search) Add more feedback when pressing some buttons 2024-12-09 15:07:23 +01:00
Viktor Lofgren
79500b8fbc (search) Move search bar back up top on mobile, put filter buttom at the bottom instead. 2024-12-09 14:55:37 +01:00
Viktor Lofgren
187eea43a4 (search) Remove redundant @if 2024-12-09 14:46:02 +01:00
Viktor Lofgren
a89ed6fa9f (search) Fix rendering on site overview, more dense serp layout on mobile 2024-12-09 14:45:45 +01:00
Viktor Lofgren
e0c0ed27bc (keyword-extraction) Clean up code and add tests for position and spans calculation
This code has been a bit of a mess and historically significantly flaky, so some test coverage is more than overdue.
2024-12-08 14:14:52 +01:00
Viktor Lofgren
20abb91657 (loader) Correct DocumentLoaderService to properly do bulk inserts
Fixes issue #128
2024-12-08 13:12:52 +01:00
Viktor Lofgren
291ca8daf1 (converter/index) Improve atag sentence matching by taking into consideration how many times a sentence appears in the links
This change breaks the format of the atags.parquet file.
2024-12-08 00:27:11 +01:00
Viktor Lofgren
8d168be138 (search) Typeahead search, etc. 2024-12-07 15:47:01 +01:00
Viktor Lofgren
6e1aa7b391 (search) Make style.css depend on jte file changes
Also add a hack to ensure classes generated from java code get included in the stylesheet as intended.
2024-12-07 14:11:22 +01:00
Viktor Lofgren
deab9b9516 (search) Clean up start views for search and site-info 2024-12-07 14:11:22 +01:00
Viktor Lofgren
39d99a906a (search) Add proper tailwind build and host fontawesome locally 2024-12-07 14:11:22 +01:00
Viktor Lofgren
6f72e6e0d3 (explore) Add lazy loading and alt attributes to images 2024-12-07 14:11:22 +01:00
Viktor Lofgren
d786d79483 (site-info) Add whitespace-nowrap to pubDay span in overview.jte 2024-12-07 14:11:22 +01:00
Viktor Lofgren
01510f6c2e (serp) Add wayback link to search results 2024-12-07 14:11:22 +01:00
Viktor Lofgren
7ba43e9e3f (site) Adjust sizing of navbars 2024-12-07 14:11:16 +01:00
Viktor Lofgren
97bfcd1353 (site) Layout changes site-info 2024-12-07 14:11:16 +01:00
Viktor Lofgren
aa3c85c196 (site) Mobile layout fixes 2024-12-07 14:11:16 +01:00
Viktor Lofgren
ee2d5496d0 Revert "(experiment) Modify atags exporter to permit duplicates from different source domains"
This reverts commit 5c858a2b94.
2024-12-07 14:01:50 +01:00
Viktor Lofgren
5c858a2b94 (experiment) Modify atags exporter to permit duplicates from different source domains
This is an attempt to provide higher resolution term frequency data that will need evaluation when the data is processed.
2024-12-06 14:10:15 +01:00
Viktor Lofgren
fb75a3827d (site) Adjust coloration of search results 2024-12-05 16:58:00 +01:00
Viktor Lofgren
7d546d0e2a (site) Make SearchParameters generate relative URLs instead of absolute 2024-12-05 16:47:22 +01:00
Viktor Lofgren
8fcb6ffd7a (site-info) Increase contrast in search results for forums, wikis 2024-12-05 16:42:16 +01:00
Viktor Lofgren
f97de0c15a (site-info) Fix layout 2024-12-05 16:33:46 +01:00
Viktor Lofgren
be9e192b78 (site-info) Fix pagination in backlinks and documents views 2024-12-05 16:26:11 +01:00
Viktor Lofgren
75ae1c9526 (site-info) Do not show 'suggest for crawling' when the ndoe affinity is already set to 0
This indicates the domain is already slated for crawling.
2024-12-05 16:18:46 +01:00
Viktor Lofgren
33761a0236 (site-info) Make the search box in the site viewer functional 2024-12-05 16:16:29 +01:00
Viktor Lofgren
19b69b1764 (site-info) Only show samples if feed is absent, never both. 2024-12-05 16:05:03 +01:00
Viktor Lofgren
8b804359a9 (serp) Layout fixes for mobile 2024-12-05 15:59:33 +01:00
Viktor Lofgren
f050bf5c4c (WIP) Initial semi-working transformation to new tailwind UI
Still missing is a proper build, we're currently pulling in tailwind from a CDN, which is no bueno in prod.

There's also a lot of polish remaining everywhere, dead links, etc.
2024-12-05 14:00:17 +01:00
Viktor Lofgren
fdc3efa250 (setup) Remove OpenNLP tokenization model
This update eliminates all occurrences of the OpenNLP token model from the setup script, configuration, and test files, as this model file is no longer used.
2024-11-28 16:03:05 +01:00
Viktor Lofgren
5fdd2c71f8 (setup) Update OpenNLP model URLs to archive.apache.org
Changed the URLs for downloading OpenNLP sentence and tokens models from downloads.apache.org to archive.apache.org; as the previous link has died.
2024-11-28 15:58:25 +01:00
Viktor Lofgren
c97c66a41c (ranking) Reduce the verbatim score multiplier 2024-11-28 13:37:11 +01:00
Viktor Lofgren
7b64377fd6 (ranking) Promote documents with multiple phrase matches with a log-scale bonus 2024-11-28 13:36:56 +01:00
Viktor Lofgren
e11ebf18e5 (span) Correct intersection counting logic, add comprehensive tests 2024-11-28 13:36:25 +01:00
Viktor Lofgren
ba47d72bf4 (ranking) Adjust scores for external link matches 2024-11-27 14:27:23 +01:00
Viktor Lofgren
52bc0272f8 (atag) Add alias domain support and improve domain handling
Introduced optional alias domain functionality in EdgeDomain class to handle domain variations such as "www" in the anchor tags code, as there are commonly a number of relevant but glancing misses in the atags data.
2024-11-27 14:26:44 +01:00
Viktor Lofgren
d4bce13a03 (export) Add export actors to precession
Adding a tracking message to the export actor means it's possible to run them in a precession.

Adding a new precession actor, and some GUI components for triggering exports.

The change also adds a heartbeat to the export process.
2024-11-26 15:07:03 +01:00
Viktor Lofgren
b9842b57e0 (encyclopedia-sideloader) Add test suite and clean up urlencoding logic 2024-11-26 13:34:15 +01:00
Viktor Lofgren
95776e9bee (encyclopedia) Fix commit gore resulting in bad SQL query 2024-11-26 12:44:49 +01:00
Viktor Lofgren
077d8dcd11 (result-score) Adjust ranking parameters a tiny bit 2024-11-25 18:30:59 +01:00
Viktor Lofgren
9ec41e27c6 (keyword-extractor) Fix bug where external link keywords weren't generating document spans as intended 2024-11-25 18:30:22 +01:00
Viktor Lofgren
200743c84f (minor) Remove delomobok debris 2024-11-25 18:29:21 +01:00
Viktor Lofgren
6d7998e349 (index) Correct behavior of debug function positionValues(), which was misleadingly incorrect 2024-11-25 18:28:53 +01:00
Viktor Lofgren
7d1ef08a0f (index) Correct ranking bonus for external linktext appearnces 2024-11-25 17:40:15 +01:00
Viktor Lofgren
ea6b148df2 (docker) Add restart: always to executor nodes
The system will perform a janitor reset on these nodes when the node profile is switched, so it's important they restart automatically.
2024-11-25 15:31:45 +01:00
Viktor Lofgren
3ec9c4c5fa (export) Filter non-HTML documents in exporters
Add a check to ensure only documents with "text/html" content type are processed in FeedExporter, AtagExporter, and TermFrequencyExporter. This prevents non-HTML documents from being parsed and helps maintain data consistency and keep the memory usage down.
2024-11-25 15:06:42 +01:00
Viktor Lofgren
0b6b5dab07 (index) Add score bonuses for single-word anchor tag spans
Enhanced scoring logic to add bonuses when the query matches single-word anchor (atag) spans exactly. Implemented this by adding conditions in `IndexResultScoreCalculator.java` and creating a new method `containsRangeExact` in `DocumentSpan.java` to check for exact span matches.
2024-11-25 14:44:41 +01:00
Viktor Lofgren
ff17473105 Fix UTF-8 URL normalization issue in sideloader.
Normalize URLs by replacing en-dash with hyphen to prevent encoding errors. This ensures correct handling of a small subset of articles with improperly normalized UTF-8 paths. Added `normalizeUtf8` method to address this issue.

Fixes issue #109.
2024-11-25 14:25:47 +01:00
Viktor Lofgren
dc5f97e737 (index) Add bonus for single-word title matches when the title is also a single word 2024-11-25 13:24:12 +01:00
Viktor Lofgren
d919179ba3 (index) Correct off-by-1 error in DocumentSpan.containsRange 2024-11-25 13:24:03 +01:00
Viktor Lofgren
f09669a5b0 (index) Correct usage of DocumentSpan.length() instead of DocumentSpan.size()
The latter counts the number of spans, and is not what you want here.
2024-11-25 13:11:55 +01:00
Viktor Lofgren
b3b0f6fed3 (actor) Add side-load profile to PROC_CONVERTER_SPAWNER.
This fell off during the profile split, but is necessary for sideloading.
2024-11-25 12:40:14 +01:00
Viktor Lofgren
88caca60f9 (live-crawl) Flag URLs that don't pass robots.txt as bad so we don't keep fetching robots.txt every day for an empty link list 2024-11-23 17:07:16 +01:00
Viktor Lofgren
923ebbac81 (feeds) Add logic to handle URI fragments in feed items
Introduced a method to decide whether to retain URI fragments in feed items based on their uniqueness. Enhanced FeedItem processing to conditionally strip fragments to maintain clean URLs where applicable.
2024-11-23 16:38:56 +01:00
Viktor
df298df852 Merge pull request #125 from MarginaliaSearch/live-search
Add near real-time crawling from RSS feeds to supplement the slower batch based crawls
2024-11-22 16:38:37 +00:00
Viktor Lofgren
552b246099 (live-crawl) Improve error handling for errors during robots.txt-retrieval
Reduce log-spam and don't treat errors other than 404 as "all is permitted".
2024-11-22 14:15:32 +01:00
Viktor Lofgren
80e6d0069c (live-crawl-actor) Clear index journal before starting live crawl
This is to prevent data corruption.   This shouldn't be necessary for the regular loader path, but the live crawler is a bit different and needs some paving of the road ahead of it.
2024-11-22 14:04:57 +01:00
Viktor Lofgren
b941604135 (live-crawler) Alter DbDomainIdRegistry to make inserts if an id is missing, as this is apparently a rare scenario we need to deal with. 2024-11-22 13:58:57 +01:00
Viktor Lofgren
52eb5bc84f (live-crawler) Keep track of bad URLs
To avoid hammering the same invalid URLs for up to two months, URLs that fail to fetch correctly are on a dice roll added to a bad URLs table, that prevents further attempts at fetching them.
2024-11-22 00:55:46 +01:00
Viktor Lofgren
4d23fe6261 (feeds) Simplify RSS User-Agent header
Removed the redundant "RSS Feed Fetcher" suffix from the User-Agent header in the FeedFetcherService.  This will help avoid making the feed fetcher trigger bot mitigation that accepts the regular UA-string.
2024-11-21 16:43:56 +01:00
Viktor Lofgren
14519294d2 Merge branch 'master' into live-search 2024-11-21 16:00:20 +01:00
Viktor Lofgren
51e46ad2b0 (refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents
Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx.

While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform.  It's still a bit heterogeneous between different processes, but a bit less so for now.
2024-11-21 16:00:09 +01:00
Viktor Lofgren
665c8831a3 (model) Fix resource leak in partially read crawl data streams.
Ensuring proper resource management by closing the underlying stream in the `close` method to prevent potential resource leaks.
2024-11-20 19:29:13 +01:00
Viktor Lofgren
47dfbacb00 (conf) Introduce a new concept of node profiles
Node profiles decide which actors are started, and which views are available in the control GUI.  This helps keep the system organized, and hides real-time clutter from the batch-oriented nodes.
2024-11-20 18:15:22 +01:00
Viktor Lofgren
f94911541a (live-crawl) Reduce the risk of id collisions with the main indexes
This is done by applying a large constant offset to the ordinals for the live crawled documents.  The chosen value still permits upto 100k documents to be fetched for a single domain with the live crawler, which is ridiculously large.
2024-11-20 16:01:10 +01:00
Viktor Lofgren
89d8af640d (live-crawl) Rename the live crawler code module to be more consistent with the other processes 2024-11-20 15:55:15 +01:00
Viktor Lofgren
6e4252cf4c (live-crawl) Make the actor poll for feeds changes instead of being a one-shot thing.
Also changes the live crawl process to store the live crawl data in a fixed directory in the storage base rather than versioned directories.
2024-11-20 15:36:25 +01:00
Viktor Lofgren
79ce4de2ab (model) Remove deprecated fields from CrawledDocument and CrawledDomain 2024-11-20 15:27:05 +01:00
Viktor Lofgren
d6575dfee4 (live-crawler) Crude first-try process for live crawling #WIP
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
2024-11-19 21:00:18 +01:00
Viktor Lofgren
a91ab4c203 (live-crawler) Crude first-try process for live crawling #WIP
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
2024-11-19 19:35:01 +01:00
Viktor Lofgren
6a3079a167 (search) Fix missing getter for proto 2024-11-18 21:05:22 +01:00
Viktor Lofgren
c728a1e2f2 (rss) Add endpoint for extracting URLs changed withing a timespan. 2024-11-18 14:59:32 +01:00
Viktor Lofgren
d874d76a09 (rss) Add an endpoint that can be used for identifying when RSS data has changed 2024-11-18 14:22:17 +01:00
Viktor Lofgren
70bc8831f5 (test) Fix excludeTags 2024-11-17 20:07:49 +01:00
Viktor Lofgren
41c11be075 (status) Clean up the status page a bit 2024-11-17 20:00:44 +01:00
Viktor Lofgren
163ce19846 (test) Tag status service endpoint tests as flaky
These tests have outside dependencies that inherently makes them unreliable and unsuitable for CI.
2024-11-17 19:48:01 +01:00
Viktor Lofgren
9eb16cb667 (test) Remove tests from fast suite
Adding a new @Tag("flaky") for tests that do not reliably return successes.  These may still be valuable during development, but should not run in CI.

Also tagging a few of the slower tests with the old @Tag("slow"), to speed up the run-time.
2024-11-17 19:45:59 +01:00
Viktor Lofgren
af40fa327b (status-service) Correct measurement pruning to use correct sqlite datetimes, as to not delete the database 2024-11-17 18:35:34 +01:00
Viktor Lofgren
cf6d28e71e (status-service) Enable auto-commit 2024-11-17 18:25:15 +01:00
Viktor Lofgren
3791ea1e18 (service) Add a new application service for external liveness monitoring
The new service 'status-service' will poll public endpoints periodically, and publish a basic read-only UI with the results, as well as publish the results to prometheus.
2024-11-17 18:01:08 +01:00
Viktor
34258b92d1 Merge pull request #124 from MarginaliaSearch/jdk-23+delombok
Friendship with lombok over, now JDK 23 is my best friend
2024-11-16 14:00:49 +00:00
Viktor Lofgren
e5db3f11e1 (chore) Clean up some of the uglier delomboking artifacts 2024-11-15 13:57:20 +01:00
Viktor Lofgren
9f47ce8d15 (chore) Remove lombok
There are likely some instances of delombok gore with this commit.
2024-11-11 21:14:38 +01:00
Viktor Lofgren
a5b4951f23 (chore) Remove use of deprecated STR.-style string templates 2024-11-11 18:02:28 +01:00
Viktor Lofgren
8b8bf0748f (feature-extraction) Add new DocumentHeaders class encapsulating Html headers.
Also adds a few new html features for CDNs and  S3 hosting for use in ranking and query refinement.
2024-11-11 13:26:15 +01:00
Viktor
5cc71ae586 Merge pull request #123 from MarginaliaSearch/vlofgren-patch-1
Update ROADMAP.md
2024-11-10 18:57:49 +01:00
Viktor
33fcfe4b63 Update ROADMAP.md 2024-11-10 18:57:15 +01:00
Viktor
a31a3b53c4 Merge pull request #122 from MarginaliaSearch/fetch-rss-feeds
Automatic RSS feed polling
2024-11-10 18:35:28 +01:00
Viktor Lofgren
a456ec9599 (feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished 2024-11-10 18:30:28 +01:00
Viktor Lofgren
a2bc9a98c0 (feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished 2024-11-10 17:45:20 +01:00
Viktor Lofgren
e24a98390c (feed) Update API to allow specifying clean vs refresh update
Move the logic deciding which operation to perform into the actor, updating its state graph to incorporate a counter that runs a clean update once in a blue moon.
2024-11-09 18:43:47 +01:00
Viktor Lofgren
6f858cd627 (feed) Decrease update interval to 24 hours 2024-11-09 18:17:51 +01:00
Viktor Lofgren
a293266ccd (feed) Wipe the feeds db and start over from system URLs periodically. 2024-11-09 18:17:16 +01:00
Viktor Lofgren
b8e0dc93d7 (search) Correctly show the feeds view when items are present
... otherwise show samples.   This commit also removes the (Experimental) bit, as this is getting fairly mature.
2024-11-09 17:56:43 +01:00
Viktor Lofgren
d774c39031 (feeds) Reduce log spam 2024-11-09 17:56:43 +01:00
Viktor Lofgren
ab17af99da (feeds) Refresh the feed db using the previous db, when it is available. 2024-11-09 17:56:43 +01:00
Viktor Lofgren
b0ac3c586f (feeds) Correct parallelism using SimpleBlockingThreadPool 2024-11-09 17:56:43 +01:00
Viktor Lofgren
139fa85b18 (feeds) Add working heartbeat tracking progress 2024-11-09 17:56:43 +01:00
Viktor Lofgren
bfeb9a4538 (feeds) Retire feedlot the feed bot, move RSS capture into the live-capture service 2024-11-09 17:56:43 +01:00
Viktor
3d6c79ae5f Merge pull request #121 from MarginaliaSearch/headless-setup
Headless deterministic setup
2024-11-08 13:50:54 +01:00
Viktor Lofgren
c9e9f73ea9 (setup) Break out installation action into non-interactive script 2024-11-08 13:38:40 +01:00
Viktor Lofgren
80e482b155 (setup) Add progress bar to downloads for better feedback 2024-11-08 13:38:40 +01:00
Viktor Lofgren
9351593495 (setup) Use huggingface for versioned hosting of language models 2024-11-08 13:38:40 +01:00
Viktor Lofgren
d74436f546 (setup) Use checksums for rdrpostagger and opennlp files
Also use versioned URLs for rdrpostagger
2024-11-08 13:38:40 +01:00
Viktor Lofgren
76e9053dd0 (setup) Move some file-downloads from setup script to the first boot of the control node of the system
We can only do this for files that are not required for unit tests.

As it is illegal to run more than one instance of the control service, this should be fine with regard to race conditions.  The boot orchestration will also ensure that no other services will boot up before the downloading is complete.
2024-11-06 15:28:20 +01:00
Viktor Lofgren
dbb8bcdd8e (crawler) Use a better hashInt implementation in CrawlDataReference
Guava's hash functions are slow as hell.
2024-10-15 18:25:55 +02:00
Viktor Lofgren
7305afa0f8 (crawler) Clean up the crawler code a bit, removing vestigial abstractions and historical debris 2024-10-15 17:27:59 +02:00
Viktor Lofgren
481f999b70 (crawler) Make DomainCrawlFrontier a bit less aggressive with throwing away excess links when it's approaching full.
Also be a bit smarter about pre-allocating queues and sets based on depth rather than the number of provided URLs, which was always zero outside of tests.
2024-10-15 14:22:40 +02:00
Viktor Lofgren
4b16022556 (crawler) Correct Spec Provider so that it uses VISITED_URLS rather than KNOWN_URLS when growing domains 2024-10-15 14:21:59 +02:00
Viktor Lofgren
89dd201a7b (link-parser) Make mailing list blocking optional 2024-10-15 13:48:32 +02:00
Viktor Lofgren
ab486323f2 (converter) Increase the number of links the converter will pick up per document 2024-10-15 13:46:19 +02:00
Viktor Lofgren
6460c11107 (index) Short-circuit rankResults when there are no results 2024-10-14 13:47:35 +02:00
Viktor Lofgren
89f7f3c17c (query-parser) Fix regression where advice terms weren't parsed properly 2024-10-14 13:46:37 +02:00
Viktor Lofgren
fe800b3af7 (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 19:04:49 +02:00
Viktor Lofgren
2a1077ff43 (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 17:57:27 +02:00
Viktor Lofgren
01a16ff388 (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 17:55:59 +02:00
Viktor Lofgren
eb60ddb729 (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 17:49:39 +02:00
Viktor Lofgren
db5faeceee (download-sample) Break apart actor for better error recovery
Change also adds logged events to give more feedback that something is happening.
2024-10-04 13:39:43 +02:00
Viktor Lofgren
45d3e6aa71 (download-sample) Break apart actor for better error recovery
Change also adds logged events to give more feedback that something is happening.
2024-10-04 13:19:09 +02:00
Viktor Lofgren
d84a2c183f (*) Remove the crawl spec abstraction
The crawl spec abstraction was used to upload lists of domains into the system for future crawling.  This was fairly clunky, and it was difficult to understand what was going to be crawled.

Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table.  This is much preferred and means the operator can directly manage domains without specs.

This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.
2024-10-03 13:41:17 +02:00
Viktor Lofgren
ecb5eedeae (crawler, EXPERIMENT) Disable content type probing and use Accept header instead
There's reason to think this may speed up crawling quite significantly, and the benefits of the probing aren't quite there.
2024-09-30 14:53:01 +02:00
Viktor Lofgren
90a2d4ae38 (index) Fix partial buffer writing in PrioDocIdsTransformer
Ensure all data is written to writeChannel by looping until the buffer is fully drained. This prevents potential data loss during the close operation and maintains data integrity.
2024-09-29 17:53:40 +02:00
Viktor Lofgren
2b8ab97ec1 (bit-writer) Do not clear buffer when creating a bit writer 2024-09-29 17:52:43 +02:00
Viktor Lofgren
43ca9c8a12 (sequence) Return Integer.MAX_VALUE for empty position lists.
Updated the method to return Integer.MAX_VALUE if any of the position lists are empty, instead of returning 0. This ensures that empty lists are handled consistently and address edge cases where an empty list is encountered.
2024-09-29 17:21:17 +02:00
Viktor Lofgren
69d99c91dd (index) Optimize buffer handling in PrioDocIdsTransformer 2024-09-29 17:20:49 +02:00
Viktor Lofgren
a8cc98a0f6 (index) Fix write offset calculation in PrioDocIdsTransformer
Adjust the write offset calculation by adding the position of the write buffer. Updated the test to validate the transformation process and ensure correctness of output file positions.
2024-09-29 17:20:29 +02:00
Viktor Lofgren
2ee58f4bc9 (index) Adjust ranking parameters to dial down the importance of tcfProximity and firstPosition 2024-09-29 15:33:12 +02:00
Viktor Lofgren
938431e514 (scrape-feeds-actor) Add deduplication of insertion data
To avoid unnecessary db churn, the domains to be added are put in a set instead of a list, ensuring that they are unique.
2024-09-28 14:41:14 +02:00
Viktor Lofgren
b2de3c70fa (scrape-feeds-actor) Add explicit commit in case it's disabled 2024-09-28 14:36:57 +02:00
Viktor Lofgren
542690d9f6 (search-service) Hide pagination when there is only 1 page of results 2024-09-28 13:48:09 +02:00
Viktor Lofgren
596a7fb4ea (actor) Disable the feed scraper on all nodes but the first 2024-09-28 12:36:16 +02:00
Viktor Lofgren
c3f726a01f (actor) Add a feed scraping actor
Add a new actor that polls an URL every 6 hours and amends the domain database with any unseen domains, flagging them to be crawled by the next crawl job.

The URLs are specified in data/scrape-urls.txt.  If this file is absent, the actor shuts down.
2024-09-28 12:33:29 +02:00
Viktor Lofgren
4538ade156 (live-capture) Add readme to live-capture function 2024-09-28 11:35:46 +02:00
Viktor Lofgren
f4709d8f32 (live-capture) Handle case when screenshot bytes are empty.
Add logic to flag the domain as fetched when the pngBytes array is empty. This ensures we won't try to re-fetch this domain again for a while.
2024-09-27 15:53:17 +02:00
Viktor Lofgren
3dda8c228c (live-capture) Handle failed screenshot fetch in BrowserlessClient
Return an empty byte array when screenshot fetch fails, ensuring downstream processes are not impacted by null responses. Additionally, only attempt to upload the screenshot if the byte array is non-empty, preventing invalid data from being stored.
2024-09-27 14:52:05 +02:00
Viktor Lofgren
ccf6b7caf3 (assistant) Refactor scheduling of tasks within SimilarDomainsService
Changed the scheduling function to use a single schedule call instead of a fixed delay for the init task. The updateScreenshotInfo method was also moved and slightly refactored for clearer readability and consistency.
2024-09-27 14:43:19 +02:00
Viktor Lofgren
fed33ed64a (search-service) Update screenshot request handling
Always request the main site screenshot to ensure staleness checks and necessary updates. Limit additional screenshot requests for similar and linking domains to avoid overloading with a maximum of 5 requests per view.
2024-09-27 14:27:25 +02:00
Viktor Lofgren
ca27d95ce1 (assistant) Add bounds checks for domain idx 2024-09-27 14:24:04 +02:00
Viktor Lofgren
3566fe296a (assistant) Add scheduled update job for screenshot information 2024-09-27 14:16:28 +02:00
Viktor Lofgren
c91435e314 (assistant) Don't attempt to respond to similarity and linkedness queries before the data is ready
This will reduce the number of exceptions in the assistant logs quite significantly.
2024-09-27 14:08:08 +02:00
Viktor Lofgren
31f30069a4 (live-capture) Dial down logging a bit 2024-09-27 14:00:55 +02:00
Viktor
e5726a75d2 Merge pull request #120 from MarginaliaSearch/live-capture-function
Add a new function 'Live Capture' for on-demand screenshot capture
2024-09-27 13:48:53 +02:00
Viktor Lofgren
c757d116bf (misc) Fix Broken Tests 2024-09-27 13:46:34 +02:00
Viktor Lofgren
23cce0c78a Add a new function 'Live Capture' for on-demand screenshot capture
The screenshots are requested by the site-service, and triggered via the site-info view.
2024-09-27 13:46:34 +02:00
Viktor Lofgren
1bd29a586c (service-discovery) Add common base interface to all Grpc services
To be able to tell service discovery whether to enable a service on a particular runtime, a common base interface DiscoverableService extends BindableService was added.
2024-09-27 13:46:34 +02:00
Viktor Lofgren
4565bfe359 (crawler) Make the crawler report crawling progress correctly when stopped and resumed. 2024-09-26 18:30:29 +02:00
Viktor Lofgren
336d6fdd14 (index-client) Fix error when zero results are found 2024-09-25 20:23:13 +02:00
Viktor Lofgren
95cde242ca (assistant) Fix NPE when IP information is absent 2024-09-25 20:19:17 +02:00
Viktor
9224176202 Merge pull request #119 from MarginaliaSearch/result-pagination
Add pagination support for the search results
2024-09-25 14:29:24 +02:00
Viktor Lofgren
0d2390fd13 (search-service) Only autofocus on the query when the query is empty 2024-09-25 14:27:03 +02:00
Viktor Lofgren
4a0356e26f (search-service) Add pagination support to the search GUI 2024-09-25 14:26:49 +02:00
Viktor Lofgren
73f973cc06 (search-query) Add pagination to search query API and the direct query-service interface 2024-09-25 14:20:59 +02:00
Viktor Lofgren
e9e8580913 (converter) Fix NPE bugs in converter due to the reintroduction of CrawledDocument.headers 2024-09-25 12:18:56 +02:00
Viktor Lofgren
8b85a58fea (search UX) Autofocus on the search form 2024-09-24 15:56:03 +02:00
Viktor Lofgren
40512511af (crawler) Refactor boundary between CrawlerRetreiver and HttpFetcherImpl
This code is still a bit too complex, but it's slowly getting better.
2024-09-24 15:08:22 +02:00
Viktor
10d8fc4fe7 Update ROADMAP.md 2024-09-24 14:57:30 +02:00
Viktor
9899d45ea8 Merge pull request #118 from MarginaliaSearch/vlofgren-patch-1
Update ROADMAP.md
2024-09-24 14:13:47 +02:00
Viktor
3eea471ca6 Update ROADMAP.md 2024-09-24 14:13:32 +02:00
Viktor Lofgren
3dec4b6b34 (index) Fix bug where tcfFirstPosition lit up because one term was in the title and the other was missing from the document
This was because firstPosition calculation was not invalidated when positions were missing.
2024-09-24 13:33:37 +02:00
Viktor Lofgren
162fc25ebc (minor) Fix accidental commit errors 2024-09-23 18:03:09 +02:00
Viktor Lofgren
e9854f194c (crawler) Refactor
* Restructure the code to make a bit more sense
* Store full headers in crawl data
* Fix bug in retry-after header that assumed the timeout was in milliseconds, and then clamped it to a lower bound of 500ms, meaning this was almost always handled wrong
2024-09-23 17:51:07 +02:00
Viktor Lofgren
9c292a4f62 (doc) Fix outdated links in documentation 2024-09-22 13:56:17 +02:00
Viktor Lofgren
edb42836da (vcs) Fix shared state issues with VarintCodedSequence's iterators.
Also cleans up the code a bit.
2024-09-21 16:09:15 +02:00
Viktor Lofgren
1ff88ff0bc (vcs) Stopgap fix for quoted queries with the same term appearinc multiple times
There are reentrance issues with VarintCodedSequence, this hides the symptom but these need to be corrected properly.
2024-09-21 14:07:59 +02:00
Viktor Lofgren
28e7c8e5e0 Increase temporal bias weight to give the recent results filter a bit more recency 2024-09-17 18:11:40 +02:00
Viktor
463b3ed0ce Merge pull request #99 from MarginaliaSearch/term-positions
Improve term positions accuracy
2024-09-17 15:30:04 +02:00
Viktor Lofgren
8e78286068 Merge branch 'master' into term-positions 2024-09-17 15:20:46 +02:00
Viktor Lofgren
f4eeef145e (index) Reduce fetch size to improve timeout characteristics 2024-09-17 15:20:41 +02:00
Viktor Lofgren
87aa869338 (index) Correct positions mask to take into account offsets when overlapping 2024-09-17 14:40:37 +02:00
Viktor Lofgren
60ad4786bc (index) Use MemorySegment.copy for LongArray->LongArray transfers 2024-09-17 13:56:31 +02:00
Viktor Lofgren
a74df7f905 (index) Increase buffer size for PrioDocIdsTransformer 2024-09-17 13:52:52 +02:00
Viktor Lofgren
9f9c6736ab (index) Use MemorySegment.copy for LongArray->LongArray transfers 2024-09-17 13:49:02 +02:00
Viktor Lofgren
b95646625f (index) Correct prio index construction with mmap
Accidentally snuck in behavior from full index
2024-09-17 13:39:08 +02:00
Viktor Lofgren
6e47eae903 (index) Correct strange close handling of PositionsFileConstructor 2024-09-13 16:34:14 +02:00
Viktor Lofgren
934af0dd4b (index) Correct units in log message when shrinking the documents file 2024-09-13 16:33:19 +02:00
Viktor Lofgren
a8bec13ed9 (index) Evaluate using mmap reads during index construction in favor of filechannel reads
It's likely that this will be faster, as the reads are on average small and sequential, and can't be buffered easily.
2024-09-13 16:14:56 +02:00
Viktor Lofgren
1cf62f5850 (doc) Correct dead links and stale information in the docs 2024-09-13 11:02:13 +02:00
Viktor Lofgren
8047e77757 (doc) Correct dead links and stale information in the docs 2024-09-13 11:01:05 +02:00
Viktor Lofgren
2a92de29ce (loader) Fix it so that the loader doesn't explode if it sees an invalid URL 2024-09-12 11:36:00 +02:00
Viktor Lofgren
99523ca079 (query-parser) Remove test that is no longer relevant 2024-09-10 10:35:56 +02:00
Viktor Lofgren
35f49bbb60 (coded-sequence) Add equals and hashCode to VCS 2024-09-10 10:33:56 +02:00
Viktor Lofgren
50ec922c2b (index) Fix broken index tests
Also cleaned up the tests to be less fragile to ranking algorithm changes.
2024-09-10 10:23:46 +02:00
Viktor Lofgren
cfbbeaa26e (ranking) Clean up ranking test code 2024-09-08 15:46:51 +02:00
Viktor Lofgren
a3b0189934 Fix build errors after merge 2024-09-08 10:22:32 +02:00
Viktor Lofgren
8f367d96f8 Merge branch 'master' into term-positions
# Conflicts:
#	code/index/java/nu/marginalia/index/results/model/ids/TermIdList.java
#	code/processes/converting-process/java/nu/marginalia/converting/ConverterMain.java
#	code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
#	code/processes/crawling-process/java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java
#	code/processes/crawling-process/model/java/nu/marginalia/io/crawldata/CrawledDomainReader.java
#	code/processes/crawling-process/test/nu/marginalia/crawling/HttpFetcherTest.java
#	code/processes/crawling-process/test/nu/marginalia/crawling/retreival/CrawlerMockFetcherTest.java
#	code/services-application/search-service/java/nu/marginalia/search/svc/SearchQueryIndexService.java
2024-09-08 10:14:43 +02:00
Viktor Lofgren
f78ef36cd4 (slop) Upgrade to 0.0.8, add encodings to string columns. 2024-09-04 15:19:00 +02:00
Viktor Lofgren
dc67c81f99 (summary) Fix a few cases where noscript tags would sometimes be used for document summary 2024-09-04 15:00:40 +02:00
Viktor Lofgren
50ba8fd099 (query-parsing) Correct handling of trailing parentheses 2024-09-03 11:45:14 +02:00
Viktor Lofgren
99b3b00b68 (query-parsing) Merge QueryTokenizer into QueryParser and add escaping of query grammar 2024-09-03 11:35:32 +02:00
Viktor Lofgren
f6d981761d (query-parsing) Drop search term elements that aren't indexed by the search engine 2024-09-03 11:24:05 +02:00
Viktor Lofgren
8290c19e24 (query-parsing) Drop search term elements that aren't indexed by the search engine 2024-09-03 11:21:01 +02:00
Viktor Lofgren
7a69dff6cf (search) Correct handling of languages on fandom 2024-09-01 13:46:01 +02:00
Viktor Lofgren
bfb7ed2c99 (search) Translate cursed medium URLs to scribe.rip links via the search application 2024-09-01 13:32:14 +02:00
Viktor Lofgren
e19dc9b13e (search) Translate cursed fandom URLs to breezewiki links via the search application 2024-09-01 13:23:35 +02:00
Viktor Lofgren
74148c790e (crawler) Pull additional new domains from node-affinity 0
Previously a bit ambiguously defined, node affinity 0 is now indicative that a domain is up for grabs for the next crawler
2024-09-01 13:00:36 +02:00
Viktor Lofgren
3d77456110 (*) Add domain parking service to ip blocklist 2024-09-01 12:53:22 +02:00
Viktor Lofgren
ab6a4b1749 (control) Correct id value for domain addition tool 2024-09-01 12:25:15 +02:00
Viktor Lofgren
aeeb1d0cb7 (control) Add utility for adding domains from an external URL 2024-09-01 12:14:21 +02:00
Viktor Lofgren
185b79f2a5 (converter) Fix bug where sideloaded reddit content was errouneously categoriszed as wiki-generated. 2024-09-01 11:30:25 +02:00
Viktor Lofgren
8d0f9652c7 (crawler) Correct RSS-sitemap behavior 2024-08-31 11:38:34 +02:00
Viktor Lofgren
5353805cc6 (crawler) Correct RSS-sitemap behavior 2024-08-31 11:37:09 +02:00
Viktor Lofgren
5407da5650 (crawler) Grab favicons as part of root sniff 2024-08-31 11:32:56 +02:00
Viktor Lofgren
b1bfe6f76e (control) New view for domains
Add capability to assign domains, and bulk-add new domains.
2024-08-30 17:06:48 +02:00
Viktor Lofgren
74e25370ca (control) New view for domains
Still a work in progress, but at this point it's possible to use for viewing domains
2024-08-29 15:40:40 +02:00
Viktor Lofgren
bb5d946c26 (index, EXPERIMENTAL) Clean up ranking code 2024-08-29 11:34:23 +02:00
Viktor Lofgren
abab5bdc8a (index, EXPERIMENTAL) Evaluate using Varint instead of GCS for position data 2024-08-26 14:20:39 +02:00
Viktor Lofgren
30bf845c81 (index) Speed up minDist calculations by excluding large lists 2024-08-26 13:04:15 +02:00
Viktor Lofgren
77efce0673 (paper-doll) Fix compilation 2024-08-26 12:51:29 +02:00
Viktor Lofgren
67a98fb0b0 (coded-sequence) Handle weird legacy HTML that puts everything in a heading 2024-08-26 12:49:15 +02:00
Viktor Lofgren
7d471ec30d (coded-sequence) Evaluate new minDist implementation 2024-08-26 12:45:11 +02:00
Viktor Lofgren
f3182a9264 (coded-sequence) Evaluate new minDist implementation 2024-08-26 12:02:37 +02:00
Viktor Lofgren
805cb5ad58 (coded-sequence) Correct behavior of findIntersections 2024-08-25 14:54:17 +02:00
Viktor Lofgren
fdf05cedae (index) Optimize DocumentSpan.countIntersections 2024-08-25 14:12:30 +02:00
Viktor Lofgren
9c5f463775 (index) Optimize DocumentSpan.countIntersections 2024-08-25 13:59:11 +02:00
Viktor Lofgren
893fae6d59 (index) Optimize DocumentSpan.countIntersections 2024-08-25 13:51:43 +02:00
Viktor Lofgren
5660f291af (index) Optimize DocumentSpan.countIntersections 2024-08-25 13:43:29 +02:00
Viktor Lofgren
efd56efc63 (index) Optimize SequenceOperations.minDistance 2024-08-25 13:28:06 +02:00
Viktor Lofgren
d94373f4b1 (index) Optimize calculatePositionsMask 2024-08-25 13:24:37 +02:00
Viktor Lofgren
0d01a48260 (index) Optimize SequenceOperations 2024-08-25 13:19:37 +02:00
Viktor Lofgren
00ab2684fa (index) Optimize SequenceOperations 2024-08-25 13:17:38 +02:00
Viktor Lofgren
a5585110a6 (index) Optimize SequenceOperations 2024-08-25 13:16:31 +02:00
Viktor Lofgren
965c89798e (index) Optimize DocumentSpan 2024-08-25 12:44:33 +02:00
Viktor Lofgren
982b03382b (index) Optimize DocumentSpan 2024-08-25 12:31:15 +02:00
Viktor Lofgren
24b805472a (index) Evaluate performance implication of decoding gcs early 2024-08-25 12:23:09 +02:00
Viktor Lofgren
6ce029b317 (index) Remove vestigial parameter 2024-08-25 12:14:12 +02:00
Viktor Lofgren
63e5b0ab18 (index) Correct weightedCounts calculations 2024-08-25 12:06:56 +02:00
Viktor Lofgren
6dda2c2d83 (coded-sequence) Reduce allocations in GCS.values() 2024-08-25 12:06:31 +02:00
Viktor Lofgren
3fb3c0b92e (index) Optimize ranking calculations 2024-08-25 11:56:11 +02:00
Viktor Lofgren
aa2c960b74 (index) Optimize ranking calculations 2024-08-25 11:53:44 +02:00
Viktor Lofgren
4fbcc02f96 (index) Adjust sensible defaults for ranking parameters 2024-08-25 11:24:16 +02:00
Viktor Lofgren
9aa8f13731 (index) Remove tcfAvgDist ranking parameter
This is captured by tcfProximity already
2024-08-25 11:20:19 +02:00
Viktor Lofgren
65bee366dc (index) Try harmonic mean for avgMinDist 2024-08-25 11:11:52 +02:00
Viktor Lofgren
53700e6667 (index) Try harmonic mean for avgMinDist 2024-08-25 11:08:41 +02:00
Viktor Lofgren
7f498e10b7 (index) Adjust proximity score 2024-08-25 11:01:35 +02:00
Viktor Lofgren
6eb0f13411 (index) Adjust handling of full phrase matches to prioritize full query matches over large partial matches 2024-08-25 10:54:04 +02:00
Viktor Lofgren
773377fe84 (index) Correct handling of full phrase match group 2024-08-25 10:48:34 +02:00
Viktor Lofgren
4372c8c835 (index) Give ranking components more consistent names 2024-08-25 10:44:27 +02:00
Viktor Lofgren
099133bdbc (index) Fix verbatim match score after moving full phrase group to a separate entity 2024-08-25 10:43:35 +02:00
Viktor Lofgren
b09e2dbeb7 (build) Fix dependency churn from testcontainers
Apparently you need to pull in commons-codec now in order to run testcontainers, through spooky action at a distance.
2024-08-25 10:35:48 +02:00
Viktor Lofgren
96bcf03ad5 (index) Address broken tests
They are still broken, but less so.
2024-08-25 10:34:36 +02:00
Viktor Lofgren
0999f07320 (search-query) Add new ranking parameters for proximity and verbatim matches 2024-08-25 10:34:12 +02:00
Viktor Lofgren
5d2b455572 (search) Clean up inconsistent usage of MathClient in SearchOperator
Also clean up SearchOperator and adjacent code
2024-08-24 10:39:31 +02:00
Viktor Lofgren
ea75ddc0e0 (search) Absorb SearchQueryIndexService into SearchOperator, and clean up SearchOperator 2024-08-22 11:50:52 +02:00
Viktor Lofgren
2db0e446cb (search) Absorb SearchQueryIndexService into SearchOperator, and clean up SearchOperator 2024-08-22 11:49:29 +02:00
Viktor Lofgren
557bdaa694 (search) Clean up SearchQueryIndexService and surrounding code 2024-08-22 11:45:28 +02:00
Viktor Lofgren
9eb1f120fc (index) Repair positions bitmask for search result presentation 2024-08-22 11:28:23 +02:00
Viktor Lofgren
266d6e4bea (slop) Replace SlopPageRef<T> with SlopTable.Ref<T> 2024-08-21 10:13:49 +02:00
Viktor Lofgren
e4c97a91d8 (*) Comment clarity 2024-08-21 10:12:00 +02:00
Viktor Lofgren
b0a874a842 (*) Upgrade slop library -> 0.0.5 2024-08-18 11:05:27 +02:00
Viktor Lofgren
bca40de107 (*) Upgrade slop library 2024-08-18 10:43:41 +02:00
Viktor Lofgren
93652e0937 (qdebug) Accurately display positions when intersecting with spans 2024-08-15 11:55:48 +02:00
Viktor Lofgren
0a383a712d (qdebug) Accurately display positions when intersecting with spans 2024-08-15 11:44:17 +02:00
Viktor Lofgren
03d5dec24c (*) Refactor termCoherences and rename them to phrase constraints. 2024-08-15 11:02:19 +02:00
Viktor Lofgren
b2a3cac351 (*) Remove broken imports 2024-08-15 11:01:34 +02:00
Viktor Lofgren
a18edad04c (index) Remove stopword list from converter
We want to index all words in the document, stopword handling is moved to the index where we change the semantics to elide inclusion checks in query construction for a very short list of words tentatively hard-coded in SearchTerms.
2024-08-15 09:36:50 +02:00
Viktor Lofgren
92522e8d97 (index) Attenuate bm25 score based on query length 2024-08-15 08:41:38 +02:00
Viktor Lofgren
049d94ce31 (index) Add body position match to qdebug fields 2024-08-15 08:39:37 +02:00
Viktor Lofgren
dbc6a95276 (index) Consume the new 'body' span in index to make it used in ranking 2024-08-15 08:33:43 +02:00
Viktor Lofgren
75b0888032 (slop) Migrate to latest Slop version 2024-08-14 11:44:35 +02:00
Viktor Lofgren
2ad93ad41a (*) Clean up 2024-08-14 11:43:45 +02:00
Viktor Lofgren
623ee5570f (slop) Break slop out into its own repository 2024-08-13 09:50:05 +02:00
Viktor Lofgren
fd2bad39f3 (keyword-extraction) Add body field for terms that are not otherwise part of a field 2024-08-13 09:49:26 +02:00
Viktor Lofgren
e6c8a6febe (index) Add index-side deduplication in selectBestResults 2024-08-10 10:51:59 +02:00
Viktor Lofgren
4ece5f847b (index) Add more qdebug factors 2024-08-10 10:45:30 +02:00
Viktor Lofgren
e4f04af044 (index) Give BODY matches a verbatim match value 2024-08-10 10:22:19 +02:00
Viktor Lofgren
b730b17f52 (index) Correct handling of firstPosition to avoid d/z 2024-08-10 10:21:59 +02:00
Viktor Lofgren
98c40958ab (index) Simplify verbatim match calculation 2024-08-10 09:54:56 +02:00
Viktor Lofgren
41b52f5bcd (index) Simplify verbatim match calculation 2024-08-10 09:51:03 +02:00
Viktor Lofgren
4264fb9f49 (query-service) Clean up qdebug UI a bit 2024-08-10 09:51:03 +02:00
Viktor Lofgren
016a4c62e1 (index) Bugs and error fixes, chasing and fixing mystery results that did not contain all relevant keywords 2024-08-10 09:51:03 +02:00
Viktor Lofgren
2f38c95886 (index) Backport bugfix from term-positions branch
The ordering of TermIdsList is assumed to be unchanged by the surrounding code, but the constructor sorts the dang list to be able to do contains() by binary search.  This is no bueno.

This is gonna be a merge conflict in the future, but it's too big of a bug to leave for another month.
2024-08-09 21:17:02 +02:00
Viktor Lofgren
df89661ed2 (index) In SearchResultItem, populate combinedId with combinedId and not its ranking-removed documentId cousin 2024-08-09 16:32:32 +02:00
Viktor Lofgren
41da4f422d (search-query) Always generate the "all"-segmentation 2024-08-09 13:20:00 +02:00
Viktor Lofgren
2e89b55593 (wip) Repair qdebug utility and show new ranking details 2024-08-09 12:57:25 +02:00
Viktor Lofgren
7babdb87d5 (index) Remove intermediate models 2024-08-07 10:10:44 +02:00
Viktor Lofgren
680ad19c7d (keyword-extraction) Correct behavior when loading spans so that they are not double-loaded causing errors 2024-08-06 11:16:56 +02:00
Viktor Lofgren
f01267bc6b (index) Don't load fwd index offsets into a hash table at start.
This makes the service take forever to start up.  Memory map the data instead and binary search.  This is a bit slower, but not by much.
2024-08-06 11:16:28 +02:00
Viktor Lofgren
df6a05b9a7 (index) Avoid hypothetical divide-by-zero in tcfAvgDist 2024-08-06 10:55:57 +02:00
Viktor Lofgren
8569bb8e11 (index) Avoid divide-by-zero when minDist returns 0 2024-08-06 10:34:05 +02:00
Viktor Lofgren
ca6e2db2b9 (index) Include external link texts in verbatim score 2024-08-06 10:23:23 +02:00
Viktor Lofgren
2080e31616 (converter) Store link text positions
To help offer verbatim matches for external link texts, we assign these positions in the document a bit after the actual document ends.  Integrating this information with the ranking is not performed here.
2024-08-04 12:00:29 +02:00
Viktor Lofgren
c379be846c (slop) Update readme 2024-08-04 10:58:23 +02:00
Viktor Lofgren
9bc665628b (slop) VarintLE implementation, correct enum8 column 2024-08-04 10:57:52 +02:00
Viktor Lofgren
ee49c01d86 (index) Tune ranking for verbatim matches in the title, rewarding shorter titles 2024-08-03 14:47:23 +02:00
Viktor Lofgren
b21f8538a8 (index) Tune ranking for verbatim matches in the title, rewarding shorter titles 2024-08-03 14:41:38 +02:00
Viktor Lofgren
dd15676d33 (index) Tune ranking for verbatim matches in the title, rewarding shorter titles 2024-08-03 14:18:04 +02:00
Viktor Lofgren
ec5a17ad13 (index) Tune ranking for verbatim matches in the title, rewarding shorter titles 2024-08-03 14:07:02 +02:00
Viktor Lofgren
e48f52faba (experiment) Add add-hoc filter runner 2024-08-03 13:24:03 +02:00
Viktor Lofgren
8462e88b8f (index) Add min-dist factor and adjust rankings 2024-08-03 13:07:00 +02:00
Viktor Lofgren
bf26ead010 (index) Remove hasPrioTerm check as we should sort this out in ranking 2024-08-03 13:06:50 +02:00
Viktor Lofgren
c2cedfa83c (index) Experimental ranking signals 2024-08-03 10:33:41 +02:00
Viktor Lofgren
eba2844361 (index) Experimental ranking signals 2024-08-03 10:32:46 +02:00
Viktor Lofgren
c6c8b059bf (index) Return some variant of the previously removed 'Bm25PrioGraphVisitor' 2024-08-03 10:10:12 +02:00
Viktor Lofgren
d8a99784e5 (index) Adding a few experimental relevance signals 2024-08-02 20:26:07 +02:00
Viktor Lofgren
57929ff242 (coded-sequence) Varint sequence 2024-08-02 20:22:56 +02:00
Viktor Lofgren
4430a39120 (loader) Clean up 2024-08-02 12:32:47 +02:00
Viktor Lofgren
6228f46af1 (loader) Reduce log spam 2024-08-02 12:21:03 +02:00
Viktor Lofgren
ac67b6b5da (converter) Fix exception handling while reading crawl data 2024-08-02 10:39:49 +02:00
Viktor Lofgren
1a268c24c8 (perf) Reduce DomPruningFilter hash table recalculation 2024-08-01 12:04:55 +02:00
Viktor Lofgren
38e2089c3f (perf) Code was still spending a lot of time resolving charsets
... in the failure case which wasn't captured by memoization.
2024-08-01 11:58:59 +02:00
Viktor Lofgren
e2107901ec (index) Add span information for anchor tags, tweak ranking params 2024-08-01 11:46:30 +02:00
Viktor Lofgren
15745b692e (index) Coherences need to be able to deal with null values among positions 2024-07-31 22:00:14 +02:00
Viktor Lofgren
696fd8909d (screenshot-capture-tool) Make screenshot bot spend time refreshing old screenshots instead of always capturing new ones 2024-07-31 21:44:10 +02:00
Viktor Lofgren
02b1c4b172 (screenshot-capture-tool) Make screenshot bot spend time refreshing old screenshots instead of always capturing new ones 2024-07-31 20:21:23 +02:00
Viktor Lofgren
285e657f68 Merge branch 'master' into term-positions
# Conflicts:
#	code/processes/crawling-process/java/nu/marginalia/crawl/CrawlerMain.java
#	code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
2024-07-31 10:44:01 +02:00
Viktor Lofgren
046ffc7752 (build) Upgrade jib to 3.4.3 2024-07-31 10:39:50 +02:00
Viktor Lofgren
2ef66ce0ca (actor) Reset NEW flag earlier when auto-deletion is disabled
Don't wait until the loader step is finished to reset the NEW flag, as this leaves manually processed (but not yet loaded) crawl data stuck in "CREATING" in the GUI.
2024-07-31 10:31:03 +02:00
Viktor Lofgren
dc5c668940 (index) Re-enable parallelization of index construction, disable parallel sorting during construction
The first change, running index construction in parallel, was previously how it was done, but it was changed to run sequentially to see how it would affect performance.  It got worse, so the change is reverted.

Though it's been noted that sorting in parallel is likely not a good idea as it leads to a lot of I/O thrashing, so this is changed to be done sequentially.
2024-07-31 10:06:53 +02:00
Viktor Lofgren
f19148132a (search) Restrict site-search by passing domain id along with the site:-term
This will help these queries deal with domains that do not have a subdomain so that they do not drag up subdomains as well, as they are also given the special site:-keyword for their corresponding parent domain.
2024-07-30 21:41:07 +02:00
Viktor Lofgren
6d7b886aaa (converter) Correct sort order of files in control storage GUI
Previously it was sorted on a field that would switch to just showing the time whenever the date was the same as the day's date, leading to a bizarre sort order where files created today was typically shown first, followed by the rest of the files with the oldest date first.
2024-07-30 19:43:27 +02:00
Viktor Lofgren
b316b55be9 (index) Experimental initial integration of document spans into index 2024-07-30 12:01:53 +02:00
Viktor Lofgren
80900107f7 (restructure) Clean up repo by moving stray features into converter-process and crawler-process 2024-07-30 10:14:00 +02:00
Viktor Lofgren
7e4efa45b8 (converter/loader) Simplify document record writing to not require predicated reads 2024-07-29 14:21:21 +02:00
Viktor Lofgren
86ea28d6bc (converter/loader) Simplify document record writing to not require predicated reads 2024-07-29 14:18:52 +02:00
Viktor Lofgren
34703da144 (slop) Support for nested array types and array-of-object types
Also adding very basic support for filtered reads via SlopTable.  This is probably not a final design.
2024-07-29 14:00:43 +02:00
Viktor Lofgren
1282f78bc5 (slop-models) Fix incorrect column grouping leading to errors in converter 2024-07-29 11:01:18 +02:00
Viktor Lofgren
2d5d965f7f (slop-models) Fix incorrect column grouping leading to errors in converter 2024-07-29 10:34:33 +02:00
Viktor Lofgren
afe56c7cf1 (loader) Tidy up code 2024-07-28 21:36:42 +02:00
Viktor Lofgren
7d51cf882f (loader) Move rssFeeds to a different column group to avoid errors 2024-07-28 21:30:10 +02:00
Viktor Lofgren
499deac2ef (slop) Fix test that broke when we split get into int get() and long getLong() 2024-07-28 21:20:37 +02:00
Viktor Lofgren
9685993adb (loader) Add spans to a different column group from spanCodes, as they are not in sync 2024-07-28 21:20:09 +02:00
Viktor Lofgren
261dcdadc8 (loader) Additional tracking for the control GUI 2024-07-28 21:19:45 +02:00
Viktor Lofgren
314a901bf0 (slop) Clean up build.gradle from unnecessary copy-paste garbage 2024-07-28 13:22:20 +02:00
Viktor Lofgren
1caad7e19e (slop) Update existing code to use the altered Slop interfaces 2024-07-28 13:21:08 +02:00
Viktor Lofgren
e585116dab (slop) Add 32 bit read method for Varint along with the old 64 bit version 2024-07-28 13:20:18 +02:00
Viktor Lofgren
40f42bf654 (slop) Add signed 16 bit column type "short" 2024-07-28 13:19:44 +02:00
Viktor Lofgren
eaf7fbb9e9 (slop) Improve Conveniences for Enum
* New fixed width 8 bit version of Enum
* Access to the enum's dictionary, and a method for reading the ordinal directly to reduce GC churn
2024-07-28 13:19:15 +02:00
Viktor Lofgren
d05a2e57e9 (index-forward) Spans Writer should not be in the index page loop context 2024-07-27 15:17:04 +02:00
Viktor Lofgren
f8684118f3 (slop) Add columnDesc information to the column readers and writers, and correct a few broken position() implementations
Added a test that should find any additional broken implementations, as it's very important that this function is correct.
2024-07-27 14:35:30 +02:00
Viktor Lofgren
2e1f669aea (slop) Remove additional vestigial seek() implementations 2024-07-27 14:35:30 +02:00
Viktor Lofgren
6c3abff664 (slop) Move GCS Slop column to the coded-sequence package
This lets the slop library be stand-alone without dependence on coded-sequence.

The change also gets rid of the vestigial seek() method in ColumnReader.
2024-07-27 13:58:45 +02:00
Viktor Lofgren
dcb43a3308 (slop) Introduce table concept to keep track of positions and simplify closing
The most common error when dealing with Slop columns is that they can fall out of sync with each other if the programmer accidentally does a conditional read and forgets to skip.

The second most common error is forgetting to close one of the columns in a reader or writer.

To deal with both cases, a new class SlopTable is added that keeps track of the lifecycle of all slop columns and performs a check when closing them that they are in sync.
2024-07-27 13:47:47 +02:00
Viktor Lofgren
ec600b967d (crawler) Adjust domain locking
Turns out throttling to only 1 lock per domain means the crawler chokes hard on large hosting websites such as wordpress.  Giving these a slightly larger allowance.
2024-07-27 11:54:46 +02:00
Viktor Lofgren
aebb2652e8 (wip) Extract and encode spans data
Refactoring keyword extraction to extract spans information.

Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.

This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact.  Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00
Viktor Lofgren
52a9a0d410 (slop) Translate nulls to empty strings when passed to the StringColumnWriters. 2024-07-25 18:26:41 +02:00
Viktor Lofgren
4123e99469 (slop) Handle empty compressed files correctly
The CompressingStorageReader would incorrectly report having data when a file was empty.  Preemptively attempting to fill the backing buffer fixes the behavior.
2024-07-25 18:26:13 +02:00
Viktor Lofgren
51a8a242ac (slop) First commit of slop library
Slop is a low-abstraction data storage convention for column based storage of complex data.
2024-07-25 15:08:41 +02:00
Viktor Lofgren
60ef826e07 (loader) Add heartbeat to update domain-ids step 2024-07-25 15:08:41 +02:00
Viktor Lofgren
2ad564404e (loader) Add heartbeat to update domain-ids step 2024-07-23 15:28:52 +02:00
Viktor Lofgren
2bb9f18411 (dld) Refactor DocumentLanguageData
Reduce the usage of raw arrays
2024-07-19 12:24:55 +02:00
Viktor Lofgren
7a1edc0880 (term-freq) Reduce the number of low-relevance words in the dictionary
Using a statistical trick to reduce the number of low-frequency words in the dictionary, as they are numerous and not very informative.
2024-07-19 12:23:28 +02:00
Viktor Lofgren
b812e96c6d (language-processing) Select the appropriate language filter
The incorrect filter was selected based on the provided parameter, this has been corrected.
2024-07-19 12:22:32 +02:00
Viktor Lofgren
22b35d5d91 (sentence-extractor) Add tag information to document language data
Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object.  Separator information is encoded as a bit set instead of an array of integers.

The change also cleans up the SentenceExtractor class a fair bit.  It no longer extracts ngrams, and a significant amount of redundant operations were removed as well.  This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.
2024-07-18 15:57:48 +02:00
Viktor Lofgren
d36055a2d0 (keyword-extractor) Retire TfIdfHigh WordFlag
This will bring the word flags count down to 8, and let us pack every value in a byte.
2024-07-17 13:54:39 +02:00
Viktor Lofgren
0d227f3543 (cleanup) Remove next-prime library only used in tests 2024-07-17 13:48:03 +02:00
Viktor Lofgren
accc598967 (crawler) Add 1 second pause after probing domain to reduce request pressure 2024-07-16 16:55:07 +02:00
Viktor Lofgren
02c4a2d4ba (crawler) Add a per-domain mutex for crawling
To let up the pressure on domains with lot sof subdomains such as substack, medium, neocities, etc. a per-domain mutex is added that will limit crawling of these domains to one thread at a time.
2024-07-16 16:44:59 +02:00
Viktor Lofgren
6665e447aa (crawler) Add crawl delays around probe call and deal with 429:s properly during this phase 2024-07-16 15:33:24 +02:00
Viktor Lofgren
7eb955cc42 (setup) Change mirror for opennlp
Seems like the estointernet mirror no longer works.  Use apache.org instead.
2024-07-16 15:19:13 +02:00
Viktor Lofgren
f4d79c203d (crawler) Adjust revisit logic
The revisit logic wasn't sufficiently dampening the recrawl rate for websites that largely have not changed.

Modified it to be more reactive to the degree to which the content has changed, while applying upper and lower limits depending on the size of the crawl set.
2024-07-16 15:12:38 +02:00
Viktor Lofgren
4d29581ea4 (crawler) Introduce absolute upper limit to crawl depth growth 2024-07-16 14:40:45 +02:00
Viktor Lofgren
0b31c4cfbb (coded-sequence) Replace GCS usage with an interface 2024-07-16 14:37:50 +02:00
Viktor Lofgren
5c098005cc (index) Fix broken test
Expected behavior changed since the ranking algorithm now takes into account the number of positions of the keyword, and the test loader was previously modified to generate positions based on prime factors of the document id.
2024-07-16 12:37:59 +02:00
Viktor Lofgren
ae87e41cec (index) Fix rare BitReader.takeWhileZero bug
Fix rare bug where the takeWhileZero method would fail to repopulate the underlying buffer.  This caused intermittent de-compression errors if takeWhileZero happened at a 64 bit boundary while the underlying buffer was empty.

The change also alters how sequence-lengths are encoded, to more consistently use the getGamma method instead of adding special significance to a zero first byte.

Finally, assertions are added checking the invariants of the gamma and delta coding logic as well as UrlIdCodec to earlier detect issues.
2024-07-16 11:03:56 +02:00
Viktor Lofgren
dfd19b5eb9 (index) Reduce the number of abstractions around result ranking
The change also restructures the internal API a bit, moving resultsFromDomain from RpcRawResultItem into RpcDecoratedResultItem, as the previous order was driving complexity in the code that generates these objects, and the consumer side of things puts all this data in the same object regardless.
2024-07-16 08:18:54 +02:00
Viktor
8ed5b51a32 Merge branch 'master' into term-positions 2024-07-15 07:05:31 +02:00
Viktor Lofgren
9d0e5dee02 Fix gitignore issue .so files not to be ignored correctly. 2024-07-15 05:18:10 +02:00
Viktor Lofgren
ffd970036d (term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter
How'd This Ever Work? (tm)

TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.
2024-07-15 05:16:17 +02:00
Viktor Lofgren
fa162698c2 (term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter
How'd This Ever Work? (tm)

TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.
2024-07-15 05:15:30 +02:00
Viktor Lofgren
ad3857938d (search-api, ranking) Update with new ranking parameters
Adding new ranking parameters to the API and routing them through the system, in order to permit integration of the new position data with the ranking algorithm.

The change also cleans out several parameters that no longer filled any function.
2024-07-15 04:49:40 +02:00
Viktor Lofgren
179a6002c2 (coded-sequence) Add a callback for re-filling underlying buffer 2024-07-12 23:50:28 +02:00
Viktor Lofgren
d28fc86956 (index-prio) Add fuzz test for prio index 2024-07-11 19:22:36 +02:00
Viktor Lofgren
6303977e9c (index-prio) Fail louder when size is 0 in PrioDocIdsTransformer
We can't deal with this scenario and should complain very loudly
2024-07-11 19:22:05 +02:00
Viktor Lofgren
97695693f2 (index-prio) Don't increment readItems counter when the output buffer is full
This behavior was causing the reader to sometimes discard trailing entries in the list.
2024-07-11 19:21:36 +02:00
Viktor Lofgren
1ab875a75d (test) Correcting flaky tests
Also changing the inappropriate usage of ReverseIndexPrioFileNames for the full index in test code.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
31881874a9 (coded-sequence) Correct indicator of next-value
It was incorrectly assumed that a "next" value could not be zero or negative, as this is not representable via the Gamam code.  This is incorrect in this case, as we're able to provide a negative offset.  Changing to using Integer.MIN_VALUE as indicator that a value is absent instead, as this will never be used.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
f090f0101b (index-construction) Gather up preindex writes
Use fewer writes when finalizing the preindex documents.dat file, as this was getting too slow.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
9881cac2da (index-reader) Correctly handle negative offset values
When wordOffset(...) returns a negative value, it means the word isn't present in the index, and we should abort.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
12590d3449 (index-reverse) Added compression to priority index
The priority index documents file can be trivially compressed to a large degree.

Compression schema:
```
00b -> diff docord (E gamma)
01b -> diff domainid (E delta) + (1 + docord) (E delta)
10b -> rank (E gamma) + domainid,docord (raw)
11b -> 30 bit size header, followed by 1 raw doc id (61 bits)
```
2024-07-11 16:13:23 +02:00
Viktor Lofgren
abf7a8d78d (coded-sequence) Correct implementation of Elias gamma
Also clean up the code a bit as the EliasGammaCodec class was an iterator, and it was leaking abstraction details.
2024-07-10 14:28:28 +02:00
Viktor Lofgren
ecfe17521a (coded-sequence) Correct implementation of Elias gamma
The implementation was incorrectly using 1 bit more than it should.  The change also adds a put method for Elias delta; and cleans up the interface a bit.
2024-07-09 17:28:21 +02:00
Viktor Lofgren
0d29e2a39d (index-reverse) Entry Sources reset() their LongQueryBuffer
Previously this was the responsibility of the caller, which lead to the possibility of passing in improperly prepared buffers and receiving bad outcome
2024-07-09 01:39:40 +02:00
Viktor Lofgren
12a2ab93db (actor) Improve error messages for convert-and-load
Some copy-and-paste errors had snuck in and every index construction error was reported as "repartitioned failed"; updated with more useful messages.
2024-07-08 19:19:30 +02:00
Viktor Lofgren
d90bd340bb (index-reverse) Removing btree indexes from prio documents file
Btree index adds overhead and disk space and doesn't fill any function for the prio index.

* Update finalize logic with a new IO transformer that copies the data and prepends a size
* Update the reader to read the new format
* Added a test
2024-07-08 17:20:17 +02:00
Viktor Lofgren
21afe94096 (index-reverse) Don't use 128 bit merge function for prio index 2024-07-07 21:36:10 +02:00
Viktor Lofgren
fa36689597 (index-reverse) Simplify priority index
* Do not emit a documents file
* Do not interlace metadata or offsets with doc ids
2024-07-06 18:04:08 +02:00
Viktor Lofgren
85c99ae808 (index-reverse) Split index construction into separate packages for full and priority index 2024-07-06 15:44:47 +02:00
Viktor Lofgren
a4ecd5f4ce (minor) Fix non-compiling test due to previous refactor 2024-07-06 15:11:43 +02:00
Viktor Lofgren
6401a513d7 (crawl) Fix onsubmit confirm dialog for single-site recrawl 2024-07-05 17:21:03 +02:00
Viktor Lofgren
d86926be5f (crawl) Add new functionality for re-crawling a single domain 2024-07-05 15:31:55 +02:00
Viktor Lofgren
a6b03a66dc (crawl) Reduce Charset.forName() object churn
Cache the Charset object returned from Charset.forName() for future use, since we're likely to see the same charset again and Charset.forName(...) can be surprisingly expensive and its built-in caching strategy, which just caches the 2 last values seen doesn't cope well with how we're hitting it with a wide array of random charsets
2024-07-04 20:49:07 +02:00
Viktor Lofgren
d023e399d2 (index) Remove unnecessary allocations in journal reader
The term data iterator is quite hot and was performing buffer slice operations that were not necessary.

Replacing with a fixed pointer alias that can be repositioned to the relevant data.

The positions data was also being wrapped in a GammaCodedSequence only to be immediately un-wrapped.

Removed this unnecessary step and move to copying the buffer directly instead.
2024-07-04 15:38:22 +02:00
Viktor Lofgren
e8ab1e14e0 (keyword-extraction) Update upper limit to number of positions per word
After real-world testing, it was determined that 256 was still a bit too low, but 512 seems like it will only truncate outlier cases like assembly code and certain tabulations.
2024-07-02 20:52:32 +02:00
Viktor Lofgren
a6e15cb338 (keyword-extraction) Update upper limit to number of positions per word
100 was a bit too low, let's try 256.
2024-06-30 22:46:56 +02:00
Viktor Lofgren
4fbb863a10 (keyword-extraction) Add upper limit to number of positions per word
Also adding some logging for this event to get a feel for how big these lists get with realistic data.  To be cleaned up later.
2024-06-30 22:41:38 +02:00
Viktor Lofgren
6ee4d1eb90 (keyword) Increase the work area for position encoding
The change also moves the allocation outside of the build()-method to allow re-use of this rather large temporary buffer.
2024-06-28 16:42:39 +02:00
Viktor Lofgren
738e0e5fed (process) Add option for automatic profiling
The change adds a new system property 'system.profile' that makes ProcessService automatically trigger JFR profiling of the processes it spawns.  By default, these are put in the log directory.

The change also adds a JVM parameter that makes it shut up about native access.
2024-06-27 13:58:36 +02:00
Viktor Lofgren
0e4dd3d76d (minor) Remove accidentally committed debug printf 2024-06-27 13:40:53 +02:00
Viktor Lofgren
10fe5a78cb (log) Prevent tests from trying to log to file
They would never have succeeded, but it adds an annoying preamble of error spam in the console window.
2024-06-27 13:19:48 +02:00
Viktor Lofgren
975b8ae2e9 (minor) Tidy code 2024-06-27 13:15:31 +02:00
Viktor Lofgren
935234939c (test) Add query parsing to IntegrationTest 2024-06-27 13:15:20 +02:00
Viktor Lofgren
87e38e6181 (search-query) refac: Move query factory 2024-06-27 13:14:47 +02:00
Viktor Lofgren
f73fc8dd57 (search-query) Fix end-inclusion bug in QWordGraphIterator 2024-06-27 13:13:42 +02:00
Viktor Lofgren
3faa5bf521 (search-query) Tidy up QueryGRPCService and IndexClient 2024-06-26 14:03:30 +02:00
Viktor Lofgren
6973712480 (query) Tidy up code 2024-06-26 13:40:06 +02:00
Viktor Lofgren
02df421c94 (*) Trim the stopwords list
Having an overlong stopwords list leads to quoted terms not performing well.  For now we'll slash it to just "a" and "the".
2024-06-26 12:22:57 +02:00
Viktor Lofgren
95b9af92a0 (index) Implement working optional TermCoherences 2024-06-26 12:22:06 +02:00
Viktor Lofgren
8ee64c0771 (index) Correct TermCoherence requirements 2024-06-25 22:18:10 +02:00
Viktor Lofgren
b805f6daa8 (gamma) Fix readCount() behavior in EGC 2024-06-25 22:17:54 +02:00
Viktor Lofgren
dae22ccbe0 (test) Integration test from crawl->query 2024-06-25 22:17:26 +02:00
Viktor Lofgren
9d00243d7f (index) Partial re-implementation of position constraints 2024-06-24 15:55:54 +02:00
Viktor Lofgren
5461634616 (doc) Add readme.md for coded-sequence library
This commit introduces a readme.md file to document the functionality and usage of the coded-sequence library. It covers the Elias Gamma code support, how sequences are encoded, and methods the library offers to query sequences, iterate over values, access data, and decode sequences.
2024-06-24 14:28:51 +02:00
Viktor Lofgren
40bca93884 (gamma) Minor clean-up 2024-06-24 13:56:43 +02:00
Viktor Lofgren
b798f28443 (journal) Fixing journal encoding
Adjusting some bit widths for entry and record sizes to ensure these don't overflow, as this would corrupt the written journal.
2024-06-24 13:56:27 +02:00
Viktor Lofgren
fff2ce5721 (gamma) Correctly decode zero-length sequences 2024-06-24 13:11:41 +02:00
Viktor
69f88255e9 Merge pull request #101 from MarginaliaSearch/security-scan
Address security scan findings
2024-06-17 13:18:36 +02:00
Viktor
08ff79827e Merge branch 'master' into security-scan 2024-06-17 13:18:25 +02:00
Viktor Lofgren
67703e2274 (run) Update install.sh with stronger warnings against non-docker install. 2024-06-17 13:15:15 +02:00
Viktor Lofgren
d0d6bb173c (control) Fix warc data http status filter default value 2024-06-17 12:40:25 +02:00
Viktor Lofgren
54caf17107 (docs) Amend install instructions for non-docker install 2024-06-16 10:22:07 +02:00
Viktor Lofgren
2168b7cf7d (docs) Update docs with clearer references to the full guide
The commit also mentions the non-docker install
2024-06-16 10:01:19 +02:00
Viktor Lofgren
90744433c9 Merge branch 'master' into security-scan
# Conflicts:
#	code/libraries/array/cpp/resources/libcpp.so
2024-06-13 13:14:47 +02:00
Viktor
5371f078f7 Merge pull request #102 from jaseemabid/jabid/macos-build
Make the project buildable on macOS
2024-06-12 14:45:03 +02:00
Jaseem Abid
0dd14a4bd0 Specify C++ standard in build command
The default C++ language standard on macOS is gnu++98, which won't build
this module.

Full error:

```
> Task :code:libraries:array:cpp:compileCpp FAILED
src/main/cpp/cpphelpers.cpp:28:5: error: expected expression
    [](const p64x2& fst, const p64x2& snd) {
    ^
```
2024-06-12 12:47:10 +01:00
Jaseem Abid
9974b31a09 Don't track build files(libcpp.so) with git 2024-06-12 12:45:49 +01:00
Viktor Lofgren
0ffbbaf4b9 (crawler) Update WARC builder to use SHA-256 for digests 2024-06-12 09:14:12 +02:00
Viktor Lofgren
6839415a0b (crawler) Fetch TLS instead of SSL context 2024-06-12 09:07:54 +02:00
Viktor Lofgren
55f3ac4846 (atags) Fix duckdb SQL injection
The input comes from the config file so this isn't a very realistic threat vector, and even if it wasn't it's a query in an empty duckdb instance; but adding a validation check to provide a better error message.
2024-06-12 09:05:57 +02:00
Viktor Lofgren
801cf4b5da (search) Fix bad practice usage of innerHTML to set what should be text content. 2024-06-12 08:59:40 +02:00
Viktor Lofgren
e0459d0c0d (build) Upgrade parquet dependencies to 1.14.0
This gets rid of a vulnerable transitive dependency.
2024-06-12 08:57:22 +02:00
Viktor Lofgren
23759a7243 (loader) Correctly clamp document size 2024-06-10 18:29:14 +02:00
Viktor Lofgren
55b2b7636b (loader) Correctly load the positions column in the keyword projection 2024-06-10 18:27:15 +02:00
Viktor Lofgren
36160988e2 (index) Integrate positions data with indexes WIP
This change integrates the new positions data with the forward and reverse indexes.

The ranking code is still only partially re-written.
2024-06-10 15:09:06 +02:00
Viktor Lofgren
9f982a0c3d (index) Integrate positions file properly 2024-06-06 16:45:42 +02:00
Viktor Lofgren
dcbec9414f (index) Fix non-compiling tests 2024-06-06 16:35:09 +02:00
Viktor Lofgren
a07cf1ba93 (array/cpp) Update gitignore to properly exclude libcpp.so 2024-06-06 13:06:08 +02:00
Viktor Lofgren
4a8afa6b9f (index, WIP) Position data partially integrated with forward and reverse indexes.
There's no graceful way of doing this in small commits, pushing to avoid the risk of data loss.
2024-06-06 12:54:52 +02:00
Viktor
bb06cc9ff3 Merge pull request #98 from samstorment/ThemeSwitcher
OS Independent Theme Switcher
2024-06-06 12:51:19 +02:00
Sam Storment
9c06f446fb (search) Styling tweaks. Make the filter button near the top right corener a bit bigger so it's easier to press on mobile 2024-06-05 19:55:17 -05:00
Sam Storment
2d076cbd67 (search) move data-has-js attribute from body to html element 2024-06-05 18:20:33 -05:00
Sam Storment
fb2eef24d6 Handle themeing when javascript is disabled. Hide the theme select and fallback to dark media query instead of data-theme attribute 2024-06-03 14:15:35 -05:00
Sam Storment
e2f68d9ccf Add a theme select to the header that lets users toggle their theme independent of their OS theme 2024-06-02 21:02:52 -05:00
Viktor Lofgren
d4f4d751c0 Merge remote-tracking branch 'origin/master' 2024-06-02 16:30:41 +02:00
Viktor Lofgren
b4eac2516e (crawler) Send "Accept"-headers when fetching documents, also indicate we prefer English results 2024-06-02 16:30:34 +02:00
Viktor
4435f6245c Merge pull request #94 from samstorment/search-dark-theme
Search Dark Theme
2024-06-02 16:21:52 +02:00
Viktor Lofgren
9b922af075 (converter) Amend existing modifications to use gamma coded positions lists
... instead of serialized RoaringBitmaps as was the initial take on the problem.
2024-05-30 14:20:36 +02:00
Viktor Lofgren
0112ae725c (gamma) Implement a small library for Elias gamma coding an integer sequence 2024-05-30 14:19:13 +02:00
Viktor Lofgren
619392edf9 (keywords) Add position information to keywords 2024-05-28 16:54:53 +02:00
Viktor Lofgren
0894822b68 (converter) Add position information to serialized document data
This is not hooked in yet, and the term metadata is still left intact.  It should probably shrink to a smaller representation (byte?) with the upcoming removal of the position mask.
2024-05-28 14:18:03 +02:00
Viktor Lofgren
206a7ce6c1 Merge remote-tracking branch 'origin/master' 2024-05-28 14:15:57 +02:00
Viktor Lofgren
a69ab311c7 (qword) Fix tests that broke due to stopword removal 2024-05-28 14:15:45 +02:00
Viktor
a61327fa0b Update ROADMAP.md 2024-05-24 13:57:50 +02:00
Viktor Lofgren
6985ab762a (query) Improve handling of stopwords in queries 2024-05-23 20:50:55 +02:00
Viktor Lofgren
0e8300979b (search) Update the no result text to request bug reports. 2024-05-23 20:18:16 +02:00
Viktor Lofgren
0b60411e5f (query) Bugfix stopword issue
Add a new rule that crates an alternative path that omits a word if it's a stopword.

In queries where a stopword is present, and no query ngram expansion is possible, the query should not require the stopword to be present in the index, as this results in no search results being found.
2024-05-23 20:15:14 +02:00
Viktor Lofgren
f83f777fff (converter) Experimental support for searching by URL
Add up to synthetic 128 keywords per document, corresponding to links to other websites.
2024-05-23 17:10:57 +02:00
Viktor Lofgren
89aae93e60 (*) Lift jetty and guava-dependencies 2024-05-23 14:20:01 +02:00
Viktor Lofgren
65b74f9cab (registry) Fix broken test 2024-05-23 14:15:01 +02:00
Sam Storment
7543e98035 Merge branch 'MarginaliaSearch:master' into search-dark-theme 2024-05-22 18:06:37 -05:00
Viktor Lofgren
59ec70eb73 (*) Clean up code related to crawl parquet inspection 2024-05-22 12:55:08 +02:00
Viktor Lofgren
365229991b (control) Improve pagination for crawl data inspector 2024-05-21 19:44:48 +02:00
Viktor Lofgren
959a8e29ee (control) Improve pagination for crawl data inspector 2024-05-21 19:27:25 +02:00
Viktor Lofgren
197c82acd4 (control) Add filter functionality for crawl data inspector 2024-05-21 19:05:44 +02:00
Viktor Lofgren
9539fdb53c (control) Clean up UX for crawl data inspector 2024-05-21 18:27:24 +02:00
Sam Storment
5659df4388 (search) Set link and form field colors manually to override browser defaults with poor dark mode contrast 2024-05-21 00:03:46 -05:00
Viktor Lofgren
24bf29d369 (*) Upgrade opennlp and deprecate the monkey patched version of the code as it's no longer needed 2024-05-20 18:03:21 +02:00
Viktor Lofgren
17dc00d05f (control) Partial implementation of inspection utility for crawl data
Uses duckdb and range queries to read the parquet files directly from the index partitions.

UX is a bit rough but is in working order.
2024-05-20 18:02:46 +02:00
Viktor Lofgren
4fcd4a8197 (index) Refactor to reduce the level of indirection 2024-05-19 12:40:33 +02:00
Viktor Lofgren
daf2a8df54 (btree) Roll back optimization of queryDataWithIndex
It had been previously assumed that re-writing this function in the style of retain() would make it faster, but it had the opposite effect.

The reason why retain is so fast due to properties of the data that hold true when intersecting document lists, where long runs of adjacent documents are expected, but not when looking up the data associated with the already intersected documents, where the data is more sparse.
2024-05-19 11:29:28 +02:00
Sam Storment
43489c98d8 (search) Minor dark theme tweaks after the new mocked UI elements were added 2024-05-19 01:06:54 -05:00
Viktor Lofgren
88997a1c4f (btree) Clean up code 2024-05-18 18:38:46 +02:00
Viktor Lofgren
d12c77305c (btree) Clean up code 2024-05-18 18:03:17 +02:00
Viktor Lofgren
ab4e2b222e (array) Fix broken benchmarks 2024-05-18 13:41:24 +02:00
Viktor Lofgren
b867eadbef (big-string) Remove the unused bigstring library 2024-05-18 13:40:03 +02:00
Viktor Lofgren
19163fa883 (array) Clean up the Array library
IntArray gets the YAGNI axe.   The array library had two implementations, one for longs which was used, and one for ints, which only ever saw bit rot.   Removing the latter, as all it ever did was clutter up the codebase and add technical debt.  If we need int arrays, we fork LongArray again (or add int capabilities to it)

Also cleaning up the interfaces, removing layers of redundant abstractions and adding javadocs.

Finally adding sz=2 specializations to the quick- and insertion sort algorithms.  It seems the JIT isn't optimizing these particularly well, this is an attempt to help it out a bit.
2024-05-18 13:23:06 +02:00
Sam Storment
a7c33809c4 Merge branch 'master' into search-dark-theme 2024-05-17 22:52:19 -05:00
Viktor Lofgren
650f3843bb (array) Clean up search function jungle
Retire search functions that weren't used, including the native implementations.  Drop confusing suffixes on search function names.  Search functions no longer encode search misses as negative values.

Replaced binary search function with a branchless version that is much faster.

Cleaned up benchmark code.
2024-05-17 14:31:02 +02:00
Viktor Lofgren
9e766bc056 (array) Clean up search function jungle
Retire search functions that weren't used, including the native implementations.  Drop confusing suffixes on search function names.  Search functions no longer encode search misses as negative values.

Replaced binary search function with a branchless version that is much faster.

Cleaned up benchmark code.
2024-05-17 14:30:06 +02:00
Viktor Lofgren
48aff52e00 (array) Increase LongArray on-heap alignment to 16 bytes
This primarily affects benchmarks, making performance more consistent for the 128 bit operations, as the system mostly works with memory mapped data.
2024-05-16 19:12:36 +02:00
Viktor Lofgren
9d7616317e (array) Clean up native code a bit 2024-05-16 14:47:10 +02:00
Viktor Lofgren
d227a09fb1 (search) Extend paperdoll service mock with site info data and screenshots
It's a bit of a hack job but will do, random exploration is available but only through a "browse:random"-style query
2024-05-15 12:40:55 +02:00
Viktor Lofgren
f48cf77c4d (array, experimental) Add benchmark results for quicksort 2024-05-14 18:15:30 +02:00
Viktor Lofgren
3549be216f (array, experimental) Documentation for native algos 2024-05-14 17:43:05 +02:00
Viktor Lofgren
c3e3a3dbc5 (search) Fix problem list in clustered search results 2024-05-14 13:05:52 +02:00
Viktor Lofgren
55a7c1db00 (array, experimental) Call C++ helper methods to do some low level stuff a bit faster than is possible with Java 2024-05-14 12:54:14 +02:00
Sam Storment
bb315221ab (search, WIP) Make the dark theme look generally nicer. Rename CSS custom properties a bit. Switch a lot of background colors to HSL to make it easy to change colors relative to one another. 2024-05-14 01:32:40 -05:00
Sam Storment
c38766c5a6 (search, WIP) Convert SCSS variables to CSS custom properties for dynamic theming 2024-05-08 22:13:24 -05:00
Viktor Lofgren
c837321df1 (search) Provide a notification when no search results are found. 2024-05-06 20:11:39 +02:00
Viktor Lofgren
af7f6b89ec (search) Delete vestigial stylesheet from the old design. 2024-05-06 19:52:29 +02:00
Viktor Lofgren
29a4d3df23 (search) Imrpove search-service paperdoll by mocking suggestions and news 2024-05-06 19:52:13 +02:00
Viktor
bcbb9afac0 Merge pull request #93 from MarginaliaSearch/accessibility-improvements
Accessibility improvements
2024-05-04 15:45:26 +02:00
Viktor Lofgren
7d1cafc070 (control) Add skip link for navigation in control GUI 2024-05-04 12:36:44 +02:00
Viktor Lofgren
5951c67a8b (search) Center the search results page 2024-05-04 12:23:21 +02:00
Viktor Lofgren
c454007730 (search) Increase contrast for some UI elements 2024-05-04 12:02:52 +02:00
Viktor Lofgren
4e49cca43d (search) Clean up SCSS code a bit 2024-05-04 11:58:54 +02:00
Viktor Lofgren
49a8c06095 (search) Improve contrast for text on random button 2024-05-04 11:51:19 +02:00
Viktor Lofgren
d01d9fa670 (search) Add screenreader-specific notification remark about when search results start. 2024-05-04 11:41:06 +02:00
Viktor Lofgren
a53a32f006 (search) Spell out website problems with "atomic elements" instead of having a hover that's inaccessible with keyboard navigation 2024-05-04 11:41:05 +02:00
Viktor Lofgren
3548d54cf6 (search) Add a screenreader-only alert when the search filters are updated to make it easier to understand what happens. 2024-05-04 11:41:04 +02:00
Viktor Lofgren
01f242ac7e (search) Add stylesheet class for screenreader-only items 2024-05-04 11:41:03 +02:00
Viktor Lofgren
2840d9d403 (search) Add screenreader-only positions count text to search results 2024-05-04 11:41:03 +02:00
Viktor Lofgren
9fecfc5025 (search) Add autocomplete attribute to search-form 2024-05-04 11:41:02 +02:00
Viktor Lofgren
1b901e01f2 (search) Add bypass link that skips navigation 2024-05-04 11:41:01 +02:00
Viktor Lofgren
974aa35558 (search) Add proper alt-text to random exploration mode 2024-05-04 11:41:00 +02:00
Viktor Lofgren
4021a0ae98 (search) Add en-US language tags to all templates 2024-05-04 11:40:59 +02:00
Viktor Lofgren
b7a95be731 (search) Create a small mocking framework for running the search service in isolation. 2024-05-04 11:40:59 +02:00
Viktor Lofgren
616649f040 (logs) Fix logdir location 2024-05-04 11:40:59 +02:00
Viktor
ac3c692b5f Merge pull request #92 from MarginaliaSearch/no-docker-v2
(WIP) Changes to make the system runnable outside of docker
2024-05-01 13:00:56 +02:00
Viktor Lofgren
6087f9635c (qs) Move index.html out of public directory
It was put there to simulate the /public interface paradigm that is now deprecated.
2024-05-01 12:56:12 +02:00
Viktor Lofgren
2ad0bfda1e (*) Fix boot orchestration for the services
This corrects an annoying bug that had the system crash and burn on first start-up due to a race condition in service initialization, where the services were attempting to access the database before it was properly migrated.

A fix was in principle already in place, but it was running too late and did not prevent attempts to access the as-yet uninitialized database.  Move the first boot check into the MainClass instead of the Service constructor.

The change also adds more appropriate docker dependencies to the services to fix rare errors resolving the hostname of the database.
2024-05-01 12:39:48 +02:00
Viktor Lofgren
cf8b12bcdc Update install.sh with refined service descriptions 2024-05-01 12:07:30 +02:00
Viktor Lofgren
08f8b6e022 (system) Log loaded properties to the console 2024-04-30 18:29:11 +02:00
Viktor Lofgren
800ed6b1e9 (zk) Terminately immediately if zookeeper isn't found
This makes debugging easier
2024-04-30 18:28:49 +02:00
Viktor Lofgren
df93e57a9a (install) Add new option to install locally outside of docker 2024-04-30 18:28:21 +02:00
Viktor Lofgren
908535a3a0 (single-service) Ensure single-service spawner can specify the node 2024-04-30 18:27:46 +02:00
Viktor Lofgren
7fe2ab6f39 (file-storage) Ensure file storage root location can be overridden when running outside of docker 2024-04-30 18:26:15 +02:00
Viktor Lofgren
c9ee0c909e (download-sample) Set +x permissions on directories created during this job 2024-04-30 18:25:07 +02:00
Viktor Lofgren
38aedb50ac (converter) Do not suppress exceptions in the converter 2024-04-30 18:24:35 +02:00
Viktor Lofgren
4772e0b59d (service) Deprecate /public prefix on HTTP
Before the gRPC migration, the system would serve both public and internal requests over HTTP, but distinguish the two using path prefixes and a few HTTP Headers (X-Public, X-Context) added by the reverse proxy to prevent misconfigurations.

Since internal requests meaningfully no longer use HTTP, this convention is just an obstacle now, adding the need to always run the system behind a reverse proxy that rewrites the paths.

The change removes the path prefix, and updates the docker templates to reflect the change.  This will require a migration for existing systems.
2024-04-30 14:46:18 +02:00
Viktor Lofgren
9c49e876d5 (conf) Update the setup.sh script to also be able to perform model upgrades 2024-04-29 17:46:20 +02:00
Viktor Lofgren
152007cd5c (docker) Add missing zookeeper service to full marginalia config 2024-04-29 11:44:53 +02:00
Viktor Lofgren
70e2e41955 (crawler) Content type prober should not swallow exceptions 2024-04-27 18:27:23 +02:00
Viktor Lofgren
4d71c776fc (crawler) Modify crawl set growth to grow small domains faster than larger ones 2024-04-27 17:36:27 +02:00
Viktor
0f41105436 Merge pull request #90 from MarginaliaSearch/run-outside-docker
Run outside of Docker
2024-04-25 18:55:26 +02:00
Viktor
2d49071e96 Merge branch 'master' into run-outside-docker 2024-04-25 18:53:26 +02:00
Viktor Lofgren
89889ecbbd (single-service) Skip starting Prometheus if it's not explicitly enabled 2024-04-25 17:54:07 +02:00
Viktor Lofgren
41576e74d4 (doc) Clean up ROADMAP.md 2024-04-25 15:53:46 +02:00
Viktor Lofgren
c8ee354d0b (log) Make log dir configurable via environment variable 2024-04-25 15:09:18 +02:00
Viktor Lofgren
4e5f069809 (build) Migrate ssr to the new root setting schema of java lang version 2024-04-25 15:08:56 +02:00
Viktor Lofgren
6690e9bde8 (service) Ensure the service discovery starts early
This is necessary as we use zookeeper to orchestrate first-time startup of the services, to ensure that the database is properly migrated by the control service before anything else is permitted to start.
2024-04-25 15:08:33 +02:00
Viktor Lofgren
e4b34b6ee6 (index) Correctly detect the presence of an all-virtual path through the query 2024-04-25 14:01:46 +02:00
Viktor Lofgren
3952ef6ca5 (service) Let singleservice configure ports and bind addresses 2024-04-25 13:49:57 +02:00
Viktor Lofgren
463d333846 (proj) Add ROADMAP.md 2024-04-25 13:07:35 +02:00
Viktor Lofgren
7eb5e6aa66 (crawler) Abort recrawl if error count is too high 2024-04-24 21:46:40 +02:00
Viktor Lofgren
282022d64e (crawler) Remove unnecessary double-fetch of the root document 2024-04-24 14:44:39 +02:00
Viktor Lofgren
91a98a8807 (crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber 2024-04-24 14:44:39 +02:00
Viktor Lofgren
32fe864a33 (build) Java 22 and its consequences has been a disaster for Marginalia Search
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle

The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
e1c9313396 (crawler) Emulate if-modified-since for domains that don't support the header
This will help reduce the strain on some server software, in particular Discourse.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
f430a084e8 (crawler) Remove accidental log spam 2024-04-24 14:44:39 +02:00
Viktor Lofgren
a86b596897 (crawler) Code quality 2024-04-24 14:44:39 +02:00
Viktor Lofgren
6dd87b0378 (crawler) Use the probe-result to reduce the likelihood of crawling both http and https
This should drastically reduce the number of fetched documents on many domains
2024-04-24 14:44:39 +02:00
Viktor Lofgren
c9f029c214 (crawler) Strip W/-prefix from the etag when supplied as If-None-Match 2024-04-24 14:44:39 +02:00
Viktor Lofgren
6b88db10ad (crawler) Ensure all appropriate headers are recorded on the request 2024-04-24 14:44:39 +02:00
Viktor Lofgren
8a891c2159 (crawler/converter) Remove legacy junk from parquet migration 2024-04-24 14:44:39 +02:00
Viktor Lofgren
ad2ac8eee3 (query) Mark flaky test, correct assert on test 2024-04-24 14:44:39 +02:00
Viktor Lofgren
f46733a47a (ranking) TermCoherenceFactory should be run for size=2 queries 2024-04-24 14:44:39 +02:00
Viktor Lofgren
934167323d (converter) Stopgap fix for some cases of lost crawl data due to HTTP 304. The root cause needs further investigation. 2024-04-24 14:44:39 +02:00
Viktor Lofgren
64baa41e64 (query) Always generate an ngram alternative, suppresses generation of multiple identical query branches 2024-04-24 14:44:39 +02:00
Viktor Lofgren
5165cf6d15 (ranking) Set regularMask correctly 2024-04-24 14:44:39 +02:00
Viktor Lofgren
4489b21528 (ranking) Cleanup 2024-04-24 14:44:39 +02:00
Viktor Lofgren
f623b37577 (ranking) Suppress NaN:s in ranking output 2024-04-24 14:44:39 +02:00
Viktor Lofgren
f4a2fea451 (ranking, bugfix) Use bm25NgramWeight and not full weight for bM25N 2024-04-24 14:44:39 +02:00
Viktor Lofgren
a748fc5448 (index, bugfix) Pass url quality to query service 2024-04-24 14:44:39 +02:00
Viktor Lofgren
0dcca0cb83 (index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp 2024-04-24 14:44:39 +02:00
Viktor Lofgren
b80a83339b (qs) Additional info in query debug UI 2024-04-24 14:44:39 +02:00
Viktor Lofgren
eb74d08f2a (qs) Additional info in query debug UI 2024-04-24 14:44:39 +02:00
Viktor Lofgren
e79ab0c70e (qs) Basic query debug feature 2024-04-24 14:44:39 +02:00
Viktor Lofgren
e419e26f3a (proto) Improve handling of omitted parameters 2024-04-24 14:44:39 +02:00
Viktor Lofgren
6102fd99bf (qs) Improve logging 2024-04-24 14:44:39 +02:00
Viktor Lofgren
def36719d3 (query) Minor code cleanup 2024-04-24 14:44:39 +02:00
Viktor Lofgren
462aa9af26 (query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard
The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
a09c84e1b8 (query) Modify tokenizer to match the behavior of the sentence extractor
This must match, otherwise a query like "plato's republic" won't match the indexed keywords, since they would strip the possessive.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
44b33798f3 (index) Clean up jaccard index term code and down-tune the parameter's importance a bit 2024-04-24 14:44:39 +02:00
Viktor Lofgren
2f0b648fad (index) Add jaccard index term to boost results based on term overlap 2024-04-24 14:44:39 +02:00
Viktor Lofgren
de0e56f027 (index) Remove position overlap check, coherences will do the work instead 2024-04-24 14:44:39 +02:00
Viktor Lofgren
973ced7b13 (index) Omit absent terms from coherence checks 2024-04-24 14:44:39 +02:00
Viktor Lofgren
cb4b824a85 (index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus 2024-04-24 14:44:39 +02:00
Viktor Lofgren
c583a538b1 (search) Add implicit coherence constraints based on segmentation 2024-04-24 14:44:39 +02:00
Viktor Lofgren
e0224085b4 (index) Improve recall for small queries
Partially reverse the previous commit and add a query head for the priority index when there are few query interpretations.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
44c1e1d6d9 (index) Remove dead code
Since the performance fix in 3359f72239 had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
c620e9c026 (index) Experimental performance regression fix 2024-04-24 14:44:39 +02:00
Viktor Lofgren
1bb88968c5 (test) Fix broken test 2024-04-24 14:44:39 +02:00
Viktor Lofgren
df75e8f4aa (index) Explicitly free LongQueryBuffers 2024-04-24 14:44:39 +02:00
Viktor Lofgren
adf846bfd2 (index) Fix term coherence evaluation
The code was incorrectly using the documentId instead of the combined id, resulting in almost all result sets being incorrectly seen as zero.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
1748fcc5ac (valuation) Impose stronger constraints on locality of terms
Clean up logic a bit
2024-04-24 14:44:39 +02:00
Viktor Lofgren
08416393e0 (valuation) Impose stronger constraints on locality of terms 2024-04-24 14:44:39 +02:00
Viktor Lofgren
fce26015c9 (encyclopedia) Index the full articles
Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits.  This was not a good idea, so the change is reverted.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
155be1078d (index) Fix priority search terms
This functionality fell into disrepair some while ago.  It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
6efc0f21fe (index) Clean up data model
The change set cleans up the data model for the term-level data.  This used to contain a bunch of fields with document-level metadata.  This data-duplication means a larger memory footprint and worse memory locality.

The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking.  This is again an effort to improve memory locality.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
f3255e080d (ngram) Grab titles separately when extracting ngrams from wiki data 2024-04-24 14:44:39 +02:00
Viktor Lofgren
0da03d4cfc (zim) Fix title extractor 2024-04-24 14:44:39 +02:00
Viktor Lofgren
5f6a3ef9d0 (ngram) Correct |s|^|s|-normalization to use length and not count 2024-04-24 14:44:39 +02:00
Viktor Lofgren
afc4fed591 (ngram) Correct size value in ngram lexicon generation, trim the terms better 2024-04-24 14:44:39 +02:00
Viktor Lofgren
cb505f98ef (ngram) Use simple blocking pool instead of FJP; split on underscores in article names. 2024-04-24 14:44:39 +02:00
Viktor Lofgren
a0b3634cb6 (ngram) Only extract frequencies of title words, but use the body to increment the counters...
The sign of the counter is used to indicate whether a term has appeared as title.  Until it's seen in the title, it's provisionally saved as a negative count.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
e23359bae9 (query, minor) Remove debug statement 2024-04-24 14:44:39 +02:00
Viktor Lofgren
5531ed632a (query, minor) Remove debug statement 2024-04-24 14:44:39 +02:00
Viktor Lofgren
150ee21f3c (ngram) Clean up ngram lexicon code
This is both an optimization that removes some GC churn, as well as a clean-up of the code that removes references to outdated concepts.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
c96da0ce1e (segmentation) Pick best segmentation using |s|^|s|-style normalization
This is better than doing all segmentations possible at the same time.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
a0d9e66ff7 (ngram) Fix index range in NgramLexicon to an avoid exception 2024-04-24 14:44:38 +02:00
Viktor Lofgren
55f627ed4c (index) Clean up the code 2024-04-24 14:44:38 +02:00
Viktor Lofgren
7dd8c78c6b (ngrams) Remove the vestigial logic for capturing permutations of n-grams
The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
8bf7d090fd (qs) Clean up parsing code using new record matching 2024-04-24 14:44:38 +02:00
Viktor Lofgren
6bfe04b609 (term-freq-exporter) Reduce thread count and memory usage 2024-04-24 14:44:38 +02:00
Viktor Lofgren
491d6bec46 (term-freq-exporter) Extract ngrams in term-frequency-exporter 2024-04-24 14:44:38 +02:00
Viktor Lofgren
4fb86ac692 (search) Fix outdated assumptions about the results
We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption.

For the API service, we'll simulate the old behavior to keep the API stable.

For the search service, we'll introduce a new way of calculating positions through tree aggregation.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
6cba6aef3b (minor) Remove dead code 2024-04-24 14:44:38 +02:00
Viktor Lofgren
7e216db463 (index) Add origin trace information for index readers
This used to be supported by the system but got lost in refactoring at some point.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
adc90c8f1e (sentence-extractor) Fix resource leak in sentence extractor
The code would always re-initialize the static ngramLexicon and rdrposTagger fields with new instances even if they were already instantiated, leading to a ton of unnecessary RAM allocation.

The modified behavior checks for nullity before creating a new instance.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
e3316a3672 (index) Clean up new index query code 2024-04-24 14:44:38 +02:00
Viktor Lofgren
a3a6d6292b (qs, index) New query model integrated with index service.
Seems to work, tests are green and initial testing finds no errors.  Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
8cb9455c32 (qs, WIP) Fix edge cases in query compilation
This addresses the relatively common case where the graph consists of two segments, such as x y, z w; in this case we want an output like (x_y) (z w | z_w) | x y (z_w).  The generated output does somewhat pessimize a few other cases, but this one is arguably more important.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
dc65b2ee01 (qs, WIP) Clean up dead code 2024-04-24 14:44:38 +02:00
Viktor Lofgren
98a1adbf81 (qs, WIP) Tidy it up a bit 2024-04-24 14:44:38 +02:00
Viktor Lofgren
0bd1e15cce (qs, WIP) Tidy it up a bit 2024-04-24 14:44:38 +02:00
Viktor Lofgren
eda926767e (qs, WIP) Tidy it up a bit 2024-04-24 14:44:38 +02:00
Viktor Lofgren
cd1a18c045 (qs, WIP) Break up code and tidy it up a bit 2024-04-24 14:44:38 +02:00
Viktor Lofgren
6f567fbea8 (qs, WIP) Fix output determinism, fix tests 2024-04-24 14:44:38 +02:00
Viktor Lofgren
0ebadd03a5 (WIP) Query rendering finally beginning to look like it works 2024-04-24 14:44:38 +02:00
Viktor Lofgren
2253b556b2 WIP 2024-04-24 14:44:17 +02:00
Viktor Lofgren
6a7a7009c7 (convert) Initial integration of segmentation data into the converter's keyword extraction logic 2024-04-24 14:44:17 +02:00
Viktor Lofgren
3c75057dcd (qs) Retire NGramBloomFilter, integrate new segmentation model instead 2024-04-24 14:44:17 +02:00
Viktor Lofgren
212d101727 (control) GUI for exporting segmentation data from a wikipedia zim 2024-04-24 14:44:17 +02:00
Viktor Lofgren
760b80659d (WIP) Partial integration of new query expansion code into the query-serivice 2024-04-24 14:44:17 +02:00
Viktor Lofgren
04879c005d (WIP) Improve data extraction from wikipedia data 2024-04-24 14:44:17 +02:00
Viktor Lofgren
cb82927756 (WIP) Implement first take of new query segmentation algorithm 2024-04-24 14:44:17 +02:00
Viktor Lofgren
8b9629f2f6 (crawler) Remove unnecessary double-fetch of the root document 2024-04-24 14:38:59 +02:00
Viktor Lofgren
f6db16b313 (crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber 2024-04-24 14:10:03 +02:00
Viktor Lofgren
4668b1ddcb (build) Java 22 and its consequences has been a disaster for Marginalia Search
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle

The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 13:54:04 +02:00
Viktor Lofgren
dcf9d9caad (crawler) Emulate if-modified-since for domains that don't support the header
This will help reduce the strain on some server software, in particular Discourse.
2024-04-22 17:26:31 +02:00
Viktor Lofgren
7a69b76001 (crawler) Remove accidental log spam 2024-04-22 15:51:37 +02:00
Viktor Lofgren
ac07ef822f (crawler) Code quality 2024-04-22 15:37:35 +02:00
Viktor Lofgren
e7d4bcd872 (crawler) Use the probe-result to reduce the likelihood of crawling both http and https
This should drastically reduce the number of fetched documents on many domains
2024-04-22 15:36:43 +02:00
Viktor Lofgren
a28c6d7cfe (crawler) Strip W/-prefix from the etag when supplied as If-None-Match 2024-04-22 14:31:05 +02:00
Viktor Lofgren
d816f048f5 (crawler) Ensure all appropriate headers are recorded on the request 2024-04-22 14:14:24 +02:00
Viktor Lofgren
b09ddd0036 (crawler/converter) Remove legacy junk from parquet migration 2024-04-22 12:34:28 +02:00
Viktor Lofgren
0a73b02a00 (query) Mark flaky test, correct assert on test 2024-04-21 12:30:14 +02:00
Viktor Lofgren
8769704462 (ranking) TermCoherenceFactory should be run for size=2 queries 2024-04-21 12:29:25 +02:00
Viktor Lofgren
214551f1df (converter) Stopgap fix for some cases of lost crawl data due to HTTP 304. The root cause needs further investigation. 2024-04-19 20:36:01 +02:00
Viktor Lofgren
2cc74c005a (query) Always generate an ngram alternative, suppresses generation of multiple identical query branches 2024-04-19 19:42:30 +02:00
Viktor Lofgren
ed250f57f2 (ranking) Set regularMask correctly 2024-04-19 14:31:57 +02:00
Viktor Lofgren
e92c25f7e0 (ranking) Cleanup 2024-04-19 14:13:12 +02:00
Viktor Lofgren
3ab563f314 (ranking) Suppress NaN:s in ranking output 2024-04-19 13:58:28 +02:00
Viktor Lofgren
426338cb45 (ranking, bugfix) Use bm25NgramWeight and not full weight for bM25N 2024-04-19 12:41:48 +02:00
Viktor Lofgren
5fa2375898 (index, bugfix) Pass url quality to query service 2024-04-19 12:41:26 +02:00
Viktor Lofgren
41782a0ab5 (index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp 2024-04-19 12:19:26 +02:00
Viktor Lofgren
9b06433b82 (qs) Additional info in query debug UI 2024-04-19 12:18:53 +02:00
Viktor Lofgren
def607d840 (qs) Additional info in query debug UI 2024-04-19 11:46:27 +02:00
Viktor Lofgren
2b811fb422 (qs) Basic query debug feature 2024-04-19 11:00:56 +02:00
Viktor Lofgren
36cc62c10c (proto) Improve handling of omitted parameters 2024-04-18 10:47:12 +02:00
Viktor Lofgren
975d92912c (qs) Improve logging 2024-04-18 10:44:08 +02:00
Viktor Lofgren
8bbaf457de (query) Minor code cleanup 2024-04-18 10:37:51 +02:00
Viktor Lofgren
7641a02f31 (query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard
The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.
2024-04-18 10:36:15 +02:00
Viktor Lofgren
ce16239e34 (query) Modify tokenizer to match the behavior of the sentence extractor
This must match, otherwise a query like "plato's republic" won't match the indexed keywords, since they would strip the possessive.
2024-04-17 17:54:32 +02:00
Viktor Lofgren
d64bd227cf (index) Clean up jaccard index term code and down-tune the parameter's importance a bit 2024-04-17 17:40:16 +02:00
Viktor Lofgren
c5ab0a9054 (index) Add jaccard index term to boost results based on term overlap 2024-04-17 16:50:26 +02:00
Viktor Lofgren
dac948973d (index) Remove position overlap check, coherences will do the work instead 2024-04-17 14:20:01 +02:00
Viktor Lofgren
9d008d1d6f (index) Omit absent terms from coherence checks 2024-04-17 14:12:16 +02:00
Viktor Lofgren
f52457213e (index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus 2024-04-17 14:05:02 +02:00
Viktor Lofgren
579295a673 (search) Add implicit coherence constraints based on segmentation 2024-04-17 14:03:35 +02:00
Viktor Lofgren
af8ff8ce99 (index) Improve recall for small queries
Partially reverse the previous commit and add a query head for the priority index when there are few query interpretations.
2024-04-16 22:51:03 +02:00
Viktor Lofgren
7fa3e86e64 (index) Remove dead code
Since the performance fix in 3359f72239 had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.
2024-04-16 19:59:27 +02:00
Viktor Lofgren
3359f72239 (index) Experimental performance regression fix 2024-04-16 19:48:14 +02:00
Viktor Lofgren
41fa154aa6 (test) Fix broken test 2024-04-16 19:48:14 +02:00
Viktor Lofgren
deaba0152d (index) Explicitly free LongQueryBuffers 2024-04-16 19:23:00 +02:00
Viktor Lofgren
feaef6093e (index) Fix term coherence evaluation
The code was incorrectly using the documentId instead of the combined id, resulting in almost all result sets being incorrectly seen as zero.
2024-04-16 18:07:43 +02:00
Viktor Lofgren
078fa4fdd0 (valuation) Impose stronger constraints on locality of terms
Clean up logic a bit
2024-04-16 17:22:58 +02:00
Viktor Lofgren
2dc77a0638 (valuation) Impose stronger constraints on locality of terms 2024-04-16 17:15:21 +02:00
Viktor
cfd9a7187f (query-segmentation) Merge pull request #89 from MarginaliaSearch/query-segmentation
The changeset cleans up the query parsing logic in the query service. It gets rid of a lot of old and largely unmaintainable query-rewriting logic that was based on POS-tagging rules, and adds a new cleaner approach. Query parsing is also refactored, and the internal APIs are updated to remove unnecessary duplication of document-level data across each search term.

A new query segmentation model is introduced based on a dictionary of known n-grams, with tools for extracting this dictionary from Wikipedia data. The changeset introduces a new segmentation model file, which is downloaded with the usual run/setup.sh, as well as an updated term frequency model.

A new intermediate representation of the query is introduced, based on a DAG with predefined vertices initiating and terminating the graph. This is for the benefit of easily writing rules for generating alternative queries, e.g. using the new segmentation data.

The graph is converted to a basic LL(1) syntax loosely reminiscent of a regular expression, where e.g. "( wiby | marginalia | kagi ) ( search engine | searchengine )" expands to "wiby search engine", "wiby searchengine", "marginalia search engine", "marginalia searchengine", "kagi search engine" and "kagi searchengine".

This compiled query is passed to the index, which parses the expression, where it is used for execution of the search and ranking of the results.
2024-04-16 15:31:05 +02:00
Viktor Lofgren
f434a8b492 (build) Upgrade jib plugin version 2024-04-16 15:25:23 +02:00
Viktor Lofgren
d2658d6f84 (sys) Add springboard service that can spawn multiple different marginalia services to make distribution easier. 2024-04-16 13:25:15 +02:00
Viktor Lofgren
8c559c8121 (conf) Add additional logic for discovering system root 2024-04-16 12:37:18 +02:00
Viktor Lofgren
2353c73c57 (encyclopedia) Index the full articles
Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits.  This was not a good idea, so the change is reverted.
2024-04-16 12:10:13 +02:00
Viktor Lofgren
599e719ad4 (index) Fix priority search terms
This functionality fell into disrepair some while ago.  It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.
2024-04-15 16:44:08 +02:00
Viktor Lofgren
b6d365bacd (index) Clean up data model
The change set cleans up the data model for the term-level data.  This used to contain a bunch of fields with document-level metadata.  This data-duplication means a larger memory footprint and worse memory locality.

The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking.  This is again an effort to improve memory locality.
2024-04-15 16:04:07 +02:00
Viktor Lofgren
52f0c0d336 (ngram) Grab titles separately when extracting ngrams from wiki data 2024-04-13 19:34:16 +02:00
Viktor Lofgren
be55f3f937 (zim) Fix title extractor 2024-04-13 19:33:47 +02:00
Viktor Lofgren
fda1c05164 (ngram) Correct |s|^|s|-normalization to use length and not count 2024-04-13 18:05:30 +02:00
Viktor Lofgren
1329d4abd8 (ngram) Correct size value in ngram lexicon generation, trim the terms better 2024-04-13 17:51:02 +02:00
Viktor Lofgren
f064992137 (ngram) Use simple blocking pool instead of FJP; split on underscores in article names. 2024-04-13 17:07:23 +02:00
Viktor Lofgren
8a81a480a1 (ngram) Only extract frequencies of title words, but use the body to increment the counters...
The sign of the counter is used to indicate whether a term has appeared as title.  Until it's seen in the title, it's provisionally saved as a negative count.
2024-04-12 18:08:31 +02:00
Viktor Lofgren
d729c400e5 (query, minor) Remove debug statement 2024-04-12 17:52:55 +02:00
Viktor Lofgren
ad4810d991 (query, minor) Remove debug statement 2024-04-12 17:45:26 +02:00
Viktor Lofgren
6a67043537 (ngram) Clean up ngram lexicon code
This is both an optimization that removes some GC churn, as well as a clean-up of the code that removes references to outdated concepts.
2024-04-12 17:45:06 +02:00
Viktor Lofgren
864d6c28e7 (segmentation) Pick best segmentation using |s|^|s|-style normalization
This is better than doing all segmentations possible at the same time.
2024-04-12 17:44:14 +02:00
Viktor Lofgren
bb6b51ad91 (ngram) Fix index range in NgramLexicon to an avoid exception 2024-04-12 10:13:25 +02:00
Viktor Lofgren
65e3caf402 (index) Clean up the code 2024-04-11 18:50:21 +02:00
Viktor Lofgren
b7d9a7ae89 (ngrams) Remove the vestigial logic for capturing permutations of n-grams
The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.
2024-04-11 18:12:01 +02:00
Viktor Lofgren
ed73d79ec1 (qs) Clean up parsing code using new record matching 2024-04-11 17:36:08 +02:00
Viktor Lofgren
c538c25008 (term-freq-exporter) Reduce thread count and memory usage 2024-04-10 17:11:23 +02:00
Viktor Lofgren
4b47fadbab (term-freq-exporter) Extract ngrams in term-frequency-exporter 2024-04-10 16:58:05 +02:00
Viktor Lofgren
fcdc843c15 (search) Fix outdated assumptions about the results
We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption.

For the API service, we'll simulate the old behavior to keep the API stable.

For the search service, we'll introduce a new way of calculating positions through tree aggregation.
2024-04-07 12:09:44 +02:00
Viktor Lofgren
dbdcf459a7 (minor) Remove dead code 2024-04-06 16:27:16 +02:00
Viktor Lofgren
ef25d60666 (index) Add origin trace information for index readers
This used to be supported by the system but got lost in refactoring at some point.
2024-04-06 13:28:14 +02:00
Viktor Lofgren
7f7021ce64 (sentence-extractor) Fix resource leak in sentence extractor
The code would always re-initialize the static ngramLexicon and rdrposTagger fields with new instances even if they were already instantiated, leading to a ton of unnecessary RAM allocation.

The modified behavior checks for nullity before creating a new instance.
2024-04-05 18:52:58 +02:00
Viktor Lofgren
448a941de2 (encyclopedia) Fix memory issue in preconversion step
Use SimpleBlockingThreadPool pool instead of Java's Workstealing Pool as the latter causes runaway memory consumption in some circumstances, while SimpleBlockingThreadPool uses a bounded queue and always pushes back against the supplier if it can't hold any more tasks.
2024-04-05 16:57:53 +02:00
Viktor Lofgren
5766da69ec (gradle) Upgrade to Gradle 8.7
This will reduce the hassle of juggling JDK versions for JDK 22, which was not supported by Gradle 8.5.
2024-04-05 15:15:49 +02:00
Joshua Holland
617e633d7a Update keywords docs use of explore to browse
I can't tell when this happened, but the proper keyword now seems to be browse and not explore.
2024-04-05 15:15:49 +02:00
Viktor Lofgren
b770a1143f (run) Fix traefik middleware configuration 2024-04-05 15:15:49 +02:00
Viktor Lofgren
e1151ecf2a (gradle) Upgrade to Gradle 8.7
This will reduce the hassle of juggling JDK versions for JDK 22, which was not supported by Gradle 8.5.
2024-04-05 15:12:38 +02:00
Viktor Lofgren
ae7c760772 (index) Clean up new index query code 2024-04-05 13:30:49 +02:00
Viktor Lofgren
81815f3e0a (qs, index) New query model integrated with index service.
Seems to work, tests are green and initial testing finds no errors.  Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.
2024-04-04 20:17:58 +02:00
Viktor
3890c413a3 Merge pull request #88 from jmholla/patch-1
Update keywords docs use of explore to browse
2024-04-01 09:14:02 +02:00
Joshua Holland
8e02f567d7 Update keywords docs use of explore to browse
I can't tell when this happened, but the proper keyword now seems to be browse and not explore.
2024-04-01 00:04:12 -05:00
Viktor Lofgren
87bb93e1d4 (qs, WIP) Fix edge cases in query compilation
This addresses the relatively common case where the graph consists of two segments, such as x y, z w; in this case we want an output like (x_y) (z w | z_w) | x y (z_w).  The generated output does somewhat pessimize a few other cases, but this one is arguably more important.
2024-03-29 12:40:27 +01:00
Viktor Lofgren
e596c929ac (qs, WIP) Clean up dead code 2024-03-28 16:37:23 +01:00
Viktor Lofgren
9852b0e609 (qs, WIP) Tidy it up a bit 2024-03-28 14:18:26 +01:00
Viktor Lofgren
51b0d6c0d3 (qs, WIP) Tidy it up a bit 2024-03-28 14:09:17 +01:00
Viktor Lofgren
15391c7a88 (qs, WIP) Tidy it up a bit 2024-03-28 13:54:30 +01:00
Viktor Lofgren
fe62593286 (qs, WIP) Break up code and tidy it up a bit 2024-03-28 13:26:54 +01:00
Viktor Lofgren
4cc11e183c (qs, WIP) Fix output determinism, fix tests 2024-03-28 13:11:26 +01:00
Viktor Lofgren
de8e753fc8 (run) Fix traefik middleware configuration 2024-03-28 13:03:12 +01:00
Viktor Lofgren
f82ebd7716 (WIP) Query rendering finally beginning to look like it works 2024-03-28 13:01:21 +01:00
Viktor Lofgren
bd0704d5a4 (*) Fix JDK22 migration issues
A few bizarre build errors cropped up when migrating to JDK22.  Not at all sure what caused them, but they were easy to mitigate.
2024-03-21 14:33:27 +01:00
Viktor Lofgren
1968485881 (docs) Upgrade to JDK22 2024-03-21 14:33:27 +01:00
Viktor Lofgren
002afca1c5 (sys) Upgrade to JDK22
This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.
2024-03-21 14:33:27 +01:00
Your Name
411b3f3138 (run/install.sh) fix docker compose file
I was following the release demo video for v2024.01.0
https://www.youtube.com/watch?v=PNwMkenQQ24 and when I did 'docker
compose up' the containers couldn't resolve the DNS name for 'zookeeper'
I realized this was because the zookeeper container was using the
default docker network, so I specified the wmsa network explicitly.
2024-03-21 14:33:27 +01:00
Viktor Lofgren
a4b810f511 WIP 2024-03-21 14:33:26 +01:00
Viktor
cd8f33f830 Merge pull request #86 from MarginaliaSearch/jdk-22
Lift JDK version to 22
2024-03-21 14:29:41 +01:00
Viktor Lofgren
824765b1ee (*) Fix JDK22 migration issues
A few bizarre build errors cropped up when migrating to JDK22.  Not at all sure what caused them, but they were easy to mitigate.
2024-03-21 14:27:13 +01:00
Viktor Lofgren
9e8138f853 (docs) Upgrade to JDK22 2024-03-21 14:27:13 +01:00
Viktor Lofgren
fe8d583fdd (sys) Upgrade to JDK22
This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.
2024-03-21 14:27:13 +01:00
Viktor Lofgren
0bd3365c24 (convert) Initial integration of segmentation data into the converter's keyword extraction logic 2024-03-19 14:28:42 +01:00
Viktor Lofgren
d8f4e7d72b (qs) Retire NGramBloomFilter, integrate new segmentation model instead 2024-03-19 10:42:09 +01:00
Viktor Lofgren
afc047cd27 (control) GUI for exporting segmentation data from a wikipedia zim 2024-03-18 13:45:23 +01:00
Viktor Lofgren
00ef4f9803 (WIP) Partial integration of new query expansion code into the query-serivice 2024-03-18 13:16:49 +01:00
Viktor Lofgren
07e4d7ec6d (WIP) Improve data extraction from wikipedia data 2024-03-18 13:16:00 +01:00
Viktor
258a344810 Merge pull request #85 from patrickbreen/master
(run/install.sh) fix docker compose file
2024-03-18 13:09:30 +01:00
Your Name
2a03014652 (run/install.sh) fix docker compose file
I was following the release demo video for v2024.01.0
https://www.youtube.com/watch?v=PNwMkenQQ24 and when I did 'docker
compose up' the containers couldn't resolve the DNS name for 'zookeeper'
I realized this was because the zookeeper container was using the
default docker network, so I specified the wmsa network explicitly.
2024-03-17 15:33:19 -04:00
Viktor Lofgren
8ae1f08095 (WIP) Implement first take of new query segmentation algorithm 2024-03-12 13:12:50 +01:00
Viktor Lofgren
57e6a12d08 (registry) Correct registerMonitor() behavior
The previous behavior would listen to too many changes, and based on zookeeper and not curator assumptions about behavior, add an additional monitor on each invocation of each monitor, (which always trigger on service state changes), leading to each monitor re-registering and effectively doubling monitors in numbers whenever a service stopped or started, which in turn meant a lot of bizarre thrashing behavior even on changes in services that don't explicitly talk to each other.

This re-registering behavior is no longer done.
2024-03-06 12:22:15 +01:00
Viktor Lofgren
46423612e3 (refac) Merge service-discovery and service modules
Also adds a few tests to the server/client code.
2024-03-03 10:49:23 +01:00
Viktor Lofgren
29bf473d74 (encyclopedia) Add URLencoding to path element
This prevents corruption of the links to the sideloaded encyclopedia data when the article path contains characters that are not valid in a URL.
2024-03-01 17:28:09 +01:00
Viktor Lofgren
9689f3faee (domain-info) Fix incorrect array indexing 2024-02-29 18:56:09 +01:00
Viktor Lofgren
93fa58c93d (domain-info) Fix incorrect array indexing
Using the id instead of idx when addressing the ranksArray caused exceptions.
2024-02-29 17:54:23 +01:00
Viktor Lofgren
186a98cc99 (doc) Fix wonky bullet lists 2024-02-28 17:43:05 +01:00
Viktor Lofgren
9993f265ca (doc) Remove irrelevant text 2024-02-28 17:40:05 +01:00
Viktor Lofgren
144f967dbf (misc) Tweak pool sizes 2024-02-28 16:23:02 +01:00
Viktor Lofgren
b31c9bb726 (docs) Update process docs 2024-02-28 15:21:33 +01:00
Viktor Lofgren
c0820b5e5c (docs) Update service docs 2024-02-28 15:19:31 +01:00
Viktor Lofgren
65b8a1d5d9 (grpc) Reduce error spam 2024-02-28 14:44:48 +01:00
Viktor Lofgren
a0648844fb (grpc) Reduce error spam 2024-02-28 14:35:29 +01:00
Viktor Lofgren
c4a27003c6 (docs) Fix formatting 2024-02-28 14:22:57 +01:00
Viktor Lofgren
41abd8982f (math) Clean up error handling 2024-02-28 14:19:50 +01:00
Viktor Lofgren
86bbc1043e (service) Clean up thread pool creation 2024-02-28 14:06:32 +01:00
Viktor Lofgren
9a045a0588 (index) Clean up index code 2024-02-28 13:09:47 +01:00
Viktor Lofgren
9415539b38 (docs) Update docs 2024-02-28 12:25:19 +01:00
Viktor Lofgren
84bab2783d (docs) Fix fake news in docs 2024-02-28 12:16:45 +01:00
Viktor
0d6e7673e4 Merge pull request #81 from MarginaliaSearch/service-discovery
Zookeeper for service-discovery, kill service-client lib, refactor everything
2024-02-28 12:15:25 +01:00
Viktor Lofgren
d78e9e715f (misc) Fix broken tests 2024-02-28 12:12:43 +01:00
Viktor Lofgren
a8ec59eb75 (conf) Add migration warning when ZOOKEEPER_HOSTS is not set. 2024-02-28 12:09:38 +01:00
Viktor Lofgren
20fc0ef13c (gradle) Add task alias 'docker' for 'jibDockerBuild'
The change also moves the jib boilerplate to an include.
2024-02-28 11:59:15 +01:00
Viktor Lofgren
37ae8cb33c Migrate the docker compose files 2024-02-28 11:48:16 +01:00
Viktor Lofgren
9f1649636e Clean up documentation and rename domain-links to link-graph 2024-02-28 11:40:39 +01:00
Viktor Lofgren
3a65fe8917 Add offload executor to GrpcChannelPoolFactory 2024-02-27 22:08:39 +01:00
Viktor Lofgren
99a6e56e99 (index-client) Increase thread count in index client
This should be a fair bit larger than the number of index nodes
2024-02-27 22:00:29 +01:00
Viktor Lofgren
e696fd9e92 (docs) Begin un-fucking the docs after refactoring 2024-02-27 21:22:21 +01:00
Viktor Lofgren
c943954bb4 (domain-info) Reduce memory usage 2024-02-27 21:22:21 +01:00
Viktor Lofgren
eaf836dc66 (service/grpc) Reduce thread count
Netty and GRPC by default spawns an incredible number of threads on high-core CPUs, which amount to a fair bit of RAM usage.

Add custom executors that throttle this behavior.
2024-02-27 21:22:21 +01:00
Viktor Lofgren
dbf64b0987 (logs) Add the option for json logging 2024-02-27 21:22:20 +01:00
Viktor Lofgren
8d0af9548b (search) Bot mitigation
Add the ability to indicate to the search service that a request is malicious, and to poison the results by providing randomly reorered old results instead.
2024-02-27 21:22:19 +01:00
Viktor Lofgren
67aa20ea2c (array) Attempting to debug strange errors 2024-02-27 21:22:18 +01:00
Viktor Lofgren
5604e9f531 (query) Bump query length, see what happens :P 2024-02-27 21:22:17 +01:00
Viktor Lofgren
1a51ec2d69 (index) Index optimization 2024-02-27 21:22:17 +01:00
Viktor Lofgren
3eb0800742 (index) Improve granularity of candidate queue polling 2024-02-27 21:22:17 +01:00
Viktor Lofgren
427f3e922f (index) Retire count operation, clean up index code. 2024-02-27 21:22:17 +01:00
Viktor Lofgren
823ca73a3f (domain-ranking) Fix a crash during ranking the edges of the similarity graph doesn't quite match the vertices of the link graph. 2024-02-27 21:22:17 +01:00
Viktor Lofgren
7fc0d4d786 (index) Observability for query execution queues 2024-02-27 21:22:17 +01:00
Viktor Lofgren
b8e336e809 (index) Reduce time allocation a bit 2024-02-27 21:22:17 +01:00
Viktor Lofgren
9429bf5c45 (index) Clean up 2024-02-27 21:22:17 +01:00
Viktor Lofgren
f7f0100174 (build) Make docker image registry and tag configurable in root build.gradle 2024-02-25 11:08:49 +01:00
Viktor Lofgren
fc00701a1e (index) Experimental refactoring of the indexing functionality 2024-02-25 11:05:10 +01:00
Viktor Lofgren
09447f2ad2 (process service) Inherit parent's assertion status 2024-02-24 18:32:37 +01:00
Viktor Lofgren
ff0ef1eebc (cleanup) Minor cleanups 2024-02-24 15:33:56 +01:00
Viktor Lofgren
1d34224416 (refac) Remove src/main from all source code paths.
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.

While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules.  Which you'll do a lot, because it's *modul*ar.  The src/main/java convention makes a lot of sense for a non-modular project though.  This ain't that.
2024-02-23 16:13:40 +01:00
Viktor Lofgren
56d35aa596 (refac) Move execution API out of executor service 2024-02-23 13:26:11 +01:00
Viktor Lofgren
2201b1a506 (refac) Clean up code issues 2024-02-23 11:39:19 +01:00
Viktor Lofgren
5cdb07023b (refac) Clean up unused imports 2024-02-23 11:27:20 +01:00
Viktor Lofgren
6154e16951 (refac) Remove "distPath" 2024-02-23 11:22:02 +01:00
Viktor Lofgren
f4ff7185f0 (refac) Move process-mqapi out of api directory 2024-02-23 11:18:29 +01:00
Viktor Lofgren
6357d30ea0 Clean up docs 2024-02-22 19:53:20 +01:00
Viktor Lofgren
8d4ef982d0 Clean up docs 2024-02-22 19:37:59 +01:00
Viktor Lofgren
4740156cfa Clean up docs 2024-02-22 18:18:58 +01:00
Viktor Lofgren
f8e7f75831 Move index to top level of code 2024-02-22 18:01:35 +01:00
Viktor Lofgren
085137ca63 * Extract the index functionality 2024-02-22 17:31:25 +01:00
Viktor Lofgren
3fd2a83184 * Extract the search-query function 2024-02-22 15:27:39 +01:00
Viktor Lofgren
66c1281301 (zk-registry) epic jak shaving WIP
Cleaning out a lot of old junk from the code, and one thing lead to another...

* Build is improved, now constructing docker images with 'jib'.  Clean build went from 3 minutes to 50 seconds.
* The ProcessService's spawning is smarter.  Will now just spawn a java process instead of relying on the application plugin's generated outputs.
* Project is migrated to GraalVM
* gRPC clients are re-written with a neat fluent/functional style. e.g.
```channelPool.call(grpcStub::method)
              .async(executor) // <-- optional
              .run(argument);
```
This change is primarily to allow handling ManagedChannel errors, but it turned out to be a pretty clean API overall.
* For now the project is all in on zookeeper
* Service discovery is now based on APIs and not services.  Theoretically means we could ship the same code either a monolith or a service mesh.
* To this end, began modularizing a few of the APIs so that they aren't strongly "living" in a service.  WIP!

Missing is documentation and testing, and some more breaking apart of code.
2024-02-22 14:01:23 +01:00
Viktor Lofgren
73947d9eca (zk-registry) Filter out phantom addresses in the registry
The change adds a hostname validation step to remove endpoints from the ZkServiceRegistry when they do not resolve.  This is a scenario that primarily happens when running in docker, and the entire system is started and stopped.
2024-02-20 18:09:11 +01:00
Viktor Lofgren
a69c0b2718 (grpc-client) Fix warmup crash
The warmup would sometimes crash during a cold start-up, because it could not get an API.  Changed the warmup to just create a GrpcSingleNodeChannelPool for the node.
2024-02-20 18:03:57 +01:00
Viktor Lofgren
6c764bceeb (doc) Update documentation for service-discovery 2024-02-20 16:09:49 +01:00
Viktor Lofgren
273aeb7bae (doc) Update documentation with new gRPC service setup 2024-02-20 16:06:05 +01:00
Viktor Lofgren
d185858266 (minor) Add missing query parameter to ServiceEndpoint.toURL 2024-02-20 15:49:43 +01:00
Viktor Lofgren
453bd6064b (minor) Add warm-up to GrpcMultiNodeChannelPool to speed up the initial messages
Without doing this, connections would be created lazily, which is probably never desirable.
2024-02-20 15:45:16 +01:00
Viktor Lofgren
904f2587cd (minor) Add default ZOOKEEPER_HOSTS to service.env 2024-02-20 15:44:26 +01:00
Viktor Lofgren
14172312dc (query-client) Fix query client
The query service delegates and aggregates IndexDomainLinksApiGrpc
messages to the index services.  The query client was accidentally
also doing this, instead of talking to the query client.

Fixed so it correctly talks to the query client and nothing else.
2024-02-20 15:44:07 +01:00
Viktor Lofgren
c600d7aa47 (refac) Inject ServiceRegistry into WebsiteAdjacenciesCalculator 2024-02-20 15:42:32 +01:00
Viktor Lofgren
3c9234078a (refac) Propagate ZOOKEEPER_HOSTS to spawned processes 2024-02-20 15:42:16 +01:00
Viktor Lofgren
ee8e0497ae (refac) Move service discovery injection to a separate guice module 2024-02-20 15:41:04 +01:00
Viktor Lofgren
fd5d121648 (minor) Add WMSA_IN_DOCKER to all docker files 2024-02-20 15:39:46 +01:00
Viktor Lofgren
30bdb4b4e9 (config) Clean up service configuration for IP addresses
Adds new ways to configure the bind and external IP addresses for a service.  Notably, if the environment variable WMSA_IN_DOCKER is present, the system will grab the HOSTNAME variable and announce that as the external address in the service registry.

The default bind address is also changed to be 0.0.0.0 only if WMSA_IN_DOCKER is present, otherwise 127.0.0.1; as this is a more secure default.
2024-02-20 14:22:48 +01:00
Viktor Lofgren
2ee492fb74 (gRPC) Bind gRPC services to an interface
By default gRPC it magically decides on an interface.  The change will explicitly tell it what to use.
2024-02-20 14:22:47 +01:00
Viktor Lofgren
36a5c8b44c (cleanup) Clean up code 2024-02-20 14:22:47 +01:00
Viktor Lofgren
07b625c58d (query-client) Add support for fault-tolerant requests to single node services
Adding a method importantCall that will retry a failing request on each route until it succeeds or the routes run out.
2024-02-20 14:16:05 +01:00
Viktor Lofgren
746a865106 (client) Fix handling of channel refreshes
The previous code made an incorrect assumption that all routes refer to the same node, and would overwrite the route list on each update.  This lead to storms of closing and opening channels whenever an update was received.

The new code is correctly aware that we may talk to multiple nodes.
2024-02-20 14:14:09 +01:00
Viktor
f85ec28a16 Merge branch 'master' into service-discovery 2024-02-20 11:44:12 +01:00
Viktor Lofgren
0307c55f9f (refac) Zookeeper for service-discovery, kill service-client lib (WIP)
To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added.

A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything.

The last remaining REST service, the assistant-service, has been migrated to gRPC.

This also proved a good time to clear out primordial technical debt from the root of the codebase.  The 'service-client' library has been taken behind the barn and given a last farewell.  It's replaced by a small library for managing gRPC channels.

Since it's no longer used by anything, RxJava has been removed as a dependency from the project.

Although the current state seems reasonably stable, this is a work-in-progress commit.
2024-02-20 11:41:14 +01:00
Viktor
d05c916491 Merge pull request #80 from MarginaliaSearch/ranking-algorithms
Clean up domain ranking code
2024-02-18 09:52:34 +01:00
Viktor Lofgren
c73e43f5c9 (recrawl) Mitigate recrawl-before-load footgun
In the scenario where an operator

* Performs a new crawl from spec
* Doesn't load the data into the index
* Recrawls the data

The recrawl will not find the domains in the database, and the crawl log will be overwritten with an empty file,
irrecoverably losing the crawl log making it impossible to load!

To mitigate the impact similar problems, the change saves a backup of the old crawl log, as well as complains about this happening.

More specifically to this exact scenario however, the parquet-loaded domains are also preemptively inserted into the domain database at the start of the crawl.  This should help the DbCrawlSpecProvider to find them regardless of loaded state.

This may seem a bit redundant, but losing crawl data is arguably the worst type of disaster scenario for this software, so it's arguably merited.
2024-02-18 09:23:20 +01:00
Viktor Lofgren
e61e7f44b9 (blacklist) Delay startup of blacklist
To help services start faster, the blacklist will no longer block until it's loaded.  If such a behavior is desirable, a method was added to explicitly wait for the data.
2024-02-18 09:23:20 +01:00
Viktor Lofgren
f9b6ac03c6 (api) Clean up incorrect error handling in GrpcChannelPool 2024-02-18 08:45:35 +01:00
Viktor Lofgren
296ccc5f8e (blacklist) Clean up blacklist impl
The domain blacklist blocked the start-up of each process that injected it, adding like 30 seconds to the start-up time in prod.

This change moves the loading to a separate thread entirely.  For threads or processes that require the blacklist to be definitely loaded, a helper method was added that blocks until that time.
2024-02-18 08:16:48 +01:00
Viktor Lofgren
8cb5825617 (search) Temporarily disable the Popular filter
This filter currently does not distinguish itself very much from the unfiltered results, and lends the impression that the filters don't "do anything".

It may come back in some shape or form in the future, with some additional tweaking of the rankings...
2024-02-18 08:02:01 +01:00
Viktor Lofgren
cee707abd8 (crawler) Implement domain shuffling in DbCrawlSpecProvider
Modified the DbCrawlSpecProvider to shuffle domains after loading to ensure a good mix for each crawl. This change prevents overload of crawling the same server in parallel from different subdomains or crawling big domains all at once.
2024-02-17 17:47:38 +01:00
Viktor Lofgren
92717a4832 (client) Refactor GrpcStubPool to handle error states
Refactored the GRPC Stub Pool for better handling of channel SHUTDOWN state. Any disconnected channels are now re-created before returning the stub.

The class was also renamed to GrpcChannelPool, as we no longer pool the stubs.
2024-02-17 14:42:26 +01:00
Viktor Lofgren
37a7296759 (sideload) Clean up the sideloading code
Clean up the sideloading code a bit, making the Reddit sideloader use the more sophisticated SideloaderProcessing approach to sideloading, instead of mimicing StackexchangeSideloader's cruder approach.

The reddit sideloader now uses the SideloaderProcessing class.  It also properly sets js-attributes for the sideloaded documents.

The control GUI now also filters the upload directory items based on name, and disables the items that do not have appropriate filenames.
2024-02-17 14:32:36 +01:00
Viktor Lofgren
ebbe49d17b (sideload) Fix sideloading of explicitly selected stackexchange files
Fix a bug where sideloading stackexchange files by explicitly selecting the 7z file would fail, since the 7z file would be passed along to the converter rather than the path to the pre-converted .db file.
2024-02-17 13:24:04 +01:00
Viktor Lofgren
b7e330855f (control) Update descriptive text in the control GUI 2024-02-16 20:32:31 +01:00
Viktor Lofgren
ac89224fb0 (domain-ranking) Remove lingering mentions of the algorithms field from the GUI 2024-02-16 20:28:37 +01:00
Viktor Lofgren
9ec262ae00 (domain-ranking) Integrate new ranking logic
The change deprecates the 'algorithm' field from the domain ranking set configuration.  Instead, the algorithm will be chosen based on whether influence domains are provided, and whether similarity data is present.
2024-02-16 20:22:01 +01:00
Viktor Lofgren
64acdb5f2a (domain-ranking) Clean up domain ranking
The domain ranking code was admittedly a bit of a clown fiesta; at the same time buggy, fragile and inscrutable.

Migrating over to use JGraphT to store the link graph
when doing rankings, and using their PageRank implementation.  Also added a modified version that does PersonalizedPageRank.
2024-02-16 18:04:58 +01:00
Viktor Lofgren
a175b36382 (search) Correct accidental regression of the SmallWeb filter 2024-02-15 18:16:56 +01:00
Viktor Lofgren
16526d283c (search) Correct accidental regression of the Vintage filter 2024-02-15 18:13:34 +01:00
Viktor Lofgren
752e677555 (search) Expose getSearchTitle in DecoratedSearchResults 2024-02-15 13:56:44 +01:00
Viktor Lofgren
f796af1ae8 (search) Fix failed refactoring 2024-02-15 13:53:19 +01:00
Viktor Lofgren
2515993536 (search) Fix issue where searchTitle setting gets lost when searching again
It's important that the field names in SearchParameters matches the fields referenced in search-form.hdb, otherwise they will get lost in transit.
2024-02-15 13:52:11 +01:00
Viktor Lofgren
66b3e71e56 (search) Expose more search options
This change set updates the query APIs to enable the search service to add additional criteria, such as QueryStrategy and TemporalBias.

The QueryStrategy makes it possible to e.g. require a match is in the title of a result, and TemporalBias enables penalizing results that are not within a particular time period.

These options are added to the search interface.  The old 'recent results' is modified to use TemporalBias, and a new filter 'Search In Title' is added as well.

The vintage filter is modified to add a temporal bias for the past.
2024-02-15 13:39:51 +01:00
Viktor Lofgren
652d151373 (process-models) Improve documentation 2024-02-15 12:21:12 +01:00
Viktor Lofgren
300b1a1b84 (index-query) Add some tests for the QueryFilter code 2024-02-15 12:03:30 +01:00
Viktor Lofgren
6c3b49417f (index-query) Improve documentation and code quality 2024-02-15 11:33:50 +01:00
Viktor Lofgren
dcc5cfb7c0 (index-journal) Improve documentation and code quality 2024-02-15 10:51:49 +01:00
Viktor
d970836605 Merge pull request #79 from MarginaliaSearch/reddit
(converter) Loader for reddit data

Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult.

Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more.

Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run.

The change also refactors the sideloading a bit since it was a bit messy, and improves the sideload UX a tiny bit.
2024-02-15 09:17:56 +01:00
Viktor Lofgren
8021bd0aae (control) Sort upload listing results
Improve the UX of the sideload GUI by sorting the results in a sensible fashion, first by whether it's a directory, then by its filename.

The change also changes the timestamp rendering to a more human-readable format than full ISO-8601.
2024-02-15 09:13:40 +01:00
Viktor Lofgren
8f91156d80 (control) Improve sideload UX
The sideload forms didn't properly set the label 'for' property, meaning that while label tags existed, they weren't appropriately clickable.

Also removed unnecessary limits on the sideload target being a directory for stackexchange and warc.  It's been possible to directly load a particular file for a while, but not allowed due to GUI limits.
2024-02-14 18:38:20 +01:00
Viktor Lofgren
fab36d6e63 (converter) Loader for reddit data
Adds experimental sideloading support for pusshift.io style reddit data.  This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult.

Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes.  Empirically this appears to mostly return good matches, even if it probably could index more.

Tests were written for this, but all require local reddit data which can't be distributed with the source code.  If these can not be found, the tests will shortcircuit as OK.  They're mostly there for debugging, and it's fine if they don't always run.

The change also refactors the sideloading a bit since it was a bit messy.
2024-02-14 17:35:44 +01:00
Viktor Lofgren
3d54879c14 (API, minor) Clean up comments. 2024-02-14 12:09:16 +01:00
Viktor Lofgren
e17fcde865 (API, minor) Remove unnecessary inject. 2024-02-14 12:05:50 +01:00
Viktor Lofgren
6950dffcb4 (API) Fix result order in API results
These results should be presented in the same order as their ranking score.
2024-02-14 11:47:14 +01:00
2225 changed files with 73378 additions and 34514 deletions

1
.github/FUNDING.yml vendored
View File

@@ -1,5 +1,6 @@
# These are supported funding model platforms # These are supported funding model platforms
polar: marginalia-search
github: MarginaliaSearch github: MarginaliaSearch
patreon: marginalia_nu patreon: marginalia_nu
open_collective: # Replace with a single Open Collective username open_collective: # Replace with a single Open Collective username

1
.gitignore vendored
View File

@@ -7,3 +7,4 @@ build/
lombok.config lombok.config
Dockerfile Dockerfile
run run
jte-classes

95
ROADMAP.md Normal file
View File

@@ -0,0 +1,95 @@
# Roadmap 2025
This is a roadmap with major features planned for Marginalia Search.
It's not set in any particular order and other features will definitely
be implemented as well.
Major goals:
* Reach 1 billion pages indexed
* Improve technical ability of indexing and search. ~~Although this area has improved a bit, the
search engine is still not very good at dealing with longer queries.~~ (As of PR [#129](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/129), this has improved significantly. There is still more work to be done )
## Hybridize crawler w/ Common Crawl data
Sometimes Marginalia's relatively obscure crawler is blocked when attempting to crawl a website, or for
other technical reasons it may be prevented from doing so. A possible work-around is to hybridize the
crawler so that it attempts to fetch such inaccessible websites from common crawl. This is an important
step on the road to 1 billion pages indexed.
As a rough sketch, the crawler would identify target websites, consume CC's index, and then fetch the WARC data
with byte range queries.
Retaining the ability to independently crawl the web is still strongly desirable so going full CC is not an option.
## Safe Search
The search engine has a bit of a problem showing spicy content mixed in with the results. It would be desirable to have a way to filter this out. It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
combined with naive bayesian filter would go a long way, or something more sophisticated...?
## Additional Language Support
It would be desirable if the search engine supported more languages than English. This is partially about
rooting out assumptions regarding character encoding, but there's most likely some amount of custom logic
associated with each language added, at least a models file or two, as well as some fine tuning.
It would be very helpful to find a speaker of a large language other than English to help in the fine tuning.
## Support for binary formats like PDF
The crawler needs to be modified to retain them, and the conversion logic needs to parse them.
The documents database probably should have some sort of flag indicating it's a PDF as well.
PDF parsing is known to be a bit of a security liability so some thought needs to be put in
that direction as well.
## Custom ranking logic
Stract does an interesting thing where they have configurable search filters.
This looks like a good idea that wouldn't just help clean up the search filters on the main
website, but might be cheap enough we might go as far as to offer a number of ad-hoc custom search
filter for any API consumer.
I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, which is quite ad-hoc, but instead to work together to find some new common description language for this.
## Show favicons next to search results
This is expected from search engines. Basic proof of concept sketch of fetching this data has been done, but the feature is some way from being reality.
## Specialized crawler for github
One of the search engine's biggest limitations right now is that it does not index github at all. A specialized crawler that fetches at least the readme.md would go a long way toward providing search capabilities in this domain.
# Completed
## Web Design Overhaul (COMPLETED 2025-01)
The design is kinda clunky and hard to maintain, and needlessly outdated-looking.
PR [#127](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/127)
## Finalize RSS support (COMPLETED 2024-11)
Marginalia has experimental RSS preview support for a few domains. This works well and
it should be extended to all domains. It would also be interesting to offer search of the
RSS data itself, or use the RSS set to feed a special live index that updates faster than the
main dataset.
Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
## Proper Position Index (COMPLETED 2024-09)
The search engine uses a fixed width bit mask to indicate word positions. It has the benefit
of being very fast to evaluate and works well for what it is, but is inaccurate and has the
drawback of making support for quoted search terms inaccurate and largely reliant on indexing
word n-grams known beforehand. This limits the ability to interpret longer queries.
The positions mask should be supplemented or replaced with a more accurate (e.g.) gamma coded positions
list, as is the civilized way of doing this.
Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)

View File

@@ -1,8 +1,11 @@
plugins { plugins {
id 'java' id 'java'
id("org.jetbrains.gradle.plugin.idea-ext") version "1.0" id("org.jetbrains.gradle.plugin.idea-ext") version "1.0"
id "io.freefair.lombok" version "8.3"
id "me.champeau.jmh" version "0.6.6" id "me.champeau.jmh" version "0.6.6"
// This is a workaround for a bug in the Jib plugin that causes it to stall randomly
// https://github.com/GoogleContainerTools/jib/issues/3347
id 'com.google.cloud.tools.jib' version '3.4.4' apply(false)
} }
group 'marginalia' group 'marginalia'
@@ -13,6 +16,14 @@ compileTestJava.options.encoding = "UTF-8"
subprojects.forEach {it -> subprojects.forEach {it ->
// Enable preview features for the entire project // Enable preview features for the entire project
if (it.path.contains(':code:')) {
sourceSets.main.java.srcDirs += file('java')
sourceSets.main.resources.srcDirs += file('resources')
sourceSets.test.java.srcDirs += file('test')
sourceSets.test.resources.srcDirs += file('test-resources')
}
it.tasks.withType(JavaCompile).configureEach { it.tasks.withType(JavaCompile).configureEach {
options.compilerArgs += ['--enable-preview'] options.compilerArgs += ['--enable-preview']
} }
@@ -28,32 +39,15 @@ subprojects.forEach {it ->
preserveFileTimestamps = false preserveFileTimestamps = false
reproducibleFileOrder = true reproducibleFileOrder = true
} }
} }
allprojects { ext {
apply plugin: 'java' jvmVersion = 24
apply plugin: 'io.freefair.lombok' dockerImageBase='container-registry.oracle.com/graalvm/jdk:24'
dockerImageTag='latest'
dependencies { dockerImageRegistry='marginalia'
implementation libs.lombok jibVersion = '3.4.4'
testImplementation libs.lombok
annotationProcessor libs.lombok
lombok libs.lombok // prevent plugin from downgrading the version to something incompatible with '19
}
test {
maxHeapSize = "8G"
useJUnitPlatform()
}
tasks.register('fastTests', Test) {
maxHeapSize = "8G"
useJUnitPlatform {
excludeTags "slow"
}
}
} }
idea { idea {
@@ -74,6 +68,7 @@ idea {
} }
java { java {
toolchain { toolchain {
languageVersion.set(JavaLanguageVersion.of(21)) languageVersion.set(JavaLanguageVersion.of(rootProject.ext.jvmVersion))
} }
} }

View File

@@ -1,31 +0,0 @@
plugins {
id 'java'
id 'jvm-test-suite'
}
java {
toolchain {
languageVersion.set(JavaLanguageVersion.of(21))
}
}
dependencies {
implementation project(':code:common:model')
implementation project(':code:common:config')
implementation project(':code:common:service-discovery')
implementation project(':code:common:service-client')
implementation libs.bundles.slf4j
implementation libs.prometheus
implementation libs.notnull
implementation libs.guice
implementation libs.rxjava
implementation libs.gson
testImplementation libs.bundles.slf4j.test
testImplementation libs.bundles.junit
testImplementation libs.mockito
}

View File

@@ -1,8 +0,0 @@
# Assistant API
Client and models for talking to the [assistant-service](../../services-core/assistant-service),
implemented with the base client from [service-client](../../common/service-client).
## Central Classes
* [AssistantClient](src/main/java/nu/marginalia/assistant/client/AssistantClient.java)

View File

@@ -1,95 +0,0 @@
package nu.marginalia.assistant.client;
import com.google.gson.reflect.TypeToken;
import com.google.inject.Inject;
import com.google.inject.Singleton;
import io.reactivex.rxjava3.core.Observable;
import nu.marginalia.assistant.client.model.DictionaryResponse;
import nu.marginalia.assistant.client.model.DomainInformation;
import nu.marginalia.assistant.client.model.SimilarDomain;
import nu.marginalia.client.AbstractDynamicClient;
import nu.marginalia.client.exception.RouteNotConfiguredException;
import nu.marginalia.model.gson.GsonFactory;
import nu.marginalia.service.descriptor.ServiceDescriptors;
import nu.marginalia.service.id.ServiceId;
import nu.marginalia.client.Context;
import java.net.URLEncoder;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;
@Singleton
public class AssistantClient extends AbstractDynamicClient {
@Inject
public AssistantClient(ServiceDescriptors descriptors) {
super(descriptors.forId(ServiceId.Assistant), GsonFactory::get);
}
public Observable<DictionaryResponse> dictionaryLookup(Context ctx, String word) {
try {
return super.get(ctx, 0, "/dictionary/" + URLEncoder.encode(word, StandardCharsets.UTF_8), DictionaryResponse.class);
}
catch (RouteNotConfiguredException ex) {
return Observable.empty();
}
}
@SuppressWarnings("unchecked")
public Observable<List<String>> spellCheck(Context ctx, String word) {
try {
return (Observable<List<String>>) (Object) super.get(ctx, 0, "/spell-check/" + URLEncoder.encode(word, StandardCharsets.UTF_8), List.class);
}
catch (RouteNotConfiguredException ex) {
return Observable.empty();
}
}
public Observable<String> unitConversion(Context ctx, String value, String from, String to) {
try {
return super.get(ctx, 0, "/unit-conversion?value=" + value + "&from=" + from + "&to=" + to);
}
catch (RouteNotConfiguredException ex) {
return Observable.empty();
}
}
public Observable<String> evalMath(Context ctx, String expression) {
try {
return super.get(ctx, 0, "/eval-expression?value=" + URLEncoder.encode(expression, StandardCharsets.UTF_8));
}
catch (RouteNotConfiguredException ex) {
return Observable.empty();
}
}
public Observable<ArrayList<SimilarDomain>> similarDomains(Context ctx, int domainId, int count) {
try {
return super.get(ctx, 0, STR."/domain/\{domainId}/similar?count=\{count}", new TypeToken<ArrayList<SimilarDomain>>() {})
.onErrorResumeWith(Observable.just(new ArrayList<>()));
}
catch (RouteNotConfiguredException ex) {
return Observable.empty();
}
}
public Observable<ArrayList<SimilarDomain>> linkedDomains(Context ctx, int domainId, int count) {
try {
return super.get(ctx, 0, STR."/domain/\{domainId}/linking?count=\{count}", new TypeToken<ArrayList<SimilarDomain>>() {})
.onErrorResumeWith(Observable.just(new ArrayList<>()));
}
catch (RouteNotConfiguredException ex) {
return Observable.empty();
}
}
public Observable<DomainInformation> domainInformation(Context ctx, int domainId) {
try {
return super.get(ctx, 0, STR."/domain/\{domainId}/info", DomainInformation.class)
.onErrorResumeWith(Observable.just(new DomainInformation()));
}
catch (RouteNotConfiguredException ex) {
return Observable.empty();
}
}
}

View File

@@ -1,14 +0,0 @@
package nu.marginalia.assistant.client.model;
import lombok.AllArgsConstructor;
import lombok.Getter;
import lombok.ToString;
@AllArgsConstructor
@Getter
@ToString
public class DictionaryEntry {
public final String type;
public final String word;
public final String definition;
}

View File

@@ -1,14 +0,0 @@
package nu.marginalia.assistant.client.model;
import lombok.AllArgsConstructor;
import lombok.Getter;
import lombok.NoArgsConstructor;
import lombok.ToString;
import java.util.List;
@ToString @Getter @AllArgsConstructor @NoArgsConstructor
public class DictionaryResponse {
public String word;
public List<DictionaryEntry> entries;
}

View File

@@ -1,48 +0,0 @@
package nu.marginalia.assistant.client.model;
import lombok.*;
import nu.marginalia.model.EdgeDomain;
@Getter @AllArgsConstructor @NoArgsConstructor @Builder
@ToString
public class DomainInformation {
EdgeDomain domain;
boolean blacklisted;
int pagesKnown;
int pagesFetched;
int pagesIndexed;
int incomingLinks;
int outboundLinks;
int nodeAffinity;
double ranking;
boolean suggestForCrawling;
boolean inCrawlQueue;
boolean unknownDomain;
String ip;
Integer asn;
String asnOrg;
String asnCountry;
String ipCountry;
String state;
public String getIpFlag() {
if (ipCountry == null || ipCountry.codePointCount(0, ipCountry.length()) != 2) {
return "";
}
String country = ipCountry;
if ("UK".equals(country)) {
country = "GB";
}
int offset = 0x1F1E6;
int asciiOffset = 0x41;
int firstChar = Character.codePointAt(country, 0) - asciiOffset + offset;
int secondChar = Character.codePointAt(country, 1) - asciiOffset + offset;
return new String(Character.toChars(firstChar)) + new String(Character.toChars(secondChar));
}
}

View File

@@ -1,301 +0,0 @@
package nu.marginalia.executor.client;
import com.google.inject.Inject;
import com.google.inject.Singleton;
import nu.marginalia.client.AbstractDynamicClient;
import nu.marginalia.client.Context;
import nu.marginalia.client.grpc.GrpcStubPool;
import nu.marginalia.executor.api.*;
import nu.marginalia.executor.api.ExecutorApiGrpc.ExecutorApiBlockingStub;
import nu.marginalia.executor.model.ActorRunState;
import nu.marginalia.executor.model.ActorRunStates;
import nu.marginalia.executor.model.transfer.TransferItem;
import nu.marginalia.executor.model.transfer.TransferSpec;
import nu.marginalia.executor.storage.FileStorageContent;
import nu.marginalia.executor.storage.FileStorageFile;
import nu.marginalia.executor.upload.UploadDirContents;
import nu.marginalia.executor.upload.UploadDirItem;
import nu.marginalia.model.gson.GsonFactory;
import nu.marginalia.nodecfg.NodeConfigurationService;
import nu.marginalia.nodecfg.model.NodeConfiguration;
import nu.marginalia.service.descriptor.ServiceDescriptors;
import nu.marginalia.service.id.ServiceId;
import nu.marginalia.storage.model.FileStorageId;
import io.grpc.ManagedChannel;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.OutputStream;
import java.net.URLEncoder;
import java.nio.charset.StandardCharsets;
import java.nio.file.Path;
import java.util.List;
import java.util.concurrent.TimeUnit;
@Singleton
public class ExecutorClient extends AbstractDynamicClient {
private final GrpcStubPool<ExecutorApiBlockingStub> stubPool;
private static final Logger logger = LoggerFactory.getLogger(ExecutorClient.class);
@Inject
public ExecutorClient(ServiceDescriptors descriptors, NodeConfigurationService nodeConfigurationService) {
super(descriptors.forId(ServiceId.Executor), GsonFactory::get);
stubPool = new GrpcStubPool<>(ServiceId.Executor) {
@Override
public ExecutorApiBlockingStub createStub(ManagedChannel channel) {
return ExecutorApiGrpc.newBlockingStub(channel);
}
@Override
public List<Integer> getEligibleNodes() {
return nodeConfigurationService.getAll()
.stream()
.map(NodeConfiguration::node)
.toList();
}
};
}
public void startFsm(int node, String actorName) {
stubPool.apiForNode(node).startFsm(
RpcFsmName.newBuilder()
.setActorName(actorName)
.build()
);
}
public void stopFsm(int node, String actorName) {
stubPool.apiForNode(node).stopFsm(
RpcFsmName.newBuilder()
.setActorName(actorName)
.build()
);
}
public void stopProcess(int node, String id) {
stubPool.apiForNode(node).stopProcess(
RpcProcessId.newBuilder()
.setProcessId(id)
.build()
);
}
public void triggerCrawl(int node, FileStorageId fid) {
stubPool.apiForNode(node).triggerCrawl(
RpcFileStorageId.newBuilder()
.setFileStorageId(fid.id())
.build()
);
}
public void triggerRecrawl(int node, FileStorageId fid) {
stubPool.apiForNode(node).triggerRecrawl(
RpcFileStorageId.newBuilder()
.setFileStorageId(fid.id())
.build()
);
}
public void triggerConvert(int node, FileStorageId fid) {
stubPool.apiForNode(node).triggerConvert(
RpcFileStorageId.newBuilder()
.setFileStorageId(fid.id())
.build()
);
}
public void triggerConvertAndLoad(int node, FileStorageId fid) {
stubPool.apiForNode(node).triggerConvertAndLoad(
RpcFileStorageId.newBuilder()
.setFileStorageId(fid.id())
.build()
);
}
public void loadProcessedData(int node, List<FileStorageId> ids) {
stubPool.apiForNode(node).loadProcessedData(
RpcFileStorageIds.newBuilder()
.addAllFileStorageIds(ids.stream().map(FileStorageId::id).toList())
.build()
);
}
public void calculateAdjacencies(int node) {
stubPool.apiForNode(node).calculateAdjacencies(Empty.getDefaultInstance());
}
public void sideloadEncyclopedia(int node, Path sourcePath, String baseUrl) {
stubPool.apiForNode(node).sideloadEncyclopedia(
RpcSideloadEncyclopedia.newBuilder()
.setBaseUrl(baseUrl)
.setSourcePath(sourcePath.toString())
.build()
);
}
public void sideloadDirtree(int node, Path sourcePath) {
stubPool.apiForNode(node).sideloadDirtree(
RpcSideloadDirtree.newBuilder()
.setSourcePath(sourcePath.toString())
.build()
);
}
public void sideloadWarc(int node, Path sourcePath) {
stubPool.apiForNode(node).sideloadWarc(
RpcSideloadWarc.newBuilder()
.setSourcePath(sourcePath.toString())
.build()
);
}
public void sideloadStackexchange(int node, Path sourcePath) {
stubPool.apiForNode(node).sideloadStackexchange(
RpcSideloadStackexchange.newBuilder()
.setSourcePath(sourcePath.toString())
.build()
);
}
public void createCrawlSpecFromDownload(int node, String description, String url) {
stubPool.apiForNode(node).createCrawlSpecFromDownload(
RpcCrawlSpecFromDownload.newBuilder()
.setDescription(description)
.setUrl(url)
.build()
);
}
public void exportAtags(int node, FileStorageId fid) {
stubPool.apiForNode(node).exportAtags(
RpcFileStorageId.newBuilder()
.setFileStorageId(fid.id())
.build()
);
}
public void exportSampleData(int node, FileStorageId fid, int size, String name) {
stubPool.apiForNode(node).exportSampleData(
RpcExportSampleData.newBuilder()
.setFileStorageId(fid.id())
.setSize(size)
.setName(name)
.build()
);
}
public void exportRssFeeds(int node, FileStorageId fid) {
stubPool.apiForNode(node).exportRssFeeds(
RpcFileStorageId.newBuilder()
.setFileStorageId(fid.id())
.build()
);
}
public void exportTermFrequencies(int node, FileStorageId fid) {
stubPool.apiForNode(node).exportTermFrequencies(
RpcFileStorageId.newBuilder()
.setFileStorageId(fid.id())
.build()
);
}
public void downloadSampleData(int node, String sampleSet) {
stubPool.apiForNode(node).downloadSampleData(
RpcDownloadSampleData.newBuilder()
.setSampleSet(sampleSet)
.build()
);
}
public void exportData(int node) {
stubPool.apiForNode(node).exportData(Empty.getDefaultInstance());
}
public void restoreBackup(int node, FileStorageId fid) {
stubPool.apiForNode(node).restoreBackup(
RpcFileStorageId.newBuilder()
.setFileStorageId(fid.id())
.build()
);
}
public ActorRunStates getActorStates(int node) {
try {
var rs = stubPool.apiForNode(node).getActorStates(Empty.getDefaultInstance());
var states = rs.getActorRunStatesList().stream()
.map(r -> new ActorRunState(
r.getActorName(),
r.getState(),
r.getActorDescription(),
r.getStateDescription(),
r.getTerminal(),
r.getCanStart())
)
.toList();
return new ActorRunStates(node, states);
}
catch (Exception ex) {
logger.warn("Failed to get actor states", ex);
// Return an empty list of states to avoid breaking the UI when a node is down
return new ActorRunStates(node, List.of());
}
}
public UploadDirContents listSideloadDir(int node) {
try {
var rs = stubPool.apiForNode(node).listSideloadDir(Empty.getDefaultInstance());
var items = rs.getEntriesList().stream()
.map(i -> new UploadDirItem(i.getName(), i.getLastModifiedTime(), i.getIsDirectory(), i.getSize()))
.toList();
return new UploadDirContents(rs.getPath(), items);
}
catch (Exception ex) {
logger.warn("Failed to list sideload dir", ex);
// Return an empty list of items to avoid breaking the UI when a node is down
return new UploadDirContents("", List.of());
}
}
public FileStorageContent listFileStorage(int node, FileStorageId fileId) {
try {
var rs = stubPool.apiForNode(node).listFileStorage(
RpcFileStorageId.newBuilder()
.setFileStorageId(fileId.id())
.build()
);
return new FileStorageContent(rs.getEntriesList().stream()
.map(e -> new FileStorageFile(e.getName(), e.getSize(), e.getLastModifiedTime()))
.toList());
}
catch (Exception ex) {
logger.warn("Failed to list file storage", ex);
// Return an empty list of items to avoid breaking the UI when a node is down
return new FileStorageContent(List.of());
}
}
public void transferFile(Context context, int node, FileStorageId fileId, String path, OutputStream destOutputStream) {
String endpoint = "/transfer/file/%d?path=%s".formatted(fileId.id(), URLEncoder.encode(path, StandardCharsets.UTF_8));
get(context, node, endpoint,
destOutputStream)
.blockingSubscribe();
}
public TransferSpec getTransferSpec(Context context, int node, int count) {
return get(context, node, "/transfer/spec?count="+count, TransferSpec.class)
.timeout(30, TimeUnit.MINUTES)
.blockingFirst();
}
public void yieldDomain(Context context, int node, TransferItem item) {
post(context, node, "/transfer/yield", item).blockingSubscribe();
}
}

View File

@@ -1,9 +0,0 @@
package nu.marginalia.executor.model.transfer;
import nu.marginalia.storage.model.FileStorageId;
public record TransferItem(String domainName,
int domainId,
FileStorageId fileStorageId,
String path) {
}

View File

@@ -1,13 +0,0 @@
package nu.marginalia.executor.model.transfer;
import java.util.List;
public record TransferSpec(List<TransferItem> items) {
public TransferSpec() {
this(List.of());
}
public int size() {
return items.size();
}
}

View File

@@ -1,9 +0,0 @@
package nu.marginalia.executor.upload;
public record UploadDirItem (
String name,
String lastModifiedTime,
boolean isDirectory,
long size
) {
}

View File

@@ -1,49 +0,0 @@
plugins {
id 'java'
id "com.google.protobuf" version "0.9.4"
id 'jvm-test-suite'
}
java {
toolchain {
languageVersion.set(JavaLanguageVersion.of(21))
}
}
sourceSets {
main {
proto {
srcDir 'src/main/protobuf'
}
}
}
apply from: "$rootProject.projectDir/protobuf.gradle"
dependencies {
implementation project(':code:common:model')
implementation project(':code:common:config')
implementation project(':code:common:service-discovery')
implementation project(':code:common:service-client')
implementation project(':code:libraries:message-queue')
implementation project(':code:features-index:index-query')
implementation libs.bundles.slf4j
implementation libs.prometheus
implementation libs.notnull
implementation libs.guice
implementation libs.rxjava
implementation libs.protobuf
implementation libs.fastutil
implementation libs.javax.annotation
implementation libs.bundles.gson
implementation libs.bundles.grpc
testImplementation libs.bundles.slf4j.test
testImplementation libs.bundles.junit
testImplementation libs.mockito
}

View File

@@ -1,8 +0,0 @@
# Index API
Client and models for talking to the [index-service](../../services-core/index-service),
implemented with the base client from [service-client](../../common/service-client).
## Central Classes
* [IndexClient](src/main/java/nu/marginalia/index/client/IndexClient.java)

View File

@@ -1,95 +0,0 @@
package nu.marginalia.index.client;
import com.google.inject.Inject;
import com.google.inject.Singleton;
import com.google.inject.name.Named;
import io.prometheus.client.Summary;
import io.reactivex.rxjava3.core.Observable;
import io.reactivex.rxjava3.schedulers.Schedulers;
import nu.marginalia.client.AbstractDynamicClient;
import nu.marginalia.client.Context;
import nu.marginalia.client.exception.RouteNotConfiguredException;
import nu.marginalia.index.client.model.query.SearchSpecification;
import nu.marginalia.index.client.model.results.SearchResultSet;
import nu.marginalia.model.gson.GsonFactory;
import nu.marginalia.mq.MessageQueueFactory;
import nu.marginalia.mq.outbox.MqOutbox;
import nu.marginalia.service.descriptor.ServiceDescriptors;
import nu.marginalia.service.id.ServiceId;
import java.util.List;
import javax.annotation.CheckReturnValue;
import java.util.UUID;
@Singleton
public class IndexClient extends AbstractDynamicClient {
private static final Summary wmsa_search_index_api_time = Summary.build().name("wmsa_search_index_api_time").help("-").register();
private final MessageQueueFactory messageQueueFactory;
MqOutbox outbox;
@Inject
public IndexClient(ServiceDescriptors descriptors,
MessageQueueFactory messageQueueFactory,
@Named("wmsa-system-node") Integer nodeId)
{
super(descriptors.forId(ServiceId.Index), GsonFactory::get);
this.messageQueueFactory = messageQueueFactory;
String inboxName = ServiceId.Index.serviceName;
String outboxName = "pp:"+System.getProperty("service-name", UUID.randomUUID().toString());
outbox = messageQueueFactory.createOutbox(inboxName, nodeId, outboxName, nodeId, UUID.randomUUID());
setTimeout(30);
}
public MqOutbox outbox() {
return outbox;
}
@CheckReturnValue
public SearchResultSet query(Context ctx, int node, SearchSpecification specs) {
return wmsa_search_index_api_time.time(
() -> this.postGet(ctx, node,"/search/", specs, SearchResultSet.class).blockingFirst()
);
}
@CheckReturnValue
public SearchResultSet query(Context ctx, List<Integer> nodes, SearchSpecification specs) {
return Observable.fromIterable(nodes)
.flatMap(node -> {
try {
return this
.postGet(ctx, node, "/search/", specs, SearchResultSet.class).onErrorReturn(t -> new SearchResultSet())
.observeOn(Schedulers.io());
} catch (RouteNotConfiguredException ex) {
return Observable.empty();
}
})
.reduce(SearchResultSet::combine)
.blockingGet();
}
@CheckReturnValue
public Observable<Boolean> isBlocked(Context ctx, int node) {
return super.get(ctx, node, "/is-blocked", Boolean.class);
}
public long triggerRepartition(int node) throws Exception {
return messageQueueFactory.sendSingleShotRequest(
ServiceId.Index.withNode(node),
IndexMqEndpoints.INDEX_REPARTITION,
null
);
}
public long triggerRerank(int node) throws Exception {
return messageQueueFactory.sendSingleShotRequest(
ServiceId.Index.withNode(node),
IndexMqEndpoints.INDEX_RERANK,
null
);
}
}

View File

@@ -1,117 +0,0 @@
package nu.marginalia.index.client;
import nu.marginalia.index.api.*;
import nu.marginalia.index.client.model.query.SearchSubquery;
import nu.marginalia.index.client.model.results.Bm25Parameters;
import nu.marginalia.index.client.model.results.ResultRankingParameters;
import nu.marginalia.index.query.limit.QueryLimits;
import nu.marginalia.index.query.limit.SpecificationLimit;
import nu.marginalia.index.query.limit.SpecificationLimitType;
import java.util.ArrayList;
import java.util.List;
public class IndexProtobufCodec {
public static SpecificationLimit convertSpecLimit(RpcSpecLimit limit) {
return new SpecificationLimit(
SpecificationLimitType.valueOf(limit.getType().name()),
limit.getValue()
);
}
public static RpcSpecLimit convertSpecLimit(SpecificationLimit limit) {
return RpcSpecLimit.newBuilder()
.setType(RpcSpecLimit.TYPE.valueOf(limit.type().name()))
.setValue(limit.value())
.build();
}
public static QueryLimits convertQueryLimits(RpcQueryLimits queryLimits) {
return new QueryLimits(
queryLimits.getResultsByDomain(),
queryLimits.getResultsTotal(),
queryLimits.getTimeoutMs(),
queryLimits.getFetchSize()
);
}
public static RpcQueryLimits convertQueryLimits(QueryLimits queryLimits) {
return RpcQueryLimits.newBuilder()
.setResultsByDomain(queryLimits.resultsByDomain())
.setResultsTotal(queryLimits.resultsTotal())
.setTimeoutMs(queryLimits.timeoutMs())
.setFetchSize(queryLimits.fetchSize())
.build();
}
public static SearchSubquery convertSearchSubquery(RpcSubquery subquery) {
List<List<String>> coherences = new ArrayList<>();
for (int j = 0; j < subquery.getCoherencesCount(); j++) {
var coh = subquery.getCoherences(j);
coherences.add(new ArrayList<>(coh.getCoherencesList()));
}
return new SearchSubquery(
subquery.getIncludeList(),
subquery.getExcludeList(),
subquery.getAdviceList(),
subquery.getPriorityList(),
coherences
);
}
public static RpcSubquery convertSearchSubquery(SearchSubquery searchSubquery) {
var subqueryBuilder =
RpcSubquery.newBuilder()
.addAllAdvice(searchSubquery.getSearchTermsAdvice())
.addAllExclude(searchSubquery.getSearchTermsExclude())
.addAllInclude(searchSubquery.getSearchTermsInclude())
.addAllPriority(searchSubquery.getSearchTermsPriority());
for (var coherences : searchSubquery.searchTermCoherences) {
subqueryBuilder.addCoherencesBuilder().addAllCoherences(coherences);
}
return subqueryBuilder.build();
}
public static ResultRankingParameters convertRankingParameterss(RpcResultRankingParameters params) {
return new ResultRankingParameters(
new Bm25Parameters(params.getFullK(), params.getFullB()),
new Bm25Parameters(params.getPrioK(), params.getPrioB()),
params.getShortDocumentThreshold(),
params.getShortDocumentPenalty(),
params.getDomainRankBonus(),
params.getQualityPenalty(),
params.getShortSentenceThreshold(),
params.getShortSentencePenalty(),
params.getBm25FullWeight(),
params.getBm25PrioWeight(),
params.getTcfWeight(),
ResultRankingParameters.TemporalBias.valueOf(params.getTemporalBias().name()),
params.getTemporalBiasWeight()
);
};
public static RpcResultRankingParameters convertRankingParameterss(ResultRankingParameters rankingParams) {
return
RpcResultRankingParameters.newBuilder()
.setFullB(rankingParams.fullParams.b())
.setFullK(rankingParams.fullParams.k())
.setPrioB(rankingParams.prioParams.b())
.setPrioK(rankingParams.prioParams.k())
.setShortDocumentThreshold(rankingParams.shortDocumentThreshold)
.setShortDocumentPenalty(rankingParams.shortDocumentPenalty)
.setDomainRankBonus(rankingParams.domainRankBonus)
.setQualityPenalty(rankingParams.qualityPenalty)
.setShortSentenceThreshold(rankingParams.shortSentenceThreshold)
.setShortSentencePenalty(rankingParams.shortSentencePenalty)
.setBm25FullWeight(rankingParams.bm25FullWeight)
.setBm25PrioWeight(rankingParams.bm25PrioWeight)
.setTcfWeight(rankingParams.tcfWeight)
.setTemporalBias(RpcResultRankingParameters.TEMPORAL_BIAS.valueOf(rankingParams.temporalBias.name()))
.setTemporalBiasWeight(rankingParams.temporalBiasWeight)
.build();
}
}

View File

@@ -1,34 +0,0 @@
package nu.marginalia.index.client.model.query;
import lombok.*;
import nu.marginalia.index.client.model.results.ResultRankingParameters;
import nu.marginalia.index.query.limit.QueryLimits;
import nu.marginalia.index.query.limit.QueryStrategy;
import nu.marginalia.index.query.limit.SpecificationLimit;
import java.util.List;
@ToString @Getter @Builder @With @AllArgsConstructor
public class SearchSpecification {
public List<SearchSubquery> subqueries;
/** If present and not empty, limit the search to these domain IDs */
public List<Integer> domains;
public String searchSetIdentifier;
public final String humanQuery;
public final SpecificationLimit quality;
public final SpecificationLimit year;
public final SpecificationLimit size;
public final SpecificationLimit rank;
public final SpecificationLimit domainCount;
public final QueryLimits queryLimits;
public final QueryStrategy queryStrategy;
public final ResultRankingParameters rankingParams;
}

View File

@@ -1,79 +0,0 @@
package nu.marginalia.index.client.model.query;
import lombok.AllArgsConstructor;
import lombok.EqualsAndHashCode;
import lombok.Getter;
import lombok.With;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;
@Getter
@AllArgsConstructor
@With
@EqualsAndHashCode
public class SearchSubquery {
/** These terms must be present in the document and are used in ranking*/
public final List<String> searchTermsInclude;
/** These terms must be absent from the document */
public final List<String> searchTermsExclude;
/** These terms must be present in the document, but are not used in ranking */
public final List<String> searchTermsAdvice;
/** If these optional terms are present in the document, rank it highly */
public final List<String> searchTermsPriority;
/** Terms that we require to be in the same sentence */
public final List<List<String>> searchTermCoherences;
@Deprecated // why does this exist?
private double value = 0;
public SearchSubquery() {
this.searchTermsInclude = new ArrayList<>();
this.searchTermsExclude = new ArrayList<>();
this.searchTermsAdvice = new ArrayList<>();
this.searchTermsPriority = new ArrayList<>();
this.searchTermCoherences = new ArrayList<>();
}
public SearchSubquery(List<String> searchTermsInclude,
List<String> searchTermsExclude,
List<String> searchTermsAdvice,
List<String> searchTermsPriority,
List<List<String>> searchTermCoherences) {
this.searchTermsInclude = searchTermsInclude;
this.searchTermsExclude = searchTermsExclude;
this.searchTermsAdvice = searchTermsAdvice;
this.searchTermsPriority = searchTermsPriority;
this.searchTermCoherences = searchTermCoherences;
}
@Deprecated // why does this exist?
public SearchSubquery setValue(double value) {
if (Double.isInfinite(value) || Double.isNaN(value)) {
this.value = Double.MAX_VALUE;
} else {
this.value = value;
}
return this;
}
@Override
public String toString() {
StringBuilder sb = new StringBuilder();
if (!searchTermsInclude.isEmpty()) sb.append("include=").append(searchTermsInclude.stream().collect(Collectors.joining(",", "[", "] ")));
if (!searchTermsExclude.isEmpty()) sb.append("exclude=").append(searchTermsExclude.stream().collect(Collectors.joining(",", "[", "] ")));
if (!searchTermsAdvice.isEmpty()) sb.append("advice=").append(searchTermsAdvice.stream().collect(Collectors.joining(",", "[", "] ")));
if (!searchTermsPriority.isEmpty()) sb.append("priority=").append(searchTermsPriority.stream().collect(Collectors.joining(",", "[", "] ")));
if (!searchTermCoherences.isEmpty()) sb.append("coherences=").append(searchTermCoherences.stream().map(coh->coh.stream().collect(Collectors.joining(",", "[", "] "))).collect(Collectors.joining(", ")));
return sb.toString();
}
}

View File

@@ -1,82 +0,0 @@
package nu.marginalia.index.client.model.results;
import lombok.Getter;
import lombok.ToString;
import nu.marginalia.model.EdgeUrl;
import org.jetbrains.annotations.NotNull;
import javax.annotation.Nullable;
import java.util.List;
@Getter
@ToString
public class DecoratedSearchResultItem {
public final SearchResultItem rawIndexResult;
@NotNull
public final EdgeUrl url;
@NotNull
public final String title;
@NotNull
public final String description;
public final double urlQuality;
@NotNull
public final String format;
/** Document features bitmask, see HtmlFeature */
public final int features;
@Nullable
public final Integer pubYear;
public final long dataHash;
public final int wordsTotal;
public final double rankingScore;
public long documentId() {
return rawIndexResult.getDocumentId();
}
public int domainId() {
return rawIndexResult.getDomainId();
}
public int resultsFromDomain() {
return rawIndexResult.getResultsFromDomain();
}
public List<SearchResultKeywordScore> keywordScores() {
return rawIndexResult.getKeywordScores();
}
public long rankingId() {
return rawIndexResult.getRanking();
}
public DecoratedSearchResultItem(SearchResultItem rawIndexResult,
@NotNull
EdgeUrl url,
@NotNull
String title,
@NotNull
String description,
double urlQuality,
@NotNull
String format,
int features,
@Nullable
Integer pubYear,
long dataHash,
int wordsTotal,
double rankingScore)
{
this.rawIndexResult = rawIndexResult;
this.url = url;
this.title = title;
this.description = description;
this.urlQuality = urlQuality;
this.format = format;
this.features = features;
this.pubYear = pubYear;
this.dataHash = dataHash;
this.wordsTotal = wordsTotal;
this.rankingScore = rankingScore;
}
}

View File

@@ -1,38 +0,0 @@
package nu.marginalia.index.client.model.results;
import it.unimi.dsi.fastutil.objects.Object2IntOpenHashMap;
import lombok.ToString;
import java.util.Map;
@ToString
public class ResultRankingContext {
private final int docCount;
public final ResultRankingParameters params;
private final Object2IntOpenHashMap<String> fullCounts = new Object2IntOpenHashMap<>(10, 0.5f);
private final Object2IntOpenHashMap<String> priorityCounts = new Object2IntOpenHashMap<>(10, 0.5f);
public ResultRankingContext(int docCount,
ResultRankingParameters params,
Map<String, Integer> fullCounts,
Map<String, Integer> prioCounts
) {
this.docCount = docCount;
this.params = params;
this.fullCounts.putAll(fullCounts);
this.priorityCounts.putAll(prioCounts);
}
public int termFreqDocCount() {
return docCount;
}
public int frequency(String keyword) {
return fullCounts.getOrDefault(keyword, 1);
}
public int priorityFrequency(String keyword) {
return priorityCounts.getOrDefault(keyword, 1);
}
}

View File

@@ -1,62 +0,0 @@
package nu.marginalia.index.client.model.results;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.EqualsAndHashCode;
import lombok.ToString;
@Builder @AllArgsConstructor @ToString @EqualsAndHashCode
public class ResultRankingParameters {
/** Tuning for BM25 when applied to full document matches */
public final Bm25Parameters fullParams;
/** Tuning for BM25 when applied to priority matches, terms with relevance signal indicators */
public final Bm25Parameters prioParams;
/** Documents below this length are penalized */
public int shortDocumentThreshold;
public double shortDocumentPenalty;
/** Scaling factor associated with domain rank (unscaled rank value is 0-255; high is good) */
public double domainRankBonus;
/** Scaling factor associated with document quality (unscaled rank value is 0-15; high is bad) */
public double qualityPenalty;
/** Average sentence length values below this threshold are penalized, range [0-4), 2 or 3 is probably what you want */
public int shortSentenceThreshold;
/** Magnitude of penalty for documents with low average sentence length */
public double shortSentencePenalty;
public double bm25FullWeight;
public double bm25PrioWeight;
public double tcfWeight;
public TemporalBias temporalBias;
public double temporalBiasWeight;
public static ResultRankingParameters sensibleDefaults() {
return builder()
.fullParams(new Bm25Parameters(1.2, 0.5))
.prioParams(new Bm25Parameters(1.5, 0))
.shortDocumentThreshold(2000)
.shortDocumentPenalty(2.)
.domainRankBonus(1/25.)
.qualityPenalty(1/15.)
.shortSentenceThreshold(2)
.shortSentencePenalty(5)
.bm25FullWeight(1.)
.bm25PrioWeight(1.)
.tcfWeight(2.)
.temporalBias(TemporalBias.NONE)
.temporalBiasWeight(1. / (10.))
.build();
}
public enum TemporalBias {
RECENT, OLD, NONE
};
}

View File

@@ -1,79 +0,0 @@
package nu.marginalia.index.client.model.results;
import lombok.AllArgsConstructor;
import lombok.Getter;
import nu.marginalia.model.id.UrlIdCodec;
import org.jetbrains.annotations.NotNull;
import java.util.ArrayList;
import java.util.List;
/** Represents a document matching a search query */
@AllArgsConstructor @Getter
public class SearchResultItem implements Comparable<SearchResultItem> {
/** Encoded ID that contains both the URL id and its ranking. This is
* probably not what you want, use getDocumentId() instead */
public final long combinedId;
/** How did the subqueries match against the document ? */
public final List<SearchResultKeywordScore> keywordScores;
/** How many other potential results existed in the same domain */
public int resultsFromDomain;
public SearchResultItem(long combinedId, int scoresCount) {
this.combinedId = combinedId;
this.keywordScores = new ArrayList<>(scoresCount);
}
public long getDocumentId() {
return UrlIdCodec.removeRank(combinedId);
}
public int getRanking() {
return UrlIdCodec.getRank(combinedId);
}
/* Used for evaluation */
private transient SearchResultPreliminaryScore scoreValue = null;
public void setScore(SearchResultPreliminaryScore score) {
scoreValue = score;
}
public SearchResultPreliminaryScore getScore() {
return scoreValue;
}
public int getDomainId() {
return UrlIdCodec.getDomainId(this.combinedId);
}
public int hashCode() {
return Long.hashCode(combinedId);
}
public String toString() {
return getClass().getSimpleName() + "[ url= " + getDocumentId() + ", rank=" + getRanking() + "]";
}
public boolean equals(Object other) {
if (other == null)
return false;
if (other == this)
return true;
if (other instanceof SearchResultItem o) {
return o.getDocumentId() == getDocumentId();
}
return false;
}
@Override
public int compareTo(@NotNull SearchResultItem o) {
// this looks like a bug, but we actually want this in a reversed order
int diff = o.getScore().compareTo(getScore());
if (diff != 0)
return diff;
return Long.compare(this.combinedId, o.combinedId);
}
}

View File

@@ -1,99 +0,0 @@
package nu.marginalia.index.client.model.results;
import nu.marginalia.model.idx.WordFlags;
import nu.marginalia.model.idx.WordMetadata;
import nu.marginalia.model.idx.DocumentMetadata;
import java.util.Objects;
public final class SearchResultKeywordScore {
public final int subquery;
public final String keyword;
private final long encodedWordMetadata;
private final long encodedDocMetadata;
private final boolean hasPriorityTerms;
private final int htmlFeatures;
public SearchResultKeywordScore(int subquery,
String keyword,
long encodedWordMetadata,
long encodedDocMetadata,
int htmlFeatures,
boolean hasPriorityTerms) {
this.subquery = subquery;
this.keyword = keyword;
this.encodedWordMetadata = encodedWordMetadata;
this.encodedDocMetadata = encodedDocMetadata;
this.htmlFeatures = htmlFeatures;
this.hasPriorityTerms = hasPriorityTerms;
}
public boolean hasTermFlag(WordFlags flag) {
return WordMetadata.hasFlags(encodedWordMetadata, flag.asBit());
}
public int positionCount() {
return Long.bitCount(positions());
}
public int subquery() {
return subquery;
}
public long positions() {
return WordMetadata.decodePositions(encodedWordMetadata);
}
public boolean isKeywordSpecial() {
return keyword.contains(":") || hasTermFlag(WordFlags.Synthetic);
}
public boolean isKeywordRegular() {
return !keyword.contains(":")
&& !hasTermFlag(WordFlags.Synthetic);
}
public long encodedWordMetadata() {
return encodedWordMetadata;
}
public long encodedDocMetadata() {
return encodedDocMetadata;
}
public int htmlFeatures() {
return htmlFeatures;
}
public boolean hasPriorityTerms() {
return hasPriorityTerms;
}
@Override
public boolean equals(Object obj) {
if (obj == this) return true;
if (obj == null || obj.getClass() != this.getClass()) return false;
var that = (SearchResultKeywordScore) obj;
return this.subquery == that.subquery &&
Objects.equals(this.keyword, that.keyword) &&
this.encodedWordMetadata == that.encodedWordMetadata &&
this.encodedDocMetadata == that.encodedDocMetadata &&
this.hasPriorityTerms == that.hasPriorityTerms;
}
@Override
public int hashCode() {
return Objects.hash(subquery, keyword, encodedWordMetadata, encodedDocMetadata, hasPriorityTerms);
}
@Override
public String toString() {
return "SearchResultKeywordScore[" +
"set=" + subquery + ", " +
"keyword=" + keyword + ", " +
"encodedWordMetadata=" + new WordMetadata(encodedWordMetadata) + ", " +
"encodedDocMetadata=" + new DocumentMetadata(encodedDocMetadata) + ", " +
"hasPriorityTerms=" + hasPriorityTerms + ']';
}
}

View File

@@ -1,32 +0,0 @@
package nu.marginalia.index.client.model.results;
import lombok.AllArgsConstructor;
import lombok.Getter;
import lombok.ToString;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
@AllArgsConstructor @Getter @ToString
public class SearchResultSet {
public SearchResultSet() {
results = new ArrayList<>();
}
public List<DecoratedSearchResultItem> results;
public int size() {
return results.size();
}
public static SearchResultSet combine(SearchResultSet l, SearchResultSet r) {
List<DecoratedSearchResultItem> combinedItems = new ArrayList<>(l.size() + r.size());
combinedItems.addAll(l.results);
combinedItems.addAll(r.results);
// TODO: Do we combine these correctly?
combinedItems.sort(Comparator.comparing(item -> item.rankingScore));
return new SearchResultSet(combinedItems);
}
}

View File

@@ -1,21 +0,0 @@
package nu.marginalia.mqapi.crawling;
import lombok.AllArgsConstructor;
import nu.marginalia.storage.model.FileStorageId;
import java.util.List;
/** A request to start a crawl */
@AllArgsConstructor
public class CrawlRequest {
/** (optional) Crawl spec(s) for sourcing domains to crawl. If not set,
* the EC_DOMAIN table will be consulted and domains with the corresponding
* node affinity will be used.
*/
public List<FileStorageId> specStorage;
/** File storage where the crawl data will be written. If it contains existing crawl data,
* this crawl data will be referenced for e-tags and last-mofified checks.
*/
public FileStorageId crawlStorage;
}

View File

@@ -1,37 +0,0 @@
plugins {
id 'java'
id 'jvm-test-suite'
}
java {
toolchain {
languageVersion.set(JavaLanguageVersion.of(21))
}
}
dependencies {
implementation project(':code:common:model')
implementation project(':code:api:index-api')
implementation project(':code:common:config')
implementation project(':code:libraries:message-queue')
implementation project(':code:features-index:index-query')
implementation project(':code:common:service-discovery')
implementation project(':code:common:service-client')
implementation libs.bundles.slf4j
implementation libs.roaringbitmap
implementation libs.prometheus
implementation libs.notnull
implementation libs.trove
implementation libs.guice
implementation libs.rxjava
implementation libs.gson
implementation libs.bundles.grpc
implementation libs.protobuf
testImplementation libs.bundles.slf4j.test
testImplementation libs.bundles.junit
testImplementation libs.mockito
}

View File

@@ -1,169 +0,0 @@
package nu.marginalia.query;
import lombok.SneakyThrows;
import nu.marginalia.index.api.*;
import nu.marginalia.index.client.IndexProtobufCodec;
import nu.marginalia.index.client.model.query.SearchSetIdentifier;
import nu.marginalia.index.client.model.query.SearchSpecification;
import nu.marginalia.index.client.model.query.SearchSubquery;
import nu.marginalia.index.client.model.results.DecoratedSearchResultItem;
import nu.marginalia.index.client.model.results.SearchResultItem;
import nu.marginalia.index.client.model.results.SearchResultKeywordScore;
import nu.marginalia.index.query.limit.QueryStrategy;
import nu.marginalia.model.EdgeUrl;
import nu.marginalia.query.model.ProcessedQuery;
import nu.marginalia.query.model.QueryParams;
import nu.marginalia.query.model.QueryResponse;
import java.util.ArrayList;
import java.util.List;
import static nu.marginalia.index.client.IndexProtobufCodec.*;
public class QueryProtobufCodec {
public static RpcIndexQuery convertQuery(RpcQsQuery request, ProcessedQuery query) {
var builder = RpcIndexQuery.newBuilder();
builder.addAllDomains(request.getDomainIdsList());
for (var subquery : query.specs.subqueries) {
builder.addSubqueries(IndexProtobufCodec.convertSearchSubquery(subquery));
}
builder.setSearchSetIdentifier(query.specs.searchSetIdentifier);
builder.setHumanQuery(request.getHumanQuery());
builder.setQuality(convertSpecLimit(query.specs.quality));
builder.setYear(convertSpecLimit(query.specs.year));
builder.setSize(convertSpecLimit(query.specs.size));
builder.setRank(convertSpecLimit(query.specs.rank));
builder.setDomainCount(convertSpecLimit(query.specs.domainCount));
builder.setQueryLimits(IndexProtobufCodec.convertQueryLimits(query.specs.queryLimits));
builder.setQueryStrategy(query.specs.queryStrategy.name());
builder.setParameters(IndexProtobufCodec.convertRankingParameterss(query.specs.rankingParams));
return builder.build();
}
public static QueryParams convertRequest(RpcQsQuery request) {
return new QueryParams(
request.getHumanQuery(),
request.getNearDomain(),
request.getTacitIncludesList(),
request.getTacitExcludesList(),
request.getTacitPriorityList(),
request.getTacitAdviceList(),
convertSpecLimit(request.getQuality()),
convertSpecLimit(request.getYear()),
convertSpecLimit(request.getSize()),
convertSpecLimit(request.getRank()),
convertSpecLimit(request.getDomainCount()),
request.getDomainIdsList(),
IndexProtobufCodec.convertQueryLimits(request.getQueryLimits()),
request.getSearchSetIdentifier());
}
public static QueryResponse convertQueryResponse(RpcQsResponse query) {
var results = new ArrayList<DecoratedSearchResultItem>(query.getResultsCount());
for (int i = 0; i < query.getResultsCount(); i++)
results.add(convertDecoratedResult(query.getResults(i)));
return new QueryResponse(
convertSearchSpecification(query.getSpecs()),
results,
query.getSearchTermsHumanList(),
query.getProblemsList(),
query.getDomain()
);
}
@SneakyThrows
private static DecoratedSearchResultItem convertDecoratedResult(RpcDecoratedResultItem results) {
return new DecoratedSearchResultItem(
convertRawResult(results.getRawItem()),
new EdgeUrl(results.getUrl()),
results.getTitle(),
results.getDescription(),
results.getUrlQuality(),
results.getFormat(),
results.getFeatures(),
results.getPubYear(), // ??,
results.getDataHash(),
results.getWordsTotal(),
results.getRankingScore()
);
}
private static SearchResultItem convertRawResult(RpcRawResultItem rawItem) {
var keywordScores = new ArrayList<SearchResultKeywordScore>(rawItem.getKeywordScoresCount());
for (int i = 0; i < rawItem.getKeywordScoresCount(); i++)
keywordScores.add(convertKeywordScore(rawItem.getKeywordScores(i)));
return new SearchResultItem(
rawItem.getCombinedId(),
keywordScores,
rawItem.getResultsFromDomain(),
null
);
}
private static SearchResultKeywordScore convertKeywordScore(RpcResultKeywordScore keywordScores) {
return new SearchResultKeywordScore(
keywordScores.getSubquery(),
keywordScores.getKeyword(),
keywordScores.getEncodedWordMetadata(),
keywordScores.getEncodedDocMetadata(),
keywordScores.getHtmlFeatures(),
keywordScores.getHasPriorityTerms()
);
}
private static SearchSpecification convertSearchSpecification(RpcIndexQuery specs) {
List<SearchSubquery> subqueries = new ArrayList<>(specs.getSubqueriesCount());
for (int i = 0; i < specs.getSubqueriesCount(); i++) {
subqueries.add(convertSearchSubquery(specs.getSubqueries(i)));
}
return new SearchSpecification(
subqueries,
specs.getDomainsList(),
specs.getSearchSetIdentifier(),
specs.getHumanQuery(),
IndexProtobufCodec.convertSpecLimit(specs.getQuality()),
IndexProtobufCodec.convertSpecLimit(specs.getYear()),
IndexProtobufCodec.convertSpecLimit(specs.getSize()),
IndexProtobufCodec.convertSpecLimit(specs.getRank()),
IndexProtobufCodec.convertSpecLimit(specs.getDomainCount()),
IndexProtobufCodec.convertQueryLimits(specs.getQueryLimits()),
QueryStrategy.valueOf(specs.getQueryStrategy()),
convertRankingParameterss(specs.getParameters())
);
}
public static RpcQsQuery convertQueryParams(QueryParams params) {
var builder = RpcQsQuery.newBuilder()
.addAllDomainIds(params.domainIds())
.addAllTacitAdvice(params.tacitAdvice())
.addAllTacitExcludes(params.tacitExcludes())
.addAllTacitIncludes(params.tacitIncludes())
.addAllTacitPriority(params.tacitPriority())
.setHumanQuery(params.humanQuery())
.setQueryLimits(convertQueryLimits(params.limits()))
.setQuality(convertSpecLimit(params.quality()))
.setYear(convertSpecLimit(params.year()))
.setSize(convertSpecLimit(params.size()))
.setRank(convertSpecLimit(params.rank()))
.setSearchSetIdentifier(params.identifier());
if (params.nearDomain() != null)
builder.setNearDomain(params.nearDomain());
return builder.build();
}
}

View File

@@ -1,204 +0,0 @@
package nu.marginalia.query.client;
import com.google.inject.Inject;
import com.google.inject.Singleton;
import gnu.trove.list.array.TIntArrayList;
import io.grpc.ManagedChannel;
import io.grpc.ManagedChannelBuilder;
import io.prometheus.client.Summary;
import nu.marginalia.client.AbstractDynamicClient;
import nu.marginalia.client.Context;
import nu.marginalia.index.api.Empty;
import nu.marginalia.index.api.IndexDomainLinksApiGrpc;
import nu.marginalia.index.api.QueryApiGrpc;
import nu.marginalia.index.api.RpcDomainId;
import nu.marginalia.index.client.model.query.SearchSpecification;
import nu.marginalia.index.client.model.results.SearchResultSet;
import nu.marginalia.model.gson.GsonFactory;
import nu.marginalia.query.QueryProtobufCodec;
import nu.marginalia.query.model.QueryParams;
import nu.marginalia.query.model.QueryResponse;
import nu.marginalia.service.descriptor.ServiceDescriptor;
import nu.marginalia.service.descriptor.ServiceDescriptors;
import nu.marginalia.service.id.ServiceId;
import org.roaringbitmap.PeekableCharIterator;
import org.roaringbitmap.longlong.PeekableLongIterator;
import org.roaringbitmap.longlong.Roaring64Bitmap;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import javax.annotation.CheckReturnValue;
import java.util.List;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
@Singleton
public class QueryClient extends AbstractDynamicClient {
private static final Summary wmsa_qs_api_delegate_time = Summary.build()
.name("wmsa_qs_api_delegate_time")
.help("query service delegate time")
.register();
private static final Summary wmsa_qs_api_search_time = Summary.build()
.name("wmsa_qs_api_search_time")
.help("query service search time")
.register();
private final Map<ServiceAndNode, ManagedChannel> channels = new ConcurrentHashMap<>();
private final Map<ServiceAndNode, QueryApiGrpc.QueryApiBlockingStub > queryIndexApis = new ConcurrentHashMap<>();
private final Map<ServiceAndNode, IndexDomainLinksApiGrpc.IndexDomainLinksApiBlockingStub> domainLinkApis = new ConcurrentHashMap<>();
record ServiceAndNode(String service, int node) {
public String getHostName() {
return service;
}
}
private ManagedChannel getChannel(ServiceAndNode serviceAndNode) {
return channels.computeIfAbsent(serviceAndNode,
san -> ManagedChannelBuilder
.forAddress(serviceAndNode.getHostName(), 81)
.usePlaintext()
.build());
}
public QueryApiGrpc.QueryApiBlockingStub queryApi(int node) {
return queryIndexApis.computeIfAbsent(new ServiceAndNode("query-service", node), n ->
QueryApiGrpc.newBlockingStub(
getChannel(n)
)
);
}
public IndexDomainLinksApiGrpc.IndexDomainLinksApiBlockingStub domainApi(int node) {
return domainLinkApis.computeIfAbsent(new ServiceAndNode("query-service", node), n ->
IndexDomainLinksApiGrpc.newBlockingStub(
getChannel(n)
)
);
}
private final Logger logger = LoggerFactory.getLogger(getClass());
@Inject
public QueryClient(ServiceDescriptors descriptors) {
super(descriptors.forId(ServiceId.Query), GsonFactory::get);
}
public QueryClient() {
super(new ServiceDescriptor(ServiceId.Query, "query-service"), GsonFactory::get);
}
/** Delegate an Index API style query directly to the index service */
@CheckReturnValue
public SearchResultSet delegate(Context ctx, SearchSpecification specs) {
return wmsa_qs_api_delegate_time.time(
() -> this.postGet(ctx, 0, "/delegate/", specs, SearchResultSet.class).blockingFirst()
);
}
@CheckReturnValue
public QueryResponse search(Context ctx, QueryParams params) {
return wmsa_qs_api_search_time.time(
() -> QueryProtobufCodec.convertQueryResponse(queryApi(0).query(QueryProtobufCodec.convertQueryParams(params)))
);
}
public AllLinks getAllDomainLinks() {
AllLinks links = new AllLinks();
domainApi(0).getAllLinks(Empty.newBuilder().build()).forEachRemaining(pairs -> {
for (int i = 0; i < pairs.getDestIdsCount(); i++) {
links.add(pairs.getSourceIds(i), pairs.getDestIds(i));
}
});
return links;
}
public List<Integer> getLinksToDomain(int domainId) {
try {
return domainApi(0).getLinksToDomain(RpcDomainId
.newBuilder()
.setDomainId(domainId)
.build())
.getDomainIdList();
}
catch (Exception e) {
logger.error("API Exception", e);
return List.of();
}
}
public List<Integer> getLinksFromDomain(int domainId) {
try {
return domainApi(0).getLinksFromDomain(RpcDomainId
.newBuilder()
.setDomainId(domainId)
.build())
.getDomainIdList();
}
catch (Exception e) {
logger.error("API Exception", e);
return List.of();
}
}
public int countLinksToDomain(int domainId) {
try {
return domainApi(0).countLinksToDomain(RpcDomainId
.newBuilder()
.setDomainId(domainId)
.build())
.getIdCount();
}
catch (Exception e) {
logger.error("API Exception", e);
return 0;
}
}
public int countLinksFromDomain(int domainId) {
try {
return domainApi(0).countLinksFromDomain(RpcDomainId
.newBuilder()
.setDomainId(domainId)
.build())
.getIdCount();
}
catch (Exception e) {
logger.error("API Exception", e);
return 0;
}
}
public static class AllLinks {
private final Roaring64Bitmap sourceToDest = new Roaring64Bitmap();
public void add(int source, int dest) {
sourceToDest.add(Integer.toUnsignedLong(source) << 32 | Integer.toUnsignedLong(dest));
}
public Iterator iterator() {
return new Iterator();
}
public class Iterator {
private final PeekableLongIterator base = sourceToDest.getLongIterator();
long val = Long.MIN_VALUE;
public boolean advance() {
if (base.hasNext()) {
val = base.next();
return true;
}
return false;
}
public int source() {
return (int) (val >>> 32);
}
public int dest() {
return (int) (val & 0xFFFF_FFFFL);
}
}
}
}

View File

@@ -1,23 +0,0 @@
package nu.marginalia.query.model;
import nu.marginalia.index.client.model.query.SearchSpecification;
import nu.marginalia.index.client.model.results.DecoratedSearchResultItem;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
public record QueryResponse(SearchSpecification specs,
List<DecoratedSearchResultItem> results,
List<String> searchTermsHuman,
List<String> problems,
String domain)
{
public Set<String> getAllKeywords() {
Set<String> keywords = new HashSet<>(100);
for (var sq : specs.subqueries) {
keywords.addAll(sq.searchTermsInclude);
}
return keywords;
}
}

View File

@@ -1,23 +0,0 @@
# Clients
## Core Services
* [assistant-api](assistant-api/)
* [query-api](query-api/)
* [index-api](index-api/)
These are clients for the [core services](../services-core/), along with what models
are necessary for speaking to them. They each implement the abstract client classes from
[service-client](../common/service-client).
All that is necessary is to `@Inject` them into the constructor and then
requests can be sent.
**Note:** If you are looking for the public API, it's handled by the api service in [services-application/api-service](../services-application/api-service).
## MQ-API Process API
[process-mqapi](process-mqapi/) defines requests and inboxes for the message queue based API used
for interacting with processes.
See [libraries/message-queue](../libraries/message-queue) and [services-application/control-service](../services-core/control-service).

View File

@@ -7,20 +7,23 @@ plugins {
java { java {
toolchain { toolchain {
languageVersion.set(JavaLanguageVersion.of(21)) languageVersion.set(JavaLanguageVersion.of(rootProject.ext.jvmVersion))
} }
} }
apply from: "$rootProject.projectDir/srcsets.gradle"
dependencies { dependencies {
implementation project(':code:common:service-discovery')
implementation project(':code:common:service-client')
implementation project(':code:common:db') implementation project(':code:common:db')
implementation project(':code:common:model') implementation project(':code:common:model')
implementation libs.bundles.slf4j implementation libs.bundles.slf4j
implementation libs.bundles.mariadb implementation libs.bundles.mariadb
implementation libs.mockito implementation libs.mockito
implementation libs.guice implementation libs.guava
implementation dependencies.create(libs.guice.get()) {
exclude group: 'com.google.guava'
}
implementation libs.gson implementation libs.gson
testImplementation libs.bundles.slf4j.test testImplementation libs.bundles.slf4j.test
@@ -30,6 +33,7 @@ dependencies {
testImplementation project(':code:libraries:test-helpers') testImplementation project(':code:libraries:test-helpers')
testImplementation platform('org.testcontainers:testcontainers-bom:1.17.4') testImplementation platform('org.testcontainers:testcontainers-bom:1.17.4')
testImplementation libs.commons.codec
testImplementation 'org.testcontainers:mariadb:1.17.4' testImplementation 'org.testcontainers:mariadb:1.17.4'
testImplementation 'org.testcontainers:junit-jupiter:1.17.4' testImplementation 'org.testcontainers:junit-jupiter:1.17.4'
testImplementation project(':code:libraries:test-helpers') testImplementation project(':code:libraries:test-helpers')

View File

@@ -3,28 +3,25 @@ package nu.marginalia;
import java.nio.file.Path; import java.nio.file.Path;
public class LanguageModels { public class LanguageModels {
public final Path ngramBloomFilter;
public final Path termFrequencies; public final Path termFrequencies;
public final Path openNLPSentenceDetectionData; public final Path openNLPSentenceDetectionData;
public final Path posRules; public final Path posRules;
public final Path posDict; public final Path posDict;
public final Path openNLPTokenData;
public final Path fasttextLanguageModel; public final Path fasttextLanguageModel;
public final Path segments;
public LanguageModels(Path ngramBloomFilter, public LanguageModels(Path termFrequencies,
Path termFrequencies,
Path openNLPSentenceDetectionData, Path openNLPSentenceDetectionData,
Path posRules, Path posRules,
Path posDict, Path posDict,
Path openNLPTokenData, Path fasttextLanguageModel,
Path fasttextLanguageModel) { Path segments) {
this.ngramBloomFilter = ngramBloomFilter;
this.termFrequencies = termFrequencies; this.termFrequencies = termFrequencies;
this.openNLPSentenceDetectionData = openNLPSentenceDetectionData; this.openNLPSentenceDetectionData = openNLPSentenceDetectionData;
this.posRules = posRules; this.posRules = posRules;
this.posDict = posDict; this.posDict = posDict;
this.openNLPTokenData = openNLPTokenData;
this.fasttextLanguageModel = fasttextLanguageModel; this.fasttextLanguageModel = fasttextLanguageModel;
this.segments = segments;
} }
} }

View File

@@ -0,0 +1,117 @@
package nu.marginalia;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Objects;
import java.util.Optional;
import java.util.stream.Stream;
public class WmsaHome {
public static UserAgent getUserAgent() {
return new UserAgent(
System.getProperty("crawler.userAgentString", "Mozilla/5.0 (compatible; Marginalia-like bot; +https://git.marginalia.nu/))"),
System.getProperty("crawler.userAgentIdentifier", "search.marginalia.nu")
);
}
public static Path getUploadDir() {
return Path.of(
System.getProperty("executor.uploadDir", "/uploads")
);
}
public static Path getHomePath() {
String[] possibleLocations = new String[] {
System.getenv("WMSA_HOME"),
System.getProperty("system.homePath"),
"/var/lib/wmsa",
"/wmsa"
};
Optional<String> retStr = Stream.of(possibleLocations)
.filter(Objects::nonNull)
.map(Path::of)
.filter(Files::isDirectory)
.map(Path::toString)
.findFirst();
if (retStr.isEmpty()) {
// Check parent directories for a fingerprint of the project's installation boilerplate
var prodRoot = Stream.iterate(Paths.get("").toAbsolutePath(), f -> f != null && Files.exists(f), Path::getParent)
.filter(p -> Files.exists(p.resolve("conf/properties/system.properties")))
.filter(p -> Files.exists(p.resolve("model/tfreq-new-algo3.bin")))
.findAny();
if (prodRoot.isPresent()) {
return prodRoot.get();
}
// Check if we are running in a test environment by looking for fingerprints
// matching the base of the source tree for the project, then looking up the
// run directory which contains a template for the installation we can use as
// though it's the project root for testing purposes
var testRoot = Stream.iterate(Paths.get("").toAbsolutePath(), f -> f != null && Files.exists(f), Path::getParent)
.filter(p -> Files.exists(p.resolve("run/env")))
.filter(p -> Files.exists(p.resolve("run/setup.sh")))
.map(p -> p.resolve("run"))
.findAny();
return testRoot.orElseThrow(() -> new IllegalStateException("""
Could not find $WMSA_HOME, either set environment
variable, the 'system.homePath' java property,
or ensure either /wmsa or /var/lib/wmsa exists
"""));
}
var ret = Path.of(retStr.get());
if (!Files.isDirectory(ret.resolve("model"))) {
throw new IllegalStateException("You need to run 'run/setup.sh' to download models to run/ before this will work!");
}
return ret;
}
public static Path getDataPath() {
return getHomePath().resolve("data");
}
public static Path getAdsDefinition() {
return getHomePath().resolve("data").resolve("adblock.txt");
}
public static Path getIPLocationDatabse() {
return getHomePath().resolve("data").resolve("IP2LOCATION-LITE-DB1.CSV");
}
public static Path getAsnMappingDatabase() {
return getHomePath().resolve("data").resolve("asn-data-raw-table");
}
public static Path getAsnInfoDatabase() {
return getHomePath().resolve("data").resolve("asn-used-autnums");
}
public static LanguageModels getLanguageModels() {
final Path home = getHomePath();
return new LanguageModels(
home.resolve("model/tfreq-new-algo3.bin"),
home.resolve("model/opennlp-sentence.bin"),
home.resolve("model/English.RDR"),
home.resolve("model/English.DICT"),
home.resolve("model/lid.176.ftz"),
home.resolve("model/segments.bin")
);
}
public static Path getAtagsPath() {
return getHomePath().resolve("data/atags.parquet");
}
}

View File

@@ -3,6 +3,7 @@ package nu.marginalia.nodecfg;
import com.google.inject.Inject; import com.google.inject.Inject;
import com.zaxxer.hikari.HikariDataSource; import com.zaxxer.hikari.HikariDataSource;
import nu.marginalia.nodecfg.model.NodeConfiguration; import nu.marginalia.nodecfg.model.NodeConfiguration;
import nu.marginalia.nodecfg.model.NodeProfile;
import org.slf4j.Logger; import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import org.slf4j.LoggerFactory;
@@ -20,10 +21,10 @@ public class NodeConfigurationService {
this.dataSource = dataSource; this.dataSource = dataSource;
} }
public NodeConfiguration create(int id, String description, boolean acceptQueries, boolean keepWarcs) throws SQLException { public NodeConfiguration create(int id, String description, boolean acceptQueries, boolean keepWarcs, NodeProfile nodeProfile) throws SQLException {
try (var conn = dataSource.getConnection(); try (var conn = dataSource.getConnection();
var is = conn.prepareStatement(""" var is = conn.prepareStatement("""
INSERT IGNORE INTO NODE_CONFIGURATION(ID, DESCRIPTION, ACCEPT_QUERIES, KEEP_WARCS) VALUES(?, ?, ?, ?) INSERT IGNORE INTO NODE_CONFIGURATION(ID, DESCRIPTION, ACCEPT_QUERIES, KEEP_WARCS, NODE_PROFILE) VALUES(?, ?, ?, ?, ?)
""") """)
) )
{ {
@@ -31,6 +32,7 @@ public class NodeConfigurationService {
is.setString(2, description); is.setString(2, description);
is.setBoolean(3, acceptQueries); is.setBoolean(3, acceptQueries);
is.setBoolean(4, keepWarcs); is.setBoolean(4, keepWarcs);
is.setString(5, nodeProfile.name());
if (is.executeUpdate() <= 0) { if (is.executeUpdate() <= 0) {
throw new IllegalStateException("Failed to insert configuration"); throw new IllegalStateException("Failed to insert configuration");
@@ -43,7 +45,7 @@ public class NodeConfigurationService {
public List<NodeConfiguration> getAll() { public List<NodeConfiguration> getAll() {
try (var conn = dataSource.getConnection(); try (var conn = dataSource.getConnection();
var qs = conn.prepareStatement(""" var qs = conn.prepareStatement("""
SELECT ID, DESCRIPTION, ACCEPT_QUERIES, AUTO_CLEAN, PRECESSION, KEEP_WARCS, DISABLED SELECT ID, DESCRIPTION, ACCEPT_QUERIES, AUTO_CLEAN, PRECESSION, KEEP_WARCS, NODE_PROFILE, DISABLED
FROM NODE_CONFIGURATION FROM NODE_CONFIGURATION
""")) { """)) {
var rs = qs.executeQuery(); var rs = qs.executeQuery();
@@ -58,6 +60,7 @@ public class NodeConfigurationService {
rs.getBoolean("AUTO_CLEAN"), rs.getBoolean("AUTO_CLEAN"),
rs.getBoolean("PRECESSION"), rs.getBoolean("PRECESSION"),
rs.getBoolean("KEEP_WARCS"), rs.getBoolean("KEEP_WARCS"),
NodeProfile.valueOf(rs.getString("NODE_PROFILE")),
rs.getBoolean("DISABLED") rs.getBoolean("DISABLED")
)); ));
} }
@@ -72,7 +75,7 @@ public class NodeConfigurationService {
public NodeConfiguration get(int nodeId) throws SQLException { public NodeConfiguration get(int nodeId) throws SQLException {
try (var conn = dataSource.getConnection(); try (var conn = dataSource.getConnection();
var qs = conn.prepareStatement(""" var qs = conn.prepareStatement("""
SELECT ID, DESCRIPTION, ACCEPT_QUERIES, AUTO_CLEAN, PRECESSION, KEEP_WARCS, DISABLED SELECT ID, DESCRIPTION, ACCEPT_QUERIES, AUTO_CLEAN, PRECESSION, KEEP_WARCS, NODE_PROFILE, DISABLED
FROM NODE_CONFIGURATION FROM NODE_CONFIGURATION
WHERE ID=? WHERE ID=?
""")) { """)) {
@@ -86,6 +89,7 @@ public class NodeConfigurationService {
rs.getBoolean("AUTO_CLEAN"), rs.getBoolean("AUTO_CLEAN"),
rs.getBoolean("PRECESSION"), rs.getBoolean("PRECESSION"),
rs.getBoolean("KEEP_WARCS"), rs.getBoolean("KEEP_WARCS"),
NodeProfile.valueOf(rs.getString("NODE_PROFILE")),
rs.getBoolean("DISABLED") rs.getBoolean("DISABLED")
); );
} }
@@ -98,7 +102,7 @@ public class NodeConfigurationService {
try (var conn = dataSource.getConnection(); try (var conn = dataSource.getConnection();
var us = conn.prepareStatement(""" var us = conn.prepareStatement("""
UPDATE NODE_CONFIGURATION UPDATE NODE_CONFIGURATION
SET DESCRIPTION=?, ACCEPT_QUERIES=?, AUTO_CLEAN=?, PRECESSION=?, KEEP_WARCS=?, DISABLED=? SET DESCRIPTION=?, ACCEPT_QUERIES=?, AUTO_CLEAN=?, PRECESSION=?, KEEP_WARCS=?, DISABLED=?, NODE_PROFILE=?
WHERE ID=? WHERE ID=?
""")) """))
{ {
@@ -108,7 +112,8 @@ public class NodeConfigurationService {
us.setBoolean(4, config.includeInPrecession()); us.setBoolean(4, config.includeInPrecession());
us.setBoolean(5, config.keepWarcs()); us.setBoolean(5, config.keepWarcs());
us.setBoolean(6, config.disabled()); us.setBoolean(6, config.disabled());
us.setInt(7, config.node()); us.setString(7, config.profile().name());
us.setInt(8, config.node());
if (us.executeUpdate() <= 0) if (us.executeUpdate() <= 0)
throw new IllegalStateException("Failed to update configuration"); throw new IllegalStateException("Failed to update configuration");

View File

@@ -6,6 +6,7 @@ public record NodeConfiguration(int node,
boolean autoClean, boolean autoClean,
boolean includeInPrecession, boolean includeInPrecession,
boolean keepWarcs, boolean keepWarcs,
NodeProfile profile,
boolean disabled boolean disabled
) )
{ {

View File

@@ -0,0 +1,28 @@
package nu.marginalia.nodecfg.model;
public enum NodeProfile {
BATCH_CRAWL,
REALTIME,
MIXED,
SIDELOAD;
public boolean isBatchCrawl() {
return this == BATCH_CRAWL;
}
public boolean isRealtime() {
return this == REALTIME;
}
public boolean isMixed() {
return this == MIXED;
}
public boolean isSideload() {
return this == SIDELOAD;
}
public boolean permitBatchCrawl() {
return isBatchCrawl() ||isMixed();
}
public boolean permitSideload() {
return isMixed() || isSideload();
}
}

View File

@@ -2,13 +2,13 @@ package nu.marginalia.storage;
import com.google.inject.name.Named; import com.google.inject.name.Named;
import com.zaxxer.hikari.HikariDataSource; import com.zaxxer.hikari.HikariDataSource;
import lombok.SneakyThrows;
import nu.marginalia.storage.model.*; import nu.marginalia.storage.model.*;
import org.slf4j.Logger; import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import org.slf4j.LoggerFactory;
import com.google.inject.Inject; import com.google.inject.Inject;
import com.google.inject.Singleton; import com.google.inject.Singleton;
import java.io.File; import java.io.File;
import java.io.IOException; import java.io.IOException;
import java.nio.file.*; import java.nio.file.*;
@@ -30,19 +30,26 @@ public class FileStorageService {
private static final DateTimeFormatter dirNameDatePattern = DateTimeFormatter.ofPattern("__uu-MM-dd'T'HH_mm_ss.SSS"); // filesystem safe ISO8601 private static final DateTimeFormatter dirNameDatePattern = DateTimeFormatter.ofPattern("__uu-MM-dd'T'HH_mm_ss.SSS"); // filesystem safe ISO8601
@Inject @Inject
public FileStorageService(HikariDataSource dataSource, @Named("wmsa-system-node") Integer node) { public FileStorageService(HikariDataSource dataSource,
@Named("wmsa-system-node") Integer node) {
this.dataSource = dataSource; this.dataSource = dataSource;
this.node = node; this.node = node;
for (var type : FileStorageType.values()) { logger.info("Resolving file storage root into {}", resolveStoragePath("/").toAbsolutePath());
String overrideProperty = System.getProperty(type.overrideName()); }
if (overrideProperty == null || overrideProperty.isBlank()) /** Resolve a storage path from a relative path, injecting the system configured storage root
continue; * if set */
public static Path resolveStoragePath(String path) {
logger.info("FileStorage override present: {} -> {}", type, if (path.startsWith("/")) {
FileStorage.createOverrideStorage(type, FileStorageBaseType.CURRENT, overrideProperty).asPath()); // Since Path.of("ANYTHING").resolve("/foo") = "/foo", we need to strip
// the leading slash
return resolveStoragePath(path.substring(1));
} }
return Path
.of(System.getProperty("storage.root", "/"))
.resolve(path);
} }
/** @return the storage base with the given id, or null if it does not exist */ /** @return the storage base with the given id, or null if it does not exist */
@@ -91,7 +98,7 @@ public class FileStorageService {
throw new RuntimeException(e); throw new RuntimeException(e);
} }
File basePathFile = Path.of(base.path()).toFile(); File basePathFile = base.asPath().toFile();
File[] files = basePathFile.listFiles(pathname -> pathname.isDirectory() && !ignoredPaths.contains(pathname.getName())); File[] files = basePathFile.listFiles(pathname -> pathname.isDirectory() && !ignoredPaths.contains(pathname.getName()));
if (files == null) return; if (files == null) return;
for (File file : files) { for (File file : files) {
@@ -119,6 +126,7 @@ public class FileStorageService {
} }
} }
public void relateFileStorages(FileStorageId source, FileStorageId target) { public void relateFileStorages(FileStorageId source, FileStorageId target) {
try (var conn = dataSource.getConnection(); try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement(""" var stmt = conn.prepareStatement("""
@@ -200,7 +208,6 @@ public class FileStorageService {
return getStorageBase(type); return getStorageBase(type);
} }
@SneakyThrows
private Path allocateDirectory(Path basePath, String prefix) throws IOException { private Path allocateDirectory(Path basePath, String prefix) throws IOException {
LocalDateTime now = LocalDateTime.now(); LocalDateTime now = LocalDateTime.now();
String timestampPart = now.format(dirNameDatePattern); String timestampPart = now.format(dirNameDatePattern);
@@ -220,6 +227,9 @@ public class FileStorageService {
); );
} }
// Ensure umask didn't mess with the access permissions
Files.setPosixFilePermissions(maybePath, PosixFilePermissions.fromString("rwxr-xr-x"));
return maybePath; return maybePath;
} }
@@ -278,20 +288,6 @@ public class FileStorageService {
public FileStorage getStorageByType(FileStorageType type) throws SQLException { public FileStorage getStorageByType(FileStorageType type) throws SQLException {
String override = System.getProperty(type.overrideName());
if (override != null) {
// It is sometimes desirable to be able to override the
// configured location of a FileStorage when running a process
//
if (!Files.isDirectory(Path.of(override))) {
throw new IllegalStateException("FileStorageType " + type.name() + " was overridden, but location '" + override + "' does not exist!");
}
return FileStorage.createOverrideStorage(type, FileStorageBaseType.CURRENT, override);
}
try (var conn = dataSource.getConnection(); try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement(""" var stmt = conn.prepareStatement("""
SELECT PATH, STATE, DESCRIPTION, ID, BASE_ID, CREATE_DATE SELECT PATH, STATE, DESCRIPTION, ID, BASE_ID, CREATE_DATE

View File

@@ -1,5 +1,7 @@
package nu.marginalia.storage.model; package nu.marginalia.storage.model;
import nu.marginalia.storage.FileStorageService;
import java.nio.file.Path; import java.nio.file.Path;
import java.time.LocalDateTime; import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter; import java.time.format.DateTimeFormatter;
@@ -24,36 +26,15 @@ public record FileStorage (
String description) String description)
{ {
/** It is sometimes desirable to be able to create an override that isn't
* backed by the database. This constructor permits this.
*/
public static FileStorage createOverrideStorage(FileStorageType type, FileStorageBaseType baseType, String override) {
var mockBase = new FileStorageBase(
new FileStorageBaseId(-1),
baseType,
-1,
"OVERRIDE:" + type.name(),
"INVALIDINVALIDINVALID"
);
return new FileStorage(
new FileStorageId(-1),
mockBase,
type,
LocalDateTime.now(),
override,
FileStorageState.UNSET,
"OVERRIDE:" + type.name()
);
}
public int node() { public int node() {
return base.node(); return base.node();
} }
public Path asPath() { public Path asPath() {
return Path.of(path); return FileStorageService.resolveStoragePath(path);
} }
public boolean isActive() { public boolean isActive() {
return FileStorageState.ACTIVE.equals(state); return FileStorageState.ACTIVE.equals(state);
} }

View File

@@ -1,5 +1,7 @@
package nu.marginalia.storage.model; package nu.marginalia.storage.model;
import nu.marginalia.storage.FileStorageService;
import java.nio.file.Path; import java.nio.file.Path;
/** /**
@@ -16,9 +18,11 @@ public record FileStorageBase(FileStorageBaseId id,
String name, String name,
String path String path
) { ) {
public Path asPath() { public Path asPath() {
return Path.of(path); return FileStorageService.resolveStoragePath(path);
} }
public boolean isValid() { public boolean isValid() {
return id.id() >= 0; return id.id() >= 0;
} }

View File

@@ -1,12 +1,11 @@
package nu.marginalia.storage.model; package nu.marginalia.storage.model;
public enum FileStorageType { public enum FileStorageType {
CRAWL_SPEC, @Deprecated
CRAWL_SPEC, //
CRAWL_DATA, CRAWL_DATA,
PROCESSED_DATA, PROCESSED_DATA,
BACKUP, BACKUP,
EXPORT; EXPORT;
public String overrideName() {
return "FS_OVERRIDE:"+name();
}
} }

View File

@@ -1,98 +0,0 @@
package nu.marginalia;
import nu.marginalia.service.ServiceHomeNotConfiguredException;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Optional;
import java.util.stream.Stream;
public class WmsaHome {
public static UserAgent getUserAgent() {
return new UserAgent(
System.getProperty("crawler.userAgentString", "Mozilla/5.0 (compatible; Marginalia-like bot; +https://git.marginalia.nu/))"),
System.getProperty("crawler.userAgentIdentifier", "search.marginalia.nu")
);
}
public static Path getUploadDir() {
return Path.of(
System.getProperty("executor.uploadDir", "/uploads")
);
}
public static Path getHomePath() {
var retStr = Optional.ofNullable(System.getenv("WMSA_HOME")).orElseGet(WmsaHome::findDefaultHomePath);
var ret = Path.of(retStr);
if (!Files.isDirectory(ret)) {
throw new ServiceHomeNotConfiguredException("Could not find $WMSA_HOME, either set environment variable or ensure " + retStr + " exists");
}
if (!Files.isDirectory(ret.resolve("model"))) {
throw new ServiceHomeNotConfiguredException("You need to run 'run/setup.sh' to download models to run/ before this will work!");
}
return ret;
}
private static String findDefaultHomePath() {
// Assume this is a local developer and not a production system, since it would have WMSA_HOME set.
// Developers probably have a "run/" somewhere upstream from cwd.
//
return Stream.iterate(Paths.get("").toAbsolutePath(), f -> f != null && Files.exists(f), Path::getParent)
.filter(p -> Files.exists(p.resolve("run/env")))
.filter(p -> Files.exists(p.resolve("run/setup.sh")))
.map(p -> p.resolve("run"))
.findAny()
.orElse(Path.of("/var/lib/wmsa"))
.toString();
}
public static Path getAdsDefinition() {
return getHomePath().resolve("data").resolve("adblock.txt");
}
public static Path getIPLocationDatabse() {
return getHomePath().resolve("data").resolve("IP2LOCATION-LITE-DB1.CSV");
}
public static Path getAsnMappingDatabase() {
return getHomePath().resolve("data").resolve("asn-data-raw-table");
}
public static Path getAsnInfoDatabase() {
return getHomePath().resolve("data").resolve("asn-used-autnums");
}
public static LanguageModels getLanguageModels() {
final Path home = getHomePath();
return new LanguageModels(
home.resolve("model/ngrams.bin"),
home.resolve("model/tfreq-new-algo3.bin"),
home.resolve("model/opennlp-sentence.bin"),
home.resolve("model/English.RDR"),
home.resolve("model/English.DICT"),
home.resolve("model/opennlp-tok.bin"),
home.resolve("model/lid.176.ftz"));
}
public static Path getAtagsPath() {
return getHomePath().resolve("data/atags.parquet");
}
}

View File

@@ -2,7 +2,7 @@ package nu.marginalia.nodecfg;
import com.zaxxer.hikari.HikariConfig; import com.zaxxer.hikari.HikariConfig;
import com.zaxxer.hikari.HikariDataSource; import com.zaxxer.hikari.HikariDataSource;
import nu.marginalia.storage.FileStorageService; import nu.marginalia.nodecfg.model.NodeProfile;
import nu.marginalia.test.TestMigrationLoader; import nu.marginalia.test.TestMigrationLoader;
import org.junit.jupiter.api.BeforeAll; import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.api.Tag; import org.junit.jupiter.api.Tag;
@@ -13,12 +13,7 @@ import org.testcontainers.containers.MariaDBContainer;
import org.testcontainers.junit.jupiter.Container; import org.testcontainers.junit.jupiter.Container;
import org.testcontainers.junit.jupiter.Testcontainers; import org.testcontainers.junit.jupiter.Testcontainers;
import java.io.IOException;
import java.nio.file.Path;
import java.sql.SQLException; import java.sql.SQLException;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
import static org.junit.jupiter.api.Assertions.*; import static org.junit.jupiter.api.Assertions.*;
@@ -52,8 +47,8 @@ public class NodeConfigurationServiceTest {
@Test @Test
public void test() throws SQLException { public void test() throws SQLException {
var a = nodeConfigurationService.create(1, "Test", false, false); var a = nodeConfigurationService.create(1, "Test", false, false, NodeProfile.MIXED);
var b = nodeConfigurationService.create(2, "Foo", true, false); var b = nodeConfigurationService.create(2, "Foo", true, false, NodeProfile.MIXED);
assertEquals(1, a.node()); assertEquals(1, a.node());
assertEquals("Test", a.description()); assertEquals("Test", a.description());

View File

@@ -3,6 +3,8 @@ package nu.marginalia.storage;
import com.google.common.collect.Lists; import com.google.common.collect.Lists;
import com.zaxxer.hikari.HikariConfig; import com.zaxxer.hikari.HikariConfig;
import com.zaxxer.hikari.HikariDataSource; import com.zaxxer.hikari.HikariDataSource;
import nu.marginalia.storage.model.FileStorage;
import nu.marginalia.storage.model.FileStorageBase;
import nu.marginalia.storage.model.FileStorageBaseType; import nu.marginalia.storage.model.FileStorageBaseType;
import nu.marginalia.storage.model.FileStorageType; import nu.marginalia.storage.model.FileStorageType;
import nu.marginalia.test.TestMigrationLoader; import nu.marginalia.test.TestMigrationLoader;
@@ -52,11 +54,6 @@ public class FileStorageServiceTest {
@BeforeEach @BeforeEach
public void setupEach() { public void setupEach() {
// clean up any file storage overrides
for (FileStorageType type : FileStorageType.values()) {
System.setProperty(type.overrideName(), "");
}
fileStorageService = new FileStorageService(dataSource, 0); fileStorageService = new FileStorageService(dataSource, 0);
} }
@@ -97,12 +94,43 @@ public class FileStorageServiceTest {
} }
@Test @Test
public void testOverride() throws SQLException { public void testPathOverride() {
System.setProperty(FileStorageType.BACKUP.overrideName(), "/tmp"); try {
System.out.println(FileStorageType.BACKUP.overrideName()); System.setProperty("storage.root", "/tmp");
fileStorageService = new FileStorageService(dataSource, 0);
Assertions.assertEquals(Path.of("/tmp"), fileStorageService.getStorageByType(FileStorageType.BACKUP).asPath()); var path = new FileStorageBase(null, null, 0, null, "test").asPath();
Assertions.assertEquals(Path.of("/tmp/test"), path);
}
finally {
System.clearProperty("storage.root");
}
} }
@Test
public void testPathOverride3() {
try {
System.setProperty("storage.root", "/tmp");
var path = new FileStorageBase(null, null, 0, null, "/test").asPath();
Assertions.assertEquals(Path.of("/tmp/test"), path);
}
finally {
System.clearProperty("storage.root");
}
}
@Test
public void testPathOverride2() {
try {
System.setProperty("storage.root", "/tmp");
var path = new FileStorage(null, null, null, null, "test", null, null).asPath();
Assertions.assertEquals(Path.of("/tmp/test"), path);
}
finally {
System.clearProperty("storage.root");
}
}
@Test @Test
public void testCreateBase() throws SQLException { public void testCreateBase() throws SQLException {
String name = "test-" + UUID.randomUUID(); String name = "test-" + UUID.randomUUID();

View File

@@ -17,7 +17,7 @@ plugins {
java { java {
toolchain { toolchain {
languageVersion.set(JavaLanguageVersion.of(21)) languageVersion.set(JavaLanguageVersion.of(rootProject.ext.jvmVersion))
} }
} }
@@ -26,12 +26,17 @@ configurations {
flywayMigration.extendsFrom(implementation) flywayMigration.extendsFrom(implementation)
} }
apply from: "$rootProject.projectDir/srcsets.gradle"
dependencies { dependencies {
implementation project(':code:common:model') implementation project(':code:common:model')
implementation libs.bundles.slf4j implementation libs.bundles.slf4j
implementation libs.guice implementation libs.guava
implementation dependencies.create(libs.guice.get()) {
exclude group: 'com.google.guava'
}
implementation libs.bundles.gson implementation libs.bundles.gson
implementation libs.notnull implementation libs.notnull
@@ -40,7 +45,6 @@ dependencies {
implementation libs.trove implementation libs.trove
implementation libs.rxjava
implementation libs.bundles.mariadb implementation libs.bundles.mariadb
flywayMigration 'org.flywaydb:flyway-mysql:10.0.1' flywayMigration 'org.flywaydb:flyway-mysql:10.0.1'
@@ -50,6 +54,7 @@ dependencies {
testImplementation platform('org.testcontainers:testcontainers-bom:1.17.4') testImplementation platform('org.testcontainers:testcontainers-bom:1.17.4')
testImplementation libs.commons.codec
testImplementation 'org.testcontainers:mariadb:1.17.4' testImplementation 'org.testcontainers:mariadb:1.17.4'
testImplementation 'org.testcontainers:junit-jupiter:1.17.4' testImplementation 'org.testcontainers:junit-jupiter:1.17.4'
testImplementation project(':code:libraries:test-helpers') testImplementation project(':code:libraries:test-helpers')

View File

@@ -0,0 +1,179 @@
package nu.marginalia.db;
import com.google.common.cache.Cache;
import com.google.common.cache.CacheBuilder;
import com.google.common.util.concurrent.UncheckedExecutionException;
import com.google.inject.Inject;
import com.google.inject.Singleton;
import com.zaxxer.hikari.HikariDataSource;
import nu.marginalia.model.EdgeDomain;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.sql.SQLException;
import java.util.*;
import java.util.concurrent.ExecutionException;
@Singleton
public class DbDomainQueries {
private final HikariDataSource dataSource;
private static final Logger logger = LoggerFactory.getLogger(DbDomainQueries.class);
private final Cache<EdgeDomain, Integer> domainIdCache = CacheBuilder.newBuilder().maximumSize(10_000).build();
private final Cache<EdgeDomain, DomainIdWithNode> domainWithNodeCache = CacheBuilder.newBuilder().maximumSize(10_000).build();
private final Cache<Integer, EdgeDomain> domainNameCache = CacheBuilder.newBuilder().maximumSize(10_000).build();
private final Cache<String, List<DomainWithNode>> siblingsCache = CacheBuilder.newBuilder().maximumSize(10_000).build();
@Inject
public DbDomainQueries(HikariDataSource dataSource)
{
this.dataSource = dataSource;
}
public Integer getDomainId(EdgeDomain domain) throws NoSuchElementException {
try {
return domainIdCache.get(domain, () -> {
try (var connection = dataSource.getConnection();
var stmt = connection.prepareStatement("SELECT ID FROM EC_DOMAIN WHERE DOMAIN_NAME=?")) {
stmt.setString(1, domain.toString());
var rsp = stmt.executeQuery();
if (rsp.next()) {
return rsp.getInt(1);
}
}
catch (SQLException ex) {
throw new RuntimeException(ex);
}
throw new NoSuchElementException();
});
}
catch (UncheckedExecutionException ex) {
throw new NoSuchElementException();
}
catch (ExecutionException ex) {
throw new RuntimeException(ex.getCause());
}
}
public DomainIdWithNode getDomainIdWithNode(EdgeDomain domain) throws NoSuchElementException {
try {
return domainWithNodeCache.get(domain, () -> {
try (var connection = dataSource.getConnection();
var stmt = connection.prepareStatement("SELECT ID, NODE_AFFINITY FROM EC_DOMAIN WHERE DOMAIN_NAME=?")) {
stmt.setString(1, domain.toString());
var rsp = stmt.executeQuery();
if (rsp.next()) {
return new DomainIdWithNode(rsp.getInt(1), rsp.getInt(2));
}
}
catch (SQLException ex) {
throw new RuntimeException(ex);
}
throw new NoSuchElementException();
});
}
catch (UncheckedExecutionException ex) {
throw new NoSuchElementException();
}
catch (ExecutionException ex) {
throw new RuntimeException(ex.getCause());
}
}
public OptionalInt tryGetDomainId(EdgeDomain domain) {
Integer maybeId = domainIdCache.getIfPresent(domain);
if (maybeId != null) {
return OptionalInt.of(maybeId);
}
try (var connection = dataSource.getConnection()) {
try (var stmt = connection.prepareStatement("SELECT ID FROM EC_DOMAIN WHERE DOMAIN_NAME=?")) {
stmt.setString(1, domain.toString());
var rsp = stmt.executeQuery();
if (rsp.next()) {
var id = rsp.getInt(1);
domainIdCache.put(domain, id);
return OptionalInt.of(id);
}
}
return OptionalInt.empty();
}
catch (UncheckedExecutionException ex) {
throw new RuntimeException(ex.getCause());
}
catch (SQLException ex) {
throw new RuntimeException(ex);
}
}
public Optional<EdgeDomain> getDomain(int id) {
EdgeDomain existing = domainNameCache.getIfPresent(id);
if (existing != null) {
return Optional.of(existing);
}
try (var connection = dataSource.getConnection()) {
try (var stmt = connection.prepareStatement("SELECT DOMAIN_NAME FROM EC_DOMAIN WHERE ID=?")) {
stmt.setInt(1, id);
var rsp = stmt.executeQuery();
if (rsp.next()) {
var val = new EdgeDomain(rsp.getString(1));
domainNameCache.put(id, val);
return Optional.of(val);
}
return Optional.empty();
}
}
catch (SQLException ex) {
throw new RuntimeException(ex);
}
}
public List<DomainWithNode> otherSubdomains(EdgeDomain domain, int cnt) throws ExecutionException {
String topDomain = domain.topDomain;
return siblingsCache.get(topDomain, () -> {
List<DomainWithNode> ret = new ArrayList<>();
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("SELECT DOMAIN_NAME, NODE_AFFINITY FROM EC_DOMAIN WHERE DOMAIN_TOP = ? LIMIT ?")) {
stmt.setString(1, topDomain);
stmt.setInt(2, cnt);
var rs = stmt.executeQuery();
while (rs.next()) {
var sibling = new EdgeDomain(rs.getString(1));
if (sibling.equals(domain))
continue;
ret.add(new DomainWithNode(sibling, rs.getInt(2)));
}
} catch (SQLException e) {
logger.error("Failed to get domain neighbors");
}
return ret;
});
}
public record DomainWithNode (EdgeDomain domain, int nodeAffinity) {
public boolean isIndexed() {
return nodeAffinity > 0;
}
}
public record DomainIdWithNode (int domainId, int nodeAffinity) { }
}

View File

@@ -9,4 +9,5 @@ public interface DomainBlacklist {
default TIntHashSet getSpamDomains() { default TIntHashSet getSpamDomains() {
return new TIntHashSet(); return new TIntHashSet();
} }
void waitUntilLoaded() throws InterruptedException;
} }

View File

@@ -0,0 +1,126 @@
package nu.marginalia.db;
import com.google.inject.Inject;
import com.google.inject.Singleton;
import com.zaxxer.hikari.HikariDataSource;
import gnu.trove.set.hash.TIntHashSet;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.sql.SQLException;
import java.util.concurrent.TimeUnit;
@Singleton
public class DomainBlacklistImpl implements DomainBlacklist {
private final boolean blacklistDisabled = Boolean.getBoolean("blacklist.disable");
private final HikariDataSource dataSource;
private final Logger logger = LoggerFactory.getLogger(getClass());
private volatile TIntHashSet spamDomainSet = new TIntHashSet();
private volatile boolean isLoaded = false;
@Inject
public DomainBlacklistImpl(HikariDataSource dataSource) {
this.dataSource = dataSource;
Thread.ofPlatform().daemon().name("BlacklistUpdater").start(this::updateSpamList);
}
private void updateSpamList() {
// If the blacklist is disabled, we don't need to do anything
if (blacklistDisabled) {
isLoaded = true;
flagLoaded();
return;
}
for (;;) {
spamDomainSet = getSpamDomains();
// Set the flag to true after the first loading attempt, regardless of success,
// to avoid deadlocking threads that are waiting for this condition
flagLoaded();
// Sleep for 10 minutes before trying again
try {
TimeUnit.MINUTES.sleep(10);
}
catch (InterruptedException ex) {
break;
}
}
}
private void flagLoaded() {
if (!isLoaded) {
synchronized (this) {
isLoaded = true;
notifyAll();
}
}
}
/** Block until the blacklist has been loaded */
@Override
public void waitUntilLoaded() throws InterruptedException {
if (blacklistDisabled)
return;
if (!isLoaded) {
logger.info("Waiting for blacklist to be loaded");
synchronized (this) {
while (!isLoaded) {
wait(5000);
}
}
logger.info("Blacklist loaded, size = {}", spamDomainSet.size());
}
}
public TIntHashSet getSpamDomains() {
final TIntHashSet result = new TIntHashSet(1_000_000);
if (blacklistDisabled) {
return result;
}
try (var connection = dataSource.getConnection()) {
try (var stmt = connection.prepareStatement("""
SELECT EC_DOMAIN.ID
FROM EC_DOMAIN
INNER JOIN EC_DOMAIN_BLACKLIST
ON (EC_DOMAIN_BLACKLIST.URL_DOMAIN = EC_DOMAIN.DOMAIN_TOP
OR EC_DOMAIN_BLACKLIST.URL_DOMAIN = EC_DOMAIN.DOMAIN_NAME)
"""))
{
stmt.setFetchSize(1000);
var rsp = stmt.executeQuery();
while (rsp.next()) {
result.add(rsp.getInt(1));
}
}
} catch (SQLException ex) {
logger.error("Failed to load spam domain list", ex);
}
return result;
}
@Override
public boolean isBlacklisted(int domainId) {
if (spamDomainSet.contains(domainId)) {
return true;
}
return false;
}
}

View File

@@ -2,7 +2,6 @@ package nu.marginalia.db;
import com.google.inject.Inject; import com.google.inject.Inject;
import com.zaxxer.hikari.HikariDataSource; import com.zaxxer.hikari.HikariDataSource;
import lombok.With;
import org.slf4j.Logger; import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import org.slf4j.LoggerFactory;
@@ -25,7 +24,7 @@ public class DomainRankingSetsService {
public Optional<DomainRankingSet> get(String name) throws SQLException { public Optional<DomainRankingSet> get(String name) throws SQLException {
try (var conn = dataSource.getConnection(); try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement(""" var stmt = conn.prepareStatement("""
SELECT NAME, DESCRIPTION, ALGORITHM, DEPTH, DEFINITION SELECT NAME, DESCRIPTION, DEPTH, DEFINITION
FROM CONF_DOMAIN_RANKING_SET FROM CONF_DOMAIN_RANKING_SET
WHERE NAME = ? WHERE NAME = ?
""")) { """)) {
@@ -39,7 +38,6 @@ public class DomainRankingSetsService {
return Optional.of(new DomainRankingSet( return Optional.of(new DomainRankingSet(
rs.getString("NAME"), rs.getString("NAME"),
rs.getString("DESCRIPTION"), rs.getString("DESCRIPTION"),
DomainSetAlgorithm.valueOf(rs.getString("ALGORITHM")),
rs.getInt("DEPTH"), rs.getInt("DEPTH"),
rs.getString("DEFINITION") rs.getString("DEFINITION")
)); ));
@@ -53,15 +51,14 @@ public class DomainRankingSetsService {
public void upsert(DomainRankingSet domainRankingSet) { public void upsert(DomainRankingSet domainRankingSet) {
try (var conn = dataSource.getConnection(); try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement(""" var stmt = conn.prepareStatement("""
REPLACE INTO CONF_DOMAIN_RANKING_SET(NAME, DESCRIPTION, ALGORITHM, DEPTH, DEFINITION) REPLACE INTO CONF_DOMAIN_RANKING_SET(NAME, DESCRIPTION, DEPTH, DEFINITION)
VALUES (?, ?, ?, ?, ?) VALUES (?, ?, ?, ?)
""")) """))
{ {
stmt.setString(1, domainRankingSet.name()); stmt.setString(1, domainRankingSet.name());
stmt.setString(2, domainRankingSet.description()); stmt.setString(2, domainRankingSet.description());
stmt.setString(3, domainRankingSet.algorithm().name()); stmt.setInt(3, domainRankingSet.depth());
stmt.setInt(4, domainRankingSet.depth()); stmt.setString(4, domainRankingSet.definition());
stmt.setString(5, domainRankingSet.definition());
stmt.executeUpdate(); stmt.executeUpdate();
if (!conn.getAutoCommit()) if (!conn.getAutoCommit())
@@ -94,7 +91,7 @@ public class DomainRankingSetsService {
try (var conn = dataSource.getConnection(); try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement(""" var stmt = conn.prepareStatement("""
SELECT NAME, DESCRIPTION, ALGORITHM, DEPTH, DEFINITION SELECT NAME, DESCRIPTION, DEPTH, DEFINITION
FROM CONF_DOMAIN_RANKING_SET FROM CONF_DOMAIN_RANKING_SET
""")) { """)) {
var rs = stmt.executeQuery(); var rs = stmt.executeQuery();
@@ -105,7 +102,6 @@ public class DomainRankingSetsService {
new DomainRankingSet( new DomainRankingSet(
rs.getString("NAME"), rs.getString("NAME"),
rs.getString("DESCRIPTION"), rs.getString("DESCRIPTION"),
DomainSetAlgorithm.valueOf(rs.getString("ALGORITHM")),
rs.getInt("DEPTH"), rs.getInt("DEPTH"),
rs.getString("DEFINITION")) rs.getString("DEFINITION"))
); );
@@ -118,38 +114,23 @@ public class DomainRankingSetsService {
} }
} }
public enum DomainSetAlgorithm { /**
/** Use link graph, do a pagerank */ * Defines a domain ranking set, parameters for the ranking algorithms.
LINKS_PAGERANK,
/** Use link graph, do a cheirank */
LINKS_CHEIRANK,
/** Use adjacency graph, do a pagerank */
ADJACENCY_PAGERANK,
/** Use adjacency graph, do a cheirank */
ADJACENCY_CHEIRANK,
/** For reserved names. Use special algorithm, function of name */
SPECIAL
};
/** Defines a domain ranking set, parameters for the ranking algorithms.
* *
* @param name Key and name of the set * @param name Key and name of the set
* @param description Human-readable description * @param description Human-readable description
* @param algorithm Algorithm to use * @param depth Depth of the algorithm
* @param depth Depth of the algorithm * @param definition Definition of the set, typically a list of domains or globs for domain-names
* @param definition Definition of the set, typically a list of domains or globs for domain-names */
* */
@With
public record DomainRankingSet(String name, public record DomainRankingSet(String name,
String description, String description,
DomainSetAlgorithm algorithm,
int depth, int depth,
String definition) String definition) {
{
public Path fileName(Path base) { public Path fileName(Path base) {
return base.resolve(name().toLowerCase() + ".dat"); return base.resolve(name().toLowerCase() + ".dat");
} }
public String[] domains() { public String[] domains() {
return Arrays.stream(definition().split("\n+")) return Arrays.stream(definition().split("\n+"))
.map(String::trim) .map(String::trim)
@@ -159,8 +140,23 @@ public class DomainRankingSetsService {
} }
public boolean isSpecial() { public boolean isSpecial() {
return algorithm() == DomainSetAlgorithm.SPECIAL; return name().equals("BLOGS") || name().equals("NONE") || name().equals("RANK");
} }
public DomainRankingSet withName(String name) {
return this.name == name ? this : new DomainRankingSet(name, description, depth, definition);
}
public DomainRankingSet withDescription(String description) {
return this.description == description ? this : new DomainRankingSet(name, description, depth, definition);
}
public DomainRankingSet withDepth(int depth) {
return this.depth == depth ? this : new DomainRankingSet(name, description, depth, definition);
}
public DomainRankingSet withDefinition(String definition) {
return this.definition == definition ? this : new DomainRankingSet(name, description, depth, definition);
}
} }
} }

View File

@@ -24,7 +24,7 @@ public class DomainTypes {
BLOG, BLOG,
CRAWL, CRAWL,
TEST TEST
}; }
private final Logger logger = LoggerFactory.getLogger(DomainTypes.class); private final Logger logger = LoggerFactory.getLogger(DomainTypes.class);

View File

@@ -17,14 +17,14 @@ It's well documented and these are probably the only four tasks you'll ever need
If you are not running the system via docker, you need to provide alternative connection details than If you are not running the system via docker, you need to provide alternative connection details than
the defaults (TODO: how?). the defaults (TODO: how?).
The migration files are in [resources/db/migration](src/main/resources/db/migration). The file name convention The migration files are in [resources/db/migration](resources/db/migration). The file name convention
incorporates the project's cal-ver versioning; and are applied in lexicographical order. incorporates the project's cal-ver versioning; and are applied in lexicographical order.
VYY_MM_v_nnn__description.sql VYY_MM_v_nnn__description.sql
## Central Paths ## Central Paths
* [migrations](src/main/resources/db/migration) - Flyway migrations * [migrations](resources/db/migration) - Flyway migrations
## See Also ## See Also

View File

@@ -0,0 +1 @@
ALTER TABLE CONF_DOMAIN_RANKING_SET DROP COLUMN ALGORITHM;

View File

@@ -0,0 +1 @@
ALTER TABLE WMSA_prod.NODE_CONFIGURATION ADD COLUMN NODE_PROFILE VARCHAR(255) DEFAULT 'MIXED';

View File

@@ -1,91 +0,0 @@
package nu.marginalia.db;
import com.google.common.cache.Cache;
import com.google.common.cache.CacheBuilder;
import com.google.common.util.concurrent.UncheckedExecutionException;
import com.google.inject.Inject;
import com.google.inject.Singleton;
import com.zaxxer.hikari.HikariDataSource;
import lombok.SneakyThrows;
import nu.marginalia.model.EdgeDomain;
import java.util.NoSuchElementException;
import java.util.Optional;
import java.util.OptionalInt;
@Singleton
public class DbDomainQueries {
private final HikariDataSource dataSource;
private final Cache<EdgeDomain, Integer> domainIdCache = CacheBuilder.newBuilder().maximumSize(10_000).build();
@Inject
public DbDomainQueries(HikariDataSource dataSource)
{
this.dataSource = dataSource;
}
@SneakyThrows
public Integer getDomainId(EdgeDomain domain) {
try (var connection = dataSource.getConnection()) {
return domainIdCache.get(domain, () -> {
try (var stmt = connection.prepareStatement("SELECT ID FROM EC_DOMAIN WHERE DOMAIN_NAME=?")) {
stmt.setString(1, domain.toString());
var rsp = stmt.executeQuery();
if (rsp.next()) {
return rsp.getInt(1);
}
}
throw new NoSuchElementException();
});
}
catch (UncheckedExecutionException ex) {
throw ex.getCause();
}
}
@SneakyThrows
public OptionalInt tryGetDomainId(EdgeDomain domain) {
Integer maybeId = domainIdCache.getIfPresent(domain);
if (maybeId != null) {
return OptionalInt.of(maybeId);
}
try (var connection = dataSource.getConnection()) {
try (var stmt = connection.prepareStatement("SELECT ID FROM EC_DOMAIN WHERE DOMAIN_NAME=?")) {
stmt.setString(1, domain.toString());
var rsp = stmt.executeQuery();
if (rsp.next()) {
var id = rsp.getInt(1);
domainIdCache.put(domain, id);
return OptionalInt.of(id);
}
}
return OptionalInt.empty();
}
catch (UncheckedExecutionException ex) {
return OptionalInt.empty();
}
}
@SneakyThrows
public Optional<EdgeDomain> getDomain(int id) {
try (var connection = dataSource.getConnection()) {
try (var stmt = connection.prepareStatement("SELECT DOMAIN_NAME FROM EC_DOMAIN WHERE ID=?")) {
stmt.setInt(1, id);
var rsp = stmt.executeQuery();
if (rsp.next()) {
return Optional.of(new EdgeDomain(rsp.getString(1)));
}
return Optional.empty();
}
}
}
}

View File

@@ -1,118 +0,0 @@
package nu.marginalia.db;
import com.zaxxer.hikari.HikariDataSource;
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.List;
import java.util.OptionalInt;
/** Class used in exporting data. This is intended to be used for a brief time
* and then discarded, not kept around as a service.
*/
public class DbDomainStatsExportMultitool implements AutoCloseable {
private final Connection connection;
private final int nodeId;
private final PreparedStatement knownUrlsQuery;
private final PreparedStatement visitedUrlsQuery;
private final PreparedStatement goodUrlsQuery;
private final PreparedStatement domainNameToId;
private final PreparedStatement allDomainsQuery;
private final PreparedStatement crawlQueueDomains;
private final PreparedStatement indexedDomainsQuery;
public DbDomainStatsExportMultitool(HikariDataSource dataSource, int nodeId) throws SQLException {
this.connection = dataSource.getConnection();
this.nodeId = nodeId;
knownUrlsQuery = connection.prepareStatement("""
SELECT KNOWN_URLS
FROM EC_DOMAIN INNER JOIN DOMAIN_METADATA
ON EC_DOMAIN.ID=DOMAIN_METADATA.ID
WHERE DOMAIN_NAME=?
""");
visitedUrlsQuery = connection.prepareStatement("""
SELECT VISITED_URLS
FROM EC_DOMAIN INNER JOIN DOMAIN_METADATA
ON EC_DOMAIN.ID=DOMAIN_METADATA.ID
WHERE DOMAIN_NAME=?
""");
goodUrlsQuery = connection.prepareStatement("""
SELECT GOOD_URLS
FROM EC_DOMAIN INNER JOIN DOMAIN_METADATA
ON EC_DOMAIN.ID=DOMAIN_METADATA.ID
WHERE DOMAIN_NAME=?
""");
domainNameToId = connection.prepareStatement("""
SELECT ID
FROM EC_DOMAIN
WHERE DOMAIN_NAME=?
""");
allDomainsQuery = connection.prepareStatement("""
SELECT DOMAIN_NAME
FROM EC_DOMAIN
""");
crawlQueueDomains = connection.prepareStatement("""
SELECT DOMAIN_NAME
FROM CRAWL_QUEUE
""");
indexedDomainsQuery = connection.prepareStatement("""
SELECT DOMAIN_NAME
FROM EC_DOMAIN
WHERE INDEXED > 0
""");
}
public OptionalInt getVisitedUrls(String domainName) throws SQLException {
return executeNameToIntQuery(domainName, visitedUrlsQuery);
}
public OptionalInt getDomainId(String domainName) throws SQLException {
return executeNameToIntQuery(domainName, domainNameToId);
}
public List<String> getCrawlQueueDomains() throws SQLException {
return executeListQuery(crawlQueueDomains, 100);
}
public List<String> getAllIndexedDomains() throws SQLException {
return executeListQuery(indexedDomainsQuery, 100_000);
}
private OptionalInt executeNameToIntQuery(String domainName, PreparedStatement statement)
throws SQLException {
statement.setString(1, domainName);
var rs = statement.executeQuery();
if (rs.next()) {
return OptionalInt.of(rs.getInt(1));
}
return OptionalInt.empty();
}
private List<String> executeListQuery(PreparedStatement statement, int sizeHint) throws SQLException {
List<String> ret = new ArrayList<>(sizeHint);
var rs = statement.executeQuery();
while (rs.next()) {
ret.add(rs.getString(1));
}
return ret;
}
@Override
public void close() throws SQLException {
knownUrlsQuery.close();
goodUrlsQuery.close();
visitedUrlsQuery.close();
allDomainsQuery.close();
crawlQueueDomains.close();
domainNameToId.close();
connection.close();
}
}

View File

@@ -1,76 +0,0 @@
package nu.marginalia.db;
import com.google.inject.Inject;
import com.google.inject.Singleton;
import com.zaxxer.hikari.HikariDataSource;
import gnu.trove.set.hash.TIntHashSet;
import io.reactivex.rxjava3.schedulers.Schedulers;
import lombok.SneakyThrows;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.concurrent.TimeUnit;
@Singleton
public class DomainBlacklistImpl implements DomainBlacklist {
private volatile TIntHashSet spamDomainSet = new TIntHashSet();
private final HikariDataSource dataSource;
private final Logger logger = LoggerFactory.getLogger(getClass());
private final boolean blacklistDisabled = Boolean.getBoolean("blacklist.disable");
@Inject
public DomainBlacklistImpl(HikariDataSource dataSource) {
this.dataSource = dataSource;
Schedulers.io().schedulePeriodicallyDirect(this::updateSpamList, 5, 600, TimeUnit.SECONDS);
updateSpamList();
}
private void updateSpamList() {
try {
int oldSetSize = spamDomainSet.size();
spamDomainSet = getSpamDomains();
if (oldSetSize == 0 && spamDomainSet.size() > 0) {
logger.info("Synchronized {} spam domains", spamDomainSet.size());
}
}
catch (Exception ex) {
logger.error("Failed to synchronize spam domains", ex);
}
}
@SneakyThrows
public TIntHashSet getSpamDomains() {
final TIntHashSet result = new TIntHashSet(1_000_000);
if (blacklistDisabled) {
return result;
}
try (var connection = dataSource.getConnection()) {
try (var stmt = connection.prepareStatement("SELECT EC_DOMAIN.ID FROM EC_DOMAIN INNER JOIN EC_DOMAIN_BLACKLIST ON (EC_DOMAIN_BLACKLIST.URL_DOMAIN = EC_DOMAIN.DOMAIN_TOP OR EC_DOMAIN_BLACKLIST.URL_DOMAIN = EC_DOMAIN.DOMAIN_NAME)")) {
stmt.setFetchSize(1000);
var rsp = stmt.executeQuery();
while (rsp.next()) {
result.add(rsp.getInt(1));
}
}
}
return result;
}
@Override
public boolean isBlacklisted(int domainId) {
if (spamDomainSet.contains(domainId)) {
return true;
}
return false;
}
}

Some files were not shown because too many files have changed in this diff Show More