1
1
mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-10-06 07:32:38 +02:00

Compare commits

...

2283 Commits

Author SHA1 Message Date
Viktor Lofgren
426658f64e (search) Improve contrast with light mode 2025-03-25 11:54:54 +01:00
Viktor Lofgren
2181b22f05 (crawler) Change default maxConcurrentRequests to 512
This seems like a more sensible default after testing a bit.  May need local tuning.
2025-03-22 12:11:09 +01:00
Viktor Lofgren
42bd79a609 (crawler) Experimentally throttle the number of active retrievals to see how this affects the network performance
There's been some indications that request storms lead to buffer bloat and bad throughput.

This adds a configurable semaphore, by default permitting 100 active requests.
2025-03-22 11:50:37 +01:00
Viktor Lofgren
b91c1e528a (favicon) Send dummy svg result when image is missing
This prevents the browser from rendering a "broken image" in this scenario.
2025-03-21 15:15:14 +01:00
Viktor Lofgren
b1130d7a04 (domainstatedb) Allow creation of disconnected db
This is required for executor services that do not have crawl data to still be able to initialize.
2025-03-21 14:59:36 +01:00
Viktor Lofgren
8364bcdc97 (favicon) Add favicons to the matchograms 2025-03-21 14:30:40 +01:00
Viktor Lofgren
626cab5fab (favicon) Add favicon to site overview 2025-03-21 14:15:23 +01:00
Viktor Lofgren
cfd4712191 (favicon) Add capability for fetching favicons 2025-03-21 13:38:58 +01:00
Viktor Lofgren
9f18ced73d (crawler) Improve deferred task behavior 2025-03-18 12:54:18 +01:00
Viktor Lofgren
18e91269ab (crawler) Improve deferred task behavior 2025-03-18 12:25:22 +01:00
Viktor Lofgren
e315ca5758 (search) Change icon for small web filter
The previous icon was of an irregular size and shifted the layout in an unaesthetic way.
2025-03-17 12:07:34 +01:00
Viktor Lofgren
3ceea17c1d (search) Adjustments to devicd detection in CSS
Use pointer:fine media query to better distinguish between mobile devices and PCs with a window in portrait orientation.

With this, we never show mobile filtering functionality on mobile; and never show the touch-inaccessible minimized sidebar on mobile.
2025-03-17 12:04:34 +01:00
Viktor Lofgren
b34527c1a3 (search) Add small web filter for new UI 2025-03-17 11:39:19 +01:00
Viktor Lofgren
185bf28fca (crawler) Correct issue leading to parquet files not being correctly preconverted
Path.endsWith("str") != String.endsWith(".str")
2025-03-10 13:48:12 +01:00
Viktor Lofgren
78cc25584a (crawler) Add error logging when entering bad path for historical crawl data 2025-03-10 13:38:40 +01:00
Viktor Lofgren
62ba30bacf (common) Log info about metrics server 2025-03-10 13:12:39 +01:00
Viktor Lofgren
3bb84eb206 (common) Log info about metrics server 2025-03-10 13:03:48 +01:00
Viktor Lofgren
be7d13ccce (crawler) Correct task execution logic in crawler
The old behavior would flag domains as pending too soon, leading to them being omitted from execution if they were not immediately available to run.
2025-03-09 13:47:51 +01:00
Viktor Lofgren
8c088a7c0b (crawler) Remove custom thread factory
This was causing issues, and not really doing much of benefit.
2025-03-09 11:50:52 +01:00
Viktor Lofgren
ea9a642b9b (crawler) More effective task scheduling in the crawler
This should hopefully allow more threads to be busy
2025-03-09 11:44:59 +01:00
Viktor Lofgren
27f528af6a (search) Fix "Remove Javascript" toggle
A bug was introduced at some point where the special keyword for filtering on javascript was changed to special:scripts, from js:true/js:false.

Solves issue #155
2025-02-28 12:03:04 +01:00
Viktor Lofgren
20ca41ec95 (processed model) Use String columns instead of Txt columns for SlopDocumentRecord
It's very likely TxtStringColumn is the culprit of the bug seen in https://github.com/MarginaliaSearch/MarginaliaSearch/issues/154 where the wrong URL was shown for a search result.
2025-02-24 11:41:51 +01:00
Viktor Lofgren
7671f0d9e4 (search) Display message when no search results are found 2025-02-24 11:15:55 +01:00
Viktor Lofgren
44d6bc71b7 (assistant) Migrate to Jooby framework 2025-02-15 13:28:12 +01:00
Viktor Lofgren
9d302e2973 (assistant) Migrate to Jooby framework 2025-02-15 13:26:04 +01:00
Viktor Lofgren
f553701224 (assistant) Migrate to Jooby framework 2025-02-15 13:21:48 +01:00
Viktor Lofgren
f076d05595 (deps) Upgrade slf4j to latest 2025-02-15 12:50:16 +01:00
Viktor Lofgren
b513809710 (*) Stopgap fix for metrics server initialization errors bringing down services 2025-02-14 17:09:48 +01:00
Viktor Lofgren
7519b28e21 (search) Correct exception from misbehaving bots feeding invalid urls 2025-02-14 17:05:24 +01:00
Viktor Lofgren
3eac4dd57f (search) Correct exception in error handler when page is missing 2025-02-14 17:00:21 +01:00
Viktor Lofgren
4c2810720a (search) Add redirect handler for full URLs in the /site endpoint 2025-02-14 16:31:11 +01:00
Viktor Lofgren
8480ba8daa (live-capture) Code cleanup 2025-02-04 14:05:36 +01:00
Viktor Lofgren
fbba392491 (live-capture) Send a UA-string from the browserless fetcher as well
The change also introduces a somewhat convoluted wiremock test to intercept and verify that these headers are in fact sent
2025-02-04 13:36:49 +01:00
Viktor Lofgren
530eb35949 (update-rss) Do not fail the feed fetcher control actor if it takes a long time to complete. 2025-02-03 11:35:32 +01:00
Viktor Lofgren
c2dd2175a2 (search) Add new query expansion rule contracting WORD NUM pairs into WORD-NUM and WORDNUM 2025-02-01 13:13:30 +01:00
Viktor Lofgren
b8581b0f56 (crawler) Safe sanitization of headers during warc->slop conversion
The warc->slop converter was rejecting some items because they had headers that were representable in the Warc code's MessageHeader map implementation, but illegal in the HttpHeaders' implementation.

Fixing this by manually filtering these out.  Ostensibly the constructor has a filtering predicate, but this annoyingly runs too late and fails to prevent the problem.
2025-01-31 12:47:42 +01:00
Viktor Lofgren
2ea34767d8 (crawler) Use the response URL when resolving relative links
The crawler was incorrectly using the request URL as the base URL when resolving relative links.  This caused problems when encountering redirects.

 For example if we fetch /log, redirecting to  /log/ and find links to foo/, and bar/; these would resolve to /foo and /bar, and not /log/foo and /log/bar.
2025-01-31 12:40:13 +01:00
Viktor Lofgren
e9af838231 (actor) Fix migration actor final steps 2025-01-30 11:48:21 +01:00
Viktor Lofgren
ae0cad47c4 (actor) Utility method for getting a json prototype for actor states
If we can hook this into the control gui somehow, it'll make for a nice QOL upgrade when manually interacting with the actors.
2025-01-29 15:20:25 +01:00
Viktor Lofgren
5fbc8ef998 (misc) Tidying 2025-01-29 15:17:04 +01:00
Viktor Lofgren
32c6dd9e6a (actor) Delete old data in the migration actor 2025-01-29 14:51:46 +01:00
Viktor Lofgren
6ece6a6cfb (actor) Improve resilience for the migration actor 2025-01-29 14:43:09 +01:00
Viktor Lofgren
39cd1c18f8 Automatically run npm install tailwindcss@3 via setup.sh, as the new default version of the package is incompatible with the project 2025-01-29 12:21:08 +01:00
Viktor
eb65daaa88 Merge pull request #151 from Lionstiger/master
fix small grammar error in footerLegal.jte
2025-01-28 21:49:50 +01:00
Viktor
0bebdb6e33 Merge branch 'master' into master 2025-01-28 21:49:36 +01:00
Viktor Lofgren
1e50e392c6 (actor) Improve logging and error handling for data migration actor 2025-01-28 15:34:36 +01:00
Viktor Lofgren
fb673de370 (crawler) Change the header 'User-agent' to 'User-Agent' 2025-01-28 15:34:16 +01:00
Viktor Lofgren
eee73ab16c (crawler) Be more lenient when performing a domain probe 2025-01-28 15:24:30 +01:00
Viktor Lofgren
5354e034bf (search) Minor grammar fix 2025-01-27 18:36:31 +01:00
Magnus Wulf
72384ad6ca fix small grammar error 2025-01-27 15:04:57 +01:00
Viktor Lofgren
a2b076f9be (converter) Add progress tracking for big domains in converter 2025-01-26 18:03:59 +01:00
Viktor Lofgren
c8b0a32c0f (crawler) Reduce long retention of CrawlDataReference objects and their associated SerializableCrawlDataStreams 2025-01-26 15:40:17 +01:00
Viktor Lofgren
f0d74aa3bb (converter) Fix close() ordering to prevent converter crash 2025-01-26 14:47:36 +01:00
Viktor Lofgren
74a1f100f4 (converter) Refactor to remove CrawledDomainReader and move its functionality into SerializableCrawlDataStream 2025-01-26 14:46:50 +01:00
Viktor Lofgren
eb049658e4 (converter) Add truncation att the parser step to prevent the converter from spending too much time on excessively large documents
Refactor to do this without introducing additional copies
2025-01-26 14:28:53 +01:00
Viktor Lofgren
db138b2a6f (converter) Add truncation att the parser step to prevent the converter from spending too much time on exessively large documents 2025-01-26 14:25:57 +01:00
Viktor Lofgren
1673fc284c (converter) Reduce lock contention in converter by separating the processing of full and simple-track domains 2025-01-26 13:21:46 +01:00
Viktor Lofgren
503ea57d5b (converter) Reduce lock contention in converter by separating the processing of full and simple-track domains 2025-01-26 13:18:14 +01:00
Viktor Lofgren
18ca926c7f (converter) Truncate excessively long strings in SentenceExtractor, malformed data was effectively DOS:ing the converter 2025-01-26 12:52:54 +01:00
Viktor Lofgren
db99242db2 (converter) Adding some logging around the simple processing track to investigate an issue with the converter stalling 2025-01-26 12:02:00 +01:00
Viktor Lofgren
2b9d2985ba (doc) Update readme with up-to-date install instructions. 2025-01-24 18:51:41 +01:00
Viktor Lofgren
eeb6ecd711 (search) Make it clearer that the affiliate marker applies to the result, and not the search engine's relation to the result. 2025-01-24 18:50:00 +01:00
Viktor Lofgren
1f58aeadbf (build) Upgrade JIB 2025-01-24 18:49:28 +01:00
Viktor Lofgren
3d68be64da (crawler) Add default CT when it's missing for icons 2025-01-22 13:55:47 +01:00
Viktor Lofgren
668f3b16ef (search) Redirect ^/site/$ to /site 2025-01-22 13:35:18 +01:00
Viktor Lofgren
98a340a0d1 (crawler) Add favicon data to domain state db in its own table 2025-01-22 11:41:20 +01:00
Viktor Lofgren
8862100f7e (crawler) Improve logging and error handling 2025-01-21 21:44:21 +01:00
Viktor Lofgren
274941f6de (crawler) Smarter parquet->slop crawl data migration 2025-01-21 21:26:12 +01:00
Viktor Lofgren
abec83582d Fix refactoring gore 2025-01-21 15:08:04 +01:00
Viktor Lofgren
569520c9b6 (index) Add manual adjustments for rankings based on domain 2025-01-21 15:07:43 +01:00
Viktor Lofgren
088310e998 (converter) Improve simple processing performance
There was a regression introduced in the recent slop migration changes in  the performance of the simple conversion track.  This reverts the issue.
2025-01-21 14:13:33 +01:00
Viktor
270cab874b Merge pull request #134 from MarginaliaSearch/slop-crawl-data-spike
Store crawl data in slop instead of parquet
2025-01-21 13:34:22 +01:00
Viktor Lofgren
4c74e280d3 (crawler) Fix urlencoding in sitemap fetcher 2025-01-21 13:33:35 +01:00
Viktor Lofgren
5b347e17ac (crawler) Automatically migrate to slop from parquet when crawling 2025-01-21 13:33:14 +01:00
Viktor Lofgren
55d6ab933f Merge branch 'master' into slop-crawl-data-spike 2025-01-21 13:32:58 +01:00
Viktor Lofgren
43b74e9706 (crawler) Fix exception handler and resource leak in WarcRecorder 2025-01-20 23:45:28 +01:00
Viktor Lofgren
579a115243 (crawler) Reduce log spam from error handling in new sitemap fetcher 2025-01-20 23:17:13 +01:00
Viktor
2c67f50a43 Merge pull request #150 from MarginaliaSearch/httpclient-in-crawler
Reduce the use of 3rd party code in the crawler
2025-01-20 19:35:30 +01:00
Viktor Lofgren
78a958e2b0 (crawler) Fix broken test that started failing after the search engine moved to a new domain 2025-01-20 18:52:14 +01:00
Viktor Lofgren
4e939389b2 (crawler) New Jsoup based sitemap parser 2025-01-20 14:37:44 +01:00
Viktor Lofgren
e67a9bdb91 (crawler) Migrate away from using OkHttp in the crawler, use Java's HttpClient instead. 2025-01-19 15:07:11 +01:00
Viktor Lofgren
567e4e1237 (crawler) Fast detection and bail-out for crawler traps
Improve logging and exclude robots.txt from this logic.
2025-01-18 15:28:54 +01:00
Viktor Lofgren
4342e42722 (crawler) Fast detection and bail-out for crawler traps
Nephentes has been doing the rounds in social media, adding an easy detection and mitigation mechanism for this type of trap, as sadly not all webmasters set up their robots.txt correctly.  Out of the box crawl limits will also deal with this type of attack, but this fix is faster.
2025-01-17 13:02:57 +01:00
Viktor Lofgren
bc818056e6 (run) Fix templates for mariadb
Apparently the docker image contract changed at some point, and now we should spawn mariadbd and not mysqld; mariadb-admin and not mysqladmin.
2025-01-16 15:27:02 +01:00
Viktor Lofgren
de2feac238 (chore) Upgrade jib from 3.4.3 to 3.4.4 2025-01-16 15:10:45 +01:00
Viktor Lofgren
1e770205a5 (search) Dyslexia fix 2025-01-12 20:40:14 +01:00
Viktor
e44ecd6d69 Merge pull request #149 from MarginaliaSearch/vlofgren-patch-1
Update ROADMAP.md
2025-01-12 20:38:36 +01:00
Viktor
5b93a0e633 Update ROADMAP.md 2025-01-12 20:38:11 +01:00
Viktor
08fb0e5efe Update ROADMAP.md 2025-01-12 20:37:43 +01:00
Viktor
bcf67782ea Update ROADMAP.md 2025-01-12 20:37:09 +01:00
Viktor Lofgren
ef3f175ede (search) Don't clobber the search query URL with default values 2025-01-10 15:57:30 +01:00
Viktor Lofgren
bbe4b5d9fd Revert experimental changes 2025-01-10 15:52:02 +01:00
Viktor Lofgren
c67a635103 (search, experimental) Add a few debugging tracks to the search UI 2025-01-10 15:44:44 +01:00
Viktor Lofgren
20b24133fb (search, experimental) Add a few debugging tracks to the search UI 2025-01-10 15:34:48 +01:00
Viktor Lofgren
f2567677e8 (index-client) Clean up index client code
Improve error handling.  This should be a relatively rare case, but we don't want one bad index partition to blow up the entire query.
2025-01-10 15:17:07 +01:00
Viktor Lofgren
bc2c2061f2 (index-client) Clean up index client code
This should have the rpc stream reception be performed in parallel in separate threads, rather blocking sequentially in the main thread, hopefully giving a slight performance boost.
2025-01-10 15:14:42 +01:00
Viktor Lofgren
1c7f5a31a5 (search) Further reduce the number of db queries by adding more caching to DbDomainQueries. 2025-01-10 14:17:29 +01:00
Viktor Lofgren
59a8ea60f7 (search) Further reduce the number of db queries by adding more caching to DbDomainQueries. 2025-01-10 14:15:22 +01:00
Viktor Lofgren
aa9b1244ea (search) Reduce the number of db queries a bit by caching data that doesn't change too often 2025-01-10 13:56:04 +01:00
Viktor Lofgren
2d17233366 (search) Reduce the number of db queries a bit by caching data that doesn't change too often 2025-01-10 13:53:56 +01:00
Viktor Lofgren
b245cc9f38 (search) Reduce the number of db queries a bit by caching data that doesn't change too often 2025-01-10 13:46:19 +01:00
Viktor Lofgren
6614d05bdf (db) Make db pool size configurable 2025-01-09 20:20:51 +01:00
Viktor Lofgren
55aeb03c4a (feeds) Replace rssreader based parsing with a custom jsoup based rss parser
This solves some issues with the rssreader based parser, which was very picky about the XML being valid.  Jsoup is much more lenient when parsing malformed XML.
2025-01-09 18:29:55 +01:00
Viktor Lofgren
faa589962f (live-capture) Browserless now requires a token 2025-01-09 14:51:11 +01:00
Viktor Lofgren
c7edd6b39f (live-capture) Browserless now requires a token 2025-01-09 14:46:05 +01:00
Viktor Lofgren
79da622e3b (search) Update front page with new banner about move 2025-01-08 21:38:19 +01:00
Viktor Lofgren
3da8337ba6 (feeds) Add system property for exporting fetched feeds to a slop table for debugging 2025-01-08 20:49:16 +01:00
Viktor Lofgren
a32d230f0a (special) Trigger deployment 2025-01-08 20:07:54 +01:00
Viktor Lofgren
3772bfd387 (query) Fix handling of optional ranking parameters 2025-01-08 17:11:22 +01:00
Viktor Lofgren
02a7900d1a (search) Correct search-in-title toggle in search UI 2025-01-08 16:51:10 +01:00
Viktor Lofgren
a1fb92468f (refac) Remove ResultRankingParameters, QueryLimits class and use protobuf classes directly instead
This is primarily to make the code a bit easier to reason about, and will reduce the level of indirection and data copying in the search-servi->query-service->index-service communication chain.
2025-01-08 16:15:57 +01:00
Viktor Lofgren
b7f0a2a98e (search-service) Fix metrics for errors and request times
This was previously in place, but broke during the jooby migration.
2025-01-08 14:10:43 +01:00
Viktor Lofgren
5fb76b2e79 (search-service) Fix metrics for errors and request times
This was previously in place, but broke during the jooby migration.
2025-01-08 14:06:03 +01:00
Viktor Lofgren
ad8c97f342 (search-service) Begin replacement of the crawl queue mechanism with node_affinity flagging
Previously a special db table was used to hold domains slated for crawling, but this is deprecated, and instead now each domain has a node_affinity flag that decides its indexing state, where a value of -1 indicates it shouldn't be crawled, a value of 0 means it's slated for crawling by the next index partition to be crawled, and a positive value means it's assigned to an index partition.

The change set also adds a test case validating the modified behavior.
2025-01-08 13:25:56 +01:00
Viktor Lofgren
dc1b6373eb (search-service) Clean up readme 2025-01-08 13:04:39 +01:00
Viktor Lofgren
983d6d067c (search-service) Add indexing indicator to sibling domains listing 2025-01-08 12:58:34 +01:00
Viktor Lofgren
a84a06975c (ranking-params) Add disable penalties flag to ranking params
This will help debugging ranking issues.  Later it may be added to some filters.
2025-01-08 00:16:49 +01:00
Viktor Lofgren
d2864c13ec (query-params) Add additional permitted query params 2025-01-07 20:21:44 +01:00
Viktor Lofgren
03ba53ce51 (legacy-search) Update nav bar with correct links 2025-01-07 17:44:52 +01:00
Viktor Lofgren
d4a6684931 (specialization) Soften length requirements for wiki-specialized documents (incl. cppreference) 2025-01-07 15:53:25 +01:00
Viktor
6f0485287a Merge pull request #145 from MarginaliaSearch/cppreference_fixes
Cppreference fixes
2025-01-07 15:43:19 +01:00
Viktor Lofgren
59e2dd4c26 (specialization) Soften length requirements for wiki-specialized documents (incl. cppreference) 2025-01-07 15:41:30 +01:00
Viktor Lofgren
ca1807caae (specialization) Add new specialization for cppreference.com
Give this reference website some synthetically generated tokens to improve the likelihood of a good match.
2025-01-07 15:41:05 +01:00
Viktor Lofgren
26c20e18ac (keyword-extraction) Soften constraints on keyword patterns, allowing for longer segmented words 2025-01-07 15:20:50 +01:00
Viktor Lofgren
7c90b6b414 (query) Don't blindly make tokens containing a colon into a non-ranking advice term 2025-01-07 15:18:05 +01:00
Viktor Lofgren
b63c54c4ce (search) Update opensearch.xml to point to non-redirecting domains. 2025-01-07 00:23:09 +01:00
Viktor Lofgren
fecd2f4ec3 (deploy) Add legacy search service to deploy script 2025-01-07 00:21:13 +01:00
Viktor Lofgren
39e420de88 (search) Add wayback machine link to siteinfo 2025-01-06 20:33:10 +01:00
Viktor Lofgren
dc83619861 (rssreader) Further suppress logging 2025-01-06 20:20:37 +01:00
Viktor Lofgren
87d1c89701 (search) Add listing of sibling subdomains to site overview 2025-01-06 20:17:36 +01:00
Viktor Lofgren
a42a7769e2 (leagacy-search) Remove legacy paperdoll class 2025-01-06 20:17:36 +01:00
Viktor
202bda884f Update readme.md
Add note about installing tailwindcss via npm
2025-01-06 18:35:13 +01:00
Viktor Lofgren
2315fdc731 (search) Vendor rssreader and modify it to be able to consume the nlnet atom feed
Also dial down the logging a bit for the rssreader package.
2025-01-06 17:58:50 +01:00
Viktor Lofgren
b5469bd8a1 (search) Turn relative feed URLs absolute when dealing with RSS/Atom item URLs 2025-01-06 16:56:24 +01:00
Viktor Lofgren
6a6318d04c (search) Add separate websiteUrl property to legacy service 2025-01-06 16:26:08 +01:00
Viktor Lofgren
55933f8d40 (search) Ensure we respect old URL contracts
/explore/random should be equivalent to /explore
2025-01-06 16:20:53 +01:00
Viktor
be6382e0d0 Merge pull request #127 from MarginaliaSearch/serp-redesign
Web UI redesign
2025-01-06 16:08:14 +01:00
Viktor Lofgren
45e771f96b (api) Update the / API redirect to the new documentation stub. 2025-01-06 16:07:32 +01:00
Viktor Lofgren
8dde502cc9 Merge branch 'master' into serp-redesign 2025-01-05 23:33:35 +01:00
Viktor Lofgren
3e66767af3 (search) Adjust query parsing to trim tokens in quoted search terms
Quoted search queries that contained keywords with possessive 's endings were not returning any results, as the index does not retain that suffix, and the query parser was not stripping it away in this code path.

This solves issue #143.
2025-01-05 23:33:09 +01:00
Viktor Lofgren
9ec9d1b338 Merge branch 'master' into serp-redesign 2025-01-05 21:10:20 +01:00
Viktor Lofgren
dcad0d7863 (search) Tweak token formation. 2025-01-05 21:01:09 +01:00
Viktor Lofgren
94e1aa0baf (search) Tweak token formation to still break apart emails in brackets. 2025-01-05 20:55:44 +01:00
Viktor Lofgren
b62f043910 (search) Adjust token formation rules to be more lenient to C++ and PHP code.
This addresses Issue #142
2025-01-05 20:50:27 +01:00
Viktor Lofgren
6ea22d0d21 (search) Update front page with work-in-progress note 2025-01-05 19:08:02 +01:00
Viktor Lofgren
8c69dc31b8 Merge branch 'master' into serp-redesign 2025-01-05 18:52:51 +01:00
Viktor Lofgren
00734ea87f (search) Add hover text for matchogram 2025-01-05 18:50:44 +01:00
Viktor Lofgren
3009713db4 (search) Fix broken tests 2025-01-05 18:50:27 +01:00
Viktor
9b2ceaf37c Merge pull request #141 from MarginaliaSearch/vlofgren-patch-1
Update FUNDING.yml
2025-01-05 18:40:20 +01:00
Viktor
8019c2ce18 Update FUNDING.yml 2025-01-05 18:40:06 +01:00
Viktor Lofgren
a9e312b8b1 (service) Add links to marginalia-search.com where appropriate 2025-01-05 16:56:38 +01:00
Viktor Lofgren
4da3563d8a (service) Clean up exceptions when requestScreengrab is not available 2025-01-04 14:45:51 +01:00
Viktor Lofgren
48d0a3089a (service) Improve logging around grpc
This change adds a marker for the gRPC-specific logging, as well as improves the clarity and meaningfulness of the log messages.
2025-01-02 20:40:53 +01:00
Viktor Lofgren
594df64b20 (domain-info) Use appropriate sqlite database when fetching feed status 2025-01-02 20:20:36 +01:00
Viktor Lofgren
06efb5abfc Merge branch 'master' into serp-redesign 2025-01-02 18:42:12 +01:00
Viktor Lofgren
78eb1417a7 (service) Only block on SingleNodeChannelPool creation in QueryClient
The code was always blocking for up to 5s while waiting for the remote end to become available, meaning some services would stall for several seconds on start-up for no sensible reason.

This should make most services start faster as a result.
2025-01-02 18:42:01 +01:00
Viktor Lofgren
8c8f2ad5ee (search) Add an indicator when a link has a feed in the similar/linked domains views 2025-01-02 18:11:57 +01:00
Viktor Lofgren
f71e79d10f (search) Add a copy of the old UI as a separate service, search-service-legacy 2025-01-02 18:03:42 +01:00
Viktor Lofgren
1b27c5cf06 (search) Add a copy of the old UI as a separate service, search-service-legacy 2025-01-02 18:02:17 +01:00
Viktor Lofgren
67edc8f90d (domain-info) Only flag domains with rss feed items as having a feed 2025-01-02 17:41:52 +01:00
Viktor Lofgren
5f576b7d0c (query-parser) Strip leading underlines
This addresses issue #140, where __builtin_ffs gives no results.
2025-01-02 14:39:03 +01:00
Viktor Lofgren
8b05c788fd (Search) Enable gzip compression of responses 2025-01-01 18:34:42 +01:00
Viktor Lofgren
236f033bc9 (Search) Reduce whitespace in explore view on all resolutions 2025-01-01 18:23:35 +01:00
Viktor Lofgren
510fc75121 (Search) Reduce whitespace in explorer view on mobile 2025-01-01 18:18:09 +01:00
Viktor Lofgren
0376f2e6e3 Merge branch 'master' into serp-redesign
# Conflicts:
#	code/services-application/search-service/resources/templates/search/index/index.hdb
2025-01-01 18:15:09 +01:00
Viktor Lofgren
0b65164f60 (chore) Fix broken test 2025-01-01 18:06:29 +01:00
Viktor Lofgren
9be477de33 (domain-info) Add a feed flag to domain info
This is a bit of a sketchy solution that requires both assistant services to run on the same physical machine.
2025-01-01 18:02:33 +01:00
Viktor Lofgren
84f55b84ff (search) Add experimental OPML-export function for feed subscriptions 2025-01-01 17:17:54 +01:00
Viktor Lofgren
ab5c30ad51 (search) Fix site info view for completely unknown domains
Also correct the DbDomainQueries.getDomainId so that it throws NoSuchElementException when domain id is missing, and not UncheckedExecutionException via Cache.
2025-01-01 16:29:01 +01:00
Viktor Lofgren
0c839453c5 (search) Fix crosstalk link 2025-01-01 16:09:19 +01:00
Viktor Lofgren
5e4c5d03ae (search) Clean up breakpoints in site overview 2025-01-01 16:06:08 +01:00
Viktor Lofgren
710af4999a (feed-fetcher) Add " entity mapping in feed fetcher 2025-01-01 15:45:17 +01:00
Viktor Lofgren
a5b0a1ae62 (search) Move linked/similar domains to a popover style menu on mobile
Fix scroll
2025-01-01 15:37:35 +01:00
Viktor Lofgren
e9f71ee39b (search) Move linked/similar domains to a popover style menu on mobile 2025-01-01 15:23:25 +01:00
Viktor Lofgren
baeb4a46cd (search) Reintroduce query rewriting for recipes, add rules for wikis and forums 2024-12-31 16:05:00 +01:00
Viktor Lofgren
5e2a8e9f27 (deploy) Add capability of adding tags to deploy script 2024-12-31 16:04:13 +01:00
Viktor
cc1a5bdf90 Merge pull request #138 from MarginaliaSearch/vlofgren-patch-1
Update ROADMAP.md
2024-12-31 14:41:02 +01:00
Viktor
7f7b1ffaba Update ROADMAP.md 2024-12-31 14:40:34 +01:00
Viktor Lofgren
0ea8092350 (search) Add link promoting the redesign beta 2024-12-30 15:47:13 +01:00
Viktor Lofgren
483d29497e (deploy) Add hashbang to deploy script 2024-12-30 15:47:13 +01:00
Viktor Lofgren
bae44497fe (crawler) Add a new system property crawler.maxFetchSize
This gives the same upper limit to the live crawler and the big boy crawler, though the live crawler will reject items too large, and the big crawler will truncate at that point.
2024-12-30 15:10:11 +01:00
Viktor Lofgren
0d59202aca (crawler) Do not remove W/-prefix on weak e-tags
The server expects to get them back prefixed, as we received them.
2024-12-27 20:56:42 +01:00
Viktor Lofgren
0ca43f0c9c (live-crawler) Improve live crawler short-circuit logic
We should not wait until we've fetched robots.txt to decide whether we have any data to fetch!  This makes the live crawler very slow and leads to unnecessary requests.
2024-12-27 20:54:42 +01:00
Viktor Lofgren
3bc99639a0 (feed-fetcher) Make feed fetcher requests conditional
Add `If-None-Match` and `If-Modified-Since` headers as appropriate to the feed fetcher's requests.  On well-configured web servers, this should short-circuit the request and reduce the amount of bandwidth and processing that is necessary.

A new table was added to the FeedDb to hold one etag per domain.

If-Modified-Since semantics are based on the creation date for the feed database, which should serve as a cutoff date for the earliest update we can have received.

This completes the changes for Issue #136.
2024-12-27 15:10:15 +01:00
Viktor Lofgren
927bc0b63c (live-crawler) Add Accept-Encoding: gzip to outbound requests
This change adds `Accept-Encoding: gzip` to all outbound requests from the live crawler and feed fetcher, and the corresponding decoding logic for the compressed response data.

The change addresses issue #136, save for making the fetcher's requests conditional.
2024-12-27 03:59:34 +01:00
Viktor Lofgren
d968801dc1 (converter) Drop feed data from SlopDomainRecord
Also remove feed extraction from converter.  This is the crawler's responsibility now.
2024-12-26 17:57:08 +01:00
Viktor Lofgren
89db69d360 (crawler) Correct feed URLs in domain state db
Discovered feed URLs were given a double slash after their domain name in the DB.  This will go away in the URL normalizer, so the URLs are still viable, but the commit fixes the issue regardless.
2024-12-26 15:18:31 +01:00
Viktor Lofgren
895cee7004 (crawler) Improved feed discovery, new domain state db per crawlset
Feed discover is improved with by probing a few likely endpoints when no feed link tag is provided.  To store the feed URLs, a sqlite database is added to each crawlset that stores a simple summary of the crawl job, including any feed URLs that have been discovered.

Solves issue #135
2024-12-26 15:05:52 +01:00
Viktor Lofgren
4bb71b8439 (crawler) Correct content type probing to only run on URLs that are suspected to be binary 2024-12-26 14:26:23 +01:00
Viktor Lofgren
e4a41f7dd1 (crawler) Correct content type probing to only run on URLs that are suspected to be binary 2024-12-26 14:13:17 +01:00
Viktor
69ad6287b1 Update ROADMAP.md 2024-12-25 21:16:38 +00:00
Viktor Lofgren
81cdd6385d Add rendering tests for most major views
This will prevent accidentally deploying a broken search service
2024-12-25 15:22:26 +01:00
Viktor Lofgren
e76c42329f Correct dark mode for infobox in site focused search 2024-12-25 15:06:05 +01:00
Viktor Lofgren
e6ef4734ea Fix tests 2024-12-25 15:05:41 +01:00
Viktor Lofgren
41a59dcf45 (feed) Sanitize illegal HTML entities out of the feed XML before parsing 2024-12-25 14:53:28 +01:00
Viktor Lofgren
df4bc1d7e9 Add update time to front page subscriptions 2024-12-25 14:42:00 +01:00
Viktor Lofgren
2b222efa75 Merge branch 'master' into serp-redesign 2024-12-25 14:22:42 +01:00
Viktor Lofgren
94d4d2edb7 (live-crawler) Add refresh date to feeds API
For now this is just the ctime for the feeds db.  We may want to store this per-record in the future.
2024-12-25 14:20:48 +01:00
Viktor Lofgren
7ae19a92ba (deploy) Improve deployment script to allow specification of partitions 2024-12-24 11:16:15 +01:00
Viktor Lofgren
56d14e56d7 (live-crawler) Improve LiveCrawlActor resilience to FeedService outages 2024-12-23 23:33:54 +01:00
Viktor Lofgren
a557c7ae7f (live-crawler) Limit concurrent accesses per domain using DomainLocks from main crawler 2024-12-23 23:31:03 +01:00
Viktor Lofgren
b66879ccb1 (feed) Add support for date discovery through atom:issued and atom:created
This is specifically to help parse monadnock.net's Atom feed.
2024-12-23 20:05:58 +01:00
Viktor Lofgren
f1b7157ca2 (deploy) Add basic linting ability to deployment script. 2024-12-23 16:21:29 +01:00
Viktor Lofgren
7622335e84 (deploy) Correct deploy script, set correct name for assistant 2024-12-23 15:59:02 +01:00
Viktor Lofgren
0da2047eae (live-capture) Correctly update processed count, disable poll rate adjustment based on freshness. 2024-12-23 15:56:27 +01:00
Viktor Lofgren
5ee4321110 (ci) Correct deploy script 2024-12-22 20:08:37 +01:00
Viktor Lofgren
9459b9933b (ci) Correct deploy script 2024-12-22 19:40:32 +01:00
Viktor Lofgren
87fb564f89 (ci) Add script for automatic deployment based on git tags 2024-12-22 19:24:54 +01:00
Viktor Lofgren
5ca8523220 (math) Reduce log error spam from null unit conversions 2024-12-21 18:51:45 +01:00
Viktor Lofgren
1118657ffd (system) Supply local IP to service discovery if multiFace is enabled 2024-12-19 22:20:19 +01:00
Viktor Lofgren
b1f970152d (system) To support configurations with multiple docker networks, bind to the "most local" interface.
Make the behavior optional.
2024-12-19 20:26:31 +01:00
Viktor Lofgren
e1783891ab (system) To support configurations with multiple docker networks, bind to the "most local" interface. 2024-12-19 20:18:57 +01:00
Viktor Lofgren
64d32471dd (deploy) Deploy executor test 2024-12-19 17:45:47 +01:00
Viktor Lofgren
232cc465d9 (deploy) Deploy executor test 2024-12-19 17:35:38 +01:00
Viktor Lofgren
8c963bd4ba (feeds) Remove Content-Encoding: gzip from feed fetcher
We don't support decompressing gzip, so this just gives us errors at this point should the server support it.
2024-12-18 22:23:44 +01:00
Viktor Lofgren
6a079c1c75 (feeds) Add per-domain throttling for feed fetcher. 2024-12-18 22:06:46 +01:00
Viktor Lofgren
2dc9f2e639 (feeds) Make feed XML parsing more lenient
... by consuming BOM markers and leading whitespace.
2024-12-18 17:18:41 +01:00
Viktor Lofgren
b66fb9caf6 (feeds) Improve error handling in the feed fetcher. 2024-12-18 17:02:13 +01:00
Viktor Lofgren
6d18e6d840 (search) Add clustering to subscriptions view 2024-12-18 15:36:05 +01:00
Viktor Lofgren
2a3c63f209 (search) Exclude generated style.css from git 2024-12-18 15:24:31 +01:00
Viktor Lofgren
9f70cecaef (search) Add site subscription feature that puts RSS updates on the front page 2024-12-18 15:24:31 +01:00
Viktor Lofgren
47e58a21c6 Refactor documentBody method and ContentType charset handling
Updated the `documentBody` method to improve parsing retries and error handling. Refactored `ContentType` charset processing with cleaner logic, removing redundant handling for unsupported charsets. Also, updated the version of the `slop` library in dependency settings.
2024-12-17 17:11:37 +01:00
Viktor Lofgren
3714104976 Add loader for slop data in converter.
Also alter CrawledDocument to not require String parsing of the underlying byte[] data.  This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.
2024-12-17 15:40:24 +01:00
Viktor Lofgren
f6f036b9b1 Switch to new Slop format for crawl data storage and processing.
Replaces Parquet output and processing with the new Slop-based format. Includes data migration functionality, updates to handling and writing of crawl data, and introduces support for SLOP in domain readers and converters.
2024-12-15 19:34:03 +01:00
Viktor Lofgren
b510b7feb8 Spike for storing crawl data in slop instead of parquet
This seems to reduce RAM overhead to 100s of MB (from ~2 GB), as well as roughly double the read speeds.  On disk size is virtually identical.
2024-12-15 15:49:47 +01:00
Viktor Lofgren
c08203e2ed (search) Prevent paperdoll from being run as a test by CI 2024-12-14 20:35:57 +01:00
Viktor Lofgren
86497fd32f (site-info) Mobile layout fix 2024-12-14 16:19:56 +01:00
Viktor Lofgren
3b998573fd Adjust colors on dark mode for site overview 2024-12-13 21:51:25 +01:00
Viktor Lofgren
e161882ec7 (search) Fix layout for light mode 2024-12-13 21:47:29 +01:00
Viktor Lofgren
357f349e30 (search) Table layout fixes for dictionary lookup 2024-12-13 21:47:08 +01:00
Viktor Lofgren
e4769f541d (search) Sort and deduplicate search results for better relevance.
Added a custom sorting mechanism to prioritize HTTPS over HTTP and domain-based URLs over raw IPs during deduplication. Ensures "bad duplicates" are discarded while maintaining the original presentation order for user-facing results.
2024-12-13 21:47:08 +01:00
Viktor Lofgren
2a173e2861 (search) Dark Mode 2024-12-13 21:47:07 +01:00
Viktor Lofgren
a6a900266c (search) Fix redirects 2024-12-13 02:40:51 +01:00
Viktor Lofgren
bdba53f055 (site) Update domain parameter type from PathParam to QueryParam 2024-12-13 02:15:35 +01:00
Viktor Lofgren
eb2fe18867 (sideload) Add LSH generation for sideloaded StackExchange data
Previously, the sideloader did not generate a locality-sensitive hashCode for document details.  This caused all documents from the same domain to be considered duplicates by the deduplication logic.
2024-12-13 02:10:52 +01:00
Viktor Lofgren
a7468c8d23 (converter) Ensure paths are created for converter batch writer 2024-12-13 01:35:07 +01:00
Viktor Lofgren
fb2beb1eac (converter) Fix data-loss bug where the converter writer would remove all but the last batch of processed data 2024-12-13 01:19:30 +01:00
Viktor Lofgren
0fb03e3d62 (export) Add logging to AtagExporter for error handling 2024-12-12 22:54:32 +01:00
Viktor Lofgren
67db3f295e (index) Revert some optimization changes 2024-12-12 22:14:24 +01:00
Viktor Lofgren
dafaab3ef7 (index) Additional optimization pass 2024-12-12 18:57:33 +01:00
Viktor Lofgren
3f11ca409f (index) Increase thread limit and optimize search result handling
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
2024-12-12 17:07:06 +01:00
Viktor Lofgren
694eed79ef (index) Increase thread limit and optimize search result handling
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
2024-12-12 15:32:31 +01:00
Viktor Lofgren
4220169119 (index) Increase thread limit and optimize search result handling
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
2024-12-12 15:31:11 +01:00
Viktor Lofgren
bbdde789e7 Merge branch 'master' into serp-redesign 2024-12-11 19:45:17 +01:00
Viktor Lofgren
0a53ac68a0 Add specialization for steam store and GOG 2024-12-11 18:32:45 +01:00
Viktor Lofgren
eab61cd48a Merge branch 'master' into serp-redesign 2024-12-11 17:09:27 +01:00
Viktor Lofgren
e65d75a0f9 (crawler) Reintroduce content type probing and clean out bad content type data from the existing crawl sets 2024-12-11 17:01:52 +01:00
Viktor Lofgren
3b99cffb3d (link-parser) Filter out URLs with binary file suffixes in LinkParser
Added an additional filter step to ensure URLs with binary suffixes are excluded during crawling. This prevents unnecessary processing of non-HTML content, improving the efficiency of the link parsing process.
2024-12-11 16:42:47 +01:00
Viktor Lofgren
a97c05107e Add synthetic meta flag for root path documents
If the document's URL path is "/", a "special:root" meta flag is now added with the "Synthetic" bit set. This will help searching only for the root document of a website, neat stuff ahead :D
2024-12-11 16:10:44 +01:00
Viktor Lofgren
5002870d1f (converter) Refactor sideloaders to improve feature handling and keyword logic
Centralized HTML feature handling with `applyFeatures` in StackexchangeSideloader and added dynamic synthetic term generation. Improved HTML structure in RedditSideloader and enhanced metadata processing with feature-based keywords. Updated DomainLinks to correctly compute link counts using individual link occurrences.
2024-12-11 16:01:38 +01:00
Viktor Lofgren
73861e613f (ranking) Downtune score boost for unordered heading matces 2024-12-11 15:44:29 +01:00
Viktor Lofgren
0ce2ba9ad9 (jooby) Fix asset handler 2024-12-11 14:38:04 +01:00
Viktor Lofgren
3ddcebaa36 (search) Give serp/start a more consistent name to siteinfo/start
The change also cleans up the layout a bit.
2024-12-11 14:33:57 +01:00
Viktor Lofgren
b91463383e (jooby) Clean up initialization process 2024-12-11 14:33:18 +01:00
Viktor Lofgren
7444a2f36c (site-info) Add placeholder when a feed item lacks a title. 2024-12-10 22:46:12 +01:00
Viktor Lofgren
461bc3eb1a (generator) Add special workaround to flag fextralife as a wiki 2024-12-10 22:22:52 +01:00
Viktor Lofgren
cf7f84f033 (rank) Reduce the impact of domain rank bonus, and only apply it to cancel out negative penalties, never to increase the ranking 2024-12-10 22:04:12 +01:00
Viktor Lofgren
fdee07048d (search) Remove Spark and migrate to Jooby for the search service 2024-12-10 19:13:13 +01:00
Viktor Lofgren
2fbf201761 (search) Adjust crosstalk flex-basis 2024-12-10 15:12:51 +01:00
Viktor Lofgren
4018e4c434 (search) Add crosstalk to paperdoll 2024-12-10 15:12:39 +01:00
Viktor Lofgren
f3382b5bd8 (search) Completely remove all old hdb templates
Create new views for conversion results, dictionary results, and site crosstalk.
2024-12-10 15:04:49 +01:00
Viktor Lofgren
9fc82574f0 (fingerprint) Add FluxGarden as a wiki generator
#130
2024-12-10 13:51:42 +01:00
Viktor
589f4dafb9 Merge pull request #129 from MarginaliaSearch/atags-counts
(WIP) Improve atag sentence matching
2024-12-10 12:42:34 +00:00
Viktor Lofgren
c5d657ef98 (live-crawler) Flag live crawled documents with a special keyword 2024-12-10 13:42:10 +01:00
Viktor Lofgren
3c2bb566da (converter) Wipe the converter output path on initialization to avoid lingering stale data. 2024-12-10 13:41:05 +01:00
Viktor Lofgren
9287ee0141 (search) Improve hyphenation logic for titles 2024-12-09 15:29:10 +01:00
Viktor Lofgren
2769c8f869 (search) Remove sticky search bar to aid with performance on firefox (and iOS?) 2024-12-09 15:20:33 +01:00
Viktor Lofgren
ddb66f33ba (search) Add more feedback when pressing some buttons 2024-12-09 15:07:23 +01:00
Viktor Lofgren
79500b8fbc (search) Move search bar back up top on mobile, put filter buttom at the bottom instead. 2024-12-09 14:55:37 +01:00
Viktor Lofgren
187eea43a4 (search) Remove redundant @if 2024-12-09 14:46:02 +01:00
Viktor Lofgren
a89ed6fa9f (search) Fix rendering on site overview, more dense serp layout on mobile 2024-12-09 14:45:45 +01:00
Viktor Lofgren
e0c0ed27bc (keyword-extraction) Clean up code and add tests for position and spans calculation
This code has been a bit of a mess and historically significantly flaky, so some test coverage is more than overdue.
2024-12-08 14:14:52 +01:00
Viktor Lofgren
20abb91657 (loader) Correct DocumentLoaderService to properly do bulk inserts
Fixes issue #128
2024-12-08 13:12:52 +01:00
Viktor Lofgren
291ca8daf1 (converter/index) Improve atag sentence matching by taking into consideration how many times a sentence appears in the links
This change breaks the format of the atags.parquet file.
2024-12-08 00:27:11 +01:00
Viktor Lofgren
8d168be138 (search) Typeahead search, etc. 2024-12-07 15:47:01 +01:00
Viktor Lofgren
6e1aa7b391 (search) Make style.css depend on jte file changes
Also add a hack to ensure classes generated from java code get included in the stylesheet as intended.
2024-12-07 14:11:22 +01:00
Viktor Lofgren
deab9b9516 (search) Clean up start views for search and site-info 2024-12-07 14:11:22 +01:00
Viktor Lofgren
39d99a906a (search) Add proper tailwind build and host fontawesome locally 2024-12-07 14:11:22 +01:00
Viktor Lofgren
6f72e6e0d3 (explore) Add lazy loading and alt attributes to images 2024-12-07 14:11:22 +01:00
Viktor Lofgren
d786d79483 (site-info) Add whitespace-nowrap to pubDay span in overview.jte 2024-12-07 14:11:22 +01:00
Viktor Lofgren
01510f6c2e (serp) Add wayback link to search results 2024-12-07 14:11:22 +01:00
Viktor Lofgren
7ba43e9e3f (site) Adjust sizing of navbars 2024-12-07 14:11:16 +01:00
Viktor Lofgren
97bfcd1353 (site) Layout changes site-info 2024-12-07 14:11:16 +01:00
Viktor Lofgren
aa3c85c196 (site) Mobile layout fixes 2024-12-07 14:11:16 +01:00
Viktor Lofgren
ee2d5496d0 Revert "(experiment) Modify atags exporter to permit duplicates from different source domains"
This reverts commit 5c858a2b94.
2024-12-07 14:01:50 +01:00
Viktor Lofgren
5c858a2b94 (experiment) Modify atags exporter to permit duplicates from different source domains
This is an attempt to provide higher resolution term frequency data that will need evaluation when the data is processed.
2024-12-06 14:10:15 +01:00
Viktor Lofgren
fb75a3827d (site) Adjust coloration of search results 2024-12-05 16:58:00 +01:00
Viktor Lofgren
7d546d0e2a (site) Make SearchParameters generate relative URLs instead of absolute 2024-12-05 16:47:22 +01:00
Viktor Lofgren
8fcb6ffd7a (site-info) Increase contrast in search results for forums, wikis 2024-12-05 16:42:16 +01:00
Viktor Lofgren
f97de0c15a (site-info) Fix layout 2024-12-05 16:33:46 +01:00
Viktor Lofgren
be9e192b78 (site-info) Fix pagination in backlinks and documents views 2024-12-05 16:26:11 +01:00
Viktor Lofgren
75ae1c9526 (site-info) Do not show 'suggest for crawling' when the ndoe affinity is already set to 0
This indicates the domain is already slated for crawling.
2024-12-05 16:18:46 +01:00
Viktor Lofgren
33761a0236 (site-info) Make the search box in the site viewer functional 2024-12-05 16:16:29 +01:00
Viktor Lofgren
19b69b1764 (site-info) Only show samples if feed is absent, never both. 2024-12-05 16:05:03 +01:00
Viktor Lofgren
8b804359a9 (serp) Layout fixes for mobile 2024-12-05 15:59:33 +01:00
Viktor Lofgren
f050bf5c4c (WIP) Initial semi-working transformation to new tailwind UI
Still missing is a proper build, we're currently pulling in tailwind from a CDN, which is no bueno in prod.

There's also a lot of polish remaining everywhere, dead links, etc.
2024-12-05 14:00:17 +01:00
Viktor Lofgren
fdc3efa250 (setup) Remove OpenNLP tokenization model
This update eliminates all occurrences of the OpenNLP token model from the setup script, configuration, and test files, as this model file is no longer used.
2024-11-28 16:03:05 +01:00
Viktor Lofgren
5fdd2c71f8 (setup) Update OpenNLP model URLs to archive.apache.org
Changed the URLs for downloading OpenNLP sentence and tokens models from downloads.apache.org to archive.apache.org; as the previous link has died.
2024-11-28 15:58:25 +01:00
Viktor Lofgren
c97c66a41c (ranking) Reduce the verbatim score multiplier 2024-11-28 13:37:11 +01:00
Viktor Lofgren
7b64377fd6 (ranking) Promote documents with multiple phrase matches with a log-scale bonus 2024-11-28 13:36:56 +01:00
Viktor Lofgren
e11ebf18e5 (span) Correct intersection counting logic, add comprehensive tests 2024-11-28 13:36:25 +01:00
Viktor Lofgren
ba47d72bf4 (ranking) Adjust scores for external link matches 2024-11-27 14:27:23 +01:00
Viktor Lofgren
52bc0272f8 (atag) Add alias domain support and improve domain handling
Introduced optional alias domain functionality in EdgeDomain class to handle domain variations such as "www" in the anchor tags code, as there are commonly a number of relevant but glancing misses in the atags data.
2024-11-27 14:26:44 +01:00
Viktor Lofgren
d4bce13a03 (export) Add export actors to precession
Adding a tracking message to the export actor means it's possible to run them in a precession.

Adding a new precession actor, and some GUI components for triggering exports.

The change also adds a heartbeat to the export process.
2024-11-26 15:07:03 +01:00
Viktor Lofgren
b9842b57e0 (encyclopedia-sideloader) Add test suite and clean up urlencoding logic 2024-11-26 13:34:15 +01:00
Viktor Lofgren
95776e9bee (encyclopedia) Fix commit gore resulting in bad SQL query 2024-11-26 12:44:49 +01:00
Viktor Lofgren
077d8dcd11 (result-score) Adjust ranking parameters a tiny bit 2024-11-25 18:30:59 +01:00
Viktor Lofgren
9ec41e27c6 (keyword-extractor) Fix bug where external link keywords weren't generating document spans as intended 2024-11-25 18:30:22 +01:00
Viktor Lofgren
200743c84f (minor) Remove delomobok debris 2024-11-25 18:29:21 +01:00
Viktor Lofgren
6d7998e349 (index) Correct behavior of debug function positionValues(), which was misleadingly incorrect 2024-11-25 18:28:53 +01:00
Viktor Lofgren
7d1ef08a0f (index) Correct ranking bonus for external linktext appearnces 2024-11-25 17:40:15 +01:00
Viktor Lofgren
ea6b148df2 (docker) Add restart: always to executor nodes
The system will perform a janitor reset on these nodes when the node profile is switched, so it's important they restart automatically.
2024-11-25 15:31:45 +01:00
Viktor Lofgren
3ec9c4c5fa (export) Filter non-HTML documents in exporters
Add a check to ensure only documents with "text/html" content type are processed in FeedExporter, AtagExporter, and TermFrequencyExporter. This prevents non-HTML documents from being parsed and helps maintain data consistency and keep the memory usage down.
2024-11-25 15:06:42 +01:00
Viktor Lofgren
0b6b5dab07 (index) Add score bonuses for single-word anchor tag spans
Enhanced scoring logic to add bonuses when the query matches single-word anchor (atag) spans exactly. Implemented this by adding conditions in `IndexResultScoreCalculator.java` and creating a new method `containsRangeExact` in `DocumentSpan.java` to check for exact span matches.
2024-11-25 14:44:41 +01:00
Viktor Lofgren
ff17473105 Fix UTF-8 URL normalization issue in sideloader.
Normalize URLs by replacing en-dash with hyphen to prevent encoding errors. This ensures correct handling of a small subset of articles with improperly normalized UTF-8 paths. Added `normalizeUtf8` method to address this issue.

Fixes issue #109.
2024-11-25 14:25:47 +01:00
Viktor Lofgren
dc5f97e737 (index) Add bonus for single-word title matches when the title is also a single word 2024-11-25 13:24:12 +01:00
Viktor Lofgren
d919179ba3 (index) Correct off-by-1 error in DocumentSpan.containsRange 2024-11-25 13:24:03 +01:00
Viktor Lofgren
f09669a5b0 (index) Correct usage of DocumentSpan.length() instead of DocumentSpan.size()
The latter counts the number of spans, and is not what you want here.
2024-11-25 13:11:55 +01:00
Viktor Lofgren
b3b0f6fed3 (actor) Add side-load profile to PROC_CONVERTER_SPAWNER.
This fell off during the profile split, but is necessary for sideloading.
2024-11-25 12:40:14 +01:00
Viktor Lofgren
88caca60f9 (live-crawl) Flag URLs that don't pass robots.txt as bad so we don't keep fetching robots.txt every day for an empty link list 2024-11-23 17:07:16 +01:00
Viktor Lofgren
923ebbac81 (feeds) Add logic to handle URI fragments in feed items
Introduced a method to decide whether to retain URI fragments in feed items based on their uniqueness. Enhanced FeedItem processing to conditionally strip fragments to maintain clean URLs where applicable.
2024-11-23 16:38:56 +01:00
Viktor
df298df852 Merge pull request #125 from MarginaliaSearch/live-search
Add near real-time crawling from RSS feeds to supplement the slower batch based crawls
2024-11-22 16:38:37 +00:00
Viktor Lofgren
552b246099 (live-crawl) Improve error handling for errors during robots.txt-retrieval
Reduce log-spam and don't treat errors other than 404 as "all is permitted".
2024-11-22 14:15:32 +01:00
Viktor Lofgren
80e6d0069c (live-crawl-actor) Clear index journal before starting live crawl
This is to prevent data corruption.   This shouldn't be necessary for the regular loader path, but the live crawler is a bit different and needs some paving of the road ahead of it.
2024-11-22 14:04:57 +01:00
Viktor Lofgren
b941604135 (live-crawler) Alter DbDomainIdRegistry to make inserts if an id is missing, as this is apparently a rare scenario we need to deal with. 2024-11-22 13:58:57 +01:00
Viktor Lofgren
52eb5bc84f (live-crawler) Keep track of bad URLs
To avoid hammering the same invalid URLs for up to two months, URLs that fail to fetch correctly are on a dice roll added to a bad URLs table, that prevents further attempts at fetching them.
2024-11-22 00:55:46 +01:00
Viktor Lofgren
4d23fe6261 (feeds) Simplify RSS User-Agent header
Removed the redundant "RSS Feed Fetcher" suffix from the User-Agent header in the FeedFetcherService.  This will help avoid making the feed fetcher trigger bot mitigation that accepts the regular UA-string.
2024-11-21 16:43:56 +01:00
Viktor Lofgren
14519294d2 Merge branch 'master' into live-search 2024-11-21 16:00:20 +01:00
Viktor Lofgren
51e46ad2b0 (refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents
Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx.

While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform.  It's still a bit heterogeneous between different processes, but a bit less so for now.
2024-11-21 16:00:09 +01:00
Viktor Lofgren
665c8831a3 (model) Fix resource leak in partially read crawl data streams.
Ensuring proper resource management by closing the underlying stream in the `close` method to prevent potential resource leaks.
2024-11-20 19:29:13 +01:00
Viktor Lofgren
47dfbacb00 (conf) Introduce a new concept of node profiles
Node profiles decide which actors are started, and which views are available in the control GUI.  This helps keep the system organized, and hides real-time clutter from the batch-oriented nodes.
2024-11-20 18:15:22 +01:00
Viktor Lofgren
f94911541a (live-crawl) Reduce the risk of id collisions with the main indexes
This is done by applying a large constant offset to the ordinals for the live crawled documents.  The chosen value still permits upto 100k documents to be fetched for a single domain with the live crawler, which is ridiculously large.
2024-11-20 16:01:10 +01:00
Viktor Lofgren
89d8af640d (live-crawl) Rename the live crawler code module to be more consistent with the other processes 2024-11-20 15:55:15 +01:00
Viktor Lofgren
6e4252cf4c (live-crawl) Make the actor poll for feeds changes instead of being a one-shot thing.
Also changes the live crawl process to store the live crawl data in a fixed directory in the storage base rather than versioned directories.
2024-11-20 15:36:25 +01:00
Viktor Lofgren
79ce4de2ab (model) Remove deprecated fields from CrawledDocument and CrawledDomain 2024-11-20 15:27:05 +01:00
Viktor Lofgren
d6575dfee4 (live-crawler) Crude first-try process for live crawling #WIP
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
2024-11-19 21:00:18 +01:00
Viktor Lofgren
a91ab4c203 (live-crawler) Crude first-try process for live crawling #WIP
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
2024-11-19 19:35:01 +01:00
Viktor Lofgren
6a3079a167 (search) Fix missing getter for proto 2024-11-18 21:05:22 +01:00
Viktor Lofgren
c728a1e2f2 (rss) Add endpoint for extracting URLs changed withing a timespan. 2024-11-18 14:59:32 +01:00
Viktor Lofgren
d874d76a09 (rss) Add an endpoint that can be used for identifying when RSS data has changed 2024-11-18 14:22:17 +01:00
Viktor Lofgren
70bc8831f5 (test) Fix excludeTags 2024-11-17 20:07:49 +01:00
Viktor Lofgren
41c11be075 (status) Clean up the status page a bit 2024-11-17 20:00:44 +01:00
Viktor Lofgren
163ce19846 (test) Tag status service endpoint tests as flaky
These tests have outside dependencies that inherently makes them unreliable and unsuitable for CI.
2024-11-17 19:48:01 +01:00
Viktor Lofgren
9eb16cb667 (test) Remove tests from fast suite
Adding a new @Tag("flaky") for tests that do not reliably return successes.  These may still be valuable during development, but should not run in CI.

Also tagging a few of the slower tests with the old @Tag("slow"), to speed up the run-time.
2024-11-17 19:45:59 +01:00
Viktor Lofgren
af40fa327b (status-service) Correct measurement pruning to use correct sqlite datetimes, as to not delete the database 2024-11-17 18:35:34 +01:00
Viktor Lofgren
cf6d28e71e (status-service) Enable auto-commit 2024-11-17 18:25:15 +01:00
Viktor Lofgren
3791ea1e18 (service) Add a new application service for external liveness monitoring
The new service 'status-service' will poll public endpoints periodically, and publish a basic read-only UI with the results, as well as publish the results to prometheus.
2024-11-17 18:01:08 +01:00
Viktor
34258b92d1 Merge pull request #124 from MarginaliaSearch/jdk-23+delombok
Friendship with lombok over, now JDK 23 is my best friend
2024-11-16 14:00:49 +00:00
Viktor Lofgren
e5db3f11e1 (chore) Clean up some of the uglier delomboking artifacts 2024-11-15 13:57:20 +01:00
Viktor Lofgren
9f47ce8d15 (chore) Remove lombok
There are likely some instances of delombok gore with this commit.
2024-11-11 21:14:38 +01:00
Viktor Lofgren
a5b4951f23 (chore) Remove use of deprecated STR.-style string templates 2024-11-11 18:02:28 +01:00
Viktor Lofgren
8b8bf0748f (feature-extraction) Add new DocumentHeaders class encapsulating Html headers.
Also adds a few new html features for CDNs and  S3 hosting for use in ranking and query refinement.
2024-11-11 13:26:15 +01:00
Viktor
5cc71ae586 Merge pull request #123 from MarginaliaSearch/vlofgren-patch-1
Update ROADMAP.md
2024-11-10 18:57:49 +01:00
Viktor
33fcfe4b63 Update ROADMAP.md 2024-11-10 18:57:15 +01:00
Viktor
a31a3b53c4 Merge pull request #122 from MarginaliaSearch/fetch-rss-feeds
Automatic RSS feed polling
2024-11-10 18:35:28 +01:00
Viktor Lofgren
a456ec9599 (feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished 2024-11-10 18:30:28 +01:00
Viktor Lofgren
a2bc9a98c0 (feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished 2024-11-10 17:45:20 +01:00
Viktor Lofgren
e24a98390c (feed) Update API to allow specifying clean vs refresh update
Move the logic deciding which operation to perform into the actor, updating its state graph to incorporate a counter that runs a clean update once in a blue moon.
2024-11-09 18:43:47 +01:00
Viktor Lofgren
6f858cd627 (feed) Decrease update interval to 24 hours 2024-11-09 18:17:51 +01:00
Viktor Lofgren
a293266ccd (feed) Wipe the feeds db and start over from system URLs periodically. 2024-11-09 18:17:16 +01:00
Viktor Lofgren
b8e0dc93d7 (search) Correctly show the feeds view when items are present
... otherwise show samples.   This commit also removes the (Experimental) bit, as this is getting fairly mature.
2024-11-09 17:56:43 +01:00
Viktor Lofgren
d774c39031 (feeds) Reduce log spam 2024-11-09 17:56:43 +01:00
Viktor Lofgren
ab17af99da (feeds) Refresh the feed db using the previous db, when it is available. 2024-11-09 17:56:43 +01:00
Viktor Lofgren
b0ac3c586f (feeds) Correct parallelism using SimpleBlockingThreadPool 2024-11-09 17:56:43 +01:00
Viktor Lofgren
139fa85b18 (feeds) Add working heartbeat tracking progress 2024-11-09 17:56:43 +01:00
Viktor Lofgren
bfeb9a4538 (feeds) Retire feedlot the feed bot, move RSS capture into the live-capture service 2024-11-09 17:56:43 +01:00
Viktor
3d6c79ae5f Merge pull request #121 from MarginaliaSearch/headless-setup
Headless deterministic setup
2024-11-08 13:50:54 +01:00
Viktor Lofgren
c9e9f73ea9 (setup) Break out installation action into non-interactive script 2024-11-08 13:38:40 +01:00
Viktor Lofgren
80e482b155 (setup) Add progress bar to downloads for better feedback 2024-11-08 13:38:40 +01:00
Viktor Lofgren
9351593495 (setup) Use huggingface for versioned hosting of language models 2024-11-08 13:38:40 +01:00
Viktor Lofgren
d74436f546 (setup) Use checksums for rdrpostagger and opennlp files
Also use versioned URLs for rdrpostagger
2024-11-08 13:38:40 +01:00
Viktor Lofgren
76e9053dd0 (setup) Move some file-downloads from setup script to the first boot of the control node of the system
We can only do this for files that are not required for unit tests.

As it is illegal to run more than one instance of the control service, this should be fine with regard to race conditions.  The boot orchestration will also ensure that no other services will boot up before the downloading is complete.
2024-11-06 15:28:20 +01:00
Viktor Lofgren
dbb8bcdd8e (crawler) Use a better hashInt implementation in CrawlDataReference
Guava's hash functions are slow as hell.
2024-10-15 18:25:55 +02:00
Viktor Lofgren
7305afa0f8 (crawler) Clean up the crawler code a bit, removing vestigial abstractions and historical debris 2024-10-15 17:27:59 +02:00
Viktor Lofgren
481f999b70 (crawler) Make DomainCrawlFrontier a bit less aggressive with throwing away excess links when it's approaching full.
Also be a bit smarter about pre-allocating queues and sets based on depth rather than the number of provided URLs, which was always zero outside of tests.
2024-10-15 14:22:40 +02:00
Viktor Lofgren
4b16022556 (crawler) Correct Spec Provider so that it uses VISITED_URLS rather than KNOWN_URLS when growing domains 2024-10-15 14:21:59 +02:00
Viktor Lofgren
89dd201a7b (link-parser) Make mailing list blocking optional 2024-10-15 13:48:32 +02:00
Viktor Lofgren
ab486323f2 (converter) Increase the number of links the converter will pick up per document 2024-10-15 13:46:19 +02:00
Viktor Lofgren
6460c11107 (index) Short-circuit rankResults when there are no results 2024-10-14 13:47:35 +02:00
Viktor Lofgren
89f7f3c17c (query-parser) Fix regression where advice terms weren't parsed properly 2024-10-14 13:46:37 +02:00
Viktor Lofgren
fe800b3af7 (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 19:04:49 +02:00
Viktor Lofgren
2a1077ff43 (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 17:57:27 +02:00
Viktor Lofgren
01a16ff388 (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 17:55:59 +02:00
Viktor Lofgren
eb60ddb729 (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 17:49:39 +02:00
Viktor Lofgren
db5faeceee (download-sample) Break apart actor for better error recovery
Change also adds logged events to give more feedback that something is happening.
2024-10-04 13:39:43 +02:00
Viktor Lofgren
45d3e6aa71 (download-sample) Break apart actor for better error recovery
Change also adds logged events to give more feedback that something is happening.
2024-10-04 13:19:09 +02:00
Viktor Lofgren
d84a2c183f (*) Remove the crawl spec abstraction
The crawl spec abstraction was used to upload lists of domains into the system for future crawling.  This was fairly clunky, and it was difficult to understand what was going to be crawled.

Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table.  This is much preferred and means the operator can directly manage domains without specs.

This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.
2024-10-03 13:41:17 +02:00
Viktor Lofgren
ecb5eedeae (crawler, EXPERIMENT) Disable content type probing and use Accept header instead
There's reason to think this may speed up crawling quite significantly, and the benefits of the probing aren't quite there.
2024-09-30 14:53:01 +02:00
Viktor Lofgren
90a2d4ae38 (index) Fix partial buffer writing in PrioDocIdsTransformer
Ensure all data is written to writeChannel by looping until the buffer is fully drained. This prevents potential data loss during the close operation and maintains data integrity.
2024-09-29 17:53:40 +02:00
Viktor Lofgren
2b8ab97ec1 (bit-writer) Do not clear buffer when creating a bit writer 2024-09-29 17:52:43 +02:00
Viktor Lofgren
43ca9c8a12 (sequence) Return Integer.MAX_VALUE for empty position lists.
Updated the method to return Integer.MAX_VALUE if any of the position lists are empty, instead of returning 0. This ensures that empty lists are handled consistently and address edge cases where an empty list is encountered.
2024-09-29 17:21:17 +02:00
Viktor Lofgren
69d99c91dd (index) Optimize buffer handling in PrioDocIdsTransformer 2024-09-29 17:20:49 +02:00
Viktor Lofgren
a8cc98a0f6 (index) Fix write offset calculation in PrioDocIdsTransformer
Adjust the write offset calculation by adding the position of the write buffer. Updated the test to validate the transformation process and ensure correctness of output file positions.
2024-09-29 17:20:29 +02:00
Viktor Lofgren
2ee58f4bc9 (index) Adjust ranking parameters to dial down the importance of tcfProximity and firstPosition 2024-09-29 15:33:12 +02:00
Viktor Lofgren
938431e514 (scrape-feeds-actor) Add deduplication of insertion data
To avoid unnecessary db churn, the domains to be added are put in a set instead of a list, ensuring that they are unique.
2024-09-28 14:41:14 +02:00
Viktor Lofgren
b2de3c70fa (scrape-feeds-actor) Add explicit commit in case it's disabled 2024-09-28 14:36:57 +02:00
Viktor Lofgren
542690d9f6 (search-service) Hide pagination when there is only 1 page of results 2024-09-28 13:48:09 +02:00
Viktor Lofgren
596a7fb4ea (actor) Disable the feed scraper on all nodes but the first 2024-09-28 12:36:16 +02:00
Viktor Lofgren
c3f726a01f (actor) Add a feed scraping actor
Add a new actor that polls an URL every 6 hours and amends the domain database with any unseen domains, flagging them to be crawled by the next crawl job.

The URLs are specified in data/scrape-urls.txt.  If this file is absent, the actor shuts down.
2024-09-28 12:33:29 +02:00
Viktor Lofgren
4538ade156 (live-capture) Add readme to live-capture function 2024-09-28 11:35:46 +02:00
Viktor Lofgren
f4709d8f32 (live-capture) Handle case when screenshot bytes are empty.
Add logic to flag the domain as fetched when the pngBytes array is empty. This ensures we won't try to re-fetch this domain again for a while.
2024-09-27 15:53:17 +02:00
Viktor Lofgren
3dda8c228c (live-capture) Handle failed screenshot fetch in BrowserlessClient
Return an empty byte array when screenshot fetch fails, ensuring downstream processes are not impacted by null responses. Additionally, only attempt to upload the screenshot if the byte array is non-empty, preventing invalid data from being stored.
2024-09-27 14:52:05 +02:00
Viktor Lofgren
ccf6b7caf3 (assistant) Refactor scheduling of tasks within SimilarDomainsService
Changed the scheduling function to use a single schedule call instead of a fixed delay for the init task. The updateScreenshotInfo method was also moved and slightly refactored for clearer readability and consistency.
2024-09-27 14:43:19 +02:00
Viktor Lofgren
fed33ed64a (search-service) Update screenshot request handling
Always request the main site screenshot to ensure staleness checks and necessary updates. Limit additional screenshot requests for similar and linking domains to avoid overloading with a maximum of 5 requests per view.
2024-09-27 14:27:25 +02:00
Viktor Lofgren
ca27d95ce1 (assistant) Add bounds checks for domain idx 2024-09-27 14:24:04 +02:00
Viktor Lofgren
3566fe296a (assistant) Add scheduled update job for screenshot information 2024-09-27 14:16:28 +02:00
Viktor Lofgren
c91435e314 (assistant) Don't attempt to respond to similarity and linkedness queries before the data is ready
This will reduce the number of exceptions in the assistant logs quite significantly.
2024-09-27 14:08:08 +02:00
Viktor Lofgren
31f30069a4 (live-capture) Dial down logging a bit 2024-09-27 14:00:55 +02:00
Viktor
e5726a75d2 Merge pull request #120 from MarginaliaSearch/live-capture-function
Add a new function 'Live Capture' for on-demand screenshot capture
2024-09-27 13:48:53 +02:00
Viktor Lofgren
c757d116bf (misc) Fix Broken Tests 2024-09-27 13:46:34 +02:00
Viktor Lofgren
23cce0c78a Add a new function 'Live Capture' for on-demand screenshot capture
The screenshots are requested by the site-service, and triggered via the site-info view.
2024-09-27 13:46:34 +02:00
Viktor Lofgren
1bd29a586c (service-discovery) Add common base interface to all Grpc services
To be able to tell service discovery whether to enable a service on a particular runtime, a common base interface DiscoverableService extends BindableService was added.
2024-09-27 13:46:34 +02:00
Viktor Lofgren
4565bfe359 (crawler) Make the crawler report crawling progress correctly when stopped and resumed. 2024-09-26 18:30:29 +02:00
Viktor Lofgren
336d6fdd14 (index-client) Fix error when zero results are found 2024-09-25 20:23:13 +02:00
Viktor Lofgren
95cde242ca (assistant) Fix NPE when IP information is absent 2024-09-25 20:19:17 +02:00
Viktor
9224176202 Merge pull request #119 from MarginaliaSearch/result-pagination
Add pagination support for the search results
2024-09-25 14:29:24 +02:00
Viktor Lofgren
0d2390fd13 (search-service) Only autofocus on the query when the query is empty 2024-09-25 14:27:03 +02:00
Viktor Lofgren
4a0356e26f (search-service) Add pagination support to the search GUI 2024-09-25 14:26:49 +02:00
Viktor Lofgren
73f973cc06 (search-query) Add pagination to search query API and the direct query-service interface 2024-09-25 14:20:59 +02:00
Viktor Lofgren
e9e8580913 (converter) Fix NPE bugs in converter due to the reintroduction of CrawledDocument.headers 2024-09-25 12:18:56 +02:00
Viktor Lofgren
8b85a58fea (search UX) Autofocus on the search form 2024-09-24 15:56:03 +02:00
Viktor Lofgren
40512511af (crawler) Refactor boundary between CrawlerRetreiver and HttpFetcherImpl
This code is still a bit too complex, but it's slowly getting better.
2024-09-24 15:08:22 +02:00
Viktor
10d8fc4fe7 Update ROADMAP.md 2024-09-24 14:57:30 +02:00
Viktor
9899d45ea8 Merge pull request #118 from MarginaliaSearch/vlofgren-patch-1
Update ROADMAP.md
2024-09-24 14:13:47 +02:00
Viktor
3eea471ca6 Update ROADMAP.md 2024-09-24 14:13:32 +02:00
Viktor Lofgren
3dec4b6b34 (index) Fix bug where tcfFirstPosition lit up because one term was in the title and the other was missing from the document
This was because firstPosition calculation was not invalidated when positions were missing.
2024-09-24 13:33:37 +02:00
Viktor Lofgren
162fc25ebc (minor) Fix accidental commit errors 2024-09-23 18:03:09 +02:00
Viktor Lofgren
e9854f194c (crawler) Refactor
* Restructure the code to make a bit more sense
* Store full headers in crawl data
* Fix bug in retry-after header that assumed the timeout was in milliseconds, and then clamped it to a lower bound of 500ms, meaning this was almost always handled wrong
2024-09-23 17:51:07 +02:00
Viktor Lofgren
9c292a4f62 (doc) Fix outdated links in documentation 2024-09-22 13:56:17 +02:00
Viktor Lofgren
edb42836da (vcs) Fix shared state issues with VarintCodedSequence's iterators.
Also cleans up the code a bit.
2024-09-21 16:09:15 +02:00
Viktor Lofgren
1ff88ff0bc (vcs) Stopgap fix for quoted queries with the same term appearinc multiple times
There are reentrance issues with VarintCodedSequence, this hides the symptom but these need to be corrected properly.
2024-09-21 14:07:59 +02:00
Viktor Lofgren
28e7c8e5e0 Increase temporal bias weight to give the recent results filter a bit more recency 2024-09-17 18:11:40 +02:00
Viktor
463b3ed0ce Merge pull request #99 from MarginaliaSearch/term-positions
Improve term positions accuracy
2024-09-17 15:30:04 +02:00
Viktor Lofgren
8e78286068 Merge branch 'master' into term-positions 2024-09-17 15:20:46 +02:00
Viktor Lofgren
f4eeef145e (index) Reduce fetch size to improve timeout characteristics 2024-09-17 15:20:41 +02:00
Viktor Lofgren
87aa869338 (index) Correct positions mask to take into account offsets when overlapping 2024-09-17 14:40:37 +02:00
Viktor Lofgren
60ad4786bc (index) Use MemorySegment.copy for LongArray->LongArray transfers 2024-09-17 13:56:31 +02:00
Viktor Lofgren
a74df7f905 (index) Increase buffer size for PrioDocIdsTransformer 2024-09-17 13:52:52 +02:00
Viktor Lofgren
9f9c6736ab (index) Use MemorySegment.copy for LongArray->LongArray transfers 2024-09-17 13:49:02 +02:00
Viktor Lofgren
b95646625f (index) Correct prio index construction with mmap
Accidentally snuck in behavior from full index
2024-09-17 13:39:08 +02:00
Viktor Lofgren
6e47eae903 (index) Correct strange close handling of PositionsFileConstructor 2024-09-13 16:34:14 +02:00
Viktor Lofgren
934af0dd4b (index) Correct units in log message when shrinking the documents file 2024-09-13 16:33:19 +02:00
Viktor Lofgren
a8bec13ed9 (index) Evaluate using mmap reads during index construction in favor of filechannel reads
It's likely that this will be faster, as the reads are on average small and sequential, and can't be buffered easily.
2024-09-13 16:14:56 +02:00
Viktor Lofgren
1cf62f5850 (doc) Correct dead links and stale information in the docs 2024-09-13 11:02:13 +02:00
Viktor Lofgren
8047e77757 (doc) Correct dead links and stale information in the docs 2024-09-13 11:01:05 +02:00
Viktor Lofgren
2a92de29ce (loader) Fix it so that the loader doesn't explode if it sees an invalid URL 2024-09-12 11:36:00 +02:00
Viktor Lofgren
99523ca079 (query-parser) Remove test that is no longer relevant 2024-09-10 10:35:56 +02:00
Viktor Lofgren
35f49bbb60 (coded-sequence) Add equals and hashCode to VCS 2024-09-10 10:33:56 +02:00
Viktor Lofgren
50ec922c2b (index) Fix broken index tests
Also cleaned up the tests to be less fragile to ranking algorithm changes.
2024-09-10 10:23:46 +02:00
Viktor Lofgren
cfbbeaa26e (ranking) Clean up ranking test code 2024-09-08 15:46:51 +02:00
Viktor Lofgren
a3b0189934 Fix build errors after merge 2024-09-08 10:22:32 +02:00
Viktor Lofgren
8f367d96f8 Merge branch 'master' into term-positions
# Conflicts:
#	code/index/java/nu/marginalia/index/results/model/ids/TermIdList.java
#	code/processes/converting-process/java/nu/marginalia/converting/ConverterMain.java
#	code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
#	code/processes/crawling-process/java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java
#	code/processes/crawling-process/model/java/nu/marginalia/io/crawldata/CrawledDomainReader.java
#	code/processes/crawling-process/test/nu/marginalia/crawling/HttpFetcherTest.java
#	code/processes/crawling-process/test/nu/marginalia/crawling/retreival/CrawlerMockFetcherTest.java
#	code/services-application/search-service/java/nu/marginalia/search/svc/SearchQueryIndexService.java
2024-09-08 10:14:43 +02:00
Viktor Lofgren
f78ef36cd4 (slop) Upgrade to 0.0.8, add encodings to string columns. 2024-09-04 15:19:00 +02:00
Viktor Lofgren
dc67c81f99 (summary) Fix a few cases where noscript tags would sometimes be used for document summary 2024-09-04 15:00:40 +02:00
Viktor Lofgren
50ba8fd099 (query-parsing) Correct handling of trailing parentheses 2024-09-03 11:45:14 +02:00
Viktor Lofgren
99b3b00b68 (query-parsing) Merge QueryTokenizer into QueryParser and add escaping of query grammar 2024-09-03 11:35:32 +02:00
Viktor Lofgren
f6d981761d (query-parsing) Drop search term elements that aren't indexed by the search engine 2024-09-03 11:24:05 +02:00
Viktor Lofgren
8290c19e24 (query-parsing) Drop search term elements that aren't indexed by the search engine 2024-09-03 11:21:01 +02:00
Viktor Lofgren
7a69dff6cf (search) Correct handling of languages on fandom 2024-09-01 13:46:01 +02:00
Viktor Lofgren
bfb7ed2c99 (search) Translate cursed medium URLs to scribe.rip links via the search application 2024-09-01 13:32:14 +02:00
Viktor Lofgren
e19dc9b13e (search) Translate cursed fandom URLs to breezewiki links via the search application 2024-09-01 13:23:35 +02:00
Viktor Lofgren
74148c790e (crawler) Pull additional new domains from node-affinity 0
Previously a bit ambiguously defined, node affinity 0 is now indicative that a domain is up for grabs for the next crawler
2024-09-01 13:00:36 +02:00
Viktor Lofgren
3d77456110 (*) Add domain parking service to ip blocklist 2024-09-01 12:53:22 +02:00
Viktor Lofgren
ab6a4b1749 (control) Correct id value for domain addition tool 2024-09-01 12:25:15 +02:00
Viktor Lofgren
aeeb1d0cb7 (control) Add utility for adding domains from an external URL 2024-09-01 12:14:21 +02:00
Viktor Lofgren
185b79f2a5 (converter) Fix bug where sideloaded reddit content was errouneously categoriszed as wiki-generated. 2024-09-01 11:30:25 +02:00
Viktor Lofgren
8d0f9652c7 (crawler) Correct RSS-sitemap behavior 2024-08-31 11:38:34 +02:00
Viktor Lofgren
5353805cc6 (crawler) Correct RSS-sitemap behavior 2024-08-31 11:37:09 +02:00
Viktor Lofgren
5407da5650 (crawler) Grab favicons as part of root sniff 2024-08-31 11:32:56 +02:00
Viktor Lofgren
b1bfe6f76e (control) New view for domains
Add capability to assign domains, and bulk-add new domains.
2024-08-30 17:06:48 +02:00
Viktor Lofgren
74e25370ca (control) New view for domains
Still a work in progress, but at this point it's possible to use for viewing domains
2024-08-29 15:40:40 +02:00
Viktor Lofgren
bb5d946c26 (index, EXPERIMENTAL) Clean up ranking code 2024-08-29 11:34:23 +02:00
Viktor Lofgren
abab5bdc8a (index, EXPERIMENTAL) Evaluate using Varint instead of GCS for position data 2024-08-26 14:20:39 +02:00
Viktor Lofgren
30bf845c81 (index) Speed up minDist calculations by excluding large lists 2024-08-26 13:04:15 +02:00
Viktor Lofgren
77efce0673 (paper-doll) Fix compilation 2024-08-26 12:51:29 +02:00
Viktor Lofgren
67a98fb0b0 (coded-sequence) Handle weird legacy HTML that puts everything in a heading 2024-08-26 12:49:15 +02:00
Viktor Lofgren
7d471ec30d (coded-sequence) Evaluate new minDist implementation 2024-08-26 12:45:11 +02:00
Viktor Lofgren
f3182a9264 (coded-sequence) Evaluate new minDist implementation 2024-08-26 12:02:37 +02:00
Viktor Lofgren
805cb5ad58 (coded-sequence) Correct behavior of findIntersections 2024-08-25 14:54:17 +02:00
Viktor Lofgren
fdf05cedae (index) Optimize DocumentSpan.countIntersections 2024-08-25 14:12:30 +02:00
Viktor Lofgren
9c5f463775 (index) Optimize DocumentSpan.countIntersections 2024-08-25 13:59:11 +02:00
Viktor Lofgren
893fae6d59 (index) Optimize DocumentSpan.countIntersections 2024-08-25 13:51:43 +02:00
Viktor Lofgren
5660f291af (index) Optimize DocumentSpan.countIntersections 2024-08-25 13:43:29 +02:00
Viktor Lofgren
efd56efc63 (index) Optimize SequenceOperations.minDistance 2024-08-25 13:28:06 +02:00
Viktor Lofgren
d94373f4b1 (index) Optimize calculatePositionsMask 2024-08-25 13:24:37 +02:00
Viktor Lofgren
0d01a48260 (index) Optimize SequenceOperations 2024-08-25 13:19:37 +02:00
Viktor Lofgren
00ab2684fa (index) Optimize SequenceOperations 2024-08-25 13:17:38 +02:00
Viktor Lofgren
a5585110a6 (index) Optimize SequenceOperations 2024-08-25 13:16:31 +02:00
Viktor Lofgren
965c89798e (index) Optimize DocumentSpan 2024-08-25 12:44:33 +02:00
Viktor Lofgren
982b03382b (index) Optimize DocumentSpan 2024-08-25 12:31:15 +02:00
Viktor Lofgren
24b805472a (index) Evaluate performance implication of decoding gcs early 2024-08-25 12:23:09 +02:00
Viktor Lofgren
6ce029b317 (index) Remove vestigial parameter 2024-08-25 12:14:12 +02:00
Viktor Lofgren
63e5b0ab18 (index) Correct weightedCounts calculations 2024-08-25 12:06:56 +02:00
Viktor Lofgren
6dda2c2d83 (coded-sequence) Reduce allocations in GCS.values() 2024-08-25 12:06:31 +02:00
Viktor Lofgren
3fb3c0b92e (index) Optimize ranking calculations 2024-08-25 11:56:11 +02:00
Viktor Lofgren
aa2c960b74 (index) Optimize ranking calculations 2024-08-25 11:53:44 +02:00
Viktor Lofgren
4fbcc02f96 (index) Adjust sensible defaults for ranking parameters 2024-08-25 11:24:16 +02:00
Viktor Lofgren
9aa8f13731 (index) Remove tcfAvgDist ranking parameter
This is captured by tcfProximity already
2024-08-25 11:20:19 +02:00
Viktor Lofgren
65bee366dc (index) Try harmonic mean for avgMinDist 2024-08-25 11:11:52 +02:00
Viktor Lofgren
53700e6667 (index) Try harmonic mean for avgMinDist 2024-08-25 11:08:41 +02:00
Viktor Lofgren
7f498e10b7 (index) Adjust proximity score 2024-08-25 11:01:35 +02:00
Viktor Lofgren
6eb0f13411 (index) Adjust handling of full phrase matches to prioritize full query matches over large partial matches 2024-08-25 10:54:04 +02:00
Viktor Lofgren
773377fe84 (index) Correct handling of full phrase match group 2024-08-25 10:48:34 +02:00
Viktor Lofgren
4372c8c835 (index) Give ranking components more consistent names 2024-08-25 10:44:27 +02:00
Viktor Lofgren
099133bdbc (index) Fix verbatim match score after moving full phrase group to a separate entity 2024-08-25 10:43:35 +02:00
Viktor Lofgren
b09e2dbeb7 (build) Fix dependency churn from testcontainers
Apparently you need to pull in commons-codec now in order to run testcontainers, through spooky action at a distance.
2024-08-25 10:35:48 +02:00
Viktor Lofgren
96bcf03ad5 (index) Address broken tests
They are still broken, but less so.
2024-08-25 10:34:36 +02:00
Viktor Lofgren
0999f07320 (search-query) Add new ranking parameters for proximity and verbatim matches 2024-08-25 10:34:12 +02:00
Viktor Lofgren
5d2b455572 (search) Clean up inconsistent usage of MathClient in SearchOperator
Also clean up SearchOperator and adjacent code
2024-08-24 10:39:31 +02:00
Viktor Lofgren
ea75ddc0e0 (search) Absorb SearchQueryIndexService into SearchOperator, and clean up SearchOperator 2024-08-22 11:50:52 +02:00
Viktor Lofgren
2db0e446cb (search) Absorb SearchQueryIndexService into SearchOperator, and clean up SearchOperator 2024-08-22 11:49:29 +02:00
Viktor Lofgren
557bdaa694 (search) Clean up SearchQueryIndexService and surrounding code 2024-08-22 11:45:28 +02:00
Viktor Lofgren
9eb1f120fc (index) Repair positions bitmask for search result presentation 2024-08-22 11:28:23 +02:00
Viktor Lofgren
266d6e4bea (slop) Replace SlopPageRef<T> with SlopTable.Ref<T> 2024-08-21 10:13:49 +02:00
Viktor Lofgren
e4c97a91d8 (*) Comment clarity 2024-08-21 10:12:00 +02:00
Viktor Lofgren
b0a874a842 (*) Upgrade slop library -> 0.0.5 2024-08-18 11:05:27 +02:00
Viktor Lofgren
bca40de107 (*) Upgrade slop library 2024-08-18 10:43:41 +02:00
Viktor Lofgren
93652e0937 (qdebug) Accurately display positions when intersecting with spans 2024-08-15 11:55:48 +02:00
Viktor Lofgren
0a383a712d (qdebug) Accurately display positions when intersecting with spans 2024-08-15 11:44:17 +02:00
Viktor Lofgren
03d5dec24c (*) Refactor termCoherences and rename them to phrase constraints. 2024-08-15 11:02:19 +02:00
Viktor Lofgren
b2a3cac351 (*) Remove broken imports 2024-08-15 11:01:34 +02:00
Viktor Lofgren
a18edad04c (index) Remove stopword list from converter
We want to index all words in the document, stopword handling is moved to the index where we change the semantics to elide inclusion checks in query construction for a very short list of words tentatively hard-coded in SearchTerms.
2024-08-15 09:36:50 +02:00
Viktor Lofgren
92522e8d97 (index) Attenuate bm25 score based on query length 2024-08-15 08:41:38 +02:00
Viktor Lofgren
049d94ce31 (index) Add body position match to qdebug fields 2024-08-15 08:39:37 +02:00
Viktor Lofgren
dbc6a95276 (index) Consume the new 'body' span in index to make it used in ranking 2024-08-15 08:33:43 +02:00
Viktor Lofgren
75b0888032 (slop) Migrate to latest Slop version 2024-08-14 11:44:35 +02:00
Viktor Lofgren
2ad93ad41a (*) Clean up 2024-08-14 11:43:45 +02:00
Viktor Lofgren
623ee5570f (slop) Break slop out into its own repository 2024-08-13 09:50:05 +02:00
Viktor Lofgren
fd2bad39f3 (keyword-extraction) Add body field for terms that are not otherwise part of a field 2024-08-13 09:49:26 +02:00
Viktor Lofgren
e6c8a6febe (index) Add index-side deduplication in selectBestResults 2024-08-10 10:51:59 +02:00
Viktor Lofgren
4ece5f847b (index) Add more qdebug factors 2024-08-10 10:45:30 +02:00
Viktor Lofgren
e4f04af044 (index) Give BODY matches a verbatim match value 2024-08-10 10:22:19 +02:00
Viktor Lofgren
b730b17f52 (index) Correct handling of firstPosition to avoid d/z 2024-08-10 10:21:59 +02:00
Viktor Lofgren
98c40958ab (index) Simplify verbatim match calculation 2024-08-10 09:54:56 +02:00
Viktor Lofgren
41b52f5bcd (index) Simplify verbatim match calculation 2024-08-10 09:51:03 +02:00
Viktor Lofgren
4264fb9f49 (query-service) Clean up qdebug UI a bit 2024-08-10 09:51:03 +02:00
Viktor Lofgren
016a4c62e1 (index) Bugs and error fixes, chasing and fixing mystery results that did not contain all relevant keywords 2024-08-10 09:51:03 +02:00
Viktor Lofgren
2f38c95886 (index) Backport bugfix from term-positions branch
The ordering of TermIdsList is assumed to be unchanged by the surrounding code, but the constructor sorts the dang list to be able to do contains() by binary search.  This is no bueno.

This is gonna be a merge conflict in the future, but it's too big of a bug to leave for another month.
2024-08-09 21:17:02 +02:00
Viktor Lofgren
df89661ed2 (index) In SearchResultItem, populate combinedId with combinedId and not its ranking-removed documentId cousin 2024-08-09 16:32:32 +02:00
Viktor Lofgren
41da4f422d (search-query) Always generate the "all"-segmentation 2024-08-09 13:20:00 +02:00
Viktor Lofgren
2e89b55593 (wip) Repair qdebug utility and show new ranking details 2024-08-09 12:57:25 +02:00
Viktor Lofgren
7babdb87d5 (index) Remove intermediate models 2024-08-07 10:10:44 +02:00
Viktor Lofgren
680ad19c7d (keyword-extraction) Correct behavior when loading spans so that they are not double-loaded causing errors 2024-08-06 11:16:56 +02:00
Viktor Lofgren
f01267bc6b (index) Don't load fwd index offsets into a hash table at start.
This makes the service take forever to start up.  Memory map the data instead and binary search.  This is a bit slower, but not by much.
2024-08-06 11:16:28 +02:00
Viktor Lofgren
df6a05b9a7 (index) Avoid hypothetical divide-by-zero in tcfAvgDist 2024-08-06 10:55:57 +02:00
Viktor Lofgren
8569bb8e11 (index) Avoid divide-by-zero when minDist returns 0 2024-08-06 10:34:05 +02:00
Viktor Lofgren
ca6e2db2b9 (index) Include external link texts in verbatim score 2024-08-06 10:23:23 +02:00
Viktor Lofgren
2080e31616 (converter) Store link text positions
To help offer verbatim matches for external link texts, we assign these positions in the document a bit after the actual document ends.  Integrating this information with the ranking is not performed here.
2024-08-04 12:00:29 +02:00
Viktor Lofgren
c379be846c (slop) Update readme 2024-08-04 10:58:23 +02:00
Viktor Lofgren
9bc665628b (slop) VarintLE implementation, correct enum8 column 2024-08-04 10:57:52 +02:00
Viktor Lofgren
ee49c01d86 (index) Tune ranking for verbatim matches in the title, rewarding shorter titles 2024-08-03 14:47:23 +02:00
Viktor Lofgren
b21f8538a8 (index) Tune ranking for verbatim matches in the title, rewarding shorter titles 2024-08-03 14:41:38 +02:00
Viktor Lofgren
dd15676d33 (index) Tune ranking for verbatim matches in the title, rewarding shorter titles 2024-08-03 14:18:04 +02:00
Viktor Lofgren
ec5a17ad13 (index) Tune ranking for verbatim matches in the title, rewarding shorter titles 2024-08-03 14:07:02 +02:00
Viktor Lofgren
e48f52faba (experiment) Add add-hoc filter runner 2024-08-03 13:24:03 +02:00
Viktor Lofgren
8462e88b8f (index) Add min-dist factor and adjust rankings 2024-08-03 13:07:00 +02:00
Viktor Lofgren
bf26ead010 (index) Remove hasPrioTerm check as we should sort this out in ranking 2024-08-03 13:06:50 +02:00
Viktor Lofgren
c2cedfa83c (index) Experimental ranking signals 2024-08-03 10:33:41 +02:00
Viktor Lofgren
eba2844361 (index) Experimental ranking signals 2024-08-03 10:32:46 +02:00
Viktor Lofgren
c6c8b059bf (index) Return some variant of the previously removed 'Bm25PrioGraphVisitor' 2024-08-03 10:10:12 +02:00
Viktor Lofgren
d8a99784e5 (index) Adding a few experimental relevance signals 2024-08-02 20:26:07 +02:00
Viktor Lofgren
57929ff242 (coded-sequence) Varint sequence 2024-08-02 20:22:56 +02:00
Viktor Lofgren
4430a39120 (loader) Clean up 2024-08-02 12:32:47 +02:00
Viktor Lofgren
6228f46af1 (loader) Reduce log spam 2024-08-02 12:21:03 +02:00
Viktor Lofgren
ac67b6b5da (converter) Fix exception handling while reading crawl data 2024-08-02 10:39:49 +02:00
Viktor Lofgren
1a268c24c8 (perf) Reduce DomPruningFilter hash table recalculation 2024-08-01 12:04:55 +02:00
Viktor Lofgren
38e2089c3f (perf) Code was still spending a lot of time resolving charsets
... in the failure case which wasn't captured by memoization.
2024-08-01 11:58:59 +02:00
Viktor Lofgren
e2107901ec (index) Add span information for anchor tags, tweak ranking params 2024-08-01 11:46:30 +02:00
Viktor Lofgren
15745b692e (index) Coherences need to be able to deal with null values among positions 2024-07-31 22:00:14 +02:00
Viktor Lofgren
696fd8909d (screenshot-capture-tool) Make screenshot bot spend time refreshing old screenshots instead of always capturing new ones 2024-07-31 21:44:10 +02:00
Viktor Lofgren
02b1c4b172 (screenshot-capture-tool) Make screenshot bot spend time refreshing old screenshots instead of always capturing new ones 2024-07-31 20:21:23 +02:00
Viktor Lofgren
285e657f68 Merge branch 'master' into term-positions
# Conflicts:
#	code/processes/crawling-process/java/nu/marginalia/crawl/CrawlerMain.java
#	code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
2024-07-31 10:44:01 +02:00
Viktor Lofgren
046ffc7752 (build) Upgrade jib to 3.4.3 2024-07-31 10:39:50 +02:00
Viktor Lofgren
2ef66ce0ca (actor) Reset NEW flag earlier when auto-deletion is disabled
Don't wait until the loader step is finished to reset the NEW flag, as this leaves manually processed (but not yet loaded) crawl data stuck in "CREATING" in the GUI.
2024-07-31 10:31:03 +02:00
Viktor Lofgren
dc5c668940 (index) Re-enable parallelization of index construction, disable parallel sorting during construction
The first change, running index construction in parallel, was previously how it was done, but it was changed to run sequentially to see how it would affect performance.  It got worse, so the change is reverted.

Though it's been noted that sorting in parallel is likely not a good idea as it leads to a lot of I/O thrashing, so this is changed to be done sequentially.
2024-07-31 10:06:53 +02:00
Viktor Lofgren
f19148132a (search) Restrict site-search by passing domain id along with the site:-term
This will help these queries deal with domains that do not have a subdomain so that they do not drag up subdomains as well, as they are also given the special site:-keyword for their corresponding parent domain.
2024-07-30 21:41:07 +02:00
Viktor Lofgren
6d7b886aaa (converter) Correct sort order of files in control storage GUI
Previously it was sorted on a field that would switch to just showing the time whenever the date was the same as the day's date, leading to a bizarre sort order where files created today was typically shown first, followed by the rest of the files with the oldest date first.
2024-07-30 19:43:27 +02:00
Viktor Lofgren
b316b55be9 (index) Experimental initial integration of document spans into index 2024-07-30 12:01:53 +02:00
Viktor Lofgren
80900107f7 (restructure) Clean up repo by moving stray features into converter-process and crawler-process 2024-07-30 10:14:00 +02:00
Viktor Lofgren
7e4efa45b8 (converter/loader) Simplify document record writing to not require predicated reads 2024-07-29 14:21:21 +02:00
Viktor Lofgren
86ea28d6bc (converter/loader) Simplify document record writing to not require predicated reads 2024-07-29 14:18:52 +02:00
Viktor Lofgren
34703da144 (slop) Support for nested array types and array-of-object types
Also adding very basic support for filtered reads via SlopTable.  This is probably not a final design.
2024-07-29 14:00:43 +02:00
Viktor Lofgren
1282f78bc5 (slop-models) Fix incorrect column grouping leading to errors in converter 2024-07-29 11:01:18 +02:00
Viktor Lofgren
2d5d965f7f (slop-models) Fix incorrect column grouping leading to errors in converter 2024-07-29 10:34:33 +02:00
Viktor Lofgren
afe56c7cf1 (loader) Tidy up code 2024-07-28 21:36:42 +02:00
Viktor Lofgren
7d51cf882f (loader) Move rssFeeds to a different column group to avoid errors 2024-07-28 21:30:10 +02:00
Viktor Lofgren
499deac2ef (slop) Fix test that broke when we split get into int get() and long getLong() 2024-07-28 21:20:37 +02:00
Viktor Lofgren
9685993adb (loader) Add spans to a different column group from spanCodes, as they are not in sync 2024-07-28 21:20:09 +02:00
Viktor Lofgren
261dcdadc8 (loader) Additional tracking for the control GUI 2024-07-28 21:19:45 +02:00
Viktor Lofgren
314a901bf0 (slop) Clean up build.gradle from unnecessary copy-paste garbage 2024-07-28 13:22:20 +02:00
Viktor Lofgren
1caad7e19e (slop) Update existing code to use the altered Slop interfaces 2024-07-28 13:21:08 +02:00
Viktor Lofgren
e585116dab (slop) Add 32 bit read method for Varint along with the old 64 bit version 2024-07-28 13:20:18 +02:00
Viktor Lofgren
40f42bf654 (slop) Add signed 16 bit column type "short" 2024-07-28 13:19:44 +02:00
Viktor Lofgren
eaf7fbb9e9 (slop) Improve Conveniences for Enum
* New fixed width 8 bit version of Enum
* Access to the enum's dictionary, and a method for reading the ordinal directly to reduce GC churn
2024-07-28 13:19:15 +02:00
Viktor Lofgren
d05a2e57e9 (index-forward) Spans Writer should not be in the index page loop context 2024-07-27 15:17:04 +02:00
Viktor Lofgren
f8684118f3 (slop) Add columnDesc information to the column readers and writers, and correct a few broken position() implementations
Added a test that should find any additional broken implementations, as it's very important that this function is correct.
2024-07-27 14:35:30 +02:00
Viktor Lofgren
2e1f669aea (slop) Remove additional vestigial seek() implementations 2024-07-27 14:35:30 +02:00
Viktor Lofgren
6c3abff664 (slop) Move GCS Slop column to the coded-sequence package
This lets the slop library be stand-alone without dependence on coded-sequence.

The change also gets rid of the vestigial seek() method in ColumnReader.
2024-07-27 13:58:45 +02:00
Viktor Lofgren
dcb43a3308 (slop) Introduce table concept to keep track of positions and simplify closing
The most common error when dealing with Slop columns is that they can fall out of sync with each other if the programmer accidentally does a conditional read and forgets to skip.

The second most common error is forgetting to close one of the columns in a reader or writer.

To deal with both cases, a new class SlopTable is added that keeps track of the lifecycle of all slop columns and performs a check when closing them that they are in sync.
2024-07-27 13:47:47 +02:00
Viktor Lofgren
ec600b967d (crawler) Adjust domain locking
Turns out throttling to only 1 lock per domain means the crawler chokes hard on large hosting websites such as wordpress.  Giving these a slightly larger allowance.
2024-07-27 11:54:46 +02:00
Viktor Lofgren
aebb2652e8 (wip) Extract and encode spans data
Refactoring keyword extraction to extract spans information.

Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.

This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact.  Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00
Viktor Lofgren
52a9a0d410 (slop) Translate nulls to empty strings when passed to the StringColumnWriters. 2024-07-25 18:26:41 +02:00
Viktor Lofgren
4123e99469 (slop) Handle empty compressed files correctly
The CompressingStorageReader would incorrectly report having data when a file was empty.  Preemptively attempting to fill the backing buffer fixes the behavior.
2024-07-25 18:26:13 +02:00
Viktor Lofgren
51a8a242ac (slop) First commit of slop library
Slop is a low-abstraction data storage convention for column based storage of complex data.
2024-07-25 15:08:41 +02:00
Viktor Lofgren
60ef826e07 (loader) Add heartbeat to update domain-ids step 2024-07-25 15:08:41 +02:00
Viktor Lofgren
2ad564404e (loader) Add heartbeat to update domain-ids step 2024-07-23 15:28:52 +02:00
Viktor Lofgren
2bb9f18411 (dld) Refactor DocumentLanguageData
Reduce the usage of raw arrays
2024-07-19 12:24:55 +02:00
Viktor Lofgren
7a1edc0880 (term-freq) Reduce the number of low-relevance words in the dictionary
Using a statistical trick to reduce the number of low-frequency words in the dictionary, as they are numerous and not very informative.
2024-07-19 12:23:28 +02:00
Viktor Lofgren
b812e96c6d (language-processing) Select the appropriate language filter
The incorrect filter was selected based on the provided parameter, this has been corrected.
2024-07-19 12:22:32 +02:00
Viktor Lofgren
22b35d5d91 (sentence-extractor) Add tag information to document language data
Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object.  Separator information is encoded as a bit set instead of an array of integers.

The change also cleans up the SentenceExtractor class a fair bit.  It no longer extracts ngrams, and a significant amount of redundant operations were removed as well.  This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.
2024-07-18 15:57:48 +02:00
Viktor Lofgren
d36055a2d0 (keyword-extractor) Retire TfIdfHigh WordFlag
This will bring the word flags count down to 8, and let us pack every value in a byte.
2024-07-17 13:54:39 +02:00
Viktor Lofgren
0d227f3543 (cleanup) Remove next-prime library only used in tests 2024-07-17 13:48:03 +02:00
Viktor Lofgren
accc598967 (crawler) Add 1 second pause after probing domain to reduce request pressure 2024-07-16 16:55:07 +02:00
Viktor Lofgren
02c4a2d4ba (crawler) Add a per-domain mutex for crawling
To let up the pressure on domains with lot sof subdomains such as substack, medium, neocities, etc. a per-domain mutex is added that will limit crawling of these domains to one thread at a time.
2024-07-16 16:44:59 +02:00
Viktor Lofgren
6665e447aa (crawler) Add crawl delays around probe call and deal with 429:s properly during this phase 2024-07-16 15:33:24 +02:00
Viktor Lofgren
7eb955cc42 (setup) Change mirror for opennlp
Seems like the estointernet mirror no longer works.  Use apache.org instead.
2024-07-16 15:19:13 +02:00
Viktor Lofgren
f4d79c203d (crawler) Adjust revisit logic
The revisit logic wasn't sufficiently dampening the recrawl rate for websites that largely have not changed.

Modified it to be more reactive to the degree to which the content has changed, while applying upper and lower limits depending on the size of the crawl set.
2024-07-16 15:12:38 +02:00
Viktor Lofgren
4d29581ea4 (crawler) Introduce absolute upper limit to crawl depth growth 2024-07-16 14:40:45 +02:00
Viktor Lofgren
0b31c4cfbb (coded-sequence) Replace GCS usage with an interface 2024-07-16 14:37:50 +02:00
Viktor Lofgren
5c098005cc (index) Fix broken test
Expected behavior changed since the ranking algorithm now takes into account the number of positions of the keyword, and the test loader was previously modified to generate positions based on prime factors of the document id.
2024-07-16 12:37:59 +02:00
Viktor Lofgren
ae87e41cec (index) Fix rare BitReader.takeWhileZero bug
Fix rare bug where the takeWhileZero method would fail to repopulate the underlying buffer.  This caused intermittent de-compression errors if takeWhileZero happened at a 64 bit boundary while the underlying buffer was empty.

The change also alters how sequence-lengths are encoded, to more consistently use the getGamma method instead of adding special significance to a zero first byte.

Finally, assertions are added checking the invariants of the gamma and delta coding logic as well as UrlIdCodec to earlier detect issues.
2024-07-16 11:03:56 +02:00
Viktor Lofgren
dfd19b5eb9 (index) Reduce the number of abstractions around result ranking
The change also restructures the internal API a bit, moving resultsFromDomain from RpcRawResultItem into RpcDecoratedResultItem, as the previous order was driving complexity in the code that generates these objects, and the consumer side of things puts all this data in the same object regardless.
2024-07-16 08:18:54 +02:00
Viktor
8ed5b51a32 Merge branch 'master' into term-positions 2024-07-15 07:05:31 +02:00
Viktor Lofgren
9d0e5dee02 Fix gitignore issue .so files not to be ignored correctly. 2024-07-15 05:18:10 +02:00
Viktor Lofgren
ffd970036d (term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter
How'd This Ever Work? (tm)

TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.
2024-07-15 05:16:17 +02:00
Viktor Lofgren
fa162698c2 (term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter
How'd This Ever Work? (tm)

TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.
2024-07-15 05:15:30 +02:00
Viktor Lofgren
ad3857938d (search-api, ranking) Update with new ranking parameters
Adding new ranking parameters to the API and routing them through the system, in order to permit integration of the new position data with the ranking algorithm.

The change also cleans out several parameters that no longer filled any function.
2024-07-15 04:49:40 +02:00
Viktor Lofgren
179a6002c2 (coded-sequence) Add a callback for re-filling underlying buffer 2024-07-12 23:50:28 +02:00
Viktor Lofgren
d28fc86956 (index-prio) Add fuzz test for prio index 2024-07-11 19:22:36 +02:00
Viktor Lofgren
6303977e9c (index-prio) Fail louder when size is 0 in PrioDocIdsTransformer
We can't deal with this scenario and should complain very loudly
2024-07-11 19:22:05 +02:00
Viktor Lofgren
97695693f2 (index-prio) Don't increment readItems counter when the output buffer is full
This behavior was causing the reader to sometimes discard trailing entries in the list.
2024-07-11 19:21:36 +02:00
Viktor Lofgren
1ab875a75d (test) Correcting flaky tests
Also changing the inappropriate usage of ReverseIndexPrioFileNames for the full index in test code.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
31881874a9 (coded-sequence) Correct indicator of next-value
It was incorrectly assumed that a "next" value could not be zero or negative, as this is not representable via the Gamam code.  This is incorrect in this case, as we're able to provide a negative offset.  Changing to using Integer.MIN_VALUE as indicator that a value is absent instead, as this will never be used.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
f090f0101b (index-construction) Gather up preindex writes
Use fewer writes when finalizing the preindex documents.dat file, as this was getting too slow.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
9881cac2da (index-reader) Correctly handle negative offset values
When wordOffset(...) returns a negative value, it means the word isn't present in the index, and we should abort.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
12590d3449 (index-reverse) Added compression to priority index
The priority index documents file can be trivially compressed to a large degree.

Compression schema:
```
00b -> diff docord (E gamma)
01b -> diff domainid (E delta) + (1 + docord) (E delta)
10b -> rank (E gamma) + domainid,docord (raw)
11b -> 30 bit size header, followed by 1 raw doc id (61 bits)
```
2024-07-11 16:13:23 +02:00
Viktor Lofgren
abf7a8d78d (coded-sequence) Correct implementation of Elias gamma
Also clean up the code a bit as the EliasGammaCodec class was an iterator, and it was leaking abstraction details.
2024-07-10 14:28:28 +02:00
Viktor Lofgren
ecfe17521a (coded-sequence) Correct implementation of Elias gamma
The implementation was incorrectly using 1 bit more than it should.  The change also adds a put method for Elias delta; and cleans up the interface a bit.
2024-07-09 17:28:21 +02:00
Viktor Lofgren
0d29e2a39d (index-reverse) Entry Sources reset() their LongQueryBuffer
Previously this was the responsibility of the caller, which lead to the possibility of passing in improperly prepared buffers and receiving bad outcome
2024-07-09 01:39:40 +02:00
Viktor Lofgren
12a2ab93db (actor) Improve error messages for convert-and-load
Some copy-and-paste errors had snuck in and every index construction error was reported as "repartitioned failed"; updated with more useful messages.
2024-07-08 19:19:30 +02:00
Viktor Lofgren
d90bd340bb (index-reverse) Removing btree indexes from prio documents file
Btree index adds overhead and disk space and doesn't fill any function for the prio index.

* Update finalize logic with a new IO transformer that copies the data and prepends a size
* Update the reader to read the new format
* Added a test
2024-07-08 17:20:17 +02:00
Viktor Lofgren
21afe94096 (index-reverse) Don't use 128 bit merge function for prio index 2024-07-07 21:36:10 +02:00
Viktor Lofgren
fa36689597 (index-reverse) Simplify priority index
* Do not emit a documents file
* Do not interlace metadata or offsets with doc ids
2024-07-06 18:04:08 +02:00
Viktor Lofgren
85c99ae808 (index-reverse) Split index construction into separate packages for full and priority index 2024-07-06 15:44:47 +02:00
Viktor Lofgren
a4ecd5f4ce (minor) Fix non-compiling test due to previous refactor 2024-07-06 15:11:43 +02:00
Viktor Lofgren
6401a513d7 (crawl) Fix onsubmit confirm dialog for single-site recrawl 2024-07-05 17:21:03 +02:00
Viktor Lofgren
d86926be5f (crawl) Add new functionality for re-crawling a single domain 2024-07-05 15:31:55 +02:00
Viktor Lofgren
a6b03a66dc (crawl) Reduce Charset.forName() object churn
Cache the Charset object returned from Charset.forName() for future use, since we're likely to see the same charset again and Charset.forName(...) can be surprisingly expensive and its built-in caching strategy, which just caches the 2 last values seen doesn't cope well with how we're hitting it with a wide array of random charsets
2024-07-04 20:49:07 +02:00
Viktor Lofgren
d023e399d2 (index) Remove unnecessary allocations in journal reader
The term data iterator is quite hot and was performing buffer slice operations that were not necessary.

Replacing with a fixed pointer alias that can be repositioned to the relevant data.

The positions data was also being wrapped in a GammaCodedSequence only to be immediately un-wrapped.

Removed this unnecessary step and move to copying the buffer directly instead.
2024-07-04 15:38:22 +02:00
Viktor Lofgren
e8ab1e14e0 (keyword-extraction) Update upper limit to number of positions per word
After real-world testing, it was determined that 256 was still a bit too low, but 512 seems like it will only truncate outlier cases like assembly code and certain tabulations.
2024-07-02 20:52:32 +02:00
Viktor Lofgren
a6e15cb338 (keyword-extraction) Update upper limit to number of positions per word
100 was a bit too low, let's try 256.
2024-06-30 22:46:56 +02:00
Viktor Lofgren
4fbb863a10 (keyword-extraction) Add upper limit to number of positions per word
Also adding some logging for this event to get a feel for how big these lists get with realistic data.  To be cleaned up later.
2024-06-30 22:41:38 +02:00
Viktor Lofgren
6ee4d1eb90 (keyword) Increase the work area for position encoding
The change also moves the allocation outside of the build()-method to allow re-use of this rather large temporary buffer.
2024-06-28 16:42:39 +02:00
Viktor Lofgren
738e0e5fed (process) Add option for automatic profiling
The change adds a new system property 'system.profile' that makes ProcessService automatically trigger JFR profiling of the processes it spawns.  By default, these are put in the log directory.

The change also adds a JVM parameter that makes it shut up about native access.
2024-06-27 13:58:36 +02:00
Viktor Lofgren
0e4dd3d76d (minor) Remove accidentally committed debug printf 2024-06-27 13:40:53 +02:00
Viktor Lofgren
10fe5a78cb (log) Prevent tests from trying to log to file
They would never have succeeded, but it adds an annoying preamble of error spam in the console window.
2024-06-27 13:19:48 +02:00
Viktor Lofgren
975b8ae2e9 (minor) Tidy code 2024-06-27 13:15:31 +02:00
Viktor Lofgren
935234939c (test) Add query parsing to IntegrationTest 2024-06-27 13:15:20 +02:00
Viktor Lofgren
87e38e6181 (search-query) refac: Move query factory 2024-06-27 13:14:47 +02:00
Viktor Lofgren
f73fc8dd57 (search-query) Fix end-inclusion bug in QWordGraphIterator 2024-06-27 13:13:42 +02:00
Viktor Lofgren
3faa5bf521 (search-query) Tidy up QueryGRPCService and IndexClient 2024-06-26 14:03:30 +02:00
Viktor Lofgren
6973712480 (query) Tidy up code 2024-06-26 13:40:06 +02:00
Viktor Lofgren
02df421c94 (*) Trim the stopwords list
Having an overlong stopwords list leads to quoted terms not performing well.  For now we'll slash it to just "a" and "the".
2024-06-26 12:22:57 +02:00
Viktor Lofgren
95b9af92a0 (index) Implement working optional TermCoherences 2024-06-26 12:22:06 +02:00
Viktor Lofgren
8ee64c0771 (index) Correct TermCoherence requirements 2024-06-25 22:18:10 +02:00
Viktor Lofgren
b805f6daa8 (gamma) Fix readCount() behavior in EGC 2024-06-25 22:17:54 +02:00
Viktor Lofgren
dae22ccbe0 (test) Integration test from crawl->query 2024-06-25 22:17:26 +02:00
Viktor Lofgren
9d00243d7f (index) Partial re-implementation of position constraints 2024-06-24 15:55:54 +02:00
Viktor Lofgren
5461634616 (doc) Add readme.md for coded-sequence library
This commit introduces a readme.md file to document the functionality and usage of the coded-sequence library. It covers the Elias Gamma code support, how sequences are encoded, and methods the library offers to query sequences, iterate over values, access data, and decode sequences.
2024-06-24 14:28:51 +02:00
Viktor Lofgren
40bca93884 (gamma) Minor clean-up 2024-06-24 13:56:43 +02:00
Viktor Lofgren
b798f28443 (journal) Fixing journal encoding
Adjusting some bit widths for entry and record sizes to ensure these don't overflow, as this would corrupt the written journal.
2024-06-24 13:56:27 +02:00
Viktor Lofgren
fff2ce5721 (gamma) Correctly decode zero-length sequences 2024-06-24 13:11:41 +02:00
Viktor
69f88255e9 Merge pull request #101 from MarginaliaSearch/security-scan
Address security scan findings
2024-06-17 13:18:36 +02:00
Viktor
08ff79827e Merge branch 'master' into security-scan 2024-06-17 13:18:25 +02:00
Viktor Lofgren
67703e2274 (run) Update install.sh with stronger warnings against non-docker install. 2024-06-17 13:15:15 +02:00
Viktor Lofgren
d0d6bb173c (control) Fix warc data http status filter default value 2024-06-17 12:40:25 +02:00
Viktor Lofgren
54caf17107 (docs) Amend install instructions for non-docker install 2024-06-16 10:22:07 +02:00
Viktor Lofgren
2168b7cf7d (docs) Update docs with clearer references to the full guide
The commit also mentions the non-docker install
2024-06-16 10:01:19 +02:00
Viktor Lofgren
90744433c9 Merge branch 'master' into security-scan
# Conflicts:
#	code/libraries/array/cpp/resources/libcpp.so
2024-06-13 13:14:47 +02:00
Viktor
5371f078f7 Merge pull request #102 from jaseemabid/jabid/macos-build
Make the project buildable on macOS
2024-06-12 14:45:03 +02:00
Jaseem Abid
0dd14a4bd0 Specify C++ standard in build command
The default C++ language standard on macOS is gnu++98, which won't build
this module.

Full error:

```
> Task :code:libraries:array:cpp:compileCpp FAILED
src/main/cpp/cpphelpers.cpp:28:5: error: expected expression
    [](const p64x2& fst, const p64x2& snd) {
    ^
```
2024-06-12 12:47:10 +01:00
Jaseem Abid
9974b31a09 Don't track build files(libcpp.so) with git 2024-06-12 12:45:49 +01:00
Viktor Lofgren
0ffbbaf4b9 (crawler) Update WARC builder to use SHA-256 for digests 2024-06-12 09:14:12 +02:00
Viktor Lofgren
6839415a0b (crawler) Fetch TLS instead of SSL context 2024-06-12 09:07:54 +02:00
Viktor Lofgren
55f3ac4846 (atags) Fix duckdb SQL injection
The input comes from the config file so this isn't a very realistic threat vector, and even if it wasn't it's a query in an empty duckdb instance; but adding a validation check to provide a better error message.
2024-06-12 09:05:57 +02:00
Viktor Lofgren
801cf4b5da (search) Fix bad practice usage of innerHTML to set what should be text content. 2024-06-12 08:59:40 +02:00
Viktor Lofgren
e0459d0c0d (build) Upgrade parquet dependencies to 1.14.0
This gets rid of a vulnerable transitive dependency.
2024-06-12 08:57:22 +02:00
Viktor Lofgren
23759a7243 (loader) Correctly clamp document size 2024-06-10 18:29:14 +02:00
Viktor Lofgren
55b2b7636b (loader) Correctly load the positions column in the keyword projection 2024-06-10 18:27:15 +02:00
Viktor Lofgren
36160988e2 (index) Integrate positions data with indexes WIP
This change integrates the new positions data with the forward and reverse indexes.

The ranking code is still only partially re-written.
2024-06-10 15:09:06 +02:00
Viktor Lofgren
9f982a0c3d (index) Integrate positions file properly 2024-06-06 16:45:42 +02:00
Viktor Lofgren
dcbec9414f (index) Fix non-compiling tests 2024-06-06 16:35:09 +02:00
Viktor Lofgren
a07cf1ba93 (array/cpp) Update gitignore to properly exclude libcpp.so 2024-06-06 13:06:08 +02:00
Viktor Lofgren
4a8afa6b9f (index, WIP) Position data partially integrated with forward and reverse indexes.
There's no graceful way of doing this in small commits, pushing to avoid the risk of data loss.
2024-06-06 12:54:52 +02:00
Viktor
bb06cc9ff3 Merge pull request #98 from samstorment/ThemeSwitcher
OS Independent Theme Switcher
2024-06-06 12:51:19 +02:00
Sam Storment
9c06f446fb (search) Styling tweaks. Make the filter button near the top right corener a bit bigger so it's easier to press on mobile 2024-06-05 19:55:17 -05:00
Sam Storment
2d076cbd67 (search) move data-has-js attribute from body to html element 2024-06-05 18:20:33 -05:00
Sam Storment
fb2eef24d6 Handle themeing when javascript is disabled. Hide the theme select and fallback to dark media query instead of data-theme attribute 2024-06-03 14:15:35 -05:00
Sam Storment
e2f68d9ccf Add a theme select to the header that lets users toggle their theme independent of their OS theme 2024-06-02 21:02:52 -05:00
Viktor Lofgren
d4f4d751c0 Merge remote-tracking branch 'origin/master' 2024-06-02 16:30:41 +02:00
Viktor Lofgren
b4eac2516e (crawler) Send "Accept"-headers when fetching documents, also indicate we prefer English results 2024-06-02 16:30:34 +02:00
Viktor
4435f6245c Merge pull request #94 from samstorment/search-dark-theme
Search Dark Theme
2024-06-02 16:21:52 +02:00
Viktor Lofgren
9b922af075 (converter) Amend existing modifications to use gamma coded positions lists
... instead of serialized RoaringBitmaps as was the initial take on the problem.
2024-05-30 14:20:36 +02:00
Viktor Lofgren
0112ae725c (gamma) Implement a small library for Elias gamma coding an integer sequence 2024-05-30 14:19:13 +02:00
Viktor Lofgren
619392edf9 (keywords) Add position information to keywords 2024-05-28 16:54:53 +02:00
Viktor Lofgren
0894822b68 (converter) Add position information to serialized document data
This is not hooked in yet, and the term metadata is still left intact.  It should probably shrink to a smaller representation (byte?) with the upcoming removal of the position mask.
2024-05-28 14:18:03 +02:00
Viktor Lofgren
206a7ce6c1 Merge remote-tracking branch 'origin/master' 2024-05-28 14:15:57 +02:00
Viktor Lofgren
a69ab311c7 (qword) Fix tests that broke due to stopword removal 2024-05-28 14:15:45 +02:00
Viktor
a61327fa0b Update ROADMAP.md 2024-05-24 13:57:50 +02:00
Viktor Lofgren
6985ab762a (query) Improve handling of stopwords in queries 2024-05-23 20:50:55 +02:00
Viktor Lofgren
0e8300979b (search) Update the no result text to request bug reports. 2024-05-23 20:18:16 +02:00
Viktor Lofgren
0b60411e5f (query) Bugfix stopword issue
Add a new rule that crates an alternative path that omits a word if it's a stopword.

In queries where a stopword is present, and no query ngram expansion is possible, the query should not require the stopword to be present in the index, as this results in no search results being found.
2024-05-23 20:15:14 +02:00
Viktor Lofgren
f83f777fff (converter) Experimental support for searching by URL
Add up to synthetic 128 keywords per document, corresponding to links to other websites.
2024-05-23 17:10:57 +02:00
Viktor Lofgren
89aae93e60 (*) Lift jetty and guava-dependencies 2024-05-23 14:20:01 +02:00
Viktor Lofgren
65b74f9cab (registry) Fix broken test 2024-05-23 14:15:01 +02:00
Sam Storment
7543e98035 Merge branch 'MarginaliaSearch:master' into search-dark-theme 2024-05-22 18:06:37 -05:00
Viktor Lofgren
59ec70eb73 (*) Clean up code related to crawl parquet inspection 2024-05-22 12:55:08 +02:00
Viktor Lofgren
365229991b (control) Improve pagination for crawl data inspector 2024-05-21 19:44:48 +02:00
Viktor Lofgren
959a8e29ee (control) Improve pagination for crawl data inspector 2024-05-21 19:27:25 +02:00
Viktor Lofgren
197c82acd4 (control) Add filter functionality for crawl data inspector 2024-05-21 19:05:44 +02:00
Viktor Lofgren
9539fdb53c (control) Clean up UX for crawl data inspector 2024-05-21 18:27:24 +02:00
Sam Storment
5659df4388 (search) Set link and form field colors manually to override browser defaults with poor dark mode contrast 2024-05-21 00:03:46 -05:00
Viktor Lofgren
24bf29d369 (*) Upgrade opennlp and deprecate the monkey patched version of the code as it's no longer needed 2024-05-20 18:03:21 +02:00
Viktor Lofgren
17dc00d05f (control) Partial implementation of inspection utility for crawl data
Uses duckdb and range queries to read the parquet files directly from the index partitions.

UX is a bit rough but is in working order.
2024-05-20 18:02:46 +02:00
Viktor Lofgren
4fcd4a8197 (index) Refactor to reduce the level of indirection 2024-05-19 12:40:33 +02:00
Viktor Lofgren
daf2a8df54 (btree) Roll back optimization of queryDataWithIndex
It had been previously assumed that re-writing this function in the style of retain() would make it faster, but it had the opposite effect.

The reason why retain is so fast due to properties of the data that hold true when intersecting document lists, where long runs of adjacent documents are expected, but not when looking up the data associated with the already intersected documents, where the data is more sparse.
2024-05-19 11:29:28 +02:00
Sam Storment
43489c98d8 (search) Minor dark theme tweaks after the new mocked UI elements were added 2024-05-19 01:06:54 -05:00
Viktor Lofgren
88997a1c4f (btree) Clean up code 2024-05-18 18:38:46 +02:00
Viktor Lofgren
d12c77305c (btree) Clean up code 2024-05-18 18:03:17 +02:00
Viktor Lofgren
ab4e2b222e (array) Fix broken benchmarks 2024-05-18 13:41:24 +02:00
Viktor Lofgren
b867eadbef (big-string) Remove the unused bigstring library 2024-05-18 13:40:03 +02:00
Viktor Lofgren
19163fa883 (array) Clean up the Array library
IntArray gets the YAGNI axe.   The array library had two implementations, one for longs which was used, and one for ints, which only ever saw bit rot.   Removing the latter, as all it ever did was clutter up the codebase and add technical debt.  If we need int arrays, we fork LongArray again (or add int capabilities to it)

Also cleaning up the interfaces, removing layers of redundant abstractions and adding javadocs.

Finally adding sz=2 specializations to the quick- and insertion sort algorithms.  It seems the JIT isn't optimizing these particularly well, this is an attempt to help it out a bit.
2024-05-18 13:23:06 +02:00
Sam Storment
a7c33809c4 Merge branch 'master' into search-dark-theme 2024-05-17 22:52:19 -05:00
Viktor Lofgren
650f3843bb (array) Clean up search function jungle
Retire search functions that weren't used, including the native implementations.  Drop confusing suffixes on search function names.  Search functions no longer encode search misses as negative values.

Replaced binary search function with a branchless version that is much faster.

Cleaned up benchmark code.
2024-05-17 14:31:02 +02:00
Viktor Lofgren
9e766bc056 (array) Clean up search function jungle
Retire search functions that weren't used, including the native implementations.  Drop confusing suffixes on search function names.  Search functions no longer encode search misses as negative values.

Replaced binary search function with a branchless version that is much faster.

Cleaned up benchmark code.
2024-05-17 14:30:06 +02:00
Viktor Lofgren
48aff52e00 (array) Increase LongArray on-heap alignment to 16 bytes
This primarily affects benchmarks, making performance more consistent for the 128 bit operations, as the system mostly works with memory mapped data.
2024-05-16 19:12:36 +02:00
Viktor Lofgren
9d7616317e (array) Clean up native code a bit 2024-05-16 14:47:10 +02:00
Viktor Lofgren
d227a09fb1 (search) Extend paperdoll service mock with site info data and screenshots
It's a bit of a hack job but will do, random exploration is available but only through a "browse:random"-style query
2024-05-15 12:40:55 +02:00
Viktor Lofgren
f48cf77c4d (array, experimental) Add benchmark results for quicksort 2024-05-14 18:15:30 +02:00
Viktor Lofgren
3549be216f (array, experimental) Documentation for native algos 2024-05-14 17:43:05 +02:00
Viktor Lofgren
c3e3a3dbc5 (search) Fix problem list in clustered search results 2024-05-14 13:05:52 +02:00
Viktor Lofgren
55a7c1db00 (array, experimental) Call C++ helper methods to do some low level stuff a bit faster than is possible with Java 2024-05-14 12:54:14 +02:00
Sam Storment
bb315221ab (search, WIP) Make the dark theme look generally nicer. Rename CSS custom properties a bit. Switch a lot of background colors to HSL to make it easy to change colors relative to one another. 2024-05-14 01:32:40 -05:00
Sam Storment
c38766c5a6 (search, WIP) Convert SCSS variables to CSS custom properties for dynamic theming 2024-05-08 22:13:24 -05:00
Viktor Lofgren
c837321df1 (search) Provide a notification when no search results are found. 2024-05-06 20:11:39 +02:00
Viktor Lofgren
af7f6b89ec (search) Delete vestigial stylesheet from the old design. 2024-05-06 19:52:29 +02:00
Viktor Lofgren
29a4d3df23 (search) Imrpove search-service paperdoll by mocking suggestions and news 2024-05-06 19:52:13 +02:00
Viktor
bcbb9afac0 Merge pull request #93 from MarginaliaSearch/accessibility-improvements
Accessibility improvements
2024-05-04 15:45:26 +02:00
Viktor Lofgren
7d1cafc070 (control) Add skip link for navigation in control GUI 2024-05-04 12:36:44 +02:00
Viktor Lofgren
5951c67a8b (search) Center the search results page 2024-05-04 12:23:21 +02:00
Viktor Lofgren
c454007730 (search) Increase contrast for some UI elements 2024-05-04 12:02:52 +02:00
Viktor Lofgren
4e49cca43d (search) Clean up SCSS code a bit 2024-05-04 11:58:54 +02:00
Viktor Lofgren
49a8c06095 (search) Improve contrast for text on random button 2024-05-04 11:51:19 +02:00
Viktor Lofgren
d01d9fa670 (search) Add screenreader-specific notification remark about when search results start. 2024-05-04 11:41:06 +02:00
Viktor Lofgren
a53a32f006 (search) Spell out website problems with "atomic elements" instead of having a hover that's inaccessible with keyboard navigation 2024-05-04 11:41:05 +02:00
Viktor Lofgren
3548d54cf6 (search) Add a screenreader-only alert when the search filters are updated to make it easier to understand what happens. 2024-05-04 11:41:04 +02:00
Viktor Lofgren
01f242ac7e (search) Add stylesheet class for screenreader-only items 2024-05-04 11:41:03 +02:00
Viktor Lofgren
2840d9d403 (search) Add screenreader-only positions count text to search results 2024-05-04 11:41:03 +02:00
Viktor Lofgren
9fecfc5025 (search) Add autocomplete attribute to search-form 2024-05-04 11:41:02 +02:00
Viktor Lofgren
1b901e01f2 (search) Add bypass link that skips navigation 2024-05-04 11:41:01 +02:00
Viktor Lofgren
974aa35558 (search) Add proper alt-text to random exploration mode 2024-05-04 11:41:00 +02:00
Viktor Lofgren
4021a0ae98 (search) Add en-US language tags to all templates 2024-05-04 11:40:59 +02:00
Viktor Lofgren
b7a95be731 (search) Create a small mocking framework for running the search service in isolation. 2024-05-04 11:40:59 +02:00
Viktor Lofgren
616649f040 (logs) Fix logdir location 2024-05-04 11:40:59 +02:00
Viktor
ac3c692b5f Merge pull request #92 from MarginaliaSearch/no-docker-v2
(WIP) Changes to make the system runnable outside of docker
2024-05-01 13:00:56 +02:00
Viktor Lofgren
6087f9635c (qs) Move index.html out of public directory
It was put there to simulate the /public interface paradigm that is now deprecated.
2024-05-01 12:56:12 +02:00
Viktor Lofgren
2ad0bfda1e (*) Fix boot orchestration for the services
This corrects an annoying bug that had the system crash and burn on first start-up due to a race condition in service initialization, where the services were attempting to access the database before it was properly migrated.

A fix was in principle already in place, but it was running too late and did not prevent attempts to access the as-yet uninitialized database.  Move the first boot check into the MainClass instead of the Service constructor.

The change also adds more appropriate docker dependencies to the services to fix rare errors resolving the hostname of the database.
2024-05-01 12:39:48 +02:00
Viktor Lofgren
cf8b12bcdc Update install.sh with refined service descriptions 2024-05-01 12:07:30 +02:00
Viktor Lofgren
08f8b6e022 (system) Log loaded properties to the console 2024-04-30 18:29:11 +02:00
Viktor Lofgren
800ed6b1e9 (zk) Terminately immediately if zookeeper isn't found
This makes debugging easier
2024-04-30 18:28:49 +02:00
Viktor Lofgren
df93e57a9a (install) Add new option to install locally outside of docker 2024-04-30 18:28:21 +02:00
Viktor Lofgren
908535a3a0 (single-service) Ensure single-service spawner can specify the node 2024-04-30 18:27:46 +02:00
Viktor Lofgren
7fe2ab6f39 (file-storage) Ensure file storage root location can be overridden when running outside of docker 2024-04-30 18:26:15 +02:00
Viktor Lofgren
c9ee0c909e (download-sample) Set +x permissions on directories created during this job 2024-04-30 18:25:07 +02:00
Viktor Lofgren
38aedb50ac (converter) Do not suppress exceptions in the converter 2024-04-30 18:24:35 +02:00
Viktor Lofgren
4772e0b59d (service) Deprecate /public prefix on HTTP
Before the gRPC migration, the system would serve both public and internal requests over HTTP, but distinguish the two using path prefixes and a few HTTP Headers (X-Public, X-Context) added by the reverse proxy to prevent misconfigurations.

Since internal requests meaningfully no longer use HTTP, this convention is just an obstacle now, adding the need to always run the system behind a reverse proxy that rewrites the paths.

The change removes the path prefix, and updates the docker templates to reflect the change.  This will require a migration for existing systems.
2024-04-30 14:46:18 +02:00
Viktor Lofgren
9c49e876d5 (conf) Update the setup.sh script to also be able to perform model upgrades 2024-04-29 17:46:20 +02:00
Viktor Lofgren
152007cd5c (docker) Add missing zookeeper service to full marginalia config 2024-04-29 11:44:53 +02:00
Viktor Lofgren
70e2e41955 (crawler) Content type prober should not swallow exceptions 2024-04-27 18:27:23 +02:00
Viktor Lofgren
4d71c776fc (crawler) Modify crawl set growth to grow small domains faster than larger ones 2024-04-27 17:36:27 +02:00
Viktor
0f41105436 Merge pull request #90 from MarginaliaSearch/run-outside-docker
Run outside of Docker
2024-04-25 18:55:26 +02:00
Viktor
2d49071e96 Merge branch 'master' into run-outside-docker 2024-04-25 18:53:26 +02:00
Viktor Lofgren
89889ecbbd (single-service) Skip starting Prometheus if it's not explicitly enabled 2024-04-25 17:54:07 +02:00
Viktor Lofgren
41576e74d4 (doc) Clean up ROADMAP.md 2024-04-25 15:53:46 +02:00
Viktor Lofgren
c8ee354d0b (log) Make log dir configurable via environment variable 2024-04-25 15:09:18 +02:00
Viktor Lofgren
4e5f069809 (build) Migrate ssr to the new root setting schema of java lang version 2024-04-25 15:08:56 +02:00
Viktor Lofgren
6690e9bde8 (service) Ensure the service discovery starts early
This is necessary as we use zookeeper to orchestrate first-time startup of the services, to ensure that the database is properly migrated by the control service before anything else is permitted to start.
2024-04-25 15:08:33 +02:00
Viktor Lofgren
e4b34b6ee6 (index) Correctly detect the presence of an all-virtual path through the query 2024-04-25 14:01:46 +02:00
Viktor Lofgren
3952ef6ca5 (service) Let singleservice configure ports and bind addresses 2024-04-25 13:49:57 +02:00
Viktor Lofgren
463d333846 (proj) Add ROADMAP.md 2024-04-25 13:07:35 +02:00
Viktor Lofgren
7eb5e6aa66 (crawler) Abort recrawl if error count is too high 2024-04-24 21:46:40 +02:00
Viktor Lofgren
282022d64e (crawler) Remove unnecessary double-fetch of the root document 2024-04-24 14:44:39 +02:00
Viktor Lofgren
91a98a8807 (crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber 2024-04-24 14:44:39 +02:00
Viktor Lofgren
32fe864a33 (build) Java 22 and its consequences has been a disaster for Marginalia Search
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle

The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
e1c9313396 (crawler) Emulate if-modified-since for domains that don't support the header
This will help reduce the strain on some server software, in particular Discourse.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
f430a084e8 (crawler) Remove accidental log spam 2024-04-24 14:44:39 +02:00
Viktor Lofgren
a86b596897 (crawler) Code quality 2024-04-24 14:44:39 +02:00
Viktor Lofgren
6dd87b0378 (crawler) Use the probe-result to reduce the likelihood of crawling both http and https
This should drastically reduce the number of fetched documents on many domains
2024-04-24 14:44:39 +02:00
Viktor Lofgren
c9f029c214 (crawler) Strip W/-prefix from the etag when supplied as If-None-Match 2024-04-24 14:44:39 +02:00
Viktor Lofgren
6b88db10ad (crawler) Ensure all appropriate headers are recorded on the request 2024-04-24 14:44:39 +02:00
Viktor Lofgren
8a891c2159 (crawler/converter) Remove legacy junk from parquet migration 2024-04-24 14:44:39 +02:00
Viktor Lofgren
ad2ac8eee3 (query) Mark flaky test, correct assert on test 2024-04-24 14:44:39 +02:00
Viktor Lofgren
f46733a47a (ranking) TermCoherenceFactory should be run for size=2 queries 2024-04-24 14:44:39 +02:00
Viktor Lofgren
934167323d (converter) Stopgap fix for some cases of lost crawl data due to HTTP 304. The root cause needs further investigation. 2024-04-24 14:44:39 +02:00
Viktor Lofgren
64baa41e64 (query) Always generate an ngram alternative, suppresses generation of multiple identical query branches 2024-04-24 14:44:39 +02:00
Viktor Lofgren
5165cf6d15 (ranking) Set regularMask correctly 2024-04-24 14:44:39 +02:00
Viktor Lofgren
4489b21528 (ranking) Cleanup 2024-04-24 14:44:39 +02:00
Viktor Lofgren
f623b37577 (ranking) Suppress NaN:s in ranking output 2024-04-24 14:44:39 +02:00
Viktor Lofgren
f4a2fea451 (ranking, bugfix) Use bm25NgramWeight and not full weight for bM25N 2024-04-24 14:44:39 +02:00
Viktor Lofgren
a748fc5448 (index, bugfix) Pass url quality to query service 2024-04-24 14:44:39 +02:00
Viktor Lofgren
0dcca0cb83 (index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp 2024-04-24 14:44:39 +02:00
Viktor Lofgren
b80a83339b (qs) Additional info in query debug UI 2024-04-24 14:44:39 +02:00
Viktor Lofgren
eb74d08f2a (qs) Additional info in query debug UI 2024-04-24 14:44:39 +02:00
Viktor Lofgren
e79ab0c70e (qs) Basic query debug feature 2024-04-24 14:44:39 +02:00
Viktor Lofgren
e419e26f3a (proto) Improve handling of omitted parameters 2024-04-24 14:44:39 +02:00
Viktor Lofgren
6102fd99bf (qs) Improve logging 2024-04-24 14:44:39 +02:00
Viktor Lofgren
def36719d3 (query) Minor code cleanup 2024-04-24 14:44:39 +02:00
Viktor Lofgren
462aa9af26 (query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard
The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
a09c84e1b8 (query) Modify tokenizer to match the behavior of the sentence extractor
This must match, otherwise a query like "plato's republic" won't match the indexed keywords, since they would strip the possessive.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
44b33798f3 (index) Clean up jaccard index term code and down-tune the parameter's importance a bit 2024-04-24 14:44:39 +02:00
Viktor Lofgren
2f0b648fad (index) Add jaccard index term to boost results based on term overlap 2024-04-24 14:44:39 +02:00
Viktor Lofgren
de0e56f027 (index) Remove position overlap check, coherences will do the work instead 2024-04-24 14:44:39 +02:00
Viktor Lofgren
973ced7b13 (index) Omit absent terms from coherence checks 2024-04-24 14:44:39 +02:00
Viktor Lofgren
cb4b824a85 (index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus 2024-04-24 14:44:39 +02:00
Viktor Lofgren
c583a538b1 (search) Add implicit coherence constraints based on segmentation 2024-04-24 14:44:39 +02:00
Viktor Lofgren
e0224085b4 (index) Improve recall for small queries
Partially reverse the previous commit and add a query head for the priority index when there are few query interpretations.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
44c1e1d6d9 (index) Remove dead code
Since the performance fix in 3359f72239 had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
c620e9c026 (index) Experimental performance regression fix 2024-04-24 14:44:39 +02:00
Viktor Lofgren
1bb88968c5 (test) Fix broken test 2024-04-24 14:44:39 +02:00
Viktor Lofgren
df75e8f4aa (index) Explicitly free LongQueryBuffers 2024-04-24 14:44:39 +02:00
Viktor Lofgren
adf846bfd2 (index) Fix term coherence evaluation
The code was incorrectly using the documentId instead of the combined id, resulting in almost all result sets being incorrectly seen as zero.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
1748fcc5ac (valuation) Impose stronger constraints on locality of terms
Clean up logic a bit
2024-04-24 14:44:39 +02:00
Viktor Lofgren
08416393e0 (valuation) Impose stronger constraints on locality of terms 2024-04-24 14:44:39 +02:00
Viktor Lofgren
fce26015c9 (encyclopedia) Index the full articles
Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits.  This was not a good idea, so the change is reverted.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
155be1078d (index) Fix priority search terms
This functionality fell into disrepair some while ago.  It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
6efc0f21fe (index) Clean up data model
The change set cleans up the data model for the term-level data.  This used to contain a bunch of fields with document-level metadata.  This data-duplication means a larger memory footprint and worse memory locality.

The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking.  This is again an effort to improve memory locality.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
f3255e080d (ngram) Grab titles separately when extracting ngrams from wiki data 2024-04-24 14:44:39 +02:00
Viktor Lofgren
0da03d4cfc (zim) Fix title extractor 2024-04-24 14:44:39 +02:00
Viktor Lofgren
5f6a3ef9d0 (ngram) Correct |s|^|s|-normalization to use length and not count 2024-04-24 14:44:39 +02:00
Viktor Lofgren
afc4fed591 (ngram) Correct size value in ngram lexicon generation, trim the terms better 2024-04-24 14:44:39 +02:00
Viktor Lofgren
cb505f98ef (ngram) Use simple blocking pool instead of FJP; split on underscores in article names. 2024-04-24 14:44:39 +02:00
Viktor Lofgren
a0b3634cb6 (ngram) Only extract frequencies of title words, but use the body to increment the counters...
The sign of the counter is used to indicate whether a term has appeared as title.  Until it's seen in the title, it's provisionally saved as a negative count.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
e23359bae9 (query, minor) Remove debug statement 2024-04-24 14:44:39 +02:00
Viktor Lofgren
5531ed632a (query, minor) Remove debug statement 2024-04-24 14:44:39 +02:00
Viktor Lofgren
150ee21f3c (ngram) Clean up ngram lexicon code
This is both an optimization that removes some GC churn, as well as a clean-up of the code that removes references to outdated concepts.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
c96da0ce1e (segmentation) Pick best segmentation using |s|^|s|-style normalization
This is better than doing all segmentations possible at the same time.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
a0d9e66ff7 (ngram) Fix index range in NgramLexicon to an avoid exception 2024-04-24 14:44:38 +02:00
Viktor Lofgren
55f627ed4c (index) Clean up the code 2024-04-24 14:44:38 +02:00
Viktor Lofgren
7dd8c78c6b (ngrams) Remove the vestigial logic for capturing permutations of n-grams
The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
8bf7d090fd (qs) Clean up parsing code using new record matching 2024-04-24 14:44:38 +02:00
Viktor Lofgren
6bfe04b609 (term-freq-exporter) Reduce thread count and memory usage 2024-04-24 14:44:38 +02:00
Viktor Lofgren
491d6bec46 (term-freq-exporter) Extract ngrams in term-frequency-exporter 2024-04-24 14:44:38 +02:00
Viktor Lofgren
4fb86ac692 (search) Fix outdated assumptions about the results
We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption.

For the API service, we'll simulate the old behavior to keep the API stable.

For the search service, we'll introduce a new way of calculating positions through tree aggregation.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
6cba6aef3b (minor) Remove dead code 2024-04-24 14:44:38 +02:00
Viktor Lofgren
7e216db463 (index) Add origin trace information for index readers
This used to be supported by the system but got lost in refactoring at some point.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
adc90c8f1e (sentence-extractor) Fix resource leak in sentence extractor
The code would always re-initialize the static ngramLexicon and rdrposTagger fields with new instances even if they were already instantiated, leading to a ton of unnecessary RAM allocation.

The modified behavior checks for nullity before creating a new instance.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
e3316a3672 (index) Clean up new index query code 2024-04-24 14:44:38 +02:00
Viktor Lofgren
a3a6d6292b (qs, index) New query model integrated with index service.
Seems to work, tests are green and initial testing finds no errors.  Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
8cb9455c32 (qs, WIP) Fix edge cases in query compilation
This addresses the relatively common case where the graph consists of two segments, such as x y, z w; in this case we want an output like (x_y) (z w | z_w) | x y (z_w).  The generated output does somewhat pessimize a few other cases, but this one is arguably more important.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
dc65b2ee01 (qs, WIP) Clean up dead code 2024-04-24 14:44:38 +02:00
Viktor Lofgren
98a1adbf81 (qs, WIP) Tidy it up a bit 2024-04-24 14:44:38 +02:00
Viktor Lofgren
0bd1e15cce (qs, WIP) Tidy it up a bit 2024-04-24 14:44:38 +02:00
Viktor Lofgren
eda926767e (qs, WIP) Tidy it up a bit 2024-04-24 14:44:38 +02:00
Viktor Lofgren
cd1a18c045 (qs, WIP) Break up code and tidy it up a bit 2024-04-24 14:44:38 +02:00
Viktor Lofgren
6f567fbea8 (qs, WIP) Fix output determinism, fix tests 2024-04-24 14:44:38 +02:00
Viktor Lofgren
0ebadd03a5 (WIP) Query rendering finally beginning to look like it works 2024-04-24 14:44:38 +02:00
Viktor Lofgren
2253b556b2 WIP 2024-04-24 14:44:17 +02:00
Viktor Lofgren
6a7a7009c7 (convert) Initial integration of segmentation data into the converter's keyword extraction logic 2024-04-24 14:44:17 +02:00
Viktor Lofgren
3c75057dcd (qs) Retire NGramBloomFilter, integrate new segmentation model instead 2024-04-24 14:44:17 +02:00
Viktor Lofgren
212d101727 (control) GUI for exporting segmentation data from a wikipedia zim 2024-04-24 14:44:17 +02:00
Viktor Lofgren
760b80659d (WIP) Partial integration of new query expansion code into the query-serivice 2024-04-24 14:44:17 +02:00
Viktor Lofgren
04879c005d (WIP) Improve data extraction from wikipedia data 2024-04-24 14:44:17 +02:00
Viktor Lofgren
cb82927756 (WIP) Implement first take of new query segmentation algorithm 2024-04-24 14:44:17 +02:00
Viktor Lofgren
8b9629f2f6 (crawler) Remove unnecessary double-fetch of the root document 2024-04-24 14:38:59 +02:00
Viktor Lofgren
f6db16b313 (crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber 2024-04-24 14:10:03 +02:00
Viktor Lofgren
4668b1ddcb (build) Java 22 and its consequences has been a disaster for Marginalia Search
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle

The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 13:54:04 +02:00
Viktor Lofgren
dcf9d9caad (crawler) Emulate if-modified-since for domains that don't support the header
This will help reduce the strain on some server software, in particular Discourse.
2024-04-22 17:26:31 +02:00
Viktor Lofgren
7a69b76001 (crawler) Remove accidental log spam 2024-04-22 15:51:37 +02:00
Viktor Lofgren
ac07ef822f (crawler) Code quality 2024-04-22 15:37:35 +02:00
Viktor Lofgren
e7d4bcd872 (crawler) Use the probe-result to reduce the likelihood of crawling both http and https
This should drastically reduce the number of fetched documents on many domains
2024-04-22 15:36:43 +02:00
Viktor Lofgren
a28c6d7cfe (crawler) Strip W/-prefix from the etag when supplied as If-None-Match 2024-04-22 14:31:05 +02:00
Viktor Lofgren
d816f048f5 (crawler) Ensure all appropriate headers are recorded on the request 2024-04-22 14:14:24 +02:00
Viktor Lofgren
b09ddd0036 (crawler/converter) Remove legacy junk from parquet migration 2024-04-22 12:34:28 +02:00
Viktor Lofgren
0a73b02a00 (query) Mark flaky test, correct assert on test 2024-04-21 12:30:14 +02:00
Viktor Lofgren
8769704462 (ranking) TermCoherenceFactory should be run for size=2 queries 2024-04-21 12:29:25 +02:00
Viktor Lofgren
214551f1df (converter) Stopgap fix for some cases of lost crawl data due to HTTP 304. The root cause needs further investigation. 2024-04-19 20:36:01 +02:00
Viktor Lofgren
2cc74c005a (query) Always generate an ngram alternative, suppresses generation of multiple identical query branches 2024-04-19 19:42:30 +02:00
Viktor Lofgren
ed250f57f2 (ranking) Set regularMask correctly 2024-04-19 14:31:57 +02:00
Viktor Lofgren
e92c25f7e0 (ranking) Cleanup 2024-04-19 14:13:12 +02:00
Viktor Lofgren
3ab563f314 (ranking) Suppress NaN:s in ranking output 2024-04-19 13:58:28 +02:00
Viktor Lofgren
426338cb45 (ranking, bugfix) Use bm25NgramWeight and not full weight for bM25N 2024-04-19 12:41:48 +02:00
Viktor Lofgren
5fa2375898 (index, bugfix) Pass url quality to query service 2024-04-19 12:41:26 +02:00
Viktor Lofgren
41782a0ab5 (index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp 2024-04-19 12:19:26 +02:00
Viktor Lofgren
9b06433b82 (qs) Additional info in query debug UI 2024-04-19 12:18:53 +02:00
Viktor Lofgren
def607d840 (qs) Additional info in query debug UI 2024-04-19 11:46:27 +02:00
Viktor Lofgren
2b811fb422 (qs) Basic query debug feature 2024-04-19 11:00:56 +02:00
Viktor Lofgren
36cc62c10c (proto) Improve handling of omitted parameters 2024-04-18 10:47:12 +02:00
Viktor Lofgren
975d92912c (qs) Improve logging 2024-04-18 10:44:08 +02:00
Viktor Lofgren
8bbaf457de (query) Minor code cleanup 2024-04-18 10:37:51 +02:00
Viktor Lofgren
7641a02f31 (query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard
The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.
2024-04-18 10:36:15 +02:00
Viktor Lofgren
ce16239e34 (query) Modify tokenizer to match the behavior of the sentence extractor
This must match, otherwise a query like "plato's republic" won't match the indexed keywords, since they would strip the possessive.
2024-04-17 17:54:32 +02:00
Viktor Lofgren
d64bd227cf (index) Clean up jaccard index term code and down-tune the parameter's importance a bit 2024-04-17 17:40:16 +02:00
Viktor Lofgren
c5ab0a9054 (index) Add jaccard index term to boost results based on term overlap 2024-04-17 16:50:26 +02:00
Viktor Lofgren
dac948973d (index) Remove position overlap check, coherences will do the work instead 2024-04-17 14:20:01 +02:00
Viktor Lofgren
9d008d1d6f (index) Omit absent terms from coherence checks 2024-04-17 14:12:16 +02:00
Viktor Lofgren
f52457213e (index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus 2024-04-17 14:05:02 +02:00
Viktor Lofgren
579295a673 (search) Add implicit coherence constraints based on segmentation 2024-04-17 14:03:35 +02:00
Viktor Lofgren
af8ff8ce99 (index) Improve recall for small queries
Partially reverse the previous commit and add a query head for the priority index when there are few query interpretations.
2024-04-16 22:51:03 +02:00
Viktor Lofgren
7fa3e86e64 (index) Remove dead code
Since the performance fix in 3359f72239 had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.
2024-04-16 19:59:27 +02:00
Viktor Lofgren
3359f72239 (index) Experimental performance regression fix 2024-04-16 19:48:14 +02:00
Viktor Lofgren
41fa154aa6 (test) Fix broken test 2024-04-16 19:48:14 +02:00
Viktor Lofgren
deaba0152d (index) Explicitly free LongQueryBuffers 2024-04-16 19:23:00 +02:00
Viktor Lofgren
feaef6093e (index) Fix term coherence evaluation
The code was incorrectly using the documentId instead of the combined id, resulting in almost all result sets being incorrectly seen as zero.
2024-04-16 18:07:43 +02:00
Viktor Lofgren
078fa4fdd0 (valuation) Impose stronger constraints on locality of terms
Clean up logic a bit
2024-04-16 17:22:58 +02:00
Viktor Lofgren
2dc77a0638 (valuation) Impose stronger constraints on locality of terms 2024-04-16 17:15:21 +02:00
Viktor
cfd9a7187f (query-segmentation) Merge pull request #89 from MarginaliaSearch/query-segmentation
The changeset cleans up the query parsing logic in the query service. It gets rid of a lot of old and largely unmaintainable query-rewriting logic that was based on POS-tagging rules, and adds a new cleaner approach. Query parsing is also refactored, and the internal APIs are updated to remove unnecessary duplication of document-level data across each search term.

A new query segmentation model is introduced based on a dictionary of known n-grams, with tools for extracting this dictionary from Wikipedia data. The changeset introduces a new segmentation model file, which is downloaded with the usual run/setup.sh, as well as an updated term frequency model.

A new intermediate representation of the query is introduced, based on a DAG with predefined vertices initiating and terminating the graph. This is for the benefit of easily writing rules for generating alternative queries, e.g. using the new segmentation data.

The graph is converted to a basic LL(1) syntax loosely reminiscent of a regular expression, where e.g. "( wiby | marginalia | kagi ) ( search engine | searchengine )" expands to "wiby search engine", "wiby searchengine", "marginalia search engine", "marginalia searchengine", "kagi search engine" and "kagi searchengine".

This compiled query is passed to the index, which parses the expression, where it is used for execution of the search and ranking of the results.
2024-04-16 15:31:05 +02:00
Viktor Lofgren
f434a8b492 (build) Upgrade jib plugin version 2024-04-16 15:25:23 +02:00
Viktor Lofgren
d2658d6f84 (sys) Add springboard service that can spawn multiple different marginalia services to make distribution easier. 2024-04-16 13:25:15 +02:00
Viktor Lofgren
8c559c8121 (conf) Add additional logic for discovering system root 2024-04-16 12:37:18 +02:00
Viktor Lofgren
2353c73c57 (encyclopedia) Index the full articles
Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits.  This was not a good idea, so the change is reverted.
2024-04-16 12:10:13 +02:00
Viktor Lofgren
599e719ad4 (index) Fix priority search terms
This functionality fell into disrepair some while ago.  It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.
2024-04-15 16:44:08 +02:00
Viktor Lofgren
b6d365bacd (index) Clean up data model
The change set cleans up the data model for the term-level data.  This used to contain a bunch of fields with document-level metadata.  This data-duplication means a larger memory footprint and worse memory locality.

The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking.  This is again an effort to improve memory locality.
2024-04-15 16:04:07 +02:00
Viktor Lofgren
52f0c0d336 (ngram) Grab titles separately when extracting ngrams from wiki data 2024-04-13 19:34:16 +02:00
Viktor Lofgren
be55f3f937 (zim) Fix title extractor 2024-04-13 19:33:47 +02:00
Viktor Lofgren
fda1c05164 (ngram) Correct |s|^|s|-normalization to use length and not count 2024-04-13 18:05:30 +02:00
Viktor Lofgren
1329d4abd8 (ngram) Correct size value in ngram lexicon generation, trim the terms better 2024-04-13 17:51:02 +02:00
Viktor Lofgren
f064992137 (ngram) Use simple blocking pool instead of FJP; split on underscores in article names. 2024-04-13 17:07:23 +02:00
Viktor Lofgren
8a81a480a1 (ngram) Only extract frequencies of title words, but use the body to increment the counters...
The sign of the counter is used to indicate whether a term has appeared as title.  Until it's seen in the title, it's provisionally saved as a negative count.
2024-04-12 18:08:31 +02:00
Viktor Lofgren
d729c400e5 (query, minor) Remove debug statement 2024-04-12 17:52:55 +02:00
Viktor Lofgren
ad4810d991 (query, minor) Remove debug statement 2024-04-12 17:45:26 +02:00
Viktor Lofgren
6a67043537 (ngram) Clean up ngram lexicon code
This is both an optimization that removes some GC churn, as well as a clean-up of the code that removes references to outdated concepts.
2024-04-12 17:45:06 +02:00
Viktor Lofgren
864d6c28e7 (segmentation) Pick best segmentation using |s|^|s|-style normalization
This is better than doing all segmentations possible at the same time.
2024-04-12 17:44:14 +02:00
Viktor Lofgren
bb6b51ad91 (ngram) Fix index range in NgramLexicon to an avoid exception 2024-04-12 10:13:25 +02:00
Viktor Lofgren
65e3caf402 (index) Clean up the code 2024-04-11 18:50:21 +02:00
Viktor Lofgren
b7d9a7ae89 (ngrams) Remove the vestigial logic for capturing permutations of n-grams
The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.
2024-04-11 18:12:01 +02:00
Viktor Lofgren
ed73d79ec1 (qs) Clean up parsing code using new record matching 2024-04-11 17:36:08 +02:00
Viktor Lofgren
c538c25008 (term-freq-exporter) Reduce thread count and memory usage 2024-04-10 17:11:23 +02:00
Viktor Lofgren
4b47fadbab (term-freq-exporter) Extract ngrams in term-frequency-exporter 2024-04-10 16:58:05 +02:00
Viktor Lofgren
fcdc843c15 (search) Fix outdated assumptions about the results
We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption.

For the API service, we'll simulate the old behavior to keep the API stable.

For the search service, we'll introduce a new way of calculating positions through tree aggregation.
2024-04-07 12:09:44 +02:00
Viktor Lofgren
dbdcf459a7 (minor) Remove dead code 2024-04-06 16:27:16 +02:00
Viktor Lofgren
ef25d60666 (index) Add origin trace information for index readers
This used to be supported by the system but got lost in refactoring at some point.
2024-04-06 13:28:14 +02:00
Viktor Lofgren
7f7021ce64 (sentence-extractor) Fix resource leak in sentence extractor
The code would always re-initialize the static ngramLexicon and rdrposTagger fields with new instances even if they were already instantiated, leading to a ton of unnecessary RAM allocation.

The modified behavior checks for nullity before creating a new instance.
2024-04-05 18:52:58 +02:00
Viktor Lofgren
448a941de2 (encyclopedia) Fix memory issue in preconversion step
Use SimpleBlockingThreadPool pool instead of Java's Workstealing Pool as the latter causes runaway memory consumption in some circumstances, while SimpleBlockingThreadPool uses a bounded queue and always pushes back against the supplier if it can't hold any more tasks.
2024-04-05 16:57:53 +02:00
Viktor Lofgren
5766da69ec (gradle) Upgrade to Gradle 8.7
This will reduce the hassle of juggling JDK versions for JDK 22, which was not supported by Gradle 8.5.
2024-04-05 15:15:49 +02:00
Joshua Holland
617e633d7a Update keywords docs use of explore to browse
I can't tell when this happened, but the proper keyword now seems to be browse and not explore.
2024-04-05 15:15:49 +02:00
Viktor Lofgren
b770a1143f (run) Fix traefik middleware configuration 2024-04-05 15:15:49 +02:00
Viktor Lofgren
e1151ecf2a (gradle) Upgrade to Gradle 8.7
This will reduce the hassle of juggling JDK versions for JDK 22, which was not supported by Gradle 8.5.
2024-04-05 15:12:38 +02:00
Viktor Lofgren
ae7c760772 (index) Clean up new index query code 2024-04-05 13:30:49 +02:00
Viktor Lofgren
81815f3e0a (qs, index) New query model integrated with index service.
Seems to work, tests are green and initial testing finds no errors.  Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.
2024-04-04 20:17:58 +02:00
Viktor
3890c413a3 Merge pull request #88 from jmholla/patch-1
Update keywords docs use of explore to browse
2024-04-01 09:14:02 +02:00
Joshua Holland
8e02f567d7 Update keywords docs use of explore to browse
I can't tell when this happened, but the proper keyword now seems to be browse and not explore.
2024-04-01 00:04:12 -05:00
Viktor Lofgren
87bb93e1d4 (qs, WIP) Fix edge cases in query compilation
This addresses the relatively common case where the graph consists of two segments, such as x y, z w; in this case we want an output like (x_y) (z w | z_w) | x y (z_w).  The generated output does somewhat pessimize a few other cases, but this one is arguably more important.
2024-03-29 12:40:27 +01:00
Viktor Lofgren
e596c929ac (qs, WIP) Clean up dead code 2024-03-28 16:37:23 +01:00
Viktor Lofgren
9852b0e609 (qs, WIP) Tidy it up a bit 2024-03-28 14:18:26 +01:00
Viktor Lofgren
51b0d6c0d3 (qs, WIP) Tidy it up a bit 2024-03-28 14:09:17 +01:00
Viktor Lofgren
15391c7a88 (qs, WIP) Tidy it up a bit 2024-03-28 13:54:30 +01:00
Viktor Lofgren
fe62593286 (qs, WIP) Break up code and tidy it up a bit 2024-03-28 13:26:54 +01:00
Viktor Lofgren
4cc11e183c (qs, WIP) Fix output determinism, fix tests 2024-03-28 13:11:26 +01:00
Viktor Lofgren
de8e753fc8 (run) Fix traefik middleware configuration 2024-03-28 13:03:12 +01:00
Viktor Lofgren
f82ebd7716 (WIP) Query rendering finally beginning to look like it works 2024-03-28 13:01:21 +01:00
Viktor Lofgren
bd0704d5a4 (*) Fix JDK22 migration issues
A few bizarre build errors cropped up when migrating to JDK22.  Not at all sure what caused them, but they were easy to mitigate.
2024-03-21 14:33:27 +01:00
Viktor Lofgren
1968485881 (docs) Upgrade to JDK22 2024-03-21 14:33:27 +01:00
Viktor Lofgren
002afca1c5 (sys) Upgrade to JDK22
This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.
2024-03-21 14:33:27 +01:00
Your Name
411b3f3138 (run/install.sh) fix docker compose file
I was following the release demo video for v2024.01.0
https://www.youtube.com/watch?v=PNwMkenQQ24 and when I did 'docker
compose up' the containers couldn't resolve the DNS name for 'zookeeper'
I realized this was because the zookeeper container was using the
default docker network, so I specified the wmsa network explicitly.
2024-03-21 14:33:27 +01:00
Viktor Lofgren
a4b810f511 WIP 2024-03-21 14:33:26 +01:00
Viktor
cd8f33f830 Merge pull request #86 from MarginaliaSearch/jdk-22
Lift JDK version to 22
2024-03-21 14:29:41 +01:00
Viktor Lofgren
824765b1ee (*) Fix JDK22 migration issues
A few bizarre build errors cropped up when migrating to JDK22.  Not at all sure what caused them, but they were easy to mitigate.
2024-03-21 14:27:13 +01:00
Viktor Lofgren
9e8138f853 (docs) Upgrade to JDK22 2024-03-21 14:27:13 +01:00
Viktor Lofgren
fe8d583fdd (sys) Upgrade to JDK22
This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.
2024-03-21 14:27:13 +01:00
Viktor Lofgren
0bd3365c24 (convert) Initial integration of segmentation data into the converter's keyword extraction logic 2024-03-19 14:28:42 +01:00
Viktor Lofgren
d8f4e7d72b (qs) Retire NGramBloomFilter, integrate new segmentation model instead 2024-03-19 10:42:09 +01:00
Viktor Lofgren
afc047cd27 (control) GUI for exporting segmentation data from a wikipedia zim 2024-03-18 13:45:23 +01:00
Viktor Lofgren
00ef4f9803 (WIP) Partial integration of new query expansion code into the query-serivice 2024-03-18 13:16:49 +01:00
Viktor Lofgren
07e4d7ec6d (WIP) Improve data extraction from wikipedia data 2024-03-18 13:16:00 +01:00
Viktor
258a344810 Merge pull request #85 from patrickbreen/master
(run/install.sh) fix docker compose file
2024-03-18 13:09:30 +01:00
Your Name
2a03014652 (run/install.sh) fix docker compose file
I was following the release demo video for v2024.01.0
https://www.youtube.com/watch?v=PNwMkenQQ24 and when I did 'docker
compose up' the containers couldn't resolve the DNS name for 'zookeeper'
I realized this was because the zookeeper container was using the
default docker network, so I specified the wmsa network explicitly.
2024-03-17 15:33:19 -04:00
Viktor Lofgren
8ae1f08095 (WIP) Implement first take of new query segmentation algorithm 2024-03-12 13:12:50 +01:00
Viktor Lofgren
57e6a12d08 (registry) Correct registerMonitor() behavior
The previous behavior would listen to too many changes, and based on zookeeper and not curator assumptions about behavior, add an additional monitor on each invocation of each monitor, (which always trigger on service state changes), leading to each monitor re-registering and effectively doubling monitors in numbers whenever a service stopped or started, which in turn meant a lot of bizarre thrashing behavior even on changes in services that don't explicitly talk to each other.

This re-registering behavior is no longer done.
2024-03-06 12:22:15 +01:00
Viktor Lofgren
46423612e3 (refac) Merge service-discovery and service modules
Also adds a few tests to the server/client code.
2024-03-03 10:49:23 +01:00
Viktor Lofgren
29bf473d74 (encyclopedia) Add URLencoding to path element
This prevents corruption of the links to the sideloaded encyclopedia data when the article path contains characters that are not valid in a URL.
2024-03-01 17:28:09 +01:00
Viktor Lofgren
9689f3faee (domain-info) Fix incorrect array indexing 2024-02-29 18:56:09 +01:00
Viktor Lofgren
93fa58c93d (domain-info) Fix incorrect array indexing
Using the id instead of idx when addressing the ranksArray caused exceptions.
2024-02-29 17:54:23 +01:00
Viktor Lofgren
186a98cc99 (doc) Fix wonky bullet lists 2024-02-28 17:43:05 +01:00
Viktor Lofgren
9993f265ca (doc) Remove irrelevant text 2024-02-28 17:40:05 +01:00
Viktor Lofgren
144f967dbf (misc) Tweak pool sizes 2024-02-28 16:23:02 +01:00
Viktor Lofgren
b31c9bb726 (docs) Update process docs 2024-02-28 15:21:33 +01:00
Viktor Lofgren
c0820b5e5c (docs) Update service docs 2024-02-28 15:19:31 +01:00
Viktor Lofgren
65b8a1d5d9 (grpc) Reduce error spam 2024-02-28 14:44:48 +01:00
Viktor Lofgren
a0648844fb (grpc) Reduce error spam 2024-02-28 14:35:29 +01:00
Viktor Lofgren
c4a27003c6 (docs) Fix formatting 2024-02-28 14:22:57 +01:00
Viktor Lofgren
41abd8982f (math) Clean up error handling 2024-02-28 14:19:50 +01:00
Viktor Lofgren
86bbc1043e (service) Clean up thread pool creation 2024-02-28 14:06:32 +01:00
Viktor Lofgren
9a045a0588 (index) Clean up index code 2024-02-28 13:09:47 +01:00
Viktor Lofgren
9415539b38 (docs) Update docs 2024-02-28 12:25:19 +01:00
Viktor Lofgren
84bab2783d (docs) Fix fake news in docs 2024-02-28 12:16:45 +01:00
Viktor
0d6e7673e4 Merge pull request #81 from MarginaliaSearch/service-discovery
Zookeeper for service-discovery, kill service-client lib, refactor everything
2024-02-28 12:15:25 +01:00
Viktor Lofgren
d78e9e715f (misc) Fix broken tests 2024-02-28 12:12:43 +01:00
Viktor Lofgren
a8ec59eb75 (conf) Add migration warning when ZOOKEEPER_HOSTS is not set. 2024-02-28 12:09:38 +01:00
Viktor Lofgren
20fc0ef13c (gradle) Add task alias 'docker' for 'jibDockerBuild'
The change also moves the jib boilerplate to an include.
2024-02-28 11:59:15 +01:00
Viktor Lofgren
37ae8cb33c Migrate the docker compose files 2024-02-28 11:48:16 +01:00
Viktor Lofgren
9f1649636e Clean up documentation and rename domain-links to link-graph 2024-02-28 11:40:39 +01:00
Viktor Lofgren
3a65fe8917 Add offload executor to GrpcChannelPoolFactory 2024-02-27 22:08:39 +01:00
Viktor Lofgren
99a6e56e99 (index-client) Increase thread count in index client
This should be a fair bit larger than the number of index nodes
2024-02-27 22:00:29 +01:00
Viktor Lofgren
e696fd9e92 (docs) Begin un-fucking the docs after refactoring 2024-02-27 21:22:21 +01:00
Viktor Lofgren
c943954bb4 (domain-info) Reduce memory usage 2024-02-27 21:22:21 +01:00
Viktor Lofgren
eaf836dc66 (service/grpc) Reduce thread count
Netty and GRPC by default spawns an incredible number of threads on high-core CPUs, which amount to a fair bit of RAM usage.

Add custom executors that throttle this behavior.
2024-02-27 21:22:21 +01:00
Viktor Lofgren
dbf64b0987 (logs) Add the option for json logging 2024-02-27 21:22:20 +01:00
Viktor Lofgren
8d0af9548b (search) Bot mitigation
Add the ability to indicate to the search service that a request is malicious, and to poison the results by providing randomly reorered old results instead.
2024-02-27 21:22:19 +01:00
Viktor Lofgren
67aa20ea2c (array) Attempting to debug strange errors 2024-02-27 21:22:18 +01:00
Viktor Lofgren
5604e9f531 (query) Bump query length, see what happens :P 2024-02-27 21:22:17 +01:00
Viktor Lofgren
1a51ec2d69 (index) Index optimization 2024-02-27 21:22:17 +01:00
Viktor Lofgren
3eb0800742 (index) Improve granularity of candidate queue polling 2024-02-27 21:22:17 +01:00
Viktor Lofgren
427f3e922f (index) Retire count operation, clean up index code. 2024-02-27 21:22:17 +01:00
Viktor Lofgren
823ca73a3f (domain-ranking) Fix a crash during ranking the edges of the similarity graph doesn't quite match the vertices of the link graph. 2024-02-27 21:22:17 +01:00
Viktor Lofgren
7fc0d4d786 (index) Observability for query execution queues 2024-02-27 21:22:17 +01:00
Viktor Lofgren
b8e336e809 (index) Reduce time allocation a bit 2024-02-27 21:22:17 +01:00
Viktor Lofgren
9429bf5c45 (index) Clean up 2024-02-27 21:22:17 +01:00
Viktor Lofgren
f7f0100174 (build) Make docker image registry and tag configurable in root build.gradle 2024-02-25 11:08:49 +01:00
Viktor Lofgren
fc00701a1e (index) Experimental refactoring of the indexing functionality 2024-02-25 11:05:10 +01:00
Viktor Lofgren
09447f2ad2 (process service) Inherit parent's assertion status 2024-02-24 18:32:37 +01:00
Viktor Lofgren
ff0ef1eebc (cleanup) Minor cleanups 2024-02-24 15:33:56 +01:00
Viktor Lofgren
1d34224416 (refac) Remove src/main from all source code paths.
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.

While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules.  Which you'll do a lot, because it's *modul*ar.  The src/main/java convention makes a lot of sense for a non-modular project though.  This ain't that.
2024-02-23 16:13:40 +01:00
Viktor Lofgren
56d35aa596 (refac) Move execution API out of executor service 2024-02-23 13:26:11 +01:00
Viktor Lofgren
2201b1a506 (refac) Clean up code issues 2024-02-23 11:39:19 +01:00
Viktor Lofgren
5cdb07023b (refac) Clean up unused imports 2024-02-23 11:27:20 +01:00
Viktor Lofgren
6154e16951 (refac) Remove "distPath" 2024-02-23 11:22:02 +01:00
Viktor Lofgren
f4ff7185f0 (refac) Move process-mqapi out of api directory 2024-02-23 11:18:29 +01:00
Viktor Lofgren
6357d30ea0 Clean up docs 2024-02-22 19:53:20 +01:00
Viktor Lofgren
8d4ef982d0 Clean up docs 2024-02-22 19:37:59 +01:00
Viktor Lofgren
4740156cfa Clean up docs 2024-02-22 18:18:58 +01:00
Viktor Lofgren
f8e7f75831 Move index to top level of code 2024-02-22 18:01:35 +01:00
Viktor Lofgren
085137ca63 * Extract the index functionality 2024-02-22 17:31:25 +01:00
Viktor Lofgren
3fd2a83184 * Extract the search-query function 2024-02-22 15:27:39 +01:00
Viktor Lofgren
66c1281301 (zk-registry) epic jak shaving WIP
Cleaning out a lot of old junk from the code, and one thing lead to another...

* Build is improved, now constructing docker images with 'jib'.  Clean build went from 3 minutes to 50 seconds.
* The ProcessService's spawning is smarter.  Will now just spawn a java process instead of relying on the application plugin's generated outputs.
* Project is migrated to GraalVM
* gRPC clients are re-written with a neat fluent/functional style. e.g.
```channelPool.call(grpcStub::method)
              .async(executor) // <-- optional
              .run(argument);
```
This change is primarily to allow handling ManagedChannel errors, but it turned out to be a pretty clean API overall.
* For now the project is all in on zookeeper
* Service discovery is now based on APIs and not services.  Theoretically means we could ship the same code either a monolith or a service mesh.
* To this end, began modularizing a few of the APIs so that they aren't strongly "living" in a service.  WIP!

Missing is documentation and testing, and some more breaking apart of code.
2024-02-22 14:01:23 +01:00
Viktor Lofgren
73947d9eca (zk-registry) Filter out phantom addresses in the registry
The change adds a hostname validation step to remove endpoints from the ZkServiceRegistry when they do not resolve.  This is a scenario that primarily happens when running in docker, and the entire system is started and stopped.
2024-02-20 18:09:11 +01:00
Viktor Lofgren
a69c0b2718 (grpc-client) Fix warmup crash
The warmup would sometimes crash during a cold start-up, because it could not get an API.  Changed the warmup to just create a GrpcSingleNodeChannelPool for the node.
2024-02-20 18:03:57 +01:00
Viktor Lofgren
6c764bceeb (doc) Update documentation for service-discovery 2024-02-20 16:09:49 +01:00
Viktor Lofgren
273aeb7bae (doc) Update documentation with new gRPC service setup 2024-02-20 16:06:05 +01:00
Viktor Lofgren
d185858266 (minor) Add missing query parameter to ServiceEndpoint.toURL 2024-02-20 15:49:43 +01:00
Viktor Lofgren
453bd6064b (minor) Add warm-up to GrpcMultiNodeChannelPool to speed up the initial messages
Without doing this, connections would be created lazily, which is probably never desirable.
2024-02-20 15:45:16 +01:00
Viktor Lofgren
904f2587cd (minor) Add default ZOOKEEPER_HOSTS to service.env 2024-02-20 15:44:26 +01:00
Viktor Lofgren
14172312dc (query-client) Fix query client
The query service delegates and aggregates IndexDomainLinksApiGrpc
messages to the index services.  The query client was accidentally
also doing this, instead of talking to the query client.

Fixed so it correctly talks to the query client and nothing else.
2024-02-20 15:44:07 +01:00
Viktor Lofgren
c600d7aa47 (refac) Inject ServiceRegistry into WebsiteAdjacenciesCalculator 2024-02-20 15:42:32 +01:00
Viktor Lofgren
3c9234078a (refac) Propagate ZOOKEEPER_HOSTS to spawned processes 2024-02-20 15:42:16 +01:00
Viktor Lofgren
ee8e0497ae (refac) Move service discovery injection to a separate guice module 2024-02-20 15:41:04 +01:00
Viktor Lofgren
fd5d121648 (minor) Add WMSA_IN_DOCKER to all docker files 2024-02-20 15:39:46 +01:00
Viktor Lofgren
30bdb4b4e9 (config) Clean up service configuration for IP addresses
Adds new ways to configure the bind and external IP addresses for a service.  Notably, if the environment variable WMSA_IN_DOCKER is present, the system will grab the HOSTNAME variable and announce that as the external address in the service registry.

The default bind address is also changed to be 0.0.0.0 only if WMSA_IN_DOCKER is present, otherwise 127.0.0.1; as this is a more secure default.
2024-02-20 14:22:48 +01:00
Viktor Lofgren
2ee492fb74 (gRPC) Bind gRPC services to an interface
By default gRPC it magically decides on an interface.  The change will explicitly tell it what to use.
2024-02-20 14:22:47 +01:00
Viktor Lofgren
36a5c8b44c (cleanup) Clean up code 2024-02-20 14:22:47 +01:00
Viktor Lofgren
07b625c58d (query-client) Add support for fault-tolerant requests to single node services
Adding a method importantCall that will retry a failing request on each route until it succeeds or the routes run out.
2024-02-20 14:16:05 +01:00
Viktor Lofgren
746a865106 (client) Fix handling of channel refreshes
The previous code made an incorrect assumption that all routes refer to the same node, and would overwrite the route list on each update.  This lead to storms of closing and opening channels whenever an update was received.

The new code is correctly aware that we may talk to multiple nodes.
2024-02-20 14:14:09 +01:00
Viktor
f85ec28a16 Merge branch 'master' into service-discovery 2024-02-20 11:44:12 +01:00
Viktor Lofgren
0307c55f9f (refac) Zookeeper for service-discovery, kill service-client lib (WIP)
To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added.

A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything.

The last remaining REST service, the assistant-service, has been migrated to gRPC.

This also proved a good time to clear out primordial technical debt from the root of the codebase.  The 'service-client' library has been taken behind the barn and given a last farewell.  It's replaced by a small library for managing gRPC channels.

Since it's no longer used by anything, RxJava has been removed as a dependency from the project.

Although the current state seems reasonably stable, this is a work-in-progress commit.
2024-02-20 11:41:14 +01:00
Viktor
d05c916491 Merge pull request #80 from MarginaliaSearch/ranking-algorithms
Clean up domain ranking code
2024-02-18 09:52:34 +01:00
Viktor Lofgren
c73e43f5c9 (recrawl) Mitigate recrawl-before-load footgun
In the scenario where an operator

* Performs a new crawl from spec
* Doesn't load the data into the index
* Recrawls the data

The recrawl will not find the domains in the database, and the crawl log will be overwritten with an empty file,
irrecoverably losing the crawl log making it impossible to load!

To mitigate the impact similar problems, the change saves a backup of the old crawl log, as well as complains about this happening.

More specifically to this exact scenario however, the parquet-loaded domains are also preemptively inserted into the domain database at the start of the crawl.  This should help the DbCrawlSpecProvider to find them regardless of loaded state.

This may seem a bit redundant, but losing crawl data is arguably the worst type of disaster scenario for this software, so it's arguably merited.
2024-02-18 09:23:20 +01:00
Viktor Lofgren
e61e7f44b9 (blacklist) Delay startup of blacklist
To help services start faster, the blacklist will no longer block until it's loaded.  If such a behavior is desirable, a method was added to explicitly wait for the data.
2024-02-18 09:23:20 +01:00
Viktor Lofgren
f9b6ac03c6 (api) Clean up incorrect error handling in GrpcChannelPool 2024-02-18 08:45:35 +01:00
Viktor Lofgren
296ccc5f8e (blacklist) Clean up blacklist impl
The domain blacklist blocked the start-up of each process that injected it, adding like 30 seconds to the start-up time in prod.

This change moves the loading to a separate thread entirely.  For threads or processes that require the blacklist to be definitely loaded, a helper method was added that blocks until that time.
2024-02-18 08:16:48 +01:00
Viktor Lofgren
8cb5825617 (search) Temporarily disable the Popular filter
This filter currently does not distinguish itself very much from the unfiltered results, and lends the impression that the filters don't "do anything".

It may come back in some shape or form in the future, with some additional tweaking of the rankings...
2024-02-18 08:02:01 +01:00
Viktor Lofgren
cee707abd8 (crawler) Implement domain shuffling in DbCrawlSpecProvider
Modified the DbCrawlSpecProvider to shuffle domains after loading to ensure a good mix for each crawl. This change prevents overload of crawling the same server in parallel from different subdomains or crawling big domains all at once.
2024-02-17 17:47:38 +01:00
Viktor Lofgren
92717a4832 (client) Refactor GrpcStubPool to handle error states
Refactored the GRPC Stub Pool for better handling of channel SHUTDOWN state. Any disconnected channels are now re-created before returning the stub.

The class was also renamed to GrpcChannelPool, as we no longer pool the stubs.
2024-02-17 14:42:26 +01:00
Viktor Lofgren
37a7296759 (sideload) Clean up the sideloading code
Clean up the sideloading code a bit, making the Reddit sideloader use the more sophisticated SideloaderProcessing approach to sideloading, instead of mimicing StackexchangeSideloader's cruder approach.

The reddit sideloader now uses the SideloaderProcessing class.  It also properly sets js-attributes for the sideloaded documents.

The control GUI now also filters the upload directory items based on name, and disables the items that do not have appropriate filenames.
2024-02-17 14:32:36 +01:00
Viktor Lofgren
ebbe49d17b (sideload) Fix sideloading of explicitly selected stackexchange files
Fix a bug where sideloading stackexchange files by explicitly selecting the 7z file would fail, since the 7z file would be passed along to the converter rather than the path to the pre-converted .db file.
2024-02-17 13:24:04 +01:00
Viktor Lofgren
b7e330855f (control) Update descriptive text in the control GUI 2024-02-16 20:32:31 +01:00
Viktor Lofgren
ac89224fb0 (domain-ranking) Remove lingering mentions of the algorithms field from the GUI 2024-02-16 20:28:37 +01:00
Viktor Lofgren
9ec262ae00 (domain-ranking) Integrate new ranking logic
The change deprecates the 'algorithm' field from the domain ranking set configuration.  Instead, the algorithm will be chosen based on whether influence domains are provided, and whether similarity data is present.
2024-02-16 20:22:01 +01:00
Viktor Lofgren
64acdb5f2a (domain-ranking) Clean up domain ranking
The domain ranking code was admittedly a bit of a clown fiesta; at the same time buggy, fragile and inscrutable.

Migrating over to use JGraphT to store the link graph
when doing rankings, and using their PageRank implementation.  Also added a modified version that does PersonalizedPageRank.
2024-02-16 18:04:58 +01:00
Viktor Lofgren
a175b36382 (search) Correct accidental regression of the SmallWeb filter 2024-02-15 18:16:56 +01:00
Viktor Lofgren
16526d283c (search) Correct accidental regression of the Vintage filter 2024-02-15 18:13:34 +01:00
Viktor Lofgren
752e677555 (search) Expose getSearchTitle in DecoratedSearchResults 2024-02-15 13:56:44 +01:00
Viktor Lofgren
f796af1ae8 (search) Fix failed refactoring 2024-02-15 13:53:19 +01:00
Viktor Lofgren
2515993536 (search) Fix issue where searchTitle setting gets lost when searching again
It's important that the field names in SearchParameters matches the fields referenced in search-form.hdb, otherwise they will get lost in transit.
2024-02-15 13:52:11 +01:00
Viktor Lofgren
66b3e71e56 (search) Expose more search options
This change set updates the query APIs to enable the search service to add additional criteria, such as QueryStrategy and TemporalBias.

The QueryStrategy makes it possible to e.g. require a match is in the title of a result, and TemporalBias enables penalizing results that are not within a particular time period.

These options are added to the search interface.  The old 'recent results' is modified to use TemporalBias, and a new filter 'Search In Title' is added as well.

The vintage filter is modified to add a temporal bias for the past.
2024-02-15 13:39:51 +01:00
Viktor Lofgren
652d151373 (process-models) Improve documentation 2024-02-15 12:21:12 +01:00
Viktor Lofgren
300b1a1b84 (index-query) Add some tests for the QueryFilter code 2024-02-15 12:03:30 +01:00
Viktor Lofgren
6c3b49417f (index-query) Improve documentation and code quality 2024-02-15 11:33:50 +01:00
Viktor Lofgren
dcc5cfb7c0 (index-journal) Improve documentation and code quality 2024-02-15 10:51:49 +01:00
Viktor
d970836605 Merge pull request #79 from MarginaliaSearch/reddit
(converter) Loader for reddit data

Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult.

Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more.

Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run.

The change also refactors the sideloading a bit since it was a bit messy, and improves the sideload UX a tiny bit.
2024-02-15 09:17:56 +01:00
Viktor Lofgren
8021bd0aae (control) Sort upload listing results
Improve the UX of the sideload GUI by sorting the results in a sensible fashion, first by whether it's a directory, then by its filename.

The change also changes the timestamp rendering to a more human-readable format than full ISO-8601.
2024-02-15 09:13:40 +01:00
Viktor Lofgren
8f91156d80 (control) Improve sideload UX
The sideload forms didn't properly set the label 'for' property, meaning that while label tags existed, they weren't appropriately clickable.

Also removed unnecessary limits on the sideload target being a directory for stackexchange and warc.  It's been possible to directly load a particular file for a while, but not allowed due to GUI limits.
2024-02-14 18:38:20 +01:00
Viktor Lofgren
fab36d6e63 (converter) Loader for reddit data
Adds experimental sideloading support for pusshift.io style reddit data.  This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult.

Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes.  Empirically this appears to mostly return good matches, even if it probably could index more.

Tests were written for this, but all require local reddit data which can't be distributed with the source code.  If these can not be found, the tests will shortcircuit as OK.  They're mostly there for debugging, and it's fine if they don't always run.

The change also refactors the sideloading a bit since it was a bit messy.
2024-02-14 17:35:44 +01:00
Viktor Lofgren
3d54879c14 (API, minor) Clean up comments. 2024-02-14 12:09:16 +01:00
Viktor Lofgren
e17fcde865 (API, minor) Remove unnecessary inject. 2024-02-14 12:05:50 +01:00
Viktor Lofgren
6950dffcb4 (API) Fix result order in API results
These results should be presented in the same order as their ranking score.
2024-02-14 11:47:14 +01:00
Viktor Lofgren
02dd5c5853 (converter) Look at properties when deciding pool size
Look at whether the property 'system.conserveProperty' is enabled when deciding he default pool size for the converter.

If true, a much more conservative default is used, limiting the risk of running out of memory.
2024-02-12 16:24:19 +01:00
Viktor Lofgren
5a1087dbf9 (qs-gui) Update documentation, add param for domain limit 2024-02-12 16:13:48 +01:00
Viktor Lofgren
7564dfeb7a (minor) Correct link in documentation for app services 2024-02-12 15:55:06 +01:00
Viktor Lofgren
10bad635a8 (search) Experimental support for clustering search results
Improves clustering of results.
2024-02-11 20:00:11 +01:00
Viktor Lofgren
7cc8b0fed5 (search) Experimental support for clustering search results
Improves clustering of results.
2024-02-11 19:58:55 +01:00
Viktor Lofgren
a77846373b (search) Experimental support for clustering search results
Improves clustering of results.
2024-02-11 19:48:55 +01:00
Viktor Lofgren
bcd0dabb92 (search) Experimental support for clustering search results
Adds experimental support for clustering search results by e.g. domain.  At a first stage, this is only enabled for the wiki and forum filters.

The commit also cleans up the UrlDetails class, which contained a number of vestigial entries.
2024-02-11 17:31:38 +01:00
Viktor Lofgren
9d68062553 (converter) Make processing pool size configurable 2024-02-10 20:59:08 +01:00
Viktor Lofgren
e66d0b7431 (warc) Minor code clean-up.
Remove redundant String$getBytes().  This is mainly an improvement in code consistency.
2024-02-10 18:30:33 +01:00
Viktor Lofgren
ba26f6ce84 (doc) Documentation corrections 2024-02-10 14:16:01 +01:00
Viktor Lofgren
929caed0b9 (warc) Improve WARC standard adherence
The WARC specification says the records should transparently remove compression.  This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.
2024-02-09 20:07:01 +01:00
Viktor Lofgren
8340aa2b6c (warc) Improve WARC standard adherence
The WARC specification says the records should transparently remove compression.  This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.
2024-02-09 17:29:21 +01:00
Viktor Lofgren
1188fe3bf0 (conf) Improve naming consistency
Rename the property system.conserve-memory to system.conserveMemory in order to be consistent with other properties in the system.
2024-02-09 14:43:08 +01:00
Viktor Lofgren
b15f47d80e (db) Retire the EC_DOMAIN_LINK table
Retire the EC_DOMAIN_LINK table as the data has been migrated off into a file instead.
2024-02-08 15:52:30 +01:00
Viktor Lofgren
ef261cbbd7 (search) Remove stray spaces in bang commands 2024-02-08 14:46:18 +01:00
Viktor
06997ff255 Merge pull request #78 from conor-f/patch-1
(search) Fix broken !ddg handling
2024-02-08 13:45:38 +01:00
Conor Flynn
9d7df87886 (search) Fix broken !ddg handling
https://duckduckgo.com/search?q=asdf leads to running a search for the term "search" instead of "asdf".

Both https://duckduckgo.com/<query> and https://duckduckgo.com/?q=<query> are accepted, but using GET vars seemed more in-keeping with the code.
2024-02-08 13:28:02 +01:00
Viktor Lofgren
a4b2323ca3 (search) Change default search profile to No Filter
Recent changes to the result ranking mean the no filter mode returns sufficiently good results for most queries that filtering by default just makes the search results more restricted.
2024-02-08 13:04:05 +01:00
Viktor
e8de468b0b Make executor API talk GRPC (#75)
* (executor-api) Make executor API talk GRPC

The executor's REST API was very fragile and annoying to work with, lacking even basic type safety.  Migrate to use GRPC instead.  GRPC is a bit of a pain with how verbose it is, but that is probably a lesser evil.  This is a fairly straightforward change, but it's also large so a solid round of testing is needed...

The change set breaks out the GrpcStubPool previously residing in the QueryService, and makes it available to all clients.

ServiceId.name was also renamed to avoid the very dangerous clash with Enum.name().

The boilerplate needed for grpc was also extracted into a common gradle file for inclusion into the appropriate build.gradle-files.
2024-02-08 13:01:12 +01:00
Viktor Lofgren
d83a3bf4e2 (search) Fix broken !w handling
Printf format error derp.
2024-02-08 12:11:33 +01:00
Viktor Lofgren
f2b39ad055 (search) Fix broken !bang handling
!bang query handling seems to have fallen victim to an overzealous refactoring effort, and broken.

It's now repaired, and a test is in place to ensure we know if it breaks again.
2024-02-08 12:05:09 +01:00
Viktor Lofgren
95d1bd98e4 (array) Update documentation, make unsafe configurable
The readme for the array library was extremely out of date.  Updating it with accurate information about how the library works, and a demo that should compile.

Also added a system property for disabling the use of sun.misc.Unsafe.
2024-02-07 12:26:47 +01:00
Viktor Lofgren
8acbc6a6b4 (index-construction) Split repartition into two actions cont'd
Continues 467ba5be20 by breaking out a constant with the name of the primary ranking set.  Also ensures it doesn't get spuriously logged as updated during the secondary updating pass.
2024-02-06 19:54:17 +01:00
Viktor Lofgren
467ba5be20 (index-construction) Split repartition into two actions
This change splits the previous 'repartition' action into two steps, one for recalculating the domain rankings, and one for recalculating the other ranking sets.  Since only the first is necessary before the index construction, the rest can be delayed until after...

To avoid issues in handling the shotgun blast of MqNotifications, Service was switched over to use a synchronous message queue instead of an asynchronous one.

The change also modifies the behavior so that only node 1 will push the changes to the EC_DOMAIN database table, to avoid unnecessary db locks and contention with the loader.

Additionally, the change fixes a bug where the index construction code wasn't actually picking up the rankings data.

Since the index construction used to be performed by the index-service, merely saving the data to memory was enough for it to be accessible within the index-construction logic, but since it's been broken out into a separate process, the new process just injected an empty DomainRankings object instead.

To fix this, DomainRankings can now be persisted to disk, and a pre-loaded version of the object is injected into the index-construction process.
2024-02-06 17:20:07 +01:00
Viktor Lofgren
29ddf9e61d (doc) Update docs 2024-02-06 16:29:55 +01:00
Viktor Lofgren
92e119cab3 (doc) Update docs 2024-02-06 12:43:42 +01:00
Viktor Lofgren
92049ba8e4 (doc) Update docs 2024-02-06 12:41:28 +01:00
Viktor Lofgren
54330b9921 (*) Remove dead code 2024-02-06 12:41:13 +01:00
Viktor Lofgren
d1aeb030f2 (doc) Update RandomWriteFunnel documentation 2024-02-06 12:35:24 +01:00
Viktor Lofgren
f89274d1ea (minor) Fix broken test
Fallout from changes in endianness made in d986f90074
2024-02-06 12:12:26 +01:00
Viktor Lofgren
7286596fb4 (deps) Remove monkey patched GSON
The codebase used to have a monkey patched version of gson that made special optimizations for the unusually large JSON files that used to store e.g. crawl data.

Since JSON is no longer used in this fashion, the GSON fork is not needed anymore.
2024-02-06 12:11:39 +01:00
Viktor Lofgren
a2fc83d94e (control) Add configurable border styling
To help distinguish between environments, a system property 'control.appBorder' is added that is injected as a body element border property in the control GUI stylesheets.
2024-02-06 12:05:02 +01:00
Viktor Lofgren
2161799cc3 (sideload) Fix filename error in dealing with stackoverflow files 2024-02-06 11:18:00 +01:00
Viktor Lofgren
c88f132057 (sideload) Fix filename error in dealing with stackoverflow files 2024-02-06 11:10:03 +01:00
Viktor Lofgren
c6313a5906 (sideload) Fix filename error in dealing with stackoverflow files 2024-02-06 11:06:36 +01:00
Viktor Lofgren
eadcdb5bed (minor) Improve error handling, naming logging in IndexResultDecorator 2024-02-05 21:05:44 +01:00
Viktor Lofgren
6e7649b5f7 (loader) Mitigate fragile paging behavior
IndexJournalWriterPagingImpl was modified to not page on number of entries written, but number of (equivalent uncompressed) bytes written.

Since the failure mode if too much data is written per file is quiet corruption of the index, the former behavior was extremely fragile.  The new behavior should consistently ensure that the data is sufficiently small to not cause any integer rollovers.

The change in 6dcc20038c was reverted, as there is really no sane reason to have this configurable in software.
2024-02-05 21:05:03 +01:00
Viktor Lofgren
d986f90074 (index) Fix consistency between RandomFileAssembler implementations
The RandomFileAssembler implementations, introduced in commit 53c575db3f were all acting subtly differently.  The RWF implementation wrote BigEndian longs instead of the native endianness used by the other implementations (and expected by the index construction code), further the mmap implementation exposed a bug in LongArray.write() that caused it to create a larger file than necessary.

A test was built to ensure the output of these implementations is equivalent.
2024-02-05 21:01:32 +01:00
Viktor Lofgren
53c575db3f (index-construction) Make random-write file strategy configurable
To cope with writing large files out of order, the system needs some form of strategy to avoid writing them directly to disk, as this causes insane amounts of disk thrashing.  By default, the data is just buffered in RAM.  This works well on a large server, but smaller systems struggle.

To help systems with small RAM process large amounts of data, the old RandomWriteFunnel is brought back if the system property 'system.conserve-memory' is set to true.  RandomWriteFunnel is buffering the construction by creating a series of small files that pigeonhole the writes into rough neighborhoods, and then it goes over the files one by one to construct one area of the file at a time.  This is relatively slow and uses more than twice the disk size.

A new interface RandomFileAssembler is introduced as an abstraction for this operation.  A third strategy, direct mmaps, is also introduced if the file is very small (less than 1 GB).  In this domain, disk thrashing is unlikely since it will comfortably fit in RAM.
2024-02-05 12:31:15 +01:00
Viktor Lofgren
6dcc20038c (index-journal) Make index journal page size configurable
Adds a new system property loader.journal-page-size to configure this setting.
2024-02-05 11:26:05 +01:00
Viktor Lofgren
fa145f632b (sideload) Add special handling for sideloaded wiki documents
This update enhances the SideloaderProcessing and DocumentClass modules to specially handle sideloaded wiki documents. Wiki content is generally truncated to the first paragraph, which generally tends to be too short to be included independently. An additional DocumentClass (SIDELOAD) has been introduced to suppress the length check in this case.
2024-02-02 21:22:07 +01:00
Viktor Lofgren
785d8deadd (crawler) Improve meta-tag redirect handling, add tests for redirects.
Wrote a new test to examine the redirect behavior of the crawler, ensuring that the redirect URL is the URL that is reported in the parquet file.  This works as intended.

Noticed in the course of this that the crawler doesn't add links from meta-tag redirects to the crawl frontier.  Added logic to handle this case, amended the test case to verify the new behavior.  Added the meta-redirect case to the HtmlDocumentProcessorPlugin as well, so that we consider it a link between documents in the unlikely case that a meta redirect is to another domain.
2024-02-01 20:30:43 +01:00
Viktor Lofgren
93a2d5afbf (*) Fix poorly named test
Likely old refactoring gore.
2024-02-01 20:08:15 +01:00
Viktor Lofgren
d60c6b18d4 (doc) Update the readme's the crawler, as they've grown stale. 2024-02-01 18:10:55 +01:00
Viktor Lofgren
d1e02569f4 (language-processing) Add a system property for configuring which language detection model to use
The flag is `system.languageDetectionModelVersion`.

* If negative, no model is used.
* If 0, both models are used.
* If 1, the old crappy model is used.
* If 2, the new fasttext model is used.
2024-01-31 13:02:33 +01:00
Viktor Lofgren
9ce67029ca (language-processing) Add a system property for configuring which language detection model to use
The flag is `system.languageDetectionModelVersion`.

* If negative, no model is used.
* If 0, both models are used.
* If 1, the old crappy model is used.
* If 2, the new fasttext model is used.
2024-01-31 13:02:16 +01:00
Viktor Lofgren
98f3382cea (minor) Fix test and improve error message 2024-01-31 11:53:41 +01:00
Viktor Lofgren
52a0255814 (*) Add flag for disabling ASCII flattening
The production configuration assumes all content of interest is 7 bit ASCII, and makes a series of optimizations based on this.  This assumption holds poorly in the wild.

Adding an **experimental** system property 'system.noFlattenUnicode', that when set to TRUE, will disable this behavior.

IMPORTANT!! The index needs to be re-constructed when this flag is changed, as different hash functions are selected for the keyword->identifier mappings.
2024-01-31 11:50:59 +01:00
Viktor Lofgren
eb59ac8535 (index-ranking) Adjust the BM25P factors a bit
Since the bleed-flags set by the anchor tags logics have been changed to Site and SiteAdjacent, give them a bit of more importance when set together with ExternalLink.

UrlDomain and UrlPath are also only more consistently only rewarded once.
2024-01-30 21:27:29 +01:00
Viktor Lofgren
acc2b4e10f (*) Update the readme with a link to the demo video 2024-01-26 13:49:41 +01:00
Viktor Lofgren
6f830f0e08 (*) Update the readme with a link to the demo video 2024-01-26 13:48:47 +01:00
Viktor Lofgren
6edc318597 (control) Fix typo in URL linking to new-crawl-specs 2024-01-26 10:43:10 +01:00
Viktor Lofgren
182c0cf28e (control) Add warnings about domain data contamination 2024-01-25 18:26:15 +01:00
Viktor Lofgren
0b105b5986 (converter) Update hyperlink text for new crawl spec creation.
Fix minor typo.
2024-01-25 18:05:11 +01:00
Viktor Lofgren
081c7d22bc Fix typo in install.sh 2024-01-25 17:08:18 +01:00
Viktor Lofgren
6aee896657 (*) Add single-node barebones configuration
This adds a single-node barebones configuration to the install script.  It also moves the log4j configuration into system.properties, and sets assertions to disabled by default.
2024-01-25 16:40:28 +01:00
Viktor Lofgren
cae1bad274 (*) Add download-sample action, refactor file storage
This changeset adds an action for downloading a set of sample data from downloads.marginalia.nu.

It also refactors out some leaky abstractions out of FileStorageService.  allocateTemporaryStorage has been renamed allocateStorage.  The storage was never temporary in any scenario...

It also doesn't take a storage base, as there was always only one valid option for this input.  The allocateStorage method finds the appropriate base itself.
2024-01-25 13:36:30 +01:00
Viktor Lofgren
1b8b97b8ec (sample-exporter) Add some limits on sizes and lengths
Tar files will reject entries with filenames over 100b, so we need a limit there.  Also added a maximum size limit to keep the file sizes reasonable.
2024-01-25 11:51:53 +01:00
Viktor Lofgren
0846606b12 (doc) Add ide quick-start guide 2024-01-24 14:39:33 +01:00
Viktor Lofgren
245ebcdfc6 (doc) Add ide quick-start guide 2024-01-24 14:37:58 +01:00
Viktor Lofgren
1b1e711c93 (doc) Add ide quick-start guide 2024-01-24 14:36:44 +01:00
Viktor Lofgren
c088c25b09 (*) Fix broken test, clean up code 2024-01-24 12:50:41 +01:00
Viktor Lofgren
958d64720e (control) Add a view for restarting aborted processes
This will avoid having to dig in the message queue to perform this relatively common task.

The control service was also refactored to extract common timestamp formatting logic out of the data objects and into the rendering.
2024-01-24 12:47:10 +01:00
Viktor Lofgren
805afad4fe (control) New GUI for exporting crawl data samples
Not going to win any beauty pageants, but this is pretty peripheral functionality.
2024-01-23 17:08:21 +01:00
Viktor Lofgren
400f4840ad (*) Fix broken code in jmh 2024-01-23 17:08:21 +01:00
Viktor Lofgren
ee7792596d (*) Fix broken test
Probably shouldn't have tests depending on external data like this...
2024-01-23 12:03:47 +01:00
Viktor Lofgren
0081328aca (converter) Adjust which flags are set by anchor text keywords
It's a mistake to let it bleed into Title, as this is a high quality signal.  We'll co-opt Site and SiteAdjacent instead to reinforce the ExternalLink when count is high.
2024-01-23 11:54:00 +01:00
Viktor Lofgren
3fff7f6878 (converter) Fix issue where quality limits were no longer enforced 2024-01-23 11:42:17 +01:00
Viktor Lofgren
f15dd06473 (index) Delayed close() of SearchIndexReader
This avoids concurrent access errors.  This is especially important when using Unsafe-based LongArrays, since we have concurrent access to the underlying memory-mapped file.  If pull the rug from under the caller by closing the file, we'll get a SIGSEGV.  Even with a "safe" MemorySegment, we'll get ugly stacktraces if we close the file while a thread is still accessing it.

So we spin up a thread that sleeps for a minute before actually unmapping the file, allowing any ongoing requests to wrap up.  This is 100% a hack, but it lets us get away with doing this without adding locks to the index readers.

Since this is "just" mmapped data, and this operation happens optimistically once a month, it should be safe if the call gets lost.
2024-01-23 11:08:41 +01:00
Viktor Lofgren
dd26819d66 (actor) Try to rare data race where a finished job is considered dead. 2024-01-22 21:22:38 +01:00
Viktor Lofgren
562012fb22 (doc) Migrate documentation https://docs.marginalia.nu/ 2024-01-22 19:40:08 +01:00
Viktor Lofgren
a6d257df5b (converter) Update Stackexchange sideload instruction
The sideload instruction in the stackexchange template was updated. The instruction now states that stackexchange data will be loaded from a directory on the server and directs users to a new documentation url for more detailed information.
2024-01-22 18:29:20 +01:00
Viktor Lofgren
41d896ba3e (converter) Refactor content type check in PlainTextDocumentProcessorPlugin
The method `isApplicable` in the `PlainTextDocumentProcessorPlugin` was refactored to handle a wider range of content types beyond merely "text/plain". It now also handles any content type that starts with "text/plain;", to accomodate contentTypes that append a charset as well.
2024-01-22 17:52:14 +01:00
Viktor Lofgren
51cdf46645 (control) Improve accessibility in search-to-ban template
This update enhances accessibility by associating labels with the corresponding checkboxes in the search-to-ban template.
2024-01-22 15:01:00 +01:00
Viktor Lofgren
1eb0adf6d3 (array) Add sun.misc.Unsafe variant of LongArray 2024-01-22 13:38:42 +01:00
Viktor Lofgren
40c9d2050f (control) Fully automatic conversion
Removed the need to have to run an external tool to pre-process the data in order to load stackexchange-style data into the search engine.

Removed the tool itself.

This stirred up some issues with the dependencies, that were due to both third-party:ing xz and importing it as a dependency.  This has been fixed, and :third-party:xz was removed.
2024-01-22 13:03:24 +01:00
Viktor Lofgren
3a325845c7 (mq) Add better error handling in fsm and mq
java.lang.Error:s were not handled properly, leading to mismatch in the bookkeeping of the FSMs.  These are now caught, acted on, and re-thrown.

MqSynchronousInbox also no longer assumes all exceptions are InterruptedException.
2024-01-22 13:03:24 +01:00
Viktor Lofgren
6a1bfd6270 (array) Remove unused 'madvise' code and 3rd party dependency on 'uppend'
This wasn't actually hooked in anywhere.  Removing the dependency and code.  If it turns out we need madvise in the future, we'll re-introducde it.
2024-01-22 13:01:57 +01:00
Viktor Lofgren
b91ea1d7ca (control) Re-add gui for sideloading dirtrees 2024-01-20 18:09:40 +01:00
Viktor Lofgren
c5760cd535 (test) Fix broken test 2024-01-20 13:39:40 +01:00
Viktor Lofgren
91c7960800 (crawler) Extract additional configuration properties
This commit extracts several previously hardcoded configuration properties, and makes then available through system.properties.

The documentation is updated to reflect the change.

Dead code was also removed in the process. CrawlSpecGenerator is left feeling a bit over-engineered still, since it's built for a more general case, where all other implementations but the current one are removed, but we'll leave it like this for now as it's fairly readable still.
2024-01-20 10:36:04 +01:00
Viktor Lofgren
2079a5574b (control) Update heading in restore backup template
Changed the heading in the partial restore backup page from "Load" to "Restore Backup".
2024-01-19 21:46:53 +01:00
Viktor Lofgren
27ffb8fa8a (converter) Integrate zim->db conversion into automatic encyclopedia processing workflow
Previously, in order to load encyclopedia data into the search engine, it was necessary to use the encyclopedia.marginalia.nu converter to first create a .db-file.  This isn't very ergonomic, so parts of that code-base was lifted in as a 3rd party library, and conversion from .zim to .db is now done automatically.

The output file name is based on the original filename, plus a crc32 hash and a .db-ending, to ensure we can recycle the data on repeat loads.
2024-01-19 13:59:03 +01:00
Viktor Lofgren
22c8fb3f59 (crawler) Fix a bug where reference copies of crawl data was written without etag and last-modified
This commit also adds a band-aid to ParquetSerializableCrawlDataStream to fetch this from the 304-entity.  This can be removed in a few months.
2024-01-18 16:02:27 +01:00
Viktor Lofgren
964419803a Fix broken test 2024-01-18 15:42:01 +01:00
Viktor Lofgren
6271d5d544 (mq) Add relation tracking between MQ messages for easier tracking and debugging.
The change adds a new column to the MESSAGE_QUEUE table called AUDIT_RELATED_ID.  This field is populated transparently, using a dictionary mapping Thread IDs to Message IDs, populated by the inbox handlers.

The existing RELATED_ID field has too many semantics associated with them,
among other things the FSM code uses them this field in tracking state changes.

The change set also improves the consistency of inbox names.  The IndexClient was buggy and populated its outbox with a UUID.  This is fixed. All Service2Service outboxes are now prefixed with 'pp:' to make them even easier to differentiate.
2024-01-18 15:08:27 +01:00
Viktor Lofgren
175bd310f5 (control) Message queue UX improvements 2024-01-18 13:05:50 +01:00
Viktor Lofgren
67ee6f4126 (control) Clean up filtering UX in Events table 2024-01-18 12:35:39 +01:00
Viktor Lofgren
01b312f14c (*) Make new index nodes accept queries by default
It's a confusing default behavior.

This was off for nodes n>1 before as a bandaid since querying indices with no data caused delays and errors.  This has been fixed now, so there's no need to do this anymore!
2024-01-18 12:05:37 +01:00
Viktor Lofgren
18638c62de (control) Rephrase text 2024-01-18 11:53:10 +01:00
Viktor Lofgren
753d000788 (control) Add toggle for automatic loading of processed data 2024-01-18 11:52:58 +01:00
Viktor Lofgren
19e781b104 (control) Add basic input validation to node actions
Will present a simple error message when required fields aren't populated, instead of a cryptic HTTP status error.
2024-01-18 11:52:49 +01:00
Viktor Lofgren
aa2df327db (index) Prevent index from attempting to answer queries when no index data is loaded
This improves query times, and gets rid of exceptions in the logs when one of the index nodes doesn't have any data loaded, yet is configured to answer queries.
2024-01-18 11:05:45 +01:00
Viktor Lofgren
321fa94b8f (crawler) Fix rare exception in content type handling due to improper length checking of a split() array 2024-01-18 11:05:45 +01:00
Viktor
ca80957143 Merge pull request #73 from MarginaliaSearch/configurable-search-sets
(WIP) Configurable domain ranking sets
2024-01-17 21:12:20 +01:00
Viktor Lofgren
41cdb8f71b (control) Fix broken update button in the update-domain-ranking-set form
id property was on the wrong element.
2024-01-17 18:21:09 +01:00
Viktor Lofgren
304d4c9acf (control) Fix result ordering in the file storage listing view
In some scenarios, such as when restoring storage items from json-manifest on db failure, the file storage view would present the items in a non-chronological order.  Added a sort() operation to mitigate this.
2024-01-17 10:56:30 +01:00
Viktor Lofgren
7fd4c092e3 (control) Clean up UX and accessibility for new domain ranking sets.
The change also adds basic support for error messages in the GUI.
2024-01-17 10:47:14 +01:00
Viktor Lofgren
2fe5705542 (control) GUI for ranking sets
Still missing is some polish, forms don't have proper labels, validation is inconsistent, no error messages, etc.
2024-01-16 17:10:09 +01:00
Viktor Lofgren
e968365858 (index) Use new DomainRankingSets to configure ranking algos in index svc 2024-01-16 12:43:32 +01:00
Viktor Lofgren
36ad4c7466 (db) Add a new configuration object 'domain ranking set' for storing ranking parameters 2024-01-16 12:34:00 +01:00
Viktor Lofgren
5a62b3058f (query-api) Make the search set identifier a string value in the API
This will free the core marginalia search engine to use arbitrary search set definitions, while the app can use its hardcoded defaults.
2024-01-16 10:55:24 +01:00
Viktor Lofgren
ec8fe9f031 (doc) Add screenshot to conversion step in crawling doc 2024-01-15 16:31:33 +01:00
Viktor Lofgren
a1df9e886a (control) Also clean up stale 'NEW' messages 2024-01-15 16:14:02 +01:00
Viktor Lofgren
ce5ae1931d (doc) Update Crawling Docs
Still a WIP, but now more accurately reflects the new GUI, with screenshots to boot!
2024-01-15 16:08:01 +01:00
Viktor Lofgren
b9445d4f62 (doc) Update Crawling Docs
Still a WIP, but now more accurately reflects the new GUI, with screenshots to boot!
2024-01-15 16:06:59 +01:00
Viktor Lofgren
fd1eec99b5 (cleanup) Fix broken tests 2024-01-15 15:44:33 +01:00
Viktor Lofgren
e162406d40 (control) New control-side actors for cleaning up stale service heartbeats and message queue entries 2024-01-15 15:44:23 +01:00
Viktor Lofgren
c41e68aaab (control) New export actions for RSS/Atom feeds and term frequency data
This commit also refactors the executor a bit, and introduces a new converter-feature called data-extractors for this class of jobs.
2024-01-15 14:54:26 +01:00
Viktor Lofgren
4665af6c42 (control) Move export data endpoint to actions controller 2024-01-15 11:06:22 +01:00
Viktor Lofgren
c0b15427fe (control) New crawl view should use radio buttons as multiple specs aren't supported 2024-01-15 11:03:47 +01:00
Viktor Lofgren
f29a9d972d (control) Move 'new crawl spec' to /node/:id/actions, out of /node/:id/storage 2024-01-15 11:02:00 +01:00
Viktor Lofgren
b192373ae7 (control) Highlight unavailable items (creating, deleting) in node actions views 2024-01-15 10:47:54 +01:00
Viktor Lofgren
c042650382 (docs) Improve query service documentation 2024-01-13 21:16:45 +01:00
Viktor Lofgren
07a916a720 (search) Give the swipe hint on mobile a nicer finish 2024-01-13 18:51:54 +01:00
Viktor Lofgren
5134044530 (assistant) Make assistant client more robust to the service going down
This is especially important for the non-essential functions, like website similarities...
2024-01-13 18:29:30 +01:00
Viktor Lofgren
4c62065e74 (install) Add two separate templates for the install script
One template is for the full Marginalia Search style install, and the other is for a barebones install with no Marginalia-related fluff.
2024-01-13 18:27:42 +01:00
Viktor Lofgren
d28fc99119 (MainClass) ensure logging isn't loaded before service name is known
This causes logs all to have names like ${sys:service-name}, instead of the service name...
2024-01-13 18:19:50 +01:00
Viktor Lofgren
c9fb45c85f (search) Fix control.hideMarginaliaApp handling 2024-01-13 17:24:15 +01:00
Viktor Lofgren
7c6e18f7a7 (*) Overhaul settings and properties
Use a system.properties file to configure the system.  This is loaded statically by MainClass or ProcessMainClass.  Update the property names to be more consistent, and update the documentations to reflect the changes.
2024-01-13 17:12:18 +01:00
Viktor Lofgren
176b9c9666 (convert) Add sizeHints to legacy serializable cawl data stream
This reduces the maximum memory usage when processing legacy crawl data
2024-01-13 15:50:36 +01:00
Viktor Lofgren
ecd9c35233 (control) Clean up the event log
* Generate fewer uninteresting event messages.
* Display fewer irrelevant fields in the overview table.
2024-01-13 13:28:02 +01:00
Viktor Lofgren
71e32c57d9 (control) Add better timestamps for the events and message queue views
Adjust display precision based on distance into the past, full ms-accurate timestamps available via hover-action.
2024-01-13 13:04:56 +01:00
Viktor Lofgren
2fefd0e4e3 (control) Add better timestamps for the events and message queue views
Adjust display precision based on distance into the past, full ms-accurate timestamps available via hover-action.
2024-01-13 13:03:52 +01:00
Viktor Lofgren
81eaf79a25 (control) UX polish 2024-01-13 12:31:13 +01:00
Viktor Lofgren
8dea7217a6 (control) UX fixes, node GUI doesn't break when an executor service goes offline. 2024-01-13 12:17:30 +01:00
Viktor Lofgren
c0fb9e17e8 (control) Add filter dropdown to message queue table
This makes inspecting the queues for processes much easier, as it's otherwise
often these important messages are drowned out by FSM chatter.
2024-01-12 18:46:17 +01:00
Viktor Lofgren
83776a8dce (control) Wean the ExportDataActor off EC_DOMAIN_LINK
The EC_DOMAIN_LINK table is deprecated and slated for removal, use QueryClient.getAllDomainLinks() instead.

The ExportDataActor now uses the QueryClient appropriately.  The CSV format was also changed to quote the values, to prevent e.g. Excel from interpreting the comma as a decimal separator when previewing the file.

Finally the form for triggering an export was overhauled.
2024-01-12 17:09:11 +01:00
Viktor Lofgren
98c0972619 (control) Add a summary table for Actors in the Node overview 2024-01-12 16:32:15 +01:00
Viktor Lofgren
56d832d661 (control) Adjust the margins of the headings to be consistent 2024-01-12 16:16:57 +01:00
Viktor Lofgren
de3a350afe (control) Disable broken actions and mark the actions view as WIP 2024-01-12 16:16:39 +01:00
Viktor Lofgren
708a741960 (test) Clean up test usage of migrations
Several tests were manually running migrations in a large copy-paste blob of code.  This makes the test less useful as it's possible to break the code while keeping the tests green by introducing a new migration that never gets run in the tests, and it's also difficult to reason about what the tests are doing.

A new test helper library is introduced with a TestMigrationLoader that can both run Flyway migrations, or load specific migrations in the cases a specific set of migrations need to be loaded.   Existing tests are migrated to use the new code.
2024-01-12 15:55:50 +01:00
Viktor Lofgren
0caef1b307 (warc) Toggle for saving WARC data
Add a toggle for saving the WARC data generated by the search engine's crawler.  Normally this is discarded, but for debugging or archival purposes, retaining it may be of interest.

The warc files are concatenated into larger archives, up to about 1 GB each.
An index is also created containing filenames, domain names, offsets and sizes
to help navigate these larger archives.

The warc data is saved in a directory warc/ under the crawl data storage.
2024-01-12 13:45:14 +01:00
Viktor Lofgren
264e2db539 (control) UX-improvements for control service
This commit overhauls a lot of the UX for the control service, adding a new actions menu to the nodes views.  It has many small tweaks to make the work flow better.

It also adds a new /uploads directory in each index node, from which sideloaded data can be selected.  This is a bit of a breaking change, as this directory needs to exist in each index node.
2024-01-12 12:33:05 +01:00
Viktor Lofgren
734996002c (*) install script for deploying Marginalia outside the codebase
The changeset also makes the control service responsible for flyway migrations.  This helps reduce the number of places the database configuration needs to be spread out.  These automatic migrations can be disabled with -DdisableFlyway=true.

The commit also adds curl to the docker container, to enable docker health checks and interdependencies.
2024-01-11 12:40:03 +01:00
Viktor Lofgren
205e5016e8 (docs) Document barebones config 2024-01-11 09:43:08 +01:00
Viktor Lofgren
a0f28a7f9b (*) Add a barebones configuration
This adds a docker-compose file 'docker-compose-barebones.yml' which will only start the minimal number of services needed to run a whitelabel Marginalia Search-style search engine, with none of the surrounding frills.

The change also adds a minimal search GUI to the query service, which is also available with JSON results if the appropriate Accept header is provided.
2024-01-10 20:23:51 +01:00
Viktor Lofgren
14b7680328 (loader) Update the size of the keyword files created by the loader
Previously these ended up being about 200 Mb each, which is wastefully small.  Increasing the size of these files makes the index construction faster.
2024-01-10 17:09:19 +01:00
Viktor Lofgren
f44222ce53 (control) Add a 'cancel' button to the process list
This is a very nice QoL improvement, since it means you don't have to dig in the Actors view to terminate processes.
2024-01-10 15:02:42 +01:00
Viktor Lofgren
f310ad8d98 (control) Actor terminations work better
Improves jank in the abort actor action, which would sometimes cause actors to hang or restart.
2024-01-10 14:18:49 +01:00
Viktor Lofgren
d56b394bcc (control) GUI for loading external WARC files 2024-01-10 12:13:30 +01:00
Viktor Lofgren
55c9501e57 (search) Serve proper content type for static resources 2024-01-10 10:46:51 +01:00
Viktor
fad9575154 Merge pull request #69 from MarginaliaSearch/converter-optimizations
Refactor the DomainProcessor to take advantage of the new crawl data format
2024-01-10 09:46:54 +01:00
Viktor Lofgren
97e11e1ac9 (search) Fix acknowledgement page for domain complaints rendering as plain text
This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used.  This method is removed with this change.
2024-01-10 09:37:40 +01:00
Viktor Lofgren
e6a1e164b2 (search) Swap swipe direction for more consistent experience 2024-01-10 09:37:40 +01:00
Viktor Lofgren
e4f8f81e89 (search) Mobile UX improvements.
Swipe right to show filter menu.

Fix CSS bug that caused parts of the menu to not have a background.
2024-01-10 09:37:39 +01:00
Viktor Lofgren
176b3bb526 (search) Toggle for showing recent results
Actually persist the value of the toggle between searches too...
2024-01-10 09:37:39 +01:00
Viktor Lofgren
b07752fa9b (search) Toggle for showing recent results
Will by default show results from the last 2 years.  May need to tune this later.
2024-01-10 09:37:39 +01:00
Viktor Lofgren
68fd0efbde (search) Clean up search results template
Rendering is very slow. Let's see if this has a measurable effect on latency.
2024-01-10 09:37:39 +01:00
Viktor Lofgren
c80d3eb812 (search) Remove dead code 2024-01-10 09:37:35 +01:00
Viktor Lofgren
f9320995d6 (search) When clicking asn-links, show results from the unfiltered view... 2024-01-10 09:37:13 +01:00
Viktor Lofgren
f592c9f04d (search) Fix acknowledgement page for domain complaints rendering as plain text
This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used.  This method is removed with this change.
2024-01-10 09:26:34 +01:00
Viktor Lofgren
bd7970fb1f (search) Swap swipe direction for more consistent experience 2024-01-09 13:38:40 +01:00
Viktor Lofgren
c47730f2cc (search) Mobile UX improvements.
Swipe right to show filter menu.

Fix CSS bug that caused parts of the menu to not have a background.
2024-01-09 13:30:30 +01:00
Viktor Lofgren
41cccfd2aa (search) Toggle for showing recent results
Actually persist the value of the toggle between searches too...
2024-01-09 11:36:49 +01:00
Viktor Lofgren
aff690f7d6 (search) Toggle for showing recent results
Will by default show results from the last 2 years.  May need to tune this later.
2024-01-09 11:28:36 +01:00
Viktor Lofgren
d4b0539d39 (search) Clean up search results template
Rendering is very slow. Let's see if this has a measurable effect on latency.
2024-01-08 20:57:40 +01:00
Viktor Lofgren
cb55273769 (search) When clicking asn-links, show results from the unfiltered view... 2024-01-08 20:02:19 +01:00
Viktor Lofgren
fbad625126 (linkdb) Add delegating implementation of DomainLinkDb
This facilitates switching between SQL and File-backed implementations on the fly while migrating from one to the other.
2024-01-08 19:56:33 +01:00
Viktor Lofgren
e49ba887e9 (crawl data) Add compatibility layer for old crawl data format
The new converter logic assumes that the crawl data is ordered where the domain record comes first, and then a sequence of document records.  This is true for the new parquet format, but not for the old zstd/gson format.

To make the new converter compatible with the old format, a specialized reader is introduced that scans for the domain record before running through the sequence of document records; and presenting them in the new order.

This is slower than just reading the file beginning to end, so in order to retain performance when this ordering isn't necessary, a CompatibilityLevel flag is added to CrawledDomainReader, permitting the caller to decide how compatible the data needs to be.

Down the line when all the old data is purged, this should be removed, as it amounts to technical debt.
2024-01-08 19:16:49 +01:00
Viktor Lofgren
edc1acbb7e (*) Replace EC_DOMAIN_LINK table with files and in-memory caching
The EC_DOMAIN_LINK MariaDB table stores links between domains.  This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB).  This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need.

This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains.  This file is loaded in memory in each node, and can be queried via the Query Service.

A migration step is needed before this file is created in each node.   Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file.

The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.
2024-01-08 15:53:13 +01:00
Viktor Lofgren
d304c10641 Merge branch 'master' into converter-optimizations 2024-01-05 13:22:46 +01:00
Viktor Lofgren
302c53a8e7 (build) Enable reproducible builds in build.gradle
Settings for enabling reproducible builds for all subprojects were added to improve build consistency. This includes preserving file timestamps and ordering files reproducibly.

This is primarily of help for docker, since it uses hashes to determine if a file or image layer has changed.
2024-01-05 13:22:13 +01:00
Viktor Lofgren
ef02b712ad (build) Remove false depdencency between icp and index-service
This dependency causes the executor service docker image to change when the index service docker image changes.
2024-01-05 13:22:13 +01:00
Viktor Lofgren
aca217cf9a (qs) Better metrics for QS 2024-01-05 13:22:13 +01:00
Viktor Lofgren
9e3386dbbb (search) Fetch fewer results per page
This is a test to evaluate how this impacts load times.
2024-01-05 13:22:13 +01:00
Viktor Lofgren
fdec565b34 (converter) Add upper 128KB limit to how much HTML we'll parse 2024-01-05 13:22:13 +01:00
Viktor Lofgren
33c2188c87 (feature) More trackers 2024-01-05 13:22:13 +01:00
Viktor Lofgren
b3c8fa74cc (feature) Add another doubleclick variant to the adtech trackers 2024-01-05 13:22:13 +01:00
Viktor Lofgren
e53bb70bef (converter) Penalize chatgpt content farm spam 2024-01-05 13:22:13 +01:00
Viktor Lofgren
109bec372c (index) Adjust BM25 parameters 2024-01-05 13:21:52 +01:00
Viktor Lofgren
5c2561d05d (search) Add query strategy requiring link 2024-01-05 13:21:52 +01:00
Viktor Lofgren
0e970b8037 (valuation) Tweaking penalties a bit 2024-01-05 13:21:52 +01:00
Viktor Lofgren
1694b4d6ef (valuation) Increase the penalty for adtech a bit 2024-01-05 13:21:34 +01:00
Viktor Lofgren
396299c1db (index) Reduce the value of site and site-adjacent in BM25P calculations 2024-01-05 13:21:33 +01:00
Viktor Lofgren
71d789aab0 (index) Tweak result valuation renormalization 2024-01-05 13:21:33 +01:00
Viktor Lofgren
41ca50ff0e (build) Enable reproducible builds in build.gradle
Settings for enabling reproducible builds for all subprojects were added to improve build consistency. This includes preserving file timestamps and ordering files reproducibly.

This is primarily of help for docker, since it uses hashes to determine if a file or image layer has changed.
2024-01-05 13:19:59 +01:00
Viktor Lofgren
6d2e14a656 (build) Remove false depdencency between icp and index-service
This dependency causes the executor service docker image to change when the index service docker image changes.
2024-01-05 13:17:29 +01:00
Viktor Lofgren
4078708aea (qs) Better metrics for QS 2024-01-04 13:27:14 +01:00
Viktor Lofgren
343ea9c6d8 (search) Fetch fewer results per page
This is a test to evaluate how this impacts load times.
2024-01-04 13:18:07 +01:00
Viktor Lofgren
60361f88ed (converter) Add upper 128KB limit to how much HTML we'll parse 2024-01-03 23:14:03 +01:00
Viktor Lofgren
f7560cb1d8 (feature) More trackers 2024-01-03 17:31:02 +01:00
Viktor Lofgren
1f66568d59 (feature) More trackers 2024-01-03 17:27:25 +01:00
Viktor Lofgren
7af07cef95 (feature) Add another doubleclick variant to the adtech trackers 2024-01-03 17:21:12 +01:00
Viktor Lofgren
41a540a629 (converter) Penalize chatgpt content farm spam 2024-01-03 17:04:38 +01:00
Viktor Lofgren
f599944942 (converter) Penalize chatgpt content farm spam 2024-01-03 16:51:26 +01:00
Viktor Lofgren
1e06aee6a2 (index) Adjust BM25 parameters 2024-01-03 16:30:46 +01:00
Viktor Lofgren
7bbaedef97 (search) Add query strategy requiring link 2024-01-03 16:23:00 +01:00
Viktor Lofgren
87048511fe (valuation) Tweaking penalties a bit 2024-01-03 16:02:25 +01:00
Viktor Lofgren
c770f0b68b (valuation) Tweaking penalties a bit 2024-01-03 15:59:21 +01:00
Viktor Lofgren
78c00ad512 (valuation) Tweaking penalties a bit 2024-01-03 15:52:57 +01:00
Viktor Lofgren
a19879d494 (valuation) Tweaking penalties a bit 2024-01-03 15:32:33 +01:00
Viktor Lofgren
ac1aca36b0 (valuation) Increase the penalty for adtech a bit 2024-01-03 15:20:38 +01:00
Viktor Lofgren
1f3b89cf28 (index) Reduce the value of site and site-adjacent in BM25P calculations 2024-01-03 15:20:18 +01:00
Viktor Lofgren
f732f6ae6f (index) Tweak result valuation renormalization 2024-01-03 14:53:53 +01:00
Viktor Lofgren
0b9f3d1751 (*) Remove accidental commit of debug logging 2024-01-03 14:32:00 +01:00
Viktor Lofgren
0806aa6dfe (language-processing) Add maximum length limit for text input in SentenceExtractor
Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.
2024-01-03 14:27:47 +01:00
Viktor Lofgren
32436d099c (language-processing) Add maximum length limit for text input in SentenceExtractor
Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.
2024-01-03 14:27:47 +01:00
Viktor Lofgren
4ce692ccaf (converter) Use SimpleBlockingThreadPool in ProcessingIterator 2024-01-03 14:27:47 +01:00
Viktor Lofgren
3caa4eed75 Merge branch 'master' into converter-optimizations 2024-01-02 17:13:25 +01:00
Viktor Lofgren
c70f508ae8 (prometheus) Saner histogram buckets 2024-01-02 17:13:14 +01:00
Viktor Lofgren
9e64d7aaf9 Merge branch 'master' into converter-optimizations 2024-01-02 15:46:24 +01:00
Viktor Lofgren
72b773f06d (search) fix search metrics labeling 2024-01-02 15:46:14 +01:00
Viktor Lofgren
5f978b865b Merge branch 'master' into converter-optimizations 2024-01-02 15:41:48 +01:00
Viktor Lofgren
57a4f92722 (api) fix missing metrics label in api service 2024-01-02 15:41:38 +01:00
Viktor Lofgren
87351e89ca Merge branch 'master' into converter-optimizations 2024-01-02 15:17:02 +01:00
Viktor
7920c67a48 Merge pull request #71 from MarginaliaSearch/metrics
Add Prometheus Instrumentation
2024-01-02 15:13:53 +01:00
Viktor Lofgren
192e356169 (prometheus) Add instrumentation to the api service 2024-01-02 15:12:44 +01:00
Viktor Lofgren
31232e49fb (prometheus) Add instrumentation to the search, qs and index services. 2024-01-02 15:02:29 +01:00
Viktor Lofgren
116595d218 (prometheus) Add in-docker prometheus instance to exfiltrate metrics from the docker-based services 2024-01-02 14:28:53 +01:00
Viktor Lofgren
9d93a31755 Merge branch 'master' into converter-optimizations 2024-01-02 12:36:16 +01:00
Viktor Lofgren
9f7df59945 (sideload) Reduce quality assessment.
This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.
2024-01-02 12:35:59 +01:00
Viktor Lofgren
d2418521a7 (index) Further ranking adjustments 2024-01-02 12:35:59 +01:00
Viktor Lofgren
9330b5b1d9 (index) Adjust rank weightings to fix bad wikipedia results
There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero.  This meant that "bad" results always rank the same.  The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization.

Some of the weights were also re-adjusted based on what appears to produce better results.  Needs evaluation.
2024-01-02 12:35:44 +01:00
Viktor Lofgren
faa50bf578 (sideload) Just index based on first paragraph
This seems like it would make the wikipedia search result worse, but it drastically improves the result quality!

This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.
2024-01-02 12:35:44 +01:00
Viktor Lofgren
f0d9618dfc (sideload) Reduce quality assessment.
This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.
2024-01-02 12:34:58 +01:00
Viktor Lofgren
310a880fa8 (index) Further ranking adjustments 2024-01-02 12:24:52 +01:00
Viktor Lofgren
fc6e3b6da0 (index) Further ranking adjustments 2024-01-01 18:51:03 +01:00
Viktor Lofgren
50771045d0 (index) Further ranking adjustments 2024-01-01 18:43:17 +01:00
Viktor Lofgren
8f522470ed (index) Adjust rank weightings to fix bad wikipedia results
There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero.  This meant that "bad" results always rank the same.  The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization.

Some of the weights were also re-adjusted based on what appears to produce better results.  Needs evaluation.
2024-01-01 17:16:29 +01:00
Viktor Lofgren
dc90c9ac65 (sideload) Just index based on first paragraph
This seems like it would make the wikipedia search result worse, but it drastically improves the result quality!

This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.
2024-01-01 16:19:38 +01:00
Viktor Lofgren
e46e174b59 (keyword-extractor) Add another test for Name-extractor 2024-01-01 15:21:51 +01:00
Viktor Lofgren
7f3f3f577c (backup) Add task heartbeats to the backup service 2024-01-01 15:20:57 +01:00
Viktor Lofgren
75d87c73d1 (crawler) Disable Java's infinite DNS caching 2023-12-31 16:59:08 +01:00
Viktor Lofgren
0fe44c9bf2 (crawler) Fix broken test
A necessary step was accidentally deleted when cleaning up these tests previously.
2023-12-30 13:56:44 +01:00
Viktor Lofgren
7a1d20ed0a (converter) Better use of ProcessingIterator
Modify processingiterator to be constructed via a factory, to enable re-use of its backing executor service.

This reduces thread churn in the converter sideloader style processing of regular crawl data.
2023-12-30 13:53:55 +01:00
Viktor Lofgren
70c83b60a1 (converter) Clean up fullProcessing()
This function made some very flimsy-looking assumptions about the order of an iterable.  These are still made, but more explicitly so.
2023-12-30 13:36:18 +01:00
Viktor Lofgren
7ba296ccdf (converter) Route sizeHint to SideloadProcessing
Route the sizeHint from the input parquet file to SideloadProcessing, so that it can set sideloadSizeAdvice appropriately, instead of using a fixed "large" number.

This is necessary to populate the KNOWN_URL column in the domain data table, which is important as it is used in e.g. calculating how far to re-crawl the site in the future.
2023-12-30 13:05:10 +01:00
Viktor Lofgren
0b112cb4d4 (warc) Update URL encoding in WarcProtocolReconstructor
The URI query string is now URL encoded in the WarcProtocolReconstructor. This change ensures proper encoding of special characters as per the standard URL encoding rules and improves URL validity during the crawling process.
2023-12-29 19:41:37 +01:00
Viktor Lofgren
68ac8d3e09 (search) Fetch fewer linking and similar domains.
Showing a total of 200 connected domains is not very informative.
2023-12-29 16:37:27 +01:00
Viktor Lofgren
f6fa8bd722 (search) Fetch fewer linking and similar domains.
Showing a total of 200 connected domains is not very informative.
2023-12-29 16:37:00 +01:00
Viktor Lofgren
6aee27a3f1 (*) Fix bug in EdgeDomain where it would permit domains with a trailing period, DNS style. 2023-12-29 16:36:01 +01:00
Viktor Lofgren
401568033c Merge branch 'master' into converter-optimizations 2023-12-29 15:55:57 +01:00
Viktor Lofgren
ea73be6831 (search) Remove the ugly placeholder screenshots from the site info view. 2023-12-29 15:55:46 +01:00
Viktor Lofgren
ba8a75c84b Fix bug in ProcessingIterator where it would run the tasks in only one single thread instead of using the pool 2023-12-29 15:10:32 +01:00
Viktor Lofgren
a1f3ccdd6d Fix bug in ProcessingIterator where it would run the tasks in only one single thread instead of using the pool 2023-12-29 14:59:39 +01:00
Viktor Lofgren
647d38007f Reduce queue polling time in ProcessingIterator
Updated ProcessingIterator's queue polling from one second to 50 milliseconds for improved performance. This facilitates faster document processing across more cores, reducing bottlenecks and slow single-threaded processing.
2023-12-29 14:27:58 +01:00
Viktor Lofgren
e7dd28b926 (converter) Optimize sideload-loading
Use ProcessingIterator to fan out processing of documents across more cores, instead of doing all of it in the writer thread blocking everything else with slow single-threaded processing.
2023-12-29 14:25:48 +01:00
Viktor Lofgren
b5fc9673d9 Merge branch 'master' into converter-optimizations 2023-12-29 14:04:43 +01:00
Viktor Lofgren
a065040323 (search) Don't inject arbitrary HTML into the site info view xD 2023-12-29 14:04:26 +01:00
Viktor Lofgren
dec3b1092d (converter) Fix bugs in conversion
This commit adds a safety check that the URL of the document is from the correct domain.

It also adds a sizeHint() method to SerializableCrawlDataStream which *may* provide an indication if the stream is very large and benefits from sideload-style processing (which is slow).

It furthermore addresses a bug where the ProcessedDomain.write() invoked the wrong method on ConverterBatchWriter and only wrote the domain metadata, not the rest...
2023-12-29 13:58:08 +01:00
Viktor Lofgren
407915a86e (converter) Fix NPEs in converter due to the new data format 2023-12-28 22:54:53 +01:00
Viktor Lofgren
c488599879 (converter) Fix NPE in converter 2023-12-28 19:52:26 +01:00
Viktor Lofgren
bcecc93e39 (converter) Swallow errors in parquet data stream 2023-12-28 19:45:35 +01:00
Viktor Lofgren
ff7d1a250e Merge branch 'master' into converter-optimizations 2023-12-28 19:35:00 +01:00
Viktor Lofgren
70f338c3de (search) Fix NPE in layout selection 2023-12-28 19:34:46 +01:00
Viktor Lofgren
c847d83011 (converter) Add size hint to converter sideload processing 2023-12-28 19:14:16 +01:00
Viktor Lofgren
5ce46a61d4 Merge branch 'master' into converter-optimizations 2023-12-28 13:26:19 +01:00
Viktor
775974d5ec Merge pull request #67 from MarginaliaSearch/rss-feeds-in-site-info
Add RSS Feeds to site info (WIP)
2023-12-28 13:25:38 +01:00
Viktor Lofgren
c7af40c368 (search) Change layout balance when feeds/samples are present 2023-12-28 13:16:10 +01:00
Viktor Lofgren
00a974a721 (crawler) Fix bug in resynchronizer where it would fail to capture expected exceptions
This commit also improves the test coverage for this part of the code.
2023-12-27 20:02:17 +01:00
Viktor Lofgren
7428ba2dd7 (converter) Basic test coverage for sideloading-style processing 2023-12-27 19:29:26 +01:00
Viktor Lofgren
b37223c053 (converter) Basic test coverage for sideloading-style processing 2023-12-27 18:33:16 +01:00
Viktor Lofgren
24051fec03 (converter) WIP Run sideload-style processing for large domains
The processor normally retains the domain data in memory after processing to be able to do additional site-wide analysis.   This works well, except there are a number of outlier websites that have an absurd number of documents that can rapidly fill up the heap of the process.

These websites now receive a simplified treatment.  This is executed in the converter batch writer thread.  This is slower, but the documents will not be persisted in memory.
2023-12-27 18:20:03 +01:00
Viktor Lofgren
f811a29f87 (crawler) Fix resource leak in crawler
A 10 MB thread local buffer wasn't static.  Oops.
2023-12-27 16:32:17 +01:00
Viktor Lofgren
acf7bcc7a6 (converter) Refactor the DomainProcessor for new format of crawl data
With the new crawler modifications, the crawl data comes in a slightly different order, and a result of this is that we can optimize the converter.  This is a breaking change that will be incompatible with the old style of crawl data, hence it will linger as a branch for a while.

The first step is to move stuff out of the domain processor into the document processor.
2023-12-27 13:57:59 +01:00
Viktor Lofgren
9707366348 (test) Fix a few slow tests that broke due to domainCount 2023-12-27 13:29:59 +01:00
Viktor Lofgren
9e5fe71f5b (crawler) Switch hash function in crawler
Guava's hashers are a bit allocation hungry, and a big driver of GC churn in the crawler.   This switches to the modified Murmur hash function used throughout Marginalia.
2023-12-27 13:29:00 +01:00
Viktor Lofgren
5d1b7da728 Updated site info feed and search service
Modified site info feed template to secure the description field against injected code. Also adjusted search service by extracting samples within the correct scope and including them in the returned site info. This improves the quality and security of the displayed information.
2023-12-26 22:06:01 +01:00
Viktor Lofgren
3ea1ddae22 (crawler) Roll back switch to virtual thread pool in crawler
This seems to cause a resource leak, it seems the http library uses thread locals?
2023-12-26 19:37:34 +01:00
Viktor Lofgren
1694e9c78c (search) Add RSS Feeds to site info
This change integrates the Feedlot RSS Bot with Marginalia's site info view to offer a preview of the latest updates.

 The change introduces a new tiny feature that is a feedlot-client based on Java's HttpClient.
2023-12-26 16:21:40 +01:00
Viktor Lofgren
4763077b76 (search/index) Add a new keyword "count"
This is for filtering results on how many times the term appears on the domain.  The intent is to be beneficial in creating e.g. a domain search feature.   It's also very helpful when tracking down spammy domains.
2023-12-25 20:38:29 +01:00
Viktor Lofgren
c0eaca220c (search) Add convenient link for AS search to the search view 2023-12-25 15:07:58 +01:00
Viktor Lofgren
25d086c4e1 (crawler) Clean up stale warc files
We should probably have an option to keep them, but not by default!
2023-12-25 15:07:36 +01:00
Viktor Lofgren
88551043cd (crawler) Even more lenient resyncing 2023-12-25 01:48:11 +01:00
Viktor Lofgren
f779f760c4 (crawler) Even more lenient resyncing 2023-12-25 01:44:18 +01:00
Viktor Lofgren
f18f82e229 (crawler) Write etags and last-modified on reference copy
This commit also fixes a test that broke with a previous change.
2023-12-25 01:40:13 +01:00
Viktor Lofgren
67ef2b45fa (crawler) Reduce logging 2023-12-25 01:10:03 +01:00
Viktor Lofgren
d72e871265 (warc) Fix resync 2023-12-25 01:03:03 +01:00
Viktor Lofgren
4c9bc13309 (warc) Reduce log spam 2023-12-25 00:58:31 +01:00
Viktor Lofgren
84563b0d46 (crawler) Be a bit more conservative about pulling etags and so on if the previous fetch wasn't OK 2023-12-25 00:55:05 +01:00
Viktor Lofgren
c5aab7e8db (warc) Fix NPE in WarcRecorder 2023-12-25 00:54:38 +01:00
Viktor Lofgren
1755b646b8 (warc) Fix NPE in WarcRecorder 2023-12-25 00:48:42 +01:00
Viktor Lofgren
85f906ea53 (executor) Fix removal of stale process heartbeats 2023-12-23 13:49:24 +01:00
Viktor Lofgren
e1a155a9c8 (crawler) Increase growth of crawl jobs
A number of crawl jobs get stuck at about 300 documents, or just under.  This seems to be because we fail to increase the crawl limit, which is based on MAX(200, 1.25 x GOOD_URLS) with a 1.5x modifier applied upon a recrawl.  GOOD_URLS is based on how many documents successfully process, which is typically fairly small.  Switching to KNOWN_URLS should let this grow faster.

The SQL query in the DbCrawlSpecProvider class has been updated; 'GOOD_URLS' has been replaced with 'KNOWN_URLS'. This update ensures the correct data is selected from the DOMAIN_METADATA table.

The floor is also increased to 250 from 200.
2023-12-23 13:22:10 +01:00
Viktor Lofgren
0454447e41 (executor) Implement process removal for long-absent heartbeats
Added functionality to remove processes from listing that have not checked in for over a day. A 'removeProcessHeartbeat' function was created to delete the respective entry from the PROCESS_HEARTBEAT table in case heartbeats are absent for more than one day.
2023-12-23 13:18:21 +01:00
Viktor Lofgren
7b40c0bbee (assistant) Clean up similar websites' results 2023-12-22 14:07:01 +01:00
Viktor Lofgren
dc773c5c20 (adjacencies) Clean up AdjacenciesLoader
Make JDBC batching more consistent, also adds a test case for the loader.
2023-12-21 14:14:22 +01:00
Viktor Lofgren
b6253b03c2 (adjacencies) Fix bug in AdjacenciesLoader
This fixes a bug where a prepared statement was created before the table it was supposed to insert into was created.  This fails and does nothing.

Furthermore, added the logging that would have warned about this failure, had it been in place.
2023-12-21 13:12:31 +01:00
Viktor Lofgren
a5bc29245b (cleanup) Remove vestigial support for WARC crawl data streams 2023-12-20 15:46:21 +01:00
Viktor Lofgren
bfae478251 Refactor CrawlerRevisitor for better consistency 2023-12-20 15:21:49 +01:00
Viktor Lofgren
a7cd490593 (minor) Remove dead code. 2023-12-19 18:58:33 +01:00
Viktor Lofgren
283d2caa81 Merge remote-tracking branch 'origin/master' 2023-12-19 18:38:01 +01:00
Viktor Lofgren
dd8fb04886 (converter) Add sizeloadSizeAdvice field to several ProcessedDomain
Since the sideloaders don't populate the documents list in ProcessedDomain to keep the memory footprint manageable, the code that estimates knownUrls etc. will set them to zero, which has negative effects on their ranking.  This change will populate them with a bullshit value within a sane ballpark, ensuring that these domains show up in the rankings.
2023-12-19 18:37:51 +01:00
Viktor
ce8dca7659 Update Additional Contributors.md 2023-12-19 12:22:01 +01:00
Viktor
5bd3934d22 Merge pull request #64 from dreimolo/macos_AS_fix
Macos apple silicon fix, and slight improvements to sample downloader
2023-12-18 18:29:14 +01:00
Viktor Lofgren
128f550ee5 (run) Download to a temporary file to avoid corruption from aborted downloads 2023-12-18 18:28:17 +01:00
Viktor Lofgren
3a56a06c4f (warc) Add a fields for etags and last-modified headers to the new crawl data formats
Make some temporary modifications to the CrawledDocument model to support both a "big string" style headers field like in the old formats, and explicit fields as in the new formats.  This is a bit awkward to deal with, but it's a necessity until we migrate off the old formats entirely.

The commit also adds a few tests to this logic.
2023-12-18 17:45:54 +01:00
Viktor Lofgren
126ac3816f (converter) Reduce queue size in ConverterWriter
The size of the ArrayBlockingQueue in ConverterWriter.java has been reduced from 4 to 1. This change aims to reduce the memory utilization by not having fully processed domains piling up in RAM.  This may cause the writer to go idle in waiting for new data, but that may be preferable to an OOM.
2023-12-18 13:42:40 +01:00
Viktor Lofgren
d02bed1a55 (loader) Optimize DomainLoaderService for faster startups
Initialization parameters in DomainLoaderService and DomainIdRegistry have been updated to improve performance. This is done by adding sane default sizes to the hash tables involved, reducing GC churn, but also by setting a sensible fetch size to the queries used, and not fetching irrelevant information such as the domain name.
2023-12-18 13:15:10 +01:00
Viktor Lofgren
b7ed0ce537 (loader) Reset count after executing batch in DomainLoaderService
This should greatly speed up starting the loader process.
2023-12-18 12:43:53 +01:00
Viktor Lofgren
a742503508 (search) Add view for showing mutual links between two websites 2023-12-17 17:50:44 +01:00
Viktor Lofgren
33312ab09e (geo-ip) Update readme 2023-12-17 16:08:33 +01:00
Viktor Lofgren
c422f0b9fb (geo-ip) Tidy up error handling 2023-12-17 16:06:51 +01:00
Viktor
7797de80e3 Merge pull request #65 from MarginaliaSearch/asn-info
Replace the ip2location-LITE IP geolocation data with ASN information from apnic.net
2023-12-17 15:04:29 +01:00
Viktor Lofgren
c92f1b8df8 (geo-ip) Revert removal of ip2location logic
We do both ip2location and ASN data.

The change also adds some keywords based on autonomous system information, on a somewhat experimental basis.  It would be neat to be able to e.g. exclude cloud services or just e.g. cloudflare from the search results.
2023-12-17 15:03:00 +01:00
Viktor Lofgren
bde68ba48b Merge branch 'master' into asn-info 2023-12-17 14:00:23 +01:00
Viktor Lofgren
bf44805e69 (*) Rename EdgeDomain$domain into topDomain
This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time.

Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.
2023-12-17 14:00:07 +01:00
Viktor Lofgren
edf9aa2c23 (*) Rename EdgeDomain$domain into topDomain
This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time.

Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.
2023-12-17 13:59:54 +01:00
Viktor Lofgren
4801c47273 (crawling-model) Fix bug where CrawledDocument.getDomain() trimmed www-prefixes
This had the knock-on effect of breaking the anchor tag loading in the processor for a lot of domains, since they'd grab domains for the wrong domain name.
2023-12-17 13:53:31 +01:00
Viktor Lofgren
bcad6492d6 (sideloader) Fix integration problems with sideloaders
In encyclopedia, add a class "mw-content-text" that the WikiSpecialization class is looking for during pruning to give the articles a more fair treatment.

Also add generator keywords based on the generator type provided, to ensure that these documents show up in appropriate filters.

Further, add a new document flag value 'Sideloaded' to be able to distinguish these entries.
2023-12-17 13:28:17 +01:00
Viktor Lofgren
5ab2a22e88 (search) Fix result count back down to 1 per domain 2023-12-17 13:14:23 +01:00
Viktor Lofgren
d7bd540683 (*) Replace the ip2location IP geolocation data with ASN information from apnic.net.
Doesn't really make sense to use ip2location as a middle man for information that is already freely available...
2023-12-16 21:55:04 +01:00
dreimolo
62954f98de adds xl to help output 2023-12-16 19:41:41 +01:00
Viktor Lofgren
722b56c8ca (index) Fix rare bug in the index-switching logic
This is caused by a resource contention with the query code.  The proper way to fix this is to use some form of synchronization, but that will slow the code down.  So we just hammer it a few times and let the GC deal with the problem if it fails.  Not optimal, but fast.
2023-12-16 18:57:35 +01:00
Viktor Lofgren
f3f12058dc (assistant) Fix logic error in filtering related domains 2023-12-16 18:45:53 +01:00
Viktor Lofgren
3da38d0483 (assistant) Fix logic error in filtering related domains 2023-12-16 18:44:25 +01:00
Viktor Lofgren
d715b1f9ca (search) Improve error handling in search parameters parsing
The code now intercepts and deals with potential exceptions during the parsing of search parameters. This is in response to constant bad requests from bots which were cluttering the logs. A catch clause is added that suppresses these errors and redirects to the base URL.
2023-12-16 18:42:13 +01:00
Viktor Lofgren
e13fa25e11 (assistant) Clean up the site info related domains view by filtering viable domains 2023-12-16 18:37:09 +01:00
Viktor Lofgren
34d4834ff6 (assistant) Clean up the site info related domains view by filtering viable domains 2023-12-16 18:27:24 +01:00
Viktor Lofgren
117ddd17d7 (assistant) Fix bugs in IP flag emoji generation 2023-12-16 17:07:17 +01:00
Viktor Lofgren
6f2bf38f0e (index) Fix off-by-1 error in the domain count limiter 2023-12-16 16:57:05 +01:00
Viktor Lofgren
320882c34a (site-info) Try to discover the schema of the website with a site:-query
The site info view can't blindly assume that every website supports https.  To figure out which schema to use when linking to a site, execute a single-result search for site:domain.name and then grab the schema off the result.

To allow this, a count parameter is introduced to doSiteSearch() in SearchOperator.
2023-12-16 16:34:53 +01:00
Viktor
8bbb533c9a Merge pull request #62 from MarginaliaSearch/warc
(WIP) Use WARCs in the crawler
2023-12-16 16:02:46 +01:00
Viktor Lofgren
3113b5a551 (warc) Filter WarcResponses based on X-Robots-Tags
There really is no fantastic place to put this logic, but we need to remove entries with an X-Robots-Tags header where that header indicates it doesn't want to be crawled by Marginalia.
2023-12-16 15:58:27 +01:00
dreimolo
c0cc05177f corrects protobuf.plugins.grpc 2023-12-16 14:24:41 +01:00
dreimolo
0b34d43804 workaround for failing mac on apple silicon deps 2023-12-16 14:22:11 +01:00
dreimolo
6c7d7427bf Adds check for wget and curl, and valid sample archives 2023-12-16 14:14:58 +01:00
Viktor Lofgren
54ed3b86ba (minor) Remove dead code. 2023-12-15 21:49:35 +01:00
Viktor Lofgren
2001d0f707 (converter) Add @Deprecated annotation to a few fields that should no longer be used. 2023-12-15 21:42:00 +01:00
Viktor Lofgren
0f9cd9c87d (warc) More accurate filering of advisory records
Further create records for resources that were blocked due to robots.txt; as well as tests to verify this happens.
2023-12-15 21:37:02 +01:00
Viktor Lofgren
2e7db61808 (warc) More accurate filering of advisory records
We want to mute some of these records so that they don't produce documents, but in some cases we want a document to be produced for accounting purposes.

Added improved tests that reach for known resources on www.marginalia.nu to test the behavior when encountering bad content type and 404s.

The commit also adds some safety try-catch:es around the charset handling, as it may sometimes explode when fed incorrect data, and we do be guessing...
2023-12-15 21:31:16 +01:00
Viktor Lofgren
5329968155 (crawler) Update CrawlingThenConvertingIntegrationTest
This commit updates CrawlingThenConvertingIntegrationTest with additional tests for invalid, redirecting, and blocked domains. Improvements have also been made to filter out irrelevant entries in ParquetSerializableCrawlDataStream.
2023-12-15 21:04:06 +01:00
Viktor Lofgren
2e536e3141 (crawler) Add timestamp to CrawledDocument records
This update includes the addition of timestamps to the parquet format for crawl data, as extracted from the Warc stream.

The parquet format stores the timestamp as a 64 bit long, seconds since unix epoch, without a logical type.  This is to avoid having to do format conversions when writing and reading the data.

This parquet field populates the timestamp field in CrawledDocument.
2023-12-15 20:23:27 +01:00
Viktor Lofgren
cf935a5331 (converter) Read cookie information
Add an optional new field to CrawledDocument containing information about whether the domain has cookies.  This was previously on the CrawledDomain object, but since the WarcFormat requires us to write a WarcInfo object at the start of a crawl rather than at the end, this information is unobtainable when creating the CrawledDomain object.

Also fix a bug in the deduplication logic in the DomainProcessor class that caused a test to break.
2023-12-15 18:09:53 +01:00
Viktor Lofgren
fa81e5b8ee (warc) Use a non-standard WARC header to convey information about whether a website uses cookies
This information is then propagated to the parquet file as a boolean.

For documents that are copied from the reference, use whatever value we last saw.  This isn't 100% deterministic and may result in false negatives, but permits websites that used cookies but have stopped to repent and have the change reflect in the search engine more quickly.
2023-12-15 16:37:53 +01:00
Viktor Lofgren
9fea22b90d (warc) Further tidying
This commit includes mostly exception handling, error propagation, a few bug fixes and minor changes to log formatting. The CrawlDelayTimer, HTTP 429 responses and IOException responses are now more accurately handled.

A non-standard WarcXEntityRefused WARC record has also been introduced, essentially acting as a rejected 'response' with different semantics.

Besides these, several existing features have been refined, such as URL encoding, crawl depth incrementing and usage of Content-Length headers.
2023-12-15 15:38:23 +01:00
Viktor Lofgren
0889b6d247 (warc) Clean up parquet conversion
This commit further cleans up the warc->parquet conversion. It fixes issues with redirect handling in WarcRecorder, adds support information about redirects and errors due to probe failure.

It also refactors the fetch result, body extraction and content type abstractions.
2023-12-14 20:39:40 +01:00
Viktor Lofgren
1328bc4938 (warc) Clean up parquet conversion
This commit cleans up the warc->parquet conversion.  Records with a http status other than 200 are now included.

The commit also fixes a bug where the robots.txt parser would be fed the full HTTP response (and choke), instead of the body.

The DocumentBodyExtractor code has also been cleaned up, and now offers a way of just getting the byte[] representation for later processing, as conversion to and from strings is a bit wasteful.
2023-12-14 16:05:48 +01:00
Viktor Lofgren
787a20cbaa (crawling-model) Implement a parquet format for crawl data
This is not hooked into anything yet.  The change also makes modifications to the parquet-floor library to support reading and writing of byte[] arrays.  This is desirable since we may in the future want to support inputs that are not text-based, and codifying the assumption that each document is a string will definitely cause us grief down the line.
2023-12-13 16:22:19 +01:00
Viktor Lofgren
a73f1ab0ac Merge branch 'master' into warc 2023-12-13 15:35:29 +01:00
Viktor Lofgren
30c0dad3ae (gradle) Bump gradle-wrapper version to 8.5
This finally resolves issues with gradle confusing intellij by complaining about java incompatibilities (that were never a problem), so that it doesn't report test errors correctly.
2023-12-13 15:35:01 +01:00
Viktor Lofgren
440e097d78 (crawler) WIP integration of WARC files into the crawler and converter process.
This commit is in a pretty rough state.  It refactors the crawler fairly significantly to offer better separation of concerns.  It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data.  This works, -ish.

There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either.

A problem is that the WARC files are a bit too large.  It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.
2023-12-13 15:33:42 +01:00
Viktor Lofgren
b74a3ebd85 (crawler) WIP integration of WARC files into the crawler process.
At this stage, the crawler will use the WARCs to resume a crawl if it terminates incorrectly.

This is a WIP commit, since the warc files are not fully incorporated into the work flow, they are deleted after the domain is crawled.

The commit also includes fairly invasive refactoring of the crawler classes, to accomplish better separation of concerns.
2023-12-11 19:32:58 +01:00
Viktor Lofgren
45987a1d98 Merge branch 'master' into warc 2023-12-11 14:32:35 +01:00
Viktor Lofgren
8f0950fc44 (geoip) Fix incorrect synchronization. 2023-12-11 14:01:39 +01:00
Viktor Lofgren
30bc3f9281 (converter) Use the prefix ip: instead of geopip: for country codes
This is the same as the prefix for the IP address, but I don't think that substantially matters, the as two have such different namespaces there can be no confusion.
2023-12-11 13:59:23 +01:00
Viktor Lofgren
f655ec5a5c (*) Refactor GeoIP-related code
In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services.

The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions.

The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server.

The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.
2023-12-10 17:30:43 +01:00
Viktor Lofgren
84b4158555 (minor) Fix broken test 2023-12-10 14:39:20 +01:00
Viktor Lofgren
91dd45cf64 (search) IP and IP geolocation in site info view
This commit also fixes a bug in the loader where the IP field wouldn't always populate as intended, and refactors the DomainInformationService to use significantly fewer SQL queries.
2023-12-09 20:06:55 +01:00
Viktor Lofgren
37af60254f (search) Better recipe filter
Tune the recipe filter to give better results, by using the 'popular' domains set along with excluding results with heavy tracking.
2023-12-09 20:06:55 +01:00
Viktor Lofgren
f0e736d4ea (search) Update the search profile 'Academia' to strictly filter on academic tlds
The previous version used a personalized pagerank centering on a few academic domains, but this didn't work very well and most results were not very academia-centric.
2023-12-09 20:06:55 +01:00
Viktor Lofgren
e3ebb0c5bb (*) Rename the search filter 'RETRO' into 'POPULAR'
This will make the terminology more consistent between the GUI and the code.  The rankings yaml still uses 'retro' though, for to retain compatibility.
2023-12-09 20:06:54 +01:00
Viktor Lofgren
6382f779c3 (search) Revert back to using 'Popular' as the default search filter
Unfiltered is a bit too ... unfiltered, and gives a bad first impression for many queries.
2023-12-09 16:34:12 +01:00
Viktor Lofgren
8ef34883a8 (search) Move site information out of the search service and into assistant.
This reduces the impact of restarting the search service, as the site information takes a few minutes to load during which it's not available.  It also permits exposing this information via API in the future if there is interest in this.

The assistant service was also modified to do a late load of the suggestions trie, as this is a major contributor to its start-up time.

Finally, some changes were made to the client library, a new get() method was added that takes a TypeToken to allow deserialization of generics such as List<Foo>, and the scheduler was also modified to use virtual threads.
2023-12-09 16:30:06 +01:00
Viktor Lofgren
5c46af0edb (converter) Refactor EncyclopediaMarginaliaNuSideloader to use ProcessingIterator
Refactored the getDocumentsStream method in EncyclopediaMarginaliaNuSideloader to use the newly extracted ProcessingIterator class that encapsulates processing a stream of results from e.g a database query in parallel and returning the computed results as an iterator.

The iterator was also improved on to be more reliable, previous versions of the logic would sometimes deadlock due to false positives in hasMore().
2023-12-09 15:20:53 +01:00
Viktor Lofgren
b6511fbfe2 (converter) Add AnchorTextKeywords to EncyclopediaMarginaliaNuSideloader processing
The commit updates EncyclopediaMarginaliaNuSideloader to include the AnchorTextKeywords in processing documents, aiding search result relevance.

It also removes old test-related functionality and a large but fairly useless test previously used to debug a specific problem, to the detriment of the overall code quality.
2023-12-09 15:20:52 +01:00
Viktor Lofgren
eccb12b366 (control) Fix spurious state detection in control-side actors
A race condition was found where precession actors would sometimes skip a step, because when invoking ExecutorRemoteActor.getState(), it would get the last 'OK' actor state from a previous run of the actor!

To avoid this, the trigger method was changed from returning a boolean to the message ID, negative if an error occurred, to be passed to getState to select only messages that pertain to the present or future runs.
2023-12-09 12:50:05 +01:00
Viktor Lofgren
d0982e7ba5 (converter) Add error handling and lazy load external domain links
The converter was not properly initiating the external links for each domain, causing an NPE in conversion.  This needs to be loaded later since we don't know the domain we're processing until we've seen it in the crawl data.

Also made some refactorings to make finding converter bugs easier, and finding the related domain less awkward from the SerializableCrawlData interface.
2023-12-09 12:33:39 +01:00
Viktor Lofgren
fc30da0d48 (converter) Add academia recognition to DomainProcessor
The code now includes an additional function in the DomainProcessor class that checks if a domain is associated with academia. An academic domain is identified by the ".edu" TLD, or fits a specific regex pattern matching domains like *.ac.ccTld or *.edu.ccTld.

 If these conditions are met, the search term "special:academia" is added to the domain.

 The existing academia search filter uses personalized pagerank to select academia-adjacent domains, but it isn't working very well.  The hope is that filtering on domain names will be more effective, and that it can supplant the ranking-based approach.
2023-12-08 20:31:34 +01:00
Viktor Lofgren
e6a1052ba7 Simplify CrawlerMain, removing the CrawlerLimiter and using a global HttpFetcher with a virtual thread pool dispatcher instead of the default. 2023-12-08 20:24:01 +01:00
Viktor Lofgren
968dce50fc (crawler) Refactored IpInterceptingNetworkInterceptor for clarity. 2023-12-08 17:45:46 +01:00
Viktor Lofgren
3bbffd3c22 (crawler) Refactor HttpFetcher to integrate WarcRecorder
Partially hook in the WarcRecorder into the crawler process.  So far it's not read, but should record the crawled documents.

The WarcRecorder and HttpFetcher classes were also refactored and broken apart to be easier to reason about.
2023-12-08 17:12:51 +01:00
Viktor Lofgren
072b5fcd12 Implement Warc-recording wrapper for OkHttp3 client
This is a first step of using WARC as an intermediate flight recorder style step in the crawler, ultimately aimed at being able to resume crawls if the crawler is restarted.  This component is currently not hooked into anything.

The OkHttp3 client wrapper class 'WarcRecordingFetcherClient' was implemented for web archiving. This allows for the recording of HTTP requests and responses. New classes were introduced, 'WarcDigestBuilder', 'IpInterceptingNetworkInterceptor', and 'WarcProtocolReconstructor'.

The JWarc dependency was added to the build.gradle file, and relevant unit tests were also introduced. Some HttpFetcher-adjacent structural changes were also done for better organization.
2023-12-08 13:49:16 +01:00
Viktor Lofgren
fabffa80f0 (warc) Integrate the crawler's content type parsing and charset logic into the WarcSideloader 2023-12-07 15:26:01 +01:00
Viktor Lofgren
064265b0b9 (crawler) Move content type/charset sniffing to a separate microlibrary
This functionality needs to be accessed by the WarcSideloader, which is in the converter.  The resultant microlibrary is tiny, but I think in this case it's justifiable.
2023-12-07 15:16:37 +01:00
Viktor Lofgren
2d5d11645d (warc) Refactor WarcSideloaderTest to not rely on specific test files on the computer 2023-12-06 19:00:29 +01:00
Viktor Lofgren
cc813a5624 (convert) Add basic support for Warc file sideloading
This update includes the integration of the jwarc library and implements support for Warc file sideloading, as a first trial integration with this library.
2023-12-06 18:43:55 +01:00
Viktor Lofgren
156c067f79 (search) Fix mobile issues with browse feature 2023-12-05 21:28:50 +01:00
Viktor Lofgren
b33b013d41 (search) Fix broken script tag
Apparently it can't be called suggestions.js...?
2023-12-05 20:29:13 +01:00
Viktor Lofgren
e74e2f705f (search) Fix broken script tag
suggestions.js became something else.
2023-12-05 20:20:07 +01:00
Viktor Lofgren
2e438847fc (search) Optimize related domains queries
In the future this logic probably needs to move into a separate
service, as it's still quite slow to load.  But this fixes response
times and DOS potential of previous version.
2023-12-05 20:12:03 +01:00
Viktor Lofgren
9301c47d93 (search) Optimize related domains queries 2023-12-05 14:42:03 +01:00
Viktor Lofgren
20ec58b07f (search) Remove layout-breakingly long URLs from the similar domains view.
They're almost all .onion URLs anyway, not really the space we're looking to peer into.
2023-12-05 13:58:15 +01:00
Viktor Lofgren
98983c1015 (search) Hopefully fix race condition that leaves the response with no Content-type header 2023-12-05 13:52:36 +01:00
Viktor Lofgren
67195592c6 (search) Hopefully fix race condition that leaves the response with no Content-type header 2023-12-05 13:48:42 +01:00
Viktor
21abfc6424 Merge pull request #61 from MarginaliaSearch/new-look
Design Revamp For search.marginalia.nu
2023-12-05 13:28:54 +01:00
Viktor Lofgren
d1e88df71e (search) Cleaning up the code a bit 2023-12-05 13:26:05 +01:00
Viktor Lofgren
f36cfe34ab (search) Hackery to get a more balanced view 2023-12-04 22:50:39 +01:00
Viktor Lofgren
8a1934008c (search) Merge similar sites results with the info view.
WIP: This commit needs to be cleaned up.
2023-12-04 22:10:24 +01:00
Viktor Lofgren
b41bb9cfcf (search) Use a &Xi; for mobile button title instead of "Filters".
Makes it easier to distinguish form the search button.
2023-12-03 16:33:25 +01:00
Viktor Lofgren
d58324bbef (search) Clean up filters menu a bit, improve accessibility. 2023-12-02 18:05:30 +01:00
Viktor Lofgren
cbbd45d3e5 (search) Clean up filters menu a bit, improve accessibility. 2023-12-02 18:01:03 +01:00
Viktor Lofgren
b89633ae4b (search) Don't render a filter button on mobile when there are no filters to be presented. 2023-12-02 17:23:45 +01:00
Viktor Lofgren
96357e9bfd (search) Fix typeahead suggestions, as well as improve mobile and desktop UX in small ways. 2023-12-02 17:06:40 +01:00
Viktor Lofgren
d530c3096f (search) GUI tweaks to make the new interface not fall apart on mobile/chrome 2023-12-02 17:06:40 +01:00
Viktor Lofgren
ae0c1c3f2d (control) Adjust search result margins for better visual density 2023-12-02 17:06:40 +01:00
Viktor Lofgren
0cc2564380 (search) CSS tweaks 2023-12-02 17:06:40 +01:00
Viktor Lofgren
38d20022ad (search) Fix script loading for mobile support 2023-12-02 17:06:40 +01:00
Viktor Lofgren
280132dad0 (search) Fix script loading for mobile support 2023-12-02 17:06:40 +01:00
Viktor Lofgren
61de4e2789 (search) Retain filter options when performing a new search from the input field 2023-12-02 17:06:40 +01:00
Viktor Lofgren
f9d3455320 (search) Reduce visual weight of search results 2023-12-02 17:06:40 +01:00
Viktor Lofgren
2ff64c3c12 (search) New toggle for reducing tracking 2023-12-02 17:06:40 +01:00
Viktor Lofgren
902f235b5b (search) Integrate 'similar' tab in site info. 2023-12-02 17:06:40 +01:00
Viktor Lofgren
97d43a6fa2 (search) Revamp browse results with new look. 2023-12-02 17:06:40 +01:00
Viktor Lofgren
9bc65ff0ca (search) Desaturate search result titles according to rank 2023-12-02 17:06:40 +01:00
Viktor Lofgren
6cd6a615fd (search) Add data-filter to body as a data attribute
For future shenanigans ;D
2023-12-02 17:06:40 +01:00
Viktor Lofgren
5639f0653d (search) Rename SearchProfile.name into filterId
Avoid foot-gun caused by name clash with the Enumeration method name(), which returns the Java name of the enumeration value.
2023-12-02 17:06:40 +01:00
Viktor Lofgren
251174c9a2 (search) Update front page with new look 2023-12-02 17:06:40 +01:00
Viktor Lofgren
42ea87d637 (search) Update conversion results, error page, and dictionary results with new CSS. 2023-12-02 17:06:40 +01:00
Viktor Lofgren
7c8a60b8cf (search) Site info view is mostly done
Also optimize the rendering a bit to avoid having to allocate huge string buffers, writing directly to Spark's response instead.
2023-12-02 17:06:40 +01:00
Viktor Lofgren
2f4500be5a (search) New frontend look 2023-12-02 17:06:40 +01:00
Viktor Lofgren
fa7534a362 (search) Remove dead code 2023-12-02 17:06:40 +01:00
Viktor Lofgren
a258f0af7a (search) Refactor search parameters to include query 2023-12-02 17:06:40 +01:00
Viktor Lofgren
01621c6344 (renderer) Make helpers configurable on a by-service basis. 2023-12-02 17:06:40 +01:00
Viktor Lofgren
c7934342a6 (control) Automatic recrawl 2023-12-02 17:06:24 +01:00
Viktor Lofgren
f5c324c06b (minor) Fix broken test 2023-12-01 17:44:39 +01:00
Viktor Lofgren
f615cf2391 (convert) Loosen up the rules enforcement for documents that have external links. 2023-12-01 17:44:29 +01:00
Viktor Lofgren
c984a97262 (docs) Update crawling.md 2023-11-30 21:53:56 +01:00
Viktor Lofgren
a02c06a837 (docs) Update sideloading-howto.md 2023-11-30 21:51:03 +01:00
Viktor Lofgren
21d6aa421c (docs) Update setup instructions 2023-11-30 21:44:29 +01:00
Viktor Lofgren
e5d274fe1c (docs) Improve architectural documentation 2023-11-30 21:38:57 +01:00
Viktor Lofgren
166a391eae (docs) Improve architectural documentation for the crawler. 2023-11-30 21:30:57 +01:00
Viktor Lofgren
5fb24bb27f (docs) Improve architectural documentation for the converter. 2023-11-30 20:43:22 +01:00
Viktor Lofgren
5a5430b383 (convert) Wiki specialization that should do a better job at removing junk keywords and providing a useful summary. 2023-11-30 20:04:46 +01:00
Viktor Lofgren
67a1e1c874 (control) GUI for triggering control-side actors 2023-11-29 15:31:14 +01:00
Viktor Lofgren
4155fbe94c (control) Reprocess-all actor 2023-11-28 17:58:48 +01:00
Viktor Lofgren
347fe6b7be (control) Reindex-all actor 2023-11-28 16:41:09 +01:00
Viktor Lofgren
ff3ceb981e (control) Button for removing a stale 'NEW' status
If a process is violently terminated, the associated file storage may get stuck in the ephemeral 'NEW' state, preventing future operations on the associated data.

To remedy this without having to dig through the database, a button was added to reset the state.  It's a band-aid, but the situation is rare enough that I think it's fine.
2023-11-28 15:18:24 +01:00
Viktor Lofgren
1dafa0c74d (mqapi/control) Repair repartition endpoint, deprecate notify endpoints.
The repartition endpoint was mis-addressing its mqapi notifications, omitting the proper nodeId.  In fixing this, it became apparent that having both @MqRequest and @MqNotification is a serious footgun, and the two should be unified into a single API where the caller isn't burdened with knowledge of the remote end's implementation specifics.
2023-11-27 16:01:12 +01:00
Viktor Lofgren
09917837d0 (process) Ensure construction exceptions are logged
Wrapping these exceptions in a try-catch and logging them with slf4j will ensure they end up in the process logs.

The way it worked using the default exception handler, they'd print on console (which nothing captures!), leading to a very annoying debugging experience.
2023-11-22 18:32:06 +01:00
Viktor Lofgren
dd507a3808 (db) Fix migrations, bump flyway to 10.0.1
Tricky problem, creating a procedure apparently needs delimiter shenanigans in Flyway, otherwise it will truncate the END statement and mariadb will be sad.
2023-11-21 20:04:35 +01:00
Viktor Lofgren
e67dcf4d68 (docker) Fix image tagging
Should now be possible to push a tagged image with e.g.

./gradlew --info dockerPush -Pdocker-registry=registry.marginalia.nu -Pdocker-tag=test2
2023-11-18 13:27:12 +01:00
Viktor Lofgren
dd9406d0ac (control) Make storage type tabs consistent
This had fallen off in the Create New Specification view, it lacked Exports.
2023-11-17 11:26:45 +01:00
Viktor Lofgren
6a80ac62a5 (doc) Amend crawling documentation 2023-11-17 11:16:06 +01:00
Viktor Lofgren
98efb08e17 (gradle) Make docker image registry and tag configurable
This is to enable running an external repository for production and test.

Use the ./gradle -Pdocker-registry=registry.foo.bar -Pdocker-tag=my-tag while building to accomplish this.  By default, use 'marginalia' for repository and 'latest' as tag.
2023-11-16 21:12:55 +01:00
Viktor Lofgren
f58a9f46be (loader) Don't truncate the entire links table on load
This behavior is an old vestige from the days of only having a single loader process.  We'd truncate the links table because doing inserts/updates was too slow.  This was also important because we had 32 bit ID, and there's a lot of links between domains to go around...

Instead we delete the rows associated with the current node with a stored procedure PURGE_LINKS_TABLE.

We also update the PRIMARY KEY to a BIGINT.  We'll need to load the data in excess of billion times to hit an ID rollover, so it'll be fine.
2023-11-16 10:30:12 +01:00
Viktor Lofgren
fd77e62a13 set registry to registry.marginalia.nu 2023-11-15 14:04:44 +01:00
Viktor Lofgren
376228e199 (docker) set registry to registry.marginalia.nu 2023-11-15 14:03:22 +01:00
Viktor Lofgren
8a5b853fae Fix experiment runner 2023-11-15 14:03:17 +01:00
Viktor Lofgren
1cbf23e7e7 (test) Don't fail test if atags.parquet is not in ~vlofgren 2023-11-15 09:11:38 +01:00
Viktor Lofgren
63554ba171 (explore2) Add robots.txt 2023-11-14 09:15:32 +01:00
Viktor Lofgren
5de37cb820 (converter) Set feature flags appropriately on stackexchange posts 2023-11-12 15:48:08 +01:00
Viktor Lofgren
e5cee1f46d (sideload) Fix sideloading so that it doesn't get disproportionately good rankings
Also add type flags so that e.g. wikipedia shows up in the wikis filter.
2023-11-12 14:57:57 +01:00
Viktor Lofgren
e9a01caa5c (index) Fix broken metrics 2023-11-11 12:53:47 +01:00
Viktor Lofgren
858357a246 (metrics) Get prometheus up out of disrepair
* Fix bad labels
* Add nodeId where appropriate
* Hopefully fix histogram buckets for index query times
2023-11-08 14:01:28 +01:00
Viktor Lofgren
ef16502159 (doc) Update readme 2023-11-07 16:00:18 +01:00
Viktor Lofgren
29e2c43e01 (gradle) Up to gradle 8.4 since it has better Java 21 compatibility 2023-11-07 16:00:08 +01:00
Viktor Lofgren
7aa2f80117 (domain) id.au should be treated as a TLD 2023-11-06 19:07:47 +01:00
Viktor
d29f9c4ffd Merge pull request #59 from MarginaliaSearch/atags
Support for anchor tag keywords

* Added new (optional) model file in $WMSA_HOME/data/atags.parquet. Due to size limitations on github, this is available at https://downloads.marginalia.nu/exports
*  Converter gets a component for creating a projection of its domains onto the full atags parquet file
*  New WordFlag ExternalLink
* These terms are also for now flagged as title words
* The ranking algorithm was tweaked to make better use of ngram information as well as weighting the priority BM25
*  Fixed a bug where Title words aliased with UrlDomain words
*  Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking
*  Crawler will also use the anchor tag file to prioritize crawling documents with external links.
2023-11-06 19:02:52 +01:00
Viktor Lofgren
7617b4cbc2 (crawler) Fix NPE in crawler caused by not having fetched the domains list yet 2023-11-06 18:16:38 +01:00
Viktor Lofgren
e0c769fd19 (converter) Integrate atags.parquet with the encyclopedia sideloader
Also clean up stackexchange and dirtree a bit.
2023-11-06 18:03:01 +01:00
Viktor Lofgren
ebd10a5f28 (crawler) Integrate atags.parquet with the crawler so that "important" URLs are prioritized 2023-11-06 16:14:58 +01:00
Viktor Lofgren
2b77184281 (converter) Integrate atags with the topology field 2023-11-06 13:46:44 +01:00
Viktor Lofgren
e23976f6c4 (search) Fix card title overflow 2023-11-06 13:25:39 +01:00
Viktor Lofgren
0b8dc02eba (result-ranking) Nudge up results with ngram matches a tiny bit 2023-11-06 13:14:22 +01:00
Viktor Lofgren
fde1d0677e (search) Remove unnecessary dependencies 2023-11-06 12:56:32 +01:00
Viktor Lofgren
48986574ae (result-ranking) Use a weighted calculation of priority term importance 2023-11-06 12:56:21 +01:00
Viktor Lofgren
c7a6a71d07 (result-ranking) Use a weighted calculation of priority term importance 2023-11-06 12:48:23 +01:00
Viktor Lofgren
1847845151 Revert "(loader) Optimize INSERT statements"
This reverts commit 7cb92195d1.
2023-11-04 19:32:02 +01:00
Viktor Lofgren
7cb92195d1 (loader) Optimize INSERT statements
INSERT IGNORE is too slow.
2023-11-04 17:43:55 +01:00
Viktor Lofgren
72afa0341f duckdb connection may need to be synchronized? 2023-11-04 14:30:25 +01:00
Viktor Lofgren
0152004c42 Initial Commit Anchor Tags
* Added new (optional) model file in $WMSA_HOME/data/atags.parquet
* Converter gets a component for creating a projection of its domains onto the full atags parquet file
* New WordFlag ExternalLink
* These terms are also for now flagged as title words
* Fixed a bug where Title words aliased with UrlDomain words
* Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking
2023-11-04 14:24:17 +01:00
Viktor Lofgren
30ca5046b5 (docker) Route screenshots to dating as well 2023-11-02 15:47:18 +01:00
Viktor Lofgren
8e9698c9a0 (control/search) Add ability to suggest removing a site from random exploration
This is what most complaints have been about.
2023-11-02 15:29:49 +01:00
Viktor Lofgren
3047e2dd7c (screenshot-capture-tool) Make screenshot-capture-tool cooperate with docker 2023-11-01 16:38:55 +01:00
Viktor Lofgren
a8b9d21f2d (executor) Refine atag export logic
* Remove obviously uninteresting tags
* Omit URL schema for more sensible sorting
* Change the column order to put the source domain last
2023-11-01 13:23:14 +01:00
Viktor Lofgren
c77a5b7cb6 (control) GUI for atags export 2023-10-31 17:55:47 +01:00
Viktor Lofgren
23f2068e33 (executor) Actor for exporting anchor tag data from crawl data 2023-10-31 17:32:34 +01:00
Viktor Lofgren
ffadfb4149 (control) Use a partial template for the storage types tabs. 2023-10-31 17:12:14 +01:00
Viktor Lofgren
b7e38cfbae (control) Add exports view 2023-10-31 17:08:48 +01:00
Viktor Lofgren
659743b39c (executor) Export Data actor allocates its own storage 2023-10-31 17:04:07 +01:00
Viktor
cbac42bdd1 Merge pull request #55 from MarginaliaSearch/multinode-index
Multinode index, control GUI redesign
2023-10-31 16:37:00 +01:00
Viktor Lofgren
69758c5859 (control) Nicer redirects acknowledging actions 2023-10-31 16:26:29 +01:00
Viktor Lofgren
81bfd7e5fb (experiment) Utility for exporting atags 2023-10-31 16:10:21 +01:00
Viktor Lofgren
fd8a5e695d (build) Upgrade dependencies with CVEs 2023-10-31 16:09:58 +01:00
Viktor Lofgren
8f74dbdbb4 (crawler) Set more lenient parameters for recrawl 2023-10-30 11:35:30 +01:00
Viktor Lofgren
fd5a7eac87 (crawler) Exit crawler retriever on thread interrupted 2023-10-30 11:34:16 +01:00
Viktor Lofgren
6bac3c75cb (api) API documentation 2023-10-29 16:13:21 +01:00
Viktor Lofgren
5d6e0e3790 (log) Clean up logging
Don't log the PROCESS stream to executor's logs, as it will also be logged in the spawned process' log files.

Also tell the spawned process which "service" it is so that it gets a log file with a name that makes sense.
2023-10-29 15:52:17 +01:00
Viktor Lofgren
2871a326e6 (ctrl/exe) Clean up UX and code 2023-10-29 14:09:39 +01:00
Viktor Lofgren
abb42f0f36 (crawler) Fix bug in SQL statement
Arguments were in the wrong order in inserting fetching sites submitted to be crawled
2023-10-29 13:19:17 +01:00
Viktor Lofgren
f6fcb04817 (experiment) Repair the experiment runner 2023-10-27 16:16:50 +02:00
Viktor Lofgren
b8796d825d (docs) Update documentation 2023-10-27 13:24:49 +02:00
Viktor Lofgren
e97259aca3 (docs) Update documentation 2023-10-27 13:22:11 +02:00
Viktor Lofgren
88f49834fd (docs) Update documentation 2023-10-27 12:45:39 +02:00
Viktor Lofgren
4415f52e18 (keyword-extraction) Fix broken test 2023-10-27 12:19:33 +02:00
Viktor Lofgren
98d742d634 (actor) Code cleanup 2023-10-27 12:19:20 +02:00
Viktor Lofgren
6c1ca10be7 (minor) code cleanup 2023-10-27 11:38:37 +02:00
Viktor Lofgren
aeaf2d546a (search) Fix broken redirect for flagging problems with websites 2023-10-27 11:20:49 +02:00
Viktor Lofgren
c7cb6664b4 (control) Indicate missing services with danger-color instead of having a distracting and constantly updating last-seen number 2023-10-26 18:05:22 +02:00
Viktor Lofgren
79adba9284 (index) Fix bug in dealing with quoted search terms 2023-10-26 16:28:23 +02:00
Viktor Lofgren
37b7f52f2c (minor) Reduce log severity for getTermMeta miss 2023-10-26 15:41:52 +02:00
Viktor Lofgren
c89e0ab255 (minor) Disable ~vlofgren specific debug test 2023-10-26 15:27:59 +02:00
Viktor Lofgren
f613f4f2df (array) Fix spurious search results
This was caused by a bug in the binary search algorithm causing it to sometimes return positive values when encoding a search miss.

It was also necessary to get rid of the vestiges of the old LongArray and IntArray classes to make this fix doable.
2023-10-26 15:27:02 +02:00
Viktor Lofgren
a497e4c920 (crawler) Terminate crawler after a few hours of no progress 2023-10-26 12:49:28 +02:00
Viktor Lofgren
0f637fb722 (logging) Better logging configurations 2023-10-26 12:48:10 +02:00
Viktor Lofgren
ba48c8e25b (docker-compose) Add health check to mariadb 2023-10-26 11:10:53 +02:00
Viktor Lofgren
abbadc92a0 (exdecutor) Prevent TriggerAdjacencyCalculationActor from showing up in the actions tab when it isn't running 2023-10-25 21:25:07 +02:00
Viktor Lofgren
97fcbdd6d9 (control) Move storage actions into the actions tab
* Also disable annoying CSS animations
2023-10-25 21:23:56 +02:00
Viktor Lofgren
d7686b665e Refactoring
* Encyclopedia sideloader; permit providing base URL.
* Storage base shows node id in GUI
* ProcessLivenessMonitorActor restarts automatically
* Clean-up of outbox code
2023-10-25 18:51:02 +02:00
Viktor Lofgren
b8855afd10 Route suggestions to assistant 2023-10-25 14:50:21 +02:00
Viktor Lofgren
5de41a3a7f (search-service) Show node affinity in site info tab 2023-10-25 12:44:48 +02:00
Viktor Lofgren
84cdac83d6 (control) Move message queue monitor to control 2023-10-24 16:44:28 +02:00
Viktor Lofgren
436a55ee1e (control) Render UUID tooltip with dashes. 2023-10-24 16:37:40 +02:00
Viktor Lofgren
313cc2965c (index-creation) Print whether full or prio is created
Previous state of saying reverse index for both was pretty confusing.
2023-10-24 16:23:10 +02:00
Viktor Lofgren
95f74c5ea7 (control) Filter out heartbeats that are stopped 2023-10-24 16:09:28 +02:00
Viktor Lofgren
8d1c3c754d Testing development flow with adding a ~tilde search filter 2023-10-24 15:35:15 +02:00
Viktor Lofgren
72152f9d80 Fix bug in handling js parameters 2023-10-24 15:10:02 +02:00
Viktor Lofgren
ebd365a128 Fix exception 2023-10-24 15:04:12 +02:00
Viktor Lofgren
c130d7cf5f (*) Use trafeik instead of nginx for reverse proxy 2023-10-24 14:44:19 +02:00
Viktor Lofgren
0406e76889 (api) Remove logging cruft 2023-10-24 13:39:05 +02:00
Viktor Lofgren
c2b28c0f8d (api) Trial streaming API 2023-10-24 13:26:46 +02:00
Viktor Lofgren
9aa5038756 (search) Remove unnecessary filtering operation 2023-10-24 11:43:47 +02:00
Viktor Lofgren
a860f8f1a8 (index/qs) GRPC API for better query peformance 2023-10-24 11:38:07 +02:00
Viktor Lofgren
487c016a32 (qs) Speed 2023-10-23 14:03:09 +02:00
Viktor Lofgren
e4bddb4993 (control) Better UUID accessibility 2023-10-23 12:53:53 +02:00
Viktor Lofgren
731afcb864 (qs) Parallel execution 2023-10-23 12:06:03 +02:00
Viktor Lofgren
efb73ff4e7 (qs) Don't blow up if an index node isn't responsive 2023-10-23 11:53:18 +02:00
Viktor Lofgren
2ed2f35a9b (actor) Rewrite of the actor prototype class using record pattern matching 2023-10-23 10:18:20 +02:00
Viktor Lofgren
119151cad3 (converter) Separtion of concerns 2023-10-22 14:35:33 +02:00
Viktor Lofgren
758f9b5aa5 (converter) Get UUID pips out of the models
Rendering concerns shouldn't be in the models, it's poor separation of concerns and very difficult to follow.
2023-10-22 14:24:52 +02:00
Viktor Lofgren
e06a8c1de2 (converter) Put upper limit on number of worker threads. 2023-10-22 14:03:09 +02:00
Viktor Lofgren
29ce8ca0cf (db) Reduce db pool size
This is a temporary thing
2023-10-22 14:03:09 +02:00
Viktor Lofgren
eb4158df0b (control) Fix start/stop FSM endpoints 2023-10-22 14:03:09 +02:00
Viktor Lofgren
12fda1a36b (control) Temporarily re-writing the data balancer to get it to work in prod
Need to clean this up later.
2023-10-22 14:03:09 +02:00
Viktor Lofgren
e927f99777 (control) JSON serializes Map<Integer> to Map<Double> and Java gets confused 2023-10-21 16:24:20 +02:00
Viktor Lofgren
044bcf55bd (control) Fix SQL in rebalance actor 2023-10-21 16:13:37 +02:00
Viktor Lofgren
e475af9f49 (control) Initialize controlActorService 2023-10-21 16:06:53 +02:00
Viktor Lofgren
c6abcd91fa (control) Better use of FS states, fix bug with start/stop actors 2023-10-20 16:37:49 +02:00
Viktor Lofgren
10fc489822 (converter) More robust filename resolution 2023-10-20 14:16:03 +02:00
Viktor Lofgren
d76d926c38 (control/executor) Add new configuration options for node
It's now possible to configure prod instance to not retain processed data.
2023-10-20 14:05:19 +02:00
Viktor Lofgren
2b3c167845 (controller) Additional configuration options for node 2023-10-20 13:13:36 +02:00
Viktor Lofgren
1d75b974b5 (loader bugfix) Set DOMAIN_METADATA appropriately 2023-10-20 13:03:27 +02:00
Viktor Lofgren
584bb3a648 (fs) interface cleanup 2023-10-20 12:24:18 +02:00
Viktor Lofgren
7b5ec6b98f (executor-service) Embed dist/ in executor-service's docker image 2023-10-19 17:48:34 +02:00
Viktor Lofgren
23526f6d1a (executor) Executor service now pulls DomainType list for CRAWL on "recrawl"
This is an automatic integration with the submit-site repo on github and also
crawl-queue.
2023-10-19 17:48:34 +02:00
Viktor Lofgren
c0930ead0f (doc) Update conceptual-overview.svg 2023-10-19 17:48:34 +02:00
Viktor Lofgren
809b3ee023 (control) Update GUI for crawl specs. They are now less important than they were before. 2023-10-19 17:48:34 +02:00
Viktor Lofgren
23f0c79fba (control) GUI for data sets/domain types. 2023-10-19 17:48:34 +02:00
Viktor Lofgren
81dd3809e9 (*) WIP Add node affinity to EC_DOMAIN
Very messy commit due to fractalline yak shaving
2023-10-19 17:48:34 +02:00
Viktor Lofgren
2bf0c4497d (*) Tool for unfcking old crawl data so that it aligns with the new style IDs 2023-10-19 17:48:34 +02:00
Viktor Lofgren
93122bdd18 (run) Add two nodes to the demo setup 2023-10-16 17:37:26 +02:00
Viktor Lofgren
978550f809 (executor-service) Retire features-convert and move the corresponding packages into the executor service. 2023-10-16 15:43:46 +02:00
Viktor Lofgren
84fea0fd05 (node) Nodes auto-start their monitor actors. 2023-10-16 15:33:22 +02:00
Viktor Lofgren
2df3e0f881 (node) Nodes auto-configure on start-up instead of requiring manual configuration. 2023-10-16 14:46:35 +02:00
Viktor Lofgren
c98117f69d (actor) FS monitor should pick up stuff in BACKUP as well. 2023-10-16 14:37:36 +02:00
Viktor Lofgren
ede5d1f890 (actor) Give process spawners more easily recognizable names. 2023-10-16 14:19:00 +02:00
Viktor Lofgren
39911e3acd (control) Fix incorrect storage base and clean up GUI for data 2023-10-16 13:30:26 +02:00
Viktor Lofgren
3d1c15ef99 (client) Refactor liveness monitor 2023-10-16 12:34:01 +02:00
Viktor Lofgren
f718482e98 (client) Fix tests 2023-10-16 12:12:16 +02:00
Viktor Lofgren
8dafd13cd7 (client) Fix executor tests 2023-10-16 12:02:57 +02:00
Viktor Lofgren
0b19b28a64 (file-storage) Delete unused code 2023-10-16 12:02:57 +02:00
Viktor Lofgren
c245f7ce3a (control) Bootstrapify review-domains and search-to-ban views. 2023-10-15 22:04:23 +02:00
Viktor Lofgren
607d647483 (control) Remove services listing view 2023-10-15 21:48:55 +02:00
Viktor Lofgren
9a38a455c9 (control/exec) File listings in control GUI 2023-10-15 19:15:44 +02:00
Viktor Lofgren
16e0738731 (*) Get multi-node routing working. 2023-10-15 18:38:30 +02:00
Viktor Lofgren
eacbf87979 (control) New list and form for index nodes. 2023-10-14 21:46:52 +02:00
Viktor Lofgren
108b4cb648 (service) Keep disabled multi-noded services dormant when they are configured to be disabled. 2023-10-14 20:58:55 +02:00
Viktor Lofgren
a9dff407a1 (config/db) Clean up migrations 2023-10-14 20:34:03 +02:00
Viktor Lofgren
9e26109e36 (reverse-index) Don't always POST 2023-10-14 16:48:29 +02:00
Viktor Lofgren
6308a8dfcd (control) Node configuration 2023-10-14 16:47:52 +02:00
Viktor Lofgren
4baf9527d7 (*) WIP Control GUI redesign, executor-service, multi-node mq
This turned out to be very difficult to do in small isolated steps.

* Design overhaul of the control gui using bootstrap
* Move the actors out of control-service into to a new executor-service, that can be run on multiple nodes
* Add node-affinity to message queue
2023-10-14 12:08:43 +02:00
Viktor Lofgren
199c459697 (*) Add node-affinity to services, processes and file storage. 2023-10-10 12:32:22 +02:00
Viktor Lofgren
61288c5e68 (service, client) First steps towards multiple nodedness 2023-10-09 22:13:27 +02:00
Viktor Lofgren
8375237de5 (converter) Add special keyword for websites with a tilde url. 2023-10-09 17:02:32 +02:00
Viktor
c8d820c17b Merge pull request #53 from MarginaliaSearch/standalone-index
Move ranking to the index-service, and query parsing to a new query-service; separate out the search-service
2023-10-09 15:42:06 +02:00
Viktor Lofgren
6319b8ef51 (api-service) Improved testability, always set content type to application/json 2023-10-09 15:39:34 +02:00
Viktor Lofgren
397a85eaa4 (query-service) Apply blacklisting to search results 2023-10-09 15:18:53 +02:00
Viktor Lofgren
3889c4bdd9 (refactor) Remove features-search and update documentation 2023-10-09 15:12:30 +02:00
Viktor Lofgren
c899f1cb85 (docs) Update documentation to reflect new query service 2023-10-09 14:56:59 +02:00
Viktor Lofgren
d8956c51d0 (refactor) Remove api:search-api
Application services should not have an API, but purely act as clients
to the core services (which should always have an API).
2023-10-09 14:42:33 +02:00
Viktor Lofgren
5dd55c7cad (refactor) Rename satellite services to application services
This is a better descriptor, since they now all implement different applications on top of the core services' APIs.
2023-10-09 13:45:45 +02:00
Viktor Lofgren
c0e61d4c87 (refactor) Move search service into services-satellite 2023-10-09 13:40:01 +02:00
Viktor Lofgren
97e17282ab (query-service) Move query parsing from search-service to the new query service. 2023-10-09 13:27:44 +02:00
Viktor Lofgren
94c882af7d (query-service) Provide delegate of IndexApi's query functionality.
This is an intermediate step in the process of introducing the query-service as a proxy between search and index.
2023-10-08 22:22:26 +02:00
Viktor Lofgren
89c6d85f2f (query-service) Create new empty 'query-service' service 2023-10-08 17:31:50 +02:00
Viktor Lofgren
cf366c602f (search) Refactor SearchQueryIndexService in preparation for feature extraction.
Prefer working on DecoratedSearchResultItem in favor of UrlDetails.
2023-10-08 17:15:41 +02:00
Viktor Lofgren
77ccab7d80 (index) Move linkdb to index from search.
This makes index complete in the sense that you can deploy an index instance and build a complete separate application on top of it, without having to go through the Marginalia-laden search service.
2023-10-08 16:48:35 +02:00
Viktor Lofgren
f51ba63742 (search) Remove dead file 2023-10-07 21:05:06 +02:00
Viktor Lofgren
9044518be5 (search) Fix broken link to git repo 2023-10-07 19:43:22 +02:00
Viktor Lofgren
9e0367eef4 (search) Filter blacklisted items in API query service as well 2023-10-07 16:16:04 +02:00
Viktor Lofgren
235bb6c1b9 (control) Administrative QOL improvement, GUI for banning spam 2023-10-07 15:45:50 +02:00
Viktor Lofgren
49344d7ea8 (control) Administrative QOL improvement, GUI for banning spam 2023-10-07 15:43:18 +02:00
Viktor Lofgren
1b418d77ff (search) We got some new IP ranges to work with for the crawler 2023-10-07 13:41:55 +02:00
Viktor Lofgren
80cc302627 (search) We can't in claim to be on PC hardware anymore... 2023-10-07 11:49:29 +02:00
Viktor
8e1abc3f10 (index-reverse) Parallel construction of the reverse indexes. (#52)
* (index-reverse) Parallel construction of the reverse indexes.

* (array) Remove wasteful calculation of numDistinct before merging two sorted arrays.

* (index-reverse)  Force changes to disk on close, reduce logging.

* (index-reverse)  Clean up merging process and add back logging

* (run)  Add a conservative default for INDEX_CONSTRUCTION_PROCESS_OPTS's parallelism as it eats a lot of RAM

* (index-reverse)  Better logging during processing

* (array) 2GB+ compatible write() function

* (array) 2GB+ compatible write() function

* (index-reverse) We are logging like Bolsonaro and I will not have it.

* (reverse-index) Self-diagnostics

* (btree) Fix bug in btree reader to do with large data sizes
2023-10-07 10:00:00 +02:00
Viktor Lofgren
e498c6907a (forward-index) Don't leak off heap memory 2023-10-05 21:22:13 +02:00
Viktor Lofgren
08e8fc6736 (index-journal) Thread safe IndexJournalReadEntry 2023-10-05 19:39:09 +02:00
Viktor Lofgren
f6e9ef6de9 (array) Fix transferFrom() so it survives larger than 2 GB transfers 2023-10-04 13:57:36 +02:00
Viktor Lofgren
c51159672e (build) Move unit test configuration to root build.gradle 2023-10-04 12:46:22 +02:00
Viktor Lofgren
233b51e29e (test) flag DomainTypesTest as Slow to exclude from regular CI 2023-10-04 12:23:10 +02:00
Viktor Lofgren
54c8e13a68 (term-frequency-dict) Fix memory leak in TermFrequencyDict 2023-10-04 11:55:11 +02:00
Viktor Lofgren
405300b4b2 (control) Fix bug where finishing one process ad hoc task would remove all other tasks from the db 2023-10-04 11:44:31 +02:00
Viktor Lofgren
a6abd31ead (setup) Upgrade mariadb image to one that exists in real life 2023-10-03 11:07:58 +02:00
Viktor Lofgren
4c26674ff4 (setup) Use mirrored lid.176.ftz file that is of a compatible version 2023-10-03 10:29:44 +02:00
Viktor Lofgren
40768e935b (test) Removing /tmp-guardrails as it doesn't hold in CI 2023-10-02 16:52:59 +02:00
Viktor Lofgren
23be648456 (setup) use curl instead of wget for setup.sh 2023-10-02 16:38:23 +02:00
Viktor Lofgren
13ee31770a (file storage) Make it possible to override the value returned by getFileStorage(type) with a JVM property. 2023-10-01 12:57:53 +02:00
Viktor Lofgren
93dc80000c (bugfix) Fix NPE in KeywordExtractor due to bad SoftReference handling 2023-09-26 17:16:41 +02:00
Viktor Lofgren
e0cd3cd991 (converter) Alter StackexchangeSideloader's summary length to align with the rest of the system. 2023-09-26 12:19:43 +02:00
Viktor Lofgren
81ae501e73 (converter) Use ThreadLocalSentenceExtractorProvider for PlainText plugin as well 2023-09-25 18:28:34 +02:00
Viktor Lofgren
9b781f8404 (keyoword-extractor) Address very rare race condition in memoization logic 2023-09-25 18:28:04 +02:00
Viktor Lofgren
f797a92f87 (converter, minor) Use domain name in task heartbeat progress 2023-09-25 18:27:04 +02:00
Viktor Lofgren
0a579814a2 (docs) Parquet How-to 2023-09-24 19:40:45 +02:00
Viktor Lofgren
ec6c9bca62 (common) Fix factual error in comments 2023-09-24 19:40:19 +02:00
Viktor Lofgren
a433bbbe45 (converter) Fix rare sentence extractor bug
It was caused by non-thread safe concurrent memory access in SentenceExtractor.
2023-09-24 19:39:48 +02:00
Viktor Lofgren
8ca20f184d (keyword-extraction) Chasing my tail looking for a bug 2023-09-24 19:39:48 +02:00
Viktor Lofgren
d160954080 (index) Two useful debug endpoints 2023-09-24 19:39:48 +02:00
Viktor Lofgren
14372e0ef0 (index) Slightly reduce alloc churn 2023-09-24 19:36:14 +02:00
Viktor Lofgren
03bffa27ac (search) Add combined id to the search result HTML 2023-09-24 19:34:35 +02:00
Viktor Lofgren
028b5a4f0d (minor performance) Reduce GC churn in index 2023-09-24 12:12:08 +02:00
Viktor Lofgren
cd12f49fc0 (long-array) Return slices SegmentLongArray of itself for range() &c 2023-09-24 11:31:54 +02:00
Viktor Lofgren
a144749a8d (jdk21) Add --enable-preview to Java tasks 2023-09-24 11:31:17 +02:00
Viktor Lofgren
1bd146fb8e (minor) Remove dead code 2023-09-24 10:55:20 +02:00
Viktor Lofgren
5f6c3da7a4 (index) Add close methods on the index readers so they clean up their mmaps 2023-09-24 10:54:23 +02:00
Viktor Lofgren
d0aa754252 (long-array) Implement java.lang.foreign.Arena based lifecycle control for LongArray.
Further de-ByteBuffer:ing of these classes is to be done, but this is the smallest most urgently needed benefit.

This commit is a WIP but in a fully working state, pushing due to the importance of the changes to offer lifecycle control over mmaps.
2023-09-24 10:40:06 +02:00
Viktor Lofgren
dbe9235f3a (*) Upgrade to JDK21 with preview enabled.
... also move some common configuration into the root build.gradle-file.

Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work.  This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.
2023-09-24 10:38:59 +02:00
Viktor Lofgren
d78569986b (backups) Fix bug where backup service would zero the linkdb when restoring. 2023-09-22 18:34:34 +02:00
Viktor Lofgren
95323e6caa (backups) Support restore multi-source load data 2023-09-22 18:34:17 +02:00
Viktor Lofgren
f809d22fc6 (loader) Support simultaneous loading of multiple processed data sets 2023-09-22 13:14:58 +02:00
Viktor
763d61db8d Create Additional Contributors.md 2023-09-21 15:38:19 +02:00
Viktor Lofgren
10cad3abb2 (dating) Implementing @samstorment's fantastic design polish 2023-09-21 15:19:50 +02:00
Viktor Lofgren
9338f35cd8 (doc) Remove confusingly outdated ER-diagrams 2023-09-21 15:08:27 +02:00
Viktor Lofgren
ead6fa9daa (doc) Update conceptual-overview.svg to reflect the removal of the lexicon 2023-09-21 13:47:05 +02:00
Viktor Lofgren
ad660cf420 (converter) Bugfix: Don't try to Path.of() on optional field 2023-09-21 13:27:09 +02:00
Viktor Lofgren
75f8ae2815 (file-storage) Use human-readable timestamps in the names of file storage directories 2023-09-21 13:22:53 +02:00
Viktor Lofgren
70aa04c047 (converter, stackexchange-xml) Add the ability to sideload stackexchange data 2023-09-21 12:48:33 +02:00
Viktor Lofgren
4aa47e87f2 (blocking-thread-pool) Add isTerminated convenience function 2023-09-21 12:47:41 +02:00
Viktor Lofgren
f8050816ac (search) Don't run LSH deduplication on details with zero lsh to support not calculating this hash. 2023-09-21 12:47:02 +02:00
Viktor Lofgren
5b0a6d7ec1 (stackexchange-converter) Create tool for converting stackexchange 7z-files to digestible sqlite db:s 2023-09-20 15:15:13 +02:00
Viktor Lofgren
3b4d08f52b (stackexchange-integration) Add better comments 2023-09-20 14:43:06 +02:00
Viktor Lofgren
6bbf40d7d2 (stackexchange-integration) Tools for reading stackexchange xml files 2023-09-20 14:17:33 +02:00
Viktor Lofgren
d895f83520 (blocking-thread-pool) Move DumbThreadPool to its own micro-library
Also rename it to SimpleBlockingThreadPool.
2023-09-20 10:11:49 +02:00
Viktor Lofgren
f6b9e8c5eb (converter) JavadocSpecialization should truncate its summary if it gets too long 2023-09-17 16:25:33 +02:00
Viktor Lofgren
98bcdf6028 (converter) DirtreeSideloader now trims /index.html from the URL if present
This is a crawler artifact in 9 cases out of 10, and may lead to bad URLs.
2023-09-17 16:08:16 +02:00
Viktor Lofgren
9b385ec7cc (converter) Make it possible to sideload documents from a directory tree 2023-09-17 14:35:06 +02:00
Viktor Lofgren
5c040f7a46 (crawl-spec) Parquetify crawl spec
* Crawl-specs are now parquet files
* Deprecate the crawl-job-extractor tool
2023-09-17 09:41:34 +02:00
Viktor
46232c7fd4 Merge pull request #48 from MarginaliaSearch/parquet
Converter-Loader communicates via Parquet files
2023-09-15 13:32:06 +02:00
Viktor Lofgren
c67d95c00f (converter) Write dummy processor log when sideloading 2023-09-14 14:13:03 +02:00
Viktor Lofgren
5e5aaf9a7e (converter, control) Re-enable sideloading encyclopedia data 2023-09-14 12:12:07 +02:00
Viktor Lofgren
35996d0adb (docs) Update the documentation up-to-date information 2023-09-14 11:33:36 +02:00
Viktor Lofgren
eaeb23d41e (refactor) Remove converting-model package completely 2023-09-14 11:21:44 +02:00
Viktor Lofgren
c71f6ad417 (converter) Add heartbeats to the loader processes and execute the tasks in parallel for a ~2X speedup 2023-09-14 10:11:57 +02:00
Viktor Lofgren
87a8593291 (work-log) Fix bug where items weren't added to the current batch on logItem 2023-09-14 10:11:04 +02:00
Viktor Lofgren
4799dd769e (converting) WIP begin to remove converting-model and the old InstructionsCompiler 2023-09-13 19:18:58 +02:00
Viktor Lofgren
24b4606f96 (converter,loader) Converter outputs parquet files instead of compressed json. 2023-09-13 16:13:41 +02:00
Viktor Lofgren
9f672a0cf4 (parquet-floor) Modify the parquet library to permit list-fields. 2023-09-13 15:56:35 +02:00
Viktor Lofgren
064bc5ee76 (processed-data) New parquet-serializable models for converter output 2023-09-11 14:08:40 +02:00
Viktor Lofgren
a52d78c8ee (work-log) New batching work log 2023-09-11 14:08:08 +02:00
Viktor Lofgren
a00cabe223 (parquet-floor) Patch in support for writing and reading repeated values 2023-09-11 14:06:43 +02:00
Viktor Lofgren
dbe974f510 (parquet) Use ZSTD compression by default. 2023-09-11 09:02:58 +02:00
Viktor Lofgren
a284682deb (parquet) Add parquet library
This small library, while great, will require some modifications to fit the project's needs, so it goes into third-party directly.
2023-09-05 10:38:51 +02:00
Viktor Lofgren
07d7507ac6 (control-service) Move Actions up in storage-details
Papercut fix. If a file storage area has a lot of files, you have to scroll down a long way to get to the actions otherwise.
2023-09-02 15:41:55 +02:00
Viktor Lofgren
c68d17d482 (keyword-extraction) Fix bug leading to position data missing on some keywords.
This was due to a discrepancy between the KeywordPositionBitmask and WordsTfIdfCounts' concept of a keyword.
2023-09-02 14:48:55 +02:00
Viktor Lofgren
9e185e80ce (control-service) Add timestamp to file storages. 2023-09-02 14:01:04 +02:00
Viktor Lofgren
676e7c7947 (keywords) Add Serializable properties that went missing as the record became a class 2023-09-02 09:52:01 +02:00
Viktor Lofgren
04212b2cef (btree) Add more consistent asserts on sortedness 2023-09-01 15:45:02 +02:00
Viktor Lofgren
bafc2a1f30 (reverse-index) Force() final docs after being written
Unlikely to be a problem, but we want to ensure it's on dsik before we go read it later.
2023-09-01 15:43:53 +02:00
Viktor Lofgren
563e388a45 (reverse-index) Fix parallel documents sorting bug
Bug was caused by parallel sorting capturing the iterator rather than the offsets to sort.
2023-09-01 15:42:45 +02:00
Viktor Lofgren
d31d8ec5b0 (index) Log keyword ids on hex format 2023-09-01 15:40:24 +02:00
Viktor Lofgren
2b00cd632d (process) Propagate environment JVM params to the index constructor 2023-09-01 15:39:42 +02:00
Viktor Lofgren
5f427d2b4c (keywords) Clean up leaky abstractions, clean up tests 2023-09-01 13:52:00 +02:00
Viktor Lofgren
8c0ce4fc1d (index journal; minor) Clean up 2023-09-01 11:32:24 +02:00
Viktor Lofgren
10a74f45ea (index journal; minor) Even cleaner separation of concerns. 2023-09-01 11:28:02 +02:00
Viktor Lofgren
320dad7f1a (index journal) Fix leaky abstraction in IndexJournalReader.
The caller shouldn't be required to know the on-disk layout of the file to make use of the data in a performant way.
2023-09-01 11:18:13 +02:00
Viktor Lofgren
88ac72c8eb (journal/reverse index) Working WIP fix over-allocation of documents 2023-08-31 20:16:02 +02:00
Viktor Lofgren
f74b9df0a7 (array) Don't use paging arrays when mapping small files for writing 2023-08-31 20:15:10 +02:00
Viktor Lofgren
a6f1335375 (loader) Fix bugfix where the loader would omit some meta and words. 2023-08-31 17:48:43 +02:00
Viktor Lofgren
f321fa5ad3 (array) Override to Paging...Array$range()
This is a big performance boost in array.range().get().

Without an override, each access will go through pages[page].get(...) for each get()-operation.  This adds up very quickly.  BTreeReader does a bunch of get():s on a range()'d array during traversal in the queryData... methods.
2023-08-31 13:52:29 +02:00
Viktor Lofgren
03d999444d (ldb) Re-add accidentally removed stmt.addBatch that breaks 2023-08-31 12:06:30 +02:00
Viktor Lofgren
763ed260c3 (ldb) Better handling of null pubYear 2023-08-30 23:08:27 +02:00
Viktor Lofgren
764e7d1315 (index) Add more comprehensive integration tests for the index service. 2023-08-30 10:37:24 +02:00
Viktor Lofgren
048f685073 (ldb) add OR IGNORE to insert status query
Otherwise it will sometimes fail because documents may appear more than once in error scenarios.
2023-08-30 10:34:01 +02:00
Viktor Lofgren
e4d7958379 (control) ProcessLivenessMonitorActor shouldn't reap tasks based on service instance liveness 2023-08-29 18:19:04 +02:00
Viktor
bdcbfb11a8 Merge pull request #42 from MarginaliaSearch/no-downtime-upgrades
Zero downtime upgrades, merge-based index construction
2023-08-29 17:05:48 +02:00
Viktor Lofgren
3f288e264b (minor) Clean up dead endpoints 2023-08-29 17:04:54 +02:00
Viktor Lofgren
dd593c292c (loader) Minor optimizations and bugfixes.
* Reduce memory churn in LoaderIndexJournalWriter, fix bug with keyword mappings as well
* Remove remains of OldDomains
* Ensure LOADER_PROCESS_OPTS gets fed to the processes
* LinkdbStatusWriter won't execute batch after each added item post 100 items
2023-08-29 15:37:52 +02:00
Viktor Lofgren
fa87c7e1b7 (process) Automatic flightrecorder runs for processes when run in docker. 2023-08-29 14:12:51 +02:00
Viktor Lofgren
39c1857c61 (heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction. 2023-08-29 13:07:55 +02:00
Viktor Lofgren
c57a2d0dc3 (control-service) Remove old index journal files when restoring a backup. 2023-08-29 11:58:01 +02:00
Viktor Lofgren
a2e6616100 (index-reverse) Add documentation and clean up code. 2023-08-29 11:35:54 +02:00
Viktor Lofgren
ba4513e82c (loader) Revert accidental experimental changes that slipped by in an earlier commit 2023-08-28 19:54:56 +02:00
Viktor Lofgren
6525b16e1f (minor) Improved logging and error messages 2023-08-28 19:53:55 +02:00
Viktor Lofgren
b6a92506d1 (index) Hook in missing DocIdRewriter
This enables documents to be ranked properly.
2023-08-28 19:53:43 +02:00
Viktor Lofgren
ffa0366deb (minor) Fix typo in ActorStateMachine's logging 2023-08-28 16:11:52 +02:00
Viktor Lofgren
00c4686ef0 (reverse-index) Fix over-allocation of the count array in merging 2023-08-28 14:36:28 +02:00
Viktor Lofgren
3101b74580 (index) Move to a lexicon-free index design
This is a system-wide change.  The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table.   This made index-construction easier, but it
also added a fairly significant RAM penalty to both the index service and the loader.

The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices.

It also became necessary half-way through to upgrade guice as its error reporting wasn't *quite* compatible with JDK20.
2023-08-28 14:02:23 +02:00
Viktor Lofgren
4e694fdff6 (minor) Comment build.gradle 2023-08-25 16:40:53 +02:00
Viktor Lofgren
194a6057dd (index,control) Recoverable index backups 2023-08-25 14:57:43 +02:00
Viktor Lofgren
e710e057e2 (db) Remove EC_URL and EC_PAGE_DATA from mariadb database 2023-08-25 13:45:03 +02:00
Viktor Lofgren
28188a6e59 (control) Simplify ConvertAndLoadActor 2023-08-25 13:30:20 +02:00
Viktor Lofgren
70a5df96c8 (control) Display progress of process tasks 2023-08-25 13:05:21 +02:00
Viktor Lofgren
460998d512 (index) Move index construction to separate process.
This provides a much cleaner separation of concerns, and makes it possible to get rid of a lot of the gunkier parts of the index service.  It will also permit lowering the Xmx on the index service a fair bit, so we can get CompressedOOps again :D
2023-08-25 12:52:54 +02:00
Viktor Lofgren
e741301417 (search) Remove endpoint flush-search-caches
It's not necessary anymore with the new linkdb.
2023-08-25 09:51:06 +02:00
Viktor Lofgren
5ed5298409 (converter) Update confusing state description
SWAP_LEXICON doesn't instruct the index service to do anything.  It just moves the file.
2023-08-24 18:56:49 +02:00
Viktor Lofgren
b911665691 (index) Clean up and optimize valuator 2023-08-24 18:34:06 +02:00
Viktor Lofgren
56eb83319d (index) Clean up result domain deduplicator 2023-08-24 18:24:55 +02:00
Viktor Lofgren
1e6800565a (system) Remove EdgeId<T> and similar objects
They seemed like a good idea at the time, but in practice they're wasting resources and not really providing the clarity I had hoped.
2023-08-24 17:46:02 +02:00
Viktor Lofgren
c909120ae1 (search) Basic working integration of linkdb in search service 2023-08-24 17:24:56 +02:00
Viktor Lofgren
9894f37412 (index) Implement new URL ID coding scheme.
Also refactor along the way.  Really needs an additional pass, these tests are very hairy.
2023-08-24 16:44:27 +02:00
Viktor
229c63c46d Update readme.md 2023-08-24 13:27:24 +02:00
Viktor Lofgren
6a04cdfddf (loader) Implement new linkdb in loader
Deprecate the LoadUrl instruction entirely. We no longer need to be told upfront about which URLs to expect, as IDs are generated from the domain id and document ordinal.

For now, we no longer store new URLs in different domains.  We need to re-implement this somehow, probably in a different job or a as a different output.
2023-08-24 13:07:54 +02:00
Viktor Lofgren
c70670bacb (common) New UrlIdCodec class
Have a single class responsible for encoding and decoding URL ids, as it's a bit finicky and used all over.
2023-08-24 11:41:07 +02:00
Viktor Lofgren
7bb3e44a76 (common) Deprecate EdgeId and similar 2023-08-24 11:16:28 +02:00
Viktor Lofgren
b958acb76a (file-storage) New File Storage type for linkdb 2023-08-24 09:06:13 +02:00
Viktor Lofgren
b22f4fbb72 (linkdb) New Module for sqlite-backed document db 2023-08-24 09:06:13 +02:00
Viktor Lofgren
e8c0648e04 Fix missing vol/ss dir in setup.sh 2023-08-23 17:59:40 +02:00
Viktor Lofgren
ebc84c22fb Upgrade antique lombok plugin
This permits tests to run on JDK20 environments.
2023-08-23 14:34:32 +00:00
Viktor Lofgren
8bd9a00c38 Amend setup instructions with command 2023-08-23 14:02:21 +00:00
Viktor Lofgren
972d03efdf Fix error in run/readme where it suggested local dev environment uses HTTPS 2023-08-23 13:47:39 +00:00
Viktor Lofgren
aa0d256d6a Upgrade code to Java 20.
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
4d75fa2908 Upgrade gradle and docker plugin to support native JDK20 environments 2023-08-23 13:30:55 +00:00
Viktor Lofgren
1a05cba60a (keyword lexicon) Use three hash tables to increase the possible number of keywords to 2^31 from 0.75 x 2^30. 2023-08-23 11:25:20 +02:00
Viktor Lofgren
bf92c270dc (language) Rollback language filter change a bit.
It appears to lead to too much junk in the lexicon.
2023-08-23 10:16:57 +02:00
Viktor Lofgren
e507844616 (language) Rollback language filter change a bit.
It appears to lead to too much junk in the lexicon.
2023-08-23 10:03:25 +02:00
Viktor Lofgren
ca12dd59f7 (loader) Fix Cleaner resource leak
Apparently Cleaners have an associated native thread, so the way to use them is to have a single static cleaner.
2023-08-22 18:05:00 +02:00
Viktor Lofgren
6f222b9800 (search) Add refresh link to explore mode.
This is a QOL improvement for mobile users, who otherwise would have to scroll all the way up to refresh.

Also removed the confusing "this is a random set of domains"-message when viewing adjacent websites, as it's not random.
2023-08-22 12:43:44 +02:00
Viktor Lofgren
fca62f261e (mq) Down-tune polling intervals in MQ
Polling 10 times a second across dozens of queues is a bit too aggressive and wasteful.
2023-08-22 11:49:30 +02:00
Viktor Lofgren
c7f0276005 (control) Don't spin on process output printing
This is the "correct" way of copying stdout and stderr to the curren't process' output.
2023-08-22 11:48:54 +02:00
Viktor Lofgren
46409c4c2d (loader) Use the correct interface for InstructionCounter 2023-08-22 11:11:36 +02:00
Viktor Lofgren
46df58d28b (control-service) Use default value for WMSA_HOME if it is not set 2023-08-22 11:11:01 +02:00
Viktor Lofgren
15912f31d0 (control-service) Basic GUI for deleting bad links from exploration mode 2023-08-21 18:35:26 +02:00
Viktor
dd380a5fb3 (doc) Add control-service to conceptual overview
Not adding every interaction as it would turn into a rat king.
2023-08-20 13:28:32 +02:00
Viktor Lofgren
93f49f1fb3 (search-service) RSS feed for the news feed 2023-08-20 12:58:34 +02:00
Viktor Lofgren
b83bb5a48a (docker) Upgrade to jdk20 image to fix weird mojibake problems.
Super weird encoding bug that only arises on versions below jdk18 causing crawl data to be read incorrectly.

Seems possibly related to the new standard charset of UTF-8. Maybe some library (unknown which) is attempting to be backwards compatible in a way that totally breaks?
2023-08-19 10:58:47 +02:00
Viktor Lofgren
704de50a9b (forward-index, valuator) HTML features in valuator
Put it in the forward index for easy access during index-side valuation.
2023-08-18 11:54:56 +02:00
Viktor Lofgren
fcfe07fb7d (valuator) Clean up code 2023-08-18 11:26:56 +02:00
Viktor Lofgren
ccf4990add (minor) Clean up code 2023-08-18 11:26:39 +02:00
Viktor Lofgren
f2638dd845 (feature-extractor) More adtech nonsense 2023-08-18 11:26:19 +02:00
Viktor Lofgren
239980ecae (minor) Improve comment 2023-08-18 11:26:05 +02:00
Viktor Lofgren
6cb784df75 (minor) Improve comment 2023-08-18 11:25:36 +02:00
Viktor Lofgren
efee904531 (search) Use the adtech bit instead of ads for ads flag 2023-08-18 11:24:59 +02:00
Viktor Lofgren
bee815b1c4 (converter) Add monsterinsights as an adtech tracker 2023-08-17 17:44:11 +02:00
Viktor Lofgren
e296b02649 (converter) Optimize LSH based within-domain deduplication 2023-08-17 17:43:46 +02:00
Viktor Lofgren
2656fcfe2c (conf) Remove unnecessary JVM flags for processes 2023-08-17 17:42:47 +02:00
Viktor Lofgren
c019a029ec (flags) Documentation and preventative bugfix 2023-08-17 17:42:31 +02:00
Viktor Lofgren
db0216936e (summary) Reduce the chance of expensive operations 2023-08-16 15:48:34 +02:00
Viktor Lofgren
46d761f34f (language) fasttext based language filter 2023-08-16 15:48:12 +02:00
Viktor Lofgren
4598c7f40f (valuation) Penalize wordpress style kebab case urls 2023-08-16 13:11:24 +02:00
Viktor Lofgren
1d486bddee (crawler) Reduce log spam 2023-08-16 11:12:09 +02:00
Viktor Lofgren
606db54dc8 (docs) Fix dead links to message-queue after moving it to libraries 2023-08-15 19:26:40 +02:00
Viktor Lofgren
d8073f0dde (feature-extractor) Add mail.ru counter to non-adtech trackers 2023-08-15 19:10:43 +02:00
Viktor Lofgren
df85468c01 (control) Action for refreshing the blogs definition. 2023-08-15 11:38:52 +02:00
Viktor Lofgren
4404ad98ae (mq) Fix missing @Inject that broke everything in control-service 2023-08-15 11:22:12 +02:00
Viktor Lofgren
e7192a9cad (mq) Refactor mq and actor library and move it to libraries out of common 2023-08-15 10:53:23 +02:00
Viktor Lofgren
019b61b330 (control) Remove message queue listing from actors view. 2023-08-13 13:50:04 +02:00
Viktor Lofgren
f997707049 (control) Move event log out of plumbing 2023-08-13 13:40:50 +02:00
Viktor Lofgren
c56ee10185 (control) Separate [Process] and [Process and Load] actions for crawl data; all SLOW data is deletable. 2023-08-13 13:39:59 +02:00
Viktor Lofgren
8210e49b4e (control) Helpful tooltips for the Actor table. 2023-08-13 12:55:56 +02:00
Viktor
e51bf8619d Merge pull request #40 from MarginaliaSearch/vlofgren-patch-2
Update readme.md
2023-08-12 18:58:32 +02:00
Viktor
69b28fd07d Update readme.md 2023-08-12 18:58:21 +02:00
Viktor
99884c2c7e Update readme.md 2023-08-12 15:39:28 +02:00
Viktor Lofgren
a8f2e9ee2c (control) Tidy up empty tables, remove actors from index view 2023-08-12 15:18:14 +02:00
Viktor Lofgren
a91b909103 (control) Event log on stop actor 2023-08-12 15:02:53 +02:00
Viktor Lofgren
d6b8b38955 (db) Add indices on SERVICE_EVENTLOG 2023-08-12 15:00:15 +02:00
Viktor Lofgren
99e031c529 (control) Remove broken pagination from events and message queue; new "light" events table for some views 2023-08-12 14:57:55 +02:00
Viktor Lofgren
998f239ed9 (control) Filterable event log view 2023-08-12 14:43:11 +02:00
Viktor Lofgren
0961f627b1 (control) Pretty up the nav bar 2023-08-12 14:42:42 +02:00
Viktor Lofgren
6483308bb0 (sql) Update default value for DOMAIN_SELECTION_TYPE 2023-08-11 14:01:15 +02:00
Viktor Lofgren
a42f707b2d (docs) Update readme with up to date instructions 2023-08-11 13:43:00 +02:00
Viktor Lofgren
eef37927ba (docs) Update readme with up to date instructions 2023-08-11 13:42:14 +02:00
Viktor Lofgren
7440da240d (blacklist) Fix broken SQL migration 2023-08-11 13:33:35 +02:00
Viktor
d0239368e2 Merge pull request #39 from MarginaliaSearch/master-control-program
Message Queue, State Machine, and Control Service
2023-08-10 15:42:58 +02:00
Viktor Lofgren
4f8048be31 (blacklist) Blacklist management 2023-08-10 15:40:07 +02:00
Viktor Lofgren
807fb2d052 (service) Task heartbeat creates event log entries 2023-08-09 15:15:16 +02:00
Viktor Lofgren
ce293029c7 (converter) Treat adtech tracking as advertisement. 2023-08-09 14:23:53 +02:00
Viktor Lofgren
b5ed21be21 (mq) MqPersistence no longer relies on autoCommit being enabled 2023-08-09 14:23:22 +02:00
Viktor Lofgren
251fc63b42 (*) Fix merge gore 2023-08-09 13:33:28 +02:00
Viktor Lofgren
47f3855a4b (control) More informative readme.md 2023-08-09 12:42:23 +02:00
Viktor Lofgren
71dfe9f33e (control) Clean up the ControlService, move mq-related endpoints to MessageQueueService. 2023-08-09 12:42:01 +02:00
Viktor Lofgren
afad4f5ebb (*) last touches 2023-08-07 12:59:33 +02:00
Viktor Lofgren
4ab1cd9502 (*) last touches 2023-08-07 12:57:44 +02:00
Viktor
52e2ab45bf Merge branch 'master' into master-control-program 2023-08-07 12:53:43 +02:00
Viktor Lofgren
be444f9172 (control) New actions view, re-arrange navigation menu 2023-08-05 14:45:04 +02:00
Viktor Lofgren
715d61dfea (mq) Fix bug in notice handling where they were registered on the wrong name 2023-08-05 14:45:04 +02:00
Viktor Lofgren
bf37a3eb25 (search-service) Make flushCaches endpoint a notice and not a request 2023-08-05 14:45:04 +02:00
Viktor Lofgren
c2b45bec8d (mq) Rename notify to sendNotice to avoid name clash with the java object function 2023-08-05 14:45:04 +02:00
Viktor Lofgren
cdfe284f9a (file storage) File Storage Type for EXPORT data
(file storage) File Storage Type for EXPORT data
2023-08-05 14:45:03 +02:00
Viktor Lofgren
08eed17e66 (api-service) Mq endpoint for flushing caches 2023-08-05 14:42:16 +02:00
Viktor Lofgren
00eb8b90dc (control) Message Queue GUI 2023-08-04 22:05:29 +02:00
Viktor Lofgren
912129311d (control) Message Queue GUI 2023-08-04 17:54:18 +02:00
Viktor Lofgren
624b78ec3a (heartbeat) Task heartbeats 2023-08-04 14:40:06 +02:00
Viktor Lofgren
1d0cea1d55 (converter) GUI for dealing with user complaints 2023-08-03 17:59:57 +02:00
Viktor Lofgren
f01f608474 (blacklist) Support blacklists with subdomain 2023-08-03 17:58:52 +02:00
Viktor Lofgren
c22feaf42e (crawl) Make crawler limiter request a GC when throttling 2023-08-03 17:58:18 +02:00
Viktor Lofgren
63e857f7cd (control) Add basic api key management 2023-08-02 20:14:03 +02:00
Viktor Lofgren
9979c9defe (search/index) Add blogosphere filter 2023-08-02 20:13:30 +02:00
Viktor Lofgren
7763df0715 (docs) Add control-service to the main readme.md 2023-08-01 22:52:41 +02:00
Viktor Lofgren
e088eb9ec8 (scripts|docs) Update scripts and documentations for the new operator's gui and file storage workflows. 2023-08-01 22:50:33 +02:00
Viktor Lofgren
19402772fc (scripts|docs) Update scripts and documentations for the new operator's gui and file storage workflows. 2023-08-01 22:50:05 +02:00
Viktor Lofgren
ba724bc1b2 (scripts|docs) Update scripts and documentations for the new operator's gui and file storage workflows. 2023-08-01 22:47:37 +02:00
Viktor Lofgren
8de3e6ab80 (control) Fix bug where CrawlActor and RecrawlActor would steal each others' mail 2023-08-01 22:33:30 +02:00
Viktor Lofgren
659d2134ba (file-storage) Deprecate mustClean flag 2023-08-01 22:32:30 +02:00
Viktor Lofgren
867410c66b (file-storage) Automatic file storage discovery via manifest file 2023-08-01 18:05:43 +02:00
Viktor Lofgren
483c2dbb44 (conf) Change default user-agent to not associate it with the project; remove unused disks.properties file. 2023-08-01 17:34:25 +02:00
Viktor Lofgren
e5c9791b14 (crawler) Fix rare ConcurrentModificationError due to HashSet 2023-08-01 17:28:29 +02:00
Viktor Lofgren
58556af6c7 (db) Use flwyay for database migrations. 2023-08-01 17:08:42 +02:00
Viktor Lofgren
2e29038ecd (db) Fix broken insert statement, move file storage defaults to a separate file. 2023-08-01 15:50:08 +02:00
Viktor Lofgren
36a23707c1 (control) Control service should be a core service. 2023-08-01 15:49:50 +02:00
Viktor Lofgren
c1ea60b399 (db) Default values for storage base 2023-08-01 15:05:04 +02:00
Viktor Lofgren
b08e302dd5 (lexicon) Optimize lexicon by using Murmur3_128's hash function 2023-08-01 15:02:13 +02:00
Viktor Lofgren
ea66195b97 (loader) Optimize loader by using zstd's direct streaming writer and the Murmur3_128 string hash 2023-08-01 15:02:13 +02:00
Viktor Lofgren
86a5cc5c5f (hash) Modified version of common codec's Murmur3 hash 2023-08-01 14:57:40 +02:00
Viktor Lofgren
8f0cbf267b (loader) Perform instruction reads in a separate thread for extra vroom vroom 2023-07-31 14:24:08 +02:00
Viktor Lofgren
2f8488610a (loader) Fix bug where trailing deferred domain meta inserts weren't executed 2023-07-31 14:23:23 +02:00
Viktor Lofgren
d95f01b701 (control) Reduce log spam in control svc 2023-07-31 14:21:06 +02:00
Viktor Lofgren
c9d7635370 (control) Aborting an actor that waits on a process request terminates the running job.
(control) Aborting an actor that waits on a process request terminates the running job.
2023-07-31 14:21:06 +02:00
Viktor Lofgren
6b5fb0f841 (control) Disable the start button for actors that aren't directly initializable.
(control) Disable the start button for actors that aren't directly initializable.
2023-07-31 14:21:00 +02:00
Viktor Lofgren
12bd74d4f3 Clean up ProcessService 2023-07-31 10:56:16 +02:00
Viktor Lofgren
37c4cc68ed TODO 2023-07-31 10:34:42 +02:00
Viktor Lofgren
1c948eb3d8 (minor) Alter DumbThreadPool in Converter to not claim the threads are crawlers. 2023-07-31 10:33:15 +02:00
Viktor Lofgren
cd90ca820f YAGNI filter over ConverterDomainTypes 2023-07-31 10:32:47 +02:00
Viktor Lofgren
9786f82220 Fix environment variables to processes so jmc works 2023-07-31 10:32:23 +02:00
Viktor Lofgren
6f4e767a04 (minor) Re-enable monkey-patch-json for converter 2023-07-31 10:31:46 +02:00
Viktor Lofgren
5411950b87 (minor) Tidy up EdgeDomain class a bit, no functional difference 2023-07-31 10:31:29 +02:00
Viktor Lofgren
6ff7e9648f (crawler) Use and pass the proper environment variables to the processes. 2023-07-30 16:54:02 +02:00
Viktor Lofgren
5c071ce4d3 (crawler) Clean up the code and remove unnecessary logging 2023-07-30 16:53:39 +02:00
Viktor Lofgren
caf3d231a8 (crawler) Fix rare issue with NPEs if the crawl queue is empty 2023-07-30 16:53:13 +02:00
Viktor Lofgren
730e8f74e4 (crawler) Even more memory optimizations.
* Fix minor resource leak in zstd streams
* Use pools for zstd streams
* Reduce the SSL session cache size
2023-07-30 14:19:55 +02:00
Viktor Lofgren
aba134284f (crawler) Reduce log spam 2023-07-29 19:22:58 +02:00
Viktor Lofgren
2a6183f9e0 (crawler) Dynamic throttling of the number of active crawl jobs permitted to spawn; reduce queue size. 2023-07-29 19:20:09 +02:00
Viktor Lofgren
ee143bbc48 (crawler, converter) Fix so that DumbThreadPool actually waits for termination as intended. 2023-07-29 19:19:09 +02:00
Viktor Lofgren
d3f01bd171 (crawler, converter) Remove monkey patched gson from dependencies 2023-07-29 19:18:12 +02:00
Viktor Lofgren
05ba3bab96 (crawler) Make SitemapRetriever abort on too large sitemaps. 2023-07-29 19:18:12 +02:00
Viktor Lofgren
d2b6b2044c (crawler) Reduce log spam in HttpFetcherImpl 2023-07-29 19:18:12 +02:00
Viktor Lofgren
7611b7900d (crawler) Reduce long term memory allocation in DomainCrawlFrontier
(crawler) Reduce long term memory allocation in DomainCrawlFrontier
2023-07-29 19:18:12 +02:00
Viktor Lofgren
9ad32ee9c7 (control) Be more clear about when a process exits and why. 2023-07-29 19:16:00 +02:00
Viktor Lofgren
866db6c63f (control) Dialog for updating message state; clean up file view. 2023-07-28 22:02:05 +02:00
Viktor Lofgren
01476577b8 (loader) Speed up loading back to original speeds with a cascading DELETE FROM EC_URL rather than EC_PAGE_DATA.
* Also clean up code and have proper rollbacks for transactions.
2023-07-28 22:00:07 +02:00
Viktor Lofgren
e237df4a10 (converter) Use a dumb thread pool instead of Java's executor service. 2023-07-28 18:15:16 +02:00
Viktor Lofgren
f11103d31d (WIP) Make it possible to sideload encyclopedia data.
This is mostly a pilot track for sideloading other large websites.

Also change coverter to produce a more compact output (java serialization instead of json).
2023-07-28 18:14:43 +02:00
Viktor Lofgren
9288d311d4 Add buffering to index journal writer 2023-07-28 18:11:19 +02:00
Viktor Lofgren
77d5e39fe0 Make processed data Serializable 2023-07-28 18:11:19 +02:00
Viktor Lofgren
27e781761d (mq single shot inbox) Flag messages as OK if there is no recipient 2023-07-28 12:04:23 +02:00
Viktor Lofgren
92cac52813 (mq) Add indexes to MESSAGE_QUEUE 2023-07-28 12:03:51 +02:00
Viktor Lofgren
66bb12e55a (converter) File listing and download for file storage 2023-07-26 21:59:35 +02:00
Viktor Lofgren
a5d980ee56 (converter) Hook crawl job extractor and adjacencies calculator into control service. 2023-07-26 15:46:22 +02:00
Viktor Lofgren
19c2ceec9b (converter) Use Marginalia Yellow for control service 2023-07-26 11:50:23 +02:00
Viktor Lofgren
507f26ad47 (converter) Refactor converter to not keep instructions list in RAM.
(converter) Refactor converter to not keep instructions list in RAM.

(converter) Refactor converter to not keep instructions list in RAM.
2023-07-25 22:06:46 +02:00
Viktor Lofgren
fd44e09ebd (loader) Don't delete the entire link database when the loader runs 2023-07-24 18:37:35 +02:00
Viktor Lofgren
09fd0a1d0e (converter) Automatically clean stale file storage records if they disappear on disk 2023-07-24 17:04:42 +02:00
Viktor Lofgren
667b0ca0b0 (converter, WIP) Refactor CrawledDomainReader to not return iterators.
Instead return a closable class SerializableCrawlDataStream.
2023-07-24 16:28:30 +02:00
Viktor Lofgren
a56953c798 (converter, WIP) Refactor converter to not have to load everything into RAM. 2023-07-24 15:25:09 +02:00
Viktor Lofgren
7470c170b1 (minor) EdgeUrl.parse() should deal with null 2023-07-24 15:06:57 +02:00
Viktor Lofgren
bc330acfc9 (control) Better refresh script that doesn't cause weird artifacts 2023-07-23 19:26:16 +02:00
Viktor Lofgren
789e8eea85 (crawler) Clean up and refactor the code a bit 2023-07-23 19:08:38 +02:00
Viktor Lofgren
35b29e4f9e (crawler) Clean up and refactor the code a bit 2023-07-23 19:06:37 +02:00
Viktor Lofgren
69f333c0bf (crawler) Clean up and refactor the code a bit 2023-07-23 18:59:14 +02:00
Viktor Lofgren
c069c8c182 (crawler) Clean up crawl data reference and recrawl logic 2023-07-22 18:42:21 +02:00
Viktor Lofgren
9e4aa7da7c (crawler) Support for X-Robots-Tag 2023-07-22 18:42:21 +02:00
Viktor Lofgren
e22e65eee4 (index) Fix bug related to debug print statements 2023-07-22 14:33:58 +02:00
Viktor Lofgren
cb55c76664 (index) Fix bug related to debug print statements 2023-07-22 14:20:52 +02:00
Viktor Lofgren
d6b07e4d01 (controller) Improve the storage interface 2023-07-21 19:56:16 +02:00
Viktor Lofgren
995657c6ce (big-string) Make big-string disable:able 2023-07-21 19:50:35 +02:00
Viktor Lofgren
58f2f86ea8 (crawler) Don't read all the data into RAM when doing a refresh-crawl 2023-07-21 19:47:52 +02:00
Viktor Lofgren
7bc1cff286 (minor) code cleanup 2023-07-21 14:28:37 +02:00
Viktor Lofgren
8f455f3b6d (control) Aborting a process spawner actor cancels the message to the actor. 2023-07-21 14:12:32 +02:00
Viktor Lofgren
f91d92cccb (crawler) WIP 2023-07-20 21:05:16 +02:00
Viktor Lofgren
08ca6399ec (converter) WIP 2023-07-19 17:14:45 +02:00
Viktor Lofgren
c0b5ea0e7d Revert "Less spammy default log settings"
This reverts commit f6e2216b87.
2023-07-18 19:28:42 +02:00
Viktor Lofgren
f21a3983aa Abortable processes 2023-07-18 18:40:12 +02:00
Viktor Lofgren
f6e2216b87 Less spammy default log settings 2023-07-17 21:42:13 +02:00
Viktor Lofgren
92ed513e4f Less spammy default log settings 2023-07-17 21:41:56 +02:00
Viktor Lofgren
d7ab21fe34 (*) Refactor Control Service and processes 2023-07-17 21:20:31 +02:00
Viktor Lofgren
bca4bbb6c8 (*) Refactor MQ and MQSM 2023-07-17 13:57:32 +02:00
Viktor Lofgren
e618aa34e9 (control) Name change process->fsm, new fsm:s
* FSM for spawning processes when messages appear for them
* FSM for removing data flagged for purging
2023-07-17 12:27:27 +02:00
Viktor Lofgren
6e41e78f36 (control) Higlight missing processes 2023-07-16 12:03:32 +02:00
Viktor Lofgren
c4dd9a0547 (control) Use MQFSMs to monitor and spawn processes when messages are sent to them 2023-07-16 11:58:47 +02:00
Viktor Lofgren
5ec10634d8 (mqfsm) Abortable state machine 2023-07-15 14:12:16 +02:00
Viktor Lofgren
cdae74d395 (control) Working redirects 2023-07-15 14:11:59 +02:00
Viktor Lofgren
8b74e3aa0d (*) File Storage WIP 2023-07-14 17:08:10 +02:00
Viktor Lofgren
23169ad818 (db) Model for file storage areas 2023-07-14 11:40:05 +02:00
Viktor Lofgren
d36e36c8fd (mq) Bugfix lastNMessages; use Lists.reverse properly 2023-07-14 11:39:15 +02:00
Viktor Lofgren
948d4d5f08 (control) Clean up the number of GUI views, abortable FSM tasks 2023-07-13 17:24:21 +02:00
Viktor Lofgren
0960e18f8e (control) Auto-refreshing tables 2023-07-13 15:44:36 +02:00
Viktor Lofgren
825fd10efa (control) Clean up the MQ ui a bit 2023-07-13 15:14:04 +02:00
Viktor Lofgren
1ec6f9cde2 (mq) More robust resume and recovery logic, protection against spurious state changes, minor bugfixes 2023-07-13 14:55:45 +02:00
Viktor Lofgren
a5118fe8f1 (minor) clean-up 2023-07-12 22:46:14 +02:00
Viktor Lofgren
6c88f00a9d (mqsm) guard against spurious transitions from unexpected messages 2023-07-12 22:44:05 +02:00
Viktor Lofgren
bf783dad7a (converter) NPE fix 2023-07-12 20:13:01 +02:00
Viktor Lofgren
8a53e107fa (mq) Synchronous and Asynchronous inboxes. 2023-07-12 20:12:52 +02:00
Viktor Lofgren
0ed938545b (mq) Add single-shot inbox 2023-07-12 18:41:27 +02:00
Viktor Lofgren
480abfe966 (minor) Add limit to pol count in MqPersistence, fix test 2023-07-12 18:16:23 +02:00
Viktor Lofgren
89e4343fdb (minor) Fix test 2023-07-12 18:15:50 +02:00
Viktor Lofgren
8c16a2aede (work-log, minor) Clean up code 2023-07-12 18:10:05 +02:00
Viktor Lofgren
5deec63667 (work-log) Better tests 2023-07-12 18:04:06 +02:00
Viktor Lofgren
363368b150 (converter) Remove auto-refresh. 2023-07-12 17:48:37 +02:00
Viktor Lofgren
74caf9e38a (processes) Remove forEach-constructs in favor of iterators. 2023-07-12 17:47:36 +02:00
Viktor Lofgren
7087ab5f07 (run) Reduce nginx access log noise for local setup 2023-07-11 23:11:34 +02:00
Viktor Lofgren
0b0cf48849 (control) Better looking UUIDs 2023-07-11 23:11:02 +02:00
Viktor Lofgren
00d9773b44 (control) Better looking progress bar 2023-07-11 21:37:32 +02:00
Viktor Lofgren
ac2d7034db (minor) Bugfix in Path handling 2023-07-11 21:24:29 +02:00
Viktor Lofgren
88b9ec70c6 (control, WIP) Run reconvert-load from converter :D 2023-07-11 18:05:37 +02:00
Viktor Lofgren
77261a38cd (control, WIP) MQFSM and ProcessService are sitting in a tree
We're spawning processes from the MSFSM in control service now!
2023-07-11 17:08:43 +02:00
Viktor Lofgren
3c7c77fe21 (minor) Bugfix in Path handling 2023-07-11 17:06:52 +02:00
Viktor Lofgren
4ee3f6ba3f (minor) Refactor ControlService 2023-07-11 14:51:51 +02:00
Viktor Lofgren
4c016b0318 Process monitoring
* Also refactored the SQL tables a bit
2023-07-11 14:46:21 +02:00
Viktor Lofgren
f59cab300e (minor) Javadoc comments for MqPersistance and MqMessageState 2023-07-10 21:59:51 +02:00
Viktor Lofgren
ec7826659a (minor) Javadoc comments for MqPersistance and MqMessageState 2023-07-10 21:52:25 +02:00
Viktor Lofgren
98b5f22104 (control) WIP control service
* Set messages to OK when received so they're cleaned up properly.
2023-07-10 21:33:57 +02:00
Viktor Lofgren
2283ceb77d (control) WIP control service 2023-07-10 18:58:43 +02:00
Viktor Lofgren
fba466d6e2 (crawler) Update URL blocklist
* Don't crawl MDN mirrors
* More mailing list variants
2023-07-10 18:58:43 +02:00
Viktor
cbbf60a599 Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor Lofgren
c125d8ab48 (search) Fix a bug where space-like characters weren't normalized in query processing. 2023-07-10 18:58:43 +02:00
Viktor Lofgren
f03146de4b (crawler) Fix bug poor handling of duplicate ids
* Also clean up the code a bit
2023-07-10 18:58:43 +02:00
Viktor Lofgren
dbb758d1a8 Minor: Better error handling in crawled domain reader 2023-07-10 18:58:43 +02:00
Viktor Lofgren
da8bcc6e24 Minor: Don't blow up the reader on a corrupted file 2023-07-10 18:58:43 +02:00
Viktor Lofgren
96eecc6ea5 Minor: Readability. 2023-07-10 18:58:43 +02:00
Viktor Lofgren
74644d59f3 (crawler) Update URL blocklist
* Don't crawl MDN mirrors
* More mailing list variants
2023-07-10 18:04:43 +02:00
Viktor
0f9b90eb1c Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 17:36:12 +02:00
Viktor Lofgren
ae9537b68e (search) Fix a bug where space-like characters weren't normalized in query processing. 2023-07-07 20:02:05 +02:00
Viktor Lofgren
2619d196bb (crawler) Fix bug poor handling of duplicate ids
* Also clean up the code a bit
2023-07-07 19:56:14 +02:00
Viktor Lofgren
17db23c2c1 Minor: Better error handling in crawled domain reader 2023-07-07 19:48:32 +02:00
Viktor Lofgren
040bea1f75 Minor: Don't blow up the reader on a corrupted file 2023-07-07 19:48:11 +02:00
Viktor Lofgren
dc8277223a Minor: Readability. 2023-07-06 19:50:13 +02:00
Viktor Lofgren
98d1898610 Bugfix: Don't run the xenforo specialization on phpBB. 2023-07-06 18:12:26 +02:00
Viktor Lofgren
1400fb4a9b Bugfix: Don't run the xenforo specialization on phpBB. 2023-07-06 18:11:19 +02:00
Viktor Lofgren
647bbfa617 Fix so that crawler tests don't sometimes fetch real sitemaps when they're run. 2023-07-06 18:05:23 +02:00
Viktor Lofgren
b73fcc19fe Fix so that crawler tests don't sometimes fetch real sitemaps when they're run. 2023-07-06 18:05:03 +02:00
Viktor Lofgren
d9e6c4f266 Trial integration of MQ-FSM into index service. 2023-07-06 18:04:16 +02:00
Viktor Lofgren
34653f03a2 Temporary bugfix, need to find source 2023-07-06 14:13:03 +02:00
Viktor Lofgren
f0a8ca440f MQFSM Usability WIP 2023-07-06 13:33:11 +02:00
Viktor Lofgren
d89db10645 MQFSM Usability WIP 2023-07-06 13:02:16 +02:00
Viktor
413dc6ced4 Update FUNDING.yml 2023-07-05 18:03:36 +02:00
Adrthegamedev
78f21dd19a (an attempt to) Add wikidot to wiki generators list 2023-07-05 18:03:36 +02:00
Viktor Lofgren
2cb209ae9c Better wordpress fingerprinting 2023-07-05 18:03:36 +02:00
Viktor Lofgren
979a620ead Bugfix where DocumentGeneratorExtractor out of bounded for generators starting with 'microsoft' or 'adobe' but having no followup string. 2023-07-05 18:03:36 +02:00
Viktor Lofgren
7a17933c65 Control service owns message queue garbage collection. 2023-07-04 19:52:30 +02:00
Viktor
019fa763cd Update FUNDING.yml 2023-07-04 18:46:58 +02:00
Viktor Lofgren
097a163cf5 Getting a skeleton in place for the control service. 2023-07-04 18:25:42 +02:00
Viktor Lofgren
2ae0b8c159 Message queue based state machine 2023-07-04 17:42:06 +02:00
Viktor Lofgren
31ae71c7d6 Message queue WIP 2023-07-04 14:28:14 +02:00
Adrthegamedev
5ce894564c (an attempt to) Add wikidot to wiki generators list 2023-07-03 13:31:42 +02:00
Viktor Lofgren
813fa08bdd Better wordpress fingerprinting 2023-07-03 11:29:27 +02:00
Viktor Lofgren
e5792ba8b3 Bugfix where DocumentGeneratorExtractor out of bounded for generators starting with 'microsoft' or 'adobe' but having no followup string. 2023-07-03 11:06:39 +02:00
Viktor Lofgren
62cc9df206 Embryo of new control process
* New events and heartbeat tables in mariadb
* Refactored to a cleaner Service interface
2023-07-03 10:40:32 +02:00
Viktor Lofgren
42375f0e53 Specialization for javadocs 2023-07-01 20:16:56 +02:00
Viktor Lofgren
24dce8c03b Remove link filtering for mediawiki, it's too strict and not every site uses the /wiki/-pattern. 2023-07-01 19:32:25 +02:00
Viktor Lofgren
eda615de0f Add generator fingerprint for invision. 2023-07-01 14:47:57 +02:00
Viktor Lofgren
a000256223 Add generator fingerprint for xenforo.
Also clean up the specializations logic a bit, and add a barebones specialization for phpbb that cleans out paths we aren't interested in but doesn't touch pruning or summarizing logic for now.
2023-07-01 14:43:49 +02:00
Viktor Lofgren
9bd0e3ce58 Add generator fingerprint for xenforo. 2023-07-01 14:04:48 +02:00
Viktor Lofgren
b4d1e0e81e Add generator fingerprints for phpBB and flarum. 2023-07-01 13:44:42 +02:00
Viktor Lofgren
d2fdaafc7a Big brain web developers were using onload and onerror handlers to load JS without script tags... 2023-06-30 17:10:25 +02:00
Viktor Lofgren
7d86586594 Remove annoying log spam in sitemap retriever 2023-06-30 17:08:35 +02:00
Viktor Lofgren
11c26e700e Remove annoying log spam in crawler retriever 2023-06-30 17:08:24 +02:00
Viktor Lofgren
8274e8a953 JVM flags for disabling black and block-lists. 2023-06-30 17:07:47 +02:00
Viktor Lofgren
42afe490b7 Update README with version info 2023-06-30 11:49:17 +02:00
Viktor Lofgren
0f34beb1aa Update search front page 2023-06-29 17:14:27 +02:00
Viktor Lofgren
e853483ef3 Bump Crawler Commons version 2023-06-29 14:14:18 +02:00
Viktor Lofgren
baff83912e Small optimizations that shave an hour of processing time :D 2023-06-28 15:41:10 +02:00
Viktor Lofgren
8e25cfff4f Update README and CONTRIBUTING. 2023-06-27 18:32:47 +02:00
Viktor Lofgren
b7dc748942 Update README to external reflect funding. 2023-06-27 18:20:55 +02:00
Viktor Lofgren
d71124961e Better tests for crawling and processing. 2023-06-27 16:11:27 +02:00
Viktor Lofgren
fbdedf53de Fix bug in CrawlerRetreiver
... where the root URL wasn't always added properly to the front of the crawl queue.
2023-06-27 15:50:38 +02:00
Viktor Lofgren
a6a66c6d8a Improve site info for unknown domains:
* Placeholder screenshot should work
* Add a link to git-repo for submitting the site for crawling
2023-06-27 15:32:11 +02:00
Viktor Lofgren
d167ad2017 Remove sitemap related log spam 2023-06-27 13:59:47 +02:00
Viktor Lofgren
7d741ff499 Fix so crawl plan replay doesn't crash if a file is missing. 2023-06-27 10:57:54 +02:00
Viktor Lofgren
f8f9f04158 Specialized logic for processing Lemmy-based websites. 2023-06-27 10:57:54 +02:00
Viktor Lofgren
b0c7480d06 Set default timeouts for java.net.URL-connections 2023-06-27 10:57:54 +02:00
Viktor Lofgren
e7af77e151 Tests for crawler specialization + testdata 2023-06-27 10:57:54 +02:00
Viktor Lofgren
ec940e36d0 Sitemap support, refined crawler specialization 2023-06-27 10:57:54 +02:00
Viktor Lofgren
f92d8a0975 EdgeUrl conversion to/from java.net.URL 2023-06-27 10:57:54 +02:00
Viktor Lofgren
ed373eef61 Refactor crawler and add special logic for some platforms
* Break apart CrawlerRetreiver
* Break apart HttpFetcher into an interface and impl for testing sanity
* Add special logic for Lemmy, Mediawiki and Discourse to not waste requests on paths that aren't interesting.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
5abaf13192 Fix serialization bug with CompressedBigString 2023-06-27 10:57:54 +02:00
Viktor Lofgren
d86e8522e2 Add search profiles for wiki, forum and docs. 2023-06-24 12:17:35 +02:00
Viktor Lofgren
bd2c3855ed Add bits and keywords for generator classes (docs, forum, wiki). 2023-06-23 21:35:28 +02:00
Viktor Lofgren
4c627d0e1d Improvements to crawling.md 2023-06-22 18:01:43 +02:00
Viktor Lofgren
c8dd45e37d First draft for crawling documentation. 2023-06-22 17:44:24 +02:00
Viktor Lofgren
54c2be893b TRIVIAL: Remove unused import. 2023-06-22 17:21:47 +02:00
Viktor Lofgren
55c65f0935 Use document generator to complement the document selection.
Will let through e.g. a modern SSG in the small web filter.
2023-06-22 17:21:33 +02:00
Viktor Lofgren
b5ef67ed28 Categorize generators by type
This is a great quality signal!
Add the type as document bitflags by category.
2023-06-22 16:04:37 +02:00
Viktor Lofgren
f140e7d7c7 Use a default tag for unset or invalid generators. 2023-06-21 17:30:14 +02:00
Viktor Lofgren
a9a2960e86 New synthetic keyword for document generator meta tag. 2023-06-20 16:25:49 +02:00
Viktor Lofgren
7326ba74fe Tweaks to pub date heuristics to make it mostly get the 'historyofphilosophy.net' case right.
Use HTML standard for plausibility checks in the more guesswork-like heuristics. Added more class names to look for date strings.
2023-06-20 14:15:05 +02:00
Viktor Lofgren
a9fabba407 Tell experiment runner to only process some domains.
Updated the experiment runner, as well as the script.
2023-06-20 14:14:01 +02:00
Viktor Lofgren
5d862d119c Bump dependency versions. 2023-06-20 12:03:12 +02:00
Viktor Lofgren
4fc0ddbc45 Improved crawl-job-extractor.
Let crawl-job-extractor run offline and allow it to read domains from file.
Improved docs.
2023-06-20 11:37:52 +02:00
Viktor Lofgren
9455100907 Throw a custom exception when WMSA_HOME isn't found 2023-06-20 11:37:52 +02:00
Viktor Lofgren
32a6735d03 Undo change in requirements for counting as a high tf-idf word 2023-06-19 17:58:19 +02:00
Viktor Lofgren
f0b4acb358 Better logic for summarization. 2023-06-19 17:58:19 +02:00
Viktor Lofgren
67c15a34e6 Reduce the amount of expensive operations in HtmlDocumentProcessorPlugin. 2023-06-19 17:58:19 +02:00
Viktor Lofgren
9579cdd151 Improved heuristic for which words are considered important in selecting the summary text. 2023-06-19 17:58:19 +02:00
Viktor Lofgren
443cf0cf1e Expose additional functionality through WordsTfIdfCounts.
Bump requirements for being flagged as high TF-IDF from 2 occurences to 3.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
4138233ddf Truncate repeated strings of any non-alnum symbols in SummaryExtractor 2023-06-19 17:58:19 +02:00
Viktor Lofgren
2979f4703e Allocation-free text utility 2023-06-19 17:58:19 +02:00
Viktor Lofgren
77f2ca51af Optimize SentenceExtractor.
Remove String pool because it's not doing much.
Break out constant.
Use a shared RdrPosTagger.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
ffcbc6c1c9 Reduce the odds of re-allocation by AsciiFlattener 2023-06-19 17:58:19 +02:00
Viktor Lofgren
186a02acfd Optimize RDRPosTagger to use integer comparisons instead of string comparisons.
Also reduce the cache-thrashing by deconstructing the tree's nodes into arrays.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
6f2a7977c1 (Minor) Remove character debris in build.gradle 2023-06-19 17:58:19 +02:00
Viktor Lofgren
266ad2e4de Re-introduce monkey patched GSON to make converter run better.
fixup! Re-introduce monkey patched GSON to make converter run better.

fixup! Re-introduce monkey patched GSON to make converter run better.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
d1a004bea6 (minor) Clean up StringPool 2023-06-19 17:58:19 +02:00
Viktor Lofgren
e4372289a5 Use fixed buffers for BigString compression and decompression to reduce GC churn.
fixup! Use fixed buffers for BigString compression and decompression to reduce GC churn.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
379bccc1a3 Disable AdblockSimulator since it's slow and doesn't really work. Just wasting CPU cycles until it's fixed. 2023-06-19 17:58:19 +02:00
Viktor Lofgren
21125206b4 Fix some bugs in JSON+LD-heuristics for pub date. 2023-06-19 17:58:19 +02:00
Viktor Lofgren
44b1fe0e6d Move list-conversion into getDescription method. 2023-06-19 17:58:19 +02:00
Viktor Lofgren
88399e30e2 Consider keyword relevance signals when creating the document summary using the DOM walker. 2023-06-19 17:58:19 +02:00
Viktor Lofgren
7ed3306be3 Make the adjacency calculator behave like it used to in the past, when it gave better results. 2023-06-07 22:03:06 +02:00
Viktor Lofgren
eb2ca942d5 Up the default crawl delay to 1 second. 2023-06-07 22:02:17 +02:00
Viktor Lofgren
2afbdc2269 Adjust the logic for the crawl job extractor to set a relatively low visit limit for websites that are new in the index or has not yielded many good documents previously. 2023-06-07 22:01:35 +02:00
Viktor Lofgren
d82a858491 Don't consider slash to be a sentence separator. 2023-05-31 16:54:30 +02:00
Viktor Lofgren
e332faa07e Fix test that broke when memex.marginalia.nu started redirecting to www.marginalia.nu. 2023-05-28 13:46:24 +02:00
Viktor Lofgren
4e9e79454f Fix broken transformation functions in the PagingArray classes. 2023-05-28 13:31:05 +02:00
Viktor Lofgren
b0bc07b4e7 Insertion sort was *super* busted I don't even know how it worked 2023-05-28 12:17:50 +02:00
Viktor Lofgren
2cda57355a More word metadata tests 2023-05-28 11:57:06 +02:00
Viktor Lofgren
fd192d2791 Fix putative overflow error with a large dictionary 2023-05-28 11:57:06 +02:00
Viktor Lofgren
6814c90625 Fix N-width sorting bug 2023-05-28 11:57:06 +02:00
Viktor
a57ab427b3 Update useful-resources.md 2023-05-27 12:01:45 +02:00
Viktor Lofgren
1e184a8372 (search) Make exploration mode more random 2023-05-25 17:40:28 +02:00
Viktor Lofgren
6fae51a8ef Stopgap fix for a bug in dealing with quote terms containing stop words. 2023-05-02 19:38:59 +02:00
Viktor Lofgren
a9f7b4c457 Add synthetic keywords for same-site files linked from a document (e.g. file:png). Also add category keywords, like file:image or file:document. 2023-04-30 19:29:13 +02:00
Viktor Lofgren
1e3b6934bb Reduce log noise during loading. Bad URLs don't need to be loaded, they can be grepped from the instructions. 2023-04-30 18:36:44 +02:00
Viktor
0a5e85be8f Update README.md 2023-04-22 21:02:25 +02:00
Viktor
7694a15f62 Fix kale's unreasonably high weighting factor 2023-04-22 20:55:09 +02:00
Viktor
d72da01a92 Update readme.md 2023-04-22 16:05:57 +02:00
Viktor
112f43b3a1 Api service response cache (#16)
* Add response caching to the API service to help SearXNG

* Clean up the code a bit.

* Add an endpoint without a terminating slash for getLicense.

* Add tests for API service.
2023-04-22 15:42:32 +02:00
Viktor Lofgren
f12c6fd57e Add a ranking parameter for biasing toward recent or old content. 2023-04-20 16:00:59 +02:00
Viktor
96bac70b85 Tools for merging sorted lists, and merging btrees. (#14)
* Utilities for merging BTrees of entity size 1 and 2.
* Isolate and clean up sorting algorithms. 
* Functions for keeping distinct items in a LongArray
2023-04-20 15:28:09 +02:00
Viktor Lofgren
619fb8ba80 (converter) Adjust the pub-date sniffing heuristics' order. Doing HTML5 tags too early puts some sites too early. Also expanded support for JSON+LD. 2023-04-19 15:28:50 +02:00
Viktor
5a5cdaf70e Improvements to the adjacency calculator and screenshots tool (#13)
* WIP: Improvements to website adjacencies loader tool.

* Improving screenshots capture bot.
2023-04-18 22:21:49 +02:00
Viktor Lofgren
bb587ca47f Reformulate search-header.hdb, s/Support/Donate/ the formulation was apparently confusing some people thinking they could get support on this page. 2023-04-18 17:04:24 +02:00
Viktor Lofgren
4d298cd5fa Improving screenshots capture bot. 2023-04-17 18:04:22 +02:00
Viktor Lofgren
fbbaf584ba Adjustments to screenshot capture tool. 2023-04-16 08:55:57 +02:00
Viktor Lofgren
df1850bd45 Fix bug in index service where tld: and links:-queries wouldn't work. 2023-04-15 18:39:16 +02:00
Viktor Lofgren
d42ab19166 Issue 5: Fix bug where some IPv6 addresses blew up domain loading. 2023-04-15 14:11:08 +02:00
Viktor Lofgren
2ab26f37b8 Bug fix for document metadata encoding that breaks year based queries. 2023-04-14 16:56:49 +02:00
Viktor
ec7ce7b0b3 Update readme.md 2023-04-11 16:31:11 +02:00
Viktor Lofgren
3e9b37c264 Refactor website screenshot tool and website adjacencies calculator into code/tools. 2023-04-11 16:20:27 +02:00
Viktor Lofgren
502713f7a8 Reduce memory churn 2023-04-10 16:51:17 +02:00
Viktor Lofgren
e19256a6b6 Tune settings to retrieve more results. 2023-04-10 15:39:20 +02:00
Viktor Lofgren
ccc41d1717 Clean up of the index query handling related code. 2023-04-10 14:50:57 +02:00
Viktor Lofgren
e49b1dd155 Better handling of quote terms, fix bug in handling of longer queries.
... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java
2023-04-10 13:20:40 +02:00
Viktor Lofgren
fe419b12b4 Better handling of quote terms, fix bug in handling of longer queries.
... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java
2023-04-10 13:11:40 +02:00
Viktor Lofgren
810515c08d Clean up artifact extractor. 2023-04-10 13:07:54 +02:00
Viktor Lofgren
535a51a621 Repair broken year query test. 2023-04-08 12:04:09 +02:00
Viktor
a278fc6296 Increase search result relevance (#8)
* Increase accuracy of the position bits.
* Increase their width to 56.
* Use a rolling position scheme for bits 16-56 to increase the average accuracy.
* Result ranking overhaul
* Optimized queries
* BM25 in the index service's ranking
* Make gui less jank
* Javadocs for ranking parameters.
2023-04-07 20:18:08 +02:00
Viktor
f1c6525a50 Update setup.sh 2023-04-02 14:44:43 +02:00
Viktor
ace0d19973 Update README.md 2023-04-02 14:43:14 +02:00
Viktor
40b8c8c128 Update README.md 2023-04-02 14:08:17 +02:00
Viktor Lofgren
716ab35b4e Search ranking debuggability improvements. 2023-04-02 13:43:24 +02:00
Viktor Lofgren
3fb249758e Adjust result ordering. 2023-04-02 12:05:22 +02:00
Viktor Lofgren
f7a6ef2179 Smarter queries, better logging. 2023-04-02 12:05:09 +02:00
Viktor Lofgren
105d93cd85 Index query builder automatically ignores redundant predicates. 2023-04-02 12:04:26 +02:00
Viktor Lofgren
1e4157017d More helpful descriptions of index queries. 2023-04-02 12:03:58 +02:00
Viktor Lofgren
5fb75adaae Remove antique result scoring adjustment that makes no sense anymore. 2023-04-02 11:58:04 +02:00
Viktor Lofgren
affcf8cf41 Load test tool 2023-04-02 09:43:43 +02:00
Viktor Lofgren
cc4e089a5d Consider average sentence length when selecting search results. This promotes proses over code listings, tabular data, etc. 2023-03-30 15:46:15 +02:00
Viktor Lofgren
32b9c2e671 Fix SentenceExtractor jank 2023-03-30 15:45:04 +02:00
Viktor Lofgren
4d05be4095 Refactor InternalLinkGraph 2023-03-30 15:44:23 +02:00
Viktor Lofgren
137adb9c3c Bitmask calculation improvement. Take sentence length into consideration, not all lines are equal. 2023-03-30 15:42:06 +02:00
Viktor Lofgren
16e37672fc Bugfix crawl plan, doesn't use rewrite() everywhere 2023-03-30 15:41:07 +02:00
Viktor Lofgren
d0c72ceb7e Improve experiment runner, convenient start script. 2023-03-30 15:40:31 +02:00
Viktor Lofgren
0fcb2b534c Polish Names 2023-03-29 16:51:47 +02:00
Viktor Lofgren
dcf6218cdb Fix bugs related to search result selection in the case with multiple search terms.
* A deduplication filter step ran too early, and removed many good results on the basis that they partially, but did not fully fit another set of search terms.

* Altered the query creation process to prefer documents where multiple terms appear in the priority index.
2023-03-29 15:18:52 +02:00
Viktor Lofgren
8f51345a1d Add experiment runner tool and got rid of experiments module in processes. 2023-03-28 16:58:46 +02:00
Viktor Lofgren
03bd892b95 Improve document processing in conversion.
* Add flags for long and short documents.
* Break out common length logic from plugins.
* Cleaning up of related code.
2023-03-28 16:38:00 +02:00
Viktor Lofgren
1e65ac3940 Improve useful-resources.md 2023-03-28 16:35:58 +02:00
Viktor
e622437560 Create FUNDING.yml 2023-03-28 13:13:49 +02:00
Viktor Lofgren
30584887f9 DictionaryMap changes.
Add new flag to change the default size to make prod index boot faster. Remove option to select OffHeapDictionaryHashMap.
2023-03-27 17:28:39 +02:00
Viktor Lofgren
17ca4f9eea Permit search results that are all synthetic to pass relevancy check. 2023-03-27 17:27:35 +02:00
Viktor Lofgren
7fb3db3249 Fix bug where link on front page news listing wouldn't work.
... also changed order of date and source to make the UI more consistent.
2023-03-27 17:26:46 +02:00
Viktor Lofgren
b60fcd0918 Documentation improvements 2023-03-27 17:25:27 +02:00
Viktor Lofgren
862e925d7c "-Dsmall-ram=TRUE" no longer does anything. Remove references to the flag, which previously reduced the memory footprint of the loader and index service. 2023-03-26 21:37:11 +02:00
Viktor Lofgren
a0027ad32b Fix broken diagram links after doc/ restructuring. 2023-03-25 16:32:10 +01:00
Viktor Lofgren
c5f4cb34bf Documentation for DB 2023-03-25 16:14:16 +01:00
Viktor
2e69179f12 Update readme.md 2023-03-25 15:47:45 +01:00
Viktor
19000ab339 Create readme.md 2023-03-25 15:46:19 +01:00
Viktor
be3ba3ef37 Update readme.md 2023-03-25 15:27:11 +01:00
Viktor
ac1ac3ea57 Move database to a separate module
* Move database to a separate project, break apart sql file into separate entities.
* Fix front page news listing.
2023-03-25 15:26:17 +01:00
Viktor
0b505939ed Update features-convert/readme.md 2023-03-25 12:43:58 +01:00
Viktor
d2a9e1b644 Add processes link to readme.md for code/common 2023-03-25 12:42:44 +01:00
Viktor Lofgren
3464ca514b Fix typeahead suggestions 2023-03-25 10:20:52 +01:00
Viktor Lofgren
2f2c86a9f5 Fix bug where WmsaHome wouldn't look in /var/lib/wmsa as a fallback 2023-03-25 10:20:52 +01:00
Viktor
45dd9fea25 Update readme.md 2023-03-22 17:15:36 +01:00
Viktor
c974d72e7e Update readme.md 2023-03-22 17:09:48 +01:00
Viktor
e3675d2fa9 Update readme.md 2023-03-22 17:02:03 +01:00
Viktor
c4a6bf7672 Update readme.md 2023-03-22 17:01:34 +01:00
Viktor
5edc0c8d52 Add files via upload 2023-03-22 17:00:01 +01:00
Viktor
cb6865924e Update readme.md 2023-03-22 16:59:38 +01:00
Viktor Lofgren
ee50f7422d CONTRIBUTING.md 2023-03-22 15:27:20 +01:00
Viktor Lofgren
964014860a Get suggestions working again 2023-03-22 15:11:22 +01:00
Viktor Lofgren
7c58ddce81 readme.md 2023-03-22 15:10:30 +01:00
Viktor Lofgren
611ba2d35a Break apart WordPatterns class 2023-03-22 15:10:17 +01:00
Viktor
ecd6ed186f Update readme.md 2023-03-21 17:33:02 +01:00
Viktor
b07f84bc01 Update readme.md 2023-03-21 17:32:09 +01:00
Viktor
ad2e939018 Update readme.md 2023-03-21 17:30:44 +01:00
Viktor
2a90ade80f Update readme.md 2023-03-21 17:26:59 +01:00
Viktor
d9c456d772 Create module-taxonomy.md 2023-03-21 17:24:39 +01:00
Viktor
b2599a6d33 Make colors less eye-grating on dark theme. 2023-03-21 17:18:47 +01:00
Viktor
38fd49b271 Update readme.md 2023-03-21 17:11:28 +01:00
Viktor
85fea2ecaa Add files via upload 2023-03-21 17:08:34 +01:00
Viktor
1b9ae7b42d Update readme.md 2023-03-21 16:38:39 +01:00
Viktor Lofgren
46f81aca2f Break apart reverse index into a separate full index and priority index. It did this before using the same code. This will make the priority index about half as big since it no longer needs to keep metadata. 2023-03-21 16:12:31 +01:00
Viktor Lofgren
ca22c287a5 Make use of DocumentFlags' flags 2023-03-21 16:03:15 +01:00
Viktor Lofgren
1bb1248ab0 Optimize array library, jmh benchmarks. 2023-03-21 16:02:31 +01:00
Viktor Lofgren
624e8acd41 Remove copy-pasted application plugin from subprojects that define features. 2023-03-20 17:25:58 +01:00
Viktor Lofgren
b7190ebc69 Don't index local deployment run state in IntelliJ. 2023-03-20 17:11:39 +01:00
vlofgren
a74f899d28 Update LICENSE.md 2023-03-20 16:49:07 +01:00
vlofgren
8e2225e346 Create useful-resources.md 2023-03-20 16:44:02 +01:00
vlofgren
29c76fcdce Add page&brin to domain-ranking readme.md 2023-03-20 16:41:34 +01:00
vlofgren
55d0fa61d7 Update readme.md 2023-03-20 16:39:15 +01:00
vlofgren
554a7fde80 Update readme.md 2023-03-20 16:27:37 +01:00
vlofgren
bdd47ecd03 Update README.md 2023-03-20 16:10:39 +01:00
Viktor Lofgren
72115e490f Put news into a database table instead of keeping them hardcoded, request counter on front page. 2023-03-19 12:54:58 +01:00
Viktor Lofgren
bdd2b4a43e Put news into a database table instead of keeping them hardcoded. 2023-03-19 11:46:13 +01:00
Viktor Lofgren
3402b31c30 Remove junk from docker-compose.yml 2023-03-18 10:43:51 +01:00
Viktor Lofgren
0682550bd2 Clean up summary extractor module. 2023-03-18 10:33:58 +01:00
Viktor Lofgren
6e89377dea Clean up summary extractor module. 2023-03-18 10:29:25 +01:00
Viktor Lofgren
950c49d80f Clean up summary extractor module. 2023-03-18 10:28:48 +01:00
Viktor Lofgren
8def95e849 Clean up summary extractor module. 2023-03-18 10:24:12 +01:00
Viktor Lofgren
43430728aa Clean up summary extractor module. 2023-03-18 10:21:41 +01:00
Viktor Lofgren
6a20b2b678 Trivial reformatting of code. 2023-03-17 22:11:14 +01:00
Viktor Lofgren
3675c7a090 The search-service doesn't speak REST. 2023-03-17 16:21:52 +01:00
Viktor Lofgren
2eb972dea1 Remove unrelated code, break tools into their own directory. 2023-03-17 16:03:11 +01:00
Viktor Lofgren
449471a076 Yet more restructuring. Improved search result ranking. 2023-03-16 21:35:54 +01:00
Viktor Lofgren
5ef17a2a20 Yet more restructuring. 2023-03-13 23:43:09 +01:00
Viktor Lofgren
0ecab53635 Yet more restructuring. 2023-03-13 23:40:26 +01:00
Viktor Lofgren
d82532b7f1 More restructuring, big bug fixes in keyword extraction. 2023-03-13 17:39:53 +01:00
Viktor Lofgren
281f1322a9 Clean up BTreeWriter 2023-03-12 12:49:49 +01:00
Viktor Lofgren
347f16939c Fix broken setup script 2023-03-12 12:21:37 +01:00
Viktor Lofgren
6e1ddca293 Fix broken mariadb setup 2023-03-12 12:11:33 +01:00
Viktor Lofgren
8b8fc49901 The refactoring will continue until morale improves. 2023-03-12 11:42:07 +01:00
Viktor Lofgren
73eaa0865d The refactoring will continue until morale improves. 2023-03-12 10:50:31 +01:00
Viktor Lofgren
616effdb3c The refactoring will continue until morale improves. 2023-03-12 10:04:48 +01:00
Viktor Lofgren
4cec89da91 Fix bug where results would sometimes be presented solely based on the fact that the document is important on the site in general, regardless of whether it's important to the document. 2023-03-11 14:20:32 +01:00
Viktor Lofgren
2e2916cebe Additional code restructuring to get rid of util and misc-style packages. 2023-03-11 13:53:36 +01:00
Viktor Lofgren
6d939175b1 Additional code restructuring to get rid of util and misc-style packages. 2023-03-11 13:48:40 +01:00
Viktor Lofgren
73e412ea5b Clean up search-service and index-api 2023-03-11 12:26:12 +01:00
Viktor Lofgren
c2f9980eba Tidy up. 2023-03-11 12:13:53 +01:00
Viktor Lofgren
0532e8c40e Tidy up. 2023-03-11 11:35:08 +01:00
Viktor Lofgren
919b80b9ab Gradle shouldn't generate dist zips, zipping jar files is slow and also just ridiculous when you realize jar files are zip files and you can't compress a file twice using the same algo. 2023-03-11 11:34:51 +01:00
Viktor Lofgren
1aee6fdc11 Fix docker dependencies warning. 2023-03-10 17:16:44 +01:00
Viktor Lofgren
a62015d5f3 Fix broken test, compiler warning. 2023-03-10 17:12:12 +01:00
Viktor Lofgren
722ff3bffb Word feature bit for words that appear in the URL, new search profile for plain text files, better plain text titles. 2023-03-10 16:46:56 +01:00
Viktor Lofgren
2bc212d65c Refactor DocumentKeyword-related classes 2023-03-09 20:41:38 +01:00
Viktor Lofgren
efb46cc703 Remove count from WordMetadata entirely. 2023-03-09 18:14:14 +01:00
Viktor Lofgren
8fb531c614 Word Metadata's count is hella broken, stopgap fix by bitCounting positions instead as this is messing with the search result ordering very badly. 2023-03-09 17:58:56 +01:00
Viktor Lofgren
9ece07d559 Chasing a result ranking bug 2023-03-09 17:52:35 +01:00
Viktor Lofgren
0ae4731cf1 Add invariant to WordMetadata 2023-03-09 17:27:07 +01:00
Viktor Lofgren
02db999762 Enable assertions in reconvert script. 2023-03-09 17:26:08 +01:00
Viktor Lofgren
2a25b5e8a9 Placeholder screenshots when the domain is missing from the database entirely. 2023-03-08 18:36:41 +01:00
Viktor Lofgren
5c1a59257c Reconvert script broken when code/ moved. 2023-03-08 17:18:04 +01:00
Viktor Lofgren
d4010c76cf Better title extraction for plain text plugin. 2023-03-07 21:53:44 +01:00
Viktor Lofgren
6fb0f77eea Improving search result scoring in index. 2023-03-07 21:53:30 +01:00
Viktor Lofgren
1252f95da5 Fix for valuation bug in index code that wouldn't sort bad-ish items properly. 2023-03-07 21:26:04 +01:00
Viktor Lofgren
f3babde415 Readme for code/ 2023-03-07 17:32:16 +01:00
Viktor Lofgren
ad1be7c835 Move all code to a code directory. 2023-03-07 17:14:32 +01:00
Viktor Lofgren
c47eb25483 Remove refuse pile logic that in practice resulted in a lot fewer results showing up for many queries. 2023-03-07 16:38:33 +01:00
Viktor Lofgren
58fcddedbb Code cleanup ForwardIndexReader 2023-03-07 16:38:03 +01:00
Viktor Lofgren
11af3f3e64 Code cleanup 2023-03-07 16:37:08 +01:00
Viktor Lofgren
549d323f6d Code cleanup 2023-03-07 16:37:05 +01:00
Viktor Lofgren
a2885acdf4 Performance optimization IndexJournalReadEntry.read 2023-03-07 16:36:44 +01:00
Viktor Lofgren
bd84c73e05 Clean up DocumentKeywordExtractor and DocumentKeywordsBuilder 2023-03-07 16:36:12 +01:00
Viktor Lofgren
04f501b8c8 Tidying up the HTML plugin. 2023-03-06 19:41:20 +01:00
Viktor Lofgren
be040419f3 Tidying up the HTML plugin. 2023-03-06 19:39:21 +01:00
Viktor Lofgren
384de2e54b Fixing LSH deduplication bug. 2023-03-06 19:32:37 +01:00
Viktor Lofgren
43f3380cb9 Refactoring converting-process 2023-03-06 19:32:25 +01:00
Viktor Lofgren
bce452fb4f More documentation... 2023-03-06 19:01:36 +01:00
Viktor Lofgren
0553174401 More documentation... 2023-03-06 18:55:28 +01:00
Viktor Lofgren
2d066af5b9 More documentation... 2023-03-06 18:45:01 +01:00
Viktor Lofgren
b945fd7f39 A lot of readmes, some refactoring. 2023-03-06 18:32:13 +01:00
Viktor Lofgren
f19c9a2863 More readmes, cleaning up dead code. 2023-03-05 19:31:43 +01:00
Viktor Lofgren
87767b14bd index service readme refers to index primitives 2023-03-05 19:16:08 +01:00
Viktor Lofgren
fe0d754f2c Make the code run properly without WMSA_HOME set, adding missing test assets. 2023-03-05 14:12:13 +01:00
Viktor Lofgren
fd1b56dbad Make the code run properly without WMSA_HOME set, adding missing test assets. 2023-03-05 13:47:40 +01:00
Viktor Lofgren
ed8ec0990e Cleaning junk from QueryParserTest 2023-03-05 13:19:26 +01:00
Viktor Lofgren
4d94a023c9 More tests for BTree, cleaned up code a bit. 2023-03-05 13:03:55 +01:00
Viktor Lofgren
96f6cd19e9 Repair integration tests 2023-03-05 12:24:12 +01:00
Viktor Lofgren
cf00963e57 Cleaning up the BTree library a bit. 2023-03-05 11:27:56 +01:00
Viktor Lofgren
4464055715 Readme for array 2023-03-04 19:19:47 +01:00
Viktor Lofgren
4972ad4c4f Readme for array 2023-03-04 19:19:12 +01:00
Viktor Lofgren
a0482273e0 Readme for array 2023-03-04 19:17:30 +01:00
Viktor Lofgren
0b7f8e1459 Readme for array 2023-03-04 19:15:51 +01:00
Viktor Lofgren
1264c64a15 Readme for array 2023-03-04 19:14:20 +01:00
Viktor Lofgren
83c32dc1a6 Better request context 2023-03-04 19:07:47 +01:00
Viktor Lofgren
6b10413efe Better request context 2023-03-04 18:28:46 +01:00
Viktor Lofgren
cd476dd243 Docs for array and btree 2023-03-04 18:06:53 +01:00
Viktor Lofgren
c5ccf0681b Docs for array and btree 2023-03-04 17:57:17 +01:00
Viktor Lofgren
bdaeb73ebb Trivial: Removing unnecessary "throws" 2023-03-04 17:44:43 +01:00
Viktor Lofgren
508cadd33f Readme for btree 2023-03-04 17:28:20 +01:00
Viktor Lofgren
449c62b666 Readme for btree 2023-03-04 17:21:39 +01:00
Viktor Lofgren
2423735a20 Readme for btree 2023-03-04 17:21:13 +01:00
Viktor Lofgren
696f791eb5 Readme for features 2023-03-04 17:00:49 +01:00
Viktor Lofgren
7aec667cb7 Separation-of-concerns SearchQueryIndexService/QueryFactory 2023-03-04 16:52:19 +01:00
Viktor Lofgren
cfbd0017e3 More readmes, clean-up of associated code. 2023-03-04 16:42:31 +01:00
Viktor Lofgren
cf386e0fd4 Cleaning up CONTRIBUTING.md 2023-03-04 16:22:35 +01:00
Viktor Lofgren
9a0d1d5d4e Setup readme 2023-03-04 16:14:03 +01:00
Viktor Lofgren
592766ba65 Setup readme 2023-03-04 16:12:37 +01:00
Viktor Lofgren
036adcbe1f Setup readme 2023-03-04 16:06:54 +01:00
Viktor Lofgren
aa24e80c40 Setup readme 2023-03-04 16:06:36 +01:00
Viktor Lofgren
a061a7e1f6 Automatic conversion at start-up given correct conditions 2023-03-04 16:02:02 +01:00
Viktor Lofgren
6130908285 Automatic conversion at start-up given correct conditions 2023-03-04 16:00:17 +01:00
Viktor Lofgren
3b5002aac8 WIP run and setup 2023-03-04 15:56:47 +01:00
Viktor Lofgren
c7014bbc92 WIP run and setup 2023-03-04 15:56:13 +01:00
Viktor Lofgren
5f758cbb0e WIP run and setup 2023-03-04 15:41:54 +01:00
Viktor Lofgren
ef87e123ba WIP run and setup 2023-03-04 15:17:02 +01:00
Viktor Lofgren
3e5e537a6b WIP run and setup 2023-03-04 15:13:00 +01:00
Viktor Lofgren
15754d7ae8 WIP run and setup 2023-03-04 15:02:52 +01:00
Viktor Lofgren
91b4579edc WIP run and setup 2023-03-04 14:59:32 +01:00
Viktor Lofgren
03b7d7bbbe WIP run and setup 2023-03-04 14:58:38 +01:00
Viktor Lofgren
7d3f9c4bab WIP run and setup 2023-03-04 14:52:44 +01:00
Viktor Lofgren
cfd408dbbd WIP run and setup 2023-03-04 14:52:03 +01:00
Viktor Lofgren
d7164ea26f WIP run and setup 2023-03-04 14:50:35 +01:00
Viktor Lofgren
ff115f3331 WIP run and setup 2023-03-04 14:50:08 +01:00
Viktor Lofgren
8d1172f56e WIP run and setup 2023-03-04 14:45:35 +01:00
Viktor Lofgren
6aad8de316 WIP run and setup 2023-03-04 14:42:24 +01:00
Viktor Lofgren
e37e599703 WIP run and setup 2023-03-04 14:38:21 +01:00
Viktor Lofgren
d3fa7d5181 WIP run and setup 2023-03-04 14:35:50 +01:00
Viktor Lofgren
3cebc08826 Move env into run for clarity 2023-03-04 14:24:45 +01:00
Viktor Lofgren
7259c65052 Move env into run for clarity 2023-03-04 14:24:38 +01:00
Viktor Lofgren
b4051c35e1 Remove old unused protobuf crap 2023-03-04 14:17:57 +01:00
Viktor Lofgren
cf1f878a39 Remove old unused protobuf crap 2023-03-04 14:17:13 +01:00
Viktor Lofgren
83d20ccf48 Readme for API-service 2023-03-04 14:10:32 +01:00
Viktor Lofgren
696508034a Readme for assistant-service 2023-03-04 14:08:49 +01:00
Viktor Lofgren
ef1a39862c Restructuring the git repo 2023-03-04 14:05:24 +01:00
Viktor Lofgren
25483adf7f Restructuring the git repo 2023-03-04 14:02:49 +01:00
Viktor Lofgren
81cb6f4ea0 Restructuring the git repo 2023-03-04 14:01:58 +01:00
Viktor Lofgren
1b776b114e Restructuring the git repo 2023-03-04 14:00:46 +01:00
Viktor Lofgren
4fdaaa16ba Restructuring the git repo 2023-03-04 13:19:01 +01:00
2949 changed files with 241613 additions and 59990 deletions

14
.github/FUNDING.yml vendored Normal file
View File

@@ -0,0 +1,14 @@
# These are supported funding model platforms
polar: marginalia-search
github: MarginaliaSearch
patreon: marginalia_nu
open_collective: # Replace with a single Open Collective username
ko_fi: # Replace with a single Ko-fi username
tidelift: # Replace with a single Tidelift platform-name/package-name e.g., npm/babel
community_bridge: # Replace with a single Community Bridge project-name e.g., cloud-foundry
liberapay: # Replace with a single Liberapay username
issuehunt: # Replace with a single IssueHunt username
otechie: # Replace with a single Otechie username
lfx_crowdfunding: # Replace with a single LFX Crowdfunding project-name e.g., cloud-foundry
custom: https://www.buymeacoffee.com/marginalia.nu

4
.gitignore vendored
View File

@@ -4,3 +4,7 @@ build/
*~ *~
.gradle/ .gradle/
.idea/ .idea/
lombok.config
Dockerfile
run
jte-classes

View File

@@ -0,0 +1,6 @@
Not everyone shows up in the git commit history, doesn't mean they didn't contribute valuable changes.
In such circumstances, their deeds will be recorded here.
* [@samstorment](https://www.github.com/samstorment) provided a design overhaul for [https://explore.marginalia.nu/](https://explore.marginalia.nu/) in [10cad3](https://github.com/MarginaliaSearch/MarginaliaSearch/commit/10cad3abb29b8a87bf5fd56afbc192335e3e94d7)
via [issue #44](https://github.com/MarginaliaSearch/MarginaliaSearch/issues/44).
* [@dreimolo](https://github.com/dreimolo) provided build script [fixes for apple silicon](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/64)

View File

@@ -1,42 +1,20 @@
# Contributing # Contributing
At present this is mostly a solo project, but This is a bit of a special project,
external contributions are very welcome.
This is a bit of a special project,
in part because a search engine isn't in part because a search engine isn't
like a text editor that you can just like a text editor that you can just
download and tinker with; and in part download and tinker with; and in part
because it's as much a research project because it's as much a research project
as it is a search engine. as it is a search engine.
If you have an idea for a cool change, If you have an idea for a cool change,
send an email to <kontakt@marginalia.nu> and email <kontakt@marginalia.nu> and
we can discuss its feasibility. we can discuss its feasibility.
Search is essentially a fractal of interesting Search is essentially a fractal of interesting
problems, so even if you don't have an idea, problems, so even if you don't have an idea,
just a skillset (really any), odds are there's just a skillset (really any), odds are there's
something interesting I could point you to. something interesting I could point you to.
## Release and branches Make sure you check out the [ide-configuration guide](doc/ide-configuration.md)
to get your IDE set up quickly and easily.
The search engine has a release cycle of
once per 6-8 weeks, coinciding with the crawling
cycle. Where model-breaking changes and changes to
the crawler can be introduced.
## Running and set-up
There is a complementary project, wmsa.local, which
contains scripts and instructions for running this
code base.
[https://git.marginalia.nu/marginalia/wmsa.local](https://git.marginalia.nu/marginalia/wmsa.local)
## Documentation
What documentation exists resides here:
https://git.marginalia.nu/marginalia/marginalia.nu/wiki

View File

@@ -14,3 +14,4 @@
You should have received a copy of the GNU Affero General Public License You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>. along with this program. If not, see <http://www.gnu.org/licenses/>.
Note that packages under [third-party/](third-party/) have different licenses, and the code in [code/libraries/](code/libraries/) is dual-licensed under MIT.

121
NGI0Entrust_tag.svg Normal file
View File

@@ -0,0 +1,121 @@
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Created with Inkscape (http://www.inkscape.org/) -->
<svg
version="1.1"
id="svg2"
xml:space="preserve"
width="1600.5095"
height="502.77777"
viewBox="0 0 480.15286 150.83333"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns="http://www.w3.org/2000/svg"
xmlns:svg="http://www.w3.org/2000/svg"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:cc="http://creativecommons.org/ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"><metadata
id="metadata8"><rdf:RDF><cc:Work
rdf:about=""><dc:format>image/svg+xml</dc:format><dc:type
rdf:resource="http://purl.org/dc/dcmitype/StillImage" /></cc:Work></rdf:RDF></metadata><defs
id="defs6"><linearGradient
id="linearGradient1220"><stop
id="stop1216"
offset="0"
style="stop-color:#98bf00;stop-opacity:1;" /><stop
id="stop1218"
offset="1"
style="stop-color:#98bf00;stop-opacity:0.51" /></linearGradient><linearGradient
x1="0"
y1="0"
x2="1"
y2="0"
gradientUnits="userSpaceOnUse"
gradientTransform="matrix(-139.45511,-135.52185,-135.52185,139.45511,177.4727,131.75308)"
spreadMethod="pad"
id="linearGradient28"><stop
style="stop-opacity:1;stop-color:#00afbc"
offset="0"
id="stop24" /><stop
style="stop-opacity:1;stop-color:#205374"
offset="1"
id="stop26" /></linearGradient><clipPath
clipPathUnits="userSpaceOnUse"
id="clipPath38"><path
d="M 0,127.984 H 415.474 V 0 H 0 Z"
id="path36" /></clipPath><linearGradient
xlink:href="#linearGradient1220"
id="linearGradient947"
gradientUnits="userSpaceOnUse"
x1="14.915152"
y1="14.167241"
x2="214.11908"
y2="111.76186"
gradientTransform="matrix(4.4444443,0,0,-4.4444443,-33.008887,535.8)" /><clipPath
clipPathUnits="userSpaceOnUse"
id="clipPath38-9"><path
d="M 0,127.984 H 415.474 V 0 H 0 Z"
id="path36-1" /></clipPath></defs><g
id="g10"
transform="matrix(1.3333333,0,0,-1.3333333,-9.9026662,160.74)"><g
id="g40"
transform="translate(175.9982,95.8645)" /><g
id="g44"
transform="translate(152.1193,64.9934)" />
<g
id="NGI0Entrust"><title
id="title12661">NGI Zero Entrust</title><path
id="path7692"
style="fill:#ffffff;fill-opacity:1;stroke:none;stroke-width:0.999999"
d="m 133.10651,96.933602 c -6.67899,0 -12.68988,-1.41201 -18.02988,-4.23501 -5.344,-2.822 -9.51678,-6.73803 -12.52178,-11.74702 -3.004994,-5.008 -4.507906,-10.66967 -4.507906,-16.982669 0,-6.314995 1.502912,-11.974991 4.507906,-16.983985 3.005,-5.008995 7.14794,-8.924024 12.42993,-11.747021 5.282,-2.823998 11.23084,-4.23501 17.84883,-4.23501 4.613,0 9.19693,0.698875 13.75093,2.094873 0.045,0.014 0.0912,0.02819 0.13623,0.04219 7.10399,2.201999 11.88413,8.859686 11.88413,16.29668 v 9.047022 c 0,3.581996 -2.90333,6.485889 -6.48633,6.485889 h -0.50581 c -0.064,0 -0.12704,-0.0077 -0.19204,-0.0097 -0.064,0.002 -0.12704,0.0097 -0.19204,0.0097 h -7.28306 c -3.92899,0 -7.35908,-2.964914 -7.61308,-6.884912 -0.278,-4.295996 3.12428,-7.86709 7.36128,-7.86709 0.776,0 1.34293,-0.753702 1.11093,-1.493702 -0.65799,-2.087998 -2.34102,-3.751009 -4.54702,-4.333008 -2.07399,-0.546999 -4.27598,-0.820898 -6.60498,-0.820898 -4.00699,0 -7.57381,0.864972 -10.6998,2.594971 -3.127,1.729999 -5.5704,4.143993 -7.3314,7.23999 -1.761,3.095997 -2.64067,6.617018 -2.64067,10.564014 0,4.005996 0.87967,7.557666 2.64067,10.653656 1.761,3.097 4.2191,5.49317 7.3771,7.19517 3.156,1.698 6.76804,2.54883 10.83604,2.54883 4.68099,0 8.8649,-1.26899 12.5499,-3.80699 2.341,-1.61199 5.52423,-1.58761 7.75723,0.17139 3.47999,2.741 3.2889,8.04495 -0.31509,10.45196 -1.7,1.13599 -3.53807,2.11163 -5.51206,2.92763 -4.553,1.881 -9.62316,2.82305 -15.20816,2.82305 z m -93.706345,-1.09248 c -4.022996,0 -7.284815,-3.26081 -7.284815,-7.28482 v -49.17612 c 0,-4.022993 3.261819,-7.284815 7.284815,-7.284815 4.023996,0 7.284814,3.261822 7.284814,7.284815 V 62.34029 c 0,2.842996 3.564362,4.118722 5.36836,1.921728 L 76.282148,34.757135 c 1.383999,-1.685 3.450155,-2.661768 5.631153,-2.661768 h 1.380761 c 4.023997,0 7.286133,3.261822 7.286133,7.284815 v 49.17612 c 0,4.02401 -3.262136,7.28482 -7.286133,7.28482 -4.023995,0 -7.284815,-3.26081 -7.284815,-7.28482 V 65.615095 c 0,-2.844997 -3.568118,-4.119773 -5.370117,-1.917774 L 46.503925,93.172322 c -1.382997,1.69 -3.45199,2.6688 -5.635987,2.6688 z m 136.597415,-4.4e-4 c -4.074,0 -7.37578,-3.30178 -7.37578,-7.37578 V 39.472027 c 0,-4.073996 3.30178,-7.37622 7.37578,-7.37622 4.074,0 7.37622,3.302224 7.37622,7.37622 v 48.992875 c 0,4.074 -3.30222,7.37578 -7.37622,7.37578 z" /><path
id="path30"
style="fill:url(#linearGradient947);fill-opacity:1;stroke:none;stroke-width:4.44444"
d="M 79.115234 30 C 52.097457 30 30 52.101902 30 79.115234 L 30 423.66211 C 30 450.67989 52.097457 472.77734 79.115234 472.77734 L 812.60352 472.77734 C 839.61685 472.77734 861.7207 450.67544 861.7207 423.66211 L 861.7207 342.50586 C 861.7207 333.51919 865.28844 324.89711 871.64844 318.53711 L 912.07617 278.11133 C 923.36506 266.82688 923.33313 248.52428 912.01758 237.27539 L 871.7207 197.19922 C 865.3207 190.83922 861.7207 182.18238 861.7207 173.16016 L 861.7207 79.115234 C 861.7207 52.101902 839.61685 30 812.60352 30 L 79.115234 30 z M 558.57812 104.87891 C 583.40035 104.87891 605.93437 109.06578 626.16992 117.42578 C 634.94325 121.05245 643.11241 125.38861 650.66797 130.4375 C 666.68575 141.13528 667.53503 164.7084 652.06836 176.89062 C 642.14392 184.7084 627.99624 184.81679 617.5918 177.65234 C 601.21402 166.37234 582.6189 160.73242 561.81445 160.73242 C 543.73445 160.73242 527.68096 164.51388 513.6543 172.06055 C 499.61874 179.62499 488.69385 190.27462 480.86719 204.03906 C 473.04052 217.79906 469.13086 233.58423 469.13086 251.38867 C 469.13086 268.93089 473.04052 284.57984 480.86719 298.33984 C 488.69385 312.09984 499.55339 322.82869 513.45117 330.51758 C 527.3445 338.20647 543.19697 342.05078 561.00586 342.05078 C 571.35697 342.05078 581.14355 340.83345 590.36133 338.40234 C 600.16577 335.81568 607.64587 328.42453 610.57031 319.14453 C 611.60142 315.85564 609.0817 312.50586 605.63281 312.50586 C 586.8017 312.50586 571.68046 296.63435 572.91602 277.54102 C 574.0449 260.11879 589.28973 246.94141 606.75195 246.94141 L 639.12109 246.94141 C 639.40998 246.94141 639.69016 246.97549 639.97461 246.98438 C 640.2635 246.97549 640.54368 246.94141 640.82812 246.94141 L 643.07617 246.94141 C 659.00062 246.94141 671.9043 259.84758 671.9043 275.76758 L 671.9043 315.97656 C 671.9043 349.0299 650.65927 378.61958 619.08594 388.40625 C 618.88594 388.46847 618.68047 388.53153 618.48047 388.59375 C 598.24047 394.79819 577.86746 397.9043 557.36523 397.9043 C 527.9519 397.9043 501.51266 391.63314 478.03711 379.08203 C 454.56155 366.53536 436.14852 349.13527 422.79297 326.87305 C 409.43741 304.61083 402.75781 279.45534 402.75781 251.38867 C 402.75781 223.33089 409.43741 198.16793 422.79297 175.91016 C 436.14852 153.64793 454.6942 136.24339 478.44531 123.70117 C 502.17865 111.15451 528.89368 104.87891 558.57812 104.87891 z M 142.10547 109.73438 L 148.62891 109.73438 C 158.33557 109.73438 167.53107 114.08459 173.67773 121.5957 L 280.94531 252.5957 C 288.9542 262.38237 304.8125 256.71671 304.8125 244.07227 L 304.8125 142.11133 C 304.8125 124.22688 319.30501 109.73438 337.18945 109.73438 C 355.0739 109.73438 369.57227 124.22688 369.57227 142.11133 L 369.57227 360.67188 C 369.57227 378.55187 355.0739 393.04883 337.18945 393.04883 L 331.05273 393.04883 C 321.3594 393.04883 312.1765 388.70764 306.02539 381.21875 L 198.3418 250.08594 C 190.32402 240.32149 174.48242 245.9914 174.48242 258.62695 L 174.48242 360.67188 C 174.48242 378.55187 159.98991 393.04883 142.10547 393.04883 C 124.22547 393.04883 109.72852 378.55187 109.72852 360.67188 L 109.72852 142.11133 C 109.72852 124.22688 124.22547 109.73438 142.10547 109.73438 z M 749.20508 109.73633 C 767.31174 109.73633 781.98828 124.41091 781.98828 142.51758 L 781.98828 360.26367 C 781.98828 378.37034 767.31174 393.04688 749.20508 393.04688 C 731.09841 393.04688 716.42383 378.37034 716.42383 360.26367 L 716.42383 142.51758 C 716.42383 124.41091 731.09841 109.73633 749.20508 109.73633 z "
transform="matrix(0.22500001,0,0,-0.22500001,7.4269998,120.555)" /><g
aria-label="Z E R O"
transform="scale(1,-1)"
id="text56"
style="font-weight:600;font-size:31.76px;font-family:'Montserrat SemiBold';-inkscape-font-specification:Montserrat-SemiBold;fill:#6f9aa8"><path
d="m 261.75384,-85.665085 -13.08512,15.97528 h 13.498 v 3.4936 H 243.206 v -2.76312 l 13.08512,-15.97528 h -12.8628 v -3.4936 h 18.32552 z"
id="path12603" /><path
d="m 278.84063,-75.787725 v 6.12968 h 12.5452 v 3.46184 h -16.674 v -22.232 h 16.22936 v 3.46184 h -12.10056 v 5.78032 h 10.73488 v 3.39832 z"
id="path12605" /><path
d="m 323.74919,-66.196205 h -4.4464 l -4.54168,-6.5108 q -0.28584,0.03176 -0.85752,0.03176 h -5.01808 v 6.47904 h -4.1288 v -22.232 h 9.14688 q 2.89016,0 5.01808,0.9528 2.15968,0.9528 3.30304,2.73136 1.14336,1.77856 1.14336,4.22408 0,2.50904 -1.23864,4.31936 -1.20688,1.81032 -3.4936,2.6996 z m -4.54168,-14.32376 q 0,-2.12792 -1.39744,-3.27128 -1.39744,-1.14336 -4.09704,-1.14336 h -4.82752 v 8.86104 h 4.82752 q 2.6996,0 4.09704,-1.14336 1.39744,-1.17512 1.39744,-3.30304 z"
id="path12607" /><path
d="m 347.12448,-65.878605 q -3.39832,0 -6.12968,-1.46096 -2.73136,-1.49272 -4.2876,-4.09704 -1.55624,-2.63608 -1.55624,-5.8756 0,-3.23952 1.55624,-5.84384 1.55624,-2.63608 4.2876,-4.09704 2.73136,-1.49272 6.12968,-1.49272 3.39832,0 6.12968,1.49272 2.73136,1.46096 4.2876,4.06528 1.55624,2.60432 1.55624,5.8756 0,3.27128 -1.55624,5.8756 -1.55624,2.60432 -4.2876,4.09704 -2.73136,1.46096 -6.12968,1.46096 z m 0,-3.62064 q 2.2232,0 4.00176,-0.98456 1.77856,-1.01632 2.79488,-2.79488 1.01632,-1.81032 1.01632,-4.03352 0,-2.2232 -1.01632,-4.00176 -1.01632,-1.81032 -2.79488,-2.79488 -1.77856,-1.01632 -4.00176,-1.01632 -2.2232,0 -4.00176,1.01632 -1.77856,0.98456 -2.79488,2.79488 -1.01632,1.77856 -1.01632,4.00176 0,2.2232 1.01632,4.03352 1.01632,1.77856 2.79488,2.79488 1.77856,0.98456 4.00176,0.98456 z"
id="path12609" /></g><g
aria-label="ENTRUST"
transform="scale(0.99994801,-1.000052)"
id="Entrust"
style="font-weight:bold;font-size:20.009px;font-family:'Montserrat SemiBold';-inkscape-font-specification:'Montserrat SemiBold, Bold';letter-spacing:3.55932px;fill:#6f9aa8;stroke-width:0.999947"><path
d="m 245.81989,-41.935548 v 3.861737 h 7.90356 v 2.180981 h -10.50473 v -14.0063 h 10.2246 v 2.180981 h -7.62343 v 3.641638 h 6.76304 v 2.140963 z"
id="path12612" /><path
d="m 270.04847,-40.414864 v -9.484266 h 2.58116 v 14.0063 h -2.14096 l -7.72347,-9.484266 v 9.484266 h -2.58117 v -14.0063 h 2.14097 z"
id="path12614" /><path
d="m 285.39308,-35.89283 h -2.60117 v -11.80531 h -4.64209 v -2.20099 h 11.88535 v 2.20099 h -4.64209 z"
id="path12616" /><path
d="m 307.52074,-35.89283 h -2.80126 l -2.86129,-4.101845 q -0.18008,0.02001 -0.54024,0.02001 h -3.16142 v 4.081836 h -2.60117 v -14.0063 h 5.76259 q 1.82082,0 3.16142,0.60027 1.36061,0.60027 2.08094,1.720774 0.72032,1.120504 0.72032,2.661197 0,1.580711 -0.78035,2.721224 -0.76034,1.140513 -2.20099,1.700765 z m -2.86129,-9.024059 q 0,-1.340603 -0.88039,-2.060927 -0.8804,-0.720324 -2.58116,-0.720324 h -3.04137 v 5.582511 h 3.04137 q 1.70076,0 2.58116,-0.720324 0.88039,-0.740333 0.88039,-2.080936 z"
id="path12618" /><path
d="m 319.76395,-35.69274 q -2.90131,0 -4.52204,-1.620729 -1.62073,-1.640738 -1.62073,-4.682106 v -7.903555 h 2.60117 v 7.80351 q 0,4.121854 3.5616,4.121854 3.5416,0 3.5416,-4.121854 v -7.80351 h 2.56115 v 7.903555 q 0,3.041368 -1.62073,4.682106 -1.60072,1.620729 -4.50202,1.620729 z"
id="path12620" /><path
d="m 337.4296,-35.69274 q -1.62073,0 -3.14141,-0.460207 -1.50068,-0.460207 -2.38107,-1.220549 l 0.9004,-2.020909 q 0.86039,0.680306 2.10095,1.120504 1.26056,0.420189 2.52113,0.420189 1.5607,0 2.32105,-0.500225 0.78035,-0.500225 0.78035,-1.320594 0,-0.60027 -0.4402,-0.980441 -0.42019,-0.40018 -1.08049,-0.620279 -0.66029,-0.220099 -1.80081,-0.500225 -1.60072,-0.380171 -2.60117,-0.760342 -0.98044,-0.380171 -1.70076,-1.180531 -0.70032,-0.820369 -0.70032,-2.20099 0,-1.160522 0.62028,-2.100945 0.64029,-0.960432 1.90086,-1.520684 1.28057,-0.560252 3.1214,-0.560252 1.28058,0 2.52113,0.320144 1.24056,0.320144 2.14097,0.920414 l -0.82037,2.020909 q -0.92042,-0.540243 -1.92087,-0.820369 -1.00045,-0.280126 -1.94087,-0.280126 -1.54069,0 -2.30103,0.520234 -0.74034,0.520234 -0.74034,1.380621 0,0.60027 0.42019,0.980441 0.4402,0.380171 1.1005,0.60027 0.66029,0.220099 1.80081,0.500225 1.5607,0.360162 2.56115,0.760342 1.00045,0.380171 1.70076,1.180531 0.72033,0.80036 0.72033,2.160972 0,1.160522 -0.64029,2.100945 -0.62028,0.940423 -1.90085,1.500675 -1.28058,0.560252 -3.12141,0.560252 z"
id="path12622" /><path
d="m 354.47498,-35.89283 h -2.60117 v -11.80531 h -4.64209 v -2.20099 h 11.88535 v 2.20099 h -4.64209 z"
id="path12624" /></g></g>
<text
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:20.01px;font-family:'Montserrat SemiBold';-inkscape-font-specification:'Montserrat SemiBold, Bold';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-feature-settings:normal;text-align:start;writing-mode:lr-tb;text-anchor:start;fill:#6f9aa8;fill-opacity:1;fill-rule:nonzero;stroke:none;stroke-width:1"
id="text2843"
x="240.16206"
y="-35.894695"
transform="scale(1,-1)"><tspan
id="tspan2841"
x="240.16206"
y="-35.894695" /></text></g></svg>

After

Width:  |  Height:  |  Size: 14 KiB

106
README.md
View File

@@ -1,40 +1,104 @@
# marginalia.nu # Marginalia Search
This is the source code for marginalia.nu, including the [search engine](https://search.marginalia.nu), This is the source code for [Marginalia Search](https://search.marginalia.nu).
the [MEMEX/gemini server](https://memex.marginalia.nu), the and the [encyclopedia service](https://encyclopedia.marginalia.nu).
The aim of the project is to develop new and alternative discovery methods for the Internet. The aim of the project is to develop new and alternative discovery methods for the Internet.
It's an experimental workshop as much as it is a public service, the overarching goal is to It's an experimental workshop as much as it is a public service, the overarching goal is to
elevate the more human, non-commercial sides of the Internet. A side-goal is to do this without elevate the more human, non-commercial sides of the Internet.
requiring datacenters and expensive enterprise hardware, to run this operation on affordable hardware.
The canonical git server for this project is [https://git.marginalia.nu](https://git.marginalia.nu). A side-goal is to do this without requiring datacenters and enterprise hardware budgets,
It is fine to mirror it on other hosts, but if you have issues or questions to be able to run this operation on affordable hardware with minimal operational overhead.
git.marginalia.nu is where you want to go.
## Important note about wmsa.local The long term plan is to refine the search engine so that it provide enough public value
that the project can be funded through grants, donations and commercial API licenses
(non-commercial share-alike is always free).
This project has a [sister repository called wmsa.local](https://git.marginalia.nu/marginalia/wmsa.local) The system can both be run as a copy of Marginalia Search, or as a white-label search engine
that contains scripts and configuration files for running and developing the code. for your own data (either crawled or side-loaded). At present the logic isn't very configurable, and a lot of the judgements
made are based on the Marginalia project's goals, but additional configurability is being
worked on!
Without it, development is very unpleasant. Here's a demo of the set-up and operation of the self-hostable barebones mode of the search engine: [🌎&nbsp;https://www.youtube.com/watch?v=PNwMkenQQ24](https://www.youtube.com/watch?v=PNwMkenQQ24)
While developing the code, you will want an environment variable WMSA_HOME pointing to ## Set up
the directory in which wmsa.local is checked out, otherwise the code will not run and
several tests will fail.
## Documentation To set up a local test environment, follow the instructions in [📄 run/readme.md](run/readme.md)!
Documentation is a work in progress. See the [wiki](https://git.marginalia.nu/marginalia/marginalia.nu/wiki). Further documentation is available at [🌎&nbsp;https://docs.marginalia.nu/](https://docs.marginalia.nu/).
## Contributing Before compiling, it's necessary to run [⚙️ run/setup.sh](run/setup.sh).
This will download supplementary model data that is necessary to run the code.
These are also necessary to run the tests.
[CONTRIBUTING.md](CONTRIBUTING.md) If you wish to hack on the code, check out [📄&nbsp;doc/ide-configuration.md](doc/ide-configuration.md).
## Supporting ## Hardware Requirements
Consider [supporting this project](https://memex.marginalia.nu/projects/edge/supporting.gmi). A production-like environment requires a lot of RAM and ideally enterprise SSDs for
the index, as well as some additional terabytes of slower harddrives for storing crawl
data. It can be made to run on smaller hardware by limiting size of the index.
The system will definitely run on a 32 Gb machine, possibly smaller, but at that size it may not perform
very well as it relies on disk caching to be fast.
A local developer's deployment is possible with much smaller hardware (and index size).
## Project Structure
[📁 code/](code/) - The Source Code. See [📄 code/readme.md](code/readme.md) for a further breakdown of the structure and architecture.
[📁 run/](run/) - Scripts and files used to run the search engine locally
[📁 third-party/](third-party/) - Third party code
[📁 doc/](doc/) - Supplementary documentation
[📄 CONTRIBUTING.md](CONTRIBUTING.md) - How to contribute
[📄 LICENSE.md](LICENSE.md) - License terms
## Contact ## Contact
You can email <kontakt@marginalia.nu> with any questions or feedback. You can email <kontakt@marginalia.nu> with any questions or feedback.
## License
The bulk of the project is available with AGPL 3.0, with exceptions. Some parts are co-licensed under MIT,
third party code may have different licenses. See the appropriate readme.md / license.md.
## Versioning
The project uses modified Calendar Versioning, where the first two pairs of numbers are a year and month coinciding
with the latest crawling operation, and the third number is a patch number.
```
version
--
yy.mm.VV
-----
crawl
```
For example, `23.03.02` is a release with crawl data from March 2023 (released in May 2023).
It is the second patch for the 23.02 release.
Versions with the same year and month are compatible with each other, or offer an upgrade path where the same
data set can be used, but across different crawl sets data format changes may be introduced, and you're generally
expected to re-crawl the data from scratch as crawler data has shelf life approximately as long as the major release
cycles of this project. After about 2-3 months it gets noticeably stale with many dead links.
For development purposes, crawling is discouraged and sample data is available. See [📄&nbsp;run/readme.md](run/readme.md)
for more information.
## Funding
### Donations
Consider [donating to the project](https://www.marginalia.nu/marginalia-search/supporting/).
### Grants
This project was funded through the [NGI0 Entrust Fund](https://nlnet.nl/entrust), a fund established by [NLnet](https://nlnet.nl) with financial support from the European Commission's [Next Generation Internet](https://ngi.eu/) programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement No 101069594.
![NLnet Foundation](nlnet.png)
![NGI0](NGI0Entrust_tag.svg)

95
ROADMAP.md Normal file
View File

@@ -0,0 +1,95 @@
# Roadmap 2025
This is a roadmap with major features planned for Marginalia Search.
It's not set in any particular order and other features will definitely
be implemented as well.
Major goals:
* Reach 1 billion pages indexed
* Improve technical ability of indexing and search. ~~Although this area has improved a bit, the
search engine is still not very good at dealing with longer queries.~~ (As of PR [#129](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/129), this has improved significantly. There is still more work to be done )
## Hybridize crawler w/ Common Crawl data
Sometimes Marginalia's relatively obscure crawler is blocked when attempting to crawl a website, or for
other technical reasons it may be prevented from doing so. A possible work-around is to hybridize the
crawler so that it attempts to fetch such inaccessible websites from common crawl. This is an important
step on the road to 1 billion pages indexed.
As a rough sketch, the crawler would identify target websites, consume CC's index, and then fetch the WARC data
with byte range queries.
Retaining the ability to independently crawl the web is still strongly desirable so going full CC is not an option.
## Safe Search
The search engine has a bit of a problem showing spicy content mixed in with the results. It would be desirable to have a way to filter this out. It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
combined with naive bayesian filter would go a long way, or something more sophisticated...?
## Additional Language Support
It would be desirable if the search engine supported more languages than English. This is partially about
rooting out assumptions regarding character encoding, but there's most likely some amount of custom logic
associated with each language added, at least a models file or two, as well as some fine tuning.
It would be very helpful to find a speaker of a large language other than English to help in the fine tuning.
## Support for binary formats like PDF
The crawler needs to be modified to retain them, and the conversion logic needs to parse them.
The documents database probably should have some sort of flag indicating it's a PDF as well.
PDF parsing is known to be a bit of a security liability so some thought needs to be put in
that direction as well.
## Custom ranking logic
Stract does an interesting thing where they have configurable search filters.
This looks like a good idea that wouldn't just help clean up the search filters on the main
website, but might be cheap enough we might go as far as to offer a number of ad-hoc custom search
filter for any API consumer.
I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, which is quite ad-hoc, but instead to work together to find some new common description language for this.
## Show favicons next to search results
This is expected from search engines. Basic proof of concept sketch of fetching this data has been done, but the feature is some way from being reality.
## Specialized crawler for github
One of the search engine's biggest limitations right now is that it does not index github at all. A specialized crawler that fetches at least the readme.md would go a long way toward providing search capabilities in this domain.
# Completed
## Web Design Overhaul (COMPLETED 2025-01)
The design is kinda clunky and hard to maintain, and needlessly outdated-looking.
PR [#127](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/127)
## Finalize RSS support (COMPLETED 2024-11)
Marginalia has experimental RSS preview support for a few domains. This works well and
it should be extended to all domains. It would also be interesting to offer search of the
RSS data itself, or use the RSS set to feed a special live index that updates faster than the
main dataset.
Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
## Proper Position Index (COMPLETED 2024-09)
The search engine uses a fixed width bit mask to indicate word positions. It has the benefit
of being very fast to evaluate and works well for what it is, but is inaccurate and has the
drawback of making support for quoted search terms inaccurate and largely reliant on indexing
word n-grams known beforehand. This limits the ability to interpret longer queries.
The positions mask should be supplemented or replaced with a more accurate (e.g.) gamma coded positions
list, as is the civilized way of doing this.
Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)

View File

@@ -1,62 +1,75 @@
plugins { plugins {
id 'java' id 'java'
id("org.jetbrains.gradle.plugin.idea-ext") version "1.0"
id "me.champeau.jmh" version "0.6.6"
id 'com.github.johnrengelman.shadow' version '6.0.0' // This is a workaround for a bug in the Jib plugin that causes it to stall randomly
// https://github.com/GoogleContainerTools/jib/issues/3347
id 'com.google.cloud.tools.jib' version '3.4.4' apply(false)
} }
group 'nu.marginalia' group 'marginalia'
version 'SNAPSHOT' version 'SNAPSHOT'
compileJava.options.encoding = "UTF-8" compileJava.options.encoding = "UTF-8"
compileTestJava.options.encoding = "UTF-8" compileTestJava.options.encoding = "UTF-8"
repositories {
mavenLocal() subprojects.forEach {it ->
maven { url "https://artifactory.cronapp.io/public-release/" } // Enable preview features for the entire project
maven { url "https://repo1.maven.org/maven2/" }
maven { url "https://www2.ph.ed.ac.uk/maven2/" } if (it.path.contains(':code:')) {
maven { url "https://jitpack.io/" } sourceSets.main.java.srcDirs += file('java')
exclusiveContent { sourceSets.main.resources.srcDirs += file('resources')
forRepository { sourceSets.test.java.srcDirs += file('test')
maven { sourceSets.test.resources.srcDirs += file('test-resources')
url = uri("https://jitpack.io")
}
}
filter {
// Only use JitPack for the `gson-record-type-adapter-factory` library
includeModule("com.github.Marcono1234", "gson-record-type-adapter-factory")
}
} }
it.tasks.withType(JavaCompile).configureEach {
options.compilerArgs += ['--enable-preview']
}
it.tasks.withType(JavaExec).configureEach {
jvmArgs += ['--enable-preview']
}
it.tasks.withType(Test).configureEach {
jvmArgs += ['--enable-preview']
}
// Enable reproducible builds for the entire project
it.tasks.withType(AbstractArchiveTask).configureEach {
preserveFileTimestamps = false
reproducibleFileOrder = true
}
} }
shadowJar { ext {
zip64 true jvmVersion=23
} dockerImageBase='container-registry.oracle.com/graalvm/jdk:23'
jar { dockerImageTag='latest'
manifest { dockerImageRegistry='marginalia'
attributes 'Main-Class': "nu.marginalia.wmsa.configuration.ServiceDescriptor" jibVersion = '3.4.4'
}
from {
configurations.shadow.collect { it.isDirectory() ? it : zipTree(it) }
}
} }
idea {
module {
// Exclude these directories from being indexed by IntelliJ
// as they tend to bring the IDE to its knees and use up all
// Inotify spots in a hurry
excludeDirs.add(file("$projectDir/run/node-1"))
excludeDirs.add(file("$projectDir/run/node-2"))
excludeDirs.add(file("$projectDir/run/model"))
excludeDirs.add(file("$projectDir/run/dist"))
excludeDirs.add(file("$projectDir/run/db"))
excludeDirs.add(file("$projectDir/run/logs"))
excludeDirs.add(file("$projectDir/run/data"))
excludeDirs.add(file("$projectDir/run/conf"))
excludeDirs.add(file("$projectDir/run/test-data"))
}
}
java { java {
toolchain { toolchain {
languageVersion.set(JavaLanguageVersion.of(17)) languageVersion.set(JavaLanguageVersion.of(rootProject.ext.jvmVersion))
}
}
dependencies {
implementation project(':marginalia_nu')
}
task version() { //
}
test {
maxParallelForks = 16
forkEvery = 1
maxHeapSize = "8G"
useJUnitPlatform {
excludeTags "nobuild"
} }
} }

View File

@@ -0,0 +1,41 @@
plugins {
id 'java'
id 'jvm-test-suite'
}
java {
toolchain {
languageVersion.set(JavaLanguageVersion.of(rootProject.ext.jvmVersion))
}
}
apply from: "$rootProject.projectDir/srcsets.gradle"
dependencies {
implementation project(':code:common:db')
implementation project(':code:common:model')
implementation libs.bundles.slf4j
implementation libs.bundles.mariadb
implementation libs.mockito
implementation libs.guava
implementation dependencies.create(libs.guice.get()) {
exclude group: 'com.google.guava'
}
implementation libs.gson
testImplementation libs.bundles.slf4j.test
testImplementation libs.bundles.junit
testImplementation project(':code:libraries:test-helpers')
testImplementation platform('org.testcontainers:testcontainers-bom:1.17.4')
testImplementation libs.commons.codec
testImplementation 'org.testcontainers:mariadb:1.17.4'
testImplementation 'org.testcontainers:junit-jupiter:1.17.4'
testImplementation project(':code:libraries:test-helpers')
}

View File

@@ -0,0 +1,67 @@
package nu.marginalia;
import nu.marginalia.storage.FileStorageService;
import nu.marginalia.storage.model.FileStorageBaseType;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.sql.SQLException;
/** The IndexLocations class is responsible for knowledge about the locations
* of various important system paths. The methods take a FileStorageService,
* as these paths are node-dependent.
*/
public class IndexLocations {
private static final Logger logger = LoggerFactory.getLogger(IndexLocations.class);
/** Return the path to the current link database */
public static Path getLinkdbLivePath(FileStorageService fileStorage) {
return getStorage(fileStorage, FileStorageBaseType.CURRENT, "ldbr");
}
/** Return the path to the next link database */
public static Path getLinkdbWritePath(FileStorageService fileStorage) {
return getStorage(fileStorage, FileStorageBaseType.CURRENT, "ldbw");
}
/** Return the path to the current live index */
public static Path getCurrentIndex(FileStorageService fileStorage) {
return getStorage(fileStorage, FileStorageBaseType.CURRENT, "ir");
}
/** Return the path to the designated index construction area */
public static Path getIndexConstructionArea(FileStorageService fileStorage) {
return getStorage(fileStorage, FileStorageBaseType.CURRENT, "iw");
}
/** Return the path to the search sets */
public static Path getSearchSetsPath(FileStorageService fileStorage) {
return getStorage(fileStorage, FileStorageBaseType.CURRENT, "ss");
}
private static Path getStorage(FileStorageService service, FileStorageBaseType baseType, String pathPart) {
try {
var base = service.getStorageBase(baseType);
if (base == null) {
throw new IllegalStateException("File storage base " + baseType + " is not configured!");
}
// Ensure the directory exists
Path ret = base.asPath().resolve(pathPart);
if (!Files.exists(ret)) {
logger.info("Creating system directory {}", ret);
Files.createDirectories(ret);
}
return ret;
}
catch (SQLException | IOException ex) {
throw new IllegalStateException("Error fetching storage " + baseType + " / " + pathPart, ex);
}
}
}

View File

@@ -0,0 +1,27 @@
package nu.marginalia;
import java.nio.file.Path;
public class LanguageModels {
public final Path termFrequencies;
public final Path openNLPSentenceDetectionData;
public final Path posRules;
public final Path posDict;
public final Path fasttextLanguageModel;
public final Path segments;
public LanguageModels(Path termFrequencies,
Path openNLPSentenceDetectionData,
Path posRules,
Path posDict,
Path fasttextLanguageModel,
Path segments) {
this.termFrequencies = termFrequencies;
this.openNLPSentenceDetectionData = openNLPSentenceDetectionData;
this.posRules = posRules;
this.posDict = posDict;
this.fasttextLanguageModel = fasttextLanguageModel;
this.segments = segments;
}
}

View File

@@ -0,0 +1,3 @@
package nu.marginalia;
public record UserAgent(String uaString, String uaIdentifier) {}

View File

@@ -0,0 +1,7 @@
package nu.marginalia;
public record WebsiteUrl(String url) {
public String withPath(String path) {
return url + path;
}
}

View File

@@ -0,0 +1,117 @@
package nu.marginalia;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Objects;
import java.util.Optional;
import java.util.stream.Stream;
public class WmsaHome {
public static UserAgent getUserAgent() {
return new UserAgent(
System.getProperty("crawler.userAgentString", "Mozilla/5.0 (compatible; Marginalia-like bot; +https://git.marginalia.nu/))"),
System.getProperty("crawler.userAgentIdentifier", "search.marginalia.nu")
);
}
public static Path getUploadDir() {
return Path.of(
System.getProperty("executor.uploadDir", "/uploads")
);
}
public static Path getHomePath() {
String[] possibleLocations = new String[] {
System.getenv("WMSA_HOME"),
System.getProperty("system.homePath"),
"/var/lib/wmsa",
"/wmsa"
};
Optional<String> retStr = Stream.of(possibleLocations)
.filter(Objects::nonNull)
.map(Path::of)
.filter(Files::isDirectory)
.map(Path::toString)
.findFirst();
if (retStr.isEmpty()) {
// Check parent directories for a fingerprint of the project's installation boilerplate
var prodRoot = Stream.iterate(Paths.get("").toAbsolutePath(), f -> f != null && Files.exists(f), Path::getParent)
.filter(p -> Files.exists(p.resolve("conf/properties/system.properties")))
.filter(p -> Files.exists(p.resolve("model/tfreq-new-algo3.bin")))
.findAny();
if (prodRoot.isPresent()) {
return prodRoot.get();
}
// Check if we are running in a test environment by looking for fingerprints
// matching the base of the source tree for the project, then looking up the
// run directory which contains a template for the installation we can use as
// though it's the project root for testing purposes
var testRoot = Stream.iterate(Paths.get("").toAbsolutePath(), f -> f != null && Files.exists(f), Path::getParent)
.filter(p -> Files.exists(p.resolve("run/env")))
.filter(p -> Files.exists(p.resolve("run/setup.sh")))
.map(p -> p.resolve("run"))
.findAny();
return testRoot.orElseThrow(() -> new IllegalStateException("""
Could not find $WMSA_HOME, either set environment
variable, the 'system.homePath' java property,
or ensure either /wmsa or /var/lib/wmsa exists
"""));
}
var ret = Path.of(retStr.get());
if (!Files.isDirectory(ret.resolve("model"))) {
throw new IllegalStateException("You need to run 'run/setup.sh' to download models to run/ before this will work!");
}
return ret;
}
public static Path getDataPath() {
return getHomePath().resolve("data");
}
public static Path getAdsDefinition() {
return getHomePath().resolve("data").resolve("adblock.txt");
}
public static Path getIPLocationDatabse() {
return getHomePath().resolve("data").resolve("IP2LOCATION-LITE-DB1.CSV");
}
public static Path getAsnMappingDatabase() {
return getHomePath().resolve("data").resolve("asn-data-raw-table");
}
public static Path getAsnInfoDatabase() {
return getHomePath().resolve("data").resolve("asn-used-autnums");
}
public static LanguageModels getLanguageModels() {
final Path home = getHomePath();
return new LanguageModels(
home.resolve("model/tfreq-new-algo3.bin"),
home.resolve("model/opennlp-sentence.bin"),
home.resolve("model/English.RDR"),
home.resolve("model/English.DICT"),
home.resolve("model/lid.176.ftz"),
home.resolve("model/segments.bin")
);
}
public static Path getAtagsPath() {
return getHomePath().resolve("data/atags.parquet");
}
}

View File

@@ -0,0 +1,123 @@
package nu.marginalia.nodecfg;
import com.google.inject.Inject;
import com.zaxxer.hikari.HikariDataSource;
import nu.marginalia.nodecfg.model.NodeConfiguration;
import nu.marginalia.nodecfg.model.NodeProfile;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.List;
public class NodeConfigurationService {
private final Logger logger = LoggerFactory.getLogger(NodeConfigurationService.class);
private final HikariDataSource dataSource;
@Inject
public NodeConfigurationService(HikariDataSource dataSource) {
this.dataSource = dataSource;
}
public NodeConfiguration create(int id, String description, boolean acceptQueries, boolean keepWarcs, NodeProfile nodeProfile) throws SQLException {
try (var conn = dataSource.getConnection();
var is = conn.prepareStatement("""
INSERT IGNORE INTO NODE_CONFIGURATION(ID, DESCRIPTION, ACCEPT_QUERIES, KEEP_WARCS, NODE_PROFILE) VALUES(?, ?, ?, ?, ?)
""")
)
{
is.setInt(1, id);
is.setString(2, description);
is.setBoolean(3, acceptQueries);
is.setBoolean(4, keepWarcs);
is.setString(5, nodeProfile.name());
if (is.executeUpdate() <= 0) {
throw new IllegalStateException("Failed to insert configuration");
}
return get(id);
}
}
public List<NodeConfiguration> getAll() {
try (var conn = dataSource.getConnection();
var qs = conn.prepareStatement("""
SELECT ID, DESCRIPTION, ACCEPT_QUERIES, AUTO_CLEAN, PRECESSION, KEEP_WARCS, NODE_PROFILE, DISABLED
FROM NODE_CONFIGURATION
""")) {
var rs = qs.executeQuery();
List<NodeConfiguration> ret = new ArrayList<>();
while (rs.next()) {
ret.add(new NodeConfiguration(
rs.getInt("ID"),
rs.getString("DESCRIPTION"),
rs.getBoolean("ACCEPT_QUERIES"),
rs.getBoolean("AUTO_CLEAN"),
rs.getBoolean("PRECESSION"),
rs.getBoolean("KEEP_WARCS"),
NodeProfile.valueOf(rs.getString("NODE_PROFILE")),
rs.getBoolean("DISABLED")
));
}
return ret;
}
catch (SQLException ex) {
logger.warn("Failed to get node configurations", ex);
return List.of();
}
}
public NodeConfiguration get(int nodeId) throws SQLException {
try (var conn = dataSource.getConnection();
var qs = conn.prepareStatement("""
SELECT ID, DESCRIPTION, ACCEPT_QUERIES, AUTO_CLEAN, PRECESSION, KEEP_WARCS, NODE_PROFILE, DISABLED
FROM NODE_CONFIGURATION
WHERE ID=?
""")) {
qs.setInt(1, nodeId);
var rs = qs.executeQuery();
if (rs.next()) {
return new NodeConfiguration(
rs.getInt("ID"),
rs.getString("DESCRIPTION"),
rs.getBoolean("ACCEPT_QUERIES"),
rs.getBoolean("AUTO_CLEAN"),
rs.getBoolean("PRECESSION"),
rs.getBoolean("KEEP_WARCS"),
NodeProfile.valueOf(rs.getString("NODE_PROFILE")),
rs.getBoolean("DISABLED")
);
}
}
return null;
}
public void save(NodeConfiguration config) throws SQLException {
try (var conn = dataSource.getConnection();
var us = conn.prepareStatement("""
UPDATE NODE_CONFIGURATION
SET DESCRIPTION=?, ACCEPT_QUERIES=?, AUTO_CLEAN=?, PRECESSION=?, KEEP_WARCS=?, DISABLED=?, NODE_PROFILE=?
WHERE ID=?
"""))
{
us.setString(1, config.description());
us.setBoolean(2, config.acceptQueries());
us.setBoolean(3, config.autoClean());
us.setBoolean(4, config.includeInPrecession());
us.setBoolean(5, config.keepWarcs());
us.setBoolean(6, config.disabled());
us.setString(7, config.profile().name());
us.setInt(8, config.node());
if (us.executeUpdate() <= 0)
throw new IllegalStateException("Failed to update configuration");
}
}
}

View File

@@ -0,0 +1,16 @@
package nu.marginalia.nodecfg.model;
public record NodeConfiguration(int node,
String description,
boolean acceptQueries,
boolean autoClean,
boolean includeInPrecession,
boolean keepWarcs,
NodeProfile profile,
boolean disabled
)
{
public int getId() {
return node;
}
}

View File

@@ -0,0 +1,28 @@
package nu.marginalia.nodecfg.model;
public enum NodeProfile {
BATCH_CRAWL,
REALTIME,
MIXED,
SIDELOAD;
public boolean isBatchCrawl() {
return this == BATCH_CRAWL;
}
public boolean isRealtime() {
return this == REALTIME;
}
public boolean isMixed() {
return this == MIXED;
}
public boolean isSideload() {
return this == SIDELOAD;
}
public boolean permitBatchCrawl() {
return isBatchCrawl() ||isMixed();
}
public boolean permitSideload() {
return isMixed() || isSideload();
}
}

View File

@@ -0,0 +1,51 @@
package nu.marginalia.storage;
import com.google.gson.Gson;
import nu.marginalia.model.gson.GsonFactory;
import nu.marginalia.storage.model.FileStorage;
import nu.marginalia.storage.model.FileStorageType;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.StandardOpenOption;
import java.util.Optional;
record FileStorageManifest(FileStorageType type, String description) {
private static final Gson gson = GsonFactory.get();
private static final String fileName = "marginalia-manifest.json";
private static final Logger logger = LoggerFactory.getLogger(FileStorageManifest.class);
public static Optional<FileStorageManifest> find(Path directory) {
Path expectedFileName = directory.resolve(fileName);
if (!Files.isRegularFile(expectedFileName) ||
!Files.isReadable(expectedFileName)) {
return Optional.empty();
}
try (var reader = Files.newBufferedReader(expectedFileName)) {
return Optional.of(gson.fromJson(reader, FileStorageManifest.class));
}
catch (Exception e) {
logger.warn("Failed to read manifest " + expectedFileName, e);
return Optional.empty();
}
}
public void write(FileStorage dir) {
Path expectedFileName = dir.asPath().resolve(fileName);
try (var writer = Files.newBufferedWriter(expectedFileName,
StandardOpenOption.CREATE,
StandardOpenOption.TRUNCATE_EXISTING))
{
gson.toJson(this, writer);
}
catch (Exception e) {
logger.warn("Failed to write manifest " + expectedFileName, e);
}
}
}

View File

@@ -0,0 +1,582 @@
package nu.marginalia.storage;
import com.google.inject.name.Named;
import com.zaxxer.hikari.HikariDataSource;
import nu.marginalia.storage.model.*;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.google.inject.Inject;
import com.google.inject.Singleton;
import java.io.File;
import java.io.IOException;
import java.nio.file.*;
import java.nio.file.attribute.PosixFilePermissions;
import java.sql.SQLException;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.*;
import java.util.concurrent.ThreadLocalRandom;
/** Manages file storage for processes and services
*/
@Singleton
public class FileStorageService {
private final HikariDataSource dataSource;
private final int node;
private final Logger logger = LoggerFactory.getLogger(FileStorageService.class);
private static final DateTimeFormatter dirNameDatePattern = DateTimeFormatter.ofPattern("__uu-MM-dd'T'HH_mm_ss.SSS"); // filesystem safe ISO8601
@Inject
public FileStorageService(HikariDataSource dataSource,
@Named("wmsa-system-node") Integer node) {
this.dataSource = dataSource;
this.node = node;
logger.info("Resolving file storage root into {}", resolveStoragePath("/").toAbsolutePath());
}
/** Resolve a storage path from a relative path, injecting the system configured storage root
* if set */
public static Path resolveStoragePath(String path) {
if (path.startsWith("/")) {
// Since Path.of("ANYTHING").resolve("/foo") = "/foo", we need to strip
// the leading slash
return resolveStoragePath(path.substring(1));
}
return Path
.of(System.getProperty("storage.root", "/"))
.resolve(path);
}
/** @return the storage base with the given id, or null if it does not exist */
public FileStorageBase getStorageBase(FileStorageBaseId id) throws SQLException {
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("""
SELECT ID, NAME, NODE, PATH, TYPE
FROM FILE_STORAGE_BASE WHERE ID = ?
""")) {
stmt.setLong(1, id.id());
try (var rs = stmt.executeQuery()) {
if (rs.next()) {
return new FileStorageBase(
new FileStorageBaseId(rs.getLong("ID")),
FileStorageBaseType.valueOf(rs.getString("TYPE")),
rs.getInt("NODE"),
rs.getString("NAME"),
rs.getString("PATH")
);
}
}
}
return null;
}
public void synchronizeStorageManifests(FileStorageBase base) {
Set<String> ignoredPaths = new HashSet<>();
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("""
SELECT FILE_STORAGE.PATH
FROM FILE_STORAGE INNER JOIN FILE_STORAGE_BASE
ON BASE_ID = FILE_STORAGE_BASE.ID
WHERE BASE_ID = ?
AND NODE = ?
""")) {
stmt.setLong(1, base.id().id());
stmt.setInt(2, node);
var rs = stmt.executeQuery();
while (rs.next()) {
ignoredPaths.add(rs.getString(1));
}
} catch (SQLException e) {
throw new RuntimeException(e);
}
File basePathFile = base.asPath().toFile();
File[] files = basePathFile.listFiles(pathname -> pathname.isDirectory() && !ignoredPaths.contains(pathname.getName()));
if (files == null) return;
for (File file : files) {
var maybeManifest = FileStorageManifest.find(file.toPath());
if (maybeManifest.isEmpty()) continue;
var manifest = maybeManifest.get();
logger.info("Discovered new file storage: " + file.getName() + " (" + manifest.type() + ")");
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("""
INSERT INTO FILE_STORAGE(BASE_ID, PATH, TYPE, DESCRIPTION)
VALUES (?, ?, ?, ?)
""")) {
stmt.setLong(1, base.id().id());
stmt.setString(2, file.getName());
stmt.setString(3, manifest.type().name());
stmt.setString(4, manifest.description());
stmt.execute();
conn.commit();
} catch (SQLException e) {
throw new RuntimeException(e);
}
}
}
public void relateFileStorages(FileStorageId source, FileStorageId target) {
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("""
INSERT INTO FILE_STORAGE_RELATION(SOURCE_ID, TARGET_ID) VALUES (?, ?)
""")) {
stmt.setLong(1, source.id());
stmt.setLong(2, target.id());
stmt.executeUpdate();
} catch (SQLException e) {
throw new RuntimeException(e);
}
}
public List<FileStorage> getSourceFromStorage(FileStorage storage) throws SQLException {
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("""
SELECT SOURCE_ID FROM FILE_STORAGE_RELATION WHERE TARGET_ID = ?
""")) {
stmt.setLong(1, storage.id().id());
var rs = stmt.executeQuery();
List<FileStorage> ret = new ArrayList<>();
while (rs.next()) {
ret.add(getStorage(new FileStorageId(rs.getLong(1))));
}
return ret;
}
}
/** @return the storage base with the given type, or null if it does not exist */
public FileStorageBase getStorageBase(FileStorageBaseType type) throws SQLException {
return getStorageBase(type, node);
}
public FileStorageBase getStorageBase(FileStorageBaseType type, int node) throws SQLException {
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("""
SELECT ID, NAME, NODE, PATH, TYPE
FROM FILE_STORAGE_BASE WHERE TYPE = ? AND NODE = ?
""")) {
stmt.setString(1, type.name());
stmt.setInt(2, node);
try (var rs = stmt.executeQuery()) {
if (rs.next()) {
return new FileStorageBase(
new FileStorageBaseId(rs.getLong("ID")),
FileStorageBaseType.valueOf(rs.getString("TYPE")),
rs.getInt("NODE"),
rs.getString("NAME"),
rs.getString("PATH")
);
}
}
}
return null;
}
public FileStorageBase createStorageBase(String name, Path path, FileStorageBaseType type) throws SQLException {
return createStorageBase(name, path, node, type);
}
public FileStorageBase createStorageBase(String name, Path path, int node, FileStorageBaseType type) throws SQLException {
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("""
INSERT INTO FILE_STORAGE_BASE(NAME, PATH, TYPE, NODE)
VALUES (?, ?, ?, ?)
""")) {
stmt.setString(1, name);
stmt.setString(2, path.toString());
stmt.setString(3, type.name());
stmt.setInt(4, node);
int update = stmt.executeUpdate();
if (update < 0) {
throw new SQLException("Failed to create storage base");
}
}
return getStorageBase(type);
}
private Path allocateDirectory(Path basePath, String prefix) throws IOException {
LocalDateTime now = LocalDateTime.now();
String timestampPart = now.format(dirNameDatePattern);
Path maybePath = basePath.resolve(prefix + timestampPart);
try {
Files.createDirectory(maybePath,
PosixFilePermissions.asFileAttribute(PosixFilePermissions.fromString("rwxr-xr-x"))
);
}
catch (FileAlreadyExistsException ex) {
// in case of a race condition, try again with some random cruft at the end
maybePath = basePath.resolve(prefix + timestampPart + "_" + Long.toHexString(ThreadLocalRandom.current().nextLong()));
Files.createDirectory(maybePath,
PosixFilePermissions.asFileAttribute(PosixFilePermissions.fromString("rwxr-xr-x"))
);
}
// Ensure umask didn't mess with the access permissions
Files.setPosixFilePermissions(maybePath, PosixFilePermissions.fromString("rwxr-xr-x"));
return maybePath;
}
/** Allocate a storage area of the given type */
public FileStorage allocateStorage(FileStorageType type,
String prefix,
String description) throws IOException, SQLException
{
var base = getStorageBase(FileStorageBaseType.forFileStorageType(type));
if (null == base)
throw new IllegalStateException("No storage base for type " + type + " on node " + node);
Path newDir = allocateDirectory(base.asPath(), prefix);
String relDir = base.asPath().relativize(newDir).normalize().toString();
try (var conn = dataSource.getConnection();
var insert = conn.prepareStatement("""
INSERT INTO FILE_STORAGE(PATH, TYPE, DESCRIPTION, BASE_ID)
VALUES (?, ?, ?, ?)
""");
var query = conn.prepareStatement("""
SELECT ID FROM FILE_STORAGE WHERE PATH = ? AND BASE_ID = ?
""")
) {
insert.setString(1, relDir);
insert.setString(2, type.name());
insert.setString(3, description);
insert.setLong(4, base.id().id());
if (insert.executeUpdate() < 1) {
throw new SQLException("Failed to insert storage");
}
query.setString(1, relDir);
query.setLong(2, base.id().id());
var rs = query.executeQuery();
if (rs.next()) {
var storage = getStorage(new FileStorageId(rs.getLong("ID")));
// Write a manifest file so we can pick this up later without needing to insert it into DB
// (e.g. when loading from outside the system)
var manifest = new FileStorageManifest(type, description);
manifest.write(storage);
return storage;
}
}
throw new SQLException("Failed to insert storage");
}
public FileStorage getStorageByType(FileStorageType type) throws SQLException {
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("""
SELECT PATH, STATE, DESCRIPTION, ID, BASE_ID, CREATE_DATE
FROM FILE_STORAGE_VIEW WHERE TYPE = ? AND NODE = ?
""")) {
stmt.setString(1, type.name());
stmt.setInt(2, node);
long storageId;
long baseId;
String path;
String state;
String description;
LocalDateTime createDateTime;
try (var rs = stmt.executeQuery()) {
if (rs.next()) {
baseId = rs.getLong("BASE_ID");
storageId = rs.getLong("ID");
createDateTime = rs.getTimestamp("CREATE_DATE").toLocalDateTime();
path = rs.getString("PATH");
state = rs.getString("STATE");
description = rs.getString("DESCRIPTION");
}
else {
return null;
}
var base = getStorageBase(new FileStorageBaseId(baseId));
return new FileStorage(
new FileStorageId(storageId),
base,
type,
createDateTime,
path,
FileStorageState.parse(state),
description
);
}
}
}
public List<FileStorage> getStorage(List<FileStorageId> ids) throws SQLException {
List<FileStorage> ret = new ArrayList<>();
for (var id : ids) {
var storage = getStorage(id);
if (storage == null) continue;
ret.add(storage);
}
return ret;
}
/** @return the storage with the given id, or null if it does not exist */
public FileStorage getStorage(FileStorageId id) throws SQLException {
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("""
SELECT PATH, TYPE, STATE, DESCRIPTION, CREATE_DATE, ID, BASE_ID
FROM FILE_STORAGE_VIEW WHERE ID = ?
""")) {
stmt.setLong(1, id.id());
long storageId;
long baseId;
String path;
String state;
String description;
FileStorageType type;
LocalDateTime createDateTime;
try (var rs = stmt.executeQuery()) {
if (rs.next()) {
baseId = rs.getLong("BASE_ID");
storageId = rs.getLong("ID");
type = FileStorageType.valueOf(rs.getString("TYPE"));
path = rs.getString("PATH");
state = rs.getString("STATE");
description = rs.getString("DESCRIPTION");
createDateTime = rs.getTimestamp("CREATE_DATE").toLocalDateTime();
}
else {
return null;
}
var base = getStorageBase(new FileStorageBaseId(baseId));
return new FileStorage(
new FileStorageId(storageId),
base,
type,
createDateTime,
path,
FileStorageState.parse(state),
description
);
}
}
}
public void deregisterFileStorage(FileStorageId id) throws SQLException {
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("""
DELETE FROM FILE_STORAGE WHERE ID = ?
""")) {
stmt.setLong(1, id.id());
stmt.executeUpdate();
}
}
public List<FileStorage> getEachFileStorage() {
List<FileStorage> ret = new ArrayList<>();
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("""
SELECT PATH, STATE, TYPE, DESCRIPTION, CREATE_DATE, ID, BASE_ID
FROM FILE_STORAGE_VIEW
WHERE NODE=?
""")) {
stmt.setInt(1, node);
long storageId;
long baseId;
String path;
String state;
String description;
LocalDateTime createDateTime;
FileStorageType type;
try (var rs = stmt.executeQuery()) {
while (rs.next()) {
baseId = rs.getLong("BASE_ID");
storageId = rs.getLong("ID");
path = rs.getString("PATH");
state = rs.getString("STATE");
try {
type = FileStorageType.valueOf(rs.getString("TYPE"));
}
catch (IllegalArgumentException ex) {
logger.warn("Illegal file storage type {} in db", rs.getString("TYPE"));
continue;
}
description = rs.getString("DESCRIPTION");
createDateTime = rs.getTimestamp("CREATE_DATE").toLocalDateTime();
var base = getStorageBase(new FileStorageBaseId(baseId));
ret.add(new FileStorage(
new FileStorageId(storageId),
base,
type,
createDateTime,
path,
FileStorageState.parse(state),
description
));
}
}
} catch (SQLException e) {
e.printStackTrace();
}
return ret;
}
public List<FileStorage> getEachFileStorage(FileStorageType type) {
return getEachFileStorage(node, type);
}
public List<FileStorage> getEachFileStorage(int node, FileStorageType type) {
List<FileStorage> ret = new ArrayList<>();
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("""
SELECT PATH, STATE, TYPE, DESCRIPTION, CREATE_DATE, ID, BASE_ID
FROM FILE_STORAGE_VIEW
WHERE NODE=? AND TYPE=?
""")) {
stmt.setInt(1, node);
stmt.setString(2, type.name());
long storageId;
long baseId;
String path;
String state;
String description;
LocalDateTime createDateTime;
try (var rs = stmt.executeQuery()) {
while (rs.next()) {
baseId = rs.getLong("BASE_ID");
storageId = rs.getLong("ID");
path = rs.getString("PATH");
state = rs.getString("STATE");
description = rs.getString("DESCRIPTION");
createDateTime = rs.getTimestamp("CREATE_DATE").toLocalDateTime();
var base = getStorageBase(new FileStorageBaseId(baseId));
ret.add(new FileStorage(
new FileStorageId(storageId),
base,
type,
createDateTime,
path,
FileStorageState.parse(state),
description
));
}
}
} catch (SQLException e) {
e.printStackTrace();
}
return ret;
}
public void flagFileForDeletion(FileStorageId id) throws SQLException {
setFileStorageState(id, FileStorageState.DELETE);
}
public void enableFileStorage(FileStorageId id) throws SQLException {
setFileStorageState(id, FileStorageState.ACTIVE);
}
public void disableFileStorage(FileStorageId id) throws SQLException {
setFileStorageState(id, FileStorageState.UNSET);
}
public void setFileStorageState(FileStorageId id, FileStorageState state) throws SQLException {
try (var conn = dataSource.getConnection();
var flagStmt = conn.prepareStatement("UPDATE FILE_STORAGE SET STATE = ? WHERE ID = ?")) {
String value = state == FileStorageState.UNSET ? "" : state.name();
flagStmt.setString(1, value);
flagStmt.setLong(2, id.id());
flagStmt.executeUpdate();
}
}
public void disableFileStorageOfType(int nodeId, FileStorageType type) throws SQLException {
try (var conn = dataSource.getConnection();
var flagStmt = conn.prepareStatement("""
UPDATE FILE_STORAGE
INNER JOIN FILE_STORAGE_BASE ON BASE_ID=FILE_STORAGE_BASE.ID
SET FILE_STORAGE.STATE = ''
WHERE FILE_STORAGE.TYPE = ?
AND FILE_STORAGE.TYPE = 'ACTIVE'
AND FILE_STORAGE_BASE.NODE=?
""")) {
flagStmt.setString(1, type.name());
flagStmt.setInt(2, nodeId);
flagStmt.executeUpdate();
}
}
public List<FileStorageId> getActiveFileStorages(FileStorageType type) throws SQLException {
return getActiveFileStorages(node, type);
}
public Optional<FileStorageId> getOnlyActiveFileStorage(FileStorageType type) throws SQLException {
return getOnlyActiveFileStorage(node, type);
}
public Optional<FileStorageId> getOnlyActiveFileStorage(int nodeId, FileStorageType type) throws SQLException {
var storages = getActiveFileStorages(nodeId, type);
if (storages.size() > 1) {
throw new IllegalStateException("Expected [0,1] instances of FileStorage with type " + type + ", found " + storages.size());
}
return storages.stream().findFirst();
}
public List<FileStorageId> getActiveFileStorages(int nodeId, FileStorageType type) throws SQLException
{
try (var conn = dataSource.getConnection();
var queryStmt = conn.prepareStatement("""
SELECT FILE_STORAGE.ID FROM FILE_STORAGE
INNER JOIN FILE_STORAGE_BASE ON BASE_ID=FILE_STORAGE_BASE.ID
WHERE FILE_STORAGE.TYPE = ?
AND STATE='ACTIVE'
AND FILE_STORAGE_BASE.NODE=?
""")) {
queryStmt.setString(1, type.name());
queryStmt.setInt(2, nodeId);
var rs = queryStmt.executeQuery();
List<FileStorageId> ids = new ArrayList<>();
while (rs.next()) {
ids.add(new FileStorageId(rs.getInt(1)));
}
return ids;
}
}
}

View File

@@ -0,0 +1,80 @@
package nu.marginalia.storage.model;
import nu.marginalia.storage.FileStorageService;
import java.nio.file.Path;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.Objects;
/**
* Represents a file storage area
*
* @param id the id of the storage in the database
* @param base the base of the storage
* @param type the type of data expected
* @param path the full path of the storage on disk
* @param description a description of the storage
*/
public record FileStorage (
FileStorageId id,
FileStorageBase base,
FileStorageType type,
LocalDateTime createDateTime,
String path,
FileStorageState state,
String description)
{
public int node() {
return base.node();
}
public Path asPath() {
return FileStorageService.resolveStoragePath(path);
}
public boolean isActive() {
return FileStorageState.ACTIVE.equals(state);
}
public boolean isNoState() {
return FileStorageState.UNSET.equals(state);
}
public boolean isDelete() {
return FileStorageState.DELETE.equals(state);
}
public boolean isNew() {
return FileStorageState.NEW.equals(state);
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
FileStorage that = (FileStorage) o;
// Exclude timestamp as it may different due to how the objects
// are constructed
if (!Objects.equals(id, that.id)) return false;
if (!Objects.equals(base, that.base)) return false;
if (type != that.type) return false;
if (!Objects.equals(path, that.path)) return false;
return Objects.equals(description, that.description);
}
@Override
public int hashCode() {
int result = id != null ? id.hashCode() : 0;
result = 31 * result + (base != null ? base.hashCode() : 0);
result = 31 * result + (type != null ? type.hashCode() : 0);
result = 31 * result + (path != null ? path.hashCode() : 0);
result = 31 * result + (description != null ? description.hashCode() : 0);
return result;
}
public String date() {
return createDateTime.format(DateTimeFormatter.ISO_LOCAL_DATE_TIME);
}
}

View File

@@ -0,0 +1,30 @@
package nu.marginalia.storage.model;
import nu.marginalia.storage.FileStorageService;
import java.nio.file.Path;
/**
* Represents a file storage base directory
*
* @param id the id of the storage base in the database
* @param type the type of the storage base
* @param name the name of the storage base
* @param path the path of the storage base
*/
public record FileStorageBase(FileStorageBaseId id,
FileStorageBaseType type,
int node,
String name,
String path
) {
public Path asPath() {
return FileStorageService.resolveStoragePath(path);
}
public boolean isValid() {
return id.id() >= 0;
}
}

View File

@@ -0,0 +1,8 @@
package nu.marginalia.storage.model;
public record FileStorageBaseId(long id) {
public String toString() {
return Long.toString(id);
}
}

View File

@@ -0,0 +1,17 @@
package nu.marginalia.storage.model;
public enum FileStorageBaseType {
CURRENT,
WORK,
STORAGE,
BACKUP;
public static FileStorageBaseType forFileStorageType(FileStorageType type) {
return switch (type) {
case EXPORT, CRAWL_DATA, PROCESSED_DATA, CRAWL_SPEC -> STORAGE;
case BACKUP -> BACKUP;
};
}
}

View File

@@ -0,0 +1,14 @@
package nu.marginalia.storage.model;
public record FileStorageId(long id) {
public static FileStorageId parse(String str) {
return new FileStorageId(Long.parseLong(str));
}
public static FileStorageId of(long storageId) {
return new FileStorageId(storageId);
}
public String toString() {
return Long.toString(id);
}
}

View File

@@ -0,0 +1,15 @@
package nu.marginalia.storage.model;
public enum FileStorageState {
UNSET,
NEW,
ACTIVE,
DELETE;
public static FileStorageState parse(String value) {
if ("".equals(value)) {
return UNSET;
}
return valueOf(value);
}
}

View File

@@ -0,0 +1,11 @@
package nu.marginalia.storage.model;
public enum FileStorageType {
@Deprecated
CRAWL_SPEC, //
CRAWL_DATA,
PROCESSED_DATA,
BACKUP,
EXPORT;
}

View File

@@ -0,0 +1,3 @@
# Config
This package contains configuration injectables used by the services.

View File

@@ -0,0 +1,67 @@
package nu.marginalia.nodecfg;
import com.zaxxer.hikari.HikariConfig;
import com.zaxxer.hikari.HikariDataSource;
import nu.marginalia.nodecfg.model.NodeProfile;
import nu.marginalia.test.TestMigrationLoader;
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.api.Tag;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.parallel.Execution;
import org.junit.jupiter.api.parallel.ExecutionMode;
import org.testcontainers.containers.MariaDBContainer;
import org.testcontainers.junit.jupiter.Container;
import org.testcontainers.junit.jupiter.Testcontainers;
import java.sql.SQLException;
import static org.junit.jupiter.api.Assertions.*;
@Testcontainers
@Execution(ExecutionMode.SAME_THREAD)
@Tag("slow")
public class NodeConfigurationServiceTest {
@Container
static MariaDBContainer<?> mariaDBContainer = new MariaDBContainer<>("mariadb")
.withDatabaseName("WMSA_prod")
.withUsername("wmsa")
.withPassword("wmsa")
.withNetworkAliases("mariadb");
static HikariDataSource dataSource;
static NodeConfigurationService nodeConfigurationService;
@BeforeAll
public static void setup() {
HikariConfig config = new HikariConfig();
config.setJdbcUrl(mariaDBContainer.getJdbcUrl());
config.setUsername("wmsa");
config.setPassword("wmsa");
dataSource = new HikariDataSource(config);
TestMigrationLoader.flywayMigration(dataSource);
nodeConfigurationService = new NodeConfigurationService(dataSource);
}
@Test
public void test() throws SQLException {
var a = nodeConfigurationService.create(1, "Test", false, false, NodeProfile.MIXED);
var b = nodeConfigurationService.create(2, "Foo", true, false, NodeProfile.MIXED);
assertEquals(1, a.node());
assertEquals("Test", a.description());
assertFalse(a.acceptQueries());
assertEquals(2, b.node());
assertEquals("Foo", b.description());
assertTrue(b.acceptQueries());
var list = nodeConfigurationService.getAll();
assertEquals(2, list.size());
assertEquals(a, list.get(0));
assertEquals(b, list.get(1));
}
}

View File

@@ -0,0 +1,162 @@
package nu.marginalia.storage;
import com.google.common.collect.Lists;
import com.zaxxer.hikari.HikariConfig;
import com.zaxxer.hikari.HikariDataSource;
import nu.marginalia.storage.model.FileStorage;
import nu.marginalia.storage.model.FileStorageBase;
import nu.marginalia.storage.model.FileStorageBaseType;
import nu.marginalia.storage.model.FileStorageType;
import nu.marginalia.test.TestMigrationLoader;
import org.junit.jupiter.api.*;
import org.junit.jupiter.api.parallel.Execution;
import org.junit.jupiter.api.parallel.ExecutionMode;
import org.testcontainers.containers.MariaDBContainer;
import org.testcontainers.junit.jupiter.Container;
import org.testcontainers.junit.jupiter.Testcontainers;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.List;
import java.util.UUID;
@Testcontainers
@Execution(ExecutionMode.SAME_THREAD)
@Tag("slow")
public class FileStorageServiceTest {
@Container
static MariaDBContainer<?> mariaDBContainer = new MariaDBContainer<>("mariadb")
.withDatabaseName("WMSA_prod")
.withUsername("wmsa")
.withPassword("wmsa")
.withNetworkAliases("mariadb");
static HikariDataSource dataSource;
static FileStorageService fileStorageService;
static List<Path> tempDirs = new ArrayList<>();
@BeforeAll
public static void setup() {
HikariConfig config = new HikariConfig();
config.setJdbcUrl(mariaDBContainer.getJdbcUrl());
config.setUsername("wmsa");
config.setPassword("wmsa");
dataSource = new HikariDataSource(config);
TestMigrationLoader.flywayMigration(dataSource);
}
@BeforeEach
public void setupEach() {
fileStorageService = new FileStorageService(dataSource, 0);
}
@AfterEach
public void tearDownEach() {
try (var conn = dataSource.getConnection();
var stmt = conn.createStatement()) {
stmt.execute("DELETE FROM FILE_STORAGE");
stmt.execute("DELETE FROM FILE_STORAGE_BASE");
} catch (SQLException e) {
throw new RuntimeException(e);
}
}
@AfterAll
public static void teardown() {
dataSource.close();
Lists.reverse(tempDirs).forEach(path -> {
try {
System.out.println("Deleting " + path);
Files.delete(path);
} catch (IOException e) {
e.printStackTrace();
}
});
}
private Path createTempDir() {
try {
Path dir = Files.createTempDirectory("file-storage-test");
tempDirs.add(dir);
return dir;
} catch (IOException e) {
throw new RuntimeException(e);
}
}
@Test
public void testPathOverride() {
try {
System.setProperty("storage.root", "/tmp");
var path = new FileStorageBase(null, null, 0, null, "test").asPath();
Assertions.assertEquals(Path.of("/tmp/test"), path);
}
finally {
System.clearProperty("storage.root");
}
}
@Test
public void testPathOverride3() {
try {
System.setProperty("storage.root", "/tmp");
var path = new FileStorageBase(null, null, 0, null, "/test").asPath();
Assertions.assertEquals(Path.of("/tmp/test"), path);
}
finally {
System.clearProperty("storage.root");
}
}
@Test
public void testPathOverride2() {
try {
System.setProperty("storage.root", "/tmp");
var path = new FileStorage(null, null, null, null, "test", null, null).asPath();
Assertions.assertEquals(Path.of("/tmp/test"), path);
}
finally {
System.clearProperty("storage.root");
}
}
@Test
public void testCreateBase() throws SQLException {
String name = "test-" + UUID.randomUUID();
var storage = new FileStorageService(dataSource, 0);
var base = storage.createStorageBase(name, createTempDir(), FileStorageBaseType.WORK);
Assertions.assertEquals(name, base.name());
Assertions.assertEquals(FileStorageBaseType.WORK, base.type());
}
@Test
public void testAllocateTemp() throws IOException, SQLException {
String name = "test-" + UUID.randomUUID();
// ensure a base exists
var base = fileStorageService.createStorageBase(name, createTempDir(), FileStorageBaseType.STORAGE);
tempDirs.add(base.asPath());
var storage = new FileStorageService(dataSource, 0);
var fileStorage = storage.allocateStorage(FileStorageType.CRAWL_DATA, "xyz", "thisShouldSucceed");
System.out.println("Allocated " + fileStorage.asPath());
Assertions.assertTrue(Files.exists(fileStorage.asPath()));
tempDirs.add(fileStorage.asPath());
}
}

View File

@@ -0,0 +1,72 @@
buildscript {
repositories {
mavenCentral()
}
dependencies {
classpath 'org.flywaydb:flyway-mysql:10.0.1'
}
}
plugins {
id 'java'
id 'jvm-test-suite'
id "org.flywaydb.flyway" version "10.0.1"
}
java {
toolchain {
languageVersion.set(JavaLanguageVersion.of(rootProject.ext.jvmVersion))
}
}
configurations {
flywayMigration.extendsFrom(implementation)
}
apply from: "$rootProject.projectDir/srcsets.gradle"
dependencies {
implementation project(':code:common:model')
implementation libs.bundles.slf4j
implementation libs.guava
implementation dependencies.create(libs.guice.get()) {
exclude group: 'com.google.guava'
}
implementation libs.bundles.gson
implementation libs.notnull
implementation libs.commons.lang3
implementation libs.trove
implementation libs.bundles.mariadb
flywayMigration 'org.flywaydb:flyway-mysql:10.0.1'
testImplementation libs.bundles.slf4j.test
testImplementation libs.bundles.junit
testImplementation libs.mockito
testImplementation platform('org.testcontainers:testcontainers-bom:1.17.4')
testImplementation libs.commons.codec
testImplementation 'org.testcontainers:mariadb:1.17.4'
testImplementation 'org.testcontainers:junit-jupiter:1.17.4'
testImplementation project(':code:libraries:test-helpers')
}
flyway {
url = 'jdbc:mariadb://localhost:3306/WMSA_prod'
user = 'wmsa'
password = 'wmsa'
schemas = ['WMSA_prod']
configurations = [ 'compileClasspath', 'flywayMigration' ]
locations = ['filesystem:src/main/resources/db/migration']
cleanDisabled = false
}

View File

@@ -0,0 +1,179 @@
package nu.marginalia.db;
import com.google.common.cache.Cache;
import com.google.common.cache.CacheBuilder;
import com.google.common.util.concurrent.UncheckedExecutionException;
import com.google.inject.Inject;
import com.google.inject.Singleton;
import com.zaxxer.hikari.HikariDataSource;
import nu.marginalia.model.EdgeDomain;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.sql.SQLException;
import java.util.*;
import java.util.concurrent.ExecutionException;
@Singleton
public class DbDomainQueries {
private final HikariDataSource dataSource;
private static final Logger logger = LoggerFactory.getLogger(DbDomainQueries.class);
private final Cache<EdgeDomain, Integer> domainIdCache = CacheBuilder.newBuilder().maximumSize(10_000).build();
private final Cache<EdgeDomain, DomainIdWithNode> domainWithNodeCache = CacheBuilder.newBuilder().maximumSize(10_000).build();
private final Cache<Integer, EdgeDomain> domainNameCache = CacheBuilder.newBuilder().maximumSize(10_000).build();
private final Cache<String, List<DomainWithNode>> siblingsCache = CacheBuilder.newBuilder().maximumSize(10_000).build();
@Inject
public DbDomainQueries(HikariDataSource dataSource)
{
this.dataSource = dataSource;
}
public Integer getDomainId(EdgeDomain domain) throws NoSuchElementException {
try {
return domainIdCache.get(domain, () -> {
try (var connection = dataSource.getConnection();
var stmt = connection.prepareStatement("SELECT ID FROM EC_DOMAIN WHERE DOMAIN_NAME=?")) {
stmt.setString(1, domain.toString());
var rsp = stmt.executeQuery();
if (rsp.next()) {
return rsp.getInt(1);
}
}
catch (SQLException ex) {
throw new RuntimeException(ex);
}
throw new NoSuchElementException();
});
}
catch (UncheckedExecutionException ex) {
throw new NoSuchElementException();
}
catch (ExecutionException ex) {
throw new RuntimeException(ex.getCause());
}
}
public DomainIdWithNode getDomainIdWithNode(EdgeDomain domain) throws NoSuchElementException {
try {
return domainWithNodeCache.get(domain, () -> {
try (var connection = dataSource.getConnection();
var stmt = connection.prepareStatement("SELECT ID, NODE_AFFINITY FROM EC_DOMAIN WHERE DOMAIN_NAME=?")) {
stmt.setString(1, domain.toString());
var rsp = stmt.executeQuery();
if (rsp.next()) {
return new DomainIdWithNode(rsp.getInt(1), rsp.getInt(2));
}
}
catch (SQLException ex) {
throw new RuntimeException(ex);
}
throw new NoSuchElementException();
});
}
catch (UncheckedExecutionException ex) {
throw new NoSuchElementException();
}
catch (ExecutionException ex) {
throw new RuntimeException(ex.getCause());
}
}
public OptionalInt tryGetDomainId(EdgeDomain domain) {
Integer maybeId = domainIdCache.getIfPresent(domain);
if (maybeId != null) {
return OptionalInt.of(maybeId);
}
try (var connection = dataSource.getConnection()) {
try (var stmt = connection.prepareStatement("SELECT ID FROM EC_DOMAIN WHERE DOMAIN_NAME=?")) {
stmt.setString(1, domain.toString());
var rsp = stmt.executeQuery();
if (rsp.next()) {
var id = rsp.getInt(1);
domainIdCache.put(domain, id);
return OptionalInt.of(id);
}
}
return OptionalInt.empty();
}
catch (UncheckedExecutionException ex) {
throw new RuntimeException(ex.getCause());
}
catch (SQLException ex) {
throw new RuntimeException(ex);
}
}
public Optional<EdgeDomain> getDomain(int id) {
EdgeDomain existing = domainNameCache.getIfPresent(id);
if (existing != null) {
return Optional.of(existing);
}
try (var connection = dataSource.getConnection()) {
try (var stmt = connection.prepareStatement("SELECT DOMAIN_NAME FROM EC_DOMAIN WHERE ID=?")) {
stmt.setInt(1, id);
var rsp = stmt.executeQuery();
if (rsp.next()) {
var val = new EdgeDomain(rsp.getString(1));
domainNameCache.put(id, val);
return Optional.of(val);
}
return Optional.empty();
}
}
catch (SQLException ex) {
throw new RuntimeException(ex);
}
}
public List<DomainWithNode> otherSubdomains(EdgeDomain domain, int cnt) throws ExecutionException {
String topDomain = domain.topDomain;
return siblingsCache.get(topDomain, () -> {
List<DomainWithNode> ret = new ArrayList<>();
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("SELECT DOMAIN_NAME, NODE_AFFINITY FROM EC_DOMAIN WHERE DOMAIN_TOP = ? LIMIT ?")) {
stmt.setString(1, topDomain);
stmt.setInt(2, cnt);
var rs = stmt.executeQuery();
while (rs.next()) {
var sibling = new EdgeDomain(rs.getString(1));
if (sibling.equals(domain))
continue;
ret.add(new DomainWithNode(sibling, rs.getInt(2)));
}
} catch (SQLException e) {
logger.error("Failed to get domain neighbors");
}
return ret;
});
}
public record DomainWithNode (EdgeDomain domain, int nodeAffinity) {
public boolean isIndexed() {
return nodeAffinity > 0;
}
}
public record DomainIdWithNode (int domainId, int nodeAffinity) { }
}

View File

@@ -0,0 +1,13 @@
package nu.marginalia.db;
import com.google.inject.ImplementedBy;
import gnu.trove.set.hash.TIntHashSet;
@ImplementedBy(DomainBlacklistImpl.class)
public interface DomainBlacklist {
boolean isBlacklisted(int domainId);
default TIntHashSet getSpamDomains() {
return new TIntHashSet();
}
void waitUntilLoaded() throws InterruptedException;
}

View File

@@ -0,0 +1,126 @@
package nu.marginalia.db;
import com.google.inject.Inject;
import com.google.inject.Singleton;
import com.zaxxer.hikari.HikariDataSource;
import gnu.trove.set.hash.TIntHashSet;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.sql.SQLException;
import java.util.concurrent.TimeUnit;
@Singleton
public class DomainBlacklistImpl implements DomainBlacklist {
private final boolean blacklistDisabled = Boolean.getBoolean("blacklist.disable");
private final HikariDataSource dataSource;
private final Logger logger = LoggerFactory.getLogger(getClass());
private volatile TIntHashSet spamDomainSet = new TIntHashSet();
private volatile boolean isLoaded = false;
@Inject
public DomainBlacklistImpl(HikariDataSource dataSource) {
this.dataSource = dataSource;
Thread.ofPlatform().daemon().name("BlacklistUpdater").start(this::updateSpamList);
}
private void updateSpamList() {
// If the blacklist is disabled, we don't need to do anything
if (blacklistDisabled) {
isLoaded = true;
flagLoaded();
return;
}
for (;;) {
spamDomainSet = getSpamDomains();
// Set the flag to true after the first loading attempt, regardless of success,
// to avoid deadlocking threads that are waiting for this condition
flagLoaded();
// Sleep for 10 minutes before trying again
try {
TimeUnit.MINUTES.sleep(10);
}
catch (InterruptedException ex) {
break;
}
}
}
private void flagLoaded() {
if (!isLoaded) {
synchronized (this) {
isLoaded = true;
notifyAll();
}
}
}
/** Block until the blacklist has been loaded */
@Override
public void waitUntilLoaded() throws InterruptedException {
if (blacklistDisabled)
return;
if (!isLoaded) {
logger.info("Waiting for blacklist to be loaded");
synchronized (this) {
while (!isLoaded) {
wait(5000);
}
}
logger.info("Blacklist loaded, size = {}", spamDomainSet.size());
}
}
public TIntHashSet getSpamDomains() {
final TIntHashSet result = new TIntHashSet(1_000_000);
if (blacklistDisabled) {
return result;
}
try (var connection = dataSource.getConnection()) {
try (var stmt = connection.prepareStatement("""
SELECT EC_DOMAIN.ID
FROM EC_DOMAIN
INNER JOIN EC_DOMAIN_BLACKLIST
ON (EC_DOMAIN_BLACKLIST.URL_DOMAIN = EC_DOMAIN.DOMAIN_TOP
OR EC_DOMAIN_BLACKLIST.URL_DOMAIN = EC_DOMAIN.DOMAIN_NAME)
"""))
{
stmt.setFetchSize(1000);
var rsp = stmt.executeQuery();
while (rsp.next()) {
result.add(rsp.getInt(1));
}
}
} catch (SQLException ex) {
logger.error("Failed to load spam domain list", ex);
}
return result;
}
@Override
public boolean isBlacklisted(int domainId) {
if (spamDomainSet.contains(domainId)) {
return true;
}
return false;
}
}

View File

@@ -0,0 +1,162 @@
package nu.marginalia.db;
import com.google.inject.Inject;
import com.zaxxer.hikari.HikariDataSource;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.nio.file.Path;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Optional;
public class DomainRankingSetsService {
private static final Logger logger = LoggerFactory.getLogger(DomainRankingSetsService.class);
private final HikariDataSource dataSource;
@Inject
public DomainRankingSetsService(HikariDataSource dataSource) {
this.dataSource = dataSource;
}
public Optional<DomainRankingSet> get(String name) throws SQLException {
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("""
SELECT NAME, DESCRIPTION, DEPTH, DEFINITION
FROM CONF_DOMAIN_RANKING_SET
WHERE NAME = ?
""")) {
stmt.setString(1, name);
var rs = stmt.executeQuery();
if (!rs.next()) {
return Optional.empty();
}
return Optional.of(new DomainRankingSet(
rs.getString("NAME"),
rs.getString("DESCRIPTION"),
rs.getInt("DEPTH"),
rs.getString("DEFINITION")
));
}
catch (SQLException ex) {
logger.error("Failed to get domain set", ex);
return Optional.empty();
}
}
public void upsert(DomainRankingSet domainRankingSet) {
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("""
REPLACE INTO CONF_DOMAIN_RANKING_SET(NAME, DESCRIPTION, DEPTH, DEFINITION)
VALUES (?, ?, ?, ?)
"""))
{
stmt.setString(1, domainRankingSet.name());
stmt.setString(2, domainRankingSet.description());
stmt.setInt(3, domainRankingSet.depth());
stmt.setString(4, domainRankingSet.definition());
stmt.executeUpdate();
if (!conn.getAutoCommit())
conn.commit();
}
catch (SQLException ex) {
logger.error("Failed to update domain set", ex);
}
}
public void delete(DomainRankingSet domainRankingSet) {
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("""
DELETE FROM CONF_DOMAIN_RANKING_SET
WHERE NAME = ?
"""))
{
stmt.setString(1, domainRankingSet.name());
stmt.executeUpdate();
if (!conn.getAutoCommit())
conn.commit();
}
catch (SQLException ex) {
logger.error("Failed to delete domain set", ex);
}
}
public List<DomainRankingSet> getAll() {
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("""
SELECT NAME, DESCRIPTION, DEPTH, DEFINITION
FROM CONF_DOMAIN_RANKING_SET
""")) {
var rs = stmt.executeQuery();
List<DomainRankingSet> ret = new ArrayList<>();
while (rs.next()) {
ret.add(
new DomainRankingSet(
rs.getString("NAME"),
rs.getString("DESCRIPTION"),
rs.getInt("DEPTH"),
rs.getString("DEFINITION"))
);
}
return ret;
}
catch (SQLException ex) {
logger.error("Failed to get domain set", ex);
return List.of();
}
}
/**
* Defines a domain ranking set, parameters for the ranking algorithms.
*
* @param name Key and name of the set
* @param description Human-readable description
* @param depth Depth of the algorithm
* @param definition Definition of the set, typically a list of domains or globs for domain-names
*/
public record DomainRankingSet(String name,
String description,
int depth,
String definition) {
public Path fileName(Path base) {
return base.resolve(name().toLowerCase() + ".dat");
}
public String[] domains() {
return Arrays.stream(definition().split("\n+"))
.map(String::trim)
.filter(s -> !s.isBlank())
.filter(s -> !s.startsWith("#"))
.toArray(String[]::new);
}
public boolean isSpecial() {
return name().equals("BLOGS") || name().equals("NONE") || name().equals("RANK");
}
public DomainRankingSet withName(String name) {
return this.name == name ? this : new DomainRankingSet(name, description, depth, definition);
}
public DomainRankingSet withDescription(String description) {
return this.description == description ? this : new DomainRankingSet(name, description, depth, definition);
}
public DomainRankingSet withDepth(int depth) {
return this.depth == depth ? this : new DomainRankingSet(name, description, depth, definition);
}
public DomainRankingSet withDefinition(String definition) {
return this.definition == definition ? this : new DomainRankingSet(name, description, depth, definition);
}
}
}

View File

@@ -0,0 +1,217 @@
package nu.marginalia.db;
import com.zaxxer.hikari.HikariDataSource;
import gnu.trove.list.TIntList;
import gnu.trove.list.array.TIntArrayList;
import org.slf4j.LoggerFactory;
import org.slf4j.Logger;
import com.google.inject.Inject;
import com.google.inject.Singleton;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.List;
/** A list of domains that are known to be of a certain type */
@Singleton
public class DomainTypes {
public enum Type {
BLOG,
CRAWL,
TEST
}
private final Logger logger = LoggerFactory.getLogger(DomainTypes.class);
private final HikariDataSource dataSource;
@Inject
public DomainTypes(HikariDataSource dataSource) {
this.dataSource = dataSource;
}
public String getUrlForSelection(Type type) {
try (var conn = dataSource.getConnection();
var qs = conn.prepareStatement("SELECT SOURCE FROM DOMAIN_SELECTION_TYPE WHERE NAME = ?"))
{
qs.setString(1, type.name());
var rs = qs.executeQuery();
if (rs.next()) {
return rs.getString("SOURCE");
}
}
catch (SQLException ex) {
ex.printStackTrace();
}
return "";
}
public void updateUrlForSelection(Type type, String newValue) throws SQLException {
try (var conn = dataSource.getConnection();
var us = conn.prepareStatement("REPLACE INTO DOMAIN_SELECTION_TYPE(NAME, SOURCE) VALUES (?, ?)")) {
us.setString(1, type.name());
us.setString(2, newValue);
us.executeUpdate();
}
}
/** Get all domains of a certain type, including domains that are not in the EC_DOMAIN table */
public List<String> getAllDomainsByType(Type type) {
List<String> ret = new ArrayList<>();
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("""
SELECT DOMAIN_NAME
FROM DOMAIN_SELECTION INNER JOIN DOMAIN_SELECTION_TYPE ON DOMAIN_TYPE_ID = DOMAIN_SELECTION_TYPE.ID
WHERE DOMAIN_SELECTION_TYPE.NAME = ?
"""))
{
stmt.setString(1, type.name());
var rs = stmt.executeQuery();
while (rs.next()) {
ret.add(rs.getString(1));
}
}
catch (SQLException ex) {
throw new RuntimeException(ex);
}
return ret;
}
/** Retrieve the domain id of all domains of a certain type,
* ignoring entries that are not in the EC_DOMAIN table */
public TIntList getKnownDomainsByType(Type type) {
TIntList ret = new TIntArrayList();
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("""
SELECT EC_DOMAIN.ID
FROM DOMAIN_SELECTION
INNER JOIN DOMAIN_SELECTION_TYPE ON DOMAIN_TYPE_ID = DOMAIN_SELECTION_TYPE.ID
INNER JOIN EC_DOMAIN ON DOMAIN_SELECTION.DOMAIN_NAME = EC_DOMAIN.DOMAIN_NAME
WHERE DOMAIN_SELECTION_TYPE.NAME = ?
"""))
{
stmt.setString(1, type.name());
var rs = stmt.executeQuery();
while (rs.next()) {
ret.add(rs.getInt(1));
}
}
catch (SQLException ex) {
throw new RuntimeException(ex);
}
return ret;
}
/** Reload the list of domains of a certain type from the source */
public void reloadDomainsList(Type type) throws IOException, SQLException {
try (var conn = dataSource.getConnection();
var stmt = conn.prepareStatement("""
SELECT SOURCE, ID FROM DOMAIN_SELECTION_TYPE WHERE NAME = ?
""");
var deleteStatement = conn.prepareStatement("""
DELETE FROM DOMAIN_SELECTION WHERE DOMAIN_TYPE_ID = ?
""");
var insertStatement = conn.prepareStatement("""
INSERT IGNORE INTO DOMAIN_SELECTION (DOMAIN_NAME, DOMAIN_TYPE_ID) VALUES (?, ?)
""")
)
{
stmt.setString(1, type.name());
var rsp = stmt.executeQuery();
if (!rsp.next()) {
throw new RuntimeException("No such domain selection type: " + type);
}
var source = rsp.getString(1);
int typeId = rsp.getInt(2);
List<String> downloadDomains = downloadDomainsList(source);
try {
conn.setAutoCommit(false);
deleteStatement.setInt(1, typeId);
deleteStatement.executeUpdate();
for (String domain : downloadDomains) {
insertStatement.setString(1, domain);
insertStatement.setInt(2, typeId);
insertStatement.executeUpdate();
// Could use batch insert here, but this executes infrequently, so it's not worth the hassle
}
conn.commit();
}
catch (SQLException ex) {
conn.rollback();
throw ex;
}
finally {
conn.setAutoCommit(true);
}
}
}
public List<String> downloadList(Type type) throws IOException {
var url = getUrlForSelection(type);
if (url.isBlank())
return List.of();
return downloadDomainsList(url);
}
private List<String> downloadDomainsList(String source) throws IOException {
if (source.isBlank())
return List.of();
List<String> ret = new ArrayList<>();
logger.info("Downloading domain list from {}", source);
try (var br = new BufferedReader(new InputStreamReader(new URL(source).openStream()))) {
String line;
while ((line = br.readLine()) != null) {
line = cleanDomainListLine(line);
if (isValidDomainListEntry(line))
ret.add(line);
}
}
logger.info("-- found {}", ret.size());
return ret;
}
private String cleanDomainListLine(String line) {
line = line.trim();
int hashIdx = line.indexOf('#');
if (hashIdx >= 0)
line = line.substring(0, hashIdx).trim();
return line;
}
private boolean isValidDomainListEntry(String line) {
if (line.isBlank())
return false;
if (!line.matches("[a-z0-9\\-.]+"))
return false;
return true;
}
}

31
code/common/db/readme.md Normal file
View File

@@ -0,0 +1,31 @@
# DB
This module primarily contains SQL files for the URLs database. The most central tables are `EC_DOMAIN`, `EC_URL` and `EC_PAGE_DATA`.
## Flyway
The system uses flyway to track database changes and allow easy migrations, this is accessible via gradle tasks.
* `flywayMigrate`
* `flywayBaseline`
* `flywayRepair`
* `flywayClean` (dangerous as in wipes your entire database)
Refer to the [Flyway documentation](https://documentation.red-gate.com/fd/flyway-documentation-138346877.html) for guidance.
It's well documented and these are probably the only four tasks you'll ever need.
If you are not running the system via docker, you need to provide alternative connection details than
the defaults (TODO: how?).
The migration files are in [resources/db/migration](resources/db/migration). The file name convention
incorporates the project's cal-ver versioning; and are applied in lexicographical order.
VYY_MM_v_nnn__description.sql
## Central Paths
* [migrations](resources/db/migration) - Flyway migrations
## See Also
* [common/service](../service) implements DatabaseModule, which is from where the services get database connections.

View File

@@ -0,0 +1,144 @@
CREATE TABLE IF NOT EXISTS EC_DOMAIN (
ID INT PRIMARY KEY AUTO_INCREMENT,
DOMAIN_NAME VARCHAR(255) UNIQUE NOT NULL,
DOMAIN_TOP VARCHAR(255) NOT NULL,
INDEXED INT DEFAULT 0 NOT NULL COMMENT "~number of documents visited / 100",
STATE ENUM('ACTIVE', 'EXHAUSTED', 'SPECIAL', 'SOCIAL_MEDIA', 'BLOCKED', 'REDIR', 'ERROR', 'UNKNOWN') NOT NULL DEFAULT 'active' COMMENT "@see EdgeDomainIndexingState",
RANK DOUBLE,
DOMAIN_ALIAS INTEGER,
IP VARCHAR(48),
INDEX_DATE TIMESTAMP DEFAULT NOW(),
DISCOVER_DATE TIMESTAMP DEFAULT NOW(),
IS_ALIVE BOOLEAN AS (STATE='ACTIVE' OR STATE='EXHAUSTED' OR STATE='SPECIAL' OR STATE='SOCIAL_MEDIA') VIRTUAL
)
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
CREATE TABLE IF NOT EXISTS EC_URL (
ID INT PRIMARY KEY AUTO_INCREMENT,
DOMAIN_ID INT NOT NULL,
PROTO ENUM('http','https','gemini') NOT NULL COLLATE utf8mb4_unicode_ci,
PATH VARCHAR(255) NOT NULL,
PORT INT,
PARAM VARCHAR(255),
PATH_HASH BIGINT NOT NULL COMMENT "Hash of PATH for uniqueness check by domain",
VISITED BOOLEAN NOT NULL DEFAULT FALSE,
STATE ENUM('ok', 'redirect', 'dead', 'archived', 'disqualified') NOT NULL DEFAULT 'ok' COLLATE utf8mb4_unicode_ci,
CONSTRAINT CONS UNIQUE (DOMAIN_ID, PATH_HASH),
FOREIGN KEY (DOMAIN_ID) REFERENCES EC_DOMAIN(ID) ON DELETE CASCADE
)
CHARACTER SET utf8mb4
COLLATE utf8mb4_bin;
CREATE TABLE IF NOT EXISTS EC_PAGE_DATA (
ID INT PRIMARY KEY AUTO_INCREMENT,
TITLE VARCHAR(255) NOT NULL,
DESCRIPTION VARCHAR(255) NOT NULL,
WORDS_TOTAL INTEGER NOT NULL,
FORMAT ENUM('PLAIN', 'UNKNOWN', 'HTML123', 'HTML4', 'XHTML', 'HTML5', 'MARKDOWN') NOT NULL,
FEATURES INT COMMENT "Bit-encoded feature set of document, @see HtmlFeature" NOT NULL,
DATA_HASH BIGINT NOT NULL,
QUALITY DOUBLE NOT NULL,
PUB_YEAR SMALLINT,
FOREIGN KEY (ID) REFERENCES EC_URL(ID) ON DELETE CASCADE
)
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
CREATE TABLE IF NOT EXISTS EC_DOMAIN_LINK (
ID INT PRIMARY KEY AUTO_INCREMENT,
SOURCE_DOMAIN_ID INT NOT NULL,
DEST_DOMAIN_ID INT NOT NULL,
CONSTRAINT CONS UNIQUE (SOURCE_DOMAIN_ID, DEST_DOMAIN_ID),
FOREIGN KEY (SOURCE_DOMAIN_ID) REFERENCES EC_DOMAIN(ID) ON DELETE CASCADE,
FOREIGN KEY (DEST_DOMAIN_ID) REFERENCES EC_DOMAIN(ID) ON DELETE CASCADE
);
CREATE TABLE IF NOT EXISTS DOMAIN_METADATA (
ID INT PRIMARY KEY,
KNOWN_URLS INT DEFAULT 0,
VISITED_URLS INT DEFAULT 0,
GOOD_URLS INT DEFAULT 0,
FOREIGN KEY (ID) REFERENCES EC_DOMAIN(ID) ON DELETE CASCADE
);
CREATE TABLE EC_FEED_URL (
URL VARCHAR(255) PRIMARY KEY,
DOMAIN_ID INT,
FOREIGN KEY (DOMAIN_ID) REFERENCES EC_DOMAIN(ID) ON DELETE CASCADE
)
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
CREATE OR REPLACE VIEW EC_URL_VIEW AS
SELECT
CONCAT(EC_URL.PROTO,
'://',
EC_DOMAIN.DOMAIN_NAME,
IF(EC_URL.PORT IS NULL, '', CONCAT(':', EC_URL.PORT)),
EC_URL.PATH,
IF(EC_URL.PARAM IS NULL, '', CONCAT('?', EC_URL.PARAM))
) AS URL,
EC_URL.PATH_HASH AS PATH_HASH,
EC_URL.PATH AS PATH,
EC_DOMAIN.DOMAIN_NAME AS DOMAIN_NAME,
EC_DOMAIN.DOMAIN_TOP AS DOMAIN_TOP,
EC_URL.ID AS ID,
EC_DOMAIN.ID AS DOMAIN_ID,
EC_URL.VISITED AS VISITED,
EC_PAGE_DATA.QUALITY AS QUALITY,
EC_PAGE_DATA.DATA_HASH AS DATA_HASH,
EC_PAGE_DATA.TITLE AS TITLE,
EC_PAGE_DATA.DESCRIPTION AS DESCRIPTION,
EC_PAGE_DATA.WORDS_TOTAL AS WORDS_TOTAL,
EC_PAGE_DATA.FORMAT AS FORMAT,
EC_PAGE_DATA.FEATURES AS FEATURES,
EC_DOMAIN.IP AS IP,
EC_URL.STATE AS STATE,
EC_DOMAIN.RANK AS RANK,
EC_DOMAIN.STATE AS DOMAIN_STATE
FROM EC_URL
LEFT JOIN EC_PAGE_DATA
ON EC_PAGE_DATA.ID = EC_URL.ID
INNER JOIN EC_DOMAIN
ON EC_URL.DOMAIN_ID = EC_DOMAIN.ID;
CREATE OR REPLACE VIEW EC_RELATED_LINKS_VIEW AS
SELECT
SOURCE_DOMAIN_ID,
SOURCE_DOMAIN.DOMAIN_NAME AS SOURCE_DOMAIN,
SOURCE_DOMAIN.DOMAIN_TOP AS SOURCE_TOP_DOMAIN,
DEST_DOMAIN_ID,
DEST_DOMAIN.DOMAIN_NAME AS DEST_DOMAIN,
DEST_DOMAIN.DOMAIN_TOP AS DEST_TOP_DOMAIN
FROM EC_DOMAIN_LINK
INNER JOIN EC_DOMAIN AS SOURCE_DOMAIN
ON SOURCE_DOMAIN.ID=SOURCE_DOMAIN_ID
INNER JOIN EC_DOMAIN AS DEST_DOMAIN
ON DEST_DOMAIN.ID=DEST_DOMAIN_ID
;
CREATE INDEX IF NOT EXISTS EC_DOMAIN_INDEXED_INDEX ON EC_DOMAIN (INDEXED);
CREATE INDEX IF NOT EXISTS EC_DOMAIN_TOP_DOMAIN ON EC_DOMAIN (DOMAIN_TOP);

View File

@@ -0,0 +1,8 @@
CREATE TABLE IF NOT EXISTS EC_DOMAIN_BLACKLIST (
ID INT PRIMARY KEY AUTO_INCREMENT,
URL_DOMAIN VARCHAR(255) UNIQUE NOT NULL,
COMMENT VARCHAR(255) DEFAULT NULL
)
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;

View File

@@ -0,0 +1,19 @@
CREATE TABLE IF NOT EXISTS REF_DICTIONARY (
TYPE VARCHAR(16),
WORD VARCHAR(255),
DEFINITION VARCHAR(255)
)
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
CREATE TABLE IF NOT EXISTS REF_WIKI_ARTICLE (
NAME VARCHAR(255) PRIMARY KEY,
REF_NAME VARCHAR(255) COMMENT "If this is a redirect, it redirects to this REF_WIKI_ARTICLE.NAME",
ENTRY LONGBLOB
)
ROW_FORMAT=DYNAMIC
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
CREATE INDEX IF NOT EXISTS REF_DICTIONARY_WORD ON REF_DICTIONARY (WORD);

View File

@@ -0,0 +1,5 @@
CREATE TABLE CRAWL_QUEUE(
DOMAIN_NAME VARCHAR(255) UNIQUE,
SOURCE VARCHAR(255)
) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

View File

@@ -0,0 +1,13 @@
CREATE TABLE IF NOT EXISTS DATA_DOMAIN_SCREENSHOT (
DOMAIN_NAME VARCHAR(255) PRIMARY KEY,
CONTENT_TYPE ENUM ('image/png', 'image/webp', 'image/svg+xml') NOT NULL,
DATA LONGBLOB NOT NULL
)
ROW_FORMAT=DYNAMIC
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
CREATE TABLE DATA_DOMAIN_HISTORY (
DOMAIN_NAME VARCHAR(255) PRIMARY KEY,
SCREENSHOT_DATE DATE DEFAULT NOW()
) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

View File

@@ -0,0 +1,15 @@
CREATE TABLE DOMAIN_COMPLAINT(
ID INT PRIMARY KEY AUTO_INCREMENT,
DOMAIN_ID INT NOT NULL,
CATEGORY VARCHAR(255) NOT NULL,
DESCRIPTION TEXT,
SAMPLE VARCHAR(255),
FILE_DATE TIMESTAMP NOT NULL DEFAULT NOW(),
REVIEWED BOOLEAN AS (REVIEW_DATE > 0) VIRTUAL,
DECISION VARCHAR(255),
REVIEW_DATE TIMESTAMP,
FOREIGN KEY (DOMAIN_ID) REFERENCES EC_DOMAIN(ID) ON DELETE CASCADE
) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

View File

@@ -0,0 +1,7 @@
CREATE TABLE IF NOT EXISTS EC_API_KEY (
LICENSE_KEY VARCHAR(255) UNIQUE,
LICENSE VARCHAR(255) NOT NULL,
NAME VARCHAR(255) NOT NULL,
EMAIL VARCHAR(255) NOT NULL,
RATE INT DEFAULT 10
);

View File

@@ -0,0 +1,34 @@
CREATE TABLE EC_DOMAIN_NEIGHBORS (
ID INT PRIMARY KEY AUTO_INCREMENT,
DOMAIN_ID INT NOT NULL,
NEIGHBOR_ID INT NOT NULL,
ADJ_IDX INT NOT NULL,
CONSTRAINT CONS UNIQUE (DOMAIN_ID, ADJ_IDX),
FOREIGN KEY (DOMAIN_ID) REFERENCES EC_DOMAIN(ID) ON DELETE CASCADE
)
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
CREATE TABLE EC_DOMAIN_NEIGHBORS_2 (
DOMAIN_ID INT NOT NULL,
NEIGHBOR_ID INT NOT NULL,
RELATEDNESS DOUBLE NOT NULL,
PRIMARY KEY (DOMAIN_ID, NEIGHBOR_ID),
FOREIGN KEY (DOMAIN_ID) REFERENCES EC_DOMAIN(ID) ON DELETE CASCADE,
FOREIGN KEY (NEIGHBOR_ID) REFERENCES EC_DOMAIN(ID) ON DELETE CASCADE
);
CREATE OR REPLACE VIEW EC_NEIGHBORS_VIEW AS
SELECT
DOM.DOMAIN_NAME AS DOMAIN_NAME,
DOM.ID AS DOMAIN_ID,
NEIGHBOR.DOMAIN_NAME AS NEIGHBOR_NAME,
NEIGHBOR.ID AS NEIGHBOR_ID,
ROUND(100 * RELATEDNESS) AS RELATEDNESS
FROM EC_DOMAIN_NEIGHBORS_2
INNER JOIN EC_DOMAIN DOM ON DOMAIN_ID=DOM.ID
INNER JOIN EC_DOMAIN NEIGHBOR ON NEIGHBOR_ID=NEIGHBOR.ID;

View File

@@ -0,0 +1,5 @@
CREATE TABLE IF NOT EXISTS EC_RANDOM_DOMAINS (
DOMAIN_ID INT PRIMARY KEY,
DOMAIN_SET INT NOT NULL
);

View File

@@ -0,0 +1,8 @@
CREATE TABLE SEARCH_NEWS_FEED (
ID INT PRIMARY KEY AUTO_INCREMENT,
TITLE VARCHAR(255) NOT NULL,
LINK VARCHAR(255) UNIQUE NOT NULL,
SOURCE VARCHAR(255),
LIST_DATE DATE NOT NULL
) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;

View File

@@ -0,0 +1,19 @@
CREATE TABLE IF NOT EXISTS DOMAIN_SELECTION_TYPE (
ID INT PRIMARY KEY AUTO_INCREMENT,
NAME VARCHAR(255) UNIQUE,
SOURCE VARCHAR(255) NOT NULL
)
CHARACTER SET utf8mb4
COLLATE utf8mb4_bin;
CREATE TABLE DOMAIN_SELECTION (
DOMAIN_NAME VARCHAR(255) PRIMARY KEY,
DOMAIN_TYPE_ID INT,
FOREIGN KEY (DOMAIN_TYPE_ID) REFERENCES DOMAIN_SELECTION_TYPE(ID) ON DELETE CASCADE
)
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
INSERT IGNORE INTO DOMAIN_SELECTION_TYPE(NAME, SOURCE)
VALUES ('BLOG', 'https://raw.githubusercontent.com/MarginaliaSearch/PublicData/master/sets/blogs.txt'),
('TEST', 'https://downloads.marginalia.nu/domain-list-test.txt');

View File

@@ -0,0 +1,27 @@
CREATE TABLE IF NOT EXISTS SERVICE_HEARTBEAT (
SERVICE_NAME VARCHAR(255) PRIMARY KEY COMMENT "Full name of the service, including node id if applicable, e.g. search-service:0",
SERVICE_BASE VARCHAR(255) NOT NULL COMMENT "Base name of the service, e.g. search-service",
INSTANCE VARCHAR(255) NOT NULL COMMENT "UUID of the service instance",
ALIVE BOOLEAN NOT NULL DEFAULT TRUE COMMENT "Set to false when the service is doing an orderly shutdown",
HEARTBEAT_TIME TIMESTAMP(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6) COMMENT "Service was last seen at this point"
);
CREATE TABLE IF NOT EXISTS PROCESS_HEARTBEAT (
PROCESS_NAME VARCHAR(255) PRIMARY KEY COMMENT "Full name of the process, including node id if applicable, e.g. converter:0",
PROCESS_BASE VARCHAR(255) NOT NULL COMMENT "Base name of the process, e.g. converter",
INSTANCE VARCHAR(255) NOT NULL COMMENT "UUID of the process instance",
STATUS ENUM ('STARTING', 'RUNNING', 'STOPPED') NOT NULL DEFAULT 'STARTING' COMMENT "Status of the process",
PROGRESS INT NOT NULL DEFAULT 0 COMMENT "Progress of the process",
HEARTBEAT_TIME TIMESTAMP(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6) COMMENT "Process was last seen at this point"
);
CREATE TABLE IF NOT EXISTS SERVICE_EVENTLOG(
ID BIGINT AUTO_INCREMENT PRIMARY KEY COMMENT "Unique id",
SERVICE_NAME VARCHAR(255) NOT NULL COMMENT "Full name of the service, including node id if applicable, e.g. search-service:0",
SERVICE_BASE VARCHAR(255) NOT NULL COMMENT "Base name of the service, e.g. search-service",
INSTANCE VARCHAR(255) NOT NULL COMMENT "UUID of the service instance",
EVENT_TIME TIMESTAMP(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6) COMMENT "Event time",
EVENT_TYPE VARCHAR(255) NOT NULL COMMENT "Event type",
EVENT_MESSAGE VARCHAR(255) NOT NULL COMMENT "Event message"
);

View File

@@ -0,0 +1,21 @@
CREATE TABLE IF NOT EXISTS MESSAGE_QUEUE (
ID BIGINT AUTO_INCREMENT PRIMARY KEY COMMENT 'Unique id',
RELATED_ID BIGINT NOT NULL DEFAULT -1 COMMENT 'Unique id a related message',
SENDER_INBOX VARCHAR(255) COMMENT 'Name of the sender inbox',
RECIPIENT_INBOX VARCHAR(255) NOT NULL COMMENT 'Name of the recipient inbox',
FUNCTION VARCHAR(255) NOT NULL COMMENT 'Which function to run',
PAYLOAD TEXT COMMENT 'Message to recipient',
-- These fields are used to avoid double processing of messages
-- instance marks the unique instance of the party, and the tick marks
-- the current polling iteration. Both are necessary.
OWNER_INSTANCE VARCHAR(255) COMMENT 'Instance UUID corresponding to the party that has claimed the message',
OWNER_TICK BIGINT DEFAULT -1 COMMENT 'Used by recipient to determine which messages it has processed',
STATE ENUM('NEW', 'ACK', 'OK', 'ERR', 'DEAD')
NOT NULL DEFAULT 'NEW' COMMENT 'Processing state',
CREATED_TIME TIMESTAMP(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6) COMMENT 'Time of creation',
UPDATED_TIME TIMESTAMP(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6) COMMENT 'Time of last update',
TTL INT COMMENT 'Time to live in seconds'
);
CREATE INDEX MESSAGE_QUEUE_STATE_IDX ON MESSAGE_QUEUE(STATE);
CREATE INDEX MESSAGE_QUEUE_OI_TICK_IDX ON MESSAGE_QUEUE(OWNER_INSTANCE, OWNER_TICK);

View File

@@ -0,0 +1,42 @@
CREATE TABLE IF NOT EXISTS FILE_STORAGE_BASE (
ID BIGINT PRIMARY KEY AUTO_INCREMENT,
NAME VARCHAR(255) NOT NULL UNIQUE,
PATH VARCHAR(255) NOT NULL UNIQUE COMMENT 'The path to the storage base',
TYPE ENUM ('SSD_INDEX', 'SSD_WORK', 'SLOW', 'BACKUP') NOT NULL,
PERMIT_TEMP BOOLEAN NOT NULL DEFAULT FALSE COMMENT 'If true, the storage can be used for temporary files'
)
CHARACTER SET utf8mb4
COLLATE utf8mb4_bin;
CREATE TABLE IF NOT EXISTS FILE_STORAGE (
ID BIGINT PRIMARY KEY AUTO_INCREMENT,
BASE_ID BIGINT NOT NULL,
PATH VARCHAR(255) NOT NULL COMMENT 'The path to the storage relative to the base',
DESCRIPTION VARCHAR(255) NOT NULL,
TYPE ENUM ('CRAWL_SPEC', 'CRAWL_DATA', 'PROCESSED_DATA', 'INDEX_STAGING', 'LEXICON_STAGING', 'INDEX_LIVE', 'LEXICON_LIVE', 'SEARCH_SETS', 'BACKUP', 'EXPORT') NOT NULL,
DO_PURGE BOOLEAN NOT NULL DEFAULT FALSE COMMENT 'If true, the storage may be cleaned',
CREATE_DATE TIMESTAMP(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6),
CONSTRAINT CONS UNIQUE (BASE_ID, PATH),
FOREIGN KEY (BASE_ID) REFERENCES FILE_STORAGE_BASE(ID) ON DELETE CASCADE
)
CHARACTER SET utf8mb4
COLLATE utf8mb4_bin;
CREATE TABLE IF NOT EXISTS FILE_STORAGE_RELATION (
SOURCE_ID BIGINT NOT NULL,
TARGET_ID BIGINT NOT NULL,
CONSTRAINT CONS UNIQUE (SOURCE_ID, TARGET_ID),
FOREIGN KEY (SOURCE_ID) REFERENCES FILE_STORAGE(ID) ON DELETE CASCADE,
FOREIGN KEY (TARGET_ID) REFERENCES FILE_STORAGE(ID) ON DELETE CASCADE
);
CREATE VIEW FILE_STORAGE_VIEW
AS SELECT
CONCAT(BASE.PATH, '/', STORAGE.PATH) AS PATH,
STORAGE.TYPE AS TYPE,
DESCRIPTION AS DESCRIPTION,
CREATE_DATE AS CREATE_DATE,
STORAGE.ID AS ID,
BASE.ID AS BASE_ID
FROM FILE_STORAGE STORAGE
INNER JOIN FILE_STORAGE_BASE BASE ON STORAGE.BASE_ID=BASE.ID;

View File

@@ -0,0 +1,28 @@
INSERT IGNORE INTO FILE_STORAGE_BASE(NAME, PATH, TYPE, PERMIT_TEMP)
VALUES
('Index Storage', '/vol', 'SSD_INDEX', false),
('Data Storage', '/samples', 'SLOW', true);
INSERT IGNORE INTO FILE_STORAGE(BASE_ID, PATH, DESCRIPTION, TYPE)
SELECT ID, 'iw', "Index Staging Area", 'INDEX_STAGING'
FROM FILE_STORAGE_BASE WHERE NAME='Index Storage';
INSERT IGNORE INTO FILE_STORAGE(BASE_ID, PATH, DESCRIPTION, TYPE)
SELECT ID, 'ir', "Index Live Area", 'INDEX_LIVE'
FROM FILE_STORAGE_BASE WHERE NAME='Index Storage';
INSERT IGNORE INTO FILE_STORAGE(BASE_ID, PATH, DESCRIPTION, TYPE)
SELECT ID, 'lw', "Lexicon Staging Area", 'LEXICON_STAGING'
FROM FILE_STORAGE_BASE WHERE NAME='Index Storage';
INSERT IGNORE INTO FILE_STORAGE(BASE_ID, PATH, DESCRIPTION, TYPE)
SELECT ID, 'lr', "Lexicon Live Area", 'LEXICON_LIVE'
FROM FILE_STORAGE_BASE WHERE NAME='Index Storage';
INSERT IGNORE INTO FILE_STORAGE(BASE_ID, PATH, DESCRIPTION, TYPE)
SELECT ID, 'ss', "Search Sets", 'SEARCH_SETS'
FROM FILE_STORAGE_BASE WHERE NAME='Index Storage';
INSERT IGNORE INTO FILE_STORAGE(BASE_ID, PATH, DESCRIPTION, TYPE)
SELECT ID, 'export', "Exported Data", 'EXPORT'
FROM FILE_STORAGE_BASE WHERE TYPE='EXPORT';

View File

@@ -0,0 +1,7 @@
INSERT INTO MESSAGE_QUEUE(RECIPIENT_INBOX,FUNCTION,PAYLOAD) VALUES
('fsm:converter_monitor','INITIAL',''),
('fsm:loader_monitor','INITIAL',''),
('fsm:crawler_monitor','INITIAL',''),
('fsm:message_queue_monitor','INITIAL',''),
('fsm:process_liveness_monitor','INITIAL',''),
('fsm:file_storage_monitor','INITIAL','');

View File

@@ -0,0 +1,10 @@
CREATE TABLE IF NOT EXISTS TASK_HEARTBEAT (
TASK_NAME VARCHAR(255) PRIMARY KEY COMMENT "Full name of the task, including node id if applicable, e.g. reconvert:0",
TASK_BASE VARCHAR(255) NOT NULL COMMENT "Base name of the task, e.g. reconvert",
INSTANCE VARCHAR(255) NOT NULL COMMENT "UUID of the task instance",
SERVICE_INSTANCE VARCHAR(255) NOT NULL COMMENT "UUID of the parent service",
STATUS ENUM ('STARTING', 'RUNNING', 'STOPPED') NOT NULL DEFAULT 'STARTING' COMMENT "Status of the task",
PROGRESS INT NOT NULL DEFAULT 0 COMMENT "Progress of the task",
STAGE_NAME VARCHAR(255) DEFAULT "",
HEARTBEAT_TIME TIMESTAMP(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6) COMMENT "Task was last seen at this point"
);

View File

@@ -0,0 +1,2 @@
CREATE INDEX IF NOT EXISTS SERVICE_EVENTLOG__EVENT_TYPE_IDX ON SERVICE_EVENTLOG (EVENT_TYPE);
CREATE INDEX IF NOT EXISTS SERVICE_EVENTLOG__SERVICE_NAME_IDX ON SERVICE_EVENTLOG (SERVICE_NAME);

View File

@@ -0,0 +1,9 @@
ALTER TABLE FILE_STORAGE MODIFY COLUMN TYPE ENUM ('CRAWL_SPEC', 'CRAWL_DATA', 'PROCESSED_DATA', 'INDEX_STAGING', 'LEXICON_STAGING', 'INDEX_LIVE', 'LEXICON_LIVE', 'SEARCH_SETS', 'BACKUP', 'EXPORT', 'LINKDB_LIVE', 'LINKDB_STAGING') NOT NULL;
INSERT IGNORE INTO FILE_STORAGE(BASE_ID, PATH, DESCRIPTION, TYPE)
SELECT ID, 'ldbr', "Linkdb Current", 'LINKDB_LIVE'
FROM FILE_STORAGE_BASE WHERE NAME='Index Storage';
INSERT IGNORE INTO FILE_STORAGE(BASE_ID, PATH, DESCRIPTION, TYPE)
SELECT ID, 'ldbw', "Linkdb Staging Area", 'LINKDB_STAGING'
FROM FILE_STORAGE_BASE WHERE NAME='Index Storage';

View File

@@ -0,0 +1,3 @@
DROP VIEW EC_URL_VIEW;
DROP TABLE EC_PAGE_DATA;
DROP TABLE EC_URL;

View File

@@ -0,0 +1,3 @@
INSERT IGNORE INTO FILE_STORAGE_BASE(NAME, PATH, TYPE, PERMIT_TEMP)
VALUES
('Backup Storage', '/backup', 'BACKUP', true);

View File

@@ -0,0 +1 @@
DELETE FROM FILE_STORAGE WHERE TYPE IN ('LEXICON_STAGING', 'LEXICON_LIVE');

View File

@@ -0,0 +1,21 @@
ALTER TABLE FILE_STORAGE_BASE MODIFY COLUMN NAME VARCHAR(255) NOT NULL;
ALTER TABLE FILE_STORAGE_BASE MODIFY COLUMN PATH VARCHAR(255) NOT NULL;
DROP INDEX PATH ON FILE_STORAGE_BASE;
DROP INDEX NAME ON FILE_STORAGE_BASE;
ALTER TABLE FILE_STORAGE_BASE ADD COLUMN NODE INT NOT NULL DEFAULT -1;
CREATE UNIQUE INDEX FILE_STORAGE_BASE__NODE_NAME ON FILE_STORAGE_BASE(NODE, NAME);
CREATE UNIQUE INDEX FILE_STORAGE_BASE__NODE_PATH ON FILE_STORAGE_BASE(NODE, PATH);
DROP VIEW FILE_STORAGE_VIEW;
CREATE VIEW FILE_STORAGE_VIEW
AS SELECT
CONCAT(BASE.PATH, '/', STORAGE.PATH) AS PATH,
STORAGE.TYPE AS TYPE,
NODE AS NODE,
DESCRIPTION AS DESCRIPTION,
CREATE_DATE AS CREATE_DATE,
STORAGE.ID AS ID,
BASE.ID AS BASE_ID
FROM FILE_STORAGE STORAGE
INNER JOIN FILE_STORAGE_BASE BASE ON STORAGE.BASE_ID=BASE.ID;

View File

@@ -0,0 +1,3 @@
ALTER TABLE TASK_HEARTBEAT ADD COLUMN NODE INT NOT NULL DEFAULT -1;
ALTER TABLE PROCESS_HEARTBEAT ADD COLUMN NODE INT NOT NULL DEFAULT -1;
ALTER TABLE SERVICE_HEARTBEAT ADD COLUMN NODE INT NOT NULL DEFAULT -1;

View File

@@ -0,0 +1,17 @@
ALTER TABLE FILE_STORAGE ADD COLUMN STATE VARCHAR(255) NOT NULL DEFAULT '';
ALTER TABLE FILE_STORAGE DROP COLUMN DO_PURGE;
DROP VIEW FILE_STORAGE_VIEW;
CREATE VIEW FILE_STORAGE_VIEW
AS SELECT
CONCAT(BASE.PATH, '/', STORAGE.PATH) AS PATH,
STORAGE.TYPE AS TYPE,
STATE AS STATE,
NODE AS NODE,
DESCRIPTION AS DESCRIPTION,
CREATE_DATE AS CREATE_DATE,
STORAGE.ID AS ID,
BASE.ID AS BASE_ID
FROM FILE_STORAGE STORAGE
INNER JOIN FILE_STORAGE_BASE BASE ON STORAGE.BASE_ID=BASE.ID;

View File

@@ -0,0 +1,8 @@
CREATE TABLE NODE_CONFIGURATION (
ID INT PRIMARY KEY,
DESCRIPTION VARCHAR(255),
ACCEPT_QUERIES BOOLEAN,
AUTO_CLEAN BOOLEAN DEFAULT TRUE,
PRECESSION BOOLEAN DEFAULT TRUE,
DISABLED BOOLEAN DEFAULT FALSE
);

View File

@@ -0,0 +1,10 @@
ALTER TABLE FILE_STORAGE_BASE DROP COLUMN PERMIT_TEMP;
ALTER TABLE FILE_STORAGE_BASE ADD COLUMN TYPE_NEW VARCHAR(255) NOT NULL;
UPDATE FILE_STORAGE_BASE SET TYPE_NEW = 'CURRENT' WHERE TYPE='SSD_INDEX';
UPDATE FILE_STORAGE_BASE SET TYPE_NEW = 'WORK' WHERE TYPE='SSD_WORK';
UPDATE FILE_STORAGE_BASE SET TYPE_NEW = 'STORAGE' WHERE TYPE='SLOW';
UPDATE FILE_STORAGE_BASE SET TYPE_NEW = 'BACKUP' WHERE TYPE='BACKUP';
ALTER TABLE FILE_STORAGE_BASE DROP COLUMN TYPE;
ALTER TABLE FILE_STORAGE_BASE CHANGE COLUMN TYPE_NEW TYPE VARCHAR(255) NOT NULL;

View File

@@ -0,0 +1 @@
UPDATE MESSAGE_QUEUE SET STATE='DEAD' WHERE STATE='NEW';

View File

@@ -0,0 +1 @@
DELETE FROM FILE_STORAGE WHERE TYPE IN ('INDEX_STAGING', 'INDEX_LIVE', 'SEARCH_SETS', 'LINKDB_LIVE', 'LINKDB_STAGING');

View File

@@ -0,0 +1 @@
ALTER TABLE EC_DOMAIN ADD COLUMN NODE_AFFINITY INT NOT NULL;

View File

@@ -0,0 +1,9 @@
ALTER TABLE WMSA_prod.EC_DOMAIN_LINK
MODIFY COLUMN ID BIGINT NOT NULL AUTO_INCREMENT;
DELIMITER $$
CREATE OR REPLACE PROCEDURE PURGE_LINKS_TABLE (IN nodeId INT)
BEGIN
DELETE EC_DOMAIN_LINK FROM EC_DOMAIN_LINK INNER JOIN WMSA_prod.EC_DOMAIN ON EC_DOMAIN_LINK.SOURCE_DOMAIN_ID = EC_DOMAIN.ID WHERE NODE_AFFINITY = nodeId;
END$$
DELIMITER ;

View File

@@ -0,0 +1 @@
ALTER TABLE WMSA_prod.NODE_CONFIGURATION ADD COLUMN KEEP_WARCS BOOLEAN DEFAULT FALSE;

View File

@@ -0,0 +1,12 @@
CREATE TABLE IF NOT EXISTS CONF_DOMAIN_RANKING_SET (
NAME VARCHAR(255) PRIMARY KEY COLLATE utf8mb4_unicode_ci,
DESCRIPTION VARCHAR(255) NOT NULL,
ALGORITHM VARCHAR(255) NOT NULL,
DEPTH INT NOT NULL,
DEFINITION LONGTEXT NOT NULL
) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;
INSERT IGNORE INTO CONF_DOMAIN_RANKING_SET(NAME, DESCRIPTION, ALGORITHM, DEPTH, DEFINITION) VALUES ('NONE', 'Reserved: No Ranking Algorithm', 'SPECIAL', 50000, '');
INSERT IGNORE INTO CONF_DOMAIN_RANKING_SET(NAME, DESCRIPTION, ALGORITHM, DEPTH, DEFINITION) VALUES ('BLOGS', 'Reserved: Blogs Set', 'SPECIAL', 50000, '');
INSERT IGNORE INTO CONF_DOMAIN_RANKING_SET(NAME, DESCRIPTION, ALGORITHM, DEPTH, DEFINITION) VALUES ('RANK', 'Reserved: Main Domain Ranking', 'SPECIAL', 50000, '');

View File

@@ -0,0 +1 @@
ALTER TABLE MESSAGE_QUEUE ADD COLUMN AUDIT_RELATED_ID LONG NOT NULL DEFAULT -1 COMMENT 'To be applied to any new messages created while handling a message';

View File

@@ -0,0 +1 @@
DROP TABLE EC_DOMAIN_LINK;

View File

@@ -0,0 +1 @@
ALTER TABLE CONF_DOMAIN_RANKING_SET DROP COLUMN ALGORITHM;

View File

@@ -0,0 +1 @@
ALTER TABLE WMSA_prod.NODE_CONFIGURATION ADD COLUMN NODE_PROFILE VARCHAR(255) DEFAULT 'MIXED';

View File

@@ -0,0 +1,91 @@
package nu.marginalia.db;
import com.zaxxer.hikari.HikariConfig;
import com.zaxxer.hikari.HikariDataSource;
import nu.marginalia.test.TestMigrationLoader;
import org.junit.jupiter.api.*;
import org.testcontainers.containers.MariaDBContainer;
import org.testcontainers.junit.jupiter.Container;
import org.testcontainers.junit.jupiter.Testcontainers;
import static org.junit.jupiter.api.Assertions.*;
@Testcontainers
@Tag("slow")
class DomainRankingSetsServiceTest {
@Container
static MariaDBContainer<?> mariaDBContainer = new MariaDBContainer<>("mariadb")
.withDatabaseName("WMSA_prod")
.withUsername("wmsa")
.withPassword("wmsa")
.withNetworkAliases("mariadb");
static HikariDataSource dataSource;
@BeforeAll
public static void setup() {
HikariConfig config = new HikariConfig();
config.setJdbcUrl(mariaDBContainer.getJdbcUrl());
config.setUsername("wmsa");
config.setPassword("wmsa");
dataSource = new HikariDataSource(config);
TestMigrationLoader.flywayMigration(dataSource);
// The migration SQL will insert a few default values, we want to remove them
wipeDomainRankingSets(dataSource);
}
@AfterEach
public void tearDown() {
wipeDomainRankingSets(dataSource);
}
@AfterAll
static void tearDownAll() {
dataSource.close();
mariaDBContainer.close();
}
@Test
public void testScenarios() throws Exception {
var service = new DomainRankingSetsService(dataSource);
var newValue = new DomainRankingSetsService.DomainRankingSet(
"test",
"Test domain set",
10,
"test\\.nu"
);
var newValue2 = new DomainRankingSetsService.DomainRankingSet(
"test2",
"Test domain set 2",
20,
"test\\.nu 2"
);
service.upsert(newValue);
service.upsert(newValue2);
assertEquals(newValue, service.get("test").orElseThrow());
var allValues = service.getAll();
assertEquals(2, allValues.size());
assertTrue(allValues.contains(newValue));
assertTrue(allValues.contains(newValue2));
service.delete(newValue);
assertFalse(service.get("test").isPresent());
service.delete(newValue2);
assertFalse(service.get("test2").isPresent());
allValues = service.getAll();
assertEquals(0, allValues.size());
}
private static void wipeDomainRankingSets(HikariDataSource dataSource) {
var service = new DomainRankingSetsService(dataSource);
service.getAll().forEach(service::delete);
}
}

View File

@@ -0,0 +1,73 @@
package nu.marginalia.db;
import com.google.common.collect.Sets;
import com.zaxxer.hikari.HikariConfig;
import com.zaxxer.hikari.HikariDataSource;
import nu.marginalia.test.TestMigrationLoader;
import org.junit.jupiter.api.AfterAll;
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.api.Tag;
import org.junit.jupiter.api.Test;
import org.testcontainers.containers.MariaDBContainer;
import org.testcontainers.junit.jupiter.Container;
import org.testcontainers.junit.jupiter.Testcontainers;
import java.io.IOException;
import java.sql.SQLException;
import java.util.HashSet;
import java.util.Set;
import static org.junit.jupiter.api.Assertions.assertEquals;
@Tag("slow")
@Testcontainers
public class DomainTypesTest {
@Container
static MariaDBContainer<?> mariaDBContainer = new MariaDBContainer<>("mariadb")
.withDatabaseName("WMSA_prod")
.withUsername("wmsa")
.withPassword("wmsa")
.withNetworkAliases("mariadb");
static HikariDataSource dataSource;
static DomainTypes domainTypes;
@BeforeAll
public static void setup() {
HikariConfig config = new HikariConfig();
config.setJdbcUrl(mariaDBContainer.getJdbcUrl());
config.setUsername("wmsa");
config.setPassword("wmsa");
dataSource = new HikariDataSource(config);
TestMigrationLoader.flywayMigration(dataSource);
domainTypes = new DomainTypes(dataSource);
}
@AfterAll
public static void teardown() {
dataSource.close();
}
@Test
public void reloadDomainsList() throws SQLException, IOException {
domainTypes.reloadDomainsList(DomainTypes.Type.TEST);
var downloadedDomains = new HashSet<>(domainTypes.getAllDomainsByType(DomainTypes.Type.TEST));
var expectedDomains = Set.of("www.marginalia.nu", "search.marginalia.nu", "docs.marginalia.nu",
"encyclopedia.marginalia.nu", "memex.marginalia.nu");
assertEquals(expectedDomains.size(), downloadedDomains.size());
assertEquals(Set.of(), Sets.symmetricDifference(expectedDomains, downloadedDomains));
}
@Test
public void configure() throws SQLException {
assertEquals("", domainTypes.getUrlForSelection(DomainTypes.Type.CRAWL));
domainTypes.updateUrlForSelection(DomainTypes.Type.CRAWL, "test");
assertEquals("test", domainTypes.getUrlForSelection(DomainTypes.Type.CRAWL));
}
}

View File

@@ -0,0 +1,49 @@
plugins {
id 'java'
id 'jvm-test-suite'
}
java {
toolchain {
languageVersion.set(JavaLanguageVersion.of(rootProject.ext.jvmVersion))
}
}
configurations {
flywayMigration.extendsFrom(implementation)
}
apply from: "$rootProject.projectDir/srcsets.gradle"
dependencies {
implementation project(':code:common:model')
implementation project(':code:common:service')
implementation libs.bundles.slf4j
implementation libs.guava
implementation dependencies.create(libs.guice.get()) {
exclude group: 'com.google.guava'
}
implementation libs.bundles.gson
implementation libs.notnull
implementation libs.bundles.mariadb
implementation libs.sqlite
implementation libs.commons.lang3
implementation libs.trove
testImplementation libs.bundles.slf4j.test
testImplementation libs.bundles.junit
testImplementation libs.mockito
testImplementation platform('org.testcontainers:testcontainers-bom:1.17.4')
testImplementation libs.commons.codec
testImplementation 'org.testcontainers:mariadb:1.17.4'
testImplementation 'org.testcontainers:junit-jupiter:1.17.4'
testImplementation project(':code:libraries:test-helpers')
}

View File

@@ -0,0 +1,7 @@
package nu.marginalia.linkdb;
public class LinkdbFileNames {
public static String DEPRECATED_LINKDB_FILE_NAME = "links.db";
public static String DOCDB_FILE_NAME = "documents.db";
public static String DOMAIN_LINKS_FILE_NAME = "domain-links.dat";
}

View File

@@ -0,0 +1,135 @@
package nu.marginalia.linkdb.docs;
import com.google.inject.Inject;
import com.google.inject.Singleton;
import com.google.inject.name.Named;
import gnu.trove.list.TLongList;
import nu.marginalia.linkdb.model.DocdbUrlDetail;
import nu.marginalia.model.EdgeUrl;
import nu.marginalia.model.id.UrlIdCodec;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.net.URISyntaxException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.StandardCopyOption;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.List;
/** Reads the document database, which is a SQLite database
* containing the URLs and metadata of the documents in the
* index.
* <p></p>
* The database is created by the DocumentDbWriter class.
* */
@Singleton
public class DocumentDbReader {
private final Path dbFile;
private volatile Connection connection;
private final Logger logger = LoggerFactory.getLogger(getClass());
@Inject
public DocumentDbReader(@Named("docdb-file") Path dbFile) throws SQLException {
this.dbFile = dbFile;
if (Files.exists(dbFile)) {
connection = createConnection();
}
else {
logger.warn("No docdb file {}", dbFile);
}
}
private Connection createConnection() throws SQLException {
try {
String connStr = "jdbc:sqlite:" + dbFile.toString();
return DriverManager.getConnection(connStr);
}
catch (SQLException ex) {
logger.error("Failed to connect to link database " + dbFile, ex);
return null;
}
}
/** Switches the input database file to a new file.
* <p></p>
* This is used to switch over to a new database file
* when the index is re-indexed.
* */
public void switchInput(Path newDbFile) throws IOException, SQLException {
if (!Files.isRegularFile(newDbFile)) {
logger.error("Source is not a file, refusing switch-over {}", newDbFile);
return;
}
if (connection != null) {
connection.close();
}
logger.info("Moving {} to {}", newDbFile, dbFile);
Files.move(newDbFile, dbFile, StandardCopyOption.REPLACE_EXISTING);
connection = createConnection();
}
/** Re-establishes the connection, useful in tests and not
* much else */
public void reconnect() throws SQLException {
if (connection != null)
connection.close();
connection = createConnection();
}
/** Returns the URL details for the given document ids.
* <p></p>
* This is used to get the URL details for the search
* results.
* */
public List<DocdbUrlDetail> getUrlDetails(TLongList ids) throws SQLException {
List<DocdbUrlDetail> ret = new ArrayList<>(ids.size());
if (connection == null ||
connection.isClosed())
{
throw new RuntimeException("URL query temporarily unavailable due to database switch");
}
try (var stmt = connection.prepareStatement("""
SELECT ID, URL, TITLE, DESCRIPTION, WORDS_TOTAL, FORMAT, FEATURES, DATA_HASH, QUALITY, PUB_YEAR
FROM DOCUMENT WHERE ID = ?
""")) {
for (int i = 0; i < ids.size(); i++) {
long id = ids.get(i);
stmt.setLong(1, id);
var rs = stmt.executeQuery();
if (rs.next()) {
var url = new EdgeUrl(rs.getString("URL"));
ret.add(new DocdbUrlDetail(
rs.getLong("ID"),
url,
rs.getString("TITLE"),
rs.getString("DESCRIPTION"),
rs.getDouble("QUALITY"),
rs.getString("FORMAT"),
rs.getInt("FEATURES"),
rs.getInt("PUB_YEAR"),
rs.getLong("DATA_HASH"),
rs.getInt("WORDS_TOTAL")
));
}
}
} catch (URISyntaxException e) {
throw new RuntimeException(e);
}
return ret;
}
}

View File

@@ -0,0 +1,83 @@
package nu.marginalia.linkdb.docs;
import nu.marginalia.linkdb.model.DocdbUrlDetail;
import java.io.IOException;
import java.nio.file.Path;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;
import java.util.List;
/** Writes the document database, which is a SQLite database
* containing the URLs and metadata of the documents in the
* index.
* */
public class DocumentDbWriter {
private final Connection connection;
public DocumentDbWriter(Path outputFile) throws SQLException {
String connStr = "jdbc:sqlite:" + outputFile.toString();
connection = DriverManager.getConnection(connStr);
try (var stream = ClassLoader.getSystemResourceAsStream("db/docdb-document.sql");
var stmt = connection.createStatement()
) {
var sql = new String(stream.readAllBytes());
stmt.executeUpdate(sql);
// Disable synchronous writing as this is a one-off operation with no recovery
stmt.execute("PRAGMA synchronous = OFF");
} catch (IOException e) {
throw new RuntimeException(e);
}
}
public void add(DocdbUrlDetail docdbUrlDetail) throws SQLException {
add(List.of(docdbUrlDetail));
}
public void add(List<DocdbUrlDetail> docdbUrlDetail) throws SQLException {
try (var stmt = connection.prepareStatement("""
INSERT OR IGNORE INTO DOCUMENT(ID, URL, TITLE, DESCRIPTION, WORDS_TOTAL, FORMAT, FEATURES, DATA_HASH, QUALITY, PUB_YEAR)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""")) {
int i = 0;
for (var document : docdbUrlDetail) {
var url = document.url();
stmt.setLong(1, document.urlId());
stmt.setString(2, url.toString());
stmt.setString(3, document.title());
stmt.setString(4, document.description());
stmt.setInt(5, document.wordsTotal());
stmt.setString(6, document.format());
stmt.setInt(7, document.features());
stmt.setLong(8, document.dataHash());
stmt.setDouble(9, document.urlQuality());
if (document.pubYear() == null) {
stmt.setInt(10, 0);
} else {
stmt.setInt(10, document.pubYear());
}
stmt.addBatch();
if (++i > 1000) {
stmt.executeBatch();
i = 0;
}
}
if (i != 0) stmt.executeBatch();
}
}
public void close() throws SQLException {
connection.close();
}
}

View File

@@ -0,0 +1,18 @@
package nu.marginalia.linkdb.model;
import nu.marginalia.model.EdgeUrl;
public record DocdbUrlDetail(long urlId,
EdgeUrl url,
String title,
String description,
double urlQuality,
String format,
int features,
Integer pubYear,
long dataHash,
int wordsTotal
)
{
}

View File

@@ -0,0 +1,19 @@
## Document Database
The document database contains information about links,
such as their ID, their URL, their title, their description,
and so forth.
The document database is a sqlite file. The reason this information
is not in the MariaDB database is that this would make updates to
this information take effect in production immediately, even before
the information was searchable.
* [DocumentLinkDbWriter](java/nu/marginalia/linkdb/docs/DocumentDbWriter.java)
* [DocumentLinkDbLoader](java/nu/marginalia/linkdb/docs/DocumentDbReader.java)
**TODO**: This module should probably be renamed and moved into some other package.
## See Also
The database is constructed by the [loading-process](../../processes/loading-process), and consumed by the [index-service](../../services-core/index-service).

View File

@@ -0,0 +1,17 @@
CREATE TABLE DOCUMENT (
ID INT8 PRIMARY KEY,
URL TEXT,
STATE INT,
TITLE TEXT NOT NULL,
DESCRIPTION TEXT NOT NULL,
WORDS_TOTAL INTEGER NOT NULL,
FORMAT TEXT NOT NULL,
FEATURES INTEGER NOT NULL,
DATA_HASH INTEGER NOT NULL,
QUALITY REAL NOT NULL,
PUB_YEAR INTEGER NOT NULL
);

View File

@@ -0,0 +1,44 @@
package nu.marginalia.linkdb;
import gnu.trove.list.array.TLongArrayList;
import nu.marginalia.linkdb.docs.DocumentDbReader;
import nu.marginalia.linkdb.docs.DocumentDbWriter;
import nu.marginalia.linkdb.model.DocdbUrlDetail;
import nu.marginalia.model.EdgeDomain;
import org.junit.jupiter.api.Test;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.sql.SQLException;
public class DocumentDbWriterTest {
@Test
public void testCreate() throws IOException {
Path tempPath = Files.createTempFile("docdb", ".db");
try {
var writer = new DocumentDbWriter(tempPath);
writer.add(new DocdbUrlDetail(
1,
new nu.marginalia.model.EdgeUrl("http", new EdgeDomain("example.com"), null, "/", null),
"Test",
"This is a test",
-4.,
"XHTML",
5,
2020,
0xF00BA3,
444
));
writer.close();
var reader = new DocumentDbReader(tempPath);
var deets = reader.getUrlDetails(new TLongArrayList(new long[]{1}));
System.out.println(deets);
} catch (SQLException e) {
throw new RuntimeException(e);
} finally {
Files.deleteIfExists(tempPath);
}
}
}

View File

@@ -0,0 +1,41 @@
plugins {
id 'java'
id 'jvm-test-suite'
}
java {
toolchain {
languageVersion.set(JavaLanguageVersion.of(rootProject.ext.jvmVersion))
}
}
apply from: "$rootProject.projectDir/srcsets.gradle"
dependencies {
implementation project(':code:libraries:braille-block-punch-cards')
implementation project(':code:libraries:coded-sequence')
implementation libs.bundles.slf4j
implementation libs.guava
implementation dependencies.create(libs.guice.get()) {
exclude group: 'com.google.guava'
}
implementation libs.bundles.gson
implementation libs.notnull
implementation libs.commons.lang3
implementation libs.trove
implementation libs.fastutil
implementation libs.bundles.mariadb
testImplementation libs.bundles.slf4j.test
testImplementation libs.bundles.junit
testImplementation libs.mockito
}

View File

@@ -0,0 +1,209 @@
package nu.marginalia.model;
import javax.annotation.Nonnull;
import java.io.Serializable;
import java.util.Objects;
import java.util.Optional;
import java.util.function.Predicate;
import java.util.regex.Pattern;
public class EdgeDomain implements Serializable {
@Nonnull
public final String subDomain;
@Nonnull
public final String topDomain;
public EdgeDomain(String host) {
Objects.requireNonNull(host, "domain name must not be null");
host = host.toLowerCase();
// Remove trailing dots, which are allowed in DNS but not in URLs
// (though sometimes still show up in the wild)
while (!host.isBlank() && host.endsWith(".")) {
host = host.substring(0, host.length() - 1);
}
var dot = host.lastIndexOf('.');
if (dot < 0 || looksLikeAnIp(host)) { // IPV6 >.>
subDomain = "";
topDomain = host;
} else {
int dot2 = host.substring(0, dot).lastIndexOf('.');
if (dot2 < 0) {
subDomain = "";
topDomain = host;
} else {
if (looksLikeGovTld(host)) { // Capture .ac.jp, .co.uk
int dot3 = host.substring(0, dot2).lastIndexOf('.');
if (dot3 >= 0) {
dot2 = dot3;
subDomain = host.substring(0, dot2);
topDomain = host.substring(dot2 + 1);
} else {
subDomain = "";
topDomain = host;
}
} else {
subDomain = host.substring(0, dot2);
topDomain = host.substring(dot2 + 1);
}
}
}
}
private static final Predicate<String> govListTest = Pattern.compile(".*\\.(id|ac|co|org|gov|edu|com)\\.[a-z]{2}").asMatchPredicate();
public EdgeDomain(@Nonnull String subDomain, @Nonnull String topDomain) {
this.subDomain = subDomain;
this.topDomain = topDomain;
}
private boolean looksLikeGovTld(String host) {
if (host.length() < 8)
return false;
int cnt = 0;
for (int i = host.length() - 7; i < host.length(); i++) {
if (host.charAt(i) == '.')
cnt++;
}
return cnt >= 2 && govListTest.test(host);
}
private static final Predicate<String> ipPatternTest = Pattern.compile("[\\d]{1,3}\\.[\\d]{1,3}\\.[\\d]{1,3}\\.[\\d]{1,3}").asMatchPredicate();
private boolean looksLikeAnIp(String host) {
if (host.length() < 7)
return false;
char firstChar = host.charAt(0);
int lastChar = host.charAt(host.length() - 1);
return Character.isDigit(firstChar)
&& Character.isDigit(lastChar)
&& ipPatternTest.test(host);
}
public EdgeUrl toRootUrlHttp() {
// Set default protocol to http, as most https websites redirect http->https, but few http websites redirect https->http
return new EdgeUrl("http", this, null, "/", null);
}
public EdgeUrl toRootUrlHttps() {
return new EdgeUrl("https", this, null, "/", null);
}
public String toString() {
return getAddress();
}
public String getAddress() {
if (!subDomain.isEmpty()) {
return subDomain + "." + topDomain;
}
return topDomain;
}
public String getDomainKey() {
int cutPoint = topDomain.indexOf('.');
if (cutPoint < 0) {
return topDomain;
}
return topDomain.substring(0, cutPoint).toLowerCase();
}
public String getLongDomainKey() {
StringBuilder ret = new StringBuilder();
int cutPoint = topDomain.indexOf('.');
if (cutPoint < 0) {
ret.append(topDomain);
} else {
ret.append(topDomain, 0, cutPoint);
}
if (!subDomain.isEmpty() && !"www".equals(subDomain)) {
ret.append(":");
ret.append(subDomain);
}
return ret.toString().toLowerCase();
}
/** If possible, try to provide an alias domain,
* i.e. a domain name that is very likely to link to this one
* */
public Optional<EdgeDomain> aliasDomain() {
if (subDomain.equals("www")) {
return Optional.of(new EdgeDomain("", topDomain));
} else if (subDomain.isBlank()){
return Optional.of(new EdgeDomain("www", topDomain));
}
else return Optional.empty();
}
public boolean hasSameTopDomain(EdgeDomain other) {
if (other == null) return false;
return topDomain.equalsIgnoreCase(other.topDomain);
}
public String getTld() {
int dot = -1;
int length = topDomain.length();
if (ipPatternTest.test(topDomain)) {
return "IP";
}
if (govListTest.test(topDomain)) {
dot = topDomain.indexOf('.', Math.max(0, length - ".edu.uk".length()));
} else {
dot = topDomain.lastIndexOf('.');
}
if (dot < 0 || dot == topDomain.length() - 1) {
return "-";
} else {
return topDomain.substring(dot + 1);
}
}
public boolean equals(final Object o) {
if (o == this) return true;
if (!(o instanceof EdgeDomain other)) return false;
final String this$subDomain = this.getSubDomain();
final String other$subDomain = other.getSubDomain();
if (!Objects.equals(this$subDomain, other$subDomain)) return false;
final String this$domain = this.getTopDomain();
final String other$domain = other.getTopDomain();
if (!Objects.equals(this$domain, other$domain)) return false;
return true;
}
public int hashCode() {
final int PRIME = 59;
int result = 1;
final Object $subDomain = this.getSubDomain().toLowerCase();
result = result * PRIME + $subDomain.hashCode();
final Object $domain = this.getTopDomain().toLowerCase();
result = result * PRIME + $domain.hashCode();
return result;
}
@Nonnull
public String getSubDomain() {
return this.subDomain;
}
@Nonnull
public String getTopDomain() {
return this.topDomain;
}
}

View File

@@ -0,0 +1,249 @@
package nu.marginalia.model;
import nu.marginalia.util.QueryParams;
import javax.annotation.Nullable;
import java.io.Serializable;
import java.net.MalformedURLException;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
import java.util.Objects;
import java.util.Optional;
import java.util.regex.Pattern;
public class EdgeUrl implements Serializable {
public final String proto;
public final EdgeDomain domain;
public final Integer port;
public final String path;
public final String param;
public EdgeUrl(String proto, EdgeDomain domain, Integer port, String path, String param) {
this.proto = proto;
this.domain = domain;
this.port = port(port, proto);
this.path = path;
this.param = param;
}
public EdgeUrl(String url) throws URISyntaxException {
this(parseURI(url));
}
private static URI parseURI(String url) throws URISyntaxException {
try {
return new URI(urlencodeFixer(url));
} catch (URISyntaxException ex) {
throw new URISyntaxException("Failed to parse URI '" + url + "'", ex.getMessage());
}
}
public static Optional<EdgeUrl> parse(@Nullable String url) {
try {
if (null == url) {
return Optional.empty();
}
return Optional.of(new EdgeUrl(url));
} catch (URISyntaxException e) {
return Optional.empty();
}
}
private static Pattern badCharPattern = Pattern.compile("[ \t\n\"<>\\[\\]()',|]");
/* Java's URI parser is a bit too strict in throwing exceptions when there's an error.
Here on the Internet, standards are like the picture on the box of the frozen pizza,
and what you get is more like what's on the inside, we try to patch things instead,
just give it a best-effort attempt att cleaning out broken or unnecessary constructions
like bad or missing URLEncoding
*/
public static String urlencodeFixer(String url) throws URISyntaxException {
var s = new StringBuilder();
String goodChars = "&.?:/-;+$#";
String hexChars = "0123456789abcdefABCDEF";
int pathIdx = findPathIdx(url);
if (pathIdx < 0) { // url looks like http://marginalia.nu
return url + "/";
}
s.append(url, 0, pathIdx);
// We don't want the fragment, and multiple fragments breaks the Java URIParser for some reason
int end = url.indexOf("#");
if (end < 0) end = url.length();
for (int i = pathIdx; i < end; i++) {
int c = url.charAt(i);
if (goodChars.indexOf(c) >= 0 || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || (c >= '0' && c <= '9')) {
s.appendCodePoint(c);
} else if (c == '%' && i + 2 < end) {
int cn = url.charAt(i + 1);
int cnn = url.charAt(i + 2);
if (hexChars.indexOf(cn) >= 0 && hexChars.indexOf(cnn) >= 0) {
s.appendCodePoint(c);
} else {
s.append("%25");
}
} else {
s.append(String.format("%%%02X", c));
}
}
return s.toString();
}
private static int findPathIdx(String url) throws URISyntaxException {
int colonIdx = url.indexOf(':');
if (colonIdx < 0 || colonIdx + 2 >= url.length()) {
throw new URISyntaxException(url, "Lacking protocol");
}
return url.indexOf('/', colonIdx + 2);
}
public EdgeUrl(URI URI) {
try {
String host = URI.getHost();
if (host == null) { // deal with a rare serialization error
host = "parse-error.invalid.example.com";
}
this.domain = new EdgeDomain(host);
this.path = URI.getPath().isEmpty() ? "/" : URI.getPath();
this.proto = URI.getScheme().toLowerCase();
this.port = port(URI.getPort(), proto);
this.param = QueryParams.queryParamsSanitizer(this.path, URI.getQuery());
} catch (Exception ex) {
System.err.println("Failed to parse " + URI);
throw ex;
}
}
public EdgeUrl(URL URL) {
try {
String host = URL.getHost();
if (host == null) { // deal with a rare serialization error
host = "parse-error.invalid.example.com";
}
this.domain = new EdgeDomain(host);
this.path = URL.getPath().isEmpty() ? "/" : URL.getPath();
this.proto = URL.getProtocol().toLowerCase();
this.port = port(URL.getPort(), proto);
this.param = QueryParams.queryParamsSanitizer(this.path, URL.getQuery());
} catch (Exception ex) {
System.err.println("Failed to parse " + URL);
throw ex;
}
}
private static Integer port(Integer port, String protocol) {
if (null == port || port < 1) {
return null;
}
if (protocol.equals("http") && port == 80) {
return null;
} else if (protocol.equals("https") && port == 443) {
return null;
}
return port;
}
public String toString() {
StringBuilder sb = new StringBuilder(256);
sb.append(proto);
sb.append("://");
sb.append(domain);
if (port != null) {
sb.append(':');
sb.append(port);
}
sb.append(path);
if (param != null) {
sb.append('?');
sb.append(param);
}
return sb.toString();
}
public String dir() {
return path.replaceAll("/[^/]+$", "/");
}
public String fileName() {
return path.replaceAll(".*/", "");
}
public int depth() {
return (int) path.chars().filter(c -> c == '/').count();
}
public EdgeUrl withPathAndParam(String path, String param) {
return new EdgeUrl(proto, domain, port, path, param);
}
public boolean equals(Object other) {
if (other == null) return false;
if (other == this) return true;
if (other instanceof EdgeUrl e) {
return Objects.equals(e.domain, domain)
&& Objects.equals(e.path, path)
&& Objects.equals(e.param, param);
}
return true;
}
public boolean equalsExactly(Object other) {
if (other == null) return false;
if (other == this) return true;
if (other instanceof EdgeUrl e) {
return Objects.equals(e.proto, proto)
&& Objects.equals(e.domain, domain)
&& Objects.equals(e.port, port)
&& Objects.equals(e.path, path)
&& Objects.equals(e.param, param);
}
return true;
}
public int hashCode() {
return Objects.hash(domain, path, param);
}
public URL asURL() throws MalformedURLException {
try {
return asURI().toURL();
} catch (URISyntaxException e) {
throw new MalformedURLException(e.getMessage());
}
}
public URI asURI() throws URISyntaxException {
if (port != null) {
return new URI(this.proto, null, this.domain.toString(), this.port, this.path, this.param, null);
}
return new URI(this.proto, this.domain.toString(), this.path, this.param, null);
}
public EdgeDomain getDomain() {
return this.domain;
}
public String getProto() {
return this.proto;
}
}

View File

@@ -0,0 +1,18 @@
package nu.marginalia.model.crawl;
public enum DomainIndexingState {
ACTIVE("Active"),
EXHAUSTED("Fully Crawled"),
SPECIAL("Content is side-loaded"),
SOCIAL_MEDIA("Social media-like website"),
BLOCKED("Blocked"),
REDIR("Redirected to another domain"),
ERROR("Error during crawling"),
UNKNOWN("Unknown");
public String desc;
DomainIndexingState(String desc) {
this.desc = desc;
}
}

View File

@@ -0,0 +1,96 @@
package nu.marginalia.model.crawl;
import java.util.Collection;
public enum HtmlFeature {
// Note, the first 32 of these features are bit encoded in the database
// so be sure to keep anything that's potentially important toward the top
// of the list
MEDIA( "special:media"),
JS("special:scripts"),
AFFILIATE_LINK( "special:affiliate"),
TRACKING("special:tracking"),
TRACKING_ADTECH("special:ads"), // We'll call this ads for now
KEBAB_CASE_URL("special:kcurl"), // https://www.example.com/urls-that-look-like-this/
LONG_URL("special:longurl"),
CLOUDFLARE_FEATURE("special:cloudflare"),
CDN_FEATURE("special:cdn"),
VIEWPORT("special:viewport"),
COOKIES("special:cookies"),
CATEGORY_FOOD("category:food"),
ADVERTISEMENT("special:ads"),
CATEGORY_CRAFTS("category:crafts"),
GA_SPAM("special:gaspam"),
/** For fingerprinting and ranking */
OPENGRAPH("special:opengraph"),
OPENGRAPH_IMAGE("special:opengraph:image"),
TWITTERCARD("special:twittercard"),
TWITTERCARD_IMAGE("special:twittercard:image"),
FONTAWSESOME("special:fontawesome"),
GOOGLEFONTS("special:googlefonts"),
DNS_PREFETCH("special:dnsprefetch"),
PRELOAD("special:preload"),
PRECONNECT("special:preconnect"),
PINGBACK("special:pingback"),
FEED("special:feed"),
WEBMENTION("special:webmention"),
INDIEAUTH("special:indieauth"),
ME_TAG("special:metag"),
NEXT_TAG("special:nexttag"),
AMPHTML("special:amphtml"),
JSON_LD("special:jsonld"),
ORIGIN_TRIAL("special:origintrial"),
PROFILE_GMPG("special:profile-gpmg"),
QUANTCAST("special:quantcast"),
COOKIELAW("special:cookielaw"),
DIDOMI("special:didomi"),
PARDOT("special:pardot"),
ONESIGNAL("special:onesignal"),
DATE_TAG("special:date_tag"),
NOSCRIPT_TAG("special:noscript_tag"),
ROBOTS_INDEX("robots:index"),
ROBOTS_FOLLOW("robots:follow"),
ROBOTS_NOODP("robots:noodp"),
ROBOTS_NOYDIR("robots:noydir"),
DOFOLLOW_LINK("special:dofollow"),
APPLE_TOUCH_ICON("special:appleicon"),
S3_FEATURE("special:s3"),
UNKNOWN("special:uncategorized");
private final String keyword;
HtmlFeature(String keyword) {
this.keyword = keyword;
}
public String getKeyword() {
return keyword;
}
public static int encode(Collection<HtmlFeature> featuresAll) {
int ret = 0;
for (var feature : featuresAll) {
ret |= (1 << (feature.ordinal()));
}
return ret;
}
public static boolean hasFeature(int value, HtmlFeature feature) {
return (value & (1<< feature.ordinal())) != 0;
}
public int getFeatureBit() {
return (1<< ordinal());
}
}

View File

@@ -1,4 +1,4 @@
package nu.marginalia.wmsa.edge.converting.processor.logic.pubdate; package nu.marginalia.model.crawl;
import java.time.LocalDate; import java.time.LocalDate;
import java.time.format.DateTimeFormatter; import java.time.format.DateTimeFormatter;
@@ -57,5 +57,8 @@ public record PubDate(String dateIso8601, int year) {
public static int fromYearByte(int yearByte) { public static int fromYearByte(int yearByte) {
return yearByte + ENCODING_OFFSET; return yearByte + ENCODING_OFFSET;
} }
public static int toYearByte(int year) {
return Math.max(0, year - ENCODING_OFFSET);
}
} }

View File

@@ -0,0 +1,10 @@
package nu.marginalia.model.crawl;
/** This should correspond to EC_URL.STATE */
public enum UrlIndexingState {
OK,
REDIRECT,
DEAD,
DISQUALIFIED
}

View File

@@ -0,0 +1,27 @@
package nu.marginalia.model.gson;
import com.google.gson.*;
import marcono1234.gson.recordadapter.RecordTypeAdapterFactory;
import nu.marginalia.model.EdgeDomain;
import nu.marginalia.model.EdgeUrl;
import java.net.URISyntaxException;
public class GsonFactory {
public static Gson get() {
return new GsonBuilder()
.registerTypeAdapterFactory(RecordTypeAdapterFactory.builder().allowMissingComponentValues().create())
.registerTypeAdapter(EdgeUrl.class, (JsonSerializer<EdgeUrl>) (src, typeOfSrc, context) -> new JsonPrimitive(src.toString()))
.registerTypeAdapter(EdgeDomain.class, (JsonSerializer<EdgeDomain>) (src, typeOfSrc, context) -> new JsonPrimitive(src.toString()))
.registerTypeAdapter(EdgeUrl.class, (JsonDeserializer<EdgeUrl>) (json, typeOfT, context) -> {
try {
return new EdgeUrl(json.getAsString());
} catch (URISyntaxException e) {
throw new JsonParseException("URL Parse Exception", e);
}
})
.registerTypeAdapter(EdgeDomain.class, (JsonDeserializer<EdgeDomain>) (json, typeOfT, context) -> new EdgeDomain(json.getAsString()))
.serializeSpecialFloatingPointValues()
.create();
}
}

View File

@@ -0,0 +1,22 @@
package nu.marginalia.model.html;
// This class really doesn't belong anywhere, but will squat here for now
public enum HtmlStandard {
PLAIN(0, 1),
UNKNOWN(0, 1),
HTML123(0, 1),
HTML4(-0.1, 1.05),
XHTML(-0.1, 1.05),
HTML5(0.5, 1.1);
/** Used to tune quality score */
public final double offset;
/** Used to tune quality score */
public final double scale;
HtmlStandard(double offset, double scale) {
this.offset = offset;
this.scale = scale;
}
}

View File

@@ -0,0 +1,93 @@
package nu.marginalia.model.id;
/** URL id encoding scheme, including an optional ranking part that's used in the indices and washed away
* outside. The ranking part is put in the highest bits so that when we sort the documents by id, they're
* actually sorted by rank. Next is the domain id part, which keeps documents from the same domain clustered.
* Finally is the document ordinal part, which is a non-unique sequence number for within the current set of
* documents loaded. The same ID may be re-used over time as a new index is loaded.
* <p></p>
* <table>
* <tr><th>Part</th><th>Bits</th><th>Cardinality</th></tr>
* <tr>
* <td>rank</td><td>6 bits</td><td>64</td>
* </tr>
* <tr>
* <td>domain</td><td>31 bits</td><td>2 billion</td>
* </tr>
* <tr>
* <td>document</td><td>26 bits</td><td>67 million</td>
* </tr>
* </table>
* <p></p>
* Most significant bit is unused for now because I'm not routing Long.compareUnsigned() all over the codebase.
* <i>If</i> we end up needing more domains, we'll cross that bridge when we come to it.
*
* <h2>Coding Scheme</h2>
* <code><pre>
* [ | rank | domain | url ]
* 0 1 6 38 64
* </pre></code>
*/
public class UrlIdCodec {
private static final long RANK_MASK = 0xFE00_0000_0000_0000L;
private static final int DOCORD_MASK = 0x03FF_FFFF;
/** Encode a URL id without a ranking element */
public static long encodeId(int domainId, int documentOrdinal) {
domainId &= 0x7FFF_FFFF;
documentOrdinal &= 0x03FF_FFFF;
assert (domainId & 0x7FFF_FFFF) == domainId : "Domain id must be in [0, 2^31-1], was " + domainId;
assert (documentOrdinal & 0x03FF_FFFF) == documentOrdinal : "Document ordinal must be in [0, 2^26-1], was " + documentOrdinal;
return ((long) domainId << 26) | documentOrdinal;
}
/** Encode a URL id with a ranking element */
public static long encodeId(int rank, int domainId, int documentOrdinal) {
assert (rank & 0x3F) == rank : "Rank must be in [0, 63], was " + rank;
assert (domainId & 0x7FFF_FFFF) == domainId : "Domain id must be in [0, 2^31-1], was " + domainId;
assert (documentOrdinal & 0x03FF_FFFF) == documentOrdinal : "Document ordinal must be in [0, 2^26-1], was " + documentOrdinal;
domainId &= 0x7FFF_FFFF;
documentOrdinal &= 0x03FF_FFFF;
rank &= 0x3F;
return ((long) rank << 57) | ((long) domainId << 26) | documentOrdinal;
}
/** Add a ranking element to an existing combined URL id.
*
* @param rank [0,1] the importance of the domain, low is good
* @param urlId
*/
public static long addRank(float rank, long urlId) {
long rankPart = (int)(rank * (1<<6));
if (rankPart >= 64) rankPart = 63;
if (rankPart < 0) rankPart = 0;
return (urlId&(~RANK_MASK)) | (rankPart << 57);
}
/** Extract the domain component from this URL id */
public static int getDomainId(long combinedId) {
return (int) ((combinedId >>> 26) & 0x7FFF_FFFFL);
}
/** Extract the document ordinal component from this URL id */
public static int getDocumentOrdinal(long combinedId) {
return (int) (combinedId & DOCORD_MASK);
}
/** Extract the document ordinal component from this URL id */
public static int getRank(long combinedId) {
return (int) (combinedId >>> 57) & 0x3F;
}
/** Mask out the ranking element from this URL id */
public static long removeRank(long combinedId) {
return combinedId & ~RANK_MASK;
}
}

View File

@@ -0,0 +1,6 @@
package nu.marginalia.model.idx;
import nu.marginalia.sequence.VarintCodedSequence;
public record CodedWordSpan(byte code, VarintCodedSequence spans) {
}

View File

@@ -0,0 +1,35 @@
package nu.marginalia.model.idx;
import java.util.EnumSet;
public enum DocumentFlags {
Javascript,
PlainText,
GeneratorDocs,
GeneratorForum,
GeneratorWiki,
Sideloaded,
Unused7,
Unused8,
;
public int asBit() {
return 1 << ordinal();
}
public boolean isPresent(long value) {
return (asBit() & value) > 0;
}
public static EnumSet<DocumentFlags> decode(long encodedValue) {
EnumSet<DocumentFlags> ret = EnumSet.noneOf(DocumentFlags.class);
for (DocumentFlags f : values()) {
if ((encodedValue & f.asBit() & 0xff) > 0) {
ret.add(f);
}
}
return ret;
}
}

View File

@@ -0,0 +1,168 @@
package nu.marginalia.model.idx;
import nu.marginalia.model.crawl.PubDate;
import java.io.Serializable;
import java.util.EnumSet;
import java.util.Set;
import static java.lang.Math.max;
import static java.lang.Math.min;
/** Document level metadata designed to fit in a single 64 bit long.
*
* @param avgSentLength average sentence length
* @param rank domain ranking
* @param encDomainSize encoded number of documents in the domain
* @param topology a measure of how important the document is
* @param year encoded publishing year
* @param sets bit mask for search sets
* @param quality quality of the document (0-15); 0 is best, 15 is worst
* @param flags flags (see {@link DocumentFlags})
*/
public record DocumentMetadata(int avgSentLength,
int rank,
int encDomainSize,
int topology,
int year,
int sets,
int quality,
byte flags)
implements Serializable
{
public String toString() {
StringBuilder sb = new StringBuilder(getClass().getSimpleName());
sb.append('[')
.append("avgSentL=").append(avgSentLength).append(", ")
.append("rank=").append(rank).append(", ")
.append("domainSize=").append(ENC_DOMAIN_SIZE_MULTIPLIER * encDomainSize).append(", ")
.append("topology=").append(topology).append(", ")
.append("year=").append(PubDate.fromYearByte(year)).append(", ")
.append("sets=").append(sets).append(", ")
.append("quality=").append(quality).append(", ")
.append("flags=").append(flagSet()).append("]");
return sb.toString();
}
public static final long ASL_MASK = 0x03L;
public static final int ASL_SHIFT = 56;
public static final long RANK_MASK = 0xFFL;
public static final int RANK_SHIFT = 48;
public static final long ENC_DOMAIN_SIZE_MASK = 0xFFL;
public static final int ENC_DOMAIN_SIZE_SHIFT = 40;
public static final int ENC_DOMAIN_SIZE_MULTIPLIER = 5;
public static final long TOPOLOGY_MASK = 0xFFL;
public static final int TOPOLOGY_SHIFT = 32;
public static final long YEAR_MASK = 0xFFL;
public static final int YEAR_SHIFT = 24;
public static final long SETS_MASK = 0xFL;
public static final int SETS_SHIFT = 16;
public static final long QUALITY_MASK = 0xFL;
public static final int QUALITY_SHIFT = 8;
public static long defaultValue() {
return 0L;
}
public DocumentMetadata() {
this(defaultValue());
}
public DocumentMetadata(int avgSentLength, int year, int quality, EnumSet<DocumentFlags> flags) {
this(avgSentLength, 0, 0, 0, year, 0, quality, encodeFlags(flags));
}
public DocumentMetadata withSizeAndTopology(int size, int topology) {
final int encSize = (int) Math.min(ENC_DOMAIN_SIZE_MASK, Math.max(1, size / ENC_DOMAIN_SIZE_MULTIPLIER));
return new DocumentMetadata(avgSentLength, rank, encSize, topology, year, sets, quality, flags);
}
private static byte encodeFlags(Set<DocumentFlags> flags) {
byte ret = 0;
for (var flag : flags) { ret |= flag.asBit(); }
return ret;
}
public boolean hasFlag(DocumentFlags flag) {
return (flags & flag.asBit()) != 0;
}
public DocumentMetadata(long value) {
this(
(int) ((value >>> ASL_SHIFT) & ASL_MASK),
(int) ((value >>> RANK_SHIFT) & RANK_MASK),
(int) ((value >>> ENC_DOMAIN_SIZE_SHIFT) & ENC_DOMAIN_SIZE_MASK),
(int) ((value >>> TOPOLOGY_SHIFT) & TOPOLOGY_MASK),
(int) ((value >>> YEAR_SHIFT) & YEAR_MASK),
(int) ((value >>> SETS_SHIFT) & SETS_MASK),
(int) ((value >>> QUALITY_SHIFT) & QUALITY_MASK),
(byte) (value & 0xFF)
);
}
public static boolean hasFlags(long encoded, long metadataBitMask) {
return ((encoded & 0xFF) & metadataBitMask) == metadataBitMask;
}
public long encode() {
long ret = 0;
ret |= Byte.toUnsignedLong(flags);
ret |= min(QUALITY_MASK, max(0, quality)) << QUALITY_SHIFT;
ret |= min(SETS_MASK, max(0, sets)) << SETS_SHIFT;
ret |= min(YEAR_MASK, max(0, year)) << YEAR_SHIFT;
ret |= min(TOPOLOGY_MASK, max(0, topology)) << TOPOLOGY_SHIFT;
ret |= min(ENC_DOMAIN_SIZE_MASK, max(0, encDomainSize)) << ENC_DOMAIN_SIZE_SHIFT;
ret |= min(RANK_MASK, max(0, rank)) << RANK_SHIFT;
ret |= min(ASL_MASK, max(0, avgSentLength)) << ASL_SHIFT;
return ret;
}
public boolean isEmpty() {
return avgSentLength == 0 && encDomainSize == 0 && topology == 0 && sets == 0 && quality == 0 && year == 0 && flags == 0 && rank == 0;
}
public static int decodeQuality(long encoded) {
return (int) ((encoded >>> QUALITY_SHIFT) & QUALITY_MASK);
}
public static int decodeTopology(long encoded) {
return (int) ((encoded >>> TOPOLOGY_SHIFT) & TOPOLOGY_MASK);
}
public static int decodeAvgSentenceLength(long encoded) {
return (int) ((encoded >>> ASL_SHIFT) & ASL_MASK);
}
public static int decodeYear(long encoded) {
return PubDate.fromYearByte((int) ((encoded >>> YEAR_SHIFT) & YEAR_MASK));
}
public int size() {
return ENC_DOMAIN_SIZE_MULTIPLIER * encDomainSize;
}
public static int decodeSize(long encoded) {
return ENC_DOMAIN_SIZE_MULTIPLIER * (int) ((encoded >>> ENC_DOMAIN_SIZE_SHIFT) & ENC_DOMAIN_SIZE_MASK);
}
public static int decodeRank(long encoded) {
return (int) ((encoded >>> RANK_SHIFT) & RANK_MASK);
}
public static long encodeRank(long encoded, int rank) {
return encoded | min(RANK_MASK, max(0, rank)) << RANK_SHIFT;
}
public EnumSet<DocumentFlags> flagSet() {
return DocumentFlags.decode(flags);
}
}

View File

@@ -0,0 +1,73 @@
package nu.marginalia.model.idx;
import java.util.EnumSet;
public enum WordFlags {
/** Word appears in title */
Title,
/** Word appears to be the subject in several sentences */
Subjects,
/** Word is a likely named object. This is a weaker version of Subjects. */
NamesWords,
/** The word isn't actually a word on page, but a fake keyword from the code
* to aid discovery
*/
Synthetic,
/** Word is important to site
*/
Site,
/** Word is important to adjacent documents
* */
SiteAdjacent,
/** Keyword appears in URL path
*/
UrlPath,
/** Keyword appears in domain name
*/
UrlDomain,
/** Word appears in an external link */
ExternalLink
;
public byte asBit() {
return (byte) (1 << ordinal());
}
public boolean isPresent(byte value) {
return (asBit() & value) > 0;
}
public boolean isAbsent(byte value) {
return (asBit() & value) == 0;
}
public static byte encode(EnumSet<WordFlags> flags) {
byte ret = 0;
for (WordFlags f : flags) {
ret |= f.asBit();
}
return ret;
}
public static EnumSet<WordFlags> decode(byte encodedValue) {
EnumSet<WordFlags> ret = EnumSet.noneOf(WordFlags.class);
for (WordFlags f : values()) {
if ((encodedValue & f.asBit()) > 0) {
ret.add(f);
}
}
return ret;
}
}

View File

@@ -0,0 +1,93 @@
package nu.marginalia.util;
import org.apache.commons.lang3.StringUtils;
import javax.annotation.Nullable;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import java.util.StringJoiner;
public class QueryParams {
@Nullable
public static String queryParamsSanitizer(String path, @Nullable String queryParams) {
if (queryParams == null) {
return null;
}
String ret;
if (queryParams.indexOf('&') >= 0) {
List<String> parts = new ArrayList<>();
for (var part : StringUtils.split(queryParams, '&')) {
if (QueryParams.isPermittedParam(path, part)) {
parts.add(part);
}
}
if (parts.size() > 1) {
parts.sort(Comparator.naturalOrder());
}
StringJoiner retJoiner = new StringJoiner("&");
parts.forEach(retJoiner::add);
ret = retJoiner.toString();
}
else if (isPermittedParam(path, queryParams)) {
ret = queryParams;
}
else {
return null;
}
if (ret.isBlank())
return null;
return ret;
}
public static boolean isPermittedParam(String path, String param) {
if (path.endsWith(".cgi")) return true;
if (path.endsWith("/posting.php")) return false;
if (param.startsWith("id=")) return true;
if (param.startsWith("p=")) {
// Don't retain forum links with post-id:s, they're always non-canonical and eat up a lot of
// crawling bandwidth
if (path.endsWith("showthread.php") || path.endsWith("viewtopic.php")) {
return false;
}
return true;
}
if (param.startsWith("f=")) {
if (path.endsWith("showthread.php") || path.endsWith("viewtopic.php")) {
return false;
}
return true;
}
if (param.startsWith("i=")) return true;
if (param.startsWith("start=")) return true;
if (param.startsWith("t=")) return true;
if (param.startsWith("v=")) return true;
if (param.startsWith("post=")) return true;
if (path.endsWith("index.php")) {
if (param.startsWith("showtopic="))
return true;
if (param.startsWith("showforum="))
return true;
}
if (path.endsWith("StoryView.py")) { // folklore.org is neat
return param.startsWith("project=") || param.startsWith("story=");
}
// www.perseus.tufts.edu:
if (param.startsWith("collection=")) return true;
if (param.startsWith("doc=")) return true;
return false;
}
}

View File

@@ -0,0 +1,11 @@
# Model
This package contains common models to the search engine
## Central Classes
* [EdgeDomain](java/nu/marginalia/model/EdgeDomain.java)
* [EdgeUrl](java/nu/marginalia/model/EdgeUrl.java)
* [DocumentMetadata](java/nu/marginalia/model/idx/DocumentMetadata.java)
* [DocumentFlags](java/nu/marginalia/model/idx/DocumentFlags.java)
* [WordFlags](java/nu/marginalia/model/idx/WordFlags.java)

Some files were not shown because too many files have changed in this diff Show More