1
1
mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-10-05 21:22:39 +02:00
Commit Graph

3494 Commits

Author SHA1 Message Date
Viktor
abf1186fa7 Merge pull request #231 from johnvonessen/feature/configurable-crawler-timeouts
feat: Make crawler timeouts configurable via system.properties
2025-09-30 13:47:07 +02:00
John Von Essen
94a77ebddf Fix timeout configuration test to expect exceptions for invalid values
- Update testInvalidTimeoutValues to expect Exception when invalid timeout values are provided
- This matches the actual behavior where negative timeouts cause IllegalArgumentException
- All timeout configuration tests now pass
2025-09-30 13:39:58 +02:00
John Von Essen
4e2f76a477 feat: Make crawler timeouts configurable via system.properties
- Add configurable timeout properties for HTTP client operations:
  - crawler.socketTimeout (default: 10s)
  - crawler.connectTimeout (default: 30s)
  - crawler.responseTimeout (default: 10s)
  - crawler.connectionRequestTimeout (default: 5min)
  - crawler.jvmConnectTimeout (default: 30000ms)
  - crawler.jvmReadTimeout (default: 30000ms)
  - crawler.httpClientIdleTimeout (default: 15s)
  - crawler.httpClientConnectionPoolSize (default: 256)

- Update HttpFetcherImpl to use Integer.getInteger() for timeout configuration
- Update CrawlerMain and LiveCrawlerMain to use configurable JVM timeouts
- Add comprehensive documentation in crawler readme.md
- Add test coverage for timeout configuration functionality

This allows users to tune crawler timeouts for their specific network
conditions without requiring code changes, improving operational flexibility.

# Conflicts:
#	code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/HttpFetcherImpl.java
2025-09-30 13:39:52 +02:00
Viktor
4cd1834938 Merge pull request #232 from johnvonessen/socks-support
Add SOCKS proxy support for crawler processes
2025-09-30 13:32:14 +02:00
Viktor Lofgren
5cbbea67ed (docs) Update documentation with more appropriate best practices 2025-09-30 13:31:23 +02:00
Viktor Lofgren
b688f15550 (proxy) Fix late binding of proxy configuration
The code was selecting the proxy too late, so that it ended up being hardcoded for the entire crawl run, thus breaking the proxy selection logic.

There was also a problem where the socket configuration was overwritten by another socket configuration, thus disabling the proxy injection.
2025-09-30 11:48:43 +02:00
Viktor Lofgren
f55af8ef48 (boot) Explicitly stop ndp and ping processes at first boot
The system has sometimes been observed starting the NDP and Ping processes automatically, which is strongly undesirable as these microcrawlers generate real web traffic.

It is not fully understood how this happened, but the first boot handler has been modified to explicitly stop them, which should prevent the problem; and seems to have the desired outcome during testing.
2025-09-30 09:29:04 +02:00
Viktor Lofgren
adc815e282 (language) Add outcome of a simulation of the complete outcome of keyword extraction to the language processing tool 2025-09-28 12:45:25 +02:00
Viktor Lofgren
ca8455e049 (live-capture) Use threads instead of FJP for coordination of sampling 2025-09-25 10:13:46 +02:00
Viktor Lofgren
4ea724d2cb (live-capture) Use threads instead of FJP for coordination of sampling 2025-09-25 10:10:46 +02:00
Viktor Lofgren
40600e7297 (live-capture) Use threads instead of FJP for coordination of sampling 2025-09-25 10:10:05 +02:00
Viktor Lofgren
7795742538 (live-capture) Use threads instead of FJP for coordination of sampling 2025-09-25 10:06:12 +02:00
Viktor Lofgren
82d33ce69b (assistant) Add domain coordination module 2025-09-25 09:57:32 +02:00
Viktor Lofgren
e49cc5c244 (live-capture) Add domain coordination, make sampling parallel 2025-09-25 09:55:50 +02:00
Viktor Lofgren
0af389ad93 (live-capture) Use availability information to select domains for sampling more intelligently 2025-09-24 18:22:37 +02:00
Viktor Lofgren
48791f56bd (index) Put back Chesterton's fence 2025-09-24 16:09:54 +02:00
Viktor Lofgren
be83726427 (query) Remove log noise from query service 2025-09-24 16:06:01 +02:00
Viktor Lofgren
708caa8791 (index) Update verbatim match handling to account for matches that span multiple tags 2025-09-24 15:43:00 +02:00
Viktor Lofgren
32394f42b9 (index) Update verbatim match handling to account for matches that span multiple tags 2025-09-24 15:41:53 +02:00
Viktor Lofgren
b8e3445ce0 (index) Update verbatim match handling to account for matches that span multiple tags 2025-09-24 15:22:50 +02:00
Viktor Lofgren
17a78a7b7e (query) Remove obsolete code 2025-09-24 15:03:08 +02:00
Viktor Lofgren
5a75dd8093 (index) Update james cook test 2025-09-24 15:02:13 +02:00
Viktor Lofgren
a9713347a0 (query) Submit all segmentations as optional matching groups 2025-09-24 15:01:59 +02:00
Viktor Lofgren
4694d36ed2 (index) Tweak ranking bonuses for partial matches 2025-09-24 15:01:29 +02:00
Viktor Lofgren
70bdd1f51e (index) Add test case for 'captain james cook' 2025-09-24 13:27:07 +02:00
Viktor Lofgren
187b4828e6 (index) Sort doc ids passed to re-ranking 2025-09-24 13:26:53 +02:00
Viktor Lofgren
93fc14dc94 (index) Add sanity assertions to SkipListReader 2025-09-24 13:26:31 +02:00
Viktor Lofgren
fbfea8539b (refac) Merge IndexResultScoreCalculator into IndexResultRankingService 2025-09-24 11:51:16 +02:00
Viktor Lofgren
0929d77247 (chore) Remove vestigial Serializable annotation from a few core models
Java serialization was briefly considered a long while ago, but it's a silly and ancient API and not something we want to use.
2025-09-24 10:42:10 +02:00
Viktor Lofgren
db8f8c1f55 (index) Fix bitmask handling in HtmlFeature 2025-09-23 10:15:01 +02:00
Viktor Lofgren
dcb2723386 (index) Fix broken test case in the "slow" collection 2025-09-23 10:13:51 +02:00
Viktor Lofgren
00c1f495f6 (index) Fix incorrect document flag bitmask handling 2025-09-23 10:12:14 +02:00
Viktor Lofgren
73a923983a (language) Fix outdated test assertion 2025-09-22 10:30:06 +02:00
Viktor Lofgren
e9ed0c5669 (language) Fix keyword pattern matching unicode handling 2025-09-22 10:27:46 +02:00
Viktor Lofgren
5b2bec6144 (search) Fix broken tests 2025-09-22 10:17:38 +02:00
Viktor Lofgren
f26bb8e2b1 (loader) Clean up the code
Loader code is still kinda needlessly convoluted for what it does, but this commit makes an effort toward making it a bit easier to follow along.
2025-09-22 10:14:54 +02:00
Viktor Lofgren
4455495dc6 (system) Fix file loggers in the json config 2025-09-21 19:02:18 +02:00
Viktor Lofgren
b84d17aa51 (system) Fix file loggers in the prod config 2025-09-21 14:02:41 +02:00
Viktor Lofgren
9d008390ae (language) Fix unicode issues in keyword extraction 2025-09-21 13:54:01 +02:00
Viktor Lofgren
a40c2a8146 (index) Partition index journal by language to speed up index construction 2025-09-21 13:53:43 +02:00
Viktor Lofgren
a3416bf48e (query) Fix timeout settings to use ms and not s 2025-09-19 22:45:22 +02:00
Viktor Lofgren
ee2461d9fc (query) Fix timeout settings to use ms and not us 2025-09-19 22:19:31 +02:00
Viktor Lofgren
54c91a84e3 (query) Make the query client give up if the request exceeds its configured timeout by 50% 2025-09-19 18:59:35 +02:00
Viktor Lofgren
a6371fc54c (query) Add a timeout to the query API 2025-09-19 18:52:44 +02:00
Viktor Lofgren
8faa9a572d (live-capture) Fix random puppeteer API churn 2025-09-19 11:15:38 +02:00
Viktor Lofgren
fdce940263 (search) Fix redundant spam in <title> 2025-09-19 10:20:14 +02:00
Viktor Lofgren
af8a13a7fb (index) Correct file name compatibility with previous versions 2025-09-19 09:40:43 +02:00
Viktor
9e332de6b4 Merge pull request #223 from MarginaliaSearch/multilingual
Add support for indexing multiple languages
2025-09-19 09:12:54 +02:00
Viktor Lofgren
d457bb5d44 (index) Fix index actor initialization 2025-09-18 16:06:40 +02:00
Viktor Lofgren
c661ebb619 (refac) Move language-processing into functions
It's long surpassed the single-responsibility library it once was, and is as such out of place in its original location, and fits better among the function-type modules.
2025-09-18 10:30:40 +02:00