Viktor Lofgren
f1a71e9033
(ndp) Deprioritize tumblr in the visitation order
2025-10-05 12:17:46 +02:00
Viktor Lofgren
7b525918c9
(ndp) Deprioritize tumblr in the visitation order
2025-10-05 12:16:05 +02:00
Viktor Lofgren
0f3aede66f
(ndp) Clean up code
2025-10-05 11:56:41 +02:00
Viktor Lofgren
88236f3836
(ndp) Use mariadb syntax instead of sqlite syntax when querying mariadb
2025-10-05 11:56:31 +02:00
Viktor Lofgren
ad31a22fbb
(ndp) Refresh the ndp queue on restart
2025-10-05 10:32:05 +02:00
Viktor Lofgren
2785ae8241
(language) Further amend the docs to mention the language configuration files
2025-10-05 09:04:12 +02:00
Viktor Lofgren
1ed1f2f299
(language) Update documentation for the language processing function
2025-10-04 11:20:24 +02:00
Viktor Lofgren
b7d3b67a1d
(language) Fix language configuration stub for German to not use French stemming
2025-10-02 10:15:30 +02:00
Viktor Lofgren
d28010b7e6
(search) Fix pagination in light mode
2025-10-02 09:04:49 +02:00
Viktor Lofgren
2689bd9eaa
(chore) Update to Java 25
...
Unbreak test suites
2025-10-02 09:04:25 +02:00
Viktor Lofgren
f6d5d7f196
(chore) Update to Java 25
...
As usual most of the change is dealing with gradle churn.
2025-09-30 15:59:35 +02:00
Viktor
abf1186fa7
Merge pull request #231 from johnvonessen/feature/configurable-crawler-timeouts
...
feat: Make crawler timeouts configurable via system.properties
2025-09-30 13:47:07 +02:00
John Von Essen
94a77ebddf
Fix timeout configuration test to expect exceptions for invalid values
...
- Update testInvalidTimeoutValues to expect Exception when invalid timeout values are provided
- This matches the actual behavior where negative timeouts cause IllegalArgumentException
- All timeout configuration tests now pass
2025-09-30 13:39:58 +02:00
John Von Essen
4e2f76a477
feat: Make crawler timeouts configurable via system.properties
...
- Add configurable timeout properties for HTTP client operations:
- crawler.socketTimeout (default: 10s)
- crawler.connectTimeout (default: 30s)
- crawler.responseTimeout (default: 10s)
- crawler.connectionRequestTimeout (default: 5min)
- crawler.jvmConnectTimeout (default: 30000ms)
- crawler.jvmReadTimeout (default: 30000ms)
- crawler.httpClientIdleTimeout (default: 15s)
- crawler.httpClientConnectionPoolSize (default: 256)
- Update HttpFetcherImpl to use Integer.getInteger() for timeout configuration
- Update CrawlerMain and LiveCrawlerMain to use configurable JVM timeouts
- Add comprehensive documentation in crawler readme.md
- Add test coverage for timeout configuration functionality
This allows users to tune crawler timeouts for their specific network
conditions without requiring code changes, improving operational flexibility.
# Conflicts:
# code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/HttpFetcherImpl.java
2025-09-30 13:39:52 +02:00
Viktor
4cd1834938
Merge pull request #232 from johnvonessen/socks-support
...
Add SOCKS proxy support for crawler processes
2025-09-30 13:32:14 +02:00
Viktor Lofgren
5cbbea67ed
(docs) Update documentation with more appropriate best practices
2025-09-30 13:31:23 +02:00
Viktor Lofgren
b688f15550
(proxy) Fix late binding of proxy configuration
...
The code was selecting the proxy too late, so that it ended up being hardcoded for the entire crawl run, thus breaking the proxy selection logic.
There was also a problem where the socket configuration was overwritten by another socket configuration, thus disabling the proxy injection.
2025-09-30 11:48:43 +02:00
Viktor Lofgren
f55af8ef48
(boot) Explicitly stop ndp and ping processes at first boot
...
The system has sometimes been observed starting the NDP and Ping processes automatically, which is strongly undesirable as these microcrawlers generate real web traffic.
It is not fully understood how this happened, but the first boot handler has been modified to explicitly stop them, which should prevent the problem; and seems to have the desired outcome during testing.
2025-09-30 09:29:04 +02:00
Viktor Lofgren
adc815e282
(language) Add outcome of a simulation of the complete outcome of keyword extraction to the language processing tool
2025-09-28 12:45:25 +02:00
Viktor Lofgren
ca8455e049
(live-capture) Use threads instead of FJP for coordination of sampling
2025-09-25 10:13:46 +02:00
Viktor Lofgren
4ea724d2cb
(live-capture) Use threads instead of FJP for coordination of sampling
2025-09-25 10:10:46 +02:00
Viktor Lofgren
40600e7297
(live-capture) Use threads instead of FJP for coordination of sampling
2025-09-25 10:10:05 +02:00
Viktor Lofgren
7795742538
(live-capture) Use threads instead of FJP for coordination of sampling
2025-09-25 10:06:12 +02:00
Viktor Lofgren
82d33ce69b
(assistant) Add domain coordination module
2025-09-25 09:57:32 +02:00
Viktor Lofgren
e49cc5c244
(live-capture) Add domain coordination, make sampling parallel
2025-09-25 09:55:50 +02:00
Viktor Lofgren
0af389ad93
(live-capture) Use availability information to select domains for sampling more intelligently
2025-09-24 18:22:37 +02:00
Viktor Lofgren
48791f56bd
(index) Put back Chesterton's fence
2025-09-24 16:09:54 +02:00
Viktor Lofgren
be83726427
(query) Remove log noise from query service
2025-09-24 16:06:01 +02:00
Viktor Lofgren
708caa8791
(index) Update verbatim match handling to account for matches that span multiple tags
2025-09-24 15:43:00 +02:00
Viktor Lofgren
32394f42b9
(index) Update verbatim match handling to account for matches that span multiple tags
2025-09-24 15:41:53 +02:00
Viktor Lofgren
b8e3445ce0
(index) Update verbatim match handling to account for matches that span multiple tags
2025-09-24 15:22:50 +02:00
Viktor Lofgren
17a78a7b7e
(query) Remove obsolete code
2025-09-24 15:03:08 +02:00
Viktor Lofgren
5a75dd8093
(index) Update james cook test
2025-09-24 15:02:13 +02:00
Viktor Lofgren
a9713347a0
(query) Submit all segmentations as optional matching groups
2025-09-24 15:01:59 +02:00
Viktor Lofgren
4694d36ed2
(index) Tweak ranking bonuses for partial matches
2025-09-24 15:01:29 +02:00
Viktor Lofgren
70bdd1f51e
(index) Add test case for 'captain james cook'
2025-09-24 13:27:07 +02:00
Viktor Lofgren
187b4828e6
(index) Sort doc ids passed to re-ranking
2025-09-24 13:26:53 +02:00
Viktor Lofgren
93fc14dc94
(index) Add sanity assertions to SkipListReader
2025-09-24 13:26:31 +02:00
Viktor Lofgren
fbfea8539b
(refac) Merge IndexResultScoreCalculator into IndexResultRankingService
2025-09-24 11:51:16 +02:00
Viktor Lofgren
0929d77247
(chore) Remove vestigial Serializable annotation from a few core models
...
Java serialization was briefly considered a long while ago, but it's a silly and ancient API and not something we want to use.
2025-09-24 10:42:10 +02:00
Viktor Lofgren
db8f8c1f55
(index) Fix bitmask handling in HtmlFeature
2025-09-23 10:15:01 +02:00
Viktor Lofgren
dcb2723386
(index) Fix broken test case in the "slow" collection
2025-09-23 10:13:51 +02:00
Viktor Lofgren
00c1f495f6
(index) Fix incorrect document flag bitmask handling
2025-09-23 10:12:14 +02:00
Viktor Lofgren
73a923983a
(language) Fix outdated test assertion
2025-09-22 10:30:06 +02:00
Viktor Lofgren
e9ed0c5669
(language) Fix keyword pattern matching unicode handling
2025-09-22 10:27:46 +02:00
Viktor Lofgren
5b2bec6144
(search) Fix broken tests
2025-09-22 10:17:38 +02:00
Viktor Lofgren
f26bb8e2b1
(loader) Clean up the code
...
Loader code is still kinda needlessly convoluted for what it does, but this commit makes an effort toward making it a bit easier to follow along.
2025-09-22 10:14:54 +02:00
Viktor Lofgren
4455495dc6
(system) Fix file loggers in the json config
2025-09-21 19:02:18 +02:00
Viktor Lofgren
b84d17aa51
(system) Fix file loggers in the prod config
2025-09-21 14:02:41 +02:00
Viktor Lofgren
9d008390ae
(language) Fix unicode issues in keyword extraction
2025-09-21 13:54:01 +02:00