Viktor Lofgren
b7d3b67a1d
(language) Fix language configuration stub for German to not use French stemming
2025-10-02 10:15:30 +02:00
Viktor Lofgren
d28010b7e6
(search) Fix pagination in light mode
2025-10-02 09:04:49 +02:00
Viktor Lofgren
2689bd9eaa
(chore) Update to Java 25
...
Unbreak test suites
2025-10-02 09:04:25 +02:00
Viktor Lofgren
f6d5d7f196
(chore) Update to Java 25
...
As usual most of the change is dealing with gradle churn.
2025-09-30 15:59:35 +02:00
Viktor
abf1186fa7
Merge pull request #231 from johnvonessen/feature/configurable-crawler-timeouts
...
feat: Make crawler timeouts configurable via system.properties
2025-09-30 13:47:07 +02:00
John Von Essen
94a77ebddf
Fix timeout configuration test to expect exceptions for invalid values
...
- Update testInvalidTimeoutValues to expect Exception when invalid timeout values are provided
- This matches the actual behavior where negative timeouts cause IllegalArgumentException
- All timeout configuration tests now pass
2025-09-30 13:39:58 +02:00
John Von Essen
4e2f76a477
feat: Make crawler timeouts configurable via system.properties
...
- Add configurable timeout properties for HTTP client operations:
- crawler.socketTimeout (default: 10s)
- crawler.connectTimeout (default: 30s)
- crawler.responseTimeout (default: 10s)
- crawler.connectionRequestTimeout (default: 5min)
- crawler.jvmConnectTimeout (default: 30000ms)
- crawler.jvmReadTimeout (default: 30000ms)
- crawler.httpClientIdleTimeout (default: 15s)
- crawler.httpClientConnectionPoolSize (default: 256)
- Update HttpFetcherImpl to use Integer.getInteger() for timeout configuration
- Update CrawlerMain and LiveCrawlerMain to use configurable JVM timeouts
- Add comprehensive documentation in crawler readme.md
- Add test coverage for timeout configuration functionality
This allows users to tune crawler timeouts for their specific network
conditions without requiring code changes, improving operational flexibility.
# Conflicts:
# code/processes/crawling-process/java/nu/marginalia/crawl/fetcher/HttpFetcherImpl.java
2025-09-30 13:39:52 +02:00
Viktor
4cd1834938
Merge pull request #232 from johnvonessen/socks-support
...
Add SOCKS proxy support for crawler processes
2025-09-30 13:32:14 +02:00
Viktor Lofgren
5cbbea67ed
(docs) Update documentation with more appropriate best practices
2025-09-30 13:31:23 +02:00
Viktor Lofgren
b688f15550
(proxy) Fix late binding of proxy configuration
...
The code was selecting the proxy too late, so that it ended up being hardcoded for the entire crawl run, thus breaking the proxy selection logic.
There was also a problem where the socket configuration was overwritten by another socket configuration, thus disabling the proxy injection.
2025-09-30 11:48:43 +02:00
Viktor Lofgren
f55af8ef48
(boot) Explicitly stop ndp and ping processes at first boot
...
The system has sometimes been observed starting the NDP and Ping processes automatically, which is strongly undesirable as these microcrawlers generate real web traffic.
It is not fully understood how this happened, but the first boot handler has been modified to explicitly stop them, which should prevent the problem; and seems to have the desired outcome during testing.
2025-09-30 09:29:04 +02:00
Viktor Lofgren
adc815e282
(language) Add outcome of a simulation of the complete outcome of keyword extraction to the language processing tool
2025-09-28 12:45:25 +02:00
Viktor Lofgren
ca8455e049
(live-capture) Use threads instead of FJP for coordination of sampling
2025-09-25 10:13:46 +02:00
Viktor Lofgren
4ea724d2cb
(live-capture) Use threads instead of FJP for coordination of sampling
2025-09-25 10:10:46 +02:00
Viktor Lofgren
40600e7297
(live-capture) Use threads instead of FJP for coordination of sampling
2025-09-25 10:10:05 +02:00
Viktor Lofgren
7795742538
(live-capture) Use threads instead of FJP for coordination of sampling
2025-09-25 10:06:12 +02:00
Viktor Lofgren
82d33ce69b
(assistant) Add domain coordination module
2025-09-25 09:57:32 +02:00
Viktor Lofgren
e49cc5c244
(live-capture) Add domain coordination, make sampling parallel
2025-09-25 09:55:50 +02:00
Viktor Lofgren
0af389ad93
(live-capture) Use availability information to select domains for sampling more intelligently
2025-09-24 18:22:37 +02:00
Viktor Lofgren
48791f56bd
(index) Put back Chesterton's fence
2025-09-24 16:09:54 +02:00
Viktor Lofgren
be83726427
(query) Remove log noise from query service
2025-09-24 16:06:01 +02:00
Viktor Lofgren
708caa8791
(index) Update verbatim match handling to account for matches that span multiple tags
2025-09-24 15:43:00 +02:00
Viktor Lofgren
32394f42b9
(index) Update verbatim match handling to account for matches that span multiple tags
2025-09-24 15:41:53 +02:00
Viktor Lofgren
b8e3445ce0
(index) Update verbatim match handling to account for matches that span multiple tags
2025-09-24 15:22:50 +02:00
Viktor Lofgren
17a78a7b7e
(query) Remove obsolete code
2025-09-24 15:03:08 +02:00
Viktor Lofgren
5a75dd8093
(index) Update james cook test
2025-09-24 15:02:13 +02:00
Viktor Lofgren
a9713347a0
(query) Submit all segmentations as optional matching groups
2025-09-24 15:01:59 +02:00
Viktor Lofgren
4694d36ed2
(index) Tweak ranking bonuses for partial matches
2025-09-24 15:01:29 +02:00
Viktor Lofgren
70bdd1f51e
(index) Add test case for 'captain james cook'
2025-09-24 13:27:07 +02:00
Viktor Lofgren
187b4828e6
(index) Sort doc ids passed to re-ranking
2025-09-24 13:26:53 +02:00
Viktor Lofgren
93fc14dc94
(index) Add sanity assertions to SkipListReader
2025-09-24 13:26:31 +02:00
Viktor Lofgren
fbfea8539b
(refac) Merge IndexResultScoreCalculator into IndexResultRankingService
2025-09-24 11:51:16 +02:00
Viktor Lofgren
0929d77247
(chore) Remove vestigial Serializable annotation from a few core models
...
Java serialization was briefly considered a long while ago, but it's a silly and ancient API and not something we want to use.
2025-09-24 10:42:10 +02:00
Viktor Lofgren
db8f8c1f55
(index) Fix bitmask handling in HtmlFeature
2025-09-23 10:15:01 +02:00
Viktor Lofgren
dcb2723386
(index) Fix broken test case in the "slow" collection
2025-09-23 10:13:51 +02:00
Viktor Lofgren
00c1f495f6
(index) Fix incorrect document flag bitmask handling
2025-09-23 10:12:14 +02:00
Viktor Lofgren
73a923983a
(language) Fix outdated test assertion
2025-09-22 10:30:06 +02:00
Viktor Lofgren
e9ed0c5669
(language) Fix keyword pattern matching unicode handling
2025-09-22 10:27:46 +02:00
Viktor Lofgren
5b2bec6144
(search) Fix broken tests
2025-09-22 10:17:38 +02:00
Viktor Lofgren
f26bb8e2b1
(loader) Clean up the code
...
Loader code is still kinda needlessly convoluted for what it does, but this commit makes an effort toward making it a bit easier to follow along.
2025-09-22 10:14:54 +02:00
Viktor Lofgren
4455495dc6
(system) Fix file loggers in the json config
2025-09-21 19:02:18 +02:00
Viktor Lofgren
b84d17aa51
(system) Fix file loggers in the prod config
2025-09-21 14:02:41 +02:00
Viktor Lofgren
9d008390ae
(language) Fix unicode issues in keyword extraction
2025-09-21 13:54:01 +02:00
Viktor Lofgren
a40c2a8146
(index) Partition index journal by language to speed up index construction
2025-09-21 13:53:43 +02:00
Viktor Lofgren
a3416bf48e
(query) Fix timeout settings to use ms and not s
2025-09-19 22:45:22 +02:00
Viktor Lofgren
ee2461d9fc
(query) Fix timeout settings to use ms and not us
2025-09-19 22:19:31 +02:00
Viktor Lofgren
54c91a84e3
(query) Make the query client give up if the request exceeds its configured timeout by 50%
2025-09-19 18:59:35 +02:00
Viktor Lofgren
a6371fc54c
(query) Add a timeout to the query API
2025-09-19 18:52:44 +02:00
Viktor Lofgren
8faa9a572d
(live-capture) Fix random puppeteer API churn
2025-09-19 11:15:38 +02:00
Viktor Lofgren
fdce940263
(search) Fix redundant spam in <title>
2025-09-19 10:20:14 +02:00
Viktor Lofgren
af8a13a7fb
(index) Correct file name compatibility with previous versions
2025-09-19 09:40:43 +02:00
Viktor
9e332de6b4
Merge pull request #223 from MarginaliaSearch/multilingual
...
Add support for indexing multiple languages
2025-09-19 09:12:54 +02:00
Viktor Lofgren
d457bb5d44
(index) Fix index actor initialization
2025-09-18 16:06:40 +02:00
Viktor Lofgren
c661ebb619
(refac) Move language-processing into functions
...
It's long surpassed the single-responsibility library it once was, and is as such out of place in its original location, and fits better among the function-type modules.
2025-09-18 10:30:40 +02:00
Viktor Lofgren
53e744398a
Update gitignore to exclude eclipse-generated stuff
2025-09-17 17:14:02 +02:00
Viktor Lofgren
1d71baf3e5
(search) Display search query first in title
2025-09-16 13:16:18 +02:00
Viktor Lofgren
bb5fc0f348
(language) Fix sketchy unicode handling in UnicodeNormalization
2025-09-16 12:15:09 +02:00
Viktor Lofgren
c8f112d040
(lang+search) Clean up LanguageConfiguration initialization and LangCommandD
2025-09-16 11:49:46 +02:00
Viktor Lofgren
ae31bc8498
(lang+search) Clean up LanguageConfiguration initialization and LangCommand
2025-09-16 11:47:15 +02:00
Viktor Lofgren
da5046c3bf
(lang) Remove language redirects for languages that are not configured
...
Passing an invalid &lang= to the query service leads to a harmless but ugly stacktrace. This change prevents such a request from being formed.
2025-09-16 11:05:31 +02:00
Viktor Lofgren
f67257baf2
(lang) Remove lang:... keyword during LangCommand
2025-09-16 11:01:11 +02:00
Viktor Lofgren
924fb05661
(config) Fix language config pickup
2025-09-16 10:43:27 +02:00
Viktor Lofgren
c231a82062
(search) Lang redirection works better if it's hooked in
2025-09-16 10:40:24 +02:00
Viktor Lofgren
2c1082d7f0
(search) Add notice about the current language selection to the UI
2025-09-16 10:32:13 +02:00
Viktor Lofgren
06947bd026
(search) Add redirect based on lang:-keyword in search query
...
The change also suppresses the term in the query parser so that it isn't delegated to the index as a keyword.
2025-09-16 10:00:20 +02:00
Viktor Lofgren
519aebd7c6
(process) Make the use of zookeeper based domain coordination optional
...
The zookeeper based domain coordinator has been a bit unstable and lead to rare deadlocks. As running multiple instances of the crawler is an unusual configuration, the default behavior that makes the most sense is to disable cross-process coordination and use only local coordination.
2025-09-15 19:13:57 +02:00
Viktor Lofgren
42cc27586e
(process) Reduce connection pool stats log spam
2025-09-15 18:51:43 +02:00
Viktor Lofgren
360881fafd
(setup) Pull POS tags from control svc on first boot
...
This commit also removes the old retrieval from setup.sh
2025-09-15 10:05:17 +02:00
Viktor Lofgren
4c6fdf6ebe
(language) Make language configuration configurable
2025-09-15 09:54:57 +02:00
Viktor Lofgren
554de21f68
(converter) Disable language keyword
2025-09-15 09:49:04 +02:00
Viktor Lofgren
00194acbfe
(search) Add language chooser to UI, clean up search service code
2025-09-13 12:40:42 +02:00
Viktor Lofgren
97dabcefaa
(search) Add language chooser to UI, clean up search service code
2025-09-13 12:34:34 +02:00
Viktor Lofgren
cc790644d4
(search) Persist language choice in the search form
2025-09-12 11:14:54 +02:00
Viktor Lofgren
8f893ee6c0
(search) Add basic support for configuring query language to the search service
...
This is not visible in the UI at this stage, only a query param.
2025-09-11 15:55:09 +02:00
Viktor Lofgren
938721b793
(index) Backwards compatible loading of old words file in index loading
2025-09-11 15:42:31 +02:00
Viktor Lofgren
f68bcefc75
(index) Correct index construction to use the correct files for Fwd index
2025-09-09 11:21:48 +02:00
John Von Essen
164a646af6
Fix SOCKS proxy property propagation to spawned processes
...
- Add SOCKS proxy system properties to ProcessSpawnerService
- Ensures crawler.socksProxy.* properties are passed to spawned crawler processes
- Fixes issue where SOCKS proxy configuration was loaded by control service
but not inherited by spawned crawler processes
This resolves the root cause of SOCKS proxy not working in crawler processes.
2025-09-09 01:02:00 +00:00
Viktor Lofgren
0cfd759f85
(deps) Upgrade slop to 0.17 for better skip performance and faster index construction times
2025-09-08 18:02:34 +02:00
Viktor Lofgren
b53002200c
(index) SkipListWriter should not be in APPEND mode
2025-09-08 17:55:14 +02:00
Viktor Lofgren
78246b9a63
(index) Fix journal language enumeration
2025-09-08 15:38:26 +02:00
Viktor Lofgren
b552e79927
(language) Make LanguageConfiguration a Singleton to avoid duplicate initializations
2025-09-08 13:24:18 +02:00
Viktor Lofgren
bffc159486
(language) Make unicode normalization configurable
2025-09-08 13:18:58 +02:00
John Von Essen
b8000721bd
Implement proper SOCKS proxy support for HTTP Components v5
...
- Replace placeholder implementation with working SocketConfig.setSocksProxyAddress()
- Remove complex ConnectionSocketFactory code that wasn't compatible with v5
- Simplify SocksProxyHttpClientFactory to use correct v5 API
- Fix Timeout vs TimeValue compilation error
- SOCKS proxy configuration now fully functional
Resolves the incomplete implementation and enables proper proxy routing.
2025-09-07 21:49:21 +00:00
John Von Essen
2ee0b0e420
Fix SOCKS proxy implementation for HTTP Components v5
...
- Add missing libs.bundles.httpcomponents dependency to build.gradle
- Fix SocksProxy class references to use SocksProxyConfiguration.SocksProxy
- Update HTTP Components API calls to match v5 interface signatures
- Fix ConnectionSocketFactory method signatures (TimeValue, HttpHost parameters)
- Remove invalid setConnectTimeout() calls on Socket class
- Add placeholder implementation for v5 SOCKS proxy configuration
Resolves compilation errors and provides foundation for proper v5 implementation.
2025-09-06 21:39:20 +00:00
Viktor Lofgren
1432fc87d7
(index) Test languages via integration test
2025-09-06 20:11:41 +02:00
John Von Essen
ec5f32b1d8
Add SOCKS proxy support for crawler processes
...
- Add SocksProxyConfiguration, SocksProxyManager, and SocksProxyHttpClientFactory classes
- Integrate SOCKS proxy support into all crawler HTTP clients
- Support round-robin and random proxy selection strategies
- Add comprehensive documentation for SOCKS proxy configuration
- Configure via system properties for easy deployment
2025-09-05 10:42:58 -04:00
Viktor Lofgren
edd453531e
(index) Partition keyword lexicons by language
2025-09-04 17:24:48 +02:00
Viktor Lofgren
096496ada1
(refac) Fold ft-anchor-keywords into converting-process
2025-09-03 13:04:30 +02:00
Viktor Lofgren
8ca6209260
(refac) Fold ft-anchor-keywords into converting-process
2025-09-03 13:03:38 +02:00
Viktor Lofgren
673c65d3c9
(refac) Fold term-frequency-dict into language-processing
2025-09-03 12:59:10 +02:00
Viktor Lofgren
acb9ec7b15
(refac) Consistently use 'languageIsoCode' for the language field
2025-09-03 12:54:18 +02:00
Viktor Lofgren
47079e05db
(index) Store language information in the index journal
2025-09-03 12:33:24 +02:00
Viktor Lofgren
c93056e77f
(refac) Clean up index code
2025-09-03 09:51:57 +02:00
Viktor Lofgren
6f7530e807
(refac) Clean up index code
2025-09-02 18:53:58 +02:00
Viktor Lofgren
87ce4a1b52
(refac) Clean up index code
2025-09-02 17:52:38 +02:00
Viktor Lofgren
52194cbe7a
(refac) Clean up index code
2025-09-02 17:44:42 +02:00
Viktor Lofgren
fd1ac03c78
(refac) Clean up index code
2025-09-02 17:30:19 +02:00
Viktor Lofgren
5e5b86efb4
(refac) Clean up index code
2025-09-02 17:24:30 +02:00
Viktor Lofgren
f332ec6191
(refac) Clean up index code
2025-09-02 13:13:10 +02:00
Viktor Lofgren
c25c1af437
(refac) Clean up index code
2025-09-02 13:04:05 +02:00
Viktor Lofgren
eb0c911b45
(refac) Clean up index code
2025-09-02 12:50:07 +02:00
Viktor Lofgren
1979870ce4
(refac) Merge index-forward, index-reverse, index/query into index
...
The project has too many submodules, and it's a bit of a headache to navigate.
2025-09-02 12:30:42 +02:00
Viktor Lofgren
0ba2ea38e1
(index) Move reverse index into a distinct package
2025-09-02 11:59:56 +02:00
Viktor Lofgren
d6cfbceeea
(index) Use a configurable hasher in the index
2025-09-01 13:44:28 +02:00
Viktor Lofgren
e369d200cc
(refac) Simplify index data model by merging SearchParameters, SearchTerms and ResultRankingContext into a new object called SearchContext
...
The previous design was difficult to reason about as similar data was stored in several places, and different functions wanted different nearly identical (but not fully identical) context objects.
This is in preparation for making the keyword hash function configurable, as we want focus all the code that hashes keywords into one place.
2025-09-01 13:17:11 +02:00
Viktor Lofgren
946d64c8da
(index) Make hash algorithm selection configurable, writer-side
2025-09-01 12:03:01 +02:00
Viktor Lofgren
42f043a60f
(API) Add language parameter to the APIs
2025-09-01 09:33:39 +02:00
Viktor Lofgren
b46f2e1407
(sideload) Remove upper limit on XML entities
...
This unfucks the sideloading of stackexchange definitions.
This broke some time when we merged the executor service into the index service.
2025-08-31 14:14:09 +02:00
Viktor Lofgren
18aa1b9764
(zim) Fix parsing of modern wikipedia zim files
...
The parser was relying on a deprecated format and
wikipedia has stopped generating zim files that
work with the old assumptions. The new approach should
hopefully work better.
2025-08-31 12:52:44 +02:00
Viktor Lofgren
2f3950e0d5
(language) Roll KeywordExtractor into LanguageDefinition
2025-08-29 10:55:48 +02:00
Viktor Lofgren
61d803869e
(language) Add support for languages with no POS-tagging
...
Clean up previous commit a bit.
2025-08-29 10:55:48 +02:00
Viktor Lofgren
df6434d177
(language) Add support for languages with no POS-tagging
...
This disables a lot of the smart keyword extraction,
which is mostly a crutch for helping English and similar
large languages to find relevant search results.
Smaller languages where a POS-tag model may not be available,
are probably fine with this disabled, as the search engine can
likely just rawdog the entire results list.
2025-08-29 10:55:48 +02:00
Viktor Lofgren
59519ed7c4
(language) Adjust languages.xml
2025-08-29 10:55:47 +02:00
Viktor Lofgren
874fc2d250
(language) Remove debug logging junk
2025-08-29 10:55:47 +02:00
Viktor Lofgren
69e8ec0eef
(language) Fix subject keywords matcher with better rules and correct logic
2025-08-29 10:55:47 +02:00
Viktor Lofgren
a7eb5f54e6
(language) Clean up PosPattern, add tests
2025-08-29 10:55:47 +02:00
Viktor Lofgren
b29ba3e228
(language) Integrate new configurable POS patterns with keyword matchers
2025-08-29 10:55:47 +02:00
Viktor Lofgren
5fa5029c60
(language) Clean up UI
2025-08-29 10:55:47 +02:00
Viktor Lofgren
4257f60f00
(keywords) Fix logic error causing misidentification of some keywords
2025-08-29 10:55:47 +02:00
Viktor Lofgren
ce221d3a0e
(language) Integrate old keyword extraction logic with new test tool
2025-08-29 10:55:47 +02:00
Viktor Lofgren
f0741142a3
(refac) Move keyword extraction into language processing
2025-08-29 10:55:47 +02:00
Viktor Lofgren
0899e4d895
(language) First version of the language processing debug tool
2025-08-29 10:55:47 +02:00
Viktor Lofgren
bbf7c5a1cb
(language) Fix RDRPosTagger back to working order and integrate with SentenceExtractor
2025-08-29 10:55:47 +02:00
Viktor Lofgren
686a40e69b
(language) Update modelling
2025-08-29 10:55:47 +02:00
Viktor Lofgren
8af254f44f
(language) Parse PosPattern tags
2025-08-29 10:55:47 +02:00
Viktor Lofgren
2c21bd9287
(language) Add logging for unknown POS tags in PosPattern
2025-08-29 10:55:47 +02:00
Viktor Lofgren
f9645e2f00
(language) Enhance PosPattern to support wildcard variants in pattern matching
2025-08-29 10:55:47 +02:00
Viktor Lofgren
81e311b558
(language) POS-patterns WIP
2025-08-29 10:55:47 +02:00
Viktor Lofgren
507c09146a
(language) Add support for downloadable resources, parsing POS tag configuration tags
2025-08-29 10:55:47 +02:00
Viktor Lofgren
f682425594
(language) Basic test for LanguageConfiguration
2025-08-29 10:55:47 +02:00
Viktor Lofgren
de67006c4f
(language) Initial integration of new language configuration utility
2025-08-29 10:55:47 +02:00
Viktor Lofgren
eea32bb7b4
(language) Very basic language.xml loading off classpath
2025-08-29 10:55:47 +02:00
Viktor Lofgren
e976940a4e
(config) Move slf4j config files to common:config
2025-08-29 10:55:47 +02:00
Viktor Lofgren
b564b33028
(language) Initial embryo for language configuration
2025-08-29 10:55:47 +02:00
Viktor Lofgren
1cca16a58e
(language) Simplify language filters
2025-08-29 10:55:47 +02:00
Viktor Lofgren
70b4ed6d81
(ldb) Pipe language information into LDB database
2025-08-29 10:55:47 +02:00
Viktor Lofgren
45dc6412c1
(converter) Add language column to slop tables
2025-08-29 10:55:47 +02:00
Viktor Lofgren
b3b95edcb5
(converter) Bypass some of the grammar processing in the keyword extraction depending on language selection
2025-08-29 10:55:47 +02:00
Viktor Lofgren
338d300e1a
(converter) Clean up spans-handling
...
This code was unnecessarily difficult to follow with repeated packing and re-packing of the same data.
2025-08-29 10:55:47 +02:00
Viktor Lofgren
fa685bf1f4
(converter) Add Language field to ProcessedDocumentDetails
2025-08-29 10:55:47 +02:00
Viktor Lofgren
d79a3e2b2a
(converter) Tag documents by language in the index as a keyword
2025-08-29 10:55:47 +02:00
Viktor Lofgren
854382b2be
(language-filter) Experimentally permit Swedish results to pass through the language filter
2025-08-29 10:55:47 +02:00
Viktor Lofgren
8710adbc2a
(build) Reduce log noise during tests
2025-08-29 10:55:32 +02:00
Viktor Lofgren
acdf7b4785
(build) Add test-logger plugin to get better feedback during test execution
2025-08-29 10:41:35 +02:00
Viktor Lofgren
b5d27c1406
(search) Improve unicode support in displayTitle and displaySummary
2025-08-23 13:59:41 +02:00
Viktor Lofgren
55eb7dc116
(search) Improve unicode support in displayTitle and displaySummary
2025-08-23 13:57:51 +02:00
Viktor Lofgren
f0e8bc8baf
(search) Improve unicode support in displayTitle and displaySummary
2025-08-23 13:56:19 +02:00
Viktor Lofgren
91a6ad2337
(search) Improve unicode support in displayTitle and displaySummary
2025-08-23 13:54:48 +02:00
Viktor Lofgren
9a182b9ddb
(search) Use ADVERTISEMENT flag instead of TRACKING_ADVERTISEMENT when choosing to flag a result as having ads
2025-08-21 13:08:25 +02:00
Viktor Lofgren
fefbcf15ce
(site) Make discord link point to chat.marginalia.nu
and let nginx deal with figuring out which discord link to redirect to
2025-08-21 12:46:37 +02:00
Viktor Lofgren
9a789bf62d
(array) Fix broken test
2025-08-18 09:10:58 +02:00
Viktor Lofgren
0525303b68
(index) Add upper limit to span lengths
...
Apparently outliers exist that are larger than SHORT_MAX. This is probably not interesting, so we'll truncate at 8192 for now.
Adding logging statement to get more information about which spans these are so we can address the root cause down the line.
2025-08-17 08:44:57 +02:00
Viktor Lofgren
6953d65de5
(native) Register fixed fd:s for a nice io_uring speed boost
2025-08-16 13:48:11 +02:00
Viktor Lofgren
a7a18ced2e
(native) Register fixed fd:s for a nice io_uring speed boost
2025-08-16 13:46:39 +02:00
Viktor Lofgren
7c94c941b2
(build) Correct rare scenario where root blocks could be generated with a negative size
2025-08-16 11:27:36 +02:00
Viktor Lofgren
ea99b62356
(build) Fix missing junit engine version
2025-08-16 11:01:32 +02:00
Viktor Lofgren
3dc21d34d8
(skiplist) Fix stability of getData fuzz test
2025-08-15 09:17:48 +02:00
Viktor Lofgren
51912e0176
(index) Tweak default values for IndexQueryExecution
2025-08-15 08:07:00 +02:00
Viktor Lofgren
de1b4d5372
(index) Make metrics make more sense by normalizing them by query budget
2025-08-15 03:16:22 +02:00
Viktor Lofgren
50ac926060
(index) Make metrics make more sense by normalizing them by query budget
2025-08-15 03:11:57 +02:00
Viktor Lofgren
d711ee75b5
(index) Add performance metrics
2025-08-15 00:48:52 +02:00
Viktor Lofgren
291ff0c4de
(deps) Upgrade crawler commons to fix robots.txt-parser bug
2025-08-15 00:13:15 +02:00
Viktor
2fd2710355
Merge pull request #218 from MarginaliaSearch/o_direct_index
...
Replace document index btrees with a block based skiplist, get rid of mmap use O_DIRECT pread instead, use io_uring for positions reads
2025-08-14 23:57:09 +02:00
Viktor Lofgren
e3b957063d
(native) Add fallbacks and configuration options for building on systems lacking liburing
2025-08-14 23:36:13 +02:00
Viktor Lofgren
aee262e5f6
(index) Safeguard against arena-leaks during exceptions
...
The GC would catch these eventually, but it's nice to clean up ourselves in a timely manner.
2025-08-14 19:28:31 +02:00
Viktor Lofgren
4a98a3c711
(skiplist) Move to a separate directory instead of in the btree module
2025-08-14 01:09:46 +02:00
Viktor Lofgren
68f52ca350
(test) Fix tests that works on my machine (TM)
2025-08-14 00:59:58 +02:00
Viktor Lofgren
2a2d951c2f
(index) Fix unhinged default values for index.preparationThreads
2025-08-14 00:54:35 +02:00
Viktor Lofgren
379a1be074
(index) Add better timeout handling in UringQueue, fix slow memory leak on timeout exception
2025-08-14 00:52:50 +02:00
Viktor Lofgren
827aadafcd
(uring) Reintroduce auto-slicing of excessively long read batches
2025-08-13 14:33:35 +02:00
Viktor Lofgren
aa7679d6ce
(pool) Fix bug in exceptionally rare edge case leading to incorrect reads
2025-08-13 14:28:50 +02:00
Viktor Lofgren
6fe6de766d
(pool) Fix SegmentMemoryPage storage
2025-08-13 13:17:14 +02:00
Viktor Lofgren
4245ac4c07
(doc) Update docs to reflect that we now need io_uring
2025-08-12 15:12:54 +02:00
Viktor Lofgren
1c49a0f5ad
(index) Add system properties for toggling O_DIRECT mode for positions and spans
2025-08-12 15:11:13 +02:00
Viktor Lofgren
9a6e5f646d
(docker) Add security_opt: seccomp:unconfined to docker-compose files
...
This is needed to access io_uring via docker.
2025-08-12 15:10:26 +02:00
Viktor Lofgren
fa92994a31
(uring) Fall back to simple I/O planning behavior when buffered mode is selected in UringFileReader
2025-08-11 23:44:38 +02:00
Viktor Lofgren
bc49406881
(build) Compatibility hack debian server
2025-08-11 23:26:53 +02:00
Viktor Lofgren
90325be447
(minor) Fix comments
2025-08-11 23:19:53 +02:00
Viktor Lofgren
dc89587af3
(index) Improve disk locality of the positions data
2025-08-11 21:17:12 +02:00
Viktor Lofgren
7b552afd6b
(index) Improve disk locality of the positions data
2025-08-11 20:59:11 +02:00
Viktor Lofgren
73557edc67
(index) Improve disk locality of the positions data
2025-08-11 20:57:32 +02:00
Viktor Lofgren
83919e448a
(index) Use O_DIRECT buffered reads for spans
2025-08-11 18:04:25 +02:00
Viktor Lofgren
6f5b75b84d
(cleanup) Remove accidentally committed print stmt
2025-08-11 18:04:25 +02:00
Viktor Lofgren
db315e2813
(index) Use O_DIRECT position reads
2025-08-11 18:04:25 +02:00
Viktor Lofgren
e9977e08b7
(index) Block-align positions data
...
This will make reads more efficient, and possibly pave way for O_DIRECT reads of this data
2025-08-11 14:36:45 +02:00
Viktor Lofgren
1df3757e5f
(native) Clean up io_uring code and check in execution queue, currently unused but nifty
2025-08-11 13:54:05 +02:00
Viktor Lofgren
ca283f9684
(native) Clean up native helpers and break them into their own library
2025-08-10 20:55:34 +02:00
Viktor Lofgren
85360e61b2
(index) Grow span writer buffer size
...
Apparently outlier spans can grow considerably large.
2025-08-10 17:20:38 +02:00
Viktor Lofgren
e2ccff21bc
(index) Wait until ranking is finished in query execution
2025-08-09 23:40:30 +02:00
Viktor Lofgren
c5b5b0c699
(index) Permit fast termination of rejection filter execution
2025-08-09 23:36:59 +02:00
Viktor Lofgren
9a65946e22
(uring) Reduce queue size to 2048 to avoid ENOMEM on systems with default ulimits
2025-08-09 20:41:24 +02:00
Viktor Lofgren
1d2ab21e27
(index) Aggregate termdata reads into a single io_uring operation instead of one for each term
2025-08-09 17:43:18 +02:00
Viktor Lofgren
0610cc19ad
(index) Fix double close errors
2025-08-09 17:05:38 +02:00
Viktor Lofgren
a676306a7f
(skiplist) Fix bugs in seek operations
2025-08-09 17:00:27 +02:00
Viktor Lofgren
8d68cd14fb
(skiplist) Even more aggressive forward pointers
2025-08-09 16:11:41 +02:00
Viktor Lofgren
4773c5a52b
(index) Backport some changes made during performance evaluations
2025-08-09 15:19:41 +02:00
Viktor Lofgren
74bd562ae4
(index) Move I/O to separate threads to hopefully reduce contention a bit
2025-08-09 15:19:41 +02:00
Viktor Lofgren
c9751287b0
(index) Boost the buffer size used in PrioIndexEntrySource
2025-08-09 01:46:12 +02:00
Viktor Lofgren
5da24e3fc4
(index) Segregate full and priority query ranking
2025-08-09 00:39:31 +02:00
Viktor Lofgren
20a4e86eec
(index) Use a confined arena in IndexResultRankingService
2025-08-08 22:08:35 +02:00
Viktor Lofgren
477a184948
(experiment) Allow early termination of include conditions in lookups
2025-08-08 19:12:54 +02:00
Viktor Lofgren
8940ce99db
(perf) More statistics in perf testi
2025-08-08 18:57:25 +02:00
Viktor Lofgren
0ac0fa4dca
(perf) More statistics in perf testi
2025-08-08 18:56:17 +02:00
Viktor Lofgren
942f15ef14
(skiplist) Use a linear-quadratic forward pointer scheme instead of an exponential
2025-08-08 16:57:15 +02:00
Viktor Lofgren
f668f33d5b
(index) Tweaks and optimizations
2025-08-08 15:32:23 +02:00
Viktor Lofgren
6789975cd2
(index) Tweaks and optimizations
2025-08-08 15:30:48 +02:00
Viktor Lofgren
c3ba608776
(index) Split up evaluation tasks
2025-08-08 15:20:33 +02:00
Viktor Lofgren
733d2687fe
(skiplist) Roll back the design change that segregated the values associated with documents into a separate file
2025-08-08 14:45:11 +02:00
Viktor Lofgren
f6daac8ed0
(index) MADVISE_RANDOM the index btrees
2025-08-07 21:14:28 +02:00
Viktor Lofgren
c2eeee4a06
(uring) Disable result set combination
2025-08-07 21:13:30 +02:00
Viktor Lofgren
3b0c701df4
(uring) Update uring timeout threshold
2025-08-07 20:13:25 +02:00
Viktor Lofgren
c6fb2db43b
(index) Use a more SLA-aware execution scheduler
2025-08-07 20:13:15 +02:00
Viktor Lofgren
9bc8fe05ae
(skiplist) Clean up search logic
2025-08-07 19:35:25 +02:00
Viktor Lofgren
440ffcf6f8
(skiplist) Fix bug in intersection-like algorithms
2025-08-07 02:18:14 +02:00
Viktor Lofgren
b07709cc72
(native) Disable expensive debug checks from uring code
2025-08-06 21:05:28 +02:00
Viktor Lofgren
9a6acdcbe0
(skiplist) Tag slow fuzz test as "slow"
2025-08-06 20:59:52 +02:00
Viktor Lofgren
23b9b0bf1b
(index) Parametrize skip list block size and buffer pool sizes
2025-08-06 20:59:33 +02:00
Viktor Lofgren
749c8ed954
(pool) Correct buffer pool alignment
2025-08-06 20:56:34 +02:00
Viktor Lofgren
9f4b6939ca
(skiplist) Fix condition for truncated block writing
2025-08-06 16:25:53 +02:00
Viktor Lofgren
1d08e44e8d
(uring) Fadvise random access for uring buffered reads
2025-08-06 15:54:24 +02:00
Viktor Lofgren
fc2e156e78
(skiplist) Ensure docs file is a multiple BLOCK_SIZE bytes
2025-08-06 15:13:32 +02:00
Viktor Lofgren
5e68a89e9f
(index) Improve error handling
2025-08-06 15:05:16 +02:00
Viktor Lofgren
d380661307
(index) Improve error handling
2025-08-06 14:31:06 +02:00
Viktor Lofgren
cccdf5c329
(pool) Check interrupt status in PoolLru's reclamation thread
2025-08-06 13:26:00 +02:00
Viktor Lofgren
f085b4ea12
(skiplist) Fix tests
2025-08-06 13:24:14 +02:00
Viktor Lofgren
e208f7d3ba
(skiplist) Code clean up an added validation
2025-08-06 12:55:04 +02:00
Viktor Lofgren
b577085cb2
(pool) Use one contiguous memory allocation to encourage a HugePage allocation and reduce TLB thrashing
2025-08-06 12:49:46 +02:00
Viktor Lofgren
b9240476f6
(pool) Use one contiguous memory allocation to encourage a HugePage allocation and reduce TLB thrashing
2025-08-06 12:48:14 +02:00
Viktor Lofgren
8f50f86d0b
(index) Fix error handling
2025-08-05 22:19:23 +02:00
Viktor Lofgren
e3b7ead7a9
(skiplist) Fix aggessive forward pointering
2025-08-05 20:47:38 +02:00
Viktor Lofgren
9a845ba604
(skiplist) EXPERIMENTAL - Store data in a separate file from document ids
2025-08-05 19:10:58 +02:00
Viktor Lofgren
b9381f1603
(skiplist) EXPERIMENTAL - Store data in a separate file from document ids
2025-08-05 17:35:13 +02:00
Viktor Lofgren
6a60127267
(skiplist) EXPERIMENTAL - Store data in a separate file from document ids
2025-08-05 16:54:39 +02:00
Viktor Lofgren
e8ffcfbb19
(skiplist) Correct binary search implementation, fix intersection logic
2025-08-04 14:49:09 +02:00
Viktor Lofgren
caf0850f81
(index) Clean up code
2025-08-04 00:12:35 +02:00
Viktor Lofgren
62e3bb675e
(btree) Remove O_DIRECT btree implementation
2025-08-03 23:43:31 +02:00
Viktor Lofgren
4dc3e7da7a
(perf) Remove warmup from perf test, it's not doing much
2025-08-03 21:19:54 +02:00
Viktor Lofgren
92b09883ec
(index) Switch from AIO to io_uring
...
Turns AIO is just bad especially with buffered I/O, io_uring performs strictly better in this scenario.
2025-08-03 21:19:54 +02:00
Viktor Lofgren
87082b4ef8
(index) Use AIO for reading spans and positions
...
This performs slightly worse in benchmarks, but that's likely caused by hitting the page cache.
AIO will tend to perform better when we see cache misses, which is the expected case in production on real-world data.
2025-08-03 21:19:54 +02:00
Viktor Lofgren
84d3f6087f
(skiplist) Parametrize skip list block size, increase to 4K pages
2025-08-03 21:19:54 +02:00
Viktor Lofgren
f93ba371a5
(pool) Fix the LRU to not deadlock and be shit
2025-08-03 21:19:54 +02:00
Viktor Lofgren
5eec27c68d
(pool) Fix for 32 bit rollover in clockHand for LRU
2025-08-03 21:19:54 +02:00
Viktor Lofgren
ab01576f91
(pool) Use one global buffer pool instead of many small ones, improved LRU with gclock reclamation, skip list optimization
2025-08-03 21:19:54 +02:00
Viktor Lofgren
054e5ccf44
(pool) Testing synchronized to see if I can find the deadlock
2025-08-03 21:19:54 +02:00
Viktor Lofgren
4351ea5128
(pool) Fix buffer leak
2025-08-03 21:19:54 +02:00
Viktor Lofgren
49cfa3a5e9
(pool) Decrease LQB size
2025-08-03 21:19:54 +02:00
Viktor Lofgren
683854b23f
(pool) Fix logging
2025-08-03 21:19:54 +02:00
Viktor Lofgren
e880fa8945
(pool) Simplify locking in PoolLru
2025-08-03 21:19:54 +02:00
Viktor Lofgren
2482dc572e
(pool) Grow free queue size
2025-08-03 21:19:54 +02:00
Viktor Lofgren
4589f11898
(pool) More stats
2025-08-03 21:19:54 +02:00
Viktor Lofgren
e43b6e610b
(pool) Adjust pool reclamation strategy
2025-08-03 21:19:53 +02:00
Viktor Lofgren
4772117a1f
(skiplist) First stab at a skiplist replacement for btrees in the documents lists
2025-08-03 21:19:53 +02:00
Viktor Lofgren
3fc7ea521c
(pool) Remove readahead and simplify the code
2025-08-03 21:19:53 +02:00
Viktor Lofgren
4372f5af03
(pool) More performant LRU pool + better instructions queue
2025-08-03 21:19:53 +02:00
Viktor Lofgren
4ad89b6c75
(pool) More performant LRU pool
2025-08-03 21:19:53 +02:00
Viktor Lofgren
ad0519e031
(index) Optimizations
2025-08-03 21:19:53 +02:00
Viktor Lofgren
596ece1230
(pool) Fix deadlock during pool starvation
2025-08-03 21:19:53 +02:00
Viktor Lofgren
07b6e1585b
(pool) Bump pool sizes
2025-08-03 21:19:53 +02:00
Viktor Lofgren
cb5e2778eb
(pool) Align the buffers with 512b
2025-08-03 21:19:53 +02:00
Viktor Lofgren
8f5ea7896c
(btree) More debug information on numEntries = 0 scenario
2025-08-03 21:19:53 +02:00
Viktor Lofgren
76c398e0b1
(index) Fix lingering issues with previous optimizations
2025-08-03 21:19:53 +02:00
Viktor Lofgren
4a94f04a8d
(btree) Debug logging
2025-08-03 21:19:53 +02:00
Viktor Lofgren
df72f670d4
(btree) Fix queryData
2025-08-03 21:19:53 +02:00
Viktor Lofgren
eaa22c2f5a
(*) Logging
2025-08-03 21:19:53 +02:00
Viktor Lofgren
7be173aeca
(pool) Only dump statistics if they say anything
2025-08-03 21:19:53 +02:00
Viktor Lofgren
36685bdca7
(btree) Fix retain implementation
2025-08-03 21:19:53 +02:00
Viktor Lofgren
ad04057609
(btree) Add short circuits when retain/rejecting on an empty tree
2025-08-03 21:19:53 +02:00
Viktor Lofgren
eb76ae22e2
(perf) Use lqb size 512 in perf test
2025-08-03 21:19:53 +02:00
Viktor Lofgren
4b858ab341
(btree) Cache retain/reject reads
2025-08-03 21:19:53 +02:00
Viktor Lofgren
c6e3c8aa3b
(index) Focus pools to try to increase reuse
2025-08-03 21:19:53 +02:00
Viktor Lofgren
9128d3907c
(index) Periodically dump buffer metrics
2025-08-03 21:19:53 +02:00
Viktor Lofgren
4ef16d13d4
(index) O_DIRECT based buffer pool for index reads
2025-07-30 15:04:23 +02:00
Viktor Lofgren
838a5626ec
(index) Reduce query buffer size
2025-07-27 21:42:04 +02:00
Viktor Lofgren
6b426209c7
(index) Restore threshold for work stealing in query execution
2025-07-27 21:41:46 +02:00
Viktor Lofgren
452b5731d9
(index) Lower threshold for work stealing in query execution
2025-07-27 21:35:11 +02:00
Viktor Lofgren
c91cf49630
(search) Disable scribe.rip substitution
...
It does not appear to work well
2025-07-27 19:40:58 +02:00
Viktor Lofgren
8503030f18
(search) Fix rare exception in scribe.rip substitution
2025-07-27 19:38:52 +02:00
Viktor Lofgren
744f7d3ef7
(search) Fix rare exception in scribe.rip substitution
2025-07-27 19:34:03 +02:00
Viktor Lofgren
215e12afe9
(index) Shrink query buffer size
2025-07-27 17:33:46 +02:00
Viktor Lofgren
2716bce918
(index) Adjust timeout logic for evaluation
2025-07-27 17:28:34 +02:00
Viktor Lofgren
caf2e6fbb7
(index) Adjust timeout logic for evaluation
2025-07-27 17:27:07 +02:00
Viktor Lofgren
233f0acfb1
(index) Further reduce query buffer size
2025-07-27 17:13:08 +02:00
Viktor Lofgren
e3a4ff02e9
(index) Abandon ongoing evaluation tasks if time is up
2025-07-27 17:04:01 +02:00
Viktor Lofgren
c786283ae1
(index) Reduce quer buffer size
2025-07-27 16:57:55 +02:00
Viktor Lofgren
a3f65ac0e0
(deploy) Trigger index deployment
2025-07-27 16:50:23 +02:00
Viktor
aba1a32af0
Merge pull request #217 from MarginaliaSearch/uncompressed-spans-file
...
Index optimizations
2025-07-27 16:49:27 +02:00
Viktor Lofgren
c9c442345b
(perf) Change execution test to use processing rate instead of count
2025-07-27 16:39:51 +02:00
Viktor Lofgren
2e126ba30e
(perf) Change execution test to use processing rate instead of count
2025-07-27 16:37:20 +02:00
Viktor Lofgren
2087985f49
(index) Implement work stealing in IndexQueryExecution as a better approach to backpressure
2025-07-27 16:29:57 +02:00
Viktor Lofgren
2b13ebd18b
(index) Tweak evaluation backlog handling
2025-07-27 16:08:16 +02:00
Viktor Lofgren
6d92c125fe
(perf) Fix perf test
2025-07-27 15:50:28 +02:00
Viktor Lofgren
f638cfa39a
(index) Avoid possibility of negative timeout
2025-07-27 15:39:12 +02:00
Viktor Lofgren
89447c12af
(index) Avoid possibility of negative timeout
2025-07-27 15:24:47 +02:00
Viktor Lofgren
c71fc46f04
(perf) Update perf test with execution scenario
2025-07-27 15:22:07 +02:00
Viktor Lofgren
f96874d828
(sequence) Implement a largestValue abort condition for minDistance()
...
This is something like 3500% faster in certain common scenarios
2025-07-27 15:05:50 +02:00
Viktor Lofgren
583a84d5a0
(index) Clean up of the index query execution logic
2025-07-27 15:05:50 +02:00
Viktor Lofgren
f65b946448
(index) Clean up code
2025-07-27 15:05:50 +02:00
Viktor Lofgren
3682815855
(index) Optimize sequence intersection for the n=1 case
2025-07-26 19:14:32 +02:00
Viktor Lofgren
3a94357660
(index) Perf test tool (WIP!)
2025-07-26 11:49:33 +02:00
Viktor Lofgren
673b0d3de1
(index) Perf test tool (WIP!)
2025-07-26 11:49:31 +02:00
Viktor Lofgren
ea942bc664
(spans) Add signature to the footer of the spans file, including a version byte so we can detect whether ot use the old or new decoding logic
2025-07-25 12:07:18 +02:00
Viktor Lofgren
7ed5083c54
(index) Don't split results into chunks
2025-07-25 11:45:07 +02:00
Viktor Lofgren
08bb2c097b
(refac) Clean up the data model used in the index service
2025-07-25 10:54:07 +02:00
Viktor Lofgren
495fb325be
(sequence) Correct sequence intersection bug introduced in optimizations
2025-07-25 10:48:33 +02:00
Viktor Lofgren
05c25bbaec
(chore) Clean up
2025-07-24 23:43:27 +02:00
Viktor Lofgren
2a028b84f3
(chore) Clean up
2025-07-24 20:12:56 +02:00
Viktor Lofgren
a091a23623
(ranking) Remove unnecessary metadata retrievals
2025-07-24 20:08:09 +02:00
Viktor Lofgren
e8897acb45
(ranking) Remove unnecessary metadata retrievals
2025-07-24 20:05:39 +02:00
Viktor Lofgren
b89ffcf2be
(index) Evaluate hash based idx mapping in ForwardIndexReader
2025-07-24 19:47:27 +02:00
Viktor Lofgren
dbcc9055b0
(index) Evaluate using MinMaxPriorityQueue as guts of ResultPriorityQueue
2025-07-24 19:31:51 +02:00
Viktor Lofgren
d9740557f4
(sequence) Optimize intersection logic with a fast abort condition
2025-07-24 19:04:10 +02:00
Viktor Lofgren
0d6cd015fd
(index) Evaluate reading all spans at once
2025-07-24 18:34:11 +02:00
Viktor Lofgren
c6034efcc8
(index) Cache value of bitset cardinality for speed
2025-07-24 17:24:55 +02:00
Viktor Lofgren
76068014ad
(index) More spans optimizations
2025-07-24 15:03:43 +02:00
Viktor Lofgren
1c3ed67127
(index) Byte align document spans
2025-07-24 14:06:14 +02:00
Viktor Lofgren
fc0cb6bd9a
(index) Reserve a larger size for IntArrayList in SeqenceOperations.findIntersections
2025-07-24 14:03:44 +02:00
Viktor Lofgren
c2601bac78
(converter) Remove unnecessary allocation of a 16 KB byte buffer
2025-07-24 13:25:37 +02:00
Viktor Lofgren
f5641b72e9
(index) Fix broken test
2025-07-24 13:21:05 +02:00
Viktor Lofgren
36efe2e219
(index) Optimize PositionsFileReader for concurrent reads
...
In benchmarks this is roughly twice as fast as the previous approach. Main caveat being we need multiple file descriptors to avoid read instruction serialization by the kernel. This is undesirable since the reads are complete scattershot and can't be reordered by the kernel in a way that optimizes anything.
2025-07-24 13:20:54 +02:00
Viktor Lofgren
983fe3829e
(spans) Evaluate uncompressed spans files
...
Span decompression appears to be somewhat of a performance bottleneck. This change removes compression of the spans file. The spans are still compressed in transit between the converter and index constructor at this stage. The change is intentionally kept small to just evaluate the performance implications, change in file sizes, etc.
2025-07-23 18:10:41 +02:00
Viktor Lofgren
668c87aa86
(ssr) Drop Executor from SSR as it no longer exists
2025-07-23 13:55:41 +02:00
Viktor Lofgren
9d3f9adb05
Force redeploy of everything
2025-07-23 13:36:02 +02:00
Viktor
a43a1773f1
Merge pull request #216 from MarginaliaSearch/deprecate-executor
...
Architecture: Remove the separate executor service and roll it into the index service.
2025-07-23 13:32:42 +02:00
Viktor Lofgren
1e7a3a3c4f
(docs) Update docs to reflect the change
2025-07-23 13:18:23 +02:00
Viktor Lofgren
62b696b1c3
(architecture) Remove the separate executor service and merge it into the index service
...
The primary motivation for this is that in production, the large number of partitioned services has lead to an intermittent exhaustion of available database connections, as each service has a connection pool.
The decision to have a separate executor service dates back from when the index service was very slow to start, and the executor didn't always spin off its memory-hungry tasks into separate processes, which meant the executor would sometimes OOM and crash, and it was undesirable to bring the index down with it.
2025-07-23 12:57:13 +02:00
Viktor Lofgren
f1a900f383
(search) Clean up front page mobile design a bit
2025-07-23 12:20:40 +02:00
Viktor Lofgren
700364b86d
(sample) Remove debug logging
...
The problem sat in the desk chair all along
2025-07-21 15:08:20 +02:00
Viktor Lofgren
7e725ddaed
(sample) Remove debug logging
...
The problem sat in the desk chair all along
2025-07-21 14:41:59 +02:00
Viktor Lofgren
120209e138
(sample) Diagnosing compression errors
2025-07-21 14:34:08 +02:00
Viktor Lofgren
a771a5b6ce
(sample) Test different approach to decoding
2025-07-21 14:19:01 +02:00
Viktor Lofgren
dac5b54128
(sample) Better logging for sample errors
2025-07-21 14:03:58 +02:00
Viktor Lofgren
6cfb143c15
(sample) Compress sample HTML data and introduce new API for only getting requests
2025-07-21 13:55:25 +02:00
Viktor Lofgren
23c818281b
(converter) Reduce DomSample logging for NOT_FOUND
2025-07-21 13:37:55 +02:00
Viktor Lofgren
8aad253cf6
(converter) Add more logging around dom sample data retrieval errors
2025-07-21 13:26:38 +02:00
Viktor Lofgren
556d7af9dc
Reapply "(grpc) Use grpc-netty instead of grpc-netty-shaded"
...
This reverts commit b7a5219ed3
.
2025-07-21 13:23:32 +02:00
Viktor Lofgren
b7a5219ed3
Revert "(grpc) Use grpc-netty instead of grpc-netty-shaded"
...
Reverting this change to see if it's the cause of some instability issues observed.
2025-07-21 13:10:41 +02:00
Viktor Lofgren
a23ec521fe
(converter) Ensure features is mutable on DetailsWithWords as this is assumed later
2025-07-21 12:50:04 +02:00
Viktor Lofgren
fff3babc6d
(classier) Add rule for */pixel.gif as likely tracking pixels
2025-07-21 12:35:57 +02:00
Viktor Lofgren
b2bfb8217c
(special) Trigger CD run
2025-07-21 12:28:24 +02:00
Viktor
3b2ac414dc
Merge pull request #210 from MarginaliaSearch/ads-fingerprinting
...
Implement advertisement and popover identification based on DOM sample data
2025-07-21 12:25:31 +02:00
Viktor Lofgren
0ba6515a01
(converter) Ensure converter works well even when dom sample data is unavailable
2025-07-21 12:11:17 +02:00
Viktor Lofgren
16c6b0f151
(search) Add link to new discord community
2025-07-20 20:54:42 +02:00
Viktor Lofgren
e998692900
(converter) Ensure converter works well even when dom sample data is unavailable
2025-07-20 19:24:40 +02:00
Viktor Lofgren
eeb1695a87
(search) Clean up dead code
2025-07-20 19:15:01 +02:00
Viktor Lofgren
a0ab910940
(search) Clean up code
2025-07-20 19:14:13 +02:00
Viktor Lofgren
b9f31048d7
(search) Clean up overlong class names
2025-07-20 19:13:04 +02:00
Viktor Lofgren
12c304289a
(grpc) Use grpc-netty instead of grpc-netty-shaded
...
This will help reduce runaway thread pool sizes
2025-07-20 17:36:25 +02:00
Viktor Lofgren
6ee01dabea
(search) Drastically reduce worker thread count in search-service
2025-07-20 17:16:58 +02:00
Viktor Lofgren
1b80e282a7
(search) Drastically reduce worker thread count in search-service
2025-07-20 16:58:33 +02:00
Viktor Lofgren
a65d18f1d1
(client) Use virtual threads in a few more clients
2025-07-20 14:10:02 +02:00
Viktor Lofgren
90a1ff220b
(ui) Clean up UI
2025-07-19 18:41:36 +02:00
Viktor Lofgren
d6c7092335
(classifier) More rules
2025-07-19 18:41:36 +02:00
Viktor Lofgren
b716333856
(classifier) Match regexes against the path + query only, as well as the full URL
2025-07-19 18:41:36 +02:00
Viktor Lofgren
b504b8482c
(classifier) Add new tracker
2025-07-19 18:41:36 +02:00
Viktor Lofgren
80da1e9ad1
(ui) UI cleanup
2025-07-19 18:41:36 +02:00
Viktor Lofgren
d3f744a441
(ui) Add traffic report to overview menu
2025-07-19 18:41:36 +02:00
Viktor Lofgren
60fb539875
(ui) Add explanatory blurb
2025-07-19 18:41:35 +02:00
Viktor Lofgren
7f5094fedf
(ui) Clean up UI
2025-07-19 18:41:35 +02:00
Viktor Lofgren
45066636a5
(classifier) Add classification for domains that make 3rd party requests
2025-07-19 18:41:35 +02:00
Viktor Lofgren
e2d6898c51
(search) Change tag colors to more pleasant ones
2025-07-19 18:41:35 +02:00
Viktor Lofgren
58ef767b94
(search) Improve traffic report UI
2025-07-19 18:41:35 +02:00
Viktor Lofgren
f9f268c67a
(grpc) Improve error handling
2025-07-19 18:41:35 +02:00
Viktor Lofgren
f44c2bdee9
(chore) Cleanup
2025-07-19 18:41:35 +02:00
Viktor Lofgren
6fdf477c18
(refac) Move DomSampleClassification to top level
2025-07-19 18:41:35 +02:00
Viktor Lofgren
6b6e455e3f
(classifier) Clean up xml
2025-07-19 18:41:35 +02:00
Viktor Lofgren
a3a126540c
(classifier) Add README.md
2025-07-19 18:41:35 +02:00
Viktor Lofgren
842b19da40
(search) Mobile layout + phrasing
2025-07-19 18:41:35 +02:00
Viktor Lofgren
2a30e93bf0
(classifier)
2025-07-19 18:41:34 +02:00
Viktor Lofgren
3d998f12c0
(search) Use display name where possible
2025-07-19 18:41:34 +02:00
Viktor Lofgren
cbccc2ac23
(classification) Add /ccm/collect as an ads-related request
2025-07-19 18:41:34 +02:00
Viktor Lofgren
2cfc23f9b7
(search) Fix layout for mobile
2025-07-18 19:06:23 +02:00
Viktor Lofgren
88fe394cdb
(request-classifier) Add rule for /pagead/
2025-07-18 19:01:33 +02:00
Viktor Lofgren
f30fcebd4f
Remove dead code
2025-07-18 18:56:42 +02:00
Viktor Lofgren
5d885927b4
(search) Fix layout and presentation
2025-07-18 17:54:47 +02:00
Viktor Lofgren
7622c8358e
(request-classifier) Adjust flagging of a few hosts
2025-07-18 17:54:46 +02:00
Viktor Lofgren
69ed9aef47
(ddgt) Load global tracker data
2025-07-18 17:02:50 +02:00
Viktor Lofgren
4c78c223da
(search) Fix endpoint collection
2025-07-18 16:59:05 +02:00
Viktor Lofgren
71b9935dd6
(search) Add warmup to programmatic tailwind classes, fix word break
2025-07-18 16:49:31 +02:00
Viktor Lofgren
ad38f2fd83
(search) Hide classification tag on unclassified requests
2025-07-18 15:45:40 +02:00
Viktor Lofgren
9c47388846
(search) Improve display ordering
2025-07-18 15:44:55 +02:00
Viktor Lofgren
d9ab10e33f
(search) Fix tracker data for the correct domain
2025-07-18 15:29:15 +02:00
Viktor Lofgren
e13ea7f42b
(search) Sort results by classifications
2025-07-18 14:51:35 +02:00
Viktor Lofgren
f38daeb036
(WIP) First stab at a GUI for viewing network traffic
...
The change also moves the dom classifier to a separate package so that it can be accessed from both the search service and converter.
The change also adds a parser for DDG's tracker radar data.
2025-07-18 13:58:57 +02:00
Viktor Lofgren
6e214293e5
(ping) Fix backoff value overflow
2025-07-16 19:50:12 +02:00
Viktor Lofgren
52582a6d7d
(experiment) Also add clients to loom experiment
2025-07-16 18:08:00 +02:00
Viktor Lofgren
ec0e39ad32
(experiment) Also add clients to loom experiment
2025-07-16 17:28:57 +02:00
Viktor Lofgren
6a15aee4b0
(ping) Fix arithmetic errors in backoff strategy due to long overflow
2025-07-16 17:23:36 +02:00
Viktor Lofgren
bd5111e8a2
(experimental) Add flag for using loom/virtual threads in gRPC executor
2025-07-16 17:12:07 +02:00
Viktor Lofgren
1ecbeb0272
(doc) Update ROADMAP.md
2025-07-14 13:38:34 +02:00
Viktor Lofgren
b91354925d
(converter) Index documents even when they are short
...
... but assign short documents a special flag and penalize them in index lookups
2025-07-14 12:24:25 +02:00
Viktor Lofgren
3f85c9c154
(refac) Clean up code
2025-07-14 11:55:21 +02:00
Viktor Lofgren
390f053406
(api) Add query parameter 'dc' for specifying the max number of results per domain
2025-07-14 10:09:30 +02:00
Viktor Lofgren
89e03d6914
(chore) Idiomatic error handling in gRPC clients
...
responseObserver.onError(...) should be passed Status.WHATEVER.foo().asRuntimeException() and not random throwables as was done before.
2025-07-13 02:59:22 +02:00
Viktor Lofgren
14e0bc9f26
(index) Add comment about encoding caveat
2025-07-13 02:47:00 +02:00
Viktor Lofgren
7065b46c6f
(index) Add penalties for new feature flags from dom sample
2025-07-13 02:37:30 +02:00
Viktor Lofgren
0372190c90
(index, refac) Move domain ranking to a better named package
2025-07-13 02:37:29 +02:00
Viktor Lofgren
ceaf32fb90
(converter) Integrate dom sample features into the converter
2025-07-13 01:38:28 +02:00
Viktor Lofgren
b03c43224c
(search) Fix redirects in new search UI
2025-07-11 23:44:45 +02:00
Viktor Lofgren
9b4ce9e9eb
(search) Fix !w redirect
2025-07-11 23:28:09 +02:00
Viktor
81ac02a695
Merge pull request #209 from us3r1d/master
...
added converter.insertFoundDomains property
2025-07-11 21:34:04 +02:00
krystal
47f624fb3b
changed converter.insertFoundDomains to loader.insertFoundDomains
2025-07-11 12:13:45 -07:00
Viktor Lofgren
b57db01415
(converter) Clean out some old and redundant advertisement and tracking detection code
2025-07-11 19:32:25 +02:00
Viktor Lofgren
ce7d522608
(converter) First basic hook-in of the new dom sample classifier into the converter workflow
2025-07-11 16:57:37 +02:00
Viktor Lofgren
18649b6ee9
(converter) Move DomSampleClassifier to converter's code tree
2025-07-11 16:12:48 +02:00
Viktor Lofgren
f6417aef1a
(converter) Additional code cleanup
2025-07-11 15:58:48 +02:00
Viktor Lofgren
2aa7e376b0
(converter) Clean up code around document deduplication
2025-07-11 15:54:28 +02:00
Viktor Lofgren
f33bc44860
(dom-sample) Create API for fetching DOM sample data across services
2025-07-11 15:41:10 +02:00
Viktor Lofgren
a2826efd44
(dom-sample) First stab at classifying outgoing requests from DOM sample data
2025-07-11 15:41:10 +02:00
krystal
c866f19cbb
added converter.insertFoundDomains property
2025-07-10 15:36:59 -07:00
Viktor Lofgren
518278493b
(converter) Increase the max byte length when parsing crawled documents to 500 kB from 200 kB.
2025-07-08 21:22:02 +02:00
Viktor Lofgren
1ac0bab0b8
(converter) Also exclude length checks when lenient processing is enabled
2025-07-08 20:37:53 +02:00
Viktor Lofgren
08b45ed10a
(converter) Add system property converter.lenientProcessing to disable most disqualification checks
2025-07-08 19:44:51 +02:00
Viktor Lofgren
f2cfb91973
(converter) Add audit log of converter errors and rejections
2025-07-08 19:15:41 +02:00
Viktor Lofgren
2f79524eb3
(refac) Rename ProcessService to ProcessSpawnerService for clarity
2025-07-07 15:48:44 +02:00
Viktor Lofgren
3b00142c96
(search) Don't say unknown domains are in the crawler queue
2025-07-06 18:42:36 +02:00
Viktor Lofgren
294ab19177
(status) Use old-search for status service instead of marginalia-search.com
2025-07-06 15:40:53 +02:00
Viktor Lofgren
6f1659ecb2
(control) Add GUI for NSFW Filter Update trigger
2025-06-25 16:03:27 +02:00
Viktor Lofgren
982dcb28f0
(live-crawler) Use Apache HttpClient + code cleanup
2025-06-24 13:04:19 +02:00
Viktor Lofgren
fc686d8b2e
(live-crawler) Fix startup race condition
...
The fix makes sure we wait for the feeds API to be available before fetching from it, so that the process doesn't crash on a cold system reboot.
2025-06-24 11:42:41 +02:00
Viktor Lofgren
69ef0f334a
(rss) Make feed fetcher use Apache's HttpClient
2025-06-23 18:49:55 +02:00
Viktor Lofgren
446746f3bd
(control) Fix so that sideload actions show up in Mixed profile nodes
2025-06-23 18:08:09 +02:00
Viktor Lofgren
24ab8398bb
(ndp) Use LinkGraphClient to populate NDP table
2025-06-23 16:44:38 +02:00
Viktor Lofgren
d2ceeff4cf
(ndp) Add toggle for excluding nodes from assignment via NDP
2025-06-23 15:38:02 +02:00
Viktor Lofgren
cf64214b1c
(ndp) Update documentation
2025-06-23 15:18:35 +02:00
Viktor Lofgren
e50d09cc01
(crawler) Remove illegal requests when denied via robots.txt
...
The commit removes attempts at probing the root document, feed URLs, and favicon if we are not permitted to do so via robots.txt
2025-06-22 17:10:44 +02:00
Viktor Lofgren
bce3892ce0
(ndp) Simplify code
2025-06-22 16:08:55 +02:00
Viktor Lofgren
36581b25c2
(ndp) Fix process tracking in domain discovery process
2025-06-21 14:35:25 +02:00
Viktor Lofgren
52ff7fb4dd
(ndp) Add a process for adding new domains to be crawled
...
This is a working "work in progress" commit, will need more refinement, but given the usual difficulties in testing crawler-adjacent code without actually crawling, it needs some maturation time in production.
2025-06-21 14:10:27 +02:00
Viktor Lofgren
a4e49e658a
(ping) Add README for ping
2025-06-19 11:21:52 +02:00
Viktor Lofgren
e2c56dc3ca
(search) Clean up the rate limiting
...
We fail quietly to make life harder for the bot farmers
2025-06-18 11:26:30 +02:00
Viktor Lofgren
470b866008
(search) Clean up the rate limiting
...
We fail quietly to make life harder for the bot farmers
2025-06-18 11:22:26 +02:00
Viktor Lofgren
4895a2ac7a
(search) Clean up the rate limiting
...
We fail quietly to make life harder for the bot farmers
2025-06-18 11:20:24 +02:00
Viktor Lofgren
fd32ae9fa7
(search) Add automatic rate limiting to /site
...
Fix typo
2025-06-18 11:10:08 +02:00
Viktor Lofgren
470651ea4c
(search) Add automatic rate limiting to /site
2025-06-18 11:04:36 +02:00
Viktor Lofgren
8d4829e783
(ping) Change cookie specification to ignore cookies
2025-06-17 12:26:34 +02:00
Viktor Lofgren
1290bc15dc
(ping) Reduce retries for SocketException and pals
2025-06-16 22:35:33 +02:00
Viktor Lofgren
e7fa558954
(ping) Disable some cert validation logic for now
2025-06-16 22:00:32 +02:00
Viktor Lofgren
720685bf3f
(ping) Persist more detailed information about why a cert is invalid
...
The change also alters the validator to be less judgemental, and accept some invalid chains based on looking like we've simply not got access to a (valid) intermediate cert.
2025-06-16 19:44:22 +02:00
Viktor Lofgren
cbec63c7da
(ping) Pull root certificates from cacerts.pem
2025-06-16 19:21:05 +02:00
Viktor Lofgren
b03ca75785
(ping) Correct test so that it does not spam an innocent webmaster with requests
2025-06-16 17:06:14 +02:00
Viktor Lofgren
184aedc071
(ping) Deploy new custom cert validator for fingerprinting purposes
2025-06-16 16:36:23 +02:00
Viktor Lofgren
0275bad281
(ping) Limit SSL certificate validity dates to a maximum timestamp as permitted by database
2025-06-16 00:32:03 +02:00
Viktor Lofgren
fd83a9d0b8
(ping) Handle null case for Subject Alternative Names in SSL certificates
2025-06-16 00:27:37 +02:00
Viktor Lofgren
d556f8ae3a
(ping) Ping server should not validate certificates
2025-06-16 00:08:30 +02:00
Viktor Lofgren
e37559837b
(crawler) Crawler should validate certificates
2025-06-16 00:06:57 +02:00
Viktor Lofgren
3564c4aaee
(ping) Route SSLHandshakeException to ConnectionError as well
...
This will mean we re-try these as an unencrypted Http connection
2025-06-15 20:31:33 +02:00
Viktor Lofgren
92c54563ab
(ping) Reduce retry count on connection errors
2025-06-15 18:39:54 +02:00
Viktor Lofgren
d7a5d90b07
(ping) Store redirect location in availability record
2025-06-15 18:39:33 +02:00
Viktor Lofgren
0a0e88fd6e
(ping) Fix schema drift between prod and flyway migrations
2025-06-15 17:20:21 +02:00
Viktor Lofgren
b4fc0c4368
(ping) Fix schema drift between prod and flyway migrations
2025-06-15 17:17:11 +02:00
Viktor Lofgren
87ee8765b8
(ping) Ensure ProtocolError->HTTP_CLIENT_ERROR retains its error message information
2025-06-15 16:54:27 +02:00
Viktor Lofgren
1adf4835fa
(ping) Add schema change information to domain security events
...
Particularly the HTTPS->HTTP-change event appears to be a strong indicator of domain parking.
2025-06-15 16:47:49 +02:00
Viktor Lofgren
b7b5d0bf46
(ping) More accurately detect connection errors
2025-06-15 16:47:07 +02:00
Viktor Lofgren
416059adde
(ping) Avoid thread starvation scenario in job scheduling
...
Adjust the queueing strategy to avoid thread starvation from whale domains with many subdomains all locking on the same semaphore and gunking up all threads by implementing a mechanism that returns jobs that can't be executed to the queue.
This will lead to some queue churn, but it should be fairly manageable given the small number of threads involved, and the fairly long job execution times.
2025-06-15 11:04:34 +02:00
Viktor Lofgren
db7930016a
(coordination) Trial the use of zookeeper for coordinating semaphores across multiple crawler-like processes
...
+ fix two broken tests
2025-06-14 16:20:01 +02:00
Viktor Lofgren
82456ad673
(coordination) Trial the use of zookeeper for coordinating semaphores across multiple crawler-like processes
...
The performance implication of this needs to be evaluated. If it does not hold water. some other solution may be required instead.
2025-06-14 16:16:10 +02:00
Viktor Lofgren
0882a6d9cd
(ping) Correct retry logic by handling missing Retry-After header
2025-06-14 12:54:07 +02:00
Viktor Lofgren
5020029c2d
(ping) Fix startup sequence for new primary-only flow
2025-06-14 12:48:09 +02:00
Viktor Lofgren
ac44d0b093
(ping) Fix wait logic to use synchronized block
2025-06-14 12:38:16 +02:00
Viktor Lofgren
4b32b9b10e
Update DomainAvailabilityRecord to use clamped integer for HTTP response time
2025-06-14 12:37:58 +02:00
Viktor Lofgren
9f041d6631
(ping) Drop the concept of primary and secondary ping instances
...
There was an idea of having the ping service duck over to a realtime partition when the partition is crawling, but this hasn't been working out well, so the concept will be retired and all nodes will run as primary.
2025-06-14 12:32:08 +02:00
Viktor Lofgren
13fb1efce4
(ping) Populate ASN field on DomainSecurityInformation
2025-06-13 15:45:43 +02:00
Viktor Lofgren
c1225165b7
(ping) Add a summary fields CHANGE_SERIAL_NUMBER and CHANGE_ISSUER to DOMAIN_SECURITY_EVENTS
2025-06-13 15:30:45 +02:00
Viktor Lofgren
67ad7a3bbc
(ping) Enhance HTTP ping logic to retry GET requests for specific status codes and add sleep duration between retries
2025-06-13 12:59:56 +02:00
Viktor Lofgren
ed62ec8a35
(random) Sanitize random search results with DOMAIN_AVAILABILITY_INFORMATION join
2025-06-13 10:38:21 +02:00
Viktor Lofgren
42b24cfa34
(ping) Fix NPE in dnsJobConsumer
2025-06-12 14:22:09 +02:00
Viktor Lofgren
1ffaab2da6
(ping) Mute logging along the happy path now that things are working
2025-06-12 14:15:23 +02:00
Viktor Lofgren
5f93c7f767
(ping) Update PROC_PING_SPAWNER to use REALTIME from SIDELOAD
2025-06-12 14:04:09 +02:00
Viktor Lofgren
4001c68c82
(ping) Update SQL query to include NODE_AFFINITY in historical availability data retrieval
2025-06-12 13:58:50 +02:00
Viktor Lofgren
6b811489c5
(actor) Make ping spawner auto-spawn the process
2025-06-12 13:46:50 +02:00
Viktor Lofgren
e9d317c65d
(ping) Parameterize thread counts for availability and DNS job consumers
2025-06-12 13:34:58 +02:00
Viktor Lofgren
16b05a4737
(ping) Reduce maximum total connections in HttpClientProvider to improve resource management
2025-06-12 13:04:55 +02:00
Viktor Lofgren
021cd73cbb
(ping) Reduce db contention by moving job scheduling out of the database to RAM
2025-06-12 12:56:33 +02:00
Viktor Lofgren
4253bd53b5
(ping) Fix issue where errors were not correctly labeled in availability
2025-06-12 00:18:07 +02:00
Viktor Lofgren
14c87461a5
(ping) Fix issue where errors were not correctly labeled in availability
2025-06-12 00:04:39 +02:00
Viktor Lofgren
9afed0a18e
(ping) Optimize parameters
...
Reduce socket and connection timeouts in HttpClient and adjust thread counts for job consumers
2025-06-11 16:21:45 +02:00
Viktor Lofgren
afad4deb94
(ping) Fix DB query to prioritize DNS information updates correctly
...
This also reduces CPU%
2025-06-11 14:58:28 +02:00
Viktor Lofgren
f071c947e4
(ping) Truncate data before inserting into db
2025-06-11 14:29:30 +02:00
Viktor Lofgren
79996c9348
(ping) Adjust thread counts based on observed processing times
2025-06-11 14:29:17 +02:00
Viktor Lofgren
db907ab06a
(ping) Update availabilityJobQueue to use put method to block rather than blow up
2025-06-11 14:22:24 +02:00
Viktor Lofgren
c49cd9dd95
(ping) Truncate fields in the builder to give consistent comparison without blowing up the database inserts.
2025-06-11 14:20:54 +02:00
Viktor Lofgren
eec9df3b0a
(ping) Truncate X-Frame-Options to 50 characters
2025-06-11 14:17:08 +02:00
Viktor
e5f3288de6
Merge pull request #205 from MarginaliaSearch/ping-server
...
Create domain availability pinging service (WIP)
2025-06-11 14:05:24 +02:00
Viktor Lofgren
d587544d3a
(refac) Rename PingJob classes and methods to AvailabilityJob for improved clarity and consistency
2025-06-11 13:52:18 +02:00
Viktor Lofgren
1a9ae1bc40
(ping) Minor bugfixes
2025-06-11 13:41:17 +02:00
Viktor Lofgren
e0c81e956a
(ping) Remove planned support for actual icmp ping
...
ICMP ping is a pain in the ass from Java, and it would have added at best marginal benefit since so few servers permit it.
2025-06-11 11:10:42 +02:00
Viktor Lofgren
542fb12b38
(ping) Add partitioning to events tables
...
This lets us migrate off the live database into either a columnar database or cold storage without expensive maintenance periods, as TRUNCATE PARTITION is effectively instantaneous.
2025-06-11 10:54:24 +02:00
Viktor Lofgren
65ec734566
(ping, refac) Rename domain ping status to domain availability information
2025-06-11 10:34:31 +02:00
Viktor Lofgren
10b6a25c63
(nsfw) Fix SQL error on duplicate domains
2025-06-11 00:11:26 +02:00
Viktor Lofgren
6260f6bec7
(ping) Refactor DomainPingStatusFactory to consolidate error handling methods and improve code clarity
2025-06-11 00:05:39 +02:00
Viktor Lofgren
d6d5467696
(ping) Add domain pinging service
2025-06-10 18:28:13 +02:00
Viktor Lofgren
034560ca75
(crawler) Add locking mechanism to avoid multiple crawler instances running in parallel on the same node
2025-06-07 16:18:05 +02:00
Viktor Lofgren
e994fddae4
(service) Add process event log object
2025-06-07 16:16:08 +02:00
Viktor Lofgren
345f01f306
(discovery) Add inter-JVM lock via zookeeper
2025-06-07 16:07:27 +02:00
Viktor
5a8e286689
Merge pull request #204 from MarginaliaSearch/vlofgren-patch-1
...
Update ROADMAP.md
2025-06-07 14:01:13 +02:00
Viktor
39a055aa94
Update ROADMAP.md
2025-06-07 14:01:01 +02:00
Viktor Lofgren
37aaa90dc9
(deploy) Clean up deploy script
2025-06-07 13:43:56 +02:00
Viktor
24022c5adc
Merge pull request #203 from MarginaliaSearch/nsfw-domain-lists
...
Nsfw blocking via UT1 domain lists
2025-06-07 13:24:05 +02:00
Viktor Lofgren
1de9ecc0b6
(nsfw) Add metrics to the filtering so we can monitor it
2025-06-07 13:17:05 +02:00
Viktor Lofgren
9b80245ea0
(nsfw) Move filtering to the IndexApiClient, and add filtering options to the internal APIs and public API.
2025-06-07 12:54:20 +02:00
Viktor Lofgren
4e1595c1a6
(nsfw) Initial work on adding UT1-based domain filtering
2025-06-06 14:23:37 +02:00
Viktor Lofgren
0be8585fa5
Add tag format hint to deploy script
2025-06-06 10:03:18 +02:00
Viktor Lofgren
a0fe070fe7
Redeploy browserless and assistant.
2025-06-06 09:51:39 +02:00
Viktor Lofgren
abe9da0fc6
(search) Ensure the new search UI sets the correct content-type for opensearch.xml
2025-05-29 12:44:55 +02:00
Viktor Lofgren
56d0128b0a
(dom-sample) Remove redundant code
2025-05-28 17:43:46 +02:00
Viktor Lofgren
840b68ac55
(dom-sample) Minor cleanups
2025-05-28 16:27:27 +02:00
Viktor Lofgren
c34ff6d6c3
(dom-sample) Use WAL journal for dom sample db
2025-05-28 16:16:28 +02:00
Viktor Lofgren
32780967d8
(dom-sample) Initialize dom sampler
2025-05-28 16:06:05 +02:00
Viktor Lofgren
7330bc489d
(deploy) Correct deploy script for browserless
2025-05-28 15:58:12 +02:00
Viktor Lofgren
ea23f33738
(deploy) Correct deploy script for headlesschrome
2025-05-28 15:56:05 +02:00
Viktor Lofgren
4a8a028118
(deploy) Deploy assistant and browserless
2025-05-28 15:50:26 +02:00
Viktor
a25bc647be
Merge pull request #201 from MarginaliaSearch/website-capture
...
Capture website snapshots
2025-05-28 15:49:03 +02:00
Viktor Lofgren
a720dba3a2
(deploy) Add browserless to deploy script
2025-05-28 15:48:32 +02:00
Viktor Lofgren
284f382867
(dom-sample) Fix initialization to work the same as screenshot capture
2025-05-28 15:40:09 +02:00
Viktor Lofgren
a80717f138
(dom-sample) Cleanup
2025-05-28 15:32:54 +02:00
Viktor Lofgren
d6da715fa4
(dom-sample) Add basic retrieval logic
...
First iteration is single threaded for simplicity
2025-05-28 15:18:15 +02:00
Viktor Lofgren
c1ec7aa491
(dom-sample) Add a boolean to the sample db when we've accepted a cookie dialogue
2025-05-28 14:45:19 +02:00
Viktor Lofgren
3daf37e283
(dom-sample) Improve storage of DOM sample data
2025-05-28 14:34:34 +02:00
Viktor Lofgren
44a774d3a8
(browserless) Add --pull option to Docker build command
...
This ensures we fetch the latest base image when we build.
2025-05-28 14:09:32 +02:00
Viktor Lofgren
597aeaf496
(website-capture) Correct manifest
...
run_at is set at the content_script level, not the root object.
2025-05-28 14:05:16 +02:00
Viktor Lofgren
06df7892c2
(website-capture) Clean up code
2025-05-27 15:56:59 +02:00
Viktor Lofgren
dc26854268
(website-capture) Add a marker to the network log when we've accepted a cookie dialog
2025-05-27 15:21:02 +02:00
Viktor Lofgren
9f16326cba
(website-capture) Add logic that automatically identifies and agrees to cookie consent popovers
...
Oftentimes, ads don't load until after you've agreed to the popover.
2025-05-27 15:11:47 +02:00
Viktor Lofgren
ed66d0b3a7
(website-capture) Amend the extension to also capture web request information
2025-05-26 14:00:43 +02:00
Viktor Lofgren
c3afc82dad
(website-capture) Rename scripts to be more consistent with extension terminology
2025-05-26 13:13:11 +02:00
Viktor Lofgren
08e25e539e
(website-capture) Minor cleanups
2025-05-21 14:55:03 +02:00
Viktor Lofgren
4946044dd0
(website-capture) Update BrowserlesClient to use the new image
2025-05-21 14:14:18 +02:00
Viktor Lofgren
edf382e1c5
(website-capture) Add a custom docker image with a new custom extension for DOM capture
...
The original approach of injecting javascript into the page directly didn't work with pages that reloaded themselves. To work around this, a chrome extension is used instead that does the same work, but subscribes to reload events and re-installs the change listener.
2025-05-21 14:13:54 +02:00
Viktor Lofgren
644cba32e4
(website-capture) Remove dead imports
2025-05-20 16:08:48 +02:00
Viktor Lofgren
34b76390b2
(website-capture) Add storage object for DOM samples
2025-05-20 16:05:54 +02:00
Viktor Lofgren
43cd507971
(crawler) Add a migration workaround so we can still open old slop crawl data with the new column added
2025-05-19 14:47:38 +02:00
Viktor Lofgren
cc40e99fdc
(crawler) Add a migration workaround so we can still open old slop crawl data with the new column added
2025-05-19 14:37:59 +02:00
Viktor Lofgren
8a944cf4c6
(crawler) Add request time to crawl data
...
This is an interesting indicator of website quality.
2025-05-19 14:07:41 +02:00
Viktor Lofgren
1c128e6d82
(crawler) Add request time to crawl data
...
This is an interesting indicator of website quality.
2025-05-19 14:02:03 +02:00
Viktor Lofgren
be039d1a8c
(live-capture) Add a new function for capturing the DOM of a website after rendering
...
The new code injects a javascript that attempts to trigger popovers, and then alters the DOM to add attributes containing CSS elements with position and visibility.
2025-05-19 13:26:07 +02:00
Viktor Lofgren
4edc0d3267
(converter) Increase work buffer for converter
...
Conversion on index node 7 in production is crashing ostensibly because this buffer is too small.
2025-05-18 13:22:44 +02:00
Viktor Lofgren
890f521d0d
(pdf) Fix crash for some bold lines
2025-05-18 13:05:05 +02:00
Viktor Lofgren
b1814a30f7
(deploy) Redeploy all services.
2025-05-17 13:11:51 +02:00
Viktor Lofgren
f59a9eb025
(legacy-search) Soften domain limit constraints in URL deduplication
2025-05-17 00:04:27 +02:00
Viktor Lofgren
599534806b
(search) Soften domain limit constraints in URL deduplication
2025-05-17 00:00:42 +02:00
Viktor Lofgren
7e8253dac7
(search) Clean up debug logging
2025-05-17 00:00:28 +02:00
Viktor Lofgren
97a6780ea3
(search) Add debug logging for specific query
2025-05-16 23:41:35 +02:00
Viktor Lofgren
eb634beec8
(search) Add debug logging for specific query
2025-05-16 23:34:03 +02:00
Viktor Lofgren
269ebd1654
Revert "(query) Add debug logging for specific query"
...
This reverts commit 39ce40bfeb
.
2025-05-16 23:29:06 +02:00
Viktor Lofgren
39ce40bfeb
(query) Add debug logging for specific query
2025-05-16 23:23:53 +02:00
Viktor Lofgren
c187b2e1c1
(search) Re-enable clustering
2025-05-16 23:20:16 +02:00
Viktor Lofgren
42eaa4588b
(search) Disable clustering for a moment
2025-05-16 23:17:01 +02:00
Viktor Lofgren
4f40a5fbeb
(search) Reduce log spam
2025-05-16 23:15:07 +02:00
Viktor Lofgren
3f3d42bc01
(search) Re-enable deduplication
2025-05-16 23:14:54 +02:00
Viktor Lofgren
61c8d53e1b
(search) Disable deduplication for a moment
2025-05-16 23:10:32 +02:00
Viktor Lofgren
a7a3d85be9
(search) Increase search timeout by 50ms
2025-05-16 22:54:12 +02:00
Viktor Lofgren
306232fb54
(pdf) Fix handling of a few corner cases
...
Deal better with documents which change font on blank spaces.
2025-05-13 18:44:28 +02:00
Viktor Lofgren
5aef844f0d
(dependency) Increase slop version to 0.0.11
...
v0.0.11 uses atomic moves. This ensures we don't encounter a race condition in the backup service with lingering .tmp-files that should have been renamed.
2025-05-12 14:09:16 +02:00
Viktor
d56b5c828a
Merge pull request #198 from MarginaliaSearch/process-pdf-files
...
Add support for processing PDF files. The changeset adds a dependency on pdfbox, and vendors/modifies its PDFTextStripper to extract additional semantics from the documents.
Since PDF documents aren't a text based format, but a graphical format which may contain a stream of characters and positions (sometimes overlapping, rotated, out of order) identifying something like a header or a paragraph is a non-trivial task, let alone extracting any text at all. A number of heuristics are used to try to accomplish this task, they aren't perfect, but about as good as you're going to get without going to something like a vision based LLM, which would be ridiculously expensive to apply at an internet search engine scale.
The change also adds format information to the JSON API, as well as indicators in the GUI for PDF files.
2025-05-11 16:43:25 +02:00
Viktor Lofgren
ab58a4636f
(pdf) Disable tests that require specific sample data that can't go in the repo
2025-05-11 16:42:23 +02:00
Viktor Lofgren
00be269238
(search) Add PDF indicator in "also from"-segment
2025-05-11 16:35:52 +02:00
Viktor Lofgren
879e6a9424
(pdf) Identify additional headings based on font weight
2025-05-11 16:35:52 +02:00
Viktor Lofgren
fba3455732
(pdf) Clean up code
2025-05-11 16:35:52 +02:00
Viktor Lofgren
14283da7f5
(pdf) Clean up generated DOM
...
Sometimes empty <p>-tags are inserted, which messes with the header joining process. Removes those nodes.
2025-05-11 15:12:09 +02:00
Viktor Lofgren
93df4d1fc0
(pdf) Improve summary extraction for PDFs
2025-05-11 14:33:11 +02:00
Viktor Lofgren
b12a0b998c
(pdf) Use smarter heuristics for paragraph splitting
...
We look at the median line distance, with outliers removed, to figure out when to break lines, as the original approach works poorly with e.g. double line spaced documents.
2025-05-11 14:29:42 +02:00
Viktor Lofgren
3b6f4e321b
(search) Add red PDF indicator to search UI
2025-05-11 13:32:14 +02:00
Viktor Lofgren
8428111771
(pdf) Fix for exception when no text positions are available
2025-05-10 15:12:02 +02:00
Viktor Lofgren
e9fd4415ef
(pdf) Merge consecutive headings.
...
Headings don't follow the same indentation rules as prose and tend to be cut off into multiple "paragraphs" by the text extractor.
2025-05-10 14:38:43 +02:00
Viktor Lofgren
4c95c3dcad
(pdf) Don't look for headings below 75% of the max y-position
2025-05-10 14:38:02 +02:00
Viktor Lofgren
c5281536fb
(api) Add format field to JSON search results
...
API consumers might want to filter out PDF results, etc.
2025-05-10 13:56:22 +02:00
Viktor Lofgren
4431dae7ac
(refac) Rename HtmlStandard -> DocumentFormat
...
The old model made some sense when we only supported HTML and to some extent plain text, but having PDF in an enum called HtmlFormat is a bit of a stretch.
2025-05-10 13:47:26 +02:00
Viktor Lofgren
4df4d0a7a8
(pdf) Increase line spacing tolerance for better paragraph handling
2025-05-10 13:34:04 +02:00
Viktor Lofgren
9f05083b94
(pdf) Add the capability to identify headings
...
This change vendors pdfbox'es PDFTextStripper and modifies it to be able to heuristically identify headings based on their font size, as this is a very useful relevance signal for the search engine, and helps identify the correct title of the article.
2025-05-09 14:04:04 +02:00
Viktor Lofgren
fc92e9b9c0
(feeds) Correct link handling in atom feeds
...
This addresses issue #199
2025-05-09 13:00:07 +02:00
Viktor Lofgren
328fb5d927
(feeds) Correct link handling in atom feeds
...
This addresses issue #199
2025-05-09 12:55:28 +02:00
Viktor Lofgren
36889950e8
(pdf) Migrate to PDFBox 3.0.5 and suppress log spam
...
PDFBox 2.x uses commons logging, which does not route through SLF4j, and thus is a hassle to configure; and is extremely verbose in its default logging settings.
Migrating to PDFBox 3.x lets us use slf4j to address the log spam by filtering out the noisy methods.
2025-05-08 18:03:26 +02:00
Viktor Lofgren
c96a94878b
(pdf) Add feature to make pdf-files searchable with format:pdf
2025-05-08 18:03:26 +02:00
Viktor Lofgren
1c57d7d73a
(pdf) Clean up code
2025-05-08 18:03:26 +02:00
Viktor Lofgren
a443d22356
(pdf) Flag the file as a PDF file in the GUI
2025-05-08 18:03:26 +02:00
Viktor Lofgren
aa59d4afa4
(pdf) Somewhat improve title and summary extraction
2025-05-08 18:03:26 +02:00
Viktor Lofgren
df0f18d0e7
(pdf) Read title
2025-05-08 18:03:26 +02:00
Viktor Lofgren
0819d46f97
(pdf) Minimal protytype to get PDFs working
2025-05-08 18:03:26 +02:00
Viktor Lofgren
5e2b63473e
(logging) Change to a terser log format
...
The old log format would often span several screen widths, especially when subprocesses logged. Switching to a terser format that should be much easier to read.
2025-05-08 18:02:22 +02:00
Viktor
f9590703f1
Merge pull request #197 from MarginaliaSearch/crawl-markdown
...
(markdown) Support crawling markdown
2025-05-08 13:35:00 +02:00
Viktor Lofgren
f12fc11337
(markdown) Support crawling markdown
2025-05-08 13:26:22 +02:00
Viktor Lofgren
c309030184
(sample) Ensure we finalize the slop.zip file creation when filtering
2025-05-06 14:52:48 +02:00
Viktor Lofgren
fd5af01629
(sample) Ensure we flush the log before adding it to the tar file
2025-05-06 14:43:47 +02:00
Viktor Lofgren
d4c43c7a79
(crawler) Test case for fetching PDFs
2025-05-06 13:45:16 +02:00
Viktor Lofgren
18700e1919
(sample) Fix bug where slop files would not be saved despite containing data
2025-05-06 13:38:21 +02:00
Viktor Lofgren
120b431998
(crawler) Fix outdated assumptions about content types and http status codes always being 200 when good.
...
We now sometimes get 206 when good.
2025-05-06 13:18:30 +02:00
Viktor Lofgren
71dad99326
(crawler) Revisitor should not demand a 200, but support a 206 as well
2025-05-06 13:11:52 +02:00
Viktor Lofgren
c1e8afdf86
(crawler) Remove domains from pending crawl tasks queue when retrying
2025-05-06 12:56:30 +02:00
Viktor Lofgren
fa32dddc24
(sample-actor) Make content type matching lenient with regard to ct parameters such as charset
2025-05-06 12:48:09 +02:00
Viktor Lofgren
a266fcbf30
(sample-actor) Clean up debris from previous runs to avoid errors on re-runs
2025-05-05 13:16:37 +02:00
Viktor Lofgren
6e47e58e0e
(sample-actor) Add progress tracking to sample export actor
2025-05-05 13:04:14 +02:00
Viktor Lofgren
9dc43d8b4a
(sample-actor) Update the actor export sample actor to not generate empty files when the filter is not applicable.
2025-05-05 12:56:12 +02:00
Viktor Lofgren
83967e3305
(sample-actor) Update the actor export sample actor to not generate empty files when the filter is not applicable.
2025-05-05 12:50:21 +02:00
Viktor Lofgren
4db980a291
(jooby-service) Set an upper limit on the number of worker threads
2025-05-05 12:40:31 +02:00
Viktor Lofgren
089b177868
(deploy) Executor partition 4.
2025-05-05 12:21:27 +02:00
Viktor Lofgren
9c8e9a68d5
(deploy) Executor partition 4.
2025-05-05 12:00:05 +02:00
Viktor Lofgren
413d5cc788
(url, minor) Fix typo in test
2025-05-04 16:28:30 +02:00
Viktor Lofgren
58539b92ac
(search) Don't show addresses with URLencoding in the UI
2025-05-04 16:26:39 +02:00
Viktor Lofgren
fe72f16df1
(url) Add additional tests for parameter handling
2025-05-04 16:23:39 +02:00
Viktor Lofgren
b49a244a2e
(url) Fix encoding handling of query parameters
2025-05-04 16:18:47 +02:00
Viktor Lofgren
3f0b4c010f
(deploy) Fix deploy script to be aware of the status service
2025-05-04 16:14:07 +02:00
Viktor Lofgren
c6e0cd93f7
(status) Fix status service to poll the new domain
2025-05-04 16:11:08 +02:00
Viktor Lofgren
80a7ccb080
Trigger redeploy of qs, search and api
2025-05-04 16:07:28 +02:00
Viktor Lofgren
54dec347c4
(url) Fix urlencoding issues with certain symbols
...
Optimize the code by adding a simple heuristic for guessing whether we need to repair the URI before we pass it to Java's parser.
2025-05-04 13:39:39 +02:00
Viktor Lofgren
d6ee3f0785
(url) Fix urlencoding issues with certain symbols
...
The urlencoding logic would consider the need to urlencode on an element basis, which is incorrect. Even if we urlencode on an element basis, we should either urlencode or not urlencode, never a mix of the two.
2025-05-04 13:08:49 +02:00
Viktor Lofgren
8be88afcf3
(url) Fix urlencoding issues with certain symbols
...
We also need to apply the fix when performing toString() on the EdgeUrl, the URI class will URLDecode the input.
The change also alters the parseURI method to only run the URLEncode-fixer during parsing if URI doesn't throw an exception. This bad path is obviously going to be slower, but realistically, most URLs are valid, so it's probably a significant optimization to do it like this.
2025-05-04 12:58:13 +02:00
Viktor Lofgren
0e3c00d3e1
(url) Fix urlencoding issues with certain symbols
...
Minor fix of issue where url sanitizer would strip some trailing slashes.
2025-05-03 23:58:28 +02:00
Viktor Lofgren
4279a7f1aa
(url) Fix urlencoding issues with certain symbols
...
Minor fix with previously urlencoded codepoints, we need to account for the fact that they are encoded in hexadecimal.
2025-05-03 23:51:39 +02:00
Viktor Lofgren
251006d4f9
(url) Fix urlencoding issues with certain symbols
...
Problems primarily cropped up with sideloaded wikipedia articles, though the search engine has been returning inconsistently URLEncoded search results for a while, though browsers and servers have seemingly magically fixed the issues in many scenarios.
This addresses Issue #195 and Issue #131 .
2025-05-03 23:48:45 +02:00
Viktor Lofgren
c3e99dc12a
(service) Limit logging from ad hoc task heartbeats
...
Certain usage patterns of the ad hoc task heartbeats would lead to an incredible amount of log noise, as it would log each update.
Limit log updates to increments of 10% to avoid this problem.
2025-05-03 12:39:58 +02:00
Viktor
aaaa2de022
Merge pull request #196 from MarginaliaSearch/filter-export-sample-data
...
Add the ability to filter sample data based on content type
2025-05-02 13:23:49 +02:00
Viktor Lofgren
fc1388422a
(actor) Add the ability to filter sample data based on content type
...
This will help in extracting relevant test sets for PDF processing.
2025-05-02 13:09:22 +02:00
Viktor Lofgren
b07080db16
(crawler) Don't retry requests when encountering UnknownHostException
2025-05-01 16:07:34 +02:00
Viktor Lofgren
e9d86dca4a
(crawler) Add timeout to wrap-up phase of WarcInputBuffer.
2025-05-01 15:57:47 +02:00
Viktor Lofgren
1d693f0efa
(build) Upgrade JIB to 3.4.5
2025-04-30 15:26:52 +02:00
Viktor Lofgren
5874a163dc
(build) Upgrade gradle to 8.14
2025-04-30 15:26:37 +02:00
Viktor Lofgren
5ec7a1deab
(crawler) Fix 80%-ish progress crawler stall
...
Since the crawl tasks are started in two phases, first when generating them in one loop, and then in a second loop that drains the task list; if the first loop contains a long-running crawl task that is triggered late, the rest of the crawl may halt until that task is finish.
Fixed the problem by draining and re-trying also in the first loop.
2025-04-29 12:23:51 +02:00
Viktor Lofgren
7fea2808ed
(search) Fix error view
...
Fix rendering error when query was null
Fix border on error message.
2025-04-27 12:12:56 +02:00
Viktor Lofgren
8da74484f0
(search) Remove unused count modifier from the footer help
2025-04-27 12:08:34 +02:00
Viktor Lofgren
923d5a7234
(search) Add a note for TUI users pointing them to the old UI
2025-04-27 11:52:07 +02:00
Viktor Lofgren
58f88749b8
(deploy) assistant
2025-04-25 13:25:50 +02:00
Viktor Lofgren
77f727a5ba
(crawler) Alter conditional request logic to avoid sending both If-None-Match and If-Modified-Since
...
It seems like some servers dislike this combination, and may turn a 304 into a 200.
2025-04-25 13:19:07 +02:00
Viktor Lofgren
667cfb53dc
(assistant) Remove more link text junk from suggestions at loadtime.
2025-04-24 13:35:29 +02:00
Viktor Lofgren
fe36d4ed20
(deploy) Executor services
2025-04-24 13:23:51 +02:00
Viktor Lofgren
acf4bef98d
(assistant) Improve search suggestions
...
Improve suggestions by loading a secondary suggestions set with link text data.
2025-04-24 13:10:59 +02:00
Viktor Lofgren
2a737c34bb
(search) Improve suggestions UX
...
Fix the highlight colors when arrowing through search suggestions. Also fix the suggestions box for dark mode.
2025-04-24 12:34:05 +02:00
Viktor Lofgren
90a577af82
(search) Improve suggestions UX
2025-04-24 00:32:25 +02:00
Viktor
f0c9b935d8
Merge pull request #192 from MarginaliaSearch/improve-suggestions
...
Improve typeahead suggestions
2025-04-23 20:17:49 +02:00
Viktor Lofgren
7b5493dd51
(assistant) Improve typeahead suggestions
...
Implement a new prefix search structure (not a trie, but hash table based) with a concept of score.
2025-04-23 20:13:53 +02:00
Viktor Lofgren
c246a59158
(search) Make it clearer that it's a search engine
2025-04-22 16:03:42 +02:00
Viktor
0b99781d24
Merge pull request #191 from MarginaliaSearch/pdf-support-in-crawler
...
Pdf support in crawler
2025-04-22 15:52:41 +02:00
Viktor Lofgren
39db9620c1
(crawler) Increase maximum permitted file size to 32 MB
2025-04-22 15:51:03 +02:00
Viktor Lofgren
1781599363
(crawler) Add support for crawling PDF files
2025-04-22 15:50:05 +02:00
Viktor Lofgren
6b2d18fb9b
(crawler) Adjust domain limits to be generally more permissive.
2025-04-22 15:27:57 +02:00
Viktor
59b1d200ab
Merge pull request #190 from MarginaliaSearch/download-sample-chores
...
Download sample chores
2025-04-22 13:29:49 +02:00
Viktor Lofgren
897010a2cf
(control) Update download sample data actor with better UI
...
The original implementation didn't really give a lot of feedback about what it was doing. Adding a progress bar to the download step.
Relates to issue 189.
2025-04-22 13:27:22 +02:00
Viktor Lofgren
602af7a77e
(control) Update UI with new sample sizes
...
Relates to issue 189.
2025-04-22 13:27:13 +02:00
Viktor Lofgren
a7d91c8527
(crawler) Clean up fetcher detailed logging
2025-04-21 12:53:52 +02:00
Viktor Lofgren
7151602124
(crawler) Reduce the likelihood of crawler tasks locking on domains before they are ready
...
Cleaning up after changes.
2025-04-21 12:47:03 +02:00
Viktor Lofgren
884e33bd4a
(crawler) Reduce the likelihood of crawler tasks locking on domains before they are ready
...
Change back to an unbounded queue, tighten sleep times a bit.
2025-04-21 11:48:15 +02:00
Viktor Lofgren
e84d5c497a
(crawler) Reduce the likelihood of crawler tasks locking on domains before they are ready
...
Change to a bounded queue and adding a sleep to reduce the amount of effectively busy looping threads.
2025-04-21 00:39:26 +02:00
Viktor Lofgren
2d2d3e2466
(crawler) Reduce the likelihood of crawler tasks locking on domains before they are ready
...
Change to a bounded queue and adding a sleep to reduce the amount of effectively busy looping threads.
2025-04-21 00:36:48 +02:00
Viktor Lofgren
647dd9b12f
(crawler) Reduce the likelihood of crawler tasks locking on domains before they are ready
2025-04-21 00:24:30 +02:00
Viktor Lofgren
de4e2849ce
(crawler) Tweak request retry counts
...
Increase the default number of tries to 3, but don't retry on SSL errors as they are unlikely to fix themselves in the short term.
2025-04-19 00:19:48 +02:00
Viktor Lofgren
3c43f1954e
(crawler) Add custom cookie store implementation
...
Apache HttpClient's cookie implementation builds an enormous concurrent hashmap with every cookie for every domain ever crawled. This is a big waste of resources.
Replacing it with a fairly crude domain-isolated instance, as we are primarily interested in answering whether a cookie is set, and we will never retain cookies long term.
2025-04-18 13:04:22 +02:00
Viktor Lofgren
fa2462ec39
(crawler) Re-enable aborts on timeout
2025-04-18 12:59:34 +02:00
Viktor Lofgren
f4ad7145db
(crawler) Disable SO_LINGER
2025-04-18 01:42:02 +02:00
Viktor Lofgren
068b450180
(crawler) Temporarily disable request.abort()
2025-04-18 01:25:56 +02:00
Viktor Lofgren
05b909a21f
(crawler) Add logging to get more info about connection leak
2025-04-18 01:06:52 +02:00
Viktor Lofgren
3d179cddce
(crawler) Correctly consume entity in sitemap retrieval
2025-04-18 00:32:21 +02:00
Viktor Lofgren
1a2aae496a
(crawler) Correct handling and abortion of HttpClient's requests
...
There was a resource leak in the initial implementation of the Apache HttpClient WarcInputBuffer that failed to free up resources.
Using HttpGet objects instead of the Classic...Request objects, as the latter fail to expose an abort()-method.
2025-04-18 00:16:26 +02:00
Viktor Lofgren
353cdffb3f
(crawler) Increase connection request timeout, restore congestion timeout
2025-04-17 21:32:06 +02:00
Viktor Lofgren
2e3f1313c7
(crawler) Log exceptions while crawling in crawler audit log
2025-04-17 21:18:09 +02:00
Viktor Lofgren
58e6f141ce
(crawler) Reduce congestion throttle go-rate
2025-04-17 20:36:58 +02:00
Viktor Lofgren
500f63e921
(crawler) Lower max conn per route
2025-04-17 18:36:16 +02:00
Viktor Lofgren
6dfbedda1e
(crawler) Increase max conn per route and connection timeout
2025-04-17 18:31:46 +02:00
Viktor Lofgren
9715ddb105
(crawler) Increase max pool size to a large value
2025-04-17 18:22:58 +02:00
Viktor Lofgren
1fc6313a77
(crawler) Remove log noise when retrying a bad URL
2025-04-17 17:10:46 +02:00
Viktor Lofgren
b1249d5b8a
(crawler) Fix broken test.
2025-04-17 17:01:42 +02:00
Viktor
ef95d59b07
Merge pull request #161 from MarginaliaSearch/apache-httpclient-in-crawler
...
The previously used Java HttpClient seems unsuitable for crawler usage, that lead to issues like send()-operations sometimes hanging forever, with clunky workarounds such as running each send operation in a separate Future that can be cancelled on a timeout.
The most damning flaw is that it does not offer socket timeouts. If a server responds in a timely manner, but for some reason between high load or malice stops sending data, Java's builtin HttpClient will hang forever.
It simply has too many assumptions that break, and fails to adequately expose the inner workings of the connection pool to a degree that makes it possible to configure in a satisfactory manner, such as setting a SO_LINGER value or limiting the number of concurrent connections to a host.
Apache's HttpClient solves all these problems.
The change also includes a new battery of tests for the HttpFetcher, and refactors the retriever class a bit to move stuff into the HttpFetcher, leading to a better separation of concerns.
The crawler will also be a bit more clever when fetching documents, and attempt to use range queries where supported to limit the number of bytes, as interrupting connections is undesirable and leads to connection storms and bufferbloat.
2025-04-17 16:57:19 +02:00
Viktor Lofgren
acdd8664f5
(crawler) More logging for the crawler, in a separate file.
2025-04-17 16:55:50 +02:00
Viktor Lofgren
6b12eac58a
(crawler) Fix crawler retriever test to use the slop format
2025-04-17 16:35:13 +02:00
Viktor Lofgren
bb3f1f395a
(crawler) Fix bug where headers were not stored correctly
...
This was the result of refactoring to Apache HttpClient.
2025-04-17 16:34:41 +02:00
Viktor Lofgren
b661beef41
(crawler) Amend recrawl logic to match redirects as being unchanged if their Location is the same.
2025-04-17 16:34:05 +02:00
Viktor Lofgren
9888c47f19
(crawler) Add custom Keep-Alive settings for HttpClient with max keep-alive of 30s
2025-04-17 15:25:46 +02:00
Viktor Lofgren
dcef7e955b
(crawler) Try to avoid unnecessary connection resets
...
In order to keep connections alive, the crawler will consume data past it's max size (but hope and pray the server supports range queries) as long as we've not exceeded the timeout.
This permits us to keep the connection alive in more scenarios, which is helpful for the health of the network stack, as constant TCP handshakes can lead to quite a lot of buffer bloat.
This will increase the bandwidth requirements in some scenarios, but on the other hand, it will increase the available bandwidth as well.
2025-04-17 14:51:33 +02:00
Viktor Lofgren
b3973a1dd7
(crawler) Remove unnecessary crawl delay when not ct-probing
...
The crawler would *always* incur the crawl delay penalty associated with content type probing, even if it wasn't actually probing. Removing this delay when we are not probing.
2025-04-17 14:39:04 +02:00
Viktor Lofgren
8bd05d6d90
(crawler) Attempt to use range queries where available
...
This might help in some circumstances to avoid fetching more data than we are interested in.
2025-04-17 14:37:55 +02:00
Viktor Lofgren
59df8e356e
(crawler) Do not fail domain and content type probe on 405
...
Some endpoints do not support the HEAD method. This has historically broken the crawler when it attempts to use HEAD to probe certain URLs that are suspected of being e.g. binary.
The change makes it so that we bypass the probing on 405 instead, and for the domain probe logic, we switch to a small range queried GET.
2025-04-17 13:54:28 +02:00
Viktor Lofgren
7161162a35
(crawler) Write WARC records in a sane order
2025-04-17 13:36:39 +02:00
Viktor Lofgren
d7c4c5141f
(crawler) Migrate to Apache HttpClient for crawler
...
The previously used Java HttpClient seems unsuitable for crawler usage,
that lead to issues like send()-operations sometimes hanging forever,
with clunky workarounds such as running each send operation in a separate
Future that can be cancelled on a timeout.
It has too many assumptions that break, and fails to adequately expose
the inner workings of the connection pool to a degree that makes it possible
to configure in a satisfactory manner.
Apache's HttpClient solves all these problems.
The change also includes a new battery of tests for the HttpFetcher,
and refactors the retriever class a bit to move stuff into the HttpFetcher,
leading to a better separation of concerns.
2025-04-17 12:51:08 +02:00
Viktor Lofgren
88e9b8fb05
(crawler) Throttle the establishment of new connections
...
To avoid network congestion from the packet storm created when establishing hundreds or thousands of connections at the same time, pace the opening of new connections.
2025-04-08 22:53:02 +02:00
Viktor Lofgren
b6265cee11
(feeds) Add timeout code to send()
...
Due to the unique way java's HttpClient implements timeouts, we must always wrap it in an executor to catch the scenario that a server stops sending data mid-response, which would otherwise hang the send method forever.
2025-04-08 22:09:59 +02:00
Viktor Lofgren
c91af247e9
(rate-limit) Fix rate limiting logic
...
The rate limiter was misconfigured to regenerate tokens at a fixed rate of 1 per refillRate; not refillRate per minute. Additionally increasing the default bucket size to 4x refill rate.
2025-04-05 12:26:26 +02:00
Viktor Lofgren
7a31227de1
(crawler) Filter out robots.txt-sitemaps that belong to different domains
2025-04-02 13:35:37 +02:00
Viktor Lofgren
4f477604c5
(crawler) Improve error handling in parquet->slop conversion
...
Parquet code throws a RuntimeException, which was not correctly caught, leading to a failure to crawl.
2025-04-02 13:16:01 +02:00
Viktor Lofgren
2970f4395b
(minor) Test code cleanup
2025-04-02 13:16:01 +02:00
Viktor Lofgren
d1ec909b36
(crawler) Improve handling of timeouts to prevent crawler from getting stuck
2025-04-02 12:57:21 +02:00
Viktor Lofgren
c67c5bbf42
(crawler) Experimentally drop to HTTP 1.1 for crawler to see if this solves stuck send()s
2025-04-01 12:05:21 +02:00
Viktor Lofgren
ecb0e57a1a
(crawler) Make the use of virtual threads in the crawler configurable via system properties
2025-03-27 21:26:05 +01:00
Viktor Lofgren
8c61f61b46
(crawler) Add crawling metadata to domainstate db
2025-03-27 16:38:37 +01:00
Viktor Lofgren
662a18c933
Revert "(crawler) Further rearrange crawl order"
...
This reverts commit 1c2426a052
.
The change does not appear necessary to avoid problems.
2025-03-27 11:25:08 +01:00
Viktor Lofgren
1c2426a052
(crawler) Further rearrange crawl order
...
Limit crawl order preferrence to edu domains, to avoid hitting stuff like medium and wordpress with shotgun requests.
2025-03-27 11:19:20 +01:00
Viktor Lofgren
34df7441ac
(crawler) Add some jitter to crawl delay to avoid accidentally synchronized requests
2025-03-27 11:15:16 +01:00
Viktor Lofgren
5387e2bd80
(crawler) Adjust crawl order to get a better mixture of domains
2025-03-27 11:12:48 +01:00
Viktor Lofgren
0f3b24d0f8
(crawler) Evaluate virtual threads for the crawler
...
The change also alters SimpleBlockingThreadPool to add the option to use virtual threads instead of platform threads.
2025-03-27 11:02:21 +01:00
Viktor Lofgren
a732095d2a
(crawler) Improve crawl task ordering
...
Further improve the ordering of the crawl tasks in order to ensure that potentially blocking tasks are enqueued as soon as possible.
2025-03-26 16:51:37 +01:00
Viktor Lofgren
6607f0112f
(crawler) Improve how the crawler deals with interruptions
...
In some cases, it threads would previously fail to terminate when interrupted.
2025-03-26 16:19:57 +01:00
Viktor Lofgren
4913730de9
(jdk) Upgrade to Java 24
2025-03-26 13:26:06 +01:00
Viktor Lofgren
1db64f9d56
(chore) Fix zookeeper test by upgrading zk image version.
...
Test suddenly broke due to the increasing entropy of the universe.
2025-03-26 11:47:14 +01:00
Viktor Lofgren
4dcff14498
(search) Improve contrast with light mode
2025-03-25 13:15:31 +01:00
Viktor Lofgren
426658f64e
(search) Improve contrast with light mode
2025-03-25 11:54:54 +01:00
Viktor Lofgren
2181b22f05
(crawler) Change default maxConcurrentRequests to 512
...
This seems like a more sensible default after testing a bit. May need local tuning.
2025-03-22 12:11:09 +01:00
Viktor Lofgren
42bd79a609
(crawler) Experimentally throttle the number of active retrievals to see how this affects the network performance
...
There's been some indications that request storms lead to buffer bloat and bad throughput.
This adds a configurable semaphore, by default permitting 100 active requests.
2025-03-22 11:50:37 +01:00
Viktor Lofgren
b91c1e528a
(favicon) Send dummy svg result when image is missing
...
This prevents the browser from rendering a "broken image" in this scenario.
2025-03-21 15:15:14 +01:00
Viktor Lofgren
b1130d7a04
(domainstatedb) Allow creation of disconnected db
...
This is required for executor services that do not have crawl data to still be able to initialize.
2025-03-21 14:59:36 +01:00
Viktor Lofgren
8364bcdc97
(favicon) Add favicons to the matchograms
2025-03-21 14:30:40 +01:00
Viktor Lofgren
626cab5fab
(favicon) Add favicon to site overview
2025-03-21 14:15:23 +01:00
Viktor Lofgren
cfd4712191
(favicon) Add capability for fetching favicons
2025-03-21 13:38:58 +01:00
Viktor Lofgren
9f18ced73d
(crawler) Improve deferred task behavior
2025-03-18 12:54:18 +01:00
Viktor Lofgren
18e91269ab
(crawler) Improve deferred task behavior
2025-03-18 12:25:22 +01:00
Viktor Lofgren
e315ca5758
(search) Change icon for small web filter
...
The previous icon was of an irregular size and shifted the layout in an unaesthetic way.
2025-03-17 12:07:34 +01:00
Viktor Lofgren
3ceea17c1d
(search) Adjustments to devicd detection in CSS
...
Use pointer:fine media query to better distinguish between mobile devices and PCs with a window in portrait orientation.
With this, we never show mobile filtering functionality on mobile; and never show the touch-inaccessible minimized sidebar on mobile.
2025-03-17 12:04:34 +01:00
Viktor Lofgren
b34527c1a3
(search) Add small web filter for new UI
2025-03-17 11:39:19 +01:00
Viktor Lofgren
185bf28fca
(crawler) Correct issue leading to parquet files not being correctly preconverted
...
Path.endsWith("str") != String.endsWith(".str")
2025-03-10 13:48:12 +01:00
Viktor Lofgren
78cc25584a
(crawler) Add error logging when entering bad path for historical crawl data
2025-03-10 13:38:40 +01:00
Viktor Lofgren
62ba30bacf
(common) Log info about metrics server
2025-03-10 13:12:39 +01:00
Viktor Lofgren
3bb84eb206
(common) Log info about metrics server
2025-03-10 13:03:48 +01:00
Viktor Lofgren
be7d13ccce
(crawler) Correct task execution logic in crawler
...
The old behavior would flag domains as pending too soon, leading to them being omitted from execution if they were not immediately available to run.
2025-03-09 13:47:51 +01:00
Viktor Lofgren
8c088a7c0b
(crawler) Remove custom thread factory
...
This was causing issues, and not really doing much of benefit.
2025-03-09 11:50:52 +01:00
Viktor Lofgren
ea9a642b9b
(crawler) More effective task scheduling in the crawler
...
This should hopefully allow more threads to be busy
2025-03-09 11:44:59 +01:00
Viktor Lofgren
27f528af6a
(search) Fix "Remove Javascript" toggle
...
A bug was introduced at some point where the special keyword for filtering on javascript was changed to special:scripts, from js:true/js:false.
Solves issue #155
2025-02-28 12:03:04 +01:00
Viktor Lofgren
20ca41ec95
(processed model) Use String columns instead of Txt columns for SlopDocumentRecord
...
It's very likely TxtStringColumn is the culprit of the bug seen in https://github.com/MarginaliaSearch/MarginaliaSearch/issues/154 where the wrong URL was shown for a search result.
2025-02-24 11:41:51 +01:00
Viktor Lofgren
7671f0d9e4
(search) Display message when no search results are found
2025-02-24 11:15:55 +01:00
Viktor Lofgren
44d6bc71b7
(assistant) Migrate to Jooby framework
2025-02-15 13:28:12 +01:00
Viktor Lofgren
9d302e2973
(assistant) Migrate to Jooby framework
2025-02-15 13:26:04 +01:00
Viktor Lofgren
f553701224
(assistant) Migrate to Jooby framework
2025-02-15 13:21:48 +01:00
Viktor Lofgren
f076d05595
(deps) Upgrade slf4j to latest
2025-02-15 12:50:16 +01:00
Viktor Lofgren
b513809710
(*) Stopgap fix for metrics server initialization errors bringing down services
2025-02-14 17:09:48 +01:00
Viktor Lofgren
7519b28e21
(search) Correct exception from misbehaving bots feeding invalid urls
2025-02-14 17:05:24 +01:00
Viktor Lofgren
3eac4dd57f
(search) Correct exception in error handler when page is missing
2025-02-14 17:00:21 +01:00
Viktor Lofgren
4c2810720a
(search) Add redirect handler for full URLs in the /site endpoint
2025-02-14 16:31:11 +01:00
Viktor Lofgren
8480ba8daa
(live-capture) Code cleanup
2025-02-04 14:05:36 +01:00
Viktor Lofgren
fbba392491
(live-capture) Send a UA-string from the browserless fetcher as well
...
The change also introduces a somewhat convoluted wiremock test to intercept and verify that these headers are in fact sent
2025-02-04 13:36:49 +01:00
Viktor Lofgren
530eb35949
(update-rss) Do not fail the feed fetcher control actor if it takes a long time to complete.
2025-02-03 11:35:32 +01:00
Viktor Lofgren
c2dd2175a2
(search) Add new query expansion rule contracting WORD NUM pairs into WORD-NUM and WORDNUM
2025-02-01 13:13:30 +01:00
Viktor Lofgren
b8581b0f56
(crawler) Safe sanitization of headers during warc->slop conversion
...
The warc->slop converter was rejecting some items because they had headers that were representable in the Warc code's MessageHeader map implementation, but illegal in the HttpHeaders' implementation.
Fixing this by manually filtering these out. Ostensibly the constructor has a filtering predicate, but this annoyingly runs too late and fails to prevent the problem.
2025-01-31 12:47:42 +01:00
Viktor Lofgren
2ea34767d8
(crawler) Use the response URL when resolving relative links
...
The crawler was incorrectly using the request URL as the base URL when resolving relative links. This caused problems when encountering redirects.
For example if we fetch /log, redirecting to /log/ and find links to foo/, and bar/; these would resolve to /foo and /bar, and not /log/foo and /log/bar.
2025-01-31 12:40:13 +01:00
Viktor Lofgren
e9af838231
(actor) Fix migration actor final steps
2025-01-30 11:48:21 +01:00
Viktor Lofgren
ae0cad47c4
(actor) Utility method for getting a json prototype for actor states
...
If we can hook this into the control gui somehow, it'll make for a nice QOL upgrade when manually interacting with the actors.
2025-01-29 15:20:25 +01:00
Viktor Lofgren
5fbc8ef998
(misc) Tidying
2025-01-29 15:17:04 +01:00
Viktor Lofgren
32c6dd9e6a
(actor) Delete old data in the migration actor
2025-01-29 14:51:46 +01:00
Viktor Lofgren
6ece6a6cfb
(actor) Improve resilience for the migration actor
2025-01-29 14:43:09 +01:00
Viktor Lofgren
39cd1c18f8
Automatically run npm install tailwindcss@3 via setup.sh, as the new default version of the package is incompatible with the project
2025-01-29 12:21:08 +01:00
Viktor
eb65daaa88
Merge pull request #151 from Lionstiger/master
...
fix small grammar error in footerLegal.jte
2025-01-28 21:49:50 +01:00
Viktor
0bebdb6e33
Merge branch 'master' into master
2025-01-28 21:49:36 +01:00
Viktor Lofgren
1e50e392c6
(actor) Improve logging and error handling for data migration actor
2025-01-28 15:34:36 +01:00
Viktor Lofgren
fb673de370
(crawler) Change the header 'User-agent' to 'User-Agent'
2025-01-28 15:34:16 +01:00
Viktor Lofgren
eee73ab16c
(crawler) Be more lenient when performing a domain probe
2025-01-28 15:24:30 +01:00
Viktor Lofgren
5354e034bf
(search) Minor grammar fix
2025-01-27 18:36:31 +01:00
Magnus Wulf
72384ad6ca
fix small grammar error
2025-01-27 15:04:57 +01:00
Viktor Lofgren
a2b076f9be
(converter) Add progress tracking for big domains in converter
2025-01-26 18:03:59 +01:00
Viktor Lofgren
c8b0a32c0f
(crawler) Reduce long retention of CrawlDataReference objects and their associated SerializableCrawlDataStreams
2025-01-26 15:40:17 +01:00
Viktor Lofgren
f0d74aa3bb
(converter) Fix close() ordering to prevent converter crash
2025-01-26 14:47:36 +01:00
Viktor Lofgren
74a1f100f4
(converter) Refactor to remove CrawledDomainReader and move its functionality into SerializableCrawlDataStream
2025-01-26 14:46:50 +01:00
Viktor Lofgren
eb049658e4
(converter) Add truncation att the parser step to prevent the converter from spending too much time on excessively large documents
...
Refactor to do this without introducing additional copies
2025-01-26 14:28:53 +01:00
Viktor Lofgren
db138b2a6f
(converter) Add truncation att the parser step to prevent the converter from spending too much time on exessively large documents
2025-01-26 14:25:57 +01:00
Viktor Lofgren
1673fc284c
(converter) Reduce lock contention in converter by separating the processing of full and simple-track domains
2025-01-26 13:21:46 +01:00
Viktor Lofgren
503ea57d5b
(converter) Reduce lock contention in converter by separating the processing of full and simple-track domains
2025-01-26 13:18:14 +01:00
Viktor Lofgren
18ca926c7f
(converter) Truncate excessively long strings in SentenceExtractor, malformed data was effectively DOS:ing the converter
2025-01-26 12:52:54 +01:00
Viktor Lofgren
db99242db2
(converter) Adding some logging around the simple processing track to investigate an issue with the converter stalling
2025-01-26 12:02:00 +01:00
Viktor Lofgren
2b9d2985ba
(doc) Update readme with up-to-date install instructions.
2025-01-24 18:51:41 +01:00
Viktor Lofgren
eeb6ecd711
(search) Make it clearer that the affiliate marker applies to the result, and not the search engine's relation to the result.
2025-01-24 18:50:00 +01:00
Viktor Lofgren
1f58aeadbf
(build) Upgrade JIB
2025-01-24 18:49:28 +01:00
Viktor Lofgren
3d68be64da
(crawler) Add default CT when it's missing for icons
2025-01-22 13:55:47 +01:00
Viktor Lofgren
668f3b16ef
(search) Redirect ^/site/$ to /site
2025-01-22 13:35:18 +01:00
Viktor Lofgren
98a340a0d1
(crawler) Add favicon data to domain state db in its own table
2025-01-22 11:41:20 +01:00
Viktor Lofgren
8862100f7e
(crawler) Improve logging and error handling
2025-01-21 21:44:21 +01:00
Viktor Lofgren
274941f6de
(crawler) Smarter parquet->slop crawl data migration
2025-01-21 21:26:12 +01:00
Viktor Lofgren
abec83582d
Fix refactoring gore
2025-01-21 15:08:04 +01:00
Viktor Lofgren
569520c9b6
(index) Add manual adjustments for rankings based on domain
2025-01-21 15:07:43 +01:00
Viktor Lofgren
088310e998
(converter) Improve simple processing performance
...
There was a regression introduced in the recent slop migration changes in the performance of the simple conversion track. This reverts the issue.
2025-01-21 14:13:33 +01:00
Viktor
270cab874b
Merge pull request #134 from MarginaliaSearch/slop-crawl-data-spike
...
Store crawl data in slop instead of parquet
2025-01-21 13:34:22 +01:00
Viktor Lofgren
4c74e280d3
(crawler) Fix urlencoding in sitemap fetcher
2025-01-21 13:33:35 +01:00
Viktor Lofgren
5b347e17ac
(crawler) Automatically migrate to slop from parquet when crawling
2025-01-21 13:33:14 +01:00
Viktor Lofgren
55d6ab933f
Merge branch 'master' into slop-crawl-data-spike
2025-01-21 13:32:58 +01:00
Viktor Lofgren
43b74e9706
(crawler) Fix exception handler and resource leak in WarcRecorder
2025-01-20 23:45:28 +01:00
Viktor Lofgren
579a115243
(crawler) Reduce log spam from error handling in new sitemap fetcher
2025-01-20 23:17:13 +01:00
Viktor
2c67f50a43
Merge pull request #150 from MarginaliaSearch/httpclient-in-crawler
...
Reduce the use of 3rd party code in the crawler
2025-01-20 19:35:30 +01:00
Viktor Lofgren
78a958e2b0
(crawler) Fix broken test that started failing after the search engine moved to a new domain
2025-01-20 18:52:14 +01:00
Viktor Lofgren
4e939389b2
(crawler) New Jsoup based sitemap parser
2025-01-20 14:37:44 +01:00
Viktor Lofgren
e67a9bdb91
(crawler) Migrate away from using OkHttp in the crawler, use Java's HttpClient instead.
2025-01-19 15:07:11 +01:00
Viktor Lofgren
567e4e1237
(crawler) Fast detection and bail-out for crawler traps
...
Improve logging and exclude robots.txt from this logic.
2025-01-18 15:28:54 +01:00
Viktor Lofgren
4342e42722
(crawler) Fast detection and bail-out for crawler traps
...
Nephentes has been doing the rounds in social media, adding an easy detection and mitigation mechanism for this type of trap, as sadly not all webmasters set up their robots.txt correctly. Out of the box crawl limits will also deal with this type of attack, but this fix is faster.
2025-01-17 13:02:57 +01:00
Viktor Lofgren
bc818056e6
(run) Fix templates for mariadb
...
Apparently the docker image contract changed at some point, and now we should spawn mariadbd and not mysqld; mariadb-admin and not mysqladmin.
2025-01-16 15:27:02 +01:00
Viktor Lofgren
de2feac238
(chore) Upgrade jib from 3.4.3 to 3.4.4
2025-01-16 15:10:45 +01:00
Viktor Lofgren
1e770205a5
(search) Dyslexia fix
2025-01-12 20:40:14 +01:00
Viktor
e44ecd6d69
Merge pull request #149 from MarginaliaSearch/vlofgren-patch-1
...
Update ROADMAP.md
2025-01-12 20:38:36 +01:00
Viktor
5b93a0e633
Update ROADMAP.md
2025-01-12 20:38:11 +01:00
Viktor
08fb0e5efe
Update ROADMAP.md
2025-01-12 20:37:43 +01:00
Viktor
bcf67782ea
Update ROADMAP.md
2025-01-12 20:37:09 +01:00
Viktor Lofgren
ef3f175ede
(search) Don't clobber the search query URL with default values
2025-01-10 15:57:30 +01:00
Viktor Lofgren
bbe4b5d9fd
Revert experimental changes
2025-01-10 15:52:02 +01:00
Viktor Lofgren
c67a635103
(search, experimental) Add a few debugging tracks to the search UI
2025-01-10 15:44:44 +01:00
Viktor Lofgren
20b24133fb
(search, experimental) Add a few debugging tracks to the search UI
2025-01-10 15:34:48 +01:00
Viktor Lofgren
f2567677e8
(index-client) Clean up index client code
...
Improve error handling. This should be a relatively rare case, but we don't want one bad index partition to blow up the entire query.
2025-01-10 15:17:07 +01:00
Viktor Lofgren
bc2c2061f2
(index-client) Clean up index client code
...
This should have the rpc stream reception be performed in parallel in separate threads, rather blocking sequentially in the main thread, hopefully giving a slight performance boost.
2025-01-10 15:14:42 +01:00
Viktor Lofgren
1c7f5a31a5
(search) Further reduce the number of db queries by adding more caching to DbDomainQueries.
2025-01-10 14:17:29 +01:00
Viktor Lofgren
59a8ea60f7
(search) Further reduce the number of db queries by adding more caching to DbDomainQueries.
2025-01-10 14:15:22 +01:00
Viktor Lofgren
aa9b1244ea
(search) Reduce the number of db queries a bit by caching data that doesn't change too often
2025-01-10 13:56:04 +01:00
Viktor Lofgren
2d17233366
(search) Reduce the number of db queries a bit by caching data that doesn't change too often
2025-01-10 13:53:56 +01:00
Viktor Lofgren
b245cc9f38
(search) Reduce the number of db queries a bit by caching data that doesn't change too often
2025-01-10 13:46:19 +01:00
Viktor Lofgren
6614d05bdf
(db) Make db pool size configurable
2025-01-09 20:20:51 +01:00
Viktor Lofgren
55aeb03c4a
(feeds) Replace rssreader based parsing with a custom jsoup based rss parser
...
This solves some issues with the rssreader based parser, which was very picky about the XML being valid. Jsoup is much more lenient when parsing malformed XML.
2025-01-09 18:29:55 +01:00
Viktor Lofgren
faa589962f
(live-capture) Browserless now requires a token
2025-01-09 14:51:11 +01:00
Viktor Lofgren
c7edd6b39f
(live-capture) Browserless now requires a token
2025-01-09 14:46:05 +01:00
Viktor Lofgren
79da622e3b
(search) Update front page with new banner about move
2025-01-08 21:38:19 +01:00
Viktor Lofgren
3da8337ba6
(feeds) Add system property for exporting fetched feeds to a slop table for debugging
2025-01-08 20:49:16 +01:00
Viktor Lofgren
a32d230f0a
(special) Trigger deployment
2025-01-08 20:07:54 +01:00
Viktor Lofgren
3772bfd387
(query) Fix handling of optional ranking parameters
2025-01-08 17:11:22 +01:00
Viktor Lofgren
02a7900d1a
(search) Correct search-in-title toggle in search UI
2025-01-08 16:51:10 +01:00
Viktor Lofgren
a1fb92468f
(refac) Remove ResultRankingParameters, QueryLimits class and use protobuf classes directly instead
...
This is primarily to make the code a bit easier to reason about, and will reduce the level of indirection and data copying in the search-servi->query-service->index-service communication chain.
2025-01-08 16:15:57 +01:00
Viktor Lofgren
b7f0a2a98e
(search-service) Fix metrics for errors and request times
...
This was previously in place, but broke during the jooby migration.
2025-01-08 14:10:43 +01:00
Viktor Lofgren
5fb76b2e79
(search-service) Fix metrics for errors and request times
...
This was previously in place, but broke during the jooby migration.
2025-01-08 14:06:03 +01:00
Viktor Lofgren
ad8c97f342
(search-service) Begin replacement of the crawl queue mechanism with node_affinity flagging
...
Previously a special db table was used to hold domains slated for crawling, but this is deprecated, and instead now each domain has a node_affinity flag that decides its indexing state, where a value of -1 indicates it shouldn't be crawled, a value of 0 means it's slated for crawling by the next index partition to be crawled, and a positive value means it's assigned to an index partition.
The change set also adds a test case validating the modified behavior.
2025-01-08 13:25:56 +01:00
Viktor Lofgren
dc1b6373eb
(search-service) Clean up readme
2025-01-08 13:04:39 +01:00
Viktor Lofgren
983d6d067c
(search-service) Add indexing indicator to sibling domains listing
2025-01-08 12:58:34 +01:00
Viktor Lofgren
a84a06975c
(ranking-params) Add disable penalties flag to ranking params
...
This will help debugging ranking issues. Later it may be added to some filters.
2025-01-08 00:16:49 +01:00
Viktor Lofgren
d2864c13ec
(query-params) Add additional permitted query params
2025-01-07 20:21:44 +01:00
Viktor Lofgren
03ba53ce51
(legacy-search) Update nav bar with correct links
2025-01-07 17:44:52 +01:00
Viktor Lofgren
d4a6684931
(specialization) Soften length requirements for wiki-specialized documents (incl. cppreference)
2025-01-07 15:53:25 +01:00
Viktor
6f0485287a
Merge pull request #145 from MarginaliaSearch/cppreference_fixes
...
Cppreference fixes
2025-01-07 15:43:19 +01:00
Viktor Lofgren
59e2dd4c26
(specialization) Soften length requirements for wiki-specialized documents (incl. cppreference)
2025-01-07 15:41:30 +01:00
Viktor Lofgren
ca1807caae
(specialization) Add new specialization for cppreference.com
...
Give this reference website some synthetically generated tokens to improve the likelihood of a good match.
2025-01-07 15:41:05 +01:00
Viktor Lofgren
26c20e18ac
(keyword-extraction) Soften constraints on keyword patterns, allowing for longer segmented words
2025-01-07 15:20:50 +01:00
Viktor Lofgren
7c90b6b414
(query) Don't blindly make tokens containing a colon into a non-ranking advice term
2025-01-07 15:18:05 +01:00
Viktor Lofgren
b63c54c4ce
(search) Update opensearch.xml to point to non-redirecting domains.
2025-01-07 00:23:09 +01:00
Viktor Lofgren
fecd2f4ec3
(deploy) Add legacy search service to deploy script
2025-01-07 00:21:13 +01:00
Viktor Lofgren
39e420de88
(search) Add wayback machine link to siteinfo
2025-01-06 20:33:10 +01:00
Viktor Lofgren
dc83619861
(rssreader) Further suppress logging
2025-01-06 20:20:37 +01:00
Viktor Lofgren
87d1c89701
(search) Add listing of sibling subdomains to site overview
2025-01-06 20:17:36 +01:00
Viktor Lofgren
a42a7769e2
(leagacy-search) Remove legacy paperdoll class
2025-01-06 20:17:36 +01:00
Viktor
202bda884f
Update readme.md
...
Add note about installing tailwindcss via npm
2025-01-06 18:35:13 +01:00
Viktor Lofgren
2315fdc731
(search) Vendor rssreader and modify it to be able to consume the nlnet atom feed
...
Also dial down the logging a bit for the rssreader package.
2025-01-06 17:58:50 +01:00
Viktor Lofgren
b5469bd8a1
(search) Turn relative feed URLs absolute when dealing with RSS/Atom item URLs
2025-01-06 16:56:24 +01:00
Viktor Lofgren
6a6318d04c
(search) Add separate websiteUrl property to legacy service
2025-01-06 16:26:08 +01:00
Viktor Lofgren
55933f8d40
(search) Ensure we respect old URL contracts
...
/explore/random should be equivalent to /explore
2025-01-06 16:20:53 +01:00
Viktor
be6382e0d0
Merge pull request #127 from MarginaliaSearch/serp-redesign
...
Web UI redesign
2025-01-06 16:08:14 +01:00
Viktor Lofgren
45e771f96b
(api) Update the / API redirect to the new documentation stub.
2025-01-06 16:07:32 +01:00
Viktor Lofgren
8dde502cc9
Merge branch 'master' into serp-redesign
2025-01-05 23:33:35 +01:00
Viktor Lofgren
3e66767af3
(search) Adjust query parsing to trim tokens in quoted search terms
...
Quoted search queries that contained keywords with possessive 's endings were not returning any results, as the index does not retain that suffix, and the query parser was not stripping it away in this code path.
This solves issue #143 .
2025-01-05 23:33:09 +01:00
Viktor Lofgren
9ec9d1b338
Merge branch 'master' into serp-redesign
2025-01-05 21:10:20 +01:00
Viktor Lofgren
dcad0d7863
(search) Tweak token formation.
2025-01-05 21:01:09 +01:00
Viktor Lofgren
94e1aa0baf
(search) Tweak token formation to still break apart emails in brackets.
2025-01-05 20:55:44 +01:00
Viktor Lofgren
b62f043910
(search) Adjust token formation rules to be more lenient to C++ and PHP code.
...
This addresses Issue #142
2025-01-05 20:50:27 +01:00
Viktor Lofgren
6ea22d0d21
(search) Update front page with work-in-progress note
2025-01-05 19:08:02 +01:00
Viktor Lofgren
8c69dc31b8
Merge branch 'master' into serp-redesign
2025-01-05 18:52:51 +01:00
Viktor Lofgren
00734ea87f
(search) Add hover text for matchogram
2025-01-05 18:50:44 +01:00
Viktor Lofgren
3009713db4
(search) Fix broken tests
2025-01-05 18:50:27 +01:00
Viktor
9b2ceaf37c
Merge pull request #141 from MarginaliaSearch/vlofgren-patch-1
...
Update FUNDING.yml
2025-01-05 18:40:20 +01:00
Viktor
8019c2ce18
Update FUNDING.yml
2025-01-05 18:40:06 +01:00
Viktor Lofgren
a9e312b8b1
(service) Add links to marginalia-search.com where appropriate
2025-01-05 16:56:38 +01:00
Viktor Lofgren
4da3563d8a
(service) Clean up exceptions when requestScreengrab is not available
2025-01-04 14:45:51 +01:00
Viktor Lofgren
48d0a3089a
(service) Improve logging around grpc
...
This change adds a marker for the gRPC-specific logging, as well as improves the clarity and meaningfulness of the log messages.
2025-01-02 20:40:53 +01:00
Viktor Lofgren
594df64b20
(domain-info) Use appropriate sqlite database when fetching feed status
2025-01-02 20:20:36 +01:00
Viktor Lofgren
06efb5abfc
Merge branch 'master' into serp-redesign
2025-01-02 18:42:12 +01:00
Viktor Lofgren
78eb1417a7
(service) Only block on SingleNodeChannelPool creation in QueryClient
...
The code was always blocking for up to 5s while waiting for the remote end to become available, meaning some services would stall for several seconds on start-up for no sensible reason.
This should make most services start faster as a result.
2025-01-02 18:42:01 +01:00
Viktor Lofgren
8c8f2ad5ee
(search) Add an indicator when a link has a feed in the similar/linked domains views
2025-01-02 18:11:57 +01:00
Viktor Lofgren
f71e79d10f
(search) Add a copy of the old UI as a separate service, search-service-legacy
2025-01-02 18:03:42 +01:00
Viktor Lofgren
1b27c5cf06
(search) Add a copy of the old UI as a separate service, search-service-legacy
2025-01-02 18:02:17 +01:00
Viktor Lofgren
67edc8f90d
(domain-info) Only flag domains with rss feed items as having a feed
2025-01-02 17:41:52 +01:00
Viktor Lofgren
5f576b7d0c
(query-parser) Strip leading underlines
...
This addresses issue #140 , where __builtin_ffs gives no results.
2025-01-02 14:39:03 +01:00
Viktor Lofgren
8b05c788fd
(Search) Enable gzip compression of responses
2025-01-01 18:34:42 +01:00
Viktor Lofgren
236f033bc9
(Search) Reduce whitespace in explore view on all resolutions
2025-01-01 18:23:35 +01:00
Viktor Lofgren
510fc75121
(Search) Reduce whitespace in explorer view on mobile
2025-01-01 18:18:09 +01:00
Viktor Lofgren
0376f2e6e3
Merge branch 'master' into serp-redesign
...
# Conflicts:
# code/services-application/search-service/resources/templates/search/index/index.hdb
2025-01-01 18:15:09 +01:00
Viktor Lofgren
0b65164f60
(chore) Fix broken test
2025-01-01 18:06:29 +01:00
Viktor Lofgren
9be477de33
(domain-info) Add a feed flag to domain info
...
This is a bit of a sketchy solution that requires both assistant services to run on the same physical machine.
2025-01-01 18:02:33 +01:00
Viktor Lofgren
84f55b84ff
(search) Add experimental OPML-export function for feed subscriptions
2025-01-01 17:17:54 +01:00
Viktor Lofgren
ab5c30ad51
(search) Fix site info view for completely unknown domains
...
Also correct the DbDomainQueries.getDomainId so that it throws NoSuchElementException when domain id is missing, and not UncheckedExecutionException via Cache.
2025-01-01 16:29:01 +01:00
Viktor Lofgren
0c839453c5
(search) Fix crosstalk link
2025-01-01 16:09:19 +01:00
Viktor Lofgren
5e4c5d03ae
(search) Clean up breakpoints in site overview
2025-01-01 16:06:08 +01:00
Viktor Lofgren
710af4999a
(feed-fetcher) Add " entity mapping in feed fetcher
2025-01-01 15:45:17 +01:00
Viktor Lofgren
a5b0a1ae62
(search) Move linked/similar domains to a popover style menu on mobile
...
Fix scroll
2025-01-01 15:37:35 +01:00
Viktor Lofgren
e9f71ee39b
(search) Move linked/similar domains to a popover style menu on mobile
2025-01-01 15:23:25 +01:00
Viktor Lofgren
baeb4a46cd
(search) Reintroduce query rewriting for recipes, add rules for wikis and forums
2024-12-31 16:05:00 +01:00
Viktor Lofgren
5e2a8e9f27
(deploy) Add capability of adding tags to deploy script
2024-12-31 16:04:13 +01:00
Viktor
cc1a5bdf90
Merge pull request #138 from MarginaliaSearch/vlofgren-patch-1
...
Update ROADMAP.md
2024-12-31 14:41:02 +01:00
Viktor
7f7b1ffaba
Update ROADMAP.md
2024-12-31 14:40:34 +01:00
Viktor Lofgren
0ea8092350
(search) Add link promoting the redesign beta
2024-12-30 15:47:13 +01:00
Viktor Lofgren
483d29497e
(deploy) Add hashbang to deploy script
2024-12-30 15:47:13 +01:00
Viktor Lofgren
bae44497fe
(crawler) Add a new system property crawler.maxFetchSize
...
This gives the same upper limit to the live crawler and the big boy crawler, though the live crawler will reject items too large, and the big crawler will truncate at that point.
2024-12-30 15:10:11 +01:00
Viktor Lofgren
0d59202aca
(crawler) Do not remove W/-prefix on weak e-tags
...
The server expects to get them back prefixed, as we received them.
2024-12-27 20:56:42 +01:00
Viktor Lofgren
0ca43f0c9c
(live-crawler) Improve live crawler short-circuit logic
...
We should not wait until we've fetched robots.txt to decide whether we have any data to fetch! This makes the live crawler very slow and leads to unnecessary requests.
2024-12-27 20:54:42 +01:00
Viktor Lofgren
3bc99639a0
(feed-fetcher) Make feed fetcher requests conditional
...
Add `If-None-Match` and `If-Modified-Since` headers as appropriate to the feed fetcher's requests. On well-configured web servers, this should short-circuit the request and reduce the amount of bandwidth and processing that is necessary.
A new table was added to the FeedDb to hold one etag per domain.
If-Modified-Since semantics are based on the creation date for the feed database, which should serve as a cutoff date for the earliest update we can have received.
This completes the changes for Issue #136 .
2024-12-27 15:10:15 +01:00
Viktor Lofgren
927bc0b63c
(live-crawler) Add Accept-Encoding: gzip to outbound requests
...
This change adds `Accept-Encoding: gzip` to all outbound requests from the live crawler and feed fetcher, and the corresponding decoding logic for the compressed response data.
The change addresses issue #136 , save for making the fetcher's requests conditional.
2024-12-27 03:59:34 +01:00
Viktor Lofgren
d968801dc1
(converter) Drop feed data from SlopDomainRecord
...
Also remove feed extraction from converter. This is the crawler's responsibility now.
2024-12-26 17:57:08 +01:00
Viktor Lofgren
89db69d360
(crawler) Correct feed URLs in domain state db
...
Discovered feed URLs were given a double slash after their domain name in the DB. This will go away in the URL normalizer, so the URLs are still viable, but the commit fixes the issue regardless.
2024-12-26 15:18:31 +01:00
Viktor Lofgren
895cee7004
(crawler) Improved feed discovery, new domain state db per crawlset
...
Feed discover is improved with by probing a few likely endpoints when no feed link tag is provided. To store the feed URLs, a sqlite database is added to each crawlset that stores a simple summary of the crawl job, including any feed URLs that have been discovered.
Solves issue #135
2024-12-26 15:05:52 +01:00
Viktor Lofgren
4bb71b8439
(crawler) Correct content type probing to only run on URLs that are suspected to be binary
2024-12-26 14:26:23 +01:00
Viktor Lofgren
81cdd6385d
Add rendering tests for most major views
...
This will prevent accidentally deploying a broken search service
2024-12-25 15:22:26 +01:00
Viktor Lofgren
e76c42329f
Correct dark mode for infobox in site focused search
2024-12-25 15:06:05 +01:00
Viktor Lofgren
e6ef4734ea
Fix tests
2024-12-25 15:05:41 +01:00
Viktor Lofgren
df4bc1d7e9
Add update time to front page subscriptions
2024-12-25 14:42:00 +01:00
Viktor Lofgren
2b222efa75
Merge branch 'master' into serp-redesign
2024-12-25 14:22:42 +01:00
Viktor Lofgren
6d18e6d840
(search) Add clustering to subscriptions view
2024-12-18 15:36:05 +01:00
Viktor Lofgren
2a3c63f209
(search) Exclude generated style.css from git
2024-12-18 15:24:31 +01:00
Viktor Lofgren
9f70cecaef
(search) Add site subscription feature that puts RSS updates on the front page
2024-12-18 15:24:31 +01:00
Viktor Lofgren
47e58a21c6
Refactor documentBody method and ContentType charset handling
...
Updated the `documentBody` method to improve parsing retries and error handling. Refactored `ContentType` charset processing with cleaner logic, removing redundant handling for unsupported charsets. Also, updated the version of the `slop` library in dependency settings.
2024-12-17 17:11:37 +01:00
Viktor Lofgren
3714104976
Add loader for slop data in converter.
...
Also alter CrawledDocument to not require String parsing of the underlying byte[] data. This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.
2024-12-17 15:40:24 +01:00
Viktor Lofgren
f6f036b9b1
Switch to new Slop format for crawl data storage and processing.
...
Replaces Parquet output and processing with the new Slop-based format. Includes data migration functionality, updates to handling and writing of crawl data, and introduces support for SLOP in domain readers and converters.
2024-12-15 19:34:03 +01:00
Viktor Lofgren
b510b7feb8
Spike for storing crawl data in slop instead of parquet
...
This seems to reduce RAM overhead to 100s of MB (from ~2 GB), as well as roughly double the read speeds. On disk size is virtually identical.
2024-12-15 15:49:47 +01:00
Viktor Lofgren
c08203e2ed
(search) Prevent paperdoll from being run as a test by CI
2024-12-14 20:35:57 +01:00
Viktor Lofgren
86497fd32f
(site-info) Mobile layout fix
2024-12-14 16:19:56 +01:00
Viktor Lofgren
3b998573fd
Adjust colors on dark mode for site overview
2024-12-13 21:51:25 +01:00
Viktor Lofgren
e161882ec7
(search) Fix layout for light mode
2024-12-13 21:47:29 +01:00
Viktor Lofgren
357f349e30
(search) Table layout fixes for dictionary lookup
2024-12-13 21:47:08 +01:00
Viktor Lofgren
e4769f541d
(search) Sort and deduplicate search results for better relevance.
...
Added a custom sorting mechanism to prioritize HTTPS over HTTP and domain-based URLs over raw IPs during deduplication. Ensures "bad duplicates" are discarded while maintaining the original presentation order for user-facing results.
2024-12-13 21:47:08 +01:00
Viktor Lofgren
2a173e2861
(search) Dark Mode
2024-12-13 21:47:07 +01:00
Viktor Lofgren
a6a900266c
(search) Fix redirects
2024-12-13 02:40:51 +01:00
Viktor Lofgren
bdba53f055
(site) Update domain parameter type from PathParam to QueryParam
2024-12-13 02:15:35 +01:00
Viktor Lofgren
bbdde789e7
Merge branch 'master' into serp-redesign
2024-12-11 19:45:17 +01:00
Viktor Lofgren
eab61cd48a
Merge branch 'master' into serp-redesign
2024-12-11 17:09:27 +01:00
Viktor Lofgren
0ce2ba9ad9
(jooby) Fix asset handler
2024-12-11 14:38:04 +01:00
Viktor Lofgren
3ddcebaa36
(search) Give serp/start a more consistent name to siteinfo/start
...
The change also cleans up the layout a bit.
2024-12-11 14:33:57 +01:00
Viktor Lofgren
b91463383e
(jooby) Clean up initialization process
2024-12-11 14:33:18 +01:00
Viktor Lofgren
7444a2f36c
(site-info) Add placeholder when a feed item lacks a title.
2024-12-10 22:46:12 +01:00
Viktor Lofgren
fdee07048d
(search) Remove Spark and migrate to Jooby for the search service
2024-12-10 19:13:13 +01:00
Viktor Lofgren
2fbf201761
(search) Adjust crosstalk flex-basis
2024-12-10 15:12:51 +01:00
Viktor Lofgren
4018e4c434
(search) Add crosstalk to paperdoll
2024-12-10 15:12:39 +01:00
Viktor Lofgren
f3382b5bd8
(search) Completely remove all old hdb templates
...
Create new views for conversion results, dictionary results, and site crosstalk.
2024-12-10 15:04:49 +01:00
Viktor Lofgren
9287ee0141
(search) Improve hyphenation logic for titles
2024-12-09 15:29:10 +01:00
Viktor Lofgren
2769c8f869
(search) Remove sticky search bar to aid with performance on firefox (and iOS?)
2024-12-09 15:20:33 +01:00
Viktor Lofgren
ddb66f33ba
(search) Add more feedback when pressing some buttons
2024-12-09 15:07:23 +01:00
Viktor Lofgren
79500b8fbc
(search) Move search bar back up top on mobile, put filter buttom at the bottom instead.
2024-12-09 14:55:37 +01:00
Viktor Lofgren
187eea43a4
(search) Remove redundant @if
2024-12-09 14:46:02 +01:00
Viktor Lofgren
a89ed6fa9f
(search) Fix rendering on site overview, more dense serp layout on mobile
2024-12-09 14:45:45 +01:00
Viktor Lofgren
8d168be138
(search) Typeahead search, etc.
2024-12-07 15:47:01 +01:00
Viktor Lofgren
6e1aa7b391
(search) Make style.css depend on jte file changes
...
Also add a hack to ensure classes generated from java code get included in the stylesheet as intended.
2024-12-07 14:11:22 +01:00
Viktor Lofgren
deab9b9516
(search) Clean up start views for search and site-info
2024-12-07 14:11:22 +01:00
Viktor Lofgren
39d99a906a
(search) Add proper tailwind build and host fontawesome locally
2024-12-07 14:11:22 +01:00
Viktor Lofgren
6f72e6e0d3
(explore) Add lazy loading and alt attributes to images
2024-12-07 14:11:22 +01:00
Viktor Lofgren
d786d79483
(site-info) Add whitespace-nowrap to pubDay span in overview.jte
2024-12-07 14:11:22 +01:00
Viktor Lofgren
01510f6c2e
(serp) Add wayback link to search results
2024-12-07 14:11:22 +01:00
Viktor Lofgren
7ba43e9e3f
(site) Adjust sizing of navbars
2024-12-07 14:11:16 +01:00
Viktor Lofgren
97bfcd1353
(site) Layout changes site-info
2024-12-07 14:11:16 +01:00
Viktor Lofgren
aa3c85c196
(site) Mobile layout fixes
2024-12-07 14:11:16 +01:00
Viktor Lofgren
fb75a3827d
(site) Adjust coloration of search results
2024-12-05 16:58:00 +01:00
Viktor Lofgren
7d546d0e2a
(site) Make SearchParameters generate relative URLs instead of absolute
2024-12-05 16:47:22 +01:00
Viktor Lofgren
8fcb6ffd7a
(site-info) Increase contrast in search results for forums, wikis
2024-12-05 16:42:16 +01:00
Viktor Lofgren
f97de0c15a
(site-info) Fix layout
2024-12-05 16:33:46 +01:00
Viktor Lofgren
be9e192b78
(site-info) Fix pagination in backlinks and documents views
2024-12-05 16:26:11 +01:00
Viktor Lofgren
75ae1c9526
(site-info) Do not show 'suggest for crawling' when the ndoe affinity is already set to 0
...
This indicates the domain is already slated for crawling.
2024-12-05 16:18:46 +01:00
Viktor Lofgren
33761a0236
(site-info) Make the search box in the site viewer functional
2024-12-05 16:16:29 +01:00
Viktor Lofgren
19b69b1764
(site-info) Only show samples if feed is absent, never both.
2024-12-05 16:05:03 +01:00
Viktor Lofgren
8b804359a9
(serp) Layout fixes for mobile
2024-12-05 15:59:33 +01:00
Viktor Lofgren
f050bf5c4c
(WIP) Initial semi-working transformation to new tailwind UI
...
Still missing is a proper build, we're currently pulling in tailwind from a CDN, which is no bueno in prod.
There's also a lot of polish remaining everywhere, dead links, etc.
2024-12-05 14:00:17 +01:00