Viktor Lofgren
4a98a3c711
(skiplist) Move to a separate directory instead of in the btree module
2025-08-14 01:09:46 +02:00
Viktor Lofgren
68f52ca350
(test) Fix tests that works on my machine (TM)
2025-08-14 00:59:58 +02:00
Viktor Lofgren
2a2d951c2f
(index) Fix unhinged default values for index.preparationThreads
2025-08-14 00:54:35 +02:00
Viktor Lofgren
379a1be074
(index) Add better timeout handling in UringQueue, fix slow memory leak on timeout exception
2025-08-14 00:52:50 +02:00
Viktor Lofgren
827aadafcd
(uring) Reintroduce auto-slicing of excessively long read batches
2025-08-13 14:33:35 +02:00
Viktor Lofgren
aa7679d6ce
(pool) Fix bug in exceptionally rare edge case leading to incorrect reads
2025-08-13 14:28:50 +02:00
Viktor Lofgren
6fe6de766d
(pool) Fix SegmentMemoryPage storage
2025-08-13 13:17:14 +02:00
Viktor Lofgren
4245ac4c07
(doc) Update docs to reflect that we now need io_uring
2025-08-12 15:12:54 +02:00
Viktor Lofgren
1c49a0f5ad
(index) Add system properties for toggling O_DIRECT mode for positions and spans
2025-08-12 15:11:13 +02:00
Viktor Lofgren
9a6e5f646d
(docker) Add security_opt: seccomp:unconfined to docker-compose files
...
This is needed to access io_uring via docker.
2025-08-12 15:10:26 +02:00
Viktor Lofgren
fa92994a31
(uring) Fall back to simple I/O planning behavior when buffered mode is selected in UringFileReader
2025-08-11 23:44:38 +02:00
Viktor Lofgren
bc49406881
(build) Compatibility hack debian server
2025-08-11 23:26:53 +02:00
Viktor Lofgren
90325be447
(minor) Fix comments
2025-08-11 23:19:53 +02:00
Viktor Lofgren
dc89587af3
(index) Improve disk locality of the positions data
2025-08-11 21:17:12 +02:00
Viktor Lofgren
7b552afd6b
(index) Improve disk locality of the positions data
2025-08-11 20:59:11 +02:00
Viktor Lofgren
73557edc67
(index) Improve disk locality of the positions data
2025-08-11 20:57:32 +02:00
Viktor Lofgren
83919e448a
(index) Use O_DIRECT buffered reads for spans
2025-08-11 18:04:25 +02:00
Viktor Lofgren
6f5b75b84d
(cleanup) Remove accidentally committed print stmt
2025-08-11 18:04:25 +02:00
Viktor Lofgren
db315e2813
(index) Use O_DIRECT position reads
2025-08-11 18:04:25 +02:00
Viktor Lofgren
e9977e08b7
(index) Block-align positions data
...
This will make reads more efficient, and possibly pave way for O_DIRECT reads of this data
2025-08-11 14:36:45 +02:00
Viktor Lofgren
1df3757e5f
(native) Clean up io_uring code and check in execution queue, currently unused but nifty
2025-08-11 13:54:05 +02:00
Viktor Lofgren
ca283f9684
(native) Clean up native helpers and break them into their own library
2025-08-10 20:55:34 +02:00
Viktor Lofgren
85360e61b2
(index) Grow span writer buffer size
...
Apparently outlier spans can grow considerably large.
2025-08-10 17:20:38 +02:00
Viktor Lofgren
e2ccff21bc
(index) Wait until ranking is finished in query execution
2025-08-09 23:40:30 +02:00
Viktor Lofgren
c5b5b0c699
(index) Permit fast termination of rejection filter execution
2025-08-09 23:36:59 +02:00
Viktor Lofgren
9a65946e22
(uring) Reduce queue size to 2048 to avoid ENOMEM on systems with default ulimits
2025-08-09 20:41:24 +02:00
Viktor Lofgren
1d2ab21e27
(index) Aggregate termdata reads into a single io_uring operation instead of one for each term
2025-08-09 17:43:18 +02:00
Viktor Lofgren
0610cc19ad
(index) Fix double close errors
2025-08-09 17:05:38 +02:00
Viktor Lofgren
a676306a7f
(skiplist) Fix bugs in seek operations
2025-08-09 17:00:27 +02:00
Viktor Lofgren
8d68cd14fb
(skiplist) Even more aggressive forward pointers
2025-08-09 16:11:41 +02:00
Viktor Lofgren
4773c5a52b
(index) Backport some changes made during performance evaluations
2025-08-09 15:19:41 +02:00
Viktor Lofgren
74bd562ae4
(index) Move I/O to separate threads to hopefully reduce contention a bit
2025-08-09 15:19:41 +02:00
Viktor Lofgren
c9751287b0
(index) Boost the buffer size used in PrioIndexEntrySource
2025-08-09 01:46:12 +02:00
Viktor Lofgren
5da24e3fc4
(index) Segregate full and priority query ranking
2025-08-09 00:39:31 +02:00
Viktor Lofgren
20a4e86eec
(index) Use a confined arena in IndexResultRankingService
2025-08-08 22:08:35 +02:00
Viktor Lofgren
477a184948
(experiment) Allow early termination of include conditions in lookups
2025-08-08 19:12:54 +02:00
Viktor Lofgren
8940ce99db
(perf) More statistics in perf testi
2025-08-08 18:57:25 +02:00
Viktor Lofgren
0ac0fa4dca
(perf) More statistics in perf testi
2025-08-08 18:56:17 +02:00
Viktor Lofgren
942f15ef14
(skiplist) Use a linear-quadratic forward pointer scheme instead of an exponential
2025-08-08 16:57:15 +02:00
Viktor Lofgren
f668f33d5b
(index) Tweaks and optimizations
2025-08-08 15:32:23 +02:00
Viktor Lofgren
6789975cd2
(index) Tweaks and optimizations
2025-08-08 15:30:48 +02:00
Viktor Lofgren
c3ba608776
(index) Split up evaluation tasks
2025-08-08 15:20:33 +02:00
Viktor Lofgren
733d2687fe
(skiplist) Roll back the design change that segregated the values associated with documents into a separate file
2025-08-08 14:45:11 +02:00
Viktor Lofgren
f6daac8ed0
(index) MADVISE_RANDOM the index btrees
2025-08-07 21:14:28 +02:00
Viktor Lofgren
c2eeee4a06
(uring) Disable result set combination
2025-08-07 21:13:30 +02:00
Viktor Lofgren
3b0c701df4
(uring) Update uring timeout threshold
2025-08-07 20:13:25 +02:00
Viktor Lofgren
c6fb2db43b
(index) Use a more SLA-aware execution scheduler
2025-08-07 20:13:15 +02:00
Viktor Lofgren
9bc8fe05ae
(skiplist) Clean up search logic
2025-08-07 19:35:25 +02:00
Viktor Lofgren
440ffcf6f8
(skiplist) Fix bug in intersection-like algorithms
2025-08-07 02:18:14 +02:00
Viktor Lofgren
b07709cc72
(native) Disable expensive debug checks from uring code
2025-08-06 21:05:28 +02:00
Viktor Lofgren
9a6acdcbe0
(skiplist) Tag slow fuzz test as "slow"
2025-08-06 20:59:52 +02:00
Viktor Lofgren
23b9b0bf1b
(index) Parametrize skip list block size and buffer pool sizes
2025-08-06 20:59:33 +02:00
Viktor Lofgren
749c8ed954
(pool) Correct buffer pool alignment
2025-08-06 20:56:34 +02:00
Viktor Lofgren
9f4b6939ca
(skiplist) Fix condition for truncated block writing
2025-08-06 16:25:53 +02:00
Viktor Lofgren
1d08e44e8d
(uring) Fadvise random access for uring buffered reads
2025-08-06 15:54:24 +02:00
Viktor Lofgren
fc2e156e78
(skiplist) Ensure docs file is a multiple BLOCK_SIZE bytes
2025-08-06 15:13:32 +02:00
Viktor Lofgren
5e68a89e9f
(index) Improve error handling
2025-08-06 15:05:16 +02:00
Viktor Lofgren
d380661307
(index) Improve error handling
2025-08-06 14:31:06 +02:00
Viktor Lofgren
cccdf5c329
(pool) Check interrupt status in PoolLru's reclamation thread
2025-08-06 13:26:00 +02:00
Viktor Lofgren
f085b4ea12
(skiplist) Fix tests
2025-08-06 13:24:14 +02:00
Viktor Lofgren
e208f7d3ba
(skiplist) Code clean up an added validation
2025-08-06 12:55:04 +02:00
Viktor Lofgren
b577085cb2
(pool) Use one contiguous memory allocation to encourage a HugePage allocation and reduce TLB thrashing
2025-08-06 12:49:46 +02:00
Viktor Lofgren
b9240476f6
(pool) Use one contiguous memory allocation to encourage a HugePage allocation and reduce TLB thrashing
2025-08-06 12:48:14 +02:00
Viktor Lofgren
8f50f86d0b
(index) Fix error handling
2025-08-05 22:19:23 +02:00
Viktor Lofgren
e3b7ead7a9
(skiplist) Fix aggessive forward pointering
2025-08-05 20:47:38 +02:00
Viktor Lofgren
9a845ba604
(skiplist) EXPERIMENTAL - Store data in a separate file from document ids
2025-08-05 19:10:58 +02:00
Viktor Lofgren
b9381f1603
(skiplist) EXPERIMENTAL - Store data in a separate file from document ids
2025-08-05 17:35:13 +02:00
Viktor Lofgren
6a60127267
(skiplist) EXPERIMENTAL - Store data in a separate file from document ids
2025-08-05 16:54:39 +02:00
Viktor Lofgren
e8ffcfbb19
(skiplist) Correct binary search implementation, fix intersection logic
2025-08-04 14:49:09 +02:00
Viktor Lofgren
caf0850f81
(index) Clean up code
2025-08-04 00:12:35 +02:00
Viktor Lofgren
62e3bb675e
(btree) Remove O_DIRECT btree implementation
2025-08-03 23:43:31 +02:00
Viktor Lofgren
4dc3e7da7a
(perf) Remove warmup from perf test, it's not doing much
2025-08-03 21:19:54 +02:00
Viktor Lofgren
92b09883ec
(index) Switch from AIO to io_uring
...
Turns AIO is just bad especially with buffered I/O, io_uring performs strictly better in this scenario.
2025-08-03 21:19:54 +02:00
Viktor Lofgren
87082b4ef8
(index) Use AIO for reading spans and positions
...
This performs slightly worse in benchmarks, but that's likely caused by hitting the page cache.
AIO will tend to perform better when we see cache misses, which is the expected case in production on real-world data.
2025-08-03 21:19:54 +02:00
Viktor Lofgren
84d3f6087f
(skiplist) Parametrize skip list block size, increase to 4K pages
2025-08-03 21:19:54 +02:00
Viktor Lofgren
f93ba371a5
(pool) Fix the LRU to not deadlock and be shit
2025-08-03 21:19:54 +02:00
Viktor Lofgren
5eec27c68d
(pool) Fix for 32 bit rollover in clockHand for LRU
2025-08-03 21:19:54 +02:00
Viktor Lofgren
ab01576f91
(pool) Use one global buffer pool instead of many small ones, improved LRU with gclock reclamation, skip list optimization
2025-08-03 21:19:54 +02:00
Viktor Lofgren
054e5ccf44
(pool) Testing synchronized to see if I can find the deadlock
2025-08-03 21:19:54 +02:00
Viktor Lofgren
4351ea5128
(pool) Fix buffer leak
2025-08-03 21:19:54 +02:00
Viktor Lofgren
49cfa3a5e9
(pool) Decrease LQB size
2025-08-03 21:19:54 +02:00
Viktor Lofgren
683854b23f
(pool) Fix logging
2025-08-03 21:19:54 +02:00
Viktor Lofgren
e880fa8945
(pool) Simplify locking in PoolLru
2025-08-03 21:19:54 +02:00
Viktor Lofgren
2482dc572e
(pool) Grow free queue size
2025-08-03 21:19:54 +02:00
Viktor Lofgren
4589f11898
(pool) More stats
2025-08-03 21:19:54 +02:00
Viktor Lofgren
e43b6e610b
(pool) Adjust pool reclamation strategy
2025-08-03 21:19:53 +02:00
Viktor Lofgren
4772117a1f
(skiplist) First stab at a skiplist replacement for btrees in the documents lists
2025-08-03 21:19:53 +02:00
Viktor Lofgren
3fc7ea521c
(pool) Remove readahead and simplify the code
2025-08-03 21:19:53 +02:00
Viktor Lofgren
4372f5af03
(pool) More performant LRU pool + better instructions queue
2025-08-03 21:19:53 +02:00
Viktor Lofgren
4ad89b6c75
(pool) More performant LRU pool
2025-08-03 21:19:53 +02:00
Viktor Lofgren
ad0519e031
(index) Optimizations
2025-08-03 21:19:53 +02:00
Viktor Lofgren
596ece1230
(pool) Fix deadlock during pool starvation
2025-08-03 21:19:53 +02:00
Viktor Lofgren
07b6e1585b
(pool) Bump pool sizes
2025-08-03 21:19:53 +02:00
Viktor Lofgren
cb5e2778eb
(pool) Align the buffers with 512b
2025-08-03 21:19:53 +02:00
Viktor Lofgren
8f5ea7896c
(btree) More debug information on numEntries = 0 scenario
2025-08-03 21:19:53 +02:00
Viktor Lofgren
76c398e0b1
(index) Fix lingering issues with previous optimizations
2025-08-03 21:19:53 +02:00
Viktor Lofgren
4a94f04a8d
(btree) Debug logging
2025-08-03 21:19:53 +02:00
Viktor Lofgren
df72f670d4
(btree) Fix queryData
2025-08-03 21:19:53 +02:00
Viktor Lofgren
eaa22c2f5a
(*) Logging
2025-08-03 21:19:53 +02:00
Viktor Lofgren
7be173aeca
(pool) Only dump statistics if they say anything
2025-08-03 21:19:53 +02:00
Viktor Lofgren
36685bdca7
(btree) Fix retain implementation
2025-08-03 21:19:53 +02:00
Viktor Lofgren
ad04057609
(btree) Add short circuits when retain/rejecting on an empty tree
2025-08-03 21:19:53 +02:00
Viktor Lofgren
eb76ae22e2
(perf) Use lqb size 512 in perf test
2025-08-03 21:19:53 +02:00
Viktor Lofgren
4b858ab341
(btree) Cache retain/reject reads
2025-08-03 21:19:53 +02:00
Viktor Lofgren
c6e3c8aa3b
(index) Focus pools to try to increase reuse
2025-08-03 21:19:53 +02:00
Viktor Lofgren
9128d3907c
(index) Periodically dump buffer metrics
2025-08-03 21:19:53 +02:00
Viktor Lofgren
4ef16d13d4
(index) O_DIRECT based buffer pool for index reads
2025-07-30 15:04:23 +02:00
Viktor Lofgren
838a5626ec
(index) Reduce query buffer size
2025-07-27 21:42:04 +02:00
Viktor Lofgren
6b426209c7
(index) Restore threshold for work stealing in query execution
2025-07-27 21:41:46 +02:00
Viktor Lofgren
452b5731d9
(index) Lower threshold for work stealing in query execution
2025-07-27 21:35:11 +02:00
Viktor Lofgren
c91cf49630
(search) Disable scribe.rip substitution
...
It does not appear to work well
2025-07-27 19:40:58 +02:00
Viktor Lofgren
8503030f18
(search) Fix rare exception in scribe.rip substitution
2025-07-27 19:38:52 +02:00
Viktor Lofgren
744f7d3ef7
(search) Fix rare exception in scribe.rip substitution
2025-07-27 19:34:03 +02:00
Viktor Lofgren
215e12afe9
(index) Shrink query buffer size
2025-07-27 17:33:46 +02:00
Viktor Lofgren
2716bce918
(index) Adjust timeout logic for evaluation
2025-07-27 17:28:34 +02:00
Viktor Lofgren
caf2e6fbb7
(index) Adjust timeout logic for evaluation
2025-07-27 17:27:07 +02:00
Viktor Lofgren
233f0acfb1
(index) Further reduce query buffer size
2025-07-27 17:13:08 +02:00
Viktor Lofgren
e3a4ff02e9
(index) Abandon ongoing evaluation tasks if time is up
2025-07-27 17:04:01 +02:00
Viktor Lofgren
c786283ae1
(index) Reduce quer buffer size
2025-07-27 16:57:55 +02:00
Viktor Lofgren
a3f65ac0e0
(deploy) Trigger index deployment
2025-07-27 16:50:23 +02:00
Viktor
aba1a32af0
Merge pull request #217 from MarginaliaSearch/uncompressed-spans-file
...
Index optimizations
2025-07-27 16:49:27 +02:00
Viktor Lofgren
c9c442345b
(perf) Change execution test to use processing rate instead of count
2025-07-27 16:39:51 +02:00
Viktor Lofgren
2e126ba30e
(perf) Change execution test to use processing rate instead of count
2025-07-27 16:37:20 +02:00
Viktor Lofgren
2087985f49
(index) Implement work stealing in IndexQueryExecution as a better approach to backpressure
2025-07-27 16:29:57 +02:00
Viktor Lofgren
2b13ebd18b
(index) Tweak evaluation backlog handling
2025-07-27 16:08:16 +02:00
Viktor Lofgren
6d92c125fe
(perf) Fix perf test
2025-07-27 15:50:28 +02:00
Viktor Lofgren
f638cfa39a
(index) Avoid possibility of negative timeout
2025-07-27 15:39:12 +02:00
Viktor Lofgren
89447c12af
(index) Avoid possibility of negative timeout
2025-07-27 15:24:47 +02:00
Viktor Lofgren
c71fc46f04
(perf) Update perf test with execution scenario
2025-07-27 15:22:07 +02:00
Viktor Lofgren
f96874d828
(sequence) Implement a largestValue abort condition for minDistance()
...
This is something like 3500% faster in certain common scenarios
2025-07-27 15:05:50 +02:00
Viktor Lofgren
583a84d5a0
(index) Clean up of the index query execution logic
2025-07-27 15:05:50 +02:00
Viktor Lofgren
f65b946448
(index) Clean up code
2025-07-27 15:05:50 +02:00
Viktor Lofgren
3682815855
(index) Optimize sequence intersection for the n=1 case
2025-07-26 19:14:32 +02:00
Viktor Lofgren
3a94357660
(index) Perf test tool (WIP!)
2025-07-26 11:49:33 +02:00
Viktor Lofgren
673b0d3de1
(index) Perf test tool (WIP!)
2025-07-26 11:49:31 +02:00
Viktor Lofgren
ea942bc664
(spans) Add signature to the footer of the spans file, including a version byte so we can detect whether ot use the old or new decoding logic
2025-07-25 12:07:18 +02:00
Viktor Lofgren
7ed5083c54
(index) Don't split results into chunks
2025-07-25 11:45:07 +02:00
Viktor Lofgren
08bb2c097b
(refac) Clean up the data model used in the index service
2025-07-25 10:54:07 +02:00
Viktor Lofgren
495fb325be
(sequence) Correct sequence intersection bug introduced in optimizations
2025-07-25 10:48:33 +02:00
Viktor Lofgren
05c25bbaec
(chore) Clean up
2025-07-24 23:43:27 +02:00
Viktor Lofgren
2a028b84f3
(chore) Clean up
2025-07-24 20:12:56 +02:00
Viktor Lofgren
a091a23623
(ranking) Remove unnecessary metadata retrievals
2025-07-24 20:08:09 +02:00
Viktor Lofgren
e8897acb45
(ranking) Remove unnecessary metadata retrievals
2025-07-24 20:05:39 +02:00
Viktor Lofgren
b89ffcf2be
(index) Evaluate hash based idx mapping in ForwardIndexReader
2025-07-24 19:47:27 +02:00
Viktor Lofgren
dbcc9055b0
(index) Evaluate using MinMaxPriorityQueue as guts of ResultPriorityQueue
2025-07-24 19:31:51 +02:00
Viktor Lofgren
d9740557f4
(sequence) Optimize intersection logic with a fast abort condition
2025-07-24 19:04:10 +02:00
Viktor Lofgren
0d6cd015fd
(index) Evaluate reading all spans at once
2025-07-24 18:34:11 +02:00
Viktor Lofgren
c6034efcc8
(index) Cache value of bitset cardinality for speed
2025-07-24 17:24:55 +02:00
Viktor Lofgren
76068014ad
(index) More spans optimizations
2025-07-24 15:03:43 +02:00
Viktor Lofgren
1c3ed67127
(index) Byte align document spans
2025-07-24 14:06:14 +02:00
Viktor Lofgren
fc0cb6bd9a
(index) Reserve a larger size for IntArrayList in SeqenceOperations.findIntersections
2025-07-24 14:03:44 +02:00
Viktor Lofgren
c2601bac78
(converter) Remove unnecessary allocation of a 16 KB byte buffer
2025-07-24 13:25:37 +02:00
Viktor Lofgren
f5641b72e9
(index) Fix broken test
2025-07-24 13:21:05 +02:00
Viktor Lofgren
36efe2e219
(index) Optimize PositionsFileReader for concurrent reads
...
In benchmarks this is roughly twice as fast as the previous approach. Main caveat being we need multiple file descriptors to avoid read instruction serialization by the kernel. This is undesirable since the reads are complete scattershot and can't be reordered by the kernel in a way that optimizes anything.
2025-07-24 13:20:54 +02:00
Viktor Lofgren
983fe3829e
(spans) Evaluate uncompressed spans files
...
Span decompression appears to be somewhat of a performance bottleneck. This change removes compression of the spans file. The spans are still compressed in transit between the converter and index constructor at this stage. The change is intentionally kept small to just evaluate the performance implications, change in file sizes, etc.
2025-07-23 18:10:41 +02:00
Viktor Lofgren
668c87aa86
(ssr) Drop Executor from SSR as it no longer exists
2025-07-23 13:55:41 +02:00
Viktor Lofgren
9d3f9adb05
Force redeploy of everything
2025-07-23 13:36:02 +02:00
Viktor
a43a1773f1
Merge pull request #216 from MarginaliaSearch/deprecate-executor
...
Architecture: Remove the separate executor service and roll it into the index service.
2025-07-23 13:32:42 +02:00
Viktor Lofgren
1e7a3a3c4f
(docs) Update docs to reflect the change
2025-07-23 13:18:23 +02:00
Viktor Lofgren
62b696b1c3
(architecture) Remove the separate executor service and merge it into the index service
...
The primary motivation for this is that in production, the large number of partitioned services has lead to an intermittent exhaustion of available database connections, as each service has a connection pool.
The decision to have a separate executor service dates back from when the index service was very slow to start, and the executor didn't always spin off its memory-hungry tasks into separate processes, which meant the executor would sometimes OOM and crash, and it was undesirable to bring the index down with it.
2025-07-23 12:57:13 +02:00
Viktor Lofgren
f1a900f383
(search) Clean up front page mobile design a bit
2025-07-23 12:20:40 +02:00
Viktor Lofgren
700364b86d
(sample) Remove debug logging
...
The problem sat in the desk chair all along
2025-07-21 15:08:20 +02:00
Viktor Lofgren
7e725ddaed
(sample) Remove debug logging
...
The problem sat in the desk chair all along
2025-07-21 14:41:59 +02:00
Viktor Lofgren
120209e138
(sample) Diagnosing compression errors
2025-07-21 14:34:08 +02:00
Viktor Lofgren
a771a5b6ce
(sample) Test different approach to decoding
2025-07-21 14:19:01 +02:00
Viktor Lofgren
dac5b54128
(sample) Better logging for sample errors
2025-07-21 14:03:58 +02:00
Viktor Lofgren
6cfb143c15
(sample) Compress sample HTML data and introduce new API for only getting requests
2025-07-21 13:55:25 +02:00
Viktor Lofgren
23c818281b
(converter) Reduce DomSample logging for NOT_FOUND
2025-07-21 13:37:55 +02:00
Viktor Lofgren
8aad253cf6
(converter) Add more logging around dom sample data retrieval errors
2025-07-21 13:26:38 +02:00
Viktor Lofgren
556d7af9dc
Reapply "(grpc) Use grpc-netty instead of grpc-netty-shaded"
...
This reverts commit b7a5219ed3
.
2025-07-21 13:23:32 +02:00
Viktor Lofgren
b7a5219ed3
Revert "(grpc) Use grpc-netty instead of grpc-netty-shaded"
...
Reverting this change to see if it's the cause of some instability issues observed.
2025-07-21 13:10:41 +02:00
Viktor Lofgren
a23ec521fe
(converter) Ensure features is mutable on DetailsWithWords as this is assumed later
2025-07-21 12:50:04 +02:00
Viktor Lofgren
fff3babc6d
(classier) Add rule for */pixel.gif as likely tracking pixels
2025-07-21 12:35:57 +02:00
Viktor Lofgren
b2bfb8217c
(special) Trigger CD run
2025-07-21 12:28:24 +02:00
Viktor
3b2ac414dc
Merge pull request #210 from MarginaliaSearch/ads-fingerprinting
...
Implement advertisement and popover identification based on DOM sample data
2025-07-21 12:25:31 +02:00
Viktor Lofgren
0ba6515a01
(converter) Ensure converter works well even when dom sample data is unavailable
2025-07-21 12:11:17 +02:00
Viktor Lofgren
16c6b0f151
(search) Add link to new discord community
2025-07-20 20:54:42 +02:00
Viktor Lofgren
e998692900
(converter) Ensure converter works well even when dom sample data is unavailable
2025-07-20 19:24:40 +02:00
Viktor Lofgren
eeb1695a87
(search) Clean up dead code
2025-07-20 19:15:01 +02:00
Viktor Lofgren
a0ab910940
(search) Clean up code
2025-07-20 19:14:13 +02:00
Viktor Lofgren
b9f31048d7
(search) Clean up overlong class names
2025-07-20 19:13:04 +02:00
Viktor Lofgren
12c304289a
(grpc) Use grpc-netty instead of grpc-netty-shaded
...
This will help reduce runaway thread pool sizes
2025-07-20 17:36:25 +02:00
Viktor Lofgren
6ee01dabea
(search) Drastically reduce worker thread count in search-service
2025-07-20 17:16:58 +02:00
Viktor Lofgren
1b80e282a7
(search) Drastically reduce worker thread count in search-service
2025-07-20 16:58:33 +02:00
Viktor Lofgren
a65d18f1d1
(client) Use virtual threads in a few more clients
2025-07-20 14:10:02 +02:00
Viktor Lofgren
90a1ff220b
(ui) Clean up UI
2025-07-19 18:41:36 +02:00
Viktor Lofgren
d6c7092335
(classifier) More rules
2025-07-19 18:41:36 +02:00
Viktor Lofgren
b716333856
(classifier) Match regexes against the path + query only, as well as the full URL
2025-07-19 18:41:36 +02:00
Viktor Lofgren
b504b8482c
(classifier) Add new tracker
2025-07-19 18:41:36 +02:00
Viktor Lofgren
80da1e9ad1
(ui) UI cleanup
2025-07-19 18:41:36 +02:00
Viktor Lofgren
d3f744a441
(ui) Add traffic report to overview menu
2025-07-19 18:41:36 +02:00
Viktor Lofgren
60fb539875
(ui) Add explanatory blurb
2025-07-19 18:41:35 +02:00
Viktor Lofgren
7f5094fedf
(ui) Clean up UI
2025-07-19 18:41:35 +02:00
Viktor Lofgren
45066636a5
(classifier) Add classification for domains that make 3rd party requests
2025-07-19 18:41:35 +02:00
Viktor Lofgren
e2d6898c51
(search) Change tag colors to more pleasant ones
2025-07-19 18:41:35 +02:00
Viktor Lofgren
58ef767b94
(search) Improve traffic report UI
2025-07-19 18:41:35 +02:00
Viktor Lofgren
f9f268c67a
(grpc) Improve error handling
2025-07-19 18:41:35 +02:00
Viktor Lofgren
f44c2bdee9
(chore) Cleanup
2025-07-19 18:41:35 +02:00
Viktor Lofgren
6fdf477c18
(refac) Move DomSampleClassification to top level
2025-07-19 18:41:35 +02:00
Viktor Lofgren
6b6e455e3f
(classifier) Clean up xml
2025-07-19 18:41:35 +02:00
Viktor Lofgren
a3a126540c
(classifier) Add README.md
2025-07-19 18:41:35 +02:00
Viktor Lofgren
842b19da40
(search) Mobile layout + phrasing
2025-07-19 18:41:35 +02:00
Viktor Lofgren
2a30e93bf0
(classifier)
2025-07-19 18:41:34 +02:00
Viktor Lofgren
3d998f12c0
(search) Use display name where possible
2025-07-19 18:41:34 +02:00
Viktor Lofgren
cbccc2ac23
(classification) Add /ccm/collect as an ads-related request
2025-07-19 18:41:34 +02:00
Viktor Lofgren
2cfc23f9b7
(search) Fix layout for mobile
2025-07-18 19:06:23 +02:00
Viktor Lofgren
88fe394cdb
(request-classifier) Add rule for /pagead/
2025-07-18 19:01:33 +02:00
Viktor Lofgren
f30fcebd4f
Remove dead code
2025-07-18 18:56:42 +02:00
Viktor Lofgren
5d885927b4
(search) Fix layout and presentation
2025-07-18 17:54:47 +02:00
Viktor Lofgren
7622c8358e
(request-classifier) Adjust flagging of a few hosts
2025-07-18 17:54:46 +02:00
Viktor Lofgren
69ed9aef47
(ddgt) Load global tracker data
2025-07-18 17:02:50 +02:00
Viktor Lofgren
4c78c223da
(search) Fix endpoint collection
2025-07-18 16:59:05 +02:00
Viktor Lofgren
71b9935dd6
(search) Add warmup to programmatic tailwind classes, fix word break
2025-07-18 16:49:31 +02:00
Viktor Lofgren
ad38f2fd83
(search) Hide classification tag on unclassified requests
2025-07-18 15:45:40 +02:00
Viktor Lofgren
9c47388846
(search) Improve display ordering
2025-07-18 15:44:55 +02:00
Viktor Lofgren
d9ab10e33f
(search) Fix tracker data for the correct domain
2025-07-18 15:29:15 +02:00
Viktor Lofgren
e13ea7f42b
(search) Sort results by classifications
2025-07-18 14:51:35 +02:00
Viktor Lofgren
f38daeb036
(WIP) First stab at a GUI for viewing network traffic
...
The change also moves the dom classifier to a separate package so that it can be accessed from both the search service and converter.
The change also adds a parser for DDG's tracker radar data.
2025-07-18 13:58:57 +02:00
Viktor Lofgren
6e214293e5
(ping) Fix backoff value overflow
2025-07-16 19:50:12 +02:00
Viktor Lofgren
52582a6d7d
(experiment) Also add clients to loom experiment
2025-07-16 18:08:00 +02:00
Viktor Lofgren
ec0e39ad32
(experiment) Also add clients to loom experiment
2025-07-16 17:28:57 +02:00
Viktor Lofgren
6a15aee4b0
(ping) Fix arithmetic errors in backoff strategy due to long overflow
2025-07-16 17:23:36 +02:00
Viktor Lofgren
bd5111e8a2
(experimental) Add flag for using loom/virtual threads in gRPC executor
2025-07-16 17:12:07 +02:00
Viktor Lofgren
1ecbeb0272
(doc) Update ROADMAP.md
2025-07-14 13:38:34 +02:00
Viktor Lofgren
b91354925d
(converter) Index documents even when they are short
...
... but assign short documents a special flag and penalize them in index lookups
2025-07-14 12:24:25 +02:00
Viktor Lofgren
3f85c9c154
(refac) Clean up code
2025-07-14 11:55:21 +02:00
Viktor Lofgren
390f053406
(api) Add query parameter 'dc' for specifying the max number of results per domain
2025-07-14 10:09:30 +02:00
Viktor Lofgren
89e03d6914
(chore) Idiomatic error handling in gRPC clients
...
responseObserver.onError(...) should be passed Status.WHATEVER.foo().asRuntimeException() and not random throwables as was done before.
2025-07-13 02:59:22 +02:00
Viktor Lofgren
14e0bc9f26
(index) Add comment about encoding caveat
2025-07-13 02:47:00 +02:00
Viktor Lofgren
7065b46c6f
(index) Add penalties for new feature flags from dom sample
2025-07-13 02:37:30 +02:00
Viktor Lofgren
0372190c90
(index, refac) Move domain ranking to a better named package
2025-07-13 02:37:29 +02:00
Viktor Lofgren
ceaf32fb90
(converter) Integrate dom sample features into the converter
2025-07-13 01:38:28 +02:00
Viktor Lofgren
b03c43224c
(search) Fix redirects in new search UI
2025-07-11 23:44:45 +02:00
Viktor Lofgren
9b4ce9e9eb
(search) Fix !w redirect
2025-07-11 23:28:09 +02:00
Viktor
81ac02a695
Merge pull request #209 from us3r1d/master
...
added converter.insertFoundDomains property
2025-07-11 21:34:04 +02:00
krystal
47f624fb3b
changed converter.insertFoundDomains to loader.insertFoundDomains
2025-07-11 12:13:45 -07:00
Viktor Lofgren
b57db01415
(converter) Clean out some old and redundant advertisement and tracking detection code
2025-07-11 19:32:25 +02:00
Viktor Lofgren
ce7d522608
(converter) First basic hook-in of the new dom sample classifier into the converter workflow
2025-07-11 16:57:37 +02:00
Viktor Lofgren
18649b6ee9
(converter) Move DomSampleClassifier to converter's code tree
2025-07-11 16:12:48 +02:00
Viktor Lofgren
f6417aef1a
(converter) Additional code cleanup
2025-07-11 15:58:48 +02:00
Viktor Lofgren
2aa7e376b0
(converter) Clean up code around document deduplication
2025-07-11 15:54:28 +02:00
Viktor Lofgren
f33bc44860
(dom-sample) Create API for fetching DOM sample data across services
2025-07-11 15:41:10 +02:00
Viktor Lofgren
a2826efd44
(dom-sample) First stab at classifying outgoing requests from DOM sample data
2025-07-11 15:41:10 +02:00
krystal
c866f19cbb
added converter.insertFoundDomains property
2025-07-10 15:36:59 -07:00
Viktor Lofgren
518278493b
(converter) Increase the max byte length when parsing crawled documents to 500 kB from 200 kB.
2025-07-08 21:22:02 +02:00
Viktor Lofgren
1ac0bab0b8
(converter) Also exclude length checks when lenient processing is enabled
2025-07-08 20:37:53 +02:00
Viktor Lofgren
08b45ed10a
(converter) Add system property converter.lenientProcessing to disable most disqualification checks
2025-07-08 19:44:51 +02:00
Viktor Lofgren
f2cfb91973
(converter) Add audit log of converter errors and rejections
2025-07-08 19:15:41 +02:00
Viktor Lofgren
2f79524eb3
(refac) Rename ProcessService to ProcessSpawnerService for clarity
2025-07-07 15:48:44 +02:00
Viktor Lofgren
3b00142c96
(search) Don't say unknown domains are in the crawler queue
2025-07-06 18:42:36 +02:00
Viktor Lofgren
294ab19177
(status) Use old-search for status service instead of marginalia-search.com
2025-07-06 15:40:53 +02:00
Viktor Lofgren
6f1659ecb2
(control) Add GUI for NSFW Filter Update trigger
2025-06-25 16:03:27 +02:00
Viktor Lofgren
982dcb28f0
(live-crawler) Use Apache HttpClient + code cleanup
2025-06-24 13:04:19 +02:00
Viktor Lofgren
fc686d8b2e
(live-crawler) Fix startup race condition
...
The fix makes sure we wait for the feeds API to be available before fetching from it, so that the process doesn't crash on a cold system reboot.
2025-06-24 11:42:41 +02:00
Viktor Lofgren
69ef0f334a
(rss) Make feed fetcher use Apache's HttpClient
2025-06-23 18:49:55 +02:00
Viktor Lofgren
446746f3bd
(control) Fix so that sideload actions show up in Mixed profile nodes
2025-06-23 18:08:09 +02:00
Viktor Lofgren
24ab8398bb
(ndp) Use LinkGraphClient to populate NDP table
2025-06-23 16:44:38 +02:00
Viktor Lofgren
d2ceeff4cf
(ndp) Add toggle for excluding nodes from assignment via NDP
2025-06-23 15:38:02 +02:00
Viktor Lofgren
cf64214b1c
(ndp) Update documentation
2025-06-23 15:18:35 +02:00
Viktor Lofgren
e50d09cc01
(crawler) Remove illegal requests when denied via robots.txt
...
The commit removes attempts at probing the root document, feed URLs, and favicon if we are not permitted to do so via robots.txt
2025-06-22 17:10:44 +02:00
Viktor Lofgren
bce3892ce0
(ndp) Simplify code
2025-06-22 16:08:55 +02:00
Viktor Lofgren
36581b25c2
(ndp) Fix process tracking in domain discovery process
2025-06-21 14:35:25 +02:00
Viktor Lofgren
52ff7fb4dd
(ndp) Add a process for adding new domains to be crawled
...
This is a working "work in progress" commit, will need more refinement, but given the usual difficulties in testing crawler-adjacent code without actually crawling, it needs some maturation time in production.
2025-06-21 14:10:27 +02:00
Viktor Lofgren
a4e49e658a
(ping) Add README for ping
2025-06-19 11:21:52 +02:00
Viktor Lofgren
e2c56dc3ca
(search) Clean up the rate limiting
...
We fail quietly to make life harder for the bot farmers
2025-06-18 11:26:30 +02:00
Viktor Lofgren
470b866008
(search) Clean up the rate limiting
...
We fail quietly to make life harder for the bot farmers
2025-06-18 11:22:26 +02:00
Viktor Lofgren
4895a2ac7a
(search) Clean up the rate limiting
...
We fail quietly to make life harder for the bot farmers
2025-06-18 11:20:24 +02:00
Viktor Lofgren
fd32ae9fa7
(search) Add automatic rate limiting to /site
...
Fix typo
2025-06-18 11:10:08 +02:00
Viktor Lofgren
470651ea4c
(search) Add automatic rate limiting to /site
2025-06-18 11:04:36 +02:00
Viktor Lofgren
8d4829e783
(ping) Change cookie specification to ignore cookies
2025-06-17 12:26:34 +02:00
Viktor Lofgren
1290bc15dc
(ping) Reduce retries for SocketException and pals
2025-06-16 22:35:33 +02:00
Viktor Lofgren
e7fa558954
(ping) Disable some cert validation logic for now
2025-06-16 22:00:32 +02:00
Viktor Lofgren
720685bf3f
(ping) Persist more detailed information about why a cert is invalid
...
The change also alters the validator to be less judgemental, and accept some invalid chains based on looking like we've simply not got access to a (valid) intermediate cert.
2025-06-16 19:44:22 +02:00
Viktor Lofgren
cbec63c7da
(ping) Pull root certificates from cacerts.pem
2025-06-16 19:21:05 +02:00
Viktor Lofgren
b03ca75785
(ping) Correct test so that it does not spam an innocent webmaster with requests
2025-06-16 17:06:14 +02:00
Viktor Lofgren
184aedc071
(ping) Deploy new custom cert validator for fingerprinting purposes
2025-06-16 16:36:23 +02:00
Viktor Lofgren
0275bad281
(ping) Limit SSL certificate validity dates to a maximum timestamp as permitted by database
2025-06-16 00:32:03 +02:00
Viktor Lofgren
fd83a9d0b8
(ping) Handle null case for Subject Alternative Names in SSL certificates
2025-06-16 00:27:37 +02:00
Viktor Lofgren
d556f8ae3a
(ping) Ping server should not validate certificates
2025-06-16 00:08:30 +02:00
Viktor Lofgren
e37559837b
(crawler) Crawler should validate certificates
2025-06-16 00:06:57 +02:00
Viktor Lofgren
3564c4aaee
(ping) Route SSLHandshakeException to ConnectionError as well
...
This will mean we re-try these as an unencrypted Http connection
2025-06-15 20:31:33 +02:00
Viktor Lofgren
92c54563ab
(ping) Reduce retry count on connection errors
2025-06-15 18:39:54 +02:00
Viktor Lofgren
d7a5d90b07
(ping) Store redirect location in availability record
2025-06-15 18:39:33 +02:00
Viktor Lofgren
0a0e88fd6e
(ping) Fix schema drift between prod and flyway migrations
2025-06-15 17:20:21 +02:00
Viktor Lofgren
b4fc0c4368
(ping) Fix schema drift between prod and flyway migrations
2025-06-15 17:17:11 +02:00
Viktor Lofgren
87ee8765b8
(ping) Ensure ProtocolError->HTTP_CLIENT_ERROR retains its error message information
2025-06-15 16:54:27 +02:00
Viktor Lofgren
1adf4835fa
(ping) Add schema change information to domain security events
...
Particularly the HTTPS->HTTP-change event appears to be a strong indicator of domain parking.
2025-06-15 16:47:49 +02:00
Viktor Lofgren
b7b5d0bf46
(ping) More accurately detect connection errors
2025-06-15 16:47:07 +02:00
Viktor Lofgren
416059adde
(ping) Avoid thread starvation scenario in job scheduling
...
Adjust the queueing strategy to avoid thread starvation from whale domains with many subdomains all locking on the same semaphore and gunking up all threads by implementing a mechanism that returns jobs that can't be executed to the queue.
This will lead to some queue churn, but it should be fairly manageable given the small number of threads involved, and the fairly long job execution times.
2025-06-15 11:04:34 +02:00
Viktor Lofgren
db7930016a
(coordination) Trial the use of zookeeper for coordinating semaphores across multiple crawler-like processes
...
+ fix two broken tests
2025-06-14 16:20:01 +02:00
Viktor Lofgren
82456ad673
(coordination) Trial the use of zookeeper for coordinating semaphores across multiple crawler-like processes
...
The performance implication of this needs to be evaluated. If it does not hold water. some other solution may be required instead.
2025-06-14 16:16:10 +02:00
Viktor Lofgren
0882a6d9cd
(ping) Correct retry logic by handling missing Retry-After header
2025-06-14 12:54:07 +02:00
Viktor Lofgren
5020029c2d
(ping) Fix startup sequence for new primary-only flow
2025-06-14 12:48:09 +02:00
Viktor Lofgren
ac44d0b093
(ping) Fix wait logic to use synchronized block
2025-06-14 12:38:16 +02:00
Viktor Lofgren
4b32b9b10e
Update DomainAvailabilityRecord to use clamped integer for HTTP response time
2025-06-14 12:37:58 +02:00
Viktor Lofgren
9f041d6631
(ping) Drop the concept of primary and secondary ping instances
...
There was an idea of having the ping service duck over to a realtime partition when the partition is crawling, but this hasn't been working out well, so the concept will be retired and all nodes will run as primary.
2025-06-14 12:32:08 +02:00
Viktor Lofgren
13fb1efce4
(ping) Populate ASN field on DomainSecurityInformation
2025-06-13 15:45:43 +02:00
Viktor Lofgren
c1225165b7
(ping) Add a summary fields CHANGE_SERIAL_NUMBER and CHANGE_ISSUER to DOMAIN_SECURITY_EVENTS
2025-06-13 15:30:45 +02:00
Viktor Lofgren
67ad7a3bbc
(ping) Enhance HTTP ping logic to retry GET requests for specific status codes and add sleep duration between retries
2025-06-13 12:59:56 +02:00
Viktor Lofgren
ed62ec8a35
(random) Sanitize random search results with DOMAIN_AVAILABILITY_INFORMATION join
2025-06-13 10:38:21 +02:00
Viktor Lofgren
42b24cfa34
(ping) Fix NPE in dnsJobConsumer
2025-06-12 14:22:09 +02:00
Viktor Lofgren
1ffaab2da6
(ping) Mute logging along the happy path now that things are working
2025-06-12 14:15:23 +02:00
Viktor Lofgren
5f93c7f767
(ping) Update PROC_PING_SPAWNER to use REALTIME from SIDELOAD
2025-06-12 14:04:09 +02:00
Viktor Lofgren
4001c68c82
(ping) Update SQL query to include NODE_AFFINITY in historical availability data retrieval
2025-06-12 13:58:50 +02:00
Viktor Lofgren
6b811489c5
(actor) Make ping spawner auto-spawn the process
2025-06-12 13:46:50 +02:00
Viktor Lofgren
e9d317c65d
(ping) Parameterize thread counts for availability and DNS job consumers
2025-06-12 13:34:58 +02:00
Viktor Lofgren
16b05a4737
(ping) Reduce maximum total connections in HttpClientProvider to improve resource management
2025-06-12 13:04:55 +02:00
Viktor Lofgren
021cd73cbb
(ping) Reduce db contention by moving job scheduling out of the database to RAM
2025-06-12 12:56:33 +02:00
Viktor Lofgren
4253bd53b5
(ping) Fix issue where errors were not correctly labeled in availability
2025-06-12 00:18:07 +02:00
Viktor Lofgren
14c87461a5
(ping) Fix issue where errors were not correctly labeled in availability
2025-06-12 00:04:39 +02:00
Viktor Lofgren
9afed0a18e
(ping) Optimize parameters
...
Reduce socket and connection timeouts in HttpClient and adjust thread counts for job consumers
2025-06-11 16:21:45 +02:00
Viktor Lofgren
afad4deb94
(ping) Fix DB query to prioritize DNS information updates correctly
...
This also reduces CPU%
2025-06-11 14:58:28 +02:00
Viktor Lofgren
f071c947e4
(ping) Truncate data before inserting into db
2025-06-11 14:29:30 +02:00
Viktor Lofgren
79996c9348
(ping) Adjust thread counts based on observed processing times
2025-06-11 14:29:17 +02:00
Viktor Lofgren
db907ab06a
(ping) Update availabilityJobQueue to use put method to block rather than blow up
2025-06-11 14:22:24 +02:00
Viktor Lofgren
c49cd9dd95
(ping) Truncate fields in the builder to give consistent comparison without blowing up the database inserts.
2025-06-11 14:20:54 +02:00
Viktor Lofgren
eec9df3b0a
(ping) Truncate X-Frame-Options to 50 characters
2025-06-11 14:17:08 +02:00
Viktor
e5f3288de6
Merge pull request #205 from MarginaliaSearch/ping-server
...
Create domain availability pinging service (WIP)
2025-06-11 14:05:24 +02:00
Viktor Lofgren
d587544d3a
(refac) Rename PingJob classes and methods to AvailabilityJob for improved clarity and consistency
2025-06-11 13:52:18 +02:00
Viktor Lofgren
1a9ae1bc40
(ping) Minor bugfixes
2025-06-11 13:41:17 +02:00
Viktor Lofgren
e0c81e956a
(ping) Remove planned support for actual icmp ping
...
ICMP ping is a pain in the ass from Java, and it would have added at best marginal benefit since so few servers permit it.
2025-06-11 11:10:42 +02:00
Viktor Lofgren
542fb12b38
(ping) Add partitioning to events tables
...
This lets us migrate off the live database into either a columnar database or cold storage without expensive maintenance periods, as TRUNCATE PARTITION is effectively instantaneous.
2025-06-11 10:54:24 +02:00
Viktor Lofgren
65ec734566
(ping, refac) Rename domain ping status to domain availability information
2025-06-11 10:34:31 +02:00
Viktor Lofgren
10b6a25c63
(nsfw) Fix SQL error on duplicate domains
2025-06-11 00:11:26 +02:00
Viktor Lofgren
6260f6bec7
(ping) Refactor DomainPingStatusFactory to consolidate error handling methods and improve code clarity
2025-06-11 00:05:39 +02:00
Viktor Lofgren
d6d5467696
(ping) Add domain pinging service
2025-06-10 18:28:13 +02:00
Viktor Lofgren
034560ca75
(crawler) Add locking mechanism to avoid multiple crawler instances running in parallel on the same node
2025-06-07 16:18:05 +02:00
Viktor Lofgren
e994fddae4
(service) Add process event log object
2025-06-07 16:16:08 +02:00
Viktor Lofgren
345f01f306
(discovery) Add inter-JVM lock via zookeeper
2025-06-07 16:07:27 +02:00
Viktor
5a8e286689
Merge pull request #204 from MarginaliaSearch/vlofgren-patch-1
...
Update ROADMAP.md
2025-06-07 14:01:13 +02:00
Viktor
39a055aa94
Update ROADMAP.md
2025-06-07 14:01:01 +02:00
Viktor Lofgren
37aaa90dc9
(deploy) Clean up deploy script
2025-06-07 13:43:56 +02:00
Viktor
24022c5adc
Merge pull request #203 from MarginaliaSearch/nsfw-domain-lists
...
Nsfw blocking via UT1 domain lists
2025-06-07 13:24:05 +02:00
Viktor Lofgren
1de9ecc0b6
(nsfw) Add metrics to the filtering so we can monitor it
2025-06-07 13:17:05 +02:00
Viktor Lofgren
9b80245ea0
(nsfw) Move filtering to the IndexApiClient, and add filtering options to the internal APIs and public API.
2025-06-07 12:54:20 +02:00
Viktor Lofgren
4e1595c1a6
(nsfw) Initial work on adding UT1-based domain filtering
2025-06-06 14:23:37 +02:00
Viktor Lofgren
0be8585fa5
Add tag format hint to deploy script
2025-06-06 10:03:18 +02:00
Viktor Lofgren
a0fe070fe7
Redeploy browserless and assistant.
2025-06-06 09:51:39 +02:00
Viktor Lofgren
abe9da0fc6
(search) Ensure the new search UI sets the correct content-type for opensearch.xml
2025-05-29 12:44:55 +02:00
Viktor Lofgren
56d0128b0a
(dom-sample) Remove redundant code
2025-05-28 17:43:46 +02:00
Viktor Lofgren
840b68ac55
(dom-sample) Minor cleanups
2025-05-28 16:27:27 +02:00
Viktor Lofgren
c34ff6d6c3
(dom-sample) Use WAL journal for dom sample db
2025-05-28 16:16:28 +02:00
Viktor Lofgren
32780967d8
(dom-sample) Initialize dom sampler
2025-05-28 16:06:05 +02:00
Viktor Lofgren
7330bc489d
(deploy) Correct deploy script for browserless
2025-05-28 15:58:12 +02:00
Viktor Lofgren
ea23f33738
(deploy) Correct deploy script for headlesschrome
2025-05-28 15:56:05 +02:00
Viktor Lofgren
4a8a028118
(deploy) Deploy assistant and browserless
2025-05-28 15:50:26 +02:00
Viktor
a25bc647be
Merge pull request #201 from MarginaliaSearch/website-capture
...
Capture website snapshots
2025-05-28 15:49:03 +02:00
Viktor Lofgren
a720dba3a2
(deploy) Add browserless to deploy script
2025-05-28 15:48:32 +02:00
Viktor Lofgren
284f382867
(dom-sample) Fix initialization to work the same as screenshot capture
2025-05-28 15:40:09 +02:00
Viktor Lofgren
a80717f138
(dom-sample) Cleanup
2025-05-28 15:32:54 +02:00
Viktor Lofgren
d6da715fa4
(dom-sample) Add basic retrieval logic
...
First iteration is single threaded for simplicity
2025-05-28 15:18:15 +02:00
Viktor Lofgren
c1ec7aa491
(dom-sample) Add a boolean to the sample db when we've accepted a cookie dialogue
2025-05-28 14:45:19 +02:00
Viktor Lofgren
3daf37e283
(dom-sample) Improve storage of DOM sample data
2025-05-28 14:34:34 +02:00
Viktor Lofgren
44a774d3a8
(browserless) Add --pull option to Docker build command
...
This ensures we fetch the latest base image when we build.
2025-05-28 14:09:32 +02:00
Viktor Lofgren
597aeaf496
(website-capture) Correct manifest
...
run_at is set at the content_script level, not the root object.
2025-05-28 14:05:16 +02:00
Viktor Lofgren
06df7892c2
(website-capture) Clean up code
2025-05-27 15:56:59 +02:00
Viktor Lofgren
dc26854268
(website-capture) Add a marker to the network log when we've accepted a cookie dialog
2025-05-27 15:21:02 +02:00
Viktor Lofgren
9f16326cba
(website-capture) Add logic that automatically identifies and agrees to cookie consent popovers
...
Oftentimes, ads don't load until after you've agreed to the popover.
2025-05-27 15:11:47 +02:00
Viktor Lofgren
ed66d0b3a7
(website-capture) Amend the extension to also capture web request information
2025-05-26 14:00:43 +02:00
Viktor Lofgren
c3afc82dad
(website-capture) Rename scripts to be more consistent with extension terminology
2025-05-26 13:13:11 +02:00
Viktor Lofgren
08e25e539e
(website-capture) Minor cleanups
2025-05-21 14:55:03 +02:00
Viktor Lofgren
4946044dd0
(website-capture) Update BrowserlesClient to use the new image
2025-05-21 14:14:18 +02:00
Viktor Lofgren
edf382e1c5
(website-capture) Add a custom docker image with a new custom extension for DOM capture
...
The original approach of injecting javascript into the page directly didn't work with pages that reloaded themselves. To work around this, a chrome extension is used instead that does the same work, but subscribes to reload events and re-installs the change listener.
2025-05-21 14:13:54 +02:00
Viktor Lofgren
644cba32e4
(website-capture) Remove dead imports
2025-05-20 16:08:48 +02:00
Viktor Lofgren
34b76390b2
(website-capture) Add storage object for DOM samples
2025-05-20 16:05:54 +02:00
Viktor Lofgren
43cd507971
(crawler) Add a migration workaround so we can still open old slop crawl data with the new column added
2025-05-19 14:47:38 +02:00
Viktor Lofgren
cc40e99fdc
(crawler) Add a migration workaround so we can still open old slop crawl data with the new column added
2025-05-19 14:37:59 +02:00
Viktor Lofgren
8a944cf4c6
(crawler) Add request time to crawl data
...
This is an interesting indicator of website quality.
2025-05-19 14:07:41 +02:00
Viktor Lofgren
1c128e6d82
(crawler) Add request time to crawl data
...
This is an interesting indicator of website quality.
2025-05-19 14:02:03 +02:00
Viktor Lofgren
be039d1a8c
(live-capture) Add a new function for capturing the DOM of a website after rendering
...
The new code injects a javascript that attempts to trigger popovers, and then alters the DOM to add attributes containing CSS elements with position and visibility.
2025-05-19 13:26:07 +02:00
Viktor Lofgren
4edc0d3267
(converter) Increase work buffer for converter
...
Conversion on index node 7 in production is crashing ostensibly because this buffer is too small.
2025-05-18 13:22:44 +02:00
Viktor Lofgren
890f521d0d
(pdf) Fix crash for some bold lines
2025-05-18 13:05:05 +02:00
Viktor Lofgren
b1814a30f7
(deploy) Redeploy all services.
2025-05-17 13:11:51 +02:00
Viktor Lofgren
f59a9eb025
(legacy-search) Soften domain limit constraints in URL deduplication
2025-05-17 00:04:27 +02:00
Viktor Lofgren
599534806b
(search) Soften domain limit constraints in URL deduplication
2025-05-17 00:00:42 +02:00
Viktor Lofgren
7e8253dac7
(search) Clean up debug logging
2025-05-17 00:00:28 +02:00
Viktor Lofgren
97a6780ea3
(search) Add debug logging for specific query
2025-05-16 23:41:35 +02:00
Viktor Lofgren
eb634beec8
(search) Add debug logging for specific query
2025-05-16 23:34:03 +02:00
Viktor Lofgren
269ebd1654
Revert "(query) Add debug logging for specific query"
...
This reverts commit 39ce40bfeb
.
2025-05-16 23:29:06 +02:00
Viktor Lofgren
39ce40bfeb
(query) Add debug logging for specific query
2025-05-16 23:23:53 +02:00
Viktor Lofgren
c187b2e1c1
(search) Re-enable clustering
2025-05-16 23:20:16 +02:00
Viktor Lofgren
42eaa4588b
(search) Disable clustering for a moment
2025-05-16 23:17:01 +02:00
Viktor Lofgren
4f40a5fbeb
(search) Reduce log spam
2025-05-16 23:15:07 +02:00
Viktor Lofgren
3f3d42bc01
(search) Re-enable deduplication
2025-05-16 23:14:54 +02:00
Viktor Lofgren
61c8d53e1b
(search) Disable deduplication for a moment
2025-05-16 23:10:32 +02:00
Viktor Lofgren
a7a3d85be9
(search) Increase search timeout by 50ms
2025-05-16 22:54:12 +02:00
Viktor Lofgren
306232fb54
(pdf) Fix handling of a few corner cases
...
Deal better with documents which change font on blank spaces.
2025-05-13 18:44:28 +02:00
Viktor Lofgren
5aef844f0d
(dependency) Increase slop version to 0.0.11
...
v0.0.11 uses atomic moves. This ensures we don't encounter a race condition in the backup service with lingering .tmp-files that should have been renamed.
2025-05-12 14:09:16 +02:00
Viktor
d56b5c828a
Merge pull request #198 from MarginaliaSearch/process-pdf-files
...
Add support for processing PDF files. The changeset adds a dependency on pdfbox, and vendors/modifies its PDFTextStripper to extract additional semantics from the documents.
Since PDF documents aren't a text based format, but a graphical format which may contain a stream of characters and positions (sometimes overlapping, rotated, out of order) identifying something like a header or a paragraph is a non-trivial task, let alone extracting any text at all. A number of heuristics are used to try to accomplish this task, they aren't perfect, but about as good as you're going to get without going to something like a vision based LLM, which would be ridiculously expensive to apply at an internet search engine scale.
The change also adds format information to the JSON API, as well as indicators in the GUI for PDF files.
2025-05-11 16:43:25 +02:00
Viktor Lofgren
ab58a4636f
(pdf) Disable tests that require specific sample data that can't go in the repo
2025-05-11 16:42:23 +02:00
Viktor Lofgren
00be269238
(search) Add PDF indicator in "also from"-segment
2025-05-11 16:35:52 +02:00
Viktor Lofgren
879e6a9424
(pdf) Identify additional headings based on font weight
2025-05-11 16:35:52 +02:00
Viktor Lofgren
fba3455732
(pdf) Clean up code
2025-05-11 16:35:52 +02:00
Viktor Lofgren
14283da7f5
(pdf) Clean up generated DOM
...
Sometimes empty <p>-tags are inserted, which messes with the header joining process. Removes those nodes.
2025-05-11 15:12:09 +02:00
Viktor Lofgren
93df4d1fc0
(pdf) Improve summary extraction for PDFs
2025-05-11 14:33:11 +02:00
Viktor Lofgren
b12a0b998c
(pdf) Use smarter heuristics for paragraph splitting
...
We look at the median line distance, with outliers removed, to figure out when to break lines, as the original approach works poorly with e.g. double line spaced documents.
2025-05-11 14:29:42 +02:00
Viktor Lofgren
3b6f4e321b
(search) Add red PDF indicator to search UI
2025-05-11 13:32:14 +02:00
Viktor Lofgren
8428111771
(pdf) Fix for exception when no text positions are available
2025-05-10 15:12:02 +02:00
Viktor Lofgren
e9fd4415ef
(pdf) Merge consecutive headings.
...
Headings don't follow the same indentation rules as prose and tend to be cut off into multiple "paragraphs" by the text extractor.
2025-05-10 14:38:43 +02:00
Viktor Lofgren
4c95c3dcad
(pdf) Don't look for headings below 75% of the max y-position
2025-05-10 14:38:02 +02:00
Viktor Lofgren
c5281536fb
(api) Add format field to JSON search results
...
API consumers might want to filter out PDF results, etc.
2025-05-10 13:56:22 +02:00
Viktor Lofgren
4431dae7ac
(refac) Rename HtmlStandard -> DocumentFormat
...
The old model made some sense when we only supported HTML and to some extent plain text, but having PDF in an enum called HtmlFormat is a bit of a stretch.
2025-05-10 13:47:26 +02:00
Viktor Lofgren
4df4d0a7a8
(pdf) Increase line spacing tolerance for better paragraph handling
2025-05-10 13:34:04 +02:00
Viktor Lofgren
9f05083b94
(pdf) Add the capability to identify headings
...
This change vendors pdfbox'es PDFTextStripper and modifies it to be able to heuristically identify headings based on their font size, as this is a very useful relevance signal for the search engine, and helps identify the correct title of the article.
2025-05-09 14:04:04 +02:00
Viktor Lofgren
fc92e9b9c0
(feeds) Correct link handling in atom feeds
...
This addresses issue #199
2025-05-09 13:00:07 +02:00
Viktor Lofgren
328fb5d927
(feeds) Correct link handling in atom feeds
...
This addresses issue #199
2025-05-09 12:55:28 +02:00
Viktor Lofgren
36889950e8
(pdf) Migrate to PDFBox 3.0.5 and suppress log spam
...
PDFBox 2.x uses commons logging, which does not route through SLF4j, and thus is a hassle to configure; and is extremely verbose in its default logging settings.
Migrating to PDFBox 3.x lets us use slf4j to address the log spam by filtering out the noisy methods.
2025-05-08 18:03:26 +02:00
Viktor Lofgren
c96a94878b
(pdf) Add feature to make pdf-files searchable with format:pdf
2025-05-08 18:03:26 +02:00
Viktor Lofgren
1c57d7d73a
(pdf) Clean up code
2025-05-08 18:03:26 +02:00
Viktor Lofgren
a443d22356
(pdf) Flag the file as a PDF file in the GUI
2025-05-08 18:03:26 +02:00
Viktor Lofgren
aa59d4afa4
(pdf) Somewhat improve title and summary extraction
2025-05-08 18:03:26 +02:00
Viktor Lofgren
df0f18d0e7
(pdf) Read title
2025-05-08 18:03:26 +02:00
Viktor Lofgren
0819d46f97
(pdf) Minimal protytype to get PDFs working
2025-05-08 18:03:26 +02:00
Viktor Lofgren
5e2b63473e
(logging) Change to a terser log format
...
The old log format would often span several screen widths, especially when subprocesses logged. Switching to a terser format that should be much easier to read.
2025-05-08 18:02:22 +02:00
Viktor
f9590703f1
Merge pull request #197 from MarginaliaSearch/crawl-markdown
...
(markdown) Support crawling markdown
2025-05-08 13:35:00 +02:00
Viktor Lofgren
f12fc11337
(markdown) Support crawling markdown
2025-05-08 13:26:22 +02:00
Viktor Lofgren
c309030184
(sample) Ensure we finalize the slop.zip file creation when filtering
2025-05-06 14:52:48 +02:00
Viktor Lofgren
fd5af01629
(sample) Ensure we flush the log before adding it to the tar file
2025-05-06 14:43:47 +02:00
Viktor Lofgren
d4c43c7a79
(crawler) Test case for fetching PDFs
2025-05-06 13:45:16 +02:00
Viktor Lofgren
18700e1919
(sample) Fix bug where slop files would not be saved despite containing data
2025-05-06 13:38:21 +02:00
Viktor Lofgren
120b431998
(crawler) Fix outdated assumptions about content types and http status codes always being 200 when good.
...
We now sometimes get 206 when good.
2025-05-06 13:18:30 +02:00
Viktor Lofgren
71dad99326
(crawler) Revisitor should not demand a 200, but support a 206 as well
2025-05-06 13:11:52 +02:00
Viktor Lofgren
c1e8afdf86
(crawler) Remove domains from pending crawl tasks queue when retrying
2025-05-06 12:56:30 +02:00
Viktor Lofgren
fa32dddc24
(sample-actor) Make content type matching lenient with regard to ct parameters such as charset
2025-05-06 12:48:09 +02:00
Viktor Lofgren
a266fcbf30
(sample-actor) Clean up debris from previous runs to avoid errors on re-runs
2025-05-05 13:16:37 +02:00
Viktor Lofgren
6e47e58e0e
(sample-actor) Add progress tracking to sample export actor
2025-05-05 13:04:14 +02:00
Viktor Lofgren
9dc43d8b4a
(sample-actor) Update the actor export sample actor to not generate empty files when the filter is not applicable.
2025-05-05 12:56:12 +02:00
Viktor Lofgren
83967e3305
(sample-actor) Update the actor export sample actor to not generate empty files when the filter is not applicable.
2025-05-05 12:50:21 +02:00
Viktor Lofgren
4db980a291
(jooby-service) Set an upper limit on the number of worker threads
2025-05-05 12:40:31 +02:00
Viktor Lofgren
089b177868
(deploy) Executor partition 4.
2025-05-05 12:21:27 +02:00
Viktor Lofgren
9c8e9a68d5
(deploy) Executor partition 4.
2025-05-05 12:00:05 +02:00
Viktor Lofgren
413d5cc788
(url, minor) Fix typo in test
2025-05-04 16:28:30 +02:00
Viktor Lofgren
58539b92ac
(search) Don't show addresses with URLencoding in the UI
2025-05-04 16:26:39 +02:00
Viktor Lofgren
fe72f16df1
(url) Add additional tests for parameter handling
2025-05-04 16:23:39 +02:00
Viktor Lofgren
b49a244a2e
(url) Fix encoding handling of query parameters
2025-05-04 16:18:47 +02:00
Viktor Lofgren
3f0b4c010f
(deploy) Fix deploy script to be aware of the status service
2025-05-04 16:14:07 +02:00
Viktor Lofgren
c6e0cd93f7
(status) Fix status service to poll the new domain
2025-05-04 16:11:08 +02:00
Viktor Lofgren
80a7ccb080
Trigger redeploy of qs, search and api
2025-05-04 16:07:28 +02:00
Viktor Lofgren
54dec347c4
(url) Fix urlencoding issues with certain symbols
...
Optimize the code by adding a simple heuristic for guessing whether we need to repair the URI before we pass it to Java's parser.
2025-05-04 13:39:39 +02:00
Viktor Lofgren
d6ee3f0785
(url) Fix urlencoding issues with certain symbols
...
The urlencoding logic would consider the need to urlencode on an element basis, which is incorrect. Even if we urlencode on an element basis, we should either urlencode or not urlencode, never a mix of the two.
2025-05-04 13:08:49 +02:00
Viktor Lofgren
8be88afcf3
(url) Fix urlencoding issues with certain symbols
...
We also need to apply the fix when performing toString() on the EdgeUrl, the URI class will URLDecode the input.
The change also alters the parseURI method to only run the URLEncode-fixer during parsing if URI doesn't throw an exception. This bad path is obviously going to be slower, but realistically, most URLs are valid, so it's probably a significant optimization to do it like this.
2025-05-04 12:58:13 +02:00
Viktor Lofgren
0e3c00d3e1
(url) Fix urlencoding issues with certain symbols
...
Minor fix of issue where url sanitizer would strip some trailing slashes.
2025-05-03 23:58:28 +02:00
Viktor Lofgren
4279a7f1aa
(url) Fix urlencoding issues with certain symbols
...
Minor fix with previously urlencoded codepoints, we need to account for the fact that they are encoded in hexadecimal.
2025-05-03 23:51:39 +02:00
Viktor Lofgren
251006d4f9
(url) Fix urlencoding issues with certain symbols
...
Problems primarily cropped up with sideloaded wikipedia articles, though the search engine has been returning inconsistently URLEncoded search results for a while, though browsers and servers have seemingly magically fixed the issues in many scenarios.
This addresses Issue #195 and Issue #131 .
2025-05-03 23:48:45 +02:00
Viktor Lofgren
c3e99dc12a
(service) Limit logging from ad hoc task heartbeats
...
Certain usage patterns of the ad hoc task heartbeats would lead to an incredible amount of log noise, as it would log each update.
Limit log updates to increments of 10% to avoid this problem.
2025-05-03 12:39:58 +02:00
Viktor
aaaa2de022
Merge pull request #196 from MarginaliaSearch/filter-export-sample-data
...
Add the ability to filter sample data based on content type
2025-05-02 13:23:49 +02:00
Viktor Lofgren
fc1388422a
(actor) Add the ability to filter sample data based on content type
...
This will help in extracting relevant test sets for PDF processing.
2025-05-02 13:09:22 +02:00
Viktor Lofgren
b07080db16
(crawler) Don't retry requests when encountering UnknownHostException
2025-05-01 16:07:34 +02:00
Viktor Lofgren
e9d86dca4a
(crawler) Add timeout to wrap-up phase of WarcInputBuffer.
2025-05-01 15:57:47 +02:00
Viktor Lofgren
1d693f0efa
(build) Upgrade JIB to 3.4.5
2025-04-30 15:26:52 +02:00
Viktor Lofgren
5874a163dc
(build) Upgrade gradle to 8.14
2025-04-30 15:26:37 +02:00
Viktor Lofgren
5ec7a1deab
(crawler) Fix 80%-ish progress crawler stall
...
Since the crawl tasks are started in two phases, first when generating them in one loop, and then in a second loop that drains the task list; if the first loop contains a long-running crawl task that is triggered late, the rest of the crawl may halt until that task is finish.
Fixed the problem by draining and re-trying also in the first loop.
2025-04-29 12:23:51 +02:00
Viktor Lofgren
7fea2808ed
(search) Fix error view
...
Fix rendering error when query was null
Fix border on error message.
2025-04-27 12:12:56 +02:00
Viktor Lofgren
8da74484f0
(search) Remove unused count modifier from the footer help
2025-04-27 12:08:34 +02:00
Viktor Lofgren
923d5a7234
(search) Add a note for TUI users pointing them to the old UI
2025-04-27 11:52:07 +02:00
Viktor Lofgren
58f88749b8
(deploy) assistant
2025-04-25 13:25:50 +02:00
Viktor Lofgren
77f727a5ba
(crawler) Alter conditional request logic to avoid sending both If-None-Match and If-Modified-Since
...
It seems like some servers dislike this combination, and may turn a 304 into a 200.
2025-04-25 13:19:07 +02:00
Viktor Lofgren
667cfb53dc
(assistant) Remove more link text junk from suggestions at loadtime.
2025-04-24 13:35:29 +02:00
Viktor Lofgren
fe36d4ed20
(deploy) Executor services
2025-04-24 13:23:51 +02:00
Viktor Lofgren
acf4bef98d
(assistant) Improve search suggestions
...
Improve suggestions by loading a secondary suggestions set with link text data.
2025-04-24 13:10:59 +02:00
Viktor Lofgren
2a737c34bb
(search) Improve suggestions UX
...
Fix the highlight colors when arrowing through search suggestions. Also fix the suggestions box for dark mode.
2025-04-24 12:34:05 +02:00
Viktor Lofgren
90a577af82
(search) Improve suggestions UX
2025-04-24 00:32:25 +02:00
Viktor
f0c9b935d8
Merge pull request #192 from MarginaliaSearch/improve-suggestions
...
Improve typeahead suggestions
2025-04-23 20:17:49 +02:00
Viktor Lofgren
7b5493dd51
(assistant) Improve typeahead suggestions
...
Implement a new prefix search structure (not a trie, but hash table based) with a concept of score.
2025-04-23 20:13:53 +02:00
Viktor Lofgren
c246a59158
(search) Make it clearer that it's a search engine
2025-04-22 16:03:42 +02:00
Viktor
0b99781d24
Merge pull request #191 from MarginaliaSearch/pdf-support-in-crawler
...
Pdf support in crawler
2025-04-22 15:52:41 +02:00
Viktor Lofgren
39db9620c1
(crawler) Increase maximum permitted file size to 32 MB
2025-04-22 15:51:03 +02:00
Viktor Lofgren
1781599363
(crawler) Add support for crawling PDF files
2025-04-22 15:50:05 +02:00
Viktor Lofgren
6b2d18fb9b
(crawler) Adjust domain limits to be generally more permissive.
2025-04-22 15:27:57 +02:00
Viktor
59b1d200ab
Merge pull request #190 from MarginaliaSearch/download-sample-chores
...
Download sample chores
2025-04-22 13:29:49 +02:00
Viktor Lofgren
897010a2cf
(control) Update download sample data actor with better UI
...
The original implementation didn't really give a lot of feedback about what it was doing. Adding a progress bar to the download step.
Relates to issue 189.
2025-04-22 13:27:22 +02:00
Viktor Lofgren
602af7a77e
(control) Update UI with new sample sizes
...
Relates to issue 189.
2025-04-22 13:27:13 +02:00
Viktor Lofgren
a7d91c8527
(crawler) Clean up fetcher detailed logging
2025-04-21 12:53:52 +02:00
Viktor Lofgren
7151602124
(crawler) Reduce the likelihood of crawler tasks locking on domains before they are ready
...
Cleaning up after changes.
2025-04-21 12:47:03 +02:00
Viktor Lofgren
884e33bd4a
(crawler) Reduce the likelihood of crawler tasks locking on domains before they are ready
...
Change back to an unbounded queue, tighten sleep times a bit.
2025-04-21 11:48:15 +02:00
Viktor Lofgren
e84d5c497a
(crawler) Reduce the likelihood of crawler tasks locking on domains before they are ready
...
Change to a bounded queue and adding a sleep to reduce the amount of effectively busy looping threads.
2025-04-21 00:39:26 +02:00
Viktor Lofgren
2d2d3e2466
(crawler) Reduce the likelihood of crawler tasks locking on domains before they are ready
...
Change to a bounded queue and adding a sleep to reduce the amount of effectively busy looping threads.
2025-04-21 00:36:48 +02:00
Viktor Lofgren
647dd9b12f
(crawler) Reduce the likelihood of crawler tasks locking on domains before they are ready
2025-04-21 00:24:30 +02:00
Viktor Lofgren
de4e2849ce
(crawler) Tweak request retry counts
...
Increase the default number of tries to 3, but don't retry on SSL errors as they are unlikely to fix themselves in the short term.
2025-04-19 00:19:48 +02:00
Viktor Lofgren
3c43f1954e
(crawler) Add custom cookie store implementation
...
Apache HttpClient's cookie implementation builds an enormous concurrent hashmap with every cookie for every domain ever crawled. This is a big waste of resources.
Replacing it with a fairly crude domain-isolated instance, as we are primarily interested in answering whether a cookie is set, and we will never retain cookies long term.
2025-04-18 13:04:22 +02:00
Viktor Lofgren
fa2462ec39
(crawler) Re-enable aborts on timeout
2025-04-18 12:59:34 +02:00
Viktor Lofgren
f4ad7145db
(crawler) Disable SO_LINGER
2025-04-18 01:42:02 +02:00
Viktor Lofgren
068b450180
(crawler) Temporarily disable request.abort()
2025-04-18 01:25:56 +02:00
Viktor Lofgren
05b909a21f
(crawler) Add logging to get more info about connection leak
2025-04-18 01:06:52 +02:00
Viktor Lofgren
3d179cddce
(crawler) Correctly consume entity in sitemap retrieval
2025-04-18 00:32:21 +02:00
Viktor Lofgren
1a2aae496a
(crawler) Correct handling and abortion of HttpClient's requests
...
There was a resource leak in the initial implementation of the Apache HttpClient WarcInputBuffer that failed to free up resources.
Using HttpGet objects instead of the Classic...Request objects, as the latter fail to expose an abort()-method.
2025-04-18 00:16:26 +02:00
Viktor Lofgren
353cdffb3f
(crawler) Increase connection request timeout, restore congestion timeout
2025-04-17 21:32:06 +02:00
Viktor Lofgren
2e3f1313c7
(crawler) Log exceptions while crawling in crawler audit log
2025-04-17 21:18:09 +02:00
Viktor Lofgren
58e6f141ce
(crawler) Reduce congestion throttle go-rate
2025-04-17 20:36:58 +02:00
Viktor Lofgren
500f63e921
(crawler) Lower max conn per route
2025-04-17 18:36:16 +02:00
Viktor Lofgren
6dfbedda1e
(crawler) Increase max conn per route and connection timeout
2025-04-17 18:31:46 +02:00
Viktor Lofgren
9715ddb105
(crawler) Increase max pool size to a large value
2025-04-17 18:22:58 +02:00
Viktor Lofgren
1fc6313a77
(crawler) Remove log noise when retrying a bad URL
2025-04-17 17:10:46 +02:00
Viktor Lofgren
b1249d5b8a
(crawler) Fix broken test.
2025-04-17 17:01:42 +02:00
Viktor
ef95d59b07
Merge pull request #161 from MarginaliaSearch/apache-httpclient-in-crawler
...
The previously used Java HttpClient seems unsuitable for crawler usage, that lead to issues like send()-operations sometimes hanging forever, with clunky workarounds such as running each send operation in a separate Future that can be cancelled on a timeout.
The most damning flaw is that it does not offer socket timeouts. If a server responds in a timely manner, but for some reason between high load or malice stops sending data, Java's builtin HttpClient will hang forever.
It simply has too many assumptions that break, and fails to adequately expose the inner workings of the connection pool to a degree that makes it possible to configure in a satisfactory manner, such as setting a SO_LINGER value or limiting the number of concurrent connections to a host.
Apache's HttpClient solves all these problems.
The change also includes a new battery of tests for the HttpFetcher, and refactors the retriever class a bit to move stuff into the HttpFetcher, leading to a better separation of concerns.
The crawler will also be a bit more clever when fetching documents, and attempt to use range queries where supported to limit the number of bytes, as interrupting connections is undesirable and leads to connection storms and bufferbloat.
2025-04-17 16:57:19 +02:00
Viktor Lofgren
acdd8664f5
(crawler) More logging for the crawler, in a separate file.
2025-04-17 16:55:50 +02:00
Viktor Lofgren
6b12eac58a
(crawler) Fix crawler retriever test to use the slop format
2025-04-17 16:35:13 +02:00
Viktor Lofgren
bb3f1f395a
(crawler) Fix bug where headers were not stored correctly
...
This was the result of refactoring to Apache HttpClient.
2025-04-17 16:34:41 +02:00
Viktor Lofgren
b661beef41
(crawler) Amend recrawl logic to match redirects as being unchanged if their Location is the same.
2025-04-17 16:34:05 +02:00
Viktor Lofgren
9888c47f19
(crawler) Add custom Keep-Alive settings for HttpClient with max keep-alive of 30s
2025-04-17 15:25:46 +02:00
Viktor Lofgren
dcef7e955b
(crawler) Try to avoid unnecessary connection resets
...
In order to keep connections alive, the crawler will consume data past it's max size (but hope and pray the server supports range queries) as long as we've not exceeded the timeout.
This permits us to keep the connection alive in more scenarios, which is helpful for the health of the network stack, as constant TCP handshakes can lead to quite a lot of buffer bloat.
This will increase the bandwidth requirements in some scenarios, but on the other hand, it will increase the available bandwidth as well.
2025-04-17 14:51:33 +02:00
Viktor Lofgren
b3973a1dd7
(crawler) Remove unnecessary crawl delay when not ct-probing
...
The crawler would *always* incur the crawl delay penalty associated with content type probing, even if it wasn't actually probing. Removing this delay when we are not probing.
2025-04-17 14:39:04 +02:00
Viktor Lofgren
8bd05d6d90
(crawler) Attempt to use range queries where available
...
This might help in some circumstances to avoid fetching more data than we are interested in.
2025-04-17 14:37:55 +02:00
Viktor Lofgren
59df8e356e
(crawler) Do not fail domain and content type probe on 405
...
Some endpoints do not support the HEAD method. This has historically broken the crawler when it attempts to use HEAD to probe certain URLs that are suspected of being e.g. binary.
The change makes it so that we bypass the probing on 405 instead, and for the domain probe logic, we switch to a small range queried GET.
2025-04-17 13:54:28 +02:00
Viktor Lofgren
7161162a35
(crawler) Write WARC records in a sane order
2025-04-17 13:36:39 +02:00
Viktor Lofgren
d7c4c5141f
(crawler) Migrate to Apache HttpClient for crawler
...
The previously used Java HttpClient seems unsuitable for crawler usage,
that lead to issues like send()-operations sometimes hanging forever,
with clunky workarounds such as running each send operation in a separate
Future that can be cancelled on a timeout.
It has too many assumptions that break, and fails to adequately expose
the inner workings of the connection pool to a degree that makes it possible
to configure in a satisfactory manner.
Apache's HttpClient solves all these problems.
The change also includes a new battery of tests for the HttpFetcher,
and refactors the retriever class a bit to move stuff into the HttpFetcher,
leading to a better separation of concerns.
2025-04-17 12:51:08 +02:00
Viktor Lofgren
88e9b8fb05
(crawler) Throttle the establishment of new connections
...
To avoid network congestion from the packet storm created when establishing hundreds or thousands of connections at the same time, pace the opening of new connections.
2025-04-08 22:53:02 +02:00
Viktor Lofgren
b6265cee11
(feeds) Add timeout code to send()
...
Due to the unique way java's HttpClient implements timeouts, we must always wrap it in an executor to catch the scenario that a server stops sending data mid-response, which would otherwise hang the send method forever.
2025-04-08 22:09:59 +02:00
Viktor Lofgren
c91af247e9
(rate-limit) Fix rate limiting logic
...
The rate limiter was misconfigured to regenerate tokens at a fixed rate of 1 per refillRate; not refillRate per minute. Additionally increasing the default bucket size to 4x refill rate.
2025-04-05 12:26:26 +02:00
Viktor Lofgren
7a31227de1
(crawler) Filter out robots.txt-sitemaps that belong to different domains
2025-04-02 13:35:37 +02:00
Viktor Lofgren
4f477604c5
(crawler) Improve error handling in parquet->slop conversion
...
Parquet code throws a RuntimeException, which was not correctly caught, leading to a failure to crawl.
2025-04-02 13:16:01 +02:00
Viktor Lofgren
2970f4395b
(minor) Test code cleanup
2025-04-02 13:16:01 +02:00
Viktor Lofgren
d1ec909b36
(crawler) Improve handling of timeouts to prevent crawler from getting stuck
2025-04-02 12:57:21 +02:00
Viktor Lofgren
c67c5bbf42
(crawler) Experimentally drop to HTTP 1.1 for crawler to see if this solves stuck send()s
2025-04-01 12:05:21 +02:00
Viktor Lofgren
ecb0e57a1a
(crawler) Make the use of virtual threads in the crawler configurable via system properties
2025-03-27 21:26:05 +01:00
Viktor Lofgren
8c61f61b46
(crawler) Add crawling metadata to domainstate db
2025-03-27 16:38:37 +01:00
Viktor Lofgren
662a18c933
Revert "(crawler) Further rearrange crawl order"
...
This reverts commit 1c2426a052
.
The change does not appear necessary to avoid problems.
2025-03-27 11:25:08 +01:00
Viktor Lofgren
1c2426a052
(crawler) Further rearrange crawl order
...
Limit crawl order preferrence to edu domains, to avoid hitting stuff like medium and wordpress with shotgun requests.
2025-03-27 11:19:20 +01:00
Viktor Lofgren
34df7441ac
(crawler) Add some jitter to crawl delay to avoid accidentally synchronized requests
2025-03-27 11:15:16 +01:00
Viktor Lofgren
5387e2bd80
(crawler) Adjust crawl order to get a better mixture of domains
2025-03-27 11:12:48 +01:00
Viktor Lofgren
0f3b24d0f8
(crawler) Evaluate virtual threads for the crawler
...
The change also alters SimpleBlockingThreadPool to add the option to use virtual threads instead of platform threads.
2025-03-27 11:02:21 +01:00
Viktor Lofgren
a732095d2a
(crawler) Improve crawl task ordering
...
Further improve the ordering of the crawl tasks in order to ensure that potentially blocking tasks are enqueued as soon as possible.
2025-03-26 16:51:37 +01:00
Viktor Lofgren
6607f0112f
(crawler) Improve how the crawler deals with interruptions
...
In some cases, it threads would previously fail to terminate when interrupted.
2025-03-26 16:19:57 +01:00
Viktor Lofgren
4913730de9
(jdk) Upgrade to Java 24
2025-03-26 13:26:06 +01:00
Viktor Lofgren
1db64f9d56
(chore) Fix zookeeper test by upgrading zk image version.
...
Test suddenly broke due to the increasing entropy of the universe.
2025-03-26 11:47:14 +01:00
Viktor Lofgren
4dcff14498
(search) Improve contrast with light mode
2025-03-25 13:15:31 +01:00
Viktor Lofgren
426658f64e
(search) Improve contrast with light mode
2025-03-25 11:54:54 +01:00
Viktor Lofgren
2181b22f05
(crawler) Change default maxConcurrentRequests to 512
...
This seems like a more sensible default after testing a bit. May need local tuning.
2025-03-22 12:11:09 +01:00
Viktor Lofgren
42bd79a609
(crawler) Experimentally throttle the number of active retrievals to see how this affects the network performance
...
There's been some indications that request storms lead to buffer bloat and bad throughput.
This adds a configurable semaphore, by default permitting 100 active requests.
2025-03-22 11:50:37 +01:00
Viktor Lofgren
b91c1e528a
(favicon) Send dummy svg result when image is missing
...
This prevents the browser from rendering a "broken image" in this scenario.
2025-03-21 15:15:14 +01:00
Viktor Lofgren
b1130d7a04
(domainstatedb) Allow creation of disconnected db
...
This is required for executor services that do not have crawl data to still be able to initialize.
2025-03-21 14:59:36 +01:00
Viktor Lofgren
8364bcdc97
(favicon) Add favicons to the matchograms
2025-03-21 14:30:40 +01:00
Viktor Lofgren
626cab5fab
(favicon) Add favicon to site overview
2025-03-21 14:15:23 +01:00
Viktor Lofgren
cfd4712191
(favicon) Add capability for fetching favicons
2025-03-21 13:38:58 +01:00
Viktor Lofgren
9f18ced73d
(crawler) Improve deferred task behavior
2025-03-18 12:54:18 +01:00
Viktor Lofgren
18e91269ab
(crawler) Improve deferred task behavior
2025-03-18 12:25:22 +01:00
Viktor Lofgren
e315ca5758
(search) Change icon for small web filter
...
The previous icon was of an irregular size and shifted the layout in an unaesthetic way.
2025-03-17 12:07:34 +01:00
Viktor Lofgren
3ceea17c1d
(search) Adjustments to devicd detection in CSS
...
Use pointer:fine media query to better distinguish between mobile devices and PCs with a window in portrait orientation.
With this, we never show mobile filtering functionality on mobile; and never show the touch-inaccessible minimized sidebar on mobile.
2025-03-17 12:04:34 +01:00
Viktor Lofgren
b34527c1a3
(search) Add small web filter for new UI
2025-03-17 11:39:19 +01:00
Viktor Lofgren
185bf28fca
(crawler) Correct issue leading to parquet files not being correctly preconverted
...
Path.endsWith("str") != String.endsWith(".str")
2025-03-10 13:48:12 +01:00
Viktor Lofgren
78cc25584a
(crawler) Add error logging when entering bad path for historical crawl data
2025-03-10 13:38:40 +01:00
Viktor Lofgren
62ba30bacf
(common) Log info about metrics server
2025-03-10 13:12:39 +01:00
Viktor Lofgren
3bb84eb206
(common) Log info about metrics server
2025-03-10 13:03:48 +01:00
Viktor Lofgren
be7d13ccce
(crawler) Correct task execution logic in crawler
...
The old behavior would flag domains as pending too soon, leading to them being omitted from execution if they were not immediately available to run.
2025-03-09 13:47:51 +01:00
Viktor Lofgren
8c088a7c0b
(crawler) Remove custom thread factory
...
This was causing issues, and not really doing much of benefit.
2025-03-09 11:50:52 +01:00
Viktor Lofgren
ea9a642b9b
(crawler) More effective task scheduling in the crawler
...
This should hopefully allow more threads to be busy
2025-03-09 11:44:59 +01:00
Viktor Lofgren
27f528af6a
(search) Fix "Remove Javascript" toggle
...
A bug was introduced at some point where the special keyword for filtering on javascript was changed to special:scripts, from js:true/js:false.
Solves issue #155
2025-02-28 12:03:04 +01:00
Viktor Lofgren
20ca41ec95
(processed model) Use String columns instead of Txt columns for SlopDocumentRecord
...
It's very likely TxtStringColumn is the culprit of the bug seen in https://github.com/MarginaliaSearch/MarginaliaSearch/issues/154 where the wrong URL was shown for a search result.
2025-02-24 11:41:51 +01:00
Viktor Lofgren
7671f0d9e4
(search) Display message when no search results are found
2025-02-24 11:15:55 +01:00
Viktor Lofgren
44d6bc71b7
(assistant) Migrate to Jooby framework
2025-02-15 13:28:12 +01:00
Viktor Lofgren
9d302e2973
(assistant) Migrate to Jooby framework
2025-02-15 13:26:04 +01:00
Viktor Lofgren
f553701224
(assistant) Migrate to Jooby framework
2025-02-15 13:21:48 +01:00
Viktor Lofgren
f076d05595
(deps) Upgrade slf4j to latest
2025-02-15 12:50:16 +01:00
Viktor Lofgren
b513809710
(*) Stopgap fix for metrics server initialization errors bringing down services
2025-02-14 17:09:48 +01:00
Viktor Lofgren
7519b28e21
(search) Correct exception from misbehaving bots feeding invalid urls
2025-02-14 17:05:24 +01:00
Viktor Lofgren
3eac4dd57f
(search) Correct exception in error handler when page is missing
2025-02-14 17:00:21 +01:00
Viktor Lofgren
4c2810720a
(search) Add redirect handler for full URLs in the /site endpoint
2025-02-14 16:31:11 +01:00
Viktor Lofgren
8480ba8daa
(live-capture) Code cleanup
2025-02-04 14:05:36 +01:00
Viktor Lofgren
fbba392491
(live-capture) Send a UA-string from the browserless fetcher as well
...
The change also introduces a somewhat convoluted wiremock test to intercept and verify that these headers are in fact sent
2025-02-04 13:36:49 +01:00
Viktor Lofgren
530eb35949
(update-rss) Do not fail the feed fetcher control actor if it takes a long time to complete.
2025-02-03 11:35:32 +01:00
Viktor Lofgren
c2dd2175a2
(search) Add new query expansion rule contracting WORD NUM pairs into WORD-NUM and WORDNUM
2025-02-01 13:13:30 +01:00
Viktor Lofgren
b8581b0f56
(crawler) Safe sanitization of headers during warc->slop conversion
...
The warc->slop converter was rejecting some items because they had headers that were representable in the Warc code's MessageHeader map implementation, but illegal in the HttpHeaders' implementation.
Fixing this by manually filtering these out. Ostensibly the constructor has a filtering predicate, but this annoyingly runs too late and fails to prevent the problem.
2025-01-31 12:47:42 +01:00
Viktor Lofgren
2ea34767d8
(crawler) Use the response URL when resolving relative links
...
The crawler was incorrectly using the request URL as the base URL when resolving relative links. This caused problems when encountering redirects.
For example if we fetch /log, redirecting to /log/ and find links to foo/, and bar/; these would resolve to /foo and /bar, and not /log/foo and /log/bar.
2025-01-31 12:40:13 +01:00
Viktor Lofgren
e9af838231
(actor) Fix migration actor final steps
2025-01-30 11:48:21 +01:00
Viktor Lofgren
ae0cad47c4
(actor) Utility method for getting a json prototype for actor states
...
If we can hook this into the control gui somehow, it'll make for a nice QOL upgrade when manually interacting with the actors.
2025-01-29 15:20:25 +01:00
Viktor Lofgren
5fbc8ef998
(misc) Tidying
2025-01-29 15:17:04 +01:00
Viktor Lofgren
32c6dd9e6a
(actor) Delete old data in the migration actor
2025-01-29 14:51:46 +01:00
Viktor Lofgren
6ece6a6cfb
(actor) Improve resilience for the migration actor
2025-01-29 14:43:09 +01:00
Viktor Lofgren
39cd1c18f8
Automatically run npm install tailwindcss@3 via setup.sh, as the new default version of the package is incompatible with the project
2025-01-29 12:21:08 +01:00
Viktor
eb65daaa88
Merge pull request #151 from Lionstiger/master
...
fix small grammar error in footerLegal.jte
2025-01-28 21:49:50 +01:00
Viktor
0bebdb6e33
Merge branch 'master' into master
2025-01-28 21:49:36 +01:00
Viktor Lofgren
1e50e392c6
(actor) Improve logging and error handling for data migration actor
2025-01-28 15:34:36 +01:00
Viktor Lofgren
fb673de370
(crawler) Change the header 'User-agent' to 'User-Agent'
2025-01-28 15:34:16 +01:00
Viktor Lofgren
eee73ab16c
(crawler) Be more lenient when performing a domain probe
2025-01-28 15:24:30 +01:00
Viktor Lofgren
5354e034bf
(search) Minor grammar fix
2025-01-27 18:36:31 +01:00
Magnus Wulf
72384ad6ca
fix small grammar error
2025-01-27 15:04:57 +01:00
Viktor Lofgren
a2b076f9be
(converter) Add progress tracking for big domains in converter
2025-01-26 18:03:59 +01:00
Viktor Lofgren
c8b0a32c0f
(crawler) Reduce long retention of CrawlDataReference objects and their associated SerializableCrawlDataStreams
2025-01-26 15:40:17 +01:00
Viktor Lofgren
f0d74aa3bb
(converter) Fix close() ordering to prevent converter crash
2025-01-26 14:47:36 +01:00
Viktor Lofgren
74a1f100f4
(converter) Refactor to remove CrawledDomainReader and move its functionality into SerializableCrawlDataStream
2025-01-26 14:46:50 +01:00
Viktor Lofgren
eb049658e4
(converter) Add truncation att the parser step to prevent the converter from spending too much time on excessively large documents
...
Refactor to do this without introducing additional copies
2025-01-26 14:28:53 +01:00
Viktor Lofgren
db138b2a6f
(converter) Add truncation att the parser step to prevent the converter from spending too much time on exessively large documents
2025-01-26 14:25:57 +01:00
Viktor Lofgren
1673fc284c
(converter) Reduce lock contention in converter by separating the processing of full and simple-track domains
2025-01-26 13:21:46 +01:00
Viktor Lofgren
503ea57d5b
(converter) Reduce lock contention in converter by separating the processing of full and simple-track domains
2025-01-26 13:18:14 +01:00
Viktor Lofgren
18ca926c7f
(converter) Truncate excessively long strings in SentenceExtractor, malformed data was effectively DOS:ing the converter
2025-01-26 12:52:54 +01:00
Viktor Lofgren
db99242db2
(converter) Adding some logging around the simple processing track to investigate an issue with the converter stalling
2025-01-26 12:02:00 +01:00
Viktor Lofgren
2b9d2985ba
(doc) Update readme with up-to-date install instructions.
2025-01-24 18:51:41 +01:00
Viktor Lofgren
eeb6ecd711
(search) Make it clearer that the affiliate marker applies to the result, and not the search engine's relation to the result.
2025-01-24 18:50:00 +01:00
Viktor Lofgren
1f58aeadbf
(build) Upgrade JIB
2025-01-24 18:49:28 +01:00
Viktor Lofgren
3d68be64da
(crawler) Add default CT when it's missing for icons
2025-01-22 13:55:47 +01:00
Viktor Lofgren
668f3b16ef
(search) Redirect ^/site/$ to /site
2025-01-22 13:35:18 +01:00
Viktor Lofgren
98a340a0d1
(crawler) Add favicon data to domain state db in its own table
2025-01-22 11:41:20 +01:00
Viktor Lofgren
8862100f7e
(crawler) Improve logging and error handling
2025-01-21 21:44:21 +01:00
Viktor Lofgren
274941f6de
(crawler) Smarter parquet->slop crawl data migration
2025-01-21 21:26:12 +01:00
Viktor Lofgren
abec83582d
Fix refactoring gore
2025-01-21 15:08:04 +01:00
Viktor Lofgren
569520c9b6
(index) Add manual adjustments for rankings based on domain
2025-01-21 15:07:43 +01:00
Viktor Lofgren
088310e998
(converter) Improve simple processing performance
...
There was a regression introduced in the recent slop migration changes in the performance of the simple conversion track. This reverts the issue.
2025-01-21 14:13:33 +01:00
Viktor
270cab874b
Merge pull request #134 from MarginaliaSearch/slop-crawl-data-spike
...
Store crawl data in slop instead of parquet
2025-01-21 13:34:22 +01:00
Viktor Lofgren
4c74e280d3
(crawler) Fix urlencoding in sitemap fetcher
2025-01-21 13:33:35 +01:00
Viktor Lofgren
5b347e17ac
(crawler) Automatically migrate to slop from parquet when crawling
2025-01-21 13:33:14 +01:00
Viktor Lofgren
55d6ab933f
Merge branch 'master' into slop-crawl-data-spike
2025-01-21 13:32:58 +01:00
Viktor Lofgren
43b74e9706
(crawler) Fix exception handler and resource leak in WarcRecorder
2025-01-20 23:45:28 +01:00
Viktor Lofgren
579a115243
(crawler) Reduce log spam from error handling in new sitemap fetcher
2025-01-20 23:17:13 +01:00
Viktor
2c67f50a43
Merge pull request #150 from MarginaliaSearch/httpclient-in-crawler
...
Reduce the use of 3rd party code in the crawler
2025-01-20 19:35:30 +01:00
Viktor Lofgren
78a958e2b0
(crawler) Fix broken test that started failing after the search engine moved to a new domain
2025-01-20 18:52:14 +01:00
Viktor Lofgren
4e939389b2
(crawler) New Jsoup based sitemap parser
2025-01-20 14:37:44 +01:00
Viktor Lofgren
e67a9bdb91
(crawler) Migrate away from using OkHttp in the crawler, use Java's HttpClient instead.
2025-01-19 15:07:11 +01:00
Viktor Lofgren
567e4e1237
(crawler) Fast detection and bail-out for crawler traps
...
Improve logging and exclude robots.txt from this logic.
2025-01-18 15:28:54 +01:00
Viktor Lofgren
4342e42722
(crawler) Fast detection and bail-out for crawler traps
...
Nephentes has been doing the rounds in social media, adding an easy detection and mitigation mechanism for this type of trap, as sadly not all webmasters set up their robots.txt correctly. Out of the box crawl limits will also deal with this type of attack, but this fix is faster.
2025-01-17 13:02:57 +01:00
Viktor Lofgren
bc818056e6
(run) Fix templates for mariadb
...
Apparently the docker image contract changed at some point, and now we should spawn mariadbd and not mysqld; mariadb-admin and not mysqladmin.
2025-01-16 15:27:02 +01:00
Viktor Lofgren
de2feac238
(chore) Upgrade jib from 3.4.3 to 3.4.4
2025-01-16 15:10:45 +01:00
Viktor Lofgren
1e770205a5
(search) Dyslexia fix
2025-01-12 20:40:14 +01:00
Viktor
e44ecd6d69
Merge pull request #149 from MarginaliaSearch/vlofgren-patch-1
...
Update ROADMAP.md
2025-01-12 20:38:36 +01:00
Viktor
5b93a0e633
Update ROADMAP.md
2025-01-12 20:38:11 +01:00
Viktor
08fb0e5efe
Update ROADMAP.md
2025-01-12 20:37:43 +01:00
Viktor
bcf67782ea
Update ROADMAP.md
2025-01-12 20:37:09 +01:00
Viktor Lofgren
ef3f175ede
(search) Don't clobber the search query URL with default values
2025-01-10 15:57:30 +01:00
Viktor Lofgren
bbe4b5d9fd
Revert experimental changes
2025-01-10 15:52:02 +01:00
Viktor Lofgren
c67a635103
(search, experimental) Add a few debugging tracks to the search UI
2025-01-10 15:44:44 +01:00
Viktor Lofgren
20b24133fb
(search, experimental) Add a few debugging tracks to the search UI
2025-01-10 15:34:48 +01:00
Viktor Lofgren
f2567677e8
(index-client) Clean up index client code
...
Improve error handling. This should be a relatively rare case, but we don't want one bad index partition to blow up the entire query.
2025-01-10 15:17:07 +01:00
Viktor Lofgren
bc2c2061f2
(index-client) Clean up index client code
...
This should have the rpc stream reception be performed in parallel in separate threads, rather blocking sequentially in the main thread, hopefully giving a slight performance boost.
2025-01-10 15:14:42 +01:00
Viktor Lofgren
1c7f5a31a5
(search) Further reduce the number of db queries by adding more caching to DbDomainQueries.
2025-01-10 14:17:29 +01:00
Viktor Lofgren
59a8ea60f7
(search) Further reduce the number of db queries by adding more caching to DbDomainQueries.
2025-01-10 14:15:22 +01:00
Viktor Lofgren
aa9b1244ea
(search) Reduce the number of db queries a bit by caching data that doesn't change too often
2025-01-10 13:56:04 +01:00
Viktor Lofgren
2d17233366
(search) Reduce the number of db queries a bit by caching data that doesn't change too often
2025-01-10 13:53:56 +01:00
Viktor Lofgren
b245cc9f38
(search) Reduce the number of db queries a bit by caching data that doesn't change too often
2025-01-10 13:46:19 +01:00
Viktor Lofgren
6614d05bdf
(db) Make db pool size configurable
2025-01-09 20:20:51 +01:00
Viktor Lofgren
55aeb03c4a
(feeds) Replace rssreader based parsing with a custom jsoup based rss parser
...
This solves some issues with the rssreader based parser, which was very picky about the XML being valid. Jsoup is much more lenient when parsing malformed XML.
2025-01-09 18:29:55 +01:00
Viktor Lofgren
faa589962f
(live-capture) Browserless now requires a token
2025-01-09 14:51:11 +01:00
Viktor Lofgren
c7edd6b39f
(live-capture) Browserless now requires a token
2025-01-09 14:46:05 +01:00
Viktor Lofgren
79da622e3b
(search) Update front page with new banner about move
2025-01-08 21:38:19 +01:00
Viktor Lofgren
3da8337ba6
(feeds) Add system property for exporting fetched feeds to a slop table for debugging
2025-01-08 20:49:16 +01:00
Viktor Lofgren
a32d230f0a
(special) Trigger deployment
2025-01-08 20:07:54 +01:00
Viktor Lofgren
3772bfd387
(query) Fix handling of optional ranking parameters
2025-01-08 17:11:22 +01:00
Viktor Lofgren
02a7900d1a
(search) Correct search-in-title toggle in search UI
2025-01-08 16:51:10 +01:00
Viktor Lofgren
a1fb92468f
(refac) Remove ResultRankingParameters, QueryLimits class and use protobuf classes directly instead
...
This is primarily to make the code a bit easier to reason about, and will reduce the level of indirection and data copying in the search-servi->query-service->index-service communication chain.
2025-01-08 16:15:57 +01:00
Viktor Lofgren
b7f0a2a98e
(search-service) Fix metrics for errors and request times
...
This was previously in place, but broke during the jooby migration.
2025-01-08 14:10:43 +01:00
Viktor Lofgren
5fb76b2e79
(search-service) Fix metrics for errors and request times
...
This was previously in place, but broke during the jooby migration.
2025-01-08 14:06:03 +01:00
Viktor Lofgren
ad8c97f342
(search-service) Begin replacement of the crawl queue mechanism with node_affinity flagging
...
Previously a special db table was used to hold domains slated for crawling, but this is deprecated, and instead now each domain has a node_affinity flag that decides its indexing state, where a value of -1 indicates it shouldn't be crawled, a value of 0 means it's slated for crawling by the next index partition to be crawled, and a positive value means it's assigned to an index partition.
The change set also adds a test case validating the modified behavior.
2025-01-08 13:25:56 +01:00
Viktor Lofgren
dc1b6373eb
(search-service) Clean up readme
2025-01-08 13:04:39 +01:00
Viktor Lofgren
983d6d067c
(search-service) Add indexing indicator to sibling domains listing
2025-01-08 12:58:34 +01:00
Viktor Lofgren
a84a06975c
(ranking-params) Add disable penalties flag to ranking params
...
This will help debugging ranking issues. Later it may be added to some filters.
2025-01-08 00:16:49 +01:00
Viktor Lofgren
d2864c13ec
(query-params) Add additional permitted query params
2025-01-07 20:21:44 +01:00
Viktor Lofgren
03ba53ce51
(legacy-search) Update nav bar with correct links
2025-01-07 17:44:52 +01:00
Viktor Lofgren
d4a6684931
(specialization) Soften length requirements for wiki-specialized documents (incl. cppreference)
2025-01-07 15:53:25 +01:00
Viktor
6f0485287a
Merge pull request #145 from MarginaliaSearch/cppreference_fixes
...
Cppreference fixes
2025-01-07 15:43:19 +01:00
Viktor Lofgren
59e2dd4c26
(specialization) Soften length requirements for wiki-specialized documents (incl. cppreference)
2025-01-07 15:41:30 +01:00
Viktor Lofgren
ca1807caae
(specialization) Add new specialization for cppreference.com
...
Give this reference website some synthetically generated tokens to improve the likelihood of a good match.
2025-01-07 15:41:05 +01:00
Viktor Lofgren
26c20e18ac
(keyword-extraction) Soften constraints on keyword patterns, allowing for longer segmented words
2025-01-07 15:20:50 +01:00
Viktor Lofgren
7c90b6b414
(query) Don't blindly make tokens containing a colon into a non-ranking advice term
2025-01-07 15:18:05 +01:00
Viktor Lofgren
b63c54c4ce
(search) Update opensearch.xml to point to non-redirecting domains.
2025-01-07 00:23:09 +01:00
Viktor Lofgren
fecd2f4ec3
(deploy) Add legacy search service to deploy script
2025-01-07 00:21:13 +01:00
Viktor Lofgren
39e420de88
(search) Add wayback machine link to siteinfo
2025-01-06 20:33:10 +01:00
Viktor Lofgren
dc83619861
(rssreader) Further suppress logging
2025-01-06 20:20:37 +01:00
Viktor Lofgren
87d1c89701
(search) Add listing of sibling subdomains to site overview
2025-01-06 20:17:36 +01:00
Viktor Lofgren
a42a7769e2
(leagacy-search) Remove legacy paperdoll class
2025-01-06 20:17:36 +01:00
Viktor
202bda884f
Update readme.md
...
Add note about installing tailwindcss via npm
2025-01-06 18:35:13 +01:00
Viktor Lofgren
2315fdc731
(search) Vendor rssreader and modify it to be able to consume the nlnet atom feed
...
Also dial down the logging a bit for the rssreader package.
2025-01-06 17:58:50 +01:00
Viktor Lofgren
b5469bd8a1
(search) Turn relative feed URLs absolute when dealing with RSS/Atom item URLs
2025-01-06 16:56:24 +01:00
Viktor Lofgren
6a6318d04c
(search) Add separate websiteUrl property to legacy service
2025-01-06 16:26:08 +01:00
Viktor Lofgren
55933f8d40
(search) Ensure we respect old URL contracts
...
/explore/random should be equivalent to /explore
2025-01-06 16:20:53 +01:00
Viktor
be6382e0d0
Merge pull request #127 from MarginaliaSearch/serp-redesign
...
Web UI redesign
2025-01-06 16:08:14 +01:00
Viktor Lofgren
45e771f96b
(api) Update the / API redirect to the new documentation stub.
2025-01-06 16:07:32 +01:00
Viktor Lofgren
8dde502cc9
Merge branch 'master' into serp-redesign
2025-01-05 23:33:35 +01:00
Viktor Lofgren
3e66767af3
(search) Adjust query parsing to trim tokens in quoted search terms
...
Quoted search queries that contained keywords with possessive 's endings were not returning any results, as the index does not retain that suffix, and the query parser was not stripping it away in this code path.
This solves issue #143 .
2025-01-05 23:33:09 +01:00
Viktor Lofgren
9ec9d1b338
Merge branch 'master' into serp-redesign
2025-01-05 21:10:20 +01:00
Viktor Lofgren
dcad0d7863
(search) Tweak token formation.
2025-01-05 21:01:09 +01:00
Viktor Lofgren
94e1aa0baf
(search) Tweak token formation to still break apart emails in brackets.
2025-01-05 20:55:44 +01:00
Viktor Lofgren
b62f043910
(search) Adjust token formation rules to be more lenient to C++ and PHP code.
...
This addresses Issue #142
2025-01-05 20:50:27 +01:00
Viktor Lofgren
6ea22d0d21
(search) Update front page with work-in-progress note
2025-01-05 19:08:02 +01:00
Viktor Lofgren
8c69dc31b8
Merge branch 'master' into serp-redesign
2025-01-05 18:52:51 +01:00
Viktor Lofgren
00734ea87f
(search) Add hover text for matchogram
2025-01-05 18:50:44 +01:00
Viktor Lofgren
3009713db4
(search) Fix broken tests
2025-01-05 18:50:27 +01:00
Viktor
9b2ceaf37c
Merge pull request #141 from MarginaliaSearch/vlofgren-patch-1
...
Update FUNDING.yml
2025-01-05 18:40:20 +01:00
Viktor
8019c2ce18
Update FUNDING.yml
2025-01-05 18:40:06 +01:00
Viktor Lofgren
a9e312b8b1
(service) Add links to marginalia-search.com where appropriate
2025-01-05 16:56:38 +01:00
Viktor Lofgren
4da3563d8a
(service) Clean up exceptions when requestScreengrab is not available
2025-01-04 14:45:51 +01:00
Viktor Lofgren
48d0a3089a
(service) Improve logging around grpc
...
This change adds a marker for the gRPC-specific logging, as well as improves the clarity and meaningfulness of the log messages.
2025-01-02 20:40:53 +01:00
Viktor Lofgren
594df64b20
(domain-info) Use appropriate sqlite database when fetching feed status
2025-01-02 20:20:36 +01:00
Viktor Lofgren
06efb5abfc
Merge branch 'master' into serp-redesign
2025-01-02 18:42:12 +01:00
Viktor Lofgren
78eb1417a7
(service) Only block on SingleNodeChannelPool creation in QueryClient
...
The code was always blocking for up to 5s while waiting for the remote end to become available, meaning some services would stall for several seconds on start-up for no sensible reason.
This should make most services start faster as a result.
2025-01-02 18:42:01 +01:00
Viktor Lofgren
8c8f2ad5ee
(search) Add an indicator when a link has a feed in the similar/linked domains views
2025-01-02 18:11:57 +01:00
Viktor Lofgren
f71e79d10f
(search) Add a copy of the old UI as a separate service, search-service-legacy
2025-01-02 18:03:42 +01:00
Viktor Lofgren
1b27c5cf06
(search) Add a copy of the old UI as a separate service, search-service-legacy
2025-01-02 18:02:17 +01:00
Viktor Lofgren
67edc8f90d
(domain-info) Only flag domains with rss feed items as having a feed
2025-01-02 17:41:52 +01:00
Viktor Lofgren
5f576b7d0c
(query-parser) Strip leading underlines
...
This addresses issue #140 , where __builtin_ffs gives no results.
2025-01-02 14:39:03 +01:00
Viktor Lofgren
8b05c788fd
(Search) Enable gzip compression of responses
2025-01-01 18:34:42 +01:00
Viktor Lofgren
236f033bc9
(Search) Reduce whitespace in explore view on all resolutions
2025-01-01 18:23:35 +01:00
Viktor Lofgren
510fc75121
(Search) Reduce whitespace in explorer view on mobile
2025-01-01 18:18:09 +01:00
Viktor Lofgren
0376f2e6e3
Merge branch 'master' into serp-redesign
...
# Conflicts:
# code/services-application/search-service/resources/templates/search/index/index.hdb
2025-01-01 18:15:09 +01:00
Viktor Lofgren
0b65164f60
(chore) Fix broken test
2025-01-01 18:06:29 +01:00
Viktor Lofgren
9be477de33
(domain-info) Add a feed flag to domain info
...
This is a bit of a sketchy solution that requires both assistant services to run on the same physical machine.
2025-01-01 18:02:33 +01:00
Viktor Lofgren
84f55b84ff
(search) Add experimental OPML-export function for feed subscriptions
2025-01-01 17:17:54 +01:00
Viktor Lofgren
ab5c30ad51
(search) Fix site info view for completely unknown domains
...
Also correct the DbDomainQueries.getDomainId so that it throws NoSuchElementException when domain id is missing, and not UncheckedExecutionException via Cache.
2025-01-01 16:29:01 +01:00
Viktor Lofgren
0c839453c5
(search) Fix crosstalk link
2025-01-01 16:09:19 +01:00
Viktor Lofgren
5e4c5d03ae
(search) Clean up breakpoints in site overview
2025-01-01 16:06:08 +01:00
Viktor Lofgren
710af4999a
(feed-fetcher) Add " entity mapping in feed fetcher
2025-01-01 15:45:17 +01:00
Viktor Lofgren
a5b0a1ae62
(search) Move linked/similar domains to a popover style menu on mobile
...
Fix scroll
2025-01-01 15:37:35 +01:00
Viktor Lofgren
e9f71ee39b
(search) Move linked/similar domains to a popover style menu on mobile
2025-01-01 15:23:25 +01:00
Viktor Lofgren
baeb4a46cd
(search) Reintroduce query rewriting for recipes, add rules for wikis and forums
2024-12-31 16:05:00 +01:00
Viktor Lofgren
5e2a8e9f27
(deploy) Add capability of adding tags to deploy script
2024-12-31 16:04:13 +01:00
Viktor
cc1a5bdf90
Merge pull request #138 from MarginaliaSearch/vlofgren-patch-1
...
Update ROADMAP.md
2024-12-31 14:41:02 +01:00
Viktor
7f7b1ffaba
Update ROADMAP.md
2024-12-31 14:40:34 +01:00
Viktor Lofgren
0ea8092350
(search) Add link promoting the redesign beta
2024-12-30 15:47:13 +01:00
Viktor Lofgren
483d29497e
(deploy) Add hashbang to deploy script
2024-12-30 15:47:13 +01:00
Viktor Lofgren
bae44497fe
(crawler) Add a new system property crawler.maxFetchSize
...
This gives the same upper limit to the live crawler and the big boy crawler, though the live crawler will reject items too large, and the big crawler will truncate at that point.
2024-12-30 15:10:11 +01:00
Viktor Lofgren
0d59202aca
(crawler) Do not remove W/-prefix on weak e-tags
...
The server expects to get them back prefixed, as we received them.
2024-12-27 20:56:42 +01:00
Viktor Lofgren
0ca43f0c9c
(live-crawler) Improve live crawler short-circuit logic
...
We should not wait until we've fetched robots.txt to decide whether we have any data to fetch! This makes the live crawler very slow and leads to unnecessary requests.
2024-12-27 20:54:42 +01:00
Viktor Lofgren
3bc99639a0
(feed-fetcher) Make feed fetcher requests conditional
...
Add `If-None-Match` and `If-Modified-Since` headers as appropriate to the feed fetcher's requests. On well-configured web servers, this should short-circuit the request and reduce the amount of bandwidth and processing that is necessary.
A new table was added to the FeedDb to hold one etag per domain.
If-Modified-Since semantics are based on the creation date for the feed database, which should serve as a cutoff date for the earliest update we can have received.
This completes the changes for Issue #136 .
2024-12-27 15:10:15 +01:00
Viktor Lofgren
927bc0b63c
(live-crawler) Add Accept-Encoding: gzip to outbound requests
...
This change adds `Accept-Encoding: gzip` to all outbound requests from the live crawler and feed fetcher, and the corresponding decoding logic for the compressed response data.
The change addresses issue #136 , save for making the fetcher's requests conditional.
2024-12-27 03:59:34 +01:00
Viktor Lofgren
d968801dc1
(converter) Drop feed data from SlopDomainRecord
...
Also remove feed extraction from converter. This is the crawler's responsibility now.
2024-12-26 17:57:08 +01:00
Viktor Lofgren
89db69d360
(crawler) Correct feed URLs in domain state db
...
Discovered feed URLs were given a double slash after their domain name in the DB. This will go away in the URL normalizer, so the URLs are still viable, but the commit fixes the issue regardless.
2024-12-26 15:18:31 +01:00
Viktor Lofgren
895cee7004
(crawler) Improved feed discovery, new domain state db per crawlset
...
Feed discover is improved with by probing a few likely endpoints when no feed link tag is provided. To store the feed URLs, a sqlite database is added to each crawlset that stores a simple summary of the crawl job, including any feed URLs that have been discovered.
Solves issue #135
2024-12-26 15:05:52 +01:00
Viktor Lofgren
4bb71b8439
(crawler) Correct content type probing to only run on URLs that are suspected to be binary
2024-12-26 14:26:23 +01:00
Viktor Lofgren
e4a41f7dd1
(crawler) Correct content type probing to only run on URLs that are suspected to be binary
2024-12-26 14:13:17 +01:00
Viktor
69ad6287b1
Update ROADMAP.md
2024-12-25 21:16:38 +00:00
Viktor Lofgren
81cdd6385d
Add rendering tests for most major views
...
This will prevent accidentally deploying a broken search service
2024-12-25 15:22:26 +01:00
Viktor Lofgren
e76c42329f
Correct dark mode for infobox in site focused search
2024-12-25 15:06:05 +01:00
Viktor Lofgren
e6ef4734ea
Fix tests
2024-12-25 15:05:41 +01:00
Viktor Lofgren
41a59dcf45
(feed) Sanitize illegal HTML entities out of the feed XML before parsing
2024-12-25 14:53:28 +01:00
Viktor Lofgren
df4bc1d7e9
Add update time to front page subscriptions
2024-12-25 14:42:00 +01:00
Viktor Lofgren
2b222efa75
Merge branch 'master' into serp-redesign
2024-12-25 14:22:42 +01:00
Viktor Lofgren
94d4d2edb7
(live-crawler) Add refresh date to feeds API
...
For now this is just the ctime for the feeds db. We may want to store this per-record in the future.
2024-12-25 14:20:48 +01:00
Viktor Lofgren
7ae19a92ba
(deploy) Improve deployment script to allow specification of partitions
2024-12-24 11:16:15 +01:00
Viktor Lofgren
56d14e56d7
(live-crawler) Improve LiveCrawlActor resilience to FeedService outages
2024-12-23 23:33:54 +01:00
Viktor Lofgren
a557c7ae7f
(live-crawler) Limit concurrent accesses per domain using DomainLocks from main crawler
2024-12-23 23:31:03 +01:00
Viktor Lofgren
b66879ccb1
(feed) Add support for date discovery through atom:issued and atom:created
...
This is specifically to help parse monadnock.net's Atom feed.
2024-12-23 20:05:58 +01:00
Viktor Lofgren
f1b7157ca2
(deploy) Add basic linting ability to deployment script.
2024-12-23 16:21:29 +01:00
Viktor Lofgren
7622335e84
(deploy) Correct deploy script, set correct name for assistant
2024-12-23 15:59:02 +01:00
Viktor Lofgren
0da2047eae
(live-capture) Correctly update processed count, disable poll rate adjustment based on freshness.
2024-12-23 15:56:27 +01:00
Viktor Lofgren
5ee4321110
(ci) Correct deploy script
2024-12-22 20:08:37 +01:00
Viktor Lofgren
9459b9933b
(ci) Correct deploy script
2024-12-22 19:40:32 +01:00
Viktor Lofgren
87fb564f89
(ci) Add script for automatic deployment based on git tags
2024-12-22 19:24:54 +01:00
Viktor Lofgren
5ca8523220
(math) Reduce log error spam from null unit conversions
2024-12-21 18:51:45 +01:00
Viktor Lofgren
1118657ffd
(system) Supply local IP to service discovery if multiFace is enabled
2024-12-19 22:20:19 +01:00
Viktor Lofgren
b1f970152d
(system) To support configurations with multiple docker networks, bind to the "most local" interface.
...
Make the behavior optional.
2024-12-19 20:26:31 +01:00
Viktor Lofgren
e1783891ab
(system) To support configurations with multiple docker networks, bind to the "most local" interface.
2024-12-19 20:18:57 +01:00
Viktor Lofgren
64d32471dd
(deploy) Deploy executor test
2024-12-19 17:45:47 +01:00
Viktor Lofgren
232cc465d9
(deploy) Deploy executor test
2024-12-19 17:35:38 +01:00
Viktor Lofgren
8c963bd4ba
(feeds) Remove Content-Encoding: gzip from feed fetcher
...
We don't support decompressing gzip, so this just gives us errors at this point should the server support it.
2024-12-18 22:23:44 +01:00
Viktor Lofgren
6a079c1c75
(feeds) Add per-domain throttling for feed fetcher.
2024-12-18 22:06:46 +01:00
Viktor Lofgren
2dc9f2e639
(feeds) Make feed XML parsing more lenient
...
... by consuming BOM markers and leading whitespace.
2024-12-18 17:18:41 +01:00
Viktor Lofgren
b66fb9caf6
(feeds) Improve error handling in the feed fetcher.
2024-12-18 17:02:13 +01:00
Viktor Lofgren
6d18e6d840
(search) Add clustering to subscriptions view
2024-12-18 15:36:05 +01:00
Viktor Lofgren
2a3c63f209
(search) Exclude generated style.css from git
2024-12-18 15:24:31 +01:00
Viktor Lofgren
9f70cecaef
(search) Add site subscription feature that puts RSS updates on the front page
2024-12-18 15:24:31 +01:00
Viktor Lofgren
47e58a21c6
Refactor documentBody method and ContentType charset handling
...
Updated the `documentBody` method to improve parsing retries and error handling. Refactored `ContentType` charset processing with cleaner logic, removing redundant handling for unsupported charsets. Also, updated the version of the `slop` library in dependency settings.
2024-12-17 17:11:37 +01:00
Viktor Lofgren
3714104976
Add loader for slop data in converter.
...
Also alter CrawledDocument to not require String parsing of the underlying byte[] data. This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.
2024-12-17 15:40:24 +01:00
Viktor Lofgren
f6f036b9b1
Switch to new Slop format for crawl data storage and processing.
...
Replaces Parquet output and processing with the new Slop-based format. Includes data migration functionality, updates to handling and writing of crawl data, and introduces support for SLOP in domain readers and converters.
2024-12-15 19:34:03 +01:00
Viktor Lofgren
b510b7feb8
Spike for storing crawl data in slop instead of parquet
...
This seems to reduce RAM overhead to 100s of MB (from ~2 GB), as well as roughly double the read speeds. On disk size is virtually identical.
2024-12-15 15:49:47 +01:00
Viktor Lofgren
c08203e2ed
(search) Prevent paperdoll from being run as a test by CI
2024-12-14 20:35:57 +01:00
Viktor Lofgren
86497fd32f
(site-info) Mobile layout fix
2024-12-14 16:19:56 +01:00
Viktor Lofgren
3b998573fd
Adjust colors on dark mode for site overview
2024-12-13 21:51:25 +01:00
Viktor Lofgren
e161882ec7
(search) Fix layout for light mode
2024-12-13 21:47:29 +01:00
Viktor Lofgren
357f349e30
(search) Table layout fixes for dictionary lookup
2024-12-13 21:47:08 +01:00
Viktor Lofgren
e4769f541d
(search) Sort and deduplicate search results for better relevance.
...
Added a custom sorting mechanism to prioritize HTTPS over HTTP and domain-based URLs over raw IPs during deduplication. Ensures "bad duplicates" are discarded while maintaining the original presentation order for user-facing results.
2024-12-13 21:47:08 +01:00
Viktor Lofgren
2a173e2861
(search) Dark Mode
2024-12-13 21:47:07 +01:00
Viktor Lofgren
a6a900266c
(search) Fix redirects
2024-12-13 02:40:51 +01:00
Viktor Lofgren
bdba53f055
(site) Update domain parameter type from PathParam to QueryParam
2024-12-13 02:15:35 +01:00
Viktor Lofgren
eb2fe18867
(sideload) Add LSH generation for sideloaded StackExchange data
...
Previously, the sideloader did not generate a locality-sensitive hashCode for document details. This caused all documents from the same domain to be considered duplicates by the deduplication logic.
2024-12-13 02:10:52 +01:00
Viktor Lofgren
a7468c8d23
(converter) Ensure paths are created for converter batch writer
2024-12-13 01:35:07 +01:00
Viktor Lofgren
fb2beb1eac
(converter) Fix data-loss bug where the converter writer would remove all but the last batch of processed data
2024-12-13 01:19:30 +01:00
Viktor Lofgren
0fb03e3d62
(export) Add logging to AtagExporter for error handling
2024-12-12 22:54:32 +01:00
Viktor Lofgren
67db3f295e
(index) Revert some optimization changes
2024-12-12 22:14:24 +01:00
Viktor Lofgren
dafaab3ef7
(index) Additional optimization pass
2024-12-12 18:57:33 +01:00
Viktor Lofgren
3f11ca409f
(index) Increase thread limit and optimize search result handling
...
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
2024-12-12 17:07:06 +01:00
Viktor Lofgren
694eed79ef
(index) Increase thread limit and optimize search result handling
...
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
2024-12-12 15:32:31 +01:00
Viktor Lofgren
4220169119
(index) Increase thread limit and optimize search result handling
...
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
2024-12-12 15:31:11 +01:00
Viktor Lofgren
bbdde789e7
Merge branch 'master' into serp-redesign
2024-12-11 19:45:17 +01:00
Viktor Lofgren
0a53ac68a0
Add specialization for steam store and GOG
2024-12-11 18:32:45 +01:00
Viktor Lofgren
eab61cd48a
Merge branch 'master' into serp-redesign
2024-12-11 17:09:27 +01:00
Viktor Lofgren
e65d75a0f9
(crawler) Reintroduce content type probing and clean out bad content type data from the existing crawl sets
2024-12-11 17:01:52 +01:00
Viktor Lofgren
3b99cffb3d
(link-parser) Filter out URLs with binary file suffixes in LinkParser
...
Added an additional filter step to ensure URLs with binary suffixes are excluded during crawling. This prevents unnecessary processing of non-HTML content, improving the efficiency of the link parsing process.
2024-12-11 16:42:47 +01:00
Viktor Lofgren
a97c05107e
Add synthetic meta flag for root path documents
...
If the document's URL path is "/", a "special:root" meta flag is now added with the "Synthetic" bit set. This will help searching only for the root document of a website, neat stuff ahead :D
2024-12-11 16:10:44 +01:00
Viktor Lofgren
5002870d1f
(converter) Refactor sideloaders to improve feature handling and keyword logic
...
Centralized HTML feature handling with `applyFeatures` in StackexchangeSideloader and added dynamic synthetic term generation. Improved HTML structure in RedditSideloader and enhanced metadata processing with feature-based keywords. Updated DomainLinks to correctly compute link counts using individual link occurrences.
2024-12-11 16:01:38 +01:00
Viktor Lofgren
73861e613f
(ranking) Downtune score boost for unordered heading matces
2024-12-11 15:44:29 +01:00
Viktor Lofgren
0ce2ba9ad9
(jooby) Fix asset handler
2024-12-11 14:38:04 +01:00
Viktor Lofgren
3ddcebaa36
(search) Give serp/start a more consistent name to siteinfo/start
...
The change also cleans up the layout a bit.
2024-12-11 14:33:57 +01:00
Viktor Lofgren
b91463383e
(jooby) Clean up initialization process
2024-12-11 14:33:18 +01:00
Viktor Lofgren
7444a2f36c
(site-info) Add placeholder when a feed item lacks a title.
2024-12-10 22:46:12 +01:00
Viktor Lofgren
461bc3eb1a
(generator) Add special workaround to flag fextralife as a wiki
2024-12-10 22:22:52 +01:00
Viktor Lofgren
cf7f84f033
(rank) Reduce the impact of domain rank bonus, and only apply it to cancel out negative penalties, never to increase the ranking
2024-12-10 22:04:12 +01:00
Viktor Lofgren
fdee07048d
(search) Remove Spark and migrate to Jooby for the search service
2024-12-10 19:13:13 +01:00
Viktor Lofgren
2fbf201761
(search) Adjust crosstalk flex-basis
2024-12-10 15:12:51 +01:00
Viktor Lofgren
4018e4c434
(search) Add crosstalk to paperdoll
2024-12-10 15:12:39 +01:00
Viktor Lofgren
f3382b5bd8
(search) Completely remove all old hdb templates
...
Create new views for conversion results, dictionary results, and site crosstalk.
2024-12-10 15:04:49 +01:00
Viktor Lofgren
9fc82574f0
(fingerprint) Add FluxGarden as a wiki generator
...
#130
2024-12-10 13:51:42 +01:00
Viktor
589f4dafb9
Merge pull request #129 from MarginaliaSearch/atags-counts
...
(WIP) Improve atag sentence matching
2024-12-10 12:42:34 +00:00
Viktor Lofgren
c5d657ef98
(live-crawler) Flag live crawled documents with a special keyword
2024-12-10 13:42:10 +01:00
Viktor Lofgren
3c2bb566da
(converter) Wipe the converter output path on initialization to avoid lingering stale data.
2024-12-10 13:41:05 +01:00
Viktor Lofgren
9287ee0141
(search) Improve hyphenation logic for titles
2024-12-09 15:29:10 +01:00
Viktor Lofgren
2769c8f869
(search) Remove sticky search bar to aid with performance on firefox (and iOS?)
2024-12-09 15:20:33 +01:00
Viktor Lofgren
ddb66f33ba
(search) Add more feedback when pressing some buttons
2024-12-09 15:07:23 +01:00
Viktor Lofgren
79500b8fbc
(search) Move search bar back up top on mobile, put filter buttom at the bottom instead.
2024-12-09 14:55:37 +01:00
Viktor Lofgren
187eea43a4
(search) Remove redundant @if
2024-12-09 14:46:02 +01:00
Viktor Lofgren
a89ed6fa9f
(search) Fix rendering on site overview, more dense serp layout on mobile
2024-12-09 14:45:45 +01:00
Viktor Lofgren
e0c0ed27bc
(keyword-extraction) Clean up code and add tests for position and spans calculation
...
This code has been a bit of a mess and historically significantly flaky, so some test coverage is more than overdue.
2024-12-08 14:14:52 +01:00
Viktor Lofgren
20abb91657
(loader) Correct DocumentLoaderService to properly do bulk inserts
...
Fixes issue #128
2024-12-08 13:12:52 +01:00
Viktor Lofgren
291ca8daf1
(converter/index) Improve atag sentence matching by taking into consideration how many times a sentence appears in the links
...
This change breaks the format of the atags.parquet file.
2024-12-08 00:27:11 +01:00
Viktor Lofgren
8d168be138
(search) Typeahead search, etc.
2024-12-07 15:47:01 +01:00
Viktor Lofgren
6e1aa7b391
(search) Make style.css depend on jte file changes
...
Also add a hack to ensure classes generated from java code get included in the stylesheet as intended.
2024-12-07 14:11:22 +01:00
Viktor Lofgren
deab9b9516
(search) Clean up start views for search and site-info
2024-12-07 14:11:22 +01:00
Viktor Lofgren
39d99a906a
(search) Add proper tailwind build and host fontawesome locally
2024-12-07 14:11:22 +01:00
Viktor Lofgren
6f72e6e0d3
(explore) Add lazy loading and alt attributes to images
2024-12-07 14:11:22 +01:00
Viktor Lofgren
d786d79483
(site-info) Add whitespace-nowrap to pubDay span in overview.jte
2024-12-07 14:11:22 +01:00
Viktor Lofgren
01510f6c2e
(serp) Add wayback link to search results
2024-12-07 14:11:22 +01:00
Viktor Lofgren
7ba43e9e3f
(site) Adjust sizing of navbars
2024-12-07 14:11:16 +01:00
Viktor Lofgren
97bfcd1353
(site) Layout changes site-info
2024-12-07 14:11:16 +01:00
Viktor Lofgren
aa3c85c196
(site) Mobile layout fixes
2024-12-07 14:11:16 +01:00
Viktor Lofgren
ee2d5496d0
Revert "(experiment) Modify atags exporter to permit duplicates from different source domains"
...
This reverts commit 5c858a2b94
.
2024-12-07 14:01:50 +01:00
Viktor Lofgren
5c858a2b94
(experiment) Modify atags exporter to permit duplicates from different source domains
...
This is an attempt to provide higher resolution term frequency data that will need evaluation when the data is processed.
2024-12-06 14:10:15 +01:00
Viktor Lofgren
fb75a3827d
(site) Adjust coloration of search results
2024-12-05 16:58:00 +01:00
Viktor Lofgren
7d546d0e2a
(site) Make SearchParameters generate relative URLs instead of absolute
2024-12-05 16:47:22 +01:00
Viktor Lofgren
8fcb6ffd7a
(site-info) Increase contrast in search results for forums, wikis
2024-12-05 16:42:16 +01:00
Viktor Lofgren
f97de0c15a
(site-info) Fix layout
2024-12-05 16:33:46 +01:00
Viktor Lofgren
be9e192b78
(site-info) Fix pagination in backlinks and documents views
2024-12-05 16:26:11 +01:00
Viktor Lofgren
75ae1c9526
(site-info) Do not show 'suggest for crawling' when the ndoe affinity is already set to 0
...
This indicates the domain is already slated for crawling.
2024-12-05 16:18:46 +01:00
Viktor Lofgren
33761a0236
(site-info) Make the search box in the site viewer functional
2024-12-05 16:16:29 +01:00
Viktor Lofgren
19b69b1764
(site-info) Only show samples if feed is absent, never both.
2024-12-05 16:05:03 +01:00
Viktor Lofgren
8b804359a9
(serp) Layout fixes for mobile
2024-12-05 15:59:33 +01:00
Viktor Lofgren
f050bf5c4c
(WIP) Initial semi-working transformation to new tailwind UI
...
Still missing is a proper build, we're currently pulling in tailwind from a CDN, which is no bueno in prod.
There's also a lot of polish remaining everywhere, dead links, etc.
2024-12-05 14:00:17 +01:00
Viktor Lofgren
fdc3efa250
(setup) Remove OpenNLP tokenization model
...
This update eliminates all occurrences of the OpenNLP token model from the setup script, configuration, and test files, as this model file is no longer used.
2024-11-28 16:03:05 +01:00
Viktor Lofgren
5fdd2c71f8
(setup) Update OpenNLP model URLs to archive.apache.org
...
Changed the URLs for downloading OpenNLP sentence and tokens models from downloads.apache.org to archive.apache.org; as the previous link has died.
2024-11-28 15:58:25 +01:00
Viktor Lofgren
c97c66a41c
(ranking) Reduce the verbatim score multiplier
2024-11-28 13:37:11 +01:00
Viktor Lofgren
7b64377fd6
(ranking) Promote documents with multiple phrase matches with a log-scale bonus
2024-11-28 13:36:56 +01:00
Viktor Lofgren
e11ebf18e5
(span) Correct intersection counting logic, add comprehensive tests
2024-11-28 13:36:25 +01:00
Viktor Lofgren
ba47d72bf4
(ranking) Adjust scores for external link matches
2024-11-27 14:27:23 +01:00
Viktor Lofgren
52bc0272f8
(atag) Add alias domain support and improve domain handling
...
Introduced optional alias domain functionality in EdgeDomain class to handle domain variations such as "www" in the anchor tags code, as there are commonly a number of relevant but glancing misses in the atags data.
2024-11-27 14:26:44 +01:00
Viktor Lofgren
d4bce13a03
(export) Add export actors to precession
...
Adding a tracking message to the export actor means it's possible to run them in a precession.
Adding a new precession actor, and some GUI components for triggering exports.
The change also adds a heartbeat to the export process.
2024-11-26 15:07:03 +01:00
Viktor Lofgren
b9842b57e0
(encyclopedia-sideloader) Add test suite and clean up urlencoding logic
2024-11-26 13:34:15 +01:00
Viktor Lofgren
95776e9bee
(encyclopedia) Fix commit gore resulting in bad SQL query
2024-11-26 12:44:49 +01:00
Viktor Lofgren
077d8dcd11
(result-score) Adjust ranking parameters a tiny bit
2024-11-25 18:30:59 +01:00
Viktor Lofgren
9ec41e27c6
(keyword-extractor) Fix bug where external link keywords weren't generating document spans as intended
2024-11-25 18:30:22 +01:00
Viktor Lofgren
200743c84f
(minor) Remove delomobok debris
2024-11-25 18:29:21 +01:00
Viktor Lofgren
6d7998e349
(index) Correct behavior of debug function positionValues(), which was misleadingly incorrect
2024-11-25 18:28:53 +01:00
Viktor Lofgren
7d1ef08a0f
(index) Correct ranking bonus for external linktext appearnces
2024-11-25 17:40:15 +01:00
Viktor Lofgren
ea6b148df2
(docker) Add restart: always to executor nodes
...
The system will perform a janitor reset on these nodes when the node profile is switched, so it's important they restart automatically.
2024-11-25 15:31:45 +01:00
Viktor Lofgren
3ec9c4c5fa
(export) Filter non-HTML documents in exporters
...
Add a check to ensure only documents with "text/html" content type are processed in FeedExporter, AtagExporter, and TermFrequencyExporter. This prevents non-HTML documents from being parsed and helps maintain data consistency and keep the memory usage down.
2024-11-25 15:06:42 +01:00
Viktor Lofgren
0b6b5dab07
(index) Add score bonuses for single-word anchor tag spans
...
Enhanced scoring logic to add bonuses when the query matches single-word anchor (atag) spans exactly. Implemented this by adding conditions in `IndexResultScoreCalculator.java` and creating a new method `containsRangeExact` in `DocumentSpan.java` to check for exact span matches.
2024-11-25 14:44:41 +01:00
Viktor Lofgren
ff17473105
Fix UTF-8 URL normalization issue in sideloader.
...
Normalize URLs by replacing en-dash with hyphen to prevent encoding errors. This ensures correct handling of a small subset of articles with improperly normalized UTF-8 paths. Added `normalizeUtf8` method to address this issue.
Fixes issue #109 .
2024-11-25 14:25:47 +01:00
Viktor Lofgren
dc5f97e737
(index) Add bonus for single-word title matches when the title is also a single word
2024-11-25 13:24:12 +01:00
Viktor Lofgren
d919179ba3
(index) Correct off-by-1 error in DocumentSpan.containsRange
2024-11-25 13:24:03 +01:00
Viktor Lofgren
f09669a5b0
(index) Correct usage of DocumentSpan.length() instead of DocumentSpan.size()
...
The latter counts the number of spans, and is not what you want here.
2024-11-25 13:11:55 +01:00
Viktor Lofgren
b3b0f6fed3
(actor) Add side-load profile to PROC_CONVERTER_SPAWNER.
...
This fell off during the profile split, but is necessary for sideloading.
2024-11-25 12:40:14 +01:00
Viktor Lofgren
88caca60f9
(live-crawl) Flag URLs that don't pass robots.txt as bad so we don't keep fetching robots.txt every day for an empty link list
2024-11-23 17:07:16 +01:00
Viktor Lofgren
923ebbac81
(feeds) Add logic to handle URI fragments in feed items
...
Introduced a method to decide whether to retain URI fragments in feed items based on their uniqueness. Enhanced FeedItem processing to conditionally strip fragments to maintain clean URLs where applicable.
2024-11-23 16:38:56 +01:00
Viktor
df298df852
Merge pull request #125 from MarginaliaSearch/live-search
...
Add near real-time crawling from RSS feeds to supplement the slower batch based crawls
2024-11-22 16:38:37 +00:00
Viktor Lofgren
552b246099
(live-crawl) Improve error handling for errors during robots.txt-retrieval
...
Reduce log-spam and don't treat errors other than 404 as "all is permitted".
2024-11-22 14:15:32 +01:00
Viktor Lofgren
80e6d0069c
(live-crawl-actor) Clear index journal before starting live crawl
...
This is to prevent data corruption. This shouldn't be necessary for the regular loader path, but the live crawler is a bit different and needs some paving of the road ahead of it.
2024-11-22 14:04:57 +01:00
Viktor Lofgren
b941604135
(live-crawler) Alter DbDomainIdRegistry to make inserts if an id is missing, as this is apparently a rare scenario we need to deal with.
2024-11-22 13:58:57 +01:00
Viktor Lofgren
52eb5bc84f
(live-crawler) Keep track of bad URLs
...
To avoid hammering the same invalid URLs for up to two months, URLs that fail to fetch correctly are on a dice roll added to a bad URLs table, that prevents further attempts at fetching them.
2024-11-22 00:55:46 +01:00
Viktor Lofgren
4d23fe6261
(feeds) Simplify RSS User-Agent header
...
Removed the redundant "RSS Feed Fetcher" suffix from the User-Agent header in the FeedFetcherService. This will help avoid making the feed fetcher trigger bot mitigation that accepts the regular UA-string.
2024-11-21 16:43:56 +01:00
Viktor Lofgren
14519294d2
Merge branch 'master' into live-search
2024-11-21 16:00:20 +01:00
Viktor Lofgren
51e46ad2b0
(refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents
...
Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx.
While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform. It's still a bit heterogeneous between different processes, but a bit less so for now.
2024-11-21 16:00:09 +01:00
Viktor Lofgren
665c8831a3
(model) Fix resource leak in partially read crawl data streams.
...
Ensuring proper resource management by closing the underlying stream in the `close` method to prevent potential resource leaks.
2024-11-20 19:29:13 +01:00
Viktor Lofgren
47dfbacb00
(conf) Introduce a new concept of node profiles
...
Node profiles decide which actors are started, and which views are available in the control GUI. This helps keep the system organized, and hides real-time clutter from the batch-oriented nodes.
2024-11-20 18:15:22 +01:00
Viktor Lofgren
f94911541a
(live-crawl) Reduce the risk of id collisions with the main indexes
...
This is done by applying a large constant offset to the ordinals for the live crawled documents. The chosen value still permits upto 100k documents to be fetched for a single domain with the live crawler, which is ridiculously large.
2024-11-20 16:01:10 +01:00
Viktor Lofgren
89d8af640d
(live-crawl) Rename the live crawler code module to be more consistent with the other processes
2024-11-20 15:55:15 +01:00
Viktor Lofgren
6e4252cf4c
(live-crawl) Make the actor poll for feeds changes instead of being a one-shot thing.
...
Also changes the live crawl process to store the live crawl data in a fixed directory in the storage base rather than versioned directories.
2024-11-20 15:36:25 +01:00
Viktor Lofgren
79ce4de2ab
(model) Remove deprecated fields from CrawledDocument and CrawledDomain
2024-11-20 15:27:05 +01:00
Viktor Lofgren
d6575dfee4
(live-crawler) Crude first-try process for live crawling #WIP
...
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
2024-11-19 21:00:18 +01:00
Viktor Lofgren
a91ab4c203
(live-crawler) Crude first-try process for live crawling #WIP
...
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
2024-11-19 19:35:01 +01:00
Viktor Lofgren
6a3079a167
(search) Fix missing getter for proto
2024-11-18 21:05:22 +01:00
Viktor Lofgren
c728a1e2f2
(rss) Add endpoint for extracting URLs changed withing a timespan.
2024-11-18 14:59:32 +01:00
Viktor Lofgren
d874d76a09
(rss) Add an endpoint that can be used for identifying when RSS data has changed
2024-11-18 14:22:17 +01:00
Viktor Lofgren
70bc8831f5
(test) Fix excludeTags
2024-11-17 20:07:49 +01:00
Viktor Lofgren
41c11be075
(status) Clean up the status page a bit
2024-11-17 20:00:44 +01:00
Viktor Lofgren
163ce19846
(test) Tag status service endpoint tests as flaky
...
These tests have outside dependencies that inherently makes them unreliable and unsuitable for CI.
2024-11-17 19:48:01 +01:00
Viktor Lofgren
9eb16cb667
(test) Remove tests from fast suite
...
Adding a new @Tag("flaky") for tests that do not reliably return successes. These may still be valuable during development, but should not run in CI.
Also tagging a few of the slower tests with the old @Tag("slow"), to speed up the run-time.
2024-11-17 19:45:59 +01:00
Viktor Lofgren
af40fa327b
(status-service) Correct measurement pruning to use correct sqlite datetimes, as to not delete the database
2024-11-17 18:35:34 +01:00
Viktor Lofgren
cf6d28e71e
(status-service) Enable auto-commit
2024-11-17 18:25:15 +01:00
Viktor Lofgren
3791ea1e18
(service) Add a new application service for external liveness monitoring
...
The new service 'status-service' will poll public endpoints periodically, and publish a basic read-only UI with the results, as well as publish the results to prometheus.
2024-11-17 18:01:08 +01:00
Viktor
34258b92d1
Merge pull request #124 from MarginaliaSearch/jdk-23+delombok
...
Friendship with lombok over, now JDK 23 is my best friend
2024-11-16 14:00:49 +00:00
Viktor Lofgren
e5db3f11e1
(chore) Clean up some of the uglier delomboking artifacts
2024-11-15 13:57:20 +01:00
Viktor Lofgren
9f47ce8d15
(chore) Remove lombok
...
There are likely some instances of delombok gore with this commit.
2024-11-11 21:14:38 +01:00
Viktor Lofgren
a5b4951f23
(chore) Remove use of deprecated STR.-style string templates
2024-11-11 18:02:28 +01:00
Viktor Lofgren
8b8bf0748f
(feature-extraction) Add new DocumentHeaders class encapsulating Html headers.
...
Also adds a few new html features for CDNs and S3 hosting for use in ranking and query refinement.
2024-11-11 13:26:15 +01:00
Viktor
5cc71ae586
Merge pull request #123 from MarginaliaSearch/vlofgren-patch-1
...
Update ROADMAP.md
2024-11-10 18:57:49 +01:00
Viktor
33fcfe4b63
Update ROADMAP.md
2024-11-10 18:57:15 +01:00
Viktor
a31a3b53c4
Merge pull request #122 from MarginaliaSearch/fetch-rss-feeds
...
Automatic RSS feed polling
2024-11-10 18:35:28 +01:00
Viktor Lofgren
a456ec9599
(feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished
2024-11-10 18:30:28 +01:00
Viktor Lofgren
a2bc9a98c0
(feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished
2024-11-10 17:45:20 +01:00
Viktor Lofgren
e24a98390c
(feed) Update API to allow specifying clean vs refresh update
...
Move the logic deciding which operation to perform into the actor, updating its state graph to incorporate a counter that runs a clean update once in a blue moon.
2024-11-09 18:43:47 +01:00
Viktor Lofgren
6f858cd627
(feed) Decrease update interval to 24 hours
2024-11-09 18:17:51 +01:00
Viktor Lofgren
a293266ccd
(feed) Wipe the feeds db and start over from system URLs periodically.
2024-11-09 18:17:16 +01:00
Viktor Lofgren
b8e0dc93d7
(search) Correctly show the feeds view when items are present
...
... otherwise show samples. This commit also removes the (Experimental) bit, as this is getting fairly mature.
2024-11-09 17:56:43 +01:00
Viktor Lofgren
d774c39031
(feeds) Reduce log spam
2024-11-09 17:56:43 +01:00
Viktor Lofgren
ab17af99da
(feeds) Refresh the feed db using the previous db, when it is available.
2024-11-09 17:56:43 +01:00
Viktor Lofgren
b0ac3c586f
(feeds) Correct parallelism using SimpleBlockingThreadPool
2024-11-09 17:56:43 +01:00
Viktor Lofgren
139fa85b18
(feeds) Add working heartbeat tracking progress
2024-11-09 17:56:43 +01:00
Viktor Lofgren
bfeb9a4538
(feeds) Retire feedlot the feed bot, move RSS capture into the live-capture service
2024-11-09 17:56:43 +01:00
Viktor
3d6c79ae5f
Merge pull request #121 from MarginaliaSearch/headless-setup
...
Headless deterministic setup
2024-11-08 13:50:54 +01:00
Viktor Lofgren
c9e9f73ea9
(setup) Break out installation action into non-interactive script
2024-11-08 13:38:40 +01:00
Viktor Lofgren
80e482b155
(setup) Add progress bar to downloads for better feedback
2024-11-08 13:38:40 +01:00
Viktor Lofgren
9351593495
(setup) Use huggingface for versioned hosting of language models
2024-11-08 13:38:40 +01:00
Viktor Lofgren
d74436f546
(setup) Use checksums for rdrpostagger and opennlp files
...
Also use versioned URLs for rdrpostagger
2024-11-08 13:38:40 +01:00
Viktor Lofgren
76e9053dd0
(setup) Move some file-downloads from setup script to the first boot of the control node of the system
...
We can only do this for files that are not required for unit tests.
As it is illegal to run more than one instance of the control service, this should be fine with regard to race conditions. The boot orchestration will also ensure that no other services will boot up before the downloading is complete.
2024-11-06 15:28:20 +01:00
Viktor Lofgren
dbb8bcdd8e
(crawler) Use a better hashInt implementation in CrawlDataReference
...
Guava's hash functions are slow as hell.
2024-10-15 18:25:55 +02:00
Viktor Lofgren
7305afa0f8
(crawler) Clean up the crawler code a bit, removing vestigial abstractions and historical debris
2024-10-15 17:27:59 +02:00
Viktor Lofgren
481f999b70
(crawler) Make DomainCrawlFrontier a bit less aggressive with throwing away excess links when it's approaching full.
...
Also be a bit smarter about pre-allocating queues and sets based on depth rather than the number of provided URLs, which was always zero outside of tests.
2024-10-15 14:22:40 +02:00
Viktor Lofgren
4b16022556
(crawler) Correct Spec Provider so that it uses VISITED_URLS rather than KNOWN_URLS when growing domains
2024-10-15 14:21:59 +02:00
Viktor Lofgren
89dd201a7b
(link-parser) Make mailing list blocking optional
2024-10-15 13:48:32 +02:00
Viktor Lofgren
ab486323f2
(converter) Increase the number of links the converter will pick up per document
2024-10-15 13:46:19 +02:00
Viktor Lofgren
6460c11107
(index) Short-circuit rankResults when there are no results
2024-10-14 13:47:35 +02:00
Viktor Lofgren
89f7f3c17c
(query-parser) Fix regression where advice terms weren't parsed properly
2024-10-14 13:46:37 +02:00
Viktor Lofgren
fe800b3af7
(crawler) Properly enqueue links from the root document in the crawler
2024-10-05 19:04:49 +02:00
Viktor Lofgren
2a1077ff43
(crawler) Properly enqueue links from the root document in the crawler
2024-10-05 17:57:27 +02:00
Viktor Lofgren
01a16ff388
(crawler) Properly enqueue links from the root document in the crawler
2024-10-05 17:55:59 +02:00
Viktor Lofgren
eb60ddb729
(crawler) Properly enqueue links from the root document in the crawler
2024-10-05 17:49:39 +02:00
Viktor Lofgren
db5faeceee
(download-sample) Break apart actor for better error recovery
...
Change also adds logged events to give more feedback that something is happening.
2024-10-04 13:39:43 +02:00
Viktor Lofgren
45d3e6aa71
(download-sample) Break apart actor for better error recovery
...
Change also adds logged events to give more feedback that something is happening.
2024-10-04 13:19:09 +02:00
Viktor Lofgren
d84a2c183f
(*) Remove the crawl spec abstraction
...
The crawl spec abstraction was used to upload lists of domains into the system for future crawling. This was fairly clunky, and it was difficult to understand what was going to be crawled.
Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table. This is much preferred and means the operator can directly manage domains without specs.
This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.
2024-10-03 13:41:17 +02:00
Viktor Lofgren
ecb5eedeae
(crawler, EXPERIMENT) Disable content type probing and use Accept header instead
...
There's reason to think this may speed up crawling quite significantly, and the benefits of the probing aren't quite there.
2024-09-30 14:53:01 +02:00
Viktor Lofgren
90a2d4ae38
(index) Fix partial buffer writing in PrioDocIdsTransformer
...
Ensure all data is written to writeChannel by looping until the buffer is fully drained. This prevents potential data loss during the close operation and maintains data integrity.
2024-09-29 17:53:40 +02:00
Viktor Lofgren
2b8ab97ec1
(bit-writer) Do not clear buffer when creating a bit writer
2024-09-29 17:52:43 +02:00
Viktor Lofgren
43ca9c8a12
(sequence) Return Integer.MAX_VALUE for empty position lists.
...
Updated the method to return Integer.MAX_VALUE if any of the position lists are empty, instead of returning 0. This ensures that empty lists are handled consistently and address edge cases where an empty list is encountered.
2024-09-29 17:21:17 +02:00
Viktor Lofgren
69d99c91dd
(index) Optimize buffer handling in PrioDocIdsTransformer
2024-09-29 17:20:49 +02:00
Viktor Lofgren
a8cc98a0f6
(index) Fix write offset calculation in PrioDocIdsTransformer
...
Adjust the write offset calculation by adding the position of the write buffer. Updated the test to validate the transformation process and ensure correctness of output file positions.
2024-09-29 17:20:29 +02:00
Viktor Lofgren
2ee58f4bc9
(index) Adjust ranking parameters to dial down the importance of tcfProximity and firstPosition
2024-09-29 15:33:12 +02:00
Viktor Lofgren
938431e514
(scrape-feeds-actor) Add deduplication of insertion data
...
To avoid unnecessary db churn, the domains to be added are put in a set instead of a list, ensuring that they are unique.
2024-09-28 14:41:14 +02:00
Viktor Lofgren
b2de3c70fa
(scrape-feeds-actor) Add explicit commit in case it's disabled
2024-09-28 14:36:57 +02:00
Viktor Lofgren
542690d9f6
(search-service) Hide pagination when there is only 1 page of results
2024-09-28 13:48:09 +02:00
Viktor Lofgren
596a7fb4ea
(actor) Disable the feed scraper on all nodes but the first
2024-09-28 12:36:16 +02:00
Viktor Lofgren
c3f726a01f
(actor) Add a feed scraping actor
...
Add a new actor that polls an URL every 6 hours and amends the domain database with any unseen domains, flagging them to be crawled by the next crawl job.
The URLs are specified in data/scrape-urls.txt. If this file is absent, the actor shuts down.
2024-09-28 12:33:29 +02:00
Viktor Lofgren
4538ade156
(live-capture) Add readme to live-capture function
2024-09-28 11:35:46 +02:00
Viktor Lofgren
f4709d8f32
(live-capture) Handle case when screenshot bytes are empty.
...
Add logic to flag the domain as fetched when the pngBytes array is empty. This ensures we won't try to re-fetch this domain again for a while.
2024-09-27 15:53:17 +02:00
Viktor Lofgren
3dda8c228c
(live-capture) Handle failed screenshot fetch in BrowserlessClient
...
Return an empty byte array when screenshot fetch fails, ensuring downstream processes are not impacted by null responses. Additionally, only attempt to upload the screenshot if the byte array is non-empty, preventing invalid data from being stored.
2024-09-27 14:52:05 +02:00
Viktor Lofgren
ccf6b7caf3
(assistant) Refactor scheduling of tasks within SimilarDomainsService
...
Changed the scheduling function to use a single schedule call instead of a fixed delay for the init task. The updateScreenshotInfo method was also moved and slightly refactored for clearer readability and consistency.
2024-09-27 14:43:19 +02:00
Viktor Lofgren
fed33ed64a
(search-service) Update screenshot request handling
...
Always request the main site screenshot to ensure staleness checks and necessary updates. Limit additional screenshot requests for similar and linking domains to avoid overloading with a maximum of 5 requests per view.
2024-09-27 14:27:25 +02:00
Viktor Lofgren
ca27d95ce1
(assistant) Add bounds checks for domain idx
2024-09-27 14:24:04 +02:00
Viktor Lofgren
3566fe296a
(assistant) Add scheduled update job for screenshot information
2024-09-27 14:16:28 +02:00
Viktor Lofgren
c91435e314
(assistant) Don't attempt to respond to similarity and linkedness queries before the data is ready
...
This will reduce the number of exceptions in the assistant logs quite significantly.
2024-09-27 14:08:08 +02:00
Viktor Lofgren
31f30069a4
(live-capture) Dial down logging a bit
2024-09-27 14:00:55 +02:00
Viktor
e5726a75d2
Merge pull request #120 from MarginaliaSearch/live-capture-function
...
Add a new function 'Live Capture' for on-demand screenshot capture
2024-09-27 13:48:53 +02:00
Viktor Lofgren
c757d116bf
(misc) Fix Broken Tests
2024-09-27 13:46:34 +02:00
Viktor Lofgren
23cce0c78a
Add a new function 'Live Capture' for on-demand screenshot capture
...
The screenshots are requested by the site-service, and triggered via the site-info view.
2024-09-27 13:46:34 +02:00
Viktor Lofgren
1bd29a586c
(service-discovery) Add common base interface to all Grpc services
...
To be able to tell service discovery whether to enable a service on a particular runtime, a common base interface DiscoverableService extends BindableService was added.
2024-09-27 13:46:34 +02:00
Viktor Lofgren
4565bfe359
(crawler) Make the crawler report crawling progress correctly when stopped and resumed.
2024-09-26 18:30:29 +02:00
Viktor Lofgren
336d6fdd14
(index-client) Fix error when zero results are found
2024-09-25 20:23:13 +02:00
Viktor Lofgren
95cde242ca
(assistant) Fix NPE when IP information is absent
2024-09-25 20:19:17 +02:00
Viktor
9224176202
Merge pull request #119 from MarginaliaSearch/result-pagination
...
Add pagination support for the search results
2024-09-25 14:29:24 +02:00
Viktor Lofgren
0d2390fd13
(search-service) Only autofocus on the query when the query is empty
2024-09-25 14:27:03 +02:00
Viktor Lofgren
4a0356e26f
(search-service) Add pagination support to the search GUI
2024-09-25 14:26:49 +02:00
Viktor Lofgren
73f973cc06
(search-query) Add pagination to search query API and the direct query-service interface
2024-09-25 14:20:59 +02:00
Viktor Lofgren
e9e8580913
(converter) Fix NPE bugs in converter due to the reintroduction of CrawledDocument.headers
2024-09-25 12:18:56 +02:00
Viktor Lofgren
8b85a58fea
(search UX) Autofocus on the search form
2024-09-24 15:56:03 +02:00
Viktor Lofgren
40512511af
(crawler) Refactor boundary between CrawlerRetreiver and HttpFetcherImpl
...
This code is still a bit too complex, but it's slowly getting better.
2024-09-24 15:08:22 +02:00
Viktor
10d8fc4fe7
Update ROADMAP.md
2024-09-24 14:57:30 +02:00
Viktor
9899d45ea8
Merge pull request #118 from MarginaliaSearch/vlofgren-patch-1
...
Update ROADMAP.md
2024-09-24 14:13:47 +02:00
Viktor
3eea471ca6
Update ROADMAP.md
2024-09-24 14:13:32 +02:00
Viktor Lofgren
3dec4b6b34
(index) Fix bug where tcfFirstPosition lit up because one term was in the title and the other was missing from the document
...
This was because firstPosition calculation was not invalidated when positions were missing.
2024-09-24 13:33:37 +02:00
Viktor Lofgren
162fc25ebc
(minor) Fix accidental commit errors
2024-09-23 18:03:09 +02:00
Viktor Lofgren
e9854f194c
(crawler) Refactor
...
* Restructure the code to make a bit more sense
* Store full headers in crawl data
* Fix bug in retry-after header that assumed the timeout was in milliseconds, and then clamped it to a lower bound of 500ms, meaning this was almost always handled wrong
2024-09-23 17:51:07 +02:00
Viktor Lofgren
9c292a4f62
(doc) Fix outdated links in documentation
2024-09-22 13:56:17 +02:00
Viktor Lofgren
edb42836da
(vcs) Fix shared state issues with VarintCodedSequence's iterators.
...
Also cleans up the code a bit.
2024-09-21 16:09:15 +02:00
Viktor Lofgren
1ff88ff0bc
(vcs) Stopgap fix for quoted queries with the same term appearinc multiple times
...
There are reentrance issues with VarintCodedSequence, this hides the symptom but these need to be corrected properly.
2024-09-21 14:07:59 +02:00
Viktor Lofgren
28e7c8e5e0
Increase temporal bias weight to give the recent results filter a bit more recency
2024-09-17 18:11:40 +02:00
Viktor
463b3ed0ce
Merge pull request #99 from MarginaliaSearch/term-positions
...
Improve term positions accuracy
2024-09-17 15:30:04 +02:00
Viktor Lofgren
8e78286068
Merge branch 'master' into term-positions
2024-09-17 15:20:46 +02:00
Viktor Lofgren
f4eeef145e
(index) Reduce fetch size to improve timeout characteristics
2024-09-17 15:20:41 +02:00
Viktor Lofgren
87aa869338
(index) Correct positions mask to take into account offsets when overlapping
2024-09-17 14:40:37 +02:00
Viktor Lofgren
60ad4786bc
(index) Use MemorySegment.copy for LongArray->LongArray transfers
2024-09-17 13:56:31 +02:00
Viktor Lofgren
a74df7f905
(index) Increase buffer size for PrioDocIdsTransformer
2024-09-17 13:52:52 +02:00
Viktor Lofgren
9f9c6736ab
(index) Use MemorySegment.copy for LongArray->LongArray transfers
2024-09-17 13:49:02 +02:00
Viktor Lofgren
b95646625f
(index) Correct prio index construction with mmap
...
Accidentally snuck in behavior from full index
2024-09-17 13:39:08 +02:00
Viktor Lofgren
6e47eae903
(index) Correct strange close handling of PositionsFileConstructor
2024-09-13 16:34:14 +02:00
Viktor Lofgren
934af0dd4b
(index) Correct units in log message when shrinking the documents file
2024-09-13 16:33:19 +02:00
Viktor Lofgren
a8bec13ed9
(index) Evaluate using mmap reads during index construction in favor of filechannel reads
...
It's likely that this will be faster, as the reads are on average small and sequential, and can't be buffered easily.
2024-09-13 16:14:56 +02:00
Viktor Lofgren
1cf62f5850
(doc) Correct dead links and stale information in the docs
2024-09-13 11:02:13 +02:00
Viktor Lofgren
8047e77757
(doc) Correct dead links and stale information in the docs
2024-09-13 11:01:05 +02:00
Viktor Lofgren
2a92de29ce
(loader) Fix it so that the loader doesn't explode if it sees an invalid URL
2024-09-12 11:36:00 +02:00
Viktor Lofgren
99523ca079
(query-parser) Remove test that is no longer relevant
2024-09-10 10:35:56 +02:00
Viktor Lofgren
35f49bbb60
(coded-sequence) Add equals and hashCode to VCS
2024-09-10 10:33:56 +02:00
Viktor Lofgren
50ec922c2b
(index) Fix broken index tests
...
Also cleaned up the tests to be less fragile to ranking algorithm changes.
2024-09-10 10:23:46 +02:00
Viktor Lofgren
cfbbeaa26e
(ranking) Clean up ranking test code
2024-09-08 15:46:51 +02:00
Viktor Lofgren
a3b0189934
Fix build errors after merge
2024-09-08 10:22:32 +02:00
Viktor Lofgren
8f367d96f8
Merge branch 'master' into term-positions
...
# Conflicts:
# code/index/java/nu/marginalia/index/results/model/ids/TermIdList.java
# code/processes/converting-process/java/nu/marginalia/converting/ConverterMain.java
# code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
# code/processes/crawling-process/java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java
# code/processes/crawling-process/model/java/nu/marginalia/io/crawldata/CrawledDomainReader.java
# code/processes/crawling-process/test/nu/marginalia/crawling/HttpFetcherTest.java
# code/processes/crawling-process/test/nu/marginalia/crawling/retreival/CrawlerMockFetcherTest.java
# code/services-application/search-service/java/nu/marginalia/search/svc/SearchQueryIndexService.java
2024-09-08 10:14:43 +02:00
Viktor Lofgren
f78ef36cd4
(slop) Upgrade to 0.0.8, add encodings to string columns.
2024-09-04 15:19:00 +02:00
Viktor Lofgren
dc67c81f99
(summary) Fix a few cases where noscript tags would sometimes be used for document summary
2024-09-04 15:00:40 +02:00
Viktor Lofgren
50ba8fd099
(query-parsing) Correct handling of trailing parentheses
2024-09-03 11:45:14 +02:00
Viktor Lofgren
99b3b00b68
(query-parsing) Merge QueryTokenizer into QueryParser and add escaping of query grammar
2024-09-03 11:35:32 +02:00
Viktor Lofgren
f6d981761d
(query-parsing) Drop search term elements that aren't indexed by the search engine
2024-09-03 11:24:05 +02:00
Viktor Lofgren
8290c19e24
(query-parsing) Drop search term elements that aren't indexed by the search engine
2024-09-03 11:21:01 +02:00
Viktor Lofgren
7a69dff6cf
(search) Correct handling of languages on fandom
2024-09-01 13:46:01 +02:00
Viktor Lofgren
bfb7ed2c99
(search) Translate cursed medium URLs to scribe.rip links via the search application
2024-09-01 13:32:14 +02:00
Viktor Lofgren
e19dc9b13e
(search) Translate cursed fandom URLs to breezewiki links via the search application
2024-09-01 13:23:35 +02:00
Viktor Lofgren
74148c790e
(crawler) Pull additional new domains from node-affinity 0
...
Previously a bit ambiguously defined, node affinity 0 is now indicative that a domain is up for grabs for the next crawler
2024-09-01 13:00:36 +02:00
Viktor Lofgren
3d77456110
(*) Add domain parking service to ip blocklist
2024-09-01 12:53:22 +02:00
Viktor Lofgren
ab6a4b1749
(control) Correct id value for domain addition tool
2024-09-01 12:25:15 +02:00
Viktor Lofgren
aeeb1d0cb7
(control) Add utility for adding domains from an external URL
2024-09-01 12:14:21 +02:00
Viktor Lofgren
185b79f2a5
(converter) Fix bug where sideloaded reddit content was errouneously categoriszed as wiki-generated.
2024-09-01 11:30:25 +02:00
Viktor Lofgren
8d0f9652c7
(crawler) Correct RSS-sitemap behavior
2024-08-31 11:38:34 +02:00
Viktor Lofgren
5353805cc6
(crawler) Correct RSS-sitemap behavior
2024-08-31 11:37:09 +02:00
Viktor Lofgren
5407da5650
(crawler) Grab favicons as part of root sniff
2024-08-31 11:32:56 +02:00
Viktor Lofgren
b1bfe6f76e
(control) New view for domains
...
Add capability to assign domains, and bulk-add new domains.
2024-08-30 17:06:48 +02:00
Viktor Lofgren
74e25370ca
(control) New view for domains
...
Still a work in progress, but at this point it's possible to use for viewing domains
2024-08-29 15:40:40 +02:00
Viktor Lofgren
bb5d946c26
(index, EXPERIMENTAL) Clean up ranking code
2024-08-29 11:34:23 +02:00
Viktor Lofgren
abab5bdc8a
(index, EXPERIMENTAL) Evaluate using Varint instead of GCS for position data
2024-08-26 14:20:39 +02:00
Viktor Lofgren
30bf845c81
(index) Speed up minDist calculations by excluding large lists
2024-08-26 13:04:15 +02:00
Viktor Lofgren
77efce0673
(paper-doll) Fix compilation
2024-08-26 12:51:29 +02:00
Viktor Lofgren
67a98fb0b0
(coded-sequence) Handle weird legacy HTML that puts everything in a heading
2024-08-26 12:49:15 +02:00
Viktor Lofgren
7d471ec30d
(coded-sequence) Evaluate new minDist implementation
2024-08-26 12:45:11 +02:00
Viktor Lofgren
f3182a9264
(coded-sequence) Evaluate new minDist implementation
2024-08-26 12:02:37 +02:00
Viktor Lofgren
805cb5ad58
(coded-sequence) Correct behavior of findIntersections
2024-08-25 14:54:17 +02:00
Viktor Lofgren
fdf05cedae
(index) Optimize DocumentSpan.countIntersections
2024-08-25 14:12:30 +02:00
Viktor Lofgren
9c5f463775
(index) Optimize DocumentSpan.countIntersections
2024-08-25 13:59:11 +02:00
Viktor Lofgren
893fae6d59
(index) Optimize DocumentSpan.countIntersections
2024-08-25 13:51:43 +02:00
Viktor Lofgren
5660f291af
(index) Optimize DocumentSpan.countIntersections
2024-08-25 13:43:29 +02:00
Viktor Lofgren
efd56efc63
(index) Optimize SequenceOperations.minDistance
2024-08-25 13:28:06 +02:00
Viktor Lofgren
d94373f4b1
(index) Optimize calculatePositionsMask
2024-08-25 13:24:37 +02:00
Viktor Lofgren
0d01a48260
(index) Optimize SequenceOperations
2024-08-25 13:19:37 +02:00
Viktor Lofgren
00ab2684fa
(index) Optimize SequenceOperations
2024-08-25 13:17:38 +02:00
Viktor Lofgren
a5585110a6
(index) Optimize SequenceOperations
2024-08-25 13:16:31 +02:00
Viktor Lofgren
965c89798e
(index) Optimize DocumentSpan
2024-08-25 12:44:33 +02:00
Viktor Lofgren
982b03382b
(index) Optimize DocumentSpan
2024-08-25 12:31:15 +02:00
Viktor Lofgren
24b805472a
(index) Evaluate performance implication of decoding gcs early
2024-08-25 12:23:09 +02:00
Viktor Lofgren
6ce029b317
(index) Remove vestigial parameter
2024-08-25 12:14:12 +02:00
Viktor Lofgren
63e5b0ab18
(index) Correct weightedCounts calculations
2024-08-25 12:06:56 +02:00
Viktor Lofgren
6dda2c2d83
(coded-sequence) Reduce allocations in GCS.values()
2024-08-25 12:06:31 +02:00
Viktor Lofgren
3fb3c0b92e
(index) Optimize ranking calculations
2024-08-25 11:56:11 +02:00
Viktor Lofgren
aa2c960b74
(index) Optimize ranking calculations
2024-08-25 11:53:44 +02:00
Viktor Lofgren
4fbcc02f96
(index) Adjust sensible defaults for ranking parameters
2024-08-25 11:24:16 +02:00
Viktor Lofgren
9aa8f13731
(index) Remove tcfAvgDist ranking parameter
...
This is captured by tcfProximity already
2024-08-25 11:20:19 +02:00
Viktor Lofgren
65bee366dc
(index) Try harmonic mean for avgMinDist
2024-08-25 11:11:52 +02:00
Viktor Lofgren
53700e6667
(index) Try harmonic mean for avgMinDist
2024-08-25 11:08:41 +02:00
Viktor Lofgren
7f498e10b7
(index) Adjust proximity score
2024-08-25 11:01:35 +02:00
Viktor Lofgren
6eb0f13411
(index) Adjust handling of full phrase matches to prioritize full query matches over large partial matches
2024-08-25 10:54:04 +02:00
Viktor Lofgren
773377fe84
(index) Correct handling of full phrase match group
2024-08-25 10:48:34 +02:00
Viktor Lofgren
4372c8c835
(index) Give ranking components more consistent names
2024-08-25 10:44:27 +02:00
Viktor Lofgren
099133bdbc
(index) Fix verbatim match score after moving full phrase group to a separate entity
2024-08-25 10:43:35 +02:00
Viktor Lofgren
b09e2dbeb7
(build) Fix dependency churn from testcontainers
...
Apparently you need to pull in commons-codec now in order to run testcontainers, through spooky action at a distance.
2024-08-25 10:35:48 +02:00
Viktor Lofgren
96bcf03ad5
(index) Address broken tests
...
They are still broken, but less so.
2024-08-25 10:34:36 +02:00
Viktor Lofgren
0999f07320
(search-query) Add new ranking parameters for proximity and verbatim matches
2024-08-25 10:34:12 +02:00
Viktor Lofgren
5d2b455572
(search) Clean up inconsistent usage of MathClient in SearchOperator
...
Also clean up SearchOperator and adjacent code
2024-08-24 10:39:31 +02:00
Viktor Lofgren
ea75ddc0e0
(search) Absorb SearchQueryIndexService into SearchOperator, and clean up SearchOperator
2024-08-22 11:50:52 +02:00
Viktor Lofgren
2db0e446cb
(search) Absorb SearchQueryIndexService into SearchOperator, and clean up SearchOperator
2024-08-22 11:49:29 +02:00
Viktor Lofgren
557bdaa694
(search) Clean up SearchQueryIndexService and surrounding code
2024-08-22 11:45:28 +02:00
Viktor Lofgren
9eb1f120fc
(index) Repair positions bitmask for search result presentation
2024-08-22 11:28:23 +02:00
Viktor Lofgren
266d6e4bea
(slop) Replace SlopPageRef<T> with SlopTable.Ref<T>
2024-08-21 10:13:49 +02:00
Viktor Lofgren
e4c97a91d8
(*) Comment clarity
2024-08-21 10:12:00 +02:00
Viktor Lofgren
b0a874a842
(*) Upgrade slop library -> 0.0.5
2024-08-18 11:05:27 +02:00
Viktor Lofgren
bca40de107
(*) Upgrade slop library
2024-08-18 10:43:41 +02:00
Viktor Lofgren
93652e0937
(qdebug) Accurately display positions when intersecting with spans
2024-08-15 11:55:48 +02:00
Viktor Lofgren
0a383a712d
(qdebug) Accurately display positions when intersecting with spans
2024-08-15 11:44:17 +02:00
Viktor Lofgren
03d5dec24c
(*) Refactor termCoherences and rename them to phrase constraints.
2024-08-15 11:02:19 +02:00
Viktor Lofgren
b2a3cac351
(*) Remove broken imports
2024-08-15 11:01:34 +02:00
Viktor Lofgren
a18edad04c
(index) Remove stopword list from converter
...
We want to index all words in the document, stopword handling is moved to the index where we change the semantics to elide inclusion checks in query construction for a very short list of words tentatively hard-coded in SearchTerms.
2024-08-15 09:36:50 +02:00
Viktor Lofgren
92522e8d97
(index) Attenuate bm25 score based on query length
2024-08-15 08:41:38 +02:00
Viktor Lofgren
049d94ce31
(index) Add body position match to qdebug fields
2024-08-15 08:39:37 +02:00
Viktor Lofgren
dbc6a95276
(index) Consume the new 'body' span in index to make it used in ranking
2024-08-15 08:33:43 +02:00
Viktor Lofgren
75b0888032
(slop) Migrate to latest Slop version
2024-08-14 11:44:35 +02:00
Viktor Lofgren
2ad93ad41a
(*) Clean up
2024-08-14 11:43:45 +02:00
Viktor Lofgren
623ee5570f
(slop) Break slop out into its own repository
2024-08-13 09:50:05 +02:00
Viktor Lofgren
fd2bad39f3
(keyword-extraction) Add body field for terms that are not otherwise part of a field
2024-08-13 09:49:26 +02:00
Viktor Lofgren
e6c8a6febe
(index) Add index-side deduplication in selectBestResults
2024-08-10 10:51:59 +02:00
Viktor Lofgren
4ece5f847b
(index) Add more qdebug factors
2024-08-10 10:45:30 +02:00
Viktor Lofgren
e4f04af044
(index) Give BODY matches a verbatim match value
2024-08-10 10:22:19 +02:00
Viktor Lofgren
b730b17f52
(index) Correct handling of firstPosition to avoid d/z
2024-08-10 10:21:59 +02:00
Viktor Lofgren
98c40958ab
(index) Simplify verbatim match calculation
2024-08-10 09:54:56 +02:00
Viktor Lofgren
41b52f5bcd
(index) Simplify verbatim match calculation
2024-08-10 09:51:03 +02:00
Viktor Lofgren
4264fb9f49
(query-service) Clean up qdebug UI a bit
2024-08-10 09:51:03 +02:00
Viktor Lofgren
016a4c62e1
(index) Bugs and error fixes, chasing and fixing mystery results that did not contain all relevant keywords
2024-08-10 09:51:03 +02:00
Viktor Lofgren
2f38c95886
(index) Backport bugfix from term-positions branch
...
The ordering of TermIdsList is assumed to be unchanged by the surrounding code, but the constructor sorts the dang list to be able to do contains() by binary search. This is no bueno.
This is gonna be a merge conflict in the future, but it's too big of a bug to leave for another month.
2024-08-09 21:17:02 +02:00
Viktor Lofgren
df89661ed2
(index) In SearchResultItem, populate combinedId with combinedId and not its ranking-removed documentId cousin
2024-08-09 16:32:32 +02:00
Viktor Lofgren
41da4f422d
(search-query) Always generate the "all"-segmentation
2024-08-09 13:20:00 +02:00
Viktor Lofgren
2e89b55593
(wip) Repair qdebug utility and show new ranking details
2024-08-09 12:57:25 +02:00
Viktor Lofgren
7babdb87d5
(index) Remove intermediate models
2024-08-07 10:10:44 +02:00
Viktor Lofgren
680ad19c7d
(keyword-extraction) Correct behavior when loading spans so that they are not double-loaded causing errors
2024-08-06 11:16:56 +02:00
Viktor Lofgren
f01267bc6b
(index) Don't load fwd index offsets into a hash table at start.
...
This makes the service take forever to start up. Memory map the data instead and binary search. This is a bit slower, but not by much.
2024-08-06 11:16:28 +02:00
Viktor Lofgren
df6a05b9a7
(index) Avoid hypothetical divide-by-zero in tcfAvgDist
2024-08-06 10:55:57 +02:00
Viktor Lofgren
8569bb8e11
(index) Avoid divide-by-zero when minDist returns 0
2024-08-06 10:34:05 +02:00
Viktor Lofgren
ca6e2db2b9
(index) Include external link texts in verbatim score
2024-08-06 10:23:23 +02:00
Viktor Lofgren
2080e31616
(converter) Store link text positions
...
To help offer verbatim matches for external link texts, we assign these positions in the document a bit after the actual document ends. Integrating this information with the ranking is not performed here.
2024-08-04 12:00:29 +02:00
Viktor Lofgren
c379be846c
(slop) Update readme
2024-08-04 10:58:23 +02:00
Viktor Lofgren
9bc665628b
(slop) VarintLE implementation, correct enum8 column
2024-08-04 10:57:52 +02:00
Viktor Lofgren
ee49c01d86
(index) Tune ranking for verbatim matches in the title, rewarding shorter titles
2024-08-03 14:47:23 +02:00
Viktor Lofgren
b21f8538a8
(index) Tune ranking for verbatim matches in the title, rewarding shorter titles
2024-08-03 14:41:38 +02:00
Viktor Lofgren
dd15676d33
(index) Tune ranking for verbatim matches in the title, rewarding shorter titles
2024-08-03 14:18:04 +02:00
Viktor Lofgren
ec5a17ad13
(index) Tune ranking for verbatim matches in the title, rewarding shorter titles
2024-08-03 14:07:02 +02:00
Viktor Lofgren
e48f52faba
(experiment) Add add-hoc filter runner
2024-08-03 13:24:03 +02:00
Viktor Lofgren
8462e88b8f
(index) Add min-dist factor and adjust rankings
2024-08-03 13:07:00 +02:00
Viktor Lofgren
bf26ead010
(index) Remove hasPrioTerm check as we should sort this out in ranking
2024-08-03 13:06:50 +02:00
Viktor Lofgren
c2cedfa83c
(index) Experimental ranking signals
2024-08-03 10:33:41 +02:00
Viktor Lofgren
eba2844361
(index) Experimental ranking signals
2024-08-03 10:32:46 +02:00
Viktor Lofgren
c6c8b059bf
(index) Return some variant of the previously removed 'Bm25PrioGraphVisitor'
2024-08-03 10:10:12 +02:00
Viktor Lofgren
d8a99784e5
(index) Adding a few experimental relevance signals
2024-08-02 20:26:07 +02:00
Viktor Lofgren
57929ff242
(coded-sequence) Varint sequence
2024-08-02 20:22:56 +02:00
Viktor Lofgren
4430a39120
(loader) Clean up
2024-08-02 12:32:47 +02:00
Viktor Lofgren
6228f46af1
(loader) Reduce log spam
2024-08-02 12:21:03 +02:00
Viktor Lofgren
ac67b6b5da
(converter) Fix exception handling while reading crawl data
2024-08-02 10:39:49 +02:00
Viktor Lofgren
1a268c24c8
(perf) Reduce DomPruningFilter hash table recalculation
2024-08-01 12:04:55 +02:00
Viktor Lofgren
38e2089c3f
(perf) Code was still spending a lot of time resolving charsets
...
... in the failure case which wasn't captured by memoization.
2024-08-01 11:58:59 +02:00
Viktor Lofgren
e2107901ec
(index) Add span information for anchor tags, tweak ranking params
2024-08-01 11:46:30 +02:00
Viktor Lofgren
15745b692e
(index) Coherences need to be able to deal with null values among positions
2024-07-31 22:00:14 +02:00
Viktor Lofgren
696fd8909d
(screenshot-capture-tool) Make screenshot bot spend time refreshing old screenshots instead of always capturing new ones
2024-07-31 21:44:10 +02:00
Viktor Lofgren
02b1c4b172
(screenshot-capture-tool) Make screenshot bot spend time refreshing old screenshots instead of always capturing new ones
2024-07-31 20:21:23 +02:00
Viktor Lofgren
285e657f68
Merge branch 'master' into term-positions
...
# Conflicts:
# code/processes/crawling-process/java/nu/marginalia/crawl/CrawlerMain.java
# code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
2024-07-31 10:44:01 +02:00
Viktor Lofgren
046ffc7752
(build) Upgrade jib to 3.4.3
2024-07-31 10:39:50 +02:00
Viktor Lofgren
2ef66ce0ca
(actor) Reset NEW flag earlier when auto-deletion is disabled
...
Don't wait until the loader step is finished to reset the NEW flag, as this leaves manually processed (but not yet loaded) crawl data stuck in "CREATING" in the GUI.
2024-07-31 10:31:03 +02:00
Viktor Lofgren
dc5c668940
(index) Re-enable parallelization of index construction, disable parallel sorting during construction
...
The first change, running index construction in parallel, was previously how it was done, but it was changed to run sequentially to see how it would affect performance. It got worse, so the change is reverted.
Though it's been noted that sorting in parallel is likely not a good idea as it leads to a lot of I/O thrashing, so this is changed to be done sequentially.
2024-07-31 10:06:53 +02:00
Viktor Lofgren
f19148132a
(search) Restrict site-search by passing domain id along with the site:-term
...
This will help these queries deal with domains that do not have a subdomain so that they do not drag up subdomains as well, as they are also given the special site:-keyword for their corresponding parent domain.
2024-07-30 21:41:07 +02:00
Viktor Lofgren
6d7b886aaa
(converter) Correct sort order of files in control storage GUI
...
Previously it was sorted on a field that would switch to just showing the time whenever the date was the same as the day's date, leading to a bizarre sort order where files created today was typically shown first, followed by the rest of the files with the oldest date first.
2024-07-30 19:43:27 +02:00
Viktor Lofgren
b316b55be9
(index) Experimental initial integration of document spans into index
2024-07-30 12:01:53 +02:00
Viktor Lofgren
80900107f7
(restructure) Clean up repo by moving stray features into converter-process and crawler-process
2024-07-30 10:14:00 +02:00
Viktor Lofgren
7e4efa45b8
(converter/loader) Simplify document record writing to not require predicated reads
2024-07-29 14:21:21 +02:00
Viktor Lofgren
86ea28d6bc
(converter/loader) Simplify document record writing to not require predicated reads
2024-07-29 14:18:52 +02:00
Viktor Lofgren
34703da144
(slop) Support for nested array types and array-of-object types
...
Also adding very basic support for filtered reads via SlopTable. This is probably not a final design.
2024-07-29 14:00:43 +02:00
Viktor Lofgren
1282f78bc5
(slop-models) Fix incorrect column grouping leading to errors in converter
2024-07-29 11:01:18 +02:00
Viktor Lofgren
2d5d965f7f
(slop-models) Fix incorrect column grouping leading to errors in converter
2024-07-29 10:34:33 +02:00
Viktor Lofgren
afe56c7cf1
(loader) Tidy up code
2024-07-28 21:36:42 +02:00
Viktor Lofgren
7d51cf882f
(loader) Move rssFeeds to a different column group to avoid errors
2024-07-28 21:30:10 +02:00
Viktor Lofgren
499deac2ef
(slop) Fix test that broke when we split get into int get() and long getLong()
2024-07-28 21:20:37 +02:00
Viktor Lofgren
9685993adb
(loader) Add spans to a different column group from spanCodes, as they are not in sync
2024-07-28 21:20:09 +02:00
Viktor Lofgren
261dcdadc8
(loader) Additional tracking for the control GUI
2024-07-28 21:19:45 +02:00
Viktor Lofgren
314a901bf0
(slop) Clean up build.gradle from unnecessary copy-paste garbage
2024-07-28 13:22:20 +02:00
Viktor Lofgren
1caad7e19e
(slop) Update existing code to use the altered Slop interfaces
2024-07-28 13:21:08 +02:00
Viktor Lofgren
e585116dab
(slop) Add 32 bit read method for Varint along with the old 64 bit version
2024-07-28 13:20:18 +02:00
Viktor Lofgren
40f42bf654
(slop) Add signed 16 bit column type "short"
2024-07-28 13:19:44 +02:00
Viktor Lofgren
eaf7fbb9e9
(slop) Improve Conveniences for Enum
...
* New fixed width 8 bit version of Enum
* Access to the enum's dictionary, and a method for reading the ordinal directly to reduce GC churn
2024-07-28 13:19:15 +02:00
Viktor Lofgren
d05a2e57e9
(index-forward) Spans Writer should not be in the index page loop context
2024-07-27 15:17:04 +02:00
Viktor Lofgren
f8684118f3
(slop) Add columnDesc information to the column readers and writers, and correct a few broken position() implementations
...
Added a test that should find any additional broken implementations, as it's very important that this function is correct.
2024-07-27 14:35:30 +02:00
Viktor Lofgren
2e1f669aea
(slop) Remove additional vestigial seek() implementations
2024-07-27 14:35:30 +02:00
Viktor Lofgren
6c3abff664
(slop) Move GCS Slop column to the coded-sequence package
...
This lets the slop library be stand-alone without dependence on coded-sequence.
The change also gets rid of the vestigial seek() method in ColumnReader.
2024-07-27 13:58:45 +02:00
Viktor Lofgren
dcb43a3308
(slop) Introduce table concept to keep track of positions and simplify closing
...
The most common error when dealing with Slop columns is that they can fall out of sync with each other if the programmer accidentally does a conditional read and forgets to skip.
The second most common error is forgetting to close one of the columns in a reader or writer.
To deal with both cases, a new class SlopTable is added that keeps track of the lifecycle of all slop columns and performs a check when closing them that they are in sync.
2024-07-27 13:47:47 +02:00
Viktor Lofgren
ec600b967d
(crawler) Adjust domain locking
...
Turns out throttling to only 1 lock per domain means the crawler chokes hard on large hosting websites such as wordpress. Giving these a slightly larger allowance.
2024-07-27 11:54:46 +02:00
Viktor Lofgren
aebb2652e8
(wip) Extract and encode spans data
...
Refactoring keyword extraction to extract spans information.
Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.
This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00
Viktor Lofgren
52a9a0d410
(slop) Translate nulls to empty strings when passed to the StringColumnWriters.
2024-07-25 18:26:41 +02:00
Viktor Lofgren
4123e99469
(slop) Handle empty compressed files correctly
...
The CompressingStorageReader would incorrectly report having data when a file was empty. Preemptively attempting to fill the backing buffer fixes the behavior.
2024-07-25 18:26:13 +02:00
Viktor Lofgren
51a8a242ac
(slop) First commit of slop library
...
Slop is a low-abstraction data storage convention for column based storage of complex data.
2024-07-25 15:08:41 +02:00
Viktor Lofgren
60ef826e07
(loader) Add heartbeat to update domain-ids step
2024-07-25 15:08:41 +02:00
Viktor Lofgren
2ad564404e
(loader) Add heartbeat to update domain-ids step
2024-07-23 15:28:52 +02:00
Viktor Lofgren
2bb9f18411
(dld) Refactor DocumentLanguageData
...
Reduce the usage of raw arrays
2024-07-19 12:24:55 +02:00
Viktor Lofgren
7a1edc0880
(term-freq) Reduce the number of low-relevance words in the dictionary
...
Using a statistical trick to reduce the number of low-frequency words in the dictionary, as they are numerous and not very informative.
2024-07-19 12:23:28 +02:00
Viktor Lofgren
b812e96c6d
(language-processing) Select the appropriate language filter
...
The incorrect filter was selected based on the provided parameter, this has been corrected.
2024-07-19 12:22:32 +02:00
Viktor Lofgren
22b35d5d91
(sentence-extractor) Add tag information to document language data
...
Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object. Separator information is encoded as a bit set instead of an array of integers.
The change also cleans up the SentenceExtractor class a fair bit. It no longer extracts ngrams, and a significant amount of redundant operations were removed as well. This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.
2024-07-18 15:57:48 +02:00
Viktor Lofgren
d36055a2d0
(keyword-extractor) Retire TfIdfHigh WordFlag
...
This will bring the word flags count down to 8, and let us pack every value in a byte.
2024-07-17 13:54:39 +02:00
Viktor Lofgren
0d227f3543
(cleanup) Remove next-prime library only used in tests
2024-07-17 13:48:03 +02:00
Viktor Lofgren
accc598967
(crawler) Add 1 second pause after probing domain to reduce request pressure
2024-07-16 16:55:07 +02:00
Viktor Lofgren
02c4a2d4ba
(crawler) Add a per-domain mutex for crawling
...
To let up the pressure on domains with lot sof subdomains such as substack, medium, neocities, etc. a per-domain mutex is added that will limit crawling of these domains to one thread at a time.
2024-07-16 16:44:59 +02:00
Viktor Lofgren
6665e447aa
(crawler) Add crawl delays around probe call and deal with 429:s properly during this phase
2024-07-16 15:33:24 +02:00
Viktor Lofgren
7eb955cc42
(setup) Change mirror for opennlp
...
Seems like the estointernet mirror no longer works. Use apache.org instead.
2024-07-16 15:19:13 +02:00
Viktor Lofgren
f4d79c203d
(crawler) Adjust revisit logic
...
The revisit logic wasn't sufficiently dampening the recrawl rate for websites that largely have not changed.
Modified it to be more reactive to the degree to which the content has changed, while applying upper and lower limits depending on the size of the crawl set.
2024-07-16 15:12:38 +02:00
Viktor Lofgren
4d29581ea4
(crawler) Introduce absolute upper limit to crawl depth growth
2024-07-16 14:40:45 +02:00
Viktor Lofgren
0b31c4cfbb
(coded-sequence) Replace GCS usage with an interface
2024-07-16 14:37:50 +02:00
Viktor Lofgren
5c098005cc
(index) Fix broken test
...
Expected behavior changed since the ranking algorithm now takes into account the number of positions of the keyword, and the test loader was previously modified to generate positions based on prime factors of the document id.
2024-07-16 12:37:59 +02:00
Viktor Lofgren
ae87e41cec
(index) Fix rare BitReader.takeWhileZero bug
...
Fix rare bug where the takeWhileZero method would fail to repopulate the underlying buffer. This caused intermittent de-compression errors if takeWhileZero happened at a 64 bit boundary while the underlying buffer was empty.
The change also alters how sequence-lengths are encoded, to more consistently use the getGamma method instead of adding special significance to a zero first byte.
Finally, assertions are added checking the invariants of the gamma and delta coding logic as well as UrlIdCodec to earlier detect issues.
2024-07-16 11:03:56 +02:00
Viktor Lofgren
dfd19b5eb9
(index) Reduce the number of abstractions around result ranking
...
The change also restructures the internal API a bit, moving resultsFromDomain from RpcRawResultItem into RpcDecoratedResultItem, as the previous order was driving complexity in the code that generates these objects, and the consumer side of things puts all this data in the same object regardless.
2024-07-16 08:18:54 +02:00
Viktor
8ed5b51a32
Merge branch 'master' into term-positions
2024-07-15 07:05:31 +02:00
Viktor Lofgren
9d0e5dee02
Fix gitignore issue .so files not to be ignored correctly.
2024-07-15 05:18:10 +02:00
Viktor Lofgren
ffd970036d
(term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter
...
How'd This Ever Work? (tm)
TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.
2024-07-15 05:16:17 +02:00
Viktor Lofgren
fa162698c2
(term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter
...
How'd This Ever Work? (tm)
TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.
2024-07-15 05:15:30 +02:00
Viktor Lofgren
ad3857938d
(search-api, ranking) Update with new ranking parameters
...
Adding new ranking parameters to the API and routing them through the system, in order to permit integration of the new position data with the ranking algorithm.
The change also cleans out several parameters that no longer filled any function.
2024-07-15 04:49:40 +02:00
Viktor Lofgren
179a6002c2
(coded-sequence) Add a callback for re-filling underlying buffer
2024-07-12 23:50:28 +02:00
Viktor Lofgren
d28fc86956
(index-prio) Add fuzz test for prio index
2024-07-11 19:22:36 +02:00
Viktor Lofgren
6303977e9c
(index-prio) Fail louder when size is 0 in PrioDocIdsTransformer
...
We can't deal with this scenario and should complain very loudly
2024-07-11 19:22:05 +02:00
Viktor Lofgren
97695693f2
(index-prio) Don't increment readItems counter when the output buffer is full
...
This behavior was causing the reader to sometimes discard trailing entries in the list.
2024-07-11 19:21:36 +02:00
Viktor Lofgren
1ab875a75d
(test) Correcting flaky tests
...
Also changing the inappropriate usage of ReverseIndexPrioFileNames for the full index in test code.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
31881874a9
(coded-sequence) Correct indicator of next-value
...
It was incorrectly assumed that a "next" value could not be zero or negative, as this is not representable via the Gamam code. This is incorrect in this case, as we're able to provide a negative offset. Changing to using Integer.MIN_VALUE as indicator that a value is absent instead, as this will never be used.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
f090f0101b
(index-construction) Gather up preindex writes
...
Use fewer writes when finalizing the preindex documents.dat file, as this was getting too slow.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
9881cac2da
(index-reader) Correctly handle negative offset values
...
When wordOffset(...) returns a negative value, it means the word isn't present in the index, and we should abort.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
12590d3449
(index-reverse) Added compression to priority index
...
The priority index documents file can be trivially compressed to a large degree.
Compression schema:
```
00b -> diff docord (E gamma)
01b -> diff domainid (E delta) + (1 + docord) (E delta)
10b -> rank (E gamma) + domainid,docord (raw)
11b -> 30 bit size header, followed by 1 raw doc id (61 bits)
```
2024-07-11 16:13:23 +02:00
Viktor Lofgren
abf7a8d78d
(coded-sequence) Correct implementation of Elias gamma
...
Also clean up the code a bit as the EliasGammaCodec class was an iterator, and it was leaking abstraction details.
2024-07-10 14:28:28 +02:00
Viktor Lofgren
ecfe17521a
(coded-sequence) Correct implementation of Elias gamma
...
The implementation was incorrectly using 1 bit more than it should. The change also adds a put method for Elias delta; and cleans up the interface a bit.
2024-07-09 17:28:21 +02:00
Viktor Lofgren
0d29e2a39d
(index-reverse) Entry Sources reset() their LongQueryBuffer
...
Previously this was the responsibility of the caller, which lead to the possibility of passing in improperly prepared buffers and receiving bad outcome
2024-07-09 01:39:40 +02:00
Viktor Lofgren
12a2ab93db
(actor) Improve error messages for convert-and-load
...
Some copy-and-paste errors had snuck in and every index construction error was reported as "repartitioned failed"; updated with more useful messages.
2024-07-08 19:19:30 +02:00
Viktor Lofgren
d90bd340bb
(index-reverse) Removing btree indexes from prio documents file
...
Btree index adds overhead and disk space and doesn't fill any function for the prio index.
* Update finalize logic with a new IO transformer that copies the data and prepends a size
* Update the reader to read the new format
* Added a test
2024-07-08 17:20:17 +02:00
Viktor Lofgren
21afe94096
(index-reverse) Don't use 128 bit merge function for prio index
2024-07-07 21:36:10 +02:00
Viktor Lofgren
fa36689597
(index-reverse) Simplify priority index
...
* Do not emit a documents file
* Do not interlace metadata or offsets with doc ids
2024-07-06 18:04:08 +02:00
Viktor Lofgren
85c99ae808
(index-reverse) Split index construction into separate packages for full and priority index
2024-07-06 15:44:47 +02:00
Viktor Lofgren
a4ecd5f4ce
(minor) Fix non-compiling test due to previous refactor
2024-07-06 15:11:43 +02:00
Viktor Lofgren
6401a513d7
(crawl) Fix onsubmit confirm dialog for single-site recrawl
2024-07-05 17:21:03 +02:00
Viktor Lofgren
d86926be5f
(crawl) Add new functionality for re-crawling a single domain
2024-07-05 15:31:55 +02:00
Viktor Lofgren
a6b03a66dc
(crawl) Reduce Charset.forName() object churn
...
Cache the Charset object returned from Charset.forName() for future use, since we're likely to see the same charset again and Charset.forName(...) can be surprisingly expensive and its built-in caching strategy, which just caches the 2 last values seen doesn't cope well with how we're hitting it with a wide array of random charsets
2024-07-04 20:49:07 +02:00
Viktor Lofgren
d023e399d2
(index) Remove unnecessary allocations in journal reader
...
The term data iterator is quite hot and was performing buffer slice operations that were not necessary.
Replacing with a fixed pointer alias that can be repositioned to the relevant data.
The positions data was also being wrapped in a GammaCodedSequence only to be immediately un-wrapped.
Removed this unnecessary step and move to copying the buffer directly instead.
2024-07-04 15:38:22 +02:00
Viktor Lofgren
e8ab1e14e0
(keyword-extraction) Update upper limit to number of positions per word
...
After real-world testing, it was determined that 256 was still a bit too low, but 512 seems like it will only truncate outlier cases like assembly code and certain tabulations.
2024-07-02 20:52:32 +02:00
Viktor Lofgren
a6e15cb338
(keyword-extraction) Update upper limit to number of positions per word
...
100 was a bit too low, let's try 256.
2024-06-30 22:46:56 +02:00
Viktor Lofgren
4fbb863a10
(keyword-extraction) Add upper limit to number of positions per word
...
Also adding some logging for this event to get a feel for how big these lists get with realistic data. To be cleaned up later.
2024-06-30 22:41:38 +02:00
Viktor Lofgren
6ee4d1eb90
(keyword) Increase the work area for position encoding
...
The change also moves the allocation outside of the build()-method to allow re-use of this rather large temporary buffer.
2024-06-28 16:42:39 +02:00
Viktor Lofgren
738e0e5fed
(process) Add option for automatic profiling
...
The change adds a new system property 'system.profile' that makes ProcessService automatically trigger JFR profiling of the processes it spawns. By default, these are put in the log directory.
The change also adds a JVM parameter that makes it shut up about native access.
2024-06-27 13:58:36 +02:00
Viktor Lofgren
0e4dd3d76d
(minor) Remove accidentally committed debug printf
2024-06-27 13:40:53 +02:00
Viktor Lofgren
10fe5a78cb
(log) Prevent tests from trying to log to file
...
They would never have succeeded, but it adds an annoying preamble of error spam in the console window.
2024-06-27 13:19:48 +02:00
Viktor Lofgren
975b8ae2e9
(minor) Tidy code
2024-06-27 13:15:31 +02:00
Viktor Lofgren
935234939c
(test) Add query parsing to IntegrationTest
2024-06-27 13:15:20 +02:00
Viktor Lofgren
87e38e6181
(search-query) refac: Move query factory
2024-06-27 13:14:47 +02:00
Viktor Lofgren
f73fc8dd57
(search-query) Fix end-inclusion bug in QWordGraphIterator
2024-06-27 13:13:42 +02:00
Viktor Lofgren
3faa5bf521
(search-query) Tidy up QueryGRPCService and IndexClient
2024-06-26 14:03:30 +02:00
Viktor Lofgren
6973712480
(query) Tidy up code
2024-06-26 13:40:06 +02:00
Viktor Lofgren
02df421c94
(*) Trim the stopwords list
...
Having an overlong stopwords list leads to quoted terms not performing well. For now we'll slash it to just "a" and "the".
2024-06-26 12:22:57 +02:00
Viktor Lofgren
95b9af92a0
(index) Implement working optional TermCoherences
2024-06-26 12:22:06 +02:00
Viktor Lofgren
8ee64c0771
(index) Correct TermCoherence requirements
2024-06-25 22:18:10 +02:00
Viktor Lofgren
b805f6daa8
(gamma) Fix readCount() behavior in EGC
2024-06-25 22:17:54 +02:00
Viktor Lofgren
dae22ccbe0
(test) Integration test from crawl->query
2024-06-25 22:17:26 +02:00
Viktor Lofgren
9d00243d7f
(index) Partial re-implementation of position constraints
2024-06-24 15:55:54 +02:00
Viktor Lofgren
5461634616
(doc) Add readme.md for coded-sequence library
...
This commit introduces a readme.md file to document the functionality and usage of the coded-sequence library. It covers the Elias Gamma code support, how sequences are encoded, and methods the library offers to query sequences, iterate over values, access data, and decode sequences.
2024-06-24 14:28:51 +02:00
Viktor Lofgren
40bca93884
(gamma) Minor clean-up
2024-06-24 13:56:43 +02:00
Viktor Lofgren
b798f28443
(journal) Fixing journal encoding
...
Adjusting some bit widths for entry and record sizes to ensure these don't overflow, as this would corrupt the written journal.
2024-06-24 13:56:27 +02:00
Viktor Lofgren
fff2ce5721
(gamma) Correctly decode zero-length sequences
2024-06-24 13:11:41 +02:00
Viktor
69f88255e9
Merge pull request #101 from MarginaliaSearch/security-scan
...
Address security scan findings
2024-06-17 13:18:36 +02:00
Viktor
08ff79827e
Merge branch 'master' into security-scan
2024-06-17 13:18:25 +02:00
Viktor Lofgren
67703e2274
(run) Update install.sh with stronger warnings against non-docker install.
2024-06-17 13:15:15 +02:00
Viktor Lofgren
d0d6bb173c
(control) Fix warc data http status filter default value
2024-06-17 12:40:25 +02:00
Viktor Lofgren
54caf17107
(docs) Amend install instructions for non-docker install
2024-06-16 10:22:07 +02:00
Viktor Lofgren
2168b7cf7d
(docs) Update docs with clearer references to the full guide
...
The commit also mentions the non-docker install
2024-06-16 10:01:19 +02:00
Viktor Lofgren
90744433c9
Merge branch 'master' into security-scan
...
# Conflicts:
# code/libraries/array/cpp/resources/libcpp.so
2024-06-13 13:14:47 +02:00
Viktor
5371f078f7
Merge pull request #102 from jaseemabid/jabid/macos-build
...
Make the project buildable on macOS
2024-06-12 14:45:03 +02:00
Jaseem Abid
0dd14a4bd0
Specify C++ standard in build command
...
The default C++ language standard on macOS is gnu++98, which won't build
this module.
Full error:
```
> Task :code:libraries:array:cpp:compileCpp FAILED
src/main/cpp/cpphelpers.cpp:28:5: error: expected expression
[](const p64x2& fst, const p64x2& snd) {
^
```
2024-06-12 12:47:10 +01:00
Jaseem Abid
9974b31a09
Don't track build files(libcpp.so) with git
2024-06-12 12:45:49 +01:00
Viktor Lofgren
0ffbbaf4b9
(crawler) Update WARC builder to use SHA-256 for digests
2024-06-12 09:14:12 +02:00
Viktor Lofgren
6839415a0b
(crawler) Fetch TLS instead of SSL context
2024-06-12 09:07:54 +02:00
Viktor Lofgren
55f3ac4846
(atags) Fix duckdb SQL injection
...
The input comes from the config file so this isn't a very realistic threat vector, and even if it wasn't it's a query in an empty duckdb instance; but adding a validation check to provide a better error message.
2024-06-12 09:05:57 +02:00
Viktor Lofgren
801cf4b5da
(search) Fix bad practice usage of innerHTML to set what should be text content.
2024-06-12 08:59:40 +02:00
Viktor Lofgren
e0459d0c0d
(build) Upgrade parquet dependencies to 1.14.0
...
This gets rid of a vulnerable transitive dependency.
2024-06-12 08:57:22 +02:00
Viktor Lofgren
23759a7243
(loader) Correctly clamp document size
2024-06-10 18:29:14 +02:00
Viktor Lofgren
55b2b7636b
(loader) Correctly load the positions column in the keyword projection
2024-06-10 18:27:15 +02:00
Viktor Lofgren
36160988e2
(index) Integrate positions data with indexes WIP
...
This change integrates the new positions data with the forward and reverse indexes.
The ranking code is still only partially re-written.
2024-06-10 15:09:06 +02:00
Viktor Lofgren
9f982a0c3d
(index) Integrate positions file properly
2024-06-06 16:45:42 +02:00
Viktor Lofgren
dcbec9414f
(index) Fix non-compiling tests
2024-06-06 16:35:09 +02:00
Viktor Lofgren
a07cf1ba93
(array/cpp) Update gitignore to properly exclude libcpp.so
2024-06-06 13:06:08 +02:00
Viktor Lofgren
4a8afa6b9f
(index, WIP) Position data partially integrated with forward and reverse indexes.
...
There's no graceful way of doing this in small commits, pushing to avoid the risk of data loss.
2024-06-06 12:54:52 +02:00
Viktor
bb06cc9ff3
Merge pull request #98 from samstorment/ThemeSwitcher
...
OS Independent Theme Switcher
2024-06-06 12:51:19 +02:00
Sam Storment
9c06f446fb
(search) Styling tweaks. Make the filter button near the top right corener a bit bigger so it's easier to press on mobile
2024-06-05 19:55:17 -05:00
Sam Storment
2d076cbd67
(search) move data-has-js attribute from body to html element
2024-06-05 18:20:33 -05:00
Sam Storment
fb2eef24d6
Handle themeing when javascript is disabled. Hide the theme select and fallback to dark media query instead of data-theme attribute
2024-06-03 14:15:35 -05:00
Sam Storment
e2f68d9ccf
Add a theme select to the header that lets users toggle their theme independent of their OS theme
2024-06-02 21:02:52 -05:00
Viktor Lofgren
d4f4d751c0
Merge remote-tracking branch 'origin/master'
2024-06-02 16:30:41 +02:00
Viktor Lofgren
b4eac2516e
(crawler) Send "Accept"-headers when fetching documents, also indicate we prefer English results
2024-06-02 16:30:34 +02:00
Viktor
4435f6245c
Merge pull request #94 from samstorment/search-dark-theme
...
Search Dark Theme
2024-06-02 16:21:52 +02:00
Viktor Lofgren
9b922af075
(converter) Amend existing modifications to use gamma coded positions lists
...
... instead of serialized RoaringBitmaps as was the initial take on the problem.
2024-05-30 14:20:36 +02:00
Viktor Lofgren
0112ae725c
(gamma) Implement a small library for Elias gamma coding an integer sequence
2024-05-30 14:19:13 +02:00
Viktor Lofgren
619392edf9
(keywords) Add position information to keywords
2024-05-28 16:54:53 +02:00
Viktor Lofgren
0894822b68
(converter) Add position information to serialized document data
...
This is not hooked in yet, and the term metadata is still left intact. It should probably shrink to a smaller representation (byte?) with the upcoming removal of the position mask.
2024-05-28 14:18:03 +02:00
Viktor Lofgren
206a7ce6c1
Merge remote-tracking branch 'origin/master'
2024-05-28 14:15:57 +02:00
Viktor Lofgren
a69ab311c7
(qword) Fix tests that broke due to stopword removal
2024-05-28 14:15:45 +02:00
Viktor
a61327fa0b
Update ROADMAP.md
2024-05-24 13:57:50 +02:00
Viktor Lofgren
6985ab762a
(query) Improve handling of stopwords in queries
2024-05-23 20:50:55 +02:00
Viktor Lofgren
0e8300979b
(search) Update the no result text to request bug reports.
2024-05-23 20:18:16 +02:00
Viktor Lofgren
0b60411e5f
(query) Bugfix stopword issue
...
Add a new rule that crates an alternative path that omits a word if it's a stopword.
In queries where a stopword is present, and no query ngram expansion is possible, the query should not require the stopword to be present in the index, as this results in no search results being found.
2024-05-23 20:15:14 +02:00
Viktor Lofgren
f83f777fff
(converter) Experimental support for searching by URL
...
Add up to synthetic 128 keywords per document, corresponding to links to other websites.
2024-05-23 17:10:57 +02:00
Viktor Lofgren
89aae93e60
(*) Lift jetty and guava-dependencies
2024-05-23 14:20:01 +02:00
Viktor Lofgren
65b74f9cab
(registry) Fix broken test
2024-05-23 14:15:01 +02:00
Sam Storment
7543e98035
Merge branch 'MarginaliaSearch:master' into search-dark-theme
2024-05-22 18:06:37 -05:00
Viktor Lofgren
59ec70eb73
(*) Clean up code related to crawl parquet inspection
2024-05-22 12:55:08 +02:00
Viktor Lofgren
365229991b
(control) Improve pagination for crawl data inspector
2024-05-21 19:44:48 +02:00
Viktor Lofgren
959a8e29ee
(control) Improve pagination for crawl data inspector
2024-05-21 19:27:25 +02:00
Viktor Lofgren
197c82acd4
(control) Add filter functionality for crawl data inspector
2024-05-21 19:05:44 +02:00
Viktor Lofgren
9539fdb53c
(control) Clean up UX for crawl data inspector
2024-05-21 18:27:24 +02:00
Sam Storment
5659df4388
(search) Set link and form field colors manually to override browser defaults with poor dark mode contrast
2024-05-21 00:03:46 -05:00
Viktor Lofgren
24bf29d369
(*) Upgrade opennlp and deprecate the monkey patched version of the code as it's no longer needed
2024-05-20 18:03:21 +02:00
Viktor Lofgren
17dc00d05f
(control) Partial implementation of inspection utility for crawl data
...
Uses duckdb and range queries to read the parquet files directly from the index partitions.
UX is a bit rough but is in working order.
2024-05-20 18:02:46 +02:00
Viktor Lofgren
4fcd4a8197
(index) Refactor to reduce the level of indirection
2024-05-19 12:40:33 +02:00
Viktor Lofgren
daf2a8df54
(btree) Roll back optimization of queryDataWithIndex
...
It had been previously assumed that re-writing this function in the style of retain() would make it faster, but it had the opposite effect.
The reason why retain is so fast due to properties of the data that hold true when intersecting document lists, where long runs of adjacent documents are expected, but not when looking up the data associated with the already intersected documents, where the data is more sparse.
2024-05-19 11:29:28 +02:00
Sam Storment
43489c98d8
(search) Minor dark theme tweaks after the new mocked UI elements were added
2024-05-19 01:06:54 -05:00
Viktor Lofgren
88997a1c4f
(btree) Clean up code
2024-05-18 18:38:46 +02:00
Viktor Lofgren
d12c77305c
(btree) Clean up code
2024-05-18 18:03:17 +02:00
Viktor Lofgren
ab4e2b222e
(array) Fix broken benchmarks
2024-05-18 13:41:24 +02:00
Viktor Lofgren
b867eadbef
(big-string) Remove the unused bigstring library
2024-05-18 13:40:03 +02:00
Viktor Lofgren
19163fa883
(array) Clean up the Array library
...
IntArray gets the YAGNI axe. The array library had two implementations, one for longs which was used, and one for ints, which only ever saw bit rot. Removing the latter, as all it ever did was clutter up the codebase and add technical debt. If we need int arrays, we fork LongArray again (or add int capabilities to it)
Also cleaning up the interfaces, removing layers of redundant abstractions and adding javadocs.
Finally adding sz=2 specializations to the quick- and insertion sort algorithms. It seems the JIT isn't optimizing these particularly well, this is an attempt to help it out a bit.
2024-05-18 13:23:06 +02:00
Sam Storment
a7c33809c4
Merge branch 'master' into search-dark-theme
2024-05-17 22:52:19 -05:00
Viktor Lofgren
650f3843bb
(array) Clean up search function jungle
...
Retire search functions that weren't used, including the native implementations. Drop confusing suffixes on search function names. Search functions no longer encode search misses as negative values.
Replaced binary search function with a branchless version that is much faster.
Cleaned up benchmark code.
2024-05-17 14:31:02 +02:00
Viktor Lofgren
9e766bc056
(array) Clean up search function jungle
...
Retire search functions that weren't used, including the native implementations. Drop confusing suffixes on search function names. Search functions no longer encode search misses as negative values.
Replaced binary search function with a branchless version that is much faster.
Cleaned up benchmark code.
2024-05-17 14:30:06 +02:00
Viktor Lofgren
48aff52e00
(array) Increase LongArray on-heap alignment to 16 bytes
...
This primarily affects benchmarks, making performance more consistent for the 128 bit operations, as the system mostly works with memory mapped data.
2024-05-16 19:12:36 +02:00
Viktor Lofgren
9d7616317e
(array) Clean up native code a bit
2024-05-16 14:47:10 +02:00
Viktor Lofgren
d227a09fb1
(search) Extend paperdoll service mock with site info data and screenshots
...
It's a bit of a hack job but will do, random exploration is available but only through a "browse:random"-style query
2024-05-15 12:40:55 +02:00
Viktor Lofgren
f48cf77c4d
(array, experimental) Add benchmark results for quicksort
2024-05-14 18:15:30 +02:00
Viktor Lofgren
3549be216f
(array, experimental) Documentation for native algos
2024-05-14 17:43:05 +02:00
Viktor Lofgren
c3e3a3dbc5
(search) Fix problem list in clustered search results
2024-05-14 13:05:52 +02:00
Viktor Lofgren
55a7c1db00
(array, experimental) Call C++ helper methods to do some low level stuff a bit faster than is possible with Java
2024-05-14 12:54:14 +02:00
Sam Storment
bb315221ab
(search, WIP) Make the dark theme look generally nicer. Rename CSS custom properties a bit. Switch a lot of background colors to HSL to make it easy to change colors relative to one another.
2024-05-14 01:32:40 -05:00
Sam Storment
c38766c5a6
(search, WIP) Convert SCSS variables to CSS custom properties for dynamic theming
2024-05-08 22:13:24 -05:00
Viktor Lofgren
c837321df1
(search) Provide a notification when no search results are found.
2024-05-06 20:11:39 +02:00
Viktor Lofgren
af7f6b89ec
(search) Delete vestigial stylesheet from the old design.
2024-05-06 19:52:29 +02:00
Viktor Lofgren
29a4d3df23
(search) Imrpove search-service paperdoll by mocking suggestions and news
2024-05-06 19:52:13 +02:00
Viktor
bcbb9afac0
Merge pull request #93 from MarginaliaSearch/accessibility-improvements
...
Accessibility improvements
2024-05-04 15:45:26 +02:00
Viktor Lofgren
7d1cafc070
(control) Add skip link for navigation in control GUI
2024-05-04 12:36:44 +02:00
Viktor Lofgren
5951c67a8b
(search) Center the search results page
2024-05-04 12:23:21 +02:00
Viktor Lofgren
c454007730
(search) Increase contrast for some UI elements
2024-05-04 12:02:52 +02:00
Viktor Lofgren
4e49cca43d
(search) Clean up SCSS code a bit
2024-05-04 11:58:54 +02:00
Viktor Lofgren
49a8c06095
(search) Improve contrast for text on random button
2024-05-04 11:51:19 +02:00
Viktor Lofgren
d01d9fa670
(search) Add screenreader-specific notification remark about when search results start.
2024-05-04 11:41:06 +02:00
Viktor Lofgren
a53a32f006
(search) Spell out website problems with "atomic elements" instead of having a hover that's inaccessible with keyboard navigation
2024-05-04 11:41:05 +02:00
Viktor Lofgren
3548d54cf6
(search) Add a screenreader-only alert when the search filters are updated to make it easier to understand what happens.
2024-05-04 11:41:04 +02:00
Viktor Lofgren
01f242ac7e
(search) Add stylesheet class for screenreader-only items
2024-05-04 11:41:03 +02:00
Viktor Lofgren
2840d9d403
(search) Add screenreader-only positions count text to search results
2024-05-04 11:41:03 +02:00
Viktor Lofgren
9fecfc5025
(search) Add autocomplete attribute to search-form
2024-05-04 11:41:02 +02:00
Viktor Lofgren
1b901e01f2
(search) Add bypass link that skips navigation
2024-05-04 11:41:01 +02:00
Viktor Lofgren
974aa35558
(search) Add proper alt-text to random exploration mode
2024-05-04 11:41:00 +02:00
Viktor Lofgren
4021a0ae98
(search) Add en-US language tags to all templates
2024-05-04 11:40:59 +02:00
Viktor Lofgren
b7a95be731
(search) Create a small mocking framework for running the search service in isolation.
2024-05-04 11:40:59 +02:00
Viktor Lofgren
616649f040
(logs) Fix logdir location
2024-05-04 11:40:59 +02:00
Viktor
ac3c692b5f
Merge pull request #92 from MarginaliaSearch/no-docker-v2
...
(WIP) Changes to make the system runnable outside of docker
2024-05-01 13:00:56 +02:00
Viktor Lofgren
6087f9635c
(qs) Move index.html out of public directory
...
It was put there to simulate the /public interface paradigm that is now deprecated.
2024-05-01 12:56:12 +02:00
Viktor Lofgren
2ad0bfda1e
(*) Fix boot orchestration for the services
...
This corrects an annoying bug that had the system crash and burn on first start-up due to a race condition in service initialization, where the services were attempting to access the database before it was properly migrated.
A fix was in principle already in place, but it was running too late and did not prevent attempts to access the as-yet uninitialized database. Move the first boot check into the MainClass instead of the Service constructor.
The change also adds more appropriate docker dependencies to the services to fix rare errors resolving the hostname of the database.
2024-05-01 12:39:48 +02:00
Viktor Lofgren
cf8b12bcdc
Update install.sh with refined service descriptions
2024-05-01 12:07:30 +02:00
Viktor Lofgren
08f8b6e022
(system) Log loaded properties to the console
2024-04-30 18:29:11 +02:00
Viktor Lofgren
800ed6b1e9
(zk) Terminately immediately if zookeeper isn't found
...
This makes debugging easier
2024-04-30 18:28:49 +02:00
Viktor Lofgren
df93e57a9a
(install) Add new option to install locally outside of docker
2024-04-30 18:28:21 +02:00
Viktor Lofgren
908535a3a0
(single-service) Ensure single-service spawner can specify the node
2024-04-30 18:27:46 +02:00
Viktor Lofgren
7fe2ab6f39
(file-storage) Ensure file storage root location can be overridden when running outside of docker
2024-04-30 18:26:15 +02:00
Viktor Lofgren
c9ee0c909e
(download-sample) Set +x permissions on directories created during this job
2024-04-30 18:25:07 +02:00
Viktor Lofgren
38aedb50ac
(converter) Do not suppress exceptions in the converter
2024-04-30 18:24:35 +02:00
Viktor Lofgren
4772e0b59d
(service) Deprecate /public prefix on HTTP
...
Before the gRPC migration, the system would serve both public and internal requests over HTTP, but distinguish the two using path prefixes and a few HTTP Headers (X-Public, X-Context) added by the reverse proxy to prevent misconfigurations.
Since internal requests meaningfully no longer use HTTP, this convention is just an obstacle now, adding the need to always run the system behind a reverse proxy that rewrites the paths.
The change removes the path prefix, and updates the docker templates to reflect the change. This will require a migration for existing systems.
2024-04-30 14:46:18 +02:00
Viktor Lofgren
9c49e876d5
(conf) Update the setup.sh script to also be able to perform model upgrades
2024-04-29 17:46:20 +02:00
Viktor Lofgren
152007cd5c
(docker) Add missing zookeeper service to full marginalia config
2024-04-29 11:44:53 +02:00
Viktor Lofgren
70e2e41955
(crawler) Content type prober should not swallow exceptions
2024-04-27 18:27:23 +02:00
Viktor Lofgren
4d71c776fc
(crawler) Modify crawl set growth to grow small domains faster than larger ones
2024-04-27 17:36:27 +02:00
Viktor
0f41105436
Merge pull request #90 from MarginaliaSearch/run-outside-docker
...
Run outside of Docker
2024-04-25 18:55:26 +02:00
Viktor
2d49071e96
Merge branch 'master' into run-outside-docker
2024-04-25 18:53:26 +02:00
Viktor Lofgren
89889ecbbd
(single-service) Skip starting Prometheus if it's not explicitly enabled
2024-04-25 17:54:07 +02:00
Viktor Lofgren
41576e74d4
(doc) Clean up ROADMAP.md
2024-04-25 15:53:46 +02:00
Viktor Lofgren
c8ee354d0b
(log) Make log dir configurable via environment variable
2024-04-25 15:09:18 +02:00
Viktor Lofgren
4e5f069809
(build) Migrate ssr to the new root setting schema of java lang version
2024-04-25 15:08:56 +02:00
Viktor Lofgren
6690e9bde8
(service) Ensure the service discovery starts early
...
This is necessary as we use zookeeper to orchestrate first-time startup of the services, to ensure that the database is properly migrated by the control service before anything else is permitted to start.
2024-04-25 15:08:33 +02:00
Viktor Lofgren
e4b34b6ee6
(index) Correctly detect the presence of an all-virtual path through the query
2024-04-25 14:01:46 +02:00
Viktor Lofgren
3952ef6ca5
(service) Let singleservice configure ports and bind addresses
2024-04-25 13:49:57 +02:00
Viktor Lofgren
463d333846
(proj) Add ROADMAP.md
2024-04-25 13:07:35 +02:00
Viktor Lofgren
7eb5e6aa66
(crawler) Abort recrawl if error count is too high
2024-04-24 21:46:40 +02:00
Viktor Lofgren
282022d64e
(crawler) Remove unnecessary double-fetch of the root document
2024-04-24 14:44:39 +02:00
Viktor Lofgren
91a98a8807
(crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber
2024-04-24 14:44:39 +02:00
Viktor Lofgren
32fe864a33
(build) Java 22 and its consequences has been a disaster for Marginalia Search
...
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle
The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
e1c9313396
(crawler) Emulate if-modified-since for domains that don't support the header
...
This will help reduce the strain on some server software, in particular Discourse.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
f430a084e8
(crawler) Remove accidental log spam
2024-04-24 14:44:39 +02:00
Viktor Lofgren
a86b596897
(crawler) Code quality
2024-04-24 14:44:39 +02:00
Viktor Lofgren
6dd87b0378
(crawler) Use the probe-result to reduce the likelihood of crawling both http and https
...
This should drastically reduce the number of fetched documents on many domains
2024-04-24 14:44:39 +02:00
Viktor Lofgren
c9f029c214
(crawler) Strip W/-prefix from the etag when supplied as If-None-Match
2024-04-24 14:44:39 +02:00
Viktor Lofgren
6b88db10ad
(crawler) Ensure all appropriate headers are recorded on the request
2024-04-24 14:44:39 +02:00
Viktor Lofgren
8a891c2159
(crawler/converter) Remove legacy junk from parquet migration
2024-04-24 14:44:39 +02:00
Viktor Lofgren
ad2ac8eee3
(query) Mark flaky test, correct assert on test
2024-04-24 14:44:39 +02:00
Viktor Lofgren
f46733a47a
(ranking) TermCoherenceFactory should be run for size=2 queries
2024-04-24 14:44:39 +02:00
Viktor Lofgren
934167323d
(converter) Stopgap fix for some cases of lost crawl data due to HTTP 304. The root cause needs further investigation.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
64baa41e64
(query) Always generate an ngram alternative, suppresses generation of multiple identical query branches
2024-04-24 14:44:39 +02:00
Viktor Lofgren
5165cf6d15
(ranking) Set regularMask correctly
2024-04-24 14:44:39 +02:00
Viktor Lofgren
4489b21528
(ranking) Cleanup
2024-04-24 14:44:39 +02:00
Viktor Lofgren
f623b37577
(ranking) Suppress NaN:s in ranking output
2024-04-24 14:44:39 +02:00
Viktor Lofgren
f4a2fea451
(ranking, bugfix) Use bm25NgramWeight and not full weight for bM25N
2024-04-24 14:44:39 +02:00
Viktor Lofgren
a748fc5448
(index, bugfix) Pass url quality to query service
2024-04-24 14:44:39 +02:00
Viktor Lofgren
0dcca0cb83
(index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp
2024-04-24 14:44:39 +02:00
Viktor Lofgren
b80a83339b
(qs) Additional info in query debug UI
2024-04-24 14:44:39 +02:00
Viktor Lofgren
eb74d08f2a
(qs) Additional info in query debug UI
2024-04-24 14:44:39 +02:00
Viktor Lofgren
e79ab0c70e
(qs) Basic query debug feature
2024-04-24 14:44:39 +02:00
Viktor Lofgren
e419e26f3a
(proto) Improve handling of omitted parameters
2024-04-24 14:44:39 +02:00
Viktor Lofgren
6102fd99bf
(qs) Improve logging
2024-04-24 14:44:39 +02:00
Viktor Lofgren
def36719d3
(query) Minor code cleanup
2024-04-24 14:44:39 +02:00
Viktor Lofgren
462aa9af26
(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard
...
The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
a09c84e1b8
(query) Modify tokenizer to match the behavior of the sentence extractor
...
This must match, otherwise a query like "plato's republic" won't match the indexed keywords, since they would strip the possessive.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
44b33798f3
(index) Clean up jaccard index term code and down-tune the parameter's importance a bit
2024-04-24 14:44:39 +02:00
Viktor Lofgren
2f0b648fad
(index) Add jaccard index term to boost results based on term overlap
2024-04-24 14:44:39 +02:00
Viktor Lofgren
de0e56f027
(index) Remove position overlap check, coherences will do the work instead
2024-04-24 14:44:39 +02:00
Viktor Lofgren
973ced7b13
(index) Omit absent terms from coherence checks
2024-04-24 14:44:39 +02:00
Viktor Lofgren
cb4b824a85
(index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus
2024-04-24 14:44:39 +02:00
Viktor Lofgren
c583a538b1
(search) Add implicit coherence constraints based on segmentation
2024-04-24 14:44:39 +02:00
Viktor Lofgren
e0224085b4
(index) Improve recall for small queries
...
Partially reverse the previous commit and add a query head for the priority index when there are few query interpretations.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
44c1e1d6d9
(index) Remove dead code
...
Since the performance fix in 3359f72239
had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
c620e9c026
(index) Experimental performance regression fix
2024-04-24 14:44:39 +02:00
Viktor Lofgren
1bb88968c5
(test) Fix broken test
2024-04-24 14:44:39 +02:00
Viktor Lofgren
df75e8f4aa
(index) Explicitly free LongQueryBuffers
2024-04-24 14:44:39 +02:00
Viktor Lofgren
adf846bfd2
(index) Fix term coherence evaluation
...
The code was incorrectly using the documentId instead of the combined id, resulting in almost all result sets being incorrectly seen as zero.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
1748fcc5ac
(valuation) Impose stronger constraints on locality of terms
...
Clean up logic a bit
2024-04-24 14:44:39 +02:00
Viktor Lofgren
08416393e0
(valuation) Impose stronger constraints on locality of terms
2024-04-24 14:44:39 +02:00
Viktor Lofgren
fce26015c9
(encyclopedia) Index the full articles
...
Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits. This was not a good idea, so the change is reverted.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
155be1078d
(index) Fix priority search terms
...
This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
6efc0f21fe
(index) Clean up data model
...
The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality.
The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
f3255e080d
(ngram) Grab titles separately when extracting ngrams from wiki data
2024-04-24 14:44:39 +02:00
Viktor Lofgren
0da03d4cfc
(zim) Fix title extractor
2024-04-24 14:44:39 +02:00
Viktor Lofgren
5f6a3ef9d0
(ngram) Correct |s|^|s|-normalization to use length and not count
2024-04-24 14:44:39 +02:00
Viktor Lofgren
afc4fed591
(ngram) Correct size value in ngram lexicon generation, trim the terms better
2024-04-24 14:44:39 +02:00
Viktor Lofgren
cb505f98ef
(ngram) Use simple blocking pool instead of FJP; split on underscores in article names.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
a0b3634cb6
(ngram) Only extract frequencies of title words, but use the body to increment the counters...
...
The sign of the counter is used to indicate whether a term has appeared as title. Until it's seen in the title, it's provisionally saved as a negative count.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
e23359bae9
(query, minor) Remove debug statement
2024-04-24 14:44:39 +02:00
Viktor Lofgren
5531ed632a
(query, minor) Remove debug statement
2024-04-24 14:44:39 +02:00
Viktor Lofgren
150ee21f3c
(ngram) Clean up ngram lexicon code
...
This is both an optimization that removes some GC churn, as well as a clean-up of the code that removes references to outdated concepts.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
c96da0ce1e
(segmentation) Pick best segmentation using |s|^|s|-style normalization
...
This is better than doing all segmentations possible at the same time.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
a0d9e66ff7
(ngram) Fix index range in NgramLexicon to an avoid exception
2024-04-24 14:44:38 +02:00
Viktor Lofgren
55f627ed4c
(index) Clean up the code
2024-04-24 14:44:38 +02:00
Viktor Lofgren
7dd8c78c6b
(ngrams) Remove the vestigial logic for capturing permutations of n-grams
...
The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
8bf7d090fd
(qs) Clean up parsing code using new record matching
2024-04-24 14:44:38 +02:00
Viktor Lofgren
6bfe04b609
(term-freq-exporter) Reduce thread count and memory usage
2024-04-24 14:44:38 +02:00
Viktor Lofgren
491d6bec46
(term-freq-exporter) Extract ngrams in term-frequency-exporter
2024-04-24 14:44:38 +02:00
Viktor Lofgren
4fb86ac692
(search) Fix outdated assumptions about the results
...
We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption.
For the API service, we'll simulate the old behavior to keep the API stable.
For the search service, we'll introduce a new way of calculating positions through tree aggregation.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
6cba6aef3b
(minor) Remove dead code
2024-04-24 14:44:38 +02:00
Viktor Lofgren
7e216db463
(index) Add origin trace information for index readers
...
This used to be supported by the system but got lost in refactoring at some point.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
adc90c8f1e
(sentence-extractor) Fix resource leak in sentence extractor
...
The code would always re-initialize the static ngramLexicon and rdrposTagger fields with new instances even if they were already instantiated, leading to a ton of unnecessary RAM allocation.
The modified behavior checks for nullity before creating a new instance.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
e3316a3672
(index) Clean up new index query code
2024-04-24 14:44:38 +02:00
Viktor Lofgren
a3a6d6292b
(qs, index) New query model integrated with index service.
...
Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
8cb9455c32
(qs, WIP) Fix edge cases in query compilation
...
This addresses the relatively common case where the graph consists of two segments, such as x y, z w; in this case we want an output like (x_y) (z w | z_w) | x y (z_w). The generated output does somewhat pessimize a few other cases, but this one is arguably more important.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
dc65b2ee01
(qs, WIP) Clean up dead code
2024-04-24 14:44:38 +02:00
Viktor Lofgren
98a1adbf81
(qs, WIP) Tidy it up a bit
2024-04-24 14:44:38 +02:00
Viktor Lofgren
0bd1e15cce
(qs, WIP) Tidy it up a bit
2024-04-24 14:44:38 +02:00
Viktor Lofgren
eda926767e
(qs, WIP) Tidy it up a bit
2024-04-24 14:44:38 +02:00
Viktor Lofgren
cd1a18c045
(qs, WIP) Break up code and tidy it up a bit
2024-04-24 14:44:38 +02:00
Viktor Lofgren
6f567fbea8
(qs, WIP) Fix output determinism, fix tests
2024-04-24 14:44:38 +02:00
Viktor Lofgren
0ebadd03a5
(WIP) Query rendering finally beginning to look like it works
2024-04-24 14:44:38 +02:00
Viktor Lofgren
2253b556b2
WIP
2024-04-24 14:44:17 +02:00
Viktor Lofgren
6a7a7009c7
(convert) Initial integration of segmentation data into the converter's keyword extraction logic
2024-04-24 14:44:17 +02:00
Viktor Lofgren
3c75057dcd
(qs) Retire NGramBloomFilter, integrate new segmentation model instead
2024-04-24 14:44:17 +02:00
Viktor Lofgren
212d101727
(control) GUI for exporting segmentation data from a wikipedia zim
2024-04-24 14:44:17 +02:00
Viktor Lofgren
760b80659d
(WIP) Partial integration of new query expansion code into the query-serivice
2024-04-24 14:44:17 +02:00
Viktor Lofgren
04879c005d
(WIP) Improve data extraction from wikipedia data
2024-04-24 14:44:17 +02:00
Viktor Lofgren
cb82927756
(WIP) Implement first take of new query segmentation algorithm
2024-04-24 14:44:17 +02:00
Viktor Lofgren
8b9629f2f6
(crawler) Remove unnecessary double-fetch of the root document
2024-04-24 14:38:59 +02:00
Viktor Lofgren
f6db16b313
(crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber
2024-04-24 14:10:03 +02:00
Viktor Lofgren
4668b1ddcb
(build) Java 22 and its consequences has been a disaster for Marginalia Search
...
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle
The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 13:54:04 +02:00
Viktor Lofgren
dcf9d9caad
(crawler) Emulate if-modified-since for domains that don't support the header
...
This will help reduce the strain on some server software, in particular Discourse.
2024-04-22 17:26:31 +02:00
Viktor Lofgren
7a69b76001
(crawler) Remove accidental log spam
2024-04-22 15:51:37 +02:00
Viktor Lofgren
ac07ef822f
(crawler) Code quality
2024-04-22 15:37:35 +02:00
Viktor Lofgren
e7d4bcd872
(crawler) Use the probe-result to reduce the likelihood of crawling both http and https
...
This should drastically reduce the number of fetched documents on many domains
2024-04-22 15:36:43 +02:00
Viktor Lofgren
a28c6d7cfe
(crawler) Strip W/-prefix from the etag when supplied as If-None-Match
2024-04-22 14:31:05 +02:00
Viktor Lofgren
d816f048f5
(crawler) Ensure all appropriate headers are recorded on the request
2024-04-22 14:14:24 +02:00
Viktor Lofgren
b09ddd0036
(crawler/converter) Remove legacy junk from parquet migration
2024-04-22 12:34:28 +02:00
Viktor Lofgren
0a73b02a00
(query) Mark flaky test, correct assert on test
2024-04-21 12:30:14 +02:00
Viktor Lofgren
8769704462
(ranking) TermCoherenceFactory should be run for size=2 queries
2024-04-21 12:29:25 +02:00
Viktor Lofgren
214551f1df
(converter) Stopgap fix for some cases of lost crawl data due to HTTP 304. The root cause needs further investigation.
2024-04-19 20:36:01 +02:00
Viktor Lofgren
2cc74c005a
(query) Always generate an ngram alternative, suppresses generation of multiple identical query branches
2024-04-19 19:42:30 +02:00
Viktor Lofgren
ed250f57f2
(ranking) Set regularMask correctly
2024-04-19 14:31:57 +02:00
Viktor Lofgren
e92c25f7e0
(ranking) Cleanup
2024-04-19 14:13:12 +02:00
Viktor Lofgren
3ab563f314
(ranking) Suppress NaN:s in ranking output
2024-04-19 13:58:28 +02:00
Viktor Lofgren
426338cb45
(ranking, bugfix) Use bm25NgramWeight and not full weight for bM25N
2024-04-19 12:41:48 +02:00
Viktor Lofgren
5fa2375898
(index, bugfix) Pass url quality to query service
2024-04-19 12:41:26 +02:00
Viktor Lofgren
41782a0ab5
(index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp
2024-04-19 12:19:26 +02:00
Viktor Lofgren
9b06433b82
(qs) Additional info in query debug UI
2024-04-19 12:18:53 +02:00
Viktor Lofgren
def607d840
(qs) Additional info in query debug UI
2024-04-19 11:46:27 +02:00
Viktor Lofgren
2b811fb422
(qs) Basic query debug feature
2024-04-19 11:00:56 +02:00
Viktor Lofgren
36cc62c10c
(proto) Improve handling of omitted parameters
2024-04-18 10:47:12 +02:00
Viktor Lofgren
975d92912c
(qs) Improve logging
2024-04-18 10:44:08 +02:00
Viktor Lofgren
8bbaf457de
(query) Minor code cleanup
2024-04-18 10:37:51 +02:00
Viktor Lofgren
7641a02f31
(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard
...
The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.
2024-04-18 10:36:15 +02:00
Viktor Lofgren
ce16239e34
(query) Modify tokenizer to match the behavior of the sentence extractor
...
This must match, otherwise a query like "plato's republic" won't match the indexed keywords, since they would strip the possessive.
2024-04-17 17:54:32 +02:00
Viktor Lofgren
d64bd227cf
(index) Clean up jaccard index term code and down-tune the parameter's importance a bit
2024-04-17 17:40:16 +02:00
Viktor Lofgren
c5ab0a9054
(index) Add jaccard index term to boost results based on term overlap
2024-04-17 16:50:26 +02:00
Viktor Lofgren
dac948973d
(index) Remove position overlap check, coherences will do the work instead
2024-04-17 14:20:01 +02:00
Viktor Lofgren
9d008d1d6f
(index) Omit absent terms from coherence checks
2024-04-17 14:12:16 +02:00
Viktor Lofgren
f52457213e
(index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus
2024-04-17 14:05:02 +02:00
Viktor Lofgren
579295a673
(search) Add implicit coherence constraints based on segmentation
2024-04-17 14:03:35 +02:00
Viktor Lofgren
af8ff8ce99
(index) Improve recall for small queries
...
Partially reverse the previous commit and add a query head for the priority index when there are few query interpretations.
2024-04-16 22:51:03 +02:00
Viktor Lofgren
7fa3e86e64
(index) Remove dead code
...
Since the performance fix in 3359f72239
had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.
2024-04-16 19:59:27 +02:00
Viktor Lofgren
3359f72239
(index) Experimental performance regression fix
2024-04-16 19:48:14 +02:00
Viktor Lofgren
41fa154aa6
(test) Fix broken test
2024-04-16 19:48:14 +02:00
Viktor Lofgren
deaba0152d
(index) Explicitly free LongQueryBuffers
2024-04-16 19:23:00 +02:00
Viktor Lofgren
feaef6093e
(index) Fix term coherence evaluation
...
The code was incorrectly using the documentId instead of the combined id, resulting in almost all result sets being incorrectly seen as zero.
2024-04-16 18:07:43 +02:00
Viktor Lofgren
078fa4fdd0
(valuation) Impose stronger constraints on locality of terms
...
Clean up logic a bit
2024-04-16 17:22:58 +02:00
Viktor Lofgren
2dc77a0638
(valuation) Impose stronger constraints on locality of terms
2024-04-16 17:15:21 +02:00
Viktor
cfd9a7187f
(query-segmentation) Merge pull request #89 from MarginaliaSearch/query-segmentation
...
The changeset cleans up the query parsing logic in the query service. It gets rid of a lot of old and largely unmaintainable query-rewriting logic that was based on POS-tagging rules, and adds a new cleaner approach. Query parsing is also refactored, and the internal APIs are updated to remove unnecessary duplication of document-level data across each search term.
A new query segmentation model is introduced based on a dictionary of known n-grams, with tools for extracting this dictionary from Wikipedia data. The changeset introduces a new segmentation model file, which is downloaded with the usual run/setup.sh, as well as an updated term frequency model.
A new intermediate representation of the query is introduced, based on a DAG with predefined vertices initiating and terminating the graph. This is for the benefit of easily writing rules for generating alternative queries, e.g. using the new segmentation data.
The graph is converted to a basic LL(1) syntax loosely reminiscent of a regular expression, where e.g. "( wiby | marginalia | kagi ) ( search engine | searchengine )" expands to "wiby search engine", "wiby searchengine", "marginalia search engine", "marginalia searchengine", "kagi search engine" and "kagi searchengine".
This compiled query is passed to the index, which parses the expression, where it is used for execution of the search and ranking of the results.
2024-04-16 15:31:05 +02:00
Viktor Lofgren
f434a8b492
(build) Upgrade jib plugin version
2024-04-16 15:25:23 +02:00
Viktor Lofgren
d2658d6f84
(sys) Add springboard service that can spawn multiple different marginalia services to make distribution easier.
2024-04-16 13:25:15 +02:00
Viktor Lofgren
8c559c8121
(conf) Add additional logic for discovering system root
2024-04-16 12:37:18 +02:00
Viktor Lofgren
2353c73c57
(encyclopedia) Index the full articles
...
Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits. This was not a good idea, so the change is reverted.
2024-04-16 12:10:13 +02:00
Viktor Lofgren
599e719ad4
(index) Fix priority search terms
...
This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.
2024-04-15 16:44:08 +02:00
Viktor Lofgren
b6d365bacd
(index) Clean up data model
...
The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality.
The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.
2024-04-15 16:04:07 +02:00
Viktor Lofgren
52f0c0d336
(ngram) Grab titles separately when extracting ngrams from wiki data
2024-04-13 19:34:16 +02:00
Viktor Lofgren
be55f3f937
(zim) Fix title extractor
2024-04-13 19:33:47 +02:00
Viktor Lofgren
fda1c05164
(ngram) Correct |s|^|s|-normalization to use length and not count
2024-04-13 18:05:30 +02:00
Viktor Lofgren
1329d4abd8
(ngram) Correct size value in ngram lexicon generation, trim the terms better
2024-04-13 17:51:02 +02:00
Viktor Lofgren
f064992137
(ngram) Use simple blocking pool instead of FJP; split on underscores in article names.
2024-04-13 17:07:23 +02:00
Viktor Lofgren
8a81a480a1
(ngram) Only extract frequencies of title words, but use the body to increment the counters...
...
The sign of the counter is used to indicate whether a term has appeared as title. Until it's seen in the title, it's provisionally saved as a negative count.
2024-04-12 18:08:31 +02:00
Viktor Lofgren
d729c400e5
(query, minor) Remove debug statement
2024-04-12 17:52:55 +02:00
Viktor Lofgren
ad4810d991
(query, minor) Remove debug statement
2024-04-12 17:45:26 +02:00
Viktor Lofgren
6a67043537
(ngram) Clean up ngram lexicon code
...
This is both an optimization that removes some GC churn, as well as a clean-up of the code that removes references to outdated concepts.
2024-04-12 17:45:06 +02:00
Viktor Lofgren
864d6c28e7
(segmentation) Pick best segmentation using |s|^|s|-style normalization
...
This is better than doing all segmentations possible at the same time.
2024-04-12 17:44:14 +02:00
Viktor Lofgren
bb6b51ad91
(ngram) Fix index range in NgramLexicon to an avoid exception
2024-04-12 10:13:25 +02:00
Viktor Lofgren
65e3caf402
(index) Clean up the code
2024-04-11 18:50:21 +02:00
Viktor Lofgren
b7d9a7ae89
(ngrams) Remove the vestigial logic for capturing permutations of n-grams
...
The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.
2024-04-11 18:12:01 +02:00
Viktor Lofgren
ed73d79ec1
(qs) Clean up parsing code using new record matching
2024-04-11 17:36:08 +02:00
Viktor Lofgren
c538c25008
(term-freq-exporter) Reduce thread count and memory usage
2024-04-10 17:11:23 +02:00
Viktor Lofgren
4b47fadbab
(term-freq-exporter) Extract ngrams in term-frequency-exporter
2024-04-10 16:58:05 +02:00
Viktor Lofgren
fcdc843c15
(search) Fix outdated assumptions about the results
...
We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption.
For the API service, we'll simulate the old behavior to keep the API stable.
For the search service, we'll introduce a new way of calculating positions through tree aggregation.
2024-04-07 12:09:44 +02:00
Viktor Lofgren
dbdcf459a7
(minor) Remove dead code
2024-04-06 16:27:16 +02:00
Viktor Lofgren
ef25d60666
(index) Add origin trace information for index readers
...
This used to be supported by the system but got lost in refactoring at some point.
2024-04-06 13:28:14 +02:00
Viktor Lofgren
7f7021ce64
(sentence-extractor) Fix resource leak in sentence extractor
...
The code would always re-initialize the static ngramLexicon and rdrposTagger fields with new instances even if they were already instantiated, leading to a ton of unnecessary RAM allocation.
The modified behavior checks for nullity before creating a new instance.
2024-04-05 18:52:58 +02:00
Viktor Lofgren
448a941de2
(encyclopedia) Fix memory issue in preconversion step
...
Use SimpleBlockingThreadPool pool instead of Java's Workstealing Pool as the latter causes runaway memory consumption in some circumstances, while SimpleBlockingThreadPool uses a bounded queue and always pushes back against the supplier if it can't hold any more tasks.
2024-04-05 16:57:53 +02:00
Viktor Lofgren
5766da69ec
(gradle) Upgrade to Gradle 8.7
...
This will reduce the hassle of juggling JDK versions for JDK 22, which was not supported by Gradle 8.5.
2024-04-05 15:15:49 +02:00
Joshua Holland
617e633d7a
Update keywords docs use of explore to browse
...
I can't tell when this happened, but the proper keyword now seems to be browse and not explore.
2024-04-05 15:15:49 +02:00
Viktor Lofgren
b770a1143f
(run) Fix traefik middleware configuration
2024-04-05 15:15:49 +02:00
Viktor Lofgren
e1151ecf2a
(gradle) Upgrade to Gradle 8.7
...
This will reduce the hassle of juggling JDK versions for JDK 22, which was not supported by Gradle 8.5.
2024-04-05 15:12:38 +02:00
Viktor Lofgren
ae7c760772
(index) Clean up new index query code
2024-04-05 13:30:49 +02:00
Viktor Lofgren
81815f3e0a
(qs, index) New query model integrated with index service.
...
Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.
2024-04-04 20:17:58 +02:00
Viktor
3890c413a3
Merge pull request #88 from jmholla/patch-1
...
Update keywords docs use of explore to browse
2024-04-01 09:14:02 +02:00
Joshua Holland
8e02f567d7
Update keywords docs use of explore to browse
...
I can't tell when this happened, but the proper keyword now seems to be browse and not explore.
2024-04-01 00:04:12 -05:00
Viktor Lofgren
87bb93e1d4
(qs, WIP) Fix edge cases in query compilation
...
This addresses the relatively common case where the graph consists of two segments, such as x y, z w; in this case we want an output like (x_y) (z w | z_w) | x y (z_w). The generated output does somewhat pessimize a few other cases, but this one is arguably more important.
2024-03-29 12:40:27 +01:00
Viktor Lofgren
e596c929ac
(qs, WIP) Clean up dead code
2024-03-28 16:37:23 +01:00
Viktor Lofgren
9852b0e609
(qs, WIP) Tidy it up a bit
2024-03-28 14:18:26 +01:00
Viktor Lofgren
51b0d6c0d3
(qs, WIP) Tidy it up a bit
2024-03-28 14:09:17 +01:00
Viktor Lofgren
15391c7a88
(qs, WIP) Tidy it up a bit
2024-03-28 13:54:30 +01:00
Viktor Lofgren
fe62593286
(qs, WIP) Break up code and tidy it up a bit
2024-03-28 13:26:54 +01:00
Viktor Lofgren
4cc11e183c
(qs, WIP) Fix output determinism, fix tests
2024-03-28 13:11:26 +01:00
Viktor Lofgren
de8e753fc8
(run) Fix traefik middleware configuration
2024-03-28 13:03:12 +01:00
Viktor Lofgren
f82ebd7716
(WIP) Query rendering finally beginning to look like it works
2024-03-28 13:01:21 +01:00
Viktor Lofgren
bd0704d5a4
(*) Fix JDK22 migration issues
...
A few bizarre build errors cropped up when migrating to JDK22. Not at all sure what caused them, but they were easy to mitigate.
2024-03-21 14:33:27 +01:00
Viktor Lofgren
1968485881
(docs) Upgrade to JDK22
2024-03-21 14:33:27 +01:00
Viktor Lofgren
002afca1c5
(sys) Upgrade to JDK22
...
This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.
2024-03-21 14:33:27 +01:00
Your Name
411b3f3138
(run/install.sh) fix docker compose file
...
I was following the release demo video for v2024.01.0
https://www.youtube.com/watch?v=PNwMkenQQ24 and when I did 'docker
compose up' the containers couldn't resolve the DNS name for 'zookeeper'
I realized this was because the zookeeper container was using the
default docker network, so I specified the wmsa network explicitly.
2024-03-21 14:33:27 +01:00
Viktor Lofgren
a4b810f511
WIP
2024-03-21 14:33:26 +01:00
Viktor
cd8f33f830
Merge pull request #86 from MarginaliaSearch/jdk-22
...
Lift JDK version to 22
2024-03-21 14:29:41 +01:00
Viktor Lofgren
824765b1ee
(*) Fix JDK22 migration issues
...
A few bizarre build errors cropped up when migrating to JDK22. Not at all sure what caused them, but they were easy to mitigate.
2024-03-21 14:27:13 +01:00
Viktor Lofgren
9e8138f853
(docs) Upgrade to JDK22
2024-03-21 14:27:13 +01:00
Viktor Lofgren
fe8d583fdd
(sys) Upgrade to JDK22
...
This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.
2024-03-21 14:27:13 +01:00
Viktor Lofgren
0bd3365c24
(convert) Initial integration of segmentation data into the converter's keyword extraction logic
2024-03-19 14:28:42 +01:00
Viktor Lofgren
d8f4e7d72b
(qs) Retire NGramBloomFilter, integrate new segmentation model instead
2024-03-19 10:42:09 +01:00
Viktor Lofgren
afc047cd27
(control) GUI for exporting segmentation data from a wikipedia zim
2024-03-18 13:45:23 +01:00
Viktor Lofgren
00ef4f9803
(WIP) Partial integration of new query expansion code into the query-serivice
2024-03-18 13:16:49 +01:00
Viktor Lofgren
07e4d7ec6d
(WIP) Improve data extraction from wikipedia data
2024-03-18 13:16:00 +01:00
Viktor
258a344810
Merge pull request #85 from patrickbreen/master
...
(run/install.sh) fix docker compose file
2024-03-18 13:09:30 +01:00
Your Name
2a03014652
(run/install.sh) fix docker compose file
...
I was following the release demo video for v2024.01.0
https://www.youtube.com/watch?v=PNwMkenQQ24 and when I did 'docker
compose up' the containers couldn't resolve the DNS name for 'zookeeper'
I realized this was because the zookeeper container was using the
default docker network, so I specified the wmsa network explicitly.
2024-03-17 15:33:19 -04:00
Viktor Lofgren
8ae1f08095
(WIP) Implement first take of new query segmentation algorithm
2024-03-12 13:12:50 +01:00
Viktor Lofgren
57e6a12d08
(registry) Correct registerMonitor() behavior
...
The previous behavior would listen to too many changes, and based on zookeeper and not curator assumptions about behavior, add an additional monitor on each invocation of each monitor, (which always trigger on service state changes), leading to each monitor re-registering and effectively doubling monitors in numbers whenever a service stopped or started, which in turn meant a lot of bizarre thrashing behavior even on changes in services that don't explicitly talk to each other.
This re-registering behavior is no longer done.
2024-03-06 12:22:15 +01:00
Viktor Lofgren
46423612e3
(refac) Merge service-discovery and service modules
...
Also adds a few tests to the server/client code.
2024-03-03 10:49:23 +01:00
Viktor Lofgren
29bf473d74
(encyclopedia) Add URLencoding to path element
...
This prevents corruption of the links to the sideloaded encyclopedia data when the article path contains characters that are not valid in a URL.
2024-03-01 17:28:09 +01:00
Viktor Lofgren
9689f3faee
(domain-info) Fix incorrect array indexing
2024-02-29 18:56:09 +01:00
Viktor Lofgren
93fa58c93d
(domain-info) Fix incorrect array indexing
...
Using the id instead of idx when addressing the ranksArray caused exceptions.
2024-02-29 17:54:23 +01:00
Viktor Lofgren
186a98cc99
(doc) Fix wonky bullet lists
2024-02-28 17:43:05 +01:00
Viktor Lofgren
9993f265ca
(doc) Remove irrelevant text
2024-02-28 17:40:05 +01:00
Viktor Lofgren
144f967dbf
(misc) Tweak pool sizes
2024-02-28 16:23:02 +01:00
Viktor Lofgren
b31c9bb726
(docs) Update process docs
2024-02-28 15:21:33 +01:00
Viktor Lofgren
c0820b5e5c
(docs) Update service docs
2024-02-28 15:19:31 +01:00
Viktor Lofgren
65b8a1d5d9
(grpc) Reduce error spam
2024-02-28 14:44:48 +01:00
Viktor Lofgren
a0648844fb
(grpc) Reduce error spam
2024-02-28 14:35:29 +01:00
Viktor Lofgren
c4a27003c6
(docs) Fix formatting
2024-02-28 14:22:57 +01:00
Viktor Lofgren
41abd8982f
(math) Clean up error handling
2024-02-28 14:19:50 +01:00
Viktor Lofgren
86bbc1043e
(service) Clean up thread pool creation
2024-02-28 14:06:32 +01:00
Viktor Lofgren
9a045a0588
(index) Clean up index code
2024-02-28 13:09:47 +01:00
Viktor Lofgren
9415539b38
(docs) Update docs
2024-02-28 12:25:19 +01:00
Viktor Lofgren
84bab2783d
(docs) Fix fake news in docs
2024-02-28 12:16:45 +01:00
Viktor
0d6e7673e4
Merge pull request #81 from MarginaliaSearch/service-discovery
...
Zookeeper for service-discovery, kill service-client lib, refactor everything
2024-02-28 12:15:25 +01:00
Viktor Lofgren
d78e9e715f
(misc) Fix broken tests
2024-02-28 12:12:43 +01:00
Viktor Lofgren
a8ec59eb75
(conf) Add migration warning when ZOOKEEPER_HOSTS is not set.
2024-02-28 12:09:38 +01:00
Viktor Lofgren
20fc0ef13c
(gradle) Add task alias 'docker' for 'jibDockerBuild'
...
The change also moves the jib boilerplate to an include.
2024-02-28 11:59:15 +01:00
Viktor Lofgren
37ae8cb33c
Migrate the docker compose files
2024-02-28 11:48:16 +01:00
Viktor Lofgren
9f1649636e
Clean up documentation and rename domain-links
to link-graph
2024-02-28 11:40:39 +01:00
Viktor Lofgren
3a65fe8917
Add offload executor to GrpcChannelPoolFactory
2024-02-27 22:08:39 +01:00
Viktor Lofgren
99a6e56e99
(index-client) Increase thread count in index client
...
This should be a fair bit larger than the number of index nodes
2024-02-27 22:00:29 +01:00
Viktor Lofgren
e696fd9e92
(docs) Begin un-fucking the docs after refactoring
2024-02-27 21:22:21 +01:00
Viktor Lofgren
c943954bb4
(domain-info) Reduce memory usage
2024-02-27 21:22:21 +01:00
Viktor Lofgren
eaf836dc66
(service/grpc) Reduce thread count
...
Netty and GRPC by default spawns an incredible number of threads on high-core CPUs, which amount to a fair bit of RAM usage.
Add custom executors that throttle this behavior.
2024-02-27 21:22:21 +01:00
Viktor Lofgren
dbf64b0987
(logs) Add the option for json logging
2024-02-27 21:22:20 +01:00
Viktor Lofgren
8d0af9548b
(search) Bot mitigation
...
Add the ability to indicate to the search service that a request is malicious, and to poison the results by providing randomly reorered old results instead.
2024-02-27 21:22:19 +01:00
Viktor Lofgren
67aa20ea2c
(array) Attempting to debug strange errors
2024-02-27 21:22:18 +01:00
Viktor Lofgren
5604e9f531
(query) Bump query length, see what happens :P
2024-02-27 21:22:17 +01:00
Viktor Lofgren
1a51ec2d69
(index) Index optimization
2024-02-27 21:22:17 +01:00
Viktor Lofgren
3eb0800742
(index) Improve granularity of candidate queue polling
2024-02-27 21:22:17 +01:00
Viktor Lofgren
427f3e922f
(index) Retire count operation, clean up index code.
2024-02-27 21:22:17 +01:00
Viktor Lofgren
823ca73a3f
(domain-ranking) Fix a crash during ranking the edges of the similarity graph doesn't quite match the vertices of the link graph.
2024-02-27 21:22:17 +01:00
Viktor Lofgren
7fc0d4d786
(index) Observability for query execution queues
2024-02-27 21:22:17 +01:00
Viktor Lofgren
b8e336e809
(index) Reduce time allocation a bit
2024-02-27 21:22:17 +01:00
Viktor Lofgren
9429bf5c45
(index) Clean up
2024-02-27 21:22:17 +01:00
Viktor Lofgren
f7f0100174
(build) Make docker image registry and tag configurable in root build.gradle
2024-02-25 11:08:49 +01:00
Viktor Lofgren
fc00701a1e
(index) Experimental refactoring of the indexing functionality
2024-02-25 11:05:10 +01:00
Viktor Lofgren
09447f2ad2
(process service) Inherit parent's assertion status
2024-02-24 18:32:37 +01:00
Viktor Lofgren
ff0ef1eebc
(cleanup) Minor cleanups
2024-02-24 15:33:56 +01:00
Viktor Lofgren
1d34224416
(refac) Remove src/main from all source code paths.
...
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.
While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's *modul*ar. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.
2024-02-23 16:13:40 +01:00
Viktor Lofgren
56d35aa596
(refac) Move execution API out of executor service
2024-02-23 13:26:11 +01:00
Viktor Lofgren
2201b1a506
(refac) Clean up code issues
2024-02-23 11:39:19 +01:00
Viktor Lofgren
5cdb07023b
(refac) Clean up unused imports
2024-02-23 11:27:20 +01:00
Viktor Lofgren
6154e16951
(refac) Remove "distPath"
2024-02-23 11:22:02 +01:00
Viktor Lofgren
f4ff7185f0
(refac) Move process-mqapi out of api directory
2024-02-23 11:18:29 +01:00
Viktor Lofgren
6357d30ea0
Clean up docs
2024-02-22 19:53:20 +01:00
Viktor Lofgren
8d4ef982d0
Clean up docs
2024-02-22 19:37:59 +01:00
Viktor Lofgren
4740156cfa
Clean up docs
2024-02-22 18:18:58 +01:00
Viktor Lofgren
f8e7f75831
Move index to top level of code
2024-02-22 18:01:35 +01:00
Viktor Lofgren
085137ca63
* Extract the index functionality
2024-02-22 17:31:25 +01:00
Viktor Lofgren
3fd2a83184
* Extract the search-query function
2024-02-22 15:27:39 +01:00
Viktor Lofgren
66c1281301
(zk-registry) epic jak shaving WIP
...
Cleaning out a lot of old junk from the code, and one thing lead to another...
* Build is improved, now constructing docker images with 'jib'. Clean build went from 3 minutes to 50 seconds.
* The ProcessService's spawning is smarter. Will now just spawn a java process instead of relying on the application plugin's generated outputs.
* Project is migrated to GraalVM
* gRPC clients are re-written with a neat fluent/functional style. e.g.
```channelPool.call(grpcStub::method)
.async(executor) // <-- optional
.run(argument);
```
This change is primarily to allow handling ManagedChannel errors, but it turned out to be a pretty clean API overall.
* For now the project is all in on zookeeper
* Service discovery is now based on APIs and not services. Theoretically means we could ship the same code either a monolith or a service mesh.
* To this end, began modularizing a few of the APIs so that they aren't strongly "living" in a service. WIP!
Missing is documentation and testing, and some more breaking apart of code.
2024-02-22 14:01:23 +01:00
Viktor Lofgren
73947d9eca
(zk-registry) Filter out phantom addresses in the registry
...
The change adds a hostname validation step to remove endpoints from the ZkServiceRegistry when they do not resolve. This is a scenario that primarily happens when running in docker, and the entire system is started and stopped.
2024-02-20 18:09:11 +01:00
Viktor Lofgren
a69c0b2718
(grpc-client) Fix warmup crash
...
The warmup would sometimes crash during a cold start-up, because it could not get an API. Changed the warmup to just create a GrpcSingleNodeChannelPool for the node.
2024-02-20 18:03:57 +01:00
Viktor Lofgren
6c764bceeb
(doc) Update documentation for service-discovery
2024-02-20 16:09:49 +01:00
Viktor Lofgren
273aeb7bae
(doc) Update documentation with new gRPC service setup
2024-02-20 16:06:05 +01:00
Viktor Lofgren
d185858266
(minor) Add missing query parameter to ServiceEndpoint.toURL
2024-02-20 15:49:43 +01:00
Viktor Lofgren
453bd6064b
(minor) Add warm-up to GrpcMultiNodeChannelPool to speed up the initial messages
...
Without doing this, connections would be created lazily, which is probably never desirable.
2024-02-20 15:45:16 +01:00
Viktor Lofgren
904f2587cd
(minor) Add default ZOOKEEPER_HOSTS to service.env
2024-02-20 15:44:26 +01:00
Viktor Lofgren
14172312dc
(query-client) Fix query client
...
The query service delegates and aggregates IndexDomainLinksApiGrpc
messages to the index services. The query client was accidentally
also doing this, instead of talking to the query client.
Fixed so it correctly talks to the query client and nothing else.
2024-02-20 15:44:07 +01:00
Viktor Lofgren
c600d7aa47
(refac) Inject ServiceRegistry into WebsiteAdjacenciesCalculator
2024-02-20 15:42:32 +01:00
Viktor Lofgren
3c9234078a
(refac) Propagate ZOOKEEPER_HOSTS to spawned processes
2024-02-20 15:42:16 +01:00
Viktor Lofgren
ee8e0497ae
(refac) Move service discovery injection to a separate guice module
2024-02-20 15:41:04 +01:00
Viktor Lofgren
fd5d121648
(minor) Add WMSA_IN_DOCKER to all docker files
2024-02-20 15:39:46 +01:00
Viktor Lofgren
30bdb4b4e9
(config) Clean up service configuration for IP addresses
...
Adds new ways to configure the bind and external IP addresses for a service. Notably, if the environment variable WMSA_IN_DOCKER is present, the system will grab the HOSTNAME variable and announce that as the external address in the service registry.
The default bind address is also changed to be 0.0.0.0 only if WMSA_IN_DOCKER is present, otherwise 127.0.0.1; as this is a more secure default.
2024-02-20 14:22:48 +01:00
Viktor Lofgren
2ee492fb74
(gRPC) Bind gRPC services to an interface
...
By default gRPC it magically decides on an interface. The change will explicitly tell it what to use.
2024-02-20 14:22:47 +01:00
Viktor Lofgren
36a5c8b44c
(cleanup) Clean up code
2024-02-20 14:22:47 +01:00
Viktor Lofgren
07b625c58d
(query-client) Add support for fault-tolerant requests to single node services
...
Adding a method importantCall that will retry a failing request on each route until it succeeds or the routes run out.
2024-02-20 14:16:05 +01:00
Viktor Lofgren
746a865106
(client) Fix handling of channel refreshes
...
The previous code made an incorrect assumption that all routes refer to the same node, and would overwrite the route list on each update. This lead to storms of closing and opening channels whenever an update was received.
The new code is correctly aware that we may talk to multiple nodes.
2024-02-20 14:14:09 +01:00
Viktor
f85ec28a16
Merge branch 'master' into service-discovery
2024-02-20 11:44:12 +01:00
Viktor Lofgren
0307c55f9f
(refac) Zookeeper for service-discovery, kill service-client lib (WIP)
...
To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/ ) to manage services and discovery has been added.
A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything.
The last remaining REST service, the assistant-service, has been migrated to gRPC.
This also proved a good time to clear out primordial technical debt from the root of the codebase. The 'service-client' library has been taken behind the barn and given a last farewell. It's replaced by a small library for managing gRPC channels.
Since it's no longer used by anything, RxJava has been removed as a dependency from the project.
Although the current state seems reasonably stable, this is a work-in-progress commit.
2024-02-20 11:41:14 +01:00
Viktor
d05c916491
Merge pull request #80 from MarginaliaSearch/ranking-algorithms
...
Clean up domain ranking code
2024-02-18 09:52:34 +01:00
Viktor Lofgren
c73e43f5c9
(recrawl) Mitigate recrawl-before-load footgun
...
In the scenario where an operator
* Performs a new crawl from spec
* Doesn't load the data into the index
* Recrawls the data
The recrawl will not find the domains in the database, and the crawl log will be overwritten with an empty file,
irrecoverably losing the crawl log making it impossible to load!
To mitigate the impact similar problems, the change saves a backup of the old crawl log, as well as complains about this happening.
More specifically to this exact scenario however, the parquet-loaded domains are also preemptively inserted into the domain database at the start of the crawl. This should help the DbCrawlSpecProvider to find them regardless of loaded state.
This may seem a bit redundant, but losing crawl data is arguably the worst type of disaster scenario for this software, so it's arguably merited.
2024-02-18 09:23:20 +01:00
Viktor Lofgren
e61e7f44b9
(blacklist) Delay startup of blacklist
...
To help services start faster, the blacklist will no longer block until it's loaded. If such a behavior is desirable, a method was added to explicitly wait for the data.
2024-02-18 09:23:20 +01:00
Viktor Lofgren
f9b6ac03c6
(api) Clean up incorrect error handling in GrpcChannelPool
2024-02-18 08:45:35 +01:00
Viktor Lofgren
296ccc5f8e
(blacklist) Clean up blacklist impl
...
The domain blacklist blocked the start-up of each process that injected it, adding like 30 seconds to the start-up time in prod.
This change moves the loading to a separate thread entirely. For threads or processes that require the blacklist to be definitely loaded, a helper method was added that blocks until that time.
2024-02-18 08:16:48 +01:00
Viktor Lofgren
8cb5825617
(search) Temporarily disable the Popular filter
...
This filter currently does not distinguish itself very much from the unfiltered results, and lends the impression that the filters don't "do anything".
It may come back in some shape or form in the future, with some additional tweaking of the rankings...
2024-02-18 08:02:01 +01:00
Viktor Lofgren
cee707abd8
(crawler) Implement domain shuffling in DbCrawlSpecProvider
...
Modified the DbCrawlSpecProvider to shuffle domains after loading to ensure a good mix for each crawl. This change prevents overload of crawling the same server in parallel from different subdomains or crawling big domains all at once.
2024-02-17 17:47:38 +01:00
Viktor Lofgren
92717a4832
(client) Refactor GrpcStubPool to handle error states
...
Refactored the GRPC Stub Pool for better handling of channel SHUTDOWN state. Any disconnected channels are now re-created before returning the stub.
The class was also renamed to GrpcChannelPool, as we no longer pool the stubs.
2024-02-17 14:42:26 +01:00
Viktor Lofgren
37a7296759
(sideload) Clean up the sideloading code
...
Clean up the sideloading code a bit, making the Reddit sideloader use the more sophisticated SideloaderProcessing approach to sideloading, instead of mimicing StackexchangeSideloader's cruder approach.
The reddit sideloader now uses the SideloaderProcessing class. It also properly sets js-attributes for the sideloaded documents.
The control GUI now also filters the upload directory items based on name, and disables the items that do not have appropriate filenames.
2024-02-17 14:32:36 +01:00
Viktor Lofgren
ebbe49d17b
(sideload) Fix sideloading of explicitly selected stackexchange files
...
Fix a bug where sideloading stackexchange files by explicitly selecting the 7z file would fail, since the 7z file would be passed along to the converter rather than the path to the pre-converted .db file.
2024-02-17 13:24:04 +01:00
Viktor Lofgren
b7e330855f
(control) Update descriptive text in the control GUI
2024-02-16 20:32:31 +01:00
Viktor Lofgren
ac89224fb0
(domain-ranking) Remove lingering mentions of the algorithms field from the GUI
2024-02-16 20:28:37 +01:00
Viktor Lofgren
9ec262ae00
(domain-ranking) Integrate new ranking logic
...
The change deprecates the 'algorithm' field from the domain ranking set configuration. Instead, the algorithm will be chosen based on whether influence domains are provided, and whether similarity data is present.
2024-02-16 20:22:01 +01:00
Viktor Lofgren
64acdb5f2a
(domain-ranking) Clean up domain ranking
...
The domain ranking code was admittedly a bit of a clown fiesta; at the same time buggy, fragile and inscrutable.
Migrating over to use JGraphT to store the link graph
when doing rankings, and using their PageRank implementation. Also added a modified version that does PersonalizedPageRank.
2024-02-16 18:04:58 +01:00
Viktor Lofgren
a175b36382
(search) Correct accidental regression of the SmallWeb filter
2024-02-15 18:16:56 +01:00
Viktor Lofgren
16526d283c
(search) Correct accidental regression of the Vintage filter
2024-02-15 18:13:34 +01:00
Viktor Lofgren
752e677555
(search) Expose getSearchTitle in DecoratedSearchResults
2024-02-15 13:56:44 +01:00
Viktor Lofgren
f796af1ae8
(search) Fix failed refactoring
2024-02-15 13:53:19 +01:00
Viktor Lofgren
2515993536
(search) Fix issue where searchTitle setting gets lost when searching again
...
It's important that the field names in SearchParameters matches the fields referenced in search-form.hdb, otherwise they will get lost in transit.
2024-02-15 13:52:11 +01:00
Viktor Lofgren
66b3e71e56
(search) Expose more search options
...
This change set updates the query APIs to enable the search service to add additional criteria, such as QueryStrategy and TemporalBias.
The QueryStrategy makes it possible to e.g. require a match is in the title of a result, and TemporalBias enables penalizing results that are not within a particular time period.
These options are added to the search interface. The old 'recent results' is modified to use TemporalBias, and a new filter 'Search In Title' is added as well.
The vintage filter is modified to add a temporal bias for the past.
2024-02-15 13:39:51 +01:00
Viktor Lofgren
652d151373
(process-models) Improve documentation
2024-02-15 12:21:12 +01:00
Viktor Lofgren
300b1a1b84
(index-query) Add some tests for the QueryFilter code
2024-02-15 12:03:30 +01:00
Viktor Lofgren
6c3b49417f
(index-query) Improve documentation and code quality
2024-02-15 11:33:50 +01:00
Viktor Lofgren
dcc5cfb7c0
(index-journal) Improve documentation and code quality
2024-02-15 10:51:49 +01:00
Viktor
d970836605
Merge pull request #79 from MarginaliaSearch/reddit
...
(converter) Loader for reddit data
Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult.
Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more.
Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run.
The change also refactors the sideloading a bit since it was a bit messy, and improves the sideload UX a tiny bit.
2024-02-15 09:17:56 +01:00
Viktor Lofgren
8021bd0aae
(control) Sort upload listing results
...
Improve the UX of the sideload GUI by sorting the results in a sensible fashion, first by whether it's a directory, then by its filename.
The change also changes the timestamp rendering to a more human-readable format than full ISO-8601.
2024-02-15 09:13:40 +01:00
Viktor Lofgren
8f91156d80
(control) Improve sideload UX
...
The sideload forms didn't properly set the label 'for' property, meaning that while label tags existed, they weren't appropriately clickable.
Also removed unnecessary limits on the sideload target being a directory for stackexchange and warc. It's been possible to directly load a particular file for a while, but not allowed due to GUI limits.
2024-02-14 18:38:20 +01:00
Viktor Lofgren
fab36d6e63
(converter) Loader for reddit data
...
Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult.
Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more.
Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run.
The change also refactors the sideloading a bit since it was a bit messy.
2024-02-14 17:35:44 +01:00
Viktor Lofgren
3d54879c14
(API, minor) Clean up comments.
2024-02-14 12:09:16 +01:00
Viktor Lofgren
e17fcde865
(API, minor) Remove unnecessary inject.
2024-02-14 12:05:50 +01:00
Viktor Lofgren
6950dffcb4
(API) Fix result order in API results
...
These results should be presented in the same order as their ranking score.
2024-02-14 11:47:14 +01:00