1
1
mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-10-06 07:32:38 +02:00

120 Commits

Author SHA1 Message Date
Viktor Lofgren
554de21f68 (converter) Disable language keyword 2025-09-15 09:49:04 +02:00
Viktor Lofgren
8ca6209260 (refac) Fold ft-anchor-keywords into converting-process 2025-09-03 13:03:38 +02:00
Viktor Lofgren
acb9ec7b15 (refac) Consistently use 'languageIsoCode' for the language field 2025-09-03 12:54:18 +02:00
Viktor Lofgren
b29ba3e228 (language) Integrate new configurable POS patterns with keyword matchers 2025-08-29 10:55:47 +02:00
Viktor Lofgren
de67006c4f (language) Initial integration of new language configuration utility 2025-08-29 10:55:47 +02:00
Viktor Lofgren
45dc6412c1 (converter) Add language column to slop tables 2025-08-29 10:55:47 +02:00
Viktor Lofgren
b3b95edcb5 (converter) Bypass some of the grammar processing in the keyword extraction depending on language selection 2025-08-29 10:55:47 +02:00
Viktor Lofgren
338d300e1a (converter) Clean up spans-handling
This code was unnecessarily difficult to follow with repeated packing and re-packing of the same data.
2025-08-29 10:55:47 +02:00
Viktor Lofgren
fa685bf1f4 (converter) Add Language field to ProcessedDocumentDetails 2025-08-29 10:55:47 +02:00
Viktor Lofgren
d79a3e2b2a (converter) Tag documents by language in the index as a keyword 2025-08-29 10:55:47 +02:00
Viktor Lofgren
291ff0c4de (deps) Upgrade crawler commons to fix robots.txt-parser bug 2025-08-15 00:13:15 +02:00
Viktor Lofgren
c2601bac78 (converter) Remove unnecessary allocation of a 16 KB byte buffer 2025-07-24 13:25:37 +02:00
Viktor Lofgren
6cfb143c15 (sample) Compress sample HTML data and introduce new API for only getting requests 2025-07-21 13:55:25 +02:00
Viktor Lofgren
23c818281b (converter) Reduce DomSample logging for NOT_FOUND 2025-07-21 13:37:55 +02:00
Viktor Lofgren
8aad253cf6 (converter) Add more logging around dom sample data retrieval errors 2025-07-21 13:26:38 +02:00
Viktor Lofgren
a23ec521fe (converter) Ensure features is mutable on DetailsWithWords as this is assumed later 2025-07-21 12:50:04 +02:00
Viktor Lofgren
e998692900 (converter) Ensure converter works well even when dom sample data is unavailable 2025-07-20 19:24:40 +02:00
Viktor Lofgren
6fdf477c18 (refac) Move DomSampleClassification to top level 2025-07-19 18:41:35 +02:00
Viktor Lofgren
f38daeb036 (WIP) First stab at a GUI for viewing network traffic
The change also moves the dom classifier to a separate package so that it can be accessed from both the search service and converter.

The change also adds a parser for DDG's tracker radar data.
2025-07-18 13:58:57 +02:00
Viktor Lofgren
b91354925d (converter) Index documents even when they are short
... but assign short documents a special flag and penalize them in index lookups
2025-07-14 12:24:25 +02:00
Viktor Lofgren
3f85c9c154 (refac) Clean up code 2025-07-14 11:55:21 +02:00
Viktor Lofgren
ceaf32fb90 (converter) Integrate dom sample features into the converter 2025-07-13 01:38:28 +02:00
Viktor Lofgren
b57db01415 (converter) Clean out some old and redundant advertisement and tracking detection code 2025-07-11 19:32:25 +02:00
Viktor Lofgren
ce7d522608 (converter) First basic hook-in of the new dom sample classifier into the converter workflow 2025-07-11 16:57:37 +02:00
Viktor Lofgren
18649b6ee9 (converter) Move DomSampleClassifier to converter's code tree 2025-07-11 16:12:48 +02:00
Viktor Lofgren
f6417aef1a (converter) Additional code cleanup 2025-07-11 15:58:48 +02:00
Viktor Lofgren
2aa7e376b0 (converter) Clean up code around document deduplication 2025-07-11 15:54:28 +02:00
Viktor Lofgren
1ac0bab0b8 (converter) Also exclude length checks when lenient processing is enabled 2025-07-08 20:37:53 +02:00
Viktor Lofgren
08b45ed10a (converter) Add system property converter.lenientProcessing to disable most disqualification checks 2025-07-08 19:44:51 +02:00
Viktor Lofgren
f2cfb91973 (converter) Add audit log of converter errors and rejections 2025-07-08 19:15:41 +02:00
Viktor Lofgren
1c128e6d82 (crawler) Add request time to crawl data
This is an interesting indicator of website quality.
2025-05-19 14:02:03 +02:00
Viktor Lofgren
890f521d0d (pdf) Fix crash for some bold lines 2025-05-18 13:05:05 +02:00
Viktor Lofgren
306232fb54 (pdf) Fix handling of a few corner cases
Deal better with documents which change font on blank spaces.
2025-05-13 18:44:28 +02:00
Viktor Lofgren
879e6a9424 (pdf) Identify additional headings based on font weight 2025-05-11 16:35:52 +02:00
Viktor Lofgren
fba3455732 (pdf) Clean up code 2025-05-11 16:35:52 +02:00
Viktor Lofgren
14283da7f5 (pdf) Clean up generated DOM
Sometimes empty <p>-tags are inserted, which messes with the header joining process.  Removes those nodes.
2025-05-11 15:12:09 +02:00
Viktor Lofgren
93df4d1fc0 (pdf) Improve summary extraction for PDFs 2025-05-11 14:33:11 +02:00
Viktor Lofgren
b12a0b998c (pdf) Use smarter heuristics for paragraph splitting
We look at the median line distance, with outliers removed, to figure out when to break lines, as the original approach works poorly with e.g. double line spaced documents.
2025-05-11 14:29:42 +02:00
Viktor Lofgren
8428111771 (pdf) Fix for exception when no text positions are available 2025-05-10 15:12:02 +02:00
Viktor Lofgren
e9fd4415ef (pdf) Merge consecutive headings.
Headings don't follow the same indentation rules as prose and tend to be cut off into multiple "paragraphs" by the text extractor.
2025-05-10 14:38:43 +02:00
Viktor Lofgren
4c95c3dcad (pdf) Don't look for headings below 75% of the max y-position 2025-05-10 14:38:02 +02:00
Viktor Lofgren
4431dae7ac (refac) Rename HtmlStandard -> DocumentFormat
The old model made some sense when we only supported HTML and to some extent plain text, but having PDF in an enum called HtmlFormat is a bit of a stretch.
2025-05-10 13:47:26 +02:00
Viktor Lofgren
4df4d0a7a8 (pdf) Increase line spacing tolerance for better paragraph handling 2025-05-10 13:34:04 +02:00
Viktor Lofgren
9f05083b94 (pdf) Add the capability to identify headings
This change vendors pdfbox'es PDFTextStripper and modifies it to be able to heuristically identify headings based on their font size, as this is a very useful relevance signal for the search engine, and helps identify the correct title of the article.
2025-05-09 14:04:04 +02:00
Viktor Lofgren
36889950e8 (pdf) Migrate to PDFBox 3.0.5 and suppress log spam
PDFBox 2.x uses commons logging, which does not route through SLF4j, and thus is a hassle to configure; and is extremely verbose in its default logging settings.

Migrating to PDFBox 3.x lets us use slf4j to address the log spam by filtering out the noisy methods.
2025-05-08 18:03:26 +02:00
Viktor Lofgren
c96a94878b (pdf) Add feature to make pdf-files searchable with format:pdf 2025-05-08 18:03:26 +02:00
Viktor Lofgren
1c57d7d73a (pdf) Clean up code 2025-05-08 18:03:26 +02:00
Viktor Lofgren
a443d22356 (pdf) Flag the file as a PDF file in the GUI 2025-05-08 18:03:26 +02:00
Viktor Lofgren
aa59d4afa4 (pdf) Somewhat improve title and summary extraction 2025-05-08 18:03:26 +02:00
Viktor Lofgren
df0f18d0e7 (pdf) Read title 2025-05-08 18:03:26 +02:00