Viktor Lofgren
554de21f68
(converter) Disable language keyword
2025-09-15 09:49:04 +02:00
Viktor Lofgren
8ca6209260
(refac) Fold ft-anchor-keywords into converting-process
2025-09-03 13:03:38 +02:00
Viktor Lofgren
acb9ec7b15
(refac) Consistently use 'languageIsoCode' for the language field
2025-09-03 12:54:18 +02:00
Viktor Lofgren
b29ba3e228
(language) Integrate new configurable POS patterns with keyword matchers
2025-08-29 10:55:47 +02:00
Viktor Lofgren
de67006c4f
(language) Initial integration of new language configuration utility
2025-08-29 10:55:47 +02:00
Viktor Lofgren
45dc6412c1
(converter) Add language column to slop tables
2025-08-29 10:55:47 +02:00
Viktor Lofgren
b3b95edcb5
(converter) Bypass some of the grammar processing in the keyword extraction depending on language selection
2025-08-29 10:55:47 +02:00
Viktor Lofgren
338d300e1a
(converter) Clean up spans-handling
...
This code was unnecessarily difficult to follow with repeated packing and re-packing of the same data.
2025-08-29 10:55:47 +02:00
Viktor Lofgren
fa685bf1f4
(converter) Add Language field to ProcessedDocumentDetails
2025-08-29 10:55:47 +02:00
Viktor Lofgren
d79a3e2b2a
(converter) Tag documents by language in the index as a keyword
2025-08-29 10:55:47 +02:00
Viktor Lofgren
291ff0c4de
(deps) Upgrade crawler commons to fix robots.txt-parser bug
2025-08-15 00:13:15 +02:00
Viktor Lofgren
c2601bac78
(converter) Remove unnecessary allocation of a 16 KB byte buffer
2025-07-24 13:25:37 +02:00
Viktor Lofgren
6cfb143c15
(sample) Compress sample HTML data and introduce new API for only getting requests
2025-07-21 13:55:25 +02:00
Viktor Lofgren
23c818281b
(converter) Reduce DomSample logging for NOT_FOUND
2025-07-21 13:37:55 +02:00
Viktor Lofgren
8aad253cf6
(converter) Add more logging around dom sample data retrieval errors
2025-07-21 13:26:38 +02:00
Viktor Lofgren
a23ec521fe
(converter) Ensure features is mutable on DetailsWithWords as this is assumed later
2025-07-21 12:50:04 +02:00
Viktor Lofgren
e998692900
(converter) Ensure converter works well even when dom sample data is unavailable
2025-07-20 19:24:40 +02:00
Viktor Lofgren
6fdf477c18
(refac) Move DomSampleClassification to top level
2025-07-19 18:41:35 +02:00
Viktor Lofgren
f38daeb036
(WIP) First stab at a GUI for viewing network traffic
...
The change also moves the dom classifier to a separate package so that it can be accessed from both the search service and converter.
The change also adds a parser for DDG's tracker radar data.
2025-07-18 13:58:57 +02:00
Viktor Lofgren
b91354925d
(converter) Index documents even when they are short
...
... but assign short documents a special flag and penalize them in index lookups
2025-07-14 12:24:25 +02:00
Viktor Lofgren
3f85c9c154
(refac) Clean up code
2025-07-14 11:55:21 +02:00
Viktor Lofgren
ceaf32fb90
(converter) Integrate dom sample features into the converter
2025-07-13 01:38:28 +02:00
Viktor Lofgren
b57db01415
(converter) Clean out some old and redundant advertisement and tracking detection code
2025-07-11 19:32:25 +02:00
Viktor Lofgren
ce7d522608
(converter) First basic hook-in of the new dom sample classifier into the converter workflow
2025-07-11 16:57:37 +02:00
Viktor Lofgren
18649b6ee9
(converter) Move DomSampleClassifier to converter's code tree
2025-07-11 16:12:48 +02:00
Viktor Lofgren
f6417aef1a
(converter) Additional code cleanup
2025-07-11 15:58:48 +02:00
Viktor Lofgren
2aa7e376b0
(converter) Clean up code around document deduplication
2025-07-11 15:54:28 +02:00
Viktor Lofgren
1ac0bab0b8
(converter) Also exclude length checks when lenient processing is enabled
2025-07-08 20:37:53 +02:00
Viktor Lofgren
08b45ed10a
(converter) Add system property converter.lenientProcessing to disable most disqualification checks
2025-07-08 19:44:51 +02:00
Viktor Lofgren
f2cfb91973
(converter) Add audit log of converter errors and rejections
2025-07-08 19:15:41 +02:00
Viktor Lofgren
1c128e6d82
(crawler) Add request time to crawl data
...
This is an interesting indicator of website quality.
2025-05-19 14:02:03 +02:00
Viktor Lofgren
890f521d0d
(pdf) Fix crash for some bold lines
2025-05-18 13:05:05 +02:00
Viktor Lofgren
306232fb54
(pdf) Fix handling of a few corner cases
...
Deal better with documents which change font on blank spaces.
2025-05-13 18:44:28 +02:00
Viktor Lofgren
879e6a9424
(pdf) Identify additional headings based on font weight
2025-05-11 16:35:52 +02:00
Viktor Lofgren
fba3455732
(pdf) Clean up code
2025-05-11 16:35:52 +02:00
Viktor Lofgren
14283da7f5
(pdf) Clean up generated DOM
...
Sometimes empty <p>-tags are inserted, which messes with the header joining process. Removes those nodes.
2025-05-11 15:12:09 +02:00
Viktor Lofgren
93df4d1fc0
(pdf) Improve summary extraction for PDFs
2025-05-11 14:33:11 +02:00
Viktor Lofgren
b12a0b998c
(pdf) Use smarter heuristics for paragraph splitting
...
We look at the median line distance, with outliers removed, to figure out when to break lines, as the original approach works poorly with e.g. double line spaced documents.
2025-05-11 14:29:42 +02:00
Viktor Lofgren
8428111771
(pdf) Fix for exception when no text positions are available
2025-05-10 15:12:02 +02:00
Viktor Lofgren
e9fd4415ef
(pdf) Merge consecutive headings.
...
Headings don't follow the same indentation rules as prose and tend to be cut off into multiple "paragraphs" by the text extractor.
2025-05-10 14:38:43 +02:00
Viktor Lofgren
4c95c3dcad
(pdf) Don't look for headings below 75% of the max y-position
2025-05-10 14:38:02 +02:00
Viktor Lofgren
4431dae7ac
(refac) Rename HtmlStandard -> DocumentFormat
...
The old model made some sense when we only supported HTML and to some extent plain text, but having PDF in an enum called HtmlFormat is a bit of a stretch.
2025-05-10 13:47:26 +02:00
Viktor Lofgren
4df4d0a7a8
(pdf) Increase line spacing tolerance for better paragraph handling
2025-05-10 13:34:04 +02:00
Viktor Lofgren
9f05083b94
(pdf) Add the capability to identify headings
...
This change vendors pdfbox'es PDFTextStripper and modifies it to be able to heuristically identify headings based on their font size, as this is a very useful relevance signal for the search engine, and helps identify the correct title of the article.
2025-05-09 14:04:04 +02:00
Viktor Lofgren
36889950e8
(pdf) Migrate to PDFBox 3.0.5 and suppress log spam
...
PDFBox 2.x uses commons logging, which does not route through SLF4j, and thus is a hassle to configure; and is extremely verbose in its default logging settings.
Migrating to PDFBox 3.x lets us use slf4j to address the log spam by filtering out the noisy methods.
2025-05-08 18:03:26 +02:00
Viktor Lofgren
c96a94878b
(pdf) Add feature to make pdf-files searchable with format:pdf
2025-05-08 18:03:26 +02:00
Viktor Lofgren
1c57d7d73a
(pdf) Clean up code
2025-05-08 18:03:26 +02:00
Viktor Lofgren
a443d22356
(pdf) Flag the file as a PDF file in the GUI
2025-05-08 18:03:26 +02:00
Viktor Lofgren
aa59d4afa4
(pdf) Somewhat improve title and summary extraction
2025-05-08 18:03:26 +02:00
Viktor Lofgren
df0f18d0e7
(pdf) Read title
2025-05-08 18:03:26 +02:00