MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-10-06 07:32:38 +02:00

Author	SHA1	Message	Date
Viktor Lofgren	554de21f68	(converter) Disable language keyword	2025-09-15 09:49:04 +02:00
Viktor Lofgren	8ca6209260	(refac) Fold ft-anchor-keywords into converting-process	2025-09-03 13:03:38 +02:00
Viktor Lofgren	acb9ec7b15	(refac) Consistently use 'languageIsoCode' for the language field	2025-09-03 12:54:18 +02:00
Viktor Lofgren	b29ba3e228	(language) Integrate new configurable POS patterns with keyword matchers	2025-08-29 10:55:47 +02:00
Viktor Lofgren	de67006c4f	(language) Initial integration of new language configuration utility	2025-08-29 10:55:47 +02:00
Viktor Lofgren	45dc6412c1	(converter) Add language column to slop tables	2025-08-29 10:55:47 +02:00
Viktor Lofgren	b3b95edcb5	(converter) Bypass some of the grammar processing in the keyword extraction depending on language selection	2025-08-29 10:55:47 +02:00
Viktor Lofgren	338d300e1a	(converter) Clean up spans-handling This code was unnecessarily difficult to follow with repeated packing and re-packing of the same data.	2025-08-29 10:55:47 +02:00
Viktor Lofgren	fa685bf1f4	(converter) Add Language field to ProcessedDocumentDetails	2025-08-29 10:55:47 +02:00
Viktor Lofgren	d79a3e2b2a	(converter) Tag documents by language in the index as a keyword	2025-08-29 10:55:47 +02:00
Viktor Lofgren	291ff0c4de	(deps) Upgrade crawler commons to fix robots.txt-parser bug	2025-08-15 00:13:15 +02:00
Viktor Lofgren	c2601bac78	(converter) Remove unnecessary allocation of a 16 KB byte buffer	2025-07-24 13:25:37 +02:00
Viktor Lofgren	6cfb143c15	(sample) Compress sample HTML data and introduce new API for only getting requests	2025-07-21 13:55:25 +02:00
Viktor Lofgren	23c818281b	(converter) Reduce DomSample logging for NOT_FOUND	2025-07-21 13:37:55 +02:00
Viktor Lofgren	8aad253cf6	(converter) Add more logging around dom sample data retrieval errors	2025-07-21 13:26:38 +02:00
Viktor Lofgren	a23ec521fe	(converter) Ensure features is mutable on DetailsWithWords as this is assumed later	2025-07-21 12:50:04 +02:00
Viktor Lofgren	e998692900	(converter) Ensure converter works well even when dom sample data is unavailable	2025-07-20 19:24:40 +02:00
Viktor Lofgren	6fdf477c18	(refac) Move DomSampleClassification to top level	2025-07-19 18:41:35 +02:00
Viktor Lofgren	f38daeb036	(WIP) First stab at a GUI for viewing network traffic The change also moves the dom classifier to a separate package so that it can be accessed from both the search service and converter. The change also adds a parser for DDG's tracker radar data.	2025-07-18 13:58:57 +02:00
Viktor Lofgren	b91354925d	(converter) Index documents even when they are short ... but assign short documents a special flag and penalize them in index lookups	2025-07-14 12:24:25 +02:00
Viktor Lofgren	3f85c9c154	(refac) Clean up code	2025-07-14 11:55:21 +02:00
Viktor Lofgren	ceaf32fb90	(converter) Integrate dom sample features into the converter	2025-07-13 01:38:28 +02:00
Viktor Lofgren	b57db01415	(converter) Clean out some old and redundant advertisement and tracking detection code	2025-07-11 19:32:25 +02:00
Viktor Lofgren	ce7d522608	(converter) First basic hook-in of the new dom sample classifier into the converter workflow	2025-07-11 16:57:37 +02:00
Viktor Lofgren	18649b6ee9	(converter) Move DomSampleClassifier to converter's code tree	2025-07-11 16:12:48 +02:00
Viktor Lofgren	f6417aef1a	(converter) Additional code cleanup	2025-07-11 15:58:48 +02:00
Viktor Lofgren	2aa7e376b0	(converter) Clean up code around document deduplication	2025-07-11 15:54:28 +02:00
Viktor Lofgren	1ac0bab0b8	(converter) Also exclude length checks when lenient processing is enabled	2025-07-08 20:37:53 +02:00
Viktor Lofgren	08b45ed10a	(converter) Add system property converter.lenientProcessing to disable most disqualification checks	2025-07-08 19:44:51 +02:00
Viktor Lofgren	f2cfb91973	(converter) Add audit log of converter errors and rejections	2025-07-08 19:15:41 +02:00
Viktor Lofgren	1c128e6d82	(crawler) Add request time to crawl data This is an interesting indicator of website quality.	2025-05-19 14:02:03 +02:00
Viktor Lofgren	890f521d0d	(pdf) Fix crash for some bold lines	2025-05-18 13:05:05 +02:00
Viktor Lofgren	306232fb54	(pdf) Fix handling of a few corner cases Deal better with documents which change font on blank spaces.	2025-05-13 18:44:28 +02:00
Viktor Lofgren	879e6a9424	(pdf) Identify additional headings based on font weight	2025-05-11 16:35:52 +02:00
Viktor Lofgren	fba3455732	(pdf) Clean up code	2025-05-11 16:35:52 +02:00
Viktor Lofgren	14283da7f5	(pdf) Clean up generated DOM Sometimes empty <p>-tags are inserted, which messes with the header joining process. Removes those nodes.	2025-05-11 15:12:09 +02:00
Viktor Lofgren	93df4d1fc0	(pdf) Improve summary extraction for PDFs	2025-05-11 14:33:11 +02:00
Viktor Lofgren	b12a0b998c	(pdf) Use smarter heuristics for paragraph splitting We look at the median line distance, with outliers removed, to figure out when to break lines, as the original approach works poorly with e.g. double line spaced documents.	2025-05-11 14:29:42 +02:00
Viktor Lofgren	8428111771	(pdf) Fix for exception when no text positions are available	2025-05-10 15:12:02 +02:00
Viktor Lofgren	e9fd4415ef	(pdf) Merge consecutive headings. Headings don't follow the same indentation rules as prose and tend to be cut off into multiple "paragraphs" by the text extractor.	2025-05-10 14:38:43 +02:00
Viktor Lofgren	4c95c3dcad	(pdf) Don't look for headings below 75% of the max y-position	2025-05-10 14:38:02 +02:00
Viktor Lofgren	4431dae7ac	(refac) Rename HtmlStandard -> DocumentFormat The old model made some sense when we only supported HTML and to some extent plain text, but having PDF in an enum called HtmlFormat is a bit of a stretch.	2025-05-10 13:47:26 +02:00
Viktor Lofgren	4df4d0a7a8	(pdf) Increase line spacing tolerance for better paragraph handling	2025-05-10 13:34:04 +02:00
Viktor Lofgren	9f05083b94	(pdf) Add the capability to identify headings This change vendors pdfbox'es PDFTextStripper and modifies it to be able to heuristically identify headings based on their font size, as this is a very useful relevance signal for the search engine, and helps identify the correct title of the article.	2025-05-09 14:04:04 +02:00
Viktor Lofgren	36889950e8	(pdf) Migrate to PDFBox 3.0.5 and suppress log spam PDFBox 2.x uses commons logging, which does not route through SLF4j, and thus is a hassle to configure; and is extremely verbose in its default logging settings. Migrating to PDFBox 3.x lets us use slf4j to address the log spam by filtering out the noisy methods.	2025-05-08 18:03:26 +02:00
Viktor Lofgren	c96a94878b	(pdf) Add feature to make pdf-files searchable with format:pdf	2025-05-08 18:03:26 +02:00
Viktor Lofgren	1c57d7d73a	(pdf) Clean up code	2025-05-08 18:03:26 +02:00
Viktor Lofgren	a443d22356	(pdf) Flag the file as a PDF file in the GUI	2025-05-08 18:03:26 +02:00
Viktor Lofgren	aa59d4afa4	(pdf) Somewhat improve title and summary extraction	2025-05-08 18:03:26 +02:00
Viktor Lofgren	df0f18d0e7	(pdf) Read title	2025-05-08 18:03:26 +02:00

1 2 3

120 Commits