1
1
mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-10-05 21:22:39 +02:00
Files
MarginaliaSearch/code/processes/readme.md

73 lines
2.5 KiB
Markdown
Raw Permalink Normal View History

# Processes
2023-03-04 13:19:01 +01:00
## 1. Crawl Process
2023-03-04 13:19:01 +01:00
The [crawling-process](crawling-process/) fetches website contents, temporarily saving them as WARC files, and then
re-converts them into parquet models. Both are described in [crawling-process/model](crawling-process/model/).
## 2. Converting Process
2023-03-04 13:19:01 +01:00
The [converting-process](converting-process/) reads crawl data from the crawling step and
processes them, extracting keywords and metadata and saves them as parquet files
described in [converting-process/model](converting-process/model/).
2023-03-04 13:19:01 +01:00
## 3. Loading Process
2023-03-04 13:19:01 +01:00
The [loading-process](loading-process/) reads the processed data.
It has creates an [index journal](../index/index-journal),
a [link database](../common/linkdb),
and loads domains and domain-links
into the [MariaDB database](../common/db).
2023-03-04 13:19:01 +01:00
2023-08-29 17:04:54 +02:00
## 4. Index Construction Process
The index-construction-process constructs indices from
2023-08-29 17:04:54 +02:00
the data generated by the loader.
2025-06-19 11:21:52 +02:00
## 5. Other Processes
* Ping Process: The [ping-process](ping-process/) keeps track of the aliveness of websites, gathering fingerprint information about the security posture of the website, as well as DNS information.
2025-06-23 15:18:35 +02:00
* New Domain Process (NDP): The [new-domain-process](new-domain-process/) evaluates new domains for inclusion in the search engine index.
2025-06-19 11:21:52 +02:00
* Live-Crawling Process: The [live-crawling-process](live-crawling-process/) is a process that crawls websites in real-time based on RSS feeds, updating a smaller index with the latest content.
2023-03-04 13:19:01 +01:00
## Overview
Schematically the crawling and loading process looks like this:
```
+-----------+
| CRAWLING | Fetch each URL and
| STEP | output to file
+-----------+
|
//========================\\
|| Parquet: || Crawl
2023-03-04 13:19:01 +01:00
|| Status, HTML[], ... || Files
|| Status, HTML[], ... ||
|| Status, HTML[], ... ||
|| ... ||
\\========================//
|
+------------+
| CONVERTING | Analyze HTML and
| STEP | extract keywords
+------------+ features, links, URLs
|
//==================\\
|| Slop : || Processed
|| Documents[] || Files
2023-03-04 13:19:01 +01:00
|| Domains[] ||
|| Links[] ||
\\==================//
|
+------------+ Insert domains into mariadb
| LOADING | Insert URLs, titles in link DB
2023-03-04 13:19:01 +01:00
| STEP | Insert keywords in Index
+------------+
2023-08-29 17:04:54 +02:00
|
+------------+
| CONSTRUCT | Make the data searchable
| INDEX |
+------------+
2023-03-04 13:19:01 +01:00
```