2023-03-13 17:39:53 +01:00
# Processes
2023-03-04 13:19:01 +01:00
2023-03-17 16:03:11 +01:00
## 1. Crawl Process
2023-03-04 13:19:01 +01:00
2024-02-01 18:10:55 +01:00
The [crawling-process ](crawling-process/ ) fetches website contents, temporarily saving them as WARC files, and then
2024-09-22 13:56:17 +02:00
re-converts them into parquet models. Both are described in [crawling-process/model ](crawling-process/model/ ).
2023-03-17 16:03:11 +01:00
## 2. Converting Process
2023-03-04 13:19:01 +01:00
The [converting-process ](converting-process/ ) reads crawl data from the crawling step and
2023-09-14 11:21:44 +02:00
processes them, extracting keywords and metadata and saves them as parquet files
2024-09-22 13:56:17 +02:00
described in [converting-process/model ](converting-process/model/ ).
2023-03-04 13:19:01 +01:00
2023-03-17 16:03:11 +01:00
## 3. Loading Process
2023-03-04 13:19:01 +01:00
2023-09-14 11:21:44 +02:00
The [loading-process ](loading-process/ ) reads the processed data.
2024-02-28 11:40:11 +01:00
It has creates an [index journal ](../index/index-journal ),
2023-09-14 11:21:44 +02:00
a [link database ](../common/linkdb ),
and loads domains and domain-links
into the [MariaDB database ](../common/db ).
2023-03-04 13:19:01 +01:00
2023-08-29 17:04:54 +02:00
## 4. Index Construction Process
2025-09-02 12:30:42 +02:00
The index-construction-process constructs indices from
2023-08-29 17:04:54 +02:00
the data generated by the loader.
2025-06-19 11:21:52 +02:00
## 5. Other Processes
* Ping Process: The [ping-process ](ping-process/ ) keeps track of the aliveness of websites, gathering fingerprint information about the security posture of the website, as well as DNS information.
2025-06-23 15:18:35 +02:00
* New Domain Process (NDP): The [new-domain-process ](new-domain-process/ ) evaluates new domains for inclusion in the search engine index.
2025-06-19 11:21:52 +02:00
* Live-Crawling Process: The [live-crawling-process ](live-crawling-process/ ) is a process that crawls websites in real-time based on RSS feeds, updating a smaller index with the latest content.
2023-03-04 13:19:01 +01:00
## Overview
Schematically the crawling and loading process looks like this:
```
+-----------+
| CRAWLING | Fetch each URL and
| STEP | output to file
+-----------+
|
//========================\\
2024-02-01 18:10:55 +01:00
|| Parquet: || Crawl
2023-03-04 13:19:01 +01:00
|| Status, HTML[], ... || Files
|| Status, HTML[], ... ||
|| Status, HTML[], ... ||
|| ... ||
\\========================//
|
+------------+
| CONVERTING | Analyze HTML and
| STEP | extract keywords
+------------+ features, links, URLs
|
//==================\\
2024-09-22 13:56:17 +02:00
|| Slop : || Processed
2023-09-14 11:21:44 +02:00
|| Documents[] || Files
2023-03-04 13:19:01 +01:00
|| Domains[] ||
|| Links[] ||
\\==================//
|
2023-09-14 11:21:44 +02:00
+------------+ Insert domains into mariadb
| LOADING | Insert URLs, titles in link DB
2023-03-04 13:19:01 +01:00
| STEP | Insert keywords in Index
+------------+
2023-08-29 17:04:54 +02:00
|
+------------+
| CONSTRUCT | Make the data searchable
| INDEX |
+------------+
2023-03-04 13:19:01 +01:00
```