MarginaliaSearch/code/processes/readme.md

# Processes

## 1. Crawl Process

The [crawling-process](crawling-process/) fetches website contents, temporarily saving them as WARC files, and then
re-converts them into parquet models.  Both are described in [crawling-process/model](crawling-process/model/).

## 2. Converting Process

The [converting-process](converting-process/) reads crawl data from the crawling step and 
processes them, extracting keywords and metadata and saves them as parquet files 
described in [converting-process/model](converting-process/model/).

## 3. Loading Process

The [loading-process](loading-process/) reads the processed data.

It has creates an [index journal](../index/index-journal), 
a [link database](../common/linkdb), 
and loads domains and domain-links 
into the [MariaDB database](../common/db).

## 4. Index Construction Process

The index-construction-process constructs indices from
the data generated by the loader.

## 5. Other Processes

* Ping Process: The [ping-process](ping-process/) keeps track of the aliveness of websites, gathering fingerprint information about the security posture of the website, as well as DNS information.
* New Domain Process (NDP): The [new-domain-process](new-domain-process/) evaluates new domains for inclusion in the search engine index.
* Live-Crawling Process: The [live-crawling-process](live-crawling-process/) is a process that crawls websites in real-time based on RSS feeds, updating a smaller index with the latest content.

## Overview 

Schematically the crawling and loading process looks like this:

```
    +-----------+  
    |  CRAWLING |  Fetch each URL and 
    |    STEP   |  output to file
    +-----------+
          |
    //========================\\
    ||  Parquet:              || Crawl
    ||  Status, HTML[], ...   || Files
    ||  Status, HTML[], ...   ||
    ||  Status, HTML[], ...   ||
    ||     ...                ||
    \\========================//
          |
    +------------+
    | CONVERTING |  Analyze HTML and 
    |    STEP    |  extract keywords 
    +------------+  features, links, URLs
          |
    //==================\\
    || Slop   :         ||  Processed
    ||  Documents[]     ||  Files
    ||  Domains[]       ||
    ||  Links[]         ||  
    \\==================//
          |
    +------------+ Insert domains into mariadb
    |  LOADING   | Insert URLs, titles in link DB
    |    STEP    | Insert keywords in Index
    +------------+    
          |
    +------------+
    | CONSTRUCT  | Make the data searchable
    |   INDEX    | 
    +------------+
```
More restructuring, big bug fixes in keyword extraction. 2023-03-13 17:39:53 +01:00			`# Processes`
Restructuring the git repo 2023-03-04 13:19:01 +01:00
Remove unrelated code, break tools into their own directory. 2023-03-17 16:03:11 +01:00			`## 1. Crawl Process`
Restructuring the git repo 2023-03-04 13:19:01 +01:00
(doc) Update the readme's the crawler, as they've grown stale. 2024-02-01 18:10:55 +01:00			`The [crawling-process](crawling-process/) fetches website contents, temporarily saving them as WARC files, and then`
(doc) Fix outdated links in documentation 2024-09-22 13:56:17 +02:00			`re-converts them into parquet models. Both are described in [crawling-process/model](crawling-process/model/).`
Remove unrelated code, break tools into their own directory. 2023-03-17 16:03:11 +01:00
			`## 2. Converting Process`
Restructuring the git repo 2023-03-04 13:19:01 +01:00
			`The [converting-process](converting-process/) reads crawl data from the crawling step and`
(refactor) Remove converting-model package completely 2023-09-14 11:21:44 +02:00			`processes them, extracting keywords and metadata and saves them as parquet files`
(doc) Fix outdated links in documentation 2024-09-22 13:56:17 +02:00			`described in [converting-process/model](converting-process/model/).`
Restructuring the git repo 2023-03-04 13:19:01 +01:00
Remove unrelated code, break tools into their own directory. 2023-03-17 16:03:11 +01:00			`## 3. Loading Process`
Restructuring the git repo 2023-03-04 13:19:01 +01:00
(refactor) Remove converting-model package completely 2023-09-14 11:21:44 +02:00			`The [loading-process](loading-process/) reads the processed data.`

Clean up documentation and rename `domain-links` to `link-graph` 2024-02-28 11:40:11 +01:00			`It has creates an [index journal](../index/index-journal),`
(refactor) Remove converting-model package completely 2023-09-14 11:21:44 +02:00			`a [link database](../common/linkdb),`
			`and loads domains and domain-links`
			`into the [MariaDB database](../common/db).`
Restructuring the git repo 2023-03-04 13:19:01 +01:00
(minor) Clean up dead endpoints 2023-08-29 17:04:54 +02:00			`## 4. Index Construction Process`

(refac) Merge index-forward, index-reverse, index/query into index The project has too many submodules, and it's a bit of a headache to navigate. 2025-09-02 12:30:42 +02:00			`The index-construction-process constructs indices from`
(minor) Clean up dead endpoints 2023-08-29 17:04:54 +02:00			`the data generated by the loader.`

(ping) Add README for ping 2025-06-19 11:21:52 +02:00			`## 5. Other Processes`

			`* Ping Process: The [ping-process](ping-process/) keeps track of the aliveness of websites, gathering fingerprint information about the security posture of the website, as well as DNS information.`
(ndp) Update documentation 2025-06-23 15:18:35 +02:00			`* New Domain Process (NDP): The [new-domain-process](new-domain-process/) evaluates new domains for inclusion in the search engine index.`
(ping) Add README for ping 2025-06-19 11:21:52 +02:00			`* Live-Crawling Process: The [live-crawling-process](live-crawling-process/) is a process that crawls websites in real-time based on RSS feeds, updating a smaller index with the latest content.`

Restructuring the git repo 2023-03-04 13:19:01 +01:00			`## Overview`

			`Schematically the crawling and loading process looks like this:`

			```
			`+-----------+`
			`\| CRAWLING \| Fetch each URL and`
			`\| STEP \| output to file`
			`+-----------+`
			`\|`
			`//========================\\`
(doc) Update the readme's the crawler, as they've grown stale. 2024-02-01 18:10:55 +01:00			`\|\| Parquet: \|\| Crawl`
Restructuring the git repo 2023-03-04 13:19:01 +01:00			`\|\| Status, HTML[], ... \|\| Files`
			`\|\| Status, HTML[], ... \|\|`
			`\|\| Status, HTML[], ... \|\|`
			`\|\| ... \|\|`
			`\\========================//`
			`\|`
			`+------------+`
			`\| CONVERTING \| Analyze HTML and`
			`\| STEP \| extract keywords`
			`+------------+ features, links, URLs`
			`\|`
			`//==================\\`
(doc) Fix outdated links in documentation 2024-09-22 13:56:17 +02:00			`\|\| Slop : \|\| Processed`
(refactor) Remove converting-model package completely 2023-09-14 11:21:44 +02:00			`\|\| Documents[] \|\| Files`
Restructuring the git repo 2023-03-04 13:19:01 +01:00			`\|\| Domains[] \|\|`
			`\|\| Links[] \|\|`
			`\\==================//`
			`\|`
(refactor) Remove converting-model package completely 2023-09-14 11:21:44 +02:00			`+------------+ Insert domains into mariadb`
			`\| LOADING \| Insert URLs, titles in link DB`
Restructuring the git repo 2023-03-04 13:19:01 +01:00			`\| STEP \| Insert keywords in Index`
			`+------------+`
(minor) Clean up dead endpoints 2023-08-29 17:04:54 +02:00			`\|`
			`+------------+`
			`\| CONSTRUCT \| Make the data searchable`
			`\| INDEX \|`
			`+------------+`
Restructuring the git repo 2023-03-04 13:19:01 +01:00			```