mirror of
https://github.com/MarginaliaSearch/MarginaliaSearch.git
synced 2025-10-06 07:32:38 +02:00
Compare commits
195 Commits
deploy-004
...
deploy-015
Author | SHA1 | Date | |
---|---|---|---|
|
8be88afcf3 | ||
|
0e3c00d3e1 | ||
|
4279a7f1aa | ||
|
251006d4f9 | ||
|
c3e99dc12a | ||
|
aaaa2de022 | ||
|
fc1388422a | ||
|
b07080db16 | ||
|
e9d86dca4a | ||
|
1d693f0efa | ||
|
5874a163dc | ||
|
5ec7a1deab | ||
|
7fea2808ed | ||
|
8da74484f0 | ||
|
923d5a7234 | ||
|
58f88749b8 | ||
|
77f727a5ba | ||
|
667cfb53dc | ||
|
fe36d4ed20 | ||
|
acf4bef98d | ||
|
2a737c34bb | ||
|
90a577af82 | ||
|
f0c9b935d8 | ||
|
7b5493dd51 | ||
|
c246a59158 | ||
|
0b99781d24 | ||
|
39db9620c1 | ||
|
1781599363 | ||
|
6b2d18fb9b | ||
|
59b1d200ab | ||
|
897010a2cf | ||
|
602af7a77e | ||
|
a7d91c8527 | ||
|
7151602124 | ||
|
884e33bd4a | ||
|
e84d5c497a | ||
|
2d2d3e2466 | ||
|
647dd9b12f | ||
|
de4e2849ce | ||
|
3c43f1954e | ||
|
fa2462ec39 | ||
|
f4ad7145db | ||
|
068b450180 | ||
|
05b909a21f | ||
|
3d179cddce | ||
|
1a2aae496a | ||
|
353cdffb3f | ||
|
2e3f1313c7 | ||
|
58e6f141ce | ||
|
500f63e921 | ||
|
6dfbedda1e | ||
|
9715ddb105 | ||
|
1fc6313a77 | ||
|
b1249d5b8a | ||
|
ef95d59b07 | ||
|
acdd8664f5 | ||
|
6b12eac58a | ||
|
bb3f1f395a | ||
|
b661beef41 | ||
|
9888c47f19 | ||
|
dcef7e955b | ||
|
b3973a1dd7 | ||
|
8bd05d6d90 | ||
|
59df8e356e | ||
|
7161162a35 | ||
|
d7c4c5141f | ||
|
88e9b8fb05 | ||
|
b6265cee11 | ||
|
c91af247e9 | ||
|
7a31227de1 | ||
|
4f477604c5 | ||
|
2970f4395b | ||
|
d1ec909b36 | ||
|
c67c5bbf42 | ||
|
ecb0e57a1a | ||
|
8c61f61b46 | ||
|
662a18c933 | ||
|
1c2426a052 | ||
|
34df7441ac | ||
|
5387e2bd80 | ||
|
0f3b24d0f8 | ||
|
a732095d2a | ||
|
6607f0112f | ||
|
4913730de9 | ||
|
1db64f9d56 | ||
|
4dcff14498 | ||
|
426658f64e | ||
|
2181b22f05 | ||
|
42bd79a609 | ||
|
b91c1e528a | ||
|
b1130d7a04 | ||
|
8364bcdc97 | ||
|
626cab5fab | ||
|
cfd4712191 | ||
|
9f18ced73d | ||
|
18e91269ab | ||
|
e315ca5758 | ||
|
3ceea17c1d | ||
|
b34527c1a3 | ||
|
185bf28fca | ||
|
78cc25584a | ||
|
62ba30bacf | ||
|
3bb84eb206 | ||
|
be7d13ccce | ||
|
8c088a7c0b | ||
|
ea9a642b9b | ||
|
27f528af6a | ||
|
20ca41ec95 | ||
|
7671f0d9e4 | ||
|
44d6bc71b7 | ||
|
9d302e2973 | ||
|
f553701224 | ||
|
f076d05595 | ||
|
b513809710 | ||
|
7519b28e21 | ||
|
3eac4dd57f | ||
|
4c2810720a | ||
|
8480ba8daa | ||
|
fbba392491 | ||
|
530eb35949 | ||
|
c2dd2175a2 | ||
|
b8581b0f56 | ||
|
2ea34767d8 | ||
|
e9af838231 | ||
|
ae0cad47c4 | ||
|
5fbc8ef998 | ||
|
32c6dd9e6a | ||
|
6ece6a6cfb | ||
|
39cd1c18f8 | ||
|
eb65daaa88 | ||
|
0bebdb6e33 | ||
|
1e50e392c6 | ||
|
fb673de370 | ||
|
eee73ab16c | ||
|
5354e034bf | ||
|
72384ad6ca | ||
|
a2b076f9be | ||
|
c8b0a32c0f | ||
|
f0d74aa3bb | ||
|
74a1f100f4 | ||
|
eb049658e4 | ||
|
db138b2a6f | ||
|
1673fc284c | ||
|
503ea57d5b | ||
|
18ca926c7f | ||
|
db99242db2 | ||
|
2b9d2985ba | ||
|
eeb6ecd711 | ||
|
1f58aeadbf | ||
|
3d68be64da | ||
|
668f3b16ef | ||
|
98a340a0d1 | ||
|
8862100f7e | ||
|
274941f6de | ||
|
abec83582d | ||
|
569520c9b6 | ||
|
088310e998 | ||
|
270cab874b | ||
|
4c74e280d3 | ||
|
5b347e17ac | ||
|
55d6ab933f | ||
|
43b74e9706 | ||
|
579a115243 | ||
|
2c67f50a43 | ||
|
78a958e2b0 | ||
|
4e939389b2 | ||
|
e67a9bdb91 | ||
|
567e4e1237 | ||
|
4342e42722 | ||
|
bc818056e6 | ||
|
de2feac238 | ||
|
1e770205a5 | ||
|
e44ecd6d69 | ||
|
5b93a0e633 | ||
|
08fb0e5efe | ||
|
bcf67782ea | ||
|
ef3f175ede | ||
|
bbe4b5d9fd | ||
|
c67a635103 | ||
|
20b24133fb | ||
|
f2567677e8 | ||
|
bc2c2061f2 | ||
|
1c7f5a31a5 | ||
|
59a8ea60f7 | ||
|
aa9b1244ea | ||
|
2d17233366 | ||
|
b245cc9f38 | ||
|
6614d05bdf | ||
|
55aeb03c4a | ||
|
faa589962f | ||
|
c7edd6b39f | ||
|
47e58a21c6 | ||
|
3714104976 | ||
|
f6f036b9b1 | ||
|
b510b7feb8 |
39
ROADMAP.md
39
ROADMAP.md
@@ -1,4 +1,4 @@
|
||||
# Roadmap 2024-2025
|
||||
# Roadmap 2025
|
||||
|
||||
This is a roadmap with major features planned for Marginalia Search.
|
||||
|
||||
@@ -30,12 +30,6 @@ Retaining the ability to independently crawl the web is still strongly desirable
|
||||
The search engine has a bit of a problem showing spicy content mixed in with the results. It would be desirable to have a way to filter this out. It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
|
||||
combined with naive bayesian filter would go a long way, or something more sophisticated...?
|
||||
|
||||
## Web Design Overhaul
|
||||
|
||||
The design is kinda clunky and hard to maintain, and needlessly outdated-looking.
|
||||
|
||||
In progress: PR [#127](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/127) -- demo available at https://test.marginalia.nu/
|
||||
|
||||
## Additional Language Support
|
||||
|
||||
It would be desirable if the search engine supported more languages than English. This is partially about
|
||||
@@ -62,8 +56,31 @@ filter for any API consumer.
|
||||
|
||||
I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, which is quite ad-hoc, but instead to work together to find some new common description language for this.
|
||||
|
||||
## Show favicons next to search results
|
||||
|
||||
This is expected from search engines. Basic proof of concept sketch of fetching this data has been done, but the feature is some way from being reality.
|
||||
|
||||
## Specialized crawler for github
|
||||
|
||||
One of the search engine's biggest limitations right now is that it does not index github at all. A specialized crawler that fetches at least the readme.md would go a long way toward providing search capabilities in this domain.
|
||||
|
||||
# Completed
|
||||
|
||||
## Web Design Overhaul (COMPLETED 2025-01)
|
||||
|
||||
The design is kinda clunky and hard to maintain, and needlessly outdated-looking.
|
||||
|
||||
PR [#127](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/127)
|
||||
|
||||
## Finalize RSS support (COMPLETED 2024-11)
|
||||
|
||||
Marginalia has experimental RSS preview support for a few domains. This works well and
|
||||
it should be extended to all domains. It would also be interesting to offer search of the
|
||||
RSS data itself, or use the RSS set to feed a special live index that updates faster than the
|
||||
main dataset.
|
||||
|
||||
Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
|
||||
|
||||
## Proper Position Index (COMPLETED 2024-09)
|
||||
|
||||
The search engine uses a fixed width bit mask to indicate word positions. It has the benefit
|
||||
@@ -76,11 +93,3 @@ list, as is the civilized way of doing this.
|
||||
|
||||
Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)
|
||||
|
||||
## Finalize RSS support (COMPLETED 2024-11)
|
||||
|
||||
Marginalia has experimental RSS preview support for a few domains. This works well and
|
||||
it should be extended to all domains. It would also be interesting to offer search of the
|
||||
RSS data itself, or use the RSS set to feed a special live index that updates faster than the
|
||||
main dataset.
|
||||
|
||||
Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
|
||||
|
@@ -5,7 +5,7 @@ plugins {
|
||||
|
||||
// This is a workaround for a bug in the Jib plugin that causes it to stall randomly
|
||||
// https://github.com/GoogleContainerTools/jib/issues/3347
|
||||
id 'com.google.cloud.tools.jib' version '3.4.3' apply(false)
|
||||
id 'com.google.cloud.tools.jib' version '3.4.5' apply(false)
|
||||
}
|
||||
|
||||
group 'marginalia'
|
||||
@@ -43,12 +43,11 @@ subprojects.forEach {it ->
|
||||
}
|
||||
|
||||
ext {
|
||||
jvmVersion=23
|
||||
dockerImageBase='container-registry.oracle.com/graalvm/jdk:23'
|
||||
jvmVersion = 24
|
||||
dockerImageBase='container-registry.oracle.com/graalvm/jdk:24'
|
||||
dockerImageTag='latest'
|
||||
dockerImageRegistry='marginalia'
|
||||
jibVersion = '3.4.3'
|
||||
|
||||
jibVersion = '3.4.5'
|
||||
}
|
||||
|
||||
idea {
|
||||
|
@@ -24,58 +24,4 @@ public class LanguageModels {
|
||||
this.fasttextLanguageModel = fasttextLanguageModel;
|
||||
this.segments = segments;
|
||||
}
|
||||
|
||||
public static LanguageModelsBuilder builder() {
|
||||
return new LanguageModelsBuilder();
|
||||
}
|
||||
|
||||
public static class LanguageModelsBuilder {
|
||||
private Path termFrequencies;
|
||||
private Path openNLPSentenceDetectionData;
|
||||
private Path posRules;
|
||||
private Path posDict;
|
||||
private Path fasttextLanguageModel;
|
||||
private Path segments;
|
||||
|
||||
LanguageModelsBuilder() {
|
||||
}
|
||||
|
||||
public LanguageModelsBuilder termFrequencies(Path termFrequencies) {
|
||||
this.termFrequencies = termFrequencies;
|
||||
return this;
|
||||
}
|
||||
|
||||
public LanguageModelsBuilder openNLPSentenceDetectionData(Path openNLPSentenceDetectionData) {
|
||||
this.openNLPSentenceDetectionData = openNLPSentenceDetectionData;
|
||||
return this;
|
||||
}
|
||||
|
||||
public LanguageModelsBuilder posRules(Path posRules) {
|
||||
this.posRules = posRules;
|
||||
return this;
|
||||
}
|
||||
|
||||
public LanguageModelsBuilder posDict(Path posDict) {
|
||||
this.posDict = posDict;
|
||||
return this;
|
||||
}
|
||||
|
||||
public LanguageModelsBuilder fasttextLanguageModel(Path fasttextLanguageModel) {
|
||||
this.fasttextLanguageModel = fasttextLanguageModel;
|
||||
return this;
|
||||
}
|
||||
|
||||
public LanguageModelsBuilder segments(Path segments) {
|
||||
this.segments = segments;
|
||||
return this;
|
||||
}
|
||||
|
||||
public LanguageModels build() {
|
||||
return new LanguageModels(this.termFrequencies, this.openNLPSentenceDetectionData, this.posRules, this.posDict, this.fasttextLanguageModel, this.segments);
|
||||
}
|
||||
|
||||
public String toString() {
|
||||
return "LanguageModels.LanguageModelsBuilder(termFrequencies=" + this.termFrequencies + ", openNLPSentenceDetectionData=" + this.openNLPSentenceDetectionData + ", posRules=" + this.posRules + ", posDict=" + this.posDict + ", fasttextLanguageModel=" + this.fasttextLanguageModel + ", segments=" + this.segments + ")";
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@@ -20,7 +20,11 @@ public class DbDomainQueries {
|
||||
private final HikariDataSource dataSource;
|
||||
|
||||
private static final Logger logger = LoggerFactory.getLogger(DbDomainQueries.class);
|
||||
|
||||
private final Cache<EdgeDomain, Integer> domainIdCache = CacheBuilder.newBuilder().maximumSize(10_000).build();
|
||||
private final Cache<EdgeDomain, DomainIdWithNode> domainWithNodeCache = CacheBuilder.newBuilder().maximumSize(10_000).build();
|
||||
private final Cache<Integer, EdgeDomain> domainNameCache = CacheBuilder.newBuilder().maximumSize(10_000).build();
|
||||
private final Cache<String, List<DomainWithNode>> siblingsCache = CacheBuilder.newBuilder().maximumSize(10_000).build();
|
||||
|
||||
@Inject
|
||||
public DbDomainQueries(HikariDataSource dataSource)
|
||||
@@ -30,16 +34,21 @@ public class DbDomainQueries {
|
||||
|
||||
|
||||
public Integer getDomainId(EdgeDomain domain) throws NoSuchElementException {
|
||||
try (var connection = dataSource.getConnection()) {
|
||||
|
||||
try {
|
||||
return domainIdCache.get(domain, () -> {
|
||||
try (var stmt = connection.prepareStatement("SELECT ID FROM EC_DOMAIN WHERE DOMAIN_NAME=?")) {
|
||||
try (var connection = dataSource.getConnection();
|
||||
var stmt = connection.prepareStatement("SELECT ID FROM EC_DOMAIN WHERE DOMAIN_NAME=?")) {
|
||||
|
||||
stmt.setString(1, domain.toString());
|
||||
var rsp = stmt.executeQuery();
|
||||
if (rsp.next()) {
|
||||
return rsp.getInt(1);
|
||||
}
|
||||
}
|
||||
catch (SQLException ex) {
|
||||
throw new RuntimeException(ex);
|
||||
}
|
||||
|
||||
throw new NoSuchElementException();
|
||||
});
|
||||
}
|
||||
@@ -49,8 +58,33 @@ public class DbDomainQueries {
|
||||
catch (ExecutionException ex) {
|
||||
throw new RuntimeException(ex.getCause());
|
||||
}
|
||||
catch (SQLException ex) {
|
||||
throw new RuntimeException(ex);
|
||||
}
|
||||
|
||||
|
||||
public DomainIdWithNode getDomainIdWithNode(EdgeDomain domain) throws NoSuchElementException {
|
||||
try {
|
||||
return domainWithNodeCache.get(domain, () -> {
|
||||
try (var connection = dataSource.getConnection();
|
||||
var stmt = connection.prepareStatement("SELECT ID, NODE_AFFINITY FROM EC_DOMAIN WHERE DOMAIN_NAME=?")) {
|
||||
|
||||
stmt.setString(1, domain.toString());
|
||||
var rsp = stmt.executeQuery();
|
||||
if (rsp.next()) {
|
||||
return new DomainIdWithNode(rsp.getInt(1), rsp.getInt(2));
|
||||
}
|
||||
}
|
||||
catch (SQLException ex) {
|
||||
throw new RuntimeException(ex);
|
||||
}
|
||||
|
||||
throw new NoSuchElementException();
|
||||
});
|
||||
}
|
||||
catch (UncheckedExecutionException ex) {
|
||||
throw new NoSuchElementException();
|
||||
}
|
||||
catch (ExecutionException ex) {
|
||||
throw new RuntimeException(ex.getCause());
|
||||
}
|
||||
}
|
||||
|
||||
@@ -84,47 +118,55 @@ public class DbDomainQueries {
|
||||
}
|
||||
|
||||
public Optional<EdgeDomain> getDomain(int id) {
|
||||
try (var connection = dataSource.getConnection()) {
|
||||
|
||||
EdgeDomain existing = domainNameCache.getIfPresent(id);
|
||||
if (existing != null) {
|
||||
return Optional.of(existing);
|
||||
}
|
||||
|
||||
try (var connection = dataSource.getConnection()) {
|
||||
try (var stmt = connection.prepareStatement("SELECT DOMAIN_NAME FROM EC_DOMAIN WHERE ID=?")) {
|
||||
stmt.setInt(1, id);
|
||||
var rsp = stmt.executeQuery();
|
||||
if (rsp.next()) {
|
||||
return Optional.of(new EdgeDomain(rsp.getString(1)));
|
||||
var val = new EdgeDomain(rsp.getString(1));
|
||||
domainNameCache.put(id, val);
|
||||
return Optional.of(val);
|
||||
}
|
||||
return Optional.empty();
|
||||
}
|
||||
}
|
||||
catch (UncheckedExecutionException ex) {
|
||||
throw new RuntimeException(ex.getCause());
|
||||
}
|
||||
catch (SQLException ex) {
|
||||
throw new RuntimeException(ex);
|
||||
}
|
||||
}
|
||||
|
||||
public List<DomainWithNode> otherSubdomains(EdgeDomain domain, int cnt) {
|
||||
List<DomainWithNode> ret = new ArrayList<>();
|
||||
public List<DomainWithNode> otherSubdomains(EdgeDomain domain, int cnt) throws ExecutionException {
|
||||
String topDomain = domain.topDomain;
|
||||
|
||||
try (var conn = dataSource.getConnection();
|
||||
var stmt = conn.prepareStatement("SELECT DOMAIN_NAME, NODE_AFFINITY FROM EC_DOMAIN WHERE DOMAIN_TOP = ? LIMIT ?")) {
|
||||
stmt.setString(1, domain.topDomain);
|
||||
stmt.setInt(2, cnt);
|
||||
return siblingsCache.get(topDomain, () -> {
|
||||
List<DomainWithNode> ret = new ArrayList<>();
|
||||
|
||||
var rs = stmt.executeQuery();
|
||||
while (rs.next()) {
|
||||
var sibling = new EdgeDomain(rs.getString(1));
|
||||
try (var conn = dataSource.getConnection();
|
||||
var stmt = conn.prepareStatement("SELECT DOMAIN_NAME, NODE_AFFINITY FROM EC_DOMAIN WHERE DOMAIN_TOP = ? LIMIT ?")) {
|
||||
stmt.setString(1, topDomain);
|
||||
stmt.setInt(2, cnt);
|
||||
|
||||
if (sibling.equals(domain))
|
||||
continue;
|
||||
var rs = stmt.executeQuery();
|
||||
while (rs.next()) {
|
||||
var sibling = new EdgeDomain(rs.getString(1));
|
||||
|
||||
ret.add(new DomainWithNode(sibling, rs.getInt(2)));
|
||||
if (sibling.equals(domain))
|
||||
continue;
|
||||
|
||||
ret.add(new DomainWithNode(sibling, rs.getInt(2)));
|
||||
}
|
||||
} catch (SQLException e) {
|
||||
logger.error("Failed to get domain neighbors");
|
||||
}
|
||||
} catch (SQLException e) {
|
||||
logger.error("Failed to get domain neighbors");
|
||||
}
|
||||
return ret;
|
||||
});
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
public record DomainWithNode (EdgeDomain domain, int nodeAffinity) {
|
||||
@@ -132,4 +174,6 @@ public class DbDomainQueries {
|
||||
return nodeAffinity > 0;
|
||||
}
|
||||
}
|
||||
|
||||
public record DomainIdWithNode (int domainId, int nodeAffinity) { }
|
||||
}
|
||||
|
@@ -1,118 +0,0 @@
|
||||
package nu.marginalia.db;
|
||||
|
||||
import com.zaxxer.hikari.HikariDataSource;
|
||||
|
||||
import java.sql.Connection;
|
||||
import java.sql.PreparedStatement;
|
||||
import java.sql.SQLException;
|
||||
import java.util.ArrayList;
|
||||
import java.util.List;
|
||||
import java.util.OptionalInt;
|
||||
|
||||
/** Class used in exporting data. This is intended to be used for a brief time
|
||||
* and then discarded, not kept around as a service.
|
||||
*/
|
||||
public class DbDomainStatsExportMultitool implements AutoCloseable {
|
||||
private final Connection connection;
|
||||
private final int nodeId;
|
||||
private final PreparedStatement knownUrlsQuery;
|
||||
private final PreparedStatement visitedUrlsQuery;
|
||||
private final PreparedStatement goodUrlsQuery;
|
||||
private final PreparedStatement domainNameToId;
|
||||
|
||||
private final PreparedStatement allDomainsQuery;
|
||||
private final PreparedStatement crawlQueueDomains;
|
||||
private final PreparedStatement indexedDomainsQuery;
|
||||
|
||||
public DbDomainStatsExportMultitool(HikariDataSource dataSource, int nodeId) throws SQLException {
|
||||
this.connection = dataSource.getConnection();
|
||||
this.nodeId = nodeId;
|
||||
|
||||
knownUrlsQuery = connection.prepareStatement("""
|
||||
SELECT KNOWN_URLS
|
||||
FROM EC_DOMAIN INNER JOIN DOMAIN_METADATA
|
||||
ON EC_DOMAIN.ID=DOMAIN_METADATA.ID
|
||||
WHERE DOMAIN_NAME=?
|
||||
""");
|
||||
visitedUrlsQuery = connection.prepareStatement("""
|
||||
SELECT VISITED_URLS
|
||||
FROM EC_DOMAIN INNER JOIN DOMAIN_METADATA
|
||||
ON EC_DOMAIN.ID=DOMAIN_METADATA.ID
|
||||
WHERE DOMAIN_NAME=?
|
||||
""");
|
||||
goodUrlsQuery = connection.prepareStatement("""
|
||||
SELECT GOOD_URLS
|
||||
FROM EC_DOMAIN INNER JOIN DOMAIN_METADATA
|
||||
ON EC_DOMAIN.ID=DOMAIN_METADATA.ID
|
||||
WHERE DOMAIN_NAME=?
|
||||
""");
|
||||
domainNameToId = connection.prepareStatement("""
|
||||
SELECT ID
|
||||
FROM EC_DOMAIN
|
||||
WHERE DOMAIN_NAME=?
|
||||
""");
|
||||
allDomainsQuery = connection.prepareStatement("""
|
||||
SELECT DOMAIN_NAME
|
||||
FROM EC_DOMAIN
|
||||
""");
|
||||
crawlQueueDomains = connection.prepareStatement("""
|
||||
SELECT DOMAIN_NAME
|
||||
FROM CRAWL_QUEUE
|
||||
""");
|
||||
indexedDomainsQuery = connection.prepareStatement("""
|
||||
SELECT DOMAIN_NAME
|
||||
FROM EC_DOMAIN
|
||||
WHERE INDEXED > 0
|
||||
""");
|
||||
}
|
||||
|
||||
public OptionalInt getVisitedUrls(String domainName) throws SQLException {
|
||||
return executeNameToIntQuery(domainName, visitedUrlsQuery);
|
||||
}
|
||||
|
||||
public OptionalInt getDomainId(String domainName) throws SQLException {
|
||||
return executeNameToIntQuery(domainName, domainNameToId);
|
||||
}
|
||||
|
||||
public List<String> getCrawlQueueDomains() throws SQLException {
|
||||
return executeListQuery(crawlQueueDomains, 100);
|
||||
}
|
||||
public List<String> getAllIndexedDomains() throws SQLException {
|
||||
return executeListQuery(indexedDomainsQuery, 100_000);
|
||||
}
|
||||
|
||||
private OptionalInt executeNameToIntQuery(String domainName, PreparedStatement statement)
|
||||
throws SQLException {
|
||||
statement.setString(1, domainName);
|
||||
var rs = statement.executeQuery();
|
||||
|
||||
if (rs.next()) {
|
||||
return OptionalInt.of(rs.getInt(1));
|
||||
}
|
||||
|
||||
return OptionalInt.empty();
|
||||
}
|
||||
|
||||
private List<String> executeListQuery(PreparedStatement statement, int sizeHint) throws SQLException {
|
||||
List<String> ret = new ArrayList<>(sizeHint);
|
||||
|
||||
var rs = statement.executeQuery();
|
||||
|
||||
while (rs.next()) {
|
||||
ret.add(rs.getString(1));
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
@Override
|
||||
public void close() throws SQLException {
|
||||
knownUrlsQuery.close();
|
||||
goodUrlsQuery.close();
|
||||
visitedUrlsQuery.close();
|
||||
allDomainsQuery.close();
|
||||
crawlQueueDomains.close();
|
||||
domainNameToId.close();
|
||||
connection.close();
|
||||
}
|
||||
}
|
@@ -14,7 +14,7 @@ public class EdgeDomain implements Serializable {
|
||||
@Nonnull
|
||||
public final String topDomain;
|
||||
|
||||
public EdgeDomain(String host) {
|
||||
public EdgeDomain(@Nonnull String host) {
|
||||
Objects.requireNonNull(host, "domain name must not be null");
|
||||
|
||||
host = host.toLowerCase();
|
||||
@@ -61,6 +61,10 @@ public class EdgeDomain implements Serializable {
|
||||
this.topDomain = topDomain;
|
||||
}
|
||||
|
||||
public static String getTopDomain(String host) {
|
||||
return new EdgeDomain(host).topDomain;
|
||||
}
|
||||
|
||||
private boolean looksLikeGovTld(String host) {
|
||||
if (host.length() < 8)
|
||||
return false;
|
||||
@@ -116,24 +120,6 @@ public class EdgeDomain implements Serializable {
|
||||
return topDomain.substring(0, cutPoint).toLowerCase();
|
||||
}
|
||||
|
||||
public String getLongDomainKey() {
|
||||
StringBuilder ret = new StringBuilder();
|
||||
|
||||
int cutPoint = topDomain.indexOf('.');
|
||||
if (cutPoint < 0) {
|
||||
ret.append(topDomain);
|
||||
} else {
|
||||
ret.append(topDomain, 0, cutPoint);
|
||||
}
|
||||
|
||||
if (!subDomain.isEmpty() && !"www".equals(subDomain)) {
|
||||
ret.append(":");
|
||||
ret.append(subDomain);
|
||||
}
|
||||
|
||||
return ret.toString().toLowerCase();
|
||||
}
|
||||
|
||||
/** If possible, try to provide an alias domain,
|
||||
* i.e. a domain name that is very likely to link to this one
|
||||
* */
|
||||
|
@@ -1,16 +1,14 @@
|
||||
package nu.marginalia.model;
|
||||
|
||||
import nu.marginalia.util.QueryParams;
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
|
||||
import javax.annotation.Nullable;
|
||||
import java.io.Serializable;
|
||||
import java.net.MalformedURLException;
|
||||
import java.net.URI;
|
||||
import java.net.URISyntaxException;
|
||||
import java.net.URL;
|
||||
import java.net.*;
|
||||
import java.nio.charset.StandardCharsets;
|
||||
import java.util.Objects;
|
||||
import java.util.Optional;
|
||||
import java.util.regex.Pattern;
|
||||
|
||||
public class EdgeUrl implements Serializable {
|
||||
public final String proto;
|
||||
@@ -33,9 +31,21 @@ public class EdgeUrl implements Serializable {
|
||||
|
||||
private static URI parseURI(String url) throws URISyntaxException {
|
||||
try {
|
||||
return new URI(urlencodeFixer(url));
|
||||
} catch (URISyntaxException ex) {
|
||||
throw new URISyntaxException("Failed to parse URI '" + url + "'", ex.getMessage());
|
||||
return new URI(url);
|
||||
} catch (URISyntaxException _) {
|
||||
try {
|
||||
/* Java's URI parser is a bit too strict in throwing exceptions when there's an error.
|
||||
|
||||
Here on the Internet, standards are like the picture on the box of the frozen pizza,
|
||||
and what you get is more like what's on the inside, we try to patch things instead,
|
||||
just give it a best-effort attempt att cleaning out broken or unnecessary constructions
|
||||
like bad or missing URLEncoding
|
||||
*/
|
||||
return EdgeUriFactory.parseURILenient(url);
|
||||
}
|
||||
catch (URISyntaxException ex2) {
|
||||
throw new URISyntaxException("Failed to parse URI '" + url + "'", ex2.getMessage());
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -51,58 +61,6 @@ public class EdgeUrl implements Serializable {
|
||||
}
|
||||
}
|
||||
|
||||
private static Pattern badCharPattern = Pattern.compile("[ \t\n\"<>\\[\\]()',|]");
|
||||
|
||||
/* Java's URI parser is a bit too strict in throwing exceptions when there's an error.
|
||||
|
||||
Here on the Internet, standards are like the picture on the box of the frozen pizza,
|
||||
and what you get is more like what's on the inside, we try to patch things instead,
|
||||
just give it a best-effort attempt att cleaning out broken or unnecessary constructions
|
||||
like bad or missing URLEncoding
|
||||
*/
|
||||
public static String urlencodeFixer(String url) throws URISyntaxException {
|
||||
var s = new StringBuilder();
|
||||
String goodChars = "&.?:/-;+$#";
|
||||
String hexChars = "0123456789abcdefABCDEF";
|
||||
|
||||
int pathIdx = findPathIdx(url);
|
||||
if (pathIdx < 0) { // url looks like http://marginalia.nu
|
||||
return url + "/";
|
||||
}
|
||||
s.append(url, 0, pathIdx);
|
||||
|
||||
// We don't want the fragment, and multiple fragments breaks the Java URIParser for some reason
|
||||
int end = url.indexOf("#");
|
||||
if (end < 0) end = url.length();
|
||||
|
||||
for (int i = pathIdx; i < end; i++) {
|
||||
int c = url.charAt(i);
|
||||
|
||||
if (goodChars.indexOf(c) >= 0 || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || (c >= '0' && c <= '9')) {
|
||||
s.appendCodePoint(c);
|
||||
} else if (c == '%' && i + 2 < end) {
|
||||
int cn = url.charAt(i + 1);
|
||||
int cnn = url.charAt(i + 2);
|
||||
if (hexChars.indexOf(cn) >= 0 && hexChars.indexOf(cnn) >= 0) {
|
||||
s.appendCodePoint(c);
|
||||
} else {
|
||||
s.append("%25");
|
||||
}
|
||||
} else {
|
||||
s.append(String.format("%%%02X", c));
|
||||
}
|
||||
}
|
||||
|
||||
return s.toString();
|
||||
}
|
||||
|
||||
private static int findPathIdx(String url) throws URISyntaxException {
|
||||
int colonIdx = url.indexOf(':');
|
||||
if (colonIdx < 0 || colonIdx + 2 >= url.length()) {
|
||||
throw new URISyntaxException(url, "Lacking protocol");
|
||||
}
|
||||
return url.indexOf('/', colonIdx + 2);
|
||||
}
|
||||
|
||||
public EdgeUrl(URI URI) {
|
||||
try {
|
||||
@@ -166,11 +124,10 @@ public class EdgeUrl implements Serializable {
|
||||
sb.append(port);
|
||||
}
|
||||
|
||||
sb.append(path);
|
||||
EdgeUriFactory.urlencodePath(sb, path);
|
||||
|
||||
if (param != null) {
|
||||
sb.append('?');
|
||||
sb.append(param);
|
||||
EdgeUriFactory.urlencodeQuery(sb, param);
|
||||
}
|
||||
|
||||
return sb.toString();
|
||||
@@ -247,3 +204,138 @@ public class EdgeUrl implements Serializable {
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
class EdgeUriFactory {
|
||||
public static URI parseURILenient(String url) throws URISyntaxException {
|
||||
var s = new StringBuilder();
|
||||
|
||||
int pathIdx = findPathIdx(url);
|
||||
if (pathIdx < 0) { // url looks like http://marginalia.nu
|
||||
return new URI(url + "/");
|
||||
}
|
||||
s.append(url, 0, pathIdx);
|
||||
|
||||
// We don't want the fragment, and multiple fragments breaks the Java URIParser for some reason
|
||||
int end = url.indexOf("#");
|
||||
if (end < 0) end = url.length();
|
||||
|
||||
int queryIdx = url.indexOf('?');
|
||||
if (queryIdx < 0) queryIdx = end;
|
||||
|
||||
urlencodePath(s, url.substring(pathIdx, queryIdx));
|
||||
if (queryIdx < end) {
|
||||
urlencodeQuery(s, url.substring(queryIdx + 1, end));
|
||||
}
|
||||
return new URI(s.toString());
|
||||
}
|
||||
|
||||
/** Break apart the path element of an URI into its components, and then
|
||||
* urlencode any component that needs it, and recombine it into a single
|
||||
* path element again.
|
||||
*/
|
||||
public static void urlencodePath(StringBuilder sb, String path) {
|
||||
if (path == null || path.isEmpty()) {
|
||||
return;
|
||||
}
|
||||
|
||||
String[] pathParts = StringUtils.split(path, '/');
|
||||
if (pathParts.length == 0) {
|
||||
sb.append('/');
|
||||
return;
|
||||
}
|
||||
|
||||
for (String pathPart : pathParts) {
|
||||
if (pathPart.isEmpty()) continue;
|
||||
|
||||
if (needsUrlEncode(pathPart)) {
|
||||
sb.append('/');
|
||||
sb.append(URLEncoder.encode(pathPart, StandardCharsets.UTF_8).replace("+", "%20"));
|
||||
} else {
|
||||
sb.append('/');
|
||||
sb.append(pathPart);
|
||||
}
|
||||
}
|
||||
|
||||
if (path.endsWith("/")) {
|
||||
sb.append('/');
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
/** Break apart the query element of a URI into its components, and then
|
||||
* urlencode any component that needs it, and recombine it into a single
|
||||
* query element again.
|
||||
*/
|
||||
public static void urlencodeQuery(StringBuilder sb, String param) {
|
||||
if (param == null || param.isEmpty()) {
|
||||
return;
|
||||
}
|
||||
|
||||
String[] pathParts = StringUtils.split(param, '&');
|
||||
boolean first = true;
|
||||
for (String queryPart : pathParts) {
|
||||
if (queryPart.isEmpty()) continue;
|
||||
|
||||
if (first) {
|
||||
sb.append('?');
|
||||
first = false;
|
||||
} else {
|
||||
sb.append('&');
|
||||
}
|
||||
|
||||
if (needsUrlEncode(queryPart)) {
|
||||
sb.append(URLEncoder.encode(queryPart, StandardCharsets.UTF_8));
|
||||
} else {
|
||||
sb.append(queryPart);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
/** Test if the url element needs URL encoding.
|
||||
* <p></p>
|
||||
* Note we may have been given an already encoded path element,
|
||||
* so we include % and + in the list of good characters
|
||||
*/
|
||||
static boolean needsUrlEncode(String urlElement) {
|
||||
for (int i = 0; i < urlElement.length(); i++) {
|
||||
char c = urlElement.charAt(i);
|
||||
|
||||
if (c >= 'a' && c <= 'z') continue;
|
||||
if (c >= 'A' && c <= 'Z') continue;
|
||||
if (c >= '0' && c <= '9') continue;
|
||||
if ("-_.~+?=&".indexOf(c) >= 0) continue;
|
||||
if (c == '%' && i + 2 < urlElement.length()) {
|
||||
char c1 = urlElement.charAt(i + 1);
|
||||
char c2 = urlElement.charAt(i + 2);
|
||||
if (isHexDigit(c1) && isHexDigit(c2)) {
|
||||
i += 2;
|
||||
continue;
|
||||
}
|
||||
}
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
private static boolean isHexDigit(char c) {
|
||||
return (c >= '0' && c <= '9') || (c >= 'a' && c <= 'f') || (c >= 'A' && c <= 'F');
|
||||
}
|
||||
|
||||
/** Find the index of the path element in a URL.
|
||||
* <p></p>
|
||||
* The path element starts after the scheme and authority part of the URL,
|
||||
* which is everything up to and including the first slash after the colon.
|
||||
*/
|
||||
private static int findPathIdx(String url) throws URISyntaxException {
|
||||
int colonIdx = url.indexOf(':');
|
||||
if (colonIdx < 0 || colonIdx + 3 >= url.length()) {
|
||||
throw new URISyntaxException(url, "Lacking scheme");
|
||||
}
|
||||
return url.indexOf('/', colonIdx + 3);
|
||||
}
|
||||
|
||||
|
||||
}
|
@@ -1,6 +1,6 @@
|
||||
package nu.marginalia.model;
|
||||
|
||||
import nu.marginalia.model.EdgeUrl;
|
||||
import org.junit.jupiter.api.Assertions;
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
import java.net.URISyntaxException;
|
||||
@@ -21,25 +21,50 @@ class EdgeUrlTest {
|
||||
new EdgeUrl("https://memex.marginalia.nu/#here")
|
||||
);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testParam() throws URISyntaxException {
|
||||
System.out.println(new EdgeUrl("https://memex.marginalia.nu/index.php?id=1").toString());
|
||||
System.out.println(new EdgeUrl("https://memex.marginalia.nu/showthread.php?id=1&count=5&tracking=123").toString());
|
||||
}
|
||||
@Test
|
||||
void urlencodeFixer() throws URISyntaxException {
|
||||
System.out.println(EdgeUrl.urlencodeFixer("https://www.example.com/#heredoc"));
|
||||
System.out.println(EdgeUrl.urlencodeFixer("https://www.example.com/%-sign"));
|
||||
System.out.println(EdgeUrl.urlencodeFixer("https://www.example.com/%22-sign"));
|
||||
System.out.println(EdgeUrl.urlencodeFixer("https://www.example.com/\n \"huh\""));
|
||||
void testUriFromString() throws URISyntaxException {
|
||||
// We test these URLs several times as we perform URLEncode-fixing both when parsing the URL and when
|
||||
// converting it back to a string, we want to ensure there is no changes along the way.
|
||||
|
||||
Assertions.assertEquals("/", EdgeUriFactory.parseURILenient("https://www.example.com/").getPath());
|
||||
Assertions.assertEquals("https://www.example.com/", EdgeUriFactory.parseURILenient("https://www.example.com/").toString());
|
||||
Assertions.assertEquals("https://www.example.com/", new EdgeUrl("https://www.example.com/").toString());
|
||||
|
||||
Assertions.assertEquals("/", EdgeUriFactory.parseURILenient("https://www.example.com/#heredoc").getPath());
|
||||
Assertions.assertEquals("https://www.example.com/", EdgeUriFactory.parseURILenient("https://www.example.com/#heredoc").toString());
|
||||
Assertions.assertEquals("https://www.example.com/", new EdgeUrl("https://www.example.com/#heredoc").toString());
|
||||
|
||||
Assertions.assertEquals("/trailingslash/", EdgeUriFactory.parseURILenient("https://www.example.com/trailingslash/").getPath());
|
||||
Assertions.assertEquals("https://www.example.com/trailingslash/", EdgeUriFactory.parseURILenient("https://www.example.com/trailingslash/").toString());
|
||||
Assertions.assertEquals("https://www.example.com/trailingslash/", new EdgeUrl("https://www.example.com/trailingslash/").toString());
|
||||
|
||||
Assertions.assertEquals("/%-sign", EdgeUriFactory.parseURILenient("https://www.example.com/%-sign").getPath());
|
||||
Assertions.assertEquals("https://www.example.com/%25-sign", EdgeUriFactory.parseURILenient("https://www.example.com/%-sign").toString());
|
||||
Assertions.assertEquals("https://www.example.com/%25-sign", new EdgeUrl("https://www.example.com/%-sign").toString());
|
||||
|
||||
Assertions.assertEquals("/\"-sign", EdgeUriFactory.parseURILenient("https://www.example.com/%22-sign").getPath());
|
||||
Assertions.assertEquals("https://www.example.com/%22-sign", EdgeUriFactory.parseURILenient("https://www.example.com/%22-sign").toString());
|
||||
Assertions.assertEquals("https://www.example.com/%22-sign", new EdgeUrl("https://www.example.com/%22-sign").toString());
|
||||
|
||||
Assertions.assertEquals("/\n \"huh\"", EdgeUriFactory.parseURILenient("https://www.example.com/\n \"huh\"").getPath());
|
||||
Assertions.assertEquals("https://www.example.com/%0A%20%22huh%22", EdgeUriFactory.parseURILenient("https://www.example.com/\n \"huh\"").toString());
|
||||
Assertions.assertEquals("https://www.example.com/%0A%20%22huh%22", new EdgeUrl("https://www.example.com/\n \"huh\"").toString());
|
||||
|
||||
Assertions.assertEquals("/wiki/Sámi", EdgeUriFactory.parseURILenient("https://en.wikipedia.org/wiki/Sámi").getPath());
|
||||
Assertions.assertEquals("https://en.wikipedia.org/wiki/S%C3%A1mi", EdgeUriFactory.parseURILenient("https://en.wikipedia.org/wiki/Sámi").toString());
|
||||
Assertions.assertEquals("https://en.wikipedia.org/wiki/S%C3%A1mi", new EdgeUrl("https://en.wikipedia.org/wiki/Sámi").toString());
|
||||
}
|
||||
|
||||
@Test
|
||||
void testParms() throws URISyntaxException {
|
||||
System.out.println(new EdgeUrl("https://search.marginalia.nu/?id=123"));
|
||||
System.out.println(new EdgeUrl("https://search.marginalia.nu/?t=123"));
|
||||
System.out.println(new EdgeUrl("https://search.marginalia.nu/?v=123"));
|
||||
System.out.println(new EdgeUrl("https://search.marginalia.nu/?m=123"));
|
||||
System.out.println(new EdgeUrl("https://search.marginalia.nu/?follow=123"));
|
||||
Assertions.assertEquals("id=123", new EdgeUrl("https://search.marginalia.nu/?id=123").param);
|
||||
Assertions.assertEquals("t=123", new EdgeUrl("https://search.marginalia.nu/?t=123").param);
|
||||
Assertions.assertEquals("v=123", new EdgeUrl("https://search.marginalia.nu/?v=123").param);
|
||||
Assertions.assertEquals("id=1", new EdgeUrl("https://memex.marginalia.nu/showthread.php?id=1&count=5&tracking=123").param);
|
||||
Assertions.assertEquals("id=1&t=5", new EdgeUrl("https://memex.marginalia.nu/shöwthrëad.php?id=1&t=5&tracking=123").param);
|
||||
Assertions.assertEquals("id=1&t=5", new EdgeUrl("https://memex.marginalia.nu/shöwthrëad.php?trëaking=123&id=1&t=5&").param);
|
||||
Assertions.assertNull(new EdgeUrl("https://search.marginalia.nu/?m=123").param);
|
||||
Assertions.assertNull(new EdgeUrl("https://search.marginalia.nu/?follow=123").param);
|
||||
}
|
||||
}
|
@@ -59,16 +59,13 @@ public class ProcessAdHocTaskHeartbeatImpl implements AutoCloseable, ProcessAdHo
|
||||
*/
|
||||
@Override
|
||||
public void progress(String step, int stepProgress, int stepCount) {
|
||||
int lastProgress = this.progress;
|
||||
this.step = step;
|
||||
|
||||
|
||||
// off by one since we calculate the progress based on the number of steps,
|
||||
// and Enum.ordinal() is zero-based (so the 5th step in a 5 step task is 4, not 5; resulting in the
|
||||
// final progress being 80% and not 100%)
|
||||
|
||||
this.progress = (int) Math.round(100. * stepProgress / (double) stepCount);
|
||||
|
||||
logger.info("ProcessTask {} progress: {}%", taskBase, progress);
|
||||
if (this.progress / 10 != lastProgress / 10) {
|
||||
logger.info("ProcessTask {} progress: {}%", taskBase, progress);
|
||||
}
|
||||
}
|
||||
|
||||
/** Wrap a collection to provide heartbeat progress updates as it's iterated through */
|
||||
|
@@ -10,7 +10,9 @@ import java.nio.charset.StandardCharsets;
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Path;
|
||||
import java.time.LocalDateTime;
|
||||
import java.util.*;
|
||||
import java.util.HashSet;
|
||||
import java.util.Optional;
|
||||
import java.util.Set;
|
||||
import java.util.function.Function;
|
||||
|
||||
/** WorkLog is a journal of work done by a process,
|
||||
@@ -61,6 +63,12 @@ public class WorkLog implements AutoCloseable, Closeable {
|
||||
return new WorkLoadIterable<>(logFile, mapper);
|
||||
}
|
||||
|
||||
public static int countEntries(Path crawlerLog) throws IOException{
|
||||
try (var linesStream = Files.lines(crawlerLog)) {
|
||||
return (int) linesStream.filter(WorkLogEntry::isJobId).count();
|
||||
}
|
||||
}
|
||||
|
||||
// Use synchro over concurrent set to avoid competing writes
|
||||
// - correct is better than fast here, it's sketchy enough to use
|
||||
// a PrintWriter
|
||||
|
@@ -57,16 +57,13 @@ public class ServiceAdHocTaskHeartbeatImpl implements AutoCloseable, ServiceAdHo
|
||||
*/
|
||||
@Override
|
||||
public void progress(String step, int stepProgress, int stepCount) {
|
||||
int lastProgress = this.progress;
|
||||
this.step = step;
|
||||
|
||||
|
||||
// off by one since we calculate the progress based on the number of steps,
|
||||
// and Enum.ordinal() is zero-based (so the 5th step in a 5 step task is 4, not 5; resulting in the
|
||||
// final progress being 80% and not 100%)
|
||||
|
||||
this.progress = (int) Math.round(100. * stepProgress / (double) stepCount);
|
||||
|
||||
logger.info("ServiceTask {} progress: {}%", taskBase, progress);
|
||||
if (this.progress / 10 != lastProgress / 10) {
|
||||
logger.info("ProcessTask {} progress: {}%", taskBase, progress);
|
||||
}
|
||||
}
|
||||
|
||||
public void shutDown() {
|
||||
|
@@ -89,7 +89,7 @@ public class DatabaseModule extends AbstractModule {
|
||||
config.addDataSourceProperty("prepStmtCacheSize", "250");
|
||||
config.addDataSourceProperty("prepStmtCacheSqlLimit", "2048");
|
||||
|
||||
config.setMaximumPoolSize(5);
|
||||
config.setMaximumPoolSize(Integer.getInteger("db.poolSize", 5));
|
||||
config.setMinimumIdle(2);
|
||||
|
||||
config.setMaxLifetime(Duration.ofMinutes(9).toMillis());
|
||||
|
@@ -6,6 +6,7 @@ import nu.marginalia.service.ServiceId;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.net.InetAddress;
|
||||
import java.net.NetworkInterface;
|
||||
import java.util.Enumeration;
|
||||
@@ -115,11 +116,12 @@ public class ServiceConfigurationModule extends AbstractModule {
|
||||
}
|
||||
}
|
||||
|
||||
public static String getLocalNetworkIP() throws Exception {
|
||||
public static String getLocalNetworkIP() throws IOException {
|
||||
Enumeration<NetworkInterface> nets = NetworkInterface.getNetworkInterfaces();
|
||||
|
||||
while (nets.hasMoreElements()) {
|
||||
NetworkInterface netif = nets.nextElement();
|
||||
logger.info("Considering network interface {}: Up? {}, Loopback? {}", netif.getDisplayName(), netif.isUp(), netif.isLoopback());
|
||||
if (!netif.isUp() || netif.isLoopback()) {
|
||||
continue;
|
||||
}
|
||||
@@ -127,6 +129,7 @@ public class ServiceConfigurationModule extends AbstractModule {
|
||||
Enumeration<InetAddress> inetAddresses = netif.getInetAddresses();
|
||||
while (inetAddresses.hasMoreElements()) {
|
||||
InetAddress addr = inetAddresses.nextElement();
|
||||
logger.info("Considering address {}: SiteLocal? {}, Loopback? {}", addr.getHostAddress(), addr.isSiteLocalAddress(), addr.isLoopbackAddress());
|
||||
if (addr.isSiteLocalAddress() && !addr.isLoopbackAddress()) {
|
||||
return addr.getHostAddress();
|
||||
}
|
||||
|
@@ -15,6 +15,7 @@ import org.slf4j.LoggerFactory;
|
||||
import org.slf4j.Marker;
|
||||
import org.slf4j.MarkerFactory;
|
||||
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Path;
|
||||
import java.nio.file.Paths;
|
||||
import java.util.List;
|
||||
@@ -106,9 +107,12 @@ public class JoobyService {
|
||||
config.externalAddress());
|
||||
|
||||
// FIXME: This won't work outside of docker, may need to submit a PR to jooby to allow classpaths here
|
||||
jooby.install(new JteModule(Path.of("/app/resources/jte"), Path.of("/app/classes/jte-precompiled")));
|
||||
jooby.assets("/*", Paths.get("/app/resources/static"));
|
||||
|
||||
if (Files.exists(Path.of("/app/resources/jte")) || Files.exists(Path.of("/app/classes/jte-precompiled"))) {
|
||||
jooby.install(new JteModule(Path.of("/app/resources/jte"), Path.of("/app/classes/jte-precompiled")));
|
||||
}
|
||||
if (Files.exists(Path.of("/app/resources/static"))) {
|
||||
jooby.assets("/*", Paths.get("/app/resources/static"));
|
||||
}
|
||||
var options = new ServerOptions();
|
||||
options.setHost(config.bindAddress());
|
||||
options.setPort(restEndpoint.port());
|
||||
|
@@ -6,25 +6,36 @@ import nu.marginalia.service.module.ServiceConfiguration;
|
||||
import org.eclipse.jetty.server.Server;
|
||||
import org.eclipse.jetty.servlet.ServletContextHandler;
|
||||
import org.eclipse.jetty.servlet.ServletHolder;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import java.net.InetSocketAddress;
|
||||
|
||||
public class MetricsServer {
|
||||
|
||||
private static final Logger logger = LoggerFactory.getLogger(MetricsServer.class);
|
||||
|
||||
@Inject
|
||||
public MetricsServer(ServiceConfiguration configuration) throws Exception {
|
||||
public MetricsServer(ServiceConfiguration configuration) {
|
||||
// If less than zero, we forego setting up a metrics server
|
||||
if (configuration.metricsPort() < 0)
|
||||
return;
|
||||
|
||||
Server server = new Server(new InetSocketAddress(configuration.bindAddress(), configuration.metricsPort()));
|
||||
try {
|
||||
Server server = new Server(new InetSocketAddress(configuration.bindAddress(), configuration.metricsPort()));
|
||||
|
||||
ServletContextHandler context = new ServletContextHandler();
|
||||
context.setContextPath("/");
|
||||
server.setHandler(context);
|
||||
ServletContextHandler context = new ServletContextHandler();
|
||||
context.setContextPath("/");
|
||||
server.setHandler(context);
|
||||
|
||||
context.addServlet(new ServletHolder(new MetricsServlet()), "/metrics");
|
||||
context.addServlet(new ServletHolder(new MetricsServlet()), "/metrics");
|
||||
|
||||
server.start();
|
||||
logger.info("MetricsServer listening on {}:{}", configuration.bindAddress(), configuration.metricsPort());
|
||||
|
||||
server.start();
|
||||
}
|
||||
catch (Exception|NoSuchMethodError ex) {
|
||||
logger.error("Failed to set up metrics server", ex);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@@ -35,21 +35,8 @@ public class RateLimiter {
|
||||
}
|
||||
|
||||
|
||||
public static RateLimiter forExpensiveRequest() {
|
||||
return new RateLimiter(5, 10);
|
||||
}
|
||||
|
||||
public static RateLimiter custom(int perMinute) {
|
||||
return new RateLimiter(perMinute, 60);
|
||||
}
|
||||
|
||||
public static RateLimiter forSpamBots() {
|
||||
return new RateLimiter(120, 3600);
|
||||
}
|
||||
|
||||
|
||||
public static RateLimiter forLogin() {
|
||||
return new RateLimiter(3, 15);
|
||||
return new RateLimiter(4 * perMinute, perMinute);
|
||||
}
|
||||
|
||||
private void cleanIdleBuckets() {
|
||||
@@ -62,7 +49,7 @@ public class RateLimiter {
|
||||
}
|
||||
|
||||
private Bucket createBucket() {
|
||||
var refill = Refill.greedy(1, Duration.ofSeconds(refillRate));
|
||||
var refill = Refill.greedy(refillRate, Duration.ofSeconds(60));
|
||||
var bw = Bandwidth.classic(capacity, refill);
|
||||
return Bucket.builder().addLimit(bw).build();
|
||||
}
|
||||
|
@@ -5,6 +5,7 @@
|
||||
<Filters>
|
||||
<MarkerFilter marker="QUERY" onMatch="DENY" onMismatch="NEUTRAL" />
|
||||
<MarkerFilter marker="HTTP" onMatch="DENY" onMismatch="NEUTRAL" />
|
||||
<MarkerFilter marker="CRAWLER" onMatch="DENY" onMismatch="NEUTRAL" />
|
||||
</Filters>
|
||||
</Console>
|
||||
<RollingFile name="LogToFile" fileName="${env:WMSA_LOG_DIR:-/var/log/wmsa}/wmsa-${sys:service-name}-${env:WMSA_SERVICE_NODE:-0}.log" filePattern="/var/log/wmsa/wmsa-${sys:service-name}-${env:WMSA_SERVICE_NODE:-0}-log-%d{MM-dd-yy-HH-mm-ss}-%i.log.gz"
|
||||
@@ -13,9 +14,20 @@
|
||||
<Filters>
|
||||
<MarkerFilter marker="QUERY" onMatch="DENY" onMismatch="NEUTRAL" />
|
||||
<MarkerFilter marker="HTTP" onMatch="DENY" onMismatch="NEUTRAL" />
|
||||
<MarkerFilter marker="CRAWLER" onMatch="DENY" onMismatch="NEUTRAL" />
|
||||
</Filters>
|
||||
<SizeBasedTriggeringPolicy size="10MB" />
|
||||
</RollingFile>
|
||||
<RollingFile name="LogToFile" fileName="${env:WMSA_LOG_DIR:-/var/log/wmsa}/crawler-audit-${env:WMSA_SERVICE_NODE:-0}.log" filePattern="/var/log/wmsa/crawler-audit-${env:WMSA_SERVICE_NODE:-0}-log-%d{MM-dd-yy-HH-mm-ss}-%i.log.gz"
|
||||
ignoreExceptions="false">
|
||||
<PatternLayout>
|
||||
<Pattern>%d{yyyy-MM-dd HH:mm:ss,SSS}: %msg{nolookups}%n</Pattern>
|
||||
</PatternLayout>
|
||||
<SizeBasedTriggeringPolicy size="100MB" />
|
||||
<Filters>
|
||||
<MarkerFilter marker="CRAWLER" onMatch="ALLOW" onMismatch="DENY" />
|
||||
</Filters>
|
||||
</RollingFile>
|
||||
</Appenders>
|
||||
<Loggers>
|
||||
<Logger name="org.apache.zookeeper" level="WARN" />
|
||||
|
@@ -5,6 +5,7 @@
|
||||
<Filters>
|
||||
<MarkerFilter marker="QUERY" onMatch="DENY" onMismatch="NEUTRAL" />
|
||||
<MarkerFilter marker="HTTP" onMatch="DENY" onMismatch="NEUTRAL" />
|
||||
<MarkerFilter marker="CRAWLER" onMatch="DENY" onMismatch="NEUTRAL" />
|
||||
</Filters>
|
||||
</Console>
|
||||
<RollingFile name="LogToFile" fileName="${env:WMSA_LOG_DIR:-/var/log/wmsa}/wmsa-${sys:service-name}-${env:WMSA_SERVICE_NODE:-0}.log" filePattern="/var/log/wmsa/wmsa-${sys:service-name}-${env:WMSA_SERVICE_NODE:-0}-log-%d{MM-dd-yy-HH-mm-ss}-%i.log.gz"
|
||||
@@ -17,6 +18,17 @@
|
||||
<MarkerFilter marker="PROCESS" onMatch="DENY" onMismatch="NEUTRAL" />
|
||||
<MarkerFilter marker="QUERY" onMatch="DENY" onMismatch="NEUTRAL" />
|
||||
<MarkerFilter marker="HTTP" onMatch="DENY" onMismatch="NEUTRAL" />
|
||||
<MarkerFilter marker="CRAWLER" onMatch="DENY" onMismatch="NEUTRAL" />
|
||||
</Filters>
|
||||
</RollingFile>
|
||||
<RollingFile name="LogToFile" fileName="${env:WMSA_LOG_DIR:-/var/log/wmsa}/crawler-audit-${env:WMSA_SERVICE_NODE:-0}.log" filePattern="/var/log/wmsa/crawler-audit-${env:WMSA_SERVICE_NODE:-0}-log-%d{MM-dd-yy-HH-mm-ss}-%i.log.gz"
|
||||
ignoreExceptions="false">
|
||||
<PatternLayout>
|
||||
<Pattern>%d{yyyy-MM-dd HH:mm:ss,SSS}: %msg{nolookups}%n</Pattern>
|
||||
</PatternLayout>
|
||||
<SizeBasedTriggeringPolicy size="100MB" />
|
||||
<Filters>
|
||||
<MarkerFilter marker="CRAWLER" onMatch="ALLOW" onMismatch="DENY" />
|
||||
</Filters>
|
||||
</RollingFile>
|
||||
</Appenders>
|
||||
|
@@ -25,7 +25,7 @@ import static org.mockito.Mockito.when;
|
||||
class ZkServiceRegistryTest {
|
||||
private static final int ZOOKEEPER_PORT = 2181;
|
||||
private static final GenericContainer<?> zookeeper =
|
||||
new GenericContainer<>("zookeeper:3.8.0")
|
||||
new GenericContainer<>("zookeeper:3.8")
|
||||
.withExposedPorts(ZOOKEEPER_PORT);
|
||||
|
||||
List<ZkServiceRegistry> registries = new ArrayList<>();
|
||||
|
@@ -48,12 +48,13 @@ public class ExecutorExportClient {
|
||||
return msgId;
|
||||
}
|
||||
|
||||
public void exportSampleData(int node, FileStorageId fid, int size, String name) {
|
||||
public void exportSampleData(int node, FileStorageId fid, int size, String ctFilter, String name) {
|
||||
channelPool.call(ExecutorExportApiBlockingStub::exportSampleData)
|
||||
.forNode(node)
|
||||
.run(RpcExportSampleData.newBuilder()
|
||||
.setFileStorageId(fid.id())
|
||||
.setSize(size)
|
||||
.setCtFilter(ctFilter)
|
||||
.setName(name)
|
||||
.build());
|
||||
}
|
||||
|
@@ -100,6 +100,7 @@ message RpcExportSampleData {
|
||||
int64 fileStorageId = 1;
|
||||
int32 size = 2;
|
||||
string name = 3;
|
||||
string ctFilter = 4;
|
||||
}
|
||||
message RpcDownloadSampleData {
|
||||
string sampleSet = 1;
|
||||
|
@@ -20,6 +20,7 @@ public enum ExecutorActor {
|
||||
EXPORT_FEEDS(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED),
|
||||
EXPORT_SAMPLE_DATA(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED),
|
||||
DOWNLOAD_SAMPLE(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED),
|
||||
MIGRATE_CRAWL_DATA(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED),
|
||||
|
||||
PROC_CONVERTER_SPAWNER(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED, NodeProfile.SIDELOAD),
|
||||
PROC_LOADER_SPAWNER(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED, NodeProfile.SIDELOAD),
|
||||
|
@@ -66,6 +66,7 @@ public class ExecutorActorControlService {
|
||||
DownloadSampleActor downloadSampleActor,
|
||||
ScrapeFeedsActor scrapeFeedsActor,
|
||||
ExecutorActorStateMachines stateMachines,
|
||||
MigrateCrawlDataActor migrateCrawlDataActor,
|
||||
ExportAllPrecessionActor exportAllPrecessionActor,
|
||||
UpdateRssActor updateRssActor) throws SQLException {
|
||||
this.messageQueueFactory = messageQueueFactory;
|
||||
@@ -107,6 +108,8 @@ public class ExecutorActorControlService {
|
||||
register(ExecutorActor.SCRAPE_FEEDS, scrapeFeedsActor);
|
||||
register(ExecutorActor.UPDATE_RSS, updateRssActor);
|
||||
|
||||
register(ExecutorActor.MIGRATE_CRAWL_DATA, migrateCrawlDataActor);
|
||||
|
||||
if (serviceConfiguration.node() == 1) {
|
||||
register(ExecutorActor.PREC_EXPORT_ALL, exportAllPrecessionActor);
|
||||
}
|
||||
|
@@ -14,6 +14,8 @@ import nu.marginalia.mq.persistence.MqPersistence;
|
||||
import nu.marginalia.nodecfg.NodeConfigurationService;
|
||||
import nu.marginalia.nodecfg.model.NodeProfile;
|
||||
import nu.marginalia.service.module.ServiceConfiguration;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import java.time.Duration;
|
||||
import java.time.LocalDateTime;
|
||||
@@ -29,6 +31,7 @@ public class UpdateRssActor extends RecordActorPrototype {
|
||||
|
||||
private final NodeConfigurationService nodeConfigurationService;
|
||||
private final MqPersistence persistence;
|
||||
private static final Logger logger = LoggerFactory.getLogger(UpdateRssActor.class);
|
||||
|
||||
@Inject
|
||||
public UpdateRssActor(Gson gson,
|
||||
@@ -101,8 +104,8 @@ public class UpdateRssActor extends RecordActorPrototype {
|
||||
case UpdateRefresh(int count, long msgId) -> {
|
||||
MqMessage msg = persistence.waitForMessageTerminalState(msgId, Duration.ofSeconds(10), Duration.ofHours(12));
|
||||
if (msg == null) {
|
||||
// Retry the update
|
||||
yield new Error("Failed to update feeds: message not found");
|
||||
logger.warn("UpdateRefresh is taking a very long time");
|
||||
yield new UpdateRefresh(count, msgId);
|
||||
} else if (msg.state() != MqMessageState.OK) {
|
||||
// Retry the update
|
||||
yield new Error("Failed to update feeds: " + msg.state());
|
||||
@@ -119,8 +122,8 @@ public class UpdateRssActor extends RecordActorPrototype {
|
||||
case UpdateClean(long msgId) -> {
|
||||
MqMessage msg = persistence.waitForMessageTerminalState(msgId, Duration.ofSeconds(10), Duration.ofHours(12));
|
||||
if (msg == null) {
|
||||
// Retry the update
|
||||
yield new Error("Failed to update feeds: message not found");
|
||||
logger.warn("UpdateClean is taking a very long time");
|
||||
yield new UpdateClean(msgId);
|
||||
} else if (msg.state() != MqMessageState.OK) {
|
||||
// Retry the update
|
||||
yield new Error("Failed to update feeds: " + msg.state());
|
||||
|
@@ -8,6 +8,7 @@ import nu.marginalia.actor.state.ActorResumeBehavior;
|
||||
import nu.marginalia.actor.state.ActorStep;
|
||||
import nu.marginalia.actor.state.Resume;
|
||||
import nu.marginalia.service.control.ServiceEventLog;
|
||||
import nu.marginalia.service.control.ServiceHeartbeat;
|
||||
import nu.marginalia.storage.FileStorageService;
|
||||
import nu.marginalia.storage.model.FileStorage;
|
||||
import nu.marginalia.storage.model.FileStorageId;
|
||||
@@ -19,6 +20,7 @@ import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import java.io.*;
|
||||
import java.net.HttpURLConnection;
|
||||
import java.net.MalformedURLException;
|
||||
import java.net.URI;
|
||||
import java.net.URL;
|
||||
@@ -32,6 +34,7 @@ public class DownloadSampleActor extends RecordActorPrototype {
|
||||
|
||||
private final FileStorageService storageService;
|
||||
private final ServiceEventLog eventLog;
|
||||
private final ServiceHeartbeat heartbeat;
|
||||
private final Logger logger = LoggerFactory.getLogger(getClass());
|
||||
|
||||
@Resume(behavior = ActorResumeBehavior.ERROR)
|
||||
@@ -66,15 +69,39 @@ public class DownloadSampleActor extends RecordActorPrototype {
|
||||
|
||||
Files.deleteIfExists(Path.of(tarFileName));
|
||||
|
||||
try (var is = new BufferedInputStream(new URI(downloadURI).toURL().openStream());
|
||||
var os = new BufferedOutputStream(Files.newOutputStream(Path.of(tarFileName), StandardOpenOption.CREATE))) {
|
||||
is.transferTo(os);
|
||||
HttpURLConnection urlConnection = (HttpURLConnection) new URI(downloadURI).toURL().openConnection();
|
||||
|
||||
try (var hb = heartbeat.createServiceAdHocTaskHeartbeat("Downloading sample")) {
|
||||
long size = urlConnection.getContentLengthLong();
|
||||
byte[] buffer = new byte[8192];
|
||||
|
||||
try (var is = new BufferedInputStream(urlConnection.getInputStream());
|
||||
var os = new BufferedOutputStream(Files.newOutputStream(Path.of(tarFileName), StandardOpenOption.CREATE))) {
|
||||
long copiedSize = 0;
|
||||
|
||||
while (copiedSize < size) {
|
||||
int read = is.read(buffer);
|
||||
|
||||
if (read < 0) // We've been promised a file of length 'size'
|
||||
throw new IOException("Unexpected end of stream");
|
||||
|
||||
os.write(buffer, 0, read);
|
||||
copiedSize += read;
|
||||
|
||||
// Update progress bar
|
||||
hb.progress(String.format("%d MB", copiedSize / 1024 / 1024), (int) (copiedSize / 1024), (int) (size / 1024));
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
catch (Exception ex) {
|
||||
eventLog.logEvent(DownloadSampleActor.class, "Error downloading sample");
|
||||
logger.error("Error downloading sample", ex);
|
||||
yield new Error();
|
||||
}
|
||||
finally {
|
||||
urlConnection.disconnect();
|
||||
}
|
||||
|
||||
eventLog.logEvent(DownloadSampleActor.class, "Download complete");
|
||||
yield new Extract(fileStorageId, tarFileName);
|
||||
@@ -170,11 +197,12 @@ public class DownloadSampleActor extends RecordActorPrototype {
|
||||
@Inject
|
||||
public DownloadSampleActor(Gson gson,
|
||||
FileStorageService storageService,
|
||||
ServiceEventLog eventLog)
|
||||
ServiceEventLog eventLog, ServiceHeartbeat heartbeat)
|
||||
{
|
||||
super(gson);
|
||||
this.storageService = storageService;
|
||||
this.eventLog = eventLog;
|
||||
this.heartbeat = heartbeat;
|
||||
}
|
||||
|
||||
}
|
||||
|
@@ -26,32 +26,32 @@ public class ExportSampleDataActor extends RecordActorPrototype {
|
||||
private final MqOutbox exportTasksOutbox;
|
||||
private final Logger logger = LoggerFactory.getLogger(getClass());
|
||||
|
||||
public record Export(FileStorageId crawlId, int size, String name) implements ActorStep {}
|
||||
public record Run(FileStorageId crawlId, FileStorageId destId, int size, String name, long msgId) implements ActorStep {
|
||||
public Run(FileStorageId crawlId, FileStorageId destId, int size, String name) {
|
||||
this(crawlId, destId, size, name, -1);
|
||||
public record Export(FileStorageId crawlId, int size, String ctFilter, String name) implements ActorStep {}
|
||||
public record Run(FileStorageId crawlId, FileStorageId destId, int size, String ctFilter, String name, long msgId) implements ActorStep {
|
||||
public Run(FileStorageId crawlId, FileStorageId destId, int size, String name, String ctFilter) {
|
||||
this(crawlId, destId, size, name, ctFilter,-1);
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public ActorStep transition(ActorStep self) throws Exception {
|
||||
return switch(self) {
|
||||
case Export(FileStorageId crawlId, int size, String name) -> {
|
||||
case Export(FileStorageId crawlId, int size, String ctFilter, String name) -> {
|
||||
var storage = storageService.allocateStorage(FileStorageType.EXPORT,
|
||||
"crawl-sample-export",
|
||||
"Crawl Data Sample " + name + "/" + size + " " + LocalDateTime.now()
|
||||
);
|
||||
|
||||
if (storage == null) yield new Error("Bad storage id");
|
||||
yield new Run(crawlId, storage.id(), size, name);
|
||||
yield new Run(crawlId, storage.id(), size, ctFilter, name);
|
||||
}
|
||||
case Run(FileStorageId crawlId, FileStorageId destId, int size, String name, long msgId) when msgId < 0 -> {
|
||||
case Run(FileStorageId crawlId, FileStorageId destId, int size, String ctFilter, String name, long msgId) when msgId < 0 -> {
|
||||
storageService.setFileStorageState(destId, FileStorageState.NEW);
|
||||
|
||||
long newMsgId = exportTasksOutbox.sendAsync(ExportTaskRequest.sampleData(crawlId, destId, size, name));
|
||||
yield new Run(crawlId, destId, size, name, newMsgId);
|
||||
long newMsgId = exportTasksOutbox.sendAsync(ExportTaskRequest.sampleData(crawlId, destId, ctFilter, size, name));
|
||||
yield new Run(crawlId, destId, size, ctFilter, name, newMsgId);
|
||||
}
|
||||
case Run(_, FileStorageId destId, _, _, long msgId) -> {
|
||||
case Run(_, FileStorageId destId, _, _, _, long msgId) -> {
|
||||
var rsp = processWatcher.waitResponse(exportTasksOutbox, ProcessService.ProcessId.EXPORT_TASKS, msgId);
|
||||
|
||||
if (rsp.state() != MqMessageState.OK) {
|
||||
@@ -70,7 +70,7 @@ public class ExportSampleDataActor extends RecordActorPrototype {
|
||||
|
||||
@Override
|
||||
public String describe() {
|
||||
return "Export RSS/Atom feeds from crawl data";
|
||||
return "Export sample crawl data";
|
||||
}
|
||||
|
||||
@Inject
|
||||
|
@@ -0,0 +1,150 @@
|
||||
package nu.marginalia.actor.task;
|
||||
|
||||
import com.google.gson.Gson;
|
||||
import jakarta.inject.Inject;
|
||||
import jakarta.inject.Singleton;
|
||||
import nu.marginalia.actor.prototype.RecordActorPrototype;
|
||||
import nu.marginalia.actor.state.ActorStep;
|
||||
import nu.marginalia.io.CrawlerOutputFile;
|
||||
import nu.marginalia.process.log.WorkLog;
|
||||
import nu.marginalia.process.log.WorkLogEntry;
|
||||
import nu.marginalia.service.control.ServiceHeartbeat;
|
||||
import nu.marginalia.slop.SlopCrawlDataRecord;
|
||||
import nu.marginalia.storage.FileStorageService;
|
||||
import nu.marginalia.storage.model.FileStorage;
|
||||
import nu.marginalia.storage.model.FileStorageId;
|
||||
import org.apache.logging.log4j.util.Strings;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Path;
|
||||
import java.nio.file.StandardCopyOption;
|
||||
import java.util.Map;
|
||||
import java.util.Optional;
|
||||
import java.util.function.Function;
|
||||
|
||||
@Singleton
|
||||
public class MigrateCrawlDataActor extends RecordActorPrototype {
|
||||
|
||||
private final FileStorageService fileStorageService;
|
||||
private final ServiceHeartbeat serviceHeartbeat;
|
||||
private static final Logger logger = LoggerFactory.getLogger(MigrateCrawlDataActor.class);
|
||||
|
||||
@Inject
|
||||
public MigrateCrawlDataActor(Gson gson, FileStorageService fileStorageService, ServiceHeartbeat serviceHeartbeat) {
|
||||
super(gson);
|
||||
|
||||
this.fileStorageService = fileStorageService;
|
||||
this.serviceHeartbeat = serviceHeartbeat;
|
||||
}
|
||||
|
||||
public record Run(long fileStorageId) implements ActorStep {}
|
||||
|
||||
@Override
|
||||
public ActorStep transition(ActorStep self) throws Exception {
|
||||
return switch (self) {
|
||||
case Run(long fileStorageId) -> {
|
||||
|
||||
FileStorage storage = fileStorageService.getStorage(FileStorageId.of(fileStorageId));
|
||||
Path root = storage.asPath();
|
||||
|
||||
Path crawlerLog = root.resolve("crawler.log");
|
||||
Path newCrawlerLog = Files.createTempFile(root, "crawler", ".migrate.log");
|
||||
|
||||
int totalEntries = WorkLog.countEntries(crawlerLog);
|
||||
|
||||
try (WorkLog workLog = new WorkLog(newCrawlerLog);
|
||||
var heartbeat = serviceHeartbeat.createServiceAdHocTaskHeartbeat("Migrating")
|
||||
) {
|
||||
int entryIdx = 0;
|
||||
|
||||
for (Map.Entry<WorkLogEntry, Path> item : WorkLog.iterableMap(crawlerLog, new CrawlDataLocator(root))) {
|
||||
|
||||
final WorkLogEntry entry = item.getKey();
|
||||
final Path inputPath = item.getValue();
|
||||
|
||||
Path outputPath = inputPath;
|
||||
heartbeat.progress("Migrating" + inputPath.getFileName(), entryIdx++, totalEntries);
|
||||
|
||||
if (inputPath.toString().endsWith(".parquet")) {
|
||||
String domain = entry.id();
|
||||
String id = Integer.toHexString(domain.hashCode());
|
||||
|
||||
outputPath = CrawlerOutputFile.createSlopPath(root, id, domain);
|
||||
|
||||
if (Files.exists(inputPath)) {
|
||||
try {
|
||||
SlopCrawlDataRecord.convertFromParquet(inputPath, outputPath);
|
||||
Files.deleteIfExists(inputPath);
|
||||
} catch (Exception ex) {
|
||||
outputPath = inputPath; // don't update the work log on error
|
||||
logger.error("Failed to convert " + inputPath, ex);
|
||||
}
|
||||
}
|
||||
else if (!Files.exists(inputPath) && !Files.exists(outputPath)) {
|
||||
// if the input file is missing, and the output file is missing, we just write the log
|
||||
// record identical to the old one
|
||||
outputPath = inputPath;
|
||||
}
|
||||
}
|
||||
|
||||
// Write a log entry for the (possibly) converted file
|
||||
workLog.setJobToFinished(entry.id(), outputPath.toString(), entry.cnt());
|
||||
}
|
||||
}
|
||||
|
||||
Path oldCrawlerLog = Files.createTempFile(root, "crawler-", ".migrate.old.log");
|
||||
Files.move(crawlerLog, oldCrawlerLog, StandardCopyOption.REPLACE_EXISTING);
|
||||
Files.move(newCrawlerLog, crawlerLog);
|
||||
|
||||
yield new End();
|
||||
}
|
||||
default -> new Error();
|
||||
};
|
||||
}
|
||||
|
||||
private static class CrawlDataLocator implements Function<WorkLogEntry, Optional<Map.Entry<WorkLogEntry, Path>>> {
|
||||
|
||||
private final Path crawlRootDir;
|
||||
|
||||
CrawlDataLocator(Path crawlRootDir) {
|
||||
this.crawlRootDir = crawlRootDir;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Optional<Map.Entry<WorkLogEntry, Path>> apply(WorkLogEntry entry) {
|
||||
var path = getCrawledFilePath(crawlRootDir, entry.path());
|
||||
|
||||
if (!Files.exists(path)) {
|
||||
return Optional.empty();
|
||||
}
|
||||
|
||||
try {
|
||||
return Optional.of(Map.entry(entry, path));
|
||||
}
|
||||
catch (Exception ex) {
|
||||
return Optional.empty();
|
||||
}
|
||||
}
|
||||
|
||||
private Path getCrawledFilePath(Path crawlDir, String fileName) {
|
||||
int sp = fileName.lastIndexOf('/');
|
||||
|
||||
// Normalize the filename
|
||||
if (sp >= 0 && sp + 1< fileName.length())
|
||||
fileName = fileName.substring(sp + 1);
|
||||
if (fileName.length() < 4)
|
||||
fileName = Strings.repeat("0", 4 - fileName.length()) + fileName;
|
||||
|
||||
String sp1 = fileName.substring(0, 2);
|
||||
String sp2 = fileName.substring(2, 4);
|
||||
return crawlDir.resolve(sp1).resolve(sp2).resolve(fileName);
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public String describe() {
|
||||
return "Migrates crawl data to the latest format";
|
||||
}
|
||||
}
|
@@ -49,6 +49,7 @@ public class ExecutorExportGrpcService
|
||||
new ExportSampleDataActor.Export(
|
||||
FileStorageId.of(request.getFileStorageId()),
|
||||
request.getSize(),
|
||||
request.getCtFilter(),
|
||||
request.getName()
|
||||
)
|
||||
);
|
||||
|
47
code/functions/favicon/api/build.gradle
Normal file
47
code/functions/favicon/api/build.gradle
Normal file
@@ -0,0 +1,47 @@
|
||||
plugins {
|
||||
id 'java'
|
||||
|
||||
id "com.google.protobuf" version "0.9.4"
|
||||
id 'jvm-test-suite'
|
||||
}
|
||||
|
||||
java {
|
||||
toolchain {
|
||||
languageVersion.set(JavaLanguageVersion.of(rootProject.ext.jvmVersion))
|
||||
}
|
||||
}
|
||||
|
||||
jar.archiveBaseName = 'favicon-api'
|
||||
|
||||
apply from: "$rootProject.projectDir/protobuf.gradle"
|
||||
apply from: "$rootProject.projectDir/srcsets.gradle"
|
||||
|
||||
dependencies {
|
||||
implementation project(':code:common:model')
|
||||
implementation project(':code:common:config')
|
||||
implementation project(':code:common:service')
|
||||
|
||||
implementation libs.bundles.slf4j
|
||||
|
||||
implementation libs.prometheus
|
||||
implementation libs.notnull
|
||||
implementation libs.guava
|
||||
implementation dependencies.create(libs.guice.get()) {
|
||||
exclude group: 'com.google.guava'
|
||||
}
|
||||
implementation libs.gson
|
||||
implementation libs.bundles.protobuf
|
||||
implementation libs.guava
|
||||
libs.bundles.grpc.get().each {
|
||||
implementation dependencies.create(it) {
|
||||
exclude group: 'com.google.guava'
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
|
||||
testImplementation libs.bundles.slf4j.test
|
||||
testImplementation libs.bundles.junit
|
||||
testImplementation libs.mockito
|
||||
|
||||
}
|
@@ -0,0 +1,39 @@
|
||||
package nu.marginalia.api.favicon;
|
||||
|
||||
import com.google.inject.Inject;
|
||||
import nu.marginalia.service.client.GrpcChannelPoolFactory;
|
||||
import nu.marginalia.service.client.GrpcMultiNodeChannelPool;
|
||||
import nu.marginalia.service.discovery.property.ServiceKey;
|
||||
import nu.marginalia.service.discovery.property.ServicePartition;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import java.util.Optional;
|
||||
|
||||
public class FaviconClient {
|
||||
private static final Logger logger = LoggerFactory.getLogger(FaviconClient.class);
|
||||
|
||||
private final GrpcMultiNodeChannelPool<FaviconAPIGrpc.FaviconAPIBlockingStub> channelPool;
|
||||
|
||||
@Inject
|
||||
public FaviconClient(GrpcChannelPoolFactory factory) {
|
||||
this.channelPool = factory.createMulti(
|
||||
ServiceKey.forGrpcApi(FaviconAPIGrpc.class, ServicePartition.multi()),
|
||||
FaviconAPIGrpc::newBlockingStub);
|
||||
}
|
||||
|
||||
public record FaviconData(byte[] bytes, String contentType) {}
|
||||
|
||||
|
||||
public Optional<FaviconData> getFavicon(String domain, int node) {
|
||||
RpcFaviconResponse rsp = channelPool.call(FaviconAPIGrpc.FaviconAPIBlockingStub::getFavicon)
|
||||
.forNode(node)
|
||||
.run(RpcFaviconRequest.newBuilder().setDomain(domain).build());
|
||||
|
||||
if (rsp.getData().isEmpty())
|
||||
return Optional.empty();
|
||||
|
||||
return Optional.of(new FaviconData(rsp.getData().toByteArray(), rsp.getContentType()));
|
||||
}
|
||||
|
||||
}
|
20
code/functions/favicon/api/src/main/protobuf/favicon.proto
Normal file
20
code/functions/favicon/api/src/main/protobuf/favicon.proto
Normal file
@@ -0,0 +1,20 @@
|
||||
syntax="proto3";
|
||||
package marginalia.api.favicon;
|
||||
|
||||
option java_package="nu.marginalia.api.favicon";
|
||||
option java_multiple_files=true;
|
||||
|
||||
service FaviconAPI {
|
||||
/** Fetches information about a domain. */
|
||||
rpc getFavicon(RpcFaviconRequest) returns (RpcFaviconResponse) {}
|
||||
}
|
||||
|
||||
message RpcFaviconRequest {
|
||||
string domain = 1;
|
||||
}
|
||||
|
||||
message RpcFaviconResponse {
|
||||
string domain = 1;
|
||||
bytes data = 2;
|
||||
string contentType = 3;
|
||||
}
|
49
code/functions/favicon/build.gradle
Normal file
49
code/functions/favicon/build.gradle
Normal file
@@ -0,0 +1,49 @@
|
||||
plugins {
|
||||
id 'java'
|
||||
|
||||
id 'application'
|
||||
id 'jvm-test-suite'
|
||||
}
|
||||
|
||||
java {
|
||||
toolchain {
|
||||
languageVersion.set(JavaLanguageVersion.of(rootProject.ext.jvmVersion))
|
||||
}
|
||||
}
|
||||
|
||||
apply from: "$rootProject.projectDir/srcsets.gradle"
|
||||
|
||||
dependencies {
|
||||
implementation project(':code:common:config')
|
||||
implementation project(':code:common:service')
|
||||
implementation project(':code:common:model')
|
||||
implementation project(':code:common:db')
|
||||
implementation project(':code:functions:favicon:api')
|
||||
implementation project(':code:processes:crawling-process')
|
||||
|
||||
implementation libs.bundles.slf4j
|
||||
|
||||
implementation libs.prometheus
|
||||
implementation libs.guava
|
||||
libs.bundles.grpc.get().each {
|
||||
implementation dependencies.create(it) {
|
||||
exclude group: 'com.google.guava'
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
implementation libs.notnull
|
||||
implementation libs.guava
|
||||
implementation dependencies.create(libs.guice.get()) {
|
||||
exclude group: 'com.google.guava'
|
||||
}
|
||||
implementation dependencies.create(libs.spark.get()) {
|
||||
exclude group: 'org.eclipse.jetty'
|
||||
}
|
||||
|
||||
testImplementation libs.bundles.slf4j.test
|
||||
testImplementation libs.bundles.junit
|
||||
testImplementation libs.mockito
|
||||
|
||||
|
||||
}
|
@@ -0,0 +1,48 @@
|
||||
package nu.marginalia.functions.favicon;
|
||||
|
||||
import com.google.inject.Inject;
|
||||
import com.google.inject.Singleton;
|
||||
import com.google.protobuf.ByteString;
|
||||
import io.grpc.stub.StreamObserver;
|
||||
import nu.marginalia.api.favicon.FaviconAPIGrpc;
|
||||
import nu.marginalia.api.favicon.RpcFaviconRequest;
|
||||
import nu.marginalia.api.favicon.RpcFaviconResponse;
|
||||
import nu.marginalia.crawl.DomainStateDb;
|
||||
import nu.marginalia.service.server.DiscoverableService;
|
||||
|
||||
import java.util.Optional;
|
||||
|
||||
@Singleton
|
||||
public class FaviconGrpcService extends FaviconAPIGrpc.FaviconAPIImplBase implements DiscoverableService {
|
||||
private final DomainStateDb domainStateDb;
|
||||
|
||||
@Inject
|
||||
public FaviconGrpcService(DomainStateDb domainStateDb) {
|
||||
this.domainStateDb = domainStateDb;
|
||||
}
|
||||
|
||||
public boolean shouldRegisterService() {
|
||||
return domainStateDb.isAvailable();
|
||||
}
|
||||
|
||||
@Override
|
||||
public void getFavicon(RpcFaviconRequest request, StreamObserver<RpcFaviconResponse> responseObserver) {
|
||||
Optional<DomainStateDb.FaviconRecord> icon = domainStateDb.getIcon(request.getDomain());
|
||||
|
||||
RpcFaviconResponse response;
|
||||
if (icon.isEmpty()) {
|
||||
response = RpcFaviconResponse.newBuilder().build();
|
||||
}
|
||||
else {
|
||||
var iconRecord = icon.get();
|
||||
response = RpcFaviconResponse.newBuilder()
|
||||
.setContentType(iconRecord.contentType())
|
||||
.setDomain(request.getDomain())
|
||||
.setData(ByteString.copyFrom(iconRecord.imageData()))
|
||||
.build();
|
||||
}
|
||||
|
||||
responseObserver.onNext(response);
|
||||
responseObserver.onCompleted();
|
||||
}
|
||||
}
|
@@ -34,6 +34,7 @@ dependencies {
|
||||
implementation libs.bundles.slf4j
|
||||
implementation libs.commons.lang3
|
||||
implementation libs.commons.io
|
||||
implementation libs.wiremock
|
||||
|
||||
implementation libs.prometheus
|
||||
implementation libs.guava
|
||||
|
@@ -1,6 +1,7 @@
|
||||
package nu.marginalia.livecapture;
|
||||
|
||||
import com.google.gson.Gson;
|
||||
import nu.marginalia.WmsaHome;
|
||||
import nu.marginalia.model.gson.GsonFactory;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
@@ -12,10 +13,13 @@ import java.net.http.HttpRequest;
|
||||
import java.net.http.HttpResponse;
|
||||
import java.time.Duration;
|
||||
import java.util.Map;
|
||||
import java.util.Optional;
|
||||
|
||||
/** Client for local browserless.io API */
|
||||
public class BrowserlessClient implements AutoCloseable {
|
||||
|
||||
private static final Logger logger = LoggerFactory.getLogger(BrowserlessClient.class);
|
||||
private static final String BROWSERLESS_TOKEN = System.getProperty("live-capture.browserless-token", "BROWSERLESS_TOKEN");
|
||||
|
||||
private final HttpClient httpClient = HttpClient.newBuilder()
|
||||
.version(HttpClient.Version.HTTP_1_1)
|
||||
@@ -25,18 +29,21 @@ public class BrowserlessClient implements AutoCloseable {
|
||||
private final URI browserlessURI;
|
||||
private final Gson gson = GsonFactory.get();
|
||||
|
||||
private final String userAgent = WmsaHome.getUserAgent().uaString();
|
||||
|
||||
public BrowserlessClient(URI browserlessURI) {
|
||||
this.browserlessURI = browserlessURI;
|
||||
}
|
||||
|
||||
public String content(String url, GotoOptions gotoOptions) throws IOException, InterruptedException {
|
||||
public Optional<String> content(String url, GotoOptions gotoOptions) throws IOException, InterruptedException {
|
||||
Map<String, Object> requestData = Map.of(
|
||||
"url", url,
|
||||
"userAgent", userAgent,
|
||||
"gotoOptions", gotoOptions
|
||||
);
|
||||
|
||||
var request = HttpRequest.newBuilder()
|
||||
.uri(browserlessURI.resolve("/content"))
|
||||
.uri(browserlessURI.resolve("/content?token="+BROWSERLESS_TOKEN))
|
||||
.method("POST", HttpRequest.BodyPublishers.ofString(
|
||||
gson.toJson(requestData)
|
||||
))
|
||||
@@ -47,10 +54,10 @@ public class BrowserlessClient implements AutoCloseable {
|
||||
|
||||
if (rsp.statusCode() >= 300) {
|
||||
logger.info("Failed to fetch content for {}, status {}", url, rsp.statusCode());
|
||||
return null;
|
||||
return Optional.empty();
|
||||
}
|
||||
|
||||
return rsp.body();
|
||||
return Optional.of(rsp.body());
|
||||
}
|
||||
|
||||
public byte[] screenshot(String url, GotoOptions gotoOptions, ScreenshotOptions screenshotOptions)
|
||||
@@ -58,12 +65,13 @@ public class BrowserlessClient implements AutoCloseable {
|
||||
|
||||
Map<String, Object> requestData = Map.of(
|
||||
"url", url,
|
||||
"userAgent", userAgent,
|
||||
"options", screenshotOptions,
|
||||
"gotoOptions", gotoOptions
|
||||
);
|
||||
|
||||
var request = HttpRequest.newBuilder()
|
||||
.uri(browserlessURI.resolve("/screenshot"))
|
||||
.uri(browserlessURI.resolve("/screenshot?token="+BROWSERLESS_TOKEN))
|
||||
.method("POST", HttpRequest.BodyPublishers.ofString(
|
||||
gson.toJson(requestData)
|
||||
))
|
||||
@@ -82,7 +90,7 @@ public class BrowserlessClient implements AutoCloseable {
|
||||
}
|
||||
|
||||
@Override
|
||||
public void close() throws Exception {
|
||||
public void close() {
|
||||
httpClient.shutdownNow();
|
||||
}
|
||||
|
||||
|
@@ -1,6 +1,6 @@
|
||||
package nu.marginalia.rss.model;
|
||||
|
||||
import com.apptasticsoftware.rssreader.Item;
|
||||
import nu.marginalia.rss.svc.SimpleFeedParser;
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
import org.jetbrains.annotations.NotNull;
|
||||
import org.jsoup.Jsoup;
|
||||
@@ -18,37 +18,33 @@ public record FeedItem(String title,
|
||||
public static final int MAX_DESC_LENGTH = 255;
|
||||
public static final DateTimeFormatter DATE_FORMAT = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSSZ");
|
||||
|
||||
public static FeedItem fromItem(Item item, boolean keepFragment) {
|
||||
String title = item.getTitle().orElse("");
|
||||
public static FeedItem fromItem(SimpleFeedParser.ItemData item, boolean keepFragment) {
|
||||
String title = item.title();
|
||||
String date = getItemDate(item);
|
||||
String description = getItemDescription(item);
|
||||
String url;
|
||||
|
||||
if (keepFragment || item.getLink().isEmpty()) {
|
||||
url = item.getLink().orElse("");
|
||||
if (keepFragment) {
|
||||
url = item.url();
|
||||
}
|
||||
else {
|
||||
try {
|
||||
String link = item.getLink().get();
|
||||
String link = item.url();
|
||||
var linkUri = new URI(link);
|
||||
var cleanUri = new URI(linkUri.getScheme(), linkUri.getAuthority(), linkUri.getPath(), linkUri.getQuery(), null);
|
||||
url = cleanUri.toString();
|
||||
}
|
||||
catch (Exception e) {
|
||||
// fallback to original link if we can't clean it, this is not a very important step
|
||||
url = item.getLink().get();
|
||||
url = item.url();
|
||||
}
|
||||
}
|
||||
|
||||
return new FeedItem(title, date, description, url);
|
||||
}
|
||||
|
||||
private static String getItemDescription(Item item) {
|
||||
Optional<String> description = item.getDescription();
|
||||
if (description.isEmpty())
|
||||
return "";
|
||||
|
||||
String rawDescription = description.get();
|
||||
private static String getItemDescription(SimpleFeedParser.ItemData item) {
|
||||
String rawDescription = item.description();
|
||||
if (rawDescription.indexOf('<') >= 0) {
|
||||
rawDescription = Jsoup.parseBodyFragment(rawDescription).text();
|
||||
}
|
||||
@@ -58,15 +54,18 @@ public record FeedItem(String title,
|
||||
|
||||
// e.g. http://fabiensanglard.net/rss.xml does dates like this: 1 Apr 2021 00:00:00 +0000
|
||||
private static final DateTimeFormatter extraFormatter = DateTimeFormatter.ofPattern("d MMM yyyy HH:mm:ss Z");
|
||||
private static String getItemDate(Item item) {
|
||||
private static String getItemDate(SimpleFeedParser.ItemData item) {
|
||||
Optional<ZonedDateTime> zonedDateTime = Optional.empty();
|
||||
try {
|
||||
zonedDateTime = item.getPubDateZonedDateTime();
|
||||
}
|
||||
catch (Exception e) {
|
||||
zonedDateTime = item.getPubDate()
|
||||
.map(extraFormatter::parse)
|
||||
.map(ZonedDateTime::from);
|
||||
try {
|
||||
zonedDateTime = Optional.of(ZonedDateTime.from(extraFormatter.parse(item.pubDate())));
|
||||
}
|
||||
catch (Exception e2) {
|
||||
// ignore
|
||||
}
|
||||
}
|
||||
|
||||
return zonedDateTime.map(date -> date.format(DATE_FORMAT)).orElse("");
|
||||
|
@@ -1,7 +1,5 @@
|
||||
package nu.marginalia.rss.svc;
|
||||
|
||||
import com.apptasticsoftware.rssreader.Item;
|
||||
import com.apptasticsoftware.rssreader.RssReader;
|
||||
import com.google.inject.Inject;
|
||||
import com.opencsv.CSVReader;
|
||||
import nu.marginalia.WmsaHome;
|
||||
@@ -20,7 +18,6 @@ import nu.marginalia.storage.FileStorageService;
|
||||
import nu.marginalia.storage.model.FileStorage;
|
||||
import nu.marginalia.storage.model.FileStorageType;
|
||||
import nu.marginalia.util.SimpleBlockingThreadPool;
|
||||
import org.apache.commons.io.input.BOMInputStream;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
@@ -32,11 +29,11 @@ import java.net.URISyntaxException;
|
||||
import java.net.http.HttpClient;
|
||||
import java.net.http.HttpRequest;
|
||||
import java.net.http.HttpResponse;
|
||||
import java.nio.charset.StandardCharsets;
|
||||
import java.sql.SQLException;
|
||||
import java.time.*;
|
||||
import java.time.format.DateTimeFormatter;
|
||||
import java.util.*;
|
||||
import java.util.concurrent.ExecutorService;
|
||||
import java.util.concurrent.Executors;
|
||||
import java.util.concurrent.TimeUnit;
|
||||
import java.util.concurrent.atomic.AtomicInteger;
|
||||
@@ -48,8 +45,6 @@ public class FeedFetcherService {
|
||||
private static final int MAX_FEED_ITEMS = 10;
|
||||
private static final Logger logger = LoggerFactory.getLogger(FeedFetcherService.class);
|
||||
|
||||
private final RssReader rssReader = new RssReader();
|
||||
|
||||
private final FeedDb feedDb;
|
||||
private final FileStorageService fileStorageService;
|
||||
private final NodeConfigurationService nodeConfigurationService;
|
||||
@@ -72,23 +67,12 @@ public class FeedFetcherService {
|
||||
this.nodeConfigurationService = nodeConfigurationService;
|
||||
this.serviceHeartbeat = serviceHeartbeat;
|
||||
this.executorClient = executorClient;
|
||||
|
||||
|
||||
// Add support for some alternate date tags for atom
|
||||
rssReader.addItemExtension("issued", this::setDateFallback);
|
||||
rssReader.addItemExtension("created", this::setDateFallback);
|
||||
}
|
||||
|
||||
private void setDateFallback(Item item, String value) {
|
||||
if (item.getPubDate().isEmpty()) {
|
||||
item.setPubDate(value);
|
||||
}
|
||||
}
|
||||
|
||||
public enum UpdateMode {
|
||||
CLEAN,
|
||||
REFRESH
|
||||
};
|
||||
}
|
||||
|
||||
public void updateFeeds(UpdateMode updateMode) throws IOException {
|
||||
if (updating) // Prevent concurrent updates
|
||||
@@ -104,6 +88,7 @@ public class FeedFetcherService {
|
||||
.followRedirects(HttpClient.Redirect.NORMAL)
|
||||
.version(HttpClient.Version.HTTP_2)
|
||||
.build();
|
||||
ExecutorService fetchExecutor = Executors.newCachedThreadPool();
|
||||
FeedJournal feedJournal = FeedJournal.create();
|
||||
var heartbeat = serviceHeartbeat.createServiceAdHocTaskHeartbeat("Update Rss Feeds")
|
||||
) {
|
||||
@@ -148,7 +133,7 @@ public class FeedFetcherService {
|
||||
|
||||
FetchResult feedData;
|
||||
try (DomainLocks.DomainLock domainLock = domainLocks.lockDomain(new EdgeDomain(feed.domain()))) {
|
||||
feedData = fetchFeedData(feed, client, ifModifiedSinceDate, ifNoneMatchTag);
|
||||
feedData = fetchFeedData(feed, client, fetchExecutor, ifModifiedSinceDate, ifNoneMatchTag);
|
||||
} catch (Exception ex) {
|
||||
feedData = new FetchResult.TransientError();
|
||||
}
|
||||
@@ -228,6 +213,7 @@ public class FeedFetcherService {
|
||||
|
||||
private FetchResult fetchFeedData(FeedDefinition feed,
|
||||
HttpClient client,
|
||||
ExecutorService executorService,
|
||||
@Nullable String ifModifiedSinceDate,
|
||||
@Nullable String ifNoneMatchTag)
|
||||
{
|
||||
@@ -243,18 +229,27 @@ public class FeedFetcherService {
|
||||
.timeout(Duration.ofSeconds(15))
|
||||
;
|
||||
|
||||
if (ifModifiedSinceDate != null) {
|
||||
// Set the If-Modified-Since or If-None-Match headers if we have them
|
||||
// though since there are certain idiosyncrasies in server implementations,
|
||||
// we avoid setting both at the same time as that may turn a 304 into a 200.
|
||||
if (ifNoneMatchTag != null) {
|
||||
requestBuilder.header("If-None-Match", ifNoneMatchTag);
|
||||
} else if (ifModifiedSinceDate != null) {
|
||||
requestBuilder.header("If-Modified-Since", ifModifiedSinceDate);
|
||||
}
|
||||
|
||||
if (ifNoneMatchTag != null) {
|
||||
requestBuilder.header("If-None-Match", ifNoneMatchTag);
|
||||
}
|
||||
|
||||
HttpRequest getRequest = requestBuilder.build();
|
||||
|
||||
for (int i = 0; i < 3; i++) {
|
||||
HttpResponse<byte[]> rs = client.send(getRequest, HttpResponse.BodyHandlers.ofByteArray());
|
||||
|
||||
/* Note we need to use an executor to time-limit the send() method in HttpClient, as
|
||||
* its support for timeouts only applies to the time until response starts to be received,
|
||||
* and does not catch the case when the server starts to send data but then hangs.
|
||||
*/
|
||||
HttpResponse<byte[]> rs = executorService.submit(
|
||||
() -> client.send(getRequest, HttpResponse.BodyHandlers.ofByteArray()))
|
||||
.get(15, TimeUnit.SECONDS);
|
||||
|
||||
if (rs.statusCode() == 429) { // Too Many Requests
|
||||
int retryAfter = Integer.parseInt(rs.headers().firstValue("Retry-After").orElse("2"));
|
||||
@@ -371,12 +366,7 @@ public class FeedFetcherService {
|
||||
|
||||
public FeedItems parseFeed(String feedData, FeedDefinition definition) {
|
||||
try {
|
||||
feedData = sanitizeEntities(feedData);
|
||||
|
||||
List<Item> rawItems = rssReader.read(
|
||||
// Massage the data to maximize the possibility of the flaky XML parser consuming it
|
||||
new BOMInputStream(new ByteArrayInputStream(feedData.trim().getBytes(StandardCharsets.UTF_8)), false)
|
||||
).toList();
|
||||
List<SimpleFeedParser.ItemData> rawItems = SimpleFeedParser.parse(feedData);
|
||||
|
||||
boolean keepUriFragment = rawItems.size() < 2 || areFragmentsDisparate(rawItems);
|
||||
|
||||
@@ -399,33 +389,6 @@ public class FeedFetcherService {
|
||||
}
|
||||
}
|
||||
|
||||
private static final Map<String, String> HTML_ENTITIES = Map.of(
|
||||
"»", "»",
|
||||
"«", "«",
|
||||
"—", "--",
|
||||
"–", "-",
|
||||
"’", "'",
|
||||
"‘", "'",
|
||||
""", "\"",
|
||||
" ", ""
|
||||
);
|
||||
|
||||
/** The XML parser will blow up if you insert HTML entities in the feed XML,
|
||||
* which is unfortunately relatively common. Replace them as far as is possible
|
||||
* with their corresponding characters
|
||||
*/
|
||||
static String sanitizeEntities(String feedData) {
|
||||
String result = feedData;
|
||||
for (Map.Entry<String, String> entry : HTML_ENTITIES.entrySet()) {
|
||||
result = result.replace(entry.getKey(), entry.getValue());
|
||||
}
|
||||
|
||||
// Handle lone ampersands not part of a recognized XML entity
|
||||
result = result.replaceAll("&(?!(amp|lt|gt|apos|quot);)", "&");
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
/** Decide whether to keep URI fragments in the feed items.
|
||||
* <p></p>
|
||||
* We keep fragments if there are multiple different fragments in the items.
|
||||
@@ -433,16 +396,16 @@ public class FeedFetcherService {
|
||||
* @param items The items to check
|
||||
* @return True if we should keep the fragments, false otherwise
|
||||
*/
|
||||
private boolean areFragmentsDisparate(List<Item> items) {
|
||||
private boolean areFragmentsDisparate(List<SimpleFeedParser.ItemData> items) {
|
||||
Set<String> seenFragments = new HashSet<>();
|
||||
|
||||
try {
|
||||
for (var item : items) {
|
||||
if (item.getLink().isEmpty()) {
|
||||
if (item.url().isBlank()) {
|
||||
continue;
|
||||
}
|
||||
|
||||
var link = item.getLink().get();
|
||||
var link = item.url();
|
||||
if (!link.contains("#")) {
|
||||
continue;
|
||||
}
|
||||
|
@@ -10,6 +10,7 @@ import java.io.IOException;
|
||||
import java.nio.charset.StandardCharsets;
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Path;
|
||||
import java.util.function.BiConsumer;
|
||||
|
||||
/** Utility for recording fetched feeds to a journal, useful in debugging feed parser issues.
|
||||
*/
|
||||
@@ -59,6 +60,17 @@ public interface FeedJournal extends AutoCloseable {
|
||||
urlWriter.put(url);
|
||||
contentsWriter.put(contents);
|
||||
}
|
||||
}
|
||||
|
||||
static void replay(Path journalPath, BiConsumer<String, String> urlAndContent) throws IOException {
|
||||
try (SlopTable table = new SlopTable(journalPath)) {
|
||||
final StringColumn.Reader urlReader = urlColumn.open(table);
|
||||
final StringColumn.Reader contentsReader = contentsColumn.open(table);
|
||||
|
||||
while (urlReader.hasRemaining()) {
|
||||
urlAndContent.accept(urlReader.get(), contentsReader.get());
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
}
|
||||
|
@@ -0,0 +1,94 @@
|
||||
package nu.marginalia.rss.svc;
|
||||
|
||||
import com.apptasticsoftware.rssreader.DateTimeParser;
|
||||
import com.apptasticsoftware.rssreader.util.Default;
|
||||
import org.jsoup.Jsoup;
|
||||
import org.jsoup.parser.Parser;
|
||||
|
||||
import java.time.ZonedDateTime;
|
||||
import java.util.ArrayList;
|
||||
import java.util.List;
|
||||
import java.util.Optional;
|
||||
|
||||
public class SimpleFeedParser {
|
||||
|
||||
private static final DateTimeParser dateTimeParser = Default.getDateTimeParser();
|
||||
|
||||
public record ItemData (
|
||||
String title,
|
||||
String description,
|
||||
String url,
|
||||
String pubDate
|
||||
) {
|
||||
public boolean isWellFormed() {
|
||||
return title != null && !title.isBlank() &&
|
||||
description != null && !description.isBlank() &&
|
||||
url != null && !url.isBlank() &&
|
||||
pubDate != null && !pubDate.isBlank();
|
||||
}
|
||||
|
||||
public Optional<ZonedDateTime> getPubDateZonedDateTime() {
|
||||
try {
|
||||
return Optional.ofNullable(dateTimeParser.parse(pubDate()));
|
||||
}
|
||||
catch (Exception e) {
|
||||
return Optional.empty();
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
public static List<ItemData> parse(String content) {
|
||||
var doc = Jsoup.parse(content, Parser.xmlParser());
|
||||
List<ItemData> ret = new ArrayList<>();
|
||||
|
||||
doc.select("item, entry").forEach(element -> {
|
||||
String link = "";
|
||||
String title = "";
|
||||
String description = "";
|
||||
String pubDate = "";
|
||||
|
||||
for (String attr : List.of("title", "dc:title")) {
|
||||
if (!title.isBlank())
|
||||
break;
|
||||
var tag = element.getElementsByTag(attr).first();
|
||||
if (tag != null) {
|
||||
title = tag.text();
|
||||
}
|
||||
}
|
||||
|
||||
for (String attr : List.of("title", "summary", "content", "description", "dc:description")) {
|
||||
if (!description.isBlank())
|
||||
break;
|
||||
var tag = element.getElementsByTag(attr).first();
|
||||
if (tag != null) {
|
||||
description = tag.text();
|
||||
}
|
||||
}
|
||||
|
||||
for (String attr : List.of("pubDate", "published", "updated", "issued", "created", "dc:date")) {
|
||||
if (!pubDate.isBlank())
|
||||
break;
|
||||
var tag = element.getElementsByTag(attr).first();
|
||||
if (tag != null) {
|
||||
pubDate = tag.text();
|
||||
}
|
||||
}
|
||||
|
||||
for (String attr : List.of("link", "url")) {
|
||||
if (!link.isBlank())
|
||||
break;
|
||||
var tag = element.getElementsByTag(attr).first();
|
||||
if (tag != null) {
|
||||
link = tag.text();
|
||||
}
|
||||
}
|
||||
|
||||
ret.add(new ItemData(title, description, link, pubDate));
|
||||
});
|
||||
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
}
|
@@ -1,36 +1,97 @@
|
||||
package nu.marginalia.livecapture;
|
||||
|
||||
import com.github.tomakehurst.wiremock.WireMockServer;
|
||||
import com.github.tomakehurst.wiremock.core.WireMockConfiguration;
|
||||
import nu.marginalia.WmsaHome;
|
||||
import nu.marginalia.service.module.ServiceConfigurationModule;
|
||||
import org.junit.jupiter.api.Assertions;
|
||||
import org.junit.jupiter.api.BeforeAll;
|
||||
import org.junit.jupiter.api.Tag;
|
||||
import org.junit.jupiter.api.Test;
|
||||
import org.testcontainers.containers.GenericContainer;
|
||||
import org.testcontainers.junit.jupiter.Testcontainers;
|
||||
import org.testcontainers.utility.DockerImageName;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.net.URI;
|
||||
import java.util.Map;
|
||||
|
||||
import static com.github.tomakehurst.wiremock.client.WireMock.*;
|
||||
|
||||
|
||||
@Testcontainers
|
||||
@Tag("slow")
|
||||
public class BrowserlessClientTest {
|
||||
static GenericContainer<?> container = new GenericContainer<>(DockerImageName.parse("browserless/chrome")).withExposedPorts(3000);
|
||||
static GenericContainer<?> container = new GenericContainer<>(DockerImageName.parse("browserless/chrome"))
|
||||
.withEnv(Map.of("TOKEN", "BROWSERLESS_TOKEN"))
|
||||
.withNetworkMode("bridge")
|
||||
.withExposedPorts(3000);
|
||||
|
||||
static WireMockServer wireMockServer =
|
||||
new WireMockServer(WireMockConfiguration.wireMockConfig()
|
||||
.port(18089));
|
||||
|
||||
static String localIp;
|
||||
|
||||
static URI browserlessURI;
|
||||
|
||||
@BeforeAll
|
||||
public static void setup() {
|
||||
public static void setup() throws IOException {
|
||||
container.start();
|
||||
|
||||
browserlessURI = URI.create(String.format("http://%s:%d/",
|
||||
container.getHost(),
|
||||
container.getMappedPort(3000))
|
||||
);
|
||||
|
||||
wireMockServer.start();
|
||||
wireMockServer.stubFor(get("/").willReturn(aResponse().withStatus(200).withBody("Ok")));
|
||||
|
||||
localIp = ServiceConfigurationModule.getLocalNetworkIP();
|
||||
|
||||
}
|
||||
|
||||
@Tag("flaky")
|
||||
@Test
|
||||
public void testInspectContentUA__Flaky() throws Exception {
|
||||
try (var client = new BrowserlessClient(browserlessURI)) {
|
||||
client.content("http://" + localIp + ":18089/",
|
||||
BrowserlessClient.GotoOptions.defaultValues()
|
||||
);
|
||||
}
|
||||
|
||||
wireMockServer.verify(getRequestedFor(urlEqualTo("/")).withHeader("User-Agent", equalTo(WmsaHome.getUserAgent().uaString())));
|
||||
}
|
||||
|
||||
@Tag("flaky")
|
||||
@Test
|
||||
public void testInspectScreenshotUA__Flaky() throws Exception {
|
||||
try (var client = new BrowserlessClient(browserlessURI)) {
|
||||
client.screenshot("http://" + localIp + ":18089/",
|
||||
BrowserlessClient.GotoOptions.defaultValues(),
|
||||
BrowserlessClient.ScreenshotOptions.defaultValues()
|
||||
);
|
||||
}
|
||||
|
||||
wireMockServer.verify(getRequestedFor(urlEqualTo("/")).withHeader("User-Agent", equalTo(WmsaHome.getUserAgent().uaString())));
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testContent() throws Exception {
|
||||
try (var client = new BrowserlessClient(URI.create("http://" + container.getHost() + ":" + container.getMappedPort(3000)))) {
|
||||
var content = client.content("https://www.marginalia.nu/", BrowserlessClient.GotoOptions.defaultValues());
|
||||
Assertions.assertNotNull(content, "Content should not be null");
|
||||
try (var client = new BrowserlessClient(browserlessURI)) {
|
||||
var content = client.content("https://www.marginalia.nu/", BrowserlessClient.GotoOptions.defaultValues()).orElseThrow();
|
||||
|
||||
Assertions.assertFalse(content.isBlank(), "Content should not be empty");
|
||||
}
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testScreenshot() throws Exception {
|
||||
try (var client = new BrowserlessClient(URI.create("http://" + container.getHost() + ":" + container.getMappedPort(3000)))) {
|
||||
var screenshot = client.screenshot("https://www.marginalia.nu/", BrowserlessClient.GotoOptions.defaultValues(), BrowserlessClient.ScreenshotOptions.defaultValues());
|
||||
try (var client = new BrowserlessClient(browserlessURI)) {
|
||||
var screenshot = client.screenshot("https://www.marginalia.nu/",
|
||||
BrowserlessClient.GotoOptions.defaultValues(),
|
||||
BrowserlessClient.ScreenshotOptions.defaultValues());
|
||||
|
||||
Assertions.assertNotNull(screenshot, "Screenshot should not be null");
|
||||
}
|
||||
}
|
||||
|
@@ -1,50 +0,0 @@
|
||||
package nu.marginalia.rss.svc;
|
||||
|
||||
import com.apptasticsoftware.rssreader.Item;
|
||||
import com.apptasticsoftware.rssreader.RssReader;
|
||||
import org.junit.jupiter.api.Assertions;
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
import java.util.List;
|
||||
import java.util.Optional;
|
||||
|
||||
public class TestXmlSanitization {
|
||||
|
||||
@Test
|
||||
public void testPreservedEntities() {
|
||||
Assertions.assertEquals("&", FeedFetcherService.sanitizeEntities("&"));
|
||||
Assertions.assertEquals("<", FeedFetcherService.sanitizeEntities("<"));
|
||||
Assertions.assertEquals(">", FeedFetcherService.sanitizeEntities(">"));
|
||||
Assertions.assertEquals("'", FeedFetcherService.sanitizeEntities("'"));
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testNlnetTitleTag() {
|
||||
// The NLnet atom feed puts HTML tags in the entry/title tags, which breaks the vanilla RssReader code
|
||||
|
||||
// Verify we're able to consume and strip out the HTML tags
|
||||
RssReader r = new RssReader();
|
||||
|
||||
List<Item> items = r.read(ClassLoader.getSystemResourceAsStream("nlnet.atom")).toList();
|
||||
|
||||
Assertions.assertEquals(1, items.size());
|
||||
for (var item : items) {
|
||||
Assertions.assertEquals(Optional.of("50 Free and Open Source Projects Selected for NGI Zero grants"), item.getTitle());
|
||||
}
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testStrayAmpersand() {
|
||||
Assertions.assertEquals("Bed & Breakfast", FeedFetcherService.sanitizeEntities("Bed & Breakfast"));
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testTranslatedHtmlEntity() {
|
||||
Assertions.assertEquals("Foo -- Bar", FeedFetcherService.sanitizeEntities("Foo — Bar"));
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testTranslatedHtmlEntityQuot() {
|
||||
Assertions.assertEquals("\"Bob\"", FeedFetcherService.sanitizeEntities(""Bob""));
|
||||
}
|
||||
}
|
@@ -134,6 +134,10 @@ public class QueryExpansion {
|
||||
if (scoreCombo > scoreA + scoreB || scoreCombo > 1000) {
|
||||
graph.addVariantForSpan(prev, qw, joinedWord);
|
||||
}
|
||||
else if (StringUtils.isAlpha(prev.word()) && StringUtils.isNumeric(qw.word())) { // join e.g. trs 80 to trs80 and trs-80
|
||||
graph.addVariantForSpan(prev, qw, prev.word() + qw.word());
|
||||
graph.addVariantForSpan(prev, qw, prev.word() + "-" + qw.word());
|
||||
}
|
||||
}
|
||||
|
||||
prev = qw;
|
||||
|
@@ -213,6 +213,18 @@ public class QueryFactoryTest {
|
||||
System.out.println(subquery);
|
||||
}
|
||||
|
||||
|
||||
@Test
|
||||
public void testContractionWordNum() {
|
||||
var subquery = parseAndGetSpecs("glove 80");
|
||||
|
||||
Assertions.assertTrue(subquery.query.compiledQuery.contains(" glove "));
|
||||
Assertions.assertTrue(subquery.query.compiledQuery.contains(" 80 "));
|
||||
Assertions.assertTrue(subquery.query.compiledQuery.contains(" glove-80 "));
|
||||
Assertions.assertTrue(subquery.query.compiledQuery.contains(" glove80 "));
|
||||
}
|
||||
|
||||
|
||||
@Test
|
||||
public void testCplusPlus() {
|
||||
var subquery = parseAndGetSpecs("std::vector::push_back vector");
|
||||
|
@@ -16,20 +16,19 @@ import org.slf4j.LoggerFactory;
|
||||
|
||||
import java.util.ArrayList;
|
||||
import java.util.Comparator;
|
||||
import java.util.Iterator;
|
||||
import java.util.List;
|
||||
import java.util.concurrent.CompletableFuture;
|
||||
import java.util.concurrent.ExecutorService;
|
||||
import java.util.concurrent.Executors;
|
||||
|
||||
import static java.lang.Math.clamp;
|
||||
import java.util.concurrent.atomic.AtomicInteger;
|
||||
import java.util.function.Consumer;
|
||||
|
||||
@Singleton
|
||||
public class IndexClient {
|
||||
private static final Logger logger = LoggerFactory.getLogger(IndexClient.class);
|
||||
private final GrpcMultiNodeChannelPool<IndexApiGrpc.IndexApiBlockingStub> channelPool;
|
||||
private final DomainBlacklistImpl blacklist;
|
||||
private static final ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor();
|
||||
private static final ExecutorService executor = Executors.newCachedThreadPool();
|
||||
|
||||
@Inject
|
||||
public IndexClient(GrpcChannelPoolFactory channelPoolFactory, DomainBlacklistImpl blacklist) {
|
||||
@@ -51,40 +50,37 @@ public class IndexClient {
|
||||
|
||||
/** Execute a query on the index partitions and return the combined results. */
|
||||
public AggregateQueryResponse executeQueries(RpcIndexQuery indexRequest, Pagination pagination) {
|
||||
List<CompletableFuture<Iterator<RpcDecoratedResultItem>>> futures =
|
||||
channelPool.call(IndexApiGrpc.IndexApiBlockingStub::query)
|
||||
.async(executor)
|
||||
.runEach(indexRequest);
|
||||
|
||||
final int requestedMaxResults = indexRequest.getQueryLimits().getResultsTotal();
|
||||
final int resultsUpperBound = requestedMaxResults * channelPool.getNumNodes();
|
||||
|
||||
List<RpcDecoratedResultItem> results = new ArrayList<>(resultsUpperBound);
|
||||
AtomicInteger totalNumResults = new AtomicInteger(0);
|
||||
|
||||
for (var future : futures) {
|
||||
try {
|
||||
future.get().forEachRemaining(results::add);
|
||||
}
|
||||
catch (Exception e) {
|
||||
logger.error("Downstream exception", e);
|
||||
}
|
||||
}
|
||||
List<RpcDecoratedResultItem> results =
|
||||
channelPool.call(IndexApiGrpc.IndexApiBlockingStub::query)
|
||||
.async(executor)
|
||||
.runEach(indexRequest)
|
||||
.stream()
|
||||
.map(future -> future.thenApply(iterator -> {
|
||||
List<RpcDecoratedResultItem> ret = new ArrayList<>(requestedMaxResults);
|
||||
iterator.forEachRemaining(ret::add);
|
||||
totalNumResults.addAndGet(ret.size());
|
||||
return ret;
|
||||
}))
|
||||
.mapMulti((CompletableFuture<List<RpcDecoratedResultItem>> fut, Consumer<List<RpcDecoratedResultItem>> c) ->{
|
||||
try {
|
||||
c.accept(fut.join());
|
||||
} catch (Exception e) {
|
||||
logger.error("Error while fetching results", e);
|
||||
}
|
||||
})
|
||||
.flatMap(List::stream)
|
||||
.filter(item -> !isBlacklisted(item))
|
||||
.sorted(comparator)
|
||||
.skip(Math.max(0, (pagination.page - 1) * pagination.pageSize))
|
||||
.limit(pagination.pageSize)
|
||||
.toList();
|
||||
|
||||
// Sort the results by ranking score and remove blacklisted domains
|
||||
results.sort(comparator);
|
||||
results.removeIf(this::isBlacklisted);
|
||||
|
||||
int numReceivedResults = results.size();
|
||||
|
||||
// pagination is typically 1-indexed, so we need to adjust the start and end indices
|
||||
int indexStart = (pagination.page - 1) * pagination.pageSize;
|
||||
int indexEnd = (pagination.page) * pagination.pageSize;
|
||||
|
||||
results = results.subList(
|
||||
clamp(indexStart, 0, Math.max(0, results.size() - 1)), // from is inclusive, so subtract 1 from size()
|
||||
clamp(indexEnd, 0, results.size()));
|
||||
|
||||
return new AggregateQueryResponse(results, pagination.page(), numReceivedResults);
|
||||
return new AggregateQueryResponse(results, pagination.page(), totalNumResults.get());
|
||||
}
|
||||
|
||||
private boolean isBlacklisted(RpcDecoratedResultItem item) {
|
||||
|
@@ -0,0 +1,119 @@
|
||||
package nu.marginalia.index.results;
|
||||
|
||||
import com.google.inject.Inject;
|
||||
import com.google.inject.Singleton;
|
||||
import gnu.trove.map.hash.TIntDoubleHashMap;
|
||||
import nu.marginalia.WmsaHome;
|
||||
import nu.marginalia.db.DbDomainQueries;
|
||||
import nu.marginalia.model.EdgeDomain;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Path;
|
||||
import java.util.List;
|
||||
import java.util.OptionalInt;
|
||||
import java.util.concurrent.TimeUnit;
|
||||
|
||||
@Singleton
|
||||
public class DomainRankingOverrides {
|
||||
private final DbDomainQueries domainQueries;
|
||||
|
||||
private volatile TIntDoubleHashMap rankingFactors = new TIntDoubleHashMap(100, 0.75f, -1, 1.);
|
||||
|
||||
private static final Logger logger = LoggerFactory.getLogger(DomainRankingOverrides.class);
|
||||
|
||||
private final Path overrideFilePath;
|
||||
|
||||
@Inject
|
||||
public DomainRankingOverrides(DbDomainQueries domainQueries) {
|
||||
this.domainQueries = domainQueries;
|
||||
|
||||
overrideFilePath = WmsaHome.getDataPath().resolve("domain-ranking-factors.txt");
|
||||
|
||||
Thread.ofPlatform().start(this::updateRunner);
|
||||
}
|
||||
|
||||
// for test access
|
||||
public DomainRankingOverrides(DbDomainQueries domainQueries, Path overrideFilePath)
|
||||
{
|
||||
this.domainQueries = domainQueries;
|
||||
this.overrideFilePath = overrideFilePath;
|
||||
}
|
||||
|
||||
|
||||
public double getRankingFactor(int domainId) {
|
||||
return rankingFactors.get(domainId);
|
||||
}
|
||||
|
||||
private void updateRunner() {
|
||||
for (;;) {
|
||||
reloadFile();
|
||||
|
||||
try {
|
||||
TimeUnit.MINUTES.sleep(5);
|
||||
} catch (InterruptedException ex) {
|
||||
logger.warn("Thread interrupted", ex);
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
void reloadFile() {
|
||||
if (!Files.exists(overrideFilePath)) {
|
||||
return;
|
||||
}
|
||||
|
||||
try {
|
||||
List<String> lines = Files.readAllLines(overrideFilePath);
|
||||
|
||||
double factor = 1.;
|
||||
|
||||
var newRankingFactors = new TIntDoubleHashMap(lines.size(), 0.75f, -1, 1.);
|
||||
|
||||
for (var line : lines) {
|
||||
if (line.isBlank()) continue;
|
||||
if (line.startsWith("#")) continue;
|
||||
|
||||
String[] parts = line.split("\\s+");
|
||||
if (parts.length != 2) {
|
||||
logger.warn("Unrecognized format for domain overrides file: {}", line);
|
||||
continue;
|
||||
}
|
||||
|
||||
try {
|
||||
switch (parts[0]) {
|
||||
case "value" -> {
|
||||
// error handle me
|
||||
factor = Double.parseDouble(parts[1]);
|
||||
if (factor < 0) {
|
||||
logger.error("Negative values are not permitted, found {}", factor);
|
||||
factor = 1;
|
||||
}
|
||||
}
|
||||
case "domain" -> {
|
||||
// error handle
|
||||
OptionalInt domainId = domainQueries.tryGetDomainId(new EdgeDomain(parts[1]));
|
||||
if (domainId.isPresent()) {
|
||||
newRankingFactors.put(domainId.getAsInt(), factor);
|
||||
}
|
||||
else {
|
||||
logger.warn("Unrecognized domain id {}", parts[1]);
|
||||
}
|
||||
}
|
||||
default -> {
|
||||
logger.warn("Unrecognized format {}", line);
|
||||
}
|
||||
}
|
||||
} catch (Exception ex) {
|
||||
logger.warn("Error in parsing domain overrides file: {} ({})", line, ex.getClass().getSimpleName());
|
||||
}
|
||||
}
|
||||
|
||||
rankingFactors = newRankingFactors;
|
||||
} catch (IOException ex) {
|
||||
logger.error("Failed to read " + overrideFilePath, ex);
|
||||
}
|
||||
}
|
||||
}
|
@@ -40,13 +40,16 @@ public class IndexResultRankingService {
|
||||
|
||||
private final DocumentDbReader documentDbReader;
|
||||
private final StatefulIndex statefulIndex;
|
||||
private final DomainRankingOverrides domainRankingOverrides;
|
||||
|
||||
@Inject
|
||||
public IndexResultRankingService(DocumentDbReader documentDbReader,
|
||||
StatefulIndex statefulIndex)
|
||||
StatefulIndex statefulIndex,
|
||||
DomainRankingOverrides domainRankingOverrides)
|
||||
{
|
||||
this.documentDbReader = documentDbReader;
|
||||
this.statefulIndex = statefulIndex;
|
||||
this.domainRankingOverrides = domainRankingOverrides;
|
||||
}
|
||||
|
||||
public List<SearchResultItem> rankResults(SearchParameters params,
|
||||
@@ -57,7 +60,7 @@ public class IndexResultRankingService {
|
||||
if (resultIds.isEmpty())
|
||||
return List.of();
|
||||
|
||||
IndexResultScoreCalculator resultRanker = new IndexResultScoreCalculator(statefulIndex, rankingContext, params);
|
||||
IndexResultScoreCalculator resultRanker = new IndexResultScoreCalculator(statefulIndex, domainRankingOverrides, rankingContext, params);
|
||||
|
||||
List<SearchResultItem> results = new ArrayList<>(resultIds.size());
|
||||
|
||||
|
@@ -41,14 +41,17 @@ public class IndexResultScoreCalculator {
|
||||
private final CombinedIndexReader index;
|
||||
private final QueryParams queryParams;
|
||||
|
||||
private final DomainRankingOverrides domainRankingOverrides;
|
||||
private final ResultRankingContext rankingContext;
|
||||
private final CompiledQuery<String> compiledQuery;
|
||||
|
||||
public IndexResultScoreCalculator(StatefulIndex statefulIndex,
|
||||
DomainRankingOverrides domainRankingOverrides,
|
||||
ResultRankingContext rankingContext,
|
||||
SearchParameters params)
|
||||
{
|
||||
this.index = statefulIndex.get();
|
||||
this.domainRankingOverrides = domainRankingOverrides;
|
||||
this.rankingContext = rankingContext;
|
||||
|
||||
this.queryParams = params.queryParams;
|
||||
@@ -127,10 +130,10 @@ public class IndexResultScoreCalculator {
|
||||
* wordFlagsQuery.root.visit(new TermFlagsGraphVisitor(params.getBm25K(), wordFlagsQuery.data, unorderedMatches.getWeightedCounts(), rankingContext))
|
||||
/ (Math.sqrt(unorderedMatches.searchableKeywordCount + 1));
|
||||
|
||||
double rankingAdjustment = domainRankingOverrides.getRankingFactor(UrlIdCodec.getDomainId(combinedId));
|
||||
|
||||
double score = normalize(
|
||||
score_firstPosition + score_proximity + score_verbatim
|
||||
+ score_bM25
|
||||
+ score_bFlags,
|
||||
rankingAdjustment * (score_firstPosition + score_proximity + score_verbatim + score_bM25 + score_bFlags),
|
||||
-Math.min(0, documentBonus) // The magnitude of documentBonus, if it is negative; otherwise 0
|
||||
);
|
||||
|
||||
@@ -580,3 +583,4 @@ public class IndexResultScoreCalculator {
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
|
@@ -0,0 +1,103 @@
|
||||
package nu.marginalia.index.results;
|
||||
|
||||
import com.zaxxer.hikari.HikariConfig;
|
||||
import com.zaxxer.hikari.HikariDataSource;
|
||||
import nu.marginalia.db.DbDomainQueries;
|
||||
import nu.marginalia.model.EdgeDomain;
|
||||
import nu.marginalia.test.TestMigrationLoader;
|
||||
import org.junit.jupiter.api.Assertions;
|
||||
import org.junit.jupiter.api.BeforeAll;
|
||||
import org.junit.jupiter.api.Tag;
|
||||
import org.junit.jupiter.api.Test;
|
||||
import org.junit.jupiter.api.parallel.Execution;
|
||||
import org.junit.jupiter.api.parallel.ExecutionMode;
|
||||
import org.testcontainers.containers.MariaDBContainer;
|
||||
import org.testcontainers.junit.jupiter.Container;
|
||||
import org.testcontainers.junit.jupiter.Testcontainers;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Path;
|
||||
import java.nio.file.StandardOpenOption;
|
||||
import java.sql.SQLException;
|
||||
|
||||
@Testcontainers
|
||||
@Execution(ExecutionMode.SAME_THREAD)
|
||||
@Tag("slow")
|
||||
class DomainRankingOverridesTest {
|
||||
@Container
|
||||
static MariaDBContainer<?> mariaDBContainer = new MariaDBContainer<>("mariadb")
|
||||
.withDatabaseName("WMSA_prod")
|
||||
.withUsername("wmsa")
|
||||
.withPassword("wmsa")
|
||||
.withNetworkAliases("mariadb");
|
||||
|
||||
private static DbDomainQueries domainQueries;
|
||||
|
||||
@BeforeAll
|
||||
public static void setup() throws SQLException {
|
||||
HikariConfig config = new HikariConfig();
|
||||
config.setJdbcUrl(mariaDBContainer.getJdbcUrl());
|
||||
config.setUsername("wmsa");
|
||||
config.setPassword("wmsa");
|
||||
|
||||
var dataSource = new HikariDataSource(config);
|
||||
|
||||
TestMigrationLoader.flywayMigration(dataSource);
|
||||
|
||||
try (var conn = dataSource.getConnection();
|
||||
var stmt = conn.createStatement()) {
|
||||
stmt.executeQuery("DELETE FROM EC_DOMAIN"); // Wipe any old state from other test runs
|
||||
|
||||
stmt.executeQuery("INSERT INTO EC_DOMAIN (DOMAIN_NAME, DOMAIN_TOP, NODE_AFFINITY) VALUES ('first.example.com', 'example.com', 1)");
|
||||
stmt.executeQuery("INSERT INTO EC_DOMAIN (DOMAIN_NAME, DOMAIN_TOP, NODE_AFFINITY) VALUES ('second.example.com', 'example.com', 1)");
|
||||
stmt.executeQuery("INSERT INTO EC_DOMAIN (DOMAIN_NAME, DOMAIN_TOP, NODE_AFFINITY) VALUES ('third.example.com', 'example.com', 1)");
|
||||
stmt.executeQuery("INSERT INTO EC_DOMAIN (DOMAIN_NAME, DOMAIN_TOP, NODE_AFFINITY) VALUES ('not-added.example.com', 'example.com', 1)");
|
||||
}
|
||||
|
||||
domainQueries = new DbDomainQueries(dataSource);
|
||||
|
||||
}
|
||||
|
||||
@Test
|
||||
public void test() throws IOException {
|
||||
|
||||
Path overridesFile = Files.createTempFile(getClass().getSimpleName(), ".txt");
|
||||
try {
|
||||
|
||||
Files.writeString(overridesFile, """
|
||||
# A comment
|
||||
value 0.75
|
||||
domain first.example.com
|
||||
domain second.example.com
|
||||
|
||||
value 1.1
|
||||
domain third.example.com
|
||||
""",
|
||||
StandardOpenOption.APPEND);
|
||||
|
||||
var overrides = new DomainRankingOverrides(domainQueries, overridesFile);
|
||||
|
||||
overrides.reloadFile();
|
||||
|
||||
Assertions.assertEquals(0.75, overrides.getRankingFactor(
|
||||
domainQueries.getDomainId(new EdgeDomain("first.example.com"))
|
||||
));
|
||||
Assertions.assertEquals(0.75, overrides.getRankingFactor(
|
||||
domainQueries.getDomainId(new EdgeDomain("second.example.com"))
|
||||
));
|
||||
Assertions.assertEquals(1.1, overrides.getRankingFactor(
|
||||
domainQueries.getDomainId(new EdgeDomain("third.example.com"))
|
||||
));
|
||||
Assertions.assertEquals(1.0, overrides.getRankingFactor(
|
||||
domainQueries.getDomainId(new EdgeDomain("not-added.example.com"))
|
||||
));
|
||||
Assertions.assertEquals(1.0, overrides.getRankingFactor(1<<23));
|
||||
|
||||
}
|
||||
finally {
|
||||
Files.deleteIfExists(overridesFile);
|
||||
}
|
||||
}
|
||||
|
||||
}
|
@@ -23,16 +23,33 @@ public class SimpleBlockingThreadPool {
|
||||
private final Logger logger = LoggerFactory.getLogger(SimpleBlockingThreadPool.class);
|
||||
|
||||
public SimpleBlockingThreadPool(String name, int poolSize, int queueSize) {
|
||||
this(name, poolSize, queueSize, ThreadType.PLATFORM);
|
||||
}
|
||||
|
||||
public SimpleBlockingThreadPool(String name, int poolSize, int queueSize, ThreadType threadType) {
|
||||
tasks = new ArrayBlockingQueue<>(queueSize);
|
||||
|
||||
for (int i = 0; i < poolSize; i++) {
|
||||
Thread worker = new Thread(this::worker, name + "[" + i + "]");
|
||||
worker.setDaemon(true);
|
||||
worker.start();
|
||||
|
||||
Thread.Builder threadBuilder = switch (threadType) {
|
||||
case VIRTUAL -> Thread.ofVirtual();
|
||||
case PLATFORM -> Thread.ofPlatform().daemon(true);
|
||||
};
|
||||
|
||||
Thread worker = threadBuilder
|
||||
.name(name + "[" + i + "]")
|
||||
.start(this::worker);
|
||||
|
||||
workers.add(worker);
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
public enum ThreadType {
|
||||
VIRTUAL,
|
||||
PLATFORM
|
||||
}
|
||||
|
||||
public void submit(Task task) throws InterruptedException {
|
||||
tasks.put(task);
|
||||
}
|
||||
|
@@ -45,6 +45,11 @@ public class GammaCodedSequenceArrayColumn extends AbstractObjectColumn<List<Gam
|
||||
);
|
||||
}
|
||||
|
||||
@Override
|
||||
public int alignmentSize() {
|
||||
return 1;
|
||||
}
|
||||
|
||||
public Reader openUnregistered(URI uri, int page) throws IOException {
|
||||
return new Reader(
|
||||
dataColumn.openUnregistered(uri, page),
|
||||
@@ -109,6 +114,11 @@ public class GammaCodedSequenceArrayColumn extends AbstractObjectColumn<List<Gam
|
||||
dataReader.skip(toSkip);
|
||||
}
|
||||
|
||||
@Override
|
||||
public boolean isDirect() {
|
||||
return dataReader.isDirect();
|
||||
}
|
||||
|
||||
@Override
|
||||
public boolean hasRemaining() throws IOException {
|
||||
return groupsReader.hasRemaining();
|
||||
|
@@ -44,6 +44,11 @@ public class GammaCodedSequenceColumn extends AbstractObjectColumn<GammaCodedSeq
|
||||
);
|
||||
}
|
||||
|
||||
@Override
|
||||
public int alignmentSize() {
|
||||
return 1;
|
||||
}
|
||||
|
||||
public Reader openUnregistered(URI uri, int page) throws IOException {
|
||||
return new Reader(
|
||||
Storage.reader(uri, this, page, false),
|
||||
@@ -96,6 +101,11 @@ public class GammaCodedSequenceColumn extends AbstractObjectColumn<GammaCodedSeq
|
||||
this.indexReader = indexReader;
|
||||
}
|
||||
|
||||
@Override
|
||||
public boolean isDirect() {
|
||||
return storage.isDirect();
|
||||
}
|
||||
|
||||
@Override
|
||||
public AbstractColumn<?, ?> columnDesc() {
|
||||
return GammaCodedSequenceColumn.this;
|
||||
|
@@ -45,6 +45,11 @@ public class VarintCodedSequenceArrayColumn extends AbstractObjectColumn<List<Va
|
||||
);
|
||||
}
|
||||
|
||||
@Override
|
||||
public int alignmentSize() {
|
||||
return 0;
|
||||
}
|
||||
|
||||
public Reader openUnregistered(URI uri, int page) throws IOException {
|
||||
return new Reader(
|
||||
dataColumn.openUnregistered(uri, page),
|
||||
@@ -109,6 +114,11 @@ public class VarintCodedSequenceArrayColumn extends AbstractObjectColumn<List<Va
|
||||
dataReader.skip(toSkip);
|
||||
}
|
||||
|
||||
@Override
|
||||
public boolean isDirect() {
|
||||
return dataReader.isDirect();
|
||||
}
|
||||
|
||||
@Override
|
||||
public boolean hasRemaining() throws IOException {
|
||||
return groupsReader.hasRemaining();
|
||||
|
@@ -44,6 +44,11 @@ public class VarintCodedSequenceColumn extends AbstractObjectColumn<VarintCodedS
|
||||
);
|
||||
}
|
||||
|
||||
@Override
|
||||
public int alignmentSize() {
|
||||
return 1;
|
||||
}
|
||||
|
||||
public Reader openUnregistered(URI uri, int page) throws IOException {
|
||||
return new Reader(
|
||||
Storage.reader(uri, this, page, false),
|
||||
@@ -101,6 +106,11 @@ public class VarintCodedSequenceColumn extends AbstractObjectColumn<VarintCodedS
|
||||
return VarintCodedSequenceColumn.this;
|
||||
}
|
||||
|
||||
@Override
|
||||
public boolean isDirect() {
|
||||
return storage.isDirect();
|
||||
}
|
||||
|
||||
@Override
|
||||
public void skip(long positions) throws IOException {
|
||||
for (int i = 0; i < positions; i++) {
|
||||
|
@@ -155,8 +155,15 @@ public class SentenceExtractor {
|
||||
public List<DocumentSentence> extractSentencesFromString(String text, EnumSet<HtmlTag> htmlTags) {
|
||||
String[] sentences;
|
||||
|
||||
// Normalize spaces
|
||||
// Safety net against malformed data DOS attacks,
|
||||
// found 5+ MB <p>-tags in the wild that just break
|
||||
// the sentence extractor causing it to stall forever.
|
||||
if (text.length() > 50_000) {
|
||||
// 50k chars can hold a small novel, let alone single html tags
|
||||
text = text.substring(0, 50_000);
|
||||
}
|
||||
|
||||
// Normalize spaces
|
||||
text = normalizeSpaces(text);
|
||||
|
||||
// Split into sentences
|
||||
|
@@ -5,9 +5,7 @@ import nu.marginalia.actor.state.*;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import java.util.ArrayList;
|
||||
import java.util.Arrays;
|
||||
import java.util.List;
|
||||
import java.util.*;
|
||||
|
||||
public abstract class RecordActorPrototype implements ActorPrototype {
|
||||
|
||||
@@ -118,7 +116,7 @@ public abstract class RecordActorPrototype implements ActorPrototype {
|
||||
}
|
||||
|
||||
private String functionName(Class<? extends ActorStep> functionClass) {
|
||||
return functionClass.getSimpleName().toUpperCase();
|
||||
return ActorStep.functionName(functionClass);
|
||||
}
|
||||
|
||||
private ActorStep constructState(String message) throws ReflectiveOperationException {
|
||||
@@ -145,4 +143,43 @@ public abstract class RecordActorPrototype implements ActorPrototype {
|
||||
}
|
||||
}
|
||||
|
||||
/** Get a list of JSON prototypes for each actor step declared by this actor */
|
||||
@SuppressWarnings("unchecked")
|
||||
public Map<String, String> getMessagePrototypes() {
|
||||
Map<String, String> messagePrototypes = new HashMap<>();
|
||||
|
||||
for (var clazz : getClass().getDeclaredClasses()) {
|
||||
if (!clazz.isRecord() || !ActorStep.class.isAssignableFrom(clazz))
|
||||
continue;
|
||||
|
||||
StringJoiner sj = new StringJoiner(",\n\t", "{\n\t", "\n}");
|
||||
|
||||
renderToJsonPrototype(sj, (Class<? extends Record>) clazz);
|
||||
|
||||
messagePrototypes.put(ActorStep.functionName((Class<? extends ActorStep>) clazz), sj.toString());
|
||||
}
|
||||
|
||||
return messagePrototypes;
|
||||
}
|
||||
|
||||
@SuppressWarnings("unchecked")
|
||||
private void renderToJsonPrototype(StringJoiner sj, Class<? extends Record> recordType) {
|
||||
for (var field : recordType.getDeclaredFields()) {
|
||||
String typeName = field.getType().getSimpleName();
|
||||
|
||||
if ("List".equals(typeName)) {
|
||||
sj.add(String.format("\"%s\": [ ]", field.getName()));
|
||||
}
|
||||
else if (field.getType().isRecord()) {
|
||||
var innerSj = new StringJoiner(",", "{", "}");
|
||||
renderToJsonPrototype(innerSj, (Class<? extends Record>) field.getType());
|
||||
sj.add(String.format("\"%s\": %s", field.getName(), sj));
|
||||
}
|
||||
else {
|
||||
sj.add(String.format("\"%s\": \"%s\"", field.getName(), typeName));
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
|
@@ -1,3 +1,7 @@
|
||||
package nu.marginalia.actor.state;
|
||||
|
||||
public interface ActorStep {}
|
||||
public interface ActorStep {
|
||||
static String functionName(Class<? extends ActorStep> type) {
|
||||
return type.getSimpleName().toUpperCase();
|
||||
}
|
||||
}
|
||||
|
@@ -87,6 +87,8 @@ dependencies {
|
||||
implementation libs.commons.compress
|
||||
implementation libs.sqlite
|
||||
|
||||
implementation libs.bundles.httpcomponents
|
||||
|
||||
testImplementation libs.bundles.slf4j.test
|
||||
testImplementation libs.bundles.junit
|
||||
testImplementation libs.mockito
|
||||
|
@@ -12,7 +12,6 @@ import nu.marginalia.converting.sideload.SideloadSourceFactory;
|
||||
import nu.marginalia.converting.writer.ConverterBatchWritableIf;
|
||||
import nu.marginalia.converting.writer.ConverterBatchWriter;
|
||||
import nu.marginalia.converting.writer.ConverterWriter;
|
||||
import nu.marginalia.io.CrawledDomainReader;
|
||||
import nu.marginalia.io.SerializableCrawlDataStream;
|
||||
import nu.marginalia.mq.MessageQueueFactory;
|
||||
import nu.marginalia.mqapi.converting.ConvertRequest;
|
||||
@@ -36,6 +35,7 @@ import java.io.IOException;
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Path;
|
||||
import java.sql.SQLException;
|
||||
import java.util.ArrayList;
|
||||
import java.util.Collection;
|
||||
import java.util.List;
|
||||
import java.util.Optional;
|
||||
@@ -51,6 +51,7 @@ public class ConverterMain extends ProcessMainClass {
|
||||
private final ProcessHeartbeat heartbeat;
|
||||
private final FileStorageService fileStorageService;
|
||||
private final SideloadSourceFactory sideloadSourceFactory;
|
||||
private static final int SIDELOAD_THRESHOLD = Integer.getInteger("converter.sideloadThreshold", 10_000);
|
||||
|
||||
public static void main(String... args) throws Exception {
|
||||
|
||||
@@ -201,12 +202,26 @@ public class ConverterMain extends ProcessMainClass {
|
||||
processedDomains.set(batchingWorkLog.size());
|
||||
heartbeat.setProgress(processedDomains.get() / (double) totalDomains);
|
||||
|
||||
for (var domain : WorkLog.iterableMap(crawlDir.getLogFile(),
|
||||
logger.info("Processing small items");
|
||||
|
||||
// We separate the large and small domains to reduce the number of critical sections,
|
||||
// as the large domains have a separate processing track that doesn't store everything
|
||||
// in memory
|
||||
|
||||
final List<Path> bigTasks = new ArrayList<>();
|
||||
|
||||
// First process the small items
|
||||
for (var dataPath : WorkLog.iterableMap(crawlDir.getLogFile(),
|
||||
new CrawlDataLocator(crawlDir.getDir(), batchingWorkLog)))
|
||||
{
|
||||
if (SerializableCrawlDataStream.getSizeHint(dataPath) >= SIDELOAD_THRESHOLD) {
|
||||
bigTasks.add(dataPath);
|
||||
continue;
|
||||
}
|
||||
|
||||
pool.submit(() -> {
|
||||
try {
|
||||
ConverterBatchWritableIf writable = processor.createWritable(domain);
|
||||
try (var dataStream = SerializableCrawlDataStream.openDataStream(dataPath)) {
|
||||
ConverterBatchWritableIf writable = processor.fullProcessing(dataStream) ;
|
||||
converterWriter.accept(writable);
|
||||
}
|
||||
catch (Exception ex) {
|
||||
@@ -225,10 +240,39 @@ public class ConverterMain extends ProcessMainClass {
|
||||
do {
|
||||
System.out.println("Waiting for pool to terminate... " + pool.getActiveCount() + " remaining");
|
||||
} while (!pool.awaitTermination(60, TimeUnit.SECONDS));
|
||||
|
||||
logger.info("Processing large items");
|
||||
|
||||
try (var hb = heartbeat.createAdHocTaskHeartbeat("Large Domains")) {
|
||||
int bigTaskIdx = 0;
|
||||
// Next the big items domain-by-domain
|
||||
for (var dataPath : bigTasks) {
|
||||
hb.progress(dataPath.toFile().getName(), bigTaskIdx++, bigTasks.size());
|
||||
|
||||
try {
|
||||
// SerializableCrawlDataStream is autocloseable, we can't try-with-resources because then it will be
|
||||
// closed before it's consumed by the converterWriter. Instead, the converterWriter guarantees it
|
||||
// will close it after it's consumed.
|
||||
|
||||
var stream = SerializableCrawlDataStream.openDataStream(dataPath);
|
||||
ConverterBatchWritableIf writable = processor.simpleProcessing(stream, SerializableCrawlDataStream.getSizeHint(dataPath));
|
||||
|
||||
converterWriter.accept(writable);
|
||||
}
|
||||
catch (Exception ex) {
|
||||
logger.info("Error in processing", ex);
|
||||
}
|
||||
finally {
|
||||
heartbeat.setProgress(processedDomains.incrementAndGet() / (double) totalDomains);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
logger.info("Processing complete");
|
||||
}
|
||||
}
|
||||
|
||||
private static class CrawlDataLocator implements Function<WorkLogEntry, Optional<SerializableCrawlDataStream>> {
|
||||
private static class CrawlDataLocator implements Function<WorkLogEntry, Optional<Path>> {
|
||||
|
||||
private final Path crawlRootDir;
|
||||
private final BatchingWorkLog batchingWorkLog;
|
||||
@@ -239,7 +283,7 @@ public class ConverterMain extends ProcessMainClass {
|
||||
}
|
||||
|
||||
@Override
|
||||
public Optional<SerializableCrawlDataStream> apply(WorkLogEntry entry) {
|
||||
public Optional<Path> apply(WorkLogEntry entry) {
|
||||
if (batchingWorkLog.isItemProcessed(entry.id())) {
|
||||
return Optional.empty();
|
||||
}
|
||||
@@ -252,7 +296,7 @@ public class ConverterMain extends ProcessMainClass {
|
||||
}
|
||||
|
||||
try {
|
||||
return Optional.of(CrawledDomainReader.createDataStream(path));
|
||||
return Optional.of(path);
|
||||
}
|
||||
catch (Exception ex) {
|
||||
return Optional.empty();
|
||||
|
@@ -19,6 +19,7 @@ import nu.marginalia.model.idx.WordFlags;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.net.URISyntaxException;
|
||||
import java.util.ArrayList;
|
||||
import java.util.List;
|
||||
@@ -91,7 +92,7 @@ public class DocumentProcessor {
|
||||
DocumentClass documentClass,
|
||||
DocumentDecorator documentDecorator,
|
||||
DomainLinks externalDomainLinks,
|
||||
ProcessedDocument ret) throws URISyntaxException, DisqualifiedException
|
||||
ProcessedDocument ret) throws URISyntaxException, IOException, DisqualifiedException
|
||||
{
|
||||
|
||||
var crawlerStatus = CrawlerDocumentStatus.valueOf(crawledDocument.crawlerStatus);
|
||||
@@ -109,7 +110,7 @@ public class DocumentProcessor {
|
||||
|
||||
ret.state = crawlerStatusToUrlState(crawledDocument.crawlerStatus, crawledDocument.httpStatus);
|
||||
|
||||
final var plugin = findPlugin(crawledDocument);
|
||||
AbstractDocumentProcessorPlugin plugin = findPlugin(crawledDocument);
|
||||
|
||||
EdgeUrl url = new EdgeUrl(crawledDocument.url);
|
||||
LinkTexts linkTexts = anchorTextKeywords.getAnchorTextKeywords(externalDomainLinks, url);
|
||||
|
@@ -32,7 +32,6 @@ import java.util.*;
|
||||
import java.util.regex.Pattern;
|
||||
|
||||
public class DomainProcessor {
|
||||
private static final int SIDELOAD_THRESHOLD = Integer.getInteger("converter.sideloadThreshold", 10_000);
|
||||
private final DocumentProcessor documentProcessor;
|
||||
private final SiteWords siteWords;
|
||||
private final AnchorTagsSource anchorTagsSource;
|
||||
@@ -54,21 +53,9 @@ public class DomainProcessor {
|
||||
geoIpDictionary.waitReady();
|
||||
}
|
||||
|
||||
public ConverterBatchWritableIf createWritable(SerializableCrawlDataStream domain) {
|
||||
final int sizeHint = domain.sizeHint();
|
||||
|
||||
if (sizeHint > SIDELOAD_THRESHOLD) {
|
||||
// If the file is too big, we run a processing mode that doesn't
|
||||
// require loading the entire dataset into RAM
|
||||
return sideloadProcessing(domain, sizeHint);
|
||||
}
|
||||
|
||||
return fullProcessing(domain);
|
||||
}
|
||||
|
||||
public SideloadProcessing sideloadProcessing(SerializableCrawlDataStream dataStream, int sizeHint, Collection<String> extraKeywords) {
|
||||
public SimpleProcessing simpleProcessing(SerializableCrawlDataStream dataStream, int sizeHint, Collection<String> extraKeywords) {
|
||||
try {
|
||||
return new SideloadProcessing(dataStream, sizeHint, extraKeywords);
|
||||
return new SimpleProcessing(dataStream, sizeHint, extraKeywords);
|
||||
}
|
||||
catch (Exception ex) {
|
||||
logger.warn("Failed to process domain sideload", ex);
|
||||
@@ -76,9 +63,9 @@ public class DomainProcessor {
|
||||
}
|
||||
}
|
||||
|
||||
public SideloadProcessing sideloadProcessing(SerializableCrawlDataStream dataStream, int sizeHint) {
|
||||
public SimpleProcessing simpleProcessing(SerializableCrawlDataStream dataStream, int sizeHint) {
|
||||
try {
|
||||
return new SideloadProcessing(dataStream, sizeHint);
|
||||
return new SimpleProcessing(dataStream, sizeHint);
|
||||
}
|
||||
catch (Exception ex) {
|
||||
logger.warn("Failed to process domain sideload", ex);
|
||||
@@ -86,22 +73,84 @@ public class DomainProcessor {
|
||||
}
|
||||
}
|
||||
|
||||
public class SideloadProcessing implements ConverterBatchWritableIf, SideloadSource {
|
||||
@Nullable
|
||||
public ProcessedDomain fullProcessing(SerializableCrawlDataStream dataStream) {
|
||||
try {
|
||||
if (!dataStream.hasNext()) {
|
||||
return null;
|
||||
}
|
||||
|
||||
List<ProcessedDocument> docs = new ArrayList<>();
|
||||
Set<String> processedUrls = new HashSet<>();
|
||||
|
||||
if (!(dataStream.next() instanceof CrawledDomain crawledDomain)) {
|
||||
throw new IllegalStateException("First record must be a domain, was " + dataStream.next().getClass().getSimpleName());
|
||||
}
|
||||
|
||||
DomainLinks externalDomainLinks = anchorTagsSource.getAnchorTags(crawledDomain.getDomain());
|
||||
DocumentDecorator documentDecorator = new DocumentDecorator();
|
||||
|
||||
// Process Domain Record
|
||||
|
||||
ProcessedDomain ret = new ProcessedDomain();
|
||||
processDomain(crawledDomain, ret, documentDecorator);
|
||||
ret.documents = docs;
|
||||
|
||||
// Process Documents
|
||||
|
||||
try (var deduplicator = new LshDocumentDeduplicator()) {
|
||||
while (dataStream.hasNext()) {
|
||||
if (!(dataStream.next() instanceof CrawledDocument doc))
|
||||
continue;
|
||||
if (doc.url == null)
|
||||
continue;
|
||||
if (doc.documentBodyBytes.length == 0)
|
||||
continue;
|
||||
if (!processedUrls.add(doc.url))
|
||||
continue;
|
||||
|
||||
try {
|
||||
var processedDoc = documentProcessor.process(doc, ret.domain, externalDomainLinks, documentDecorator);
|
||||
deduplicator.markIfDuplicate(processedDoc);
|
||||
docs.add(processedDoc);
|
||||
} catch (Exception ex) {
|
||||
logger.warn("Failed to process " + doc.url, ex);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Add late keywords and features from domain-level information
|
||||
|
||||
calculateStatistics(ret, externalDomainLinks);
|
||||
|
||||
return ret;
|
||||
}
|
||||
catch (Exception ex) {
|
||||
logger.warn("Failed to process domain", ex);
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
/** The simple processing track processes documents individually, and does not perform any domain-level analysis.
|
||||
* This is needed to process extremely large domains, which would otherwise eat up too much RAM.
|
||||
*/
|
||||
public class SimpleProcessing implements ConverterBatchWritableIf, SideloadSource {
|
||||
private final SerializableCrawlDataStream dataStream;
|
||||
private final ProcessedDomain domain;
|
||||
private final DocumentDecorator documentDecorator;
|
||||
private final Set<String> processedUrls = new HashSet<>();
|
||||
private final DomainLinks externalDomainLinks;
|
||||
private final LshDocumentDeduplicator deduplicator = new LshDocumentDeduplicator();
|
||||
|
||||
private static final ProcessingIterator.Factory iteratorFactory = ProcessingIterator.factory(8,
|
||||
Integer.getInteger("java.util.concurrent.ForkJoinPool.common.parallelism", Runtime.getRuntime().availableProcessors())
|
||||
);
|
||||
|
||||
SideloadProcessing(SerializableCrawlDataStream dataStream, int sizeHint) throws IOException {
|
||||
SimpleProcessing(SerializableCrawlDataStream dataStream, int sizeHint) throws IOException {
|
||||
this(dataStream, sizeHint, List.of());
|
||||
}
|
||||
|
||||
SideloadProcessing(SerializableCrawlDataStream dataStream, int sizeHint, Collection<String> extraKeywords) throws IOException {
|
||||
SimpleProcessing(SerializableCrawlDataStream dataStream, int sizeHint, Collection<String> extraKeywords) throws IOException {
|
||||
this.dataStream = dataStream;
|
||||
|
||||
if (!dataStream.hasNext() || !(dataStream.next() instanceof CrawledDomain crawledDomain))
|
||||
@@ -128,6 +177,7 @@ public class DomainProcessor {
|
||||
@Override
|
||||
public Iterator<ProcessedDocument> getDocumentsStream() {
|
||||
return iteratorFactory.create((taskConsumer) -> {
|
||||
|
||||
while (dataStream.hasNext())
|
||||
{
|
||||
if (!(dataStream.next() instanceof CrawledDocument doc))
|
||||
@@ -172,65 +222,6 @@ public class DomainProcessor {
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@Nullable
|
||||
public ProcessedDomain fullProcessing(SerializableCrawlDataStream dataStream) {
|
||||
try {
|
||||
if (!dataStream.hasNext()) {
|
||||
return null;
|
||||
}
|
||||
|
||||
List<ProcessedDocument> docs = new ArrayList<>();
|
||||
Set<String> processedUrls = new HashSet<>();
|
||||
|
||||
if (!(dataStream.next() instanceof CrawledDomain crawledDomain)) {
|
||||
throw new IllegalStateException("First record must be a domain, was " + dataStream.next().getClass().getSimpleName());
|
||||
}
|
||||
|
||||
DomainLinks externalDomainLinks = anchorTagsSource.getAnchorTags(crawledDomain.getDomain());
|
||||
DocumentDecorator documentDecorator = new DocumentDecorator();
|
||||
|
||||
// Process Domain Record
|
||||
|
||||
ProcessedDomain ret = new ProcessedDomain();
|
||||
processDomain(crawledDomain, ret, documentDecorator);
|
||||
ret.documents = docs;
|
||||
|
||||
// Process Documents
|
||||
|
||||
try (var deduplicator = new LshDocumentDeduplicator()) {
|
||||
while (dataStream.hasNext()) {
|
||||
if (!(dataStream.next() instanceof CrawledDocument doc))
|
||||
continue;
|
||||
if (doc.url == null)
|
||||
continue;
|
||||
if (doc.documentBody.isBlank())
|
||||
continue;
|
||||
if (!processedUrls.add(doc.url))
|
||||
continue;
|
||||
|
||||
try {
|
||||
var processedDoc = documentProcessor.process(doc, ret.domain, externalDomainLinks, documentDecorator);
|
||||
deduplicator.markIfDuplicate(processedDoc);
|
||||
docs.add(processedDoc);
|
||||
} catch (Exception ex) {
|
||||
logger.warn("Failed to process " + doc.url, ex);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Add late keywords and features from domain-level information
|
||||
|
||||
calculateStatistics(ret, externalDomainLinks);
|
||||
|
||||
return ret;
|
||||
}
|
||||
catch (Exception ex) {
|
||||
logger.warn("Failed to process domain", ex);
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
private void processDomain(CrawledDomain crawledDomain,
|
||||
ProcessedDomain domain,
|
||||
DocumentDecorator decorator)
|
||||
|
@@ -116,7 +116,7 @@ public class AdblockSimulator {
|
||||
|
||||
|
||||
// Refrain from cleaning up this code, it's very hot code and needs to be fast.
|
||||
// This version is about 100x faster than the a "clean" first stab implementation.
|
||||
// This version is about 100x faster than a "clean" first stab implementation.
|
||||
|
||||
class RuleVisitor implements NodeFilter {
|
||||
public boolean sawAds;
|
||||
|
@@ -23,7 +23,7 @@ public class DocumentGeneratorExtractor {
|
||||
|
||||
var tags = doc.select("meta[name=generator]");
|
||||
|
||||
if (tags.size() == 0) {
|
||||
if (tags.isEmpty()) {
|
||||
// Some sites have a comment in the head instead of a meta tag
|
||||
return fingerprintServerTech(doc, responseHeaders);
|
||||
}
|
||||
|
@@ -24,7 +24,7 @@ public class DocumentValuator {
|
||||
double scriptPenalty = getScriptPenalty(parsedDocument);
|
||||
double chatGptPenalty = getChatGptContentFarmPenalty(parsedDocument);
|
||||
|
||||
int rawLength = crawledDocument.documentBody.length();
|
||||
int rawLength = crawledDocument.documentBodyBytes.length;
|
||||
|
||||
if (textLength == 0) {
|
||||
throw new DisqualifiedException(DisqualifiedException.DisqualificationReason.LENGTH);
|
||||
|
@@ -218,7 +218,10 @@ public class FeatureExtractor {
|
||||
}
|
||||
}
|
||||
|
||||
if (features.contains(HtmlFeature.JS) && adblockSimulator.hasAds(doc.clone())) {
|
||||
if (features.contains(HtmlFeature.JS)
|
||||
// remove while disabled to get rid of expensive clone() call:
|
||||
// adblockSimulator.hasAds(doc.clone())
|
||||
) {
|
||||
features.add(HtmlFeature.ADVERTISEMENT);
|
||||
}
|
||||
|
||||
|
@@ -14,6 +14,7 @@ import nu.marginalia.model.crawldata.CrawledDocument;
|
||||
import nu.marginalia.model.html.HtmlStandard;
|
||||
|
||||
import javax.annotation.Nullable;
|
||||
import java.io.IOException;
|
||||
import java.net.URISyntaxException;
|
||||
import java.util.HashSet;
|
||||
import java.util.List;
|
||||
@@ -25,7 +26,7 @@ public abstract class AbstractDocumentProcessorPlugin {
|
||||
this.languageFilter = languageFilter;
|
||||
}
|
||||
|
||||
public abstract DetailsWithWords createDetails(CrawledDocument crawledDocument, LinkTexts linkTexts, DocumentClass documentClass) throws DisqualifiedException, URISyntaxException;
|
||||
public abstract DetailsWithWords createDetails(CrawledDocument crawledDocument, LinkTexts linkTexts, DocumentClass documentClass) throws DisqualifiedException, URISyntaxException, IOException;
|
||||
public abstract boolean isApplicable(CrawledDocument doc);
|
||||
|
||||
protected void checkDocumentLanguage(DocumentLanguageData dld) throws DisqualifiedException {
|
||||
@@ -86,6 +87,7 @@ public abstract class AbstractDocumentProcessorPlugin {
|
||||
|
||||
return this;
|
||||
}
|
||||
|
||||
public MetaTagsBuilder addPubDate(PubDate pubDate) {
|
||||
|
||||
if (pubDate.year() > 1900) {
|
||||
|
@@ -6,6 +6,7 @@ import nu.marginalia.converting.model.DisqualifiedException;
|
||||
import nu.marginalia.converting.model.DocumentHeaders;
|
||||
import nu.marginalia.converting.model.GeneratorType;
|
||||
import nu.marginalia.converting.model.ProcessedDocumentDetails;
|
||||
import nu.marginalia.converting.processor.AcceptableAds;
|
||||
import nu.marginalia.converting.processor.DocumentClass;
|
||||
import nu.marginalia.converting.processor.MetaRobotsTag;
|
||||
import nu.marginalia.converting.processor.logic.*;
|
||||
@@ -32,11 +33,11 @@ import nu.marginalia.model.crawldata.CrawledDocument;
|
||||
import nu.marginalia.model.html.HtmlStandard;
|
||||
import nu.marginalia.model.idx.DocumentFlags;
|
||||
import nu.marginalia.model.idx.DocumentMetadata;
|
||||
import org.jsoup.Jsoup;
|
||||
import org.jsoup.nodes.Document;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.net.URISyntaxException;
|
||||
import java.util.EnumSet;
|
||||
import java.util.HashSet;
|
||||
@@ -51,7 +52,6 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
|
||||
private final double minDocumentQuality;
|
||||
|
||||
private final FeatureExtractor featureExtractor;
|
||||
private final TitleExtractor titleExtractor;
|
||||
private final DocumentKeywordExtractor keywordExtractor;
|
||||
private final PubDateSniffer pubDateSniffer;
|
||||
|
||||
@@ -74,7 +74,6 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
|
||||
@Named("min-document-quality") Double minDocumentQuality,
|
||||
LanguageFilter languageFilter,
|
||||
FeatureExtractor featureExtractor,
|
||||
TitleExtractor titleExtractor,
|
||||
DocumentKeywordExtractor keywordExtractor,
|
||||
PubDateSniffer pubDateSniffer,
|
||||
DocumentLengthLogic documentLengthLogic,
|
||||
@@ -89,7 +88,6 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
|
||||
this.minDocumentQuality = minDocumentQuality;
|
||||
this.featureExtractor = featureExtractor;
|
||||
|
||||
this.titleExtractor = titleExtractor;
|
||||
this.keywordExtractor = keywordExtractor;
|
||||
this.pubDateSniffer = pubDateSniffer;
|
||||
this.metaRobotsTag = metaRobotsTag;
|
||||
@@ -108,19 +106,17 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
|
||||
public DetailsWithWords createDetails(CrawledDocument crawledDocument,
|
||||
LinkTexts linkTexts,
|
||||
DocumentClass documentClass)
|
||||
throws DisqualifiedException, URISyntaxException {
|
||||
throws DisqualifiedException, URISyntaxException, IOException {
|
||||
|
||||
String documentBody = crawledDocument.documentBody;
|
||||
|
||||
if (languageFilter.isBlockedUnicodeRange(documentBody)) {
|
||||
if (languageFilter.isBlockedUnicodeRange(crawledDocument.documentBody(512))) {
|
||||
throw new DisqualifiedException(DisqualificationReason.LANGUAGE);
|
||||
}
|
||||
|
||||
if (documentBody.length() > MAX_DOCUMENT_LENGTH_BYTES) { // 128kb
|
||||
documentBody = documentBody.substring(0, MAX_DOCUMENT_LENGTH_BYTES);
|
||||
}
|
||||
Document doc = crawledDocument.parseBody();
|
||||
|
||||
Document doc = Jsoup.parse(documentBody);
|
||||
if (AcceptableAds.hasAcceptableAdsTag(doc)) {
|
||||
throw new DisqualifiedException(DisqualifiedException.DisqualificationReason.ACCEPTABLE_ADS);
|
||||
}
|
||||
|
||||
if (!metaRobotsTag.allowIndexingByMetaTag(doc)) {
|
||||
throw new DisqualifiedException(DisqualificationReason.FORBIDDEN);
|
||||
@@ -138,32 +134,33 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
|
||||
}
|
||||
|
||||
var prunedDoc = specialization.prune(doc);
|
||||
DocumentLanguageData dld = sentenceExtractorProvider.get().extractSentences(prunedDoc);
|
||||
|
||||
checkDocumentLanguage(dld);
|
||||
|
||||
var ret = new ProcessedDocumentDetails();
|
||||
|
||||
final int length = getLength(doc);
|
||||
final HtmlStandard standard = getHtmlStandard(doc);
|
||||
final double quality = documentValuator.getQuality(crawledDocument, standard, doc, length);
|
||||
|
||||
if (isDisqualified(documentClass, url, quality, doc.title())) {
|
||||
throw new DisqualifiedException(DisqualificationReason.QUALITY);
|
||||
}
|
||||
|
||||
DocumentLanguageData dld = sentenceExtractorProvider.get().extractSentences(prunedDoc);
|
||||
|
||||
checkDocumentLanguage(dld);
|
||||
documentLengthLogic.validateLength(dld, specialization.lengthModifier() * documentClass.lengthLimitModifier());
|
||||
|
||||
var ret = new ProcessedDocumentDetails();
|
||||
|
||||
ret.length = length;
|
||||
ret.standard = standard;
|
||||
ret.title = specialization.getTitle(doc, dld, crawledDocument.url);
|
||||
|
||||
documentLengthLogic.validateLength(dld, specialization.lengthModifier() * documentClass.lengthLimitModifier());
|
||||
|
||||
final Set<HtmlFeature> features = featureExtractor.getFeatures(url, doc, documentHeaders, dld);
|
||||
|
||||
ret.features = features;
|
||||
ret.quality = documentValuator.adjustQuality(quality, features);
|
||||
ret.hashCode = dld.localitySensitiveHashCode();
|
||||
|
||||
if (isDisqualified(documentClass, url, quality, ret.title)) {
|
||||
throw new DisqualifiedException(DisqualificationReason.QUALITY);
|
||||
}
|
||||
|
||||
PubDate pubDate = pubDateSniffer.getPubDate(documentHeaders, url, doc, standard, true);
|
||||
|
||||
EnumSet<DocumentFlags> documentFlags = documentFlags(features, generatorParts.type());
|
||||
|
@@ -71,7 +71,7 @@ public class PlainTextDocumentProcessorPlugin extends AbstractDocumentProcessorP
|
||||
DocumentClass documentClass)
|
||||
throws DisqualifiedException, URISyntaxException {
|
||||
|
||||
String documentBody = crawledDocument.documentBody;
|
||||
String documentBody = crawledDocument.documentBody();
|
||||
|
||||
if (languageFilter.isBlockedUnicodeRange(documentBody)) {
|
||||
throw new DisqualifiedException(DisqualifiedException.DisqualificationReason.LANGUAGE);
|
||||
|
@@ -19,6 +19,7 @@ import nu.marginalia.model.idx.DocumentMetadata;
|
||||
import nu.marginalia.model.idx.WordFlags;
|
||||
|
||||
import java.net.URISyntaxException;
|
||||
import java.nio.charset.StandardCharsets;
|
||||
import java.time.LocalDateTime;
|
||||
import java.util.EnumSet;
|
||||
import java.util.List;
|
||||
@@ -50,7 +51,7 @@ public class SideloaderProcessing {
|
||||
"OK",
|
||||
"NP",
|
||||
"",
|
||||
body,
|
||||
body.getBytes(StandardCharsets.UTF_8),
|
||||
false,
|
||||
null,
|
||||
null
|
||||
|
@@ -127,7 +127,7 @@ public class EncyclopediaMarginaliaNuSideloader implements SideloadSource, AutoC
|
||||
}
|
||||
fullHtml.append("</div></body></html>");
|
||||
|
||||
var doc = sideloaderProcessing
|
||||
return sideloaderProcessing
|
||||
.processDocument(fullUrl,
|
||||
fullHtml.toString(),
|
||||
List.of("encyclopedia", "wiki"),
|
||||
@@ -137,8 +137,6 @@ public class EncyclopediaMarginaliaNuSideloader implements SideloadSource, AutoC
|
||||
anchorTextKeywords.getAnchorTextKeywords(domainLinks, new EdgeUrl(fullUrl)),
|
||||
LocalDate.now().getYear(),
|
||||
10_000_000);
|
||||
|
||||
return doc;
|
||||
}
|
||||
|
||||
private String normalizeUtf8(String url) {
|
||||
|
@@ -106,11 +106,7 @@ public class WarcSideloader implements SideloadSource, AutoCloseable {
|
||||
return false;
|
||||
|
||||
var url = new EdgeUrl(warcResponse.target());
|
||||
if (!Objects.equals(url.getDomain(), domain)) {
|
||||
return false;
|
||||
}
|
||||
|
||||
return true;
|
||||
return Objects.equals(url.getDomain(), domain);
|
||||
} catch (Exception e) {
|
||||
logger.warn("Failed to process response", e);
|
||||
}
|
||||
|
@@ -39,6 +39,9 @@ public class ConverterWriter implements AutoCloseable {
|
||||
workerThread.start();
|
||||
}
|
||||
|
||||
/** Queue and eventually write the domain into the converter journal
|
||||
* The domain object will be closed after it's processed.
|
||||
* */
|
||||
public void accept(@Nullable ConverterBatchWritableIf domain) {
|
||||
if (null == domain)
|
||||
return;
|
||||
@@ -72,15 +75,15 @@ public class ConverterWriter implements AutoCloseable {
|
||||
|
||||
if (workLog.isItemCommitted(id) || workLog.isItemInCurrentBatch(id)) {
|
||||
logger.warn("Skipping already logged item {}", id);
|
||||
}
|
||||
else {
|
||||
currentWriter.write(data);
|
||||
workLog.logItem(id);
|
||||
data.close();
|
||||
continue;
|
||||
}
|
||||
|
||||
currentWriter.write(data);
|
||||
|
||||
workLog.logItem(id);
|
||||
|
||||
switcher.tick();
|
||||
data.close();
|
||||
}
|
||||
}
|
||||
catch (Exception ex) {
|
||||
|
@@ -11,7 +11,6 @@ import nu.marginalia.slop.column.primitive.IntColumn;
|
||||
import nu.marginalia.slop.column.primitive.LongColumn;
|
||||
import nu.marginalia.slop.column.string.EnumColumn;
|
||||
import nu.marginalia.slop.column.string.StringColumn;
|
||||
import nu.marginalia.slop.column.string.TxtStringColumn;
|
||||
import nu.marginalia.slop.desc.StorageType;
|
||||
import org.jetbrains.annotations.Nullable;
|
||||
|
||||
@@ -182,8 +181,8 @@ public record SlopDocumentRecord(
|
||||
}
|
||||
|
||||
// Basic information
|
||||
private static final TxtStringColumn domainsColumn = new TxtStringColumn("domain", StandardCharsets.UTF_8, StorageType.GZIP);
|
||||
private static final TxtStringColumn urlsColumn = new TxtStringColumn("url", StandardCharsets.UTF_8, StorageType.GZIP);
|
||||
private static final StringColumn domainsColumn = new StringColumn("domain", StandardCharsets.UTF_8, StorageType.GZIP);
|
||||
private static final StringColumn urlsColumn = new StringColumn("url", StandardCharsets.UTF_8, StorageType.GZIP);
|
||||
private static final VarintColumn ordinalsColumn = new VarintColumn("ordinal", StorageType.PLAIN);
|
||||
private static final EnumColumn statesColumn = new EnumColumn("state", StandardCharsets.US_ASCII, StorageType.PLAIN);
|
||||
private static final StringColumn stateReasonsColumn = new StringColumn("stateReason", StandardCharsets.US_ASCII, StorageType.GZIP);
|
||||
@@ -211,7 +210,7 @@ public record SlopDocumentRecord(
|
||||
private static final VarintCodedSequenceArrayColumn spansColumn = new VarintCodedSequenceArrayColumn("spans", StorageType.ZSTD);
|
||||
|
||||
public static class KeywordsProjectionReader extends SlopTable {
|
||||
private final TxtStringColumn.Reader domainsReader;
|
||||
private final StringColumn.Reader domainsReader;
|
||||
private final VarintColumn.Reader ordinalsReader;
|
||||
private final IntColumn.Reader htmlFeaturesReader;
|
||||
private final LongColumn.Reader domainMetadataReader;
|
||||
@@ -275,8 +274,8 @@ public record SlopDocumentRecord(
|
||||
}
|
||||
|
||||
public static class MetadataReader extends SlopTable {
|
||||
private final TxtStringColumn.Reader domainsReader;
|
||||
private final TxtStringColumn.Reader urlsReader;
|
||||
private final StringColumn.Reader domainsReader;
|
||||
private final StringColumn.Reader urlsReader;
|
||||
private final VarintColumn.Reader ordinalsReader;
|
||||
private final StringColumn.Reader titlesReader;
|
||||
private final StringColumn.Reader descriptionsReader;
|
||||
@@ -332,8 +331,8 @@ public record SlopDocumentRecord(
|
||||
}
|
||||
|
||||
public static class Writer extends SlopTable {
|
||||
private final TxtStringColumn.Writer domainsWriter;
|
||||
private final TxtStringColumn.Writer urlsWriter;
|
||||
private final StringColumn.Writer domainsWriter;
|
||||
private final StringColumn.Writer urlsWriter;
|
||||
private final VarintColumn.Writer ordinalsWriter;
|
||||
private final EnumColumn.Writer statesWriter;
|
||||
private final StringColumn.Writer stateReasonsWriter;
|
||||
|
@@ -98,7 +98,7 @@ public class ConvertingIntegrationTest {
|
||||
|
||||
@Test
|
||||
public void testMemexMarginaliaNuSideloadProcessing() throws IOException {
|
||||
var ret = domainProcessor.sideloadProcessing(asSerializableCrawlData(readMarginaliaWorkingSet()), 100);
|
||||
var ret = domainProcessor.simpleProcessing(asSerializableCrawlData(readMarginaliaWorkingSet()), 100);
|
||||
assertNotNull(ret);
|
||||
assertEquals("memex.marginalia.nu", ret.id());
|
||||
|
||||
@@ -146,7 +146,7 @@ public class ConvertingIntegrationTest {
|
||||
"OK",
|
||||
"",
|
||||
"",
|
||||
readClassPathFile(p.toString()),
|
||||
readClassPathFile(p.toString()).getBytes(),
|
||||
false,
|
||||
null,
|
||||
null
|
||||
|
@@ -200,23 +200,23 @@ public class CrawlingThenConvertingIntegrationTest {
|
||||
|
||||
@Test
|
||||
public void crawlRobotsTxt() throws Exception {
|
||||
var specs = new CrawlerMain.CrawlSpecRecord("search.marginalia.nu", 5,
|
||||
List.of("https://search.marginalia.nu/search?q=hello+world")
|
||||
var specs = new CrawlerMain.CrawlSpecRecord("marginalia-search.com", 5,
|
||||
List.of("https://marginalia-search.com/search?q=hello+world")
|
||||
);
|
||||
|
||||
CrawledDomain domain = crawl(specs);
|
||||
assertFalse(domain.doc.isEmpty());
|
||||
assertEquals("OK", domain.crawlerStatus);
|
||||
assertEquals("search.marginalia.nu", domain.domain);
|
||||
assertEquals("marginalia-search.com", domain.domain);
|
||||
|
||||
Set<String> allUrls = domain.doc.stream().map(doc -> doc.url).collect(Collectors.toSet());
|
||||
assertTrue(allUrls.contains("https://search.marginalia.nu/search"), "We expect a record for entities that are forbidden");
|
||||
assertTrue(allUrls.contains("https://marginalia-search.com/search"), "We expect a record for entities that are forbidden");
|
||||
|
||||
var output = process();
|
||||
|
||||
assertNotNull(output);
|
||||
assertFalse(output.documents.isEmpty());
|
||||
assertEquals(new EdgeDomain("search.marginalia.nu"), output.domain);
|
||||
assertEquals(new EdgeDomain("marginalia-search.com"), output.domain);
|
||||
assertEquals(DomainIndexingState.ACTIVE, output.state);
|
||||
|
||||
for (var doc : output.documents) {
|
||||
|
@@ -55,16 +55,19 @@ dependencies {
|
||||
implementation libs.zstd
|
||||
implementation libs.jwarc
|
||||
implementation libs.crawlercommons
|
||||
implementation libs.okhttp3
|
||||
implementation libs.jsoup
|
||||
implementation libs.opencsv
|
||||
implementation libs.fastutil
|
||||
|
||||
implementation libs.bundles.mariadb
|
||||
implementation libs.bundles.httpcomponents
|
||||
|
||||
testImplementation libs.bundles.slf4j.test
|
||||
testImplementation libs.bundles.junit
|
||||
testImplementation libs.mockito
|
||||
testImplementation libs.wiremock
|
||||
|
||||
|
||||
|
||||
testImplementation project(':code:processes:test-data')
|
||||
}
|
||||
|
@@ -2,11 +2,16 @@ package nu.marginalia.contenttype;
|
||||
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
|
||||
import java.nio.charset.Charset;
|
||||
import java.nio.charset.IllegalCharsetNameException;
|
||||
import java.nio.charset.StandardCharsets;
|
||||
|
||||
/** Content type and charset of a document
|
||||
* @param contentType The content type, e.g. "text/html"
|
||||
* @param charset The charset, e.g. "UTF-8"
|
||||
*/
|
||||
public record ContentType(String contentType, String charset) {
|
||||
|
||||
public static ContentType parse(String contentTypeHeader) {
|
||||
if (contentTypeHeader == null || contentTypeHeader.isBlank())
|
||||
return new ContentType(null, null);
|
||||
@@ -15,9 +20,31 @@ public record ContentType(String contentType, String charset) {
|
||||
String contentType = parts[0].trim();
|
||||
String charset = parts.length > 1 ? parts[1].trim() : "UTF-8";
|
||||
|
||||
if (charset.toLowerCase().startsWith("charset=")) {
|
||||
charset = charset.substring("charset=".length());
|
||||
}
|
||||
|
||||
return new ContentType(contentType, charset);
|
||||
}
|
||||
|
||||
/** Best effort method for turning the provided charset string into a Java charset method,
|
||||
* with some guesswork-heuristics for when it doesn't work
|
||||
*/
|
||||
public Charset asCharset() {
|
||||
try {
|
||||
if (Charset.isSupported(charset)) {
|
||||
return Charset.forName(charset);
|
||||
} else if (charset.equalsIgnoreCase("macintosh-latin")) {
|
||||
return StandardCharsets.ISO_8859_1;
|
||||
} else {
|
||||
return StandardCharsets.UTF_8;
|
||||
}
|
||||
}
|
||||
catch (IllegalCharsetNameException ex) { // thrown by Charset.isSupported()
|
||||
return StandardCharsets.UTF_8;
|
||||
}
|
||||
}
|
||||
|
||||
public boolean is(String contentType) {
|
||||
return this.contentType.equalsIgnoreCase(contentType);
|
||||
}
|
||||
|
@@ -1,9 +1,12 @@
|
||||
package nu.marginalia.contenttype;
|
||||
|
||||
import org.jsoup.Jsoup;
|
||||
import org.jsoup.nodes.Document;
|
||||
|
||||
import java.io.ByteArrayInputStream;
|
||||
import java.io.IOException;
|
||||
import java.nio.charset.Charset;
|
||||
import java.nio.charset.IllegalCharsetNameException;
|
||||
import java.nio.charset.StandardCharsets;
|
||||
import java.nio.charset.UnsupportedCharsetException;
|
||||
import java.util.Map;
|
||||
import java.util.concurrent.ConcurrentHashMap;
|
||||
|
||||
@@ -23,24 +26,25 @@ public class DocumentBodyToString {
|
||||
return new String(data, charset);
|
||||
}
|
||||
|
||||
public static Document getParsedData(ContentType type, byte[] data, int maxLength, String url) throws IOException {
|
||||
final Charset charset;
|
||||
|
||||
if (type.charset() == null || type.charset().isBlank()) {
|
||||
charset = StandardCharsets.UTF_8;
|
||||
} else {
|
||||
charset = charsetMap.computeIfAbsent(type, DocumentBodyToString::computeCharset);
|
||||
}
|
||||
|
||||
ByteArrayInputStream bais = new ByteArrayInputStream(data, 0, Math.min(data.length, maxLength));
|
||||
|
||||
return Jsoup.parse(bais, charset.name(), url);
|
||||
}
|
||||
|
||||
private static Charset computeCharset(ContentType type) {
|
||||
try {
|
||||
if (type.charset() == null || type.charset().isBlank())
|
||||
return StandardCharsets.UTF_8;
|
||||
else {
|
||||
return Charset.forName(type.charset());
|
||||
}
|
||||
}
|
||||
catch (IllegalCharsetNameException ex) {
|
||||
// Fall back to UTF-8 if we don't understand what this is. It's *probably* fine? Maybe?
|
||||
if (type.charset() == null || type.charset().isBlank())
|
||||
return StandardCharsets.UTF_8;
|
||||
}
|
||||
catch (UnsupportedCharsetException ex) {
|
||||
// This is usually like Macintosh Latin
|
||||
// (https://en.wikipedia.org/wiki/Macintosh_Latin_encoding)
|
||||
//
|
||||
// It's close enough to 8859-1 to serve
|
||||
return StandardCharsets.ISO_8859_1;
|
||||
else {
|
||||
return type.asCharset();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@@ -19,22 +19,19 @@ import nu.marginalia.crawl.retreival.DomainProber;
|
||||
import nu.marginalia.crawl.warc.WarcArchiverFactory;
|
||||
import nu.marginalia.crawl.warc.WarcArchiverIf;
|
||||
import nu.marginalia.db.DomainBlacklist;
|
||||
import nu.marginalia.io.CrawledDomainReader;
|
||||
import nu.marginalia.io.CrawlerOutputFile;
|
||||
import nu.marginalia.model.EdgeDomain;
|
||||
import nu.marginalia.mq.MessageQueueFactory;
|
||||
import nu.marginalia.parquet.crawldata.CrawledDocumentParquetRecordFileWriter;
|
||||
import nu.marginalia.process.ProcessConfiguration;
|
||||
import nu.marginalia.process.ProcessConfigurationModule;
|
||||
import nu.marginalia.process.ProcessMainClass;
|
||||
import nu.marginalia.process.control.ProcessHeartbeatImpl;
|
||||
import nu.marginalia.process.log.WorkLog;
|
||||
import nu.marginalia.service.module.DatabaseModule;
|
||||
import nu.marginalia.slop.SlopCrawlDataRecord;
|
||||
import nu.marginalia.storage.FileStorageService;
|
||||
import nu.marginalia.storage.model.FileStorageId;
|
||||
import nu.marginalia.util.SimpleBlockingThreadPool;
|
||||
import okhttp3.ConnectionPool;
|
||||
import okhttp3.Dispatcher;
|
||||
import org.jetbrains.annotations.NotNull;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
@@ -44,11 +41,9 @@ import java.nio.file.Files;
|
||||
import java.nio.file.Path;
|
||||
import java.nio.file.StandardCopyOption;
|
||||
import java.security.Security;
|
||||
import java.util.ArrayList;
|
||||
import java.util.Collections;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.*;
|
||||
import java.util.concurrent.ConcurrentHashMap;
|
||||
import java.util.concurrent.LinkedBlockingQueue;
|
||||
import java.util.concurrent.TimeUnit;
|
||||
import java.util.concurrent.atomic.AtomicInteger;
|
||||
|
||||
@@ -72,6 +67,8 @@ public class CrawlerMain extends ProcessMainClass {
|
||||
|
||||
private final Map<String, CrawlTask> pendingCrawlTasks = new ConcurrentHashMap<>();
|
||||
|
||||
private final LinkedBlockingQueue<CrawlTask> retryQueue = new LinkedBlockingQueue<>();
|
||||
|
||||
private final AtomicInteger tasksDone = new AtomicInteger(0);
|
||||
private final HttpFetcherImpl fetcher;
|
||||
|
||||
@@ -85,6 +82,7 @@ public class CrawlerMain extends ProcessMainClass {
|
||||
|
||||
@Inject
|
||||
public CrawlerMain(UserAgent userAgent,
|
||||
HttpFetcherImpl httpFetcher,
|
||||
ProcessHeartbeatImpl heartbeat,
|
||||
MessageQueueFactory messageQueueFactory, DomainProber domainProber,
|
||||
FileStorageService fileStorageService,
|
||||
@@ -98,6 +96,7 @@ public class CrawlerMain extends ProcessMainClass {
|
||||
super(messageQueueFactory, processConfiguration, gson, CRAWLER_INBOX);
|
||||
|
||||
this.userAgent = userAgent;
|
||||
this.fetcher = httpFetcher;
|
||||
this.heartbeat = heartbeat;
|
||||
this.domainProber = domainProber;
|
||||
this.fileStorageService = fileStorageService;
|
||||
@@ -107,14 +106,19 @@ public class CrawlerMain extends ProcessMainClass {
|
||||
this.blacklist = blacklist;
|
||||
this.node = processConfiguration.node();
|
||||
|
||||
SimpleBlockingThreadPool.ThreadType threadType;
|
||||
if (Boolean.getBoolean("crawler.useVirtualThreads")) {
|
||||
threadType = SimpleBlockingThreadPool.ThreadType.VIRTUAL;
|
||||
}
|
||||
else {
|
||||
threadType = SimpleBlockingThreadPool.ThreadType.PLATFORM;
|
||||
}
|
||||
|
||||
pool = new SimpleBlockingThreadPool("CrawlerPool",
|
||||
Integer.getInteger("crawler.poolSize", 256),
|
||||
1);
|
||||
1,
|
||||
threadType);
|
||||
|
||||
fetcher = new HttpFetcherImpl(userAgent,
|
||||
new Dispatcher(),
|
||||
new ConnectionPool(5, 10, TimeUnit.SECONDS)
|
||||
);
|
||||
|
||||
// Wait for the blacklist to be loaded before starting the crawl
|
||||
blacklist.waitUntilLoaded();
|
||||
@@ -132,6 +136,10 @@ public class CrawlerMain extends ProcessMainClass {
|
||||
System.setProperty("sun.net.client.defaultConnectTimeout", "30000");
|
||||
System.setProperty("sun.net.client.defaultReadTimeout", "30000");
|
||||
|
||||
// Set the maximum number of connections to keep alive in the connection pool
|
||||
System.setProperty("jdk.httpclient.idleTimeout", "15"); // 15 seconds
|
||||
System.setProperty("jdk.httpclient.connectionPoolSize", "256");
|
||||
|
||||
// We don't want to use too much memory caching sessions for https
|
||||
System.setProperty("javax.net.ssl.sessionCacheSize", "2048");
|
||||
|
||||
@@ -225,10 +233,7 @@ public class CrawlerMain extends ProcessMainClass {
|
||||
|
||||
logger.info("Loaded {} domains", crawlSpecRecords.size());
|
||||
|
||||
// Shuffle the domains to ensure we get a good mix of domains in each crawl,
|
||||
// so that e.g. the big domains don't get all crawled at once, or we end up
|
||||
// crawling the same server in parallel from different subdomains...
|
||||
Collections.shuffle(crawlSpecRecords);
|
||||
crawlSpecRecords.sort(crawlSpecArrangement(crawlSpecRecords));
|
||||
|
||||
// First a validation run to ensure the file is all good to parse
|
||||
if (crawlSpecRecords.isEmpty()) {
|
||||
@@ -249,21 +254,53 @@ public class CrawlerMain extends ProcessMainClass {
|
||||
// (this happens when the process is restarted after a crash or a shutdown)
|
||||
tasksDone.set(workLog.countFinishedJobs());
|
||||
|
||||
// Create crawl tasks and submit them to the pool for execution
|
||||
// List of deferred tasks used to ensure beneficial scheduling of domains with regard to DomainLocks,
|
||||
// merely shuffling the domains tends to lead to a lot of threads being blocked waiting for a semphore,
|
||||
// this will more aggressively attempt to schedule the jobs to avoid blocking
|
||||
List<CrawlTask> taskList = new ArrayList<>();
|
||||
|
||||
// Create crawl tasks
|
||||
for (CrawlSpecRecord crawlSpec : crawlSpecRecords) {
|
||||
if (workLog.isJobFinished(crawlSpec.domain()))
|
||||
if (workLog.isJobFinished(crawlSpec.domain))
|
||||
continue;
|
||||
|
||||
var task = new CrawlTask(
|
||||
crawlSpec,
|
||||
anchorTagsSource,
|
||||
outputDir,
|
||||
warcArchiver,
|
||||
domainStateDb,
|
||||
workLog);
|
||||
var task = new CrawlTask(crawlSpec, anchorTagsSource, outputDir, warcArchiver, domainStateDb, workLog);
|
||||
|
||||
if (pendingCrawlTasks.putIfAbsent(crawlSpec.domain(), task) == null) {
|
||||
pool.submitQuietly(task);
|
||||
// Try to run immediately, to avoid unnecessarily keeping the entire work set in RAM
|
||||
if (!trySubmitDeferredTask(task)) {
|
||||
|
||||
// Drain the retry queue to the taskList, and try to submit any tasks that are in the retry queue
|
||||
retryQueue.drainTo(taskList);
|
||||
taskList.removeIf(this::trySubmitDeferredTask);
|
||||
|
||||
// Then add this new task to the retry queue
|
||||
taskList.add(task);
|
||||
}
|
||||
}
|
||||
|
||||
// Schedule viable tasks for execution until list is empty
|
||||
for (int emptyRuns = 0;emptyRuns < 300;) {
|
||||
boolean hasTasks = !taskList.isEmpty();
|
||||
|
||||
// The order of these checks very important to avoid a race condition
|
||||
// where we miss a task that is put into the retry queue
|
||||
boolean hasRunningTasks = pool.getActiveCount() > 0;
|
||||
boolean hasRetryTasks = !retryQueue.isEmpty();
|
||||
|
||||
if (hasTasks || hasRetryTasks || hasRunningTasks) {
|
||||
retryQueue.drainTo(taskList);
|
||||
|
||||
// Try to submit any tasks that are in the retry queue (this will block if the pool is full)
|
||||
taskList.removeIf(this::trySubmitDeferredTask);
|
||||
|
||||
// Add a small pause here to avoid busy looping toward the end of the execution cycle when
|
||||
// we might have no new viable tasks to run for hours on end
|
||||
TimeUnit.MILLISECONDS.sleep(5);
|
||||
} else {
|
||||
// We have no tasks to run, and no tasks in the retry queue
|
||||
// but we wait a bit to see if any new tasks come in via the retry queue
|
||||
emptyRuns++;
|
||||
TimeUnit.SECONDS.sleep(1);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -291,6 +328,51 @@ public class CrawlerMain extends ProcessMainClass {
|
||||
}
|
||||
}
|
||||
|
||||
/** Create a comparator that sorts the crawl specs in a way that is beneficial for the crawl,
|
||||
* we want to enqueue domains that have common top domains first, but otherwise have a random
|
||||
* order.
|
||||
* <p></p>
|
||||
* Note, we can't use hash codes for randomization as it is not desirable to have the same order
|
||||
* every time the process is restarted (and CrawlSpecRecord is a record, which defines equals and
|
||||
* hashcode based on the fields).
|
||||
* */
|
||||
private Comparator<CrawlSpecRecord> crawlSpecArrangement(List<CrawlSpecRecord> records) {
|
||||
Random r = new Random();
|
||||
Map<String, Integer> topDomainCounts = new HashMap<>(4 + (int) Math.sqrt(records.size()));
|
||||
Map<String, Integer> randomOrder = new HashMap<>(records.size());
|
||||
|
||||
for (var spec : records) {
|
||||
topDomainCounts.merge(EdgeDomain.getTopDomain(spec.domain), 1, Integer::sum);
|
||||
randomOrder.put(spec.domain, r.nextInt());
|
||||
}
|
||||
|
||||
return Comparator.comparing((CrawlSpecRecord spec) -> topDomainCounts.getOrDefault(EdgeDomain.getTopDomain(spec.domain), 0) >= 8)
|
||||
.reversed()
|
||||
.thenComparing(spec -> randomOrder.get(spec.domain))
|
||||
.thenComparing(Record::hashCode); // non-deterministic tie-breaker to
|
||||
}
|
||||
|
||||
/** Submit a task for execution if it can be run, returns true if it was submitted
|
||||
* or if it can be discarded */
|
||||
private boolean trySubmitDeferredTask(CrawlTask task) {
|
||||
if (!task.canRun()) {
|
||||
return false;
|
||||
}
|
||||
|
||||
if (pendingCrawlTasks.putIfAbsent(task.domain, task) != null) {
|
||||
return true; // task has already run, duplicate in crawl specs
|
||||
}
|
||||
|
||||
try {
|
||||
// This blocks the caller when the pool is full
|
||||
pool.submitQuietly(task);
|
||||
return true;
|
||||
}
|
||||
catch (RuntimeException ex) {
|
||||
logger.error("Failed to submit task " + task.domain, ex);
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
public void runForSingleDomain(String targetDomainName, FileStorageId fileStorageId) throws Exception {
|
||||
runForSingleDomain(targetDomainName, fileStorageService.getStorage(fileStorageId).asPath());
|
||||
@@ -348,79 +430,117 @@ public class CrawlerMain extends ProcessMainClass {
|
||||
this.id = Integer.toHexString(domain.hashCode());
|
||||
}
|
||||
|
||||
/** Best effort indicator whether we could start this now without getting stuck in
|
||||
* DomainLocks purgatory */
|
||||
public boolean canRun() {
|
||||
return domainLocks.isLockableHint(new EdgeDomain(domain));
|
||||
}
|
||||
|
||||
@Override
|
||||
public void run() throws Exception {
|
||||
|
||||
Path newWarcFile = CrawlerOutputFile.createWarcPath(outputDir, id, domain, CrawlerOutputFile.WarcFileVersion.LIVE);
|
||||
Path tempFile = CrawlerOutputFile.createWarcPath(outputDir, id, domain, CrawlerOutputFile.WarcFileVersion.TEMP);
|
||||
Path parquetFile = CrawlerOutputFile.createParquetPath(outputDir, id, domain);
|
||||
|
||||
// Move the WARC file to a temp file if it exists, so we can resume the crawl using the old data
|
||||
// while writing to the same file name as before
|
||||
if (Files.exists(newWarcFile)) {
|
||||
Files.move(newWarcFile, tempFile, StandardCopyOption.REPLACE_EXISTING);
|
||||
}
|
||||
else {
|
||||
Files.deleteIfExists(tempFile);
|
||||
if (workLog.isJobFinished(domain)) { // No-Op
|
||||
logger.info("Omitting task {}, as it is already run", domain);
|
||||
return;
|
||||
}
|
||||
|
||||
try (var warcRecorder = new WarcRecorder(newWarcFile); // write to a temp file for now
|
||||
var retriever = new CrawlerRetreiver(fetcher, domainProber, specification, domainStateDb, warcRecorder);
|
||||
CrawlDataReference reference = getReference();
|
||||
)
|
||||
{
|
||||
// Resume the crawl if it was aborted
|
||||
if (Files.exists(tempFile)) {
|
||||
retriever.syncAbortedRun(tempFile);
|
||||
Files.delete(tempFile);
|
||||
Optional<DomainLocks.DomainLock> lock = domainLocks.tryLockDomain(new EdgeDomain(domain));
|
||||
// We don't have a lock, so we can't run this task
|
||||
// we return to avoid blocking the pool for too long
|
||||
if (lock.isEmpty()) {
|
||||
if (retryQueue.remainingCapacity() > 0) {
|
||||
// Sleep a moment to avoid busy looping via the retry queue
|
||||
// in the case when few tasks remain and almost all are ineligible for
|
||||
// immediate restart
|
||||
Thread.sleep(5);
|
||||
}
|
||||
|
||||
DomainLinks domainLinks = anchorTagsSource.getAnchorTags(domain);
|
||||
retryQueue.put(this);
|
||||
return;
|
||||
}
|
||||
DomainLocks.DomainLock domainLock = lock.get();
|
||||
|
||||
int size;
|
||||
try (var lock = domainLocks.lockDomain(new EdgeDomain(domain))) {
|
||||
size = retriever.crawlDomain(domainLinks, reference);
|
||||
try (domainLock) {
|
||||
Thread.currentThread().setName("crawling:" + domain);
|
||||
|
||||
Path newWarcFile = CrawlerOutputFile.createWarcPath(outputDir, id, domain, CrawlerOutputFile.WarcFileVersion.LIVE);
|
||||
Path tempFile = CrawlerOutputFile.createWarcPath(outputDir, id, domain, CrawlerOutputFile.WarcFileVersion.TEMP);
|
||||
Path slopFile = CrawlerOutputFile.createSlopPath(outputDir, id, domain);
|
||||
|
||||
// Move the WARC file to a temp file if it exists, so we can resume the crawl using the old data
|
||||
// while writing to the same file name as before
|
||||
if (Files.exists(newWarcFile)) {
|
||||
Files.move(newWarcFile, tempFile, StandardCopyOption.REPLACE_EXISTING);
|
||||
}
|
||||
else {
|
||||
Files.deleteIfExists(tempFile);
|
||||
}
|
||||
|
||||
// Delete the reference crawl data if it's not the same as the new one
|
||||
// (mostly a case when migrating from legacy->warc)
|
||||
reference.delete();
|
||||
try (var warcRecorder = new WarcRecorder(newWarcFile); // write to a temp file for now
|
||||
var retriever = new CrawlerRetreiver(fetcher, domainProber, specification, domainStateDb, warcRecorder);
|
||||
CrawlDataReference reference = getReference())
|
||||
{
|
||||
// Resume the crawl if it was aborted
|
||||
if (Files.exists(tempFile)) {
|
||||
retriever.syncAbortedRun(tempFile);
|
||||
Files.delete(tempFile);
|
||||
}
|
||||
|
||||
// Convert the WARC file to Parquet
|
||||
CrawledDocumentParquetRecordFileWriter
|
||||
.convertWarc(domain, userAgent, newWarcFile, parquetFile);
|
||||
DomainLinks domainLinks = anchorTagsSource.getAnchorTags(domain);
|
||||
|
||||
// Optionally archive the WARC file if full retention is enabled,
|
||||
// otherwise delete it:
|
||||
warcArchiver.consumeWarc(newWarcFile, domain);
|
||||
int size = retriever.crawlDomain(domainLinks, reference);
|
||||
|
||||
// Mark the domain as finished in the work log
|
||||
workLog.setJobToFinished(domain, parquetFile.toString(), size);
|
||||
// Delete the reference crawl data if it's not the same as the new one
|
||||
// (mostly a case when migrating from legacy->warc)
|
||||
reference.delete();
|
||||
|
||||
// Update the progress bar
|
||||
heartbeat.setProgress(tasksDone.incrementAndGet() / (double) totalTasks);
|
||||
// Convert the WARC file to Slop
|
||||
SlopCrawlDataRecord
|
||||
.convertWarc(domain, userAgent, newWarcFile, slopFile);
|
||||
|
||||
logger.info("Fetched {}", domain);
|
||||
} catch (Exception e) {
|
||||
logger.error("Error fetching domain " + domain, e);
|
||||
}
|
||||
finally {
|
||||
// We don't need to double-count these; it's also kept int he workLog
|
||||
pendingCrawlTasks.remove(domain);
|
||||
Thread.currentThread().setName("[idle]");
|
||||
// Optionally archive the WARC file if full retention is enabled,
|
||||
// otherwise delete it:
|
||||
warcArchiver.consumeWarc(newWarcFile, domain);
|
||||
|
||||
Files.deleteIfExists(newWarcFile);
|
||||
Files.deleteIfExists(tempFile);
|
||||
// Mark the domain as finished in the work log
|
||||
workLog.setJobToFinished(domain, slopFile.toString(), size);
|
||||
|
||||
// Update the progress bar
|
||||
heartbeat.setProgress(tasksDone.incrementAndGet() / (double) totalTasks);
|
||||
|
||||
logger.info("Fetched {}", domain);
|
||||
} catch (Exception e) {
|
||||
logger.error("Error fetching domain " + domain, e);
|
||||
}
|
||||
finally {
|
||||
// We don't need to double-count these; it's also kept in the workLog
|
||||
pendingCrawlTasks.remove(domain);
|
||||
Thread.currentThread().setName("[idle]");
|
||||
|
||||
Files.deleteIfExists(newWarcFile);
|
||||
Files.deleteIfExists(tempFile);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private CrawlDataReference getReference() {
|
||||
try {
|
||||
return new CrawlDataReference(CrawledDomainReader.createDataStream(outputDir, domain, id));
|
||||
} catch (IOException e) {
|
||||
Path slopPath = CrawlerOutputFile.getSlopPath(outputDir, id, domain);
|
||||
if (Files.exists(slopPath)) {
|
||||
return new CrawlDataReference(slopPath);
|
||||
}
|
||||
|
||||
Path parquetPath = CrawlerOutputFile.getParquetPath(outputDir, id, domain);
|
||||
if (Files.exists(parquetPath)) {
|
||||
slopPath = migrateParquetData(parquetPath, domain, outputDir);
|
||||
return new CrawlDataReference(slopPath);
|
||||
}
|
||||
|
||||
} catch (Exception e) {
|
||||
logger.debug("Failed to read previous crawl data for {}", specification.domain());
|
||||
return new CrawlDataReference();
|
||||
}
|
||||
|
||||
return new CrawlDataReference();
|
||||
}
|
||||
|
||||
}
|
||||
@@ -480,4 +600,20 @@ public class CrawlerMain extends ProcessMainClass {
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Migrate from parquet to slop if necessary
|
||||
//
|
||||
// This must be synchronized as chewing through parquet files in parallel leads to enormous memory overhead
|
||||
private synchronized Path migrateParquetData(Path inputPath, String domain, Path crawlDataRoot) throws IOException {
|
||||
if (!inputPath.toString().endsWith(".parquet")) {
|
||||
return inputPath;
|
||||
}
|
||||
|
||||
Path outputFile = CrawlerOutputFile.createSlopPath(crawlDataRoot, Integer.toHexString(domain.hashCode()), domain);
|
||||
|
||||
SlopCrawlDataRecord.convertFromParquet(inputPath, outputFile);
|
||||
|
||||
return outputFile;
|
||||
}
|
||||
|
||||
}
|
||||
|
@@ -1,5 +1,8 @@
|
||||
package nu.marginalia.crawl;
|
||||
|
||||
import com.google.inject.Inject;
|
||||
import nu.marginalia.storage.FileStorageService;
|
||||
import nu.marginalia.storage.model.FileStorageType;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
@@ -8,7 +11,9 @@ import java.nio.file.Path;
|
||||
import java.sql.Connection;
|
||||
import java.sql.DriverManager;
|
||||
import java.sql.SQLException;
|
||||
import java.time.Duration;
|
||||
import java.time.Instant;
|
||||
import java.util.Objects;
|
||||
import java.util.Optional;
|
||||
|
||||
/** Supplemental sqlite database for storing the summary of a crawl.
|
||||
@@ -20,6 +25,17 @@ public class DomainStateDb implements AutoCloseable {
|
||||
|
||||
private final Connection connection;
|
||||
|
||||
|
||||
public record CrawlMeta(
|
||||
String domainName,
|
||||
Instant lastFullCrawl,
|
||||
Duration recrawlTime,
|
||||
Duration crawlTime,
|
||||
int recrawlErrors,
|
||||
int crawlChanges,
|
||||
int totalCrawlSize
|
||||
) {}
|
||||
|
||||
public record SummaryRecord(
|
||||
String domainName,
|
||||
Instant lastUpdated,
|
||||
@@ -60,7 +76,31 @@ public class DomainStateDb implements AutoCloseable {
|
||||
|
||||
}
|
||||
|
||||
public DomainStateDb(Path filename) throws SQLException {
|
||||
public record FaviconRecord(String contentType, byte[] imageData) {}
|
||||
|
||||
@Inject
|
||||
public DomainStateDb(FileStorageService fileStorageService) throws SQLException {
|
||||
this(findFilename(fileStorageService));
|
||||
}
|
||||
|
||||
private static Path findFilename(FileStorageService fileStorageService) throws SQLException {
|
||||
var fsId = fileStorageService.getOnlyActiveFileStorage(FileStorageType.CRAWL_DATA);
|
||||
|
||||
if (fsId.isPresent()) {
|
||||
var fs = fileStorageService.getStorage(fsId.get());
|
||||
return fs.asPath().resolve("domainstate.db");
|
||||
}
|
||||
else {
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
public DomainStateDb(@Nullable Path filename) throws SQLException {
|
||||
if (null == filename) {
|
||||
connection = null;
|
||||
return;
|
||||
}
|
||||
|
||||
String sqliteDbString = "jdbc:sqlite:" + filename.toString();
|
||||
connection = DriverManager.getConnection(sqliteDbString);
|
||||
|
||||
@@ -74,18 +114,102 @@ public class DomainStateDb implements AutoCloseable {
|
||||
feedUrl TEXT
|
||||
)
|
||||
""");
|
||||
|
||||
stmt.executeUpdate("""
|
||||
CREATE TABLE IF NOT EXISTS crawl_meta (
|
||||
domain TEXT PRIMARY KEY,
|
||||
lastFullCrawlEpochMs LONG NOT NULL,
|
||||
recrawlTimeMs LONG NOT NULL,
|
||||
recrawlErrors INTEGER NOT NULL,
|
||||
crawlTimeMs LONG NOT NULL,
|
||||
crawlChanges INTEGER NOT NULL,
|
||||
totalCrawlSize INTEGER NOT NULL
|
||||
)
|
||||
""");
|
||||
stmt.executeUpdate("""
|
||||
CREATE TABLE IF NOT EXISTS favicon (
|
||||
domain TEXT PRIMARY KEY,
|
||||
contentType TEXT NOT NULL,
|
||||
icon BLOB NOT NULL
|
||||
)
|
||||
""");
|
||||
stmt.execute("PRAGMA journal_mode=WAL");
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public void close() throws SQLException {
|
||||
connection.close();
|
||||
if (connection != null) {
|
||||
connection.close();
|
||||
}
|
||||
}
|
||||
|
||||
public boolean isAvailable() {
|
||||
return connection != null;
|
||||
}
|
||||
|
||||
public void saveIcon(String domain, FaviconRecord faviconRecord) {
|
||||
if (connection == null) throw new IllegalStateException("No connection to domainstate db");
|
||||
|
||||
try (var stmt = connection.prepareStatement("""
|
||||
INSERT OR REPLACE INTO favicon (domain, contentType, icon)
|
||||
VALUES(?, ?, ?)
|
||||
""")) {
|
||||
stmt.setString(1, domain);
|
||||
stmt.setString(2, Objects.requireNonNullElse(faviconRecord.contentType, "application/octet-stream"));
|
||||
stmt.setBytes(3, faviconRecord.imageData);
|
||||
stmt.executeUpdate();
|
||||
}
|
||||
catch (SQLException ex) {
|
||||
logger.error("Failed to insert favicon", ex);
|
||||
}
|
||||
}
|
||||
|
||||
public Optional<FaviconRecord> getIcon(String domain) {
|
||||
if (connection == null)
|
||||
return Optional.empty();
|
||||
|
||||
try (var stmt = connection.prepareStatement("SELECT contentType, icon FROM favicon WHERE DOMAIN = ?")) {
|
||||
stmt.setString(1, domain);
|
||||
var rs = stmt.executeQuery();
|
||||
|
||||
if (rs.next()) {
|
||||
return Optional.of(
|
||||
new FaviconRecord(
|
||||
rs.getString("contentType"),
|
||||
rs.getBytes("icon")
|
||||
)
|
||||
);
|
||||
}
|
||||
} catch (SQLException e) {
|
||||
logger.error("Failed to retrieve favicon", e);
|
||||
}
|
||||
|
||||
return Optional.empty();
|
||||
}
|
||||
|
||||
public void save(CrawlMeta crawlMeta) {
|
||||
if (connection == null) throw new IllegalStateException("No connection to domainstate db");
|
||||
|
||||
try (var stmt = connection.prepareStatement("""
|
||||
INSERT OR REPLACE INTO crawl_meta (domain, lastFullCrawlEpochMs, recrawlTimeMs, recrawlErrors, crawlTimeMs, crawlChanges, totalCrawlSize)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?)
|
||||
""")) {
|
||||
stmt.setString(1, crawlMeta.domainName());
|
||||
stmt.setLong(2, crawlMeta.lastFullCrawl.toEpochMilli());
|
||||
stmt.setLong(3, crawlMeta.recrawlTime.toMillis());
|
||||
stmt.setInt(4, crawlMeta.recrawlErrors);
|
||||
stmt.setLong(5, crawlMeta.crawlTime.toMillis());
|
||||
stmt.setInt(6, crawlMeta.crawlChanges);
|
||||
stmt.setInt(7, crawlMeta.totalCrawlSize);
|
||||
stmt.executeUpdate();
|
||||
} catch (SQLException e) {
|
||||
logger.error("Failed to insert crawl meta record", e);
|
||||
}
|
||||
}
|
||||
|
||||
public void save(SummaryRecord record) {
|
||||
if (connection == null) throw new IllegalStateException("No connection to domainstate db");
|
||||
|
||||
try (var stmt = connection.prepareStatement("""
|
||||
INSERT OR REPLACE INTO summary (domain, lastUpdatedEpochMs, state, stateDesc, feedUrl)
|
||||
VALUES (?, ?, ?, ?, ?)
|
||||
@@ -101,7 +225,38 @@ public class DomainStateDb implements AutoCloseable {
|
||||
}
|
||||
}
|
||||
|
||||
public Optional<SummaryRecord> get(String domainName) {
|
||||
public Optional<CrawlMeta> getMeta(String domainName) {
|
||||
if (connection == null)
|
||||
return Optional.empty();
|
||||
|
||||
try (var stmt = connection.prepareStatement("""
|
||||
SELECT domain, lastFullCrawlEpochMs, recrawlTimeMs, recrawlErrors, crawlTimeMs, crawlChanges, totalCrawlSize
|
||||
FROM crawl_meta
|
||||
WHERE domain = ?
|
||||
""")) {
|
||||
stmt.setString(1, domainName);
|
||||
var rs = stmt.executeQuery();
|
||||
if (rs.next()) {
|
||||
return Optional.of(new CrawlMeta(
|
||||
rs.getString("domain"),
|
||||
Instant.ofEpochMilli(rs.getLong("lastFullCrawlEpochMs")),
|
||||
Duration.ofMillis(rs.getLong("recrawlTimeMs")),
|
||||
Duration.ofMillis(rs.getLong("crawlTimeMs")),
|
||||
rs.getInt("recrawlErrors"),
|
||||
rs.getInt("crawlChanges"),
|
||||
rs.getInt("totalCrawlSize")
|
||||
));
|
||||
}
|
||||
} catch (SQLException ex) {
|
||||
logger.error("Failed to get crawl meta record", ex);
|
||||
}
|
||||
return Optional.empty();
|
||||
}
|
||||
|
||||
public Optional<SummaryRecord> getSummary(String domainName) {
|
||||
if (connection == null)
|
||||
return Optional.empty();
|
||||
|
||||
try (var stmt = connection.prepareStatement("""
|
||||
SELECT domain, lastUpdatedEpochMs, state, stateDesc, feedUrl
|
||||
FROM summary
|
||||
|
@@ -1,6 +1,6 @@
|
||||
package nu.marginalia.crawl.fetcher;
|
||||
|
||||
import okhttp3.Request;
|
||||
import org.apache.hc.client5.http.classic.methods.HttpGet;
|
||||
|
||||
/** Encapsulates request modifiers; the ETag and Last-Modified tags for a resource */
|
||||
public record ContentTags(String etag, String lastMod) {
|
||||
@@ -17,14 +17,16 @@ public record ContentTags(String etag, String lastMod) {
|
||||
}
|
||||
|
||||
/** Paints the tags onto the request builder. */
|
||||
public void paint(Request.Builder getBuilder) {
|
||||
public void paint(HttpGet request) {
|
||||
|
||||
// Paint the ETag header if present,
|
||||
// otherwise paint the Last-Modified header
|
||||
// (but not both at the same time due to some servers not liking it)
|
||||
|
||||
if (etag != null) {
|
||||
getBuilder.addHeader("If-None-Match", etag);
|
||||
}
|
||||
|
||||
if (lastMod != null) {
|
||||
getBuilder.addHeader("If-Modified-Since", lastMod);
|
||||
request.addHeader("If-None-Match", etag);
|
||||
} else if (lastMod != null) {
|
||||
request.addHeader("If-Modified-Since", lastMod);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@@ -1,43 +0,0 @@
|
||||
package nu.marginalia.crawl.fetcher;
|
||||
|
||||
import okhttp3.Cookie;
|
||||
import okhttp3.CookieJar;
|
||||
import okhttp3.HttpUrl;
|
||||
|
||||
import java.util.Collections;
|
||||
import java.util.List;
|
||||
import java.util.concurrent.ConcurrentHashMap;
|
||||
|
||||
public class Cookies {
|
||||
final ThreadLocal<ConcurrentHashMap<String, List<Cookie>>> cookieJar = ThreadLocal.withInitial(ConcurrentHashMap::new);
|
||||
|
||||
public CookieJar getJar() {
|
||||
return new CookieJar() {
|
||||
|
||||
@Override
|
||||
public void saveFromResponse(HttpUrl url, List<Cookie> cookies) {
|
||||
|
||||
if (!cookies.isEmpty()) {
|
||||
cookieJar.get().put(url.host(), cookies);
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public List<Cookie> loadForRequest(HttpUrl url) {
|
||||
return cookieJar.get().getOrDefault(url.host(), Collections.emptyList());
|
||||
}
|
||||
};
|
||||
}
|
||||
|
||||
public void clear() {
|
||||
cookieJar.get().clear();
|
||||
}
|
||||
|
||||
public boolean hasCookies() {
|
||||
return !cookieJar.get().isEmpty();
|
||||
}
|
||||
|
||||
public List<String> getCookies() {
|
||||
return cookieJar.get().values().stream().flatMap(List::stream).map(Cookie::toString).toList();
|
||||
}
|
||||
}
|
@@ -0,0 +1,56 @@
|
||||
package nu.marginalia.crawl.fetcher;
|
||||
|
||||
import org.apache.hc.client5.http.classic.methods.HttpUriRequestBase;
|
||||
import org.apache.hc.core5.http.ClassicHttpRequest;
|
||||
import org.apache.hc.core5.http.HttpResponse;
|
||||
|
||||
import java.util.HashMap;
|
||||
import java.util.Map;
|
||||
import java.util.StringJoiner;
|
||||
|
||||
public class DomainCookies {
|
||||
private final Map<String, String> cookies = new HashMap<>();
|
||||
|
||||
public boolean hasCookies() {
|
||||
return !cookies.isEmpty();
|
||||
}
|
||||
|
||||
public void updateCookieStore(HttpResponse response) {
|
||||
for (var header : response.getHeaders()) {
|
||||
if (header.getName().equalsIgnoreCase("Set-Cookie")) {
|
||||
parseCookieHeader(header.getValue());
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private void parseCookieHeader(String value) {
|
||||
// Parse the Set-Cookie header value and extract the cookies
|
||||
|
||||
String[] parts = value.split(";");
|
||||
String cookie = parts[0].trim();
|
||||
|
||||
if (cookie.contains("=")) {
|
||||
String[] cookieParts = cookie.split("=");
|
||||
String name = cookieParts[0].trim();
|
||||
String val = cookieParts[1].trim();
|
||||
cookies.put(name, val);
|
||||
}
|
||||
}
|
||||
|
||||
public void paintRequest(HttpUriRequestBase request) {
|
||||
request.addHeader("Cookie", createCookieHeader());
|
||||
}
|
||||
|
||||
public void paintRequest(ClassicHttpRequest request) {
|
||||
request.addHeader("Cookie", createCookieHeader());
|
||||
}
|
||||
|
||||
private String createCookieHeader() {
|
||||
StringJoiner sj = new StringJoiner("; ");
|
||||
for (var cookie : cookies.entrySet()) {
|
||||
sj.add(cookie.getKey() + "=" + cookie.getValue());
|
||||
}
|
||||
return sj.toString();
|
||||
}
|
||||
|
||||
}
|
@@ -3,31 +3,32 @@ package nu.marginalia.crawl.fetcher;
|
||||
import com.google.inject.ImplementedBy;
|
||||
import crawlercommons.robots.SimpleRobotRules;
|
||||
import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
|
||||
import nu.marginalia.crawl.retreival.CrawlDelayTimer;
|
||||
import nu.marginalia.model.EdgeDomain;
|
||||
import nu.marginalia.model.EdgeUrl;
|
||||
import nu.marginalia.model.body.HttpFetchResult;
|
||||
import nu.marginalia.model.crawldata.CrawlerDomainStatus;
|
||||
import org.apache.hc.client5.http.cookie.CookieStore;
|
||||
|
||||
import java.util.List;
|
||||
|
||||
@ImplementedBy(HttpFetcherImpl.class)
|
||||
public interface HttpFetcher {
|
||||
public interface HttpFetcher extends AutoCloseable {
|
||||
void setAllowAllContentTypes(boolean allowAllContentTypes);
|
||||
|
||||
List<String> getCookies();
|
||||
CookieStore getCookies();
|
||||
void clearCookies();
|
||||
|
||||
DomainProbeResult probeDomain(EdgeUrl url);
|
||||
|
||||
ContentTypeProbeResult probeContentType(
|
||||
EdgeUrl url,
|
||||
WarcRecorder recorder,
|
||||
ContentTags tags) throws HttpFetcherImpl.RateLimitException;
|
||||
|
||||
HttpFetchResult fetchContent(EdgeUrl url,
|
||||
WarcRecorder recorder,
|
||||
DomainCookies cookies,
|
||||
CrawlDelayTimer timer,
|
||||
ContentTags tags,
|
||||
ProbeType probeType) throws HttpFetcherImpl.RateLimitException, Exception;
|
||||
ProbeType probeType);
|
||||
|
||||
List<EdgeUrl> fetchSitemapUrls(String rootSitemapUrl, CrawlDelayTimer delayTimer);
|
||||
|
||||
SimpleRobotRules fetchRobotRules(EdgeDomain domain, WarcRecorder recorder);
|
||||
|
||||
@@ -43,6 +44,7 @@ public interface HttpFetcher {
|
||||
|
||||
/** This domain redirects to another domain */
|
||||
record Redirect(EdgeDomain domain) implements DomainProbeResult {}
|
||||
record RedirectSameDomain_Internal(EdgeUrl domain) implements DomainProbeResult {}
|
||||
|
||||
/** If the retrieval of the probed url was successful, return the url as it was fetched
|
||||
* (which may be different from the url we probed, if we attempted another URL schema).
|
||||
@@ -53,7 +55,10 @@ public interface HttpFetcher {
|
||||
}
|
||||
|
||||
sealed interface ContentTypeProbeResult {
|
||||
record NoOp() implements ContentTypeProbeResult {}
|
||||
record Ok(EdgeUrl resolvedUrl) implements ContentTypeProbeResult { }
|
||||
record HttpError(int statusCode, String message) implements ContentTypeProbeResult { }
|
||||
record Redirect(EdgeUrl location) implements ContentTypeProbeResult { }
|
||||
record BadContentType(String contentType, int statusCode) implements ContentTypeProbeResult { }
|
||||
record Timeout(java.lang.Exception ex) implements ContentTypeProbeResult { }
|
||||
record Exception(java.lang.Exception ex) implements ContentTypeProbeResult { }
|
||||
|
@@ -1,78 +1,174 @@
|
||||
package nu.marginalia.crawl.fetcher;
|
||||
|
||||
import com.google.inject.Inject;
|
||||
import com.google.inject.Singleton;
|
||||
import crawlercommons.robots.SimpleRobotRules;
|
||||
import crawlercommons.robots.SimpleRobotRulesParser;
|
||||
import nu.marginalia.UserAgent;
|
||||
import nu.marginalia.crawl.fetcher.socket.FastTerminatingSocketFactory;
|
||||
import nu.marginalia.crawl.fetcher.socket.IpInterceptingNetworkInterceptor;
|
||||
import nu.marginalia.crawl.fetcher.socket.NoSecuritySSL;
|
||||
import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
|
||||
import nu.marginalia.crawl.retreival.CrawlDelayTimer;
|
||||
import nu.marginalia.link_parser.LinkParser;
|
||||
import nu.marginalia.model.EdgeDomain;
|
||||
import nu.marginalia.model.EdgeUrl;
|
||||
import nu.marginalia.model.body.ContentTypeLogic;
|
||||
import nu.marginalia.model.body.DocumentBodyExtractor;
|
||||
import nu.marginalia.model.body.HttpFetchResult;
|
||||
import nu.marginalia.model.crawldata.CrawlerDomainStatus;
|
||||
import okhttp3.ConnectionPool;
|
||||
import okhttp3.Dispatcher;
|
||||
import okhttp3.OkHttpClient;
|
||||
import okhttp3.Request;
|
||||
import org.apache.hc.client5.http.ConnectionKeepAliveStrategy;
|
||||
import org.apache.hc.client5.http.HttpRequestRetryStrategy;
|
||||
import org.apache.hc.client5.http.classic.HttpClient;
|
||||
import org.apache.hc.client5.http.classic.methods.HttpGet;
|
||||
import org.apache.hc.client5.http.config.ConnectionConfig;
|
||||
import org.apache.hc.client5.http.config.RequestConfig;
|
||||
import org.apache.hc.client5.http.cookie.BasicCookieStore;
|
||||
import org.apache.hc.client5.http.cookie.CookieStore;
|
||||
import org.apache.hc.client5.http.cookie.StandardCookieSpec;
|
||||
import org.apache.hc.client5.http.impl.classic.CloseableHttpClient;
|
||||
import org.apache.hc.client5.http.impl.classic.HttpClients;
|
||||
import org.apache.hc.client5.http.impl.io.PoolingHttpClientConnectionManager;
|
||||
import org.apache.hc.client5.http.impl.io.PoolingHttpClientConnectionManagerBuilder;
|
||||
import org.apache.hc.client5.http.ssl.DefaultClientTlsStrategy;
|
||||
import org.apache.hc.core5.http.*;
|
||||
import org.apache.hc.core5.http.io.HttpClientResponseHandler;
|
||||
import org.apache.hc.core5.http.io.SocketConfig;
|
||||
import org.apache.hc.core5.http.io.entity.EntityUtils;
|
||||
import org.apache.hc.core5.http.io.support.ClassicRequestBuilder;
|
||||
import org.apache.hc.core5.http.message.MessageSupport;
|
||||
import org.apache.hc.core5.http.protocol.HttpContext;
|
||||
import org.apache.hc.core5.pool.PoolStats;
|
||||
import org.apache.hc.core5.util.TimeValue;
|
||||
import org.apache.hc.core5.util.Timeout;
|
||||
import org.jsoup.Jsoup;
|
||||
import org.jsoup.nodes.Document;
|
||||
import org.jsoup.parser.Parser;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
import org.slf4j.Marker;
|
||||
import org.slf4j.MarkerFactory;
|
||||
|
||||
import javax.net.ssl.X509TrustManager;
|
||||
import java.io.InterruptedIOException;
|
||||
import javax.net.ssl.SSLContext;
|
||||
import javax.net.ssl.SSLException;
|
||||
import java.io.IOException;
|
||||
import java.net.SocketTimeoutException;
|
||||
import java.net.URISyntaxException;
|
||||
import java.net.UnknownHostException;
|
||||
import java.security.NoSuchAlgorithmException;
|
||||
import java.time.Duration;
|
||||
import java.util.List;
|
||||
import java.util.Objects;
|
||||
import java.util.Optional;
|
||||
import java.time.Instant;
|
||||
import java.util.*;
|
||||
import java.util.concurrent.Semaphore;
|
||||
import java.util.concurrent.TimeUnit;
|
||||
import java.util.concurrent.atomic.AtomicBoolean;
|
||||
|
||||
|
||||
public class HttpFetcherImpl implements HttpFetcher {
|
||||
@Singleton
|
||||
public class HttpFetcherImpl implements HttpFetcher, HttpRequestRetryStrategy {
|
||||
|
||||
private final Logger logger = LoggerFactory.getLogger(getClass());
|
||||
private final String userAgentString;
|
||||
private final String userAgentIdentifier;
|
||||
private final Cookies cookies = new Cookies();
|
||||
|
||||
private final CookieStore cookies = new BasicCookieStore();
|
||||
|
||||
private static final SimpleRobotRulesParser robotsParser = new SimpleRobotRulesParser();
|
||||
private static final ContentTypeLogic contentTypeLogic = new ContentTypeLogic();
|
||||
private final Marker crawlerAuditMarker = MarkerFactory.getMarker("CRAWLER");
|
||||
|
||||
private final LinkParser linkParser = new LinkParser();
|
||||
@Override
|
||||
public void setAllowAllContentTypes(boolean allowAllContentTypes) {
|
||||
contentTypeLogic.setAllowAllContentTypes(allowAllContentTypes);
|
||||
}
|
||||
|
||||
private final OkHttpClient client;
|
||||
private final CloseableHttpClient client;
|
||||
private PoolingHttpClientConnectionManager connectionManager;
|
||||
|
||||
private static final FastTerminatingSocketFactory ftSocketFactory = new FastTerminatingSocketFactory();
|
||||
public PoolStats getPoolStats() {
|
||||
return connectionManager.getTotalStats();
|
||||
}
|
||||
|
||||
private OkHttpClient createClient(Dispatcher dispatcher, ConnectionPool pool) {
|
||||
var builder = new OkHttpClient.Builder();
|
||||
if (dispatcher != null) {
|
||||
builder.dispatcher(dispatcher);
|
||||
}
|
||||
private CloseableHttpClient createClient() throws NoSuchAlgorithmException {
|
||||
final ConnectionConfig connectionConfig = ConnectionConfig.custom()
|
||||
.setSocketTimeout(10, TimeUnit.SECONDS)
|
||||
.setConnectTimeout(30, TimeUnit.SECONDS)
|
||||
.setValidateAfterInactivity(TimeValue.ofSeconds(5))
|
||||
.build();
|
||||
|
||||
return builder.sslSocketFactory(NoSecuritySSL.buildSocketFactory(), (X509TrustManager) NoSecuritySSL.trustAllCerts[0])
|
||||
.socketFactory(ftSocketFactory)
|
||||
.hostnameVerifier(NoSecuritySSL.buildHostnameVerifyer())
|
||||
.addNetworkInterceptor(new IpInterceptingNetworkInterceptor())
|
||||
.connectionPool(pool)
|
||||
.cookieJar(cookies.getJar())
|
||||
.followRedirects(true)
|
||||
.followSslRedirects(true)
|
||||
.connectTimeout(8, TimeUnit.SECONDS)
|
||||
.readTimeout(10, TimeUnit.SECONDS)
|
||||
.writeTimeout(10, TimeUnit.SECONDS)
|
||||
.build();
|
||||
connectionManager = PoolingHttpClientConnectionManagerBuilder.create()
|
||||
.setMaxConnPerRoute(2)
|
||||
.setMaxConnTotal(5000)
|
||||
.setDefaultConnectionConfig(connectionConfig)
|
||||
.setTlsSocketStrategy(new DefaultClientTlsStrategy(SSLContext.getDefault()))
|
||||
.build();
|
||||
|
||||
connectionManager.setDefaultSocketConfig(SocketConfig.custom()
|
||||
.setSoLinger(TimeValue.ofSeconds(-1))
|
||||
.setSoTimeout(Timeout.ofSeconds(10))
|
||||
.build()
|
||||
);
|
||||
|
||||
Thread.ofPlatform().daemon(true).start(() -> {
|
||||
try {
|
||||
for (;;) {
|
||||
TimeUnit.SECONDS.sleep(15);
|
||||
logger.info("Connection pool stats: {}", connectionManager.getTotalStats());
|
||||
}
|
||||
}
|
||||
catch (InterruptedException e) {
|
||||
Thread.currentThread().interrupt();
|
||||
}
|
||||
});
|
||||
|
||||
final RequestConfig defaultRequestConfig = RequestConfig.custom()
|
||||
.setCookieSpec(StandardCookieSpec.RELAXED)
|
||||
.setResponseTimeout(10, TimeUnit.SECONDS)
|
||||
.setConnectionRequestTimeout(5, TimeUnit.MINUTES)
|
||||
.build();
|
||||
|
||||
return HttpClients.custom()
|
||||
.setDefaultCookieStore(cookies)
|
||||
.setConnectionManager(connectionManager)
|
||||
.setRetryStrategy(this)
|
||||
.setKeepAliveStrategy(new ConnectionKeepAliveStrategy() {
|
||||
// Default keep-alive duration is 3 minutes, but this is too long for us,
|
||||
// as we are either going to re-use it fairly quickly or close it for a long time.
|
||||
//
|
||||
// So we set it to 30 seconds or clamp the server-provided value to a minimum of 10 seconds.
|
||||
private static final TimeValue defaultValue = TimeValue.ofSeconds(30);
|
||||
|
||||
@Override
|
||||
public TimeValue getKeepAliveDuration(HttpResponse response, HttpContext context) {
|
||||
final Iterator<HeaderElement> it = MessageSupport.iterate(response, HeaderElements.KEEP_ALIVE);
|
||||
|
||||
while (it.hasNext()) {
|
||||
final HeaderElement he = it.next();
|
||||
final String param = he.getName();
|
||||
final String value = he.getValue();
|
||||
|
||||
if (value == null)
|
||||
continue;
|
||||
if (!"timeout".equalsIgnoreCase(param))
|
||||
continue;
|
||||
|
||||
try {
|
||||
long timeout = Long.parseLong(value);
|
||||
timeout = Math.clamp(timeout, 30, defaultValue.toSeconds());
|
||||
return TimeValue.ofSeconds(timeout);
|
||||
} catch (final NumberFormatException ignore) {
|
||||
break;
|
||||
}
|
||||
}
|
||||
return defaultValue;
|
||||
}
|
||||
})
|
||||
.disableRedirectHandling()
|
||||
.setDefaultRequestConfig(defaultRequestConfig)
|
||||
.build();
|
||||
}
|
||||
|
||||
@Override
|
||||
public List<String> getCookies() {
|
||||
return cookies.getCookies();
|
||||
public CookieStore getCookies() {
|
||||
return cookies;
|
||||
}
|
||||
|
||||
@Override
|
||||
@@ -81,26 +177,32 @@ public class HttpFetcherImpl implements HttpFetcher {
|
||||
}
|
||||
|
||||
@Inject
|
||||
public HttpFetcherImpl(UserAgent userAgent,
|
||||
Dispatcher dispatcher,
|
||||
ConnectionPool connectionPool)
|
||||
public HttpFetcherImpl(UserAgent userAgent)
|
||||
{
|
||||
this.client = createClient(dispatcher, connectionPool);
|
||||
try {
|
||||
this.client = createClient();
|
||||
} catch (NoSuchAlgorithmException e) {
|
||||
throw new RuntimeException(e);
|
||||
}
|
||||
this.userAgentString = userAgent.uaString();
|
||||
this.userAgentIdentifier = userAgent.uaIdentifier();
|
||||
}
|
||||
|
||||
public HttpFetcherImpl(String userAgent) {
|
||||
this.client = createClient(null, new ConnectionPool());
|
||||
try {
|
||||
this.client = createClient();
|
||||
} catch (NoSuchAlgorithmException e) {
|
||||
throw new RuntimeException(e);
|
||||
}
|
||||
this.userAgentString = userAgent;
|
||||
this.userAgentIdentifier = userAgent;
|
||||
}
|
||||
|
||||
// Not necessary in prod, but useful in test
|
||||
public void close() {
|
||||
client.dispatcher().executorService().shutdown();
|
||||
client.connectionPool().evictAll();
|
||||
public void close() throws IOException {
|
||||
client.close();
|
||||
}
|
||||
|
||||
/**
|
||||
* Probe the domain to see if it is reachable, attempting to identify which schema to use,
|
||||
* and if there are any redirects. This is done by one or more HEAD requests.
|
||||
@@ -110,23 +212,94 @@ public class HttpFetcherImpl implements HttpFetcher {
|
||||
*/
|
||||
@Override
|
||||
public DomainProbeResult probeDomain(EdgeUrl url) {
|
||||
var head = new Request.Builder().head().addHeader("User-agent", userAgentString)
|
||||
.url(url.toString())
|
||||
.build();
|
||||
List<EdgeUrl> urls = new ArrayList<>();
|
||||
urls.add(url);
|
||||
|
||||
var call = client.newCall(head);
|
||||
int redirects = 0;
|
||||
AtomicBoolean tryGet = new AtomicBoolean(false);
|
||||
|
||||
try (var rsp = call.execute()) {
|
||||
EdgeUrl requestUrl = new EdgeUrl(rsp.request().url().toString());
|
||||
while (!urls.isEmpty() && ++redirects < 5) {
|
||||
ClassicHttpRequest request;
|
||||
|
||||
if (!Objects.equals(requestUrl.domain, url.domain)) {
|
||||
return new DomainProbeResult.Redirect(requestUrl.domain);
|
||||
EdgeUrl topUrl = urls.removeFirst();
|
||||
try {
|
||||
if (tryGet.get()) {
|
||||
request = ClassicRequestBuilder.get(topUrl.asURI())
|
||||
.addHeader("User-Agent", userAgentString)
|
||||
.addHeader("Accept-Encoding", "gzip")
|
||||
.addHeader("Range", "bytes=0-255")
|
||||
.build();
|
||||
} else {
|
||||
request = ClassicRequestBuilder.head(topUrl.asURI())
|
||||
.addHeader("User-Agent", userAgentString)
|
||||
.addHeader("Accept-Encoding", "gzip")
|
||||
.build();
|
||||
}
|
||||
} catch (URISyntaxException e) {
|
||||
return new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, "Invalid URL");
|
||||
}
|
||||
return new DomainProbeResult.Ok(requestUrl);
|
||||
}
|
||||
catch (Exception ex) {
|
||||
return new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, ex.getMessage());
|
||||
|
||||
try {
|
||||
var result = SendLock.wrapSend(client, request, response -> {
|
||||
EntityUtils.consume(response.getEntity());
|
||||
|
||||
return switch (response.getCode()) {
|
||||
case 200 -> new DomainProbeResult.Ok(url);
|
||||
case 405 -> {
|
||||
if (!tryGet.get()) {
|
||||
tryGet.set(true);
|
||||
yield new DomainProbeResult.RedirectSameDomain_Internal(url);
|
||||
}
|
||||
else {
|
||||
yield new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, "HTTP status 405, tried HEAD and GET?!");
|
||||
}
|
||||
}
|
||||
case 301, 302, 307 -> {
|
||||
var location = response.getFirstHeader("Location");
|
||||
|
||||
if (location != null) {
|
||||
Optional<EdgeUrl> newUrl = linkParser.parseLink(topUrl, location.getValue());
|
||||
if (newUrl.isEmpty()) {
|
||||
yield new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, "Invalid location header on redirect");
|
||||
}
|
||||
EdgeUrl newEdgeUrl = newUrl.get();
|
||||
if (newEdgeUrl.domain.equals(topUrl.domain)) {
|
||||
yield new DomainProbeResult.RedirectSameDomain_Internal(newEdgeUrl);
|
||||
}
|
||||
else {
|
||||
yield new DomainProbeResult.Redirect(newEdgeUrl.domain);
|
||||
}
|
||||
}
|
||||
|
||||
yield new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, "No location header on redirect");
|
||||
|
||||
}
|
||||
default ->
|
||||
new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, "HTTP status " + response.getCode());
|
||||
};
|
||||
});
|
||||
|
||||
if (result instanceof DomainProbeResult.RedirectSameDomain_Internal(EdgeUrl redirUrl)) {
|
||||
urls.add(redirUrl);
|
||||
}
|
||||
else {
|
||||
return result;
|
||||
}
|
||||
|
||||
// We don't have robots.txt yet, so we'll assume a request delay of 1 second
|
||||
TimeUnit.SECONDS.sleep(1);
|
||||
}
|
||||
catch (SocketTimeoutException ex) {
|
||||
return new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, "Timeout during domain probe");
|
||||
}
|
||||
catch (Exception ex) {
|
||||
return new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, "Error during domain probe");
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
return new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, "Failed to resolve domain root");
|
||||
|
||||
}
|
||||
|
||||
/** Perform a HEAD request to fetch the content type of a URL.
|
||||
@@ -137,66 +310,73 @@ public class HttpFetcherImpl implements HttpFetcher {
|
||||
* recorded in the WARC file on failure.
|
||||
*/
|
||||
public ContentTypeProbeResult probeContentType(EdgeUrl url,
|
||||
WarcRecorder warcRecorder,
|
||||
ContentTags tags) throws RateLimitException {
|
||||
if (tags.isEmpty() && contentTypeLogic.isUrlLikeBinary(url)) {
|
||||
var headBuilder = new Request.Builder().head()
|
||||
.addHeader("User-agent", userAgentString)
|
||||
.addHeader("Accept-Encoding", "gzip")
|
||||
.url(url.toString());
|
||||
|
||||
var head = headBuilder.build();
|
||||
var call = client.newCall(head);
|
||||
|
||||
try (var rsp = call.execute()) {
|
||||
var contentTypeHeader = rsp.header("Content-type");
|
||||
|
||||
if (contentTypeHeader != null && !contentTypeLogic.isAllowableContentType(contentTypeHeader)) {
|
||||
warcRecorder.flagAsFailedContentTypeProbe(url, contentTypeHeader, rsp.code());
|
||||
|
||||
return new ContentTypeProbeResult.BadContentType(contentTypeHeader, rsp.code());
|
||||
}
|
||||
|
||||
// Update the URL to the final URL of the HEAD request, otherwise we might end up doing
|
||||
|
||||
// HEAD 301 url1 -> url2
|
||||
// HEAD 200 url2
|
||||
// GET 301 url1 -> url2
|
||||
// GET 200 url2
|
||||
|
||||
// which is not what we want. Overall we want to do as few requests as possible to not raise
|
||||
// too many eyebrows when looking at the logs on the target server. Overall it's probably desirable
|
||||
// that it looks like the traffic makes sense, as opposed to looking like a broken bot.
|
||||
|
||||
var redirectUrl = new EdgeUrl(rsp.request().url().toString());
|
||||
EdgeUrl ret;
|
||||
|
||||
if (Objects.equals(redirectUrl.domain, url.domain)) ret = redirectUrl;
|
||||
else ret = url;
|
||||
|
||||
// Intercept rate limiting
|
||||
if (rsp.code() == 429) {
|
||||
throw new HttpFetcherImpl.RateLimitException(Objects.requireNonNullElse(rsp.header("Retry-After"), "1"));
|
||||
}
|
||||
|
||||
return new ContentTypeProbeResult.Ok(ret);
|
||||
}
|
||||
catch (RateLimitException ex) {
|
||||
throw ex;
|
||||
}
|
||||
catch (InterruptedIOException ex) {
|
||||
warcRecorder.flagAsTimeout(url);
|
||||
|
||||
return new ContentTypeProbeResult.Timeout(ex);
|
||||
} catch (Exception ex) {
|
||||
logger.error("Error during fetching {}[{}]", ex.getClass().getSimpleName(), ex.getMessage());
|
||||
|
||||
warcRecorder.flagAsError(url, ex);
|
||||
|
||||
return new ContentTypeProbeResult.Exception(ex);
|
||||
}
|
||||
DomainCookies cookies,
|
||||
CrawlDelayTimer timer,
|
||||
ContentTags tags) {
|
||||
if (!tags.isEmpty() || !contentTypeLogic.isUrlLikeBinary(url)) {
|
||||
return new ContentTypeProbeResult.NoOp();
|
||||
}
|
||||
|
||||
try {
|
||||
ClassicHttpRequest head = ClassicRequestBuilder.head(url.asURI())
|
||||
.addHeader("User-Agent", userAgentString)
|
||||
.addHeader("Accept-Encoding", "gzip")
|
||||
.build();
|
||||
|
||||
cookies.paintRequest(head);
|
||||
|
||||
return SendLock.wrapSend(client, head, (rsp) -> {
|
||||
cookies.updateCookieStore(rsp);
|
||||
EntityUtils.consume(rsp.getEntity());
|
||||
int statusCode = rsp.getCode();
|
||||
|
||||
// Handle redirects
|
||||
if (statusCode == 301 || statusCode == 302 || statusCode == 307) {
|
||||
var location = rsp.getFirstHeader("Location");
|
||||
if (location != null) {
|
||||
Optional<EdgeUrl> newUrl = linkParser.parseLink(url, location.getValue());
|
||||
if (newUrl.isEmpty())
|
||||
return new ContentTypeProbeResult.HttpError(statusCode, "Invalid location header on redirect");
|
||||
return new ContentTypeProbeResult.Redirect(newUrl.get());
|
||||
}
|
||||
}
|
||||
|
||||
if (statusCode == 405) {
|
||||
// If we get a 405, we can't probe the content type with HEAD, so we'll just say it's ok
|
||||
return new ContentTypeProbeResult.Ok(url);
|
||||
}
|
||||
|
||||
// Handle errors
|
||||
if (statusCode < 200 || statusCode > 300) {
|
||||
return new ContentTypeProbeResult.HttpError(statusCode, "Bad status code");
|
||||
}
|
||||
|
||||
// Handle missing content type
|
||||
var ctHeader = rsp.getFirstHeader("Content-Type");
|
||||
if (ctHeader == null) {
|
||||
return new ContentTypeProbeResult.HttpError(statusCode, "Missing Content-Type header");
|
||||
}
|
||||
var contentType = ctHeader.getValue();
|
||||
|
||||
// Check if the content type is allowed
|
||||
if (contentTypeLogic.isAllowableContentType(contentType)) {
|
||||
return new ContentTypeProbeResult.Ok(url);
|
||||
} else {
|
||||
return new ContentTypeProbeResult.BadContentType(contentType, statusCode);
|
||||
}
|
||||
});
|
||||
}
|
||||
catch (SocketTimeoutException ex) {
|
||||
|
||||
return new ContentTypeProbeResult.Timeout(ex);
|
||||
}
|
||||
catch (Exception ex) {
|
||||
logger.error("Error during fetching {}[{}]", ex.getClass().getSimpleName(), ex.getMessage());
|
||||
return new ContentTypeProbeResult.Exception(ex);
|
||||
}
|
||||
finally {
|
||||
timer.waitFetchDelay();
|
||||
}
|
||||
return new ContentTypeProbeResult.Ok(url);
|
||||
}
|
||||
|
||||
/** Fetch the content of a URL, and record it in a WARC file,
|
||||
@@ -206,35 +386,87 @@ public class HttpFetcherImpl implements HttpFetcher {
|
||||
@Override
|
||||
public HttpFetchResult fetchContent(EdgeUrl url,
|
||||
WarcRecorder warcRecorder,
|
||||
DomainCookies cookies,
|
||||
CrawlDelayTimer timer,
|
||||
ContentTags contentTags,
|
||||
ProbeType probeType)
|
||||
throws Exception
|
||||
{
|
||||
var getBuilder = new Request.Builder().get();
|
||||
try {
|
||||
if (probeType == HttpFetcher.ProbeType.FULL) {
|
||||
try {
|
||||
var probeResult = probeContentType(url, cookies, timer, contentTags);
|
||||
|
||||
getBuilder.url(url.toString())
|
||||
.addHeader("Accept-Encoding", "gzip")
|
||||
.addHeader("Accept-Language", "en,*;q=0.5")
|
||||
.addHeader("Accept", "text/html, application/xhtml+xml, text/*;q=0.8")
|
||||
.addHeader("User-agent", userAgentString);
|
||||
switch (probeResult) {
|
||||
case HttpFetcher.ContentTypeProbeResult.NoOp():
|
||||
break; //
|
||||
case HttpFetcher.ContentTypeProbeResult.Ok(EdgeUrl resolvedUrl):
|
||||
logger.info(crawlerAuditMarker, "Probe result OK for {}", url);
|
||||
url = resolvedUrl; // If we were redirected while probing, use the final URL for fetching
|
||||
break;
|
||||
case ContentTypeProbeResult.BadContentType badContentType:
|
||||
warcRecorder.flagAsFailedContentTypeProbe(url, badContentType.contentType(), badContentType.statusCode());
|
||||
logger.info(crawlerAuditMarker, "Probe result Bad ContenType ({}) for {}", badContentType.contentType(), url);
|
||||
return new HttpFetchResult.ResultNone();
|
||||
case ContentTypeProbeResult.BadContentType.Timeout(Exception ex):
|
||||
logger.info(crawlerAuditMarker, "Probe result Timeout for {}", url);
|
||||
warcRecorder.flagAsTimeout(url);
|
||||
return new HttpFetchResult.ResultException(ex);
|
||||
case ContentTypeProbeResult.Exception(Exception ex):
|
||||
logger.info(crawlerAuditMarker, "Probe result Exception({}) for {}", ex.getClass().getSimpleName(), url);
|
||||
warcRecorder.flagAsError(url, ex);
|
||||
return new HttpFetchResult.ResultException(ex);
|
||||
case ContentTypeProbeResult.HttpError httpError:
|
||||
logger.info(crawlerAuditMarker, "Probe result HTTP Error ({}) for {}", httpError.statusCode(), url);
|
||||
return new HttpFetchResult.ResultException(new HttpException("HTTP status code " + httpError.statusCode() + ": " + httpError.message()));
|
||||
case ContentTypeProbeResult.Redirect redirect:
|
||||
logger.info(crawlerAuditMarker, "Probe result redirect for {} -> {}", url, redirect.location());
|
||||
return new HttpFetchResult.ResultRedirect(redirect.location());
|
||||
}
|
||||
} catch (Exception ex) {
|
||||
logger.warn("Failed to fetch {}", url, ex);
|
||||
return new HttpFetchResult.ResultException(ex);
|
||||
}
|
||||
|
||||
contentTags.paint(getBuilder);
|
||||
|
||||
HttpFetchResult result = warcRecorder.fetch(client, getBuilder.build());
|
||||
|
||||
if (result instanceof HttpFetchResult.ResultOk ok) {
|
||||
if (ok.statusCode() == 429) {
|
||||
throw new RateLimitException(Objects.requireNonNullElse(ok.header("Retry-After"), "1"));
|
||||
}
|
||||
if (ok.statusCode() == 304) {
|
||||
return new HttpFetchResult.Result304Raw();
|
||||
}
|
||||
if (ok.statusCode() == 200) {
|
||||
return ok;
|
||||
|
||||
HttpGet request = new HttpGet(url.asURI());
|
||||
request.addHeader("User-Agent", userAgentString);
|
||||
request.addHeader("Accept-Encoding", "gzip");
|
||||
request.addHeader("Accept-Language", "en,*;q=0.5");
|
||||
request.addHeader("Accept", "text/html, application/xhtml+xml, text/*;q=0.8");
|
||||
|
||||
contentTags.paint(request);
|
||||
|
||||
try (var sl = new SendLock()) {
|
||||
Instant start = Instant.now();
|
||||
HttpFetchResult result = warcRecorder.fetch(client, cookies, request);
|
||||
|
||||
Duration fetchDuration = Duration.between(start, Instant.now());
|
||||
|
||||
if (result instanceof HttpFetchResult.ResultOk ok) {
|
||||
if (ok.statusCode() == 304) {
|
||||
result = new HttpFetchResult.Result304Raw();
|
||||
}
|
||||
}
|
||||
|
||||
switch (result) {
|
||||
case HttpFetchResult.ResultOk ok -> logger.info(crawlerAuditMarker, "Fetch result OK {} for {} ({} ms)", ok.statusCode(), url, fetchDuration.toMillis());
|
||||
case HttpFetchResult.ResultRedirect redirect -> logger.info(crawlerAuditMarker, "Fetch result redirect: {} for {}", redirect.url(), url);
|
||||
case HttpFetchResult.ResultNone none -> logger.info(crawlerAuditMarker, "Fetch result none for {}", url);
|
||||
case HttpFetchResult.ResultException ex -> logger.error(crawlerAuditMarker, "Fetch result exception for {}", url, ex.ex());
|
||||
case HttpFetchResult.Result304Raw raw -> logger.info(crawlerAuditMarker, "Fetch result: 304 Raw for {}", url);
|
||||
case HttpFetchResult.Result304ReplacedWithReference ref -> logger.info(crawlerAuditMarker, "Fetch result: 304 With reference for {}", url);
|
||||
}
|
||||
|
||||
return result;
|
||||
}
|
||||
}
|
||||
catch (Exception ex) {
|
||||
logger.error(crawlerAuditMarker, "Fetch result exception for {}", url, ex);
|
||||
|
||||
return new HttpFetchResult.ResultException(ex);
|
||||
}
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
@Override
|
||||
@@ -242,6 +474,131 @@ public class HttpFetcherImpl implements HttpFetcher {
|
||||
return new SitemapRetriever();
|
||||
}
|
||||
|
||||
/** Recursively fetch sitemaps */
|
||||
@Override
|
||||
public List<EdgeUrl> fetchSitemapUrls(String root, CrawlDelayTimer delayTimer) {
|
||||
try {
|
||||
List<EdgeUrl> ret = new ArrayList<>();
|
||||
|
||||
Set<String> seenUrls = new HashSet<>();
|
||||
Set<String> seenSitemaps = new HashSet<>();
|
||||
|
||||
Deque<EdgeUrl> sitemapQueue = new LinkedList<>();
|
||||
|
||||
EdgeUrl rootSitemapUrl = new EdgeUrl(root);
|
||||
|
||||
sitemapQueue.add(rootSitemapUrl);
|
||||
|
||||
int fetchedSitemaps = 0;
|
||||
|
||||
while (!sitemapQueue.isEmpty() && ret.size() < 20_000 && ++fetchedSitemaps < 10) {
|
||||
var head = sitemapQueue.removeFirst();
|
||||
|
||||
switch (fetchSingleSitemap(head)) {
|
||||
case SitemapResult.SitemapUrls(List<String> urls) -> {
|
||||
|
||||
for (var url : urls) {
|
||||
if (seenUrls.add(url)) {
|
||||
EdgeUrl.parse(url)
|
||||
.filter(u -> u.domain.equals(rootSitemapUrl.domain))
|
||||
.ifPresent(ret::add);
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
case SitemapResult.SitemapReferences(List<String> refs) -> {
|
||||
for (var ref : refs) {
|
||||
if (seenSitemaps.add(ref)) {
|
||||
EdgeUrl.parse(ref)
|
||||
.filter(url -> url.domain.equals(rootSitemapUrl.domain))
|
||||
.ifPresent(sitemapQueue::addFirst);
|
||||
}
|
||||
}
|
||||
}
|
||||
case SitemapResult.SitemapError() -> {}
|
||||
}
|
||||
|
||||
delayTimer.waitFetchDelay();
|
||||
}
|
||||
|
||||
return ret;
|
||||
}
|
||||
catch (Exception ex) {
|
||||
logger.error("Error while fetching sitemaps via {}: {} ({})", root, ex.getClass().getSimpleName(), ex.getMessage());
|
||||
return List.of();
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
private SitemapResult fetchSingleSitemap(EdgeUrl sitemapUrl) throws URISyntaxException {
|
||||
HttpGet getRequest = new HttpGet(sitemapUrl.asURI());
|
||||
|
||||
getRequest.addHeader("User-Agent", userAgentString);
|
||||
getRequest.addHeader("Accept-Encoding", "gzip");
|
||||
getRequest.addHeader("Accept", "text/*, */*;q=0.9");
|
||||
getRequest.addHeader("User-Agent", userAgentString);
|
||||
|
||||
try (var sl = new SendLock()) {
|
||||
return client.execute(getRequest, response -> {
|
||||
try {
|
||||
if (response.getCode() != 200) {
|
||||
return new SitemapResult.SitemapError();
|
||||
}
|
||||
|
||||
Document parsedSitemap = Jsoup.parse(
|
||||
EntityUtils.toString(response.getEntity()),
|
||||
sitemapUrl.toString(),
|
||||
Parser.xmlParser()
|
||||
);
|
||||
|
||||
if (parsedSitemap.childrenSize() == 0) {
|
||||
return new SitemapResult.SitemapError();
|
||||
}
|
||||
|
||||
String rootTagName = parsedSitemap.child(0).tagName();
|
||||
|
||||
return switch (rootTagName.toLowerCase()) {
|
||||
case "sitemapindex" -> {
|
||||
List<String> references = new ArrayList<>();
|
||||
for (var locTag : parsedSitemap.getElementsByTag("loc")) {
|
||||
references.add(locTag.text().trim());
|
||||
}
|
||||
yield new SitemapResult.SitemapReferences(Collections.unmodifiableList(references));
|
||||
}
|
||||
case "urlset" -> {
|
||||
List<String> urls = new ArrayList<>();
|
||||
for (var locTag : parsedSitemap.select("url > loc")) {
|
||||
urls.add(locTag.text().trim());
|
||||
}
|
||||
yield new SitemapResult.SitemapUrls(Collections.unmodifiableList(urls));
|
||||
}
|
||||
case "rss", "atom" -> {
|
||||
List<String> urls = new ArrayList<>();
|
||||
for (var locTag : parsedSitemap.select("link, url")) {
|
||||
urls.add(locTag.text().trim());
|
||||
}
|
||||
yield new SitemapResult.SitemapUrls(Collections.unmodifiableList(urls));
|
||||
}
|
||||
default -> new SitemapResult.SitemapError();
|
||||
};
|
||||
}
|
||||
finally {
|
||||
EntityUtils.consume(response.getEntity());
|
||||
}
|
||||
});
|
||||
}
|
||||
catch (Exception ex) {
|
||||
logger.warn("Error while fetching sitemap {}: {} ({})", sitemapUrl, ex.getClass().getSimpleName(), ex.getMessage());
|
||||
return new SitemapResult.SitemapError();
|
||||
}
|
||||
}
|
||||
|
||||
private sealed interface SitemapResult {
|
||||
record SitemapUrls(List<String> urls) implements SitemapResult {}
|
||||
record SitemapReferences(List<String> sitemapRefs) implements SitemapResult {}
|
||||
record SitemapError() implements SitemapResult {}
|
||||
}
|
||||
|
||||
@Override
|
||||
public SimpleRobotRules fetchRobotRules(EdgeDomain domain, WarcRecorder recorder) {
|
||||
var ret = fetchAndParseRobotsTxt(new EdgeUrl("https", domain, null, "/robots.txt", null), recorder);
|
||||
@@ -256,15 +613,14 @@ public class HttpFetcherImpl implements HttpFetcher {
|
||||
}
|
||||
|
||||
private Optional<SimpleRobotRules> fetchAndParseRobotsTxt(EdgeUrl url, WarcRecorder recorder) {
|
||||
try {
|
||||
var getBuilder = new Request.Builder().get();
|
||||
try (var sl = new SendLock()) {
|
||||
|
||||
getBuilder.url(url.toString())
|
||||
.addHeader("Accept-Encoding", "gzip")
|
||||
.addHeader("Accept", "text/*, */*;q=0.9")
|
||||
.addHeader("User-agent", userAgentString);
|
||||
HttpGet request = new HttpGet(url.asURI());
|
||||
request.addHeader("User-Agent", userAgentString);
|
||||
request.addHeader("Accept-Encoding", "gzip");
|
||||
request.addHeader("Accept", "text/*, */*;q=0.9");
|
||||
|
||||
HttpFetchResult result = recorder.fetch(client, getBuilder.build());
|
||||
HttpFetchResult result = recorder.fetch(client, new DomainCookies(), request);
|
||||
|
||||
return DocumentBodyExtractor.asBytes(result).mapOpt((contentType, body) ->
|
||||
robotsParser.parseContent(url.toString(),
|
||||
@@ -278,6 +634,57 @@ public class HttpFetcherImpl implements HttpFetcher {
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public boolean retryRequest(HttpRequest request, IOException exception, int executionCount, HttpContext context) {
|
||||
return switch (exception) {
|
||||
case SocketTimeoutException ste -> false;
|
||||
case SSLException ssle -> false;
|
||||
case UnknownHostException uhe -> false;
|
||||
default -> executionCount <= 3;
|
||||
};
|
||||
}
|
||||
|
||||
@Override
|
||||
public boolean retryRequest(HttpResponse response, int executionCount, HttpContext context) {
|
||||
return switch (response.getCode()) {
|
||||
case 500, 503 -> executionCount <= 2;
|
||||
case 429 -> executionCount <= 3;
|
||||
default -> false;
|
||||
};
|
||||
}
|
||||
|
||||
@Override
|
||||
public TimeValue getRetryInterval(HttpRequest request, IOException exception, int executionCount, HttpContext context) {
|
||||
return TimeValue.ofSeconds(1);
|
||||
}
|
||||
|
||||
@Override
|
||||
public TimeValue getRetryInterval(HttpResponse response, int executionCount, HttpContext context) {
|
||||
|
||||
int statusCode = response.getCode();
|
||||
|
||||
// Give 503 a bit more time
|
||||
if (statusCode == 503) return TimeValue.ofSeconds(5);
|
||||
|
||||
if (statusCode == 429) {
|
||||
// get the Retry-After header
|
||||
String retryAfter = response.getFirstHeader("Retry-After").getValue();
|
||||
if (retryAfter == null) {
|
||||
return TimeValue.ofSeconds(2);
|
||||
}
|
||||
|
||||
try {
|
||||
int retryAfterTime = Integer.parseInt(retryAfter);
|
||||
retryAfterTime = Math.clamp(retryAfterTime, 1, 5);
|
||||
|
||||
return TimeValue.ofSeconds(retryAfterTime);
|
||||
} catch (NumberFormatException e) {
|
||||
logger.warn("Invalid Retry-After header: {}", retryAfter);
|
||||
}
|
||||
}
|
||||
|
||||
return TimeValue.ofSeconds(2);
|
||||
}
|
||||
|
||||
public static class RateLimitException extends Exception {
|
||||
private final String retryAfter;
|
||||
@@ -298,5 +705,31 @@ public class HttpFetcherImpl implements HttpFetcher {
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
class SendLock implements AutoCloseable {
|
||||
|
||||
private static final Semaphore maxConcurrentRequests = new Semaphore(Integer.getInteger("crawler.maxConcurrentRequests", 512));
|
||||
boolean closed = false;
|
||||
|
||||
public SendLock() {
|
||||
maxConcurrentRequests.acquireUninterruptibly();
|
||||
}
|
||||
|
||||
public static <T> T wrapSend(HttpClient client, final ClassicHttpRequest request,
|
||||
final HttpClientResponseHandler<? extends T> responseHandler) throws IOException {
|
||||
try (var lock = new SendLock()) {
|
||||
return client.execute(request, responseHandler);
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public void close() {
|
||||
if (!closed) {
|
||||
maxConcurrentRequests.release();
|
||||
closed = true;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
@@ -1,31 +0,0 @@
|
||||
package nu.marginalia.crawl.fetcher.socket;
|
||||
|
||||
import okhttp3.Interceptor;
|
||||
import okhttp3.Response;
|
||||
import org.jetbrains.annotations.NotNull;
|
||||
|
||||
import java.io.IOException;
|
||||
|
||||
|
||||
/** An interceptor that intercepts network requests and adds the remote IP address as
|
||||
* a header in the response. This is used to pass the remote IP address to the Warc
|
||||
* writer, as this information is not available in the response.
|
||||
*/
|
||||
public class IpInterceptingNetworkInterceptor implements Interceptor {
|
||||
private static final String pseudoHeaderName = "X-Marginalia-Remote-IP";
|
||||
|
||||
@NotNull
|
||||
@Override
|
||||
public Response intercept(@NotNull Interceptor.Chain chain) throws IOException {
|
||||
String IP = chain.connection().socket().getInetAddress().getHostAddress();
|
||||
|
||||
return chain.proceed(chain.request())
|
||||
.newBuilder()
|
||||
.addHeader(pseudoHeaderName, IP)
|
||||
.build();
|
||||
}
|
||||
|
||||
public static String getIpFromResponse(Response response) {
|
||||
return response.header(pseudoHeaderName);
|
||||
}
|
||||
}
|
@@ -27,7 +27,7 @@ public class NoSecuritySSL {
|
||||
}
|
||||
};
|
||||
|
||||
public static SSLSocketFactory buildSocketFactory() {
|
||||
public static SSLContext buildSslContext() {
|
||||
try {
|
||||
// Install the all-trusting trust manager
|
||||
final SSLContext sslContext = SSLContext.getInstance("TLS");
|
||||
@@ -40,14 +40,11 @@ public class NoSecuritySSL {
|
||||
clientSessionContext.setSessionCacheSize(2048);
|
||||
|
||||
// Create a ssl socket factory with our all-trusting manager
|
||||
return sslContext.getSocketFactory();
|
||||
return sslContext;
|
||||
}
|
||||
catch (Exception e) {
|
||||
throw new RuntimeException(e);
|
||||
}
|
||||
}
|
||||
|
||||
public static HostnameVerifier buildHostnameVerifyer() {
|
||||
return (hn, session) -> true;
|
||||
}
|
||||
}
|
||||
|
@@ -1,15 +1,20 @@
|
||||
package nu.marginalia.crawl.fetcher.warc;
|
||||
|
||||
import okhttp3.Headers;
|
||||
import okhttp3.Response;
|
||||
import org.apache.commons.io.IOUtils;
|
||||
import org.apache.commons.io.input.BOMInputStream;
|
||||
import org.apache.hc.client5.http.classic.methods.HttpGet;
|
||||
import org.apache.hc.core5.http.ClassicHttpResponse;
|
||||
import org.apache.hc.core5.http.Header;
|
||||
import org.netpreserve.jwarc.WarcTruncationReason;
|
||||
|
||||
import java.io.*;
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Path;
|
||||
import java.util.Objects;
|
||||
import java.util.zip.GZIPInputStream;
|
||||
import java.time.Duration;
|
||||
import java.time.Instant;
|
||||
import java.util.Arrays;
|
||||
|
||||
import static nu.marginalia.crawl.fetcher.warc.ErrorBuffer.suppressContentEncoding;
|
||||
|
||||
/** Input buffer for temporary storage of a HTTP response
|
||||
* This may be in-memory or on-disk, at the discretion of
|
||||
@@ -17,8 +22,9 @@ import java.util.zip.GZIPInputStream;
|
||||
* */
|
||||
public abstract class WarcInputBuffer implements AutoCloseable {
|
||||
protected WarcTruncationReason truncationReason = WarcTruncationReason.NOT_TRUNCATED;
|
||||
protected Headers headers;
|
||||
WarcInputBuffer(Headers headers) {
|
||||
protected Header[] headers;
|
||||
|
||||
WarcInputBuffer(Header[] headers) {
|
||||
this.headers = headers;
|
||||
}
|
||||
|
||||
@@ -30,7 +36,7 @@ public abstract class WarcInputBuffer implements AutoCloseable {
|
||||
|
||||
public final WarcTruncationReason truncationReason() { return truncationReason; }
|
||||
|
||||
public final Headers headers() { return headers; }
|
||||
public final Header[] headers() { return headers; }
|
||||
|
||||
/** Create a buffer for a response.
|
||||
* If the response is small and not compressed, it will be stored in memory.
|
||||
@@ -38,33 +44,70 @@ public abstract class WarcInputBuffer implements AutoCloseable {
|
||||
* and suppressed from the headers.
|
||||
* If an error occurs, a buffer will be created with no content and an error status.
|
||||
*/
|
||||
static WarcInputBuffer forResponse(Response rsp) {
|
||||
if (rsp == null)
|
||||
static WarcInputBuffer forResponse(ClassicHttpResponse response,
|
||||
HttpGet request,
|
||||
Duration timeLimit) throws IOException {
|
||||
if (response == null)
|
||||
return new ErrorBuffer();
|
||||
|
||||
try {
|
||||
String contentLengthHeader = Objects.requireNonNullElse(rsp.header("Content-Length"), "-1");
|
||||
int contentLength = Integer.parseInt(contentLengthHeader);
|
||||
String contentEncoding = rsp.header("Content-Encoding");
|
||||
|
||||
if (contentEncoding == null && contentLength > 0 && contentLength < 8192) {
|
||||
var entity = response.getEntity();
|
||||
|
||||
if (null == entity) {
|
||||
return new ErrorBuffer();
|
||||
}
|
||||
|
||||
Instant start = Instant.now();
|
||||
InputStream is = null;
|
||||
try {
|
||||
is = entity.getContent();
|
||||
long length = entity.getContentLength();
|
||||
|
||||
if (length > 0 && length < 8192) {
|
||||
// If the content is small and not compressed, we can just read it into memory
|
||||
return new MemoryBuffer(rsp, contentLength);
|
||||
}
|
||||
else {
|
||||
return new MemoryBuffer(response.getHeaders(), request, timeLimit, is, (int) length);
|
||||
} else {
|
||||
// Otherwise, we unpack it into a file and read it from there
|
||||
return new FileBuffer(rsp);
|
||||
return new FileBuffer(response.getHeaders(), request, timeLimit, is);
|
||||
}
|
||||
}
|
||||
catch (Exception ex) {
|
||||
return new ErrorBuffer(rsp);
|
||||
finally {
|
||||
// We're required to consume the stream to avoid leaking connections,
|
||||
// but we also don't want to get stuck on slow or malicious connections
|
||||
// forever, so we set a time limit on this phase and call abort() if it's exceeded.
|
||||
try {
|
||||
while (is != null) {
|
||||
// Consume some data
|
||||
if (is.skip(65536) == 0) {
|
||||
// Note that skip may return 0 if the stream is empty
|
||||
// or for other unspecified reasons, so we need to check
|
||||
// with read() as well to determine if the stream is done
|
||||
if (is.read() == -1)
|
||||
is = null;
|
||||
}
|
||||
// Check if the time limit has been exceeded
|
||||
else if (Duration.between(start, Instant.now()).compareTo(timeLimit) > 0) {
|
||||
request.abort();
|
||||
is = null;
|
||||
}
|
||||
}
|
||||
}
|
||||
catch (IOException e) {
|
||||
// Ignore the exception
|
||||
}
|
||||
finally {
|
||||
// Close the input stream
|
||||
IOUtils.closeQuietly(is);
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
|
||||
/** Copy an input stream to an output stream, with a maximum size and time limit */
|
||||
protected void copy(InputStream is, OutputStream os) {
|
||||
long startTime = System.currentTimeMillis();
|
||||
protected void copy(InputStream is, HttpGet request, OutputStream os, Duration timeLimit) {
|
||||
Instant start = Instant.now();
|
||||
Instant timeout = start.plus(timeLimit);
|
||||
long size = 0;
|
||||
|
||||
byte[] buffer = new byte[8192];
|
||||
@@ -74,24 +117,105 @@ public abstract class WarcInputBuffer implements AutoCloseable {
|
||||
|
||||
while (true) {
|
||||
try {
|
||||
Duration remaining = Duration.between(Instant.now(), timeout);
|
||||
if (remaining.isNegative()) {
|
||||
truncationReason = WarcTruncationReason.TIME;
|
||||
// Abort the request if the time limit is exceeded
|
||||
// so we don't keep the connection open forever or are forced to consume
|
||||
// the stream to the end
|
||||
|
||||
request.abort();
|
||||
break;
|
||||
}
|
||||
|
||||
int n = is.read(buffer);
|
||||
|
||||
if (n < 0) break;
|
||||
size += n;
|
||||
os.write(buffer, 0, n);
|
||||
|
||||
if (size > WarcRecorder.MAX_SIZE) {
|
||||
// Even if we've exceeded the max length,
|
||||
// we keep consuming the stream up until the end or a timeout,
|
||||
// as closing the stream means resetting the connection, and
|
||||
// that's generally not desirable.
|
||||
|
||||
if (size < WarcRecorder.MAX_SIZE) {
|
||||
os.write(buffer, 0, n);
|
||||
}
|
||||
else if (truncationReason != WarcTruncationReason.LENGTH) {
|
||||
truncationReason = WarcTruncationReason.LENGTH;
|
||||
break;
|
||||
}
|
||||
|
||||
if (System.currentTimeMillis() - startTime > WarcRecorder.MAX_TIME) {
|
||||
truncationReason = WarcTruncationReason.TIME;
|
||||
break;
|
||||
}
|
||||
} catch (IOException e) {
|
||||
throw new RuntimeException(e);
|
||||
truncationReason = WarcTruncationReason.UNSPECIFIED;
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
/** Takes a Content-Range header and checks if it is complete.
|
||||
* A complete range is one that covers the entire resource.
|
||||
* For example, "bytes 0-1023/2048" or "bytes 0-1023/*" are complete ranges.
|
||||
* "bytes 0-1023/2048" is not a complete range.
|
||||
*/
|
||||
public boolean isRangeComplete(Header[] headers) {
|
||||
// Find the Content-Range header
|
||||
String contentRangeHeader = null;
|
||||
for (var header : headers) {
|
||||
if ("Content-Range".equalsIgnoreCase(header.getName())) {
|
||||
contentRangeHeader = header.getValue();
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
// Return true if header is null or empty
|
||||
if (contentRangeHeader == null || contentRangeHeader.isEmpty()) {
|
||||
return true;
|
||||
}
|
||||
|
||||
try {
|
||||
// Content-Range format: "bytes range-start-range-end/size"
|
||||
// e.g., "bytes 0-1023/2048" or "bytes 0-1023/*"
|
||||
|
||||
// Get the part after "bytes "
|
||||
String[] parts = contentRangeHeader.split(" ", 2);
|
||||
if (parts.length < 2) {
|
||||
return false;
|
||||
}
|
||||
|
||||
// Get the range and size parts (e.g., "0-1023/2048")
|
||||
String rangeAndSize = parts[1];
|
||||
String[] rangeAndSizeParts = rangeAndSize.split("/", 2);
|
||||
if (rangeAndSizeParts.length < 2) {
|
||||
return false;
|
||||
}
|
||||
|
||||
// Get the range (e.g., "0-1023")
|
||||
String range = rangeAndSizeParts[0];
|
||||
String[] rangeParts = range.split("-", 2);
|
||||
if (rangeParts.length < 2) {
|
||||
return false;
|
||||
}
|
||||
|
||||
// Get the size (e.g., "2048" or "*")
|
||||
String size = rangeAndSizeParts[1];
|
||||
|
||||
// If size is "*", we don't know the total size, so return false
|
||||
if ("*".equals(size)) {
|
||||
return false;
|
||||
}
|
||||
|
||||
// Parse as long to handle large files
|
||||
long rangeStart = Long.parseLong(rangeParts[0]);
|
||||
long rangeEnd = Long.parseLong(rangeParts[1]);
|
||||
long totalSize = Long.parseLong(size);
|
||||
|
||||
// Check if the range covers the entire resource
|
||||
return rangeStart == 0 && rangeEnd == totalSize - 1;
|
||||
|
||||
} catch (NumberFormatException | ArrayIndexOutOfBoundsException e) {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
@@ -99,12 +223,8 @@ public abstract class WarcInputBuffer implements AutoCloseable {
|
||||
/** Pseudo-buffer for when we have an error */
|
||||
class ErrorBuffer extends WarcInputBuffer {
|
||||
public ErrorBuffer() {
|
||||
super(Headers.of());
|
||||
truncationReason = WarcTruncationReason.UNSPECIFIED;
|
||||
}
|
||||
super(new Header[0]);
|
||||
|
||||
public ErrorBuffer(Response rsp) {
|
||||
super(rsp.headers());
|
||||
truncationReason = WarcTruncationReason.UNSPECIFIED;
|
||||
}
|
||||
|
||||
@@ -120,17 +240,29 @@ class ErrorBuffer extends WarcInputBuffer {
|
||||
|
||||
@Override
|
||||
public void close() throws Exception {}
|
||||
|
||||
|
||||
static Header[] suppressContentEncoding(Header[] headers) {
|
||||
return Arrays.stream(headers).filter(header -> !"Content-Encoding".equalsIgnoreCase(header.getName())).toArray(Header[]::new);
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
/** Buffer for when we have the response in memory */
|
||||
class MemoryBuffer extends WarcInputBuffer {
|
||||
byte[] data;
|
||||
public MemoryBuffer(Response response, int size) {
|
||||
super(response.headers());
|
||||
public MemoryBuffer(Header[] headers, HttpGet request, Duration timeLimit, InputStream responseStream, int size) {
|
||||
super(suppressContentEncoding(headers));
|
||||
|
||||
if (!isRangeComplete(headers)) {
|
||||
truncationReason = WarcTruncationReason.LENGTH;
|
||||
} else {
|
||||
truncationReason = WarcTruncationReason.NOT_TRUNCATED;
|
||||
}
|
||||
|
||||
var outputStream = new ByteArrayOutputStream(size);
|
||||
|
||||
copy(response.body().byteStream(), outputStream);
|
||||
copy(responseStream, request, outputStream, timeLimit);
|
||||
|
||||
data = outputStream.toByteArray();
|
||||
}
|
||||
@@ -154,53 +286,25 @@ class MemoryBuffer extends WarcInputBuffer {
|
||||
class FileBuffer extends WarcInputBuffer {
|
||||
private final Path tempFile;
|
||||
|
||||
public FileBuffer(Response response) throws IOException {
|
||||
super(suppressContentEncoding(response.headers()));
|
||||
public FileBuffer(Header[] headers, HttpGet request, Duration timeLimit, InputStream responseStream) throws IOException {
|
||||
super(suppressContentEncoding(headers));
|
||||
|
||||
if (!isRangeComplete(headers)) {
|
||||
truncationReason = WarcTruncationReason.LENGTH;
|
||||
} else {
|
||||
truncationReason = WarcTruncationReason.NOT_TRUNCATED;
|
||||
}
|
||||
|
||||
this.tempFile = Files.createTempFile("rsp", ".html");
|
||||
|
||||
if (response.body() == null) {
|
||||
truncationReason = WarcTruncationReason.DISCONNECT;
|
||||
return;
|
||||
try (var out = Files.newOutputStream(tempFile)) {
|
||||
copy(responseStream, request, out, timeLimit);
|
||||
}
|
||||
|
||||
if ("gzip".equals(response.header("Content-Encoding"))) {
|
||||
try (var out = Files.newOutputStream(tempFile)) {
|
||||
copy(new GZIPInputStream(response.body().byteStream()), out);
|
||||
}
|
||||
catch (Exception ex) {
|
||||
truncationReason = WarcTruncationReason.UNSPECIFIED;
|
||||
}
|
||||
}
|
||||
else {
|
||||
try (var out = Files.newOutputStream(tempFile)) {
|
||||
copy(response.body().byteStream(), out);
|
||||
}
|
||||
catch (Exception ex) {
|
||||
truncationReason = WarcTruncationReason.UNSPECIFIED;
|
||||
}
|
||||
catch (Exception ex) {
|
||||
truncationReason = WarcTruncationReason.UNSPECIFIED;
|
||||
}
|
||||
}
|
||||
|
||||
private static Headers suppressContentEncoding(Headers headers) {
|
||||
var builder = new Headers.Builder();
|
||||
|
||||
headers.toMultimap().forEach((k, values) -> {
|
||||
if ("Content-Encoding".equalsIgnoreCase(k)) {
|
||||
return;
|
||||
}
|
||||
if ("Transfer-Encoding".equalsIgnoreCase(k)) {
|
||||
return;
|
||||
}
|
||||
for (var value : values) {
|
||||
builder.add(k, value);
|
||||
}
|
||||
});
|
||||
|
||||
return builder.build();
|
||||
}
|
||||
|
||||
|
||||
public InputStream read() throws IOException {
|
||||
return Files.newInputStream(tempFile);
|
||||
}
|
||||
|
@@ -1,11 +1,14 @@
|
||||
package nu.marginalia.crawl.fetcher.warc;
|
||||
|
||||
import okhttp3.Protocol;
|
||||
import okhttp3.Response;
|
||||
import org.apache.commons.lang3.StringUtils;
|
||||
import org.apache.hc.core5.http.ClassicHttpResponse;
|
||||
import org.apache.hc.core5.http.Header;
|
||||
|
||||
import java.net.URI;
|
||||
import java.net.URLEncoder;
|
||||
import java.net.http.HttpClient;
|
||||
import java.net.http.HttpHeaders;
|
||||
import java.net.http.HttpResponse;
|
||||
import java.nio.charset.StandardCharsets;
|
||||
import java.util.*;
|
||||
import java.util.stream.Collectors;
|
||||
@@ -16,7 +19,7 @@ import java.util.stream.Collectors;
|
||||
public class WarcProtocolReconstructor {
|
||||
|
||||
static String getHttpRequestString(String method,
|
||||
Map<String, List<String>> mainHeaders,
|
||||
Header[] mainHeaders,
|
||||
Map<String, List<String>> extraHeaders,
|
||||
URI uri) {
|
||||
StringBuilder requestStringBuilder = new StringBuilder();
|
||||
@@ -33,12 +36,13 @@ public class WarcProtocolReconstructor {
|
||||
|
||||
Set<String> addedHeaders = new HashSet<>();
|
||||
|
||||
mainHeaders.forEach((k, values) -> {
|
||||
for (var value : values) {
|
||||
addedHeaders.add(k);
|
||||
requestStringBuilder.append(capitalizeHeader(k)).append(": ").append(value).append("\r\n");
|
||||
}
|
||||
});
|
||||
for (var header : mainHeaders) {
|
||||
String k = header.getName();
|
||||
String v = header.getValue();
|
||||
|
||||
addedHeaders.add(k);
|
||||
requestStringBuilder.append(capitalizeHeader(k)).append(": ").append(v).append("\r\n");
|
||||
}
|
||||
|
||||
extraHeaders.forEach((k, values) -> {
|
||||
if (!addedHeaders.contains(k)) {
|
||||
@@ -75,17 +79,23 @@ public class WarcProtocolReconstructor {
|
||||
return "HTTP/" + version + " " + statusCode + " " + statusMessage + "\r\n" + headerString + "\r\n\r\n";
|
||||
}
|
||||
|
||||
static String getResponseHeader(Response response, long size) {
|
||||
String version = response.protocol() == Protocol.HTTP_1_1 ? "1.1" : "2.0";
|
||||
static String getResponseHeader(HttpResponse<?> response, long size) {
|
||||
String version = response.version() == HttpClient.Version.HTTP_1_1 ? "1.1" : "2.0";
|
||||
|
||||
String statusCode = String.valueOf(response.code());
|
||||
String statusMessage = STATUS_CODE_MAP.getOrDefault(response.code(), "Unknown");
|
||||
String statusCode = String.valueOf(response.statusCode());
|
||||
String statusMessage = STATUS_CODE_MAP.getOrDefault(response.statusCode(), "Unknown");
|
||||
|
||||
String headerString = getHeadersAsString(response, size);
|
||||
String headerString = getHeadersAsString(response.headers(), size);
|
||||
|
||||
return "HTTP/" + version + " " + statusCode + " " + statusMessage + "\r\n" + headerString + "\r\n\r\n";
|
||||
}
|
||||
|
||||
static String getResponseHeader(ClassicHttpResponse response, long size) {
|
||||
String headerString = getHeadersAsString(response.getHeaders(), size);
|
||||
|
||||
return response.getVersion().format() + " " + response.getCode() + " " + response.getReasonPhrase() + "\r\n" + headerString + "\r\n\r\n";
|
||||
}
|
||||
|
||||
private static final Map<Integer, String> STATUS_CODE_MAP = Map.ofEntries(
|
||||
Map.entry(200, "OK"),
|
||||
Map.entry(201, "Created"),
|
||||
@@ -148,10 +158,41 @@ public class WarcProtocolReconstructor {
|
||||
return joiner.toString();
|
||||
}
|
||||
|
||||
static private String getHeadersAsString(Response response, long responseSize) {
|
||||
|
||||
|
||||
static private String getHeadersAsString(Header[] headers, long responseSize) {
|
||||
StringJoiner joiner = new StringJoiner("\r\n");
|
||||
|
||||
response.headers().toMultimap().forEach((k, values) -> {
|
||||
for (var header : headers) {
|
||||
String headerCapitalized = capitalizeHeader(header.getName());
|
||||
|
||||
// Omit pseudoheaders injected by the crawler itself
|
||||
if (headerCapitalized.startsWith("X-Marginalia"))
|
||||
continue;
|
||||
|
||||
// Omit Transfer-Encoding and Content-Encoding headers
|
||||
if (headerCapitalized.equals("Transfer-Encoding"))
|
||||
continue;
|
||||
if (headerCapitalized.equals("Content-Encoding"))
|
||||
continue;
|
||||
|
||||
// Since we're transparently decoding gzip, we need to update the Content-Length header
|
||||
// to reflect the actual size of the response body. We'll do this at the end.
|
||||
if (headerCapitalized.equals("Content-Length"))
|
||||
continue;
|
||||
|
||||
joiner.add(headerCapitalized + ": " + header.getValue());
|
||||
}
|
||||
|
||||
joiner.add("Content-Length: " + responseSize);
|
||||
|
||||
return joiner.toString();
|
||||
}
|
||||
|
||||
static private String getHeadersAsString(HttpHeaders headers, long responseSize) {
|
||||
StringJoiner joiner = new StringJoiner("\r\n");
|
||||
|
||||
headers.map().forEach((k, values) -> {
|
||||
String headerCapitalized = capitalizeHeader(k);
|
||||
|
||||
// Omit pseudoheaders injected by the crawler itself
|
||||
@@ -179,8 +220,8 @@ public class WarcProtocolReconstructor {
|
||||
return joiner.toString();
|
||||
}
|
||||
|
||||
// okhttp gives us flattened headers, so we need to reconstruct Camel-Kebab-Case style
|
||||
// for the WARC parser's sake...
|
||||
// okhttp gave us flattened headers, so we need to reconstruct Camel-Kebab-Case style
|
||||
// for the WARC parser's sake... (do we still need this, mr chesterton?)
|
||||
static private String capitalizeHeader(String k) {
|
||||
return Arrays.stream(StringUtils.split(k, '-'))
|
||||
.map(StringUtils::capitalize)
|
||||
|
@@ -1,13 +1,16 @@
|
||||
package nu.marginalia.crawl.fetcher.warc;
|
||||
|
||||
import nu.marginalia.crawl.fetcher.ContentTags;
|
||||
import nu.marginalia.crawl.fetcher.DomainCookies;
|
||||
import nu.marginalia.crawl.fetcher.HttpFetcher;
|
||||
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
||||
import nu.marginalia.crawl.fetcher.socket.IpInterceptingNetworkInterceptor;
|
||||
import nu.marginalia.link_parser.LinkParser;
|
||||
import nu.marginalia.model.EdgeDomain;
|
||||
import nu.marginalia.model.EdgeUrl;
|
||||
import nu.marginalia.model.body.HttpFetchResult;
|
||||
import okhttp3.OkHttpClient;
|
||||
import okhttp3.Request;
|
||||
import org.apache.hc.client5.http.classic.HttpClient;
|
||||
import org.apache.hc.client5.http.classic.methods.HttpGet;
|
||||
import org.apache.hc.core5.http.NameValuePair;
|
||||
import org.jetbrains.annotations.Nullable;
|
||||
import org.netpreserve.jwarc.*;
|
||||
import org.slf4j.Logger;
|
||||
@@ -16,18 +19,20 @@ import org.slf4j.LoggerFactory;
|
||||
import java.io.IOException;
|
||||
import java.io.InputStream;
|
||||
import java.net.InetAddress;
|
||||
import java.net.SocketTimeoutException;
|
||||
import java.net.URI;
|
||||
import java.net.URISyntaxException;
|
||||
import java.nio.charset.StandardCharsets;
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Path;
|
||||
import java.security.NoSuchAlgorithmException;
|
||||
import java.time.Duration;
|
||||
import java.time.Instant;
|
||||
import java.util.*;
|
||||
|
||||
/** Based on JWarc's fetch method, APL 2.0 license
|
||||
* <p></p>
|
||||
* This class wraps OkHttp's OkHttpClient and records the HTTP request and response in a WARC file,
|
||||
* This class wraps HttpClient and records the HTTP request and response in a WARC file,
|
||||
* as best is possible given not all the data is available at the same time and needs to
|
||||
* be reconstructed.
|
||||
*/
|
||||
@@ -36,7 +41,7 @@ public class WarcRecorder implements AutoCloseable {
|
||||
static final int MAX_TIME = 30_000;
|
||||
|
||||
/** Maximum (decompressed) size we'll save */
|
||||
static final int MAX_SIZE = Integer.getInteger("crawler.maxFetchSize", 10 * 1024 * 1024);
|
||||
static final int MAX_SIZE = Integer.getInteger("crawler.maxFetchSize", 32 * 1024 * 1024);
|
||||
|
||||
private final WarcWriter writer;
|
||||
private final Path warcFile;
|
||||
@@ -47,12 +52,7 @@ public class WarcRecorder implements AutoCloseable {
|
||||
// Affix a version string in case we need to change the format in the future
|
||||
// in some way
|
||||
private final String warcRecorderVersion = "1.0";
|
||||
|
||||
// We need to know if the site uses cookies so this can be reported among the search results
|
||||
// -- flip this to true if we see any cookies. This information will also be painted on any
|
||||
// revisited pages. It's not 100% perfect and a bit order dependent, but it's good enough.
|
||||
private final WarcXCookieInformationHeader cookieInformation = new WarcXCookieInformationHeader();
|
||||
|
||||
private final LinkParser linkParser = new LinkParser();
|
||||
/**
|
||||
* Create a new WarcRecorder that will write to the given file
|
||||
*
|
||||
@@ -74,108 +74,173 @@ public class WarcRecorder implements AutoCloseable {
|
||||
temporaryFile = true;
|
||||
}
|
||||
|
||||
public HttpFetchResult fetch(OkHttpClient client, Request request) throws NoSuchAlgorithmException,
|
||||
IOException,
|
||||
URISyntaxException,
|
||||
InterruptedException
|
||||
public HttpFetchResult fetch(HttpClient client,
|
||||
DomainCookies cookies,
|
||||
HttpGet request)
|
||||
throws NoSuchAlgorithmException, IOException, URISyntaxException, InterruptedException
|
||||
{
|
||||
URI requestUri = request.url().uri();
|
||||
return fetch(client, cookies, request, Duration.ofMillis(MAX_TIME));
|
||||
}
|
||||
|
||||
public HttpFetchResult fetch(HttpClient client,
|
||||
DomainCookies cookies,
|
||||
HttpGet request,
|
||||
Duration timeout)
|
||||
throws NoSuchAlgorithmException, IOException, URISyntaxException, InterruptedException
|
||||
{
|
||||
URI requestUri = request.getUri();
|
||||
|
||||
WarcDigestBuilder responseDigestBuilder = new WarcDigestBuilder();
|
||||
WarcDigestBuilder payloadDigestBuilder = new WarcDigestBuilder();
|
||||
|
||||
String ip;
|
||||
Instant date = Instant.now();
|
||||
|
||||
var call = client.newCall(request);
|
||||
// Not entirely sure why we need to do this, but keeping it due to Chesterton's Fence
|
||||
Map<String, List<String>> extraHeaders = new HashMap<>(request.getHeaders().length);
|
||||
|
||||
cookieInformation.update(client, request.url());
|
||||
// Inject a range header to attempt to limit the size of the response
|
||||
// to the maximum size we want to store, if the server supports it.
|
||||
request.addHeader("Range", "bytes=0-"+MAX_SIZE);
|
||||
cookies.paintRequest(request);
|
||||
try {
|
||||
return client.execute(request,response -> {
|
||||
|
||||
try (var response = call.execute();
|
||||
WarcInputBuffer inputBuffer = WarcInputBuffer.forResponse(response))
|
||||
{
|
||||
byte[] responseHeaders = WarcProtocolReconstructor.getResponseHeader(response, inputBuffer.size()).getBytes(StandardCharsets.UTF_8);
|
||||
try (WarcInputBuffer inputBuffer = WarcInputBuffer.forResponse(response, request, timeout);
|
||||
InputStream inputStream = inputBuffer.read()) {
|
||||
|
||||
ResponseDataBuffer responseDataBuffer = new ResponseDataBuffer(inputBuffer.size() + responseHeaders.length);
|
||||
InputStream inputStream = inputBuffer.read();
|
||||
cookies.updateCookieStore(response);
|
||||
|
||||
ip = IpInterceptingNetworkInterceptor.getIpFromResponse(response);
|
||||
// Build and write the request
|
||||
|
||||
responseDataBuffer.put(responseHeaders);
|
||||
responseDataBuffer.updateDigest(responseDigestBuilder, 0, responseHeaders.length);
|
||||
WarcDigestBuilder requestDigestBuilder = new WarcDigestBuilder();
|
||||
|
||||
int dataStart = responseDataBuffer.pos();
|
||||
byte[] httpRequestString = WarcProtocolReconstructor
|
||||
.getHttpRequestString(
|
||||
request.getMethod(),
|
||||
request.getHeaders(),
|
||||
extraHeaders,
|
||||
requestUri)
|
||||
.getBytes();
|
||||
|
||||
for (;;) {
|
||||
int remainingLength = responseDataBuffer.remaining();
|
||||
if (remainingLength == 0)
|
||||
break;
|
||||
requestDigestBuilder.update(httpRequestString);
|
||||
|
||||
int startPos = responseDataBuffer.pos();
|
||||
WarcRequest warcRequest = new WarcRequest.Builder(requestUri)
|
||||
.blockDigest(requestDigestBuilder.build())
|
||||
.date(date)
|
||||
.body(MediaType.HTTP_REQUEST, httpRequestString)
|
||||
.build();
|
||||
|
||||
int n = responseDataBuffer.readFrom(inputStream, remainingLength);
|
||||
if (n < 0)
|
||||
break;
|
||||
warcRequest.http(); // force HTTP header to be parsed before body is consumed so that caller can use it
|
||||
writer.write(warcRequest);
|
||||
|
||||
responseDataBuffer.updateDigest(responseDigestBuilder, startPos, n);
|
||||
responseDataBuffer.updateDigest(payloadDigestBuilder, startPos, n);
|
||||
}
|
||||
|
||||
// It looks like this might be the same as requestUri, but it's not;
|
||||
// it's the URI after resolving redirects.
|
||||
final URI responseUri = response.request().url().uri();
|
||||
if (cookies.hasCookies()) {
|
||||
response.addHeader("X-Has-Cookies", 1);
|
||||
}
|
||||
|
||||
WarcResponse.Builder responseBuilder = new WarcResponse.Builder(responseUri)
|
||||
.blockDigest(responseDigestBuilder.build())
|
||||
.date(date)
|
||||
.body(MediaType.HTTP_RESPONSE, responseDataBuffer.copyBytes());
|
||||
byte[] responseHeaders = WarcProtocolReconstructor.getResponseHeader(response, inputBuffer.size()).getBytes(StandardCharsets.UTF_8);
|
||||
|
||||
cookieInformation.paint(responseBuilder);
|
||||
ResponseDataBuffer responseDataBuffer = new ResponseDataBuffer(inputBuffer.size() + responseHeaders.length);
|
||||
|
||||
if (ip != null) responseBuilder.ipAddress(InetAddress.getByName(ip));
|
||||
responseDataBuffer.put(responseHeaders);
|
||||
responseDataBuffer.updateDigest(responseDigestBuilder, 0, responseHeaders.length);
|
||||
|
||||
responseBuilder.payloadDigest(payloadDigestBuilder.build());
|
||||
responseBuilder.truncated(inputBuffer.truncationReason());
|
||||
int dataStart = responseDataBuffer.pos();
|
||||
|
||||
// Build and write the response
|
||||
for (;;) {
|
||||
int remainingLength = responseDataBuffer.remaining();
|
||||
if (remainingLength == 0)
|
||||
break;
|
||||
|
||||
var warcResponse = responseBuilder.build();
|
||||
warcResponse.http(); // force HTTP header to be parsed before body is consumed so that caller can use it
|
||||
writer.write(warcResponse);
|
||||
int startPos = responseDataBuffer.pos();
|
||||
|
||||
// Build and write the request
|
||||
int n = responseDataBuffer.readFrom(inputStream, remainingLength);
|
||||
if (n < 0)
|
||||
break;
|
||||
|
||||
WarcDigestBuilder requestDigestBuilder = new WarcDigestBuilder();
|
||||
responseDataBuffer.updateDigest(responseDigestBuilder, startPos, n);
|
||||
responseDataBuffer.updateDigest(payloadDigestBuilder, startPos, n);
|
||||
}
|
||||
|
||||
byte[] httpRequestString = WarcProtocolReconstructor
|
||||
.getHttpRequestString(
|
||||
response.request().method(),
|
||||
response.request().headers().toMultimap(),
|
||||
request.headers().toMultimap(),
|
||||
requestUri)
|
||||
.getBytes();
|
||||
// with some http client libraries, that resolve redirects transparently, this might be different
|
||||
// from the request URI, but currently we don't have transparent redirect resolution so it's always
|
||||
// the same (though let's keep the variables separate in case this changes)
|
||||
final URI responseUri = requestUri;
|
||||
|
||||
requestDigestBuilder.update(httpRequestString);
|
||||
WarcResponse.Builder responseBuilder = new WarcResponse.Builder(responseUri)
|
||||
.blockDigest(responseDigestBuilder.build())
|
||||
.date(date)
|
||||
.concurrentTo(warcRequest.id())
|
||||
.body(MediaType.HTTP_RESPONSE, responseDataBuffer.copyBytes());
|
||||
|
||||
WarcRequest warcRequest = new WarcRequest.Builder(requestUri)
|
||||
.blockDigest(requestDigestBuilder.build())
|
||||
.date(date)
|
||||
.body(MediaType.HTTP_REQUEST, httpRequestString)
|
||||
.concurrentTo(warcResponse.id())
|
||||
.build();
|
||||
InetAddress inetAddress = InetAddress.getByName(responseUri.getHost());
|
||||
responseBuilder.ipAddress(inetAddress);
|
||||
responseBuilder.payloadDigest(payloadDigestBuilder.build());
|
||||
responseBuilder.truncated(inputBuffer.truncationReason());
|
||||
|
||||
warcRequest.http(); // force HTTP header to be parsed before body is consumed so that caller can use it
|
||||
writer.write(warcRequest);
|
||||
// Build and write the response
|
||||
|
||||
return new HttpFetchResult.ResultOk(responseUri,
|
||||
response.code(),
|
||||
inputBuffer.headers(),
|
||||
ip,
|
||||
responseDataBuffer.data,
|
||||
dataStart,
|
||||
responseDataBuffer.length() - dataStart);
|
||||
}
|
||||
catch (Exception ex) {
|
||||
var warcResponse = responseBuilder.build();
|
||||
warcResponse.http(); // force HTTP header to be parsed before body is consumed so that caller can use it
|
||||
writer.write(warcResponse);
|
||||
|
||||
if (Duration.between(date, Instant.now()).compareTo(Duration.ofSeconds(9)) > 0
|
||||
&& inputBuffer.size() < 2048
|
||||
&& !requestUri.getPath().endsWith("robots.txt")) // don't bail on robots.txt
|
||||
{
|
||||
// Fast detection and mitigation of crawler traps that respond with slow
|
||||
// small responses, with a high branching factor
|
||||
|
||||
// Note we bail *after* writing the warc records, this will effectively only
|
||||
// prevent link extraction from the document.
|
||||
|
||||
logger.warn("URL {} took too long to fetch ({}s) and was too small for the effort ({}b)",
|
||||
requestUri,
|
||||
Duration.between(date, Instant.now()).getSeconds(),
|
||||
inputBuffer.size()
|
||||
);
|
||||
|
||||
return new HttpFetchResult.ResultException(new IOException("Likely crawler trap"));
|
||||
}
|
||||
|
||||
if (response.getCode() == 301 || response.getCode() == 302 || response.getCode() == 307) {
|
||||
// If the server responds with a redirect, we need to
|
||||
// update the request URI to the new location
|
||||
EdgeUrl redirectLocation = Optional.ofNullable(response.getFirstHeader("Location"))
|
||||
.map(NameValuePair::getValue)
|
||||
.flatMap(location -> linkParser.parseLink(new EdgeUrl(requestUri), location))
|
||||
.orElse(null);
|
||||
if (redirectLocation != null) {
|
||||
// If the redirect location is a valid URL, we need to update the request URI
|
||||
return new HttpFetchResult.ResultRedirect(redirectLocation);
|
||||
} else {
|
||||
// If the redirect location is not a valid URL, we need to throw an exception
|
||||
return new HttpFetchResult.ResultException(new IOException("Invalid redirect location: " + response.getFirstHeader("Location")));
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
return new HttpFetchResult.ResultOk(responseUri,
|
||||
response.getCode(),
|
||||
inputBuffer.headers(),
|
||||
inetAddress.getHostAddress(),
|
||||
responseDataBuffer.data,
|
||||
dataStart,
|
||||
responseDataBuffer.length() - dataStart);
|
||||
} catch (Exception ex) {
|
||||
flagAsError(new EdgeUrl(requestUri), ex); // write a WARC record to indicate the error
|
||||
logger.warn("Failed to fetch URL {}: {}", requestUri, ex.getMessage());
|
||||
return new HttpFetchResult.ResultException(ex);
|
||||
}
|
||||
});
|
||||
// the client.execute() method will throw an exception if the request times out
|
||||
// or on other IO exceptions, so we need to catch those here as well as having
|
||||
// exception handling in the response handler
|
||||
} catch (SocketTimeoutException ex) {
|
||||
flagAsTimeout(new EdgeUrl(requestUri)); // write a WARC record to indicate the timeout
|
||||
return new HttpFetchResult.ResultException(ex);
|
||||
} catch (IOException ex) {
|
||||
flagAsError(new EdgeUrl(requestUri), ex); // write a WARC record to indicate the error
|
||||
logger.warn("Failed to fetch URL {}: {}", requestUri, ex.getMessage());
|
||||
return new HttpFetchResult.ResultException(ex);
|
||||
}
|
||||
@@ -185,7 +250,7 @@ public class WarcRecorder implements AutoCloseable {
|
||||
writer.write(item);
|
||||
}
|
||||
|
||||
private void saveOldResponse(EdgeUrl url, String contentType, int statusCode, String documentBody, @Nullable String headers, ContentTags contentTags) {
|
||||
private void saveOldResponse(EdgeUrl url, DomainCookies domainCookies, String contentType, int statusCode, byte[] documentBody, @Nullable String headers, ContentTags contentTags) {
|
||||
try {
|
||||
WarcDigestBuilder responseDigestBuilder = new WarcDigestBuilder();
|
||||
WarcDigestBuilder payloadDigestBuilder = new WarcDigestBuilder();
|
||||
@@ -195,7 +260,7 @@ public class WarcRecorder implements AutoCloseable {
|
||||
if (documentBody == null) {
|
||||
bytes = new byte[0];
|
||||
} else {
|
||||
bytes = documentBody.getBytes();
|
||||
bytes = documentBody;
|
||||
}
|
||||
|
||||
// Create a synthesis of custom headers and the original headers
|
||||
@@ -246,7 +311,9 @@ public class WarcRecorder implements AutoCloseable {
|
||||
.date(Instant.now())
|
||||
.body(MediaType.HTTP_RESPONSE, responseDataBuffer.copyBytes());
|
||||
|
||||
cookieInformation.paint(builder);
|
||||
if (domainCookies.hasCookies() || (headers != null && headers.contains("Set-Cookie:"))) {
|
||||
builder.addHeader("X-Has-Cookies", "1");
|
||||
}
|
||||
|
||||
var reference = builder.build();
|
||||
|
||||
@@ -264,8 +331,8 @@ public class WarcRecorder implements AutoCloseable {
|
||||
* an E-Tag or Last-Modified header, and the server responds with a 304 Not Modified. In this
|
||||
* scenario we want to record the data as it was in the previous crawl, but not re-fetch it.
|
||||
*/
|
||||
public void writeReferenceCopy(EdgeUrl url, String contentType, int statusCode, String documentBody, @Nullable String headers, ContentTags ctags) {
|
||||
saveOldResponse(url, contentType, statusCode, documentBody, headers, ctags);
|
||||
public void writeReferenceCopy(EdgeUrl url, DomainCookies cookies, String contentType, int statusCode, byte[] documentBody, @Nullable String headers, ContentTags ctags) {
|
||||
saveOldResponse(url, cookies, contentType, statusCode, documentBody, headers, ctags);
|
||||
}
|
||||
|
||||
public void writeWarcinfoHeader(String ip, EdgeDomain domain, HttpFetcherImpl.DomainProbeResult result) throws IOException {
|
||||
@@ -285,6 +352,9 @@ public class WarcRecorder implements AutoCloseable {
|
||||
case HttpFetcherImpl.DomainProbeResult.Ok ok:
|
||||
fields.put("X-WARC-Probe-Status", List.of("OK"));
|
||||
break;
|
||||
case HttpFetcher.DomainProbeResult.RedirectSameDomain_Internal redirectSameDomain:
|
||||
fields.put("X-WARC-Probe-Status", List.of("REDIR-INTERNAL"));
|
||||
break;
|
||||
}
|
||||
|
||||
var warcinfo = new Warcinfo.Builder()
|
||||
|
@@ -3,6 +3,7 @@ package nu.marginalia.crawl.logic;
|
||||
import nu.marginalia.model.EdgeDomain;
|
||||
|
||||
import java.util.Map;
|
||||
import java.util.Optional;
|
||||
import java.util.concurrent.ConcurrentHashMap;
|
||||
import java.util.concurrent.Semaphore;
|
||||
|
||||
@@ -19,8 +20,22 @@ public class DomainLocks {
|
||||
* and may be held by another thread. The caller is responsible for locking and releasing the lock.
|
||||
*/
|
||||
public DomainLock lockDomain(EdgeDomain domain) throws InterruptedException {
|
||||
return new DomainLock(domain.toString(),
|
||||
locks.computeIfAbsent(domain.topDomain.toLowerCase(), this::defaultPermits));
|
||||
var sem = locks.computeIfAbsent(domain.topDomain.toLowerCase(), this::defaultPermits);
|
||||
|
||||
sem.acquire();
|
||||
|
||||
return new DomainLock(sem);
|
||||
}
|
||||
|
||||
public Optional<DomainLock> tryLockDomain(EdgeDomain domain) {
|
||||
var sem = locks.computeIfAbsent(domain.topDomain.toLowerCase(), this::defaultPermits);
|
||||
if (sem.tryAcquire(1)) {
|
||||
return Optional.of(new DomainLock(sem));
|
||||
}
|
||||
else {
|
||||
// We don't have a lock, so we return an empty optional
|
||||
return Optional.empty();
|
||||
}
|
||||
}
|
||||
|
||||
private Semaphore defaultPermits(String topDomain) {
|
||||
@@ -28,39 +43,45 @@ public class DomainLocks {
|
||||
return new Semaphore(16);
|
||||
if (topDomain.equals("blogspot.com"))
|
||||
return new Semaphore(8);
|
||||
|
||||
if (topDomain.equals("tumblr.com"))
|
||||
return new Semaphore(8);
|
||||
if (topDomain.equals("neocities.org"))
|
||||
return new Semaphore(4);
|
||||
return new Semaphore(8);
|
||||
if (topDomain.equals("github.io"))
|
||||
return new Semaphore(4);
|
||||
return new Semaphore(8);
|
||||
|
||||
// Substack really dislikes broad-scale crawlers, so we need to be careful
|
||||
// to not get blocked.
|
||||
if (topDomain.equals("substack.com")) {
|
||||
return new Semaphore(1);
|
||||
}
|
||||
if (topDomain.endsWith(".edu")) {
|
||||
return new Semaphore(1);
|
||||
}
|
||||
|
||||
return new Semaphore(2);
|
||||
}
|
||||
|
||||
/** Returns true if the domain is lockable, i.e. if it is not already locked by another thread.
|
||||
* (this is just a hint, and does not guarantee that the domain is actually lockable any time
|
||||
* after this method returns true)
|
||||
*/
|
||||
public boolean isLockableHint(EdgeDomain domain) {
|
||||
Semaphore sem = locks.get(domain.topDomain.toLowerCase());
|
||||
if (null == sem)
|
||||
return true;
|
||||
else
|
||||
return sem.availablePermits() > 0;
|
||||
}
|
||||
|
||||
public static class DomainLock implements AutoCloseable {
|
||||
private final String domainName;
|
||||
private final Semaphore semaphore;
|
||||
|
||||
DomainLock(String domainName, Semaphore semaphore) throws InterruptedException {
|
||||
this.domainName = domainName;
|
||||
DomainLock(Semaphore semaphore) {
|
||||
this.semaphore = semaphore;
|
||||
|
||||
Thread.currentThread().setName("crawling:" + domainName + " [await domain lock]");
|
||||
semaphore.acquire();
|
||||
Thread.currentThread().setName("crawling:" + domainName);
|
||||
}
|
||||
|
||||
@Override
|
||||
public void close() throws Exception {
|
||||
semaphore.release();
|
||||
Thread.currentThread().setName("crawling:" + domainName + " [wrapping up]");
|
||||
Thread.currentThread().setName("[idle]");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@@ -4,6 +4,7 @@ import nu.marginalia.ContentTypes;
|
||||
import nu.marginalia.io.SerializableCrawlDataStream;
|
||||
import nu.marginalia.lsh.EasyLSH;
|
||||
import nu.marginalia.model.crawldata.CrawledDocument;
|
||||
import org.jetbrains.annotations.NotNull;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
@@ -11,54 +12,76 @@ import javax.annotation.Nullable;
|
||||
import java.io.IOException;
|
||||
import java.nio.file.Files;
|
||||
import java.nio.file.Path;
|
||||
import java.util.Iterator;
|
||||
import java.util.Objects;
|
||||
import java.util.Optional;
|
||||
|
||||
/** A reference to a domain that has been crawled before. */
|
||||
public class CrawlDataReference implements AutoCloseable {
|
||||
public class CrawlDataReference implements AutoCloseable, Iterable<CrawledDocument> {
|
||||
|
||||
private boolean closed = false;
|
||||
|
||||
@Nullable
|
||||
private final Path path;
|
||||
|
||||
@Nullable
|
||||
private SerializableCrawlDataStream data = null;
|
||||
|
||||
private final SerializableCrawlDataStream data;
|
||||
private static final Logger logger = LoggerFactory.getLogger(CrawlDataReference.class);
|
||||
|
||||
public CrawlDataReference(SerializableCrawlDataStream data) {
|
||||
this.data = data;
|
||||
public CrawlDataReference(@Nullable Path path) {
|
||||
this.path = path;
|
||||
}
|
||||
|
||||
public CrawlDataReference() {
|
||||
this(SerializableCrawlDataStream.empty());
|
||||
this(null);
|
||||
}
|
||||
|
||||
/** Delete the associated data from disk, if it exists */
|
||||
public void delete() throws IOException {
|
||||
Path filePath = data.path();
|
||||
|
||||
if (filePath != null) {
|
||||
Files.deleteIfExists(filePath);
|
||||
if (path != null) {
|
||||
Files.deleteIfExists(path);
|
||||
}
|
||||
}
|
||||
|
||||
/** Get the next document from the crawl data,
|
||||
* returning null when there are no more documents
|
||||
* available
|
||||
*/
|
||||
@Nullable
|
||||
public CrawledDocument nextDocument() {
|
||||
try {
|
||||
while (data.hasNext()) {
|
||||
if (data.next() instanceof CrawledDocument doc) {
|
||||
if (!ContentTypes.isAccepted(doc.contentType))
|
||||
continue;
|
||||
public @NotNull Iterator<CrawledDocument> iterator() {
|
||||
|
||||
return doc;
|
||||
requireStream();
|
||||
// Guaranteed by requireStream, but helps java
|
||||
Objects.requireNonNull(data);
|
||||
|
||||
return data.map(next -> {
|
||||
if (next instanceof CrawledDocument doc && ContentTypes.isAccepted(doc.contentType)) {
|
||||
return Optional.of(doc);
|
||||
}
|
||||
else {
|
||||
return Optional.empty();
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
/** After calling this method, data is guaranteed to be non-null */
|
||||
private void requireStream() {
|
||||
if (closed) {
|
||||
throw new IllegalStateException("Use after close()");
|
||||
}
|
||||
|
||||
if (data == null) {
|
||||
try {
|
||||
if (path != null) {
|
||||
data = SerializableCrawlDataStream.openDataStream(path);
|
||||
return;
|
||||
}
|
||||
}
|
||||
}
|
||||
catch (IOException ex) {
|
||||
logger.error("Failed to read next document", ex);
|
||||
}
|
||||
catch (Exception ex) {
|
||||
logger.error("Failed to open stream", ex);
|
||||
}
|
||||
|
||||
return null;
|
||||
data = SerializableCrawlDataStream.empty();
|
||||
}
|
||||
}
|
||||
|
||||
public static boolean isContentBodySame(String one, String other) {
|
||||
public static boolean isContentBodySame(byte[] one, byte[] other) {
|
||||
|
||||
final long contentHashOne = contentHash(one);
|
||||
final long contentHashOther = contentHash(other);
|
||||
@@ -66,7 +89,7 @@ public class CrawlDataReference implements AutoCloseable {
|
||||
return EasyLSH.hammingDistance(contentHashOne, contentHashOther) < 4;
|
||||
}
|
||||
|
||||
private static long contentHash(String content) {
|
||||
private static long contentHash(byte[] content) {
|
||||
EasyLSH hash = new EasyLSH();
|
||||
int next = 0;
|
||||
|
||||
@@ -74,8 +97,8 @@ public class CrawlDataReference implements AutoCloseable {
|
||||
|
||||
// In a naive best-effort fashion, extract the text
|
||||
// content of the document and feed it into the LSH
|
||||
for (int i = 0; i < content.length(); i++) {
|
||||
char c = content.charAt(i);
|
||||
for (byte b : content) {
|
||||
char c = (char) b;
|
||||
if (c == '<') {
|
||||
isInTag = true;
|
||||
} else if (c == '>') {
|
||||
@@ -98,7 +121,12 @@ public class CrawlDataReference implements AutoCloseable {
|
||||
}
|
||||
|
||||
@Override
|
||||
public void close() throws Exception {
|
||||
data.close();
|
||||
public void close() throws IOException {
|
||||
if (!closed) {
|
||||
if (data != null) {
|
||||
data.close();
|
||||
}
|
||||
closed = true;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@@ -3,6 +3,7 @@ package nu.marginalia.crawl.retreival;
|
||||
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
||||
|
||||
import java.time.Duration;
|
||||
import java.util.concurrent.ThreadLocalRandom;
|
||||
|
||||
import static java.lang.Math.max;
|
||||
import static java.lang.Math.min;
|
||||
@@ -50,15 +51,20 @@ public class CrawlDelayTimer {
|
||||
waitFetchDelay(0);
|
||||
}
|
||||
|
||||
public void waitFetchDelay(Duration spentTime) {
|
||||
waitFetchDelay(spentTime.toMillis());
|
||||
}
|
||||
|
||||
public void waitFetchDelay(long spentTime) {
|
||||
long sleepTime = delayTime;
|
||||
|
||||
long jitter = ThreadLocalRandom.current().nextLong(0, 150);
|
||||
try {
|
||||
if (sleepTime >= 1) {
|
||||
if (spentTime > sleepTime)
|
||||
return;
|
||||
|
||||
Thread.sleep(min(sleepTime - spentTime, 5000));
|
||||
Thread.sleep(min(sleepTime - spentTime, 5000) + jitter);
|
||||
} else {
|
||||
// When no crawl delay is specified, lean toward twice the fetch+process time,
|
||||
// within sane limits. This means slower servers get slower crawling, and faster
|
||||
@@ -71,17 +77,17 @@ public class CrawlDelayTimer {
|
||||
if (spentTime > sleepTime)
|
||||
return;
|
||||
|
||||
Thread.sleep(sleepTime - spentTime);
|
||||
Thread.sleep(sleepTime - spentTime + jitter);
|
||||
}
|
||||
|
||||
if (slowDown) {
|
||||
// Additional delay when the server is signalling it wants slower requests
|
||||
Thread.sleep(DEFAULT_CRAWL_DELAY_MIN_MS);
|
||||
Thread.sleep(DEFAULT_CRAWL_DELAY_MIN_MS + jitter);
|
||||
}
|
||||
}
|
||||
catch (InterruptedException e) {
|
||||
Thread.currentThread().interrupt();
|
||||
throw new RuntimeException();
|
||||
throw new RuntimeException("Interrupted", e);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@@ -0,0 +1,42 @@
|
||||
package nu.marginalia.crawl.retreival;
|
||||
|
||||
import java.time.Duration;
|
||||
import java.time.Instant;
|
||||
import java.util.concurrent.Semaphore;
|
||||
import java.util.concurrent.TimeUnit;
|
||||
|
||||
/**
|
||||
* This class is used to stagger the rate at which connections are created.
|
||||
* <p></p>
|
||||
* It is used to ensure that we do not create too many connections at once,
|
||||
* which can lead to network congestion and other issues. Since the connections
|
||||
* tend to be very long-lived, we can afford to wait a bit before creating the next
|
||||
* even if it adds a bit of build-up time when the crawl starts.
|
||||
*/
|
||||
public class CrawlerConnectionThrottle {
|
||||
private Instant lastCrawlStart = Instant.EPOCH;
|
||||
private final Semaphore launchSemaphore = new Semaphore(1);
|
||||
|
||||
private final Duration launchInterval;
|
||||
|
||||
public CrawlerConnectionThrottle(Duration launchInterval) {
|
||||
this.launchInterval = launchInterval;
|
||||
}
|
||||
|
||||
public void waitForConnectionPermission() throws InterruptedException {
|
||||
try {
|
||||
launchSemaphore.acquire();
|
||||
Instant nextPermittedLaunch = lastCrawlStart.plus(launchInterval);
|
||||
|
||||
if (nextPermittedLaunch.isAfter(Instant.now())) {
|
||||
long waitTime = Duration.between(Instant.now(), nextPermittedLaunch).toMillis();
|
||||
TimeUnit.MILLISECONDS.sleep(waitTime);
|
||||
}
|
||||
|
||||
lastCrawlStart = Instant.now();
|
||||
}
|
||||
finally {
|
||||
launchSemaphore.release();
|
||||
}
|
||||
}
|
||||
}
|
@@ -6,13 +6,12 @@ import nu.marginalia.contenttype.ContentType;
|
||||
import nu.marginalia.crawl.CrawlerMain;
|
||||
import nu.marginalia.crawl.DomainStateDb;
|
||||
import nu.marginalia.crawl.fetcher.ContentTags;
|
||||
import nu.marginalia.crawl.fetcher.DomainCookies;
|
||||
import nu.marginalia.crawl.fetcher.HttpFetcher;
|
||||
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
||||
import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
|
||||
import nu.marginalia.crawl.logic.LinkFilterSelector;
|
||||
import nu.marginalia.crawl.retreival.revisit.CrawlerRevisitor;
|
||||
import nu.marginalia.crawl.retreival.revisit.DocumentWithReference;
|
||||
import nu.marginalia.crawl.retreival.sitemap.SitemapFetcher;
|
||||
import nu.marginalia.ip_blocklist.UrlBlocklist;
|
||||
import nu.marginalia.link_parser.LinkParser;
|
||||
import nu.marginalia.model.EdgeDomain;
|
||||
@@ -20,7 +19,6 @@ import nu.marginalia.model.EdgeUrl;
|
||||
import nu.marginalia.model.body.DocumentBodyExtractor;
|
||||
import nu.marginalia.model.body.HttpFetchResult;
|
||||
import nu.marginalia.model.crawldata.CrawlerDomainStatus;
|
||||
import org.jsoup.Jsoup;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
@@ -28,14 +26,16 @@ import java.io.IOException;
|
||||
import java.net.InetAddress;
|
||||
import java.net.UnknownHostException;
|
||||
import java.nio.file.Path;
|
||||
import java.time.Duration;
|
||||
import java.time.Instant;
|
||||
import java.util.List;
|
||||
import java.util.Objects;
|
||||
import java.util.Optional;
|
||||
import java.util.concurrent.TimeUnit;
|
||||
|
||||
public class CrawlerRetreiver implements AutoCloseable {
|
||||
|
||||
private static final int MAX_ERRORS = 20;
|
||||
private static final int HTTP_429_RETRY_LIMIT = 1; // Retry 429s once
|
||||
|
||||
private final HttpFetcher fetcher;
|
||||
|
||||
@@ -52,8 +52,12 @@ public class CrawlerRetreiver implements AutoCloseable {
|
||||
private final DomainStateDb domainStateDb;
|
||||
private final WarcRecorder warcRecorder;
|
||||
private final CrawlerRevisitor crawlerRevisitor;
|
||||
private final DomainCookies cookies = new DomainCookies();
|
||||
|
||||
private static final CrawlerConnectionThrottle connectionThrottle = new CrawlerConnectionThrottle(
|
||||
Duration.ofSeconds(1) // pace the connections to avoid network congestion at startup
|
||||
);
|
||||
|
||||
private final SitemapFetcher sitemapFetcher;
|
||||
int errorCount = 0;
|
||||
|
||||
public CrawlerRetreiver(HttpFetcher fetcher,
|
||||
@@ -71,7 +75,6 @@ public class CrawlerRetreiver implements AutoCloseable {
|
||||
|
||||
crawlFrontier = new DomainCrawlFrontier(new EdgeDomain(domain), specs.urls(), specs.crawlDepth());
|
||||
crawlerRevisitor = new CrawlerRevisitor(crawlFrontier, this, warcRecorder);
|
||||
sitemapFetcher = new SitemapFetcher(crawlFrontier, fetcher.createSitemapRetriever());
|
||||
|
||||
// We must always crawl the index page first, this is assumed when fingerprinting the server
|
||||
var fst = crawlFrontier.peek();
|
||||
@@ -93,30 +96,63 @@ public class CrawlerRetreiver implements AutoCloseable {
|
||||
}
|
||||
|
||||
public int crawlDomain(DomainLinks domainLinks, CrawlDataReference oldCrawlData) {
|
||||
try {
|
||||
// Do an initial domain probe to determine the root URL
|
||||
EdgeUrl rootUrl;
|
||||
try (oldCrawlData) {
|
||||
|
||||
// Wait for permission to open a connection to avoid network congestion
|
||||
// from hundreds/thousands of TCP handshakes
|
||||
connectionThrottle.waitForConnectionPermission();
|
||||
|
||||
// Do an initial domain probe to determine the root URL
|
||||
var probeResult = probeRootUrl();
|
||||
switch (probeResult) {
|
||||
|
||||
return switch (probeResult) {
|
||||
case HttpFetcher.DomainProbeResult.Ok(EdgeUrl probedUrl) -> {
|
||||
rootUrl = probedUrl; // Good track
|
||||
|
||||
// Sleep after the initial probe, we don't have access to the robots.txt yet
|
||||
// so we don't know the crawl delay
|
||||
TimeUnit.SECONDS.sleep(1);
|
||||
|
||||
final SimpleRobotRules robotsRules = fetcher.fetchRobotRules(probedUrl.domain, warcRecorder);
|
||||
final CrawlDelayTimer delayTimer = new CrawlDelayTimer(robotsRules.getCrawlDelay());
|
||||
|
||||
delayTimer.waitFetchDelay(0); // initial delay after robots.txt
|
||||
|
||||
DomainStateDb.SummaryRecord summaryRecord = sniffRootDocument(probedUrl, delayTimer);
|
||||
domainStateDb.save(summaryRecord);
|
||||
|
||||
if (Thread.interrupted()) {
|
||||
// There's a small chance we're interrupted during the sniffing portion
|
||||
throw new InterruptedException();
|
||||
}
|
||||
|
||||
Instant recrawlStart = Instant.now();
|
||||
CrawlerRevisitor.RecrawlMetadata recrawlMetadata = crawlerRevisitor.recrawl(oldCrawlData, cookies, robotsRules, delayTimer);
|
||||
Duration recrawlTime = Duration.between(recrawlStart, Instant.now());
|
||||
|
||||
// Play back the old crawl data (if present) and fetch the documents comparing etags and last-modified
|
||||
if (recrawlMetadata.size() > 0) {
|
||||
// If we have reference data, we will always grow the crawl depth a bit
|
||||
crawlFrontier.increaseDepth(1.5, 2500);
|
||||
}
|
||||
|
||||
oldCrawlData.close(); // proactively close the crawl data reference here to not hold onto expensive resources
|
||||
|
||||
yield crawlDomain(probedUrl, robotsRules, delayTimer, domainLinks, recrawlMetadata, recrawlTime);
|
||||
}
|
||||
case HttpFetcher.DomainProbeResult.Redirect(EdgeDomain domain1) -> {
|
||||
domainStateDb.save(DomainStateDb.SummaryRecord.forError(domain, "Redirect", domain1.toString()));
|
||||
return 1;
|
||||
yield 1;
|
||||
}
|
||||
case HttpFetcher.DomainProbeResult.Error(CrawlerDomainStatus status, String desc) -> {
|
||||
domainStateDb.save(DomainStateDb.SummaryRecord.forError(domain, status.toString(), desc));
|
||||
return 1;
|
||||
yield 1;
|
||||
}
|
||||
}
|
||||
default -> {
|
||||
logger.error("Unexpected domain probe result {}", probeResult);
|
||||
yield 1;
|
||||
}
|
||||
};
|
||||
|
||||
// Sleep after the initial probe, we don't have access to the robots.txt yet
|
||||
// so we don't know the crawl delay
|
||||
TimeUnit.SECONDS.sleep(1);
|
||||
|
||||
return crawlDomain(oldCrawlData, rootUrl, domainLinks);
|
||||
}
|
||||
catch (Exception ex) {
|
||||
logger.error("Error crawling domain {}", domain, ex);
|
||||
@@ -124,30 +160,31 @@ public class CrawlerRetreiver implements AutoCloseable {
|
||||
}
|
||||
}
|
||||
|
||||
private int crawlDomain(CrawlDataReference oldCrawlData,
|
||||
EdgeUrl rootUrl,
|
||||
DomainLinks domainLinks) throws InterruptedException {
|
||||
private int crawlDomain(EdgeUrl rootUrl,
|
||||
SimpleRobotRules robotsRules,
|
||||
CrawlDelayTimer delayTimer,
|
||||
DomainLinks domainLinks,
|
||||
CrawlerRevisitor.RecrawlMetadata recrawlMetadata,
|
||||
Duration recrawlTime) {
|
||||
|
||||
final SimpleRobotRules robotsRules = fetcher.fetchRobotRules(rootUrl.domain, warcRecorder);
|
||||
final CrawlDelayTimer delayTimer = new CrawlDelayTimer(robotsRules.getCrawlDelay());
|
||||
|
||||
delayTimer.waitFetchDelay(0); // initial delay after robots.txt
|
||||
|
||||
DomainStateDb.SummaryRecord summaryRecord = sniffRootDocument(rootUrl, delayTimer);
|
||||
domainStateDb.save(summaryRecord);
|
||||
|
||||
// Play back the old crawl data (if present) and fetch the documents comparing etags and last-modified
|
||||
if (crawlerRevisitor.recrawl(oldCrawlData, robotsRules, delayTimer) > 0) {
|
||||
// If we have reference data, we will always grow the crawl depth a bit
|
||||
crawlFrontier.increaseDepth(1.5, 2500);
|
||||
}
|
||||
Instant crawlStart = Instant.now();
|
||||
|
||||
// Add external links to the crawl frontier
|
||||
crawlFrontier.addAllToQueue(domainLinks.getUrls(rootUrl.proto));
|
||||
|
||||
// Add links from the sitemap to the crawl frontier
|
||||
sitemapFetcher.downloadSitemaps(robotsRules, rootUrl);
|
||||
// Fetch sitemaps
|
||||
for (var sitemap : robotsRules.getSitemaps()) {
|
||||
|
||||
// Validate the sitemap URL and check if it belongs to the domain as the root URL
|
||||
if (EdgeUrl.parse(sitemap)
|
||||
.map(url -> url.getDomain().equals(rootUrl.domain))
|
||||
.orElse(false)) {
|
||||
|
||||
crawlFrontier.addAllToQueue(fetcher.fetchSitemapUrls(sitemap, delayTimer));
|
||||
}
|
||||
}
|
||||
|
||||
int crawlerAdditions = 0;
|
||||
|
||||
while (!crawlFrontier.isEmpty()
|
||||
&& !crawlFrontier.isCrawlDepthReached()
|
||||
@@ -180,7 +217,11 @@ public class CrawlerRetreiver implements AutoCloseable {
|
||||
continue;
|
||||
|
||||
try {
|
||||
fetchContentWithReference(top, delayTimer, DocumentWithReference.empty());
|
||||
var result = fetchContentWithReference(top, delayTimer, DocumentWithReference.empty());
|
||||
|
||||
if (result.isOk()) {
|
||||
crawlerAdditions++;
|
||||
}
|
||||
}
|
||||
catch (InterruptedException ex) {
|
||||
Thread.currentThread().interrupt();
|
||||
@@ -188,6 +229,17 @@ public class CrawlerRetreiver implements AutoCloseable {
|
||||
}
|
||||
}
|
||||
|
||||
Duration crawlTime = Duration.between(crawlStart, Instant.now());
|
||||
domainStateDb.save(new DomainStateDb.CrawlMeta(
|
||||
domain,
|
||||
Instant.now(),
|
||||
recrawlTime,
|
||||
crawlTime,
|
||||
recrawlMetadata.errors(),
|
||||
crawlerAdditions,
|
||||
recrawlMetadata.size() + crawlerAdditions
|
||||
));
|
||||
|
||||
return crawlFrontier.visitedSize();
|
||||
}
|
||||
|
||||
@@ -216,17 +268,29 @@ public class CrawlerRetreiver implements AutoCloseable {
|
||||
return domainProbeResult;
|
||||
}
|
||||
|
||||
|
||||
|
||||
private DomainStateDb.SummaryRecord sniffRootDocument(EdgeUrl rootUrl, CrawlDelayTimer timer) {
|
||||
Optional<String> feedLink = Optional.empty();
|
||||
|
||||
try {
|
||||
var url = rootUrl.withPathAndParam("/", null);
|
||||
|
||||
HttpFetchResult result = fetchWithRetry(url, timer, HttpFetcher.ProbeType.DISABLED, ContentTags.empty());
|
||||
HttpFetchResult result = fetcher.fetchContent(url, warcRecorder, cookies, timer, ContentTags.empty(), HttpFetcher.ProbeType.DISABLED);
|
||||
timer.waitFetchDelay(0);
|
||||
|
||||
if (!(result instanceof HttpFetchResult.ResultOk ok))
|
||||
if (result instanceof HttpFetchResult.ResultRedirect(EdgeUrl location)) {
|
||||
if (Objects.equals(location.domain, url.domain)) {
|
||||
// TODO: Follow the redirect to the new location and sniff the document
|
||||
crawlFrontier.addFirst(location);
|
||||
}
|
||||
|
||||
return DomainStateDb.SummaryRecord.forSuccess(domain);
|
||||
}
|
||||
|
||||
if (!(result instanceof HttpFetchResult.ResultOk ok)) {
|
||||
return DomainStateDb.SummaryRecord.forSuccess(domain);
|
||||
}
|
||||
|
||||
var optDoc = ok.parseDocument();
|
||||
if (optDoc.isEmpty())
|
||||
@@ -271,18 +335,28 @@ public class CrawlerRetreiver implements AutoCloseable {
|
||||
}
|
||||
|
||||
// Download the sitemap if available
|
||||
if (feedLink.isPresent()) {
|
||||
sitemapFetcher.downloadSitemaps(List.of(feedLink.get()));
|
||||
timer.waitFetchDelay(0);
|
||||
}
|
||||
feedLink.ifPresent(s -> fetcher.fetchSitemapUrls(s, timer));
|
||||
|
||||
// Grab the favicon if it exists
|
||||
fetchWithRetry(faviconUrl, timer, HttpFetcher.ProbeType.DISABLED, ContentTags.empty());
|
||||
|
||||
if (fetcher.fetchContent(faviconUrl, warcRecorder, cookies, timer, ContentTags.empty(), HttpFetcher.ProbeType.DISABLED) instanceof HttpFetchResult.ResultOk iconResult) {
|
||||
String contentType = iconResult.header("Content-Type");
|
||||
byte[] iconData = iconResult.getBodyBytes();
|
||||
|
||||
domainStateDb.saveIcon(
|
||||
domain,
|
||||
new DomainStateDb.FaviconRecord(contentType, iconData)
|
||||
);
|
||||
}
|
||||
timer.waitFetchDelay(0);
|
||||
|
||||
}
|
||||
catch (Exception ex) {
|
||||
logger.error("Error configuring link filter", ex);
|
||||
if (Thread.interrupted()) {
|
||||
Thread.currentThread().interrupt();
|
||||
return DomainStateDb.SummaryRecord.forError(domain, "Crawler Interrupted", ex.getMessage());
|
||||
}
|
||||
}
|
||||
finally {
|
||||
crawlFrontier.addVisited(rootUrl);
|
||||
@@ -310,7 +384,7 @@ public class CrawlerRetreiver implements AutoCloseable {
|
||||
);
|
||||
|
||||
private Optional<String> guessFeedUrl(CrawlDelayTimer timer) throws InterruptedException {
|
||||
var oldDomainStateRecord = domainStateDb.get(domain);
|
||||
var oldDomainStateRecord = domainStateDb.getSummary(domain);
|
||||
|
||||
// If we are already aware of an old feed URL, then we can just revalidate it
|
||||
if (oldDomainStateRecord.isPresent()) {
|
||||
@@ -335,7 +409,7 @@ public class CrawlerRetreiver implements AutoCloseable {
|
||||
if (parsedOpt.isEmpty())
|
||||
return false;
|
||||
|
||||
HttpFetchResult result = fetchWithRetry(parsedOpt.get(), timer, HttpFetcher.ProbeType.DISABLED, ContentTags.empty());
|
||||
HttpFetchResult result = fetcher.fetchContent(parsedOpt.get(), warcRecorder, cookies, timer, ContentTags.empty(), HttpFetcher.ProbeType.DISABLED);
|
||||
timer.waitFetchDelay(0);
|
||||
|
||||
if (!(result instanceof HttpFetchResult.ResultOk ok)) {
|
||||
@@ -361,110 +435,63 @@ public class CrawlerRetreiver implements AutoCloseable {
|
||||
CrawlDelayTimer timer,
|
||||
DocumentWithReference reference) throws InterruptedException
|
||||
{
|
||||
logger.debug("Fetching {}", top);
|
||||
|
||||
long startTime = System.currentTimeMillis();
|
||||
var contentTags = reference.getContentTags();
|
||||
|
||||
HttpFetchResult fetchedDoc = fetchWithRetry(top, timer, HttpFetcher.ProbeType.FULL, contentTags);
|
||||
HttpFetchResult fetchedDoc = fetcher.fetchContent(top, warcRecorder, cookies, timer, contentTags, HttpFetcher.ProbeType.FULL);
|
||||
timer.waitFetchDelay();
|
||||
|
||||
if (Thread.interrupted()) {
|
||||
Thread.currentThread().interrupt();
|
||||
throw new InterruptedException();
|
||||
}
|
||||
|
||||
// Parse the document and enqueue links
|
||||
try {
|
||||
if (fetchedDoc instanceof HttpFetchResult.ResultOk ok) {
|
||||
var docOpt = ok.parseDocument();
|
||||
if (docOpt.isPresent()) {
|
||||
var doc = docOpt.get();
|
||||
switch (fetchedDoc) {
|
||||
case HttpFetchResult.ResultOk ok -> {
|
||||
var docOpt = ok.parseDocument();
|
||||
if (docOpt.isPresent()) {
|
||||
var doc = docOpt.get();
|
||||
|
||||
crawlFrontier.enqueueLinksFromDocument(top, doc);
|
||||
crawlFrontier.addVisited(new EdgeUrl(ok.uri()));
|
||||
var responseUrl = new EdgeUrl(ok.uri());
|
||||
|
||||
crawlFrontier.enqueueLinksFromDocument(responseUrl, doc);
|
||||
crawlFrontier.addVisited(responseUrl);
|
||||
}
|
||||
}
|
||||
}
|
||||
else if (fetchedDoc instanceof HttpFetchResult.Result304Raw && reference.doc() != null) {
|
||||
var doc = reference.doc();
|
||||
case HttpFetchResult.Result304Raw ref when reference.doc() != null ->
|
||||
{
|
||||
var doc = reference.doc();
|
||||
|
||||
warcRecorder.writeReferenceCopy(top, doc.contentType, doc.httpStatus, doc.documentBody, doc.headers, contentTags);
|
||||
warcRecorder.writeReferenceCopy(top, cookies, doc.contentType, doc.httpStatus, doc.documentBodyBytes, doc.headers, contentTags);
|
||||
|
||||
fetchedDoc = new HttpFetchResult.Result304ReplacedWithReference(doc.url,
|
||||
new ContentType(doc.contentType, "UTF-8"),
|
||||
doc.documentBody);
|
||||
fetchedDoc = new HttpFetchResult.Result304ReplacedWithReference(doc.url,
|
||||
new ContentType(doc.contentType, "UTF-8"),
|
||||
doc.documentBodyBytes);
|
||||
|
||||
if (doc.documentBody != null) {
|
||||
var parsed = Jsoup.parse(doc.documentBody);
|
||||
if (doc.documentBodyBytes != null) {
|
||||
var parsed = doc.parseBody();
|
||||
|
||||
crawlFrontier.enqueueLinksFromDocument(top, parsed);
|
||||
crawlFrontier.addVisited(top);
|
||||
crawlFrontier.enqueueLinksFromDocument(top, parsed);
|
||||
crawlFrontier.addVisited(top);
|
||||
}
|
||||
}
|
||||
}
|
||||
else if (fetchedDoc instanceof HttpFetchResult.ResultException) {
|
||||
errorCount ++;
|
||||
case HttpFetchResult.ResultRedirect(EdgeUrl location) -> {
|
||||
if (Objects.equals(location.domain, top.domain)) {
|
||||
crawlFrontier.addFirst(location);
|
||||
}
|
||||
}
|
||||
case HttpFetchResult.ResultException ex -> errorCount++;
|
||||
default -> {} // Ignore other types
|
||||
}
|
||||
}
|
||||
catch (Exception ex) {
|
||||
logger.error("Error parsing document {}", top, ex);
|
||||
}
|
||||
|
||||
timer.waitFetchDelay(System.currentTimeMillis() - startTime);
|
||||
|
||||
return fetchedDoc;
|
||||
}
|
||||
|
||||
/** Fetch a document and retry on 429s */
|
||||
private HttpFetchResult fetchWithRetry(EdgeUrl url,
|
||||
CrawlDelayTimer timer,
|
||||
HttpFetcher.ProbeType probeType,
|
||||
ContentTags contentTags) throws InterruptedException {
|
||||
|
||||
long probeStart = System.currentTimeMillis();
|
||||
|
||||
if (probeType == HttpFetcher.ProbeType.FULL) {
|
||||
retryLoop:
|
||||
for (int i = 0; i <= HTTP_429_RETRY_LIMIT; i++) {
|
||||
try {
|
||||
var probeResult = fetcher.probeContentType(url, warcRecorder, contentTags);
|
||||
|
||||
switch (probeResult) {
|
||||
case HttpFetcher.ContentTypeProbeResult.Ok(EdgeUrl resolvedUrl):
|
||||
url = resolvedUrl; // If we were redirected while probing, use the final URL for fetching
|
||||
break retryLoop;
|
||||
case HttpFetcher.ContentTypeProbeResult.BadContentType badContentType:
|
||||
return new HttpFetchResult.ResultNone();
|
||||
case HttpFetcher.ContentTypeProbeResult.BadContentType.Timeout timeout:
|
||||
return new HttpFetchResult.ResultException(timeout.ex());
|
||||
case HttpFetcher.ContentTypeProbeResult.Exception exception:
|
||||
return new HttpFetchResult.ResultException(exception.ex());
|
||||
default: // should be unreachable
|
||||
throw new IllegalStateException("Unknown probe result");
|
||||
}
|
||||
}
|
||||
catch (HttpFetcherImpl.RateLimitException ex) {
|
||||
timer.waitRetryDelay(ex);
|
||||
}
|
||||
catch (Exception ex) {
|
||||
logger.warn("Failed to fetch {}", url, ex);
|
||||
return new HttpFetchResult.ResultException(ex);
|
||||
}
|
||||
}
|
||||
|
||||
timer.waitFetchDelay(System.currentTimeMillis() - probeStart);
|
||||
}
|
||||
|
||||
|
||||
for (int i = 0; i <= HTTP_429_RETRY_LIMIT; i++) {
|
||||
try {
|
||||
return fetcher.fetchContent(url, warcRecorder, contentTags, probeType);
|
||||
}
|
||||
catch (HttpFetcherImpl.RateLimitException ex) {
|
||||
timer.waitRetryDelay(ex);
|
||||
}
|
||||
catch (Exception ex) {
|
||||
logger.warn("Failed to fetch {}", url, ex);
|
||||
return new HttpFetchResult.ResultException(ex);
|
||||
}
|
||||
}
|
||||
|
||||
return new HttpFetchResult.ResultNone();
|
||||
}
|
||||
|
||||
private boolean isAllowedProtocol(String proto) {
|
||||
return proto.equalsIgnoreCase("http")
|
||||
|| proto.equalsIgnoreCase("https");
|
||||
|
@@ -55,6 +55,9 @@ public class DomainCrawlFrontier {
|
||||
}
|
||||
}
|
||||
|
||||
public EdgeDomain getDomain() {
|
||||
return thisDomain;
|
||||
}
|
||||
/** Increase the depth of the crawl by a factor. If the current depth is smaller
|
||||
* than the number of already visited documents, the base depth will be adjusted
|
||||
* to the visited count first.
|
||||
|
@@ -1,8 +1,8 @@
|
||||
package nu.marginalia.crawl.retreival.revisit;
|
||||
|
||||
import com.google.common.base.Strings;
|
||||
import crawlercommons.robots.SimpleRobotRules;
|
||||
import nu.marginalia.crawl.fetcher.ContentTags;
|
||||
import nu.marginalia.crawl.fetcher.DomainCookies;
|
||||
import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
|
||||
import nu.marginalia.crawl.retreival.CrawlDataReference;
|
||||
import nu.marginalia.crawl.retreival.CrawlDelayTimer;
|
||||
@@ -11,17 +11,23 @@ import nu.marginalia.crawl.retreival.DomainCrawlFrontier;
|
||||
import nu.marginalia.model.EdgeUrl;
|
||||
import nu.marginalia.model.body.HttpFetchResult;
|
||||
import nu.marginalia.model.crawldata.CrawledDocument;
|
||||
import org.jsoup.Jsoup;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import java.io.IOException;
|
||||
|
||||
/** This class encapsulates the logic for re-visiting a domain that has already been crawled.
|
||||
* We may use information from the previous crawl to inform the next crawl, specifically the
|
||||
* E-Tag and Last-Modified headers.
|
||||
*/
|
||||
public class CrawlerRevisitor {
|
||||
|
||||
private final DomainCrawlFrontier crawlFrontier;
|
||||
private final CrawlerRetreiver crawlerRetreiver;
|
||||
private final WarcRecorder warcRecorder;
|
||||
|
||||
private static final Logger logger = LoggerFactory.getLogger(CrawlerRevisitor.class);
|
||||
|
||||
public CrawlerRevisitor(DomainCrawlFrontier crawlFrontier,
|
||||
CrawlerRetreiver crawlerRetreiver,
|
||||
WarcRecorder warcRecorder) {
|
||||
@@ -31,7 +37,8 @@ public class CrawlerRevisitor {
|
||||
}
|
||||
|
||||
/** Performs a re-crawl of old documents, comparing etags and last-modified */
|
||||
public int recrawl(CrawlDataReference oldCrawlData,
|
||||
public RecrawlMetadata recrawl(CrawlDataReference oldCrawlData,
|
||||
DomainCookies cookies,
|
||||
SimpleRobotRules robotsRules,
|
||||
CrawlDelayTimer delayTimer)
|
||||
throws InterruptedException {
|
||||
@@ -39,19 +46,18 @@ public class CrawlerRevisitor {
|
||||
int retained = 0;
|
||||
int errors = 0;
|
||||
int skipped = 0;
|
||||
int size = 0;
|
||||
|
||||
for (;;) {
|
||||
for (CrawledDocument doc : oldCrawlData) {
|
||||
if (errors > 20) {
|
||||
// If we've had too many errors, we'll stop trying to recrawl
|
||||
break;
|
||||
}
|
||||
|
||||
CrawledDocument doc = oldCrawlData.nextDocument();
|
||||
if (Thread.interrupted()) {
|
||||
throw new InterruptedException();
|
||||
}
|
||||
|
||||
if (doc == null)
|
||||
break;
|
||||
|
||||
// This Shouldn't Happen (TM)
|
||||
var urlMaybe = EdgeUrl.parse(doc.url);
|
||||
if (urlMaybe.isEmpty())
|
||||
continue;
|
||||
@@ -70,7 +76,7 @@ public class CrawlerRevisitor {
|
||||
// unlikely to produce anything meaningful for us.
|
||||
if (doc.httpStatus != 200)
|
||||
continue;
|
||||
if (Strings.isNullOrEmpty(doc.documentBody))
|
||||
if (!doc.hasBody())
|
||||
continue;
|
||||
|
||||
if (!crawlFrontier.filterLink(url))
|
||||
@@ -84,6 +90,7 @@ public class CrawlerRevisitor {
|
||||
continue;
|
||||
}
|
||||
|
||||
size++;
|
||||
|
||||
double skipProb;
|
||||
|
||||
@@ -117,14 +124,20 @@ public class CrawlerRevisitor {
|
||||
// fashion to make sure we eventually catch changes over time
|
||||
// and ensure we discover new links
|
||||
|
||||
// Hoover up any links from the document
|
||||
crawlFrontier.enqueueLinksFromDocument(url, Jsoup.parse(doc.documentBody));
|
||||
try {
|
||||
// Hoover up any links from the document
|
||||
crawlFrontier.enqueueLinksFromDocument(url, doc.parseBody());
|
||||
|
||||
}
|
||||
catch (IOException ex) {
|
||||
//
|
||||
}
|
||||
// Add a WARC record so we don't repeat this
|
||||
warcRecorder.writeReferenceCopy(url,
|
||||
cookies,
|
||||
doc.contentType,
|
||||
doc.httpStatus,
|
||||
doc.documentBody,
|
||||
doc.documentBodyBytes,
|
||||
doc.headers,
|
||||
new ContentTags(doc.etagMaybe, doc.lastModifiedMaybe)
|
||||
);
|
||||
@@ -146,11 +159,15 @@ public class CrawlerRevisitor {
|
||||
else if (result instanceof HttpFetchResult.ResultException) {
|
||||
errors++;
|
||||
}
|
||||
|
||||
recrawled++;
|
||||
}
|
||||
}
|
||||
|
||||
return recrawled;
|
||||
logger.info("Recrawl summary {}: {} recrawled, {} retained, {} errors, {} skipped",
|
||||
crawlFrontier.getDomain(), recrawled, retained, errors, skipped);
|
||||
|
||||
return new RecrawlMetadata(size, errors, skipped);
|
||||
}
|
||||
|
||||
public record RecrawlMetadata(int size, int errors, int skipped) {}
|
||||
}
|
||||
|
@@ -2,12 +2,11 @@ package nu.marginalia.crawl.retreival.revisit;
|
||||
|
||||
import nu.marginalia.crawl.fetcher.ContentTags;
|
||||
import nu.marginalia.crawl.retreival.CrawlDataReference;
|
||||
import nu.marginalia.model.body.DocumentBodyExtractor;
|
||||
import nu.marginalia.model.body.DocumentBodyResult;
|
||||
import nu.marginalia.model.body.HttpFetchResult;
|
||||
import nu.marginalia.model.crawldata.CrawledDocument;
|
||||
|
||||
import javax.annotation.Nullable;
|
||||
import java.util.Objects;
|
||||
|
||||
public record DocumentWithReference(
|
||||
@Nullable CrawledDocument doc,
|
||||
@@ -35,21 +34,31 @@ public record DocumentWithReference(
|
||||
return false;
|
||||
if (doc == null)
|
||||
return false;
|
||||
if (doc.documentBody == null)
|
||||
return false;
|
||||
if (doc.documentBodyBytes.length == 0) {
|
||||
if (doc.httpStatus < 300) {
|
||||
return resultOk.bytesLength() == 0;
|
||||
}
|
||||
else if (doc.httpStatus == 301 || doc.httpStatus == 302 || doc.httpStatus == 307) {
|
||||
@Nullable
|
||||
String docLocation = doc.getHeader("Location");
|
||||
@Nullable
|
||||
String resultLocation = resultOk.header("Location");
|
||||
|
||||
if (!(DocumentBodyExtractor.asString(resultOk) instanceof DocumentBodyResult.Ok<String> bodyOk)) {
|
||||
return false;
|
||||
return Objects.equals(docLocation, resultLocation);
|
||||
}
|
||||
else {
|
||||
return doc.httpStatus == resultOk.statusCode();
|
||||
}
|
||||
}
|
||||
|
||||
return CrawlDataReference.isContentBodySame(doc.documentBody, bodyOk.body());
|
||||
return CrawlDataReference.isContentBodySame(doc.documentBodyBytes, resultOk.bytesRaw());
|
||||
}
|
||||
|
||||
public ContentTags getContentTags() {
|
||||
if (null == doc)
|
||||
return ContentTags.empty();
|
||||
|
||||
if (doc.documentBody == null || doc.httpStatus != 200)
|
||||
if (doc.documentBodyBytes.length == 0 || doc.httpStatus != 200)
|
||||
return ContentTags.empty();
|
||||
|
||||
String lastmod = doc.getLastModified();
|
||||
|
@@ -1,72 +0,0 @@
|
||||
package nu.marginalia.crawl.retreival.sitemap;
|
||||
|
||||
import crawlercommons.robots.SimpleRobotRules;
|
||||
import nu.marginalia.crawl.fetcher.SitemapRetriever;
|
||||
import nu.marginalia.crawl.retreival.DomainCrawlFrontier;
|
||||
import nu.marginalia.model.EdgeUrl;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import java.util.HashSet;
|
||||
import java.util.List;
|
||||
import java.util.Optional;
|
||||
import java.util.Set;
|
||||
|
||||
public class SitemapFetcher {
|
||||
|
||||
private final DomainCrawlFrontier crawlFrontier;
|
||||
private final SitemapRetriever sitemapRetriever;
|
||||
private static final Logger logger = LoggerFactory.getLogger(SitemapFetcher.class);
|
||||
|
||||
public SitemapFetcher(DomainCrawlFrontier crawlFrontier, SitemapRetriever sitemapRetriever) {
|
||||
this.crawlFrontier = crawlFrontier;
|
||||
this.sitemapRetriever = sitemapRetriever;
|
||||
}
|
||||
|
||||
public void downloadSitemaps(SimpleRobotRules robotsRules, EdgeUrl rootUrl) {
|
||||
List<String> urls = robotsRules.getSitemaps();
|
||||
|
||||
if (urls.isEmpty()) {
|
||||
urls = List.of(rootUrl.withPathAndParam("/sitemap.xml", null).toString());
|
||||
}
|
||||
|
||||
downloadSitemaps(urls);
|
||||
}
|
||||
|
||||
public void downloadSitemaps(List<String> urls) {
|
||||
|
||||
Set<String> checkedSitemaps = new HashSet<>();
|
||||
|
||||
for (var rawUrl : urls) {
|
||||
Optional<EdgeUrl> parsedUrl = EdgeUrl.parse(rawUrl);
|
||||
if (parsedUrl.isEmpty()) {
|
||||
continue;
|
||||
}
|
||||
|
||||
EdgeUrl url = parsedUrl.get();
|
||||
|
||||
// Let's not download sitemaps from other domains for now
|
||||
if (!crawlFrontier.isSameDomain(url)) {
|
||||
continue;
|
||||
}
|
||||
|
||||
if (checkedSitemaps.contains(url.path))
|
||||
continue;
|
||||
|
||||
var sitemap = sitemapRetriever.fetchSitemap(url);
|
||||
if (sitemap.isEmpty()) {
|
||||
continue;
|
||||
}
|
||||
|
||||
// ensure we don't try to download this sitemap again
|
||||
// (don't move this up, as we may want to check the same
|
||||
// path with different protocols until we find one that works)
|
||||
|
||||
checkedSitemaps.add(url.path);
|
||||
|
||||
crawlFrontier.addAllToQueue(sitemap);
|
||||
}
|
||||
|
||||
logger.debug("Queue is now {}", crawlFrontier.queueSize());
|
||||
}
|
||||
}
|
@@ -32,15 +32,17 @@ dependencies {
|
||||
implementation libs.bundles.parquet
|
||||
|
||||
implementation libs.trove
|
||||
implementation libs.slop
|
||||
implementation libs.jwarc
|
||||
implementation libs.gson
|
||||
implementation libs.commons.io
|
||||
implementation libs.commons.lang3
|
||||
implementation libs.okhttp3
|
||||
implementation libs.jsoup
|
||||
implementation libs.snakeyaml
|
||||
implementation libs.zstd
|
||||
|
||||
implementation libs.bundles.httpcomponents
|
||||
|
||||
testImplementation libs.bundles.slf4j.test
|
||||
testImplementation libs.bundles.junit
|
||||
testImplementation libs.mockito
|
||||
|
@@ -6,6 +6,7 @@ public class ContentTypes {
|
||||
public static final Set<String> acceptedContentTypes = Set.of("application/xhtml+xml",
|
||||
"application/xhtml",
|
||||
"text/html",
|
||||
"application/pdf",
|
||||
"image/x-icon",
|
||||
"text/plain");
|
||||
|
||||
@@ -19,4 +20,9 @@ public class ContentTypes {
|
||||
return false;
|
||||
}
|
||||
|
||||
public static boolean isBinary(String contentTypeHeader) {
|
||||
String lcHeader = contentTypeHeader.toLowerCase();
|
||||
return lcHeader.startsWith("application/pdf");
|
||||
}
|
||||
|
||||
}
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user