mirror of
https://github.com/MarginaliaSearch/MarginaliaSearch.git
synced 2025-10-06 07:32:38 +02:00
Compare commits
100 Commits
deploy-004
...
deploy-008
Author | SHA1 | Date | |
---|---|---|---|
|
7671f0d9e4 | ||
|
44d6bc71b7 | ||
|
9d302e2973 | ||
|
f553701224 | ||
|
f076d05595 | ||
|
b513809710 | ||
|
7519b28e21 | ||
|
3eac4dd57f | ||
|
4c2810720a | ||
|
8480ba8daa | ||
|
fbba392491 | ||
|
530eb35949 | ||
|
c2dd2175a2 | ||
|
b8581b0f56 | ||
|
2ea34767d8 | ||
|
e9af838231 | ||
|
ae0cad47c4 | ||
|
5fbc8ef998 | ||
|
32c6dd9e6a | ||
|
6ece6a6cfb | ||
|
39cd1c18f8 | ||
|
eb65daaa88 | ||
|
0bebdb6e33 | ||
|
1e50e392c6 | ||
|
fb673de370 | ||
|
eee73ab16c | ||
|
5354e034bf | ||
|
72384ad6ca | ||
|
a2b076f9be | ||
|
c8b0a32c0f | ||
|
f0d74aa3bb | ||
|
74a1f100f4 | ||
|
eb049658e4 | ||
|
db138b2a6f | ||
|
1673fc284c | ||
|
503ea57d5b | ||
|
18ca926c7f | ||
|
db99242db2 | ||
|
2b9d2985ba | ||
|
eeb6ecd711 | ||
|
1f58aeadbf | ||
|
3d68be64da | ||
|
668f3b16ef | ||
|
98a340a0d1 | ||
|
8862100f7e | ||
|
274941f6de | ||
|
abec83582d | ||
|
569520c9b6 | ||
|
088310e998 | ||
|
270cab874b | ||
|
4c74e280d3 | ||
|
5b347e17ac | ||
|
55d6ab933f | ||
|
43b74e9706 | ||
|
579a115243 | ||
|
2c67f50a43 | ||
|
78a958e2b0 | ||
|
4e939389b2 | ||
|
e67a9bdb91 | ||
|
567e4e1237 | ||
|
4342e42722 | ||
|
bc818056e6 | ||
|
de2feac238 | ||
|
1e770205a5 | ||
|
e44ecd6d69 | ||
|
5b93a0e633 | ||
|
08fb0e5efe | ||
|
bcf67782ea | ||
|
ef3f175ede | ||
|
bbe4b5d9fd | ||
|
c67a635103 | ||
|
20b24133fb | ||
|
f2567677e8 | ||
|
bc2c2061f2 | ||
|
1c7f5a31a5 | ||
|
59a8ea60f7 | ||
|
aa9b1244ea | ||
|
2d17233366 | ||
|
b245cc9f38 | ||
|
6614d05bdf | ||
|
55aeb03c4a | ||
|
faa589962f | ||
|
c7edd6b39f | ||
|
79da622e3b | ||
|
3da8337ba6 | ||
|
a32d230f0a | ||
|
3772bfd387 | ||
|
02a7900d1a | ||
|
a1fb92468f | ||
|
b7f0a2a98e | ||
|
5fb76b2e79 | ||
|
ad8c97f342 | ||
|
dc1b6373eb | ||
|
983d6d067c | ||
|
a84a06975c | ||
|
d2864c13ec | ||
|
47e58a21c6 | ||
|
3714104976 | ||
|
f6f036b9b1 | ||
|
b510b7feb8 |
39
ROADMAP.md
39
ROADMAP.md
@@ -1,4 +1,4 @@
|
|||||||
# Roadmap 2024-2025
|
# Roadmap 2025
|
||||||
|
|
||||||
This is a roadmap with major features planned for Marginalia Search.
|
This is a roadmap with major features planned for Marginalia Search.
|
||||||
|
|
||||||
@@ -30,12 +30,6 @@ Retaining the ability to independently crawl the web is still strongly desirable
|
|||||||
The search engine has a bit of a problem showing spicy content mixed in with the results. It would be desirable to have a way to filter this out. It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
|
The search engine has a bit of a problem showing spicy content mixed in with the results. It would be desirable to have a way to filter this out. It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
|
||||||
combined with naive bayesian filter would go a long way, or something more sophisticated...?
|
combined with naive bayesian filter would go a long way, or something more sophisticated...?
|
||||||
|
|
||||||
## Web Design Overhaul
|
|
||||||
|
|
||||||
The design is kinda clunky and hard to maintain, and needlessly outdated-looking.
|
|
||||||
|
|
||||||
In progress: PR [#127](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/127) -- demo available at https://test.marginalia.nu/
|
|
||||||
|
|
||||||
## Additional Language Support
|
## Additional Language Support
|
||||||
|
|
||||||
It would be desirable if the search engine supported more languages than English. This is partially about
|
It would be desirable if the search engine supported more languages than English. This is partially about
|
||||||
@@ -62,8 +56,31 @@ filter for any API consumer.
|
|||||||
|
|
||||||
I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, which is quite ad-hoc, but instead to work together to find some new common description language for this.
|
I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, which is quite ad-hoc, but instead to work together to find some new common description language for this.
|
||||||
|
|
||||||
|
## Show favicons next to search results
|
||||||
|
|
||||||
|
This is expected from search engines. Basic proof of concept sketch of fetching this data has been done, but the feature is some way from being reality.
|
||||||
|
|
||||||
|
## Specialized crawler for github
|
||||||
|
|
||||||
|
One of the search engine's biggest limitations right now is that it does not index github at all. A specialized crawler that fetches at least the readme.md would go a long way toward providing search capabilities in this domain.
|
||||||
|
|
||||||
# Completed
|
# Completed
|
||||||
|
|
||||||
|
## Web Design Overhaul (COMPLETED 2025-01)
|
||||||
|
|
||||||
|
The design is kinda clunky and hard to maintain, and needlessly outdated-looking.
|
||||||
|
|
||||||
|
PR [#127](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/127)
|
||||||
|
|
||||||
|
## Finalize RSS support (COMPLETED 2024-11)
|
||||||
|
|
||||||
|
Marginalia has experimental RSS preview support for a few domains. This works well and
|
||||||
|
it should be extended to all domains. It would also be interesting to offer search of the
|
||||||
|
RSS data itself, or use the RSS set to feed a special live index that updates faster than the
|
||||||
|
main dataset.
|
||||||
|
|
||||||
|
Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
|
||||||
|
|
||||||
## Proper Position Index (COMPLETED 2024-09)
|
## Proper Position Index (COMPLETED 2024-09)
|
||||||
|
|
||||||
The search engine uses a fixed width bit mask to indicate word positions. It has the benefit
|
The search engine uses a fixed width bit mask to indicate word positions. It has the benefit
|
||||||
@@ -76,11 +93,3 @@ list, as is the civilized way of doing this.
|
|||||||
|
|
||||||
Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)
|
Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)
|
||||||
|
|
||||||
## Finalize RSS support (COMPLETED 2024-11)
|
|
||||||
|
|
||||||
Marginalia has experimental RSS preview support for a few domains. This works well and
|
|
||||||
it should be extended to all domains. It would also be interesting to offer search of the
|
|
||||||
RSS data itself, or use the RSS set to feed a special live index that updates faster than the
|
|
||||||
main dataset.
|
|
||||||
|
|
||||||
Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
|
|
||||||
|
@@ -5,7 +5,7 @@ plugins {
|
|||||||
|
|
||||||
// This is a workaround for a bug in the Jib plugin that causes it to stall randomly
|
// This is a workaround for a bug in the Jib plugin that causes it to stall randomly
|
||||||
// https://github.com/GoogleContainerTools/jib/issues/3347
|
// https://github.com/GoogleContainerTools/jib/issues/3347
|
||||||
id 'com.google.cloud.tools.jib' version '3.4.3' apply(false)
|
id 'com.google.cloud.tools.jib' version '3.4.4' apply(false)
|
||||||
}
|
}
|
||||||
|
|
||||||
group 'marginalia'
|
group 'marginalia'
|
||||||
@@ -47,7 +47,7 @@ ext {
|
|||||||
dockerImageBase='container-registry.oracle.com/graalvm/jdk:23'
|
dockerImageBase='container-registry.oracle.com/graalvm/jdk:23'
|
||||||
dockerImageTag='latest'
|
dockerImageTag='latest'
|
||||||
dockerImageRegistry='marginalia'
|
dockerImageRegistry='marginalia'
|
||||||
jibVersion = '3.4.3'
|
jibVersion = '3.4.4'
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@@ -24,58 +24,4 @@ public class LanguageModels {
|
|||||||
this.fasttextLanguageModel = fasttextLanguageModel;
|
this.fasttextLanguageModel = fasttextLanguageModel;
|
||||||
this.segments = segments;
|
this.segments = segments;
|
||||||
}
|
}
|
||||||
|
|
||||||
public static LanguageModelsBuilder builder() {
|
|
||||||
return new LanguageModelsBuilder();
|
|
||||||
}
|
|
||||||
|
|
||||||
public static class LanguageModelsBuilder {
|
|
||||||
private Path termFrequencies;
|
|
||||||
private Path openNLPSentenceDetectionData;
|
|
||||||
private Path posRules;
|
|
||||||
private Path posDict;
|
|
||||||
private Path fasttextLanguageModel;
|
|
||||||
private Path segments;
|
|
||||||
|
|
||||||
LanguageModelsBuilder() {
|
|
||||||
}
|
|
||||||
|
|
||||||
public LanguageModelsBuilder termFrequencies(Path termFrequencies) {
|
|
||||||
this.termFrequencies = termFrequencies;
|
|
||||||
return this;
|
|
||||||
}
|
|
||||||
|
|
||||||
public LanguageModelsBuilder openNLPSentenceDetectionData(Path openNLPSentenceDetectionData) {
|
|
||||||
this.openNLPSentenceDetectionData = openNLPSentenceDetectionData;
|
|
||||||
return this;
|
|
||||||
}
|
|
||||||
|
|
||||||
public LanguageModelsBuilder posRules(Path posRules) {
|
|
||||||
this.posRules = posRules;
|
|
||||||
return this;
|
|
||||||
}
|
|
||||||
|
|
||||||
public LanguageModelsBuilder posDict(Path posDict) {
|
|
||||||
this.posDict = posDict;
|
|
||||||
return this;
|
|
||||||
}
|
|
||||||
|
|
||||||
public LanguageModelsBuilder fasttextLanguageModel(Path fasttextLanguageModel) {
|
|
||||||
this.fasttextLanguageModel = fasttextLanguageModel;
|
|
||||||
return this;
|
|
||||||
}
|
|
||||||
|
|
||||||
public LanguageModelsBuilder segments(Path segments) {
|
|
||||||
this.segments = segments;
|
|
||||||
return this;
|
|
||||||
}
|
|
||||||
|
|
||||||
public LanguageModels build() {
|
|
||||||
return new LanguageModels(this.termFrequencies, this.openNLPSentenceDetectionData, this.posRules, this.posDict, this.fasttextLanguageModel, this.segments);
|
|
||||||
}
|
|
||||||
|
|
||||||
public String toString() {
|
|
||||||
return "LanguageModels.LanguageModelsBuilder(termFrequencies=" + this.termFrequencies + ", openNLPSentenceDetectionData=" + this.openNLPSentenceDetectionData + ", posRules=" + this.posRules + ", posDict=" + this.posDict + ", fasttextLanguageModel=" + this.fasttextLanguageModel + ", segments=" + this.segments + ")";
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
@@ -20,7 +20,10 @@ public class DbDomainQueries {
|
|||||||
private final HikariDataSource dataSource;
|
private final HikariDataSource dataSource;
|
||||||
|
|
||||||
private static final Logger logger = LoggerFactory.getLogger(DbDomainQueries.class);
|
private static final Logger logger = LoggerFactory.getLogger(DbDomainQueries.class);
|
||||||
|
|
||||||
private final Cache<EdgeDomain, Integer> domainIdCache = CacheBuilder.newBuilder().maximumSize(10_000).build();
|
private final Cache<EdgeDomain, Integer> domainIdCache = CacheBuilder.newBuilder().maximumSize(10_000).build();
|
||||||
|
private final Cache<Integer, EdgeDomain> domainNameCache = CacheBuilder.newBuilder().maximumSize(10_000).build();
|
||||||
|
private final Cache<String, List<DomainWithNode>> siblingsCache = CacheBuilder.newBuilder().maximumSize(10_000).build();
|
||||||
|
|
||||||
@Inject
|
@Inject
|
||||||
public DbDomainQueries(HikariDataSource dataSource)
|
public DbDomainQueries(HikariDataSource dataSource)
|
||||||
@@ -30,16 +33,21 @@ public class DbDomainQueries {
|
|||||||
|
|
||||||
|
|
||||||
public Integer getDomainId(EdgeDomain domain) throws NoSuchElementException {
|
public Integer getDomainId(EdgeDomain domain) throws NoSuchElementException {
|
||||||
try (var connection = dataSource.getConnection()) {
|
try {
|
||||||
|
|
||||||
return domainIdCache.get(domain, () -> {
|
return domainIdCache.get(domain, () -> {
|
||||||
try (var stmt = connection.prepareStatement("SELECT ID FROM EC_DOMAIN WHERE DOMAIN_NAME=?")) {
|
try (var connection = dataSource.getConnection();
|
||||||
|
var stmt = connection.prepareStatement("SELECT ID FROM EC_DOMAIN WHERE DOMAIN_NAME=?")) {
|
||||||
|
|
||||||
stmt.setString(1, domain.toString());
|
stmt.setString(1, domain.toString());
|
||||||
var rsp = stmt.executeQuery();
|
var rsp = stmt.executeQuery();
|
||||||
if (rsp.next()) {
|
if (rsp.next()) {
|
||||||
return rsp.getInt(1);
|
return rsp.getInt(1);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
catch (SQLException ex) {
|
||||||
|
throw new RuntimeException(ex);
|
||||||
|
}
|
||||||
|
|
||||||
throw new NoSuchElementException();
|
throw new NoSuchElementException();
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
@@ -49,9 +57,6 @@ public class DbDomainQueries {
|
|||||||
catch (ExecutionException ex) {
|
catch (ExecutionException ex) {
|
||||||
throw new RuntimeException(ex.getCause());
|
throw new RuntimeException(ex.getCause());
|
||||||
}
|
}
|
||||||
catch (SQLException ex) {
|
|
||||||
throw new RuntimeException(ex);
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
public OptionalInt tryGetDomainId(EdgeDomain domain) {
|
public OptionalInt tryGetDomainId(EdgeDomain domain) {
|
||||||
@@ -84,31 +89,38 @@ public class DbDomainQueries {
|
|||||||
}
|
}
|
||||||
|
|
||||||
public Optional<EdgeDomain> getDomain(int id) {
|
public Optional<EdgeDomain> getDomain(int id) {
|
||||||
try (var connection = dataSource.getConnection()) {
|
|
||||||
|
|
||||||
|
EdgeDomain existing = domainNameCache.getIfPresent(id);
|
||||||
|
if (existing != null) {
|
||||||
|
return Optional.of(existing);
|
||||||
|
}
|
||||||
|
|
||||||
|
try (var connection = dataSource.getConnection()) {
|
||||||
try (var stmt = connection.prepareStatement("SELECT DOMAIN_NAME FROM EC_DOMAIN WHERE ID=?")) {
|
try (var stmt = connection.prepareStatement("SELECT DOMAIN_NAME FROM EC_DOMAIN WHERE ID=?")) {
|
||||||
stmt.setInt(1, id);
|
stmt.setInt(1, id);
|
||||||
var rsp = stmt.executeQuery();
|
var rsp = stmt.executeQuery();
|
||||||
if (rsp.next()) {
|
if (rsp.next()) {
|
||||||
return Optional.of(new EdgeDomain(rsp.getString(1)));
|
var val = new EdgeDomain(rsp.getString(1));
|
||||||
|
domainNameCache.put(id, val);
|
||||||
|
return Optional.of(val);
|
||||||
}
|
}
|
||||||
return Optional.empty();
|
return Optional.empty();
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
catch (UncheckedExecutionException ex) {
|
|
||||||
throw new RuntimeException(ex.getCause());
|
|
||||||
}
|
|
||||||
catch (SQLException ex) {
|
catch (SQLException ex) {
|
||||||
throw new RuntimeException(ex);
|
throw new RuntimeException(ex);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
public List<EdgeDomain> otherSubdomains(EdgeDomain domain, int cnt) {
|
public List<DomainWithNode> otherSubdomains(EdgeDomain domain, int cnt) throws ExecutionException {
|
||||||
List<EdgeDomain> ret = new ArrayList<>();
|
String topDomain = domain.topDomain;
|
||||||
|
|
||||||
|
return siblingsCache.get(topDomain, () -> {
|
||||||
|
List<DomainWithNode> ret = new ArrayList<>();
|
||||||
|
|
||||||
try (var conn = dataSource.getConnection();
|
try (var conn = dataSource.getConnection();
|
||||||
var stmt = conn.prepareStatement("SELECT DOMAIN_NAME FROM EC_DOMAIN WHERE DOMAIN_TOP = ? LIMIT ?")) {
|
var stmt = conn.prepareStatement("SELECT DOMAIN_NAME, NODE_AFFINITY FROM EC_DOMAIN WHERE DOMAIN_TOP = ? LIMIT ?")) {
|
||||||
stmt.setString(1, domain.topDomain);
|
stmt.setString(1, topDomain);
|
||||||
stmt.setInt(2, cnt);
|
stmt.setInt(2, cnt);
|
||||||
|
|
||||||
var rs = stmt.executeQuery();
|
var rs = stmt.executeQuery();
|
||||||
@@ -118,12 +130,19 @@ public class DbDomainQueries {
|
|||||||
if (sibling.equals(domain))
|
if (sibling.equals(domain))
|
||||||
continue;
|
continue;
|
||||||
|
|
||||||
ret.add(sibling);
|
ret.add(new DomainWithNode(sibling, rs.getInt(2)));
|
||||||
}
|
}
|
||||||
} catch (SQLException e) {
|
} catch (SQLException e) {
|
||||||
logger.error("Failed to get domain neighbors");
|
logger.error("Failed to get domain neighbors");
|
||||||
}
|
}
|
||||||
|
|
||||||
return ret;
|
return ret;
|
||||||
|
});
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
public record DomainWithNode (EdgeDomain domain, int nodeAffinity) {
|
||||||
|
public boolean isIndexed() {
|
||||||
|
return nodeAffinity > 0;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@@ -1,118 +0,0 @@
|
|||||||
package nu.marginalia.db;
|
|
||||||
|
|
||||||
import com.zaxxer.hikari.HikariDataSource;
|
|
||||||
|
|
||||||
import java.sql.Connection;
|
|
||||||
import java.sql.PreparedStatement;
|
|
||||||
import java.sql.SQLException;
|
|
||||||
import java.util.ArrayList;
|
|
||||||
import java.util.List;
|
|
||||||
import java.util.OptionalInt;
|
|
||||||
|
|
||||||
/** Class used in exporting data. This is intended to be used for a brief time
|
|
||||||
* and then discarded, not kept around as a service.
|
|
||||||
*/
|
|
||||||
public class DbDomainStatsExportMultitool implements AutoCloseable {
|
|
||||||
private final Connection connection;
|
|
||||||
private final int nodeId;
|
|
||||||
private final PreparedStatement knownUrlsQuery;
|
|
||||||
private final PreparedStatement visitedUrlsQuery;
|
|
||||||
private final PreparedStatement goodUrlsQuery;
|
|
||||||
private final PreparedStatement domainNameToId;
|
|
||||||
|
|
||||||
private final PreparedStatement allDomainsQuery;
|
|
||||||
private final PreparedStatement crawlQueueDomains;
|
|
||||||
private final PreparedStatement indexedDomainsQuery;
|
|
||||||
|
|
||||||
public DbDomainStatsExportMultitool(HikariDataSource dataSource, int nodeId) throws SQLException {
|
|
||||||
this.connection = dataSource.getConnection();
|
|
||||||
this.nodeId = nodeId;
|
|
||||||
|
|
||||||
knownUrlsQuery = connection.prepareStatement("""
|
|
||||||
SELECT KNOWN_URLS
|
|
||||||
FROM EC_DOMAIN INNER JOIN DOMAIN_METADATA
|
|
||||||
ON EC_DOMAIN.ID=DOMAIN_METADATA.ID
|
|
||||||
WHERE DOMAIN_NAME=?
|
|
||||||
""");
|
|
||||||
visitedUrlsQuery = connection.prepareStatement("""
|
|
||||||
SELECT VISITED_URLS
|
|
||||||
FROM EC_DOMAIN INNER JOIN DOMAIN_METADATA
|
|
||||||
ON EC_DOMAIN.ID=DOMAIN_METADATA.ID
|
|
||||||
WHERE DOMAIN_NAME=?
|
|
||||||
""");
|
|
||||||
goodUrlsQuery = connection.prepareStatement("""
|
|
||||||
SELECT GOOD_URLS
|
|
||||||
FROM EC_DOMAIN INNER JOIN DOMAIN_METADATA
|
|
||||||
ON EC_DOMAIN.ID=DOMAIN_METADATA.ID
|
|
||||||
WHERE DOMAIN_NAME=?
|
|
||||||
""");
|
|
||||||
domainNameToId = connection.prepareStatement("""
|
|
||||||
SELECT ID
|
|
||||||
FROM EC_DOMAIN
|
|
||||||
WHERE DOMAIN_NAME=?
|
|
||||||
""");
|
|
||||||
allDomainsQuery = connection.prepareStatement("""
|
|
||||||
SELECT DOMAIN_NAME
|
|
||||||
FROM EC_DOMAIN
|
|
||||||
""");
|
|
||||||
crawlQueueDomains = connection.prepareStatement("""
|
|
||||||
SELECT DOMAIN_NAME
|
|
||||||
FROM CRAWL_QUEUE
|
|
||||||
""");
|
|
||||||
indexedDomainsQuery = connection.prepareStatement("""
|
|
||||||
SELECT DOMAIN_NAME
|
|
||||||
FROM EC_DOMAIN
|
|
||||||
WHERE INDEXED > 0
|
|
||||||
""");
|
|
||||||
}
|
|
||||||
|
|
||||||
public OptionalInt getVisitedUrls(String domainName) throws SQLException {
|
|
||||||
return executeNameToIntQuery(domainName, visitedUrlsQuery);
|
|
||||||
}
|
|
||||||
|
|
||||||
public OptionalInt getDomainId(String domainName) throws SQLException {
|
|
||||||
return executeNameToIntQuery(domainName, domainNameToId);
|
|
||||||
}
|
|
||||||
|
|
||||||
public List<String> getCrawlQueueDomains() throws SQLException {
|
|
||||||
return executeListQuery(crawlQueueDomains, 100);
|
|
||||||
}
|
|
||||||
public List<String> getAllIndexedDomains() throws SQLException {
|
|
||||||
return executeListQuery(indexedDomainsQuery, 100_000);
|
|
||||||
}
|
|
||||||
|
|
||||||
private OptionalInt executeNameToIntQuery(String domainName, PreparedStatement statement)
|
|
||||||
throws SQLException {
|
|
||||||
statement.setString(1, domainName);
|
|
||||||
var rs = statement.executeQuery();
|
|
||||||
|
|
||||||
if (rs.next()) {
|
|
||||||
return OptionalInt.of(rs.getInt(1));
|
|
||||||
}
|
|
||||||
|
|
||||||
return OptionalInt.empty();
|
|
||||||
}
|
|
||||||
|
|
||||||
private List<String> executeListQuery(PreparedStatement statement, int sizeHint) throws SQLException {
|
|
||||||
List<String> ret = new ArrayList<>(sizeHint);
|
|
||||||
|
|
||||||
var rs = statement.executeQuery();
|
|
||||||
|
|
||||||
while (rs.next()) {
|
|
||||||
ret.add(rs.getString(1));
|
|
||||||
}
|
|
||||||
|
|
||||||
return ret;
|
|
||||||
}
|
|
||||||
|
|
||||||
@Override
|
|
||||||
public void close() throws SQLException {
|
|
||||||
knownUrlsQuery.close();
|
|
||||||
goodUrlsQuery.close();
|
|
||||||
visitedUrlsQuery.close();
|
|
||||||
allDomainsQuery.close();
|
|
||||||
crawlQueueDomains.close();
|
|
||||||
domainNameToId.close();
|
|
||||||
connection.close();
|
|
||||||
}
|
|
||||||
}
|
|
@@ -83,6 +83,11 @@ public class QueryParams {
|
|||||||
if (path.endsWith("StoryView.py")) { // folklore.org is neat
|
if (path.endsWith("StoryView.py")) { // folklore.org is neat
|
||||||
return param.startsWith("project=") || param.startsWith("story=");
|
return param.startsWith("project=") || param.startsWith("story=");
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// www.perseus.tufts.edu:
|
||||||
|
if (param.startsWith("collection=")) return true;
|
||||||
|
if (param.startsWith("doc=")) return true;
|
||||||
|
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@@ -10,7 +10,9 @@ import java.nio.charset.StandardCharsets;
|
|||||||
import java.nio.file.Files;
|
import java.nio.file.Files;
|
||||||
import java.nio.file.Path;
|
import java.nio.file.Path;
|
||||||
import java.time.LocalDateTime;
|
import java.time.LocalDateTime;
|
||||||
import java.util.*;
|
import java.util.HashSet;
|
||||||
|
import java.util.Optional;
|
||||||
|
import java.util.Set;
|
||||||
import java.util.function.Function;
|
import java.util.function.Function;
|
||||||
|
|
||||||
/** WorkLog is a journal of work done by a process,
|
/** WorkLog is a journal of work done by a process,
|
||||||
@@ -61,6 +63,12 @@ public class WorkLog implements AutoCloseable, Closeable {
|
|||||||
return new WorkLoadIterable<>(logFile, mapper);
|
return new WorkLoadIterable<>(logFile, mapper);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
public static int countEntries(Path crawlerLog) throws IOException{
|
||||||
|
try (var linesStream = Files.lines(crawlerLog)) {
|
||||||
|
return (int) linesStream.filter(WorkLogEntry::isJobId).count();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// Use synchro over concurrent set to avoid competing writes
|
// Use synchro over concurrent set to avoid competing writes
|
||||||
// - correct is better than fast here, it's sketchy enough to use
|
// - correct is better than fast here, it's sketchy enough to use
|
||||||
// a PrintWriter
|
// a PrintWriter
|
||||||
|
@@ -89,7 +89,7 @@ public class DatabaseModule extends AbstractModule {
|
|||||||
config.addDataSourceProperty("prepStmtCacheSize", "250");
|
config.addDataSourceProperty("prepStmtCacheSize", "250");
|
||||||
config.addDataSourceProperty("prepStmtCacheSqlLimit", "2048");
|
config.addDataSourceProperty("prepStmtCacheSqlLimit", "2048");
|
||||||
|
|
||||||
config.setMaximumPoolSize(5);
|
config.setMaximumPoolSize(Integer.getInteger("db.poolSize", 5));
|
||||||
config.setMinimumIdle(2);
|
config.setMinimumIdle(2);
|
||||||
|
|
||||||
config.setMaxLifetime(Duration.ofMinutes(9).toMillis());
|
config.setMaxLifetime(Duration.ofMinutes(9).toMillis());
|
||||||
|
@@ -6,6 +6,7 @@ import nu.marginalia.service.ServiceId;
|
|||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
import org.slf4j.LoggerFactory;
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
import java.net.InetAddress;
|
import java.net.InetAddress;
|
||||||
import java.net.NetworkInterface;
|
import java.net.NetworkInterface;
|
||||||
import java.util.Enumeration;
|
import java.util.Enumeration;
|
||||||
@@ -115,7 +116,7 @@ public class ServiceConfigurationModule extends AbstractModule {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
public static String getLocalNetworkIP() throws Exception {
|
public static String getLocalNetworkIP() throws IOException {
|
||||||
Enumeration<NetworkInterface> nets = NetworkInterface.getNetworkInterfaces();
|
Enumeration<NetworkInterface> nets = NetworkInterface.getNetworkInterfaces();
|
||||||
|
|
||||||
while (nets.hasMoreElements()) {
|
while (nets.hasMoreElements()) {
|
||||||
|
@@ -15,6 +15,7 @@ import org.slf4j.LoggerFactory;
|
|||||||
import org.slf4j.Marker;
|
import org.slf4j.Marker;
|
||||||
import org.slf4j.MarkerFactory;
|
import org.slf4j.MarkerFactory;
|
||||||
|
|
||||||
|
import java.nio.file.Files;
|
||||||
import java.nio.file.Path;
|
import java.nio.file.Path;
|
||||||
import java.nio.file.Paths;
|
import java.nio.file.Paths;
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
@@ -106,9 +107,12 @@ public class JoobyService {
|
|||||||
config.externalAddress());
|
config.externalAddress());
|
||||||
|
|
||||||
// FIXME: This won't work outside of docker, may need to submit a PR to jooby to allow classpaths here
|
// FIXME: This won't work outside of docker, may need to submit a PR to jooby to allow classpaths here
|
||||||
|
if (Files.exists(Path.of("/app/resources/jte")) || Files.exists(Path.of("/app/classes/jte-precompiled"))) {
|
||||||
jooby.install(new JteModule(Path.of("/app/resources/jte"), Path.of("/app/classes/jte-precompiled")));
|
jooby.install(new JteModule(Path.of("/app/resources/jte"), Path.of("/app/classes/jte-precompiled")));
|
||||||
|
}
|
||||||
|
if (Files.exists(Path.of("/app/resources/static"))) {
|
||||||
jooby.assets("/*", Paths.get("/app/resources/static"));
|
jooby.assets("/*", Paths.get("/app/resources/static"));
|
||||||
|
}
|
||||||
var options = new ServerOptions();
|
var options = new ServerOptions();
|
||||||
options.setHost(config.bindAddress());
|
options.setHost(config.bindAddress());
|
||||||
options.setPort(restEndpoint.port());
|
options.setPort(restEndpoint.port());
|
||||||
|
@@ -6,17 +6,22 @@ import nu.marginalia.service.module.ServiceConfiguration;
|
|||||||
import org.eclipse.jetty.server.Server;
|
import org.eclipse.jetty.server.Server;
|
||||||
import org.eclipse.jetty.servlet.ServletContextHandler;
|
import org.eclipse.jetty.servlet.ServletContextHandler;
|
||||||
import org.eclipse.jetty.servlet.ServletHolder;
|
import org.eclipse.jetty.servlet.ServletHolder;
|
||||||
|
import org.slf4j.Logger;
|
||||||
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
import java.net.InetSocketAddress;
|
import java.net.InetSocketAddress;
|
||||||
|
|
||||||
public class MetricsServer {
|
public class MetricsServer {
|
||||||
|
|
||||||
|
private static Logger logger = LoggerFactory.getLogger(MetricsServer.class);
|
||||||
|
|
||||||
@Inject
|
@Inject
|
||||||
public MetricsServer(ServiceConfiguration configuration) throws Exception {
|
public MetricsServer(ServiceConfiguration configuration) {
|
||||||
// If less than zero, we forego setting up a metrics server
|
// If less than zero, we forego setting up a metrics server
|
||||||
if (configuration.metricsPort() < 0)
|
if (configuration.metricsPort() < 0)
|
||||||
return;
|
return;
|
||||||
|
|
||||||
|
try {
|
||||||
Server server = new Server(new InetSocketAddress(configuration.bindAddress(), configuration.metricsPort()));
|
Server server = new Server(new InetSocketAddress(configuration.bindAddress(), configuration.metricsPort()));
|
||||||
|
|
||||||
ServletContextHandler context = new ServletContextHandler();
|
ServletContextHandler context = new ServletContextHandler();
|
||||||
@@ -27,4 +32,8 @@ public class MetricsServer {
|
|||||||
|
|
||||||
server.start();
|
server.start();
|
||||||
}
|
}
|
||||||
|
catch (Exception|NoSuchMethodError ex) {
|
||||||
|
logger.error("Failed to set up metrics server", ex);
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
@@ -20,6 +20,7 @@ public enum ExecutorActor {
|
|||||||
EXPORT_FEEDS(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED),
|
EXPORT_FEEDS(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED),
|
||||||
EXPORT_SAMPLE_DATA(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED),
|
EXPORT_SAMPLE_DATA(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED),
|
||||||
DOWNLOAD_SAMPLE(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED),
|
DOWNLOAD_SAMPLE(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED),
|
||||||
|
MIGRATE_CRAWL_DATA(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED),
|
||||||
|
|
||||||
PROC_CONVERTER_SPAWNER(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED, NodeProfile.SIDELOAD),
|
PROC_CONVERTER_SPAWNER(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED, NodeProfile.SIDELOAD),
|
||||||
PROC_LOADER_SPAWNER(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED, NodeProfile.SIDELOAD),
|
PROC_LOADER_SPAWNER(NodeProfile.BATCH_CRAWL, NodeProfile.MIXED, NodeProfile.SIDELOAD),
|
||||||
|
@@ -66,6 +66,7 @@ public class ExecutorActorControlService {
|
|||||||
DownloadSampleActor downloadSampleActor,
|
DownloadSampleActor downloadSampleActor,
|
||||||
ScrapeFeedsActor scrapeFeedsActor,
|
ScrapeFeedsActor scrapeFeedsActor,
|
||||||
ExecutorActorStateMachines stateMachines,
|
ExecutorActorStateMachines stateMachines,
|
||||||
|
MigrateCrawlDataActor migrateCrawlDataActor,
|
||||||
ExportAllPrecessionActor exportAllPrecessionActor,
|
ExportAllPrecessionActor exportAllPrecessionActor,
|
||||||
UpdateRssActor updateRssActor) throws SQLException {
|
UpdateRssActor updateRssActor) throws SQLException {
|
||||||
this.messageQueueFactory = messageQueueFactory;
|
this.messageQueueFactory = messageQueueFactory;
|
||||||
@@ -107,6 +108,8 @@ public class ExecutorActorControlService {
|
|||||||
register(ExecutorActor.SCRAPE_FEEDS, scrapeFeedsActor);
|
register(ExecutorActor.SCRAPE_FEEDS, scrapeFeedsActor);
|
||||||
register(ExecutorActor.UPDATE_RSS, updateRssActor);
|
register(ExecutorActor.UPDATE_RSS, updateRssActor);
|
||||||
|
|
||||||
|
register(ExecutorActor.MIGRATE_CRAWL_DATA, migrateCrawlDataActor);
|
||||||
|
|
||||||
if (serviceConfiguration.node() == 1) {
|
if (serviceConfiguration.node() == 1) {
|
||||||
register(ExecutorActor.PREC_EXPORT_ALL, exportAllPrecessionActor);
|
register(ExecutorActor.PREC_EXPORT_ALL, exportAllPrecessionActor);
|
||||||
}
|
}
|
||||||
|
@@ -14,6 +14,8 @@ import nu.marginalia.mq.persistence.MqPersistence;
|
|||||||
import nu.marginalia.nodecfg.NodeConfigurationService;
|
import nu.marginalia.nodecfg.NodeConfigurationService;
|
||||||
import nu.marginalia.nodecfg.model.NodeProfile;
|
import nu.marginalia.nodecfg.model.NodeProfile;
|
||||||
import nu.marginalia.service.module.ServiceConfiguration;
|
import nu.marginalia.service.module.ServiceConfiguration;
|
||||||
|
import org.slf4j.Logger;
|
||||||
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
import java.time.Duration;
|
import java.time.Duration;
|
||||||
import java.time.LocalDateTime;
|
import java.time.LocalDateTime;
|
||||||
@@ -29,6 +31,7 @@ public class UpdateRssActor extends RecordActorPrototype {
|
|||||||
|
|
||||||
private final NodeConfigurationService nodeConfigurationService;
|
private final NodeConfigurationService nodeConfigurationService;
|
||||||
private final MqPersistence persistence;
|
private final MqPersistence persistence;
|
||||||
|
private static final Logger logger = LoggerFactory.getLogger(UpdateRssActor.class);
|
||||||
|
|
||||||
@Inject
|
@Inject
|
||||||
public UpdateRssActor(Gson gson,
|
public UpdateRssActor(Gson gson,
|
||||||
@@ -101,8 +104,8 @@ public class UpdateRssActor extends RecordActorPrototype {
|
|||||||
case UpdateRefresh(int count, long msgId) -> {
|
case UpdateRefresh(int count, long msgId) -> {
|
||||||
MqMessage msg = persistence.waitForMessageTerminalState(msgId, Duration.ofSeconds(10), Duration.ofHours(12));
|
MqMessage msg = persistence.waitForMessageTerminalState(msgId, Duration.ofSeconds(10), Duration.ofHours(12));
|
||||||
if (msg == null) {
|
if (msg == null) {
|
||||||
// Retry the update
|
logger.warn("UpdateRefresh is taking a very long time");
|
||||||
yield new Error("Failed to update feeds: message not found");
|
yield new UpdateRefresh(count, msgId);
|
||||||
} else if (msg.state() != MqMessageState.OK) {
|
} else if (msg.state() != MqMessageState.OK) {
|
||||||
// Retry the update
|
// Retry the update
|
||||||
yield new Error("Failed to update feeds: " + msg.state());
|
yield new Error("Failed to update feeds: " + msg.state());
|
||||||
@@ -119,8 +122,8 @@ public class UpdateRssActor extends RecordActorPrototype {
|
|||||||
case UpdateClean(long msgId) -> {
|
case UpdateClean(long msgId) -> {
|
||||||
MqMessage msg = persistence.waitForMessageTerminalState(msgId, Duration.ofSeconds(10), Duration.ofHours(12));
|
MqMessage msg = persistence.waitForMessageTerminalState(msgId, Duration.ofSeconds(10), Duration.ofHours(12));
|
||||||
if (msg == null) {
|
if (msg == null) {
|
||||||
// Retry the update
|
logger.warn("UpdateClean is taking a very long time");
|
||||||
yield new Error("Failed to update feeds: message not found");
|
yield new UpdateClean(msgId);
|
||||||
} else if (msg.state() != MqMessageState.OK) {
|
} else if (msg.state() != MqMessageState.OK) {
|
||||||
// Retry the update
|
// Retry the update
|
||||||
yield new Error("Failed to update feeds: " + msg.state());
|
yield new Error("Failed to update feeds: " + msg.state());
|
||||||
|
@@ -0,0 +1,150 @@
|
|||||||
|
package nu.marginalia.actor.task;
|
||||||
|
|
||||||
|
import com.google.gson.Gson;
|
||||||
|
import jakarta.inject.Inject;
|
||||||
|
import jakarta.inject.Singleton;
|
||||||
|
import nu.marginalia.actor.prototype.RecordActorPrototype;
|
||||||
|
import nu.marginalia.actor.state.ActorStep;
|
||||||
|
import nu.marginalia.io.CrawlerOutputFile;
|
||||||
|
import nu.marginalia.process.log.WorkLog;
|
||||||
|
import nu.marginalia.process.log.WorkLogEntry;
|
||||||
|
import nu.marginalia.service.control.ServiceHeartbeat;
|
||||||
|
import nu.marginalia.slop.SlopCrawlDataRecord;
|
||||||
|
import nu.marginalia.storage.FileStorageService;
|
||||||
|
import nu.marginalia.storage.model.FileStorage;
|
||||||
|
import nu.marginalia.storage.model.FileStorageId;
|
||||||
|
import org.apache.logging.log4j.util.Strings;
|
||||||
|
import org.slf4j.Logger;
|
||||||
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
|
import java.nio.file.Files;
|
||||||
|
import java.nio.file.Path;
|
||||||
|
import java.nio.file.StandardCopyOption;
|
||||||
|
import java.util.Map;
|
||||||
|
import java.util.Optional;
|
||||||
|
import java.util.function.Function;
|
||||||
|
|
||||||
|
@Singleton
|
||||||
|
public class MigrateCrawlDataActor extends RecordActorPrototype {
|
||||||
|
|
||||||
|
private final FileStorageService fileStorageService;
|
||||||
|
private final ServiceHeartbeat serviceHeartbeat;
|
||||||
|
private static final Logger logger = LoggerFactory.getLogger(MigrateCrawlDataActor.class);
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public MigrateCrawlDataActor(Gson gson, FileStorageService fileStorageService, ServiceHeartbeat serviceHeartbeat) {
|
||||||
|
super(gson);
|
||||||
|
|
||||||
|
this.fileStorageService = fileStorageService;
|
||||||
|
this.serviceHeartbeat = serviceHeartbeat;
|
||||||
|
}
|
||||||
|
|
||||||
|
public record Run(long fileStorageId) implements ActorStep {}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public ActorStep transition(ActorStep self) throws Exception {
|
||||||
|
return switch (self) {
|
||||||
|
case Run(long fileStorageId) -> {
|
||||||
|
|
||||||
|
FileStorage storage = fileStorageService.getStorage(FileStorageId.of(fileStorageId));
|
||||||
|
Path root = storage.asPath();
|
||||||
|
|
||||||
|
Path crawlerLog = root.resolve("crawler.log");
|
||||||
|
Path newCrawlerLog = Files.createTempFile(root, "crawler", ".migrate.log");
|
||||||
|
|
||||||
|
int totalEntries = WorkLog.countEntries(crawlerLog);
|
||||||
|
|
||||||
|
try (WorkLog workLog = new WorkLog(newCrawlerLog);
|
||||||
|
var heartbeat = serviceHeartbeat.createServiceAdHocTaskHeartbeat("Migrating")
|
||||||
|
) {
|
||||||
|
int entryIdx = 0;
|
||||||
|
|
||||||
|
for (Map.Entry<WorkLogEntry, Path> item : WorkLog.iterableMap(crawlerLog, new CrawlDataLocator(root))) {
|
||||||
|
|
||||||
|
final WorkLogEntry entry = item.getKey();
|
||||||
|
final Path inputPath = item.getValue();
|
||||||
|
|
||||||
|
Path outputPath = inputPath;
|
||||||
|
heartbeat.progress("Migrating" + inputPath.getFileName(), entryIdx++, totalEntries);
|
||||||
|
|
||||||
|
if (inputPath.toString().endsWith(".parquet")) {
|
||||||
|
String domain = entry.id();
|
||||||
|
String id = Integer.toHexString(domain.hashCode());
|
||||||
|
|
||||||
|
outputPath = CrawlerOutputFile.createSlopPath(root, id, domain);
|
||||||
|
|
||||||
|
if (Files.exists(inputPath)) {
|
||||||
|
try {
|
||||||
|
SlopCrawlDataRecord.convertFromParquet(inputPath, outputPath);
|
||||||
|
Files.deleteIfExists(inputPath);
|
||||||
|
} catch (Exception ex) {
|
||||||
|
outputPath = inputPath; // don't update the work log on error
|
||||||
|
logger.error("Failed to convert " + inputPath, ex);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
else if (!Files.exists(inputPath) && !Files.exists(outputPath)) {
|
||||||
|
// if the input file is missing, and the output file is missing, we just write the log
|
||||||
|
// record identical to the old one
|
||||||
|
outputPath = inputPath;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Write a log entry for the (possibly) converted file
|
||||||
|
workLog.setJobToFinished(entry.id(), outputPath.toString(), entry.cnt());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
Path oldCrawlerLog = Files.createTempFile(root, "crawler-", ".migrate.old.log");
|
||||||
|
Files.move(crawlerLog, oldCrawlerLog, StandardCopyOption.REPLACE_EXISTING);
|
||||||
|
Files.move(newCrawlerLog, crawlerLog);
|
||||||
|
|
||||||
|
yield new End();
|
||||||
|
}
|
||||||
|
default -> new Error();
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
private static class CrawlDataLocator implements Function<WorkLogEntry, Optional<Map.Entry<WorkLogEntry, Path>>> {
|
||||||
|
|
||||||
|
private final Path crawlRootDir;
|
||||||
|
|
||||||
|
CrawlDataLocator(Path crawlRootDir) {
|
||||||
|
this.crawlRootDir = crawlRootDir;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Optional<Map.Entry<WorkLogEntry, Path>> apply(WorkLogEntry entry) {
|
||||||
|
var path = getCrawledFilePath(crawlRootDir, entry.path());
|
||||||
|
|
||||||
|
if (!Files.exists(path)) {
|
||||||
|
return Optional.empty();
|
||||||
|
}
|
||||||
|
|
||||||
|
try {
|
||||||
|
return Optional.of(Map.entry(entry, path));
|
||||||
|
}
|
||||||
|
catch (Exception ex) {
|
||||||
|
return Optional.empty();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private Path getCrawledFilePath(Path crawlDir, String fileName) {
|
||||||
|
int sp = fileName.lastIndexOf('/');
|
||||||
|
|
||||||
|
// Normalize the filename
|
||||||
|
if (sp >= 0 && sp + 1< fileName.length())
|
||||||
|
fileName = fileName.substring(sp + 1);
|
||||||
|
if (fileName.length() < 4)
|
||||||
|
fileName = Strings.repeat("0", 4 - fileName.length()) + fileName;
|
||||||
|
|
||||||
|
String sp1 = fileName.substring(0, 2);
|
||||||
|
String sp2 = fileName.substring(2, 4);
|
||||||
|
return crawlDir.resolve(sp1).resolve(sp2).resolve(fileName);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public String describe() {
|
||||||
|
return "Migrates crawl data to the latest format";
|
||||||
|
}
|
||||||
|
}
|
@@ -29,10 +29,12 @@ dependencies {
|
|||||||
implementation libs.jsoup
|
implementation libs.jsoup
|
||||||
implementation project(':third-party:rssreader')
|
implementation project(':third-party:rssreader')
|
||||||
implementation libs.opencsv
|
implementation libs.opencsv
|
||||||
|
implementation libs.slop
|
||||||
implementation libs.sqlite
|
implementation libs.sqlite
|
||||||
implementation libs.bundles.slf4j
|
implementation libs.bundles.slf4j
|
||||||
implementation libs.commons.lang3
|
implementation libs.commons.lang3
|
||||||
implementation libs.commons.io
|
implementation libs.commons.io
|
||||||
|
implementation libs.wiremock
|
||||||
|
|
||||||
implementation libs.prometheus
|
implementation libs.prometheus
|
||||||
implementation libs.guava
|
implementation libs.guava
|
||||||
|
@@ -1,6 +1,7 @@
|
|||||||
package nu.marginalia.livecapture;
|
package nu.marginalia.livecapture;
|
||||||
|
|
||||||
import com.google.gson.Gson;
|
import com.google.gson.Gson;
|
||||||
|
import nu.marginalia.WmsaHome;
|
||||||
import nu.marginalia.model.gson.GsonFactory;
|
import nu.marginalia.model.gson.GsonFactory;
|
||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
import org.slf4j.LoggerFactory;
|
import org.slf4j.LoggerFactory;
|
||||||
@@ -12,10 +13,13 @@ import java.net.http.HttpRequest;
|
|||||||
import java.net.http.HttpResponse;
|
import java.net.http.HttpResponse;
|
||||||
import java.time.Duration;
|
import java.time.Duration;
|
||||||
import java.util.Map;
|
import java.util.Map;
|
||||||
|
import java.util.Optional;
|
||||||
|
|
||||||
/** Client for local browserless.io API */
|
/** Client for local browserless.io API */
|
||||||
public class BrowserlessClient implements AutoCloseable {
|
public class BrowserlessClient implements AutoCloseable {
|
||||||
|
|
||||||
private static final Logger logger = LoggerFactory.getLogger(BrowserlessClient.class);
|
private static final Logger logger = LoggerFactory.getLogger(BrowserlessClient.class);
|
||||||
|
private static final String BROWSERLESS_TOKEN = System.getProperty("live-capture.browserless-token", "BROWSERLESS_TOKEN");
|
||||||
|
|
||||||
private final HttpClient httpClient = HttpClient.newBuilder()
|
private final HttpClient httpClient = HttpClient.newBuilder()
|
||||||
.version(HttpClient.Version.HTTP_1_1)
|
.version(HttpClient.Version.HTTP_1_1)
|
||||||
@@ -25,18 +29,21 @@ public class BrowserlessClient implements AutoCloseable {
|
|||||||
private final URI browserlessURI;
|
private final URI browserlessURI;
|
||||||
private final Gson gson = GsonFactory.get();
|
private final Gson gson = GsonFactory.get();
|
||||||
|
|
||||||
|
private final String userAgent = WmsaHome.getUserAgent().uaString();
|
||||||
|
|
||||||
public BrowserlessClient(URI browserlessURI) {
|
public BrowserlessClient(URI browserlessURI) {
|
||||||
this.browserlessURI = browserlessURI;
|
this.browserlessURI = browserlessURI;
|
||||||
}
|
}
|
||||||
|
|
||||||
public String content(String url, GotoOptions gotoOptions) throws IOException, InterruptedException {
|
public Optional<String> content(String url, GotoOptions gotoOptions) throws IOException, InterruptedException {
|
||||||
Map<String, Object> requestData = Map.of(
|
Map<String, Object> requestData = Map.of(
|
||||||
"url", url,
|
"url", url,
|
||||||
|
"userAgent", userAgent,
|
||||||
"gotoOptions", gotoOptions
|
"gotoOptions", gotoOptions
|
||||||
);
|
);
|
||||||
|
|
||||||
var request = HttpRequest.newBuilder()
|
var request = HttpRequest.newBuilder()
|
||||||
.uri(browserlessURI.resolve("/content"))
|
.uri(browserlessURI.resolve("/content?token="+BROWSERLESS_TOKEN))
|
||||||
.method("POST", HttpRequest.BodyPublishers.ofString(
|
.method("POST", HttpRequest.BodyPublishers.ofString(
|
||||||
gson.toJson(requestData)
|
gson.toJson(requestData)
|
||||||
))
|
))
|
||||||
@@ -47,10 +54,10 @@ public class BrowserlessClient implements AutoCloseable {
|
|||||||
|
|
||||||
if (rsp.statusCode() >= 300) {
|
if (rsp.statusCode() >= 300) {
|
||||||
logger.info("Failed to fetch content for {}, status {}", url, rsp.statusCode());
|
logger.info("Failed to fetch content for {}, status {}", url, rsp.statusCode());
|
||||||
return null;
|
return Optional.empty();
|
||||||
}
|
}
|
||||||
|
|
||||||
return rsp.body();
|
return Optional.of(rsp.body());
|
||||||
}
|
}
|
||||||
|
|
||||||
public byte[] screenshot(String url, GotoOptions gotoOptions, ScreenshotOptions screenshotOptions)
|
public byte[] screenshot(String url, GotoOptions gotoOptions, ScreenshotOptions screenshotOptions)
|
||||||
@@ -58,12 +65,13 @@ public class BrowserlessClient implements AutoCloseable {
|
|||||||
|
|
||||||
Map<String, Object> requestData = Map.of(
|
Map<String, Object> requestData = Map.of(
|
||||||
"url", url,
|
"url", url,
|
||||||
|
"userAgent", userAgent,
|
||||||
"options", screenshotOptions,
|
"options", screenshotOptions,
|
||||||
"gotoOptions", gotoOptions
|
"gotoOptions", gotoOptions
|
||||||
);
|
);
|
||||||
|
|
||||||
var request = HttpRequest.newBuilder()
|
var request = HttpRequest.newBuilder()
|
||||||
.uri(browserlessURI.resolve("/screenshot"))
|
.uri(browserlessURI.resolve("/screenshot?token="+BROWSERLESS_TOKEN))
|
||||||
.method("POST", HttpRequest.BodyPublishers.ofString(
|
.method("POST", HttpRequest.BodyPublishers.ofString(
|
||||||
gson.toJson(requestData)
|
gson.toJson(requestData)
|
||||||
))
|
))
|
||||||
@@ -82,7 +90,7 @@ public class BrowserlessClient implements AutoCloseable {
|
|||||||
}
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public void close() throws Exception {
|
public void close() {
|
||||||
httpClient.shutdownNow();
|
httpClient.shutdownNow();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@@ -1,6 +1,6 @@
|
|||||||
package nu.marginalia.rss.model;
|
package nu.marginalia.rss.model;
|
||||||
|
|
||||||
import com.apptasticsoftware.rssreader.Item;
|
import nu.marginalia.rss.svc.SimpleFeedParser;
|
||||||
import org.apache.commons.lang3.StringUtils;
|
import org.apache.commons.lang3.StringUtils;
|
||||||
import org.jetbrains.annotations.NotNull;
|
import org.jetbrains.annotations.NotNull;
|
||||||
import org.jsoup.Jsoup;
|
import org.jsoup.Jsoup;
|
||||||
@@ -18,37 +18,33 @@ public record FeedItem(String title,
|
|||||||
public static final int MAX_DESC_LENGTH = 255;
|
public static final int MAX_DESC_LENGTH = 255;
|
||||||
public static final DateTimeFormatter DATE_FORMAT = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSSZ");
|
public static final DateTimeFormatter DATE_FORMAT = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSSZ");
|
||||||
|
|
||||||
public static FeedItem fromItem(Item item, boolean keepFragment) {
|
public static FeedItem fromItem(SimpleFeedParser.ItemData item, boolean keepFragment) {
|
||||||
String title = item.getTitle().orElse("");
|
String title = item.title();
|
||||||
String date = getItemDate(item);
|
String date = getItemDate(item);
|
||||||
String description = getItemDescription(item);
|
String description = getItemDescription(item);
|
||||||
String url;
|
String url;
|
||||||
|
|
||||||
if (keepFragment || item.getLink().isEmpty()) {
|
if (keepFragment) {
|
||||||
url = item.getLink().orElse("");
|
url = item.url();
|
||||||
}
|
}
|
||||||
else {
|
else {
|
||||||
try {
|
try {
|
||||||
String link = item.getLink().get();
|
String link = item.url();
|
||||||
var linkUri = new URI(link);
|
var linkUri = new URI(link);
|
||||||
var cleanUri = new URI(linkUri.getScheme(), linkUri.getAuthority(), linkUri.getPath(), linkUri.getQuery(), null);
|
var cleanUri = new URI(linkUri.getScheme(), linkUri.getAuthority(), linkUri.getPath(), linkUri.getQuery(), null);
|
||||||
url = cleanUri.toString();
|
url = cleanUri.toString();
|
||||||
}
|
}
|
||||||
catch (Exception e) {
|
catch (Exception e) {
|
||||||
// fallback to original link if we can't clean it, this is not a very important step
|
// fallback to original link if we can't clean it, this is not a very important step
|
||||||
url = item.getLink().get();
|
url = item.url();
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
return new FeedItem(title, date, description, url);
|
return new FeedItem(title, date, description, url);
|
||||||
}
|
}
|
||||||
|
|
||||||
private static String getItemDescription(Item item) {
|
private static String getItemDescription(SimpleFeedParser.ItemData item) {
|
||||||
Optional<String> description = item.getDescription();
|
String rawDescription = item.description();
|
||||||
if (description.isEmpty())
|
|
||||||
return "";
|
|
||||||
|
|
||||||
String rawDescription = description.get();
|
|
||||||
if (rawDescription.indexOf('<') >= 0) {
|
if (rawDescription.indexOf('<') >= 0) {
|
||||||
rawDescription = Jsoup.parseBodyFragment(rawDescription).text();
|
rawDescription = Jsoup.parseBodyFragment(rawDescription).text();
|
||||||
}
|
}
|
||||||
@@ -58,15 +54,18 @@ public record FeedItem(String title,
|
|||||||
|
|
||||||
// e.g. http://fabiensanglard.net/rss.xml does dates like this: 1 Apr 2021 00:00:00 +0000
|
// e.g. http://fabiensanglard.net/rss.xml does dates like this: 1 Apr 2021 00:00:00 +0000
|
||||||
private static final DateTimeFormatter extraFormatter = DateTimeFormatter.ofPattern("d MMM yyyy HH:mm:ss Z");
|
private static final DateTimeFormatter extraFormatter = DateTimeFormatter.ofPattern("d MMM yyyy HH:mm:ss Z");
|
||||||
private static String getItemDate(Item item) {
|
private static String getItemDate(SimpleFeedParser.ItemData item) {
|
||||||
Optional<ZonedDateTime> zonedDateTime = Optional.empty();
|
Optional<ZonedDateTime> zonedDateTime = Optional.empty();
|
||||||
try {
|
try {
|
||||||
zonedDateTime = item.getPubDateZonedDateTime();
|
zonedDateTime = item.getPubDateZonedDateTime();
|
||||||
}
|
}
|
||||||
catch (Exception e) {
|
catch (Exception e) {
|
||||||
zonedDateTime = item.getPubDate()
|
try {
|
||||||
.map(extraFormatter::parse)
|
zonedDateTime = Optional.of(ZonedDateTime.from(extraFormatter.parse(item.pubDate())));
|
||||||
.map(ZonedDateTime::from);
|
}
|
||||||
|
catch (Exception e2) {
|
||||||
|
// ignore
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
return zonedDateTime.map(date -> date.format(DATE_FORMAT)).orElse("");
|
return zonedDateTime.map(date -> date.format(DATE_FORMAT)).orElse("");
|
||||||
|
@@ -1,7 +1,5 @@
|
|||||||
package nu.marginalia.rss.svc;
|
package nu.marginalia.rss.svc;
|
||||||
|
|
||||||
import com.apptasticsoftware.rssreader.Item;
|
|
||||||
import com.apptasticsoftware.rssreader.RssReader;
|
|
||||||
import com.google.inject.Inject;
|
import com.google.inject.Inject;
|
||||||
import com.opencsv.CSVReader;
|
import com.opencsv.CSVReader;
|
||||||
import nu.marginalia.WmsaHome;
|
import nu.marginalia.WmsaHome;
|
||||||
@@ -20,7 +18,6 @@ import nu.marginalia.storage.FileStorageService;
|
|||||||
import nu.marginalia.storage.model.FileStorage;
|
import nu.marginalia.storage.model.FileStorage;
|
||||||
import nu.marginalia.storage.model.FileStorageType;
|
import nu.marginalia.storage.model.FileStorageType;
|
||||||
import nu.marginalia.util.SimpleBlockingThreadPool;
|
import nu.marginalia.util.SimpleBlockingThreadPool;
|
||||||
import org.apache.commons.io.input.BOMInputStream;
|
|
||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
import org.slf4j.LoggerFactory;
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
@@ -32,7 +29,6 @@ import java.net.URISyntaxException;
|
|||||||
import java.net.http.HttpClient;
|
import java.net.http.HttpClient;
|
||||||
import java.net.http.HttpRequest;
|
import java.net.http.HttpRequest;
|
||||||
import java.net.http.HttpResponse;
|
import java.net.http.HttpResponse;
|
||||||
import java.nio.charset.StandardCharsets;
|
|
||||||
import java.sql.SQLException;
|
import java.sql.SQLException;
|
||||||
import java.time.*;
|
import java.time.*;
|
||||||
import java.time.format.DateTimeFormatter;
|
import java.time.format.DateTimeFormatter;
|
||||||
@@ -48,8 +44,6 @@ public class FeedFetcherService {
|
|||||||
private static final int MAX_FEED_ITEMS = 10;
|
private static final int MAX_FEED_ITEMS = 10;
|
||||||
private static final Logger logger = LoggerFactory.getLogger(FeedFetcherService.class);
|
private static final Logger logger = LoggerFactory.getLogger(FeedFetcherService.class);
|
||||||
|
|
||||||
private final RssReader rssReader = new RssReader();
|
|
||||||
|
|
||||||
private final FeedDb feedDb;
|
private final FeedDb feedDb;
|
||||||
private final FileStorageService fileStorageService;
|
private final FileStorageService fileStorageService;
|
||||||
private final NodeConfigurationService nodeConfigurationService;
|
private final NodeConfigurationService nodeConfigurationService;
|
||||||
@@ -72,17 +66,6 @@ public class FeedFetcherService {
|
|||||||
this.nodeConfigurationService = nodeConfigurationService;
|
this.nodeConfigurationService = nodeConfigurationService;
|
||||||
this.serviceHeartbeat = serviceHeartbeat;
|
this.serviceHeartbeat = serviceHeartbeat;
|
||||||
this.executorClient = executorClient;
|
this.executorClient = executorClient;
|
||||||
|
|
||||||
|
|
||||||
// Add support for some alternate date tags for atom
|
|
||||||
rssReader.addItemExtension("issued", this::setDateFallback);
|
|
||||||
rssReader.addItemExtension("created", this::setDateFallback);
|
|
||||||
}
|
|
||||||
|
|
||||||
private void setDateFallback(Item item, String value) {
|
|
||||||
if (item.getPubDate().isEmpty()) {
|
|
||||||
item.setPubDate(value);
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
public enum UpdateMode {
|
public enum UpdateMode {
|
||||||
@@ -96,6 +79,7 @@ public class FeedFetcherService {
|
|||||||
throw new IllegalStateException("Already updating feeds, refusing to start another update");
|
throw new IllegalStateException("Already updating feeds, refusing to start another update");
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
try (FeedDbWriter writer = feedDb.createWriter();
|
try (FeedDbWriter writer = feedDb.createWriter();
|
||||||
HttpClient client = HttpClient.newBuilder()
|
HttpClient client = HttpClient.newBuilder()
|
||||||
.connectTimeout(Duration.ofSeconds(15))
|
.connectTimeout(Duration.ofSeconds(15))
|
||||||
@@ -103,6 +87,7 @@ public class FeedFetcherService {
|
|||||||
.followRedirects(HttpClient.Redirect.NORMAL)
|
.followRedirects(HttpClient.Redirect.NORMAL)
|
||||||
.version(HttpClient.Version.HTTP_2)
|
.version(HttpClient.Version.HTTP_2)
|
||||||
.build();
|
.build();
|
||||||
|
FeedJournal feedJournal = FeedJournal.create();
|
||||||
var heartbeat = serviceHeartbeat.createServiceAdHocTaskHeartbeat("Update Rss Feeds")
|
var heartbeat = serviceHeartbeat.createServiceAdHocTaskHeartbeat("Update Rss Feeds")
|
||||||
) {
|
) {
|
||||||
updating = true;
|
updating = true;
|
||||||
@@ -155,6 +140,8 @@ public class FeedFetcherService {
|
|||||||
case FetchResult.Success(String value, String etag) -> {
|
case FetchResult.Success(String value, String etag) -> {
|
||||||
writer.saveEtag(feed.domain(), etag);
|
writer.saveEtag(feed.domain(), etag);
|
||||||
writer.saveFeed(parseFeed(value, feed));
|
writer.saveFeed(parseFeed(value, feed));
|
||||||
|
|
||||||
|
feedJournal.record(feed.feedUrl(), value);
|
||||||
}
|
}
|
||||||
case FetchResult.NotModified() -> {
|
case FetchResult.NotModified() -> {
|
||||||
writer.saveEtag(feed.domain(), ifNoneMatchTag);
|
writer.saveEtag(feed.domain(), ifNoneMatchTag);
|
||||||
@@ -367,12 +354,7 @@ public class FeedFetcherService {
|
|||||||
|
|
||||||
public FeedItems parseFeed(String feedData, FeedDefinition definition) {
|
public FeedItems parseFeed(String feedData, FeedDefinition definition) {
|
||||||
try {
|
try {
|
||||||
feedData = sanitizeEntities(feedData);
|
List<SimpleFeedParser.ItemData> rawItems = SimpleFeedParser.parse(feedData);
|
||||||
|
|
||||||
List<Item> rawItems = rssReader.read(
|
|
||||||
// Massage the data to maximize the possibility of the flaky XML parser consuming it
|
|
||||||
new BOMInputStream(new ByteArrayInputStream(feedData.trim().getBytes(StandardCharsets.UTF_8)), false)
|
|
||||||
).toList();
|
|
||||||
|
|
||||||
boolean keepUriFragment = rawItems.size() < 2 || areFragmentsDisparate(rawItems);
|
boolean keepUriFragment = rawItems.size() < 2 || areFragmentsDisparate(rawItems);
|
||||||
|
|
||||||
@@ -395,33 +377,6 @@ public class FeedFetcherService {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
private static final Map<String, String> HTML_ENTITIES = Map.of(
|
|
||||||
"»", "»",
|
|
||||||
"«", "«",
|
|
||||||
"—", "--",
|
|
||||||
"–", "-",
|
|
||||||
"’", "'",
|
|
||||||
"‘", "'",
|
|
||||||
""", "\"",
|
|
||||||
" ", ""
|
|
||||||
);
|
|
||||||
|
|
||||||
/** The XML parser will blow up if you insert HTML entities in the feed XML,
|
|
||||||
* which is unfortunately relatively common. Replace them as far as is possible
|
|
||||||
* with their corresponding characters
|
|
||||||
*/
|
|
||||||
static String sanitizeEntities(String feedData) {
|
|
||||||
String result = feedData;
|
|
||||||
for (Map.Entry<String, String> entry : HTML_ENTITIES.entrySet()) {
|
|
||||||
result = result.replace(entry.getKey(), entry.getValue());
|
|
||||||
}
|
|
||||||
|
|
||||||
// Handle lone ampersands not part of a recognized XML entity
|
|
||||||
result = result.replaceAll("&(?!(amp|lt|gt|apos|quot);)", "&");
|
|
||||||
|
|
||||||
return result;
|
|
||||||
}
|
|
||||||
|
|
||||||
/** Decide whether to keep URI fragments in the feed items.
|
/** Decide whether to keep URI fragments in the feed items.
|
||||||
* <p></p>
|
* <p></p>
|
||||||
* We keep fragments if there are multiple different fragments in the items.
|
* We keep fragments if there are multiple different fragments in the items.
|
||||||
@@ -429,16 +384,16 @@ public class FeedFetcherService {
|
|||||||
* @param items The items to check
|
* @param items The items to check
|
||||||
* @return True if we should keep the fragments, false otherwise
|
* @return True if we should keep the fragments, false otherwise
|
||||||
*/
|
*/
|
||||||
private boolean areFragmentsDisparate(List<Item> items) {
|
private boolean areFragmentsDisparate(List<SimpleFeedParser.ItemData> items) {
|
||||||
Set<String> seenFragments = new HashSet<>();
|
Set<String> seenFragments = new HashSet<>();
|
||||||
|
|
||||||
try {
|
try {
|
||||||
for (var item : items) {
|
for (var item : items) {
|
||||||
if (item.getLink().isEmpty()) {
|
if (item.url().isBlank()) {
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
|
||||||
var link = item.getLink().get();
|
var link = item.url();
|
||||||
if (!link.contains("#")) {
|
if (!link.contains("#")) {
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
@@ -0,0 +1,76 @@
|
|||||||
|
package nu.marginalia.rss.svc;
|
||||||
|
|
||||||
|
import nu.marginalia.WmsaHome;
|
||||||
|
import nu.marginalia.slop.SlopTable;
|
||||||
|
import nu.marginalia.slop.column.string.StringColumn;
|
||||||
|
import nu.marginalia.slop.desc.StorageType;
|
||||||
|
import org.apache.commons.io.FileUtils;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.nio.charset.StandardCharsets;
|
||||||
|
import java.nio.file.Files;
|
||||||
|
import java.nio.file.Path;
|
||||||
|
import java.util.function.BiConsumer;
|
||||||
|
|
||||||
|
/** Utility for recording fetched feeds to a journal, useful in debugging feed parser issues.
|
||||||
|
*/
|
||||||
|
public interface FeedJournal extends AutoCloseable {
|
||||||
|
StringColumn urlColumn = new StringColumn("url");
|
||||||
|
StringColumn contentsColumn = new StringColumn("contents", StandardCharsets.UTF_8, StorageType.ZSTD);
|
||||||
|
|
||||||
|
void record(String url, String contents) throws IOException;
|
||||||
|
void close() throws IOException;
|
||||||
|
|
||||||
|
|
||||||
|
static FeedJournal create() throws IOException {
|
||||||
|
if (Boolean.getBoolean("feedFetcher.persistJournal")) {
|
||||||
|
Path journalPath = WmsaHome.getDataPath().resolve("feed-journal");
|
||||||
|
if (Files.isDirectory(journalPath)) {
|
||||||
|
FileUtils.deleteDirectory(journalPath.toFile());
|
||||||
|
}
|
||||||
|
Files.createDirectories(journalPath);
|
||||||
|
return new RecordingFeedJournal(journalPath);
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
return new NoOpFeedJournal();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
class NoOpFeedJournal implements FeedJournal {
|
||||||
|
@Override
|
||||||
|
public void record(String url, String contents) {}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public void close() {}
|
||||||
|
}
|
||||||
|
|
||||||
|
class RecordingFeedJournal extends SlopTable implements FeedJournal {
|
||||||
|
|
||||||
|
private final StringColumn.Writer urlWriter;
|
||||||
|
private final StringColumn.Writer contentsWriter;
|
||||||
|
|
||||||
|
public RecordingFeedJournal(Path path) throws IOException {
|
||||||
|
super(path, SlopTable.getNumPages(path, FeedJournal.urlColumn));
|
||||||
|
|
||||||
|
urlWriter = urlColumn.create(this);
|
||||||
|
contentsWriter = contentsColumn.create(this);
|
||||||
|
}
|
||||||
|
|
||||||
|
public synchronized void record(String url, String contents) throws IOException {
|
||||||
|
urlWriter.put(url);
|
||||||
|
contentsWriter.put(contents);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
static void replay(Path journalPath, BiConsumer<String, String> urlAndContent) throws IOException {
|
||||||
|
try (SlopTable table = new SlopTable(journalPath)) {
|
||||||
|
final StringColumn.Reader urlReader = urlColumn.open(table);
|
||||||
|
final StringColumn.Reader contentsReader = contentsColumn.open(table);
|
||||||
|
|
||||||
|
while (urlReader.hasRemaining()) {
|
||||||
|
urlAndContent.accept(urlReader.get(), contentsReader.get());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
|
}
|
@@ -0,0 +1,94 @@
|
|||||||
|
package nu.marginalia.rss.svc;
|
||||||
|
|
||||||
|
import com.apptasticsoftware.rssreader.DateTimeParser;
|
||||||
|
import com.apptasticsoftware.rssreader.util.Default;
|
||||||
|
import org.jsoup.Jsoup;
|
||||||
|
import org.jsoup.parser.Parser;
|
||||||
|
|
||||||
|
import java.time.ZonedDateTime;
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Optional;
|
||||||
|
|
||||||
|
public class SimpleFeedParser {
|
||||||
|
|
||||||
|
private static final DateTimeParser dateTimeParser = Default.getDateTimeParser();
|
||||||
|
|
||||||
|
public record ItemData (
|
||||||
|
String title,
|
||||||
|
String description,
|
||||||
|
String url,
|
||||||
|
String pubDate
|
||||||
|
) {
|
||||||
|
public boolean isWellFormed() {
|
||||||
|
return title != null && !title.isBlank() &&
|
||||||
|
description != null && !description.isBlank() &&
|
||||||
|
url != null && !url.isBlank() &&
|
||||||
|
pubDate != null && !pubDate.isBlank();
|
||||||
|
}
|
||||||
|
|
||||||
|
public Optional<ZonedDateTime> getPubDateZonedDateTime() {
|
||||||
|
try {
|
||||||
|
return Optional.ofNullable(dateTimeParser.parse(pubDate()));
|
||||||
|
}
|
||||||
|
catch (Exception e) {
|
||||||
|
return Optional.empty();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
public static List<ItemData> parse(String content) {
|
||||||
|
var doc = Jsoup.parse(content, Parser.xmlParser());
|
||||||
|
List<ItemData> ret = new ArrayList<>();
|
||||||
|
|
||||||
|
doc.select("item, entry").forEach(element -> {
|
||||||
|
String link = "";
|
||||||
|
String title = "";
|
||||||
|
String description = "";
|
||||||
|
String pubDate = "";
|
||||||
|
|
||||||
|
for (String attr : List.of("title", "dc:title")) {
|
||||||
|
if (!title.isBlank())
|
||||||
|
break;
|
||||||
|
var tag = element.getElementsByTag(attr).first();
|
||||||
|
if (tag != null) {
|
||||||
|
title = tag.text();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
for (String attr : List.of("title", "summary", "content", "description", "dc:description")) {
|
||||||
|
if (!description.isBlank())
|
||||||
|
break;
|
||||||
|
var tag = element.getElementsByTag(attr).first();
|
||||||
|
if (tag != null) {
|
||||||
|
description = tag.text();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
for (String attr : List.of("pubDate", "published", "updated", "issued", "created", "dc:date")) {
|
||||||
|
if (!pubDate.isBlank())
|
||||||
|
break;
|
||||||
|
var tag = element.getElementsByTag(attr).first();
|
||||||
|
if (tag != null) {
|
||||||
|
pubDate = tag.text();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
for (String attr : List.of("link", "url")) {
|
||||||
|
if (!link.isBlank())
|
||||||
|
break;
|
||||||
|
var tag = element.getElementsByTag(attr).first();
|
||||||
|
if (tag != null) {
|
||||||
|
link = tag.text();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
ret.add(new ItemData(title, description, link, pubDate));
|
||||||
|
});
|
||||||
|
|
||||||
|
|
||||||
|
return ret;
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
@@ -1,36 +1,97 @@
|
|||||||
package nu.marginalia.livecapture;
|
package nu.marginalia.livecapture;
|
||||||
|
|
||||||
|
import com.github.tomakehurst.wiremock.WireMockServer;
|
||||||
|
import com.github.tomakehurst.wiremock.core.WireMockConfiguration;
|
||||||
|
import nu.marginalia.WmsaHome;
|
||||||
|
import nu.marginalia.service.module.ServiceConfigurationModule;
|
||||||
import org.junit.jupiter.api.Assertions;
|
import org.junit.jupiter.api.Assertions;
|
||||||
import org.junit.jupiter.api.BeforeAll;
|
import org.junit.jupiter.api.BeforeAll;
|
||||||
|
import org.junit.jupiter.api.Tag;
|
||||||
import org.junit.jupiter.api.Test;
|
import org.junit.jupiter.api.Test;
|
||||||
import org.testcontainers.containers.GenericContainer;
|
import org.testcontainers.containers.GenericContainer;
|
||||||
import org.testcontainers.junit.jupiter.Testcontainers;
|
import org.testcontainers.junit.jupiter.Testcontainers;
|
||||||
import org.testcontainers.utility.DockerImageName;
|
import org.testcontainers.utility.DockerImageName;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
import java.net.URI;
|
import java.net.URI;
|
||||||
|
import java.util.Map;
|
||||||
|
|
||||||
|
import static com.github.tomakehurst.wiremock.client.WireMock.*;
|
||||||
|
|
||||||
|
|
||||||
@Testcontainers
|
@Testcontainers
|
||||||
|
@Tag("slow")
|
||||||
public class BrowserlessClientTest {
|
public class BrowserlessClientTest {
|
||||||
static GenericContainer<?> container = new GenericContainer<>(DockerImageName.parse("browserless/chrome")).withExposedPorts(3000);
|
static GenericContainer<?> container = new GenericContainer<>(DockerImageName.parse("browserless/chrome"))
|
||||||
|
.withEnv(Map.of("TOKEN", "BROWSERLESS_TOKEN"))
|
||||||
|
.withNetworkMode("bridge")
|
||||||
|
.withExposedPorts(3000);
|
||||||
|
|
||||||
|
static WireMockServer wireMockServer =
|
||||||
|
new WireMockServer(WireMockConfiguration.wireMockConfig()
|
||||||
|
.port(18089));
|
||||||
|
|
||||||
|
static String localIp;
|
||||||
|
|
||||||
|
static URI browserlessURI;
|
||||||
|
|
||||||
@BeforeAll
|
@BeforeAll
|
||||||
public static void setup() {
|
public static void setup() throws IOException {
|
||||||
container.start();
|
container.start();
|
||||||
|
|
||||||
|
browserlessURI = URI.create(String.format("http://%s:%d/",
|
||||||
|
container.getHost(),
|
||||||
|
container.getMappedPort(3000))
|
||||||
|
);
|
||||||
|
|
||||||
|
wireMockServer.start();
|
||||||
|
wireMockServer.stubFor(get("/").willReturn(aResponse().withStatus(200).withBody("Ok")));
|
||||||
|
|
||||||
|
localIp = ServiceConfigurationModule.getLocalNetworkIP();
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
@Tag("flaky")
|
||||||
|
@Test
|
||||||
|
public void testInspectContentUA__Flaky() throws Exception {
|
||||||
|
try (var client = new BrowserlessClient(browserlessURI)) {
|
||||||
|
client.content("http://" + localIp + ":18089/",
|
||||||
|
BrowserlessClient.GotoOptions.defaultValues()
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
wireMockServer.verify(getRequestedFor(urlEqualTo("/")).withHeader("User-Agent", equalTo(WmsaHome.getUserAgent().uaString())));
|
||||||
|
}
|
||||||
|
|
||||||
|
@Tag("flaky")
|
||||||
|
@Test
|
||||||
|
public void testInspectScreenshotUA__Flaky() throws Exception {
|
||||||
|
try (var client = new BrowserlessClient(browserlessURI)) {
|
||||||
|
client.screenshot("http://" + localIp + ":18089/",
|
||||||
|
BrowserlessClient.GotoOptions.defaultValues(),
|
||||||
|
BrowserlessClient.ScreenshotOptions.defaultValues()
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
wireMockServer.verify(getRequestedFor(urlEqualTo("/")).withHeader("User-Agent", equalTo(WmsaHome.getUserAgent().uaString())));
|
||||||
}
|
}
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
public void testContent() throws Exception {
|
public void testContent() throws Exception {
|
||||||
try (var client = new BrowserlessClient(URI.create("http://" + container.getHost() + ":" + container.getMappedPort(3000)))) {
|
try (var client = new BrowserlessClient(browserlessURI)) {
|
||||||
var content = client.content("https://www.marginalia.nu/", BrowserlessClient.GotoOptions.defaultValues());
|
var content = client.content("https://www.marginalia.nu/", BrowserlessClient.GotoOptions.defaultValues()).orElseThrow();
|
||||||
Assertions.assertNotNull(content, "Content should not be null");
|
|
||||||
Assertions.assertFalse(content.isBlank(), "Content should not be empty");
|
Assertions.assertFalse(content.isBlank(), "Content should not be empty");
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
public void testScreenshot() throws Exception {
|
public void testScreenshot() throws Exception {
|
||||||
try (var client = new BrowserlessClient(URI.create("http://" + container.getHost() + ":" + container.getMappedPort(3000)))) {
|
try (var client = new BrowserlessClient(browserlessURI)) {
|
||||||
var screenshot = client.screenshot("https://www.marginalia.nu/", BrowserlessClient.GotoOptions.defaultValues(), BrowserlessClient.ScreenshotOptions.defaultValues());
|
var screenshot = client.screenshot("https://www.marginalia.nu/",
|
||||||
|
BrowserlessClient.GotoOptions.defaultValues(),
|
||||||
|
BrowserlessClient.ScreenshotOptions.defaultValues());
|
||||||
|
|
||||||
Assertions.assertNotNull(screenshot, "Screenshot should not be null");
|
Assertions.assertNotNull(screenshot, "Screenshot should not be null");
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@@ -1,50 +0,0 @@
|
|||||||
package nu.marginalia.rss.svc;
|
|
||||||
|
|
||||||
import com.apptasticsoftware.rssreader.Item;
|
|
||||||
import com.apptasticsoftware.rssreader.RssReader;
|
|
||||||
import org.junit.jupiter.api.Assertions;
|
|
||||||
import org.junit.jupiter.api.Test;
|
|
||||||
|
|
||||||
import java.util.List;
|
|
||||||
import java.util.Optional;
|
|
||||||
|
|
||||||
public class TestXmlSanitization {
|
|
||||||
|
|
||||||
@Test
|
|
||||||
public void testPreservedEntities() {
|
|
||||||
Assertions.assertEquals("&", FeedFetcherService.sanitizeEntities("&"));
|
|
||||||
Assertions.assertEquals("<", FeedFetcherService.sanitizeEntities("<"));
|
|
||||||
Assertions.assertEquals(">", FeedFetcherService.sanitizeEntities(">"));
|
|
||||||
Assertions.assertEquals("'", FeedFetcherService.sanitizeEntities("'"));
|
|
||||||
}
|
|
||||||
|
|
||||||
@Test
|
|
||||||
public void testNlnetTitleTag() {
|
|
||||||
// The NLnet atom feed puts HTML tags in the entry/title tags, which breaks the vanilla RssReader code
|
|
||||||
|
|
||||||
// Verify we're able to consume and strip out the HTML tags
|
|
||||||
RssReader r = new RssReader();
|
|
||||||
|
|
||||||
List<Item> items = r.read(ClassLoader.getSystemResourceAsStream("nlnet.atom")).toList();
|
|
||||||
|
|
||||||
Assertions.assertEquals(1, items.size());
|
|
||||||
for (var item : items) {
|
|
||||||
Assertions.assertEquals(Optional.of("50 Free and Open Source Projects Selected for NGI Zero grants"), item.getTitle());
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
@Test
|
|
||||||
public void testStrayAmpersand() {
|
|
||||||
Assertions.assertEquals("Bed & Breakfast", FeedFetcherService.sanitizeEntities("Bed & Breakfast"));
|
|
||||||
}
|
|
||||||
|
|
||||||
@Test
|
|
||||||
public void testTranslatedHtmlEntity() {
|
|
||||||
Assertions.assertEquals("Foo -- Bar", FeedFetcherService.sanitizeEntities("Foo — Bar"));
|
|
||||||
}
|
|
||||||
|
|
||||||
@Test
|
|
||||||
public void testTranslatedHtmlEntityQuot() {
|
|
||||||
Assertions.assertEquals("\"Bob\"", FeedFetcherService.sanitizeEntities(""Bob""));
|
|
||||||
}
|
|
||||||
}
|
|
@@ -2,9 +2,6 @@ package nu.marginalia.api.searchquery;
|
|||||||
|
|
||||||
import nu.marginalia.api.searchquery.model.query.SearchPhraseConstraint;
|
import nu.marginalia.api.searchquery.model.query.SearchPhraseConstraint;
|
||||||
import nu.marginalia.api.searchquery.model.query.SearchQuery;
|
import nu.marginalia.api.searchquery.model.query.SearchQuery;
|
||||||
import nu.marginalia.api.searchquery.model.results.Bm25Parameters;
|
|
||||||
import nu.marginalia.api.searchquery.model.results.ResultRankingParameters;
|
|
||||||
import nu.marginalia.index.query.limit.QueryLimits;
|
|
||||||
import nu.marginalia.index.query.limit.SpecificationLimit;
|
import nu.marginalia.index.query.limit.SpecificationLimit;
|
||||||
import nu.marginalia.index.query.limit.SpecificationLimitType;
|
import nu.marginalia.index.query.limit.SpecificationLimitType;
|
||||||
|
|
||||||
@@ -27,37 +24,19 @@ public class IndexProtobufCodec {
|
|||||||
.build();
|
.build();
|
||||||
}
|
}
|
||||||
|
|
||||||
public static QueryLimits convertQueryLimits(RpcQueryLimits queryLimits) {
|
|
||||||
return new QueryLimits(
|
|
||||||
queryLimits.getResultsByDomain(),
|
|
||||||
queryLimits.getResultsTotal(),
|
|
||||||
queryLimits.getTimeoutMs(),
|
|
||||||
queryLimits.getFetchSize()
|
|
||||||
);
|
|
||||||
}
|
|
||||||
|
|
||||||
public static RpcQueryLimits convertQueryLimits(QueryLimits queryLimits) {
|
|
||||||
return RpcQueryLimits.newBuilder()
|
|
||||||
.setResultsByDomain(queryLimits.resultsByDomain())
|
|
||||||
.setResultsTotal(queryLimits.resultsTotal())
|
|
||||||
.setTimeoutMs(queryLimits.timeoutMs())
|
|
||||||
.setFetchSize(queryLimits.fetchSize())
|
|
||||||
.build();
|
|
||||||
}
|
|
||||||
|
|
||||||
public static SearchQuery convertRpcQuery(RpcQuery query) {
|
public static SearchQuery convertRpcQuery(RpcQuery query) {
|
||||||
List<SearchPhraseConstraint> phraeConstraints = new ArrayList<>();
|
List<SearchPhraseConstraint> phraseConstraints = new ArrayList<>();
|
||||||
|
|
||||||
for (int j = 0; j < query.getPhrasesCount(); j++) {
|
for (int j = 0; j < query.getPhrasesCount(); j++) {
|
||||||
var coh = query.getPhrases(j);
|
var coh = query.getPhrases(j);
|
||||||
if (coh.getType() == RpcPhrases.TYPE.OPTIONAL) {
|
if (coh.getType() == RpcPhrases.TYPE.OPTIONAL) {
|
||||||
phraeConstraints.add(new SearchPhraseConstraint.Optional(List.copyOf(coh.getTermsList())));
|
phraseConstraints.add(new SearchPhraseConstraint.Optional(List.copyOf(coh.getTermsList())));
|
||||||
}
|
}
|
||||||
else if (coh.getType() == RpcPhrases.TYPE.MANDATORY) {
|
else if (coh.getType() == RpcPhrases.TYPE.MANDATORY) {
|
||||||
phraeConstraints.add(new SearchPhraseConstraint.Mandatory(List.copyOf(coh.getTermsList())));
|
phraseConstraints.add(new SearchPhraseConstraint.Mandatory(List.copyOf(coh.getTermsList())));
|
||||||
}
|
}
|
||||||
else if (coh.getType() == RpcPhrases.TYPE.FULL) {
|
else if (coh.getType() == RpcPhrases.TYPE.FULL) {
|
||||||
phraeConstraints.add(new SearchPhraseConstraint.Full(List.copyOf(coh.getTermsList())));
|
phraseConstraints.add(new SearchPhraseConstraint.Full(List.copyOf(coh.getTermsList())));
|
||||||
}
|
}
|
||||||
else {
|
else {
|
||||||
throw new IllegalArgumentException("Unknown phrase constraint type: " + coh.getType());
|
throw new IllegalArgumentException("Unknown phrase constraint type: " + coh.getType());
|
||||||
@@ -70,7 +49,7 @@ public class IndexProtobufCodec {
|
|||||||
query.getExcludeList(),
|
query.getExcludeList(),
|
||||||
query.getAdviceList(),
|
query.getAdviceList(),
|
||||||
query.getPriorityList(),
|
query.getPriorityList(),
|
||||||
phraeConstraints
|
phraseConstraints
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -103,60 +82,4 @@ public class IndexProtobufCodec {
|
|||||||
return subqueryBuilder.build();
|
return subqueryBuilder.build();
|
||||||
}
|
}
|
||||||
|
|
||||||
public static ResultRankingParameters convertRankingParameterss(RpcResultRankingParameters params) {
|
|
||||||
if (params == null)
|
|
||||||
return ResultRankingParameters.sensibleDefaults();
|
|
||||||
|
|
||||||
return new ResultRankingParameters(
|
|
||||||
new Bm25Parameters(params.getBm25K(), params.getBm25B()),
|
|
||||||
params.getShortDocumentThreshold(),
|
|
||||||
params.getShortDocumentPenalty(),
|
|
||||||
params.getDomainRankBonus(),
|
|
||||||
params.getQualityPenalty(),
|
|
||||||
params.getShortSentenceThreshold(),
|
|
||||||
params.getShortSentencePenalty(),
|
|
||||||
params.getBm25Weight(),
|
|
||||||
params.getTcfFirstPositionWeight(),
|
|
||||||
params.getTcfVerbatimWeight(),
|
|
||||||
params.getTcfProximityWeight(),
|
|
||||||
ResultRankingParameters.TemporalBias.valueOf(params.getTemporalBias().getBias().name()),
|
|
||||||
params.getTemporalBiasWeight(),
|
|
||||||
params.getExportDebugData()
|
|
||||||
);
|
|
||||||
}
|
|
||||||
|
|
||||||
public static RpcResultRankingParameters convertRankingParameterss(ResultRankingParameters rankingParams,
|
|
||||||
RpcTemporalBias temporalBias)
|
|
||||||
{
|
|
||||||
if (rankingParams == null) {
|
|
||||||
rankingParams = ResultRankingParameters.sensibleDefaults();
|
|
||||||
}
|
|
||||||
|
|
||||||
var builder = RpcResultRankingParameters.newBuilder()
|
|
||||||
.setBm25B(rankingParams.bm25Params.b())
|
|
||||||
.setBm25K(rankingParams.bm25Params.k())
|
|
||||||
.setShortDocumentThreshold(rankingParams.shortDocumentThreshold)
|
|
||||||
.setShortDocumentPenalty(rankingParams.shortDocumentPenalty)
|
|
||||||
.setDomainRankBonus(rankingParams.domainRankBonus)
|
|
||||||
.setQualityPenalty(rankingParams.qualityPenalty)
|
|
||||||
.setShortSentenceThreshold(rankingParams.shortSentenceThreshold)
|
|
||||||
.setShortSentencePenalty(rankingParams.shortSentencePenalty)
|
|
||||||
.setBm25Weight(rankingParams.bm25Weight)
|
|
||||||
.setTcfFirstPositionWeight(rankingParams.tcfFirstPosition)
|
|
||||||
.setTcfProximityWeight(rankingParams.tcfProximity)
|
|
||||||
.setTcfVerbatimWeight(rankingParams.tcfVerbatim)
|
|
||||||
.setTemporalBiasWeight(rankingParams.temporalBiasWeight)
|
|
||||||
.setExportDebugData(rankingParams.exportDebugData);
|
|
||||||
|
|
||||||
if (temporalBias != null && temporalBias.getBias() != RpcTemporalBias.Bias.NONE) {
|
|
||||||
builder.setTemporalBias(temporalBias);
|
|
||||||
}
|
|
||||||
else {
|
|
||||||
builder.setTemporalBias(RpcTemporalBias.newBuilder()
|
|
||||||
.setBias(RpcTemporalBias.Bias.valueOf(rankingParams.temporalBias.name())));
|
|
||||||
}
|
|
||||||
|
|
||||||
return builder.build();
|
|
||||||
}
|
|
||||||
|
|
||||||
}
|
}
|
||||||
|
@@ -5,7 +5,7 @@ import nu.marginalia.api.searchquery.model.query.QueryParams;
|
|||||||
import nu.marginalia.api.searchquery.model.query.QueryResponse;
|
import nu.marginalia.api.searchquery.model.query.QueryResponse;
|
||||||
import nu.marginalia.api.searchquery.model.query.SearchSpecification;
|
import nu.marginalia.api.searchquery.model.query.SearchSpecification;
|
||||||
import nu.marginalia.api.searchquery.model.results.DecoratedSearchResultItem;
|
import nu.marginalia.api.searchquery.model.results.DecoratedSearchResultItem;
|
||||||
import nu.marginalia.api.searchquery.model.results.ResultRankingParameters;
|
import nu.marginalia.api.searchquery.model.results.PrototypeRankingParameters;
|
||||||
import nu.marginalia.api.searchquery.model.results.SearchResultItem;
|
import nu.marginalia.api.searchquery.model.results.SearchResultItem;
|
||||||
import nu.marginalia.api.searchquery.model.results.SearchResultKeywordScore;
|
import nu.marginalia.api.searchquery.model.results.SearchResultKeywordScore;
|
||||||
import nu.marginalia.api.searchquery.model.results.debug.DebugFactor;
|
import nu.marginalia.api.searchquery.model.results.debug.DebugFactor;
|
||||||
@@ -37,7 +37,7 @@ public class QueryProtobufCodec {
|
|||||||
builder.setSize(IndexProtobufCodec.convertSpecLimit(query.specs.size));
|
builder.setSize(IndexProtobufCodec.convertSpecLimit(query.specs.size));
|
||||||
builder.setRank(IndexProtobufCodec.convertSpecLimit(query.specs.rank));
|
builder.setRank(IndexProtobufCodec.convertSpecLimit(query.specs.rank));
|
||||||
|
|
||||||
builder.setQueryLimits(IndexProtobufCodec.convertQueryLimits(query.specs.queryLimits));
|
builder.setQueryLimits(query.specs.queryLimits);
|
||||||
|
|
||||||
// Query strategy may be overridden by the query, but if not, use the one from the request
|
// Query strategy may be overridden by the query, but if not, use the one from the request
|
||||||
if (query.specs.queryStrategy != null && query.specs.queryStrategy != QueryStrategy.AUTO)
|
if (query.specs.queryStrategy != null && query.specs.queryStrategy != QueryStrategy.AUTO)
|
||||||
@@ -45,9 +45,27 @@ public class QueryProtobufCodec {
|
|||||||
else
|
else
|
||||||
builder.setQueryStrategy(request.getQueryStrategy());
|
builder.setQueryStrategy(request.getQueryStrategy());
|
||||||
|
|
||||||
|
if (request.getTemporalBias().getBias() != RpcTemporalBias.Bias.NONE) {
|
||||||
if (query.specs.rankingParams != null) {
|
if (query.specs.rankingParams != null) {
|
||||||
builder.setParameters(IndexProtobufCodec.convertRankingParameterss(query.specs.rankingParams, request.getTemporalBias()));
|
builder.setParameters(
|
||||||
|
RpcResultRankingParameters.newBuilder(query.specs.rankingParams)
|
||||||
|
.setTemporalBias(request.getTemporalBias())
|
||||||
|
.build()
|
||||||
|
);
|
||||||
|
} else {
|
||||||
|
builder.setParameters(
|
||||||
|
RpcResultRankingParameters.newBuilder(PrototypeRankingParameters.sensibleDefaults())
|
||||||
|
.setTemporalBias(request.getTemporalBias())
|
||||||
|
.build()
|
||||||
|
);
|
||||||
}
|
}
|
||||||
|
} else if (query.specs.rankingParams != null) {
|
||||||
|
builder.setParameters(query.specs.rankingParams);
|
||||||
|
}
|
||||||
|
// else {
|
||||||
|
// if we have no ranking params, we don't need to set them, the client check and use the default values
|
||||||
|
// so we don't need to send this huge object over the wire
|
||||||
|
// }
|
||||||
|
|
||||||
return builder.build();
|
return builder.build();
|
||||||
}
|
}
|
||||||
@@ -65,18 +83,13 @@ public class QueryProtobufCodec {
|
|||||||
builder.setSize(IndexProtobufCodec.convertSpecLimit(query.specs.size));
|
builder.setSize(IndexProtobufCodec.convertSpecLimit(query.specs.size));
|
||||||
builder.setRank(IndexProtobufCodec.convertSpecLimit(query.specs.rank));
|
builder.setRank(IndexProtobufCodec.convertSpecLimit(query.specs.rank));
|
||||||
|
|
||||||
builder.setQueryLimits(IndexProtobufCodec.convertQueryLimits(query.specs.queryLimits));
|
builder.setQueryLimits(query.specs.queryLimits);
|
||||||
|
|
||||||
// Query strategy may be overridden by the query, but if not, use the one from the request
|
// Query strategy may be overridden by the query, but if not, use the one from the request
|
||||||
builder.setQueryStrategy(query.specs.queryStrategy.name());
|
builder.setQueryStrategy(query.specs.queryStrategy.name());
|
||||||
|
|
||||||
if (query.specs.rankingParams != null) {
|
if (query.specs.rankingParams != null) {
|
||||||
builder.setParameters(IndexProtobufCodec.convertRankingParameterss(
|
builder.setParameters(query.specs.rankingParams);
|
||||||
query.specs.rankingParams,
|
|
||||||
RpcTemporalBias.newBuilder().setBias(
|
|
||||||
RpcTemporalBias.Bias.NONE)
|
|
||||||
.build())
|
|
||||||
);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
return builder.build();
|
return builder.build();
|
||||||
@@ -95,10 +108,10 @@ public class QueryProtobufCodec {
|
|||||||
IndexProtobufCodec.convertSpecLimit(request.getSize()),
|
IndexProtobufCodec.convertSpecLimit(request.getSize()),
|
||||||
IndexProtobufCodec.convertSpecLimit(request.getRank()),
|
IndexProtobufCodec.convertSpecLimit(request.getRank()),
|
||||||
request.getDomainIdsList(),
|
request.getDomainIdsList(),
|
||||||
IndexProtobufCodec.convertQueryLimits(request.getQueryLimits()),
|
request.getQueryLimits(),
|
||||||
request.getSearchSetIdentifier(),
|
request.getSearchSetIdentifier(),
|
||||||
QueryStrategy.valueOf(request.getQueryStrategy()),
|
QueryStrategy.valueOf(request.getQueryStrategy()),
|
||||||
ResultRankingParameters.TemporalBias.valueOf(request.getTemporalBias().getBias().name()),
|
RpcTemporalBias.Bias.valueOf(request.getTemporalBias().getBias().name()),
|
||||||
request.getPagination().getPage()
|
request.getPagination().getPage()
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
@@ -294,9 +307,9 @@ public class QueryProtobufCodec {
|
|||||||
IndexProtobufCodec.convertSpecLimit(specs.getYear()),
|
IndexProtobufCodec.convertSpecLimit(specs.getYear()),
|
||||||
IndexProtobufCodec.convertSpecLimit(specs.getSize()),
|
IndexProtobufCodec.convertSpecLimit(specs.getSize()),
|
||||||
IndexProtobufCodec.convertSpecLimit(specs.getRank()),
|
IndexProtobufCodec.convertSpecLimit(specs.getRank()),
|
||||||
IndexProtobufCodec.convertQueryLimits(specs.getQueryLimits()),
|
specs.getQueryLimits(),
|
||||||
QueryStrategy.valueOf(specs.getQueryStrategy()),
|
QueryStrategy.valueOf(specs.getQueryStrategy()),
|
||||||
IndexProtobufCodec.convertRankingParameterss(specs.getParameters())
|
specs.hasParameters() ? specs.getParameters() : null
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -307,7 +320,7 @@ public class QueryProtobufCodec {
|
|||||||
.addAllTacitExcludes(params.tacitExcludes())
|
.addAllTacitExcludes(params.tacitExcludes())
|
||||||
.addAllTacitPriority(params.tacitPriority())
|
.addAllTacitPriority(params.tacitPriority())
|
||||||
.setHumanQuery(params.humanQuery())
|
.setHumanQuery(params.humanQuery())
|
||||||
.setQueryLimits(IndexProtobufCodec.convertQueryLimits(params.limits()))
|
.setQueryLimits(params.limits())
|
||||||
.setQuality(IndexProtobufCodec.convertSpecLimit(params.quality()))
|
.setQuality(IndexProtobufCodec.convertSpecLimit(params.quality()))
|
||||||
.setYear(IndexProtobufCodec.convertSpecLimit(params.year()))
|
.setYear(IndexProtobufCodec.convertSpecLimit(params.year()))
|
||||||
.setSize(IndexProtobufCodec.convertSpecLimit(params.size()))
|
.setSize(IndexProtobufCodec.convertSpecLimit(params.size()))
|
||||||
@@ -319,7 +332,7 @@ public class QueryProtobufCodec {
|
|||||||
.build())
|
.build())
|
||||||
.setPagination(RpcQsQueryPagination.newBuilder()
|
.setPagination(RpcQsQueryPagination.newBuilder()
|
||||||
.setPage(params.page())
|
.setPage(params.page())
|
||||||
.setPageSize(Math.min(100, params.limits().resultsTotal()))
|
.setPageSize(Math.min(100, params.limits().getResultsTotal()))
|
||||||
.build());
|
.build());
|
||||||
|
|
||||||
if (params.nearDomain() != null)
|
if (params.nearDomain() != null)
|
||||||
|
@@ -1,7 +1,7 @@
|
|||||||
package nu.marginalia.api.searchquery.model.query;
|
package nu.marginalia.api.searchquery.model.query;
|
||||||
|
|
||||||
import nu.marginalia.api.searchquery.model.results.ResultRankingParameters;
|
import nu.marginalia.api.searchquery.RpcQueryLimits;
|
||||||
import nu.marginalia.index.query.limit.QueryLimits;
|
import nu.marginalia.api.searchquery.RpcTemporalBias;
|
||||||
import nu.marginalia.index.query.limit.QueryStrategy;
|
import nu.marginalia.index.query.limit.QueryStrategy;
|
||||||
import nu.marginalia.index.query.limit.SpecificationLimit;
|
import nu.marginalia.index.query.limit.SpecificationLimit;
|
||||||
|
|
||||||
@@ -21,14 +21,14 @@ public record QueryParams(
|
|||||||
SpecificationLimit size,
|
SpecificationLimit size,
|
||||||
SpecificationLimit rank,
|
SpecificationLimit rank,
|
||||||
List<Integer> domainIds,
|
List<Integer> domainIds,
|
||||||
QueryLimits limits,
|
RpcQueryLimits limits,
|
||||||
String identifier,
|
String identifier,
|
||||||
QueryStrategy queryStrategy,
|
QueryStrategy queryStrategy,
|
||||||
ResultRankingParameters.TemporalBias temporalBias,
|
RpcTemporalBias.Bias temporalBias,
|
||||||
int page
|
int page
|
||||||
)
|
)
|
||||||
{
|
{
|
||||||
public QueryParams(String query, QueryLimits limits, String identifier) {
|
public QueryParams(String query, RpcQueryLimits limits, String identifier) {
|
||||||
this(query, null,
|
this(query, null,
|
||||||
List.of(),
|
List.of(),
|
||||||
List.of(),
|
List.of(),
|
||||||
@@ -42,7 +42,7 @@ public record QueryParams(
|
|||||||
limits,
|
limits,
|
||||||
identifier,
|
identifier,
|
||||||
QueryStrategy.AUTO,
|
QueryStrategy.AUTO,
|
||||||
ResultRankingParameters.TemporalBias.NONE,
|
RpcTemporalBias.Bias.NONE,
|
||||||
1 // page
|
1 // page
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
@@ -1,10 +1,11 @@
|
|||||||
package nu.marginalia.api.searchquery.model.query;
|
package nu.marginalia.api.searchquery.model.query;
|
||||||
|
|
||||||
import nu.marginalia.api.searchquery.model.results.ResultRankingParameters;
|
import nu.marginalia.api.searchquery.RpcQueryLimits;
|
||||||
import nu.marginalia.index.query.limit.QueryLimits;
|
import nu.marginalia.api.searchquery.RpcResultRankingParameters;
|
||||||
import nu.marginalia.index.query.limit.QueryStrategy;
|
import nu.marginalia.index.query.limit.QueryStrategy;
|
||||||
import nu.marginalia.index.query.limit.SpecificationLimit;
|
import nu.marginalia.index.query.limit.SpecificationLimit;
|
||||||
|
|
||||||
|
import javax.annotation.Nullable;
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
|
|
||||||
public class SearchSpecification {
|
public class SearchSpecification {
|
||||||
@@ -24,11 +25,12 @@ public class SearchSpecification {
|
|||||||
public SpecificationLimit size;
|
public SpecificationLimit size;
|
||||||
public SpecificationLimit rank;
|
public SpecificationLimit rank;
|
||||||
|
|
||||||
public final QueryLimits queryLimits;
|
public final RpcQueryLimits queryLimits;
|
||||||
|
|
||||||
public final QueryStrategy queryStrategy;
|
public final QueryStrategy queryStrategy;
|
||||||
|
|
||||||
public final ResultRankingParameters rankingParams;
|
@Nullable
|
||||||
|
public final RpcResultRankingParameters rankingParams;
|
||||||
|
|
||||||
public SearchSpecification(SearchQuery query,
|
public SearchSpecification(SearchQuery query,
|
||||||
List<Integer> domains,
|
List<Integer> domains,
|
||||||
@@ -38,9 +40,9 @@ public class SearchSpecification {
|
|||||||
SpecificationLimit year,
|
SpecificationLimit year,
|
||||||
SpecificationLimit size,
|
SpecificationLimit size,
|
||||||
SpecificationLimit rank,
|
SpecificationLimit rank,
|
||||||
QueryLimits queryLimits,
|
RpcQueryLimits queryLimits,
|
||||||
QueryStrategy queryStrategy,
|
QueryStrategy queryStrategy,
|
||||||
ResultRankingParameters rankingParams)
|
@Nullable RpcResultRankingParameters rankingParams)
|
||||||
{
|
{
|
||||||
this.query = query;
|
this.query = query;
|
||||||
this.domains = domains;
|
this.domains = domains;
|
||||||
@@ -91,7 +93,7 @@ public class SearchSpecification {
|
|||||||
return this.rank;
|
return this.rank;
|
||||||
}
|
}
|
||||||
|
|
||||||
public QueryLimits getQueryLimits() {
|
public RpcQueryLimits getQueryLimits() {
|
||||||
return this.queryLimits;
|
return this.queryLimits;
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -99,7 +101,7 @@ public class SearchSpecification {
|
|||||||
return this.queryStrategy;
|
return this.queryStrategy;
|
||||||
}
|
}
|
||||||
|
|
||||||
public ResultRankingParameters getRankingParams() {
|
public RpcResultRankingParameters getRankingParams() {
|
||||||
return this.rankingParams;
|
return this.rankingParams;
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -120,9 +122,9 @@ public class SearchSpecification {
|
|||||||
private boolean size$set;
|
private boolean size$set;
|
||||||
private SpecificationLimit rank$value;
|
private SpecificationLimit rank$value;
|
||||||
private boolean rank$set;
|
private boolean rank$set;
|
||||||
private QueryLimits queryLimits;
|
private RpcQueryLimits queryLimits;
|
||||||
private QueryStrategy queryStrategy;
|
private QueryStrategy queryStrategy;
|
||||||
private ResultRankingParameters rankingParams;
|
private RpcResultRankingParameters rankingParams;
|
||||||
|
|
||||||
SearchSpecificationBuilder() {
|
SearchSpecificationBuilder() {
|
||||||
}
|
}
|
||||||
@@ -171,7 +173,7 @@ public class SearchSpecification {
|
|||||||
return this;
|
return this;
|
||||||
}
|
}
|
||||||
|
|
||||||
public SearchSpecificationBuilder queryLimits(QueryLimits queryLimits) {
|
public SearchSpecificationBuilder queryLimits(RpcQueryLimits queryLimits) {
|
||||||
this.queryLimits = queryLimits;
|
this.queryLimits = queryLimits;
|
||||||
return this;
|
return this;
|
||||||
}
|
}
|
||||||
@@ -181,7 +183,7 @@ public class SearchSpecification {
|
|||||||
return this;
|
return this;
|
||||||
}
|
}
|
||||||
|
|
||||||
public SearchSpecificationBuilder rankingParams(ResultRankingParameters rankingParams) {
|
public SearchSpecificationBuilder rankingParams(RpcResultRankingParameters rankingParams) {
|
||||||
this.rankingParams = rankingParams;
|
this.rankingParams = rankingParams;
|
||||||
return this;
|
return this;
|
||||||
}
|
}
|
||||||
|
@@ -0,0 +1,33 @@
|
|||||||
|
package nu.marginalia.api.searchquery.model.results;
|
||||||
|
|
||||||
|
import nu.marginalia.api.searchquery.RpcResultRankingParameters;
|
||||||
|
import nu.marginalia.api.searchquery.RpcTemporalBias;
|
||||||
|
|
||||||
|
public class PrototypeRankingParameters {
|
||||||
|
|
||||||
|
/** These are the default ranking parameters that are used when no parameters are specified. */
|
||||||
|
|
||||||
|
private static final RpcResultRankingParameters _sensibleDefaults = RpcResultRankingParameters.newBuilder()
|
||||||
|
.setBm25B(0.5)
|
||||||
|
.setBm25K(1.2)
|
||||||
|
.setShortDocumentThreshold(2000)
|
||||||
|
.setShortDocumentPenalty(2.)
|
||||||
|
.setDomainRankBonus(1 / 100.)
|
||||||
|
.setQualityPenalty(1 / 15.)
|
||||||
|
.setShortSentenceThreshold(2)
|
||||||
|
.setShortSentencePenalty(5)
|
||||||
|
.setBm25Weight(1.)
|
||||||
|
.setTcfVerbatimWeight(1.)
|
||||||
|
.setTcfProximityWeight(1.)
|
||||||
|
.setTcfFirstPositionWeight(5)
|
||||||
|
.setTemporalBias(RpcTemporalBias.newBuilder().setBias(RpcTemporalBias.Bias.NONE))
|
||||||
|
.setTemporalBiasWeight(5.0)
|
||||||
|
.setExportDebugData(false)
|
||||||
|
.setDisablePenalties(false)
|
||||||
|
.build();
|
||||||
|
|
||||||
|
public static RpcResultRankingParameters sensibleDefaults() {
|
||||||
|
return _sensibleDefaults;
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
@@ -1,12 +1,13 @@
|
|||||||
package nu.marginalia.api.searchquery.model.results;
|
package nu.marginalia.api.searchquery.model.results;
|
||||||
|
|
||||||
|
import nu.marginalia.api.searchquery.RpcResultRankingParameters;
|
||||||
import nu.marginalia.api.searchquery.model.compiled.CqDataInt;
|
import nu.marginalia.api.searchquery.model.compiled.CqDataInt;
|
||||||
|
|
||||||
import java.util.BitSet;
|
import java.util.BitSet;
|
||||||
|
|
||||||
public class ResultRankingContext {
|
public class ResultRankingContext {
|
||||||
private final int docCount;
|
private final int docCount;
|
||||||
public final ResultRankingParameters params;
|
public final RpcResultRankingParameters params;
|
||||||
|
|
||||||
|
|
||||||
public final BitSet regularMask;
|
public final BitSet regularMask;
|
||||||
@@ -21,7 +22,7 @@ public class ResultRankingContext {
|
|||||||
public final CqDataInt priorityCounts;
|
public final CqDataInt priorityCounts;
|
||||||
|
|
||||||
public ResultRankingContext(int docCount,
|
public ResultRankingContext(int docCount,
|
||||||
ResultRankingParameters params,
|
RpcResultRankingParameters params,
|
||||||
BitSet ngramsMask,
|
BitSet ngramsMask,
|
||||||
BitSet regularMask,
|
BitSet regularMask,
|
||||||
CqDataInt fullCounts,
|
CqDataInt fullCounts,
|
||||||
|
@@ -1,278 +0,0 @@
|
|||||||
package nu.marginalia.api.searchquery.model.results;
|
|
||||||
|
|
||||||
import java.util.Objects;
|
|
||||||
|
|
||||||
public class ResultRankingParameters {
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Tuning for BM25 when applied to full document matches
|
|
||||||
*/
|
|
||||||
public final Bm25Parameters bm25Params;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Documents below this length are penalized
|
|
||||||
*/
|
|
||||||
public int shortDocumentThreshold;
|
|
||||||
|
|
||||||
public double shortDocumentPenalty;
|
|
||||||
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Scaling factor associated with domain rank (unscaled rank value is 0-255; high is good)
|
|
||||||
*/
|
|
||||||
public double domainRankBonus;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Scaling factor associated with document quality (unscaled rank value is 0-15; high is bad)
|
|
||||||
*/
|
|
||||||
public double qualityPenalty;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Average sentence length values below this threshold are penalized, range [0-4), 2 or 3 is probably what you want
|
|
||||||
*/
|
|
||||||
public int shortSentenceThreshold;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Magnitude of penalty for documents with low average sentence length
|
|
||||||
*/
|
|
||||||
public double shortSentencePenalty;
|
|
||||||
|
|
||||||
public double bm25Weight;
|
|
||||||
public double tcfFirstPosition;
|
|
||||||
public double tcfVerbatim;
|
|
||||||
public double tcfProximity;
|
|
||||||
|
|
||||||
public TemporalBias temporalBias;
|
|
||||||
public double temporalBiasWeight;
|
|
||||||
|
|
||||||
public boolean exportDebugData;
|
|
||||||
|
|
||||||
public ResultRankingParameters(Bm25Parameters bm25Params, int shortDocumentThreshold, double shortDocumentPenalty, double domainRankBonus, double qualityPenalty, int shortSentenceThreshold, double shortSentencePenalty, double bm25Weight, double tcfFirstPosition, double tcfVerbatim, double tcfProximity, TemporalBias temporalBias, double temporalBiasWeight, boolean exportDebugData) {
|
|
||||||
this.bm25Params = bm25Params;
|
|
||||||
this.shortDocumentThreshold = shortDocumentThreshold;
|
|
||||||
this.shortDocumentPenalty = shortDocumentPenalty;
|
|
||||||
this.domainRankBonus = domainRankBonus;
|
|
||||||
this.qualityPenalty = qualityPenalty;
|
|
||||||
this.shortSentenceThreshold = shortSentenceThreshold;
|
|
||||||
this.shortSentencePenalty = shortSentencePenalty;
|
|
||||||
this.bm25Weight = bm25Weight;
|
|
||||||
this.tcfFirstPosition = tcfFirstPosition;
|
|
||||||
this.tcfVerbatim = tcfVerbatim;
|
|
||||||
this.tcfProximity = tcfProximity;
|
|
||||||
this.temporalBias = temporalBias;
|
|
||||||
this.temporalBiasWeight = temporalBiasWeight;
|
|
||||||
this.exportDebugData = exportDebugData;
|
|
||||||
}
|
|
||||||
|
|
||||||
public static ResultRankingParameters sensibleDefaults() {
|
|
||||||
return builder()
|
|
||||||
.bm25Params(new Bm25Parameters(1.2, 0.5))
|
|
||||||
.shortDocumentThreshold(2000)
|
|
||||||
.shortDocumentPenalty(2.)
|
|
||||||
.domainRankBonus(1 / 100.)
|
|
||||||
.qualityPenalty(1 / 15.)
|
|
||||||
.shortSentenceThreshold(2)
|
|
||||||
.shortSentencePenalty(5)
|
|
||||||
.bm25Weight(1.)
|
|
||||||
.tcfVerbatim(1.)
|
|
||||||
.tcfProximity(1.)
|
|
||||||
.tcfFirstPosition(5)
|
|
||||||
.temporalBias(TemporalBias.NONE)
|
|
||||||
.temporalBiasWeight(5.0)
|
|
||||||
.exportDebugData(false)
|
|
||||||
.build();
|
|
||||||
}
|
|
||||||
|
|
||||||
public static ResultRankingParametersBuilder builder() {
|
|
||||||
return new ResultRankingParametersBuilder();
|
|
||||||
}
|
|
||||||
|
|
||||||
public Bm25Parameters getBm25Params() {
|
|
||||||
return this.bm25Params;
|
|
||||||
}
|
|
||||||
|
|
||||||
public int getShortDocumentThreshold() {
|
|
||||||
return this.shortDocumentThreshold;
|
|
||||||
}
|
|
||||||
|
|
||||||
public double getShortDocumentPenalty() {
|
|
||||||
return this.shortDocumentPenalty;
|
|
||||||
}
|
|
||||||
|
|
||||||
public double getDomainRankBonus() {
|
|
||||||
return this.domainRankBonus;
|
|
||||||
}
|
|
||||||
|
|
||||||
public double getQualityPenalty() {
|
|
||||||
return this.qualityPenalty;
|
|
||||||
}
|
|
||||||
|
|
||||||
public int getShortSentenceThreshold() {
|
|
||||||
return this.shortSentenceThreshold;
|
|
||||||
}
|
|
||||||
|
|
||||||
public double getShortSentencePenalty() {
|
|
||||||
return this.shortSentencePenalty;
|
|
||||||
}
|
|
||||||
|
|
||||||
public double getBm25Weight() {
|
|
||||||
return this.bm25Weight;
|
|
||||||
}
|
|
||||||
|
|
||||||
public double getTcfFirstPosition() {
|
|
||||||
return this.tcfFirstPosition;
|
|
||||||
}
|
|
||||||
|
|
||||||
public double getTcfVerbatim() {
|
|
||||||
return this.tcfVerbatim;
|
|
||||||
}
|
|
||||||
|
|
||||||
public double getTcfProximity() {
|
|
||||||
return this.tcfProximity;
|
|
||||||
}
|
|
||||||
|
|
||||||
public TemporalBias getTemporalBias() {
|
|
||||||
return this.temporalBias;
|
|
||||||
}
|
|
||||||
|
|
||||||
public double getTemporalBiasWeight() {
|
|
||||||
return this.temporalBiasWeight;
|
|
||||||
}
|
|
||||||
|
|
||||||
public boolean isExportDebugData() {
|
|
||||||
return this.exportDebugData;
|
|
||||||
}
|
|
||||||
|
|
||||||
@Override
|
|
||||||
public final boolean equals(Object o) {
|
|
||||||
if (this == o) return true;
|
|
||||||
if (!(o instanceof ResultRankingParameters that)) return false;
|
|
||||||
|
|
||||||
return shortDocumentThreshold == that.shortDocumentThreshold && Double.compare(shortDocumentPenalty, that.shortDocumentPenalty) == 0 && Double.compare(domainRankBonus, that.domainRankBonus) == 0 && Double.compare(qualityPenalty, that.qualityPenalty) == 0 && shortSentenceThreshold == that.shortSentenceThreshold && Double.compare(shortSentencePenalty, that.shortSentencePenalty) == 0 && Double.compare(bm25Weight, that.bm25Weight) == 0 && Double.compare(tcfFirstPosition, that.tcfFirstPosition) == 0 && Double.compare(tcfVerbatim, that.tcfVerbatim) == 0 && Double.compare(tcfProximity, that.tcfProximity) == 0 && Double.compare(temporalBiasWeight, that.temporalBiasWeight) == 0 && exportDebugData == that.exportDebugData && Objects.equals(bm25Params, that.bm25Params) && temporalBias == that.temporalBias;
|
|
||||||
}
|
|
||||||
|
|
||||||
@Override
|
|
||||||
public int hashCode() {
|
|
||||||
int result = Objects.hashCode(bm25Params);
|
|
||||||
result = 31 * result + shortDocumentThreshold;
|
|
||||||
result = 31 * result + Double.hashCode(shortDocumentPenalty);
|
|
||||||
result = 31 * result + Double.hashCode(domainRankBonus);
|
|
||||||
result = 31 * result + Double.hashCode(qualityPenalty);
|
|
||||||
result = 31 * result + shortSentenceThreshold;
|
|
||||||
result = 31 * result + Double.hashCode(shortSentencePenalty);
|
|
||||||
result = 31 * result + Double.hashCode(bm25Weight);
|
|
||||||
result = 31 * result + Double.hashCode(tcfFirstPosition);
|
|
||||||
result = 31 * result + Double.hashCode(tcfVerbatim);
|
|
||||||
result = 31 * result + Double.hashCode(tcfProximity);
|
|
||||||
result = 31 * result + Objects.hashCode(temporalBias);
|
|
||||||
result = 31 * result + Double.hashCode(temporalBiasWeight);
|
|
||||||
result = 31 * result + Boolean.hashCode(exportDebugData);
|
|
||||||
return result;
|
|
||||||
}
|
|
||||||
|
|
||||||
public String toString() {
|
|
||||||
return "ResultRankingParameters(bm25Params=" + this.getBm25Params() + ", shortDocumentThreshold=" + this.getShortDocumentThreshold() + ", shortDocumentPenalty=" + this.getShortDocumentPenalty() + ", domainRankBonus=" + this.getDomainRankBonus() + ", qualityPenalty=" + this.getQualityPenalty() + ", shortSentenceThreshold=" + this.getShortSentenceThreshold() + ", shortSentencePenalty=" + this.getShortSentencePenalty() + ", bm25Weight=" + this.getBm25Weight() + ", tcfFirstPosition=" + this.getTcfFirstPosition() + ", tcfVerbatim=" + this.getTcfVerbatim() + ", tcfProximity=" + this.getTcfProximity() + ", temporalBias=" + this.getTemporalBias() + ", temporalBiasWeight=" + this.getTemporalBiasWeight() + ", exportDebugData=" + this.isExportDebugData() + ")";
|
|
||||||
}
|
|
||||||
|
|
||||||
public enum TemporalBias {
|
|
||||||
RECENT, OLD, NONE
|
|
||||||
}
|
|
||||||
|
|
||||||
public static class ResultRankingParametersBuilder {
|
|
||||||
private Bm25Parameters bm25Params;
|
|
||||||
private int shortDocumentThreshold;
|
|
||||||
private double shortDocumentPenalty;
|
|
||||||
private double domainRankBonus;
|
|
||||||
private double qualityPenalty;
|
|
||||||
private int shortSentenceThreshold;
|
|
||||||
private double shortSentencePenalty;
|
|
||||||
private double bm25Weight;
|
|
||||||
private double tcfFirstPosition;
|
|
||||||
private double tcfVerbatim;
|
|
||||||
private double tcfProximity;
|
|
||||||
private TemporalBias temporalBias;
|
|
||||||
private double temporalBiasWeight;
|
|
||||||
private boolean exportDebugData;
|
|
||||||
|
|
||||||
ResultRankingParametersBuilder() {
|
|
||||||
}
|
|
||||||
|
|
||||||
public ResultRankingParametersBuilder bm25Params(Bm25Parameters bm25Params) {
|
|
||||||
this.bm25Params = bm25Params;
|
|
||||||
return this;
|
|
||||||
}
|
|
||||||
|
|
||||||
public ResultRankingParametersBuilder shortDocumentThreshold(int shortDocumentThreshold) {
|
|
||||||
this.shortDocumentThreshold = shortDocumentThreshold;
|
|
||||||
return this;
|
|
||||||
}
|
|
||||||
|
|
||||||
public ResultRankingParametersBuilder shortDocumentPenalty(double shortDocumentPenalty) {
|
|
||||||
this.shortDocumentPenalty = shortDocumentPenalty;
|
|
||||||
return this;
|
|
||||||
}
|
|
||||||
|
|
||||||
public ResultRankingParametersBuilder domainRankBonus(double domainRankBonus) {
|
|
||||||
this.domainRankBonus = domainRankBonus;
|
|
||||||
return this;
|
|
||||||
}
|
|
||||||
|
|
||||||
public ResultRankingParametersBuilder qualityPenalty(double qualityPenalty) {
|
|
||||||
this.qualityPenalty = qualityPenalty;
|
|
||||||
return this;
|
|
||||||
}
|
|
||||||
|
|
||||||
public ResultRankingParametersBuilder shortSentenceThreshold(int shortSentenceThreshold) {
|
|
||||||
this.shortSentenceThreshold = shortSentenceThreshold;
|
|
||||||
return this;
|
|
||||||
}
|
|
||||||
|
|
||||||
public ResultRankingParametersBuilder shortSentencePenalty(double shortSentencePenalty) {
|
|
||||||
this.shortSentencePenalty = shortSentencePenalty;
|
|
||||||
return this;
|
|
||||||
}
|
|
||||||
|
|
||||||
public ResultRankingParametersBuilder bm25Weight(double bm25Weight) {
|
|
||||||
this.bm25Weight = bm25Weight;
|
|
||||||
return this;
|
|
||||||
}
|
|
||||||
|
|
||||||
public ResultRankingParametersBuilder tcfFirstPosition(double tcfFirstPosition) {
|
|
||||||
this.tcfFirstPosition = tcfFirstPosition;
|
|
||||||
return this;
|
|
||||||
}
|
|
||||||
|
|
||||||
public ResultRankingParametersBuilder tcfVerbatim(double tcfVerbatim) {
|
|
||||||
this.tcfVerbatim = tcfVerbatim;
|
|
||||||
return this;
|
|
||||||
}
|
|
||||||
|
|
||||||
public ResultRankingParametersBuilder tcfProximity(double tcfProximity) {
|
|
||||||
this.tcfProximity = tcfProximity;
|
|
||||||
return this;
|
|
||||||
}
|
|
||||||
|
|
||||||
public ResultRankingParametersBuilder temporalBias(TemporalBias temporalBias) {
|
|
||||||
this.temporalBias = temporalBias;
|
|
||||||
return this;
|
|
||||||
}
|
|
||||||
|
|
||||||
public ResultRankingParametersBuilder temporalBiasWeight(double temporalBiasWeight) {
|
|
||||||
this.temporalBiasWeight = temporalBiasWeight;
|
|
||||||
return this;
|
|
||||||
}
|
|
||||||
|
|
||||||
public ResultRankingParametersBuilder exportDebugData(boolean exportDebugData) {
|
|
||||||
this.exportDebugData = exportDebugData;
|
|
||||||
return this;
|
|
||||||
}
|
|
||||||
|
|
||||||
public ResultRankingParameters build() {
|
|
||||||
return new ResultRankingParameters(this.bm25Params, this.shortDocumentThreshold, this.shortDocumentPenalty, this.domainRankBonus, this.qualityPenalty, this.shortSentenceThreshold, this.shortSentencePenalty, this.bm25Weight, this.tcfFirstPosition, this.tcfVerbatim, this.tcfProximity, this.temporalBias, this.temporalBiasWeight, this.exportDebugData);
|
|
||||||
}
|
|
||||||
|
|
||||||
public String toString() {
|
|
||||||
return "ResultRankingParameters.ResultRankingParametersBuilder(bm25Params=" + this.bm25Params + ", shortDocumentThreshold=" + this.shortDocumentThreshold + ", shortDocumentPenalty=" + this.shortDocumentPenalty + ", domainRankBonus=" + this.domainRankBonus + ", qualityPenalty=" + this.qualityPenalty + ", shortSentenceThreshold=" + this.shortSentenceThreshold + ", shortSentencePenalty=" + this.shortSentencePenalty + ", bm25Weight=" + this.bm25Weight + ", tcfFirstPosition=" + this.tcfFirstPosition + ", tcfVerbatim=" + this.tcfVerbatim + ", tcfProximity=" + this.tcfProximity + ", temporalBias=" + this.temporalBias + ", temporalBiasWeight=" + this.temporalBiasWeight + ", exportDebugData=" + this.exportDebugData + ")";
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
@@ -162,6 +162,7 @@ message RpcResultRankingParameters {
|
|||||||
double temporalBiasWeight = 17;
|
double temporalBiasWeight = 17;
|
||||||
|
|
||||||
bool exportDebugData = 18;
|
bool exportDebugData = 18;
|
||||||
|
bool disablePenalties = 19;
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@@ -3,8 +3,6 @@ package nu.marginalia.index.client;
|
|||||||
import nu.marginalia.api.searchquery.IndexProtobufCodec;
|
import nu.marginalia.api.searchquery.IndexProtobufCodec;
|
||||||
import nu.marginalia.api.searchquery.model.query.SearchPhraseConstraint;
|
import nu.marginalia.api.searchquery.model.query.SearchPhraseConstraint;
|
||||||
import nu.marginalia.api.searchquery.model.query.SearchQuery;
|
import nu.marginalia.api.searchquery.model.query.SearchQuery;
|
||||||
import nu.marginalia.api.searchquery.model.results.ResultRankingParameters;
|
|
||||||
import nu.marginalia.index.query.limit.QueryLimits;
|
|
||||||
import nu.marginalia.index.query.limit.SpecificationLimit;
|
import nu.marginalia.index.query.limit.SpecificationLimit;
|
||||||
import org.junit.jupiter.api.Test;
|
import org.junit.jupiter.api.Test;
|
||||||
|
|
||||||
@@ -22,18 +20,6 @@ class IndexProtobufCodecTest {
|
|||||||
verifyIsIdentityTransformation(SpecificationLimit.lessThan(1), l -> IndexProtobufCodec.convertSpecLimit(IndexProtobufCodec.convertSpecLimit(l)));
|
verifyIsIdentityTransformation(SpecificationLimit.lessThan(1), l -> IndexProtobufCodec.convertSpecLimit(IndexProtobufCodec.convertSpecLimit(l)));
|
||||||
}
|
}
|
||||||
|
|
||||||
@Test
|
|
||||||
public void testRankingParameters() {
|
|
||||||
verifyIsIdentityTransformation(ResultRankingParameters.sensibleDefaults(),
|
|
||||||
p -> IndexProtobufCodec.convertRankingParameterss(IndexProtobufCodec.convertRankingParameterss(p, null)));
|
|
||||||
}
|
|
||||||
|
|
||||||
@Test
|
|
||||||
public void testQueryLimits() {
|
|
||||||
verifyIsIdentityTransformation(new QueryLimits(1,2,3,4),
|
|
||||||
l -> IndexProtobufCodec.convertQueryLimits(IndexProtobufCodec.convertQueryLimits(l))
|
|
||||||
);
|
|
||||||
}
|
|
||||||
@Test
|
@Test
|
||||||
public void testSubqery() {
|
public void testSubqery() {
|
||||||
verifyIsIdentityTransformation(new SearchQuery(
|
verifyIsIdentityTransformation(new SearchQuery(
|
||||||
|
@@ -2,8 +2,9 @@ package nu.marginalia.functions.searchquery;
|
|||||||
|
|
||||||
import com.google.inject.Inject;
|
import com.google.inject.Inject;
|
||||||
import com.google.inject.Singleton;
|
import com.google.inject.Singleton;
|
||||||
|
import nu.marginalia.api.searchquery.RpcQueryLimits;
|
||||||
|
import nu.marginalia.api.searchquery.RpcResultRankingParameters;
|
||||||
import nu.marginalia.api.searchquery.model.query.*;
|
import nu.marginalia.api.searchquery.model.query.*;
|
||||||
import nu.marginalia.api.searchquery.model.results.ResultRankingParameters;
|
|
||||||
import nu.marginalia.functions.searchquery.query_parser.QueryExpansion;
|
import nu.marginalia.functions.searchquery.query_parser.QueryExpansion;
|
||||||
import nu.marginalia.functions.searchquery.query_parser.QueryParser;
|
import nu.marginalia.functions.searchquery.query_parser.QueryParser;
|
||||||
import nu.marginalia.functions.searchquery.query_parser.token.QueryToken;
|
import nu.marginalia.functions.searchquery.query_parser.token.QueryToken;
|
||||||
@@ -36,7 +37,7 @@ public class QueryFactory {
|
|||||||
|
|
||||||
|
|
||||||
public ProcessedQuery createQuery(QueryParams params,
|
public ProcessedQuery createQuery(QueryParams params,
|
||||||
@Nullable ResultRankingParameters rankingParams) {
|
@Nullable RpcResultRankingParameters rankingParams) {
|
||||||
final var query = params.humanQuery();
|
final var query = params.humanQuery();
|
||||||
|
|
||||||
if (query.length() > 1000) {
|
if (query.length() > 1000) {
|
||||||
@@ -132,7 +133,9 @@ public class QueryFactory {
|
|||||||
var limits = params.limits();
|
var limits = params.limits();
|
||||||
// Disable limits on number of results per domain if we're searching with a site:-type term
|
// Disable limits on number of results per domain if we're searching with a site:-type term
|
||||||
if (domain != null) {
|
if (domain != null) {
|
||||||
limits = limits.forSingleDomain();
|
limits = RpcQueryLimits.newBuilder(limits)
|
||||||
|
.setResultsByDomain(limits.getResultsTotal())
|
||||||
|
.build();
|
||||||
}
|
}
|
||||||
|
|
||||||
var expansion = queryExpansion.expandQuery(queryBuilder.searchTermsInclude);
|
var expansion = queryExpansion.expandQuery(queryBuilder.searchTermsInclude);
|
||||||
|
@@ -9,7 +9,7 @@ import nu.marginalia.api.searchquery.*;
|
|||||||
import nu.marginalia.api.searchquery.model.query.ProcessedQuery;
|
import nu.marginalia.api.searchquery.model.query.ProcessedQuery;
|
||||||
import nu.marginalia.api.searchquery.model.query.QueryParams;
|
import nu.marginalia.api.searchquery.model.query.QueryParams;
|
||||||
import nu.marginalia.api.searchquery.model.results.DecoratedSearchResultItem;
|
import nu.marginalia.api.searchquery.model.results.DecoratedSearchResultItem;
|
||||||
import nu.marginalia.api.searchquery.model.results.ResultRankingParameters;
|
import nu.marginalia.api.searchquery.model.results.PrototypeRankingParameters;
|
||||||
import nu.marginalia.index.api.IndexClient;
|
import nu.marginalia.index.api.IndexClient;
|
||||||
import nu.marginalia.service.server.DiscoverableService;
|
import nu.marginalia.service.server.DiscoverableService;
|
||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
@@ -55,7 +55,7 @@ public class QueryGRPCService
|
|||||||
.time(() -> {
|
.time(() -> {
|
||||||
|
|
||||||
var params = QueryProtobufCodec.convertRequest(request);
|
var params = QueryProtobufCodec.convertRequest(request);
|
||||||
var query = queryFactory.createQuery(params, ResultRankingParameters.sensibleDefaults());
|
var query = queryFactory.createQuery(params, PrototypeRankingParameters.sensibleDefaults());
|
||||||
|
|
||||||
var indexRequest = QueryProtobufCodec.convertQuery(request, query);
|
var indexRequest = QueryProtobufCodec.convertQuery(request, query);
|
||||||
|
|
||||||
@@ -102,7 +102,7 @@ public class QueryGRPCService
|
|||||||
String originalQuery,
|
String originalQuery,
|
||||||
QueryParams params,
|
QueryParams params,
|
||||||
IndexClient.Pagination pagination,
|
IndexClient.Pagination pagination,
|
||||||
ResultRankingParameters rankingParameters) {
|
RpcResultRankingParameters rankingParameters) {
|
||||||
|
|
||||||
var query = queryFactory.createQuery(params, rankingParameters);
|
var query = queryFactory.createQuery(params, rankingParameters);
|
||||||
IndexClient.AggregateQueryResponse response = indexClient.executeQueries(QueryProtobufCodec.convertQuery(originalQuery, query), pagination);
|
IndexClient.AggregateQueryResponse response = indexClient.executeQueries(QueryProtobufCodec.convertQuery(originalQuery, query), pagination);
|
||||||
|
@@ -134,6 +134,10 @@ public class QueryExpansion {
|
|||||||
if (scoreCombo > scoreA + scoreB || scoreCombo > 1000) {
|
if (scoreCombo > scoreA + scoreB || scoreCombo > 1000) {
|
||||||
graph.addVariantForSpan(prev, qw, joinedWord);
|
graph.addVariantForSpan(prev, qw, joinedWord);
|
||||||
}
|
}
|
||||||
|
else if (StringUtils.isAlpha(prev.word()) && StringUtils.isNumeric(qw.word())) { // join e.g. trs 80 to trs80 and trs-80
|
||||||
|
graph.addVariantForSpan(prev, qw, prev.word() + qw.word());
|
||||||
|
graph.addVariantForSpan(prev, qw, prev.word() + "-" + qw.word());
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
prev = qw;
|
prev = qw;
|
||||||
|
@@ -1,12 +1,12 @@
|
|||||||
package nu.marginalia.query.svc;
|
package nu.marginalia.query.svc;
|
||||||
|
|
||||||
import nu.marginalia.WmsaHome;
|
import nu.marginalia.WmsaHome;
|
||||||
|
import nu.marginalia.api.searchquery.RpcQueryLimits;
|
||||||
|
import nu.marginalia.api.searchquery.RpcTemporalBias;
|
||||||
import nu.marginalia.api.searchquery.model.query.QueryParams;
|
import nu.marginalia.api.searchquery.model.query.QueryParams;
|
||||||
import nu.marginalia.api.searchquery.model.query.SearchSpecification;
|
import nu.marginalia.api.searchquery.model.query.SearchSpecification;
|
||||||
import nu.marginalia.api.searchquery.model.results.ResultRankingParameters;
|
|
||||||
import nu.marginalia.functions.searchquery.QueryFactory;
|
import nu.marginalia.functions.searchquery.QueryFactory;
|
||||||
import nu.marginalia.functions.searchquery.query_parser.QueryExpansion;
|
import nu.marginalia.functions.searchquery.query_parser.QueryExpansion;
|
||||||
import nu.marginalia.index.query.limit.QueryLimits;
|
|
||||||
import nu.marginalia.index.query.limit.QueryStrategy;
|
import nu.marginalia.index.query.limit.QueryStrategy;
|
||||||
import nu.marginalia.index.query.limit.SpecificationLimit;
|
import nu.marginalia.index.query.limit.SpecificationLimit;
|
||||||
import nu.marginalia.index.query.limit.SpecificationLimitType;
|
import nu.marginalia.index.query.limit.SpecificationLimitType;
|
||||||
@@ -49,10 +49,15 @@ public class QueryFactoryTest {
|
|||||||
SpecificationLimit.none(),
|
SpecificationLimit.none(),
|
||||||
SpecificationLimit.none(),
|
SpecificationLimit.none(),
|
||||||
null,
|
null,
|
||||||
new QueryLimits(100, 100, 100, 100),
|
RpcQueryLimits.newBuilder()
|
||||||
|
.setResultsTotal(100)
|
||||||
|
.setResultsByDomain(100)
|
||||||
|
.setTimeoutMs(100)
|
||||||
|
.setFetchSize(100)
|
||||||
|
.build(),
|
||||||
"NONE",
|
"NONE",
|
||||||
QueryStrategy.AUTO,
|
QueryStrategy.AUTO,
|
||||||
ResultRankingParameters.TemporalBias.NONE,
|
RpcTemporalBias.Bias.NONE,
|
||||||
0), null).specs;
|
0), null).specs;
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -208,6 +213,18 @@ public class QueryFactoryTest {
|
|||||||
System.out.println(subquery);
|
System.out.println(subquery);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void testContractionWordNum() {
|
||||||
|
var subquery = parseAndGetSpecs("glove 80");
|
||||||
|
|
||||||
|
Assertions.assertTrue(subquery.query.compiledQuery.contains(" glove "));
|
||||||
|
Assertions.assertTrue(subquery.query.compiledQuery.contains(" 80 "));
|
||||||
|
Assertions.assertTrue(subquery.query.compiledQuery.contains(" glove-80 "));
|
||||||
|
Assertions.assertTrue(subquery.query.compiledQuery.contains(" glove80 "));
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
public void testCplusPlus() {
|
public void testCplusPlus() {
|
||||||
var subquery = parseAndGetSpecs("std::vector::push_back vector");
|
var subquery = parseAndGetSpecs("std::vector::push_back vector");
|
||||||
|
@@ -16,20 +16,19 @@ import org.slf4j.LoggerFactory;
|
|||||||
|
|
||||||
import java.util.ArrayList;
|
import java.util.ArrayList;
|
||||||
import java.util.Comparator;
|
import java.util.Comparator;
|
||||||
import java.util.Iterator;
|
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
import java.util.concurrent.CompletableFuture;
|
import java.util.concurrent.CompletableFuture;
|
||||||
import java.util.concurrent.ExecutorService;
|
import java.util.concurrent.ExecutorService;
|
||||||
import java.util.concurrent.Executors;
|
import java.util.concurrent.Executors;
|
||||||
|
import java.util.concurrent.atomic.AtomicInteger;
|
||||||
import static java.lang.Math.clamp;
|
import java.util.function.Consumer;
|
||||||
|
|
||||||
@Singleton
|
@Singleton
|
||||||
public class IndexClient {
|
public class IndexClient {
|
||||||
private static final Logger logger = LoggerFactory.getLogger(IndexClient.class);
|
private static final Logger logger = LoggerFactory.getLogger(IndexClient.class);
|
||||||
private final GrpcMultiNodeChannelPool<IndexApiGrpc.IndexApiBlockingStub> channelPool;
|
private final GrpcMultiNodeChannelPool<IndexApiGrpc.IndexApiBlockingStub> channelPool;
|
||||||
private final DomainBlacklistImpl blacklist;
|
private final DomainBlacklistImpl blacklist;
|
||||||
private static final ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor();
|
private static final ExecutorService executor = Executors.newCachedThreadPool();
|
||||||
|
|
||||||
@Inject
|
@Inject
|
||||||
public IndexClient(GrpcChannelPoolFactory channelPoolFactory, DomainBlacklistImpl blacklist) {
|
public IndexClient(GrpcChannelPoolFactory channelPoolFactory, DomainBlacklistImpl blacklist) {
|
||||||
@@ -51,40 +50,37 @@ public class IndexClient {
|
|||||||
|
|
||||||
/** Execute a query on the index partitions and return the combined results. */
|
/** Execute a query on the index partitions and return the combined results. */
|
||||||
public AggregateQueryResponse executeQueries(RpcIndexQuery indexRequest, Pagination pagination) {
|
public AggregateQueryResponse executeQueries(RpcIndexQuery indexRequest, Pagination pagination) {
|
||||||
List<CompletableFuture<Iterator<RpcDecoratedResultItem>>> futures =
|
|
||||||
channelPool.call(IndexApiGrpc.IndexApiBlockingStub::query)
|
|
||||||
.async(executor)
|
|
||||||
.runEach(indexRequest);
|
|
||||||
|
|
||||||
final int requestedMaxResults = indexRequest.getQueryLimits().getResultsTotal();
|
final int requestedMaxResults = indexRequest.getQueryLimits().getResultsTotal();
|
||||||
final int resultsUpperBound = requestedMaxResults * channelPool.getNumNodes();
|
|
||||||
|
|
||||||
List<RpcDecoratedResultItem> results = new ArrayList<>(resultsUpperBound);
|
AtomicInteger totalNumResults = new AtomicInteger(0);
|
||||||
|
|
||||||
for (var future : futures) {
|
List<RpcDecoratedResultItem> results =
|
||||||
|
channelPool.call(IndexApiGrpc.IndexApiBlockingStub::query)
|
||||||
|
.async(executor)
|
||||||
|
.runEach(indexRequest)
|
||||||
|
.stream()
|
||||||
|
.map(future -> future.thenApply(iterator -> {
|
||||||
|
List<RpcDecoratedResultItem> ret = new ArrayList<>(requestedMaxResults);
|
||||||
|
iterator.forEachRemaining(ret::add);
|
||||||
|
totalNumResults.addAndGet(ret.size());
|
||||||
|
return ret;
|
||||||
|
}))
|
||||||
|
.mapMulti((CompletableFuture<List<RpcDecoratedResultItem>> fut, Consumer<List<RpcDecoratedResultItem>> c) ->{
|
||||||
try {
|
try {
|
||||||
future.get().forEachRemaining(results::add);
|
c.accept(fut.join());
|
||||||
}
|
} catch (Exception e) {
|
||||||
catch (Exception e) {
|
logger.error("Error while fetching results", e);
|
||||||
logger.error("Downstream exception", e);
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
})
|
||||||
|
.flatMap(List::stream)
|
||||||
|
.filter(item -> !isBlacklisted(item))
|
||||||
|
.sorted(comparator)
|
||||||
|
.skip(Math.max(0, (pagination.page - 1) * pagination.pageSize))
|
||||||
|
.limit(pagination.pageSize)
|
||||||
|
.toList();
|
||||||
|
|
||||||
// Sort the results by ranking score and remove blacklisted domains
|
return new AggregateQueryResponse(results, pagination.page(), totalNumResults.get());
|
||||||
results.sort(comparator);
|
|
||||||
results.removeIf(this::isBlacklisted);
|
|
||||||
|
|
||||||
int numReceivedResults = results.size();
|
|
||||||
|
|
||||||
// pagination is typically 1-indexed, so we need to adjust the start and end indices
|
|
||||||
int indexStart = (pagination.page - 1) * pagination.pageSize;
|
|
||||||
int indexEnd = (pagination.page) * pagination.pageSize;
|
|
||||||
|
|
||||||
results = results.subList(
|
|
||||||
clamp(indexStart, 0, Math.max(0, results.size() - 1)), // from is inclusive, so subtract 1 from size()
|
|
||||||
clamp(indexEnd, 0, results.size()));
|
|
||||||
|
|
||||||
return new AggregateQueryResponse(results, pagination.page(), numReceivedResults);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
private boolean isBlacklisted(RpcDecoratedResultItem item) {
|
private boolean isBlacklisted(RpcDecoratedResultItem item) {
|
||||||
|
@@ -10,12 +10,12 @@ import it.unimi.dsi.fastutil.longs.LongArrayList;
|
|||||||
import nu.marginalia.api.searchquery.IndexApiGrpc;
|
import nu.marginalia.api.searchquery.IndexApiGrpc;
|
||||||
import nu.marginalia.api.searchquery.RpcDecoratedResultItem;
|
import nu.marginalia.api.searchquery.RpcDecoratedResultItem;
|
||||||
import nu.marginalia.api.searchquery.RpcIndexQuery;
|
import nu.marginalia.api.searchquery.RpcIndexQuery;
|
||||||
|
import nu.marginalia.api.searchquery.RpcResultRankingParameters;
|
||||||
import nu.marginalia.api.searchquery.model.compiled.CompiledQuery;
|
import nu.marginalia.api.searchquery.model.compiled.CompiledQuery;
|
||||||
import nu.marginalia.api.searchquery.model.compiled.CompiledQueryLong;
|
import nu.marginalia.api.searchquery.model.compiled.CompiledQueryLong;
|
||||||
import nu.marginalia.api.searchquery.model.compiled.CqDataInt;
|
import nu.marginalia.api.searchquery.model.compiled.CqDataInt;
|
||||||
import nu.marginalia.api.searchquery.model.query.SearchSpecification;
|
import nu.marginalia.api.searchquery.model.query.SearchSpecification;
|
||||||
import nu.marginalia.api.searchquery.model.results.ResultRankingContext;
|
import nu.marginalia.api.searchquery.model.results.ResultRankingContext;
|
||||||
import nu.marginalia.api.searchquery.model.results.ResultRankingParameters;
|
|
||||||
import nu.marginalia.array.page.LongQueryBuffer;
|
import nu.marginalia.array.page.LongQueryBuffer;
|
||||||
import nu.marginalia.index.index.StatefulIndex;
|
import nu.marginalia.index.index.StatefulIndex;
|
||||||
import nu.marginalia.index.model.SearchParameters;
|
import nu.marginalia.index.model.SearchParameters;
|
||||||
@@ -211,7 +211,7 @@ public class IndexGrpcService
|
|||||||
/** This class is responsible for ranking the results and adding the best results to the
|
/** This class is responsible for ranking the results and adding the best results to the
|
||||||
* resultHeap, which depending on the state of the indexLookup threads may or may not block
|
* resultHeap, which depending on the state of the indexLookup threads may or may not block
|
||||||
*/
|
*/
|
||||||
private ResultRankingContext createRankingContext(ResultRankingParameters rankingParams,
|
private ResultRankingContext createRankingContext(RpcResultRankingParameters rankingParams,
|
||||||
CompiledQuery<String> compiledQuery,
|
CompiledQuery<String> compiledQuery,
|
||||||
CompiledQueryLong compiledQueryIds)
|
CompiledQueryLong compiledQueryIds)
|
||||||
{
|
{
|
||||||
|
@@ -2,12 +2,13 @@ package nu.marginalia.index.model;
|
|||||||
|
|
||||||
import nu.marginalia.api.searchquery.IndexProtobufCodec;
|
import nu.marginalia.api.searchquery.IndexProtobufCodec;
|
||||||
import nu.marginalia.api.searchquery.RpcIndexQuery;
|
import nu.marginalia.api.searchquery.RpcIndexQuery;
|
||||||
|
import nu.marginalia.api.searchquery.RpcResultRankingParameters;
|
||||||
import nu.marginalia.api.searchquery.model.compiled.CompiledQuery;
|
import nu.marginalia.api.searchquery.model.compiled.CompiledQuery;
|
||||||
import nu.marginalia.api.searchquery.model.compiled.CompiledQueryLong;
|
import nu.marginalia.api.searchquery.model.compiled.CompiledQueryLong;
|
||||||
import nu.marginalia.api.searchquery.model.compiled.CompiledQueryParser;
|
import nu.marginalia.api.searchquery.model.compiled.CompiledQueryParser;
|
||||||
import nu.marginalia.api.searchquery.model.query.SearchSpecification;
|
|
||||||
import nu.marginalia.api.searchquery.model.query.SearchQuery;
|
import nu.marginalia.api.searchquery.model.query.SearchQuery;
|
||||||
import nu.marginalia.api.searchquery.model.results.ResultRankingParameters;
|
import nu.marginalia.api.searchquery.model.query.SearchSpecification;
|
||||||
|
import nu.marginalia.api.searchquery.model.results.PrototypeRankingParameters;
|
||||||
import nu.marginalia.index.query.IndexSearchBudget;
|
import nu.marginalia.index.query.IndexSearchBudget;
|
||||||
import nu.marginalia.index.query.limit.QueryStrategy;
|
import nu.marginalia.index.query.limit.QueryStrategy;
|
||||||
import nu.marginalia.index.searchset.SearchSet;
|
import nu.marginalia.index.searchset.SearchSet;
|
||||||
@@ -23,7 +24,7 @@ public class SearchParameters {
|
|||||||
public final IndexSearchBudget budget;
|
public final IndexSearchBudget budget;
|
||||||
public final SearchQuery query;
|
public final SearchQuery query;
|
||||||
public final QueryParams queryParams;
|
public final QueryParams queryParams;
|
||||||
public final ResultRankingParameters rankingParams;
|
public final RpcResultRankingParameters rankingParams;
|
||||||
|
|
||||||
public final int limitByDomain;
|
public final int limitByDomain;
|
||||||
public final int limitTotal;
|
public final int limitTotal;
|
||||||
@@ -41,11 +42,11 @@ public class SearchParameters {
|
|||||||
public SearchParameters(SearchSpecification specsSet, SearchSet searchSet) {
|
public SearchParameters(SearchSpecification specsSet, SearchSet searchSet) {
|
||||||
var limits = specsSet.queryLimits;
|
var limits = specsSet.queryLimits;
|
||||||
|
|
||||||
this.fetchSize = limits.fetchSize();
|
this.fetchSize = limits.getFetchSize();
|
||||||
this.budget = new IndexSearchBudget(limits.timeoutMs());
|
this.budget = new IndexSearchBudget(limits.getTimeoutMs());
|
||||||
this.query = specsSet.query;
|
this.query = specsSet.query;
|
||||||
this.limitByDomain = limits.resultsByDomain();
|
this.limitByDomain = limits.getResultsByDomain();
|
||||||
this.limitTotal = limits.resultsTotal();
|
this.limitTotal = limits.getResultsTotal();
|
||||||
|
|
||||||
queryParams = new QueryParams(
|
queryParams = new QueryParams(
|
||||||
specsSet.quality,
|
specsSet.quality,
|
||||||
@@ -62,17 +63,17 @@ public class SearchParameters {
|
|||||||
}
|
}
|
||||||
|
|
||||||
public SearchParameters(RpcIndexQuery request, SearchSet searchSet) {
|
public SearchParameters(RpcIndexQuery request, SearchSet searchSet) {
|
||||||
var limits = IndexProtobufCodec.convertQueryLimits(request.getQueryLimits());
|
var limits = request.getQueryLimits();
|
||||||
|
|
||||||
this.fetchSize = limits.fetchSize();
|
this.fetchSize = limits.getFetchSize();
|
||||||
|
|
||||||
// The time budget is halved because this is the point when we start to
|
// The time budget is halved because this is the point when we start to
|
||||||
// wrap up the search and return the results.
|
// wrap up the search and return the results.
|
||||||
this.budget = new IndexSearchBudget(limits.timeoutMs() / 2);
|
this.budget = new IndexSearchBudget(limits.getTimeoutMs() / 2);
|
||||||
this.query = IndexProtobufCodec.convertRpcQuery(request.getQuery());
|
this.query = IndexProtobufCodec.convertRpcQuery(request.getQuery());
|
||||||
|
|
||||||
this.limitByDomain = limits.resultsByDomain();
|
this.limitByDomain = limits.getResultsByDomain();
|
||||||
this.limitTotal = limits.resultsTotal();
|
this.limitTotal = limits.getResultsTotal();
|
||||||
|
|
||||||
queryParams = new QueryParams(
|
queryParams = new QueryParams(
|
||||||
convertSpecLimit(request.getQuality()),
|
convertSpecLimit(request.getQuality()),
|
||||||
@@ -85,7 +86,7 @@ public class SearchParameters {
|
|||||||
compiledQuery = CompiledQueryParser.parse(this.query.compiledQuery);
|
compiledQuery = CompiledQueryParser.parse(this.query.compiledQuery);
|
||||||
compiledQueryIds = compiledQuery.mapToLong(SearchTermsUtil::getWordId);
|
compiledQueryIds = compiledQuery.mapToLong(SearchTermsUtil::getWordId);
|
||||||
|
|
||||||
rankingParams = IndexProtobufCodec.convertRankingParameterss(request.getParameters());
|
rankingParams = request.hasParameters() ? request.getParameters() : PrototypeRankingParameters.sensibleDefaults();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@@ -2,7 +2,6 @@ package nu.marginalia.index.results;
|
|||||||
|
|
||||||
import nu.marginalia.api.searchquery.model.compiled.CqDataInt;
|
import nu.marginalia.api.searchquery.model.compiled.CqDataInt;
|
||||||
import nu.marginalia.api.searchquery.model.compiled.CqExpression;
|
import nu.marginalia.api.searchquery.model.compiled.CqExpression;
|
||||||
import nu.marginalia.api.searchquery.model.results.Bm25Parameters;
|
|
||||||
import nu.marginalia.api.searchquery.model.results.ResultRankingContext;
|
import nu.marginalia.api.searchquery.model.results.ResultRankingContext;
|
||||||
|
|
||||||
import java.util.BitSet;
|
import java.util.BitSet;
|
||||||
@@ -24,14 +23,14 @@ public class Bm25GraphVisitor implements CqExpression.DoubleVisitor {
|
|||||||
|
|
||||||
private final BitSet mask;
|
private final BitSet mask;
|
||||||
|
|
||||||
public Bm25GraphVisitor(Bm25Parameters bm25Parameters,
|
public Bm25GraphVisitor(double k1, double b,
|
||||||
float[] counts,
|
float[] counts,
|
||||||
int length,
|
int length,
|
||||||
ResultRankingContext ctx) {
|
ResultRankingContext ctx) {
|
||||||
this.length = length;
|
this.length = length;
|
||||||
|
|
||||||
this.k1 = bm25Parameters.k();
|
this.k1 = k1;
|
||||||
this.b = bm25Parameters.b();
|
this.b = b;
|
||||||
|
|
||||||
this.docCount = ctx.termFreqDocCount();
|
this.docCount = ctx.termFreqDocCount();
|
||||||
this.counts = counts;
|
this.counts = counts;
|
||||||
|
@@ -0,0 +1,119 @@
|
|||||||
|
package nu.marginalia.index.results;
|
||||||
|
|
||||||
|
import com.google.inject.Inject;
|
||||||
|
import com.google.inject.Singleton;
|
||||||
|
import gnu.trove.map.hash.TIntDoubleHashMap;
|
||||||
|
import nu.marginalia.WmsaHome;
|
||||||
|
import nu.marginalia.db.DbDomainQueries;
|
||||||
|
import nu.marginalia.model.EdgeDomain;
|
||||||
|
import org.slf4j.Logger;
|
||||||
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.nio.file.Files;
|
||||||
|
import java.nio.file.Path;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.OptionalInt;
|
||||||
|
import java.util.concurrent.TimeUnit;
|
||||||
|
|
||||||
|
@Singleton
|
||||||
|
public class DomainRankingOverrides {
|
||||||
|
private final DbDomainQueries domainQueries;
|
||||||
|
|
||||||
|
private volatile TIntDoubleHashMap rankingFactors = new TIntDoubleHashMap(100, 0.75f, -1, 1.);
|
||||||
|
|
||||||
|
private static final Logger logger = LoggerFactory.getLogger(DomainRankingOverrides.class);
|
||||||
|
|
||||||
|
private final Path overrideFilePath;
|
||||||
|
|
||||||
|
@Inject
|
||||||
|
public DomainRankingOverrides(DbDomainQueries domainQueries) {
|
||||||
|
this.domainQueries = domainQueries;
|
||||||
|
|
||||||
|
overrideFilePath = WmsaHome.getDataPath().resolve("domain-ranking-factors.txt");
|
||||||
|
|
||||||
|
Thread.ofPlatform().start(this::updateRunner);
|
||||||
|
}
|
||||||
|
|
||||||
|
// for test access
|
||||||
|
public DomainRankingOverrides(DbDomainQueries domainQueries, Path overrideFilePath)
|
||||||
|
{
|
||||||
|
this.domainQueries = domainQueries;
|
||||||
|
this.overrideFilePath = overrideFilePath;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
public double getRankingFactor(int domainId) {
|
||||||
|
return rankingFactors.get(domainId);
|
||||||
|
}
|
||||||
|
|
||||||
|
private void updateRunner() {
|
||||||
|
for (;;) {
|
||||||
|
reloadFile();
|
||||||
|
|
||||||
|
try {
|
||||||
|
TimeUnit.MINUTES.sleep(5);
|
||||||
|
} catch (InterruptedException ex) {
|
||||||
|
logger.warn("Thread interrupted", ex);
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
void reloadFile() {
|
||||||
|
if (!Files.exists(overrideFilePath)) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
try {
|
||||||
|
List<String> lines = Files.readAllLines(overrideFilePath);
|
||||||
|
|
||||||
|
double factor = 1.;
|
||||||
|
|
||||||
|
var newRankingFactors = new TIntDoubleHashMap(lines.size(), 0.75f, -1, 1.);
|
||||||
|
|
||||||
|
for (var line : lines) {
|
||||||
|
if (line.isBlank()) continue;
|
||||||
|
if (line.startsWith("#")) continue;
|
||||||
|
|
||||||
|
String[] parts = line.split("\\s+");
|
||||||
|
if (parts.length != 2) {
|
||||||
|
logger.warn("Unrecognized format for domain overrides file: {}", line);
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
try {
|
||||||
|
switch (parts[0]) {
|
||||||
|
case "value" -> {
|
||||||
|
// error handle me
|
||||||
|
factor = Double.parseDouble(parts[1]);
|
||||||
|
if (factor < 0) {
|
||||||
|
logger.error("Negative values are not permitted, found {}", factor);
|
||||||
|
factor = 1;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
case "domain" -> {
|
||||||
|
// error handle
|
||||||
|
OptionalInt domainId = domainQueries.tryGetDomainId(new EdgeDomain(parts[1]));
|
||||||
|
if (domainId.isPresent()) {
|
||||||
|
newRankingFactors.put(domainId.getAsInt(), factor);
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
logger.warn("Unrecognized domain id {}", parts[1]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
default -> {
|
||||||
|
logger.warn("Unrecognized format {}", line);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} catch (Exception ex) {
|
||||||
|
logger.warn("Error in parsing domain overrides file: {} ({})", line, ex.getClass().getSimpleName());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
rankingFactors = newRankingFactors;
|
||||||
|
} catch (IOException ex) {
|
||||||
|
logger.error("Failed to read " + overrideFilePath, ex);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
@@ -40,13 +40,16 @@ public class IndexResultRankingService {
|
|||||||
|
|
||||||
private final DocumentDbReader documentDbReader;
|
private final DocumentDbReader documentDbReader;
|
||||||
private final StatefulIndex statefulIndex;
|
private final StatefulIndex statefulIndex;
|
||||||
|
private final DomainRankingOverrides domainRankingOverrides;
|
||||||
|
|
||||||
@Inject
|
@Inject
|
||||||
public IndexResultRankingService(DocumentDbReader documentDbReader,
|
public IndexResultRankingService(DocumentDbReader documentDbReader,
|
||||||
StatefulIndex statefulIndex)
|
StatefulIndex statefulIndex,
|
||||||
|
DomainRankingOverrides domainRankingOverrides)
|
||||||
{
|
{
|
||||||
this.documentDbReader = documentDbReader;
|
this.documentDbReader = documentDbReader;
|
||||||
this.statefulIndex = statefulIndex;
|
this.statefulIndex = statefulIndex;
|
||||||
|
this.domainRankingOverrides = domainRankingOverrides;
|
||||||
}
|
}
|
||||||
|
|
||||||
public List<SearchResultItem> rankResults(SearchParameters params,
|
public List<SearchResultItem> rankResults(SearchParameters params,
|
||||||
@@ -57,7 +60,7 @@ public class IndexResultRankingService {
|
|||||||
if (resultIds.isEmpty())
|
if (resultIds.isEmpty())
|
||||||
return List.of();
|
return List.of();
|
||||||
|
|
||||||
IndexResultScoreCalculator resultRanker = new IndexResultScoreCalculator(statefulIndex, rankingContext, params);
|
IndexResultScoreCalculator resultRanker = new IndexResultScoreCalculator(statefulIndex, domainRankingOverrides, rankingContext, params);
|
||||||
|
|
||||||
List<SearchResultItem> results = new ArrayList<>(resultIds.size());
|
List<SearchResultItem> results = new ArrayList<>(resultIds.size());
|
||||||
|
|
||||||
@@ -156,7 +159,7 @@ public class IndexResultRankingService {
|
|||||||
// for the selected results, as this would be comically expensive to do for all the results we
|
// for the selected results, as this would be comically expensive to do for all the results we
|
||||||
// discard along the way
|
// discard along the way
|
||||||
|
|
||||||
if (params.rankingParams.exportDebugData) {
|
if (params.rankingParams.getExportDebugData()) {
|
||||||
var combinedIdsList = new LongArrayList(resultsList.size());
|
var combinedIdsList = new LongArrayList(resultsList.size());
|
||||||
for (var item : resultsList) {
|
for (var item : resultsList) {
|
||||||
combinedIdsList.add(item.combinedId);
|
combinedIdsList.add(item.combinedId);
|
||||||
|
@@ -2,10 +2,11 @@ package nu.marginalia.index.results;
|
|||||||
|
|
||||||
import it.unimi.dsi.fastutil.ints.IntIterator;
|
import it.unimi.dsi.fastutil.ints.IntIterator;
|
||||||
import it.unimi.dsi.fastutil.ints.IntList;
|
import it.unimi.dsi.fastutil.ints.IntList;
|
||||||
|
import nu.marginalia.api.searchquery.RpcResultRankingParameters;
|
||||||
|
import nu.marginalia.api.searchquery.RpcTemporalBias;
|
||||||
import nu.marginalia.api.searchquery.model.compiled.CompiledQuery;
|
import nu.marginalia.api.searchquery.model.compiled.CompiledQuery;
|
||||||
import nu.marginalia.api.searchquery.model.compiled.CompiledQueryLong;
|
import nu.marginalia.api.searchquery.model.compiled.CompiledQueryLong;
|
||||||
import nu.marginalia.api.searchquery.model.results.ResultRankingContext;
|
import nu.marginalia.api.searchquery.model.results.ResultRankingContext;
|
||||||
import nu.marginalia.api.searchquery.model.results.ResultRankingParameters;
|
|
||||||
import nu.marginalia.api.searchquery.model.results.SearchResultItem;
|
import nu.marginalia.api.searchquery.model.results.SearchResultItem;
|
||||||
import nu.marginalia.api.searchquery.model.results.debug.DebugRankingFactors;
|
import nu.marginalia.api.searchquery.model.results.debug.DebugRankingFactors;
|
||||||
import nu.marginalia.index.forward.spans.DocumentSpans;
|
import nu.marginalia.index.forward.spans.DocumentSpans;
|
||||||
@@ -40,14 +41,17 @@ public class IndexResultScoreCalculator {
|
|||||||
private final CombinedIndexReader index;
|
private final CombinedIndexReader index;
|
||||||
private final QueryParams queryParams;
|
private final QueryParams queryParams;
|
||||||
|
|
||||||
|
private final DomainRankingOverrides domainRankingOverrides;
|
||||||
private final ResultRankingContext rankingContext;
|
private final ResultRankingContext rankingContext;
|
||||||
private final CompiledQuery<String> compiledQuery;
|
private final CompiledQuery<String> compiledQuery;
|
||||||
|
|
||||||
public IndexResultScoreCalculator(StatefulIndex statefulIndex,
|
public IndexResultScoreCalculator(StatefulIndex statefulIndex,
|
||||||
|
DomainRankingOverrides domainRankingOverrides,
|
||||||
ResultRankingContext rankingContext,
|
ResultRankingContext rankingContext,
|
||||||
SearchParameters params)
|
SearchParameters params)
|
||||||
{
|
{
|
||||||
this.index = statefulIndex.get();
|
this.index = statefulIndex.get();
|
||||||
|
this.domainRankingOverrides = domainRankingOverrides;
|
||||||
this.rankingContext = rankingContext;
|
this.rankingContext = rankingContext;
|
||||||
|
|
||||||
this.queryParams = params.queryParams;
|
this.queryParams = params.queryParams;
|
||||||
@@ -116,20 +120,20 @@ public class IndexResultScoreCalculator {
|
|||||||
|
|
||||||
float proximitiyFac = getProximitiyFac(decodedPositions, searchTerms.phraseConstraints, verbatimMatches, unorderedMatches, spans);
|
float proximitiyFac = getProximitiyFac(decodedPositions, searchTerms.phraseConstraints, verbatimMatches, unorderedMatches, spans);
|
||||||
|
|
||||||
double score_firstPosition = params.tcfFirstPosition * (1.0 / Math.sqrt(unorderedMatches.firstPosition));
|
double score_firstPosition = params.getTcfFirstPositionWeight() * (1.0 / Math.sqrt(unorderedMatches.firstPosition));
|
||||||
double score_verbatim = params.tcfVerbatim * verbatimMatches.getScore();
|
double score_verbatim = params.getTcfVerbatimWeight() * verbatimMatches.getScore();
|
||||||
double score_proximity = params.tcfProximity * proximitiyFac;
|
double score_proximity = params.getTcfProximityWeight() * proximitiyFac;
|
||||||
double score_bM25 = params.bm25Weight
|
double score_bM25 = params.getBm25Weight()
|
||||||
* wordFlagsQuery.root.visit(new Bm25GraphVisitor(params.bm25Params, unorderedMatches.getWeightedCounts(), docSize, rankingContext))
|
* wordFlagsQuery.root.visit(new Bm25GraphVisitor(params.getBm25K(), params.getBm25B(), unorderedMatches.getWeightedCounts(), docSize, rankingContext))
|
||||||
/ (Math.sqrt(unorderedMatches.searchableKeywordCount + 1));
|
/ (Math.sqrt(unorderedMatches.searchableKeywordCount + 1));
|
||||||
double score_bFlags = params.bm25Weight
|
double score_bFlags = params.getBm25Weight()
|
||||||
* wordFlagsQuery.root.visit(new TermFlagsGraphVisitor(params.bm25Params, wordFlagsQuery.data, unorderedMatches.getWeightedCounts(), rankingContext))
|
* wordFlagsQuery.root.visit(new TermFlagsGraphVisitor(params.getBm25K(), wordFlagsQuery.data, unorderedMatches.getWeightedCounts(), rankingContext))
|
||||||
/ (Math.sqrt(unorderedMatches.searchableKeywordCount + 1));
|
/ (Math.sqrt(unorderedMatches.searchableKeywordCount + 1));
|
||||||
|
|
||||||
|
double rankingAdjustment = domainRankingOverrides.getRankingFactor(UrlIdCodec.getDomainId(combinedId));
|
||||||
|
|
||||||
double score = normalize(
|
double score = normalize(
|
||||||
score_firstPosition + score_proximity + score_verbatim
|
rankingAdjustment * (score_firstPosition + score_proximity + score_verbatim + score_bM25 + score_bFlags),
|
||||||
+ score_bM25
|
|
||||||
+ score_bFlags,
|
|
||||||
-Math.min(0, documentBonus) // The magnitude of documentBonus, if it is negative; otherwise 0
|
-Math.min(0, documentBonus) // The magnitude of documentBonus, if it is negative; otherwise 0
|
||||||
);
|
);
|
||||||
|
|
||||||
@@ -245,9 +249,13 @@ public class IndexResultScoreCalculator {
|
|||||||
private double calculateDocumentBonus(long documentMetadata,
|
private double calculateDocumentBonus(long documentMetadata,
|
||||||
int features,
|
int features,
|
||||||
int length,
|
int length,
|
||||||
ResultRankingParameters rankingParams,
|
RpcResultRankingParameters rankingParams,
|
||||||
@Nullable DebugRankingFactors debugRankingFactors) {
|
@Nullable DebugRankingFactors debugRankingFactors) {
|
||||||
|
|
||||||
|
if (rankingParams.getDisablePenalties()) {
|
||||||
|
return 0.;
|
||||||
|
}
|
||||||
|
|
||||||
int rank = DocumentMetadata.decodeRank(documentMetadata);
|
int rank = DocumentMetadata.decodeRank(documentMetadata);
|
||||||
int asl = DocumentMetadata.decodeAvgSentenceLength(documentMetadata);
|
int asl = DocumentMetadata.decodeAvgSentenceLength(documentMetadata);
|
||||||
int quality = DocumentMetadata.decodeQuality(documentMetadata);
|
int quality = DocumentMetadata.decodeQuality(documentMetadata);
|
||||||
@@ -256,18 +264,18 @@ public class IndexResultScoreCalculator {
|
|||||||
int topology = DocumentMetadata.decodeTopology(documentMetadata);
|
int topology = DocumentMetadata.decodeTopology(documentMetadata);
|
||||||
int year = DocumentMetadata.decodeYear(documentMetadata);
|
int year = DocumentMetadata.decodeYear(documentMetadata);
|
||||||
|
|
||||||
double averageSentenceLengthPenalty = (asl >= rankingParams.shortSentenceThreshold ? 0 : -rankingParams.shortSentencePenalty);
|
double averageSentenceLengthPenalty = (asl >= rankingParams.getShortSentenceThreshold() ? 0 : -rankingParams.getShortSentencePenalty());
|
||||||
|
|
||||||
final double qualityPenalty = calculateQualityPenalty(size, quality, rankingParams);
|
final double qualityPenalty = calculateQualityPenalty(size, quality, rankingParams);
|
||||||
final double rankingBonus = (255. - rank) * rankingParams.domainRankBonus;
|
final double rankingBonus = (255. - rank) * rankingParams.getDomainRankBonus();
|
||||||
final double topologyBonus = Math.log(1 + topology);
|
final double topologyBonus = Math.log(1 + topology);
|
||||||
final double documentLengthPenalty = length > rankingParams.shortDocumentThreshold ? 0 : -rankingParams.shortDocumentPenalty;
|
final double documentLengthPenalty = length > rankingParams.getShortDocumentThreshold() ? 0 : -rankingParams.getShortDocumentPenalty();
|
||||||
final double temporalBias;
|
final double temporalBias;
|
||||||
|
|
||||||
if (rankingParams.temporalBias == ResultRankingParameters.TemporalBias.RECENT) {
|
if (rankingParams.getTemporalBias().getBias() == RpcTemporalBias.Bias.RECENT) {
|
||||||
temporalBias = - Math.abs(year - PubDate.MAX_YEAR) * rankingParams.temporalBiasWeight;
|
temporalBias = - Math.abs(year - PubDate.MAX_YEAR) * rankingParams.getTemporalBiasWeight();
|
||||||
} else if (rankingParams.temporalBias == ResultRankingParameters.TemporalBias.OLD) {
|
} else if (rankingParams.getTemporalBias().getBias() == RpcTemporalBias.Bias.OLD) {
|
||||||
temporalBias = - Math.abs(year - PubDate.MIN_YEAR) * rankingParams.temporalBiasWeight;
|
temporalBias = - Math.abs(year - PubDate.MIN_YEAR) * rankingParams.getTemporalBiasWeight();
|
||||||
} else {
|
} else {
|
||||||
temporalBias = 0;
|
temporalBias = 0;
|
||||||
}
|
}
|
||||||
@@ -506,14 +514,14 @@ public class IndexResultScoreCalculator {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
private double calculateQualityPenalty(int size, int quality, ResultRankingParameters rankingParams) {
|
private double calculateQualityPenalty(int size, int quality, RpcResultRankingParameters rankingParams) {
|
||||||
if (size < 400) {
|
if (size < 400) {
|
||||||
if (quality < 5)
|
if (quality < 5)
|
||||||
return 0;
|
return 0;
|
||||||
return -quality * rankingParams.qualityPenalty;
|
return -quality * rankingParams.getQualityPenalty();
|
||||||
}
|
}
|
||||||
else {
|
else {
|
||||||
return -quality * rankingParams.qualityPenalty * 20;
|
return -quality * rankingParams.getQualityPenalty() * 20;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -575,3 +583,4 @@ public class IndexResultScoreCalculator {
|
|||||||
}
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@@ -3,7 +3,6 @@ package nu.marginalia.index.results;
|
|||||||
import nu.marginalia.api.searchquery.model.compiled.CqDataInt;
|
import nu.marginalia.api.searchquery.model.compiled.CqDataInt;
|
||||||
import nu.marginalia.api.searchquery.model.compiled.CqDataLong;
|
import nu.marginalia.api.searchquery.model.compiled.CqDataLong;
|
||||||
import nu.marginalia.api.searchquery.model.compiled.CqExpression;
|
import nu.marginalia.api.searchquery.model.compiled.CqExpression;
|
||||||
import nu.marginalia.api.searchquery.model.results.Bm25Parameters;
|
|
||||||
import nu.marginalia.api.searchquery.model.results.ResultRankingContext;
|
import nu.marginalia.api.searchquery.model.results.ResultRankingContext;
|
||||||
import nu.marginalia.model.idx.WordFlags;
|
import nu.marginalia.model.idx.WordFlags;
|
||||||
|
|
||||||
@@ -15,15 +14,14 @@ public class TermFlagsGraphVisitor implements CqExpression.DoubleVisitor {
|
|||||||
private final CqDataLong wordMetaData;
|
private final CqDataLong wordMetaData;
|
||||||
private final CqDataInt frequencies;
|
private final CqDataInt frequencies;
|
||||||
private final float[] counts;
|
private final float[] counts;
|
||||||
private final Bm25Parameters bm25Parameters;
|
private final double k1;
|
||||||
|
|
||||||
private final int docCount;
|
private final int docCount;
|
||||||
|
|
||||||
public TermFlagsGraphVisitor(Bm25Parameters bm25Parameters,
|
public TermFlagsGraphVisitor(double k1,
|
||||||
CqDataLong wordMetaData,
|
CqDataLong wordMetaData,
|
||||||
float[] counts,
|
float[] counts,
|
||||||
ResultRankingContext ctx) {
|
ResultRankingContext ctx) {
|
||||||
this.bm25Parameters = bm25Parameters;
|
this.k1 = k1;
|
||||||
this.counts = counts;
|
this.counts = counts;
|
||||||
this.docCount = ctx.termFreqDocCount();
|
this.docCount = ctx.termFreqDocCount();
|
||||||
this.wordMetaData = wordMetaData;
|
this.wordMetaData = wordMetaData;
|
||||||
@@ -55,7 +53,7 @@ public class TermFlagsGraphVisitor implements CqExpression.DoubleVisitor {
|
|||||||
int freq = frequencies.get(idx);
|
int freq = frequencies.get(idx);
|
||||||
|
|
||||||
// note we override b to zero for priority terms as they are independent of document length
|
// note we override b to zero for priority terms as they are independent of document length
|
||||||
return invFreq(docCount, freq) * f(bm25Parameters.k(), 0, count, 0);
|
return invFreq(docCount, freq) * f(k1, 0, count, 0);
|
||||||
}
|
}
|
||||||
|
|
||||||
private double evaluatePriorityScore(int idx) {
|
private double evaluatePriorityScore(int idx) {
|
||||||
|
@@ -1,7 +0,0 @@
|
|||||||
package nu.marginalia.index.query.limit;
|
|
||||||
|
|
||||||
public record QueryLimits(int resultsByDomain, int resultsTotal, int timeoutMs, int fetchSize) {
|
|
||||||
public QueryLimits forSingleDomain() {
|
|
||||||
return new QueryLimits(resultsTotal, resultsTotal, timeoutMs, fetchSize);
|
|
||||||
}
|
|
||||||
}
|
|
@@ -4,10 +4,11 @@ import com.google.inject.Guice;
|
|||||||
import com.google.inject.Inject;
|
import com.google.inject.Inject;
|
||||||
import nu.marginalia.IndexLocations;
|
import nu.marginalia.IndexLocations;
|
||||||
import nu.marginalia.api.searchquery.RpcDecoratedResultItem;
|
import nu.marginalia.api.searchquery.RpcDecoratedResultItem;
|
||||||
|
import nu.marginalia.api.searchquery.RpcQueryLimits;
|
||||||
import nu.marginalia.api.searchquery.model.query.SearchPhraseConstraint;
|
import nu.marginalia.api.searchquery.model.query.SearchPhraseConstraint;
|
||||||
import nu.marginalia.api.searchquery.model.query.SearchQuery;
|
import nu.marginalia.api.searchquery.model.query.SearchQuery;
|
||||||
import nu.marginalia.api.searchquery.model.query.SearchSpecification;
|
import nu.marginalia.api.searchquery.model.query.SearchSpecification;
|
||||||
import nu.marginalia.api.searchquery.model.results.ResultRankingParameters;
|
import nu.marginalia.api.searchquery.model.results.PrototypeRankingParameters;
|
||||||
import nu.marginalia.index.construction.DocIdRewriter;
|
import nu.marginalia.index.construction.DocIdRewriter;
|
||||||
import nu.marginalia.index.construction.full.FullIndexConstructor;
|
import nu.marginalia.index.construction.full.FullIndexConstructor;
|
||||||
import nu.marginalia.index.construction.prio.PrioIndexConstructor;
|
import nu.marginalia.index.construction.prio.PrioIndexConstructor;
|
||||||
@@ -17,7 +18,6 @@ import nu.marginalia.index.forward.construction.ForwardIndexConverter;
|
|||||||
import nu.marginalia.index.index.StatefulIndex;
|
import nu.marginalia.index.index.StatefulIndex;
|
||||||
import nu.marginalia.index.journal.IndexJournal;
|
import nu.marginalia.index.journal.IndexJournal;
|
||||||
import nu.marginalia.index.journal.IndexJournalSlopWriter;
|
import nu.marginalia.index.journal.IndexJournalSlopWriter;
|
||||||
import nu.marginalia.index.query.limit.QueryLimits;
|
|
||||||
import nu.marginalia.index.query.limit.QueryStrategy;
|
import nu.marginalia.index.query.limit.QueryStrategy;
|
||||||
import nu.marginalia.index.query.limit.SpecificationLimit;
|
import nu.marginalia.index.query.limit.SpecificationLimit;
|
||||||
import nu.marginalia.linkdb.docs.DocumentDbReader;
|
import nu.marginalia.linkdb.docs.DocumentDbReader;
|
||||||
@@ -115,9 +115,16 @@ public class IndexQueryServiceIntegrationSmokeTest {
|
|||||||
|
|
||||||
var rsp = queryService.justQuery(
|
var rsp = queryService.justQuery(
|
||||||
SearchSpecification.builder()
|
SearchSpecification.builder()
|
||||||
.queryLimits(new QueryLimits(10, 10, Integer.MAX_VALUE, 4000))
|
.queryLimits(
|
||||||
|
RpcQueryLimits.newBuilder()
|
||||||
|
.setResultsByDomain(10)
|
||||||
|
.setResultsTotal(10)
|
||||||
|
.setTimeoutMs(Integer.MAX_VALUE)
|
||||||
|
.setFetchSize(4000)
|
||||||
|
.build()
|
||||||
|
)
|
||||||
.queryStrategy(QueryStrategy.SENTENCE)
|
.queryStrategy(QueryStrategy.SENTENCE)
|
||||||
.rankingParams(ResultRankingParameters.sensibleDefaults())
|
.rankingParams(PrototypeRankingParameters.sensibleDefaults())
|
||||||
.domains(new ArrayList<>())
|
.domains(new ArrayList<>())
|
||||||
.searchSetIdentifier("NONE")
|
.searchSetIdentifier("NONE")
|
||||||
.query(
|
.query(
|
||||||
@@ -171,9 +178,16 @@ public class IndexQueryServiceIntegrationSmokeTest {
|
|||||||
|
|
||||||
var rsp = queryService.justQuery(
|
var rsp = queryService.justQuery(
|
||||||
SearchSpecification.builder()
|
SearchSpecification.builder()
|
||||||
.queryLimits(new QueryLimits(10, 10, Integer.MAX_VALUE, 4000))
|
.queryLimits(
|
||||||
|
RpcQueryLimits.newBuilder()
|
||||||
|
.setResultsByDomain(10)
|
||||||
|
.setResultsTotal(10)
|
||||||
|
.setTimeoutMs(Integer.MAX_VALUE)
|
||||||
|
.setFetchSize(4000)
|
||||||
|
.build()
|
||||||
|
)
|
||||||
.queryStrategy(QueryStrategy.SENTENCE)
|
.queryStrategy(QueryStrategy.SENTENCE)
|
||||||
.rankingParams(ResultRankingParameters.sensibleDefaults())
|
.rankingParams(PrototypeRankingParameters.sensibleDefaults())
|
||||||
.domains(new ArrayList<>())
|
.domains(new ArrayList<>())
|
||||||
.searchSetIdentifier("NONE")
|
.searchSetIdentifier("NONE")
|
||||||
.query(
|
.query(
|
||||||
@@ -225,8 +239,15 @@ public class IndexQueryServiceIntegrationSmokeTest {
|
|||||||
|
|
||||||
var rsp = queryService.justQuery(
|
var rsp = queryService.justQuery(
|
||||||
SearchSpecification.builder()
|
SearchSpecification.builder()
|
||||||
.queryLimits(new QueryLimits(10, 10, Integer.MAX_VALUE, 4000))
|
.queryLimits(
|
||||||
.rankingParams(ResultRankingParameters.sensibleDefaults())
|
RpcQueryLimits.newBuilder()
|
||||||
|
.setResultsByDomain(10)
|
||||||
|
.setResultsTotal(10)
|
||||||
|
.setTimeoutMs(Integer.MAX_VALUE)
|
||||||
|
.setFetchSize(4000)
|
||||||
|
.build()
|
||||||
|
)
|
||||||
|
.rankingParams(PrototypeRankingParameters.sensibleDefaults())
|
||||||
.queryStrategy(QueryStrategy.SENTENCE)
|
.queryStrategy(QueryStrategy.SENTENCE)
|
||||||
.domains(List.of(2))
|
.domains(List.of(2))
|
||||||
.query(
|
.query(
|
||||||
@@ -282,11 +303,18 @@ public class IndexQueryServiceIntegrationSmokeTest {
|
|||||||
|
|
||||||
var rsp = queryService.justQuery(
|
var rsp = queryService.justQuery(
|
||||||
SearchSpecification.builder()
|
SearchSpecification.builder()
|
||||||
.queryLimits(new QueryLimits(10, 10, Integer.MAX_VALUE, 4000))
|
.queryLimits(
|
||||||
|
RpcQueryLimits.newBuilder()
|
||||||
|
.setResultsByDomain(10)
|
||||||
|
.setResultsTotal(10)
|
||||||
|
.setTimeoutMs(Integer.MAX_VALUE)
|
||||||
|
.setFetchSize(4000)
|
||||||
|
.build()
|
||||||
|
)
|
||||||
.year(SpecificationLimit.equals(1998))
|
.year(SpecificationLimit.equals(1998))
|
||||||
.queryStrategy(QueryStrategy.SENTENCE)
|
.queryStrategy(QueryStrategy.SENTENCE)
|
||||||
.searchSetIdentifier("NONE")
|
.searchSetIdentifier("NONE")
|
||||||
.rankingParams(ResultRankingParameters.sensibleDefaults())
|
.rankingParams(PrototypeRankingParameters.sensibleDefaults())
|
||||||
.query(
|
.query(
|
||||||
SearchQuery.builder()
|
SearchQuery.builder()
|
||||||
.compiledQuery("4")
|
.compiledQuery("4")
|
||||||
|
@@ -4,10 +4,11 @@ import com.google.inject.Guice;
|
|||||||
import com.google.inject.Inject;
|
import com.google.inject.Inject;
|
||||||
import it.unimi.dsi.fastutil.ints.IntList;
|
import it.unimi.dsi.fastutil.ints.IntList;
|
||||||
import nu.marginalia.IndexLocations;
|
import nu.marginalia.IndexLocations;
|
||||||
|
import nu.marginalia.api.searchquery.RpcQueryLimits;
|
||||||
import nu.marginalia.api.searchquery.model.query.SearchPhraseConstraint;
|
import nu.marginalia.api.searchquery.model.query.SearchPhraseConstraint;
|
||||||
import nu.marginalia.api.searchquery.model.query.SearchQuery;
|
import nu.marginalia.api.searchquery.model.query.SearchQuery;
|
||||||
import nu.marginalia.api.searchquery.model.query.SearchSpecification;
|
import nu.marginalia.api.searchquery.model.query.SearchSpecification;
|
||||||
import nu.marginalia.api.searchquery.model.results.ResultRankingParameters;
|
import nu.marginalia.api.searchquery.model.results.PrototypeRankingParameters;
|
||||||
import nu.marginalia.hash.MurmurHash3_128;
|
import nu.marginalia.hash.MurmurHash3_128;
|
||||||
import nu.marginalia.index.construction.DocIdRewriter;
|
import nu.marginalia.index.construction.DocIdRewriter;
|
||||||
import nu.marginalia.index.construction.full.FullIndexConstructor;
|
import nu.marginalia.index.construction.full.FullIndexConstructor;
|
||||||
@@ -18,7 +19,6 @@ import nu.marginalia.index.forward.construction.ForwardIndexConverter;
|
|||||||
import nu.marginalia.index.index.StatefulIndex;
|
import nu.marginalia.index.index.StatefulIndex;
|
||||||
import nu.marginalia.index.journal.IndexJournal;
|
import nu.marginalia.index.journal.IndexJournal;
|
||||||
import nu.marginalia.index.journal.IndexJournalSlopWriter;
|
import nu.marginalia.index.journal.IndexJournalSlopWriter;
|
||||||
import nu.marginalia.index.query.limit.QueryLimits;
|
|
||||||
import nu.marginalia.index.query.limit.QueryStrategy;
|
import nu.marginalia.index.query.limit.QueryStrategy;
|
||||||
import nu.marginalia.index.query.limit.SpecificationLimit;
|
import nu.marginalia.index.query.limit.SpecificationLimit;
|
||||||
import nu.marginalia.linkdb.docs.DocumentDbReader;
|
import nu.marginalia.linkdb.docs.DocumentDbReader;
|
||||||
@@ -389,13 +389,20 @@ public class IndexQueryServiceIntegrationTest {
|
|||||||
SearchSpecification basicQuery(Function<SearchSpecification.SearchSpecificationBuilder, SearchSpecification.SearchSpecificationBuilder> mutator)
|
SearchSpecification basicQuery(Function<SearchSpecification.SearchSpecificationBuilder, SearchSpecification.SearchSpecificationBuilder> mutator)
|
||||||
{
|
{
|
||||||
var builder = SearchSpecification.builder()
|
var builder = SearchSpecification.builder()
|
||||||
.queryLimits(new QueryLimits(10, 10, Integer.MAX_VALUE, 4000))
|
.queryLimits(
|
||||||
|
RpcQueryLimits.newBuilder()
|
||||||
|
.setResultsByDomain(10)
|
||||||
|
.setResultsTotal(10)
|
||||||
|
.setTimeoutMs(Integer.MAX_VALUE)
|
||||||
|
.setFetchSize(4000)
|
||||||
|
.build()
|
||||||
|
)
|
||||||
.queryStrategy(QueryStrategy.SENTENCE)
|
.queryStrategy(QueryStrategy.SENTENCE)
|
||||||
.year(SpecificationLimit.none())
|
.year(SpecificationLimit.none())
|
||||||
.quality(SpecificationLimit.none())
|
.quality(SpecificationLimit.none())
|
||||||
.size(SpecificationLimit.none())
|
.size(SpecificationLimit.none())
|
||||||
.rank(SpecificationLimit.none())
|
.rank(SpecificationLimit.none())
|
||||||
.rankingParams(ResultRankingParameters.sensibleDefaults())
|
.rankingParams(PrototypeRankingParameters.sensibleDefaults())
|
||||||
.domains(new ArrayList<>())
|
.domains(new ArrayList<>())
|
||||||
.searchSetIdentifier("NONE");
|
.searchSetIdentifier("NONE");
|
||||||
|
|
||||||
|
@@ -0,0 +1,103 @@
|
|||||||
|
package nu.marginalia.index.results;
|
||||||
|
|
||||||
|
import com.zaxxer.hikari.HikariConfig;
|
||||||
|
import com.zaxxer.hikari.HikariDataSource;
|
||||||
|
import nu.marginalia.db.DbDomainQueries;
|
||||||
|
import nu.marginalia.model.EdgeDomain;
|
||||||
|
import nu.marginalia.test.TestMigrationLoader;
|
||||||
|
import org.junit.jupiter.api.Assertions;
|
||||||
|
import org.junit.jupiter.api.BeforeAll;
|
||||||
|
import org.junit.jupiter.api.Tag;
|
||||||
|
import org.junit.jupiter.api.Test;
|
||||||
|
import org.junit.jupiter.api.parallel.Execution;
|
||||||
|
import org.junit.jupiter.api.parallel.ExecutionMode;
|
||||||
|
import org.testcontainers.containers.MariaDBContainer;
|
||||||
|
import org.testcontainers.junit.jupiter.Container;
|
||||||
|
import org.testcontainers.junit.jupiter.Testcontainers;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.nio.file.Files;
|
||||||
|
import java.nio.file.Path;
|
||||||
|
import java.nio.file.StandardOpenOption;
|
||||||
|
import java.sql.SQLException;
|
||||||
|
|
||||||
|
@Testcontainers
|
||||||
|
@Execution(ExecutionMode.SAME_THREAD)
|
||||||
|
@Tag("slow")
|
||||||
|
class DomainRankingOverridesTest {
|
||||||
|
@Container
|
||||||
|
static MariaDBContainer<?> mariaDBContainer = new MariaDBContainer<>("mariadb")
|
||||||
|
.withDatabaseName("WMSA_prod")
|
||||||
|
.withUsername("wmsa")
|
||||||
|
.withPassword("wmsa")
|
||||||
|
.withNetworkAliases("mariadb");
|
||||||
|
|
||||||
|
private static DbDomainQueries domainQueries;
|
||||||
|
|
||||||
|
@BeforeAll
|
||||||
|
public static void setup() throws SQLException {
|
||||||
|
HikariConfig config = new HikariConfig();
|
||||||
|
config.setJdbcUrl(mariaDBContainer.getJdbcUrl());
|
||||||
|
config.setUsername("wmsa");
|
||||||
|
config.setPassword("wmsa");
|
||||||
|
|
||||||
|
var dataSource = new HikariDataSource(config);
|
||||||
|
|
||||||
|
TestMigrationLoader.flywayMigration(dataSource);
|
||||||
|
|
||||||
|
try (var conn = dataSource.getConnection();
|
||||||
|
var stmt = conn.createStatement()) {
|
||||||
|
stmt.executeQuery("DELETE FROM EC_DOMAIN"); // Wipe any old state from other test runs
|
||||||
|
|
||||||
|
stmt.executeQuery("INSERT INTO EC_DOMAIN (DOMAIN_NAME, DOMAIN_TOP, NODE_AFFINITY) VALUES ('first.example.com', 'example.com', 1)");
|
||||||
|
stmt.executeQuery("INSERT INTO EC_DOMAIN (DOMAIN_NAME, DOMAIN_TOP, NODE_AFFINITY) VALUES ('second.example.com', 'example.com', 1)");
|
||||||
|
stmt.executeQuery("INSERT INTO EC_DOMAIN (DOMAIN_NAME, DOMAIN_TOP, NODE_AFFINITY) VALUES ('third.example.com', 'example.com', 1)");
|
||||||
|
stmt.executeQuery("INSERT INTO EC_DOMAIN (DOMAIN_NAME, DOMAIN_TOP, NODE_AFFINITY) VALUES ('not-added.example.com', 'example.com', 1)");
|
||||||
|
}
|
||||||
|
|
||||||
|
domainQueries = new DbDomainQueries(dataSource);
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
@Test
|
||||||
|
public void test() throws IOException {
|
||||||
|
|
||||||
|
Path overridesFile = Files.createTempFile(getClass().getSimpleName(), ".txt");
|
||||||
|
try {
|
||||||
|
|
||||||
|
Files.writeString(overridesFile, """
|
||||||
|
# A comment
|
||||||
|
value 0.75
|
||||||
|
domain first.example.com
|
||||||
|
domain second.example.com
|
||||||
|
|
||||||
|
value 1.1
|
||||||
|
domain third.example.com
|
||||||
|
""",
|
||||||
|
StandardOpenOption.APPEND);
|
||||||
|
|
||||||
|
var overrides = new DomainRankingOverrides(domainQueries, overridesFile);
|
||||||
|
|
||||||
|
overrides.reloadFile();
|
||||||
|
|
||||||
|
Assertions.assertEquals(0.75, overrides.getRankingFactor(
|
||||||
|
domainQueries.getDomainId(new EdgeDomain("first.example.com"))
|
||||||
|
));
|
||||||
|
Assertions.assertEquals(0.75, overrides.getRankingFactor(
|
||||||
|
domainQueries.getDomainId(new EdgeDomain("second.example.com"))
|
||||||
|
));
|
||||||
|
Assertions.assertEquals(1.1, overrides.getRankingFactor(
|
||||||
|
domainQueries.getDomainId(new EdgeDomain("third.example.com"))
|
||||||
|
));
|
||||||
|
Assertions.assertEquals(1.0, overrides.getRankingFactor(
|
||||||
|
domainQueries.getDomainId(new EdgeDomain("not-added.example.com"))
|
||||||
|
));
|
||||||
|
Assertions.assertEquals(1.0, overrides.getRankingFactor(1<<23));
|
||||||
|
|
||||||
|
}
|
||||||
|
finally {
|
||||||
|
Files.deleteIfExists(overridesFile);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
@@ -45,6 +45,11 @@ public class GammaCodedSequenceArrayColumn extends AbstractObjectColumn<List<Gam
|
|||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public int alignmentSize() {
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
|
||||||
public Reader openUnregistered(URI uri, int page) throws IOException {
|
public Reader openUnregistered(URI uri, int page) throws IOException {
|
||||||
return new Reader(
|
return new Reader(
|
||||||
dataColumn.openUnregistered(uri, page),
|
dataColumn.openUnregistered(uri, page),
|
||||||
@@ -109,6 +114,11 @@ public class GammaCodedSequenceArrayColumn extends AbstractObjectColumn<List<Gam
|
|||||||
dataReader.skip(toSkip);
|
dataReader.skip(toSkip);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public boolean isDirect() {
|
||||||
|
return dataReader.isDirect();
|
||||||
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public boolean hasRemaining() throws IOException {
|
public boolean hasRemaining() throws IOException {
|
||||||
return groupsReader.hasRemaining();
|
return groupsReader.hasRemaining();
|
||||||
|
@@ -44,6 +44,11 @@ public class GammaCodedSequenceColumn extends AbstractObjectColumn<GammaCodedSeq
|
|||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public int alignmentSize() {
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
|
||||||
public Reader openUnregistered(URI uri, int page) throws IOException {
|
public Reader openUnregistered(URI uri, int page) throws IOException {
|
||||||
return new Reader(
|
return new Reader(
|
||||||
Storage.reader(uri, this, page, false),
|
Storage.reader(uri, this, page, false),
|
||||||
@@ -96,6 +101,11 @@ public class GammaCodedSequenceColumn extends AbstractObjectColumn<GammaCodedSeq
|
|||||||
this.indexReader = indexReader;
|
this.indexReader = indexReader;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public boolean isDirect() {
|
||||||
|
return storage.isDirect();
|
||||||
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public AbstractColumn<?, ?> columnDesc() {
|
public AbstractColumn<?, ?> columnDesc() {
|
||||||
return GammaCodedSequenceColumn.this;
|
return GammaCodedSequenceColumn.this;
|
||||||
|
@@ -45,6 +45,11 @@ public class VarintCodedSequenceArrayColumn extends AbstractObjectColumn<List<Va
|
|||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public int alignmentSize() {
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
public Reader openUnregistered(URI uri, int page) throws IOException {
|
public Reader openUnregistered(URI uri, int page) throws IOException {
|
||||||
return new Reader(
|
return new Reader(
|
||||||
dataColumn.openUnregistered(uri, page),
|
dataColumn.openUnregistered(uri, page),
|
||||||
@@ -109,6 +114,11 @@ public class VarintCodedSequenceArrayColumn extends AbstractObjectColumn<List<Va
|
|||||||
dataReader.skip(toSkip);
|
dataReader.skip(toSkip);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public boolean isDirect() {
|
||||||
|
return dataReader.isDirect();
|
||||||
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public boolean hasRemaining() throws IOException {
|
public boolean hasRemaining() throws IOException {
|
||||||
return groupsReader.hasRemaining();
|
return groupsReader.hasRemaining();
|
||||||
|
@@ -44,6 +44,11 @@ public class VarintCodedSequenceColumn extends AbstractObjectColumn<VarintCodedS
|
|||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public int alignmentSize() {
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
|
||||||
public Reader openUnregistered(URI uri, int page) throws IOException {
|
public Reader openUnregistered(URI uri, int page) throws IOException {
|
||||||
return new Reader(
|
return new Reader(
|
||||||
Storage.reader(uri, this, page, false),
|
Storage.reader(uri, this, page, false),
|
||||||
@@ -101,6 +106,11 @@ public class VarintCodedSequenceColumn extends AbstractObjectColumn<VarintCodedS
|
|||||||
return VarintCodedSequenceColumn.this;
|
return VarintCodedSequenceColumn.this;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public boolean isDirect() {
|
||||||
|
return storage.isDirect();
|
||||||
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public void skip(long positions) throws IOException {
|
public void skip(long positions) throws IOException {
|
||||||
for (int i = 0; i < positions; i++) {
|
for (int i = 0; i < positions; i++) {
|
||||||
|
@@ -155,8 +155,15 @@ public class SentenceExtractor {
|
|||||||
public List<DocumentSentence> extractSentencesFromString(String text, EnumSet<HtmlTag> htmlTags) {
|
public List<DocumentSentence> extractSentencesFromString(String text, EnumSet<HtmlTag> htmlTags) {
|
||||||
String[] sentences;
|
String[] sentences;
|
||||||
|
|
||||||
// Normalize spaces
|
// Safety net against malformed data DOS attacks,
|
||||||
|
// found 5+ MB <p>-tags in the wild that just break
|
||||||
|
// the sentence extractor causing it to stall forever.
|
||||||
|
if (text.length() > 50_000) {
|
||||||
|
// 50k chars can hold a small novel, let alone single html tags
|
||||||
|
text = text.substring(0, 50_000);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Normalize spaces
|
||||||
text = normalizeSpaces(text);
|
text = normalizeSpaces(text);
|
||||||
|
|
||||||
// Split into sentences
|
// Split into sentences
|
||||||
|
@@ -5,9 +5,7 @@ import nu.marginalia.actor.state.*;
|
|||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
import org.slf4j.LoggerFactory;
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
import java.util.ArrayList;
|
import java.util.*;
|
||||||
import java.util.Arrays;
|
|
||||||
import java.util.List;
|
|
||||||
|
|
||||||
public abstract class RecordActorPrototype implements ActorPrototype {
|
public abstract class RecordActorPrototype implements ActorPrototype {
|
||||||
|
|
||||||
@@ -118,7 +116,7 @@ public abstract class RecordActorPrototype implements ActorPrototype {
|
|||||||
}
|
}
|
||||||
|
|
||||||
private String functionName(Class<? extends ActorStep> functionClass) {
|
private String functionName(Class<? extends ActorStep> functionClass) {
|
||||||
return functionClass.getSimpleName().toUpperCase();
|
return ActorStep.functionName(functionClass);
|
||||||
}
|
}
|
||||||
|
|
||||||
private ActorStep constructState(String message) throws ReflectiveOperationException {
|
private ActorStep constructState(String message) throws ReflectiveOperationException {
|
||||||
@@ -145,4 +143,43 @@ public abstract class RecordActorPrototype implements ActorPrototype {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/** Get a list of JSON prototypes for each actor step declared by this actor */
|
||||||
|
@SuppressWarnings("unchecked")
|
||||||
|
public Map<String, String> getMessagePrototypes() {
|
||||||
|
Map<String, String> messagePrototypes = new HashMap<>();
|
||||||
|
|
||||||
|
for (var clazz : getClass().getDeclaredClasses()) {
|
||||||
|
if (!clazz.isRecord() || !ActorStep.class.isAssignableFrom(clazz))
|
||||||
|
continue;
|
||||||
|
|
||||||
|
StringJoiner sj = new StringJoiner(",\n\t", "{\n\t", "\n}");
|
||||||
|
|
||||||
|
renderToJsonPrototype(sj, (Class<? extends Record>) clazz);
|
||||||
|
|
||||||
|
messagePrototypes.put(ActorStep.functionName((Class<? extends ActorStep>) clazz), sj.toString());
|
||||||
|
}
|
||||||
|
|
||||||
|
return messagePrototypes;
|
||||||
|
}
|
||||||
|
|
||||||
|
@SuppressWarnings("unchecked")
|
||||||
|
private void renderToJsonPrototype(StringJoiner sj, Class<? extends Record> recordType) {
|
||||||
|
for (var field : recordType.getDeclaredFields()) {
|
||||||
|
String typeName = field.getType().getSimpleName();
|
||||||
|
|
||||||
|
if ("List".equals(typeName)) {
|
||||||
|
sj.add(String.format("\"%s\": [ ]", field.getName()));
|
||||||
|
}
|
||||||
|
else if (field.getType().isRecord()) {
|
||||||
|
var innerSj = new StringJoiner(",", "{", "}");
|
||||||
|
renderToJsonPrototype(innerSj, (Class<? extends Record>) field.getType());
|
||||||
|
sj.add(String.format("\"%s\": %s", field.getName(), sj));
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
sj.add(String.format("\"%s\": \"%s\"", field.getName(), typeName));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
|
@@ -1,3 +1,7 @@
|
|||||||
package nu.marginalia.actor.state;
|
package nu.marginalia.actor.state;
|
||||||
|
|
||||||
public interface ActorStep {}
|
public interface ActorStep {
|
||||||
|
static String functionName(Class<? extends ActorStep> type) {
|
||||||
|
return type.getSimpleName().toUpperCase();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
@@ -12,7 +12,6 @@ import nu.marginalia.converting.sideload.SideloadSourceFactory;
|
|||||||
import nu.marginalia.converting.writer.ConverterBatchWritableIf;
|
import nu.marginalia.converting.writer.ConverterBatchWritableIf;
|
||||||
import nu.marginalia.converting.writer.ConverterBatchWriter;
|
import nu.marginalia.converting.writer.ConverterBatchWriter;
|
||||||
import nu.marginalia.converting.writer.ConverterWriter;
|
import nu.marginalia.converting.writer.ConverterWriter;
|
||||||
import nu.marginalia.io.CrawledDomainReader;
|
|
||||||
import nu.marginalia.io.SerializableCrawlDataStream;
|
import nu.marginalia.io.SerializableCrawlDataStream;
|
||||||
import nu.marginalia.mq.MessageQueueFactory;
|
import nu.marginalia.mq.MessageQueueFactory;
|
||||||
import nu.marginalia.mqapi.converting.ConvertRequest;
|
import nu.marginalia.mqapi.converting.ConvertRequest;
|
||||||
@@ -36,6 +35,7 @@ import java.io.IOException;
|
|||||||
import java.nio.file.Files;
|
import java.nio.file.Files;
|
||||||
import java.nio.file.Path;
|
import java.nio.file.Path;
|
||||||
import java.sql.SQLException;
|
import java.sql.SQLException;
|
||||||
|
import java.util.ArrayList;
|
||||||
import java.util.Collection;
|
import java.util.Collection;
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
import java.util.Optional;
|
import java.util.Optional;
|
||||||
@@ -51,6 +51,7 @@ public class ConverterMain extends ProcessMainClass {
|
|||||||
private final ProcessHeartbeat heartbeat;
|
private final ProcessHeartbeat heartbeat;
|
||||||
private final FileStorageService fileStorageService;
|
private final FileStorageService fileStorageService;
|
||||||
private final SideloadSourceFactory sideloadSourceFactory;
|
private final SideloadSourceFactory sideloadSourceFactory;
|
||||||
|
private static final int SIDELOAD_THRESHOLD = Integer.getInteger("converter.sideloadThreshold", 10_000);
|
||||||
|
|
||||||
public static void main(String... args) throws Exception {
|
public static void main(String... args) throws Exception {
|
||||||
|
|
||||||
@@ -201,12 +202,26 @@ public class ConverterMain extends ProcessMainClass {
|
|||||||
processedDomains.set(batchingWorkLog.size());
|
processedDomains.set(batchingWorkLog.size());
|
||||||
heartbeat.setProgress(processedDomains.get() / (double) totalDomains);
|
heartbeat.setProgress(processedDomains.get() / (double) totalDomains);
|
||||||
|
|
||||||
for (var domain : WorkLog.iterableMap(crawlDir.getLogFile(),
|
logger.info("Processing small items");
|
||||||
|
|
||||||
|
// We separate the large and small domains to reduce the number of critical sections,
|
||||||
|
// as the large domains have a separate processing track that doesn't store everything
|
||||||
|
// in memory
|
||||||
|
|
||||||
|
final List<Path> bigTasks = new ArrayList<>();
|
||||||
|
|
||||||
|
// First process the small items
|
||||||
|
for (var dataPath : WorkLog.iterableMap(crawlDir.getLogFile(),
|
||||||
new CrawlDataLocator(crawlDir.getDir(), batchingWorkLog)))
|
new CrawlDataLocator(crawlDir.getDir(), batchingWorkLog)))
|
||||||
{
|
{
|
||||||
|
if (SerializableCrawlDataStream.getSizeHint(dataPath) >= SIDELOAD_THRESHOLD) {
|
||||||
|
bigTasks.add(dataPath);
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
pool.submit(() -> {
|
pool.submit(() -> {
|
||||||
try {
|
try (var dataStream = SerializableCrawlDataStream.openDataStream(dataPath)) {
|
||||||
ConverterBatchWritableIf writable = processor.createWritable(domain);
|
ConverterBatchWritableIf writable = processor.fullProcessing(dataStream) ;
|
||||||
converterWriter.accept(writable);
|
converterWriter.accept(writable);
|
||||||
}
|
}
|
||||||
catch (Exception ex) {
|
catch (Exception ex) {
|
||||||
@@ -225,10 +240,39 @@ public class ConverterMain extends ProcessMainClass {
|
|||||||
do {
|
do {
|
||||||
System.out.println("Waiting for pool to terminate... " + pool.getActiveCount() + " remaining");
|
System.out.println("Waiting for pool to terminate... " + pool.getActiveCount() + " remaining");
|
||||||
} while (!pool.awaitTermination(60, TimeUnit.SECONDS));
|
} while (!pool.awaitTermination(60, TimeUnit.SECONDS));
|
||||||
|
|
||||||
|
logger.info("Processing large items");
|
||||||
|
|
||||||
|
try (var hb = heartbeat.createAdHocTaskHeartbeat("Large Domains")) {
|
||||||
|
int bigTaskIdx = 0;
|
||||||
|
// Next the big items domain-by-domain
|
||||||
|
for (var dataPath : bigTasks) {
|
||||||
|
hb.progress(dataPath.toFile().getName(), bigTaskIdx++, bigTasks.size());
|
||||||
|
|
||||||
|
try {
|
||||||
|
// SerializableCrawlDataStream is autocloseable, we can't try-with-resources because then it will be
|
||||||
|
// closed before it's consumed by the converterWriter. Instead, the converterWriter guarantees it
|
||||||
|
// will close it after it's consumed.
|
||||||
|
|
||||||
|
var stream = SerializableCrawlDataStream.openDataStream(dataPath);
|
||||||
|
ConverterBatchWritableIf writable = processor.simpleProcessing(stream, SerializableCrawlDataStream.getSizeHint(dataPath));
|
||||||
|
|
||||||
|
converterWriter.accept(writable);
|
||||||
|
}
|
||||||
|
catch (Exception ex) {
|
||||||
|
logger.info("Error in processing", ex);
|
||||||
|
}
|
||||||
|
finally {
|
||||||
|
heartbeat.setProgress(processedDomains.incrementAndGet() / (double) totalDomains);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
private static class CrawlDataLocator implements Function<WorkLogEntry, Optional<SerializableCrawlDataStream>> {
|
logger.info("Processing complete");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private static class CrawlDataLocator implements Function<WorkLogEntry, Optional<Path>> {
|
||||||
|
|
||||||
private final Path crawlRootDir;
|
private final Path crawlRootDir;
|
||||||
private final BatchingWorkLog batchingWorkLog;
|
private final BatchingWorkLog batchingWorkLog;
|
||||||
@@ -239,7 +283,7 @@ public class ConverterMain extends ProcessMainClass {
|
|||||||
}
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public Optional<SerializableCrawlDataStream> apply(WorkLogEntry entry) {
|
public Optional<Path> apply(WorkLogEntry entry) {
|
||||||
if (batchingWorkLog.isItemProcessed(entry.id())) {
|
if (batchingWorkLog.isItemProcessed(entry.id())) {
|
||||||
return Optional.empty();
|
return Optional.empty();
|
||||||
}
|
}
|
||||||
@@ -252,7 +296,7 @@ public class ConverterMain extends ProcessMainClass {
|
|||||||
}
|
}
|
||||||
|
|
||||||
try {
|
try {
|
||||||
return Optional.of(CrawledDomainReader.createDataStream(path));
|
return Optional.of(path);
|
||||||
}
|
}
|
||||||
catch (Exception ex) {
|
catch (Exception ex) {
|
||||||
return Optional.empty();
|
return Optional.empty();
|
||||||
|
@@ -19,6 +19,7 @@ import nu.marginalia.model.idx.WordFlags;
|
|||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
import org.slf4j.LoggerFactory;
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
import java.net.URISyntaxException;
|
import java.net.URISyntaxException;
|
||||||
import java.util.ArrayList;
|
import java.util.ArrayList;
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
@@ -91,7 +92,7 @@ public class DocumentProcessor {
|
|||||||
DocumentClass documentClass,
|
DocumentClass documentClass,
|
||||||
DocumentDecorator documentDecorator,
|
DocumentDecorator documentDecorator,
|
||||||
DomainLinks externalDomainLinks,
|
DomainLinks externalDomainLinks,
|
||||||
ProcessedDocument ret) throws URISyntaxException, DisqualifiedException
|
ProcessedDocument ret) throws URISyntaxException, IOException, DisqualifiedException
|
||||||
{
|
{
|
||||||
|
|
||||||
var crawlerStatus = CrawlerDocumentStatus.valueOf(crawledDocument.crawlerStatus);
|
var crawlerStatus = CrawlerDocumentStatus.valueOf(crawledDocument.crawlerStatus);
|
||||||
@@ -109,7 +110,7 @@ public class DocumentProcessor {
|
|||||||
|
|
||||||
ret.state = crawlerStatusToUrlState(crawledDocument.crawlerStatus, crawledDocument.httpStatus);
|
ret.state = crawlerStatusToUrlState(crawledDocument.crawlerStatus, crawledDocument.httpStatus);
|
||||||
|
|
||||||
final var plugin = findPlugin(crawledDocument);
|
AbstractDocumentProcessorPlugin plugin = findPlugin(crawledDocument);
|
||||||
|
|
||||||
EdgeUrl url = new EdgeUrl(crawledDocument.url);
|
EdgeUrl url = new EdgeUrl(crawledDocument.url);
|
||||||
LinkTexts linkTexts = anchorTextKeywords.getAnchorTextKeywords(externalDomainLinks, url);
|
LinkTexts linkTexts = anchorTextKeywords.getAnchorTextKeywords(externalDomainLinks, url);
|
||||||
|
@@ -32,7 +32,6 @@ import java.util.*;
|
|||||||
import java.util.regex.Pattern;
|
import java.util.regex.Pattern;
|
||||||
|
|
||||||
public class DomainProcessor {
|
public class DomainProcessor {
|
||||||
private static final int SIDELOAD_THRESHOLD = Integer.getInteger("converter.sideloadThreshold", 10_000);
|
|
||||||
private final DocumentProcessor documentProcessor;
|
private final DocumentProcessor documentProcessor;
|
||||||
private final SiteWords siteWords;
|
private final SiteWords siteWords;
|
||||||
private final AnchorTagsSource anchorTagsSource;
|
private final AnchorTagsSource anchorTagsSource;
|
||||||
@@ -54,21 +53,9 @@ public class DomainProcessor {
|
|||||||
geoIpDictionary.waitReady();
|
geoIpDictionary.waitReady();
|
||||||
}
|
}
|
||||||
|
|
||||||
public ConverterBatchWritableIf createWritable(SerializableCrawlDataStream domain) {
|
public SimpleProcessing simpleProcessing(SerializableCrawlDataStream dataStream, int sizeHint, Collection<String> extraKeywords) {
|
||||||
final int sizeHint = domain.sizeHint();
|
|
||||||
|
|
||||||
if (sizeHint > SIDELOAD_THRESHOLD) {
|
|
||||||
// If the file is too big, we run a processing mode that doesn't
|
|
||||||
// require loading the entire dataset into RAM
|
|
||||||
return sideloadProcessing(domain, sizeHint);
|
|
||||||
}
|
|
||||||
|
|
||||||
return fullProcessing(domain);
|
|
||||||
}
|
|
||||||
|
|
||||||
public SideloadProcessing sideloadProcessing(SerializableCrawlDataStream dataStream, int sizeHint, Collection<String> extraKeywords) {
|
|
||||||
try {
|
try {
|
||||||
return new SideloadProcessing(dataStream, sizeHint, extraKeywords);
|
return new SimpleProcessing(dataStream, sizeHint, extraKeywords);
|
||||||
}
|
}
|
||||||
catch (Exception ex) {
|
catch (Exception ex) {
|
||||||
logger.warn("Failed to process domain sideload", ex);
|
logger.warn("Failed to process domain sideload", ex);
|
||||||
@@ -76,9 +63,9 @@ public class DomainProcessor {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
public SideloadProcessing sideloadProcessing(SerializableCrawlDataStream dataStream, int sizeHint) {
|
public SimpleProcessing simpleProcessing(SerializableCrawlDataStream dataStream, int sizeHint) {
|
||||||
try {
|
try {
|
||||||
return new SideloadProcessing(dataStream, sizeHint);
|
return new SimpleProcessing(dataStream, sizeHint);
|
||||||
}
|
}
|
||||||
catch (Exception ex) {
|
catch (Exception ex) {
|
||||||
logger.warn("Failed to process domain sideload", ex);
|
logger.warn("Failed to process domain sideload", ex);
|
||||||
@@ -86,22 +73,84 @@ public class DomainProcessor {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
public class SideloadProcessing implements ConverterBatchWritableIf, SideloadSource {
|
@Nullable
|
||||||
|
public ProcessedDomain fullProcessing(SerializableCrawlDataStream dataStream) {
|
||||||
|
try {
|
||||||
|
if (!dataStream.hasNext()) {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
List<ProcessedDocument> docs = new ArrayList<>();
|
||||||
|
Set<String> processedUrls = new HashSet<>();
|
||||||
|
|
||||||
|
if (!(dataStream.next() instanceof CrawledDomain crawledDomain)) {
|
||||||
|
throw new IllegalStateException("First record must be a domain, was " + dataStream.next().getClass().getSimpleName());
|
||||||
|
}
|
||||||
|
|
||||||
|
DomainLinks externalDomainLinks = anchorTagsSource.getAnchorTags(crawledDomain.getDomain());
|
||||||
|
DocumentDecorator documentDecorator = new DocumentDecorator();
|
||||||
|
|
||||||
|
// Process Domain Record
|
||||||
|
|
||||||
|
ProcessedDomain ret = new ProcessedDomain();
|
||||||
|
processDomain(crawledDomain, ret, documentDecorator);
|
||||||
|
ret.documents = docs;
|
||||||
|
|
||||||
|
// Process Documents
|
||||||
|
|
||||||
|
try (var deduplicator = new LshDocumentDeduplicator()) {
|
||||||
|
while (dataStream.hasNext()) {
|
||||||
|
if (!(dataStream.next() instanceof CrawledDocument doc))
|
||||||
|
continue;
|
||||||
|
if (doc.url == null)
|
||||||
|
continue;
|
||||||
|
if (doc.documentBodyBytes.length == 0)
|
||||||
|
continue;
|
||||||
|
if (!processedUrls.add(doc.url))
|
||||||
|
continue;
|
||||||
|
|
||||||
|
try {
|
||||||
|
var processedDoc = documentProcessor.process(doc, ret.domain, externalDomainLinks, documentDecorator);
|
||||||
|
deduplicator.markIfDuplicate(processedDoc);
|
||||||
|
docs.add(processedDoc);
|
||||||
|
} catch (Exception ex) {
|
||||||
|
logger.warn("Failed to process " + doc.url, ex);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Add late keywords and features from domain-level information
|
||||||
|
|
||||||
|
calculateStatistics(ret, externalDomainLinks);
|
||||||
|
|
||||||
|
return ret;
|
||||||
|
}
|
||||||
|
catch (Exception ex) {
|
||||||
|
logger.warn("Failed to process domain", ex);
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/** The simple processing track processes documents individually, and does not perform any domain-level analysis.
|
||||||
|
* This is needed to process extremely large domains, which would otherwise eat up too much RAM.
|
||||||
|
*/
|
||||||
|
public class SimpleProcessing implements ConverterBatchWritableIf, SideloadSource {
|
||||||
private final SerializableCrawlDataStream dataStream;
|
private final SerializableCrawlDataStream dataStream;
|
||||||
private final ProcessedDomain domain;
|
private final ProcessedDomain domain;
|
||||||
private final DocumentDecorator documentDecorator;
|
private final DocumentDecorator documentDecorator;
|
||||||
private final Set<String> processedUrls = new HashSet<>();
|
private final Set<String> processedUrls = new HashSet<>();
|
||||||
private final DomainLinks externalDomainLinks;
|
private final DomainLinks externalDomainLinks;
|
||||||
private final LshDocumentDeduplicator deduplicator = new LshDocumentDeduplicator();
|
private final LshDocumentDeduplicator deduplicator = new LshDocumentDeduplicator();
|
||||||
|
|
||||||
private static final ProcessingIterator.Factory iteratorFactory = ProcessingIterator.factory(8,
|
private static final ProcessingIterator.Factory iteratorFactory = ProcessingIterator.factory(8,
|
||||||
Integer.getInteger("java.util.concurrent.ForkJoinPool.common.parallelism", Runtime.getRuntime().availableProcessors())
|
Integer.getInteger("java.util.concurrent.ForkJoinPool.common.parallelism", Runtime.getRuntime().availableProcessors())
|
||||||
);
|
);
|
||||||
|
|
||||||
SideloadProcessing(SerializableCrawlDataStream dataStream, int sizeHint) throws IOException {
|
SimpleProcessing(SerializableCrawlDataStream dataStream, int sizeHint) throws IOException {
|
||||||
this(dataStream, sizeHint, List.of());
|
this(dataStream, sizeHint, List.of());
|
||||||
}
|
}
|
||||||
|
|
||||||
SideloadProcessing(SerializableCrawlDataStream dataStream, int sizeHint, Collection<String> extraKeywords) throws IOException {
|
SimpleProcessing(SerializableCrawlDataStream dataStream, int sizeHint, Collection<String> extraKeywords) throws IOException {
|
||||||
this.dataStream = dataStream;
|
this.dataStream = dataStream;
|
||||||
|
|
||||||
if (!dataStream.hasNext() || !(dataStream.next() instanceof CrawledDomain crawledDomain))
|
if (!dataStream.hasNext() || !(dataStream.next() instanceof CrawledDomain crawledDomain))
|
||||||
@@ -128,6 +177,7 @@ public class DomainProcessor {
|
|||||||
@Override
|
@Override
|
||||||
public Iterator<ProcessedDocument> getDocumentsStream() {
|
public Iterator<ProcessedDocument> getDocumentsStream() {
|
||||||
return iteratorFactory.create((taskConsumer) -> {
|
return iteratorFactory.create((taskConsumer) -> {
|
||||||
|
|
||||||
while (dataStream.hasNext())
|
while (dataStream.hasNext())
|
||||||
{
|
{
|
||||||
if (!(dataStream.next() instanceof CrawledDocument doc))
|
if (!(dataStream.next() instanceof CrawledDocument doc))
|
||||||
@@ -172,65 +222,6 @@ public class DomainProcessor {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@Nullable
|
|
||||||
public ProcessedDomain fullProcessing(SerializableCrawlDataStream dataStream) {
|
|
||||||
try {
|
|
||||||
if (!dataStream.hasNext()) {
|
|
||||||
return null;
|
|
||||||
}
|
|
||||||
|
|
||||||
List<ProcessedDocument> docs = new ArrayList<>();
|
|
||||||
Set<String> processedUrls = new HashSet<>();
|
|
||||||
|
|
||||||
if (!(dataStream.next() instanceof CrawledDomain crawledDomain)) {
|
|
||||||
throw new IllegalStateException("First record must be a domain, was " + dataStream.next().getClass().getSimpleName());
|
|
||||||
}
|
|
||||||
|
|
||||||
DomainLinks externalDomainLinks = anchorTagsSource.getAnchorTags(crawledDomain.getDomain());
|
|
||||||
DocumentDecorator documentDecorator = new DocumentDecorator();
|
|
||||||
|
|
||||||
// Process Domain Record
|
|
||||||
|
|
||||||
ProcessedDomain ret = new ProcessedDomain();
|
|
||||||
processDomain(crawledDomain, ret, documentDecorator);
|
|
||||||
ret.documents = docs;
|
|
||||||
|
|
||||||
// Process Documents
|
|
||||||
|
|
||||||
try (var deduplicator = new LshDocumentDeduplicator()) {
|
|
||||||
while (dataStream.hasNext()) {
|
|
||||||
if (!(dataStream.next() instanceof CrawledDocument doc))
|
|
||||||
continue;
|
|
||||||
if (doc.url == null)
|
|
||||||
continue;
|
|
||||||
if (doc.documentBody.isBlank())
|
|
||||||
continue;
|
|
||||||
if (!processedUrls.add(doc.url))
|
|
||||||
continue;
|
|
||||||
|
|
||||||
try {
|
|
||||||
var processedDoc = documentProcessor.process(doc, ret.domain, externalDomainLinks, documentDecorator);
|
|
||||||
deduplicator.markIfDuplicate(processedDoc);
|
|
||||||
docs.add(processedDoc);
|
|
||||||
} catch (Exception ex) {
|
|
||||||
logger.warn("Failed to process " + doc.url, ex);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// Add late keywords and features from domain-level information
|
|
||||||
|
|
||||||
calculateStatistics(ret, externalDomainLinks);
|
|
||||||
|
|
||||||
return ret;
|
|
||||||
}
|
|
||||||
catch (Exception ex) {
|
|
||||||
logger.warn("Failed to process domain", ex);
|
|
||||||
return null;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
private void processDomain(CrawledDomain crawledDomain,
|
private void processDomain(CrawledDomain crawledDomain,
|
||||||
ProcessedDomain domain,
|
ProcessedDomain domain,
|
||||||
DocumentDecorator decorator)
|
DocumentDecorator decorator)
|
||||||
|
@@ -116,7 +116,7 @@ public class AdblockSimulator {
|
|||||||
|
|
||||||
|
|
||||||
// Refrain from cleaning up this code, it's very hot code and needs to be fast.
|
// Refrain from cleaning up this code, it's very hot code and needs to be fast.
|
||||||
// This version is about 100x faster than the a "clean" first stab implementation.
|
// This version is about 100x faster than a "clean" first stab implementation.
|
||||||
|
|
||||||
class RuleVisitor implements NodeFilter {
|
class RuleVisitor implements NodeFilter {
|
||||||
public boolean sawAds;
|
public boolean sawAds;
|
||||||
|
@@ -23,7 +23,7 @@ public class DocumentGeneratorExtractor {
|
|||||||
|
|
||||||
var tags = doc.select("meta[name=generator]");
|
var tags = doc.select("meta[name=generator]");
|
||||||
|
|
||||||
if (tags.size() == 0) {
|
if (tags.isEmpty()) {
|
||||||
// Some sites have a comment in the head instead of a meta tag
|
// Some sites have a comment in the head instead of a meta tag
|
||||||
return fingerprintServerTech(doc, responseHeaders);
|
return fingerprintServerTech(doc, responseHeaders);
|
||||||
}
|
}
|
||||||
|
@@ -24,7 +24,7 @@ public class DocumentValuator {
|
|||||||
double scriptPenalty = getScriptPenalty(parsedDocument);
|
double scriptPenalty = getScriptPenalty(parsedDocument);
|
||||||
double chatGptPenalty = getChatGptContentFarmPenalty(parsedDocument);
|
double chatGptPenalty = getChatGptContentFarmPenalty(parsedDocument);
|
||||||
|
|
||||||
int rawLength = crawledDocument.documentBody.length();
|
int rawLength = crawledDocument.documentBodyBytes.length;
|
||||||
|
|
||||||
if (textLength == 0) {
|
if (textLength == 0) {
|
||||||
throw new DisqualifiedException(DisqualifiedException.DisqualificationReason.LENGTH);
|
throw new DisqualifiedException(DisqualifiedException.DisqualificationReason.LENGTH);
|
||||||
|
@@ -218,7 +218,10 @@ public class FeatureExtractor {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
if (features.contains(HtmlFeature.JS) && adblockSimulator.hasAds(doc.clone())) {
|
if (features.contains(HtmlFeature.JS)
|
||||||
|
// remove while disabled to get rid of expensive clone() call:
|
||||||
|
// adblockSimulator.hasAds(doc.clone())
|
||||||
|
) {
|
||||||
features.add(HtmlFeature.ADVERTISEMENT);
|
features.add(HtmlFeature.ADVERTISEMENT);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@@ -14,6 +14,7 @@ import nu.marginalia.model.crawldata.CrawledDocument;
|
|||||||
import nu.marginalia.model.html.HtmlStandard;
|
import nu.marginalia.model.html.HtmlStandard;
|
||||||
|
|
||||||
import javax.annotation.Nullable;
|
import javax.annotation.Nullable;
|
||||||
|
import java.io.IOException;
|
||||||
import java.net.URISyntaxException;
|
import java.net.URISyntaxException;
|
||||||
import java.util.HashSet;
|
import java.util.HashSet;
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
@@ -25,7 +26,7 @@ public abstract class AbstractDocumentProcessorPlugin {
|
|||||||
this.languageFilter = languageFilter;
|
this.languageFilter = languageFilter;
|
||||||
}
|
}
|
||||||
|
|
||||||
public abstract DetailsWithWords createDetails(CrawledDocument crawledDocument, LinkTexts linkTexts, DocumentClass documentClass) throws DisqualifiedException, URISyntaxException;
|
public abstract DetailsWithWords createDetails(CrawledDocument crawledDocument, LinkTexts linkTexts, DocumentClass documentClass) throws DisqualifiedException, URISyntaxException, IOException;
|
||||||
public abstract boolean isApplicable(CrawledDocument doc);
|
public abstract boolean isApplicable(CrawledDocument doc);
|
||||||
|
|
||||||
protected void checkDocumentLanguage(DocumentLanguageData dld) throws DisqualifiedException {
|
protected void checkDocumentLanguage(DocumentLanguageData dld) throws DisqualifiedException {
|
||||||
@@ -86,6 +87,7 @@ public abstract class AbstractDocumentProcessorPlugin {
|
|||||||
|
|
||||||
return this;
|
return this;
|
||||||
}
|
}
|
||||||
|
|
||||||
public MetaTagsBuilder addPubDate(PubDate pubDate) {
|
public MetaTagsBuilder addPubDate(PubDate pubDate) {
|
||||||
|
|
||||||
if (pubDate.year() > 1900) {
|
if (pubDate.year() > 1900) {
|
||||||
|
@@ -6,6 +6,7 @@ import nu.marginalia.converting.model.DisqualifiedException;
|
|||||||
import nu.marginalia.converting.model.DocumentHeaders;
|
import nu.marginalia.converting.model.DocumentHeaders;
|
||||||
import nu.marginalia.converting.model.GeneratorType;
|
import nu.marginalia.converting.model.GeneratorType;
|
||||||
import nu.marginalia.converting.model.ProcessedDocumentDetails;
|
import nu.marginalia.converting.model.ProcessedDocumentDetails;
|
||||||
|
import nu.marginalia.converting.processor.AcceptableAds;
|
||||||
import nu.marginalia.converting.processor.DocumentClass;
|
import nu.marginalia.converting.processor.DocumentClass;
|
||||||
import nu.marginalia.converting.processor.MetaRobotsTag;
|
import nu.marginalia.converting.processor.MetaRobotsTag;
|
||||||
import nu.marginalia.converting.processor.logic.*;
|
import nu.marginalia.converting.processor.logic.*;
|
||||||
@@ -32,11 +33,11 @@ import nu.marginalia.model.crawldata.CrawledDocument;
|
|||||||
import nu.marginalia.model.html.HtmlStandard;
|
import nu.marginalia.model.html.HtmlStandard;
|
||||||
import nu.marginalia.model.idx.DocumentFlags;
|
import nu.marginalia.model.idx.DocumentFlags;
|
||||||
import nu.marginalia.model.idx.DocumentMetadata;
|
import nu.marginalia.model.idx.DocumentMetadata;
|
||||||
import org.jsoup.Jsoup;
|
|
||||||
import org.jsoup.nodes.Document;
|
import org.jsoup.nodes.Document;
|
||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
import org.slf4j.LoggerFactory;
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
import java.net.URISyntaxException;
|
import java.net.URISyntaxException;
|
||||||
import java.util.EnumSet;
|
import java.util.EnumSet;
|
||||||
import java.util.HashSet;
|
import java.util.HashSet;
|
||||||
@@ -51,7 +52,6 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
|
|||||||
private final double minDocumentQuality;
|
private final double minDocumentQuality;
|
||||||
|
|
||||||
private final FeatureExtractor featureExtractor;
|
private final FeatureExtractor featureExtractor;
|
||||||
private final TitleExtractor titleExtractor;
|
|
||||||
private final DocumentKeywordExtractor keywordExtractor;
|
private final DocumentKeywordExtractor keywordExtractor;
|
||||||
private final PubDateSniffer pubDateSniffer;
|
private final PubDateSniffer pubDateSniffer;
|
||||||
|
|
||||||
@@ -74,7 +74,6 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
|
|||||||
@Named("min-document-quality") Double minDocumentQuality,
|
@Named("min-document-quality") Double minDocumentQuality,
|
||||||
LanguageFilter languageFilter,
|
LanguageFilter languageFilter,
|
||||||
FeatureExtractor featureExtractor,
|
FeatureExtractor featureExtractor,
|
||||||
TitleExtractor titleExtractor,
|
|
||||||
DocumentKeywordExtractor keywordExtractor,
|
DocumentKeywordExtractor keywordExtractor,
|
||||||
PubDateSniffer pubDateSniffer,
|
PubDateSniffer pubDateSniffer,
|
||||||
DocumentLengthLogic documentLengthLogic,
|
DocumentLengthLogic documentLengthLogic,
|
||||||
@@ -89,7 +88,6 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
|
|||||||
this.minDocumentQuality = minDocumentQuality;
|
this.minDocumentQuality = minDocumentQuality;
|
||||||
this.featureExtractor = featureExtractor;
|
this.featureExtractor = featureExtractor;
|
||||||
|
|
||||||
this.titleExtractor = titleExtractor;
|
|
||||||
this.keywordExtractor = keywordExtractor;
|
this.keywordExtractor = keywordExtractor;
|
||||||
this.pubDateSniffer = pubDateSniffer;
|
this.pubDateSniffer = pubDateSniffer;
|
||||||
this.metaRobotsTag = metaRobotsTag;
|
this.metaRobotsTag = metaRobotsTag;
|
||||||
@@ -108,19 +106,17 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
|
|||||||
public DetailsWithWords createDetails(CrawledDocument crawledDocument,
|
public DetailsWithWords createDetails(CrawledDocument crawledDocument,
|
||||||
LinkTexts linkTexts,
|
LinkTexts linkTexts,
|
||||||
DocumentClass documentClass)
|
DocumentClass documentClass)
|
||||||
throws DisqualifiedException, URISyntaxException {
|
throws DisqualifiedException, URISyntaxException, IOException {
|
||||||
|
|
||||||
String documentBody = crawledDocument.documentBody;
|
if (languageFilter.isBlockedUnicodeRange(crawledDocument.documentBody(512))) {
|
||||||
|
|
||||||
if (languageFilter.isBlockedUnicodeRange(documentBody)) {
|
|
||||||
throw new DisqualifiedException(DisqualificationReason.LANGUAGE);
|
throw new DisqualifiedException(DisqualificationReason.LANGUAGE);
|
||||||
}
|
}
|
||||||
|
|
||||||
if (documentBody.length() > MAX_DOCUMENT_LENGTH_BYTES) { // 128kb
|
Document doc = crawledDocument.parseBody();
|
||||||
documentBody = documentBody.substring(0, MAX_DOCUMENT_LENGTH_BYTES);
|
|
||||||
}
|
|
||||||
|
|
||||||
Document doc = Jsoup.parse(documentBody);
|
if (AcceptableAds.hasAcceptableAdsTag(doc)) {
|
||||||
|
throw new DisqualifiedException(DisqualifiedException.DisqualificationReason.ACCEPTABLE_ADS);
|
||||||
|
}
|
||||||
|
|
||||||
if (!metaRobotsTag.allowIndexingByMetaTag(doc)) {
|
if (!metaRobotsTag.allowIndexingByMetaTag(doc)) {
|
||||||
throw new DisqualifiedException(DisqualificationReason.FORBIDDEN);
|
throw new DisqualifiedException(DisqualificationReason.FORBIDDEN);
|
||||||
@@ -138,32 +134,33 @@ public class HtmlDocumentProcessorPlugin extends AbstractDocumentProcessorPlugin
|
|||||||
}
|
}
|
||||||
|
|
||||||
var prunedDoc = specialization.prune(doc);
|
var prunedDoc = specialization.prune(doc);
|
||||||
DocumentLanguageData dld = sentenceExtractorProvider.get().extractSentences(prunedDoc);
|
|
||||||
|
|
||||||
checkDocumentLanguage(dld);
|
|
||||||
|
|
||||||
var ret = new ProcessedDocumentDetails();
|
|
||||||
|
|
||||||
final int length = getLength(doc);
|
final int length = getLength(doc);
|
||||||
final HtmlStandard standard = getHtmlStandard(doc);
|
final HtmlStandard standard = getHtmlStandard(doc);
|
||||||
final double quality = documentValuator.getQuality(crawledDocument, standard, doc, length);
|
final double quality = documentValuator.getQuality(crawledDocument, standard, doc, length);
|
||||||
|
|
||||||
|
if (isDisqualified(documentClass, url, quality, doc.title())) {
|
||||||
|
throw new DisqualifiedException(DisqualificationReason.QUALITY);
|
||||||
|
}
|
||||||
|
|
||||||
|
DocumentLanguageData dld = sentenceExtractorProvider.get().extractSentences(prunedDoc);
|
||||||
|
|
||||||
|
checkDocumentLanguage(dld);
|
||||||
|
documentLengthLogic.validateLength(dld, specialization.lengthModifier() * documentClass.lengthLimitModifier());
|
||||||
|
|
||||||
|
var ret = new ProcessedDocumentDetails();
|
||||||
|
|
||||||
ret.length = length;
|
ret.length = length;
|
||||||
ret.standard = standard;
|
ret.standard = standard;
|
||||||
ret.title = specialization.getTitle(doc, dld, crawledDocument.url);
|
ret.title = specialization.getTitle(doc, dld, crawledDocument.url);
|
||||||
|
|
||||||
documentLengthLogic.validateLength(dld, specialization.lengthModifier() * documentClass.lengthLimitModifier());
|
|
||||||
|
|
||||||
final Set<HtmlFeature> features = featureExtractor.getFeatures(url, doc, documentHeaders, dld);
|
final Set<HtmlFeature> features = featureExtractor.getFeatures(url, doc, documentHeaders, dld);
|
||||||
|
|
||||||
ret.features = features;
|
ret.features = features;
|
||||||
ret.quality = documentValuator.adjustQuality(quality, features);
|
ret.quality = documentValuator.adjustQuality(quality, features);
|
||||||
ret.hashCode = dld.localitySensitiveHashCode();
|
ret.hashCode = dld.localitySensitiveHashCode();
|
||||||
|
|
||||||
if (isDisqualified(documentClass, url, quality, ret.title)) {
|
|
||||||
throw new DisqualifiedException(DisqualificationReason.QUALITY);
|
|
||||||
}
|
|
||||||
|
|
||||||
PubDate pubDate = pubDateSniffer.getPubDate(documentHeaders, url, doc, standard, true);
|
PubDate pubDate = pubDateSniffer.getPubDate(documentHeaders, url, doc, standard, true);
|
||||||
|
|
||||||
EnumSet<DocumentFlags> documentFlags = documentFlags(features, generatorParts.type());
|
EnumSet<DocumentFlags> documentFlags = documentFlags(features, generatorParts.type());
|
||||||
|
@@ -71,7 +71,7 @@ public class PlainTextDocumentProcessorPlugin extends AbstractDocumentProcessorP
|
|||||||
DocumentClass documentClass)
|
DocumentClass documentClass)
|
||||||
throws DisqualifiedException, URISyntaxException {
|
throws DisqualifiedException, URISyntaxException {
|
||||||
|
|
||||||
String documentBody = crawledDocument.documentBody;
|
String documentBody = crawledDocument.documentBody();
|
||||||
|
|
||||||
if (languageFilter.isBlockedUnicodeRange(documentBody)) {
|
if (languageFilter.isBlockedUnicodeRange(documentBody)) {
|
||||||
throw new DisqualifiedException(DisqualifiedException.DisqualificationReason.LANGUAGE);
|
throw new DisqualifiedException(DisqualifiedException.DisqualificationReason.LANGUAGE);
|
||||||
|
@@ -19,6 +19,7 @@ import nu.marginalia.model.idx.DocumentMetadata;
|
|||||||
import nu.marginalia.model.idx.WordFlags;
|
import nu.marginalia.model.idx.WordFlags;
|
||||||
|
|
||||||
import java.net.URISyntaxException;
|
import java.net.URISyntaxException;
|
||||||
|
import java.nio.charset.StandardCharsets;
|
||||||
import java.time.LocalDateTime;
|
import java.time.LocalDateTime;
|
||||||
import java.util.EnumSet;
|
import java.util.EnumSet;
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
@@ -50,7 +51,7 @@ public class SideloaderProcessing {
|
|||||||
"OK",
|
"OK",
|
||||||
"NP",
|
"NP",
|
||||||
"",
|
"",
|
||||||
body,
|
body.getBytes(StandardCharsets.UTF_8),
|
||||||
false,
|
false,
|
||||||
null,
|
null,
|
||||||
null
|
null
|
||||||
|
@@ -127,7 +127,7 @@ public class EncyclopediaMarginaliaNuSideloader implements SideloadSource, AutoC
|
|||||||
}
|
}
|
||||||
fullHtml.append("</div></body></html>");
|
fullHtml.append("</div></body></html>");
|
||||||
|
|
||||||
var doc = sideloaderProcessing
|
return sideloaderProcessing
|
||||||
.processDocument(fullUrl,
|
.processDocument(fullUrl,
|
||||||
fullHtml.toString(),
|
fullHtml.toString(),
|
||||||
List.of("encyclopedia", "wiki"),
|
List.of("encyclopedia", "wiki"),
|
||||||
@@ -137,8 +137,6 @@ public class EncyclopediaMarginaliaNuSideloader implements SideloadSource, AutoC
|
|||||||
anchorTextKeywords.getAnchorTextKeywords(domainLinks, new EdgeUrl(fullUrl)),
|
anchorTextKeywords.getAnchorTextKeywords(domainLinks, new EdgeUrl(fullUrl)),
|
||||||
LocalDate.now().getYear(),
|
LocalDate.now().getYear(),
|
||||||
10_000_000);
|
10_000_000);
|
||||||
|
|
||||||
return doc;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
private String normalizeUtf8(String url) {
|
private String normalizeUtf8(String url) {
|
||||||
|
@@ -106,11 +106,7 @@ public class WarcSideloader implements SideloadSource, AutoCloseable {
|
|||||||
return false;
|
return false;
|
||||||
|
|
||||||
var url = new EdgeUrl(warcResponse.target());
|
var url = new EdgeUrl(warcResponse.target());
|
||||||
if (!Objects.equals(url.getDomain(), domain)) {
|
return Objects.equals(url.getDomain(), domain);
|
||||||
return false;
|
|
||||||
}
|
|
||||||
|
|
||||||
return true;
|
|
||||||
} catch (Exception e) {
|
} catch (Exception e) {
|
||||||
logger.warn("Failed to process response", e);
|
logger.warn("Failed to process response", e);
|
||||||
}
|
}
|
||||||
|
@@ -39,6 +39,9 @@ public class ConverterWriter implements AutoCloseable {
|
|||||||
workerThread.start();
|
workerThread.start();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/** Queue and eventually write the domain into the converter journal
|
||||||
|
* The domain object will be closed after it's processed.
|
||||||
|
* */
|
||||||
public void accept(@Nullable ConverterBatchWritableIf domain) {
|
public void accept(@Nullable ConverterBatchWritableIf domain) {
|
||||||
if (null == domain)
|
if (null == domain)
|
||||||
return;
|
return;
|
||||||
@@ -72,15 +75,15 @@ public class ConverterWriter implements AutoCloseable {
|
|||||||
|
|
||||||
if (workLog.isItemCommitted(id) || workLog.isItemInCurrentBatch(id)) {
|
if (workLog.isItemCommitted(id) || workLog.isItemInCurrentBatch(id)) {
|
||||||
logger.warn("Skipping already logged item {}", id);
|
logger.warn("Skipping already logged item {}", id);
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
currentWriter.write(data);
|
||||||
|
workLog.logItem(id);
|
||||||
data.close();
|
data.close();
|
||||||
continue;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
currentWriter.write(data);
|
|
||||||
|
|
||||||
workLog.logItem(id);
|
|
||||||
|
|
||||||
switcher.tick();
|
switcher.tick();
|
||||||
|
data.close();
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
catch (Exception ex) {
|
catch (Exception ex) {
|
||||||
|
@@ -98,7 +98,7 @@ public class ConvertingIntegrationTest {
|
|||||||
|
|
||||||
@Test
|
@Test
|
||||||
public void testMemexMarginaliaNuSideloadProcessing() throws IOException {
|
public void testMemexMarginaliaNuSideloadProcessing() throws IOException {
|
||||||
var ret = domainProcessor.sideloadProcessing(asSerializableCrawlData(readMarginaliaWorkingSet()), 100);
|
var ret = domainProcessor.simpleProcessing(asSerializableCrawlData(readMarginaliaWorkingSet()), 100);
|
||||||
assertNotNull(ret);
|
assertNotNull(ret);
|
||||||
assertEquals("memex.marginalia.nu", ret.id());
|
assertEquals("memex.marginalia.nu", ret.id());
|
||||||
|
|
||||||
@@ -146,7 +146,7 @@ public class ConvertingIntegrationTest {
|
|||||||
"OK",
|
"OK",
|
||||||
"",
|
"",
|
||||||
"",
|
"",
|
||||||
readClassPathFile(p.toString()),
|
readClassPathFile(p.toString()).getBytes(),
|
||||||
false,
|
false,
|
||||||
null,
|
null,
|
||||||
null
|
null
|
||||||
|
@@ -8,6 +8,7 @@ import nu.marginalia.converting.model.ProcessedDomain;
|
|||||||
import nu.marginalia.converting.processor.DomainProcessor;
|
import nu.marginalia.converting.processor.DomainProcessor;
|
||||||
import nu.marginalia.crawl.CrawlerMain;
|
import nu.marginalia.crawl.CrawlerMain;
|
||||||
import nu.marginalia.crawl.DomainStateDb;
|
import nu.marginalia.crawl.DomainStateDb;
|
||||||
|
import nu.marginalia.crawl.fetcher.Cookies;
|
||||||
import nu.marginalia.crawl.fetcher.HttpFetcher;
|
import nu.marginalia.crawl.fetcher.HttpFetcher;
|
||||||
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
||||||
import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
|
import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
|
||||||
@@ -200,23 +201,23 @@ public class CrawlingThenConvertingIntegrationTest {
|
|||||||
|
|
||||||
@Test
|
@Test
|
||||||
public void crawlRobotsTxt() throws Exception {
|
public void crawlRobotsTxt() throws Exception {
|
||||||
var specs = new CrawlerMain.CrawlSpecRecord("search.marginalia.nu", 5,
|
var specs = new CrawlerMain.CrawlSpecRecord("marginalia-search.com", 5,
|
||||||
List.of("https://search.marginalia.nu/search?q=hello+world")
|
List.of("https://marginalia-search.com/search?q=hello+world")
|
||||||
);
|
);
|
||||||
|
|
||||||
CrawledDomain domain = crawl(specs);
|
CrawledDomain domain = crawl(specs);
|
||||||
assertFalse(domain.doc.isEmpty());
|
assertFalse(domain.doc.isEmpty());
|
||||||
assertEquals("OK", domain.crawlerStatus);
|
assertEquals("OK", domain.crawlerStatus);
|
||||||
assertEquals("search.marginalia.nu", domain.domain);
|
assertEquals("marginalia-search.com", domain.domain);
|
||||||
|
|
||||||
Set<String> allUrls = domain.doc.stream().map(doc -> doc.url).collect(Collectors.toSet());
|
Set<String> allUrls = domain.doc.stream().map(doc -> doc.url).collect(Collectors.toSet());
|
||||||
assertTrue(allUrls.contains("https://search.marginalia.nu/search"), "We expect a record for entities that are forbidden");
|
assertTrue(allUrls.contains("https://marginalia-search.com/search"), "We expect a record for entities that are forbidden");
|
||||||
|
|
||||||
var output = process();
|
var output = process();
|
||||||
|
|
||||||
assertNotNull(output);
|
assertNotNull(output);
|
||||||
assertFalse(output.documents.isEmpty());
|
assertFalse(output.documents.isEmpty());
|
||||||
assertEquals(new EdgeDomain("search.marginalia.nu"), output.domain);
|
assertEquals(new EdgeDomain("marginalia-search.com"), output.domain);
|
||||||
assertEquals(DomainIndexingState.ACTIVE, output.state);
|
assertEquals(DomainIndexingState.ACTIVE, output.state);
|
||||||
|
|
||||||
for (var doc : output.documents) {
|
for (var doc : output.documents) {
|
||||||
@@ -246,7 +247,7 @@ public class CrawlingThenConvertingIntegrationTest {
|
|||||||
private CrawledDomain crawl(CrawlerMain.CrawlSpecRecord specs, Predicate<EdgeDomain> domainBlacklist) throws Exception {
|
private CrawledDomain crawl(CrawlerMain.CrawlSpecRecord specs, Predicate<EdgeDomain> domainBlacklist) throws Exception {
|
||||||
List<SerializableCrawlData> data = new ArrayList<>();
|
List<SerializableCrawlData> data = new ArrayList<>();
|
||||||
|
|
||||||
try (var recorder = new WarcRecorder(fileName);
|
try (var recorder = new WarcRecorder(fileName, new Cookies());
|
||||||
var db = new DomainStateDb(dbTempFile))
|
var db = new DomainStateDb(dbTempFile))
|
||||||
{
|
{
|
||||||
new CrawlerRetreiver(httpFetcher, new DomainProber(domainBlacklist), specs, db, recorder).crawlDomain();
|
new CrawlerRetreiver(httpFetcher, new DomainProber(domainBlacklist), specs, db, recorder).crawlDomain();
|
||||||
|
@@ -55,7 +55,6 @@ dependencies {
|
|||||||
implementation libs.zstd
|
implementation libs.zstd
|
||||||
implementation libs.jwarc
|
implementation libs.jwarc
|
||||||
implementation libs.crawlercommons
|
implementation libs.crawlercommons
|
||||||
implementation libs.okhttp3
|
|
||||||
implementation libs.jsoup
|
implementation libs.jsoup
|
||||||
implementation libs.opencsv
|
implementation libs.opencsv
|
||||||
implementation libs.fastutil
|
implementation libs.fastutil
|
||||||
|
@@ -2,11 +2,16 @@ package nu.marginalia.contenttype;
|
|||||||
|
|
||||||
import org.apache.commons.lang3.StringUtils;
|
import org.apache.commons.lang3.StringUtils;
|
||||||
|
|
||||||
|
import java.nio.charset.Charset;
|
||||||
|
import java.nio.charset.IllegalCharsetNameException;
|
||||||
|
import java.nio.charset.StandardCharsets;
|
||||||
|
|
||||||
/** Content type and charset of a document
|
/** Content type and charset of a document
|
||||||
* @param contentType The content type, e.g. "text/html"
|
* @param contentType The content type, e.g. "text/html"
|
||||||
* @param charset The charset, e.g. "UTF-8"
|
* @param charset The charset, e.g. "UTF-8"
|
||||||
*/
|
*/
|
||||||
public record ContentType(String contentType, String charset) {
|
public record ContentType(String contentType, String charset) {
|
||||||
|
|
||||||
public static ContentType parse(String contentTypeHeader) {
|
public static ContentType parse(String contentTypeHeader) {
|
||||||
if (contentTypeHeader == null || contentTypeHeader.isBlank())
|
if (contentTypeHeader == null || contentTypeHeader.isBlank())
|
||||||
return new ContentType(null, null);
|
return new ContentType(null, null);
|
||||||
@@ -15,9 +20,31 @@ public record ContentType(String contentType, String charset) {
|
|||||||
String contentType = parts[0].trim();
|
String contentType = parts[0].trim();
|
||||||
String charset = parts.length > 1 ? parts[1].trim() : "UTF-8";
|
String charset = parts.length > 1 ? parts[1].trim() : "UTF-8";
|
||||||
|
|
||||||
|
if (charset.toLowerCase().startsWith("charset=")) {
|
||||||
|
charset = charset.substring("charset=".length());
|
||||||
|
}
|
||||||
|
|
||||||
return new ContentType(contentType, charset);
|
return new ContentType(contentType, charset);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/** Best effort method for turning the provided charset string into a Java charset method,
|
||||||
|
* with some guesswork-heuristics for when it doesn't work
|
||||||
|
*/
|
||||||
|
public Charset asCharset() {
|
||||||
|
try {
|
||||||
|
if (Charset.isSupported(charset)) {
|
||||||
|
return Charset.forName(charset);
|
||||||
|
} else if (charset.equalsIgnoreCase("macintosh-latin")) {
|
||||||
|
return StandardCharsets.ISO_8859_1;
|
||||||
|
} else {
|
||||||
|
return StandardCharsets.UTF_8;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
catch (IllegalCharsetNameException ex) { // thrown by Charset.isSupported()
|
||||||
|
return StandardCharsets.UTF_8;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
public boolean is(String contentType) {
|
public boolean is(String contentType) {
|
||||||
return this.contentType.equalsIgnoreCase(contentType);
|
return this.contentType.equalsIgnoreCase(contentType);
|
||||||
}
|
}
|
||||||
|
@@ -1,9 +1,12 @@
|
|||||||
package nu.marginalia.contenttype;
|
package nu.marginalia.contenttype;
|
||||||
|
|
||||||
|
import org.jsoup.Jsoup;
|
||||||
|
import org.jsoup.nodes.Document;
|
||||||
|
|
||||||
|
import java.io.ByteArrayInputStream;
|
||||||
|
import java.io.IOException;
|
||||||
import java.nio.charset.Charset;
|
import java.nio.charset.Charset;
|
||||||
import java.nio.charset.IllegalCharsetNameException;
|
|
||||||
import java.nio.charset.StandardCharsets;
|
import java.nio.charset.StandardCharsets;
|
||||||
import java.nio.charset.UnsupportedCharsetException;
|
|
||||||
import java.util.Map;
|
import java.util.Map;
|
||||||
import java.util.concurrent.ConcurrentHashMap;
|
import java.util.concurrent.ConcurrentHashMap;
|
||||||
|
|
||||||
@@ -23,24 +26,25 @@ public class DocumentBodyToString {
|
|||||||
return new String(data, charset);
|
return new String(data, charset);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
public static Document getParsedData(ContentType type, byte[] data, int maxLength, String url) throws IOException {
|
||||||
|
final Charset charset;
|
||||||
|
|
||||||
|
if (type.charset() == null || type.charset().isBlank()) {
|
||||||
|
charset = StandardCharsets.UTF_8;
|
||||||
|
} else {
|
||||||
|
charset = charsetMap.computeIfAbsent(type, DocumentBodyToString::computeCharset);
|
||||||
|
}
|
||||||
|
|
||||||
|
ByteArrayInputStream bais = new ByteArrayInputStream(data, 0, Math.min(data.length, maxLength));
|
||||||
|
|
||||||
|
return Jsoup.parse(bais, charset.name(), url);
|
||||||
|
}
|
||||||
|
|
||||||
private static Charset computeCharset(ContentType type) {
|
private static Charset computeCharset(ContentType type) {
|
||||||
try {
|
|
||||||
if (type.charset() == null || type.charset().isBlank())
|
if (type.charset() == null || type.charset().isBlank())
|
||||||
return StandardCharsets.UTF_8;
|
return StandardCharsets.UTF_8;
|
||||||
else {
|
else {
|
||||||
return Charset.forName(type.charset());
|
return type.asCharset();
|
||||||
}
|
|
||||||
}
|
|
||||||
catch (IllegalCharsetNameException ex) {
|
|
||||||
// Fall back to UTF-8 if we don't understand what this is. It's *probably* fine? Maybe?
|
|
||||||
return StandardCharsets.UTF_8;
|
|
||||||
}
|
|
||||||
catch (UnsupportedCharsetException ex) {
|
|
||||||
// This is usually like Macintosh Latin
|
|
||||||
// (https://en.wikipedia.org/wiki/Macintosh_Latin_encoding)
|
|
||||||
//
|
|
||||||
// It's close enough to 8859-1 to serve
|
|
||||||
return StandardCharsets.ISO_8859_1;
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@@ -19,22 +19,19 @@ import nu.marginalia.crawl.retreival.DomainProber;
|
|||||||
import nu.marginalia.crawl.warc.WarcArchiverFactory;
|
import nu.marginalia.crawl.warc.WarcArchiverFactory;
|
||||||
import nu.marginalia.crawl.warc.WarcArchiverIf;
|
import nu.marginalia.crawl.warc.WarcArchiverIf;
|
||||||
import nu.marginalia.db.DomainBlacklist;
|
import nu.marginalia.db.DomainBlacklist;
|
||||||
import nu.marginalia.io.CrawledDomainReader;
|
|
||||||
import nu.marginalia.io.CrawlerOutputFile;
|
import nu.marginalia.io.CrawlerOutputFile;
|
||||||
import nu.marginalia.model.EdgeDomain;
|
import nu.marginalia.model.EdgeDomain;
|
||||||
import nu.marginalia.mq.MessageQueueFactory;
|
import nu.marginalia.mq.MessageQueueFactory;
|
||||||
import nu.marginalia.parquet.crawldata.CrawledDocumentParquetRecordFileWriter;
|
|
||||||
import nu.marginalia.process.ProcessConfiguration;
|
import nu.marginalia.process.ProcessConfiguration;
|
||||||
import nu.marginalia.process.ProcessConfigurationModule;
|
import nu.marginalia.process.ProcessConfigurationModule;
|
||||||
import nu.marginalia.process.ProcessMainClass;
|
import nu.marginalia.process.ProcessMainClass;
|
||||||
import nu.marginalia.process.control.ProcessHeartbeatImpl;
|
import nu.marginalia.process.control.ProcessHeartbeatImpl;
|
||||||
import nu.marginalia.process.log.WorkLog;
|
import nu.marginalia.process.log.WorkLog;
|
||||||
import nu.marginalia.service.module.DatabaseModule;
|
import nu.marginalia.service.module.DatabaseModule;
|
||||||
|
import nu.marginalia.slop.SlopCrawlDataRecord;
|
||||||
import nu.marginalia.storage.FileStorageService;
|
import nu.marginalia.storage.FileStorageService;
|
||||||
import nu.marginalia.storage.model.FileStorageId;
|
import nu.marginalia.storage.model.FileStorageId;
|
||||||
import nu.marginalia.util.SimpleBlockingThreadPool;
|
import nu.marginalia.util.SimpleBlockingThreadPool;
|
||||||
import okhttp3.ConnectionPool;
|
|
||||||
import okhttp3.Dispatcher;
|
|
||||||
import org.jetbrains.annotations.NotNull;
|
import org.jetbrains.annotations.NotNull;
|
||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
import org.slf4j.LoggerFactory;
|
import org.slf4j.LoggerFactory;
|
||||||
@@ -85,6 +82,7 @@ public class CrawlerMain extends ProcessMainClass {
|
|||||||
|
|
||||||
@Inject
|
@Inject
|
||||||
public CrawlerMain(UserAgent userAgent,
|
public CrawlerMain(UserAgent userAgent,
|
||||||
|
HttpFetcherImpl httpFetcher,
|
||||||
ProcessHeartbeatImpl heartbeat,
|
ProcessHeartbeatImpl heartbeat,
|
||||||
MessageQueueFactory messageQueueFactory, DomainProber domainProber,
|
MessageQueueFactory messageQueueFactory, DomainProber domainProber,
|
||||||
FileStorageService fileStorageService,
|
FileStorageService fileStorageService,
|
||||||
@@ -98,6 +96,7 @@ public class CrawlerMain extends ProcessMainClass {
|
|||||||
super(messageQueueFactory, processConfiguration, gson, CRAWLER_INBOX);
|
super(messageQueueFactory, processConfiguration, gson, CRAWLER_INBOX);
|
||||||
|
|
||||||
this.userAgent = userAgent;
|
this.userAgent = userAgent;
|
||||||
|
this.fetcher = httpFetcher;
|
||||||
this.heartbeat = heartbeat;
|
this.heartbeat = heartbeat;
|
||||||
this.domainProber = domainProber;
|
this.domainProber = domainProber;
|
||||||
this.fileStorageService = fileStorageService;
|
this.fileStorageService = fileStorageService;
|
||||||
@@ -111,10 +110,6 @@ public class CrawlerMain extends ProcessMainClass {
|
|||||||
Integer.getInteger("crawler.poolSize", 256),
|
Integer.getInteger("crawler.poolSize", 256),
|
||||||
1);
|
1);
|
||||||
|
|
||||||
fetcher = new HttpFetcherImpl(userAgent,
|
|
||||||
new Dispatcher(),
|
|
||||||
new ConnectionPool(5, 10, TimeUnit.SECONDS)
|
|
||||||
);
|
|
||||||
|
|
||||||
// Wait for the blacklist to be loaded before starting the crawl
|
// Wait for the blacklist to be loaded before starting the crawl
|
||||||
blacklist.waitUntilLoaded();
|
blacklist.waitUntilLoaded();
|
||||||
@@ -132,6 +127,10 @@ public class CrawlerMain extends ProcessMainClass {
|
|||||||
System.setProperty("sun.net.client.defaultConnectTimeout", "30000");
|
System.setProperty("sun.net.client.defaultConnectTimeout", "30000");
|
||||||
System.setProperty("sun.net.client.defaultReadTimeout", "30000");
|
System.setProperty("sun.net.client.defaultReadTimeout", "30000");
|
||||||
|
|
||||||
|
// Set the maximum number of connections to keep alive in the connection pool
|
||||||
|
System.setProperty("jdk.httpclient.idleTimeout", "15"); // 15 seconds
|
||||||
|
System.setProperty("jdk.httpclient.connectionPoolSize", "256");
|
||||||
|
|
||||||
// We don't want to use too much memory caching sessions for https
|
// We don't want to use too much memory caching sessions for https
|
||||||
System.setProperty("javax.net.ssl.sessionCacheSize", "2048");
|
System.setProperty("javax.net.ssl.sessionCacheSize", "2048");
|
||||||
|
|
||||||
@@ -291,7 +290,6 @@ public class CrawlerMain extends ProcessMainClass {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
public void runForSingleDomain(String targetDomainName, FileStorageId fileStorageId) throws Exception {
|
public void runForSingleDomain(String targetDomainName, FileStorageId fileStorageId) throws Exception {
|
||||||
runForSingleDomain(targetDomainName, fileStorageService.getStorage(fileStorageId).asPath());
|
runForSingleDomain(targetDomainName, fileStorageService.getStorage(fileStorageId).asPath());
|
||||||
}
|
}
|
||||||
@@ -353,7 +351,7 @@ public class CrawlerMain extends ProcessMainClass {
|
|||||||
|
|
||||||
Path newWarcFile = CrawlerOutputFile.createWarcPath(outputDir, id, domain, CrawlerOutputFile.WarcFileVersion.LIVE);
|
Path newWarcFile = CrawlerOutputFile.createWarcPath(outputDir, id, domain, CrawlerOutputFile.WarcFileVersion.LIVE);
|
||||||
Path tempFile = CrawlerOutputFile.createWarcPath(outputDir, id, domain, CrawlerOutputFile.WarcFileVersion.TEMP);
|
Path tempFile = CrawlerOutputFile.createWarcPath(outputDir, id, domain, CrawlerOutputFile.WarcFileVersion.TEMP);
|
||||||
Path parquetFile = CrawlerOutputFile.createParquetPath(outputDir, id, domain);
|
Path slopFile = CrawlerOutputFile.createSlopPath(outputDir, id, domain);
|
||||||
|
|
||||||
// Move the WARC file to a temp file if it exists, so we can resume the crawl using the old data
|
// Move the WARC file to a temp file if it exists, so we can resume the crawl using the old data
|
||||||
// while writing to the same file name as before
|
// while writing to the same file name as before
|
||||||
@@ -364,9 +362,9 @@ public class CrawlerMain extends ProcessMainClass {
|
|||||||
Files.deleteIfExists(tempFile);
|
Files.deleteIfExists(tempFile);
|
||||||
}
|
}
|
||||||
|
|
||||||
try (var warcRecorder = new WarcRecorder(newWarcFile); // write to a temp file for now
|
try (var warcRecorder = new WarcRecorder(newWarcFile, fetcher); // write to a temp file for now
|
||||||
var retriever = new CrawlerRetreiver(fetcher, domainProber, specification, domainStateDb, warcRecorder);
|
var retriever = new CrawlerRetreiver(fetcher, domainProber, specification, domainStateDb, warcRecorder);
|
||||||
CrawlDataReference reference = getReference();
|
CrawlDataReference reference = getReference()
|
||||||
)
|
)
|
||||||
{
|
{
|
||||||
// Resume the crawl if it was aborted
|
// Resume the crawl if it was aborted
|
||||||
@@ -387,15 +385,15 @@ public class CrawlerMain extends ProcessMainClass {
|
|||||||
reference.delete();
|
reference.delete();
|
||||||
|
|
||||||
// Convert the WARC file to Parquet
|
// Convert the WARC file to Parquet
|
||||||
CrawledDocumentParquetRecordFileWriter
|
SlopCrawlDataRecord
|
||||||
.convertWarc(domain, userAgent, newWarcFile, parquetFile);
|
.convertWarc(domain, userAgent, newWarcFile, slopFile);
|
||||||
|
|
||||||
// Optionally archive the WARC file if full retention is enabled,
|
// Optionally archive the WARC file if full retention is enabled,
|
||||||
// otherwise delete it:
|
// otherwise delete it:
|
||||||
warcArchiver.consumeWarc(newWarcFile, domain);
|
warcArchiver.consumeWarc(newWarcFile, domain);
|
||||||
|
|
||||||
// Mark the domain as finished in the work log
|
// Mark the domain as finished in the work log
|
||||||
workLog.setJobToFinished(domain, parquetFile.toString(), size);
|
workLog.setJobToFinished(domain, slopFile.toString(), size);
|
||||||
|
|
||||||
// Update the progress bar
|
// Update the progress bar
|
||||||
heartbeat.setProgress(tasksDone.incrementAndGet() / (double) totalTasks);
|
heartbeat.setProgress(tasksDone.incrementAndGet() / (double) totalTasks);
|
||||||
@@ -416,11 +414,22 @@ public class CrawlerMain extends ProcessMainClass {
|
|||||||
|
|
||||||
private CrawlDataReference getReference() {
|
private CrawlDataReference getReference() {
|
||||||
try {
|
try {
|
||||||
return new CrawlDataReference(CrawledDomainReader.createDataStream(outputDir, domain, id));
|
Path slopPath = CrawlerOutputFile.getSlopPath(outputDir, id, domain);
|
||||||
|
if (Files.exists(slopPath)) {
|
||||||
|
return new CrawlDataReference(slopPath);
|
||||||
|
}
|
||||||
|
|
||||||
|
Path parquetPath = CrawlerOutputFile.getParquetPath(outputDir, id, domain);
|
||||||
|
if (Files.exists(parquetPath)) {
|
||||||
|
slopPath = migrateParquetData(parquetPath, domain, outputDir);
|
||||||
|
return new CrawlDataReference(slopPath);
|
||||||
|
}
|
||||||
|
|
||||||
} catch (IOException e) {
|
} catch (IOException e) {
|
||||||
logger.debug("Failed to read previous crawl data for {}", specification.domain());
|
logger.debug("Failed to read previous crawl data for {}", specification.domain());
|
||||||
return new CrawlDataReference();
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
return new CrawlDataReference();
|
||||||
}
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
@@ -480,4 +489,20 @@ public class CrawlerMain extends ProcessMainClass {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Migrate from parquet to slop if necessary
|
||||||
|
//
|
||||||
|
// This must be synchronized as chewing through parquet files in parallel leads to enormous memory overhead
|
||||||
|
private synchronized Path migrateParquetData(Path inputPath, String domain, Path crawlDataRoot) throws IOException {
|
||||||
|
if (!inputPath.endsWith(".parquet")) {
|
||||||
|
return inputPath;
|
||||||
|
}
|
||||||
|
|
||||||
|
Path outputFile = CrawlerOutputFile.createSlopPath(crawlDataRoot, Integer.toHexString(domain.hashCode()), domain);
|
||||||
|
|
||||||
|
SlopCrawlDataRecord.convertFromParquet(inputPath, outputFile);
|
||||||
|
|
||||||
|
return outputFile;
|
||||||
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
|
@@ -9,6 +9,7 @@ import java.sql.Connection;
|
|||||||
import java.sql.DriverManager;
|
import java.sql.DriverManager;
|
||||||
import java.sql.SQLException;
|
import java.sql.SQLException;
|
||||||
import java.time.Instant;
|
import java.time.Instant;
|
||||||
|
import java.util.Objects;
|
||||||
import java.util.Optional;
|
import java.util.Optional;
|
||||||
|
|
||||||
/** Supplemental sqlite database for storing the summary of a crawl.
|
/** Supplemental sqlite database for storing the summary of a crawl.
|
||||||
@@ -60,6 +61,8 @@ public class DomainStateDb implements AutoCloseable {
|
|||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
public record FaviconRecord(String contentType, byte[] imageData) {}
|
||||||
|
|
||||||
public DomainStateDb(Path filename) throws SQLException {
|
public DomainStateDb(Path filename) throws SQLException {
|
||||||
String sqliteDbString = "jdbc:sqlite:" + filename.toString();
|
String sqliteDbString = "jdbc:sqlite:" + filename.toString();
|
||||||
connection = DriverManager.getConnection(sqliteDbString);
|
connection = DriverManager.getConnection(sqliteDbString);
|
||||||
@@ -74,7 +77,13 @@ public class DomainStateDb implements AutoCloseable {
|
|||||||
feedUrl TEXT
|
feedUrl TEXT
|
||||||
)
|
)
|
||||||
""");
|
""");
|
||||||
|
stmt.executeUpdate("""
|
||||||
|
CREATE TABLE IF NOT EXISTS favicon (
|
||||||
|
domain TEXT PRIMARY KEY,
|
||||||
|
contentType TEXT NOT NULL,
|
||||||
|
icon BLOB NOT NULL
|
||||||
|
)
|
||||||
|
""");
|
||||||
stmt.execute("PRAGMA journal_mode=WAL");
|
stmt.execute("PRAGMA journal_mode=WAL");
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -85,6 +94,41 @@ public class DomainStateDb implements AutoCloseable {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
public void saveIcon(String domain, FaviconRecord faviconRecord) {
|
||||||
|
try (var stmt = connection.prepareStatement("""
|
||||||
|
INSERT OR REPLACE INTO favicon (domain, contentType, icon)
|
||||||
|
VALUES(?, ?, ?)
|
||||||
|
""")) {
|
||||||
|
stmt.setString(1, domain);
|
||||||
|
stmt.setString(2, Objects.requireNonNullElse(faviconRecord.contentType, "application/octet-stream"));
|
||||||
|
stmt.setBytes(3, faviconRecord.imageData);
|
||||||
|
stmt.executeUpdate();
|
||||||
|
}
|
||||||
|
catch (SQLException ex) {
|
||||||
|
logger.error("Failed to insert favicon", ex);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
public Optional<FaviconRecord> getIcon(String domain) {
|
||||||
|
try (var stmt = connection.prepareStatement("SELECT contentType, icon FROM favicon WHERE DOMAIN = ?")) {
|
||||||
|
stmt.setString(1, domain);
|
||||||
|
var rs = stmt.executeQuery();
|
||||||
|
|
||||||
|
if (rs.next()) {
|
||||||
|
return Optional.of(
|
||||||
|
new FaviconRecord(
|
||||||
|
rs.getString("contentType"),
|
||||||
|
rs.getBytes("icon")
|
||||||
|
)
|
||||||
|
);
|
||||||
|
}
|
||||||
|
} catch (SQLException e) {
|
||||||
|
logger.error("Failed to retrieve favicon", e);
|
||||||
|
}
|
||||||
|
|
||||||
|
return Optional.empty();
|
||||||
|
}
|
||||||
|
|
||||||
public void save(SummaryRecord record) {
|
public void save(SummaryRecord record) {
|
||||||
try (var stmt = connection.prepareStatement("""
|
try (var stmt = connection.prepareStatement("""
|
||||||
INSERT OR REPLACE INTO summary (domain, lastUpdatedEpochMs, state, stateDesc, feedUrl)
|
INSERT OR REPLACE INTO summary (domain, lastUpdatedEpochMs, state, stateDesc, feedUrl)
|
||||||
|
@@ -1,6 +1,6 @@
|
|||||||
package nu.marginalia.crawl.fetcher;
|
package nu.marginalia.crawl.fetcher;
|
||||||
|
|
||||||
import okhttp3.Request;
|
import java.net.http.HttpRequest;
|
||||||
|
|
||||||
/** Encapsulates request modifiers; the ETag and Last-Modified tags for a resource */
|
/** Encapsulates request modifiers; the ETag and Last-Modified tags for a resource */
|
||||||
public record ContentTags(String etag, String lastMod) {
|
public record ContentTags(String etag, String lastMod) {
|
||||||
@@ -17,14 +17,14 @@ public record ContentTags(String etag, String lastMod) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
/** Paints the tags onto the request builder. */
|
/** Paints the tags onto the request builder. */
|
||||||
public void paint(Request.Builder getBuilder) {
|
public void paint(HttpRequest.Builder getBuilder) {
|
||||||
|
|
||||||
if (etag != null) {
|
if (etag != null) {
|
||||||
getBuilder.addHeader("If-None-Match", etag);
|
getBuilder.header("If-None-Match", etag);
|
||||||
}
|
}
|
||||||
|
|
||||||
if (lastMod != null) {
|
if (lastMod != null) {
|
||||||
getBuilder.addHeader("If-Modified-Since", lastMod);
|
getBuilder.header("If-Modified-Since", lastMod);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@@ -1,33 +1,14 @@
|
|||||||
package nu.marginalia.crawl.fetcher;
|
package nu.marginalia.crawl.fetcher;
|
||||||
|
|
||||||
import okhttp3.Cookie;
|
import java.io.IOException;
|
||||||
import okhttp3.CookieJar;
|
import java.net.CookieHandler;
|
||||||
import okhttp3.HttpUrl;
|
import java.net.URI;
|
||||||
|
|
||||||
import java.util.Collections;
|
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
|
import java.util.Map;
|
||||||
import java.util.concurrent.ConcurrentHashMap;
|
import java.util.concurrent.ConcurrentHashMap;
|
||||||
|
|
||||||
public class Cookies {
|
public class Cookies extends CookieHandler {
|
||||||
final ThreadLocal<ConcurrentHashMap<String, List<Cookie>>> cookieJar = ThreadLocal.withInitial(ConcurrentHashMap::new);
|
final ThreadLocal<ConcurrentHashMap<String, List<String>>> cookieJar = ThreadLocal.withInitial(ConcurrentHashMap::new);
|
||||||
|
|
||||||
public CookieJar getJar() {
|
|
||||||
return new CookieJar() {
|
|
||||||
|
|
||||||
@Override
|
|
||||||
public void saveFromResponse(HttpUrl url, List<Cookie> cookies) {
|
|
||||||
|
|
||||||
if (!cookies.isEmpty()) {
|
|
||||||
cookieJar.get().put(url.host(), cookies);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
@Override
|
|
||||||
public List<Cookie> loadForRequest(HttpUrl url) {
|
|
||||||
return cookieJar.get().getOrDefault(url.host(), Collections.emptyList());
|
|
||||||
}
|
|
||||||
};
|
|
||||||
}
|
|
||||||
|
|
||||||
public void clear() {
|
public void clear() {
|
||||||
cookieJar.get().clear();
|
cookieJar.get().clear();
|
||||||
@@ -38,6 +19,16 @@ public class Cookies {
|
|||||||
}
|
}
|
||||||
|
|
||||||
public List<String> getCookies() {
|
public List<String> getCookies() {
|
||||||
return cookieJar.get().values().stream().flatMap(List::stream).map(Cookie::toString).toList();
|
return cookieJar.get().values().stream().flatMap(List::stream).toList();
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Map<String, List<String>> get(URI uri, Map<String, List<String>> requestHeaders) throws IOException {
|
||||||
|
return cookieJar.get();
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public void put(URI uri, Map<String, List<String>> responseHeaders) throws IOException {
|
||||||
|
cookieJar.get().putAll(responseHeaders);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@@ -3,6 +3,7 @@ package nu.marginalia.crawl.fetcher;
|
|||||||
import com.google.inject.ImplementedBy;
|
import com.google.inject.ImplementedBy;
|
||||||
import crawlercommons.robots.SimpleRobotRules;
|
import crawlercommons.robots.SimpleRobotRules;
|
||||||
import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
|
import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
|
||||||
|
import nu.marginalia.crawl.retreival.CrawlDelayTimer;
|
||||||
import nu.marginalia.model.EdgeDomain;
|
import nu.marginalia.model.EdgeDomain;
|
||||||
import nu.marginalia.model.EdgeUrl;
|
import nu.marginalia.model.EdgeUrl;
|
||||||
import nu.marginalia.model.body.HttpFetchResult;
|
import nu.marginalia.model.body.HttpFetchResult;
|
||||||
@@ -11,10 +12,10 @@ import nu.marginalia.model.crawldata.CrawlerDomainStatus;
|
|||||||
import java.util.List;
|
import java.util.List;
|
||||||
|
|
||||||
@ImplementedBy(HttpFetcherImpl.class)
|
@ImplementedBy(HttpFetcherImpl.class)
|
||||||
public interface HttpFetcher {
|
public interface HttpFetcher extends AutoCloseable {
|
||||||
void setAllowAllContentTypes(boolean allowAllContentTypes);
|
void setAllowAllContentTypes(boolean allowAllContentTypes);
|
||||||
|
|
||||||
List<String> getCookies();
|
Cookies getCookies();
|
||||||
void clearCookies();
|
void clearCookies();
|
||||||
|
|
||||||
DomainProbeResult probeDomain(EdgeUrl url);
|
DomainProbeResult probeDomain(EdgeUrl url);
|
||||||
@@ -27,7 +28,9 @@ public interface HttpFetcher {
|
|||||||
HttpFetchResult fetchContent(EdgeUrl url,
|
HttpFetchResult fetchContent(EdgeUrl url,
|
||||||
WarcRecorder recorder,
|
WarcRecorder recorder,
|
||||||
ContentTags tags,
|
ContentTags tags,
|
||||||
ProbeType probeType) throws HttpFetcherImpl.RateLimitException, Exception;
|
ProbeType probeType) throws Exception;
|
||||||
|
|
||||||
|
List<EdgeUrl> fetchSitemapUrls(String rootSitemapUrl, CrawlDelayTimer delayTimer);
|
||||||
|
|
||||||
SimpleRobotRules fetchRobotRules(EdgeDomain domain, WarcRecorder recorder);
|
SimpleRobotRules fetchRobotRules(EdgeDomain domain, WarcRecorder recorder);
|
||||||
|
|
||||||
|
@@ -1,35 +1,39 @@
|
|||||||
package nu.marginalia.crawl.fetcher;
|
package nu.marginalia.crawl.fetcher;
|
||||||
|
|
||||||
import com.google.inject.Inject;
|
import com.google.inject.Inject;
|
||||||
|
import com.google.inject.Singleton;
|
||||||
import crawlercommons.robots.SimpleRobotRules;
|
import crawlercommons.robots.SimpleRobotRules;
|
||||||
import crawlercommons.robots.SimpleRobotRulesParser;
|
import crawlercommons.robots.SimpleRobotRulesParser;
|
||||||
import nu.marginalia.UserAgent;
|
import nu.marginalia.UserAgent;
|
||||||
import nu.marginalia.crawl.fetcher.socket.FastTerminatingSocketFactory;
|
|
||||||
import nu.marginalia.crawl.fetcher.socket.IpInterceptingNetworkInterceptor;
|
|
||||||
import nu.marginalia.crawl.fetcher.socket.NoSecuritySSL;
|
import nu.marginalia.crawl.fetcher.socket.NoSecuritySSL;
|
||||||
import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
|
import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
|
||||||
|
import nu.marginalia.crawl.retreival.CrawlDelayTimer;
|
||||||
import nu.marginalia.model.EdgeDomain;
|
import nu.marginalia.model.EdgeDomain;
|
||||||
import nu.marginalia.model.EdgeUrl;
|
import nu.marginalia.model.EdgeUrl;
|
||||||
import nu.marginalia.model.body.ContentTypeLogic;
|
import nu.marginalia.model.body.ContentTypeLogic;
|
||||||
import nu.marginalia.model.body.DocumentBodyExtractor;
|
import nu.marginalia.model.body.DocumentBodyExtractor;
|
||||||
import nu.marginalia.model.body.HttpFetchResult;
|
import nu.marginalia.model.body.HttpFetchResult;
|
||||||
import nu.marginalia.model.crawldata.CrawlerDomainStatus;
|
import nu.marginalia.model.crawldata.CrawlerDomainStatus;
|
||||||
import okhttp3.ConnectionPool;
|
import org.jsoup.Jsoup;
|
||||||
import okhttp3.Dispatcher;
|
import org.jsoup.nodes.Document;
|
||||||
import okhttp3.OkHttpClient;
|
import org.jsoup.parser.Parser;
|
||||||
import okhttp3.Request;
|
|
||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
import org.slf4j.LoggerFactory;
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
import javax.net.ssl.X509TrustManager;
|
import java.io.IOException;
|
||||||
import java.io.InterruptedIOException;
|
import java.io.InputStream;
|
||||||
|
import java.net.URISyntaxException;
|
||||||
|
import java.net.http.HttpClient;
|
||||||
|
import java.net.http.HttpRequest;
|
||||||
|
import java.net.http.HttpResponse;
|
||||||
|
import java.net.http.HttpTimeoutException;
|
||||||
import java.time.Duration;
|
import java.time.Duration;
|
||||||
import java.util.List;
|
import java.util.*;
|
||||||
import java.util.Objects;
|
import java.util.concurrent.Executors;
|
||||||
import java.util.Optional;
|
import java.util.zip.GZIPInputStream;
|
||||||
import java.util.concurrent.TimeUnit;
|
|
||||||
|
|
||||||
|
|
||||||
|
@Singleton
|
||||||
public class HttpFetcherImpl implements HttpFetcher {
|
public class HttpFetcherImpl implements HttpFetcher {
|
||||||
|
|
||||||
private final Logger logger = LoggerFactory.getLogger(getClass());
|
private final Logger logger = LoggerFactory.getLogger(getClass());
|
||||||
@@ -40,39 +44,29 @@ public class HttpFetcherImpl implements HttpFetcher {
|
|||||||
private static final SimpleRobotRulesParser robotsParser = new SimpleRobotRulesParser();
|
private static final SimpleRobotRulesParser robotsParser = new SimpleRobotRulesParser();
|
||||||
private static final ContentTypeLogic contentTypeLogic = new ContentTypeLogic();
|
private static final ContentTypeLogic contentTypeLogic = new ContentTypeLogic();
|
||||||
|
|
||||||
|
private final Duration requestTimeout = Duration.ofSeconds(10);
|
||||||
|
private final Duration probeTimeout = Duration.ofSeconds(30);
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public void setAllowAllContentTypes(boolean allowAllContentTypes) {
|
public void setAllowAllContentTypes(boolean allowAllContentTypes) {
|
||||||
contentTypeLogic.setAllowAllContentTypes(allowAllContentTypes);
|
contentTypeLogic.setAllowAllContentTypes(allowAllContentTypes);
|
||||||
}
|
}
|
||||||
|
|
||||||
private final OkHttpClient client;
|
private final HttpClient client;
|
||||||
|
|
||||||
private static final FastTerminatingSocketFactory ftSocketFactory = new FastTerminatingSocketFactory();
|
private HttpClient createClient() {
|
||||||
|
return HttpClient.newBuilder()
|
||||||
private OkHttpClient createClient(Dispatcher dispatcher, ConnectionPool pool) {
|
.sslContext(NoSecuritySSL.buildSslContext())
|
||||||
var builder = new OkHttpClient.Builder();
|
.cookieHandler(cookies)
|
||||||
if (dispatcher != null) {
|
.followRedirects(HttpClient.Redirect.NORMAL)
|
||||||
builder.dispatcher(dispatcher);
|
.connectTimeout(Duration.ofSeconds(8))
|
||||||
}
|
.executor(Executors.newCachedThreadPool())
|
||||||
|
|
||||||
return builder.sslSocketFactory(NoSecuritySSL.buildSocketFactory(), (X509TrustManager) NoSecuritySSL.trustAllCerts[0])
|
|
||||||
.socketFactory(ftSocketFactory)
|
|
||||||
.hostnameVerifier(NoSecuritySSL.buildHostnameVerifyer())
|
|
||||||
.addNetworkInterceptor(new IpInterceptingNetworkInterceptor())
|
|
||||||
.connectionPool(pool)
|
|
||||||
.cookieJar(cookies.getJar())
|
|
||||||
.followRedirects(true)
|
|
||||||
.followSslRedirects(true)
|
|
||||||
.connectTimeout(8, TimeUnit.SECONDS)
|
|
||||||
.readTimeout(10, TimeUnit.SECONDS)
|
|
||||||
.writeTimeout(10, TimeUnit.SECONDS)
|
|
||||||
.build();
|
.build();
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public List<String> getCookies() {
|
public Cookies getCookies() {
|
||||||
return cookies.getCookies();
|
return cookies;
|
||||||
}
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
@@ -81,26 +75,24 @@ public class HttpFetcherImpl implements HttpFetcher {
|
|||||||
}
|
}
|
||||||
|
|
||||||
@Inject
|
@Inject
|
||||||
public HttpFetcherImpl(UserAgent userAgent,
|
public HttpFetcherImpl(UserAgent userAgent)
|
||||||
Dispatcher dispatcher,
|
|
||||||
ConnectionPool connectionPool)
|
|
||||||
{
|
{
|
||||||
this.client = createClient(dispatcher, connectionPool);
|
this.client = createClient();
|
||||||
this.userAgentString = userAgent.uaString();
|
this.userAgentString = userAgent.uaString();
|
||||||
this.userAgentIdentifier = userAgent.uaIdentifier();
|
this.userAgentIdentifier = userAgent.uaIdentifier();
|
||||||
}
|
}
|
||||||
|
|
||||||
public HttpFetcherImpl(String userAgent) {
|
public HttpFetcherImpl(String userAgent) {
|
||||||
this.client = createClient(null, new ConnectionPool());
|
this.client = createClient();
|
||||||
this.userAgentString = userAgent;
|
this.userAgentString = userAgent;
|
||||||
this.userAgentIdentifier = userAgent;
|
this.userAgentIdentifier = userAgent;
|
||||||
}
|
}
|
||||||
|
|
||||||
// Not necessary in prod, but useful in test
|
// Not necessary in prod, but useful in test
|
||||||
public void close() {
|
public void close() {
|
||||||
client.dispatcher().executorService().shutdown();
|
client.close();
|
||||||
client.connectionPool().evictAll();
|
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Probe the domain to see if it is reachable, attempting to identify which schema to use,
|
* Probe the domain to see if it is reachable, attempting to identify which schema to use,
|
||||||
* and if there are any redirects. This is done by one or more HEAD requests.
|
* and if there are any redirects. This is done by one or more HEAD requests.
|
||||||
@@ -110,23 +102,34 @@ public class HttpFetcherImpl implements HttpFetcher {
|
|||||||
*/
|
*/
|
||||||
@Override
|
@Override
|
||||||
public DomainProbeResult probeDomain(EdgeUrl url) {
|
public DomainProbeResult probeDomain(EdgeUrl url) {
|
||||||
var head = new Request.Builder().head().addHeader("User-agent", userAgentString)
|
HttpRequest head;
|
||||||
.url(url.toString())
|
try {
|
||||||
|
head = HttpRequest.newBuilder()
|
||||||
|
.HEAD()
|
||||||
|
.uri(url.asURI())
|
||||||
|
.header("User-agent", userAgentString)
|
||||||
|
.timeout(probeTimeout)
|
||||||
.build();
|
.build();
|
||||||
|
} catch (URISyntaxException e) {
|
||||||
var call = client.newCall(head);
|
return new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, "Invalid URL");
|
||||||
|
|
||||||
try (var rsp = call.execute()) {
|
|
||||||
EdgeUrl requestUrl = new EdgeUrl(rsp.request().url().toString());
|
|
||||||
|
|
||||||
if (!Objects.equals(requestUrl.domain, url.domain)) {
|
|
||||||
return new DomainProbeResult.Redirect(requestUrl.domain);
|
|
||||||
}
|
}
|
||||||
return new DomainProbeResult.Ok(requestUrl);
|
|
||||||
|
for (int tries = 0;; tries++) {
|
||||||
|
try {
|
||||||
|
var rsp = client.send(head, HttpResponse.BodyHandlers.discarding());
|
||||||
|
EdgeUrl rspUri = new EdgeUrl(rsp.uri());
|
||||||
|
|
||||||
|
if (!Objects.equals(rspUri.domain, url.domain)) {
|
||||||
|
return new DomainProbeResult.Redirect(rspUri.domain);
|
||||||
}
|
}
|
||||||
catch (Exception ex) {
|
return new DomainProbeResult.Ok(rspUri);
|
||||||
|
} catch (Exception ex) {
|
||||||
|
if (tries > 3) {
|
||||||
return new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, ex.getMessage());
|
return new DomainProbeResult.Error(CrawlerDomainStatus.ERROR, ex.getMessage());
|
||||||
}
|
}
|
||||||
|
// else try again ...
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/** Perform a HEAD request to fetch the content type of a URL.
|
/** Perform a HEAD request to fetch the content type of a URL.
|
||||||
@@ -140,21 +143,25 @@ public class HttpFetcherImpl implements HttpFetcher {
|
|||||||
WarcRecorder warcRecorder,
|
WarcRecorder warcRecorder,
|
||||||
ContentTags tags) throws RateLimitException {
|
ContentTags tags) throws RateLimitException {
|
||||||
if (tags.isEmpty() && contentTypeLogic.isUrlLikeBinary(url)) {
|
if (tags.isEmpty() && contentTypeLogic.isUrlLikeBinary(url)) {
|
||||||
var headBuilder = new Request.Builder().head()
|
|
||||||
.addHeader("User-agent", userAgentString)
|
|
||||||
.addHeader("Accept-Encoding", "gzip")
|
|
||||||
.url(url.toString());
|
|
||||||
|
|
||||||
var head = headBuilder.build();
|
try {
|
||||||
var call = client.newCall(head);
|
var headBuilder = HttpRequest.newBuilder()
|
||||||
|
.HEAD()
|
||||||
|
.uri(url.asURI())
|
||||||
|
.header("User-Agent", userAgentString)
|
||||||
|
.header("Accept-Encoding", "gzip")
|
||||||
|
.timeout(requestTimeout)
|
||||||
|
;
|
||||||
|
|
||||||
try (var rsp = call.execute()) {
|
var rsp = client.send(headBuilder.build(), HttpResponse.BodyHandlers.discarding());
|
||||||
var contentTypeHeader = rsp.header("Content-type");
|
var headers = rsp.headers();
|
||||||
|
|
||||||
|
var contentTypeHeader = headers.firstValue("Content-Type").orElse(null);
|
||||||
|
|
||||||
if (contentTypeHeader != null && !contentTypeLogic.isAllowableContentType(contentTypeHeader)) {
|
if (contentTypeHeader != null && !contentTypeLogic.isAllowableContentType(contentTypeHeader)) {
|
||||||
warcRecorder.flagAsFailedContentTypeProbe(url, contentTypeHeader, rsp.code());
|
warcRecorder.flagAsFailedContentTypeProbe(url, contentTypeHeader, rsp.statusCode());
|
||||||
|
|
||||||
return new ContentTypeProbeResult.BadContentType(contentTypeHeader, rsp.code());
|
return new ContentTypeProbeResult.BadContentType(contentTypeHeader, rsp.statusCode());
|
||||||
}
|
}
|
||||||
|
|
||||||
// Update the URL to the final URL of the HEAD request, otherwise we might end up doing
|
// Update the URL to the final URL of the HEAD request, otherwise we might end up doing
|
||||||
@@ -168,27 +175,27 @@ public class HttpFetcherImpl implements HttpFetcher {
|
|||||||
// too many eyebrows when looking at the logs on the target server. Overall it's probably desirable
|
// too many eyebrows when looking at the logs on the target server. Overall it's probably desirable
|
||||||
// that it looks like the traffic makes sense, as opposed to looking like a broken bot.
|
// that it looks like the traffic makes sense, as opposed to looking like a broken bot.
|
||||||
|
|
||||||
var redirectUrl = new EdgeUrl(rsp.request().url().toString());
|
var redirectUrl = new EdgeUrl(rsp.uri());
|
||||||
EdgeUrl ret;
|
EdgeUrl ret;
|
||||||
|
|
||||||
if (Objects.equals(redirectUrl.domain, url.domain)) ret = redirectUrl;
|
if (Objects.equals(redirectUrl.domain, url.domain)) ret = redirectUrl;
|
||||||
else ret = url;
|
else ret = url;
|
||||||
|
|
||||||
// Intercept rate limiting
|
// Intercept rate limiting
|
||||||
if (rsp.code() == 429) {
|
if (rsp.statusCode() == 429) {
|
||||||
throw new HttpFetcherImpl.RateLimitException(Objects.requireNonNullElse(rsp.header("Retry-After"), "1"));
|
throw new HttpFetcherImpl.RateLimitException(headers.firstValue("Retry-After").orElse("1"));
|
||||||
}
|
}
|
||||||
|
|
||||||
return new ContentTypeProbeResult.Ok(ret);
|
return new ContentTypeProbeResult.Ok(ret);
|
||||||
}
|
}
|
||||||
|
catch (HttpTimeoutException ex) {
|
||||||
|
warcRecorder.flagAsTimeout(url);
|
||||||
|
return new ContentTypeProbeResult.Timeout(ex);
|
||||||
|
}
|
||||||
catch (RateLimitException ex) {
|
catch (RateLimitException ex) {
|
||||||
throw ex;
|
throw ex;
|
||||||
}
|
}
|
||||||
catch (InterruptedIOException ex) {
|
catch (Exception ex) {
|
||||||
warcRecorder.flagAsTimeout(url);
|
|
||||||
|
|
||||||
return new ContentTypeProbeResult.Timeout(ex);
|
|
||||||
} catch (Exception ex) {
|
|
||||||
logger.error("Error during fetching {}[{}]", ex.getClass().getSimpleName(), ex.getMessage());
|
logger.error("Error during fetching {}[{}]", ex.getClass().getSimpleName(), ex.getMessage());
|
||||||
|
|
||||||
warcRecorder.flagAsError(url, ex);
|
warcRecorder.flagAsError(url, ex);
|
||||||
@@ -210,13 +217,15 @@ public class HttpFetcherImpl implements HttpFetcher {
|
|||||||
ProbeType probeType)
|
ProbeType probeType)
|
||||||
throws Exception
|
throws Exception
|
||||||
{
|
{
|
||||||
var getBuilder = new Request.Builder().get();
|
var getBuilder = HttpRequest.newBuilder()
|
||||||
|
.GET()
|
||||||
getBuilder.url(url.toString())
|
.uri(url.asURI())
|
||||||
.addHeader("Accept-Encoding", "gzip")
|
.header("User-Agent", userAgentString)
|
||||||
.addHeader("Accept-Language", "en,*;q=0.5")
|
.header("Accept-Encoding", "gzip")
|
||||||
.addHeader("Accept", "text/html, application/xhtml+xml, text/*;q=0.8")
|
.header("Accept-Language", "en,*;q=0.5")
|
||||||
.addHeader("User-agent", userAgentString);
|
.header("Accept", "text/html, application/xhtml+xml, text/*;q=0.8")
|
||||||
|
.timeout(requestTimeout)
|
||||||
|
;
|
||||||
|
|
||||||
contentTags.paint(getBuilder);
|
contentTags.paint(getBuilder);
|
||||||
|
|
||||||
@@ -242,6 +251,126 @@ public class HttpFetcherImpl implements HttpFetcher {
|
|||||||
return new SitemapRetriever();
|
return new SitemapRetriever();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public List<EdgeUrl> fetchSitemapUrls(String root, CrawlDelayTimer delayTimer) {
|
||||||
|
try {
|
||||||
|
List<EdgeUrl> ret = new ArrayList<>();
|
||||||
|
|
||||||
|
Set<String> seenUrls = new HashSet<>();
|
||||||
|
Set<String> seenSitemaps = new HashSet<>();
|
||||||
|
|
||||||
|
Deque<EdgeUrl> sitemapQueue = new LinkedList<>();
|
||||||
|
|
||||||
|
EdgeUrl rootSitemapUrl = new EdgeUrl(root);
|
||||||
|
|
||||||
|
sitemapQueue.add(rootSitemapUrl);
|
||||||
|
|
||||||
|
int fetchedSitemaps = 0;
|
||||||
|
|
||||||
|
while (!sitemapQueue.isEmpty() && ret.size() < 20_000 && ++fetchedSitemaps < 10) {
|
||||||
|
var head = sitemapQueue.removeFirst();
|
||||||
|
|
||||||
|
switch (fetchSitemap(head)) {
|
||||||
|
case SitemapResult.SitemapUrls(List<String> urls) -> {
|
||||||
|
|
||||||
|
for (var url : urls) {
|
||||||
|
if (seenUrls.add(url)) {
|
||||||
|
EdgeUrl.parse(url)
|
||||||
|
.filter(u -> u.domain.equals(rootSitemapUrl.domain))
|
||||||
|
.ifPresent(ret::add);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
|
case SitemapResult.SitemapReferences(List<String> refs) -> {
|
||||||
|
for (var ref : refs) {
|
||||||
|
if (seenSitemaps.add(ref)) {
|
||||||
|
EdgeUrl.parse(ref)
|
||||||
|
.filter(url -> url.domain.equals(rootSitemapUrl.domain))
|
||||||
|
.ifPresent(sitemapQueue::addFirst);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
case SitemapResult.SitemapError() -> {}
|
||||||
|
}
|
||||||
|
|
||||||
|
delayTimer.waitFetchDelay();
|
||||||
|
}
|
||||||
|
|
||||||
|
return ret;
|
||||||
|
}
|
||||||
|
catch (Exception ex) {
|
||||||
|
logger.error("Error while fetching sitemaps via {}: {} ({})", root, ex.getClass().getSimpleName(), ex.getMessage());
|
||||||
|
return List.of();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
private SitemapResult fetchSitemap(EdgeUrl sitemapUrl) throws URISyntaxException, IOException, InterruptedException {
|
||||||
|
HttpRequest getRequest = HttpRequest.newBuilder()
|
||||||
|
.GET()
|
||||||
|
.uri(sitemapUrl.asURI())
|
||||||
|
.header("Accept-Encoding", "gzip")
|
||||||
|
.header("Accept", "text/*, */*;q=0.9")
|
||||||
|
.header("User-Agent", userAgentString)
|
||||||
|
.timeout(requestTimeout)
|
||||||
|
.build();
|
||||||
|
|
||||||
|
var response = client.send(getRequest, HttpResponse.BodyHandlers.ofInputStream());
|
||||||
|
if (response.statusCode() != 200) {
|
||||||
|
return new SitemapResult.SitemapError();
|
||||||
|
}
|
||||||
|
|
||||||
|
try (InputStream inputStream = response.body()) {
|
||||||
|
|
||||||
|
InputStream parserStream;
|
||||||
|
if (sitemapUrl.path.endsWith(".gz")) {
|
||||||
|
parserStream = new GZIPInputStream(inputStream);
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
parserStream = inputStream;
|
||||||
|
}
|
||||||
|
|
||||||
|
Document parsedSitemap = Jsoup.parse(parserStream, "UTF-8", sitemapUrl.toString(), Parser.xmlParser());
|
||||||
|
if (parsedSitemap.childrenSize() == 0) {
|
||||||
|
return new SitemapResult.SitemapError();
|
||||||
|
}
|
||||||
|
|
||||||
|
String rootTagName = parsedSitemap.child(0).tagName();
|
||||||
|
|
||||||
|
return switch (rootTagName.toLowerCase()) {
|
||||||
|
case "sitemapindex" -> {
|
||||||
|
List<String> references = new ArrayList<>();
|
||||||
|
for (var locTag : parsedSitemap.getElementsByTag("loc")) {
|
||||||
|
references.add(locTag.text().trim());
|
||||||
|
}
|
||||||
|
yield new SitemapResult.SitemapReferences(Collections.unmodifiableList(references));
|
||||||
|
}
|
||||||
|
case "urlset" -> {
|
||||||
|
List<String> urls = new ArrayList<>();
|
||||||
|
for (var locTag : parsedSitemap.select("url > loc")) {
|
||||||
|
urls.add(locTag.text().trim());
|
||||||
|
}
|
||||||
|
yield new SitemapResult.SitemapUrls(Collections.unmodifiableList(urls));
|
||||||
|
}
|
||||||
|
case "rss", "atom" -> {
|
||||||
|
List<String> urls = new ArrayList<>();
|
||||||
|
for (var locTag : parsedSitemap.select("link, url")) {
|
||||||
|
urls.add(locTag.text().trim());
|
||||||
|
}
|
||||||
|
yield new SitemapResult.SitemapUrls(Collections.unmodifiableList(urls));
|
||||||
|
}
|
||||||
|
default -> new SitemapResult.SitemapError();
|
||||||
|
};
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private sealed interface SitemapResult {
|
||||||
|
record SitemapUrls(List<String> urls) implements SitemapResult {}
|
||||||
|
record SitemapReferences(List<String> sitemapRefs) implements SitemapResult {}
|
||||||
|
record SitemapError() implements SitemapResult {}
|
||||||
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public SimpleRobotRules fetchRobotRules(EdgeDomain domain, WarcRecorder recorder) {
|
public SimpleRobotRules fetchRobotRules(EdgeDomain domain, WarcRecorder recorder) {
|
||||||
var ret = fetchAndParseRobotsTxt(new EdgeUrl("https", domain, null, "/robots.txt", null), recorder);
|
var ret = fetchAndParseRobotsTxt(new EdgeUrl("https", domain, null, "/robots.txt", null), recorder);
|
||||||
@@ -257,14 +386,15 @@ public class HttpFetcherImpl implements HttpFetcher {
|
|||||||
|
|
||||||
private Optional<SimpleRobotRules> fetchAndParseRobotsTxt(EdgeUrl url, WarcRecorder recorder) {
|
private Optional<SimpleRobotRules> fetchAndParseRobotsTxt(EdgeUrl url, WarcRecorder recorder) {
|
||||||
try {
|
try {
|
||||||
var getBuilder = new Request.Builder().get();
|
var getRequest = HttpRequest.newBuilder()
|
||||||
|
.GET()
|
||||||
|
.uri(url.asURI())
|
||||||
|
.header("Accept-Encoding", "gzip")
|
||||||
|
.header("Accept", "text/*, */*;q=0.9")
|
||||||
|
.header("User-Agent", userAgentString)
|
||||||
|
.timeout(requestTimeout);
|
||||||
|
|
||||||
getBuilder.url(url.toString())
|
HttpFetchResult result = recorder.fetch(client, getRequest.build());
|
||||||
.addHeader("Accept-Encoding", "gzip")
|
|
||||||
.addHeader("Accept", "text/*, */*;q=0.9")
|
|
||||||
.addHeader("User-agent", userAgentString);
|
|
||||||
|
|
||||||
HttpFetchResult result = recorder.fetch(client, getBuilder.build());
|
|
||||||
|
|
||||||
return DocumentBodyExtractor.asBytes(result).mapOpt((contentType, body) ->
|
return DocumentBodyExtractor.asBytes(result).mapOpt((contentType, body) ->
|
||||||
robotsParser.parseContent(url.toString(),
|
robotsParser.parseContent(url.toString(),
|
||||||
|
@@ -1,31 +0,0 @@
|
|||||||
package nu.marginalia.crawl.fetcher.socket;
|
|
||||||
|
|
||||||
import okhttp3.Interceptor;
|
|
||||||
import okhttp3.Response;
|
|
||||||
import org.jetbrains.annotations.NotNull;
|
|
||||||
|
|
||||||
import java.io.IOException;
|
|
||||||
|
|
||||||
|
|
||||||
/** An interceptor that intercepts network requests and adds the remote IP address as
|
|
||||||
* a header in the response. This is used to pass the remote IP address to the Warc
|
|
||||||
* writer, as this information is not available in the response.
|
|
||||||
*/
|
|
||||||
public class IpInterceptingNetworkInterceptor implements Interceptor {
|
|
||||||
private static final String pseudoHeaderName = "X-Marginalia-Remote-IP";
|
|
||||||
|
|
||||||
@NotNull
|
|
||||||
@Override
|
|
||||||
public Response intercept(@NotNull Interceptor.Chain chain) throws IOException {
|
|
||||||
String IP = chain.connection().socket().getInetAddress().getHostAddress();
|
|
||||||
|
|
||||||
return chain.proceed(chain.request())
|
|
||||||
.newBuilder()
|
|
||||||
.addHeader(pseudoHeaderName, IP)
|
|
||||||
.build();
|
|
||||||
}
|
|
||||||
|
|
||||||
public static String getIpFromResponse(Response response) {
|
|
||||||
return response.header(pseudoHeaderName);
|
|
||||||
}
|
|
||||||
}
|
|
@@ -27,7 +27,7 @@ public class NoSecuritySSL {
|
|||||||
}
|
}
|
||||||
};
|
};
|
||||||
|
|
||||||
public static SSLSocketFactory buildSocketFactory() {
|
public static SSLContext buildSslContext() {
|
||||||
try {
|
try {
|
||||||
// Install the all-trusting trust manager
|
// Install the all-trusting trust manager
|
||||||
final SSLContext sslContext = SSLContext.getInstance("TLS");
|
final SSLContext sslContext = SSLContext.getInstance("TLS");
|
||||||
@@ -40,14 +40,11 @@ public class NoSecuritySSL {
|
|||||||
clientSessionContext.setSessionCacheSize(2048);
|
clientSessionContext.setSessionCacheSize(2048);
|
||||||
|
|
||||||
// Create a ssl socket factory with our all-trusting manager
|
// Create a ssl socket factory with our all-trusting manager
|
||||||
return sslContext.getSocketFactory();
|
return sslContext;
|
||||||
}
|
}
|
||||||
catch (Exception e) {
|
catch (Exception e) {
|
||||||
throw new RuntimeException(e);
|
throw new RuntimeException(e);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
public static HostnameVerifier buildHostnameVerifyer() {
|
|
||||||
return (hn, session) -> true;
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
@@ -1,14 +1,14 @@
|
|||||||
package nu.marginalia.crawl.fetcher.warc;
|
package nu.marginalia.crawl.fetcher.warc;
|
||||||
|
|
||||||
import okhttp3.Headers;
|
|
||||||
import okhttp3.Response;
|
|
||||||
import org.apache.commons.io.input.BOMInputStream;
|
import org.apache.commons.io.input.BOMInputStream;
|
||||||
import org.netpreserve.jwarc.WarcTruncationReason;
|
import org.netpreserve.jwarc.WarcTruncationReason;
|
||||||
|
|
||||||
import java.io.*;
|
import java.io.*;
|
||||||
|
import java.net.http.HttpHeaders;
|
||||||
|
import java.net.http.HttpResponse;
|
||||||
import java.nio.file.Files;
|
import java.nio.file.Files;
|
||||||
import java.nio.file.Path;
|
import java.nio.file.Path;
|
||||||
import java.util.Objects;
|
import java.util.Map;
|
||||||
import java.util.zip.GZIPInputStream;
|
import java.util.zip.GZIPInputStream;
|
||||||
|
|
||||||
/** Input buffer for temporary storage of a HTTP response
|
/** Input buffer for temporary storage of a HTTP response
|
||||||
@@ -17,8 +17,9 @@ import java.util.zip.GZIPInputStream;
|
|||||||
* */
|
* */
|
||||||
public abstract class WarcInputBuffer implements AutoCloseable {
|
public abstract class WarcInputBuffer implements AutoCloseable {
|
||||||
protected WarcTruncationReason truncationReason = WarcTruncationReason.NOT_TRUNCATED;
|
protected WarcTruncationReason truncationReason = WarcTruncationReason.NOT_TRUNCATED;
|
||||||
protected Headers headers;
|
protected HttpHeaders headers;
|
||||||
WarcInputBuffer(Headers headers) {
|
|
||||||
|
WarcInputBuffer(HttpHeaders headers) {
|
||||||
this.headers = headers;
|
this.headers = headers;
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -30,7 +31,7 @@ public abstract class WarcInputBuffer implements AutoCloseable {
|
|||||||
|
|
||||||
public final WarcTruncationReason truncationReason() { return truncationReason; }
|
public final WarcTruncationReason truncationReason() { return truncationReason; }
|
||||||
|
|
||||||
public final Headers headers() { return headers; }
|
public final HttpHeaders headers() { return headers; }
|
||||||
|
|
||||||
/** Create a buffer for a response.
|
/** Create a buffer for a response.
|
||||||
* If the response is small and not compressed, it will be stored in memory.
|
* If the response is small and not compressed, it will be stored in memory.
|
||||||
@@ -38,26 +39,27 @@ public abstract class WarcInputBuffer implements AutoCloseable {
|
|||||||
* and suppressed from the headers.
|
* and suppressed from the headers.
|
||||||
* If an error occurs, a buffer will be created with no content and an error status.
|
* If an error occurs, a buffer will be created with no content and an error status.
|
||||||
*/
|
*/
|
||||||
static WarcInputBuffer forResponse(Response rsp) {
|
static WarcInputBuffer forResponse(HttpResponse<InputStream> rsp) {
|
||||||
if (rsp == null)
|
if (rsp == null)
|
||||||
return new ErrorBuffer();
|
return new ErrorBuffer();
|
||||||
|
|
||||||
try {
|
var headers = rsp.headers();
|
||||||
String contentLengthHeader = Objects.requireNonNullElse(rsp.header("Content-Length"), "-1");
|
|
||||||
int contentLength = Integer.parseInt(contentLengthHeader);
|
try (var is = rsp.body()) {
|
||||||
String contentEncoding = rsp.header("Content-Encoding");
|
int contentLength = (int) headers.firstValueAsLong("Content-Length").orElse(-1L);
|
||||||
|
String contentEncoding = headers.firstValue("Content-Encoding").orElse(null);
|
||||||
|
|
||||||
if (contentEncoding == null && contentLength > 0 && contentLength < 8192) {
|
if (contentEncoding == null && contentLength > 0 && contentLength < 8192) {
|
||||||
// If the content is small and not compressed, we can just read it into memory
|
// If the content is small and not compressed, we can just read it into memory
|
||||||
return new MemoryBuffer(rsp, contentLength);
|
return new MemoryBuffer(headers, is, contentLength);
|
||||||
}
|
}
|
||||||
else {
|
else {
|
||||||
// Otherwise, we unpack it into a file and read it from there
|
// Otherwise, we unpack it into a file and read it from there
|
||||||
return new FileBuffer(rsp);
|
return new FileBuffer(headers, is);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
catch (Exception ex) {
|
catch (Exception ex) {
|
||||||
return new ErrorBuffer(rsp);
|
return new ErrorBuffer();
|
||||||
}
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
@@ -99,12 +101,8 @@ public abstract class WarcInputBuffer implements AutoCloseable {
|
|||||||
/** Pseudo-buffer for when we have an error */
|
/** Pseudo-buffer for when we have an error */
|
||||||
class ErrorBuffer extends WarcInputBuffer {
|
class ErrorBuffer extends WarcInputBuffer {
|
||||||
public ErrorBuffer() {
|
public ErrorBuffer() {
|
||||||
super(Headers.of());
|
super(HttpHeaders.of(Map.of(), (k,v)->false));
|
||||||
truncationReason = WarcTruncationReason.UNSPECIFIED;
|
|
||||||
}
|
|
||||||
|
|
||||||
public ErrorBuffer(Response rsp) {
|
|
||||||
super(rsp.headers());
|
|
||||||
truncationReason = WarcTruncationReason.UNSPECIFIED;
|
truncationReason = WarcTruncationReason.UNSPECIFIED;
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -125,12 +123,12 @@ class ErrorBuffer extends WarcInputBuffer {
|
|||||||
/** Buffer for when we have the response in memory */
|
/** Buffer for when we have the response in memory */
|
||||||
class MemoryBuffer extends WarcInputBuffer {
|
class MemoryBuffer extends WarcInputBuffer {
|
||||||
byte[] data;
|
byte[] data;
|
||||||
public MemoryBuffer(Response response, int size) {
|
public MemoryBuffer(HttpHeaders headers, InputStream responseStream, int size) {
|
||||||
super(response.headers());
|
super(headers);
|
||||||
|
|
||||||
var outputStream = new ByteArrayOutputStream(size);
|
var outputStream = new ByteArrayOutputStream(size);
|
||||||
|
|
||||||
copy(response.body().byteStream(), outputStream);
|
copy(responseStream, outputStream);
|
||||||
|
|
||||||
data = outputStream.toByteArray();
|
data = outputStream.toByteArray();
|
||||||
}
|
}
|
||||||
@@ -154,19 +152,15 @@ class MemoryBuffer extends WarcInputBuffer {
|
|||||||
class FileBuffer extends WarcInputBuffer {
|
class FileBuffer extends WarcInputBuffer {
|
||||||
private final Path tempFile;
|
private final Path tempFile;
|
||||||
|
|
||||||
public FileBuffer(Response response) throws IOException {
|
public FileBuffer(HttpHeaders headers, InputStream responseStream) throws IOException {
|
||||||
super(suppressContentEncoding(response.headers()));
|
super(suppressContentEncoding(headers));
|
||||||
|
|
||||||
this.tempFile = Files.createTempFile("rsp", ".html");
|
this.tempFile = Files.createTempFile("rsp", ".html");
|
||||||
|
|
||||||
if (response.body() == null) {
|
|
||||||
truncationReason = WarcTruncationReason.DISCONNECT;
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
if ("gzip".equals(response.header("Content-Encoding"))) {
|
if ("gzip".equalsIgnoreCase(headers.firstValue("Content-Encoding").orElse(""))) {
|
||||||
try (var out = Files.newOutputStream(tempFile)) {
|
try (var out = Files.newOutputStream(tempFile)) {
|
||||||
copy(new GZIPInputStream(response.body().byteStream()), out);
|
copy(new GZIPInputStream(responseStream), out);
|
||||||
}
|
}
|
||||||
catch (Exception ex) {
|
catch (Exception ex) {
|
||||||
truncationReason = WarcTruncationReason.UNSPECIFIED;
|
truncationReason = WarcTruncationReason.UNSPECIFIED;
|
||||||
@@ -174,7 +168,7 @@ class FileBuffer extends WarcInputBuffer {
|
|||||||
}
|
}
|
||||||
else {
|
else {
|
||||||
try (var out = Files.newOutputStream(tempFile)) {
|
try (var out = Files.newOutputStream(tempFile)) {
|
||||||
copy(response.body().byteStream(), out);
|
copy(responseStream, out);
|
||||||
}
|
}
|
||||||
catch (Exception ex) {
|
catch (Exception ex) {
|
||||||
truncationReason = WarcTruncationReason.UNSPECIFIED;
|
truncationReason = WarcTruncationReason.UNSPECIFIED;
|
||||||
@@ -182,22 +176,13 @@ class FileBuffer extends WarcInputBuffer {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
private static Headers suppressContentEncoding(Headers headers) {
|
private static HttpHeaders suppressContentEncoding(HttpHeaders headers) {
|
||||||
var builder = new Headers.Builder();
|
return HttpHeaders.of(headers.map(), (k, v) -> {
|
||||||
|
|
||||||
headers.toMultimap().forEach((k, values) -> {
|
|
||||||
if ("Content-Encoding".equalsIgnoreCase(k)) {
|
if ("Content-Encoding".equalsIgnoreCase(k)) {
|
||||||
return;
|
return false;
|
||||||
}
|
|
||||||
if ("Transfer-Encoding".equalsIgnoreCase(k)) {
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
for (var value : values) {
|
|
||||||
builder.add(k, value);
|
|
||||||
}
|
}
|
||||||
|
return !"Transfer-Encoding".equalsIgnoreCase(k);
|
||||||
});
|
});
|
||||||
|
|
||||||
return builder.build();
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@@ -1,11 +1,12 @@
|
|||||||
package nu.marginalia.crawl.fetcher.warc;
|
package nu.marginalia.crawl.fetcher.warc;
|
||||||
|
|
||||||
import okhttp3.Protocol;
|
|
||||||
import okhttp3.Response;
|
|
||||||
import org.apache.commons.lang3.StringUtils;
|
import org.apache.commons.lang3.StringUtils;
|
||||||
|
|
||||||
import java.net.URI;
|
import java.net.URI;
|
||||||
import java.net.URLEncoder;
|
import java.net.URLEncoder;
|
||||||
|
import java.net.http.HttpClient;
|
||||||
|
import java.net.http.HttpHeaders;
|
||||||
|
import java.net.http.HttpResponse;
|
||||||
import java.nio.charset.StandardCharsets;
|
import java.nio.charset.StandardCharsets;
|
||||||
import java.util.*;
|
import java.util.*;
|
||||||
import java.util.stream.Collectors;
|
import java.util.stream.Collectors;
|
||||||
@@ -75,13 +76,13 @@ public class WarcProtocolReconstructor {
|
|||||||
return "HTTP/" + version + " " + statusCode + " " + statusMessage + "\r\n" + headerString + "\r\n\r\n";
|
return "HTTP/" + version + " " + statusCode + " " + statusMessage + "\r\n" + headerString + "\r\n\r\n";
|
||||||
}
|
}
|
||||||
|
|
||||||
static String getResponseHeader(Response response, long size) {
|
static String getResponseHeader(HttpResponse<?> response, long size) {
|
||||||
String version = response.protocol() == Protocol.HTTP_1_1 ? "1.1" : "2.0";
|
String version = response.version() == HttpClient.Version.HTTP_1_1 ? "1.1" : "2.0";
|
||||||
|
|
||||||
String statusCode = String.valueOf(response.code());
|
String statusCode = String.valueOf(response.statusCode());
|
||||||
String statusMessage = STATUS_CODE_MAP.getOrDefault(response.code(), "Unknown");
|
String statusMessage = STATUS_CODE_MAP.getOrDefault(response.statusCode(), "Unknown");
|
||||||
|
|
||||||
String headerString = getHeadersAsString(response, size);
|
String headerString = getHeadersAsString(response.headers(), size);
|
||||||
|
|
||||||
return "HTTP/" + version + " " + statusCode + " " + statusMessage + "\r\n" + headerString + "\r\n\r\n";
|
return "HTTP/" + version + " " + statusCode + " " + statusMessage + "\r\n" + headerString + "\r\n\r\n";
|
||||||
}
|
}
|
||||||
@@ -148,10 +149,10 @@ public class WarcProtocolReconstructor {
|
|||||||
return joiner.toString();
|
return joiner.toString();
|
||||||
}
|
}
|
||||||
|
|
||||||
static private String getHeadersAsString(Response response, long responseSize) {
|
static private String getHeadersAsString(HttpHeaders headers, long responseSize) {
|
||||||
StringJoiner joiner = new StringJoiner("\r\n");
|
StringJoiner joiner = new StringJoiner("\r\n");
|
||||||
|
|
||||||
response.headers().toMultimap().forEach((k, values) -> {
|
headers.map().forEach((k, values) -> {
|
||||||
String headerCapitalized = capitalizeHeader(k);
|
String headerCapitalized = capitalizeHeader(k);
|
||||||
|
|
||||||
// Omit pseudoheaders injected by the crawler itself
|
// Omit pseudoheaders injected by the crawler itself
|
||||||
@@ -179,8 +180,8 @@ public class WarcProtocolReconstructor {
|
|||||||
return joiner.toString();
|
return joiner.toString();
|
||||||
}
|
}
|
||||||
|
|
||||||
// okhttp gives us flattened headers, so we need to reconstruct Camel-Kebab-Case style
|
// okhttp gave us flattened headers, so we need to reconstruct Camel-Kebab-Case style
|
||||||
// for the WARC parser's sake...
|
// for the WARC parser's sake... (do we still need this, mr chesterton?)
|
||||||
static private String capitalizeHeader(String k) {
|
static private String capitalizeHeader(String k) {
|
||||||
return Arrays.stream(StringUtils.split(k, '-'))
|
return Arrays.stream(StringUtils.split(k, '-'))
|
||||||
.map(StringUtils::capitalize)
|
.map(StringUtils::capitalize)
|
||||||
|
@@ -1,13 +1,11 @@
|
|||||||
package nu.marginalia.crawl.fetcher.warc;
|
package nu.marginalia.crawl.fetcher.warc;
|
||||||
|
|
||||||
import nu.marginalia.crawl.fetcher.ContentTags;
|
import nu.marginalia.crawl.fetcher.ContentTags;
|
||||||
|
import nu.marginalia.crawl.fetcher.Cookies;
|
||||||
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
import nu.marginalia.crawl.fetcher.HttpFetcherImpl;
|
||||||
import nu.marginalia.crawl.fetcher.socket.IpInterceptingNetworkInterceptor;
|
|
||||||
import nu.marginalia.model.EdgeDomain;
|
import nu.marginalia.model.EdgeDomain;
|
||||||
import nu.marginalia.model.EdgeUrl;
|
import nu.marginalia.model.EdgeUrl;
|
||||||
import nu.marginalia.model.body.HttpFetchResult;
|
import nu.marginalia.model.body.HttpFetchResult;
|
||||||
import okhttp3.OkHttpClient;
|
|
||||||
import okhttp3.Request;
|
|
||||||
import org.jetbrains.annotations.Nullable;
|
import org.jetbrains.annotations.Nullable;
|
||||||
import org.netpreserve.jwarc.*;
|
import org.netpreserve.jwarc.*;
|
||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
@@ -18,16 +16,19 @@ import java.io.InputStream;
|
|||||||
import java.net.InetAddress;
|
import java.net.InetAddress;
|
||||||
import java.net.URI;
|
import java.net.URI;
|
||||||
import java.net.URISyntaxException;
|
import java.net.URISyntaxException;
|
||||||
|
import java.net.http.HttpClient;
|
||||||
|
import java.net.http.HttpResponse;
|
||||||
import java.nio.charset.StandardCharsets;
|
import java.nio.charset.StandardCharsets;
|
||||||
import java.nio.file.Files;
|
import java.nio.file.Files;
|
||||||
import java.nio.file.Path;
|
import java.nio.file.Path;
|
||||||
import java.security.NoSuchAlgorithmException;
|
import java.security.NoSuchAlgorithmException;
|
||||||
|
import java.time.Duration;
|
||||||
import java.time.Instant;
|
import java.time.Instant;
|
||||||
import java.util.*;
|
import java.util.*;
|
||||||
|
|
||||||
/** Based on JWarc's fetch method, APL 2.0 license
|
/** Based on JWarc's fetch method, APL 2.0 license
|
||||||
* <p></p>
|
* <p></p>
|
||||||
* This class wraps OkHttp's OkHttpClient and records the HTTP request and response in a WARC file,
|
* This class wraps HttpClient and records the HTTP request and response in a WARC file,
|
||||||
* as best is possible given not all the data is available at the same time and needs to
|
* as best is possible given not all the data is available at the same time and needs to
|
||||||
* be reconstructed.
|
* be reconstructed.
|
||||||
*/
|
*/
|
||||||
@@ -47,20 +48,22 @@ public class WarcRecorder implements AutoCloseable {
|
|||||||
// Affix a version string in case we need to change the format in the future
|
// Affix a version string in case we need to change the format in the future
|
||||||
// in some way
|
// in some way
|
||||||
private final String warcRecorderVersion = "1.0";
|
private final String warcRecorderVersion = "1.0";
|
||||||
|
private final Cookies cookies;
|
||||||
// We need to know if the site uses cookies so this can be reported among the search results
|
|
||||||
// -- flip this to true if we see any cookies. This information will also be painted on any
|
|
||||||
// revisited pages. It's not 100% perfect and a bit order dependent, but it's good enough.
|
|
||||||
private final WarcXCookieInformationHeader cookieInformation = new WarcXCookieInformationHeader();
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Create a new WarcRecorder that will write to the given file
|
* Create a new WarcRecorder that will write to the given file
|
||||||
*
|
*
|
||||||
* @param warcFile The file to write to
|
* @param warcFile The file to write to
|
||||||
*/
|
*/
|
||||||
public WarcRecorder(Path warcFile) throws IOException {
|
public WarcRecorder(Path warcFile, HttpFetcherImpl fetcher) throws IOException {
|
||||||
this.warcFile = warcFile;
|
this.warcFile = warcFile;
|
||||||
this.writer = new WarcWriter(warcFile);
|
this.writer = new WarcWriter(warcFile);
|
||||||
|
this.cookies = fetcher.getCookies();
|
||||||
|
}
|
||||||
|
|
||||||
|
public WarcRecorder(Path warcFile, Cookies cookies) throws IOException {
|
||||||
|
this.warcFile = warcFile;
|
||||||
|
this.writer = new WarcWriter(warcFile);
|
||||||
|
this.cookies = cookies;
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
@@ -70,36 +73,45 @@ public class WarcRecorder implements AutoCloseable {
|
|||||||
public WarcRecorder() throws IOException {
|
public WarcRecorder() throws IOException {
|
||||||
this.warcFile = Files.createTempFile("warc", ".warc.gz");
|
this.warcFile = Files.createTempFile("warc", ".warc.gz");
|
||||||
this.writer = new WarcWriter(this.warcFile);
|
this.writer = new WarcWriter(this.warcFile);
|
||||||
|
this.cookies = new Cookies();
|
||||||
|
|
||||||
temporaryFile = true;
|
temporaryFile = true;
|
||||||
}
|
}
|
||||||
|
|
||||||
public HttpFetchResult fetch(OkHttpClient client, Request request) throws NoSuchAlgorithmException,
|
public HttpFetchResult fetch(HttpClient client,
|
||||||
IOException,
|
java.net.http.HttpRequest request)
|
||||||
URISyntaxException,
|
throws NoSuchAlgorithmException, IOException, URISyntaxException, InterruptedException
|
||||||
InterruptedException
|
|
||||||
{
|
{
|
||||||
URI requestUri = request.url().uri();
|
URI requestUri = request.uri();
|
||||||
|
|
||||||
WarcDigestBuilder responseDigestBuilder = new WarcDigestBuilder();
|
WarcDigestBuilder responseDigestBuilder = new WarcDigestBuilder();
|
||||||
WarcDigestBuilder payloadDigestBuilder = new WarcDigestBuilder();
|
WarcDigestBuilder payloadDigestBuilder = new WarcDigestBuilder();
|
||||||
|
|
||||||
String ip;
|
|
||||||
Instant date = Instant.now();
|
Instant date = Instant.now();
|
||||||
|
|
||||||
var call = client.newCall(request);
|
// Not entirely sure why we need to do this, but keeping it due to Chesterton's Fence
|
||||||
|
Map<String, List<String>> extraHeaders = new HashMap<>(request.headers().map());
|
||||||
|
|
||||||
cookieInformation.update(client, request.url());
|
HttpResponse<InputStream> response;
|
||||||
|
try {
|
||||||
|
response = client.send(request, java.net.http.HttpResponse.BodyHandlers.ofInputStream());
|
||||||
|
}
|
||||||
|
catch (Exception ex) {
|
||||||
|
logger.warn("Failed to fetch URL {}: {}", requestUri, ex.getMessage());
|
||||||
|
return new HttpFetchResult.ResultException(ex);
|
||||||
|
}
|
||||||
|
|
||||||
try (var response = call.execute();
|
|
||||||
WarcInputBuffer inputBuffer = WarcInputBuffer.forResponse(response))
|
try (WarcInputBuffer inputBuffer = WarcInputBuffer.forResponse(response);
|
||||||
|
InputStream inputStream = inputBuffer.read())
|
||||||
{
|
{
|
||||||
|
if (cookies.hasCookies()) {
|
||||||
|
extraHeaders.put("X-Has-Cookies", List.of("1"));
|
||||||
|
}
|
||||||
|
|
||||||
byte[] responseHeaders = WarcProtocolReconstructor.getResponseHeader(response, inputBuffer.size()).getBytes(StandardCharsets.UTF_8);
|
byte[] responseHeaders = WarcProtocolReconstructor.getResponseHeader(response, inputBuffer.size()).getBytes(StandardCharsets.UTF_8);
|
||||||
|
|
||||||
ResponseDataBuffer responseDataBuffer = new ResponseDataBuffer(inputBuffer.size() + responseHeaders.length);
|
ResponseDataBuffer responseDataBuffer = new ResponseDataBuffer(inputBuffer.size() + responseHeaders.length);
|
||||||
InputStream inputStream = inputBuffer.read();
|
|
||||||
|
|
||||||
ip = IpInterceptingNetworkInterceptor.getIpFromResponse(response);
|
|
||||||
|
|
||||||
responseDataBuffer.put(responseHeaders);
|
responseDataBuffer.put(responseHeaders);
|
||||||
responseDataBuffer.updateDigest(responseDigestBuilder, 0, responseHeaders.length);
|
responseDataBuffer.updateDigest(responseDigestBuilder, 0, responseHeaders.length);
|
||||||
@@ -123,17 +135,15 @@ public class WarcRecorder implements AutoCloseable {
|
|||||||
|
|
||||||
// It looks like this might be the same as requestUri, but it's not;
|
// It looks like this might be the same as requestUri, but it's not;
|
||||||
// it's the URI after resolving redirects.
|
// it's the URI after resolving redirects.
|
||||||
final URI responseUri = response.request().url().uri();
|
final URI responseUri = response.uri();
|
||||||
|
|
||||||
WarcResponse.Builder responseBuilder = new WarcResponse.Builder(responseUri)
|
WarcResponse.Builder responseBuilder = new WarcResponse.Builder(responseUri)
|
||||||
.blockDigest(responseDigestBuilder.build())
|
.blockDigest(responseDigestBuilder.build())
|
||||||
.date(date)
|
.date(date)
|
||||||
.body(MediaType.HTTP_RESPONSE, responseDataBuffer.copyBytes());
|
.body(MediaType.HTTP_RESPONSE, responseDataBuffer.copyBytes());
|
||||||
|
|
||||||
cookieInformation.paint(responseBuilder);
|
InetAddress inetAddress = InetAddress.getByName(responseUri.getHost());
|
||||||
|
responseBuilder.ipAddress(inetAddress);
|
||||||
if (ip != null) responseBuilder.ipAddress(InetAddress.getByName(ip));
|
|
||||||
|
|
||||||
responseBuilder.payloadDigest(payloadDigestBuilder.build());
|
responseBuilder.payloadDigest(payloadDigestBuilder.build());
|
||||||
responseBuilder.truncated(inputBuffer.truncationReason());
|
responseBuilder.truncated(inputBuffer.truncationReason());
|
||||||
|
|
||||||
@@ -150,8 +160,8 @@ public class WarcRecorder implements AutoCloseable {
|
|||||||
byte[] httpRequestString = WarcProtocolReconstructor
|
byte[] httpRequestString = WarcProtocolReconstructor
|
||||||
.getHttpRequestString(
|
.getHttpRequestString(
|
||||||
response.request().method(),
|
response.request().method(),
|
||||||
response.request().headers().toMultimap(),
|
response.request().headers().map(),
|
||||||
request.headers().toMultimap(),
|
extraHeaders,
|
||||||
requestUri)
|
requestUri)
|
||||||
.getBytes();
|
.getBytes();
|
||||||
|
|
||||||
@@ -167,10 +177,29 @@ public class WarcRecorder implements AutoCloseable {
|
|||||||
warcRequest.http(); // force HTTP header to be parsed before body is consumed so that caller can use it
|
warcRequest.http(); // force HTTP header to be parsed before body is consumed so that caller can use it
|
||||||
writer.write(warcRequest);
|
writer.write(warcRequest);
|
||||||
|
|
||||||
|
if (Duration.between(date, Instant.now()).compareTo(Duration.ofSeconds(9)) > 0
|
||||||
|
&& inputBuffer.size() < 2048
|
||||||
|
&& !request.uri().getPath().endsWith("robots.txt")) // don't bail on robots.txt
|
||||||
|
{
|
||||||
|
// Fast detection and mitigation of crawler traps that respond with slow
|
||||||
|
// small responses, with a high branching factor
|
||||||
|
|
||||||
|
// Note we bail *after* writing the warc records, this will effectively only
|
||||||
|
// prevent link extraction from the document.
|
||||||
|
|
||||||
|
logger.warn("URL {} took too long to fetch ({}s) and was too small for the effort ({}b)",
|
||||||
|
requestUri,
|
||||||
|
Duration.between(date, Instant.now()).getSeconds(),
|
||||||
|
inputBuffer.size()
|
||||||
|
);
|
||||||
|
|
||||||
|
return new HttpFetchResult.ResultException(new IOException("Likely crawler trap"));
|
||||||
|
}
|
||||||
|
|
||||||
return new HttpFetchResult.ResultOk(responseUri,
|
return new HttpFetchResult.ResultOk(responseUri,
|
||||||
response.code(),
|
response.statusCode(),
|
||||||
inputBuffer.headers(),
|
inputBuffer.headers(),
|
||||||
ip,
|
inetAddress.getHostAddress(),
|
||||||
responseDataBuffer.data,
|
responseDataBuffer.data,
|
||||||
dataStart,
|
dataStart,
|
||||||
responseDataBuffer.length() - dataStart);
|
responseDataBuffer.length() - dataStart);
|
||||||
@@ -185,7 +214,7 @@ public class WarcRecorder implements AutoCloseable {
|
|||||||
writer.write(item);
|
writer.write(item);
|
||||||
}
|
}
|
||||||
|
|
||||||
private void saveOldResponse(EdgeUrl url, String contentType, int statusCode, String documentBody, @Nullable String headers, ContentTags contentTags) {
|
private void saveOldResponse(EdgeUrl url, String contentType, int statusCode, byte[] documentBody, @Nullable String headers, ContentTags contentTags) {
|
||||||
try {
|
try {
|
||||||
WarcDigestBuilder responseDigestBuilder = new WarcDigestBuilder();
|
WarcDigestBuilder responseDigestBuilder = new WarcDigestBuilder();
|
||||||
WarcDigestBuilder payloadDigestBuilder = new WarcDigestBuilder();
|
WarcDigestBuilder payloadDigestBuilder = new WarcDigestBuilder();
|
||||||
@@ -195,7 +224,7 @@ public class WarcRecorder implements AutoCloseable {
|
|||||||
if (documentBody == null) {
|
if (documentBody == null) {
|
||||||
bytes = new byte[0];
|
bytes = new byte[0];
|
||||||
} else {
|
} else {
|
||||||
bytes = documentBody.getBytes();
|
bytes = documentBody;
|
||||||
}
|
}
|
||||||
|
|
||||||
// Create a synthesis of custom headers and the original headers
|
// Create a synthesis of custom headers and the original headers
|
||||||
@@ -246,7 +275,9 @@ public class WarcRecorder implements AutoCloseable {
|
|||||||
.date(Instant.now())
|
.date(Instant.now())
|
||||||
.body(MediaType.HTTP_RESPONSE, responseDataBuffer.copyBytes());
|
.body(MediaType.HTTP_RESPONSE, responseDataBuffer.copyBytes());
|
||||||
|
|
||||||
cookieInformation.paint(builder);
|
if (cookies.hasCookies()) {
|
||||||
|
builder.addHeader("X-Has-Cookies", "1");
|
||||||
|
}
|
||||||
|
|
||||||
var reference = builder.build();
|
var reference = builder.build();
|
||||||
|
|
||||||
@@ -264,7 +295,7 @@ public class WarcRecorder implements AutoCloseable {
|
|||||||
* an E-Tag or Last-Modified header, and the server responds with a 304 Not Modified. In this
|
* an E-Tag or Last-Modified header, and the server responds with a 304 Not Modified. In this
|
||||||
* scenario we want to record the data as it was in the previous crawl, but not re-fetch it.
|
* scenario we want to record the data as it was in the previous crawl, but not re-fetch it.
|
||||||
*/
|
*/
|
||||||
public void writeReferenceCopy(EdgeUrl url, String contentType, int statusCode, String documentBody, @Nullable String headers, ContentTags ctags) {
|
public void writeReferenceCopy(EdgeUrl url, String contentType, int statusCode, byte[] documentBody, @Nullable String headers, ContentTags ctags) {
|
||||||
saveOldResponse(url, contentType, statusCode, documentBody, headers, ctags);
|
saveOldResponse(url, contentType, statusCode, documentBody, headers, ctags);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@@ -4,6 +4,7 @@ import nu.marginalia.ContentTypes;
|
|||||||
import nu.marginalia.io.SerializableCrawlDataStream;
|
import nu.marginalia.io.SerializableCrawlDataStream;
|
||||||
import nu.marginalia.lsh.EasyLSH;
|
import nu.marginalia.lsh.EasyLSH;
|
||||||
import nu.marginalia.model.crawldata.CrawledDocument;
|
import nu.marginalia.model.crawldata.CrawledDocument;
|
||||||
|
import org.jetbrains.annotations.NotNull;
|
||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
import org.slf4j.LoggerFactory;
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
@@ -11,54 +12,76 @@ import javax.annotation.Nullable;
|
|||||||
import java.io.IOException;
|
import java.io.IOException;
|
||||||
import java.nio.file.Files;
|
import java.nio.file.Files;
|
||||||
import java.nio.file.Path;
|
import java.nio.file.Path;
|
||||||
|
import java.util.Iterator;
|
||||||
|
import java.util.Objects;
|
||||||
|
import java.util.Optional;
|
||||||
|
|
||||||
/** A reference to a domain that has been crawled before. */
|
/** A reference to a domain that has been crawled before. */
|
||||||
public class CrawlDataReference implements AutoCloseable {
|
public class CrawlDataReference implements AutoCloseable, Iterable<CrawledDocument> {
|
||||||
|
|
||||||
|
private boolean closed = false;
|
||||||
|
|
||||||
|
@Nullable
|
||||||
|
private final Path path;
|
||||||
|
|
||||||
|
@Nullable
|
||||||
|
private SerializableCrawlDataStream data = null;
|
||||||
|
|
||||||
private final SerializableCrawlDataStream data;
|
|
||||||
private static final Logger logger = LoggerFactory.getLogger(CrawlDataReference.class);
|
private static final Logger logger = LoggerFactory.getLogger(CrawlDataReference.class);
|
||||||
|
|
||||||
public CrawlDataReference(SerializableCrawlDataStream data) {
|
public CrawlDataReference(@Nullable Path path) {
|
||||||
this.data = data;
|
this.path = path;
|
||||||
}
|
}
|
||||||
|
|
||||||
public CrawlDataReference() {
|
public CrawlDataReference() {
|
||||||
this(SerializableCrawlDataStream.empty());
|
this(null);
|
||||||
}
|
}
|
||||||
|
|
||||||
/** Delete the associated data from disk, if it exists */
|
/** Delete the associated data from disk, if it exists */
|
||||||
public void delete() throws IOException {
|
public void delete() throws IOException {
|
||||||
Path filePath = data.path();
|
if (path != null) {
|
||||||
|
Files.deleteIfExists(path);
|
||||||
if (filePath != null) {
|
|
||||||
Files.deleteIfExists(filePath);
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/** Get the next document from the crawl data,
|
public @NotNull Iterator<CrawledDocument> iterator() {
|
||||||
* returning null when there are no more documents
|
|
||||||
* available
|
requireStream();
|
||||||
*/
|
// Guaranteed by requireStream, but helps java
|
||||||
@Nullable
|
Objects.requireNonNull(data);
|
||||||
public CrawledDocument nextDocument() {
|
|
||||||
|
return data.map(next -> {
|
||||||
|
if (next instanceof CrawledDocument doc && ContentTypes.isAccepted(doc.contentType)) {
|
||||||
|
return Optional.of(doc);
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
return Optional.empty();
|
||||||
|
}
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
/** After calling this method, data is guaranteed to be non-null */
|
||||||
|
private void requireStream() {
|
||||||
|
if (closed) {
|
||||||
|
throw new IllegalStateException("Use after close()");
|
||||||
|
}
|
||||||
|
|
||||||
|
if (data == null) {
|
||||||
try {
|
try {
|
||||||
while (data.hasNext()) {
|
if (path != null) {
|
||||||
if (data.next() instanceof CrawledDocument doc) {
|
data = SerializableCrawlDataStream.openDataStream(path);
|
||||||
if (!ContentTypes.isAccepted(doc.contentType))
|
return;
|
||||||
continue;
|
|
||||||
|
|
||||||
return doc;
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
catch (Exception ex) {
|
||||||
catch (IOException ex) {
|
logger.error("Failed to open stream", ex);
|
||||||
logger.error("Failed to read next document", ex);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
return null;
|
data = SerializableCrawlDataStream.empty();
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
public static boolean isContentBodySame(String one, String other) {
|
public static boolean isContentBodySame(byte[] one, byte[] other) {
|
||||||
|
|
||||||
final long contentHashOne = contentHash(one);
|
final long contentHashOne = contentHash(one);
|
||||||
final long contentHashOther = contentHash(other);
|
final long contentHashOther = contentHash(other);
|
||||||
@@ -66,7 +89,7 @@ public class CrawlDataReference implements AutoCloseable {
|
|||||||
return EasyLSH.hammingDistance(contentHashOne, contentHashOther) < 4;
|
return EasyLSH.hammingDistance(contentHashOne, contentHashOther) < 4;
|
||||||
}
|
}
|
||||||
|
|
||||||
private static long contentHash(String content) {
|
private static long contentHash(byte[] content) {
|
||||||
EasyLSH hash = new EasyLSH();
|
EasyLSH hash = new EasyLSH();
|
||||||
int next = 0;
|
int next = 0;
|
||||||
|
|
||||||
@@ -74,8 +97,8 @@ public class CrawlDataReference implements AutoCloseable {
|
|||||||
|
|
||||||
// In a naive best-effort fashion, extract the text
|
// In a naive best-effort fashion, extract the text
|
||||||
// content of the document and feed it into the LSH
|
// content of the document and feed it into the LSH
|
||||||
for (int i = 0; i < content.length(); i++) {
|
for (byte b : content) {
|
||||||
char c = content.charAt(i);
|
char c = (char) b;
|
||||||
if (c == '<') {
|
if (c == '<') {
|
||||||
isInTag = true;
|
isInTag = true;
|
||||||
} else if (c == '>') {
|
} else if (c == '>') {
|
||||||
@@ -98,7 +121,12 @@ public class CrawlDataReference implements AutoCloseable {
|
|||||||
}
|
}
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public void close() throws Exception {
|
public void close() throws IOException {
|
||||||
|
if (!closed) {
|
||||||
|
if (data != null) {
|
||||||
data.close();
|
data.close();
|
||||||
}
|
}
|
||||||
|
closed = true;
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
@@ -12,7 +12,6 @@ import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
|
|||||||
import nu.marginalia.crawl.logic.LinkFilterSelector;
|
import nu.marginalia.crawl.logic.LinkFilterSelector;
|
||||||
import nu.marginalia.crawl.retreival.revisit.CrawlerRevisitor;
|
import nu.marginalia.crawl.retreival.revisit.CrawlerRevisitor;
|
||||||
import nu.marginalia.crawl.retreival.revisit.DocumentWithReference;
|
import nu.marginalia.crawl.retreival.revisit.DocumentWithReference;
|
||||||
import nu.marginalia.crawl.retreival.sitemap.SitemapFetcher;
|
|
||||||
import nu.marginalia.ip_blocklist.UrlBlocklist;
|
import nu.marginalia.ip_blocklist.UrlBlocklist;
|
||||||
import nu.marginalia.link_parser.LinkParser;
|
import nu.marginalia.link_parser.LinkParser;
|
||||||
import nu.marginalia.model.EdgeDomain;
|
import nu.marginalia.model.EdgeDomain;
|
||||||
@@ -20,7 +19,6 @@ import nu.marginalia.model.EdgeUrl;
|
|||||||
import nu.marginalia.model.body.DocumentBodyExtractor;
|
import nu.marginalia.model.body.DocumentBodyExtractor;
|
||||||
import nu.marginalia.model.body.HttpFetchResult;
|
import nu.marginalia.model.body.HttpFetchResult;
|
||||||
import nu.marginalia.model.crawldata.CrawlerDomainStatus;
|
import nu.marginalia.model.crawldata.CrawlerDomainStatus;
|
||||||
import org.jsoup.Jsoup;
|
|
||||||
import org.slf4j.Logger;
|
import org.slf4j.Logger;
|
||||||
import org.slf4j.LoggerFactory;
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
@@ -53,7 +51,6 @@ public class CrawlerRetreiver implements AutoCloseable {
|
|||||||
private final WarcRecorder warcRecorder;
|
private final WarcRecorder warcRecorder;
|
||||||
private final CrawlerRevisitor crawlerRevisitor;
|
private final CrawlerRevisitor crawlerRevisitor;
|
||||||
|
|
||||||
private final SitemapFetcher sitemapFetcher;
|
|
||||||
int errorCount = 0;
|
int errorCount = 0;
|
||||||
|
|
||||||
public CrawlerRetreiver(HttpFetcher fetcher,
|
public CrawlerRetreiver(HttpFetcher fetcher,
|
||||||
@@ -71,7 +68,6 @@ public class CrawlerRetreiver implements AutoCloseable {
|
|||||||
|
|
||||||
crawlFrontier = new DomainCrawlFrontier(new EdgeDomain(domain), specs.urls(), specs.crawlDepth());
|
crawlFrontier = new DomainCrawlFrontier(new EdgeDomain(domain), specs.urls(), specs.crawlDepth());
|
||||||
crawlerRevisitor = new CrawlerRevisitor(crawlFrontier, this, warcRecorder);
|
crawlerRevisitor = new CrawlerRevisitor(crawlFrontier, this, warcRecorder);
|
||||||
sitemapFetcher = new SitemapFetcher(crawlFrontier, fetcher.createSitemapRetriever());
|
|
||||||
|
|
||||||
// We must always crawl the index page first, this is assumed when fingerprinting the server
|
// We must always crawl the index page first, this is assumed when fingerprinting the server
|
||||||
var fst = crawlFrontier.peek();
|
var fst = crawlFrontier.peek();
|
||||||
@@ -93,47 +89,23 @@ public class CrawlerRetreiver implements AutoCloseable {
|
|||||||
}
|
}
|
||||||
|
|
||||||
public int crawlDomain(DomainLinks domainLinks, CrawlDataReference oldCrawlData) {
|
public int crawlDomain(DomainLinks domainLinks, CrawlDataReference oldCrawlData) {
|
||||||
try {
|
try (oldCrawlData) {
|
||||||
// Do an initial domain probe to determine the root URL
|
// Do an initial domain probe to determine the root URL
|
||||||
EdgeUrl rootUrl;
|
|
||||||
|
|
||||||
var probeResult = probeRootUrl();
|
var probeResult = probeRootUrl();
|
||||||
switch (probeResult) {
|
|
||||||
|
return switch (probeResult) {
|
||||||
case HttpFetcher.DomainProbeResult.Ok(EdgeUrl probedUrl) -> {
|
case HttpFetcher.DomainProbeResult.Ok(EdgeUrl probedUrl) -> {
|
||||||
rootUrl = probedUrl; // Good track
|
|
||||||
}
|
|
||||||
case HttpFetcher.DomainProbeResult.Redirect(EdgeDomain domain1) -> {
|
|
||||||
domainStateDb.save(DomainStateDb.SummaryRecord.forError(domain, "Redirect", domain1.toString()));
|
|
||||||
return 1;
|
|
||||||
}
|
|
||||||
case HttpFetcher.DomainProbeResult.Error(CrawlerDomainStatus status, String desc) -> {
|
|
||||||
domainStateDb.save(DomainStateDb.SummaryRecord.forError(domain, status.toString(), desc));
|
|
||||||
return 1;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// Sleep after the initial probe, we don't have access to the robots.txt yet
|
// Sleep after the initial probe, we don't have access to the robots.txt yet
|
||||||
// so we don't know the crawl delay
|
// so we don't know the crawl delay
|
||||||
TimeUnit.SECONDS.sleep(1);
|
TimeUnit.SECONDS.sleep(1);
|
||||||
|
|
||||||
return crawlDomain(oldCrawlData, rootUrl, domainLinks);
|
final SimpleRobotRules robotsRules = fetcher.fetchRobotRules(probedUrl.domain, warcRecorder);
|
||||||
}
|
|
||||||
catch (Exception ex) {
|
|
||||||
logger.error("Error crawling domain {}", domain, ex);
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
private int crawlDomain(CrawlDataReference oldCrawlData,
|
|
||||||
EdgeUrl rootUrl,
|
|
||||||
DomainLinks domainLinks) throws InterruptedException {
|
|
||||||
|
|
||||||
final SimpleRobotRules robotsRules = fetcher.fetchRobotRules(rootUrl.domain, warcRecorder);
|
|
||||||
final CrawlDelayTimer delayTimer = new CrawlDelayTimer(robotsRules.getCrawlDelay());
|
final CrawlDelayTimer delayTimer = new CrawlDelayTimer(robotsRules.getCrawlDelay());
|
||||||
|
|
||||||
delayTimer.waitFetchDelay(0); // initial delay after robots.txt
|
delayTimer.waitFetchDelay(0); // initial delay after robots.txt
|
||||||
|
|
||||||
DomainStateDb.SummaryRecord summaryRecord = sniffRootDocument(rootUrl, delayTimer);
|
DomainStateDb.SummaryRecord summaryRecord = sniffRootDocument(probedUrl, delayTimer);
|
||||||
domainStateDb.save(summaryRecord);
|
domainStateDb.save(summaryRecord);
|
||||||
|
|
||||||
// Play back the old crawl data (if present) and fetch the documents comparing etags and last-modified
|
// Play back the old crawl data (if present) and fetch the documents comparing etags and last-modified
|
||||||
@@ -142,12 +114,40 @@ public class CrawlerRetreiver implements AutoCloseable {
|
|||||||
crawlFrontier.increaseDepth(1.5, 2500);
|
crawlFrontier.increaseDepth(1.5, 2500);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
oldCrawlData.close(); // proactively close the crawl data reference here to not hold onto expensive resources
|
||||||
|
|
||||||
|
yield crawlDomain(probedUrl, robotsRules, delayTimer, domainLinks);
|
||||||
|
}
|
||||||
|
case HttpFetcher.DomainProbeResult.Redirect(EdgeDomain domain1) -> {
|
||||||
|
domainStateDb.save(DomainStateDb.SummaryRecord.forError(domain, "Redirect", domain1.toString()));
|
||||||
|
yield 1;
|
||||||
|
}
|
||||||
|
case HttpFetcher.DomainProbeResult.Error(CrawlerDomainStatus status, String desc) -> {
|
||||||
|
domainStateDb.save(DomainStateDb.SummaryRecord.forError(domain, status.toString(), desc));
|
||||||
|
yield 1;
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
}
|
||||||
|
catch (Exception ex) {
|
||||||
|
logger.error("Error crawling domain {}", domain, ex);
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private int crawlDomain(EdgeUrl rootUrl,
|
||||||
|
SimpleRobotRules robotsRules,
|
||||||
|
CrawlDelayTimer delayTimer,
|
||||||
|
DomainLinks domainLinks) {
|
||||||
|
|
||||||
|
|
||||||
// Add external links to the crawl frontier
|
// Add external links to the crawl frontier
|
||||||
crawlFrontier.addAllToQueue(domainLinks.getUrls(rootUrl.proto));
|
crawlFrontier.addAllToQueue(domainLinks.getUrls(rootUrl.proto));
|
||||||
|
|
||||||
// Add links from the sitemap to the crawl frontier
|
// Fetch sitemaps
|
||||||
sitemapFetcher.downloadSitemaps(robotsRules, rootUrl);
|
for (var sitemap : robotsRules.getSitemaps()) {
|
||||||
|
crawlFrontier.addAllToQueue(fetcher.fetchSitemapUrls(sitemap, delayTimer));
|
||||||
|
}
|
||||||
|
|
||||||
while (!crawlFrontier.isEmpty()
|
while (!crawlFrontier.isEmpty()
|
||||||
&& !crawlFrontier.isCrawlDepthReached()
|
&& !crawlFrontier.isCrawlDepthReached()
|
||||||
@@ -271,13 +271,19 @@ public class CrawlerRetreiver implements AutoCloseable {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// Download the sitemap if available
|
// Download the sitemap if available
|
||||||
if (feedLink.isPresent()) {
|
feedLink.ifPresent(s -> fetcher.fetchSitemapUrls(s, timer));
|
||||||
sitemapFetcher.downloadSitemaps(List.of(feedLink.get()));
|
|
||||||
timer.waitFetchDelay(0);
|
|
||||||
}
|
|
||||||
|
|
||||||
// Grab the favicon if it exists
|
// Grab the favicon if it exists
|
||||||
fetchWithRetry(faviconUrl, timer, HttpFetcher.ProbeType.DISABLED, ContentTags.empty());
|
|
||||||
|
if (fetchWithRetry(faviconUrl, timer, HttpFetcher.ProbeType.DISABLED, ContentTags.empty()) instanceof HttpFetchResult.ResultOk iconResult) {
|
||||||
|
String contentType = iconResult.header("Content-Type");
|
||||||
|
byte[] iconData = iconResult.getBodyBytes();
|
||||||
|
|
||||||
|
domainStateDb.saveIcon(
|
||||||
|
domain,
|
||||||
|
new DomainStateDb.FaviconRecord(contentType, iconData)
|
||||||
|
);
|
||||||
|
}
|
||||||
timer.waitFetchDelay(0);
|
timer.waitFetchDelay(0);
|
||||||
|
|
||||||
}
|
}
|
||||||
@@ -375,21 +381,23 @@ public class CrawlerRetreiver implements AutoCloseable {
|
|||||||
if (docOpt.isPresent()) {
|
if (docOpt.isPresent()) {
|
||||||
var doc = docOpt.get();
|
var doc = docOpt.get();
|
||||||
|
|
||||||
crawlFrontier.enqueueLinksFromDocument(top, doc);
|
var responseUrl = new EdgeUrl(ok.uri());
|
||||||
crawlFrontier.addVisited(new EdgeUrl(ok.uri()));
|
|
||||||
|
crawlFrontier.enqueueLinksFromDocument(responseUrl, doc);
|
||||||
|
crawlFrontier.addVisited(responseUrl);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
else if (fetchedDoc instanceof HttpFetchResult.Result304Raw && reference.doc() != null) {
|
else if (fetchedDoc instanceof HttpFetchResult.Result304Raw && reference.doc() != null) {
|
||||||
var doc = reference.doc();
|
var doc = reference.doc();
|
||||||
|
|
||||||
warcRecorder.writeReferenceCopy(top, doc.contentType, doc.httpStatus, doc.documentBody, doc.headers, contentTags);
|
warcRecorder.writeReferenceCopy(top, doc.contentType, doc.httpStatus, doc.documentBodyBytes, doc.headers, contentTags);
|
||||||
|
|
||||||
fetchedDoc = new HttpFetchResult.Result304ReplacedWithReference(doc.url,
|
fetchedDoc = new HttpFetchResult.Result304ReplacedWithReference(doc.url,
|
||||||
new ContentType(doc.contentType, "UTF-8"),
|
new ContentType(doc.contentType, "UTF-8"),
|
||||||
doc.documentBody);
|
doc.documentBodyBytes);
|
||||||
|
|
||||||
if (doc.documentBody != null) {
|
if (doc.documentBodyBytes != null) {
|
||||||
var parsed = Jsoup.parse(doc.documentBody);
|
var parsed = doc.parseBody();
|
||||||
|
|
||||||
crawlFrontier.enqueueLinksFromDocument(top, parsed);
|
crawlFrontier.enqueueLinksFromDocument(top, parsed);
|
||||||
crawlFrontier.addVisited(top);
|
crawlFrontier.addVisited(top);
|
||||||
|
@@ -1,6 +1,5 @@
|
|||||||
package nu.marginalia.crawl.retreival.revisit;
|
package nu.marginalia.crawl.retreival.revisit;
|
||||||
|
|
||||||
import com.google.common.base.Strings;
|
|
||||||
import crawlercommons.robots.SimpleRobotRules;
|
import crawlercommons.robots.SimpleRobotRules;
|
||||||
import nu.marginalia.crawl.fetcher.ContentTags;
|
import nu.marginalia.crawl.fetcher.ContentTags;
|
||||||
import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
|
import nu.marginalia.crawl.fetcher.warc.WarcRecorder;
|
||||||
@@ -11,7 +10,8 @@ import nu.marginalia.crawl.retreival.DomainCrawlFrontier;
|
|||||||
import nu.marginalia.model.EdgeUrl;
|
import nu.marginalia.model.EdgeUrl;
|
||||||
import nu.marginalia.model.body.HttpFetchResult;
|
import nu.marginalia.model.body.HttpFetchResult;
|
||||||
import nu.marginalia.model.crawldata.CrawledDocument;
|
import nu.marginalia.model.crawldata.CrawledDocument;
|
||||||
import org.jsoup.Jsoup;
|
|
||||||
|
import java.io.IOException;
|
||||||
|
|
||||||
/** This class encapsulates the logic for re-visiting a domain that has already been crawled.
|
/** This class encapsulates the logic for re-visiting a domain that has already been crawled.
|
||||||
* We may use information from the previous crawl to inform the next crawl, specifically the
|
* We may use information from the previous crawl to inform the next crawl, specifically the
|
||||||
@@ -40,18 +40,12 @@ public class CrawlerRevisitor {
|
|||||||
int errors = 0;
|
int errors = 0;
|
||||||
int skipped = 0;
|
int skipped = 0;
|
||||||
|
|
||||||
for (;;) {
|
for (CrawledDocument doc : oldCrawlData) {
|
||||||
if (errors > 20) {
|
if (errors > 20) {
|
||||||
// If we've had too many errors, we'll stop trying to recrawl
|
// If we've had too many errors, we'll stop trying to recrawl
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
|
|
||||||
CrawledDocument doc = oldCrawlData.nextDocument();
|
|
||||||
|
|
||||||
if (doc == null)
|
|
||||||
break;
|
|
||||||
|
|
||||||
// This Shouldn't Happen (TM)
|
|
||||||
var urlMaybe = EdgeUrl.parse(doc.url);
|
var urlMaybe = EdgeUrl.parse(doc.url);
|
||||||
if (urlMaybe.isEmpty())
|
if (urlMaybe.isEmpty())
|
||||||
continue;
|
continue;
|
||||||
@@ -70,7 +64,7 @@ public class CrawlerRevisitor {
|
|||||||
// unlikely to produce anything meaningful for us.
|
// unlikely to produce anything meaningful for us.
|
||||||
if (doc.httpStatus != 200)
|
if (doc.httpStatus != 200)
|
||||||
continue;
|
continue;
|
||||||
if (Strings.isNullOrEmpty(doc.documentBody))
|
if (!doc.hasBody())
|
||||||
continue;
|
continue;
|
||||||
|
|
||||||
if (!crawlFrontier.filterLink(url))
|
if (!crawlFrontier.filterLink(url))
|
||||||
@@ -117,14 +111,19 @@ public class CrawlerRevisitor {
|
|||||||
// fashion to make sure we eventually catch changes over time
|
// fashion to make sure we eventually catch changes over time
|
||||||
// and ensure we discover new links
|
// and ensure we discover new links
|
||||||
|
|
||||||
|
try {
|
||||||
// Hoover up any links from the document
|
// Hoover up any links from the document
|
||||||
crawlFrontier.enqueueLinksFromDocument(url, Jsoup.parse(doc.documentBody));
|
crawlFrontier.enqueueLinksFromDocument(url, doc.parseBody());
|
||||||
|
|
||||||
|
}
|
||||||
|
catch (IOException ex) {
|
||||||
|
//
|
||||||
|
}
|
||||||
// Add a WARC record so we don't repeat this
|
// Add a WARC record so we don't repeat this
|
||||||
warcRecorder.writeReferenceCopy(url,
|
warcRecorder.writeReferenceCopy(url,
|
||||||
doc.contentType,
|
doc.contentType,
|
||||||
doc.httpStatus,
|
doc.httpStatus,
|
||||||
doc.documentBody,
|
doc.documentBodyBytes,
|
||||||
doc.headers,
|
doc.headers,
|
||||||
new ContentTags(doc.etagMaybe, doc.lastModifiedMaybe)
|
new ContentTags(doc.etagMaybe, doc.lastModifiedMaybe)
|
||||||
);
|
);
|
||||||
|
@@ -2,8 +2,6 @@ package nu.marginalia.crawl.retreival.revisit;
|
|||||||
|
|
||||||
import nu.marginalia.crawl.fetcher.ContentTags;
|
import nu.marginalia.crawl.fetcher.ContentTags;
|
||||||
import nu.marginalia.crawl.retreival.CrawlDataReference;
|
import nu.marginalia.crawl.retreival.CrawlDataReference;
|
||||||
import nu.marginalia.model.body.DocumentBodyExtractor;
|
|
||||||
import nu.marginalia.model.body.DocumentBodyResult;
|
|
||||||
import nu.marginalia.model.body.HttpFetchResult;
|
import nu.marginalia.model.body.HttpFetchResult;
|
||||||
import nu.marginalia.model.crawldata.CrawledDocument;
|
import nu.marginalia.model.crawldata.CrawledDocument;
|
||||||
|
|
||||||
@@ -35,21 +33,17 @@ public record DocumentWithReference(
|
|||||||
return false;
|
return false;
|
||||||
if (doc == null)
|
if (doc == null)
|
||||||
return false;
|
return false;
|
||||||
if (doc.documentBody == null)
|
if (doc.documentBodyBytes.length == 0)
|
||||||
return false;
|
return false;
|
||||||
|
|
||||||
if (!(DocumentBodyExtractor.asString(resultOk) instanceof DocumentBodyResult.Ok<String> bodyOk)) {
|
return CrawlDataReference.isContentBodySame(doc.documentBodyBytes, resultOk.bytesRaw());
|
||||||
return false;
|
|
||||||
}
|
|
||||||
|
|
||||||
return CrawlDataReference.isContentBodySame(doc.documentBody, bodyOk.body());
|
|
||||||
}
|
}
|
||||||
|
|
||||||
public ContentTags getContentTags() {
|
public ContentTags getContentTags() {
|
||||||
if (null == doc)
|
if (null == doc)
|
||||||
return ContentTags.empty();
|
return ContentTags.empty();
|
||||||
|
|
||||||
if (doc.documentBody == null || doc.httpStatus != 200)
|
if (doc.documentBodyBytes.length == 0 || doc.httpStatus != 200)
|
||||||
return ContentTags.empty();
|
return ContentTags.empty();
|
||||||
|
|
||||||
String lastmod = doc.getLastModified();
|
String lastmod = doc.getLastModified();
|
||||||
|
@@ -1,72 +0,0 @@
|
|||||||
package nu.marginalia.crawl.retreival.sitemap;
|
|
||||||
|
|
||||||
import crawlercommons.robots.SimpleRobotRules;
|
|
||||||
import nu.marginalia.crawl.fetcher.SitemapRetriever;
|
|
||||||
import nu.marginalia.crawl.retreival.DomainCrawlFrontier;
|
|
||||||
import nu.marginalia.model.EdgeUrl;
|
|
||||||
import org.slf4j.Logger;
|
|
||||||
import org.slf4j.LoggerFactory;
|
|
||||||
|
|
||||||
import java.util.HashSet;
|
|
||||||
import java.util.List;
|
|
||||||
import java.util.Optional;
|
|
||||||
import java.util.Set;
|
|
||||||
|
|
||||||
public class SitemapFetcher {
|
|
||||||
|
|
||||||
private final DomainCrawlFrontier crawlFrontier;
|
|
||||||
private final SitemapRetriever sitemapRetriever;
|
|
||||||
private static final Logger logger = LoggerFactory.getLogger(SitemapFetcher.class);
|
|
||||||
|
|
||||||
public SitemapFetcher(DomainCrawlFrontier crawlFrontier, SitemapRetriever sitemapRetriever) {
|
|
||||||
this.crawlFrontier = crawlFrontier;
|
|
||||||
this.sitemapRetriever = sitemapRetriever;
|
|
||||||
}
|
|
||||||
|
|
||||||
public void downloadSitemaps(SimpleRobotRules robotsRules, EdgeUrl rootUrl) {
|
|
||||||
List<String> urls = robotsRules.getSitemaps();
|
|
||||||
|
|
||||||
if (urls.isEmpty()) {
|
|
||||||
urls = List.of(rootUrl.withPathAndParam("/sitemap.xml", null).toString());
|
|
||||||
}
|
|
||||||
|
|
||||||
downloadSitemaps(urls);
|
|
||||||
}
|
|
||||||
|
|
||||||
public void downloadSitemaps(List<String> urls) {
|
|
||||||
|
|
||||||
Set<String> checkedSitemaps = new HashSet<>();
|
|
||||||
|
|
||||||
for (var rawUrl : urls) {
|
|
||||||
Optional<EdgeUrl> parsedUrl = EdgeUrl.parse(rawUrl);
|
|
||||||
if (parsedUrl.isEmpty()) {
|
|
||||||
continue;
|
|
||||||
}
|
|
||||||
|
|
||||||
EdgeUrl url = parsedUrl.get();
|
|
||||||
|
|
||||||
// Let's not download sitemaps from other domains for now
|
|
||||||
if (!crawlFrontier.isSameDomain(url)) {
|
|
||||||
continue;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (checkedSitemaps.contains(url.path))
|
|
||||||
continue;
|
|
||||||
|
|
||||||
var sitemap = sitemapRetriever.fetchSitemap(url);
|
|
||||||
if (sitemap.isEmpty()) {
|
|
||||||
continue;
|
|
||||||
}
|
|
||||||
|
|
||||||
// ensure we don't try to download this sitemap again
|
|
||||||
// (don't move this up, as we may want to check the same
|
|
||||||
// path with different protocols until we find one that works)
|
|
||||||
|
|
||||||
checkedSitemaps.add(url.path);
|
|
||||||
|
|
||||||
crawlFrontier.addAllToQueue(sitemap);
|
|
||||||
}
|
|
||||||
|
|
||||||
logger.debug("Queue is now {}", crawlFrontier.queueSize());
|
|
||||||
}
|
|
||||||
}
|
|
@@ -32,11 +32,11 @@ dependencies {
|
|||||||
implementation libs.bundles.parquet
|
implementation libs.bundles.parquet
|
||||||
|
|
||||||
implementation libs.trove
|
implementation libs.trove
|
||||||
|
implementation libs.slop
|
||||||
implementation libs.jwarc
|
implementation libs.jwarc
|
||||||
implementation libs.gson
|
implementation libs.gson
|
||||||
implementation libs.commons.io
|
implementation libs.commons.io
|
||||||
implementation libs.commons.lang3
|
implementation libs.commons.lang3
|
||||||
implementation libs.okhttp3
|
|
||||||
implementation libs.jsoup
|
implementation libs.jsoup
|
||||||
implementation libs.snakeyaml
|
implementation libs.snakeyaml
|
||||||
implementation libs.zstd
|
implementation libs.zstd
|
||||||
|
@@ -1,45 +0,0 @@
|
|||||||
package nu.marginalia.io;
|
|
||||||
|
|
||||||
import nu.marginalia.io.crawldata.format.ParquetSerializableCrawlDataStream;
|
|
||||||
import org.slf4j.Logger;
|
|
||||||
import org.slf4j.LoggerFactory;
|
|
||||||
|
|
||||||
import java.io.FileNotFoundException;
|
|
||||||
import java.io.IOException;
|
|
||||||
import java.nio.file.Files;
|
|
||||||
import java.nio.file.Path;
|
|
||||||
|
|
||||||
public class CrawledDomainReader {
|
|
||||||
private static final Logger logger = LoggerFactory.getLogger(CrawledDomainReader.class);
|
|
||||||
|
|
||||||
/** An iterator-like access to domain data This must be closed otherwise it will leak off-heap memory! */
|
|
||||||
public static SerializableCrawlDataStream createDataStream(Path fullPath) throws IOException
|
|
||||||
{
|
|
||||||
|
|
||||||
String fileName = fullPath.getFileName().toString();
|
|
||||||
if (fileName.endsWith(".parquet")) {
|
|
||||||
try {
|
|
||||||
return new ParquetSerializableCrawlDataStream(fullPath);
|
|
||||||
} catch (Exception ex) {
|
|
||||||
logger.error("Error reading domain data from " + fullPath, ex);
|
|
||||||
return SerializableCrawlDataStream.empty();
|
|
||||||
}
|
|
||||||
} else {
|
|
||||||
logger.error("Unknown file type: {}", fullPath);
|
|
||||||
return SerializableCrawlDataStream.empty();
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
/** An iterator-like access to domain data. This must be closed otherwise it will leak off-heap memory! */
|
|
||||||
public static SerializableCrawlDataStream createDataStream(Path basePath, String domain, String id) throws IOException {
|
|
||||||
Path parquetPath = CrawlerOutputFile.getParquetPath(basePath, id, domain);
|
|
||||||
|
|
||||||
if (Files.exists(parquetPath)) {
|
|
||||||
return createDataStream(parquetPath);
|
|
||||||
}
|
|
||||||
else {
|
|
||||||
throw new FileNotFoundException("No such file: " + parquetPath);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
}
|
|
@@ -35,7 +35,7 @@ public class CrawlerOutputFile {
|
|||||||
return destDir.resolve(id + "-" + filesystemSafeName(domain) + "-" + version.suffix + ".warc.gz");
|
return destDir.resolve(id + "-" + filesystemSafeName(domain) + "-" + version.suffix + ".warc.gz");
|
||||||
}
|
}
|
||||||
|
|
||||||
public static Path createParquetPath(Path basePath, String id, String domain) throws IOException {
|
public static Path createSlopPath(Path basePath, String id, String domain) throws IOException {
|
||||||
id = padId(id);
|
id = padId(id);
|
||||||
|
|
||||||
String first = id.substring(0, 2);
|
String first = id.substring(0, 2);
|
||||||
@@ -45,8 +45,9 @@ public class CrawlerOutputFile {
|
|||||||
if (!Files.exists(destDir)) {
|
if (!Files.exists(destDir)) {
|
||||||
Files.createDirectories(destDir);
|
Files.createDirectories(destDir);
|
||||||
}
|
}
|
||||||
return destDir.resolve(id + "-" + filesystemSafeName(domain) + ".parquet");
|
return destDir.resolve(id + "-" + filesystemSafeName(domain) + ".slop.zip");
|
||||||
}
|
}
|
||||||
|
|
||||||
public static Path getParquetPath(Path basePath, String id, String domain) {
|
public static Path getParquetPath(Path basePath, String id, String domain) {
|
||||||
id = padId(id);
|
id = padId(id);
|
||||||
|
|
||||||
@@ -56,16 +57,18 @@ public class CrawlerOutputFile {
|
|||||||
Path destDir = basePath.resolve(first).resolve(second);
|
Path destDir = basePath.resolve(first).resolve(second);
|
||||||
return destDir.resolve(id + "-" + filesystemSafeName(domain) + ".parquet");
|
return destDir.resolve(id + "-" + filesystemSafeName(domain) + ".parquet");
|
||||||
}
|
}
|
||||||
public static Path getWarcPath(Path basePath, String id, String domain, WarcFileVersion version) {
|
|
||||||
|
public static Path getSlopPath(Path basePath, String id, String domain) {
|
||||||
id = padId(id);
|
id = padId(id);
|
||||||
|
|
||||||
String first = id.substring(0, 2);
|
String first = id.substring(0, 2);
|
||||||
String second = id.substring(2, 4);
|
String second = id.substring(2, 4);
|
||||||
|
|
||||||
Path destDir = basePath.resolve(first).resolve(second);
|
Path destDir = basePath.resolve(first).resolve(second);
|
||||||
return destDir.resolve(id + "-" + filesystemSafeName(domain) + ".warc" + version.suffix);
|
return destDir.resolve(id + "-" + filesystemSafeName(domain) + ".slop.zip");
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Pads the given ID with leading zeros to ensure it has a length of 4 characters.
|
* Pads the given ID with leading zeros to ensure it has a length of 4 characters.
|
||||||
*/
|
*/
|
||||||
|
@@ -1,35 +1,120 @@
|
|||||||
package nu.marginalia.io;
|
package nu.marginalia.io;
|
||||||
|
|
||||||
|
import nu.marginalia.io.crawldata.format.ParquetSerializableCrawlDataStream;
|
||||||
|
import nu.marginalia.io.crawldata.format.SlopSerializableCrawlDataStream;
|
||||||
import nu.marginalia.model.crawldata.CrawledDocument;
|
import nu.marginalia.model.crawldata.CrawledDocument;
|
||||||
import nu.marginalia.model.crawldata.CrawledDomain;
|
import nu.marginalia.model.crawldata.CrawledDomain;
|
||||||
import nu.marginalia.model.crawldata.SerializableCrawlData;
|
import nu.marginalia.model.crawldata.SerializableCrawlData;
|
||||||
import org.jetbrains.annotations.Nullable;
|
import org.jetbrains.annotations.Nullable;
|
||||||
|
import org.slf4j.Logger;
|
||||||
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
import java.io.IOException;
|
import java.io.IOException;
|
||||||
import java.nio.file.Path;
|
import java.nio.file.Path;
|
||||||
import java.util.ArrayList;
|
import java.util.ArrayList;
|
||||||
import java.util.Iterator;
|
import java.util.Iterator;
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
|
import java.util.Optional;
|
||||||
|
import java.util.function.Function;
|
||||||
|
|
||||||
/** Closable iterator exceptional over serialized crawl data
|
/** Closable iterator exceptional over serialized crawl data
|
||||||
* The data may appear in any order, and the iterator must be closed.
|
* The data may appear in any order, and the iterator must be closed.
|
||||||
*
|
*
|
||||||
* @see CrawledDomainReader
|
|
||||||
* */
|
* */
|
||||||
public interface SerializableCrawlDataStream extends AutoCloseable {
|
public interface SerializableCrawlDataStream extends AutoCloseable {
|
||||||
|
Logger logger = LoggerFactory.getLogger(SerializableCrawlDataStream.class);
|
||||||
|
|
||||||
SerializableCrawlData next() throws IOException;
|
SerializableCrawlData next() throws IOException;
|
||||||
|
|
||||||
/** Return a size hint for the stream. 0 is returned if the hint is not available,
|
/** Return a size hint for the stream. 0 is returned if the hint is not available,
|
||||||
* or if the file is seemed too small to bother */
|
* or if the file is seemed too small to bother */
|
||||||
default int sizeHint() { return 0; }
|
default int getSizeHint() { return 0; }
|
||||||
|
|
||||||
boolean hasNext() throws IOException;
|
boolean hasNext() throws IOException;
|
||||||
|
|
||||||
@Nullable
|
@Nullable
|
||||||
default Path path() { return null; }
|
default Path path() { return null; }
|
||||||
|
|
||||||
|
void close() throws IOException;
|
||||||
|
|
||||||
|
/** An iterator-like access to domain data This must be closed otherwise it will leak off-heap memory! */
|
||||||
|
static SerializableCrawlDataStream openDataStream(Path fullPath) throws IOException
|
||||||
|
{
|
||||||
|
|
||||||
|
String fileName = fullPath.getFileName().toString();
|
||||||
|
if (fileName.endsWith(".parquet")) {
|
||||||
|
try {
|
||||||
|
return new ParquetSerializableCrawlDataStream(fullPath);
|
||||||
|
} catch (Exception ex) {
|
||||||
|
logger.error("Error reading domain data from " + fullPath, ex);
|
||||||
|
return SerializableCrawlDataStream.empty();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (fileName.endsWith(".slop.zip")) {
|
||||||
|
try {
|
||||||
|
return new SlopSerializableCrawlDataStream(fullPath);
|
||||||
|
} catch (Exception ex) {
|
||||||
|
logger.error("Error reading domain data from " + fullPath, ex);
|
||||||
|
return SerializableCrawlDataStream.empty();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
logger.error("Unknown file type: {}", fullPath);
|
||||||
|
return SerializableCrawlDataStream.empty();
|
||||||
|
}
|
||||||
|
|
||||||
|
/** Get an idication of the size of the stream. This is used to determine whether to
|
||||||
|
* load the stream into memory or not. 0 is returned if the hint is not available,
|
||||||
|
* or if the file is seemed too small to bother */
|
||||||
|
static int getSizeHint(Path fullPath) {
|
||||||
|
String fileName = fullPath.getFileName().toString();
|
||||||
|
if (fileName.endsWith(".parquet")) {
|
||||||
|
return ParquetSerializableCrawlDataStream.sizeHint(fullPath);
|
||||||
|
}
|
||||||
|
else if (fileName.endsWith(".slop.zip")) {
|
||||||
|
return SlopSerializableCrawlDataStream.sizeHint(fullPath);
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
default <T> Iterator<T> map(Function<SerializableCrawlData, Optional<T>> mapper) {
|
||||||
|
return new Iterator<>() {
|
||||||
|
T next = null;
|
||||||
|
|
||||||
|
public boolean hasNext() {
|
||||||
|
if (next != null)
|
||||||
|
return true;
|
||||||
|
try {
|
||||||
|
while (SerializableCrawlDataStream.this.hasNext()) {
|
||||||
|
var val = mapper.apply(SerializableCrawlDataStream.this.next());
|
||||||
|
if (val.isPresent()) {
|
||||||
|
next = val.get();
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
catch (IOException ex) {
|
||||||
|
logger.error("Error during stream", ex);
|
||||||
|
}
|
||||||
|
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
public T next() {
|
||||||
|
if (next == null && !hasNext())
|
||||||
|
throw new IllegalStateException("No more data to read");
|
||||||
|
|
||||||
|
T ret = next;
|
||||||
|
next = null;
|
||||||
|
return ret;
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
/** For tests */
|
/** For tests */
|
||||||
default List<SerializableCrawlData> asList() throws IOException {
|
default List<SerializableCrawlData> asList() throws IOException {
|
||||||
List<SerializableCrawlData> data = new ArrayList<>();
|
List<SerializableCrawlData> data = new ArrayList<>();
|
||||||
@@ -81,7 +166,6 @@ public interface SerializableCrawlDataStream extends AutoCloseable {
|
|||||||
public boolean hasNext() { return iterator.hasNext(); }
|
public boolean hasNext() { return iterator.hasNext(); }
|
||||||
public void close() {}
|
public void close() {}
|
||||||
};
|
};
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
|
@@ -1,7 +1,6 @@
|
|||||||
package nu.marginalia.io.crawldata.format;
|
package nu.marginalia.io.crawldata.format;
|
||||||
|
|
||||||
import nu.marginalia.contenttype.ContentType;
|
import nu.marginalia.contenttype.ContentType;
|
||||||
import nu.marginalia.contenttype.DocumentBodyToString;
|
|
||||||
import nu.marginalia.hash.MurmurHash3_128;
|
import nu.marginalia.hash.MurmurHash3_128;
|
||||||
import nu.marginalia.io.SerializableCrawlDataStream;
|
import nu.marginalia.io.SerializableCrawlDataStream;
|
||||||
import nu.marginalia.model.EdgeUrl;
|
import nu.marginalia.model.EdgeUrl;
|
||||||
@@ -18,6 +17,7 @@ import java.nio.file.Path;
|
|||||||
import java.util.*;
|
import java.util.*;
|
||||||
import java.util.stream.Stream;
|
import java.util.stream.Stream;
|
||||||
|
|
||||||
|
@Deprecated
|
||||||
public class ParquetSerializableCrawlDataStream implements AutoCloseable, SerializableCrawlDataStream {
|
public class ParquetSerializableCrawlDataStream implements AutoCloseable, SerializableCrawlDataStream {
|
||||||
private static final Logger logger = LoggerFactory.getLogger(ParquetSerializableCrawlDataStream.class);
|
private static final Logger logger = LoggerFactory.getLogger(ParquetSerializableCrawlDataStream.class);
|
||||||
|
|
||||||
@@ -40,7 +40,7 @@ public class ParquetSerializableCrawlDataStream implements AutoCloseable, Serial
|
|||||||
return path;
|
return path;
|
||||||
}
|
}
|
||||||
|
|
||||||
public int sizeHint() {
|
public static int sizeHint(Path path) {
|
||||||
// Only calculate size hint for large files
|
// Only calculate size hint for large files
|
||||||
// (the reason we calculate them in the first place is to assess whether it is large
|
// (the reason we calculate them in the first place is to assess whether it is large
|
||||||
// because it has many documents, or because it is a small number of large documents)
|
// because it has many documents, or because it is a small number of large documents)
|
||||||
@@ -124,9 +124,7 @@ public class ParquetSerializableCrawlDataStream implements AutoCloseable, Serial
|
|||||||
}
|
}
|
||||||
else if (nextRecord.body != null) {
|
else if (nextRecord.body != null) {
|
||||||
try {
|
try {
|
||||||
bodyString = DocumentBodyToString.getStringData(
|
ContentType.parse(nextRecord.contentType);
|
||||||
ContentType.parse(nextRecord.contentType),
|
|
||||||
nextRecord.body);
|
|
||||||
} catch (Exception ex) {
|
} catch (Exception ex) {
|
||||||
logger.error("Failed to convert body to string", ex);
|
logger.error("Failed to convert body to string", ex);
|
||||||
status = CrawlerDocumentStatus.BAD_CHARSET;
|
status = CrawlerDocumentStatus.BAD_CHARSET;
|
||||||
@@ -147,7 +145,7 @@ public class ParquetSerializableCrawlDataStream implements AutoCloseable, Serial
|
|||||||
status.toString(),
|
status.toString(),
|
||||||
"",
|
"",
|
||||||
nextRecord.headers,
|
nextRecord.headers,
|
||||||
bodyString,
|
nextRecord.body,
|
||||||
// this field isn't actually used, maybe we can skip calculating it?
|
// this field isn't actually used, maybe we can skip calculating it?
|
||||||
nextRecord.cookies,
|
nextRecord.cookies,
|
||||||
lastModified,
|
lastModified,
|
||||||
|
@@ -0,0 +1,181 @@
|
|||||||
|
package nu.marginalia.io.crawldata.format;
|
||||||
|
|
||||||
|
import nu.marginalia.contenttype.ContentType;
|
||||||
|
import nu.marginalia.io.SerializableCrawlDataStream;
|
||||||
|
import nu.marginalia.model.EdgeUrl;
|
||||||
|
import nu.marginalia.model.crawldata.*;
|
||||||
|
import nu.marginalia.slop.SlopCrawlDataRecord;
|
||||||
|
import org.slf4j.Logger;
|
||||||
|
import org.slf4j.LoggerFactory;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.net.URISyntaxException;
|
||||||
|
import java.nio.file.Files;
|
||||||
|
import java.nio.file.Path;
|
||||||
|
import java.time.Instant;
|
||||||
|
import java.util.ArrayDeque;
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.Deque;
|
||||||
|
import java.util.NoSuchElementException;
|
||||||
|
|
||||||
|
public class SlopSerializableCrawlDataStream implements AutoCloseable, SerializableCrawlDataStream {
|
||||||
|
private static final Logger logger = LoggerFactory.getLogger(SlopSerializableCrawlDataStream.class);
|
||||||
|
|
||||||
|
private final SlopCrawlDataRecord.FilteringReader reader;
|
||||||
|
|
||||||
|
// Holds the next value. This is not a buffer, but to deal with the fact that
|
||||||
|
// we sometimes generate multiple SerializableCrawlData records for a single input
|
||||||
|
private final Deque<SerializableCrawlData> nextQ = new ArrayDeque<>();
|
||||||
|
|
||||||
|
private boolean wroteDomainRecord = false;
|
||||||
|
private final Path path;
|
||||||
|
|
||||||
|
public SlopSerializableCrawlDataStream(Path file) throws IOException {
|
||||||
|
path = file;
|
||||||
|
reader = new SlopCrawlDataRecord.FilteringReader(file) {
|
||||||
|
@Override
|
||||||
|
public boolean filter(String url, int status, String contentType) {
|
||||||
|
String ctLc = contentType.toLowerCase();
|
||||||
|
|
||||||
|
if (ctLc.startsWith("text/"))
|
||||||
|
return true;
|
||||||
|
else if (ctLc.startsWith("x-marginalia/"))
|
||||||
|
return true;
|
||||||
|
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public Path path() {
|
||||||
|
return path;
|
||||||
|
}
|
||||||
|
|
||||||
|
public static int sizeHint(Path path) {
|
||||||
|
// Only calculate size hint for large files
|
||||||
|
// (the reason we calculate them in the first place is to assess whether it is large
|
||||||
|
// because it has many documents, or because it is a small number of large documents)
|
||||||
|
try {
|
||||||
|
if (Files.size(path) > 10_000_000) {
|
||||||
|
return SlopCrawlDataRecord.countGoodStatusCodes(path);
|
||||||
|
}
|
||||||
|
} catch (IOException e) {
|
||||||
|
// suppressed
|
||||||
|
}
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public boolean hasNext() {
|
||||||
|
try {
|
||||||
|
while (reader.hasRemaining() && nextQ.isEmpty()) {
|
||||||
|
try {
|
||||||
|
var nextRecord = reader.get();
|
||||||
|
if (!wroteDomainRecord) {
|
||||||
|
createDomainRecord(nextRecord);
|
||||||
|
wroteDomainRecord = true;
|
||||||
|
}
|
||||||
|
|
||||||
|
createDocumentRecord(nextRecord);
|
||||||
|
} catch (Exception ex) {
|
||||||
|
logger.error("Failed to create document record", ex);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return !nextQ.isEmpty();
|
||||||
|
}
|
||||||
|
catch (IOException ex) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private void createDomainRecord(SlopCrawlDataRecord parquetRecord) throws URISyntaxException {
|
||||||
|
|
||||||
|
CrawlerDomainStatus status = CrawlerDomainStatus.OK;
|
||||||
|
String statusReason = "";
|
||||||
|
|
||||||
|
String redirectDomain = null;
|
||||||
|
|
||||||
|
// The advisory content types are used to signal various states of the crawl
|
||||||
|
// that are not actual crawled documents.
|
||||||
|
|
||||||
|
switch (parquetRecord.contentType()) {
|
||||||
|
case "x-marginalia/advisory;state=redirect" -> {
|
||||||
|
EdgeUrl crawledUrl = new EdgeUrl(parquetRecord.url());
|
||||||
|
redirectDomain = crawledUrl.getDomain().toString();
|
||||||
|
status = CrawlerDomainStatus.REDIRECT;
|
||||||
|
}
|
||||||
|
case "x-marginalia/advisory;state=blocked" -> {
|
||||||
|
status = CrawlerDomainStatus.BLOCKED;
|
||||||
|
}
|
||||||
|
case "x-marginalia/advisory;state=error" -> {
|
||||||
|
status = CrawlerDomainStatus.ERROR;
|
||||||
|
statusReason = new String(parquetRecord.body());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
nextQ.add(new CrawledDomain(
|
||||||
|
parquetRecord.domain(),
|
||||||
|
redirectDomain,
|
||||||
|
status.toString(),
|
||||||
|
statusReason,
|
||||||
|
parquetRecord.ip(),
|
||||||
|
new ArrayList<>(),
|
||||||
|
new ArrayList<>()
|
||||||
|
));
|
||||||
|
}
|
||||||
|
|
||||||
|
private void createDocumentRecord(SlopCrawlDataRecord nextRecord) {
|
||||||
|
CrawlerDocumentStatus status = CrawlerDocumentStatus.OK;
|
||||||
|
|
||||||
|
if (nextRecord.contentType().startsWith("x-marginalia/advisory;state=content-type-failed-probe")) {
|
||||||
|
status = CrawlerDocumentStatus.BAD_CONTENT_TYPE;
|
||||||
|
}
|
||||||
|
else if (nextRecord.contentType().startsWith("x-marginalia/advisory;state=robots-txt-skipped")) {
|
||||||
|
status = CrawlerDocumentStatus.ROBOTS_TXT;
|
||||||
|
}
|
||||||
|
else if (nextRecord.contentType().startsWith("x-marginalia/advisory")) {
|
||||||
|
// we don't care about the other advisory content types here
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
else if (nextRecord.body() != null) {
|
||||||
|
try {
|
||||||
|
ContentType.parse(nextRecord.contentType());
|
||||||
|
} catch (Exception ex) {
|
||||||
|
logger.error("Failed to convert body to string", ex);
|
||||||
|
status = CrawlerDocumentStatus.BAD_CHARSET;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
status = CrawlerDocumentStatus.ERROR;
|
||||||
|
}
|
||||||
|
|
||||||
|
nextQ.add(new CrawledDocument("",
|
||||||
|
nextRecord.url(),
|
||||||
|
nextRecord.contentType(),
|
||||||
|
Instant.ofEpochMilli(nextRecord.timestamp()).toString(),
|
||||||
|
nextRecord.httpStatus(),
|
||||||
|
status.toString(),
|
||||||
|
"",
|
||||||
|
nextRecord.headers(),
|
||||||
|
nextRecord.body(),
|
||||||
|
// this field isn't actually used, maybe we can skip calculating it?
|
||||||
|
nextRecord.cookies(),
|
||||||
|
null,
|
||||||
|
null));
|
||||||
|
}
|
||||||
|
|
||||||
|
public void close() throws IOException {
|
||||||
|
reader.close();
|
||||||
|
}
|
||||||
|
|
||||||
|
@Override
|
||||||
|
public SerializableCrawlData next() throws IOException {
|
||||||
|
if (!hasNext())
|
||||||
|
throw new NoSuchElementException();
|
||||||
|
|
||||||
|
return nextQ.poll();
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
@@ -18,7 +18,7 @@ public class DocumentBodyExtractor {
|
|||||||
return asBytes(fetchOk);
|
return asBytes(fetchOk);
|
||||||
}
|
}
|
||||||
else if (result instanceof HttpFetchResult.Result304ReplacedWithReference retained) {
|
else if (result instanceof HttpFetchResult.Result304ReplacedWithReference retained) {
|
||||||
return new DocumentBodyResult.Ok<>(retained.contentType(), retained.body().getBytes());
|
return new DocumentBodyResult.Ok<>(retained.contentType(), retained.body());
|
||||||
}
|
}
|
||||||
|
|
||||||
return new DocumentBodyResult.Error<>(CrawlerDocumentStatus.ERROR, "Fetch Result Not Ok");
|
return new DocumentBodyResult.Error<>(CrawlerDocumentStatus.ERROR, "Fetch Result Not Ok");
|
||||||
|
@@ -1,18 +1,18 @@
|
|||||||
package nu.marginalia.model.body;
|
package nu.marginalia.model.body;
|
||||||
|
|
||||||
import nu.marginalia.contenttype.ContentType;
|
import nu.marginalia.contenttype.ContentType;
|
||||||
import okhttp3.Headers;
|
import org.jetbrains.annotations.Nullable;
|
||||||
import org.jsoup.Jsoup;
|
import org.jsoup.Jsoup;
|
||||||
import org.jsoup.nodes.Document;
|
import org.jsoup.nodes.Document;
|
||||||
import org.netpreserve.jwarc.MessageHeaders;
|
import org.netpreserve.jwarc.MessageHeaders;
|
||||||
import org.netpreserve.jwarc.WarcResponse;
|
import org.netpreserve.jwarc.WarcResponse;
|
||||||
|
|
||||||
import java.io.ByteArrayInputStream;
|
import java.io.ByteArrayInputStream;
|
||||||
import java.io.IOException;
|
|
||||||
import java.io.InputStream;
|
import java.io.InputStream;
|
||||||
import java.net.InetAddress;
|
import java.net.InetAddress;
|
||||||
import java.net.URI;
|
import java.net.URI;
|
||||||
import java.util.Optional;
|
import java.net.http.HttpHeaders;
|
||||||
|
import java.util.*;
|
||||||
|
|
||||||
/* FIXME: This interface has a very unfortunate name that is not very descriptive.
|
/* FIXME: This interface has a very unfortunate name that is not very descriptive.
|
||||||
*/
|
*/
|
||||||
@@ -56,42 +56,46 @@ public sealed interface HttpFetchResult {
|
|||||||
*/
|
*/
|
||||||
record ResultOk(URI uri,
|
record ResultOk(URI uri,
|
||||||
int statusCode,
|
int statusCode,
|
||||||
Headers headers,
|
HttpHeaders headers,
|
||||||
String ipAddress,
|
String ipAddress,
|
||||||
byte[] bytesRaw,
|
byte[] bytesRaw, // raw data for the entire response including headers
|
||||||
int bytesStart,
|
int bytesStart,
|
||||||
int bytesLength
|
int bytesLength
|
||||||
) implements HttpFetchResult {
|
) implements HttpFetchResult {
|
||||||
|
|
||||||
|
public ResultOk(URI uri, int status, MessageHeaders headers, String ipAddress, byte[] bytes, int bytesStart, int length) {
|
||||||
|
this(uri, status, convertHeaders(headers), ipAddress, bytes, bytesStart, length);
|
||||||
|
}
|
||||||
|
|
||||||
|
private static HttpHeaders convertHeaders(MessageHeaders messageHeaders) {
|
||||||
|
Map<String, List<String>> inputMap = messageHeaders.map();
|
||||||
|
Map<String, List<String>> filteredMap = new HashMap<>(Math.max(4, inputMap.size()));
|
||||||
|
|
||||||
|
inputMap.forEach((k, v) -> {
|
||||||
|
if (k.isBlank()) return;
|
||||||
|
if (!Character.isAlphabetic(k.charAt(0))) return;
|
||||||
|
|
||||||
|
filteredMap.put(k, v);
|
||||||
|
});
|
||||||
|
|
||||||
|
return HttpHeaders.of(filteredMap, (k,v) -> true);
|
||||||
|
}
|
||||||
|
|
||||||
public boolean isOk() {
|
public boolean isOk() {
|
||||||
return statusCode >= 200 && statusCode < 300;
|
return statusCode >= 200 && statusCode < 300;
|
||||||
}
|
}
|
||||||
|
|
||||||
public ResultOk(URI uri,
|
|
||||||
int statusCode,
|
|
||||||
MessageHeaders headers,
|
|
||||||
String ipAddress,
|
|
||||||
byte[] bytesRaw,
|
|
||||||
int bytesStart,
|
|
||||||
int bytesLength) {
|
|
||||||
this(uri, statusCode, convertHeaders(headers), ipAddress, bytesRaw, bytesStart, bytesLength);
|
|
||||||
}
|
|
||||||
|
|
||||||
private static Headers convertHeaders(MessageHeaders headers) {
|
|
||||||
var ret = new Headers.Builder();
|
|
||||||
for (var header : headers.map().entrySet()) {
|
|
||||||
for (var value : header.getValue()) {
|
|
||||||
ret.add(header.getKey(), value);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return ret.build();
|
|
||||||
}
|
|
||||||
|
|
||||||
public InputStream getInputStream() {
|
public InputStream getInputStream() {
|
||||||
return new ByteArrayInputStream(bytesRaw, bytesStart, bytesLength);
|
return new ByteArrayInputStream(bytesRaw, bytesStart, bytesLength);
|
||||||
}
|
}
|
||||||
|
|
||||||
public Optional<Document> parseDocument() throws IOException {
|
/** Copy the byte range corresponding to the payload of the response,
|
||||||
|
Warning: Copies the data, use getInputStream() for zero copy access */
|
||||||
|
public byte[] getBodyBytes() {
|
||||||
|
return Arrays.copyOfRange(bytesRaw, bytesStart, bytesStart + bytesLength);
|
||||||
|
}
|
||||||
|
|
||||||
|
public Optional<Document> parseDocument() {
|
||||||
return DocumentBodyExtractor.asString(this).flatMapOpt((contentType, body) -> {
|
return DocumentBodyExtractor.asString(this).flatMapOpt((contentType, body) -> {
|
||||||
if (contentType.is("text/html")) {
|
if (contentType.is("text/html")) {
|
||||||
return Optional.of(Jsoup.parse(body));
|
return Optional.of(Jsoup.parse(body));
|
||||||
@@ -102,8 +106,9 @@ public sealed interface HttpFetchResult {
|
|||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@Nullable
|
||||||
public String header(String name) {
|
public String header(String name) {
|
||||||
return headers.get(name);
|
return headers.firstValue(name).orElse(null);
|
||||||
}
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
@@ -114,20 +119,10 @@ public sealed interface HttpFetchResult {
|
|||||||
*
|
*
|
||||||
* @see Result304Raw for the case where the document has not yet been replaced with the reference data.
|
* @see Result304Raw for the case where the document has not yet been replaced with the reference data.
|
||||||
*/
|
*/
|
||||||
record Result304ReplacedWithReference(String url, ContentType contentType, String body) implements HttpFetchResult {
|
record Result304ReplacedWithReference(String url, ContentType contentType, byte[] body) implements HttpFetchResult {
|
||||||
|
|
||||||
public boolean isOk() {
|
public boolean isOk() {
|
||||||
return true;
|
return true;
|
||||||
}
|
}
|
||||||
|
|
||||||
public Optional<Document> parseDocument() {
|
|
||||||
try {
|
|
||||||
return Optional.of(Jsoup.parse(body));
|
|
||||||
}
|
|
||||||
catch (Exception ex) {
|
|
||||||
return Optional.empty();
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
/** Fetching resulted in an exception */
|
/** Fetching resulted in an exception */
|
||||||
|
@@ -1,8 +1,16 @@
|
|||||||
package nu.marginalia.model.crawldata;
|
package nu.marginalia.model.crawldata;
|
||||||
|
|
||||||
|
import nu.marginalia.contenttype.ContentType;
|
||||||
|
import nu.marginalia.contenttype.DocumentBodyToString;
|
||||||
import nu.marginalia.model.EdgeUrl;
|
import nu.marginalia.model.EdgeUrl;
|
||||||
import org.apache.commons.lang3.StringUtils;
|
import org.apache.commons.lang3.StringUtils;
|
||||||
import org.jetbrains.annotations.Nullable;
|
import org.jetbrains.annotations.Nullable;
|
||||||
|
import org.jsoup.nodes.Document;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.nio.charset.StandardCharsets;
|
||||||
|
import java.util.Arrays;
|
||||||
|
import java.util.Objects;
|
||||||
|
|
||||||
public final class CrawledDocument implements SerializableCrawlData {
|
public final class CrawledDocument implements SerializableCrawlData {
|
||||||
public String crawlId;
|
public String crawlId;
|
||||||
@@ -19,8 +27,52 @@ public final class CrawledDocument implements SerializableCrawlData {
|
|||||||
@Nullable
|
@Nullable
|
||||||
public String headers;
|
public String headers;
|
||||||
|
|
||||||
public String documentBody;
|
public String documentBody() {
|
||||||
|
return DocumentBodyToString.getStringData(
|
||||||
|
ContentType.parse(contentType),
|
||||||
|
documentBodyBytes);
|
||||||
|
}
|
||||||
|
|
||||||
|
/** Attempt to parse the first sampleSize bytes of the document body into a string */
|
||||||
|
public String documentBody(int sampleSize) {
|
||||||
|
if (sampleSize >= documentBodyBytes.length) {
|
||||||
|
return documentBody();
|
||||||
|
}
|
||||||
|
|
||||||
|
// Truncating the string at an unlucky point *may* lead to a parsing error
|
||||||
|
// ... so we try again with a longer length
|
||||||
|
for (int i = 0; i <= 3 && sampleSize + i < documentBodyBytes.length; i++) {
|
||||||
|
try {
|
||||||
|
byte[] bytes = new byte[sampleSize + i];
|
||||||
|
System.arraycopy(documentBodyBytes, 0, bytes, 0, bytes.length);
|
||||||
|
|
||||||
|
return DocumentBodyToString.getStringData(
|
||||||
|
ContentType.parse(contentType),
|
||||||
|
bytes);
|
||||||
|
}
|
||||||
|
catch (RuntimeException ex) {
|
||||||
|
// Try again with i + 1
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
throw new IllegalArgumentException("Failed to parse substring");
|
||||||
|
}
|
||||||
|
|
||||||
|
public Document parseBody() throws IOException {
|
||||||
|
// Prevent stalls from parsing excessively large documents
|
||||||
|
|
||||||
|
return DocumentBodyToString.getParsedData(
|
||||||
|
ContentType.parse(contentType),
|
||||||
|
documentBodyBytes,
|
||||||
|
200_000,
|
||||||
|
url);
|
||||||
|
}
|
||||||
|
|
||||||
|
public boolean hasBody() {
|
||||||
|
return documentBodyBytes.length > 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
public byte[] documentBodyBytes;
|
||||||
/**
|
/**
|
||||||
* This is not guaranteed to be set in all versions of the format,
|
* This is not guaranteed to be set in all versions of the format,
|
||||||
* information may come in CrawledDomain instead
|
* information may come in CrawledDomain instead
|
||||||
@@ -30,7 +82,7 @@ public final class CrawledDocument implements SerializableCrawlData {
|
|||||||
public String lastModifiedMaybe;
|
public String lastModifiedMaybe;
|
||||||
public String etagMaybe;
|
public String etagMaybe;
|
||||||
|
|
||||||
public CrawledDocument(String crawlId, String url, String contentType, String timestamp, int httpStatus, String crawlerStatus, String crawlerStatusDesc, @Nullable String headers, String documentBody, Boolean hasCookies, String lastModifiedMaybe, String etagMaybe) {
|
public CrawledDocument(String crawlId, String url, String contentType, String timestamp, int httpStatus, String crawlerStatus, String crawlerStatusDesc, @Nullable String headers, byte[] documentBodyBytes, Boolean hasCookies, String lastModifiedMaybe, String etagMaybe) {
|
||||||
this.crawlId = crawlId;
|
this.crawlId = crawlId;
|
||||||
this.url = url;
|
this.url = url;
|
||||||
this.contentType = contentType;
|
this.contentType = contentType;
|
||||||
@@ -39,7 +91,7 @@ public final class CrawledDocument implements SerializableCrawlData {
|
|||||||
this.crawlerStatus = crawlerStatus;
|
this.crawlerStatus = crawlerStatus;
|
||||||
this.crawlerStatusDesc = crawlerStatusDesc;
|
this.crawlerStatusDesc = crawlerStatusDesc;
|
||||||
this.headers = headers;
|
this.headers = headers;
|
||||||
this.documentBody = documentBody;
|
this.documentBodyBytes = Objects.requireNonNullElse(documentBodyBytes, new byte[] {});
|
||||||
this.hasCookies = hasCookies;
|
this.hasCookies = hasCookies;
|
||||||
this.lastModifiedMaybe = lastModifiedMaybe;
|
this.lastModifiedMaybe = lastModifiedMaybe;
|
||||||
this.etagMaybe = etagMaybe;
|
this.etagMaybe = etagMaybe;
|
||||||
@@ -106,7 +158,7 @@ public final class CrawledDocument implements SerializableCrawlData {
|
|||||||
}
|
}
|
||||||
|
|
||||||
public String toString() {
|
public String toString() {
|
||||||
return "CrawledDocument(crawlId=" + this.crawlId + ", url=" + this.url + ", contentType=" + this.contentType + ", timestamp=" + this.timestamp + ", httpStatus=" + this.httpStatus + ", crawlerStatus=" + this.crawlerStatus + ", crawlerStatusDesc=" + this.crawlerStatusDesc + ", headers=" + this.headers + ", documentBody=" + this.documentBody + ", hasCookies=" + this.hasCookies + ", lastModifiedMaybe=" + this.lastModifiedMaybe + ", etagMaybe=" + this.etagMaybe + ")";
|
return "CrawledDocument(crawlId=" + this.crawlId + ", url=" + this.url + ", contentType=" + this.contentType + ", timestamp=" + this.timestamp + ", httpStatus=" + this.httpStatus + ", crawlerStatus=" + this.crawlerStatus + ", crawlerStatusDesc=" + this.crawlerStatusDesc + ", headers=" + this.headers + ", documentBody=" + documentBody() + ", hasCookies=" + this.hasCookies + ", lastModifiedMaybe=" + this.lastModifiedMaybe + ", etagMaybe=" + this.etagMaybe + ")";
|
||||||
}
|
}
|
||||||
|
|
||||||
public static class CrawledDocumentBuilder {
|
public static class CrawledDocumentBuilder {
|
||||||
@@ -118,7 +170,7 @@ public final class CrawledDocument implements SerializableCrawlData {
|
|||||||
private String crawlerStatus;
|
private String crawlerStatus;
|
||||||
private String crawlerStatusDesc;
|
private String crawlerStatusDesc;
|
||||||
private @Nullable String headers;
|
private @Nullable String headers;
|
||||||
private String documentBody;
|
private byte[] documentBodyBytes = new byte[0];
|
||||||
private String recrawlState;
|
private String recrawlState;
|
||||||
private Boolean hasCookies;
|
private Boolean hasCookies;
|
||||||
private String lastModifiedMaybe;
|
private String lastModifiedMaybe;
|
||||||
@@ -168,10 +220,13 @@ public final class CrawledDocument implements SerializableCrawlData {
|
|||||||
}
|
}
|
||||||
|
|
||||||
public CrawledDocumentBuilder documentBody(String documentBody) {
|
public CrawledDocumentBuilder documentBody(String documentBody) {
|
||||||
this.documentBody = documentBody;
|
this.documentBodyBytes = documentBody.getBytes(StandardCharsets.UTF_8);
|
||||||
|
return this;
|
||||||
|
}
|
||||||
|
public CrawledDocumentBuilder documentBodyBytes(byte[] documentBodyBytes) {
|
||||||
|
this.documentBodyBytes = documentBodyBytes;
|
||||||
return this;
|
return this;
|
||||||
}
|
}
|
||||||
|
|
||||||
@Deprecated
|
@Deprecated
|
||||||
public CrawledDocumentBuilder recrawlState(String recrawlState) {
|
public CrawledDocumentBuilder recrawlState(String recrawlState) {
|
||||||
this.recrawlState = recrawlState;
|
this.recrawlState = recrawlState;
|
||||||
@@ -194,11 +249,11 @@ public final class CrawledDocument implements SerializableCrawlData {
|
|||||||
}
|
}
|
||||||
|
|
||||||
public CrawledDocument build() {
|
public CrawledDocument build() {
|
||||||
return new CrawledDocument(this.crawlId, this.url, this.contentType, this.timestamp, this.httpStatus, this.crawlerStatus, this.crawlerStatusDesc, this.headers, this.documentBody, this.hasCookies, this.lastModifiedMaybe, this.etagMaybe);
|
return new CrawledDocument(this.crawlId, this.url, this.contentType, this.timestamp, this.httpStatus, this.crawlerStatus, this.crawlerStatusDesc, this.headers, this.documentBodyBytes, this.hasCookies, this.lastModifiedMaybe, this.etagMaybe);
|
||||||
}
|
}
|
||||||
|
|
||||||
public String toString() {
|
public String toString() {
|
||||||
return "CrawledDocument.CrawledDocumentBuilder(crawlId=" + this.crawlId + ", url=" + this.url + ", contentType=" + this.contentType + ", timestamp=" + this.timestamp + ", httpStatus=" + this.httpStatus + ", crawlerStatus=" + this.crawlerStatus + ", crawlerStatusDesc=" + this.crawlerStatusDesc + ", headers=" + this.headers + ", documentBody=" + this.documentBody + ", recrawlState=" + this.recrawlState + ", hasCookies=" + this.hasCookies + ", lastModifiedMaybe=" + this.lastModifiedMaybe + ", etagMaybe=" + this.etagMaybe + ")";
|
return "CrawledDocument.CrawledDocumentBuilder(crawlId=" + this.crawlId + ", url=" + this.url + ", contentType=" + this.contentType + ", timestamp=" + this.timestamp + ", httpStatus=" + this.httpStatus + ", crawlerStatus=" + this.crawlerStatus + ", crawlerStatusDesc=" + this.crawlerStatusDesc + ", headers=" + this.headers + ", documentBodyBytes=" + Arrays.toString(this.documentBodyBytes) + ", recrawlState=" + this.recrawlState + ", hasCookies=" + this.hasCookies + ", lastModifiedMaybe=" + this.lastModifiedMaybe + ", etagMaybe=" + this.etagMaybe + ")";
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@@ -165,27 +165,28 @@ public class CrawledDocumentParquetRecordFileWriter implements AutoCloseable {
|
|||||||
contentType = "";
|
contentType = "";
|
||||||
}
|
}
|
||||||
|
|
||||||
String headersStr = null;
|
|
||||||
StringJoiner headersStrBuilder = new StringJoiner("\n");
|
StringJoiner headersStrBuilder = new StringJoiner("\n");
|
||||||
for (var header : headers) {
|
for (var header : headers.map().entrySet()) {
|
||||||
headersStrBuilder.add(header.getFirst() + ": " + header.getSecond());
|
for (var value : header.getValue()) {
|
||||||
|
headersStrBuilder.add(header.getKey() + ": " + value);
|
||||||
}
|
}
|
||||||
headersStr = headersStrBuilder.toString();
|
}
|
||||||
|
String headersStr = headersStrBuilder.toString();
|
||||||
|
|
||||||
|
|
||||||
write(new CrawledDocumentParquetRecord(
|
write(new CrawledDocumentParquetRecord(
|
||||||
domain,
|
domain,
|
||||||
response.target(),
|
response.target(),
|
||||||
fetchOk.ipAddress(),
|
fetchOk.ipAddress(),
|
||||||
WarcXCookieInformationHeader.hasCookies(response),
|
headers.firstValue("X-Has-Cookies").orElse("0").equals("1"),
|
||||||
fetchOk.statusCode(),
|
fetchOk.statusCode(),
|
||||||
response.date(),
|
response.date(),
|
||||||
contentType,
|
contentType,
|
||||||
bodyBytes,
|
bodyBytes,
|
||||||
headersStr,
|
headersStr,
|
||||||
headers.get("ETag"),
|
headers.firstValue("ETag").orElse(null),
|
||||||
headers.get("Last-Modified"))
|
headers.firstValue("Last-Modified").orElse(null)
|
||||||
);
|
));
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user