1
1
mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-10-06 07:32:38 +02:00

Compare commits

...

12 Commits

Author SHA1 Message Date
Viktor Lofgren
567e4e1237 (crawler) Fast detection and bail-out for crawler traps
Improve logging and exclude robots.txt from this logic.
2025-01-18 15:28:54 +01:00
Viktor Lofgren
4342e42722 (crawler) Fast detection and bail-out for crawler traps
Nephentes has been doing the rounds in social media, adding an easy detection and mitigation mechanism for this type of trap, as sadly not all webmasters set up their robots.txt correctly.  Out of the box crawl limits will also deal with this type of attack, but this fix is faster.
2025-01-17 13:02:57 +01:00
Viktor Lofgren
bc818056e6 (run) Fix templates for mariadb
Apparently the docker image contract changed at some point, and now we should spawn mariadbd and not mysqld; mariadb-admin and not mysqladmin.
2025-01-16 15:27:02 +01:00
Viktor Lofgren
de2feac238 (chore) Upgrade jib from 3.4.3 to 3.4.4 2025-01-16 15:10:45 +01:00
Viktor Lofgren
1e770205a5 (search) Dyslexia fix 2025-01-12 20:40:14 +01:00
Viktor
e44ecd6d69 Merge pull request #149 from MarginaliaSearch/vlofgren-patch-1
Update ROADMAP.md
2025-01-12 20:38:36 +01:00
Viktor
5b93a0e633 Update ROADMAP.md 2025-01-12 20:38:11 +01:00
Viktor
08fb0e5efe Update ROADMAP.md 2025-01-12 20:37:43 +01:00
Viktor
bcf67782ea Update ROADMAP.md 2025-01-12 20:37:09 +01:00
Viktor Lofgren
ef3f175ede (search) Don't clobber the search query URL with default values 2025-01-10 15:57:30 +01:00
Viktor Lofgren
bbe4b5d9fd Revert experimental changes 2025-01-10 15:52:02 +01:00
Viktor Lofgren
c67a635103 (search, experimental) Add a few debugging tracks to the search UI 2025-01-10 15:44:44 +01:00
15 changed files with 105 additions and 183 deletions

View File

@@ -1,4 +1,4 @@
# Roadmap 2024-2025
# Roadmap 2025
This is a roadmap with major features planned for Marginalia Search.
@@ -30,12 +30,6 @@ Retaining the ability to independently crawl the web is still strongly desirable
The search engine has a bit of a problem showing spicy content mixed in with the results. It would be desirable to have a way to filter this out. It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
combined with naive bayesian filter would go a long way, or something more sophisticated...?
## Web Design Overhaul
The design is kinda clunky and hard to maintain, and needlessly outdated-looking.
In progress: PR [#127](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/127) -- demo available at https://test.marginalia.nu/
## Additional Language Support
It would be desirable if the search engine supported more languages than English. This is partially about
@@ -62,8 +56,31 @@ filter for any API consumer.
I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, which is quite ad-hoc, but instead to work together to find some new common description language for this.
## Show favicons next to search results
This is expected from search engines. Basic proof of concept sketch of fetching this data has been done, but the feature is some way from being reality.
## Specialized crawler for github
One of the search engine's biggest limitations right now is that it does not index github at all. A specialized crawler that fetches at least the readme.md would go a long way toward providing search capabilities in this domain.
# Completed
## Web Design Overhaul (COMPLETED 2025-01)
The design is kinda clunky and hard to maintain, and needlessly outdated-looking.
PR [#127](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/127)
## Finalize RSS support (COMPLETED 2024-11)
Marginalia has experimental RSS preview support for a few domains. This works well and
it should be extended to all domains. It would also be interesting to offer search of the
RSS data itself, or use the RSS set to feed a special live index that updates faster than the
main dataset.
Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
## Proper Position Index (COMPLETED 2024-09)
The search engine uses a fixed width bit mask to indicate word positions. It has the benefit
@@ -76,11 +93,3 @@ list, as is the civilized way of doing this.
Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)
## Finalize RSS support (COMPLETED 2024-11)
Marginalia has experimental RSS preview support for a few domains. This works well and
it should be extended to all domains. It would also be interesting to offer search of the
RSS data itself, or use the RSS set to feed a special live index that updates faster than the
main dataset.
Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)

View File

@@ -47,7 +47,7 @@ ext {
dockerImageBase='container-registry.oracle.com/graalvm/jdk:23'
dockerImageTag='latest'
dockerImageRegistry='marginalia'
jibVersion = '3.4.3'
jibVersion = '3.4.4'
}

View File

@@ -22,6 +22,7 @@ import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.security.NoSuchAlgorithmException;
import java.time.Duration;
import java.time.Instant;
import java.util.*;
@@ -89,6 +90,7 @@ public class WarcRecorder implements AutoCloseable {
var call = client.newCall(request);
cookieInformation.update(client, request.url());
try (var response = call.execute();
@@ -167,6 +169,25 @@ public class WarcRecorder implements AutoCloseable {
warcRequest.http(); // force HTTP header to be parsed before body is consumed so that caller can use it
writer.write(warcRequest);
if (Duration.between(date, Instant.now()).compareTo(Duration.ofSeconds(9)) > 0
&& inputBuffer.size() < 2048
&& !request.url().encodedPath().endsWith("robots.txt")) // don't bail on robots.txt
{
// Fast detection and mitigation of crawler traps that respond with slow
// small responses, with a high branching factor
// Note we bail *after* writing the warc records, this will effectively only
// prevent link extraction from the document.
logger.warn("URL {} took too long to fetch ({}s) and was too small for the effort ({}b)",
requestUri,
Duration.between(date, Instant.now()).getSeconds(),
inputBuffer.size()
);
return new HttpFetchResult.ResultException(new IOException("Likely crawler trap"));
}
return new HttpFetchResult.ResultOk(responseUri,
response.code(),
inputBuffer.headers(),

View File

@@ -294,99 +294,4 @@ public class SearchOperator {
}
}
public DecoratedSearchResults doSearchFastTrack1(SearchParameters userParams) {
var queryParams = paramFactory.forRegularSearch(userParams);
QueryResponse queryResponse = queryClient.search(queryParams);
var queryResults = getResultsFromQuery(queryResponse).results;
// Cluster the results based on the query response
List<ClusteredUrlDetails> clusteredResults = SearchResultClusterer
.selectStrategy(queryResponse)
.clusterResults(queryResults, 25);
String focusDomain = queryResponse.domain();
int focusDomainId = (focusDomain == null || focusDomain.isBlank())
? -1
: domainQueries.tryGetDomainId(new EdgeDomain(focusDomain)).orElse(0);
List<ResultsPage> resultPages = IntStream.rangeClosed(1, queryResponse.totalPages())
.mapToObj(number -> new ResultsPage(
number,
number == userParams.page(),
userParams.withPage(number).renderUrl()
))
.toList();
// Return the results to the user
return DecoratedSearchResults.builder()
.params(userParams)
.results(clusteredResults)
.filters(new SearchFilters(userParams))
.focusDomain(focusDomain)
.focusDomainId(focusDomainId)
.resultPages(resultPages)
.build();
}
public DecoratedSearchResults doSearchFastTrack2(SearchParameters userParams) {
var queryParams = paramFactory.forRegularSearch(userParams);
QueryResponse queryResponse = queryClient.search(queryParams);
var queryResults = getResultsFromQuery(queryResponse).results;
// Cluster the results based on the query response
List<ClusteredUrlDetails> clusteredResults = SearchResultClusterer
.noOpClustering(queryResults, queryResults.size());
String focusDomain = queryResponse.domain();
int focusDomainId = (focusDomain == null || focusDomain.isBlank())
? -1
: domainQueries.tryGetDomainId(new EdgeDomain(focusDomain)).orElse(0);
List<ResultsPage> resultPages = IntStream.rangeClosed(1, queryResponse.totalPages())
.mapToObj(number -> new ResultsPage(
number,
number == userParams.page(),
userParams.withPage(number).renderUrl()
))
.toList();
// Return the results to the user
return DecoratedSearchResults.builder()
.params(userParams)
.results(clusteredResults)
.filters(new SearchFilters(userParams))
.focusDomain(focusDomain)
.focusDomainId(focusDomainId)
.resultPages(resultPages)
.build();
}
public DecoratedSearchResults doSearchFastTrack3(SearchParameters userParams) {
var queryParams = paramFactory.forRegularSearch(userParams);
QueryResponse queryResponse = queryClient.search(queryParams);
var queryResults = getResultsFromQuery(queryResponse).results;
// Cluster the results based on the query response
List<ClusteredUrlDetails> clusteredResults = SearchResultClusterer
.noOpClustering(queryResults, queryResults.size());
List<ResultsPage> resultPages = IntStream.rangeClosed(1, queryResponse.totalPages())
.mapToObj(number -> new ResultsPage(
number,
number == userParams.page(),
userParams.withPage(number).renderUrl()
))
.toList();
// Return the results to the user
return DecoratedSearchResults.builder()
.params(userParams)
.results(clusteredResults)
.filters(new SearchFilters(userParams))
.focusDomain(null)
.focusDomainId(-1)
.resultPages(resultPages)
.build();
}
}

View File

@@ -17,13 +17,13 @@ public class SearchResultClusterer {
public static SearchResultClusterStrategy selectStrategy(QueryResponse response) {
if (response.domain() != null && !response.domain().isBlank())
return SearchResultClusterer::noOpClustering;
return SearchResultClusterer::noOp;
return SearchResultClusterer::byDomain;
}
/** No clustering, just return the results as is */
public static List<ClusteredUrlDetails> noOpClustering(List<UrlDetails> results, int total) {
private static List<ClusteredUrlDetails> noOp(List<UrlDetails> results, int total) {
if (results.isEmpty())
return List.of();

View File

@@ -21,8 +21,7 @@ public record SearchParameters(WebsiteUrl url,
SearchTitleParameter searchTitle,
SearchAdtechParameter adtech,
boolean newFilter,
int page,
int debug
int page
) {
public static SearchParameters defaultsForQuery(WebsiteUrl url, String query, int page) {
@@ -35,8 +34,7 @@ public record SearchParameters(WebsiteUrl url,
SearchTitleParameter.DEFAULT,
SearchAdtechParameter.DEFAULT,
false,
page,
0);
page);
}
public String profileStr() {
@@ -44,30 +42,30 @@ public record SearchParameters(WebsiteUrl url,
}
public SearchParameters withProfile(SearchProfile profile) {
return new SearchParameters(url, query, profile, js, recent, searchTitle, adtech, true, page, debug);
return new SearchParameters(url, query, profile, js, recent, searchTitle, adtech, true, page);
}
public SearchParameters withJs(SearchJsParameter js) {
return new SearchParameters(url, query, profile, js, recent, searchTitle, adtech, true, page, debug);
return new SearchParameters(url, query, profile, js, recent, searchTitle, adtech, true, page);
}
public SearchParameters withAdtech(SearchAdtechParameter adtech) {
return new SearchParameters(url, query, profile, js, recent, searchTitle, adtech, true, page, debug);
return new SearchParameters(url, query, profile, js, recent, searchTitle, adtech, true, page);
}
public SearchParameters withRecent(SearchRecentParameter recent) {
return new SearchParameters(url, query, profile, js, recent, searchTitle, adtech, true, page, debug);
return new SearchParameters(url, query, profile, js, recent, searchTitle, adtech, true, page);
}
public SearchParameters withTitle(SearchTitleParameter title) {
return new SearchParameters(url, query, profile, js, recent, title, adtech, true, page, debug);
return new SearchParameters(url, query, profile, js, recent, title, adtech, true, page);
}
public SearchParameters withPage(int page) {
return new SearchParameters(url, query, profile, js, recent, searchTitle, adtech, false, page, debug);
return new SearchParameters(url, query, profile, js, recent, searchTitle, adtech, false, page);
}
public SearchParameters withQuery(String query) {
return new SearchParameters(url, query, profile, js, recent, searchTitle, adtech, false, page, debug);
return new SearchParameters(url, query, profile, js, recent, searchTitle, adtech, false, page);
}
public String renderUrlWithoutSiteFocus() {
@@ -86,18 +84,33 @@ public record SearchParameters(WebsiteUrl url,
}
public String renderUrl() {
String path = String.format("/search?query=%s&profile=%s&js=%s&adtech=%s&recent=%s&searchTitle=%s&newfilter=%s&page=%d",
URLEncoder.encode(query, StandardCharsets.UTF_8),
URLEncoder.encode(profile.filterId, StandardCharsets.UTF_8),
URLEncoder.encode(js.value, StandardCharsets.UTF_8),
URLEncoder.encode(adtech.value, StandardCharsets.UTF_8),
URLEncoder.encode(recent.value, StandardCharsets.UTF_8),
URLEncoder.encode(searchTitle.value, StandardCharsets.UTF_8),
Boolean.valueOf(newFilter).toString(),
page
);
return path;
StringBuilder pathBuilder = new StringBuilder("/search?");
pathBuilder.append("query=").append(URLEncoder.encode(query, StandardCharsets.UTF_8));
if (profile != SearchProfile.NO_FILTER) {
pathBuilder.append("&profile=").append(URLEncoder.encode(profile.filterId, StandardCharsets.UTF_8));
}
if (js != SearchJsParameter.DEFAULT) {
pathBuilder.append("&js=").append(URLEncoder.encode(js.value, StandardCharsets.UTF_8));
}
if (adtech != SearchAdtechParameter.DEFAULT) {
pathBuilder.append("&adtech=").append(URLEncoder.encode(adtech.value, StandardCharsets.UTF_8));
}
if (recent != SearchRecentParameter.DEFAULT) {
pathBuilder.append("&recent=").append(URLEncoder.encode(recent.value, StandardCharsets.UTF_8));
}
if (searchTitle != SearchTitleParameter.DEFAULT) {
pathBuilder.append("&searchTitle=").append(URLEncoder.encode(searchTitle.value, StandardCharsets.UTF_8));
}
if (page != 1) {
pathBuilder.append("&page=").append(page);
}
if (newFilter) {
pathBuilder.append("&newfilter=").append(Boolean.valueOf(newFilter).toString());
}
return pathBuilder.toString();
}
public RpcTemporalBias.Bias temporalBias() {

View File

@@ -17,38 +17,15 @@ public class SearchCommand implements SearchCommandInterface {
@Inject
public SearchCommand(SearchOperator searchOperator) {
public SearchCommand(SearchOperator searchOperator){
this.searchOperator = searchOperator;
}
@Override
public Optional<ModelAndView<?>> process(SearchParameters parameters) throws InterruptedException {
if (parameters.debug() == 0) {
DecoratedSearchResults results = searchOperator.doSearch(parameters);
return Optional.of(new MapModelAndView("serp/main.jte",
Map.of("results", results, "navbar", NavbarModel.SEARCH)
));
}
else if (parameters.debug() == 1) {
DecoratedSearchResults results = searchOperator.doSearchFastTrack1(parameters);
return Optional.of(new MapModelAndView("serp/main.jte",
Map.of("parameters", results, "navbar", NavbarModel.SEARCH)
));
}
else if (parameters.debug() == 2) {
DecoratedSearchResults results = searchOperator.doSearchFastTrack2(parameters);
return Optional.of(new MapModelAndView("serp/main.jte",
Map.of("parameters", results, "navbar", NavbarModel.SEARCH)
));
}
else if (parameters.debug() == 3) {
DecoratedSearchResults results = searchOperator.doSearchFastTrack3(parameters);
return Optional.of(new MapModelAndView("serp/main.jte",
Map.of("parameters", results, "navbar", NavbarModel.SEARCH)
));
}
else {
return Optional.empty();
}
DecoratedSearchResults results = searchOperator.doSearch(parameters);
return Optional.of(new MapModelAndView("serp/main.jte",
Map.of("results", results, "navbar", NavbarModel.SEARCH)
));
}
}

View File

@@ -60,8 +60,7 @@ public class SearchFilters {
SearchTitleParameter.DEFAULT,
SearchAdtechParameter.DEFAULT,
false,
1,
0));
1));
}
public SearchFilters(SearchParameters parameters) {

View File

@@ -39,8 +39,7 @@ public class SearchQueryService {
@QueryParam String recent,
@QueryParam String searchTitle,
@QueryParam String adtech,
@QueryParam Integer page,
@QueryParam Integer debug
@QueryParam Integer page
) {
try {
SearchParameters parameters = new SearchParameters(websiteUrl,
@@ -51,9 +50,7 @@ public class SearchQueryService {
SearchTitleParameter.parse(searchTitle),
SearchAdtechParameter.parse(adtech),
false,
Objects.requireNonNullElse(page,1),
Objects.requireNonNullElse(debug,0)
);
Objects.requireNonNullElse(page,1));
return searchCommandEvaulator.eval(parameters);
}

View File

@@ -36,10 +36,11 @@
</div>
@if (filters.showRecentOption.isSet()) <input type="hidden" name="js" value="${filters.removeJsOption.value()}"> @endif
@if (filters.reduceAdtechOption.isSet()) <input type="hidden" name="adtech" value="${filters.reduceAdtechOption.value()}"> @endif
@if (filters.searchTitleOption.isSet()) <input type="hidden" name="searchTitle" value="${filters.searchTitleOption.value()}"> @endif
@if (filters.showRecentOption.isSet()) <input type="hidden" name="recent" value="${filters.showRecentOption.value()}"> @endif
<input type="hidden" name="js" value="${filters.removeJsOption.value()}">
<input type="hidden" name="adtech" value="${filters.reduceAdtechOption.value()}">
<input type="hidden" name="searchTitle" value="${filters.searchTitleOption.value()}">
<input type="hidden" name="profile" value="${profile}">
<input type="hidden" name="recent" value="${filters.showRecentOption.value()}">
</form>

View File

@@ -36,7 +36,7 @@
<div class="text-slate-700 dark:text-white text-sm p-4">
<div class="fas fa-gift mr-1 text-margeblue dark:text-slate-200"></div>
This is the new design and home of Marginalia Search.
You can about what this entails <a href="https://about.marginalia-search.com/article/redesign/" class="underline text-liteblue dark:text-blue-200">here</a>.
You can read about what this entails <a href="https://about.marginalia-search.com/article/redesign/" class="underline text-liteblue dark:text-blue-200">here</a>.
<p class="my-4"></p>
The old version of Marginalia Search remains available at
<a href="https://old-search.marginalia.nu/" class="underline text-liteblue dark:text-blue-200">https://old-search.marginalia.nu/</a>.

View File

@@ -72,11 +72,11 @@ services:
image: "mariadb:lts"
container_name: "mariadb"
env_file: "${INSTALL_DIR}/env/mariadb.env"
command: ['mysqld', '--character-set-server=utf8mb4', '--collation-server=utf8mb4_unicode_ci']
command: ['mariadbd', '--character-set-server=utf8mb4', '--collation-server=utf8mb4_unicode_ci']
ports:
- "127.0.0.1:3306:3306/tcp"
healthcheck:
test: mysqladmin ping -h 127.0.0.1 -u ${uval} --password=${pval}
test: mariadb-admin ping -h 127.0.0.1 -u ${uval} --password=${pval}
start_period: 5s
interval: 5s
timeout: 5s

View File

@@ -103,11 +103,11 @@ services:
image: "mariadb:lts"
container_name: "mariadb"
env_file: "${INSTALL_DIR}/env/mariadb.env"
command: ['mysqld', '--character-set-server=utf8mb4', '--collation-server=utf8mb4_unicode_ci']
command: ['mariadbd', '--character-set-server=utf8mb4', '--collation-server=utf8mb4_unicode_ci']
ports:
- "127.0.0.1:3306:3306/tcp"
healthcheck:
test: mysqladmin ping -h 127.0.0.1 -u ${uval} --password=${pval}
test: mariadb-admin ping -h 127.0.0.1 -u ${uval} --password=${pval}
start_period: 5s
interval: 5s
timeout: 5s

View File

@@ -129,11 +129,11 @@ services:
image: "mariadb:lts"
container_name: "mariadb"
env_file: "${INSTALL_DIR}/env/mariadb.env"
command: ['mysqld', '--character-set-server=utf8mb4', '--collation-server=utf8mb4_unicode_ci']
command: ['mariadbd', '--character-set-server=utf8mb4', '--collation-server=utf8mb4_unicode_ci']
ports:
- "127.0.0.1:3306:3306/tcp"
healthcheck:
test: mysqladmin ping -h 127.0.0.1 -u ${uval} --password=${pval}
test: mariadb-admin ping -h 127.0.0.1 -u ${uval} --password=${pval}
start_period: 5s
interval: 5s
timeout: 5s

View File

@@ -3,11 +3,11 @@ services:
image: "mariadb:lts"
container_name: "mariadb"
env_file: "${INSTALL_DIR}/env/mariadb.env"
command: ['mysqld', '--character-set-server=utf8mb4', '--collation-server=utf8mb4_unicode_ci']
command: ['mariadbd', '--character-set-server=utf8mb4', '--collation-server=utf8mb4_unicode_ci']
ports:
- "127.0.0.1:3306:3306/tcp"
healthcheck:
test: mysqladmin ping -h 127.0.0.1 -u ${uval} --password=${pval}
test: mariadb-admin ping -h 127.0.0.1 -u ${uval} --password=${pval}
start_period: 5s
interval: 5s
timeout: 5s