1
1
mirror of http://git.sesse.net/plocate synced 2025-10-06 03:32:43 +02:00

9 Commits

Author SHA1 Message Date
Steinar H. Gunderson
82f5f11a22 Fix a comment in db.h. 2021-12-28 14:20:04 +01:00
Steinar H. Gunderson
d0f2469aed Fix an issue where the database could be built with the wrong check_visibility flag.
The check_visibility flag would never be set in the header, and thus be set to some
random variable instead of what the user wanted.
2020-12-05 10:50:49 +01:00
Steinar H. Gunderson
23668b1483 Honor the “require visibility” flag (in the negative). 2020-11-28 18:17:23 +01:00
Steinar H. Gunderson
63fd24efd7 Add a native updatedb.
This incorporates some code from mlocate's updatedb, and thus is compatible
with /etc/updatedb.conf, and supports all the pruning options from it.
All the code has been heavily modified, e.g. the gnulib dependency has been
removed and replaced with STL code (kicking 10k+ lines of code), the bind
mount code has been fixed (it was all broken since the switch from /etc/mtab
to /proc/self/mountinfo) and everything has been reformatted. Like with mlocate,
plocate's updatedb is merging, ie., it can skip readdir() on unchanged
directories. (The logic here is also copied pretty verbatim from mlocate.)
updatedb reads plocate's native format; there's a new max_version 2 that
contains directory timestamps (without it, updatedb will fall back to a full
scan). The timestamps increase the database size by only about 1%, which is a
good tradeoff when we're getting rid of the entire mlocate database.

We liberally use modern features to simplify the implementation; in particular,
openat() to avoid race conditions, instead of mlocate's complicated chdir() dance.
Unfortunately, the combination of the slightly strange storage order from mlocate,
and openat(), means we can need to keep up a bunch of file descriptors open,
but they are not an expensive resource these days, and we try to bump the
limit ourselves if we are allowed to. We also use O_TMPFILE, to make sure we
never leave a half-finished file lying around (mlocate's updatedb tries to
catch signals instead). All of this may hinder portability, so we might ease up
on the requirements later. We don't use io_uring for updatedb at this point.

plocate-build does not write the needed timestamps, so the first upgrade from
mlocate to native plocate requires a full rescan.

NOTE: The format is _not_ frozen yet, and won't be until actual release.
2020-11-25 00:58:09 +01:00
Steinar H. Gunderson
15235ad941 Use zstd dictionaries.
Since we have small strings, they can benefit from some shared context,
and zstd supports this. plocate-build now reads the mlocate database
twice; the first pass samples 1000 random blocks, which it uses to train
a 1 kB dictionary. (zstd recommends much larger dictionaries, but practical
testing seems to indicate this doesn't help us much, and might actually
be harmful.)

We get ~20% slower builds and ~7% smaller .db files -- but more
interestingly, linear search speed is up ~20% (which indicates that
decompression in itself benefits more). We need to read the 1 kB
dictionary, but it's practically free since it's stored next to the
header and so small.

This is a version bump (to version 1), so we're not forward-compatible,
but we're backward-compatible (plocate still reads version 0 files
just fine). Since we're adding more fields to the header anyway,
we can add a new “max_version” field that allows for marking
backwards-compatible changes in the future, ie., if plocate-build
adds more information that plocate would like to use but that older
plocate versions can simply ignore.
2020-10-13 17:53:02 +02:00
Steinar H. Gunderson
d5f6c3c0a4 Fix searching for very short (1 or 2 bytes) queries.
plocate had assumptions about the layout of the file, that no longer
held. Use the pad field to simplify things.

This requires a database rebuild, but only for short queries.
Normal queries will continue to work, so there's no version bump.
2020-10-03 10:49:10 +02:00
Steinar H. Gunderson
96d1b7ab7a Make some padding in the header explicit. 2020-10-02 18:36:46 +02:00
Steinar H. Gunderson
94cd925830 Rerun clang-format. 2020-09-30 21:52:16 +02:00
Steinar H. Gunderson
c41f998855 Switch trigram lookup from binary search to a hash table.
Binary search was fine when we just wanted simplicity, but for I/O
optimization on rotating media, we want as few seeks as possible.
A hash table with open addressing gives us just that; Robin Hood
hashing makes it possible for us to guarantee maximum probe length,
so we can just read 256 bytes (plus a little slop) for each lookup
and that's it. This kills ~30 ms or so cold-cache.

This breaks the format, so we use the chance to add a magic and
a proper header to provide some more flexibility in case we want
to change the builder.
2020-09-30 19:46:53 +02:00