1
1
mirror of https://github.com/Byron/gitoxide synced 2025-10-06 01:52:40 +02:00
Eliah Kagan 90a3dcbddc doc: Strengthen and adjust copy-royal usage guidance and caveats
The `copy-royal` algorithm maintains the patterns and "shape" of
text sufficiently to keep diffs the same (in the vast majority of
cases). It is used in `internal-tools` to help prepare test cases
with what is important and relevant to a regression test of diff
behavior, rather than the exact original repository content in a
tree that has been found to trigger a bug. It avoids needless
verbatim reproduction, while preserving aspects that are useful and
necessary for testing. It keeps the focus on patterns, preventing
irrelevant details of code in a tree that triggered a bug from
being confused with the logic of gitoxide itself, and makes it less
likely to be touched inadvertently in efforts to fix bugs or
improve style (which, in test data, would cause subtle breakage).

Although these benefits are substantial and we intend to continue
using copy-royal in the preparation of test cases as needed if or
when regressions arise, some of the guidance and rationale we had
given for its use was inaccurate or misleading. Most importantly,
copy-royal cannot be used in practice to redact sensitive
information: if you have a repository whose contents should not be
made public, then it is not safe to share the output of copy-royal
run on that repository either.

Copy-royal is implemented (roughly speaking) by mapping alphabetic
characters down to ten letters. This removes some information, at
least in principle: that is, if it were given totally random
letters as input, then it would be impossible to reverse it to get
those letters back. Even on input that is much more structured and
predictable, such as real-world input, it obfuscates it, making it
look garbled and nonsensical. However, even when one intuitively
feels that it has destroyed information, it is possible to reverse
it in many cases, and possibly even in all practical cases.

The reason is that, in real world source code and natural language,
some sequences of letters are overwhelmingly more likely to occur
than others, both in general and (especially) contextually given
what surrounding text is present. The information that is removed
by mapping into ten letters could often be reconstructed by:

1. Building a grammar of possible inputs, which can be done in a
   simple manner by translating the copy-royal output one wishes to
   reverse into a regular expression in which every symbol in the
   copy-royal output becomes a character class of characters that
   map to it. In effect, for every output of the copy-royal
   algorithm, there is a regex that matches the possible inputs.

2. Predicting, stepwise, what code or text is likely to have arisen
   that matches that grammar. In principle this could be done with
   a variety of techniques or even manually. But one fruitful
   approach would be to use an autoregressive large language model,
   and apply constrained decoding[1] to sample only logits
   consistent with the regex. Small experiments carried out so far
   suggest[2] this to be a workable technique when combined with
   beam search[3]. (This technique does not require the specific
   text or code being reconstructed to have existed when the model
   was trained.)

Accordingly, this modifies the documentation of copy-royal to avoid
claiming that the input of copy-royal cannot be recovered, or
anything that recommends or may appear to recommend the use of
copy-royal to redact sensitive information. It also clarifies and
adjusts the explanation of when it makes sense to use copy-royal,
and describes some of its benefits that do not rely on the
assumption that it is infeasible (or even difficult) to reverse.

In the comment documenting `BlameCopyRoyal`, which is among those
edited in the above ways, this also edits its top line to make
clear more generally how `BlameCopyRoyal` relates to `git blame`.

[1]: https://github.com/Saibo-creator/Awesome-LLM-Constrained-Decoding
[2]: See link(s) in https://github.com/GitoxideLabs/gitoxide/pull/2180
[3]: https://en.wikipedia.org/wiki/Beam_search

Co-authored-by: Sebastian Thiel <sebastian.thiel@icloud.com>
2025-09-21 02:34:49 -04:00
2024-08-09 17:46:58 +00:00
2024-04-10 16:09:07 +00:00
2025-09-15 05:25:57 +02:00
2025-09-12 11:31:36 +02:00
2025-08-29 20:47:05 +02:00
2025-08-11 05:19:21 +02:00
2025-08-11 05:19:21 +02:00
2025-01-12 13:51:47 +01:00
2025-02-22 16:55:29 +01:00
2025-09-12 11:31:36 +02:00
2025-01-12 13:51:47 +01:00
2025-09-07 09:29:41 +02:00
2025-09-12 11:31:36 +02:00
2025-01-12 13:51:47 +01:00
2025-01-12 13:51:47 +01:00
2025-08-28 08:26:51 +00:00
2025-08-30 20:57:54 +02:00
2024-04-10 16:09:07 +00:00
2022-04-04 11:31:31 +08:00
2025-05-27 14:48:58 +02:00
2025-08-26 09:00:43 -07:00
2022-05-21 17:47:56 +08:00
2025-08-28 08:26:51 +00:00
2023-11-05 11:31:11 +01:00
2021-08-14 15:32:53 -05:00
2025-05-08 16:32:30 -04:00

CI Crates.io

gitoxide is an implementation of git written in Rust for developing future-proof applications which strive for correctness and performance while providing a pleasant and unsurprising developer experience.

There are two primary ways to use gitoxide:

  1. As Rust library: Use the gix crate as a Cargo dependency for API access.
  2. As command-line tool: The gix binary as development tool to help testing the API in real repositories, and the ein binary with workflow-enhancing tools. Both binaries may forever be unstable, do not rely on them in scripts.

asciicast

Development Status

The command-line tools as well as the status of each crate is described in the crate status document.

For use in applications, look for the gix crate, which serves as entrypoint to the functionality provided by various lower-level plumbing crates like gix-config.

Feature Discovery

Can gix do what I need it to do?

The above can be hard to answer and this paragraph is here to help with feature discovery.

Look at crate-status.md for a rather exhaustive document that contains both implemented and planned features.

Further, the gix crate documentation with the git2 search term helps to find all currently known git2 equivalent method calls. Please note that this list is definitely not exhaustive yet, but might help if you are coming from git2.

What follows is a high-level list of features and those which are planned:

  • clone
  • fetch
  • push
  • blame (plumbing)
  • status
  • blob and tree-diff
  • merge
    • blobs
    • trees
    • commits
  • commit
    • hooks
  • commit-graph traversal
  • rebase
  • worktree checkout and worktree stream
  • reset
  • reading and writing of objects
  • reading and writing of refs
  • reading and writing of .git/index
  • reading and writing of git configuration
  • pathspecs
  • revspecs
  • .gitignore and .gitattributes

Crates

Follow linked crate name for detailed status. Please note that all crates follow semver as well as the stability guide.

Production Grade

Stabilization Candidates

Crates that seem feature complete and need to see some more use before they can be released as 1.0. Documentation is complete and was reviewed at least once.

Initial Development

These crates may be missing some features and thus are somewhat incomplete, but what's there is usable to some extent.

Stress Testing

  • Verify huge packs
  • Explode a pack to disk
  • Generate and verify large commit graphs
  • Generate huge pack from a lot of loose objects

Stability and MSRV

Our stability guide helps to judge how much churn can be expected when depending on crates in this workspace.

Installation

Download a Binary Release

Using cargo binstall, one is able to fetch binary releases. You can install it via cargo install cargo-binstall, assuming the rust toolchain is present.

Then install gitoxide with cargo binstall gitoxide.

See the releases section for manual installation and various alternative builds that are slimmer or smaller, depending on your needs, for Linux, MacOS and Windows.

Download from Arch Linux repository

For Arch Linux you can download gitoxide from community repository:

pacman -S gitoxide

Download from Exherbo Linux Rust repository

For Exherbo Linux you can download gitoxide from the Rust repository:

cave resolve -x repository/rust
cave resolve -x gitoxide

From Source via Cargo

cargo is the Rust package manager which can easily be obtained through rustup. With it, you can build your own binary effortlessly and for your particular CPU for additional performance gains.

The minimum supported Rust version is documented in the CI configuration, the latest stable one will work as well.

There are various build configurations, all of them are documented here. The documentation should also be useful for packagers who need to tune external dependencies.

# A way to install `gitoxide` with just Rust and a C compiler installed.
# If there are problems with SSL certificates during clones, try to omit `--locked`.
cargo install gitoxide --locked --no-default-features --features max-pure

# The default installation, 'max', is the fastest, but also needs `cmake` to build successfully.
# Installing it is platform-dependent.
cargo install gitoxide

# For smaller binaries and even faster build times that are traded for a less fancy CLI implementation,
# use the `lean` feature.
cargo install gitoxide --locked --no-default-features --features lean

The following installs the latest unpublished max release directly from git:

cargo install --git https://github.com/GitoxideLabs/gitoxide gitoxide

How to deal with build failures

On some platforms, installation may fail due to lack of tools required by C toolchains. This can generally be avoided by installation with:

cargo install gitoxide --no-default-features --features max-pure

What follows is a list of known failures.

  • On Fedora, perl needs to be installed for OpenSSL to build properly. This can be done with the following command (see issue #592):

    dnf install perl
    

Using Docker

Some CI/CD pipelines leverage repository cloning. Below is a copy-paste-able example to build docker images for such workflows. As no official image exists (at this time), an image must first be built.

Note

The dockerfile isn't continuously tested as it costs too much time and thus might already be broken. PRs are welcome.

Building the most compatible base image

docker build -f etc/docker/Dockerfile.alpine -t gitoxide:latest --compress . --target=pipeline

Basic usage in a Pipeline

For example, if a Dockerfile currently uses something like RUN git clone https://github.com/GitoxideLabs/gitoxide, first build the image:

docker build -f etc/docker/Dockerfile.alpine -t gitoxide:latest --compress .

Then copy the binaries into your image and replace the git directive with a gix equivalent.

COPY --from gitoxide:latest /bin/gix /usr/local/bin/
COPY --from gitoxide:latest /bin/ein /usr/local/bin/

RUN /usr/local/bin/gix clone --depth 1 https://github.com/GitoxideLabs/gitoxide gitoxide

Usage

Once installed, there are two binaries:

  • ein
    • high level commands, porcelain, for every-day use, optimized for a pleasant user experience
  • gix
    • low level commands, plumbing, for use in more specialized cases and to validate newly written code in real-world scenarios

Project Goals

Project goals can change over time as we learn more, and they can be challenged.

  • a pure-rust implementation of git
    • including transport, object database, references, cli and tui
    • a simple command-line interface is provided for the most common git operations, optimized for user experience. A simple-git if you so will.
    • be the go-to implementation for anyone who wants to solve problems around git, and become the alternative to GitPython and libgit2 in the process.
    • become the foundation for a distributed alternative to GitHub, and maybe even for use within GitHub itself
  • learn from the best to write the best possible idiomatic Rust
    • libgit2 is a fantastic resource to see what abstractions work, we will use them
    • use Rust's type system to make misuse impossible
  • be the best performing implementation
    • use Rust's type system to optimize for work not done without being hard to use
    • make use of parallelism from the get go
    • sparse checkout support from day one
  • assure on-disk consistency
    • assure reads never interfere with concurrent writes
    • assure multiple concurrent writes don't cause trouble
  • take shortcuts, but not in quality
    • binaries may use anyhow::Error exhaustively, knowing these errors are solely user-facing.
    • libraries use light-weight custom errors implemented using quick-error or thiserror.
    • internationalization is nothing we are concerned with right now.
    • IO errors due to insufficient amount of open file handles don't always lead to operation failure
  • Cross platform support, including Windows
    • With the tools and experience available here there is no reason not to support Windows.
    • Windows is tested on CI and failures do prevent releases.

Non-Goals

Project non-goals can change over time as we learn more, and they can be challenged.

  • replicate git command functionality perfectly
    • git is git, and there is no reason to not use it. Our path is the one of simplicity to make getting started with git easy.
  • be incompatible to git
    • the on-disk format must remain compatible, and we will never contend with it.
  • use async IO everywhere
    • for the most part, git operations are heavily reliant on memory mapped IO as well as CPU to decompress data, which doesn't lend itself well to async IO out of the box.
    • Use blocking as well as gix-features::interrupt to bring operations into the async world and to control long running operations.
    • When connecting or streaming over TCP connections, especially when receiving on the server, async seems like a must though, but behind a feature flag.

Contributions

If what you have seen so far sparked your interest to contribute, then let us say: We are happy to have you and help you to get started.

We recommend running just test during the development process to assure CI is green before pushing.

A backlog for work ready to be picked up is available in the Project's Kanban board, which contains instructions on how to pick a task. If it's empty or you have other questions, feel free to start a discussion or reach out to @Byron privately.

For additional details, also take a look at the collaboration guide.

Getting started with Video Tutorials

  • Learning Rust with Gitoxide
    • In 17 episodes you can learn all you need to meaningfully contribute to gitoxide.
  • Getting into Gitoxide
    • Get an introduction to gitoxide itself which should be a good foundation for any contribution, but isn't a requirement for contributions either.
  • Gifting Gitoxide
    • See how PRs are reviewed along with a lot of inner monologue.

Other Media

Roadmap

Features for 1.0

Provide a CLI to for the most basic user journey:

  • initialize a repository
  • fetch
    • and update worktree
  • clone a repository
    • bare
    • with working tree
  • create a commit after adding worktree files
  • add a remote
  • push
    • create (thin) pack

Ideas for Examples

  • gix tool open-remote open the URL of the remote, possibly after applying known transformations to go from ssh to https.
  • tix as example implementation of tig, displaying a version of the commit graph, useful for practicing how highly responsive GUIs can be made.
  • Something like git-sizer, but leveraging extreme decompression speeds of indexed packs.
  • Open up SQL for git using sqlite virtual tables. Check out gitqlite as well. What would an MVP look like? Maybe even something that could ship with gitoxide. See this go implementation as example.
  • A truly awesome history rewriter which makes it easy to understand what happened while avoiding all pitfalls. Think BFG, but more awesome, if that's possible.
  • gix-tui should learn a lot from fossil-scm regarding the presentation of data. Maybe this can be used for prompts. Probably magit has a lot to offer, too.

Ideas for Spin-Offs

  • A system to integrate tightly with gix-lfs to allow a multi-tier architecture so that assets can be stored in git and are accessible quickly from an intranet location (for example by accessing the storage read-only over the network) while changes are pushed immediately by the server to other edge locations, like the cloud or backups. Sparse checkouts along with explorer/finder integrations make it convenient to only work on a small subset of files locally. Clones can contain all configuration somebody would need to work efficiently from their location, and authentication for the git history as well as LFS resources make the system secure. One could imagine encryption support for untrusted locations in the cloud even though more research would have to be done to make it truly secure.
  • A syncthing like client/server application. This is to demonstrate how lower-level crates can be combined into custom applications that use only part of git's technology to achieve their very own thing. Watch out for big file support, multi-device cross-syncing, the possibility for untrusted destinations using full-encryption, case-insensitive and sensitive filesystems, and extended file attributes as well as ignore files.
  • An event-based database that uses commit messages to store deltas, while occasionally aggregating the actual state in a tree. Of course it's distributed by nature, allowing people to work offline.
    • It's abstracted to completely hide the actual data model behind it, allowing for all kinds of things to be implemented on top.
    • Commits probably need a nanosecond component for the timestamp, which can be added via custom header field.
    • having recording all changes allows for perfect merging, both on the client or on the server, while keeping a natural audit log which makes it useful for mission critical databases in business.
    • Applications
      • Can markdown be used as database so issue-trackers along with meta-data could just be markdown files which are mostly human-editable? Could user interfaces be meta-data aware and just hide the meta-data chunks which are now editable in the GUI itself? Doing this would make conflicts easier to resolve than an sqlite database.
      • A time tracker - simple data, very likely naturally conflict free, and interesting to see it in terms of teams or companies using it with maybe GitHub as Backing for authentication.
        • How about supporting multiple different trackers, as in different remotes?

Shortcomings & Limitations

Please take a look at the SHORTCOMINGS.md file for details.

Credits

  • itertools (MIT Licensed)
    • We use the izip! macro in code
  • flate2 (MIT Licensed)
    • We use the high-level flate2 library to implement decompression and compression, which builds on the high-performance zlib-rs crate.

🙏 Special Thanks 🙏

At least for now this section is exclusive to highlight the incredible support that Josh Triplett has provided to me in the form of advice, sponsorship and countless other benefits that were incredibly meaningful. Going full time with gitoxide would hardly have been feasible without his involvement, and I couldn't be more grateful 😌.

License

This project is licensed under either of

at your option.

Fun facts

  • Originally @Byron was really fascinated by this problem and believes that with gitoxide it will be possible to provide the fastest solution for it.
  • @Byron has been absolutely blown away by git from the first time he experienced git more than 13 years ago, and tried to implement it in various shapes and forms multiple times. Now with Rust @Byron finally feels to have found the right tool for the job!
Languages
Rust 91.3%
Shell 7.3%
HTML 1.1%
Just 0.2%