doc: Strengthen and adjust copy-royal usage guidance and caveats

The `copy-royal` algorithm maintains the patterns and "shape" of text sufficiently to keep diffs the same (in the vast majority of cases). It is used in `internal-tools` to help prepare test cases with what is important and relevant to a regression test of diff behavior, rather than the exact original repository content in a tree that has been found to trigger a bug. It avoids needless verbatim reproduction, while preserving aspects that are useful and necessary for testing. It keeps the focus on patterns, preventing irrelevant details of code in a tree that triggered a bug from being confused with the logic of gitoxide itself, and makes it less likely to be touched inadvertently in efforts to fix bugs or improve style (which, in test data, would cause subtle breakage). Although these benefits are substantial and we intend to continue using copy-royal in the preparation of test cases as needed if or when regressions arise, some of the guidance and rationale we had given for its use was inaccurate or misleading. Most importantly, copy-royal cannot be used in practice to redact sensitive information: if you have a repository whose contents should not be made public, then it is not safe to share the output of copy-royal run on that repository either. Copy-royal is implemented (roughly speaking) by mapping alphabetic characters down to ten letters. This removes some information, at least in principle: that is, if it were given totally random letters as input, then it would be impossible to reverse it to get those letters back. Even on input that is much more structured and predictable, such as real-world input, it obfuscates it, making it look garbled and nonsensical. However, even when one intuitively feels that it has destroyed information, it is possible to reverse it in many cases, and possibly even in all practical cases. The reason is that, in real world source code and natural language, some sequences of letters are overwhelmingly more likely to occur than others, both in general and (especially) contextually given what surrounding text is present. The information that is removed by mapping into ten letters could often be reconstructed by: 1. Building a grammar of possible inputs, which can be done in a simple manner by translating the copy-royal output one wishes to reverse into a regular expression in which every symbol in the copy-royal output becomes a character class of characters that map to it. In effect, for every output of the copy-royal algorithm, there is a regex that matches the possible inputs. 2. Predicting, stepwise, what code or text is likely to have arisen that matches that grammar. In principle this could be done with a variety of techniques or even manually. But one fruitful approach would be to use an autoregressive large language model, and apply constrained decoding[1] to sample only logits consistent with the regex. Small experiments carried out so far suggest[2] this to be a workable technique when combined with beam search[3]. (This technique does not require the specific text or code being reconstructed to have existed when the model was trained.) Accordingly, this modifies the documentation of copy-royal to avoid claiming that the input of copy-royal cannot be recovered, or anything that recommends or may appear to recommend the use of copy-royal to redact sensitive information. It also clarifies and adjusts the explanation of when it makes sense to use copy-royal, and describes some of its benefits that do not rely on the assumption that it is infeasible (or even difficult) to reverse. In the comment documenting `BlameCopyRoyal`, which is among those edited in the above ways, this also edits its top line to make clear more generally how `BlameCopyRoyal` relates to `git blame`. [1]: https://github.com/Saibo-creator/Awesome-LLM-Constrained-Decoding [2]: See link(s) in https://github.com/GitoxideLabs/gitoxide/pull/2180 [3]: https://en.wikipedia.org/wiki/Beam_search Co-authored-by: Sebastian Thiel <sebastian.thiel@icloud.com>
2025-10-06 01:52:40 +02:00 · 2025-09-01 14:38:06 -04:00
parent 442f800026
commit 90a3dcbddc
1 changed files with 25 additions and 16 deletions
--- a/tests/it/src/args.rs
+++ b/tests/it/src/args.rs
@@ -16,19 +16,16 @@ pub struct Args {
 #[derive(Debug, clap::Subcommand)]
 pub enum Subcommands {
    /// Generate a shell script that creates a git repository containing all commits that are
-    /// traversed when a blame is generated.
+    /// traversed when following a given file through the Git history just as `git blame` would.
    ///
    /// This command extracts the file’s history so that blame, when run on the repository created
    /// by the script, shows the same characteristics, in particular bugs, as the original, but in
-    /// a way that the original source file's content cannot be reconstructed.
+    /// a way that does not resemble the original source file's content to any greater extent than
+    /// is useful and necessary.
    ///
-    /// The idea is that by obfuscating the file's content we make it easier for people to share
-    /// the subset of data that's required for debugging purposes from repositories that are not
-    /// public.
-    ///
-    /// Note that the obfuscation leaves certain properties of the source intact, so they can still
-    /// be inferred from the extracted history. Among these properties are directory structure
-    /// (though not the directories' names), renames, number of lines, and whitespace.
+    /// Note that this should not be used to redact sensitive information. The obfuscation leaves
+    /// numerous properties of the source intact, such that it may be feasible to reconstruct the
+    /// input.
    ///
    /// This command can also be helpful in debugging the blame algorithm itself.
    ///
@@ -59,15 +56,27 @@ pub enum Subcommands {
        file: std::ffi::OsString,
        /// Do not use `copy-royal` to obfuscate the content of blobs, but copy it verbatim.
        ///
-        /// Note that this should only be done if the source history does not contain information
-        /// you're not willing to share.
+        /// Note that, for producing cases for the gitoxide test suite, we usually prefer only to
+        /// take blobs verbatim if the source repository was purely for testing.
        #[clap(long)]
        verbatim: bool,
    },
-    /// Copy a tree so that it diffs the same but can't be traced back uniquely to its source.
+    /// Copy a tree so that it diffs the same but does not resemble the original files' content to
+    /// any greater extent than is useful and necessary.
    ///
-    /// The idea is that we don't want to deal with licensing, it's more about patterns in order to
-    /// reproduce cases for tests.
+    /// The idea is that this preserves the patterns that are usually sufficient to reproduce cases
+    /// for tests of diffs, both for making the tests work and for keeping the diffs understandable
+    /// to developers working on the tests, while avoiding keeping large verbatim fragments of code
+    /// based on which the test cases were created. The benefits of "reducing" the code to these
+    /// patterns include that the original meaning and function of code will not be confused with
+    /// the code of gitoxide itself, will not distract from the effects observed in their diffs,
+    /// and will not inadvertently be caught up in code cleanup efforts (e.g. attempting to adjust
+    /// style or fix bugs) that would make sense in code of gitoxide itself but that would subtly
+    /// break data test fixtures if done on their data.
+    ///
+    /// Note that this should not be used to redact sensitive information. The obfuscation leaves
+    /// numerous properties of the source intact, such that it may be feasible to reconstruct the
+    /// input.
    #[clap(visible_alias = "cr")]
    CopyRoyal {
        /// Don't really copy anything.
@@ -93,8 +102,8 @@ pub enum Subcommands {
        count: usize,
        /// Do not use `copy-royal` to degenerate information of blobs, but take blobs verbatim.
        ///
-        /// Note that this should only be done if the source repository is purely for testing
-        /// or was created by yourself.
+        /// Note that, for producing cases for the gitoxide test suite, we usually prefer only to
+        /// take blobs verbatim if the source repository was purely for testing.
        #[clap(long)]
        verbatim: bool,
        /// The directory into which the blobs and tree declarations will be written.