2025-10-07 17:12:50 +02:00 · 2025-09-04 15:10:12 +02:00 · 2025-09-04 14:35:36 +02:00 · 2025-09-01 23:52:43 +02:00 · 2025-09-01 14:35:25 +02:00 · 2025-08-25 12:07:15 +02:00
35 changed files with 603 additions and 471 deletions
							
							
								
								jvoisin
							
						
235403bc11
						
							
							Edit README.md
						
						
2025-09-04 15:10:12 +02:00
							
							
								
								jvoisin
							
						
102f08cd28
						
							
							Switch the project from 0xacab to github
						
						
...
While the folks running 0xacab are much more lovely than the github ones, this
project has outgrown the former:
- Github offers beefy continuous integration, make it easier to run the
  testsuite on every python version, instead of using a weird docker-based
  contraption. Moreover, I'd rather burn some Microsoft money than 0xacab one.
- Opening an account on 0xacab is non-trivial (by design), making it tedious
  for people to report issues and contribute to mat2.
- Gitlab is becoming unbearably slow and convoluted, even compared to Github's
  awful Copilot/AI push.
It's a sad state of affairs, but it's a pragmatic decision. People who don't
have a Github account can still report issues and send patches by sending me an
email.
2025-09-04 14:35:36 +02:00
							
							
								
								jvoisin
							
						
7a8ea224bc
						
							
							Fix issue introduced in f073444
						
						
...
The continuous integration on 0xacab didn't run, so it didn't catch this issue.
It seems like we'll have to move to github or whatever instead, sigh.
2025-09-01 23:52:43 +02:00
							
							
								
								jvoisin
							
						
504efb2448
						
							
							Remove mypy from the CI
						
						
...
It has always been useless a best, and a nuisance most of the times.
2025-09-01 14:35:25 +02:00
							
							
								
								jvoisin
							
						
f07344444d
						
							
							Fix a broken test
						
						
...
Reported-By: https://github.com/NixOS/nixpkgs/issues/436421
2025-08-25 12:07:15 +02:00
							
							
								
								jvoisin
							
						
473903b70e
						
							
							Fix HEIC parsing with the latest exiftool
						
						
2025-04-03 17:34:44 +02:00
							
							
								
								jvoisin
							
						
1438cf7bd4
						
							
							Disable webp tests for now
						
						
...
```
======================================================================
ERROR: test_all_parametred (tests.test_libmat2.TestCleaning.test_all_parametred) (case={'name': 'webp', 'parser': <class 'libmat2.images.WEBPParser'>, 'meta': {'Warning': '[minor] Improper EXIF header'}, 'expected_meta': {}})
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/builds/jvoisin/mat2/libmat2/images.py", line 109, in __init__
    GdkPixbuf.Pixbuf.new_from_file(self.filename)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
gi.repository.GLib.GError: gdk-pixbuf-error-quark: Couldn’t recognize the image file format for file “./tests/data/clean.webp” (3)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/builds/jvoisin/mat2/tests/test_libmat2.py", line 557, in test_all_parametred
    p1 = case['parser'](target)
  File "/builds/jvoisin/mat2/libmat2/images.py", line 111, in __init__
    raise ValueError
ValueError
```
Pending on https://0xacab.org/georg/mat2-ci-images/-/issues/14
2025-04-03 17:34:40 +02:00
							
							
								
								jvoisin
							
						
e740a9559f
						
							
							Properly handle an exception
						
						
...
```
Traceback (most recent call last):
  File "/builds/jvoisin/mat2/tests/test_deep_cleaning.py", line 147, in test_office
    meta = p.get_meta()
  File "/builds/jvoisin/mat2/libmat2/archive.py", line 155, in get_meta
    zin.extract(member=item, path=temp_folder)
    ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.13/zipfile/__init__.py", line 1762, in extract
    return self._extract_member(member, path, pwd)
           ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.13/zipfile/__init__.py", line 1829, in _extract_member
    os.makedirs(upperdirs, exist_ok=True)
    ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen os>", line 227, in makedirs
OSError: [Errno 28] No space left on device: '/tmp/tmptl1ibyv6/word/theme'
```
This should never happen™, but just in case…
2025-04-03 15:24:34 +02:00
							
							
								
								Vincent Deffontaines
							
						
2b58eece50
						
							
							Add webp support
						
						
2025-03-18 22:20:17 +01:00
							
							
								
								georg
							
						
29f404bce3
						
							
							CI: run tests via python3.{13,14}
						
						
2025-01-09 09:52:47 +00:00
							
							
								
								jvoisin
							
						
6c966f2afa
						
							
							Significantly improve portability
						
						
2025-01-09 02:36:16 +01:00
							
							
								
								jvoisin
							
						
70d236a062
						
							
							Bump the changelog
						
						
2025-01-09 00:43:12 +01:00
							
							
								
								Alex Marchant
							
						
d61fb7f77a
						
							
							Wait to remove elements until they are all processed
						
						
2024-09-13 14:28:57 +02:00
							
							
								
								jvoisin
							
						
1aed4ff2a5
						
							
							Catch a MemoryError in cairo
						
						
...
This should close #202
2024-09-13 14:28:50 +02:00
							
							
								
								matiargs
							
						
75c0a750c1
						
							
							Keep orientation metadata
						
						
2024-07-18 15:04:24 +00:00
							
							
								
								jvoisin
							
						
a47ac01eb6
						
							
							Remove a duplicate function
						
						
...
This is a leftover from today's best-effort merges.
2024-04-05 19:51:14 +02:00
							
							
								
								Alex Marchant
							
						
156855ab7e
						
							
							Remove dangling references from document.xml.rels
						
						
...
The file `word/_rels/document.xml.rels` is similar to `[Content_Types].xml` and
has references to other files in the archive. If those references aren't
removed Word refuses to open the document. # Please enter the commit message
for your changes. Lines starting
2024-04-05 18:45:58 +02:00
							
							
								
								jvoisin
							
						
09672a2dcc
						
							
							Merge branch 'alexmarchant-utf-8-encode-all'
						
						
2024-04-05 18:33:30 +02:00
							
							
								
								Alex Marchant
							
						
f2c898c92d
						
							
							Strip comment references from document.xml
						
						
2024-04-05 18:31:49 +02:00
							
							
								
								Alex Marchant
							
						
f931a0ecee
						
							
							Make utf-8 explicit in all tree.write calls
						
						
2024-04-03 15:27:48 -04:00
							
							
								
								Alex Marchant
							
						
61f39c4bd0
						
							
							Strip comment references from document.xml
						
						
2024-04-03 15:20:00 -04:00
							
							
								
								Alex Marchant
							
						
1b9ce34e2c
						
							
							Add test that checks if comments.xml is removed without errors
						
						
2024-04-03 15:03:33 -04:00
							
							
								
								Alex Marchant
							
						
17e76ab6f0
						
							
							Update comments file regex
						
						
2024-04-03 14:49:39 -04:00
							
							
								
								jvoisin
							
						
94ef57c994
						
							
							Add python3.12 in the CI
						
						
2024-01-02 02:50:44 +00:00
							
							
								
								jvoisin
							
						
05d1ca5841
						
							
							Improve the pyproject.yaml file
						
						
...
Prompted by !113
2023-12-31 18:34:39 +01:00
							
							
								
								jvoisin
							
						
55b468ded7
						
							
							Update Arch Linux package URL in INSTALL.md
						
						
...
Patch by https://github.com/felixonmars
2023-11-21 12:27:45 +01:00
							
							
								
								jvoisin
							
						
0fcafa2edd
						
							
							Raise a ValueError for invalid FLAC files to please mypy
						
						
2023-11-13 15:03:42 +01:00
							
							
								
								Romain Vigier
							
						
7405955ab5
						
							
							parsers: Inherit the sandbox option when creating additional parsers
						
						
2023-11-13 13:11:35 +01:00
							
							
								
								Romain Vigier
							
						
e6564509e1
						
							
							mat2: Fix the --no-sandbox argument
						
						
...
The --no-sandbox argument was parsed incorrectly, meaning no sandbox was
used when it was absent and the sandbox being used when it was present.
2023-11-13 13:06:38 +01:00
							
							
								
								jvoisin
							
						
bbd5b2817c
						
							
							Fix the CI on Debian
						
						
2023-11-08 15:44:33 +01:00
							
							
								
								jvoisin
							
						
73f2a87aa0
						
							
							Provide a name for the loggers
						
						
2023-09-08 22:16:45 +02:00
							
							
								
								jvoisin
							
						
abcdf07ef4
						
							
							Properly handle a cairo exception
						
						
2023-09-07 16:31:34 +02:00
							
							
								
								Rui Chen
							
						
a3081bce47
						
							
							setup: use share/man/man1 for man1
						
						
2023-08-31 19:44:28 -04:00
							
							
								
								georg
							
						
47d5529840
						
							
							tests: drop duplicate dirty.epub file; it's stored below data/ as well
						
						
2023-08-03 13:42:15 +00:00
							
							
								
								jvoisin
							
						
fa44794dfd
						
							
							Fix the project name in pyproject.toml
						
						
2023-08-02 21:21:44 +02:00
							
							
								
								jvoisin
							
						
04786d75da
						
							
							Bump the changelog
						
						
2023-08-02 21:09:12 +02:00
							
							
								
								jvoisin
							
						
cb7b5747a8
						
							
							Add the manpage to the PyPI package
						
						
...
This should close #192
2023-07-11 22:03:56 +02:00
							
							
								
								Jason Smalls
							
						
8c26020f67
						
							
							Add more files to ignore for MSOffice documents
						
						
2023-07-11 21:38:22 +02:00
							
							
								
								Jason Smalls
							
						
a0c97b25c4
						
							
							Add a variant mimetype for bmp
						
						
2023-07-11 21:35:04 +02:00
							
							
								
								Jason Smalls
							
						
1bcb945360
						
							
							Harden get_meta in archive.py against variants of CVE-2022-35410
						
						
2023-07-11 21:31:53 +02:00
							
							
								
								jvoisin
							
						
9159fe8705
						
							
							Mention wp-mat in the readme
						
						
2023-06-05 19:52:13 +02:00
							
							
								
								jvoisin
							
						
1b9608aecf
						
							
							Use proper type annotations instead of comments
						
						
2023-05-03 22:28:02 +02:00
							
							
								
								jvoisin
							
						
2ac8c24dac
						
							
							Make use of is_dir/isdir for archives
						
						
2023-05-03 22:19:19 +02:00
							
							
								
								jvoisin
							
						
71ecac85b0
						
							
							Add some documentation about OSX
						
						
2023-04-11 21:35:25 +02:00
							
							
								
								georg
							
						
b9677d8655
						
							
							CI: codespell: drop obsolete list of ignored words
						
						
...
codespell was dropped via a63011b3f6.
Accordingly, this commit does some cleanup.
2023-03-21 13:18:54 +00:00
							
							
								
								georg
							
						
6fde80d3e3
						
							
							CI: shallow clone repository and limit depth to 5
						
						
...
The previous commit changed the strategy to 'clone', instead of 'fetch'
as before. While this fixes permission errors, it is also slower, as an
existing checkout of the repository will be ignored. To overcome this,
this commit limits the depth to 5.
2023-03-20 15:11:02 +00:00
							
							
								
								georg
							
						
6c05360afa
						
							
							CI: 'clone' git repository instead of 'fetch'
						
						
...
While the former is slower, the later might lead to errors such as
"fatal: detected dubious ownership in repository at" which is fixed
GitLab upstream via
https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/3538, but
not yet released.
Closes #191
2023-03-20 15:10:56 +00:00
							
							
								
								georg
							
						
596696dfbc
						
							
							CI: Add python3.{7,8,9,10,11} test jobs
						
						
...
Closes #187
2023-03-15 23:38:39 +00:00
							
							
								
								jvoisin
							
						
daa17a3e9c
						
							
							Fix the CI on Archlinux
						
						
2023-03-12 13:29:46 +01:00
							
							
								
								Gu1nn3zz
							
						
6061f47231
						
							
							fix: Typing in the parser factory
						
						
2023-03-07 17:37:56 +00:00
							
							
								
								georg
							
						
8b41764a3e
						
							
							CI: linting: ruff: specify image
						
						
...
Otherwise, this job might fail, depending on the runner which executes
the job, due to different configurations, especially wrt the default
image.
Ref https://0xacab.org/jvoisin/mat2/-/merge_requests/105
2023-03-07 11:25:17 +00:00
							
							
								
								Rui Chen
							
						
ed0ffa5693
						
							
							Update pyproject.toml to include version
						
						
2023-02-24 09:12:06 +00:00
							
							
								
								jvoisin
							
						
b1c03bce72
						
							
							Bump the changelog
						
						
2023-02-23 21:36:46 +01:00
							
							
								
								jvoisin
							
						
a63011b3f6
						
							
							Improve the CI
						
						
...
- Remove some useless linters
- Make use of ruff
2023-02-20 21:15:07 +01:00
							
							
								
								jvoisin
							
						
e41390eb64
						
							
							Explicitly pass a parameter to functools.lru_cache
						
						
2023-01-31 20:42:39 +01:00
							
							
								
								jvoisin
							
						
66a36f6b15
						
							
							Bump the changelog
						
						
2023-01-28 17:55:02 +01:00
							
							
								
								jvoisin
							
						
3cb3f58084
						
							
							Another typing pass
						
						
2023-01-28 17:22:26 +01:00
							
							
								
								jvoisin
							
						
39fb254e01
						
							
							Fix the type annotations
						
						
2023-01-28 15:57:20 +00:00
							
							
								
								jvoisin
							
						
1f73a16ef3
						
							
							imghdr is deprecated
						
						
2023-01-14 15:38:12 +01:00
							
							
								
								jvoisin
							
						
e8b38f1101
						
							
							Revert "Simplify a bit the typing annotations of ./mat2"
						
						
...
This reverts commit 29057d6cdf.
2023-01-14 15:35:21 +01:00
							
							
								
								jvoisin
							
						
8d7230ba16
						
							
							Fix -l output
						
						
2023-01-07 17:10:02 +01:00
							
							
							
						
@@ -0,0 +1,45 @@
name: CI for Python versions
on:
  pull_request:
  push:
  schedule:
    - cron: '0 16 * * 5'
jobs:
  linting:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5
      - uses: actions/setup-python@v5
      - run: pip install ruff
      - run: | 
          ruff check .
  build:
    needs: linting
    runs-on: ubuntu-latest
    strategy:
        matrix:
          python-version: ["3.8", "3.9", "3.10", "3.11", "3.12", "3.13", "3.14.0-rc.2"]
    steps:
      - uses: actions/checkout@v5
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}
      - name: Install dependencies
        run: |
          sudo apt-get install --no-install-recommends --no-install-suggests --yes \
            ffmpeg \
            gir1.2-gdkpixbuf-2.0 \
            gir1.2-poppler-0.18 \
            gir1.2-rsvg-2.0 \
            libimage-exiftool-perl \
            python3-gi-cairo \
            libcairo2-dev \
            libgirepository-2.0-dev \
            libgirepository1.0-dev \
            gobject-introspection \
            python3-mutagen
          pip install .
      - name: Build and run the testsuite
        run:  python3 -m unittest discover -v
							
							
							
						
@@ -1,79 +0,0 @@
include:
  - template: Security/SAST.gitlab-ci.yml
variables:
  CONTAINER_REGISTRY: $CI_REGISTRY/georg/mat2-ci-images
stages:
  - linting
  - test
.prepare_env: &prepare_env
  before_script:  # This is needed to not run the testsuite as root
    - useradd --home-dir ${CI_PROJECT_DIR} mat2
    - chown -R mat2 .
linting:bandit:
  image: $CONTAINER_REGISTRY:linting 
  stage: linting
  script:  # TODO: remove B405 and B314
    - bandit ./mat2 --format txt --skip B101
    - bandit -r ./libmat2 --format txt --skip B101,B404,B603,B405,B314,B108,B311
linting:codespell:
  image: $CONTAINER_REGISTRY:linting
  stage: linting
  script:
    # Run codespell to check for spelling errors; ignore errors about binary
    # files, use a config with ignored words and exclude the git directory,
    # which might contain false positives
    - codespell -q 2 -I utils/ci/codespell/ignored_words.txt -S .git
  
linting:pylint:
  image: $CONTAINER_REGISTRY:linting
  stage: linting
  script:
    - pylint --disable=no-else-return,no-else-raise,no-else-continue,unnecessary-comprehension,raise-missing-from,unsubscriptable-object,use-dict-literal,unspecified-encoding,consider-using-f-string,use-list-literal,too-many-statements --extension-pkg-whitelist=cairo,gi ./libmat2 ./mat2
linting:mypy:
  image: $CONTAINER_REGISTRY:linting
  stage: linting
  script:
    - mypy --ignore-missing-imports mat2 libmat2/*.py
tests:archlinux:
  image: $CONTAINER_REGISTRY:archlinux
  stage: test
  script:
    - python3 -m unittest discover -v
  
tests:debian:
  image: $CONTAINER_REGISTRY:debian
  stage: test
  <<: *prepare_env
  script:
    - apt-get -qqy purge bubblewrap
    - su - mat2 -c "python3-coverage run --branch -m unittest discover -s tests/"
    - su - mat2 -c "python3-coverage report --fail-under=95 -m --include 'libmat2/*'"
tests:debian_with_bubblewrap:
  image: $CONTAINER_REGISTRY:debian
  stage: test
  allow_failure: true
  <<: *prepare_env
  script:
    - apt-get -qqy install bubblewrap
    - python3 -m unittest discover -v
tests:fedora:
  image: $CONTAINER_REGISTRY:fedora
  stage: test
  script:
    - python3 -m unittest discover -v
tests:gentoo:
  image: $CONTAINER_REGISTRY:gentoo
  stage: test
  <<: *prepare_env
  script:
    - su - mat2 -c "python3 -m unittest discover -v"
							
							
							
						
@@ -1,18 +0,0 @@
[FORMAT]
good-names=e,f,i,x,s
max-locals=20
[MESSAGES CONTROL]
disable=
    fixme,
    invalid-name,
    duplicate-code,
    missing-docstring,
    protected-access,
    abstract-method,
    wrong-import-position,
    catching-non-exception,
    cell-var-from-loop,
    locally-disabled,
		raise-missing-from,
    invalid-sequence-index,  # pylint doesn't like things like `Tuple[int, bytes]` in type annotation
							
							
							
						
@@ -1,3 +1,28 @@
# 0.13.5 - 2025-01-09
- Keep orientation metadata on jpeg and tiff files
- Improve cairo-related error/exceptions handling
- Improve the logging
- Improve the sandboxing
- Improve Python3.12 support
- Improve MSOffice documents handling
# 0.13.4 - 2023-08-02
- Add documentation about mat2 on OSX
- Make use of python3.7 constructs to simplify code
- Use moderner type annotations
- Harden get_meta in archive.py against variants of CVE-2022-35410 
- Improve MSOffice document support
- Package the manpage on pypi 
# 0.13.3 - 2023-02-23
- Fix a decorator argument
# 0.13.2 - 2023-01-28
- Fix a crash on some python versions
# 0.13.1 - 2023-01-07
- Improve xlsx support
							
								
							
							
							
						
 
							
							
							
						
@@ -1,9 +1,9 @@
# Contributing to mat2
The main repository for mat2 is on [0xacab]( https://0xacab.org/jvoisin/mat2 ),
The main repository for mat2 is on [github]( https://github.com/jvoisin/mat2 ),
but you can send patches to jvoisin by [email](https://dustri.org/) if you prefer.
Do feel free to pick up [an issue]( https://0xacab.org/jvoisin/mat2/issues )
Do feel free to pick up [an issue]( https://github.com/jvoisin/mat2/issues )
and to send a pull-request.
Before sending the pull-request, please do check that everything is fine by
							
							
							
								
							
						
@@ -27,18 +27,19 @@ Since mat2 is written in Python3, please conform as much as possible to the
# Doing a release
1. Update the [changelog](https://0xacab.org/jvoisin/mat2/blob/master/CHANGELOG.md)
2. Update the version in the [mat2](https://0xacab.org/jvoisin/mat2/blob/master/mat2) file
3. Update the version in the [setup.py](https://0xacab.org/jvoisin/mat2/blob/master/setup.py) file
4. Update the version and date in the [man page](https://0xacab.org/jvoisin/mat2/blob/master/doc/mat2.1)
5. Commit the changelog, man page, mat2 and setup.py files
6. Create a tag with `git tag -s $VERSION`
7. Push the commit with `git push origin master`
8. Push the tag with `git push --tags`
9. Download the gitlab archive of the release
10. Diff it against the local copy
11. If there is no difference, sign the archive with `gpg --armor --detach-sign mat2-$VERSION.tar.xz`
12. Upload the signature on Gitlab's [tag page](https://0xacab.org/jvoisin/mat2/tags) and add the changelog there
13. Announce the release on the [mailing list](https://mailman.boum.org/listinfo/mat-dev)
14. Sign'n'upload the new version on pypi with `python3 setup.py sdist bdist_wheel` then `twine upload -s dist/*`
15. Do the secret release dance
1. Update the [changelog](https://github.com/jvoisin/mat2/blob/master/CHANGELOG.md)
2. Update the version in the [mat2](https://github.com/jvoisin/mat2/blob/master/mat2) file
3. Update the version in the [setup.py](https://github.com/jvoisin/mat2/blob/master/setup.py) file
4. Update the version in the [pyproject.toml](https://github.com/jvoisin/mat2/blob/master/yproject.toml) file
5. Update the version and date in the [man page](https://github.com/jvoisin/mat2/blob/master/doc/mat2.1)
6. Commit the modified files
7. Create a tag with `git tag -s $VERSION`
8. Push the commit with `git push origin master`
9. Push the tag with `git push --tags`
10. Download the gitlab archive of the release
11. Diff it against the local copy
12. If there is no difference, sign the archive with `gpg --armor --detach-sign mat2-$VERSION.tar.xz`
13. Upload the signature on Gitlab's [tag page](https://github.com/jvoisin/mat2/tags) and add the changelog there
14. Announce the release on the [mailing list](https://mailman.boum.org/listinfo/mat-dev)
15. Sign'n'upload the new version on pypi with `python3 setup.py sdist bdist_wheel` then `twine upload -s dist/*`
16. Do the secret release dance
							
							
							
						
 
							
							
								
							
							
						
@@ -19,7 +19,7 @@ installed, mat2 uses it to sandbox any external processes it invokes.
## Arch Linux
Thanks to [kpcyrd](https://archlinux.org/packages/?maintainer=kpcyrd), there is an package available on
[Arch linux's AUR](https://archlinux.org/packages/community/any/mat2/).
[Arch linux's AUR](https://archlinux.org/packages/extra/any/mat2/).
## Debian
							
								
							
							
								
							
							
						
@@ -49,3 +49,22 @@ dnf -y install mat2
## Gentoo
mat2 is available in the [torbrowser overlay](https://github.com/MeisterP/torbrowser-overlay).
# OSX
## Homebrew
mat2 is [available on homebrew](https://formulae.brew.sh/formula/mat2):
```
brew install mat2
```
## MacPorts
mat2 is [available on MacPorts](https://ports.macports.org/port/mat2/):
```
port install mat2
```
							
							
							
						
 
							
							
							
						
@@ -1,190 +1 @@
```
 _____ _____ _____ ___
|     |  _  |_   _|_  |  Keep your data,
| | | | |_| | | | |  _|     trash your meta!
|_|_|_|_| |_| |_| |___|
```
# Metadata and privacy
Metadata consist of information that characterizes data.
Metadata are used to provide documentation for data products.
In essence, metadata answer who, what, when, where, why, and how about
every facet of the data that are being documented.
Metadata within a file can tell a lot about you.
Cameras record data about when a picture was taken and what
camera was used. Office documents like PDF or Office automatically adds
author and company information to documents and spreadsheets.
Maybe you don't want to disclose those information.
This is precisely the job of mat2: getting rid, as much as possible, of
metadata.
mat2 provides:
- a library called `libmat2`;
- a command line tool called `mat2`,
- a service menu for Dolphin, KDE's default file manager
If you prefer a regular graphical user interface, you might be interested in
[Metadata Cleaner](https://metadatacleaner.romainvigier.fr/), which is using
`mat2` under the hood.
# Requirements
- `python3-mutagen` for audio support
- `python3-gi-cairo` and `gir1.2-poppler-0.18` for PDF support
- `gir1.2-gdkpixbuf-2.0` for images support
- `gir1.2-rsvg-2.0` for svg support
- `FFmpeg`, optionally, for video support
- `libimage-exiftool-perl` for everything else
- `bubblewrap`, optionally, for sandboxing
Please note that mat2 requires at least Python3.5.
# Requirements setup on macOS (OS X) using [Homebrew](https://brew.sh/)
```bash
brew install exiftool cairo pygobject3 poppler gdk-pixbuf librsvg ffmpeg
```
# Running the test suite
```bash
$ python3 -m unittest discover -v
```
And if you want to see the coverage:
```bash
$ python3-coverage run --branch -m unittest discover -s tests/
$ python3-coverage report --include -m --include /libmat2/*'
```
# How to use mat2
```
usage: mat2 [-h] [-V] [--unknown-members policy] [--inplace] [--no-sandbox]
            [-v] [-l] [--check-dependencies] [-L | -s]
            [files [files ...]]
Metadata anonymisation toolkit 2
positional arguments:
  files                 the files to process
optional arguments:
  -h, --help            show this help message and exit
  -V, --verbose         show more verbose status information
  --unknown-members policy
                        how to handle unknown members of archive-style files
                        (policy should be one of: abort, omit, keep) [Default:
                        abort]
  --inplace             clean in place, without backup
  --no-sandbox          Disable bubblewrap's sandboxing
  -v, --version         show program's version number and exit
  -l, --list            list all supported fileformats
  --check-dependencies  check if mat2 has all the dependencies it needs
  -L, --lightweight     remove SOME metadata
  -s, --show            list harmful metadata detectable by mat2 without
                        removing them
```
Note that mat2 **will not** clean files in-place, but will produce, for
example, with a file named "myfile.png" a cleaned version named
"myfile.cleaned.png".
## Web interface
It's possible to run mat2 as a web service, via
[mat2-web](https://0xacab.org/jvoisin/mat2-web).
## Desktop GUI
For GNU/Linux desktops, it's possible to use the
[Metadata Cleaner](https://gitlab.com/rmnvgr/metadata-cleaner) GTK application.
# Supported formats
The following formats are supported: avi, bmp, css, epub/ncx, flac, gif, jpeg,
m4a/mp2/mp3/…, mp4, odc/odf/odg/odi/odp/ods/odt/…, off/opus/oga/spx/…, pdf,
png, ppm, pptx/xlsx/docx/…, svg/svgz/…, tar/tar.gz/tar.bz2/tar.xz/…, tiff,
torrent, wav, wmv, zip, …
  
# Notes about detecting metadata
While mat2 is doing its very best to display metadata when the `--show` flag is
passed, it doesn't mean that a file is clean from any metadata if mat2 doesn't
show any. There is no reliable way to detect every single possible metadata for
complex file formats.
This is why you shouldn't rely on metadata's presence to decide if your file must
be cleaned or not.
# Notes about the lightweight mode
By default, mat2 might alter a bit the data of your files, in order to remove
as much metadata as possible. For example, texts in PDF might not be selectable anymore,
compressed images might get compressed again, …
Since some users might be willing to trade some metadata's presence in exchange
of the guarantee that mat2 won't modify the data of their files, there is the
`-L` flag that precisely does that.
# Related software
- The first iteration of [MAT](https://mat.boum.org)
- [Exiftool](https://sno.phy.queensu.ca/~phil/exiftool/mat)
- [pdf-redact-tools](https://github.com/firstlookmedia/pdf-redact-tools), that
	tries to deal with *printer dots* too.
- [pdfparanoia](https://github.com/kanzure/pdfparanoia), that removes
	watermarks from PDF.
- [Scrambled Exif](https://f-droid.org/packages/com.jarsilio.android.scrambledeggsif/),
	an open-source Android application to remove metadata from pictures.
- [Dangerzone](https://dangerzone.rocks/), designed to sanitize harmful documents
  into harmless ones.
# Contact
If possible, use the [issues system](https://0xacab.org/jvoisin/mat2/issues)
or the [mailing list](https://www.autistici.org/mailman/listinfo/mat-dev)
Should a more private contact be needed (eg. for reporting security issues),
you can email Julien (jvoisin) Voisin at `julien.voisin+mat2@dustri.org`,
using the gpg key `9FCDEE9E1A381F311EA62A7404D041E8171901CC`.
# Donations
If you want to donate some money, please give it to [Tails]( https://tails.boum.org/donate/?r=contribute ).
# License
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
You should have received a copy of the GNU Lesser General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.
Copyright 2018 Julien (jvoisin) Voisin <julien.voisin+mat2@dustri.org>  
Copyright 2016 Marie-Rose for mat2's logo
The `tests/data/dirty_with_nsid.docx` file is licensed under GPLv3,
and was borrowed from the Calibre project: https://calibre-ebook.com/downloads/demos/demo.docx
The `narrated_powerpoint_presentation.pptx` file is in the public domain.
# Thanks
mat2 wouldn't exist without:
- the [Google Summer of Code](https://summerofcode.withgoogle.com/);
- the fine people from [Tails]( https://tails.boum.org);
- friends
Many thanks to them!
# This repository is deprecated, please use https://github.com/jvoisin/mat2 instead
							
							
								
							
							
						
@@ -19,14 +19,14 @@ details.
# jpegoptim, optipng, …
While designed to reduce as much as possible the size of pictures,
those software can be used to remove metadata. They usually have very good
those software can be used to remove metadata. They usually have excellent
support for a single picture format, and can be used in place of mat2 for them.
# PDF Redact Tools
[PDF Redact Tools](https://github.com/firstlookmedia/pdf-redact-tools) is
a software developed by the people from [First Look
software developed by the people from [First Look
Media](https://firstlook.media/), the entity behind, amongst other things, 
[The Intercept](https://theintercept.com/).
							
							
							
								
							
						
@@ -34,13 +34,13 @@ The tool uses roughly the same approach than mat2 to deal with PDF,
which is unfortunately the only fileformat that it does support.
It's interesting to note that it has counter-measures against
[yellow dots](https://en.wikipedia.org/wiki/Machine_Identification_Code),
a capacity that mat2 [doesn't possess yet](https://0xacab.org/jvoisin/mat2/issues/43).
a capacity that mat2 doesn't have.
# Exiv2
[Exiv2](https://www.exiv2.org/) was considered for mat2,
but it currently [misses a lot of metadata](https://0xacab.org/jvoisin/mat2/issues/85)
but it currently misses a lot of metadata.
# Others non open source software/online service
							
								
							
							
							
						
 
							
							
							
						
@@ -1,4 +1,4 @@
.TH mat2 "1" "January 2023" "mat2 0.13.1" "User Commands"
.TH mat2 "1" "January 2025" "mat2 0.13.5" "User Commands"
.SH NAME
mat2 \- the metadata anonymisation toolkit 2
							
								
							
							
								
							
							
						
@@ -84,7 +84,7 @@ but keep in mind by doing so, some metadata \fBwon't be cleaned\fR.
While mat2 does its very best to remove every single metadata,
it's still in beta, and \fBsome\fR might remain. Should you encounter
some issues, check the bugtracker: https://0xacab.org/jvoisin/mat2/issues
some issues, check the bugtracker: https://github.com/jvoisin/mat2/issues
.PP
Please use accordingly and be careful.
							
								
							
							
							
						
 
							
							
								
							
							
						
@@ -2,14 +2,10 @@
import enum
import importlib
from typing import Optional, Union
from typing import Dict
from . import exiftool, video
# make pyflakes happy
assert Optional
assert Union
# A set of extension that aren't supported, despite matching a supported mimetype
UNSUPPORTED_EXTENSIONS = {
    '.asc',
							
								
							
							
								
							
							
						
@@ -66,8 +62,9 @@ CMD_DEPENDENCIES = {
    },
}
def check_dependencies() -> dict[str, dict[str, bool]]:
    ret = dict()  # type: dict[str, dict]
def check_dependencies() -> Dict[str, Dict[str, bool]]:
    ret: Dict[str, Dict] = dict()
    for key, value in DEPENDENCIES.items():
        ret[key] = {
							
								
							
							
							
						
 
							
							
							
						
@@ -1,7 +1,7 @@
import abc
import os
import re
from typing import Union
from typing import Union, Set, Dict
class AbstractParser(abc.ABC):
							
							
							
								
							
						
@@ -9,8 +9,8 @@ class AbstractParser(abc.ABC):
    It might yield `ValueError` on instantiation on invalid files,
    and `RuntimeError` when something went wrong in `remove_all`.
    """
    meta_list = set()  # type: set[str]
    mimetypes = set()  # type: set[str]
    meta_list: Set[str] = set()
    mimetypes: Set[str] = set()
    def __init__(self, filename: str) -> None:
        """
							
							
							
								
							
						
@@ -33,8 +33,11 @@ class AbstractParser(abc.ABC):
        self.sandbox = True
    @abc.abstractmethod
    def get_meta(self) -> dict[str, Union[str, dict]]:
        """Return all the metadata of the current file"""
    def get_meta(self) -> Dict[str, Union[str, Dict]]:
        """Return all the metadata of the current file
        :raises RuntimeError: Raised if the cleaning process went wrong.
        """
    @abc.abstractmethod
    def remove_all(self) -> bool:
							
								
							
							
							
						
 
							
							
								
							
							
						
@@ -7,7 +7,7 @@ import tempfile
import os
import logging
import shutil
from typing import Pattern, Union, Any
from typing import Pattern, Union, Any, Set, Dict, List
from . import abstract, UnknownMemberPolicy, parser_factory
							
								
							
							
								
							
							
						
@@ -44,20 +44,20 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
    def __init__(self, filename):
        super().__init__(filename)
        # We ignore typing here because mypy is too stupid
        self.archive_class = None  #  type: ignore
        self.member_class = None  #  type: ignore
        self.archive_class = None  # type: ignore
        self.member_class = None  # type: ignore
        # Those are the files that have a format that _isn't_
        # supported by mat2, but that we want to keep anyway.
        self.files_to_keep = set()  # type: set[Pattern]
        self.files_to_keep: Set[Pattern] = set()
        # Those are the files that we _do not_ want to keep,
        # no matter if they are supported or not.
        self.files_to_omit = set()  # type: set[Pattern]
        self.files_to_omit: Set[Pattern] = set()
        # what should the parser do if it encounters an unknown file in
        # the archive?
        self.unknown_member_policy = UnknownMemberPolicy.ABORT  # type: UnknownMemberPolicy
        self.unknown_member_policy: UnknownMemberPolicy = UnknownMemberPolicy.ABORT
        # The LGTM comment is to mask a false-positive,
        # see https://lgtm.com/projects/g/jvoisin/mat2/
							
							
							
								
							
						
@@ -72,7 +72,7 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
        # pylint: disable=unused-argument
        return True  # pragma: no cover
    def _specific_get_meta(self, full_path: str, file_path: str) -> dict[str, Any]:
    def _specific_get_meta(self, full_path: str, file_path: str) -> Dict[str, Any]:
        """ This method can be used to extract specific metadata
        from files present in the archive."""
        # pylint: disable=unused-argument
							
							
							
								
							
						
@@ -87,7 +87,7 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
    @staticmethod
    @abc.abstractmethod
    def _get_all_members(archive: ArchiveClass) -> list[ArchiveMember]:
    def _get_all_members(archive: ArchiveClass) -> List[ArchiveMember]:
        """Return all the members of the archive."""
    @staticmethod
							
							
							
								
							
						
@@ -97,7 +97,7 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
    @staticmethod
    @abc.abstractmethod
    def _get_member_meta(member: ArchiveMember) -> dict[str, str]:
    def _get_member_meta(member: ArchiveMember) -> Dict[str, str]:
        """Return all the metadata of a given member."""
    @staticmethod
							
							
							
								
							
						
@@ -105,6 +105,11 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
    def _get_member_name(member: ArchiveMember) -> str:
        """Return the name of the given member."""
    @staticmethod
    @abc.abstractmethod
    def _is_dir(member: ArchiveMember) -> bool:
        """Return true is the given member is a directory."""
    @abc.abstractmethod
    def _add_file_to_archive(self, archive: ArchiveClass, member: ArchiveMember,
                             full_path: str):
							
							
							
								
							
						
@@ -128,8 +133,8 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
        # pylint: disable=unused-argument
        return member
    def get_meta(self) -> dict[str, Union[str, dict]]:
        meta = dict()  # type: dict[str, Union[str, dict]]
    def get_meta(self) -> Dict[str, Union[str, Dict]]:
        meta: Dict[str, Union[str, Dict]] = dict()
        with self.archive_class(self.filename) as zin:
            temp_folder = tempfile.mkdtemp()
							
							
							
								
							
						
@@ -138,12 +143,20 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
                local_meta = self._get_member_meta(item)
                member_name = self._get_member_name(item)
                if member_name[-1] == '/':  # pragma: no cover
                    # `is_dir` is added in Python3.6
                if self._is_dir(item):  # pragma: no cover
                    continue  # don't keep empty folders
                zin.extract(member=item, path=temp_folder)
                full_path = os.path.join(temp_folder, member_name)
                if not os.path.abspath(full_path).startswith(temp_folder):
                    logging.error("%s contains a file (%s) pointing outside (%s) of its root.",
                        self.filename, member_name, full_path)
                    break
                try:
                    zin.extract(member=item, path=temp_folder)
                except OSError as e:
                    logging.error("Unable to extraxt %s from %s: %s", item, self.filename, e)
                os.chmod(full_path, stat.S_IRUSR)
                specific_meta = self._specific_get_meta(full_path, member_name)
							
							
							
								
							
						
@@ -151,6 +164,7 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
                member_parser, _ = parser_factory.get_parser(full_path)  # type: ignore
                if member_parser:
                    member_parser.sandbox = self.sandbox
                    local_meta = {**local_meta, **member_parser.get_meta()}
                if local_meta:
							
							
							
								
							
						
@@ -170,7 +184,7 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
            # Sort the items to process, to reduce fingerprinting,
            # and keep them in the `items` variable.
            items = list()  # type: list[ArchiveMember]
            items: List[ArchiveMember] = list()
            for item in sorted(self._get_all_members(zin), key=self._get_member_name):
                # Some fileformats do require to have the `mimetype` file
                # as the first file in the archive.
							
							
							
								
							
						
@@ -183,7 +197,7 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
            # we're iterating (and thus inserting) them in lexicographic order.
            for item in items:
                member_name = self._get_member_name(item)
                if member_name[-1] == '/':  # `is_dir` is added in Python3.6
                if self._is_dir(item):
                    continue  # don't keep empty folders
                full_path = os.path.join(temp_folder, member_name)
							
								
							
							
								
							
							
						
@@ -238,6 +252,7 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
                            abort = True
                            continue
                    else:
                        member_parser.sandbox = self.sandbox
                        if member_parser.remove_all() is False:
                            logging.warning("In file %s, something went wrong \
                                             with the cleaning of %s \
							
								
							
							
								
							
							
						
@@ -264,6 +279,7 @@ class ArchiveBasedAbstractParser(abstract.AbstractParser):
class TarParser(ArchiveBasedAbstractParser):
    mimetypes = {'application/x-tar'}
    def __init__(self, filename):
        super().__init__(filename)
        # yes, it's tarfile.open and not tarfile.TarFile,
							
								
							
							
								
							
							
						
@@ -336,7 +352,7 @@ class TarParser(ArchiveBasedAbstractParser):
        return member
    @staticmethod
    def _get_member_meta(member: ArchiveMember) -> dict[str, str]:
    def _get_member_meta(member: ArchiveMember) -> Dict[str, str]:
        assert isinstance(member, tarfile.TarInfo)  # please mypy
        metadata = {}
        if member.mtime != 0:
							
							
							
								
							
						
@@ -358,7 +374,7 @@ class TarParser(ArchiveBasedAbstractParser):
        archive.add(full_path, member.name, filter=TarParser._clean_member)  # type: ignore
    @staticmethod
    def _get_all_members(archive: ArchiveClass) -> list[ArchiveMember]:
    def _get_all_members(archive: ArchiveClass) -> List[ArchiveMember]:
        assert isinstance(archive, tarfile.TarFile)  # please mypy
        return archive.getmembers()  # type: ignore
							
							
							
								
							
						
@@ -373,6 +389,11 @@ class TarParser(ArchiveBasedAbstractParser):
        member.mode = permissions
        return member
    @staticmethod
    def _is_dir(member: ArchiveMember) -> bool:
        assert isinstance(member, tarfile.TarInfo)  # please mypy
        return member.isdir()
class TarGzParser(TarParser):
    compression = ':gz'
							
							
							
								
							
						
@@ -391,7 +412,8 @@ class TarXzParser(TarParser):
class ZipParser(ArchiveBasedAbstractParser):
    mimetypes = {'application/zip'}
    def __init__(self, filename):
    def __init__(self, filename: str):
        super().__init__(filename)
        self.archive_class = zipfile.ZipFile
        self.member_class = zipfile.ZipInfo
							
							
							
								
							
						
@@ -412,7 +434,7 @@ class ZipParser(ArchiveBasedAbstractParser):
        return member
    @staticmethod
    def _get_member_meta(member: ArchiveMember) -> dict[str, str]:
    def _get_member_meta(member: ArchiveMember) -> Dict[str, str]:
        assert isinstance(member, zipfile.ZipInfo)  # please mypy
        metadata = {}
        if member.create_system == 3:  # this is Linux
							
								
							
							
								
							
							
						
@@ -439,7 +461,7 @@ class ZipParser(ArchiveBasedAbstractParser):
                             compress_type=member.compress_type)
    @staticmethod
    def _get_all_members(archive: ArchiveClass) -> list[ArchiveMember]:
    def _get_all_members(archive: ArchiveClass) -> List[ArchiveMember]:
        assert isinstance(archive, zipfile.ZipFile)  # please mypy
        return archive.infolist()  # type: ignore
							
							
							
								
							
						
@@ -458,3 +480,8 @@ class ZipParser(ArchiveBasedAbstractParser):
        assert isinstance(member, zipfile.ZipInfo)  # please mypy
        member.compress_type = compression
        return member
    @staticmethod
    def _is_dir(member: ArchiveMember) -> bool:
        assert isinstance(member, zipfile.ZipInfo)  # please mypy
        return member.is_dir()
							
							
							
						
 
							
							
								
							
							
						
@@ -2,7 +2,7 @@ import mimetypes
import os
import shutil
import tempfile
from typing import Union
from typing import Union, Dict
import mutagen
							
							
							
								
							
						
@@ -18,10 +18,10 @@ class MutagenParser(abstract.AbstractParser):
        except mutagen.MutagenError:
            raise ValueError
    def get_meta(self) -> dict[str, Union[str, dict]]:
    def get_meta(self) -> Dict[str, Union[str, Dict]]:
        f = mutagen.File(self.filename)
        if f.tags:
            return {k:', '.join(map(str, v)) for k, v in f.tags.items()}
            return {k: ', '.join(map(str, v)) for k, v in f.tags.items()}
        return {}
    def remove_all(self) -> bool:
							
							
							
								
							
						
@@ -38,8 +38,8 @@ class MutagenParser(abstract.AbstractParser):
class MP3Parser(MutagenParser):
    mimetypes = {'audio/mpeg', }
    def get_meta(self) -> dict[str, Union[str, dict]]:
        metadata = {}  # type: dict[str, Union[str, dict]]
    def get_meta(self) -> Dict[str, Union[str, Dict]]:
        metadata: Dict[str, Union[str, Dict]] = dict()
        meta = mutagen.File(self.filename).tags
        if not meta:
            return metadata
							
								
							
							
								
							
							
						
@@ -68,12 +68,12 @@ class FLACParser(MutagenParser):
        f.save(deleteid3=True)
        return True
    def get_meta(self) -> dict[str, Union[str, dict]]:
    def get_meta(self) -> Dict[str, Union[str, Dict]]:
        meta = super().get_meta()
        for num, picture in enumerate(mutagen.File(self.filename).pictures):
            name = picture.desc if picture.desc else 'Cover %d' % num
            extension = mimetypes.guess_extension(picture.mime)
            if extension is None:  #  pragma: no cover
            if extension is None: #  pragma: no cover
                meta[name] = 'harmful data'
                continue
							
							
							
								
							
						
@@ -82,6 +82,9 @@ class FLACParser(MutagenParser):
            with open(fname, 'wb') as f:
                f.write(picture.data)
            p, _ = parser_factory.get_parser(fname)  # type: ignore
            if p is None:
                raise ValueError
            p.sandbox = self.sandbox
            # Mypy chokes on ternaries :/
            meta[name] = p.get_meta() if p else 'harmful data'  # type: ignore
            os.remove(fname)
							
							
							
								
							
						
@@ -98,6 +101,7 @@ class WAVParser(video.AbstractFFmpegParser):
                      'MIMEType', 'NumChannels', 'SampleRate', 'SourceFile',
                     }
class AIFFParser(video.AbstractFFmpegParser):
    mimetypes = {'audio/aiff', 'audio/x-aiff'}
    meta_allowlist = {'AvgBytesPerSec', 'BitsPerSample', 'Directory',
							
								
							
							
							
						
 
							
							
								
							
							
						
@@ -12,7 +12,7 @@ import shutil
import subprocess
import tempfile
import functools
from typing import Optional
from typing import Optional, List
__all__ = ['PIPE', 'run', 'CalledProcessError']
							
							
							
								
							
						
@@ -22,7 +22,7 @@ CalledProcessError = subprocess.CalledProcessError
# pylint: disable=subprocess-run-check
@functools.lru_cache
@functools.lru_cache(maxsize=None)
def _get_bwrap_path() -> str:
    which_path = shutil.which('bwrap')
    if which_path:
							
							
							
								
							
						
@@ -33,7 +33,7 @@ def _get_bwrap_path() -> str:
def _get_bwrap_args(tempdir: str,
                    input_filename: str,
                    output_filename: Optional[str] = None) -> list[str]:
                    output_filename: Optional[str] = None) -> List[str]:
    ro_bind_args = []
    cwd = os.getcwd()
							
								
							
							
								
							
							
						
@@ -78,7 +78,7 @@ def _get_bwrap_args(tempdir: str,
    return args
def run(args: list[str],
def run(args: List[str],
        input_filename: str,
        output_filename: Optional[str] = None,
        **kwargs) -> subprocess.CompletedProcess:
							
								
							
							
							
						
 
							
							
								
							
							
						
@@ -3,10 +3,11 @@ import re
import uuid
import zipfile
import xml.etree.ElementTree as ET  # type: ignore
from typing import Any
from typing import Any, Dict
from . import archive, office
class EPUBParser(archive.ZipParser):
    mimetypes = {'application/epub+zip', }
    metadata_namespace = '{http://purl.org/dc/elements/1.1/}'
							
							
							
								
							
						
@@ -28,7 +29,6 @@ class EPUBParser(archive.ZipParser):
             }))
        self.uniqid = uuid.uuid4()
    def is_archive_valid(self):
        super().is_archive_valid()
        with zipfile.ZipFile(self.filename) as zin:
							
							
							
								
							
						
@@ -37,7 +37,7 @@ class EPUBParser(archive.ZipParser):
                if member_name.endswith('META-INF/encryption.xml'):
                    raise ValueError('the file contains encrypted fonts')
    def _specific_get_meta(self, full_path, file_path) -> dict[str, Any]:
    def _specific_get_meta(self, full_path, file_path) -> Dict[str, Any]:
        if not file_path.endswith('.opf'):
            return {}
							
								
							
							
								
							
							
						
@@ -73,7 +73,6 @@ class EPUBParser(archive.ZipParser):
                   short_empty_elements=False)
        return True
    def __handle_tocncx(self, full_path: str) -> bool:
        try:
            tree, namespace = office._parse_xml(full_path)
							
								
							
							
							
						
 
							
							
								
							
							
						
@@ -4,7 +4,7 @@ import logging
import os
import shutil
import subprocess
from typing import Union
from typing import Union, Set, Dict
from . import abstract
from . import bubblewrap
							
							
							
								
							
						
@@ -15,9 +15,9 @@ class ExiftoolParser(abstract.AbstractParser):
    from a import file, hence why several parsers are re-using its `get_meta`
    method.
    """
    meta_allowlist = set()  # type: set[str]
    meta_allowlist: Set[str] = set()
    def get_meta(self) -> dict[str, Union[str, dict]]:
    def get_meta(self) -> Dict[str, Union[str, Dict]]:
        try:
            if self.sandbox:
                out = bubblewrap.run([_get_exiftool_path(), '-json',
							
								
							
							
								
							
							
						
@@ -67,7 +67,7 @@ class ExiftoolParser(abstract.AbstractParser):
            return False
        return True
@functools.lru_cache
@functools.lru_cache(maxsize=None)
def _get_exiftool_path() -> str:  # pragma: no cover
    which_path = shutil.which('exiftool')
    if which_path:
							
								
							
							
							
						
 
							
							
							
						
@@ -1,13 +1,13 @@
import shutil
from typing import Union
from typing import Union, Dict
from . import abstract
class HarmlessParser(abstract.AbstractParser):
    """ This is the parser for filetypes that can not contain metadata. """
    mimetypes = {'text/plain', 'image/x-ms-bmp'}
    mimetypes = {'text/plain', 'image/x-ms-bmp', 'image/bmp'}
    def get_meta(self) -> dict[str, Union[str, dict]]:
    def get_meta(self) -> Dict[str, Union[str, Dict]]:
        return dict()
    def remove_all(self) -> bool:
							
								
							
							
							
						
 
							
							
							
						
@@ -1,7 +1,6 @@
import imghdr
import os
import re
from typing import Union, Any
from typing import Union, Any, Dict
import cairo
							
							
							
								
							
						
@@ -12,9 +11,6 @@ from gi.repository import GdkPixbuf, GLib, Rsvg
from . import exiftool, abstract
# Make pyflakes happy
assert Any
class SVGParser(exiftool.ExiftoolParser):
    mimetypes = {'image/svg+xml', }
    meta_allowlist = {'Directory', 'ExifToolVersion', 'FileAccessDate',
							
								
							
							
								
							
							
						
@@ -49,7 +45,7 @@ class SVGParser(exiftool.ExiftoolParser):
        surface.finish()
        return True
    def get_meta(self) -> dict[str, Union[str, dict]]:
    def get_meta(self) -> Dict[str, Union[str, Dict]]:
        meta = super().get_meta()
        # The namespace is mandatory, but only the …/2000/svg is valid.
							
							
							
								
							
						
@@ -58,6 +54,7 @@ class SVGParser(exiftool.ExiftoolParser):
            meta.pop('Xmlns')
        return meta
class PNGParser(exiftool.ExiftoolParser):
    mimetypes = {'image/png', }
    meta_allowlist = {'SourceFile', 'ExifToolVersion', 'FileName',
							
							
							
								
							
						
@@ -71,9 +68,6 @@ class PNGParser(exiftool.ExiftoolParser):
    def __init__(self, filename):
        super().__init__(filename)
        if imghdr.what(filename) != 'png':
            raise ValueError
        try:  # better fail here than later
            cairo.ImageSurface.create_from_png(self.filename)
        except:  # pragma: no cover
							
								
							
							
								
							
							
						
@@ -111,7 +105,6 @@ class GdkPixbufAbstractParser(exiftool.ExiftoolParser):
    def __init__(self, filename):
        super().__init__(filename)
        # we can't use imghdr here because of https://bugs.python.org/issue28591
        try:
            GdkPixbuf.Pixbuf.new_from_file(self.filename)
        except GLib.GError:
							
							
							
								
							
						
@@ -123,6 +116,7 @@ class GdkPixbufAbstractParser(exiftool.ExiftoolParser):
        _, extension = os.path.splitext(self.filename)
        pixbuf = GdkPixbuf.Pixbuf.new_from_file(self.filename)
        pixbuf = GdkPixbuf.Pixbuf.apply_embedded_orientation(pixbuf)
        if extension.lower() == '.jpg':
            extension = '.jpeg'  # gdk is picky
        elif extension.lower() == '.tif':
							
							
							
								
							
						
@@ -145,7 +139,7 @@ class JPGParser(GdkPixbufAbstractParser):
                      'MIMEType', 'ImageWidth', 'ImageSize', 'BitsPerSample',
                      'ColorComponents', 'EncodingProcess', 'JFIFVersion',
                      'ResolutionUnit', 'XResolution', 'YCbCrSubSampling',
                      'YResolution', 'Megapixels', 'ImageHeight'}
                      'YResolution', 'Megapixels', 'ImageHeight', 'Orientation'}
class TiffParser(GdkPixbufAbstractParser):
							
							
							
								
							
						
@@ -159,13 +153,14 @@ class TiffParser(GdkPixbufAbstractParser):
                      'FileInodeChangeDate', 'FileModifyDate', 'FileName',
                      'FilePermissions', 'FileSize', 'FileType',
                      'FileTypeExtension', 'ImageHeight', 'ImageSize',
                      'ImageWidth', 'MIMEType', 'Megapixels', 'SourceFile'}
                      'ImageWidth', 'MIMEType', 'Megapixels', 'SourceFile', 'Orientation'}
class PPMParser(abstract.AbstractParser):
    mimetypes = {'image/x-portable-pixmap'}
    def get_meta(self) -> dict[str, Union[str, dict]]:
        meta = {}  # type: dict[str, Union[str, dict[Any, Any]]]
    def get_meta(self) -> Dict[str, Union[str, Dict]]:
        meta: Dict[str, Union[str, Dict[Any, Any]]] = dict()
        with open(self.filename) as f:
            for idx, line in enumerate(f):
                if line.lstrip().startswith('#'):
							
							
							
								
							
						
@@ -181,9 +176,10 @@ class PPMParser(abstract.AbstractParser):
                        fout.write(line)
        return True
class HEICParser(exiftool.ExiftoolParser):
    mimetypes = {'image/heic'}
    meta_allowlist = {'SourceFile', 'ExifToolVersion', 'FileName','Directory',
    meta_allowlist = {'SourceFile', 'ExifToolVersion', 'FileName', 'Directory',
            'FileSize', 'FileModifyDate', 'FileAccessDate',
            'FileInodeChangeDate', 'FilePermissions', 'FileType',
            'FileTypeExtension', 'MIMEType', 'MajorBrand', 'MinorVersion',
							
							
							
								
							
						
@@ -200,3 +196,15 @@ class HEICParser(exiftool.ExiftoolParser):
    def remove_all(self) -> bool:
        return self._lightweight_cleanup()
class WEBPParser(GdkPixbufAbstractParser):
    mimetypes = {'image/webp'}
    meta_allowlist = {'SourceFile', 'ExifToolVersion', 'FileName',
                      'Directory', 'FileSize', 'FileModifyDate',
                      'FileAccessDate', "FileInodeChangeDate",
                      'FilePermissions', 'FileType', 'FileTypeExtension',
                      'MIMEType', 'ImageWidth', 'ImageSize', 'BitsPerSample',
                      'ColorComponents', 'EncodingProcess', 'JFIFVersion',
                      'ResolutionUnit', 'XResolution', 'YCbCrSubSampling',
                      'YResolution', 'Megapixels', 'ImageHeight', 'Orientation',
                      'HorizontalScale', 'VerticalScale', 'VP8Version'}
							
							
							
						
 
							
							
								
							
							
						
@@ -4,7 +4,7 @@ import logging
import os
import re
import zipfile
from typing import Pattern, Any
from typing import Pattern, Any, Tuple, Dict
import xml.etree.ElementTree as ET  # type: ignore
							
							
							
								
							
						
@@ -12,7 +12,8 @@ from .archive import ZipParser
# pylint: disable=line-too-long
def _parse_xml(full_path: str) -> tuple[ET.ElementTree, dict[str, str]]:
def _parse_xml(full_path: str) -> Tuple[ET.ElementTree, Dict[str, str]]:
    """ This function parses XML, with namespace support. """
    namespace_map = dict()
    for _, (key, value) in ET.iterparse(full_path, ("start-ns", )):
							
							
							
								
							
						
@@ -37,7 +38,7 @@ def _sort_xml_attributes(full_path: str) -> bool:
    for c in tree.getroot():
        c[:] = sorted(c, key=lambda child: (child.tag, child.get('desc')))
    tree.write(full_path, xml_declaration=True)
    tree.write(full_path, xml_declaration=True, encoding='utf-8')
    return True
							
							
							
								
							
						
@@ -62,13 +63,24 @@ class MSOfficeParser(ZipParser):
        'application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml',  # /word/footer.xml
        'application/vnd.openxmlformats-officedocument.wordprocessingml.header+xml',  # /word/header.xml
        'application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml',  # /word/styles.xml
        'application/vnd.openxmlformats-officedocument.wordprocessingml.numbering+xml',  # /word/numbering.xml (used for bullet point formatting)
        'application/vnd.openxmlformats-officedocument.theme+xml',  # /word/theme/theme[0-9].xml (used for font and background coloring, etc.)
        'application/vnd.openxmlformats-package.core-properties+xml',  # /docProps/core.xml
        # for more complicated powerpoints
        'application/vnd.openxmlformats-officedocument.presentationml.notesSlide+xml',
        'application/vnd.openxmlformats-officedocument.presentationml.notesMaster+xml',
        'application/vnd.openxmlformats-officedocument.presentationml.handoutMaster+xml',
        'application/vnd.openxmlformats-officedocument.drawingml.diagramData+xml',
        'application/vnd.openxmlformats-officedocument.drawingml.diagramLayout+xml',
        'application/vnd.openxmlformats-officedocument.drawingml.diagramStyle+xml',
        'application/vnd.openxmlformats-officedocument.drawingml.diagramColors+xml',
        'application/vnd.ms-office.drawingml.diagramDrawing+xml',
        # Do we want to keep the following ones?
        'application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml',
    }
    def __init__(self, filename):
        super().__init__(filename)
							
							
							
								
							
						
@@ -85,7 +97,7 @@ class MSOfficeParser(ZipParser):
            r'^_rels/\.rels$',
            r'^xl/sharedStrings\.xml$',  # https://docs.microsoft.com/en-us/office/open-xml/working-with-the-shared-string-table
            r'^xl/calcChain\.xml$',
            r'^(?:word|ppt|xl)/_rels/document\.xml\.rels$',
            r'^(?:word|ppt|xl)/_rels/(document|workbook|presentation)\.xml\.rels$',
            r'^(?:word|ppt|xl)/_rels/footer[0-9]*\.xml\.rels$',
            r'^(?:word|ppt|xl)/_rels/header[0-9]*\.xml\.rels$',
            r'^(?:word|ppt|xl)/charts/_rels/chart[0-9]+\.xml\.rels$',
							
							
							
								
							
						
@@ -100,6 +112,7 @@ class MSOfficeParser(ZipParser):
            r'^ppt/slideLayouts/_rels/slideLayout[0-9]+\.xml\.rels$',
            r'^ppt/slideLayouts/slideLayout[0-9]+\.xml$',
            r'^(?:word|ppt|xl)/tableStyles\.xml$',
            r'^(?:word|ppt|xl)/tables/table[0-9]+\.xml$',
            r'^ppt/slides/_rels/slide[0-9]*\.xml\.rels$',
            r'^ppt/slides/slide[0-9]*\.xml$',
            # https://msdn.microsoft.com/en-us/library/dd908153(v=office.12).aspx
							
							
							
								
							
						
@@ -109,8 +122,13 @@ class MSOfficeParser(ZipParser):
            r'^ppt/slideMasters/slideMaster[0-9]+\.xml',
            r'^ppt/slideMasters/_rels/slideMaster[0-9]+\.xml\.rels',
            r'^xl/worksheets/_rels/sheet[0-9]+\.xml\.rels',
            r'^xl/drawings/vmlDrawing[0-9]+\.vml',
            r'^xl/drawings/drawing[0-9]+\.xml',
            r'^(?:word|ppt|xl)/drawings/vmlDrawing[0-9]+\.vml',
            r'^(?:word|ppt|xl)/drawings/drawing[0-9]+\.xml',
            r'^(?:word|ppt|xl)/embeddings/Microsoft_Excel_Worksheet[0-9]+\.xlsx',
            # rels for complicated powerpoints
            r'^ppt/notesSlides/_rels/notesSlide[0-9]+\.xml\.rels',
            r'^ppt/notesMasters/_rels/notesMaster[0-9]+\.xml\.rels',
            r'^ppt/handoutMasters/_rels/handoutMaster[0-9]+\.xml\.rels',
        }))
        self.files_to_omit = set(map(re.compile, {  # type: ignore
            r'^\[trash\]/',
							
							
							
								
							
						
@@ -120,18 +138,24 @@ class MSOfficeParser(ZipParser):
            r'^(?:word|ppt|xl)/printerSettings/',
            r'^(?:word|ppt|xl)/theme',
            r'^(?:word|ppt|xl)/people\.xml$',
            r'^(?:word|ppt|xl)/persons/person\.xml$',
            r'^(?:word|ppt|xl)/numbering\.xml$',
            r'^(?:word|ppt|xl)/tags/',
            r'^(?:word|ppt|xl)/glossary/',
            # View properties like view mode, last viewed slide etc
            r'^(?:word|ppt|xl)/viewProps\.xml$',
            # Additional presentation-wide properties like printing properties,
            # presentation show properties etc.
            r'^(?:word|ppt|xl)/presProps\.xml$',
            r'^(?:word|ppt|xl)/comments[0-9]+\.xml$',
            r'^(?:word|ppt|xl)/comments[0-9]*\.xml$',
            r'^(?:word|ppt|xl)/threadedComments/threadedComment[0-9]*\.xml$',
            r'^(?:word|ppt|xl)/commentsExtended\.xml$',
            r'^(?:word|ppt|xl)/commentsExtensible\.xml$',
            r'^(?:word|ppt|xl)/commentsIds\.xml$',
            # we have an allowlist in self.files_to_keep,
            # so we can trash everything else
            r'^(?:word|ppt|xl)/_rels/',
            r'docMetadata/LabelInfo\.xml$'
        }))
        if self.__fill_files_to_keep_via_content_types() is False:
							
							
							
								
							
						
@@ -148,7 +172,7 @@ class MSOfficeParser(ZipParser):
                return False
            xml_data = zin.read('[Content_Types].xml')
        self.content_types = dict()  # type: dict[str, str]
        self.content_types: Dict[str, str] = dict()
        try:
            tree = ET.fromstring(xml_data)
        except ET.ParseError:
							
								
							
							
								
							
							
						
@@ -196,7 +220,7 @@ class MSOfficeParser(ZipParser):
        for element in elements_to_remove:
            parent_map[element].remove(element)
        tree.write(full_path, xml_declaration=True)
        tree.write(full_path, xml_declaration=True, encoding='utf-8')
        return True
    @staticmethod
							
							
							
								
							
						
@@ -218,7 +242,7 @@ class MSOfficeParser(ZipParser):
        if 'w' not in namespace:
            return True
        parent_map = {c:p for p in tree.iter() for c in p}
        parent_map = {c: p for p in tree.iter() for c in p}
        elements_to_remove = list()
        for element in tree.iterfind('.//w:nsid', namespace):
							
							
							
								
							
						
@@ -226,10 +250,9 @@ class MSOfficeParser(ZipParser):
        for element in elements_to_remove:
            parent_map[element].remove(element)
        tree.write(full_path, xml_declaration=True)
        tree.write(full_path, xml_declaration=True, encoding='utf-8')
        return True
    @staticmethod
    def __remove_revisions(full_path: str) -> bool:
        try:
							
								
							
							
								
							
							
						
@@ -260,11 +283,82 @@ class MSOfficeParser(ZipParser):
                    for children in element.iterfind('./*'):
                        elements_ins.append((element, position, children))
                    break
        for (element, position, children) in elements_ins:
            parent_map[element].insert(position, children)
        # the list can sometimes contain duplicate elements, so don't remove
        # until all children have been processed
        for (element, position, children) in elements_ins:
            if element in parent_map[element]:
                parent_map[element].remove(element)
        tree.write(full_path, xml_declaration=True, encoding='utf-8')
        return True
    @staticmethod
    def __remove_document_comment_meta(full_path: str) -> bool:
        try:
            tree, namespace = _parse_xml(full_path)
        except ET.ParseError as e:  # pragma: no cover
            logging.error("Unable to parse %s: %s", full_path, e)
            return False
        # search the docs to see if we can bail early
        range_start = tree.find('.//w:commentRangeStart', namespace)
        range_end = tree.find('.//w:commentRangeEnd', namespace)
        references = tree.find('.//w:commentReference', namespace)
        if range_start is None and range_end is None and references is None:
            return True  # No comment meta tags are present
        parent_map = {c:p for p in tree.iter() for c in p}
        # iterate over the elements and add them to list
        elements_del = list()
        for element in tree.iterfind('.//w:commentRangeStart', namespace):
            elements_del.append(element)
        for element in tree.iterfind('.//w:commentRangeEnd', namespace):
            elements_del.append(element)
        for element in tree.iterfind('.//w:commentReference', namespace):
            elements_del.append(element)
        # remove the elements
        for element in elements_del:
            parent_map[element].remove(element)
        tree.write(full_path, xml_declaration=True)
        tree.write(full_path, xml_declaration=True, encoding='utf-8')
        return True
    def __remove_document_xml_rels_members(self, full_path: str) -> bool:
        """ Remove the dangling references from the word/_rels/document.xml.rels file, since MS office doesn't like them.
        """
        try:
            tree, namespace = _parse_xml(full_path)
        except ET.ParseError as e:  # pragma: no cover
            logging.error("Unable to parse %s: %s", full_path, e)
            return False
        if len(namespace.items()) != 1:  # pragma: no cover
            logging.debug("Got several namespaces for Types: %s", namespace.items())
        removed_fnames = set()
        with zipfile.ZipFile(self.filename) as zin:
            for fname in [item.filename for item in zin.infolist()]:
                for file_to_omit in self.files_to_omit:
                    if file_to_omit.search(fname):
                        matches = map(lambda r: r.search(fname), self.files_to_keep)
                        if any(matches):  # the file is in the allowlist
                            continue
                        removed_fnames.add(fname)
                        break
        root = tree.getroot()
        for item in root.findall('{%s}Relationship' % namespace['']):
            name = 'word/' + item.attrib['Target'] # add the word/ prefix to the path, since all document rels are in the word/ directory
            if name in removed_fnames:
                root.remove(item)
        tree.write(full_path, xml_declaration=True, encoding='utf-8')
        return True
    def __remove_content_type_members(self, full_path: str) -> bool:
							
								
							
							
								
							
							
						
@@ -297,7 +391,7 @@ class MSOfficeParser(ZipParser):
            if name in removed_fnames:
                root.remove(item)
        tree.write(full_path, xml_declaration=True)
        tree.write(full_path, xml_declaration=True, encoding='utf-8')
        return True
    def _final_checks(self) -> bool:
							
							
							
								
							
						
@@ -319,7 +413,6 @@ class MSOfficeParser(ZipParser):
            for i in re.findall(r'<p:cNvPr id="([0-9]+)"', content):
                self.__counters['cNvPr'].add(int(i))
    @staticmethod
    def __randomize_creationId(full_path: str) -> bool:
        try:
							
							
							
								
							
						
@@ -333,7 +426,7 @@ class MSOfficeParser(ZipParser):
        for item in tree.iterfind('.//p14:creationId', namespace):
            item.set('val', '%s' % random.randint(0, 2**32))
        tree.write(full_path, xml_declaration=True)
        tree.write(full_path, xml_declaration=True, encoding='utf-8')
        return True
    @staticmethod
							
							
							
								
							
						
@@ -349,7 +442,7 @@ class MSOfficeParser(ZipParser):
        for item in tree.iterfind('.//p:sldMasterId', namespace):
            item.set('id', '%s' % random.randint(0, 2**32))
        tree.write(full_path, xml_declaration=True)
        tree.write(full_path, xml_declaration=True, encoding='utf-8')
        return True
    def _specific_cleanup(self, full_path: str) -> bool:
							
							
							
								
							
						
@@ -357,7 +450,7 @@ class MSOfficeParser(ZipParser):
        if os.stat(full_path).st_size == 0:  # Don't process empty files
            return True
        if not full_path.endswith('.xml'):
        if not full_path.endswith(('.xml', '.xml.rels')):
            return True
        if self.__randomize_creationId(full_path) is False:
							
							
							
								
							
						
@@ -374,6 +467,13 @@ class MSOfficeParser(ZipParser):
            # this file contains the revisions
            if self.__remove_revisions(full_path) is False:
                return False  # pragma: no cover
            # remove comment references and ranges
            if self.__remove_document_comment_meta(full_path) is False:
                return False  # pragma: no cover
        elif full_path.endswith('/word/_rels/document.xml.rels'):
            # similar to the above, but for the document.xml.rels file
            if self.__remove_document_xml_rels_members(full_path) is False:  # pragma: no cover
                return False
        elif full_path.endswith('/docProps/app.xml'):
            # This file must be present and valid,
            # so we're removing as much as we can.
							
								
							
							
								
							
							
						
@@ -425,13 +525,13 @@ class MSOfficeParser(ZipParser):
        # see: https://docs.microsoft.com/en-us/dotnet/framework/wpf/advanced/mc-ignorable-attribute
        with open(full_path, 'rb') as f:
            text = f.read()
            out = re.sub(b'mc:Ignorable="[^"]*"', b'', text, 1)
            out = re.sub(b'mc:Ignorable="[^"]*"', b'', text, count=1)
        with open(full_path, 'wb') as f:
            f.write(out)
        return True
    def _specific_get_meta(self, full_path: str, file_path: str) -> dict[str, Any]:
    def _specific_get_meta(self, full_path: str, file_path: str) -> Dict[str, Any]:
        """
        Yes, I know that parsing xml with regexp ain't pretty,
        be my guest and fix it if you want.
							
							
							
								
							
						
@@ -441,8 +541,8 @@ class MSOfficeParser(ZipParser):
        with open(full_path, encoding='utf-8') as f:
            try:
                results = re.findall(r"<(.+)>(.+)</\1>", f.read(), re.I|re.M)
                return {k:v for (k, v) in results}
                results = re.findall(r"<(.+)>(.+)</\1>", f.read(), re.I | re.M)
                return {k: v for (k, v) in results}
            except (TypeError, UnicodeDecodeError):
                # We didn't manage to parse the xml file
                return {file_path: 'harmful content', }
							
							
							
								
							
						
@@ -459,7 +559,6 @@ class LibreOfficeParser(ZipParser):
        'application/vnd.oasis.opendocument.image',
    }
    def __init__(self, filename):
        super().__init__(filename)
							
								
							
							
								
							
							
						
@@ -493,7 +592,7 @@ class LibreOfficeParser(ZipParser):
            for changes in text.iterfind('.//text:tracked-changes', namespace):
                text.remove(changes)
        tree.write(full_path, xml_declaration=True)
        tree.write(full_path, xml_declaration=True, encoding='utf-8')
        return True
    def _specific_cleanup(self, full_path: str) -> bool:
							
							
							
								
							
						
@@ -512,7 +611,7 @@ class LibreOfficeParser(ZipParser):
                return False
        return True
    def _specific_get_meta(self, full_path: str, file_path: str) -> dict[str, Any]:
    def _specific_get_meta(self, full_path: str, file_path: str) -> Dict[str, Any]:
        """
        Yes, I know that parsing xml with regexp ain't pretty,
        be my guest and fix it if you want.
							
								
							
							
							
						
 
							
							
								
							
							
						
@@ -2,7 +2,7 @@ import glob
import os
import mimetypes
import importlib
from typing import TypeVar, Optional
from typing import TypeVar, Optional, List, Tuple
from . import abstract, UNSUPPORTED_EXTENSIONS
							
								
							
							
								
							
							
						
@@ -34,7 +34,7 @@ def __load_all_parsers():
__load_all_parsers()
def _get_parsers() -> list[T]:
def _get_parsers() -> List[T]:
    """ Get all our parsers!"""
    def __get_parsers(cls):
        return cls.__subclasses__() + \
							
							
							
								
							
						
@@ -42,7 +42,7 @@ def _get_parsers() -> list[T]:
    return __get_parsers(abstract.AbstractParser)
def get_parser(filename: str) -> tuple[Optional[T], Optional[str]]:
def get_parser(filename: str) -> Tuple[Optional[T], Optional[str]]:
    """ Return the appropriate parser for a given filename.
        :raises ValueError: Raised if the instantiation of the parser went wrong.
							
								
							
							
							
						
 
							
							
								
							
							
						
@@ -7,7 +7,7 @@ import re
import logging
import tempfile
import io
from typing import Union
from typing import Union, Dict
import cairo
import gi
							
							
							
								
							
						
@@ -18,6 +18,7 @@ from . import abstract
FIXED_PDF_VERSION = cairo.PDFVersion.VERSION_1_5
class PDFParser(abstract.AbstractParser):
    mimetypes = {'application/pdf', }
    meta_list = {'author', 'creation-date', 'creator', 'format', 'keywords',
							
							
							
								
							
						
@@ -35,7 +36,10 @@ class PDFParser(abstract.AbstractParser):
    def remove_all(self) -> bool:
        if self.lightweight_cleaning is True:
            return self.__remove_all_lightweight()
            try:
                return self.__remove_all_lightweight()
            except (cairo.Error, MemoryError) as e:
                raise RuntimeError(e)
        return self.__remove_all_thorough()
    def __remove_all_lightweight(self) -> bool:
							
								
							
							
								
							
							
						
@@ -132,21 +136,21 @@ class PDFParser(abstract.AbstractParser):
        # It should(tm) be alright though, because cairo's output format
        # for metadata is fixed.
        with open(out_file, 'rb') as f:
            out = re.sub(rb'<<[\s\n]*/Producer.*?>>', b' << >>', f.read(), 0,
                         re.DOTALL | re.IGNORECASE)
            out = re.sub(rb'<<[\s\n]*/Producer.*?>>', b' << >>', f.read(),
                         count=0, flags=re.DOTALL | re.IGNORECASE)
        with open(out_file, 'wb') as f:
            f.write(out)
        return True
    @staticmethod
    def __parse_metadata_field(data: str) -> dict[str, str]:
    def __parse_metadata_field(data: str) -> Dict[str, str]:
        metadata = {}
        for (_, key, value) in re.findall(r"<(xmp|pdfx|pdf|xmpMM):(.+)>(.+)</\1:\2>", data, re.I):
            metadata[key] = value
        return metadata
    def get_meta(self) -> dict[str, Union[str, dict]]:
    def get_meta(self) -> Dict[str, Union[str, Dict]]:
        """ Return a dict with all the meta of the file
        """
        metadata = {}
							
								
							
							
							
						
 
							
							
							
						
@@ -1,5 +1,5 @@
import logging
from typing import Union
from typing import Union, Dict, List, Tuple
from . import abstract
							
							
							
								
							
						
@@ -15,7 +15,7 @@ class TorrentParser(abstract.AbstractParser):
        if self.dict_repr is None:
            raise ValueError
    def get_meta(self) -> dict[str, Union[str, dict]]:
    def get_meta(self) -> Dict[str, Union[str, Dict]]:
        metadata = {}
        for key, value in self.dict_repr.items():
            if key not in self.allowlist:
							
								
							
							
								
							
							
						
@@ -56,7 +56,7 @@ class _BencodeHandler:
        }
    @staticmethod
    def __decode_int(s: bytes) -> tuple[int, bytes]:
    def __decode_int(s: bytes) -> Tuple[int, bytes]:
        s = s[1:]
        next_idx = s.index(b'e')
        if s.startswith(b'-0'):
							
							
							
								
							
						
@@ -66,7 +66,7 @@ class _BencodeHandler:
        return int(s[:next_idx]), s[next_idx+1:]
    @staticmethod
    def __decode_string(s: bytes) -> tuple[bytes, bytes]:
    def __decode_string(s: bytes) -> Tuple[bytes, bytes]:
        colon = s.index(b':')
        # FIXME Python3 is broken here, the call to `ord` shouldn't be needed,
        # but apparently it is. This is utterly idiotic.
							
							
							
								
							
						
@@ -76,7 +76,7 @@ class _BencodeHandler:
        s = s[1:]
        return s[colon:colon+str_len], s[colon+str_len:]
    def __decode_list(self, s: bytes) -> tuple[list, bytes]:
    def __decode_list(self, s: bytes) -> Tuple[List, bytes]:
        ret = list()
        s = s[1:]  # skip leading `l`
        while s[0] != ord('e'):
							
							
							
								
							
						
@@ -84,7 +84,7 @@ class _BencodeHandler:
            ret.append(value)
        return ret, s[1:]
    def __decode_dict(self, s: bytes) -> tuple[dict, bytes]:
    def __decode_dict(self, s: bytes) -> Tuple[Dict, bytes]:
        ret = dict()
        s = s[1:]  # skip leading `d`
        while s[0] != ord(b'e'):
							
								
							
							
								
							
							
						
@@ -113,10 +113,10 @@ class _BencodeHandler:
            ret += self.__encode_func[type(value)](value)
        return b'd' + ret + b'e'
    def bencode(self, s: Union[dict, list, bytes, int]) -> bytes:
    def bencode(self, s: Union[Dict, List, bytes, int]) -> bytes:
        return self.__encode_func[type(s)](s)
    def bdecode(self, s: bytes) -> Union[dict, None]:
    def bdecode(self, s: bytes) -> Union[Dict, None]:
        try:
            ret, trail = self.__decode_func[s[0]](s)
        except (IndexError, KeyError, ValueError) as e:
							
								
							
							
							
						
 
							
							
								
							
							
						
@@ -3,7 +3,7 @@ import functools
import shutil
import logging
from typing import Union
from typing import Union, Dict
from . import exiftool
from . import bubblewrap
							
							
							
								
							
						
@@ -12,7 +12,7 @@ from . import bubblewrap
class AbstractFFmpegParser(exiftool.ExiftoolParser):
    """ Abstract parser for all FFmpeg-based ones, mainly for video. """
    # Some fileformats have mandatory metadata fields
    meta_key_value_allowlist = {}  # type: dict[str, Union[str, int]]
    meta_key_value_allowlist: Dict[str, Union[str, int]] = dict()
    def remove_all(self) -> bool:
        if self.meta_key_value_allowlist:
							
								
							
							
								
							
							
						
@@ -45,10 +45,10 @@ class AbstractFFmpegParser(exiftool.ExiftoolParser):
            return False
        return True
    def get_meta(self) -> dict[str, Union[str, dict]]:
    def get_meta(self) -> Dict[str, Union[str, Dict]]:
        meta = super().get_meta()
        ret = dict()  # type: dict[str, Union[str, dict]]
        ret: Dict[str, Union[str, Dict]] = dict()
        for key, value in meta.items():
            if key in self.meta_key_value_allowlist:
                if value == self.meta_key_value_allowlist[key]:
							
								
							
							
								
							
							
						
@@ -135,7 +135,7 @@ class MP4Parser(AbstractFFmpegParser):
    }
@functools.lru_cache()
@functools.lru_cache(maxsize=None)
def _get_ffmpeg_path() -> str:  # pragma: no cover
    which_path = shutil.which('ffmpeg')
    if which_path:
							
								
							
							
							
						
 
							
							
							
						
@@ -1,5 +1,5 @@
from html import parser, escape
from typing import  Any, Optional
from typing import Any, Optional, Dict, List, Tuple, Set
import re
import string
							
							
							
								
							
						
@@ -20,12 +20,12 @@ class CSSParser(abstract.AbstractParser):
                content = f.read()
            except UnicodeDecodeError:  # pragma: no cover
                raise ValueError
            cleaned = re.sub(r'/\*.*?\*/', '', content, 0, self.flags)
            cleaned = re.sub(r'/\*.*?\*/', '', content, count=0, flags=self.flags)
        with open(self.output_filename, 'w', encoding='utf-8') as f:
            f.write(cleaned)
        return True
    def get_meta(self) -> dict[str, Any]:
    def get_meta(self) -> Dict[str, Any]:
        metadata = {}
        with open(self.filename, encoding='utf-8') as f:
            try:
							
							
							
								
							
						
@@ -44,10 +44,10 @@ class CSSParser(abstract.AbstractParser):
class AbstractHTMLParser(abstract.AbstractParser):
    tags_blocklist = set()  # type: set[str]
    tags_blocklist: Set[str] = set()
    # In some html/xml-based formats some tags are mandatory,
    # so we're keeping them, but are discarding their content
    tags_required_blocklist = set()  # type: set[str]
    tags_required_blocklist: Set[str] = set()
    def __init__(self, filename):
        super().__init__(filename)
							
							
							
								
							
						
@@ -57,7 +57,7 @@ class AbstractHTMLParser(abstract.AbstractParser):
            self.__parser.feed(f.read())
        self.__parser.close()
    def get_meta(self) -> dict[str, Any]:
    def get_meta(self) -> Dict[str, Any]:
        return self.__parser.get_meta()
    def remove_all(self) -> bool:
							
								
							
							
								
							
							
						
@@ -91,7 +91,7 @@ class _HTMLParser(parser.HTMLParser):
        self.filename = filename
        self.__textrepr = ''
        self.__meta = {}
        self.__validation_queue = []  # type: list[str]
        self.__validation_queue: List[str] = list()
        # We're using counters instead of booleans, to handle nested tags
        self.__in_dangerous_but_required_tag = 0
							
							
							
								
							
						
@@ -112,7 +112,7 @@ class _HTMLParser(parser.HTMLParser):
        """
        raise ValueError(message)
    def handle_starttag(self, tag: str, attrs: list[tuple[str, Optional[str]]]):
    def handle_starttag(self, tag: str, attrs: List[Tuple[str, Optional[str]]]):
        # Ignore the type, because mypy is too stupid to infer
        # that get_starttag_text() can't return None.
        original_tag = self.get_starttag_text()  # type: ignore
							
								
							
							
								
							
							
						
@@ -159,7 +159,7 @@ class _HTMLParser(parser.HTMLParser):
                    self.__textrepr += escape(data)
    def handle_startendtag(self, tag: str,
                           attrs: list[tuple[str, Optional[str]]]):
                           attrs: List[Tuple[str, Optional[str]]]):
        if tag in self.tag_required_blocklist | self.tag_blocklist:
            meta = {k:v for k, v in attrs}
            name = meta.get('name', 'harmful metadata')
							
							
							
								
							
						
@@ -184,7 +184,7 @@ class _HTMLParser(parser.HTMLParser):
            f.write(self.__textrepr)
        return True
    def get_meta(self) -> dict[str, Any]:
    def get_meta(self) -> Dict[str, Any]:
        if self.__validation_queue:
            raise ValueError("Some tags (%s) were left unclosed in %s" % (
                ', '.join(self.__validation_queue),
							
								
							
							
							
						
 
							
							
								
							
							
						
@@ -2,6 +2,7 @@
import os
import shutil
from typing import List, Set, Dict
import sys
import mimetypes
import argparse
							
							
							
								
							
						
@@ -16,7 +17,7 @@ except ValueError as ex:
    print(ex)
    sys.exit(1)
__version__ = '0.13.1'
__version__ = '0.13.5'
logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.WARNING)
							
							
							
								
							
						
@@ -35,7 +36,7 @@ def __check_file(filename: str, mode: int = os.R_OK) -> bool:
        __print_without_chars("[-] %s is not a regular file." % filename)
        return False
    elif not os.access(filename, mode):
        mode_str = []  # type: list[str]
        mode_str: List[str] = list()
        if mode & os.R_OK:
            mode_str += 'readable'
        if mode & os.W_OK:
							
							
							
								
							
						
@@ -56,8 +57,8 @@ def create_arg_parser() -> argparse.ArgumentParser:
                        ', '.join(p.value for p in UnknownMemberPolicy))
    parser.add_argument('--inplace', action='store_true',
                        help='clean in place, without backup')
    parser.add_argument('--no-sandbox', dest='sandbox', action='store_true',
                        default=False, help='Disable bubblewrap\'s sandboxing')
    parser.add_argument('--no-sandbox', dest='sandbox', action='store_false',
                        default=True, help='Disable bubblewrap\'s sandboxing')
    excl_group = parser.add_mutually_exclusive_group()
    excl_group.add_argument('files', nargs='*', help='the files to process',
							
								
							
							
								
							
							
						
@@ -97,7 +98,7 @@ def show_meta(filename: str, sandbox: bool):
    __print_meta(filename, p.get_meta())
def __print_meta(filename: str, metadata: dict, depth: int = 1):
def __print_meta(filename: str, metadata: Dict, depth: int = 1):
    padding = " " * depth*2
    if not metadata:
        __print_without_chars(padding + "No metadata found in %s." % filename)
							
								
							
							
								
							
							
						
@@ -151,10 +152,10 @@ def clean_meta(filename: str, is_lightweight: bool, inplace: bool, sandbox: bool
def show_parsers():
    print('[+] Supported formats:')
    formats = set()  # set[str]
    formats = set()  # Set[str]
    for parser in parser_factory._get_parsers():  # type: ignore
        for mtype in parser.mimetypes:
            extensions = set()  # set[str]
            extensions = set()  # Set[str]
            for extension in mimetypes.guess_all_extensions(mtype):
                if extension not in UNSUPPORTED_EXTENSIONS:
                    extensions.add(extension)
							
							
							
								
							
						
@@ -163,11 +164,11 @@ def show_parsers():
                # mimetype, so there is not point in showing the mimetype at all
                continue
            formats.add('  - %s (%s)' % (mtype, ', '.join(extensions)))
    __print_without_chars('\n'.join(sorted(formats)))
    print('\n'.join(sorted(formats)))
def __get_files_recursively(files: list[str]) -> list[str]:
    ret = set()  # type: set[str]
def __get_files_recursively(files: List[str]) -> List[str]:
    ret: Set[str] = set()
    for f in files:
        if os.path.isdir(f):
            for path, _, _files in os.walk(f):
							
							
							
								
							
						
@@ -185,7 +186,7 @@ def main() -> int:
    args = arg_parser.parse_args()
    if args.verbose:
        logging.getLogger().setLevel(logging.DEBUG)
        logging.getLogger(__name__).setLevel(logging.DEBUG)
    if not args.files:
        if args.list:
							
								
							
							
							
						
 
							
							
							
						
@@ -0,0 +1,21 @@
[project]
name = "mat2"
version = "0.13.5"
description = "mat2 is a metadata removal tool, supporting a wide range of commonly used file formats, written in python3: at its core, it's a library, used by an eponymous command-line interface, as well as several file manager extensions."
readme = "README.md"
license = {file = "LICENSE"}
requires-python = ">=3.9"
dependencies = [
        'mutagen',
        'PyGObject',
        'pycairo',
]
[project.urls]
Repository = "https://github.com/jvoisin/mat2"
Issues = "https://github.com/jvoisin/mat2/issues"
Changelog = "https://github.com/jvoisin/mat2/blob/master/CHANGELOG.md"
[tool.ruff]
target-version = "py39"
# E501 Line too long
ignore = ["E501", "F401", "E402", "E722"]
							
							
								
							
							
						
@@ -5,13 +5,13 @@ with open("README.md", encoding='utf-8') as fh:
setuptools.setup(
    name="mat2",
    version='0.13.1',
    version='0.13.5',
    author="Julien (jvoisin) Voisin",
    author_email="julien.voisin+mat2@dustri.org",
    description="A handy tool to trash your metadata",
    long_description=long_description,
    long_description_content_type="text/markdown",
    url="https://0xacab.org/jvoisin/mat2",
    url="https://github.com/jvoisin/mat2",
    python_requires = '>=3.5.0',
    scripts=['mat2'],
    install_requires=[
							
							
							
								
							
						
@@ -20,6 +20,7 @@ setuptools.setup(
        'pycairo',
    ],
    packages=setuptools.find_packages(exclude=('tests', )),
    data_files = [('share/man/man1', ['doc/mat2.1'])],
    classifiers=[
        "Development Status :: 3 - Alpha",
        "Environment :: Console",
							
							
							
								
							
						
@@ -30,6 +31,6 @@ setuptools.setup(
        "Intended Audience :: End Users/Desktop",
    ],
    project_urls={
        'bugtacker': 'https://0xacab.org/jvoisin/mat2/issues',
        'bugtacker': 'https://github.com/jvoisin/mat2/issues',
    },
)
							
							
							
						
 
			
				
					Side by Side
					
				
			
			
				
					
						
						
						
							After
							
							
								
									Width: 
									 | 
									Height: 
									 | 
								
								Size: 38 KiB
							
						
						
					
				
				
			
		
							
							
								
							
							
						
@@ -236,6 +236,11 @@ class TestGetMeta(unittest.TestCase):
        self.assertIn(b'i am a : various comment', stdout)
        self.assertIn(b'artist: jvoisin', stdout)
    #def test_webp(self):
    #    proc = subprocess.Popen(mat2_binary + ['--show', './tests/data/dirty.webp'],
    #            stdout=subprocess.PIPE)
    #    stdout, _ = proc.communicate()
    #    self.assertIn(b'Warning: [minor] Improper EXIF header', stdout)
class TestControlCharInjection(unittest.TestCase):
    def test_jpg(self):
							
								
							
							
							
						
 
							
							
								
							
							
						
@@ -14,7 +14,7 @@ from libmat2 import harmless, video, web, archive
# No need to logging messages, should something go wrong,
# the testsuite _will_ fail.
logger = logging.getLogger()
logger = logging.getLogger(__name__)
logger.setLevel(logging.FATAL)
							
								
							
							
							
						
 
							
							
								
							
							
						
@@ -4,6 +4,7 @@ import unittest
import shutil
import os
import re
import sys
import tarfile
import tempfile
import zipfile
							
								
							
							
								
							
							
						
@@ -113,6 +114,11 @@ class TestGetMeta(unittest.TestCase):
        meta = p.get_meta()
        self.assertEqual(meta['Comment'], 'Created with GIMP')
    #def test_webp(self):
    #    p = images.WEBPParser('./tests/data/dirty.webp')
    #    meta = p.get_meta()
    #    self.assertEqual(meta['Warning'], '[minor] Improper EXIF header')
    def test_ppm(self):
        p = images.PPMParser('./tests/data/dirty.ppm')
        meta = p.get_meta()
							
								
							
							
								
							
							
						
@@ -333,6 +339,11 @@ class TestCleaning(unittest.TestCase):
            'parser': images.JPGParser,
            'meta': {'Comment': 'Created with GIMP'},
            'expected_meta': {},
        #}, {
        #    'name': 'webp',
        #    'parser': images.WEBPParser,
        #    'meta': {'Warning': '[minor] Improper EXIF header'},
        #    'expected_meta': {},
        }, {
            'name': 'wav',
            'parser': audio.WAVParser,
							
								
							
							
								
							
							
						
@@ -480,11 +491,12 @@ class TestCleaning(unittest.TestCase):
            'expected_meta': {
                'AverageBitrate': 465641,
                'BufferSize': 0,
                'CompatibleBrands': ['isom', 'iso2', 'avc1', 'mp41'],
                'ColorProfiles': 'nclx',
                'ColorPrimaries': 'BT.709',
                'ColorProfiles': 'nclx',
                'ColorRepresentation': 'nclx 1 1 1',
                'CompatibleBrands': ['isom', 'iso2', 'avc1', 'mp41'],
                'CompressorID': 'avc1',
                'CompressorName': 'JVT/AVC Coding',
                'GraphicsMode': 'srcCopy',
                'HandlerDescription': 'SoundHandler',
                'HandlerType': 'Metadata',
							
							
							
								
							
						
@@ -495,6 +507,7 @@ class TestCleaning(unittest.TestCase):
                'MediaDataOffset': 48,
                'MediaDataSize': 379872,
                'MediaHeaderVersion': 0,
                'MediaLanguageCode': 'eng',
                'MinorVersion': '0.2.0',
                'MovieDataOffset': 48,
                'MovieHeaderVersion': 0,
							
							
							
								
							
						
@@ -506,7 +519,11 @@ class TestCleaning(unittest.TestCase):
                'TrackID': 1,
                'TrackLayer': 0,
                'TransferCharacteristics': 'BT.709',
                'VideoFullRangeFlag': 'Limited',
            },
            'extra_expected_meta': {
                'VideoFullRangeFlag': 0,
             }
        },{
            'name': 'wmv',
            'ffmpeg': 1,
							
							
							
								
							
						
@@ -519,7 +536,43 @@ class TestCleaning(unittest.TestCase):
            'name': 'heic',
            'parser': images.HEICParser,
            'meta': {},
            'expected_meta': {},
            'expected_meta': {
                'BlueMatrixColumn': '0.14305 0.06061 0.71393',
                'BlueTRC': '(Binary data 32 bytes, use -b option to extract)',
                'CMMFlags': 'Not Embedded, Independent',
                'ChromaticAdaptation': '1.04788 0.02292 -0.05022 0.02959 0.99048 -0.01707 -0.00925 0.01508 0.75168',
                'ChromaticityChannel1': '0.64 0.33002',
                'ChromaticityChannel2': '0.3 0.60001',
                'ChromaticityChannel3': '0.15001 0.06',
                'ChromaticityChannels': 3,
                'ChromaticityColorant': 'Unknown',
                'ColorSpaceData': 'RGB ',
                'ConnectionSpaceIlluminant': '0.9642 1 0.82491',
                'DeviceAttributes': 'Reflective, Glossy, Positive, Color',
                'DeviceManufacturer': '',
                'DeviceMfgDesc': 'GIMP',
                'DeviceModel': '',
                'DeviceModelDesc': 'sRGB',
                'ExifByteOrder': 'Big-endian (Motorola, MM)',
                'GreenMatrixColumn': '0.38512 0.7169 0.09706',
                'GreenTRC': '(Binary data 32 bytes, use -b option to extract)',
                'MediaWhitePoint': '0.9642 1 0.82491',
                'PrimaryPlatform': 'Apple Computer Inc.',
                'ProfileCMMType': 'Little CMS',
                'ProfileClass': 'Display Device Profile',
                'ProfileConnectionSpace': 'XYZ ',
                'ProfileCopyright': 'Public Domain',
                'ProfileCreator': 'Little CMS',
                'ProfileDateTime': '2022:05:15 16:29:22',
                'ProfileDescription': 'GIMP built-in sRGB',
                'ProfileFileSignature': 'acsp',
                'ProfileID': 0,
                'ProfileVersion': '4.3.0',
                'RedMatrixColumn': '0.43604 0.22249 0.01392',
                'RedTRC': '(Binary data 32 bytes, use -b option to extract)',
                'RenderingIntent': 'Perceptual',
                'Warning': 'Bad IFD0 directory',
            },
        }
        ]
							
								
							
							
								
							
							
						
@@ -554,8 +607,13 @@ class TestCleaning(unittest.TestCase):
                meta = p2.get_meta()
                if meta:
                    for k, v in p2.get_meta().items():
                        self.assertIn(k, case['expected_meta'], '"%s" is not in "%s" (%s)' % (k, case['expected_meta'], case['name']))
                        self.assertIn(str(case['expected_meta'][k]), str(v))
                        self.assertIn(k, case['expected_meta'], '"%s" is not in "%s" (%s), with all of them being %s' % (k, case['expected_meta'], case['name'], p2.get_meta().items()))
                        if str(case['expected_meta'][k]) in str(v):
                            continue
                        if 'extra_expected_meta' in case and k in case['extra_expected_meta']:
                            if str(case['extra_expected_meta'][k]) in str(v):
                                continue
                        self.assertTrue(False, "got a different value (%s) than excepted (%s) for %s, with all of them being %s" % (str(v), meta, k, p2.get_meta().items()))
                self.assertTrue(p2.remove_all())
                os.remove(target)
							
							
							
								
							
						
@@ -581,14 +639,20 @@ class TestCleaning(unittest.TestCase):
        os.remove('./tests/data/clean.cleaned.html')
        os.remove('./tests/data/clean.cleaned.cleaned.html')
        with open('./tests/data/clean.html', 'w') as f:
            f.write('<title><title><pouet/><meta/></title></title><test/>')
        p = web.HTMLParser('./tests/data/clean.html')
        self.assertTrue(p.remove_all())
        with open('./tests/data/clean.cleaned.html', 'r') as f:
            self.assertEqual(f.read(), '<title></title><test/>')
        if sys.version_info >= (3, 13):
            with open('./tests/data/clean.html', 'w') as f:
                f.write('<title><title><pouet/><meta/></title></title><test/>')
            with self.assertRaises(ValueError):
                p = web.HTMLParser('./tests/data/clean.html')
        else:
            with open('./tests/data/clean.html', 'w') as f:
                f.write('<title><title><pouet/><meta/></title></title><test/>')
            p = web.HTMLParser('./tests/data/clean.html')
            self.assertTrue(p.remove_all())
            with open('./tests/data/clean.cleaned.html', 'r') as f:
                self.assertEqual(f.read(), '<title></title><test/>')
            os.remove('./tests/data/clean.cleaned.html')
        os.remove('./tests/data/clean.html')
        os.remove('./tests/data/clean.cleaned.html')
        with open('./tests/data/clean.html', 'w') as f:
            f.write('<test><title>Some<b>metadata</b><br/></title></test>')
							
								
							
							
								
							
							
						
@@ -855,3 +919,97 @@ class TestComplexOfficeFiles(unittest.TestCase):
        os.remove(target)
        os.remove(p.output_filename)
class TextDocx(unittest.TestCase):
    def test_comment_xml_is_removed(self):
        with zipfile.ZipFile('./tests/data/comment.docx') as zipin:
            # Check if 'word/comments.xml' exists in the zip
            self.assertIn('word/comments.xml', zipin.namelist())
        shutil.copy('./tests/data/comment.docx', './tests/data/comment_clean.docx')
        p = office.MSOfficeParser('./tests/data/comment_clean.docx')
        self.assertTrue(p.remove_all())
        with zipfile.ZipFile('./tests/data/comment_clean.cleaned.docx') as zipin:
            # Check if 'word/comments.xml' exists in the zip
            self.assertNotIn('word/comments.xml', zipin.namelist())
        os.remove('./tests/data/comment_clean.docx')
        os.remove('./tests/data/comment_clean.cleaned.docx')
    def test_xml_is_utf8(self):
        with zipfile.ZipFile('./tests/data/comment.docx') as zipin:
            c = zipin.open('word/document.xml')
            content = c.read()
            # ensure encoding is utf-8
            r = b'encoding=(\'|\")UTF-8(\'|\")'
            match = re.search(r, content, re.IGNORECASE)
            self.assertIsNotNone(match)
        shutil.copy('./tests/data/comment.docx', './tests/data/comment_clean.docx')
        p = office.MSOfficeParser('./tests/data/comment_clean.docx')
        self.assertTrue(p.remove_all())
        with zipfile.ZipFile('./tests/data/comment_clean.cleaned.docx') as zipin:
            c = zipin.open('word/document.xml')
            content = c.read()
            # ensure encoding is still utf-8
            r = b'encoding=(\'|\")UTF-8(\'|\")'
            match = re.search(r, content, re.IGNORECASE)
            self.assertIsNotNone(match)
        os.remove('./tests/data/comment_clean.docx')
        os.remove('./tests/data/comment_clean.cleaned.docx')
    def test_comment_references_are_removed(self):
        with zipfile.ZipFile('./tests/data/comment.docx') as zipin:
            c = zipin.open('word/document.xml')
            content = c.read()
            r = b'w:commentRangeStart'
            self.assertIn(r, content)
            r = b'w:commentRangeEnd'
            self.assertIn(r, content)
            r = b'w:commentReference'
            self.assertIn(r, content)
        shutil.copy('./tests/data/comment.docx', './tests/data/comment_clean.docx')
        p = office.MSOfficeParser('./tests/data/comment_clean.docx')
        self.assertTrue(p.remove_all())
        with zipfile.ZipFile('./tests/data/comment_clean.cleaned.docx') as zipin:
            c = zipin.open('word/document.xml')
            content = c.read()
            r = b'w:commentRangeStart'
            self.assertNotIn(r, content)
            r = b'w:commentRangeEnd'
            self.assertNotIn(r, content)
            r = b'w:commentReference'
            self.assertNotIn(r, content)
        os.remove('./tests/data/comment_clean.docx')
        os.remove('./tests/data/comment_clean.cleaned.docx')
    def test_clean_document_xml_rels(self):
        with zipfile.ZipFile('./tests/data/comment.docx') as zipin:
            c = zipin.open('word/_rels/document.xml.rels')
            content = c.read()
            r = b'Target="comments.xml"'
            self.assertIn(r, content)
        shutil.copy('./tests/data/comment.docx', './tests/data/comment_clean.docx')
        p = office.MSOfficeParser('./tests/data/comment_clean.docx')
        self.assertTrue(p.remove_all())
        with zipfile.ZipFile('./tests/data/comment_clean.cleaned.docx') as zipin:
            c = zipin.open('word/_rels/document.xml.rels')
            content = c.read()
            r = b'Target="comments.xml"'
            self.assertNotIn(r, content)
        os.remove('./tests/data/comment_clean.docx')
        os.remove('./tests/data/comment_clean.cleaned.docx')
							
							
							
						
 
							
							
								
							
							
						
@@ -23,6 +23,11 @@ class TestLightWeightCleaning(unittest.TestCase):
                'parser': images.JPGParser,
                'meta': {'Comment': 'Created with GIMP'},
                'expected_meta': {},
            #}, {
            #    'name': 'webp',
            #    'parser': images.WEBPParser,
            #    'meta': {'Warning': '[minor] Improper EXIF header'},
            #    'expected_meta': {},
            }, {
                'name': 'torrent',
                'parser': torrent.TorrentParser,
							
							
							
								
							
						
@@ -33,7 +38,6 @@ class TestLightWeightCleaning(unittest.TestCase):
                'parser': images.TiffParser,
                'meta': {'ImageDescription': 'OLYMPUS DIGITAL CAMERA         '},
                'expected_meta': {
                    'Orientation': 'Horizontal (normal)',
                    'ResolutionUnit': 'inches',
                    'XResolution': 72,
                    'YResolution': 72
							
								
							
							
							
						
 
							
							
							
						
@@ -1,3 +0,0 @@
# Words to be ignored by codespell.
# Put one word per line and sort alphabetically.
process'
Author	SHA1	Message	Date
jvoisin	235403bc11	Edit README.md	2025-09-04 15:10:12 +02:00
jvoisin	102f08cd28	Switch the project from 0xacab to github While the folks running 0xacab are much more lovely than the github ones, this project has outgrown the former: - Github offers beefy continuous integration, make it easier to run the testsuite on every python version, instead of using a weird docker-based contraption. Moreover, I'd rather burn some Microsoft money than 0xacab one. - Opening an account on 0xacab is non-trivial (by design), making it tedious for people to report issues and contribute to mat2. - Gitlab is becoming unbearably slow and convoluted, even compared to Github's awful Copilot/AI push. It's a sad state of affairs, but it's a pragmatic decision. People who don't have a Github account can still report issues and send patches by sending me an email.	2025-09-04 14:35:36 +02:00
jvoisin	7a8ea224bc	Fix issue introduced in `f073444` The continuous integration on 0xacab didn't run, so it didn't catch this issue. It seems like we'll have to move to github or whatever instead, sigh.	2025-09-01 23:52:43 +02:00
jvoisin	504efb2448	Remove mypy from the CI It has always been useless a best, and a nuisance most of the times.	2025-09-01 14:35:25 +02:00
jvoisin	f07344444d	Fix a broken test Reported-By: https://github.com/NixOS/nixpkgs/issues/436421	2025-08-25 12:07:15 +02:00
jvoisin	473903b70e	Fix HEIC parsing with the latest exiftool	2025-04-03 17:34:44 +02:00
jvoisin	1438cf7bd4	Disable webp tests for now ``` ====================================================================== ERROR: test_all_parametred (tests.test_libmat2.TestCleaning.test_all_parametred) (case={'name': 'webp', 'parser': <class 'libmat2.images.WEBPParser'>, 'meta': {'Warning': '[minor] Improper EXIF header'}, 'expected_meta': {}}) ---------------------------------------------------------------------- Traceback (most recent call last): File "/builds/jvoisin/mat2/libmat2/images.py", line 109, in __init__ GdkPixbuf.Pixbuf.new_from_file(self.filename) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^ gi.repository.GLib.GError: gdk-pixbuf-error-quark: Couldn’t recognize the image file format for file “./tests/data/clean.webp” (3) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/builds/jvoisin/mat2/tests/test_libmat2.py", line 557, in test_all_parametred p1 = case['parser'](target) File "/builds/jvoisin/mat2/libmat2/images.py", line 111, in __init__ raise ValueError ValueError ``` Pending on https://0xacab.org/georg/mat2-ci-images/-/issues/14	2025-04-03 17:34:40 +02:00
jvoisin	e740a9559f	Properly handle an exception ``` Traceback (most recent call last): File "/builds/jvoisin/mat2/tests/test_deep_cleaning.py", line 147, in test_office meta = p.get_meta() File "/builds/jvoisin/mat2/libmat2/archive.py", line 155, in get_meta zin.extract(member=item, path=temp_folder) ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.13/zipfile/__init__.py", line 1762, in extract return self._extract_member(member, path, pwd) ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.13/zipfile/__init__.py", line 1829, in _extract_member os.makedirs(upperdirs, exist_ok=True) ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<frozen os>", line 227, in makedirs OSError: [Errno 28] No space left on device: '/tmp/tmptl1ibyv6/word/theme' ``` This should never happen™, but just in case…	2025-04-03 15:24:34 +02:00
Vincent Deffontaines	2b58eece50	Add webp support	2025-03-18 22:20:17 +01:00
georg	29f404bce3	CI: run tests via python3.{13,14}	2025-01-09 09:52:47 +00:00
jvoisin	6c966f2afa	Significantly improve portability	2025-01-09 02:36:16 +01:00
jvoisin	70d236a062	Bump the changelog	2025-01-09 00:43:12 +01:00
Alex Marchant	d61fb7f77a	Wait to remove elements until they are all processed	2024-09-13 14:28:57 +02:00
jvoisin	1aed4ff2a5	Catch a MemoryError in cairo This should close #202	2024-09-13 14:28:50 +02:00
matiargs	75c0a750c1	Keep orientation metadata	2024-07-18 15:04:24 +00:00
jvoisin	a47ac01eb6	Remove a duplicate function This is a leftover from today's best-effort merges.	2024-04-05 19:51:14 +02:00
Alex Marchant	156855ab7e	Remove dangling references from document.xml.rels The file `word/_rels/document.xml.rels` is similar to `[Content_Types].xml` and has references to other files in the archive. If those references aren't removed Word refuses to open the document. # Please enter the commit message for your changes. Lines starting	2024-04-05 18:45:58 +02:00
jvoisin	09672a2dcc	Merge branch 'alexmarchant-utf-8-encode-all'	2024-04-05 18:33:30 +02:00
Alex Marchant	f2c898c92d	Strip comment references from document.xml	2024-04-05 18:31:49 +02:00
Alex Marchant	f931a0ecee	Make utf-8 explicit in all tree.write calls	2024-04-03 15:27:48 -04:00
Alex Marchant	61f39c4bd0	Strip comment references from document.xml	2024-04-03 15:20:00 -04:00
Alex Marchant	1b9ce34e2c	Add test that checks if comments.xml is removed without errors	2024-04-03 15:03:33 -04:00
Alex Marchant	17e76ab6f0	Update comments file regex	2024-04-03 14:49:39 -04:00
jvoisin	94ef57c994	Add python3.12 in the CI	2024-01-02 02:50:44 +00:00
jvoisin	05d1ca5841	Improve the pyproject.yaml file Prompted by !113	2023-12-31 18:34:39 +01:00
jvoisin	55b468ded7	Update Arch Linux package URL in INSTALL.md Patch by https://github.com/felixonmars	2023-11-21 12:27:45 +01:00
jvoisin	0fcafa2edd	Raise a ValueError for invalid FLAC files to please mypy	2023-11-13 15:03:42 +01:00
Romain Vigier	7405955ab5	parsers: Inherit the sandbox option when creating additional parsers	2023-11-13 13:11:35 +01:00
Romain Vigier	e6564509e1	mat2: Fix the --no-sandbox argument The --no-sandbox argument was parsed incorrectly, meaning no sandbox was used when it was absent and the sandbox being used when it was present.	2023-11-13 13:06:38 +01:00
jvoisin	bbd5b2817c	Fix the CI on Debian	2023-11-08 15:44:33 +01:00
jvoisin	73f2a87aa0	Provide a name for the loggers	2023-09-08 22:16:45 +02:00
jvoisin	abcdf07ef4	Properly handle a cairo exception	2023-09-07 16:31:34 +02:00
Rui Chen	a3081bce47	setup: use share/man/man1 for man1	2023-08-31 19:44:28 -04:00
georg	47d5529840	tests: drop duplicate dirty.epub file; it's stored below data/ as well	2023-08-03 13:42:15 +00:00
jvoisin	fa44794dfd	Fix the project name in pyproject.toml	2023-08-02 21:21:44 +02:00
jvoisin	04786d75da	Bump the changelog	2023-08-02 21:09:12 +02:00
jvoisin	cb7b5747a8	Add the manpage to the PyPI package This should close #192	2023-07-11 22:03:56 +02:00
Jason Smalls	8c26020f67	Add more files to ignore for MSOffice documents	2023-07-11 21:38:22 +02:00
Jason Smalls	a0c97b25c4	Add a variant mimetype for bmp	2023-07-11 21:35:04 +02:00
Jason Smalls	1bcb945360	Harden get_meta in archive.py against variants of CVE-2022-35410	2023-07-11 21:31:53 +02:00
jvoisin	9159fe8705	Mention wp-mat in the readme	2023-06-05 19:52:13 +02:00
jvoisin	1b9608aecf	Use proper type annotations instead of comments	2023-05-03 22:28:02 +02:00
jvoisin	2ac8c24dac	Make use of is_dir/isdir for archives	2023-05-03 22:19:19 +02:00
jvoisin	71ecac85b0	Add some documentation about OSX	2023-04-11 21:35:25 +02:00
georg	b9677d8655	CI: codespell: drop obsolete list of ignored words codespell was dropped via `a63011b3f6`. Accordingly, this commit does some cleanup.	2023-03-21 13:18:54 +00:00
georg	6fde80d3e3	CI: shallow clone repository and limit depth to 5 The previous commit changed the strategy to 'clone', instead of 'fetch' as before. While this fixes permission errors, it is also slower, as an existing checkout of the repository will be ignored. To overcome this, this commit limits the depth to 5.	2023-03-20 15:11:02 +00:00
georg	6c05360afa	CI: 'clone' git repository instead of 'fetch' While the former is slower, the later might lead to errors such as "fatal: detected dubious ownership in repository at" which is fixed GitLab upstream via https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/3538, but not yet released. Closes #191	2023-03-20 15:10:56 +00:00
georg	596696dfbc	CI: Add python3.{7,8,9,10,11} test jobs Closes #187	2023-03-15 23:38:39 +00:00
jvoisin	daa17a3e9c	Fix the CI on Archlinux	2023-03-12 13:29:46 +01:00
Gu1nn3zz	6061f47231	fix: Typing in the parser factory	2023-03-07 17:37:56 +00:00
georg	8b41764a3e	CI: linting: ruff: specify image Otherwise, this job might fail, depending on the runner which executes the job, due to different configurations, especially wrt the default image. Ref https://0xacab.org/jvoisin/mat2/-/merge_requests/105	2023-03-07 11:25:17 +00:00
Rui Chen	ed0ffa5693	Update `pyproject.toml` to include `version`	2023-02-24 09:12:06 +00:00
jvoisin	b1c03bce72	Bump the changelog	2023-02-23 21:36:46 +01:00
jvoisin	a63011b3f6	Improve the CI - Remove some useless linters - Make use of ruff	2023-02-20 21:15:07 +01:00
jvoisin	e41390eb64	Explicitly pass a parameter to functools.lru_cache	2023-01-31 20:42:39 +01:00
jvoisin	66a36f6b15	Bump the changelog	2023-01-28 17:55:02 +01:00
jvoisin	3cb3f58084	Another typing pass	2023-01-28 17:22:26 +01:00
jvoisin	39fb254e01	Fix the type annotations	2023-01-28 15:57:20 +00:00
jvoisin	1f73a16ef3	imghdr is deprecated	2023-01-14 15:38:12 +01:00
jvoisin	e8b38f1101	Revert "Simplify a bit the typing annotations of ./mat2" This reverts commit `29057d6cdf`.	2023-01-14 15:35:21 +01:00
jvoisin	8d7230ba16	Fix `-l` output	2023-01-07 17:10:02 +01:00