Tarfile Extractall Without Filter¶
Code example¶
import tarfile
def install_archive(archive_path: str, install_dir: str) -> None:
with tarfile.open(archive_path, "r:gz") as tar:
tar.extractall(install_dir)
The function looks fluent. The archive is opened with the canonical with block; the extraction is one line. For trusted archives the function works perfectly.
The defect is invisible until the archive is attacker-controlled. A malicious tar file can contain entries with paths like ../../etc/passwd (relative traversal) or absolute paths like /root/.ssh/authorized_keys. tarfile.extractall(install_dir) writes each entry to its declared path — outside install_dir if the declared path traverses up. This is CVE-2007-4559, the canonical path-traversal flaw in Python's tarfile module; the broader archive-extraction flaw class across formats was later popularized by Snyk (2018) as "Zip Slip". The CVE has been open for nearly 20 years.
Python 3.12 (PEP 706, 2024) added tarfile.data_filter as the recommended safe-extraction approach. The tightened version becomes a one-line fix:
def install_archive(archive_path: str, install_dir: str) -> None:
with tarfile.open(archive_path, "r:gz") as tar:
tar.extractall(install_dir, filter='data') # rejects path-traversal entries
The filter='data' argument (also 'tar' and 'fully_trusted' exist) tells Python to apply the data-filter rules: members must not have absolute paths; relative paths must not contain .. components that escape the destination; special files (devices, fifos) are rejected; symlinks are sanitized.
On pre-Python-3.12 (the codebase may need to support older versions), the cure is a manual safe-extract wrapper:
import os
import tarfile
def safe_extract(tar: tarfile.TarFile, path: str = ".") -> None:
abs_dest = os.path.realpath(path)
for member in tar.getmembers():
member_path = os.path.realpath(os.path.join(path, member.name))
if not member_path.startswith(abs_dest + os.sep) and member_path != abs_dest:
raise ValueError(f"Path traversal in archive: {member.name}")
tar.extractall(path)
The pattern has several visible sub-shapes in captured specimens:
- Established MLOps platform — the defect exists in long-established Python projects that haven't migrated to the Python 3.12 data_filter idiom. Captured in mlrun/mlrun (Bandit B202 HIGH; part of a 2026-03-16 batch-audit across 5+ Python repos).
- AI/research tooling — arXiv source-archive extractors operating on user-supplied untrusted archives. The defect is particularly costly here because the trust boundary is paper author → automated pipeline. Captured in kpollz/daily_papers_tool (Critical; AGENTS.md 23KB).
- Cross-platform desktop utility — auto-installers extracting binaries from internet downloads. Captured in iAmGiG/FFmpeg-AutoToolkit (HIGH; references CVE-2007-4559 explicitly; provides both Python 3.12+ and pre-3.12 fix recipes).
All sub-shapes share the same root mechanism: the model produced tar.extractall(path) without the filter='data' argument that Python 3.12 added.
Mechanism¶
A language model generates archive-extraction code from local context. The training corpus contains:
-
Pre-Python-3.12 form (the corpus's dominant shape):
with tarfile.open(path) as tar: tar.extractall(dest). This is the canonical "how to extract a tar archive in Python" Stack Overflow answer, tutorial example, and library README. The vast majority of training-data examples use this form because (a) Python 3.12 is recent (2024) and (b) the corpus weight of pre-3.12 Python content dwarfs post-3.12 content. -
Post-Python-3.12 form (the corpus's emerging shape):
with tarfile.open(path) as tar: tar.extractall(dest, filter='data'). PEP 706 (2024) added the argument; community guidance is migrating; the corpus is slowly catching up. -
Manual safe_extract wrapper (the corpus's hand-rolled shape): security-conscious projects wrote their own member-validation function before Python 3.12 made it built-in. The wrapper code appears in security blog posts and CVE-2007-4559 remediation guides.
The defective shape is over-represented per-token in two corpus segments:
Python's own tarfile documentation, pre-Python-3.12. The Python docs themselves used tar.extractall(path) as the canonical example for years. The model has seen the canonical Python-docs form, often without the surrounding CVE-2007-4559 context.
Stack Overflow answers about "how to extract a tar in Python." Questions like "how do I extract a .tar.gz in Python" get answers using the bare extractall(dest) form because the asker's question is what does the API look like, not how do I extract safely. Security-conscious follow-ups exist in separate Q&A threads.
The model knows about the filter='data' argument in the abstract — it can describe what it does when asked directly. What it does not do reliably during local generation is choose the filter argument when producing an extractall call. The token-level prediction follows the corpus-dominant pre-Python-3.12 form.
This is the deployment-context-blind failure mode applied to a deprecation-recent ecosystem change. The defect path requires: 1. The function runs on a deployment that handles externally-supplied archives. 2. The Python version is 3.12+ (where the filter argument is available and the unfiltered form emits a DeprecationWarning). 3. The codebase has not been migrated to the filter idiom.
All three are common in 2026 — Python 3.12 adoption is widespread; AI-generated code is being deployed; archives come from many sources.
The defect path is direct: arbitrary file write on the extraction host. Concrete attack scenarios from the captured specimens:
- mlrun: malicious archive in an artifact store overwrites files on the MLRun execution host
- daily_papers_tool: malicious arXiv source tarball overwrites files in the user's home directory during paper-processing pipeline
- FFmpeg-AutoToolkit: man-in-the-middle on the FFmpeg download URL serves a malicious archive that overwrites system files
This pattern is AI-amplified, not AI-exclusive. Human Python programmers wrote bare extractall calls for years (the corpus is full of them). The AI-amplified differential rests on:
- Initial-state authorship after Python 3.12: AI-generated codebases produce the pre-3.12 form even when the project targets Python 3.12+ (where the filter argument is the documented recommendation).
- Codified-guidance-is-insufficient at the bandit/CVE layer: Bandit B202 catches the pattern; CVE-2007-4559 is documented; Python 3.12 emits a DeprecationWarning. AI-generated codebases still produce the unfiltered form.
- Trust-boundary-shift on archive sources: AI-generated code routinely fetches archives from third-party sources (paper repos, model registries, dataset hubs, package indexes) — exactly where the trust assumption fails first.
Evidence / incident¶
Three captured specimens from different AI/Python project domains. Detailed specimen notes are not included in the public repository.
- mlrun/mlrun — established MLOps platform;
mlrun/package/utils/_archiver.pyextracts archives without member validation. Bandit B202 HIGH. Part of a 2026-03-16 batch audit filed across 5+ Python repos (mlrun, PyCQA/bandit, alephdata/memorious, eliasgranderubio/dagda, crate/crate-python) with identical structure — suggests an automated security-audit service. References CWE-22. AGENTS.md (6513 bytes). - kpollz/daily_papers_tool — arXiv source-archive figure extractor;
summary_utils/extract_figure.pyuses bare extractall on user-supplied archives. Critical severity; self-audit by project author. References Python 3.12 deprecation. AGENTS.md (23367 bytes). - iAmGiG/FFmpeg-AutoToolkit — cross-platform FFmpeg auto-installer;
ffmpeg_manager.py:91-92extracts internet-downloaded archives. HIGH severity. References CVE-2007-4559 (the canonical Python tarfile CVE) explicitly. Provides both Python 3.12+ filter recipe and pre-3.12 manual safe_extract recipe — methodologically thorough.
Three different defect surfaces (MLOps execution host / paper-processing pipeline host / desktop-utility user system), three different trust-boundary contexts (artifact store / paper-author-supplied source tarball / mirror MITM), three different audit framings (batch-audit by automated service / self-audit / self-audit with canonical-CVE reference).
Supplementary references:
- PyCQA/bandit itself — the bandit project's own example file
examples/tarfile_extractall.pywas caught by the 2026-03-16 batch audit. Bandit's example is intentionally a B202 violation (the file demonstrates what B202 catches) — but it's a small irony that the canonical Python security-linter's own example file got flagged. - alephdata/memorious —
memorious/operations/extract.pyin a web-scraping toolkit. Adjacent specimen; same batch-audit shape. - crate/crate-python —
src/crate/testing/layer.py. Adjacent specimen; same batch-audit shape.
Bandit has rule B202 (tarfile_unsafe_members). Python 3.12 added tarfile.data_filter (PEP 706); Python 3.12+ emits a DeprecationWarning for extractall() without a filter; in Python 3.14+ the default changes so an unfiltered extractall() applies the safe data filter automatically. Wide community recognition; the AI-amplified observation is that AI-generated code continues to produce the pre-filter form despite the deprecation roadmap.
Detection cues¶
What to look for in a diff or completion:
tar.extractall(path)ortarfile.open(...).extractall(path)without afilter=argument. The most direct signal. Python 3.12+ should always usefilter='data'; pre-3.12 should use a manual safe_extract wrapper.zipfile.ZipFile(path).extractall(dest)— the ZIP analogue, with an important asymmetry:zipfile.extractallalready sanitizes member paths (it strips leading slashes and drops..components), and PEP 706 deliberately did not add afilter=mechanism forzipfile. So the tarfile cure (filter='data') does not exist for ZIPs; still treat untrusted ZIP extraction with care, but the built-in path handling differs.- Archive-extracting code in artifact-store, paper-processing, model-registry, or dataset-fetching contexts. These are the AI-typical surfaces where untrusted archives flow.
- Auto-installer / package-manager code that downloads tarballs from a URL. The download URL is the trust boundary; a compromised mirror / MITM allows malicious archives.
- Functions whose archive input comes from user upload, third-party API, or external storage. The trust boundary is whoever can produce an archive that ends up in this function — verify by tracing the data flow to the function's archive parameter.
# noqa: B202annotations. Bandit suppression for the rule. If the suppression has no justifying comment, the lint rule has been silenced without addressing the underlying defect.
The diagnostic question for any archive-extraction call: who can supply the archive, and what does an attacker-controlled member path look like? If the answer to "who" is anything other than "code I trust completely," the filter is required.
Bandit B202 catches the pattern mechanically. Python 3.12+ also produces a runtime DeprecationWarning — if the test suite captures warnings, the defect surfaces automatically.
Notes¶
Category security. Both this entry and string-built-sql are CVE-class defects with widely-adopted lint rules (Bandit B202, B608) that AI-generated code reproduces despite ecosystem recognition.
Difficulty rated low. Spotting extractall(...) without filter= is visually trivial. Bandit B202 catches it mechanically. Python 3.12+ DeprecationWarning surfaces it at runtime. The reason this is in the taxonomy is AI-amplification dimensions (post-3.12 codebases still produce the pre-3.12 form) and defect surface (archive-extraction contexts in AI tooling).
The pattern is AI-amplified, not AI-exclusive. Restated: every Python developer who has worked with tarfile has written a bare extractall(path) at some point. CVE-2007-4559 has been open for nearly 20 years. The AI-amplified differential rests on initial-state authorship after Python 3.12 (the corpus-dominant pre-3.12 form persists), codified-guidance-insufficient at multiple layers (Bandit + Python deprecation + CVE documentation), and trust-boundary-drift in AI-tooling contexts.
False-positive shapes. Be cautious before flagging:
- Genuinely-trusted archives. Internal-only build pipelines, CI artifacts produced by your own infrastructure with cryptographic verification, archives whose contents have been pre-vetted. If the archive's provenance is internally controlled and verified, the filter is belt-and-suspenders rather than required. The cue is whether the archive could ever be produced by an external party.
- Python < 3.12 codebases where the filter argument isn't available. The pre-3.12 form requires a manual safe_extract wrapper. The AI-typical form is the bare extractall regardless of Python version — but the cure differs.
- Test fixtures or scripts that extract known-content archives.
extractall(test_dir)in a unit test where the archive is checked into the repo is fine. - Migration code with documented version constraint. If the project explicitly targets pre-3.12 and the bare form is wrapped in a documented manual safe_extract, the absence of
filter='data'is by design.
Mutation operator hint. A deterministic mutation that introduces the pattern from clean code:
- Take
tar.extractall(dest, filter='data')and remove the filter argument - Take a
safe_extract(tar, dest)call and replace withtar.extractall(dest) - Take a
zipfile.ZipFile.extractall(dest, members=safe_members)and remove the members argument - Wrap a bare extractall in
# noqa: B202without justification
These compose with swallowed-exceptions — a bare extractall inside try: ...; except Exception: pass silently absorbs path-traversal errors that would otherwise surface, allowing the attack to succeed silently. Also composes with hardcoded-config-values when the archive source URL is hardcoded (no way to verify integrity at fetch time).
Connection to codified-guidance-is-insufficient note. Bandit B202 + Python 3.12 DeprecationWarning + CVE-2007-4559 + Python documentation guidance — all are codified, well-known, and widely-cited. AI-generated codebases reproduce the pre-filter form despite all four layers of codified guidance. The 2026-03-16 batch audit catching the bandit project's own example file is the comedic-but-real instance of codified-guidance-insufficient (the canonical lint-rule project's own example demonstrates the lint-rule violation).
Connection to deployment-context-blind defects cluster. This entry joins missing-network-timeout, assert-for-runtime-validation, resource-leak-no-context-manager, async-await-mismatch, print-instead-of-logging, f-string-in-logger-call, and string-built-sql in the cluster of patterns whose defect surfaces only in production-deployment contexts. Here the deployment-context is receives-untrusted-archives — which AI tooling routinely does (model registries, paper extractors, dataset fetchers).
Connection to trust-boundary-shift methodological observation. The kpollz specimen's framing — "Dataset registries tend to grow over time, and the trust boundary moves" (a phrase from the asi-build specimen for shell=True) — applies here at the archive-source layer. A codebase that today extracts archives only from a trusted internal store may tomorrow extract from a community registry. The bare extractall doesn't survive the trust-boundary shift; the filter form does. This is methodologically important for evaluating latent security defects in AI-generated code — the question isn't only "is the code safe today" but "does it survive normal evolution of the trust boundary."