I've been working with MLOps pipelines lately, and it always bothered me that torch.load() (and Pickle in general) is basically an RCE vulnerability we've all just accepted. We download gigabytes of opaque weights from Hugging Face and run them in production, often with full privileges.
I looked for existing tools, but many relied on simple regex (easy to bypass) or didn't verify if the file was tampered with in transit.
So I built Veritensor. It’s a CLI tool to gatekeep models before they hit your runtime.
How it works under the hood:
1. Pickle Emulation: Instead of grepping for os.system, it emulates the Pickle VM stack. This catches obfuscated payloads (like STACK_GLOBAL assembly) without actually executing the code.
2. Identity Check: It hashes your local file and queries the Hugging Face Hub API to ensure it matches the upstream version bit-for-bit (detects MITM or corruption).
3. License Headers: It parses metadata from Safetensors/GGUF to detect restrictive licenses (like CC-BY-NC or AGPL) so you don't accidentally ship them in a commercial product.
4. Signing: Integrates with Sigstore Cosign to sign the container if the scan passes.
It supports PyTorch, Keras (checks for Lambda layers), and GGUF. Written in Python, Apache 2.0.
I’d love to hear your feedback on the detection logic or edge cases I might have missed with the Pickle emulation.
OP here.
One of the annoying edge cases I hit was handling "Zip Bombs" in PyTorch files (since .pt is just a zip). Had to implement a stream reader with strict memory limits to prevent the scanner itself from OOMing on malicious archives.
Also, the "Identity Check" was tricky because people often rename files locally (e.g., model.bin instead of pytorch_model.bin). The tool now queries the HF API to find if any file in the repo matches the local hash, rather than just relying on the filename.
Happy to answer any questions!
Hi guys,
I've been working with MLOps pipelines lately, and it always bothered me that torch.load() (and Pickle in general) is basically an RCE vulnerability we've all just accepted. We download gigabytes of opaque weights from Hugging Face and run them in production, often with full privileges.
I looked for existing tools, but many relied on simple regex (easy to bypass) or didn't verify if the file was tampered with in transit.
So I built Veritensor. It’s a CLI tool to gatekeep models before they hit your runtime.
How it works under the hood: 1. Pickle Emulation: Instead of grepping for os.system, it emulates the Pickle VM stack. This catches obfuscated payloads (like STACK_GLOBAL assembly) without actually executing the code. 2. Identity Check: It hashes your local file and queries the Hugging Face Hub API to ensure it matches the upstream version bit-for-bit (detects MITM or corruption). 3. License Headers: It parses metadata from Safetensors/GGUF to detect restrictive licenses (like CC-BY-NC or AGPL) so you don't accidentally ship them in a commercial product. 4. Signing: Integrates with Sigstore Cosign to sign the container if the scan passes.
It supports PyTorch, Keras (checks for Lambda layers), and GGUF. Written in Python, Apache 2.0.
I’d love to hear your feedback on the detection logic or edge cases I might have missed with the Pickle emulation.
Repo: https://github.com/ArseniiBrazhnyk/Veritensor PyPI: pip install veritensor
OP here. One of the annoying edge cases I hit was handling "Zip Bombs" in PyTorch files (since .pt is just a zip). Had to implement a stream reader with strict memory limits to prevent the scanner itself from OOMing on malicious archives.
Also, the "Identity Check" was tricky because people often rename files locally (e.g., model.bin instead of pytorch_model.bin). The tool now queries the HF API to find if any file in the repo matches the local hash, rather than just relying on the filename. Happy to answer any questions!