On November 7th, local time, Thursday, Google announced the release of Magika 1.0, the first stable version of its AI-based file type detection system. Refactored in Rust for improved speed and memory safety, Magika has been widely adopted in the open-source community since its open-source release early last year, with over 1 million downloads per month. This update brings a completely new architecture, performance improvements, and support for more file types.
As mentioned earlier, the biggest change in Magika 1.0 is that its core engine has been completely rewritten in Rust for enhanced performance and memory safety. Additionally, the new Magika provides native Rust command-line tools, capable of identifying hundreds of files per second on a single core and scaling to thousands per second on multi-core CPUs.
The system uses the ONNX Runtime for model inference and leverages the Tokio framework for asynchronous parallel processing. Google's test data shows that on a MacBook Pro (M4), Magika can process approximately 1,000 files per second. Regarding file type support, Magika 1.0 expands its detection capabilities to over 200 file formats, double the number of the initial version. New categories include:
Data Science and Machine Learning: Supports Jupyter Notebooks (ipynb), Numpy (npy, npz), PyTorch (pytorch), ONNX (onnx), Apache Parquet (parquet), and HDF5 (h5) files;
Modern Programming and Web Development: Adds support for Swift, Kotlin, TypeScript, Dart, Solidity, WebAssembly (wasm), and Zig;
DevOps and Configuration Files: Supports Dockerfile, TOML, HashiCorp HCL, Bazel build files, and YARA rules;
Databases and Graphics Formats: Adds support for SQLite, AutoCAD (dwg, dxf), Photoshop (psd), and modern web fonts (woff, woff2).
Magika 1.0 also improves its ability to distinguish similar formats, such as JSONL vs. JSON, TSV vs. CSV, Apple binary plist vs. XML plist, and distinguishing between C vs. C++, JavaScript vs. TypeScript files.
Technically, the team faced two major challenges: the massive scale of the training data and the scarcity of samples for some file types. The uncompressed dataset exceeded 3TB, so Google used its self-developed SedPack dataset library, employing streaming loading and decompression techniques for efficient training. Simultaneously, for file types with insufficient samples, the research team used the generative AI tool Gemini to create high-quality synthetic training data, converting existing code and structured files into other formats to enhance the model's generalization ability.
The new version of Magika also updated the Python and TypeScript modules, simplifying the integration process for developers across different languages. Users can install the native client on Linux, macOS, or Windows via simple commands, or install the Python package using `pipx install magika` to use the Rust command-line tool. Google stated that Magika's future development will continue to focus on performance optimization and file type expansion. The team encourages the developer community to contribute, including through testing, feature requests, and code submissions.