Google announced the launch of Magika 1.0, a toolkit designed to identify the type of content based on data analysis within a file. Magika is capable of accurately detecting programming languages, compression methods, installation packages, executable code, markup types, audio, video, document and image formats present in the content. The associated tools and machine learning model are available under the Apache 2.0 license and bindings are provided for Rust, Python, JavaScript/TypeScript, and Go.
Setting itself apart from similar projects that determine MIME types by content, Magika utilizes machine learning methods to achieve high performance and accuracy in content identification. The model, trained using the Keras framework on a dataset of over 100 million sample files (exceeding 3 TB), supports recognition of 200 data types with a minimum accuracy of 99%. Compiled in the ONNX format, the model is compact in size, offering a 50% improvement in detection accuracy compared to Google’s previous rule-based system.
Google integrates Magika into services such as Gmail, Drive, Code Insight, and Safe Browsing for security checks and compliance monitoring. It is also incorporated into platforms like VirusTotal and abuse.ch to serve as an initial filter for files before running detailed analyses. Hosted on Google’s infrastructure, Magika can process millions of files per second and billions of files per week, with an output generation time of 5 ms on a single CPU core, regardless of file size.
For project implementation, a command line utility, Python, Rust and Go packages, as well as a JavaScript library suitable for browser or Node.js-based projects are provided by Google. The command line interface and API support batch operations, enabling scanning of multiple files in one request, recursive directory scanning, and three prediction modes for adjusting error tolerance.
Originally developed in Python, the content type determination engine for Mag