Copyright Issues When Rewriting ScanCode In Rust Using AI

In a recent development, the ScanCode Toolkit, a tool designed to scan code for copyright infringements, identify licenses, and detect vulnerabilities, has criticized a project that created a clone of the ScanCode product using AI and rewriting it from Python to Rust. The project, known as Provenant, has been accused of violating ScanCode’s trademark, removing copyright and license references, and continuing to use key ScanCode algorithms while maintaining the project’s architecture and code structure.

The ScanCode maintainer pointed out that the successful creation of the clone was aided by the presence of a comprehensive automated test suite in the project, with over 90 thousand tests, including 40 thousand dedicated to detecting specific licenses in the code. Despite the rewritten version claiming a significant increase in performance (10-100 times), the accompanying ScanCode revealed that the speed came at the cost of incomplete test pass rates, reduced accuracy, and missing information during analysis.

Although the Rust port of the clone outperformed ScanCode in certain performance tests, it was slower when running the standard test set, even after skipping some checks. After implementing optimizations such as caching, ScanCode’s performance in code scanning matched that of the Rust clone, while maintaining accuracy.

The authors of the clone have also been accused of violating copyrights and the Apache 2.0 license under which ScanCode is distributed. It was noted that during the rewriting process, four key license requirements were violated, including leaving the original file with a notice (NOTICE), maintaining copyright notices, highlighting changes made, and changing the name in the derivative work. Although the authors posted the NOTICE file and renamed their project after receiving the notice, two violations still remain unresolved.

/Reports, release notes, official announcements.