Flexgen – engine for launching AI bots similar to ChatGPT, on systems with one GPU

A group of researchers from Stodford University, the University of California at Berkeley, Swiss Higher Technical School of Zurich, the Higher School of Economics, the University of Carnegie – Mellon, as well as Yandex and Meta, published the initial texts of the engine to perform large language models on systems with limited resources. For example, the engine provides the possibility of creating functionality resembling ChatGPT and Copilot, through the implementation of the finished trained model OPT-175B , covering 175 billion parameters , on a regular computer with a NVIDIA RTX3090 game video card, equipped with 24GB video memory. The code is written in the Python language, uses the Pytorch Freimvork and spreads under the license Apache 2.0.

The composition includes an example of a script for creating bots that allows you to download one of the publicly affordable language models and start communication immediately (for example, having completed the command “Python Apps/Chatbot.py –Model Facebook/Opt -30B –percent 0 100 100 100 100 100 100 100 0 100 0 “). As a base, it is proposed to use published facebook, a large language model trained on the Bookcorpus collections (10 thousand books), CC-Stories, Piles (Opensubtitles, Wikipedia, DM Mathematics, HackerNews, etc.), Pushchift.io (based on Reddit data) and Ccnewsv2 (news archive). The model covers about 180 billion tokens (800 GB of data). 33 days of cluster work with 992 GPU NVIDIA A100 80GB were spent on training the model.

When performing the OPT-175B model on a system with one GPU NVIDIA T4 (16GB), the Flexgen engine demonstrated a productivity of up to 100 times the previously proposed solutions, which makes large language models more accessible and allows them to launch them on systems without specialized accelerators. At the same time, Flexgen can be scaled for computing computing if there are several GPUs. To reduce the size of the model, an additional applies Own scheme for compression of the parameters and the mechanism of caching models.

Currently, Flexgen supports only language models opt , but in the future, developers also promise to add support for models bloom (176 billion parameters, supports 46 languages ​​and 13 programming languages), Codegen (can generate code in 22 programming languages) and glm . An example of a dialogue with a bot on the basis of Flexgen and the OPT-30B model:

Human: What is the name of the Tallest Mountain in the World?

Assistant: Everest.

Human: I am Planning a Trip for Oour Anniversary. What Things Can We Do?

Assistant: Well, There ARE A Number of Things You can do for your anniversary. FIRST, YOU CAN PLAY CARDS. Second, You can go for a hike. Third, you can go to a museum.

/Media reports cited above.