Redpajama Project Creates Open Data Set for AI

The joint project is presented redpajama , aimed at creating open models of machine learning and related source data for training, which can be used for creating intellectual assistants competing with commercial products such as ChatGPT. It is assumed that the presence of open source data and large language models will save independent teams engaged in research in the field of machine learning, and simplify the creation of specialized dialogue systems. The organization and communities joined the work on the project as Together, Ontocord.ai, Eth DS3LAB, Stanford CRFM, Hazy Research and Mila Québec Ai Institute.

The first step was the publication of the data set redpajama-data-1t for teaching dialogue models with 1.2 trillion tokens. Redpajama set reproduces data from public sources used by Facebook to create its own model llama (it has 1.25 trillion tokens), but is supplied under An open license that does not limit the field of use (the data and models of LLAMA were supplied only to researchers at a special request for non -profit use). The size of the set prepared for loading is 2.67 TB and includes information from the Common CRAWL web-stones, the GITHUB archives, public books from Gutenberg libraries, scientific articles from the Arxiv archive and discussions with Stack Overflow and other sites of Stack Exchange.

Ready-made models trained on the basis of a prepared data set and optimized using ready-made examples of dialogs in the form of instructions prepared by Alpaca and OpenChatkit projects in the next few weeks. Of the similar initiatives to create language models, partially open projects are mentioned llama , alpaca , vicuna , and koala , as well as completely open initiatives pythia , OpenChatkit , Open Assistant and Dolly .

You can additionally note several new projects related to the creation of chatbots:

/Reports, release notes, official announcements.