Reading 3 min Views 4 Published Updated
Meta AI recently published a preliminary study demonstrating a radical new “Megabyte” framework for building generative pre-trained transformer (GPT) systems.
Called “promising” by Andrey Karpathy of OpenAI, a former director of artificial intelligence at Tesla, the new architecture is designed to process large amounts of data such as images, novels and video files without using a process known as tokenization.
Promising. Everyone should hope that we can throw away tokenization in LLMs. Doing so naively creates (byte-level) sequences that are too long, so the devil is in the details.
Tokenization means that LLMs are not actually fully end-to-end. There is a whole separate stage with… https://t.co/t240ZPxPm7
— Andrej Karpathy (@karpathy) May 15, 2023
Tokenization is a lossy process comparable to file compression. To process large amounts of data, GPT models convert bytes into tokens. The tokens are then processed by the resolver and used to generate output tokens which are then decoded.
The tokenization process allows the AI system to process large strings of data as numbers. For example, the words “my favorite color is red” when processed by OpenAI ChatGPT will be converted to the token string “3666, 4004, 3124, 318, 2266, 13” for processing.
Unfortunately, even with the help of tokenization, the amount of data that today’s modern systems can handle still has a hard limit. For GPT-3.5, the limit is just over 4,000 tokens, or around 3,000 words, while for GPT-4, the maximum is around 32,000 tokens, or around 24,000 words.
Meta’s new Megabyte system eschews tokenization in favor of a new multi-level prediction architecture capable of end-to-end modeling over 1 million bytes of data.
Most standard English encoding systems use the standard 8-bit encoding. In this paradigm, each character occupies one byte of data. Thus, an artificial intelligence system capable of processing 1 million bytes of data without tokenization can work with text documents containing 750,000 words, which is 3025% more than in GPT-4.
In comparison, GPT-4 can currently process about 10 full-length news articles per invitation, while Megabyte will be able to analyze Leo Tolstoy’s entire War and Peace plus two more medium-length novels.
Meta’s Megabyte model also performed well in ImageNet’s and ImageNet benchmarks related to audio file processing, either equaling or outperforming existing byte-based transformation models such as DeepMind’s Perciever AR in both cases:
“Megabyte matches the state-of-the-art performance of PerceiverAR using only half of the computation.”
The implications of this research could be far-reaching. Tokenization is considered a barrier in this field due to its severe data limitations and the amount of energy and time required to train systems.
Without tokenization, it should be possible to train AI models with stronger fundamental support for non-English languages, especially those that cannot be easily encoded with standard 8-bit characters.
This could lead to further democratization of these technologies, allowing everything from cryptocurrency trading bots to decentralized autonomous organization technologies to be embedded in native language codes around the world.
Related: Sam Altman’s Worldcoin Secures $115M for Decentralized Identity
It would also enhance the ability of models like ChatGPT to work with images, video, and audio files by creating media clips that use roughly the same time and power consumption as text.