Mamba - The Next Evolution in AI Language Processing
Mamba represents a groundbreaking advancement in the field of language modelling, in many cases surpassing the capabilities of traditional Transformer models, with remarkable speed and cost-effectiveness.
What is Mamba?
Mamba is a new method for modelling sequences (like sentences in a language or steps in a process) in a more efficient way compared to current deep learning models.
Recurrent Neural Networks used to be a popular way of modelling text because of their ability to understand sequences, but they became obsolete after 2017 when Transformers were invented. Today large language models all use the Transformer Architecture. Transformers are a type of model that's really good at handling sequences, as they can look at all parts of a sequence at once to understand the context (Self Attention is key here - see our Blog on AI Large Language Models). However, they struggle with very long sequences because they require a lot of computing power and memory and so can be slow and computationally heavy.
Mamba is designed to work well with both short and long sequences by using a technique called "selective state space models" (SSMs). This decides which parts of the sequence are important and focuses on them, while ignoring less relevant parts. This is what happens when humans read a long article and only focus on the key sentences to understand the main idea.
Understanding the difference between Transformers and Mamba
A good way to understand the difference between the two architectures is to use the analogy of reading a book.
To understand a specific event in "Harry Potter and the Deathly Hallows", a Transformer model would need to recall an enormous amount of detail from the entire series - every character, spell, and plot twist from the previous six books. This is like a reader having to remember every single aspect of Harry's journey, from his first day at Hogwarts to the intricate details of every Quidditch match, to understand the significance of a specific moment in the final book. Technically, this means Transformers maintain a vast repository of information in their memory to process new text, leading to inefficiency and a cumbersome process.
In contrast, Mamba reads "Harry Potter" more like a human reader. While progressing through the story, key themes and characters, such as Harry's conflict with Voldemort, the significance of the Horcruxes, or Dumbledore's guidance, are retained. However, less pivotal details, like the color of every character's robes or the menu at the Hogwarts feast, are not actively remembered. By adopting this selective memory approach., Mamba focuses on the crucial elements of the story and disregards the minutiae. For example, if 'Harry' is mentioned, Mamba, using its stored context, might predict related concepts like 'wand' or 'Voldemort,' drawing upon the key narrative elements it has 'learned' throughout the series.
How SSMs Work
The fundamental concept behind SSMs involves mapping each element in an input sequence through a 'state space', which can be imagined as an abstract, high-dimensional space. In this space, the transformation of the data is governed by a set of parameters that define how each input influences the next state. These transformations are captured mathematically by a series of matrices, each representing different aspects of the transformation, such as how the state evolves over time (A matrix), how inputs affect the state (B matrix), and how the state is transformed into an output (C matrix).
In summary, SSMs work by moving each element of a sequence through a sophisticated, multi-dimensional space (the state space), where it undergoes a series of transformations based on a set of rules (represented by matrices A, B, and C). These transformations are designed to capture the essence and context of the sequence, allowing for accurate predictions or classifications based on the data. This method is particularly powerful for handling sequences where context and long-range dependencies are crucial for understanding the whole picture.
What this means :
Input Sequence and State Space: Imagine you have a sequence of data points, like words in a sentence or notes in a music piece. Structured State Space Models (SSMs) take each of these data points and transform them in a high-dimensional space called the 'state space.' This state space isn't a physical space but an abstract mathematical construct where each dimension can represent different features or aspects of the data
Transformation Governed by Parameters: The way each data point (like a word or note) is transformed within this state space is not random. It's governed by a specific set of parameters. These parameters are like rules that dictate how one point in the sequence influences the next. The idea is that each data point affects the 'state' of the model, and this state carries information that influences how future data points are processed.
Mathematical Representation with Matrices:
A Matrix (State Evolution): This matrix represents how the state itself evolves over time. In simpler terms, it's like a set of rules that determine how the current information in the state space will change or transition to the next moment, independent of the new incoming data. In the analogy this is like the rules determining how the story's context evolves.
B Matrix (Input Influence): This matrix defines how the new input (like the next word in a sentence) affects the state. It captures the way incoming data alters or contributes to the current state of the model. In the analogy this shapes the ongoing narrative
C Matrix (Output Generation): After processing through the state space, the C matrix is used to transform the state into the actual output. This could be a prediction, like the next word in a sentence, or a classification, like identifying the genre of a piece of music. Essentially, the C matrix translates the complex, high-dimensional state back into a meaningful output that we can understand and use. In the analogy this is like creating a conclusion based on the story so far.
In tests, Mamba performed exceptionally well, even better than Transformers in some cases, especially when dealing with really long sequences like in language modelling, audio processing, and genomics (the study of DNA sequences). This makes it a promising tool for a wide range of applications, from understanding human language to analysing complex biological data.
Overall, Mamba represents a significant step forward in efficiently handling and understanding sequences, offering improved performance and speed, particularly for very long sequences.