Selective State Space Models: Solving the Cost-Quality Tradeoff

Selective State Space Models: Solving the Cost-Quality Tradeoff

As AI is increasingly used in production scenarios, costs are mounting. Are alternative architectures the solution?

Daniel LaBruna

2 min read August 12, 2024

Domain Insights Infra

One of the great drawbacks of Transformers’ attention-based models is the computational cost of inference. Unlike training, which can be parallelized, attention-based inference scales quadratically in both cost and speed with input sequence length, limiting context window size. Flash Attention (recently updated for the H100) helps and has become a standard, but it doesn’t entirely remove the quadratic scaling constraint.

So, when the original Mamba SSM paper promised both parallelizable training and inference that scaled linearly with input length, there was understandable excitement. Structured state space models (SSMs) had existed before, but they suffered from the opposite problem of transformers: their inference scaled linearly, but training could not be parallelized. Additionally, their static state matrices were fixed after training and could not respond dynamically to changing inputs. This was a critical problem that only attention seemed to solve. It appeared that you could have one or the other: quality or cost.

Mamba (and Selective SSMs broadly) challenged this paradigm: they offered states that vary based on input (akin to attention), and computations that can be done in parallel (inference that scales linearly). The result is a model that generates with transformer-level fidelity but dramatically improved inference efficiency.

Development has continued since this paper was published, with researchers experimenting on an improved Mamba 2 model, compact Zamba model (well-suited to edge deployment), as well as hybrid attention + SSM models that claim state-of-the-art performance. While these early results have been promising enough to motivate further research, it will be interesting to see whether Selective SSMs will become the accepted solution to the inference cost-quality tradeoff.

The answer will likely depend more on transformers than SSMs. Incremental improvements continue to be eked out by updated Flash and Sparse Attention mechanisms, as well as speculative decoding algorithms. But will these methods reach a theoretical limit? And if they do, will the cost-quality tradeoff become severe enough to motivate the widespread adoption of an alternative architecture? Only time will tell.

If you’re using these models in the field, send me your thoughts at dlabruna@baincapital.com.

Related Insights

How Cube’s Universal Semantic Layer Unlocks a New Generation of AI Apps

Cube is the standard for providing semantic consistency to LLMs, and we are investing in a new $25M financing after leading the seed round in 2020.

Rak Garg, Aaref Hilaly 3 min read

News Infra Early

‘You’ve Got to Believe You Can Build a 5x Better Product’ — Lessons From Founder Amit Aggarwal

In this edition of “In the Lab,” Amit Aggarwal explains why he’s building an AI startup in BCV Labs after selling his company The Yes to Pinterest.

Rak Garg 18 min read

BCV Labs Infra Early Seed

Paperwork, Pilots and Prompts: Common Traits We’ve Noticed Among ‘AI for XYZ’ Businesses

For now, most GenAI startups are focused on completing paperwork and are built on prompts. That may change in the months ahead.

Aaref Hilaly 2 min read

BCV Labs Infra Apps Early Seed