A message to founders of AI apps companies: don’t optimize for gross margin too early.
I call it the “gross margin myth”: the idea that the highest quality apps companies are those with the highest gross margins. This was generally true for SaaS companies that dominated the venture successes of the past two decades, but is a red herring for AI apps companies.
AI apps companies are built differently than SaaS companies. Of course there are similarities, but holding to all the same heuristics is flawed. In the case of gross margin: low gross margin today is not indicative of low gross margin at-scale, and can actually be an important signal of business strength. There are three reasons why this is true: strategic, structural, and financial.
It’s time we dispel this gross margin myth.
How gross margin came to define the health of a business
We can all agree that businesses exist to make money (eventually). Venture-backed businesses often burn capital for many years, spending on operating expenses like product development and customer acquisition ahead of revenue in the interest of creating enterprise value at-scale. Still, the goal is–eventually–cashflow.
Well before a company is profitable, a useful indicator of eventual business quality is gross margin. Gross margin is a powerful metric because it is persistent over time. Whereas operating expenses decrease as a percentage of revenue as a company scales, gross margin more or less stays the same.
Gross margin is the percentage of a company’s sales that are retained after direct expenses to deliver the product, also known as the cost of goods sold (COGS). COGS includes all variable costs associated with delivering a unit of revenue, such as implementation services or customer support.
The last platform shift moved us from on-premise software to cloud and SaaS. High gross margin is one of the most beautiful and most differentiating characteristics of SaaS companies. Compared to on-prem software companies, SaaS companies can expand their customer base without significantly increasing direct costs, which helps produce high gross margins. Best-in-class SaaS companies have 80%+ gross margin. It’s also useful to know when SaaS companies do not have high gross margin profiles. It may indicate that the product is lower quality because either it requires high customer support costs, or the company has little differentiation from competitors so the marginal profit is competed away.
Why the gross margin game has changed
For AI apps, gross margin today is not indicative of terminal gross margin. In the prior world of SaaS apps, gross margin today was a good proxy for gross margin tomorrow, but the same is not true for AI apps. There are three primary reasons for this.
Reason #1: Structural
AI apps companies typically have lower gross margin than pure SaaS companies because they have all of the same COGS (hosting, customer support), but also layer on an additional cost: compute. Compute is the cost of model inference, or prompting the LLM as embedded within the product. Today, we’ve seen compute costs for AI apps companies at ~1-3x software hosting costs. To believe that AI apps companies can be as profitable as SaaS companies, then compute costs have to decline. This is happening.
First, we’re seeing the price for an equivalent unit of “work” drop at the infrastructure layer: the underlying models are getting more powerful with each new release, while the costs are coming down. The mechanisms include hardware optimizations, such as new GPU development, data center energy efficiencies, and investing in GPU clusters directly. Additionally, software optimizations hold significant promise, including speculative decoding, model quantization, model pruning/sparsification and prompt caching, which Anthropic recently released and estimates will reduce cost by approximately 90%. It’s likely that competition among LLMs (and between LLMs and other types of models) will bring down prices further. In the future, the hyperscalers may even choose to subsidize the cost of inference below cost, offset by higher data hosting fees.
Second, the GPU crunch is a cost driver that has already begun to loosen, and we expect it to dissipate even more. Today, companies need to make GPU buying decisions months ahead of demand in order to ensure there’s enough supply. Those in a position to do so, like Abridge, would rather err on the side of overages so there is no interruption to customer experience, even though it can add cost. As GPU capacity becomes more liquid with more chips coming to market and chip supply increasing absolutely, it will enable companies to make buying decisions in real-time to match customer demand (just as we’ve seen with hosting costs overtime).
Third, AI-native apps companies are building a new layer of infrastructure to maximize model performance while minimizing cost. Put simply: companies do not need to use the most expensive models every time since some tasks can handle more latency or lower performance. The total cost of inference comes down when models of varying cost are blended within the product. For example, MagicSchool uses different models for different prompts. Infrastructure layer companies are popping up to support this work, such as low-cost model routing by Martian or the open-source framework RouteLLM, as well as model (and product) evaluation companies like Braintrust, Galileo, and Patronus, and model “orchestration” companies.
These optimizations are not just theoretical but are already having a real impact. For example, we can follow the cratering inference costs in the span of only a few days from Mistral AI’s release of the Mixtral-8x7B sparse mixture of experts model (SMoE). On December 11, 2023, Mistral AI released the new model with open weights and pricing of $2 per 1 million tokens. Just hours later, Together AI used the same weights and dropped pricing by 70% to $0.60 per 1 million tokens. Days later on December 14, Abacus AI cut a further 50% to $0.30 per 1 million tokens, and then Deep Infra reached $0.24 per 1 million tokens. Putting this in terms of gross margin, assuming other COGS stay constant, this is the equivalent of increasing gross margin from 60% to 78%!
Reason #2: Strategic
One of the reasons why some AI apps companies have lower gross margin today is due to a greater reliance on human labor. For those less entrenched in generative AI, this may come as a surprise. “Isn’t the point of AI automation to rely less on humans?” Yes, while it is possible today to build a product that’s 100% automated using AI, we’re finding that these products run the risk of being indefensible. In contrast, AI apps companies that include humans as part of the product delivery process are definitionally doing something that isn’t yet possible with technology alone. While this harms gross margin in the short-term, this is a strategic choice that produces a better and more enduring company in the long-term.
What does this look like in practice? For some AI apps companies, the human input prepares the product, effectively data labeling to create high quality datasets for model finetuning or RLHF. For others, the human input is a complement to the product experience, effectively catching edge cases, or providing quality control at the end of a largely automated process. For example, Norm AI uses a combination of technology plus attorneys and compliance experts to map regulations to a graph. Then they use LLMs to ingest the graph and relevant customer artifacts to adjudicate compliance with the regulation. Finally, they validate the findings internally before releasing them to customers.
Founders, I understand that this can feel like a terrifying cliff to peer over: include humans and build something groundbreaking but simultaneously run the risk of being categorized as a “services” company. When is this worth the jump? Consider a two-point test to assess whether or not humans-in-the-loop is worthwhile:
Unique data asset with network effects: It’s worth using humans for data labeling or end output quality control to build datasets that have not before existed, are disparate, and accrue more value with scale. Examples include undigitized data literally stored in file cabinets, data that has never been written down and lives in the minds of experts, and workflow data that defines a process or edits a process. In each of these examples, the datasets cannot be “bought” because they don’t yet exist, and once developed, each node has more value as part of the whole because the collective whole provides context and meaning. Copycats can’t compete because they just can’t catch up. Judgment-based, yet deterministic processes are ripe for these characteristics, such as in law, accounting, finance, manufacturing and maintenance, healthcare, medicine, government/intelligence and psychology. It’s worth noting that the humans-in-the-loop can be within the company or on the customer side (a discussion for another post!).
Reliance on humans decreases over time: You also need to prove that the activities that humans are doing can be captured and built into the product, producing higher gross margin over time. You should be able to study the services work and identify the steps requiring the greatest time, and build technology that automates these steps away over time. The proof is in the pudding here: fewer pages should require human review over time. A smaller percentage of total units produced should require any human review. Human review per page should decrease, and so on.
If you fail point 1, then you’re burning capital but not building defensibility. You run the risk of proving out a model that others will fast follow, dooming your gross margins to persist at low levels due to price competition.
If you fail point 2 and there’s no measurable improvement in the time required for services for a unit of product delivered at constant quality, then you just have a product that is part-technology, part-services.
Passing both tests, I think companies have a good argument to view these labor costs as an asset. While accounting standards direct us to categorize these investments as cost today, perhaps a more accurate depiction of the P&L would capitalize these costs, like we learned to capitalize R&D expenses from the SaaS era.
To make this more real, EvenUp is an example of a vertical-specific AI apps company that leverages human expertise to both curate a high-quality dataset and control quality. In their demand letter product, team members review injury claim files into a structured format that enhances the accuracy of AI-generated content. Critically, all of the human work occurs on “platform,” so the product and engineering team can track and therefore gradually automate more tasks over time giving the company a speed, accuracy, and cost edge over a human performing the tasks by hand. For instance, earlier this year they launched “Rosie,” their intelligent review bot, which passes each document through rigorous quality checks, automating one of the most time-intensive steps in the review process. Critics point out that this is a technically and operationally complex process today. Yes, it is, and that is the point. Not only has EvenUp’s unique approach allowed them to deliver high-quality outputs that are difficult and costly for competitors to replicate, but also their process passes the two-point test toward building defensibility: they’re building novel datasets that did not exist before, and they are automating processes that significantly boost gross margins.
Reason #3: Financial
This point is the most controversial.
First, consider the financial choice of your customers: You can use low price as an explicit strategy to win market share. Across the economy, buyers are shopping for AI products, trialing many and implementing those with proven ROI. Price is not the most important consideration for many buyers, but it is critical and at least sets the ROI hurdle rate. When a buyer selects a vendor, the incumbent AI vendor has an advantage. Some fields lend themselves to higher vendor lock-in, such as those with high data privacy or security requirements, integrations into highly complex or custom-built systems or data services, and multi-year license agreement standards. We see these dynamics in health/pharma, biotech, government, manufacturing, and energy/utility industries.
Second, your own ability to keep prices low is a function of how long you can keep burning capital. For those companies in the enviable group deemed “AI winners” by the VC community, financing is pretty cheap today. It’s worth it to raise, burn and build, and optimize the company tomorrow.
So, when does gross margin matter?
Of course, across both SaaS and AI apps companies, eventually generating cash matters. But what matters is the gross margin at the time a company gets to-scale. For SaaS companies, their gross margin profile looks approximately the same over time, so gross margin early on is a good indicator of company value at IPO. We know the same is not true for AI apps companies because the gross margin profile is rapidly changing over time from structural, strategic, and financial considerations.
AI apps founders: Don’t optimize for gross margin too early or you may undermine your company’s potential. And, if investors or board members push you to increase gross margin? In the words of a leader at one of the most revered AI apps companies: “Cry BS! This is a good way to weed out spreadsheet investors who don’t understand AI.” I couldn’t have put it as eloquently.
MagicSchool founder Adeel Khan is a former teacher and principal whose AI platform is saving teachers time, fighting burnout, and helping schools build responsible AI experiences for students.