This is part three of our three-part AI-Core Insights series. Click here for part one, “Foundation models: To open-source or not to open-source?”, and here for part two, “Discovering holistic infrastructure strategies for compute-intensive startups.”
On the road of LLM-driven use cases, startups are leading the way. The road can be bumpy, with hiccups in GPU allocation, allocated capacity availability, API rate limits, and more. Then there are the innumerable priorities of an LLM pipeline that need to be timed for different stages of your product build.
In this final part of our AI Core Insights series, we’ll summarize a few decisions you need to consider at various stages to make your journey easier.
Experimenting with models
At the experimentation stage, you’re first testing and comparing several models, both open- and closed-source. For OpenAI APIs, Microsoft for Startups provides access to OpenAI credits worth $2,500 which can provide rapid availability of APIs for experimentation.
A simple model catalog can be a great way to experiment with several models with simple pipelines and find out the best performant model for the use cases. The refreshed AzureML model catalog enlists best models from HuggingFace, as well as the few selected by Azure.
The compute targets for this stage can be either a CPU or a GPU, with no major need of a super-performant system for scale. The GPUs can encompass V100s, A100s or RTX GPUs. For inference, the most widely used SKU is A10s and V100s, while A100s are also used in some cases. It is important to pursue alternatives to ensure scale in access, with multiple dependent variables like region availability and quota availability.
Considerations after choosing a model
After completing experimentation, you’ve centralized upon a use case and the right model configuration to go with it. The model configuration, however, is usually a set of models instead of just one. Here are a few considerations to keep in mind:
- Papers like FrugalGPT outline various techniques of choosing the best-fit deployment between model choice and use-case success. This is a bit like malloc principles: we have an option to choose the first fit but oftentimes, the most efficient products will come out of best fit.
- Serverless compute offering can help deploy ML jobs without the overhead of ML job management and understanding compute types.
- For deployment comparisons, setting up jobs via Azure ML Studio can help benchmark and evaluate performance.
- Creating multiple pipelines is easy via reusable components with Azure ML.
On the road to rapid growth
With a few customers under the bucket, your LLM pipeline starts scaling fast. At this stage, are additional considerations:
- Content safety starts becoming key, since your inferences are going to the customer. Azure Content Safety Studio can be a great place to get ready for deployment to the customers.
- Autoscaling of your ML endpoints can help scale up and down, based on demand and alerts. This can help optimize cost with varying customer workloads.
- Building on top of an infrastructure like Azure helps presume a few growth needs like reliability of service, adherence to compliance regulations such as HIPAA, and more.
As large-mode driven use cases become more mainstream, it is clear that except for a few large players, your model is not your product. However, a few considerations early on help prioritize the right problem statements to help you build, deploy, and scale your product quickly while the industry keeps expanding.
For ongoing learning and building around AI, sign up today for Microsoft for Startups Founders Hub.