Maria Korolov
Contributing writer

7 ways to deploy your own large language model

Feature
Nov 20, 202410 mins
Artificial IntelligenceData and Information SecurityDatabases

The cost to build a new large language model from scratch is an option, but can be too much to bear for many companies. Luckily, there are several other ways to deploy customized LLMs that are faster, easier, and, most importantly, cheaper.

office workers at desk
Credit: iStock

Generative AI is the fastest-moving new technology in history. It’s transforming the world, and according to the Real-Time Population Survey of more than 3,000 working adults in the US, released in September, a quarter used gen AI for work at least once the previous week, with nearly 11% using it every day.

So the adoption rate of this new technology, in comparison, is about twice as fast as that of the internet, and in a recent report by S&P Global Market Intelligence on behalf of Weka, 88% of organizations surveyed use gene AI, and 24% have it as an integrated capability across workflows.

More specifically, an LLM is a type of gen AI that focuses on text and code instead of images or audio, although some have begun to integrate different modalities. The most popular LLMs in the enterprise today are ChatGPT and other OpenAI GPT models, Anthropic’s Claude, Google’s Gemini, Meta’s Llama, and Mistral, an open source project created by former employees of Meta and Google’s DeepMind.

So with the momentum increasing and industry pressure to engage with gen AI more directly, we examine seven varying degrees of complexity that companies are confronting to deploy LLMs today, and the lengths they’ll go to ensure a competitive advantage.

Send in the chatbots

Chatbots are the simplest way to start using gen AI at a company. There are free, public options for the lowest-risk use cases, such as AI-powered internet searches or summarizing public documents. There are also enterprise versions of these chatbots, where the vendors promise to keep all conversations secure and not use them to train their AIs.

According to a July report from Netskope Threat Labs, 96% of organizations use gen AI chatbots, up from 74% a year ago, with ChatGPT being the most popular general-purpose gen AI chatbot platform in the enterprise, with an 80% usage rate. Following is Microsoft Copilot at 67%, and Google Gemini at 51%.

In addition, enterprise software vendors are increasingly embedding gen AI functionality in their platforms. Grammarly, for example, has gen AI functionality, and so does Salesforce. Most major enterprise software vendors have either already rolled out some gen AI functionality or have them on their roadmaps.

“Certainly, most of the value generation you can attribute to generative AI this year — and over the next two, probably — will be as copilots or assistants, in your search engines, applications, and tools,” says Nick Kramer, leader of applied solutions at SSA & Company, a global consulting firm.

And in its assessment, Gartner predicts more than 80% of enterprise software vendors will have gen AI capabilities by 2026, up from less than 5% in March.

APIs

The next most common gen AI deployment strategy is to add APIs into corporate platforms. For example, if an employee uses an application to track meetings, an API can be used to automatically generate transcript summaries. And Gartner says by 2026, more than 30% of the increase in demand for APIs will come from gen AI.

“The commercial LLMs, the ones created by the big tech companies, you can access through APIs with a usage pricing model,” says Bharath Thota, partner in the digital and analytics practice at Kearney. “Lots of cloud providers make it easy to access those LLMs.”

For something like summarizing a report, the LLM can be used as is, he says, without retrieval augmented generation (RAG) embeddings or fine tuning; just the prompt itself, but it depends on the business problem that needs solving. This is a low-risk, low-cost way to add gen AI functionality to enterprise systems without significant overhead. It’s also an opportunity for companies to learn how these APIs work and how to create effective prompts.

According to OpenAI, 92% of Fortune 500 companies are using its API — and the use of it has doubled since July due to the release of new models, lower costs, and better performance.

Vector databases and RAG

For most companies looking to customize their LLMs, RAG is the way to go. If someone talks about embeddings or vector databases, this is what they normally mean. The way it works is if a user asks a question about, say, a company policy or product, that question isn’t sent to the LLM right away. Instead, it’s processed first to determine if the user has the right to access that information. If access rights are there, then all potentially relevant information is retrieved, usually from a vector database. Then the question and relevant information is sent to the LLM and embedded into an optimized prompt that might also specify the preferred format of the answer and tone the LLM should use.

A vector database is a way of organizing information in a series of lists, each one sorted by a different attribute. For instance, if there’s a list that’s alphabetical, the closer your responses are in that order, the more relevant they are. An alphabetical list is a one-dimensional vector database, but they can have unlimited dimensions, allowing you to search for related answers based on proximity to any number of factors. That makes them perfect to use in conjunction with LLMs.

“Right now, we’re converting everything to a vector database,” says Ellie Fields, chief product and engineering officer at Salesloft, a sales engagement platform vendor. “And yes, they’re working.”

And it’s more effective than using simple documents to provide context for LLM queries, she says. The company primarily uses ChromaDB, an open-source vector store, whose primary use is for LLMs. Another vector database Salesloft uses is PGVector, a vector similarity search extension for the PostgreSQL database.

“But we’ve also done some research using FAISS and Pinecone,” Fields says. FAISS, or Facebook AI Similarity Search, is an open-source library provided by Meta that supports similarity searches in multimedia documents.

And Pinecone is a proprietary cloud-based vector database that’s also become popular with developers, and its free tier supports up to 100,000 vectors. Once the relevant information is retrieved from the vector database and embedded into a prompt, it gets sent to OpenAI running in a private instance on Microsoft Azure.

“We had Azure certified as a new sub-processor on our platform,” says Fields. “We always let customers know when we have a new processor for their information.”

But Salesloft also works with Google and IBM, and is working on a gen AI functionality that uses those platforms as well.

“We’ll definitely work with different providers and different models,” she says. “Things are changing week by week. If you’re not looking at different models, you’re missing the boat.” So RAG allows enterprises to separate their proprietary data from the model itself, making it much easier to swap models in and out as better models are released. In addition, the vector database can be updated, even in real time, without any need to do more fine-tuning or retraining of the model.

Sometimes different models have different APIs. But switching out a model is still easier than retraining. “We haven’t yet found a use case that’s better served by fine tuning rather than a vector database,” Fields adds. “I believe there are use cases out there, but so far, we haven’t found one that performs better.”

One of the first applications of LLMs that Salesloft rolled out was adding a feature that lets customers generate a sales email to a prospect. “Customers were taking a lot of time to write those emails,” says Fields. “It was hard to start, and there’s a lot of writer’s block.” So now customers can specify the target persona, their value proposition, and the call to action — and they get three different draft emails back they can personalize.

Locally run open source models

It’s clear to Andy Thurai, VP and principal analyst at Constellation Research, that open source LLMs have become very powerful. For example, Meta has just released the Llama 3.2 model in several sizes with new vision capabilities, and says they’ve been downloaded nearly 350 million times — a 10-fold increase over the course of a single year — and has more than 60,000 derivative models, fine-tuned for specific use cases.

According to the Chatbot Arena LLM Leaderboard, Meta’s top Llama model is comparable to OpenAI’s GPT 4 and Anthropic’s Claude 3.5 Sonnet in quality.

 “While Llama has the early advantage, many other enterprise companies are creating their own version of an open source LLM,” says Thurai, including IBM’s Granite models, AWS’ Titan, and Google with its several open source models. In light of this growth, API company Kong recently released a survey of hundreds of IT professionals and business leaders that showed most enterprises use OpenAI, either directly or through Azure AI, followed by Google Gemini — but Meta’s Llama came in third place.

The fact that the open source models come in many sizes is a benefit to enterprises since smaller models are cheaper and faster.

“Many enterprises are moving toward deployment mode away from experimentation, and the cost of inferencing and optimization is becoming a major issue,” says Thurai. “A lot of them are getting sticker shocks for the scale they’re looking to deploy.”

Boston-based Ikigai Labs also offers a platform that allows companies to build custom large graphical models, or AI models designed to work with structured data. But to make the interface easier to use, Ikigai powers its front end with LLMs. For example, the company uses the seven billion parameter version of the Falcon open source LLM, and runs it in its own environment for some clients.

To feed information into the LLM, Ikigai uses a vector database, also run locally, says co-founder and co-CEO Devavrat Shah. “At MIT four years ago, some of my students and I experimented with a ton of vector databases,” says Shah, who’s also a professor of AI at MIT. “I knew it would be useful, but not this useful.”

Keeping both the model and the vector database local means no data can leak out to third parties, he says. “For clients okay with sending queries to others, we use OpenAI,” says Shah. “We’re LLM agnostic.”

Then there’s PricewaterhouseCoopers, which built its own ChatPwC tool and is also LLM agnostic. “ChatPwC makes our associates more capable,” says Bret Greenstein, the firm’s partner and leader of the gen AI go-to-market strategy. For example, it includes pre-built prompts and embeddings to implement use cases like generating job descriptions. “This is implemented to use our formats, templates, and terminology,” he says. “To create that, we have HR, data, and prompt experts, and we optimize the use case to generate good, consistent job postings. Now, end users don’t need to know how to do the prompting to generate job descriptions.”

The tool is built on top of Microsoft Azure, and the company also built it for Google Cloud Platform and AWS. “We have to serve our clients, and they exist on every cloud,” says Greenstein. Similarly, it’s optimized to use different models on the back end because that’s how clients want it. “We have every major model working,” he adds. “Claude, Anthropic, OpenAI, Llama, Falcon — we have everything.”

The market is changing quickly, of course, and Greenstein suggests enterprises adopt a no regrets policy to their AI deployments.

“There’s a lot people can do, like build up their data independent of models, and build up governance,” he says. Then, when the market changes, and a new model and technologies come out, the data and governance structure will still be relevant.

Fine tuning

Management consulting company AArete uses few-shot learning-based fine tuning on AWS Bedrock’s Claude 2.5 Sonnet. “We’re US East-1 region’s highest users of AWS Bedrock,” says Priya Iragavarapu, the company’s VP of digital technology services. “We’ve been able to scale our generative AI application into production effectively.”

If AArete used a hosted model and connected to it via API, trust issues would arise. “We’re concerned where the data from the prompting might end up,” she says. “We don’t want to take those risks.”

When choosing an open source model, she looks at how many times it was previously downloaded, its community support, and its hardware requirements.

“The foundational models have become so powerful from where they started last year that we don’t need to worry about the efficacy output for task relevancy,” she says. “The only difference now is how the models differ by number of tokens they can take and versioning.”

Many companies in the financial world and in the health care industry are fine-tuning LLMs based on their own additional data sets. Basic LLMs are trained on the whole internet, but with fine tuning, a company can create a model specifically targeted at their business use case. A common way of doing this is by creating a list of questions and answers, and fine tuning a model on those. In fact, OpenAI began allowing fine tuning of its GPT 3.5 model in August 2023 using a Q&A approach, and unrolled a suite of new fine tuning, customization, and RAG options for GPT 4 at its November DevDay. This is particularly useful for customer service and help desk applications, where a company might already have a data bank of FAQs.

Software companies building applications such as SaaS apps might use fine tuning, says PricewaterhouseCoopers’ Greenstein. “If you have a highly repeatable pattern, fine tuning can drive down your costs,” he says, but for enterprise deployments, RAG is more efficient in up to 95% of cases.

Start from the ground up

Few companies are going to build their own LLM from scratch. OpenAI’s GPT 3 has 175 billion parameters and was trained on a data set of 45 terabytes and cost $4.6 million to train. And according to OpenAI CEO Sam Altman, GPT 4 cost over $100 million. That size is what gives LLMs their magic and ability to process human language, with a degree of common sense and the ability to follow instructions.

“While you can create your own LLM, it requires a significant investment of data and processing power,” says Carm Taglienti, chief data officer at Insight. “Training a model from scratch requires a sufficient volume of data to be able to execute the LLM tasks you expect based on your data.”

Then, once the model is done with the base training, there’s the reinforcement learning with human feedback step, RLHF, which is necessary for the model to interact with users in an appropriate way.

Today, nearly all LLMs come from the big hyperscalers or AI-focused startups like OpenAI and Anthropic. Even companies with extensive experience building their own models are staying away from creating their own LLMs. Salesloft, for example, has been building their own AI and ML models for years, including gen AI models using earlier technologies, but is hesitant about building a brand-new, cutting edge foundation model from scratch.

“It’s a massive computational step that, at least at this stage, I don’t see us embarking on,” says Fields.

Model gardens

For the most mature companies, a single gen AI model isn’t enough. Different models are good for different kinds of use cases, and have different costs and performance metrics associated with them. And new players constantly enter the space, leapfrogging over established giants. Plus, some models can be run on-prem or in colocation data centers, which can reduce costs for companies or provide additional security or flexibility. To take advantage of these options, companies create curated model gardens, private collections of carefully vetted LLMs including custom models or fine-tuned models, and use a routing system to funnel requests to the most appropriate ones. “Not many companies are there yet,” says Kearney’s Thota. “It’s complicated but I believe that’s what the future will be.”