Demand for generative AI (GenAI) and large language model (LLM) is rising rapidly, driven by the emergence of ChatGPT, a chatbot developed by OpenAI. For their large scale as well as the massive data sets and resources required to train LLMs, cloud service providers (CSP) are generally adopting the method of combining inference and prompt engineering for their AI solutions to support clients' customization needs, according to DIGITIMES Research's latest report focusing on the AI server industry.
As such, cloud inference has now become the primary running model for LLMs. However, as language applications mostly require instant responses and need to support huge simultaneous usages by multiple users, only large clusters of high-speed interconnected AI servers can perform LLM inference that satisfies most of the usage scenarios, the report's information shows.
First-tier CSPs are aggressively deploying GenAI cloud services. Apart from the commonly known creation of content such as texts, images, documents, and codes, CSPs have also been actively promoting GenAI platform as a service (PaaS), providing users with pre-trained models, prompt engineering tools, and all types of APIs to allow enterprises to quickly create customized application tools.
Pre-trained models, also referred to as the foundation model, play a crucial role in deciding LLMs' basic quality. CSPs usually offer multiple LLMs for users to choose from, but the best-performing proprietary models, such as ChatGPT or GPT-4, are generally not open to users for their customized training with the users instead given prompt engineering tools for achieving customized results similar to training. Users that are unsatisfied with the prompt engineering tools can choose to perform fine-tuning or training using partial or full open-source models.
LLMs, compared to other deep learning models, rely more heavily on inference, but only large clusters of high-performance AI servers can support LLMs' enormous amount of parameters, vector processing, and language applications' harsh usage scenario – instant response and large numbers of concurrent usages by multiple users.
The trend is driving server manufacturers to transition from CPU-based system architecture to accelerator-based one. When a large cluster of AI servers is being treated as a computation unit, the networks that interconnect the AI servers also become a computing bottleneck that needs to be addressed.