AI Controller Interface: Generative AI with a lightweight, LLM-integrated VM

已发布

作者 , Senior Principal Research Manager , Principal Research Software Development Engineer

This diagram shows the flow and interaction between an AI Controller and LLM during constrained decoding.  The diagram begins with Step 0, uploading the desired AI Controller to the LLM service, if necessary.  Step 1 sends an LLM request to the server.  Step 2 is a token generation, where the AI Controller is called before, during, and after each token generation to control the LLM’s behavior.  Step 2 repeats for every token being generated by the LLM.  Step 3 returns the resulting generated text.

The emergence of large language models (LLMs) has revolutionized the way people create text and interact with computing. However, these models are limited in ensuring the accuracy of the content they generate and enforcing strict compliance with specific formats, such as JSON and other computer programming languages. Additionally, LLMs that process information from multiple sources face notable challenges in preserving confidentiality and security. In sectors like healthcare, finance, and science, where information confidentiality and reliability are critical, the success of LLMs relies heavily on meeting strict privacy and accuracy standards. Current strategies to address these issues, such as constrained decoding and agent-based approaches, pose practical challenges, including significant performance costs or the need for direct model integration, which is difficult.

The AI Controller Interface and program

To make these approaches more feasible, we created the AI Controller Interface (AICI). The AICI goes beyond the standard “text-in/text-out” API for cloud-based tools with a “prompt-as-program” interface. It’s designed to allow user-level code to integrate with LLM output generation seamlessly in the cloud. It also provides support for existing security frameworks, application-specific functionalities, fast experimentation, and various strategies for improving accuracy, privacy, and adherence to specific formats. By providing granular-level access to the generative AI infrastructure, AICI allows for customized control over LLM processing, whether it’s run locally or in the cloud.

A lightweight virtual machine (VM), the AI Controller, sits atop this interface. AICI conceals the LLM processing engine’s specific implementation, providing the right mechanisms to enable developers and researchers to agilely and efficiently work with the LLM, allowing them to more easily develop and experiment. With features that allow for adjustments in decision-making processes, efficient memory use, handling multiple requests at once, and coordinating tasks simultaneously, users can finely tune the output, controlling it step by step.

Spotlight: Microsoft research newsletter

Microsoft Research Newsletter

Stay connected to the research community at Microsoft.

An individual user, tenant, or platform can develop the AI Controller program using a customizable interface designed for specific applications or prompt-completion tasks. The AICI is designed for the AI Controller to run on the CPU in parallel with model processing on the GPU, enabling advanced control over LLM behavior without impacting its performance. Additionally, multiple AI Controllers can run simultaneously. Figure 1 illustrates the AI Controller architecture.

This figure shows an architecture stack for the AI Controller Interface system.  At the top of the stack, the copilot or application runs independently and calls into an AI Controller one level lower in the stack.  The AI Controller may be the DeclCtrl, PyCtrl, JSCtrl, or a custom controller.  The AI Controller sits above the AI Controller Interface, which is integrated directly with an LLM serving engine, such as rLLM, llama.cpp, or other LLM serving engine.
Figure 1. Applications send instructions to an AI Controller, which provides a high-level API. The AICI allows the Controller to execute efficiently in the cloud in parallel with model inference.

AI Controllers are implemented as WebAssembly VMs, most easily written as Rust programs. However, they can also be written in any language that can be compiled into or interpreted as WebAssembly. We have already developed several sample AI Controllers, available as open source (opens in new tab). These features provide built-in tools for controlled text creation, allowing for on-the-fly changes to initial instructions and the resulting text. They also enable efficient management of tasks that involve multiple stages or batch processing.

High-level execution flow

Let’s take an example to illustrate how the AI Controller impacts the output of LLMs. Suppose a user requests the completion of a task, such as solving a mathematical equation, with the expectation of receiving a numeric answer. The following program ensures the the LLM’s response is numeric. The process unfolds as follows:

1. Setup. The user or platform owner first sets up the AICI-enabled LLM engine and then deploys the provided AI Controller, DeclCtrl, to the cloud via a REST API.

2. Request. The user initiates LLM inference with a REST request specifying the AI Controller (DeclCtrl), and a JSON-formatted declarative program, such as the following example. 

{"steps": [
    {"Fixed":{"text":"Please tell me what is 122.3*140.4?"}},
    {"Gen": {"rx":" ^(([1-9][0-9]*)|(([0-9]*)\.([0-9]*)))$"}}
]}

Once the server receives this request, it creates an instance of the requested DeclCtrl AI Controller and passes the declarative program into it. The AI Controller parses its input, initializes its internal state, and LLM inference begins.

3. Token generation. The server generates tokens sequentially, with the AICI making calls to the DeclCtrl AI Controller before, during, and after each token generation.

  • pre_process() is called before token generation. At this point, the AI Controller may stop generating (e.g., if it is complete), fork parallel generations, suspend, or continue.
  • mid_process() is called during token generation and is the main entry point for computation in the AI Controller. During this call, the AI Controller can return logit biases to constrain generation, backtrack in the generation, or fast forward through a set of fixed or zero-entropy tokens. The mid_process() function runs in parallel with model inference on the GPU and its computation (e.g., of logit biases) is incorporated into the model’s token sampling on the GPU.
  • post_process() is called once the model has generated the next token. Here, the AI Controller may, for example, perform simple bookkeeping, updating its state based on the sampled token.

During these calls, the DeclCtrl AI Controller executes the necessary logic to ensure that the LLM generation conforms to the declarative program provided by the user. This ensures the LLM response is a numeric solution to the math problem. 

4. Response. Once DeclCtrl completes its program, it assembles the results, which might include intermediate outputs, debug information, and computed variables. These can be returned as a final response or streamed to show progress. Finally, the AI Controller is deallocated.

This diagram shows the flow and interaction between an AI Controller and LLM during constrained decoding.  The diagram begins with Step 0, uploading the desired AI Controller to the LLM service, if necessary.  Step 1 sends an LLM request to the server.  Step 2 is a token generation, where the AI Controller is called before, during, and after each token generation to control the LLM’s behavior.  Step 2 repeats for every token being generated by the LLM.  Step 3 returns the resulting generated text.
Figure 2. AI Controllers incorporate custom logic during the token-by-token decoding, working in parallel with the LLM to support fast, flexible, and secure controlled generation.

Use cases

Efficient constrained decoding

For Rust-based AI Controllers, we’ve developed an efficient way to check and enforce formatting rules (constraints) during text creation within the aici_abi library. This method involves using a special kind of search tree (called a trie) and checks based on patterns (regular expressions) or rules (context-free grammar) to ensure each piece of text follows specified constraints. This efficiency ensures rapid compliance-checking, enabling the program to seamlessly integrate with the GPU’s process without affecting performance.

While AI Controllers currently support mandatory formatting requirements, such as assigning negative infinity values to disallow invalid tokens, we anticipate that future versions will support more flexible guidance.

Information flow constraints

Furthermore, the AI Controller VM gives users the power to control the timing and manner by which prompts, background data, and intermediate text creations affect subsequent outputs. This is achieved through backtracking, editing, and prompt processing.

This functionality can be useful in a number of scenarios. For example, it allows users to selectively influence one part of a structured chain-of-thought process but not another. It can also be applied to preprocessing background data to remove irrelevant or potentially sensitive details before starting an LLM analysis. Currently, achieving this level of control requires multiple independent calls to LLMs.

Looking ahead

Our work with AICI has led to a successful implementation on a reference LLM serving engine (rLLM) and integrations with LLaMa.cpp. Currently, we’re working to provide a small set of standard AI Controllers for popular libraries like Guidance. In the near future, we plan to work with a variety of LLM infrastructures, and we’re excited to use the open-source ecosystem of LLM serving engines to integrate the AICI, providing portability for AI Controllers across environments.

Resources

Code, detailed descriptions of the AICI, and tutorials are available on GitHub (opens in new tab). We encourage developers and researchers to create and share their own custom AI Controllers.

继续阅读

查看所有博客文章