Version: v0.14 🚧

Provide LLM Service with Inference Module for AI Application

In the wave of Artificial Intelligence (AI), Large Language Models (LLMs) are gradually becoming a key factor in driving innovation and productivity. As a result, researchers and developers are looking for a more efficient way to deploy and manage complex LLM models and AI applications.

To simplify the process from model construction, deployment and interaction with applications, the KusionStack community has provided an inference module. We will explore in detail how to deploy an AI application using LLM service provided by this module in this article.

info

The module definition and implementation, as well as the example application we are about to show can be found here.

Prerequisites

Before we begin, we need to perform the following steps to set up the environment required by Kusion:

Install Kusion
Running Kubernetes cluster

For more details, please refer to the prerequisites in the guide for deploying an application with Kusion.

Initializing and Managing Workspace Configuration

For information on how to initialize and switch a workspace with kusion workspace create and kusion workspace switch, please refer to this document.

For the current version of the inference module, an empty configuration for workspace initialization is enough, and users may need to configure the network module as an accessory to provide the network service for the AI application, whose workload is described with service module. Users can also add other modules' platform configurations in the workspace according to their need.

An example is shown below:

modules: 
    service: 
        path: oci://ghcr.io/kusionstack/service
        version: 0.2.0
        configs: 
            default: {}
    network: 
        path: oci://ghcr.io/kusionstack/network
        version: 0.2.0
        configs: 
            default: {}
    inference: 
        path: oci://ghcr.io/kusionstack/inference
        version: 0.1.0-beta.4
        configs: 
            default: {}

Example

After creating and switching to the workspace shown above, we can initialize the example Project and Stack with kusion project create and kusion stack create. Please refer to this document for more details.

The directory structure, and configuration file contents of the example project is shown below:

example/
.
├── default
│   ├── kcl.mod
│   ├── main.k
│   └── stack.yaml
└── project.yaml

project.yaml:

name: example

stack.yaml:

name: default

kcl.mod:

[dependencies]
kam = { git = "https://github.com/KusionStack/kam.git", tag = "0.2.0" }
service = {oci = "oci://ghcr.io/kusionstack/service", tag = "0.1.0" }
network = { oci = "oci://ghcr.io/kusionstack/network", tag = "0.2.0" }
inference = { oci = "oci://ghcr.io/kusionstack/inference", tag = "0.1.0-beta.4" } 

main.k:

import kam.v1.app_configuration as ac
import service
import service.container as c
import network as n
import inference.v1.inference

inference: ac.AppConfiguration {
    # Declare the workload configurations. 
    workload: service.Service {
        containers: {
            myct: c.Container {image: "kangy126/app"}
        }
        replicas: 1
    }
    # Declare the inference module configurations. 
    accessories: {
        "inference": inference.Inference {
            model: "llama3"
            framework: "Ollama"
        }
        "network": n.Network {ports: [n.Port {
            port: 80
            targetPort: 5000
        }]}
    }
}

In the above example, we configure the model and framework item of the inference module, which are two required configuration items for this module. The inference service of different models with different inference frameworks could be quickly built up by changing these two configuration items.

As for how the AI application use the LLM service provided by the inference module, an environment variable named INFERENCE_URL will be injected by the module and the application can call the LLM service with the address.

Which model used in the application is transparent, and you only need to provide the prompt parameter to the request address. Of course, you can directly modify the model and other configuration items in the main.k file and update the deployment resources by kusion apply.

There are also some optional configuration items in the inference module for adjusting the LLM service, whose details can be found here.

Deployment

Now we can generate and deploy the Spec containing all the relevant resources the AI application needs with Kusion.

First, we should navigate to the folder example/default and execute the kusion generate command, and a Spec will be generated.

➜  default git:(main) ✗ kusion generate
 ✔︎  Generating Spec in the Stack default...
resources:
    - id: v1:Namespace:example
      type: Kubernetes
      attributes:
        apiVersion: v1
        kind: Namespace
        metadata:
            creationTimestamp: null
            name: example
        spec: {}
        status: {}
      extensions:
        GVK: /v1, Kind=Namespace
    - id: apps/v1:Deployment:example:example-default-inference
      type: Kubernetes
      attributes:
        apiVersion: apps/v1
        kind: Deployment
        metadata:
            creationTimestamp: null
            labels:
                app.kubernetes.io/name: inference
                app.kubernetes.io/part-of: example
            name: example-default-inference
            namespace: example
        spec:
            replicas: 1
            selector:
                matchLabels:
                    app.kubernetes.io/name: inference
                    app.kubernetes.io/part-of: example
            strategy: {}
            template:
                metadata:
                    creationTimestamp: null
                    labels:
                        app.kubernetes.io/name: inference
                        app.kubernetes.io/part-of: example
                spec:
                    containers:
                        - env:
                            - name: INFERENCE_URL
                              value: ollama-infer-service
                          image: kangy126/app
                          name: myct
                          resources: {}
        status: {}
      dependsOn:
        - v1:Namespace:example
        - v1:Service:example:ollama-infer-service
        - v1:Service:example:example-default-inference-private
      extensions:
        GVK: apps/v1, Kind=Deployment
        kusion.io/is-workload: true
    - id: apps/v1:Deployment:example:ollama-infer-deployment
      type: Kubernetes
      attributes:
        apiVersion: apps/v1
        kind: Deployment
        metadata:
            creationTimestamp: null
            name: ollama-infer-deployment
            namespace: example
        spec:
            selector:
                matchLabels:
                    accessory: ollama
            strategy: {}
            template:
                metadata:
                    creationTimestamp: null
                    labels:
                        accessory: ollama
                spec:
                    containers:
                        - command:
                            - /bin/sh
                            - -c
                            - |-
                              echo 'FROM llama3
                              PARAMETER top_k 40
                              PARAMETER top_p 0.900000
                              PARAMETER temperature 0.800000
                              PARAMETER num_predict 128
                              PARAMETER num_ctx 2048
                              ' > Modelfile && ollama serve & OLLAMA_SERVE_PID=$! && sleep 5 && ollama create llama3 -f Modelfile && wait $OLLAMA_SERVE_PID
                          image: ollama/ollama
                          name: ollama-infer-container
                          ports:
                            - containerPort: 11434
                              name: ollama-port
                          resources: {}
                          volumeMounts:
                            - mountPath: /root/.ollama
                              name: ollama-infer-storage
                    volumes:
                        - emptyDir: {}
                          name: ollama-infer-storage
        status: {}
      dependsOn:
        - v1:Namespace:example
        - v1:Service:example:ollama-infer-service
        - v1:Service:example:example-default-inference-private
      extensions:
        GVK: apps/v1, Kind=Deployment
    - id: v1:Service:example:ollama-infer-service
      type: Kubernetes
      attributes:
        apiVersion: v1
        kind: Service
        metadata:
            creationTimestamp: null
            labels:
                accessory: ollama
            name: ollama-infer-service
            namespace: example
        spec:
            ports:
                - port: 80
                  targetPort: 11434
            selector:
                accessory: ollama
            type: ClusterIP
        status:
            loadBalancer: {}
      dependsOn:
        - v1:Namespace:example
      extensions:
        GVK: /v1, Kind=Service
    - id: v1:Service:example:example-default-inference-private
      type: Kubernetes
      attributes:
        apiVersion: v1
        kind: Service
        metadata:
            creationTimestamp: null
            labels:
                app.kubernetes.io/name: inference
                app.kubernetes.io/part-of: example
            name: example-default-inference-private
            namespace: example
        spec:
            ports:
                - name: example-default-inference-private-80-tcp
                  port: 80
                  protocol: TCP
                  targetPort: 5000
            selector:
                app.kubernetes.io/name: inference
                app.kubernetes.io/part-of: example
            type: ClusterIP
        status:
            loadBalancer: {}
      dependsOn:
        - v1:Namespace:example
      extensions:
        GVK: /v1, Kind=Service
secretStore: null
context: {}

Next, we can execute the kusion preview command and review the resource three-way diffs for a more secure deployment.

➜  default git:(main) ✗ kusion preview
 ✔︎  Generating Spec in the Stack default...
Stack: default
ID                                                    Action
v1:Namespace:example                                  Create
v1:Service:example:ollama-infer-service               Create
v1:Service:example:example-default-inference-private  Create
apps/v1:Deployment:example:example-default-inference  Create
apps/v1:Deployment:example:ollama-infer-deployment    Create


Which diff detail do you want to see?:
> all
  v1:Namespace:example Create
  v1:Service:example:ollama-infer-service Create
  v1:Service:example:example-default-inference-private Create
  apps/v1:Deployment:example:example-default-inference Create

Finally, execute the kusion apply command to deploy the related Kubernetes resources.

➜  default git:(main) ✗ kusion apply
 ✔︎  Generating Spec in the Stack default...
Stack: default
ID                                                    Action
v1:Namespace:example                                  Create
v1:Service:example:ollama-infer-service               Create
v1:Service:example:example-default-inference-private  Create
apps/v1:Deployment:example:ollama-infer-deployment    Create
apps/v1:Deployment:example:example-default-inference  Create


Do you want to apply these diffs?:
  > yes

Start applying diffs ...
 ✔︎  Succeeded v1:Namespace:example
 ✔︎  Succeeded v1:Service:example:ollama-infer-service
 ✔︎  Succeeded v1:Service:example:example-default-inference-private
 ✔︎  Succeeded apps/v1:Deployment:example:ollama-infer-deployment
 ✔︎  Succeeded apps/v1:Deployment:example:example-default-inference
Apply complete! Resources: 5 created, 0 updated, 0 deleted.

Testing

Execute the kubectl get all -n example command, and the deployed Kubernetes resources will be shown.

➜  ~ kubectl get all -n example
NAME                                           READY   STATUS    RESTARTS   AGE
pod/example-dev-inference-5cf6c74574-7w92f     1/1     Running   0          2d6h
pod/mynginx                                    1/1     Running   0          2d6h
pod/ollama-infer-deployment-7c56845496-s5snb   1/1     Running   0          2d6h

NAME                                   TYPE           CLUSTER-IP        EXTERNAL-IP     PORT(S)        AGE
service/example-dev-inference-public   ClusterIP      192.168.116.121   <none>          80:32693/TCP         2d6h
service/ollama-infer-service           ClusterIP      192.168.28.208    <none>          80/TCP         2d6h

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/example-dev-inference     1/1     1            1           2d6h
deployment.apps/ollama-infer-deployment   1/1     1            1           2d6h

NAME                                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/example-dev-inference-5cf6c74574     1         1         1       2d6h
replicaset.apps/ollama-infer-deployment-7c56845496   1         1         1       2d6h

The AI application in the example provides a simple service that returns the LLM responses when sending a GET request with the prompt parameter.

We can test the application service locally by port-forward, allowing us to directly send requests to the application via our browser.

kubectl port-forward service/example-dev-inference-public 8080:80 -n example

The test results are shown in the figure below.

By modifying the model parameter in the main.k file, you can switch to a different model without having to change the application itself.

For example, we change the value of model from llama3 to qwen. Then we execute the kusion apply command to update the K8S resources.

❯ kusion apply                                                         
 ✔︎  Generating Spec in the Stack dev...                                                                     
Stack: dev                                          
ID                                                  Action
v1:Namespace:example                                UnChanged
v1:Service:example:ollama-infer-service             UnChanged
v1:Service:example:proxy-infer-service              UnChanged
v1:Service:example:example-dev-inference-public     UnChanged
apps/v1:Deployment:example:example-dev-inference    UnChanged
apps/v1:Deployment:example:proxy-infer-deployment   Update
apps/v1:Deployment:example:ollama-infer-deployment  Update


Do you want to apply these diffs?:
  yes
> details
  no

We repeat to send the request to the application via the browser, and the new results are as follows.

Provide LLM Service with Inference Module for AI Application

Prerequisites​

Initializing and Managing Workspace Configuration​

Example​

Deployment​

Testing​

Prerequisites

Initializing and Managing Workspace Configuration

Example

Deployment

Testing