Skip to main content
Version: v0.14

Provide LLM Service with Inference Module for AI Application

In the wave of Artificial Intelligence (AI), Large Language Models (LLMs) are gradually becoming a key factor in driving innovation and productivity. As a result, researchers and developers are looking for a more efficient way to deploy and manage complex LLM models and AI applications.

To simplify the process from model construction, deployment and interaction with applications, the KusionStack community has provided an inference module. We will explore in detail how to deploy an AI application using LLM service provided by this module in this article.

info

The module definition and implementation, as well as the example application we are about to show can be found here.

Prerequisites

Before we begin, we need to perform the following steps to set up the environment required by Kusion:

  • Install Kusion
  • Running Kubernetes cluster

For more details, please refer to the prerequisites in the guide for deploying an application with Kusion.

Initializing and Managing Workspace Configuration

For information on how to initialize and switch a workspace with kusion workspace create and kusion workspace switch, please refer to this document.

For the current version of the inference module, an empty configuration for workspace initialization is enough, and users may need to configure the network module as an accessory to provide the network service for the AI application, whose workload is described with service module. Users can also add other modules' platform configurations in the workspace according to their need.

An example is shown below:

modules: 
service:
path: oci://ghcr.io/kusionstack/service
version: 0.2.0
configs:
default: {}
network:
path: oci://ghcr.io/kusionstack/network
version: 0.2.0
configs:
default: {}
inference:
path: oci://ghcr.io/kusionstack/inference
version: 0.1.0-beta.4
configs:
default: {}

Example

After creating and switching to the workspace shown above, we can initialize the example Project and Stack with kusion project create and kusion stack create. Please refer to this document for more details.

The directory structure, and configuration file contents of the example project is shown below:

example/
.
├── default
│ ├── kcl.mod
│ ├── main.k
│ └── stack.yaml
└── project.yaml

project.yaml:

name: example

stack.yaml:

name: default

kcl.mod:

[dependencies]
kam = { git = "https://github.com/KusionStack/kam.git", tag = "0.2.0" }
service = {oci = "oci://ghcr.io/kusionstack/service", tag = "0.1.0" }
network = { oci = "oci://ghcr.io/kusionstack/network", tag = "0.2.0" }
inference = { oci = "oci://ghcr.io/kusionstack/inference", tag = "0.1.0-beta.4" }

main.k:

import kam.v1.app_configuration as ac
import service
import service.container as c
import network as n
import inference.v1.inference

inference: ac.AppConfiguration {
# Declare the workload configurations.
workload: service.Service {
containers: {
myct: c.Container {image: "kangy126/app"}
}
replicas: 1
}
# Declare the inference module configurations.
accessories: {
"inference": inference.Inference {
model: "llama3"
framework: "Ollama"
}
"network": n.Network {ports: [n.Port {
port: 80
targetPort: 5000
}]}
}
}

In the above example, we configure the model and framework item of the inference module, which are two required configuration items for this module. The inference service of different models with different inference frameworks could be quickly built up by changing these two configuration items.

As for how the AI application use the LLM service provided by the inference module, an environment variable named INFERENCE_URL will be injected by the module and the application can call the LLM service with the address.

Which model used in the application is transparent, and you only need to provide the prompt parameter to the request address. Of course, you can directly modify the model and other configuration items in the main.k file and update the deployment resources by kusion apply.

There are also some optional configuration items in the inference module for adjusting the LLM service, whose details can be found here.

Deployment

Now we can generate and deploy the Spec containing all the relevant resources the AI application needs with Kusion.

First, we should navigate to the folder example/default and execute the kusion generate command, and a Spec will be generated.

➜  default git:(main) ✗ kusion generate
✔︎ Generating Spec in the Stack default...
resources:
- id: v1:Namespace:example
type: Kubernetes
attributes:
apiVersion: v1
kind: Namespace
metadata:
creationTimestamp: null
name: example
spec: {}
status: {}
extensions:
GVK: /v1, Kind=Namespace
- id: apps/v1:Deployment:example:example-default-inference
type: Kubernetes
attributes:
apiVersion: apps/v1
kind: Deployment
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/name: inference
app.kubernetes.io/part-of: example
name: example-default-inference
namespace: example
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: inference
app.kubernetes.io/part-of: example
strategy: {}
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/name: inference
app.kubernetes.io/part-of: example
spec:
containers:
- env:
- name: INFERENCE_URL
value: ollama-infer-service
image: kangy126/app
name: myct
resources: {}
status: {}
dependsOn:
- v1:Namespace:example
- v1:Service:example:ollama-infer-service
- v1:Service:example:example-default-inference-private
extensions:
GVK: apps/v1, Kind=Deployment
kusion.io/is-workload: true
- id: apps/v1:Deployment:example:ollama-infer-deployment
type: Kubernetes
attributes:
apiVersion: apps/v1
kind: Deployment
metadata:
creationTimestamp: null
name: ollama-infer-deployment
namespace: example
spec:
selector:
matchLabels:
accessory: ollama
strategy: {}
template:
metadata:
creationTimestamp: null
labels:
accessory: ollama
spec:
containers:
- command:
- /bin/sh
- -c
- |-
echo 'FROM llama3
PARAMETER top_k 40
PARAMETER top_p 0.900000
PARAMETER temperature 0.800000
PARAMETER num_predict 128
PARAMETER num_ctx 2048
' > Modelfile && ollama serve & OLLAMA_SERVE_PID=$! && sleep 5 && ollama create llama3 -f Modelfile && wait $OLLAMA_SERVE_PID
image: ollama/ollama
name: ollama-infer-container
ports:
- containerPort: 11434
name: ollama-port
resources: {}
volumeMounts:
- mountPath: /root/.ollama
name: ollama-infer-storage
volumes:
- emptyDir: {}
name: ollama-infer-storage
status: {}
dependsOn:
- v1:Namespace:example
- v1:Service:example:ollama-infer-service
- v1:Service:example:example-default-inference-private
extensions:
GVK: apps/v1, Kind=Deployment
- id: v1:Service:example:ollama-infer-service
type: Kubernetes
attributes:
apiVersion: v1
kind: Service
metadata:
creationTimestamp: null
labels:
accessory: ollama
name: ollama-infer-service
namespace: example
spec:
ports:
- port: 80
targetPort: 11434
selector:
accessory: ollama
type: ClusterIP
status:
loadBalancer: {}
dependsOn:
- v1:Namespace:example
extensions:
GVK: /v1, Kind=Service
- id: v1:Service:example:example-default-inference-private
type: Kubernetes
attributes:
apiVersion: v1
kind: Service
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/name: inference
app.kubernetes.io/part-of: example
name: example-default-inference-private
namespace: example
spec:
ports:
- name: example-default-inference-private-80-tcp
port: 80
protocol: TCP
targetPort: 5000
selector:
app.kubernetes.io/name: inference
app.kubernetes.io/part-of: example
type: ClusterIP
status:
loadBalancer: {}
dependsOn:
- v1:Namespace:example
extensions:
GVK: /v1, Kind=Service
secretStore: null
context: {}

Next, we can execute the kusion preview command and review the resource three-way diffs for a more secure deployment.

➜  default git:(main) ✗ kusion preview
✔︎ Generating Spec in the Stack default...
Stack: default
ID Action
v1:Namespace:example Create
v1:Service:example:ollama-infer-service Create
v1:Service:example:example-default-inference-private Create
apps/v1:Deployment:example:example-default-inference Create
apps/v1:Deployment:example:ollama-infer-deployment Create


Which diff detail do you want to see?:
> all
v1:Namespace:example Create
v1:Service:example:ollama-infer-service Create
v1:Service:example:example-default-inference-private Create
apps/v1:Deployment:example:example-default-inference Create

Finally, execute the kusion apply command to deploy the related Kubernetes resources.

➜  default git:(main) ✗ kusion apply
✔︎ Generating Spec in the Stack default...
Stack: default
ID Action
v1:Namespace:example Create
v1:Service:example:ollama-infer-service Create
v1:Service:example:example-default-inference-private Create
apps/v1:Deployment:example:ollama-infer-deployment Create
apps/v1:Deployment:example:example-default-inference Create


Do you want to apply these diffs?:
> yes

Start applying diffs ...
✔︎ Succeeded v1:Namespace:example
✔︎ Succeeded v1:Service:example:ollama-infer-service
✔︎ Succeeded v1:Service:example:example-default-inference-private
✔︎ Succeeded apps/v1:Deployment:example:ollama-infer-deployment
✔︎ Succeeded apps/v1:Deployment:example:example-default-inference
Apply complete! Resources: 5 created, 0 updated, 0 deleted.

Testing

Execute the kubectl get all -n example command, and the deployed Kubernetes resources will be shown.

➜  ~ kubectl get all -n example
NAME READY STATUS RESTARTS AGE
pod/example-dev-inference-5cf6c74574-7w92f 1/1 Running 0 2d6h
pod/mynginx 1/1 Running 0 2d6h
pod/ollama-infer-deployment-7c56845496-s5snb 1/1 Running 0 2d6h

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/example-dev-inference-public ClusterIP 192.168.116.121 <none> 80:32693/TCP 2d6h
service/ollama-infer-service ClusterIP 192.168.28.208 <none> 80/TCP 2d6h

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/example-dev-inference 1/1 1 1 2d6h
deployment.apps/ollama-infer-deployment 1/1 1 1 2d6h

NAME DESIRED CURRENT READY AGE
replicaset.apps/example-dev-inference-5cf6c74574 1 1 1 2d6h
replicaset.apps/ollama-infer-deployment-7c56845496 1 1 1 2d6h

The AI application in the example provides a simple service that returns the LLM responses when sending a GET request with the prompt parameter.

We can test the application service locally by port-forward, allowing us to directly send requests to the application via our browser.

kubectl port-forward service/example-dev-inference-public 8080:80 -n example

The test results are shown in the figure below.

By modifying the model parameter in the main.k file, you can switch to a different model without having to change the application itself.

For example, we change the value of model from llama3 to qwen. Then we execute the kusion apply command to update the K8S resources.

❯ kusion apply                                                         
✔︎ Generating Spec in the Stack dev...
Stack: dev
ID Action
v1:Namespace:example UnChanged
v1:Service:example:ollama-infer-service UnChanged
v1:Service:example:proxy-infer-service UnChanged
v1:Service:example:example-dev-inference-public UnChanged
apps/v1:Deployment:example:example-dev-inference UnChanged
apps/v1:Deployment:example:proxy-infer-deployment Update
apps/v1:Deployment:example:ollama-infer-deployment Update


Do you want to apply these diffs?:
yes
> details
no

We repeat to send the request to the application via the browser, and the new results are as follows.