# 让 K8sgpt 作为 K8S 的 AI 助手


## 简介

[k8sgpt](https://github.com/k8sgpt-ai/k8sgpt.git) 是一个开源的二进制工具，用于扫描 Kubernetes Cluster，以及用简单的英语诊断和分类问题的工具。结合大模型 AI 能力它将 SRE 经验植入其分析仪，并帮助我们提取最有价值的相关信息，以及基于人工智能进行丰富、完善，以支撑问题的解决。 

![k8sgpt](k8sgpt.png "k8sgpt")

## K8sgpt 简单原理

简单说明 K8sgpt 的原理，执行 K8sgpt 命令时，首先 K8sgpt 会扫描 K8s 集群的资源，获取资源相关事件和缺失的相关资源等错误，然后将这些信息发送给提前设置好的 gpt，默认是 openai，gpt 会给出相关解释和解决办法返回，K8sgpt 得到结果格式化友好的形式返回给用户。

所以要想使用 K8sgpt，前提得拥有一个 gpt 的后端。如果要使用 openai，那么得注册 openai 账号拿到 key。或者使用 localai，即部署一个本地大模型使用。如果是国内环境的话，那么 localai 的方式是比较稳妥的，毕竟其他方式都需要魔法。

想要使用 K8sgpt 有两种方式：

第一种直接使用 K8sgpt 二进制工具，直接在 K8S 集群执行命令即可得到结果

第二种使用 K8sgpt-operator，使用声明式 API 自动执行并获取结果

## K8sgpt 使用

这里分别演示 openai 和 localai 作为后端 gpt  

### 快速使用

提前注册号 openai 的账号，本篇文章不做阐述。

1、首先下载 k8sgpt 二进制工具

```bash
$ wget https://github.com/k8sgpt-ai/k8sgpt/releases/download/v0.3.13/k8sgpt_Linux_arm64.tar.gz
```

2、获取 openai api key，https://beta.openai.com/account/api-keys

![k8sgpt](openai.png "openai")

3、认证 openai，并输入上述 key

```bash
$ ./k8sgpt auth add --backend openai -m gpt-3.5-turbo
Enter openai Key:
```

4、认证成功后，即可使用 k8sgpt 来分析 k8s 集群了。

```bash
# --explain 表示向 gpt 发送请求, 默认不发送请求
$ ./k8sgpt analyze --explain
```

### 过滤资源

在 K8sgpt 中，过滤条件用于管理要分析的资源。

1、默认 k8sgpt 会获取下面 active 类型的分析结果，Unused 表示会被过滤掉，即不分析该资源

```bash
$ ./k8sgpt filters list
Active: 
> Pod
> ReplicaSet
> PersistentVolumeClaim
> CronJob
> MutatingWebhookConfiguration
> ValidatingWebhookConfiguration
> Deployment
> Service
> Ingress
> StatefulSet
> Node
Unused: 
> HorizontalPodAutoScaler
> PodDisruptionBudget
> NetworkPolicy
```

2、如果您不想扫描 Pod 和 Service 类型资源，您可以使用以下命令过滤掉资源。

```bash
# 多个类型用逗号隔开
$ ./k8sgpt filters remove Pod,Service
Filter(s) Pod, Service removed

# 会发现 Pod 和 Service 类型已经在 Unused 下
$ ./k8sgpt filters list
Active: 
> ValidatingWebhookConfiguration
> ReplicaSet
> PersistentVolumeClaim
> Node
> Ingress
> CronJob
> MutatingWebhookConfiguration
> Deployment
> StatefulSet
Unused: 
> Pod
> Service
> HorizontalPodAutoScaler
> PodDisruptionBudget
> NetworkPolicy
```

3、下面看看分析的结果，发现将 Pod，Service 类型已经过滤掉了

```bash
$ ./k8sgpt analyze 
AI Provider: openai

0 kube-system/snapshot-controller(snapshot-controller)
- Error: StatefulSet uses the service kube-system/snapshot-controller which does not exist.

1 kubesphere-logging-system/elasticsearch-logging-discovery(elasticsearch-logging-discovery)
- Error: StatefulSet uses the service kubesphere-logging-system/elasticsearch-logging-master which does not exist.

2 kubesphere-monitoring-system/thanos-ruler-kubesphere(thanos-ruler-kubesphere)
- Error: StatefulSet uses the service kubesphere-monitoring-system/ which does not exist.
```

4、也可以通过以下命令将 Pod ，Service 增加回来

```bash
$ ./k8sgpt filters add Pod,Service
Filter Pod, Service added

$ ./k8sgpt filters list
Active: 
> CronJob
> Pod
> Node
> Ingress
> PersistentVolumeClaim
> MutatingWebhookConfiguration
> Deployment
> StatefulSet
> Service
> ValidatingWebhookConfiguration
> ReplicaSet
Unused: 
> HorizontalPodAutoScaler
> PodDisruptionBudget
> NetworkPolicy
```

### 内置分析器

K8sgpt 有默认的分析器和可选分析器

**默认分析器**

- podAnalyzer
- pvcAnalyzer
- rsAnalyzer
- serviceAnalyzer
- eventAnalyzer
- ingressAnalyzer
- statefulSetAnalyzer
- deploymentAnalyzer
- cronJobAnalyzer
- nodeAnalyzer

**可选分析器**

- hpaAnalyzer
- pdbAnalyzer
- networkPolicyAnalyzer

1、用 “Service” 这类特定的资源过滤结果，只会分析 Service 类型的资源

```bash
$ ./k8sgpt analyze  --filter=Service
Service openebs/openebs.io-local does not exist
AI Provider: openai

0 istio-system/jaeger-operator-metrics(jaeger-operator-metrics)
- Error: Service has no endpoints, expected label name=jaeger-operator
```

2、指定命名空间过滤结果

```bash
$ ./k8sgpt analyze  --filter=Service --namespace istio-system
AI Provider: openai

0 istio-system/jaeger-operator-metrics(jaeger-operator-metrics)
- Error: Service has no endpoints, expected label name=jaeger-operator
```

3、以 JSON 格式输出结果：

```bash
$ ./k8sgpt analyze  --filter=Service --namespace istio-system -o json
{
  "provider": "openai",
  "errors": null,
  "status": "ProblemDetected",
  "problems": 1,
  "results": [
    {
      "kind": "Service",
      "name": "istio-system/jaeger-operator-metrics",
      "error": [
        {
          "Text": "Service has no endpoints, expected label name=jaeger-operator",
          "KubernetesDoc": "",
          "Sensitive": [
            {
              "Unmasked": "name",
              "Masked": "ayUtcw=="
            },
            {
              "Unmasked": "jaeger-operator",
              "Masked": "JWpfejBDMU1MMF9dKER2"
            }
          ]
        }
      ],
      "details": "",
      "parentObject": "jaeger-operator-metrics"
    }
  ]
}
```

### 匿名分析

匿名分析会将敏感数据（如 Kubernetes 对象名称和标签） 发送到 AI 后端进行分析之前对其进行屏蔽。这意味着您的数据将安全可靠地保护起来，没有人能窥探不该看的东西。

在分析过程中，K8sgpt 会检索敏感数据，然后对其进行屏蔽，然后再发送到 AI 后端。后端接收到屏蔽后的数据，进行处理，并向用户返回解决方案。

一旦解决方案返回给用户，屏蔽的数据会被实际的 Kubernetes 对象名称和标签所替换。

可使用以下命令开启匿名分析

```bash
$ ./k8sgpt analyze --explain --anonymize
```

### 集成命令

k8sgpt 可以集成一些额外的工具结合起来对集群进行扫描分析，可以通过以下命令查看 k8sgpt 支持的集成工具

```bash
$ ./k8sgpt integrations list
Active:
Unused: 
> trivy
```

可以看到目前支持 trivy 这一种工具，trivy 是一个容器安全扫描工具。

2、集成 trivy 扫描集群，它将在集群上安装 Trivy 的 Helm chart。

```bash
$ ./k8sgpt integration activate trivy
2023/08/17 21:23:15 creating 1 resource(s)
2023/08/17 21:23:16 creating 1 resource(s)
2023/08/17 21:23:16 creating 1 resource(s)
2023/08/17 21:23:16 creating 1 resource(s)
2023/08/17 21:23:16 creating 1 resource(s)
2023/08/17 21:23:16 creating 1 resource(s)
2023/08/17 21:23:16 creating 1 resource(s)
2023/08/17 21:23:16 creating 1 resource(s)
2023/08/17 21:23:16 creating 1 resource(s)
2023/08/17 21:23:16 creating 1 resource(s)
2023/08/17 21:23:16 beginning wait for 10 resources with timeout of 1m0s
2023/08/17 21:23:18 Clearing REST mapper cache
2023/08/17 21:23:20 creating 21 resource(s)
2023/08/17 21:23:21 release installed successfully: trivy-operator-k8sgpt/trivy-operator-0.15.1
Activated integration trivy
```

3、会发现 k8sgpt 多了一种过滤资源 VulnerabilityReport (integration)

```bash
$ ./k8sgpt filters list
Active: 
> Node
> Ingress
> StatefulSet
> ValidatingWebhookConfiguration
> VulnerabilityReport (integration)
> CronJob
> PersistentVolumeClaim
> MutatingWebhookConfiguration
> Deployment
> Service
> ReplicaSet
> Pod
Unused: 
> NetworkPolicy
> HorizontalPodAutoScaler
> PodDisruptionBudget
```

4、这样就可以单独扫描该 VulnerabilityReport (integration) 资源了

```bash
$ ./k8sgpt analyze --filter VulnerabilityReport
AI Provider: openai

No problems detected
```

5、删除 trivy 集成组件

```bash
$ ./k8sgpt integration deactivate trivy
```

## 结合 localai 使用

对于大部分国内环境是访问不了 openai 的 api，除非拥有魔法。那么使用 localai 在本地环境部署一个大模型供 K8sgpt 使用即可。

### 部署 localai

部署 localai 有多种方式，这里使用 chart 方式部署到 K8S 集群中。

1、下载 local-ai chart 到本地

```bash
$ helm repo add go-skynet https://go-skynet.github.io/helm-charts/
$ helm pull go-skynet/local-ai --version 2.1.1 
```

2、更改 values.yaml

```bash
replicaCount: 1

deployment:
	# 镜像最好提前下载下来，这个镜像非常大，有 12G
  image: quay.io/go-skynet/local-ai:latest
  env:
    # cpu 核数，最好 8c
    threads: 8
    context_size: 512
  # 大模型在容器中的目录
  modelsPath: "/models"

resources:
  {}
  # limits:
  #   cpu: 100m
  #   memory: 128Mi
  # requests:
  #   cpu: 100m
  #   memory: 128Mi

# Prompt templates to include
# Note: the keys of this map will be the names of the prompt template files
promptTemplates:
  {}
  # ggml-gpt4all-j.tmpl: |
  #   The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
  #   ### Prompt:
  #   {{.Input}}
  #   ### Response:

# Models to download at runtime
models:
  # Whether to force download models even if they already exist
  forceDownload: false

  # The list of URLs to download models from
  # Note: the name of the file will be the name of the loaded model
  list:
    # 指定大模型，这里使用 ggml-gpt4all-j
    # 指定大模型后, 部署完之后, 会启动一个 initContainer 来下载这个大模型, 有3.5G, 所以建议提前下载下来, 放到pv目录下
    - url: "https://gpt4all.io/models/ggml-gpt4all-j.bin"
      # basicAuth: base64EncodedCredentials

  # 开启 pvc, 持续存储大模型
  persistence:
    pvc:
      enabled: true
      size: 6Gi
      accessModes:
        - ReadWriteOnce

      annotations: {}

      # Optional
      storageClass: ~

    hostPath:
      enabled: false
      path: "/models"

service:
  type: ClusterIP
  port: 80
  annotations: {}
  # If using an AWS load balancer, you'll need to override the default 60s load balancer idle timeout
  # service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "1200"

ingress:
  enabled: false
  className: ""
  annotations:
    {}
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"
  hosts:
    - host: chart-example.local
      paths:
        - path: /
          pathType: ImplementationSpecific
  tls: []
  #  - secretName: chart-example-tls
  #    hosts:
  #      - chart-example.local
# 指定运行节点, 因为容器镜像非常大, 所以提前将镜像下载下来放到某个节点上, 然后指定节点运行会比较块
nodeSelector:
  kubernetes.io/hostname: master-172-31-97-104
```

3、将 local-ai deployment.yaml initContainers 删除，提前将大模型文件下载下来并放到 local-ai 的 pv 目录下，这样 initContainers 就不需要了

```yaml
initContainers:
  {{- if .Values.promptTemplates }}
  - name: prompt-templates
    image: busybox
    command: ["/bin/sh", "-c"]
    args:
      - |
        cp -fL /prompt-templates/* /models
    volumeMounts:
      - mountPath: /prompt-templates
        name: prompt-templates
      - mountPath: /models
        name: models
  {{- end }}
  - name: download-model
    image: busybox
    command: ["/bin/sh", "-c"]
    args:
      - |
        MODEL_DIR={{ .Values.deployment.modelsPath }}
        FORCE_DOWNLOAD={{ .Values.models.forceDownload }}
        URLS="{{ $urls }}"

        mkdir -p "$MODEL_DIR"

        # Split urls on commas
        echo "$URLS" | awk -F, '{for (i=1; i<=NF; i++) print $i}' | while read -r line; do
            url=$(echo "$line" | awk '{print $1}')
            auth=$(echo "$line" | awk '{print $2}')

            if [ -n "$url" ]; then
                filename=$(basename "$url")

                if [ "$FORCE_DOWNLOAD" = false ] && [ -f "$MODEL_DIR/$filename" ]; then
                    echo "File $filename already exists. Skipping download."
                    continue
                fi

                rm -f "$MODEL_DIR/$filename"

                echo "Downloading $filename"

                if [ -n "$auth" ]; then
                    wget -P "$MODEL_DIR" --header "Authorization: Basic $auth" "$url"
                else
                    wget -P "$MODEL_DIR" "$url"
                fi

                if [ "$?" -ne 0 ]; then
                    echo "Download failed."
                else
                    echo "Download completed."
                fi
            fi
        done
    volumeMounts:
    - mountPath: {{ .Values.deployment.modelsPath }}
      name: models
```

4、部署 local-ai

```bash
$ helm install local-ai go-skynet/local-ai -f values.yaml
```

5、本地使用 curl 命令简单测试下

```bash
$ curl http://10.233.59.69/v1/models
# 返回:
{"object":"list","data":[{"id":"ggml-gpt4all-j","object":"model"}]}

$ curl --location --request POST 'http://172.31.94.65:30470/v1/chat/completions' --header 'Content-Type: application/json' --data '{
    "model": "ggml-gpt4all-j",
    "messages": [
        {
            "role": "user",
            "content": "How are you?"
        }
    ],
    "temperature": 0.9
}'
# 返回:
{"object":"chat.completion","model":"ggml-gpt4all-j","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"I'm doing well, thank you."}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
```

### K8sgpt 使用

下面就利用刚刚部署的 local-ai 作为 K8sgpt 大模型来使用 K8sgpt。

1、认证 local-ai

```bash
# --model: 模型名称
# --baseurl: local-ai svc
$ ./k8sgpt auth add --backend localai --model ggml-gpt4all-j --baseurl http://10.233.59.69/v1
localai added to the AI backend provider list
```

2、分析 K8S 集群

```bash
# 使用 localai 只分析 K8S 集群中 service 资源的问题
$ ./k8sgpt analyze --explain -b localai --filter Service
# 输出:
AI Provider: localai

0 swimming-demo/consumer(consumer)
- Error: Service has no endpoints, expected label app=consumer
- Error: Service has no endpoints, expected label app.kubernetes.io/name=consumer
- Error: Service has no endpoints, expected label app.kubernetes.io/version=v1
Example:  {A sample error message with a solution}
	Output: {The expected output with the solution}
	Please note: {Some important information about the error}
	Thank you.
1 test1/test1(test1)
- Error: Service has no endpoints, expected label app=test1
- Error: Service has no endpoints, expected label app.kubernetes.io/name=test1
- Error: Service has no endpoints, expected label app.kubernetes.io/version=v1
This error message indicates that the Service has no endpoint, expected label app=test1. To resolve this, you need to provide the most possible solution in a step by step style. 
	Here are the steps you can follow: 
	1. Make sure you have Kubernetes installed and running. 
	2. Check the Kubernetes configuration file (kubeconfig) and ensure that the service name is set correctly. 
	3. Check the labels on your Deployment and ReplicaSet. Ensure that the labels are set correctly and match your Service name. 
	4. Check that your Deployment and ReplicaSet have a replica set defined. 
	5. Ensure that your ReplicaSet has a Deployment defined. 
	6. Ensure that your Deployment has a replicas set defined. 
	7. Ensure that your Deployment has a selector defined, which matches the labels on your ReplicaSet. 
	8. Ensure that your Deployment has a replicas set defined. 
	9. Ensure that your Deployment has a selector defined, which matches the labels on your ReplicaSet. 
	10. Ensure that your Deployment has a replicas set defined. 
	11. Ensure that your ReplicaSet has a selector defined, which matches the labels on your Deployment. 
	12. Ensure that your ReplicaSet has a replicas set defined. 
	13. Ensure that your ReplicaSet has a selector defined, which matches the labels on your Deployment. 
	14. Ensure that your ReplicaSet has a replicas set defined. 
	15. Ensure that your Deployment has a replicas set defined. 
	16. Ensure that your Deployment has a selector defined, which matches the labels on your ReplicaSet. 
	17. Ensure that your Deployment has a replicas set defined. 
	18. Ensure that your ReplicaSet has a selector defined, which matches the labels on your Deployment. 
	19. Ensure that your ReplicaSet has a replicas set defined. 
	20. Ensure that your Deployment has a replicas set defined. 
	21. Ensure that your Deployment has a selector defined, which matches the labels on your ReplicaSet. 
	22. Ensure that your Deployment has a replicas set defined.
```

可以发现 local-ai 不仅给出了具体错误信息，还给出了问题的解释以及修复的步骤。

local-ai 的作用不仅仅是给 K8sgpt 使用，还可以部署一个 gpt ui 搭建自己的 gpt 。

## K8sgpt-operator 使用

K8sgpt 需要手动执行命令然后得到结果，那么 [K8sgpt-operator](https://github.com/k8sgpt-ai/k8sgpt-operator) 即使用声明式 API 来自动执行并获取结果。

### K8sgpt-operator 原理

简单来说 k8sgpt-operator 可以在集群中开启自动化的 k8sgpt。它提供了两个 CRD： `K8sGPT` 和 `Result`。前者可以用来设置 k8sgpt 及其行为；而后者则是用来展示问题资源的诊断结果。

当创建好 K8sgpt 资源后，K8sgpt-operator 会 watch 到该资源，然后会创建一个 K8sgpt-deployment Pod，这个 Pod 就是启动一个 K8sgpt Server。

然后 K8sgpt-operator 会拿到这个 Pod 的 svc 并调用拿到结果，且周期性重复执行上述操作，这样就能较实时获取到 K8S 集群的分析结果。

### 部署 K8sgpt-operator

1、使用 chart 形式部署 K8sgpt-operator

```bash
$ helm repo add k8sgpt https://charts.k8sgpt.ai/
$ helm pull k8sgpt/k8sgpt-operator --version 0.0.20
$ helm install k8sgpt-operator ./k8sgpt-operator 
```

2、等待 k8sgpt-operator 运行成功

```bash
$ kubectl get pods
NAME                                                 READY   STATUS    RESTARTS   AGE
k8sgpt-operator-controller-manager-948bdbfd9-f7p6k   2/2     Running   0          170m
local-ai-696bf4f754-ptdjg                            1/1     Running   0          4h51m
```

### 测试

1、创建 K8sgpt CRD 描述后端 gpt 信息

```yaml
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
  name: k8sgpt-local-ai
  namespace: default
spec:
  ai:
    enabled: true
    model: ggml-gpt4all-j
    backend: localai
    baseUrl: http://local-ai.svc.cluster.local:8080/v1
  noCache: false
  # k8sgpt 镜像tag
  version: v0.3.8
  filters:
  - Pod
```

2、创建完 K8sgpt 资源之后，K8S 集群中会创建一个 K8sgpt 的 deployment 用于此次扫描任务，大概过一会可以通过查看 Result 资源来查看扫描结果。

```bash
$ kubectl get result -A
NAMESPACE   NAME                KIND   BACKEND
default     apisixapisixetcd2   Pod    localai
$ kubectl get result apisixapisixetcd2 -o yaml
apiVersion: core.k8sgpt.ai/v1alpha1
kind: Result
metadata:
  creationTimestamp: "2023-08-22T06:01:03Z"
  generation: 1
  name: apisixapisixetcd2
  namespace: default
  resourceVersion: "57982314"
  uid: b9c4e004-0655-4a9c-80af-fcfb00c976a7
spec:
  backend: localai
  details: "Example:  Error: Could not start container etcd pod=api-sx-etcdd(6e3a8b48-4f11-49c6-9a15-2f4e3c6c9d1b)
    with exit code: 1\nThe error message suggests that the container for etcd pod
    is failing with exit code 1. The error message also indicates that the pod is
    being restarted due to a backoff failure. \nThe solution suggests that the error
    could be caused by a few possible reasons, such as incorrect configuration of
    etcd or networking issues. \nThe steps to resolve the error are not specified
    in this message."
  error:
  - text: back-off 5m0s restarting failed container=etcd pod=apisix-etcd-2_apisix(d84a9fe9-f02a-41d2-a5a8-fd8dd859603b)
  kind: Pod
  name: apisix/apisix-etcd-2
  parentObject: ""
status: {}
```

## 总结

利用好 K8sgpt 可以作为运维人员的助手，可以作为 K8S 集群巡检工具，也可以帮助运维人员解决 K8S 问题。

但是国内要想使用 K8sgpt，最方便还是私有化部署一个 local-ai，其成本也可控。

当然结合 K8sgpt-operator 自动给出建议是最丝滑的了。

最佳实践：K8sgpt-operator + local-ai