一、概述
1.1 背景介绍
生产环境中一个微服务体系动辄几十个 Deployment、Service、ConfigMap、Secret、Ingress,如果全部用裸 YAML 手工维护,版本迭代时改错一个 label selector 就能导致滚动更新断流。Helm 作为 Kubernetes 的包管理器,把一组关联资源打包成 Chart,通过模板变量实现环境差异化渲染,通过 Release 管理实现一键部署、升级、回滚。从 Helm v3 开始去掉了集群端的 Tiller 组件,直接使用 kubeconfig 鉴权,安全模型大幅简化。v3.17+ 进一步将 OCI Registry 作为原生存储后端,Chart 的分发方式与容器镜像完全对齐。
本文面向一线运维和 SRE,以排障手册的形式覆盖 Helm Chart 从设计、开发、测试到部署上线的全链路,重点放在企业级场景下的 Chart 模板化设计模式、多环境分层覆盖策略、OCI Registry 集成、Helmfile 编排以及常见部署故障的根因定位。
1.2 核心语义
| 概念 | 语义 | 排障关联 |
|---|---|---|
| Chart | 一组 Kubernetes 资源模板的打包单元 | Chart 结构错误直接导致 helm install 失败 |
| Release | Chart 的一次安装实例,带版本历史 | Release 状态异常是回滚和升级故障的入口 |
| Values | 模板渲染参数,支持多层覆盖 | Values 层级错误导致渲染结果不符合预期 |
| Repository / OCI Registry | Chart 的存储和分发通道 | 认证、网络问题导致 chart pull 失败 |
| Hook | Release 生命周期中的自定义动作 | Hook 执行失败会阻塞整个 install/upgrade 流程 |
1.3 适用场景
多环境(dev/staging/prod)同一套应用需要差异化配置部署
微服务体系需要统一模板、批量发布、版本化管理
需要在 CI/CD 流水线中实现声明式部署并支持自动回滚
跨团队共享基础设施组件(中间件、监控栈、日志采集器)的标准化交付
需要将内部 Chart 推送到私有 OCI Registry 进行版本管理
1.4 环境要求
| 组件 | 版本要求 | 说明 |
|---|---|---|
| Kubernetes | 1.32+ | 目标集群版本,需与 Chart apiVersion 兼容 |
| Helm | v3.17+ | OCI registry 默认支持,无需额外 feature gate |
| kubectl | 匹配集群版本 | helm 底层依赖 kubectl 的 kubeconfig |
| Docker Engine | 27.x | 用于构建镜像和本地 OCI registry 测试 |
| Container Registry | Harbor 2.x / ECR / ACR / GCR | OCI 兼容的 Chart 存储后端 |
| Helmfile | v0.169+ | 多 Release 批量编排(可选) |
1.5 排障坐标系
Helm 部署故障的排查可以沿两个轴展开:
时间轴(Release 生命周期):
helm install → pre-install hooks → 资源创建 → post-install hooks → Ready 检查 helm upgrade → pre-upgrade hooks → 资源更新 → post-upgrade hooks → Ready 检查 helm rollback → pre-rollback hooks → 资源回退 → post-rollback hooks helm uninstall → pre-delete hooks → 资源删除 → post-delete hooks
空间轴(问题域分层):
Layer 0: Chart 结构 / 语法错误 → helm lint / helm template 可发现 Layer 1: Values 渲染结果异常 → helm template + diff 可发现 Layer 2: Kubernetes API 拒绝 → kubectl apply --dry-run=server 级别 Layer 3: 资源运行时异常 → Pod/Service/Ingress 层面的运行时排查 Layer 4: Hook 执行失败 → Job/Pod 日志层面
二、详细步骤
2.1 观测面:Chart 结构与 Release 状态
2.1.1 Chart 目录结构
一个标准的 Helm Chart 目录结构:
mychart/ ├── Chart.yaml # Chart 元数据:名称、版本、依赖 ├── Chart.lock # 依赖锁定文件 ├── values.yaml # 默认参数 ├── values-dev.yaml # 开发环境覆盖(可选,推荐放 Chart 外部) ├── values-prod.yaml # 生产环境覆盖 ├── charts/ # 子 Chart(依赖) ├── crds/ # CRD 定义(install 时自动应用) ├── templates/ # 模板文件 │ ├── _helpers.tpl # 命名模板(partial templates) │ ├── deployment.yaml │ ├── service.yaml │ ├── ingress.yaml │ ├── configmap.yaml │ ├── secret.yaml │ ├── hpa.yaml │ ├── serviceaccount.yaml │ ├── NOTES.txt # install 后的提示信息 │ └── tests/ │ └── test-connection.yaml └── .helmignore # 打包时忽略的文件
2.1.2 Chart.yaml 详解
apiVersion: v2 # Helm v3 固定为 v2 name: myapp description: A production-grade web application chart type: application # application 或 library version: 1.4.2 # Chart 版本,遵循 SemVer appVersion: "3.8.1" # 应用程序版本,仅展示用途 kubeVersion: ">=1.28.0-0" # 兼容的 K8s 版本范围 home: https://github.com/myorg/myapp maintainers: - name: sre-team email: sre@myorg.com dependencies: - name: postgresql version: "15.5.x" repository: "oci://registry.myorg.com/charts" condition: postgresql.enabled tags: - database - name: redis version: "19.x.x" repository: "oci://registry.myorg.com/charts" condition: redis.enabled tags: - cache - name: common version: "2.x.x" repository: "oci://registry.myorg.com/charts" tags: - infra
version 和 appVersion 的区别经常混淆:version 是 Chart 自身的版本号,每次模板变更都应递增;appVersion 是 Chart 部署的应用版本,变更应用镜像 tag 时更新。CI/CD 中应该自动同步这两个版本号。
2.1.3 Release 状态查看
# 列出所有 namespace 的 release helm list -A # 查看特定 release 的详细信息 helm status myapp -n production # 查看 release 历史版本 helm history myapp -n production # 输出示例 # REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION # 1 2026-02-10 1000 superseded myapp-1.2.0 3.6.0 Install complete # 2 2026-02-15 1400 superseded myapp-1.3.0 3.7.0 Upgrade complete # 3 2026-03-01 0900 deployed myapp-1.4.2 3.8.1 Upgrade complete
Release 的 STATUS 字段是排障第一入口:
| STATUS | 含义 | 下一步动作 |
|---|---|---|
| deployed | 当前活跃版本 | 正常,无需处理 |
| superseded | 已被新版本取代 | 保留用于回滚参考 |
| failed | 安装或升级失败 | 查看 helm status 中的 NOTES 和 events |
| pending-install | 安装中 | 检查 hook Job 是否卡住 |
| pending-upgrade | 升级中 | 检查 hook Job 和资源创建状态 |
| pending-rollback | 回滚中 | 检查回滚目标版本的资源兼容性 |
| uninstalling | 删除中 | 检查 finalizer 和 pre-delete hook |
2.2 第一轮判断:模板渲染与语法检查
2.2.1 helm lint — 静态检查
# 基础 lint helm lint ./mychart # 带 values 文件的 lint helm lint ./mychart -f values-prod.yaml # 严格模式 helm lint ./mychart --strict # 输出示例(有问题时) # ==> Linting ./mychart # [ERROR] Chart.yaml: version is required # [WARNING] templates/deployment.yaml: object name does not conform to Kubernetes naming requirements # [INFO] Chart.yaml: icon is recommended # # Error: 1 chart(s) linted, 1 chart(s) failed
helm lint 能捕获的问题:Chart.yaml 必填字段缺失、模板语法错误(括号未闭合、函数不存在)、资源名称不符合 K8s 命名规范。它不能捕获的问题:Values 值类型不匹配、运行时资源冲突、镜像不存在。
2.2.2 helm template — 本地渲染
# 完整渲染输出 helm template myapp ./mychart -f values-prod.yaml -n production # 只渲染特定模板 helm template myapp ./mychart -s templates/deployment.yaml -f values-prod.yaml # 渲染并校验 API 版本(连接集群) helm template myapp ./mychart -f values-prod.yaml --validate # 渲染结果重定向到文件,方便 diff helm template myapp ./mychart -f values-dev.yaml > rendered-dev.yaml helm template myapp ./mychart -f values-prod.yaml > rendered-prod.yaml diff rendered-dev.yaml rendered-prod.yaml
helm template 是排查 Values 渲染问题的核心工具。在 CI/CD 中,建议每次 merge request 都执行 helm template 并与主分支的渲染结果做 diff,任何意外的资源变更都应触发人工审核。
2.2.3 helm diff 插件
# 安装 diff 插件 helm plugin install https://github.com/databus23/helm-diff # 查看升级将产生的差异(不实际执行) helm diff upgrade myapp ./mychart -f values-prod.yaml -n production # 输出示例 # production, myapp-deployment, Deployment (apps) has changed: # spec: # template: # spec: # containers: # - name: myapp # - image: registry.myorg.com/myapp:3.7.0 # + image: registry.myorg.com/myapp:3.8.1 # - resources: # - limits: # - memory: 512Mi # + resources: # + limits: # + memory: 1Gi
helm diff 在执行 helm upgrade 之前展示所有即将变更的资源,是生产环境变更审批的必备环节。
2.3 第二轮下钻:Go 模板语法与 Values 设计
2.3.1 Go 模板语法核心
变量引用与内置对象:
# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "myapp.fullname" . }}
labels:
{{- include "myapp.labels" . | nindent 4 }}
annotations:
helm.sh/chart: {{ include "myapp.chart" . }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
spec:
{{- if not .Values.autoscaling.enabled }}
replicas: {{ .Values.replicaCount }}
{{- end }}
selector:
matchLabels:
{{- include "myapp.selectorLabels" . | nindent 6 }}
template:
metadata:
annotations:
checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
labels:
{{- include "myapp.selectorLabels" . | nindent 8 }}
spec:
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 8 }}
{{- end }}
serviceAccountName: {{ include "myapp.serviceAccountName" . }}
containers:
- name: {{ .Chart.Name }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
ports:
- name: http
containerPort: {{ .Values.service.port }}
protocol: TCP
{{- if .Values.probes.liveness.enabled }}
livenessProbe:
httpGet:
path: {{ .Values.probes.liveness.path }}
port: http
initialDelaySeconds: {{ .Values.probes.liveness.initialDelaySeconds }}
periodSeconds: {{ .Values.probes.liveness.periodSeconds }}
failureThreshold: {{ .Values.probes.liveness.failureThreshold }}
{{- end }}
{{- if .Values.probes.readiness.enabled }}
readinessProbe:
httpGet:
path: {{ .Values.probes.readiness.path }}
port: http
initialDelaySeconds: {{ .Values.probes.readiness.initialDelaySeconds }}
periodSeconds: {{ .Values.probes.readiness.periodSeconds }}
{{- end }}
resources:
{{- toYaml .Values.resources | nindent 12 }}
{{- if .Values.extraEnv }}
env:
{{- range .Values.extraEnv }}
- name: {{ .name }}
value: {{ .value | quote }}
{{- end }}
{{- end }}
volumeMounts:
- name: config
mountPath: /etc/myapp
readOnly: true
volumes:
- name: config
configMap:
name: {{ include "myapp.fullname" . }}-config
{{- with .Values.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.tolerations }}
tolerations:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.affinity }}
affinity:
{{- toYaml . | nindent 8 }}
{{- end }}
关键语法点:
| 语法 | 作用 | 常见错误 |
|---|---|---|
| {{ .Values.xxx }} | 引用 values 值 | 路径写错不报错,渲染为空 |
| {{- ... -}} | 去除左/右空白 | 过度使用导致 YAML 缩进错误 |
| {{ include "tpl" . }} | 调用命名模板 | 忘记传递 . 上下文 |
| {{ toYaml . | nindent N }} | 转 YAML 并缩进 N 格 | nindent 值错误导致 YAML 解析失败 |
| {{ with .Values.xxx }} | 条件块并切换上下文 | 块内无法访问 .Release 等顶级对象 |
| {{ range .Values.list }} | 遍历列表 | 块内 . 指向当前元素,需 $.Values 访问顶级 |
| {{ default "val" .Values.xxx }} | 默认值 | 空字符串不触发 default |
| {{ required "msg" .Values.xxx }} | 必填校验 | 仅在渲染时触发,lint 不检查 |
| {{ quote .Values.xxx }} | 加引号 | 数字类型意外变字符串 |
2.3.2 _helpers.tpl 命名模板
# templates/_helpers.tpl
{{/*
Chart 完整名称,带 release name 前缀
*/}}
{{- define "myapp.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- if contains $name .Release.Name }}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{- end }}
{{/*
Chart 标识
*/}}
{{- define "myapp.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
{{- end }}
{{/*
通用标签
*/}}
{{- define "myapp.labels" -}}
helm.sh/chart: {{ include "myapp.chart" . }}
{{ include "myapp.selectorLabels" . }}
{{- if .Chart.AppVersion }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
{{- end }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}
{{/*
Selector 标签 — 不可变,用于 Deployment selector
*/}}
{{- define "myapp.selectorLabels" -}}
app.kubernetes.io/name: {{ include "myapp.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}
{{/*
ServiceAccount 名称
*/}}
{{- define "myapp.serviceAccountName" -}}
{{- if .Values.serviceAccount.create }}
{{- default (include "myapp.fullname" .) .Values.serviceAccount.name }}
{{- else }}
{{- default "default" .Values.serviceAccount.name }}
{{- end }}
{{- end }}
{{/*
名称截断
*/}}
{{- define "myapp.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}
63 字符截断是 Kubernetes 对 label value 的硬限制,在命名模板中必须始终保留这个约束。
2.3.3 Values 分层覆盖设计
Helm Values 的合并顺序(后者覆盖前者):
Chart 内 values.yaml(默认值) ↓ 被覆盖 父 Chart 的 values.yaml(子 Chart 依赖场景) ↓ 被覆盖 -f / --values 指定的文件(可多个,后指定的优先) ↓ 被覆盖 --set / --set-string / --set-json(命令行直接设置)
values.yaml 设计原则:
# values.yaml — 默认值,适配 dev 环境
replicaCount: 1
image:
repository: registry.myorg.com/myapp
pullPolicy: IfNotPresent
tag: "" # 留空,默认使用 Chart.AppVersion
imagePullSecrets:
- name: registry-credentials
serviceAccount:
create: true
name: ""
annotations: {}
service:
type: ClusterIP
port: 8080
ingress:
enabled: false
className: nginx
annotations: {}
hosts:
- host: myapp.dev.myorg.com
paths:
- path: /
pathType: Prefix
tls: []
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 100m
memory: 128Mi
autoscaling:
enabled: false
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
probes:
liveness:
enabled: true
path: /healthz
initialDelaySeconds: 10
periodSeconds: 15
failureThreshold: 3
readiness:
enabled: true
path: /readyz
initialDelaySeconds: 5
periodSeconds: 10
extraEnv: []
nodeSelector: {}
tolerations: []
affinity: {}
# 子 Chart 开关
postgresql:
enabled: true
auth:
database: myapp
username: myapp
primary:
persistence:
size: 10Gi
redis:
enabled: false
values-prod.yaml — 生产环境覆盖:
# values-prod.yaml replicaCount: 3 image: tag: "3.8.1" ingress: enabled: true className: nginx annotations: cert-manager.io/cluster-issuer: letsencrypt-prod nginx.ingress.kubernetes.io/rate-limit: "100" nginx.ingress.kubernetes.io/ssl-redirect: "true" hosts: - host: myapp.myorg.com paths: - path: / pathType: Prefix tls: - secretName: myapp-tls hosts: - myapp.myorg.com resources: limits: cpu: "2" memory: 2Gi requests: cpu: 500m memory: 512Mi autoscaling: enabled: true minReplicas: 3 maxReplicas: 20 targetCPUUtilizationPercentage: 65 probes: liveness: initialDelaySeconds: 30 failureThreshold: 5 nodeSelector: node-role: app tolerations: - key: dedicated operator: Equal value: app effect: NoSchedule affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app.kubernetes.io/name operator: In values: - myapp topologyKey: kubernetes.io/hostname postgresql: primary: persistence: size: 100Gi resources: limits: cpu: "4" memory: 8Gi requests: cpu: "1" memory: 2Gi
敏感值处理:
生产环境的 Secret 不应写在 values 文件中。推荐方案:
# 方案一:通过 --set 在 CI/CD 中注入
helm upgrade myapp ./mychart
-f values-prod.yaml
--set secrets.dbPassword="${DB_PASSWORD}"
--set secrets.apiKey="${API_KEY}"
-n production
# 方案二:External Secrets Operator(推荐)
# 在 Chart 模板中引用 ExternalSecret 资源,从 AWS Secrets Manager / Vault 同步
# templates/external-secret.yaml
{{- if .Values.externalSecrets.enabled }}
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: {{ include "myapp.fullname" . }}-secrets
labels:
{{- include "myapp.labels" . | nindent 4 }}
spec:
refreshInterval: 1h
secretStoreRef:
name: {{ .Values.externalSecrets.storeRef }}
kind: ClusterSecretStore
target:
name: {{ include "myapp.fullname" . }}-secrets
data:
{{- range .Values.externalSecrets.keys }}
- secretKey: {{ .secretKey }}
remoteRef:
key: {{ .remoteKey }}
property: {{ .property }}
{{- end }}
{{- end }}
2.4 根因矩阵:Helm 部署常见故障
| 故障现象 | 层级 | 可能原因 | 诊断命令 | 修复方向 |
|---|---|---|---|---|
| helm install 报 YAML 解析错误 | L0 | 模板缩进错误、nindent 值不对 | helm template 查看渲染结果 | 修正模板缩进 |
| helm install 报 "cannot re-use a name" | L2 | 同名 Release 已存在(可能是 failed 状态) | helm list -A --all | helm uninstall 旧 Release 或换名 |
| upgrade 报 "has no deployed releases" | L2 | 首次 install 失败,Release 停在 pending-install | helm list --pending -A | helm uninstall --no-hooks 清理后重新 install |
| upgrade 后 Pod 未更新 | L1 | image.tag 未变更,Deployment spec 无变化 | helm diff upgrade 检查差异 | 在 annotation 中加 checksum/config 触发滚动 |
| hook Job 卡在 Pending | L4 | 资源配额不足、nodeSelector 不匹配 | kubectl describe job | 调整 Job 资源或调度约束 |
| "lookup function not supported" | L0 | 使用 lookup 函数但在 helm template 中运行 | 改用 helm install --dry-run | lookup 只在连接集群时可用 |
| "chart requires kubeVersion >=1.28" | L0 | 集群版本低于 Chart 要求 | kubectl version | 升级集群或调整 Chart.yaml |
| OCI pull 401 Unauthorized | L2 | Registry 认证失败 | helm registry login 测试 | 检查凭证配置 |
| values 中的 list 被 --set 覆盖而非追加 | L1 | Helm --set 对 list 是替换行为 | helm template 对比渲染结果 | 使用 --set-json 或 -f 文件 |
| CRD 升级未生效 | L0 | Helm 不管理 CRD 的升级(只负责首次安装) | kubectl get crd -o yaml | 手动 kubectl apply CRD |
2.5 处理与验证
2.5.1 Release 部署操作
# 首次安装 helm install myapp ./mychart -f values-prod.yaml -n production --create-namespace --wait --timeout 10m # 升级(带原子操作:失败自动回滚) helm upgrade myapp ./mychart -f values-prod.yaml -n production --atomic --timeout 10m # 手动回滚到上一个版本 helm rollback myapp 0 -n production --wait # 回滚到指定版本 helm rollback myapp 2 -n production --wait # 卸载 helm uninstall myapp -n production --keep-history
关键参数说明:
--wait:等待所有 Pod 就绪后才标记 Release 为 deployed,否则即使 Pod 还没启动完就返回成功
--atomic:如果升级失败(--wait 超时、hook 失败),自动回滚到上一个成功版本
--timeout:与 --wait 配合,设置等待超时。默认 5m,复杂应用建议调到 10-15m
--keep-history:uninstall 时保留历史,后续可以 rollback
2.5.2 部署后验证清单
# 1. 确认 Release 状态 helm status myapp -n production # 2. 确认渲染结果符合预期 helm get values myapp -n production # 查看当前生效的 values helm get manifest myapp -n production # 查看当前生效的完整 manifest # 3. 确认 Pod 状态 kubectl get pods -n production -l app.kubernetes.io/instance=myapp # 4. 确认 Service 端点 kubectl get endpoints -n production -l app.kubernetes.io/instance=myapp # 5. 确认 Ingress kubectl get ingress -n production -l app.kubernetes.io/instance=myapp # 6. 运行 Chart 测试 helm test myapp -n production
2.5.3 Helm Hooks 详解
# templates/hooks/db-migration.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "myapp.fullname" . }}-db-migrate
labels:
{{- include "myapp.labels" . | nindent 4 }}
annotations:
"helm.sh/hook": pre-upgrade,pre-install
"helm.sh/hook-weight": "-5" # 权重越小越先执行
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
backoffLimit: 3
activeDeadlineSeconds: 600
template:
metadata:
labels:
{{- include "myapp.selectorLabels" . | nindent 8 }}
spec:
restartPolicy: Never
containers:
- name: migrate
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
command: ["./migrate", "--direction", "up"]
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: {{ include "myapp.fullname" . }}-secrets
key: database-url
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 8 }}
{{- end }}
Hook 类型与执行时机:
| Hook | 触发时机 | 典型用途 |
|---|---|---|
| pre-install | install 前,资源创建前 | 数据库初始化、前置检查 |
| post-install | install 后,所有资源创建后 | 注册服务、发送通知 |
| pre-upgrade | upgrade 前 | 数据库迁移、数据备份 |
| post-upgrade | upgrade 后 | 缓存预热、健康检查 |
| pre-rollback | rollback 前 | 数据库回退迁移 |
| post-rollback | rollback 后 | 状态清理 |
| pre-delete | uninstall 前 | 资源清理、数据导出 |
| post-delete | uninstall 后 | 外部资源清理 |
| test | helm test 执行时 | 连通性测试、功能验证 |
Hook 删除策略:
| 策略 | 含义 |
|---|---|
| before-hook-creation | 执行新 hook 前删除旧的同名资源 |
| hook-succeeded | hook 成功后删除 |
| hook-failed | hook 失败后删除 |
生产建议:始终设置 before-hook-creation,否则第二次 upgrade 时会因为同名 Job 已存在而失败。
三、示例代码和配置
3.1 OCI Registry 操作
Helm v3.17+ 中 OCI 是默认支持的 Chart 存储格式,不再需要设置环境变量 HELM_EXPERIMENTAL_OCI=1。
# 登录 OCI Registry
helm registry login registry.myorg.com
--username admin
--password-stdin <<< "${REGISTRY_PASSWORD}"
# 打包 Chart
helm package ./mychart
# 输出: Successfully packaged chart and saved it to: /path/to/myapp-1.4.2.tgz
# 推送 Chart 到 OCI Registry
helm push myapp-1.4.2.tgz oci://registry.myorg.com/charts
# 从 OCI Registry 拉取 Chart
helm pull oci://registry.myorg.com/charts/myapp --version 1.4.2
# 查看远程 Chart 信息
helm show chart oci://registry.myorg.com/charts/myapp --version 1.4.2
helm show values oci://registry.myorg.com/charts/myapp --version 1.4.2
# 直接从 OCI Registry 安装
helm install myapp oci://registry.myorg.com/charts/myapp
--version 1.4.2
-f values-prod.yaml
-n production
3.2 完整 CI/CD 集成示例
# .gitlab-ci.yml — Helm Chart CI/CD 流水线
stages:
- lint
- package
- deploy-staging
- deploy-prod
variables:
CHART_DIR: "./charts/myapp"
REGISTRY: "registry.myorg.com"
CHART_REPO: "oci://${REGISTRY}/charts"
lint:
stage: lint
image: alpine/helm:3.17.0
script:
- helm lint ${CHART_DIR} --strict
- helm template test-release ${CHART_DIR} -f ${CHART_DIR}/values.yaml > /dev/null
- helm template test-release ${CHART_DIR} -f values-staging.yaml > /dev/null
- helm template test-release ${CHART_DIR} -f values-prod.yaml > /dev/null
rules:
- changes:
- "charts/**/*"
package-and-push:
stage: package
image: alpine/helm:3.17.0
script:
- helm dependency update ${CHART_DIR}
- helm package ${CHART_DIR}
- helm registry login ${REGISTRY} --username ${REGISTRY_USER} --password ${REGISTRY_PASSWORD}
- helm push myapp-*.tgz ${CHART_REPO}
rules:
- if: '$CI_COMMIT_BRANCH == "main"'
changes:
- "charts/**/*"
deploy-staging:
stage: deploy-staging
image: alpine/helm:3.17.0
script:
- helm registry login ${REGISTRY} --username ${REGISTRY_USER} --password ${REGISTRY_PASSWORD}
- helm diff upgrade myapp ${CHART_REPO}/myapp
--version ${CHART_VERSION}
-f values-staging.yaml
-n staging || true
- helm upgrade --install myapp ${CHART_REPO}/myapp
--version ${CHART_VERSION}
-f values-staging.yaml
-n staging
--create-namespace
--atomic
--timeout 10m
- helm test myapp -n staging
environment:
name: staging
rules:
- if: '$CI_COMMIT_BRANCH == "main"'
deploy-prod:
stage: deploy-prod
image: alpine/helm:3.17.0
script:
- helm registry login ${REGISTRY} --username ${REGISTRY_USER} --password ${REGISTRY_PASSWORD}
- helm diff upgrade myapp ${CHART_REPO}/myapp
--version ${CHART_VERSION}
-f values-prod.yaml
-n production
- helm upgrade --install myapp ${CHART_REPO}/myapp
--version ${CHART_VERSION}
-f values-prod.yaml
-n production
--atomic
--timeout 15m
--set secrets.dbPassword="${DB_PASSWORD_PROD}"
- helm test myapp -n production
environment:
name: production
rules:
- if: '$CI_COMMIT_BRANCH == "main"'
when: manual
3.3 Helmfile 多环境批量编排
# helmfile.yaml
---
repositories: [] # OCI registry 不需要 repositories 声明
environments:
dev:
values:
- environments/dev/defaults.yaml
staging:
values:
- environments/staging/defaults.yaml
production:
values:
- environments/production/defaults.yaml
secrets:
- environments/production/secrets.yaml # sops 加密
---
helmDefaults:
wait: true
timeout: 600
atomic: true
createNamespace: true
releases:
- name: myapp
namespace: {{ .Environment.Name }}
chart: oci://registry.myorg.com/charts/myapp
version: 1.4.2
values:
- values-{{ .Environment.Name }}.yaml
- image:
tag: {{ requiredEnv "APP_VERSION" }}
hooks:
- events: ["presync"]
showlogs: true
command: "kubectl"
args:
- "get"
- "ns"
- "{{ .Environment.Name }}"
- name: postgresql
namespace: {{ .Environment.Name }}
chart: oci://registry.myorg.com/charts/postgresql
version: 15.5.38
condition: postgresql.enabled
values:
- postgresql-values-{{ .Environment.Name }}.yaml
- name: redis
namespace: {{ .Environment.Name }}
chart: oci://registry.myorg.com/charts/redis
version: 19.6.4
condition: redis.enabled
values:
- redis-values-{{ .Environment.Name }}.yaml
- name: prometheus-stack
namespace: monitoring
chart: oci://registry.myorg.com/charts/kube-prometheus-stack
version: 65.8.1
values:
- monitoring-values.yaml
# Helmfile 操作命令 # 查看 diff helmfile -e production diff # 部署到指定环境 helmfile -e production apply # 只部署特定 release helmfile -e production -l name=myapp apply # 销毁环境 helmfile -e staging destroy
3.4 Library Chart 设计
Library Chart 不直接生成 Kubernetes 资源,而是提供可复用的模板片段给其他 Chart 引用。
# common-library/Chart.yaml apiVersion: v2 name: common-library version: 2.1.0 type: library # 关键:声明为 library 类型 description: Common templates for all application charts
# common-library/templates/_deployment.tpl
{{- define "common.deployment" -}}
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "common.fullname" . }}
labels:
{{- include "common.labels" . | nindent 4 }}
spec:
{{- if not .Values.autoscaling.enabled }}
replicas: {{ .Values.replicaCount | default 1 }}
{{- end }}
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 0
selector:
matchLabels:
{{- include "common.selectorLabels" . | nindent 6 }}
template:
metadata:
annotations:
{{- if .Values.configHash }}
checksum/config: {{ .Values.configHash }}
{{- end }}
labels:
{{- include "common.selectorLabels" . | nindent 8 }}
spec:
serviceAccountName: {{ include "common.serviceAccountName" . }}
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 8 }}
{{- end }}
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
containers:
- name: {{ .Chart.Name }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
imagePullPolicy: {{ .Values.image.pullPolicy | default "IfNotPresent" }}
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
{{- with .Values.ports }}
ports:
{{- toYaml . | nindent 12 }}
{{- end }}
{{- with .Values.resources }}
resources:
{{- toYaml . | nindent 12 }}
{{- end }}
{{- with .Values.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- end -}}
在应用 Chart 中引用 Library Chart:
# myapp/Chart.yaml dependencies: - name: common-library version: "2.x.x" repository: "oci://registry.myorg.com/charts"
# myapp/templates/deployment.yaml
{{- include "common.deployment" . }}
这样组织内所有微服务 Chart 只需维护 values 差异,Deployment 的基础结构由 Library Chart 统一管理。
3.5 Chart 测试模板
# templates/tests/test-connection.yaml
apiVersion: v1
kind: Pod
metadata:
name: "{{ include "myapp.fullname" . }}-test-connection"
labels:
{{- include "myapp.labels" . | nindent 4 }}
annotations:
"helm.sh/hook": test
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
restartPolicy: Never
containers:
- name: wget
image: busybox:1.37
command: ['wget']
args: ['{{ include "myapp.fullname" . }}:{{ .Values.service.port }}/healthz', '-q', '-O-', '-T', '10']
- name: curl-api
image: curlimages/curl:8.10.1
command: ['curl']
args: ['-sf', '--max-time', '10', 'http://{{ include "myapp.fullname" . }}:{{ .Values.service.port }}/api/v1/status']
# 运行测试 helm test myapp -n production --timeout 5m # 查看测试 Pod 日志 kubectl logs myapp-test-connection -n production -c wget kubectl logs myapp-test-connection -n production -c curl-api
3.6 自动化诊断脚本
#!/usr/bin/env bash set -euo pipefail # 文件名:helm-release-diagnose.sh # 作用:对指定 Helm Release 进行全面健康检查,输出诊断报告 # 适用场景:Release 部署后状态异常、Pod 未就绪、Service 无端点时使用 # 使用方法:./helm-release-diagnose.sh# 输入参数:$1=Release名称 $2=命名空间 # 输出结果:终端输出诊断报告,包含 Release 状态、Pod 状态、事件、日志摘要 # 风险提示:只读操作,不修改任何资源。需要 helm 和 kubectl 权限 RELEASE="${1:?Usage: $0 }" NAMESPACE="${2:?Usage: $0 }" echo "==========================================" echo "Helm Release 诊断报告" echo "Release: ${RELEASE}" echo "Namespace: ${NAMESPACE}" echo "Time: $(date -u +%Y-%m-%dT%H:%M:%SZ)" echo "==========================================" echo "" echo "--- 1. Release 状态 ---" helm status "${RELEASE}" -n "${NAMESPACE}" 2>&1 || echo "ERROR: Release 不存在或无法访问" echo "" echo "--- 2. Release 历史 ---" helm history "${RELEASE}" -n "${NAMESPACE}" --max 5 2>&1 || echo "ERROR: 无法获取历史" echo "" echo "--- 3. 当前生效 Values(非默认值) ---" helm get values "${RELEASE}" -n "${NAMESPACE}" 2>&1 || echo "ERROR: 无法获取 values" echo "" echo "--- 4. Release 关联的 Pod ---" PODS=$(kubectl get pods -n "${NAMESPACE}" -l "app.kubernetes.io/instance=${RELEASE}" -o wide 2>&1) echo "${PODS}" # 检查是否有异常 Pod echo "" echo "--- 5. 异常 Pod 详情 ---" ABNORMAL_PODS=$(kubectl get pods -n "${NAMESPACE}" -l "app.kubernetes.io/instance=${RELEASE}" --field-selector='status.phase!=Running,status.phase!=Succeeded' -o name 2>/dev/null || true) if [[ -z "${ABNORMAL_PODS}" ]]; then echo "所有 Pod 状态正常" else for pod in ${ABNORMAL_PODS}; do echo "--- ${pod} ---" kubectl describe "${pod}" -n "${NAMESPACE}" | tail -30 echo "" kubectl logs "${pod}" -n "${NAMESPACE}" --tail=20 2>/dev/null || echo "无法获取日志" echo "" done fi echo "" echo "--- 6. 近期事件(Warning) ---" kubectl get events -n "${NAMESPACE}" --field-selector="type=Warning" --sort-by='.lastTimestamp' 2>&1 | tail -20 echo "" echo "--- 7. Service 端点 ---" kubectl get endpoints -n "${NAMESPACE}" -l "app.kubernetes.io/instance=${RELEASE}" 2>&1 echo "" echo "--- 8. Ingress 状态 ---" kubectl get ingress -n "${NAMESPACE}" -l "app.kubernetes.io/instance=${RELEASE}" 2>&1 || echo "无 Ingress" echo "" echo "==========================================" echo "诊断完成" echo "=========================================="
#!/usr/bin/env bash set -euo pipefail # 文件名:helm-chart-ci-validate.sh # 作用:在 CI 流水线中对 Chart 执行完整的静态验证 # 适用场景:MR/PR 提交时自动触发,防止有语法错误或渲染异常的 Chart 合入主分支 # 使用方法:./helm-chart-ci-validate.sh[values-file1] [values-file2] ... # 输入参数:$1=Chart目录 $2+=可选的 values 文件列表 # 输出结果:通过则退出码 0,失败则退出码 1 并输出错误详情 # 风险提示:只读操作。需要 helm 3.17+ CHART_DIR="${1:?Usage: $0 [values-files...]}" shift VALUES_FILES=("$@") ERRORS=0 echo "=== Helm Chart CI Validation ===" echo "Chart: ${CHART_DIR}" echo "Values files: ${VALUES_FILES[*]:-none}" echo "" # Step 1: lint echo "--- Step 1: helm lint (strict) ---" if ! helm lint "${CHART_DIR}" --strict; then echo "FAIL: helm lint failed" ((ERRORS++)) fi # Step 2: dependency check echo "" echo "--- Step 2: dependency update ---" if ! helm dependency update "${CHART_DIR}"; then echo "FAIL: dependency update failed" ((ERRORS++)) fi # Step 3: template render with default values echo "" echo "--- Step 3: template render (default values) ---" if ! helm template ci-test "${CHART_DIR}" > /dev/null; then echo "FAIL: template render with default values failed" ((ERRORS++)) fi # Step 4: template render with each values file for vf in "${VALUES_FILES[@]}"; do echo "" echo "--- Step 4: template render with ${vf} ---" if ! helm template ci-test "${CHART_DIR}" -f "${vf}" > /dev/null; then echo "FAIL: template render with ${vf} failed" ((ERRORS++)) fi done # Step 5: check for deprecated API versions echo "" echo "--- Step 5: deprecated API check ---" RENDERED=$(helm template ci-test "${CHART_DIR}" 2>/dev/null || true) DEPRECATED_APIS=("extensions/v1beta1" "apps/v1beta1" "apps/v1beta2" "networking.k8s.io/v1beta1") for api in "${DEPRECATED_APIS[@]}"; do if echo "${RENDERED}" | grep -q "apiVersion: ${api}"; then echo "WARN: Deprecated API found: ${api}" fi done echo "" if [[ ${ERRORS} -gt 0 ]]; then echo "RESULT: FAILED (${ERRORS} errors)" exit 1 else echo "RESULT: PASSED" exit 0 fi
#!/usr/bin/env bash
set -euo pipefail
# 文件名:helm-bulk-upgrade.sh
# 作用:批量升级多个 Helm Release,支持 dry-run 和回滚
# 适用场景:基础组件(如 common-library)升级后需要批量更新所有引用该 library 的应用 Chart
# 使用方法:./helm-bulk-upgrade.sh --env production --chart-version 1.4.2 [--dry-run]
# 输入参数:--env=目标环境 --chart-version=Chart版本 --dry-run=仅模拟不执行
# 输出结果:每个 Release 的升级结果日志
# 风险提示:生产环境务必先 --dry-run。批量操作前确认变更窗口。升级顺序:基础设施 → 中间件 → 应用
ENV=""
CHART_VERSION=""
DRY_RUN=false
while [[ $# -gt 0 ]]; do
case $1 in
--env) ENV="$2"; shift 2 ;;
--chart-version) CHART_VERSION="$2"; shift 2 ;;
--dry-run) DRY_RUN=true; shift ;;
*) echo "Unknown option: $1"; exit 1 ;;
esac
done
[[ -z "${ENV}" ]] && { echo "Error: --env is required"; exit 1; }
[[ -z "${CHART_VERSION}" ]] && { echo "Error: --chart-version is required"; exit 1; }
REGISTRY="oci://registry.myorg.com/charts"
RELEASES=("myapp-api" "myapp-web" "myapp-worker" "myapp-scheduler")
NAMESPACE="${ENV}"
echo "Bulk upgrade started"
echo "Environment: ${ENV}"
echo "Chart version: ${CHART_VERSION}"
echo "Dry run: ${DRY_RUN}"
echo ""
FAILED=()
SUCCEEDED=()
for release in "${RELEASES[@]}"; do
echo "--- Upgrading ${release} ---"
VALUES_FILE="values/${release}-${ENV}.yaml"
if [[ ! -f "${VALUES_FILE}" ]]; then
echo "SKIP: ${VALUES_FILE} not found"
continue
fi
CMD="helm upgrade --install ${release} ${REGISTRY}/myapp
--version ${CHART_VERSION}
-f ${VALUES_FILE}
-n ${NAMESPACE}
--atomic
--timeout 10m"
if [[ "${DRY_RUN}" == "true" ]]; then
CMD="${CMD} --dry-run"
fi
if eval "${CMD}"; then
SUCCEEDED+=("${release}")
echo "OK: ${release} upgraded successfully"
else
FAILED+=("${release}")
echo "FAIL: ${release} upgrade failed"
if [[ "${DRY_RUN}" == "false" ]]; then
echo "Note: --atomic flag should have triggered auto-rollback"
fi
fi
echo ""
done
echo "=========================================="
echo "Summary"
echo "Succeeded: ${SUCCEEDED[*]:-none}"
echo "Failed: ${FAILED[*]:-none}"
echo "=========================================="
[[ ${#FAILED[@]} -eq 0 ]] || exit 1
#!/usr/bin/env bash set -euo pipefail # 文件名:helm-values-diff.sh # 作用:对比两个环境的 values 渲染差异,输出人类可读的 diff 报告 # 适用场景:新环境上线前确认与已有环境的配置差异,防止遗漏关键配置 # 使用方法:./helm-values-diff.sh# 输入参数:$1=Chart目录 $2=环境A的values文件 $3=环境B的values文件 # 输出结果:两个环境渲染后的资源 diff(类似 kubectl diff 格式) # 风险提示:只读操作。大型 Chart 渲染可能消耗较多内存 CHART_DIR="${1:?Usage: $0 }" VALUES_A="${2:?Usage: $0 }" VALUES_B="${3:?Usage: $0 }" TMPDIR=$(mktemp -d) trap 'rm -rf "${TMPDIR}"' EXIT echo "Rendering ${VALUES_A}..." helm template compare "${CHART_DIR}" -f "${VALUES_A}" > "${TMPDIR}/a.yaml" echo "Rendering ${VALUES_B}..." helm template compare "${CHART_DIR}" -f "${VALUES_B}" > "${TMPDIR}/b.yaml" echo "" echo "=== Diff: ${VALUES_A} vs ${VALUES_B} ===" diff -u "${TMPDIR}/a.yaml" "${TMPDIR}/b.yaml" --label "${VALUES_A}" --label "${VALUES_B}" || true echo "" echo "=== End of diff ==="
四、实际应用案例
4.1 案例一:helm upgrade 后 Pod 未滚动更新
现场现象:执行 helm upgrade myapp ./mychart -f values-prod.yaml -n production 返回成功,但 Pod 仍在运行旧版本镜像。helm history 显示新 Revision 为 deployed 状态。
第一轮判断:
# 查看当前渲染的 manifest 中的镜像
helm get manifest myapp -n production | grep "image:"
# 输出:image: "registry.myorg.com/myapp:3.8.1" — 镜像 tag 确实已更新
# 查看 Deployment 是否有变更
kubectl get deployment myapp -n production -o jsonpath='{.spec.template.spec.containers[0].image}'
# 输出:registry.myorg.com/myapp:3.8.1 — Deployment spec 已更新
# 查看 Pod 的镜像
kubectl get pods -n production -l app.kubernetes.io/instance=myapp -o jsonpath='{.items[*].spec.containers[0].image}'
# 输出:registry.myorg.com/myapp:3.7.0 registry.myorg.com/myapp:3.7.0 — Pod 仍是旧版本
第二轮下钻:
# 查看 Deployment 的 rollout 状态 kubectl rollout status deployment/myapp -n production # 输出:Waiting for deployment "myapp" rollout to finish: 0 of 3 updated replicas are available... # 查看 ReplicaSet kubectl get rs -n production -l app.kubernetes.io/instance=myapp # NAME DESIRED CURRENT READY AGE # myapp-6d4f8b7c9 3 3 3 10d ← 旧 RS # myapp-7a5e9c3d1 3 3 0 2m ← 新 RS,Pod 未 Ready # 查看新 RS 的 Pod 为什么未 Ready kubectl describe pod myapp-7a5e9c3d1-xxxxx -n production # Events: # Warning Failed 2m kubelet Failed to pull image "registry.myorg.com/myapp:3.8.1": # rpc error: code = NotFound desc = failed to pull and unpack image: not found
关键证据:镜像 3.8.1 在 Registry 中不存在。CI/CD 流水线在镜像构建完成前就触发了 Helm upgrade。
根因:流水线中 build 和 deploy stage 的依赖关系配置错误,deploy 未等待 build 完成。
修复动作:
先推送正确的镜像到 Registry
新 Pod 会自动拉取成功并变为 Ready
修复 CI/CD 流水线中 stage 之间的依赖关系
修复后验证:
kubectl rollout status deployment/myapp -n production --timeout=5m
# deployment "myapp" successfully rolled out
kubectl get pods -n production -l app.kubernetes.io/instance=myapp -o jsonpath='{.items[*].spec.containers[0].image}'
# registry.myorg.com/myapp:3.8.1 registry.myorg.com/myapp:3.8.1 registry.myorg.com/myapp:3.8.1
防再发建议:
CI/CD 中在 deploy 之前加镜像存在性检查:docker manifest inspect registry.myorg.com/myapp:3.8.1
Helm upgrade 使用 --atomic,镜像拉取失败会自动回滚
4.2 案例二:pre-upgrade Hook Job 卡住导致 upgrade 超时
现场现象:helm upgrade myapp ./mychart -f values-prod.yaml -n production --timeout 10m --atomic 执行 10 分钟后超时,Release 状态变为 failed,然后自动回滚。但数据库迁移 Job 一直卡在 Pending。
第一轮判断:
helm history myapp -n production # REVISION STATUS DESCRIPTION # 5 deployed Upgrade complete # 6 failed pre-upgrade hook "myapp-db-migrate" timed out kubectl get jobs -n production | grep migrate # myapp-db-migrate 0/1 10m 10m
第二轮下钻:
kubectl describe job myapp-db-migrate -n production # Events: # Warning FailedCreate 10m job-controller Error creating: pods "myapp-db-migrate-xxxxx" # is forbidden: exceeded quota: compute-quota, requested: cpu=2, used: cpu=14, limited: cpu=16 kubectl get resourcequota -n production # NAME AGE REQUEST LIMIT # compute-quota 30d requests.cpu: 14/16, ... limits.cpu: 28/32, ...
关键证据:数据库迁移 Job 请求 2 CPU,但命名空间的 ResourceQuota 已经用了 14/16,剩余 2 CPU 正好不够(因为还有其他 Pending Pod 占用请求)。
根因:迁移 Job 的资源请求过高,且在 upgrade 期间新旧 Pod 同时存在导致 CPU 请求总量超限。
修复动作:
# 降低迁移 Job 的资源请求 # templates/hooks/db-migration.yaml spec: template: spec: containers: - name: migrate resources: requests: cpu: 200m # 从 2 降到 200m memory: 256Mi limits: cpu: "1" memory: 512Mi
或者调整升级策略,确保新旧 Pod 不会同时大量并存:
# Deployment 使用 Recreate 策略或更保守的 RollingUpdate spec: strategy: type: RollingUpdate rollingUpdate: maxSurge: 0 # 不额外创建新 Pod maxUnavailable: 1 # 一次只替换一个
修复后验证:
helm upgrade myapp ./mychart -f values-prod.yaml -n production --atomic --timeout 10m # Release "myapp" has been upgraded. Happy Helming! kubectl get jobs -n production | grep migrate # myapp-db-migrate 1/1 45s 1m
防再发建议:
Hook Job 的资源请求应与 ResourceQuota 预留空间匹配
在 CI 中加入 ResourceQuota 余量检查
考虑将数据库迁移从 Helm Hook 移到独立的 Job,在 upgrade 之前手动执行
4.3 案例三:Helmfile 多 Release 部署顺序导致依赖未就绪
现场现象:使用 Helmfile 部署三个组件(postgresql、redis、myapp),myapp 部署后 Pod 反复 CrashLoopBackOff,日志显示 "connection refused" 无法连接 PostgreSQL。
第一轮判断:
helmfile -e production status # NAME NAMESPACE REVISION STATUS # postgresql production 1 deployed # redis production 1 deployed # myapp production 1 deployed kubectl get pods -n production # postgresql-0 1/1 Running 0 2m # redis-master-0 1/1 Running 0 2m # myapp-xxx 0/1 CrashLoopBackOff 3 2m kubectl logs myapp-xxx -n production # level=fatal msg="failed to connect to database" error="dial tcp 10.96.45.123 connection refused"
第二轮下钻:
# PostgreSQL Pod 虽然 Running,但实际还在初始化 kubectl logs postgresql-0 -n production | tail -5 # 2026-03-01 0930 UTC [1] LOG: database system is ready to accept connections # 上面这行出现在 myapp 启动之后 # 查看时间线 kubectl get events -n production --sort-by='.firstTimestamp' | grep -E "(postgresql|myapp)" # 0900 postgresql-0 Created # 0905 myapp-xxx Created ← myapp 在 postgresql 初始化完成前就启动了 # 0920 myapp-xxx BackOff # 0930 postgresql-0 Ready
关键证据:Helmfile 默认并发安装所有 Release,myapp 在 PostgreSQL 完成初始化之前就开始连接数据库。
根因:Helmfile 缺少 Release 之间的依赖和顺序声明。
修复动作:
# helmfile.yaml — 添加 needs 声明 releases: - name: postgresql namespace: production chart: oci://registry.myorg.com/charts/postgresql version: 15.5.38 values: - postgresql-values-production.yaml - name: redis namespace: production chart: oci://registry.myorg.com/charts/redis version: 19.6.4 values: - redis-values-production.yaml - name: myapp namespace: production chart: oci://registry.myorg.com/charts/myapp version: 1.4.2 needs: # 声明依赖 - production/postgresql - production/redis values: - values-production.yaml
同时在 myapp 的 Chart 中增加 initContainer 做连接等待:
# values-production.yaml initContainers: - name: wait-for-db image: busybox:1.37 command: ['sh', '-c', 'until nc -z postgresql 5432; do echo waiting for db; sleep 2; done']
修复后验证:
helmfile -e production apply # postgresql deployed first, then redis, then myapp kubectl get pods -n production # postgresql-0 1/1 Running 0 3m # redis-master-0 1/1 Running 0 2m # myapp-xxx 1/1 Running 0 1m
防再发建议:
始终在 Helmfile 中用 needs 声明 Release 之间的启动依赖
应用端使用 initContainer 或连接重试机制,不依赖部署顺序保证可用性
设置合理的 readinessProbe,让 K8s 在应用真正就绪后才接入流量
4.4 案例四:OCI Registry 推送 Chart 时 401 Unauthorized
现场现象:CI/CD 流水线中执行 helm push myapp-1.4.2.tgz oci://registry.myorg.com/charts 报错 Error: unexpected status from HEAD request to https://registry.myorg.com/v2/charts/myapp/blobs/sha256 401 Unauthorized。
第一轮判断:
# 测试 registry 登录
helm registry login registry.myorg.com --username ci-bot --password "${REGISTRY_PASSWORD}"
# Login Succeeded — 登录没问题
# 手动推送
helm push myapp-1.4.2.tgz oci://registry.myorg.com/charts
# Error: unexpected status from HEAD request... 401 Unauthorized
第二轮下钻:
# 检查 OCI 兼容性
curl -u "ci-bot:${REGISTRY_PASSWORD}" https://registry.myorg.com/v2/
# {"errors":[{"code":"UNAUTHORIZED","message":"authentication required"}]}
# 检查是否需要 Bearer token
curl -v https://registry.myorg.com/v2/ 2>&1 | grep -i www-authenticate
# Www-Authenticate: Bearer realm="https://auth.myorg.com/service/token",service="registry.myorg.com"
# 检查 helm 的认证配置
cat ~/.config/helm/registry/config.json
# 发现 auth 字段存在但对应的 credsStore 配置指向了 docker-credential-desktop
关键证据:Helm 的 registry 认证走的是 Docker credential store,在 CI 环境中 docker-credential-desktop 不存在,导致虽然 helm registry login 显示成功(写入了 config.json),但实际推送时 credential helper 找不到凭证。
根因:CI 环境缺少 Docker credential helper,Helm 的 registry 认证依赖 Docker 的凭证存储体系。
修复动作:
# 方案一:在 CI 中禁用 credential helper,使用纯文件认证
export DOCKER_CONFIG="${HOME}/.docker-helm"
mkdir -p "${DOCKER_CONFIG}"
# 直接写入认证信息
echo '{"auths":{"registry.myorg.com":{"auth":"'$(echo -n "ci-bot:${REGISTRY_PASSWORD}" | base64)'"}}}'
> "${DOCKER_CONFIG}/config.json"
# 重新执行 helm push
helm push myapp-1.4.2.tgz oci://registry.myorg.com/charts
# Pushed: registry.myorg.com/charts/myapp:1.4.2
# 方案二:使用 HELM_REGISTRY_CONFIG 环境变量
export HELM_REGISTRY_CONFIG="/tmp/helm-registry-config.json"
helm registry login registry.myorg.com --username ci-bot --password "${REGISTRY_PASSWORD}"
helm push myapp-1.4.2.tgz oci://registry.myorg.com/charts
修复后验证:
# 确认推送成功 helm pull oci://registry.myorg.com/charts/myapp --version 1.4.2 # Pulled: registry.myorg.com/charts/myapp:1.4.2 helm show chart oci://registry.myorg.com/charts/myapp --version 1.4.2 # apiVersion: v2 # name: myapp # version: 1.4.2
防再发建议:
CI 环境中统一使用 HELM_REGISTRY_CONFIG 或 DOCKER_CONFIG 环境变量指定认证配置文件
不依赖系统级的 credential helper
在流水线开始时加 helm registry login 的连通性验证步骤
五、最佳实践和注意事项
5.1 Chart 设计最佳实践
5.1.1 版本管理
Chart version 遵循 SemVer:Breaking change 升主版本,新功能升次版本,Bug 修复升补丁版本
appVersion 与应用镜像 tag 保持同步
使用 Git tag 触发 Chart 打包和推送
Chart.lock 提交到版本库,保证依赖版本可复现
5.1.2 Values 设计规范
所有可配置项都应有默认值,helm install 裸装必须能跑起来(至少在 dev 环境)
使用 required 函数标记生产环境必填项
敏感值不写入 values 文件,使用 External Secrets Operator 或 --set 注入
布尔开关命名统一:xxx.enabled
资源配置使用嵌套结构:resources.requests.cpu,不要拍平成 requestsCpu
# values.yaml 中用 required 标记必填项的模板用法
# templates/deployment.yaml
env:
- name: DATABASE_URL
value: {{ required "database.url is required for production" .Values.database.url }}
5.1.3 模板组织
公共逻辑提取到 _helpers.tpl
每个 Kubernetes 资源一个模板文件,命名与资源类型对应
复杂的条件渲染用命名模板封装,而不是在资源模板中嵌套大量 if-else
使用 {{- include ... | nindent N }} 而非 {{ template ... }}(template 不能管道传递)
5.1.4 Umbrella Chart 架构
platform-chart/ ├── Chart.yaml # type: application ├── values.yaml # 全局默认值 ├── charts/ │ ├── api/ # 子 Chart: API 服务 │ ├── web/ # 子 Chart: Web 前端 │ ├── worker/ # 子 Chart: 后台任务 │ └── common/ # Library Chart └── templates/ ├── _helpers.tpl └── shared-configmap.yaml # 跨子 Chart 共享的资源
Umbrella Chart 的 values.yaml 中通过子 Chart 名称作为 key 传递参数:
# platform-chart/values.yaml global: imageRegistry: registry.myorg.com environment: production api: replicaCount: 3 image: repository: registry.myorg.com/myapp-api tag: "2.1.0" web: replicaCount: 2 image: repository: registry.myorg.com/myapp-web tag: "1.8.0" worker: replicaCount: 5 image: repository: registry.myorg.com/myapp-worker tag: "2.1.0"
5.2 安全加固
5.2.1 RBAC 最小权限
# 为 Helm 操作创建专用 ServiceAccount(CI/CD 场景) apiVersion: v1 kind: ServiceAccount metadata: name: helm-deployer namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: helm-deployer rules: - apiGroups: ["", "apps", "batch", "networking.k8s.io", "autoscaling"] resources: ["*"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - apiGroups: [""] resources: ["secrets"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] # Helm 存储 Release 信息到 Secret --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: helm-deployer roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: helm-deployer subjects: - kind: ServiceAccount name: helm-deployer namespace: kube-system
5.2.2 Chart 安全扫描
# 使用 kubeaudit 扫描渲染后的 manifest helm template myapp ./mychart -f values-prod.yaml | kubeaudit all -f - # 使用 checkov 做合规扫描 helm template myapp ./mychart -f values-prod.yaml > rendered.yaml checkov -f rendered.yaml --framework kubernetes # 使用 trivy 扫描 Chart 中的配置问题 trivy config ./mychart
5.3 注意事项
5.3.1 CRD 管理陷阱
Helm 对 CRD 的处理有特殊限制:
CRD 放在 crds/ 目录下,仅在首次 helm install 时安装
helm upgrade 不会更新 CRD
helm uninstall 不会删除 CRD
因此 CRD 的生命周期管理需要额外处理:
# 手动更新 CRD kubectl apply -f mychart/crds/ # 或者将 CRD 从 crds/ 移到 templates/ 中,但需要加注解防止意外删除 # templates/crd.yaml apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: name: myresources.myorg.com annotations: "helm.sh/resource-policy": keep # uninstall 时保留
5.3.2 常见配置错误
| 错误模式 | 表现 | 修正方法 |
|---|---|---|
| nindent 值错误 | YAML 渲染后缩进不对,K8s API 拒绝 | 用 helm template 检查输出格式 |
| with 块内访问顶级对象 | {{ .Values.xxx }} 变成空 | 使用 {{ $.Values.xxx }} |
| range 块内的 . | . 指向当前迭代元素而非根 | 使用 $ 引用根上下文 |
| 数字值未加 quote | YAML 解析为科学计数法 | 用 {{ .Values.xxx | quote }} |
| list 值用 --set 覆盖 | 整个 list 被替换而非追加 | 使用 --set-json 或 -f 文件 |
| Release 历史膨胀 | etcd 存储压力增大 | 设置 --history-max 10 |
5.3.3 Release 存储与 etcd 压力
Helm v3 将 Release 信息存储在 Kubernetes Secret 中(Base64 编码 + gzip 压缩)。每次 upgrade/rollback 都会创建新的 Secret。对于频繁发布的应用,需要限制历史版本数:
# 全局设置最大历史版本 helm upgrade myapp ./mychart -f values-prod.yaml -n production --history-max 10 # 清理旧 Release Secret kubectl get secrets -n production -l owner=helm -l name=myapp --sort-by=.metadata.creationTimestamp
六、故障排查和监控
6.1 故障排查决策树
helm install/upgrade 失败 ├── 报错 "YAML parse error" / "template render error" │ ├── 检查 helm template 输出 │ ├── 定位到具体模板文件和行号 │ └── 常见:nindent 错误、括号未闭合、函数名拼写 ├── 报错 "cannot re-use a name" / "has no deployed releases" │ ├── helm list --all -n查看现有 Release │ ├── 如果是 failed 状态:helm uninstall --no-hooks │ └── 如果是 pending-* 状态:等待或强制清理 ├── 报错 "timed out waiting for the condition" │ ├── 检查 Pod 状态:kubectl get pods │ ├── Hook Job 卡住:kubectl describe job │ ├── Pod 未 Ready:kubectl describe pod / kubectl logs │ └── 资源配额不足:kubectl get resourcequota ├── 报错 "admission webhook denied" │ ├── 检查 ValidatingWebhookConfiguration │ ├── 查看 webhook 服务是否可用 │ └── 临时绕过:删除 webhook 或加 exclude annotation ├── 部署成功但功能异常 │ ├── helm get values 确认 values 生效 │ ├── helm get manifest 确认渲染结果 │ ├── kubectl describe / logs 查看运行时状态 │ └── Service/Endpoints/Ingress 链路检查 └── OCI Registry 相关 ├── 401 Unauthorized:检查 helm registry login ├── not found:确认 Chart 名称和版本号 └── timeout:检查网络和 DNS
6.2 Helm 操作监控
6.2.1 Release 状态监控
# Prometheus 告警规则:检测 Helm Release 异常状态
# 需要 helm-exporter(https://github.com/sstarber/helm-exporter)
groups:
- name: helm-release-alerts
rules:
- alert: HelmReleaseFailed
expr: helm_chart_info{status="failed"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Helm Release {{ $labels.release }} 处于 failed 状态"
description: "Release {{ $labels.release }} in namespace {{ $labels.namespace }} has been in failed state for 5 minutes. Chart: {{ $labels.chart }}"
- alert: HelmReleasePending
expr: helm_chart_info{status=~"pending-.*"} == 1
for: 15m
labels:
severity: warning
annotations:
summary: "Helm Release {{ $labels.release }} 处于 pending 状态"
description: "Release {{ $labels.release }} in namespace {{ $labels.namespace }} has been pending for 15 minutes."
6.2.2 CI/CD 部署指标
在 CI/CD 流水线中记录 Helm 部署的关键指标:
# 部署耗时统计 START_TIME=$(date +%s) helm upgrade --install myapp ./mychart -f values-prod.yaml -n production --atomic --timeout 10m END_TIME=$(date +%s) DURATION=$((END_TIME - START_TIME)) # 推送指标到 Prometheus Pushgateway cat <
6.3 日志排查路径
# Helm 自身调试日志 helm upgrade myapp ./mychart -f values-prod.yaml -n production --debug 2>&1 | tee helm-debug.log # 从 debug 日志中提取关键信息 grep -E "(error|Error|FAIL|WARNING)" helm-debug.log # 查看 Helm 发送给 K8s API 的请求 helm upgrade myapp ./mychart -f values-prod.yaml -n production --debug --dry-run 2>&1 | head -100
6.4 常见问题速查
问题 快速诊断 快速修复 "release: already exists" helm list --all -n helm uninstall -n "UPGRADE FAILED: has no deployed releases" helm list --pending -A helm uninstall --no-hooks -n 后重新 install Secret 超过 1MB(etcd 限制) helm get manifest | wc -c 拆分 Chart 或减少嵌入数据 lookup 在 CI 中失败 helm template 模式不支持 lookup 改用 --dry-run=server 或移除 lookup 子 Chart values 未生效 helm get values 检查层级 确认 values 中用子 Chart 名称作为 key 七、总结
7.1 技术要点回顾
Helm v3.17+ 去除 Tiller,直接使用 kubeconfig 鉴权,OCI Registry 作为原生 Chart 存储后端
Chart 结构的核心三要素:Chart.yaml(元数据)、values.yaml(参数化)、templates/(模板渲染)
Go 模板语法中最常见的坑:with 块上下文切换、range 块中 . 的含义变化、nindent 缩进值
Values 分层覆盖遵循「默认值 → 父 Chart → -f 文件 → --set」的优先级链
Hook 的 before-hook-creation 删除策略是防止 Job 冲突的关键配置
--atomic 参数保证升级失败时自动回滚,是生产环境部署的必选项
Library Chart + Umbrella Chart 是企业级多微服务 Chart 管理的标准架构模式
Release 历史存储在 etcd 中,必须通过 --history-max 限制版本数量
7.2 排障链路总结
完整的 Helm 部署排障链路:
helm lint → helm template → helm diff → helm upgrade --dry-run → helm upgrade(实际执行) → 失败时:helm status → helm history → kubectl describe/logs → 定位到具体层级(Chart 语法 / Values 渲染 / K8s API / 运行时 / Hook) → 修复后:helm upgrade --atomic → helm test → 验证
7.3 进阶学习方向
Helm SDK 集成:在 Go 程序中直接调用 Helm SDK 实现自定义部署控制器
Chart Testing(ct)工具:在 CI 中自动检测 Chart 变更并运行集成测试
ArgoCD + Helm:GitOps 模式下的 Helm Chart 自动同步和漂移检测
Sigstore 签名验证:对 Chart 进行数字签名,在部署前验证完整性
7.3 参考资料
Helm 官方文档 — Chart 开发、最佳实践、命令参考
Helm GitHub — 源码、Issue 跟踪
Artifact Hub — 公共 Chart 搜索
Helmfile 文档 — 多 Release 编排
附录
A. Helm 命令速查表
# Chart 开发 helm create mychart # 创建 Chart 脚手架 helm lint ./mychart --strict # 静态检查 helm template myapp ./mychart -f values.yaml # 本地渲染 helm package ./mychart # 打包为 .tgz helm dependency update ./mychart # 更新依赖 helm dependency list ./mychart # 查看依赖状态 helm show chart ./mychart # 查看 Chart.yaml helm show values ./mychart # 查看默认 values # OCI Registry helm registry login# 登录 helm push oci:// / # 推送 helm pull oci:// / --version X # 拉取 helm show chart oci:// / # 查看远程 Chart # Release 管理 helm install -f -n # 安装 helm upgrade --install -f -n # 升级(不存在则安装) helm rollback -n # 回滚 helm uninstall -n # 卸载 helm list -A # 列出所有 Release helm status -n # 查看状态 helm history -n # 查看历史 # 调试 helm upgrade --dry-run --debug # 模拟升级 helm get values -n # 查看当前 values helm get manifest -n # 查看当前 manifest helm get hooks -n # 查看 hooks helm test -n # 运行测试
全部0条评论
快来发表一下你的评论吧 !