云原生监控系统搭建实战

张开发
2026/4/16 11:57:54 15 分钟阅读

分享文章

云原生监控系统搭建实战
云原生监控系统搭建实战引言监控的重要性哥们别整那些花里胡哨的作为一个前端开发兼摇滚鼓手我最烦的就是系统出问题了还不知道。在云原生时代监控系统就像是乐队的调音师能及时发现问题并调整确保整个系统的稳定运行。今天我就给你们整一套硬核的云原生监控系统搭建方案直接上代码不玩虚的一、监控系统架构1. 核心组件Prometheus时序数据库用于存储监控指标Grafana可视化平台用于展示监控数据Alertmanager告警管理用于处理和发送告警Node Exporter节点监控收集主机指标kube-state-metricsKubernetes状态监控Prometheus Adapter用于HPA基于自定义指标2. 架构图------------------ ------------------ ------------------ | | | | | | | Node Exporter | | kube-state-metrics| | Application | | | | | | Metrics | | | | | | | ------------------ ------------------ ------------------ | | | ------------------------------------------------ | v ------------------ | | | Prometheus | | | ------------------ | v ------------------ | | | Alertmanager | | | ------------------ | v ------------------ | | | Grafana | | | ------------------二、Prometheus 部署1. 使用 Helm 部署# 添加 Prometheus Helm 仓库 helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # 部署 Prometheus 栈 helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace # 查看部署状态 kubectl get pods -n monitoring2. 配置 Prometheus# prometheus-values.yaml prometheus: prometheusSpec: serviceMonitorSelector: matchLabels: release: prometheus resources: requests: cpu: 1 memory: 2Gi limits: cpu: 2 memory: 4Gi retention: days: 15# 使用自定义配置部署 helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace --values prometheus-values.yaml三、Grafana 配置1. 访问 Grafana# 端口转发 kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring # 访问地址http://localhost:3000 # 默认用户名admin # 默认密码prom-operator2. 导入仪表盘节点监控导入仪表盘 ID 1860Kubernetes 集群导入仪表盘 ID 12070应用监控导入仪表盘 ID 4053. 自定义仪表盘{ annotations: { list: [ { builtIn: 1, datasource: -- Grafana --, enable: true, hide: true, iconColor: rgba(0, 211, 255, 1), name: Annotations Alerts, type: dashboard } ] }, editable: true, gnetId: null, graphTooltip: 0, id: 1, links: [], panels: [ { aliasColors: {}, bars: false, dashLength: 10, dashes: false, datasource: Prometheus, fieldConfig: { defaults: { custom: {} }, overrides: [] }, fill: 1, fillGradient: 0, gridPos: { h: 8, w: 12, x: 0, y: 0 }, hiddenSeries: false, id: 2, legend: { avg: false, current: false, max: false, min: false, show: true, total: false, values: false }, lines: true, linewidth: 1, nullPointMode: null, options: { alertThreshold: true }, percentage: false, pluginVersion: 7.5.7, pointradius: 2, points: false, renderer: flot, seriesOverrides: [], spaceLength: 10, stack: false, steppedLine: false, targets: [ { expr: node_cpu_seconds_total{mode\idle\}, interval: , legendFormat: {{ instance }}, refId: A } ], thresholds: [], timeFrom: null, timeRegions: [], timeShift: null, title: CPU 空闲时间, tooltip: { shared: true, sort: 0, value_type: individual }, type: graph, xaxis: { buckets: null, mode: time, name: null, show: true, values: [] }, yaxes: [ { format: short, label: null, logBase: 1, max: null, min: null, show: true }, { format: short, label: null, logBase: 1, max: null, min: null, show: true } ], yaxis: { align: false, alignLevel: null } } ], schemaVersion: 26, style: dark, tags: [], templating: { list: [] }, time: { from: now-6h, to: now }, timepicker: {}, timezone: , title: 自定义监控仪表盘, uid: custom-dashboard, version: 1 }四、告警配置1. Alertmanager 配置# alertmanager-config.yaml apiVersion: v1 kind: ConfigMap metadata: name: alertmanager-config namespace: monitoring data: alertmanager.yml: | global: resolve_timeout: 5m smtp_smarthost: smtp.example.com:587 smtp_from: alertmanagerexample.com smtp_auth_username: alertmanager smtp_auth_password: password route: group_by: [alertname] group_wait: 30s group_interval: 5m repeat_interval: 1h receiver: email receivers: - name: email email_configs: - to: adminexample.com send_resolved: true inhibit_rules: - source_match: severity: critical target_match: severity: warning equal: [alertname, dev, instance]2. 自定义告警规则# custom-alerts.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: custom-alerts namespace: monitoring spec: groups: - name: node-alerts rules: - alert: NodeHighCPUUsage expr: (100 - (avg by(instance) (irate(node_cpu_seconds_total{modeidle}[5m])) * 100) 80 for: 5m labels: severity: warning annotations: summary: Node high CPU usage description: Node {{ $labels.instance }} has CPU usage above 80% for 5 minutes - alert: NodeHighMemoryUsage expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 80 for: 5m labels: severity: warning annotations: summary: Node high memory usage description: Node {{ $labels.instance }} has memory usage above 80% for 5 minutes - name: kubernetes-alerts rules: - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[5m]) 0 for: 5m labels: severity: warning annotations: summary: Pod crash looping description: Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is crash looping - alert: PersistentVolumeClaimNearFull expr: kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 10 for: 5m labels: severity: warning annotations: summary: PersistentVolumeClaim near full description: PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} has less than 10% space left五、应用监控1. 暴露应用指标// Node.js 应用示例 const express require(express); const prometheus require(prom-client); const app express(); // 定义指标 const counter new prometheus.Counter({ name: http_requests_total, help: Total HTTP requests, labelNames: [method, route, status] }); const histogram new prometheus.Histogram({ name: http_request_duration_seconds, help: HTTP request duration in seconds, labelNames: [method, route, status], buckets: [0.1, 0.5, 1, 2, 5] }); // 中间件 app.use((req, res, next) { const start Date.now(); res.on(finish, () { const duration (Date.now() - start) / 1000; counter.inc({ method: req.method, route: req.path, status: res.statusCode }); histogram.observe(duration, { method: req.method, route: req.path, status: res.statusCode }); }); next(); }); // 暴露指标 app.get(/metrics, (req, res) { res.set(Content-Type, prometheus.register.contentType); res.end(prometheus.register.metrics()); }); // 应用路由 app.get(/, (req, res) { res.send(Hello World!); }); app.listen(3000, () { console.log(Server running on port 3000); });2. 配置 ServiceMonitor# service-monitor.yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: app-monitor namespace: monitoring labels: release: prometheus spec: selector: matchLabels: app: my-app namespaceSelector: matchNames: - default endpoints: - port: metrics interval: 15s3. 部署应用# app-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: my-app labels: app: my-app spec: replicas: 3 selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: containers: - name: my-app image: my-app:latest ports: - containerPort: 3000 resources: requests: cpu: 100m memory: 128Mi limits: cpu: 200m memory: 256Mi --- apiVersion: v1 kind: Service metadata: name: my-app labels: app: my-app spec: selector: app: my-app ports: - name: metrics port: 3000 targetPort: 3000六、最佳实践1. 监控指标选择基础设施指标CPU、内存、磁盘、网络Kubernetes 指标Pod 状态、资源使用、集群健康应用指标请求量、响应时间、错误率业务指标用户数、订单量、收入2. 告警策略分级告警根据严重程度分为 warning、critical告警抑制避免告警风暴告警路由根据告警类型发送到不同的接收者告警自愈对于常见问题自动修复3. 监控系统维护数据保留根据需求设置合理的保留时间资源配置根据监控规模调整资源备份定期备份监控数据升级及时升级监控组件4. 可视化最佳实践仪表盘组织按功能或服务组织仪表盘指标展示选择合适的图表类型阈值设置在图表上显示告警阈值自动刷新根据需要设置合适的刷新频率七、实战案例案例电商平台监控系统环境Kubernetes 集群微服务架构高并发场景监控需求基础设施监控服务健康监控业务指标监控实时告警实施方案部署 Prometheus 栈配置节点和 Kubernetes 监控为每个微服务添加指标暴露配置自定义告警规则构建业务指标仪表盘成果系统故障提前预警性能瓶颈及时发现业务指标实时监控运维效率显著提升结论监控系统的未来炸了云原生监控系统已经成为现代应用不可或缺的一部分。通过 Prometheus Grafana 等工具我们可以实现全方位的监控确保系统的稳定运行。作为前端开发者掌握监控系统的搭建和配置不仅可以提高系统的可靠性还可以为业务决策提供数据支持。记住直接上代码别整那些花里胡哨的云原生监控系统就是要硬核、高效、稳定。这就是技术的生机所在。

更多文章