云原生监控系统搭建实战

张开发

• 2026/4/16 11:57:54 • 15 分钟阅读

分享文章

云原生监控系统搭建实战引言监控的重要性哥们别整那些花里胡哨的作为一个前端开发兼摇滚鼓手我最烦的就是系统出问题了还不知道。在云原生时代监控系统就像是乐队的调音师能及时发现问题并调整确保整个系统的稳定运行。今天我就给你们整一套硬核的云原生监控系统搭建方案直接上代码不玩虚的一、监控系统架构1. 核心组件Prometheus时序数据库用于存储监控指标Grafana可视化平台用于展示监控数据Alertmanager告警管理用于处理和发送告警Node Exporter节点监控收集主机指标kube-state-metricsKubernetes状态监控Prometheus Adapter用于HPA基于自定义指标2. 架构图------------------ ------------------ ------------------ | | | | | | | Node Exporter | | kube-state-metrics| | Application | | | | | | Metrics | | | | | | | ------------------ ------------------ ------------------ | | | ------------------------------------------------ | v ------------------ | | | Prometheus | | | ------------------ | v ------------------ | | | Alertmanager | | | ------------------ | v ------------------ | | | Grafana | | | ------------------二、Prometheus 部署1. 使用 Helm 部署# 添加 Prometheus Helm 仓库 helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # 部署 Prometheus 栈 helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace # 查看部署状态 kubectl get pods -n monitoring2. 配置 Prometheus# prometheus-values.yaml prometheus: prometheusSpec: serviceMonitorSelector: matchLabels: release: prometheus resources: requests: cpu: 1 memory: 2Gi limits: cpu: 2 memory: 4Gi retention: days: 15# 使用自定义配置部署 helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace --values prometheus-values.yaml三、Grafana 配置1. 访问 Grafana# 端口转发 kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring # 访问地址http://localhost:3000 # 默认用户名admin # 默认密码prom-operator2. 导入仪表盘节点监控导入仪表盘 ID 1860Kubernetes 集群导入仪表盘 ID 12070应用监控导入仪表盘 ID 4053. 自定义仪表盘{ annotations: { list: [ { builtIn: 1, datasource: -- Grafana --, enable: true, hide: true, iconColor: rgba(0, 211, 255, 1), name: Annotations Alerts, type: dashboard } ] }, editable: true, gnetId: null, graphTooltip: 0, id: 1, links: [], panels: [ { aliasColors: {}, bars: false, dashLength: 10, dashes: false, datasource: Prometheus, fieldConfig: { defaults: { custom: {} }, overrides: [] }, fill: 1, fillGradient: 0, gridPos: { h: 8, w: 12, x: 0, y: 0 }, hiddenSeries: false, id: 2, legend: { avg: false, current: false, max: false, min: false, show: true, total: false, values: false }, lines: true, linewidth: 1, nullPointMode: null, options: { alertThreshold: true }, percentage: false, pluginVersion: 7.5.7, pointradius: 2, points: false, renderer: flot, seriesOverrides: [], spaceLength: 10, stack: false, steppedLine: false, targets: [ { expr: node_cpu_seconds_total{mode\idle\}, interval: , legendFormat: {{ instance }}, refId: A } ], thresholds: [], timeFrom: null, timeRegions: [], timeShift: null, title: CPU 空闲时间, tooltip: { shared: true, sort: 0, value_type: individual }, type: graph, xaxis: { buckets: null, mode: time, name: null, show: true, values: [] }, yaxes: [ { format: short, label: null, logBase: 1, max: null, min: null, show: true }, { format: short, label: null, logBase: 1, max: null, min: null, show: true } ], yaxis: { align: false, alignLevel: null } } ], schemaVersion: 26, style: dark, tags: [], templating: { list: [] }, time: { from: now-6h, to: now }, timepicker: {}, timezone: , title: 自定义监控仪表盘, uid: custom-dashboard, version: 1 }四、告警配置1. Alertmanager 配置# alertmanager-config.yaml apiVersion: v1 kind: ConfigMap metadata: name: alertmanager-config namespace: monitoring data: alertmanager.yml: | global: resolve_timeout: 5m smtp_smarthost: smtp.example.com:587 smtp_from: alertmanagerexample.com smtp_auth_username: alertmanager smtp_auth_password: password route: group_by: [alertname] group_wait: 30s group_interval: 5m repeat_interval: 1h receiver: email receivers: - name: email email_configs: - to: adminexample.com send_resolved: true inhibit_rules: - source_match: severity: critical target_match: severity: warning equal: [alertname, dev, instance]2. 自定义告警规则# custom-alerts.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: custom-alerts namespace: monitoring spec: groups: - name: node-alerts rules: - alert: NodeHighCPUUsage expr: (100 - (avg by(instance) (irate(node_cpu_seconds_total{modeidle}[5m])) * 100) 80 for: 5m labels: severity: warning annotations: summary: Node high CPU usage description: Node {{ $labels.instance }} has CPU usage above 80% for 5 minutes - alert: NodeHighMemoryUsage expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 80 for: 5m labels: severity: warning annotations: summary: Node high memory usage description: Node {{ $labels.instance }} has memory usage above 80% for 5 minutes - name: kubernetes-alerts rules: - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[5m]) 0 for: 5m labels: severity: warning annotations: summary: Pod crash looping description: Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is crash looping - alert: PersistentVolumeClaimNearFull expr: kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 10 for: 5m labels: severity: warning annotations: summary: PersistentVolumeClaim near full description: PVC {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} has less than 10% space left五、应用监控1. 暴露应用指标// Node.js 应用示例 const express require(express); const prometheus require(prom-client); const app express(); // 定义指标 const counter new prometheus.Counter({ name: http_requests_total, help: Total HTTP requests, labelNames: [method, route, status] }); const histogram new prometheus.Histogram({ name: http_request_duration_seconds, help: HTTP request duration in seconds, labelNames: [method, route, status], buckets: [0.1, 0.5, 1, 2, 5] }); // 中间件 app.use((req, res, next) { const start Date.now(); res.on(finish, () { const duration (Date.now() - start) / 1000; counter.inc({ method: req.method, route: req.path, status: res.statusCode }); histogram.observe(duration, { method: req.method, route: req.path, status: res.statusCode }); }); next(); }); // 暴露指标 app.get(/metrics, (req, res) { res.set(Content-Type, prometheus.register.contentType); res.end(prometheus.register.metrics()); }); // 应用路由 app.get(/, (req, res) { res.send(Hello World!); }); app.listen(3000, () { console.log(Server running on port 3000); });2. 配置 ServiceMonitor# service-monitor.yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: app-monitor namespace: monitoring labels: release: prometheus spec: selector: matchLabels: app: my-app namespaceSelector: matchNames: - default endpoints: - port: metrics interval: 15s3. 部署应用# app-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: my-app labels: app: my-app spec: replicas: 3 selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: containers: - name: my-app image: my-app:latest ports: - containerPort: 3000 resources: requests: cpu: 100m memory: 128Mi limits: cpu: 200m memory: 256Mi --- apiVersion: v1 kind: Service metadata: name: my-app labels: app: my-app spec: selector: app: my-app ports: - name: metrics port: 3000 targetPort: 3000六、最佳实践1. 监控指标选择基础设施指标CPU、内存、磁盘、网络Kubernetes 指标Pod 状态、资源使用、集群健康应用指标请求量、响应时间、错误率业务指标用户数、订单量、收入2. 告警策略分级告警根据严重程度分为 warning、critical告警抑制避免告警风暴告警路由根据告警类型发送到不同的接收者告警自愈对于常见问题自动修复3. 监控系统维护数据保留根据需求设置合理的保留时间资源配置根据监控规模调整资源备份定期备份监控数据升级及时升级监控组件4. 可视化最佳实践仪表盘组织按功能或服务组织仪表盘指标展示选择合适的图表类型阈值设置在图表上显示告警阈值自动刷新根据需要设置合适的刷新频率七、实战案例案例电商平台监控系统环境Kubernetes 集群微服务架构高并发场景监控需求基础设施监控服务健康监控业务指标监控实时告警实施方案部署 Prometheus 栈配置节点和 Kubernetes 监控为每个微服务添加指标暴露配置自定义告警规则构建业务指标仪表盘成果系统故障提前预警性能瓶颈及时发现业务指标实时监控运维效率显著提升结论监控系统的未来炸了云原生监控系统已经成为现代应用不可或缺的一部分。通过 Prometheus Grafana 等工具我们可以实现全方位的监控确保系统的稳定运行。作为前端开发者掌握监控系统的搭建和配置不仅可以提高系统的可靠性还可以为业务决策提供数据支持。记住直接上代码别整那些花里胡哨的云原生监控系统就是要硬核、高效、稳定。这就是技术的生机所在。

更多文章

前端开发 2026/4/16 11:55:39

seo推广外包需要多少投入_seo推广外包如何避免被算法惩罚

SEO推广外包需要多少投入_SEO推广外包如何避免被算法惩罚在当今数字化经济时代，SEO（搜索引擎优化）推广已经成为企业提升网站流量和品牌知名度的重要手段。随着搜索引擎算法的不断更新，企业在进行SEO推广外包时，不仅需…

Spring Cloud 2026 架构演进与最佳实践：构建弹性微服务体系我是 Alex，一个在 CSDN 写 Java 架构思考的暖男。看到新手博主写技术踩坑记录总会留言："这个 debug 思路很 solid，下次试试加个 circuit breaker 会更优雅。"我…

张开发

前端开发 2026/4/16 6:58:11

Spring Boot 4.0 新特性深度解析：构建更优雅的企业应用

Spring Boot 4.0 新特性深度解析：构建更优雅的企业应用我是 Alex，一个在 CSDN 写 Java 架构思考的暖男。看到新手博主写技术踩坑记录总会留言："这个 debug 思路很 solid，下次试试加个 circuit breaker 会更优雅。"我的文…

张开发

云原生监控系统搭建实战

最新文章

Win11Debloat：3个步骤让你的Windows 11从卡顿到流畅的终极优化指南

如何永久保存你的微信聊天记忆？WeChatMsg终极指南

Vivado固化程序到Flash老报错？从原理到实战，彻底搞懂‘校验失败’与‘地址不匹配’的解决方法

别再只盯着MTBF预测了！实测、截尾、加速寿命，三种测试方法到底怎么选？

期末复习别慌！用Python+Scikit-learn手把手复现数据挖掘核心算法（附代码）

GameFramework资源管理避坑指南：如何优化AB包冗余依赖？

推荐文章

CrossMgrLapCounter：嵌入式设备接入赛事计时系统的WebSocket协议库

Java Iterator

Mac上Xcode搞C++竞赛？手把手教你添加万能头文件stdc++.h（附完整代码）

利用BurpSuite Intruder模块实现验证码失效场景下的表单暴力破解

机器学习中的常用算法（非传统算法）

深度学习检测不准确智能电表:一个案例研究 python源代码，代码按照高水平文章复现

相关文章

科研绘图不止Origin：聊聊OriginPro 2021与Python/Matlab的共存与选择

StructBERT在客服系统中的实战应用：智能情绪分析与工单分类

30元玩客云变身全能软路由：手把手教你用Docker部署AllinOne直播服务

FinalBurn Neo终极指南：开源街机模拟器的技术架构与实战应用

OpCore-Simplify终极指南：10分钟完成黑苹果配置的完整解决方案

Qwen3.5-9B成本优化实践：Spot实例调度+自动启停+GPU资源弹性伸缩

分享文章

更多文章

seo推广外包需要多少投入_seo推广外包如何避免被算法惩罚

PCB设计中特殊元器件布局规范与实战技巧

嵌入式开发面试全解析：从内存管理到硬件接口

spring boot在普通方法中获取HttpServletRequest及其使用的方式

Mysql数据类型面试题15连问

2026届学术党必备的六大AI辅助论文神器推荐

JavaScript是什么

私人自建电力线绝缘子目标检测数据集该数据集包括图片1008张包含3个类别，分别是insulator（绝缘子）, insulator-defect01（自爆），insulator-defect02（

质子交换膜（PEM）燃料电池氢气供应系统，阳极压力非线性状态控制simulink模型；自适应反...

OpenClaw数据可视化：Phi-3-mini-128k-instruct分析CSV生成图表

Spring Cloud 2026 架构演进与最佳实践：构建弹性微服务体系

Spring Boot 4.0 新特性深度解析：构建更优雅的企业应用