监控系统代码埋点

Posted by zengchengjie on Sunday, January 11, 2026

代码埋点 + 监控系统完整指南

一、监控体系全景图

业务监控体系 = 代码埋点 + 数据采集 + 存储计算 + 可视化告警
              ↓           ↓          ↓           ↓
          应用埋点    Prometheus  时序数据库   Grafana
          日志埋点     日志Agent   Elasticsearch  Alertmanager
          链路埋点    OpenTelemetry 数据湖      告警通道

二、代码埋点技术栈详解

2.1 埋点类型分类

类型 目的 技术方案 频率
指标埋点 监控系统状态 Micrometer/Prometheus 实时
日志埋点 问题排查 SLF4J + Logback 按需
链路埋点 性能分析 OpenTelemetry/SkyWalking 采样
事件埋点 行为分析 消息队列 + 大数据 实时

2.2 指标埋点完整实现

2.2.1 依赖配置

<!-- pom.xml -->
<dependencies>
    <!-- Micrometer核心 -->
    <dependency>
        <groupId>io.micrometer</groupId>
        <artifactId>micrometer-core</artifactId>
        <version>1.10.5</version>
    </dependency>
    
    <!-- Prometheus注册表 -->
    <dependency>
        <groupId>io.micrometer</groupId>
        <artifactId>micrometer-registry-prometheus</artifactId>
        <version>1.10.5</version>
    </dependency>
    
    <!-- 支持Spring Boot自动配置 -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-actuator</artifactId>
    </dependency>
</dependencies>

2.2.2 全局监控配置类

@Configuration
@EnableScheduling
public class MonitoringConfig {
    
    /**
     * 全局MeterRegistry配置
     */
    @Bean
    public MeterRegistryCustomizer<MeterRegistry> metricsCommonTags() {
        return registry -> registry.config()
            .commonTags(
                "application", "order-service",
                "environment", System.getenv("ENV") != null ? 
                    System.getenv("ENV") : "dev",
                "cluster", System.getenv("CLUSTER") != null ? 
                    System.getenv("CLUSTER") : "default",
                "instance", ManagementFactory.getRuntimeMXBean().getName()
            );
    }
    
    /**
     * Prometheus指标暴露端点
     */
    @Bean
    public ServletRegistrationBean<Servlet> prometheusServlet() {
        ServletRegistrationBean<Servlet> bean = 
            new ServletRegistrationBean<>(
                new MetricsServlet(), "/metrics");
        bean.addInitParameter(
            "quantiles", "0.5,0.75,0.95,0.99,0.999");
        return bean;
    }
    
    /**
     * JVM和系统指标自动收集
     */
    @PostConstruct
    public void initSystemMetrics() {
        // JVM内存使用
        new JvmMemoryMetrics().bindTo(Metrics.globalRegistry);
        // JVM GC信息
        new JvmGcMetrics().bindTo(Metrics.globalRegistry);
        // 系统CPU
        new ProcessorMetrics().bindTo(Metrics.globalRegistry);
        // 日志框架
        new LogbackMetrics().bindTo(Metrics.globalRegistry);
    }
}

2.3 业务指标埋点实战

2.3.1 订单服务完整埋点示例

@Service
@Slf4j
public class OrderService {
    
    // 1. 计数器:统计成功/失败次数
    private final Counter orderCreateSuccessCounter;
    private final Counter orderCreateFailureCounter;
    private final Counter orderCreateCounter;
    
    // 2. 分布摘要:记录数值分布(如订单金额)
    private final DistributionSummary orderAmountSummary;
    
    // 3. 计时器:记录方法执行时间
    private final Timer orderCreateTimer;
    
    // 4. 仪表盘:记录瞬时值(如库存数量)
    private final Map<Long, AtomicInteger> inventoryGauges = new ConcurrentHashMap<>();
    
    public OrderService(MeterRegistry meterRegistry) {
        // 初始化所有指标
        this.orderCreateSuccessCounter = Counter.builder("order.create.success")
            .description("订单创建成功总数")
            .tag("service", "order-service")
            .register(meterRegistry);
            
        this.orderCreateFailureCounter = Counter.builder("order.create.failure")
            .description("订单创建失败总数")
            .tag("service", "order-service")
            .register(meterRegistry);
            
        this.orderCreateCounter = Counter.builder("order.create.total")
            .description("订单创建总数(不分成功失败)")
            .tag("service", "order-service")
            .register(meterRegistry);
            
        this.orderAmountSummary = DistributionSummary.builder("order.amount")
            .description("订单金额分布")
            .baseUnit("CNY")
            .scale(1.0)  // 金额单位:元
            .publishPercentiles(0.5, 0.95, 0.99)  // 50%, 95%, 99%分位数
            .register(meterRegistry);
            
        this.orderCreateTimer = Timer.builder("order.create.duration")
            .description("订单创建耗时")
            .publishPercentiles(0.5, 0.95, 0.99)
            .sla(Duration.ofMillis(100), Duration.ofMillis(500), 
                 Duration.ofMillis(1000), Duration.ofMillis(3000))
            .register(meterRegistry);
    }
    
    /**
     * 创建订单 - 完整埋点示例
     */
    @Transactional
    public Order createOrder(CreateOrderRequest request) {
        // 开始计时
        Timer.Sample sample = Timer.start();
        orderCreateCounter.increment();
        
        Order order = null;
        try {
            // 1. 参数验证
            validateRequest(request);
            
            // 2. 检查库存(包含库存指标)
            checkInventory(request.getProductId(), request.getQuantity());
            
            // 3. 创建订单
            order = buildOrder(request);
            order = orderRepository.save(order);
            
            // 4. 扣减库存
            reduceInventory(request.getProductId(), request.getQuantity());
            
            // 5. 记录订单金额分布
            orderAmountSummary.record(order.getTotalAmount().doubleValue());
            
            // 6. 成功计数
            orderCreateSuccessCounter.increment();
            
            log.info("订单创建成功,订单号: {}", order.getOrderNo());
            
            return order;
            
        } catch (BusinessException e) {
            // 业务异常:记录失败原因
            orderCreateFailureCounter.increment(1.0, 
                Tags.of("error_type", "business", 
                       "error_code", e.getErrorCode()));
            throw e;
            
        } catch (Exception e) {
            // 系统异常
            orderCreateFailureCounter.increment(1.0, 
                Tags.of("error_type", "system", 
                       "exception", e.getClass().getSimpleName()));
            throw new OrderCreateException("订单创建失败", e);
            
        } finally {
            // 结束计时
            sample.stop(orderCreateTimer);
            
            // 记录耗时日志
            if (order != null) {
                log.debug("订单创建完成,耗时: {}ms", 
                    sample.duration(orderCreateTimer).toMillis());
            }
        }
    }
    
    /**
     * 库存检查 - 包含库存监控
     */
    private void checkInventory(Long productId, Integer quantity) {
        Product product = productRepository.findById(productId)
            .orElseThrow(() -> new ProductNotFoundException(productId));
        
        // 记录当前库存量(仪表盘)
        registerInventoryGauge(product);
        
        if (product.getStock() < quantity) {
            // 库存不足时触发告警指标
            Metrics.counter("inventory.insufficient",
                "product_id", String.valueOf(productId))
                .increment();
            
            throw new InsufficientStockException(
                String.format("商品 %s 库存不足,剩余: %d", 
                    product.getName(), product.getStock()));
        }
    }
    
    /**
     * 动态注册库存仪表盘
     */
    private void registerInventoryGauge(Product product) {
        inventoryGauges.computeIfAbsent(product.getId(), id -> {
            AtomicInteger gauge = new AtomicInteger(product.getStock());
            
            Gauge.builder("product.inventory.current", gauge, AtomicInteger::get)
                .description("商品当前库存量")
                .tag("product_id", String.valueOf(product.getId()))
                .tag("product_name", product.getName())
                .tag("category", product.getCategory())
                .register(Metrics.globalRegistry);
                
            return gauge;
        }).set(product.getStock());
    }
}

2.4 AOP统一埋点方案

2.4.1 自定义监控注解

/**
 * 方法监控注解
 */
@Target(ElementType.METHOD)
@Retention(RetentionPolicy.RUNTIME)
public @interface MonitorMethod {
    
    /** 指标名称 */
    String name() default "";
    
    /** 指标描述 */
    String description() default "";
    
    /** 记录参数 */
    boolean recordParams() default false;
    
    /** 记录返回值 */
    boolean recordResult() default false;
    
    /** 是否记录异常 */
    boolean recordException() default true;
    
    /** 自定义标签 */
    String[] tags() default {};
}

/**
 * 业务指标注解
 */
@Target({ElementType.METHOD, ElementType.TYPE})
@Retention(RetentionPolicy.RUNTIME)
public @interface BusinessMetric {
    
    /** 业务类型 */
    String businessType();
    
    /** 业务子类型 */
    String subType() default "";
    
    /** 是否记录QPS */
    boolean qps() default true;
    
    /** 是否记录耗时 */
    boolean duration() default true;
    
    /** 是否记录错误率 */
    boolean errorRate() default true;
}

2.4.2 AOP切面实现

@Aspect
@Component
@Slf4j
public class MonitoringAspect {
    
    private final MeterRegistry meterRegistry;
    private final ObjectMapper objectMapper;
    
    private final ThreadLocal<Timer.Sample> timerSample = new ThreadLocal<>();
    
    @Autowired
    public MonitoringAspect(MeterRegistry meterRegistry, ObjectMapper objectMapper) {
        this.meterRegistry = meterRegistry;
        this.objectMapper = objectMapper;
    }
    
    /**
     * 方法监控切面
     */
    @Around("@annotation(monitorMethod)")
    public Object monitorMethod(ProceedingJoinPoint joinPoint, 
                                MonitorMethod monitorMethod) throws Throwable {
        
        String methodName = getMethodName(joinPoint);
        String metricName = monitorMethod.name().isEmpty() ? 
            methodName : monitorMethod.name();
        
        // 开始计时
        Timer.Sample sample = Timer.start(meterRegistry);
        timerSample.set(sample);
        
        Object result = null;
        boolean success = false;
        
        try {
            // 记录调用次数
            Counter.builder(metricName + ".call.total")
                .description(monitorMethod.description())
                .tags(monitorMethod.tags())
                .register(meterRegistry)
                .increment();
            
            // 执行原方法
            result = joinPoint.proceed();
            success = true;
            
            // 记录成功
            Counter.builder(metricName + ".call.success")
                .tags(monitorMethod.tags())
                .register(meterRegistry)
                .increment();
            
            // 记录返回值(如果需要)
            if (monitorMethod.recordResult() && result != null) {
                recordResultMetric(metricName, result, monitorMethod.tags());
            }
            
            return result;
            
        } catch (Exception e) {
            // 记录失败
            Counter.builder(metricName + ".call.failure")
                .tags(monitorMethod.tags())
                .tag("exception", e.getClass().getSimpleName())
                .register(meterRegistry)
                .increment();
            
            if (monitorMethod.recordException()) {
                log.error("方法 {} 执行失败", methodName, e);
            }
            
            throw e;
            
        } finally {
            // 结束计时
            if (timerSample.get() != null) {
                timerSample.get().stop(Timer.builder(metricName + ".duration")
                    .tags(monitorMethod.tags())
                    .publishPercentiles(0.5, 0.95, 0.99)
                    .register(meterRegistry));
                timerSample.remove();
            }
            
            // 记录参数(如果需要)
            if (monitorMethod.recordParams()) {
                recordParamsMetric(metricName, joinPoint.getArgs(), monitorMethod.tags());
            }
        }
    }
    
    /**
     * 业务指标切面
     */
    @Around("@annotation(businessMetric)")
    public Object businessMetric(ProceedingJoinPoint joinPoint,
                                 BusinessMetric businessMetric) throws Throwable {
        
        String businessKey = businessMetric.businessType() + 
            (businessMetric.subType().isEmpty() ? "" : "." + businessMetric.subType());
        
        // 记录QPS
        if (businessMetric.qps()) {
            meterRegistry.counter("business.qps",
                "business_type", businessMetric.businessType(),
                "sub_type", businessMetric.subType())
                .increment();
        }
        
        Timer.Sample sample = null;
        if (businessMetric.duration()) {
            sample = Timer.start(meterRegistry);
        }
        
        try {
            Object result = joinPoint.proceed();
            
            // 记录成功
            meterRegistry.counter("business.success",
                "business_type", businessMetric.businessType(),
                "sub_type", businessMetric.subType())
                .increment();
                
            return result;
            
        } catch (Exception e) {
            // 记录失败
            meterRegistry.counter("business.failure",
                "business_type", businessMetric.businessType(),
                "sub_type", businessMetric.subType(),
                "exception", e.getClass().getSimpleName())
                .increment();
                
            throw e;
            
        } finally {
            if (sample != null && businessMetric.duration()) {
                sample.stop(meterRegistry.timer("business.duration",
                    "business_type", businessMetric.businessType(),
                    "sub_type", businessMetric.subType()));
            }
        }
    }
    
    /**
     * 控制器监控切面
     */
    @Around("@within(org.springframework.web.bind.annotation.RestController) || " +
            "@within(org.springframework.stereotype.Controller)")
    public Object controllerMonitor(ProceedingJoinPoint joinPoint) throws Throwable {
        
        MethodSignature signature = (MethodSignature) joinPoint.getSignature();
        String controllerName = joinPoint.getTarget().getClass().getSimpleName();
        String methodName = signature.getMethod().getName();
        String fullMethodName = controllerName + "." + methodName;
        
        // HTTP请求指标
        Counter requestCounter = Counter.builder("http.requests.total")
            .tag("controller", controllerName)
            .tag("method", methodName)
            .register(meterRegistry);
        
        requestCounter.increment();
        
        Timer.Sample sample = Timer.start(meterRegistry);
        boolean success = false;
        
        try {
            Object result = joinPoint.proceed();
            success = true;
            
            meterRegistry.counter("http.requests.success")
                .tag("controller", controllerName)
                .tag("method", methodName)
                .increment();
                
            return result;
            
        } catch (Exception e) {
            meterRegistry.counter("http.requests.error")
                .tag("controller", controllerName)
                .tag("method", methodName)
                .tag("exception", e.getClass().getSimpleName())
                .increment();
                
            throw e;
            
        } finally {
            sample.stop(Timer.builder("http.requests.duration")
                .tag("controller", controllerName)
                .tag("method", methodName)
                .publishPercentiles(0.5, 0.95, 0.99)
                .register(meterRegistry));
            
            // 记录成功率
            if (success) {
                meterRegistry.counter("http.requests.complete")
                    .tag("controller", controllerName)
                    .tag("method", methodName)
                    .increment();
            }
        }
    }
    
    // 辅助方法
    private String getMethodName(ProceedingJoinPoint joinPoint) {
        return joinPoint.getSignature().getDeclaringTypeName() + "." + 
               joinPoint.getSignature().getName();
    }
    
    private void recordResultMetric(String metricName, Object result, String[] tags) {
        try {
            String resultJson = objectMapper.writeValueAsString(result);
            // 可以记录到日志或发送到消息队列
            log.debug("方法 {} 返回值: {}", metricName, resultJson);
        } catch (JsonProcessingException e) {
            log.warn("记录返回值失败", e);
        }
    }
    
    private void recordParamsMetric(String metricName, Object[] args, String[] tags) {
        if (args != null && args.length > 0) {
            try {
                String paramsJson = objectMapper.writeValueAsString(args);
                log.debug("方法 {} 参数: {}", metricName, paramsJson);
            } catch (JsonProcessingException e) {
                log.warn("记录参数失败", e);
            }
        }
    }
}

2.5 使用示例

@RestController
@RequestMapping("/orders")
@Slf4j
public class OrderController {
    
    @Autowired
    private OrderService orderService;
    
    /**
     * 创建订单接口
     */
    @PostMapping
    @BusinessMetric(businessType = "order", subType = "create", 
                   qps = true, duration = true, errorRate = true)
    @MonitorMethod(name = "order.create.api", 
                  description = "订单创建API接口",
                  recordParams = true,
                  recordException = true,
                  tags = {"api", "order"})
    public ResponseEntity<ApiResponse<OrderDTO>> createOrder(
            @Valid @RequestBody CreateOrderRequest request) {
        
        // 业务逻辑
        Order order = orderService.createOrder(request);
        
        // 返回结果
        return ResponseEntity.ok(ApiResponse.success(
            OrderDTO.fromEntity(order)));
    }
    
    /**
     * 查询订单
     */
    @GetMapping("/{orderNo}")
    @BusinessMetric(businessType = "order", subType = "query")
    public ResponseEntity<ApiResponse<OrderDTO>> getOrder(
            @PathVariable String orderNo,
            @RequestHeader(value = "X-User-Id") Long userId) {
        
        // 记录用户行为
        meterRegistry.counter("user.behavior.query",
            "user_id", String.valueOf(userId),
            "action", "query_order")
            .increment();
        
        Order order = orderService.findByOrderNo(orderNo);
        
        // 检查权限
        if (!order.getUserId().equals(userId)) {
            throw new AccessDeniedException("无权访问此订单");
        }
        
        return ResponseEntity.ok(ApiResponse.success(
            OrderDTO.fromEntity(order)));
    }
}

三、监控系统配置

3.1 Prometheus配置

# prometheus.yml
global:
  scrape_interval: 15s  # 抓取间隔
  evaluation_interval: 15s  # 规则评估间隔

scrape_configs:
  # 应用指标
  - job_name: 'order-service'
    metrics_path: '/actuator/prometheus'
    scrape_interval: 10s
    static_configs:
      - targets: 
        - 'order-service-1:8080'
        - 'order-service-2:8080'
        labels:
          service: 'order-service'
          env: 'production'
    
  # JVM指标
  - job_name: 'jvm'
    static_configs:
      - targets: 
        - 'order-service-1:8081'
        - 'order-service-2:8081'
    
  # 业务自定义指标
  - job_name: 'business-metrics'
    static_configs:
      - targets: 
        - 'order-service-1:9090'  # 自定义metrics端口
    
  # 基础设施监控
  - job_name: 'node-exporter'
    static_configs:
      - targets: 
        - 'node-exporter:9100'
    
  # 数据库监控
  - job_name: 'mysql-exporter'
    static_configs:
      - targets: 
        - 'mysql-exporter:9104'

3.2 告警规则配置

# alert-rules.yml
groups:
  - name: business-alerts
    rules:
      # 订单成功率告警
      - alert: OrderSuccessRateLow
        expr: |
          # 计算成功率:成功数 / (成功数 + 失败数)
          (
            sum(rate(order_create_success_total[5m])) 
            / 
            sum(rate(order_create_success_total[5m]) + rate(order_create_failure_total[5m]))
          ) * 100 < 95          
        for: 2m
        labels:
          severity: warning
          service: order-service
        annotations:
          summary: "订单成功率低"
          description: |
            {{ $labels.service }} 订单成功率当前为 {{ $value | printf "%.2f" }}%,
            低于阈值 95%,最近5分钟失败订单数: {{ humanize 
              (sum(rate(order_create_failure_total[5m]))) }}            
      
      # 支付成功率告警
      - alert: PaymentSuccessRateLow
        expr: |
          (
            sum(rate(payment_success_total[5m])) 
            / 
            (sum(rate(payment_success_total[5m])) + sum(rate(payment_failure_total[5m])))
          ) * 100 < 98          
        for: 1m
        labels:
          severity: critical
          service: payment-service
        annotations:
          summary: "支付成功率低"
          description: "支付成功率低于98%,请立即检查支付渠道"
          
      # 库存预警
      - alert: ProductInventoryLow
        expr: |
                    product_inventory_current < 10
        for: 0m
        labels:
          severity: warning
          service: product-service
        annotations:
          summary: "商品库存不足"
          description: |
            商品 {{ $labels.product_name }} (ID: {{ $labels.product_id }}) 
            库存仅剩 {{ $value }},请及时补货            
          
      # API响应时间告警
      - alert: APIResponseTimeHigh
        expr: |
          histogram_quantile(0.95, 
            sum(rate(http_requests_duration_seconds_bucket[5m])) by (le, controller, method)
          ) > 1          
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "API响应时间过长"
          description: |
            {{ $labels.controller }}.{{ $labels.method }} 接口 
            95%响应时间超过1秒,当前值: {{ $value | printf "%.3f" }}秒            
          
      # 业务异常突增告警
      - alert: BusinessExceptionSpike
        expr: |
          rate(order_create_failure_total{error_type="business"}[2m]) 
          > 
          5 * rate(order_create_failure_total{error_type="business"}[10m:1m])          
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "业务异常突增"
          description: "订单创建业务异常在2分钟内增长超过5倍"
          
      # 系统容量预警
      - alert: SystemCapacityWarning
        expr: |
          # CPU使用率
          (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)) * 100 > 80
          or
          # 内存使用率
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
          or
          # JVM老年代使用率
          jvm_memory_used_bytes{area="heap", id="G1 Old Gen"} / 
          jvm_memory_max_bytes{area="heap", id="G1 Old Gen"} * 100 > 85          
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "系统资源使用率过高"
          description: |
            {{ $labels.instance }} 资源使用率过高:
            CPU: {{ printf "%.1f" (query1 $value) }}%
            内存: {{ printf "%.1f" (query2 $value) }}%
            JVM堆: {{ printf "%.1f" (query3 $value) }}%            

3.3 Alertmanager配置

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.qq.com:587'
  smtp_from: 'monitor@yourcompany.com'
  smtp_auth_username: 'monitor@yourcompany.com'
  smtp_auth_password: 'password'
  slack_api_url: 'https://hooks.slack.com/services/xxx'

route:
  group_by: ['alertname', 'severity', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'slack-notifications'
  
  routes:
    # 关键告警直接打电话
    - match:
        severity: critical
      receiver: 'phone-call'
      continue: true
      
    # 业务告警发企业微信
    - match:
        service: order-service
      receiver: 'wechat-work'
      
    # 基础设施告警发邮件
    - match:
        severity: warning
      receiver: 'email'

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - channel: '#monitoring-alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}'
        
  - name: 'phone-call'
    webhook_configs:
      - url: 'http://phone-call-service/alert'
        send_resolved: true
        
  - name: 'wechat-work'
    wechat_configs:
      - corp_id: 'your-corp-id'
        agent_id: '1000002'
        secret: 'your-secret'
        to_user: '@all'
        message: '{{ range .Alerts }}告警: {{ .Annotations.summary }}\n描述: {{ .Annotations.description }}\n{{ end }}'
        
  - name: 'email'
    email_configs:
      - to: 'devops@yourcompany.com'
        subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          告警名称: {{ .Labels.alertname }}
          严重级别: {{ .Labels.severity }}
          服务: {{ .Labels.service }}
          实例: {{ .Labels.instance }}
          摘要: {{ .Annotations.summary }}
          描述: {{ .Annotations.description }}
          时间: {{ .StartsAt }}
          {{ end }}          

inhibit_rules:
  # 当有critical告警时,抑制warning告警
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['service', 'instance']

3.4 Grafana仪表板配置

{
  "dashboard": {
    "title": "订单服务监控仪表板",
    "panels": [
      {
        "title": "订单成功率",
        "type": "stat",
        "targets": [{
          "expr": "sum(rate(order_create_success_total[5m])) / sum(rate(order_create_total[5m])) * 100",
          "legendFormat": "成功率"
        }],
        "thresholds": {
          "steps": [
            {"color": "red", "value": null},
            {"color": "yellow", "value": 95},
            {"color": "green", "value": 98}
          ]
        }
      },
      {
        "title": "订单创建QPS",
        "type": "graph",
        "targets": [{
          "expr": "sum(rate(order_create_total[1m]))",
          "legendFormat": "总QPS"
        }, {
          "expr": "sum(rate(order_create_success_total[1m]))",
          "legendFormat": "成功QPS"
        }, {
          "expr": "sum(rate(order_create_failure_total[1m]))",
          "legendFormat": "失败QPS"
        }]
      },
      {
        "title": "API响应时间分布",
        "type": "heatmap",
        "targets": [{
          "expr": "histogram_quantile(0.99, sum(rate(http_requests_duration_seconds_bucket[5m])) by (le))",
          "legendFormat": "P99"
        }, {
          "expr": "histogram_quantile(0.95, sum(rate(http_requests_duration_seconds_bucket[5m])) by (le))",
          "legendFormat": "P95"
        }, {
          "expr": "histogram_quantile(0.50, sum(rate(http_requests_duration_seconds_bucket[5m])) by (le))",
          "legendFormat": "P50"
        }]
      },
      {
        "title": "错误类型分布",
        "type": "piechart",
        "targets": [{
          "expr": "sum(rate(order_create_failure_total[5m])) by (error_type, error_code)",
          "legendFormat": "{{error_type}} - {{error_code}}"
        }]
      }
    ],
    "refresh": "10s",
    "time": {
      "from": "now-1h",
      "to": "now"
    }
  }
}

四、最佳实践和注意事项

4.1 埋点设计原则

  1. 明确监控目标:每个埋点都要有明确的监控目的
  2. 避免过度埋点:只埋点关键业务路径和核心指标
  3. 标签设计:标签要可枚举,避免高基数标签(如用户ID)
  4. 指标命名:使用统一命名规范:<metric_type>.<service>.<metric_name>
  5. 文档化:维护埋点文档,说明每个指标的含义和告警阈值

4.2 性能考虑

// 性能优化示例
public class OptimizedMonitoring {
    
    // 1. 使用预创建的Meter,避免重复创建
    private static final Counter PRECREATED_COUNTER = 
        Counter.builder("precreated.counter").register(Metrics.globalRegistry);
    
    // 2. 批量更新指标
    public void batchOperation(List<Order> orders) {
        Timer.Sample sample = Timer.start();
        
        // 批量处理
        processBatch(orders);
        
        sample.stop(Timer.builder("batch.process.duration")
            .register(Metrics.globalRegistry));
            
        // 批量增加计数,而不是每条记录增加一次
        PRECREATED_COUNTER.increment(orders.size());
    }
    
    // 3. 异步记录指标
    @Async
    public void asyncRecordMetric(String metricName, double value) {
        Metrics.timer(metricName).record(() -> {
            // 耗时操作
            heavyCalculation();
        });
    }
}

4.3 常见陷阱和解决方案

  1. 标签基数爆炸

    // 错误:使用用户ID作为标签
    Counter.builder("user.action")
        .tag("user_id", userId)  // 每个用户都会创建新的时间序列!
        .register(registry);
    
    // 正确:使用用户分组
    Counter.builder("user.action")
        .tag("user_group", getUserGroup(userId))  // 可枚举的分组
        .register(registry);
    
  2. 内存泄漏

    // 动态创建的Gauge需要手动清理
    private final Map<String, AtomicInteger> gauges = new ConcurrentHashMap<>();
    
    public void removeGauge(String key) {
        AtomicInteger gauge = gauges.remove(key);
        if (gauge != null) {
            // 从注册表中移除
            Metrics.globalRegistry.remove(new Meter.Id(
                "dynamic.gauge",
                Tags.of("key", key),
                null, null, Meter.Type.GAUGE));
        }
    }
    
  3. 采样率控制

    // 高频率调用的方法使用采样
    public void highFrequencyMethod() {
        // 只采样1%的调用
        if (ThreadLocalRandom.current().nextDouble() < 0.01) {
            Timer.Sample sample = Timer.start();
            try {
                doWork();
            } finally {
                sample.stop(highFrequencyTimer);
            }
        } else {
            doWork();
        }
    }
    

4.4 监控告警分级

告警级别:
  P0(致命): 影响核心业务,需要立即处理
    - 支付成功率 < 90%
    - 数据库主节点宕机
    - 响应时间 > 10秒
    
  P1(严重): 影响用户体验,需要当天处理
    - 订单成功率 < 95%
    - 关键API错误率 > 5%
    - 内存使用率 > 90%
    
  P2(警告): 潜在风险,需要关注
    - 磁盘使用率 > 80%
    - 业务异常增长
    - 库存低于安全线
    
  P3(提醒): 信息性通知
    - 定时任务完成
    - 系统启动/关闭
    - 配置变更

4.5 监控数据生命周期管理

-- 数据保留策略
-- 原始数据: 保留30天(用于详细分析)
-- 1小时聚合: 保留90天(用于趋势分析)
-- 1天聚合: 保留1年(用于长期趋势)
-- 业务指标: 永久保留(用于业务分析)

-- 自动清理脚本示例
DELETE FROM metrics 
WHERE timestamp < NOW() - INTERVAL '30 days'
AND granularity = 'raw';

DELETE FROM metrics 
WHERE timestamp < NOW() - INTERVAL '90 days'
AND granularity = 'hourly';

DELETE FROM metrics 
WHERE timestamp < NOW() - INTERVAL '1 year'
AND granularity = 'daily';

五、总结

5.1 关键成功要素

  1. 业务驱动:监控指标要服务于业务目标
  2. 可操作性:告警要包含明确的处理建议
  3. 可持续性:监控系统本身要可监控
  4. 成本意识:平衡监控覆盖面和存储成本
  5. 持续优化:定期回顾和优化监控策略

5.2 推荐实施路径

第1阶段:基础设施监控(CPU、内存、磁盘)
第2阶段:应用性能监控(响应时间、错误率)
第3阶段:业务指标监控(成功率、业务量)
第4阶段:用户体验监控(端到端性能、用户行为)
第5阶段:智能预警和预测

通过以上完整的代码埋点和监控系统配置,你可以建立起一个从代码层面到业务层面的全方位监控体系,及时发现并解决问题,确保系统的稳定运行和业务的持续发展。