Kubernetes cronjob 错过了计划

问题描述

在 EKS 集群中运行着大约 50 个 cronjob。我想找出 Cronjob 错过调度作业的原因,检查调度、并发策略、活动作业、startingDeadlineseconds 似乎是一个乏味的过程。尽管进行了所有这些检查,但有时仍不清楚。无法从控制器日志中找到有用的信息。有什么直接的方法可以从日志中找出错过时间表的原因吗?

apiVersion: batch/v1beta1
kind: CronJob
Metadata:
  creationTimestamp: "2021-03-02T20:19:23Z"
  name: <name >
  namespace: <namespace>
spec:
  concurrencyPolicy: Allow
  FailedJobsHistoryLimit: 1
  jobTemplate:
    Metadata:
      creationTimestamp: null
    spec:
      template:
        Metadata:
          creationTimestamp: null
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeselectorTerms:
                - matchExpressions:
                  - key: <key>
                    operator: In
                    values:
                    - "true"
          containers:
            image: <image-name>
            imagePullPolicy: Always
            name: solution-info
            resources:
              limits:
                cpu: 300m
                memory: 300Mi
              requests:
                cpu: 300m
                memory: 300Mi
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          dnsPolicy: ClusterFirst
          restartPolicy: OnFailure
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
          tolerations:
          - effect: NoSchedule
            key: assets
            operator: Equal
            value: "true"
  schedule: 0 */6 * * *
  startingDeadlineseconds: 10
  successfulJobsHistoryLimit: 3
  suspend: false
status:
  lastScheduleTime: "2021-03-10T12:00:00Z"

解决方法

我已经做了一些挖掘,在这种情况下我想涵盖的几点:

  1. 控制平面组件使用 klog 库进行日志记录。 kube-controller-manager 可以在与 --log-dir 标志一起使用时将每个级别 a 记录到给定目录内的单独文件中,或者如果与 --log-file 标志一起使用,则将所有内容记录到单个文件中。请记住,它们是 mutually exclusive 并确保您正在检查正确的日志。

  2. CronJob 控制器 runes every 10 sec


 // Check things every 10 second. 
 go wait.Until(jm.syncAll,10*time.Second,stopCh) 

如果太晚了misses a schedule,它会记录它:


 scheduledTime := times[len(times)-1] 
 tooLate := false 
 if sj.Spec.StartingDeadlineSeconds != nil { 
    tooLate = scheduledTime.Add(time.Second * time.Duration(*sj.Spec.StartingDeadlineSeconds)).Before(now) 
 } 
 if tooLate { 
    glog.V(4).Infof("Missed starting window for %s",nameForLog) 
    recorder.Eventf(sj,v1.EventTypeWarning,"MissSchedule","Missed scheduled time to start a job: %s",scheduledTime.Format(time.RFC1123Z)) 
    // TODO: Since we don't set LastScheduleTime when not scheduling,we are going to keep noticing 
    // the miss every cycle.  In order to avoid sending multiple events,and to avoid processing 
    // the sj again and again,we could set a Status.LastMissedTime when we notice a miss. 
    // Then,when we call getRecentUnmetScheduleTimes,we can take max(creationTimestamp,// Status.LastScheduleTime,Status.LastMissedTime),and then so we won't generate 
    // and event the next time we process it,and also so the user looking at the status 
    // can see easily that there was a missed execution. 
    return 
 } 

因此,为“错过的起始窗口”或类似情况搜索日志将提供预期的结果。

  1. 强烈建议您了解CronJob limitations

注意:如果将 startingDeadlineSeconds 设置为小于 10 秒的值,则可能不会安排 CronJob。这是因为 CronJob 控制器每 10 秒检查一次。

可以在链接的文档中找到可能错过时间表背后的更多详细信息和原因。