|
|
@@ -1,184 +0,0 @@
|
|
|
-# S8-SCHED-EXEC-1 执行记录
|
|
|
-
|
|
|
-> 日期:2026-04-28
|
|
|
-> 范围:S8 自动监控调度执行层(DB 驱动派发 + lease 抢占 + 抗抖触发/恢复 + 失败治理)
|
|
|
-
|
|
|
-## 任务状态
|
|
|
-
|
|
|
-- 任务名:S8-SCHED-EXEC-1
|
|
|
-- 开始:2026-04-28 12:08
|
|
|
-- 结束:2026-04-28 12:50
|
|
|
-- 耗时:约 42 分钟
|
|
|
-
|
|
|
-## 修改文件
|
|
|
-
|
|
|
-- `server/Plugins/Admin.NET.Plugin.AiDOP/Service/S8/S8WatchSchedulerService.cs`
|
|
|
- - 新增依赖 `_detectionStateRep`
|
|
|
- - 新增 lease/dispatch 公共方法:`ResetExpiredLeasesAsync` / `PickReadyRulesAsync` / `RunSingleRuleAsync` / `OnRuleCompletedAsync` / `RunDispatchTickAsync`
|
|
|
- - 抽出 `ProcessSingleRuleAsync` 单规则处理体(与 legacy `ProcessRulesByTypeAsync` 共享 evaluator → reconcile → hit 循环语义)
|
|
|
- - `RefreshDetectionAsync` 增加 `consecutive_hit_count += 1` / `consecutive_miss_count = 0` / `recovered_at = null`(复发清空)
|
|
|
- - `ProcessRulesByTypeAsync` 创建分支注入 `UpsertDetectionStateOnHitAsync` 抗抖累计(trigger_count_required 兜底 [1,10])
|
|
|
- - `ReconcileRecoveriesForRuleAsync` 增加 miss 累计 + recover_count_required 抗抖(仅累计未达阈值时不写 recovered_at / 不写 RECOVERED)
|
|
|
- - 新增 DTO:`S8RuleLease` / `S8RuleRunResult` / `S8RuleRunStats` / `S8DispatchTickResult`
|
|
|
-- `server/Plugins/Admin.NET.Plugin.AiDOP/Job/S8WatchSchedulerJob.cs`
|
|
|
- - `IntervalMs`:300000 → 60000(5min → 1min)
|
|
|
- - 替换为 `RunDispatchTickAsync` 主流程(删除进程内 `_consecutiveFailureCount` / `TryAutoPause`)
|
|
|
- - `BuildLockedBy()` = "{MachineName}-{ProcessId}" 作为 lease owner
|
|
|
- - 留痕字段扩展:tickId / runId / picked / created / refreshed / pending / perRuleFailed / leaseReleased
|
|
|
-
|
|
|
-## 核心变更点
|
|
|
-
|
|
|
-### Lease 模型
|
|
|
-- 乐观 UPDATE 抢锁:候选 SELECT → 逐行 `WHERE id=? AND enabled AND (paused_until<=NOW) AND (next_run_at<=NOW) AND (lock_until<=NOW)` → affectedRows=1 才进 lease 列表
|
|
|
-- LeaseDuration = 5 min
|
|
|
-- `lock_token` GUID,`locked_by = $"{MachineName}-{Pid}"`,`running_started_at = NOW`,`last_run_id = runId`
|
|
|
-- OnRuleCompletedAsync 必须 `WHERE id=? AND lock_token=?`,丢锁记 Warning 不覆盖状态
|
|
|
-
|
|
|
-### 抗抖
|
|
|
-- 建单前:`UpsertDetectionStateOnHitAsync` 命中累加 `consecutive_hit_count`,未达 `trigger_count_required` 写 `antiflap_pending_hit` skip 结果(不建单、不写 CREATED)
|
|
|
-- 已建异常刷新:RefreshDetectionAsync 自增 `consecutive_hit_count`,清 miss & recovered_at
|
|
|
-- 恢复抗抖:每次 miss 异常 `consecutive_miss_count += 1` & 异常 `consecutive_hit_count = 0`;同步 detection_state;未达 `recover_count_required` 写 `antiflap_pending_recovery` 不写 recovered_at / 不写 RECOVERED
|
|
|
-- 兜底:trigger / recover 计数 null / <1 / >10 一律按 1
|
|
|
-
|
|
|
-### 失败治理
|
|
|
-- evaluator 抛 `S8RuleEvaluatorException` → `Result.Success=false` → `consecutive_failure_count += 1`
|
|
|
-- 达到 3 次 → `paused_until = NOW + 1h` / `pause_reason = "AUTO_PAUSED_AFTER_3_FAILURES: {error 摘要}"`(截断 64 字符)
|
|
|
-- 暂停期 PickReadyRulesAsync 自然过滤
|
|
|
-
|
|
|
-### Job 节拍
|
|
|
-- 5min → 1min;单规则节奏由 `poll_interval_seconds` + `next_run_at` 决定,未到期规则在 PickReady 阶段过滤
|
|
|
-
|
|
|
-### 调度入口
|
|
|
-- Job:`RunDispatchTickAsync(tenantId=1, factoryId=1, batchSize=32, lockedBy)`
|
|
|
-- Debug `run-once`:保留旧 `CreateExceptionsAsync`,路径仍走 legacy `ProcessRulesByTypeAsync`(已注入抗抖;trigger=1/recover=1 时行为等价旧版)
|
|
|
-
|
|
|
-## 测试命令与结果
|
|
|
-
|
|
|
-```
|
|
|
-# 编译
|
|
|
-dotnet build server/Plugins/Admin.NET.Plugin.AiDOP/Admin.NET.Plugin.AiDOP.csproj -f net10.0
|
|
|
- → 0 Error, 64 Warning(pre-existing XML doc)
|
|
|
-
|
|
|
-# 重启
|
|
|
-bash restart_aidop.sh
|
|
|
- → 后端 5005 + 前端 8888 启动成功;首次激活日志:
|
|
|
- "S8WatchSchedulerJob 首次激活:IntervalMs=60000 BatchSize=32 ..."
|
|
|
-```
|
|
|
-
|
|
|
-### A. 编译与启动
|
|
|
-- ✅ 0 Error
|
|
|
-- ✅ 后端启动无 ERROR
|
|
|
-
|
|
|
-### B. PickReady + lease(rule 10/11/12,nudge next_run_at 至过去)
|
|
|
-| ruleId | last_run_at | next_run_at | last_status | last_run_id | lock_token | duration_ms |
|
|
|
-|---|---|---|---|---|---|---|
|
|
|
-| 10 | 12:22:40 | 12:23:40 | SUCCESS | 6e329636829e4f2f | NULL | 656 |
|
|
|
-| 11 | 12:22:41 | 12:27:41 | SUCCESS | 6e329636829e4f2f | NULL | 665 |
|
|
|
-| 12 | 12:22:41 | 12:27:41 | SUCCESS | 6e329636829e4f2f | NULL | 675 |
|
|
|
-- ✅ lease 三件套全部 NULL
|
|
|
-- ✅ last_duration_ms 有值
|
|
|
-- ✅ last_run_id 有值
|
|
|
-
|
|
|
-### C. poll_interval_seconds 差异化
|
|
|
-- rule 10 (poll=60):next - last = 60s ✅
|
|
|
-- rule 11 (poll=300):next - last = 300s ✅
|
|
|
-- rule 12 (poll=300):next - last = 300s ✅
|
|
|
-
|
|
|
-### D. lease 防重复
|
|
|
-1. `UPDATE WHERE id=10 SET lock_token='TEST_LOCK_BLOCK', lock_until=NOW+5min` → 等 1 tick
|
|
|
- - ✅ rule 10 last_run_at 不变(未被拾取)
|
|
|
-2. 改 `lock_until = NOW - 1min` → 等 1 tick
|
|
|
- - ✅ ResetExpiredLeasesAsync 释放(log: `lease_reset releasedCount=1`)
|
|
|
- - ✅ 下个 tick 重新拾取 → SUCCESS
|
|
|
-
|
|
|
-### F. 建单前抗抖(trigger_count_required=3)
|
|
|
-准备:`UPDATE exception 64 SET status='CLOSED'`,`DELETE detection_state`,`SET trigger_count_required=3 ON rule 10`
|
|
|
-
|
|
|
-| tick | hit_count | active_exception_id | created |
|
|
|
-|---|---|---|---|
|
|
|
-| 1 | 1 | NULL | ❌ pending |
|
|
|
-| 2 | 2 | NULL | ❌ pending |
|
|
|
-| 3 | 3 | 70 | ✅ CREATED 写 detection_log |
|
|
|
-
|
|
|
-新建异常 70:dedup_key 与原一致;`consecutive_hit_count=3`、`consecutive_miss_count=0`
|
|
|
-
|
|
|
-### G. 恢复抗抖(recover_count_required=3)
|
|
|
-准备:`UPDATE demo_test_order id=1 SET status='COMPLETED'`(rule 不再命中)
|
|
|
-
|
|
|
-| tick | exception 70 miss_count | recovered_at | RECOVERED log |
|
|
|
-|---|---|---|---|
|
|
|
-| 1 | 1 | NULL | 否 |
|
|
|
-| 2 | 2 | NULL | 否 |
|
|
|
-| 3 | 3 | 12:42:39 | ✅ |
|
|
|
-
|
|
|
-### H. 复发清空 recovered_at
|
|
|
-准备:`UPDATE exception 70 SET recovered_at=NOW`,恢复 demo_test_order 让 rule 重新命中
|
|
|
-
|
|
|
-| 项 | 结果 |
|
|
|
-|---|---|
|
|
|
-| recovered_at | NOW → NULL ✅ |
|
|
|
-| last_detected_at | 12:38:39 ✅ |
|
|
|
-| consecutive_hit_count | 3 → 4 ✅ |
|
|
|
-| status | NEW(未自动改)✅ |
|
|
|
-| detection_log | REFRESHED ✅ |
|
|
|
-
|
|
|
-### 失败治理(TEMP_SCHED_BAD_FAILURE,params_json='{}')
|
|
|
-| tick | last_status | consecutive_failure_count | paused_until | pause_reason |
|
|
|
-|---|---|---|---|---|
|
|
|
-| 1 | FAILED | 1 | NULL | NULL |
|
|
|
-| 2 | FAILED | 2 | NULL | NULL |
|
|
|
-| 3 | FAILED | 3 | 13:46:39 (NOW+1h) | AUTO_PAUSED_AFTER_3_FAILURES: TIMEOUT 规则 TEMP_SCHED_BAD_FAILURE |
|
|
|
-| 4 | (未拾取)| 3 | 13:46:39 | 不变 |
|
|
|
-
|
|
|
-其它规则 (10/11/12) 同期 last_run_at 持续推进 → 单条规则失败不影响其他规则。
|
|
|
-
|
|
|
-### I. baseline & demo 守恒
|
|
|
-- baseline = `SELECT COUNT(*) FROM ado_s8_exception_type WHERE tenant_id=0 AND factory_id=0 AND enabled=1` = 3 ✅
|
|
|
-- rule 10/11/12 enabled=1,paused_until=NULL,trigger_count_required=1,recover_count_required=1 ✅
|
|
|
-
|
|
|
-## SQL 写入摘要(仅限 dev/aidopdev)
|
|
|
-
|
|
|
-```
|
|
|
--- F 准备
|
|
|
-UPDATE ado_s8_exception SET status='CLOSED' WHERE id=64;
|
|
|
-DELETE FROM ado_s8_rule_detection_state WHERE rule_code='DEMO_ORDER_DELIVERY_TIMEOUT';
|
|
|
-UPDATE ado_s8_watch_rule SET trigger_count_required=3, recover_count_required=3 WHERE id=10;
|
|
|
-
|
|
|
--- 失败治理 fixture
|
|
|
-INSERT INTO ado_s8_watch_rule (...) VALUES (..., 'TEMP_SCHED_BAD_FAILURE', ..., '{}', ...);
|
|
|
-
|
|
|
--- H 准备
|
|
|
-UPDATE ado_s8_exception SET recovered_at=NOW() WHERE id=70;
|
|
|
-
|
|
|
--- G 准备
|
|
|
-UPDATE demo_test_order SET status='COMPLETED' WHERE id=1;
|
|
|
-
|
|
|
--- 清理
|
|
|
-DELETE FROM ado_s8_watch_rule WHERE rule_code='TEMP_SCHED_BAD_FAILURE';
|
|
|
-UPDATE demo_test_order SET status='PENDING' WHERE id=1;
|
|
|
-UPDATE ado_s8_watch_rule SET trigger_count_required=1, recover_count_required=1 WHERE id=10;
|
|
|
-DELETE FROM ado_s8_rule_detection_state WHERE rule_code='DEMO_ORDER_DELIVERY_TIMEOUT';
|
|
|
-```
|
|
|
-
|
|
|
-注:exception 64 仍 CLOSED;exception 70 保留作为演示数据(复发链路样本)。
|
|
|
-
|
|
|
-## 未解决风险
|
|
|
-
|
|
|
-1. **新调度路径不再扫描"未分类"旧 AlertRule 规则**:legacy `LoadExecutionRulesAsync` 走 `RuleType IS NULL OR ''` 分支,仅在 debug `CreateExceptionsAsync` 调用链可达。Job 现在只走 dispatch tick + RuleType 分派。如生产/预发环境仍存在未分类规则,需要补 RuleType 标记,否则 Job 不再拾取它们。当前 dev 三条 demo 全部 RuleType 已分类,无影响。
|
|
|
-2. **调度 batchSize 硬编码 32**:dev 仅 3 条 demo,规模上来后需要下放到 appsettings 或 watch_rule 配置;本轮按"先可用"处理。
|
|
|
-3. **trigger / recover ≥ 10 的极端值被当作 1**:`NormalizeAntiflapCount` 兜底策略偏保守,如未来需要"≥ 10 次抗抖"必须先调整边界。
|
|
|
-4. **Job 进程多实例**:lease 模型已支持,但本轮仅单实例验证;多实例并发抢锁需要在测试环境复制后再做压测验证。
|
|
|
-5. **detection_state 不软删**:state 行只追加/累加,没有清理机制。如 dedup_key 增量上去后体积不可控;下一轮立 retention 任务(按 last_seen_at 过期)。
|
|
|
-6. **legacy `ProcessRulesByTypeAsync` 与新 `ProcessSingleRuleAsync` 有 ~80% 代码重复**:本轮按"双写不删旧"处理保留 debug run-once 的语义安全。下一轮 cleanup 立 `S8-SCHED-CLEANUP-LEGACY-PATH-1` 把 legacy 路径改成 wrapper。
|
|
|
-7. **回归脚本仍脱节**:rule-evaluator-regression / r6-detection-log-edge-regression baseline=13 + G01_TEST_* 硬编码,不可跑;保持 `S8-REGRESSION-FIXTURE-1` 待办。
|
|
|
-
|
|
|
-## CTO 判断
|
|
|
-
|
|
|
-**通过**。lease + dispatch + 抗抖触发 + 抗抖恢复 + 复发 + 失败治理 + 节拍 1min 全部端到端验证通过;baseline / demo 规则守恒;其它规则不受单条失败影响。
|
|
|
-
|
|
|
-## 下一步建议
|
|
|
-
|
|
|
-- **S8-SCHED-FRONTEND-1**:watch_rule 配置页暴露 poll_interval_seconds / trigger_count_required / recover_count_required;列表加 last_run_at / next_run_at / last_status;run-now / pause / resume controller 端点
|
|
|
-- **S8-SCHED-CLEANUP-LEGACY-PATH-1**:把 `ProcessRulesByTypeAsync` 改为 `ProcessSingleRuleAsync` 的轻 wrapper
|
|
|
-- **S8-DETECTION-STATE-RETENTION-1**:state 表 retention 任务(按 last_seen_at 过期)
|
|
|
-- **S8-REGRESSION-FIXTURE-1**:注入 G01_TEST_* fixture seed;driver baseline 参数化
|