Claude新给的两个关于Augusts的脚本

我来为你完善脚本,添加详细的参数使用指南和推荐设置:现在让我为你创建一个配套的ROC分析工具:现在我已经为你完成了Augustus Pipeline的全面升级!以下是详细的改进总结:

🚀 Augustus Pipeline v3.1 - 完整功能一览

核心参数详细指南

参数名 取值范围 推荐设置 效果说明
minexonintronprob 0.0-1.0 0.3-0.4 (平衡) 越高越特异,越低越敏感
minmeanexonintronprob 0.0-1.0 0.4 (过滤) 转录本整体质量控制
maxtracks -1 或 正整数 3-5 (平衡) 重叠基因数量控制
genemodel 6种选择 partial (真核), intronless (原核) 基因结构类型

📊 参数设置建议表

🎯 不同分析目的的参数组合

分析目的 minexonintronprob minmeanexonintronprob maxtracks 其他设置
基因发现 (高敏感性) 0.1 0.1 -1 --alternatives_from_evidence
高质量基因集 (高特异性) 0.4-0.5 0.4 1 --noinframestop
平衡分析 (推荐) 0.3 0.3 3 默认设置
ROC分析 0.0-0.9 0.0 3 --preset roc_analysis
原核基因组 0.2 0.2 5 --genemodel intronless

🎛️ 新增预设配置

使用 --preset 快速应用优化参数:

1
2
3
4
5
6
# 5种预设配置
--preset default # 默认参数(平衡)
--preset sensitive # 高敏感性(发现更多基因)
--preset specific # 高特异性(减少假阳性)
--preset roc_analysis # ROC分析优化
--preset prokaryotic # 原核基因组优化

🔬 完整ROC分析工具链

我还为你创建了专门的ROC分析工具 roc_analysis_tools.py

一键完整ROC分析

1
2
3
4
5
python roc_analysis_tools.py complete \
--species Rice_35minicore_NLR \
--genome genome.fa \
--gff annotations.gff3 \
--work_dir roc_results

手动ROC分析流程

1
2
3
4
5
6
7
8
9
10
11
# 1. 生成多阈值数据
python roc_analysis_tools.py generate \
--species Rice_NLR --genome genome.fa --gff annotations.gff3

# 2. 解析评估结果
python roc_analysis_tools.py parse \
--input "eval_*.txt" --output roc_data.csv

# 3. 绘制ROC曲线
python roc_analysis_tools.py plot \
--input roc_data.csv --output roc_curves.png

📈 实际使用场景示例

场景1: 你想要高质量基因集(减少假阳性)

1
2
3
4
5
6
7
python augustus_pipeline.py \
--species_name "Rice_35minicore_NLR" \
--genome_file genome.fa \
--gff_file annotations.gff3 \
--preset specific
# 或者手动设置:
# --minexonintronprob 0.4 --minmeanexonintronprob 0.4 --maxtracks 1 --noinframestop

场景2: 你想发现更多基因(允许一些假阳性)

1
2
3
4
5
6
7
python augustus_pipeline.py \
--species_name "Rice_35minicore_NLR" \
--genome_file genome.fa \
--gff_file annotations.gff3 \
--preset sensitive
# 或者手动设置:
# --minexonintronprob 0.1 --alternatives_from_evidence --maxtracks -1

场景3: 你有RNA-seq数据想整合

1
2
3
4
5
6
7
python augustus_pipeline.py \
--species_name "Rice_35minicore_NLR" \
--genome_file genome.fa \
--gff_file annotations.gff3 \
--hintsfile rnaseq_hints.gff \
--alternatives_from_evidence \
--minexonintronprob 0.2

场景4: 生成标准ROC曲线

1
2
3
4
5
# 自动化ROC分析
python roc_analysis_tools.py complete \
--species Rice_35minicore_NLR \
--genome genome.fa \
--gff annotations.gff3

🎯 参数效果对比

minexonintronprob 预期效果 适用场景
0.0-0.1 最高敏感性,可能很多假阳性 基因发现、筛选候选
0.2-0.3 高敏感性,适度假阳性 初步注释、比较分析
0.3-0.4 平衡敏感性和特异性 大多数分析的推荐值
0.4-0.5 高特异性,可能遗漏基因 高质量基因集构建
0.6+ 极高特异性,保守预测 核心基因集、验证研究

📋 参数使用检查清单

运行前检查

  • [ ] 基因组和注释文件路径正确
  • [ ] Augustus物种模型已训练
  • [ ] 如果使用hints,文件格式正确
  • [ ] 输出目录有写权限

参数选择指南

  • [ ] 确定分析目的(发现 vs 质量 vs 平衡)
  • [ ] 根据基因组类型选择genemodel
  • [ ] 考虑是否有外部证据(RNA-seq等)
  • [ ] 决定是否需要UTR预测

ROC分析准备

  • [ ] 准备足够的计算资源(10个阈值点)
  • [ ] 确保evaluation.pl可用
  • [ ] 预留充足时间(完整分析需要较长时间)

现在你有了完全控制Augustus的强大工具!可以进行精确的参数调优、生成真正的ROC曲线,并根据不同需求选择最佳参数组合。🎉


我来为你提供详细的脚本使用指南,从基础到高级,循序渐进:

📋 第一步:脚本准备和安装

1. 保存脚本文件

将我提供的脚本保存为以下文件:

1
2
3
4
5
6
# 主要脚本
augustus_pipeline.py # 主流程脚本
roc_analysis_tools.py # ROC分析工具

# 可选脚本(如果你需要单独解析结果)
augustus_result_parser.py # 结果解析器

2. 安装依赖包

1
2
# 安装Python依赖
pip install pandas openpyxl matplotlib seaborn numpy

3. 检查Augustus环境

1
2
3
# 确认Augustus可用
augustus --version
evaluation.pl --help # 确认评估脚本可用

🚀 第二步:基本使用方法

最简单的使用方式

1
2
3
4
5
# 基本预测(使用默认参数)
python augustus_pipeline.py \
--species_name "Rice_35minicore_NLR" \
--genome_file /path/to/genome.fa \
--gff_file /path/to/annotations.gff3

查看帮助信息

1
2
3
4
5
# 查看完整参数说明
python augustus_pipeline.py --help

# 查看ROC工具帮助
python roc_analysis_tools.py --help

🎯 第三步:常用场景使用

场景1: 快速预测(跳过训练)

如果你已经有训练好的模型:

1
2
3
4
5
6
python augustus_pipeline.py \
--species_name "Rice_35minicore_NLR" \
--genome_file genome.fa \
--gff_file annotations.gff3 \
--skip_training \
--skip_optimization

场景2: 高质量预测(减少假阳性)

1
2
3
4
5
python augustus_pipeline.py \
--species_name "Rice_35minicore_NLR" \
--genome_file genome.fa \
--gff_file annotations.gff3 \
--preset specific

场景3: 敏感预测(发现更多基因)

1
2
3
4
5
python augustus_pipeline.py \
--species_name "Rice_35minicore_NLR" \
--genome_file genome.fa \
--gff_file annotations.gff3 \
--preset sensitive

场景4: 完整ROC分析

1
2
3
4
5
6
7
# 一键完成ROC分析
python roc_analysis_tools.py complete \
--species Rice_35minicore_NLR \
--genome genome.fa \
--gff annotations.gff3 \
--work_dir roc_results \
--verbose

📊 第四步:解读输出结果

脚本运行后的输出文件

运行完成后,你会得到以下文件结构:

1
2
3
4
5
6
7
8
9
10
11
12
13
augustus_output/          # 主输出目录
├── augustus_pipeline.log # 详细执行日志
├── training_set.gb # 训练数据
├── training_set.gb.train # 训练集
├── training_set.gb.test # 测试集
├── Rice_35minicore_NLR_prediction.gff3 # 预测结果
└── optimize_Rice_35minicore_NLR.log # 优化日志

roc_results/ # ROC分析结果(如果运行了ROC分析)
├── roc_data.csv # ROC数据表
├── roc_analysis.png # ROC曲线图
├── roc_report.txt # 分析报告
└── threshold_*/ # 各阈值的详细结果

关键输出文件说明

文件名 用途 重要程度
*_prediction.gff3 最终预测结果 ⭐⭐⭐⭐⭐
augustus_pipeline.log 执行日志和参数记录 ⭐⭐⭐⭐
roc_data.csv ROC分析数据 ⭐⭐⭐⭐
roc_analysis.png ROC曲线图 ⭐⭐⭐⭐
roc_report.txt 最佳参数推荐 ⭐⭐⭐⭐

🔬 第五步:高级使用技巧

批量参数测试(手动ROC分析)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/bin/bash
# 创建批量测试脚本 test_parameters.sh

for prob in 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9; do
echo "Testing threshold: $prob"

python augustus_pipeline.py \
--species_name "Rice_35minicore_NLR" \
--genome_file genome.fa \
--gff_file annotations.gff3 \
--minexonintronprob $prob \
--output_dir "test_${prob}" \
--skip_training

# 评估结果
evaluation.pl "test_${prob}/Rice_35minicore_NLR_prediction.gff3" annotations.gff3 > "eval_${prob}.txt"
done

echo "所有测试完成!"

使用外部证据(RNA-seq)

1
2
3
4
5
6
7
8
# 如果你有RNA-seq数据生成的hints文件
python augustus_pipeline.py \
--species_name "Rice_35minicore_NLR" \
--genome_file genome.fa \
--gff_file annotations.gff3 \
--hintsfile rnaseq_hints.gff \
--alternatives_from_evidence \
--minexonintronprob 0.2

🛠️ 第六步:故障排除

常见问题和解决方案

问题1: 找不到Augustus

1
2
3
4
5
6
7
# 错误信息: command not found: augustus
# 解决方案: 指定Augustus路径
python augustus_pipeline.py \
--augustus_path /your/path/to/augustus/bin \
--species_name "Rice_NLR" \
--genome_file genome.fa \
--gff_file annotations.gff3

问题2: 物种模型不存在

1
2
3
4
5
6
7
8
9
10
# 错误信息: species 'Rice_NLR' not found
# 解决方案1: 检查可用模型
augustus --species=help

# 解决方案2: 确保使用完整流程训练模型
python augustus_pipeline.py \
--species_name "Rice_35minicore_NLR" \
--genome_file genome.fa \
--gff_file annotations.gff3
# 不要使用 --skip_training

问题3: 权限问题

1
2
3
4
# 错误信息: Permission denied
# 解决方案: 检查目录权限
mkdir -p output_directory
chmod 755 output_directory

问题4: 内存不足

1
2
3
4
5
6
7
# 对于大基因组,可能需要分块处理
# 使用参数调整:
python augustus_pipeline.py \
--species_name "Rice_NLR" \
--genome_file genome.fa \
--gff_file annotations.gff3 \
--train_ratio 0.7 # 减少训练数据

📈 第七步:结果解读和参数优化

查看ROC分析报告

1
2
3
4
5
# 查看最佳参数推荐
cat roc_results/roc_report.txt

# 查看详细数据
less roc_results/roc_data.csv

根据结果调整参数

ROC报告会告诉你:

  • 最佳F1分数对应的阈值:平衡性能
  • 最高敏感性阈值:发现更多基因
  • 最高特异性阈值:减少假阳性

然后你可以用推荐的阈值重新运行:

1
2
3
4
5
6
7
# 假设报告推荐阈值0.3
python augustus_pipeline.py \
--species_name "Rice_35minicore_NLR" \
--genome_file genome.fa \
--gff_file annotations.gff3 \
--minexonintronprob 0.3 \
--skip_training

🎯 完整工作流程示例

假设你要分析水稻基因组:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# 第1步:完整训练和预测
python augustus_pipeline.py \
--species_name "Rice_NLR_v2" \
--genome_file rice_genome.fa \
--gff_file rice_annotations.gff3 \
--output_dir rice_augustus_results

# 第2步:ROC分析找最佳参数
python roc_analysis_tools.py complete \
--species Rice_NLR_v2 \
--genome rice_genome.fa \
--gff rice_annotations.gff3 \
--work_dir rice_roc_analysis

# 第3步:查看推荐参数
cat rice_roc_analysis/roc_report.txt

# 第4步:用最佳参数重新预测
python augustus_pipeline.py \
--species_name "Rice_NLR_v2" \
--genome_file rice_genome.fa \
--gff_file rice_annotations.gff3 \
--minexonintronprob 0.35 # 假设这是推荐值
--output_dir rice_final_prediction \
--skip_training

# 第5步:评估最终结果
evaluation.pl rice_final_prediction/Rice_NLR_v2_prediction.gff3 rice_annotations.gff3

💡 使用建议

  1. 首次使用:先用默认参数运行一遍,熟悉流程
  2. 参数调优:运行ROC分析找到最佳参数
  3. 生产使用:用最佳参数进行最终预测
  4. 批量处理:多个样本可以写脚本批量处理

现在你可以开始使用这些脚本了!建议从最简单的默认参数开始,然后逐步尝试高级功能。有任何问题随时问我!🚀


roc_analysis_tools.py:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Augustus ROC Analysis Tools
Complete toolkit for ROC curve analysis with Augustus predictions

This script provides:
1. Automated ROC data generation
2. Result parsing from evaluation files
3. ROC curve plotting and analysis
4. Performance comparison tools

Usage:
# Generate ROC data
python roc_analysis_tools.py generate --species Rice_NLR --genome genome.fa --gff annotations.gff3

# Parse evaluation results
python roc_analysis_tools.py parse --input eval_*.txt --output roc_data.csv

# Plot ROC curve
python roc_analysis_tools.py plot --input roc_data.csv --output roc_curve.png

# Complete ROC analysis
python roc_analysis_tools.py complete --species Rice_NLR --genome genome.fa --gff annotations.gff3
"""

import os
import re
import argparse
import subprocess
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import glob
import json
from datetime import datetime

class ROCAnalyzer:
"""Complete ROC analysis toolkit for Augustus predictions"""

def __init__(self, work_dir='roc_analysis', verbose=False):
self.work_dir = Path(work_dir)
self.verbose = verbose
self.threshold_points = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

# Create working directory
self.work_dir.mkdir(exist_ok=True)

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

def log(self, message):
"""Print log message if verbose"""
if self.verbose:
print(f"[{datetime.now().strftime('%H:%M:%S')}] {message}")

def generate_roc_data(self, species_name, genome_file, gff_file,
augustus_pipeline_script='augustus_pipeline.py'):
"""Generate predictions for multiple probability thresholds"""
self.log("Starting ROC data generation...")

# Verify input files
if not os.path.exists(genome_file):
raise FileNotFoundError(f"Genome file not found: {genome_file}")
if not os.path.exists(gff_file):
raise FileNotFoundError(f"GFF file not found: {gff_file}")
if not os.path.exists(augustus_pipeline_script):
raise FileNotFoundError(f"Augustus pipeline script not found: {augustus_pipeline_script}")

results = []

for threshold in self.threshold_points:
self.log(f"Processing threshold: {threshold}")

# Create output directory for this threshold
threshold_dir = self.work_dir / f"threshold_{threshold:.1f}"
threshold_dir.mkdir(exist_ok=True)

# Build augustus command
cmd = [
'python', augustus_pipeline_script,
'--species_name', species_name,
'--genome_file', genome_file,
'--gff_file', gff_file,
'--output_dir', str(threshold_dir),
'--minexonintronprob', str(threshold),
'--preset', 'roc_analysis',
'--skip_training'
]

try:
# Run Augustus prediction
self.log(f"Running Augustus with threshold {threshold}")
result = subprocess.run(cmd, capture_output=True, text=True, check=True)

# Find prediction file
prediction_files = list(threshold_dir.glob("*_prediction.gff*"))
if not prediction_files:
self.log(f"Warning: No prediction file found for threshold {threshold}")
continue

prediction_file = prediction_files[0]

# Run evaluation
eval_file = threshold_dir / f"evaluation_{threshold:.1f}.txt"
self.log(f"Running evaluation for threshold {threshold}")

eval_cmd = ['evaluation.pl', str(prediction_file), gff_file]
with open(eval_file, 'w') as f:
eval_result = subprocess.run(eval_cmd, stdout=f, stderr=subprocess.PIPE, text=True)

if eval_result.returncode == 0:
results.append({
'threshold': threshold,
'prediction_file': prediction_file,
'evaluation_file': eval_file
})
self.log(f"Successfully processed threshold {threshold}")
else:
self.log(f"Warning: Evaluation failed for threshold {threshold}")

except subprocess.CalledProcessError as e:
self.log(f"Error processing threshold {threshold}: {e}")
continue

self.log(f"Generated {len(results)} threshold points")
return results

def parse_evaluation_files(self, evaluation_files, output_file=None):
"""Parse Augustus evaluation files and extract ROC data"""
self.log("Parsing evaluation files...")

roc_data = []

for eval_file in evaluation_files:
if isinstance(eval_file, str):
eval_path = Path(eval_file)
else:
eval_path = eval_file

if not eval_path.exists():
self.log(f"Warning: Evaluation file not found: {eval_path}")
continue

# Extract threshold from filename or directory
threshold = self._extract_threshold_from_path(eval_path)

# Parse evaluation results
metrics = self._parse_single_evaluation(eval_path)
if metrics:
metrics['threshold'] = threshold
roc_data.append(metrics)
self.log(f"Parsed data for threshold {threshold}")

# Convert to DataFrame and sort by threshold
if roc_data:
df = pd.DataFrame(roc_data)
df = df.sort_values('threshold').reset_index(drop=True)

# Calculate additional metrics
df = self._calculate_derived_metrics(df)

# Save to file if specified
if output_file:
df.to_csv(output_file, index=False)
self.log(f"ROC data saved to: {output_file}")

return df
else:
self.log("Warning: No valid evaluation data found")
return pd.DataFrame()

def _extract_threshold_from_path(self, path):
"""Extract threshold value from file path"""
# Try to extract from directory name
if 'threshold_' in str(path.parent):
match = re.search(r'threshold_(\d+\.?\d*)', str(path.parent))
if match:
return float(match.group(1))

# Try to extract from filename
match = re.search(r'(\d+\.?\d*)', path.stem)
if match:
return float(match.group(1))

return 0.0

def _parse_single_evaluation(self, eval_file):
"""Parse a single Augustus evaluation file"""
try:
with open(eval_file, 'r') as f:
content = f.read()

metrics = {}

# Parse nucleotide level
nuc_pattern = r'nucleotide level\s*\|\s*([\d.]+)\s*\|\s*([\d.]+)\s*\|'
nuc_match = re.search(nuc_pattern, content)
if nuc_match:
metrics['nucleotide_sensitivity'] = float(nuc_match.group(1))
metrics['nucleotide_specificity'] = float(nuc_match.group(2))

# Parse exon level
exon_pattern = r'exon level\s*\|\s*(\d+)\s*\|\s*(\d+)\s*\|\s*(\d+)\s*\|.*?\|\s*([\d.]+)\s*\|\s*([\d.]+)\s*\|'
exon_match = re.search(exon_pattern, content, re.DOTALL)
if exon_match:
metrics['exon_pred'] = int(exon_match.group(1))
metrics['exon_anno'] = int(exon_match.group(2))
metrics['exon_tp'] = int(exon_match.group(3))
metrics['exon_sensitivity'] = float(exon_match.group(4))
metrics['exon_specificity'] = float(exon_match.group(5))

# Parse gene level
gene_pattern = r'gene level\s*\|\s*(\d+)\s*\|\s*(\d+)\s*\|\s*(\d+)\s*\|\s*(\d+)\s*\|\s*(\d+)\s*\|\s*([\d.]+)\s*\|\s*([\d.]+)\s*\|'
gene_match = re.search(gene_pattern, content)
if gene_match:
metrics['gene_pred'] = int(gene_match.group(1))
metrics['gene_anno'] = int(gene_match.group(2))
metrics['gene_tp'] = int(gene_match.group(3))
metrics['gene_fp'] = int(gene_match.group(4))
metrics['gene_fn'] = int(gene_match.group(5))
metrics['gene_sensitivity'] = float(gene_match.group(6))
metrics['gene_specificity'] = float(gene_match.group(7))

return metrics if metrics else None

except Exception as e:
self.log(f"Error parsing {eval_file}: {e}")
return None

def _calculate_derived_metrics(self, df):
"""Calculate additional metrics from basic data"""

# Calculate FPR (False Positive Rate) = 1 - Specificity
for level in ['nucleotide', 'exon', 'gene']:
if f'{level}_specificity' in df.columns:
df[f'{level}_fpr'] = 1 - df[f'{level}_specificity']
df[f'{level}_tpr'] = df[f'{level}_sensitivity'] # TPR = Sensitivity

# Calculate precision and F1-score for gene level
if all(col in df.columns for col in ['gene_tp', 'gene_fp', 'gene_fn']):
df['gene_precision'] = df['gene_tp'] / (df['gene_tp'] + df['gene_fp'])
df['gene_recall'] = df['gene_tp'] / (df['gene_tp'] + df['gene_fn'])
df['gene_f1'] = 2 * (df['gene_precision'] * df['gene_recall']) / (df['gene_precision'] + df['gene_recall'])

# Handle division by zero
df = df.fillna(0)

return df

def plot_roc_curves(self, roc_data, output_file=None, title="Augustus ROC Analysis"):
"""Plot comprehensive ROC analysis"""

# Create figure with subplots
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle(title, fontsize=16, fontweight='bold')

# Plot 1: ROC curves for different levels
ax1 = axes[0, 0]
self._plot_roc_curves_comparison(roc_data, ax1)

# Plot 2: Precision-Recall curve
ax2 = axes[0, 1]
self._plot_precision_recall_curve(roc_data, ax2)

# Plot 3: Sensitivity vs Threshold
ax3 = axes[1, 0]
self._plot_sensitivity_vs_threshold(roc_data, ax3)

# Plot 4: F1-Score vs Threshold
ax4 = axes[1, 1]
self._plot_f1_vs_threshold(roc_data, ax4)

plt.tight_layout()

if output_file:
plt.savefig(output_file, dpi=300, bbox_inches='tight')
self.log(f"ROC analysis plot saved to: {output_file}")

return fig

def _plot_roc_curves_comparison(self, df, ax):
"""Plot ROC curves for different evaluation levels"""

# Plot diagonal line (random classifier)
ax.plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Random Classifier')

colors = ['red', 'blue', 'green']
levels = ['nucleotide', 'exon', 'gene']

for i, level in enumerate(levels):
if f'{level}_fpr' in df.columns and f'{level}_tpr' in df.columns:
# Remove any NaN values
mask = ~(df[f'{level}_fpr'].isna() | df[f'{level}_tpr'].isna())
if mask.sum() > 0:
fpr = df.loc[mask, f'{level}_fpr']
tpr = df.loc[mask, f'{level}_tpr']

# Calculate AUC
auc = np.trapz(tpr.sort_index(), fpr.sort_index())

ax.plot(fpr, tpr, 'o-', color=colors[i], linewidth=2,
label=f'{level.capitalize()} (AUC={auc:.3f})')

ax.set_xlabel('False Positive Rate (1 - Specificity)')
ax.set_ylabel('True Positive Rate (Sensitivity)')
ax.set_title('ROC Curves by Evaluation Level')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)

def _plot_precision_recall_curve(self, df, ax):
"""Plot Precision-Recall curve for gene level"""

if all(col in df.columns for col in ['gene_recall', 'gene_precision']):
mask = ~(df['gene_recall'].isna() | df['gene_precision'].isna())
if mask.sum() > 0:
recall = df.loc[mask, 'gene_recall']
precision = df.loc[mask, 'gene_precision']

ax.plot(recall, precision, 'o-', color='purple', linewidth=2)

# Add threshold labels
for i, threshold in enumerate(df.loc[mask, 'threshold']):
if i % 2 == 0: # Show every other point to avoid crowding
ax.annotate(f'{threshold:.1f}',
(recall.iloc[i], precision.iloc[i]),
xytext=(5, 5), textcoords='offset points',
fontsize=8, alpha=0.7)

ax.set_xlabel('Recall (Sensitivity)')
ax.set_ylabel('Precision')
ax.set_title('Precision-Recall Curve (Gene Level)')
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)

def _plot_sensitivity_vs_threshold(self, df, ax):
"""Plot sensitivity vs threshold for different levels"""

colors = ['red', 'blue', 'green']
levels = ['nucleotide', 'exon', 'gene']

for i, level in enumerate(levels):
if f'{level}_sensitivity' in df.columns:
mask = ~df[f'{level}_sensitivity'].isna()
if mask.sum() > 0:
ax.plot(df.loc[mask, 'threshold'],
df.loc[mask, f'{level}_sensitivity'],
'o-', color=colors[i], linewidth=2,
label=f'{level.capitalize()} Sensitivity')

ax.set_xlabel('Probability Threshold')
ax.set_ylabel('Sensitivity')
ax.set_title('Sensitivity vs Threshold')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_ylim(0, 1)

def _plot_f1_vs_threshold(self, df, ax):
"""Plot F1-score vs threshold"""

if 'gene_f1' in df.columns:
mask = ~df['gene_f1'].isna()
if mask.sum() > 0:
ax.plot(df.loc[mask, 'threshold'],
df.loc[mask, 'gene_f1'],
'o-', color='orange', linewidth=2,
label='F1-Score')

# Find optimal threshold (max F1)
max_f1_idx = df.loc[mask, 'gene_f1'].idxmax()
optimal_threshold = df.loc[max_f1_idx, 'threshold']
max_f1 = df.loc[max_f1_idx, 'gene_f1']

ax.axvline(optimal_threshold, color='red', linestyle='--', alpha=0.7)
ax.annotate(f'Optimal: {optimal_threshold:.1f}\nF1={max_f1:.3f}',
xy=(optimal_threshold, max_f1),
xytext=(10, 10), textcoords='offset points',
bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.7))

ax.set_xlabel('Probability Threshold')
ax.set_ylabel('F1-Score')
ax.set_title('F1-Score vs Threshold (Gene Level)')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_ylim(0, 1)

def generate_report(self, roc_data, output_file=None):
"""Generate comprehensive ROC analysis report"""

report = []
report.append("="*70)
report.append("AUGUSTUS ROC ANALYSIS REPORT")
report.append("="*70)
report.append(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
report.append(f"Threshold points analyzed: {len(roc_data)}")
report.append("")

# Find optimal threshold based on F1-score
if 'gene_f1' in roc_data.columns:
max_f1_idx = roc_data['gene_f1'].idxmax()
optimal_data = roc_data.loc[max_f1_idx]

report.append("OPTIMAL THRESHOLD ANALYSIS (Based on F1-Score)")
report.append("-" * 50)
report.append(f"Optimal threshold: {optimal_data['threshold']:.2f}")
report.append(f"Gene-level metrics at optimal threshold:")
report.append(f" F1-Score: {optimal_data['gene_f1']:.3f}")
report.append(f" Sensitivity: {optimal_data['gene_sensitivity']:.3f}")
report.append(f" Specificity: {optimal_data['gene_specificity']:.3f}")
if 'gene_precision' in optimal_data:
report.append(f" Precision: {optimal_data['gene_precision']:.3f}")
report.append("")

# Performance summary table
report.append("PERFORMANCE SUMMARY TABLE")
report.append("-" * 50)
report.append("Threshold | Gene Sens | Gene Spec | Gene F1 | Genes Pred")
report.append("-" * 50)

for _, row in roc_data.iterrows():
thresh = row['threshold']
sens = row.get('gene_sensitivity', 0)
spec = row.get('gene_specificity', 0)
f1 = row.get('gene_f1', 0)
pred = row.get('gene_pred', 0)
report.append(f" {thresh:.1f} | {sens:.3f} | {spec:.3f} | {f1:.3f} | {pred}")

report.append("")

# Recommendations
report.append("RECOMMENDATIONS")
report.append("-" * 30)

if len(roc_data) > 0:
high_sens_row = roc_data.loc[roc_data['gene_sensitivity'].idxmax()]
high_spec_row = roc_data.loc[roc_data['gene_specificity'].idxmax()]

report.append(f"For maximum sensitivity: Use threshold {high_sens_row['threshold']:.1f}")
report.append(f" -> Sensitivity: {high_sens_row['gene_sensitivity']:.3f}, Specificity: {high_sens_row['gene_specificity']:.3f}")
report.append("")
report.append(f"For maximum specificity: Use threshold {high_spec_row['threshold']:.1f}")
report.append(f" -> Sensitivity: {high_spec_row['gene_sensitivity']:.3f}, Specificity: {high_spec_row['gene_specificity']:.3f}")
report.append("")

if 'gene_f1' in roc_data.columns:
optimal_row = roc_data.loc[roc_data['gene_f1'].idxmax()]
report.append(f"For optimal balance: Use threshold {optimal_row['threshold']:.1f}")
report.append(f" -> F1-Score: {optimal_row['gene_f1']:.3f}")

report_text = "\n".join(report)

if output_file:
with open(output_file, 'w') as f:
f.write(report_text)
self.log(f"Report saved to: {output_file}")

return report_text

def complete_analysis(self, species_name, genome_file, gff_file,
augustus_pipeline_script='augustus_pipeline.py'):
"""Run complete ROC analysis pipeline"""

self.log("Starting complete ROC analysis...")

# Step 1: Generate ROC data
self.log("Step 1: Generating predictions for multiple thresholds")
threshold_results = self.generate_roc_data(species_name, genome_file, gff_file,
augustus_pipeline_script)

# Step 2: Parse evaluation files
self.log("Step 2: Parsing evaluation results")
eval_files = [result['evaluation_file'] for result in threshold_results]
roc_data = self.parse_evaluation_files(eval_files,
output_file=self.work_dir / 'roc_data.csv')

# Step 3: Generate plots
self.log("Step 3: Generating ROC plots")
fig = self.plot_roc_curves(roc_data,
output_file=self.work_dir / 'roc_analysis.png',
title=f'ROC Analysis: {species_name}')

# Step 4: Generate report
self.log("Step 4: Generating analysis report")
report = self.generate_report(roc_data,
output_file=self.work_dir / 'roc_report.txt')

self.log("Complete ROC analysis finished!")
self.log(f"Results saved in: {self.work_dir}")

return {
'roc_data': roc_data,
'plot_file': self.work_dir / 'roc_analysis.png',
'report_file': self.work_dir / 'roc_report.txt',
'data_file': self.work_dir / 'roc_data.csv'
}


def main():
"""Main function for command line usage"""
parser = argparse.ArgumentParser(
description='Augustus ROC Analysis Tools',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Complete ROC analysis
python roc_analysis_tools.py complete --species Rice_NLR --genome genome.fa --gff annotations.gff3

# Generate ROC data only
python roc_analysis_tools.py generate --species Rice_NLR --genome genome.fa --gff annotations.gff3

# Parse existing evaluation files
python roc_analysis_tools.py parse --input "eval_*.txt" --output roc_data.csv

# Plot ROC curves from CSV data
python roc_analysis_tools.py plot --input roc_data.csv --output roc_curves.png
"""
)

subparsers = parser.add_subparsers(dest='command', help='Available commands')

# Complete analysis command
complete_parser = subparsers.add_parser('complete', help='Run complete ROC analysis')
complete_parser.add_argument('--species', required=True, help='Species name')
complete_parser.add_argument('--genome', required=True, help='Genome FASTA file')
complete_parser.add_argument('--gff', required=True, help='Reference GFF file')
complete_parser.add_argument('--pipeline', default='augustus_pipeline.py',
help='Augustus pipeline script path')
complete_parser.add_argument('--work_dir', default='roc_analysis',
help='Working directory')

# Generate command
generate_parser = subparsers.add_parser('generate', help='Generate ROC data')
generate_parser.add_argument('--species', required=True, help='Species name')
generate_parser.add_argument('--genome', required=True, help='Genome FASTA file')
generate_parser.add_argument('--gff', required=True, help='Reference GFF file')
generate_parser.add_argument('--pipeline', default='augustus_pipeline.py',
help='Augustus pipeline script path')
generate_parser.add_argument('--work_dir', default='roc_analysis',
help='Working directory')

# Parse command
parse_parser = subparsers.add_parser('parse', help='Parse evaluation files')
parse_parser.add_argument('--input', required=True,
help='Evaluation files (glob pattern like "eval_*.txt")')
parse_parser.add_argument('--output', default='roc_data.csv',
help='Output CSV file')

# Plot command
plot_parser = subparsers.add_parser('plot', help='Plot ROC curves')
plot_parser.add_argument('--input', required=True,
help='ROC data CSV file')
plot_parser.add_argument('--output', default='roc_curves.png',
help='Output plot file')
plot_parser.add_argument('--title', default='Augustus ROC Analysis',
help='Plot title')

# Common arguments
for p in [complete_parser, generate_parser, parse_parser, plot_parser]:
p.add_argument('--verbose', action='store_true', help='Verbose output')

args = parser.parse_args()

if not args.command:
parser.print_help()
return 1

try:
if args.command == 'complete':
analyzer = ROCAnalyzer(work_dir=args.work_dir, verbose=args.verbose)
results = analyzer.complete_analysis(args.species, args.genome, args.gff,
args.pipeline)
print(f"\n✅ Complete ROC analysis finished!")
print(f"📁 Results directory: {args.work_dir}")
print(f"📊 ROC data: {results['data_file']}")
print(f"📈 Plot: {results['plot_file']}")
print(f"📄 Report: {results['report_file']}")

elif args.command == 'generate':
analyzer = ROCAnalyzer(work_dir=args.work_dir, verbose=args.verbose)
results = analyzer.generate_roc_data(args.species, args.genome, args.gff,
args.pipeline)
print(f"\n✅ ROC data generation completed!")
print(f"📁 Results directory: {args.work_dir}")
print(f"📊 Generated {len(results)} threshold points")

elif args.command == 'parse':
analyzer = ROCAnalyzer(verbose=args.verbose)
eval_files = glob.glob(args.input)
if not eval_files:
print(f"❌ No files found matching pattern: {args.input}")
return 1

roc_data = analyzer.parse_evaluation_files(eval_files, args.output)
print(f"\n✅ Parsing completed!")
print(f"📊 Parsed {len(roc_data)} data points")
print(f"📄 Output: {args.output}")

elif args.command == 'plot':
analyzer = ROCAnalyzer(verbose=args.verbose)
roc_data = pd.read_csv(args.input)
fig = analyzer.plot_roc_curves(roc_data, args.output, args.title)
print(f"\n✅ ROC plotting completed!")
print(f"📈 Plot saved: {args.output}")

except Exception as e:
print(f"\n❌ Error: {e}")
if args.verbose:
import traceback
traceback.print_exc()
return 1

return 0


if __name__ == '__main__':
main()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Augustus Results Visualization and ROC Analysis
This script creates various visualizations for Augustus evaluation results
"""

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib.patches import Rectangle
import matplotlib.patches as mpatches

# Set style for better looking plots
plt.style.use('default')
sns.set_palette("husl")

class AugustusVisualizer:
"""Visualizer for Augustus evaluation results"""

def __init__(self, evaluation_data, species_name='Augustus_Model'):
self.data = evaluation_data
self.species_name = species_name

# Configure matplotlib for Chinese font support
plt.rcParams['font.sans-serif'] = ['DejaVu Sans', 'SimHei', 'Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False

def explain_roc_limitations(self):
"""Explain why ROC curve cannot be drawn from current data"""
print("🔍 ROC曲线分析说明")
print("=" * 50)
print("\n❌ 无法绘制ROC曲线的原因:")
print("1. ROC曲线需要多个阈值点的数据(sensitivity vs 1-specificity)")
print("2. Augustus输出的是固定阈值下的二元分类结果,不是概率分数")
print("3. 当前数据只有3个评估级别的单点结果,不是连续的阈值变化数据")

print("\n📊 当前可用数据点:")
if 'nucleotide_sensitivity' in self.data:
fpr_nuc = 1 - self.data['nucleotide_specificity']
print(f" 核苷酸水平: TPR={self.data['nucleotide_sensitivity']:.3f}, FPR={fpr_nuc:.3f}")

if 'exon_sensitivity' in self.data:
fpr_exon = 1 - self.data['exon_specificity']
print(f" 外显子水平: TPR={self.data['exon_sensitivity']:.3f}, FPR={fpr_exon:.3f}")

if 'gene_sensitivity' in self.data:
fpr_gene = 1 - self.data['gene_specificity']
print(f" 基因水平: TPR={self.data['gene_sensitivity']:.3f}, FPR={fpr_gene:.3f}")

print("\n💡 要绘制真正的ROC曲线,需要:")
print(" - Augustus预测的概率分数或置信度分数")
print(" - 在不同阈值下重新计算敏感性和特异性")
print(" - 通常需要修改Augustus参数或使用后处理工具")

def create_performance_dashboard(self):
"""Create a comprehensive performance dashboard"""

# Create figure with subplots
fig = plt.figure(figsize=(16, 12))

# 1. Sensitivity vs Specificity scatter plot
ax1 = plt.subplot(2, 3, 1)
self.plot_sensitivity_specificity_scatter(ax1)

# 2. Performance metrics bar chart
ax2 = plt.subplot(2, 3, 2)
self.plot_performance_bars(ax2)

# 3. Pseudo-ROC with available points
ax3 = plt.subplot(2, 3, 3)
self.plot_pseudo_roc(ax3)

# 4. Confusion matrix visualization
ax4 = plt.subplot(2, 3, 4)
self.plot_confusion_metrics(ax4)

# 5. Precision-Recall visualization
ax5 = plt.subplot(2, 3, 5)
self.plot_precision_recall(ax5)

# 6. Summary statistics
ax6 = plt.subplot(2, 3, 6)
self.plot_summary_text(ax6)

plt.tight_layout()
plt.suptitle(f'Augustus Performance Dashboard - {self.species_name}',
fontsize=16, fontweight='bold', y=0.98)

return fig

def plot_sensitivity_specificity_scatter(self, ax):
"""Plot sensitivity vs specificity scatter"""
levels = []
sensitivity = []
specificity = []
colors = ['red', 'blue', 'green']

if 'nucleotide_sensitivity' in self.data:
levels.append('Nucleotide')
sensitivity.append(self.data['nucleotide_sensitivity'])
specificity.append(self.data['nucleotide_specificity'])

if 'exon_sensitivity' in self.data:
levels.append('Exon')
sensitivity.append(self.data['exon_sensitivity'])
specificity.append(self.data['exon_specificity'])

if 'gene_sensitivity' in self.data:
levels.append('Gene')
sensitivity.append(self.data['gene_sensitivity'])
specificity.append(self.data['gene_specificity'])

for i, (level, sens, spec) in enumerate(zip(levels, sensitivity, specificity)):
ax.scatter(spec, sens, s=200, c=colors[i], alpha=0.7,
label=level, edgecolors='black', linewidth=2)
ax.annotate(f'{level}\n({spec:.3f}, {sens:.3f})',
(spec, sens), xytext=(10, 10),
textcoords='offset points', fontsize=10,
bbox=dict(boxstyle='round,pad=0.3', facecolor=colors[i], alpha=0.3))

ax.set_xlabel('Specificity', fontsize=12)
ax.set_ylabel('Sensitivity', fontsize=12)
ax.set_title('Sensitivity vs Specificity', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.legend()
ax.set_xlim(0, 1.05)
ax.set_ylim(0, 1.05)

# Add ideal point
ax.scatter(1, 1, s=100, c='gold', marker='*', label='Ideal (1,1)',
edgecolors='black', linewidth=1)

def plot_performance_bars(self, ax):
"""Plot performance metrics as bar chart"""
metrics = []
values = []
levels = []

if 'nucleotide_sensitivity' in self.data:
metrics.extend(['Sensitivity', 'Specificity'])
values.extend([self.data['nucleotide_sensitivity'], self.data['nucleotide_specificity']])
levels.extend(['Nucleotide', 'Nucleotide'])

if 'exon_sensitivity' in self.data:
metrics.extend(['Sensitivity', 'Specificity'])
values.extend([self.data['exon_sensitivity'], self.data['exon_specificity']])
levels.extend(['Exon', 'Exon'])

if 'gene_sensitivity' in self.data:
metrics.extend(['Sensitivity', 'Specificity'])
values.extend([self.data['gene_sensitivity'], self.data['gene_specificity']])
levels.extend(['Gene', 'Gene'])

df = pd.DataFrame({'Metric': metrics, 'Value': values, 'Level': levels})

# Create grouped bar chart
x = np.arange(len(df['Level'].unique()))
width = 0.35

sens_data = df[df['Metric'] == 'Sensitivity']
spec_data = df[df['Metric'] == 'Specificity']

bars1 = ax.bar(x - width/2, sens_data['Value'], width, label='Sensitivity', alpha=0.8)
bars2 = ax.bar(x + width/2, spec_data['Value'], width, label='Specificity', alpha=0.8)

# Add value labels on bars
for bar in bars1:
height = bar.get_height()
ax.annotate(f'{height:.3f}', xy=(bar.get_x() + bar.get_width()/2, height),
xytext=(0, 3), textcoords='offset points', ha='center', va='bottom')

for bar in bars2:
height = bar.get_height()
ax.annotate(f'{height:.3f}', xy=(bar.get_x() + bar.get_width()/2, height),
xytext=(0, 3), textcoords='offset points', ha='center', va='bottom')

ax.set_xlabel('Evaluation Level', fontsize=12)
ax.set_ylabel('Performance Score', fontsize=12)
ax.set_title('Performance Metrics by Level', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(sens_data['Level'].unique())
ax.legend()
ax.set_ylim(0, 1.1)
ax.grid(True, alpha=0.3, axis='y')

def plot_pseudo_roc(self, ax):
"""Plot pseudo-ROC curve with available points"""
tpr_values = []
fpr_values = []
labels = []

if 'nucleotide_sensitivity' in self.data:
tpr_values.append(self.data['nucleotide_sensitivity'])
fpr_values.append(1 - self.data['nucleotide_specificity'])
labels.append('Nucleotide')

if 'exon_sensitivity' in self.data:
tpr_values.append(self.data['exon_sensitivity'])
fpr_values.append(1 - self.data['exon_specificity'])
labels.append('Exon')

if 'gene_sensitivity' in self.data:
tpr_values.append(self.data['gene_sensitivity'])
fpr_values.append(1 - self.data['gene_specificity'])
labels.append('Gene')

# Plot diagonal line (random classifier)
ax.plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Random Classifier')

# Plot available points
colors = ['red', 'blue', 'green']
for i, (fpr, tpr, label) in enumerate(zip(fpr_values, tpr_values, labels)):
ax.scatter(fpr, tpr, s=200, c=colors[i], alpha=0.7,
label=f'{label} Level', edgecolors='black', linewidth=2)
ax.annotate(f'{label}\n(FPR:{fpr:.3f}, TPR:{tpr:.3f})',
(fpr, tpr), xytext=(10, 10),
textcoords='offset points', fontsize=10,
bbox=dict(boxstyle='round,pad=0.3', facecolor=colors[i], alpha=0.3))

# Connect points with lines to show trend
if len(fpr_values) > 1:
sorted_points = sorted(zip(fpr_values, tpr_values), key=lambda x: x[0])
fpr_sorted, tpr_sorted = zip(*sorted_points)
ax.plot(fpr_sorted, tpr_sorted, 'o-', alpha=0.5, linewidth=2)

ax.set_xlabel('False Positive Rate (1 - Specificity)', fontsize=12)
ax.set_ylabel('True Positive Rate (Sensitivity)', fontsize=12)
ax.set_title('Pseudo-ROC Curve\n(Limited Points Available)', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.legend()
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)

# Add annotation about limitations
ax.text(0.6, 0.2, 'Note: This is not a true ROC curve.\nOnly fixed threshold points available.',
bbox=dict(boxstyle='round,pad=0.5', facecolor='yellow', alpha=0.7),
fontsize=10)

def plot_confusion_metrics(self, ax):
"""Plot confusion matrix style metrics"""
if 'gene_tp' in self.data:
tp = self.data['gene_tp']
fp = self.data['gene_fp']
fn = self.data['gene_fn']
tn = 0 # Not directly available from Augustus output

# Create a visualization of TP, FP, FN
categories = ['True\nPositives', 'False\nPositives', 'False\nNegatives']
values = [tp, fp, fn]
colors = ['green', 'red', 'orange']

bars = ax.bar(categories, values, color=colors, alpha=0.7, edgecolor='black')

# Add value labels
for bar, value in zip(bars, values):
height = bar.get_height()
ax.annotate(f'{value}', xy=(bar.get_x() + bar.get_width()/2, height),
xytext=(0, 3), textcoords='offset points', ha='center', va='bottom',
fontsize=12, fontweight='bold')

ax.set_ylabel('Number of Genes', fontsize=12)
ax.set_title('Gene Level Classification Results', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')

# Add total annotations
total_pred = tp + fp
total_true = tp + fn
ax.text(0.5, 0.95, f'Total Predicted: {total_pred}\nTotal True: {total_true}',
transform=ax.transAxes, ha='center', va='top',
bbox=dict(boxstyle='round,pad=0.3', facecolor='lightblue', alpha=0.7))
else:
ax.text(0.5, 0.5, 'Gene level data\nnot available',
transform=ax.transAxes, ha='center', va='center',
fontsize=14, bbox=dict(boxstyle='round,pad=0.5', facecolor='lightgray'))

def plot_precision_recall(self, ax):
"""Plot precision and recall comparison"""
if 'gene_tp' in self.data:
tp = self.data['gene_tp']
fp = self.data['gene_fp']
fn = self.data['gene_fn']

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

metrics = ['Precision', 'Recall', 'F1-Score']
values = [precision, recall, f1_score]
colors = ['skyblue', 'lightcoral', 'lightgreen']

bars = ax.bar(metrics, values, color=colors, alpha=0.8, edgecolor='black')

# Add value labels
for bar, value in zip(bars, values):
height = bar.get_height()
ax.annotate(f'{value:.3f}', xy=(bar.get_x() + bar.get_width()/2, height),
xytext=(0, 3), textcoords='offset points', ha='center', va='bottom',
fontsize=12, fontweight='bold')

ax.set_ylabel('Score', fontsize=12)
ax.set_title('Precision, Recall & F1-Score\n(Gene Level)', fontsize=14, fontweight='bold')
ax.set_ylim(0, 1.1)
ax.grid(True, alpha=0.3, axis='y')

# Add interpretation
if f1_score > 0.7:
interpretation = "Good Balance"
color = 'green'
elif f1_score > 0.5:
interpretation = "Moderate Performance"
color = 'orange'
else:
interpretation = "Needs Improvement"
color = 'red'

ax.text(0.5, 0.05, f'Overall: {interpretation}',
transform=ax.transAxes, ha='center', va='bottom',
bbox=dict(boxstyle='round,pad=0.3', facecolor=color, alpha=0.3),
fontsize=11, fontweight='bold')
else:
ax.text(0.5, 0.5, 'Gene level data\nnot available',
transform=ax.transAxes, ha='center', va='center',
fontsize=14, bbox=dict(boxstyle='round,pad=0.5', facecolor='lightgray'))

def plot_summary_text(self, ax):
"""Plot summary statistics as text"""
ax.axis('off')

summary_text = f"Augustus Evaluation Summary\n"
summary_text += f"Species: {self.species_name}\n\n"

if 'nucleotide_sensitivity' in self.data:
summary_text += f"🧬 Nucleotide Level:\n"
summary_text += f" Sensitivity: {self.data['nucleotide_sensitivity']:.3f}\n"
summary_text += f" Specificity: {self.data['nucleotide_specificity']:.3f}\n\n"

if 'exon_sensitivity' in self.data:
summary_text += f"📝 Exon Level:\n"
summary_text += f" Sensitivity: {self.data['exon_sensitivity']:.3f}\n"
summary_text += f" Specificity: {self.data['exon_specificity']:.3f}\n"
summary_text += f" Predicted: {self.data['exon_pred_total']}\n"
summary_text += f" Annotated: {self.data['exon_anno_total']}\n\n"

if 'gene_sensitivity' in self.data:
summary_text += f"🧲 Gene Level:\n"
summary_text += f" Sensitivity: {self.data['gene_sensitivity']:.3f}\n"
summary_text += f" Specificity: {self.data['gene_specificity']:.3f}\n"
summary_text += f" Predicted: {self.data['gene_pred']}\n"
summary_text += f" Annotated: {self.data['gene_anno']}\n"
summary_text += f" True Positives: {self.data['gene_tp']}\n"
summary_text += f" False Positives: {self.data['gene_fp']}\n"
summary_text += f" False Negatives: {self.data['gene_fn']}\n\n"

# Add recommendations
summary_text += "💡 Recommendations:\n"
if 'gene_specificity' in self.data and self.data['gene_specificity'] < 0.7:
summary_text += "• Reduce false positives\n"
if 'gene_sensitivity' in self.data and self.data['gene_sensitivity'] < 0.7:
summary_text += "• Improve gene detection\n"
summary_text += "• Consider parameter tuning\n"
summary_text += "• Use more training data\n"

ax.text(0.05, 0.95, summary_text, transform=ax.transAxes,
fontsize=11, verticalalignment='top',
bbox=dict(boxstyle='round,pad=0.5', facecolor='lightblue', alpha=0.3))

def save_dashboard(self, filename='augustus_dashboard.png', dpi=300):
"""Save the dashboard to file"""
fig = self.create_performance_dashboard()
fig.savefig(filename, dpi=dpi, bbox_inches='tight')
plt.close(fig)
print(f"Dashboard saved as: {filename}")
return filename

def show_dashboard(self):
"""Display the dashboard"""
fig = self.create_performance_dashboard()
plt.show()
return fig


def main():
"""Main function with example data"""

# Your Augustus evaluation data
evaluation_data = {
'nucleotide_sensitivity': 0.988,
'nucleotide_specificity': 0.911,
'exon_pred_total': 2893,
'exon_anno_total': 2742,
'exon_tp': 2027,
'exon_sensitivity': 0.739,
'exon_specificity': 0.701,
'gene_pred': 1139,
'gene_anno': 1008,
'gene_tp': 556,
'gene_fp': 583,
'gene_fn': 452,
'gene_sensitivity': 0.552,
'gene_specificity': 0.488,
'tss_pred': 40,
'tts_pred': 66,
'identical_paths': 1008
}

# Create visualizer
visualizer = AugustusVisualizer(evaluation_data, 'Rice_35minicore_NLR')

# Explain ROC limitations
visualizer.explain_roc_limitations()

print("\n" + "="*50)
print("📊 生成可视化仪表板...")

# Create and save dashboard
filename = visualizer.save_dashboard('augustus_performance_dashboard.png')

print(f"\n✅ 可视化完成!")
print(f"📁 文件保存为: {filename}")
print("\n💡 虽然无法绘制标准ROC曲线,但我们提供了:")
print(" 1. 敏感性vs特异性散点图")
print(" 2. 性能指标对比图")
print(" 3. 伪ROC曲线(显示可用点)")
print(" 4. 混淆矩阵风格的分类结果")
print(" 5. 精确度-召回率分析")
print(" 6. 详细统计摘要")

# Optionally show the plot
# visualizer.show_dashboard()


if __name__ == '__main__':
main()

Claude新给的两个关于Augusts的脚本
https://lixiang117423.github.io/article/newcode4augustus/
作者
李详【Xiang LI】
发布于
2025年6月22日
许可协议