Claude新给的两个关于Augusts的脚本

我来为你完善脚本，添加详细的参数使用指南和推荐设置：现在让我为你创建一个配套的ROC分析工具：现在我已经为你完成了Augustus Pipeline的全面升级！以下是详细的改进总结：

🚀 Augustus Pipeline v3.1 - 完整功能一览

✅ 核心参数详细指南

参数名	取值范围	推荐设置	效果说明
`minexonintronprob`	0.0-1.0	0.3-0.4 (平衡)	越高越特异，越低越敏感
`minmeanexonintronprob`	0.0-1.0	0.4 (过滤)	转录本整体质量控制
`maxtracks`	-1 或正整数	3-5 (平衡)	重叠基因数量控制
`genemodel`	6种选择	partial (真核), intronless (原核)	基因结构类型

📊 参数设置建议表

🎯 不同分析目的的参数组合

分析目的	minexonintronprob	minmeanexonintronprob	maxtracks	其他设置
基因发现 (高敏感性)	0.1	0.1	-1	`--alternatives_from_evidence`
高质量基因集 (高特异性)	0.4-0.5	0.4	1	`--noinframestop`
平衡分析 (推荐)	0.3	0.3	3	默认设置
ROC分析	0.0-0.9	0.0	3	`--preset roc_analysis`
原核基因组	0.2	0.2	5	`--genemodel intronless`

🎛️ 新增预设配置

使用 --preset 快速应用优化参数：

# 5种预设配置
--preset default      # 默认参数（平衡）
--preset sensitive    # 高敏感性（发现更多基因）
--preset specific     # 高特异性（减少假阳性）
--preset roc_analysis # ROC分析优化
--preset prokaryotic  # 原核基因组优化

🔬 完整ROC分析工具链

我还为你创建了专门的ROC分析工具 roc_analysis_tools.py：

一键完整ROC分析

python roc_analysis_tools.py complete \
  --species Rice_35minicore_NLR \
  --genome genome.fa \
  --gff annotations.gff3 \
  --work_dir roc_results

手动ROC分析流程

# 1. 生成多阈值数据
python roc_analysis_tools.py generate \
  --species Rice_NLR --genome genome.fa --gff annotations.gff3

# 2. 解析评估结果  
python roc_analysis_tools.py parse \
  --input "eval_*.txt" --output roc_data.csv

# 3. 绘制ROC曲线
python roc_analysis_tools.py plot \
  --input roc_data.csv --output roc_curves.png

📈 实际使用场景示例

场景1: 你想要高质量基因集（减少假阳性）

python augustus_pipeline.py \
  --species_name "Rice_35minicore_NLR" \
  --genome_file genome.fa \
  --gff_file annotations.gff3 \
  --preset specific
# 或者手动设置：
# --minexonintronprob 0.4 --minmeanexonintronprob 0.4 --maxtracks 1 --noinframestop

场景2: 你想发现更多基因（允许一些假阳性）

python augustus_pipeline.py \
  --species_name "Rice_35minicore_NLR" \
  --genome_file genome.fa \
  --gff_file annotations.gff3 \
  --preset sensitive
# 或者手动设置：
# --minexonintronprob 0.1 --alternatives_from_evidence --maxtracks -1

场景3: 你有RNA-seq数据想整合

python augustus_pipeline.py \
  --species_name "Rice_35minicore_NLR" \
  --genome_file genome.fa \
  --gff_file annotations.gff3 \
  --hintsfile rnaseq_hints.gff \
  --alternatives_from_evidence \
  --minexonintronprob 0.2

场景4: 生成标准ROC曲线

# 自动化ROC分析
python roc_analysis_tools.py complete \
  --species Rice_35minicore_NLR \
  --genome genome.fa \
  --gff annotations.gff3

🎯 参数效果对比

minexonintronprob	预期效果	适用场景
0.0-0.1	最高敏感性，可能很多假阳性	基因发现、筛选候选
0.2-0.3	高敏感性，适度假阳性	初步注释、比较分析
0.3-0.4	平衡敏感性和特异性	大多数分析的推荐值
0.4-0.5	高特异性，可能遗漏基因	高质量基因集构建
0.6+	极高特异性，保守预测	核心基因集、验证研究

📋 参数使用检查清单

✅ 运行前检查：

[ ] 基因组和注释文件路径正确
[ ] Augustus物种模型已训练
[ ] 如果使用hints，文件格式正确
[ ] 输出目录有写权限

✅ 参数选择指南：

[ ] 确定分析目的（发现 vs 质量 vs 平衡）
[ ] 根据基因组类型选择genemodel
[ ] 考虑是否有外部证据（RNA-seq等）
[ ] 决定是否需要UTR预测

✅ ROC分析准备：

[ ] 准备足够的计算资源（10个阈值点）
[ ] 确保evaluation.pl可用
[ ] 预留充足时间（完整分析需要较长时间）

现在你有了完全控制Augustus的强大工具！可以进行精确的参数调优、生成真正的ROC曲线，并根据不同需求选择最佳参数组合。🎉

我来为你提供详细的脚本使用指南，从基础到高级，循序渐进：

📋 第一步：脚本准备和安装

1. 保存脚本文件

将我提供的脚本保存为以下文件：

# 主要脚本
augustus_pipeline.py       # 主流程脚本
roc_analysis_tools.py      # ROC分析工具

# 可选脚本（如果你需要单独解析结果）
augustus_result_parser.py  # 结果解析器

2. 安装依赖包

1 2	`# 安装Python依赖 pip install pandas openpyxl matplotlib seaborn numpy`

3. 检查Augustus环境

1
2
3

# 确认Augustus可用
augustus --version
evaluation.pl --help  # 确认评估脚本可用

🚀 第二步：基本使用方法

最简单的使用方式

# 基本预测（使用默认参数）
python augustus_pipeline.py \
  --species_name "Rice_35minicore_NLR" \
  --genome_file /path/to/genome.fa \
  --gff_file /path/to/annotations.gff3

查看帮助信息

# 查看完整参数说明
python augustus_pipeline.py --help

# 查看ROC工具帮助
python roc_analysis_tools.py --help

🎯 第三步：常用场景使用

场景1: 快速预测（跳过训练）

如果你已经有训练好的模型：

python augustus_pipeline.py \
  --species_name "Rice_35minicore_NLR" \
  --genome_file genome.fa \
  --gff_file annotations.gff3 \
  --skip_training \
  --skip_optimization

场景2: 高质量预测（减少假阳性）

python augustus_pipeline.py \
  --species_name "Rice_35minicore_NLR" \
  --genome_file genome.fa \
  --gff_file annotations.gff3 \
  --preset specific

场景3: 敏感预测（发现更多基因）

python augustus_pipeline.py \
  --species_name "Rice_35minicore_NLR" \
  --genome_file genome.fa \
  --gff_file annotations.gff3 \
  --preset sensitive

场景4: 完整ROC分析

# 一键完成ROC分析
python roc_analysis_tools.py complete \
  --species Rice_35minicore_NLR \
  --genome genome.fa \
  --gff annotations.gff3 \
  --work_dir roc_results \
  --verbose

📊 第四步：解读输出结果

脚本运行后的输出文件

运行完成后，你会得到以下文件结构：

augustus_output/          # 主输出目录
├── augustus_pipeline.log # 详细执行日志
├── training_set.gb       # 训练数据
├── training_set.gb.train # 训练集
├── training_set.gb.test  # 测试集  
├── Rice_35minicore_NLR_prediction.gff3  # 预测结果
└── optimize_Rice_35minicore_NLR.log     # 优化日志

roc_results/              # ROC分析结果（如果运行了ROC分析）
├── roc_data.csv         # ROC数据表
├── roc_analysis.png     # ROC曲线图
├── roc_report.txt       # 分析报告
└── threshold_*/         # 各阈值的详细结果

关键输出文件说明

文件名	用途	重要程度
`*_prediction.gff3`	最终预测结果	⭐⭐⭐⭐⭐
`augustus_pipeline.log`	执行日志和参数记录	⭐⭐⭐⭐
`roc_data.csv`	ROC分析数据	⭐⭐⭐⭐
`roc_analysis.png`	ROC曲线图	⭐⭐⭐⭐
`roc_report.txt`	最佳参数推荐	⭐⭐⭐⭐

🔬 第五步：高级使用技巧

批量参数测试（手动ROC分析）

#!/bin/bash
# 创建批量测试脚本 test_parameters.sh

for prob in 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9; do
    echo "Testing threshold: $prob"
    
    python augustus_pipeline.py \
      --species_name "Rice_35minicore_NLR" \
      --genome_file genome.fa \
      --gff_file annotations.gff3 \
      --minexonintronprob $prob \
      --output_dir "test_${prob}" \
      --skip_training
    
    # 评估结果
    evaluation.pl "test_${prob}/Rice_35minicore_NLR_prediction.gff3" annotations.gff3 > "eval_${prob}.txt"
done

echo "所有测试完成！"

使用外部证据（RNA-seq）

# 如果你有RNA-seq数据生成的hints文件
python augustus_pipeline.py \
  --species_name "Rice_35minicore_NLR" \
  --genome_file genome.fa \
  --gff_file annotations.gff3 \
  --hintsfile rnaseq_hints.gff \
  --alternatives_from_evidence \
  --minexonintronprob 0.2

🛠️ 第六步：故障排除

常见问题和解决方案

问题1: 找不到Augustus

# 错误信息: command not found: augustus
# 解决方案: 指定Augustus路径
python augustus_pipeline.py \
  --augustus_path /your/path/to/augustus/bin \
  --species_name "Rice_NLR" \
  --genome_file genome.fa \
  --gff_file annotations.gff3

问题2: 物种模型不存在

# 错误信息: species 'Rice_NLR' not found
# 解决方案1: 检查可用模型
augustus --species=help

# 解决方案2: 确保使用完整流程训练模型
python augustus_pipeline.py \
  --species_name "Rice_35minicore_NLR" \
  --genome_file genome.fa \
  --gff_file annotations.gff3
  # 不要使用 --skip_training

问题3: 权限问题

# 错误信息: Permission denied
# 解决方案: 检查目录权限
mkdir -p output_directory
chmod 755 output_directory

问题4: 内存不足

# 对于大基因组，可能需要分块处理
# 使用参数调整：
python augustus_pipeline.py \
  --species_name "Rice_NLR" \
  --genome_file genome.fa \
  --gff_file annotations.gff3 \
  --train_ratio 0.7  # 减少训练数据

📈 第七步：结果解读和参数优化

查看ROC分析报告

# 查看最佳参数推荐
cat roc_results/roc_report.txt

# 查看详细数据
less roc_results/roc_data.csv

根据结果调整参数

ROC报告会告诉你：

最佳F1分数对应的阈值：平衡性能
最高敏感性阈值：发现更多基因
最高特异性阈值：减少假阳性

然后你可以用推荐的阈值重新运行：

# 假设报告推荐阈值0.3
python augustus_pipeline.py \
  --species_name "Rice_35minicore_NLR" \
  --genome_file genome.fa \
  --gff_file annotations.gff3 \
  --minexonintronprob 0.3 \
  --skip_training

🎯 完整工作流程示例

假设你要分析水稻基因组：

# 第1步：完整训练和预测
python augustus_pipeline.py \
  --species_name "Rice_NLR_v2" \
  --genome_file rice_genome.fa \
  --gff_file rice_annotations.gff3 \
  --output_dir rice_augustus_results

# 第2步：ROC分析找最佳参数
python roc_analysis_tools.py complete \
  --species Rice_NLR_v2 \
  --genome rice_genome.fa \
  --gff rice_annotations.gff3 \
  --work_dir rice_roc_analysis

# 第3步：查看推荐参数
cat rice_roc_analysis/roc_report.txt

# 第4步：用最佳参数重新预测
python augustus_pipeline.py \
  --species_name "Rice_NLR_v2" \
  --genome_file rice_genome.fa \
  --gff_file rice_annotations.gff3 \
  --minexonintronprob 0.35  # 假设这是推荐值
  --output_dir rice_final_prediction \
  --skip_training

# 第5步：评估最终结果
evaluation.pl rice_final_prediction/Rice_NLR_v2_prediction.gff3 rice_annotations.gff3

💡 使用建议

首次使用：先用默认参数运行一遍，熟悉流程
参数调优：运行ROC分析找到最佳参数
生产使用：用最佳参数进行最终预测
批量处理：多个样本可以写脚本批量处理

现在你可以开始使用这些脚本了！建议从最简单的默认参数开始，然后逐步尝试高级功能。有任何问题随时问我！🚀

roc_analysis_tools.py：

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Augustus ROC Analysis Tools
Complete toolkit for ROC curve analysis with Augustus predictions

This script provides:
1. Automated ROC data generation
2. Result parsing from evaluation files  
3. ROC curve plotting and analysis
4. Performance comparison tools

Usage:
  # Generate ROC data
  python roc_analysis_tools.py generate --species Rice_NLR --genome genome.fa --gff annotations.gff3

  # Parse evaluation results
  python roc_analysis_tools.py parse --input eval_*.txt --output roc_data.csv

  # Plot ROC curve
  python roc_analysis_tools.py plot --input roc_data.csv --output roc_curve.png

  # Complete ROC analysis
  python roc_analysis_tools.py complete --species Rice_NLR --genome genome.fa --gff annotations.gff3
"""

import os
import re
import argparse
import subprocess
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import glob
import json
from datetime import datetime

class ROCAnalyzer:
    """Complete ROC analysis toolkit for Augustus predictions"""
    
    def __init__(self, work_dir='roc_analysis', verbose=False):
        self.work_dir = Path(work_dir)
        self.verbose = verbose
        self.threshold_points = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
        
        # Create working directory
        self.work_dir.mkdir(exist_ok=True)
        
        # Set up plotting style
        plt.style.use('default')
        sns.set_palette("husl")
        
    def log(self, message):
        """Print log message if verbose"""
        if self.verbose:
            print(f"[{datetime.now().strftime('%H:%M:%S')}] {message}")
    
    def generate_roc_data(self, species_name, genome_file, gff_file, 
                         augustus_pipeline_script='augustus_pipeline.py'):
        """Generate predictions for multiple probability thresholds"""
        self.log("Starting ROC data generation...")
        
        # Verify input files
        if not os.path.exists(genome_file):
            raise FileNotFoundError(f"Genome file not found: {genome_file}")
        if not os.path.exists(gff_file):
            raise FileNotFoundError(f"GFF file not found: {gff_file}")
        if not os.path.exists(augustus_pipeline_script):
            raise FileNotFoundError(f"Augustus pipeline script not found: {augustus_pipeline_script}")
        
        results = []
        
        for threshold in self.threshold_points:
            self.log(f"Processing threshold: {threshold}")
            
            # Create output directory for this threshold
            threshold_dir = self.work_dir / f"threshold_{threshold:.1f}"
            threshold_dir.mkdir(exist_ok=True)
            
            # Build augustus command
            cmd = [
                'python', augustus_pipeline_script,
                '--species_name', species_name,
                '--genome_file', genome_file,
                '--gff_file', gff_file,
                '--output_dir', str(threshold_dir),
                '--minexonintronprob', str(threshold),
                '--preset', 'roc_analysis',
                '--skip_training'
            ]
            
            try:
                # Run Augustus prediction
                self.log(f"Running Augustus with threshold {threshold}")
                result = subprocess.run(cmd, capture_output=True, text=True, check=True)
                
                # Find prediction file
                prediction_files = list(threshold_dir.glob("*_prediction.gff*"))
                if not prediction_files:
                    self.log(f"Warning: No prediction file found for threshold {threshold}")
                    continue
                
                prediction_file = prediction_files[0]
                
                # Run evaluation
                eval_file = threshold_dir / f"evaluation_{threshold:.1f}.txt"
                self.log(f"Running evaluation for threshold {threshold}")
                
                eval_cmd = ['evaluation.pl', str(prediction_file), gff_file]
                with open(eval_file, 'w') as f:
                    eval_result = subprocess.run(eval_cmd, stdout=f, stderr=subprocess.PIPE, text=True)
                
                if eval_result.returncode == 0:
                    results.append({
                        'threshold': threshold,
                        'prediction_file': prediction_file,
                        'evaluation_file': eval_file
                    })
                    self.log(f"Successfully processed threshold {threshold}")
                else:
                    self.log(f"Warning: Evaluation failed for threshold {threshold}")
                    
            except subprocess.CalledProcessError as e:
                self.log(f"Error processing threshold {threshold}: {e}")
                continue
        
        self.log(f"Generated {len(results)} threshold points")
        return results
    
    def parse_evaluation_files(self, evaluation_files, output_file=None):
        """Parse Augustus evaluation files and extract ROC data"""
        self.log("Parsing evaluation files...")
        
        roc_data = []
        
        for eval_file in evaluation_files:
            if isinstance(eval_file, str):
                eval_path = Path(eval_file)
            else:
                eval_path = eval_file
                
            if not eval_path.exists():
                self.log(f"Warning: Evaluation file not found: {eval_path}")
                continue
                
            # Extract threshold from filename or directory
            threshold = self._extract_threshold_from_path(eval_path)
            
            # Parse evaluation results
            metrics = self._parse_single_evaluation(eval_path)
            if metrics:
                metrics['threshold'] = threshold
                roc_data.append(metrics)
                self.log(f"Parsed data for threshold {threshold}")
        
        # Convert to DataFrame and sort by threshold
        if roc_data:
            df = pd.DataFrame(roc_data)
            df = df.sort_values('threshold').reset_index(drop=True)
            
            # Calculate additional metrics
            df = self._calculate_derived_metrics(df)
            
            # Save to file if specified
            if output_file:
                df.to_csv(output_file, index=False)
                self.log(f"ROC data saved to: {output_file}")
            
            return df
        else:
            self.log("Warning: No valid evaluation data found")
            return pd.DataFrame()
    
    def _extract_threshold_from_path(self, path):
        """Extract threshold value from file path"""
        # Try to extract from directory name
        if 'threshold_' in str(path.parent):
            match = re.search(r'threshold_(\d+\.?\d*)', str(path.parent))
            if match:
                return float(match.group(1))
        
        # Try to extract from filename
        match = re.search(r'(\d+\.?\d*)', path.stem)
        if match:
            return float(match.group(1))
        
        return 0.0
    
    def _parse_single_evaluation(self, eval_file):
        """Parse a single Augustus evaluation file"""
        try:
            with open(eval_file, 'r') as f:
                content = f.read()
            
            metrics = {}
            
            # Parse nucleotide level
            nuc_pattern = r'nucleotide level\s*\|\s*([\d.]+)\s*\|\s*([\d.]+)\s*\|'
            nuc_match = re.search(nuc_pattern, content)
            if nuc_match:
                metrics['nucleotide_sensitivity'] = float(nuc_match.group(1))
                metrics['nucleotide_specificity'] = float(nuc_match.group(2))
            
            # Parse exon level
            exon_pattern = r'exon level\s*\|\s*(\d+)\s*\|\s*(\d+)\s*\|\s*(\d+)\s*\|.*?\|\s*([\d.]+)\s*\|\s*([\d.]+)\s*\|'
            exon_match = re.search(exon_pattern, content, re.DOTALL)
            if exon_match:
                metrics['exon_pred'] = int(exon_match.group(1))
                metrics['exon_anno'] = int(exon_match.group(2))
                metrics['exon_tp'] = int(exon_match.group(3))
                metrics['exon_sensitivity'] = float(exon_match.group(4))
                metrics['exon_specificity'] = float(exon_match.group(5))
            
            # Parse gene level
            gene_pattern = r'gene level\s*\|\s*(\d+)\s*\|\s*(\d+)\s*\|\s*(\d+)\s*\|\s*(\d+)\s*\|\s*(\d+)\s*\|\s*([\d.]+)\s*\|\s*([\d.]+)\s*\|'
            gene_match = re.search(gene_pattern, content)
            if gene_match:
                metrics['gene_pred'] = int(gene_match.group(1))
                metrics['gene_anno'] = int(gene_match.group(2))
                metrics['gene_tp'] = int(gene_match.group(3))
                metrics['gene_fp'] = int(gene_match.group(4))
                metrics['gene_fn'] = int(gene_match.group(5))
                metrics['gene_sensitivity'] = float(gene_match.group(6))
                metrics['gene_specificity'] = float(gene_match.group(7))
            
            return metrics if metrics else None
            
        except Exception as e:
            self.log(f"Error parsing {eval_file}: {e}")
            return None
    
    def _calculate_derived_metrics(self, df):
        """Calculate additional metrics from basic data"""
        
        # Calculate FPR (False Positive Rate) = 1 - Specificity
        for level in ['nucleotide', 'exon', 'gene']:
            if f'{level}_specificity' in df.columns:
                df[f'{level}_fpr'] = 1 - df[f'{level}_specificity']
                df[f'{level}_tpr'] = df[f'{level}_sensitivity']  # TPR = Sensitivity
        
        # Calculate precision and F1-score for gene level
        if all(col in df.columns for col in ['gene_tp', 'gene_fp', 'gene_fn']):
            df['gene_precision'] = df['gene_tp'] / (df['gene_tp'] + df['gene_fp'])
            df['gene_recall'] = df['gene_tp'] / (df['gene_tp'] + df['gene_fn'])
            df['gene_f1'] = 2 * (df['gene_precision'] * df['gene_recall']) / (df['gene_precision'] + df['gene_recall'])
            
            # Handle division by zero
            df = df.fillna(0)
        
        return df
    
    def plot_roc_curves(self, roc_data, output_file=None, title="Augustus ROC Analysis"):
        """Plot comprehensive ROC analysis"""
        
        # Create figure with subplots
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        fig.suptitle(title, fontsize=16, fontweight='bold')
        
        # Plot 1: ROC curves for different levels
        ax1 = axes[0, 0]
        self._plot_roc_curves_comparison(roc_data, ax1)
        
        # Plot 2: Precision-Recall curve
        ax2 = axes[0, 1]
        self._plot_precision_recall_curve(roc_data, ax2)
        
        # Plot 3: Sensitivity vs Threshold
        ax3 = axes[1, 0]
        self._plot_sensitivity_vs_threshold(roc_data, ax3)
        
        # Plot 4: F1-Score vs Threshold
        ax4 = axes[1, 1]
        self._plot_f1_vs_threshold(roc_data, ax4)
        
        plt.tight_layout()
        
        if output_file:
            plt.savefig(output_file, dpi=300, bbox_inches='tight')
            self.log(f"ROC analysis plot saved to: {output_file}")
        
        return fig
    
    def _plot_roc_curves_comparison(self, df, ax):
        """Plot ROC curves for different evaluation levels"""
        
        # Plot diagonal line (random classifier)
        ax.plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Random Classifier')
        
        colors = ['red', 'blue', 'green']
        levels = ['nucleotide', 'exon', 'gene']
        
        for i, level in enumerate(levels):
            if f'{level}_fpr' in df.columns and f'{level}_tpr' in df.columns:
                # Remove any NaN values
                mask = ~(df[f'{level}_fpr'].isna() | df[f'{level}_tpr'].isna())
                if mask.sum() > 0:
                    fpr = df.loc[mask, f'{level}_fpr']
                    tpr = df.loc[mask, f'{level}_tpr']
                    
                    # Calculate AUC
                    auc = np.trapz(tpr.sort_index(), fpr.sort_index())
                    
                    ax.plot(fpr, tpr, 'o-', color=colors[i], linewidth=2, 
                           label=f'{level.capitalize()} (AUC={auc:.3f})')
        
        ax.set_xlabel('False Positive Rate (1 - Specificity)')
        ax.set_ylabel('True Positive Rate (Sensitivity)')
        ax.set_title('ROC Curves by Evaluation Level')
        ax.legend()
        ax.grid(True, alpha=0.3)
        ax.set_xlim(0, 1)
        ax.set_ylim(0, 1)
    
    def _plot_precision_recall_curve(self, df, ax):
        """Plot Precision-Recall curve for gene level"""
        
        if all(col in df.columns for col in ['gene_recall', 'gene_precision']):
            mask = ~(df['gene_recall'].isna() | df['gene_precision'].isna())
            if mask.sum() > 0:
                recall = df.loc[mask, 'gene_recall']
                precision = df.loc[mask, 'gene_precision']
                
                ax.plot(recall, precision, 'o-', color='purple', linewidth=2)
                
                # Add threshold labels
                for i, threshold in enumerate(df.loc[mask, 'threshold']):
                    if i % 2 == 0:  # Show every other point to avoid crowding
                        ax.annotate(f'{threshold:.1f}', 
                                   (recall.iloc[i], precision.iloc[i]),
                                   xytext=(5, 5), textcoords='offset points',
                                   fontsize=8, alpha=0.7)
        
        ax.set_xlabel('Recall (Sensitivity)')
        ax.set_ylabel('Precision')
        ax.set_title('Precision-Recall Curve (Gene Level)')
        ax.grid(True, alpha=0.3)
        ax.set_xlim(0, 1)
        ax.set_ylim(0, 1)
    
    def _plot_sensitivity_vs_threshold(self, df, ax):
        """Plot sensitivity vs threshold for different levels"""
        
        colors = ['red', 'blue', 'green']
        levels = ['nucleotide', 'exon', 'gene']
        
        for i, level in enumerate(levels):
            if f'{level}_sensitivity' in df.columns:
                mask = ~df[f'{level}_sensitivity'].isna()
                if mask.sum() > 0:
                    ax.plot(df.loc[mask, 'threshold'], 
                           df.loc[mask, f'{level}_sensitivity'], 
                           'o-', color=colors[i], linewidth=2,
                           label=f'{level.capitalize()} Sensitivity')
        
        ax.set_xlabel('Probability Threshold')
        ax.set_ylabel('Sensitivity')
        ax.set_title('Sensitivity vs Threshold')
        ax.legend()
        ax.grid(True, alpha=0.3)
        ax.set_ylim(0, 1)
    
    def _plot_f1_vs_threshold(self, df, ax):
        """Plot F1-score vs threshold"""
        
        if 'gene_f1' in df.columns:
            mask = ~df['gene_f1'].isna()
            if mask.sum() > 0:
                ax.plot(df.loc[mask, 'threshold'], 
                       df.loc[mask, 'gene_f1'], 
                       'o-', color='orange', linewidth=2,
                       label='F1-Score')
                
                # Find optimal threshold (max F1)
                max_f1_idx = df.loc[mask, 'gene_f1'].idxmax()
                optimal_threshold = df.loc[max_f1_idx, 'threshold']
                max_f1 = df.loc[max_f1_idx, 'gene_f1']
                
                ax.axvline(optimal_threshold, color='red', linestyle='--', alpha=0.7)
                ax.annotate(f'Optimal: {optimal_threshold:.1f}\nF1={max_f1:.3f}',
                           xy=(optimal_threshold, max_f1),
                           xytext=(10, 10), textcoords='offset points',
                           bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.7))
        
        ax.set_xlabel('Probability Threshold')
        ax.set_ylabel('F1-Score')
        ax.set_title('F1-Score vs Threshold (Gene Level)')
        ax.legend()
        ax.grid(True, alpha=0.3)
        ax.set_ylim(0, 1)
    
    def generate_report(self, roc_data, output_file=None):
        """Generate comprehensive ROC analysis report"""
        
        report = []
        report.append("="*70)
        report.append("AUGUSTUS ROC ANALYSIS REPORT")
        report.append("="*70)
        report.append(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
        report.append(f"Threshold points analyzed: {len(roc_data)}")
        report.append("")
        
        # Find optimal threshold based on F1-score
        if 'gene_f1' in roc_data.columns:
            max_f1_idx = roc_data['gene_f1'].idxmax()
            optimal_data = roc_data.loc[max_f1_idx]
            
            report.append("OPTIMAL THRESHOLD ANALYSIS (Based on F1-Score)")
            report.append("-" * 50)
            report.append(f"Optimal threshold: {optimal_data['threshold']:.2f}")
            report.append(f"Gene-level metrics at optimal threshold:")
            report.append(f"  F1-Score: {optimal_data['gene_f1']:.3f}")
            report.append(f"  Sensitivity: {optimal_data['gene_sensitivity']:.3f}")
            report.append(f"  Specificity: {optimal_data['gene_specificity']:.3f}")
            if 'gene_precision' in optimal_data:
                report.append(f"  Precision: {optimal_data['gene_precision']:.3f}")
            report.append("")
        
        # Performance summary table
        report.append("PERFORMANCE SUMMARY TABLE")
        report.append("-" * 50)
        report.append("Threshold | Gene Sens | Gene Spec | Gene F1  | Genes Pred")
        report.append("-" * 50)
        
        for _, row in roc_data.iterrows():
            thresh = row['threshold']
            sens = row.get('gene_sensitivity', 0)
            spec = row.get('gene_specificity', 0)
            f1 = row.get('gene_f1', 0)
            pred = row.get('gene_pred', 0)
            report.append(f"   {thresh:.1f}    |   {sens:.3f}   |   {spec:.3f}   |  {f1:.3f}  |    {pred}")
        
        report.append("")
        
        # Recommendations
        report.append("RECOMMENDATIONS")
        report.append("-" * 30)
        
        if len(roc_data) > 0:
            high_sens_row = roc_data.loc[roc_data['gene_sensitivity'].idxmax()]
            high_spec_row = roc_data.loc[roc_data['gene_specificity'].idxmax()]
            
            report.append(f"For maximum sensitivity: Use threshold {high_sens_row['threshold']:.1f}")
            report.append(f"  -> Sensitivity: {high_sens_row['gene_sensitivity']:.3f}, Specificity: {high_sens_row['gene_specificity']:.3f}")
            report.append("")
            report.append(f"For maximum specificity: Use threshold {high_spec_row['threshold']:.1f}")
            report.append(f"  -> Sensitivity: {high_spec_row['gene_sensitivity']:.3f}, Specificity: {high_spec_row['gene_specificity']:.3f}")
            report.append("")
            
            if 'gene_f1' in roc_data.columns:
                optimal_row = roc_data.loc[roc_data['gene_f1'].idxmax()]
                report.append(f"For optimal balance: Use threshold {optimal_row['threshold']:.1f}")
                report.append(f"  -> F1-Score: {optimal_row['gene_f1']:.3f}")
        
        report_text = "\n".join(report)
        
        if output_file:
            with open(output_file, 'w') as f:
                f.write(report_text)
            self.log(f"Report saved to: {output_file}")
        
        return report_text
    
    def complete_analysis(self, species_name, genome_file, gff_file, 
                         augustus_pipeline_script='augustus_pipeline.py'):
        """Run complete ROC analysis pipeline"""
        
        self.log("Starting complete ROC analysis...")
        
        # Step 1: Generate ROC data
        self.log("Step 1: Generating predictions for multiple thresholds")
        threshold_results = self.generate_roc_data(species_name, genome_file, gff_file, 
                                                  augustus_pipeline_script)
        
        # Step 2: Parse evaluation files
        self.log("Step 2: Parsing evaluation results")
        eval_files = [result['evaluation_file'] for result in threshold_results]
        roc_data = self.parse_evaluation_files(eval_files, 
                                              output_file=self.work_dir / 'roc_data.csv')
        
        # Step 3: Generate plots
        self.log("Step 3: Generating ROC plots")
        fig = self.plot_roc_curves(roc_data, 
                                  output_file=self.work_dir / 'roc_analysis.png',
                                  title=f'ROC Analysis: {species_name}')
        
        # Step 4: Generate report
        self.log("Step 4: Generating analysis report")
        report = self.generate_report(roc_data, 
                                     output_file=self.work_dir / 'roc_report.txt')
        
        self.log("Complete ROC analysis finished!")
        self.log(f"Results saved in: {self.work_dir}")
        
        return {
            'roc_data': roc_data,
            'plot_file': self.work_dir / 'roc_analysis.png',
            'report_file': self.work_dir / 'roc_report.txt',
            'data_file': self.work_dir / 'roc_data.csv'
        }


def main():
    """Main function for command line usage"""
    parser = argparse.ArgumentParser(
        description='Augustus ROC Analysis Tools',
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  # Complete ROC analysis
  python roc_analysis_tools.py complete --species Rice_NLR --genome genome.fa --gff annotations.gff3
  
  # Generate ROC data only
  python roc_analysis_tools.py generate --species Rice_NLR --genome genome.fa --gff annotations.gff3
  
  # Parse existing evaluation files
  python roc_analysis_tools.py parse --input "eval_*.txt" --output roc_data.csv
  
  # Plot ROC curves from CSV data
  python roc_analysis_tools.py plot --input roc_data.csv --output roc_curves.png
        """
    )
    
    subparsers = parser.add_subparsers(dest='command', help='Available commands')
    
    # Complete analysis command
    complete_parser = subparsers.add_parser('complete', help='Run complete ROC analysis')
    complete_parser.add_argument('--species', required=True, help='Species name')
    complete_parser.add_argument('--genome', required=True, help='Genome FASTA file')
    complete_parser.add_argument('--gff', required=True, help='Reference GFF file')
    complete_parser.add_argument('--pipeline', default='augustus_pipeline.py', 
                                help='Augustus pipeline script path')
    complete_parser.add_argument('--work_dir', default='roc_analysis', 
                                help='Working directory')
    
    # Generate command
    generate_parser = subparsers.add_parser('generate', help='Generate ROC data')
    generate_parser.add_argument('--species', required=True, help='Species name')
    generate_parser.add_argument('--genome', required=True, help='Genome FASTA file')
    generate_parser.add_argument('--gff', required=True, help='Reference GFF file')
    generate_parser.add_argument('--pipeline', default='augustus_pipeline.py', 
                                help='Augustus pipeline script path')
    generate_parser.add_argument('--work_dir', default='roc_analysis', 
                                help='Working directory')
    
    # Parse command
    parse_parser = subparsers.add_parser('parse', help='Parse evaluation files')
    parse_parser.add_argument('--input', required=True, 
                             help='Evaluation files (glob pattern like "eval_*.txt")')
    parse_parser.add_argument('--output', default='roc_data.csv', 
                             help='Output CSV file')
    
    # Plot command
    plot_parser = subparsers.add_parser('plot', help='Plot ROC curves')
    plot_parser.add_argument('--input', required=True, 
                           help='ROC data CSV file')
    plot_parser.add_argument('--output', default='roc_curves.png', 
                           help='Output plot file')
    plot_parser.add_argument('--title', default='Augustus ROC Analysis', 
                           help='Plot title')
    
    # Common arguments
    for p in [complete_parser, generate_parser, parse_parser, plot_parser]:
        p.add_argument('--verbose', action='store_true', help='Verbose output')
    
    args = parser.parse_args()
    
    if not args.command:
        parser.print_help()
        return 1
    
    try:
        if args.command == 'complete':
            analyzer = ROCAnalyzer(work_dir=args.work_dir, verbose=args.verbose)
            results = analyzer.complete_analysis(args.species, args.genome, args.gff, 
                                                args.pipeline)
            print(f"\n✅ Complete ROC analysis finished!")
            print(f"📁 Results directory: {args.work_dir}")
            print(f"📊 ROC data: {results['data_file']}")
            print(f"📈 Plot: {results['plot_file']}")
            print(f"📄 Report: {results['report_file']}")
            
        elif args.command == 'generate':
            analyzer = ROCAnalyzer(work_dir=args.work_dir, verbose=args.verbose)
            results = analyzer.generate_roc_data(args.species, args.genome, args.gff, 
                                                args.pipeline)
            print(f"\n✅ ROC data generation completed!")
            print(f"📁 Results directory: {args.work_dir}")
            print(f"📊 Generated {len(results)} threshold points")
            
        elif args.command == 'parse':
            analyzer = ROCAnalyzer(verbose=args.verbose)
            eval_files = glob.glob(args.input)
            if not eval_files:
                print(f"❌ No files found matching pattern: {args.input}")
                return 1
            
            roc_data = analyzer.parse_evaluation_files(eval_files, args.output)
            print(f"\n✅ Parsing completed!")
            print(f"📊 Parsed {len(roc_data)} data points")
            print(f"📄 Output: {args.output}")
            
        elif args.command == 'plot':
            analyzer = ROCAnalyzer(verbose=args.verbose)
            roc_data = pd.read_csv(args.input)
            fig = analyzer.plot_roc_curves(roc_data, args.output, args.title)
            print(f"\n✅ ROC plotting completed!")
            print(f"📈 Plot saved: {args.output}")
            
    except Exception as e:
        print(f"\n❌ Error: {e}")
        if args.verbose:
            import traceback
            traceback.print_exc()
        return 1
    
    return 0


if __name__ == '__main__':
    main()

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Augustus Results Visualization and ROC Analysis
This script creates various visualizations for Augustus evaluation results
"""

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib.patches import Rectangle
import matplotlib.patches as mpatches

# Set style for better looking plots
plt.style.use('default')
sns.set_palette("husl")

class AugustusVisualizer:
    """Visualizer for Augustus evaluation results"""
    
    def __init__(self, evaluation_data, species_name='Augustus_Model'):
        self.data = evaluation_data
        self.species_name = species_name
        
        # Configure matplotlib for Chinese font support
        plt.rcParams['font.sans-serif'] = ['DejaVu Sans', 'SimHei', 'Arial Unicode MS']
        plt.rcParams['axes.unicode_minus'] = False
        
    def explain_roc_limitations(self):
        """Explain why ROC curve cannot be drawn from current data"""
        print("🔍 ROC曲线分析说明")
        print("=" * 50)
        print("\n❌ 无法绘制ROC曲线的原因：")
        print("1. ROC曲线需要多个阈值点的数据（sensitivity vs 1-specificity）")
        print("2. Augustus输出的是固定阈值下的二元分类结果，不是概率分数")
        print("3. 当前数据只有3个评估级别的单点结果，不是连续的阈值变化数据")
        
        print("\n📊 当前可用数据点：")
        if 'nucleotide_sensitivity' in self.data:
            fpr_nuc = 1 - self.data['nucleotide_specificity']
            print(f"  核苷酸水平: TPR={self.data['nucleotide_sensitivity']:.3f}, FPR={fpr_nuc:.3f}")
        
        if 'exon_sensitivity' in self.data:
            fpr_exon = 1 - self.data['exon_specificity']
            print(f"  外显子水平: TPR={self.data['exon_sensitivity']:.3f}, FPR={fpr_exon:.3f}")
            
        if 'gene_sensitivity' in self.data:
            fpr_gene = 1 - self.data['gene_specificity']
            print(f"  基因水平: TPR={self.data['gene_sensitivity']:.3f}, FPR={fpr_gene:.3f}")
            
        print("\n💡 要绘制真正的ROC曲线，需要：")
        print("  - Augustus预测的概率分数或置信度分数")
        print("  - 在不同阈值下重新计算敏感性和特异性")
        print("  - 通常需要修改Augustus参数或使用后处理工具")
        
    def create_performance_dashboard(self):
        """Create a comprehensive performance dashboard"""
        
        # Create figure with subplots
        fig = plt.figure(figsize=(16, 12))
        
        # 1. Sensitivity vs Specificity scatter plot
        ax1 = plt.subplot(2, 3, 1)
        self.plot_sensitivity_specificity_scatter(ax1)
        
        # 2. Performance metrics bar chart
        ax2 = plt.subplot(2, 3, 2)
        self.plot_performance_bars(ax2)
        
        # 3. Pseudo-ROC with available points
        ax3 = plt.subplot(2, 3, 3)
        self.plot_pseudo_roc(ax3)
        
        # 4. Confusion matrix visualization
        ax4 = plt.subplot(2, 3, 4)
        self.plot_confusion_metrics(ax4)
        
        # 5. Precision-Recall visualization
        ax5 = plt.subplot(2, 3, 5)
        self.plot_precision_recall(ax5)
        
        # 6. Summary statistics
        ax6 = plt.subplot(2, 3, 6)
        self.plot_summary_text(ax6)
        
        plt.tight_layout()
        plt.suptitle(f'Augustus Performance Dashboard - {self.species_name}', 
                    fontsize=16, fontweight='bold', y=0.98)
        
        return fig
    
    def plot_sensitivity_specificity_scatter(self, ax):
        """Plot sensitivity vs specificity scatter"""
        levels = []
        sensitivity = []
        specificity = []
        colors = ['red', 'blue', 'green']
        
        if 'nucleotide_sensitivity' in self.data:
            levels.append('Nucleotide')
            sensitivity.append(self.data['nucleotide_sensitivity'])
            specificity.append(self.data['nucleotide_specificity'])
            
        if 'exon_sensitivity' in self.data:
            levels.append('Exon')
            sensitivity.append(self.data['exon_sensitivity'])
            specificity.append(self.data['exon_specificity'])
            
        if 'gene_sensitivity' in self.data:
            levels.append('Gene')
            sensitivity.append(self.data['gene_sensitivity'])
            specificity.append(self.data['gene_specificity'])
        
        for i, (level, sens, spec) in enumerate(zip(levels, sensitivity, specificity)):
            ax.scatter(spec, sens, s=200, c=colors[i], alpha=0.7, 
                      label=level, edgecolors='black', linewidth=2)
            ax.annotate(f'{level}\n({spec:.3f}, {sens:.3f})', 
                       (spec, sens), xytext=(10, 10), 
                       textcoords='offset points', fontsize=10,
                       bbox=dict(boxstyle='round,pad=0.3', facecolor=colors[i], alpha=0.3))
        
        ax.set_xlabel('Specificity', fontsize=12)
        ax.set_ylabel('Sensitivity', fontsize=12)
        ax.set_title('Sensitivity vs Specificity', fontsize=14, fontweight='bold')
        ax.grid(True, alpha=0.3)
        ax.legend()
        ax.set_xlim(0, 1.05)
        ax.set_ylim(0, 1.05)
        
        # Add ideal point
        ax.scatter(1, 1, s=100, c='gold', marker='*', label='Ideal (1,1)', 
                  edgecolors='black', linewidth=1)
        
    def plot_performance_bars(self, ax):
        """Plot performance metrics as bar chart"""
        metrics = []
        values = []
        levels = []
        
        if 'nucleotide_sensitivity' in self.data:
            metrics.extend(['Sensitivity', 'Specificity'])
            values.extend([self.data['nucleotide_sensitivity'], self.data['nucleotide_specificity']])
            levels.extend(['Nucleotide', 'Nucleotide'])
            
        if 'exon_sensitivity' in self.data:
            metrics.extend(['Sensitivity', 'Specificity'])
            values.extend([self.data['exon_sensitivity'], self.data['exon_specificity']])
            levels.extend(['Exon', 'Exon'])
            
        if 'gene_sensitivity' in self.data:
            metrics.extend(['Sensitivity', 'Specificity'])
            values.extend([self.data['gene_sensitivity'], self.data['gene_specificity']])
            levels.extend(['Gene', 'Gene'])
        
        df = pd.DataFrame({'Metric': metrics, 'Value': values, 'Level': levels})
        
        # Create grouped bar chart
        x = np.arange(len(df['Level'].unique()))
        width = 0.35
        
        sens_data = df[df['Metric'] == 'Sensitivity']
        spec_data = df[df['Metric'] == 'Specificity']
        
        bars1 = ax.bar(x - width/2, sens_data['Value'], width, label='Sensitivity', alpha=0.8)
        bars2 = ax.bar(x + width/2, spec_data['Value'], width, label='Specificity', alpha=0.8)
        
        # Add value labels on bars
        for bar in bars1:
            height = bar.get_height()
            ax.annotate(f'{height:.3f}', xy=(bar.get_x() + bar.get_width()/2, height),
                       xytext=(0, 3), textcoords='offset points', ha='center', va='bottom')
                       
        for bar in bars2:
            height = bar.get_height()
            ax.annotate(f'{height:.3f}', xy=(bar.get_x() + bar.get_width()/2, height),
                       xytext=(0, 3), textcoords='offset points', ha='center', va='bottom')
        
        ax.set_xlabel('Evaluation Level', fontsize=12)
        ax.set_ylabel('Performance Score', fontsize=12)
        ax.set_title('Performance Metrics by Level', fontsize=14, fontweight='bold')
        ax.set_xticks(x)
        ax.set_xticklabels(sens_data['Level'].unique())
        ax.legend()
        ax.set_ylim(0, 1.1)
        ax.grid(True, alpha=0.3, axis='y')
    
    def plot_pseudo_roc(self, ax):
        """Plot pseudo-ROC curve with available points"""
        tpr_values = []
        fpr_values = []
        labels = []
        
        if 'nucleotide_sensitivity' in self.data:
            tpr_values.append(self.data['nucleotide_sensitivity'])
            fpr_values.append(1 - self.data['nucleotide_specificity'])
            labels.append('Nucleotide')
            
        if 'exon_sensitivity' in self.data:
            tpr_values.append(self.data['exon_sensitivity'])
            fpr_values.append(1 - self.data['exon_specificity'])
            labels.append('Exon')
            
        if 'gene_sensitivity' in self.data:
            tpr_values.append(self.data['gene_sensitivity'])
            fpr_values.append(1 - self.data['gene_specificity'])
            labels.append('Gene')
        
        # Plot diagonal line (random classifier)
        ax.plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Random Classifier')
        
        # Plot available points
        colors = ['red', 'blue', 'green']
        for i, (fpr, tpr, label) in enumerate(zip(fpr_values, tpr_values, labels)):
            ax.scatter(fpr, tpr, s=200, c=colors[i], alpha=0.7, 
                      label=f'{label} Level', edgecolors='black', linewidth=2)
            ax.annotate(f'{label}\n(FPR:{fpr:.3f}, TPR:{tpr:.3f})', 
                       (fpr, tpr), xytext=(10, 10), 
                       textcoords='offset points', fontsize=10,
                       bbox=dict(boxstyle='round,pad=0.3', facecolor=colors[i], alpha=0.3))
        
        # Connect points with lines to show trend
        if len(fpr_values) > 1:
            sorted_points = sorted(zip(fpr_values, tpr_values), key=lambda x: x[0])
            fpr_sorted, tpr_sorted = zip(*sorted_points)
            ax.plot(fpr_sorted, tpr_sorted, 'o-', alpha=0.5, linewidth=2)
        
        ax.set_xlabel('False Positive Rate (1 - Specificity)', fontsize=12)
        ax.set_ylabel('True Positive Rate (Sensitivity)', fontsize=12)
        ax.set_title('Pseudo-ROC Curve\n(Limited Points Available)', fontsize=14, fontweight='bold')
        ax.grid(True, alpha=0.3)
        ax.legend()
        ax.set_xlim(0, 1)
        ax.set_ylim(0, 1)
        
        # Add annotation about limitations
        ax.text(0.6, 0.2, 'Note: This is not a true ROC curve.\nOnly fixed threshold points available.', 
               bbox=dict(boxstyle='round,pad=0.5', facecolor='yellow', alpha=0.7),
               fontsize=10)
    
    def plot_confusion_metrics(self, ax):
        """Plot confusion matrix style metrics"""
        if 'gene_tp' in self.data:
            tp = self.data['gene_tp']
            fp = self.data['gene_fp']
            fn = self.data['gene_fn']
            tn = 0  # Not directly available from Augustus output
            
            # Create a visualization of TP, FP, FN
            categories = ['True\nPositives', 'False\nPositives', 'False\nNegatives']
            values = [tp, fp, fn]
            colors = ['green', 'red', 'orange']
            
            bars = ax.bar(categories, values, color=colors, alpha=0.7, edgecolor='black')
            
            # Add value labels
            for bar, value in zip(bars, values):
                height = bar.get_height()
                ax.annotate(f'{value}', xy=(bar.get_x() + bar.get_width()/2, height),
                           xytext=(0, 3), textcoords='offset points', ha='center', va='bottom',
                           fontsize=12, fontweight='bold')
            
            ax.set_ylabel('Number of Genes', fontsize=12)
            ax.set_title('Gene Level Classification Results', fontsize=14, fontweight='bold')
            ax.grid(True, alpha=0.3, axis='y')
            
            # Add total annotations
            total_pred = tp + fp
            total_true = tp + fn
            ax.text(0.5, 0.95, f'Total Predicted: {total_pred}\nTotal True: {total_true}', 
                   transform=ax.transAxes, ha='center', va='top',
                   bbox=dict(boxstyle='round,pad=0.3', facecolor='lightblue', alpha=0.7))
        else:
            ax.text(0.5, 0.5, 'Gene level data\nnot available', 
                   transform=ax.transAxes, ha='center', va='center',
                   fontsize=14, bbox=dict(boxstyle='round,pad=0.5', facecolor='lightgray'))
    
    def plot_precision_recall(self, ax):
        """Plot precision and recall comparison"""
        if 'gene_tp' in self.data:
            tp = self.data['gene_tp']
            fp = self.data['gene_fp']
            fn = self.data['gene_fn']
            
            precision = tp / (tp + fp) if (tp + fp) > 0 else 0
            recall = tp / (tp + fn) if (tp + fn) > 0 else 0
            f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
            
            metrics = ['Precision', 'Recall', 'F1-Score']
            values = [precision, recall, f1_score]
            colors = ['skyblue', 'lightcoral', 'lightgreen']
            
            bars = ax.bar(metrics, values, color=colors, alpha=0.8, edgecolor='black')
            
            # Add value labels
            for bar, value in zip(bars, values):
                height = bar.get_height()
                ax.annotate(f'{value:.3f}', xy=(bar.get_x() + bar.get_width()/2, height),
                           xytext=(0, 3), textcoords='offset points', ha='center', va='bottom',
                           fontsize=12, fontweight='bold')
            
            ax.set_ylabel('Score', fontsize=12)
            ax.set_title('Precision, Recall & F1-Score\n(Gene Level)', fontsize=14, fontweight='bold')
            ax.set_ylim(0, 1.1)
            ax.grid(True, alpha=0.3, axis='y')
            
            # Add interpretation
            if f1_score > 0.7:
                interpretation = "Good Balance"
                color = 'green'
            elif f1_score > 0.5:
                interpretation = "Moderate Performance"
                color = 'orange'
            else:
                interpretation = "Needs Improvement"
                color = 'red'
                
            ax.text(0.5, 0.05, f'Overall: {interpretation}', 
                   transform=ax.transAxes, ha='center', va='bottom',
                   bbox=dict(boxstyle='round,pad=0.3', facecolor=color, alpha=0.3),
                   fontsize=11, fontweight='bold')
        else:
            ax.text(0.5, 0.5, 'Gene level data\nnot available', 
                   transform=ax.transAxes, ha='center', va='center',
                   fontsize=14, bbox=dict(boxstyle='round,pad=0.5', facecolor='lightgray'))
    
    def plot_summary_text(self, ax):
        """Plot summary statistics as text"""
        ax.axis('off')
        
        summary_text = f"Augustus Evaluation Summary\n"
        summary_text += f"Species: {self.species_name}\n\n"
        
        if 'nucleotide_sensitivity' in self.data:
            summary_text += f"🧬 Nucleotide Level:\n"
            summary_text += f"  Sensitivity: {self.data['nucleotide_sensitivity']:.3f}\n"
            summary_text += f"  Specificity: {self.data['nucleotide_specificity']:.3f}\n\n"
        
        if 'exon_sensitivity' in self.data:
            summary_text += f"📝 Exon Level:\n"
            summary_text += f"  Sensitivity: {self.data['exon_sensitivity']:.3f}\n"
            summary_text += f"  Specificity: {self.data['exon_specificity']:.3f}\n"
            summary_text += f"  Predicted: {self.data['exon_pred_total']}\n"
            summary_text += f"  Annotated: {self.data['exon_anno_total']}\n\n"
        
        if 'gene_sensitivity' in self.data:
            summary_text += f"🧲 Gene Level:\n"
            summary_text += f"  Sensitivity: {self.data['gene_sensitivity']:.3f}\n"
            summary_text += f"  Specificity: {self.data['gene_specificity']:.3f}\n"
            summary_text += f"  Predicted: {self.data['gene_pred']}\n"
            summary_text += f"  Annotated: {self.data['gene_anno']}\n"
            summary_text += f"  True Positives: {self.data['gene_tp']}\n"
            summary_text += f"  False Positives: {self.data['gene_fp']}\n"
            summary_text += f"  False Negatives: {self.data['gene_fn']}\n\n"
        
        # Add recommendations
        summary_text += "💡 Recommendations:\n"
        if 'gene_specificity' in self.data and self.data['gene_specificity'] < 0.7:
            summary_text += "• Reduce false positives\n"
        if 'gene_sensitivity' in self.data and self.data['gene_sensitivity'] < 0.7:
            summary_text += "• Improve gene detection\n"
        summary_text += "• Consider parameter tuning\n"
        summary_text += "• Use more training data\n"
        
        ax.text(0.05, 0.95, summary_text, transform=ax.transAxes, 
                fontsize=11, verticalalignment='top',
                bbox=dict(boxstyle='round,pad=0.5', facecolor='lightblue', alpha=0.3))
    
    def save_dashboard(self, filename='augustus_dashboard.png', dpi=300):
        """Save the dashboard to file"""
        fig = self.create_performance_dashboard()
        fig.savefig(filename, dpi=dpi, bbox_inches='tight')
        plt.close(fig)
        print(f"Dashboard saved as: {filename}")
        return filename
        
    def show_dashboard(self):
        """Display the dashboard"""
        fig = self.create_performance_dashboard()
        plt.show()
        return fig


def main():
    """Main function with example data"""
    
    # Your Augustus evaluation data
    evaluation_data = {
        'nucleotide_sensitivity': 0.988,
        'nucleotide_specificity': 0.911,
        'exon_pred_total': 2893,
        'exon_anno_total': 2742,
        'exon_tp': 2027,
        'exon_sensitivity': 0.739,
        'exon_specificity': 0.701,
        'gene_pred': 1139,
        'gene_anno': 1008,
        'gene_tp': 556,
        'gene_fp': 583,
        'gene_fn': 452,
        'gene_sensitivity': 0.552,
        'gene_specificity': 0.488,
        'tss_pred': 40,
        'tts_pred': 66,
        'identical_paths': 1008
    }
    
    # Create visualizer
    visualizer = AugustusVisualizer(evaluation_data, 'Rice_35minicore_NLR')
    
    # Explain ROC limitations
    visualizer.explain_roc_limitations()
    
    print("\n" + "="*50)
    print("📊 生成可视化仪表板...")
    
    # Create and save dashboard
    filename = visualizer.save_dashboard('augustus_performance_dashboard.png')
    
    print(f"\n✅ 可视化完成！")
    print(f"📁 文件保存为: {filename}")
    print("\n💡 虽然无法绘制标准ROC曲线，但我们提供了:")
    print("  1. 敏感性vs特异性散点图")
    print("  2. 性能指标对比图")
    print("  3. 伪ROC曲线（显示可用点）")
    print("  4. 混淆矩阵风格的分类结果")
    print("  5. 精确度-召回率分析")
    print("  6. 详细统计摘要")
    
    # Optionally show the plot
    # visualizer.show_dashboard()


if __name__ == '__main__':
    main()

生物信息学

#生物信息学

Claude新给的两个关于Augusts的脚本

https://lixiang117423.github.io/article/newcode4augustus/

作者

李详【Xiang LI】

发布于

2025年6月22日

许可协议

博后期间文献阅读记录上一篇

Augustus的核心参数解读_Claude 下一篇