RNA-Seq自动化脚本

好的，遵命。这里是拆分后的中文和英文两个独立的 README.md 文件。

中文版本

# RNA-seq 分析流程

这是一个基于 HISAT2 和 StringTie 的自动化 RNA-seq 分析流程。它封装了从原始 FASTQ 文件到最终表达矩阵生成的完整步骤，旨在提供一个简单、高效且可重复的分析体验。

## 功能特性

- **🚀 自动化流程**: 一键运行从索引构建、序列比对、转录本定量到表达矩阵合并的完整流程。
- **🧩 模块化设计**: 代码结构清晰，易于理解、维护和扩展。
- **🎯 灵活的输入**: 支持两种输入模式：
    1.  自动扫描 FASTQ 目录并识别样本对。
    2.  通过样本信息表（sample sheet）精确指定每个样本的文件。
- **🔧 参数可配置**: 支持自定义线程数、FASTQ 文件命名模式以及是否保留中间 BAM 文件。
- **📊 清晰的输出**: 生成统一格式的表达矩阵（`all.fpkm.tpm.txt`）、详细的运行日志和总结报告。
- **🐍 Python API**: 提供 `RNASeqAnalyzer` 类，方便集成到其他 Python 项目中。

## 系统依赖

在运行此流程之前，请确保您的系统中已安装以下软件，并且它们的路径已添加到 `PATH` 环境变量中。

1.  **HISAT2**: 用于序列比对。
2.  **StringTie**: 用于转录本组装和定量。
3.  **Samtools**: 用于处理 SAM/BAM 文件。
4.  **Python 3.x**:
    - **pandas**: 用于数据处理和矩阵合并。

您可以使用 Conda 方便地安装这些依赖：
```bash
conda create -n rnaseq-env -c bioconda hisat2 stringtie samtools pandas
conda activate rnaseq-env

使用方法

基本命令

1	`python run_rnaseq.py -g <genome.fa> -f <genes.gtf> -i <input_path> -o <output_dir> [OPTIONS]`

参数说明

参数	简写	说明	是否必需
`--genome`	`-g`	基因组参考序列 FASTA 文件路径。	是
`--gtf`	`-f`	基因注释 GTF 文件路径。	是
`--input`	`-i`	输入路径。可以是包含 FASTQ 文件的目录，或是一个样本信息文件。	是
`--output`	`-o`	输出结果的目录。	是
`--pattern`	`-p`	[可选] FASTQ 文件命名模式，`` 代表样本名。例如 `"_R1.fq.gz"`。如果未提供，脚本会自动识别常见模式。	否
`--remove`	`-r`	[可选] 是否在处理完成后删除中间的 BAM 文件以节省空间。可选值为 `yes` 或 `no`。默认为 `no`。	否
`--threads`	`-t`	[可选] 使用的线程数。默认为 `8`。	否

输入格式

1. FASTQ 目录模式

当 -i 参数提供的是一个目录时，脚本会自动扫描该目录以查找配对的 FASTQ 文件。

目录结构示例:

fastq_data/
├── sampleA_1.fq.gz
├── sampleA_2.fq.gz
├── sampleB_1.fq.gz
└── sampleB_2.fq.gz

脚本默认会识别 _1/_2, _R1/_R2 等常见后缀。如果您的文件名不符合这些模式，请使用 --pattern 参数。

2. 样本信息文件模式

当 -i 参数提供的是一个文件时，脚本会将其解析为样本信息表。这对于文件分布在不同位置或命名复杂的情况非常有用。

文件格式 (制表符分隔):

sample_name<TAB>path/to/read1.fq.gz<TAB>path/to/read2.fq.gz

samples.txt 示例:

1
2
3

sample_A	/path/to/data1/A_read1.fastq.gz	/path/to/data1/A_read2.fastq.gz
sample_B	/path/to/data2/B.R1.fq.gz	/path/to/data2/B.R2.fq.gz
control_C	/path/to/data3/C-1.fq.gz	/path/to/data3/C-2.fq.gz

使用示例

示例 1: 基本用法 (自动扫描目录)

python run_rnaseq.py \
    -g /data/genomes/human/hg38.fa \
    -f /data/annotations/human/gencode.v40.annotation.gtf \
    -i ./raw_fastq \
    -o ./rnaseq_results

示例 2: 指定线程数和文件模式

python run_rnaseq.py \
    -g hg38.fa \
    -f gencode.gtf \
    -i ./fastq_files \
    -o ./results \
    -t 16 \
    -p "*_R1.fastq.gz"

示例 3: 使用样本信息文件并删除BAM文件

python run_rnaseq.py \
    -g hg38.fa \
    -f gencode.gtf \
    -i samples.txt \
    -o ./results_from_sheet \
    -r yes

输出文件结构

分析完成后，输出目录的结构如下：

rnaseq_results/
├── sampleA.sorted.bam         # (如果--remove=no) 比对后的BAM文件
├── sampleB.sorted.bam         # ...
│
├── stringtie_output/
│   ├── sampleA.gtf            # StringTie为每个样本生成的GTF
│   └── sampleB.gtf
│
├── fpkm_output/
│   ├── sampleA.fpkm.txt       # 每个样本提取的FPKM/TPM值
│   └── sampleB.fpkm.txt
│
├── all.fpkm.tpm.txt           # **[主要结果]** 合并所有样本的表达矩阵
├── rnaseq_analysis.log        # 详细的运行日志文件
└── rnaseq_summary.txt         # 分析总结报告

主要结果文件

all.fpkm.tpm.txt: 一个制表符分隔的文件，包含了所有样本中每个转录本的 FPKM 和 TPM 值，格式清晰，可直接用于下游差异表达分析。
rnaseq_summary.txt: 一个简明的文本报告，总结了本次分析使用的输入文件、参数和样本信息。

Python API 用法

您也可以将此流程作为库导入到您自己的 Python 脚本中。

from rnaseq import RNASeqAnalyzer, RNASeqConfig

# 运行分析
try:
    analyzer = RNASeqAnalyzer(
        genome_file="/path/to/genome.fa",
        gtf_file="/path/to/genes.gtf",
        input_path="/path/to/fastq_dir",
        output_dir="./api_results",
        threads=12,
        remove_bam="yes"
    )
    
    analyzer.run_analysis()
    print("RNA-seq分析成功完成！")

except ValueError as e:
    print(f"配置错误: {e}")
except Exception as e:
    print(f"发生意外错误: {e}")

English Version

# RNA-seq Analysis Pipeline

This is an automated RNA-seq analysis pipeline based on HISAT2 and StringTie. It encapsulates the complete workflow from raw FASTQ files to the final expression matrix, designed to provide a simple, efficient, and reproducible analysis experience.

## Features

-   **🚀 Automated Pipeline**: Executes the entire workflow—from index building, sequence alignment, and transcript quantification to expression matrix merging—with a single command.
-   **🧩 Modular Design**: The code is well-structured, making it easy to understand, maintain, and extend.
-   **🎯 Flexible Input**: Supports two input modes:
    1.  Automatically scans a directory for FASTQ files and identifies paired-end reads.
    2.  Uses a sample sheet to precisely specify files for each sample.
-   **🔧 Configurable Parameters**: Allows customization of thread count, FASTQ file naming patterns, and whether to retain intermediate BAM files.
-   **📊 Clear Output**: Generates a consistently formatted expression matrix (`all.fpkm.tpm.txt`), a detailed run log, and a summary report.
-   **🐍 Python API**: Provides an `RNASeqAnalyzer` class for easy integration into other Python projects.

## Dependencies

Before running this pipeline, please ensure the following software is installed and accessible from your system's `PATH`.

1.  **HISAT2**: For sequence alignment.
2.  **StringTie**: For transcript assembly and quantification.
3.  **Samtools**: For processing SAM/BAM files.
4.  **Python 3.x**:
    -   **pandas**: For data manipulation and matrix merging.

You can easily install these dependencies using Conda:
```bash
conda create -n rnaseq-env -c bioconda hisat2 stringtie samtools pandas
conda activate rnaseq-env

Usage

Basic Command

1	`python run_rnaseq.py -g <genome.fa> -f <genes.gtf> -i <input_path> -o <output_dir> [OPTIONS]`

Parameters

Parameter	Short	Description	Required
`--genome`	`-g`	Path to the reference genome FASTA file.	Yes
`--gtf`	`-f`	Path to the gene annotation GTF file.	Yes
`--input`	`-i`	Input path. Can be a directory containing FASTQ files or a sample information file.	Yes
`--output`	`-o`	Directory for output results.	Yes
`--pattern`	`-p`	[Optional] FASTQ file naming pattern, where `` is a wildcard for the sample name. E.g., `"_R1.fq.gz"`. If not provided, common patterns are auto-detected.	No
`--remove`	`-r`	[Optional] Whether to remove intermediate BAM files after processing to save space. Choices: `yes` or `no`. Default: `no`.	No
`--threads`	`-t`	[Optional] Number of threads to use. Default: `8`.	No

Input Formats

1. FASTQ Directory Mode

When the -i argument is a directory, the script automatically scans it for paired FASTQ files.

Example Directory Structure:

fastq_data/
├── sampleA_1.fq.gz
├── sampleA_2.fq.gz
├── sampleB_1.fq.gz
└── sampleB_2.fq.gz

The script recognizes common suffixes like _1/_2, _R1/_R2, etc., by default. If your filenames do not follow these patterns, use the --pattern argument.

2. Sample Sheet Mode

When the -i argument is a file, it is parsed as a sample sheet. This is useful when files are in different locations or have complex names.

File Format (Tab-separated):
sample_name<TAB>path/to/read1.fq.gz<TAB>path/to/read2.fq.gz

Example samples.txt:

1
2
3

sample_A	/path/to/data1/A_read1.fastq.gz	/path/to/data1/A_read2.fastq.gz
sample_B	/path/to/data2/B.R1.fq.gz	/path/to/data2/B.R2.fq.gz
control_C	/path/to/data3/C-1.fq.gz	/path/to/data3/C-2.fq.gz

Examples

Example 1: Basic Usage (Auto-scan Directory)

python run_rnaseq.py \
    -g /data/genomes/human/hg38.fa \
    -f /data/annotations/human/gencode.v40.annotation.gtf \
    -i ./raw_fastq \
    -o ./rnaseq_results

Example 2: Specify Threads and File Pattern

python run_rnaseq.py \
    -g hg38.fa \
    -f gencode.gtf \
    -i ./fastq_files \
    -o ./results \
    -t 16 \
    -p "*_R1.fastq.gz"

Example 3: Use Sample Sheet and Remove BAM files

python run_rnaseq.py \
    -g hg38.fa \
    -f gencode.gtf \
    -i samples.txt \
    -o ./results_from_sheet \
    -r yes

Output File Structure

After the analysis is complete, the output directory will be structured as follows:

rnaseq_results/
├── sampleA.sorted.bam         # Aligned BAM file (if --remove=no)
├── sampleB.sorted.bam         # ...
│
├── stringtie_output/
│   ├── sampleA.gtf            # StringTie's GTF output for each sample
│   └── sampleB.gtf
│
├── fpkm_output/
│   ├── sampleA.fpkm.txt       # Extracted FPKM/TPM values for each sample
│   └── sampleB.fpkm.txt
│
├── all.fpkm.tpm.txt           # **[MAIN RESULT]** Merged expression matrix
├── rnaseq_analysis.log        # Detailed run log file
└── rnaseq_summary.txt         # Analysis summary report

Key Result Files

all.fpkm.tpm.txt: A tab-separated file containing FPKM and TPM values for every transcript across all samples, ready for downstream differential expression analysis.
rnaseq_summary.txt: A concise text report summarizing the input files, parameters, and sample information used in the analysis.

Python API Usage

You can also import this pipeline as a library into your own Python scripts.

from rnaseq import RNASeqAnalyzer, RNASeqConfig

# Run the analysis
try:
    analyzer = RNASeqAnalyzer(
        genome_file="/path/to/genome.fa",
        gtf_file="/path/to/genes.gtf",
        input_path="/path/to/fastq_dir",
        output_dir="./api_results",
        threads=12,
        remove_bam="yes"
    )
    
    analyzer.run_analysis()
    print("RNA-seq analysis completed successfully!")

except ValueError as e:
    print(f"Configuration error: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

生物信息学

#生物信息学

RNA-Seq自动化脚本

https://lixiang117423.github.io/article/rnaseq/

作者

李详【Xiang LI】

发布于

2025年7月11日

许可协议

从VCF文件提取单倍型信息都脚本上一篇

Gemma自动化脚本下一篇