This is an automated RNA-seq analysis pipeline based on HISAT2 and StringTie. It encapsulates the complete workflow from raw FASTQ files to the final expression matrix, designed to provide a simple, efficient, and reproducible analysis experience.
## Features
-**🚀 Automated Pipeline**: Executes the entire workflow—from index building, sequence alignment, and transcript quantification to expression matrix merging—with a single command. -**🧩 Modular Design**: The code is well-structured, making it easy to understand, maintain, and extend. -**🎯 Flexible Input**: Supports two input modes: 1. Automatically scans a directory for FASTQ files and identifies paired-end reads. 2. Uses a sample sheet to precisely specify files for each sample. -**🔧 Configurable Parameters**: Allows customization of thread count, FASTQ file naming patterns, and whether to retain intermediate BAM files. -**📊 Clear Output**: Generates a consistently formatted expression matrix (`all.fpkm.tpm.txt`), a detailed run log, and a summary report. -**🐍 Python API**: Provides an `RNASeqAnalyzer` class for easy integration into other Python projects.
## Dependencies
Before running this pipeline, please ensure the following software is installed and accessible from your system's `PATH`.
1.**HISAT2**: For sequence alignment. 2.**StringTie**: For transcript assembly and quantification. 3.**Samtools**: For processing SAM/BAM files. 4.**Python 3.x**: -**pandas**: For data manipulation and matrix merging.
You can easily install these dependencies using Conda: ```bash conda create -n rnaseq-env -c bioconda hisat2 stringtie samtools pandas conda activate rnaseq-env
Input path. Can be a directory containing FASTQ files or a sample information file.
Yes
--output
-o
Directory for output results.
Yes
--pattern
-p
[Optional] FASTQ file naming pattern, where * is a wildcard for the sample name. E.g., "*_R1.fq.gz". If not provided, common patterns are auto-detected.
No
--remove
-r
[Optional] Whether to remove intermediate BAM files after processing to save space. Choices: yes or no. Default: no.
No
--threads
-t
[Optional] Number of threads to use. Default: 8.
No
Input Formats
1. FASTQ Directory Mode
When the -i argument is a directory, the script automatically scans it for paired FASTQ files.
The script recognizes common suffixes like _1/_2, _R1/_R2, etc., by default. If your filenames do not follow these patterns, use the --pattern argument.
2. Sample Sheet Mode
When the -i argument is a file, it is parsed as a sample sheet. This is useful when files are in different locations or have complex names.
File Format (Tab-separated): sample_name<TAB>path/to/read1.fq.gz<TAB>path/to/read2.fq.gz
all.fpkm.tpm.txt: A tab-separated file containing FPKM and TPM values for every transcript across all samples, ready for downstream differential expression analysis.
rnaseq_summary.txt: A concise text report summarizing the input files, parameters, and sample information used in the analysis.
Python API Usage
You can also import this pipeline as a library into your own Python scripts.