运行fastp的脚本

中文

功能概述

fastp 质控模块是 biopytools 工具包中的高效 FASTQ 数据质量控制工具,支持单端和双端测序数据的批量处理。该模块封装了 fastp 工具,提供了便捷的批处理功能和灵活的参数配置。

主要特性

  • 🚀 高效批处理:自动识别和处理整个目录下的 FASTQ 文件
  • 🔧 灵活配置:支持多种质控参数的自定义设置
  • 📊 质量报告:自动生成 HTML 和 JSON 格式的质控报告
  • 💾 双端支持:完整支持双端测序数据(Paired-end)处理
  • 🎯 智能识别:自动识别文件配对关系

安装方法

系统依赖

确保系统已安装 fastp:

1
2
3
4
5
6
7
8
9
10
11
# Ubuntu/Debian
sudo apt-get install fastp

# CentOS/RHEL
sudo yum install fastp

# macOS (使用 Homebrew)
brew install fastp

# 或者使用 conda
conda install -c bioconda fastp

安装 biopytools

1
2
3
4
5
6
7
8
9
# 克隆项目
git clone https://github.com/yourusername/biopytools.git
cd biopytools

# 安装包
pip install -e .

# 验证安装
run_fastp --help

使用方法

基本语法

1
run_fastp -i INPUT_DIR -o OUTPUT_DIR [OPTIONS]

必需参数

参数 说明
-i, --input-dir 输入原始 FASTQ 数据目录
-o, --output-dir 输出清洁 FASTQ 数据目录

可选参数

参数 默认值 说明
--fastp-path fastp fastp 可执行文件路径
-t, --threads 12 线程数
-q, --quality-threshold 30 质量阈值
-l, --min-length 50 最小长度
-u, --unqualified-percent 40 不合格碱基百分比阈值
-n, --n-base-limit 10 N 碱基数量限制
--read1-suffix _1.fq.gz Read1 文件后缀
--read2-suffix _2.fq.gz Read2 文件后缀

使用示例

1. 基本用法

1
2
# 处理目录下的所有 FASTQ 文件
run_fastp -i ./raw_data -o ./clean_data

2. 自定义参数

1
2
3
4
5
6
# 使用更严格的质控标准
run_fastp -i ./raw_data -o ./clean_data \
-q 35 \
-l 75 \
-u 30 \
-t 16

3. 不同文件后缀

1
2
3
4
# 处理以 .R1.fastq.gz 和 .R2.fastq.gz 结尾的文件
run_fastp -i ./raw_data -o ./clean_data \
--read1-suffix .R1.fastq.gz \
--read2-suffix .R2.fastq.gz

4. 指定 fastp 路径

1
2
3
# 使用自定义路径的 fastp
run_fastp -i ./raw_data -o ./clean_data \
--fastp-path /usr/local/bin/fastp

5. 完整参数示例

1
2
3
4
5
6
7
8
9
10
11
run_fastp \
-i /path/to/raw_data \
-o /path/to/clean_data \
--fastp-path /usr/local/bin/fastp \
-t 20 \
-q 25 \
-l 40 \
-u 50 \
-n 5 \
--read1-suffix _R1.fq.gz \
--read2-suffix _R2.fq.gz

输入文件格式

目录结构要求

1
2
3
4
5
6
raw_data/
├── sample1_1.fq.gz # Read1
├── sample1_2.fq.gz # Read2
├── sample2_1.fq.gz
├── sample2_2.fq.gz
└── ...

支持的文件格式

  • .fq.gz / .fastq.gz:压缩的 FASTQ 文件
  • .fq / .fastq:未压缩的 FASTQ 文件

文件命名规则

程序会根据 --read1-suffix--read2-suffix 参数自动识别配对文件:

  • sample_1.fq.gzsample_2.fq.gz
  • sample_R1.fastq.gzsample_R2.fastq.gz
  • sample.R1.fq.gzsample.R2.fq.gz

输出结果

输出目录结构

1
2
3
4
5
6
7
8
9
10
clean_data/
├── sample1_1.clean.fq.gz # 清洁的 Read1 文件
├── sample1_2.clean.fq.gz # 清洁的 Read2 文件
├── sample1.fastp.html # HTML 质控报告
├── sample1.fastp.json # JSON 质控报告
├── sample2_1.clean.fq.gz
├── sample2_2.clean.fq.gz
├── sample2.fastp.html
├── sample2.fastp.json
└── batch_summary.txt # 批处理总结报告

质控报告说明

  • HTML 报告:可视化的质控结果,包含质量分布图、GC含量等
  • JSON 报告:机器可读的质控统计数据
  • 批处理总结:所有样本的处理状态和统计信息

质控参数说明

质量阈值 (-q, —quality-threshold)

  • 默认值:30
  • 说明:Phred 质量值阈值,低于此值的碱基被认为是低质量碱基
  • 建议值
    • 严格:35+
    • 标准:30
    • 宽松:20-25

最小长度 (-l, —min-length)

  • 默认值:50
  • 说明:过滤后序列的最小长度,短于此长度的序列将被丢弃
  • 建议值
    • RNA-seq:50-75
    • DNA-seq:30-50
    • 16S rRNA:200+

不合格碱基百分比 (-u, —unqualified-percent)

  • 默认值:40
  • 说明:如果序列中低质量碱基的百分比超过此阈值,整条序列将被丢弃
  • 建议值:30-50%

N 碱基限制 (-n, —n-base-limit)

  • 默认值:10
  • 说明:序列中允许的最大 N 碱基数量
  • 建议值:5-15

性能优化

线程设置

1
2
3
4
5
# 根据 CPU 核心数设置线程
run_fastp -i input -o output -t $(nproc)

# 或者设置为核心数的 80%
run_fastp -i input -o output -t $(($(nproc) * 4 / 5))

内存使用

  • 每个线程大约使用 500MB-1GB 内存
  • 建议总内存使用量不超过系统内存的 80%

故障排除

常见问题

1. 找不到 fastp 命令

1
2
3
4
5
# 检查 fastp 是否安装
which fastp

# 如果未安装,使用 conda 安装
conda install -c bioconda fastp

2. 权限错误

1
2
# 确保输出目录有写入权限
chmod 755 /path/to/output_dir

3. 文件未找到

1
2
3
4
5
# 检查输入目录是否存在
ls -la /path/to/input_dir

# 检查文件后缀是否正确
ls /path/to/input_dir/*_1.fq.gz

4. 内存不足

1
2
# 减少线程数
run_fastp -i input -o output -t 4

调试模式

1
2
3
4
5
# 查看详细输出
run_fastp -i input -o output --verbose

# 检查 fastp 版本
fastp --version

最佳实践

1. 质控前检查

1
2
3
4
5
# 检查原始数据质量
fastqc raw_data/*.fq.gz -o qc_reports/

# 统计文件数量
find raw_data -name "*_1.fq.gz" | wc -l

2. 参数选择建议

应用场景 质量阈值 最小长度 不合格碱基%
RNA-seq 25-30 50-75 40-50
WGS 30-35 50-100 30-40
16S rRNA 25 200+ 50
ChIP-seq 20-25 30-50 50

3. 质控后验证

1
2
3
4
5
# 检查处理结果
fastqc clean_data/*.clean.fq.gz -o qc_reports_after/

# 比较处理前后的统计
multiqc qc_reports/ qc_reports_after/

Python API 使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from biopytools.fastp import FastpProcessor, FastpConfig

# 创建配置
config = FastpConfig(
input_dir="./raw_data",
output_dir="./clean_data",
threads=16,
quality_threshold=30,
min_length=50
)

# 运行处理
processor = FastpProcessor(config)
results = processor.run_batch_processing()

# 查看结果
print(f"处理了 {results['total_samples']} 个样本")
print(f"成功:{results['success_count']}")
print(f"失败:{results['failed_count']}")

English

Feature Overview

The fastp QC module is an efficient FASTQ data quality control tool within the biopytools toolkit, supporting batch processing for both single-end and paired-end sequencing data. This module wraps the fastp tool, providing convenient batch processing capabilities and flexible parameter configuration.

Key Features

  • 🚀 Efficient Batch Processing: Automatically discovers and processes all FASTQ files in a directory.
  • 🔧 Flexible Configuration: Supports customization of various quality control parameters.
  • 📊 Quality Reports: Automatically generates quality control reports in HTML and JSON formats.
  • 💾 Paired-End Support: Full support for processing paired-end sequencing data.
  • 🎯 Smart Pairing: Automatically identifies paired-end file relationships.

Installation

System Dependencies

Ensure that fastp is installed on your system:

1
2
3
4
5
6
7
8
9
10
11
# For Ubuntu/Debian
sudo apt-get install fastp

# For CentOS/RHEL
sudo yum install fastp

# For macOS (with Homebrew)
brew install fastp

# Alternatively, using conda
conda install -c bioconda fastp

Install biopytools

1
2
3
4
5
6
7
8
9
# Clone the repository
git clone https://github.com/yourusername/biopytools.git
cd biopytools

# Install the package
pip install -e .

# Verify the installation
run_fastp --help

Usage

Basic Syntax

1
run_fastp -i INPUT_DIR -o OUTPUT_DIR [OPTIONS]

Required Arguments

Argument Description
-i, --input-dir Input directory containing raw FASTQ files.
-o, --output-dir Output directory for cleaned FASTQ files.

Optional Arguments

Argument Default Description
--fastp-path fastp Path to the fastp executable.
-t, --threads 12 Number of threads to use.
-q, --quality-threshold 30 Quality threshold (Phred score).
-l, --min-length 50 Minimum length required for a read.
-u, --unqualified-percent 40 Percentage threshold of unqualified bases.
-n, --n-base-limit 10 Maximum number of N bases allowed.
--read1-suffix _1.fq.gz Suffix for Read1 files.
--read2-suffix _2.fq.gz Suffix for Read2 files.

Examples

1. Basic Usage

1
2
# Process all FASTQ files in a directory
run_fastp -i ./raw_data -o ./clean_data

2. Custom Parameters

1
2
3
4
5
6
# Use stricter quality control standards
run_fastp -i ./raw_data -o ./clean_data \
-q 35 \
-l 75 \
-u 30 \
-t 16

3. Different File Suffixes

1
2
3
4
# Process files ending with .R1.fastq.gz and .R2.fastq.gz
run_fastp -i ./raw_data -o ./clean_data \
--read1-suffix .R1.fastq.gz \
--read2-suffix .R2.fastq.gz

4. Specify fastp Path

1
2
3
# Use a custom path for the fastp executable
run_fastp -i ./raw_data -o ./clean_data \
--fastp-path /usr/local/bin/fastp

5. Full Parameter Example

1
2
3
4
5
6
7
8
9
10
11
run_fastp \
-i /path/to/raw_data \
-o /path/to/clean_data \
--fastp-path /usr/local/bin/fastp \
-t 20 \
-q 25 \
-l 40 \
-u 50 \
-n 5 \
--read1-suffix _R1.fq.gz \
--read2-suffix _R2.fq.gz

Input File Format

Directory Structure Requirement

1
2
3
4
5
6
raw_data/
├── sample1_1.fq.gz # Read1
├── sample1_2.fq.gz # Read2
├── sample2_1.fq.gz
├── sample2_2.fq.gz
└── ...

Supported File Formats

  • .fq.gz / .fastq.gz: Compressed FASTQ files
  • .fq / .fastq: Uncompressed FASTQ files

File Naming Convention

The program automatically identifies paired files based on the --read1-suffix and --read2-suffix arguments:

  • sample_1.fq.gzsample_2.fq.gz
  • sample_R1.fastq.gzsample_R2.fastq.gz
  • sample.R1.fq.gzsample.R2.fq.gz

Output

Output Directory Structure

1
2
3
4
5
6
7
8
9
10
clean_data/
├── sample1_1.clean.fq.gz # Cleaned Read1 file
├── sample1_2.clean.fq.gz # Cleaned Read2 file
├── sample1.fastp.html # HTML quality control report
├── sample1.fastp.json # JSON quality control report
├── sample2_1.clean.fq.gz
├── sample2_2.clean.fq.gz
├── sample2.fastp.html
├── sample2.fastp.json
└── batch_summary.txt # Batch processing summary report

Description of QC Reports

  • HTML Report: Visualized quality control results, including quality distribution plots, GC content, etc.
  • JSON Report: Machine-readable quality control statistics.
  • Batch Summary: Processing status and statistics for all samples.

Explanation of QC Parameters

Quality Threshold (-q, --quality-threshold)

  • Default: 30
  • Description: The Phred quality score threshold. Bases with a quality score below this value are considered low-quality.
  • Recommended Values:
    • Strict: 35+
    • Standard: 30
    • Lenient: 20-25

Minimum Length (-l, --min-length)

  • Default: 50
  • Description: The minimum length of a read after trimming. Reads shorter than this length will be discarded.
  • Recommended Values:
    • RNA-seq: 50-75
    • DNA-seq: 30-50
    • 16S rRNA: 200+

Unqualified Base Percentage (-u, --unqualified-percent)

  • Default: 40
  • Description: If the percentage of low-quality bases in a read exceeds this threshold, the entire read will be discarded.
  • Recommended Values: 30-50%

N Base Limit (-n, --n-base-limit)

  • Default: 10
  • Description: The maximum number of N bases allowed in a read.
  • Recommended Values: 5-15

Performance Optimization

Thread Settings

1
2
3
4
5
# Set threads based on the number of CPU cores
run_fastp -i input -o output -t $(nproc)

# Or set it to 80% of the core count
run_fastp -i input -o output -t $(($(nproc) * 4 / 5))

Memory Usage

  • Each thread uses approximately 500MB-1GB of memory.
  • It is recommended that total memory usage does not exceed 80% of the system’s available memory.

Troubleshooting

Common Issues

1. fastp command not found

1
2
3
4
5
# Check if fastp is installed and in your PATH
which fastp

# If not installed, use conda to install it
conda install -c bioconda fastp

2. Permission error

1
2
# Ensure you have write permissions for the output directory
chmod 755 /path/to/output_dir

3. File not found

1
2
3
4
5
# Check if the input directory exists
ls -la /path/to/input_dir

# Check if the file suffixes are correct
ls /path/to/input_dir/*_1.fq.gz

4. Insufficient memory

1
2
# Reduce the number of threads
run_fastp -i input -o output -t 4

Debug Mode

1
2
3
4
5
# View detailed (verbose) output
run_fastp -i input -o output --verbose

# Check the fastp version
fastp --version

Best Practices

1. Pre-QC Check

1
2
3
4
5
# Check the quality of the raw data
fastqc raw_data/*.fq.gz -o qc_reports/

# Count the number of file pairs
find raw_data -name "*_1.fq.gz" | wc -l

2. Parameter Selection Recommendations

Application Quality Threshold Min Length Unqualified Base %
RNA-seq 25-30 50-75 40-50
WGS 30-35 50-100 30-40
16S rRNA 25 200+ 50
ChIP-seq 20-25 30-50 50

3. Post-QC Validation

1
2
3
4
5
# Check the quality of the cleaned data
fastqc clean_data/*.clean.fq.gz -o qc_reports_after/

# Compare statistics before and after QC using MultiQC
multiqc qc_reports/ qc_reports_after/

Using the Python API

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from biopytools.fastp import FastpProcessor, FastpConfig

# Create a configuration object
config = FastpConfig(
input_dir="./raw_data",
output_dir="./clean_data",
threads=16,
quality_threshold=30,
min_length=50
)

# Initialize and run the processor
processor = FastpProcessor(config)
results = processor.run_batch_processing()

# View the results
print(f"Processed {results['total_samples']} samples")
print(f"Succeeded: {results['success_count']}")
print(f"Failed: {results['failed_count']}")

运行fastp的脚本
https://lixiang117423.github.io/article/run-fastp/
作者
李详【Xiang LI】
发布于
2025年7月15日
许可协议