运行fastp的脚本

中文

功能概述

fastp 质控模块是 biopytools 工具包中的高效 FASTQ 数据质量控制工具，支持单端和双端测序数据的批量处理。该模块封装了 fastp 工具，提供了便捷的批处理功能和灵活的参数配置。

主要特性

🚀 高效批处理：自动识别和处理整个目录下的 FASTQ 文件
🔧 灵活配置：支持多种质控参数的自定义设置
📊 质量报告：自动生成 HTML 和 JSON 格式的质控报告
💾 双端支持：完整支持双端测序数据（Paired-end）处理
🎯 智能识别：自动识别文件配对关系

安装方法

系统依赖

确保系统已安装 fastp：

# Ubuntu/Debian
sudo apt-get install fastp

# CentOS/RHEL
sudo yum install fastp

# macOS (使用 Homebrew)
brew install fastp

# 或者使用 conda
conda install -c bioconda fastp

安装 biopytools

# 克隆项目
git clone https://github.com/yourusername/biopytools.git
cd biopytools

# 安装包
pip install -e .

# 验证安装
run_fastp --help

使用方法

基本语法

1	`run_fastp -i INPUT_DIR -o OUTPUT_DIR [OPTIONS]`

必需参数

参数	说明
`-i, --input-dir`	输入原始 FASTQ 数据目录
`-o, --output-dir`	输出清洁 FASTQ 数据目录

可选参数

参数	默认值	说明
`--fastp-path`	`fastp`	fastp 可执行文件路径
`-t, --threads`	`12`	线程数
`-q, --quality-threshold`	`30`	质量阈值
`-l, --min-length`	`50`	最小长度
`-u, --unqualified-percent`	`40`	不合格碱基百分比阈值
`-n, --n-base-limit`	`10`	N 碱基数量限制
`--read1-suffix`	`_1.fq.gz`	Read1 文件后缀
`--read2-suffix`	`_2.fq.gz`	Read2 文件后缀

使用示例

1. 基本用法

1 2	`# 处理目录下的所有 FASTQ 文件 run_fastp -i ./raw_data -o ./clean_data`

2. 自定义参数

# 使用更严格的质控标准
run_fastp -i ./raw_data -o ./clean_data \
    -q 35 \
    -l 75 \
    -u 30 \
    -t 16

3. 不同文件后缀

# 处理以 .R1.fastq.gz 和 .R2.fastq.gz 结尾的文件
run_fastp -i ./raw_data -o ./clean_data \
    --read1-suffix .R1.fastq.gz \
    --read2-suffix .R2.fastq.gz

4. 指定 fastp 路径

1
2
3

# 使用自定义路径的 fastp
run_fastp -i ./raw_data -o ./clean_data \
    --fastp-path /usr/local/bin/fastp

5. 完整参数示例

run_fastp \
    -i /path/to/raw_data \
    -o /path/to/clean_data \
    --fastp-path /usr/local/bin/fastp \
    -t 20 \
    -q 25 \
    -l 40 \
    -u 50 \
    -n 5 \
    --read1-suffix _R1.fq.gz \
    --read2-suffix _R2.fq.gz

输入文件格式

目录结构要求

raw_data/
├── sample1_1.fq.gz    # Read1
├── sample1_2.fq.gz    # Read2  
├── sample2_1.fq.gz
├── sample2_2.fq.gz
└── ...

支持的文件格式

.fq.gz / .fastq.gz：压缩的 FASTQ 文件
.fq / .fastq：未压缩的 FASTQ 文件

文件命名规则

程序会根据 --read1-suffix 和 --read2-suffix 参数自动识别配对文件：

sample_1.fq.gz ↔ sample_2.fq.gz
sample_R1.fastq.gz ↔ sample_R2.fastq.gz
sample.R1.fq.gz ↔ sample.R2.fq.gz

输出结果

输出目录结构

clean_data/
├── sample1_1.clean.fq.gz     # 清洁的 Read1 文件
├── sample1_2.clean.fq.gz     # 清洁的 Read2 文件
├── sample1.fastp.html        # HTML 质控报告
├── sample1.fastp.json        # JSON 质控报告
├── sample2_1.clean.fq.gz
├── sample2_2.clean.fq.gz
├── sample2.fastp.html
├── sample2.fastp.json
└── batch_summary.txt         # 批处理总结报告

质控报告说明

HTML 报告：可视化的质控结果，包含质量分布图、GC含量等
JSON 报告：机器可读的质控统计数据
批处理总结：所有样本的处理状态和统计信息

质控参数说明

质量阈值 (-q, —quality-threshold)

默认值：30
说明：Phred 质量值阈值，低于此值的碱基被认为是低质量碱基
建议值：
- 严格：35+
- 标准：30
- 宽松：20-25

最小长度 (-l, —min-length)

默认值：50
说明：过滤后序列的最小长度，短于此长度的序列将被丢弃
建议值：
- RNA-seq：50-75
- DNA-seq：30-50
- 16S rRNA：200+

不合格碱基百分比 (-u, —unqualified-percent)

默认值：40
说明：如果序列中低质量碱基的百分比超过此阈值，整条序列将被丢弃
建议值：30-50%

N 碱基限制 (-n, —n-base-limit)

默认值：10
说明：序列中允许的最大 N 碱基数量
建议值：5-15

性能优化

线程设置

# 根据 CPU 核心数设置线程
run_fastp -i input -o output -t $(nproc)

# 或者设置为核心数的 80%
run_fastp -i input -o output -t $(($(nproc) * 4 / 5))

内存使用

每个线程大约使用 500MB-1GB 内存
建议总内存使用量不超过系统内存的 80%

故障排除

常见问题

1. 找不到 fastp 命令

# 检查 fastp 是否安装
which fastp

# 如果未安装，使用 conda 安装
conda install -c bioconda fastp

2. 权限错误

1 2	`# 确保输出目录有写入权限 chmod 755 /path/to/output_dir`

3. 文件未找到

# 检查输入目录是否存在
ls -la /path/to/input_dir

# 检查文件后缀是否正确
ls /path/to/input_dir/*_1.fq.gz

4. 内存不足

1 2	`# 减少线程数 run_fastp -i input -o output -t 4`

调试模式

# 查看详细输出
run_fastp -i input -o output --verbose

# 检查 fastp 版本
fastp --version

最佳实践

1. 质控前检查

# 检查原始数据质量
fastqc raw_data/*.fq.gz -o qc_reports/

# 统计文件数量
find raw_data -name "*_1.fq.gz" | wc -l

2. 参数选择建议

应用场景	质量阈值	最小长度	不合格碱基%
RNA-seq	25-30	50-75	40-50
WGS	30-35	50-100	30-40
16S rRNA	25	200+	50
ChIP-seq	20-25	30-50	50

3. 质控后验证

# 检查处理结果
fastqc clean_data/*.clean.fq.gz -o qc_reports_after/

# 比较处理前后的统计
multiqc qc_reports/ qc_reports_after/

Python API 使用

from biopytools.fastp import FastpProcessor, FastpConfig

# 创建配置
config = FastpConfig(
    input_dir="./raw_data",
    output_dir="./clean_data",
    threads=16,
    quality_threshold=30,
    min_length=50
)

# 运行处理
processor = FastpProcessor(config)
results = processor.run_batch_processing()

# 查看结果
print(f"处理了 {results['total_samples']} 个样本")
print(f"成功：{results['success_count']}")
print(f"失败：{results['failed_count']}")

English

Feature Overview

The fastp QC module is an efficient FASTQ data quality control tool within the biopytools toolkit, supporting batch processing for both single-end and paired-end sequencing data. This module wraps the fastp tool, providing convenient batch processing capabilities and flexible parameter configuration.

Key Features

🚀 Efficient Batch Processing: Automatically discovers and processes all FASTQ files in a directory.
🔧 Flexible Configuration: Supports customization of various quality control parameters.
📊 Quality Reports: Automatically generates quality control reports in HTML and JSON formats.
💾 Paired-End Support: Full support for processing paired-end sequencing data.
🎯 Smart Pairing: Automatically identifies paired-end file relationships.

Installation

System Dependencies

Ensure that fastp is installed on your system:

# For Ubuntu/Debian
sudo apt-get install fastp

# For CentOS/RHEL
sudo yum install fastp

# For macOS (with Homebrew)
brew install fastp

# Alternatively, using conda
conda install -c bioconda fastp

Install biopytools

# Clone the repository
git clone https://github.com/yourusername/biopytools.git
cd biopytools

# Install the package
pip install -e .

# Verify the installation
run_fastp --help

Usage

Basic Syntax

1	`run_fastp -i INPUT_DIR -o OUTPUT_DIR [OPTIONS]`

Required Arguments

Argument	Description
`-i, --input-dir`	Input directory containing raw FASTQ files.
`-o, --output-dir`	Output directory for cleaned FASTQ files.

Optional Arguments

Argument	Default	Description
`--fastp-path`	`fastp`	Path to the fastp executable.
`-t, --threads`	`12`	Number of threads to use.
`-q, --quality-threshold`	`30`	Quality threshold (Phred score).
`-l, --min-length`	`50`	Minimum length required for a read.
`-u, --unqualified-percent`	`40`	Percentage threshold of unqualified bases.
`-n, --n-base-limit`	`10`	Maximum number of N bases allowed.
`--read1-suffix`	`_1.fq.gz`	Suffix for Read1 files.
`--read2-suffix`	`_2.fq.gz`	Suffix for Read2 files.

Examples

1. Basic Usage

1 2	`# Process all FASTQ files in a directory run_fastp -i ./raw_data -o ./clean_data`

2. Custom Parameters

# Use stricter quality control standards
run_fastp -i ./raw_data -o ./clean_data \
    -q 35 \
    -l 75 \
    -u 30 \
    -t 16

3. Different File Suffixes

# Process files ending with .R1.fastq.gz and .R2.fastq.gz
run_fastp -i ./raw_data -o ./clean_data \
    --read1-suffix .R1.fastq.gz \
    --read2-suffix .R2.fastq.gz

4. Specify fastp Path

1
2
3

# Use a custom path for the fastp executable
run_fastp -i ./raw_data -o ./clean_data \
    --fastp-path /usr/local/bin/fastp

5. Full Parameter Example

run_fastp \
    -i /path/to/raw_data \
    -o /path/to/clean_data \
    --fastp-path /usr/local/bin/fastp \
    -t 20 \
    -q 25 \
    -l 40 \
    -u 50 \
    -n 5 \
    --read1-suffix _R1.fq.gz \
    --read2-suffix _R2.fq.gz

Input File Format

Directory Structure Requirement

raw_data/
├── sample1_1.fq.gz    # Read1
├── sample1_2.fq.gz    # Read2  
├── sample2_1.fq.gz
├── sample2_2.fq.gz
└── ...

Supported File Formats

.fq.gz / .fastq.gz: Compressed FASTQ files
.fq / .fastq: Uncompressed FASTQ files

File Naming Convention

The program automatically identifies paired files based on the --read1-suffix and --read2-suffix arguments:

sample_1.fq.gz ↔ sample_2.fq.gz
sample_R1.fastq.gz ↔ sample_R2.fastq.gz
sample.R1.fq.gz ↔ sample.R2.fq.gz

Output

Output Directory Structure

clean_data/
├── sample1_1.clean.fq.gz     # Cleaned Read1 file
├── sample1_2.clean.fq.gz     # Cleaned Read2 file
├── sample1.fastp.html        # HTML quality control report
├── sample1.fastp.json        # JSON quality control report
├── sample2_1.clean.fq.gz
├── sample2_2.clean.fq.gz
├── sample2.fastp.html
├── sample2.fastp.json
└── batch_summary.txt         # Batch processing summary report

Description of QC Reports

HTML Report: Visualized quality control results, including quality distribution plots, GC content, etc.
JSON Report: Machine-readable quality control statistics.
Batch Summary: Processing status and statistics for all samples.

Explanation of QC Parameters

Quality Threshold (`-q`, `--quality-threshold`)

Default: 30
Description: The Phred quality score threshold. Bases with a quality score below this value are considered low-quality.
Recommended Values:
- Strict: 35+
- Standard: 30
- Lenient: 20-25

Minimum Length (`-l`, `--min-length`)

Default: 50
Description: The minimum length of a read after trimming. Reads shorter than this length will be discarded.
Recommended Values:
- RNA-seq: 50-75
- DNA-seq: 30-50
- 16S rRNA: 200+

Unqualified Base Percentage (`-u`, `--unqualified-percent`)

Default: 40
Description: If the percentage of low-quality bases in a read exceeds this threshold, the entire read will be discarded.
Recommended Values: 30-50%

N Base Limit (`-n`, `--n-base-limit`)

Default: 10
Description: The maximum number of N bases allowed in a read.
Recommended Values: 5-15

Performance Optimization

Thread Settings

# Set threads based on the number of CPU cores
run_fastp -i input -o output -t $(nproc)

# Or set it to 80% of the core count
run_fastp -i input -o output -t $(($(nproc) * 4 / 5))

Memory Usage

Each thread uses approximately 500MB-1GB of memory.
It is recommended that total memory usage does not exceed 80% of the system’s available memory.

Troubleshooting

Common Issues

1. fastp command not found

# Check if fastp is installed and in your PATH
which fastp

# If not installed, use conda to install it
conda install -c bioconda fastp

2. Permission error

1 2	`# Ensure you have write permissions for the output directory chmod 755 /path/to/output_dir`

3. File not found

# Check if the input directory exists
ls -la /path/to/input_dir

# Check if the file suffixes are correct
ls /path/to/input_dir/*_1.fq.gz

4. Insufficient memory

1 2	`# Reduce the number of threads run_fastp -i input -o output -t 4`

Debug Mode

# View detailed (verbose) output
run_fastp -i input -o output --verbose

# Check the fastp version
fastp --version

Best Practices

1. Pre-QC Check

# Check the quality of the raw data
fastqc raw_data/*.fq.gz -o qc_reports/

# Count the number of file pairs
find raw_data -name "*_1.fq.gz" | wc -l

2. Parameter Selection Recommendations

Application	Quality Threshold	Min Length	Unqualified Base %
RNA-seq	25-30	50-75	40-50
WGS	30-35	50-100	30-40
16S rRNA	25	200+	50
ChIP-seq	20-25	30-50	50

3. Post-QC Validation

# Check the quality of the cleaned data
fastqc clean_data/*.clean.fq.gz -o qc_reports_after/

# Compare statistics before and after QC using MultiQC
multiqc qc_reports/ qc_reports_after/

Using the Python API

from biopytools.fastp import FastpProcessor, FastpConfig

# Create a configuration object
config = FastpConfig(
    input_dir="./raw_data",
    output_dir="./clean_data",
    threads=16,
    quality_threshold=30,
    min_length=50
)

# Initialize and run the processor
processor = FastpProcessor(config)
results = processor.run_batch_processing()

# View the results
print(f"Processed {results['total_samples']} samples")
print(f"Succeeded: {results['success_count']}")
print(f"Failed: {results['failed_count']}")

生物信息

#生物信息学

运行fastp的脚本

https://lixiang117423.github.io/article/run-fastp/

作者

李详【Xiang LI】

发布于

2025年7月15日

许可协议

biopytools的用法上一篇

从gff文件提取基因信息都脚本下一篇

运行fastp的脚本

中文

功能概述

主要特性

安装方法

系统依赖

安装 biopytools

使用方法

基本语法

必需参数

可选参数

使用示例

1. 基本用法

2. 自定义参数

3. 不同文件后缀

4. 指定 fastp 路径

5. 完整参数示例

输入文件格式

目录结构要求

支持的文件格式

文件命名规则

输出结果

输出目录结构

质控报告说明

质控参数说明

质量阈值 (-q, —quality-threshold)

最小长度 (-l, —min-length)

不合格碱基百分比 (-u, —unqualified-percent)

N 碱基限制 (-n, —n-base-limit)

性能优化

线程设置

内存使用

故障排除

常见问题

调试模式

最佳实践

1. 质控前检查

2. 参数选择建议

3. 质控后验证

Python API 使用

English

Feature Overview

Key Features

Installation

System Dependencies

Install biopytools

Usage

Basic Syntax

Required Arguments

Optional Arguments

Examples

1. Basic Usage

2. Custom Parameters

3. Different File Suffixes

4. Specify fastp Path

5. Full Parameter Example

Input File Format

Directory Structure Requirement

Supported File Formats

File Naming Convention

Output

Output Directory Structure

Description of QC Reports

Explanation of QC Parameters

Quality Threshold (-q, --quality-threshold)

Minimum Length (-l, --min-length)

Unqualified Base Percentage (-u, --unqualified-percent)

N Base Limit (-n, --n-base-limit)

Performance Optimization

Thread Settings

Memory Usage

Troubleshooting

Common Issues

Debug Mode

Best Practices

1. Pre-QC Check

2. Parameter Selection Recommendations

3. Post-QC Validation

Using the Python API

Quality Threshold (`-q`, `--quality-threshold`)

Minimum Length (`-l`, `--min-length`)

Unqualified Base Percentage (`-u`, `--unqualified-percent`)

N Base Limit (`-n`, `--n-base-limit`)