中文 功能概述 fastp 质控模块是 biopytools 工具包中的高效 FASTQ 数据质量控制工具,支持单端和双端测序数据的批量处理。该模块封装了 fastp 工具,提供了便捷的批处理功能和灵活的参数配置。
主要特性
🚀 高效批处理 :自动识别和处理整个目录下的 FASTQ 文件
🔧 灵活配置 :支持多种质控参数的自定义设置
📊 质量报告 :自动生成 HTML 和 JSON 格式的质控报告
💾 双端支持 :完整支持双端测序数据(Paired-end)处理
🎯 智能识别 :自动识别文件配对关系
安装方法 系统依赖 确保系统已安装 fastp:
1 2 3 4 5 6 7 8 9 10 11 sudo apt-get install fastp sudo yum install fastp brew install fastp conda install -c bioconda fastp
1 2 3 4 5 6 7 8 9 git clone https://github.com/yourusername/biopytools.gitcd biopytools pip install -e . run_fastp --help
使用方法 基本语法 1 run_fastp -i INPUT_DIR -o OUTPUT_DIR [OPTIONS]
必需参数
参数
说明
-i, --input-dir
输入原始 FASTQ 数据目录
-o, --output-dir
输出清洁 FASTQ 数据目录
可选参数
参数
默认值
说明
--fastp-path
fastp
fastp 可执行文件路径
-t, --threads
12
线程数
-q, --quality-threshold
30
质量阈值
-l, --min-length
50
最小长度
-u, --unqualified-percent
40
不合格碱基百分比阈值
-n, --n-base-limit
10
N 碱基数量限制
--read1-suffix
_1.fq.gz
Read1 文件后缀
--read2-suffix
_2.fq.gz
Read2 文件后缀
使用示例 1. 基本用法 1 2 run_fastp -i ./raw_data -o ./clean_data
2. 自定义参数 1 2 3 4 5 6 run_fastp -i ./raw_data -o ./clean_data \ -q 35 \ -l 75 \ -u 30 \ -t 16
3. 不同文件后缀 1 2 3 4 run_fastp -i ./raw_data -o ./clean_data \ --read1-suffix .R1.fastq.gz \ --read2-suffix .R2.fastq.gz
4. 指定 fastp 路径 1 2 3 run_fastp -i ./raw_data -o ./clean_data \ --fastp-path /usr/local/bin/fastp
5. 完整参数示例 1 2 3 4 5 6 7 8 9 10 11 run_fastp \ -i /path/to/raw_data \ -o /path/to/clean_data \ --fastp-path /usr/local/bin/fastp \ -t 20 \ -q 25 \ -l 40 \ -u 50 \ -n 5 \ --read1-suffix _R1.fq.gz \ --read2-suffix _R2.fq.gz
输入文件格式 目录结构要求 1 2 3 4 5 6 raw_data/ ├── sample1_1.fq.gz ├── sample1_2.fq.gz ├── sample2_1.fq.gz ├── sample2_2.fq.gz └── ...
支持的文件格式
.fq.gz / .fastq.gz :压缩的 FASTQ 文件
.fq / .fastq :未压缩的 FASTQ 文件
文件命名规则 程序会根据 --read1-suffix
和 --read2-suffix
参数自动识别配对文件:
sample_1.fq.gz
↔ sample_2.fq.gz
sample_R1.fastq.gz
↔ sample_R2.fastq.gz
sample.R1.fq.gz
↔ sample.R2.fq.gz
输出结果 输出目录结构 1 2 3 4 5 6 7 8 9 10 clean_data/ ├── sample1_1.clean.fq.gz ├── sample1_2.clean.fq.gz ├── sample1.fastp.html ├── sample1.fastp.json ├── sample2_1.clean.fq.gz ├── sample2_2.clean.fq.gz ├── sample2.fastp.html ├── sample2.fastp.json └── batch _summary.txt
质控报告说明
HTML 报告 :可视化的质控结果,包含质量分布图、GC含量等
JSON 报告 :机器可读的质控统计数据
批处理总结 :所有样本的处理状态和统计信息
质控参数说明 质量阈值 (-q, —quality-threshold)
默认值 :30
说明 :Phred 质量值阈值,低于此值的碱基被认为是低质量碱基
建议值 :
最小长度 (-l, —min-length)
默认值 :50
说明 :过滤后序列的最小长度,短于此长度的序列将被丢弃
建议值 :
RNA-seq:50-75
DNA-seq:30-50
16S rRNA:200+
不合格碱基百分比 (-u, —unqualified-percent)
默认值 :40
说明 :如果序列中低质量碱基的百分比超过此阈值,整条序列将被丢弃
建议值 :30-50%
N 碱基限制 (-n, —n-base-limit)
默认值 :10
说明 :序列中允许的最大 N 碱基数量
建议值 :5-15
性能优化 线程设置 1 2 3 4 5 run_fastp -i input -o output -t $(nproc ) run_fastp -i input -o output -t $(($(nproc) * 4 / 5 ))
内存使用
每个线程大约使用 500MB-1GB 内存
建议总内存使用量不超过系统内存的 80%
故障排除 常见问题 1. 找不到 fastp 命令
1 2 3 4 5 which fastp conda install -c bioconda fastp
2. 权限错误
1 2 chmod 755 /path/to/output_dir
3. 文件未找到
1 2 3 4 5 ls -la /path/to/input_dirls /path/to/input_dir/*_1.fq.gz
4. 内存不足
1 2 run_fastp -i input -o output -t 4
调试模式 1 2 3 4 5 run_fastp -i input -o output --verbose fastp --version
最佳实践 1. 质控前检查 1 2 3 4 5 fastqc raw_data/*.fq.gz -o qc_reports/ find raw_data -name "*_1.fq.gz" | wc -l
2. 参数选择建议
应用场景
质量阈值
最小长度
不合格碱基%
RNA-seq
25-30
50-75
40-50
WGS
30-35
50-100
30-40
16S rRNA
25
200+
50
ChIP-seq
20-25
30-50
50
3. 质控后验证 1 2 3 4 5 fastqc clean_data/*.clean.fq.gz -o qc_reports_after/ multiqc qc_reports/ qc_reports_after/
Python API 使用 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 from biopytools.fastp import FastpProcessor, FastpConfig config = FastpConfig( input_dir="./raw_data" , output_dir="./clean_data" , threads=16 , quality_threshold=30 , min_length=50 ) processor = FastpProcessor(config) results = processor.run_batch_processing()print (f"处理了 {results['total_samples' ]} 个样本" )print (f"成功:{results['success_count' ]} " )print (f"失败:{results['failed_count' ]} " )
English Feature Overview The fastp
QC module is an efficient FASTQ data quality control tool within the biopytools
toolkit, supporting batch processing for both single-end and paired-end sequencing data. This module wraps the fastp
tool, providing convenient batch processing capabilities and flexible parameter configuration.
Key Features
🚀 Efficient Batch Processing : Automatically discovers and processes all FASTQ files in a directory.
🔧 Flexible Configuration : Supports customization of various quality control parameters.
📊 Quality Reports : Automatically generates quality control reports in HTML and JSON formats.
💾 Paired-End Support : Full support for processing paired-end sequencing data.
🎯 Smart Pairing : Automatically identifies paired-end file relationships.
Installation System Dependencies Ensure that fastp
is installed on your system:
1 2 3 4 5 6 7 8 9 10 11 sudo apt-get install fastp sudo yum install fastp brew install fastp conda install -c bioconda fastp
1 2 3 4 5 6 7 8 9 git clone https://github.com/yourusername/biopytools.gitcd biopytools pip install -e . run_fastp --help
Usage Basic Syntax 1 run_fastp -i INPUT_DIR -o OUTPUT_DIR [OPTIONS]
Required Arguments
Argument
Description
-i, --input-dir
Input directory containing raw FASTQ files.
-o, --output-dir
Output directory for cleaned FASTQ files.
Optional Arguments
Argument
Default
Description
--fastp-path
fastp
Path to the fastp executable.
-t, --threads
12
Number of threads to use.
-q, --quality-threshold
30
Quality threshold (Phred score).
-l, --min-length
50
Minimum length required for a read.
-u, --unqualified-percent
40
Percentage threshold of unqualified bases.
-n, --n-base-limit
10
Maximum number of N bases allowed.
--read1-suffix
_1.fq.gz
Suffix for Read1 files.
--read2-suffix
_2.fq.gz
Suffix for Read2 files.
Examples 1. Basic Usage 1 2 run_fastp -i ./raw_data -o ./clean_data
2. Custom Parameters 1 2 3 4 5 6 run_fastp -i ./raw_data -o ./clean_data \ -q 35 \ -l 75 \ -u 30 \ -t 16
3. Different File Suffixes 1 2 3 4 run_fastp -i ./raw_data -o ./clean_data \ --read1-suffix .R1.fastq.gz \ --read2-suffix .R2.fastq.gz
4. Specify fastp Path 1 2 3 run_fastp -i ./raw_data -o ./clean_data \ --fastp-path /usr/local/bin/fastp
5. Full Parameter Example 1 2 3 4 5 6 7 8 9 10 11 run_fastp \ -i /path/to/raw_data \ -o /path/to/clean_data \ --fastp-path /usr/local/bin/fastp \ -t 20 \ -q 25 \ -l 40 \ -u 50 \ -n 5 \ --read1-suffix _R1.fq.gz \ --read2-suffix _R2.fq.gz
Directory Structure Requirement 1 2 3 4 5 6 raw_data/ ├── sample1_1.fq.gz ├── sample1_2.fq.gz ├── sample2_1.fq.gz ├── sample2_2.fq.gz └── ...
.fq.gz / .fastq.gz : Compressed FASTQ files
.fq / .fastq : Uncompressed FASTQ files
File Naming Convention The program automatically identifies paired files based on the --read1-suffix
and --read2-suffix
arguments:
sample_1.fq.gz
↔ sample_2.fq.gz
sample_R1.fastq.gz
↔ sample_R2.fastq.gz
sample.R1.fq.gz
↔ sample.R2.fq.gz
Output Output Directory Structure 1 2 3 4 5 6 7 8 9 10 clean_data/ ├── sample1_1.clean.fq.gz ├── sample1_2.clean.fq.gz ├── sample1.fastp.html ├── sample1.fastp.json ├── sample2_1.clean.fq.gz ├── sample2_2.clean.fq.gz ├── sample2.fastp.html ├── sample2.fastp.json └── batch_summary.txt
Description of QC Reports
HTML Report : Visualized quality control results, including quality distribution plots, GC content, etc.
JSON Report : Machine-readable quality control statistics.
Batch Summary : Processing status and statistics for all samples.
Explanation of QC Parameters Quality Threshold (-q
, --quality-threshold
)
Default : 30
Description : The Phred quality score threshold. Bases with a quality score below this value are considered low-quality.
Recommended Values :
Strict: 35+
Standard: 30
Lenient: 20-25
Minimum Length (-l
, --min-length
)
Default : 50
Description : The minimum length of a read after trimming. Reads shorter than this length will be discarded.
Recommended Values :
RNA-seq: 50-75
DNA-seq: 30-50
16S rRNA: 200+
Unqualified Base Percentage (-u
, --unqualified-percent
)
Default : 40
Description : If the percentage of low-quality bases in a read exceeds this threshold, the entire read will be discarded.
Recommended Values : 30-50%
N Base Limit (-n
, --n-base-limit
)
Default : 10
Description : The maximum number of N bases allowed in a read.
Recommended Values : 5-15
Thread Settings 1 2 3 4 5 run_fastp -i input -o output -t $(nproc ) run_fastp -i input -o output -t $(($(nproc) * 4 / 5 ))
Memory Usage
Each thread uses approximately 500MB-1GB of memory.
It is recommended that total memory usage does not exceed 80% of the system’s available memory.
Troubleshooting Common Issues 1. fastp
command not found
1 2 3 4 5 which fastp conda install -c bioconda fastp
2. Permission error
1 2 chmod 755 /path/to/output_dir
3. File not found
1 2 3 4 5 ls -la /path/to/input_dirls /path/to/input_dir/*_1.fq.gz
4. Insufficient memory
1 2 run_fastp -i input -o output -t 4
Debug Mode 1 2 3 4 5 run_fastp -i input -o output --verbose fastp --version
Best Practices 1. Pre-QC Check 1 2 3 4 5 fastqc raw_data/*.fq.gz -o qc_reports/ find raw_data -name "*_1.fq.gz" | wc -l
2. Parameter Selection Recommendations
Application
Quality Threshold
Min Length
Unqualified Base %
RNA-seq
25-30
50-75
40-50
WGS
30-35
50-100
30-40
16S rRNA
25
200+
50
ChIP-seq
20-25
30-50
50
3. Post-QC Validation 1 2 3 4 5 fastqc clean_data/*.clean.fq.gz -o qc_reports_after/ multiqc qc_reports/ qc_reports_after/
Using the Python API 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 from biopytools.fastp import FastpProcessor, FastpConfig config = FastpConfig( input_dir="./raw_data" , output_dir="./clean_data" , threads=16 , quality_threshold=30 , min_length=50 ) processor = FastpProcessor(config) results = processor.run_batch_processing()print (f"Processed {results['total_samples' ]} samples" )print (f"Succeeded: {results['success_count' ]} " )print (f"Failed: {results['failed_count' ]} " )