biopytools的用法

安装

1
2
3
4
5
git clone https://github.com/lixiang117423/biopytools.git
cd biopytools
pip install -e .

pip install -e ".[dev]"

run_annovar

主要功能是实现使用Annovar对VCF文件进行注释。

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
 run_annovar -h
usage: run_annovar [-h] -g GFF3 -f GENOME -v VCF -b BUILD_VER [-a ANNOVAR_PATH] [-d DATABASE_PATH] [-o OUTPUT_DIR] [-q QUAL_THRESHOLD] [-s {1,2,3,4}]
[--skip-gff-fix] [--skip-vcf-filter] [--enable-vcf-filter]

ANNOVAR VCF注释自动化脚本 (模块化版本) | ANNOVAR VCF Annotation Automation Script (Modular Version)

options:
-h, --help show this help message and exit
-g, --gff3 GFF3 GFF3注释文件路径 | GFF3 annotation file path (default: None)
-f, --genome GENOME 基因组序列文件路径 | Genome sequence file path (default: None)
-v, --vcf VCF VCF变异文件路径 | VCF variant file path (default: None)
-b, --build-ver BUILD_VER
基因组构建版本标识符 (如: OV, KY131) - 不应包含路径分隔符 | Genome build version identifier (e.g., OV, KY131) - should not contain path separators
(default: None)
-a, --annovar-path ANNOVAR_PATH
ANNOVAR软件安装路径 | ANNOVAR software installation path (default: /share/org/YZWL/yzwl_lixg/software/annovar/annovar)
-d, --database-path DATABASE_PATH
ANNOVAR数据库路径 | ANNOVAR database path (default: ./database)
-o, --output-dir OUTPUT_DIR
输出目录 | Output directory (default: ./annovar_output)
-q, --qual-threshold QUAL_THRESHOLD
VCF质量过滤阈值 (仅在启用VCF过滤时生效) | VCF quality filtering threshold (only effective when VCF filtering is enabled) (default: 20)
-s, --step {1,2,3,4} 只运行指定步骤 | Run only specified step (1:gff3转换 | gff3 conversion, 2:提取序列 | extract sequences, 3:VCF处理 | VCF processing, 4:注释 |
annotation) (default: None)
--skip-gff-fix 跳过GFF3文件的自动修复(CDS phase等问题) | Skip automatic GFF3 file fixes (CDS phase and other issues) (default: False)
--skip-vcf-filter 跳过VCF过滤步骤,直接使用输入的VCF文件(默认启用) | Skip VCF filtering step, use input VCF file directly (enabled by default) (default: True)
--enable-vcf-filter 启用VCF过滤步骤(使用bcftools) | Enable VCF filtering step (using bcftools) (default: False)

运行示例

1
run_annovar --gff3 genome.gff --genome genome.fa --vcf final_filtered.recode.chr.vcf.gz --build-ver test --annovar-path /share/org/YZWL/yzwl_lixg/software/annovar/annovar --database-path ./ --output-dir ./

run_fastp

执行fastp对fastq文件进行过滤。自动识别fastq文件的样品信息,批量过滤。

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
run_fastp -h                                                                                                                                           
usage: run_fastp [-h] -i INPUT_DIR -o OUTPUT_DIR [--fastp-path FASTP_PATH] [-t THREADS] [-q QUALITY_THRESHOLD] [-l MIN_LENGTH] [-u UNQUALIFIED_PERCENT]
[-n N_BASE_LIMIT] [--read1-suffix READ1_SUFFIX] [--read2-suffix READ2_SUFFIX]

FASTQ数据质控批处理脚本 | FASTQ Data Quality Control Batch Processing Script

options:
-h, --help show this help message and exit
-i, --input-dir INPUT_DIR
输入原始FASTQ数据目录 | Input raw FASTQ data directory (default: None)
-o, --output-dir OUTPUT_DIR
输出清洁FASTQ数据目录 | Output clean FASTQ data directory (default: None)
--fastp-path FASTP_PATH
fastp可执行文件路径 | fastp executable path (default: fastp)
-t, --threads THREADS
线程数 | Number of threads (default: 12)
-q, --quality-threshold QUALITY_THRESHOLD
质量阈值 | Quality threshold (default: 30)
-l, --min-length MIN_LENGTH
最小长度 | Minimum length (default: 50)
-u, --unqualified-percent UNQUALIFIED_PERCENT
不合格碱基百分比阈值 | Unqualified base percentage threshold (default: 40)
-n, --n-base-limit N_BASE_LIMIT
N碱基数量限制 | N base count limit (default: 10)
--read1-suffix READ1_SUFFIX
Read1文件后缀 | Read1 file suffix (default: _1.fq.gz)
--read2-suffix READ2_SUFFIX
Read2文件后缀 | Read2 file suffix (default: _2.fq.gz)

运行示例

1
run_fastp -i raw -o clean --read1-suffix _1.clean.fq.gz --read2-suffix _2.clean.fq.gz --fastp-path /home/lixiang/miniforge3/envs/RNA_Seq/bin/fastp

run_rnaseq

自动执行HISAT2+StringTie流程,最终输出TPM和FPKM表达矩阵。不包含fastq文件过滤步骤,需要先对fastq文件进行过滤;程序能够自动创建索引。

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
run_rnaseq -h
usage: run_rnaseq [-h] -g GENOME -f GTF -i INPUT -o OUTPUT [-p PATTERN] [-r {yes,y,no,n}] [-t THREADS]

RNA-seq分析流程:HISAT2 + StringTie (模块化版本) | RNA-seq analysis pipeline: HISAT2 + StringTie (Modular Version)

options:
-h, --help show this help message and exit
-g, --genome GENOME 基因组fasta文件路径 | Genome fasta file path (default: None)
-f, --gtf GTF 基因注释GTF文件路径 | Gene annotation GTF file path (default: None)
-i, --input INPUT 输入fastq文件目录或样本信息文件 | Input fastq file directory or sample information file (default: None)
-o, --output OUTPUT 输出目录 | Output directory (default: None)
-p, --pattern PATTERN
Fastq文件命名模式,例如 "*.R1.fastq.gz""*_1.fq.gz",*代表样本名 | Fastq file naming pattern, e.g., "*.R1.fastq.gz" or "*_1.fq.gz", *
represents sample name (default: None)
-r, --remove {yes,y,no,n}
处理后删除BAM文件 | Remove BAM files after processing (default: no)
-t, --threads THREADS
线程数 | Number of threads (default: 8)

运行示例

1
run_rnaseq -g genome/T2T70-15_chr.fa -f genome/T2T70-15_chr.gtf -i clean -o output -p "*_1.clean.fq.gz"       

run_vcf_extractor

从vcf文件中提取genotype信息。

建议安装cyvcf2,但是在Python 3.13版本中无法安装,建议创建环境的时候选择适合cyvcf2的版本。

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
run_vcf_extractor -h
usage: run_vcf_extractor [-h] [-o OUTPUT] [-s SAMPLES] [--biallelic-only] [-e {yes,y,no,n}] [-t {txt,csv,excel}] [--output-dir OUTPUT_DIR] vcf_file

VCF基因型提取工具 | VCF Genotype Extraction Tool

positional arguments:
vcf_file VCF文件路径(支持.gz压缩格式) | VCF file path (supports .gz compressed format)

options:
-h, --help show this help message and exit
-o, --output OUTPUT 输出文件前缀 | Output file prefix (default: vcf_genotype)
-s, --samples SAMPLES
样本选择:all(所有样本)或逗号分隔的样本名称 | Sample selection: all (all samples) or comma-separated sample names (default: all)
--biallelic-only 只保留双等位位点 | Keep only biallelic sites (default: False)
-e, --each {yes,y,no,n}
按染色体拆分输出文件:yes/y(是)或no/n(否) | Split output files by chromosome: yes/y or no/n (default: n)
-t, --output-type {txt,csv,excel}
输出文件格式 | Output file format (default: txt)
--output-dir OUTPUT_DIR
输出目录 | Output directory (default: ./)

使用示例

1
run_vcf_extractor final_filtered.recode.chr.vcf.gz -o OV_snp -e y -t csv 

日志信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
2025-07-16 10:01:38,146 - INFO - 依赖检查 | Dependency check:
2025-07-16 10:01:38,147 - INFO - cyvcf2: 不可用 | Not available
2025-07-16 10:01:38,147 - INFO - pandas: 可用 | Available
2025-07-16 10:01:38,147 - INFO - 使用原生Python解析VCF文件 | Using native Python for VCF parsing
2025-07-16 10:01:38,148 - INFO - 开始VCF基因型提取 | Starting VCF genotype extraction
2025-07-16 10:01:38,148 - INFO - 输入文件 | Input file: /mnt/f/project/04.诸葛菜/genotype/final_filtered.recode.chr.vcf.gz
2025-07-16 10:01:38,148 - INFO - 输出前缀 | Output prefix: OV_snp
2025-07-16 10:01:38,149 - INFO - 输出格式 | Output format: csv
2025-07-16 10:01:38,159 - INFO - 目标样本数 | Number of target samples: 100
2025-07-16 10:01:39,282 - INFO - 已处理 | Processed 10000 variants
2025-07-16 10:01:40,303 - INFO - 已处理 | Processed 20000 variants
2025-07-16 10:01:41,251 - INFO - 已处理 | Processed 30000 variants
2025-07-16 10:01:42,229 - INFO - 已处理 | Processed 40000 variants
2025-07-16 10:01:43,171 - INFO - 已处理 | Processed 50000 variants
2025-07-16 10:01:44,109 - INFO - 已处理 | Processed 60000 variants
2025-07-16 10:01:44,990 - INFO - 已处理 | Processed 70000 variants
2025-07-16 10:01:45,863 - INFO - 已处理 | Processed 80000 variants
2025-07-16 10:01:46,727 - INFO - 已处理 | Processed 90000 variants

run_plink_gwas

使用PLINK进行GWAS分析。

帮助信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
run_plink_gwas -h
usage: run_plink_gwas [-h] -v VCF_FILE -p PHENOTYPE_FILE [-t {qualitative,quantitative}] [-o OUTPUT_DIR] [--mind MIND] [--geno GENO] [--maf MAF]
[--hwe HWE] [--ld-window-size LD_WINDOW_SIZE] [--ld-step-size LD_STEP_SIZE] [--ld-r2-threshold LD_R2_THRESHOLD]
[--pca-components PCA_COMPONENTS] [--pca-use PCA_USE] [--correction-method {bonferroni,suggestive,fdr,all}]
[--bonferroni-alpha BONFERRONI_ALPHA] [--suggestive-threshold SUGGESTIVE_THRESHOLD] [--fdr-alpha FDR_ALPHA] [--threads THREADS]

完整的PLINK GWAS分析流程 (模块化版本) - 支持质量性状和数量性状,多种显著性校正方法 | Complete PLINK GWAS Analysis Pipeline (Modular Version) - Supporting both qualitative and quantitative
traits, multiple significance correction methods

options:
-h, --help show this help message and exit
-v, --vcf-file VCF_FILE
输入VCF文件路径(支持.gz压缩) | Input VCF file path (supports .gz compression) (default: None)
-p, --phenotype-file PHENOTYPE_FILE
表型文件路径(样本ID和表型值,以空格或制表符分隔) | Phenotype file path (sample ID and phenotype value, space or tab separated) (default: None)
-t, --trait-type {qualitative,quantitative}
表型类型 | Trait type: 'qualitative' for binary traits (0/1 -> 1/2), 'quantitative' for continuous traits (keep original values)
(default: qualitative)
-o, --output-dir OUTPUT_DIR
输出目录 | Output directory (default: plink_results)
--mind MIND 个体缺失率阈值(移除缺失率大于此值的个体) | Individual missing rate threshold (default: 0.05)
--geno GENO SNP缺失率阈值(移除缺失率大于此值的SNP) | SNP missing rate threshold (default: 0.05)
--maf MAF 最小等位基因频率阈值 | Minor allele frequency threshold (default: 0.01)
--hwe HWE Hardy-Weinberg平衡检验P值阈值 | Hardy-Weinberg equilibrium p-value threshold (default: 1e-06)
--ld-window-size LD_WINDOW_SIZE
LD剪枝窗口大小(kb) | LD pruning window size (kb) (default: 50)
--ld-step-size LD_STEP_SIZE
LD剪枝步长(SNP数) | LD pruning step size (number of SNPs) (default: 5)
--ld-r2-threshold LD_R2_THRESHOLD
LD剪枝r²阈值 | LD pruning r² threshold (default: 0.2)
--pca-components PCA_COMPONENTS
计算的主成分数量 | Number of principal components to compute (default: 10)
--pca-use PCA_USE 关联分析中使用的主成分数量 | Number of PCs to use in association analysis (default: 5)
--correction-method {bonferroni,suggestive,fdr,all}
显著性校正方法 | Significance correction method: 'bonferroni' for Bonferroni correction, 'suggestive' for suggestive threshold, 'fdr' for
false discovery rate, 'all' for all methods (default: all)
--bonferroni-alpha BONFERRONI_ALPHA
Bonferroni校正的alpha水平 | Alpha level for Bonferroni correction (default: 0.05)
--suggestive-threshold SUGGESTIVE_THRESHOLD
提示性关联阈值 | Suggestive association threshold (default: 1e-05)
--fdr-alpha FDR_ALPHA
FDR校正的q值阈值 | q-value threshold for FDR correction (default: 0.05)
--threads THREADS 使用的线程数 | Number of threads to use (default: 1)

使用示例

1
run_plink_gwas -v final_filtered.recode.chr.vcf.gz -p phe.txt -o ./

日志信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
2025-07-16 15:10:34,967 - INFO - 开始PLINK GWAS分析流程 | Starting PLINK GWAS analysis pipeline...
2025-07-16 15:10:34,967 - INFO - 表型类型 | Trait type: qualitative
2025-07-16 15:10:34,968 - INFO - 显著性校正方法 | Correction method: all
2025-07-16 15:10:34,970 - INFO - ==================================================
2025-07-16 15:10:34,973 - INFO - 步骤1: 检查输入文件 | Step 1: Checking input files
2025-07-16 15:10:34,973 - INFO - 复制输入文件到工作目录 | Copying input files to working directory...
2025-07-16 15:12:09,104 - INFO - VCF文件已复制 | VCF file copied: input.vcf.gz
2025-07-16 15:12:09,135 - INFO - 表型文件已复制 | Phenotype file copied: phenotype.txt
2025-07-16 15:12:09,137 - INFO - 步骤2: 转换表型文件 | Step 2: Converting phenotype file
2025-07-16 15:12:09,138 - INFO - 转换表型文件格式 | Converting phenotype file format...
2025-07-16 15:12:09,141 - INFO - 表型类型 | Trait type: qualitative
2025-07-16 15:12:09,157 - INFO - 原始表型文件形状 | Original phenotype file shape: (100, 2)
2025-07-16 15:12:09,162 - INFO - 原始表型分布 | Original phenotype distribution: {1: np.int64(50), 0: np.int64(50)}
2025-07-16 15:12:09,165 - INFO - 处理质量性状:将0转换为1(对照),1转换为2(病例) | Processing qualitative trait: converting 0 to 1 (control), 1 to 2 (case)
2025-07-16 15:12:09,166 - INFO - 转换后表型分布 | Converted phenotype distribution: {2: np.int64(50), 1: np.int64(50)}
2025-07-16 15:12:09,168 - INFO - 对照数 (抗病, 原值0) | Controls (resistant, original 0): 50
2025-07-16 15:12:09,168 - INFO - 病例数 (感病, 原值1) | Cases (susceptible, original 1): 50
2025-07-16 15:12:09,180 - INFO - 表型文件转换完成 | Phenotype file conversion completed
2025-07-16 15:12:09,181 - INFO - 步骤3: 转换VCF文件 | Step 3: Converting VCF file
2025-07-16 15:12:09,181 - INFO - 转换VCF文件为PLINK格式 | Converting VCF file to PLINK format...
2025-07-16 15:12:09,181 - INFO - 执行步骤 | Executing step: VCF转换为PLINK格式 | Converting VCF to PLINK format
2025-07-16 15:12:09,182 - INFO - 命令 | Command: plink --vcf input.vcf.gz --make-bed --out raw_data --allow-extra-chr --set-missing-var-ids @:# --keep-allele-order
......
2025-07-16 15:40:34,137 - INFO - PLINK GWAS分析流程完成! | PLINK GWAS analysis pipeline completed!
2025-07-16 15:40:34,138 - INFO - 表型类型 | Trait type: qualitative
2025-07-16 15:40:34,139 - INFO - 显著性校正方法 | Correction method: all
2025-07-16 15:40:34,139 - INFO - 所有结果保存在 | All results saved in: /mnt/f/biopytools_test/gwas
2025-07-16 15:40:34,141 - INFO - 主要输出文件 | Main output files:
2025-07-16 15:40:34,143 - INFO - - analysis_report.txt: 分析报告 | Analysis report
2025-07-16 15:40:34,144 - INFO - - gwas_results_ADD.txt: 主要关联结果 | Main association results
2025-07-16 15:40:34,144 - INFO - - significant_bonferroni.txt: Bonferroni校正显著位点 | Bonferroni significant loci
2025-07-16 15:40:34,144 - INFO - - bonferroni_info.txt: Bonferroni校正信息 | Bonferroni correction information
2025-07-16 15:40:34,144 - INFO - - significant_suggestive.txt: 提示性关联位点 | Suggestive association loci
2025-07-16 15:40:34,145 - INFO - - suggestive_info.txt: 提示性关联信息 | Suggestive association information
2025-07-16 15:40:34,145 - INFO - - significant_fdr.txt: FDR校正显著位点 | FDR significant loci
2025-07-16 15:40:34,146 - INFO - - fdr_info.txt: FDR校正信息 | FDR correction information
2025-07-16 15:40:34,147 - INFO - - manhattan_plot.png: Manhattan图 | Manhattan plot
2025-07-16 15:40:34,148 - INFO - - qq_plot.png: QQ图 | QQ plot

run_kmer_analysis

从fastq文件中找特定序列的k-mer存在/缺失矩阵。

帮助信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
run_kmer_analysis -h
usage: run_kmer_analysis [-h] -g GENE_FASTA -f FASTQ_DIR -o OUTPUT_DIR [-k KMER_SIZE] [-t THREADS] [-m HARD_MIN] [-p PROJECT_NAME] [--skip-build]
[--run-haplotype] [-n N_COMPONENTS]

高性能k-mer数据库查询流水线 | High-Performance K-mer Database Query Pipeline
基于kmtricks + RocksDB的大规模k-mer分析系统 | Large-scale k-mer analysis system based on kmtricks + RocksDB

专为大规模数据集设计 (支持数千个样本) | Designed for large-scale datasets (supporting thousands of samples):
1. 一次构建全基因组k-mer数据库 | Build genome-wide k-mer database once
2. 支持快速查询任意基因的k-mer模式 | Support fast queries of k-mer patterns for any genes
3. 查询速度: 秒级到分钟级 (vs 小时级的暴力搜索) | Query speed: seconds to minutes (vs hours of brute force)

适用场景 | Use Cases:
- 大规模群体基因组学研究 | Large-scale population genomics studies
- 需要重复查询不同基因的场景 | Scenarios requiring repeated queries of different genes
- 对查询速度有高要求的项目 | Projects with high query speed requirements

性能对比 (5000个样本) | Performance Comparison (5000 samples):
- 传统方法 | Traditional method: 每次查询数周 | weeks per query
- 本方法 | This method: 构建一次(1-2天) + 查询(1-5分钟) | build once (1-2 days) + query (1-5 minutes)


options:
-h, --help show this help message and exit

必需参数 | Required Arguments:
-g GENE_FASTA, --gene-fasta GENE_FASTA
目标基因FASTA文件路径 | Target gene FASTA file path
-f FASTQ_DIR, --fastq-dir FASTQ_DIR
FASTQ文件目录路径 | FASTQ file directory path
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
输出目录路径 | Output directory path

k-mer分析参数 | K-mer Analysis Parameters:
-k KMER_SIZE, --kmer-size KMER_SIZE
k-mer大小 (默认: 51) | k-mer size (default: 51)
-t THREADS, --threads THREADS
线程数 (默认: 32,建议使用较多线程) | Thread count (default: 32, recommend using more threads)
-m HARD_MIN, --hard-min HARD_MIN
最小k-mer频次阈值 (默认: 2) | Minimum k-mer frequency threshold (default: 2)

流程控制参数 | Process Control Parameters:
-p PROJECT_NAME, --project-name PROJECT_NAME
项目名称 (默认: 从输出目录名获取) | Project name (default: derived from output directory name)
--skip-build 跳过数据库构建步骤 (用于已有数据库的查询) | Skip database build step (for querying existing database)
--run-haplotype 运行单倍型聚类分析 | Run haplotype clustering analysis
-n N_COMPONENTS, --n-components N_COMPONENTS
BGMM最大聚类数 (默认: 5) | Maximum number of BGMM clusters (default: 5)

使用示例

1
run_kmer_analysis -g gene/gene.fa -f fastq -o output

日志信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
2025-07-16 17:17:51,743 - INFO - 开始高性能k-mer数据库分析流水线 | Starting high-performance k-mer database analysis pipeline...
2025-07-16 17:17:51,744 - INFO -
============================================================
2025-07-16 17:17:51,744 - INFO - 步骤1: 创建FOF文件 | Step 1: Create FOF file
2025-07-16 17:17:51,744 - INFO - ============================================================
2025-07-16 17:17:51,744 - INFO - 创建FOF文件 | Creating FOF file...
2025-07-16 17:17:51,746 - INFO - 自动检测FASTQ文件 | Auto detecting FASTQ files...
2025-07-16 17:17:51,787 - INFO - 找到8个FASTQ文件 | Found 8 FASTQ files
2025-07-16 17:17:51,787 - INFO - 检测到4个样本 | Detected 4 samples
2025-07-16 17:17:51,805 - INFO - FOF文件已创建 | FOF file created: output/samples.fof
2025-07-16 17:17:51,807 - INFO - 包含 4 个样本 | Contains 4 samples
2025-07-16 17:17:51,808 - INFO -
============================================================
2025-07-16 17:17:51,808 - INFO - 步骤2: 构建kmtricks数据库 | Step 2: Build kmtricks database
2025-07-16 17:17:51,810 - INFO - ============================================================
2025-07-16 17:17:51,810 - INFO - 开始运行kmtricks k-mer构建流水线 | Starting kmtricks k-mer construction pipeline...
2025-07-16 17:17:51,811 - INFO - 警告: 这一步可能需要数小时到一天时间,但只需要执行一次 | Warning: This step may take hours to a day, but only needs to be run once
2025-07-16 17:17:51,835 - INFO - FOF文件路径 | FOF file path: /mnt/f/biopytools_test/kmer/output/samples.fof
2025-07-16 17:17:51,836 - INFO - kmtricks运行目录 | kmtricks run directory: /mnt/f/biopytools_test/kmer/output/output.k51
2025-07-16 17:17:51,837 - INFO - 这可能需要很长时间,请耐心等待 | This may take a long time, please be patient...
2025-07-16 17:17:51,840 - INFO - 执行步骤 | Executing step: kmtricks k-mer构建 | kmtricks k-mer construction
2025-07-16 17:17:51,842 - INFO - 命令 | Command: kmtricks pipeline -t 32 --file /mnt/f/biopytools_test/kmer/output/samples.fof --run-dir /mnt/f/biopytools_test/kmer/output/output.k51 --mode kmer:pa:bin --hard-min 2 --kmer-size 51 --cpr
2025-07-16 17:17:51,856 - INFO - ✓ FOF文件存在 | FOF file exists: /mnt/f/biopytools_test/kmer/output/samples.fof
2025-07-16 17:17:51,864 - INFO - 输出 | Output: [2025-07-16 17:17:51.864] [info] Run with Kmer<64> - __uint128_t implementation
2025-07-16 17:17:51,925 - INFO - 输出 | Output: [2025-07-16 17:17:51.925] [info] Compute configuration...
2025-07-16 17:17:51,927 - INFO - 输出 | Output: [2025-07-16 17:17:51.925] [info] 4 samples found (8 read files).
2025-07-16 17:17:53,371 - INFO - 输出 | Output: [2025-07-16 17:17:53.371] [info] Use 46 partitions.
2025-07-16 17:17:53,728 - INFO - 输出 | Output: [2025-07-16 17:17:53.728] [info] Compute minimizer repartition...
2025-07-16 17:18:02,168 - INFO - 输出 | Output:
2025-07-16 17:18:02,169 - INFO - 输出 | Output: Compute SuperK [> ] [00:00s]
2025-07-16 17:18:02,169 - INFO - 输出 | Output: Compute SuperK [> ] [00m:00s]
2025-07-16 17:18:02,169 - INFO - 输出 | Output:
......

run_augustus_multi_rnaseq

使用Augustus模型和二代转录组对基因组进行注释。

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
run_augustus_multi_rnaseq -h
usage: run_augustus_multi_rnaseq.py [-h] -g GENOME -s SPECIES (-i INPUT_DIR | -c CONFIG) [-p PATTERN] [-o OUTPUT] [-t THREADS] [-x HISAT2_INDEX]
[-m MIN_INTRON_SUPPORT] [-f] [-a] [-z {atac,gtag,gcag}] [--skip-deps-check]

多转录组Augustus基因预测脚本 | Multiple RNA-seq Augustus Gene Prediction Script

options:
-h, --help show this help message and exit
-g GENOME, --genome GENOME
基因组fasta文件路径 | Genome fasta file path (default: None)
-s SPECIES, --species SPECIES
Augustus训练的物种模型名称 | Augustus trained species model name (default: None)
-i INPUT_DIR, --input-dir INPUT_DIR
输入FASTQ文件目录 | Input FASTQ files directory (default: None)
-c CONFIG, --config CONFIG
样本配置文件路径 | Sample configuration file path (default: None)
-p PATTERN, --pattern PATTERN
R1文件匹配模式 | R1 file matching pattern (only used with -i/--input-dir) (default: *.R1.fastq.gz)
-o OUTPUT, --output OUTPUT
输出目录 | Output directory (default: augustus_multi_rnaseq)
-t THREADS, --threads THREADS
线程数 | Number of threads (default: 8)
-x HISAT2_INDEX, --hisat2-index HISAT2_INDEX
HISAT2索引路径前缀 | HISAT2 index path prefix (default: None)
-m MIN_INTRON_SUPPORT, --min-intron-support MIN_INTRON_SUPPORT
内含子hints的最小支持度 | Minimum support for intron hints (default: 2)
-f, --no-filter-bam 跳过BAM文件过滤步骤 | Skip BAM filtering step (default: False)
-a, --no-alternatives
不使用转录组证据生成可变剪切 | Do not use RNA-seq evidence for alternative splicing (default: False)
-z {atac,gtag,gcag}, --splicesites {atac,gtag,gcag}
允许的剪切位点类型 | Allowed splice site types (default: atac)
--skip-deps-check 跳过依赖检查 | Skip dependency check (default: False)

使用示例

1
run_augustus_multi_rnaseq -g genome/tp309.genone.fa -s Rice_35minicore_NLR_0.5 -i genome -p "*_R1.fq" -o model_rnaseq

日志信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
🔍 检查工具 | Checking tool: hisat2
测试 | Testing: hisat2 --help
返回码 | Return code: 0
输出长度 | Output length - stdout: 9473, stderr: 0
✅ 直接运行: 工具可用

🔍 检查工具 | Checking tool: hisat2-build
测试 | Testing: hisat2-build --help
返回码 | Return code: 0
输出长度 | Output length - stdout: 2232, stderr: 0
✅ 直接运行: 工具可用

🔍 检查工具 | Checking tool: samtools
测试 | Testing: samtools --help
返回码 | Return code: 0
输出长度 | Output length - stdout: 2294, stderr: 0
✅ 直接运行: 工具可用

🔍 检查工具 | Checking tool: augustus
测试 | Testing: augustus --help
返回码 | Return code: 0
输出长度 | Output length - stdout: 0, stderr: 3008
✅ 直接运行: 工具可用

🔍 检查工具 | Checking tool: bedtools
测试 | Testing: bedtools --help
返回码 | Return code: 0
输出长度 | Output length - stdout: 3875, stderr: 0
✅ 直接运行: 工具可用

🔍 检查工具 | Checking tool: filterBam
测试 | Testing: filterBam -h
返回码 | Return code: 1
输出长度 | Output length - stdout: 2944, stderr: 0
✅ 直接运行: 找到使用说明,工具可用

🔍 检查工具 | Checking tool: bam2hints
测试 | Testing: bam2hints --help
返回码 | Return code: 0
输出长度 | Output length - stdout: 2449, stderr: 0
✅ 直接运行: 找到使用说明,工具可用
2025-07-17 13:45:21,241 - INFO - ============================================================
2025-07-17 13:45:21,241 - INFO - 开始多转录组Augustus基因预测流程 | Starting multiple RNA-seq Augustus gene prediction pipeline
2025-07-17 13:45:21,241 - INFO - ============================================================
2025-07-17 13:45:21,241 - INFO - 基因组文件 | Genome file: /share/org/YZWL/yzwl_lixg/project/01.NLR_prediction/05.tp309/genome/tp309.genone.fa
2025-07-17 13:45:21,241 - INFO - Augustus模型 | Augustus model: Rice_35minicore_NLR_0.5
2025-07-17 13:45:21,241 - INFO - 转录组样本数 | RNA-seq samples: 2
2025-07-17 13:45:21,241 - INFO - 输出目录 | Output directory: /share/org/YZWL/yzwl_lixg/project/01.NLR_prediction/05.tp309/model_rnaseq
2025-07-17 13:45:21,241 - INFO -
============================================================
2025-07-17 13:45:21,241 - INFO - 步骤1: 检查和构建HISAT2索引 | Step 1: Check and build HISAT2 index
2025-07-17 13:45:21,241 - INFO - ============================================================
2025-07-17 13:45:21,241 - INFO - ✓ HISAT2索引 | HISAT2 index已存在 | already exists: /share/org/YZWL/yzwl_lixg/project/01.NLR_prediction/05.tp309/genome/tp309.genone_hisat2_index.1.ht2
2025-07-17 13:45:21,241 - INFO -
============================================================
2025-07-17 13:45:21,241 - INFO - 步骤2: 处理转录组样本 | Step 2: Process RNA-seq samples
2025-07-17 13:45:21,241 - INFO - ============================================================
......

run_admixture

使用VCF文件做Admixture分析。

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
run_admixture -h                             
usage: run_admixture [-h] --vcf VCF [--output OUTPUT] [--min-k MIN_K] [--max-k MAX_K] [--cv-folds CV_FOLDS] [--threads THREADS] [--maf MAF]
[--missing MISSING] [--hwe HWE] [--skip-preprocessing] [--keep-intermediate]

ADMIXTURE群体结构分析工具 (模块化版本) | ADMIXTURE Population Structure Analysis Tool (Modular Version)

options:
-h, --help show this help message and exit
--vcf VCF 输入VCF文件路径 | Input VCF file path (default: None)
--output OUTPUT 输出目录 | Output directory (default: admixture_results)
--min-k MIN_K 最小K值 | Minimum K value (default: 2)
--max-k MAX_K 最大K值 | Maximum K value (default: 10)
--cv-folds CV_FOLDS 交叉验证折数 | Cross-validation folds (default: 5)
--threads THREADS 线程数 | Number of threads (default: 4)
--maf MAF MAF阈值 | MAF threshold (default: 0.01)
--missing MISSING 缺失率阈值 | Missing rate threshold (default: 0.1)
--hwe HWE HWE p值阈值 | HWE p-value threshold (default: 1e-06)
--skip-preprocessing 跳过VCF预处理 | Skip VCF preprocessing (default: False)
--keep-intermediate 保留中间文件 | Keep intermediate files (default: False)

使用示例

1
run_admixture -v final_filtered.recode.chr.vcf.gz -o ./

日志信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2025-07-17 16:57:25,799 - INFO - === 软件环境检查 | Software Environment Check ===
2025-07-17 16:57:25,799 - INFO - ✓ plink 已安装 | plink is installed
2025-07-17 16:57:25,799 - INFO - ✓ admixture 已安装 | admixture is installed
2025-07-17 16:57:25,799 - INFO - ✓ bcftools 已安装 | bcftools is installed
2025-07-17 16:57:25,799 - INFO - ✓ Rscript 已安装 | Rscript is installed
2025-07-17 16:57:25,799 - INFO - ============================================================
2025-07-17 16:57:25,800 - INFO - 开始ADMIXTURE群体结构分析 | Starting ADMIXTURE Population Structure Analysis
2025-07-17 16:57:25,800 - INFO - ============================================================
2025-07-17 16:57:25,800 - INFO -
步骤1: VCF文件预处理 | Step 1: VCF file preprocessing
2025-07-17 16:57:25,800 - INFO - 预处理VCF文件 | Preprocessing VCF file: /share/org/YZWL/yzwl_lixg/project/06.longliuxing_BSA/18.admixturec/final_filtered.recode.chr.vcf.gz
2025-07-17 16:57:25,800 - INFO - 开始 | Starting: 检查VCF文件头信息 | Check VCF header
2025-07-17 16:57:25,800 - INFO - 执行命令 | Executing command: zcat /share/org/YZWL/yzwl_lixg/project/06.longliuxing_BSA/18.admixturec/final_filtered.recode.chr.vcf.gz | head -20
2025-07-17 16:57:25,803 - INFO - 命令输出 | Command output:
......

run_kmer_pav

从fastq或fasta文件中快速计算k-mer的频率和PAV矩阵。

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
run_kmer_pav -h
usage: run_kmer_pav [-h] --database-input DATABASE_INPUT --query-input QUERY_INPUT [-o OUTPUT_PREFIX] [--output-dir OUTPUT_DIR] [-s SIZE] [-r]
[--min-count MIN_COUNT] [--max-count MAX_COUNT] [--database-pattern DATABASE_PATTERN] [--query-pattern QUERY_PATTERN] [-t THREADS]
[--kmc-memory KMC_MEMORY] [--kmc-tmp-dir KMC_TMP_DIR] [--keep-intermediate] [--kmc-path KMC_PATH] [--kmc-tools-path KMC_TOOLS_PATH]

K-mer PAV (Presence/Absence Variation) 分析工具 (双阶段设计) | K-mer PAV Analysis Tool (Two-Stage Design)

options:
-h, --help show this help message and exit
--database-input DATABASE_INPUT
数据库输入文件/目录 (用于构建k-mer数据库) | Database input file/directory (for building k-mer database) (default: None)
--query-input QUERY_INPUT
查询输入文件/目录 (用于样本比较) | Query input file/directory (for sample comparison) (default: None)
-o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX
输出文件前缀 | Output file prefix (default: kmer_analysis)
--output-dir OUTPUT_DIR
输出目录 | Output directory (default: ./kmer_output)
-s SIZE, --size SIZE K-mer大小 | K-mer size (default: 31)
-r, --reverse-complement
包含反向互补序列 | Include reverse complement sequences (default: False)
--min-count MIN_COUNT
最小k-mer计数阈值 | Minimum k-mer count threshold (default: 1)
--max-count MAX_COUNT
最大k-mer计数阈值 | Maximum k-mer count threshold (default: 1000000)
--database-pattern DATABASE_PATTERN
数据库文件匹配模式 | Database file matching pattern (default: None)
--query-pattern QUERY_PATTERN
查询文件匹配模式 | Query file matching pattern (default: None)
-t THREADS, --threads THREADS
线程数 | Number of threads (default: 8)
--kmc-memory KMC_MEMORY
KMC内存限制(GB) | KMC memory limit (GB) (default: 16)
--kmc-tmp-dir KMC_TMP_DIR
KMC临时目录 | KMC temporary directory (default: kmc_tmp)
--keep-intermediate 保留中间文件 | Keep intermediate files (default: False)
--kmc-path KMC_PATH KMC可执行文件路径 | KMC executable path (default: kmc)
--kmc-tools-path KMC_TOOLS_PATH
kmc_tools可执行文件路径 | kmc_tools executable path (default: kmc_tools)

双阶段设计说明 | Two-Stage Design Description: 阶段1 | Phase 1: 从数据库文件构建统一k-mer数据库 阶段2 | Phase 2: 查询文件与k-mer数据库比较分析 样本处理逻辑 | Sample Processing Logic: - FASTQ文件:
整个文件作为一个样本 - FASTA文件: 每条序列作为一个样本 依赖工具 | Required Tools: - KMC: K-mer计数工具 | K-mer counting tool - kmc_tools: KMC工具套件 | KMC tools suite - BioPython:
FASTA序列处理 | FASTA sequence processing 安装方法 | Installation: Ubuntu/Debian: sudo apt-get install kmc Conda: conda install -c bioconda kmc Python: pip
install biopython pandas numpy 示例 | Examples: run_kmer_pav --database-input db_files/ --query-input samples/ -o analysis run_kmer_pav --database-input
db_files/ --query-input query.fasta -s 25 run_kmer_pav --database-input ref.fasta --query-input samples/ --threads 16

使用示例

1
run_kmer_pav --database-input fastq --query-input gene.fa -o test

日志信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
2025-07-18 10:45:41,004 - INFO - KMC工具检查通过 | KMC tools check passed
2025-07-18 10:45:41,005 - INFO - KMC工具检查通过 | KMC tools check passed
2025-07-18 10:45:41,005 - INFO - ==========================================================================================
2025-07-18 10:45:41,005 - INFO - 开始K-mer PAV分析 (双阶段设计) | Starting K-mer PAV analysis (Two-Stage Design)
2025-07-18 10:45:41,005 - INFO - ==========================================================================================
2025-07-18 10:45:41,005 - INFO - ============================================================
2025-07-18 10:45:41,005 - INFO - 阶段1: 构建k-mer数据库 | Phase 1: Building k-mer database
2025-07-18 10:45:41,005 - INFO - ============================================================
2025-07-18 10:45:41,018 - INFO - 找到 4 个数据库文件 | Found 4 database files
2025-07-18 10:45:41,018 - INFO - 开始构建各个文件的k-mer数据库 | Starting to build individual k-mer databases
2025-07-18 10:45:41,018 - INFO - 处理数据库文件 1/4: CRR071751_1.clean.fq
2025-07-18 10:45:41,018 - INFO - 执行 | Executing: 构建数据库 | Build database: CRR071751_1.clean.fq
......

run_minimap2

运行minimap2比对基因组,提取未比对到参考基因组上的序列。

帮助信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
run_minimap2 -h                                           
usage: run_minimap2 [-h] -t TARGET -q QUERY [-o OUTPUT_DIR] [-x {asm5,asm10,asm20,map-ont,map-pb}] [-p THREADS] [-m MIN_MATCH] [-u MIN_UNMAPPED]
[--tp-type {S,P,SP}] [-M MINIMAP2_PATH] [-S SEQKIT_PATH]

Minimap2全基因组比对和未比对区间提取工具 | Minimap2 Whole Genome Alignment and Unmapped Region Extraction Tool

options:
-h, --help show this help message and exit
-t TARGET, --target TARGET
目标基因组文件路径 | Target genome file path (default: None)
-q QUERY, --query QUERY
查询基因组文件路径 | Query genome file path (default: None)
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
输出目录 | Output directory (default: ./minimap2_output)
-x {asm5,asm10,asm20,map-ont,map-pb}, --preset {asm5,asm10,asm20,map-ont,map-pb}
Minimap2预设参数 | Minimap2 preset parameters (default: asm5)
-p THREADS, --threads THREADS
线程数 | Number of threads (default: 8)
-m MIN_MATCH, --min-match MIN_MATCH
最小匹配长度阈值 | Minimum match length threshold (default: 1000)
-u MIN_UNMAPPED, --min-unmapped MIN_UNMAPPED
最小未比对区间长度阈值 | Minimum unmapped region length threshold (default: 1000)
--tp-type {S,P,SP} 保留的tp类型 | tp type to keep: S(secondary), P(primary), SP(both) - 默认P | default P (default: P)
-M MINIMAP2_PATH, --minimap2-path MINIMAP2_PATH
minimap2可执行文件路径 | minimap2 executable path (default: minimap2)
-S SEQKIT_PATH, --seqkit-path SEQKIT_PATH
seqkit可执行文件路径 | seqkit executable path (default: seqkit)

使用示例

1
run_minimap2 -q 02428.genome.fa -t T2T_NIP.genome.fa -o ./

日志信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
开始处理基因组文件...
处理样品: 02428
输入文件: /share/org/YZWL/yzwl_lixg/project/08.rice_pangenome/01.data/genome/02428.genome.fa
输出目录: /share/org/YZWL/yzwl_lixg/project/08.rice_pangenome/04.minimap_filter/02428
运行minimap2...
2025-07-18 22:48:25,009 - INFO - 开始Minimap2分析流程 | Starting Minimap2 analysis pipeline
2025-07-18 22:48:25,010 - INFO - 目标基因组 | Target genome: /share/org/YZWL/yzwl_lixg/project/08.rice_pangenome/01.data/genome/T2T_NIP.genome.fa
2025-07-18 22:48:25,010 - INFO - 查询基因组 | Query genome: /share/org/YZWL/yzwl_lixg/project/08.rice_pangenome/01.data/genome/02428.genome.fa
2025-07-18 22:48:25,010 - INFO - 开始Minimap2分析流程 | Starting Minimap2 analysis pipeline
2025-07-18 22:48:25,010 - INFO - 目标基因组 | Target genome: /share/org/YZWL/yzwl_lixg/project/08.rice_pangenome/01.data/genome/T2T_NIP.genome.fa
2025-07-18 22:48:25,010 - INFO - 查询基因组 | Query genome: /share/org/YZWL/yzwl_lixg/project/08.rice_pangenome/01.data/genome/02428.genome.fa
2025-07-18 22:48:25,010 - INFO - 输出目录 | Output directory: /share/org/YZWL/yzwl_lixg/project/08.rice_pangenome/04.minimap_filter/02428
2025-07-18 22:48:25,010 - INFO - 预设参数 | Preset: asm5
2025-07-18 22:48:25,010 - INFO - tp类型过滤 | tp type filter: P
2025-07-18 22:48:25,010 - INFO - 线程数 | Threads: 88
2025-07-18 22:48:25,011 - INFO - 最小匹配长度 | Min match length: 1000
2025-07-18 22:48:25,011 - INFO - 最小未比对长度 | Min unmapped length: 1000
2025-07-18 22:48:25,011 - INFO -
============================================================
2025-07-18 22:48:25,011 - INFO - 步骤1: 运行Minimap2比对 | Step 1: Running Minimap2 alignment
2025-07-18 22:48:25,011 - INFO - ============================================================
2025-07-18 22:48:25,011 - INFO - 目标基因组 | Target genome: /share/org/YZWL/yzwl_lixg/project/08.rice_pangenome/01.data/genome/T2T_NIP.genome.fa
2025-07-18 22:48:25,011 - INFO - 查询基因组 | Query genome: /share/org/YZWL/yzwl_lixg/project/08.rice_pangenome/01.data/genome/02428.genome.fa
2025-07-18 22:48:25,011 - INFO - 输出PAF文件 | Output PAF file: /share/org/YZWL/yzwl_lixg/project/08.rice_pangenome/04.minimap_filter/02428/02428.genome_alignment.paf
2025-07-18 22:48:25,011 - INFO - 预设参数 | Preset: asm5
2025-07-18 22:48:25,011 - INFO - 线程数 | Threads: 88
2025-07-18 22:48:25,011 - INFO - 执行步骤 | Executing step: Minimap2全基因组比对 | Minimap2 whole genome alignment
2025-07-18 22:48:25,011 - INFO - 命令 | Command: minimap2 -x asm5 /share/org/YZWL/yzwl_lixg/project/08.rice_pangenome/01.data/genome/T2T_NIP.genome.fa /share/org/YZWL/yzwl_lixg/project/08.rice_pangenome/01.data/genome/02428.genome.fa -t 88 > /share/org/YZWL/yzwl_lixg/project/08.rice_pangenome/04.minimap_filter/02428/02428.genome_alignment.paf
2025-07-18 22:49:46,267 - INFO - 命令执行成功 | Command executed successfully: Minimap2全基因组比对 | Minimap2 whole genome alignment
2025-07-18 22:49:46,268 - INFO - 比对完成,PAF文件已生成 | Alignment completed, PAF file generated: /share/org/YZWL/yzwl_lixg/project/08.rice_pangenome/04.minimap_filter/02428/02428.genome_alignment.paf
2025-07-18 22:49:46,268 - INFO -
......

run_repeat_masker

基因组重复序列鉴定脚本。

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
run_repeat_masker -h
usage: run_repeat_masker [-h] -g GENOME [-o OUTPUT] [-s SPECIES] [-l LIB] [-t THREADS] [--hard-mask] [--no-gff] [--include-low-complexity]
[-m MODELER_THREADS] [--no-ltr-struct] [--no-modeler] [--trf-match TRF_MATCH] [--trf-mismatch TRF_MISMATCH]
[--trf-indel TRF_INDEL] [--trf-min-score TRF_MIN_SCORE] [--trf-max-period TRF_MAX_PERIOD] [--trf-min-copies TRF_MIN_COPIES]
[--trf-max-length TRF_MAX_LENGTH] [--no-trf] [-e EDTA_THREADS] [--edta-species {rice,maize,others}]
[--edta-step {all,filter,final,anno}] [--edta-sensitive] [--no-edta-anno] [--no-edta-eval] [--edta-overwrite] [--no-edta]
[--no-de-novo] [--no-database] [--min-length MIN_LENGTH] [--max-divergence MAX_DIVERGENCE]

基因组重复序列分析工具 (模块化版本) | Genome Repeat Sequence Analysis Tool (Modular Version)

options:
-h, --help show this help message and exit
-g GENOME, --genome GENOME
输入基因组FASTA文件 | Input genome FASTA file (default: None)
-o OUTPUT, --output OUTPUT
输出目录 | Output directory (default: ./repeat_output)
-s SPECIES, --species SPECIES
物种名称 | Species name (default: human)
-l LIB, --lib LIB 自定义重复序列库文件 | Custom repeat library file (default: None)
-t THREADS, --threads THREADS
RepeatMasker线程数 | RepeatMasker thread count (default: 8)
--hard-mask 使用硬屏蔽(N字符)而不是软屏蔽 | Use hard masking (N) instead of soft masking (default: False)
--no-gff 不生成GFF注释文件 | Do not generate GFF annotation file (default: False)
--include-low-complexity
包含低复杂度序列 | Include low complexity sequences (default: False)
-m MODELER_THREADS, --modeler-threads MODELER_THREADS
RepeatModeler线程数 | RepeatModeler thread count (default: 8)
--no-ltr-struct 不使用LTRStruct | Do not use LTRStruct (default: False)
--no-modeler 跳过RepeatModeler从头预测 | Skip RepeatModeler de novo prediction (default: False)
--trf-match TRF_MATCH
TRF匹配权重 | TRF match weight (default: 2)
--trf-mismatch TRF_MISMATCH
TRF错配惩罚 | TRF mismatch penalty (default: 7)
--trf-indel TRF_INDEL
TRF插入缺失惩罚 | TRF indel penalty (default: 7)
--trf-min-score TRF_MIN_SCORE
TRF最小比对得分 | TRF minimum alignment score (default: 80)
--trf-max-period TRF_MAX_PERIOD
TRF最大周期长度 | TRF maximum period size (default: 10)
--trf-min-copies TRF_MIN_COPIES
TRF最小重复次数 | TRF minimum copy number (default: 50)
--trf-max-length TRF_MAX_LENGTH
TRF最大期望长度 | TRF maximum expected length (default: 500)
--no-trf 跳过TRF串联重复分析 | Skip TRF tandem repeat analysis (default: False)
-e EDTA_THREADS, --edta-threads EDTA_THREADS
EDTA线程数 | EDTA thread count (default: 8)
--edta-species {rice,maize,others}
EDTA物种类型 | EDTA species type (default: others)
--edta-step {all,filter,final,anno}
EDTA执行步骤 | EDTA execution step (default: all)
--edta-sensitive EDTA敏感模式 | EDTA sensitive mode (default: False)
--no-edta-anno 不进行EDTA注释 | Do not perform EDTA annotation (default: False)
--no-edta-eval 不进行EDTA评估 | Do not perform EDTA evaluation (default: False)
--edta-overwrite 覆盖现有EDTA结果 | Overwrite existing EDTA results (default: False)
--no-edta 跳过EDTA转座元件分析 | Skip EDTA transposable element analysis (default: False)
--no-de-novo 跳过从头预测分析 | Skip de novo prediction analysis (default: False)
--no-database 跳过数据库搜索 | Skip database search (default: False)
--min-length MIN_LENGTH
最小重复序列长度 | Minimum repeat length (default: 50)
--max-divergence MAX_DIVERGENCE
最大分化度 (0-1) | Maximum divergence (0-1) (default: 0.25)

使用示例

1
run_repeat_masker -g genome.fa -s human -o results

run_vcf_pca

直接从VCF文件到PCA结果。

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
 run_vcf_pca -h                                                               
usage: run_vcf_pca [-h] -v VCF [-o OUTPUT] [-s SAMPLE_INFO] [-c COMPONENTS] [-m MAF] [--missing MISSING] [--hwe HWE] [--skip-qc] [-p] [-g GROUP_COLUMN]
[--plink-path PLINK_PATH] [--bcftools-path BCFTOOLS_PATH]

VCF PCA分析脚本 (模块化版本) | VCF PCA Analysis Script (Modular Version)

options:
-h, --help show this help message and exit
-v VCF, --vcf VCF 输入VCF文件路径 (支持压缩和未压缩) | Input VCF file path (supports compressed and uncompressed) (default: None)
-o OUTPUT, --output OUTPUT
输出目录 | Output directory (default: ./pca_output)
-s SAMPLE_INFO, --sample-info SAMPLE_INFO
样本信息文件 (制表符分隔) | Sample information file (tab-separated) (default: None)
-c COMPONENTS, --components COMPONENTS
主成分数量 | Number of principal components (default: 10)
-m MAF, --maf MAF 最小等位基因频率阈值 | Minor allele frequency threshold (default: 0.05)
--missing MISSING 最大缺失率阈值 | Maximum missing rate threshold (default: 0.1)
--hwe HWE Hardy-Weinberg平衡p值阈值 | Hardy-Weinberg equilibrium p-value threshold (default: 1e-06)
--skip-qc 跳过质量控制过滤,直接使用输入VCF文件 | Skip quality control filtering, use input VCF file directly (default: False)
-p, --plot 生成PCA可视化图表 | Generate PCA visualization plots (default: False)
-g GROUP_COLUMN, --group-column GROUP_COLUMN
样本信息文件中用于分组的列名 | Column name for grouping in sample info file (default: None)
--plink-path PLINK_PATH
PLINK软件路径 | PLINK software path (default: plink)
--bcftools-path BCFTOOLS_PATH
BCFtools软件路径 | BCFtools software path (default: bcftools)

示例 | Examples: run_vcf_pca -v variants.vcf -o pca_results run_vcf_pca -v data.vcf.gz -o results -c 15 -p run_vcf_pca -v variants.vcf -o pca_out -s
samples.txt -g population -p run_vcf_pca -v filtered_variants.vcf -o pca_results --skip-qc -p

使用示例

1
run_vcf_pca -v 01.data/wild.snp.new.record.vcf.recode.vcf -o 04.pca --skip-qc

run_vcf_ld_heatmap

从VCF文件绘制连锁不平衡热图。

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
run_vcf_ld_heatmap -h
usage: run_vcf_ld_heatmap.py [-h] -i INPUT [-o OUTPUT] [--region REGION] [--maf MAF] [--max-snps MAX_SNPS] [--samples SAMPLES [SAMPLES ...]]
[--figsize FIGSIZE FIGSIZE] [--dpi DPI] [--colormap COLORMAP] [--title TITLE] [--save-matrix SAVE_MATRIX]
[--ld-threshold LD_THRESHOLD] [--verbose] [--triangle-only]

VCF连锁不平衡热图生成器 (模块化版本) | VCF Linkage Disequilibrium Heatmap Generator (Modular Version)

options:
-h, --help show this help message and exit
-i INPUT, --input INPUT
输入VCF文件路径 | Input VCF file path (default: None)
-o OUTPUT, --output OUTPUT
输出图片文件路径 | Output image file path (default: ld_heatmap.png)
--region REGION 指定基因组区域 (格式: chr:start-end, 例如: chr1:1000000-2000000) | Specify genomic region (format: chr:start-end, e.g.: chr1:1000000-2000000)
(default: None)
--maf MAF 最小等位基因频率阈值 | Minor allele frequency threshold (default: 0.01)
--max-snps MAX_SNPS 最大SNP数量限制 | Maximum SNP count limit (default: 1000)
--samples SAMPLES [SAMPLES ...]
指定样本名称列表 | Specify sample name list (default: None)
--figsize FIGSIZE FIGSIZE
图形尺寸 (宽度 高度) | Figure size (width height) (default: [10, 8])
--dpi DPI 图像分辨率 | Image resolution (default: 300)
--colormap COLORMAP 颜色映射 | Color map (default: RdYlGn_r)
--title TITLE 图表标题 | Chart title (default: None)
--save-matrix SAVE_MATRIX
保存LD矩阵到CSV文件 | Save LD matrix to CSV file (default: None)
--ld-threshold LD_THRESHOLD
LD阈值,低于此值显示为白色 | LD threshold, values below shown as white (default: 0.0)
--verbose, -v 显示详细信息 | Show verbose information (default: False)
--triangle-only 只显示上三角矩阵 | Show upper triangle only (default: False)

使用示例 | Examples: run_vcf_ld_heatmap.py -i input.vcf -o output.png run_vcf_ld_heatmap.py -i input.vcf -o output.pdf --region chr1:1000000-2000000
run_vcf_ld_heatmap.py -i input.vcf --maf 0.05 --max-snps 500 --figsize 12 10 run_vcf_ld_heatmap.py -i input.vcf -o heatmap.png --save-matrix ld_matrix.csv
--triangle-only

使用示例

1
run_vcf_ld_heatmap -i final_filtered.recode.chr.vcf.gz --region OV01:1000000-2000000 -o region_ld.png --save-matrix ld_matrix.csv

run_vcf_filter

筛选VCF文件。

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
run_vcf_filter -h
usage: run_vcf_filter [-h] -i INPUT [-o OUTPUT] [-c CHR] [-s START] [-e END] [--convert-format] [--plink-path PLINK_PATH] [--allow-extra-chr] [--maf MAF]
[--max-missing MAX_MISSING] [--quality-threshold QUALITY_THRESHOLD] [--min-depth MIN_DEPTH] [--max-depth MAX_DEPTH]
[--keep-samples KEEP_SAMPLES] [--remove-samples REMOVE_SAMPLES] [--keep-ids KEEP_IDS] [--remove-ids REMOVE_IDS] [--biallelic-only]
[--remove-indels] [--verbose]

VCF文件筛选工具 (模块化版本) | VCF File Filtering Tool (Modular Version)

options:
-h, --help show this help message and exit
-i INPUT, --input INPUT
输入VCF文件路径 | Input VCF file path (default: None)
-o OUTPUT, --output OUTPUT
输出VCF文件路径 | Output VCF file path (default: None)
-c CHR, --chr CHR, --chromosome CHR
染色体名称 (支持逗号分隔的多个染色体) | Chromosome name(s) (comma-separated for multiple) (default: None)
-s START, --start START
起始位置 | Start position (default: None)
-e END, --end END 结束位置 | End position (default: None)
--convert-format 使用PLINK进行格式转换 | Use PLINK for format conversion (default: False)
--plink-path PLINK_PATH
PLINK可执行文件路径 | PLINK executable path (default: plink)
--allow-extra-chr 允许额外染色体 | Allow extra chromosomes (default: True)
--maf MAF 最小等位基因频率 | Minimum allele frequency (default: None)
--max-missing MAX_MISSING
最大缺失率 | Maximum missing rate (default: None)
--quality-threshold QUALITY_THRESHOLD
质量阈值 | Quality threshold (default: None)
--min-depth MIN_DEPTH
最小深度 | Minimum depth (default: None)
--max-depth MAX_DEPTH
最大深度 | Maximum depth (default: None)
--keep-samples KEEP_SAMPLES
保留的样本名称 (逗号分隔) | Sample names to keep (comma-separated) (default: None)
--remove-samples REMOVE_SAMPLES
移除的样本名称 (逗号分隔) | Sample names to remove (comma-separated) (default: None)
--keep-ids KEEP_IDS 保留的变异位点ID (逗号分隔) | Variant IDs to keep (comma-separated) (default: None)
--remove-ids REMOVE_IDS
移除的变异位点ID (逗号分隔) | Variant IDs to remove (comma-separated) (default: None)
--biallelic-only 只保留双等位基因位点 | Keep only biallelic sites (default: False)
--remove-indels 移除插入缺失变异 | Remove indel variants (default: False)
--verbose, -v 显示详细信息 | Show verbose information (default: False)

使用示例 | Examples: run_vcf_filter -i input.vcf -o output.vcf run_vcf_filter -i input.vcf -c chr1 -s 1000 -e 2000 run_vcf_filter -i input.vcf --convert-
format --maf 0.05 run_vcf_filter -i input.vcf --keep-samples sample1,sample2,sample3 run_vcf_filter -i input.vcf --quality-threshold 30 --biallelic-only

使用示例

1
run_vcf_filter -i final_filtered.recode.chr.vcf.gz -o filtered_vcf_ov01 -c OV01 -s 100000 -e 300000 --convert-format --allow-extra-chr

run_ena_downloader

从ENA数据库下载测序数据的信息和下载链接。

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
run_ena_downloader -h
usage: run_ena_downloader [-h] --accession ACCESSION [--output-dir OUTPUT_DIR] [--create-dir] [--metadata-format {tsv,csv,xlsx}] [--protocol {ftp,aspera}]
[--aspera-key ASPERA_KEY] [--method {save,run}] [--metadata-only] [--fields FIELDS [FIELDS ...]] [--max-retries MAX_RETRIES]

ENA数据下载工具 | ENA Data Download Tool

options:
-h, --help show this help message and exit
--accession ACCESSION, -a ACCESSION
ENA项目编号 | ENA accession number (required) 格式示例 | Format examples: PRJNA661210, SRP000123 支持ENA/NCBI标准编号格式 | Supports ENA/NCBI
standard accession formats (default: None)
--output-dir OUTPUT_DIR, -o OUTPUT_DIR
输出目录 | Output directory 默认使用当前目录 | Default uses current directory (default: None)
--create-dir, -d 创建专门的输出目录 | Create dedicated output directory 格式: [accession].ena.download | Format: [accession].ena.download (default: False)
--metadata-format {tsv,csv,xlsx}, -f {tsv,csv,xlsx}
元数据文件格式 | Metadata file format (default: tsv)
--protocol {ftp,aspera}, -p {ftp,aspera}
下载协议类型 | Download protocol type ftp: 标准FTP下载 | Standard FTP download aspera: 高速传输协议 | High-speed transfer protocol (requires
private key) (default: ftp)
--aspera-key ASPERA_KEY, -k ASPERA_KEY
Aspera私钥路径 | Path to aspera private key 使用aspera协议时必需 | Required when using aspera protocol 默认位置 | Default location:
~/.aspera/connect/etc/asperaweb_id_dsa.openssh (default: None)
--method {save,run}, -m {save,run}
执行模式 | Execution mode save: 生成下载脚本 | Generate download script (default) run: 直接执行下载命令 | Execute download commands directly
(default: save)
--metadata-only, -M 仅下载元数据,不处理FASTQ文件 | Only download metadata, do not process FASTQ files (default: False)
--fields FIELDS [FIELDS ...], -F FIELDS [FIELDS ...]
自定义元数据字段 | Custom metadata fields 使用 "all" 获取所有字段 | Use "all" to get all fields 示例 | Examples: --fields fastq_ftp fastq_md5
study_title (default: None)
--max-retries MAX_RETRIES, -r MAX_RETRIES
API请求最大重试次数 | Maximum API request retries (default: 3)

使用示例 | Usage Examples: # 仅下载元数据 | Download metadata only python run_ena_downloader -a PRJNA661210 -M # 下载元数据并生成FTP下载脚本 | Download metadata and generate
FTP download script python run_ena_downloader -a PRJNA661210 -p ftp -m save # 下载元数据并生成Aspera下载脚本 | Download metadata and generate Aspera download script
python run_ena_downloader -a PRJNA661210 -p aspera -k ~/.aspera/connect/etc/asperaweb_id_dsa.openssh # 直接执行FTP下载 | Execute FTP download directly python
run_ena_downloader -a PRJNA661210 -p ftp -m run # 自定义输出目录和格式 | Custom output directory and format python run_ena_downloader -a PRJNA661210 -o my_results
-f xlsx # 自定义字段 | Custom fields python run_ena_downloader -a PRJNA661210 -F fastq_ftp study_title -f csv -o results # 创建专门目录 | Create dedicated directory
python run_ena_downloader -a PRJNA661210 -d

使用示例

1
run_ena_downloader -a PRJNA505193 -M

输出日志:

1
2
3
4
5
6
7
8
9
2025-07-23 22:24:27 - ena_downloader - INFO - 开始元数据下载流程 | Starting metadata download pipeline
2025-07-23 22:24:27 - ena_downloader - INFO - 开始下载元数据 | Starting metadata download for accession: PRJDB10313
2025-07-23 22:24:27 - ena_downloader - INFO - 正在请求ENA API | Requesting ENA API: https://www.ebi.ac.uk/ena/portal/api/filereport
2025-07-23 22:24:28 - ena_downloader - INFO - 元数据文件已保存 | Metadata file saved: PRJDB10313.meta.tsv
2025-07-23 22:24:28 - ena_downloader - INFO - 元数据下载完成 | Metadata download completed
2025-07-23 22:24:28 - ena_downloader - INFO - 汇总报告已生成 | Summary report generated: download_summary.txt

元数据下载完成 | Metadata download completed
元数据文件 | Metadata file: PRJDB10313.meta.tsv

run_vcf_njtree

使用VCF文件构建NJ树。

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
run_vcf_njtree -h                                       
usage: run_vcf_njtree [-h] [-i VCF_FILE] [-d DISTANCE_MATRIX] [-o OUTPUT_PREFIX] [-t TREE_OUTPUT] [--vcf2dis-path VCF2DIS_PATH] [-w WORKING_DIR]
[--skip-vcf2dis]

VCF系统发育分析工具 v2.0 | VCF Phylogenetic Analysis Tool v2.0

options:
-h, --help show this help message and exit

输入文件 | Input Files:
-i VCF_FILE, --input VCF_FILE
输入VCF文件路径 | Input VCF file path
-d DISTANCE_MATRIX, --distance-matrix DISTANCE_MATRIX
已有的距离矩阵文件路径(用于跳过VCF2Dis步骤)| Existing distance matrix file path (for skipping VCF2Dis step)

输出文件 | Output Files:
-o OUTPUT_PREFIX, --output OUTPUT_PREFIX
输出文件前缀(默认: phylo_analysis)| Output file prefix (default: phylo_analysis)
-t TREE_OUTPUT, --tree-output TREE_OUTPUT
系统发育树输出文件路径(默认: OUTPUT.nwk)| Phylogenetic tree output file path (default: OUTPUT.nwk)

工具设置 | Tool Settings:
--vcf2dis-path VCF2DIS_PATH
VCF2Dis程序路径(默认: VCF2Dis)| VCF2Dis program path (default: VCF2Dis)
-w WORKING_DIR, --working-dir WORKING_DIR
工作目录(默认: 当前目录)| Working directory (default: current directory)

行为控制 | Behavior Control:
--skip-vcf2dis 跳过VCF2Dis步骤,直接从距离矩阵构建树 | Skip VCF2Dis step, build tree directly from distance matrix

示例 | Examples:
# 基本分析 | Basic analysis
run_vcf_njtree -i wild.snp.vcf -o wild_snp

# 指定完整路径 | Specify full paths
run_vcf_njtree --input 01.data/wild.snp.new.record.vcf.recode.vcf \
--output 02.tree/wild_snp_dis_mat \
--tree-output 02.tree/wild_snp_dis.nwk

# 从已有距离矩阵构建树 | Build tree from existing matrix
run_vcf_njtree --distance-matrix existing_matrix.txt \
--tree-output tree.nwk \
--skip-vcf2dis

使用示例

1
run_vcf_njtree -i final_filtered.recode.chr.vcf.gz -o ov  

输出日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
run_vcf_njtree -i final_filtered.recode.chr.vcf.gz -o ov
2025-07-25 10:28:31,505 - INFO - ================================================================================
2025-07-25 10:28:31,505 - INFO - 开始VCF系统发育分析 | Starting VCF phylogenetic analysis
2025-07-25 10:28:31,505 - INFO - ================================================================================
2025-07-25 10:28:31,505 - INFO - 检查依赖软件 | Checking dependencies
2025-07-25 10:28:31,507 - INFO - ✓ VCF2Dis 可用 | available
2025-07-25 10:28:31,507 - INFO - ✓ Python包 numpy 可用 | Python package numpy available
2025-07-25 10:28:31,707 - INFO - ✓ Python包 pandas 可用 | Python package pandas available
2025-07-25 10:28:31,710 - INFO - ✓ Python包 scipy 可用 | Python package scipy available
2025-07-25 10:28:32,102 - INFO - ✓ Python包 scikit-bio 可用 | Python package scikit-bio available
2025-07-25 10:28:32,102 - INFO - ----------------------------------------
2025-07-25 10:28:32,102 - INFO - 步骤1: 计算距离矩阵 | Step 1: Calculate distance matrix
2025-07-25 10:28:32,102 - INFO - ----------------------------------------
2025-07-25 10:28:32,102 - INFO - 开始计算VCF距离矩阵 | Starting VCF distance matrix calculation
2025-07-25 10:28:32,102 - WARNING - 无法获取VCF统计信息 | Cannot get VCF statistics: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
2025-07-25 10:28:32,102 - INFO - 执行步骤 | Executing step: 使用VCF2Dis计算距离矩阵 | Calculate distance matrix using VCF2Dis
2025-07-25 10:28:32,102 - INFO - 命令 | Command: VCF2Dis -InPut /share/org/YZWL/yzwl_lixg/project/06.longliuxing_BSA/98.test/final_filtered.recode.chr.vcf.gz -OutPut /share/org/YZWL/yzwl_lixg/project/06.longliuxing_BSA/98.test/ov
2025-07-25 10:28:32,102 - INFO - 工作目录 | Working directory: /share/org/YZWL/yzwl_lixg/project/06.longliuxing_BSA/98.test
2025-07-25 10:31:06,393 - INFO - 命令执行成功 | Command executed successfully: 使用VCF2Dis计算距离矩阵 | Calculate distance matrix using VCF2Dis
2025-07-25 10:31:06,393 - INFO - 距离矩阵已生成 | Distance matrix generated: /share/org/YZWL/yzwl_lixg/project/06.longliuxing_BSA/98.test/ov
2025-07-25 10:31:06,393 - INFO - 距离矩阵包含 100 个样本 | Distance matrix contains 100 samples
2025-07-25 10:31:06,393 - INFO - ----------------------------------------
2025-07-25 10:31:06,393 - INFO - 步骤2: 构建系统发育树 | Step 2: Build phylogenetic tree
2025-07-25 10:31:06,393 - INFO - ----------------------------------------
2025-07-25 10:31:06,393 - INFO - 开始构建NJ系统发育树 | Starting NJ phylogenetic tree construction
2025-07-25 10:31:06,393 - INFO - ✓ 成功导入 scikit-bio | Successfully imported scikit-bio
2025-07-25 10:31:06,393 - INFO - 读取距离矩阵文件 | Reading distance matrix file: /share/org/YZWL/yzwl_lixg/project/06.longliuxing_BSA/98.test/ov
2025-07-25 10:31:06,394 - INFO - 检测到矩阵维度信息: 100 | Detected matrix dimension: 100
2025-07-25 10:31:06,395 - INFO - 验证距离矩阵 | Validating distance matrix: (100, 100)
2025-07-25 10:31:06,396 - INFO - 距离矩阵验证完成 | Distance matrix validation completed
2025-07-25 10:31:06,396 - INFO - 矩阵统计 | Matrix statistics: min=0.000000, max=0.486191, mean=0.313095
2025-07-25 10:31:06,396 - INFO - 成功读取 100 个样本的距离矩阵 | Successfully read distance matrix for 100 samples
2025-07-25 10:31:06,396 - INFO - 样本列表 | Sample list: ['OV8-1', 'OV8-105', 'OV8-106', 'OV8-107', 'OV8-108']...
2025-07-25 10:31:06,396 - INFO - 创建 scikit-bio DistanceMatrix 对象 | Creating scikit-bio DistanceMatrix object
2025-07-25 10:31:06,396 - INFO - 使用NJ算法构建系统发育树 | Building phylogenetic tree using NJ algorithm
2025-07-25 10:31:06,399 - INFO - 成功构建NJ树,树长度: 4941 字符 | Successfully built NJ tree, tree length: 4941 characters
2025-07-25 10:31:06,400 - INFO - 系统发育树已保存到 | Phylogenetic tree saved to: /share/org/YZWL/yzwl_lixg/project/06.longliuxing_BSA/98.test/ov.nwk
2025-07-25 10:31:06,402 - INFO - NJ系统发育树已保存 | NJ phylogenetic tree saved: /share/org/YZWL/yzwl_lixg/project/06.longliuxing_BSA/98.test/ov.nwk
2025-07-25 10:31:06,402 - INFO - 系统发育树统计 | Phylogenetic tree statistics:
2025-07-25 10:31:06,402 - INFO - - 样本数 | Sample count: 100
2025-07-25 10:31:06,402 - INFO - - 树格式 | Tree format: Newick
2025-07-25 10:31:06,402 - INFO - - 构建方法 | Construction method: Neighbor-Joining (NJ)
2025-07-25 10:31:06,402 - INFO - - 使用库 | Library used: scikit-bio
2025-07-25 10:31:06,402 - INFO - - 文件大小 | File size: 4941 bytes
2025-07-25 10:31:06,402 - INFO - - 树格式验证 | Tree format validation: ✓ 有效的Newick格式 | Valid Newick format
2025-07-25 10:31:06,402 - INFO - ----------------------------------------
2025-07-25 10:31:06,402 - INFO - 步骤3: 生成结果报告 | Step 3: Generate results report
2025-07-25 10:31:06,402 - INFO - ----------------------------------------
2025-07-25 10:31:06,402 - INFO - 验证输出结果 | Validating output results
2025-07-25 10:31:06,402 - INFO - ✓ 距离矩阵文件存在 | Distance matrix file exists
2025-07-25 10:31:06,402 - INFO - ✓ 系统发育树文件存在 | Phylogenetic tree file exists
2025-07-25 10:31:06,402 - INFO - ✓ 系统发育树文件格式正确 | Phylogenetic tree file format is correct
2025-07-25 10:31:06,402 - INFO - 生成分析总结报告 | Generating analysis summary report
2025-07-25 10:31:06,402 - INFO - 总结报告已生成 | Summary report generated: ov_summary.txt
2025-07-25 10:31:06,402 - INFO - ================================================================================
2025-07-25 10:31:06,402 - INFO - VCF系统发育分析完成 | VCF phylogenetic analysis completed successfully
2025-07-25 10:31:06,402 - INFO - ================================================================================
2025-07-25 10:31:06,403 - INFO - 输出文件 | Output files:
2025-07-25 10:31:06,403 - INFO - - 距离矩阵 | Distance matrix: /share/org/YZWL/yzwl_lixg/project/06.longliuxing_BSA/98.test/ov (91306 bytes)
2025-07-25 10:31:06,403 - INFO - - 系统发育树 | Phylogenetic tree: /share/org/YZWL/yzwl_lixg/project/06.longliuxing_BSA/98.test/ov.nwk (4941 bytes)
2025-07-25 10:31:06,403 - INFO - - 日志文件 | Log file: ov.log (6322 bytes)
2025-07-25 10:31:06,403 - INFO - - 总结报告 | Summary report: ov_summary.txt (1688 bytes)

parse_longest_mrna

提取最长的转录本,需要调用gffread.

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
parse_longest_mrna -h 
usage: parse_longest_mrna.py [-h] -g GENOME -f GFF3 -o OUTPUT [--gene-info GENE_INFO]

最长转录本提取工具 (模块化版本) | Longest mRNA Extraction Tool (Modular Version)

options:
-h, --help show this help message and exit
-g GENOME, --genome GENOME
输入基因组FASTA文件 | Input genome FASTA file (default: None)
-f GFF3, --gff3 GFF3 输入GFF3注释文件 | Input GFF3 annotation file (default: None)
-o OUTPUT, --output OUTPUT
输出FASTA文件 | Output FASTA file (default: None)
--gene-info GENE_INFO
基因信息输出文件 (默认自动生成) | Gene info output file (auto-generated by default) (default: None)

使用示例

1
parse_longest_mrna -g T2T70-15_chr.fa -f T2T70-15_chr.gff -o longest.mrna.pep.fa

日志信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
2025-07-17 08:47:30,359 - INFO - 开始最长转录本提取流程 | Starting longest mRNA extraction pipeline
2025-07-17 08:47:30,359 - INFO - 基因组文件 | Genome file: /mnt/f/biopytools_test/longest_mran/T2T70-15_chr.fa
2025-07-17 08:47:30,359 - INFO - GFF3文件 | GFF3 file: /mnt/f/biopytools_test/longest_mran/T2T70-15_chr.gff
2025-07-17 08:47:30,359 - INFO - 输出文件 | Output file: /mnt/f/biopytools_test/longest_mran/longest.mrna.pep.fa
2025-07-17 08:47:30,359 - INFO -
步骤1: 分析GFF3文件,计算最长转录本 | Step 1: Analyzing GFF3 file, calculating longest transcripts
2025-07-17 08:47:30,359 - INFO - 计算CDS长度 | Calculating CDS lengths from: /mnt/f/biopytools_test/longest_mran/T2T70-15_chr.gff
2025-07-17 08:47:30,807 - INFO - 找到 12100 个基因的最长转录本 | Found longest transcripts for 12100 genes
2025-07-17 08:47:30,807 - INFO -
步骤2: 生成基因信息文件 | Step 2: Generating gene info file
2025-07-17 08:47:30,807 - INFO - 生成基因信息文件 | Generating gene info file: /mnt/f/biopytools_test/longest_mran/T2T70-15_chr.gene.info.txt
2025-07-17 08:47:30,830 - INFO - ✓ 基因信息文件生成完成 | Gene info file generated: /mnt/f/biopytools_test/longest_mran/T2T70-15_chr.gene.info.txt
2025-07-17 08:47:30,830 - INFO -
步骤3: 提取蛋白质序列 | Step 3: Extracting protein sequences
2025-07-17 08:47:30,832 - INFO - 执行 | Executing: 使用gffread生成蛋白质序列 | Generate protein sequences using gffread
2025-07-17 08:47:30,833 - INFO - 命令 | Command: gffread "/mnt/f/biopytools_test/longest_mran/T2T70-15_chr.gff" -g "/mnt/f/biopytools_test/longest_mran/T2T70-15_chr.fa" -y "/tmp/tmpnnf71ggw.fa"
2025-07-17 08:47:36,724 - INFO - ✓ 命令执行成功 | Command executed successfully: 使用gffread生成蛋白质序列 | Generate protein sequences using gffread
2025-07-17 08:47:36,724 - INFO - 执行 | Executing: 使用seqkit筛选最长转录本序列 | Filter longest transcript sequences using seqkit
2025-07-17 08:47:36,724 - INFO - 命令 | Command: seqkit grep -f "/tmp/tmpf299bf64" "/tmp/tmpnnf71ggw.fa" -o "/mnt/f/biopytools_test/longest_mran/longest.mrna.pep.fa"
2025-07-17 08:47:36,832 - INFO - ✓ 命令执行成功 | Command executed successfully: 使用seqkit筛选最长转录本序列 | Filter longest transcript sequences using seqkit
2025-07-17 08:47:36,832 - INFO - ✓ 成功提取 12100 个最长转录本序列 | Successfully extracted 12100 longest transcript sequences
2025-07-17 08:47:36,832 - INFO - ✓ 输出文件 | Output file: /mnt/f/biopytools_test/longest_mran/longest.mrna.pep.fa
2025-07-17 08:47:36,832 - INFO -
步骤4: 生成统计信息 | Step 4: Generating statistics
2025-07-17 08:47:36,832 - INFO - 解析GFF3文件 | Parsing GFF3 file: /mnt/f/biopytools_test/longest_mran/T2T70-15_chr.gff
2025-07-17 08:47:46,499 - INFO - 解析完成,找到 12100 个基因 | Parsing completed, found 12100 genes
2025-07-17 08:47:46,499 - INFO - 计算统计信息 | Calculating statistics
2025-07-17 08:47:46,501 - INFO -
步骤5: 显示总结 | Step 5: Displaying summary
2025-07-17 08:47:46,501 - INFO - ==================================================
2025-07-17 08:47:46,501 - INFO - 提取完成总结 | Extraction Summary
2025-07-17 08:47:46,501 - INFO - ==================================================
2025-07-17 08:47:46,501 - INFO - 总基因数 | Total genes processed: 12100
2025-07-17 08:47:46,501 - INFO - 多转录本基因数 | Genes with multiple transcripts: 0
2025-07-17 08:47:46,501 - INFO - 平均转录本长度 | Average transcript length: 1625.63
2025-07-17 08:47:46,501 - INFO - 提取的最长转录本数 | Longest transcripts extracted: 12100
2025-07-17 08:47:46,501 - INFO - 输出文件 | Output files:
2025-07-17 08:47:46,501 - INFO - - 蛋白质序列文件 | Protein sequences: /mnt/f/biopytools_test/longest_mran/longest.mrna.pep.fa
2025-07-17 08:47:46,501 - INFO - - 基因信息文件 | Gene info: /mnt/f/biopytools_test/longest_mran/T2T70-15_chr.gene.info.txt
2025-07-17 08:47:46,501 - INFO - ==================================================
2025-07-17 08:47:46,501 - INFO -
✓ 最长转录本提取流程完成 | Longest mRNA extraction pipeline completed successfully

parse_sample_hete

从vcf文件中计算样品的杂合纯合信息。

帮助信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
parse_sample_hete -h
usage: parse_sample_hete.py [-h] -v VCF [-o OUTPUT] [-d MIN_DEPTH] [-q MIN_QUAL] [-e] [-D] [-S]

VCF基因型统计分析脚本 | VCF Genotype Statistics Analysis Script 支持短参数和长参数格式 | Supports both short and long parameter formats

options:
-h, --help show this help message and exit
-v VCF, --vcf VCF 输入VCF文件路径 (支持.gz压缩格式) | Input VCF file path (supports .gz compressed format) (default: None)
-o OUTPUT, --output OUTPUT
输出目录 | Output directory (default: vcf_stats_output)
-d MIN_DEPTH, --min-depth MIN_DEPTH
最小深度过滤阈值 (0表示不过滤) | Minimum depth filter threshold (0 = no filter) (default: 0)
-q MIN_QUAL, --min-qual MIN_QUAL
最小质量分数过滤阈值 (0.0表示不过滤) | Minimum quality score filter threshold (0.0 = no filter) (default: 0.0)
-e, --exclude-missing
排除缺失基因型 (./..) 的统计 | Exclude missing genotypes (./..) from statistics (default: False)
-D, --no-detailed 不输出详细统计结果 | Do not output detailed statistics (default: False)
-S, --no-summary 不输出汇总统计结果 | Do not output summary statistics (default: False)

参数对照表 | Parameter Reference: -v/--vcf 输入VCF文件 | Input VCF file -o/--output 输出目录 | Output directory -d/--min-depth 最小深度 | Minimum depth -q/--min-qual 最小质量 | Minimum quality
-e/--exclude-missing 排除缺失 | Exclude missing -D/--no-detailed 无详细输出 | No detailed output -S/--no-summary 无汇总输出 | No summary output 示例用法 | Example Usage: # 基本分析 | Basic analysis
python run_vcf_stats.py -v variants.vcf -o vcf_stats_output # 应用质量和深度过滤 | Apply quality and depth filters python run_vcf_stats.py -v variants.vcf.gz -o results -d 10 -q 30.0 #
排除缺失基因型,仅输出汇总统计 | Exclude missing genotypes, summary only python run_vcf_stats.py -v variants.vcf -o results -e -D # 长参数格式 | Long parameter format python run_vcf_stats.py --vcf
variants.vcf --output results \ --min-depth 10 --min-qual 30.0 --exclude-missing --no-detailed 支持的基因型格式 | Supported genotype formats: - 未定相: 0/0, 0/1, 1/1, ./. | Unphased: 0/0, 0/1,
1/1, ./. - 已定相: 0|0, 0|1, 1|1, .|. | Phased: 0|0, 0|1, 1|1, .|. - 多等位基因: 0/2, 1/2等 | Multi-allelic: 0/2, 1/2, etc.

使用示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
parse_sample_hete --vcf final_filtered.recode.chr.vcf.gz --output 样品杂合度统计信息
2025-07-17 10:53:47,215 - INFO - ==================== 开始VCF基因型统计分析 | Starting VCF Genotype Statistics Analysis ====================
2025-07-17 10:53:47,215 - INFO - VCF文件 | VCF file: /mnt/f/project/04.诸葛菜/基因型/final_filtered.recode.chr.vcf.gz
2025-07-17 10:53:47,215 - INFO - 输出目录 | Output directory: /mnt/f/project/04.诸葛菜/基因型/样品杂合度统计信息.txt
2025-07-17 10:53:47,216 - INFO -
步骤 1/3: 处理VCF文件 | Step 1/3: Processing VCF file
2025-07-17 10:53:47,216 - INFO - 开始处理VCF文件 | Starting VCF file processing
2025-07-17 10:53:47,247 - INFO - 检测到 100 个样本 | Detected 100 samples
2025-07-17 10:53:47,248 - INFO - 样本名称 | Sample names: OV8-1, OV8-105, OV8-106, OV8-107, OV8-108...
2025-07-17 10:53:48,590 - INFO - 已处理 10000 个变异位点 | Processed 10000 variants
2025-07-17 10:53:51,630 - INFO - 已处理 20000 个变异位点 | Processed 20000 variants
......
2025-07-17 11:17:32,047 - INFO - 已处理 5670000 个变异位点 | Processed 5670000 variants
2025-07-17 11:17:34,437 - INFO - VCF处理完成 | VCF processing completed
2025-07-17 11:17:34,437 - INFO - 总变异位点数 | Total variants: 5677437
2025-07-17 11:17:34,437 - INFO - 通过过滤的位点数 | Variants passed filters: 5677437
2025-07-17 11:17:34,438 - INFO -
步骤 2/3: 计算统计结果 | Step 2/3: Calculating statistics
2025-07-17 11:17:34,438 - INFO - 成功计算了 100 个样本的统计结果 | Successfully calculated statistics for 100 samples
2025-07-17 11:17:34,438 - INFO -
步骤 3/3: 导出结果 | Step 3/3: Exporting results
2025-07-17 11:17:34,439 - INFO - 导出汇总统计结果 | Exporting summary statistics
2025-07-17 11:17:34,451 - INFO - 汇总统计已保存 | Summary statistics saved: /mnt/f/project/04.诸葛菜/基因型/样品杂合度统计信息.txt/genotype_summary_statistics.txt
2025-07-17 11:17:34,458 - INFO - 简化统计已保存 | Simplified statistics saved: /mnt/f/project/04.诸葛菜/基因型/样品杂合度统计信息.txt/genotype_rates_simple.txt
2025-07-17 11:17:34,458 - INFO - 导出详细统计结果 | Exporting detailed statistics
2025-07-17 11:17:34,468 - INFO - 详细统计已保存 | Detailed statistics saved: /mnt/f/project/04.诸葛菜/基因型/样品杂合度统计信息.txt/genotype_detailed_statistics.txt
2025-07-17 11:17:34,469 - INFO - 为每个样本导出单独统计文件 | Exporting individual sample statistics
2025-07-17 11:17:34,828 - INFO - 分析总结报告已生成 | Analysis summary report generated: /mnt/f/project/04.诸葛菜/基因型/样品杂合度统计信息.txt/analysis_summary.txt
2025-07-17 11:17:34,828 - INFO -
==================== VCF基因型统计分析完成 | VCF Genotype Statistics Analysis Completed ====================
2025-07-17 11:17:34,833 - INFO - 结果文件已保存到 | Results saved to: /mnt/f/project/04.诸葛菜/基因型/样品杂合度统计信息.txt

parse_gene_info

从gff文件中提取基因和染色体的信息。

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
parse_gene_info -h
usage: parse_gene_info [-h] --gff3 GFF3 --output OUTPUT [--gene-type GENE_TYPE] [--transcript-types TRANSCRIPT_TYPES [TRANSCRIPT_TYPES ...]]

从GFF3文件中为每个转录本提取整合的基因和转录本信息 | Extract integrated gene and transcript information for each transcript from GFF3 files

options:
-h, --help show this help message and exit
--gff3, -g GFF3 输入的GFF3文件路径 | Input GFF3 file path (default: None)
--output, -o OUTPUT 输出的TSV文件路径 | Output TSV file path (default: None)
--gene-type GENE_TYPE
基因特征类型 | Gene feature type (default: gene)
--transcript-types TRANSCRIPT_TYPES [TRANSCRIPT_TYPES ...]
转录本特征类型列表 | Transcript feature types list (default: ['mRNA', 'transcript'])

示例 | Example: parse_gene_info -g input.gff3 -o gene_transcript_info.tsv

使用示例

1
parse_gene_info -g genome.gff -o gene_info.txt

日志信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
2025-07-16 14:10:44,332 - INFO - 开始GFF3基因转录本提取分析 | Starting GFF3 gene transcript extraction analysis
2025-07-16 14:10:44,333 - INFO - 输入文件 | Input file: /mnt/f/biopytools_test/geneinfo/genome.gff
2025-07-16 14:10:44,333 - INFO - 输出文件 | Output file: /mnt/f/biopytools_test/geneinfo/gene_info.txt
2025-07-16 14:10:44,333 - INFO - 开始提取基因和转录本信息 | Starting to extract gene and transcript information
2025-07-16 14:10:44,334 - INFO - 输入文件 | Input file: /mnt/f/biopytools_test/geneinfo/genome.gff
2025-07-16 14:10:44,334 - INFO - 基因类型 | Gene type: gene
2025-07-16 14:10:44,334 - INFO - 转录本类型 | Transcript types: transcript, mRNA
2025-07-16 14:10:44,334 - INFO - 开始收集基因信息 | Starting to collect gene information
2025-07-16 14:10:47,572 - INFO - 收集完成,共发现 51840 个基因 | Collection completed, found 51840 genes
2025-07-16 14:10:47,572 - INFO - 开始处理转录本信息 | Starting to process transcript information
2025-07-16 14:10:49,882 - INFO - 处理完成,共处理 51840 个转录本 | Processing completed, processed 51840 transcripts
2025-07-16 14:10:49,882 - INFO - 提取完成 | Extraction completed
2025-07-16 14:10:49,886 - INFO - 写入结果文件 | Writing results file: /mnt/f/biopytools_test/geneinfo/gene_info.txt
2025-07-16 14:10:50,270 - INFO - 结果已保存 | Results saved: /mnt/f/biopytools_test/geneinfo/gene_info.txt
2025-07-16 14:10:50,280 - INFO - 总结报告已生成 | Summary report generated: /mnt/f/biopytools_test/geneinfo/gff_extraction_summary.txt

基因转录本提取统计摘要 | Gene Transcript Extraction Summary:
============================================================
总转录本数 | Total transcripts: 51840
涉及基因数 | Genes involved: 51840
染色体数 | Chromosomes: 147
链方向 | Strands: +, -
孤儿转录本数 | Orphan transcripts: 0
染色体列表 | Chromosome list: OV01, OV02, OV03, OV04, OV05, OV06, OV07, OV08, OV09, OV10, OV11, OV12, ctg000030, ctg000040, ctg000070, ctg000080, ctg000090, ctg000100, ctg000160, ctg000180, ctg000190, ctg000210, ctg000220, ctg000240, ctg000260, ctg000270, ctg000310, ctg000330, ctg000360, ctg000400, ctg000410, ctg000420, ctg000430, ctg000440, ctg000450, ctg000470, ctg000510, ctg000520, ctg000530, ctg000550, ctg000590, ctg000610, ctg000640, ctg000650, ctg000680, ctg000720, ctg000740, ctg000760, ctg000850, ctg000890, ctg000910, ctg000920, ctg000930, ctg000940, ctg000960, ctg000980, ctg000990, ctg001060, ctg001070, ctg001100, ctg001110, ctg001120, ctg001140, ctg001150, ctg001200, ctg001210, ctg001220, ctg001250, ctg001270, ctg001300, ctg001320, ctg001330, ctg001350, ctg001390, ctg001400, ctg001440, ctg001450, ctg001470, ctg001510, ctg001560, ctg001750, ctg001770, ctg001810, ctg001820, ctg001830, ctg001860, ctg001870, ctg001880, ctg001900, ctg001910, ctg001930, ctg001940, ctg001950, ctg001960, ctg001970, ctg002000, ctg002010, ctg002020, ctg002030, ctg002080, ctg002090, ctg002110, ctg002130, ctg002180, ctg002210, ctg002230, ctg002240, ctg002260, ctg002310, ctg002350, ctg002370, ctg002380, ctg002390, ctg002400, ctg002410, ctg002420, ctg002480, ctg002530, ctg002580, ctg002590, ctg002630, ctg002640, ctg002650, ctg002670, ctg002680, ctg002760, ctg002780, ctg002840, ctg002870, ctg002900, ctg003030, ctg003110, ctg003180, ctg003220, ctg003230, ctg003240, ctg003250, ctg003260, ctg003270, ctg003330, ctg003340, ctg003350, ctg003380, ctg003390, ctg003400, ctg003410, ctg003490
============================================================
2025-07-16 14:10:50,289 - INFO - 提取完成 | Extraction completed successfully

parse_sequence_vcf

基于基因组文件,从VCF文件中提取单倍型信息。

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
parse_sequence_vcf
usage: parse_sequence_vcf.py [-h] -v VCF -g GENOME -c CHROM -s START -e END [-o OUTPUT_DIR] [--format {tab,fasta,csv}] [--second-allele] [--no-reference]
[--min-qual MIN_QUAL] [--samples SAMPLES] [--exclude-samples EXCLUDE_SAMPLES]

从VCF文件和基因组文件中提取特定区间的序列变异信息 | Extract sequence variation information from VCF and genome files for specific regions

options:
-h, --help show this help message and exit
-v VCF, --vcf VCF VCF文件路径 | VCF file path (default: None)
-g GENOME, --genome GENOME
基因组FASTA文件路径 | Genome FASTA file path (default: None)
-c CHROM, --chrom CHROM
染色体名称 | Chromosome name (e.g., chr1, 1) (default: None)
-s START, --start START
起始位置 (1-based) | Start position (1-based) (default: None)
-e END, --end END 结束位置 (1-based, inclusive) | End position (1-based, inclusive) (default: None)
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
输出目录 | Output directory (default: ./sequence_output)
--format {tab,fasta,csv}
输出格式 | Output format (default: tab)
--second-allele 使用第二个等位基因而不是第一个 | Use second allele instead of first (default: False)
--no-reference 不包含参考序列 | Do not include reference sequence (default: False)
--min-qual MIN_QUAL 最小质量值过滤 | Minimum quality filter (default: None)
--samples SAMPLES 指定样品列表文件或逗号分隔的样品名称 | Sample list file or comma-separated sample names (default: None)
--exclude-samples EXCLUDE_SAMPLES
排除样品列表文件或逗号分隔的样品名称 | Exclude sample list file or comma-separated sample names (default: None)

使用示例

1
2
3
4
5
6
7
8
9
10
11
# 基本提取
python run_sequence_extractor.py -v variants.vcf -g genome.fa -c chr1 -s 1000 -e 1050

# 高级用法
python run_sequence_extractor.py \
-v variants.vcf.gz \
-g genome.fa \
-c chr1 -s 1000 -e 1050 \
-o results --format fasta \
--min-qual 30 \
--samples "sample1,sample2"

biopytools的用法
https://lixiang117423.github.io/article/biopytools-readme/
作者
李详【Xiang LI】
发布于
2025年7月15日
许可协议