biopytools的用法

安装

1
2
3
4
5
git clone https://github.com/lixiang117423/biopytools.git
cd biopytools
pip install -e .

pip install -e ".[dev]"

run_annovar

主要功能是实现使用Annovar对VCF文件进行注释。

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
 run_annovar -h
usage: run_annovar [-h] -g GFF3 -f GENOME -v VCF -b BUILD_VER [-a ANNOVAR_PATH] [-d DATABASE_PATH] [-o OUTPUT_DIR] [-q QUAL_THRESHOLD] [-s {1,2,3,4}]
[--skip-gff-fix] [--skip-vcf-filter] [--enable-vcf-filter]

ANNOVAR VCF注释自动化脚本 (模块化版本) | ANNOVAR VCF Annotation Automation Script (Modular Version)

options:
-h, --help show this help message and exit
-g, --gff3 GFF3 GFF3注释文件路径 | GFF3 annotation file path (default: None)
-f, --genome GENOME 基因组序列文件路径 | Genome sequence file path (default: None)
-v, --vcf VCF VCF变异文件路径 | VCF variant file path (default: None)
-b, --build-ver BUILD_VER
基因组构建版本标识符 (如: OV, KY131) - 不应包含路径分隔符 | Genome build version identifier (e.g., OV, KY131) - should not contain path separators
(default: None)
-a, --annovar-path ANNOVAR_PATH
ANNOVAR软件安装路径 | ANNOVAR software installation path (default: /share/org/YZWL/yzwl_lixg/software/annovar/annovar)
-d, --database-path DATABASE_PATH
ANNOVAR数据库路径 | ANNOVAR database path (default: ./database)
-o, --output-dir OUTPUT_DIR
输出目录 | Output directory (default: ./annovar_output)
-q, --qual-threshold QUAL_THRESHOLD
VCF质量过滤阈值 (仅在启用VCF过滤时生效) | VCF quality filtering threshold (only effective when VCF filtering is enabled) (default: 20)
-s, --step {1,2,3,4} 只运行指定步骤 | Run only specified step (1:gff3转换 | gff3 conversion, 2:提取序列 | extract sequences, 3:VCF处理 | VCF processing, 4:注释 |
annotation) (default: None)
--skip-gff-fix 跳过GFF3文件的自动修复(CDS phase等问题) | Skip automatic GFF3 file fixes (CDS phase and other issues) (default: False)
--skip-vcf-filter 跳过VCF过滤步骤,直接使用输入的VCF文件(默认启用) | Skip VCF filtering step, use input VCF file directly (enabled by default) (default: True)
--enable-vcf-filter 启用VCF过滤步骤(使用bcftools) | Enable VCF filtering step (using bcftools) (default: False)

运行示例

1
run_annovar --gff3 genome.gff --genome genome.fa --vcf final_filtered.recode.chr.vcf.gz --build-ver test --annovar-path /share/org/YZWL/yzwl_lixg/software/annovar/annovar --database-path ./ --output-dir ./

run_fastp

执行fastp对fastq文件进行过滤。自动识别fastq文件的样品信息,批量过滤。

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
run_fastp -h                                                                                                                                           
usage: run_fastp [-h] -i INPUT_DIR -o OUTPUT_DIR [--fastp-path FASTP_PATH] [-t THREADS] [-q QUALITY_THRESHOLD] [-l MIN_LENGTH] [-u UNQUALIFIED_PERCENT]
[-n N_BASE_LIMIT] [--read1-suffix READ1_SUFFIX] [--read2-suffix READ2_SUFFIX]

FASTQ数据质控批处理脚本 | FASTQ Data Quality Control Batch Processing Script

options:
-h, --help show this help message and exit
-i, --input-dir INPUT_DIR
输入原始FASTQ数据目录 | Input raw FASTQ data directory (default: None)
-o, --output-dir OUTPUT_DIR
输出清洁FASTQ数据目录 | Output clean FASTQ data directory (default: None)
--fastp-path FASTP_PATH
fastp可执行文件路径 | fastp executable path (default: fastp)
-t, --threads THREADS
线程数 | Number of threads (default: 12)
-q, --quality-threshold QUALITY_THRESHOLD
质量阈值 | Quality threshold (default: 30)
-l, --min-length MIN_LENGTH
最小长度 | Minimum length (default: 50)
-u, --unqualified-percent UNQUALIFIED_PERCENT
不合格碱基百分比阈值 | Unqualified base percentage threshold (default: 40)
-n, --n-base-limit N_BASE_LIMIT
N碱基数量限制 | N base count limit (default: 10)
--read1-suffix READ1_SUFFIX
Read1文件后缀 | Read1 file suffix (default: _1.fq.gz)
--read2-suffix READ2_SUFFIX
Read2文件后缀 | Read2 file suffix (default: _2.fq.gz)

运行示例

1
run_fastp -i raw -o clean --read1-suffix _1.clean.fq.gz --read2-suffix _2.clean.fq.gz --fastp-path /home/lixiang/miniforge3/envs/RNA_Seq/bin/fastp

run_rnaseq

自动执行HISAT2+StringTie流程,最终输出TPM和FPKM表达矩阵。不包含fastq文件过滤步骤,需要先对fastq文件进行过滤;程序能够自动创建索引。

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
run_rnaseq -h
usage: run_rnaseq [-h] -g GENOME -f GTF -i INPUT -o OUTPUT [-p PATTERN] [-r {yes,y,no,n}] [-t THREADS]

RNA-seq分析流程:HISAT2 + StringTie (模块化版本) | RNA-seq analysis pipeline: HISAT2 + StringTie (Modular Version)

options:
-h, --help show this help message and exit
-g, --genome GENOME 基因组fasta文件路径 | Genome fasta file path (default: None)
-f, --gtf GTF 基因注释GTF文件路径 | Gene annotation GTF file path (default: None)
-i, --input INPUT 输入fastq文件目录或样本信息文件 | Input fastq file directory or sample information file (default: None)
-o, --output OUTPUT 输出目录 | Output directory (default: None)
-p, --pattern PATTERN
Fastq文件命名模式,例如 "*.R1.fastq.gz""*_1.fq.gz",*代表样本名 | Fastq file naming pattern, e.g., "*.R1.fastq.gz" or "*_1.fq.gz", *
represents sample name (default: None)
-r, --remove {yes,y,no,n}
处理后删除BAM文件 | Remove BAM files after processing (default: no)
-t, --threads THREADS
线程数 | Number of threads (default: 8)

运行示例

1
run_rnaseq -g genome/T2T70-15_chr.fa -f genome/T2T70-15_chr.gtf -i clean -o output -p "*_1.clean.fq.gz"       

run_vcf_extractor

从vcf文件中提取genotype信息。

建议安装cyvcf2,但是在Python 3.13版本中无法安装,建议创建环境的时候选择适合cyvcf2的版本。

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
run_vcf_extractor -h
usage: run_vcf_extractor [-h] [-o OUTPUT] [-s SAMPLES] [--biallelic-only] [-e {yes,y,no,n}] [-t {txt,csv,excel}] [--output-dir OUTPUT_DIR] vcf_file

VCF基因型提取工具 | VCF Genotype Extraction Tool

positional arguments:
vcf_file VCF文件路径(支持.gz压缩格式) | VCF file path (supports .gz compressed format)

options:
-h, --help show this help message and exit
-o, --output OUTPUT 输出文件前缀 | Output file prefix (default: vcf_genotype)
-s, --samples SAMPLES
样本选择:all(所有样本)或逗号分隔的样本名称 | Sample selection: all (all samples) or comma-separated sample names (default: all)
--biallelic-only 只保留双等位位点 | Keep only biallelic sites (default: False)
-e, --each {yes,y,no,n}
按染色体拆分输出文件:yes/y(是)或no/n(否) | Split output files by chromosome: yes/y or no/n (default: n)
-t, --output-type {txt,csv,excel}
输出文件格式 | Output file format (default: txt)
--output-dir OUTPUT_DIR
输出目录 | Output directory (default: ./)

使用示例

1
run_vcf_extractor final_filtered.recode.chr.vcf.gz -o OV_snp -e y -t csv 

日志信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
2025-07-16 10:01:38,146 - INFO - 依赖检查 | Dependency check:
2025-07-16 10:01:38,147 - INFO - cyvcf2: 不可用 | Not available
2025-07-16 10:01:38,147 - INFO - pandas: 可用 | Available
2025-07-16 10:01:38,147 - INFO - 使用原生Python解析VCF文件 | Using native Python for VCF parsing
2025-07-16 10:01:38,148 - INFO - 开始VCF基因型提取 | Starting VCF genotype extraction
2025-07-16 10:01:38,148 - INFO - 输入文件 | Input file: /mnt/f/project/04.诸葛菜/genotype/final_filtered.recode.chr.vcf.gz
2025-07-16 10:01:38,148 - INFO - 输出前缀 | Output prefix: OV_snp
2025-07-16 10:01:38,149 - INFO - 输出格式 | Output format: csv
2025-07-16 10:01:38,159 - INFO - 目标样本数 | Number of target samples: 100
2025-07-16 10:01:39,282 - INFO - 已处理 | Processed 10000 variants
2025-07-16 10:01:40,303 - INFO - 已处理 | Processed 20000 variants
2025-07-16 10:01:41,251 - INFO - 已处理 | Processed 30000 variants
2025-07-16 10:01:42,229 - INFO - 已处理 | Processed 40000 variants
2025-07-16 10:01:43,171 - INFO - 已处理 | Processed 50000 variants
2025-07-16 10:01:44,109 - INFO - 已处理 | Processed 60000 variants
2025-07-16 10:01:44,990 - INFO - 已处理 | Processed 70000 variants
2025-07-16 10:01:45,863 - INFO - 已处理 | Processed 80000 variants
2025-07-16 10:01:46,727 - INFO - 已处理 | Processed 90000 variants

parse_gene_info

从gff文件中提取基因和染色体的信息。

帮助文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
parse_gene_info -h
usage: parse_gene_info [-h] --gff3 GFF3 --output OUTPUT [--gene-type GENE_TYPE] [--transcript-types TRANSCRIPT_TYPES [TRANSCRIPT_TYPES ...]]

从GFF3文件中为每个转录本提取整合的基因和转录本信息 | Extract integrated gene and transcript information for each transcript from GFF3 files

options:
-h, --help show this help message and exit
--gff3, -g GFF3 输入的GFF3文件路径 | Input GFF3 file path (default: None)
--output, -o OUTPUT 输出的TSV文件路径 | Output TSV file path (default: None)
--gene-type GENE_TYPE
基因特征类型 | Gene feature type (default: gene)
--transcript-types TRANSCRIPT_TYPES [TRANSCRIPT_TYPES ...]
转录本特征类型列表 | Transcript feature types list (default: ['mRNA', 'transcript'])

示例 | Example: parse_gene_info -g input.gff3 -o gene_transcript_info.tsv

使用示例

1
parse_gene_info -g genome.gff -o gene_info.txt

日志信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
2025-07-16 14:10:44,332 - INFO - 开始GFF3基因转录本提取分析 | Starting GFF3 gene transcript extraction analysis
2025-07-16 14:10:44,333 - INFO - 输入文件 | Input file: /mnt/f/biopytools_test/geneinfo/genome.gff
2025-07-16 14:10:44,333 - INFO - 输出文件 | Output file: /mnt/f/biopytools_test/geneinfo/gene_info.txt
2025-07-16 14:10:44,333 - INFO - 开始提取基因和转录本信息 | Starting to extract gene and transcript information
2025-07-16 14:10:44,334 - INFO - 输入文件 | Input file: /mnt/f/biopytools_test/geneinfo/genome.gff
2025-07-16 14:10:44,334 - INFO - 基因类型 | Gene type: gene
2025-07-16 14:10:44,334 - INFO - 转录本类型 | Transcript types: transcript, mRNA
2025-07-16 14:10:44,334 - INFO - 开始收集基因信息 | Starting to collect gene information
2025-07-16 14:10:47,572 - INFO - 收集完成,共发现 51840 个基因 | Collection completed, found 51840 genes
2025-07-16 14:10:47,572 - INFO - 开始处理转录本信息 | Starting to process transcript information
2025-07-16 14:10:49,882 - INFO - 处理完成,共处理 51840 个转录本 | Processing completed, processed 51840 transcripts
2025-07-16 14:10:49,882 - INFO - 提取完成 | Extraction completed
2025-07-16 14:10:49,886 - INFO - 写入结果文件 | Writing results file: /mnt/f/biopytools_test/geneinfo/gene_info.txt
2025-07-16 14:10:50,270 - INFO - 结果已保存 | Results saved: /mnt/f/biopytools_test/geneinfo/gene_info.txt
2025-07-16 14:10:50,280 - INFO - 总结报告已生成 | Summary report generated: /mnt/f/biopytools_test/geneinfo/gff_extraction_summary.txt

基因转录本提取统计摘要 | Gene Transcript Extraction Summary:
============================================================
总转录本数 | Total transcripts: 51840
涉及基因数 | Genes involved: 51840
染色体数 | Chromosomes: 147
链方向 | Strands: +, -
孤儿转录本数 | Orphan transcripts: 0
染色体列表 | Chromosome list: OV01, OV02, OV03, OV04, OV05, OV06, OV07, OV08, OV09, OV10, OV11, OV12, ctg000030, ctg000040, ctg000070, ctg000080, ctg000090, ctg000100, ctg000160, ctg000180, ctg000190, ctg000210, ctg000220, ctg000240, ctg000260, ctg000270, ctg000310, ctg000330, ctg000360, ctg000400, ctg000410, ctg000420, ctg000430, ctg000440, ctg000450, ctg000470, ctg000510, ctg000520, ctg000530, ctg000550, ctg000590, ctg000610, ctg000640, ctg000650, ctg000680, ctg000720, ctg000740, ctg000760, ctg000850, ctg000890, ctg000910, ctg000920, ctg000930, ctg000940, ctg000960, ctg000980, ctg000990, ctg001060, ctg001070, ctg001100, ctg001110, ctg001120, ctg001140, ctg001150, ctg001200, ctg001210, ctg001220, ctg001250, ctg001270, ctg001300, ctg001320, ctg001330, ctg001350, ctg001390, ctg001400, ctg001440, ctg001450, ctg001470, ctg001510, ctg001560, ctg001750, ctg001770, ctg001810, ctg001820, ctg001830, ctg001860, ctg001870, ctg001880, ctg001900, ctg001910, ctg001930, ctg001940, ctg001950, ctg001960, ctg001970, ctg002000, ctg002010, ctg002020, ctg002030, ctg002080, ctg002090, ctg002110, ctg002130, ctg002180, ctg002210, ctg002230, ctg002240, ctg002260, ctg002310, ctg002350, ctg002370, ctg002380, ctg002390, ctg002400, ctg002410, ctg002420, ctg002480, ctg002530, ctg002580, ctg002590, ctg002630, ctg002640, ctg002650, ctg002670, ctg002680, ctg002760, ctg002780, ctg002840, ctg002870, ctg002900, ctg003030, ctg003110, ctg003180, ctg003220, ctg003230, ctg003240, ctg003250, ctg003260, ctg003270, ctg003330, ctg003340, ctg003350, ctg003380, ctg003390, ctg003400, ctg003410, ctg003490
============================================================
2025-07-16 14:10:50,289 - INFO - 提取完成 | Extraction completed successfully

run_plink_gwas

使用PLINK进行GWAS分析。

帮助信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
run_plink_gwas -h
usage: run_plink_gwas [-h] -v VCF_FILE -p PHENOTYPE_FILE [-t {qualitative,quantitative}] [-o OUTPUT_DIR] [--mind MIND] [--geno GENO] [--maf MAF]
[--hwe HWE] [--ld-window-size LD_WINDOW_SIZE] [--ld-step-size LD_STEP_SIZE] [--ld-r2-threshold LD_R2_THRESHOLD]
[--pca-components PCA_COMPONENTS] [--pca-use PCA_USE] [--correction-method {bonferroni,suggestive,fdr,all}]
[--bonferroni-alpha BONFERRONI_ALPHA] [--suggestive-threshold SUGGESTIVE_THRESHOLD] [--fdr-alpha FDR_ALPHA] [--threads THREADS]

完整的PLINK GWAS分析流程 (模块化版本) - 支持质量性状和数量性状,多种显著性校正方法 | Complete PLINK GWAS Analysis Pipeline (Modular Version) - Supporting both qualitative and quantitative
traits, multiple significance correction methods

options:
-h, --help show this help message and exit
-v, --vcf-file VCF_FILE
输入VCF文件路径(支持.gz压缩) | Input VCF file path (supports .gz compression) (default: None)
-p, --phenotype-file PHENOTYPE_FILE
表型文件路径(样本ID和表型值,以空格或制表符分隔) | Phenotype file path (sample ID and phenotype value, space or tab separated) (default: None)
-t, --trait-type {qualitative,quantitative}
表型类型 | Trait type: 'qualitative' for binary traits (0/1 -> 1/2), 'quantitative' for continuous traits (keep original values)
(default: qualitative)
-o, --output-dir OUTPUT_DIR
输出目录 | Output directory (default: plink_results)
--mind MIND 个体缺失率阈值(移除缺失率大于此值的个体) | Individual missing rate threshold (default: 0.05)
--geno GENO SNP缺失率阈值(移除缺失率大于此值的SNP) | SNP missing rate threshold (default: 0.05)
--maf MAF 最小等位基因频率阈值 | Minor allele frequency threshold (default: 0.01)
--hwe HWE Hardy-Weinberg平衡检验P值阈值 | Hardy-Weinberg equilibrium p-value threshold (default: 1e-06)
--ld-window-size LD_WINDOW_SIZE
LD剪枝窗口大小(kb) | LD pruning window size (kb) (default: 50)
--ld-step-size LD_STEP_SIZE
LD剪枝步长(SNP数) | LD pruning step size (number of SNPs) (default: 5)
--ld-r2-threshold LD_R2_THRESHOLD
LD剪枝r²阈值 | LD pruning r² threshold (default: 0.2)
--pca-components PCA_COMPONENTS
计算的主成分数量 | Number of principal components to compute (default: 10)
--pca-use PCA_USE 关联分析中使用的主成分数量 | Number of PCs to use in association analysis (default: 5)
--correction-method {bonferroni,suggestive,fdr,all}
显著性校正方法 | Significance correction method: 'bonferroni' for Bonferroni correction, 'suggestive' for suggestive threshold, 'fdr' for
false discovery rate, 'all' for all methods (default: all)
--bonferroni-alpha BONFERRONI_ALPHA
Bonferroni校正的alpha水平 | Alpha level for Bonferroni correction (default: 0.05)
--suggestive-threshold SUGGESTIVE_THRESHOLD
提示性关联阈值 | Suggestive association threshold (default: 1e-05)
--fdr-alpha FDR_ALPHA
FDR校正的q值阈值 | q-value threshold for FDR correction (default: 0.05)
--threads THREADS 使用的线程数 | Number of threads to use (default: 1)

使用示例

1
run_plink_gwas -v final_filtered.recode.chr.vcf.gz -p phe.txt -o ./

日志信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
2025-07-16 15:10:34,967 - INFO - 开始PLINK GWAS分析流程 | Starting PLINK GWAS analysis pipeline...
2025-07-16 15:10:34,967 - INFO - 表型类型 | Trait type: qualitative
2025-07-16 15:10:34,968 - INFO - 显著性校正方法 | Correction method: all
2025-07-16 15:10:34,970 - INFO - ==================================================
2025-07-16 15:10:34,973 - INFO - 步骤1: 检查输入文件 | Step 1: Checking input files
2025-07-16 15:10:34,973 - INFO - 复制输入文件到工作目录 | Copying input files to working directory...
2025-07-16 15:12:09,104 - INFO - VCF文件已复制 | VCF file copied: input.vcf.gz
2025-07-16 15:12:09,135 - INFO - 表型文件已复制 | Phenotype file copied: phenotype.txt
2025-07-16 15:12:09,137 - INFO - 步骤2: 转换表型文件 | Step 2: Converting phenotype file
2025-07-16 15:12:09,138 - INFO - 转换表型文件格式 | Converting phenotype file format...
2025-07-16 15:12:09,141 - INFO - 表型类型 | Trait type: qualitative
2025-07-16 15:12:09,157 - INFO - 原始表型文件形状 | Original phenotype file shape: (100, 2)
2025-07-16 15:12:09,162 - INFO - 原始表型分布 | Original phenotype distribution: {1: np.int64(50), 0: np.int64(50)}
2025-07-16 15:12:09,165 - INFO - 处理质量性状:将0转换为1(对照),1转换为2(病例) | Processing qualitative trait: converting 0 to 1 (control), 1 to 2 (case)
2025-07-16 15:12:09,166 - INFO - 转换后表型分布 | Converted phenotype distribution: {2: np.int64(50), 1: np.int64(50)}
2025-07-16 15:12:09,168 - INFO - 对照数 (抗病, 原值0) | Controls (resistant, original 0): 50
2025-07-16 15:12:09,168 - INFO - 病例数 (感病, 原值1) | Cases (susceptible, original 1): 50
2025-07-16 15:12:09,180 - INFO - 表型文件转换完成 | Phenotype file conversion completed
2025-07-16 15:12:09,181 - INFO - 步骤3: 转换VCF文件 | Step 3: Converting VCF file
2025-07-16 15:12:09,181 - INFO - 转换VCF文件为PLINK格式 | Converting VCF file to PLINK format...
2025-07-16 15:12:09,181 - INFO - 执行步骤 | Executing step: VCF转换为PLINK格式 | Converting VCF to PLINK format
2025-07-16 15:12:09,182 - INFO - 命令 | Command: plink --vcf input.vcf.gz --make-bed --out raw_data --allow-extra-chr --set-missing-var-ids @:# --keep-allele-order
......
2025-07-16 15:40:34,137 - INFO - PLINK GWAS分析流程完成! | PLINK GWAS analysis pipeline completed!
2025-07-16 15:40:34,138 - INFO - 表型类型 | Trait type: qualitative
2025-07-16 15:40:34,139 - INFO - 显著性校正方法 | Correction method: all
2025-07-16 15:40:34,139 - INFO - 所有结果保存在 | All results saved in: /mnt/f/biopytools_test/gwas
2025-07-16 15:40:34,141 - INFO - 主要输出文件 | Main output files:
2025-07-16 15:40:34,143 - INFO - - analysis_report.txt: 分析报告 | Analysis report
2025-07-16 15:40:34,144 - INFO - - gwas_results_ADD.txt: 主要关联结果 | Main association results
2025-07-16 15:40:34,144 - INFO - - significant_bonferroni.txt: Bonferroni校正显著位点 | Bonferroni significant loci
2025-07-16 15:40:34,144 - INFO - - bonferroni_info.txt: Bonferroni校正信息 | Bonferroni correction information
2025-07-16 15:40:34,144 - INFO - - significant_suggestive.txt: 提示性关联位点 | Suggestive association loci
2025-07-16 15:40:34,145 - INFO - - suggestive_info.txt: 提示性关联信息 | Suggestive association information
2025-07-16 15:40:34,145 - INFO - - significant_fdr.txt: FDR校正显著位点 | FDR significant loci
2025-07-16 15:40:34,146 - INFO - - fdr_info.txt: FDR校正信息 | FDR correction information
2025-07-16 15:40:34,147 - INFO - - manhattan_plot.png: Manhattan图 | Manhattan plot
2025-07-16 15:40:34,148 - INFO - - qq_plot.png: QQ图 | QQ plot

run_kmer_analysis

从fastq文件中找特定序列的k-mer存在/缺失矩阵。

帮助信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
run_kmer_analysis -h
usage: run_kmer_analysis [-h] -g GENE_FASTA -f FASTQ_DIR -o OUTPUT_DIR [-k KMER_SIZE] [-t THREADS] [-m HARD_MIN] [-p PROJECT_NAME] [--skip-build]
[--run-haplotype] [-n N_COMPONENTS]

高性能k-mer数据库查询流水线 | High-Performance K-mer Database Query Pipeline
基于kmtricks + RocksDB的大规模k-mer分析系统 | Large-scale k-mer analysis system based on kmtricks + RocksDB

专为大规模数据集设计 (支持数千个样本) | Designed for large-scale datasets (supporting thousands of samples):
1. 一次构建全基因组k-mer数据库 | Build genome-wide k-mer database once
2. 支持快速查询任意基因的k-mer模式 | Support fast queries of k-mer patterns for any genes
3. 查询速度: 秒级到分钟级 (vs 小时级的暴力搜索) | Query speed: seconds to minutes (vs hours of brute force)

适用场景 | Use Cases:
- 大规模群体基因组学研究 | Large-scale population genomics studies
- 需要重复查询不同基因的场景 | Scenarios requiring repeated queries of different genes
- 对查询速度有高要求的项目 | Projects with high query speed requirements

性能对比 (5000个样本) | Performance Comparison (5000 samples):
- 传统方法 | Traditional method: 每次查询数周 | weeks per query
- 本方法 | This method: 构建一次(1-2天) + 查询(1-5分钟) | build once (1-2 days) + query (1-5 minutes)


options:
-h, --help show this help message and exit

必需参数 | Required Arguments:
-g GENE_FASTA, --gene-fasta GENE_FASTA
目标基因FASTA文件路径 | Target gene FASTA file path
-f FASTQ_DIR, --fastq-dir FASTQ_DIR
FASTQ文件目录路径 | FASTQ file directory path
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
输出目录路径 | Output directory path

k-mer分析参数 | K-mer Analysis Parameters:
-k KMER_SIZE, --kmer-size KMER_SIZE
k-mer大小 (默认: 51) | k-mer size (default: 51)
-t THREADS, --threads THREADS
线程数 (默认: 32,建议使用较多线程) | Thread count (default: 32, recommend using more threads)
-m HARD_MIN, --hard-min HARD_MIN
最小k-mer频次阈值 (默认: 2) | Minimum k-mer frequency threshold (default: 2)

流程控制参数 | Process Control Parameters:
-p PROJECT_NAME, --project-name PROJECT_NAME
项目名称 (默认: 从输出目录名获取) | Project name (default: derived from output directory name)
--skip-build 跳过数据库构建步骤 (用于已有数据库的查询) | Skip database build step (for querying existing database)
--run-haplotype 运行单倍型聚类分析 | Run haplotype clustering analysis
-n N_COMPONENTS, --n-components N_COMPONENTS
BGMM最大聚类数 (默认: 5) | Maximum number of BGMM clusters (default: 5)

使用示例

1
run_kmer_analysis -g gene/gene.fa -f fastq -o output

日志信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
2025-07-16 17:17:51,743 - INFO - 开始高性能k-mer数据库分析流水线 | Starting high-performance k-mer database analysis pipeline...
2025-07-16 17:17:51,744 - INFO -
============================================================
2025-07-16 17:17:51,744 - INFO - 步骤1: 创建FOF文件 | Step 1: Create FOF file
2025-07-16 17:17:51,744 - INFO - ============================================================
2025-07-16 17:17:51,744 - INFO - 创建FOF文件 | Creating FOF file...
2025-07-16 17:17:51,746 - INFO - 自动检测FASTQ文件 | Auto detecting FASTQ files...
2025-07-16 17:17:51,787 - INFO - 找到8个FASTQ文件 | Found 8 FASTQ files
2025-07-16 17:17:51,787 - INFO - 检测到4个样本 | Detected 4 samples
2025-07-16 17:17:51,805 - INFO - FOF文件已创建 | FOF file created: output/samples.fof
2025-07-16 17:17:51,807 - INFO - 包含 4 个样本 | Contains 4 samples
2025-07-16 17:17:51,808 - INFO -
============================================================
2025-07-16 17:17:51,808 - INFO - 步骤2: 构建kmtricks数据库 | Step 2: Build kmtricks database
2025-07-16 17:17:51,810 - INFO - ============================================================
2025-07-16 17:17:51,810 - INFO - 开始运行kmtricks k-mer构建流水线 | Starting kmtricks k-mer construction pipeline...
2025-07-16 17:17:51,811 - INFO - 警告: 这一步可能需要数小时到一天时间,但只需要执行一次 | Warning: This step may take hours to a day, but only needs to be run once
2025-07-16 17:17:51,835 - INFO - FOF文件路径 | FOF file path: /mnt/f/biopytools_test/kmer/output/samples.fof
2025-07-16 17:17:51,836 - INFO - kmtricks运行目录 | kmtricks run directory: /mnt/f/biopytools_test/kmer/output/output.k51
2025-07-16 17:17:51,837 - INFO - 这可能需要很长时间,请耐心等待 | This may take a long time, please be patient...
2025-07-16 17:17:51,840 - INFO - 执行步骤 | Executing step: kmtricks k-mer构建 | kmtricks k-mer construction
2025-07-16 17:17:51,842 - INFO - 命令 | Command: kmtricks pipeline -t 32 --file /mnt/f/biopytools_test/kmer/output/samples.fof --run-dir /mnt/f/biopytools_test/kmer/output/output.k51 --mode kmer:pa:bin --hard-min 2 --kmer-size 51 --cpr
2025-07-16 17:17:51,856 - INFO - ✓ FOF文件存在 | FOF file exists: /mnt/f/biopytools_test/kmer/output/samples.fof
2025-07-16 17:17:51,864 - INFO - 输出 | Output: [2025-07-16 17:17:51.864] [info] Run with Kmer<64> - __uint128_t implementation
2025-07-16 17:17:51,925 - INFO - 输出 | Output: [2025-07-16 17:17:51.925] [info] Compute configuration...
2025-07-16 17:17:51,927 - INFO - 输出 | Output: [2025-07-16 17:17:51.925] [info] 4 samples found (8 read files).
2025-07-16 17:17:53,371 - INFO - 输出 | Output: [2025-07-16 17:17:53.371] [info] Use 46 partitions.
2025-07-16 17:17:53,728 - INFO - 输出 | Output: [2025-07-16 17:17:53.728] [info] Compute minimizer repartition...
2025-07-16 17:18:02,168 - INFO - 输出 | Output:
2025-07-16 17:18:02,169 - INFO - 输出 | Output: Compute SuperK [> ] [00:00s]
2025-07-16 17:18:02,169 - INFO - 输出 | Output: Compute SuperK [> ] [00m:00s]
2025-07-16 17:18:02,169 - INFO - 输出 | Output:
......


biopytools的用法
https://lixiang117423.github.io/article/biopytools-readme/
作者
李详【Xiang LI】
发布于
2025年7月15日
许可协议