数据准备和质控
1. 数据质量评估
1 2 3 4 5 6 7 8 9 10 11
| fastqc short_reads_R1.fastq.gz short_reads_R2.fastq.gz -o qc_reports/
NanoPlot --fastq nanopore_reads.fastq.gz --outdir nanopore_qc/
fastqc hifi_reads.fastq.gz -o qc_reports/
fastqc hic_R1.fastq.gz hic_R2.fastq.gz -o qc_reports/
|
2. 数据预处理
1 2 3 4 5 6 7 8 9 10 11 12 13
| fastp -i short_reads_R1.fastq.gz -I short_reads_R2.fastq.gz \ -o cleaned_R1.fastq.gz -O cleaned_R2.fastq.gz \ --thread 16 --html fastp_report.html
filtlong --min_length 1000 --keep_percent 90 \ nanopore_reads.fastq.gz > filtered_nanopore.fastq.gz
fastp -i hic_R1.fastq.gz -I hic_R2.fastq.gz \ -o hic_clean_R1.fastq.gz -O hic_clean_R2.fastq.gz
|
基因组组装流程
3. 初步组装(使用长读长数据)
使用Flye进行Nanopore组装
1 2 3 4 5
| flye --nano-raw filtered_nanopore.fastq.gz \ --out-dir flye_assembly \ --threads 32 \ --genome-size 500m
|
使用Hifiasm进行Hi-Fi组装
1 2 3 4 5
| hifiasm -o hifi_assembly.asm -t 32 hifi_reads.fastq.gz
awk '/^S/{print ">"$2"\n"$3}' hifi_assembly.asm.bp.p_ctg.gfa > hifi_primary.fasta
|
4. 组装结果比较和选择
1 2 3 4 5 6 7 8 9 10
| quast.py flye_assembly/assembly.fasta hifi_primary.fasta \ -o assembly_comparison \ --threads 16
busco -i flye_assembly/assembly.fasta -l eukaryota_odb10 \ -o flye_busco -m genome busco -i hifi_primary.fasta -l eukaryota_odb10 \ -o hifi_busco -m genome
|
5. 组装优化和纠错
使用短读长数据进行纠错
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| ASSEMBLY="hifi_primary.fasta"
bwa index $ASSEMBLY
bwa mem -t 16 $ASSEMBLY cleaned_R1.fastq.gz cleaned_R2.fastq.gz | \ samtools sort -@ 16 -o aligned_short.bam
samtools index aligned_short.bam
pilon --genome $ASSEMBLY --frags aligned_short.bam \ --output polished_assembly --threads 16
|
使用长读长数据进一步纠错
1 2 3 4 5 6 7 8
| minimap2 -ax map-ont polished_assembly.fasta filtered_nanopore.fastq.gz | \ samtools sort -@ 16 -o nanopore_aligned.bam
samtools index nanopore_aligned.bam
racon filtered_nanopore.fastq.gz nanopore_aligned.bam \ polished_assembly.fasta > racon_polished.fasta
|
6. Hi-C辅助的染色体级别组装
1 2 3 4 5 6 7 8 9 10 11 12
| POLISHED_ASSEMBLY="racon_polished.fasta"
bwa index $POLISHED_ASSEMBLY
bwa mem -5SP -t 16 $POLISHED_ASSEMBLY hic_clean_R1.fastq.gz hic_clean_R2.fastq.gz | \ samtools view -Sb - > hic_aligned.bam
run-asm-pipeline.sh $POLISHED_ASSEMBLY hic_aligned.bam
|
基因组注释流程
7. 重复序列注释
1 2 3 4 5 6 7 8
| BuildDatabase -name genome_db final_assembly.fasta
RepeatModeler -database genome_db -pa 16 -LTRStruct
RepeatMasker -pa 16 -lib consensi.fa.classified \ -gff -dir repeat_annotation final_assembly.fasta
|
8. 基因结构预测
Ab initio基因预测
1 2 3 4 5 6
| augustus --species=generic --gff3=on \ final_assembly.fasta > augustus_genes.gff3
gmes_petap.pl --ES --sequence final_assembly.fasta
|
基于同源性的基因预测
1 2 3 4 5 6
|
GeMoMa -Xmx100g GeMoMaPipeline threads=16 \ t=final_assembly.fasta \ i=species annotation=reference_proteins.fasta \ outdir=gemoma_results
|
基于转录组的基因预测
1 2 3 4 5 6 7 8 9 10 11 12 13 14
|
hisat2-build final_assembly.fasta genome_index
hisat2 -x genome_index -1 rnaseq_R1.fastq.gz -2 rnaseq_R2.fastq.gz \ -S rnaseq_aligned.sam --threads 16
samtools sort -@ 16 rnaseq_aligned.sam -o rnaseq_aligned.bam
stringtie rnaseq_aligned.bam -o transcripts.gtf -p 16
gffread transcripts.gtf -o transcripts.gff3
|
9. 基因预测结果整合
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
|
cat > weights.txt << EOF ABINITIO_PREDICTION augustus 1 ABINITIO_PREDICTION genemark 1 PROTEIN gemoma 5 TRANSCRIPT stringtie 10 EOF
EVidenceModeler --genome final_assembly.fasta \ --weights weights.txt \ --gene_predictions augustus_genes.gff3 \ --gene_predictions genemark.gtf \ --protein_alignments gemoma_results/final_annotation.gff \ --transcript_alignments transcripts.gff3 \ --segmentSize 100000 --overlapSize 10000 \ --CPU 16 > evm_consensus.gff3
|
10. 功能注释
蛋白质序列提取
1 2
| gffread -g final_assembly.fasta -y proteins.fasta evm_consensus.gff3
|
数据库比对注释
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| makeblastdb -in nr.fa -dbtype prot -out nr_db
blastp -query proteins.fasta -db nr_db -outfmt 6 \ -evalue 1e-5 -num_threads 16 -out blast_nr.txt
interproscan.sh -i proteins.fasta -f tsv -o interpro_results.tsv -cpu 16
python3 scripts/interpro2go.py interpro_results.tsv > go_annotations.txt
|
11. 非编码RNA注释
1 2 3 4 5 6 7 8
| tRNAscan-SE -o trna_results.txt final_assembly.fasta
barrnap --kingdom euk --threads 16 final_assembly.fasta > rrna_results.gff3
|
12. 最终注释整合和格式转换
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| python3 scripts/integrate_annotations.py \ --genome final_assembly.fasta \ --genes evm_consensus.gff3 \ --blast blast_nr.txt \ --interpro interpro_results.tsv \ --go go_annotations.txt \ --trna trna_results.txt \ --rrna rrna_results.gff3 \ --output final_annotation.gff3
python3 scripts/annotation_stats.py final_annotation.gff3 > annotation_summary.txt
|
质量评估和验证
13. 最终质量评估
1 2 3 4 5 6 7 8
| quast.py final_assembly.fasta -o final_quast_report --threads 16
busco -i proteins.fasta -l eukaryota_odb10 -o final_busco -m proteins
assembly-stats final_assembly.fasta > assembly_stats.txt
|
重要脚本和配置文件
辅助脚本示例
interpro2go.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| import sys
go_mapping = {} with open('interpro2go', 'r') as f: for line in f: if line.startswith('InterPro:'): parts = line.strip().split(' ; ') if len(parts) >= 2: interpro_id = parts[0].split(':')[1] go_id = parts[1].split(':')[1] if interpro_id not in go_mapping: go_mapping[interpro_id] = [] go_mapping[interpro_id].append(go_id)
with open(sys.argv[1], 'r') as f: for line in f: fields = line.strip().split('\t') protein_id = fields[0] interpro_id = fields[11] if len(fields) > 11 and fields[11] != '-' else None if interpro_id and interpro_id in go_mapping: for go_id in go_mapping[interpro_id]: print(f"{protein_id}\t{go_id}")
|
环境配置
1 2 3 4 5 6 7 8 9 10
| conda create -n genome_assembly -c bioconda -c conda-forge \ fastqc fastp flye hifiasm quast busco \ bwa samtools pilon racon \ augustus genemark gemoma \ blast interproscan \ trnascan-se barrnap
conda activate genome_assembly
|
流程总结
这个完整的流程整合了:
- 多种测序数据:Hi-Fi(高质量长读长)、Nanopore(长读长)、Illumina(短读长)、Hi-C(染色体构象)
- 组装策略:长读长初步组装 → 短读长纠错 → Hi-C辅助scaffold
- 注释方法:重复序列 → 基因结构预测 → 功能注释 → 非编码RNA
- 质量控制:每个步骤都包含质量评估和验证
根据具体的基因组大小、复杂性和可用计算资源,可以调整参数和选择适当的工具组合。