结构变异检测软件SVDSS使用

¶参考文献

Denti L, Khorsand P, Bonizzoni P, et al. SVDSS: structural variation discovery in hard-to-call genomic regions using sample-specific strings from accurate long reads[J]. Nature Methods, 2023, 20(4): 550-558.

点击观看官方视屏.

¶计算原理

SVDSS融合了基于比对、无比对和基于组装这三种方法的优势。整个流程可以分为如下几步：

read-smoothing：基于比对到参考基因组上的方法将测序错误碱基、SNP和小片段插入（<20bp）的序列去除。
构建SFS superstring：在完成上一步的基础上对样品特异性片段（sample-specific string，SFS）进行提取，然后再将SFS组装成superstrings.

什么是specific strings？官方视屏中的介绍是这样的：也就是在A中有在B中没有的片段。
SV预测：基于这些SFS在参考基因组上的位置对其进行聚类，然后鉴定SV.

¶实战案例

¶## 软件安装

1
2
3

mamba create --name svdss
mamba activate svdss
mamba install -c bioconda svdss

¶数据下载

官方使用的是人类基因组数据，我选择水稻的数据。

从[RiceSuperPIRdb](http://www.ricesuperpir.com/web/download)下载日本晴的T2T基因组和注释信息。从Qin P, Lu H, Du H, et al. Pan-genome analysis of 33 genetically diverse rice accessions reveals hidden genomic variations[J]. Cell, 2021, 184(13): 3542-3558. e16.提供的数据中下载2个样品全基因组测序数据。下载地址：

https://ngdc.cncb.ac.cn/search/?dbId=gsa&q=PRJCA002103.&page=1

剩下的步骤大概可以分为如下几步：

Build FMD index of reference genome (SVDSS index)
Smooth the input BAM file (SVDSS smooth)
Extract SFS from smoothed BAM file (SVDSS search)
Assemble SFS into superstrings (SVDSS assemble)
Genotype SVs from the assembled superstrings (SVDSS call)

但是软件输出的帮助文档的顺序和GitHub上的不一样：

SVDSS, Structural Variant Discovery from Sample-specific Strings.

Usage: 
        * Index reference/sample:
                SVDSS index --fastq/--fasta /path/to/genome/file --index /path/to/output/index/file

                Optional arguments: 
                        -b, --binary                            output index in binary format. Allows for another index to be appended to this index later.
                        -a, --append /path/to/binary/index      append to existing binary index.

        * Extract SFS from BAM/FASTQ/FASTA files:
                SVDSS search --index /path/to/index --fastq/--bam /path/to/input --workdir /output/directory

                Optional arguments: 
                        --assemble                              automatically runs SVDSS assemble on output

        * Assemble SFS into superstrings:
                SVDSS assemble --workdir /path/to/.sfs/files --batches /number/of/SFS/batches

        * Reconstruct sample:
                SVDSS smooth --workdir /output/file/direcotry --bam /path/to/input/bam/file --reference /path/to/reference/genome/fasta

        * Call SVs:
                SVDSS call --workdir /path/to/assembled/.sfs/files --bam /path/to/input/bam/file --reference /path/to/reference/genome/fasta

                Optional arguments: 
                        --clipped                               calls SVs from clipped SFS.
                        --min-cluster-weight                    minimum number of supporting superstrings for a call to be reported.
                        --min-sv-length                         minimum length of reported SVs. Default is 25. Values < 25 are ignored.

General options: 
        --threads                                               sets number of threads, default 4.
        --version                                               print version information.
        --help                                                  print this help message.

先跟着GitHub上的步骤操作一波试试。

All of SVDSS steps must be run in the same directory so we always pass --workdir $PWD for every command.

¶构建参考基因组索引

1 2	`mkdir 1.refer.index SVDSS index --fasta 00.data/genome/nip.t2t.fa --index 01.refer.index/refer.genome.fmds`

373M的基因组运行结果如下：

SVDSS, Structural Variant Discovery from Sample-specific Strings.
Mode: index
[I] . 
[M::mr_insert_multi] Turn off parallelization for this batch as too few strings are left.
[I] Complete. Runtime: 876 seconds.

¶Mapping

使用Minimap2将序列比对到参考基因组上获得BAM文件。

1
2
3

minimap2 -ax map-pb  ref.fa pacbio-reads.fq > aln.sam   # for PacBio CLR reads
minimap2 -ax map-ont ref.fa ont-reads.fq > aln.sam      # for Oxford Nanopore reads
minimap2 -ax map-iclr ref.fa iclr-reads.fq > aln.sam    # for Illumina Complete Long Reads

1	`minimap2 -t 70 -ax map-pb 00.data/genome/nip.t2t.fa 00.data/sample/zh11.fastq > zh11.sam`

57G数据运行了25分钟左右。

将SAM转为BAM排序后构建索引：

1 2	`samtools view -@ 70 -b zh11.sam > zh11.bam samtools sort --write-index -@ 70 zh11.bam -o zh11.sorted.bam`

¶Smooth样品

剔除样品中的测序错误碱基、SNP和小片段插入（<20bp）。

1	`SVDSS smooth --threads 70 --reference nip.t2t.fa --bam zh11.sorted.bam`

构建索引。

1	`samtools index -@ 70 -b smoothed.selective.bam`

¶提取样品的SFS并组装

看了下一步的代码里面有一个参数是batches，猜测不同的样本的batches不同，批量处理就不好处理，那就直接一步完成了。

1	`SVDSS search --threads 70 --index refer.genome.fmd --bam smoothed.selective.bam`

¶预测SVs

1	`SVDSS call --threads 70 --reference nip.t2t.fa --bam smoothed.selective.bam > zh11.sv.vcf`

¶汇总结果

mkdir 02.each/zh11
mv zh11* 02.each/zh11 
mv s* 02.each/zh11 
mv ignored_reads.txt 02.each/zh11 
mv poa.sam 02.each/zh11

¶另一种方法

我尝试了上面的方法，不行，最终得不到结构变异。选择直接从fastq文件开始，不用比对了。

mkdir 1.refer.index
SVDSS index --fasta 00.data/genome/nip.t2t.fa --index 01.refer.index/refer.genome.fmds
minimap2 -t 70 -ax map-pb 00.data/genome/nip.t2t.fa 00.data/sample/zh11.fastq > zh11.sam
samtools view -@ 70 -b zh11.sam > zh11.bam
samtools sort --write-index -@ 70 zh11.bam -o zh11.sorted.bam 
SVDSS search --threads 70 --assemble --index refer.genome.fmd --fastq 00.data/sample/zh11.fastq
SVDSS call --min-sv-length 50 --threads 70 --reference nip.t2t.fa --bam zh11.sorted.bam  > zh11.sv.vcf

mkdir 02.each/zh11
mv zh11* 02.each/zh11 
mv s* 02.each/zh11 
mv ignored_reads.txt 02.each/zh11 
mv poa.sam 02.each/zh11

只得到200个SV，不知道对不对。得到这样一个图：

¶结构变异可视化

在Ahsan M U, Liu Q, Perdomo J E, et al. A survey of algorithms for the detection of genomic structural variants from long-read sequencing data[J]. Nature Methods, 2023, 20(8): 1143-1158.这个文章里面看到这个图：

找到这个文章：

Maria Nattestad, Robert Aboukhalil, Chen-Shan Chin, Michael C Schatz, Ribbon: intuitive visualization for complex genomic variation, Bioinformatics, Volume 37, Issue 3, February 2021, Pages 413–415

点击查看GitHub代码。

直接部署一个到自己的服务器上，接个域名方便自己使用。

https://www.xxx.com/ribbon/

生物信息学

#生物信息学

结构变异检测软件SVDSS使用

https://lixiang117423.github.io/article/svdss/

作者

李详【Xiang LI】

发布于

2023年10月3日

许可协议

从二代测序数据中鉴定是否含有特定基因上一篇

调控植物-植物互作的地下部分信号下一篇