微生物组VSEARCH学习笔记

文献

Rognes T, Flouri T, Nichols B, et al. VSEARCH: a versatile open source tool for metagenomics[J]. PeerJ, 2016, 4: e2584.

安装

直接使用mamba安装:

1
mamba install bioconda::vsearch

流程

常规的扩增子流程:

序列双端合并;去除两端接头,Fastqc 质量检测;序列去重复;嵌合体检测;OTU 聚类;分类信息注释等步骤。

合并数据

将双末端数据合并为一个,对bash批处理不是很熟悉,我选择使用R时长批量处理代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
rm(list = ls())

dir("./01.data/") %>%
as_data_frame() %>%
magrittr::set_names("file") %>%
dplyr::mutate(sample = stringr::str_split(file, ".338") %>% sapply("[", 1)) -> df

comm = NULL

for (i in unique(df$sample)) {
f.file = sprintf("./01.data/%s.338F_806R.R1.clean.fastq", i)
r.file = sprintf("./01.data/%s.338F_806R.R2.clean.fastq", i)
out.file = sprintf("./02.merge/%s.merged.fq",i)

sprintf("vsearch --threads 60 --fastq_mergepairs %s --reverse %s --fastqout %s --relabel %s-", f.file, r.file, out.file, i) %>%
as_data_frame() %>%
dplyr::bind_rows(comm) -> comm
}

comm %>%
write.table("./00.code/01.vsearch.merge.sh", col.names = FALSE, row.names = FALSE, quote = FALSE)

生成的批处理代码:

1
2
3
4
5
6
7
vsearch --threads 60 --fastq_mergepairs ./01.data/LXZR_6.338F_806R.R1.clean.fastq --reverse ./01.data/LXZR_6.338F_806R.R2.clean.fastq --fastqout ./02.merge/LXZR_6.merged.fq --relabel LXZR_6
vsearch --threads 60 --fastq_mergepairs ./01.data/LXZR_5.338F_806R.R1.clean.fastq --reverse ./01.data/LXZR_5.338F_806R.R2.clean.fastq --fastqout ./02.merge/LXZR_5.merged.fq --relabel LXZR_5
vsearch --threads 60 --fastq_mergepairs ./01.data/LXZR_4.338F_806R.R1.clean.fastq --reverse ./01.data/LXZR_4.338F_806R.R2.clean.fastq --fastqout ./02.merge/LXZR_4.merged.fq --relabel LXZR_4
vsearch --threads 60 --fastq_mergepairs ./01.data/LXZR_3.338F_806R.R1.clean.fastq --reverse ./01.data/LXZR_3.338F_806R.R2.clean.fastq --fastqout ./02.merge/LXZR_3.merged.fq --relabel LXZR_3
vsearch --threads 60 --fastq_mergepairs ./01.data/LXZR_2.338F_806R.R1.clean.fastq --reverse ./01.data/LXZR_2.338F_806R.R2.clean.fastq --fastqout ./02.merge/LXZR_2.merged.fq --relabel LXZR_2
vsearch --threads 60 --fastq_mergepairs ./01.data/LXZR_1.338F_806R.R1.clean.fastq --reverse ./01.data/LXZR_1.338F_806R.R2.clean.fastq --fastqout ./02.merge/LXZR_1.merged.fq --relabel LXZR_1
vsearch --threads 60 --fastq_mergepairs ./01.data/LXXG_6.338F_806R.R1.clean.fastq --reverse ./01.data/LXXG_6.338F_806R.R2.clean.fastq --fastqout ./02.merge/LXXG_6.merged.fq --relabel LXXG_6

切除引物接头

由于我用fastp质控过数据,就省略了这一步。参考的代码是:

1
2
3
4
5
cat 02.merge/* > 02.merge/all.fq

time vsearch --fastx_filter temp/all.fq \
--fastq_stripleft 29 --fastq_stripright 18 \
--fastqout temp/stripped.fq # 34s

质控

1
2
# cat 02.merge/* > 02.merge/all.fq
vsearch --fastx_filter 02.merge/all.fq --fastq_maxee_rate 0.01 --fastaout 02.merge/all.filtered.fq
1
2
3
4
5
vsearch v2.27.0_linux_x86_64, 503.3GB RAM, 80 cores
https://github.com/torognes/vsearch

Reading input file 100%
6566012 sequences kept (of which 0 truncated), 66629 sequences discarded.

102个样品,5.9G数据质控后还有3.0G.

去冗余

1
vsearch --derep_fulllength 02.merge/all.filtered.fq --sizeout --minuniquesize 8 --output 02.merge/unique.fa
1
2
3
4
5
6
7
8
9
10
vsearch v2.27.0_linux_x86_64, 503.3GB RAM, 80 cores
https://github.com/torognes/vsearch

Dereplicating file 02.merge/all.filtered.fq 100%
3053614330 nt in 6566009 seqs, min 32, max 578, avg 465
minseqlength 32: 3 sequences discarded.
Sorting 100%
4570066 unique sequences, avg cluster 1.4, median 1, max 897
Writing FASTA output file 100%
30033 uniques written, 4540033 clusters discarded (99.3%)

生成OTU

1
vsearch --cluster_fast 02.merge/unique.fa --id 0.97 --centroids 04.otu/otus.fa --relabel OTU_
1
2
3
4
5
6
7
8
9
10
11
12
13
vsearch v2.27.0_linux_x86_64, 503.3GB RAM, 80 cores
https://github.com/torognes/vsearch

Reading file 02.merge/unique.fa 100%
13862140 nt in 30033 seqs, min 141, max 500, avg 462
Masking 100%
Sorting by length 100%
Counting k-mers 100%
Clustering 100%
Sorting clusters 100%
Writing clusters 100%
Clusters: 1025 Size min 1, max 856, avg 29.3
Singletons: 318, 1.1% of seqs, 31.0% of clusters

细菌可以使用Usearch作者整理的RDP Gold数据库去除嵌合体:

下载:

1
wget http://drive5.com/uchime/rdp_gold.fa
1
vsearch --uchime_ref 04.otu/otus.fa --db rdp_gold.fa --nonchimeras 04.otu/otus.nochimer.fa
1
2
3
4
5
6
7
8
9
10
11
12
13
14
vsearch v2.27.0_linux_x86_64, 503.3GB RAM, 80 cores
https://github.com/torognes/vsearch

Reading file rdp_gold.fa 100%
29007378 nt in 20098 seqs, min 320, max 2210, avg 1443
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%
Detecting chimeras 100%
Found 31 (3.0%) chimeras, 992 (96.8%) non-chimeras,
and 2 (0.2%) borderline sequences in 1025 unique sequences.
Taking abundance information into account, this corresponds to
31 (3.0%) chimeras, 992 (96.8%) non-chimeras,
and 2 (0.2%) borderline sequences in 1025 total sequences.

生成OTU表

1
vsearch --usearch_global 02.merge/all.filtered.fq --db 04.otu/otus.nochimer.fa --id 0.97 --otutabout 04.otu/bacteria.otu.abun.txt --threads 60
1
2
3
4
5
6
7
8
9
10
11
vsearch v2.27.0_linux_x86_64, 503.3GB RAM, 80 cores
https://github.com/torognes/vsearch

Reading file 04.otu/otus.nochimer.fa 100%
460361 nt in 992 seqs, min 141, max 500, avg 464
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%
Searching 100%
Matching unique query sequences: 2383628 of 6566012 (36.30%)
Writing OTU table (classic) 100%

OTU注释

1
vsearch --sintax 04.otu/otus.nochimer.fa -db silva_16s_v123.fa --sintax_cutoff 0.1 --tabbedout 04.otu/bacteria.otu.tax.txt --threads 60

所有代码

1
2
3
4
5
6
vsearch --fastx_filter 02.merge/all.fq --fastq_maxee_rate 0.01 --fastaout 02.merge/all.filtered.fq
vsearch --derep_fulllength 02.merge/all.filtered.fq --sizeout --minuniquesize 4 --output 02.merge/unique.fa
vsearch --cluster_fast 02.merge/unique.fa --id 0.97 --centroids 04.otu/otus.fa --relabel OTU_
vsearch --uchime_ref 04.otu/otus.fa --db rdp_gold.fa --nonchimeras 04.otu/otus.nochimer.fa
vsearch --usearch_global 02.merge/all.filtered.fq --db 04.otu/otus.nochimer.fa --id 0.97 --otutabout 04.otu/bacteria.otu.abun.txt --threads 60
vsearch --sintax 04.otu/otus.nochimer.fa -db silva_16s_v123.fa --sintax_cutoff 0.1 --tabbedout 04.otu/bacteria.otu.tax.txt --threads 60

真菌参数

1
2
3
4
5
6
vsearch --fastx_filter 02.merge/all.fq --fastq_maxee_rate 0.01 --fastaout 02.merge/all.filtered.fq
vsearch --derep_fulllength 02.merge/all.filtered.fq --sizeout --minuniquesize 4 --output 02.merge/unique.fa
vsearch --cluster_fast 02.merge/unique.fa --id 0.97 --centroids 04.otu/otus.fa --relabel OTU_
vsearch --uchime_ref 04.otu/otus.fa --db rdp_gold.fa --nonchimeras 04.otu/fungi.otus.nochimer.fa
vsearch --usearch_global 02.merge/all.filtered.fq --db 04.otu/fungi.otus.nochimer.fa --id 0.97 --otutabout 04.otu/fungi.otu.abun.txt --threads 60
vsearch --sintax 04.otu/fungi.otus.nochimer.fa -db fungi.unite.9.0.fa --sintax_cutoff 0.1 --tabbedout 04.otu/fungi.otu.tax.txt --threads 60

微生物组VSEARCH学习笔记
https://lixiang117423.github.io/article/vsearch/
作者
李详【Xiang LI】
发布于
2024年2月20日
许可协议