文献
Rognes T, Flouri T, Nichols B, et al. VSEARCH: a versatile open source tool for metagenomics[J]. PeerJ, 2016, 4: e2584.
安装 直接使用mamba
安装:
1 mamba install bioconda::vsearch
流程 常规的扩增子流程:
序列双端合并;去除两端接头,Fastqc 质量检测;序列去重复;嵌合体检测;OTU 聚类;分类信息注释等步骤。
合并数据 将双末端数据合并为一个,对bash
批处理不是很熟悉,我选择使用R
时长批量处理代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 rm( list = ls( ) ) dir( "./01.data/" ) %>% as_data_frame( ) %>% magrittr:: set_names( "file" ) %>% dplyr:: mutate( sample = stringr:: str_split( file, ".338" ) %>% sapply( "[" , 1 ) ) -> df comm = NULL for ( i in unique( df$ sample) ) { f.file = sprintf( "./01.data/%s.338F_806R.R1.clean.fastq" , i) r.file = sprintf( "./01.data/%s.338F_806R.R2.clean.fastq" , i) out.file = sprintf( "./02.merge/%s.merged.fq" , i) sprintf( "vsearch --threads 60 --fastq_mergepairs %s --reverse %s --fastqout %s --relabel %s-" , f.file, r.file, out.file, i) %>% as_data_frame( ) %>% dplyr:: bind_rows( comm) -> comm} comm %>% write.table( "./00.code/01.vsearch.merge.sh" , col.names = FALSE , row.names = FALSE , quote = FALSE )
生成的批处理代码:
1 2 3 4 5 6 7 vsearch - - threads 60 - - fastq_mergepairs ./ 01. data/ LXZR_6.338F_806R.R1.clean.fastq - - reverse ./ 01. data/ LXZR_6.338F_806R.R2.clean.fastq - - fastqout ./ 02. merge/ LXZR_6.merged.fq - - relabel LXZR_6 vsearch - - threads 60 - - fastq_mergepairs ./ 01. data/ LXZR_5.338F_806R.R1.clean.fastq - - reverse ./ 01. data/ LXZR_5.338F_806R.R2.clean.fastq - - fastqout ./ 02. merge/ LXZR_5.merged.fq - - relabel LXZR_5 vsearch - - threads 60 - - fastq_mergepairs ./ 01. data/ LXZR_4.338F_806R.R1.clean.fastq - - reverse ./ 01. data/ LXZR_4.338F_806R.R2.clean.fastq - - fastqout ./ 02. merge/ LXZR_4.merged.fq - - relabel LXZR_4 vsearch - - threads 60 - - fastq_mergepairs ./ 01. data/ LXZR_3.338F_806R.R1.clean.fastq - - reverse ./ 01. data/ LXZR_3.338F_806R.R2.clean.fastq - - fastqout ./ 02. merge/ LXZR_3.merged.fq - - relabel LXZR_3 vsearch - - threads 60 - - fastq_mergepairs ./ 01. data/ LXZR_2.338F_806R.R1.clean.fastq - - reverse ./ 01. data/ LXZR_2.338F_806R.R2.clean.fastq - - fastqout ./ 02. merge/ LXZR_2.merged.fq - - relabel LXZR_2 vsearch - - threads 60 - - fastq_mergepairs ./ 01. data/ LXZR_1.338F_806R.R1.clean.fastq - - reverse ./ 01. data/ LXZR_1.338F_806R.R2.clean.fastq - - fastqout ./ 02. merge/ LXZR_1.merged.fq - - relabel LXZR_1 vsearch - - threads 60 - - fastq_mergepairs ./ 01. data/ LXXG_6.338F_806R.R1.clean.fastq - - reverse ./ 01. data/ LXXG_6.338F_806R.R2.clean.fastq - - fastqout ./ 02. merge/ LXXG_6.merged.fq - - relabel LXXG_6
切除引物接头 由于我用fastp
质控过数据,就省略了这一步。参考的代码是:
1 2 3 4 5 cat 02. merge/ * > 02. merge/ all.fq time vsearch - - fastx_filter temp/ all.fq \ - - fastq_stripleft 29 - - fastq_stripright 18 \ - - fastqout temp/ stripped.fq
质控 1 2 vsearch - - fastx_filter 02. merge/ all.fq - - fastq_maxee_rate 0.01 - - fastaout 02. merge/ all.filtered.fq
1 2 3 4 5 vsearch v2.27.0_linux_x86_64, 503.3 GB RAM, 80 cores https: / / github.com/ torognes/ vsearch Reading input file 100 % 6566012 sequences kept ( of which 0 truncated) , 66629 sequences discarded.
102个样品,5.9G数据质控后还有3.0G.
去冗余 1 vsearch --derep_fulllength 02.merge/all.filtered.fq --sizeout --minuniquesize 8 --output 02.merge/unique.fa
1 2 3 4 5 6 7 8 9 10 vsearch v2.27.0_linux_x86_64, 503.3GB RAM, 80 cores https://github.com/torognes/vsearch Dereplicating file 02.merge/all.filtered.fq 100% 3053614330 nt in 6566009 seqs, min 32, max 578, avg 465 minseqlength 32: 3 sequences discarded. Sorting 100% 4570066 unique sequences, avg cluster 1.4, median 1, max 897 Writing FASTA output file 100% 30033 uniques written, 4540033 clusters discarded (99.3%)
生成OTU 1 vsearch --cluster_fast 02.merge/unique.fa --id 0.97 --centroids 04.otu/otus.fa --relabel OTU_
1 2 3 4 5 6 7 8 9 10 11 12 13 vsearch v2.27.0_linux_x86_64, 503.3GB RAM, 80 cores https://github.com/torognes/vsearch Reading file 02.merge/unique.fa 100% 13862140 nt in 30033 seqs, min 141, max 500, avg 462 Masking 100% Sorting by length 100% Counting k-mers 100% Clustering 100% Sorting clusters 100% Writing clusters 100% Clusters: 1025 Size min 1, max 856, avg 29.3 Singletons: 318, 1.1% of seqs, 31.0% of clusters
细菌可以使用Usearch作者整理的RDP Gold数据库去除嵌合体:
下载:
1 wget http://drive5.com/uchime/rdp_gold.fa
1 vsearch --uchime_ref 04.otu/otus.fa --db rdp_gold.fa --nonchimeras 04.otu/otus.nochimer.fa
1 2 3 4 5 6 7 8 9 10 11 12 13 14 vsearch v2.27.0_linux_x86_64, 503.3GB RAM, 80 cores https://github.com/torognes/vsearch Reading file rdp_gold.fa 100% 29007378 nt in 20098 seqs, min 320, max 2210, avg 1443 Masking 100% Counting k-mers 100% Creating k-mer index 100% Detecting chimeras 100% Found 31 (3.0%) chimeras, 992 (96.8%) non-chimeras, and 2 (0.2%) borderline sequences in 1025 unique sequences. Taking abundance information into account, this corresponds to 31 (3.0%) chimeras, 992 (96.8%) non-chimeras, and 2 (0.2%) borderline sequences in 1025 total sequences.
生成OTU表 1 vsearch --usearch_global 02.merge/all.filtered.fq --db 04.otu/otus.nochimer.fa --id 0.97 --otutabout 04.otu/bacteria.otu.abun.txt --threads 60
1 2 3 4 5 6 7 8 9 10 11 vsearch v2.27.0_linux_x86_64, 503.3GB RAM, 80 cores https://github.com/torognes/vsearch Reading file 04.otu/otus.nochimer.fa 100% 460361 nt in 992 seqs, min 141, max 500, avg 464 Masking 100% Counting k-mers 100% Creating k-mer index 100% Searching 100% Matching unique query sequences: 2383628 of 6566012 (36.30%) Writing OTU table (classic) 100%
OTU注释 1 vsearch --sintax 04.otu/otus.nochimer.fa -db silva_16s_v123.fa --sintax_cutoff 0.1 --tabbedout 04.otu/bacteria.otu.tax.txt --threads 60
所有代码 1 2 3 4 5 6 vsearch --fastx_filter 02.merge/all.fq --fastq_maxee_rate 0.01 --fastaout 02.merge/all.filtered.fq vsearch --derep_fulllength 02.merge/all.filtered.fq --sizeout --minuniquesize 4 --output 02.merge/unique.fa vsearch --cluster_fast 02.merge/unique.fa --id 0.97 --centroids 04.otu/otus.fa --relabel OTU_ vsearch --uchime_ref 04.otu/otus.fa --db rdp_gold.fa --nonchimeras 04.otu/otus.nochimer.fa vsearch --usearch_global 02.merge/all.filtered.fq --db 04.otu/otus.nochimer.fa --id 0.97 --otutabout 04.otu/bacteria.otu.abun.txt --threads 60 vsearch --sintax 04.otu/otus.nochimer.fa -db silva_16s_v123.fa --sintax_cutoff 0.1 --tabbedout 04.otu/bacteria.otu.tax.txt --threads 60
真菌参数 1 2 3 4 5 6 vsearch --fastx_filter 02.merge/all.fq --fastq_maxee_rate 0.01 --fastaout 02.merge/all.filtered.fq vsearch --derep_fulllength 02.merge/all.filtered.fq --sizeout --minuniquesize 4 --output 02.merge/unique.fa vsearch --cluster_fast 02.merge/unique.fa --id 0.97 --centroids 04.otu/otus.fa --relabel OTU_ vsearch --uchime_ref 04.otu/otus.fa --db rdp_gold.fa --nonchimeras 04.otu/fungi.otus.nochimer.fa vsearch --usearch_global 02.merge/all.filtered.fq --db 04.otu/fungi.otus.nochimer.fa --id 0.97 --otutabout 04.otu/fungi.otu.abun.txt --threads 60 vsearch --sintax 04.otu/fungi.otus.nochimer.fa -db fungi.unite.9.0.fa --sintax_cutoff 0.1 --tabbedout 04.otu/fungi.otu.tax.txt --threads 60