如何批量从NCBI下载基因组数据

想下载几千个细菌的基因组做泛基因组分析,结果啊总是网络错误下载失败。于是我就搜了搜,发现这个:

Genomes Download (FTP) FAQ

发现这个链接的命名非常有规律可循。

那就R语言伺候:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
df.bins.pan.info %>% 
dplyr::select(`Assembly Accession`, `Assembly Name`) %>%
magrittr::set_names(c("acc", "name")) %>%
dplyr::mutate(temp0 = stringr::str_split(acc, "_") %>% sapply("[", 1),
temp1 = stringr::str_split(acc, "_") %>% sapply("[", 2) %>% stringr::str_sub(1,3),
temp2 = stringr::str_split(acc, "_") %>% sapply("[", 2) %>% stringr::str_sub(4,6),
temp3 = stringr::str_split(acc, "_") %>% sapply("[", 2) %>% stringr::str_sub(7,9)) %>%
dplyr::mutate(link = sprintf("https://ftp.ncbi.nlm.nih.gov/genomes/all/%s/%s/%s/%s/%s_%s",temp0, temp1, temp2, temp3, acc, name),
comm = sprintf('wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 %s -P ./',link)) -> down.comm

down.comm %>%
dplyr::select(comm) %>%
write.table(file = "./data/sanqimetagenome/results/分箱/13.泛基因组/基因组下载链接.txt",
col.names = FALSE, row.names = FALSE, quote = FALSE)
R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/016/728/825/GCA_016728825.1_ASM1672882v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/728/825/GCF_016728825.1_ASM1672882v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/008/245/125/GCA_008245125.1_ASM824512v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/008/245/125/GCF_008245125.1_ASM824512v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/013/267/375/GCA_013267375.1_ASM1326737v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/267/375/GCF_013267375.1_ASM1326737v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/636/675/GCA_900636675.1_43781_F01 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/636/675/GCF_900636675.1_43781_F01 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/019/968/625/GCA_019968625.1_ASM1996862v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/968/625/GCF_019968625.1_ASM1996862v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/082/135/GCA_002082135.1_ASM208213v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/082/135/GCF_002082135.1_ASM208213v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/022/870/085/GCA_022870085.1_ASM2287008v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/022/870/085/GCF_022870085.1_ASM2287008v1 -P ./
SH

放服务器后台让它慢慢跑着去。

需要加上-np这个参数,不然会向上搜索子目录,比较麻烦,也没啥用。


如何批量从NCBI下载基因组数据
https://lixiang117423.github.io/article/sheng-wu-xin-xi-xue/
作者
李详【Xiang LI】
发布于
2023年7月6日
许可协议