如何批量从NCBI下载基因组数据

想下载几千个细菌的基因组做泛基因组分析，结果啊总是网络错误下载失败。于是我就搜了搜，发现这个：

发现这个链接的命名非常有规律可循。

那就R语言伺候：

df.bins.pan.info %>% 
  dplyr::select(`Assembly Accession`,  `Assembly Name`) %>% 
  magrittr::set_names(c("acc", "name")) %>% 
  dplyr::mutate(temp0 = stringr::str_split(acc, "_") %>% sapply("[", 1),
                temp1 = stringr::str_split(acc, "_") %>% sapply("[", 2) %>% stringr::str_sub(1,3),
                temp2 = stringr::str_split(acc, "_") %>% sapply("[", 2) %>% stringr::str_sub(4,6),
                temp3 = stringr::str_split(acc, "_") %>% sapply("[", 2) %>% stringr::str_sub(7,9))  %>% 
  dplyr::mutate(link = sprintf("https://ftp.ncbi.nlm.nih.gov/genomes/all/%s/%s/%s/%s/%s_%s",temp0, temp1, temp2, temp3, acc, name),
                comm = sprintf('wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 %s -P ./',link)) -> down.comm

down.comm %>% 
  dplyr::select(comm) %>% 
  write.table(file = "./data/sanqimetagenome/results/分箱/13.泛基因组/基因组下载链接.txt", 
            col.names = FALSE, row.names = FALSE, quote = FALSE)

wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/016/728/825/GCA_016728825.1_ASM1672882v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/728/825/GCF_016728825.1_ASM1672882v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/008/245/125/GCA_008245125.1_ASM824512v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/008/245/125/GCF_008245125.1_ASM824512v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/013/267/375/GCA_013267375.1_ASM1326737v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/013/267/375/GCF_013267375.1_ASM1326737v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/636/675/GCA_900636675.1_43781_F01 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/636/675/GCF_900636675.1_43781_F01 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/019/968/625/GCA_019968625.1_ASM1996862v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/968/625/GCF_019968625.1_ASM1996862v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/082/135/GCA_002082135.1_ASM208213v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/082/135/GCF_002082135.1_ASM208213v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/022/870/085/GCA_022870085.1_ASM2287008v1 -P ./
wget -np --recursive -e robots=off --reject "index.html" --no-host-directories --cut-dirs=6 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/022/870/085/GCF_022870085.1_ASM2287008v1 -P ./

放服务器后台让它慢慢跑着去。

需要加上-np这个参数，不然会向上搜索子目录，比较麻烦，也没啥用。

#生物信息学

如何批量从NCBI下载基因组数据

https://lixiang117423.github.io/article/sheng-wu-xin-xi-xue/

作者

李详【Xiang LI】

发布于

2023年7月6日

许可协议

细菌泛基因组分析工具panX 上一篇

MAGpurify2配置下一篇