真菌分泌蛋白预测流程

软件配置

DeepLoc 2.0

参考文献:

Thumuluri V, Almagro Armenteros J J, Johansen A R, et al. DeepLoc 2.0: multi-label subcellular localization prediction using protein language models[J]. Nucleic acids research, 2022, 50(W1): W228-W234.

首先需要使用教育邮箱申请后才能进行下载,申请网址为:https://services.healthtech.dtu.dk/cgi-bin/sw_request?software=deeploc&version=2.0&packageversion=2.0&platform=All

解压后可以直接进行安装,官方建议的安装方式为:

1
2
3
4
5
6
# 第一种方式:
pip install deeploc2.tar.gz

# 第二种方式
#or within the deeploc2_package directory:
pip install .

但是我尝试了很多次都安装失败了,报错信息为:

1
modulenotfounderror: no module named '_sysconfigdata_x86_64_conda_cos7_linux_gnu'

索性直接重新建立一个单独的环境安装DeepLoc.

由于需要安装一些机器学习框架,所以呢我选择单个安装,不然一直报错。setup.py文件如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import setuptools
import atexit
from setuptools.command.develop import develop
from setuptools.command.install import install

def _download_models():
from esm import pretrained
from transformers import T5Tokenizer, T5EncoderModel, logging
logging.set_verbosity_error()
_, _ = pretrained.load_model_and_alphabet("esm1b_t33_650M_UR50S")

class PostDevelopCommand(develop):
"""Post-installation for development mode."""
def __init__(self, *args, **kwargs):
super(PostDevelopCommand, self).__init__(*args, **kwargs)
atexit.register(_download_models)

class PostInstallCommand(install):
"""Post-installation for installation mode."""
def __init__(self, *args, **kwargs):
super(PostInstallCommand, self).__init__(*args, **kwargs)
atexit.register(_download_models)

with open("README.md", "r", encoding="utf-8") as fh:
long_description = fh.read()

setup_requires = ['numpy','transformers','fair-esm','sentencepiece','torch>=1.6']

install_requires = [
'numpy',
'matplotlib',
'pandas',
'scipy',
'Bio',
'torch>=1.6',
'onnxruntime>=1.7.0',
'fair-esm',
'transformers',
'pytorch_lightning',
'sentencepiece'
]

setuptools.setup(
name="DeepLoc2",
version="1.0.0",
author="Jose Juan Almagro Armenteros",
author_email="jjaa@stanford.edu",
description="Prediction of subcellular localization",
#scripts=['bin/deeploc2'],
entry_points={'console_scripts':['deeploc2=DeepLoc2.deeploc2:predict']},
long_description=long_description,
long_description_content_type="text/markdown",
url="https://services.healthtech.dtu.dk/service.php?DeepLoc-2.0",
project_urls={
"Bug Tracker": "https://services.healthtech.dtu.dk/service.php?DeepLoc-2.0",
},
install_requires=install_requires,
setup_requires=setup_requires,
cmdclass={'develop':PostDevelopCommand,'install':PostInstallCommand},
packages=setuptools.find_packages(),
package_data={'DeepLoc2': ['models/*', 'models/models_esm1b/*','models/models_prott5/*',
'models/models_esm1b/signaltype/*','models/models_prott5/signaltype/*']},
python_requires=">=3.6",
)

需要下载一些模型文件:点击访问下载地址

image-20240829103915476

这些模型文件只有科学上网才能下载,索性选择开一个新加坡的云服务器下载了再传回到实验室服务器上。传输文件的时候遇到一个报错:

1
OpenSSL version mismatch. Built against 30000020, you have 30200000

解决方法是更新ssh版本。

1
2
sudo apt remove --purge openssh-client
sudo apt install openssh-client

把这些文件下载好几本上就可以开始使用了。

SignalP v6.0

真的是太难配置了啊。。。。。。

1
2
3
4
mamba create -n signalp.6
mamba activate signalp.6

mm install predector::signalp6

这样并没有安装完成,需要去https://services.healthtech.dtu.dk/services/SignalP-6.0/申请下载软件,需要用到教育邮箱。然后再注册:

1
signalp6-register signalp-6.0h.fast.tar.gz

然后就可以开始使用了:

1
nohup signalp6 --fastafile rice_brownspot.pep.fa --output_dir ./ --organism eukarya --mode fast &

TargetP v2.0

同样需要去https://services.healthtech.dtu.dk/services/TargetP-2.0/申请下载,然后安装注册:

1
2
mm install predector::targetp2
targetp2-register targetp-2.0.Linux.tar.gz

然后开始运行:

1
nohup targetp2 -batch 1000 -fasta rice_brownspot.pep.fa -gff3 -org non-pl  -prefix heban &

TMHMM v2.0

同样要申请下载,然后安装:

1
https://services.healthtech.dtu.dk/services/TMHMM-2.0/

需要注意的是,要将bin目录下的tmhmmtmhmmformat.pl两个文件的第一行修改为perl的绝对路径。我的:

1
2
3
4
5
6
7
8
9
10
/home/xxx/mambaforge/envs/tools4bioinf/bin/perl

# This is version 2.0c of tmhmm


# Give ONE fasta file on cmdline OR use stdin
# A single sequence can be given WITHOUT the ID line (">ID")
# Such a sequence will be called "WEBSEQUENCE"

# OPTION PARSING ##########################################

SignalP结果处理


真菌分泌蛋白预测流程
https://lixiang117423.github.io/article/secretome4fungi/
作者
李详【Xiang LI】
发布于
2024年8月29日
许可协议