生物信息学知识库

浅言前言

软件安装

Linux基础

报错及解决方案

  • Some index files failed to download. They have been ignored, or old ones used instead.:编辑vi /etc/resolv.conf,加入:
1
2
nameserver 8.8.8.8
nameserver 8.8.4.4

Nginx

  • 不同域名指定端口
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Default server configuration
#
server {
listen 80 default_server;
listen [::]:80 default_server;

root /var/www/html;

# Add index.php to the list if you are using PHP
index index.html index.htm index.nginx-debian.html;

server_name _;

location / {
# First attempt to serve request as file, then
# as directory, then fall back to displaying a 404.
try_files $uri $uri/ =404;
}
}

server {
listen 80;
server_name xxx.xxx.com;
location / {
proxy_pass http://localhost:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}

server {
listen 80;
server_name xxx.xxx.com;
location / {
proxy_pass http://localhost:1288;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}

数据库

序列处理

从基因组提取蛋白序列

1
gffread genome.gff -g genome.fa -y 100.pep

从基因组提取基因序列

搜索半天找不到合适的工具,让ChatGPT写了一个:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from Bio import SeqIO
import gffutils
import argparse

# 解析命令行参数
parser = argparse.ArgumentParser(description="Extract gene sequences from genome and GFF files.")
parser.add_argument("-genome", dest="genome_file", required=True, help="Path to the genome FASTA file")
parser.add_argument("-gff", dest="gff_file", required=True, help="Path to the GFF3 file")
parser.add_argument("-out", dest="out_file", required=True, help="Path to the output file")
args = parser.parse_args()

# 读取基因组序列
genome_seq = SeqIO.to_dict(SeqIO.parse(args.genome_file, "fasta"))

# 读取GFF文件并建立数据库
db_file = args.gff_file + ".db"
db = gffutils.create_db(args.gff_file, dbfn=db_file, force=True)

# 遍历所有基因,提取核酸序列并输出到文件
with open(args.out_file, "w") as out_file:
for gene in db.features_of_type("gene"):
# 获取基因序列的位置信息
seq_region = gene.seqid
start = gene.start
end = gene.end
strand = gene.strand

# 从基因组序列中提取基因核酸序列
if seq_region in genome_seq:
gene_seq = genome_seq[seq_region][start-1:end].seq
if strand == "-":
gene_seq = gene_seq.reverse_complement()

# 输出基因核酸序列到文件
out_file.write(">" + gene.id + "\n")
out_file.write(str(gene_seq) + "\n")

提取启动子序列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# 导入所需的包
from Bio import SeqIO
import argparse

# 创建参数解析器
parser = argparse.ArgumentParser(description='从GFF文件中提取每个转录本的信息,并在此基础上提取指定长度的启动子序列。')
parser.add_argument(
'-i',
'--input',
dest='gff_file',
type=str,
required=True,
help='输入GFF文件路径。'
)
parser.add_argument(
'-g',
'--genome',
dest='genome_file',
type=str,
required=True,
help='输入基因组序列文件路径。'
)
parser.add_argument(
'-o',
'--output',
dest='output_file',
type=str,
required=True,
help='输出文件路径。'
)
parser.add_argument(
'-l',
'--length',
dest='promoter_length',
type=int,
default=1500,
help='指定启动子序列的长度,默认为1500bp。'
)

# 解析参数
args = parser.parse_args()

# 初始化基因组序列
genome_dict = SeqIO.to_dict(SeqIO.parse(args.genome_file, 'fasta'))

# 从GFF文件中提取转录本位置信息,并从基因组序列中提取启动子序列
with open(args.output_file, 'w') as out_file, open(args.gff_file) as in_file:
for line in in_file:
fields = line.strip().split('\t')
if fields[2] == 'mRNA':
transcript_id = fields[8].split(';')[0].split('=')[1]
gene_id = fields[8].split(';')[1].split('=')[1]
chr_num = fields[0]
start = int(fields[3])
end = int(fields[4])
strand = fields[6]
# 根据正负链计算起始位置和终止位置
if strand == '+':
promoter_start = max(0, start - args.promoter_length)
promoter_end = start
else:
promoter_start = end
promoter_end = min(len(genome_dict[chr_num]), end + args.promoter_length)
# 从基因组序列中提取启动子序列
promoter_seq = str(genome_dict[chr_num].seq[promoter_start:promoter_end]).upper()
# 将启动子序列写入输出文件
out_file.write(f'>{transcript_id}\n{promoter_seq}\n')

用法如下:

1
python getpromoterseqfromgenome.py -genome 1.fa -gff temp.gff -length 5 -out test.fa    

💌lixiang117423@foxmail.com
💌lixiang117423@gmail.com


生物信息学知识库
https://lixiang117423.github.io/article/mybioinf/
作者
李详【Xiang LI】
发布于
2022年8月13日
许可协议