NCBI基因组注释流程使用三代转录组数据报错解决方法

软件版本:v.0.4.1-alpha

报错信息:

1
[0a/15eb38] NOTE: Process egapx:rnaseq_long_plane:rename_fasta_ids (1) terminated with an error exit status (1) -- Execution is retried (1)

报错信息的大概意思是在处理三代转录组的时候讲fastq文件转换为fasta文件的时候报错了。

egapx/nf/subworkflows/ncbi/setup/main.nf文件中原始脚本的处理逻辑:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
process rename_fasta_ids {
input:
tuple val(sampleID), path(fastx, stageAs: "reads/*")
val srr_id
output:
tuple val(sampleID), path ('output/*') , emit: 'fasta_pair_list'
script:
file_name = fastx.getBaseName() + '.fasta'
"""
#!/usr/bin/env python3
import os
os.makedirs('output', exist_ok=True)
with open('${fastx}', 'r') as infile, open('output/${file_name}', 'w') as outfile:
rec_cnt = 1
skip_next = False
for line in infile:
line = line.lstrip()
if not line:
continue
if line[0] in {'>', '@', '+'}:
new_id = f"gnl|SRA|SRR{${srr_id}:08d}.{rec_cnt}.1"
if line[0] in {'>', '@'}:
outfile.write(f">{new_id}{os.linesep}")
if line[0] in {'>', '+'}:
rec_cnt += 1
if line[0] == '+':
skip_next = True
elif skip_next:
skip_next = False
else:
outfile.write(line)
"""
stub:
file_name = fastx.getBaseName() + '.fasta'
"""
mkdir -p output
echo $srr_id > output/$file_name
"""
}

这个脚本只能处理未压缩的文件,但是我输入的是压缩后的格式。

解决方法:把上面这段代码替换为下面的代码,能够自动识别是压缩的还是为压缩的文件。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
process rename_fasta_ids {
input:
tuple val(sampleID), path(fastx, stageAs: "reads/*")
val srr_id
output:
tuple val(sampleID), path ('output/*') , emit: 'fasta_pair_list'
script:
file_name = fastx.getBaseName() + '.fasta'
"""
#!/usr/bin/env python3
import os
import gzip # <--- [新增] 引入 gzip 模块

os.makedirs('output', exist_ok=True)

input_path = '${fastx}'

# <--- [新增] 判断文件后缀并选择打开方式
if input_path.endswith('.gz'):
open_func = gzip.open
mode = 'rt' # read text mode
else:
open_func = open
mode = 'r'

# <--- [修改] 使用 open_func 替代 open
with open_func(input_path, mode) as infile, open('output/${file_name}', 'w') as outfile:
rec_cnt = 1
skip_next = False
for line in infile:
line = line.lstrip()
if not line:
continue
if line[0] in {'>', '@', '+'}:
# 注意:这里 ${srr_id} 是 Nextflow 变量插值,保留原样
new_id = f"gnl|SRA|SRR{${srr_id}:08d}.{rec_cnt}.1"
if line[0] in {'>', '@'}:
outfile.write(f">{new_id}{os.linesep}")
if line[0] in {'>', '+'}:
rec_cnt += 1
if line[0] == '+':
skip_next = True
elif skip_next:
skip_next = False
else:
outfile.write(line)
"""
stub:
file_name = fastx.getBaseName() + '.fasta'
"""
mkdir -p output
echo $srr_id > output/$file_name
"""
}

后续的minimap2还是会报错:

1
NOTE: Process `egapx:rnaseq_long_plane:minimap2:minimap2_wnode (50)` terminated with an error exit status (3) -- Execution is retried (3)

找到一个解决方案:https://github.com/ncbi/egapx/issues/166

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
process rename_fasta_ids {
input:
tuple val(sampleID), path(fastx, stageAs: "reads/*")
val srr_id
output:
tuple val(sampleID), path ('output/*') , emit: 'fasta_pair_list'
script:
file_name = fastx.getBaseName() + '.fasta'
def srrFmt = String.format('SRR%08d', (srr_id as int))
"""
mkdir -p output
seqkit fq2fa -j 32 '${fastx}' | seqkit replace -j 32 -w 0 -p '.*' -r 'gnl|SRA|${srrFmt}.{nr}.1' > output/${file_name}
"""
stub:
file_name = fastx.getBaseName() + '.fasta'
"""
mkdir -p output
echo $srr_id > output/$file_name
"""
}

NCBI基因组注释流程使用三代转录组数据报错解决方法
https://lixiang117423.github.io/article/egapx-3rd-rnaseq/
作者
李详【Xiang LI】
发布于
2025年12月11日
许可协议