NCBI基因组注释流程使用三代转录组数据报错解决方法

软件版本：v.0.4.1-alpha

报错信息：

1	`[0a/15eb38] NOTE: Process egapx:rnaseq_long_plane:rename_fasta_ids (1) terminated with an error exit status (1) -- Execution is retried (1)`

报错信息的大概意思是在处理三代转录组的时候讲fastq文件转换为fasta文件的时候报错了。

egapx/nf/subworkflows/ncbi/setup/main.nf文件中原始脚本的处理逻辑：

process rename_fasta_ids {
    input:
        tuple val(sampleID), path(fastx, stageAs: "reads/*")
        val  srr_id
    output:
        tuple val(sampleID),  path ('output/*')  , emit: 'fasta_pair_list'
    script:
        file_name = fastx.getBaseName() + '.fasta'
    """
    #!/usr/bin/env python3
    import os
    os.makedirs('output', exist_ok=True)
    with open('${fastx}', 'r') as infile, open('output/${file_name}', 'w') as outfile:
        rec_cnt = 1
        skip_next = False
        for line in infile:
            line = line.lstrip()
            if not line:
                continue
            if line[0] in {'>', '@', '+'}:
                new_id = f"gnl|SRA|SRR{${srr_id}:08d}.{rec_cnt}.1"
                if line[0] in {'>', '@'}:
                    outfile.write(f">{new_id}{os.linesep}")
                if line[0] in {'>', '+'}:
                    rec_cnt += 1
                if line[0] == '+':
                    skip_next = True
            elif skip_next:
                skip_next = False
            else:
                outfile.write(line)
    """
    stub:
        file_name = fastx.getBaseName() + '.fasta'
    """
    mkdir -p output
    echo $srr_id > output/$file_name
    """
}

这个脚本只能处理未压缩的文件，但是我输入的是压缩后的格式。

解决方法：把上面这段代码替换为下面的代码，能够自动识别是压缩的还是为压缩的文件。

process rename_fasta_ids {
    input:
        tuple val(sampleID), path(fastx, stageAs: "reads/*")
        val  srr_id
    output:
        tuple val(sampleID),  path ('output/*')  , emit: 'fasta_pair_list'
    script:
        file_name = fastx.getBaseName() + '.fasta'
    """
    #!/usr/bin/env python3
    import os
    import gzip  # <--- [新增] 引入 gzip 模块

    os.makedirs('output', exist_ok=True)

    input_path = '${fastx}'
    
    # <--- [新增] 判断文件后缀并选择打开方式
    if input_path.endswith('.gz'):
        open_func = gzip.open
        mode = 'rt' # read text mode
    else:
        open_func = open
        mode = 'r'

    # <--- [修改] 使用 open_func 替代 open
    with open_func(input_path, mode) as infile, open('output/${file_name}', 'w') as outfile:
        rec_cnt = 1
        skip_next = False
        for line in infile:
            line = line.lstrip()
            if not line:
                continue
            if line[0] in {'>', '@', '+'}:
                # 注意：这里 ${srr_id} 是 Nextflow 变量插值，保留原样
                new_id = f"gnl|SRA|SRR{${srr_id}:08d}.{rec_cnt}.1"
                if line[0] in {'>', '@'}:
                    outfile.write(f">{new_id}{os.linesep}")
                if line[0] in {'>', '+'}:
                    rec_cnt += 1
                if line[0] == '+':
                    skip_next = True
            elif skip_next:
                skip_next = False
            else:
                outfile.write(line)
    """
    stub:
        file_name = fastx.getBaseName() + '.fasta'
    """
    mkdir -p output
    echo $srr_id > output/$file_name
    """
}

后续的minimap2还是会报错：

1	NOTE: Process `egapx:rnaseq_long_plane:minimap2:minimap2_wnode (50)` terminated with an error exit status (3) -- Execution is retried (3)

找到一个解决方案：https://github.com/ncbi/egapx/issues/166

process rename_fasta_ids {
    input:
        tuple val(sampleID), path(fastx, stageAs: "reads/*")
        val  srr_id
    output:
        tuple val(sampleID),  path ('output/*')  , emit: 'fasta_pair_list'
    script:
        file_name = fastx.getBaseName() + '.fasta'
        def srrFmt = String.format('SRR%08d', (srr_id as int))
    """
    mkdir -p output
    seqkit fq2fa -j 32 '${fastx}' | seqkit replace -j 32 -w 0 -p '.*' -r 'gnl|SRA|${srrFmt}.{nr}.1' > output/${file_name}
    """
    stub:
        file_name = fastx.getBaseName() + '.fasta'
    """
    mkdir -p output
    echo $srr_id > output/$file_name
    """
}

生物信息学

#生物信息学

NCBI基因组注释流程使用三代转录组数据报错解决方法

https://lixiang117423.github.io/article/egapx-3rd-rnaseq/

作者

李详【Xiang LI】

发布于

2025年12月11日

许可协议

生信小工具biopytools的使用方法上一篇

生物信息学脚本合集下一篇