自学内容网 自学内容网

RoseTTAFold parse_hhr函数解读

parse_hhr 函数用于解析 .hhr 文件并结合 ffindex 数据库提取比对信息。它提取每个比对的详细信息,包括匹配的位置信息、相似性评分、置信度和其他统计信息,最终返回所有有效比对的结果。

在蛋白质结构计算中,该函数的主要作用是从 .hhr 文件中提取与查询序列相关的模板信息,用于下游分析,比如蛋白质的同源建模、功能预测、或结构比对优化。

源代码:

from collections import namedtuple
import re

FFindexEntry = namedtuple("FFindexEntry", "name, offset, length")

def read_index(ffindex_filename):
    entries = []
    
    fh = open(ffindex_filename)
    for line in fh:
        tokens = line.split("\t")
        entries.append(FFindexEntry(tokens[0], int(tokens[1]), int(tokens[2])))
    fh.close()
    
    return entries


# parse HHsearch output
def parse_hhr(filename, ffindex, idmax=105.5):

    # labels present in the database
    label_set = set([i.name for i in ffindex])

    out = []

    with open(filename, "r") as hhr:

        # read .hhr into a list of lines
        lines = [s.rstrip() for _,s in enumerate(hhr)]

        # read list of all hits, 前两个空行之间。
        start = lines.index("") + 2
        stop = lines[start:].index("") + start
        hits = []
        for line in lines[start:stop]:

            # ID of the hit
            #label = re.sub('_','',line[4:10].strip())
            label = line[4:10].strip()

            # position in the query where the alignment starts
            qstart = int(line[75:84].strip().split("-")[0])-1

            # position in the template where the alignment starts
            tstart = int(line[85:94].strip().split("-")[0])-1

            hits.append([label, qstart, tstart, int(line[69:75])]) # line[69:75] 比对上的序列数


        #print(f"hits:{hits}")
        
        # get line numbers where each hit starts
        start = [i for i,l in enumerate(lines) if l and l[0]==">"] # and l[1:].strip() in label_set]

        # process hits
        for idx,i in enumerate(start):

            # skip if hit is too short
            if hits[idx][3] < 10:
                continue

            # skip if template is not in the database
            if hits[idx][0] not in label_set:
                continue

            # get hit statistics
            p,e,s,_,seqid,sim,_,neff = [float(s) for s in re.sub('[=%]', ' ', lines[i+1]).split()[1::2]]
           
            #print(f"p:{p}")
            #print(f"seqid:{seqid}")
            #print(f"s:{s}")
            #print(f"sim:{sim}")
            #print(f"neff:{neff}")
            
            # skip too similar hits
            if seqid > idmax:
                continue

            query = np.array(list(lines[i+3].split()[3]), dtype='|S1')
            tmplt = np.array(list(lines[i+7].split()[3]), dtype='|S1')

            simlr = np.array(list(lines[i+7][22:]), dtype='|S1').view(np.uint8)
            abc = np.array(list(" =-.+|"), dtype='|S1').view(np.uint8)
            
            for k in range(abc.shape[0]):
                simlr[simlr == abc[k]] = k

            confd = np.array(list(lines[i+10][22:]), dtype='|S1

原文地址:https://blog.csdn.net/qq_27390023/article/details/143828354

免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!