RoseTTAFold parse_hhr函数解读
parse_hhr
函数用于解析 .hhr
文件并结合 ffindex
数据库提取比对信息。它提取每个比对的详细信息,包括匹配的位置信息、相似性评分、置信度和其他统计信息,最终返回所有有效比对的结果。
在蛋白质结构计算中,该函数的主要作用是从 .hhr
文件中提取与查询序列相关的模板信息,用于下游分析,比如蛋白质的同源建模、功能预测、或结构比对优化。
源代码:
from collections import namedtuple
import re
FFindexEntry = namedtuple("FFindexEntry", "name, offset, length")
def read_index(ffindex_filename):
entries = []
fh = open(ffindex_filename)
for line in fh:
tokens = line.split("\t")
entries.append(FFindexEntry(tokens[0], int(tokens[1]), int(tokens[2])))
fh.close()
return entries
# parse HHsearch output
def parse_hhr(filename, ffindex, idmax=105.5):
# labels present in the database
label_set = set([i.name for i in ffindex])
out = []
with open(filename, "r") as hhr:
# read .hhr into a list of lines
lines = [s.rstrip() for _,s in enumerate(hhr)]
# read list of all hits, 前两个空行之间。
start = lines.index("") + 2
stop = lines[start:].index("") + start
hits = []
for line in lines[start:stop]:
# ID of the hit
#label = re.sub('_','',line[4:10].strip())
label = line[4:10].strip()
# position in the query where the alignment starts
qstart = int(line[75:84].strip().split("-")[0])-1
# position in the template where the alignment starts
tstart = int(line[85:94].strip().split("-")[0])-1
hits.append([label, qstart, tstart, int(line[69:75])]) # line[69:75] 比对上的序列数
#print(f"hits:{hits}")
# get line numbers where each hit starts
start = [i for i,l in enumerate(lines) if l and l[0]==">"] # and l[1:].strip() in label_set]
# process hits
for idx,i in enumerate(start):
# skip if hit is too short
if hits[idx][3] < 10:
continue
# skip if template is not in the database
if hits[idx][0] not in label_set:
continue
# get hit statistics
p,e,s,_,seqid,sim,_,neff = [float(s) for s in re.sub('[=%]', ' ', lines[i+1]).split()[1::2]]
#print(f"p:{p}")
#print(f"seqid:{seqid}")
#print(f"s:{s}")
#print(f"sim:{sim}")
#print(f"neff:{neff}")
# skip too similar hits
if seqid > idmax:
continue
query = np.array(list(lines[i+3].split()[3]), dtype='|S1')
tmplt = np.array(list(lines[i+7].split()[3]), dtype='|S1')
simlr = np.array(list(lines[i+7][22:]), dtype='|S1').view(np.uint8)
abc = np.array(list(" =-.+|"), dtype='|S1').view(np.uint8)
for k in range(abc.shape[0]):
simlr[simlr == abc[k]] = k
confd = np.array(list(lines[i+10][22:]), dtype='|S1
原文地址:https://blog.csdn.net/qq_27390023/article/details/143828354
免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!