问题描述
|
我的文本文件很大。看起来如下
> <Enzymologic: Ki nM 1>
257000
> <Enzymologic: IC50 nM 1>
n/a
> <ITC: Delta_G0 kJ/mole 1>
n/a
> <Enzymologic: Ki nM 1>
5000
> <Enzymologic: EC50/IC50 nM 1>
1000
.....
现在我想创建python脚本来查找(> <Enzymologic: Ki nM 1>
,> <Enzymologic: EC50/IC50 nM 1>
)之类的单词,并以制表符分隔格式将每个行的下一行打印如下
> <Enzymologic: Ki nM 1> > <Enzymologic: EC50/IC50 nM 1>
257000 n/a
5000 1000
....
我尝试了以下代码
infile = path of the file
lines = infile.readlines()
infile.close()
searchtxt = \"> <Enzymologic: IC50 nM 1>\",\"> <Enzymologic: Ki nM 1>\"
for i,line in enumerate(lines):
if searchtxt in line and i+1 < len(lines):
print lines[i+1]
但是它不能正常工作,任何人都可以建议一些代码来实现它。
提前致谢
解决方法
s = \'\'\'Enzymologic: Ki nM 1
257000
Enzymologic: IC50 nM 1
n/a
ITC: Delta_G0 kJ/mole 1
n/a
Enzymologic: Ki nM 1
5000
Enzymologic: IC50 nM 1
1000\'\'\'
from collections import defaultdict
lines = [x for x in s.splitlines() if x]
keys = lines[::2]
values = lines[1::2]
result = defaultdict(list)
for key,value in zip(keys,values):
result[key].append(value)
print dict(result)
>>> {\'ITC: Delta_G0 kJ/mole 1\': [\'n/a\'],\'Enzymologic: Ki nM 1\': [\'257000\',\'5000\'],\'Enzymologic: IC50 nM 1\': [\'n/a\',\'1000\']}
然后根据需要格式化输出。
, 我认为您的问题出在以下事实:您在you9ѭ中对每个pattern
做if searchtxt in line
而不是do7ѭ。这是我要做的:
>>> path = \'D:\\\\temp\\\\Test.txt\'
>>> lines = open(path).readlines()
>>> searchtxt = \"Enzymologic: IC50 nM 1\",\"Enzymologic: Ki nM 1\"
>>> from collections import defaultdict
>>> dict_patterns = defaultdict(list)
>>> for i,line in enumerate(lines):
for pattern in searchtxt:
if pattern in line and i+1 < len(lines):
dict_patterns[pattern].append(lines[i+1])
>>> dict_patterns
defaultdict(<type \'list\'>,{\'Enzymologic: Ki nM 1\': [\'257000\\n\',\'5000\\n\'],\'Enzymologic: IC50 nM 1\': [\'n/a\\n\',\'1000\']})
使用dict可以按模式将结果分组(defaultdict
是一种不强制初始化您的对象的简便方法)。
, 您确实有太多单独的问题:
解析文件并从中提取数据
import itertools
# let\'s imitate a file
pseudo_file = \"\"\"
> <Enzymologic: Ki nM 1>
257000
> <Enzymologic: IC50 nM 1>
n/a
> <ITC: Delta_G0 kJ/mole 1>
n/a
> <Enzymologic: Ki nM 1>
5000
> <Enzymologic: EC50/IC50 nM 1>
1000
\"\"\".split(\'\\n\')
def iterate_on_couple(iterable):
\"\"\"
Iterate on two elements,by two elements
\"\"\"
iterable = iter(iterable)
for x in iterable:
yield x,next(iterable)
plain_lines = (l for l in pseudo_file if l.strip()) # ignore empty lines
results = {}
# store all results in a dictionary
for name,value in iterate_on_couple(plain_lines):
results.setdefault(name,[]).append(value)
# now you got a dictionary with all values linked to a name
print results
现在,此代码假设您的文件未损坏,并且
您始终具有以下结构:
空白
名称
值
如果没有,您可能需要更强大的功能。
其次,这会将所有值存储在内存中,如果
您有很多价值观。在这种情况下,您需要查看一些存储空间
解决方案,例如shelve
模块或sqlite
。
将结果保存到文件中
import csv
def get(iterable,index,default):
\"\"\"
Return an item from array or default if IndexError
\"\"\"
try:
return iterable[index]
except IndexError:
return default
names = results.keys() # get a list of all names
# now we write our tab separated file using the csv module
out = csv.writer(open(\'/tmp/test.csv\',\'w\'),delimiter=\'\\t\')
# first the header
out.writerow(names)
# get the size of the longest column
max_size = list(reversed(sorted(len(results[name]) for name in names)))[0]
# then write the lines one by one
for i in xrange(max_size):
line = [get(results[name],i,\"-\") for name in names]
out.writerow(line)
由于我正在为您编写整个代码,因此我特意使用了一些高级Python习惯用法,以便您在使用它时可以有所思考。
, import itertools
def search(lines,terms):
results = [[t] for t in terms]
lines = iter(lines)
for l in lines:
for i,t in enumerate(terms):
if t in l:
results[i].append(lines.next().strip())
break
return results
def format(results):
s = []
rows = list(itertools.izip_longest(*results,fillvalue=\"\"))
for row in rows:
s.append(\"\\t\".join(row))
s.append(\'\\n\')
return \'\'.join(s)
这是调用函数的方法:
example = \"\"\"> <Enzymologic: Ki nM 1>
257000
> <Enzymologic: IC50 nM 1>
n/a
> <ITC: Delta_G0 kJ/mole 1>
n/a
> <Enzymologic: Ki nM 1>
5000
> <Enzymologic: EC50/IC50 nM 1>
1000\"\"\"
def test():
terms = [\"> <Enzymologic: IC50 nM 1>\",\"> <Enzymologic: Ki nM 1>\"]
lines = example.split(\'\\n\')
result = search(lines,terms)
print format(result)
>>> test()
> <酶学:IC50 nM 1>> <酶学:Ki nM 1>
不适用257000
上面的示例通过单个选项卡将每一列分开。如果您需要更高级的格式设置(例如您的示例),那么format函数会变得更加复杂:
import math
def format(results):
maxcolwidth = [0] * len(results)
rows = list(itertools.izip_longest(*results,fillvalue=\"\"))
for row in rows:
for i,col in enumerate(row):
w = int(math.ceil(len(col)/8.0))*8
maxcolwidth[i] = max(maxcolwidth[i],w)
s = []
for row in rows:
for i,col in enumerate(row):
s += col
padding = maxcolwidth[i]-len(col)
tabs = int(math.ceil(padding/8.0))
s += \'\\t\' * tabs
s += \'\\n\'
return \'\'.join(s)
, import re
pseudo_file = \"\"\"
> <Enzymologic: Ki nM 1>
257000
> <Enzymologic: IC50 nM 1>
n/a
> <ITC: Delta_G0 kJ/mole 1>
n/a
> <Enzymologic: Ki nM 1>
5000
> <Enzymologic: EC50/IC50 nM 1>
1000\"\"\"
searchtxt = \"nzymologic: Ki nM 1>\",\"<Enzymologic: IC50 nM 1>\"
regx_AAA = re.compile(\'([^:]+: )([^ \\t]+)(.*)\')
tu = tuple(regx_AAA.sub(\'\\\\1.*?\\\\2.*?\\\\3\',x) for x in searchtxt)
model = \'%%-%ss %%s\\n\' % len(searchtxt[0])
regx_BBB = re.compile((\'%s[ \\t\\r\\n]+(.+)[ \\t\\r\\n]+\'
\'.+?%s[ \\t\\r\\n]+(.+?)[ \\t]*(?=\\r?\\n|\\Z)\') % tu)
print \'tu ==\',tu
print \'model==\',model
print \'regx_BBB.findall(pseudo_file)==\\n\',regx_BBB.findall(pseudo_file)
with open(\'woof.txt\',\'w\') as f:
f.write(model % searchtxt)
f.writelines(model % x for x in regx_BBB.findall(pseudo_file))
结果
tu == (\'nzymologic: .*?Ki.*? nM 1>\',\'<Enzymologic: .*?IC50.*? nM 1>\')
model== %-20s %s
regx_BBB.findall(pseudo_file)==
[(\'257000\',\'n/a\'),(\'5000\',\'1000\')]
文件“ woof.txt”的内容为:
> <Enzymologic: Ki nM 1> > <Enzymologic: IC50 nM 1>
257000 n/a
5000 1000
为了获得regx_BBB,我首先计算一个元组tu,因为您想捕获一行>但searchtxt中只有\“> \”
因此,元组tu引入了。*?在searchtxt的字符串中,以便regex regx_BBB能够捕获包含IC50的行,而不仅限于与searchtxt元素完全相等的行
请注意,除了您使用的字符串外,我在searchtxt中放入了字符串\"nzymologic: Ki nM 1>\"
和\"<Enzymologic: IC50 nM 1>\"
,以表明正则表达式是已构建的,因此仍可以得到结果。
唯一的条件是,每个searchtxt字符串中\':\'之前必须至少有一个字符
。
编辑1
我认为在文件中,,24ѭ或\'> <Enzymologic: EC50/IC50 nM 1>\'
行应始终跟随follow26ѭ行
但是,在阅读完其他人的答案后,我认为这并不明显(这是常见的问题:他们没有提供足够的信息和准确度)
如果必须独立捕获每一行,则可以使用以下更简单的regex regx_BBB:
regx_AAA = re.compile(\'([^:]+: )([^ \\t]+)(.*)\')
li = [ regx_AAA.sub(\'\\\\1.*?\\\\2.*?\\\\3\',x) for x in searchtxt]
regx_BBB = re.compile(\'|\'.join(li).join(\'()\') + \'[ \\t\\r\\n]+(.+?)[ \\t]*(?=\\r?\\n|\\Z)\')
但是录制文件的格式会更难。我厌倦了写一个新的完整代码,却不知道到底想要什么
, 在一行中查找字符串然后打印下一行的最简单方法可能是使用itertools islice:
from itertools import islice
searchtxt = \"<Enzymologic: IC50 nM 1>\"
with open (\'file.txt\',\'r\') as itfile:
for line in itfile:
if searchtxt in line:
print line
print \'\'.join(islice(itfile,1)