查找多个单词并使用Python打印下一行

问题描述

| 我的文本文件很大。看起来如下
> <Enzymologic: Ki nM 1>
 257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000

.....
现在我想创建python脚本来查找(
> <Enzymologic: Ki nM 1>
> <Enzymologic: EC50/IC50 nM 1>
)之类的单词,并以制表符分隔格式将每个行的下一行打印如下
> <Enzymologic: Ki nM 1>     > <Enzymologic: EC50/IC50 nM 1>
257000                       n/a
5000                         1000
.... 
我尝试了以下代码
infile = path of the file
lines = infile.readlines()
infile.close()
searchtxt = \"> <Enzymologic: IC50 nM 1>\",\"> <Enzymologic: Ki nM 1>\"
for i,line in enumerate(lines): 
     if searchtxt in line and i+1 < len(lines):
         print lines[i+1]
但是它不能正常工作,任何人都可以建议一些代码来实现它。 提前致谢     

解决方法

        
s = \'\'\'Enzymologic: Ki nM 1

257000

Enzymologic: IC50 nM 1

n/a

ITC: Delta_G0 kJ/mole 1

n/a

Enzymologic: Ki nM 1

5000

Enzymologic: IC50 nM 1

1000\'\'\'
from collections import defaultdict

lines = [x for x in s.splitlines() if x]
keys = lines[::2]
values = lines[1::2]
result = defaultdict(list)
for key,value in zip(keys,values):
    result[key].append(value)
print dict(result)

>>> {\'ITC: Delta_G0 kJ/mole 1\': [\'n/a\'],\'Enzymologic: Ki nM 1\': [\'257000\',\'5000\'],\'Enzymologic: IC50 nM 1\': [\'n/a\',\'1000\']}
然后根据需要格式化输出。     ,        我认为您的问题出在以下事实:您在you9ѭ中对每个
pattern
if searchtxt in line
而不是do7ѭ。这是我要做的:
>>> path = \'D:\\\\temp\\\\Test.txt\'
>>> lines = open(path).readlines()
>>> searchtxt = \"Enzymologic: IC50 nM 1\",\"Enzymologic: Ki nM 1\"
>>> from collections import defaultdict
>>> dict_patterns = defaultdict(list)
>>> for i,line in enumerate(lines):
    for pattern in searchtxt:
        if pattern in line and i+1 < len(lines):
             dict_patterns[pattern].append(lines[i+1])

>>> dict_patterns
defaultdict(<type \'list\'>,{\'Enzymologic: Ki nM 1\': [\'257000\\n\',\'5000\\n\'],\'Enzymologic: IC50 nM 1\': [\'n/a\\n\',\'1000\']})
使用dict可以按模式将结果分组(
defaultdict
是一种不强制初始化您的对象的简便方法)。     ,        您确实有太多单独的问题: 解析文件并从中提取数据
import itertools

# let\'s imitate a file
pseudo_file = \"\"\"
> <Enzymologic: Ki nM 1>
 257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000
\"\"\".split(\'\\n\')

def iterate_on_couple(iterable):
  \"\"\"
    Iterate on two elements,by two elements
  \"\"\"
  iterable = iter(iterable)
  for x in iterable:
    yield x,next(iterable)

plain_lines = (l for l in pseudo_file  if l.strip()) # ignore empty lines

results = {}

# store all results in a dictionary
for name,value in iterate_on_couple(plain_lines):
  results.setdefault(name,[]).append(value)

# now you got a dictionary with all values linked to a name
print results
现在,此代码假设您的文件未损坏,并且 您始终具有以下结构: 空白 名称 值 如果没有,您可能需要更强大的功能。 其次,这会将所有值存储在内存中,如果 您有很多价值观。在这种情况下,您需要查看一些存储空间 解决方案,例如
shelve
模块或
sqlite
。 将结果保存到文件中
import csv

def get(iterable,index,default):
  \"\"\"
    Return an item from array or default if IndexError
  \"\"\"
  try:
      return iterable[index]
  except IndexError:
      return default

names = results.keys() # get a list of all names

# now we write our tab separated file using the csv module
out = csv.writer(open(\'/tmp/test.csv\',\'w\'),delimiter=\'\\t\')

# first the header
out.writerow(names)

# get the size of the longest column
max_size = list(reversed(sorted(len(results[name]) for name in names)))[0]

# then write the lines one by one
for i in xrange(max_size):
    line = [get(results[name],i,\"-\") for name in names]
    out.writerow(line)
由于我正在为您编写整个代码,因此我特意使用了一些高级Python习惯用法,以便您在使用它时可以有所思考。     ,        
import itertools

def search(lines,terms):
    results = [[t] for t in terms]
    lines = iter(lines)
    for l in lines:
        for i,t in enumerate(terms):
            if t in l:
                results[i].append(lines.next().strip())
                break
    return results

def format(results):
    s = []
    rows = list(itertools.izip_longest(*results,fillvalue=\"\"))
    for row in rows:
        s.append(\"\\t\".join(row))
        s.append(\'\\n\')
    return \'\'.join(s)
这是调用函数的方法:
example = \"\"\"> <Enzymologic: Ki nM 1>
257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000\"\"\"

def test():
    terms = [\"> <Enzymologic: IC50 nM 1>\",\"> <Enzymologic: Ki nM 1>\"]
    lines = example.split(\'\\n\')
    result = search(lines,terms)
    print format(result)
>>> test() > <酶学:IC50 nM 1>> <酶学:Ki nM 1> 不适用257000 上面的示例通过单个选项卡将每一列分开。如果您需要更高级的格式设置(例如您的示例),那么format函数会变得更加复杂:
import math

def format(results):
    maxcolwidth = [0] * len(results)
    rows = list(itertools.izip_longest(*results,fillvalue=\"\"))
    for row in rows:
        for i,col in enumerate(row):
            w = int(math.ceil(len(col)/8.0))*8
            maxcolwidth[i] = max(maxcolwidth[i],w)

    s = []
    for row in rows:
        for i,col in enumerate(row):
            s += col
            padding = maxcolwidth[i]-len(col)
            tabs = int(math.ceil(padding/8.0))
            s += \'\\t\' * tabs
        s += \'\\n\'

    return \'\'.join(s)
    ,        
import re

pseudo_file = \"\"\"
> <Enzymologic: Ki nM 1>
 257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000\"\"\"

searchtxt = \"nzymologic: Ki nM 1>\",\"<Enzymologic: IC50 nM 1>\"

regx_AAA = re.compile(\'([^:]+: )([^ \\t]+)(.*)\')

tu = tuple(regx_AAA.sub(\'\\\\1.*?\\\\2.*?\\\\3\',x) for x in searchtxt)

model = \'%%-%ss  %%s\\n\' % len(searchtxt[0])

regx_BBB = re.compile((\'%s[ \\t\\r\\n]+(.+)[ \\t\\r\\n]+\'
                       \'.+?%s[ \\t\\r\\n]+(.+?)[ \\t]*(?=\\r?\\n|\\Z)\') % tu)


print \'tu   ==\',tu
print \'model==\',model
print \'regx_BBB.findall(pseudo_file)==\\n\',regx_BBB.findall(pseudo_file)



with open(\'woof.txt\',\'w\') as f:
    f.write(model % searchtxt)
    f.writelines(model % x for x in regx_BBB.findall(pseudo_file))
结果
tu   == (\'nzymologic: .*?Ki.*? nM 1>\',\'<Enzymologic: .*?IC50.*? nM 1>\')
model== %-20s  %s

regx_BBB.findall(pseudo_file)==
[(\'257000\',\'n/a\'),(\'5000\',\'1000\')]
文件“ woof.txt”的内容为:
> <Enzymologic: Ki nM 1>  > <Enzymologic: IC50 nM 1>
257000                    n/a
5000                      1000
为了获得regx_BBB,我首先计算一个元组tu,因为您想捕获一行>但searchtxt中只有\“> \” 因此,元组tu引入了。*?在searchtxt的字符串中,以便regex regx_BBB能够捕获包含IC50的行,而不仅限于与searchtxt元素完全相等的行 请注意,除了您使用的字符串外,我在searchtxt中放入了字符串
\"nzymologic: Ki nM 1>\"
\"<Enzymologic: IC50 nM 1>\"
,以表明正则表达式是已构建的,因此仍可以得到结果。 唯一的条件是,每个searchtxt字符串中\':\'之前必须至少有一个字符 。 编辑1 我认为在文件中,,24ѭ或
\'> <Enzymologic: EC50/IC50 nM 1>\'
行应始终跟随follow26ѭ行 但是,在阅读完其他人的答案后,我认为这并不明显(这是常见的问题:他们没有提供足够的信息和准确度) 如果必须独立捕获每一行,则可以使用以下更简单的regex regx_BBB:
regx_AAA = re.compile(\'([^:]+: )([^ \\t]+)(.*)\')

li = [ regx_AAA.sub(\'\\\\1.*?\\\\2.*?\\\\3\',x) for x in searchtxt]

regx_BBB = re.compile(\'|\'.join(li).join(\'()\') + \'[ \\t\\r\\n]+(.+?)[ \\t]*(?=\\r?\\n|\\Z)\')
但是录制文件的格式会更难。我厌倦了写一个新的完整代码,却不知道到底想要什么     ,        在一行中查找字符串然后打印下一行的最简单方法可能是使用itertools islice:
    from itertools import islice
    searchtxt = \"<Enzymologic: IC50 nM 1>\"
    with open (\'file.txt\',\'r\') as itfile:
            for line in itfile:
                    if searchtxt in line:
                            print line
                            print \'\'.join(islice(itfile,1)
    

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...