查找多个单词并使用Python打印下一行

问题描述

| 我的文本文件很大。看起来如下

> <Enzymologic: Ki nM 1>
 257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000

.....

现在我想创建python脚本来查找（> <Enzymologic: Ki nM 1>，> <Enzymologic: EC50/IC50 nM 1>）之类的单词，并以制表符分隔格式将每个行的下一行打印如下

> <Enzymologic: Ki nM 1>     > <Enzymologic: EC50/IC50 nM 1>
257000                       n/a
5000                         1000
....

我尝试了以下代码

infile = path of the file
lines = infile.readlines()
infile.close()
searchtxt = \"> <Enzymologic: IC50 nM 1>\",\"> <Enzymologic: Ki nM 1>\"
for i,line in enumerate(lines): 
     if searchtxt in line and i+1 < len(lines):
         print lines[i+1]

但是它不能正常工作，任何人都可以建议一些代码来实现它。提前致谢

解决方法

s = \'\'\'Enzymologic: Ki nM 1

257000

Enzymologic: IC50 nM 1

n/a

ITC: Delta_G0 kJ/mole 1

n/a

Enzymologic: Ki nM 1

5000

Enzymologic: IC50 nM 1

1000\'\'\'
from collections import defaultdict

lines = [x for x in s.splitlines() if x]
keys = lines[::2]
values = lines[1::2]
result = defaultdict(list)
for key,value in zip(keys,values):
    result[key].append(value)
print dict(result)

>>> {\'ITC: Delta_G0 kJ/mole 1\': [\'n/a\'],\'Enzymologic: Ki nM 1\': [\'257000\',\'5000\'],\'Enzymologic: IC50 nM 1\': [\'n/a\',\'1000\']}

然后根据需要格式化输出。 , 我认为您的问题出在以下事实：您在you9ѭ中对每个pattern做if searchtxt in line而不是do7ѭ。这是我要做的：

>>> path = \'D:\\\\temp\\\\Test.txt\'
>>> lines = open(path).readlines()
>>> searchtxt = \"Enzymologic: IC50 nM 1\",\"Enzymologic: Ki nM 1\"
>>> from collections import defaultdict
>>> dict_patterns = defaultdict(list)
>>> for i,line in enumerate(lines):
    for pattern in searchtxt:
        if pattern in line and i+1 < len(lines):
             dict_patterns[pattern].append(lines[i+1])

>>> dict_patterns
defaultdict(<type \'list\'>,{\'Enzymologic: Ki nM 1\': [\'257000\\n\',\'5000\\n\'],\'Enzymologic: IC50 nM 1\': [\'n/a\\n\',\'1000\']})

使用dict可以按模式将结果分组（defaultdict是一种不强制初始化您的对象的简便方法）。 , 您确实有太多单独的问题：解析文件并从中提取数据

import itertools

# let\'s imitate a file
pseudo_file = \"\"\"
> <Enzymologic: Ki nM 1>
 257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000
\"\"\".split(\'\\n\')

def iterate_on_couple(iterable):
  \"\"\"
    Iterate on two elements,by two elements
  \"\"\"
  iterable = iter(iterable)
  for x in iterable:
    yield x,next(iterable)

plain_lines = (l for l in pseudo_file  if l.strip()) # ignore empty lines

results = {}

# store all results in a dictionary
for name,value in iterate_on_couple(plain_lines):
  results.setdefault(name,[]).append(value)

# now you got a dictionary with all values linked to a name
print results

现在，此代码假设您的文件未损坏，并且您始终具有以下结构：空白名称值如果没有，您可能需要更强大的功能。其次，这会将所有值存储在内存中，如果您有很多价值观。在这种情况下，您需要查看一些存储空间解决方案，例如shelve模块或sqlite。将结果保存到文件中

import csv

def get(iterable,index,default):
  \"\"\"
    Return an item from array or default if IndexError
  \"\"\"
  try:
      return iterable[index]
  except IndexError:
      return default

names = results.keys() # get a list of all names

# now we write our tab separated file using the csv module
out = csv.writer(open(\'/tmp/test.csv\',\'w\'),delimiter=\'\\t\')

# first the header
out.writerow(names)

# get the size of the longest column
max_size = list(reversed(sorted(len(results[name]) for name in names)))[0]

# then write the lines one by one
for i in xrange(max_size):
    line = [get(results[name],i,\"-\") for name in names]
    out.writerow(line)

由于我正在为您编写整个代码，因此我特意使用了一些高级Python习惯用法，以便您在使用它时可以有所思考。 ,

import itertools

def search(lines,terms):
    results = [[t] for t in terms]
    lines = iter(lines)
    for l in lines:
        for i,t in enumerate(terms):
            if t in l:
                results[i].append(lines.next().strip())
                break
    return results

def format(results):
    s = []
    rows = list(itertools.izip_longest(*results,fillvalue=\"\"))
    for row in rows:
        s.append(\"\\t\".join(row))
        s.append(\'\\n\')
    return \'\'.join(s)

这是调用函数的方法：

example = \"\"\"> <Enzymologic: Ki nM 1>
257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000\"\"\"

def test():
    terms = [\"> <Enzymologic: IC50 nM 1>\",\"> <Enzymologic: Ki nM 1>\"]
    lines = example.split(\'\\n\')
    result = search(lines,terms)
    print format(result)

>>> test（） > <酶学：IC50 nM 1>> <酶学：Ki nM 1> 不适用257000 上面的示例通过单个选项卡将每一列分开。如果您需要更高级的格式设置（例如您的示例），那么format函数会变得更加复杂：

import math

def format(results):
    maxcolwidth = [0] * len(results)
    rows = list(itertools.izip_longest(*results,fillvalue=\"\"))
    for row in rows:
        for i,col in enumerate(row):
            w = int(math.ceil(len(col)/8.0))*8
            maxcolwidth[i] = max(maxcolwidth[i],w)

    s = []
    for row in rows:
        for i,col in enumerate(row):
            s += col
            padding = maxcolwidth[i]-len(col)
            tabs = int(math.ceil(padding/8.0))
            s += \'\\t\' * tabs
        s += \'\\n\'

    return \'\'.join(s)

import re

pseudo_file = \"\"\"
> <Enzymologic: Ki nM 1>
 257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000\"\"\"

searchtxt = \"nzymologic: Ki nM 1>\",\"<Enzymologic: IC50 nM 1>\"

regx_AAA = re.compile(\'([^:]+: )([^ \\t]+)(.*)\')

tu = tuple(regx_AAA.sub(\'\\\\1.*?\\\\2.*?\\\\3\',x) for x in searchtxt)

model = \'%%-%ss  %%s\\n\' % len(searchtxt[0])

regx_BBB = re.compile((\'%s[ \\t\\r\\n]+(.+)[ \\t\\r\\n]+\'
                       \'.+?%s[ \\t\\r\\n]+(.+?)[ \\t]*(?=\\r?\\n|\\Z)\') % tu)


print \'tu   ==\',tu
print \'model==\',model
print \'regx_BBB.findall(pseudo_file)==\\n\',regx_BBB.findall(pseudo_file)



with open(\'woof.txt\',\'w\') as f:
    f.write(model % searchtxt)
    f.writelines(model % x for x in regx_BBB.findall(pseudo_file))

结果

tu   == (\'nzymologic: .*?Ki.*? nM 1>\',\'<Enzymologic: .*?IC50.*? nM 1>\')
model== %-20s  %s

regx_BBB.findall(pseudo_file)==
[(\'257000\',\'n/a\'),(\'5000\',\'1000\')]

文件“ woof.txt”的内容为：

> <Enzymologic: Ki nM 1>  > <Enzymologic: IC50 nM 1>
257000                    n/a
5000                      1000

为了获得regx_BBB，我首先计算一个元组tu，因为您想捕获一行>但searchtxt中只有\“> \” 因此，元组tu引入了。*？在searchtxt的字符串中，以便regex regx_BBB能够捕获包含IC50的行，而不仅限于与searchtxt元素完全相等的行请注意，除了您使用的字符串外，我在searchtxt中放入了字符串\"nzymologic: Ki nM 1>\"和\"<Enzymologic: IC50 nM 1>\"，以表明正则表达式是已构建的，因此仍可以得到结果。唯一的条件是，每个searchtxt字符串中\'：\'之前必须至少有一个字符。编辑1 我认为在文件中，,24ѭ或\'> <Enzymologic: EC50/IC50 nM 1>\'行应始终跟随follow26ѭ行但是，在阅读完其他人的答案后，我认为这并不明显（这是常见的问题：他们没有提供足够的信息和准确度）如果必须独立捕获每一行，则可以使用以下更简单的regex regx_BBB：

regx_AAA = re.compile(\'([^:]+: )([^ \\t]+)(.*)\')

li = [ regx_AAA.sub(\'\\\\1.*?\\\\2.*?\\\\3\',x) for x in searchtxt]

regx_BBB = re.compile(\'|\'.join(li).join(\'()\') + \'[ \\t\\r\\n]+(.+?)[ \\t]*(?=\\r?\\n|\\Z)\')

但是录制文件的格式会更难。我厌倦了写一个新的完整代码，却不知道到底想要什么 , 在一行中查找字符串然后打印下一行的最简单方法可能是使用itertools islice：

    from itertools import islice
    searchtxt = \"<Enzymologic: IC50 nM 1>\"
    with open (\'file.txt\',\'r\') as itfile:
            for line in itfile:
                    if searchtxt in line:
                            print line
                            print \'\'.join(islice(itfile,1)

python 一行一行使用使用使用单词多个打印打印查找查找

查找多个单词并使用Python打印下一行

问题描述

解决方法

相关问答