使用python根据文本文件中的特定模式提取行数据

问题描述

我有一个包含一些数据的庞大报告文件，我必须在代码为“ MLT-TRR”的行上进行一些数据处理。现在，我已经提取了脚本中所有以该代码开头的行，并将它们放置在单独的文件中。新文件如下所示-Rules.txt。

MLT-TRR                         Warning     C:\Users\Di\Pictures\SavedPictures\top.png  63   10   Port is not registered [Folder: 'Picture']

MLT-TRR                         Warning     C:\Users\Di\Pictures\SavedPictures\tree.png 315  10   Port is not registered [Folder: 'Picture.first_inst']

MLT-TRR                         Warning     C:\Users\Di\Pictures\SavedPictures\top.png  315  10   Port is not registered [Folder: 'Picture.second_inst']

MLT-TRR                         Warning     C:\Users\Di\Pictures\SavedPictures\tree.png 317  10   Port is not registered [Folder: 'Picture.third_inst']

MLT-TRR                         Warning     C:\Users\Di\Pictures\SavedPictures\top.png  317  10   Port is not registered [Folder: 'Picture.fourth_inst']

对于每行，我都必须提取“ [文件夹：'图片”之后的数据。如果像我的第一行一样，在“ [文件夹：'图片”之后”没有数据，则跳过该行并移至下一行。我还想提取每行的文件名-top.txt，tree.txt

我想不出一种更简单的方法来执行此操作，因为这涉及到循环并变得更加混乱。有什么办法可以做到吗？仅提取文件路径和每一行的结束数据。

import os
import sys
from os import path
import numpy as np


folder_path = os.path.dirname(os.path.abspath(__file__))
inFile1 = 'Rules.txt'
inFile2 = 'TopRules.txt'

def open_file(filename):
    try:
        with open(filename,'r') as f:
            targets = [line for line in f if "MLT-TRR" in line]
            print targets
        f.close()
        with open(inFile1,"w") as f2:
            for line in targets:
                f2.write(line + "\n")
        f2.close()
        
    except Exception,e:
        print str(e)
    exit(1)


if __name__ == '__main__':
    name = sys.argv[1]
    filename = sys.argv[1]
    open_file(filename)

解决方法

要提取文件名和其他数据，您应该能够使用正则表达式：

import re

for line in f:
    match = re.match(r"^MLT-TRR.*([A-Za-z]:\\[-A-Za-z0-9_:\\.]+).*\[Folder: 'Picture\.(\w+)']",line)
    if match:
        filename = match.group(1)
        data = match.group(2)

这假设'Picture.之后的数据仅包含字母数字字符和下划线。如果文件名很奇怪，则可能必须在文件名部分[A-Za-z0-9_:\\.]中更改允许的字符。它还假定文件名以Windows驱动器号（因此为绝对路径）开头，以便更轻松地与该行中的其他数据区分开。

如果只需要文件名的基本名称，则在提取文件名后可以使用os.path.basename或pathlib.Path.name。

我遇到了一个非常相似的问题，并通过用regex搜索特定行“ key”（在您的情况下为MLT-TRR”），然后指定要从该行获取哪些“字节”来解决该问题。然后附加所选数据到一个数组。

#include <iostream>
#include <vector>
#include <string>

enum class E { TYPE_0,TYPE_1 };

template<typename T1,typename T2>
struct AandB
{
   T1 v0;
   T2 v1;
   E type;
   AandB() : type{ E::TYPE_0 } {}
   AandB& operator= (const AandB& rhs) // one operator =
   {
      v0 = rhs.v0;
      v1 = rhs.v1;
      type = rhs.type;
      return *this;
   }

   std::string strType() const { return std::to_string(static_cast<int>(type)); }
};

int main()
{
   using C0 = std::vector<float>;
   using C1 = std::vector<int>;
   AandB<C0,C1> obj;
   std::cout << obj.strType() ; // Prints: 0
}

如果将正则表达式设置为查找“ MLT_TRR ?????文件夹：'Picture。'”，则它将跳过没有更多信息的任何行。

对于问题的第二部分。我怀疑您的文件名是否为固定长度，因此上述方法无法工作，因为您无法指定要提取的字节数。此代码从文件路径中提取名称和扩展名，您可以将其应用于提取的任何内容从每一行开始。

import re #Import the regex function
#Make empty arrays:
    P190=[] #my file
    shot=[] #events in my file (multiple lines of text for each event)
    S011east=[] #what I want
    S011north #another thing I want

#Create your regex:
    S011=re.compile(r"^S0\w*\W*11\b") 

#search and append:
    #Open P190 file
    with open(import_file_path,'rt') as infile:
        for lines in infile:
            P190.append(lines.rstrip('\n'))       
    #Locate specific lines and extract data
    for line in P190:
        if  S011.search(line)!= None:
            easting=line[47:55]
            easting=float(easting)
            S011east.append(easting)
            northing=line[55:64]
            northing=float(northing)
            S011north.append(northing)

file file file pattern-matching python