循环遍历文件时python枚举超出范围

问题描述

我有一个名为 test.txt 的路径文件

/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_2_Clean.fastq.gz

请注意,行数是偶数且始终是偶数,我的最终目标是解析此文件并创建一个新的文件,以两两为基础循环遍历这些路径。我正在尝试 enumerate 函数,但这不会两两解析。此外,我超出了范围,因为我正在做的索引方式是错误的。如果有人能告诉我如何使用 enumerate 正确编制索引,那也太好了。

 with open('./src/test.txt') as f:
    for index,line in enumerate(f):
        sample = re.search(r'pfg[\dGT]+',line)
        sample_string = sample.group(0)
        #print(sample_string)
        print('{{"name":"{0}","readgroup":"{0}","platform_unit":"{0}","fastq_1":"{1}","fastq_2":"{2}","library":"{0}"}},'.format(sample_string,line,line[index+1]))

结果是这样的:

{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pfg001G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
","fastq_2":"g","library":"pfg001G"},{"name":"pfg001G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
","fastq_2":"r",{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pfg001T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
","fastq_2":"o","library":"pfg001T"},"fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
","fastq_2":"u",{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pfg002G","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
","fastq_2":"p","library":"pfg002G"},"fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
","fastq_2":"s",{"name":"pfg002T","readgroup":"pfg002T","platform_unit":"pfg002T","fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
","fastq_2":"/","library":"pfg002T"},"fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_2_Clean.fastq.gz","fastq_2":"c",

显然索引是错误的,因为它遍历了我路径的每个元素,例如 g r 等,而不是打印下一个路径。对于第一次迭代,打印的下一条路径应该是:"fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz".

我相信问题本身可以用 itertools 更优雅地解决我只是不知道该怎么做。如果有人能告诉我 enumerate 的索引是否也可以工作,那就太好了。

解决方法

一个问题是,您试图在读取数据对之前访问它的第二行中的数据。此外,您不能使用 line[index + 1] 访问第二行,因为它指的是当前行中的一个字符,而不是尚未读取的下一行。

所以你需要跟踪成对的行。您可以使用 enumerate() 提供的索引来确定当前行是第一行(因为它是偶数)还是第二行(因为它是奇数)。阅读第一行时存储 fastq_1 的名称和路径。只在第二行写输出。像这样:

重新导入

with open('test.txt') as f:
    for index,line in enumerate(f):
        if index % 2 == 0:      # even,so this is the first line of a pair
            name = re.search(r'pfg[\dGT]+',line).group(0)
            fastq_1 = line.rstrip()
        else:                   # odd,so second line. Emit result
            fastq_2 = line.rstrip()
            print('{{"name":"{0}","readgroup":"{0}","platform_unit":"{0}","fastq_1":"{1}","fastq_2":"{2}","library":"{0}"}},'.format(name,fastq_1,fastq_2))

line.rstrip() 用于删除每行末尾的尾随换行符。

,

@mhawke 已经提供了一个很好的解决方案,但要提供另一种方法,可以使用 more_itertools 库中的 more_itertools.chunked function 或 {{ 3}} 来自 Python 手册。

这也提供了当最后一行是奇数行时应该发生什么的选项;是否应该引发错误或将其与默认值配对。

,

您可能需要考虑,当您为变量分配索引时,您获得的是该字符串的索引字符,而不是它的索引。

您可以做的是将文件分配给一个列表,然后获取索引位置,这样您就可以根据需要在行之间切换。

还是没看懂,你是想在fastq_1fastq_2中的行之间切换还是你每个路径都根据它的键?

代码语法

with open(path) as f:
    lis = list(f)
    for index,line in enumerate(lis):
        try:
            sample = re.search(r'pfg[\dGT]+',line)
            sample_string = sample.group(0)

            print(f'{{"name":"{sample_string}","readgroup":"{sample_string}","platform_unit":"{sample_string}","fastq_1":"{line}","fastq_2":"{lis[index+1]}","library":"{sample_string}"}},')
        except IndexError:
                break

输出

{"name":"pfg001G","readgroup":"pfg001G","platform_unit":"pf
g001G","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001G_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Ta
rgeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
","library":"pfg001G"},{"name":"pfg001G","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001G_2_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
",{"name":"pfg001T","readgroup":"pfg001T","platform_unit":"pf
g001T","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001T_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
","library":"pfg001T"},"fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg001T_2_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
",{"name":"pfg002G","readgroup":"pfg002G","platform_unit":"pf
g002G","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg002G_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
","library":"pfg002G"},"fastq_1":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002G_2_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
",{"name":"pfg002T","readgroup":"pfg002T","platform_unit":"pfg002T","fastq_1":"/groups/cgsd/javed/validation_set/Leung
SY_Targeted_SS-190528-01a/Clean/pfg002T_1_Clean.fastq.gz
","fastq_2":"/groups/cgsd/javed/validation_set/LeungSY_Targeted_SS-190528-01a/Clean/pfg002T_2_Clean.fastq.gz
","library":"pfg002T"},[Program finished]

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...