如何在 Python 中按位置分隔 CVS 列

问题描述

我有数据要从 CVS 文件中的一列分成 3 列。原始文件如下所示：

0400000006340000000000965871       
0700000007850000000000336487    
0100000003360000000000444444

我想将列分开以类似于下面的列表，同时仍保留前导零：

04 0000000634 0000000000965871   
07 0000000785 0000000000336487   
01 0000000336 0000000000444444

我可以将文件上传到 Python，但我不知道我必须使用哪个分隔符或定位。我到目前为止的代码：

import pandas as pd   
df = pd.read_cvs('new_numbers.txt',header=None)

感谢您的帮助。

解决方法

使用 pandas Supa 方法 - 代表“固定宽度格式”：

pd.read_fwf('new_numbers.txt',widths=[2,10,16],header=None)

这将删除前导零：

   0    1       2
0  4  634  965871
1  7  785  336487
2  1  336  444444

要保留它们，请使用 dtype 将 object 指定为字符串：

pd.read_fwf('new_numbers.txt',dtype=object,header=None)

输出：

    0           1                 2
0  04  0000000634  0000000000965871
1  07  0000000785  0000000000336487
2  01  0000000336  0000000000444444

看起来没有分隔符，您使用的是固定长度。

通过它们在列表符号中的位置访问固定长度。

例如：

str1 = "0400000006340000000000965871"

str1A = str1[:2]
str1B = str1[3:14]
str1C = str1[14:]

除非您需要远端的数据框，否则我不会特别为熊猫烦恼。

您不需要 Pandas 来加载您的文本文件并读取其内容（而且，您不需要加载 csv 文件）。

with open("new_numbers.txt") as f:
    lines = f.readlines()

我建议您使用 re 模块。

import re

PATTERN = re.compile(r"(0*[1-9]+)(0*[1-9]+)(0*[1-9]+)")

您可以 check here 在您的示例中使用此表达式的结果。

然后您需要从您的行中获取匹配项，并用空格将它们连接起来。

matches = []
for line in lines:
    match = PATTERN.match(line)
    first,second,third = match.group(1,2,3)
    matches.append(" ".join([first,third]))

最后，matches 将是一个由空格分隔的数字（带前导零）的数组。

此时您可以将它们写入另一个文件，或者对它执行任何您需要的操作。

towrite = "\n".join(matches)

with open("output.txt","w") as f:
    f.write(towrite)

csv csv csv fixed-width pandas pandas python