Power Shell 命令删除行中间的输入/返回字符,但保持换行不变

问题描述

我们有巨大的 csv 文件(用双引号括起来的逗号分隔)。

某些数据有输入/返回字符导致加载问题:示例如下

ID,Name,dob,Gender
"1","John Smith","01-01-1980","M"
"2","Craig 
Davis","02-02-1990","M"
"3","David Smith","05-05-1970","M"

我试图通过命令解决这个问题:

((Get-Content -path C:\Users\10145097\work\RnD\sample_data.txt -Raw) -replace "`n|`r","")

但它也删除了我们不想要的行分隔符:

错误输出

ID,Gender"1","M""2","Craig Davis","M""3","M"

所需的输出是:

ID,"M"

能否请您帮忙。

非常感谢..

解决方法

@ECHO Off
SETLOCAL ENABLEDELAYEDEXPANSION
rem The following settings for the source directory,destination directory,target directory,rem batch directory,filenames,output filename and temporary filename [if shown] are names
rem that I use for testing and deliberately include names which include spaces to make sure
rem that the process works using such names. These will need to be changed to suit your situation.

SET "sourcedir=u:\your files"
SET "destdir=u:\your results"
SET "filename1=%sourcedir%\q65605302.txt"
SET "outfile=%destdir%\outfile.txt"

SET "line="
SET "unbalanced="

(
 FOR /f "usebackqdelims=" %%a IN ("%filename1%") DO (
  IF DEFINED unbalanced (ECHO !line!%%a&SET "unbalanced=") ELSE (
   SET "line=%%a"
   CALL :countquotes
   IF DEFINED unbalanced (SET "line=%%a") ELSE ECHO %%a
  )
)
)>"%outfile%"

GOTO :EOF

:countquotes
IF NOT DEFINED line GOTO :EOF
SET "char=%line:~0,1%"
SET "line=%line:~1%"
SET "char=%char:"=%"
IF DEFINED char GOTO countquotes
IF DEFINED unbalanced (SET "unbalanced=") ELSE SET "unbalanced=y"
GOTO countquotes

就我个人而言,我会使用 sed(g)awk 来尝试修复您的文件。

实际上无法分析原始源数据,我只能从发布的数据中得出结论,问题是在引用字段中的空格后随机插入换行符。

这个批处理只是逐行读取文件。

如果设置了标志unbalanced,那么前一行是不平衡的,所以将当前的%%a附加到保存在line中的前一行并输出,然后设置{{1} } 为假(无值)。

如果 unbalanced,则“计数”当前行上的引号数。 “计数”是通过简单地调用 unbalanced is not set 标志来完成的,所以如果有奇数个双引号,unbalanced 将是 unbalanced;如果是偶数就清楚了。

set 例程返回时,如果 :countquotes 未设置(即读取的行具有平衡引号),则反刍该行,否则再次将该行保存在 unbalanced 中(如line 已销毁它)准备好连接到下一行读取。

,

如果您有足够的内存来加载/解析文件,您可以使用以下方法清除换行符:

$data = Import-Csv -Path 'D:\Test\TheFile.csv'
foreach ($item in $data) {
    # replace all fields that contain Newline characters
    foreach ($prop in ($item.PSObject.Properties.Name | Where-Object { $item.$_ -match '\r?\n'})) {
        # remove the Newline characters:
        $item.$prop = $item.$prop -replace '[\r\n]'

        # OR
        # to normalize all multiple whitespace characters into a single space use:
        # $item.$prop = $item.$prop -replace '\s+',' '
    }
}

$data | Export-Csv -Path 'D:\Test\TheUpdatedFile.csv' -NoTypeInformation

如果您可以确定换行符只出现在 Name 列中,您可以将上面的代码简化为:

$data = Import-Csv -Path 'D:\Test\TheFile.csv'
foreach ($item in $data) {
    # replace all fields in column 'Name' that contain Newline characters
    if ($item.Name -match '\r?\n') {
        # remove the Newline characters:
        $item.Name = $item.Name -replace '[\r\n]'

        # OR
        # to normalize all multiple whitespace characters into a single space use:
        # $item.Name = $item.Name -replace '\s+',' '
    }
}

$data | Export-Csv -Path 'D:\Test\TheUpdatedFile.csv' -NoTypeInformation

使用您的示例,CSV 输出如下

"ID","Name","DOB","Gender"
"1","John Smith","01-01-1980","M"
"2","Craig Davis","02-02-1990","M"
"3","David Smith","05-05-1970","M"