正则表达式与文件格式化处理

简单地说，正则表达式就是处理字符串的方法，它以行为单位来进行字符串的处理操作，正则表达式通过一些特殊符号的辅助，可以让用户轻易地完成【查找、删除、替换】某特定字符串的处理过程。

正则表达式基本上是一种【表示法】，只要程序支持这种表示法，那么该程序就可以用来作为正则表达式的字符串处理之用。

一、基础正则表达式

既然正则表达式是处理字符串的一种表示方式，那么对字符排序有影响的语系数据就会对正则表达式的结果有影响。

1.1、语系对正则表达式的影响

使用正则表达式时，需要特别留意当时环境的语系是什么，否则可能会发现与别人不相同的选取结果。

为了避免编码所造成的英文与数字的选取问题，有以下特殊字符

在这里插入图片描述

1.2、grep的一些高级选项

grep [-A] [-B] [--color=auto] '查找字符' filename
选项与参数：
-A：后面可加数字，为after的意思，除了列出该行外，后续的n行也列出来
-B：后面可加数字，为before的意思，除了列出该行外，前面的n行也列出来
-color=auto：可将正确的那个选取列出颜色

1.3、基础正则表达式练习

练习环境：

语系已经使用【export LANG=C;export LC_ALL=C】的设置值
grep已经使用alias设置成为【grep --color=auto】

练习文件下载

wget "http://linux.vbird.org/linux_basic/0330regularex/regular_express.txt"

示例一：查找特定字符串

查找包含'the'字符串的行，且列出行号
[root@VM_0_8_centos ~]# grep -n 'the' regular_express.txt
8:I can't finish the test.
12:the symbol '*' is represented as start.
15:You are the best is mean you are the no. 1.
16:The world <Happy> is the same with "glad".
18:google is the best tools for search keyword.

反向选择
[root@VM_0_8_centos ~]# grep -vn 'the' regular_express.txt
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
4:this dress doesn't fit me.
5:However, this dress is about $ 3183 dollars.
6:GNU is free air not free beer.
7:Her hair is very beauty.
9:Oh! The soup taste good.
10:motorcycle is cheap than car.
11:This window is clear.
13:Oh!	My god!
14:The gd software is a library for drafting programs.
17:I like dog.
19:goooooogle yes!
20:go! go! Let's go.
21:# I am VBird
22:

忽略大小写
[root@VM_0_8_centos ~]# grep -in "the" regular_express.txt
8:I can't finish the test.
9:Oh! The soup taste good.
12:the symbol '*' is represented as start.
14:The gd software is a library for drafting programs.
15:You are the best is mean you are the no. 1.
16:The world <Happy> is the same with "glad".
18:google is the best tools for search keyword.

示例二：利用中括号[]来查找集合字符

查找包含test和taste的行
[root@VM_0_8_centos ~]# grep -n 't[ae]st' regular_express.txt
8:I can't finish the test.
9:Oh! The soup taste good.

中括号[]里面不论有几个字符，它都仅代表某【一个】字符。

使用集合字符的反向选择[^]，获取oo前面没有g的行
[root@VM_0_8_centos ~]# grep -n '[^g]oo' regular_express.txt
2:apple is my favorite food.
3:Football game is not use feet only.
18:google is the best tools for search keyword.
19:goooooogle yes!

获取oo前面没有小写字母的行，使用[:lower:]可以忽略语系对编码的影响
[root@VM_0_8_centos ~]# grep -n '[^a-z]oo' regular_express.txt 
3:Football game is not use feet only.
[root@VM_0_8_centos ~]# grep -n '[^[:lower:]]oo' regular_express.txt 
3:Football game is not use feet only.


获取包含数字的行，使用[:digit:]可以忽略语系对编码的影响
[root@VM_0_8_centos ~]# grep -n '[0-9]' regular_express.txt 
5:However, this dress is about $ 3183 dollars.
15:You are the best is mean you are the no. 1.
[root@VM_0_8_centos ~]# grep -n '[[:digit:]]' regular_express.txt
5:However, this dress is about $ 3183 dollars.
15:You are the best is mean you are the no. 1.

示例三：行首与行尾字符^$

获取以the开头的行
[root@VM_0_8_centos ~]# grep -n '^the' regular_express.txt 
12:the symbol '*' is represented as start.


获取小写字母开头的行，'^[a-z]'可以使用'^[[:lower:]]'替换
[root@VM_0_8_centos ~]# grep -n '^[a-z]' regular_express.txt
2:apple is my favorite food.
4:this dress doesn't fit me.
10:motorcycle is cheap than car.
12:the symbol '*' is represented as start.
18:google is the best tools for search keyword.
19:goooooogle yes!
20:go! go! Let's go.


获取非字母开头的行，'^[^A-Za-z]'可以使用'^[^[:alpha:]]'替换
[root@VM_0_8_centos ~]# grep -n '^[^A-Za-z]' regular_express.txt 
1:"Open Source" is a good mechanism to develop programs.
21:# I am VBird

^符号，在字符集合符号(括号[])之内与之外是不同的。在 [] 内代表反向选择，在 [] 之外则代表定位在行首的意义。

获取以.结束的行
[root@VM_0_8_centos ~]# grep -n '\.$' regular_express.txt 
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
4:this dress doesn't fit me.
10:motorcycle is cheap than car.
11:This window is clear.
12:the symbol '*' is represented as start.
15:You are the best is mean you are the no. 1.
16:The world <Happy> is the same with "glad".
17:I like dog.
18:google is the best tools for search keyword.
20:go! go! Let's go.

获取空白行
[root@VM_0_8_centos ~]# grep -n '^$' regular_express.txt 
22:


获取/etc/rsyslog.conf文件中非空白行，和非#开头的行
[root@VM_0_8_centos ~]# grep -v '^$' /etc/rsyslog.conf | grep -vn '^#'
6:$ModLoad imuxsock # provides support for local system logging (e.g. via logger command)
7:$ModLoad imjournal # provides access to the systemd journal
18:$workdirectory /var/lib/rsyslog
20:$ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat
25:$IncludeConfig /etc/rsyslog.d/*.conf
28:$OmitLocalLogging on
30:$IMJournalStateFile imjournal.state
37:*.info;mail.none;authpriv.none;cron.none                /var/log/messages
39:authpriv.*                                              /var/log/secure
41:mail.*                                                  -/var/log/maillog
43:cron.*                                                  /var/log/cron
45:*.emerg                                                 :omusrmsg:*
47:uucp,news.crit                                          /var/log/spooler
49:local7.*                                                /var/log/boot.log

示例四：任意一个字符 . 与重复字符 *

.(小数点)：代表【一定有一个任意字符】的意思
*(星星号)：代表【重复前一个字符，0到无穷多次】的意思，为组合形态

筛选至少有两个o及以上的行
[root@VM_0_8_centos ~]# grep -n 'ooo*' regular_express.txt 
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
9:Oh! The soup taste good.
18:google is the best tools for search keyword.
19:goooooogle yes!


筛选存在g...g的行，.*代表零个或多个任意字符
[root@VM_0_8_centos ~]# grep -n 'g.*g' regular_express.txt 
1:"Open Source" is a good mechanism to develop programs.
14:The gd software is a library for drafting programs.
18:google is the best tools for search keyword.
19:goooooogle yes!
20:go! go! Let's go.


找出存在任意数字的行
[root@VM_0_8_centos ~]# grep -n '[0-9][0-9]*' regular_express.txt 
5:However, this dress is about $ 3183 dollars.
15:You are the best is mean you are the no. 1.

示例五：限定连续RE字符范围{}

因为 { 与 } 的符号在shell是有特殊意义的，所以RE中使用字符范围时需要转义符 \ ，来让它失去特殊意义。

查找存在两个o的字符串的行
[root@VM_0_8_centos ~]# grep -n 'o\{2\}' regular_express.txt 
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
9:Oh! The soup taste good.
18:google is the best tools for search keyword.
19:goooooogle yes!

找出g后面接2到5个o，然后再接一个g的字符串的行
[root@VM_0_8_centos ~]# grep -n 'go\{2,5\}g' regular_express.txt 
18:google is the best tools for search keyword.


找出g后面接2个及以上的o，然后再接一个g的字符串的行
[root@VM_0_8_centos ~]# grep -n 'go\{2,\}g' regular_express.txt 
18:google is the best tools for search keyword.
19:goooooogle yes!

1.4、sed工具

sed本身也是一个管道命令，可以分析标准输入。而且sed还可以将数据进行替换、删除、新增、选取特定行等功能。

sed [-nefri] [操作]
选项与参数：
-n：使用安静(silent)模式，在一般sed的用法中，所有来自stdin的数据一般都会被列出到屏幕上。
	但如果加上-n参数后，则只有经过sed特殊处理的那一行(或操作)才会被列出来。
-e：直接在命令行模式上进行sed的操作编辑
-f：直接将sed的操作写在一个文件内，-f filename 则可以执行filename内的sed操作
-r：sed的操作使用的是扩展型正则表达式的语法。(默认是基础正则表达式语法)
-i：直接修改读取的文件内容，而不是由屏幕输出

操作说明：[n1[,n2]]function
n1,n2: 不见得会存在，一般代表【选项进行操作的行数】，举例来说，如果我的操作
	   是需要在10到20行之间进行的，则【10，20[操作行为]】
function有下面这些东西：
a：新增，a的后面可以接字符，而这些字符会在新的一行出现(目前的下一行)
c：替换，c的后面可以接字符，这些字符可以替换n1,n2之间的行
d：删除，因为是删除，所以d后面通常不接任何东西
i：插入，i的后面可以接字符，而这些字符会在新的一行出现(目前的上一行)
p：打印，亦即将某个选择的数据打印出来。通常p会与参数sed -n一起运行
s：替换，可以直接进行替换的工作，通常这个s的操作可以搭配正则表达式，例如，1，20s/old/new/g就是

以行为单位的新增/删除功能

示例一：将/etc/passwd的内容, 删除2~5行的内容显示
[root@VM_0_8_centos ~]# nl /etc/passwd | sed '2,5d'
     1	root:x:0:0:root:/root:/bin/bash
     6	sync:x:5:0:sync:/sbin:/bin/sync
     7	shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
     8	halt:x:7:0:halt:/sbin:/sbin/halt
     9	mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
...

示例二：只删除第2行
[root@VM_0_8_centos ~]# nl /etc/passwd | sed '2d'
     1	root:x:0:0:root:/root:/bin/bash
     3	daemon:x:2:2:daemon:/sbin:/sbin/nologin
...

示例三：删除第3到最后一行，$ 表示最后一行
[root@VM_0_8_centos ~]# nl /etc/passwd | sed '3,$d'
     1	root:x:0:0:root:/root:/bin/bash
     2	bin:x:1:1:bin:/bin:/sbin/nologin

示例四：在第2行后加上【drink tea】
[root@VM_0_8_centos ~]# nl /etc/passwd | sed '2a drink tea'
     1	root:x:0:0:root:/root:/bin/bash
     2	bin:x:1:1:bin:/bin:/sbin/nologin
drink tea
     3	daemon:x:2:2:daemon:/sbin:/sbin/nologin
...

示例五：在第2行前加上【drink tea】
[root@VM_0_8_centos ~]# nl /etc/passwd | sed '2i drink tea'
     1	root:x:0:0:root:/root:/bin/bash
drink tea
     2	bin:x:1:1:bin:/bin:/sbin/nologin
...

示例六：在第2行后面加上多行, 使用反斜杠【\】来进行新行的增加
[root@VM_0_8_centos ~]# nl /etc/passwd | sed '2a drink tea...\
> drink apple ...\
> drink beer...'
     1	root:x:0:0:root:/root:/bin/bash
     2	bin:x:1:1:bin:/bin:/sbin/nologin
drink tea...
drink apple ...
drink beer...
     3	daemon:x:2:2:daemon:/sbin:/sbin/nologin
...

以行为单位的替换与显示功能

示例一：将第2-5行的内容替换成为【No 2-5 number】
[root@VM_0_8_centos ~]# nl /etc/passwd | sed '2,5c No 2-5 number'
     1	root:x:0:0:root:/root:/bin/bash
No 2-5 number
     6	sync:x:5:0:sync:/sbin:/bin/sync
...


示例二：仅列出/etc/passwd 文件内的第5-7行
[root@VM_0_8_centos ~]# nl /etc/passwd | sed -n '5,7p'
     5	lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
     6	sync:x:5:0:sync:/sbin:/bin/sync
     7	shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown

部分数据的查找并替换的功能

示例一：利用ifconfig命令配合grep获取网卡的ip
[root@VM_0_8_centos ~]# ifconfig eth0 | grep 'inet '
        inet 10.206.0.8  netmask 255.255.240.0  broadcast 10.206.15.255

示例二：将ip前面的部分(inet)删除掉
[root@VM_0_8_centos ~]# ifconfig eth0 | grep 'inet ' | sed 's/^.*inet //g'
10.206.0.8  netmask 255.255.240.0  broadcast 10.206.15.255

示例二：将ip后面的部分予以删除
[root@VM_0_8_centos ~]# ifconfig eth0 | grep 'inet ' | sed 's/^.*inet //g' \
> | sed 's/ *netmask.*$//g'
10.206.0.8

/etc/man_db.conf 文件中只需要MAN存在的那几行数据，含有'#'在内的数据去除，空白行去除
步骤一：先使用grep将关键字MAN所在行取出来
[root@VM_0_8_centos ~]# cat /etc/man_db.conf | grep 'MAN'
# MANDATORY_MANPATH			manpath_element
# MANPATH_MAP		path_element	manpath_element
# MANDB_MAP		global_manpath	[relative_catpath]
...

步骤二：删除注释掉之后的内容
[root@VM_0_8_centos ~]# cat /etc/man_db.conf | grep 'MAN' | sed 's/#.*$//g'
MANDATORY_MANPATH			/usr/man
MANDATORY_MANPATH			/usr/share/man
...

步骤三：删除空白行
[root@VM_0_8_centos ~]# cat /etc/man_db.conf | grep 'MAN' | sed 's/#.*$//g' | sed '/^$/d'
MANDATORY_MANPATH			/usr/man
MANDATORY_MANPATH			/usr/share/man
MANDATORY_MANPATH			/usr/local/share/man
MANPATH_MAP	/bin			/usr/share/man
MANPATH_MAP	/usr/bin		/usr/share/man
...

直接修改文件内容(危险操作)

sed可以直接修改文件的内容，而不必使用管道命令或数据流重定向。

示例一：利用sed将regular_express.txt内每一行结尾若为.则换成！
[root@VM_0_8_centos ~]# sed -i 's/\.$/\!/g' regular_express.txt 

示例二：利用sed直接在regular_express.txt 最后一行加入 '# This is a test', $ 表示最后一行
[root@VM_0_8_centos ~]# sed -i  '$a # This is a test' regular_express.txt

二、扩展正则表达式

grep 默认仅支持基础正则表达式，如果要使用扩展正则表达式，可以使用 grep -E。不过更建议直接使用egrep，直接区分命令比较好记忆。egrep 与 grep -E 是类似命名别名的关系。

在这里插入图片描述

三、文件的格式化与相关处理

awk：好用的数据处理工具

awk 比较倾向于一行当中分成数个字段来处理，适合处理小型的文本数据。

awk通常运行的模式

awk '条件类型1{操作1} 条件类型2{操作2} ...' filename

awk 后面接两个单引号并加上大括号{}来设置想要对数据进行的处理操作，awk 可以处理后续接的文件，也可以读取来之前个命令的标准输出。

awk 主要是处理每一行的字段内的数据，而默认的字段的分隔符为"空格键"或"[Tab]键"。

取出账号与登录者的IP，且账号与IP之间以[Tab]隔开
[root@VM_0_8_centos ~]# last -n 5 | awk '{print $1 "\t" $3}'
root	112.10.73.144
root	112.10.73.144
root	112.10.73.144
root	112.10.73.144
root	112.10.73.144
wtmp	Sun

由上例，整个awk的处理流程是：

读入第1行，并将第1行的数据写入$0、$1、$2等变量当中
根据“条件类型”的限制，判断是否需要进行后面的“操作”
完成所有操作与条件类型
若还有后续的【行】的数据，则重复上面1~3的步骤，直到所有的数据都读完为止

awk的内置变量：

NF：每一行($0)拥有的字段总数
NR：目前awk所处理的是第几行数据
FS：目前的分隔字符，默认是空格键

继续上面的last -n 5的例子来说明：
1、列出每一行的账号(就是$1)
2、列出目前处理的行数(就是awk内的NR变量)
3、并且说明，该行有多少字段(就是awk内的NF变量)
[root@VM_0_8_centos ~]# last -n 5 | awk '{print $1 "\t lines: " NR "\t columns:  " NF}'
root	 lines: 1	 columns:  10
root	 lines: 2	 columns:  10
root	 lines: 3	 columns:  10
root	 lines: 4	 columns:  10
root	 lines: 5	 columns:  10
	 lines: 6	 columns:  0
wtmp	 lines: 7	 columns:  7

awk 的逻辑运算字符

常用逻辑运算符：>(大于)、<(小于)、>=(大于或等于)、<=(小于或等于)、==(等于)、!=(不等于)

示例一：在/etc/passwd中，查看第三栏小于10以下的数据，并且仅列出账号与第三列
[root@VM_0_8_centos ~]# cat /etc/passwd | awk '{FS=":"} $3<10 {print $1 "\t" $3}'
root:x:0:0:root:/root:/bin/bash	
bin	1
daemon	2
...

示例二：使用BEGIN预先设置awk的变量，使示例一第一行数据也正确显示
[root@VM_0_8_centos ~]# cat /etc/passwd | awk 'BEGIN {FS=":"} $3<10 {print $1 "\t" $3}'
root	0
bin	1
daemon	2
...

示例三：计算每个人薪资的总额
1、第一行只是说明，所以第一行不要进行求和(NR==1时处理)
2、第二行以后就会有求和的情况出现(NR>=2以后处理)
[root@VM_0_8_centos ~]# cat pay.txt 
Name	1st	2nd	3th
cjw1	23000	24000	25000
cjw2	21000	20000	23000
cjw3	43000	42000	41000
[root@VM_0_8_centos ~]# cat pay.txt | \
> awk 'NR==1 {printf "%10s %10s %10s %10s %10s\n", $1, $2,$3,$4, "Total"}
> NR>=2 { total = $2 + $3 + $4
> printf "%10s %10d %10d %10d %10.2f\n", $1, $2, $3, $4, total}'
      Name        1st        2nd        3th      Total
      cjw1      23000      24000      25000   72000.00
      cjw2      21000      20000      23000   64000.00
      cjw3      43000      42000      41000  126000.00

上面的例子几个重要事项：

awk的命令间隔：所有awk的操作，亦即在{}内的操作，如果有需要多个命令辅助时，可利用分号【;】间隔，或直接以 [Enter] 按键来隔开每个命令。
逻辑运算当中，如果是【等于】的情况，则务必使用两个等号【==】
格式化输出时，在printf的格式设置当中，务必加上\n，才能进行分行
与 bash shell 的变量不同，在awk当中，变量可以直接使用，不需加上$符号