通过bash删除冗余 旧版本

问题描述

我有这个问题,我有以下几行:

http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre&ca=2412

我想删除每一行中每一行都有每个参数的每一行,假设这两行:

http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre&ca=2412

我只想保留这个:

http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre&ca=2412

因为是具有更多参数的参数,而第一个将是多余的。

我想保留这些:

http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre&ca=2412

我要删除其他所有具有相同参数的行,保留具有更多参数的行,而不是具有较少参数的行。

另一个例子:

我要转换此:

http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&amp
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/content/index.cfm?ID=123

对此:

http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103

不同资源中的相同参数,必须是不同的行。

如果我明白了:

http://grouplogic.com:80/content/index.cfm?ID=123
http://grouplogic.com:80/content2/index.cfm?ID=123

我都想保留它们。

编辑8月19日:


URL的另一个示例以及我希望如何处理它们:

https://es.answers.search.yahoo.com/search?p=mixmail+correo&fr2=piv-web
https://es.answers.search.yahoo.com/search?p=educastur+campus&fr2=piv-web
https://techvalidation.dell.com/Default.aspx?id=9d459f5c-8a26-4268-b37c-23980a6ba577&Key=%2fuKb2WS3da4lk%2f34VSXE4F02YqS5LfvbKFGcDXNQxgIvvbodU3o3lHoNm09M67Ut&SRC=QuoteCenter&newsession=true
https://techvalidation.dell.com/technicalvalidationlist.aspx?key=6ivAYJco9bouAJBNkQ8rgtGWPdfLVRumAScf7bIb6DMpj6SYVdWy6bd4ITEPF4tQMkNzNpGshERZndX3Ia%2bbqhJ3CnrC46qJkHJ4TdiyN78%3d&PartnerAffinityId=3341728904&SRC=QuoteCenter
https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_LOGOFF
https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_ERROR&language=fr
https://wwwmg.pandacn.ford.com/forms/frmservlet?config=pandacn3
https://www.panda.ford.com/forms/frmservlet?config=pandain4
https://www.panda.ford.com/forms/frmservlet?config=pandain3

它应该输出:

https://es.answers.search.yahoo.com/search?p=mixmail+correo&fr2=piv-web
https://techvalidation.dell.com/Default.aspx?id=9d459f5c-8a26-4268-b37c-23980a6ba577&Key=%2fuKb2WS3da4lk%2f34VSXE4F02YqS5LfvbKFGcDXNQxgIvvbodU3o3lHoNm09M67Ut&SRC=QuoteCenter&newsession=true
https://techvalidation.dell.com/technicalvalidationlist.aspx?key=6ivAYJco9bouAJBNkQ8rgtGWPdfLVRumAScf7bIb6DMpj6SYVdWy6bd4ITEPF4tQMkNzNpGshERZndX3Ia%2bbqhJ3CnrC46qJkHJ4TdiyN78%3d&PartnerAffinityId=3341728904&SRC=QuoteCenter
https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_ERROR&language=fr
https://wwwmg.pandacn.ford.com/forms/frmservlet?config=pandacn3
https://www.panda.ford.com/forms/frmservlet?config=pandain4

我的方法仅适用于只有一个参数的网址:

https://www.panda.ford.com/forms/frmservlet?config=pandain4
https://www.panda.ford.com/forms/frmservlet?config=pandain3

我做:cat list.txt | sort -u -t "=" -k 1,1,然后输出:

https://www.panda.ford.com/forms/frmservlet?config=pandain4

但是这些失败了:

https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_LOGOFF
https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_ERROR&language=fr

我在哪里

https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_LOGOFF

使用| cat list.txt | sort -u -t "=" -k 1,1,我正好想要另一行

https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_ERROR&language=fr

因为它具有相同的参数并且更多。

致谢!


解决方法

要正确执行此操作,需要进行大量内部排序,这些内部排序会在bash循环内产生大量进程,并使工作减慢太多。

切换到perl。注意,这会重新排列参数和行;如果您需要未修改的原始行和/或按照原始的顺序,我们将不得不再增加一到三个步骤。您还应该注意,您同时拥有knowledge大写字母和小写字母; url通过端口不区分大小写,但是此后的路径区分大小写,因此即使它们具有相同的参数,它们也不会注册为相同的内容。

#!/usr/bin/env perl

use strict;     # I ALWAYS use strict and warnings unless 
use warnings;   # there is some compelling reason not to.

open my $fh,'urls' or die "urls: $!";
my %urlsOUT;
foreach ( <$fh> ) { chomp;
    my %args;                              # clean for each record
    m!^(https?://[^/]+)(/[^?]+)[?](.*)!i;  # catch the base in separate case sensitivities
    my ($base) = lc($1).$2;                # always lowercase the case insensitive part
    @args{ split /[?&]+/,$3 } = ();       # removes duplicate args in a url
    my ( $args ) = join '&',reverse sort keys %args; # reassemle ORDERED
    $urlsOUT{"$base?$args"}='';            # now a unique key
}

my $urlsOUT='';
REC: foreach my $url (reverse sort keys %urlsOUT ) { # ORDERED
       for ( split /[?&]/,$url ) {                  # for each arg
         if ( $urlsOUT !~ /\b$_\b/ ) {               # if new
           $urlsOUT .= "$url\n";                     # keep this
           next REC;                                 # check next
         }
       }
}

print $urlsOUT;

这将一致地对URL中的所有参数进行重新排序和去重复,对所有结果记录进行去重复,然后检查每个剩余的记录(以降序排列)以消除没有某物的任何记录之前没有其他记录。

我将程序文件命名为tst,并分别创建了tst1urls

$: cat tst1
http://test/foo?foo
http://test/foo?bar
http://test/foo?foo
http://test2/foo?foo
http://test2/foo?baz
http://test2/foo?foo&bar
http://test2/foo?baz
http://test/foo?foo&bar
http://test/foo?bar&foo
http://test2/foo?bar&foo
http://test3/foo?bar
http://test3/foo?foo&bar&baz
http://test2/foo?foo&bar&baz
http://test/foo?foo&bar&baz

$: ./tst tst1
http://test3/foo?foo&baz&bar
http://test2/foo?foo&baz&bar
http://test/foo?foo&baz&bar

$: cat urls
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&amp
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/content/index.cfm?ID=123
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/content/index.cfm?ID=123&foo=bar
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?    upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&amp
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/content/index.cfm?ID=123

$: ./tst urls
http://grouplogic.com:80/store/index.cfm?upTp=2&ptype=FS&prTpID=5&fa=upgrade&UpNewType=2
http://grouplogic.com:80/store/index.cfm?prTpID=5&id=532&fa=PrtSlt
http://grouplogic.com:80/store/index.cfm?fa=conre&cftoken=26157811&cfid=11812682
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/news-events/index.cfm?prod=2&fa=viewRelease&ID=21
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&archive=1&ProdID=1
http://grouplogic.com:80/content/index.cfm?foo=bar&ID=123
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&amp

请注意,输出采用区分大小写的ASCII排序,并清除了结尾和重复/重复的与号。

使用内部读取和排序在perl中执行此操作也快得多。

real    0m0.170s
user    0m0.046s
sys     0m0.092s

旧版本

虽然您至少可以消除嵌套循环中的多余比较,但我认为没有比蛮力两次通过更优雅的方法了。

lst=( $( sort -ru x ) ) # unique reverse sort once to eliminate simple dups

for (( ndx1=0; ndx1<${#lst[@]}-1; ndx1++ ))       # walk thru once in outer loop
do [[ -n "${lst[ndx1]}" ]] || continue            # ignore removed
   for (( ndx2=ndx1+1; ndx2<${#lst[@]}; ndx2++ )) # inner skips prev,no redux
   do case "${lst[ndx1]}" in                      # case statement string match
      "${lst[ndx2]}"*) unset lst[ndx2] ;;         # remove shorter versions
                    *) continue 2      ;;         # no match,skip ahead
      esac
   done
done

printf "%s\n" "${lst[@]}"                         # print out what's left

sort以相反的顺序唯一,以消除简单的重复并建立比较,并存储到数组中以方便嵌套循环。

外循环遍历数组一次;它不会打扰最后一个记录,因为内部循环将处理该记录。内循环从外循环中当前记录之后的记录开始-由于已对它们进行排序,因此无需再次检查上一个。

由于内部循环删除了记录,因此外部循环将完全跳过检查指示索引处的外键记录是否为空的情况。

case语句从外部循环中检查当前记录之后的每个记录。如果内部键包含在当前的外部循环键记录中,则使用unset从数组中删除较短的版本,然后循环进行到下一条记录以进行检查。

当内循环记录不再是外循环键的一部分时,我们知道我们已经移过了相关记录(因为它们已排序),因此我们无意义地跳过了列表的其余部分,然后移至下一个具有continue 2的外键记录。

此移动的相关记录窗口应尽量减少浪费的工作。

,

最后,看起来我已经完成了此测试文件:

$ cat file2
test?foo
test?bar
test?foo
test2?foo
test2?baz
test2?foo&bar
test2?baz
test?foo&bar
test?bar&foo
test2?bar&foo
test3?bar
test3?foo&bar&baz
test2?foo&bar&baz
test?foo&bar&baz

脚本

#!/bin/bash
declare -A resorces
raw=( $(sort -u $1) )
for url in "${raw[@]}"; { resorces[${url//\?*}]+=" ${url//*\?}"; }
for res in "${!resorces[@]}"; {
    list=( ${resorces[$res]} )
    for i in "${!list[@]}"; {
        par=${list[$i]}
        unset list[$i]
        [[ ${list[@]} =~ $par ]] || result+=("$res?$par")
    }
}
printf '%s\n' "${result[@]}"

结果

$ ./test2 file2
test2?bar&foo
test2?foo&bar&baz
test3?foo&bar&baz
test?bar&foo
test?foo&bar&baz

对于此测试文件:

$ cat file
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&amp
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/content/index.cfm?ID=123
http://grouplogic.com:80/content/index.cfm?ID=123
http://grouplogic.com:80/content/index.cfm?ID=123
http://grouplogic.com:80/content/index.cfm?ID=123&foo=bar
http://grouplogic.com:80/content/index.cfm?ID=123&foo=bar

结果

$ ./test2 file
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/content/index.cfm?ID=123&foo=bar
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&amp
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
,

我看到以下算法可以做到这一点(不幸的是,我不知道如何实现它):

首先,您按字母顺序对文件进行排序。

然后,您逐行读取文件,并且如果一行是下一行的子字符串,则不要将其放入结果文件中。

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...