正则表达式对小字符串的暴力

在测试小字符串(例如isPhoneNumber或isHexadecimal)时,使用正则表达式会带来性能上的好处,还是会强制它们更快？不会通过检查给定字符串的字符是否在指定范围内比使用正则表达式更快来强制它们吗？

例如：

public static boolean isHexadecimal(String value)
{
    if (value.startsWith("-"))
    {
        value = value.substring(1);
    }

    value = value.toLowerCase();

    if (value.length() <= 2 || !value.startsWith("0x"))
    {
        return false;
    }

    for (int i = 2; i < value.length(); i++)
    {
        char c = value.charat(i);

        if (!(c >= '0' && c <= '9' || c >= 'a' && c <= 'f'))
        {
            return false;
        }
    }

    return true;
}

与

Regex.match(/0x[0-9a-f]+/,"0x123fa") // returns true if regex matches whole given expression

似乎有一些与正则表达式相关的开销,即使模式是预编译的,只是因为正则表达式必须在许多一般情况下工作.相比之下,蛮力方法完全符合要求而不再需要.我错过了正则表达式的一些优化吗？

我写了一个小基准来估计：

> nop 方法(了解基线迭代速度);
>原始方法,由OP提供;
> RegExp;
>编译正则表达式;
> @maraca提供的版本(没有toLowerCase和substring);
>“fastIsHex”版本(基于开关),我添加的只是为了好玩.

测试机配置如下：

> JVM：Java(TM)SE运行时环境(版本1.8.0_101-b13)
> cpu：Intel(R)Core(TM)i5-2500 cpu @ 3.30GHz

以下是我为原始测试字符串“0x123fa”和10.000.000次迭代得到的结果：

Method "nop" => #10000000 iterations in 9ms
Method "isHexadecimal (OP)" => #10000000 iterations in 300ms
Method "RegExp" => #10000000 iterations in 4270ms
Method "RegExp (Compiled)" => #10000000 iterations in 1025ms
Method "isHexadecimal (maraca)" => #10000000 iterations in 135ms
Method "fastIsHex" => #10000000 iterations in 107ms

正如您所看到的,OP的原始方法比RegExp方法更快(至少在使用JDK提供的RegExp实现时).

(供你参考)

基准代码：

public static void main(String[] argv) throws Exception {
    //Number of IteraTIONS
    final int IteraTIONS = 10000000;

    //nop
    benchmark(IteraTIONS,"nop",() -> nop(longHexText));

    //isHexadecimal
    benchmark(IteraTIONS,"isHexadecimal (OP)",() -> isHexadecimal(longHexText));

    //Un-compiled regexp
    benchmark(IteraTIONS,"RegExp",() -> longHexText.matches("0x[0-9a-fA-F]+"));

    //Pre-compiled regexp
    final Pattern pattern = Pattern.compile("0x[0-9a-fA-F]+");
    benchmark(IteraTIONS,"RegExp (Compiled)",() -> {
        pattern.matcher(longHexText).matches();
    });

    //isHexadecimal (maraca)
    benchmark(IteraTIONS,"isHexadecimal (maraca)",() -> isHexadecimalMaraca(longHexText));

    //FastIsHex
    benchmark(IteraTIONS,"fastIsHex",() -> fastIsHex(longHexText));
}

public static void benchmark(int iterations,String name,Runnable block) {
    //Start Time
    long stime = System.currentTimeMillis();

    //Benchmark
    for(int i = 0; i < iterations; i++) {
        block.run();
    }

    //Done
    System.out.println(
        String.format("Method \"%s\" => #%d iterations in %dms",name,iterations,(System.currentTimeMillis()-stime))
    );
}

nop 方法：

public static boolean nop(String value) { return true; }

fastIsHex方法：

public static boolean fastIsHex(String value) {

    //Value must be at least 4 characters long (0x00)
    if(value.length() < 4) {
        return false;
    }

    //Compute where the data starts
    int start = ((value.charat(0) == '-') ? 1 : 0) + 2;

    //Check prefix
    if(value.charat(start-2) != '0' || value.charat(start-1) != 'x') {
        return false;
    }

    //Verify data
    for(int i = start; i < value.length(); i++) {
        switch(value.charat(i)) {
            case '0':case '1':case '2':case '3':case '4':case '5':case '6':case '7':case '8':case '9':
            case 'a':case 'b':case 'c':case 'd':case 'e':case 'f':
            case 'A':case 'B':case 'C':case 'D':case 'E':case 'F':
                continue;

            default:
                return false;
        }
    }

    return true;
}

所以,答案是否定的,对于短字符串和手头的任务,RegExp并不快.

当谈到更长的弦乐时,平衡是完全不同的,
下面是8192长十六进制字符串的结果,我用以下代码生成：

hexdump -n 8196 -v -e '/1 "%02X"' /dev/urandom

和10.000次迭代：

Method "nop" => #10000 iterations in 2ms
Method "isHexadecimal (OP)" => #10000 iterations in 1512ms
Method "RegExp" => #10000 iterations in 1303ms
Method "RegExp (Compiled)" => #10000 iterations in 1263ms
Method "isHexadecimal (maraca)" => #10000 iterations in 553ms
Method "fastIsHex" => #10000 iterations in 530ms

正如你所看到的,手写方法(由macara和我的fastIsHex编写的方法)仍然击败RegExp,但原始方法没有,
(由于substring()和toLowerCase()).

边注：

这个基准测试非常简单,只测试“最坏情况”场景(即一个完全有效的字符串),实际结果,混合数据长度和非0有效无效比率可能完全不同.

更新：

我还尝试了char []数组版本：

char[] chars = value.tochararray();
 for (idx += 2; idx < chars.length; idx++) { ... }

它甚至比getCharat(i)版本慢一点：

Method "isHexadecimal (maraca) char[] array version" => #10000000 iterations in 194ms
  Method "fastIsHex,char[] array version" => #10000000 iterations in 164ms

我的猜测是由于toch ararray中的数组复制.

更新(#2)：

我已经运行了额外的8k / 100.000迭代测试,看看“maraca”和“fastIsHex”方法之间的速度是否存在任何真正的差异,并且还将它们标准化为使用完全相同的前置条件代码：

运行#1

Method "isHexadecimal (maraca) *normalized" => #100000 iterations in 5341ms
Method "fastIsHex" => #100000 iterations in 5313ms

运行#2

Method "isHexadecimal (maraca) *normalized" => #100000 iterations in 5313ms
Method "fastIsHex" => #100000 iterations in 5334ms

即这两种方法之间的速度差异最多是微不足道的,可能是由于测量错误(因为我在我的工作站上运行它而不是专门设置的清洁测试环境).

正则表达式对小字符串的暴力

相关文章