我在哪里可以找到比levenshtein()和php similar_text()方法更准确地评估错位字符拼写的算法?
例:
similar_text('jonas','xxjon',$similar); echo $similar; // returns 60 similar_text('jonas','asjon',$similar); echo $similar; // returns 60 <- although more similar! echo levenshtein('jonas','xxjon'); // returns 4 echo levenshtein('jonas','asjon'); // returns 4 <- although more similar!
/乔纳斯
解决方法
这是我提出的解决方案.这是基于蒂姆关于比较后续字符顺序的建议.一些结果:
> jonas / jonax:0.8
> jonas / sjona:0.68
> jonas / sjonas:0.66
> jonas / asjon:0.52
> jonas / xxjon:0.36
我确信我并不完美,并且它可以被优化,但是它似乎产生了我追求的结果……
一个弱点是,当字符串具有不同的长度时,它会在交换值时产生不同的结果…
static public function string_compare($str_a,$str_b) { $length = strlen($str_a); $length_b = strlen($str_b); $i = 0; $segmentcount = 0; $segmentsinfo = array(); $segment = ''; while ($i < $length) { $char = substr($str_a,$i,1); if (strpos($str_b,$char) !== FALSE) { $segment = $segment.$char; if (strpos($str_b,$segment) !== FALSE) { $segmentpos_a = $i - strlen($segment) + 1; $segmentpos_b = strpos($str_b,$segment); $positiondiff = abs($segmentpos_a - $segmentpos_b); $posfactor = ($length - $positiondiff) / $length_b; // <-- ? $lengthfactor = strlen($segment)/$length; $segmentsinfo[$segmentcount] = array( 'segment' => $segment,'score' => ($posfactor * $lengthfactor)); } else { $segment = ''; $i--; $segmentcount++; } } else { $segment = ''; $segmentcount++; } $i++; } // PHP 5.3 lambda in array_map $totalscore = array_sum(array_map(function($v) { return $v['score']; },$segmentsinfo)); return $totalscore; }