php 错误的levenshtein UTF8

cbeh67ev  于 2023-10-15  发布在  PHP
关注(0)|答案(2)|浏览(100)

从这里Levenshtein distance on diacritic characters我使用这个PHP函数来计算UTF8字符的levenshtein距离。

function levenshtein_php($str1, $str2){
  $length1 = mb_strlen( $str1, 'UTF-8');
  $length2 = mb_strlen( $str2, 'UTF-8');
  if( $length1 < $length2) return levenshtein_php($str2, $str1);
  if( $length1 == 0 ) return $length2;
  if( $str1 === $str2) return 0;
  $prevRow = range( 0, $length2);
  $currentRow = array();
  for ( $i = 0; $i < $length1; $i++ ) {
    $currentRow=array();
    $currentRow[0] = $i + 1;
    $c1 = mb_substr( $str1, $i, 1, 'UTF-8') ;
    for ( $j = 0; $j < $length2; $j++ ) {
      $c2 = mb_substr( $str2, $j, 1, 'UTF-8' );
      $insertions = $prevRow[$j+1] + 1;
      $deletions = $currentRow[$j] + 1;
      $substitutions = $prevRow[$j] + (($c1 != $c2)?1:0);
      $currentRow[] = min($insertions, $deletions, $substitutions);
    }
    $prevRow = $currentRow;
  }
  return $prevRow[$length2];
}

当我使用它与$string1 = 'ncat de countrytique';$string2= 'ncântat';
我得到的levensthein差是13,我认为这是错误的。
我看到的唯一两个选项是对$string1的以下更改:

1)
    a) add the characters 'ânt' to reach $string1 = 'încântat de counștițe';
    b) delete the characters ' de counștițe' to reach $string1 = 'încântat' ;

which would lead to a difference of 17 changes

    2) 
    a) replace 'a' with 'â' to reach $string1 = 'încât de counștițe';
    b) delete 't de cou' to reach $string1 = 'încânștițe';
    c) delete 'ș' to reach $string1 = 'încântițe';
    d) add characters 'at' to reach to reach $string1 = 'încântatițe';
    e) remove characters 'ițe' to reach $string1 = 'încântat';

with an levensthein distance of 15

你能帮我纠正上面的levensthein_php代码返回正确的差异。如果有任何其他现有的PHP函数,我很乐意使用它,但我明白,没有函数mb_levenshtein。
如果有关系,我在PHP 7.4.33上运行它。

lztngnrs

lztngnrs1#

在我看来,你的第二个计算距离的例子包含一个错误:

a) replace 'a' with 'â' to reach $string1 = 'încât de counștițe';
b) delete 't de cou' to reach $string1 = 'încânștițe';
c) delete 'ș' to reach $string1 = 'încântițe';
d) add characters 'at' to reach $string1 = 'încântatițe';
e) remove characters 'ițe' to reach $string1 = 'încântat';

最后两行应改为:

d) replace characters 'iț' with 'at' to reach $string1 = 'încântate';
e) remove character 'e' to reach $string1 = 'încântat';

因此,结果是13而不是15,并且函数工作正常。

ntjbwcob

ntjbwcob2#

您的示例文本意味着字符集仅限于某些西欧字母。我建议您使用合适的转换例程将UTF-8转换为Latin 1(或其他合适的编码)。那么字符将是单字节的,并且不需要“mb”例程。
尝试MySQL的CHARACTER SET cp 1250和latin 2--它们似乎有所需的s或t与cedilla。如果需要大小写和重音折叠,请尝试使用_general_ci和_croatian_ci COLLATIONS。Or _bin区分大小写和重音。

相关问题