汉字 Unicode 范围与 PHP 匹配汉字方法

December 11, 2018

做一个小玩具,需要用到一些奇形怪状的汉字,整理了一下与汉字相关的 Unicode 范围。

NameQtyUnicode Range
CJK Unified Ideographs209764E00-9FEF
CJK Unified Ideographs Extension A65823400-4DB5
CJK Unified Ideographs Extension B4271120000-2A6D6
CJK Unified Ideographs Extension C41492A700-2B734
CJK Unified Ideographs Extension D2222B740-2B81D
CJK Unified Ideographs Extension E57622B820-2CEA1
CJK Unified Ideographs Extension F74732CEB0-2EBE0
CJK Compatibility Ideographs477F900-FAD9
13007

通常情况下,日常用到的字都在 CJK Unified Ideographs 里,这里面的字其实就够用了,若要匹配,用下行即可。

preg_match_all('/[\x{4e00}-\x{9fef}]/u' , $string, $result);

哪怕是吴语用字“覅”(U+8985)也包含在上述范围内。

但我用到了一些奇形怪状的字,上述的办法不够用,只好自己写了个函数,和上述方法思路完全一致,仅仅只是把模式改大了。

供参考。

<?php
function match_cjk_ideographs($string) {
    $cjk_unified_ideographs = '\x{4e00}-\x{9fef}'; 
    $cjk_unified_ideographs_extension = array('\x{3400}-\x{4dbf}', '\x{20000}-\x{2a6df}', '\x{2a700}-\x{2b734}', '\x{2b740}-\x{2b81d}', '\x{2b820}-\x{2cea1}', '\x{2ceb0}-\x{2ebe0}'); 
    $cjk_compatibility_ideographs ='\x{f900}-\x{fad9}'; 
    $zero = '\x{3007}';
    $pattern = '/['.$cjk_unified_ideographs . implode('', $cjk_unified_ideographs_extension) . $cjk_compatibility_ideographs . $zero . ']/u';
    preg_match_all($pattern , $string, $result);
    return implode('', $result[0]);
}

运行如图

此外,与汉字相关的还有偏旁、部首、注音符号等,也可按需添加。

参考内容:


评论已关闭