微客导航 » 文章资讯 » C++中可正确获取UTF-8字符长度的函数分享

C++中可正确获取UTF-8字符长度的函数分享

2024-04-09 16:27:03 373

在C++的char*以及string中，使用的是字节流编码，即sizeof(char)==1。

也就是说，C++是不区分字符的编码的。

而一个合法UTF8的字符长度可能为1～4位。

现在假设一串输入为UTF8编码，如何能准确的定位到每个UTF8字符的“CharPoint”，而不会错误的分割字符呢？

参考这个页面：http://www.nubaria.com/en/blog/?p=289

可以改造出下面的函数：

constunsignedcharkFirstBitMask=128;//1000000
constunsignedcharkSecondBitMask=64;//0100000
constunsignedcharkThirdBitMask=32;//0010000
constunsignedcharkFourthBitMask=16;//0001000
constunsignedcharkFifthBitMask=8;//0000100

intutf8_char_len(charfirstByte)
{
std::string::difference_typeoffset=1;

if(firstByte&kFirstBitMask)//Thismeansthefirstbytehasavaluegreaterthan127,andsoisbeyondtheASCIIrange.
{
if(firstByte&kThirdBitMask)//Thismeansthatthefirstbytehasavaluegreaterthan224,andsoitmustbeatleastathree-octetcodepoint.
{
if(firstByte&kFourthBitMask)//Thismeansthatthefirstbytehasavaluegreaterthan240,andsoitmustbeafour-octetcodepoint.
offset=4;
else
offset=3;
}
else
{
offset=2;
}
}
returnoffset;
}

返回顶部
3162201930
czq8825@qq.com

C++中可正确获取UTF-8字符长度的函数分享

热门推荐

随机推荐