首页utf-8字符集的string，如何计算字符个数？

utf-8字符集的string，如何计算字符个数？

时间: 2024-11-06 14:21:44 浏览: 32

在 C++ 中，UTF-8 字符集的 `std::string` 类型存储的是字节序列，而每个字节可能代表一个或多个 Unicode 字符。要计算 UTF-8 字符串中的实际字符数量，你需要遍历整个字符串并考虑每个字节是否属于一个多字节编码。 UTF-8 的规则是： - 单字节字符（ASCII）范围内的值（0 到 127），占用 1 个字节。 - 连续的多字节字符最多包含 4 个字节：第一字节的最高位为 0 到 11，表示后续还有 1 到 4 个字节。 - 第一字节的最高位为 110，则后跟 2 个字节；最高位为 1110，则后跟 3 个字节；最高位为 11110，则后跟 4 个字节。你可以使用以下算法来计算 `std::string` 中的字符数量： ```cpp size_t countUtf8Chars(const std::string& str) { size_t charCount = 0; for (size_t i = 0; i < str.length(); ++i) { if ((str[i] & 0b11000000) == 0) { // ASCII or single-byte character charCount++; } else if ((str[i] & 0b11100000) == 0b11000000) { // First byte of a multi-byte sequence // Check if the next bytes are part of the sequence size_t endOfSequence = i + 1; while (endOfSequence < str.length() && (str[endOfSequence] & 0b10000000) == 0) { endOfSequence++; } charCount++; i = endOfSequence - 1; // Skip to the end of the sequence } } return charCount; } ``` 这个函数遍历 `std::string`，遇到单字节字符、以及多字节字符的第一字节时增加计数器。对于多字节字符，它会检查接下来的字节是否构成一个完整的序列，如果是则增加计数，并跳过这些字节。

阅读全文