UTF-8转和unsigned char *之间相互转换的代码,请使用C/C++代码实现
时间: 2024-10-16 13:11:05 浏览: 40
utf.h:C用于C和C ++的单头utf8字符串函数
在C/C++中,UTF-8编码是一种变长字符编码,而`unsigned char *`通常用于表示字节数组。要将UTF-8字符串转换成`unsigned char *`,你需要遍历每个UTF-8码点,并将其添加到字节数组中。反之,从`unsigned char *`恢复UTF-8字符串,需要逐字节解析。这里是一个简单的示例:
**UTF-8到`unsigned char *`**
```cpp
#include <string>
#include <vector>
std::vector<unsigned char> utf8_to_bytes(const std::string& str) {
std::vector<unsigned char> result;
for (char c : str) {
result.push_back(c);
}
return result;
}
// 示例
std::string utf8_str = "你好,世界!";
auto bytes = utf8_to_bytes(utf8_str); // 将UTF-8字符串转换为字节数组
```
**`unsigned char *`到UTF-8**
```cpp
std::string bytes_to_utf8(const unsigned char* ptr, size_t len) {
std::string result;
const unsigned char* end = ptr + len;
while (ptr < end) {
int byte1 = (*ptr++);
if (byte1 <= 0x7F) { // ASCII character (1 byte)
result += static_cast<char>(byte1);
} else if ((byte1 & 0xE0) == 0xC0) { // 2-byte character
int byte2 = (*ptr++) & 0x3F;
result += static_cast<char>((byte1 & 0x1F) << 6 | byte2);
} else if ((byte1 & 0xF0) == 0xE0) { // 3-byte character
int byte2 = (*ptr++) & 0x3F;
int byte3 = (*ptr++) & 0x3F;
result += static_cast<char>((byte1 & 0x0F) << 12 | (byte2 << 6) | byte3);
} else if ((byte1 & 0xF8) == 0xF0) { // 4-byte character
int byte2 = (*ptr++) & 0x3F;
int byte3 = (*ptr++) & 0x3F;
int byte4 = (*ptr++) & 0x3F;
result += static_cast<char>((byte1 & 0x07) << 18 | (byte2 << 12) | (byte3 << 6) | byte4);
} else {
throw std::runtime_error("Invalid UTF-8 sequence");
}
}
return result;
}
// 示例
const unsigned char* byte_array = ...; // 假设你有一个字节数组
size_t length = ...; // 字节数组长度
std::string utf8_string = bytes_to_utf8(byte_array, length);
```
注意,这个例子假设输入的`unsigned char *`确实包含有效的UTF-8编码,实际应用中可能需要额外的错误检查。
阅读全文