29、哈夫曼树编码文件压缩 【问题描述】采用哈夫曼编码思想实现文件的压缩和恢复(解压缩)功能。 【基本要求】 (1)实现文本文件的压缩和解压缩功能。 (2)运行时的压缩原文件的规模应不小于5KB。 (3)给出文件的压缩比。 (4)提供恢复文件与原文件的相同性对比功能。
时间: 2024-02-01 16:14:26 浏览: 93
实现文本文件的压缩和解压缩功能,主要步骤如下:
1. 统计文件中每个字符出现的次数,构建字符频率表。
2. 根据字符频率表构建哈夫曼树。
3. 根据哈夫曼树生成编码表,将每个字符的编码存储到压缩文件中。
4. 读取原文件中的每个字符,将其对应的编码写入压缩文件中。
5. 压缩文件的大小为编码后的二进制文件大小。
6. 解压缩时,读取压缩文件中的编码表和编码后的二进制文件,根据编码表将二进制文件转换为原始文本文件。
以下是一个可能的实现代码示例:
```python
import heapq
import os
class HuffmanCoding:
def __init__(self, filepath):
self.filepath = filepath
self.freq = {}
self.heap = []
self.codes = {}
self.reverse_codes = {}
class HeapNode:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None
def __lt__(self, other):
return self.freq < other.freq
def make_frequency_dict(self):
with open(self.filepath, 'r') as f:
for line in f:
for char in line:
if char in self.freq:
self.freq[char] += 1
else:
self.freq[char] = 1
def make_heap(self):
for char in self.freq:
node = self.HeapNode(char, self.freq[char])
heapq.heappush(self.heap, node)
def merge_nodes(self):
while len(self.heap) > 1:
node1 = heapq.heappop(self.heap)
node2 = heapq.heappop(self.heap)
merged = self.HeapNode(None, node1.freq + node2.freq)
merged.left = node1
merged.right = node2
heapq.heappush(self.heap, merged)
def make_codes_helper(self, root, current_code):
if root is None:
return
if root.char is not None:
self.codes[root.char] = current_code
self.reverse_codes[current_code] = root.char
return
self.make_codes_helper(root.left, current_code + "0")
self.make_codes_helper(root.right, current_code + "1")
def make_codes(self):
root = heapq.heappop(self.heap)
current_code = ""
self.make_codes_helper(root, current_code)
def get_encoded_text(self):
encoded_text = ""
with open(self.filepath, 'r') as f:
for line in f:
for char in line:
encoded_text += self.codes[char]
return encoded_text
def pad_encoded_text(self, encoded_text):
padding = 8 - len(encoded_text) % 8
for i in range(padding):
encoded_text += "0"
padded_info = "{0:08b}".format(padding)
return padded_info + encoded_text
def get_byte_array(self, padded_encoded_text):
if len(padded_encoded_text) % 8 != 0:
print("Encoded text not padded properly")
exit(0)
b = bytearray()
for i in range(0, len(padded_encoded_text), 8):
byte = padded_encoded_text[i:i+8]
b.append(int(byte, 2))
return b
def compress(self):
filename, file_extension = os.path.splitext(self.filepath)
output_path = filename + ".bin"
with open(output_path, 'wb') as output:
self.make_frequency_dict()
self.make_heap()
self.merge_nodes()
self.make_codes()
encoded_text = self.get_encoded_text()
padded_encoded_text = self.pad_encoded_text(encoded_text)
b = self.get_byte_array(padded_encoded_text)
output.write(bytes(b))
print("Compressed file saved as:", output_path)
def remove_padding(self, padded_encoded_text):
padded_info = padded_encoded_text[:8]
padding = int(padded_info, 2)
padded_encoded_text = padded_encoded_text[8:]
encoded_text = padded_encoded_text[:-1*padding]
return encoded_text
def decode_text(self, encoded_text):
current_code = ""
decoded_text = ""
for bit in encoded_text:
current_code += bit
if current_code in self.reverse_codes:
char = self.reverse_codes[current_code]
decoded_text += char
current_code = ""
return decoded_text
def decompress(self, input_path):
filename, file_extension = os.path.splitext(input_path)
output_path = filename + "_decompressed.txt"
with open(input_path, 'rb') as file, open(output_path, 'w') as output:
bit_string = ""
byte = file.read(1)
while len(byte) > 0:
byte = ord(byte)
bits = bin(byte)[2:].rjust(8, '0')
bit_string += bits
byte = file.read(1)
encoded_text = self.remove_padding(bit_string)
decompressed_text = self.decode_text(encoded_text)
output.write(decompressed_text)
print("Decompressed file saved as:", output_path)
if __name__ == '__main__':
hc = HuffmanCoding("example.txt")
hc.compress()
hc.decompress("example.bin")
```
其中,`HuffmanCoding` 类的 `compress` 方法实现了文本文件的压缩功能,`decompress` 方法实现了解压缩功能。在 `compress` 方法中,会先统计文件中每个字符出现的次数,然后构建哈夫曼树,并根据哈夫曼树生成编码表,将每个字符的编码存储到压缩文件中。接着,读取原文件中的每个字符,将其对应的编码写入压缩文件中。压缩文件的大小为编码后的二进制文件大小。在 `decompress` 方法中,会读取压缩文件中的编码表和编码后的二进制文件,根据编码表将二进制文件转换为原始文本文件。
可以通过比较原文件和解压缩后的文件大小来计算文件的压缩比。另外,在解压缩时,还可以提供恢复文件与原文件的相同性对比功能,检查解压缩后的文件是否与原文件完全一致。
阅读全文