Huffman Codes (30)
In 1953, David A. Huffman published his paper "A Method for the Construction of Minimum-Redundancy Codes", and hence printed his name in the history of computer science. As a professor who gives the final exam problem on Huffman codes, I am encountering a big problem: the Huffman codes are NOT unique. For example, given a string "aaaxuaxz", we can observe that the frequencies of the characters 'a', 'x', 'u' and 'z' are 4, 2, 1 and 1, respectively. We may either encode the symbols as {'a'=0, 'x'=10, 'u'=110, 'z'=111}, or in another way as {'a'=1, 'x'=01, 'u'=001, 'z'=000}, both compress the string into 14 bits. Another set of code can be given as {'a'=0, 'x'=11, 'u'=100, 'z'=101}, but {'a'=0, 'x'=01, 'u'=011, 'z'=001} is NOT correct since "aaaxuaxz" and "aazuaxax" can both be decoded from the code 00001011001001. The students are submitting all kinds of codes, and I need a computer program to help me determine which ones are correct and which ones are not.
Input Specification:
Each input file contains one test case. For each case, the first line gives an integer N (2 <= N <= 63), then followed by a line that contains all the N distinct characters and their frequencies in the following format:
c[1] f[1] c[2] f[2] ... c[N] f[N]
where c[i] is a character chosen from {'0' - '9', 'a' - 'z', 'A' - 'Z', '_'}, and f[i] is the frequency of c[i] and is an integer no more than 1000. The next line gives a positive integer M (<=1000), then followed by M student submissions. Each student submission consists of N lines, each in the format:
c[i] code[i]
where c[i] is the i-th character and code[i] is a string of '0's and '1's.
Output Specification:
For each test case, print in each line either “Yes” if the student’s submission is correct, or “No” if not.
Sample Input:7A 1 B 1 C 1 D 3 E 3 F 6 G 64A 00000B 00001C 0001D 001E 01F 10G 11A 01010B 01011C 0100D 011E 10F 11G 00A 000B 001C 010D 011E 100F 101G 110A 00000B 00001C 0001D 001E 00F 10G 11Sample Output:
YesYesNoNo
#include#include #include
Huffman code的特征有两点
1. WPL(带权路径长度)最小,这个我们可以构建一个Huffman树来计算。
2. 每个编码唯一不具二义性,也就是每个编码都不会是另一个编码的前缀。
我的代码思路是这样
我们首先把输入的权值保存在一个map里面,然后交给
unsigned int computeWPLwithFrequencyMap(map&f)
来计算出WPL,函数是通过建立一个Huffman树然后遍历得到WPL值的,应该有更好的方法。
之后用得到的WPL来和每个同学的WPL比较,比我们大的肯定不是Huffman编码了。
之后检测是否有二义性有两个办法(我目前想到两个):
1.按编码长度排序(升序),穷举每个编码是否是后面编码的子串,也就是字符串比较,可以通过kmp比较。
2.建立trie树,看看每个编码的路径中是否有其他的编码。
我用的是办法2,我先用sort()函数进行了按编码长度的升序排列(偷懒了,也可以不用排序),然后依次建立trie,
把编码最后一位所在的节点进行标记。按照Huffman编码,这个肯定是叶节点,如果以后有编码路径经过它,
那这个肯定不是Huffman编码了。