2013-06-14 19:46:46Morris

[UVA][字串分析] 860 - Entropy Text Analyzer

Problem I

Entropy Text Analyzer

Claude Shannon, mathematician and computer scientist, born on April 30 1916, and died on February 24 2001, was the one who create the mathematical foundations which laid down the general rules of modern information theory. In his fundamental paper of 1948, A Mathematical Theory of Communication, a measure of the uncertainty associated with a random memoryless source, called entropy, is proposed. Here we are interested in the use of the entropy concept to analyze texts at the level of its words variety.

We define the entropy of a text T, with words and n different ones, by the formula

where pi, i = 1,...,n, is the frequency of each i-word in the text T, that is, pi is the number of times that the i-word happens to occur in the given text. If we consider that a text of length (a text with words) is as much richer as much larger is the number n of different words and, among the texts with the same number of words and the same number n of different words, is richer the one where the words have less variation in frequency, one can easily conclude that the entropy is indeed a very useful measure to compare the richness of two or more texts. To compare texts with different number of words , we introduce a kind of ``relative entropy'' Erel, defined as the quotient between the entropy ET of the text and the maximum entropy Emax, and multiplying by 100 if one wants a percentage:

Erel = ET

Emax

×100 .

The maximum entropy Emax is just the entropy of a text with the same number of words and in which each word occurs exactly once (i.e., n : = , pi : = 1):


Problem

Given a text T, write a program that computes the total number of words in T, the entropy ET of the text, and its relative entropy Erel. In order to determine the required numbers, your program must be case insensitive (for example, words like ``House'', ``house'' or ``HOUSE'' must be considered to be the same). Also, in the context of this program, a word is a consecutive sequence of characters different of the punctuation marks , . : ; ! ? " ( ) as well as spaces, tabs and newlines ('\n'). Words with only one letter are to be considered.

Input

The input contains several texts T, each one necessarily with more than one word ( > 1). You can assume that the maximum length of the words is 20 characters long and that a single text does not have more than 100 000 words.

A line containing only ****END_OF_TEXT*** marks the end of each text, and a line containing ****END_OF_INPUT**** marks the end of input. You can be certain that these reserved words wil not appear inside a text. Besides thats, everything can appear on a text, including blank lines.

Output

In the output write one line for each test, each one containing three numbers: the first with the total number of words in T; the second with the text entropy ET rounded to one decimal digit; and the last one with the relative entropy Erel, in percentage, and rounded to be an integer.

Sample Input1

Midnight, not a sound from the pavement
Has the moon lost her memory?
She is smiling alone
In the lamplight, the withered leaves collect at my feet
And the wind begins to moan
****END_OF_TEXT****
Memory, all alone in the moonlight
I can dream of the old days
Life was beautiful then
I remember the time I knew what happiness was
Let the memory live again
****END_OF_TEXT****
****END_OF_INPUT****

Sample Output

33 1.4 93
31 1.3 89

Footnotes:

1Barbra Streisand, Memory (first two verses).



Delfim Marado Torres, MIUP'2002


根據題目意思去切割單字,並且計算每個單字出現的頻率。
必須將
, . : ; ! ? " ( ) 視為空白。

之後就是計算公式部分而已。雖然我不知道公式的含意。


#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <map>
#include <iostream>
#include <sstream>
using namespace std;
int main() {
    char s[10005];
    int i, j;
    while(gets(s)) {
        if(!strcmp(s, "****END_OF_INPUT****"))
            break;
        map<string, int> R;
        int wordcnt = 0;
        do {
            if(!strcmp(s, "****END_OF_TEXT****"))
                break;
            for(i = 0; s[i]; i++) {
                switch(s[i]) {
                    case ',':case '.':case ':':
                    case ';':case '!':case '?':
                    case '\"':case '(':case ')':
                        s[i] = ' ';
                }
                if(s[i] >= 'A' && s[i] <= 'Z')
                    s[i] += 32;
            }
            stringstream sin(s);
            string token;
            while(sin >> token) {
                int &v = R[token];
                v++, wordcnt++;
            }

        } while(gets(s));
        double Et = 0, Emax = 0, Erel;
        for(map<string, int>::iterator it = R.begin();
            it != R.end(); it++) {
            Et += (it->second)*(log10(wordcnt)-log10(it->second));
        }
        Et /= wordcnt;
        Emax = log10(wordcnt);
        Erel = Et / Emax;
        printf("%d %.1lf %.0lf\n", wordcnt, Et, Erel*100);
    }
    return 0;
}