Improved Summarization of Chinese Spoken ...

Viewer
Transcript

IMPROVED SUMMARIZATION OF CHINESE SPOKEN DOCUMENTS BY PROBABILISTIC LATENT SEMANTIC ANALYSIS (PLSA) WITH FURTHER ANALYSIS AND INTEGRATED SCORING Sheng-yi Kong and Lin-shan Lee Speech Lab., College of EECS, National Taiwan University, Taipei, Taiwan, Republic of China [email protected], [email protected] ABSTRACT

of terms tj , is given a score: I(S) =

(1)

λ3 c(tj ) + λ4 g(tj )] + λ5 b(S),

Index Terms— Summarization, Spoken document, Probabilistic Latent Semantic Analysis

1. INTRODUCTION In the future network era, digital content over the network will include all the information activities for human life. The most attractive form of network content will be in multi-media including speech information, and it is in such speech information that we usually find the subjects, topics, and concepts of the associated multi-media content. However, multimedia/spoken documents are just video/audio signals, or a very long sequence of words including errors even if automatically transcribed. They are much more difficult to retrieve and browse, because they can not be easily displayed on the screen, and the user cannot simply “skim through” each of them from the beginning to the end. As a result, spoken document summarization becomes very important [2]. Automatic summarization of documents have been actively investigated. Many approaches attempted to select a number of indicative sentences or passages from the original document according to a target summarization ratio, and sequence them to form a summary. Some approaches tried to identify sentences carrying concepts closer to the complete documents [3]. The spoken documents actually carry intrinsic difficulties such as the recognition errors, problems with spontaneous speech, and lack of correct sentence or paragraph boundaries. In recent years, a general approach has been found very successful [4, 5], in which each sentence in the document, S = t1 t2 . . . tj . . . tn , represented as a sequence

1424408733/06/$20.00 ©2006 IEEE

1 [λ1 s(tj ) + λ2 l(tj )+ n j=1 n

In a previous paper [1] two new scoring measures, Topic Significance (TS) and Topic Entropy (TE), obtained from Probabilistic Latent Semantic Analysis (PLSA) were shown to outperform very successful baseline Significance Score (SS) in selecting the important sentences for summarization of spoken documents. In this paper extensive experiments using the ROUGE scores with respect to different parameters at different summarization ratios were carefully analyzed in great detail. It was also found that integration of these two scoring measures offered further improvements, and special considerations of the structure of Chinese language was also helpful when summarizing Chinese spoken documents.

26

where s(tj ), l(tj ), c(tj ), g(tj ) are respectively some statistical measure (such as TF/IDF), linguistic measure (e.g., different parts-of-speech (POSs) are given different weights), confidence score and N-gram score for the term tj , and b(S) is calculated from the grammatical structure of the sentence S, and λ1 , λ2 , λ3 , λ4 and λ5 are weighting parameters. Sentences used in the summary is then selected based on this score I(s). In a recent paper we found the topical information obtained in Probabilistic Latent Semantic Analysis (PLSA) are very useful in estimating the statistical measure s(tj ) in equation (1) above to identify the important sentences [6]. In this approach, a set of latent topics Tk , k = 1, 2, . . . , K, are assumed and the relationships among all the terms, documents and these topics can be modeled by a probabilistic framework with all probabilities trained by EM algorithm [1]. Below the proposed approach is described in section 2, and the experiments and results are presented in sections 3 and 4. 2. PROPOSED APPROACH The approach proposed in this paper uses a simplified version of the above equation (1). We basically follow the successful methods reported recently [7], while focusing on the use of the different scoring measures obtained using PLSA to estimate the statistical measure s(tj ) in equation (1). One approach for evaluating this statistical measure s(tj ) which has been proved extremely useful is the “significance score” (hereafter referred to as the baseline Significance Score, SS) [7], FA s(tj ) = n(tj , di ) · log , (2) Ftj where n(tj , di ) is the number of occurrences of the term tj in the given document di , Ftj is the number of occurrences of tj in a large corpus, and FA is the number of occurrences of all terms or content words in the corpus. The basic idea is that terms of fewer occurrences are more semantically significant. The approaches proposed in this paper based on PLSA [1] are briefly summarized below.

SLT 2006

2.1. Probabilistic Latent Semantic Analysis (PLSA)

2.3. Topic Entropy

The set of documents {di , i = 1, 2, . . . , N } have been conventionally analyzed by the terms {tj , j = 1, 2, . . . , L} they may include, usually with statistical approaches. In recent years, efforts have also been made to establish a probabilistic framework for such purposes with improved model training algorithms, of which the Probabilistic Latent Semantic Analysis (PLSA)[6] is often considered as a representative. In PLSA, a set of latent topic variables is defined, {Tk , k = 1, 2, . . . , K}, to characterize the “term-document” co-occurrence relationships.Both the document di and the term tj are assumed to be independently conditioned on an associated latent topic Tk . The conditional probability of a document di generating a term tj thus can be parameterized by

Topic Entropy (TE, previously referred to as term entropy in [1]) can be obtained from the topic distribution P (Tk |tj ) for each term tj estimated as follows:

P (tj |di ) =

K

P (tj |Tk )P (Tk |di ).

P (Tk |tj ) =

P (tj |Tk ) × P (Tk ) P (tj |Tk ) ≈ . P (tj ) P (tj )

where the probability P (Tk ) is left out because a good approach to estimate it is not yet available, while P (tj ) can be obtained from a large corpus. The topic entropy for a term tj is then defined as H(tj ) = −

(3)

(6)

K

P (Tk |tj ) log P (Tk |tj ),

(7)

k=1

k=1

Notice that this probability is not obtained directly from the frequency of the term tj occurring in di , but instead through P (tj |Tk ), the probability of observing tj in the latent topic Tk , as well as P (Tk |di ), the likelihood that di addresses the latent topic Tk . The PLSA model can be optimized with the EM algorithm by maximizing a carefully defined likelihood function [6].

which is a measure of how the term is focused on a few topics, so a lower topic entropy implies the term carries more topical information. The statistical measure s(tj ) to be used in equation (1) based on topic entropy can then be defined as:

2.2. Topic Significance (TS)

where α is a scaling factor. So the statistical measure sEN (tj ) is inversely proportion to the topic entropy H(tj ), or higher for a lower H(tj ).

sEN (tj ) =

The Topic Significance (TS) of a term tj with respect to a topic Tk , Stj (Tk ), is defined as: n(tj , di ) × P (Tk |di ) d ∈D

Stj (Tk ) = i

n(tj , di ) × [1 − P (Tk |di )]

,

It is possible to improve the performance by integrating both scores proposed above by linear interpolation:

where n(tj , di ) has been defined above in equation (2), and P (Tk |di ) for a latent topic Tk is obtained from the PLSA modeling. In equation (4) in the numerator n(tj , di ) is weighted by a factor regarding how the document di is focused on the topic Tk , while in the denominator it is weighted by the probability that the document di is addressing all other topics different from Tk . After summation over all documents di , a higher Stj (Tk ) obtained in equation (4) implies the term tj has a higher frequency in the latent topic Tk than other latent topics, and is thus more important in the latent topic Tk . Given this topic significance in equation (4), the statistical measure s(tj ) to be used in equation (1) based on topic significance can be defined as: K

Stj (Tk )P (Tk |di ).

(8)

2.4. Integrated Scoring

(4)

di ∈D

sT S (tj ) =

αn(tj , di ) , H(tj )

(5)

k=1

That is, the topic significance of a term tj for a topic Tk , Stj (Tk ), is further weighted by the topic distribution of the document di and summed over all topics. The term P (Tk |di ) can be better estimated by folding-in the probabilities P (Tk |tj ). A higher sT S (tj ) implies the term is more important and should be given a higher priority when extracting sentences for summarization.

27

sI (tj ) = (1 − ω)sT S (tj ) + ωsEN (tj ),

(9)

where ω is the weighting parameter ranging from 0 to 1. 2.5. Considerations of Special Structure of Chinese Language The Chinese language is quite different from many Western languages in its very special monosyllabic structure [8]. It is not alphabetical. Every character has its own meaning and is pronounced as a monosyllable, and a word is composed of one to several characters. A monosyllable is usually shared by many different homonym characters. As a result, in speech recognition very often syllable accuracy is the highest but syllables carry the highest ambiguity, word accuracy is the lowest due to out-of-vocabulary (OOV) problem but the words carry the most semantics, while characters are in between. For the purposes here, various definitions of the term tj in the above formulation can therefore be chosen and used to replace the role of words (W), including characters (C), overlapping segments of two syllables (S(2)), and various combinations thereof, e.g. words plus overlapping segments of two syllables (W+S(2)), etc.

3. EXPERIMENTAL SETUP

(a) Transcription, Ratio=10%

The experiments were performed with broadcast news stories in Mandarin Chinese. The training corpus used in the experiments included 15,000 news stories in text form without word errors collected in August 2001 provided by the Central News Agency of Taipei. They were used to calculate Ftj and FA in equation (2) and train the PLSA models under various latent topic numbers. The testing corpus included 200 news stories broadcast in August 2001 by a few radio stations in Taipei. The average length of each story was about 29 sec, and manual transcriptions without word errors and ASR results were both used. In the ASR results, the accuracies for words, characters and syllables were 66.46%, 74.95% and 81.70% respectively. Three human subjects (students at National Taiwan University) were requested to produce three reference summaries for each news story by ranking the importance of the sentences in each story from “the most important” to “of average importance.”

0.3

R−L

R−1

R−2

R−3

R−L

(d) ASR Results, Ratio=30%

F−measure

F−measure

0.3

0.2

R−1

R−2

R−3

R−L

0.4

0.3

R−1

R−2

R−3

R−L

Fig. 1. ROUGE scores (R-N: ROUGE-N, R-L: ROUGE-L) for different scoring measures(SS: Significance Score, TS: Topic Significance, and TE: Topic Entropy) using correct manual transcriptions at (a) 10% summarization ratio and (b) 30% summarization ratio, or using ASR results (c) at 10% summarization ratio and (d) at 30% summarization ratio. Words (W) were used as terms in all experiments here.

Count(gramN )

S∈(Ref. Summaries) gramN ∈S

R−3

0.5

Countmatch (gramN )

X

R−2

0.4

S∈(Ref. Summaries) gramN ∈S

X

R−1

SS TS TE

0.4

(c) ASR Results, Ratio=10%

The well-known and useful evaluation package called ROUGE [9] was used in this research, including ROUGE-N (N=1,2,3) and ROUGE-L scores. ROUGEr -N is an N-gram recall between the automatically generated summary and a set of manually generated reference summaries calculated as follows:

ROUGEr -N =

0.3

0.2

3.2. Evaluation Metrics

X

0.5

F−measure

F−measure

3.1. Training and Testing Corpus

X

(b) Transcription, Ratio=30%

0.4

(10)

Where N stands for the length of N-gram considered, gramN , and Countmatch (gramN ) is the maximum number of N-grams co-occurring in a automatically generated summary and a set of reference summaries S. Count(gramN ) is then the total number of N-grams in the reference summaries S. ROUGE-L is similarly obtained but counting the “longest common subsequence” (LCS) between the automatically generated summary and the reference summaries. ROUGE-1 and ROUGEL were reported to have very good correlation with human evaluation of summaries when the summaries are extremely short. F-measures for ROUGE-N and ROUGE-L can be evaluated in exactly the same way, which was used in all the following results. 4. RESULTS AND DISCUSSIONS In the experiments to be represented below, the summarization ratio was set to be 10% and 30% respectively. The number of latent topics in PLSA modeling was manually set to 16, 32, 64, and 128. By considering the special structure of Chinese language as discussed in section 2.5, we used not only words (W) as the terms tj in the above formulation but also characters (C), overlapping segments of two syllables (S(2)), and combinations of them, e.g. words plus segments of syllables W+S(2), etc.

28

Extensive experiments were performed and it was found that the F-measures obtained for ROUGE-1, 2, 3 and ROUGEL behaved very similarly in most of experiments when we varied all the different parameters including scoring measures, latent topic numbers for PLSA models, choices of terms, and summarization ratios. In other words, the performance ranking for choices of different parameters using one ROUGE score is usually very similar to that using the others. In particular, F-measures from ROUGE-1 and ROUGE-L are very close when the summarization ratio was 10%. Figure 1 compares the results for manual transcriptions (Figure 1(a)(b)) and ASR results (Figure 1(c)(d)) using words (W) as terms at 10% and 30% summarization ratio respectively. The two proposed scoring measures Topic Significance (TS) and Topci Entropy (TE) both outperformed the baseline Significance Score (SS) in all cases. The Topic Significance (TS) was the best at 10% summarization ratio for both the transcriptions and ASR results in Figure 1 (a) and (c) (with minor exceptions in ROUGE-2, 3 in Figure 1(a) which were less important as discussed in the literatures). The Term Entropy (TE) was the best at 30% summarization ratio for both the transcriptions and ASR results in Figure 1 (b) and (d). The performance for ASR results is very close to that for manual transcriptions. So the measures proposed here are reasonably robust with respect to recognition errors. Figure 2 shows the results with each of the three scoring measures (SS((a)(b)), TS((c)(d)), TE((e)(f))) when different choices of term tj were used for summarizing ASR results at 10% ((a), (c) and (e)) and 30% ((b), (d), and (f)) summarization ratios individually. The scores for ASR results previously presented in Figure 1 (c) and (d) were obtained using words (W) as the term so they are exactly the leftmost dark blue bar in each of the groups in Figure 2. As discussed in section 2.5, different choice of the term carry different level of information. The words (W) have the most semantic meanings but the lowest recognition accuracy and syllables are on the contrary, while characters are in between. It turned out that the best choice of term is in fact dependent on the scoring measure and the summarization ratio. The best choice of term is the

F−measure

0.6

0.2 0.4 F−measure

R−1 R−2 R−3 R−L (c) Topic Significance, ratio=10%

0.3

0.2

R−1 R−2 R−3 R−L

0.39 0.385 0.38

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.7

0.8

0.9

1

Weight of Topic Entropy 0.475 (b) C+S(2), 32 topics, ratio=30%

F−measure

R−1 R−2 R−3 R−L (d) Topic Significance, ratio=30%

0.45 0

0.1

0.2

0.3

0.4 0.5 0.6 Weight of Topic Entropy

Fig. 3. Integration of two scoring measures.

TS

respectively. In each case the left end is the result for TS and the right end for TE, corresponding for those results in Figure 2 (c)(d)(e)(f). Only ROUGE-1 and ROUGE-L are shown. The results show that reasonable improvements are obtainable with such integrated scoring. R−1 R−2 R−3 R−L (f) Topic Entropy, ratio=30%

5. CONCLUSIONS AND FUTURE WORK The proposed scoring measures Topic Significance (TS) and Topic Entropy (TE) were shown to outperform the baseline Significance Score (SS) with extensive experiments and evaluation with ROUGE metrics. The performance with respect to different choices of parameters at different summarization ratios were carefully analyzed. It seems that the performance with the Topic Significance (TS) is slightly less stable, but sometimes gave very high scores, while that with Topic Entropy (TE) is much more stable and very good in most cases.

TE

0.4

0.3

0.46 0.455

0.445

0.4

0.5

TE

0.4 0.395

0.47

0.3

R−1 R−2 R−3 R−L (e) Topic Entropy, ratio=10%

W C S(2) W+S(2) C+S(2)

0.465

0.5

TS

0.4

0.4

0.3

F−measure

0.2

SS F−measure

0.3

ROUGE−1 ROUGE−L

(a) W+S(2), 128 topics, ratio=10% 0.41 0.405

0.5

SS

F−measure

F−measure

0.4

(b) Significance Score, ratio=30% F−measure

(a) Significance Score, ratio=10%

R−1 R−2 R−3 R−L

Fig. 2. The results with different choices of terms (W:words, C: characters, S(2): segments of two syllables) for summarizing ASR results at 10% and 30% summarization ratios respectively using SS((a) and (b)), TS((c) and (a)) and TE((e) and (f)).

6. REFERENCES [1] S.-y. Kong and L.-s. Lee, “Improved spoken document summarization using probabilistic latent semantic analysis (plsa),” in Proc. of ICASSP, 2006. [2] L.-s. Lee and B. Chen, “Spoken document understanding and organization,” in Special Section, IEEE Signal Processing Megazine, 2005. [3] Y. Gong and X. Liu, “Generic text summarization using relevance measure and latent semantic analysis,” in Proc. ACM SIGIR Conference on R&D in Information Retrieva, 2001, pp. 19–25. [4] J. Goldstein, M. Kantrowitz, and J. Carbonell, “Summarizing text documents: Sentence selection and evaluation metrics,” in Proc. ACM SIGIR Conference on R&D in Information Retrieva, 1999, pp. 121–128. [5] S. Furui, T. Kikuchi, Y. Shinnaka, and C. Hori, “Speech-to-text and speech-to-speech summarization of spontaneous speech,” IEEE Trans. on Speech and Audio Processing, vol. 12, no. 4, pp. 401–408, 2004. [6] T. Hofmann, “Probabilistic latent semantic analysis,” in Proceedings of the 15th Conference on Uncertainty in AI, 1999. [7] M. Hirohata, Y. Shinnaka, K. Iwano, and S. Furui, “Sentence extraction-based presentation summarization techniques and evaluation metrics,” in Proc. ICASSP, 2005, pp. SP–P16.14. [8] L. S. Lee, “Voice dictation of mandarin chinese,” IEEE Signal Processing Magazine, vol. 14, no. 4, pp. 63–101, 1997. [9] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Proc. of Workshop on Text Summarization Branches Out, 2004, pp. 74–81.

character (C) for the baseline significance score (SS) in Figure 2 (a)(b), the word (W) for the term significance (TS) in Figure 2 (c)(d), and the word plus the segment of two syllables (W+S(2)) for term entropy (TE) at 10% summarization ratio in Figure 2 (e) and the characters (C) at 30% summarization ratio in Figure 2 (f). Across all the results in Figure 2, the best performance using a single scoring measure for summarization ratio of 10% was obtained by using Topic Significance (TS) with words (W) as the term and 128 latent topics in Figure 2(c). For summarization ratio of 30%, Topic Entropy (TE) with characters (C) as the term and 64 latent topics gave the highest score in Figure 2(f). Our proposed measures can be more robust to some noisy conditions or sparseness of the training data. Topic-specific words can have higher scores measured by topical information while they receive low ranking using significance scores or conventional TF.IDF measure. As mentioned in section 2.4, we also tried to improve the performance by integrating the two scoring measures TS and TE. Figure 3 presents two typical examples for such integration using W+S(2) and C+S(2) as the choices of the term for summarization ratio of 10% and 30% in Figure 3 (a) and (b)

29

Multi-Layered Summarization of Spoken ... - Semantic Scholar