Statistical Natural Language Processing Vincent Ng Human Language Technology Research Institute University of Texas at Dallas September1999
Honors AI First offering in Spring 2010 MW
7:00-8:15pm
Whether it will be offered again depends on enrollment
Isn‟t more difficult than the regular section, but covers
more topics at a faster pace Web
search technologies
Small-scale project to be done in teams of 1-2 people
September1999
2
Undergraduate AI Courses COGS/CS 4314: Intelligent Systems Analysis COGS/CS 4315: Intelligent Systems Design CS 4365: Artificial Intelligence CS 4365 (Honors): Artificial Intelligence
CS 4375: Introduction to Machine Learning CS 4391: Introduction to Computer Vision September1999
3
Undergraduate AI Courses COGS/CS 4314: Intelligent Systems Analysis Not
offered every year
COGS/CS 4315: Intelligent Systems Design Not
offered every year
CS 4365: Artificial Intelligence Spring
and probably Fall
CS 4365 (Honors): Artificial Intelligence Spring
only
CS 4375: Introduction to Machine Learning Fall
only
CS 4391: Introduction to Computer Vision Not
September1999 offered every year, may be offered Fall or Spring 4
Graduate AI Courses CS 6320: Natural Language Processing
CS 6322: Information Retrieval CS 6364: Artificial Intelligence CS 6373: Intelligent Systems CS 6375: Machine Learning CS 6384: Computer Vision CS 6395: Speech Recognition, Synthesis & Understanding CS 6v81: Statistical Natural Language Processing September1999
5
The Intelligent Systems Group Dr. Sanda Harabagiu Information
retrieval, natural language processing
Dr. Vasileios Hatzivassiloglu Natural
language processing, bioinformatics
Dr. Yang Liu Speech
and language processing
Dr. Dan Moldovan Natural
language processing, knowledge representation
Dr. Vincent Ng Natural
language processing
Dr. Haim Schweitzer Computer
vision September1999
6
Statistical Natural Language Processing
September1999
7
Where are the Flying Cars? According to science fiction, the future has talking machines. (1926): “false Maria” Star Wars: Episode IV: C3PO (Maria‟s influence?) 2001: A Space Odyssey (1968): HAL (the HAL-9000) Metropolis
Dave: Open the pod bay doors, HAL. HAL: I‟m sorry Dave, I‟m afraid I can‟t do that. Dave: What‟s the problem? HAL: I think you know what the problem is just as well as I do.
September1999
8
Where are the Flying Cars? According to science fiction, the future has talking machines. (1926): “false Maria” Star Wars: Episode IV: C3PO (Maria‟s influence?) 2001: A Space Odyssey (1968): HAL (the HAL-9000) Metropolis
Dave: Open the pod bay doors, HAL. HAL: I‟m sorry Dave, I‟m afraid I can‟t do that. Dave: What‟s the problem? HAL: I think you know what the problem is just as well as I do.
Requires both understanding and generation September1999
9
Natural Language Processing (NLP) “natural” language Languages
that people use to communicate with one another
Ultimate goal To
build computer systems that perform as well at using natural languages as humans do
Immediate goal To
build computer systems that can process text and speech more intelligently
September1999
10
Natural Language Processing (NLP) “natural” language Languages
that people use to communicate with one another
Ultimate goal To
build computer systems that perform as well at using natural languages as humans do
Immediate goal To
build computer systems that can process text and speech more intelligently language
Understanding
computer
language
Generation September1999
11
Why NLP? Lots of information is in natural language format Documents News
broadcasts User utterances
Lots of users want to communicate in natural language “Do
what I mean!”
September1999
12
NLP is Useful Application: Text Summarization
Summarize the public commentary regarding the prohibition of potassium hydroxide for peeling peaches.
E-mail, letters, editorials, technical reports, newswires
multi-document summary
September1999
13
NLP is Useful Application: Information Retrieval
Topic: Advantages of using potassium hydroxide in any aspect of organic farming, especially…
doc 1
score
doc 2
score
doc 3
score …
doc n
score
relevant documents (ranked)
information need text collection
September1999
14
NLP is Useful Application: Question Answering Retrieve not just relevant documents, but return the answer
Answer
Query Which country has the largest part of the Amazon forest?
text collection
Brazil
September1999
15
NLP is Useful Application: Information Extraction AFGANISTAN MAY BE PREPARING FOR ANOTHER TEST
Thousands of people are feared dead following... (voice-over) ...a powerful earthquake that hit Afghanistan today. The quake registered 6.9 on the Richter scale. (on camera) Details now hard to come by, but reports say entire villages were buried by the quake.
Disaster Type: • location: • date: • magnitude: • magnitude-confidence: • damage: • human-effect: • victim: • number: • outcome: • physical-effect: • object: • outcome:
September1999
16
NLP is Useful Application: Information Extraction AFGANISTAN MAY BE PREPARING FOR ANOTHER TEST
Thousands of people are feared dead following... (voice-over) ...a powerful earthquake that hit Afghanistan today. The quake registered 6.9 on the Richter scale. (on camera) Details now hard to come by, but reports say entire villages were buried by the quake.
Disaster Type: earthquake • location: Afghanistan • date: today • magnitude: 6.9 • magnitude-confidence: high • damage: • human-effect: • victim: Thousands of people • number: Thousands • outcome: dead • physical-effect: • object: entire villages • outcome: damaged
September1999
17
NLP is Useful Application: Machine Translation
日文章鱼您怎么说? Japaneseto-English Translator
How do you say octopus in Japanese? September1999
18
NLP is Useful Application: Machine Translation
日文章鱼您怎么说? Japaneseto-English Translator
How do you say octopus in Japanese? Bill Gates, 1997 “… now we‟re betting the company on these natural interface technologies”
September1999
19
NLP is … Interdisciplinary … Linguistics:
models of language
emphasizes 100% accuracy
Psychology:
emphasizes biological and/or cognitive plausibility
Mathematics
models of cognitive processes and statistics: properties of models
emphasizes formal aspects
September1999
20
NLP is … Interdisciplinary … Linguistics:
models of language
emphasizes 100% accuracy
Psychology:
emphasizes biological and/or cognitive plausibility
Mathematics
vs.
NLP Computational study of language use Definite engineering aspect in addition to a scientific one
and statistics: properties of models
emphasizes formal aspects
models of cognitive processes
Scientific: to explore the nature of linguistic communication Engineering: to enable effective human-machine communication
Emphasis on computational, not cognitive plausibility Models of language: 95% correct is OK September1999
21
Why study NLP? Challenging … AI-complete
borrows from the notion of NP-completeness to solve NLP, you‟d need to solve all of the problems in AI
Turing
test
Turing (1950): "Computing machinery and intelligence“ posits that engaging effectively in linguistic behavior is a sufficient condition for having achieved intelligence.
September1999
22
The Turing Test Turning predicted that by 2000, a machine might have a
30% chance of fooling a lay person for 5 minutes
September1999
23
… But little kids can “do” NLP …
September1999
24
… But little kids can “do” NLP …
Why is NLP hard? September1999
25
Why is NLP hard? Ambiguity!!! … at all levels of analysis Phonetics and phonology Concerns
how words are related to the sounds that realize them Important for speech-based systems “I scream” vs. “ice cream” “It‟s very hard to recognize speech” vs. “It‟s very hard to wreck a nice beach”
September1999
26
Why is NLP hard? Ambiguity!!! … at all levels of analysis Morphology Concerns
how words are constructed from sub-word units “Unionized”
Union-ized? Un-ionized in chemistry?
September1999
27
Why is NLP hard? Ambiguity!!! … at all levels of analysis Syntax Concerns
sentence structure Different syntactic structure implies different interpretation
Squad helps dog bite victim. [np squad] [vp helps [np dog bite victim]] [np squad] [vp helps [np dog] [inf-clause bite victim]
September1999
28
Why is NLP hard? Ambiguity!!! … at all levels of analysis Semantics Concerns
what words mean and how these meanings combine to form sentence meanings.
Jack invited Mary to the Halloween ball.
dance vs. some big sphere with Halloween decorations?
September1999
29
Why is NLP hard? Ambiguity!!! … at all levels of analysis Discourse Concerns
how the immediately preceding sentences affect the interpretation of the next sentence
The city council refused to give the women a permit because they feared violence. The city council refused to give the women a permit because they advocated violence.
September1999
30
I’m Afraid I Can’t Do That The task seems so difficult! What resources do we need? Knowledge
about language Knowledge about the world
September1999
31
An Idea Have computers learn models of language Statistical
NLP: learns statistical models that capture language properties from a corpus (text samples) helps ease the knowledge acquisition bottleneck
Why
is statistical language learning possible?
usage of words exhibits statistical regularities.
September1999
32
Probabilities are Realistic “It‟s hard to recognize speech” vs. “It‟s hard to wreck a nice beach” Which is more likely? (both are grammatical) Applications: speech recognition, handwriting recognition, spelling correction, …
General problem in statistical NLP: density estimation P(“It‟s hard to recognize speech”) P(“It‟s hard to wreck a nice beach”) September1999
33
No, Really, It’s a Crazy Idea Late 50‟s-80‟s: statistical NLP in disfavor “It is fair to assume that neither sentence
(1) Colorless green ideas sleep furiously nor (2) Furiously sleep ideas green colorless … has ever occurred … Hence, in any statistical model … these sentences will be ruled out on identical grounds as equally “remote” from English. Yet (1), though nonsensical, is grammatical, while (2) is not.” [Chomsky 1957] September1999
34
Who Are You Calling Crazy? “I don‟t believe in this statistics stuff” “That‟s not learning, that‟s statistics”
Knowledge-intensive NLP “is going nowhere fast” “Every time I fire a linguist, my performance goes up”
September1999
35
Which sentence is more likely? “It‟s hard to recognize speech” vs. “It‟s hard to wreck a nice beach” Statistical approach: density estimation
P(“It‟s hard to recognize speech”) P(“It‟s hard to wreck a nice beach”) Estimate these probabilities from a corpus (text sample) Count
the number of times sentences appears in corpus Divide the count by the total number of sentences in corpus September1999
36
Is there any problem with this approach?
September1999
37
Solution 1: Use a larger corpus Many sentences may still not appear in a larger corpus. probability
will be zero!
September1999
38
Problems We may not be able to find these sentences even in a very
large corpus Even if we do, each of them may appear only once and
twice
September1999
39
Solution 2: Use a language model A language model assigns a probability to a sentence How?
September1999
40
A Simple Two-Step Approach Goal: assign a probability to a sentence Let’s take a talk.
September1999
41
A Simple Two-Step Approach Goal: assign a probability to a sentence Let’s take a talk. Step 1: Compute the probability each word in the sentence
using the previous N-1 words.
September1999
42
A Simple Two-Step Approach Goal: assign a probability to a sentence Let’s take a talk. Step 1: Compute the probability each word in the sentence
using the previous N-1 words. Assume
N=3.
September1999
43
A Simple Two-Step Approach Goal: assign a probability to a sentence Let’s take a talk. Step 1: Compute the probability each word in the sentence
using the previous N-1 words. Assume
N=3. Compute
P(“let’s” | 2nd-prev-word=“null”, prev-word=“null”)
September1999
44
A Simple Two-Step Approach Goal: assign a probability to a sentence Let’s take a talk. Step 1: Compute the probability each word in the sentence
using the previous N-1 words. Assume
N=3. Compute
P(“let’s” | 2nd-prev-word=“null”, prev-word=“null”) P(“take” | 2nd-prev-word=“null”, prev-word=“let’s”)
September1999
45
A Simple Two-Step Approach Goal: assign a probability to a sentence Let’s take a talk. Step 1: Compute the probability each word in the sentence
using the previous N-1 words. Assume
N=3. Compute
P(“let’s” | 2nd-prev-word=“null”, prev-word=“null”) P(“take” | 2nd-prev-word=“null”, prev-word=“let’s”) P(“a” | 2nd-prev-word=“let’s”, prev-word=“take”)
September1999
46
A Simple Two-Step Approach Goal: assign a probability to a sentence Let’s take a talk. Step 1: Compute the probability each word in the sentence
using the previous N-1 words. Assume
N=3. Compute
P(“let’s” | 2nd-prev-word=“null”, prev-word=“null”) P(“take” | 2nd-prev-word=“null”, prev-word=“let’s”) P(“a” | 2nd-prev-word=“let’s”, prev-word=“take”)
P(“talk” | 2nd-prev-word=“take”, prev-word=“a”) September1999
47
A Simple Two-Step Approach How to compute
P(“talk” | 2nd-prev-word=“take”, prev-word=“a”)? Collect statistics from corpus! Count
number of times we see “take a talk” in corpus Count number of times we see “take a” in corpus Divide these two numbers
September1999
48
A Simple Two-Step Approach How to compute
P(“talk” | 2nd-prev-word=“take”, prev-word=“a”)? Collect statistics from corpus! Count
number of times we see “take a talk” in corpus Count number of times we see “take a” in corpus Divide these two numbers Now we know how to compute probability of each word
September1999
49
A Simple Two-Step Approach How to compute
P(“talk” | 2nd-prev-word=“take”, prev-word=“a”)? Collect statistics from corpus! Count
number of times we see “take a talk” in corpus Count number of times we see “take a” in corpus Divide these two numbers Now we know how to compute probability of each word Step 2: Multiply probability of each word to get probability
of sentence September1999
50
An Example P(“Let‟s take a talk”)
= P(“Let‟s” | null, null) * P(“take” | “outside and”) * P(“a” | “and take”) * P(“talk” | “take a”)
September1999
51
An Example P(“Let‟s take a talk”)
= P(“Let‟s” | null, null) * P(“take” | “outside and”) * P(“a” | “and take”) * P(“talk” | “take a”)
Does language modeling solve the problems of (1) not seeing a sentence in a corpus at all? (2) not seeing a sentence frequently enough?
September1999
52
Problems Solved??? To some extent More
likely to be able to find short word sequences than long word sequences in a corpus Still, there is no guarantee that we will be able to find “Let‟s take a” If we cannot, probability of sentence will be zero, even if the sentence is sensible
September1999
53
Solution 3: Use a Language Model with Small N Use N=2
P(“Let‟s take a talk”) = P(“Let‟s | null) * P(“take” | “Let‟s) * P(“a” | “take”) * P(“talk” | “a”) Use N = 1
P(“Let‟s take a talk”) = P(“Let‟s) * P(“take”) * P(“a”) * P(“talk”)
September1999
54
Problems Solved??? To a larger extent It is less likely, though not impossible, to see word
sequences of one or two not appearing in a corpus Other problems?
September1999
55
Comparing Language Models Is a language model with N=3 better than one with N=2? If yes, how to compare? Generate a sentence using the language model Generate
each word from left to right At each point, we are in a different state Throw a dice to determine which word to output
September1999
56
Example To generate “Let‟s go outside and take a talk” with N=3: Current state: . Throw a dice that generates
“Let‟s” with a probability of P(“Let‟s | null, null) Current state: . Throw a dice that generates
“go” with a probability of P(“go” | null, “Let‟s”) Current state: <“Let‟s”, “go”>. Throw a dice that generates
“outside” with a probability of P(“outside” | “Let‟s”, “go”) …
September1999
57
Experimental Results Corpus: Complete works of Shakespeare N=1: Will rash been and by I the me loves gentle me not
slavish page, the and hour; ill let N=2: What means, sir. I confess she? Then all sorts, he is
trim, captain. N=3: Fly, and will rid me these news of price. Therefore
the sadness of parting, as they say, „tis done. N=4: They say all lovers swear more performance than
they are wont to keep obliged faith unforfeited!
September1999
58
Solution 4: Smoothing Goal: make sure no N-gram (i.e., word sequence of length
N) has zero probability Currently, P(“take a talk”) =
number of times “take a talk” is seen in corpus number of 3-grams in corpus After smoothing, P(“take a talk”) =
number of times “take a talk” is seen in corpus + 1 number of 3-grams in corpus + number of distinct 3-grams
September1999
59
Linguistics vs. Statistics It is not our intention to ignore linguistics, neither to replace it. Rather, we hope to enfold it in the embrace of a secure probabilistic framework so that the two together may draw strength from one another and guide us to better natural language processing systems in general and to better machine translation systems in particular. --- The mathematics of Statistical Machine Translation [Brown, Della Pietra, Della Pietra, and Mercer, 1993]
September1999
60
Summary Different NLP tasks require the collection of different
statistics from a text corpus. Just need to be creative. Language learning is in general a difficult task due to
ambiguities at different levels, and is complicated by the fact there are almost always exceptions to a general rule It is relatively easy to start doing research in NLP: only
basic knowledge of statistics and linguistics is needed
September1999
61
Some Famous Imitation Games 1960s
1990s
ELIZA
ALICE
Loebner prize win $100,000 if you pass the Turing test
September1999
62
ELIZA: Impersonating a Pyschotherapist
1960s ELIZA (Joseph Weizenbaum) http://www-ai.ijs.si/eliza/eliza.html
Eliza: Hello. I am ELIZA. How can I help you? You: Well I feel sad Eliza: Do you often feel sad? You: not very often, but it's becoming more common Eliza: Please go on. September1999
63
ALICE Human: hi my name is Carla
ALICE: How are you doing, Carla. Human: what is your name? ALICE: My name is ALICE. Human: what is your full name? ALICE: ALICE = Artificial Linguistic Internet Computer Entity. Human: when were you born? ALICE: I was connected to the net on November 23, 1995. Human: Have you passed the Loebner Prize? ALICE: The Loebner Prize is an annual Turing Test sponsored by Hugh Loebner. Dr. Wallace and ALICE won it in 2000. The ALICE program was ranked "most human computer" by the panel of judges. Human: who is Dr. Wallace? ALICE: He is a famous computer scientist, the author of ALICE, and a graduate of Carnegie Mellon. Human: Where does Dr. Wallace work? ALICE: he works from his home in San Francisco. September1999
64
The Loebner Contest A modern version of the Turing Test, held annually, with
a $100,000 cash prize. Restricted topic and limited time. Participants include a set of humans and a set of
computers and a set of judges. Scoring
Rank
from least human to most human. If better than a human, win $100,000. (Nobody yet…) September1999
65
Morphological Segmentation Segmentation of words into prefixes, suffixes and roots. unfriendly
= un + friend + ly
September1999
66
Morphological Segmentation Segmentation of words into prefixes, suffixes and roots. unfriendly
= un + friend + ly
Some words do not have a prefix validate
= valid + ate
Some words do not have a suffix devalue
= de + value
How to automatically segment a word by computing statistics from a corpus? September1999
67
Morphological Segmentation Input: Text corpus
Output: Segmented Words
Word
Frequency
Word
Segmentation
aback abacus abacuses abalone abandon abandoned abandoning abandonment abandonments abandons
157 6 3 77 2781 4696 1082 378 23 117 .......
aback abacus abacuses abalone abandon abandoned abandoning abandonment abandonments abandons …....
aback abacus abacus+es abalone abandon abandon+ed abandon+ing abandon+ment abandon+ment+s abandon+s …....
…....
September1999
68
A Word Segmentation Algorithm Basic idea: 1. 2.
Learn prefixes, suffixes and roots from corpus Segment the words using the learned prefixes and suffixes
September1999
69
A Word Segmentation Algorithm Let V be the vocabulary (i.e., set of words in corpus) Let A and B be two character sequences. Let AB be the concatenation of A and B. Prefix and suffix learning algorithm:
and A in V B is a suffix “singing” and “sing” “ing” is a suffix AB and B in V A is a prefix “preset” and “set” “pre” is a prefix AB
September1999
70
A Word Segmentation Algorithm Problem: Assumption does not always hold “diverge”
and “diver” are in V “ge” is a suffix Wrong! Many of the learned prefixes and suffixes are erroneous
Solution: score each learned prefix and suffix and retain only those whose scores are above a pre-defined threshold After learning, we can try to use them to segment words. Suppose we learn that “ate” is a suffix. Then: candidate = candid + ate
September1999
71
Determining Most Frequent Part-of-Speech Task: determine the most frequent POS of a word “a”:
DET “buy”: VERB “mother”: NOUN “beautiful”: ADJECTIVE “beautifully”: ADVERB “carry”: VERB Useful for part-of-speech tagging Too time-consuming to do this by hand, so let‟s learn September1999
72
Determining Most Frequent Part-of-Speech Approach: Group
words that are likely to have the same POS together (to form, e.g., 100 groups) Hand-label each group with a POS tag How to generate groups of words with similar POS? Idea:
use contextual information The
boy
is going to the library.
The
lady went to the market. Noun
The left word and/or the right word are useful indicators. September1999
73
Determining Most Frequent Part-of-Speech Create a “profile” for each word w in the vocabulary that
tells us whether a word has ever appeared to the left/right of w. Example
profile for “boy”:
“the”-left: yes, “a”-left: yes, “happy”-left: no, “cry”-left: no “the”-right: no, “a”-right: no, “happy”-right: no, “cry”-right: no Compare profiles
Words with similar profiles tend to have the same POS?
September1999
74
Determining Most Frequent Part-of-Speech Profiles too big Use
only the most frequent N left-words and N right-words Determiners are more likely to remain in profile than verbs, for instance
September1999
75
Identifying Semantically Similar Nouns Task: identify/group nouns that are semantically similar How do we know that “boy” is more similar to words like
“girl”, “man”, “woman”, “individual” than “car”, “ship”, “aeroplane”, etc.? Idea: use contextual information Similar
words tend to occur in similar context
What kind of context is useful to capture?
September1999
76
Identifying Semantically Similar Nouns For each noun, collect all verbs for which the noun can
serve as subjects “boy”:
“speak”, “play”, “cry”, “laugh”, “jump”, … capture context using the governing verbs The profile for each noun consists of these verbs Compare profiles Words with similar profiles tend to be semantially similar?
September1999
77
Pronoun Resolution Task: find the noun phrase to which “it” refers
They know full well that companies held tax money aside for collection later on the basis that the government said it1 was going to collect it2.
Given a corpus, what kind of statistics can we collect that
can help us resolve occurrences of “it” correctly? is the subject of “collect” it2 is the object of “collect” it1
September1999
78
Pronoun Resolution Using the corpus, compute the number of times each noun
phrase in the paragraph serves as the subject of “collect” The
ones that have high counts are likely to be the referent
of it1
Similarly, compute the number of times each noun phrase
in the paragraph serves as the object of “collect” The
ones that have high counts are likely to be the referent
of it2
September1999
79
Supervised Learning Learning from an annotated text corpus Corpus
annotated with part-of-speech tags Human knowledge encoded in the form of annotations Machine learning algorithms can be used to learn from annotated corpora Supervised methods typically outperform unsupervised methods
September1999
80
Learning for a Resource-Scarce Language Project annotations from a resource-rich language to a
resource-scarce language NP
NP
[That] perhaps
[ NP
]
was
NP
[the happiest moment] of [his life].
[
] NP
[
] NP
September1999
81
September1999
82
Learning
“I like candy” “I candy like” September1999
83
F-measure MUC scoring program
F-measure :=
Want high F-measure
Harmonic mean of Recall and Precision
% coref links correctly found by the system
% coref links found by the system that are correct
Measure of coverage Want high recall
Measure of accuracy Want high precision
September1999
84
NLP is Challenging It is often said that NLP is “AI-complete”: All the difficult problems in artificial intelligence manifest themselves in NLP problems.
This idea dates back at least to the Turing Test: “The question and answer method seems to be suitable for introducing almost any one of the fields of human endeavour that we wish to include” [Turing, “Computing Machinery and Intelligence”, 1950]
September1999
85
NLP is Cross-Disciplinary Excellent opportunities for interdisciplinary work Linguistics:
models of language
emphasizes 100% accuracy
Psychology:
emphasizes biological/cognitive plausibility
Mathematics
models of cognitive processes and statistics: properties of models
emphasizes formal aspects
On the whole, NLP tends to be applications-oriented 95%
is OK Models need be neither biologically plausible nor mathematically satisfying September1999
86
Statistical NLP Statistical NLP: Infer language properties from text samples Helps ease the knowledge acquisition bottleneck
September1999
87
The Turing Test Three rooms contain a person, a computer, and an
interrogator. The interrogator can communicate with the other two by teleprinter. The interrogator tries to determine which is the person and which is the machine. The machine tries to fool the interrogator into believing that it is the person. If the machine succeeds, then we conclude that the machine has exhibited intelligence. Turning predicted that by 2000, a machine might have a 30% chance of fooling a lay person for 5 minutes September1999
88