A Comprehensive Bangla Spelling Checker

Naushad UzZaman and Mumit Khan Center for Research on Bangla Language Processing BRAC University Bangladesh

Naushad UzZaman

International Conference on Computer Processing of Bangla (ICCPB 2006) 17 February, 2006 Dhaka, Bangladesh

ICCPB 2006

1

Outline „ „

„

Introduction Previous work in each step and propose solutions for each step Performance of our proposed solution

Naushad UzZaman

ICCPB 2006

2

Introduction „

Spelling checker „ „ „

„

Detect misspelled words Generate suggestions for misspelled word Rank the suggestions

Can be used in „ „ „ „ „ „

Word processors Optical Character Recognition (OCR) Text To Speech (TTS) Automatic Speech Recognition (ASR) Machine Translation Many more…

Naushad UzZaman

ICCPB 2006

3

Detecting misspelled words „

Kukich (1992) breaks down human typing errors in two classes „

Typographical error „ „

„

People’s mistake while typing E.g. spell as speel

Cognitive error „ „

Do not know how to spell the word E.g. separate as seperate

Naushad UzZaman

ICCPB 2006

4

Detecting misspelled words „

Cognitive error „

Phonetic error „

„

„

Substituting a phonetically equivalent sequence of letters E.g. separate as seperate

Homonym error „

„

Accidentally produce a real word. a.k.a. real word error E.g. there as their

Naushad UzZaman

ICCPB 2006

5

Previous work on detecting misspelled word „

„ „

„ „

Typographical error and cognitive phonetic error is trivial Cognitive homonym error is not trivial BB Choudhury (2001) and Abdullah and Rahman (2004) uses direct dictionary look-up and approximate string matching for detecting misspelled word We used dictionary look-up Cognitive homonym error is not solved

Naushad UzZaman

ICCPB 2006

6

Generating suggestions for misspelling words „

Error patterns of typographical error: „

„

Damerau (1964) founds in English that 80% single error misspelling (insertion, deletion, substitution, transposition error) BB Choudhury (2001) founds in more than 15 million Bangla words „

„

41.36% words due to single error misspelling (error zone = 1) 32.94% (error zone = 2)

„

Error zone is subset of edit distance

„

Solution: Edit distance 2

Naushad UzZaman

ICCPB 2006

7

Previous work on typographical error „ „

„

Considers edit distance 2 BB Choudhury (2001) handles using error zone 2 Abdullah and Rahman handles using recursive simulation method

Naushad UzZaman

ICCPB 2006

8

Previous work on typographical error „

„

BB Choudhury’s (2001) method needs twice the amount of memory for reverse dictionary Abdullah and Rahman’s (2004) method requires m^(2*n+1) dictionary lookup. „ „

n = length of word m = average no of words in the circular list

Naushad UzZaman

ICCPB 2006

9

Our proposal for typographical error „ „

„

Given a query word with length ‘n’ Take the subset of lexicon, within length n+2 and n-2 Generate edit distance with the subset

Naushad UzZaman

ICCPB 2006

10

Error pattern for phonetic error „

There are groups of phonetically similar characters in Bangla; „ „

„

NA (ন) and NNA (ণ) SA (স), SHA (শ) and SSA (ষ)

Bangla has many consonant clusters or conjuncts with unusual pronunciations (i.e., k, h, etc.): „

let us consider k. k = ক+◌্ +ষ; kত is pronounced as খত, where ষ does not have any sound.

Naushad UzZaman

ICCPB 2006

11

Error pattern for phonetic error „

Different pronunciation of letters or conjuncts in different contexts: consider again k. „ „

„

At the beginning of word, it is pronounced as খ. (kত → খত); In the middle or at the end of a word, it is pronounced as কখ, (দk → দকখ).

Multiple pronunciations of some letters in the same context, such as হ with ব: „ „ „

ভ: আhান → আoভান /aovan/ আhান is usually pronounced as আহভান /ahobhan/. Both pronunciations are considered correct.

Naushad UzZaman

ICCPB 2006

12

Previous work on phonetic error „ „ „ „

BB Choudhury (2001), Abdullah and Rahman (2004), Hoque and Kaykobad’s Soundex (2002) and UzZaman and Khan’s Soundex (2004) „ „

Deals phonetic errors in small scale Mostly considers the first case shown before „

Groups of phonetically similar character in Bangla „ „

Naushad UzZaman

NA (ন) and NNA (ণ) SA (স), SHA (শ) and SSA (ষ)

ICCPB 2006

13

Proposed solution for phonetic error

„

aততn - aতয্n দকখ - দk সনধা - সnয্া েবেবাহার - বয্বহার

„

Solution: Phonetic encoding

„ „ „

Naushad UzZaman

ICCPB 2006

14

Phonetic encoding „

„

Code a word based on its pronunciation. „

aতয্n - - aততn

„

সnয্া- - সনধা

„

বয্বহার - - েবেবাহার

Naushad UzZaman and Mumit Khan, A Double Metaphone

Encoding for Bangla and its Application in Spelling Checker,

Proc. IEEE NLP KE, Wuhan, China, 2005

Naushad UzZaman

ICCPB 2006

15

Example of Spelling Checker Using Phonetic Encoding Dictionary

Encoded

Word List

Word List

aকালপk সকাল পাষাণ

“okalpkk” “skal” “pasan” “dgd”

দg

Naushad UzZaman

Encoded Test word “skal”

ICCPB 2006

Test Word শকাল

16

Ranking the suggestion „ „

Solution: Edit distance BB Choudhury (2001), Abdullah and Rahman (2004) used edit distance „

„

Can’t rank suggestions phonetically

We used combination of edit distance on phonetic encoding „

Able to rank the suggestions phonetically and typographically

Naushad UzZaman

ICCPB 2006

17

Performance on 1607 common misspelled Bangla words No of words

1607*

Correct (Edit Distance 0)

1473

Error

134

Rate of accuracy

91.67%

Rate of error

8.33%

*Source of words: Bangla Banan Obhidhan, Dr. Khurshid Alam, Mirnava, Dhaka, Bangladesh. Naushad UzZaman

ICCPB 2006

18

Performance on 1607 common misspelled Bangla words No of words

1607

Correct (Edit Distance 0)

1473

Error

134

Rate of accuracy

91.67%

Rate of error

8.33%

Naushad UzZaman

Error

134

8.33%

Edit Distance 1

107

6.65%

Edit Distance 2

27

1.68%

ICCPB 2006

19

Summary „

Showed steps of spelling checker Showed existing solutions in each step „ Proposed solutions for each step For our particular sample set we get a 100 % accuracy by using a combination of phonetic encoding and edit distance-2. „

„

Naushad UzZaman

ICCPB 2006

20

Acknowledgment „

Supported in part by the PAN Localization Project (www.panl10n.net), grant from the International Development Research Center, Ottawa, Canada and BRAC University.

Naushad UzZaman

ICCPB 2006

21

References „

„

„

Karen Kukich, “Techniques for automatically correcting words in text”, ACM Computing Surveys, 24 (4), page 377 - 439”, 1992. B. B. Chaudhuri, “Reversed word dictionary and phonetically similar word grouping based spell-checker to Bangla text”, Proc. LESAL Workshop, Mumbai, 2001. Arif Billah Al-Mahmud Abdullah and Ashfaq Rahman, “A Different Approach in Spell Checking for South Asian Languages”, Proc. 2nd International Conference on Information Technology for Applications (ICITA), China, 2004.

Naushad UzZaman

ICCPB 2006

22

References „

„

„

F.J. Damerau, “A technique for computer detection and correction of spelling errors”, communication of ACM, 7(3), 171-176, 1964. Md. Tamjidul Haque and M. Kaykobad, “Use of Phonetic Similarity for Bangla Spell Checker”, Page 182 – 185, Proc. 5th International, Conference on Computer and Information Technology, Dhaka, December, 2002. Naushad UzZaman and Mumit Khan, “A Bangla Phonetic Encoding for Better Spelling Suggestion”, Proc. 7th International Conference on Computer and Information Technology, Dhaka, Bangladesh, December, 2004.

Naushad UzZaman

ICCPB 2006

23

Edit distance „

Definition: The smallest number of insertions, deletions, and substitutions required to change one string into another.

Naushad UzZaman

ICCPB 2006

24

A Comprehensive Bangla Spelling Checker

Feb 17, 2006 - Kukich (1992) breaks down human typing errors in two classes. ▫ Typographical error. ▫ People's mistake while typing. ▫ E.g. spell as speel.

216KB Sizes 3 Downloads 337 Views

Recommend Documents

a comprehensive bangla spelling checker - Semantic Scholar
suggestions), compare the methodologies with existing solutions available in the ... is an essential component of many of the common desktop applications.

a comprehensive bangla spelling checker - Semantic Scholar
spelling checker, one such application, is an essential component of many of the common desktop applications such as word processors as well as the more ...

A Bangla Phonetic Encoding for Better Spelling ... - Semantic Scholar
Encode the input word using phonetic coding rules;. 2. Look up a phonetically ..... V. L. Levenshtein, “Binary codes capable of correcting deletions, insertions ...

A Comprehensive Roman (English)-to-Bangla ...
Feb 17, 2006 - In narrow sense: mapping of letters from one script ... Search সnয্া in a Bangla document ... 91% accuracy for this sample set using lexicon of.

pdf spell checker
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. pdf spell ...

Indonesian-A-Comprehensive-Grammar-Routledge-Comprehensive ...
CATALAN: A COMPREHENSIVE GRAMMAR (ROUTLEDGE COMPREHENSIVE GRAMMARS). Read On the internet and Download Ebook Catalan: A Comprehensive Grammar (Routledge Comprehensive Grammars). Download Max Wheeler ebook file free of charge and this ebook pdf found

Extending SMTCoq, a Certified Checker for SMT - Stanford University
SMT-Solver. In R. A. Schmidt, editor: CADE, Lecture Notes in Computer Science 5663, Springer, pp. 151–156 ... Available at http://www.cl.cam.ac.uk/~tw333/.

JBernstein: A Validity Checker for Generalized ...
processors, and mixed analog/ digital circuits. Despite .... each term is smaller than the cardinality (i.e., every k is the unique signature of each Bernstein ... create an internal Boolean variable field isUnknown that is initially set to false. Du

Extending SMTCoq, a Certified Checker for SMT - Stanford University
The checker's soundness is stated with respect to a translation function from the ... The choice of the type of Booleans bool as the codomain of the translation ...

Spelling Menu.pdf
cat. catc. catch. ABC Order ABC Order. Write your spelling words. in ABC order. If words. start with the same letter,. look at the next letter. Story, Story Story, Story.

Combining a Logical Framework with an RUP Checker ...
Apr 25, 2011 - In the current paper, we describe an approach, and tools in progress, to ... Figure 1: Data Structures in LFSC for Generalized Clauses ... Most of the 1000-line signature is elided here, including rules for CNF conversion and.

A Simulation Based Model Checker for Real Time Java.pdf ...
checkers can also deal with liveness properties, e.g., by check- ing assertions expressed in linear time logic (LTL) [11]. Figure 1: JPF architecture. Java PathFinder is an explicit state model checker for. Java bytecode. JPF focuses on finding bugs

Spelling Menu.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Spelling Menu.Missing:

JBernstein: A Validity Checker for Generalized ...
of the polynomial represented in Bernstein basis are greater than c, then the ..... on the Heart-Dipole problem (whose property contains 4 digits below decimal).

Arabic GramCheck: a grammar checker for Arabic - Wiley Online Library
Mar 11, 2005 - SUMMARY. Arabic is a Semitic language that is rich in its morphology and syntax. The very numerous and complex grammar rules of the language may be confusing for the average user of a word processor. In this paper, we report our attemp

Wordpress Tutorial ebook Bangla www.rafiqbamna.blogspot.com.pdf ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Wordpress ...

Treasures spelling practice.pdf
Saya telah mengirimkan komplain ke Wa- hana melalui situs, email, whatsapp ke customer service pusat,. namun tidak ada solusi. Mohon tanggapan Wahana ...

AddressSanitizer: A Fast Address Sanity Checker - Research at Google
uses of freed heap memory, remain a serious problem for .... applications (as each malloc call requires at least one ... will behave incorrectly in a way detectable by existing ...... tional Conference on Virtual Execution Environments (VEE '07),.

TTS for Low Resource Languages: A Bangla ... - Research at Google
Alexander Gutkin, Linne Ha, Martin Jansche, Knot Pipatsrisawat, Richard Sproat. Google, Inc. 1600 Amphitheatre Parkway, Mountain View, CA. {agutkin,linne,mjansche,thammaknot,rws}@google.com. Abstract. We present a text-to-speech (TTS) system designed

painless spelling pdf
Page 1 of 1. File: Painless spelling pdf. Download now. Click here if your download doesn't start automatically. Page 1 of 1. painless spelling pdf. painless ...