Caverphone Revisited Caversham Project Occasional Technical Paper Code Number:CTP150804 Author: David Hood Original Publication Date:15th August 2004 Keywords: phonetic encoding, procedures, comparisons Last Updated: 10th December 2004 E-mail: caversham@otago.ac.nz Original location: http://caversham.otago.ac.nz/files/working/ctp150804.pdf The Technical papers series is intended to document specific technical solutions to problems encountered within the project. It is being made available to the public in the event the techniques outlined may be of use to others.Synopsis: The original Caverphone algorithm was provide to document the phonetic matching procedures used in linking the Caversham Project databases, and to give some guidelines for those attempting similar projects. It was not intended as a general phonetic matching code. Version two of Caverphone has been created as a general purpose English phonetic matching system. Description of the problem: When the original caversham phonetic matching algorithm was made available1, it was intended to document the process used by the project and to provide some guidelines for others undertaking similar tasks. Because the rules phonetic matching were intended for a very specific database, the process of matching was able to draw on characteristics of the specific data to supplement a ‘good enough’ set of phonetic matching rules, rather than treating it as an exercise in pure phonetic matching. However, since publication of the original Caverphone working paper there has been an interest in this approach of a cascading series of rules to other data sets. The original rules were, in my opinion, not suited to this. Caverphone 2.0 has been developed as a more general phonetic match. Procedure: In creating a general purpose English language matching version of Caverphone from the original, a number of changes were made to it. However, as with the first version, a theoretical implementation would begin with a list of potential words that the word in question could be matched to. This word list would have both the true spelling and the phonetic code of each word stored. When looking for matches to a target word, the target word is converted into its phonetic code. The word list is then searched, and all entries with the same phonetic code are produced as a result.
Caverphone 2.0 is being made available for free use for the benefit of anyone who has a use for it,with the proviso that the Caversham Project at the University of Otago should be acknowledge as theoriginal source.
Start with a wordConvert to lowercaseRemove anything not in the standard alphabet (typically a-z)2Remove final eIf the name starts with cough make it cou2f
Available from http://caversham.otago.ac.nz/files/working/ctp060902.pdf
This may vary if the set of letters includes characters such as æ, ā, or ø
If the name starts with rough make it rou2fIf the name starts with tough make it tou2fIf the name starts with enough make it enou2fIf the name starts with enough make it trou2fIf the name starts with gn make it 2nIf the name ends with mb make it m2replace cq with 2qreplace ci with sireplace ce with sereplace cy with syreplace tch with 2chreplace c with kreplace q with kreplace x with kreplace v with freplace dg with 2greplace tio with sioreplace tia with siareplace d with treplace ph with fhreplace b with preplace sh with s2replace z with sreplace an initial vowel3 with an Areplace all other vowels with a 3replace j with yreplace an initial y3 with Y3replace an initial y with Areplace y with 3replace 3gh3 with 3kh3replace gh with 22replace g with kreplace groups of the letter s with a Sreplace groups of the letter t with a Treplace groups of the letter p with a Preplace groups of the letter k with a Kreplace groups of the letter f with a Freplace groups of the letter m with a Mreplace groups of the letter n with a Nreplace w3 with W3replace wh3 with Wh3if the name ends in w replace the final w with 3replace w with 2replace an initial h with an Areplace all other occurrences of h with a 2replace r3 with R3if the name ends in r replace the replace final r with 3
Vowels are normally a,e,i,o,u but depending on the data might include characters such as æ, ā,or ø
replace r with 2replace l3 with L3if the name ends in l replace the replace final l with 3replace l with 2remove all 2sif the name end in 3, replace the final 3 with Aremove all 3sput ten 1s on the endtake the first ten characters as the code
The key aspects of the algorithm are that vocalic and non-vocalic versions of consonants are blended together, vowels are significant if initial or final (though they will also affect the coding of surrounding letters).
I decided that a fixed length code would be useful for subsequent work on transcription errors, and tenplaces was decided on as being a length sufficient for every name that I could test it on. Examples:
Stevenson - take a namestevenson - convert to lower casestevenson - remove nonstandard lettersstefenson - replace v with fst3f3ns3n - replace vowels with 3St3f3nS3n - replace s with SST3f3nS3n - replace t with TST3F3nS3n - replace f with FST3F3NS3N- replace n with NSTFNSN - remove all 3sSTFNSN1111- add ones
Peter - take a namepeter - convert to lower casepeter - remove nonstandard lettersp3t3r - replace vowels with 3p3T3r - replace t with TP3T3r - replace p with PP3T33 - replace final r with 3P3T3A - replace final 3 with APTA - remove all 3sPTA1111111- add ones
Testing:
To get a sense of the number of potential matches verses number of false positive results whenmatching, this algorithm was compared to soundex and metaphone4 using files from theMoby word list collection5. I have included my own python implementation of the matching
This paper used the implementations of soundex and metaphone present in php 4.x
Project Gutenberg etext 3201, http://www.gutenberg.net/etext/3201
algorithm as a appendix. Note that this is written for ease of reading the algorithm, not forefficient implementation.
Tests were run using the Moby list of 1000 most frequent words and the Moby surnames list. Prior to running the tests the files were edited to remove apparent duplicates, this reduced the frequent words list to 900 entries. Moby list of 1000 frequently used words:
Number of unique codes:Soundex: 552 codes, Metaphone 680 codes, Caverphone 542 codes
Among this small sample Metaphone has the most unique codes, while Caverphone has marginally less than Soundex.
Table 1: Percentage distribution of number of entries per code for Moby frequent words list
Soundex's reputation for being a loose match compared to Metaphone seems to stem from the numberof unique entries (the early part of the distribution) rather than the tail of the distribution (which is similar in both cases). While the tail of the Caverphone distribution extends to more shared codes thaneither other distribution, the distribution itself is close to that of Metaphone.
Commonest codes were:Soundex: (7 entries) L200, R300, T200, T600L200 words were: leach, leg, less, like, lookCaverphone: (15 entries) AT11111111AT11111111 words were: add, aid, at, art, eat, earth, head, hit, hot, hold, hard, heart, it, out, oldMetaphone: (8 entries) FR, RT
FR words were: far, fear, fire, for, four, free, vary, very
Word: ready, Soundex: R300, Caverphone: RTA1111111, Metaphone: RTSoundex matches: radio, rate, read, ready, red, ride, roadCaverphone matches: rather, ready, writerMetaphone matches: radio, rate, read, ready, red, ride, road, writeCommentary: Caverphone's sensitivity to the final syllable produced significantly different matches.
Word: social, Soundex: S240, Caverphone: SSA1111111, Metaphone: SXLSoundex matches: socialCaverphone matches: socialMetaphone matches: socialCommentary: This example happens to be one where all three codes do not share words.
Word: able, Soundex: A140, Caverphone: APA1111111, Metaphone: ABLSoundex matches: able, applyCaverphone matches: able, appearMetaphone matches: ableCommentary: Here is an example of Metaphone returning fewer matches than the other codes. Moby List of Surnames (21986 common surnames):
Number of unique codes:Soundex: 2911 codes, Caverphone 4339 codes, Metaphone 6791 codes
Distribution of codes:Table 2: Percentage distribution of number of entries per code for Moby surname list
Using a large set of material that the three coding systems were designed for, the distribution of Caverphone still lies between Soundex, and Metaphone (though closer to Metaphone than when usingthe smaller set of data).
Though not obvious from the table, while the distribution of the tail is shallower for Caverphone than Soundex, it is also longer as can be seen in the maximum number of names per code with each system.
Commonest codes were:Soundex: D500 113 entriesMetaphone: TN 115 entriesCaverphone: ATA1111111 174 codes
Matches of randomly selected names:Tedder Soundex:T360 Metaphone:TTR Caverphone:TTA1111111Soundex: Teador, Tedder, Tedra, Teeter, Teodoor, Teodor, Teodora, Teodoro, Theadora, Theodor, Theodora, Theodore, Tuddor, TudorMetaphone: Daughtry, Dedra, Deidre, Didier, Dieter, Ditter, Dituri, Teador, Tedder, Tedra, Teeter, Teodoor, Teodoor, Teodor, Teodora, Teodoro, Tuddor, TudorCaverphone: Darda, Datha, Dedie, Deedee, Deerdre, Deidre, Deirdre, Detta, Didi, Didier, Dido, Dierdre, Dieter, Dita, Ditter, Dodi, Dodie, Dody, Doherty, Dorthea, Dorthy, Doti, Dotti, Dottie, Dotty,Doty, Doughty, Douty, Dowdell, Duthie, Tada, Taddeo, Tadeo, Tadio, Tati, Teador, Tedda, Tedder, Teddi, Teddie, Teddy, Tedi, Tedie, Teeter, Teodoor, Teodor, Terti, Theda, Theodor, Theodore, Theta,Thilda, Thordia, Tilda, Tildi, Tildie, Tildy, Tita, Tito, Tjader, Toddie, Toddy, Torto, Tuddor, Tudor, Turtle, Tuttle, Tutto
Karleen Soundex:K645 Metaphone:KRLN Caverphone:KLN1111111Soundex: Kara-Lynn, Karalynn, Karilynn, Karlan, Karleen, Karlen, Karlene, Karlens, Karlin, Karlyn, Karolina, Karoline, Karolyn, Karylin, Koerlin, Krahling, KurlandMetaphone: Carilyn, Carleen, Carlen, Carlene, Carlin, Carlina, Carline, Carlyn, Carlynn, Carlynne, Carolan, Carolann, Carolin, Carolina, Caroline, Carolyn, Carolyne, Carolynn, Carolynne, Coraline, Coralyn, Crelin, Crellen, Garlan, Garlen, Gorlin, Kara-Lynn, Karalynn, Karilynn, Karlan, Karleen, Karlen, Karlene, Karlin, Karlyn, Karolina, Karoline, Karolyn, Karylin, KoerlinCaverphone: Cailean, Calan, Calen, Callahan, Callan, Callean, Carleen, Carlen, Carlene, Carlin, Carline, Carlyn, Carlynn, Carlynne, Charlean, Charleen, Charlene, Charline, Cherlyn, Chirlin, Clein, Cleon, Cline, Cohleen, Colan, Coleen, Colene, Colin, Colleen, Collen, Collin, Colline, Colon, Cullan, Cullen, Cullin, Gaelan, Galan, Galen, Garlan, Garlen, Gaulin, Gayleen, Gaylene, Giliane, Gillan, Gillian, Glen, Glenn, Glyn, Glynn, Gollin, Gorlin, Kalin, Karlan, Karleen, Karlen, Karlene, Karlin, Karlyn, Kaylyn, Keelin, Kellen, Kellene, Kellyann, Kellyn, Khalin, Kilan, Kilian, Killen, Killian, Killion, Klein, Kleon, Kline, Koerlin, Kylen, Kylynn, Quillan, Quillon, Qulllon, Xylon
Dyun Soundex:D500 Metaphone:TYN Caverphone:TN11111111Soundex: Dam, Dame, Dan, Dana, Danae, Dane, Daney, Dani, Dania, Danie, Danieu, Dann, Danna, Danni, Dannie, Danny, Dannye, Danya, Daune, Dawn, Dawna, Dayna, Ddene, Dean, Deana, Deane, Deanna, Deanne, DeeAnn, Deeann, Deeanne, Deena, Deenya, Deeyn, Deina, Demmy, Demy, Den, Dena, Denae, Dene, Deni, Denie, Denn, Denna, Denney, Denni, Dennie, Denny, Deny, Deonne, Deuno, Dewain, Dewayne, Dhumma, Diahann, Dian, Diana, Diane, Diann, Dianna, Dianne, Diannne, Diena, Dina, Dinah, Dine, Dinnie, Dinny, Dino, Dion, Dione, Dionne, Doane, Doehne, Dom, Don, Dona, Donahoe, Donahue, Donia, Donn, Donna, Donni, Donnie, Donny, Donoho, Donohue, Doone, Down, Downe, Downey, Duane, Duma, Dumah, Dumm, Dun, Dunn, Duyne, Dwain, Dwaine, Dwan, Dwane, Dwayne, Dyan, Dyana, Dyane, Dyann, Dyanna, Dyanne, Dyna, Dynah, DyunMetaphone: Dyan, Dyana, Dyane, Dyann, Dyanna, Dyanne, DyunCaverphone: Dan, Dane, Dann, Darn, Daune, Dawn, Ddene, Dean, Deane, Deanne, DeeAnn, Deeann,Deeanne, Deeyn, Den, Dene, Denn, Deonne, Diahann, Dian, Diane, Diann, Dianne, Diannne, Dine,
Dion, Dione, Dionne, Doane, Doehne, Don, Donn, Doone, Dorn, Down, Downe, Duane, Dun, Dunn, Duyne, Dyan, Dyane, Dyann, Dyanne, Dyun, Tan, Tann, Teahan, Ten, Tenn, Terhune, Thain, Thaine,Thane, Thanh, Thayne, Theone, Thin, Thorn, Thorne, Thun, Thynne, Tien, Tine, Tjon, Town, Towne,Turne, Tyne
To assess the strengths of all three algorithms at making true positive matches I compared the codes generated by 393 known name variations within the records of the Caversham project (for the test file see http://caversham.otago.ac.nz/files/variantNames.csv). This is in no sense a ‘blind’ test as the namevariation was established using a combination of hand-matching and linking using the first version of the Caverphone algorithm. As a result of this Caverphone 2.0 was expected to perform well at matching different versions of the same name
Of the 393 name pairs Soundex matched 328 of them (83.50%), Metaphone matched 303 pairs (77.10%), and Caverphone matched 373 pairs (94.91%). From these results it is indicated that Metaphone’s reduced false positives when compared to Soundex does come at a cost of decreased successful matches.
In glancing through those matches solely made by Caverphone, the successes seem to come in cases of differing initial vowels and in looser handling of post-vocalic consonants than metaphone. Those matches missed by all three seem to be the hand corrections where the variation is the result of transcription rather than phonetic error.
While Caverphone's sensitivity to false positives lies between Metaphone and Soundex (tending to be similar to Metaphone), it does seem to be better at including true positives within the set of results. In using this within the project, phonetic codes are used when we cannot find a match by searching with other criteria. Comparing unmatched entries in two sources reduces the number of false positives dramatically than matching against a complete source. This process of potential match minimization prior to phonetic testing enables much greater automation of the process. Appendix 1: Python Caverphone implementation used in testing (non-optimised).
# make the word lowercaseworkingCopyOfWordToConvert = string.lower(wordToConvert)
# remove nonstandard characterspartsOfName = re.split('[^a-z]+',workingCopyOfWordToConvert)workingCopyOfWordToConvert = string.join(partsOfName,'')
# remove final eif workingCopyOfWordToConvert.endswith('e'):
workingCopyOfWordToConvert = workingCopyOfWordToConvert[0:-1]
# If the name starts with cough make it cou2fif workingCopyOfWordToConvert.startswith('cough'):
workingCopyOfWordToConvert = 'cou2f' + workingCopyOfWordToConvert[5:]
# If the name starts with rough make it rou2fif workingCopyOfWordToConvert.startswith('rough'):
workingCopyOfWordToConvert = 'rou2f' + workingCopyOfWordToConvert[5:]
# If the name starts with tough make it tou2fif workingCopyOfWordToConvert.startswith('tough'):
workingCopyOfWordToConvert = 'tou2f' + workingCopyOfWordToConvert[5:]
# If the name starts with enough make it enou2fif workingCopyOfWordToConvert.startswith('enough'):
workingCopyOfWordToConvert = 'enou2f' + workingCopyOfWordToConvert[6:]
# If the name starts with trough make it trou2fif workingCopyOfWordToConvert.startswith('trough'):
workingCopyOfWordToConvert = 'trou2f' + workingCopyOfWordToConvert[6:]
# If the name starts with gn make it 2nif workingCopyOfWordToConvert.startswith('gn'):
workingCopyOfWordToConvert = '2n' + workingCopyOfWordToConvert[2:]
# If the name ends with mb make it m2if workingCopyOfWordToConvert.endswith('mb'):
workingCopyOfWordToConvert = workingCopyOfWordToConvert[0:-1] + '2'
# replace cq with 2qworkingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('cq','2q')
# replace ci with siworkingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('ci','si')
# replace ce with seworkingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('ce','se')
# replace cy with syworkingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('cy','sy')
# replace tch with 2chworkingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('tch','2ch')
# replace c with kworkingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('c','k')
# replace q with kworkingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('q','k')
# replace x with kworkingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('x','k')
# replace v with fworkingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('v','f')
# replace dg with 2gworkingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('dg','2g')
# replace tio with sioworkingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('tio','sio')
# replace tia with siaworkingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('tia','sia')
# replace d with tworkingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('d','t')
# replace ph with fhworkingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('ph','fh')
# replace b with pworkingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('b','p')
# replace sh with s2workingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('sh','s2')
# replace z with sworkingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('z','s')
# replace an initial vowel with an Aif workingCopyOfWordToConvert.startswith('a'):
workingCopyOfWordToConvert = 'A' + workingCopyOfWordToConvert[1:]
if workingCopyOfWordToConvert.startswith('e'):
workingCopyOfWordToConvert = 'A' + workingCopyOfWordToConvert[1:]
if workingCopyOfWordToConvert.startswith('i'):
workingCopyOfWordToConvert = 'A' + workingCopyOfWordToConvert[1:]
if workingCopyOfWordToConvert.startswith('o'):
workingCopyOfWordToConvert = 'A' + workingCopyOfWordToConvert[1:]
if workingCopyOfWordToConvert.startswith('u'):
workingCopyOfWordToConvert = 'A' + workingCopyOfWordToConvert[1:]
# replace all other vowels with a 3workingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('a','3')workingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('e','3')workingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('i','3')workingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('o','3')workingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('u','3')
# replace j with yworkingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('j','y')
# replace an initial y3 with Y3if workingCopyOfWordToConvert.startswith('y3'):
workingCopyOfWordToConvert = 'Y3' + workingCopyOfWordToConvert[2:]
# replace an initial y with Aif workingCopyOfWordToConvert.startswith('y'):
workingCopyOfWordToConvert = 'A' + workingCopyOfWordToConvert[1:]
# replace y with 3workingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('y','3')
# replace 3gh3 with 3kh3workingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('3gh3','3kh3')
# replace gh with 22workingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('gh','22')
# replace g with kworkingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('g','k')
# replace groups of the letter s with a SotherPartsOfName = re.split('s+',workingCopyOfWordToConvert)workingCopyOfWordToConvert = string.join(otherPartsOfName,'S')
# replace groups of the letter t with a TotherPartsOfName = re.split('t+',workingCopyOfWordToConvert)workingCopyOfWordToConvert = string.join(otherPartsOfName,'T')
# replace groups of the letter p with a PotherPartsOfName = re.split('p+',workingCopyOfWordToConvert)workingCopyOfWordToConvert = string.join(otherPartsOfName,'P')
# replace groups of the letter k with a KotherPartsOfName = re.split('k+',workingCopyOfWordToConvert)workingCopyOfWordToConvert = string.join(otherPartsOfName,'K')
# replace groups of the letter f with a FotherPartsOfName = re.split('f+',workingCopyOfWordToConvert)workingCopyOfWordToConvert = string.join(otherPartsOfName,'F')
# replace groups of the letter m with a MotherPartsOfName = re.split('m+',workingCopyOfWordToConvert)workingCopyOfWordToConvert = string.join(otherPartsOfName,'M')
# replace groups of the letter n with a NotherPartsOfName = re.split('n+',workingCopyOfWordToConvert)workingCopyOfWordToConvert = string.join(otherPartsOfName,'N')
# replace w3 with W3workingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('w3','W3')
# replace wh3 with Wh3workingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('wh3','Wh3')
# if the name ends in w replace the final w with 3if workingCopyOfWordToConvert.endswith('w'):
workingCopyOfWordToConvert = workingCopyOfWordToConvert[0:-1] + '3'
# replace w with 2workingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('w','2')
# replace an initial h with an Aif workingCopyOfWordToConvert.startswith('h'):
workingCopyOfWordToConvert = 'A' + workingCopyOfWordToConvert[1:]
# replace all other occurrences of h with a 2workingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('h','2')
# replace r3 with R3workingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('r3','R3')
# if the name ends in r replace the replace final r with 3if workingCopyOfWordToConvert.endswith('r'):
workingCopyOfWordToConvert = workingCopyOfWordToConvert[0:-1] + '3'
# replace r with 2workingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('r','2')
# replace l3 with L3workingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('l3','L3')
# if the name ends in l replace the replace final l with 3if workingCopyOfWordToConvert.endswith('l'):
workingCopyOfWordToConvert = workingCopyOfWordToConvert[0:-1] + '3'
# replace l with 2workingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('l','2')
# remove all 2sworkingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('2','')
# if the name end in 3, replace the final 3 with Aif workingCopyOfWordToConvert.endswith('3'):
workingCopyOfWordToConvert = workingCopyOfWordToConvert[0:-1] + 'A'
# remove all 3sworkingCopyOfWordToConvert = workingCopyOfWordToConvert.replace('3','')
# put ten 1s on the endworkingCopyOfWordToConvert = workingCopyOfWordToConvert + '1111111111'
# take the first ten characters as the codeworkingCopyOfWordToConvert = workingCopyOfWordToConvert[0:10]
You Shall Not Commit Exodus 20:14 The Propaganda And Marketing Of Sex questionnaire about sex, assertions about sex and problems with sex• Penile enlargement, Cialis, Viagra• Latest – “car wash” II. The Corruption Of Sex • Sex drive is virtually equal to our will to live• God designed man with sexual desire –• Corruption of sex – online pornograph