~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-
Texting / 2010-12-26
~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-~-

Being most of the time a late adopter of new technologies, I’m still composing text messages on a traditional 12-button telephone keypad with 26 letters spread across 8 keys. The classical entry mode requires pressing each key up to several times to specify which letter was intended, while the more clever entry mode lets you press one key per letter and makes use of the redundancy in English to guess which letters you intended. Of course I use the latter since it’s much faster. Anybody who has used such a thing knows that not too infrequently there are collisions, where two English words are represented by the same sequence of keystrokes (the most common one that I come across being “go” and “in”). Only recently did it occur to me that there could be some interesting computation to be done with these collisions.

1
ABC
2
DEF
3
GHI
4
JKL
5
MNO
6
PQRS
7
TUV
8
WXYZ
9

So I grabbed the /usr/share/dict/american-english word list file and scrapped together some code. The first thing I wanted to do was explore the collisions in the normal key layout. My word list has 73895 words in it (I filtered out anything that included punctuation). Of those, 13186 (or 17.8%) collide with some other word. The largest set of mutually colliding words consists of the following twelve: acres, bards, barer, bares, barfs, baser, bases, caper, capes, cards, cares, and cases. To anybody who uses all twelve words in a sentence I will award one hundred points divided by the number of words in the sentence.

I also looked for some of the most dissimilar colliding words, where dissimilarity is measured by number of positions with differing letters. The winner, at eight, is “Khoikhoi” and “jingling”. Since I hadn’t heard of the proper noun “Khoikhoi” until a moment ago, I’ll include the sevens as well:

  • astride, brushed, crushed
  • amounts, contour
  • hotbeds, invader, invades
  • rampages, scorcher, scorches
  • Smirnoff, pogromed, poisoned

The next question is how much better other key layouts can do. This is an interesting computational problem, since searching all possible layouts is practically infeasible. If we describe a layout as a partitioning of the alphabet into eight sets, and we limit ourselves to two sets of four and six sets of three, my sloppy and careless calculations put the number of combinations at 10421411600600000. My computer gets tired just thinking about looping through all of the English words that many times.

The most straightforward approach seems to be a hill-climbing algorithm. I started from the standard layout, and at each step made the most helpful swap of two letters out of all possible swaps. Usually about ten steps would produce a layout with no improving swaps. The state I reached from the standard layout has a collision percentage of only 12.7%.

1
BKN
2
DHI
3
FGR
4
CLY
5
AJM
6
EPVX
7
OQT
8
SUWZ
9

Its collisions include (chant, limbo), (scanty, symbol), (ghoul, ritzy), (speedy, weevil), and (hold, itch). Of course we couldn’t switch to something like this because it would mess up all those phone number mnemonics. And anyhow nobody writes text messages this way anymore.

Also I think “The Speedy Ritzy Weevil-Ghouls” could be a promising name for a rock band.

--------------------
:::Comments:::