Outline of development of FW03

Last update: 2008/1/31

8 Development of FW03

Word candidates

13,607 word candidates were selected from a word familiarity database (i) with about 88,000 word entries, which were derived from all the word entries in a medium sized Japanese dictionary. The conditions for selection were as follows.

The word length was set at four moras when selecting words because four mora words are the most frequently occurring type in Japanese.
The accent type was Low-High-High-High (i.e., the first mora has a low pitch and the following moras have a high pitch) because the Low-High-High-High accent type is most common in four mora words. Words with more than one accent type were excluded to avoid accent type ambiguity, which might affect the word intelligibility score.
Homophones (i.e., a set of words that have the same sequence of phonemes) were regarded as a single word because homophones have the same word familiarity and they are not distinguished in the word familiarity database.
Words with a negative image, antisocial words, and disease-related words were excluded because these kinds of words might be aected by social suppression or other kinds of inhibitions, which would result in unexpectedly low word intelligibility scores.

Word selection

The word candidates were divided into four sets according to word-familiarity rank: low familiarity (1.0-2.5), lower-middle familiarity (2.5-4.0), upper-middle familiarity (4.0-5.5), and high familiarity (5.5-7.0). These sets contain 2501, 4108, 4885, and 2113 words, respectively.
From each of the four sets, 20 lists of 50 words (i.e., 1000 words) were selected by considering the phonetic balance. This phonetic balance was achieved by taking account of "entropy."
Two kinds of entropy, H1 and H2, were used for the phonetic balance. H1 was calculated as in Eq. (1).
H1 = ??X m p(m) log2 p(m); (1)
where p(m) is the occurrence probability of a word-initial mora m.
H2 was calculated from the transitional probability of two successive phonemes within a word as in Eq. (2).
H2 = ??X v X c p(v)p(cjv) log2 p(cjv); (2)
where p(v) is the probability of vowel v, and p(cjv) is the conditional occurrence probability of the consonant c preceded by vowel v.
Total entropy is defined as Htotal as in Eq. (3).
Htotal = H1 + H2; (3)
For each word-familiarity rank, the lists were obtained by maximizing the sum of Htotal for 1000 words by employing the "Add & Delete" method (ii). The procedure is as follows.

Individually choose a word that gives a maximum gain of Htotal until a word set reaches 1000 words.
Search for a pair of words that give a maximum gain of Htotal if one of the words is deleted from the word set and the other word is added to the word set.
Exchange the words found in Step B.
Repeat Steps B and C until the gain of Htotal reaches zero.

References

Amano, S., Kondo, T., 1999. Lexical Properties of Japanese, Vol. 1. Sanseido, Tokyo (in Japanese).
Shikano, K., 1984. Phonetically balanced word list based on information entropy. Proceedings of Spring Meeting of the Acoustical Society of Japan, 211--212 (in Japanese).