Go to site main page,
selected writings.

File last modified:

Languages Left Behind


Hokkien, the colloquial speech of millions of people in southern Fújiàn and Táiwān, has no status as an official language, and successive Chinese governments have discouraged any attempt at officialization.

Furthermore, it has no standardized writing system.

Should these facts, jointly or separately, keep it from representation in Unicode, the world-wide standard for script encoding for computer use, given that other unofficial or under-standardized languages are represented? To what extent is the issue technical and to what extent is it political?

Because of its large number of speakers, Hokkien is a dramatic example of a problem potentially faced by many languages.

The following article, originally a 1998 conference paper, was published in 2002. The situation has not changed significantly since that time. For present purposes, some long paragraphs have been divided into shorter ones. All bibliographic references are on a separate page, which can be accessed from the frequent links at the right side of this page. The bibliography should open in a separate tab, and should be closable simply by clicking anywhere on it.


Languages Left Behind

Keeping Taiwanese off the World Wide Web

David K. Jordan
University of California, San Diego

Reprinted from: Language Problems &
Language Planning
26(2): 111-127. (2002)


  1. Orthographic Standardization
  2. Unicode: Letters and Computer Codes
  3. Double-Byte Codes
  4. Chinese Characters and Computers
  5. Non-Mandarin Chinese: The Politics of Being Left Out
  6. The Nature of Non-Mandarin
  7. The Case in Favor of Person-Reason-Yīn
  8. The Case Against Person-Reason-Yīn
  9. Conclusion

Abstract: The Unicode standard is an enormous step toward realizing the goal of a single computer encoding scheme for virtually all of the world’s scripts. Although not all computers will necessarily have the type fonts to display all characters, at least all computers will be able to recognize what characters are required for proper display of text in almost any language.

However the Unicode standard presupposes that each language has a script consisting of a finite number of agreed-upon characters. Some languages still lack such agreement.

As planning has gone forward for Unicode, more and more code points are being assigned, leaving ever fewer conveniently accessed code points for future expansion.

This article describes the Unicode project. Then it describes the special challenge of encoding Chinese characters. Finally it uses the example of Hokkien, a “dialect” of Chinese spoken by most people in Taiwan, to explore the problem of unorthodox, unstable, or unofficial scripts.

Political forces and technical considerations make it difficult to include such scripts in Unicode. As Unicode becomes the de facto standard for writing human languages, script innovations will presumably become less and less likely to receive wide use.

1. Orthographic Standardization

Standardization is very much a part of the world as we know it today, and the benefits of common standards are obvious, despite occasional nostalgia for cranky localisms. There is a brief period when selected decision makers collaborate to generate and then legitimize the new standard. After that, changes are difficult. It is too late to change the length of a centimeter: the alternatives are no longer viable proposals.

We live today during the period when the world’s orthographies are being standardized, and in particular are being integrated into a single set of numerical codes designed to allow all future computers around the globe to recognize, manipulate, and display all included languages with equal ease and with no variation beyond the artistic flourishes of particular type fonts. The coding system is usually referred to as Unicode, and its basic features are already in widespread, if not yet universal, use. If one uses a web browser, one uses Unicode, or anyway one can. If one uses the popular Microsoft business applications, one uses Unicode, even if one is unaware of it.

Most of the decisions involved with the creation of Unicode are extremely technical and have been, appropriately, undertaken by computer specialists building on the work of other computer specialists. [Note 1] In this paper I will briefly describe the Unicode system. Then I will turn to how different orthographies are included in it. Some orthographies, by reason of political and historical accident, are also excluded from it and probably always will be. And that will be the principal point I wish to make.

Note 1. Unicode is a cooperative project involving such major computer companies as IBM, Microsoft, Apple, Novell, Lotus, Xerox, and Hewlett-Packard; such research groups as the Research Libraries Group, the Getty Art History Information Project, and Knight-Ridder Information; and individuals from organizations like the Tibetan Languages Institute, the Asian Classics Input Project, and the Korea Industrial Advancement Administration. The Java computer programming language already uses Unicode as the basis of its text representations, and others seem likely to follow. The Unicode Consortium cooperates with the International Organization for Standardization and other standards bodies. The earlier ISO/lEG 10646 standard has been merged with the Unicode project.

Unicode supposes that each language has a standard orthography (or orthographies) that must be accommodated. However there is in the end a finite number of code points in the system, and they are in fact rapidly coming to be claimed. There will soon come a time when there is no room for new characters in the system, and orthographic creativity will be severely curtailed (alternatively, Unicode will need to be expanded, a point I will return to later). Who defines the standard form of an orthography, the characters which Unicode must accommodate? For most modern languages, governments do. For dead languages, scholars do. Hokkien, a Chinese “dialect” (also known as Minnan, Amoy, and Taiwanese), is a language that, for political reasons, has no standard orthography and is excluded from Unicode, despite being the native tongue of perhaps fifty million people. In describing the case of Hokkien, my more general point is to call attention to the disconcerting fact that international computer standardization locks insiders in and outsiders out, including, of course, the determination of which languages can easily be represented on the increasingly important World Wide Web.

Open bibliography.

Return to top.

2. Unicode: Letters and Computer Codes

Computers represent data as a series of binary numbers, that is, the numbers 0 and 1. Each binary digit is referred to as a “bit” of information. By convention these bits are blocked in groups of eight; eight bits taken together are called a byte. A byte may vary in value from 00000000 to 11111111. Expressed as decimal values, this is the range from 0 to 256(=2^8). For programming purposes it is convenient to express each byte as a two-digit hexadecimal number from 00 to FF, with FF also of course equal to 256. The computer reads the eight bits making up the code, looks up the number in a character table, and sends the appropriate graphic representation to the screen or printer. Different type fonts can be swapped in the computer’s memory, and the exact form of a character may vary without limit, but the standard system defines a letter J, for example, as occupying position 74 (hexadecimal 4A). Although some type fonts may prefer to place an unrelated graphic at that position, compatibility with the world’s computer systems is maximized by using fonts that use 74 for J.

At present, computers represent text by mapping the upper and lower case Latin alphabets, the Arabic numerals, and a variety of control codes and punctuation marks onto the values 00 to 128 (hex. 00-80), a code range referred to as “Basic Latin.” (Historically this derives from older American teletype codes, or so the story goes. Position 7, for example, does not print a letter: it rings the bell on the teletype machine.) Standards initially varied about what was mapped onto the values 129-256 (hex. 81一FF), but these codes (known as “Latin-1”) today include additional characters used in Western European languages other than English, especially Latin letters with diacritical marks. (The letter ñ, for example, is in position 241 [hex. F1].) Clearly, however, 256 positions is too few to allow the representation even of all the variants of the Latin alphabet, let alone the many other orthographies that the world uses.

Open bibliography.

Return to top.

3. Double-Byte Codes

Unicode uses double-byte codes, in which each character, rather than being coded between 00 and FF, has a code between 0000 and FFFF. This produces 65,536 (= 2^16) possible unique codes, which, it is proposed, can be used to accommodate all or nearly all of the world’s scripts, depending in part upon how much space is allocated to Chinese. The letter J is then no longer represented by the code 4A, but 004A; ñ is not Fl but 00F1. And there is now space for Ĉ at 0108, which is 264 in decimal figuring, just beyond the reach of the 256 positions of single- byte encoding. There is even room for (kān, a Chinese character meaning a niche in the wall serving as a shrine) at position 9F95 (decimal 40,853). Theoretically the last position available is of course position number 65,535 (hex. FFFF), although in fact many spaces are reserved or still unused. [Note 2]

Note 2. The procedures for inputting these codes using a standard computer keyboard are an entirely separate issue not addressed by the Unicode Consortium. Character input systems have a less compelling need for standardization, since they can vary according to the convenience of the user with no effect on the finished result. Character 40,853 can be read as regardless of how the typist got the number into the file.

Numbered character codes refer to the computer’s internal representation of characters as numbers and thus only indirectly to the graphic forms it places on the screen based on its look-up tables, which are referred to by computer specialists as glyphs. Unicode does not concern itself with the graphic form of a code, defining U+OOFF (ÿ) [Note 3], for example, as “Latin small letter y with dieresis” or U+0636 (ض) as “Arabic letter dad,” and leaving it to the type-font designer to worry about the graphic details. (Font designers are of course free to assign any graphic to any code position, but this is impractical if standardization is desired.)

Note 3. It is conventional to precede a Unicode hexadecimal code with the prefix “U+” to distinguish it from other numbers or letters.

Version 2.0 of the Unicode Standard was released in printed form in February of 1997 by the Unicode Consortium (http://www.unicode.org), making the system easily available to the Anglophone public. It was and is a work in progress. When Version 3.0 was published in 2000, it included Sinhala, Ethiopic, Cherokee, “Canadian Aboriginal Syllabics,” and other new entries, as well as Braille patterns and over twelve hundred codes for the Yi (Lobo) syllabary.

In various states of consideration, approval, or proposal (and therefore not yet in the standard) were alternative scripts proposed but no longer used for English (Deseret, Shavian), as well as ancient scripts such as Etruscan, Gothic, Linear B, and Cypriot syllabary. “Proposed scripts” included: Pollard, Phaistos disk, Ugaritic Cuneiform, Old Persian Cuneiform, Meroitic, and basic Egyptian hieroglyphics. [Note 4]

Note 4. Updated details maybe found at http://www.unicode.org/unicode/alloc/PiPelifle.html.

In the case of Ethiopic, Unicode represents the first actual standard for its computer representation. “Prior to the Unicode 3.0 standard,” writes one commentator, “no fewer than 60 disparate encoding systems have appeared” for Ethiopic (Yacob 2000: 30). [Note 5]

Note 5. Inclusion in the Unicode standard can require a certain amount of housekeeping in unstandardized or no longer used scripts. Inclusion of Deseret, a script for English developed by early Mormon settlers in Utah, has required the establishment of sorting rules (alphabetical order) for these characters. Sorting rules are required by Unicode, but never established for Descret because, so far as is known, nothing was ever alphabetized in that script (Jenkins 2000: 36).

Open bibliography.

Return to top.

4. Chinese Characters and Computers

Chinese characters have traditional forms as well as some standardized variants in Korean and Japanese, and recent simplifications officialized in mainland China. Computer specialists use the expression “CJK” (Chinese-Japanese- Korean) to refer to the collectivity of Chinese and Chinese-derived characters, and “CJKV” when they include pre-reform Vietnamese as well. National computer coding standards for CJK characters vary from one country to another, and from Taiwan to the mainland within the Chinese-speaking world. None of these standards actually incorporates all of the characters in occasional use in these countries. (For details see Vine 1999.)

Many characters are used in more than one of these CJK languages. The character for “person,” for example, occurs in Korean and Japanese as well as both Chinese character sets, with a different double-byte code in each of the four different national standards. In Unicode a shared character is assigned a single code for use in all languages. [Note 6]

Note 6. Because of this, the ordering of the characters in Unicode is not the same as in any of these CJK national standards, and the original codes are preserved for none of them. Because the Korean standard in particular includes some duplicate representations of the same character depending upon its pronunciation, Unicode has in fact duplicated some characters in order to retain maximum two-way translatability with each of the CJK national standards. There are minor differences in conventional type design (similar to serifs and ligatures in Roman typography) that make the conventional representation of a character slightly different from one language to another and between simplified and traditional Chinese characters. Font designers and specialists in globalization find that they must take this into account in producing text that is comfortable to read rather than merely intelligible, but it does not affect the representation of these characters in Unicode. See Lunde 1999, Meyer 1999, Hudson 2000.

In Version 2.0 of the Unicode Standard, the “CJK Unified Ideographs” occupied 20,901 codes (U+4E00 through U+9FA5). [Note 7] Ominously, perhaps, by Version 3.0 it had been found desirable to allocate an additional 6,655 spaces for CJK characters (U+3400 through U+4DFF).

Note 7. Technical considerations require some duplicate characters stored as F900 through FA2D, and the difficulties of composing the Korean Hangul alphabet into its traditional character-shaped entities stimulated the allocation of codes U+ACOO through U-D7A3 (11,172 spaces) for pre-composed versions of them. In addition spaces U+3000 through U+3400 are allocated to East Asian punctuation signs, individual phonetic and syllabary symbols, diacritics, currency symbols, alternate number symbols, etc. commonly used in the CJK scripts.

Is this enough space? In actual practice, Chinese characters have always been an open set, and folk characters thrive today as they always have, some becoming the de facto standard despite efforts at centralized control. The largest general Chinese dictionary (Zhāng Qíyún 1973) contains 49,905 characters, not including shorthand forms, dialect forms, or Japanese, Korean, or mainland simplified variants.

Nearly twice that number has proved necessary for use by the international consortium engaged in computerizing the Chinese Buddhist canon.

Non-standard characters and non-standard simplifications are constantly being invented and used, especially in shorthand, private letters, informal signs, and occasionally in specialized ways. [Note 8] (See Bokset 1988 for several hundred of these collected across China.)

Note 8. The standard Chinese term for restaurant is cāntīng, traditionally written 餐廳 The famous Chinese character simplification scheme officializes a simplification for one of these and not the other: 餐厅. It is a rare restaurant in contemporary China that does not utilize a rogue simplification of the first, usually consisting only of its upper left-hand corner. The rogue simplification is unlikely to turn up in the Unicode scheme, since it has no official support, and hence no legitimate advocates.

Some variants are matters of more than merely shorthand. A specialized form The character of the character , “mother,” is . It is used exclusively by the Unity Sect (Yīguàn Dào 一貫道) of Taiwan. Since the sect has several million initiates, and produces books in which “mother” is always printed in this way, it is not obvious that the character should not be regarded as part of normal Chinese in Taiwan, except that it has no official status and can, as a practical matter, easily enough be written . An opposite argument would be that it is actually more comparable to a religious symbol such as the yīn-yáng (U+262F ☯) or the overlaid chi-ro of the Christian world (U+2627 ☧), or to a linguistically irrelevant graphic flourish, like the stylized C in the Coca-Cola emblem. The inclusion of a small number of religious symbols in the Unicode set probably invites future claims of religious discrimination. Arguing that that the Unity Sect mother emblem belongs there if chi-ro belongs there would be an example.

And finally several “dialects” of Chinese have rich but unstandardized traditions for writing terms that do not occur in standard Mandarin. Thus the 27,556 spaces now assigned to CJK characters in Unicode cannot be expected to represent everything writable in Chinese. Chinese would be capable of filling all sixty-five thousand spaces and would still have characters unrepresented. [Note 9]

Note 9. I write from the perspective of Chinese. CJKV obviously includes some characters used only in Japanese, Korean, or Vietnamese and some variants or simplifications not used in China. (One character, U+5F41, seems not to occur in any of these languages.) The character simplification in Japan after World War II did not produce the same simplifications as the more radical reform in China, for example. However modern Japanese severely restricts its use of characters, and modern Korean and Vietnamese omit them. Thus the problems discussed in this paper are of most significance in Chinese.

So what is to be left out? Essentially only characters that are incorporated into the national standards used as source documents are included. Omitted are local and “substandard” simplified characters (shorthand), characters no longer in active use and not occurring in the standard literary corpora, and — most interesting for present purposes — characters used only to represent Chinese languages other than modern Mandarin or standard literary Chinese. In particular, there has been no attempt to incorporate the characters used to write Shanghainese, Xiang, Hakka, Hokkien, or other “regional dialects” of Chinese that are in fact separate, though closely related, languages, each with millions of speakers. To understand this we need to review some facts about these kinds of Chinese.

Open bibliography.

Return to top.

5. Non-Mandarin Chinese: The Politics of Being Left Out

Mandarin is but one of the regional Chinese languages (euphemistically but traditionally referred to as “dialects”), and indeed Mandarin itself contains broad enough regional variation that some Mandarin dialects could reasonably be called independent languages. Since the syntax of literary Chinese was peculiar to itself, all speakers of Chinese languages formerly shared it as a common writing system with little expectation that it would precisely correspond with spoken usage anywhere. The demise of literary Chinese early in the twentieth century brought this situation to an end, and elevated Mandarin, the majority dialect, to the position of a national standard. (In theory Beijing Mandarin is the standard. There is of course some slippage, but the details are not relevant here.) Modern written language is expected to follow spoken Mandarin usage.

Each of these dialects has, over the last centuries, produced various attempts to represent colloquial speech in Chinese characters. When colloquial words seemed to have no equivalent, new characters were invented on the analogy of others, [Note 10] or existing homonyms were used exclusively for their sound value (a usage dubbed “white characters” in Chinese).

Note 10. From the earliest times for which we have evidence, all variants of Chinese were written with graphic characters. Most characters in historic periods have been composed, consisting of a “radical” or “meaning element” plus a “phonetic element.” For example fàn ”rice” is made of , a radical relating to food (and occurring as the separate character shI “to eat, food”) and fǎn ,”to turn back,” borrowed for its approximate sound value. Using these principles, new characters are easily created and readily understood, in context, by other speakers of the same dialect. In principle, the 214 radicals of the Qīng period Kāngxī 康熙 dictionary combined with the 890 or so combining phonetics could produce about 190,000 compound characters of this kind. In view of the constant invention of shorthand forms, special radicals, and nonce and alternative writings, the total number of graphically distinct characters ever used in CJK languages is almost certainly well in excess of 100,000, possibly in excess of 200,000.

None of the non-Mandarin dialects has ever had its own standardized and widely used colloquial writing system, and none is routinely used for continuous written text today. [Note 11] An exception is Cantonese, which developed local conventions (if not exactly standards) in part because the long British occupation of Hong Kong did not include ideological efforts to manage how Chinese was written. (It did establish a semi-official Hong Kong Government Chinese Character Set, although other standards compete with it. See Meyer 1998.)

Note 11. Missionary efforts to introduce alphabetical writing systems won little acceptance in China. A modified Latin script for Hokkien once taught in Christian churches was prohibited for use in running text during Taiwan’s long post-war period of martial law and is effectively extinct except to show pronunciation in a few Hokkien dictionaries. Efforts to create and use non-Mandarin Pinyin have also been stillborn.

In contrast, the language reforms of the twentieth century were associated with nationalist sentiment suspicious of the use of anything other than the “national language” (guóyǔ,a Nationalist phrase) or the “common language” (pǔtōnghuà, a Communist phrase) as vaguely treasonous. The force of the campaign against non-Mandarin Chinese varied, but the issue was never ignored. Non-Mandarin radio and television broadcasts were (and are) prohibited or limited, depending on the place and period, and a constant barrage of slogans and propaganda mark much of the twentieth century. (A colleague brought me a poster from a theoretically multilingual Singapore public office that reads: “Speak more in Mandarin, speak less in regional dialects” (多说华语,少说方言!)

These attitudes and policies inhibited the development of vernacular writing systems in non-Mandarin regions, where today one reads and writes in Mandarin, translating orally if there are still non-Mandarin speakers who need to be informed. The situation is roughly comparable to what would ensue if South Americans, both Spanish and Portuguese speakers, were required to conduct all written communication in French. At stake politically is linguistic uniformity as a symbol of nationhood.

Open bibliography.

Return to top.

6. The Nature of Non-Mandarin

If a different dialect had been chosen as standard, written language would have been different, in some cases quite different. By way of example, here are written forms of three non-Mandarin languages compared with the same sentence as written in colloquial Mandarin. [Note 12] (Plus signs have been inserted to show sentence structure. The dummy character (U+2639) is used where no Unicode character is available.) [Note 13]

Note 12. These examples are based in part on Gunn 1993. For the sake of simplicity, the non-Mandarin examples are here printed in traditional (non-simplified) characters. The use of simplified characters would create minor variants, as can be seen here for Mandarin. These writings may alternate with more Mandarin-like usages for some writers.

Note 13. Shanghainese and a number of related “Wu dialects” are spoken in the lower Yangzi region. Hokkien is spoken in southern Fujian and is the home language of most people in Taiwan. Cantonese is the language of Guangdong province, including Hong Kong. In Taiwan the status of Hokkien, locally often called Taiwanese, is a politically charged issue.

The non-Mandarin sentences as written here make use of quite a lot of characters that are not part of the Mandarin version. In most cases, these characters have Mandarin readings and occur in classical Chinese, but are not colloquial in Mandarin in this sentence. The non-Mandarin sentences here might not be written identically by all speakers, since these are not standardized orthographies.

Four characters (Shanghainese,Hokkien ☹,and Cantonese and ) are not to be found in the basic standard B5 (Taiwan) or GB (Mainland) non-Unicode computer code sets. [Note 14]

Note 14. Of these, number 4 () is quite usual in Cantonese, although some authors write or instead (e.g. Chiang Ker Chiu nd-b). Number 3 is rare enough in Cantonese not to occur in some desk Cantonese dictionaries (Huang Gangsheng 1994, Qiáo Yànnóng 1965). I am unfamiliar with Shanghainese and cannot comment on the first character .

Because three of them are rare but extant characters in Mandarin or literary Chinese, they have made their way into Unicode. [Note 15] However, character 2 (Hokkien: in), here written with the place- holder ☹,is not in the Unicode scheme (unlike the place-holder), and therefore is of particular interest to us here. As most usually written, it is composed of a “person” element (Mandarin: rén) on the left plus a phonetic Mandarin: yīn) which, as an independent character, means “reason.” Chinese describing the combined character represented by would call this character the “person-reason-yīn,” meaning that it has the radical “person,” the phonetic “reason,” and the sound “yīn.” It is a third-person plural pronoun, pronounced in in Hokkien. (The Hokkien third-person singular pronoun is i.)

Note 15. In an acknowledgment of the normalization achieved in Hong Kong, and perhaps as a political statement about Hong Kong’s successful autonomy within the larger PRC polity, many of the additional characters introduced in Unicode Version 3.0 are Hong Kong usages introduced with the apparent blessing of the Beijing government, although this example does not include any of those additions. My impression is that many occur only in place names.

Although far more standardized among Hokkien speakers than many other vernacular characters, person-reason-yīn is very clearly not Mandarin. Neither are many other characters, although how many depends on who is doing the writing, and what level of tolerance is allowed for using homophones as stand-in characters. Literate Hokkien speakers have apparently always produced a considerable number of distinctive characters not found in other forms of Chinese. Campbell’s Dictionary of Amoy [Hokkien] Vernacular, published in Tainan in 1913, circulates in a Taiwan martial-law-period reprint with Mandarin pronunciations written into its character index, apparently a fig leaf to avoid its being censored. The reprinters were able to provide Mandarin readings for only about 60 percent of the characters in Campbell’s index. Similarly an edition of Douglas’ 1873 Hokkien dictionary reprinted in the same period with characters written into the margins eschews non-Mandarin characters, with the result that no characters at all are written in for the many words that have no obvious Mandarin form. Since the martial-law period, Taiwanese lexicographers (e.g., Chén Xiū 1991, Yáng Qīngchù 1992) and promoters (e.g., Fāng Nánqiáng 1994, 1996) have continued and expanded experiments in promoting the Hokkien language, using characters many of which do not occur in other kinds of Chinese. In some cases the desire to differentiate a Taiwanese from a Mandarin text has stimulated the selection or invention of maximally different characters even when a Mandarin-congruent choice has been available.

Should person-reason-yīn be in Unicode? Not necessarily, although a successful negotiation of Taiwan independence could very well make it politically necessary for a Taiwan government to argue in favor of it rather than against it. First of all, not all Hokkien writers use it. In competition with person-reason-yīn as the character of choice for the third person plural pronoun are several other, less common but less “offensive” forms, as follows (sources for each are noted):

  1. built of person radical plus phonetic (Mandarin: yīn, “reason”).
  2. use of the same graph for both singular (i) and plural (in).
  3. built of the singular third-person pronoun (Mandarin: ) plus a “heart” radical()underneath. [Note 16] Like person-reason-yīn, this character is also not in the Unicode scheme.
  4. Use of an etymologically unrelated character of similar sense arbitrarily read as in:
  5. Use of an etymologically unrelated but homonymous character, such as (Mandarin: yīn, reason).
  6. Use of the bi-syllabic Mandarin third-person plural form, 他們 tāmen, translated into Hokkien when read aloud.

Note 16. A Mandarin analogy and possible inspiration for this is the Mandarin second-person pronoun , which has a formal variant n/in , written by adding the heart below . Since in Hokkien plural pronouns add -n to singular ones, the use of the “heart” radical to represent the sound of a final -n can become a productive device for converting characters for singular Hokkien pronouns into characters for plural ones.

Note 17. Sources published under martial law were ill advised to propose separate writing systems for Taiwanese Hokkien, since Taiwanese “separatism” was considered a significant threat to the cause of national unification. Accordingly, few all-Chinese Hokkien dictionaries (e.g., Shěn Fùjìn 1954, Lǐ Mùqí 1963, Cài Wénhuī 1972) were produced, nearly always simply providing both Hokkien and Mandarin readings for characters shared by both languages, and nearly always in the name of linguistic research or with the avowed purpose of helping to teach Mandarin to Hokkien speakers, and never with the overt goal of rendering Hokkien a viable written idiom. Occasional dictionaries (e.g., Gāo Shùfān 1985) included a few characters used only in Taiwan place names, but treated them as Mandarin characters and gave them only Mandarin readings. The few dictionaries intended for foreigners normally represented Hokkien only in Romanization and provided only orthodox Mandarin Chinese characters.

Open bibliography.

Return to top.

7. The Case in Favor of Person-Reason-Yīn

The case in favor of including person-reason-yīn in Unicode is simple:

  1. Taiwanese Hokkien is the language of a people for whom nationalistic sentiments appear to be rising, and person-reason-yīn appears today to be the traditional and preferred character for at least some of the most recent and most serious script reformers who would like to see Taiwanese people write at least sometimes in their native Hokkien rather than in Mandarin.
  2. The Unicode Consortium has tried to provide codes for as many scripts as possible, including some (like ancient Greek) used by no living speakers and some (like Shavian and Deseret) with no significant history of use. A modern language with millions of speakers should not be excluded. Indeed the inclusion of a considerable amount of valuable code space for the Yi (Lobo) syllabary suggests that exactly this logic has been convincing in the case of a non-Han minority population in China.
  3. The particular sin of the Hokikien speakers appears to be that they are of Han ethnicity, and Han has been defined as necessarily Mandarin- (or, grudgingly, Cantonese-) speaking.

Open bibliography.

Return to top.

8. The Case Against Person-Reason-Yīn

The case against including person-reason-yīn is also simple, but has more parts:

  1. First, it is not part of the Taiwan (or mainland) computer standard, so it is not part of an existing, bounded writing system already defined as one of the Chinese sources. It is therefore not included in the corpus of texts on the basis of which the Unicode standard establishes what Chinese characters to include. It is not obvious what expansion of the corpus would successfully capture the characters most appropriate for Hokkien in the absence of its prior standardization.
  2. Person-reason-yīn is not in (much) active use in Taiwan (or elsewhere in the CJK world) except as a proposal by a tiny handful of people, since Hokkien speakers normally write in Mandarin. Thus it is by definition a low-frequency character.
  3. Both the PRC and the ROC governments oppose recognition of written colloquial Hokkien as a legitimate written standard, so as a practical matter there would be overwhelming political objection to introducing Hokkien-only characters from two countries whose participation is essential if Unicode is to work. “Who is entitled to define what Chinese is if not ‘the Chinese’?” it could reasonably be asked. Unlike ancient Egyptian, where a few responsible scholars can reasonably dictate what will and will not be enshrined in the Unicode Standard (http://anubis.dkuug.dk/jtcl/sc2/wg2/docs/n1637/n 1637.htm), Chinese carries with it considerable political baggage. The Consortium must tread delicately.
  4. The acknowledgment of person-reason-yīn (and other non-Mandarin Chinese characters) raises the issue of including hundreds or even thousands of other Chinese characters now excluded from the standard. For example:
  5. Chinese, as we have seen, is an open character set that could potentially gobble up the entire double-byte capacity if no limits are set. It is easy to incorporate the unused Shavian alternative alphabet for English, since it comes at relatively little cost because it is a small and closed set of symbols. Chinese, on the other hand, is an endless ocean. Arguably Unicode has already been overly generous to Chinese. Modern simplified Chinese merges some characters as essentially merely insignificant graphic variations. For example, and ,both simplified to lòu in Mainland China, are treated by Chinese dictionaries as interchangeable. Is it really necessary for Unicode to observe a distinction? Why?

Open bibliography.

Return to top.

9. Conclusion

The exclusion of unofficial Chinese characters underlines the fact that script standardization is a process entwined with politics and national self-image. That is hardly a surprise.

But it also suggests a larger issue: When the new global standard is finally fully in place, when our operating systems are finally Unicode-based and our browsers and word processors routinely provided with full Unicode type fonts that offer Burmese and Russian and Arabic and Mongol on the same page (a moment that has already arrived for some of us), will it be too late to stabilize a new writing system for Hokkien? Will the age of innovative Chinese (and other) orthographies have drawn to a close? Will we have standardized at least some orthographies, perhaps some languages, into corners as curiosities that cannot be seriously used in a world in which literacy has become intimately bound to electronic information flow?

Or, viewing the matter in the opposite direction, to the extent that local language activists in the Hokkien-speaking regions insist on the use of non-standard graphs like person-reason-yīn, will they contribute to the defeat of their own goals by rendering colloquial Hokkien unusable in normal printing or computerized communication?

I like standardization. It is an important basis of modernity. On the other hand there is a democratic, populist, localist voice in me that worries about person-reason-yīn and similar vernacular innovations, whether for Hokkien or for any other language, being defined forever as outside the world of civilized discourse. [Note 18]

Note 18. Foreseeing the problem of running out of space, the Unicode scheme provides a portal to a larger, 32-bit world (the so-called “UCS-4 character set”) with room to encode vastly more characters than human ingenuity has yet imagined: 4,294,967,298 (=2^32), to be exact, and, ideally, computer input and interpretation systems could make the cumbersome addressing in these outer circles invisible to the user (Graham 2000:81-82). However the wide range of computer applications that still make it harder to deal with Czech than with English today does not give one confidence that any of us will live to see easy use of 32-bit character sets, especially if most major languages are adequately supported within the 16-bit world of mainstream Unicode. The imposition of additional technical requirements on an orthography can force unpopular compromises — the grudging use of cx, gx etc. in the absence of convenient access to ĉ, ĝ, etc. in computerized Esperanto is a ready example. This will probably always constitute a degree of discrimination, though less than the stigma of outright exclusion.

Open bibliography.

Return to top.