Unicode KbdsonWindows
Unicode KbdsonWindows
Michael S. Kaplan
Cathy Wissink
Windows Globalization, Microsoft Corporation
1. Introduction
To implementers, it seems inputting data into applications via keyboards should be one of the
fundamentally simple features on Windows. However, once additional complexities like fonts
and rendering engines are taken into consideration, input appears to be not quite so simple
anymore. Adding many different keyboard layouts on top of over 135 locales further complicates
the issue. And finally, once you include the ability to define keyboard layouts (whether by
Microsoft interfaces or third party products) where all of Unicode can be supported, it becomes
downright complex!
This paper will discuss the many features that keyboard layouts support (such as dead keys, shift
states and ligatures), the interaction between input, fonts, and rendering engines, the issue of
code pages vs. Unicode, when IMEs are preferred and when they are not, and the collation issues
that enter into the equation. In the end, it will be clear that on Windows, the input of virtually
any characters in Unicode is possible, even if in some cases more work is required than was
originally expected.
The OEM keys are keys that add punctuation and symbols. The ones that commonly change
with different keyboards are OEM_1 through OEM_8, OEM_102, OEM_COMMA,
OEM_PERIOD, OEM_PLUS, and OEM_MINUS. On these keyboard layout maps they are
abbreviated with an O* prefix, followed by enough information to uniquely identify the key (e.g.
O2 for OEM_2 and OP for OEM_PERIOD).
1
Processing keystrokes
When a Windows message loop handles a VK in the WM_KEYDOWN message, it can pass the
VK to the DefWindowProc API. To handle the message, the code in the USER subsystem will
process the keystroke and convert it (when appropriate) to a character, passed as a WM_CHAR
message. This processing requires a great deal of information:
Once all of this information is collected by the USER subsystem (that is, the keyboard layout is
known for each thread and the WM_KEYDOWN message contains the VK and shift state), the
code is then is able to come up with the appropriate character, taking all the information about
shift states, VKs and current layout into account (obviously hitting arrow keys, for example,
would not be expected to insert characters; USER will not have any of this extra character-based
work run). You can mimic this behavior with several different Win32 APIs (see Table 1 for a list
of the APIs that can be useful for this).
Table 1: Keyboard input functions and what they do
Function
Description
keybd_event
MapVirtualKey
Maps between scan codes, VKs, and characters for the current keyboard
layout
MapVirtualKeyEx
Maps between scan codes, VKs, and characters for a specified keyboard
layout (layout must be loaded)
OemKeyScan
SendInput
ToAscii
ToUnicode
ToUnicodeEx
VkKeyScan
VkKeyScanEx
The functions in Table 1 are interesting in that when you read the descriptions, the functions
appear to be duplicates of each other. However, once you start needing these functions in an
application, you will see the small differences between these different functions can actually have
a great deal of importance for obtaining the features you need.2
In any case, your code has now passed a character onto an application and inserted text! You can
look at a few of the many keyboards supported on Windows (Figures 5-8) to help you see the
wide variety of possible characters to be inserted.
As an example, the definitions of MapVirtualKey and VkKeyScan seem similar, but the former
does not handle shifted characters while the latter does. For more information, you can look at
the Platform SDK:
http://msdn.microsoft.com/library/enus/winui/WinUI/WindowsUserInterface/UserInput/KeyboardInput.asp
2
ligatures
dead keys
shift states
Ligatures
There are many times that a single keystroke needs to enter more than one character. In keyboard
nomenclature, these 1:many mappings are called ligatures.
Note that this definition of ligature is not identical to the one used in typography or in language
orthographies; "ligature" here is used to identify multiple UTF-16 code points that are input by a
single keystroke. This could be used in a number of ways: to represent a linguistic character
consisting of multiple UTF-16 code points (such as Sri and Ksa seen on the Tamil keyboard,
shown in Figure 9); to represent multiple linguistic characters which often work together in the
language; or to develop a keyboard layout to handle a language represented by supplementary
characters (such as the Deseret keyboard layout in Figure 10)3. (Technically, one could even
create a keyboard with a keystroke that would insert "mike" or "cath" or "hiya" using a legal
keyboard layout ligature -- as seen in the silly keyboard layout in Figure 11.)
Figure 9: The Tamil keyboard in the shifted state, showing linguistic characters Sri and Ksa as ligatures
Since keyboards support UTF-16 code points on Windows, the only way to handle
supplementary characters on keyboards is via ligatures (the high surrogate and the low surrogate
make a ligature). The process is seamless from the user perspective; the user will not experience
any difference between supplementary characters and characters on the BMP, aside from a
limitation of 4 UTF-16 code points on a single key.
3
Figure 10: A keyboard layout for Deseret, a language using supplementary characters (each represented by
"ligatures" of UTF-16 high and low surrogates)
Figure 11: A very silly (but real!) keyboard layout (created by a developer for personal use). This shows the
4 UTF-16 character limit for a single keystroke.
Dead keys
The dead key mechanism is either very intuitive or incredibly confusing, depending on your
experience with legacy European keyboards. The basic concept is that you type a character
defined on the particular keyboard as a dead key, then type a specific second character known as
a base character. Rather than displaying these two characters, a unique third character known as
a combining character will be shown. The reason the first character is defined as a "dead" key is
that this character is not shown, and the cursor does not advance.
Dead keys are most commonly used in European keyboard layouts; a diacritic is generally used
as the dead key. An example of this can be found on the Finnish keyboard, where typing a
diaeresis (U+00A8) will initially do nothing, but then typing any of the characters in the first
column in Table 2 will cause the character in the second column of Table 2 to be displayed. For
example, if a user types a diaeresis, followed by a small letter a, Latin small letter A diaeresis ()
will be displayed.
Base Character
Combining Character
U+0020
U+00A8 ()
U+00A8+other character
The last two rows in gray of the above table are important to note. The first gray row is a
common convention on most keyboards with dead keys; if you type the dead key and then a
space, you will get the spacing version of the character. The second one is not a part of the
keyboard layout definition, but is simply what happens if you type a dead key followed by a
character that is not defined in the keyboard layout as a base character for that dead key: the
deadkey is printed (input), followed by that second character. For example, Latin small letter C
is not defined in the keyboard layout as being a base character for the diaeresis deadkey. If
U+00A8 is typed, followed by c, those two code points will be input. No combining character
will be created.
While deadkeys are not limited to European keyboard layouts, that is where they are most
commonly used.
Shift states
A keyboard layout typically has only 47 or 48 assigned physical keys on it; even the English
alphabet would not fit, if you wanted both uppercase and lowercase A to Z (there wouldnt even
be room for punctuation characters). Therefore, keyboards usually contain another set of 47 or 48
keys that can be accessed by pressing Shift in tandem with a character (for examples, see Figure
12 and 13 for the Greek keyboard in both the unshifted and shifted states).
10
11
A complex script is any writing system that needs additional processing in order to properly
display. For example, Arabic needs contextual shaping as well as bidirectional behavior,
Vietnamese needs diacritic positioning, and Indic scripts sometimes need rearrangement of
vowel marks. Uniscribe handles this kind of processing.
4
6 This includes any NT-based version of Windows after Windows NT 4 (Windows 2000,
Windows XP, and the upcoming Windows .NET Server 2003).
12
Shaping Engine
To storage,
collation, etc.
Uniscribe
Input method
Language? Kannada
U+0C97, U+0CBF
Keyboard.dll
Script? Indic
Basis of Analysis? Syllable
Kbdinkan.dll
Unshifted VK_I
Unshifted VK_F
0C97 0CBF |
Code points
Glyphs
To display
Glyph positioning
Figure 17: The relationship between a keyboard, the rendering engine and display in a complex
script (Kannada, an Indic script language).
Fonts
What has diminished the importance of the HKL of a keyboard has been the increased selection
of fonts available, as well as font linking (the borrowing of information from multiple fonts to
obtain glyphs not in the current font), which was introduced in Windows 2000 and improved for
Windows XP. Obviously for a keyboard to work well, it assumed that there will be at least one
font somewhere on the machine to assist in displaying the inputted text, lest every character be
replaced by a null glyph7.
13
Is there some kind of keyboard standard for the region or country? It is sometimes
required to have an input method which is sanctioned by the government or an
appropriate governing body. Implementers should consider contacting their local or
See http://msdn.microsoft.com/library/en-us/intl/ime_5tiq.asp.
For more detailed information of collation on Windows, please see our talk Sorting it all out: an
introduction to collation, available at
http://www.microsoft.com/globaldev/Presentations/unicode22/016.doc
9
14
What languages will the keyboard support? This should be explicitly determined before
allocating keys to characters.
Does the keyboard provide input of all needed linguistic characters for the appropriate
language(s)? This requirement can be met in a number of ways: via dead keys or
additional shift states, for example (not all characters need to be on the unshifted state).
High frequency linguistic characters should be positioned where they are easy to type,
ideally in the unshifted state. (Note that if the keyboard supports multiple languages, the
high frequency keys may change.)
Does the keyboard focus on code points, and not glyphs? It is important to not place the
burden of display or shaping onto the keyboard. All technologies related to visual
display are decoupled from the keyboard (and should be handled by fonts and a
rendering engine if needed; see section 4 for more information).
Do all characters on the keyboard exist in Unicode? Since all input on Windows is based
on Unicode (UTF-16), any code points not encoded in Unicode cannot be handled.
15
The ALT+X option, also available in Office. (Typing ALT+X after a character gives you
the Unicode value; typing ALT+X after a Unicode value gives you the character.)
I want to make sure I have every single visual variant of my characters on the keyboardthe canonical (or
isolate) version of the code point is not sufficient.
As is discussed in the other technologies section, keyboards on Windows only deal with code
points, not with glyphs. Code points are used exclusively for text processing, except for display.
At the point of display, technologies such as fonts and rendering engines map between code
points and glyphs. There is an important technical boundary between code points and glyphs,
and this exists in order to maintain at least modicum of simplicity within the system. (Imagine if
every single visual variant of a code point had to be maintained for text processing!) For this
reason, keyboards focus exclusively on code points, and leave the work of linking code points to
the appropriate visual display to fonts and shaping engines.
I want to have an IME rather than a keyboard for my language.
This is generally heard from customers working with complex script languages who feel that
they need to have all visual variants of a code point on an input method. Input Method Editors
really make sense with ideographic languages such as Chinese or Korean, where there are
literally thousands of characters needed for the language. Each of these ideographic characters is
semantically distinct. Compare this with complex scripts, where the number of semantically
distinct characters is generally less than 100, but the number of visually distinct characters is
considerable (into the hundreds). Again, keyboards work with code points, not with glyphs.
Since code points are semantically distinct and not visually distinct, a complex script language
can easily be handled via a keyboard; as noted earlier, the code points are linked to the
appropriate visual display by other non-keyboard technologies.
8. Summary
As has been described in this paper, the inner workings of keyboards are more complicated than
a developer would probably like them to be. What is crucial is understanding the association
between the virtual keys, the scan codes and the shift states in a keyboard. In addition,
developers should understand the relationship input has to other technologies, once the
keyboard passes on the code points (e.g., Uniscribe, font technologies and IMEs). This paper has
only touched upon many of the issues, but we hope that it has provided implementers enough
knowledge to avoid pitfalls, and provide customers with a seamless input experience.
16