Linux Internationalization HOWTO

What do you do when you want your Linux box to talk to you in Japanese, Korean, Norwegian, English and possibly some Swedish? How about inputting foreign alphabets? As with anything else in Linux it doesn't quite work straight out of the box, but in this howto I'll try to explain how to get it working at least half-decently.

Initial preparations
Using Linux in your native language
Input methods
 Smart Common Input Method
 Universal Input Method
The UTF-8 text encoding
Getting hold of internationalized apps
Contact details

Initial preparations

The first thing to do is to make sure your C library has support for foreign locales, the data that tells your system how to display numbers, letters and lists in different languages. If it isn't there, you need to recompile your C library with foreign language support. This will take quite a while, so sit down with your favorite manga (or manhwa) and a good cup of coffee.

(On Gentoo Linux, if support for locales has been compiled in, the USE flag nls should be found in your /etc/make.conf file. If it isn't there, add nls to the USE flag section and emerge glibc again.)

Using Linux in your native language

Though I personally hate seeing improvised Norwegian translations of popular English computer terms, a lot of people find it easier to use Linux in their native language. Although localization settings for software like Gnome or KDE may be altered simply by using a control panel, users of other software solutions may need to change some settings called environment variables. The names of these variables begin with LC_, and LC_CTYPE should be the most important one for enabling input methods. Additionally, LANG and in rare cases LANGUAGE might be needed to change the language in some programs.

Changing these variables is as simple as finding the country code for your country, and using a command line. First, run the command locale -a and find the appropriate code for your language in the list. Try Google if you're unsure of which one it is.

The system language then needs to be set by a script that runs every time the system starts up. On Gentoo Linux, the correct file is apparently called /etc/env.d/02locale. For a lot of systems, though, a quick and alomst certainly incorrect way is to enter the necessary codes into the startup script rc.local, which will unfortunately probably only be run near the very end of the bootup procedure. Assuming that our language of choice is the beautiful "Example Language", the code might be as follows:

export LC_CTYPE=example_LANGUAGE
export LANG=example_LANGUAGE
export LANGUAGE=example_LANGUAGE

Et voila. Upon rebooting your machine a lot of applications should start appearing in the language of your choice.

Another neat feature is the ability to run just one program in another language without changing your settings. Simply specify the variables before your program name on a command line:

LANG=example_LANGUAGE LC_CTYPE=example_LANGUAGE LANGUAGE=example_LANGUAGE my_program

Input methods

Although most text input is carried out simply by using a certain keyboard layout (Like an American one or a Norwegian one), some languages require the user to use a special input method for entering text. These often work by transliterating the text that the user is entering from the roman alphabet to some other script, like Hiragana or Traditional Chinese.

There are currently two important implementations of imput methods on Unix-like systems: The traditional X Input Method interface (XIM), and the input modules seen in GTK2 and Qt.

This howto deals with two software packages that supports all three of these interfaces: SCIM and UIM.

Smart Common Input Method

Smart Common Input Method is a multilingual input method framework that can be used both through the old X Input Method interface and as a module for GTK2 or Qt.

The first step is installing SCIM and any input methods you might need. Make sure to also install the GTK2 immodule and Qt immodule interfaces.

SCIM's web site can be found here: http://www.scim-im.org/

If you have GTK2, you might need to run the following command after the installation:

gtk-query-immodules-2.0 > /etc/gtk-2.0/gtk.immodules

SCIM needs a backend application to make it work with applications that don't use GTK2 or Qt. It's a good idea to start it whenever you start X, so put the following command line somewhere before the last command in your ~/.xinitrc:

scim -d

To make SCIM your default input method, go back to your ~/.xinitrc and add the following lines:

export XMODIFIERS=@im=SCIM
export GTK_IM_MODULE="scim"
export QT_IM_MODULE=scim

In most GTK2 and some Qt programs you'll be able to select SCIM by right-clicking on a text input field without using any special setup.

To activate SCIM, press control + space. Given that it's running properly, you should now see a menu in the bottom right corner of your screen where you can choose between several input methods. There should also be a menu option to hide the SCIM panel, which some users prefer not to have on their screen. You can find out more at Yukiko Bando's mini guide.

Universal Input Method

Universal Input Method is a highly extendable input method framework that focuses on Asian input methods for GTK2 and Qt. It can be used as a module for GTK2 or Qt, through X Input Method (XIM), and as a module for SCIM.

Start by installing uim itself and any UIM-based input methods you need. Make sure to also install the GTK2 immodule and Qt immodule interfaces.

UIM's web site can be found here: http://uim.freedesktop.org/wiki/

If you have GTK+ 2, you might need to run the following command after the installation:

gtk-query-immodules-2.0 > /etc/gtk-2.0/gtk.immodules

UIM also needs a helper application to work with applications that don't use GTK2 or Qt. It's a good idea to start it whenever you start X, so put the following command line somewhere before the last command in your ~/.xinitrc:

uim-xim &

To make UIM your default input method, go back to your ~/.xinitrc and add the following lines:

export XMODIFIERS=@im=uim
export GTK_IM_MODULE="uim"
export QT_IM_MODULE=uim

To switch to another input method (within UIM), use the application uim-im-switcher. In GTK2 and some Qt programs you'll be able to select input methods by right-clicking on a text input field without using any special setup.

UIM is activated by pressing shift + space. For more information, have a look at UIM's home page and wiki.

The UTF-8 text encoding

Many systems use character encodings to represent their local alphabets as binary data that only support one or a very few languages, meaning that one string of text written on a Korean system, for instance, can't be read on a Japanese system. However, it's possible to make Linux use the UTF-8 (Unicode Transfer Format) encoding for any language. The UTF-8 text encoding supports a lot of languages without any of them interfering with one another, but it's not compatible the older character encodings, except for the bits used for English, which are the same.

The system locale settings containing settings for among other things the system language, can be made to use UTF-8 with relative ease. Unfortunately, many distros lack UTF-8 locales for most languages. Run the command locale -a and see if you can find a locale for your language with a name ending in .utf8. If not, log in as root (or use su) and convert an already existing locale to the UTF-8 encoding:

localedef -f UTF-8 -i example_LANGUAGE example_LANGUAGE.utf8

Once you know you have a locale that allows you to use UTF-8, you need to tell the system to use it. This might involve adding ".utf8" to the name of the locale you're using, for instance.

export LC_CTYPE="example_LANGUAGE.utf8"
export LANG="example_LANGUAGE.utf8"
export LANGUAGE="example_LANGUAGE.utf8"

Note that input methods that need to be started with a special locale settings, such as the Korean input method ami, still need to be started in their old locale settings.

UTF-8 is already used by most GTK2 and Qt applications, and Java also uses its own, Unicode-based encoding. However, to use UTF-8 in console applications you need a UTF-8 enabled terminal emulator or console. Konsole or gnome-terminal are excellent choices for terminal emulators, but if you don't have Qt or GTK2, then mlterm is for you. To enable UTF-8 on the actual console, use unicode_start from kbd or console-tools.

Some programs, like screen or irssi need special options in order to work with UTF-8.

Contact details

If you have further questions about what languages and scripts are supported by SCIM and uim, or about how to enable other input methods such as kinput2, please e-mail me and I'll see if I can help.

Thanks to Martin Swift, Botond Botyanszki, Matt Doughty, Scott Robbins, Tokunaga Hiroyuki, James Su and the Scandinavian Gentoo forums for contributing to the information contained within this document.