Bourne | Ash |  #!  | find | ARG_MAX | Shells | whatshell | portability | permissions | UUOC | ancient | - | ../Various | HOME
"$@" | echo/printf | set -e | test | tty defs | tty chars | $() vs ) | IFS | using siginfo | nanosleep | line charset | locale


Setting a correct locale on Unix for your preferred (8bit) character set

or: how to get accented characters and umlauts

This document is not a source of fix-it-fast help, but tries to assist in solving even nasty problems. Thus, it contains both my experiences as well as numerous basic explanations and pointers to external documentation.

My advice: first fly over the following once and then read it.


Mainly, this page deals with printability of characters, i.e., the locale category "LC_CTYPE". The examples emphasize the 8-bit ISO 8859 family (link to a PDF of the according standard, see p.7), especially ISO 8859-1 or ISO 8859-15 aka latin 9. The latter link points to a comparison by J.Korpela.

Nevertheless, this page also contains hints and pointers about locale handling in general.

However: you might be interested in Unicode as well, which is not covered by basic 8-bit locales. Then see below for some pointers to good documentation. For now, keep the command iconv(1) in mind, the "codeset converter" - if you ever unpreparedly meet UTF-8 (or another "incompatible" codeset).


Content of this page:


Introduction

By default, many Unix flavours only process ASCII without any problems (the standardized minimum 7bit-character-set). This is mainly for historical reasons, portability and usability.

If you want to use national characters on these systems (using 8-bit or multi-byte characters), then you must tell that to most programs. The general way to do that is by setting some environment variables to appropriate values, i.e., setting the "locale".

Unfortunately these values are not standardized and differ among the various Unix flavours.

There is not even a common, portable value for ISO8859-1 (aka "Latin 1"), which extends ASCII and is the most frequently used replacement in the western world. You have to find a valid value for your needs on your system.

The usual error messages, such as "Couldn't set locale correctly" (if you get it at all), are not nearly meaningful enough. If your settings are wrong you only know: "apparently", it doesn't work.

Almost all values provided on a particular system work properly. But in special cases you're likely to run into problems again: The only command to find out about which values are supported by your system, "locale -a", doesn't tell which categories/environment variables are actually supported for a particular value. (On some systems there's a flag to find out about that - with rather complicated output on some systems - however some implementations are broken.) You even might have to look into system specific directories to find out about the supported categories.

And: several systems don't provide the command locale(1) at all.

Now, even a program paying attention only to the single category "printability of characters" might be affected by most of the above. And although setting the locale usually is rather simple and a feature in general (offering numerous categories and regulating subtle differences between the various languages), you might have become annoyed at first.

Fix it once and forever - that's Unix. An example for a standard program paying attention to the locale on most (commercial) systems is the traditional vi(1). Other examples are tin(1), mutt(1) and perl(1).

By the way, you don't need a national keyboard to type national characters. Using the "AltGr" mechanism is even rather comfortable. See general ways to use special characters in X11 if you're windowed.

On a few occasions the following deals with the mail client mutt, which was the original cause for writing a page about this subject. (For german language readers: Since the newsgroup for mutt, comp.mail.mutt, is english speaking, this document is not written in my mother tongue.)


Widespread documentation: You shouldn't have missed other documents


Why do I have to care about the locale

In the usenet newsgroup comp.mail.mutt for instance, questions regularly show up how to correctly display 8bit characters with the internal pager of this mailer. In particular other affiliated programs, mainly editors or external pagers like vim(1), less(1) and several GNU tools apparently "can do it" by default?

The reason is that mutt(1) exactly follows the settings of your according environment variables.

mutt(1) does not fall back to a common character set like the western europe iso 8859-1, providing accents and umlauts, but mutt stays with the standard locale (C or POSIX), which usually means 'no features'. The name is derived from the former being defined by ANSI C, the latter by POSIX / SUS.

In fact this even can be considered a feature: Keep in mind, that when you really work with language specific characters, numerous things might behave different:

"Ok, setting the locale is a feature", but if you only wanted to fix printability of characters, you might be annoyed by having to get busy with the category LC_CTYPE. That's the Unix way - having set it properly once, all reasonable applications suddenly know what you want.

Let's have a look at various programs:

For example vim(1), emacs(1), jed(1) and partly joe(1) "know" printing 8bit characters out of the box. Some of them don't care about the locale, because they have their own configuration options. A few do so to support encodings which might not be supported by a system locale (e.g. unicode), and some just don't care because they didn't know better, like in pre-223 versions of less(1) for example.

However, other programs do not ignore the locale:

And as another example, some shells also behave according to the locale:

So in general, numerous applications will consider the locale.

Nearly all Unix-systems know about locales, but the valid values are not the same on all systems. Some systems recognize only very few and special values. So find out and set the appropriate value/s for the locale. But before looking at system specific things, what does a certain value mean at all?


In short - how to set the locale?

First look at your current settings:

  $ env | egrep 'LANG|LC_'
  LC_CTYPE=en_US 

You could also try

  $ locale
  LANG=
  LC_CTYPE=en_US
  LC_NUMERIC="C"
  LC_TIME="C"
  LC_COLLATE="C"
  LC_MONETARY="C"
  LC_MESSAGES="C"
  LC_ALL=

(All general categories will be reported, even those which are not explicitly set. In most implementations the double quotes signal an implicit setting.)

However, the latter way is not as robust as may seem: If a category is set with an invalid value, setting the locale fails completely. However, locale(1) won't report an error.

(By the way: Note that some applications certainly might use their own very special variables, but that's not of concern here then.)

Now try "locale -a" to see the available values for the locale on your system.
This command doesn't exist on a few Unix flavours -- see below then.

Main point is to set LC_CTYPE to an appropriate and legal name for your system (see below).
Then you should get your 8bit characters printed.

A correct value for western-europe 8-bit likely "sounds" like "iso88591" or "en_US", because on numerous Unix systems I tried it was always one of the following:
iso_8859_1, en_US, en_US.iso88591, en_US.ISO8859-1, en_US.ISO_8859-1.

No rule without exception. Vincent Lefèvre reports: on Maemo (Nokia's Linux distribution for phones), en_US actually implements UTF-8, not Latin-1.
And in the future, more implementations might move to implement UTF-8 by default.

There are other variables for other meanings -- but two of them are special: They don't mean a real single category but influence all other categories in a general way:

(So if you only want to adjust printability by setting LC_CTYPE, be sure that LC_ALL and LANG are unset.)

However, the ouput of "locale -a" only means that there is some support for those values, no matter for what categories exactly. This means: if you have set LANG but still have problems, then only one category, e.g. LC_CTYPE or LC_MESSAGES might support this value.

If you want to read messages and menues in your mothertongue, set LC_MESSAGES. (This already happens implicitly if you set LANG.) Keep in mind that the application must come with the appropriate translations itself, properly installed, because the system certainly can't know them.

Look for manual pages like 'environ(5)/(7), locale(1)/(7)/(5), setlocale(3C)/(3), localedef(4), i18n_intro(5), l10n_intro(5),' etc, and find out about all the according environment variables, the most important ones being LC_ALL, LC_CTYPE, LC_MESSAGES and LANG.

Pay attention to chose the proper section, because there might be several entries with the same name. This means for example "man 5 environ" (or "man -s5 environ" on Solaris). The numbers in parentheses above are suggestions for sections in which you might find them. -- It's time to do "man man" now, if you didn't knew that by heart.


Locale value syntax

In general, a value for a locale category (the according environment variable) is constructed like this:

"xy[_XY][.codeset][@modifier]"
xy: language-abbreviation, ISO 639-1 (2 characters), [ISO 639-2 (3 characters) might be used for languages without a two letter code] XY: country/territory, ISO 3166 [3166-2 three letter codes might be possible] codeset: f.i. "iso88591", "ISO8859-1", "UTF-8", "greek8", "roman", etc modifer: anything else refining it. for example "euro" for the currency symbol, or "phone" for a different sorting order (LC_COLLATE).
Unfortunately there is no standardization.
The exact values sometimes noticably differ among the various Unix flavours.

Some examples for such values:


Collection of hints how to solve all the possible problems.

Essential, the "preparatory work" in advance:

Back to the locale:


A small debugging program as example

See the code for a program which gives an example of how to make use of the locale.

First it tries to set the locale like other programs. Then it additionally inspects LC_CTYPE and LC_MESSAGES more thoroughly, indicating the printable characters according to isprint(3), and issuing an error message with perror(3) to see the language of system messages (which will be english in most cases, though). But it doesn't try to print a nationalized messages of its own or of another utility (because you might have to install these messages in a system directory).

It will complain about all problems that occur.

It will warn if a call to "setlocale" returns with a value different from the value it was (implicitly) called with: Some locale implementations internally additionally try modifed values, particularly if your value contains a charset or modifier suffix. (And if you set both LANG and another explicit category, then setlocale() will return a "composite value".) -- Thanks to Alain Bench for pointing this out to me!

Note: Depending on your font settings and your browser, you might not be able to see the latin1 characters contained in the following quote. (Also note that you usually shouldn't mix different locale values - certainly with the exception of "unsetting" some categories with the value "C".)

Example:

  $ uname -sr
  SunOS 5.9

  $ LC_CTYPE=iso_8859_1 LANG=nonsense LC_MESSAGES=POSIX ./checklocale

  [Latin1/9] If there's no literal copyrightsymbol at the end of this sentence,
  then your terminal/terminalemulator/font is not ISO8859-1/15 ready: ©
  
- Current environment settings:
  LANG        = "nonsense"
  LC_CTYPE    = "iso_8859_1"
  LC_MESSAGES = "POSIX"

- Implicitly setting all locale categories with LANG failed.
  You might want to unset/fix it now and/or set supported categories instead.

- Setting LC_CTYPE to "iso_8859_1" succeeded.

  Testing LC_CTYPE with isprint():

  # # # # # # # # # # # # # # # # 
  # # # # # # # # # # # # # # # # 
    ! " # $ % & ' ( ) * + , - . / 
  0 1 2 3 4 5 6 7 8 9 : ; < = > ? 
  @ A B C D E F G H I J K L M N O 
  P Q R S T U V W X Y Z [ \ ] ^ _ 
  ` a b c d e f g h i j k l m n o 
  p q r s t u v w x y z { | } ~ # 
  # # # # # # # # # # # # # # # # 
  # # # # # # # # # # # # # # # # 
    ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ 
  ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ 
  À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
  Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
  à á â ã ä å æ ç è é ê ë ì í î ï
  ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

  
- Implicitly setting   LC_NUMERIC by LANG failed.
- Implicitly setting      LC_TIME by LANG failed.
- Implicitly setting   LC_COLLATE by LANG failed.
- Implicitly setting  LC_MONETARY by LANG failed.
- Setting LC_MESSAGES to "POSIX" succeeded.
- Testing LC_MESSAGES with perror() for EAGAIN, that is, a libc message.
  Message catalogs are in /usr/share/locale, according to bindtextdomain().
  perror()   tells: Resource temporarily unavailable
  strerror() tells: Resource temporarily unavailable


Post Scriptum


Unicode:

Eventually, Unicode is the way to go for a really useful encoding (however the problem with locales would still remain to some degree). Don't miss the following:

and


With credits to: Jürgen Dollinger, Christian Weisgerber, Heiko Schlenker, Rudolf Hommer, Chris Green, Sven Guckes, Olav Kvittem, Thomas Schultz, Vincent Lefèvre and especially to Mark Glassberg and Alain Bench.

Your own experiences and other feedback are most welcome

comments to <mascheck@in-ulm.de>
started about 1999-11-18, last update 2010-07-13