Locale and 8bit Characters (Accented characters, Umlaute, etc)

Setting a correct locale on Unix for your preferred (8bit) character set

or: how to get accented characters and umlauts

Update remark: Meanwhile this document might probably have become outdated. It's rather 8bit than Unicode.
It originates from a time where it could be exceptionally diffcult to get locales running on some traditional unix variants.

This document is not a source of fix-it-fast help, but tries to assist in solving even nasty problems. Thus, it contains both my experiences as well as numerous basic explanations and pointers to external documentation.

My advice: first fly over the following once and then read it.

Mainly, this page deals with printability of characters, i.e., the locale category "LC_CTYPE". The examples emphasize the 8-bit ISO 8859 family (link to a PDF of the according standard, see p.7), especially ISO 8859-1 or ISO 8859-15 aka latin 9. The latter link points to a comparison by J.Korpela.

Nevertheless, this page also contains hints and pointers about locale handling in general.

However: you might be interested in Unicode as well, which is not covered by basic 8-bit locales. Then see below for some pointers to good documentation. For now, keep the command iconv(1) in mind, the "codeset converter" - if you ever unpreparedly meet UTF-8 (or another "incompatible" codeset).

Content of this page:

Introduction
widespread documentation
why do I have to care about the locale
how to set the locale, short introduction (not covering other possible problems on your system)
locale value syntax
collection of hints how to find and solve all the possible problems
a small program to debug locale settings (get more detailed error messages)
post scriptum (be careful with LC_COLLATE, more system specific stuff, vendor documentation)
Unicode pointers

Introduction

By default, many Unix flavours only process ASCII without any problems (the standardized minimum 7bit-character-set). This is mainly for historical reasons, portability and usability.

If you want to use national characters on these systems (using 8-bit or multi-byte characters), then you must tell that to most programs. The general way to do that is by setting some environment variables to appropriate values, i.e., setting the "locale".

Unfortunately these values are not standardized and differ among the various Unix flavours.

There is not even a common, portable value for ISO8859-1 (aka "Latin 1"), which extends ASCII and is the most frequently used replacement in the western world. You have to find a valid value for your needs on your system.

The usual error messages, such as "Couldn't set locale correctly" (if you get it at all), are not nearly meaningful enough. If your settings are wrong you only know: "apparently", it doesn't work.

Almost all values provided on a particular system work properly. But in special cases you're likely to run into problems again: The only command to find out about which values are supported by your system, "locale -a", doesn't tell which categories/environment variables are actually supported for a particular value. (On some systems there's a flag to find out about that - with rather complicated output on some systems - however some implementations are broken.) You even might have to look into system specific directories to find out about the supported categories.

And: several systems don't provide the command locale(1) at all.

Now, even a program paying attention only to the single category "printability of characters" might be affected by most of the above. And although setting the locale usually is rather simple and a feature in general (offering numerous categories and regulating subtle differences between the various languages), you might have become annoyed at first.

Fix it once and forever - that's Unix. An example for a standard program paying attention to the locale on most (commercial) systems is the traditional vi(1). Other examples are tin(1), mutt(1) and perl(1).

By the way, you don't need a national keyboard to type national characters. Using the "AltGr" mechanism is even rather comfortable. See general ways to use special characters in X11 if you're windowed.

On a few occasions the following deals with the mail client mutt, which was the original cause for writing a page about this subject. (For german language readers: Since the newsgroup for mutt, comp.mail.mutt, is english speaking, this document is not written in my mother tongue.)

Widespread documentation: You shouldn't have missed other documents

See the ISO 8859-1 FAQ. That ftp server never has been responding to me for a long time now. Alternatively see a version from the google usenet archive (or a local copy). The above document is outdated concerning X11, see a response in Usenet (local copy).
Certainly, the locale can also serve other character-sets. I will use ISO8859-1 just as a placeholder in the following.
As the appropriate Linux HOWTOs are actually not very linux specific (i even haven't used Linux often myself for a long time), you might want to read some of them (find your mirror on linuxdoc.org):
The Danish-HOWTO is written in english and useful for all Europeans on any Unix in general, as it provides tips for many Unix applications. Also, there's a short linux-specific description about regenerating the locale (localedef).
(For german-lang readers: German-HOWTO.)
FreeBSD Handbook, Chapter 13. Localization and The euro symbol are also interesting.
(For german-lang readers: Benutzung von Umlauten auf FreeBSD-Systemen.)
FreeBSD provides the login.conf(5) mechanism, which offers a very general - and recommended - way to do "localization". Open- and NetBSD provide login.conf as well, but there's no support for internationalization.
(For german-lang readers and not Unix specific)
- Umlaute in E-Mail und Netnews FAQ
- Umlaute im deutschsprachigen Teil des Usenet
(For russian lang readers)
- Home of KOI8-R - Russian Net Character Set with numerous specialized subpages containing valuable and specific hints (pointed out by Francois Zellinger in c.m.mutt).
- The Cyrillic Charset Soup about KOI8-R / ISO8859-5.
Roman Czyborra made an impressing collection about character sets and fonts, Unicode (standards, characters, encoding, etc), ISO8859-x, Cyrillic, Chinese/Japanese/Korean, DOS- and Windows Codepages (and a few other proprietary)...

Why do I have to care about the locale

In the usenet newsgroup comp.mail.mutt for instance, questions regularly show up how to correctly display 8bit characters with the internal pager of this mailer. In particular other affiliated programs, mainly editors or external pagers like vim(1), less(1) and several GNU tools apparently "can do it" by default?

The reason is that mutt(1) exactly follows the settings of your according environment variables.

mutt(1) does not fall back to a common character set like the western europe iso 8859-1, providing accents and umlauts, but mutt stays with the standard locale (C or POSIX), which usually means 'no features'. The name is derived from the former being defined by ANSI C, the latter by POSIX / SUS.

In fact this even can be considered a feature: Keep in mind, that when you really work with language specific characters, numerous things might behave different:

Many eight bit characters are considered to be printable, but not all. (category LC_CTYPE)
Sorting order for characters matters (LC_COLLATE). In some languages "z" is not the last character in the alphabet. There are even locale values on some systems, that let you distinguish between the different sorting of telephone book and lexica.
national language support: messages and menues in your mothertongue (LC_MESSAGES)
time and money formatting (LC_TIME, LC_MONETARY)

"Ok, setting the locale is a feature", but if you only wanted to fix printability of characters, you might be annoyed by having to get busy with the category LC_CTYPE. That's the Unix way - having set it properly once, all reasonable applications suddenly know what you want.

Let's have a look at various programs:

For example vim(1), emacs(1), jed(1) and partly joe(1) "know" printing 8bit characters out of the box. Some of them don't care about the locale, because they have their own configuration options. A few do so to support encodings which might not be supported by a system locale (e.g. unicode), and some just don't care because they didn't know better, like in pre-223 versions of less(1) for example.

However, other programs do not ignore the locale:

On most systems, "vanilla" vi (that is, the traditional vi) needs a proper setting for one category of the locale: LC_CTYPE.
Some mailers and newsreaders follow your settings of LC_CTYPE and LC_MESSAGES, for example mutt(1) and tin(1).
A few terminalemulators pay attention to LC_CTYPE and might even silently refuse to accept/print eight bit characters if the setting isn't appropriate.

And as another example, some shells also behave according to the locale:

In bash-2.x, the readline library initializes according to LC_CTYPE at startup. If you don't go this way, you have to fiddle with the very readline settings to be able to type 8bit characters (on Linux, your distributor might have already done it, so that you never needed to adjust readline settings yourself, but it was necessary as well). More about bash/readline: See the post scriptum below.
Many ksh88 and ksh93, as well as tcsh, even track LC_CTYPE at run time.

So in general, numerous applications will consider the locale.

Nearly all Unix-systems know about locales, but the valid values are not the same on all systems. Some systems recognize only very few and special values. So find out and set the appropriate value/s for the locale. But before looking at system specific things, what does a certain value mean at all?

In short - how to set the locale?

First look at your current settings:

  $ env | egrep 'LANG|LC_'
  LC_CTYPE=en_US

You could also try

  $ locale
  LANG=
  LC_CTYPE=en_US
  LC_NUMERIC="C"
  LC_TIME="C"
  LC_COLLATE="C"
  LC_MONETARY="C"
  LC_MESSAGES="C"
  LC_ALL=

(All general categories will be reported, even those which are not explicitly set. In most implementations the double quotes signal an implicit setting.)

However, the latter way is not as robust as may seem: If a category is set with an invalid value, setting the locale fails completely. However, locale(1) won't report an error.

(By the way: Note that some applications certainly might use their own very special variables, but that's not of concern here then.)

Now try "locale -a" to see the available values for the locale on your system.
This command doesn't exist on a few Unix flavours -- see below then.

Main point is to set LC_CTYPE to an appropriate and legal name for your system (see below).
Then you should get your 8bit characters printed.

A correct value for western-europe 8-bit likely "sounds" like "iso88591" or "en_US", because on numerous Unix systems I tried it was always one of the following:
iso_8859_1, en_US, en_US.iso88591, en_US.ISO8859-1, en_US.ISO_8859-1.

No rule without exception. Vincent Lefèvre reports: on Maemo (Nokia's Linux distribution for phones), en_US actually implements UTF-8, not Latin-1.
And in the future, more implementations might move to implement UTF-8 by default.

There are other variables for other meanings -- but two of them are special: They don't mean a real single category but influence all other categories in a general way:

LC_ALL, which overrides all others. Thus it should be set for debugging purposes only (e.g., enforcing a fall back to 7-bit ASCII with the value "C").
LANG, which has lower priority than all others. It doesn't override any value, that's its very purpose in contrast to LC_ALL.
It is used to completely preset the locale with a certain default, as it influences all other categories which are not set.
This variable is useful if you want as much nationalized behaviour as possible.

(So if you only want to adjust printability by setting LC_CTYPE, be sure that LC_ALL and LANG are unset.)

However, the ouput of "locale -a" only means that there is some support for those values, no matter for what categories exactly. This means: if you have set LANG but still have problems, then only one category, e.g. LC_CTYPE or LC_MESSAGES might support this value.

If you want to read messages and menues in your mothertongue, set LC_MESSAGES. (This already happens implicitly if you set LANG.) Keep in mind that the application must come with the appropriate translations itself, properly installed, because the system certainly can't know them.

Look for manual pages like 'environ(5)/(7), locale(1)/(7)/(5), setlocale(3C)/(3), localedef(4), i18n_intro(5), l10n_intro(5),' etc, and find out about all the according environment variables, the most important ones being LC_ALL, LC_CTYPE, LC_MESSAGES and LANG.

Pay attention to chose the proper section, because there might be several entries with the same name. This means for example "man 5 environ" (or "man -s5 environ" on Solaris). The numbers in parentheses above are suggestions for sections in which you might find them. -- It's time to do "man man" now, if you didn't knew that by heart.

Locale value syntax

In general, a value for a locale category (the according environment variable) is constructed like this:

"xy[_XY][.codeset][@modifier]"

 xy:      language-abbreviation, ISO 639-1 (2 characters), [ISO 639-2 (3 characters) might be used for languages without a two letter code] 
 XY:      country/territory, ISO 3166                      [3166-2 three letter codes might be possible]
 codeset: f.i. "iso88591", "ISO8859-1", "UTF-8", "greek8", "roman", etc
 modifer: anything else refining it.
          for example "euro" for the currency symbol,
          or "phone" for a different sorting order (LC_COLLATE).

Unfortunately there is no standardization.
The exact values sometimes noticably differ among the various Unix flavours.

Some examples for such values:

"C" - the standard value, usually the default, the same like not setting the category at all. 7-bit ASCII charset, no goodies. Ironically enough, at least one vendor (HP) apparently felt the need to provide "C.iso88591". The name C is associated with ANSI C.
"en_US, en_US.iso88591" - ascii and the western europe specific characters.
en_US (if available) always contains iso8859-1, even without the codeset suffix.
However HP-UX 10/11 provide only values with the codeset, for example en_US.iso88591.
You see it's essential to find out about the valid values instead of only guessing.
although "de_DE" is valid syntax, this value doesn't exist on many Solaris versions (but only "de").
"fr_CA.roman8" - might be appropriate for canadians
"zh_TW.big5" traditional chinese in taiwan with the BIG5 codeset (not an eight bit locale, but a good example).
"en_US.ISO8859-15@euro" - example from Solaris supporting the "euro sign" instead of the dollar sign as currency character (apart from that @euro is the default for iso8859-15).

Collection of hints how to solve all the possible problems.

Essential, the "preparatory work" in advance:

Make sure that your tty is "eight-bit-clean". Generally set "stty cs8 -istrip". Many shells and TTYs require this.
If you use telnet(1), confirm that you run in 8bit mode: Press <CTRL-]> and then "set ?" and "toggle ?". See the variables inbinary and outbinary. Fix them or start telnet with the right options. Adjust ~/.telnetrc. (Note that ssh(1) is 8bit-clean.)
Verify that your font actually does contain the wanted special characters (see also post scriptum and the debugging program).
Several terminalemulators pay attention to the locale. Examples are dtterm(1) (not on Solaris, but on HP-UX, AIX), hpterm(1), aixterm(1). You might have to start them with correct settings. Yes, this can be sort of a crux...
A really unlikely problem: Verify that your Xresources don't explicitly prohibit eight-bit characters. Use xrdb(1), "xrdb -q", to see the general settings and appres(1) for application specific resources. An example: xterm ('appres XTerm xterm') knows the resource "XTerm*eightBitOutput", it correctly defaults to True. (Note that the resource "eightBitInput" has a completely different meaning and is not of concern here).
If you have perl(1) there's an elegant way to print the 8bit characters:
perl -e 'for$i(160..255){printf"%c%c",$i,($i%16==15)?10:32}'
(Christian Weisgerber)

Back to the locale:

Ok, you have found a promising locale value. It's essential that the call of setlocale(3) in your prorgram succeeds. Use the example below to get detailed information about this step.
If you set a category, better don't forget to "export" it. If you want to remove it again, use "unset" or "unsetenv".
- bourne-like shells:
  $ unset LANG LC_ALL; LC_CTYPE=<value> export LC_CTYPE
- csh-like shells:
  % unsetenv LANG LC_ALL; setenv LC_CTYPE <value>
Check also system wide configuration files which tend to set LANG or even LC_ALL.
(This might be /etc/*profile and /etc/default/[i18n|lang] for example.)
Setting different locale categories to incompatible values might cause problems. This is most likely to happen if a login script had also set LANG and you haven't noticed this.
Some systems don't provide the command locale(1), e.g., Free/Open/NetBSD, SunOS 4.x and Irix 5.3. In this case, or if "locale -a" didn't help and you still have problems, search for directories like /usr/[local|share|lib]/[nls|loc|locale]/ and inspect them yourself.
As a simple example: For setting LC_CTYPE there is an entry like one of these on almost all systems:
- /<path-to-locale-directory>/<locale-value>/LC_CTYPE/ctype (e.g. Solaris)
- /<path-to-locale-directory>/<locale-value>/LC_CTYPE (e.g. Linux glibc2)
- /<path-to-locale-directory>/<locale-value> (e.g. HP-UX)
On Linux, pay attention to /usr/share/locale vs. /usr/lib/locale. Both might exist due to an upgrade (with only one containing a locale).
If you cannot solve your problems, some programs might provide a workaround:
For mutt this is the configure switch "--enable-locales-fix", so you have to recompile mutt. Another example: tin provides "--disable-locale". Also, some programs might not handle the wide-character support of glibc. Pre-mutt-1.3 in connection with such a glibc is an example. Recompiling with said option should help.
For mutt-1.3 (developer versions), if you have still problems, use also "--without-wc-funcs", (without wide character functions). You should have seen it already in INSTALL and "configure --help".
On most systems: For all categories but LC_MESSAGES, your value must include the country (see syntax above). Thus, it's not "en, de, ..." but "en_US, de_DE, ...", even if the former are reported as valid values by "locale -a".
Why? In the directories named by language abbreviations (i.e., "de", "fr", etc.), you'll usually only find the translations of messages (LC_MESSAGES) for various programs. But LC_MESSAGE stuff is accessed by a mechanism different from the other categories.
And: Don't confuse a language abbreviation (fr, es, de) with a locale alias (like french, spanish, german) from the file /usr/[share|lib]/locale/locale.alias. Be very careful about using these aliases, as well.
RedHat Linux 8.0 is using UTF-8 (Unicode) locales by default. For some programs you might have to switch back to common 8-bit locales (e.g., the acrobat reader at the time of this writing).
HP-UX 10/11: Your value must contain the "codeset". So it's not "en_US" but "en_US.iso88591".
(And note that there are no separate LC_CTYPE files in the system, but there's only one common entry for each value, like /usr/lib/nls/loc/locales.1/en_US.iso88591 . (confirm that by looking into /usr/lib/nls/loc/src/en_US.iso1.src).
Solaris (and SunOS 4.x as well): No locale values are contained in the minimal installation. However, even then you can get support for 8bit by setting only LC_CTYPE to the value "iso_8859_1".
Your preferred locale might just not be installed.
E.g.: in the Usenet posting Message-ID: <slrn8trvil.1qm.hschlen@humbert.ddns.org>, Heiko Schlenker mentions that on Debian GNU/Linux one might still have to post-install a package - like "user-de" for "de" support.
There are some older Linux distributions with broken locale support in the libc, i.e., isprint(3) doesn't work like expected.
Unlikely, but possible: Something has gotten wrong with your locale installation. Use a system call tracer [truss(1), ktruss(1), ktrace(1)+kdump(1), strace(1), trace(1), tusc(1), par(1), sctrace(1)], to find out if all needed stuff is found by the C library or your program.
See localedef(1) for fixing or rebuilding a locale installation on the lower level (pointed out by Jürgen Dollinger). This command is - like locale(1) - available on practically every Unix, except Free/Open/NetBSD, SunOS 4 and Irix5.
You'll find an example in the Danish HOWTO.
Some Linux distributions come with their own way to do this (e.g., Debian).
OpenBSD has limited locale support.
- Originally, there was almost no support.
- OpenBSD 2.9 implemented rather hardcoded iso8859-1 behaviour concerning printability, that is, LC_CTYPE (unless you have disabled 8bit support by recompiling libc with -DUSE7BIT).
- Since about OpenBSD 3.8, LC_CTYPE is supported and you'll find the supported values in /usr/share/locale/, see mklocale(1).
See also src/lib/libc/gen/ctype_.c.
Output on the console ttys might be limited to ASCII before OpenBSD 2.9:
```
	From: "Arvid Grøtting"
	Newsgroups: comp.unix.bsd.openbsd.misc
	Subject: Re: Problem with ASCII representations
	Date: Fri, 16 Mar 2001 09:43:26 GMT
	Message-ID: <l8u24udp6p.fsf@gorgon.netfonds.no> 
```
Concerning console drivers, see pcvt(4) up to 2.8 and wscons(4) from 2.9 on.
If you use the on-board vi (in fact it's "nvi") on systems before 2.8, see the post scriptum below about nvi.
Apart from the above: Programs certainly might install their own messages, using LC_MESSAGES then. However, this is done with a mechanism completely different from setlocale(3), so it's not affected by the above limitation.
You might try resorting to Linux emulation, if you ever need something very special.
NetBSD:
NetBSD 1.5 and earlier have very little support for locales. Be aware of setlocale(3) being a stub for all categories but LC_CTYPE. setlocale(3) implies that there is support for LC_CTYPE (see its BUGS section), but AFAIK there is none. A look into the system directories (/usr/share/[locale|nls]) will confirm this.
See http://www.netbsd.org/Documentation/misc/index.html#locales for LC_CTYPE support. But the link therein was not accessible under special circumstances (firewall configurations) at the time of this writing. Thus, see also ftp://ftp2.fr.netbsd.org/pub/NetBSD/arch/i386/french-1.4/locale.tgz in case.
Planning for multi-byte support has been started, but I haven't been following that, as I don't run NetBSD myself:
```
	> From: itojun@iijlab.net (itojun@iijlab.net)
	> Subject: multibyte LC_CTYPE locale support from Citrus XPG4DL repository
	> Newsgroups: comp.unix.bsd.netbsd.announce
	> Date: 2001-01-25 07:59:59 PST
	>
	> NetBSD-current now integrates multibyte LC_CTYPE locale support,
	> from the Citrus XPG4DL codebase.
	> [...]
	> http://citrus.bsdclub.org/index-en.html
```
NetBSD 1.6 comes with several locales installed.
FreeBSD:
From setlocale(3): The current implementation supports only the "C" and "POSIX" locales for all but the LC_COLLATE, LC_CTYPE, and LC_TIME categories.
However you can set the other variables anyway. The libc will only stat the locale directory itself, but not try to access category specific files then. (Yet, this dummy-stat() certainly fails for invalid values.) Specific applications might make use of those categories in their own way.
And, as mentioned at the top, don't forget about login.conf(5), e.g. using a ~/.login_conf with
```
    me:\
    :charset=iso-8859-1:\
    :lang=en_US.ISO8859-1:
	
```
Search again in the documentation for your OS
search dejanews/google with "locale", "8bit", "accent", "umlaut", "charset", etc - in an appropriate group for you OS.

A small debugging program as example

See the code for a program which gives an example of how to make use of the locale.

First it tries to set the locale like other programs. Then it additionally inspects LC_CTYPE and LC_MESSAGES more thoroughly, indicating the printable characters according to isprint(3), and issuing an error message with perror(3) to see the language of system messages (which will be english in most cases, though). But it doesn't try to print a nationalized messages of its own or of another utility (because you might have to install these messages in a system directory).

It will complain about all problems that occur.

It will warn if a call to "setlocale" returns with a value different from the value it was (implicitly) called with: Some locale implementations internally additionally try modifed values, particularly if your value contains a charset or modifier suffix. (And if you set both LANG and another explicit category, then setlocale() will return a "composite value".) -- Thanks to Alain Bench for pointing this out to me!

Note: Depending on your font settings and your browser, you might not be able to see the latin1 characters contained in the following quote. (Also note that you usually shouldn't mix different locale values - certainly with the exception of "unsetting" some categories with the value "C".)

Example:

  $ uname -sr
  SunOS 5.9

  $ LC_CTYPE=iso_8859_1 LANG=nonsense LC_MESSAGES=POSIX ./checklocale

  [Latin1/9] If there's no literal copyrightsymbol at the end of this sentence,
  then your terminal/terminalemulator/font is not ISO8859-1/15 ready: ©
  
- Current environment settings:
  LANG        = "nonsense"
  LC_CTYPE    = "iso_8859_1"
  LC_MESSAGES = "POSIX"

- Implicitly setting all locale categories with LANG failed.
  You might want to unset/fix it now and/or set supported categories instead.

- Setting LC_CTYPE to "iso_8859_1" succeeded.

  Testing LC_CTYPE with isprint():

  # # # # # # # # # # # # # # # # 
  # # # # # # # # # # # # # # # # 
    ! " # $ % & ' ( ) * + , - . / 
  0 1 2 3 4 5 6 7 8 9 : ; < = > ? 
  @ A B C D E F G H I J K L M N O 
  P Q R S T U V W X Y Z [ \ ] ^ _ 
  ` a b c d e f g h i j k l m n o 
  p q r s t u v w x y z { | } ~ # 
  # # # # # # # # # # # # # # # # 
  # # # # # # # # # # # # # # # # 
    ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬  ® ¯ 
  ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ 
  À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
  Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
  à á â ã ä å æ ç è é ê ë ì í î ï
  ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

  
- Implicitly setting   LC_NUMERIC by LANG failed.
- Implicitly setting      LC_TIME by LANG failed.
- Implicitly setting   LC_COLLATE by LANG failed.
- Implicitly setting  LC_MONETARY by LANG failed.
- Setting LC_MESSAGES to "POSIX" succeeded.
- Testing LC_MESSAGES with perror() for EAGAIN, that is, a libc message.
  Message catalogs are in /usr/share/locale, according to bindtextdomain().
  perror()   tells: Resource temporarily unavailable
  strerror() tells: Resource temporarily unavailable

Post Scriptum

Be certain to use a font that contains the characters you want to see.
If you're using X11/XFree86, use xfd(1) to display a font, use xset(1) and xlsfonts(1) to find out (or limit or extend) the locations (fontpath) and thus the names of available fonts. The default font is named "fixed", btw.
See " Using eight-bit-characters in X11" about how to insert eight-bit characters.
If using LC_COLLATE, be very careful with sort(1)ing , tr(1)ing , and using character ranges (shell -c 'echo [A-Z]*').
Likely, you get unexpected but probably valid results, depending on shell, system and certainly locale:
```
    $ touch A B C a b c
    $ LC_COLLATE=C     shell -c 'echo [A-Z]*'
    A B C
    $ LC_COLLATE=en_US shell -c 'echo [A-Z]*'
    A a B b C c

    $ ls * | LC_COLLATE=C sort
    A
    B
    C
    a
    b
    c

    $ ls * | LC_COLLATE=en_US sort
    A
    a
    B
    b
    C
    c   
```
If you have set LANG, then you might want to add LC_COLLATE=C.
(German lang readers: [cert.uni-stuttgart.de] Locale-Einstellungen mit überraschenden Auswirkungen. Und ein Thread speziell zum eigentlichen LC_COLLATE-Problem in de.comp.os.unix.shell, startend mit <9qd9ot0e1b4cpboupq7p78ch9o4ub6vcb1@4ax.com>.)
Using LC_CTYPE will do no harm here usually. But be careful about security issues. One might imagine security relevant characters to be encoded in a way, that a program doesn't recognize them, e.g. "../" using Unicode instead of ASCII.
Strings being used with LC_MESSAGES, might be vulnerable to buffer overflows. See also www.cert.org, "Vulnerability in Natural Language Service" about this.
If you still have problems with bash and printable characters, verify the following settings with "bind -v":
set convert-meta off, set input-meta on, set output-meta on, set meta-flag on (synonym for input-meta).
See bash(1) about INPUTRC then. This is required if you cannot get working locale support for any reason.
If setlocale(3) is not available at all, readline accepts the special values "iso8859[1-10]" and "koi8r", see bash-2.x/lib/readline/nls.c, "legal_lang_values[]".
About nvi (which is the on-board vi for OpenBSD for example). It inspects LC_CTYPE which is not supported on OpenBSD, though. Additionally nvi knows an option to force printing of certain characters anyway. However most versions of nvi suffer from a bug and you need the following in ~/.nexrc or alike:
```
    set print="<printable characters>"
    set print=
    
```
where <printable characters> is just all the literal characters you want, e.g. äåâàáöôø...
It's fixed in OpenBSD 2.8.
On Solaris there's a fast way to confirm if LC_CTYPE works as expected: dumpcs(1) prints all printable characters.
Global system settings on Solaris also in /etc/default/init (and not in /etc/default/login).
On systems with dtlogin(1), see the according manpage, /usr/dt/config/ and /etc/dt/config/.
Global system settings on Redhat Linux for example in /etc/sysconfig/i18n, on SuSE Linux for example in /etc/environment. Et cetera.
Your system might support "locale -kc LC_CTYPE", and tell you which characters are printable.
However, on several systems (Solaris 2.6 - Solaris 8, HP-UX 10.x, Irix6.5.10) its output is just wrong.
On several other systems it doesn't tell about printability (OSF1/V4.0, AIX 3, AIX 4).
I could only find useful implementations on Solaris2.5 and Linux Glibc2 so far.
For german-lang SuSE-Linux Users: "Console-Fonts haben keine Umlaute", pointed out by Heiko Schlenker (dead link, instead see the page on archive.org)

For german-lang OpenBSD 2.8 users with pcvt(4):

	From: Christian Weisgerber
	Newsgroups: de.comp.os.unix.bsd
	Subject: Re: Umlaute im vi (pcvt) ?
	Date: Mon, 26 Mar 2001 16:01:47 +0000 (UTC)
	Message-ID: <99np5b$1bup$1@kemoauc.mips.inka.de>
	(original link)

For Linux Users: In Usenet, upgrades of some Distributions have reapeatedly been reported to fail about the locale stuff. I don't know more about that at all, but then you should reinstall the locale anyway. Verify your locale installation (see example program above), in case.
ksh93-l (2001-07-04) has problems with many locale values, as it tries to canonicalize a value before calling setlocale(3). No easy fix known yet. On many systems using a value like xx_YY (e.g. en_US) instead of xx_XX (fr_FR, de_DE) helps, as ksh93-l then tries the longer version instead of only fr or de. If your system only provides values with codeset (en_US.ISO8859-1, etc), then you've lost with this version.
The 2nd edition of ksh93-l (l+) (still called 2001-07-04) fixes this problem.
XFree86 Xlib calls setlocale(3), but with values in /usr/X11R6/lib/X11/locale/. If you - particularly after an upgrade - get a "Warning: locale not supported by Xlib, locale set to C", then have a look into locale.alias in that directory and adjust it, in case. (Details from Olav Kvittem.)
Be careful with the very version of libslang. Some linux distributors apply patches for UTF8 support, but they may fail with simple 8-bit locales then. (Pointed out by Thomas Schultz in <slrnagbqa1.1mp.tststs@starflower.tststs.ddns.org>.)
Some vendor specific documents, apart from the manpages:
- Sun: FAQ Locales (dead link, instead see the page on archive.org)
- Sun: FAQ Running Localized Sessions (dead link, instead see the page on archive.org)
- Sun: Solaris 8 I18N Whitepapers: Unicode Support (dead link, instead see the page on archive.org)
- Sun: Solaris 8 I18N Whitepapers: Euro Currency Support (dead link, instead see the page on archive.org)
- Sun: International Language Environment Guide, docs.sun.com: Solaris 8, pointed out by Rudolf Hommer (dead link, instead see the page on archive.org)
- Sun: Solaris Internationalization Guide For Developers, docs.sun.com: Solaris7, Solaris6, (dead links, instead see the pages on archive.org here and here)
- SGI: Internationalization FAQ (dead link, instead see the page on archive.org)
- HP-UX 10.01/.10/.20, 11.11/.22: /usr/lib/nls/README.nls.10.01
- Digital UNIX: Using Internationalization Features (dead link, instead see the page on archive.org)
From the Single Unix Specification, Version 2 (or Version 4):
- locale - internal realisation (or v4)
- Character Set - portable and control character set, wide-characters (or v4)
- setlocale(3) (or v4), isprint(3) (or v4), localedef(1) (or v4),
- Environment Variables (or v4),

Unicode:

Eventually, Unicode is the way to go for a really useful encoding (however the problem with locales would still remain to some degree). Don't miss the following:

Markus Kuhn's Unicode Site - HOWTO, his X11 fonts, and much more.
Jukka Korpela's tutorial on character code issues
Roman Czyborra's Unicode in the Unix Environment, featuring national character sets as well.

and

the Unicode CLDR (Common Locale Data Repository).
This project has the idea to provide XML-based cross-platform locale information.

With credits to: Jürgen Dollinger, Christian Weisgerber, Heiko Schlenker, Rudolf Hommer, Chris Green, Sven Guckes, Olav Kvittem, Thomas Schultz, Vincent Lefèvre and especially to Mark Glassberg and Alain Bench.

Your own experiences and other feedback are most welcome

comments to <mascheck@in-ulm.de>
started about 1999-11-18, last update 2016-04-24