Character Data Types


A character may be declared as follows.

char c;

Actually, there are three distinct character types - signed char, unsigned char and just plain char. The three different character types occupy the same amount of memory, and that amount is architecturally dependent. For the Intel architecture, a character is held in one byte of main memory. The inbuilt C++ operator sizeof returns the size of a particular object. Presently, sizeof(char)=1.

The distinction between signed and unsigned characters is significant when interpreting the results of arithmetic operations on characters and when extending characters to integer data types. If a character occupies a byte of storage, the numeric range for signed characters is -128 to 127, and for unsigned characters is 0 to 255.

Depending upon the environment, a char may be signed or unsigned by default. Two unsigned characters may be declared in a single statement as follows.

unsigned char a, b;

Synonyms for Types

A mechanism exists for defining a synonym of a type name. For example, the type unsigned char may be abbreviated as follows:

typedef unsigned char uchar;

whence the previous declaration may be specified as follows.

uchar a, b;

Note that the typedef specifier does not create a new data type; rather, it assigns an alternative name to an existing type.

Character Sets

If a character occupies one byte (i.e. 8 bits) of memory, there are 256 possible bit combinations for the character. Standards exist for character representation, and one such standard is ASCII, for which the first 128 codes (0x00-0x7f) are referred to as Standard ASCII Codes, whilst the remaining 128 codes (0x80-0xff) are referred to as Extended ASCII Codes. The extended ASCII codes have been replaced more recently with the ANSI (American National Standards Institute) character set - which includes accented characters for European languages. Unicode is today's standard for double-byte programming.

Character Constants

A character constant consists of one or more characters enclosed within single quotes. For example, 'A' is a character constant as is 'AB'. When a single character is enclosed within quotes the character constant is of type char. If multiple characters are enclosed within quotes, the character constant has data type int. The value of a multiple character constant is implementation dependent.

A character can be declared and initialized in a single step as follows:

char test = 'A';

in which case, if the ASCII character set is in use then the byte test will contain 0x41 or 65 decimal. For the ASCII character set, a character may be converted from upper to lower case as follows.

char difference  = 'a' - 'A';     // 'a'= 0x61, 'A'= 0x41, difference = 0x20
char lower = 'L';                 // 'L'= 0x4c
char upper = lower + difference;  // 'l'= 0x6c = 0x4c+0x20

To assist with program portability, enclosing a letter in a pair of single quotes is preferable to using a raw number (which is possible). If a different character set is in effect, say EBCDIC, then 'A' would be assigned the numeric value appropriate to that character set. Note that the above calculation may or may not work for character sets other than ASCII.

Escape Characters

The question arises as to how to declare a character consisting of a single quote (without confusing the compiler). The answer is as follows:

char single_quote = '\'';

where the backslash character precedes a single quote and is used as an escape character. That is, the backslash is used to indicate that a special character is to follow, and that the backslash is combined with the special character to form a single character. There are a number of special characters for which escape sequences are defined. These are listed in the table that follows.

Meaning Escape
new Line \n
Horizontal Tab \t
Vertical Tab \v
Backspace \b
Carriage Return \r
Form Feed \f
Alert \a
Backslash \\
Question Mark \?
Single Quote \'
Double Quote \"
Zero \0

In addition to these, a hexadecimal or octal number may be escaped. For example, '\x41' and '\101' are both equivalent to an ASCII. 'A'. For hexadecimal escapes, the backslash is followed by one or more hex digits. The first non-hex character terminates the escape sequence (e.g.'\x692b+'=='i++' in ASCII). For octal escapes, the backslash is followed by one, two or three octal digits. The first non-octal character terminates an octal character specification.

A character constant that is prefixed with the letter L (e.g. L'A') is treated as a wide character constant. A wide character constant is an integer whose specific type is defined in the header stddef.h (see wchar_t).