Full Sail: Power User Tips
Creating Web Pages With Character(s)
Character Entities
by
Norman L. De Forest
Beacon Correspondent
If you have used the information in
Part 1 of this series on Characters
On the Web, you can now view web pages with accented letters and other
characters without seeing something entirely inappropriate (such as an
upper-case Sigma for the ISO-8859-1 character 228, 'ä',
'a' with an umlaut). You might want to incorporate such characters
in your web page (using "résumé" (with e
acute) instead of "resume" or
"leçon" (with c cedilla) instead of
"lecon"). Or you might just want to display the
characters '"', '&', '<', and
'>' and find that they don't (always) get displayed properly.
HTML has a standard method of specifying such characters so that they
are properly recognised by browsers that support them. This is done by
using "entities" to specify the characters. An entity begins
with the '&' (ampersand) character and ends with a semicolon,
';'. In some cases, the semicolon may be omitted but it is always
correct (and safer) to include it. Between the ampersand and the semicolon
you can include one of three things:
- The decimal value of the character in the Unicode standard preceded by
'#', the number sign . This is identical to the ISO-8859-1 character
set for those characters in the range of 32 to 126 and 160 to 255. An example
is ç for the character 'ç' (c cedilla).
- The hexadecimal value of the character, preceded by "#x".
Currently, Lynx supports this format (which eliminates the necessity of
converting from the Unicode standard, specified in hexadecimal, to a decimal
numeric value) but most graphical browsers do not, although it is in
the HTML standard. Examples are ç, 'ç'
(c cedilla again) and カ, 'カ'
(the Katakana syllabic character 'Ka') which would otherwise have
to be converted to the decimal entities ç and
カ.
- The standard name for the character if there is one (not all characters
have standard HTML entity names). Our persistant friend, c cedilla
can be expressed as ç, 'ç'. (Unlike most
HTML tags, the entity names are case-sensitive. Ò,
('Ò') and ò ('ò') are
two different characters. (I would have used Ç
and ç as examples but some programs cannot
display character 128 which is what Ç would be translated
to if your system uses the PC character set or code page 850. They treat it
as an unprintable NUL character with the high bit set.) The appropriate entity
names can also be used to display the quote, "
("), the ampersand, & (&), the
less-than sign, < (<), and the greater-than sign,
> (>) because the literal use of these
characters could be misinterpreted as special HTML code.
All of the ISO-8859-1 characters are supported by browsers as numeric codes
and &, <, and > are also
recognised by all browsers as the representations for &, <, and >.
The " entity for the double-quote character, ", should
be recognised by all browsers but that entity name was
inadvertently left out of the HTML 3.2 standard and
browsers do not have to recognise this entity to be certified as conforming
to that standard. For this reason, the numeric entity " is
slightly safer to use than " but, fortunately, neither are
necessary except in circumstances where the use of a literal double-quote,
", might be misinterpreted as a string delimiter. Such a case is when the
double-quote is part of the ALT (alternate) text for an image:
<IMG SRC="rover.gif" ALT="[A picture of Rover saying "Arf!"]">
An alternative is to use single quotes to delimit the ALT string and
then use embedded double quotes within the string:
<IMG SRC="rover.gif" ALT='[A picture of Rover saying "Arf!"]'>
Most browsers also recognise the entity names for the upper characters
in the ISO-8859-1 character set (see the
chart in the last Beacon).
In addition to those, the Lynx browser recognises almost all of the defined
entity names (and numbers) for general symbols, math symbols, and Greek
letters for mathematical use and will provide a substitute for them. Many
other non-Latin characters are also recognised. Unfortunately, unless
someone has a special multilingual version of Netscape Navigator or Internet Explorer
that can handle Unicode characters, those special characters will just be
displayed as '?' or mis-displayed by some browsers by ignoring all but the
lower eight bits of the character.
Some Special Cases.
You may be wondering why there are two different spaces and two different
hyphens in the ISO-8859-1 character set. The reason for this is that it is
sometimes necessary to control where the browser will break a line.
Except in preformatted text, the location of a regular space is fair
game for the browser to break a line. There are times when some text needs
a space but it is undesirable to have the line break at this point. If I
included my name, Norman De Forest, in a paragraph and used all ordinary
spaces, it is quite possible that the "De Forest" part could be split up
with "De" at the end of a line and the "Forest" at the beginning of the
next line. If I wished to prevent this, I could use the non-breaking space,
or   and include my name in the text as
"Norman De Forest" and the browser would not break up the "De" and
the "Forest" but would, instead, keep them together.
The ordinary hyphen or minus sign, '-' is treated the same as any other
non-space character in a word. There are times when you may prefer a long
word to be broken between syllables if it would otherwise go past the edge
of the screen instead of wrapping the line at the beginning of the word.
This may be necessary to prevent the display of a pathologically-short line
just before the long word. If the word is split across two lines, you
would like to have a hyphen at the end of the first part to show the word is
continued on the next line. Otherwise, you don't want any hyphens visible.
For that you can use the
soft hyphen,
­ or ­ to indicate where a word may be broken. With a
conforming browser the hyphen is invisible unless the browser finds
it necessary to break up the word. (Unfortunately, both Netscape Navigator and
Internet Explorer
fail to conform
to the HTML standards.) Then, and only then, a
hyphen will be displayed. This:
<p>
A very long imaginary word is the word from "Mary Poppins",
"super­calli­fragilistic­expiali­docious".
If you say it six times, it sounds like
"super­calli­fragilistic­expiali­docious,
super­calli­fragilistic­expiali­docious,
super­calli­fragilistic­expiali­docious,
super­calli­fragilistic­expiali­docious,
super­calli­fragilistic­expiali­docious,
super­calli­fragilistic­expiali­docious."
</p>
displays as this with your browser:
A very long imaginary word is the word from "Mary Poppins",
"supercallifragilisticexpialidocious".
If you say it six times, it sounds like
"supercallifragilisticexpialidocious,
supercallifragilisticexpialidocious,
supercallifragilisticexpialidocious,
supercallifragilisticexpialidocious,
supercallifragilisticexpialidocious,
supercallifragilisticexpialidocious."
Here's how it looks with Lynx (which conforms to the standards):
Notice how the "word" was hyphenated only where it was broken?
Sometimes you may want a word that is hyphenated to be broken at the
hyphen but don't want the hyphen to be invisible. Or you may want a URL
to be broken between directories and not in the middle of a directory
name but you don't want any hyphen to be displayed. Another invisible
character to the rescue. It is a character that never gets displayed.
It merely gives the browser permission to break a line at that point, even
if it is in the middle of a word. The character is called a "zero-width
non-joiner" and can be represented by ‌. Let's see what that
word from "Mary Poppins" looks like when we use the zero-width non-joiner:
<p>
A very long imaginary word is the word from "Mary Poppins",
"super‌calli‌fragilistic‌expiali‌docious".
If you say it six times, it sounds like
"super‌calli‌fragilistic‌expiali‌docious,
super‌calli‌fragilistic‌expiali‌docious,
super‌calli‌fragilistic‌expiali‌docious,
super‌calli‌fragilistic‌expiali‌docious,
super‌calli‌fragilistic‌expiali‌docious,
super‌calli‌fragilistic‌expiali‌docious."
</p>
displays as this on your browser:
A very long imaginary word is the word from "Mary Poppins",
"supercallifragilisticexpialidocious".
If you say it six times, it sounds like
"supercallifragilisticexpialidocious,
supercallifragilisticexpialidocious,
supercallifragilisticexpialidocious,
supercallifragilisticexpialidocious,
supercallifragilisticexpialidocious,
supercallifragilisticexpialidocious."
Here's how it looks with Lynx (conforming to the standards):
Notice how the word is broken only where permission was given but
this time, if you are using Lynx, no hyphen was displayed?
Those pesky Windows characters!
Lynx users, don't you just hate it when you visit a site and
see text like:
A spokesman for the penguins said, “We don’t mind Linux
users using a picture of us as their logo. We just want our cut.”
"But the ASCII quotes look so horrible. We can't use
those," cry the webmasters. Perhaps you could tell them about the
perfectly valid representations for those Windows characters that are not
part of the ISO-8859-1 character set. All of the defined Windows
characters have a Unicode value assigned and those Unicode characters are
supported by virtually all modern browsers -- including Lynx.
(Some older versions of Lynx, Netscape Navigator, and Internet Explorer do not
recognise the Unicode characters and the newest version of Netscape Navigator can
have problems if fonts are not installed correctly.) By
using the Unicode entities, the page looks the same with the latest Windows-based
graphical browsers (when they are properly configured and
fonts are installed correctly) as they did before but the characters are recognised
by Lynx and displayed with your computer's character set, making
substitutions if necessary. Here is the same text, first using the
invalid Windows characters, then using Unicode, and then again using plain
ASCII. The text in the first two may look the same with graphical
browsers but there will be a major difference in appearance when Lynx is
used:
A spokesman for the penguins said, We dont mind Linux users
using a picture of us as their logo. We just want our cut.
A spokesman for the penguins said, “We don’t mind Linux users
using a picture of us as their logo. We just want our cut.”
A spokesman for the penguins said, "We don't mind Linux users
using a picture of us as their logo. We just want our cut."
This is what the three paragraphs above look like with Lynx when you
have the default IBM character set or have an ISO-8859-1 font loaded into
your VGA card and have Lynx configured to know which character set you
are using:
This is what the three paragraphs above look like with Lynx when you
have a cp1252 font loaded into your VGA card and have Lynx configured to
know you are using the cp1252 font:
Here is a list of the preferred substitutes for Windows cp1252
(code page 1252) characters that are not in the ISO-8859-1 character
set -- in cp1252 order. (Shown in single quotes except where
parentheses are clearer -- such as when quoting quotation marks.)
- 0x80 '€' '' -- instead, use:
- '€' or '€'
'€' and '€' -- euro sign, NEW
[not recognised by Lynx yet]
- 0x81 '' '' -- undefined
- 0x82 (‚) () -- instead, use:
- (‚) or (‚)
(‚) and (‚) -- single low-9 quotation mark, NEW
- 0x83 'ƒ' '' -- instead, use:
- 'ƒ' or 'ƒ'
'ƒ' and 'ƒ' -- latin small f with hook, florin
- 0x84 („) () -- instead, use:
- („) or („)
(„) and („) -- double low-9 quotation mark, NEW
- 0x85 '…' '
' -- instead, use:
- '…' or '…'
'…' and '…' -- horizontal ellipsis = three dot leader,
- 0x86 '†' '' -- instead, use:
- '†' or '†'
'†' and '†' -- dagger
- 0x87 '‡' '' -- instead, use:
- '‡' or '‡'
'‡' and '‡' -- double dagger
- 0x88 'ˆ' '' -- instead, use:
- 'ˆ' or 'ˆ'
'ˆ' and 'ˆ' -- modifier letter circumflex accent
- 0x89 '‰' '' -- instead, use:
- '‰' or '‰'
'‰' and '‰' -- per mille sign
- 0x8A 'Š' '' -- instead, use:
- 'Š' or 'Š'
'Š' and 'Š' -- latin capital letter S with caron
- 0x8B '‹' '' -- instead, use: '<', '<' -- for now.
- '‹' or '‹'
'‹' and '‹' -- single left-pointing angle quotation mark,
(ISO proposed but not yet ISO standardized)
- 0x8C 'Œ' '' -- instead, use:
- 'Œ' or 'Œ'
'Œ' and 'Œ' -- latin capital ligature OE
- 0x8D '' '' -- undefined
- 0x8E 'Ž' '' -- instead, use:
- 'Ž' or 'Ž'
'Ž' and 'Ž' -- latin capital letter Z with caron
- 0x8F '' '' -- undefined
- 0x90 '' '' -- undefined
- 0x91 (‘) () -- instead, use:
- (‘) or (‘)
(‘) and (‘) -- left single quotation mark
- 0x92 (’) () -- instead, use:
- (’) or (’)
(’) and (’) -- right single quotation mark
- 0x93 (“) () -- instead, use:
- (“) or (“)
(“) and (“) -- left double quotation mark
- 0x94 (”) () -- instead, use:
- (”) or (”)
(”) and (”) -- right double quotation mark
- 0x95 '•' '' -- instead, use:
- '•' or '•'
'•' and '•' -- bullet = black small circle
- 0x96 '–' '' -- instead, use:
- '–' or '–'
'–' and '–' -- en dash
- 0x97 '—' '' -- instead, use:
- '—' or '—'
'—' and '—' -- em dash
- 0x98 '˜' '' -- instead, use:
- '˜' or '˜'
'˜' and '˜' -- small tilde
- 0x99 '™' '' -- instead, use:
- '™' or '™'
'™' and '™' -- trade mark sign
- 0x9A 'š' '' -- instead, use:
- 'š' or 'š'
'š' and 'š' -- latin small letter s with caron
- 0x9B '›' '' -- instead, use '>', '>' -- for now.
- '›' or '›'
'›' and '›' -- single right-pointing angle quotation mark
(ISO proposed but not yet ISO standardized)
- 0x9C 'œ' '' -- instead, use:
- 'œ' or 'œ'
'œ' and 'œ' -- latin small ligature oe
(ligature is a misnomer, this is a separate character in some languages)
- 0x9D '' '' -- undefined
- 0x9E 'ž' '' -- instead, use:
- 'ž' or 'ž'
'ž' and 'ž' -- latin small letter Z with caron
- 0x9F 'Ÿ' '' -- instead, use:
- 'Ÿ' and 'Ÿ'
'Ÿ' and 'Ÿ' -- latin capital letter Y with diaeresis
Typing special characters into email
I am not sure what changed with the last system upgrade to affect this
but it is no longer possible to type high characters directly into email.
It is possible to include them into messages, however. There are two
methods.
- The hard way:
- Type the characters into a file on your computer, upload the file
to CCN, and use Control-R, Control-T to select the file and press ENTER
to paste the file into your document.
- The easy way:
- While editing a document, an email message, or a newsposting with the
PINE or PICO editor, press your ESCAPE key (it may be marked "Esc") twice
and then type the three-digit decimal numeric value of the character with
the top row of number keys on your keyboard. Last-month's chart and the
chart above will tell you what number to use for each character. For
example, Esc Esc 2 2 5 (without the spaces) will generate character 225,
á, á, an 'a' with an acute accent.
Some additional references:
-
ISO 8859 Alphabet Soup
-
ISO-8859 briefing and resources
-
Notes on HTML Internationalisation (i18n)
-
escape notation for double-quote
-
demoroniser - correct moronic and gratuitously incompatible Microsoft HTML
-
Using national and special characters in HTML
You may direct comments or suggestions about this column to:
Norman L. De Forest,
af380@chebucto.ns.ca
Back To The Beacon Index
|