Full Sail: Power User Tips

Creating Web Pages
With Character(s)

Character Entities

by
Norman L. De Forest
Beacon Correspondent

If you have used the information in Part 1 of this series on Characters On the Web, you can now view web pages with accented letters and other characters without seeing something entirely inappropriate (such as an upper-case Sigma for the ISO-8859-1 character 228, 'ä', 'a' with an umlaut). You might want to incorporate such characters in your web page (using "résumé" (with e acute) instead of "resume" or "leçon" (with c cedilla) instead of "lecon"). Or you might just want to display the characters '"', '&', '<', and '>' and find that they don't (always) get displayed properly.

HTML has a standard method of specifying such characters so that they are properly recognised by browsers that support them. This is done by using "entities" to specify the characters. An entity begins with the '&' (ampersand) character and ends with a semicolon, ';'. In some cases, the semicolon may be omitted but it is always correct (and safer) to include it. Between the ampersand and the semicolon you can include one of three things:

The decimal value of the character in the Unicode standard preceded by '#', the number sign . This is identical to the ISO-8859-1 character set for those characters in the range of 32 to 126 and 160 to 255. An example is ç for the character 'ç' (c cedilla).
The hexadecimal value of the character, preceded by "#x". Currently, Lynx supports this format (which eliminates the necessity of converting from the Unicode standard, specified in hexadecimal, to a decimal numeric value) but most graphical browsers do not, although it is in the HTML standard. Examples are ç, 'ç' (c cedilla again) and カ, 'カ' (the Katakana syllabic character 'Ka') which would otherwise have to be converted to the decimal entities ç and カ.
The standard name for the character if there is one (not all characters have standard HTML entity names). Our persistant friend, c cedilla can be expressed as ç, 'ç'. (Unlike most HTML tags, the entity names are case-sensitive. Ò, ('Ò') and ò ('ò') are two different characters. (I would have used Ç and ç as examples but some programs cannot display character 128 which is what Ç would be translated to if your system uses the PC character set or code page 850. They treat it as an unprintable NUL character with the high bit set.) The appropriate entity names can also be used to display the quote, " ("), the ampersand, & (&), the less-than sign, < (<), and the greater-than sign, > (>) because the literal use of these characters could be misinterpreted as special HTML code.

All of the ISO-8859-1 characters are supported by browsers as numeric codes and &, <, and > are also recognised by all browsers as the representations for &, <, and >. The " entity for the double-quote character, ", should be recognised by all browsers but that entity name was inadvertently left out of the HTML 3.2 standard and browsers do not have to recognise this entity to be certified as conforming to that standard. For this reason, the numeric entity " is slightly safer to use than " but, fortunately, neither are necessary except in circumstances where the use of a literal double-quote, ", might be misinterpreted as a string delimiter. Such a case is when the double-quote is part of the ALT (alternate) text for an image:

An alternative is to use single quotes to delimit the ALT string and then use embedded double quotes within the string:

Most browsers also recognise the entity names for the upper characters in the ISO-8859-1 character set (see the chart in the last Beacon).

In addition to those, the Lynx browser recognises almost all of the defined entity names (and numbers) for general symbols, math symbols, and Greek letters for mathematical use and will provide a substitute for them. Many other non-Latin characters are also recognised. Unfortunately, unless someone has a special multilingual version of Netscape Navigator or Internet Explorer that can handle Unicode characters, those special characters will just be displayed as '?' or mis-displayed by some browsers by ignoring all but the lower eight bits of the character.

Some Special Cases.

You may be wondering why there are two different spaces and two different hyphens in the ISO-8859-1 character set. The reason for this is that it is sometimes necessary to control where the browser will break a line.

Except in preformatted text, the location of a regular space is fair game for the browser to break a line. There are times when some text needs a space but it is undesirable to have the line break at this point. If I included my name, Norman De Forest, in a paragraph and used all ordinary spaces, it is quite possible that the "De Forest" part could be split up with "De" at the end of a line and the "Forest" at the beginning of the next line. If I wished to prevent this, I could use the non-breaking space,   or   and include my name in the text as "Norman De Forest" and the browser would not break up the "De" and the "Forest" but would, instead, keep them together.

The ordinary hyphen or minus sign, '-' is treated the same as any other non-space character in a word. There are times when you may prefer a long word to be broken between syllables if it would otherwise go past the edge of the screen instead of wrapping the line at the beginning of the word. This may be necessary to prevent the display of a pathologically-short line just before the long word. If the word is split across two lines, you would like to have a hyphen at the end of the first part to show the word is continued on the next line. Otherwise, you don't want any hyphens visible. For that you can use the soft hyphen,  or  to indicate where a word may be broken. With a conforming browser the hyphen is invisible unless the browser finds it necessary to break up the word. (Unfortunately, both Netscape Navigator and Internet Explorer fail to conform to the HTML standards.) Then, and only then, a hyphen will be displayed. This:

<p>
A very long imaginary word is the word from "Mary Poppins", 
"super&#173;calli&#173;fragilistic&#173;expiali&#173;docious".
If you say it six times, it sounds like
"super&#173;calli&#173;fragilistic&#173;expiali&#173;docious,
super&#173;calli&#173;fragilistic&#173;expiali&#173;docious,
super&#173;calli&#173;fragilistic&#173;expiali&#173;docious,
super&#173;calli&#173;fragilistic&#173;expiali&#173;docious,
super&#173;calli&#173;fragilistic&#173;expiali&#173;docious,
super&#173;calli&#173;fragilistic&#173;expiali&#173;docious."
</p>

displays as this with your browser:

A very long imaginary word is the word from "Mary Poppins", "supercallifragilisticexpialidocious". If you say it six times, it sounds like "supercallifragilisticexpialidocious, supercallifragilisticexpialidocious, supercallifragilisticexpialidocious, supercallifragilisticexpialidocious, supercallifragilisticexpialidocious, supercallifragilisticexpialidocious."

Here's how it looks with Lynx (which conforms to the standards):

[ A screen snapshot of the paragraph above. ]

Notice how the "word" was hyphenated only where it was broken?

Sometimes you may want a word that is hyphenated to be broken at the hyphen but don't want the hyphen to be invisible. Or you may want a URL to be broken between directories and not in the middle of a directory name but you don't want any hyphen to be displayed. Another invisible character to the rescue. It is a character that never gets displayed. It merely gives the browser permission to break a line at that point, even if it is in the middle of a word. The character is called a "zero-width non-joiner" and can be represented by ‌. Let's see what that word from "Mary Poppins" looks like when we use the zero-width non-joiner:

<p>
A very long imaginary word is the word from "Mary Poppins", 
"super&#8204;calli&#8204;fragilistic&#8204;expiali&#8204;docious".
If you say it six times, it sounds like
"super&#8204;calli&#8204;fragilistic&#8204;expiali&#8204;docious,
super&#8204;calli&#8204;fragilistic&#8204;expiali&#8204;docious,
super&#8204;calli&#8204;fragilistic&#8204;expiali&#8204;docious,
super&#8204;calli&#8204;fragilistic&#8204;expiali&#8204;docious,
super&#8204;calli&#8204;fragilistic&#8204;expiali&#8204;docious,
super&#8204;calli&#8204;fragilistic&#8204;expiali&#8204;docious."
</p>

displays as this on your browser:

A very long imaginary word is the word from "Mary Poppins", "super‌calli‌fragilistic‌expiali‌docious". If you say it six times, it sounds like "super‌calli‌fragilistic‌expiali‌docious, super‌calli‌fragilistic‌expiali‌docious, super‌calli‌fragilistic‌expiali‌docious, super‌calli‌fragilistic‌expiali‌docious, super‌calli‌fragilistic‌expiali‌docious, super‌calli‌fragilistic‌expiali‌docious."

Here's how it looks with Lynx (conforming to the standards):

Notice how the word is broken only where permission was given but this time, if you are using Lynx, no hyphen was displayed?

Those pesky Windows characters!

Lynx users, don't you just hate it when you visit a site and see text like:

A spokesman for the penguins said, We dont mind Linux users using a picture of us as their logo. We just want our cut.

"But the ASCII quotes look so horrible. We can't use those," cry the webmasters. Perhaps you could tell them about the perfectly valid representations for those Windows characters that are not part of the ISO-8859-1 character set. All of the defined Windows characters have a Unicode value assigned and those Unicode characters are supported by virtually all modern browsers -- including Lynx. (Some older versions of Lynx, Netscape Navigator, and Internet Explorer do not recognise the Unicode characters and the newest version of Netscape Navigator can have problems if fonts are not installed correctly.) By using the Unicode entities, the page looks the same with the latest Windows-based graphical browsers (when they are properly configured and fonts are installed correctly) as they did before but the characters are recognised by Lynx and displayed with your computer's character set, making substitutions if necessary. Here is the same text, first using the invalid Windows characters, then using Unicode, and then again using plain ASCII. The text in the first two may look the same with graphical browsers but there will be a major difference in appearance when Lynx is used:

A spokesman for the penguins said, “We don’t mind Linux users using a picture of us as their logo. We just want our cut.”

A spokesman for the penguins said, "We don't mind Linux users using a picture of us as their logo. We just want our cut."

This is what the three paragraphs above look like with Lynx when you have the default IBM character set or have an ISO-8859-1 font loaded into your VGA card and have Lynx configured to know which character set you are using:

[ A screen snapshot of the paragraphs. ]

This is what the three paragraphs above look like with Lynx when you have a cp1252 font loaded into your VGA card and have Lynx configured to know you are using the cp1252 font:

[ Another screen snapshot of the paragraphs. ]

Here is a list of the preferred substitutes for Windows cp1252 (code page 1252) characters that are not in the ISO-8859-1 character set -- in cp1252 order. (Shown in single quotes except where parentheses are clearer -- such as when quoting quotation marks.)

0x80 '' '€' -- instead, use:: '€' or '€'
'€' and '€' -- euro sign, NEW
[not recognised by Lynx yet]
0x81 '' '' -- undefined
0x82 () (‚) -- instead, use:: (‚) or (&sbquo;)
(‚) and (‚) -- single low-9 quotation mark, NEW
0x83 '' 'ƒ' -- instead, use:: 'ƒ' or '&fnof;'
'ƒ' and 'ƒ' -- latin small f with hook, florin
0x84 () („) -- instead, use:: („) or (&bdquo;)
(„) and („) -- double low-9 quotation mark, NEW
0x85 '' '…' -- instead, use:: '…' or '…'
'…' and '…' -- horizontal ellipsis = three dot leader,
0x86 '' '†' -- instead, use:: '†' or '&dagger;'
'†' and '†' -- dagger
0x87 '' '‡' -- instead, use:: '‡' or '&Dagger;'
'‡' and '‡' -- double dagger
0x88 '' 'ˆ' -- instead, use:: 'ˆ' or '&circ;'
'ˆ' and 'ˆ' -- modifier letter circumflex accent
0x89 '' '‰' -- instead, use:: '‰' or '&permil;'
'‰' and '‰' -- per mille sign
0x8A '' 'Š' -- instead, use:: 'Š' or '&Scaron;'
'Š' and 'Š' -- latin capital letter S with caron
0x8B '' '‹' -- instead, use: '<', '<' -- for now.: '‹' or '&lsaquo;'
'‹' and '‹' -- single left-pointing angle quotation mark, (ISO proposed but not yet ISO standardized)
0x8C '' 'Œ' -- instead, use:: 'Œ' or '&OElig;'
'Œ' and 'Œ' -- latin capital ligature OE
0x8D '' '' -- undefined
0x8E '' 'Ž' -- instead, use:: 'Ž' or '&Zcaron;'
'Ž' and 'Ž' -- latin capital letter Z with caron
0x8F '' '' -- undefined
0x90 '' '' -- undefined
0x91 () (‘) -- instead, use:: (‘) or (‘)
(‘) and (‘) -- left single quotation mark
0x92 () (’) -- instead, use:: (’) or (’)
(’) and (’) -- right single quotation mark
0x93 () (“) -- instead, use:: (“) or (“)
(“) and (“) -- left double quotation mark
0x94 () (”) -- instead, use:: (”) or (”)
(”) and (”) -- right double quotation mark
0x95 '' '•' -- instead, use:: '•' or '•'
'•' and '•' -- bullet = black small circle
0x96 '' '–' -- instead, use:: '–' or '–'
'–' and '–' -- en dash
0x97 '' '—' -- instead, use:: '—' or '—'
'—' and '—' -- em dash
0x98 '' '˜' -- instead, use:: '˜' or '&tilde;'
'˜' and '˜' -- small tilde
0x99 '' '™' -- instead, use:: '™' or '™'
'™' and '™' -- trade mark sign
0x9A '' 'š' -- instead, use:: 'š' or '&scaron;'
'š' and 'š' -- latin small letter s with caron
0x9B '' '›' -- instead, use '>', '>' -- for now.: '›' or '&rsaquo;'
'›' and '›' -- single right-pointing angle quotation mark (ISO proposed but not yet ISO standardized)
0x9C '' 'œ' -- instead, use:: 'œ' or '&oelig;'
'œ' and 'œ' -- latin small ligature oe (ligature is a misnomer, this is a separate character in some languages)
0x9D '' '' -- undefined
0x9E '' 'ž' -- instead, use:: 'ž' or '&zcaron;'
'ž' and 'ž' -- latin small letter Z with caron
0x9F '' 'Ÿ' -- instead, use:: 'Ÿ' and '&Yuml;'
'Ÿ' and 'Ÿ' -- latin capital letter Y with diaeresis

Typing special characters into email

I am not sure what changed with the last system upgrade to affect this but it is no longer possible to type high characters directly into email. It is possible to include them into messages, however. There are two methods.

The hard way:: Type the characters into a file on your computer, upload the file to CCN, and use Control-R, Control-T to select the file and press ENTER to paste the file into your document.
The easy way:: While editing a document, an email message, or a newsposting with the PINE or PICO editor, press your ESCAPE key (it may be marked "Esc") twice and then type the three-digit decimal numeric value of the character with the top row of number keys on your keyboard. Last-month's chart and the chart above will tell you what number to use for each character. For example, Esc Esc 2 2 5 (without the spaces) will generate character 225, &#225, á, an 'a' with an acute accent.

Some additional references:

You may direct comments or suggestions about this column to:

Norman L. De Forest, af380@chebucto.ns.ca

Back To The Beacon Index