Webforumz Newsletter - September 2007
Articles
Semantics
<br> is semantic markup. Yes, really.
Everyone talks about "semantic web design" these days. But do you really know what it is?
This article is a fairly deep exploration of the meaning of semantic web design. By the end of it, you should really understand what semantic web design is, and (perhaps more importantly) what it is not.
You'll also learn some of the philosophy behind semantic web design, which will help position the ideas in a much broader context.
To finish, we'll apply this knowledge to explain why <br> and even <i> and <b> are semantic elements, not presentational. Don't believe me? Read and learn!
Let's start with an easy example
Here are two bad ways to make a heading:
<p class="heading1">Level one heading</p><p><font style="font-size: 150%;">Level one heading</font></p>
Web designers often call such code presentational, meaning that the markup has been chosen without concern for the meaning of the item, but only for how it looks.
Here is a good way to make a heading:
<h1>Level one heading</h1>
This way is better than the others, because it informs browsers and search engines that the item is a heading. This is good for users, because they can navigate the page by jumping between headings (particularly useful for blind people). Search engines will assign extra weight to the heading text, helping them to discover what the page is about (and hence index it well).
Web designers often call such code semantic.
Web designers use "semantic" for too many things
I like the word "semantic". It's an impressive word. It makes you sound intellectual, one of the enlightened. It's only natural, therefore, to overuse it.
"Semantic web design" is often used to mean any of the following:
- Separating content and structure from presentation (using CSS);
- Avoiding unnecessary markup (such as excessive numbers of
<div>s for presentation); - Choosing CSS class and ID names based on the function of items, not on their intended appearance;
- Using "good" elements such as
<p>instead of "bad" elements such as<br>(or<em>instead of<i>).
All of these have nothing to do with semantic web design.
(1), (2), and (3) are about efficient coding practices. (1) and (2) save you time, help you avoid confusing yourself with bloated code, and reduce the file size of your web pages. (3) just makes it easier to keep track of your IDs and classes; it will make no difference to your users whether you choose div.contactBox or div.slushPuppyMoonCowHertfordshire.
(4) is along the right lines, but still fundamentally confused. More on this later. This is what "semantic web design" really means:
- Choosing markup that conveys the function (or meaning) of each item, and thereby gives your document structure.
The meaning of "meaning"
(No apologies to Hilary Putnam; semantic externalism is a silly theory.)
The adjective "semantic" means "of or relating to meaning".
The abstract noun "semantics" means "the branch of linguistics concerned with meaning".
In the looser parlance of web design, "semantic" code is meaningful code: the markup has been well-chosen to reflect the intended meaning of the items.
Logical languages and formal grammars
Erk! What just happened? Are you suddenly in a logic class?
Actually, yes you are. Web designers are supposed to be multi-talented: you do coding, copywriting, information architecture (ugh! horrible term), graphic design, customer support, sales, usability, accessibility, and so on.
So why not add another string to your bow? Become a logician! We get all the girls, you know. Oh yes: I just mention Skolem's paradox, and they're putty in my hands.
(X)HTML is a logical language. Sound scary? Not really: a logical language is simply one that has been designed to work logically. By contrast, English (or Chinese, or whatever) is a natural language.
We all speak at least one natural language, and they are much more complex than logical languages. Natural languages evolve chaotically, organically. They are disorganised: every rule has exceptions, spellings can't be predicted from sounds, and regional or cultural dialects alter meaning. Plus you have to worry about irony, sarcasm, and the rest. It's a sprawling, heaving city of bright lights and slimy gutters.
Logical languages, on the other hand, are like carefully planned towns. All the roads are straight. Everyone drives at the same speed. The flowers are spaced exactly one metre apart. No-one ever lies. Nothing unexpected ever happens, because everyone is the same.
Logical languages are good for order and precision; natural languages are good for life. (X)HTML allows you to combine the best of each: write your content in vibrant natural language (let them smell that gutter!), but use markup (logical language) to give it the order that allows browsers and search engines to understand its structure.
Syntax vs. semantics
(X)HTML is actually a set of logical languages: each version is a little different. HTML 4.01 Strict is a different language to XHTML 1.1, or even HTML 4.01 Transitional. Each of them defines its own formal grammar.
A formal grammar is just like ordinary grammar, but -- guess what? -- formal. Just as English grammar has (loose) rules about writing grammatical sentences, HTML has (strict) rules about writing grammatical code. Feeling lost now? This should help:
- Grammatical code = valid code.
- Ungrammatical code = invalid code.
Grammar rules are syntax rules. Syntax is about grammar, not meaning. The (X)HTML validator is a syntax checker -- a grammar checker. It knows nothing of meaning.
Syntax and semantics are orthogonal: like oil and water, they never mix. But all (useful) languages need both of them, and (X)HTML has semantics too.
Elements have meanings
In (X)HTML, elements have defined meanings. For example, <p> is defined as a paragraph. Some elements have less precise definitions, but in (X)HTML Strict, every element has some meaning.
Except, that is, for two special elements: <span> and <div>. These are semantically neutral: they have no meaning whatsoever. <span> is the semantically neutral inline element; <div> is the semantically neutral block-level element. These are extremely useful, in somewhat the same manner as the numbers zero and one are useful in mathematics (as additive and multiplicative identities, in case you were wondering).
Transitional is different
<font> has no meaning. It is purely presentational. The same is true of <center>. These could be applied to any content; there's no way to determine from them (or even guess) anything about the content.
It's precisely because these elements (and others) have no meaning that they were stripped from (X)HTML. Presentation is much better controlled using CSS. CSS is the samurai sword to <font>'s rusty knuckle duster.
In itself, however, <font> does not harm semantic web design. A page that uses <font> tags to control presentation is no less semantic than the same page using CSS. It just makes for messy, nasty code.
But let's return to our original examples:
<p><font style="font-size: 150%;">Level one heading</font></p>
This is unsemantic, in the sense that you should use more meaningful markup instead:
<h1>Level one heading</h1>
As promised: <br> is semantic.
First, let's remind ourselves why so many web designers think that <br> is unsemantic:
<p>A paragraph. Now I wanted to make some space underneath it.</p>
<br>
<p>The next paragraph. Now I want to add a bigger gap.</p>
<br><br><br>
<h2>I added all those br's because the margin was too small</h2>
<br>
<p>BR is so useful!</p>
<br><br><br><br><br><br><br><br><br>
<p>Footer text here</p>
That code is junk. The spacing between and around elements should be controlled by CSS, using properties such as margin and padding.
It's understandable that many designers, after seeing such abuses of <br>, dismiss it as unsemantic -- a "bad" element. But you could use any block level element in its place. Here is a similar abuse:
<p>A paragraph. Now I wanted to make some space underneath it.</p>
<p> </p>
<p>The next paragraph.</p>
(2) is no better than (1). But then how can you condemn <br> as "unsemantic" without equally condemning <p>?
Perhaps you could say that <p> has meaning, whereas <br> does not. This would be wrong. <br> has a precise meaning: it denotes a deliberate or forced line break.
"But line breaks are presentational, not semantic," I hear you cry. Not so fast! Line breaks are not presentational. Their presentation is independent of their meaning. A line break indicates a division within a block of text, which is usually but not always denoted by a carriage return (starting a new line).
Here is an example of a good ("semantic") use of <br>:
<p>
I strove with none, for none was worth my strife.<br>
Nature I loved and, next to Nature, Art:<br>
I warmed both hands before the fire of Life;<br>
It sinks, and I am ready to depart.
</p>
Here, <br> indicates line breaks in a poem. Using a new <p> for each line would be semantically incorrect, because they are line breaks, not paragraphs. I have used <p> to enclose the poem, however, because (X)HTML lacks <stanza> and <poem> elements. When the ideal element does not exist, choose the best available alternative.
Conventions are meaningful: <i> and <b>
I hope I've convinced you that <br> is a respectable semantic element. Now I want to restore <i> and <b> to the same status.
Most of the time, italics are used for emphasis. The correct element, therefore, is <em>. Similarly, we normally use bold for strong emphasis; the correct element here is <strong>.
But emphasis is not the only use of italicised or bold text: they are also used to distinguish an item from surrounding content. Unlike emphasis, distinguishing an item does not mark it as more important, but only as different.
Italics are used to distinguish:
- The titles of books (and other items such as journals and newspapers).
- Foreign words.
- Syntactic occurrences of words. For example: the is the definite article; a is the indefinite article.
- Onomatopoeia: whoosh! went the rocket.
Bold text is used to distinguish:
- Examples (such as fragments of HTML code in a tutorial).
- First occurrences of names in an article.
In all of these cases, it would be semantically incorrect to use <em> or <strong>, because the formatting is intended to distinguish items, not to emphasise them. The correct markup would be <i> or <b>.
Again, it would be better to use <booktitle> and <syntacticword> instead, but these elements don't exist.
Some of you might prefer <span class="italics"> or <span class="booktitle">. These are unsemantic elements; by using them, you are replacing meaningful markup (<i>) with purely presentational markup (remember: classes and IDs mean nothing to users).
Ironically, many designers think that <i> is presentational and <span class="italics"> is semantic. That's the wrong way around. <i> inherits meaning from its long history of conventional uses in print; <span class="italics"> does not.
And don't forget: you can change the appearance of <i> and <b> using CSS. For markup, use <i> and <b> as print conventions dictate (except where more precise elements such as <em> are appropriate); but feel free to change their presentation.
For example, <i> could render in a different colour, and <b> could change font size. You could even use classes to assign different presentations of <i> for different uses: <i class="booktitle"> or <i class="onomatopoeia">. I'm not suggesting this would look nice, but it does demonstrate that <i> is not a presentational element. Its meaning is: "this text would conventionally be rendered in italics".
What about <u>?
<u> is deprecated: it does not exist in strict flavours of (X)HTML.
Unlike <i> and <b>, <u> does not have strong typographic conventions that give it meaning. Underlining is generally avoided in print, except for items such as headings. In fact, the traditional use of underlining was as a poor-man's substitute for italics in handwritten or typewritten manuscripts. Publishers still prefer to receive manuscripts with underlining instead of italics -- it's easier to scan for editing -- but the production copy will always revert to italics.
You might use underlining for emphasis, but then <em> would be correct (adjusted with CSS). Since every conventional use of underlining is covered by more precise elements, <u>is never the most appropriate choice.
I suspect this is why <u> was dropped from (X)HTML, yet <i> and <b> were retained.
(Underlining should be used with care. In most contexts, users expect underlined text to be a link. Avoid surprising them.)
What about <hr>?
<hr> is a generic separator. It means, "these bits of content are somehow separate". I never use this element, but it's not semantically neutral. Like <b> and <i>, <hr> has a long history of use in printed text as a section divider. In academic papers, the horizontal rule is often just that: a line. In books, especially fiction, you will often see one or more asterisks (* * *) or other glyphs used as a horizontal rule, to separate story threads.
Although <hr> (like <b> and <i>) is spurned by wannabe standards advocates, it is arguably "more semantic" than using a purely presentational CSS effect (such as a border). If your border is supposed to separate different areas or types of content, then <hr> conveys this meaning.
Using <hr> is rarely worth the hassle; but if an important separation of content is not otherwise conveyed in your markup, then use <hr> instead of a CSS border. You can, of course, use CSS to style your <hr>s.
Using a <div> to mimic <hr> would be an outright mistake (semantically speaking).
Back to our example
Armed with this understanding of <i> and <b>, let's use them in our example:
<p><i>I strove with none</i> is an existential poem by <b>Walter Savage Landor</b>:</p>
<blockquote><p>
I strove with none, for none was worth my strife.<br>
Nature I loved and, next to Nature, Art:<br>
I warmed both hands before the fire of Life;<br>
It sinks, and I am ready to depart.
</p></blockquote>
You can see the result of this code in my first demo page.
Styling <br>
<br> is normally rendered as a carriage return (starting a new line). This is the default presentation, just as the default presentation of <li> is an indented bullet.
But you can present line breaks in other ways. In-line poetry quotes, for example, usually separate lines with a " / " (and no carriage return).
In my second demo page, I use the unusual CSS display: inline-block to remove the carriage return, and add the " / " using generated content:
br {
display: inline-block;
margin: 0;
}
br:before {
content: "\00A0\002F\0020";
}
Sadly this is not a practical example to use on your web pages, because:
- It works in Opera, but not in Firefox, IE, or Safari.
- I can't be sure whether Opera's presentation is correct according to the W3C specification. It's possible that the other browsers are correct, or that the proper rendering is indeterminate.
Nonetheless, it proves the point: you can use style sheets to alter the default presentation of <br>.
You could make this work in all browsers by using a <span> and making the <br> disappear with display: none. Unfortunately that would be unsemantic, because display: none completely removes an element from the document flow. Screen readers, for example, will properly ignore the <br>. display: none is not a visual formatting control: it's a structural formatting control.
Conclusions
Semantic web design is about choosing markup that conveys the function (or meaning) of each item, and thereby helping both users and search engines to understand the page.
(X)HTML is a logical language with predefined syntax and semantics. In Strict (X)HTML, the only semantically neutral elements are <span> and <div>. All other elements carry some meaning, if only that which is inherited from their conventional print uses.
(X)HTML does not have elements for every conceivable purpose. When the ideal element does not exist, choose the best available alternative. If no satisfactory element exists, use a <div> or <span> (no meaning is better than misleading meaning).
<br>, <i>, and <b> are perfectly valid (X)HTML elements with good, meaningful (semantic) uses. <u> is not. <em> and <strong> are not replacements for <i> and <b>.
All these elements can be restyled with CSS, but <br> is a strange element that responds unpredictably to restyling. Nonetheless, <br> is semantic, not presentational.