Converting Documents to Web Pages
Much of the work of setting up and maintaining a website is taking text that had been prepared for the print media (brochures, reports, press releases, etc.) and preparing it for electronic distribution. Most material is prepared with word processors or desktop publishing applications. But shoveling words onto the web requires a shift in approach and care to catch the glitches.
First, remember that printed formats are not the same as digital/electronic formats. The web is a medium in its own right and has conventions and good practices for usability. Give up the idea that you need to duplicate the appearance of a printed document.
Old Habits Die Hard
Some facets of manuscripts are actually a legacy of the mechanical typewriter that had limited means of highlighting and formatting text.
Drawing the Line Underlining is no longer the only means of underscoring text. Modern word-processing and computer displays can now display color and degrees of bold to draw attention to text. In fact, underlining is detrimental to effective reading because the descenders (the parts of letters like y and g that go below the baseline of the text) are overwritten by the underlining. It becomes hard to distinguish between an "q" and a "g." In fact, the HTML code for underlining will be deprecated (or eliminated) in future versions of the markup language. Another reason to avoid underlining is that it is used to mark hyperlinks in the display; many users will mistakenly click on the underlined text, thinking that it will send them to another page.
Another example of a print feature that makes a poor transition to digital display is italics since it is not as readable as the standard face, especially on low-definition monitors.
Left-Align Your Text, Not Center: Soem of us have come to expect titles and headings to be neatly aligned in the center of the page. The web medium requires that most of your text be left aligned.
Avoid ALL CAPS: Another holdover is capitalizing WORDS or ENTIRE PHRASES FOR EMPHASIS. Again, it is harder to read than mixed case. A reader scans text for hints about the structure of the sentence and mixed case allows him/her to decipher the message more quickly. In addition, capitalized text takes up more space on a screen, which can be important in headings.
Microsoft Word, Corel WordPerfect and most other word processors allow you to save documents in HTML format. Even though they claim to be "Web-aware", they have the bad habit of converting all formatting, even those that are not visible (you can insert bold or underline on blank lines and they will not be visible in your document. At their worst (i.e. Microsoft Word 98 and especially 2000), they insert unnecessary code and even garbage that is only recognized by a Windows 2000 server. In other cases, formatting from automatic conversion violates the HTML specifications and can change the appearance of your page. Word 2000 has an HTML converter.
Tip: If you use Microsoft FrontPage, you can use it to strip fonts and other coding by selecting the text and then pressing
Ctrl and the space bar at the same time. This operation will reset the text to the default formatting, without stripping paragraph and list coding.
Tip: MS Word uses a feature called SmartQuotes (the quote marks curl outwards on both sides). When Word saves to HTML, it does not eliminate the SmartQuote; the characters are not display correctly in non-Windows computers so you should eliminate the SmartQuotes before conversion or else eliminate them manually in the HMTL.
You can actually leave a lot of this junk code in the web pages because most browsers will ignore them. However, some browswers may choke on them. Your pages will definitely load much more slowly.
Quick Solutions - Not: In the case of MS Word or WordPerfect, you can leave the documents in the native format. If you're lucky, the visitor will already have the word processor installed on his/her computer. If you're not lucky, the user can download and install a free viewer. Corel Office Suite has a free plug-in for your browser. In any case, the user's computer will churn away for a few extra seconds while it loads the application, utility or plug-in.
On the other hand, if you want a user to grab the file and use it (a letter-writing campaign, for instance), it might be wise to include a polished draft in a word processor format. If you do, use an older version of the file format (for instance, Word 95). Not everyone updates to the latest release of Word or WordPerfect. This is especially true for international audiences.
Warning: For your average user, requiring plug-ins and viewers in addition to the web browser can be intimating and puts another barrier between your material and your potential audience. Document file formats are not indexed by search engines, the most common way that users locate information on the web.
For those who have to convert large amounts of sophisticated documents, Adobe Acrobat and its Portable Document Format offer powerful features and compelling advantages. Acrobat can handle complex layouts and do it quickly. Conversion of file formats (not just word processor files) takes minutes while hand-coding web pages can take hours or even days. At about $230, it costs less than some of the converters in the sidebar. The price is roughly equivalent to a day's pay to a low-end HTML coder.
Visitors who want to view and print the pdf documents will have to obtain the free Adobe Acrobat Reader that acts like a plug-in by integrating with the web browser automatically. For some users, that's asking a lot.
Some layouts just does not lend itself to the HTML formatting. Version 4 also allows you to incorporate hyperlinks and searches. Because (pdf) cannot be edited or copied easily, you are guaranteed that content will not be put to other uses (but that could be a drawback in those cases when you want users to take advantage of your material). Adobe Acrobat excels at printing content, something that web browsers have yet to master fully.
If you are going to use Adobe Acrobat, make sure that you give visitors full details about the documents you are offering them, especially since the files require a plug-in and may take extra time to download over a modem. Acrobat file average about twice as big as HTML files. Indicate the size of the file (i.e. 300 kb) Also provide a link to the Adobe site to obtain the plug in.
Is Adobe Acrobat right for your purposes? The key question is whether your user will want to read the items on the computer or print the documents out. Acrobat wins the printer test. When you want to cover all bases, offer both pdf and HTML files.
The Hard Way - By Hand
In many cases, you may decide either to work with the substandard code produced by word processors and converters. A second option is to save the files as ASCII. A third choice is to do a cut and paste of the text and insert it directly into the HTML page. Here are some tips that should improve your experience:
- Have a template with your boilerplate (header, titles, other standard layout items) already in place. Remember that if you are going to place documents in subfolders in the website, you must adjust the URLS of the hyperlinks and image links so that they work properly.
- You may want to break a large document into smaller, logical parts.
- Increase the value of your documents by linking them to other resources on your website or elsewhere. Your visitors are not going to know your material as well as you do so don't expect them to know you have related material. Avoid making your hyperlink too long because it interferes with the readability of the text. (see
- One common problem in converting documents is that most word processor users set off paragraphs by hitting Enter twice. Most conversion schemes will translate that into two
<p> space </p>tags. The result is that you get twice as much space between paragraphs as you might intend. Screen space is at a premium so you want to use "air" judiciously. You can eliminate the extra
<p> space </p>quickly by using a search and replace operation for those characters and leaving the replace option empty.
- Justified text has not reached the web yet because browswers do not have the ability to insert hyphens automatically or do micro-adjustments between words. In fact, you can only insert one space between words, unless you use the
. If your word processors inserts
align="justify"" in all your paragraph tags, you can eliminate them with a quick search and replace operation.
- Avoid using odd type faces. Remember that most users have a limit selection of fonts installed on their computers. The best bets are Verdana and Georgia, which are part of the Internet Explorer installation. See our page on web font management. You do not score extra points for each font that you introduce in a document. Choose one for the text in your web site and use it consistently.
- Use the HTMLTidy utility to strip out the garbage, like fonts and bad code. It has lots of other tricks. It should be in every web developer's toolkit.
All this may seem anal retentive. The work may take 5-15 minutes, depending on the complexity of your orignal document, to cruise through your web page and eliminate the non-standard or extraneous coding. Is it worth it? We have seen page size reduced by 20-30 percent by applying these methods. That may seem a minor gain, but speed is everything in keeping users on your site. For large pages, you may save 10-15 seconds on the download. If your pages are taking more than 30 seconds to load, you should rethink the need to divide up the original document into smaller web pages.
New: Word Cleaner is a web-based utility that will clean out proprietary tags from documents that orginated in Word. This comes highly recommended.
Although some programs below are quiet ingenious, most stick too closely to the printed page paradigm: one page of a text document will be converted in a separate web page. A 10-page document will get converted into 10 separate pages with a navigation scheme tying them together. You don't necessarily want that because it can unnecessarily interrupt the flow of ideas and outline. It may even break up a paragraph unnecessarily. We have tried a few of them without being satisfied with any.
- ConvertZonde's CZdoc2htm ($120) and CZDoc Converter ($160)
- PDF995 Suite needs a free plugin, Omniformat, to export to 75 file formats, including HTML. Some features may require MS Word 2000. Costs $19.95 or $29.95 for additional plug-ins
- goBCL ::: BCL Magellan $199
- Adobe Acrobat: free online service, PDF to HTML
- SoftInterfaceConvert Doc $470. Can be used as a command-line application to convert documents in batch.
Just Word (or other word processing formats) to HTML
- Aurelia Reporter, a printer driver that converts documents to HTML. Cost - $49.00. Slick concept and implementation, reasonably priced, but too literal in maintaining the page format.
- Mambosoft's WordCleaner - $99. Allows batch conversions.
- WebConvert - $179. It does not allow much customized output beyond its own style sheets and format.
- SolutionSoft's Word-to-Web 2.5 $299
- Eon Solutions EasyHTML £120
- Reworx $199 Standard version; $399 Pro version. The Pro version gives wider choices of table of content formats.
- Verity's Keyview or Verity Export talk to sales about price (usually means a high-end product).
- HTML Transit $5000 - with this price, you will have to have industrial volumes of documents. It seems to be more a content management and publication system, rather than document conversion per se. It is, however, a well-regarded product, working with 225 file formats.
Open Source Solutions
- AbiWord Open source word processor that can handle MS word docs and export to HTML
- Doc2PDF open source
- wvWare - open source, but last update was 2001 Online Demo
Web Directory Listings
- W3C Word Processor filters section is actually now dated, but good for historical reference.
- dmoz.org's Converters section is more useful and current.