Using Regular Expressions in Web Work

I've been digging into regular expressions as a tool to help me convert massive amounts of pages. Regular expressions are essentially patterns of symbols or values, rather than specific strings of characters, that are used with Find/Replace operations as well as to feed input into text manipulation procedures. Simple, but it requires you to think differently. It's like learning a foreign language -- vocabulary, syntax, rules and a kind of mind-meld. See the Wikipedia's definition of the term

The problem with find and replace operations is that you need an exact match. If a phrase is slightly off because of a misspelling or capitalized letter or an extra space, it is missed. Regular expressions overcome this shortcoming by permitting you to identify specific patterns, for instance, a date (10/08/03) or an e-mail address (john@e-mail.com) or an URL.

For a start, I'm reading the O'Reilly second edition book Mastering Regular Expressions Books can be comforting when tackling intimidating topics. The book author is Jeffrey E. F. Friedl. When Friedl first wrote his book on regex in 1996, he had a narrow set of choice of languages. For the second edition in 2002, the picture has changed dramatically. Be sure to download sample code for each chapter off his website or from O'Reilly. Also note there is an errata page, especially important when dealing with code. The book is probably overkill for most, but it has impressed my colleagues at work to see it laying on my desk during lunch break. O'Reilly also has a pocket reference book by Tony Stubblebine. A sample chapter is available. Because of its strength in open source languages, O'Reilly offers a wealth of resources on regular expressions, as demonstrated by this simple site search, nearly 1500 results found (October 2003). Reading the book is a kind of security blanket, and probably overkill for what I need to do.

For a suscinct read, try Dorothea Salo's Brief Guide to Regular Expressions. She writes from the viewpoint of preparing documents for publication, not as a programmer. For novices, it can't be beat.

Terms: In addition to regex, regular expressions are also frequently called regexp or RE. You will see a lot of references to grep or other variations (egrep, meaning extended grep; agrep; ngrep, etc.) in the literature. That term comes from Unix programmers and stands for "general regular expression print." Many of the search and replace functions in modern Windows applications and scripting languages that allude to this name are trying to duplicate grep in Unix.

I'm just scratching the surface. I am reading through Friedl a couple of pages a day so this page will be a kind of scratch pad for my learning experience. There are some reference on the Web that date from 1996 and earlier. I have generally refrained from listing them because I can't be sure they're still being maintained and updated.

General Resources and References

For a more philosphical view, Andy Oram's article, Marshall McLuhan vs. Marshalling Regular Expressions, examines the importance of regular expressions in the digital realm.

Dual Approaches

There are two ways that you can use regular expressions. First, use an application that accepts regex in its search and replace function. There are many applications that provide the more robust regex as part of their features. Second, pick one of the scripting or programming languages that have regex functions.

Applications

I have several programs that already have regex available within their features: HomeSite, DreamWeaver, Textpad and HTMLKit. Many text processors include regex. You can also throw in Unix utilities, like Vi and Emacs (for geeks). You can also add the utilities that I list above. The issue is each may have its own implementation. Minor differences can cause problems when do major changes on mutliple pages. Always test out your search and replace operations on backup copies until you get the hang of a particular implementation.

Several text processors and web editors can handle regular expressions. I am focusing on Windows here, but Mac and Unix both have applications and utilities that use regex. Since I am not familiar with these fields, I am going to leave them to others. See Grep.Extracts.de for more info.

The advantage of this method is that the application handles the mechanics of identifying a match and replacing it. However, if you want to do something more elaborate with the text, you're out of luck.

Scripting Languages

Perl 5 has become the standard for regular expressions, and it has made its mark on the Web. Version 5, which came out in 1994, was the turning point to getting regular expressions out of the professional programmer's or Unix geek's hands and into the web developer's tool kit. It is currently in version 5.8. Once programmers saw regex's usefullness for text manipulation, it spread to practically any modern programming language: JavaScript, EMCAScript, VBScript, Visual Basic, PHP, Python, Ruby, C (in its many flavors), Java, and .NET. Usually, you will need to have a web server up and running to use these methods. The web had created an environment in which text manipulation became a necessity for serious enterprises and applications. Perl is still the most mature product, but others are catching up. Because of the differences between the different flavors of regex, you may be luckier looking at specific references for each language (PHP, Python, etc.). See the listings above.

In the case of this approach, the advantage is that regex can be paired with the scripting power of the lanugage. It also usually means that you are not confined by an operating system -- a PERL regular expression is going to work on Windows, Apple or Unix. See this table for differences.