Using Regular Expressions in Web Work
I've been digging into regular expressions as a tool to help me convert massive amounts of pages. Regular expressions are essentially patterns of symbols or values, rather than specific strings of characters, that are used with Find/Replace operations as well as to feed input into text manipulation procedures. Simple, but it requires you to think differently. It's like learning a foreign language -- vocabulary, syntax, rules and a kind of mind-meld. See the Wikipedia's definition of the term
The problem with find and replace operations is that you need an exact match. If a phrase is slightly off because of a misspelling or capitalized letter or an extra space, it is missed. Regular expressions overcome this shortcoming by permitting you to identify specific patterns, for instance, a date (10/08/03) or an e-mail address (firstname.lastname@example.org) or an URL.
For a start, I'm reading the O'Reilly second edition book Mastering Regular Expressions . The book author is Jeffrey E. F. Friedl. When Friedl first wrote his book on regex in 1996, he had a narrow set of choice of languages. For the second edition in 2002, the picture has changed dramatically. Be sure to download sample code for each chapter off his website or from O'Reilly. Also note there is an errata page, especially important when dealing with code. The book is probably overkill for most, but it has impressed my colleagues at work to see it laying on my desk during lunch break. O'Reilly also has a pocket reference book by Tony Stubblebine. A sample chapter is available. Because of its strength in open source languages, O'Reilly offers a wealth of resources on regular expressions, as demonstrated by this simple site search, nearly 1500 results found (October 2003). Reading the book is a kind of security blanket, and probably overkill for what I need to do.
For a suscinct read, try Dorothea Salo's Brief Guide to Regular Expressions. She writes from the viewpoint of preparing documents for publication, not as a programmer. For novices, it can't be beat.
Terms: In addition to regex, regular expressions are also frequently called regexp or RE. You will see a lot of references to grep or other variations (egrep, meaning extended grep; agrep; ngrep, etc.) in the literature. That term comes from Unix programmers and stands for "general regular expression print." Many of the search and replace functions in modern Windows applications and scripting languages that allude to this name are trying to duplicate grep in Unix.
I'm just scratching the surface. I am reading through Friedl a couple of pages a day so this page will be a kind of scratch pad for my learning experience. There are some reference on the Web that date from 1996 and earlier. I have generally refrained from listing them because I can't be sure they're still being maintained and updated.
General Resources and References
- Stephen Ramsay's Using Regular Expressions ::: A Tao of RegEx ::: Resources on the Web ::: Five Habits for Successful Regular Expressions ::: 12 Reasons to Learn and Use Regular Expressions ::: PCUpdate RegEx Intro ::: Regular Expressions Explained by Jan Borsodi ::: Lecture Notes ::: Chris Spruck Regular Expression Basics on Evolt.org
- Sites: Zvon Reference ::: RegEx Library ::: RegularExpressions.info
- Tutotials: Dev Shed tutorial ::: Intro to Regular Expressions
- Learning tools: Visual RegEx ::: RegEx Coach ::: RexExplorer
- Web Directory: Dmoz listing ::: DevDaily ::: Zeal
- Marzie's RegEx Tester ::: Using RegEx ::: Grep.Extracts - Resources on Regular Expressions ::: Regular Expressions resources on the Web
- Textism: Using Grep to Find Caps ::: Word HTML Cleaner This utility is a perfect example of the power of regular expressions because it takes all the garbage that Word and other MS applications stick into HTML files and cleans to the bone. It's a PHP script that accomplishes all this work.
For a more philosphical view, Andy Oram's article, Marshall McLuhan vs. Marshalling Regular Expressions, examines the importance of regular expressions in the digital realm.
There are two ways that you can use regular expressions. First, use an application that accepts regex in its search and replace function. There are many applications that provide the more robust regex as part of their features. Second, pick one of the scripting or programming languages that have regex functions.
I have several programs that already have regex available within their features: HomeSite, DreamWeaver, Textpad and HTMLKit. Many text processors include regex. You can also throw in Unix utilities, like Vi and Emacs (for geeks). You can also add the utilities that I list above. The issue is each may have its own implementation. Minor differences can cause problems when do major changes on mutliple pages. Always test out your search and replace operations on backup copies until you get the hang of a particular implementation.
Several text processors and web editors can handle regular expressions. I am focusing on Windows here, but Mac and Unix both have applications and utilities that use regex. Since I am not familiar with these fields, I am going to leave them to others. See Grep.Extracts.de for more info.
The advantage of this method is that the application handles the mechanics of identifying a match and replacing it. However, if you want to do something more elaborate with the text, you're out of luck.
- HomeSite: Macromedia help ::: Extended Replace ::: Regex plugin
- Windows utilities: RegExTools ::: Funduc's Search and Replace 4.5 ::: Abacre's Advanced Search and Replace ::: Grep for Windows ::: Wingrep and explanation of how to use it ::: Hurricane Search ::: AGrep ::: GNU utilities for Win32 ::: Search+Replace ::: Astrogrep ::: Agent Ransack ::: All of these are downloads, as shareware or freeware. I have not been able to examine any of these utilities, just a quick look at their websites. Others can be found at DMoz.org
- Interesting options: I am actively examining three utilities. As with all implementations of regex, each utility has its own way of implementing features and require a steep learning curve:
- BK ReplaceEM »» Examples ::: Using Bill Klein's BK ReplaceEM for Windows for mass website HTML text changes ::: I am using BK ReplaceEM on several tasks. Interesting features, including the range expressions. Range Replacements are lazy in that they will take the first and shortest string of text that matches your range, as opposed to regular expressions, which are greedy, taking the most text possible. However, ReplaceEM does not support all regex metacharacters. There has been no update since 1999. Advanced edit feature allows you to insert text and ReplaceEM converts it to a regex. Free, donations accepted.
- XReplace32. It has an 81 page manual in PDF format. Strong point is an intuitive interface similar to Windows Explorer. Has a prompted mode that allows you to step through an action. Allows scheduling of actions and creation of macros. Sold through Vestris Software, but no update since mid-2001. Costs $39.
- PowerGREP This is the most expensive, but feature-rich stand-alone utility. JGSoft - Just Great Software provides a tremendous amount of documentation (100 page pdf manual), examples and support. Allows you to undo actions. It has a tabbed interface that focuses on specific functions: Search, Replace, File Finder, Collect Info, Sequence (multiple actions), Results and Undo History. All this helps to get over the hump in applying regex and getting a handle on the program. Comes with 15 regex examples for web page maintenance. The company's text editor, EditpadPro, also supports regex. Like most utilities of this kind, it states that the utility is Perl compatible, but the documentation gives ample proof that the author follow through on that claim. Also see the JGSoft-backed Regular-expressions.info for additional informaiton. Jan Goyvaerts, the creative force behind the company, is actively developing its product, the latest version (2.1.0) being January 2004. PowerGrep also searches binary files, like Word and Adobe Acrobat docs, but does not make replacements. Costs $100 - but worth it because Goyvaerts goes the extra mile in giving you an excellent tool, plus documentation and examples that make it productive.
- Apache: Zytrax RegEx
- Frontier Regex Extensions
- bradchoate.com MovableType
- CodeAuditor scan websites or project directories for code that violates your standard rules. Uses regex to make rules. Requires .NET Framework.
- Regex-Enabled Text Processors: TextPad »» Forums »» Macros »» RegEx ::: UltraEdit ::: Boxer Text Editor ::: NoteTab Pro ::: EditpadPro
In the case of this approach, the advantage is that regex can be paired with the scripting power of the lanugage. It also usually means that you are not confined by an operating system -- a PERL regular expression is going to work on Windows, Apple or Unix. See this table for differences.
- Unix: Unix Review.com Cameron Laird's Regular Expression column is not always about regex. The Limits of Regular Expressions ::: CMU ::: Haddock Directory ::: POSIX extended regex ::: PCRE - Perl Compatible Regular Expressions ::: ^txt2regex$ a Regular Expression "wizard", all written with bash2 builtins, that converts human sentences to RegExs. with a simple interface
- PERL: Perldocs.com Regular Expressions ::: RegEx Power ::: Multiple Search/Replace Operations on Multiple Recursive Files ::: Steve Litt's PERLs of Wisdom: PERL Regular Expressions ::: WebBlazonry - PERL ::: Test RegEx ::: A Regex Introduction
- PHP: PREG ::: RegEx tester ::: RegExEditor - a module for PHPEditor ::: Quanetic.com Regex Tester
- Python: The Gospel ::: Text Processing in Python ::: InformIT Intro to Regular Expressions ::: Learning to Use Regular Expressions ::: A.M. Kuchling Regular Expression HOWTO Also available at ActiveState Programmer Network ::: Python.FAQTs ::: Rozenberg's Visual RegExp ::: Kodos Python GUI utility for creating, testing and debugging regular expressions for the Python programming language ::: Regular Expressions and Text Processing
- Java: Sources ::: RegEx in Java with lots of docs, tutorial and a Online Tester ::: GNU RegEx class ::: Test drive regex in JDK 1.4
- Microsoft: Microsoft has implemented regex in many of its languages. Rather than list them all here, just search for references on Search MSDN ::: Microsoft Beefs Up VBScript with Regular Expressions
- .NET: Regular Expressions as a Language ::: ISerializable Introduction to regular expressions and Practical Parsing Using Groups in Regular Expressions ::: Ultrapico Expresso for Microsoft Windows .NET and requires the .NET Framework ::: Radsoft Free Regular Expression Designer (Requires the .NET Framework) »» Syntax in C and .NET ::: RegEx Builder requires .NET Framework ::: Dan Appleman's book, Regular Expressions with .NET