Home

About

Contact

Perl and Regular Expressions

You can buy the Perl and Regular Expressions Quick Start Workbook from the following Amazon sites:

download Perl and Regular Expressions Quick Start Workbook from amazon.com Amazon.com download Perl and Regular Expressions Quick Start Workbook from amazon.co.uk Amazon.co.uk
download Perl and Regular Expressions Quick Start Workbook from amazon.ca Amazon.ca download Perl and Regular Expressions Quick Start Workbook from amazon.de Amazon.de
download Perl and Regular Expressions Quick Start Workbook from amazon.in Amazon.in download Perl and Regular Expressions Quick Start Workbook from amazon.fr Amazon.fr
download Perl and Regular Expressions Quick Start Workbook from amazon.es Amazon.es download Perl and Regular Expressions Quick Start Workbook from amazon.it Amazon.it
download Perl and Regular Expressions Quick Start Workbook from amazon.nl Amazon.nl download Perl and Regular Expressions Quick Start Workbook from amazon.co.jp Amazon.co.jp
download Perl and Regular Expressions Quick Start Workbook from amazon.com.br Amazon.com.br download Perl and Regular Expressions Quick Start Workbook from amazon.com.mx Amazon.com.mx
download Perl and Regular Expressions Quick Start Workbook from amazon.com.au Amazon.com.au

Perl and Regular Expressions

Perl is a programming language, and regular expressions provide a way to search for patterns in text strings. Many languages provide support for regular expresions, but Perl's implementation of regular expressions is probably better than that of any other language. Together, Perl and regular expressions provide a very powerful way of manipulating text-based files such as HTML and XML files.

Perl and Regular Expressions Quick Start Workbook

The Perl and Regular Expressions Quick Start Workbook teaches you how to implement regular expressions into Perl scripts, which can then be used to process text files. You might need to do this in order to:

  • merge two or more files together,
  • extract data from several files in order to produce a report,
  • fix the formatting in a set of files,
  • convert files from one format to another,
  • provide the back-end processing for a Windows application, and so on.

The book is written using a tutorial-style approach, with the aim of getting you writing scripts as quickly as possible. Although there is of course a certain amount of theory contained in the book, the objective is to learn by doing rather than by reading through chapter after chapter of reference material.

Note: If you want to download the various scripts, text files, and applications that are used in the book, you need to access this website from a PC to make the download links available.

Download Perl and Regular Expressions Quick Start Workbook from Amazon

download Perl and Regular Expressions Quick Start Workbook from amazon.com Amazon.com download Perl and Regular Expressions Quick Start Workbook from amazon.co.uk Amazon.co.uk download Perl and Regular Expressions Quick Start Workbook from amazon.ca Amazon.ca
download Perl and Regular Expressions Quick Start Workbook from amazon.de Amazon.de download Perl and Regular Expressions Quick Start Workbook from amazon.in Amazon.in download Perl and Regular Expressions Quick Start Workbook from amazon.fr Amazon.fr
download Perl and Regular Expressions Quick Start Workbook from amazon.es Amazon.es download Perl and Regular Expressions Quick Start Workbook from amazon.it Amazon.it download Perl and Regular Expressions Quick Start Workbook from amazon.nl Amazon.nl
download Perl and Regular Expressions Quick Start Workbook from amazon.co.jp Amazon.co.jp download Perl and Regular Expressions Quick Start Workbook from amazon.com.br Amazon.com.br download Perl and Regular Expressions Quick Start Workbook from amazon.com.mx Amazon.com.mx
download Perl and Regular Expressions Quick Start Workbook from amazon.com.au Amazon.com.au

Perl and Regular Expressions

Perl is a programming language, and regular expressions provide a way to search for patterns in text strings. Many languages provide support for regular expresions, but Perl's implementation of regular expressions is probably better than that of any other language. Together, Perl and regular expressions provide a very powerful way of manipulating text-based files such as HTML and XML files.

Perl and Regular Expressions Quick Start Workbook

The Perl and Regular Expressions Quick Start Workbook teaches you how to implement regular expressions into Perl scripts, which can then be used to process text files. You might want to do this so that you can:

The book is written using a tutorial-style approach, with the aim of getting you writing scripts as quickly as possible. Although there is of course a certain amount of theory contained in the book, the objective is to learn by doing rather than by reading through chapter after chapter of reference material.

Use the links in the following sections to download the files for each chapter. You'll need to enter a username and password, which you can get from the last page in the quick start workbook itself. If you have any problems downloading the files, please contact us and we will try to help.

Chapters 1 and 2 - Creating a Perl Development Environment

In Chapters 1 and 2 we set up a Perl development environment, which basically means downloading and installing a Perl interpreter and deciding on which text editor to use. The number of editors on the market is huge, with loads of free and low cost options available. To develop Perl scripts, any text editor will do - even NotePad is sufficient. As far as a Perl interpreter is concerned, in the book we use Strawberry Perl, although a Perl development environment achieved using ActiveState will work just as well.

At the end of the day, it doesn't really matter which Perl development environment you go for so long as it suits you. After all, it's how good your scripts are rather than the tools you use to create them that's important.

Chapter 3 - Working with Files

This chapter provides a gentle introduction to how Perl handles files and gives us our first opportunity to have a look at a couple of simple scripts. We learn about filehandles, pathnames, and how to get Perl to display a meaningful error message if something goes wrong when we are trying to process a file.

Chapter 4 - Pattern Matching and Regular Expressions

In many ways, this is where the fun starts. Although you won't be an expert in regular expressions at the end of working through this chapter, you will (hopefully) have an appreciation of what they're about, and will have enough knowledge to start building useful regular expressions yourself.

This chapter also introduces the regExTester application, which lets you check whether a particular regular expression matches a pattern in a string. regExTester provides a bit of light relief to, hopefully, make regular expression development a bit more fun.

Chapter 5 - Processing Multiple Files

Earlier scripts given in the book have concentrated on entering one or more filenames at the command prompt. Although this is an acceptable approach where only a few files need to be processed, it is not feasible to do this where more than, say, four or five files need to be processed. In this chapter we look at how to process all the files of a particular type in the current folder, or files of differing types located in multiple folders.

Chapter 6 - Running a Perl Script from a Windows Application

In this chapter we look at how a Perl script can be run from a simple Windows application. Many 'real-life' commercial software products use a combination of programming languages in order to deliver a solution to customers, and this chapter provides a very simple introduction to this approach.

Chapter 7 - Integrating a Perl Script into a Text Editor

In this chapter we look at how to integrate a Perl script directly into a text editor so it can be run on the file that is currently open in the editor. Although we use TextPad as the example editor, the principles covered in this chapter could be applied to a lot of text/HTMl editors currently available on the market. Although many editors include in-built regular expressions functionality, whereby a particular regular expression or series of expressions can be run on a file, integrating a complete script that can be run from a menu command is something that some people might find useful.


Perl and Regular Expressions Primer

Like many web content authors, over the past few years I've had many occasions when I've needed to clean up a bunch of HTML files that have been generated by a word processor or publishing package. Initially, I used to clean up the files manually, opening each one in turn, and making the same set of updates to each one. This works fine when you only have a few files to fix, but when you have hundreds or even thousands to do, you can very quickly be looking at weeks or even months of work. A few years ago someone put me on to the idea of using Perl and regular expressions to perform this 'cleaning up' process.

The Goal

When converting documents into HTML the goal is always to achieve a seamless conversion from the source document (for example, a word processor document) to HTML. The last thing you need is for your content authors to be spending hours, or even days, fixing untidy HTML code after it has been converted.

Many applications offer excellent tools for converting documents to HTML and, in combination with a well designed cascading style sheet (CSS), can often produce perfect results. Sometimes though, there are little bits of HTML code that are a bit messy, normally caused by authors not applying paragraph tags or styles correctly in the source document.

Why Perl?

The reason why Perl is such a good language to use for this task is because it is excellent at processing text files, which let's face it, is all HTML files are. Perl is also the de facto standard for the use of regular expressions, which you can use to search for, and replace/change, bits of text or code in a file.

What is Perl?

Perl (Practical Extraction and Report Language) is a general purpose programming language, which means it can be used to do anything that any other programming language can do. Having said that, Perl is very good at doing certain things, and not so good at others. Although you could do it, you wouldn't normally develop a user interface in Perl as it would be much easier to use a language like Visual Basic to do this. What Perl is really good at, is processing text. This makes it a great choice for manipulating HTML files.

What is a Regular Expression?

A regular expression is a string that describes a search pattern, according to certain syntax rules. Regular expressions are not unique to Perl - many languages, including JavaScript and PHP can use them - but Perl handles them better than any other language.

Processing a Single HTML File

In this part we'll develop a Perl script to process an HTML file.

Suppose we have the following HTML file, called file1.htm:

<html>
<head><title>Sample HTML File</title>
<link rel="stylesheet" type="text/css" href="style.css"></head>
<body>
<h1>Introduction</h1>
<p>Welcome to the world of Perl and regular expressions</p>
<h2>Programming Languages</h2>
<table border="1" width="400">
<tr><th colspan="2">Programming Languages</th></tr>
<tr><td>Language</td><td>Typical use</td></tr>
<tr><td>JavaScript</td><td>Client-side scripts</td></tr>
<tr><td>Perl</td><td>Processing HTML files</td></tr>
<tr><td>PHP</td><td>Server-side scripts</td></tr>
</table>
<h1>Summary</h1>
<p>JavaScript, Perl, and PHP are all interpreted programming languages.</p>
</body>
</html>

Now imagine that we want to change both occurrences of <h1>heading</h1> to <h1 class="big">heading</h1>. Not a major change and something that could be easily done manually or by doing a couple of simple search and replace operations. But we're just getting started here.

To do this, we could use the following Perl script (script1.pl):

1 open (IN, "file1.htm");
2 open (OUT, ">new_file1.htm");
3 while ($line = <IN>) {
4 $line =~ s/<h1>/<h1 class="big">/;
5 (print OUT $line);
6 }
7 close (IN);
8 close (OUT);

Note: You don't need to enter the line numbers. I've included them simply so that I can reference individual lines in the script.

Let's look at what the script does.

Line 1
In this line file1.htm is opened so that it can be processed by the script. In order to process the file, Perl uses something called a filehandle, which provides a kind of link between the script and the operating system, containing information about the file that is being processed. I've called this "opening" filehandle 'IN', but I could have used anything within reason. Filehandles are normally in capitals.

Line 2
This line creates a new file called new_file1.htm, which is written to by using another filehandle, OUT. The '>' just before the filename indicates that the file will be written to.

Line 3
This line sets up a loop in which each line in file1.htm will be examined individually.

Line 4
This is the regular expression. It searches for one occurrence of <h1> on each line of file1.htm and, if it finds one, changes it to <h1 class="big">.

Looking at Line 4 in more detail:

$line - This is a variable that contains a line of text. It gets modified if the substitution is successful.
=~ is called the comparison operator.
s is the substitution operator.
<h1> is what needs to be substituted (replaced).
<h1 class="big"> is what <h1> has to be changed to.

Line 5
This line takes the contents of the $line variable and, via the OUT file handle, writes the line to new_file1.htm.

Line 6
This line closes the 'while' loop. The loop is repeated until all the lines in file1.htm have been examined.

Lines 7 and 8

These two lines close the two file handles that have been used in the script. If you missed off these two lines the script would still work, but it's good programming practice to close file handles, thus freeing up the file handle name so that it can be used, for example, by another file.

Running the Script

As the purpose of this primer is to explain how to use regular expressions to process HTML files, and not necessarily how to use Perl, I don't want to dwell for too long on how to run Perl scripts. Suffice to say that you can run them in various ways, for example, from within an MS-DOS window.

(The location of the Perl interpreter will need to be in your PATH statement so that you can run Perl scripts from any location on your computer and not just from within the directory where the interpreter (perl.exe) itself is installed.)

So, to run our script we could open an MS-DOS window and navigate to the location where the script and the HTML file are located. To keep life simple I've assumed that these two files are in the same folder (or directory). The command to run the script is:

C:>perl script1.pl

If the script does work (and in theory it should), a new file (new_file1.htm) is created in the same folder as file1.htm. If you open the file you'll see the the two lines that contained <h1> tags have been modified so that they now read <h1 class="big">.

Processing Multiple Files

Above, we developed a Perl script (script1.pl) to process a single HTML file. In this section we'll look at how to process multiple files.

script1.pl has one major drawback, making it unusable in real terms: the name of the web page (HTML file) that the script processes is hard coded into the script itself. For the script to be useful, we need to be able to run it on any web page. Changing the script so that it can do this is fairly straightforward.

Below, I've given two scripts: script1.pl, which was our original script from earlier in this article, and script2.pl, which is a new script that will process a list of files.

script1.pl

1 open (IN, "file1.htm");
2 open (OUT, ">new_file1.htm");
3 while ($line = <IN>) {
4 $line =~ s/<h1>/<h1 class="big">/;
5 (print OUT $line);
6 }
7 close (IN);
8 close (OUT);

script2.pl

1 foreach $file (@ARGV) {
2 rename $file, "$file.bak";
3 open (IN, "$file");
5 while ($line = <IN>) {
6 $line =~ s/<h1>/<h1 class="big">/;
7 (print OUT $line);
8 }
9 close IN;
10 close OUT;
11 }

Before looking at each line of the script in detail, let's just quickly establish what script2.pl does. Well, it processes one or more files entered at the command line prompt (for example, the MS-DOS prompt) and then, for each file entered, the script initially makes a backup copy before changing every occurrence of <h1> to <h1 class="big">.

A few quick definitions:

Variable
A temporary storage place for a value. In the above script, $file is a variable. The filename file1.htm, which will be entered at the command line prompt, is a value that will be temporarily stored in that variable when the script is run.

Array A storage place for a list of values.

Let's take a look at each line of script2.pl.

Line 1
This line enables one or more files to be entered at the command line and processed by the script. We only have one file, file1.htm, so when we run the script we'll only enter one file to be processed.

Line 2
This line makes a backup copy of each file before processing it. So, for file1.htm, the backup file would be file1.htm.bak.

Line 3
This line opens a filehandle for the file being processed (see earlier in this article for a description of filehandles).

Line 4
This line opens another filehandle, but this time for the output from the script.

Note: file1.htm.bak will contain the contents of the file from before the script is run. file1.htm will contain the updated contents, that's to say, the output from the script.

Line 5
This line sets up a loop in which each line in the input file (the file being processed) will be examined individually.

Line 6
This is the regular expression. It searches for one occurrence of <h1> on each line of the input file and, if it finds one, changes it to <h1 class="big">.

Line 7
This line takes the contents of the $line variable and, via the OUT file handle, writes the line to the output file.

Line 8
This line closes the 'while' loop. The loop is repeated until all the lines in the file currently being processed have been examined.

Lines 9 and 10
These two lines close the two file handles that have been used in the script.

Line 11
This line closes the 'foreach' loop. The loop is repeated until all the files entered at the command line prompt have been processed.

Running the script

To run the script, at the command line type:

C:>perl script2.pl file1.htm

If the script executes successfully, a new file should be created called file1.htm.bak, which is a backup of the original file (ie before it was processed). A new version of file1.htm should also have been produced, containing the modified <h1> tag.

Processing all the HTML Files in the Current Directory (Folder)

script2.pl, from above, enabled us to enter filenames at the command prompt:

c:>perl script2.pl file1.htm file2.htm file3.htm

Although this script enables us to process as many files as we want to, the drawback is that all the filenames need to be manually typed in. This is fine if you only want to process a few files, but if you've got hundreds or thousands to process, this approach would not be feasible.

script2.pl (repeated from above)

1 foreach $file (@ARGV) {
2 rename $file, "$file.bak";
3 open (IN, "$file");
5 while ($line = <IN>) {
6 $line =~ s/<h1>/<h1 class="big">/;
7 (print OUT $line);
8 }
9 close IN;
10 close OUT;
11 }

In script2.pl, it is line 1 that enables us to enter filenames at the command prompt. script3.pl, which is listed below, provides us with a way to process all the HTML files (that have a .htm extension) in the current directory/folder. This is the directory where all the files to be processed, and the script itself, are located.

script3.pl

1 opendir(DIR, ".") or die "can't opendir: $!";
2 @allfiles = grep (/.htm$/i, readdir DIR);
3 closedir(DIR);
4 foreach $name (@allfiles) {
5 rename $file, "$file.bak";
6 open (IN, "$file");
8 while ($line = <IN>) {
9 $line =~ s/<h1>/<h1 class="big">/;
10 (print OUT $line);
11 }
12 close IN;
13 close OUT;
14 }

The only difference between script2.pl and script3.pl is the first few lines. Let's look at the new lines in script3.pl.

Line 1
Opens the current directory (signified by a dot ".") for processing. It is given a directory handle of DIR. If the directory cannot be opened, an error message is displayed.

Line 2
This line reads in all the ,htm files in the directory, and puts them in an array called @allfiles. In Perl, a '@' indicates an array, and a '$' indicates a variable. A variable stores a single value, whereas an array stores a list of values.

grep is a search command from the UNIX world.

Line 3
This line closes the DIR directory handle.

Running the script

c:>perl script3.pl

Processing Files in Different Directories (Folders)

script3.pl enabled us to read in all the files in the current directory. Sometimes, however, you might need to process files that are located in different directories. script4.pl lists a script that will do this.

script4.pl

1 @allfiles=glob("file1.htm directory1/subdirectory1/*.shtm directory2/*.htm");
2 foreach $name (@allfiles) {
3 rename $file, "$file.bak";
4 open (IN, "<$file.bak");
5 open (OUT, ">$file");
6 while ($line = <IN>) {
7 $line =~ s/<h1>/<h1 class="big">/;
8 (print OUT $line);
9 }
10 close IN;
11 close OUT;
12 }

The only new line here is line 1, which uses the glob function to search through specified directories and files. Firstly, it searches for file1.htm in the current directory, and then it search for all files ending in .shtm in directory1/subdirectory1, and then all files ending in .htm in directory2. The asterisk (*) is a wildcard, which means any filename.

Running the script

c:>perl script4.pl