Thursday, April 26, 2007

Basic Regular Expressions with grep


Digg!

In a previous post, How to Search Logs Using grep Part 1, I talked about some basic concepts such as:
  • basic pipe concepts
    • cat file.txt | sort
  • basic grep usage
    • cat file.txt | grep pattern
  • chaining input from one instance of grep to another
    • cat file.txt | grep pattern1 | grep pattern2
  • inverse grepping
    • cat file.txt | grep -v pattern
Have a look at this post if you want more on the theory and practice behind the basic use of grep. For this installment, we will look at regular expression patterns and some other techniques that will allow you to quickly setup extremely accurate patterns on the fly for just about anything you'd like to match. So, let's get right down to it!

Regular Expressions

Regular expressions, or "regex" for short, is a huge topic that has a pile of books all its own. For our purposes, though, we can do a lot with just the basics. The idea is to use a special syntax to represent characters or groups of characters in a line. It can be considered line-oriented, just like grep, so they're perfect for each other. Here are some basic regex patterns.


^ Matches the beginning of the line before the first
character
$ Matches the end of the line after the last
character
. Matches a single character. Any character at all
.* Matches any number of characters
a Matches the letter a, for example
[xyz] Matches one x, one y or one z
[xyz]* Matches any number of x,y,and z characters. "xyzzy"
would be matched, for instance.

If you've never used regex before, the info above is going to be pretty confusing. Don't worry about it now, we'll just jump right into some examples using grep.

Basic Regex with grep

For these examples, we will use a file named data.txt containing the following lines:

alpha is 1st. Nothing comes before alpha
beta is 2nd. Beta comes after alpha
gamma is 3rd. Gamma comes after beta
delta is 4th. Delta comes after gamma
epsilon is 5th. comes after delta

When I first introduced grep, I suggested that you do something like this:

cat data.txt | grep pattern

grep can, in fact, read a file on its own without the need to pipe data into it. This is a shortcut since grep is so commonly used on files. here's the syntax we will be using:

grep 'pattern' filename

Pretty simple. We will be using single quotes around patterns from now on. Unix/Linux shells can try to misinterpret parts our patterns if we are using all kinds of special characters in them, so this is the easiest way to tell the shell to ignore them and pass them as a chunk to grep for it to deal with.

Let's get started on some basic regex patterns. Suppose we would like to search our file for any lines beginning with the word "alpha". Lets try it without regex.

grep 'alpha' data.txt
alpha is 1st. Nothing comes before alpha
beta is 2nd. Beta comes after alpha

Problem! we got back two lines since the word "alpha" is in both of them. Let's use a basic regular expression to be more specific.

grep -e '^alpha' data.txt
alpha is 1st. Nothing comes before alpha

That works. We used the ^ to signify that we want to match the beginning of the line, followed by "alpha". So this can be read as "Match the beginning of the line, then the letters alpha". It's also interesting to note here that we are telling it match 5 characters, a l p h a, not necessarily the word "alpha". More on that in a bit.

Lets use the same pattern again, but modify it to illustrate the '.' as a replacement for a single character.

grep -e '^.lpha' data.txt
alpha is 1st. Nothing comes before alpha

This can be read as "Match the beginning of the line, any single character at all, then the letters lpha". So, it would match alpha, Alpha, blpha, 6lpha, etc. any single character, any at all in that position would be a valid match.

Let's work with the "." a bit more, since it's simple but flexible. If we combine the "." with the "*", then we can build the expression for "Match any number of any characters". Example:

grep -e '^gamma.*beta$' data.txt
gamma is 3rd. Gamma comes after beta

This can be read as "Match the beginning of the line, the letters gamma, any number of characters, then the letters beta, then the end of the line". Since we're now working with a complete regular expression describing a full line, it's good form to tell grep where the end of the line is by using '$'. It's not strictly necessary here, but it's a good habit to think in these terms.

So, we've actually done something new here... we've matched two terms we were looking for using one single expression. We've also told it the order in which the terms appear and where they are in the line of text. Another example to really illustrate what we've done so far:

grep -e '^gamma.*Ga..a .om.s a..er be.*$' data.txt
gamma is 3rd. Gamma comes after beta

So, that looks pretty complicated, and I won't begin to try to explain each item matched. What is important to note is that the "." matches any single character, and the ".*" matches a series of any characters.

Let's try something new. The "[]", or bracket, syntax allows us to specific a group of characters that we may possibly want to match. It does a bit more, but we'll get into that later. For now, lets see an example of the bracket syntax.

grep -e '^.*[123].*$' data.txt
alpha is 1st. Nothing comes before alpha
beta is 2nd. Beta comes after alpha
gamma is 3rd. Gamma comes after beta

This expression can be read as "Match the beginning of the line, any number of characters, any single character from the set of 1,2,3, any number of characters, then the end of the line". This is an extremely useful syntax as it allows you to create patterns like this:

[aA]pril [456789]th

This would match "April 5th", "april 9th", "april 6th", "April 8th" and so on.

Sometimes it can get tedious, however, if you're trying to do things that fall into ranges. 0-9, a-z, etc. Conveniently enough, the bracket syntax understands ranges! here's an example.

grep -e '^.*[2-5].*$' data.txt
beta is 2nd. Beta comes after alpha
gamma is 3rd. Gamma comes after beta
delta is 4th. Delta comes after gamma
epsilon is 5th. comes after delta

So, the [2-5] here simply says "Match any single character that ranges between 2 and 5". Extremely useful. The bracket syntax can also support the following:

[a-z] any lowercase letter a through z
[A-Z] same, but an uppercase letter
[0-9] any number zero through 9
[a-zA-Z] any letter, capital or otherwise
[a-zA-Z0-9] any number or letter
[0-9.] any number OR a period.

Since the goal here is to leave you with a basic knowledge of the most useful aspects of regex, we'll stop here. Basic use of regex is one of the key skills that will allow you to easily swim through log files and other line-oriented data very quickly. If you find yourself needing to scan logs on a regular basis, internalizing some basic regex will make your life a lot easier.

In an upcoming post, I will discuss how to use this as it relates to searching and processing logs.

3 comments:

Anonymous said...

Are you sure you haven't broken the magicians mode of ethics? I don't think they would be happy that someone produced such a short, concise and truly excellent introduction to this subject matter. The fact that I just got a short script to work based purly on this intro is a great commendation on its clarity - mainly because I suck in bash ;-). Really good job!!

Anonymous said...

Excellent one....so simple, short & clear. You saved me from boring 'man' pages ! ;-)

Anonymous said...

You should explain the -e option.