Thursday, April 26, 2007

Basic Regular Expressions with grep


Digg!

In a previous post, How to Search Logs Using grep Part 1, I talked about some basic concepts such as:
  • basic pipe concepts
    • cat file.txt | sort
  • basic grep usage
    • cat file.txt | grep pattern
  • chaining input from one instance of grep to another
    • cat file.txt | grep pattern1 | grep pattern2
  • inverse grepping
    • cat file.txt | grep -v pattern
Have a look at this post if you want more on the theory and practice behind the basic use of grep. For this installment, we will look at regular expression patterns and some other techniques that will allow you to quickly setup extremely accurate patterns on the fly for just about anything you'd like to match. So, let's get right down to it!

Regular Expressions

Regular expressions, or "regex" for short, is a huge topic that has a pile of books all its own. For our purposes, though, we can do a lot with just the basics. The idea is to use a special syntax to represent characters or groups of characters in a line. It can be considered line-oriented, just like grep, so they're perfect for each other. Here are some basic regex patterns.


^ Matches the beginning of the line before the first
character
$ Matches the end of the line after the last
character
. Matches a single character. Any character at all
.* Matches any number of characters
a Matches the letter a, for example
[xyz] Matches one x, one y or one z
[xyz]* Matches any number of x,y,and z characters. "xyzzy"
would be matched, for instance.

If you've never used regex before, the info above is going to be pretty confusing. Don't worry about it now, we'll just jump right into some examples using grep.

Basic Regex with grep

For these examples, we will use a file named data.txt containing the following lines:

alpha is 1st. Nothing comes before alpha
beta is 2nd. Beta comes after alpha
gamma is 3rd. Gamma comes after beta
delta is 4th. Delta comes after gamma
epsilon is 5th. comes after delta

When I first introduced grep, I suggested that you do something like this:

cat data.txt | grep pattern

grep can, in fact, read a file on its own without the need to pipe data into it. This is a shortcut since grep is so commonly used on files. here's the syntax we will be using:

grep 'pattern' filename

Pretty simple. We will be using single quotes around patterns from now on. Unix/Linux shells can try to misinterpret parts our patterns if we are using all kinds of special characters in them, so this is the easiest way to tell the shell to ignore them and pass them as a chunk to grep for it to deal with.

Let's get started on some basic regex patterns. Suppose we would like to search our file for any lines beginning with the word "alpha". Lets try it without regex.

grep 'alpha' data.txt
alpha is 1st. Nothing comes before alpha
beta is 2nd. Beta comes after alpha

Problem! we got back two lines since the word "alpha" is in both of them. Let's use a basic regular expression to be more specific.

grep -e '^alpha' data.txt
alpha is 1st. Nothing comes before alpha

That works. We used the ^ to signify that we want to match the beginning of the line, followed by "alpha". So this can be read as "Match the beginning of the line, then the letters alpha". It's also interesting to note here that we are telling it match 5 characters, a l p h a, not necessarily the word "alpha". More on that in a bit.

Lets use the same pattern again, but modify it to illustrate the '.' as a replacement for a single character.

grep -e '^.lpha' data.txt
alpha is 1st. Nothing comes before alpha

This can be read as "Match the beginning of the line, any single character at all, then the letters lpha". So, it would match alpha, Alpha, blpha, 6lpha, etc. any single character, any at all in that position would be a valid match.

Let's work with the "." a bit more, since it's simple but flexible. If we combine the "." with the "*", then we can build the expression for "Match any number of any characters". Example:

grep -e '^gamma.*beta$' data.txt
gamma is 3rd. Gamma comes after beta

This can be read as "Match the beginning of the line, the letters gamma, any number of characters, then the letters beta, then the end of the line". Since we're now working with a complete regular expression describing a full line, it's good form to tell grep where the end of the line is by using '$'. It's not strictly necessary here, but it's a good habit to think in these terms.

So, we've actually done something new here... we've matched two terms we were looking for using one single expression. We've also told it the order in which the terms appear and where they are in the line of text. Another example to really illustrate what we've done so far:

grep -e '^gamma.*Ga..a .om.s a..er be.*$' data.txt
gamma is 3rd. Gamma comes after beta

So, that looks pretty complicated, and I won't begin to try to explain each item matched. What is important to note is that the "." matches any single character, and the ".*" matches a series of any characters.

Let's try something new. The "[]", or bracket, syntax allows us to specific a group of characters that we may possibly want to match. It does a bit more, but we'll get into that later. For now, lets see an example of the bracket syntax.

grep -e '^.*[123].*$' data.txt
alpha is 1st. Nothing comes before alpha
beta is 2nd. Beta comes after alpha
gamma is 3rd. Gamma comes after beta

This expression can be read as "Match the beginning of the line, any number of characters, any single character from the set of 1,2,3, any number of characters, then the end of the line". This is an extremely useful syntax as it allows you to create patterns like this:

[aA]pril [456789]th

This would match "April 5th", "april 9th", "april 6th", "April 8th" and so on.

Sometimes it can get tedious, however, if you're trying to do things that fall into ranges. 0-9, a-z, etc. Conveniently enough, the bracket syntax understands ranges! here's an example.

grep -e '^.*[2-5].*$' data.txt
beta is 2nd. Beta comes after alpha
gamma is 3rd. Gamma comes after beta
delta is 4th. Delta comes after gamma
epsilon is 5th. comes after delta

So, the [2-5] here simply says "Match any single character that ranges between 2 and 5". Extremely useful. The bracket syntax can also support the following:

[a-z] any lowercase letter a through z
[A-Z] same, but an uppercase letter
[0-9] any number zero through 9
[a-zA-Z] any letter, capital or otherwise
[a-zA-Z0-9] any number or letter
[0-9.] any number OR a period.

Since the goal here is to leave you with a basic knowledge of the most useful aspects of regex, we'll stop here. Basic use of regex is one of the key skills that will allow you to easily swim through log files and other line-oriented data very quickly. If you find yourself needing to scan logs on a regular basis, internalizing some basic regex will make your life a lot easier.

In an upcoming post, I will discuss how to use this as it relates to searching and processing logs.

Sunday, April 01, 2007

How to Search Logs Using grep, Part 1



Here is something that I could write a book about.. or a few good chapters on. grep is one of the key tools in the traditional Unix arsenal for tearing through text files and finding exactly what you want very, very quickly. It doesn't take long to master if you have the right tools.

First, you're going to need to understand how to use pipes. If you aren't familiar with pipes or use them regularly, it would definitely be worth your while to dig into this. If folks would be interested in a complete pipe tutorial here, or may know of a good one online, please comment. That said, I will give a short overview.

suppose we have a text file called "data.txt" with the following contents:

delta is 4th
alpha is 1st
gamma is 3rd
beta is 2nd

the following command would display the contents of that file in your terminal

cat data.txt
delta is 4th
alpha is 1st
gamma is 3rd
beta is 2nd

What "cat data.txt" did was really read the file, line by line, and output it to your terminal, line by line. Yes, line by line, is the key term here. The term for "output to terminal" is standard out. We will use that going forward.

suppose I wanted to do something useful to this data. I can combine the "cat" command with the "sort" command. What "sort" does is read, line by line, everything you give it until it detects the end of the file. Suppose we type the following command:

sort

It just sits there, doing nothing. It's waiting for some data to come in on the terminal (or, better termed, standard input). That's pretty useless most of the time! But remember that "cat" will read a file and send it, line by line, to the terminal? Well, using a pipe, we can take those lines from cat and feed them into sort.

cat data.txt | sort
alpha is 1st
beta is 2nd
gamma is 3rd
delta is 4th

Now we have something useful! What we have done above can be described by this statement "Take the output of cat data.txt and pipe it through sort". Many, many commands work in Unix (or Linux/MacOS/etc) will act like "sort" did and accept input line by line. by stringing together commands that print output to the terminal and commands that read from the terminal, you can do some very powerful things. grep is one of those commands.


grepping

Now that we have the basics of pipes squared away, we can get into some more interesting and useful stuff. grep can be described as a program that reads from standard input, tests each line against a pattern, and writes to standard output the lines that match this pattern. It can do a lot more, but this is a good working definition to start. Here's an example:

cat data.txt | grep gamma
gamma is 4th

What we've done here told "cat" to read every line of the file "data.txt" and pipe it into grep. grep took each line that came in and checked to see if the pattern "gamma" appeared on that line. when it did, it displayed the line. What happens if no lines match the pattern?

cat data.txt | grep epsilon


grep only outputs the lines that match. If no lines match, then nothing is sent to standard output.

Note that grep reads in lines from standard input and outputs lines to standard output. That means it can be both a consumer and a provider of lines for other commands that can process standard input. That is huge... More on that later.

Let's try a more complex example with the same file.

cat data.txt | grep l
delta is 4th
alpha is 1st

Great, we matched every line with an "l" (the letter l) in it and displayed it to standard input. Looks like it's out of order, though, so lets sort it after it comes out of grep.

cat data.txt | grep l | sort
alpha is 1st
delta is 4th

So we had "cat" read data.txt line by line, piped it through grep looking for "l" and piped the results through sort. You can chain commands like this indefinitely as long as they're reading from standard in and outputting to standard out.

Lets try something else:

cat data.txt | grep l | grep p
alpha is 1st

grep can read another grep's output!

Let's work on some logs now. Suppose I have an apache log where I'd like to see all of the lines that match a hit to a certain URL. Lets try this:


cat /var/log/httpd/access.log | grep "GET /signup.jsp"
4.2.2.1 - - [01/Apr/2007:18:19:45 -0700] "GET /signup.jsp HTTP/1.1" 200 4664 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3"
10.1.1.1 - - [01/Apr/2007:18:22:48 -0700] "GET /signup.jsp HTTP/1.1" 200 4664 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.11) Gecko/20070312 Firefox/1.5.0.11"
192.168.0.1 - - [01/Apr/2007:18:23:08 -0700] "GET /signup.jsp HTTP/1.1" 200 4664 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.11) Gecko/20070312 Firefox/1.5.0.11"


Great. now we have searched the entire log and filtered out only those hits to that particular IP. What if I wanted to know who came in on a Mac?

cat /var/log/httpd/access.log | grep "GET /signup.jsp" | grep "Mac OS X"
4.2.2.1 - - [01/Apr/2007:18:19:45 -0700] "GET /signup.jsp HTTP/1.1" 200 4664 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3"



That covers basic grepping. To review, you can chain as many grep commands as you like. This allows you to filter the output of one grep command with a more specific pattern.

grep has some more useful options as well:

grep -v pattern

the -v will search for "pattern" and show you the lines that DON'T match. This is useful for ignoring lines. For example, suppose you wanted to see all the hits to the signup.jsp page on your website that did NOT come from your company's firewall (say it's 4.2.2.1 for the sake of argument).

cat /var/log/httpd/access.log | grep "GET /signup.jsp" | grep -v 4.2.2.1
10.1.1.1 - - [01/Apr/2007:18:22:48 -0700] "GET /signup.jsp HTTP/1.1" 200 4664 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.11) Gecko/20070312 Firefox/1.5.0.11"
192.168.0.1 - - [01/Apr/2007:18:23:08 -0700] "GET /signup.jsp HTTP/1.1" 200 4664 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.11) Gecko/20070312 Firefox/1.5.0.11"



Just for fun, lets use the "wc", or word count, command.

cat /var/log/httpd/access.log | grep "GET /signup.jsp" | grep -v 4.2.2.1 | wc -l
2


So, we catted our access.log, piped it through grep for our signup URL, piped those results through grep to filter out lines containing our IP address, and piped that through word count to show the number of lines in the result. We got two log lines that matched.

This really is the tip of the iceberg for grep and what it can do for you in processing your logs. I will follow up with part two in the coming days where I will cover more complex patterns and some shortcuts. There are easier ways to do all of these examples, but this should help you to understand how it works and give you the tools to started using it today.