Your first steps with regular expressions - the essentials

When you search for something at the Linux command line, you’ll typically come across two different types of search patterns.

First, there are the ones you probably already use intuitively for file operations, like in cp *.pdf /tmp - which means “copy all files ending with .pdf to /tmp”. These patterns are known as filename globbing. You use them typically, if you want to address multiple files at once on the command line.

The second type of pattern you’ll come across are the regular expressions. At first glance, regular expressions might look similar to filename globbing, but they operate very differently. And more importantly, they’re incredibly powerful for searching.

For example, let’s look at a regular expression for matching email addresses:

\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b

And here’s one for matching any possible IP address:

\b((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\b

A very prominent tool you will typically use in combination with regular expressions is grep. Therefore I’ll use it here for doing the illustrations. (For an introduction to using grep see Searching with grep: The Essentials).

But over the time you’ll find that many other Linux command-line tools, such as find, sed and awk, just to name a few, can also leverage the power of regular expressions.

From the examples above you can guess that regular expressions can quickly become quite complex. But don’t worry - as a starting point you can achieve quite a lot with just a few basics. And these basics are typically referred to as “Basic Regular Expressions (BRE)” or simply standard regular expressions.

I typically use the phrase “standard regular expressions” in courses and conversations as a contrast to the much more complex “Extended Regular Expressions (ERE)”

The most important patterns of standard regular expressions

Let’s dive into the use of standard regular expressions with an example:

Let’s say you want to search through the file “/etc/passwd” (this file contains the locally defined users of a system) for the user named “max”.

To do this search with the command grep, you first need to know about the layout of the file, which is really straight forward: Every single line describes the properties of a user in seven fields. And these fields are separated by colons (“:”).

Here are a few possible lines from “/etc/passwd” for illustration.

...
tux:x:1099:1099:Tux:/home/tux:/bin/bash
max:x:1100:1100:Max:/home/max:/bin/bash
test1:x:1101:1101:Testuser 1:/home/max:/bin/false
...

Search for something at the beginning of a line

If we are now searching for the defined user “max”, we could simply try the following:

grep max /etc/passwd

But because the user “test1” has a defined home directory that contains the phrase “max” too (have a look into the 6th field), we need to specify what exactly we are looking for: The phrase “max” exactly at the beginning of a line (the first field contains the username), followed by a colon.

To describe this, we use a special pattern from the regular expressions - the “caret” character ^. This character is the marker for the start of the line.

^ marks the start of a line.

With this in mind, the usage of grep with the regular expression to search for the line defining the user “max” would look like this:

grep '^max:' /etc/passwd

Hint: Because regular expressions often contain characters that may be interpreted by the shell itself, it is common practice to surround the whole expression with single quotation marks ' ... '.

Search for something at the end of a line

Sometimes, you’ll need to search for something at the end of a line. For this, you can use the special character $ as the marker for the end of a line in a regular expression.

$ marks the end of a line.

For example, if we want to search the sample data from above for the keyword “false” at the end of a line, we could use the following command:

grep 'false$' /etc/passwd

Here, grep searches for the string “false” followed directly by the end of a line.

Using wildcards in regular expressions

Often, you won’t know the exact phrase you’re looking for, but you know roughly “how it looks”. In these situations, you’ll need to use wildcards in your regular expressions.

Search for an instance of any character

Let’s start with the “any character” wildcard - the “.” (a single dot). This pattern stands for exactly one character, no matter what character it is.

. is the placeholder for any character.

If we type

grep '^r..t:' /etc/passwd

then this command line will find every line

starting with an “r”,
followed by any two characters,
followed by the sequence “t:”

For example, it would match lines that start with “rest:”, “rent:”, “rant:”, or the line starting with “root:”.

Search for a character that’s in a given set

The second wildcard also describes a single character, but this time it specifies which characters are allowed, making it distinct from other wildcards.

The pattern for this type of wildcard is the set of allowed characters surrounded by square brackets.

[abc] is the placeholder for a single character from the given set a,b,c

The following example searches for lines that start with the character ‘t’, ‘m’, or ‘r’:

grep '^[tmr]' /etc/passwd

This pattern matches all three example lines from above together with the line containing the user “root”. So the output could look like this:

root:x:0:0:root:/root:/bin/bash
tux:x:1099:1099:Tux:/home/tux:/bin/bash
max:x:1100:1100:Max:/home/max:/bin/bash
test1:x:1101:1101:Testuser 1:/home/max:/bin/false

Search for a character that’s not in a given set

Sometimes you need to formulate your pattern the other way around: You want to search for a character that is not in a given set.

As an example, let’s see if we can find in “/etc/passwd” lines that are not starting with a lowercase “t” or “m” or “r”. For this, we can negate the character set we used above by adding a caret (the “^” character) as the first character of the set.

[^abc] is the placeholder for a single character not from the given set a,b,c

So, if grep '^[tmr]' /etc/passwd gives you every line starting with a “t”, an “m”, or an “r”, then the following will give you all the other lines:

grep '^[^tmr]' /etc/passwd

This looks a bit weird at first, because the “^” in front of the square brackets describe the beginning of a line, but as the first character within the brackets it negates the set.

Example task: find all lines with four-character user names in “/etc/passwd”

To give you a more concrete example for using these wildcards, let’s extract all lines containing four-character user names from “/etc/passwd”. (And yes - in the real world out there, there is often an urgent need to search for four-character usernames ;-) )

We could simply start with the following:

grep '^....:' /etc/passwd

This searches for all lines starting with any four characters followed by a colon.

If you try this at home, then chances are high that you will get lines describing two-character usernames too. You could get for instance the following line (among others):

lp:x:37:37:lp:/tmp:/bin/false

And I think it’s obvious why: the character “:” is also matched by the “.” - not only the letters are.

The next approach could be to search for four characters from a given set:

grep '^[a-z][a-z][a-z][a-z]:' /etc/passwd

As you can see, you cannot only insert single characters between the brackets, but you can also use ranges with a “-“ sign between the start and the end character of the needed range.

This approach will at first glance solve the given task. But on the second thought, you will realize that there could be more four-character usernames defined on a system. Just think of capital letters or digits that may be included in a username.

We could simply address this by just adding these characters to our character sets:

grep '^[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]:' /etc/passwd

BUT … aren’t there other characters allowed in usernames too? The dot, the underscore, the … Ok - let’s read the documentation and add them all.

STOP! Let’s do it in a more simplistic way!

We do not need to know all the allowed characters within a username. All we need to know is that one single character simply cannot be a component of a username - the colon “:”. Otherwise, the file format with the “:” as a separator won’t work, would it?

So simply let’s do the search with four negated character sets that only contain the negated colon:

grep '^[^:][^:][^:][^:]:' /etc/passwd

Oh yes - this looks nerdy ;o)

But it will give us exactly what we are looking for: Every line, starting with four characters that are not a colon, followed by a colon. Aka - all lines with four-character usernames.

How about any character any number of times?

If you want to include a pattern that matches any character any number of times in your expression, then you need to combine two things: First the pattern for “any character” you already know - the dot (“.”) and second, a repetition operator.

A repetition operator specifies how many times the preceding pattern or phrase must be matched.

One simple-to-use repetition operator is the asterisk (“*”).

This operator specifies, that the preceding phrase must repeat any number of times - including zero times.

For instance would

grep "^ro*t:" /etc/passwd

find any line that starts with an “r”, followed by zero or more occurrences of an “o” directly followed by “t:”.

This would match for instance the following lines:

rt:....
rot:....
root:....
rooot:....

And if we combine the . for any character with the repetition operator * to .*, then we will get the pattern for any character any number of times.

.* is the pattern for any character any number of times.

Two other useful repetition operators

The two other useful repetition operators I want to show you here consist of two characters:

”\+” … this operator specifies, that the preceding phrase must be seen at least once.

So the regular expression ^ro\+t: wouldn’t match the line starting with “rt:”, but it would match all the other lines from above.

”\?” … this operator specifies, that the preceding phrase is optional. It can be seen zero or exactly one time.

If we use this for the regular expression as in ^ro\?t:, then this would match only lines starting with “rt:” or with “rot:”

Hint: The repetition operators can of course be combined with any other pattern you use. So would, for instance, the pattern [0-9]\+ match any sequence of digits.

Search for special characters

There will be times when you need to search directly for the special characters we used above to create our patterns. But how can we use these characters in regular expressions without being misinterpreted? Perhaps you guessed it: we need to escape them!

You can do this by simply adding a backslash (“\”) in front of the special character.

Escape special characters in a regular expression by preceding it with a backslash (“\”).

Let’s pretend that we want to search for the top-level domain “.com” in the file “access.log”. To avoid having the tool (here grep) interpret the dot as “any character”, we prepend it with a backslash to remove this special meaning:

grep '\.com' access.log

Or now - let’s pretend we want to search for the phrase “[2024\11\11]” in the same file. (This is an often used format for dates in log files.)

Here we need to escape three special characters to avoid confusion. Do you see them all?

The first two are easy: The opening and the closing square brackets.

The third one is the backslash itself. Because we don’t want to escape these two digits in the string “[2024\11\11]” but instead we really want to search for the backslash at the given position. So we need to escape the backslash too.

And to escape it, we again simply add a backslash in front of it, to that it will become a double backslash (“\\”).

So if we escape all of them, the two brackets and the backslashes, our command line would look like this:

grep '\[2024\\11\\11\]'

Hint: If you use backslashes to escape special characters in regular expressions on the command line, surround the regular expression with single quotation marks. If you forget this, the backslashes will be interpreted by the shell as escape markers, and the tool interpreting the regular expression won’t see them at all. (see If your shell always get’s you wrong - quote the right way )

How about the two examples from the beginning

At the beginning I gave you two examples for complex looking regular expressions. Here they are again:

Searching for an email address:

\b[A-Za-z0-9.\_%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b

Searching for an IP address:

\b((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\b

I hope you can now recognize a few patterns we discussed above in these expressions. But there are a few more included I haven’t discussed here. For instance the following:

\b … this marks a word boundary
{3} … this is a repetition marker, that allows the pattern before to be repeated exactly three times
(var1|var2|var3) … this is a pattern that matches one of the given variants “var1”, “var2” or “var3”.

These patterns have one thing in common: They are part of the extended regular expressions (and there are many more.)

The use of these extended regular expressions will be covered in a follow-up article.

But if you want to explore them on your own, be aware that the grep command, by default, only interprets the standard or basic regular expressions.

So if you want to use parts from extended regular expressions in your pattern, you need to use egrep instead or call the command grep with the added command-line switch -E.

To use extended regular expressions with grep, use grep -E ... or use egrep instead.

Here is what to do next

If you followed me through this article, you certainly have realized that knowing some internals about how things are working at the Linux command line, can save you a lot of time and frustration.

And sometimes it’s just fun to leverage these powerful mechanics.

If you wanna know more about such “internal mechanisms” of the Linux command line - written especially for Linux beginners

have a look at “The Linux Beginners Framework”

In this framework I guide you through 5 simple steps to feel comfortable at the Linux command line.

This framework comes as a free pdf and you can get it here.

Wanna take an unfair advantage?

If it comes to working on the Linux command line - at the end of the day it is always about knowing the right tool for the right task.

And it is about knowing the tools that are most certainly available on the Linux system you are currently on.

To give you all the tools for your day-to-day work at the Linux command line, I have created “The ShellToolbox”.

This book gives you everything

from the very basic commands, through
everything you need for working with files and filesystems,
managing processes,
managing users and permissions, through
software management,
hardware analyses and
simple shell-scripting to the tools you need for
doing simple “networking stuff”.

Everything in one single, easy to read book. With explanations and example calls for illustration.

If you are interested, go to shelltoolbox.com and have a look (as long as it is available).

Your first steps with regular expressions - the essentials

The most important patterns of standard regular expressions

Search for something at the beginning of a line

Search for something at the end of a line

Using wildcards in regular expressions

How about any character any number of times?

Search for special characters

How about the two examples from the beginning

Here is what to do next

Wanna take an unfair advantage?

The ShellToolBox 3.0

Free Beginners eBook

On-Demand Course

Master The Linux Filesystem Permissions