Your first steps with regular expressions - the essentials
When you search for something at the Linux command line, you’ll typically come across two different types of search patterns.
First, there are the ones you probably already use intuitively for file operations, like in cp *.pdf /tmp
- which means “copy all files ending with .pdf
to /tmp
”. These patterns are known as filename globbing. You use them typically, if you want to address multiple files at once on the command line.
The second type of pattern you’ll come across are the regular expressions. At first glance, regular expressions might look similar to filename globbing, but they operate very differently. And more importantly, they’re incredibly powerful for searching.
For example, let’s look at a regular expression for matching email addresses:
\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b
And here’s one for matching any possible IP address:
\b((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\b
A very prominent tool you will typically use in combination with regular expressions is grep
. Therefore I’ll use it here for doing the illustrations. (For an introduction to using grep
see Searching with grep: The Essentials).
But over the time you’ll find that many other Linux command-line tools, such as find
, sed
and awk
, just to name a few, can also leverage the power of regular expressions.
From the examples above you can guess that regular expressions can quickly become quite complex. But don’t worry - as a starting point you can achieve quite a lot with just a few basics. And these basics are typically referred to as “Basic Regular Expressions (BRE)” or simply standard regular expressions.
I typically use the phrase “standard regular expressions” in courses and conversations as a contrast to the much more complex “Extended Regular Expressions (ERE)”
The most important patterns of standard regular expressions
Let’s dive into the use of standard regular expressions with an example:
Let’s say you want to search through the file “/etc/passwd” (this file contains the locally defined users of a system) for the user named “max”.
To do this search with the command grep
, you first need to know about the layout of the file, which is really straight forward: Every single line describes the properties of a user in seven fields. And these fields are separated by colons (“:”).
Here are a few possible lines from “/etc/passwd” for illustration.
...
tux:x:1099:1099:Tux:/home/tux:/bin/bash
max:x:1100:1100:Max:/home/max:/bin/bash
test1:x:1101:1101:Testuser 1:/home/max:/bin/false
...
Search for something at the beginning of a line
If we are now searching for the defined user “max”, we could simply try the following:
grep max /etc/passwd
But because the user “test1” has a defined home directory that contains the phrase “max” too (have a look into the 6th field), we need to specify what exactly we are looking for: The phrase “max” exactly at the beginning of a line (the first field contains the username), followed by a colon.
To describe this, we use a special pattern from the regular expressions - the “caret” character ^
. This character is the marker for the start of the line.
^
marks the start of a line.
With this in mind, the usage of grep
with the regular expression to search for the line defining the user “max” would look like this:
grep '^max:' /etc/passwd
Hint: Because regular expressions often contain characters that may be interpreted by the shell itself, it is common practice to surround the whole expression with single quotation marks ' ... '
.
Search for something at the end of a line
Sometimes, you’ll need to search for something at the end of a line. For this, you can use the special character $
as the marker for the end of a line in a regular expression.
$
marks the end of a line.
For example, if we want to search the sample data from above for the keyword “false” at the end of a line, we could use the following command:
grep 'false$' /etc/passwd
Here, grep
searches for the string “false” followed directly by the end of a line.
Using wildcards in regular expressions
Often, you won’t know the exact phrase you’re looking for, but you know roughly “how it looks”. In these situations, you’ll need to use wildcards in your regular expressions.
Search for an instance of any character
Let’s start with the “any character” wildcard - the “.” (a single dot). This pattern stands for exactly one character, no matter what character it is.
.
is the placeholder for any character.
If we type
grep '^r..t:' /etc/passwd
then this command line will find every line
- starting with an “r”,
- followed by any two characters,
- followed by the sequence “t:”
For example, it would match lines that start with “rest:”, “rent:”, “rant:”, or the line starting with “root:”.
Search for a character that’s in a given set
The second wildcard also describes a single character, but this time it specifies which characters are allowed, making it distinct from other wildcards.
The pattern for this type of wildcard is the set of allowed characters surrounded by square brackets.
[abc]
is the placeholder for a single character from the given set a,b,c
The following example searches for lines that start with the character ‘t’, ‘m’, or ‘r’:
grep '^[tmr]' /etc/passwd
This pattern matches all three example lines from above together with the line containing the user “root”. So the output could look like this:
root:x:0:0:root:/root:/bin/bash
tux:x:1099:1099:Tux:/home/tux:/bin/bash
max:x:1100:1100:Max:/home/max:/bin/bash
test1:x:1101:1101:Testuser 1:/home/max:/bin/false
Search for a character that’s not in a given set
Sometimes you need to formulate your pattern the other way around: You want to search for a character that is not in a given set.
As an example, let’s see if we can find in “/etc/passwd” lines that are not starting with a lowercase “t” or “m” or “r”. For this, we can negate the character set we used above by adding a caret (the “^” character) as the first character of the set.
[^abc]
is the placeholder for a single character not from the given set a,b,c
So, if grep '^[tmr]' /etc/passwd
gives you every line starting with a “t”, an “m”, or an “r”, then the following will give you all the other lines:
grep '^[^tmr]' /etc/passwd
This looks a bit weird at first, because the “^” in front of the square brackets describe the beginning of a line, but as the first character within the brackets it negates the set.
Example task: find all lines with four-character user names in “/etc/passwd”
To give you a more concrete example for using these wildcards, let’s extract all lines containing four-character user names from “/etc/passwd”. (And yes - in the real world out there, there is often an urgent need to search for four-character usernames ;-) )
We could simply start with the following:
grep '^....:' /etc/passwd
This searches for all lines starting with any four characters followed by a colon.
If you try this at home, then chances are high that you will get lines describing two-character usernames too. You could get for instance the following line (among others):
lp:x:37:37:lp:/tmp:/bin/false
And I think it’s obvious why: the character “:” is also matched by the “.” - not only the letters are.
The next approach could be to search for four characters from a given set:
grep '^[a-z][a-z][a-z][a-z]:' /etc/passwd
As you can see, you cannot only insert single characters between the brackets, but you can also use ranges with a “-“ sign between the start and the end character of the needed range.
This approach will at first glance solve the given task. But on the second thought, you will realize that there could be more four-character usernames defined on a system. Just think of capital letters or digits that may be included in a username.
We could simply address this by just adding these characters to our character sets:
grep '^[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]:' /etc/passwd
BUT … aren’t there other characters allowed in usernames too? The dot, the underscore, the … Ok - let’s read the documentation and add them all.
STOP! Let’s do it in a more simplistic way!
We do not need to know all the allowed characters within a username. All we need to know is that one single character simply cannot be a component of a username - the colon “:”. Otherwise, the file format with the “:” as a separator won’t work, would it?
So simply let’s do the search with four negated character sets that only contain the negated colon:
grep '^[^:][^:][^:][^:]:' /etc/passwd
Oh yes - this looks nerdy ;o)
But it will give us exactly what we are looking for: Every line, starting with four characters that are not a colon, followed by a colon. Aka - all lines with four-character usernames.
How about any character any number of times?
If you want to include a pattern that matches any character any number of times in your expression, then you need to combine two things: First the pattern for “any character” you already know - the dot (“.”) and second, a repetition operator.
A repetition operator specifies how many times the preceding pattern or phrase must be matched.
One simple-to-use repetition operator is the asterisk (“*”).
This operator specifies, that the preceding phrase must repeat any number of times - including zero times.
For instance would
grep "^ro*t:" /etc/passwd
find any line that starts with an “r”, followed by zero or more occurrences of an “o” directly followed by “t:”.
This would match for instance the following lines:
rt:....
rot:....
root:....
rooot:....
And if we combine the .
for any character with the repetition operator *
to .*
, then we will get the pattern for any character any number of times.
.*
is the pattern for any character any number of times.
Two other useful repetition operators
The two other useful repetition operators I want to show you here consist of two characters:
”\+” … this operator specifies, that the preceding phrase must be seen at least once.
So the regular expression ^ro\+t:
wouldn’t match the line starting with “rt:”, but it would match all the other lines from above.
”\?” … this operator specifies, that the preceding phrase is optional. It can be seen zero or exactly one time.
If we use this for the regular expression as in ^ro\?t:
, then this would match only lines starting with “rt:” or with “rot:”
Hint: The repetition operators can of course be combined with any other pattern you use. So would, for instance, the pattern [0-9]\+
match any sequence of digits.
Search for special characters
There will be times when you need to search directly for the special characters we used above to create our patterns. But how can we use these characters in regular expressions without being misinterpreted? Perhaps you guessed it: we need to escape them!
You can do this by simply adding a backslash (“\”) in front of the special character.
Escape special characters in a regular expression by preceding it with a backslash (“\”).
Let’s pretend that we want to search for the top-level domain “.com” in the file “access.log”. To avoid having the tool (here grep
) interpret the dot as “any character”, we prepend it with a backslash to remove this special meaning:
grep '\.com' access.log
Or now - let’s pretend we want to search for the phrase “[2024\11\11]” in the same file. (This is an often used format for dates in log files.)
Here we need to escape three special characters to avoid confusion. Do you see them all?
The first two are easy: The opening and the closing square brackets.
The third one is the backslash itself. Because we don’t want to escape these two digits in the string “[2024\11\11]” but instead we really want to search for the backslash at the given position. So we need to escape the backslash too.
And to escape it, we again simply add a backslash in front of it, to that it will become a double backslash (“\\”).
So if we escape all of them, the two brackets and the backslashes, our command line would look like this:
grep '\[2024\\11\\11\]'
Hint: If you use backslashes to escape special characters in regular expressions on the command line, surround the regular expression with single quotation marks. If you forget this, the backslashes will be interpreted by the shell as escape markers, and the tool interpreting the regular expression won’t see them at all. (see If your shell always get’s you wrong - quote the right way )
How about the two examples from the beginning
At the beginning I gave you two examples for complex looking regular expressions. Here they are again:
Searching for an email address:
\b[A-Za-z0-9.\_%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b
Searching for an IP address:
\b((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\b
I hope you can now recognize a few patterns we discussed above in these expressions. But there are a few more included I haven’t discussed here. For instance the following:
\b
… this marks a word boundary{3}
… this is a repetition marker, that allows the pattern before to be repeated exactly three times(var1|var2|var3)
… this is a pattern that matches one of the given variants “var1”, “var2” or “var3”.
These patterns have one thing in common: They are part of the extended regular expressions (and there are many more.)
The use of these extended regular expressions will be covered in a follow-up article.
But if you want to explore them on your own, be aware that the grep
command, by default, only interprets the standard or basic regular expressions.
So if you want to use parts from extended regular expressions in your pattern, you need to use egrep
instead or call the command grep
with the added command-line switch -E
.
To use extended regular expressions with
grep
, usegrep -E ...
or useegrep
instead.
Here is what to do next
If you followed me through this article, you certainly have realized that knowing some internals about how things are working at the Linux command line, can save you a lot of time and frustration.
And sometimes it’s just fun to leverage these powerful mechanics.
If you wanna know more about such “internal mechanisms” of the Linux command line - written especially for Linux beginners
have a look at “The Linux Beginners Framework”
In this framework I guide you through 5 simple steps to feel comfortable at the Linux command line.
This framework comes as a free pdf and you can get it here.
Wanna take an unfair advantage?
If it comes to working on the Linux command line - at the end of the day it is always about knowing the right tool for the right task.
And it is about knowing the tools that are most certainly available on the Linux system you are currently on.
To give you all the tools for your day-to-day work at the Linux command line, I have created “The ShellToolbox”.
This book gives you everything
- from the very basic commands, through
- everything you need for working with files and filesystems,
- managing processes,
- managing users and permissions, through
- software management,
- hardware analyses and
- simple shell-scripting to the tools you need for
- doing simple “networking stuff”.
Everything in one single, easy to read book. With explanations and example calls for illustration.
If you are interested, go to shelltoolbox.com and have a look (as long as it is available).