How to extract strings by a given search-pattern
The other day I was asked how to extract strings matching a given search-pattern from a file or datastream.
The one who asked had to implement a simple broken-link-checker for a website. And therefore, he wanted to extract all the URLs referenced in this website and then check them for availability.
Another use case could be to extract all IP-addresses from a given file or all timestamps or dates - and only them - from a server’s logfile.
I think you got the point.
As long as we are able to describe the string we are looking for as a regular expression, we can simply extract it with grep.
Oh - yes. You are absolutely right: If we simply search with grep in a file or datastream, we usually get the entire line containing the matching string. (as “grep root /etc/passwd” gives us all lines from /etc/passwd containing the string “root”)
BUT …. did you know the option “-o” of grep, which only prints out the matching strings and not the whole lines?
And exactly this is the little trick I want to point out in this post:
If you use grep to find strings matching a regular expression, you can use the “-o” command-line switch to get only the matching strings instead of the whole lines.
So - that’s all for today - really.
But if - and only if - you are curious and want some kind of examples - read on.
As a first simple example - let’s extract all the numbers we can find in the file /etc/passwd:
grep -o "[0-9]\+" /etc/passwd
This command will give you - line by line - all the numbers (group ids, user ids) from /etc/passwd. “[0-9]+” is the regular expression for the strings we are looking for and stands for any sequence of any digits.
Ok - and what about extracting all URLs from a website?
Well - just to simplify our regular expressions for this example - first let’s assume a link in the websites source-code always looks like this:
<a ... href="URL" ...>
where “…” are some possible parameters for the link.
If we have downloaded the website to a file (for instance with curl https://dilbert.com > dilbert.com
), we can extract all the URLs in 3 simple steps:
first: extract all the “<a href…” components
grep -o '<a[^>]*' dilbert.com
Here we extract all strings starting with “<a” and after that as many characters as possible that are not the closing “>”.
This will give us many output-lines of the following type:
...
<a class="nav-main-link" href="https://dilbert.com/"
...
second: extract the href=”URL” part of the link
Now we take the output from the first grep command and filter it through the next grep. We want to extract only the part href=”URL”.
grep -o '<a[^>]*' dilbert.com | grep -o 'href="[^"]*'
(extract all from “href=” followed by as many characters as possible which are not the closing “)
The output-lines now look like the following:
...
href="https://dilbert.com/
...
third: extract the URL-part
As the last step, we extract everything that isn’t a quotation-mark (“) until the end of the line:
grep -o '<a[^>]*' dilbert.com | grep -o 'href="[^"]*' | grep -o '[^"]*$'
Wohoo - this now looks really nerdy! But it gives us exactly the URLs we want:
...
https://dilbert.com/
...
goal achieved!
-robert
Wanna take an unfair advantage?
If it comes to working on the Linux command line - at the end of the day it is always about knowing the right tool for the right task.
And it is about knowing the tools that are most certainly available on the Linux system you are currently on.
To give you all the tools for your day-to-day work at the Linux command line, I have created “The ShellToolbox”.
This book gives you everything
- from the very basic commands, through
- everything you need for working with files and filesystems,
- managing processes,
- managing users and permissions, through
- software management,
- hardware analyses and
- simple shell-scripting to the tools you need for
- doing simple “networking stuff”.
Everything in one single, easy to read book. With explanations and example calls for illustration.
If you are interested, go to shelltoolbox.com and have a look (as long as it is available).