Extract regular expression group match using grep or sed

2012-05-09
I've been looking for a GNU/Linux command that matches lines in a text file using a regular expression and extracts a group match for that expression (a sub-match within parenthesis). It would be really handy if the grep command supported this as an output option, but unfortunately it does not. However there are at least two ways to approach this problem. I'll show you one with grep and one with sed.

So, here is the situation. I have a text file where I want to extract a sub-match for lines that match the whole regular expression. As an example, we can take the output from mysqldump which is the sql statements to recreate a database. It can look like this:

CREATE TABLE `offercalc_fields` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) NOT NULL,
  `price` double NOT NULL,
  `offer_slug` varchar(255) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Now I’d like to match all lines that contains “CREATE TABLE” and then extract the string between ` and `. A regular expression with group matching is easy:

CREATE TABLE `(.+)`

Getting just the matching lines is easy with grep, assuming the text file is named “dump.sql”:

grep -Po 'CREATE TABLE `(.+)`' dump.sql

The grep options means:

However, this still outputs too much. “CREATE TABLE” and the enclosing ` and ` characters will still be included in the output from grep. Here it would be really handy if you could specify a group match to the -o option. Like “-o1” to output the first group match. Unfortunately this is not supported.

So how to solve this problem?

Use look-ahead and look-behind with grep

My first solution is to put everything before the group match inside a look-ahead and everything after the group match inside a look-behind. The downside is that look-ahead and look-behinds in regular expressions must be constant strings (no regular expressions inside the look-ahead/behind).

Look-ahead in regular expressions look like “(?<=Hello)” where “Hello” is the string to match. Look-behind in regular expressions look like “(?=Hello)” where “Hello” is the string to match. So the solution is:

grep -Po '(?<=CREATE TABLE `).+(?=`)' dump.sql

Now grep only prints the matching part of the line and with look-ahead/behind I’ve managed to exclude the strings before and after my target from the match.

Use substitute and print commands inside of sed

Another command that can help me is sed. Sed knows about group matches in the context of the substitute command inside of sed. Combining this with the print command in sed, I get the same result with the following command:

sed -nr 's/CREATE TABLE `(.+)`.*/\1/p' dump.sql

The regular expression is easier to write here since we are not constrained to look-ahead/behind. If you don’t know the sed command I first need to explain the following parts:

Note that since it is a substitution operation (and not an extract operation), I need to make sure that the regular expression matches the whole line by ending it with “.*”. Otherwise any text before or after the regular expression would still be included in the output.

That means that I’m also assuming that “CREATE TABLE” is at the beginning of the line. I could improve the command by being explicit about this using “^”:

sed -nr 's/^CREATE TABLE `(.+)`.*/\1/p' dump.sql

Conclusion

Which solution is the best depends on the original text lines. I think that grep is easier if there is a lot more before and/or after the group match that might be harder to match against in sed (since sed must match the whole line in the regular expression). But for all other cases, I think the sed solution looks a little cleaner when I don’t have to use the look-ahead/behind expressions.

Please feel free to give feedback in the comments if there are even better ways to do this in Bash with standard GNU commands.

5 Responses to “Extract regular expression group match using grep or sed”

  1. Jason says:

    Hi, this was really useful for grepping some key analytics from our year’s worth of Apache logs. BUT I found that you CAN put regex’s inside of lookahead’s and lookbehind’s. In my example below, I just needed the last 2 regex parts of the filepath from :-

    “GET /mydocs/myfiles/abcd/folder/123/456.txt”

    grep -Po ‘(?<=GET /mydocs/myfiles/[a-z0-9]{4}/folder/)[0-9]{1,6}/[0-9]{1,6}(?=\.txt)' access.log

    …which returns:-

    "123/456"

    Thanks again, Jason

  2. VJ says:

    Great article! Thanks!!

  3. Mohan says:

    Can the same regex be used to extract pattern matches from multiple lines?

Leave a Reply

Twitter: @mikeplate