So, here is the situation. I have a text file where I want to extract a sub-match for lines that match the whole regular expression. As an example, we can take the output from mysqldump which is the sql statements to recreate a database. It can look like this:
CREATE TABLE `offercalc_fields` ( `id` bigint(20) NOT NULL AUTO_INCREMENT, `name` varchar(255) NOT NULL, `price` double NOT NULL, `offer_slug` varchar(255) NOT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Now I’d like to match all lines that contains “CREATE TABLE” and then extract the string between ` and `. A regular expression with group matching is easy:
CREATE TABLE `(.+)`
Getting just the matching lines is easy with grep, assuming the text file is named “dump.sql”:
grep -Po 'CREATE TABLE `(.+)`' dump.sql
The grep options means:
However, this still outputs too much. “CREATE TABLE” and the enclosing ` and ` characters will still be included in the output from grep. Here it would be really handy if you could specify a group match to the -o option. Like “-o1” to output the first group match. Unfortunately this is not supported.
So how to solve this problem?
My first solution is to put everything before the group match inside a look-ahead and everything after the group match inside a look-behind. The downside is that look-ahead and look-behinds in regular expressions must be constant strings (no regular expressions inside the look-ahead/behind).
Look-ahead in regular expressions look like “(?<=Hello)” where “Hello” is the string to match. Look-behind in regular expressions look like “(?=Hello)” where “Hello” is the string to match. So the solution is:
grep -Po '(?<=CREATE TABLE `).+(?=`)' dump.sql
Now grep only prints the matching part of the line and with look-ahead/behind I’ve managed to exclude the strings before and after my target from the match.
Another command that can help me is sed. Sed knows about group matches in the context of the substitute command inside of sed. Combining this with the print command in sed, I get the same result with the following command:
sed -nr 's/CREATE TABLE `(.+)`.*/\1/p' dump.sql
The regular expression is easier to write here since we are not constrained to look-ahead/behind. If you don’t know the sed command I first need to explain the following parts:
Note that since it is a substitution operation (and not an extract operation), I need to make sure that the regular expression matches the whole line by ending it with “.*”. Otherwise any text before or after the regular expression would still be included in the output.
That means that I’m also assuming that “CREATE TABLE” is at the beginning of the line. I could improve the command by being explicit about this using “^”:
sed -nr 's/^CREATE TABLE `(.+)`.*/\1/p' dump.sql
Which solution is the best depends on the original text lines. I think that grep is easier if there is a lot more before and/or after the group match that might be harder to match against in sed (since sed must match the whole line in the regular expression). But for all other cases, I think the sed solution looks a little cleaner when I don’t have to use the look-ahead/behind expressions.
Please feel free to give feedback in the comments if there are even better ways to do this in Bash with standard GNU commands.