Regular Expressions




  1. What are regular expressions?
    1. A pattern matching notation used by ed,sed, awk, vi and other programs.
    2. A regular expression is composed of regular and meta characters that define one or more strings.


  2. General rules for regualar expressions.
    1. A regular expression always matches the longest string possible starting as far left as possible.
    2. A regular character always represents itself. Alphabetic and numeric characters are regular characters.
    3. A meta character does not represent itself unless it has been immediately preceded with a backslash or quoted. Meta characters expand to a pattern. No regular expression will match a newline.


  3. Meta Characters
    .
    matches any character. So b.ll with match ball,bell or bull.
    [ ]
    defines a character class. [Tt]alk will match either Talk or talk. A range may also be specified. [0-9] matches any digit.
    ^
    inverts a character class when used in the [ ] notation. Accordingly [^0-9] matches any character that is not a digit.
    *
    matches zero or more characters. For example b* will match any sequence with zero or more b's.


  4. Anchors.
    Anchors are used to denote a position in a line.
    $
    represents the end of line only when it is the last character in the regular expression. money$ will match lines where money is followed by a newline.
    ^
    represents the beginning of line only when it is the first character in the regular expression. ^func will find any line that begins with func. Do not confuse ^ outside of [ ] with ^ inside of [ ].


  5. grep
    The grep command searches for the pattern specified by the Pattern parameter and writes each matching line to standard output. The patterns are regular expressions in the style of the ed. The grep command uses a compact non-deterministic algorithm.
    1. Finding simple text.
      To locate all line that have the word title in html files use the following. (Note the -i option to make the search case insensitive).
      $ grep -i title *.html
      a010.html:<title> Unix Syllabus</title>
      a020.html:<title> Unix Assignments</title>
      a030.html:<title> Unix History</title>
      a040.html:<title> OS and Shell</title>
      a045.html:<title> Unix Assignments</title>
      a050.html:<title>Command Overview</title>
      
    2. Using anchors.
      To locate only lines that start with <. and pass them to word count to count the lines.
      $ grep '^<' *.html | wc -l
          2030
      
    3. Looking for a pattern.
      To find lines that only have and html tag that starts with < and end with >. Use the following pattern search. The -n option gives the line number where the pattern occurs.
      $ grep -n  '^<[^>]*>$' a010.html
      1:<html>
      2:<head>
      4:</head>
      5:<body bgcolor="#FFFFFF">
      6:<center>
      9:<br>
      11:<br>
      13:<br>
      15:<br>
      16:</B>
      17:</Center>
      18:<hr>
      19:<B>
      20:<Center>
      22:</center>
      23:</b>
      24:<br>
      25:<OL>
      27:<br>
      28:<PRE>
      89:</pre>
      90:<br>
      165:<table border=1 center width=70%>
      166:<tr>
      168:<TR>
      186:</table>
      187:</body>
      188:</html>
      


  6. Replacement strings
    Replacements strings are used by vi and sed. The special character & can be used to represent matched strings.

    & is used to represent the match. So s/[Tt]ruck/red &/ in ed will match any line with truck or Truck and replace it with red tnd the match..
    $ grep -i truck test
    I drove the truck to town.
    Trucks are great.
    
    $ sed 's/[tT]ruck/red &/' test
    I drove the red truck to town.
    red Trucks are great.
    


  7. Looking for meta characters.
    If you need to look for a meta character like $ you will need to quote it. Placing the escape character is front of a meta character will make it a regular character like \$.

  8. Ranges.
    Ranges can be used with \{ \} constructs.
    \{number\}
    says the pattern match must be this long. for example [ab]\{3\}.
    \{number, number\}
    specifies the allowable range for a match. Like a.*\{3,5\}
    \{number,\}
    specifies a minmum number to match. Like [ab]\{5,\}.


  9. Checkpoint
    Given the following file
    $cat grepdata
    a
    a a
    a bb c
    a bbb c
    abbc aaa
    abbbc bbb
    abc aaaca
    ac
    aaabbbccc
    abbccc
    

    Which lines are matched by each grep command?
    1. grep 'a$' grepdata
    2. grep ' b' grepdata
    3. grep 'ab*c' grepdata
    4. grep 'abb*c' grepdata
    5. grep '[ab]c' grepdata
    6. grep 'a..b' grepdata
    7. grep 'a[^a-z]' grepdata
    8. grep -v 'b' grepdata
    9. grep 'b\{3\}' grepdata

    © Allan Kochis Last revision 1/7/2000