Lab 7

Click Here to Download this Answer Instantly

Regular expression: From Wikipedia, the free encyclopedia
In computing, regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) provide a concise and flexible means for identifying text of interest, such as particular characters, words, or patterns of characters. Regular expressions are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.
Regular expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. For example, Perl and Tcl have a powerful regular expression engine built directly into their syntax. Several utilities provided by Unix distributions—including the editor ed and the filter grep—were the first to popularize the concept of regular expressions.
7.1 Exercise – Fill in the following chart:

In these notes we concentrate on POSIX regular expressions using egrep.
Wildcards:
Assume we have a directory with the following contents:

Using “wild cards:

7.2 The Grep Family
The UNIX grep utility marked the birth of a global regular expression print(GREP) tools. Searching for patterns in text is important operation in a number ofdomains, including program comprehension and software maintenance, structuredtext databases, indexing file systems, and searching natural language texts. Such awide range of uses inspired the development of variations of the original UNIXgrep. This variations range from adding new features, to employing fasteralgorithms, to changing the behaviour of pattern matching and printing. This
survey presents all the major developments in global regular expression printtools, namely the UNIX grep family, the GNU grep family, agrep, cgrep, sgrep,nrgrep, and Perl regularexpressions. Taken from man grep:

7.3Regexs: Some Examples
Some Examples:We start with several simple examples. Assume we have a file fruits:

Matching characters “strings” by example:

Metacharacters: Metacharacters are characters that have ‘special’ meaning. Here are the metacharacters that are defined.

. Matches anycharacter.
* “character*” specifies that the character can be matched zero or more times.
+ “character+” Matches that character one or more times. Pay careful attention to the
difference between * and +; * matches zero or more times, so whatever’s being repeatedmay not be present at all, while + requires at least one occurrence. To use a similar example, ca+t will match “cat” (1 “a”), “caaat” (3 “a”‘s), but won’t match “ct”.
? “character?” Matches that character either once or zero times; you can think of it as marking something as being optional. For example, home-?brew matches either “homebrew” or “home-brew”.

Examples:

If you wish to search for a metacharacter the metacharacter must be escaped by preceding it with the backslash “\”.As an example let’s assume we have a file such as:

And we wish to fine “209.204.146.22”. egrep ‘209.204.146.22’ ip will NOT work. We must escape the “.” character.

Anchors:
using ^ and $ you can force a regex to match only at the start or end of a line, respectively.
^ Match at the start of a line
$ Match at the end of a line

As you can see, this regex fails to match both apple and grape, since neither starts with a ‘p’. The fact that they contain a ‘p’ elsewhere is irrelevant. Similarly, the regex e$ only matches apple, orange and grape:So ^cat matches only those lines that start with cat, and cat$ only matches lines ending with cat.
Mind the quotes though! In most shells, the dollar-sign has a special meaning. By putting the regex in single-quotes (not double-quotes or back-quotes), the regex will be protected from the shell, so to speak. It’s generally a good idea to single-quote your regexes.
Moving on, ^cat$ only matches lines that contain exactly cat. You can find empty lines in a similar way with ^$. If you’re having trouble understanding that last one, just apply the definitions. The regex basically says: “Match a start-of-line, followed by an end-of-line”.
Word Boundaries
A lot of regex implementations offer the ability to use word anchors. As you saw, a regex like cat not only finds the word cat, but also all those cases where cat is “hidden” in other, longer words. In such cases you can use the start-of-word and end-of-word anchors, \< and \>, respectively. So if you were looking only for occurrences of the word cat, you could use the regex \<cat\>.

In this context, the term “word” should be taken lightly; every combination of letters, upper and lower case, the underscore ( _ ) and digits counts as a word when dealing with word boundary anchors.
Character Classes:[ ….]
The [ and ] metacharacters are used to define character classes. Suppose for instance that you’re trying to find both cake and coke. In that case you can use the regex c[ao]ke.
A way to find a hexadecimal digit would be[0123456789abcdefABCDEF]. This quickly becomes impractical though. Fortunately you can use a hyphen to specify a range: [0-9] means any digit from 0 to 9.
More than one range in a character class is also allowed: [0-9a-fA-F].
Another use of the ^ character:
You can also specify a negated character class by placing a caret (^) directly after the opening bracket: [^…]. This inverts the sense of the character class: [^0-9] matches any character but digits.
Some Grouping “short cuts”
Some of the special sequences beginning with “\” represent predefined sets of characters that are often useful, such as the set of digits, the set of letters, or the set of anything that isn’t whitespace. The following predefined special sequences are available:
\d Matches any decimal digit; this is equivalent to the class [0-9].
\D Matches any non-digit character; this is equivalent to the class [^0-9].
\s Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
\S Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
\w Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
\W Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
These sequences can be included inside a character class. For example, [\s,.] is a character class that will match any whitespace character, or “,” or “.”.
Other Grouping shortcuts POSIX
POSIX regex offer the following grouping classes.
Character Group Meaning
[:alnum:] Alphanumeric
[:cntrl:] Control Character
[:lower:] Lower case character
[:space:] Whitespace
[:alpha:] Letters
[:digit:] Digit
[:print:] Printable character
[:upper:] Upper Case Character
[:blank:] whitespace, tab, etc.
[:graph:] ?
[:punct:] Punctuation
[:xdigit:] Extended Digit
Quantifiers
Using quantifiers, you can specify how often a character, character class or group may or must be repeated in sequence. The general form is {min,max}.
An example is the regex bo{1,2}t, which matches both bot and boot. To match any sequence of three to five vowels, you can use [aeiou]{3,5}. Or you can use a quantifier to make something optional: finds{0,1} matches find and finds. This case occurs often enough to justify an abbreviation: the regex finds? is effectively identical to the previous.
Important to notice at this point is that a quantifier only applies to the item that precedes it. The question mark in the above regex only applies to s, not to the entire finds.
If you want to match something a certain number of times you can set the minimum equal to the maximum: ^-{80,80}$ matches lines that consist of exactly eighty dashes.
It’s also allowed to leave out the upper bound: a{5,} will match any row of at least five letters ‘a’. The case of “one or more” (eg. a{1,} occurs much more frequently, though. That’s why this form has an abbreviation, the +: a+ and a{1,} are effectively equivalent.
7.3 Exercise:
Using egrep (or grep), find the answers to the following. All questions refer to the science.txt file (posted on SLATE):
In how many lines does the word ‘science’ appear in the file? __________________
Command _________________________________________________
In how many lines doesthe word ‘Science’ appear in the file? __________________
Command _________________________________________________
In how many lines does the word ‘science or ‘Science appear in the file? ___________
Command _________________________________________________
How many lines contain at exactly three vowels? ‘aeiou’ in sequence? ______________
Command _________________________________________________
In how many lines does the string ‘at’ appear in the file? _______________
Command _________________________________________________
How many lines contain a wordthat begins with ‘at’?
Command _________________________________________________
How many lines contain a word that ends with ‘at’? ________________
Command _________________________________________________
In how many lines does the string ‘cat’ appear in the file? ____________
Command _________________________________________________
In how many lines does the word ‘cat’ appear in the file? ___________
Command _________________________________________________
In how many lines does a year appear in the file? Example 1960, 2001 etc. _________
Command _________________________________________________
How many lines end with a ‘.’ period? ________________
Command _________________________________________________
SYST 13416 The Linux Operating System