You are here
Home > Automata Theory > Regular Expressions – (Regex) – Regular Expression

Regular Expressions – (Regex) – Regular Expression

Regular Expressions

Regular Expressions was initially a term borrowed from automata theory in theoretical computer science. Broadly, it refers to patterns to which a sub-string needs to be matched.

The comic should have already given you an idea of what regular expressions could be useful for. It should not be surprising that many programming languages, text processing tools, data validation tools and search engines make extensive use of them.

The key idea is that a regular expression is a pattern which matches a set of target strings.

\w+@\w+\.(com|org|net|in) is a regex that matches a most email addresses that end with a .com, .net, .org or a .in.

Regular Expressions Concepts

There are many forms of regex syntax that vary with the language. Here, we will be examining Perl regex since most other regexps are usually a variation on this.

Before we dive into the syntax, these are the kinds of things that the patterns consist of:

  • Literals: They are the simplest things to match. When they are there, we just match them. It could be like an a or a 1.
  • Meta characters: They do not mean what they look like. They usually refer to something else. For example, \d could refer to any digit.
  • Vertical Bar: The | is a symbol of boolean OR. It gives an option to match any of the things it delimits.
  • Quantifiers: They specify how many of the concerned pattern needs to be matched.
  • Grouping and Capturing: Parentheses could be used to group parts of the regex or capturing parts for later use.

Regular Expression Syntax

Let’s look at what the meta characters do in a little more detail.

Meta character Description
^ Start of a string
$ End of a string
\t Tab
\n Newline
\r Carriage Return
\s Any whitespace character
\S Any non-whitespace character
\d Any Digit
\D Any non-digit
\w Any word-character
\W Any non-word character
\b Any word boundary
\B Any non-word-boundary
. Any single character, usually barring a newline

By the way, if you want to match a metacharacter literally, you need to use \ to escape it. For example, \. would just match the . character.

Now, let us look into more flexibility stuff.

Expression Meaning
[abc] Matches any of a,b, or c
[^abc] Matches anything other than ab, or c
[a-d] Matches any of the characters in the range a-d
a* Matches a zero or more times
a? Matches a zero or one time
a+ Matches a one or more times
a|b Matches either a or b
a{3} Matches exactly 3 of a
a{3,} Matches 3 or more of a
a{3,5} Matches 3, 4 or 5 of a (inclusive range)
( ) Captures everything inside the bracket
Example:

We are now ready to explain why \w+@\w+\.(com|org|net|in) does what it claims.

Firstly, what should an email look like? That's right, it should have a structure like user@domain.extension.

The user and domain consists of any letter, number or underscore but at least one of them. So, we use \w+.

We restrict the extension to org, com, net or in by using the |.

Read Also: Context Free Grammars

Like Us: https://www.facebook.com/theoryofcomputation2018/

Top