A colleague is looking to learn regex so I decided to put it up here. This is the first in a series of regex-related articles. In the parts to follow, we’re going to be using our Regular Expressions to learn other topics relevant to the Internet while simltaneously expanding on regex knowledge.
Regex Quick Start
- A vertical bar |, sometimes known as a pipe, is an “OR” operator in regular expressions. It indicates that the pattern to the left OR to the right is acceptable. On the keyboard, this key is typically the “capital”, or “shifted version”, of the backslash key.
- Parentheses are round brackets () that select a section of a regular expression. They can be used similarly to in mathematics when the order of operations is adjusted. The “selection” can be referenced elsewhere though we’ll cover that later.
- A dot . is a special character which matches any character other than “newline” characters.
- A backslash \ is sometimes called an escape character. There are many characters, such as the parentheses and the vertical bar above, that have special meanings. The backslash is used to take away (escape) those special meanings. For example, to specify an actual full stop, you need to escape the special dot character: \.
- A star * is an operator which indicates that the preceding character can be repeated zero or more times. This is often combined with the dot to produce a “wildcard” pattern, which matches a string of any length: .*
Using the vertical bar and the escape character
Here we have a number of example web addresses:
http://dogma.swiftspirit.co.za/ http://swiftspirit.co.za/ http://google.com/
If I want a pattern that matches them all, I could use the vertical bar (rule 1 above) to separate them. I’d then also need to use the escape character (rule 4 above) for the full stops:
To above will match every URI in my example however it isn’t specifically efficient or elegant.
The same can be achieved with the following, shorter, regex, utilising the parenthesis from rule 2 above:
Notice that I’ve simply placed common items, for example “http://” and “/”, outside the selections.
What if you want to match text that might be anything?
Maybe you have more that you still want to match? For example, if I put http://swiftspirit.co.za/downloads/ and I still want the pattern to match that or any other url under my web site. Or maybe its okay for http://anything.google.com/anything to also match. We can use the special dot and star characters from rules 3 and 5 from above:
Simply adding that any character can appear zero or more times lets a lot more match without adding too much to the regex’s complexity:
The above is from in front of the google.com pattern. The vertical bar indicates two options here: either blank (so we end up with just google.com), or .*\. (so we have anything followed by a literal dot and that is then followed by google.com).
Practise makes perfect
We can already see from the last example that a regular expression can very quickly become complex. Much like any language, being easily able to read a regular expression takes practice. If you have a relevant need to work with regular expressions, you’re in for a treat.
Part 2 should be due in a few days. I’ll be doing some very indirect “work”, demonstrating how to flush DNS caches in a variety of systems.