Email Validation
Validating an email address is fairly easy. You first have to match the local part (the part before the @). If you're following the official RFC of what's allowed there, you're going to have a bad time. But honestly, most of the mailservers don't even support all of these characters. Thus, we'll make the life much easier by providing a expression which will validate sane local parts.
So, a normal local part can consist of either digits, letters, or one of these characters: "._%+-". Each of these can exist multiple times in an email address.
Then, of course, the @ symbol follows. Without that it's not an email address. After that, a valid domain or subdomain must follow. These may include digits, letters, as well as dots or dashes.
The last part of that domain will be the top level domain, e.g. .com
or .email
.
Those can only be letters, and are at least two characters long. Honestly, that top level domain does not
have to exist, since addresses like root@localhost
are valid as well, but those won't work for
online services.
Additionally, email addresses are completely case insensitive.
SRL Syntax:
begin with any of (digit, letter, one of "._%+-") once or more, literally "@", any of (digit, letter, one of ".-") once or more, literally ".", letter at least 2 times, must end, case insensitive
SRL Query is matching!
SRL Query is not matching.
The SRL Query contains an error:
Whoops... you may have found a bug.
Generated Regular Expression:
As you can see, the query pretty much explains itself. The only thing to mention may be the
begin with
and must end
parts. These define the start and the end of the string
to test. If those parts would not exist, addresses like [email protected]
would match as
well.
URL Validation and Matching
Let's take a look at URLs. For example, this one:
https://domain.com:1234/a/path?query=param
They are pretty complex, if you think of it. They may contain a protocol (https://
), a domain
(domain.com
), a port (1234
), a path (/a/path
),
and even parameters (query=param
). Let's have a look at a SRL query to filter those parts.
SRL Syntax:
begin with capture (letter once or more) as "protocol", literally "://", capture ( letter once or more, any of (letter, literally ".") once or more, letter at least 2 times ) as "domain", (literally ":", capture (digit once or more) as "port") optional, capture (literally "/", anything never or more) as "path" until (any of (literally "?", must end)), literally "?" optional, capture (anything never or more) as "parameters" optional, must end, case insensitive
SRL Query is matching!
SRL Query is not matching.
The SRL Query contains an error:
Whoops... you may have found a bug.
Generated Regular Expression:
Okay.. that's huge and complicated. But what would you rather like to debug and maintain? The SRL above, or the resulting Regular Expression:
/^(?<protocol>[a-z]+)(?::\/\/)(?<domain>[a-z]+(?:[a-z]|(?:\.))+[a-z]{2,})(?:(?::)(?<port>[0-9]+))?(?<path>(?:\/).*?)(?:(?:\?)|$)(?:\?)?(?<parameters>.*)?$/i
Let's dive into it. We'll start up easy and just match some letters followed by a ://
. Those
will be captured as protocol
using a capture group. Then, a new capture group follows which
will first contain at least one letter, since a domain name never starts with a dot. Then, either dots or
letters are allowed, since we want to match subdomains, domains and top level domains.
After the domain part, a colon followed by a port may show up. Since a port is never required in a valid URL, that part is completely optional. The same goes with all following. Neither a path, nor parameters are required, but valid if it exists.
A path can be pretty much anything following the /
, except a question mark, since this will
start the parameter section. Since we can't rely on the parameter section existing, we'll match anything up
to either the end of the string or the ?
.
Now, if a ?
is present, everything that follows will be the parameters, so we'll capture all of
it until the end of the string.
Try matching the SRL Query against the test URL provided. Feel free to tinker with either the URL or the query just to see all the results possible. Learning by doing!
Password Validation
A password should be secure. It should at least have 8 characters in length, an upper- and an lowercase letter, special characters and a digit. To be honest, that isn't an easy task to achieve using regular expressions, since every arrangement must be valid, and everything must be invalid if some required character is missing.
In SRL, the query would look like seen below. The expression will only be valid if all conditions are met. After that, we can simply match anything, since we already made sure that the required characters are contained. But of course, it should be at least 8 characters long.
In the if followed by
section, you may notice the anything never or more
. What
looks a bit odd is necessary, since the pointer for the "followed by" criteria is at position zero. If you'd
remove it, it'd expect the character in that statement on the start. But since it doesn't matter where these
characters are, we have to explicitly allow anything before.
one of
will match any of the given characters. Since we're in a string, we have to escape the
quote in the middle, as well as the last backslash, with a backslash.
SRL Syntax:
if followed by (anything never or more, letter), if followed by (anything never or more, uppercase letter), if followed by (anything never or more, digit), if followed by (anything never or more, one of "!@#$%^&*[]\"';:_-<>., =+/\\"), anything at least 8 times
SRL Query is matching!
SRL Query is not matching.
The SRL Query contains an error:
Whoops... you may have found a bug.
Generated Regular Expression:
Trimming Whitespaces
Removing unnecessary whitespaces is a common task in a programmers life. Some languages provide dedicated tools for the job, and some don't. However, regular expressions are able to do that quite quick. To replace all whitespaces, we first have to capture them. After that, replacing them is easy.
SRL Syntax:
capture (whitespace once or more)
SRL Query is matching!
SRL Query is not matching.
The SRL Query contains an error:
Whoops... you may have found a bug.
Generated Regular Expression:
The matches aren't quite interesting, to be honest. But it matches. Whitespaces are boring and we have to live with it. Nevertheless, the query used to capture the whitespaces is pretty simple but effective.
Now, assuming you're using PHP, let's replace the whitespaces with a -
. Of course, you can take
any string you like for that, and even callbacks are supported.
$srl = new SRL("capture (whitespace once or more)");
echo $srl->replace("-", " a text with\nwhitespaces and\ttabs");
And the result will then be:
-a-text-with-whitespaces-and-tabs
Note, that the expression replaced more than one whitespaces with just one hyphen, as well as the tab
(\t
) and new line character (\n
).
Lookaround - Conditional Regular Expressions
Let's have a peek at lookarounds an why we need them. Let's say we're trying to match the last occurrence of something, and only the last:
This example contains 3 numbers. 2 should not match. Only 1 should.
Using lookarounds, this is quite easy. Think of lookarounds as some sort of if
statement.
A Lookaround can either be a lookahead, or a lookbehind. Every kind can then be positive, or negative. But
don't worry. It sounds more complicated than it actually is. This would be a positive lookahead:
match this, if that pattern follows
A negative lookahead would look like this: match this, if that pattern does not
follow
I guess you get the idea. Positive: If it does follow; Negative: If it doesn't. Lookbehinds are equally simple. Instead of matching everything that follows, they'll match everything that already occurred before the lookbehind.
Going back to our example, we basically want to match any number, but only if no other number follows. Let's try that:
SRL Syntax:
capture (digit) if not followed by (anything once or more, digit)
SRL Query is matching!
SRL Query is not matching.
The SRL Query contains an error:
Whoops... you may have found a bug.
Generated Regular Expression:
Thank You!
Thanks for reading through! If you've got any other examples that are worth sharing, feel free to open an issue on GitHub.