Why You Shouldn’t Validate Emails with Regex

Foo, a programmer working at FooBar Inc., is happily working on a registration form, when he think to himself,

“Hmm. I should probably check if the email is valid.”

Somewhere along the line, Foo connect the words “email” and “valid” with “regular expressions”.

But as the great Jamie Zawinski once stated,

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

 

Implementing Regular Expressions to Validate Email Addresses


 

Now, for the sake of demonstration, let’s look into the thought processes of Foo.

First, Foo thinks, “Well, what makes an email address valid?”

He quickly scribbles down a list.

  1. Email must contain an “@” symbol.
  2. Email must contain a “.” symbol.
  3. Email must not contain spaces.
  4. Email can have the character sets [a-zA-z]
  5. Email can have the number sets [0-9]

Foo also knows that the order must always go as follows :

  1. Word
  2. “@” symbol
  3. “.” symbol
  4. Domain Name

Okay. Seems simple enough, right?

Foo writes the following regex :

^[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-zA-Z0-9]+

And it works flawlessly! At first, anyway.
It turns out that a regex statement like this won’t account for emails like “henry.dang@henrydangprg.com” or “henry-dang@henrydangprg.com”

The solution? As long as it’s [az-A-Z0-9._], it should be fine, up until the “@” symbol.

^[a-zA-Z0-9.-]+@[a-zA-Z0-9]+\.[a-zA-Z0-9]+

So far so good! Looks like everything’s working. But hold on, your friend tries to sign up with his email, “henry.dang@henry.dang.prg.com”, which is an absolutely valid email. Now you have to compensate for multiple periods after the “@” symbol!

But even after Foo fixes that, he realizes that his input could be infinitely large if someone strung together infinite periods! (EX : henry.dang@henry.henry.henry.henry.henry.henry.henry… and so on)

But wait! There’s more! The user could have a plus sign in his email, but not in the domain. And wait! What about foreign characters? And apostrophes? And all those other wild characters that hardly anyone would use in an email, but would still be valid if they really wanted to use it?

The fact of the matter is that Foo will most likely be unable to account for every single possible letter and combination. There are simply too many, and attempting to match all of them will simply bring in angry customers who have some esoteric symbol in their email that you failed to account for.

 

So What’s the Solution?


 

The solution is simple. Don’t use regex. It is not the right tool for the job. All you need to do is check if the user has an “@” and a “.” in their email. Anything else is extraneous and will lead to some user emailing you about how they can’t register with their email.

There is no point in attempting to check the infinite possibilities for an email. Send the user a confirmation email. If their email is valid, they will receive an email. If not, then their email was invalid, and they should change it.

For the stubborn or curious user, there is a solution available here. Clocking in at a whopping 6424 characters, this monstrous and unreadable regular expression is the last thing you want to use in your code.