How To Create A Regex In Javascript

How JavaScript works: regular expressions (RegExp)

This is post # 27 of the series, dedicated to exploring JavaScript and its building components. In the process of identifying and describing the core elements, we also share some rules of thumb we use when building SessionStack, a JavaScript application that needs to be robust and high-performing to help companies optimize the digital experience of their users.

Searching, matching, and aggregating are an important part of our daily activity on the web. For instance, while you're browsing around or googling some keywords, that's a lot of searching going on there. To make searching/matching less daunting and precise, popular editors like Notepad and Sublime use regular expressions to support search and replace. So, when you hit CTRL + F on your keyboard, while using these editors, you can search and match texts of your choice.

Aside from searching, you can perform input validation with regular expressions. For example, you can check if the PIN entered by the user is all numeric or if a password entered has special characters, etc. What most developers love about RegExp is how transferable the knowledge of RegExp is. For instance, RegExp written in JavaScript can easily be relatable to RegExp written in Python.

In this article, I'll explain what RegExp in JavaScript is, its importance, special characters, how to create and write them efficiently, main use-cases, and its different properties and methods.

What Are Regular Expressions

Regular expressions are a sequence of characters that are used for matching character combinations in strings for text matching/searching. In JavaScript, regular expressions are search patterns (JavaScript objects) from sequences of characters.

RegExp makes searching and matching of strings easier and faster. For example, in search engines, logs, editors, etc. there's a need to filter/match texts easily and efficiently. This is where the RegExp patterns come in, defining search patterns with a sequence of characters.

Importance of Regular Expressions

Information is an integral part of an increasing number of industries due to the accelerated digital transformation. In this section, we'll be looking at why regular expressions are important and how they're useful in data management.

Searching/Matching of Strings

Most developers who use regular expressions, perform searching and matching of Strings with it. RegExp allows you to search texts in a pool of other texts. When you search for a text with RegExp, you'll get true or false if the text is found. When you try to match a text from a group of texts, you'll get an array with the expected text i.e. text that matches our pattern.

Input Validation

Input validation is an important feature for most software developers. You want PIN entered by users to be numbers and that emails are entered correctly with @xx.com. To do this, most developers make use of regular expressions.

Let's look at the example below of RegExp to validate the user's input, to ensure that their input contains only numbers:

The code above will print true because num is a number from 0–9. However, if we change the value of num to a String, our output will be false.

Web Scraping

Web scraping involves the extraction of data from websites. With RegExp, developers can easily perform this task. For instance, developers can extract substrings from Strings by pointing to a webpage and extracting data that matches their pattern.

Data Wrangling

There's more that you can do with data retrieved from a webpage. For instance, you can evaluate and arrange data from the web into the desired format for proper decision-making. With RegExp, you can aggregate and map data to use it for analytics purposes.

Information from data wrangling can be stored for future purposes so that retrieving it becomes easier.

How to Create RegExp Objects in JavaScript

Regular expressions in JavaScript are created with the RegExp object. Therefore, regular expressions are mostly JavaScript objects. Since we've gotten a better understanding of what regular expressions are, let's look at how to create them in JavaScript.

Literal Notation

Literal notation is one method of creating RegExp objects in JavaScript. This method involves the use of RegExp literal syntax. RegExp literal notation involves the enclosing of your expression in slashes / without the use of quotation marks.

Because literal notation involves the use of JavaScript literals; i.e. fixed values that can't be changed during runtime, it is important to use literal notation where the regular expression will remain constant. For instance, you won't want to use literal notation in loops. This is because there's no way values from a loop will change if it's not recompiled after each iteration. Literal notation can't be changed during runtime, they remain constant and won't be recompiled on each iteration.

The code below shows the syntax for using the literal notation when creating JavaScript regular expressions:

          Let re = /hello/

Let's look at a simple expression with the literal notation that'll look for an exact match in a String. This will match the String, performing case sensitive search:

If you run the command below, you'll get false, because hello isn't equal to Hello since it's a case-sensitive search. What the command above does, is to search for hello in the String Hello Studytonight.

We can perform a case insensitive search with the i flag that will ignore case sensitivity. Let's look at the example we did above with an i flag:

This time, the program outputs true because we aren't performing a case-sensitive search. Therefore, hello equals Hello

Constructor Function

Another way developers can create regular expressions in JavaScript is with the use of a constructor. This method takes in regular expressions as Strings in function arguments. From ECMAScript 6, constructor functions can now take in regular expression literals.

It is advisable to use the constructor function when creating regular expressions whose pattern will change during runtime. For instance, when validating user input or performing iterations. The syntax to create JavaScript regular expressions with constructor function is shown in the code below:

Just like our example from literal notation, we'll be creating a case sensitive search with the RegExp constructor function:

Because our example above is running a case-sensitive search, we'll get false as our output. Next, we'll add the i flag to our function argument, to ignore case sensitivity in the search.

Now, our code will output true since we're ignoring case sensitivity in our search.

Regular Expression Methods

There are two main methods for regular expressions which are exec() and test(). However, there are other methods of String that are used for regular expressions, such as match(), matchAll(), replace(), replaceAll(), search() and split(). In this section, we'll explore the different methods that can be used for JavaScript regular expressions:

exec()

This method executes a search and returns an array of results or a null. It can be used for iteration over multiple matches in a string of text. For instance, we'll look at the example below with and without iteration utilizing the exec() method.

Notice that without iteration, we get the index of the first match only. However with iteration, we get results of all (multiple) matches.

test()

This RegExp method searches for a match between a regular expression and a String. It returns true or false if a match is found or not. With this method, you can also use the global flag g. Let's look at an example, to search for a regular expression in a String, with and without the global flag g.

From the example above, regex.test(str) and globalRegex.test(str) outputs true, because the expression spa can be found in the String in a space of time spark i.e, in space and spark.

However, the global flag can allow us to iterate in our search to determine how many times spa is present in the String. We can also determine the index of the different positions where spa can be found in our String. This can't be achieved without the global flag, as the test() method will run through the String, determining if our expression (spa) is present or not, not taking into account if it occurred once or multiple times. The code below, explains this better:

The syntax for the test() method is test(str). Where str is the String we'll be matching our regular expression against. This method returns a Boolean (true or false) unlike the search() method which returns an index of a match or -1 if a match isn't found.

match()

The match method is a String method that can be used in regular expressions. In this method, we'll retrieve the result of matching a regular expression against a String. Instead of returning true or false, this method outputs an array of results matching our regular expression. Let's look at an example that will match capital letters in our String. We'll be using the global flag so that our match iterates over every letter of our alphabet.

The syntax for the match() method is match(regexp). Where regexp is our regular expression object. If you don't put in a parameter or use the match() method directly, you'll get an array of empty strings. If you don't use the global flag with this method, you'll get the same result as the exec() method. Also, you can use additional properties with this method like groups, index, input etc.

matchAll()

The matchAll() the method must be called with the global flag. The difference between this method and the match() method is the ability to return an iterator with all matched groups and capturing groups. In the match() method, no capturing groups are returned with the g flag. Without the g flag in the match() method, the first match is returned as well as the related capturing groups.

The use of the g flag is important with the matchAll() method, otherwise, you'll get an error. Let's look at the same example in our match() method, this time with the matchAll() method.

From the example above, we can see that capturing group "a" is returned. This wasn't returned in our match() method example. The matchAll() syntax is the same as the match() object. However, the keyword match() is replaced with matchAll().

replace()

If you want to not only search and match but replace Strings, the replace() method will do the job. The pattern can either be a String or a RegExp. For example, we can replace texts in a String with our pattern as shown below:

From our example above, we can see that the initial String p doesn't change. The only thing that changes is our result. Also, the second String "girl" doesn't change. The syntax for the replace() method is shown below:

          // for RegExp pattern          replace(regexp, newSubstr)          replace(regexp, replacerFunction)          // for String Pattern          replace(substr, newSubstr)          replace(substr, replacerFunction)

replaceAll()

Replace all is useful if you want to change all occurrences of a String in your text with your RegExp pattern. In the replace() method, only the first String in a text is replaced with our pattern. However, replaceAll() will substitute all occurrences of the String with our pattern, not just the first one. Let's look at the same example from our replace() method. The replaceAll() syntax is the same as the replace() object. However, the keyword replace() is changed to replaceAll():

search()

The search method search() is used to perform a search for a match between a regular expression and a String. This method doesn't output true or false or an array of the result. Instead, it outputs a number, showing the index of the first match.

For instance, the example below outputs 4 which is the index of the first capital letter "S".

The syntax for the search() method is:

          search(regex) //where regex is a regular expression.

split()

We can extract substrings from our String with the split method. What this method does, is divide Strings into substrings according to our pattern. Then, it'll return an array containing all of the substrings. We can divide Strings into words, characters, etc. with the split method.

The syntax for the split() method is shown below.

          split()          split(separator)          split(separator, limit)

Where separator describes where each split should occur. The separator can either be a String or RegExp. You can also pass a limit as an argument. Limit, describes the number of substrings to be included in the array. For example, if the limit is specified as 0, an empty array [] will be returned.

Writing RegExp Patterns

In JavaScript, you can write RegExp patterns using simple patterns, special characters, and flags. In this section, we'll explore the different ways to write regular expressions while focusing on simple patterns, special characters, and flags.

Simple Patterns

Sometimes, when searching for a text, you'll want to get an exact match. For instance, if you want to search for the word "fry", in the sentence "Blessing makes good fries by frying fr yosh's potatoes." you won't want to get results such as "fr yosh's" or "fries", you'd want to get an exact match like "frying". This is what simple patterns are all about. With simple patterns, you create patterns to get an exact match. They mostly consist of characters only.

A quick example is a code below. It allows us to create a search for an exact match of the String "fry".

Special Characters

Searching sometimes doesn't have to be exact. For instance, we may want to make a search using a range. You may want to search for alphabets a — c, notwithstanding if there's whitespace in-between them in a String. To do this, developers need to use special characters. Special characters for RegExp in JavaScript fall into the following categories: Assertions, Character classes, Groups and Ranges, Quantifiers, and Unicode property escapes. Let's look at how to use special characters in these categories.

Assertions

Assertions in RegExp denote pattern boundaries. With assertions, you can indicate the beginning and end of words. You can also write patterns for a match, using expressions like: look ahead, look behind, and conditions.

For boundary type assertions, you can use characters like ^, $, \b or \B.

^ — This character is used for matching the beginning of input. If you set the multiline flag to true, this character can match immediately after a line break.

$ — The $ character, matches the end of input. If you set the multiline flag to true, this character can match immediately before a line break.

\b — This character matches a word boundary. That is, where a word character is not followed or preceded by another word-character.

\B — This character matches a non-word boundary. That is, where the previous and next characters are of the same type: both must either be words or non-words. For example, an alphabet can't be followed by a whitespace.

For expressions like look ahead and look behind, use the following characters:

x(?=y) — This character syntax is for the look ahead assertion. The syntax will match x only if it is followed by y. Replace x and y with the value of your choice to perform assertion. For instance, /Man(?=Money)/ will match "man" only if it is followed by "money".

x(?!y) — This syntax is for the negative look ahead assertion. It'll match x only if it's not followed by y. For instance, /Man(?=Money)/ will match "man" only if it's not followed by "money".

(?<=y)x — This syntax is for the look behind assertion. This will match x only if it is preceded by y. For instance, /Man(?=Money)/ will match "man" only if it is preceded by "money".

(?<!y)x — This syntax is for the negative look behind assertion. It'll match x only if it's not preceded by y. For instance, /Man(?=Money)/ will match "man" only if it's not preceded by "money".

Let's look at the example below with special characters we've discussed and Assertion.

Character Classes

Character classes are used to distinguish different characters from each other. For instance, you can distinguish letters from alphabets with character classes. Let's look at special characters with character classes and how they work.

\d — This character matches a digit. i.e. a number from 0–9. You can use this character /\d/ or /[0–9]/ to match digits.

\D — This is used to match any character that's not a digit. /\D/ is equivalent to / [^0-9]/.

\w — This character is used for matching alphanumeric characters from the basic Latin alphabet. /\w/ is equivalent to [A-Za-z0–9_].

\W — This is used to match non- alphanumeric characters. i.e. characters not from the basic Latin alphabet. /\W/ is equivalent to [^A-Za-z0–9_].

\s — This is used to match a single whitespace character. i.e. a space, tab, form-feed, line-feed, and other Unicode spaces. /\s/ is equivalent to [ \f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff].

. — The dot sign is used to match single characters except for line terminators like \n, \r, \u2028, and \u2029.

[\b] — This character is used for matching a backspace.

\0 — This matches a null character.

\xhh — This syntax is used for matching a character (x) with two hexadecimal digits.

\uhhhh — This syntax is used for matching a UTF-16 code unit with hexadecimal digits.

\cX — This is used to match a control character using the caret notation.

There are a bunch of other special characters like \t, \r, \n, \v, \f which match a horizontal tab, carriage return, line feed, vertical tab, and form feed respectively. Now, let's look at a simple example showing these special character usage in character classes:

Groups and Ranges

If you want to group expression characters or specify ranges there are special characters that do just that. Let's look at them:

x|y — This syntax is used for matching either xor y. For instance, the expression man|woman will match either "man" or "woman" in a String.

[xbz] — This is used for matching any character enclosed in the bracket. For instance, [xbz] will match "x", "b", "z" in a String.

[a-c] — This is used to match any character from the range of characters enclosed in the bracket. For instance, [a-c] will match "a", "b", and "c". However, if the hyphen is at the beginning or end of the bracket, it is taken as a normal character. Therefore, [-ac] will match the hyphen in "non-profit".

[^xyz] — This will match any character that isn't enclosed in the bracket. For example, [^xyz] won't match the "y" and "z" in "Lazy" but it'll match "L" and "A".

[^a-c] — This will match anything not included in the range of characters that is enclosed in the bracket. For example, [^a-c] won't match the "b" and "a" in "bank" but it'll match "n" and "k".

(x) — This character is used for capturing groups. For instance, (x) will match the character "x" and remembers the characters matched for future use with later references. For example /(family)/ matches and remembers "family" in "make family familiar" like in capturing groups. So, if you replace "family" in "make family familiar", the text will change to what you replaced "family" with throughout your code.

\n — This syntax is used as a backreference to the last substring, matching group number "n" in regular expressions, where "n" is a positive integer.

\k<Name> — This syntax is a backreference to the last substring, matching the named captured group specified by <Name>.

(?<Name>x) — This syntax is for name capturing groups. It matches x and stores it on the group's property of the return matches under the name specified by <Name>.

(?:x) — This is for non-capturing groups. In this case, the pattern matches x however, it doesn't remember the match. Therefore, you can't recall matched substring from the resulting array.

We've discussed the different special characters that can be used for groups and ranges. Now, let's see them in action.

Quantifiers

When matching characters, sometimes you'll want to specify the number of expressions or characters to match. Quantifiers allow developers to indicate the number of expressions or characters they'll want to match. Let's look at special characters that serve as quantifiers in regular expressions.

x* — This syntax is used for matching the preceding item x zero or more times. For example, /bo*/ matches the b in bird and nothing in goat.

x+ — This syntax matches the preceding item x one or more times. /x+/ is equivalent to {1,}.

x? — This syntax will match the preceding item x zero or one time.

x{n} — This syntax allows you to match them exactly, "n" occurrences of the preceding item x where "n" is a positive integer.

x{n,} — Instead of matching exactly, "n" occurrences of the preceding item x, you can match all of the occurrences of the preceding item x equal to or above "n". Where "n" is a positive integer.

x{n,m} — Where n is a positive integer or zero, m is a positive integer, and m is greater than "n", you can match at least "n" and at most "m" occurrences of the preceding item x with this syntax.

By default, quantifiers like * and + try to match as much as possible in the String. Therefore, they're termed greedy. The ? character, will make quantifiers non-greedy, therefore they'll stop after meeting a match.

Let's look at an example that shows how quantifiers can be used in regular expressions:

Unicode Property Escapes

You can match characters based on their Unicode properties. With Unicode property escapes, you can match emojis, punctuations, letters from specific languages, or scripts. Regular expressions for Unicode properties must have the u flag. Also, you can write Unicode properties for binary and non-binary values.

Let's look at the syntax for writing Unicode property escapes:

The example below shows how to use the Unicode property escape in regular expressions:

Flags

Regular expressions in JavaScript have seven flags. These flags enhance regular expression patterns. For instance, the i flag is used for case-insensitive searches. You can use flags alone or together and they can be included as part of a regular expression. Let's look at these flags and what they're used for:

d — This flag is used for generating indices for substring matches.

g — The g flag is used to indicate global search.

i — This flag is for case-insensitive search. If you want to perform searches without enforcing case sensitivity, use this flag.

m — This flag is used for performing multi-line search.

u — This flag indicates Unicode; it treats a pattern as a sequence of Unicode code points.

y — This is used to perform a sticky search.

s — The s flag allows the dot . character to match newline characters.

To use flags with regular expressions, use the syntax below.

          //for literal notation          var re = /pattern/flags;          // or          //for constructor function          var re = new RegExp('pattern', 'flags');

Regular Expressions, When Not to Use Them

So far, we've explored regular expressions, how they work in JavaScript and why we should use them. However, they're situations where it's best to use other tools instead of RegExp. It is a bad practice to use regular expressions in the following scenarios:

Parsing HTML with RegExp isn't a good practice, because HTML isn't a regular language. In general, source code is not a regular language and shouldn't be parsed with RegExp.
It's better to parse a URL's path and query parameters with better tools or a built-in URL parser than RegExp. This is because you can't get tokenized output with RegExp.
Although developers can use RegExp to find or validate emails, things can get really complicated when attempting to do so.

Although the knowledge of RegExp is transferable, it isn't something you learn in a single day. If you want to get a deeper understanding of RegExp, this documentation is a great fit for you. Also, there are amazing tools for RegExp in JavaScript like RegExr, Regex tester, and Regex visualizer.

After a deeper knowledge in regular expressions has been accumulated, the developer can make better decisions when to apply it. There are many real-life examples where regular expressions are the optimal approach for a certain problem. And this doesn't apply only to JavaScript and the front-end world but the backend world as well.
In order for SessionStack to generate a pixel-perfect replay of user journeys as videos, it has to process the collected data from the browser such as DOM changes, user interactions, JavaScript exceptions, stack traces, network requests, debug messages, and CSS files. The processing of the CSS files utilizes regular expressions in order to be efficient and scalable.
The videos can then be used to optimize product workflows, reproduce bugs, or see where users are stuck.

There is a free trial if you'd like to give SessionStack a try.

If you missed the previous chapters of the series, you can find them here:

An overview of the engine, the runtime, and the call stack
Inside Google's V8 engine + 5 tips on how to write optimized code
Memory management + how to handle 4 common memory leaks
The event loop and the rise of Async programming + 5 ways to better coding with async/await
Deep dive into WebSockets and HTTP/2 with SSE + how to pick the right path
A comparison with WebAssembly + why in certain cases it's better to use it over JavaScript
The building blocks of Web Workers + 5 cases when you should use them
Service Workers, their life-cycle, and use cases
The mechanics of Web Push Notifications
Tracking changes in the DOM using MutationObserver
The rendering engine and tips to optimize its performance
Inside the Networking Layer + How to Optimize Its Performance and Security
Under the hood of CSS and JS animations + how to optimize their performance
Parsing, Abstract Syntax Trees (ASTs) + 5 tips on how to minimize parse time
The internals of classes and inheritance + transpiling in Babel and TypeScript
Storage engines + how to choose the proper storage API
The internals of Shadow DOM + how to build self-contained components
WebRTC and the mechanics of peer to peer connectivity
Under the hood of custom elements + Best practices on building reusable components
Exceptions + best practices for synchronous and asynchronous code
5 types of XSS attacks + tips on preventing them
CSRF attacks + 7 mitigation strategies
Iterators + tips on gaining advanced control over generators
Cryptography + how to deal with man-in-the-middle (MITM) attacks
Functional style and how it compares to other approaches
Three types of polymorphism