Python Regex Backreferences

Summary: in this tutorial, you’ll learn about Python regex backreferences and how to apply them effectively.

Introduction to the Python regex backreferences #

Backreferences like variables in Python. The backreferences allow you to reference capturing groups within a regular expression.

The following shows the syntax of a backreference:

\NCode language: Python (python)

Alternatively, you can use the following syntax:

\g<N>Code language: Python (python)

In this syntax, N can be 1, 2, 3, etc. that represents the corresponding capturing group.

Note that the \g<0> refer to the entire match, which has the same value as the match.group(0).

Suppose you have a string with the duplicate word Python like this:

s = 'Python Python is awesome'Code language: Python (python)

And you want to remove the duplicate word (Python) so that the result string will be:

Python is awesomeCode language: Python (python)

To do that, you can use a regular expression with a backreference.

First, match a word with one or more characters and one or more space:

'\w+\s+'Code language: Python (python)

Second, create a capturing group that contains only the word characters:

'(\w+)\s+'Code language: Python (python)

Third, create a backreference that references the first capturing group:

'(\w+)\s+\1'Code language: Python (python)

In this pattern, the \1 is a backreference that references the (\w+) capturing group.

Finally, replace the entire match with the first capturing group using the sub() function from the re module:

import re

s = 'Python Python is awesome'

new_s = re.sub(r'(\w+)\s+\1', r'\1', s)

print(new_s)Code language: Python (python)

Output:

Python is awesomeCode language: Python (python)

More Python regex backreference examples #

Let’s take some more examples of using backreferences.

1) Using Python regex backreferences to get text inside quotes #

Suppose you want to get the text within double quotes:

"This is regex backreference example"Code language: Python (python)

Or single quote:

'This is regex backreference example'Code language: Python (python)

But not mixed of single and double-quotes. The following will not match:

'not match"Code language: Python (python)

To do this, you may use the following pattern:

'[\'"](.*?)[\'"]'Code language: Python (python)

However, this pattern will match text that starts with a single quote (‘) and ends with a double quote (“) or vice versa. For example:

import re

s = '"Python\'s awsome". She said'
pattern = '[\'"].*?[\'"]'

match = re.search(pattern, s)

print(match.group(0))Code language: Python (python)

It returns the "Python' not "Python's awesome":

"Python'Code language: Python (python)

To fix it, you can use a backreference:

r'([\'"]).*?\1'Code language: Python (python)

The backreference \1 refers to the first capturing group. So if the subgroup starts with a single quote, the \1 will match the single quote. And if the subgroup starts with a double-quote, the \1 will match the double-quote.

For example:

import re

s = '"Python\'s awsome". She said'
pattern = r'([\'"])(.*?)\1'

match = re.search(pattern, s)
print(match.group())Code language: Python (python)

Output:

"Python's awsome"Code language: Python (python)

2) Using Python regex backreferences to find words that have at least one consecutive repeated character #

The following example uses a backreference to find words that have at least one consecutive repeated character:

import re

words = ['apple', 'orange', 'strawberry']
pattern = r'\b\w*(\w)\1\w*\b'

results = [w for w in words if re.search(pattern, w)]

print(results)Code language: Python (python)

Output:

['apple', 'strawberry']Code language: Python (python)

Summary #

  • Use a backreference \N to reference the capturing group N in a regular expression.
Did you find this tutorial helpful ?