Summary: in this tutorial, you’ll learn about Python regex backreferences and how to apply them effectively.
Introduction to the Python regex backreferences
Backreferences like variables in Python. The backreferences allow you to reference capturing groups within a regular expression.
The following shows the syntax of a backreference:
\N
Code language: Python (python)
Alternatively, you can use the following syntax:
\g<N>
Code language: Python (python)
In this syntax, N
can be 1, 2, 3, etc. that represents the corresponding capturing group.
Note that the \g<0>
refer to the entire match, which has the same value as the match.group(0)
.
Suppose you have a string with the duplicate word Python
like this:
s = 'Python Python is awesome'
Code language: Python (python)
And you want to remove the duplicate word (Python
) so that the result string will be:
Python is awesome
Code language: Python (python)
To do that, you can use a regular expression with a backreference.
First, match a word with one or more characters and one or more space:
'\w+\s+'
Code language: Python (python)
Second, create a capturing group that contains only the word characters:
'(\w+)\s+'
Code language: Python (python)
Third, create a backreference that references the first capturing group:
'(\w+)\s+\1'
Code language: Python (python)
In this pattern, the \1
is a backreference that references the (\w+
) capturing group.
Finally, replace the entire match with the first capturing group using the sub()
function from the re
module:
import re
s = 'Python Python is awesome'
new_s = re.sub(r'(\w+)\s+\1', r'\1', s)
print(new_s)
Code language: Python (python)
Output:
Python is awesome
Code language: Python (python)
More Python regex backreference examples
Let’s take some more examples of using backreferences.
1) Using Python regex backreferences to get text inside quotes
Suppose you want to get the text within double quotes:
"This is regex backreference example"
Code language: Python (python)
Or single quote:
'This is regex backreference example'
Code language: Python (python)
But not mixed of single and double-quotes. The following will not match:
'not match"
Code language: Python (python)
To do this, you may use the following pattern:
'[\'"](.*?)[\'"]'
Code language: Python (python)
However, this pattern will match text that starts with a single quote (‘) and ends with a double quote (“) or vice versa. For example:
import re
s = '"Python\'s awsome". She said'
pattern = '[\'"].*?[\'"]'
match = re.search(pattern, s)
print(match.group(0))
Code language: Python (python)
It returns the "Python'
not "Python's awesome"
:
"Python'
Code language: Python (python)
To fix it, you can use a backreference:
r'([\'"]).*?\1'
Code language: Python (python)
The backreference \1
refers to the first capturing group. So if the subgroup starts with a single quote, the \1
will match the single quote. And if the subgroup starts with a double-quote, the \1
will match the double-quote.
For example:
import re
s = '"Python\'s awsome". She said'
pattern = r'([\'"])(.*?)\1'
match = re.search(pattern, s)
print(match.group())
Code language: Python (python)
Output:
"Python's awsome"
Code language: Python (python)
2) Using Python regex backreferences to find words that have at least one consecutive repeated character
The following example uses a backreference to find words that have at least one consecutive repeated character:
import re
words = ['apple', 'orange', 'strawberry']
pattern = r'\b\w*(\w)\1\w*\b'
results = [w for w in words if re.search(pattern, w)]
print(results)
Code language: Python (python)
Output:
['apple', 'strawberry']
Code language: Python (python)
Summary
- Use a backreference
\N
to reference the capturing groupN
in a regular expression.