Summary: in this tutorial, you’ll learn about Python regex backreferences and how to apply them effectively.
Introduction to the Python regex backreferences #
Backreferences like variables in Python. The backreferences allow you to reference capturing groups within a regular expression.
The following shows the syntax of a backreference:
\N
Code language: Python (python)
Alternatively, you can use the following syntax:
\g<N>
Code language: Python (python)
In this syntax, N
can be 1, 2, 3, etc. that represents the corresponding capturing group.
Note that the \g<0>
refer to the entire match, which has the same value as the match.group(0)
.
Suppose you have a string with the duplicate word Python
like this:
s = 'Python Python is awesome'
Code language: Python (python)
And you want to remove the duplicate word (Python
) so that the result string will be:
Python is awesome
Code language: Python (python)
To do that, you can use a regular expression with a backreference.
First, match a word with one or more characters and one or more space:
'\w+\s+'
Code language: Python (python)
Second, create a capturing group that contains only the word characters:
'(\w+)\s+'
Code language: Python (python)
Third, create a backreference that references the first capturing group:
'(\w+)\s+\1'
Code language: Python (python)
In this pattern, the \1
is a backreference that references the (\w+
) capturing group.
Finally, replace the entire match with the first capturing group using the sub()
function from the re
module:
import re
s = 'Python Python is awesome'
new_s = re.sub(r'(\w+)\s+\1', r'\1', s)
print(new_s)
Code language: Python (python)
Output:
Python is awesome
Code language: Python (python)
More Python regex backreference examples #
Let’s take some more examples of using backreferences.
1) Using Python regex backreferences to get text inside quotes #
Suppose you want to get the text within double quotes:
"This is regex backreference example"
Code language: Python (python)
Or single quote:
'This is regex backreference example'
Code language: Python (python)
But not mixed of single and double-quotes. The following will not match:
'not match"
Code language: Python (python)
To do this, you may use the following pattern:
'[\'"](.*?)[\'"]'
Code language: Python (python)
However, this pattern will match text that starts with a single quote (‘) and ends with a double quote (“) or vice versa. For example:
import re
s = '"Python\'s awsome". She said'
pattern = '[\'"].*?[\'"]'
match = re.search(pattern, s)
print(match.group(0))
Code language: Python (python)
It returns the "Python'
not "Python's awesome"
:
"Python'
Code language: Python (python)
To fix it, you can use a backreference:
r'([\'"]).*?\1'
Code language: Python (python)
The backreference \1
refers to the first capturing group. So if the subgroup starts with a single quote, the \1
will match the single quote. And if the subgroup starts with a double-quote, the \1
will match the double-quote.
For example:
import re
s = '"Python\'s awsome". She said'
pattern = r'([\'"])(.*?)\1'
match = re.search(pattern, s)
print(match.group())
Code language: Python (python)
Output:
"Python's awsome"
Code language: Python (python)
2) Using Python regex backreferences to find words that have at least one consecutive repeated character #
The following example uses a backreference to find words that have at least one consecutive repeated character:
import re
words = ['apple', 'orange', 'strawberry']
pattern = r'\b\w*(\w)\1\w*\b'
results = [w for w in words if re.search(pattern, w)]
print(results)
Code language: Python (python)
Output:
['apple', 'strawberry']
Code language: Python (python)
Summary #
- Use a backreference
\N
to reference the capturing groupN
in a regular expression.