Understand re module
The regular expressions library re
is a built-in library in Python.
Here first we should understand the pattern. For a 10-digit phone number, you can pattern as re"\d\d\d\d\d\d\d\d\d\d"
. Here d
stands for digit and \
(backslash) corresponds to an individual (single) character If you see it as repeating digits, you can simplify this as re"\d{10}"
. This is called quantify
in.
Check the below code with search
import re
string = 'Customer phone numer is 8888777666'
pattern = "\d{10}"
match = re.search(pattern, string)
print(match)
match2 = re.search("pattern", string)
print(match2)
For success cases, it prints matched words in the string and location (span) in the string. for not matching case match
is None
Output for the above code
<re.Match object; span=(24, 34), match='8888777666'>
None
search
gives only the first successful match. To find all matches use findall
import re
string = 'Customer phone numer is 8888777666 and alternate is 2233445566'
pattern = "\d{10}"
match = re.findall(pattern, string)
print(match)
match2 = re.findall("pattern", string)
print(match2)
findall
returns a list of matched words, empty lits for no match case. Output for the above code is
['8888777666', '2233445566']
Similarly, findter
return a list of match-type objects.
import re
string = 'Customer phone numer is 8888777666 and alternate is 2233445566'
pattern = "\d{10}"
for match in re.finditer(pattern, string):
print(match)
print(match.start())
Output is
<re.Match object; span=(24, 34), match='8888777666'>
24
<re.Match object; span=(52, 62), match='2233445566'>
52
Now we understood basic regular expressions methods search
, findall
and finditer
to search patterns in a string. Now we go through the complex patterns.
Pattrns
Here is a complete list of patterns
Pattern | Description | Example Pattern Code | Example Match |
---|---|---|---|
\d | A digit | file_\d\d | file_66 |
\D | A non-digit | file_\D | file_x |
\w | A alphanumeric character | \w+ | Hello123 |
\W | A non-alphanumeric character | \W+ | 22#$+3 |
\s | A whitespace character | \s+ | Hello World |
\S | A non-whitespace character | \W+ | HelloWorld |
. | Any character except newline | py..n | python, py123n |
^ | Start of a string | ^Hello | Hello, Hello World |
$ | End of a string | World$ | Hello World |
[abc] | Any one of a, b, or c | [aeiou] | e, o |
[0-9] | Any digit from 0 to 9 | [0-9]+ | 123, 456 |
[^0-9] | Any character except digits | [^0-9]+ | abcXYZ |
(abc) | A group (captures) | (\d{2}) | 12 (captured) |
a* | Zero or more 'a's | a* | '', 'a', 'aa' |
a+ | One or more 'a's | a+ | 'a', 'aa' |
a? | Zero or one 'a' | a? | '', 'a' |
a{3} | Exactly 3 'a's | a{3} | 'aaa' |
a{3,5} | Between 3 and 5 'a's | a{3,5} | 'aaa', 'aaaaa' |
a{3,} | 3 or more 'a's | a{3,} | 'aaa', 'aaaaaa' |
You can group search patterns in parentheses ()
like (\d{3})
using compile
method
import re
string = 'Customer phone numer is 8888-777-666 and alternate number is 2233445566'
pattern = re.compile(r'(\d{4})-(\d{3})-(\d{3})')
match = re.search(pattern, string)
print(match)
print(match.group(1))
print(match.group(2))
print(match.group(3))
Output is
<re.Match object; span=(24, 36), match='8888-777-666'>
8888
777
666
More Regular Expressions
or'ing using |
Using '|' you can do logic or operation. like searching for John or George with John|George
import re
print(re.search('John|George', 'John and George came yesterday'))
print(re.search('John|George', 'John alone came yesterday'))
print(re.search('John|George', 'George alone came yesterday'))
print(re.search('John|George', 'None came yesterday'))
Output is
<re.Match object; span=(0, 4), match='John'>
<re.Match object; span=(0, 4), match='John'>
<re.Match object; span=(0, 6), match='George'>
None
Wildcard (., *, + )
import re
string = 'Customer phone nubmer is 8888-777-666 and alternate number is 2233445566'
# Without wildcard
print(re.findall('er', string))
# with wildcard \w+
print(re.findall('\w+er', string))
See the difference in output with and without a wildcard
['er', 'er', 'er', 'er']
['Customer', 'nubmer', 'alter', 'number']
See some complex patterns
import re
string = 'Customer phone nubmer is 8888-777-666 and alternate number is 2233445566'
# find non-digit characters
print(re.findall('[^\d]', string))
# In the above example, it check for each charater. hence you see big list of characters.
# You convert them to words using + wildcard
print(re.findall('[^\d]+', string))
Output is
['C', 'u', 's', 't', 'o', 'm', 'e', 'r', ' ', 'p', 'h', 'o', 'n', 'e', ' ', 'n', 'u', 'b', 'm', 'e', 'r', ' ', 'i', 's', ' ', '-', '-', ' ', 'a', 'n', 'd', ' ', 'a', 'l', 't', 'e', 'r', 'n', 'a', 't', 'e', ' ', 'n', 'u', 'm', 'b', 'e', 'r', ' ', 'i', 's', ' ']
['Customer phone nubmer is ', '-', '-', ' and alternate number is ']
another example to exclude function. in the above example use a pattern [^!.?]
import re
print(re.findall('[^!.?,]+','Hi! How are you? I am doing good.'))