regex


Pay Notebook Creator: Salah Ahmed0
Set Session Lifetime: 10 minutes0
Total0

Regex

Regular Expressions, referred to as regex, is a pattern matching search string that proves to be a powerful tool to have under your toolbelt.

Resources

anytime you want to deal with regular expressions, test each expression you're attempting with this tool

matching numbers/letters/whitespace

  • \w matches word character (not whitespace)
  • \s matches white space (\n, \t, single space character)
  • \d matches digits 0-9
  • . matches any character except newline

matching outside of the sets

  • \W matches "not word"
  • \S matches "not space char"
  • \D matches "not digit"
In [8]:
# re module in python deals with regular expressions
import re
pattern = r'i l\wve to l\wve'
word_one = 'i love to live'
word_two = 'i live to love'
mismatch = 'i l-ve to love'

re_pattern = re.compile(pattern)
print(re_pattern.match(word_one))
print(re_pattern.match(word_two))
print(re_pattern.match(mismatch))
<_sre.SRE_Match object at 0x7f4f29795238>
<_sre.SRE_Match object at 0x7f4f29795238>
None

match a phone number ex: 718.777.7777

escaping '.' with '\' ensures it captures the literal '.'
In [10]:
pattern = '\d\d\d\.\d\d\d\.\d\d\d\d'
phone_one = '718.777.3143'
mismatch = '83s.382.sa32'

re_pattern = re.compile(pattern)
print(re_pattern.match(phone_one))
print(re_pattern.match(mismatch))
<_sre.SRE_Match object at 0x7f4f29795510>
None

Example, matching 4 characters and a 4 digit pin, separated by a space character

"""\w\w\w\w\s\d\d\d\d"""
In [16]:
pattern = "\w\w\w\w\s\d\d\d\d"
word_one = 'abcd 0321'
# \w matches numbers as well
word_two = 'abc3 0321'
mismatch = 'a-dvd afd-3'

re_pattern = re.compile(pattern)
print(re_pattern.match(word_one))
print(re_pattern.match(word_two))
print(re_pattern.match(mismatch))
<_sre.SRE_Match object at 0x7f4f29795510>
<_sre.SRE_Match object at 0x7f4f29795510>
None

matching any character with '.'

In [18]:
pattern = "my favorite character is ."
word_one = 'my favorite character is x'
word_two = 'my favorite character is ?'
word_three = 'my favorite character is 3'
word_four = 'my favorite character is  '
re_pattern = re.compile(pattern)

print(re_pattern.match(word_one))
print(re_pattern.match(word_two))
print(re_pattern.match(word_three))
print(re_pattern.match(word_four))
<_sre.SRE_Match object at 0x7f4f29795988>
<_sre.SRE_Match object at 0x7f4f29795988>
<_sre.SRE_Match object at 0x7f4f29795988>
<_sre.SRE_Match object at 0x7f4f29795988>

character sets

character sets are sets of characters kept in [] braces, that match any character in the set.

note that any special interpretations that character might have is null, and the raw character is instead parsed

  • [abc] any of a, b, or c
  • [^abc] not a, b, or c
  • [a-g] character between a & g
"""[abcdefghijklmnopqrstuvwxyz123456789'.,/!]"""

shortcut:

"""[a-z0-9'.,/!]"""

is equivalent to matching any letter from a to z, or any number from 0 to 9, or any of the characters '.,/!

In [19]:
pattern = "i love lock[es]"
word_one = 'i love locke'
word_two = 'i love locks'
word_three = 'i love locka'
re_pattern = re.compile(pattern)

print(re_pattern.match(word_one))
print(re_pattern.match(word_two))
print(re_pattern.match(word_three))
<_sre.SRE_Match object at 0x7f4f29795b28>
<_sre.SRE_Match object at 0x7f4f29795b28>
None

Quantifiers and Alternations

  • {n, m} matches from n to m of the previous pattern/group

    • {n} is equivalent to {n, n}
    • {n, } is equivalent to (from n to infinite number of matches)
      """\w{4}\s\d{4}"""
      is equivalent to 
      """\w\w\w\w\s\d\d\d\d"""
      
  • + is equivalent to {1, }

  • * is equivalent to {0, }
  • ? is equivalent to {0, 1}
  • a+? a{2,}? match as few as possible
  • ab|cd match ab or cd
    "(ab|ef)an"
    matches "aban" or "efan"
    
In [22]:
pattern = '\d{3}\.\d{3}\.\d{4}'
phone = '718.777.7777'
phone_two = '718.777.7d77'


p = re.compile(pattern)
print(p.match(phone))
print(p.match(phone_two))
<_sre.SRE_Match object at 0x7f4f29795e68>
None
In [24]:
pattern = "(iv|eth)an"
word = "ivan"
word_two = "ethan"

p = re.compile(pattern)
print(p.match(word))
print(p.match(word_two))
<_sre.SRE_Match object at 0x7f4f300b5dc8>
<_sre.SRE_Match object at 0x7f4f300b5dc8>

Groups

groups are subsets of patterns that you might want to reference again if you are interested in a certain subset of the string rather than the entire match

groups can be "captured" with parenthesis

"""(\w{4}) pin:(\d{4})"""

this captures the first four characters of a matched pattern of type

"""\w\w\w\w pin:\d\d\d\d"""

as well as the final four digits

In [34]:
# capture id of students
pattern = '\w+?\s\w+?: (\d{4})'
s_one = 'salah ahmed: 1823'
s_two = 'jon stewart: 8421'
s_three = 'john mulaney: 3824'

p = re.compile(pattern)
print(p.match(s_one).groups())
print(p.match(s_two).groups())
print(p.match(s_three).groups())
('1823',)
('8421',)
('3824',)

lookarounds

  • lookahead
    • q(?=u) matches q if it is followed by a u (doesn't match the u)
    • q(?!u) matches q if it is not followed by a u
In [36]:
pattern = '(salah)(?=!)'
w_one = 'salah?'
w_two = 'salah!'

p = re.compile(pattern)
print(p.match(w_one))
print(p.match(w_two))
None
<_sre.SRE_Match object at 0x7f4f297ad6c0>

Anchors

  • ^ matches start of string
  • $ matches end of string
  • \b matches word boundary (either start or end)
  • \B matches not word boundary (inside word, not beginning or end)
In [43]:
"""
"port"
but not
"opportunity"
and
"\Bport\B"
matches
"opportunity"
but not 
"port"
```
"""
pattern = "\Bport\B"
p = re.compile(pattern)

word = """port"""
match = """opportunity"""

print(p.search(word))
print(p.search(match))
None
<_sre.SRE_Match object at 0x7f4f297ae850>