Python Regex

When preparing for this years Advent of Code I am trying to improve my regex skills. A very useful function in the builtin regex module for python is re.sub.

Find and evaluate simple expressions in a string

Lets imagine we want o handle a case like this:

text = 'AA+1D-1'    # shall be evaluated to ABC.
text = 'z-26'       # shall be evaluated to a

Transform text where we evaluate expressions. Some chars are left as they are, while those that have an simple expression like -1 or +12 shall be shifter accordingly.

Here is an initial version that solves this problem.

import re
text = 'AA+1D-1'

# It can be solved using re.sub and instead of providing a str pattern for replacement we instead
# provide a function that takes a match and returns a string.

def resolve(match: re.Match) -> str:
    match = match.group()
    if '+' in match:
        a, num = match.split('+')
        num = int(num)
    else:
        a, num = match.split('-')
        num = -int(num)

    return chr(ord(a) + num)

# replace said pattern with whatever the resolve function returns given the match object.
print(re.sub(r'([a-zA-Z][\+-][0-9]+)', resolve, text))

It works, but I was unhappy with how inside the resolve function I lost went back to using split instead of regex. Lets try to use regex all the way, and see if we can reuse the work done in the re.sub call.

import re

text = 'AA+1D-1'

def resolve(match: re.Match) -> str:
    a, sign, offset = match.groups()
    return eval(f"chr(ord('{a}'){sign}{offset})")

print(re.sub(r'([a-zA-Z])([\+-])([0-9]+)', resolve, text))

Instead of one large group in the regex its now split into 3 parts which we can access directly inside resolve.

Then inside resolve I had some fun when evaluating the expression, but that is not important here.

Resolve the position simple

Lets play around with a similar problem. In Advent of Code it is common need to work with coordinates in a grid.

Lets start light with simply being able to add two positions like this

text = '(5,5)+(1,2)'  # shall be evaluated to (6,7)
text = '(5,5)-(6,5)'  # shall be evaluated to (-1,0)

One way of solving it is shown below.

import re

text = '(5,5)+(1,2)'  # shall be evaluated to (6,7)

def resolve(match: re.Match) -> str:
    x,y, op, xx,yy = match.groups()
    return str(eval(f"({x}{op}{xx},{y}{op}{yy})"))

position=r'\(([-\+]?[0-9]+),([-\+]?[0-9]+)\)'
print(re.sub(f"{position}([\+-]){position}", resolve, text))

We hav a pattern to get the two parts of a position and then we repeat it and allow for an operator of type + or - inbetween.

It seems it might be resonably to extract out the part that can resolve positions seperately such that it can be reused, since I know I want to do more stuff with positions.

import re

text = '(5,5)+(1,2)'  # shall be evaluated to (6,7)

def resolve_pos(pos: str) -> str:
    return tuple(map(int,re.match(r'([-\+]?[0-9]+),([-\+]?[0-9]+)', pos).groups()))

def resolve(match: re.Match) -> str:
    a, op, b = match.groups()
    x,y   = resolve_pos(a)
    xx,yy = resolve_pos(b)
    return str(eval(f"({x}{op}{xx},{y}{op}{yy})"))

position=r'\(([-\+]?[0-9]+,[-\+]?[0-9]+)\)'
print(re.sub(f"{position}([\+-]){position}", resolve, text))

Again the python details besides regex is not too important here. The ugly oneline is fine for a quick Advent of Code challenge but not for production.

In this case the representation for positions are the same as used in python so the resolve_pos code can be replaced with eval like this:

import re

text = '(5,5)+(1,2)'  # shall be evaluated to (6,7)

def resolve_pos(pos: str) -> str:
    return eval(pos)

def resolve(match: re.Match) -> str:
    a, op, b = match.groups()
    x,y   = resolve_pos(a)
    xx,yy = resolve_pos(b)
    return str(eval(f"({x}{op}{xx},{y}{op}{yy})"))

position=r'\(([-\+]?[0-9]+,[-\+]?[0-9]+)\)'
print(re.sub(f"{position}([\+-]){position}", resolve, text))

Still keeping the resolve_pos function as its more readable. But the internals has been replaced with a simple eval.

Resolve the position, handle constants

Lets add some more cases that we need to handle one at a time

# These 2 are the same as before.
text = '(5,5)+(1,2)'  # shall be evaluated to (6,7)
text = '(5,5)-(6,5)'  # shall be evaluated to (-1,0)

text = '(3,-1)-2'      # shall be evaluated to (1,-3)
text = '(3,-1)+2'      # shall be evaluated to (5,1)

So we need to add constants to both the x and y coordinate in the position.

import re
from typing import Iterable

text = '(3,-1)-2'      # shall be evaluated to (1,-3)
# text = '(3,-1)+2'      # shall be evaluated to (5,1)
# text = '2+(3,-1)'      # shall be evaluated to (5,1)

def resolve_value(value: str) -> str:
    result = eval(value)
    # If constant dupliate values so addition and subtraction works as required.
    if not isinstance(result, Iterable):
        return (result, result)
    return result

def resolve(match: re.Match) -> str:
    a, op, b = match.groups()
    x,y = resolve_value(a)
    xx,yy = resolve_value(b)
    return str(eval(f"({x}{op}{xx},{y}{op}{yy})"))

constant=r'[-\+]?[0-9]+'
position=rf'\({constant},{constant}\)'
value=rf'({position}|{constant})'
pattern=f"{value}([\+-]){value}"
print(re.sub(pattern, resolve, text))

Here I broke out the various parts of the regex. Now it feels like we are defining a programming language, which we kindof are, a DSL anyways.

Resolve the position, handle new operator

Now lets add a new operator.

text = '(3,-1)*2'      # shall be evaluated to (6,-2)

It turned out trivial to implement. Though I had to update the regex patterns a bit as I realized that.

  1. I did not need to try to escape + and e’ inside [] before. And that in fact I did not escape them before with a single \, but needed \\.
  2. * on the other hand needs to be escaped.
  3. We wont have + at the start of a constant. so changed the defintion to allow for a minus but not a plus.

Here we have the updated code:

import re
from typing import Iterable

text = '(3,-1)*2'      # shall be evaluated to (6,-2)

def resolve_value(value: str) -> str:
    result = eval(value)
    # If constant dupliate values so addition and subtraction works as required.
    if not isinstance(result, Iterable):
        return (result, result)
    return result

def resolve(match: re.Match) -> str:
    a, op, b = match.groups()
    x,y = resolve_value(a)
    xx,yy = resolve_value(b)
    return str(eval(f"({x}{op}{xx},{y}{op}{yy})"))

constant=r'-?[0-9]+'
position=rf'\({constant},{constant}\)'
value=rf'({position}|{constant})'
operation=r'([+-\\*])'

pattern=rf"{value}{operation}{value}"
print(re.sub(pattern, resolve, text))

Resolve the position, handle larger dimensions

Next it would be fun to handle arbritary large dimensions, or at least some larger fixed size like 3 or 5. This feels like something we might need to do on a follow up challenge in Advent of Code.

text = '(1,1,1)-2'      # shall be evaluated to (-1,-1,-1)
text = '(1,1,1)+2'      # shall be evaluated to (3,3,3)
text = '(1,1,1)*2'      # shall be evaluated to (2,2,2)

Again we still need to handle all previous cases as well!

The solution below is a good try, but it does not use regexp. Which in general is not a bad thing at all, quite the opposite. But now Im trying to force myself to use it more (than necessary) to get out of my comfort zone and maybe find solutions that are hidden gems.

Anyways I now create a Position object that handles the operator logic for me. Some ugly hacks with eval but hey this is just for fun.

import re
from typing import Iterable
from itertools import repeat

text = '(1,1,1)*2'      # shall be evaluated to (2,2,2)

class Position(tuple):
    def __new__(cls, *coords):
        return super().__new__(cls, tuple(coords))

    def __str__(self):
        return f"Position({', '.join(map(str,self))})"

    def unify_other(other):
        """If other is a constant repeat it. Otherwise no change."""
        is_constant = lambda other: not isinstance(other, Position)
        if is_constant(other):
            other = repeat(other)

        return other

    def __add__(self, other):
        """Supports adding two equal length positions or adding with a constant."""
        return Position(*[a + b for a,b in zip(self, Position.unify_other(other))])

    def __sub__(self, other):
        """Supports subtracting two equal length positions or adding with a constant."""
        return Position(*[a - b for a,b in zip(self, Position.unify_other(other))])

    def __mul__(self, other):
        """Supports multiplying two equal length positions or adding with a constant."""
        return Position(*[a * b for a,b in zip(self, Position.unify_other(other))])


def resolve_value(value: str) -> str:
    coords = eval(value)
    if isinstance(coords, Iterable):
        return Position(*coords)
    else:
        constant = coords
        return constant

def resolve(match: re.Match) -> str:
    a, op, b = match.groups()
    a = resolve_value(a)
    b = resolve_value(b)

    exp = f"{a} {op} {b}"
    return str(eval(exp))

constant=r'-?[0-9]+'

position_2d=rf'\({constant},{constant}\)'
position_3d=rf'\({constant},{constant},{constant}\)'
position=rf'{position_2d}|{position_3d}'

value=rf'({position}|{constant})'

operation=r'([+-\\*])'

pattern=rf"{value}{operation}{value}"
print(re.sub(pattern, resolve, text))

This solution however silently handles positions of unequal length like this (1,1,1)+(2,2) = (3,3) which I dont like. Since this is about Positions missing values should mean that the value is 0 in the missing dimensions. Or since we only handle 2d or 3d the 1 missing dimension.

So lets add a new rule for this. Missing dimensions shall be treated as a 0.

text = '(1,1,1)+(2,2)' # shall be evaluated to (3,3,1)

I have implemented it in a quite general way with minimal changes. Because I know that I want to expand this to larger dimensions too.

import re
from typing import Iterable
from itertools import repeat, zip_longest

text = '(1,1,1)+(2,2)' # shall be evaluated to (3,3,1)

class Position(tuple):
    def __new__(cls, *coords):
        return super().__new__(cls, tuple(coords))

    def __str__(self):
        return f"Position({', '.join(map(str,self))})"

    def unify_other(self, other):
        """If other is a constant repeat it. Otherwise no change."""
        is_constant = lambda other: not isinstance(other, Position)
        if is_constant(other):
            other = repeat(other, len(self))

        return other

    def __add__(self, other):
        """Supports adding:
         * Adding two equal positions (adds 0's to missing dimensions for the shorter one) 
         * Adding with a constant."""
        return Position(*[a + b for a,b in zip_longest(self, self.unify_other(other), fillvalue=0)])

    def __sub__(self, other):
        """Supports subtracting:
         * Subtracting two equal positions (adds 0's to missing dimensions for the shorter one) 
         * Subtracting with a constant."""
        return Position(*[a - b for a,b in zip_longest(self, self.unify_other(other), fillvalue=0)])

    def __mul__(self, other):
        """Supports multiplying:
         * Multiplying two equal positions (adds 0's to missing dimensions for the shorter one) 
         * Multiplying with a constant."""
        return Position(*[a * b for a,b in zip_longest(self, self.unify_other(other), fillvalue=0)])


def resolve_value(value: str) -> str:
    coords = eval(value)
    if isinstance(coords, Iterable):
        return Position(*coords)
    else:
        constant = coords
        return constant

def resolve(match: re.Match) -> str:
    a, op, b = match.groups()
    a = resolve_value(a)
    b = resolve_value(b)

    exp = f"{a} {op} {b}"
    return str(eval(exp))

constant=r'-?[0-9]+'

position_2d=rf'\({constant},{constant}\)'
position_3d=rf'\({constant},{constant},{constant}\)'
position=rf'{position_2d}|{position_3d}'

value=rf'({position}|{constant})'

operation=r'([+-\\*])'

pattern=rf"{value}{operation}{value}"
print(re.sub(pattern, resolve, text))

Support for n-dimensional points

Now lets assume the positions can have arbritary dimensions.

text = '(1,)+(0,0,1,0,0)'      # shall be evaluated to (1,0,1,0,0)

The change here is that the regex for positions now allows for an arbritary number of coordinates. I moved the Position class to its own module for brevity as I’m not making any changes to it.

import re
from typing import Iterable
from position import Position

text = '(1,)+(0,0,1,0,0)'      # shall be evaluated to (1,0,1,0,0)

def resolve_value(value: str) -> str:
    coords = eval(value)
    if isinstance(coords, Iterable):
        return Position(*coords)
    else:
        constant = coords
        return constant

def resolve(match: re.Match) -> str:
    a, op, b = match.groups()
    a = resolve_value(a)
    b = resolve_value(b)

    exp = f"{a} {op} {b}"
    return str(eval(exp))

constant=r'-?[0-9]+'
position=rf'\((?:{constant})(?:, ?{constant})*,?\)'
value=rf'({position}|{constant})'
operation=r'([+-\\*])'

pattern=rf"{value}{operation}{value}"
print(re.sub(pattern, resolve, text))

The regex is difficult to work with, so lets see if I can use named capture groups and regex verbose mode to make things simpler to read.

import re
from typing import Iterable, Tuple
from position import Position

text = '(1,)+(0,0,1,0,0)'      # shall be evaluated to (1,0,1,0,0)

def tokens(text) -> Iterable[Tuple[str, str]]:
    """Tokenize a given string."""
    # Inpsired by https://twitter.com/nedbat/status/1545772259402338304
    token_patterns = r"""(?xm)
        (?P<const>  -?[0-9]+                    )|
        (?P<pos>    \(-?[0-9]+(,-?[0-9]+)*,?\)  )|
        (?P<op>     ([+-\\*])                   )|
        (           \#.*$                       )| # Ignore python comments.
        (           \s+                         )  # Ignore whitespace.
        """

    for match in re.finditer(token_patterns, text):
        if match.lastgroup:
            yield (match.lastgroup, match[0])

def resolve(text):
    parts = ''
    for kind, text in tokens(text):
        if kind == 'pos': text = str(Position(*eval(text)))
        parts += text
    return eval(parts)

print(resolve(text))

The new tokens funciton inpsired by Ned Batchelder.

Its now easier to see and debug what is going on.

Support for chained expressions

With the addition of the tokens function in the previous solution we can now handle these types of expressions.

text = '(1,)+(0,1)+(0,0,1)'    # shall be evaluated to (1,1,1)

It’s however not yet possible to use () to change precedence. So lets fix that.

text = '(1,)+(0,1)+(0,0,1)'    # shall be evaluated to (1,1,1)
text = '(0,0,1)+2+5'           # shall be evaluated to (7,7,8)
text = '(0,0,1)+(1+1)*2'       # shall be evaluated to (4,4,5)
text = '((0,0,1)+(0,1,1))*2'   # shall be evaluated to (0,2,4)
import re
from typing import Iterable, Tuple
from position import Position

text = '(1,)+(0,1)+(0,0,1)'    # shall be evaluated to (1,1,1)
text = '(0,0,1)+2+5'           # shall be evaluated to (7,7,8)
text = '(0,0,1)+(1+1)*2'       # shall be evaluated to (4,4,5)
text = '((0,0,1)+(0,1,1))*2'   # shall be evaluated to (0,2,4)

def tokens(text) -> Iterable[Tuple[str, str]]:
    """Tokenize a given string."""
    # Inpsired by https://twitter.com/nedbat/status/1545772259402338304
    token_patterns = r"""(?xm)
        (?P<const>  -?[0-9]+                    )|
        (?P<pos>    \(-?[0-9]+(,-?[0-9]+)*,?\)  )|
        (?P<op>     ([+-\\*])                   )|
        (?P<expr>   \(.*\)                      )| # Expressions are anything surrounded by () that is not a position.
        (           \#.*$                       )| # Ignore python comments.
        (           \s+                         )  # Ignore whitespace.
        """

    for match in re.finditer(token_patterns, text):
        if match.lastgroup:
            yield (match.lastgroup, match[0])

def resolve(text):
    parts = ''
    for kind, text in tokens(text):
        if kind == 'pos':    text = str(Position(*eval(text)))
        elif kind == 'expr': text = str(resolve(text[1:-1])) # Remove outer () and resolve inner part recursively.
        parts += text
    return eval(parts)

print(resolve(text))

It works but I would like to refactor it a bit. The tokens and the changes needed for each is separated. They however conceptually have a 1:1 mapping so it would be nice if the code could reflect that.

import re
from position import Position

text = '((0,0,1)+(0,1,1))*2'   # shall be evaluated to (0,2,4)

def tokens(text) -> Iterable[Tuple[str, str]]:
    """Tokenize a given string."""
    # Inpsired by https://twitter.com/nedbat/status/1545772259402338304
    token_patterns = r"""(?xm)
        (?P<const>  -?[0-9]+                    )|
        (?P<pos>    \(-?[0-9]+(,-?[0-9]+)*,?\)  )|
        (?P<op>     ([+-\\*])                   )|
        (?P<expr>   \(.*\)                      )| # Expressions are anything surrounded by () that is not a position.
        (           \#.*$                       )| # Ignore python comments.
        (           \s+                         )  # Ignore whitespace.
        """

    for match in re.finditer(token_patterns, text):
        if match.lastgroup:
            yield (match.lastgroup, match[0])

def token_to_action(kind) -> Callable[[str], str]:
    """Some tokens require the matched string to change.
    Examples:
        (1,1) -> Position(1,1)      # pos type
        (1+2) -> 3                  # expr type
    """
    # Map between type of token and what to do with them.
    actions = {
        'pos':  lambda s: str(Position(*eval(s))),  # (1,1) -> Position(1,1)
        'expr': lambda s: str(resolve(s[1:-1])),    # Remove outer () and resolve inner part recursively
        '_':    lambda s: s                         #
        }
    get_action = lambda kind: actions.get(kind, actions['_'])
    return get_action(kind)

def resolve(text):
    return eval(' '.join(token_to_action(kind)(s) for kind, s in tokens(text)))

print(resolve(text))

Here they are still seperated but it is a step in the right direction. What I really want is to have a single dictionary where each named pattern to be a key and the action to be a value.

# Add this to position.
# This ensures that 2*(1,1) == (1,1)*2

class Position(tuple):
    ...

    # Add to the bottom.
    def __rmul__(self, other):
            return self * other

Finally the language itself is now a bit hard to read when nesting since () is used for both positions and for precendence. Lets use [] for positions instead. This also allows us to match token with behaviour in a new way.

import re
from position import Position

text = '[1,]+[0,1]+[0,0,1]'    # shall be evaluated to [1,1,1]
text = '[0,0,1]+2+5'           # shall be evaluated to [7,7,8]
text = '[0,0,1]+[1+1]*2'       # shall be evaluated to [4,4,5]
text = '([0,0,1]+[0,1,1])*2'   # shall be evaluated to [0,2,4]

def resolve(text) -> str:
    token_action = {
        # Remove outer () and resolve inner part recursively
        r'(?P<expr>\(.*\))':                     lambda m: str(resolve(m[0][1:-1])),
        # (1,1) -> Position(1,1)
        r'(?P<pos> \[-?[0-9]+(,-?[0-9]+)*,?\])': lambda m: str(Position(*eval(m[0]))),
    }

    for pattern, f in token_action.items():
        text = re.sub(r'(?xm)'+pattern, f, text)
    return eval(text)

print(resolve(text))

Though Im not happy about this. Yes the token patterns and the changes associated with them are defined on the same line. But doing multiple passes with sub is not great as you might have weird sideffects if you are not careful.

Here is something similar but maybe less prone to error since we are using fintiter again which ensures that the individual patterns don’t overlap.

import re
from typing import Iterable, Tuple
from position import Position

text = '([0,0,1]+[0,1,1])*2'   # shall be evaluated to (0,2,4)

def tokens(text) -> Iterable[Tuple[str, str]]:
    """Tokenize a given string."""
    # Inpsired by https://twitter.com/nedbat/status/1545772259402338304
        patterns = {
        'expr':  (r'(?P<expr>  \(.*\)                     )', lambda m: str(resolve(m[1:-1]))),
        'pos':   (r'(?P<pos>   \[-?[0-9]+(,-?[0-9]+)*,?\] )', lambda m: str(Position(*eval(m)))),
        'const': (r'(?P<const> -?[0-9]+                   )', lambda m: str(m)),
        'op':    (r'(?P<op>    ([+-\\*])                  )', lambda m: str(m)),
        '_':     (r'(.*                                   )', lambda m: str(m)),
    }
    token_patterns = r'(?xm)'+'|'.join([p for p,f in patterns.values()])

    for match in re.finditer(token_patterns, text):
        if match.lastgroup:
            kind = match.lastgroup
            _,f = patterns[kind]
            yield f(match[0])

def resolve(text):
    return eval(' '.join(tokens(text)))

print(resolve(text))

Well this was fun experimenting. Thats all for now.

Written on November 20, 2022