Find and replace in Python

Like ctrl-f with for Python.
Python
Data Processing
Author

Thadryan

Published

March 12, 2018

Using a dict to perform find and replace tasks makes a lot of sense; they’re simple to implement and they allow us to easily store the target with our desired replacement in one spot. There are a few hidden traps to keep be aware of, but they’re easily avoided and we can be back on track reaping the benefits with just a little forethought.

Find and Replace Deluxe:

How to use a Python dict for find/replace functions

We’re going to go through several examples of find-and-replace tasks of varying complexity. We’ll also find some potential hiccups and how to avoid them. We’re going to use a simple string that is a coherent message if the code works and is hard to read if it isn’t to see if our efforts are working.

Simple character-character substitution

Replacing one character with another is pretty straightforward using a function. This function takes the string and the dict as arguments, iterates through them, and makes the replacements. You’ll notice it has an “if” statement. This is to make sure that the item exists in our dict or else we get a keyError. We can’t replace something will the right pair in the dict if the item isn’t in the dict, and, if you think about it, that’s what we’re trying to do if we iterate over every letter and look it up in the dict regardless of whether or not it’s there.

Note: I’m going to comment every line on this one to make sure we’re on the same page as this function will serve as the basis of all the ones coming after it. After that, I’ll just highlight what is different about the function.

# here is our target string, contaminated with "X"s
s = "HEREXISXAXSAMPLEXSTRING"

# dictionary pairs "X" with "-"
d = {"X":"-"}

# define a functions that takes a string and a dict
def find_replace(string, dictionary):
    # is the item in the dict?
    for item in string:
        # iterate by keys
        if item in dictionary.keys():
            # look up and replace
            string = string.replace(item, dictionary[item])
    # return updated string
    return string

# call the funciton
find_replace(s,d)
'HERE-IS-A-SAMPLE-STRING'

We can see that all the “X”s have been replaced by “-”, their paired value.

Replacing multi-character values

This follows the same basic pattern, but uses a simple regex function. We will get rid of “ABC”s instead of “X”s. We import the re library and modify our string and dict for the test.

import re

s = "HEREABCISABCAABCSAMPLEABCSTRING"

d = {"ABC":"-"}

def find_replace_multi(string, dictionary):
    for item in dictionary.keys():
        # sub item for item's paired value in string
        string = re.sub(item, dictionary[item], string)
    return string

find_replace_multi(s, d)
'HERE-IS-A-SAMPLE-STRING'

Replacing single and multi character patterns…oh wait, crap…

There is a final kink in this however. What would happen if one of the single character values we wanted to replace occurs within the multi-character string?

# the middle "ABC" has been replaced with just an "C"
s = "HEREABCISABCACSAMPLEABCSTRING"

# ...which we see 
find_replace_multi(s,d)
'HERE-IS-ACSAMPLE-STRING'

Alas! It seems we have a variation. We assumed all the problematic strings were “ABC”s, but one is just a “C”. No problem, right? We just update the dictionary to get rid of “C” as well. Not quite. Here is the thing that goes wrong with these types of functions: we don’t know what order the substitutions are happening in. So we might remove the “C” from “ABC” (leaving it as just “AB”) first thing, then when we look for “ABC”s to sub, there wont be any. Our program would end without doing anything, and we would still have the “AB”.

For another example, imagine we are replacing “Hello” with 1, and “He” with 2. Our string could be “He Said Hello”. If “He” gets changed first, we would have “2 Said 2llo”. When we go to look for hello? There is no hello.

So what do we do?

My solution for this problem is to make sure we iterate through the dictionary keys from LARGEST by length to SMALLEST by length to ensure that we don’t replace any pieces of them by accident. Using the He/Hello example, we substitute the “Hello” first so none of the substrings (“He”) can interfere. We can do this by using the sorted() function and using the reverse = True options (it goes smallest to largest by default). Check it.

s = "HEREABCISABCACSAMPLEABCSTRING"

d ={"C":"-", "ABC":"-"}

def find_replace_multi_ordered(string, dictionary):
    # sort keys by length, in reverse order
    for item in sorted(dictionary.keys(), key = len, reverse = True):
        string = re.sub(item, dictionary[item], string)
    return string

find_replace_multi_ordered(s, d)
'HERE-IS-A-SAMPLE-STRING'

Looks as if we are on the right track. Because we replaced the biggest strings first (“ABC”), and the the smallest (“C”), we ensure the “C” in “ABC” didn’t get messed with. I’ll write one more test to be sure. To confirm that the largest is going first, and not simply that “ABC” is going before “C”, we’ll add another value to the dict (the rest of the code stays the same). I’ll use “CSAMPLEABC”, because the only way for that to be replaced is if it goes before “C” and “ABC” as BOTH of those strings are in it.

s = "HEREABCISABCACSAMPLEABCSTRING"

d ={"C":"-", "ABC":"-", "CSAMPLEABC":"-:)-"}

def find_replace_multi_ordered(string, dictionary):
    # sort keys by length, in reverse order
    for item in sorted(dictionary.keys(), key = len, reverse = True):
        string = re.sub(item, dictionary[item], string)
    return string

find_replace_multi_ordered(s, d)
'HERE-IS-A-:)-STRING'

And there we have it! Our final function. It’s worth noting that, while I made the initial character-character find-and-replace function to work us up to this one, we final function will work for simple character-character substitutions too, so we only need the one, final version for all the tasks here. I’d advise doing so because it has error avoidance with the iteration and the “if” statement.