How to handle '\\' as a delimiter in Python strings

I am having to load and handle strings from an external data source that are delimited by double backslash. I need to return the sorted and de-duplicated delimited string e.g.

input_string = ‘bananas\\apples\\pears\\apples\\bananas\\pears’

the return string should be:

‘apples\\bananas\\pears’

This works:

input_string = 'bananas\\apples\\pears\\apples\\bananas\\pears'
distinct_items = set(x.strip() for x in input_string.split('\\'))
print('\\'.join(sorted(distinct_items)))

but this doesn’t:

def dedupe_and_sort(input_string, delimiter='\\'):
    ''' get a list item that contains a delimited string, dedupe and sort it and pass it back '''
    distinct_items = set(x.strip() for x in input_string.split(delimiter))
    return (delimiter.join(sorted(distinct_items)))

What is the correct way to handle the ‘\\’ as a delimiter within the function dedupe_and_sort?

What incorrect result do you get? Here’s what I see:

In [1]: def dedupe_and_sort(input_string, delimiter='\\'):
   ...:     ''' get a list item that contains a delimited string, dedupe and sort it and pass it bac
   ...: k '''
   ...:     distinct_items = set(x.strip() for x in input_string.split(delimiter))
   ...:     return (delimiter.join(sorted(distinct_items)))
   ...: 

In [2]: dedupe_and_sort('bananas\\apples\\pears\\apples\\bananas\\pears')
Out[2]: 'apples\\bananas\\pears'

Does the double-backslash already exist in the file? Because the code examples are using a single backslash as a delimiter.

single_backslash = '\\'

How many backslashes do you think are actually in this string?

There are five, not ten.

If you actually see bananas\\apples\\pears\\apples\\bananas\\pears when you, for example, save the data from the “external data source” to a text file and open it in a text editor, then that source data contains ten backslashes. But the input_string you are using to test only uses a single backslash in between each fruit name. If you .write it to a text file and open that text file in a text editor, you will only see five backslashes. The \\ that you type in the source code is special syntax that means a single backslash in the actual string. This is because the first \ has special meaning: it begins an escape sequence that lets you put things into the string that are hard or impossible to type, would mess up the formatting of the code, or would prevent Python from understanding the code (e.g. so that you can put a quote symbol into the string, and Python can know that it’s part of the string instead of the syntax to say where the string ends).

You need to make sure of what the delimiter actually is, in the actual data that you have. The code that you show works fine for the test data that you show.

in the SQL table there’s basically field stuffing going on and the string literal in the table column would be bananas apples pears apples bananas pears with a double backslash delimiting each. So I’m 100% certain the every delimited record read in from the table has a double backslash in the string as a delimiter between each word where a field has been stuffed with multiple values.

Thus, I’m reading in a column value from a SQL query, it needs to be deduped and sorted and written back with two backslashes, just the way it was retrieved from the table.

Various calls and their results:

Based on that, it seems like you understand the difference between "\\\\" (two backslashes, escaped) and "\\" (just one). So where’s the confusion?

In the first example, you are splitting a string with single backslashes, and joining it with doubled ones. So you get the expected output. In the second and third you are splitting and joining with single backslashes.

to clarify more, the string you have defined as input_string doesn’t have the doubled backslashes in it. If you print it out you will see that.

Of course, silly me getting confused with visual presentation vs underlying content. Thx. In hindsight the answer is pretty obvious, sorry for wasting your time, but appreciate the feedback.

1 Like

Surprisingly, no-one has pointed out that raw-string literals are useful when typing strings in your code that contain backslash (but don’t need to represent special characters like newline). It basically turns off the backslash-escape convention, so the visual confusion is less.

Then the string you get looks more like the string you type. r"\\" is a string of two backslashes. It is commonly used for Windows file paths and regular expressions.

2 Likes

Do keep in mind that raw string literals do still use backslash to prevent a quote that comes immediately after from terminating the string - but they don’t “escape” the quote when doing so. Thus:

>>> r'\''
"\\'"
>>> print(r'\'')
\'

The string contains a backslash and a single-quote. The backslash in the raw string literal syntax allows the ' to be part of the string, but is still a backslash. This happens because Python considers the backslash when it tokenizes the code, to figure out where the string ends; then it determines what the string actually contains on a later pass, taking the r prefix into account.

For the same reason, a raw string literal cannot end with an odd number of backslashes:

And, of course, the resulting string cannot contain the closing quote character without also having a backslash in front of that quote.

2 Likes

This explains a lot, thank you! I had a problem with this recently and couldn’t understand what the rules were for backslashes inside raw strings. In my case I ended up doing silly stuff like this:

_esc = “\x5c”

# later…
some_pattern =f”{_esc}”

Backslashes can be nasty. You can find a good explanation here:

1 Like