How to remove or replace unwanted characters from text – using CLEAN, TRIM & SUBSTITUTE

CLEAN, TRIM & SUBSTITUTE all help remove unwanted characters from text in Excel, but are used to achieve distinct outcomes. TRIM is designed to work with unwanted spaces, whereas CLEAN tackles most unwanted non-printing ASCII characters. SUBSTITUTE is more general but can be used to target specific problem characters.

Continue reading to understand what these functions can do and in which situations each should be used.

We will also see what non-printing characters are and how they differ from spacing character and why computers have them and how they might appear in our spreadsheets.

When do we get non-printing characters in Excel?

Have you ever tried to type a Tab into a string in Excel? Well if you’ve tried, the default operation is to execute the cell value or function and move the cursor the the next cell. It’s therefore very unlikely that non standard text characters will be introduced by typing directly into Excel.

There are a number of computer characters which are non printable, so how do they get into our data?

The real culprit for introducing unwanted characters comes from data imported into Excel from an external file, process or connection.

Therefore if you have a spreadsheet which uses data sourced externally, you are likely, eventually, to find unexpected characters hiding in your data you’ll want to remove.

Windows characters

Up until Windows NT when Microsoft adopted Unicode, Character Encoding in Excel relied on the extended ASCII character sets described by Windows Code Pages. The user encoding would access default encoding of the computer’s ‘locale’ when reading and writing text data, which for western languages Windows used its code page 1252.

The first half of Window’s code pages were based on the original 7-bit encoding defined by ASCII in the 1960’s, which allowed for 128 characters and hardware instruction codes, such as Carriage Return, Escape and Tab, to be mapped to binary values.

ASCII, rendered in Excel with CHAR function

NOTE: Excel renders some control characters as squares or rectangle symbols. Other non-printing characters have no visible output.

Windows Code Page 1252

The code characters 128-255 extended the earlier ASCII character set, using 8-bit encoding (commonly referred to as ANSI or ASCII Extended encodings) to provide addition symbols like £, , and non English accented western letters such as ñ, ß. Window’s own code pages were just one of many ANSI extensions at the time, and itself an augmented version based on ISO-8859.

ASCII Extended (Windows 1252), rendered in Excel with CHAR function

NOTE: Where as ASCII is a standard for the values 0-127, typically on Windows PCs the installed ANSI codepage (code numbers 128-255) will be Windows-1252 for western languages.

So what is the difference between TRIM & CLEAN

TRIM and CLEAN both remove specific characters from text.

  • TRIM deals with spaces (ASCII character code 32) at the beginning and end of text, and removes duplicate spaces between words, whereas
  • CLEAN removes non printing ASCII character codes in the range 0-31 inclusive, from text.

TRIM function

Example

Using Notepad++‘s Show All Characters option we can set up our example with various distributions of spaces in text. We’ll also include a Tab in one example to see what happens.

TIP: Notepad++ is a very useful tool for inspecting data you have especially to check for the presence of errant characters.

Example data containing leading, trailing and duplicate spaces, and a Tab character

Spaces are indicated by orange dots and the tab symbol by an orange arrow, we will copy this data and paste it to Excel.

Note: CR and LF are ‘Carriage Return’ and ‘Line Feed’ visualisations in Notepad++, these are removed as the rows are converted to Cell rows in Excel.

We’ll use the LEN function to count all characters in the text including the spaces and Tab character.

=LEN(text)

In Excel, we add the following

  1. Column A: We paste our test data from Notepad++ (above),
  2. Column B: We use the LEN function to count both printing and non printing characters in the Column A input data,
  3. Column C: We apply TRIM to our input data (Column A)
  4. Column D: We count the characters remaining after using the TRIM function on our input data, from Column C, with a second LEN function.

NOTE: The Tab character is a not visible in the cell, but its presence can be seen in the formula bar as indicated by the red arrow and yellow highlighted area, and the character count.

Using the TRIM function on our test data

  1. Row 2: TRIM has no effect on space between with only a single space between words
  2. Row 3: TRIM removes the leading space, which results in one fewer characters in the output
  3. Row 4: TRIM removes the trailing space, which results in one fewer characters in the output
  4. Row 5: TRIM removes the duplicate multiple spaces between words, which results in one fewer characters in the output
  5. Row 6: TRIM has no effect on tab between and the Tab character remains between the words, the character count remains the same.

Result

Therefore, we have shown how the TRIM function:

  • removes all excess spaces around text, except for single spaces between words.
  • The TRIM function does not remove the tab character.

CLEAN function

Example

Using the same input text we used previously with the TRIM function, the Tab character (ASCII char 09 – Horizontal Tab to give it its official name) is removed and the character count is reduced by one character.

CLEAN is used to remove any of the control characters in the ASCII range 0-31.

Checking the output in Notepad++ we can see that the spaces remain unaffected as indicated by our character counts before and after the CLEAN function had been applied, whereas the Horizontal Tab has been removed.

Limitations of TRIM and CLEAN in string parsing

CLEAN and TRIM deal with ASCII characters 0-31, and 32 respectively (in different ways).

  • CLEAN removes all characters 0-31,
  • TRIM removes leading and trailing (before and after printable characters) spaces (character 32), and removes duplicate spaces between words.

That leaves us with having to deal separately with any ASCII code values not in the range 0-32 where they occur. For example, there are additional nonprinting characters not removed by CLEAN (values 127, 129, 141, 143, 144, and 157).

Luckily, most issues faced by users will be resolved by CLEAN or TRIM or a combination of both, which can be use together in a single nested formula:

=TRIM(CLEAN(text))

But we can identify and remove the other code values, if necessary, in other ways, either via the SUBSTITUTE function or using VBA.

Spacing characters

Apart from the space character described by the ASCII Code 32 character. Excel however can display other characters as ‘spaces’ which are in fact not a normal space character.

When processing or referencing text, we will usually want to remove excess space characters or other spacing characters such as the Horizontal Tab (character code 9) and the Non-break space character (character code 160).

The non-break space character is often used in Word Processing as a space which is intended to prevent an automatic line break at its position. It is often rendered as the ° symbol in word processing packages. Depending on the context, it may be more appropriate to replace the non-break space character by a space character when text parsing, rather than being removed entirely.

We can remove the Tab character (along with other control characters) with Excel’s CLEAN function, but this does not work with the non-break space character. We’ll find out how to deal with non-break spaces later.

Non-Printing characters

We have already seen that ASCII and UNICODE contain control characters. Non printing characters are any character in a text string which would not be printed on screen or printer when processed.

In many cases, these will have no affect for what the users see on screen, but they will affect text analysis or comparison.

Text containing a non-printing character will affect the count of characters observed in the text (or string). Any comparison between text containing non-printing characters, and text without non-printing characters will not equate.

Unexpected non-printing characters could effect not just a logical check of equivalence in Excel, but the ability of referencing type functions to work as intended, eg. MATCH or VLOOKUP.

Apart from control characters 0-31, and 127 in ASCII, other non-printable characters include numbers 129141143144, and 157. However it is unlikely that you will encounter these in text you receive. Control characters are the most likely non printing characters found when parsing external text.

How to remove specific character codes

We can remove specific characters, printable or otherwise via the SUBSTITUTION function. The SUBSTITUTION function replaces specified text in a given string with new text (or no text), 1 or more times.

First we take a look at how the SUBSTITUTE function works, then we’ll see how to remove a specific character

SUBSTITUTE function

=SUBSTITUTE(text, old_text, new_text, [instance_num])
  1. The first argument is the text to be parsed (or processed),
  2. The old_text is the character or text we wish to change,
  3. The new_text is the character or text we wish to use instead,
  4. instance_num
    1. is the number of times the old_text is substituted in the text. or
    2. by omitting the instance_num argument completely all possible substitutions will be made.

Replacing specific characters

  • We can pass in text to the old_text value or,
  • We can pass in non-printing characters using the CHAR function into the old_text argument.

Removing text

  • Where we merely wish to remove text we can pass in no text to the new_text argument using 2 empty double quotes “”

Example

Removing standard characters

SUBSTITUTE can be used in its standard way to replace or remove a single character, or a group of characters, from text.

Replacing a single character

In this example, we state the text, the text to find and change, ‘f’ and the new text to used ‘F’. On this occasion we will not limit the number of substitutions, so we omit the 4th argument instance_num, although, with just one instance, the result would be no different

=SUBSTITUTE("The quick brown fox", "f" , "F")

our “f” is capitalised in the output

The quick brown Fox

Replacing multiple characters

=SUBSTITUTE("The quick brown fox", "brown" , "red")

Our fox has changed color

The quick red fox

Limiting the number of replacements

The input text now has a duplicate word ‘brown’

=SUBSTITUTE("The brown brown fox", "brown" , "quick", 1)

The 4th argument is used to limit the number of substitutions to ‘1’ occurrence. Resulting in only the first “brown” being substituted.

The quick brown fox

Removing unwanted words or characters

So to remove explicitly every “s” character from text we would pass old_text as “s”, new_text as “”, and omit the instance number.

=SUBSTITUTE("She sell sea shells by the sea shore", "s", "")

All the little s’s are gone, but the capital ‘S’ remains.

She ell ea hell by the ea hore

NOTE: SUBSTITUTE is case sensitive

Replacing unwanted words or characters

We also previously mentioned the non-break space character. Depending on the context, it may be more appropriate to replace the non-break space character by a standard space character, rather than removing the character.

In which case we can update our SUBSTITUTE function accordingly

=SUBSTITUTE(A1, CHAR(160), " ")

or more explicitly refer to the space character in the Window-1252 ASCII/ANSI table

=SUBSTITUTE(A1, CHAR(160), CHAR(32))

TIP: CHAR(32) is the space character, but we can just use ” ” (with a space between the double quotes) for simplicity

Putting it all together

If you want to tackle the majority of issues with imported text we can combine all of our functions – assuming our text is in cell A1

  1. replace non-brake space with a space,
  2. remove all control characters,
  3. remove excess spaces
=TRIM(CLEAN(SUBSTITUTE(A1,CHAR(160)," ")))

Summary

In this article we saw how:

  • We can represent standard western language characters and symbols from a up to 265 of ASCII / ANSI codes. The ANSI version is locale specific (specific to your computer setup).
  • Historically, some of the codes are non printing control codes, used to control printer hardware.
  • There are many more codes now available with UNICODE.
  • US-ASCII & UNICODE share the same numbering for the first 128 characters. And ISO 8859-1 and UNICODE share the same 256 character values.
  • The LEN function can count both printable and non printable characters
  • These ASCII / ANSI / Windows 1252 values can be converted to characters using the CHAR function.
  • We can remove excess spaces around text with the TRIM function
  • We use CLEAN to remove non-printing control characters, i.e. any with ASCII values 1-31 inclusive
  • We can replace or remove unwanted text, or specific characters, using SUBSTITUTE, 1 or more times.

Related Topics

030

Leave a Reply