Skip to content Skip to sidebar Skip to footer

How Do You Deal With The "special" Characters That MS Word Adds?

I'm wondering how you clean the special characters that MS Word as, such as m- and n-dashes and curly quotes? I often find myself copying content from clients from Word and pasting

Solution 1:

With regards to clients posting copy/pasted text from Word in textareas:

The most reliable way to ensure that the client sends you text in any particular encoding (thus hopefully doing any conversion from CP-1252 [or whatever Word uses] for you), is to add the accept-charset="..." attribute to all your <form>s. E.g.:

<form ... accept-charset="UTF-8">
   ...
</form>

Most browsers will obey that and make sure any "Word-specific" characters are converted to the appropriate character set before it gets to your website.

Once invalid text gets to your website, there's very little you can do to fix it reliably, so it's best to simply check all input for being valid in whatever character set you use, and discard any requests that have invalid text. This is necessary even with accept-charset, because undoubtedly there are some clients out there that will ignore it.


Solution 2:

You can use preg_replace function call to remove all special characters of word or others from your string

 preg_replace('/[^\x00-\x7F]+/', '', $str);

Solution 3:

Pay attention to specify an encoding everywhere and use UTF-8, then those "special" characters should survive just fine. But once they've gone through an encoding that can't represent them, the information which character it was originally is lost, so it can't be repaired (except for some specific though probably very common cases like switching between Cp1252 and ISO-8859-1).


Solution 4:

You might try the Demoroniser.


Solution 5:

Make sure Word is configured to use UTF-8 for "Save As.." HTML.

This is in Options > Word Options > Advanced > Web Options > Encoding


Post a Comment for "How Do You Deal With The "special" Characters That MS Word Adds?"