Blog

Convert to Plain Text (With HTML Entities)

Mar 6, 2009
Category:Programming HTML 

I work a lot with user submitted data, and it usually comes in the form of a Microsoft Word document, either as a .doc or copied and pasted from Word into forms or emails. The problem is Microsoft Word likes to change some of the characters in your document to smart characters, most noticeably the double quotation marks, apostrophes, elongated hyphens and triple dots. Although this possibly makes the document look nicer (does it?), it is most annoying as it does not display properly in HTML, resulting in funny question marks and random characters.

I have created this simple form to strip out all of the Microsoft Word smart quotes and other weird or invisible characters that show up wrong in HTML, and replace them with the standard ASCII equivalents. At the same time you can also optionally encode the HTML entities. The results can be displayed or downloaded.

-- UPDATE --
This is now old and broken so I have removed it, a much better way is to use the paste from word feature of TinyMCE or another web based wysiwyg editor.

Comments