This article solves the following challenge:
How to convert WORD to HTML inside any application
Although HTML is not a regular language, the easiest solutions involves regular expressions. Copying word documents already returns a html code. The problem is that it is very dirty and full of bad and illegal formatting, that can only be processed using the program word or any other specialized type of software.
The following regular expressions translate a WORD input to a HTML output:
// 0. remove all empty p's when
var output = input.replace(/(|<\/p>)/g, '');
// 1. remove line breaks / Mso classes
var stringStripper = /(\n|\r| class=(")?Mso[a-zA-Z]+(")?)/g;
output = output.replace(stringStripper, '');
// 2. strip Word generated HTML comments
var commentSripper = new RegExp('', 'g');
output = output.replace(commentSripper, '');
var tagStripper = new RegExp('<(/)*(meta|link|\\?xml:|st1:|o:)(.*?)>', 'gi');
// 3. remove tags leave content if any
output = output.replace(tagStripper, '');
// 4. Remove everything in between and including tags ''
var badTags = ['script', 'applet', 'embed', 'noframes', 'noscript'];
for (var i = 0; i < badTags.length; i++) {
tagStripper = new RegExp('<' + badTags[i] + '.*?' + badTags[i] + '(.*?)>', 'gi');
output = output.replace(tagStripper, '');
}
Evaluate complexity of present statement:
Select ratingCancelGuessingPassing knowledgeKnowledgeableExpert
Your rating: 3 Average: 3.3 (3 votes)