How to convert WORD to HTML inside any application

The challenge is to translate from one language to another, namely from WORD to HTML, while also considering and preventing XSS (Cross Site Scripting). Automatic translations always pose great difficulties upon a developer as different language standards may support different levels of complexity and translatability, so that it may be impossible to perform a translation that has a good amount of satisfiability. On top of that it is necessary to integrate the translator into any application for any given programming language. It is required to run fast even on huge word inputs with mediocre styling and a lot of content (about 3 DINA4 pages of text is already considered huge). It is not required to make sure that every possible formatting from word is completely and perfectly translated. It’s fine as long as some basic formatting like bold, italics, text color, text size and links are covered. Additionally, as with great power comes great responsibility (HTML is way more powerful and possibly harmful than any WORD text), the translator has to remove possible Cross Site Injection Attacks from the person that wants to translate a specific text from WORD to HTML, but it is not necessary to block every possible small hole that may exist as this would completely strip of any possible formatting.
1 answer

using Regular expressions

This article solves the following challenge: 

How to convert WORD to HTML inside any application

Although HTML is not a regular language, the easiest solutions involves regular expressions. Copying word documents already returns a html code. The problem is that it is very dirty and full of bad and illegal formatting, that can only be processed using the program word or any other specialized type of software.
The following regular expressions translate a WORD input to a HTML output:

// 0. remove all empty p's when
var output = input.replace(/(|<\/p>)/g, '');

// 1. remove line breaks / Mso classes
var stringStripper = /(\n|\r| class=(")?Mso[a-zA-Z]+(")?)/g;
output = output.replace(stringStripper, '');
// 2. strip Word generated HTML comments
var commentSripper = new RegExp('', 'g');
output = output.replace(commentSripper, '');
var tagStripper = new RegExp('<(/)*(meta|link|\\?xml:|st1:|o:)(.*?)>', 'gi');
// 3. remove tags leave content if any
output = output.replace(tagStripper, '');
// 4. Remove everything in between and including tags ''
var badTags = ['script', 'applet', 'embed', 'noframes', 'noscript'];
for (var i = 0; i < badTags.length; i++) {
tagStripper = new RegExp('<' + badTags[i] + '.*?' + badTags[i] + '(.*?)>', 'gi');
output = output.replace(tagStripper, '');
}

Evaluate complexity of present statement:

Select ratingCancelGuessingPassing knowledgeKnowledgeableExpert

Your rating: 3 Average: 3.3 (3 votes)

Taggings: