Extracting textual information from RTF documents

In order to provide indexing of office documents of a big company and to enable web-crawler to gather the necessary information the tool is required to get the textual information from non-textual sources, like PDF, DOCX, ODT, RTF, etc Another requirement is to use PHP without third-party tools, such as antiword, xpdf, or at least OLE under Windows. This requirement is grounded on the fact that, for example, OLE is incredibly slow, even is the task can be solved by it. Another reason is have the imndependent solution, not using any of existing tools and not to depend on the platfrom used. Here the task is to study Rich Text Format, which while evolution till the current 1.9.1 version has more than 300 pages of specifications, that are surely not heping in parsing this format.
1 answer

Use PHP to extract textual information from RTF documents

When trying to view the document of this format we get not-readable information, like

Calibri;}{\f1\fnil\fcharset0 Calibri;}}
\b0\fs22\'c1\'e5\'eb\'e5\'e5\'f2 \'ef\'e0\'f0\'f3\'f1 \'ee\'e4\'e8\'ed\'ee\'ea\'ee\'e9

The more detailed screenshot see in the attachments.

So what we see is 8-bit data format, which is good, as that means that it will not be that difficult to extract the information. Generally RTF consists of control words, that can be grouped in nested sets. Control words begin with backslash ('\') and the group is limited with figure brackets ('{' and '}'). Control word can be any a-z word and could be followed by a number value, and also could contain one non-digit-letter ascii symbol.

So, sequence like \rtf1\ansi\ansicpg1251 is divided in 3 control words: rtf with parameter 1 (format major version), ansi (current encoding) and ansicpg with parameter 1251 (codepage Windows-1251).

Nested set define the scope of control words, so everything defined in {} works in it and all children sets. So in order to keep the current set of working control words, a stack is needed - adding elements with opening bracket and remove with closing. Some control words can be closed not only with closing bracket { but also with adding the parameter 0. For example: This is \b bold \b0 text.

As the initial encoding from RTF is ANSI, then english text will be saved without any special operations. But as I am interested in more general approach, at least getting Russian, and even better - Unicode. So, RTF gives a possibility to encode the rest of ASCII table (more than 128), taking in account the current codepage, of course (\ansicpg). A special sequence is used: \'hh, where hh - is a hex-code of the symbol from ASCII table. The unicode symbols are encoded as a sequence \uABCD, where ABCD is a decimal code of the unicode-symbol.

But, during the testing it appeared that it is not htat easy with Unicode as it seems. The problem is that RTF has another control word: \ucN, which is tightly coupled with Unicode. The thing is that Unicode is strongly supporting the old standards and systems, which could be used to read the rtf-file. Fro example, PC with Windows 3.11 will not be able to read Unicode. In order to let him show at least something in this case, after every unicode-symbol encoded with control word \u several symbols can be define , which should be displayed in case rtf-viewer cannot show of recognize current information.

Because of this, most of the current text processors put '?' after each unicode-control word, as a symbol to be showed instead of the current symbeol, in extreme cases. But some variations are also possible, like \u915FValue. So the keyword \ucN is used to tell, how many symbols should we display if we cann't show unicode. So, if in front of unicode data we get something like \uc1, that means we should skip one symbol after each unicode-symbol control word.

So, after mining this information from the specifications of the format, we can already write the code. The source code is provided in ther attachments.

So the suggested algorithm will work with most of RTF document, but there are some way to enhance it if you like. For example, a good thing would be to cut all non-textual data, in my implementation I cut out only fonts, colors, themes, binary data and everything marked as "don't read me if you can't" (\*). Another good option would be to parse the encoding and codepage in order to better display keywords as \'hh.

[1] www.latex2rtf.sourceforge.net/RTF-Spec-1.0.txt

[2] www.microsoft.com/downloads/details.aspx?FamilyId=DD422B8D-FF06-4207-B47...