Extracting textual information from DOCX and ODT documents

<p>In order to provide indexing of office documents of a big company and to enable web-crawler to gather the necessary information the tool is required to get the textual information from non-textual sources, like PDF, DOCX, ODT, RTF, etc</p><p>&nbsp;</p><p>&nbsp;</p><p>&nbsp;</p><p>Another requirement is to use PHP without third-party tools, such as antiword, xpdf, or at least OLE under Windows. This requirement is grounded on the fact that, for example, OLE is incredibly slow, even is the task can be solved by it. Another reason is have the imndependent solution, not using any of existing tools and not to depend on the platfrom used.</p><p>&nbsp;</p><p>&nbsp;</p> <p>Here the task is to study Office Open XML format as know as Microsoft's DOCX and another similar format, that is OpenDocument Format, as know as ODT from ODF Alliance.</p><p>&nbsp;</p> <p>The first fomat - Office Open XML: DOCX - can be a real problem even if you are simply working in document management system in one company. As this format is not compatible with the old versions of Microsoft Word, it could really be a problem if you receive such a document from a client, having somthing like Microsoft Office 2003. So the ability to extract important textual infromation from such document without purchasing the new Office licence would be a nice idea.</p><p>&nbsp;</p> <p>The same thing is with the ODT format, however this problem is solved much easier, as Open Office is open-source software, so if you get a document in such a format - you just have to download the free software. But anyway, the indexing software will not be able to install all the necessary software, so this task is really important.</p>
1 answer

Use PHP to extract textual information from DOCX and ODT documents

Actually this task appeared to be not that hard as I first thought. To make the task more real, I was working with Russian language, i.e. with CP1251.

So at first I tried to open the documents with simple Lister. The screenshot of the experiment is provided in the attachments. Both files, odt and docx, look like binary files from this point of view. But the most important detail to notice (and it's quite small I would say) is th letters "PK" in the beginning of the data. That actually means that both files are, somewhere deep in their soul, just a zip-archive, which extension was renamed either to odt or to docx.

If we open any of the files in Total Commander using Ctrl+PageDown (Open element under the cursor) we will get a structured content with some folders and XML documents. The screenshot of the experiment is provided in the attachments.

The content that we need is situated in the file content.xml (in ODT) and word/document.xml (in DOCX).

So, in order to extract textual information from ODT ot DOCX formats, we will have to use the standard ZipArchive class and some functions to work with it.
The source code is provided in the attachments. The solution works under PHP 5.2+ and requires php_zip.dll for Windows or --enable-zip key for Linux. In case of unavailability of ZipArchive (like old PHP or lack of libraries) one could use PclZip (http://www.phpconcept.net/pclzip/index.en.php) .

References:
www.msdn.microsoft.com/en-us/library/aa338205.aspx
http://www.i-rs.ru/content/download/1447/8162/file/OpenDocument-v1.0-os.pdf