docx

Extracting textual information from DOCX and ODT documents

In order to provide indexing of office documents of a big company and to enable web-crawler to gather the necessary information the tool is required to get the textual information from non-textual sources, like PDF, DOCX, ODT, RTF, etc   Another requirement is to use PHP without third-party tools, such as antiword, xpdf, or at least OLE under Windows. This requirement is grounded on the fact that, for example, OLE is incredibly slow, even is the task can be solved by it. Another reason is have the imndependent solution, not using any of existing tools and not to depend on the platfrom used.   Here the task is to study Office Open XML format as know as Microsoft's DOCX and another similar format, that is OpenDocument Format, as know as ODT from ODF Alliance.  The first fomat - Office Open XML: DOCX - can be a real problem even if you are simply working in document management system in one company. As this format is not compatible with the old versions of Microsoft Word, it could really be a problem if you receive such a document from a client, having somthing like Microsoft Office 2003. So the ability to extract important textual infromation from such document without purchasing the new Office licence would be a nice idea.  The same thing is with the ODT format, however this problem is solved much easier, as Open Office is open-source software, so if you get a document in such a format - you just have to download the free software. But anyway, the indexing software will not be able to install all the necessary software, so this task is really important.

Taggings:

Submitted by Dmitriy Solomakhin on Thu, 11/19/2009 - 23:57

Main menu

Navigation

Tags in Web Engineering

Tags in Social Tags

docx

Extracting textual information from DOCX and ODT documents

Taggings:

Main menu

You are here

Navigation

Tags in Web Engineering

Tags in Social Tags

docx

Extracting textual information from DOCX and ODT documents

Taggings: