corpus

DDC - a corpus query system with DS-capabilities

DDC is a robust scalable corpus query system.
http://sourceforge.net/projects/ddc-concordance/
It takes as input a collection of texts, creates multiple indexes, to allow for fast results even to complex queries, eg:
(adjective or article) one to x nouns (having /regexp/) followed by verb ending "nt" within next 5 words.
Even for collection of mulitple 100 millions of words, the results are returned within one second, together with the exact number of hits.
The result is returned as a paginated list of hits with the matching text and surrounding context.
Variable sorting and filtering options (depending on the annotation level of the input data).
This all is nice but not so special, as there are other corpus query system showing similar capabilities.
The presumingly unique feature of this system is it's capability of distributed service. Ie it is possible to run multiple servers, which act as threads. User's query is send by the "asked" server to its peers, processed here simultaneously and the partial results are sent back to the "asked" server and merged there, providing the user with the merged result, completely transparazing the distribution of the system and with next to no time handicap. Especially with regard to the optional sorting, this is exhibition of remarkable performance power.
Perhaps one aspect to point out with respect to the distribution is the fact, that the communication between the server and its threads is based on sockets exchanging simple as can be protocol.
This fact is also used for the later developed API in perl and python, which don't have to do much more, than to translate user query into protocol-conformant string, send it via socket and output the incoming result (which comes optionally as plain-text, html or self-defined xml).

Taggings:

Distributed corpus

Corpora are large (fixed) collections of texts. As opposed to the search engines they often provide non web data (sometimes even quite old, digitized with huge effort) and allow complex linguistic queries with semantic filtering on this annotated enriched data. They are nowadays nothing unusual, many university's computer linguistic insitutes provide theirs online for free. The problem is (as often) the heterogenity of these data sets, almost every corpus provider working on its own solution, which means both own data-format and own query format. With the amplitude of these data collections growing there is growing need for cooperation. Ideally multiple (freely selectable) corpora should be accessible by the user by only one query. Constrained by the fact that the data already exists (often in some "home made" format), thus the provider are usually not willing to switch to some other format, and because of copyright restrictions usually can't be given away, ie can't be concentrated on one central server (in one application). So one has to solve the problem of heterogenous formats and of distributed data. Also the user data thus having need for some distributed authentication and user management system.
Subscribe to corpus