computer linguistics

Distributed corpus

Corpora are large (fixed) collections of texts. As opposed to the search engines they often provide non web data (sometimes even quite old, digitized with huge effort) and allow complex linguistic queries with semantic filtering on this annotated enriched data. They are nowadays nothing unusual, many university's computer linguistic insitutes provide theirs online for free. The problem is (as often) the heterogenity of these data sets, almost every corpus provider working on its own solution, which means both own data-format and own query format. With the amplitude of these data collections growing there is growing need for cooperation. Ideally multiple (freely selectable) corpora should be accessible by the user by only one query. Constrained by the fact that the data already exists (often in some "home made" format), thus the provider are usually not willing to switch to some other format, and because of copyright restrictions usually can't be given away, ie can't be concentrated on one central server (in one application). So one has to solve the problem of heterogenous formats and of distributed data. Also the user data thus having need for some distributed authentication and user management system.
Subscribe to computer linguistics