pdf java algorithm

Parsing hyphenated words in PDF

In developing of a program for parsing elements of PDF files my task was to develop a feature of “dehyphenation” - a function that merges hyphenated words. The naive approach is to search for hyphens on very last place of each page line and match it with the first word in the next line containing text. This approach would not only exclude the case when the hyphen appears as a last char of a page, but also the case of column page layout, where the matching candidates can be above the text object.
Subscribe to pdf java algorithm