Parsing hyphenated words in PDF

In developing of a program for parsing elements of PDF files my task was to develop a feature of “dehyphenation” - a function that merges hyphenated words. The naive approach is to search for hyphens on very last place of each page line and match it with the first word in the next line containing text. This approach would not only exclude the case when the hyphen appears as a last char of a page, but also the case of column page layout, where the matching candidates can be above the text object.
1 answer

I honestly did not understand your solution. Aren't their any 3rd party libraries which can help in this regards?

Taggings:

Comments

I also have issues understanding your solution. Maybe you should make it more clear.

Daria Piacun - Mon, 12/10/2018 - 09:29 :::