HypheNN-de: German Hyphenation with Neural Networks May 14, 2017
These days, if you run into a problem that you can’t solve with traditional programming techniques, what do you do? You use neural networks.
Recently I got interested in software-based hyphenation for German words. Now, that’s a solved problem, right? Except it ain’t that easy. There are a couple of approaches which work really well - for the English language. Let’s dive in:
A widely used approach is to work with patterns (as LaTeX does for example). This works for German words too in many cases. The trouble is that the German language allows for compound words such as Food intoleranceNahrungsmittelunverträglichkeit or Beef labeling supervision duties delegation law. Yes, it's a real word.Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz. Let’s take the former and hyphenate it with a pattern based approach:
>>> import pyphen >>> dic = pyphen.Pyphen(lang='de_DE') >>> dic.inserted('Nahrungsmittelunverträglichkeit') 'Nah-rungs-mit-te-lun-ver-träg-lich-keit'
This works almost flawlessly. Except when it hyphenates mittelunverträglich as mit-te-lun-ver… where it should be mit-tel-un-ver… What patterns can’t express is that German compound words have to be hyphenated according to the original single words while accounting for different sorts of prefixes and suffixes. So, could we try to model the German grammar rules instead?
That approach works well for German words, but there are downsides. The conceptual problem is that the grammar rules have to be maintained manually which is a lot of work if the official grammar rules change. The more practical problem is that there’s no easily accessible open source implementation of this approach. This leads us to:
After having done some research on this topic, I got curious whether neural networks would be a good fit for this problem. Turns out there isFor example http://www.fi.muni.cz/usr/sojka/papers/nnw.pdf discusses neural networks for Czech hyphenation and http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.70.7099&rep=rep1&type=pdf uses them for German hyphenation. some research on hyphenation with neural networks from the 90s with acceptable results. For instance, the paper Fritzke, B., & Nasahl, C. (1991). A neural network that learns to do hyphenation report an 96% accuracy for German words when training with 1000 words. Can I beat that?
Following the paper implemented a rather simple neural network. It uses the one-hot encoding to represent the input character set and only considers 8 characters at once. It has a hidden layer consisting of 41 nodes and a single output node that indicates whether the word can be hyphenated at the current position. For instance, the word Nahrungsmittelunverträglichkeit is processed as follows:
WINDOW OUTPUT _ _ n a | h r u n 0.0 _ n a h | r u n g 0.99 n a h r | u n g s 0.0 a h r u | n g s m 0.0 h r u n | g s m i 0.0 r u n g | s m i t 0.0 u n g s | m i t t 1.0 ...
To hyphenate a word, we slide the 8 character window over the whole word and for every position ask the neural network whether or not a hyphenation should happen here. That way, we can compute the hyphenation of the complete word:
>>> predict('Nahrungsmittelunverträglichkeit') 'Nah·rungs·mit·tel·un·ver·träg·lich·keit'
We did it! Nahrungsmittelunverträglichkeit is hyphenated correctly!
Training & Validation
To get a decent training set, I built a small scraper that extracts hyphenation definitions from a complete Wiktionary dump. That way I was able to get a training set of roughly 370,000 words, 150,000 of which I used for training while the rest was used for validation. Training the neural network took place on my Quadro K2000 and took 6 minutes. For comparison, the paper I followed used a training set of 1000 words which took 44 minutes back in 1991. Crazy!
How accurate is this network? I’m not an expert, but from my testing it seems like the network achieves an accuracy of 99.2%, beating the original paper by almost 3%! Not bad for a 250 line of code neural network with little parameter tuning only.
What I learned
- Neural networks are fun. I didn’t expect to get results that good with that little fine-tuning and tinkering. I basically followed the original paper and did a few small modifications based on tips from the internet.
- German hyphenation is hard to get right. It’s easy to get a solution that almost works – when using traditional programming techniques.
- Neural networks + hyphenation = ❤️. Applying neural networks to hyphenation seems to be a good place to get started with artificial intelligence and machine learning. You’ll learn about input encodings, different neural network architectures and more. Why don’t you try to build a model for your own native language?
The source code for the model as well as the dataset scraper and the trained model weights are open-sourced at https://github.com/msiemens/HypheNN-de.