In this paper, we present statistical models for
morphological disambiguation in Turkish. Turkish
presents an interesting problem for statistical models
since the potential tag set size is very large because
of the productive derivational morphology. We propose
to handle this by breaking up the morphosyntactic
tags into inflectional groups, each of which
contains the inflectional features for each (intermediate)
derived form. Our statistical models score the
probability of each morphosyntactic tag by considering
statistics over the individual inflection groups
in a trigram model. Among the three models that we have developed and tested, the simplest model
ignoring the local morphotactics within words performs
the best. Our best trigram model performs
with 93.95% accuracy on our test data getting all the
morphosyntactic and semantic features correct. If we
are just interested in syntactically relevant features
and ignore a very small set of semantic features, then
the accuracy increases to 95.07%.
History
Publisher Statement
Published in Proceedings of The 18th International Conference on Computational Linguistics, August 2000, Saarbrucken, Germany