PhD Thesis


In the thesis, we suggest that linguistic similarity is one of the fundamental ideas in Natural Language Processing (NLP) and argue that it should be studied comprehensively and as distinct from the techniques used to estimate similarity. Specific kinds of linguistic similarities have been studied earlier, but there has been a lack of a comprehensive study of the generalized idea of linguistic similarity. We show that most of the problems in NLP can (in part or in full) be formulated in terms of estimation and maximization of different kinds of linguistic similarities. We have conducted experiments on a wide variety of such problems and have tried to advance the state of the art in solving these specific problems and have tried to.

We categorize different kinds of linguistic similarities in three broad categories: surface, functional and deep. These three roughly correspond to orthographic/phonetic, morphosyntactic/syntactic and semantic similarity, respectively. We propose a Computational Phonetic Model of Scripts, that can be used for better estimation of surface similarity and we have successfully used it for several applications. We also propose a computational model of phonetic space that can estimate surface similarity in statistical terms. We introduce several new similarity measures and extensions to existing techniques to solve NLP problems. In the case of functional and deep similarity too we have worked on several problems with some success.

To help in the study of generalized linguistic similarity, we try to build a comprehensive typology of linguistic similarity. We suggest that similarity can be formulated as a triple consisting of the degree, the directness and the relevance. The typological specification of linguistic similarity can also be given as a triple, which has values along three axes: depth, linguality and granularity. The three kinds of similarities mentioned earlier are differentiated along the depth axis. Along the linguality axis, similarity can be monolingual or crosslingual and along the granularity axis, it can vary from letter or phoneme to corpus and to the language itself.

Applications of surface similarity that we have worked on include spell checking, approximate string search, transliteration, letter to phoneme conversion, Indian language to Indian language Crosslingual Information Retrieval (CLIR), English to Indian language CLIR, calculation of language distances, generation of phylogenetic trees of languages and estimation of the cost of adaptation of resources from a resource rich language to a close but resource poor language.

Applications of functional similarity covered in the thesis include translation of multi-word number expressions, grammar induction from annotated treebank data and grouping of morphological and spelling variations of the same (root) word.

In the case of deep similarity, we have worked on language identification (monolingual and multilingual), sentence alignment, word alignment, translation of tense, aspect and modality markers, using an association measure to initialize a weakly supervised dependency parsing algorithm and transfer of annotation from a resource rich language to a resource poor language.

One of the proposals made in the thesis is that text processing in general, and morphological processing in particular, should be based on a more logical unit than a lexeme. Morphological information is often distributed over more than one lexeme and different languages do this in very different ways. The proposed unit, called Extra Lexical Unit (ELU), which has a close parallel in the Paninian notion of {\em samasta pada}, would be consistent across languages. For the same reason, it should be more suitable for crosslingual processing, including for estimation of similarity of units such as sentences.

[Download the thesis]

So They Say

W (double U) has, of all the letters in our alphabet, the only cumbrous name, the names of the others being monosyllabic. This advantage of the Roman alphabet over the Grecian is the more valued after audibly spelling out some simple Greek word, like "epixoriambikos." Still, it is now thought by the learned that other agencies than the difference of the two alphabets may have been concerned in the decline of "the glory that was Greece" and the rise of "the grandeur that was Rome." There can be no doubt, however, that by simplifying the name of W (calling it "wow," for example) our civilization could be, if not promoted, at least better endured.

— Ambrose Bierce