• Quality estimation for machine translation
  • Modelling and application of linguistic similarity
  • Computational modeling and processing of tense, aspect, modality and temporal information in natural languages
  • Aligning parallel corpora
  • Language and encoding identification
  • A computational phonetic model for Indian language scripts based on the highly phonetic and well organized nature of Brahmi based scripts. It is being used to build applications like a spell checker, cognate identifier, transliteration tool, etc. for Indian languages.
  • Phonetic processing of text: transliteration, letter-to-phoneme conversion, model of phonetic space
  • Crosslingual information retrieval
  • Building GUI based interfaces for corpora annotation in Java
  • Building APIs for language resources like dictionaries and corpora
  • A multi-purpose editor specialized for NLP and Indian languages
  • Etc.

Note: Most of the above and several others like APIs for N-Gram modelling, corpus compilation, find/replace/extract tools for corpora, file splitter, tree viewer, etc. have been integrated as a small open-source Java based platform for NLP, especially focusing on Indian languages. Parts of Sanchay are already being used by many people for working with South Asian languages. Some others are now also contributing to the development of some Sanchay modules.

The last formal release of Sanchay (version 0.4.1) is available for download here. The latest builds are usually put here. You can also contact me.

So They Say

We should have a great fewer disputes in the world if words were taken for what they are, the signs of our ideas only, and not for things themselves.

— John Locke