Projects

  • Quality estimation for machine translation
  • Modelling and application of linguistic similarity
  • Computational modeling and processing of tense, aspect, modality and temporal information in natural languages
  • Aligning parallel corpora
  • Language and encoding identification
  • A computational phonetic model for Indian language scripts based on the highly phonetic and well organized nature of Brahmi based scripts. It is being used to build applications like a spell checker, cognate identifier, transliteration tool, etc. for Indian languages.
  • Phonetic processing of text: transliteration, letter-to-phoneme conversion, model of phonetic space
  • Crosslingual information retrieval
  • Building GUI based interfaces for corpora annotation in Java
  • Building APIs for language resources like dictionaries and corpora
  • A multi-purpose editor specialized for NLP and Indian languages
  • Etc.

Note: Most of the above and several others like APIs for N-Gram modelling, corpus compilation, find/replace/extract tools for corpora, file splitter, tree viewer, etc. have been integrated as a small open-source Java based platform for NLP, especially focusing on Indian languages. Parts of Sanchay are already being used by many people for working with South Asian languages. Some others are now also contributing to the development of some Sanchay modules.

The last formal release of Sanchay (version 0.4.1) is available for download here. The latest builds are usually put here. You can also contact me.

So They Say

Words ought to be a little wild for they are the assaults of thought on the unthinking.

— John Maynard Keynes