• Project funded by MEITY
  • Project duration: Mar 2022 to Mar 2025


The goal of this project is to develop a system for spoken term detection STD), i.e., to search a spoken query term in an audio database.The performance of the STD system depends critically on the representation of the speech signal. The acoustic waveforms, obtained by measuring sound pressure levels, of a word uttered by two different speakers look completely different and yet carry the same linguistic information. A good feature representation for STD should be speech-specific, and at the same time, it should be robust to the speaker and channel variability. We will be developing two kinds of STD systems:

  • Language dependent (works for a specific language)
  • Language agnostic (works for any language)


  • Adhiraj Banerjee and Vipul Arora, “Wav2Tok: Deep Sequence Tokenizer for Audio Retrieval” in ICLR, 2023.
  • Adhiraj Banerjee and Vipul Arora, “Enc-Dec RNN Acoustic Word Embeddings learned via Pairwise Prediction” in INTERSPEECH, 2023.