In speaker diarisation, one of the most popular methods is to use a Gaussian mixture model to model each of the speakers, and assign the corresponding frames for each speaker with the help of a Hidden Markov Model. There are two main kinds of clustering strategies. The first one is by far the most popular and is called Bottom-Up. The algorithm starts in splitting the full audio content in a succession of clusters and progressively tries to merge the redundant clusters in order to reach a situation where each cluster corresponds to a real speaker. The second clustering strategy is called top-down and starts with one single cluster for all the audio data and tries to split it iteratively until reaching a number of clusters equal to the number of speakers. A 2010 review can be found at [1].
More recently, speaker diarisation is performed via neural networks leveraging large-scale GPU computing and methodological developments in deep learning.6
There are some open source initiatives for speaker diarisation (in alphabetical order):
Sahidullah, Md; Patino, Jose; Cornell, Samuele; Yin, Ruiking; Sivasankaran, Sunit; Bredin, Herve; Korshunov, Pavel; Brutti, Alessio; Serizel, Romain; Vincent, Emmanuel; Evans, Nicholas; Marcel, Sebastien; Squartini, Stefano; Barras, Claude (2019-11-06). "The Speed Submission to DIHARD II: Contributions & Lessons Learned". arXiv:1911.02388 [eess.AS]. /wiki/ArXiv_(identifier) ↩
Zhu, Xuan; Barras, Claude; Meignier, Sylvain; Gauvain, Jean-Luc. "Improved speaker diarization using speaker identification". Retrieved 2012-01-25. http://www.limsi.fr/Rapports/RS2005/chm/tlp/tlp1/index.html ↩
Kotti, Margarita; Moschou, Vassiliki; Kotropoulos, Constantine. "Speaker Segmentation and Clustering" (PDF). Retrieved 2012-01-25. http://poseidon.csd.auth.gr/papers/PUBLISHED/JOURNAL/pdf/Kotti08a.pdf ↩
"Rich Transcription Evaluation Project". NIST. Retrieved 2012-01-25. http://www.itl.nist.gov/iad/mig/tests/rt/ ↩
"Awesome Speaker Diarization". awesome-diarization. Retrieved 2024-09-17. https://wq2012.github.io/awesome-diarization/ ↩
Park, Tae Jin; Kanda, Naoyuki; Dimitriadis, Dimitrios; Han, Kyu J.; Watanabe, Shinji; Narayanan, Shrikanth (2021-11-26). "A Review of Speaker Diarization: Recent Advances with Deep Learning". arXiv:2101.09624 [eess.AS]. /wiki/ArXiv_(identifier) ↩