Speaker diarisation

<h2 id="main-types-of-diarisation-systems">Main types of diarisation systems</h2>
<p>In speaker diarisation, one of the most popular methods is to use a <a href="/facts/Mixture_model/WWe6TJlg">Gaussian mixture model</a> to model each of the speakers, and assign the corresponding frames for each speaker with the help of a <a href="/facts/Hidden_Markov_Model/ur1zTAhP">Hidden Markov Model</a>. There are two main kinds of clustering strategies. The first one is by far the most popular and is called Bottom-Up. The algorithm starts in splitting the full audio content in a succession of clusters and progressively tries to merge the redundant clusters in order to reach a situation where each cluster corresponds to a real speaker. The second clustering strategy is called <a href="http://www.eurecom.fr/util/publidownload.fr.htm?id=3000">top-down</a> and starts with one single cluster for all the audio data and tries to split it iteratively until reaching a number of clusters equal to the number of speakers.
A 2010 review can be found at <a href="http://www.icsi.berkeley.edu/~fractor/papers/friedland_146.pdf">[1]</a>.
</p><p>More recently, speaker diarisation is performed via <a href="/facts/Artificial_neural_network/6V1jMlkx">neural networks</a> leveraging large-scale <a href="/facts/Graphics_processing_unit/PTK1RQVp">GPU</a> computing and methodological developments in <a href="/facts/Deep_learning/JLuwD3ea">deep learning</a>.<a class="footnote-ref" id="fnref:6" href="#fn:6"><sup>6</sup></a>
</p>
<h2 id="open-source-speaker-diarisation-software">Open source speaker diarisation software</h2>
<p>There are some open source initiatives for speaker diarisation (in alphabetical order):
</p>
<ul><li>ALIZE Speaker Diarization (last repository update: July 2016; last release: February 2013, version: 3.0): ALIZE Diarization System, developed at the University Of Avignon, a release 2.0 is available <a href="http://alize.univ-avignon.fr/svn/LIA_RAL/branches/2.0/LIA_SpkSeg/">[2]</a>.</li>
<li>Audioseg (last repository update: May 2014; last release: January 2010, version: 1.2): AudioSeg is a toolkit dedicated to audio segmentation and classification of audio streams. <a href="http://gforge.inria.fr/projects/audioseg">[3]</a>.</li>
<li>pyannote.audio (last repository update: August 2022, last release: July 2022, version: 2.0): pyannote.audio is an open-source toolkit written in Python for speaker diarization. <a href="https://github.com/pyannote/pyannote-audio">[4]</a>.</li>
<li>pyAudioAnalysis (last repository update: September 2022): Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications <a href="https://github.com/tyiannak/pyAudioAnalysis">[5]</a></li>
<li>SHoUT (last update: December 2010; version: 0.3): SHoUT is a software package  developed at the University of Twente to aid speech recognition research. SHoUT is a Dutch acronym for <i>Speech Recognition Research at the University of Twente</i>. <a href="http://shout-toolkit.sourceforge.net/">[6]</a></li>
<li>LIUM SpkDiarization (last release: September 2013, version: 8.4.1): LIUM_SpkDiarization tool <a href="https://projets-lium.univ-lemans.fr/spkdiarization/">[7]</a>.</li></ul>

<h2 id="bibliography">Bibliography</h2>
<ul><li>Anguera, Xavier (2012). <a href="http://www.eurecom.fr/publication/3152">"Speaker diarization: A review of recent research"</a>. <i>IEEE Transactions on Audio, Speech, and Language Processing</i>. 20 (2). IEEE/ACM Transactions on Audio, Speech, and Language Processing: 356–370. <a href="/facts/CiteSeerX_(identifier)/SceDmd3c">CiteSeerX</a> <a href="https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.470.6149">10.1.1.470.6149</a>. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1109%2FTASL.2011.2125954">10.1109/TASL.2011.2125954</a>. <a href="/facts/ISSN_(identifier)/DPAflDvU">ISSN</a> <a href="https://search.worldcat.org/issn/1558-7916">1558-7916</a>. <a href="/facts/S2CID_(identifier)/ldJsHa2Y">S2CID</a> <a href="https://api.semanticscholar.org/CorpusID:206602044">206602044</a>.</li>
<li>Beigi, Homayoon (2011). <a href="https://www.springer.com/computer/image+processing/book/978-0-387-77591-3"><i>Fundamentals of Speaker Recognition</i></a>. New York: Springer. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 978-0-387-77591-3.</li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>Sahidullah, Md; Patino, Jose; Cornell, Samuele; Yin, Ruiking; Sivasankaran, Sunit; Bredin, Herve; Korshunov, Pavel; Brutti, Alessio; Serizel, Romain; Vincent, Emmanuel; Evans, Nicholas; Marcel, Sebastien; Squartini, Stefano; Barras, Claude (2019-11-06). "The Speed Submission to DIHARD II: Contributions & Lessons Learned". arXiv:1911.02388 [eess.AS]. <a href="/wiki/ArXiv_(identifier)" target="_blank">/wiki/ArXiv_(identifier)</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
<li id="fn:2"><p>Zhu, Xuan; Barras, Claude; Meignier, Sylvain; Gauvain, Jean-Luc. "Improved speaker diarization using speaker identification". Retrieved 2012-01-25. <a href="http://www.limsi.fr/Rapports/RS2005/chm/tlp/tlp1/index.html" target="_blank">http://www.limsi.fr/Rapports/RS2005/chm/tlp/tlp1/index.html</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></p></li>
<li id="fn:3"><p>Kotti, Margarita; Moschou, Vassiliki; Kotropoulos, Constantine. "Speaker Segmentation and Clustering" (PDF). Retrieved 2012-01-25. <a href="http://poseidon.csd.auth.gr/papers/PUBLISHED/JOURNAL/pdf/Kotti08a.pdf" target="_blank">http://poseidon.csd.auth.gr/papers/PUBLISHED/JOURNAL/pdf/Kotti08a.pdf</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></p></li>
<li id="fn:4"><p>"Rich Transcription Evaluation Project". NIST. Retrieved 2012-01-25. <a href="http://www.itl.nist.gov/iad/mig/tests/rt/" target="_blank">http://www.itl.nist.gov/iad/mig/tests/rt/</a> <a href="#fnref:4" class="footnote-back-ref">↩</a></p></li>
<li id="fn:5"><p>"Awesome Speaker Diarization". awesome-diarization. Retrieved 2024-09-17. <a href="https://wq2012.github.io/awesome-diarization/" target="_blank">https://wq2012.github.io/awesome-diarization/</a> <a href="#fnref:5" class="footnote-back-ref">↩</a></p></li>
<li id="fn:6"><p>Park, Tae Jin; Kanda, Naoyuki; Dimitriadis, Dimitrios; Han, Kyu J.; Watanabe, Shinji; Narayanan, Shrikanth (2021-11-26). "A Review of Speaker Diarization: Recent Advances with Deep Learning". arXiv:2101.09624 [eess.AS]. <a href="/wiki/ArXiv_(identifier)" target="_blank">/wiki/ArXiv_(identifier)</a> <a href="#fnref:6" class="footnote-back-ref">↩</a></p></li>
</ol>

Speaker diarisation open-in-new

Speaker diarisation