The current state-of-the-art in processing images, video, speech, and audio uses a form of machine learning called deep learning, in which computers learn by example. Similar to the way humans can solve problems based on accumulated knowledge from many experiences, deep-learning involves “convolutional neural networks” that are pre-trained on massive archives of images, including everything from dogs to cars. Using a method called transfer learning, the computer models can then apply that training to something new, such as visualizations of whale calls.
The computer “learning” can be supervised, in which humans provide examples so that the system can recognize certain patterns, or it can be unsupervised, in which the system learns by itself from the data. In both of these forms, sound is presented as a visual representation, the spectrogram.