Placeholder Image

Subtitles section Play video

  • This episode of Real Engineering is brought to you by Brilliant, a problem solving website

  • that teaches you to think like an engineer.

  • Introduction: Opening, scene in a pub listening to a song and opening the shazam app.

  • Maybe be tricky to film.

  • What you just witnessed was the Shazam app recognising a song in a noisy environment,

  • and proceeding to find a match for it among the millions of songs in its servers database.

  • For most this probably seems like a trivial task.

  • Our brains can identify songs incredibly quickly from a young age, but the pathways in your

  • brain that allow you to identify a song quickly are incredibly complex.

  • Often times you simply need to hear just a few chords to know exactly what song is about

  • to play, that jolt of excitement when you can hear a DJ fading in the baseline of your

  • favourite song.

  • A simple combination of tones in a specific order allow you to identify a song from the

  • thousands of other songs you have heard in your life in an instant, but coding a computer

  • do the same thing is an incredible challenge.

  • A computer does not have an intuitive understanding of music.

  • A computer can only compare songs to other songs in its database, looking for a match

  • by comparison.

  • It is a problem akin to finding a needle in a haystack, where you can only find the needle

  • by looking at a picture of a needle and comparing it to each individual straw, comparing it's

  • length and colour until you finally find the needle.

  • To create a software capable of doing this task quickly poses a very interesting coding

  • challenge and the solution the engineers at Shazam came up with gives us some interesting

  • insight into how our own brains work.

  • A study by the Manchester Museum of Science and Industry tested 12,000 people's ability

  • to recognise a song.

  • They created an interactive game to search for the most recognisable songs, where they

  • would play the hook of 1000 best selling songs and recorded the time required to identify

  • them.[1]

  • Can you identify this song with just 2.3 seconds of the hook?

  • That was the Spice Girls song Wannabe, which ranks highest with a recognition time averaging

  • just 2.3 seconds, and that's including the reaction time required to hit the button.

  • Our brains are hardwired for this kind of pattern recognition.

  • In a world where recognising the sound of an approaching threat meant life or death,

  • we have evolved incredibly efficient ways of categorizing and accessing historical data

  • like this.

  • Our brain does not take the sound and compare it to every sound we have ever heard like

  • a computer, the specific combination of chords in progression simply activates specific neurons

  • that unlock that historical data.

  • What if the chords were played by a different instrument?

  • Would we recognise the song as quickly?

  • Those same 2.3 seconds played on a guitar sounds like this.

  • The notes are exactly the same, but they don't sound the exactly the same.

  • We even know intuitively what instrument is playing.

  • Why is that?

  • This is called the timbre of a note and different instruments have different timbres.

  • Pianos and guitars are examples of harmonic instruments and when they produce a note,

  • they aren't just producing a pure note of a single frequency.

  • Each note is a combination of multiple frequencies all related to the base note, the fundamental

  • frequency.

  • These are called overtones, and they are simply multiples of the base frequency.

  • Each instrument has a unique combination and evolution of these overtones that give it

  • that unique sound.

  • Again, it's quite easy for our brains to distinguish between a piano and a guitar,

  • but we need a way to quantify these characteristics for a computer to recognise, and this is where

  • the spectrogram comes in.

  • A spectrogram is a visual representation of sound.

  • It's a 3D graph with time on the x-axis, frequency on the y-axis, and the amplitude

  • of the frequency, or in other words the loudness, on the z-axis, which is often represented

  • by a colour.

  • This 3D graph is something a computer can absolutely recognise and store as data, but

  • there is huge amount of data within a spectrogram like this, and the more data there is the

  • more computation time is required to find a match.

  • So the first step in reducing computation time is reducing the data required to classify

  • a song.

  • Shazam uses something they call a fingerprint, where they transform these spectrograms into

  • something that looks like star map.

  • [2] Here each star represents the strongest frequencies at particular times.

  • Doing this, we have not only reduced our graph from 3 dimensions down to 2, but have drastically

  • reduced the amount of data points on the graph.

  • This is the first vital part of Shazam's technology.

  • Every single song in Shazams database is stored in a fingerprint like this.

  • When you open your phone and hit that Shazam button, the app accesses your microphone and

  • begins to create its own fingerprint of the sound waves it receives.

  • This ingenious method also helps the shazam app to filter out noise because it only creates

  • data points for stand-out frequencies.

  • Once the app has created a fingerprint of your audio, it then sends it to the shazam

  • servers where the recognition part of the process begins.

  • This is where things get difficult.

  • Let's look at a simplified song fingerprint, and a recorded fingerprint to see why.

  • The recorded fingerprint is only a short recording of the song, in our example we have just 3

  • possible frequencies, and each recorded fingerprint will have just 3 time points.

  • If we want to check the first 3 time points in the song for a match we first check the

  • 3 frequencies, then we move onto the next time point and check the 3 possible frequencies

  • again, and do the same for the final time point.

  • If we find a match, that is 9 operations required to find a match, but obviously that isn't

  • likely.

  • We then need to do those nine operations for every time point in the song, or perhaps every

  • time point in Shazams massive music archive, this obviously is going to take a lot of computation

  • time.

  • This is not how Shazam looks for a match.

  • First Shazam categorises fingerprints in a clever way.

  • We don't search to see if a note exists in a song, we search to see if several notes

  • exist separated by a particular time, just as brain does.

  • This becomes our searchable address for a hash table.

  • Hashes and hash functions are an incredibly useful technique that appear everywhere in

  • computer science.

  • Hash functions can be found in search algorithms used by Google, to make sure files are downloaded

  • correctly, and are the backbone of crypto currencies like bitcoin.

  • [3]

  • A hash function takes a varying length of input and produces a fixed length output,

  • called a hash.

  • In practice, the input can be anything from a short piece of text, like a password, to

  • a long data stream like an entire movie.

  • Consider a library of books.

  • We want to store each book on a shelf so we can find it later, and we know we'll have

  • the title of the book when we're searching for it.

  • We can use a hash function to decide which shelf to put a book on, using the title of

  • the book as the input and producing a shelf number as an output.

  • The first goal of a hash function is to produce outputs that are uniformly distributed...In

  • our library, we want the books to be spread evenly across the shelves, so no shelf in

  • particular will end up full of books, leaving others almost empty.

  • The second goal of a hash function is that it should reduce collisions.

  • A collision is when two different inputs produce the same output hash.

  • In our case, a collision results in two or more books on the same shelf.

  • If our library only has two shelves, collisions will be really common, no matter what hash

  • function you use.

  • If our library had a billion shelves, a good hash function will mean collisions will be

  • rare.

  • Another goal of a hash function is that it should be quick to calculate.

  • If our library has millions of books, we don't want to take too long figuring out which shelf

  • each one needs to be on.

  • A simple hash function might be to take the title of a book, and group them on shelves

  • alphabetically.

  • This would be really quick to calculate, but it would result in a lot of collisions, with

  • many books on the same shelf, and wouldn't be very well distributed.

  • Think about how many book titles begin with the wordTHE”, compared to how many book

  • titles start with the letter “Z”.

  • An alternative might be to take the position of each letter in the alphabet and sum up

  • the letters in the book title.

  • We could then divide that number by the number of shelves we have and take the remainder

  • as the shelf number to store the book on.

  • This would still be fairly fast to calculate, and would prevent all the books with titles

  • starting with the wordTHEbeing stored on the same shelf.

  • Now imagine, instead of book titles, our hash function takes data from our two frequencies

  • separated by particular time as an input, t, and produces a number between 1 and...say

  • 1 billion.

  • First, we go through our database of songs and calculate the hash number for each anchor

  • point.

  • Songs will contain multiple anchor points, which will allow us to categorise short snippets

  • of songs by the frequency of the anchor point, the frequency the following point and the

  • time between them.

  • And just like the library, we store each anchor point in order by the hash.

  • These addresses' are also categorised with song IDs and time stamps within the song in

  • a secondary hash table, allowing us to search for matching songs.

  • This makes it much faster to locate our matches, and to find our song we will require multiple

  • matching anchor points.

  • This ingenious method of song recognition allowed Shazam to be sold for 400 million

  • dollars to Apple, and help you figure out just what that catchy song is.

  • This is a very simplified view of how the programming of Shazam works, but I have linked

  • my research materials below if you would like to read more into the process.

  • Or if you would like to begin learning more about programming and build a solid foundation

  • of understanding, then you could take this course on Computer Science Fundamentals on

  • Brilliant which will start with an intro to algorithms and gradually build you up to more

  • complex ideas like data types and structures, by the end of this course you will have discovered

  • algorithms that can be used to extract answers from data, and when you are finished you can

  • take the follow up course in computer science algorithms.

  • This is just one of many courses on Brilliant, with more courses due to released soon on

  • things like automotive engineering and Python Coding.

  • If I have inspired you and you want to educate yourself, then go to brilliant.org/RealEngineering

  • and sign up for free.And the first 73 people that go to that link will get 20% off the

  • annual Premium subscription.

  • As always thanks for watching and thank you to all my Patreon supporters.

  • If you would like to see more from me the links to my instagram, twitter, subreddit

  • and discord server are below.

This episode of Real Engineering is brought to you by Brilliant, a problem solving website

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it