It's like I can reverse engineer music now? creepy... (⁀⊙﹏☉⁀)
I recently got a guitar and found out that most of the guitar learning tools cost money, subscriptions even. It took me ~2 months to learn how to play by ear using trial-error, therefore I made this program for myself where I use Demucs, Whisper, ffmpeg and some tinkering to make this shell script that produces one mkv file that has the original version, a vocal karaoke version, a guitar karaoke version and a backing track along with automatic lyrics. The mkv video file can be played on any device that has VLC or a similar media player installed.
Ever since I got my guitar, I've been paying more attention to the Instruments rather than the song as a whole. After a long time (I didn't like the song earlier), I listened to this song → Blur - Song 2. I felt like I could try playing along and get better at playing guitar. That's when I remembered what demucs could do and started working on this project.
yt-dlp is the tool I used to download the song from youtube. Youtube's best audio format is in .webm, so I downloaded the song in that format.
I used ranger, a terminal-based file manager as the UI to choose the song.
Since demucs uses ffmpeg in the background and converts any file to .wav, I did no conversion after selecting the file.
Initialised the necessary variables and directories.
Splitting the song into individual instruments
I used demucs by Meta, to split the song into various stems as seen here. I was mainly interested in two of them -
htdemucs_6s → [6s - 6 stems] This is the most versatile demucs model. It can split up a song into these stems.
htdemucs_ft → [ft - Fine Tuned] This one is arguably the better model in terms of raw accuracy as it is a bag of 4 models, but it misses one crucial thing I am interested in → the guitar stem...
I was in a fix what model to use. That's when I got an idea. I tested it out with multiple songs and came up with these equations. As seen in the image below, I used bass.wav, drums.wav and vocals.wav from demucs_ft and other.wav, piano.wav from demucs_6s.
I was so excited when I figured this out that I got up to get a glass of water (rare event)
I followed the same stem selection process for Vocals Karaoke and Backing Track.
AAC (.m4a) is the only thing popularized by Apple I can accept Oh, and maybe ipods
The reason I used -b:a 320k and .m4a is because AAC audio with 320 kbps is basically the same quality (almost) as lossless .wav file but almost 1/5th the size.
I initially thought I might come across problems because, but then I thought of this → demucs_ft[ other.wav ] = demucs_6s[ other.wav + guitar.wav + piano.wav ]. It must be fool proof, keeping the performance of the models aside.
Both the models take a little over a minute for a standard 4 minute song combined.
Auto-Generate Lyrics
Since we now had vocals.wav, we can directly use whisper, by OpenAI to auto-generate the lyrics. This is not very promising, but is okay for now.
Once the venv setup was done, I used the cli to get the lyrics for a song using the following code snippet.
It produced a .srt file. It's just like a simple text file with timestamps of the subtitles, in our case, lyrics.
Tch (ᗒᗣᗕ)՞ What a bummer
I tried to use whisper's fork → whisperX as it can highlight each word at the exact time it was said in the audio, like how Karaoke actually works. Unfortunately, my GPU is too old and does not support the CUDA version required for using whisperX. I also tried using an older version of whisperX but then got into dependency hell. I decided I will work on the auto-generated lyrics later and moved on, for now.
1 week later: I tried using whisper.cpp and it worked surprisingly well. Infact, I was able to use the best whisper model which would OOM when used in the python runtime. I was also able to setup per word (karaoke lyrics) subttitles and it worked well for some songs. But I found another problem... Yes, even with whisper's best model. No matter what kind of tuning I did, be it VAD (Voice Activity Detection) threshold or max tokens for context prompt. Every song needed it's own tuning and it's own prompt.
Idea for future upgrades: Fuzzy match real lyrics with the lyrics generated by the ASR program (Whisper, but I was considering Vosk, as I have already tested it's capability with fuzzy matching in another project). This way, we will be clear of all hallucinations and the program will only have to match the timing.
Now that the audio files and the auto-generated lyrics are ready, we can add all of the produced files into a single .mkv container.
Trivia: A .mkv container, also called Matroska Video is one of the most versatile video formats, it can contain multiple video streams, subtitles and also audio tracks. This is the format used when we used to watch Hollywood movies 15+ years ago which had subtitles of almost every single language.
I used ffmpeg, once again, to combine all of the .m4a audio tracks and lyrics .srt file.
I should load these to my ipod touch once it finishes charging... It's been charging for a month now (⋟﹏⋞)
Once this is finished, there is some cleanup to do. And then, we can play the video on any media player. For example, I use VLC Media Player here.
We can see the metadata of the file above. This file can be played on phone, laptop, etc. Any device supporting VLC Media Player or a similar app will be able to have the Karaoke experience.
Custom Player
Now that the file can be played anywhere, I thought of making a custom Karaoke Player using PyGame.
It took a while, but it was ready. A custom player where all instruments react in their own way and also enable or disable. Here is the video.
I then proceeded to generate .ass (Advanced SubStation Alpha) which can make the subtitles move and have effects in a more dynamic way. The lyrics border get thicker with bass, shake with drums and show a chromatic aberration effect with guitar.
Honestly, I like the .mkv container better. It has 5 audio tracks, reactive subtitles and can be viewed on any device, even my TV.