The company I work for has thousands of lectures available in a video-on-demand library. All of these videos begin with a title card that is displayed for 10-20 seconds before the hour-long lecture begins. During this time, a high pitch tone is played back. Way back in the 1980s, this was used to help set audio levels in the studio and for broadcast, but they really aren’t of any use now on the web. In fact, they can be downright annoying! I decided to write a bit of software to clean them up and make the user experience a bit more enjoyable.
Each video starts with 10-20 seconds of tone, followed by about 5 seconds of silence, and then the beginning of the video. As these scene changes are done live by hand during recording, they are a bit different in every case so there is no hard rule to follow. I needed a way to detect when the title card segment was complete. I could have analyzed the video frames themselves, but these were not always consistent. I decided to simply analyze the audio to find my breaks. First, I took several videos from the library and used ffmpeg to simply extract the audio as a PCM wav. Fortunately, the command-line defaults do this without any additional switches.
ffmpeg -i video.mp4 output.wav
The beginning of a typical audio track looked like this (screenshot from Audacity).

Fortunately, while pitch is complicated to calculate in digital audio, amplitude is very easy. For each 16-bit (2 byte) sample, you have a value between -32768 to +32787. Zero is complete silence and 32K is blow-your-ears-out loud. So all we have to do is read through the bytes of the wav file and keep track of how loud stuff is and we should be able to easily tell when the intro tone disappears and a few seconds of silence occur.
I first tried simply probing intervals (say, every 1000 samples), but they led to occasional anomalies. Then I switched to finding the mean of an entire seconds worth of audio. Finally, I switched to finding the median amplitude as this gave me an even more accurate reading.
class Program
{
const int sampleRate = 48000;
const int channels = 2;
const int bytesPerSingleChannelSample = 2;
static void Main(string[] args)
{
byte[] data = File.ReadAllBytes(args[0]);
int head = 44; // The first 44 bytes have header info
int sampleCount = 1;
List<int> sampleBuffer = new List<int>();
while (head <= (data.Length - 1) && head < 5000000) // Stop after reading 5 Megs of data - that is plenty
{
short sampleAmplitude = BitConverter.ToInt16(data, head);
sampleAmplitude = Math.Abs(sampleAmplitude);
sampleBuffer.Add(sampleAmplitude);
// After enough samples are collected in the buffer, print out their average amplitude and then clear the buffer
if (sampleBuffer.Count >= (sampleRate))
{
Console.WriteLine(SamplesToSeconds(sampleCount).ToString("0") + ": " + GetMedian(sampleBuffer).ToString());
sampleBuffer.Clear();
}
sampleCount++;
// Advance the reading head to the next sample, skipping the second channel if it exists.
// We only need to check the left channel of the stereo to simplify things
head = head + (bytesPerSingleChannelSample * channels);
}
}
public static int GetMedian(List<int> ints)
{
int[] temp = ints.ToArray();
Array.Sort(temp);
int middleIndex = Convert.ToInt32(Math.Floor((float)(temp.Length / 2)));
return temp[middleIndex];
}
public static float SamplesToSeconds(int samples)
{
return samples / (sampleRate);
}
}
The output of the program for the same wav file is shown below.

You can easily see the cut to silence as the average amplitude drops from 6000+ to only 22 (practically dead silence) at the 17-second mark. A few seconds later, the video begins and some intro music fades in and values go back up.
Even though I’m wasting memory by reading in the entire large (500 meg) raw audio file, the application still only takes a couple of seconds to run. At this point, I can write zeros (silence) back over everything from beginning of the file to the location of the first second of silence. Additionally, when muxing the file back into the MP4 with ffmpeg, I can trim extra material from the front of the recording to make the length of the title card consistent. Scripting all of that is a job for tomorrow though.
For anyone looking to understand the PCM wav format, this is probably the easiest walkthrough to read.