The Case of the Extra 40ms

Last modified on December 16, 2020

Netflix Technology Blog

By: John Blair, Netflix Associate Engineering

The Netflix software runs on a whole bunch of natty TVs, streaming sticks and pay TV convey high containers. The function of a Associate Engineer at Netflix is to assist software producers originate the Netflix software on their gadgets. On this text we give attention to one notably interesting predicament that blocked the originate of a software in Europe.

Against the tip of 2017, I was on a convention title to keep in touch about an predicament with the Netflix software on a model new convey high field. The field was a model new Android TV software with 4k playback, in step with Android Initiate Source Project (AOSP) mannequin 5.0, aka “Lollipop”. I had been at Netflix for just some years, and had shipped multiple gadgets, however this was my first Android TV software.

All 4 avid avid gamers interested in the software had been on the decision: there was the broad European pay TV firm (the operator) launching the software, the contractor integrating the convey-top-box firmware (the integrator), the system-on-a-chip supplier (the chip vendor), and myself (Netflix).

The integrator and Netflix had already completed the rigorous Netflix certification course of, however at some degree of the TV operator’s inside trial an govt at the company reported a severe predicament: Netflix playback on his software was “stuttering.”, i.e. video would play for a undoubtedly temporary time, then discontinuance, then supply once more, then discontinuance. It didn’t happen the full time, however would reliably launch to happen inside just some days of powering on the field. They geared up a video and it regarded hideous.

The software integrator had discovered a talent to breed the predicament: many situations supply Netflix, supply playback, then return to the software UI. They geared up a script to automate the course of. Generally it took so long as 5 minutes, however the script would repeatedly reliably reproduce the bug.

Within the meantime, a self-discipline engineer for the chip vendor had recognized the root purpose: Netflix’s Android TV software, known as Ninja, was not delivering audio knowledge lickety-split ample. The stuttering was attributable to buffer starvation inside the software audio pipeline. Playback stopped when the decoder waited for Ninja to carry extra of the audio circulation, then resumed one other time knowledge arrived. The integrator, the chip vendor and the operator all thought the predicament was acknowledged and their message to me was distinct: Netflix, that chances are high excessive you will bask in a bug in your software, and also you bask in to restore it. I might probably probably probably hear the stress inside the voices from the operator. Their software was late and working over funds and they also anticipated outcomes from me.

I was skeptical. The equivalent Ninja software runs on tens of tens of millions of Android TV gadgets, together with natty TVs and different convey high containers. If there was a bug in Ninja, why is it easiest happening on this software?

I began by reproducing the predicament myself using the script geared up by the integrator. I contacted my counterpart at the chip vendor, requested if he’d thought-about the leisure like this sooner than (he hadn’t). Next I began studying the Ninja provide code. I desired to go looking out the exact code that delivers the audio knowledge. I acknowledged hundreds, however I started to lose the pickle inside the playback code and I needed aid.

I walked upstairs and discovered the engineer who wrote the audio and video pipeline in Ninja, and he gave me a guided tour of the code. I spent some high quality time with the provision code myself to understand its working components, including my bask in logging to confirm my thought. The Netflix software is sophisticated, however at its easiest it streams knowledge from a Netflix server, buffers loads of seconds worth of video and audio knowledge on the software, then delivers video and audio frames one-at-a-time to the software’s playback {hardware}.

A diagram showing content downloaded to a device into a streaming buffer, then copied into the device decode buffer.

A diagram showing content downloaded to a device into a streaming buffer, then copied into the device decode buffer.

Figure 1: Instrument Playback Pipeline (simplified)

Let’s take a second to talk about the audio/video pipeline inside the Netflix software. Everything up besides the “decoder buffer” is the equivalent on each convey high field and natty TV, however interesting the A/V knowledge into the software’s decoder buffer is a tool-narrate routine working in its bask in thread. This routine’s job is to take care of the decoder buffer plump by calling a Netflix geared up API which affords the subsequent body of audio or video knowledge. In Ninja, this job is carried out by an Android Thread. There may probably be an easy convey machine and some frequent sense to take care of totally different play states, however beneath customary playback the thread copies one body of info into the Android playback API, then tells the thread scheduler to assist 15 ms and invoke the handler once more. Within the event you assemble an Android thread, that chances are high excessive you will probably question of that the thread be lunge many situations, as if in a loop, but it surely undoubtedly is the Android Thread scheduler that calls the handler, not your bask in software.

To play a 60fps video, the best possible body worth readily accessible inside the Netflix catalog, the software must render a model new body each 16.66 ms, so checking for a model new sample each 15ms is lawful mercurial ample to stay earlier than any video circulation Netflix can current. Since the integrator had acknowledged the audio circulation as a result of the predicament, I zeroed in on the narrate thread handler that was delivering audio samples to the Android audio supplier.

I desired to reply this quiz: the place is the additional time? I believed some function invoked by the handler may probably probably be the wrongdoer, so I sprinkled log messages all the draw wherein throughout the handler, assuming the responsible code could be apparent. What was shortly apparent was that there was nothing inside the handler that was misbehaving, and the handler was working in just some milliseconds even when playback was stuttering.

Within the tip, I centered on three numbers: the worth of info switch, the time when the handler was invoked and the time when the handler handed take care of watch over encourage to Android. I wrote a script to parse the log output, and made the graph beneath which gave me the reply.

A graph showing time spent in the thread handler and audio data throughput.

A graph showing time spent in the thread handler and audio data throughput.

Figure 2: Visualizing Audio Throughput and Thread Handler Timing

The orange line is the worth that knowledge moved from the streaming buffer into the Android audio system, in bytes/millisecond. You may probably probably probably hit upon three distinct behaviors on this chart:

  1. The two, gargantuan spiky components the place the knowledge worth reaches 500 bytes/ms. This part is buffering, sooner than playback begins. The handler is copying knowledge as mercurial as a result of it's going to.
  2. The distance inside the heart is customary playback. Audio knowledge is moved at about 45 bytes/ms.
  3. The stuttering area is on the lawful, when audio knowledge is interesting at nearer to 10 bytes/ms. Here is not mercurial ample to take care of playback.

The unavoidable conclusion: the orange line confirms what the chip vendor’s engineer reported: Ninja is not delivering audio knowledge lickety-split ample.

To adore why, let’s hit upon what chronicle the yellow and gray traces inform.

The yellow line displays the time spent inside the handler routine itself, calculated from timestamps recorded at the tip and the bottom of the handler. In each customary and concern playback areas, the time spent inside the handler was the equivalent: about 2 ms. The spikes inform circumstances when the runtime was slower on account of the time spent on different duties on the software.

The gray line, the time between calls invoki

Read More

Similar Products:

    None Found

Recent Content