Surviving Twitch Plays Pokemon: Adventures in Twitch Chat Engineering [TwitchCon 2016 Panel]

Adventures in Twitch Chat Engineering TwitchCon 2016

For the complete list of panels, check out the main TwitchCon 2016 panel recap post.

VoD link: https://www.twitch.tv/twitchconfrankerz/v/92636123?t=03h13m46s
FrankerZ Theater Day Three
Moderator: 
IrelandofEndor

  • Some quick facts about the state of Twitch chat in February 2014:
    • Over 800k concurrent users
    • Tens of billions of daily messages
    • ~10 chat servers
    • 2 engineers
    • 19 amp energy drinks per day
  • What are chat servers?
    • Many machines, each one running many programs
    • Different kinds of programs that do different things
    • Chat servers/programs talk to each other, this is the hard part!
  • So it’s not like you send a message to one server and this displays it, it goes through multiple servers.
  • The public reactions about Twitch Plays Pokemon were mainly positive, media outlets described it as being ‘mesmerizing’, ‘miraculous’ and ‘beautiful chaos’.
  • A quote from the BBC: “Some users are hailing the development as part of internet history”
  • Twitch Plays Pokemon timeline:
    • TPP hits 5k concurrent viewers
    • TPP hits 20k
    • Twitch takes preemptive action
    • TPP got moved on to an event cluster
    • TPP hits 50k, this is where the problems start. Messages are not being delivered
    • Twitch chat engineers are being rushed to the office and start investigations
  • Chat servers are organised in several clusters. If one cluster fails the others can pick up. You can a cluster a specific purpose (for instance: use one cluster for a League of Legends tournament).
  • A couple of tools that were being used to debug the problems:
    • Server logs: text file that describes ongoing behavior of a program
    • Dashboards: graphs that show the health of the server and how they behave over time
    • Linux command line: programs that allow you to dig deep into the state of a machine, process log files etc.
  • Debugging principle: Start investigating the software closest to the user.
  • The server all the users connect to is called the Edge server. It’s called the Edge server because it connects the internet to the internal services of Twitch. It sends and receives messages to and from users.
  • User info is being stored in a database. A database runs on hard drives. Making the story capacity big but slow. Twitch also uses cache servers to store info on temporarily. These servers use memory, making them very quick but not equipped to store data for long periods of time.
  • After a lot of digging in server logs they found out that a Cache server was not being used optimally. It had 16 CPU’s but only one of them was being used. The solution was to make a cache for every single CPU and distribute info in each of those. It worked but did not have as big as an impact as was hoped.
  • So a problem was fixed, but it was not the only one so the search continues.
  • Next up, Linux commands were used and they found out there were many connections being made, but not being used. They found out the test chat server was responsible for making all these connections but not using them. The server was turned off resulting in users being able to connect to chat again and able to send messages. Problem fixed!
  • The lessons that they learned from this scenario is:
    • Better logging/instrumentation needed to be added to make debugging easier
    • Generate fake traffic to force them to handle more load than they need
    • Be better at using diagnostic programs and keep knowledge up to date
  • What have they done since:
    • Started using Amazon servers
    • Use a better infrastructure
    • Change chat code from Python to Go
    • Higher reliability by changing architecture

To top