We're Sorry: Schoology Site Performance

Contributed By

Jeremy Friedman

Founder and CEO at Schoology

We're Sorry: Schoology Site Performance

Posted in Schoology | October 01, 2014

At Schoology, we provide millions of users all over the world with simple and efficient ways to enhance teaching and learning. Over the past few years we’ve grown from a name people had barely heard of and often mispronounced, to a leading learning platform, still often mispronounced. Today nearly 7 million people rely on Schoology, so it’s imperative that our cloud service is always available and reliable.

The last extended outage we suffered was in 2012. It lasted only a few hours during peak time at the start of the school year, but it felt like an eternity. After that occurrence we added resources; improved processes, monitoring, and alerts; and significantly improved uptime ever since, averaging 99.9% uptime.

We understand that all of you—faculty, students, and parents—place an enormous amount of trust in us and we cannot be more grateful and appreciative of your support. That said, over the past two days we experienced our worst site performance in Schoology history, and for that, we sincerely apologize.

We strive to always be transparent, but on Tuesday a perfect storm of events occurred and our status page (hosted on a completely separate network) also failed. We take full responsibility for the issues that followed and the inevitable frustration that you felt.

In addition to offering our sincerest apologies, I’d like to address what happened, what we’ve done in the short-term to resolve issues, and what we’re doing in the longer term to prevent issues like this from happening ever again.

The Short Version

This past weekend, our hosting providers made emergency upgrades to their networks in response to a security vulnerability that was recently discovered. We worked through the night on Sunday into Monday to monitor and maintain stability. On Monday, we noticed major performance issues that we believed were caused by the emergency maintenance.

In order to repair the site, we expanded our server capacity; however, it masked a slightly larger problem with our content delivery network. During peak time on Tuesday the underlying issue became apparent and had escalated to the point where users were again unable to access Schoology. Coincidentally, a separate issue with the way we connect to our file servers arose. This prevented users from uploading or downloading any files and prevented a number of internal services from working properly. We were able to resolve all issues and are working hard to prevent them from happening again in the future. 

Due to the unprecedented period of site inaccessibility, our status page, which is hosted on a completely separate infrastructure, was ill-prepared for the number of sustained requests it received during this period. We have increased the resilience of the site by putting it behind a CDN which will allow it to quickly scale up to handle any number of requests, although hopefully this is an improvement that will remain unused.

The Long Version (Warning: Lots of Technical Details and Jargon)

There were two primary issues that led to the sustained downtime:

  • Akamai was not caching any of our assets.
  • The token handshake necessary for us to upload and download files from our file system on Rackspace began to time out and block our web servers from serving other requests.

The root cause of the first issue was due to server upgrades we completed more than a week ago, where we upgraded our web servers to nginx, which is better at managing connections and allows us to be more efficient than standard Apache servers. We did not notice the issue during our initial rollout and testing because certain Akamai cached assets were still valid for 7 days. For this reason, the rollout appeared successful for almost a week before we started experiencing symptoms.

The root cause of the second issue still has yet to be discovered, but we are continuing to work with Rackspace for further information. In the meantime, we do have a solution in place to prevent this issue from impacting performance in the future.

Below is a timeline of events and actions:

Sunday, Sept. 28
Rackspace and Amazon Web Services, our hosting providers, had to make an emergency upgrade to all servers on their networks to address a recently discovered security vulnerability, XSA-108 (http://xenbits.xen.org/xsa/).

Monday, Sept. 29
We experienced a networking disruption from 8:57 AM until 11:38 AM EDT (UTC-4), during which time users reported slow page loads and occasional timeouts.

  • Our team originally attributed this to be caches that were not yet fully populated, causing one of our JavaScript/CSS caching services to saturate its networking interface.
  • As a note, in order to optimize page loads, we aggregate and bundle JavaScript and CSS into a single minified file, which reduces the number of network requests and allows easy browser caching.
  • In an attempt to alleviate this issue, CSS/JS aggregation was disabled; however, this caused our firewall cluster to exceed preset limits we have in place as precautionary measures (as traffic was beyond double the normal amount). This prevented us from deploying further code fixes.
  • This event came directly after emergency reboots related to the server vulnerability (XSA-108) that affected a large number of cloud providers, including Rackspace and Amazon Web Services. Increasing the size of that caching service allowed the site to operate normally. 

Tuesday, Sept. 30
Schoology experienced a similar networking disruption from 8:50 AM until 3:25 PM EDT (UTC-4).

  • The firewall cluster was once again nearing its limit; however, the caching cluster was handling the traffic without any problems (compared to the previous day).
  • Rackspace reported that the large increase in traffic was affecting other customers, so they rate-limited our networking infrastructure to about 10% its normal capacity, leading to further page load failures.
  • At the same time as all other issues, the token handshake necessary for us to upload and download files from our file system began to time out and block our web servers from serving other requests.
  • Uploads and downloads were temporarily disabled to allow other requests to load properly; however, this did not help as it was around the same time that our networking infrastructure was rate-limited.
  • The product and engineering team looked into increasing the efficiency of the local browser cache in an attempt to decrease the amount of network traffic between our servers and our clients.
  • After speaking with our CDN provider, Akamai, it was determined that CSS/JS files and bundles being served up by our servers to the CDN were not properly being cached, which explained the large increase in traffic over the past couple of days. This was a change related to our webserver switchover from Apache to nginx the previous week, where JS/CSS bundles that were cached on the CDN began to expire over the weekend.
  • We were able to get a fix in place to allow proper caching; this allowed pages to load normally around 3:25 PM.
  • We continue to see issues with file uploads and downloads that we are actively working with Rackspace to resolve.
  • In the meantime, we have put in place a work-around to allow users to resume file-related operations. While the exact cause of the issues has yet to be determined, we have put in extensive work-arounds and contingency plans to ensure that if the problem arises again, there will be little to no impact to our users.

Short-Term Actions

In the immediate short-term, we worked to identify and resolve all outstanding issues as well as related residual issues.

  • Content Caching: In the case of the content not being properly cached, we deployed changes to correct the issue.
  • File System Token Issues: In the case of the Rackspace file system issue, we deployed changes to provide additional resilience that will prevent any performance impact in the future.
  • Status Page Performance: We have increased the resilience of the site by putting it behind a CDN which will allow it to quickly scale up to handle any number of request

Long-Term Actions

Our engineering team has developed a list of improvements and enhancements that will be investigated and made to our backend services to increase and expand our high availability, ranging from code changes to new monitors and alerts, to additional hardware purchases and implementation. We have an incredibly talented team focused on building a best-in-class infrastructure.

Will It Happen Again?

Our team remains focused and dedicated to providing you with the best possible experience. I wish I could guarantee that we will never have an issue again, but that is simply not possible for any company. What I can guarantee is that the issues that occurred today will be unable to impact us again in the future.

Additionally, the changes and enhancements that we have put in place over the last two days will also prevent other future issues from occurring. They will also make our platform more resilient if a new issue arises. Even with the best team, and the best hardware, at some point there will inevitably be a problem that arises in the future that may cause site performance issues. That said, we will be constantly working to stay as close to 100% as possible, and when there are issues we will do our best to keep everyone informed immediately and resolve issues as quickly as possible.

Once again, I apologize to all of our users who experienced slow page loads, timeouts, or data delays. I know that it’s not acceptable, and so does our team. We are committed to being the reliable service that you have grown to love over the past few years, and for those of you who are new to Schoology, we look forward to becoming the system you love.

Thank you all for your support! Please feel free to reach out if you have any questions, concerns, or suggestions.

Sincerely,

Jeremy Friedman
CEO, Schoology

Join the Conversation