By Rodney Ellis, Synthesis Service Manager
With Black Friday around the corner, most if not all, e-commerce and retailers will have that feeling of anxiousness and dread in the weeks leading up to the day asking: “Will the site or systems stay up? Will it perform under the additional load?” and “If it goes down, how fast can we recover?”
Where to start?
Now that we know what potentially can go wrong and it will go wrong at some point, what can be done to avoid this from affecting customer experience and, at the end of the day, sales and business confidence.
A good monitoring foundation and accurate metric thresholds need to be in place. By monitoring the critical metrics, a lot of the potential bottle necks and issues can be identified before they become a problem. For example, database usage, CPU/memory utilisation, disk/network performance, endpoint latency and application logs can lead to a cascading effect, eventually causing downtime.
A good architecture design and decoupling of services are needed to decrease the blast radius of outages, be they from infrastructure or application-related issues.
Using multiple availability zones or data centres for application hosting with automatic failover decreases the risk of an infrastructure outage affecting availability; and converting services to micro-services utilising Kubernetes orchestration can increase service availability.
Partnering with a cloud provider like AWS solves several of the potential issues that co-located or on-premises infrastructure are unable to handle or easily handle.
The ability to rapidly scale up or down resources, either manually or by using AWS autoscaling, triggers executing on predefined metric thresholds. This can provide a seamless and predictable experience for your customers.
Synthesis Managed Services provides this service to its retail and e-commerce partners by providing a group of DevOps and system engineers to actively monitor and test the environment during the day, and react within seconds to resolve potential problems that arise.
By using these methodologies and best practices, we have been able to provide our partners with a predictable business outcome during Black Friday, knowing that no matter how many customers sign up or buy toasters, there will be no impact on business as usual.
When the load hits and things go wrong
Usually, the first reaction from the support team is that it’s the developer’s fault and from the developers, it’s the infrastructure. With good monitoring in place, the task of finding the root cause of the problem is a lot easier, and this eliminates the blame game.
We have seen issues ranging from hitting disk IOPS constraints, API limits, CPU load and even memory leaks in a service.
Usually, around 11pm on the Thursday, users start logging in and refreshing your site continuously. Some of them have become quite crafty, utilising web scraping tools to poll for deals and key words that can lead to hundreds if not thousands of web requests per second.
The teams need to be on standby and ready to deal with any issue that might arise during the day.
When the dust settles
After the day has come and gone, take a moment to go through all the issues experienced, and come up with remediation plans. Usually the first Black Friday is the most error-prone, but by taking the experiences from the day and implementing solutions for them, the next Black Friday will go a lot smoother.
A couple of tips to prepare for the day:
- Monitoring – Make sure all the critical metrics and endpoints are monitored.
- Service autoscaling – Make sure your services and web servers can scale.
- No change Friday – Try and make sure a stable release of the services and sites are up for that week.
Side note: Do not panic about good preparation. A good team can mitigate most, if not all, interruptions to business and when an issue pops up, they will be able to deal with it quickly.