A few years ago, after an altogether unsatisfactory “dalliance” with FastHosts as my webhosting provider, I decided that Heart Internet were a better bet. All has been excellent until this last month or so. Recently I have been treated to a couple of prolonged service problems. Today’s has been the worst example – not only have all my sites been down due to a UPS problem but so has the entire email service to them.
What went wrong?
At first Heart blamed the problem on a DDOS (Distributed Denial Of Service) attack, but then suddenly changed the reason to an unexplainable problem during scheduled maintenance!
The maintenance was supposed to be the change of a “Voltage Monitor” in their Uninterruptible Power Supply, OK that sounds fine, but why on earth was the work done during normal working hours? As an experienced UPS engineer, I know that any decent UPS system would be resilient as an N+1 or N+2 configuration. That means that, as a minimum, the system would comprise enough UPS modules to support the load with 1 or 2 module(s) out of service.
As an example, let’s assume the entire load was about 1000kVA (approx 1000kW), A sensibly configured UPS would be 5 units of 250kVA (this is a minimum, there could easily be 6 modules, N+2), one/two online but acting as standby(s) should any other fail. This also allows for one/two module(s) to be taken out of service for maintenance should that be required.
It seems today’s failure happened when the entire system was switched to bypass (meaning servers were supplied by the mains) while the “Voltage Monitor” was attended to. That should never have been considered. A properly configured/specified UPS should have been able to remove one or two modules from service without compromising the entire system.
That of course assumes that the faulty item was not in the overall system monitoring component. If that had been the case, the work should not have been done during normal working hours and should have been done as “planned maintenance” out of hours and customers warned.
As it happened, the entire system appears to have been compromised during normal hours. Something interrupted the mains supply while the UPS was in bypass and thus the entire Heart Datacentre lost supply. Completely unacceptable under any circumstances!
How to solve the UPS problem?
A message to Heart Internet: Contact me if you want to know how to configure a proper UPS System, and, if this kind of failure happens again, don’t bullshit your customers…..