====== Network Outage Troubleshooting ====== Author: Patrick Gary ===== Introduction ===== The following wiki entry will step through troubleshooting process for a generic network outage at the Sacramento data center (QTS). ===== Network Outage Reported ===== * Team reports “Sacramento” is unreachable * What does that mean? What’s not reachable? By what method? * Confirmed team means QA systems on the blade enclosure via DNS query * Test ping/ssh to qadb1.err (or equivalent) * Failure * Test ping/ssh to atlas.err * Failure * Test ping to 10.91.100.101 (atlas) * Success * Problem is isolated to the blade enclosure. Results indicate that the DNS server is down. ===== DNS issue identified ===== * What is reachable by IP? * Non-blade servers are reachable. Atlas, Hyperion, etc. * Blade servers are not reachable. * Results indicate the problem extends to all blade systems. Other servers are not affected, indicating a problem with only the blade enclosure. ===== Blade enclosure issue identified ===== * Attempt to log in to Primary Blade Management interface to check for hardware issues: [[https://10.91.100.200/|HP BladeSystem Onboard Administrator (Primary)]] * Failure (404, login prompt does not load) * Attempt to log in to Secondary Blade management interface: [[https://10.91.100.201|HP BladeSystem Onboard Administrator (Failover)]] * Failure (404, login prompt does not load) * Attempt to contact a blade via out-of-band management interface (ILO) per IPs defined in tracking spreadsheet: [[https://docs.google.com/spreadsheets/d/1XzNoaDJMCtDmTHEgi3KXLtUqe5VHDHM00L6BZ0xEdT8/edit?usp=sharing|Google Doc]] * Failure (404, login prompt does not load) * Results indicate that the blade enclosure is not reachable, meaning there is likely a networking issue external to the enclosure itself. ===== Internal networking issue identified ===== *Confirmed that entirety of BladeSystem is unreachable * Indicates that the issue exists between the inbound network interface (10.91.100.1) and secondary network devices that connect to the blade enclosure. * There is a switch that sits between the blade enclosure and the VPN router. After previous tests all signs point to this device being out of service. Visual inspection is required to determine if it is powered down or otherwise in an inactive state. * Call QTS for a visual inspection of the switch. * QTS engineer confirms all lights are off on the switch (it’s dead). * QTS engineer performs un-/re-plug of switch to reboot it. * After a few minutes and a local DNS cache reset, blade servers are again reachable and DNS queries resume functioning. ===== Contact QTS ===== * Call 866.239.5000 * Username: David MeGee/Patrick Gary * Password: n******1 * Provide the server name/public address/IP Address of the server * The information may be updated at https://docs.google.com/spreadsheets/d/1AZOq9ztKS7kELWx4e0pRbd09NfASfTLHGMIGCXOYdKM/edit?usp=sharing --- //[[patrick.gary@errigal.com|Patrick Gary]] 2018/05/04 20:17//