Table of Contents

SNMP & Traps in a Nutshell

SNMP stands for simple network management protocol. As the name suggests it is a standard protocol designed for managing devices. Wikipedia has a brief description of the theory here. One of the most important things to node about SNMP is that it is a fire and forget protocol. There is no handshake to verify that anything has actually been received! It's thus very important that monitoring has minimum downtime as devices being monitored will have no feedback that nothing is receiving their messages!

Traps

The part of the protocol we're most interested in is traps. These are how we monitor devices for alarms and issues. The devices we monitor are all connected over private networks to our servers. When they experience an issue they send a small UDP packet called a trap to us. This shifts the burden from the monitoring software - we don't have to constantly poll devices for fault (though we also do that to some extent), the devices send us alarms when they experience faults.

Traps are simple data objects. They have are identified by an object identifier (OID). The OID is a series of numbers separated by dots. An example OID is .1.3.6.1.4.1.33582.1.1.2.6.4.1.

Now that's not terrible informative. What does that mean? We use a file call a management information base (MIB) to translate it. A MIB is really just a text file with a specific standard structure that defines what the OID values mean. The MIB Browser tool made by iReasoning is very useful for exploring and understanding MIBs.

MIBs contain a tree structure that is used to translate the numbers of the OID into something sensible. But how do we know what MIB translates each trap? The long number in the middle there is the enterprise number of the OID. It signifies which company made the device or software the trap is being sent from and therefore who you can get the MIB from. These numbers are registered with the IANA and there's a list here.

If you look up which company is identified by 33582 you'll find it's us! This is one of our own internal traps. The MIB file to translated it is ERRIGAL-INTERNAL-SYSTEM-MIB.mib. With this we can translate the OID and find that this trap is called errigalMonitoredCarrierDeviceMissingAlarm. It's a trap we send to our own systems to signify that we have gotten in an alarm from a device that doesn't exist on our system yet. It's worth your time to bug one of the developers to send you a copy of the ERRIGAL-INTERNAL-SYSTEM-MIB and verifying this yourself with the iReasoning MIB browser so that you can get used to the structure of MIBs.

Varbinds

But the trap just telling us that a device is missing isn't exactly very useful. What's missing? Traps also contain variable bindings (varbinds). Varbinds, similar to traps themselves, have an OID that identifies their name. They also have a value. A varbind's value can be a number, string, null or another OID (though in practice they're mostly strings and numbers). The varbind's OIDs are translated using the same MIB. Varbinds for our device missing alarm could be something like this:

Name OID Name translated value
.1.3.6.1.4.1.33582.1.1.1.1.0 alarmNotificationStatus information
.1.3.6.1.4.1.33582.1.1.1.2.0 hostName VD_WD_PeoplesPark_03
.1.3.6.1.4.1.33582.1.1.1.3.0 equipmentIdentifier VD_WD_PeoplesPark_03-RM_20
.1.3.6.1.4.1.33582.1.1.1.4.0 equipmentType MOBILE_ACCESS
.1.3.6.1.4.1.33582.1.1.1.5.0 customerName VODA
.1.3.6.1.4.1.33582.1.1.1.6.0 alarmIdentifier btscRfSwOn
.1.3.6.1.4.1.33582.1.1.1.7.0 alarmDescription btscRfSwOn
.1.3.6.1.4.1.33582.1.1.1.8.0 ipAddress 9.9.9.9

The varbinds each have a specific meaning that is usually documented within the MIB itself. In this case this would signify that we received an btscRfSwOn alarm from a piece of equipment called VD_WD_PeoplesPark_03-RM_20 which is attached to a host VD_WD_PeoplesPark_03 from IP address 9.9.9.9 which we have processed as being part of the MOBILE_ACCESS technology and belongs to customer VODA. If not all of that informations makes sense yet don't worry about it too much. Just know that each different type of trap will have different varbinds with different values that have to be interpreted by our software. We take all this information and put it into a standardized format using trap rules.

Resolving Exceptions

StaleObjectStateException ("Missing Unit Alarm Received" trap example (ticket creation turned off))

I added new entry “Missing Unit Alarm Received” in active_alarm_exclusion_criteria.
I think alarm sync tried to clear device missing but at the same time device missing trap is sent to SnmpManager. 
Before device missing cleared, repeat count goes up hence version number (column version in active_alarm table) is incremented. 
Process to clear device missing has active_alarm instance of previous version number.
Thus StaleObjectStateException: Row was updated or deleted by another transaction