Netcon
home | screenshots | faq | discussion  
 
Questions
 ·  What is Netcon?
 ·  Why write Netcon?
 ·  What is the architecture of Netcon?
 ·  What are the other basic features of Netcon?
 ·  How does Netcon work?
 ·  What is there left to work on?
 ·  What other monitoring programs are available?

Answers
 ·  What is Netcon?
  Netcon is an operational machine and service monitoring tool. It allows you to setup monitoring for machine paramaters such as CPU, and Disk Usage, as well as services such as HTTP, and MySQL. When any of the reported data for these services meets a set of pre-determined triggers, the people responsible for those services can be notified.

 ·  Why write Netcon?
 

 ·  What is the architecture of Netcon?
  The Netcon server uses a uniform and extendable metric naming scheme. The same history that records the CPU usage of a machine over time is used to record the duration of a trigger or failure -- allowing either to leverage display or graphing capabilities. This makes it easy to build up many layers to the system. For example, by indicating to Netcon the user-impact of an incident, Netcon can report on the user-percieved impact of failures over time.

Netcon's basic architecture borrows my favorite features from other tools. Like QOS, it has a lightweight data-collection agent which is deployed as needed to query data, and which can be easily extended with application-specific collectors. Like Netsaint/Nagios, it has an SQL backend database and a configuration and information browsing UI. Like some larger commercial counterparts, configuration is performed from the Netcon web user-interface. This means it is easy to configure, and since this configuration is stored in the database, this means it is easy to write scripts which modify configuration without fear of breaking a big configuration file.

 ·  What are the other basic features of Netcon?
 
  • data is stored in a MySQL database
  • monitoring is performed by a lightweight data-collection client
  • configuration data about what to monitor is administered centrally
  • custom data-collection clients can be written by extending the Netcon data-collection agent in Python, or by merely speaking the Netcon http protocol
  • clients can (optionally) save and report data for disconnected periods
  • hierarchial redundant trigger suppression
  • services are specified in role-groups and applied to a set of machines
  • Trend-analysis for triggers. (i.e. trigger when value will be reached in less than 8 hours)

 ·  How does Netcon work?
 One way to understand Netcon is to consider the flow of monitored data through the system. Here is a description of the cycle of data collection through an incident notification and resolution.

  1. netcon server startup
  2. netcon client startup
    1. checkin with server to get configuration
    2. begin monitoring, periodically reporting data to server
  3. netcon server accepts reported data from many clients
    1. for each piece of data, update the 'current' state of that service
    2. roll previous data into 'history'
  4. netcon server periodically checks for errors
    1. load all triggers and check against 'current' state
    2. record any trigger state changes
    3. for any triggered errors, add them to the active incident, creating one if necessary
  5. netcon server periodically handles notifications
    1. iterate through active incidents, make sure currently active users are watching these incidents
    2. iterate over watched incidents for each user, and generate notifications (user can choose a single email, or a single email per incident)
    3. deactivate incidents which have been resolved and which have passed their 'watch' period without any activity.
When the user receives a notification, that notification will indicate the severity of the incident, and the number of failures present on that incident. By visiting the web-interface, the user can check the detailed information reported on the incident, as well as add notes to the incident.

When the problem is resolved, the user must acknowledge and resolve the incident before it will be cleared. When acknowledging, the user can indicate the user-percieved result of the failure (degraded-performance, degraded-functionality, inaccessability), as well as the length of time this incident should be watched for. After the watch timeout has expired, Netcon will clear the incident and make it part of the incident history.

 ·  What is there left to work on?
 Download the source and check out the TODO.txt file!

 ·  What other monitoring programs are available?
  Here is a list of other free monitoring programs that I've looked at. They are all good packages, all with unique strengths. Below I've tried to summarize the biggest difference with Netcon, and the strongest selling point of each tool.
  • Nagios : (formerly Netsaint) Nagios is a centralized network/host monitoring tool, while Netcon is a decentralized network/host and application-level monitoring tool. Running Nagios data-collectors on multiple machines requires installing all of Netsaint on those machines, unlike Netcon which has lightweight data-collection agents. Nagios configuration is performed in datafiles, while Netcon configuration can be done completely from html UI, or by writing to a MySQL database. Nagios has a fairly extensive UI for viewing host data, including nice graphs, flapping detection, and trends. Nagios is more difficult to setup and configure.
  • QOS QOS is a decentralized data collection and error notification system. Netcon borrows much of its design from QOS, including Python, decentralized lightweight agents, and data history storage. Netcon stores all configuration and history in a Mysql database, while QOS uses flat-files and python config files. QOS uses a long-running async I/O master which is constantly connected to long-running agents, while Netcon (currently) uses agents which periodically connect to master code in Apache CGIs. Master datakeeping and notifications happen via a periodic server run via cron or as a daemon. Netcon adds an online configuration UI for changing collections and triggers, a visualization UI for history, and an incident creation and tracking model to minimize pages and organize required action.
  • mon
  • Big Sister
  • NMIS : a centralized SNMP data collection server with notification and graphing based on RRDTool site
  • remstats : uses client collectors and a central server with rrdtool site
  • other tools based on RRDTool site
There are also many commercial packages available. Here are a few of them: Here are some other links which are useful as well:

 
Copyright © 2003 - David Jeske