Questions
Answers
| · |
What is Netcon? |
| |
Netcon is an operational machine and service monitoring tool. It
allows you to setup monitoring for machine paramaters such as CPU, and
Disk Usage, as well as services such as HTTP, and MySQL. When any of
the reported data for these services meets a set of pre-determined
triggers, the people responsible for those services can be notified.
|
| · |
Why write Netcon? |
| |
|
| · |
What is the architecture of Netcon? |
| |
The Netcon server uses a uniform and extendable metric naming
scheme. The same history that records the CPU usage of a machine
over time is used to record the duration of a trigger or
failure -- allowing either to leverage display or graphing capabilities.
This makes it easy to build up many layers to the system. For example,
by indicating to Netcon the user-impact of an incident, Netcon can
report on the user-percieved impact of failures over time.
Netcon's basic architecture borrows my favorite features from other
tools. Like QOS, it has a lightweight data-collection agent which
is deployed as needed to query data, and which can be easily
extended with application-specific collectors. Like Netsaint/Nagios,
it has an SQL backend database and a configuration and information
browsing UI. Like some larger commercial counterparts, configuration
is performed from the Netcon web user-interface. This means it is
easy to configure, and since this configuration is stored in the
database, this means it is easy to write scripts which modify
configuration without fear of breaking a big configuration file.
|
| · |
What are the other basic features of Netcon? |
| |
- data is stored in a MySQL database
- monitoring is performed by a lightweight data-collection client
- configuration data about what to monitor is administered centrally
- custom data-collection clients can be written by extending
the Netcon data-collection agent in Python, or by merely speaking
the Netcon http protocol
- clients can (optionally) save and report data for disconnected periods
- hierarchial redundant trigger suppression
- services are specified in role-groups and applied to a set of machines
- Trend-analysis for triggers. (i.e. trigger when value will be reached in
less than 8 hours)
|
| · |
How does Netcon work? |
| | One way to understand Netcon is to consider the flow of monitored data
through the system. Here is a description of the cycle of data
collection through an incident notification and resolution.
- netcon server startup
- netcon client startup
- checkin with server to get configuration
- begin monitoring, periodically reporting data to server
- netcon server accepts reported data from many clients
- for each piece of data, update the 'current' state of that
service
- roll previous data into 'history'
- netcon server periodically checks for errors
- load all triggers and check against 'current' state
- record any trigger state changes
- for any triggered errors, add them to the active incident,
creating one if necessary
- netcon server periodically handles notifications
- iterate through active incidents, make sure currently
active users are watching these incidents
- iterate over watched incidents for each user, and
generate notifications (user can choose a single email,
or a single email per incident)
- deactivate incidents which have been resolved and which
have passed their 'watch' period without any activity.
When the user receives a notification, that notification will indicate
the severity of the incident, and the number of failures present on
that incident. By visiting the web-interface, the user can check the
detailed information reported on the incident, as well as add notes to
the incident.
When the problem is resolved, the user must acknowledge and resolve
the incident before it will be cleared. When acknowledging, the user
can indicate the user-percieved result of the failure
(degraded-performance, degraded-functionality, inaccessability), as
well as the length of time this incident should be watched for. After
the watch timeout has expired, Netcon will clear the incident and make
it part of the incident history.
|
| · |
What is there left to work on? |
| | Download the source and check out the TODO.txt file!
|
| · |
What other monitoring programs are available? |
| |
Here is a list of other free monitoring programs that I've looked
at. They are all good packages, all with unique strengths. Below I've
tried to summarize the biggest difference with Netcon, and the
strongest selling point of each tool.
- Nagios : (formerly
Netsaint) Nagios is a centralized network/host monitoring tool, while
Netcon is a decentralized network/host and application-level
monitoring tool. Running Nagios data-collectors on multiple machines
requires installing all of Netsaint on those machines, unlike Netcon
which has lightweight data-collection agents. Nagios configuration
is performed in datafiles, while Netcon configuration can be done
completely from html UI, or by writing to a MySQL database. Nagios
has a fairly extensive UI for viewing host data, including nice
graphs, flapping detection, and trends. Nagios is more difficult to
setup and configure.
- QOS QOS is a
decentralized data collection and error notification system. Netcon
borrows much of its design from QOS, including Python, decentralized
lightweight agents, and data history storage. Netcon stores all
configuration and history in a Mysql database, while QOS uses
flat-files and python config files. QOS uses a long-running async
I/O master which is constantly connected to long-running agents,
while Netcon (currently) uses agents which periodically
connect to master code in Apache CGIs. Master datakeeping and
notifications happen via a periodic server run via cron or as a daemon.
Netcon adds an online configuration UI for changing collections
and triggers, a visualization UI for history, and an incident
creation and tracking model to minimize pages and organize required action.
- mon
- Big Sister
- NMIS : a centralized SNMP data collection server with notification
and graphing based on RRDTool
site
- remstats : uses client collectors and a central server with rrdtool
site
- other tools based on RRDTool
site
There are also many commercial packages available. Here are a few of
them:
Here are some other links which are useful as well:
|
|