Translations of this page:

Concepts

Osmius is a monitoring tool. Therefore it allows us to follow up the status or monitor different elements at the same time and get information about the changes on such status.

According to the Dictionary, monitoring means:

Observe through special devices the course of one or various physiologic parameters or parameters of another nature to detect possible anomalies

Understanding what a monitoring system is and the associated terms, we will be able to take advantage of a system like Osmius, so that it aligns with the business targets. In a mature organization there will be inventories of network elements, servers and applications. We will know how the applications affect the business processes and who the functional and technical responsible people in charge of each service are. This is not always accomplished and, if the information is not structured and even if time and money has been spent developing a monitoring system, it does not improve the overall performance of the company or of certain services. These cases are frustrating and the monitoring consoles end up having so many alarms and incidences that stop being useful.

Therefore, it is important to dedicate time to familiarize with the general concepts related to monitoring; in this way, we will design the project better from scratch and thus avoiding problems, optimizing our investment and our resources and coordinating monitoring with the business needs which ever its nature.

First of all, we will present a quick summary and then move on to look over each one of the concepts used in Osmius in particular and in the general monitoring systems.

Summary

The Base

If we start from the most technical level, we will be interested in monitoring a set of variables for each one of the elements composing the Information Systems Infrastructure. This infrastructure might include up to thousands of elements.

In Osmius, these elements are called INSTANCES. But not all instances are the same; therefore we will have instances of type “Linux Server”, type “Router ACME”, “Postgress” databases, and a long etc. So, it seems obvious that the first way of organizing our inventories should be by TYPE OF INSTANCE. This is; number of Linux Instances, Windows Instances, etc.

From the monitoring point of view, the types of instances are differentiated by the type of variables which I can ask them for. We can ask certain Type of Linux Instances about the percentage of CPU Charge, about the number of remote users connected, or about the number of free Megabytes in the Data directory; however, to a certain Type of “MySQL Database” instance, I will ask about the number of “Queries” per second. About how many tables are open in a certain moment. Each variable which I can ask the type of instance about is called TYPE OF EVENT.

When I ask a request an Instance about a Type of Event I obtain an EVENT. Therefore, when the Monitoring System has been developed, we will be receiving events, each one from a Type of Event, with values form each one of the Instances that are in our system. Depending on the value of the event for a certain Instance I will be more or less happy with the result. If the event which gives us the temperature of the Data Processing Centre Room informs us that it is 21 degrees Celsius I will be calm, but if it says it is 51 degrees Celsius we will have a problem.

And so, thresholds are designed for every Type of Event and each one of the Instances, in order to know the STATUS of the event. For example, if the temperature goes over 25, I want a warning (orange color), if it goes over 30 I want it to be a Critical Alarm (red color), and if it’s bellow 25 I want everything to remain calm (OK, green color).

Color Status
OKOK or informative
Warning“Warning”
CríticoAlarm or Critical
ErrorUnknown or Error

In the same way, every instance has an INSTANCE STATUS which depends on the status of the active events of that same instance. Osmius calculates the status of an instance as if it were the most critical status of the active events of that instance.

An ACTIVE EVENT is the one that has not been invalidated or accepted by a latter event or human action. If a critical event of CPU use reaches a server, and five minutes later the same event reaches it but with an acceptable percentage, the first event is no longer active. When an event is not active it is said that it has been transferred to HISTORIC. It’s normally Osmius (its correlation engine) the one in charge of transferring events from the Active storage to the Historic storage.

Each Type of Event of an Instance which we want to monitor has a PERFORMANCE PERIOD associated to it. We want to know the available memory in the client database every five minutes and we want to check that the company products sales page is working every 3 minutes. We normally group the instances, and we ask each group for combined variables or types of events, forming the Performance Period for each one and the limits to know the Ok, Warning and Critical status. To make things easier, in Osmius we can group these forms in TEMPLATES.

The Business

Up until know, we have concentrated on the elements that are connected to the network from the lowest level to the highest one, from network elements to Web applications, but where is the Business oriented view?

It’s in the SERVICES. A Service is a group of instances that offer, in a combined way, a series of functionalities for the users. For example, we can say that my company Intranet is based on: Linux Server LINUX01, Database MYSQL01, the server Windows WIN01, the server Exchange EXCH00, portal accessible at http://ACME.local/intranet/ and the Human Resources Sap system SAP_RRHH01. If any of these elements is down or is not available, we can no longer provide the service called Intranet.

 Diferentes puntos de vista

There is also a STATUS SERVICE, which depends on the status of the instances that form it. Osmius calculates such status as the most critical of the status of the instances which form a particular service. An Instance can belong to more than one Service. At the same time, we have the AVAILABILITY of a SERVICE, which is estimated as available by checking that all its instances are available; if it’s not so, the server will not be available. Osmius stores historic information for each instance and each service system in regard to when the status has changed and when the availability has changed. We can see the status evolution of an instance from last week or how the availability of a service has changed the first of last month.

Every Service has a Service Level Agreement or SLA associated. Osmius defines the SLA as the monthly targets of availability in percentage and the percentage of time that a service has to be OK (green). If a service is very important for my business, I should ask for 99.9999% percentages (around five minutes of non availability maximum per month). 3 or 4 SLAs are normally defined for a complete installation (Gold, Silver and bronze).

By means of the reports and what we know as Data Mining, we can check the achievement status of the different SLAs, services and the evolution of the instances. Osmius also provides reports in order to exploit the information in a more clear and precise way, Inventories, Number of Events, most problematic Instances, Most Active Services, etc.

The Supervision

Osmius is in charge of cleaning up, as much as possible, the sights of active elements and of calculating and recording the status and availability of all the monitored Instances and Services. In spite of this and depending on the relative importance of the systems, we will have different USER PROFILES interested in the information provided by Osmius. The OPERATORS are those technicians who supervise the system constantly. They revise the events and their criticalities and they implement actions that have usually been in procedure to facilitate their application. A team of Operators that cover 24×7 of the services are used in professional systems.

The ADMINISTRATORS normally come into action when an operator informs them about a problem that has not been solved automatically or by applying a certain procedure. An Administrator must be in charge of reviewing the events of the instances and services of which it is responsible, it will create procedures for the operators to use and it will also set up the times and limits of the events, to ensure the correct performance of everything that is under its responsibility.

The profile of VIEWER or read-only user is also available in Osmius. They will normally be the ones responsible for the services of those who want and must know all about the evolution of the status and achievements or under achievements of the SLA, without reaching excessive technical levels. They will not be able to change setups nor technical parameters but they will be able to supervise the system and use the Osmius Events Management Control Panel.

Apart from using the Events Management Control Panel, Osmius also offers other ways of being informed about what is happening in the system. There are users, for example the Company Director, who might not have time to be staring at the screen to see what is happening to the applications. This is what we call NOTIFICATIONS. Through the Notifications, the users are the ones who decide what kind of information they want to subscribe to and how they want to receive it. For example, “I want to subscribe to availability changes of the Service ‘Product Sales through Internet’- which is the most important in the business- so when it falls or recovers from a fall I receive an email during office hours and an SMS out of office hours”. This way, we open up the access to the information as the users want to.

Glossary

  • Type of Instance: Every Instance that can be monitored has a type associated that enables us to classify it and differentiate the capacities amongst the different types. A Linux server is not the same than a Postgress database or a temperature sensor, and therefore we cannot ask for the same variables. E.g.: from a service we can ask about the use of the CPU, free capacity of the disk, percentage of use of the memory; whilst from a data base we will ask about the number of open sessions or the percentage of successes in data cache and in the particular sensor we will ask about the pressure and temperature.
  • Instance: Any element susceptible of being monitored. This is, every now and then we ask about the value of one or more variables. E.g.: we ask a server within our infrastructure its time of response through a ping or the use of the CPU. An Apache server, a MySQL, a documentary management application, a DPC temperature sensor, a Kwatts register or a fire alarm would also be instances.
  • Type of Event: each one of the different variables which we can ask about for a specific type of instance. Every event has a specific type of event associated. An event with a value of 80%of CPU use of “percentage of CPU used” type. E.g.: for the type of instance “Linux Server” we can ask about the following types of events: CPU charge, percentage of used memory, Network output Kbytes or uptime seconds.
  • Event: It’s a message that we get from an instance or monitored element. An event replies to a question and returns a numeric value and a text, and according to the returned value, the criticality is calculated as informative (green), “warning” (orange) or “critical” (red). This criticality spreads to the status of the instance from which the events comes from, so if we have a red active event, the associated instance will also pass to being red or critical.
  • User Defined Events: This is an event that is not predefined in the system and that is creates by the user. To configure the monitoring of such events from console, Osmius has five possibilities:
    1. Scripts: monitoring an event by executing a script.
    2. SNMP: monitoring an event via an OID of a MIB.
    3. Alerts: monitoring an event applying logical rules to existing events.
    4. Statistics: monitoring an event performing statistical operations on existing events.
    5. Web probes: monitoring an event via a web transaction created with a web browser.
  • Criticality or Status of an Event: Every event has a criticality associated that can have some of these values/colors: OK or informative/Green-Warning/Orange- Critical/Red- Error or unknown/Grey. The criticalities are calculated according to the numeric value of the event and the limits assigned in the setup.
  • Status of an Instance: The status of an instance depends on the criticality (Ok - Warning - Critical - Error) of the active events associated to the instance. Osmius calculates the status of an instance by the worst of the status of its active events.
    E.g.: if there is a Critical o Red event, 40 warnings and 10 errors in the active events, the status of the instance will be Critical or Red. Osmius keeps track of the historic changes of status of each instance to create graphics in which to, for example, consult the percentage of time used in each status.
    OK OK or informative Warning “Warning” Crítico Alarm or Critical Error Unknown or Error
  • Availability of an Instance: The availability of an instance depends on the criticality (Ok - Warning - Critical - Error) of the active events associated to the instance and depending on if the event is classified as an indicator of the availability for that type of instance. An instance can receive a critical event but that does not mean it stops being available. In Osmius, the expert who designs the event for a type of instance is the responsible for selecting the events that affect the availability of the instances.
    E.g.: the event that indicates the availability of an instance of a Linux server type is the uptime ( if the machine has been running for less than N seconds), but despite receiving critical events of CPU charge at 99%, the availability of that server is not modified. All this behavior can be set up.
  • Active and Non-Active Events: The active events are those which are affecting the monitored instances actively and that have not been recognized by a system operator or administrator. When a user recognizes an event, this one goes to be part of the Historic Events for the latter consultation and does not affect the installed monitored instances. Most of the events automatically go to the Historic through the Osmius correlation engine.
  • Automatic Recognition and Event Correlation: Most of the events are automatically recognized by the Osmius correlation engine.
    E.g.: If an instance has only one active event in critical status (red) and the same event with a value that makes its status be informative (green) is received, the system will hand over the two events to the historic and the instance will remain in OK status (green). The idea is to simplify the system management and avoid repetitions or useless information in the screen of active events.
  • Repeated Events and Correlation: To facilitate the incidence reading and interpretation, another aspect of the correlation is to accumulate the repeated events in one single event with all the ones related with the same instance, same type of event and same criticality. E.g.: if every 60 minutes we receive a critical event of CPU usage of my firewall host, at the end of the day we will have a new event with a 24 counter, instead of 24 events with nearly the same information.
  • Service: a service should be a separate entity within the business processes or the organization of certain business. One way of metering the maturity of the company is by assessing the clarity in the services provided by the departments. The concept of service as a logic organization of instances that in conjunction provide certain functionality is used in Osmius.
    E.g.: the intranet service in our company is formed by the following instances: Windows server, Exchange Server Instance, Linux server and contact MySQL database. If any of these instances stop being available or pass to have a critical status they will influence the service, affecting its availability and its criticality.
  • Service Level Agreement or SLA: In Osmius, every service has associated goals related to their availability and percentage of time they need to be in an OK status (green). These goals are defined in The Service Level Agreement so we can group the business services or processes according to their relative importance for the organization or company. E.g.: the Intranet service could cease to be available for 8 hours a month without having too many problems, but the sales catalogue page cannot be down for more than 5 minutes or the loses will be higher than the cost of my service supplier and 24×7 operation supplier. It’s important to have three or more SLAs so that the importance can be distributed and so the prioritization is not flat, and that we are sure we are attending first the critical things when an incidence happens.
  • Intervention: An intervention is a scheduled stop and agreed by all parties involved, in which certain services will not be operational for a period of time. This implies that, during the intervention, these unavailabilities do not affect the fulfillment of its SLAs.
  • Agents: An Agent is a software process that runs in an engine and that is capable of obtaining Events of a certain Type of Type of Instance to monitor its status. We say that the Type of Agent is the same than the Instances that it’s capable of monitoring. Therefore we will have Linux, HP-UX, Windows, MySQL, MSQL, HTTP, etc, Agents. An updated list of the Agents which are offered with the different Osmius distributions can be consulted at Osmius Official Documentation.
    It’s important to highlight that An Agent cannot be executed if it’s not done through a Master Agent on which it depends and that controls it’s performance and that, even though there are Agents that need to be executed compulsorily in the same machine than the Instance that we want to monitor –for example those which monitor their Operative System-, normally an Agent is able of monitoring “anything” connected to the same network.
    The Agents can have different status (colors): Verde Started, Gris Stopped and Rojo Error, which means that the real status of the Agent is not the desired one; this can occur due to an error (it has stopped or it has been initiated without a direct order from the control panel) or because the status of the Master Agent on which it depends is not the desired one.
    Each Agent has associated a setup file that it reads, each time its Master Agent starts it or restarts it, and that indicates its execution parameters and the instances it must monitor or stop monitoring.
  • Master Agents: A Master Agent is a software process that runs in a machine and that is capable of Managing (this is: start, stop, change the setup, etc.) the Agents which depend on it (always executed on the same machine) and of receiving all the events that each one obtains sending them to the Central Server. The Master Agents are the only ones capable of communicating with the Central Server to send it the events, as we have mentioned before, and to receive orders, in the form of tasks, for the setup and the management of its Agents.
    A Master Agent is defined by the host name where it’s being executed and by the IP address, therefore there can only be one Master Agent being executed in the same machine. Each one of them has also a unique Master Agent code which identifies them internally in the system and that is automatically assigned when you display Deployment of a New Master Agent, maintaining itself even though the Master Agent is started again manually Restart the Master Agent.
    The Agents can be in different status (colors): Verde Started, Naranja Paused, Gris Stopped and Rojo Error, which means that the real status of the Agent is not the desired one; this can occur due to an error (it has been stopped or started without a direct order from the control panel) or because it has unsettled tasks to execute. Each Master Agent has associated a set up file to it, which it reads every time it’s started or restarted, and that indicates its execution parameters and the Agents that must be started o stopped.
  • Tasks: Any functionality that is to be carried out in the Osmius infrastructure generates a Task in the Server. Such Task is always associated to a unique Master Agent. The Server’s Task Administrator treats them periodically and sends the appropriate command to the Master Agent that must execute it. This is done locally (in the machine where it is running) obtaining the expected result over it or its Agents. All the tasks are based on the updates of the setup files and on the restarting of the processes, which is going to enable us to undertake all the system functionalities with a small group of them: Update of the Agent Set Up, Update of the Master Agent Set Up, Restart of the Master Agent, Pause a Master Agent, Stop a Master Agent, Consult the Status of the Master Agent and Consult the Status of the Agent.
  • Central Server: The Central Server is formed by a series of processes that are executed together on one machine and that centralize all the actions and all the data they monitor: Events Administrator: It receives all the events of each one of the Master Agents displayed by the infrastructure, it correlates them and it stores them in a Central Database. Tasks Administrator: It processes the tasks periodically, sending the Master Agent the pertinent command and waiting for its response, which it uses to update the result of the task. Since a user gives an order from the console that produces the task, until the Task Administrator processes it and the associated Master Agent executes it, the time that will pass depends, amongst other things, on the set up of the Task Administrator. Infrastructure Status Administrator: It verifies periodically the status of each one of the Master Agents to see if they are being executed accordingly to how they have been started from the control panel, if they are stopped and should be executed or vice versa. If there is an error it warns about it and it demands the Master Agent to correct it if possible. Every time there are pendant tasks for the Master Agent, the status will not be Ok until they are processed and the mentioned Master Agent executes them. Data Mining Administrator: It calculates periodically the global mark of the system and of all the needed data aggregations for its latter exploitation. Notifications: Periodically process the changes in the services, instances, SLA, marks, etc, and sends notifications to the subscribed users through email, SMS, etc.
  • Control Panel: It’s the graphic interface from which we can undertake all the actions in Osmius and in particular, we can monitor its entire infrastructure.
  • Dependencies: With Osmius we can define instances that depends of other instances. An example could be an O.S. instance and another instance from a web server running in this O.S. If we acknowledge an availability event from the O.S. instance, all the events from the web server instance will be acknowledged automatically.
  • Propagation Rule: We can define propagation rules (or rules to not propagate the instance state). An example could be a Tomcat cluster. The state of the service will be changed only if the availability of all the instances of the cluster are down.
  • Actions: Osmius can perform manual actions from the console when any event is received. This action execute an script, created by the administrator, on the machine of the Master Agent that has monitored the event. The administrator can creates generic actions (for any event), or actions for a particular type of instance (for all events of this type of instance) by selecting the “script” to execute.
 
en/usuario/conceptos.txt · Last modified: 2011/07/21 13:16 by jesus.pancorbo
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki