Application monitoring is a very important aspect of a project but unfortunately not much attention is paid to develop the effective monitoring while the project is not live. Once project is live lack of proper monitoring costs in terms of downtime when support persons are not aware if application is having some problems or application not working at all. 

Discussion on application monitoring should start early at least from the time when deployment details are being worked out. Some application may require some specific scripts or tools or authorizations, an early   discussion on monitoring will make it in a better position to address the delays in its implementation.

This document gives a basic introduction to the challenges , type of monitoring and best practices which can be followed to ensure high availability of the live systems .

Challenges in Application monitoring :
Following are some of the challenges faced today for application monitoring :

1. Proactive Monitoring :  Proactive monitoring  means monitor the system and application health  and take corrective action when it reaches a certain threshold level .The threshold level is defined as the level where application is not showing deterioration but can deteriorate if corrective actions are not taken .  The biggest challenges is to gather the statistics to workout a threshold and number of parameters and process that needs to be monitor .  Applications which interact directly with the customer for example ecommerce, banking  & online applications needs to be monitored proactively so that problems are detected even before it impacts the end user customer. 2. Complexity & number of applications: An application may become more complex if it has a global user base. The application has to support multiple languages, culture and currencies. Application may have multiple instances located in different regions of the world and may be using different time or logging format. To effectively monitor global applications one has to under stand the application instances their interconnectivity, flow coordinate with regional teams and in most cases depend on regional teams for monitoring the application.

3. Shared Systems: Applications are often shared in a system in order to utilize the full capacity of the hardware and this implementation brings in its own set of challenges. For a single application system it is easier to track the resources like memory, CPU, disk, network bandwidth but in shared application environment some application may take the resources and others may get impacted for apparently no fault of their own. Sometime application owners may not be contactable to take corrective actions.

4. Clustered Systems: To avoid a single point of failure applications are hosted in a clustered environment with number of machines in different network and locations. From monitoring perspective it poses another challenge of keeping track of the  request & failure logs , memory , CPU network , disk resources  as one has to look at all the cluster component machines logs and resource just to isolate which one is giving bad performance  .

5. Limited Logging  in production environment : Since volume of transactions are very high  and application code has already been run through performance , reliability and quality assurance cycles the  application code in the production environment  is generally enabled for minimum logging information  . This may lead to situation when actually indicator of a problem may not show up  in the logs . The logs may not show the error message until the logging level increased .

6. Custom logging in production : Logging in online production environments at the most can be changed to higher level as provided by the code. In case of particular problem when logging and other debugging methods does not provide a clue to the problem  special instrumented code has to be developed and deployed to capture error condition events . The instrumented code has to be deployed in production environment only since  the problem could not be replicated under test conditions .Deploying a custom code in   production  calls for application downtime which may not be acceptable to the application owners  and business groups involved  and also  require considerable efforts on part of supporting team to maintain it . This custom code may get overwritten by the next release cycle code .
 

Types of Monitoring for applications :
Applications are simultaneously monitored at various points  to  ensure its availability  and monitoring as a whole falls under following categories :

1. Health Monitoring :  As a proactive step  application health has to be monitored constantly  in order to address any issue before it becomes a serious issue . Health monitoring  in a simple arrangement will consists of  taking a snapshot of system & application parameters and comparing it to the standard benchmarks . For example in a system if a transaction is known to take around one second to complete and we can monitor this response time and setup alerts if the response time increases . Automated monitoring of health parameters is the best way of ensuring high availability of an application environment.

2. Error Monitoring : Errors in any application can impact the user experience adversely. An error condition in an application can cause user experience to fail out rightly or can cause unexpected errors such as time outs or failure to submit or display the requested data .  Errors can arise either due to  software problem , relating to  application code , web server , application server or database server  or  due to an hardware issues relating to memory , CPU processing , disk space or network issues .

These type of errors are monitored differently , application errors are mostly monitored by analyzing the application , web server , application server logs ,  understanding the error message and using that error message to find  the nature of problem .  For example an application may stop processing new requests and  from log files we may find the possible reason for this  behavior if the application is not able to process the requests due to resource shortage like cpu , memory ,network bandwidth , database performance etc. Application monitoring  requirement and  tools to monitor can be designed by studying the application documentation , architecture , platform , error messages etc.

Hardware monitoring is done using the standard tools and commands available for the particular hardware. Every operating system has tools and commands to monitor memory usage , CPU usage and disk usage but to monitor & report these resources on a regular basis  custom scripts can be written which is independent of application code .

3. Performance Monitoring : Performance of an application is critical to create  good user experience. An application which responds to user requests in  reasonable amount of time will have a good impact on user whereas an application which takes seconds or minutes  to respond will cause users to abandon the application . Application Performance  is derived from application code and supporting hardware . The code ensures that  the program routines incorporated in the program are capable of handling at least desired number of actual user requests and hardware provides the necessary memory and processing capabilities..

Application performance can be monitored from the  application access time , request processing time & time reported for various transactions in the application logs . While the application logs may provide some data about the processing time actual user experience can be simulated by sending requests to applications from different locations and measuring the resulting application response time in real-time. 

4.Configuration Monitoring :  Applications releases and  operating system changes can impact the hardware and software configuration of a machine .It is very important to monitor configuration  to avoid any undocumented and untested configuration element .Each of the configuration change needs to be documented and  monitored for any unauthorized change . The best way to monitor configuration is through a change control process where a change is submitted approved and them implemented . The change control process keeps  record for all the changes and allows  to monitor  the changes by the persons responsible  for the applications.

5. Security Monitoring : In today’s global scenario it is very important to monitor applications  for security . Security monitoring involves ensuring latest security patches are implemented in application servers , web servers & database servers .  Software companies frequently issues security warning in their software products and these security warnings should be carefully studied  & implemented to ensure compliance and protection against hackers . At any given point of time the software versions  should be monitored  to understand if they poses any security threat and update them with newer  & secure safe version .

Some companies have security teams who constantly monitor hardware and software for possible security breach and send their recommendations  but generally support team should subscribe to the newsletters from software companies which informs about the later security threats .

Best Practices for application  monitoring :
Systems can fail  due to various reasons related to hardware , operating system , network or applications itself . Sometimes despite good efforts  systems and applications fail . Although one can not assure always available status of these components there are some best practices which can be followed to ensure high availability of applications :

1. Plan Early : If there is  a new application or software component is becoming live and needs monitoring it is better to involve in early discussions of architecture  and  design to get an overview of things to come . This give time to think and implement the monitoring solution when required. In many cases it will help as monitoring solution may not be a straight forward and may require additional resources and efforts.

2. Monitoring proactively  : Don’t let system/applications  go down  and  its failure be used as a point to start corrective action . Monitor systems and applications proactively for the symptoms of problem so that corrective action can be initiated before system/application  fails . Proactive monitoring can  achieved  by monitoring some threshold values  for resources utilization like CPU memory , network bandwidth and application health parameters . If the system crosses the threshold values  a system health check has to be performed which include finding the running processes , memory utilization by various process , monitoring application logs etc . The health check and corrective action  proactively can  avoid system and application crash.

3. Balance the Load  : Load balancers are used to distribute the load on to the servers which can handle the load . In the event of one server being heavily loaded or  down the load balancers can automatically direct the traffic to the healthy server . This operation by load balancers is transparent to the users and they will not notice the difference. Load balancers can be hardware or software based and  if not present has to be used for a high transaction application.

4. Cluster the servers : Clustering removes the single point of failure by providing multiple points for request processing . In the event of one server being down due to hardware failure , network failure  or  heavily load on resources   , requests are sent and processed by  other members of the cluster .

5. Create a Recovery Plan : To avoid delay  online applications should have a well documented & tested recovery plan . The plan should cover the steps and checklists to be followed in the event of a application failure. A simple example would be to test the fail over feature of a server and observe the total requests failure and time taken to failover etc. which can give a estimated time when a alternate server will be up . Having a plan at the time of failure avoid time wastage to look for alternatives.

6. Deploy application code from a trusted  & tested source : Application code should be released from the trusted & tested source such as version control system , staging or quality assurance environments . No code should be released which has external changes other then trusted source where only authorized persons have access . Using code in this way presents a opportunity to simulate any code problems and examine the code base itself by the development teams.

7. Create a Service Level Agreement : A service level agreement in writing emphasize the need and scope of monitoring . It provides  monitoring requirements   for the support team and  a standard to measure the application availability   by the business groups. This document will give a estimated time to respond and fix the issues and teams can work in advance to create a recovery plan which meets the service level agreement .

8. Use Good hardware : Hardware which is proven to be reliable in the industry should be used for production environment . All the additional component cabling etc should be of high standard to avoid problems due to hardware failures . Replacement components should be of exact specifications as original. The hardware should have support mechanism with manufacturing company or  other company which can supply the components and troubleshooting expertise in case of a failure.

9. Seek Professional Help : If your application is mission critical ,involves impact to customers and revenue then it is not sufficient to relay on home grown solutions for monitoring but you should seek professional advice from the companies which have been doing monitoring for other companies. These companies besides monitoring applications can provide you with different type of reports like response time , downtime , uptime etc. which may be helpful in marinating and planning for the application resources.

 

Implementation :
 To implement effective application monitoring one has to under stand the nature of application , what exactly it is trying to do . For this one doesn’t have to have the full application code knowledge but the basic flow of information should be clear .    

1. Uptime Monitoring For setting this type of monitoring applications are monitored if they are up and running . A simple monitor can be setup by monitoring the server urls or  server processes . The problem with this type is that it can tell if a application is up  it does not tell  if application can process the transactions .

2. Transaction Monitoring Transaction based applications are best monitored using transaction monitor . If the application involves some form submissions and displaying a success message ,the same behavior can be simulated  using some scripts and status can be captured to find success status. The script can do the transactions at repeated intervals and send alerts if something fails.

This can be used effectively in proactive monitoring if the application can return back the transaction processing time or some other status which can be quantified .  The transaction completion time/status  can be monitored and compared with expected times . If a transaction takes much time  one can look  at the application logs to figure out the problem & take corrective action to avoid a crash.

3. Data files monitoring In some application environments   transactions happen offline where the data travels in an offline manner from one point to another like businesses sending their daily sales data to their head office every night in the form of a data file. This type of flow can be monitored by constantly monitoring the various drop and pick points of the data  files . At frequent intervals counts can be taken at drop and pickup points to ensure the files are moving properly .

This also provides a means to proactively monitor the flow as  the problem will be known when files starts to accumulate at a drop point  on its first occurrence and system can be prevented from clogging by looking into the cause which resulted in accumulation of files.

4. Database MonitoringApplications uses databases and databases should be monitored  for its uptime state as transaction state . Uptime state is easy to monitor by monitoring some key processes we can determine if data base is up or not. To monitor the transactional health of a database some monitoring transactions like creating records , updating the records etc can be done and the time taken for each transaction and their final status is noted. 

When the transactions starts to fail we can know that database is having some issues but as proactive monitoring we can monitor the time taken to complete each transaction . In most of the cases if system becomes overloaded the transaction time will be higher and that can give a vital clue to look the problem area in database and correct it before it goes down .

5. Resource Monitoring CPU , Memory , network disk ,monitoring is equally important as the above ones . constantly monitoring the system resources can  prevent application and operating system slow down and crash. 

If the CPU and memory is reaching its peak the application can go into a hung state . If disk space is full applications can crash right away as they may not be able to write logs etc on the disk.

Network bandwidth over utilization can also causes application crash where by the request queues starts building up due to slow network.

All the resources offer quantitative measurements and  can be mentored using the scripts using existing system utilities . For proactive monitoring  threshold  values can be set for each resource  and on reaching the threshold one can  investigate the cause of over utilization of resources .