Sunday, February 17, 2013

Getting started with MMStats in your application

I have been working on a large Python/Django based web application that has a web-server component and a job-scheduling component. We recently transitioned from "Early-Deployment" to "General-Deployment" of our application, and management instantly wanted to know how many people were logging in, how many records were being created and when.

The first solution: Logging

Logging is the first solution to this problem. We've had logs since the beginning, using Python's logging library. This is indispensable, and we'd be lost without our logs. So the first thing we tried when the boss wanted data was to perform special log entries that could be parsed out by a log-mining script. This lead to log-pollution, ugly logging code, and a significant time-delay between events occurring and getting reporting data.

The second solution: Email

Next, we fired email events when certain important events occurred. Fire an email to a distribution list whenever a traceback event is trapped. Fire another email whenever a record has invalid data. This quickly lead to bloated inboxes and very poor metrics on how many important events occurred and when. It also lead to developers getting desensitized to "normal errors."

The correct solution: mmstats

We loaded the mmstats package into our application a month ago with the intention of trying to transition to it gradually. Today I had a chance to kick the tires. This will enable us to log exactly the kind of data we want to see, and get an understanding for how this data is changing over time.

In our application, we have an automated process that reaches out to Cisco routers to make configuration changes. Each router configuration job is dispatched into its own thread, and I wanted to start gathering statistics on the amount of time each command to the router took and how many commands were issued to each router.

We have a class which encapsulates all the interaction with the router called RouterConnection:

 class RouterConnection(AbstractConnection):
   "Connection to a router"  
   def __init__(self, router_config, mongodb):  
     AbstractConnection.__init__(self, router_config)  
     self.mongodb = mongodb  
     self.connection = None  
     self.enable_prompt = "#"  
     self.disable_prompt = ">"  
     self.config_prompt = "\(config.*\)#"  
     ...   

Next, the trick is to instrument this class with mmstats, so the first step is to create another class in the same module that keeps track of the items I care about:

 class RouterStats(mmstats.MmStats):  
   wait_for_prompt_timer = mmstats.TimerField(label="wait_for_prompt")  
   try_to_read_lines_timer = mmstats.TimerField(label="try_to_read_lines")  
   line_timeout_counter = mmstats.CounterField(label="line_timeout")  
   test_counter = mmstats.CounterField(label="test_counter")  
   to_router_counter = mmstats.CounterField(label="to_router")  

There's the new monitoring class... add a line to instantiate it within the RouterConnection class...
 class RouterConnection(AbstractConnection): # pylint: disable=R0902  
   "Connection to a router"  
   def __init__(self, router_config, mongodb):  
     AbstractConnection.__init__(self, router_config)  
     self.mongodb = mongodb  
     self.connection = None  
     self.enable_prompt = "#"  
     self.disable_prompt = ">"  
     self.config_prompt = "\(config.*\)#"  
     self.stats = RouterStats(  
       label_prefix='router.stats.',  
       filename="/tmp/mmstats-%s" % router_config.name  
     ) 

Since we'll be instantiating several of these objects, I am customizing the filename based on the name of the router being connected to. Next, whenever there are items of interest in the code, use the stat collector. For example:
     while '\n' not in read_buffer:  
       try:  
         new_buffer = self.connection.read_nonblocking(timeout=MIN_LINE_TIMEOUT)  
         read_buffer += new_buffer  
       except pexpect.TIMEOUT:  
         self.stats.line_timeout_counter.inc()  
and...
     while not (found_index == exit_index or self.exit_interact):  
       try:  
         with self.stats.try_to_read_lines_timer:  
           read_lines = self._try_to_read_lines()  
       except pexpect.EOF as p_eof:  
         if eof_action:  
           eof_action()  
         else:  
           raise p_eof  

Reading the data that you've collected