Tracking the Use of our Tools: Strategies that Helped

As an internal IT service provider to NetApp engineering, we have the need to track who’s using our tools and services. This lets us make sound business decisions about where to invest our limited time and energy, and generally enables us to have insightful and productive conversations with our user community (which is comprised of the dev and QA engineers that fuel innovation at NetApp).

We’ve developed a tool that does this for us. It’s called “Elmo”. Elmo receives feeds from over 200 internal tools and services — everything from our bug tracking system, to our source control system, to our internal engineering cloud. These feeds identify who used the tool, when they used it, how long it took, what they did, and so on. Elmo tallies up the counts, and from them provides reports that show whether any given tool’s user base is growing or shrinking, which organizations within the company are the top consumers, which groups use certain tools proportionally more than others, and so on.

Elmo gets a reasonable volume of traffic (on the order of 150-300 messages per second). To accommodate the load while maintaining the ability to generate reports in a timely fashion, we’ve adopted some strategies that have general applicability.

Add a Level of Indirection

Rather than write feed data directly to the database as it comes in, Elmo writes it to a message queue. This helps in a few ways:

  1. Writing to a message queue has lower latency than writing to our database (which is centrally hosted, and off box). Since we use a RESTful API (over HTTP) to receive these feeds, the net effect is that tools spend less time waiting for Elmo to accept each bundle of feed data.
  2. Using a message queue allows us to continue to receive data even when the database is down. This in turn makes our DBA happy, as he has the opportunity to schedule maintenance windows.
  3. Using a message queue enables us to write to the database more efficiently, because we can group the data, and send it in batches.

There are other ways to add levels of indirection, as well. For example, we could use a DNS CNAME to identify the host on which the message queue lives, rather than using the machine’s primary hostname (or IP address). We don’t actually employ this technique, but if we did it would enable us to seamlessly move our message queue (or queues) to other machines, which would in turn enable us to upgrade the message queue software without incurring any application-level down time (as an example).

Aggregate, to Optimize for Reporting

Our feeds consist primarily of “events”: user John Doe used the tool called “perforce” at 10:32:59 today, to perform the “login” action, and it took 41 milliseconds. When John Doe subsequently performs a “sync workspace” action at 10:33:14, another event is fired off to elmo. And so on.

In order to report on all of this data efficiently, we aggregate it into containers corresponding to the various settings and knobs offered in our reports. Fundamentally, we maintain one aggregated data table per combination of “dimension” (if you will) from these lists:

  • dimension 1: user, host, action, hour, user+host, user+action, and user+hour
  • dimension 2: day, week, month, and quarter

In the end, then, we have one table that has counts grouped by user tallied on a daily basis (dimension 1 = user, dimension 2 = day), another grouped by user tallied on a weekly basis (dimension 1 = user, dimension 2 = week), and so on all the way down to a table that has counts grouped by user and hour tallied on a quarterly basis (dimension 1 = user+hour, dimension 2 = quarter). Of course, the application/tool identifier is also always part of the grouping.

In this way, we don’t have to wade through hundreds of thousands (or even millions) of raw event records in order to generate a report. We can go to the granularity-appropriate aggregate, and build reports from orders of magnitude less data.

Let the Database do the Heavy Lifting

To do this aggregation naively, we might iterate through event records, and for each determine whether a corresponding aggregate record exists. If so we would update it, and if not we would create it. This would mean at least two rounds trips to the database for each event record. For a few thousand even records, this works. For tens of millions, it’s not ideal.

To get around this, we employ a two-stage aggregation pipeline that lets us leverage the database for all of the intensive data manipulation. The flow is:

Tool -> Event Data -> First-Level Aggregates (FLA) -> Second-Level Aggregates (SLA)

Or, more concretely:

  1. Tools send event data to elmo continuously (24x7x365). This event data is for specific points in time, down to the hour, minute, and second of the day.
  2. On a continuous basis, chunks of event data are turned into FLA records, which are at the granularity of a single day. So, for example, if John Doe uses perforce once in the morning and once at night, the end result is either one or two FLA records depending on whether the event records are “near” each other in the incoming stream of event data; if they are, they’ll be aggregated at the same time, into the same FLA record; if not, they’ll be aggregated in separate passes, and will wind up in two different FLA records. Either way, the combined “count” (i.e. the number of times Joe Doe is reported to have used perforce) is 2.
  3. On a slightly less frequent but still continuous basis, we aggregate FLA records into SLA records. However, while FLA records are for a particular day (date, really), SLA records are for particular time periods (a date, a week, a month, or a quarter) — and, there’s one (and only one) SLA record for every unique combination of dimension (from the previous discussion) and time scale.

Example: John Doe uses perforce throughout the 7th of July, and then again on the 30th of July, and the 2nd of August. His events are turned into:

  • One or more FLA records for July 7th.
  • One or more FLA records for July 30th.
  • One or more FLA records for August 2nd.

and subsequently:

  • One (and only one) “daily” SLA record for July 7th.
  • One (and only one) “weekly” SLA record for the week in which July 7th falls.
  • One (and only one) “monthly” SLA record for the entire month of July.
  • One (and only one) “quarterly” SLA record for the quarter that ends in September. and so on.

Why is this helpful?

If John Doe is a prolific user of perforce (maybe he uses it from automated scripts, to the tune of 1,000 accesses per day), then a quarter’s worth of John’s activity results in 90,000 event records. Using raw event records to tally up John’s quarterly usage of perforce would require the database to scan 90,000 records. To do it using the SLA record would require just one query, to fetch a single record.

It’s true that scanning 90,000 records isn’t a big deal for a typical database. However, if you want to run a report that shows the usage for thousands of perforce users over an entire quarter, SLA records allow you to query for users*1 records rather than users*90k of them.

Summary

Use layers of abstraction and indirection to limit the impact to end users when service components need maintenance, or go down. Use data structures to enable quick and efficient access to your data. Finally, let your tools do what they do best.

Appendix: A Useful Aggregation Technique

The conversion of FLA records into SLA records is done with a version of the SQL “INSERT” statement that handles inserts and updates seamlessly. This feature doesn’t appear to be used as widely as its usefulness would suggest.

The basic idea is as follows. This particular example is for the aggregation of user+hour data into the quarterly table. Ignore the extra columns like “interface” and “count_error”.

Of interest:

  1. This is the quarterly table, so we want the date that we store in the database to be the last day of the quarter. That way, no matter how many times John Doe uses perforce within a quarter’s time — and no matter when the data about his activity is sent to Elmo — only a single record is ever created in the “q_archive_user_hour” table. Also of note: we don’t read the source record from code, translate it, and write it back out. Instead, we let the database transform it on the fly.
  2. The “ON DUPLICATE KEY” clause handles the case that a target record already exists. The INSERT happens unless there’s a duplicate key (where the key is the combination of app_id, interface, username and hour of day). If there’s a duplicate key, the existing target record is merely updated according to the UPDATE statement: the target record’s “count_all” field is set to its existing “count_all” value plus the new data’s “count_all” value. Again, nothing is read into code, massaged, and written back. It all happens in the database, and it all happens in a single pass.
  3. START_ID and END_ID identify the range of FLA records to aggregate. The beauty of this approach is that FLA records can be processed in chunks — in parallel, even — and the net result is still just a single record per combination of primary key values.
John Klassa
John has worked in the industry for over 25 years, with a focus on tools that help the business to automate and streamline its internal processes. His current focus is Netapp’s private engineering cloud.

Leave a Reply