RVM and Ruby 1.9 to test logstash grok patterns on Fedora/CentOS

I’ve been working on a personal project with Logstash lately, and it relies relatively heavily on grok filters for matching text and extracting matched parts. Today, I’ve been parsing syslog from Puppet to extract various metrics and timings, which will then be passed on from Logstash to Etsy’s statsd and then to graphite for display. Unfortunately, a few of my patterns are showing the “_grokparsefailure” tag and I just can’t seem to find the problem.

The logstash wiki provides a page on Testing your Grok patterns, as does Sean Laurent on his blog: Testing Logstash grok filters. Unfortunately, I work in a CentOS/RHEL shop, and we’re decidedly not a Ruby shop. Our Logstash install is using the monolithic/standalone Java JAR. We run Puppet, which is currently under ruby 1.8.7, and the jls-grok rubygem requires ruby 1.9. There’s no way I’d feel safe installing 1.9 on any of our machines, as they all run (and require) Puppet. So, I found out about RVM, the Ruby Version Manager, which allows you to run and switch between multiple ruby versions, and all of it is installed on a per-user basis. So, I created a new user on my Fedora 16 desktop called “rvmtest” and went about the process of setting up what’s needed to test grok patterns in the user’s local environment. I imagine this would work similarly under CentOS or RHEL, but the following is only tested on Fedora 16. If you have any issues, you should probably refer back to the RVM documentation.

  1. Create the isolated user, just to be extra careful. Login as that user.
  2. As per Installing RVM: curl https://raw.github.com/wayneeseguin/rvm/master/binscripts/rvm-installer | bash -s stable
  3. edit your ~/.bashrc and add:
    [[ -s "$HOME/.rvm/scripts/rvm" ]] && . "$HOME/.rvm/scripts/rvm"
    [[ -r $rvm_path/scripts/completion ]] && . $rvm_path/scripts/completion

    The first line sets up RVM for your sessions, and the second sources in tab-completion for the rvm command.

  4. source .bashrc
  5. If you’re interested, you can see a list of all known rubies with: rvm list known
  6. Install Ruby (MRI) 1.9.2: rvm install 1.9.2
  7. “switch” to that ruby: rvm use 1.9.2 and confirm it by running ruby -v
  8. Make it the default ruby for us: rvm use 1.9.2 --default
  9. Create a “gemset” (set of rubygems for our environment): rvm gemset create groktest
  10. Use it, and set it as default: rvm use 1.9.2@groktest --default
  11. for grok testing, gem install jls-grok
  12. check that it’s there: gem list
  13. Download Logstash’s default grok patterns from github
  14. You should now be ready to test some grok patterns.

While the two howto’s linked above use irb to interactively test the patterns, I prefer something easier to move to production, more reliable, and more repeatable. The following quick little ruby script takes test to match against on STDIN (log files, messages, etc.) and prints the matches to STDOUT. The script is based on test.rb from jordansissel’s ruby-grok. Note one important thing here, I couldn’t get the shebang (#!) to work with anything other than the explicit path to my RVM ruby install (which ruby) so you’ll need to manually update this yourself.

#!.rvm/rubies/ruby-1.9.2-320bin/ruby
 
require 'rubygems'
require 'grok-pure'
require 'pp'
 
grok = Grok.new
grok.add_patterns_from_file("grok-patterns")
 
pattern = 'your_grok_pattern_here'
grok.compile(pattern)
puts "PATTERN: #{pattern}"
 
while a = gets
  puts "IN: #{a}"
  match = grok.match(a)
  if match
    puts "MATCH:"
    pp match.captures
  else
    puts "No Match."
  end
end

Here’s an example using a pattern to capture information from custom syslog messages triggered by updating puppet configs. Here’s some sample messages:

[rvmtest@jantmanwork ~]$ cat puppet.log
Updated 2 files in puppet svn (environment prod) to revision 754
Updated 3 files in puppet svn (environment prod) to revision 756
Updated 1 files in puppet svn (environment prod) to revision 757

And the pattern that I use:

Updated%{SPACE}%{NUMBER:puppet_svn_num_files}%{SPACE}files%{SPACE}in%{SPACE}puppet%{SPACE}svn%{SPACE}\(environment%{SPACE}%{WORD:puppet_svn_env}\)%{SPACE}to%{SPACE}revision%{SPACE}%{NUMBER:puppet_svn_revision}

And the output of the script:

[rvmtest@jantmanwork ~]$ cat puppet.log | ./puppet-update-test.rb 
PATTERN: Updated%{SPACE}%{NUMBER:puppet_svn_num_files}%{SPACE}files%{SPACE}in%{SPACE}puppet%{SPACE}svn%{SPACE}\(environment%{SPACE}%{WORD:puppet_svn_env}\)%{SPACE}to%{SPACE}revision%{SPACE}%{NUMBER:puppet_svn_revision}
IN: Updated 2 files in puppet svn (environment prod) to revision 754
MATCH:
{"SPACE"=>[" ", " ", " ", " ", " ", " ", " ", " ", " ", " "],
 "NUMBER:puppet_svn_num_files"=>["2"],
 "BASE10NUM"=>["2", "754"],
 "WORD:puppet_svn_env"=>["prod"],
 "NUMBER:puppet_svn_revision"=>["754"]}
IN: Updated 3 files in puppet svn (environment prod) to revision 756
MATCH:
{"SPACE"=>[" ", " ", " ", " ", " ", " ", " ", " ", " ", " "],
 "NUMBER:puppet_svn_num_files"=>["3"],
 "BASE10NUM"=>["3", "756"],
 "WORD:puppet_svn_env"=>["prod"],
 "NUMBER:puppet_svn_revision"=>["756"]}
IN: Updated 1 files in puppet svn (environment prod) to revision 757
MATCH:
{"SPACE"=>[" ", " ", " ", " ", " ", " ", " ", " ", " ", " "],
 "NUMBER:puppet_svn_num_files"=>["1"],
 "BASE10NUM"=>["1", "757"],
 "WORD:puppet_svn_env"=>["prod"],
 "NUMBER:puppet_svn_revision"=>["757"]}

Hopefully this will make the process a bit simpler for someone else…

Piwik Web Analytics, and some unfortunate stats about my blog

Back in March when I selected a new template for this blog, I posted that I was looking into open source self-hosted web analytics tools to replace Google Analytics. There were a few reasons for this; most importantly, it started from a discussion with some privacy-conscious coworkers, who said that they use NoScript and specifically block Google from tracking them (which also breaks Google Analytics). This was a serious issue for me, as I no longer process server-side logs but relied solely on Google Analytics for traffic information. So, I decided to try something other than Google and ended up settling on Piwik as my solution. I will say, in full disclosure, that the amount of information Piwki gives is a bit scary; I can watch users navigate this blog in realtime, and even the initial dashboard page gives a list of the most recent visitors, with their IP address, country of origin, browser, OS, and the pages they visited. However my decision was made on two main points: first, that I wanted something withich could use server-side PHP to log visits (albeit with a lot less information) of people who had JavaScript or tracking disabled, and second, that if someone is going to have such amazingly detailed information on my visitors, it should be me, so I can ensure that I’m the only person who has access to it and that it isn’t used for the wrong purposes.

Aside: The only revenue I get from this site is through Google AdSense, which isn’t a whole lot given the low traffic (certainly not enough to pay for the hosting). Other than that, I keep this blog to try and share my knowledge with others, and hope that someone else can find the solution to their problem here instead of doing the work that I did. So, I find analytics very helpful; I check my stats now and then, go back and update or add to the most popular posts, and try to write relevant posts if it seems like a lot of people are finding their way here for something slightly different than the actual post they landed on. Unfortunately, that last point isn’t as easy since Google switched to HTTPS Search for logged-in users on October 18th, 2011 – I can no longer use Piwik see the search keywords that got Google users to my site. Luckily, these are still available through Google Webmaster Tools (via Traffic -> Search Queries on the left menu), though it adds an additional step and removes some of my motivation to check regularly and make sure people are getting useful content. Also, perhaps most importantly, it doesn’t let me associate search query with other stats like time on page, so even if one search query was very popular, I have no way of knowing whether all those people actually read the page, or took one look at it and left.

I really like Piwki. I don’t use most of it terribly often, but it gives me a nice overview visits graph on the WordPress dashboard (via the WP-Piwik plugin), infinitely detailed information (most of which I haven’t even looked at) in the Piwki web interface, and nightly email reports of visits to the site. It also supports multiple sites, so I have it on my ancient wiki, my Redmine instance, and even ViewVC. I’d highly recommend it; it’s full-featured (beyond anything I can even comprehend, really)

I was recently looking through the stats for this blog, and came by some unfortunate, though not surprising, trends. Below is the graph of visits per day, from April 1, 2012 through today (August 26, 2012):

blog visits chart

  1. It’s probably not terribly unusual for a site with as much technical content as mine (and mostly professional stuff, not just for hobbyists), my weekend traffic is usually a full 50% lower than weekday traffic. This can also be seen in the graph of visits by visitor’s local time, which is decidedly biased towards the 9am-5pm window:
    blog visits chart by visitor local time
    I guess there’s nothing I can really do about that, and it just gives me a nice maintenance window at 4am on Sunday mornings :)
  2. Looking at the overall graph, there also appears to be quite a bit of oscillation of the average visits over time. It’s nothing terribly large, but at a guess, I’d attribute it to my sporadic posting.
  3. Though it’s not visible in these graphs, this site has an 80% bounce rate (the percent of visitors that viewed only one page and then left the site). I guess that’s also not terribly unusual for a site with mostly how-to information on a wide variety of topics.
  4. To add a little more information to some of the previous items, here is the chart of my Feedburner RSS/Atom feed, since I started using Feedburner in February. The number of subscribers is in green, and the reach (number of people who actually clicked through to a post) is in blue:
    Feedburner stats
    This is a clear indication of something even stronger than the “bounce rate”; the apparently high number of people who subscribe to and then unsubscribe from my feed (if these stats are accurate). To me, this is an even stronger indication that what I really need to do is post useful content on a more regular basis – I have a tendency to blog in spurts, and either start a draft and never finish it, or write a few posts and set them to “pending” status with the intent of publishing them over a few days… and then forget the last part.

Puppet facter facts for syslog daemon type and version, symantec netbackup

I have a few more custom facts that I’ve added to my puppet-facter-facts github repository:

  • syslog_bin, syslog_type, and syslog_version – tell the absolute path to the running syslog binary, its short name (basename), and its version as a string. Currently only know about /sbin/syslogd and /sbin/rsyslogd.
  • has_netbackup – tests for presence of the /usr/openv/netbackup/bin directory, created by installation of Symantec Netbackup. Useful for making generation of include/exclude files conditional on having NetBackup installed.

Hopefully some of these will be of use to someone else as well.

Puppet facter fact for all applied classes, returned as a CSV list

I’m unfortunatey stuck, at least for the time being, using flat-file manifests to configure my puppet nodes. Without an ENC, it’s pretty difficult to get a good ovewview of what classes are used on each node, and what nodes use a given class. I know I could write up a simple web tool to do this (unfortunately, given my limited Ruby knowledge, it would have to be in PHP or Perl, not a real modification to Dashboard in Ruby). But where to get the data from?

After some research, I found a puppet fact for puppet classes on Matthew Nicholson’s Coffee & Beer blog. It parses /var/lib/puppet/classes.txt and returns the list of classes found as a JSON array. Great base, but I wanted something easier, that would be more easily parsed from its direct storage in MySQL. My modification to his code is onlty a few characters; I dropped out the JSON require, and return the classes as a CSV list. This lets me to easy LIKE '%,classname,%' SELECTs in MySQL, and also gives me the fact value stored in the puppet DB, so I can build a separate tool around that data. Thanks, Matt.

#
# facter fact for puppet classes on node, pulled from /var/lib/puppet/classes.txt
# from <http://sjoeboo.github.com/blog/2012/07/31/updated-puppet-facts-for-puppet-classes/>
#
 
require 'facter'
begin
        Facter.hostname
        Facter.fqdn
rescue
        Facter.loadfacts()
end
hostname = Facter.value('hostname')
fqdn = Facter.value('fqdn')
 
classes_txt = "/var/lib/puppet/classes.txt"
 
if File.exists?(classes_txt) then
        f = File.new(classes_txt)
        classes = Array.new()
        f.readlines.each do |line|
                line = line.chomp.to_s
                line = line.sub(" ","_")
                classes.push(line)
        end
        classes.delete("settings")
        classes.delete("#{hostname}")
        classes.delete("#{fqdn}")
        Facter.add("puppet_classes_csv") do
                setcode do
                        classes.join(",")
                end
        end
end

All of my facts are now available in a GitHub repository: https://github.com/jantman/puppet-facter-facts.

Puppet facter fact for last applied configuration version

For anyone else who sets the Puppet config_version paramater to return the current SVN or Git version of your configuration, here’s a fact that grabs that version (by parsing the cached YAML catalog) and sets it as a fact called “catalog_config_version”. It can then be used for sanity-checking your nodes, looking up via the Inventory Service, or you can display it in the Dashboard using my patch: Patch to Puppet Dashboard 1.2.10 to show arbitrary facts in the main node table.

#
# facter fact for last applied config version, skeleton from /var/lib/puppet/client_yaml/catalog/fqdn.yaml
#
 
require 'puppet'
require 'yaml'
require 'facter'
 
localconfig = ARGV[0] || "#{Puppet[:clientyamldir]}/catalog/#{ Facter.fqdn }.yaml"
 
unless File.exist?(localconfig)
  puts("Can't find #{ Facter.fqdn }.yaml")
  exit 1
end
 
lc = File.read(localconfig)
 
begin
  pup = Marshal.load(lc)
rescue TypeError
  pup = YAML.load(lc)
rescue Exception => e
  raise
end
 
if pup.class == Puppet::Resource::Catalog
        Facter.add("catalog_config_version") do
                setcode do
                        pup.version
                end
        end
else
        Facter.add("catalog_config_version") do
                setcode do
                        "unknown"
                end
        end
end

All of my facts are now available in a GitHub repository: https://github.com/jantman/puppet-facter-facts.

Setting emacs zone-mode based on path

At work, we do a fair amount of DNS updates. Our zone files are stored in subversion, and are named according to the domain (with no .zone extension). It’s a real pain when updating a few (or a few dozen) zones in Emacs, since I have to remember to “M-x zone-mode” so the serial gets automatically updated. Here’s a lisp snippet to put in your .emacs file that will set zone-mode for all files in any path matching the regex svn/named/zones-internal. I deliberately made it a relative path (or, really, any path containing that) so it would work for all of my team’s workstations, no matter where we have the svn repo checked out:

(add-to-list 'auto-mode-alist '("svn/named/zones-internal/" . zone-mode))

Many thanks to taylanub on #emacs on irc.freenode.net for helping with this.

Patch to Puppet Dashboard 1.2.10 to show arbitrary facts in the main node table

We use Puppet Dashboard at work to view the status of our puppet nodes. While it’s very handy, there’s one feature I really wanted: the ability to show the value of arbitrary puppet facts in the main node table on the home page. Specifically, the facts we use for environment (we have eng/dev, qa, prod, and test puppet environments), zone (physical location) and last applied configuration version. I’m not terribly experience with Ruby, but I managed to muddle my way through a working patch to do this, along with options in the settings file to enable it and configure the facts. You’ll need to restart dashboard (or your web server) to change the facts, of course. The commit is currently available on github, but it doesn’t strictly follow the puppet-dashboard contributing checklist so I may have to redo it.

Here’s a screenshot:

Dashboard screenshot after patch

And here’s that the configuration section added to settings.yml looks like:

# Enables display of arbitrary node facts in "home" page node table, between node name and latest report time
enable_home_facts: true
 
# If enable_home_facts is true, the fact names and column headings to display. Simply repeat the following two line pairs
# as needed:
#- name: 'factname'
#  heading: 'heading text'
home_facts: 
- name: 'environment'
  heading: 'Env'
- name: 'zone'
  heading: 'Zone'
- name: 'catalog_config_version'
  heading: 'Cfg Ver'

If I feel really adventurous, I’d like to implement my other big wish, some sort of pop-up list of links, based on arbitrary facts (mainly hostname and fqdn) for each node – something where I can mouse over the node name/table cell, and see links (static URLs with node name/fqdn/other facts plugged in) to things like Nagios/Icinga, our backup system, etc.

Workflow for contributing to GitHub projects

Lately I’ve been contributing to some open source projects hosted on github. I’m pretty new to git, and the process is a bit confusing for beginners. So, here’s a sample workflow, based on the The Foreman‘s foreman github repository. Note that I’m developing against the “develop” branch of that repository, not the master, so that throws in a little difference that isn’t documented in most introductions. To throw in another wrench, I maintan a branch with the code that I’m currently actually using (i.e. the application code that I have checked out on the production server), called “jantman”. This is more or less composed of the upstream “develop” branch, with all of my finished (but not yet merged in the upstream) topic branches. I’m pretty sure all this is correct, but honestly, I’m still new enough at git that I can’t make any promises. Unfortunatelty, I haven’t had the time to really learn git, and I also can’t find a simple enough tutorial that covers all this…

  1. Fork the original repository through the GitHub interface.
  2. On your machine, clone your fork:
    git clone git@github.com:username/reponame.git && cd reponame
  3. Make sure you’ve setup
    git config --global branch.autosetupmerge true
  4. Add your upstream repo:
    git remote add upstream git://github.com/upstream_user/upstream_repo.git
  5. Fetch it and initialize any submodules:
    git fetch upstream && git submodule update --init
  6. Check the current branch (git branch, let’s assume it’s called “develop”) and rebase to its upstream:
    git rebase upstream/develop develop
  7. Create my “jantman” branch, which will be the upstream “develop”, plus my finished work merged into it:
    git checkout -b jantman origin/develop
  8. Create a topic branch to do some work:
    git checkout -b NewBranchName jantman
  9. Periodically, push the topic branch to github:
    git push origin NewBranchName
  10. If you commit to this branch from another computer (or someone else commits to it), periodically update your local tracking branch:
    git pull origin NewBranchName
  11. Periodically, you want to pull in the upstream changes:
    1. switch to the develop branch:
      git checkout develop
    2. grab the latest version of the upstream git repo:
      git fetch upstream
    3. rebase develop to mirror the upstream develop branch:
      git rebase upstream/develop develop
    4. switch to our personal branch:
      git checkout jantman
    5. rebase our personal branch onto develop (pull all the new commits from develop into our personal branch):
      git rebase develop jantman
    6. If we want those new upstream changes to continue down to our topic branches:
      git rebase develop topicBranchName
  12. When we’re done with a topic branch, we want to merge it into our “personal” branch:
    git checkout jantman; git merge --squash node-table-facts

    and then commit:

    git commit

    The --squash will squash all the history of that branch down to one commit. This is generally easier for integration into upstream, and assuming the topic branch was created for a single feature or bug, should be logical.

  13. If we’re sure we don’t need it anymore, delete the topic branch from our local machine:
    git branch -d topicBranchName

    and from github:

    git push origin --delete topicBranchName
  14. Finally, make sure we push our “personal” branch back to origin:
    git push origin jantman
  15. Assuming all went well, you’ll see the new commit on github, and have a nice pull request button.

References:

  1. Contribute — Doctrine-Project
  2. Github – Quicksilver Wiki
  3. Contributor Workflow with Github · carmaa/inception Wiki
  4. Help.GitHub – Fork A Repo

Easily comparing a bunch of files in one directory

So I pulled a specific configuration file (rsyslog.conf) off of a LOT of hosts. I’m going to be managing it with Puppet, but before I do, I need to know what’s out there already lest it get overwritten. I used pssh with cat and an output directory to grab the file from all 30 servers in question. Now, I’ve got a directory with 30 files in it, and I need to figure out how many different files (by contents) there are, and which ones differ.

find . -type f -exec md5sum '{}' \; | sort | uniq -d -w 36

This will check the contents of each file by MD5 checksum, and print out the (lexographically) first file in each group, along with its MD5 sum. You can also strip off the uniq command, and see the list sorted by md5.

A GUI alternative would be to use fslint, which is a graphical tool that can (among other things) display a list of the duplicate files within a path or set of paths.

Dear Mom and Dad – or, a book about what I actually do

I’ve followed Tom Limoncelli’s blog for quite a while; his books The Practice of System and Network Administration and Time Management for System Administrators were infinitely helpful in the early days of my professional life, and are among the few (literally, 4 or 5) books that live on my desk. His insight and information into the soft skills of SA work – time management, hiring, working in teams, etc. – is not only excellent, but also all too rare in a largely technical field.

Anyway, Tom posted the below article to his blog about a book that recently came out, “Taming Information Technology: Lessons from Studies of System Administrators” by Eser Kandogan, Paul Maglio, Eben Haber and John Bailey. I haven’t read the book yet, and at $56, it’s going to be a while before my book budget recovers enough to justify it. But going on what I’ve read from Tom and others, I want it. Not only do I want to read it, but I want to pass it around to my parents and in-laws and everyone else who has asked what I do for a living, and I found myself at a loss for a less-than-6-hour-long explanation. So, here’s what Tom wrote on it:

Dear Mom And Dad,

Many times I’ve tried to explain to you what I do for a living. “Computer system administrator” or “sysadmin” is a career that is difficult to explain and I’m sure my attempts have left you even more confused. I have good news. Oxford University Press has just published a book by 4 scientists who video taped sysadmins doing their job, analysed what they do, and explains it to the non-computer person. They do it by telling compelling stories of sysadmins at work plus they give interesting analysis with great insight.

Why did they do this? Because businesses depend on technology more and more and that means they depend on sysadmins more and more. Yet most CEOs don’t understand what we do. The scientists made some interesting discoveries: that our jobs are high-stress, high-risk, and highly collaborative. We invent our own tools, often on the spot, to solve complex problems. We are men and women of every age group. It is a career unlike any other. These are things that most people don’t know about our profession. The book is very engaging: Some of the chapters read like the opening scene of “Indiana Jones”; others like “Gorillas in the Mist.” Kandogan, Maglio, Haber and Bailey have put together a very serious, scientific book with care and compassion.

I’m not one of the sysadmins they studied but every story they tell reminds me of real experiences I have had.

I hope you enjoy reading this book. I know I did.

Pre-order it here: http://www.amazon.com/dp/0195374126/tomontime-20

Sincerely your son,
Tom

P.S. In all seriousness, I read a preview copy of this book and highly recommend it to others. You may have seen the authors speak at Usenix LISA or LOPSA PICC conferences where they showed clips of the video tapes they made. The book conveys the same stories, plus many more, with interesting analysis. If you think that the profession of system administration would benefit from non-sysadmins better understanding what we do, I highly recommend you pre-order this book and share it. You can pre-order it here: “Taming Information Technology: Lessons from Studies of System Administrators” by Eser Kandogan, Paul Maglio, Eben Haber and John Bailey

More about the book here: http://everythingsysadmin.com/2012/07/kandogan.html

If you have any interest, I encourage you to go out and buy the book. If you know someone who’s an SA, you should buy them the book. If you can justify any sort of book budget at work, you should buy the book. And while you’re at it, if you haven’t read Tom’s other books, you should buy those too. You might be in the unfortunate position – like I am – of probably never being able to implement most of his suggestions at work, but at least you’ll be aware of them…