Histograms in R

Post date: Sep 18, 2011 6:50:54 PM

I was playing with timeseries histograms in R. Here are a few quick notes. Assume that the file to read has rows something like (date, response_time_ms).

20100923123456.789 777

you can read this with

> data <- read.table("myfile.dat", colClasses=c("character", "numeric"))

this forces the date into a character format which makes the next steps easier. Now give the columns names, change the timestamp column to date objects, and fix the milliseconds.

> names(data) = c("ts", "rt")

> data$ts = as.POSIXct(data$ts, tz="GMT", format="%Y%m%d%H%M%OS")

> data$rt = data$rt / 1000

For the rest of this I am only going to deal with the timestamp (data$ts). I could 'attach' data so we didn't have to write data$ all the time, but this time I'll keep it simple.

> hist(data$ts, "hours", freq=TRUE)

This plots the histogram using hourly buckets. You get the options: c("secs", "mins", "hours", "days", "weeks", "months", "years", "quarters"). I also switched to frequency (count per 'hour'). By default you get density (percent in each bucket)

As it turns out, you can also replace "hours" with a number of buckets (they call them breaks) to get a fixed number of bars. Or you can pass an array of breaks. If you pass the array, they don't have to be equal sized. Specifying each break point is tedious so I wrote a function to create breaks at a fixed number of seconds across the input data called break.sec(x,sec) Pass your time data and the number of seconds to round to and it returns an array of times.

> break.sec(data$ts, 300)

[1] "2010-09-29 13:20:00 GMT" "2010-09-29 13:25:00 GMT"

[3] "2010-09-29 13:30:00 GMT" "2010-09-29 13:35:00 GMT"

etc

You can pass this to hist

> hist(data$ts, break.sec(data$ts, 300))

You can also change timezones by casting to a POSIXlt object (which really understands tz).

> hist(data$ts, as.POSIXlt(break.sec(data$ts, 3600), tz="MST7MDT"))

Note that there are two POSIX time objects. POSIXct which holds the seconds since epoch (as double) and carries but does not use a timezone. The other is POSIXlt which is like a C struct tm. Separate fields for sec, min, hour, ... dst. This one does worry about its timezone.

To use the function, put it in a file (ex break.sec.r) and source("break.sec.r"). Over time I'll collect some of these and put them in a library so they can all be loaded in one command.

break.sec <- function(x, sec) {

if(!inherits(x, "POSIXt")) stop("wrong method");

start <- as.POSIXct(min(x, na.rm = TRUE))

start <- as.POSIXct(floor(unclass(start)/sec) * sec, "GMT",

ISOdatetime(1970,1,1,0,0,0,tz="GMT"))

incr <- sec - 0.01;

maxx <- max(x, na.rm = TRUE);

breaks <- seq.int(start, maxx + incr, sec);

breaks <- breaks[seq_len(1L + max(which(breaks < maxx)))];

breaks;

}

And here is a way to hack the Y axis. An example; if you use 5 minute (300 second) buckets you probably still want the scale to be events/sec or events/hour to make it easier for the viewer to understand. But hist() will number it as events/5min. R's hist() program leaves the x/y extents of the graph in the global parameter 'usr.' If you suppress the axis during the initial draw call, scale the Y usr values, then draw the y axis you can scale it to anything you want. First a function to scale the parameters:

rescale.y <- function(f) {

u <- par('usr');

u[3] = u[3]*f;

u[4] = u[4]*f;

par(usr=u)

}

Now create the histogram with half hour buckets and a y-axis indicating hourly counts.

> hist(data$ts, as.POSIXlt(break.sec(data$ts, 1800), tz="MST7MDT"), freq=TRUE,

col="blue", main="Event Count", ylab="Events per Hour",

xlab="Time of Day (MST7MDT)", yaxt='n')

> rescale.y(2)

> axis(2) # draw the left y axis

> grid(nx=NA,ny=NULL)

And you will get something like:

Page updated

Report abuse