Wednesday, October 13, 2010

Data Science, Science Data

Image Nature News
The latest Nature News has an article titled Computational Science: ... Error, describing the increasing difficulty scientists face as the computer programming required for research becomes more and more complex. I face a similar prob in my work. Much of the work I do requires a fair amount of database management as a precursor to the analysis. Most of this is basic SAS programming, with some tricky bits here and there, but it is all within reach of my programming skills. Most researchers don't have my programming skills (and with a BS in CoSci, many statisticians don't have my programming skills), which is one of the reasons they might come in for statistical help in the first place. Database management is an important skill for an applied statistician, but it is not my primary skill ("Dammit Jim, I've a statistician, not a bricklayer").

The point of this is not to blow my own horn, but that I have a set of skills for managing databases that is nearly independent of my statistical knowledge. The Nature News article points out the problems with programming skills, but the same problem exist with database skills: Some researchers don't understand the basics of recording data in an organized manner, and disorganized data can lead to as many problems as disorganized programming.

It is not too unusual for researchers to bring me data (typically in a spreadsheet), and sometime I spot specific problems that could be error in how they collected and recorded the data. This is fairly important, because if the data is wrong then my analysis will be too. Sometimes I can fix these errors for them, other times I have to have the researcher fix the problems, because it requires medical knowledge and familiarity with (or access to) the original data source to make the correction. Once these bugs have been ironed out, all it well and I do my statistical thing.

There is another sort of error though, and it is much more subtle. These are the errors in the data that don't really look like errors. When someone brings me their data and there is nothing obviously wrong, I probably don't question it, and proceed with the analysis. There are some common ways this might happen: cut & paste errors, "bad" sorts that scramble the data, inconsistency in data entry, all simple mistakes. Sometimes evidence of these errors shows up during my database management prep work or during the analysis itself. Obviously if a mistake is found, it gets fixed. However, if my experience with finding errors in the late stages of analysis is any indicator, then if seem likely that some of these errors are never found. The "garbage-in, garbage-out" principle applies, and some of the analyses I've produced were likely garbage, because the data was garbage.

The good news is this sort of error is unlikely to contribute much to the larger of body of scientific knowledge. By the nature of statistics (and with an assumption of some randomness) these subtle errors are unlikely to produce significant results, less likely to be in agreement with other published studies, and certainly unlikely to be verified by follow-up studies. The bad news is that some simple, perhaps even careless mistakes can ruin months or even years of research effort, which is a waste of effort.

Finally, this brings me that other set of skills: teaching. Whenever I have the opportunity to work with people who are starting off on new research projects, I try to teach the basic data-skills, the do's and don'ts, to help them get good data and do good research. Not everyone is interested in spreadsheets and databases, but it is not too hard to convince researchers that a little extra effort up front to get good data will pay dividends down the road when it comes to publications. It certainly pays me dividends when it comes to actually doing the statistical analysis - my primary skill - rather than spending hours (or days, or weeks) trying to track down what went wrong with the data, or unknowingly analyzing junk data.
Dread Tomato Addiction blog signature

3 comments:

  1. +1 for the bricklayer allusion. :)

    ReplyDelete
  2. Ooooooouuuuuugh, you gave me the shivers. Spreadsheets, that is heresy to me.

    I often get students bringing data where the experiments were not really designed, kinda let us sample anyway we can think of, put the whole thing into a spreadsheet, and prove ... prove what? Anyway, then the explanations are so hard to follow. Shouldn't they first have a precise question, or as precise as possible, then set controls and variable(s), then ... take a whole course on experimental design?

    ReplyDelete
  3. I don't mind the spreadsheets so much, but they do present a variety of opportunities to mutilate data (that's a topic for another day). Like your student, I occasionally get people showing up at my office and asking, in so many words, "I've got my data! What do I do now?"

    *HEAD* <---> *DESK*
    *SMACK* *SMACK* *SMACK*

    Fortunately that doesn't happen too much any more, largely because we've gotten the word out that design help is available. It's been a long haul getting to this point though.

    ReplyDelete