Valid Data : Deep Creek Watershed Science

Valid Data

Obtaining valid data is not easy. Data is measured information. The measurements are made with instruments. Instruments are hardware designed to operate a certain way. These instruments must be calibrated for precision and accuracy. The procedures to use the instruments must be appropriate for the context within which they must operate. The measurements must be translated into human understandable parameters. There is a lot of work that goes into getting valid information in a format that we understand, and much of this we don't know much about or take for granted. I'll go into all of these things in more detail later on, but for now I'll assume that the data that is available is in good shape.

As a preliminary to the data presentation effort, I want to make the point that pretty much all of my analysis work with data is done using the 'R' computer language. I've come to love it mostly because of its flexibility, a support by a world-wide community of users and developers, and a very active development effort. This pretty much guarantees an up-to-date collection of tools. It has a little bit of a learning curve, but there is a lot of support on the Internet, and almost anything you want to do can be found by Googling for it. It is highly recommended for anyone to learn how to work with it. It rivals Python for scientific computing. It's very much a tool used in academia, but also by large commercial companies. Don't be fooled by the word "Statistical Computing Environment." It is much, much more than that. For example, it can solve all kinds of differential equations. One can do simulations with it. Icredit.R've done the Deep Creek Lake bathymetry with it. Check it out here. Above all, it's free. Just to get you going, several of resources are listed below in items 1-4.

Data Extraction

Much of the data collected on this website resulted from processing of other data sources. One large data source is Brookfield itself. Unfortunately their data, today, is not available directly in digital form and must be extracted from monthly and annual reports. I typically work with their pdf files. Unfortunately again, the pdf files are quirky entities and have to be processed separately.

I start with trying simply to copy and paste the data from the pdf file into a text editor. Sometimes this works very well and the data come across clean. Some hand editing may be required to remove items from the data that are artifacts from going from one page to the other.

If the above is not possible, I found that the best way is to make a jpg of the page and use OCR to convert to text. Often this introduces some errors in translation that must be cleaned up. Often this is easiest done by hand using some of the clever features that come with modern text processors, or if it's a very regular 'defect', a small R script is written that can clean the data.

Occasionally, when documents are old and have been copied and copied again, complete manual translation will be required. Fortunately, this does does not occur very often.

The next thing is to graph the data, either as scatter plots, line plots, or polygons. Often this process brings out a few errors in the data which are then corrected.

The final result is data in the form of a text file that anyone can read with any type of text editor and computer language.

Resources

1. What is R?

2. Beginner's guide to R: Introduction

3. R Tutorial

4. R-Studio

PLV: 9/16/2016
Updated: 10/17/2017