TSDI - Unit 4 - Expert Panel #2

Video Transcript

Hello I’m Hollylynne Lee and I’m here with my colleague and friend Webster West. Webster is

a professor of statistics here at North Carolina State University, and we’re going to be talking

today with him about some of his experiences teaching at the introductory statistics level,

particularly using a tool that he has developed called StatCrunch. He’s been very involved in

thinking about the design of curriculum materials, as well as the use of technology tools

throughout his career, and we’re going to benefit from that experience and hear a little bit from

him.

Webster, you are the developer of a software tool called StatCrunch, and I know that you have

thought very carefully about how to design and use your software in doing statistics and

developing curriculum materials around that. One of the things I love, as I’ve mentioned I’ve

been in your Intro to Statistics courses and I see you using this almost on a daily basis with your

students. They come to class with any mobile device, whether it’s a laptop, a phone, a tablet and

their able to interact with this software in that format and both analyze data you have given them

as well as enter data in doing different tasks where you’re collecting data from them. I just love

seeing that in that Intro to Statistics course. I have opened one of the datasets that you use and

we’re going to take a look at how he engages his students with this dataset and the ways that you

can use a tool like StatCrunch to actually have students think about interesting questions and

tools for analysis and interpretations they might do with a given dataset. Tell me a little bit about

this dataset.

This dataset is one that I am personally interested in, because at one point in my life lived close

to Lake Travis. This is a dataset about Lake Travis, which is outside of Austin, Texas, in the

suburbs of Austin, Texas. It’s a beautiful lake. It’s in the hill country of Texas, very nice blue

water and has a limestone bottom, so it’s a beautiful spot. The only issue with Lake Travis is that

it’s a variable depth lake. There’s a dam on one end of the lake and water’s released from the

lake to provide water for irrigation for farmers downstream. It’s an interesting lake, because you

never know how much water’s going to be there. When it’s there it’s nice, but when it’s not it

can look a little rough around the edges. You see more and more of the shoreline and things like

that. It’s an interesting place, and as I said I lived here, close by and was interested in the data

relative to the lake. I encourage my students to do this type of thing as well. You can find data

relevant, data sources about the world very close to you pretty easily now with the internet.

Here in North Carolina we might look at Lake Jordan.

We could that’s right. I don’t know if it’s variable depth; it’s probably variable, but not quite as

variable as Lake Travis. We’ll see.

 Teaching Statistics Through Data Investigations – Spring 2015 Page 1

Tell me a little bit about the numbers that I am seeing in this dataset, and what was actually

measured.

This is an interesting dataset, first of all because of its format. When you look at the data here, I

naive student would initially think, oh there are 13 variables here.

Because I see 13 columns.

13 columns, right. That’s one of the things you always want to discuss with your students, how

many variables have actually been measured. In this case it’s only really 3, three variables. What

we have here is annual measurement, yearly measurements taken over the months of each year

January – December, and it’s the water level of Lake Travis and that’s feet above sea level that

you see here. These measurements are generally in the 600 or so range. We have the data broken

out in a very specific way here, we’ve got the year in the first column. We’ve got 1943 to 2011. I

haven’t updated this dataset since I moved from the area, I guess I lost interest. We have year in

that column that’s one of the variables, then we have the month of the measurement, but the

data’s broken out in a different way here. The months are spread across the next 12 columns,

January through December. An alternate format would be to have month as the title of the

column and have January through December repeated many, many times. Then have the actual

measurement for the height of the lake, level of the lake in a third column. The data in this

format is a very common format that you see. One reason why I wanted to discuss this dataset

today is because it’s the most common format of data that is accessible on the internet. Seeing it

presented in this way is also very common.

I have gone to different weather sites or from the NOAA in trying to look at sunset data or things

like that, and this is the type of format that I see. If you think about what kind of software packet

you have access to, the format of that data, I think you’re bringing up really good points. You

have to think about what is being measured? How is this formatted? How am I going to deal with

this format in the analysis that I am going to do?

That’s right. It’s an interesting dataset, because the types of statistical topics you can discuss as

well. Technically it’s a time series, so many of my colleagues in the Statistics Department would

apply some very advanced time series techniques with this data, but even at the introductory

level you can do some very interesting things and we’ll touch on some of those concepts that I

think are very enlightening for the students.

Well I hope you’re going to tell us a little more about that. What kinds of questions would you

pose for your students to answer, or do you think students might pose with this dataset?

What are some good questions? It’s what I was talking about before, ask good questions. I would

put this out to my class, showing them this on the screen, what are some good questions here? A

 Teaching Statistics Through Data Investigations – Spring 2015 Page 2

natural question based on the format of the data, how does the water level vary over the course of

the year? If you’re thinking about visiting this area for a vacation or something like that, what am

I likely to have water? If I’m going to use a boat or something like that, then you want to visit

when there is more water in the lake. That would be an obvious question. To quantify that a little

more you might want to say when is the water level typically at its highest point? Touching on

that idea of typical. What’s a typical measurement in January, a typical level in February, March

and so on. Which one of those typical levels is the highest?

I’m looking at 12 typical values and trying to think about which of those 12, which of those

months am I going to have on average the highest water level?

That’s absolutely right. Then of course in addition to that; I was once told that that’s 10% of your

job, if you can teach people about comparing typical values in datasets. Maybe the more

interesting question is can we quantify the variability? How much variability is in the water level

in January? And how do we compare that with February, March and so on. The variability across

those months would be something you’d want to understand as well. From a practical

perspective if you’re going to visit the lake, sure you want to go when the typical water level is

high, but you also don’t want to see a lot of variability in the water level in that particular month,

because that could mean you don’t know.

You don’t know.

You’d like to visit when the water level is typically high or very small variation, if possible.

Those are the types of questions I would hope to elicit from my students based on a very short

introduction from me.

If you showed them some pictures of people enjoying the lake when there is lots of water versus

people not being able to enjoy the lake.

Actually I have a slide somewhere that has it full and not so full. Unfortunately there’s been a

drought in Texas for the last several years and the lake has really suffered from that drought.

That’s actually what got me interested in this particular dataset is the water level that we’re

seeing today unusual or is it something we’ve seen maybe in the past?

Webster, show us some things you might do or have students do when they are investigating

some of these questions that you just posed for us.

You have to think about when I introduce this dataset, what point in the semester and I do it very

early on in descriptive statistics after we talked about things like histograms, box plots and things

of that nature. Basic summary statistics: mean, median and other measures of variation, standard

deviation and quartile range. One of the things I hope to teach my students in terms of working

with this very common data format is what’s a go to way where you could answer those

 Teaching Statistics Through Data Investigations – Spring 2015 Page 3

questions? What would be the standard thing you should always think about with this particular

data format? I really like the idea of doing a box plot here. I would like to do 12 box plots, one

for each of the 12 months here. We’re going to do a box plot for the data in the January column,

a box plot for the data in the February column, March and all the way on over to December. The

best part about box plots is they’re comparative in nature it’s really what they were built for. We

want to look at all 12 and see if we can answer those questions.

I want to see the distribution in December how water levels have been varying.

Right, same for the other 12 months, and we want to be able to compare those distributions. I

like the box plot here, so let’s do a graph, go down to box plot. You can see of course StatCrunch

does all the standard graphics for categorical and quantitative variables, even some more

advanced ones we might take a look at here in a second. Let’s choose the box plot option and I’ll

select all of these months January through December. Now you think about how you want to

draw the box plot, I’m a five number summary kind of guy; Min, Q1, Median, Q3, Max. You can

do what I call a modified box plot as well, but I don’t typically start with that. I look at the

features of the data first with the five number summary, and then I come back if necessary and

look at the modified version. We can look at that in a second. We’ve got our 12 columns

selected, we click compute and here are our 12 box plots. When you look at this picture there are

some interesting features. One of the things you can clearly do is try to compare the typical

values; when is it at its highest? When is it at its lowest? If you think about those levels, how

much variation is around those levels across the months? Real quickly we can just move across.

One of the things you can do in StatCrunch is you can put your cursor over a particular graphical

object, in this case one of the box plots and you can get some information about it.

There I see some of your five number summary.

There you see it. The median, if we’re going to use that as our measure of typical as we would

with the box plot, would be 674 in January. We can move across here a little higher 693, in

February. Moving across, we don’t have to look at all of these, but it continues to increase a little

bit in March, April we see a smidge of a drop down.

One of the things that my eye was immediately drawn to was how long the first 25% of our data

is.

That’s right that’s where variation is important. Is there a great month to visit where you’re

guaranteed to have a lot of water in the lake? Summer months for any sort of lake activity would

be your optimal time to visit, but you see what happens in the summer as things heat up in Texas,

quite hot in Austin, the water evaporates faster and they let more out of the dam.

For irrigation.

 Teaching Statistics Through Data Investigations – Spring 2015 Page 4

For irrigation then you see it dip there in September. When that weather pattern changes it starts

to go back up again. This tempting time to visit is tricky and that variation that we see here, that

really long first quartile for each these months indicates that there are many times that the water

level is very low.

Right, I can see some of those minimums dipping down below 620.

Yeah, down below 620. When the water level is at 620 in Lake Travis and that vicinity you have

a lot of the inlets areas that are completely dry. In many cases you couldn’t even put a boat in the

water in that situation. In fact, in my experience anything below 640 is dreadfully low in terms of

the lake level.

What I love is I’m hearing you have this conversation about how you’re interpreting these box

plots directly back to the context and directly back to thinking about why we might be seeing

some of the patterns that we’re seeing.

Right that’s important, assignable causes are important. In this case you can understand the

behavior at the lake when you think about it. Even if you look at these medians there’s not a

huge amount of variation relative to the total variation that you see in each of the months. There

is a lot of variation in each month. The middle 50% of the data shown by the box is relatively

consistent I would say across the 12 months, the 12 box plots and so is the range of the data. It’s

relatively consistent, but you picked up on a very good point. We’ve got a very long left tail and

the data for each of the months we see this very long left tail. In fact when you see either end of

the box plot that long of a segment, in this case it’s the first quartile in each of these box plots is

very long, then you might think about going back and see do we have outliers responsible for

that or are there consistent behaviors? There’s a lot of data in each of these segments.

Right you had mentioned that there were times where there was a drought. I’m wondering were

there particular years that were kind of contributing to that?

Let’s go back and edit our inputs. This is one of the most important things in data analysis,

because too often students expect I do this, this and this and interpret the output.

Do this, this I’m done. A linear process.

Right and it’s never the case when you’re analyzing data. It’s almost always the situation when

you take a first look at the data, you’ve got to go back and based on that look modify the way

that you’re approaching the data.

This comes from your Tukey background in thinking about exploratory data analysis, doesn’t it?

 Teaching Statistics Through Data Investigations – Spring 2015 Page 5

I’m sure it does. Let’s go back over here and now use fences to identify outliers, so turn on that

option and see if we have that one drought year or something that’s causing those very low

values. Let’s go back and re-compute our box plots. The problem is really probably not one year

of drought. I’d say historically, looking at this data from 1943-2011 that there are many times

when we’ve had some very low values. All the outliers in this case are on the low end of the

data, none on the high end unfortunately. If you think about it I can tell you why none of the

values are above 700 roughly, because that’s the level of the dam. There’s only so much water

you can get into the lake until it starts spilling out. That’s really not the exact height of the dam,

but it’s related to the height of the dam, the capacity if you will.

Right, so understanding the reasonable range that one can be expecting given that context, like

it’s unlikely that we’re going to see something above 700, because of how that lake and the dam

is constructed.

That’s right because you are pushing the capacity. We don’t have any of the outliers on the high

end, and that’s why because you’re capped in that direction. Unfortunately for people who like

Lake Travis there are many times where it can get extremely low. We’re managed on the high

side, but not on the low side. I think there are some political efforts to change that, but at the time

that we’re viewing the data that’s not the case. One of the thing you can do now is we’re

wondering if there’s a single year that is responsible for those lower quartiles here. In StatCrunch

you can also highlight these values, so let’s pick some of the lowest ones. Let’s try this one in

August. So that low value was in 1951.

Now what I’m seeing in pink are all the 1951 data values.

That’s right. We highlighted that point that turns on the row in that dataset.

Absolutely this dynamic link.

All the values that are a part of that row will be highlighted. In fact what we’re doing here is

highlighting that same point in all of the box plots. You’ll see that the 1951 measurements

qualifies an outlier across all the months. That’s clearly an issue there in 1951, especially in the

summer the June, July and August minimums are all in 1951.

That is just dreadful.

In fact, even the most recent data that’s not even in this dataset is not quite at that low level.

There are times in history where it’s been even lower than what we’re currently seen at Lake

Travis.

In recent years.

 Teaching Statistics Through Data Investigations – Spring 2015 Page 6

This is very interesting thing. This is very important part of analyzing data now, all the

interactive graphics that there. You can use them for pedagogical purposes, because it helps

students understand the connection between the data and the graphs. I think often times students

think that graphs appear magically.

The software did it.

Somehow I don’t know exactly what it is, but I’m going to tell you what the median is. When

you start brushing a graph, interacting with a graph then that draws those connections. We’ve

now actually taken one of these points and said oh yeah this is where it’s at in the data. We can

turn off that highlighting by clicking clear. Maybe we want to look at a few other years here.

Select two this time. Now we got 1952, so it was a bad couple of years, ’51 and ’52, then 1964.

In fact, we’ll see that one of these years, you see two of the dots highlighted in January,

February, March, April, May, June, July, August, but then here.

Yeah I see one of those years in September it got a lot better.

Yeah in September it got a lot better. It was even up above the third quartile.

What year was that? I see whenever you’re pointing to that point a little pop up is telling me

about that dataset, about that data point.

About that point, yeah let’s see here.

Row 10.

Row 10, so that was in 1952. If you look at this data as a complete time series you’ll see that

occasionally you get a monsoon. Maybe its hurricane type weather in the Gulf of Mexico, but

something generates a massive amount of rainfall and things can change very quickly. If you

don’t like the level of Lake Travis today, just wait. It typically takes a massive rain event to get

you out of a drought situation like there was in 1951 and 1952. 1964 though, you’ll see that’s

row 22 here, it was consistently low.

Like it was in 1951.

Like it was in 1951, but we got some good rain. Sometime in September of 1952 it really brought

the level back. That’s illustrating one of the nice ways how you can interact with the data. That’s

something that is very important pedagogically. In terms of real world data analysis, like I said

it’s a critical skill analyzing lots of datasets myself that’s something I have to have.

If your students were making, let’s go back to your original question what month would you like

to visit Lake Travis, what kind of response would you be anticipating? What kind of things you

 Teaching Statistics Through Data Investigations – Spring 2015 Page 7

hope that they would be making when they’re making that claim that they would be considering

in their final interpretation?

I would go back to the idea that I put forward and hopefully work the classroom discussion in the

direction of. We want to go in typically on the higher side with the least amount of variability.

One of the issues we see here though is that it’s very variable, so the variability is roughly

consistent across the 12 months, so the best we could do in terms of making our decision it looks

like probably May and June would be some of the best times to visit.

Especially after you were showing some of the outliers there, March looked pretty good and May

and June.

The water’s colder in March.

Oh that’s true.

If you factor in the temperature as well as the level I would suggest May or June being some of

your best months. This is a graphical way you can look at this particular dataset. I always start

with graphs and I encourage my students to do the same. I think bad decisions are made

statistically speaking if you don’t look at a graph first. There is no analysis that I can ever

imagine doing that should not be motivated by a graph. That’s something I really try to ingrain in

my students, graph first then compute some numbers later. We might want to actually look at

these numbers in a table now. Let me just show you how to do that real quickly.

I think that should be a general rule of thumb that not only graph first, but integrate your

reasoning about the graphs and numerical summaries, so that you’re reasoning both from the

statistical measure point of view as well as what you’re seeing in the distributions whenever you

graph it.

That’s right. I’m going to go ahead and turn off the highlighting again here clicking the clear

button. Now let’s compute some summary statistics. All the numbers are under the stat menu,

things like means, medians, t tests, regression and things of that nature. Summary statistics for

columns here, once again I’m going to select all 12 of these. We’re going to click compute. Now

we get a nice table of output. One of the things we’ve added recently and what’s nice for

classroom usage is the ability to hover over rows and have those rows be highlighted.

It’s difficult if you’re displaying this in a classroom.

Right displaying in the classroom is very difficult, so if you want to have the ability to focus,

having this row highlighting ability is nice. It’s based on my own classroom experience. One of

the things we’ve added, if you want to think of when the median level is the highest and things

like that you can now sort this table very easily by clicking on these arrows.

 Teaching Statistics Through Data Investigations – Spring 2015 Page 8

Oh that’s nice.

Before I do that let me just point out that StatCrunch understands how certain things should be

ordered like January, February; it recognizes those as months and will order them correctly.

Rather than alphabetical.

That’s right, even if we didn’t select them in this order it would produce them in this order. If we

want to break out of the standard ordering and sort things based on median or mean level or

anything like that, there we go. Now we’ve got order increasing, if we click again we can get

order decreasing. The median level was in fact highest in June, as we saw in the box plots.

June, May, March.

That was your three months.

Yeah it was.

We’ve identified June as the high median level. If you think about your measures of spread,

standard deviation, inner quartile range and things like that, we can easily see that the standard

deviation here is very consistent. One thing that is not produced by default on StatCrunch is

inner quartile range. Another way to measure variability, in fact my preferred way in most cases.

Especially if you’re using box plots.

That’s right if you’re going to motivate these statistics using the box plot as the graphic we can

go down and include it, so you have several options here of other statistics that you can compute.

Let’s go down and there we go. Let’s see if we can get rid of that one.

There you go.

We now have our IQR, again if we saw a consistency in the box plot and in fact the numbers

here are quite consistent as well. That’s something I think that’s nice to go from the graphic to

the numbers. That makes these numbers mean a bit more. We saw that the variation was roughly

consistent. When we actually compare these numbers we see that in the values that are here.

Okay that’s your basic table summary statistics, we know how to do box plots, summary

statistics now in StatCrunch.

Any kind of interpretation and claim I might make I would hopefully be using reasoning with the

graphical, as well as reasoning from the statistics that have been computed for us.

Absolutely, I think that graphics are the first step and numbers are the second, and interpreting

them correctly; for both really, the graphics and the numbers. Another thing I like about this

 Teaching Statistics Through Data Investigations – Spring 2015 Page 9

dataset is that it’s a natural for thinking about correlation as well. Since we do have

measurements that are paired together with the year, we can start thinking about what type of

correlation between the water level in January and February. It’s important to have students think

about this question before you do anything with the software. I always ask my students what you

expect, than let’s think about what actually is.

Prediction first.

Were your expectations met or did we learn something new here? If you pose this type of

question to a class, most people are going to think well if it was low in January it will be low in

February, typically. What they’re describing, if we map that to the idea of association we’re

talking about a positive association. That’s what most people would expect to see, if it’s high in

January it will probably be high in February. You want to think about that, motivate that idea

before you ever even look at a scatterplot. Let’s take a quick look at a scatterplot, let’s actually

look at January and February. We chose the scatterplot under the graph menu, let’s look at the

January measurements as X and February as Y. Let’s go ahead and click compute. In fact it’s

exactly what we expected, so there are some situations where a very strong positive association

here. There are some outliers here right? Some of these outliers and if we had a lot of time we

could actually discuss those a little bit, identify when those outliers occurred. Use the arrows

here to navigate down to the selected rows. That was actually staying 2010 was an outlier. If you

think about what was going on at that particular point that’s when the January measurement was

low and the February measurement was high, higher than you would have expected. The same is

true for this other point here, which we can select.

It actually looks like a lot of the ones that seem to be outside of the general cluster of points seem

to have that pattern, where it’s low in January and higher in February.

That’s right and that probably has to do with weather patterns. There’s some moisture, some rain

that comes in that raises the level for February. Overall, what we expect is a strong positive

association, maybe with a couple of outliers can impact things here. With January and February

we saw what we expected to see, so would the same be true with all the pair-wise associations?

January and February they’re close together in time, so it’s likely they would be positively

associated, but how about January and March or January and June or January.

And anything.

January to anything. We’ve got to remember that December is actually close to January, so about

as far from January we can get is June or July, then we’re actually getting towards the end of the

year. Let’s think about how we would do that. I like something called a pair plot. This is

something I started showing my students in class the last couple of years, since I’ve come to NC

State. With data sets like this, this is one of the examples that we’ve used for the pairs plot. The

 Teaching Statistics Through Data Investigations – Spring 2015 Page 10

beauty of the pairs plot is you can select all of the data January through December, click

compute.

Oh wow.

Now we get a matrix of scatterplots. All the axis information and on Stat-Ed we always like to

see things labeled on the axis’. This is a situation when we’re trying to put a lot of information

on the screen at one time, so if you think about we have 140 scatterplots here. The one we were

looking at before was with January on the X axis and February was on the Y axis. You can

actually see here the two little points that we were looking at, identified as outliers in that

picture. This is really a triangular matrix of scatterplots, lower or upper you can look at either

one.

They’re all symmetric.

Do we see any information here that would suggest that the level of correlation decreases over

the year? I think if we think about that January and comparing that to some of the other months,

with January and February we saw very strong association. You can come on down and see

January and March in the next plot, and I think even there you can already see a little more

variation about the linear trend.

I’m seeing a larger clump.

Same thing in April and it’s much larger than in May. June, July it’s probably at its peak or

maybe even August. In September we see a lot of variation and even January to December

there’s a lot of variation. We do have to remember here that that’s the December at the end of

that particular year, not the December of next year. That is 12 months apart, but yes we do see if

you move away from January the other water levels, the correlation with that January

measurement is dissipating quickly.

We have all these plots and looking at the different correlations where would we go from here?

The next step would be to compute correlations, so let’s hop over here in the stat menu where the

numbers are located. Under summary stats we’ll choose correlation. I’m going to select all the

columns, and we’re going to get the corresponding correlation matrix to the scatterplot matrix,

the pairs plot we were just looking at. Let’s go ahead and click compute. Now we see it, by

default it’s only a lower triangular matrix, but we can now quantify the correlations that we saw

back over here in the scatterplots; January and February were in fact very highly correlated, .978

correlation coefficient. As we move through the year it gets lower and lower, so in December,

the January measurement is.

That corresponds to the scatterplot down here.

 Teaching Statistics Through Data Investigations – Spring 2015 Page 11

January, December scatter plot. I think we’ve got them a bit out of order now. That’s the

corresponding correlation matrix.

That’s a nice use of this dataset.

It’s a great way, once again graphs, numbers, graphs, numbers and drawing connections between

the two. That’s a really quick run through of things you can do in StatCrunch and with this

particular data this is some of the things I do in my own class. There’s lots of other things you

can do. It’s a rich data set, you can do basic box plots and things of that nature, and you can talk

about correlation, very few datasets that allow you that. This particular format is one I use for

that purpose quite a bit.

 Teaching Statistics Through Data Investigations – Spring 2015 Page 12