Exploring & Describing Data

 Calendar | Course Managment  | Contact us   
   Main Concepts  | Lecture | Activity | Data Collection & Analysis | Teaching Tips  | Technology Tips  | Quiz  | Want to Know More? 
  • Introduction    • Distributions    • Graphical Techniques    • Numerical Summaries    • Lecture 1: Using Graphs to Analyze Data    • Transcript of Lecture 1
   

 Distributions

Before we begin summarizing, we need a "pre-summary". A frequency distribution is a powerful tool for conceptualizing variables. Essentially, a frequency distribution is a mathematical function that is associated with a particular variable. It captures two important characteristics of a variable: the observed values and the frequency of those variables (i.e. how many times each value occurred.)
(NOTE: I could have said "observable" rather than "observed". We will generalize the concept of a frequency distribution to include not just those particular values observed in our data set, but also other values that we might have observed. For example, we did observe a 7340, but might also have observed 7350.)

You can think of a frequency distribution as a function: plug in x (the value) get out y (the frequency). Often it's useful to think of, and represent, frequency distributions as tables that list the observed values along with their frequencies. But often, especially for large data sets, this isn't practical. And so often we need to summarize this distribution.
This is probably a good place to reflect on the fact that not all variable are alike. It's useful to distinguish the types of values that a variable might have. There are two major divisions, and each can be broken into smaller divisions.
[Students sometimes have a hard time identifying these types of variables. It's worth spending some time on this -- but not too much.]
The major division is quantitative vs. qualitative. In other words, numerical vs. categorical. "Numerical" means the values are numbers. "Categorical" means the values are categories. For example, the variable "hair color" will probably have categorical values like "brown", "black", "green". Our income variable is numerical.
But be careful: sometimes numbers are used to "mark" categories. For example, our census data set recordes "sex" as "F" or "M". But they could have used "1" and "0". Strictly speaking, these are numbers. But they are being used as if they were categories.
Some people find it easy to get lost in further classifying variables. (For example, we could distinguish between "ordered" categories such as "low income", "medium income" and "high income" and unordered categories such as our hair color variable.) But the only remaining distinction that is important to grasp is that between discrete and continuous numerical values.
This distinction is a bit abstract. At first blush, its simply the distinction between "counting" numbers and real numbers. What makes it slightly more complicated is the fact that we usually categorize variables as discrete or continuous not based on the values we saw, but instead on the values we might have seen. For example, we might have recorded people's heights as 60.0", 61.1", 61.2", 61.5", etc. Now you might argue that this is discrete, since because we rounded off to the first decimal place we could "count" these numbers: 60.0, 60.1, 60.2, etc. But the *concept* of height is a continuous one. People don't grow in tenths of inches. Growth is a continuous process. Thus, "height" is really a continuous variable. By necessity our measurements are discrete, but we still speak of height as continuous.