Questioning Data

A Textbook by Joy Sebesta

Chapter 1: DataWhat is data?

Like in all branches of mathematics, it is important to define things. The Cambridge Dictionary defines data as information, (especially facts or numbers) collected to be examined, considered, and used to help with decision-making, OR as information in an electronic form that can be stored and used by a computer.

For our purposes, the first definition is the most accurate. In statistics, we collect and analyze data in the form of information. We collect data on subjects, which are also called observational units, cases, or experimental units. A subject is the person or thing that we are studying and collecting data on. The information that we collect from our subjects can be broken down into characteristics or numerical quantities, which we call variables. A Variable assumes different values that can be either categorical or quantitative.

What types of data are there?

Categorical variables have data that can be separated into groups. Each piece of data can fit into one and only one group. For example, if our population is students under the age of 17, the student could be in elementary school, middle school or high school at a given point in time. Other categorical variables include eye color, political party affiliation, and favorite car manufacturer.

Name at least 3 other examples of categorical data.

Numerical variables are called quantitative. Suppose we want to find the average height of students enrolled in a middle school: this data would be numerical in nature. However, not all numerical data is quantitative. For data to be considered quantitative, we must be able to do calculations on it. For example, we need to be able to find the mean of the data set. The mean is more commonly known as the average. It is found by taking the sum of all the data values, then dividing by the number of data values that there are.

Quantitative data can also be discrete or continuous. Discrete data is numerical data that is made up of whole numbers or integers. It is finite in nature, meaning it can be counted. Discrete data could be the number of times you roll a die before you get a 5, or it could be the number of students who prefer Dunkin coffee in our class.

Continuous data involves precise measurements. This data can take on any value that falls within a range of values. For example, the heights of elementary school students would range from 42 to 50 inches and could be any value in between those two numbers. You could have a student that was 42.0159 inches tall and one who is 49.25698 inches tall. Continuous data has an infinite number of values that it could take on in that range. Since the number of values is infinite, it is not countable.

You can read more about and find other examples of discrete and continuous data at this website https://www.indeed.com/career-advice/career-development/what-is-discrete-data

Sometimes numerical data is not quantitative. Jersey numbers for a sports team are also numerical, as are student ID numbers. Does it make sense to average the jersey numbers of the NY Giants? Or find the median of all of the Student ID numbers of everyone in our class? Of course, the answer is no! This is a case where numerical data is just categorical data in disguise. The jersey numbers and student ID numbers are just another representation of the person’s name. Can you think of other examples where numerical data is just categorical data in disguise?

Name two examples of numerical data that are categorical data in disguise.

Name at least 3 examples of quantitative data.

Are there other ways to organize data?

Besides categorical and quantitative, data can be arranged according to the following groupings called Levels of Measurement.

The Nominal Level of Measurement is identified by data that is categorical in nature and is made up of names, labels, and categories. These are groupings such as your favorite color, sports team, or genre of music.

The Ordinal Level of Measurement can be used to classify data that can be put into some kind of order. Course grades are an excellent example. You can earn an A, B, C, D or F in a college course, and then we can put those grades in order from least to highest or highest to least. Another required identifier for the ordinal level of measurement is that the difference between the data values either can’t be found, or it doesn’t make sense if the differences are found. It wouldn't make sense to find the difference between an A and C, A-C. Pain level 0-10 is another example. Pain level five minus pain level two doesn’t make sense.

The Interval Level of Measurement is identified by data that can be arranged in order, but the differences between the values can be found and have meaning. There is not a natural zero starting point where none of the quantity is present. Examples include a temperature scale like Fahrenheit or Celsius or SAT scores which range from 200-800. In this instance, 0℃ or 0℉ don’t mean there is no temperature and you can’t receive an SAT Score of zero. (phew!)

Ratio Level of Measurement encompasses all three previous levels of measurement as well as having a natural zero. Data at the ratio level of measurement can be arranged in some order, have differences that can be found (and are meaningful), have a zero starting point where there is no quantity present, and ratios can be calculated. For example, six students in the class recorded their earnings for last week: $0, $500, $375, $624.50, $1249, and $750. Putting these in order from least to greatest, we have: $0, $375, $500, $624.50,$750 and $1249. $0 means this person earned no money last week. The difference between two values is meaningful. $1249 is $499 more than $750. We can also find ratios between two values; $624.50 is half of $1249.

The pain level scale diagram below shows the pain scale several ways.

Which levels of measurement are represented in the diagram?

https://www.mibluesperspectives.com/stories/for-you/how-pain-is-measured

Search the internet and find another diagram with several levels of measurement. Copy and paste your image and then report what levels of measurement it uses.

How do we get data?

First, we must determine the population that we are interested in studying and come up with a research question. The population is the entire group of subjects that we are interested in studying. The research question should be about a parameter. The parameter is an unknown measurement of some variable in the population that we are trying to find. Research questions range from very simple to complicated. It all depends on what your interests are or what questions you have that you want to use statistics to answer.

Simple research questions might be: Do all college students prefer online synchronous or in person classes? Or, what proportion of all m&ms are blue? A more complicated question might be: On average, how much has sea level changed in the past 20 years along the New York City waterfront near Battery Park and along the Brooklyn waterfront at Breezy Point? Once you have chosen your research question and the population you are interested in studying the next step is to take a sample.

We take a sample from the population that we are studying and use that sample to gather data. The sample is a subgroup that we choose from the population. As a guideline, the sample size should be approximately 10% of the total population. The data we collect could come from a variety of sources. It could come from responses to a survey, information gathered by doing an observational study, or by completing a randomized experiment. We call the measurement that we get from a sample a statistic. A statistic is a known value that comes from the sample. If our variable is quantitative, we can calculate the statistic. If the variable is categorical, the statistic is usually a proportion or percentage.

It turns out that the way we collect data can affect how we can use it. The goal of collecting a sample is to be able to generalize the statistical information we find in the sample to the larger population so that we can estimate the population parameter. There are several ways to collect data from a population.

Use the internet to find a dictionary definition of generalize or generalization. Write down what it means.

Why would generalizing what we find in a sample to the population be important?

Are some methods of data collection better than others?

The most effective way to gather sample data that is generalizable is by using random sampling. Random sampling describes a process where we number every member of the population and use those numbers to choose our sample. This process creates a simple random sample, where every member of the population has an equal chance of being selected to be in the sample.

107	203	190	8	17	284	147	201	33	289
131	109	298	156	154	213	69	57	185	132
173	38	289	40	228	264	215	229	54	26

Suppose we have a population of 300 adults in the United States who we want to draw our sample from. First, each member of our population needs to be numbered. This means that each potential subject will have a number that identifies it ranging from 1 to 300. Since we have a population of 300, we should have a sample of approximately 30 subjects. We then generate 30 random numbers to know which individuals to choose from the population. Random numbers can be generated at websites like random.org or calculator.net. You can also ask Siri to generate random numbers! Then, we find the subject that matches each random number and record all the relevant data on that subject. We keep repeating this until all 30 random numbers have been matched to their respective subject. This gives us a random sample. Any statistical information that we gather from this sample can be generalized to the population, Adults in the United States.

Figure 3 Thirty Random Numbers generated at Random.org

Figure Sample of 30 Adults in United StatesFigure Population of 300 Adults in the United States

If the population is small, you could also create a random sample using the lottery method. The lottery method is done by putting the names of the subjects or the objects themselves in a container and choosing one at a time, without looking until you have the appropriate sample size.

What causes errors in our data?

If we take multiple samples from the same population, each sample will have a slightly different statistic associated with it. The small differences between the samples are called sampling error. This is normal, we expect this to happen. We can decrease the likelihood of sampling error by using a large sample size. If errors can be good, this is a good error.

There are other types of error that can happen when we take samples. Non-sampling error is when an error happens when we are collecting the data. This causes the data from the sample to be very different than what is happening in the population. It could be that the data was entered incorrectly, an incorrect calculation was made, or the subject reported false information in their survey. Another type of sampling error happens when we don’t take a random sample. A nonrandom sampling error is when we take a sample and do not employ the random sampling technique described above.

If we use a nonrandom sampling technique, we create statistics that we can’t generalize. These types of data collection result in what is called sampling bias. If we have a sample that is biased it means that the data we have collected won’t be representative of the population that it was chosen from. It could create an overestimate or underestimate of the population parameter that we are trying to find. Remember, the parameter is an unknown value that represents some attribute of a population.

Look up a definition of representative in an online dictionary and record it.

Why would the sample being representative of the population be important?

Convenience samples and internet polls are two examples of nonrandom sampling. A convenience sample uses data that is easy to get. For example, you could ask a question of the first 50 people who walk by you on the street. Keep in mind that this is absolutely not a random sample. Random does not mean accidental or by chance. For our purposes, random means systematic. Internet polls are another way to collect data by having subjects answer questions. This creates self-selection bias. Self-selection bias is when the subject themselves decides whether to be included in the study or not.

Non-response bias is when people drop out of a research study or don’t respond to a survey question or the entire survey. This type of bias often happens when people are asked about topics that are sensitive to them. Some people may hesitate to answer questions about income, alcohol intake, drug use, or sexual preference.

Social desirability bias is when survey respondents answer questions based on what society deems as important, not what they actually do. In a survey of travelers at an airport more than 72% of women and 52% of men said they always wash their hands after using the bathroom. However, researchers observed much lower percentage of both women and men washing their hands!

Why would it be important to have random samples? If a sample is biased, should we use it?

Are there other things we can do to collect data?

The researchers observing the people in the airport restroom is an example of an observational study. In observational study, we just observe the subjects and collect data without interacting with them. Another way to do an observational study is to take data that are already available and use it. For example, if our research question is, “Are high school seniors taller now than they were in the 1980’s?” We would gather student records from randomly selected schools from the 1980’s and now and compare the student heights of randomly selected students from each school.

We can also answer research questions by doing randomized experiments. These experiments compare two variables and look for relationships between them. The first variable is the explanatory variable and the second, the response variable. The explanatory variable is the independent variable, the researcher manipulates this variable and then observes the dependent variable or response variable to see if there are any changes in it.

To complete a randomized experiment the subjects are first randomly selected to participate in the experiment. Once the sample is chosen, the participants are randomly assigned to groups. The treatment group has the explanatory variable manipulated and the control group does not. Randomized experiments are often used in the health care industry to determine if a new drug is more effective than the old drug. In this case, the explanatory variable would be the new drug that is being tested and it would be given to the treatment group. The control group would be given the old drug and at the end of the experiment the results would be compared.

If the drug being tested is brand new and not a new formulation of an older drug, the treatment group would be given the new drug and the control group would not receive any treatment. At the end of the experiment, the results from each group are compared. Randomized experiments allow us to conclude that the changes in the explanatory variable cause the changes in the response variable.

Use the diagram below and describe the randomized experiment that is being done.

Randomized Experiment phases by Unknown Author is licensed under CC BY-SA-NC

What is the explanatory variable?

What is the response variable?

What treatment is being used?

Is there a control group?

Why is it important to randomize the experiment?

Chapter 2: Organization How do we organize data once we have it?

Once we have chosen our random sample, we use descriptive statistics to summarize and display organized data. Descriptive statistics could be a single statistic, like the proportion of people who disagree with a certain statement, a table that shows the number of times the subjects answered each option of a question in our survey or a bar graph that summarizes the responses to a particular survey question. The tools we use to organize the data are dependent upon what type of data we are trying to organize.

What if we have data that is based on one categorical variable?

If our variable is a single categorical variable, like the number of people who answered yes to a survey question, we can find the proportion to summarize that data.

Suppose that we asked a survey question about the number and type of pets that students in our class have. There are a total of 37 students in the class, and we find that 12 students have cats, 15 students have dogs, and 10 students don’t have any pets at all.

We can summarize this data as follows:

The proportion of students who have cats is The proportion of students who have dogs is

In general, a proportion is .

What is the proportion of students who don’t have any pets?

Since this data comes from a population, the symbol we use for it is (Some textbooks also use Then

Find

What happens when you add up the three proportions from our data? (A similar thing will happen when you add the decimal equivalent for each of the three proportions.)

Why is the data from a population and not a sample?

What procedure should we follow to take a random sample from our class?

How many students would be in the sample?

What is the variable? Is it quantitative or categorical? How do you know?

This data could also be summarized in a frequency distribution table. A frequency distribution table displays how many times a specific event happens. A frequency distribution table of the pet data is shown below.

*Type of Pet*	*Number of Students*
Cat	12
Dog	15
No Pets	10
Total:	37

A relative frequency table displays the same information as a frequency table and adds another column for the relative frequency. The relative frequency is the proportion of times a particular event happened divided by the total number of events. This can also be changed to a percentage.

*Type of Pet*	*Number of Students*	*Relative Frequency*
Cat	12
Dog	15
No Pets	10
Total:	37	100%

How do you change a decimal to a percentage?

Look up the definition of percentage. What is a percentage?

Are there other ways to organize categorical data?

Another way we can organize categorical data is in a graph called a bar chart. In a bar chart, each bar shows the frequency or number of responses in each category. The bar chart below shows our Pet data. We could also organize this categorical data using a pie chart. Pie charts are graphs that are circular. Each sector of the pie chart represents a certain proportion of the data. You can create a bar chart or a pie chart with a spreadsheet program like Google Sheets or Microsoft Excel.

This link to the Microsoft website will show you how to create a pie chart in Excel, Word, and PowerPoint. https://support.microsoft.com/en-gb/office/add-a-pie-chart-1a5f08ae-ba40-46f2-9ed0-ff84873b7863

If you prefer Google Sheets, this video shows you how to create a pie chart. It also explains some of the differences between a pie chart and a bar chart. https://www.youtube.com/watch?v=sVz-5Sm2Y-Q

Compare the pie chart with the Relative Frequency table of the Pet data. What do you notice?

What modifications could we make to the pie chart to make them more clear?

What if we have quantitative data?

Quantitative data is usually organized in graphs or charts called histograms. Histograms are graphs of frequency distributions for quantitative data. The histogram below shows us the frequency distribution of the height of a sample of 31 Cherry trees.

Each bar in the graph is called a bin. Each bin contains trees that have a range of heights. Each tree can only be put into one bin. The first bin contains trees that range from 60 to exactly 65 feet in height. The second bin contains trees that have heights greater than 65 feet up to those that are equal to 70 feet. The third bin contains trees that have heights greater than 70 feet ranging up to exactly 75 feet. The fourth bin contains tress that have heights greater than 75 feet ranging up to exactly 80 feet.

What range of heights do the fifth and sixth bin contain?

How do we know there were 31 trees in the sample?

How many trees are between 70 and 75 feet tall?

Why is height quantitative?

Which bin number would a tree that is 65 feet tall go in?

Which bin number would a tree that is 65.2 feet tall go in?

Create a set of data that would have the same histogram as shown above.

Having created a set of data to match the histogram, what disadvantage does using a histogram create?

What does the shape of the data mean?

When we graph our data in a histogram it can take on a variety of shapes. Some data is normally distributed. Data that is normally distributed has a symmetrical bell shape. This data has a few low values that have a low frequency. Then, the values in a normal distribution increase until they reach a peak frequency and then they begin to decrease to low frequency again. Our Cherry Tree Height data is approximately normally distributed. The blue line above the histogram bins illustrates the symmetric bell curve.

https://commons.wikimedia.org/wiki/File:Normal-data.jpg

A red and blue curves

Description automatically generated Other data is skewed. The skew can be in the negative direction or the positive direction. Skewed data is also bell shaped, but it is not symmetrical. Data that has a negative skew has a data value that is much lower than the rest of the data values, which pulls the tail of the distribution off to the left. Data that has a positive skew has a data value that is much higher than the rest of the data values, which pulls the tail of the distribution off to the right.

https://commons.wikimedia.org/wiki/File:Negative_and_positive_skew_diagrams_(English).svg

The data values that are much lower or much higher than the rest of the data are called outliers. We will learn a way to determine whether a data value is an outlier later in this section.

Is there a way to visually organize quantitative data?

Another type of chart that we can use to organize data is a Box and Whisker Plot. The Box and Whisker Plot of the Cherry Tree data is shown below. We can learn the shape of the data from a Box and Whisker Plot as well as how much variation the data values have. Variation is how far the data values are away from each other.

The maximum and minimum data values are indicated by the short lines at the end of the “whiskers”. You can see from the Box and Whisker Plot that the data varies from about 60 feet to a little less than 90 feet. This measurement is called the range. The range is the maximum data value minus the minimum data value. The actual data shows us that the minimum data value is 60 feet, and the maximum data value is 89 feet. This makes our range feet.

The “box” piece of the plot shows us three important values from our data set, the quartiles. Quartiles break the data into sections, with 25% of the data being in each section.

The line closest to the minimum data value is called the first quartile. The symbol for the first quartile is The value at the first quartile tells the cutoff for the lowest 25% our data. For the Cherry Tree data set, . This means that 25% of the tree heights in our data set are below 71 feet. The next line in the box is the value of the second quartile. This value also has another more commonly used name, the median. The median is the data value that is in the exact middle of the data set. 50% of the data is below the median. For the Cherry Tree data, the median is about 77 feet. The final line the box is the value for the third quartile, 75% of the data values are below this value. The value for feet.

The box and whisker plot is also a visual way of displaying the five number summary of the data set. The five number summary consists of the

What shape does the Cherry Tree data have? Choose one of the options below.

Normal

Skewed Left

Skewed Right

You can use a spreadsheet program or a TI84 calculator to find the five number summary for any quantitative data set. Using Microsoft Excel, the five number summary for the Cherry Tree Data is in the following table. The commands used to get the five number summary are at the right of the table. This gives us exact values, rather than the approximations that we got from the box and whisker plot.

Minimum	60	"=quartile($B$2:$B$32, 0)
Q1	71.15	"=quartile($B$2:$B$32, 1)
Med	76	"=quartile($B$2:$B$32, 2)
Q3	79.15	"=quartile($B$2:$B$32, 3)
Maximum	89	"=quartile($B$2,$B$32, 4)

The five number summary is often recorded as

0, 71.15, 76, 79.15, 89)

If you’d like to draw a box and whisker plot by hand you can follow the instructions at this website: https://www.wikihow.com/Make-a-Box-and-Whisker-Plot

How do we quantify outliers?

Earlier, we said that outliers were data values that are far away from the rest of our data. How far away do the outliers have to be from the rest of the data to be considered an outlier? To calculate this, we must first find the Inter Quartile Range, or IQR. The IQR is For our Cherry Tree Height data, that would be . To determine if we have an outlier, there are two formulas.

Low outliers are found using

High outliers are found using

Low outliers would be below feet.

High outliers would be above feet.

Use the data set you created to match the histogram and find the range of your data.

Use the data set you created to match the histogram and find the five number summary, then create the box and whisker plot. (You can use a spreadsheet program or draw it by hand)

Use the data set you created and do the calculations to find if there are any outliers.

Are there other important values in quantitative data?

The mean, median and mode are also called measures of central tendency.

The box and whisker plot and five number summary shows us important information about our data. One of those values is the center of the data. There are several ways we can determine the center of the data. One is the median, the exact center of the data when the data are put in order from least to greatest (or greatest to least). The symbol for the median is If you have an odd number of data values, the median is the value in the center. If you have an even number of data values, the median is the mean of the two middle values.

For example, suppose we have the following set of sample data.

The data is in order and there are an odd number of values, so we can just choose the number in the middle. The median, . Now, suppose we have this set of data.

To find the median, we must put the data in order. Then, because there are an even number of values, we must find the mean of the middle two values.

The median, .

The other way to determine the center of the data is to find the mean. The mean is found by adding up all the data values and dividing by the total number of data values in the data set. The symbol for the mean of sample is The symbol for the mean of a population is the Greek letter mu.

The mean for this set of sample data, would be found by adding up all the data values and then dividing by 6.

Another measure of central tendency is the mode. The mode is the value that happens with the most frequency in the data.

The mode in this data set is because it happens twice.

In a normal distribution, the mean, median and mode are all the same value.

Measures of Central Tendency | Definition, Formula & Examples - Video & Lesson Transcript | Study.com

Create a data set with an even number of data values and find the mean, median and mode. What shape is your data?

Create a data set with an odd number of data values and find the mean, median and mode. What shape is your data?

Create a data set that would result in a positive skew.

Create a data set that would result in a negative skew.

Are there other ways to find the variation in quantitative data?

One way to find the variation in the data is to calculate the range (maximum – minimum). Another way is to calculate the standard deviation. The standard deviation is the average distance of data values from the mean of the data. The symbol for the standard deviation of a sample is The symbol for the standard deviation of a population is the lower-case Greek letter sigma. A low standard deviation tells us that the data does not have a lot of variation and is grouped close to the mean, like the red portion of the graph below. A high standard deviation tells us that there is a lot of variation in the data, and they are spread out away from the mean, like the blue portion of the graph.

https://upload.wikimedia.org/wikipedia/commons/f/f9/Comparison_standard_deviations.svg

The data that is graphed in blue above has a standard deviation of 50. It has a minimum of approximately -10 and a maximum of 230. The data that is graphed in red has a standard deviation of 10. It has a minimum of 60 and a maximum of approximately 125. A quick way to estimate the standard deviation is to find the range and divide that by 4.

These values are not exact, but give us a good idea, quickly of what the standard deviation is. We should also only compare two sample standard deviations when the sample means are about the same. In this case, the center of both data sets is 100, so comparison would be appropriate.

The actual calculations for the standard deviation of a sample are completed using this formula.

Let’s calculate the standard deviation of this sample data.

22, 22, 26 & 24

*Step 1*: Compute the mean
*Step 2:* Subtract the mean from each individual sample value.
*Step 3:* Square each of the deviations obtained in Step 2.
*Step 4:* Add all of the squares obtained in Step 3.
*Step 5:* Divide the total from Step 4 by , which is one less than the number of values in the sample.
*Step 6:* Find the square root of the result in Step 5.

This means that the standard deviation of our data is 1.9149. On average, each datum is about 1.91 units away from the mean. (Datum is the singular form of data.) The good news is that a spreadsheet program or a TI84 calculator will perform all these calculations for you. You can also calculate a population standard deviation using a spreadsheet program or TI84 calculator.

This is the formula for the standard deviation of a population. The symbol for the standard deviation of a population is .

Thankfully, this calculation is performed by calculators and spreadsheet programs.

Now that we have some tools to help us organize our data, we can begin to analyze it. There are several important tools for analyzing data that we will discuss in upcoming chapters.

Chapter 3: Variation & Unusual Statistics

How do I know if my sample statistic is unusual?

Once we know the value of the standard deviation of the data, we can use it to find which data values are unusual. For our 22, 22, 26 & 24 data we found that . To find the minimum and maximum usual values, we also need to calculate the mean of our sample, :

Now that we have the mean and standard deviation we can calculate where the unusual values would begin.

This means that any value below 19.6702 or above 27.3298 would be considered unusual.

Add one number to the data set that is much lower than 22. Then perform the calculations for the mean and standard deviation again.

What is your minimum usual value now?

Add one number to the data set that is much higher than 26. Then perform the calculations for the mean and standard deviation again.

What is your minimum usual value now?

This brings us to the Empirical Rule. The Empirical Rule states that for data that is normally distributed, 68% of data will fall within one standard deviation of the mean, 95% of data will fall within 2 standard deviations of the mean, and 99.7% of data will fall within 3 standard deviations of the mean.

https://upload.wikimedia.org/wikipedia/commons/d/df/08fig-empirical.png

Since 95% of the data falls within 2 standard deviations of the mean, this means anything below 2 standard deviations below the mean and anything above 2 standard deviations above the mean is unusual.

Another way to measure the variation in the data is to compute the variance. The variance is the standard deviation squared. For our 22, 22, 24, 26 data the variance would be

What does normally distributed mean? Describe data that is normally distributed.

Calculate the variance for each of the data sets you created above.

Indeed.com has a good summary of the measures of variation and the Excel commands to calculate them. There is also a listing of careers that use these important measures. https://www.indeed.com/career-advice/career-development/measures-of-variation

What if we want to compare two things that don’t seem related?

We can use a standardized statistic. The z-score is a way of finding how far away a data value is from the mean of the data set. We can calculate the z-score using the formula:

Suppose you take a physics test, and the mean of your class is 78 with a standard deviation of 2.8. You scored 84. Your friend takes a psychology test with a class mean of 86 and a standard deviation of 1.4. They scored 92. You both are 6 points above the mean, but is one score better than the other? We can use the z-score to find out.

For the physics test score of . For the psychology score of .

The physics test score is 2.14 standard deviations above the mean. The psychology test score is 4.29 standard deviations above the mean.

4.29

2.14

https://commons.wikimedia.org/wiki/File:The_Normal_Distribution.svg

I’ve drawn the z-scores for both tests on the normal standard distribution above. You can see that a z-score of 4.29 is farther away from the mean of the distribution, . We learned earlier that any data values that are more than 2 standard deviations away from the mean are unusual. Looking at the z-scores for the test scores, we can see that they are both unusual. However, the psychology test score is 4.29 standard deviations above the mean which is more unlikely to happen than the score of 2.14. The farther away the z-score is away from the mean (in absolute value), the more unusual it is.

z-scores can also be negative. This would indicate that the data value is below the mean. Suppose you take a physics test and the mean of your class is 78 with a standard deviation of 2.8. You score a 71. Your friend takes a psychology test with a mean of 86 and a standard deviation of 1.4. They score a 79.

Calculate the z-score for each of these tests.

Which one is more unusual?

What is the z-score for a physics test score of 78?

What is the z-score for a psychology test of 86?

Why do you think that happened?

Variation and knowing whether your sample statistic is unusual are important parts of analyzing data. In order to fully understand data analysis, we need to talk about probability first. We will do that in the next chapter.

Chapter 4: Probability and Data Analysis

What does probability have to do with data analysis?

Probability is the proportion of times a certain result happens in a very large set of results. Suppose you flip a coin. The probability the coin will land on heads is 1 out of 2 or but if you flip a coin three times, you don’t get exactly You may not even get something close to Even if you flip it 15 times, you still won’t get exactly We talk about probability in terms of events. The coin filp is the event we are interested in. We record probability like this:

A screenshot of a computer

Description automatically generated Try this coin flip simulator. Start with 15 tosses.

When you tossed the coin 15 times, what was the probability of heads as a decimal?

Then try 200 tosses. What is the probability of heads now?

How many tosses do you have to do to get the probability value of exactly

What do you notice about the graph that is created when you do the simulation?

The point of this exercise is probability is about the long run. If we flip the coin thousands of times the probability will get closer and closer to , the decimal equivalent. This is due to the Law of Large Numbers. The Law of Large Numbers says that as the event (coin toss, dice roll, etc.) is repeated over and over numerous times the probability of the event in the sample tends to approach the actual probability for the population.

A graph with a bar

Description automatically generated with medium confidence If you look at probability values as fractions or their decimal equivalents, they range from 0 to 1. If you are using percentages then they range from 0% to 100%. If the probability of a certain event is zero, that means there is no chance it will happen. For example, all cats will fly tomorrow. If the probability of an event is one, the means it is certain to happen. For example, the sun will rise tomorrow. If an event is unlikely, the probability that it will happen is 0.05 (5%) or less.

Let’s use a single sided die to learn more about probability. A single die has the numbers one through six on its sides.

https://commons.wikimedia.org/wiki/File:Dice-faces_32x32.jpg

The probability that the die lands on five is 1 out of 6 or This is because there is one five and six possible outcomes when you roll the die. So, .

What is the probability that the die lands on two?

Change the fraction to the decimal equivalent.

Change the decimal equivalent to a percentage?

What is the probability that the die lands on seven?

What if we had two dice? What would that change? First, we wouldn’t just have the outcomes of the roll of one die to consider. We would have to consider the outcomes of the rolls of both dice. That means that the number of outcomes is going to increase from six. How many outcomes would there be? We could list a few. We could get a 1 on the first die and a 5 on the second die. We will put that in the table below as 1, 5. We could also get a 3 on the Die #1 and an 1 on Die #2. We will put that in the table below as 3,1.

Is rolling a five on Die #1 and a one on Die #2 the same as rolling a one on Die #1 and a five on Die #2? Justify your answer.

Die #1, Die # 2	Die #1, Die # 2	Die #1, Die # 2	Die #1, Die # 2	Die #1, Die # 2	Die #1, Die # 2
1, 1	1,2	1,3	1,4	1,5	1,6
2, 1
3, 1
4, 1
5, 1
6, 1			6,4

Complete the table.

How many outcomes are there now?

You should have found 36 different outcomes. Since we are using two dice, we are now looking at the probability of a sum. For example, if you roll the dice and get a 5 on die #1 and a 3 on die #2, the sum of the dice is 8. That means the probability of getting a sum of eight is . But wait. Is that right? Are there other ways we could get an eight if we roll two dice? We could get a 6 on Die #1 and a 2 on Die #2 or a 3 on Die #1 and 5 on Die #2 or a 2 on Die #1 and a 6 on Die #2. Each die is considered an independent event. An independent event means the outcome of one event doesn’t affect the outcome of the other event. This means that rolling Die #1 doesn’t affect the outcome of the roll of Die #2.

Let’s look at the table of possible outcomes for rolling two dice and see how many ways there are to get a sum of 8.

Die #1, Die # 2	Die #1, Die # 2	Die #1, Die # 2	Die #1, Die # 2	Die #1, Die # 2	Die #1, Die # 2
1, 1 1+1=2	1,2 1+2=3	1,3 1+3=4	1,4 1+4=5	1,5 1+5=6	1,6 1+6=7
2, 1 2+1=3	2 ,2 2+2=4	2, 3 2+3=5	2, 4 2+4=6	2, 5 2+5=7	2, 6 2+6=8
3, 1 3+1=4	3, 2	3, 3	3, 4	3, 5 3+5=8	3, 6
4, 1	4, 2	4, 3	4, 4 4+4=8	4, 5	4, 6
5, 1	5, 2	5, 3 5+3=8	5, 4	5, 5	5, 6
6, 1	6,2 6+2=8	6, 3	6, 4	6, 5	6, 6

There are 5 ways to get a sum of 8. That means The image below shows another way to count the number of ways you can get a certain sum. For example, there is only one way to get a sum of 12, so

A pyramid of dice

Description automatically generated

https://commons.wikimedia.org/wiki/File:Dice_Distribution_(bar).svg

What if we wanted to find the probability of getting a sum of 8 or sum of 3. We know there are 5 possible ways to get a sum of 8. How many possible ways are there to get a sum of 3? Looking at the image above, we can see that there are only 2. You could get a 2 on Die #1 and a 1 on Die #2 or a 1 on Die #1 and a 2 on Die #2. That makes the probability of getting a sum of 8 or a sum of 3 equal to 7 out of 36.

This demonstrates the Addition Rule. The Addition Rule says that the probability of Event A (sum of 8) or Event B (sum of 3) or BOTH will occur in a single trial is the sum of the probabilities of each event. We have independent events, so there is no way on one roll of two dice that we can get a sum of 8 and a sum of 3. The two events cannot occur at the same time. Independent events are called mutually exclusive or disjoint. There is no overlap between the two events. The venn diagram below illustrates this.

Sum of 3

Sum of 8

https://commons.wikimedia.org/wiki/File:Intersecci%C3%B3nABvacia.svg

This is the probability equation notation for the Addition Rule:

We want to be careful to count each outcome only once. In other words, find the number of ways Event A can occur then, find the number of ways event B can occur and add them. Then subtract the ways they can both happen at the same time.

The next example will use the part of the equation. But first, here is some information about what is contained in a deck of 52 playing cards (minus the 2 jokers)

In that deck of cards there 52 total cards, 26 red cards and 26 black cards.

There are four suits, Clubs Club Suit with solid fill , Diamonds Diamond Suit outline , Hearts Heart with solid fill and Spades Spade Suit with solid fill

There are 13 cards in each suit, A, 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, and K. A screenshot of a game

Description automatically generated

https://commons.wikimedia.org/wiki/File:Svg-cards-2.0.svg

What is the probability that you draw a king or a club from a standard deck of 52 playing cards? These events aren’t mutually exclusive. The card you draw can be a king and club at the same time. This means that there are 4 kings out of the 52 cards and there are 13 clubs out of 52 cards. However, we also have to be sure we subtract the one king of clubs so we don’t count it twice. Notice how the king of clubs is in the overlap between the two circles in the venn diagram.

What is the probability of drawing a Jack or a red card?

What is the probability of drawing a black or 1 card?

What is the probability of drawing a red card or a black card?

Descriptive Statistics

~~Histogram~~
~~Box and Whisker Plot~~
Line Graphs
~~Frequency Table~~
~~Relative Frequency Table~~
Two Way Tables
~~Normal Distribution~~

~~Now that our data is organized, how do we analyze it?~~

~~Shape,~~
~~Center~~
~~Spread --mean, median, mode, standard deviation.~~

Analysis

What does probability have to do with data analysis?

Introduce probability here…

How do we know if our sample data is different from the population?

~~Usual and unusual values~~
~~Empirical Rule - 95% of data is within 2 SD of the mean of the distribution~~
~~Z-scores~~
Correlation and causation

Sampling Distribution

Normal Distributions

Standard Normal Distributions

Student t Distributions

Regression to analyze a data set

2 Quantitative Variables

Explanatory Variable - Independent Variable
Response Variable - Dependent Variable
Correlation

Scavenger Hunt: Find a data set with

A negative correlation
A negative slope and an intercept that makes sense.
A negative slope and an intercept that doesn’t make sense.
A positive correlation
A positive slope and an intercept that makes sense.
A positive slope and an intercept that doesn’t make sense.

When can we conclude causation?

When the experiment is designed using random assignment.

Determining point Estimates

Confidence Intervals

Hypothesis Testing & Type I and Type II Errors

Null Hypothesis
Alternative Hypothesis
p value

One Tailed Test

Two Tailed Test

Chi-square Tests

Non-parametric tests

One Categorical Variable

One Quantitative Variable

Revisit z-scores –introduce t-distribution

2 Categorical Variables

Analysis and interpretation of different distributions (graphs) of data

Sampling Distributions

Normal Distribution

Student t-distribution

Skewed Distributions

Using Linear Regression to analyze a data set

Determining Point Estimates

Calculating and Analyzing Confidence Intervals

Calculating Standardized Statistics

z-scores

t-scores

Sources:

The Cambridge Dictionary: https://dictionary.cambridge.org/us/dictionary/english/data

ANOVA: https://www.qualtrics.com/experience-management/research/anova/

Coin flip Simulator: https://digitalfirst.bfwpub.com/stats_applet/stats_applet_10_prob.html

Definitions:

Categorical data is data that can be separated into groups

Quantitative data is data that is numerical in nature and it makes sense to do calculations on this data.

Show the following:

Adjust appearance:

Notes