Chapter 04 Picturing Data With Tables and Charts | Chapter 4: Picturing Data With Tables and Charts

Clear-Sighted Statistics

Chapter 4: Picturing Data with Tables and Charts

Excellence in statistical graphics consists of complex ideas communicated with clarity, precision, and efficiency….Graphics reveal data. Indeed graphics can be more precise and revealing than conventional statistical computations.1

-- Edward R. Tufte

I. Introduction

You will recall that descriptive statistics are the techniques for generating quantitative summaries of data while inferential statistics is the science of making inferences about a population using sample data. In this chapter, we turn to the task of descriptive statistics: Describing and interpreting the meaning of data. Descriptive statistics is sometimes referred to as exploratory data analysis or EDA. John W. Tukey, who taught statistics at Princeton University and worked at Bell Laboratories, compared EDA to detective work and inferential statistics, or confirmatory data analysis, to a jury trial.2 In Chapter 5: Statistical Measures will introduce the quantitative measures used in both descriptive and inferential statistics.

After completing this chapter, you will:

• Know the characteristics of frequency tables.

• Know how to construct a frequency table.

• Understand that by using data visualization—charts—the meaning of the data can be communicated quickly.

• Know the common ways frequency tables are charted: Column charts and pie charts.

• Know the characteristics of the four types of frequency distributions: frequency distributions, relative frequency distributions, cumulative frequency distributions, and cumulative relative frequency distributions.

• Know how to construct the four types of frequency distributions.

• Know the common ways frequency distributions are charted: Histograms, frequency polygons, ogives, and Pareto charts.

• Understand what is meant by the terms: Symmetrical, bi-modal, right-skew, leftskew, and asymmetrical distributions.

• Know how fools and knaves misuse charts.

There are a number of Excel files that you should download:

• Chapter04_Exercises.xlsx

• Chapter04_FreqDist.xlsx

• Chapter04_M&Ms_Colors.xlsx

II. What Are Tables?

Tables are used to organize data. Spreadsheet programs, like Microsoft Excel, are based on tables. Tables consist of columns and rows. Columns are vertical. They go up and down just like the columns on the White House shown in Figure 1. Rows, on the other hand, are horizontal. They go sideways or left to right.

Figure 1: Columns on the White House Facade

Columns on the White House in Washington, DC.

Two of the types of tables used in descriptive statistics are frequency tables and frequency distributions. These tables summarize data by counting the number of occurrences—frequencies—of specific categories of variables that occur in the data. Both organize data into mutually exclusive classes. Classes are often called categories, buckets, or bins. In the context of tables, mutually exclusive means that each datum fits in only one class. The categories included in a frequency table or frequency distribution must also be collectively exhaustive. This means that there must be a class for every datum.

III. Frequency Tables

A frequency table is the grouping of qualitative data into mutually exclusive classes. Below in Table 1 is an example. The data can also be found on the Microsoft Excel spreadsheet titled Chapter04_M&Ms_Colors.xlsx. The data are in the worksheet labelled Raw Data_M&Ms. Please note: An Excel file is called a workbook. Workbooks may have one or more pages, which are called worksheets.

For this little frequency table, a package of M&Ms was purchased from a vending machine. When the bag was opened and the contents poured into a bowl, two things were apparent:

1. M&Ms come in six colors: Blue, brown, green, orange, red, and yellow.

2. The six colors do not have an equal number of pieces.

You can construct a frequency table by hand. Simply sort the candies by color and then count the number of pieces in each sorted pile. The number of candies for each color are the frequencies. This is very easy to do when there are a small number of variables. When the data contain a large number of variables, using Microsoft Excel saves time. Of course, you would have to enter the data into Excel.

Chapter04_M&Ms_Colors.xlsx has the necessary Excel formulas you need to know to construct a frequency table. Here are the Excel formulas used to create the frequency table shown below in Figure 2:

Figure 2: Microsoft Excel’s Frequency Table Formulas

Microsoft Excel's
Frequency Table Formulas

When you complete the counting process, your frequency table will look like the one shown in Table 1. Please note: The bag of M&Ms you might buy may have a slightly different distribution of colors.

Table 1: A Simple Frequency Table

Source: M&Ms Package Purchase by the author

The “Color” column contains qualitative data. The “Frequency” column is quantitative data. It is a ratio scale measurement because there is a real zero in this scale. We can, therefore, state that this bag of M&Ms contains twice as many yellow pieces (10) as blue (5).

You can add a third column to show the relative frequencies of the six M&Ms colors. Relative frequencies show the percentage of each class represented in the total. The sum or the relative frequencies, therefore, must equal 100 percent.

Table 2: A Relative Frequency Table

Source: M&Ms Packaged Purchase by the author

The frequency table and the relative frequency table can be shown on one table, as shown in Table 2 above, or in two separate tables: One for the simple frequency table and one for the relative frequency table.

Based on this frequency table, we know that two colors, orange and red, account for over 50 percent of the candies in the bag; three colors, orange, red, and yellow account for over 68 percent of the candies.

IV. Charting Frequency Tables

There are two basic ways to chart frequency tables: Column or bar charts and Pie charts. We will use Microsoft Excel to create these charts.

A. Column and Bar Charts

Column and bar charts do the same thing. Both display rectangles where the length of the rectangles are proportional to the value of the data. The rectangles of column charts are vertical, they go up and down. Hence the name column chart. Bar charts have horizontal rectangles that go left to right. Column charts are used more often than bar charts. Bar charts are used when the chart has many categories or the category names are long.

A column chart for the M&Ms color frequency table is shown below in Figure 3 and can be found in Chapter04_M&Ms_Colors.xlsx. All charts are supposed to telegraph the meaning of the data quicker than a table. This chart achieves this goal. The height of the rectangles quickly communicates that the most common colors of M&Ms are orange, red, and yellow while green brown, and blue are the least common colors.

Figure 3: Column Chart – M&Ms by Color

The Y-Axis must cross the X-Axis at zero.

Source: M&Ms Package Purchased by the author

To see how this chart was created, open Chapter04_M&Ms_Colors.xlsx, and click on the worksheet titled “Sorted.” To create a column chart, highlight cells A2:B7. Select Insert from the menu, and then Chart and Column. Excel will draw a column chart that looks like the one shown in Figure 4.

Figure 4: Basic Column Chart. Source: M&Ms package purchased by the author

Basic Excel Column Chart.

Once you have this rough column chart, you can start to format it to suit your taste. Rather than take you through the steps used to format the chart shown in Figure 4, see if you can create a chart that looks like it. The best way to learn Excel is to dive in and see what it can do. If you make a mistake, do not worry. You can always download a fresh copy of Chapter04_M&Ms_Colors.xlsx and try again.

Please note: The horizontal axis, the axis with the names of the colors, is called the X-Axis. The Y-Axis is the vertical axis on the left. The Y-Axis indicates the frequency of the number of pieces of each color. The strict rule with column charts is that the value of the Y-Axis at the point where it intersects with the X-Axis must equal zero. See Figure 5. If this is not the case, the chart is biased. Later in this chapter, we will see how dishonest commentators—knaves—distort the impressions left by column charts by not having the Y-Axis cross the X-Axis at zero.

Figure 5: The Y-Axis Must Cross the X-Axis at Zero

The Y-Axis must cross the X-Axis at zero.

Source: M&Ms Package Purchased by the author

Generally, column charts are used more often than bar charts. Bar charts are used when the category labels are very long or whenever there are 12 or more categories.

B. Pie Charts

Pie charts show a divided circle of the relationship of the “parts” to the “whole.” Most people are familiar with this type of chart.

Figure 6 shows a pie chart for our M&Ms Color frequency table:

Figure 6: Pie Chart - M&Ms Color Distribution

Pie Chart: Distribution of M&Ms Colors

Source: M&Ms Package Purchased by the author

Pie charts are circles with each “slice” representing the relative frequency of each category. Introductory textbooks may suggest using pie charts. Most statisticians, however, do not like pie charts. Here is a remark commonly attributed to John Wilder Tukey: “There is no data that can be displayed in a pie chart that cannot be displayed better in some other type of chart.”3 Data visualization guru, Edward Tufte writes,

A table is nearly always better than a dumb pie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare qualities located in a spatial disarray both within and between pies….Given their low data-density and failure to order numbers along a visual dimension, pie charts should never be used.4

The essence of the experts’ critique of pie charts is that people have difficulty gauging the size of the slices by comparing the angles at the center of the pie chart. It is far easier for people to compare the height of columns in a column chart. Follow the advice of giants like Tukey and Tufte. Do not use pie charts. But, if you insist on using them, be aware of the following points:

1. The sum of all the slices must equal 100 percent. If not, your pie chart does not offer a good representation of the data.

2. Do not use 3-D effects. These distort the chart. In fact, avoid using 3-D effects when drawing any chart.

3. Consider showing the biggest slice at the top of the chart this would be 12 o’clock if your pie chart were a clock. The remaining slices should be shown in order of their magnitude.

4. Keep your pie chart simple. Do not include a legend. Use data labels that show the category labels and relative frequencies. You could also add the frequencies.

5. Too many categories makes your pie chart difficult to interpret. Pie charts work best when there are fewer than eight categories.

To prove Edward Tufte’s point than the only thing worse that a pie chart is several of them, look at two pie charts comparing the share of market for athletic shoes in 2010 and 2013 in Figure 7:

Figure 7: Two Pie Chart Comparing Top Athletic Footwear on Teens in 2018
Comparing data using two pie charts

Using two pie charts to compare two data sets is ineffective because it is hard to determine the magnitude of the changes by comparing the size of the slices. A simple table would be much clearer. If you want to chart these data, use a stacked column chart like the one in shown in Figure 8:

Figure 8: Stack Column Chart Comparing Top Athletic Footwear on Teens in 2018
Comparing data using a Stacked Column Chart

A stacked column chart is a much clearer way to compare to time periods than two pie charts because it is easier to compare the changes in market share.

V. Frequency Distributions

Frequency distributions group “raw” data into mutually exclusive quantitative categories, classes, buckets, or bins. These tables count the number of variables, occurrences, or frequencies of the variables for each class. There are four kinds of frequency distributions:

1. Frequency Distributions (it does not have a special name),

2. Relative Frequency Distributions,

3. Cumulative Frequency Distributions, and

4. Cumulative Relative Frequency Distributions.

Here are the basic steps for constructing and interpreting a frequency distribution:

Step 1: Collect the raw data. Raw data is unorganized or unprocessed data.

Step 2: Decide on the number of classes.

Step 3: Determine the class interval or class width.

Step 4: Select the lower limit of the first class. Calculate the class mid-points, and the lower and upper limit of each class. Complete the four frequency distributions.

Step 5: Determine whether to present the data using tables or charts.

Step 6: Comment on the findings.

Constructing frequency distributions: Figure 9 shows raw data.

Figure 9: The Raw Data.

A sample of 50 Dewey, Cheatem, and Howe expense reports.

The human resources department of the law firm Dewey, Cheatem, and Howe is putting together an analysis of dinner and taxi fare expenses for clerical staff who work on cases after 9 pm. Here is a sample of 50 expense reports for the first two weeks of June 2019. Construct a frequency distribution, relative frequency distribution, and the two cumulative frequency distributions. This data along with the frequency distributions and charts are included in the Microsoft Excel file, Chapter04_FreqDist.xlsx.

As previously stated, frequency distributions record the frequencies for each class. The relative frequency distribution records the proportion of the total number of observations for each class. The two cumulative frequency distributions show the sum of the observations in a class plus all of the classes below it. The different kinds of frequency distributions can be presented in a single table or four tables. Table 3 shows the four frequency distributions for the Dewey, Cheatem, and Howe data on a single table:

Table 3: 4 Frequency Distribution on 1 Table.s

The frequency distributions show that the $140 < $150 class has the highest frequency, 13 or 26 percent of the 50 cases in the sample. The three classes in the middle of the distribution: $130 < $140, $140 < $150, and $150 < $160, account for 33 of the 50 observations, or 62% of the data. The data, therefore, are concentrated in the middle of the frequency distribution.

There are seven things you must remember when constructing frequency distributions:

1. Classes must be mutually exclusive. Each variable fits in only one class.

2. The classes must be collectively exhaustive. There is one class for each variable.

3. The first and last class of a frequency distribution must include at least one variable. If these classes have no observations, they should not be included in the frequency distribution.

4. The class intervals or class widths must be the same for all classes. In our example, every class has the same interval: 10.

5. Frequency distribution are easy to understand when the class intervals can be divided by 2, 5, or 10 and the result is a whole number.

6. Too many classes makes charting the frequency distribution difficult.

7. Too few classes may hide the shape of the data.

Here are detailed instructions for constructing frequency distributions:

A. Determine the number of classes:

A handy, but not mandatory, guideline used by many analysts to determine the number of classes is the “2 to the k” rule.” The formula for the 2 to the k rule is: 2k > n.

We read this formula as “two to the k power (where k is the number of categories) is greater than n (number of observations).” Remember the “>” sign means “greater than.” If our data had 30 observations, we would need to raise 2 to the fifth power to get a number greater than 30: 25 = 32, found by 2 x 2 x 2 x 2 x 2 = 32. We would use at least five classes when we construct the frequency distribution. If we had 32 observations, we would have to use 6 categories because 25 = 32 and 32 is not greater than 32.

Given that our data have 50 observations (n = 50), using the 2 to the k rule, suggests using at least 6 classes, 26 = 64. But, a seventh class may be warranted. By the way, 27 = 128, found by 2 x 2 x 2 x 2 x 2 x 2 x 2 = 128. Remember: Never use fewer classes than suggested by the 2 to the k rule. Usually using the smallest number of classes that meets the 2 to the k rule is fine. Sometimes, however, adding an additional class makes the frequency distribution easier to understand.

It is easy to calculate power or exponents using Microsoft Excel. In fact, there are two methods. The first method is the caret, ^, operator. To find 2 raised to the fifth power, use the following formula in Excel: = 2^5. Excel will return the answer, 32. The 2 in the formula can be replaced with any number or cell address that contains a number. The second, and more powerful method, is the POWER function. The syntax of the POWER function is “=POWER(Number,Power). “Number” stands for the base number or cell address with the base number. “Power” is the exponent to which the base number is to be raised. To calculate 2 raised to the fifth power, the formula would be =POWER(2,5). The power function is powerful because when power is entered as a fraction, it can be used to calculate square roots, cube roots, fourth roots, and so forth. Remember: The numerator of this fraction must always be 1.

B. Determining the class interval

The class interval is often called the class width. The class interval is the distance between the lower class limit of a class and the lower class limit of the next class. As stated above, all classes intervals in a frequency distribution must have the same width.

Here is the formula for calculating class intervals:

Equation 1: Determining The Class Interval

Where: “i” equals the class interval,

“H” equals the largest value in the raw data,

“L” equals the smallest value in the raw data, and

“k” equals the number of classes.

The greater than or equal sign, ≥, suggests that the result of this equation may have to be rounded up. Never round down. Doing so violates the formula for class intervals and may result in a frequency distribution that is not collectively exhaustive.

We have already decided to use either six or seven classes, with the Dewey, Cheatem, and Howe data. The largest value is $179 and the smallest is $111. Here are the class intervals using six classes:

Equation 2: Class Interval Calculation With 6 Classes

The value for the class interval is 11.33. But, counting by 11.33 is difficult. We round up the interval to 12.

What would happen if we decided to use seven classes instead of six classes? Here is the class interval using seven classes:

Equation 3: Class Interval Calculation With 7 Classes

Counting by 9.71s is also difficult. We round up the interval to 10.

So, what is better, six classes with a class interval of 12, or seven classes with an interval of 10? This is a matter of judgment. But, given the fact that our decimal numbers system is based on units of ten, most people find that counting by 10s is easier than counting by 12s. Using seven classes, therefore, would be better than using six.

Excel offers two functions that are very helpful for determining class intervals. The MAX function returns the largest value in the data. The MIN function returns the lowest value in the data.

Equation 4: Excel’s Min and Max Functions

=MAX(Cell range)

=MIN(Cell range)

The following Excel formula returns the RANGE, which is the difference between the highest and lowest numbers in the data:

Equation 5: Excel’s Formula for the Range

= MAX(cell range)-MIN(cell range).

In Chapter 5, Statistical Measures, we will review the range along with other common statistical measures.

C. Setting the lower limit of the first class

The lower limit or boundary of the first class must contain the smallest value in the data, $111. We could, therefore, choose either $110 or $111. That decision is another judgment call. Using $110 as the lower limit of the first class will make the frequency distribution a little easier to understand.

D. Write the class boundaries for the seven classes

Here are the class boundaries for the seven classes with class intervals of ten. The first class is “$110 less than $120,” the second is “$120 less than $130” and so forth. The interval for the first class can be written as $110 < $120 or $110 up to $120. It is better to use $110 < $120 because it is shorter and clearer. Here are the seven classes for our frequency distribution:

Table 4: The Boundaries for the Seven Classes

Check the seven classes. Do they all have the same interval? Yes, $10. The class interval us the result of subtracting the lower limit of a class from the lower limit of the next highest class, $120 - $110 = $10. Do all the data fit in the seven categories? Yes. The smallest number in our data, $111, fits in the first class. The largest number, $179, fits in the last class. Are there gaps between any of the seven classes? No. Based on these facts, a frequency distribution constructed using these class boundaries is collectively exhaustive because every variable would fit in the frequency distribution. You should also note the classes do not overlap. The seven classes, therefore, are mutually exclusive, means no variable fits is more than one class.

There are two variables that equal $130. Which class do these variables belong? Do they belong in $120 < $130 or in $130 < $140? Clearly the answer is that $130 goes in the $130 < $140 class and not the $120 < $130 class. This is because $130 is not included in the $120 < $130 class.

The class mid-point of the first class is $115, found by taking the average of the lower class limit of the first class and second class: ($120 + $110)/2 = $115. To find the class mid-points for the remaining classes, simply add the class interval, $10, to the class mid-point of the first class and repeat this process for the remaining classes.

E. Tally the frequencies for each of the seven classes

This step involves counting all the variables for each of the seven classes. You can do this using paper and pencil or Microsoft Excel. Excel, of course, will be faster.

Using paper and pencil, write down the seven categories. As shown in Figure 10, create a “tally” column to the right of the category column.

Figure 10: Tally for the $110 < $120 This is fictious data

Tally for the $110 < $120 class

Going through each variable draw a line through each value between $110 and less than $120. Record the a for each variables in the tally column. Repeat this step for each of the remaining categories..

F. Count the Tally Marks

Table 5 shows what the frequency distribution should look like The next step is to count the tally marks. Once you completed this step, you will have completed the basic frequency distribution.

Table 5: Frequency Distribution. This is fictious data

G. Complete the Relative and Cumulative Frequency distributions

There are still three more frequency distributions to complete: The relative frequency distribution and the cumulative frequency distribution. Table 6 shows the frequency distribution with the added relative frequency column.

Table 6: Relative Frequency Distribution.

Calculating the relative frequencies is easy. The relative frequency of the first class is 8%, found by 8 ÷ 50 = 0.08 = 8%. We repeat this process for the six remaining classes.

The third and fourth frequency distributions are the cumulative frequency distributions: The Cumulative Frequency Distribution and the Relative Cumulative Frequency Distribution. See Table 7.

Table 7: All four Frequency Distributions.

Calculating the cumulative frequencies is easy. For the first class, $110 < $120, the cumulative frequency is the same as the class’ frequency. In this case, 4. The same holds for the cumulative relative frequency, which is 8%. For the remaining classes, add the class frequency to the sum of the cumulative frequencies of all the previous classes.

Creating the four frequency distributions using paper and a pencil is not difficult, but is tedious and prone to mistakes especially when working with large data sets. You can speed up this process using Microsoft Excel. There are two ways to do this with Excel: Using the Histogram function in the Analysis ToolPak, or just using Excel’s built-in functions.

Let’s look at the Analysis ToolPak first. The Analysis Toolpak is a very handy Excel plugin that comes with Microsoft Excel for conducting statistical analysis. Here are the steps to construct a frequency distribution using the AnalysisToolPak:

Step 1: Import the raw data into Microsoft Excel. The data must be in a single column. Please note: The data, shown in Table 7, are organized in five columns. The AnalysisToolPak cannot use this data. The data in a single column can be found on the “Creating a FD in Excel” worksheet in the Chapter04_FreqDist.xlsx workbook. The data is shown in cells A2:A51, with a label ($) in cell A1.

Step 2: Type the upper limits for each category into a different column. Be aware that in Excel the classes are called “bins.” The upper class limits are in cells C2:C8, with the label, Upper Limits, in C1.

Step 3: In another column, type the names of the seven categories. The category names are in cells D2:D8, with the label, Class, in cell D1.

Step 4: Click on the Data or Tools tab in Excel (depending on whether you are using a Macintosh or Windows computer), then click on “Data Analysis.”

Step 5: Click on Histogram. Yes, Histogram, a common chart for graphing frequency distributions. The Excel programmers often give functions odd names. One wonders why Microsoft did not call this function Frequency Distribution. See Figure 11.

Figure 11: Analysis ToolPak Selection Box

AnalysisTookPak Selection Box

Step 6: The Histogram Output Options box will appear. Enter the Input Range, Bin Range, and Output Options. See Figure 12.

Figure 12: Histogram Output Options

AnalysisToolPack: Histogram Options

For the Input Range: Enter the range mentioned in Step 1: A1:A51. To do this, click on the space for the input range and drag the mouse from cells A1 through A51. Excel will add the dollar signs, which indicate that the address is an absolute address. For the bin range use the cell addresses entered in Step 2, C2:C8. As both the Input Range and Bin Range have labels, check the labels box. For the Output option, select cell G1, which will place the results in this worksheet starting at this cell. Leave the Pareto (sorted histogram) and Chart Output boxes unchecked for now. The output is shown in Figure 13.

Figure 13: Analysis ToolPak Histogram Output

Analysis Toolpak: Histogram ToolPak

The Analysis Toolpak is fast. It quickly counts the cell frequencies and calculates the cumulative relative frequencies. But, it does have serious drawbacks. It creates a “bin,” or category called “More,” which is useless. Unfortunately, Excel uses this bin when it charts the data. What is worse, you cannot remove this extra bin from the chart Excel draws. Given these problems, you have two options:

1. Use the seven rows with data in the Frequency and Cumulative % columns, and add the relative frequency and cumulative frequency columns to complete the four frequency distributions.

2. Create all four frequency distributions using the SUM and COUNTIF functions.

Most people learn the SUM function, which adds cells in a designated range of cells, the first time they use Excel. The COUNTIF function is less well known despite the fact that it is extremely useful. The syntax for the COUNTIF function is =COUNTIF(range,criteria). The range are the cells that contain the data. The criteria is what you want to count. The criteria is set within quotation marks.

To see how to create all four frequency distributions, open the Chapter04_FreqDist.xlsx workbook and go to the FreqDist-7 worksheet. All four frequency distributions are in the cell range H1:L9. The COUNTIF function counts the frequencies and cumulative frequencies. The relative frequencies and cumulative relative frequencies are calculated by dividing the class value into the total. The total number of observations in Cell I9 uses the SUM function, =SUM(I2:I8). In Figure 14 are all the formulas used to create these frequency distributions.

Figure 14: Excel’s Frequency Distribution Functions

Microsoft Excel Frequency Distribution formula

Once you complete the four frequency distributions, you can describe the shape of the data or you can chart the frequency distributions. Charting the frequency distributions will give you a graphic display of the shape of the data.

VI. Charting Frequency Distributions

There are four basic ways to chart frequency distributions:

1. Histograms,

2. Frequency Polygons,

3. Ogives, and

4. Pareto charts.

A. Histograms

Histograms are a special form of column chart. The class mid-points or class names are shown on the X-Axis. These mid-points are quantitative variables. The height of the columns, as measured by the scale on the Y-Axis, represents the number of observations, or frequencies, for each class. Because there are no gaps between the classes, the columns should touch each other. In this sense, histograms are different than the column charts used to graph frequency tables. Figure 15 shows the histogram for our Dewey, Cheatem, and Howe frequency distribution:

Figure 15: Histogram. This is fictious data

Histogram

The advantage of a good chart is that it quickly communicates the meaning of the data. This particular histogram makes the following facts clear: The data have a very symmetrical distribution. The middle category, $140 < $150 has the highest frequency. The frequencies for the three categories below $140 < $150 show increasing frequencies. The three categories above $140 < $150 show decreasing frequencies.

Three ways to construct a histogram using Microsoft Excel:

Here is the first method. It uses the histogram charting function that Microsoft introduced a few years ago. The user must use the raw data, not the frequency distribution, and the data must be in a single column. Figure 16 shows the histogram of the Dewey, Cheatem, and Howe data created with the Histogram chart function. It is different than the histogram shown above because Excel determines how to draw the classes. This is unacceptable. While it is possible to use seven classes instead of five; it is difficult, perhaps impossible, for the analyst to alter the category labels and set the lower limit of the first class. This technique, therefore, should never be used. The analyst, not the computer program, should control how the chart looks.

Figure 16: Histogram Drawn With Excel’s Histogram Function

Histogram using Excel's histogram function

The second method uses the Analysis Toolpak’s Histogram function. To draw a histogram, launch the Analysis Toolpak and select Histogram. Make certain the Chart Output box is selected. Then click OK.

Figure 17: Make Certain the Chart Output Box is Selected

Check Chart Output

Excel draws the histogram shown in Figure 18.

Figure 18: Histogram Drawn Using the Analysis ToolPak

Histogram drawn using the Analysis Toolpak

This method has a host of serious problems:

1. There are gaps between the category columns when there should be none. This, however, problem can be fixed.

2. Excel added a space for an eight column, “More.” There is no data for this column. Worse still, users cannot get rid of this column in the histogram.

3. It has a legend, “Frequency,” which is an unnecessary element as we are charting only one variable. This legend can be removed.

4. The X-Axis shows the upper class limits. It should show the class names. Fortunately, this problem can be fixed.

Given these problems, do not use this method of drawing histograms.

The preferred method—a method that lets the user control what the histogram looks like—is to draw a simple column chart. Once you have a column chart, close the gaps between the columns, change the category names if necessary, and make other stylistic refinements. This takes a few minutes, but it is time well spent.

To change the gap between the columns, double-click any of the columns. The following dialogue box comes up. Change the Gap Width to zero as shown in Figure 19.

Figure 19: Gap Width Dialog Box

Gap Width Dialogue Box

The next step is to change the names of the categories. Click the Select Data box on the “ribbon” (a strip of buttons above the work area) shown in Figure 20.

Figure 20: Select Data Box

Select Data button

Another box will open. Place the cursor in the Horizontal (Category) axis labels box that highlight the cells with your category names. Click OK. Excel will change the category names.

Figure 21: Horizontal (Category) axis label box

s Horizontal (Category) axis labels box

At this point, you are ready to format your chart by adding a chart title, data values, axis labels, and the like.

2. Frequency Polygon

A frequency polygon is a type of line chart that displays the same information as a histogram. The mid-points of the classes or class names are on the X-Axis. These mid-points are quantitative variables. The number of observations for each class are on the Y-Axis. There is a marker at the interception of the class mid-points and the number of observations shown on the Y-Axis. Lines connect the markers.

The decision to use either a histogram or a frequency polygon is a matter of judgment. Because these charts communicate the same information, do not use both. Histograms are more widely used than frequency polygons. This is probably because histograms make a more striking visual impact than frequency polygons. The frequency polygon for our Dewey, Cheatem, and Howe frequency distribution is shown in Figure 22.

Figure 22: Frequency Polygon.

Frequency Polygon

To create a frequency polygon, highlight the appropriate cells. Select Insert from the menu, then Chart and Line. Excel will draw a line chart. Modify the line chart using the same steps used to modify the histogram.

3. Ogive

The third type of chart for a frequency distribution is called an ogive or cumulative frequency polygon. The ogive for the Dewey, Cheatem, and Howe data is shown in Figure 23.

Figure 23: Ogive or Cumulative Frequency Polygon.

Ogive or Cumulative Frequency Polygon

Ogives look like line charts, but they are not. With Excel, you cannot draw a line chart with the two Y-Axes that ogives require. The primary Y-Axis on the left shows the cumulative frequency. The secondary Y-Axis on the right shows the cumulative relative frequency. The X-Axis shows the upper class limits.

Ogives provide a lot of useful information. The top 20 percent of expenses (80 to 100 percent) represent expense reports of $160 or more. Ten people had expense reports this high, found by looking at the 50 – 40 range on the primary Y-Axis. See Figure 24.

Figure 24: Top 20 Percent of Expense Report.

Top 20%

The bottom 20 percent of expense reports, of which there are ten, were for $130 or less. See Figure 25.

Figure 25: Bottom 20 Percent of Expense Reports.

Bottom 20 percent

To create an ogive, highlight the cells in three columns, the upper class limits, the cumulative frequencies, and cumulative relative frequencies. Go to Insert, Chart, X Y Scatter. Excel creates the chart shown in Figure 26.

Figure 26: Ogive - Step One.

Ogive: Step One

This chart does not look like an ogive. Several things must be done. First, series 2 must be placed on the secondary Y-Axis. Then the scales on both axes must be adjusted. To create the secondary Y-Axis, double click on one of the series 2 dots. Figure 27 shows dialog box that will appear. Select the option of Plot Series on Secondary Axis.

Figure 27: Secondary Axis Dialog Box

Secondary Axis Dialog Box

The chart, shown in Figure 28, looks more like an ogive.

Figure 28: Ogive - Step Two.

Ogive: Step Two

The next steps are to adjust the scales on all three axes, connect the dots with a line, remove the legend, add the axis labels, and complete the formatting to suit your taste.

4. Pareto Charts

Pareto charts are sometimes called sorted histograms. They are not always covered in introductory statistics textbooks perhaps because Pareto charts are infrequently used. A Pareto chart actually combines a histogram with the category frequencies sorted from largest to smallest and a line chart showing the cumulative relative frequency. Like an ogive, the Pareto chart has two Y-Axes. The primary Y-Axis on the left shows the frequencies. The secondary Y-Axis on the right shows the cumulative relative frequencies. Figure 29 shows an example of a Pareto chart that uses fictional data. On the X-Axis are five categories of unit sales in millions of units. On the first Y-Axis are dollar sales in millions. The second Y-Axis shows the cumulative relative frequencies for dollar sales:

Figure 29: Pareto Chart

Pareto or 80/20 chart

With our Pareto chart, the top two categories, 1.50 < 2.75 million units and 7.00 < 8.00 million units, are the top 20 percent. These categories account for 80 percent (actually 81 percent) of the dollar sales.

Pareto charts are used to highlight the Pareto principle, also known as the 80/20 rule or the law of the vital few. This principle is named after Vilfredo Pareto (1848 – 1923) an Italian-French social scientist who made important contributions to economics, sociology, political science, and moral philosophy. The fundamental idea of the Pareto principal is that data are not evenly distributed. In his 1896, Cours d’economie politique (A Course in Political Economy), Pareto showed that roughly 20 percent of the Italian population owned approximately 80 percent of Italian land. In management, the 80/20 rule is often used: “80 percent of a company’s sales comes from 20 percent of its customers.” The 20 percent is considered the vital few. The 80/20 rule applies in many areas:

• 20 percent of customers contribute 80 percent of sales (or profits).

• 20 percent of drivers cause 80 percent of traffic accidents.

• 20 percent of factories contribute 80 percent of air pollution.

When considering the 80/20 rule, remember it is not a fixed rule. It is merely a rough guide.5 The two numbers need not add to 100 percent. The 20 percent could create 80 percent of the results, but it could also create 90 percent, 75 percent, 50 percent, or even 30 percent.

Creating Pareto charts: Just like histograms, Excel has three ways of creating Pareto charts. You can use the Pareto chart command or the Histogram command in the Analysis ToolPak. Just as with the Histogram command, these methods are awful. Excel’s older charting methods are much better because they offer more control. Follow the steps to create an ogive, but instead of inserting an X Y Scatter chart, insert a column chart. Turn the second series into a line chart. Format the chart to suit your taste.

VII. Symmetrical and Asymmetrical Distributions

Data can have many different “shapes.” When we construct frequency tables or frequency distributions or chart data from these tables, we comment on the shape of the data. In later chapters, we shall see that the shape of the data is a critical issue for inferential statistics. Data can be symmetrical, skewed to the right, skewed to the left, or asymmetrical.

A. Symmetrical Shape

Figure 30 shows a histogram for a symmetrical distribution. Note that the data peaks at the center category with the right side of the distribution a mirror image of the left.

Figure 30: Symmetrical Distribution

The shape of this distribution approximates what is called a normal distribution. The normal distribution is a central focus of inferential statistics. Normal distributions will be described in detail in Chapter 9: Continuous Probability Distributions – Normal Distributions.

Figure 31 shows a symmetrical distribution that is not “normal.” It is bimodal, which is to say that it has two modes. The mode is the most frequent or common value for variables in the data. We will review the mode in detail in Chapter 5: Statistical Measures.

Figure 31: Symmetrical Bimodal Distribution

Symmetrical, Bi-Modal Distribution

B. Right Skewed Distributions

Figure 32 shows a right-skewed distribution, also known as positively skewed data. It has a long “tail” on the right side of the histogram.

Figure 32: Right or Positive Skewed Distribution

Right or Positive Skewed Distribution

C. Left Skewed Distributions

Left-skewed distribution, also known as a negatively skewed distribution, has a long “tail” on the left side of the histogram. See Figure 33.

Figure 33: Left or Negative Skewed Distribution

Left or negatively skewed distribution

D. Asymmetrical Distributions

Finally, a distribution can be asymmetrical, as shown in Figure 34. Asymmetrical distribution varies irregularly.

Figure 34: Asymmetrical Distribution

Asymmetrial Distribution

VIII. How Knaves and Fool Misuse Charts

Knaves are unprincipled, untrustworthy, and dishonest people. Fools are ignorant people who are easily tricked. Both knaves and fools distort data, and in doing so, they misrepresent data. As Robert B. Reich, a Chancellor’s Professor of Public Policy at the University of California at Berkeley and former Secretary of Labor, writes: “There are two kinds of liars—fools and knaves. Fools lie because they don’t know the truth. Knaves lie because they intend to mislead.”6

We will now play a little game called “Knave or Fool.” You will be shown a misleading chart, table, or graphic that appeared in the media. Determine whether the people who created these graphics are fools who do not know better, or knaves who deliberately distort the data to suit their ends. Of course, there is a third option. The charts could present the data accurately. Let’s review a few examples of misleading graphics.

The Misuse of Pie Charts by CBS News

CBS News has been surveying people on the use of marijuana. Here are two pie charts that CBS News aired. These charts are highly misleading. The first reports the results on a survey question that asks Americans whether they have ever tried marijuana.

Figure 35: CBS News – Americans Who Have Tried Marijuana

Misuse of a pie chart by CBSNews, Americas who have tried marijuanna.

Source: http://www.painting-with-numbers.com/blog/getting-high-on-bad-data-visualization/

The pie chart shown in Figure 35 has three sectors: 51 percent today, 43 percent last year, and 34 percent in 1997.

Clearly this chart is due to foolishness, not treachery. Here is why:

1. You cannot present longitudinal data, meaning data that shows changes over time, with a pie chart.

2. The slices add up to 128 percent. The sum of the pie chart slices must never exceed 100 percent.

CBS News’ second misuse of a pie chart comes from a poll taken in April 2017. Respondents were asked whether they thought marijuana or alcohol is more harmful. The pie chart used by CBS News is shown in Figure 36:

Figure 36: CBS News – Which is More Harmful: Marijuana or Alcohol?

Which is more harmful: Marijuana or Alcohol.

Source: https://www.cbsnews.com/news/support-for-marijuana-legalization-at-all-time-high/

While this pie chart is better than the previous one because it focuses on a single time period. But, the slices in this pie chart shown in Figure 36 add to only 97%. The difference is too big to be explained by rounding error. What is missing? We do not know. Again, the short-comings of this chart suggest foolishness, not knavery.

A Bizarre Table Used by Fox News

Figure 37 shows an opinion poll reported on Fox News.

Figure 37: Rasmussen Reports Poll

Rasmussen Poll on Global Warming

Source: https://www.mediamatters.org/blog/2009/12/08/fox-news-fiddles-with-climate-change-polling/157839

The source of the data is Rasmussen Reports, a polling organization whose results are often biased in favor of Republicans and conservative causes. The question this poll asked is: “Did scientists falsify research to support their theories on Global Warming?” Leaving aside for the moment whether the question posed is a leading question, the table offers three responses: Somewhat likely (59%), Very Likely (35%), and Not Very Likely (26%). This is clearly foolishness because the results add up to 120 percent. Of course, if you believe that this is a leading question, we would have to add that this is also knavery. The question asked is, in fact, a leading question because it suggests that scientists are falsifying results. We must conclude: Knaves can also be fools.

A Bizarre Pie Chart Used by Fox News (Figure 38

In November 2009, Fox News was focusing on the 2012 Republican presidential primaries when it aired a graphic on the results of an Opinion Dynamic poll on the leading Republican candidates: Sarah Palin, Mitt Romney, and Mike Huckabee.

Figure 38: Opinion Dynamics Presidential Poll on Fox News

Fox News: 2012 Presidential Run, GOP candidates.

Source: https://flowingdata.com/2009/11/26/fox-news-makes-the-best-pie-chart-ever/

The announcer read that Governor Sarah Palin was in the lead while the following pie chart appeared on the screen. The sum of the slices equals 193 percent! As you know, the proportions for all of the slices must equal 100 percent. This mistake is due to a sophomoric misunderstanding of the data. The actual question asked respondents to name their first and second choices for the Republican presidential nominee. Pie charts cannot be used for this kind of data. A table would have been much better than a pie chart. Clearly this is another foolish chart.

The Old Truncated Y-Axis Trick on Fox News

Figure 39 shows a very misleading column chart on the enrollment in Obamacare that aired on Fox News.

Figure 39: Fox News Distorted Obamacare Chart

Fox News: distorted Obamacare Enrollment chart started airing on March 27, 2013.

https://www.mediamatters.org/blog/2014/03/31/dishonest-fox-charts-obamacare-enrollment-editi/198679

On the right, the tall column shows the enrollment goal of 7,066,000 as of March 31st. The chart shown in Figure 39 has a short column showing that the program has achieved an enrollment of 6,000,000 as of March 27th. The shortfall is 1,066,000. But, the chart exaggerates this difference because the Y-Axis does not cross the X-Axis at zero. The enrollment appears to be one one-third of the goal. This is a classic example of chart knavery. Four days after the chart first appeared, Fox News corrected this chart on April 1, 2014. See Figure 40 for this correction. Of course, this corrected chart did not show the massive surge in enrollments during the last four days in March when the White House announced the final Affordable Care enrollment figure of 7,041,000.6 By the way, the March 31, 2014 goal of 7,066,000 enrollments was not the goal of the Obama administration. It was a prediction of the Congressional Budget Office.7

Figure 40: Corrected Fox News chart

Fox News corrected their distorted Obamacare enrollment chart on April 1, 2013.

https://www.politico.com/blogs/media/2014/04/fox-news-corrects-obamacare-chart-186120

This gambit for distorting column charts, bar charts, and line chartes is called truncating the Y-Axis. It is a clear signal of knavery. As mediamatters.org reports, truncating the Y-Axis is a favorite tool of Fox News.8 Figure 41 shows two column charts with truncated Y-Axes that aired on Fox News.

Figure 41: Fox News’ Column Charts with Truncated Y-Axes

Fox News: Truncated Y-Axes on Column Charts

Source: https://www.mediamatters.org/blog/2012/08/09/today-in-dishonest-fox-charts-government-aid-ed/189223 https://www.mediamatters.org/blog/2013/04/05/fox-news-newest-dishonest-chart-immigration-enf/193507

A Fox News graphic that is more like propaganda than information

Here is our last example of distorted graphics. The graphic shown in Figure 42 deals with job creation during the first 100 days or two months (62 days) for the last four presidents. The graphic claims both time periods. The take-away from this graphic is that President Trump with 317,000 new jobs is a much better jobs creator than the last three presidents. His immediate predecessor, President Obama, was a job destroyer, with a loss of 1,525,000 jobs. Is this graphic truthful? Is it foolishness? Is it knavery?

Figure 42: Fox News on Job Creation During a President’s First 100 Days

100 Days of Job Creation for four U.S. Presidents.

Source: https://www.reddit.com/r/facepalm/comments/68m3v9/fox_news_called_out_for_comparing_the_jobs/ https://me.me/i/100-days-job-creation-jobs-added-during-potus-first-2-13463426

Fox News forgot recent history: When President Obama came to power in January 2009, the country was suffering from the recession of 2008, sometimes called the Great Recession, which began during George W. Bush’s presidency. This was the most severe economic crisis since the Great Depression of the 1930s. Under President Obama, the economy recovered relatively quickly. President Trump inherited a booming economy. President Obama did not. The number of jobs created or lost in the first two months or 100 days of a president’s administration may say more about the state of the economy the president inherited than it does about the new president’s economic policies.

Clearly we must be aware that knaves and fools distort data. We must be prepared to examine news reports critically.

VII. Exercises

Here are five problems for you to solve. The raw data can be found in the Excel file Chapter04_Exercises.xlsx.

Exercise 1:

Here is a sample of checks from the Metro Diner from this past week:

Table 8: Sample of Checks from Metro Dinner

1. How many classes would you recommend?

2. What class interval would you suggest?

3. Construct a frequency distribution, relative frequency distribution, and a cumulative frequency distribution

4. Draw a histogram, frequency polygon, ogive, and Pareto chart

5. Comment on the “shape” of the data

Exercise 2:

There are approximately 13,500 taxicab medallions in NYC. The medallion is fixed to a taxicab’s hood. Here is a sample of 50 taxicabs listing the car model:

Table 9: Sample of 50 taxicabs

1. Construct a frequency table of the approved taxicab models in the sample.

2. Present the frequency table as a chart.

3. Comment on your findings

Exercise 3

According to the College Board, college students spend $1,240 on books and supplies a year.

Here are the results of a survey of students’ expenditures for the 2018-19 academic year.

Table 10: Student Expenditures on Books and Supplies

Summarize this data with frequency distributions.

1. How many classes would you recommend?

2. What class interval would you suggest?

3. Construct a frequency distribution, relative frequency distribution, and a cumulative frequency distribution?

4. Draw a histogram, frequency polygon, ogive, and Pareto chart

5. Comment on the “shape” of the data

Exercise 4

Here are the survey results on the color of people’s eyes.

Table 11: People’s Eye Color

1. Construct a frequency table on eye color

2. Present the frequency table as a column chart

3. Comment of your findings

Exercise 5

Table 7 lists the number of traffic fatalities per million inhabitants for 28 European countries. Create a chart for this data using Microsoft Excel.

Table 12: 2018 – Road Accident Fatalities per Million Inhabitants in 28 European Countries.

Source: European Commission

Except where otherwise noted, Clear-Sighted Statistics is licensed under a
Creative Commons License. You are free to share derivatives of this work for
non-commercial purposes only. Please attribute this work to Edward Volchok.

References

Edward R. Tufte, The Visual Display of Quantitative Information, Second Edition, (Cheshire, CT: Graphics Press, 2001), p. 13.

2 John W. Tukey, Exploratory Data Analysis, (Reading, MA: Addison-Wesley, 1972), pp. 1-3.

3 “The Infamous Pie Chart: History, Pros, Cons, and Best Practices.” .” https://infogram.com/blog/the-infamous-pie-chart-history-pros-cons-and-best-practices/.

4 Edward R. Tufte, The Visual Display of Quantitative Information, Second Edition. (Cheshire, CT: Graphics Press, 2002), p. 178.

55 Richard Koch. The 80/20 Principle, Third Edition. (New York: Doubleday, 2008).

6 Robert Reich, “Fools and Knave.” November 25, 2017. . https://robertreich.org/post/167878608210.

67 Hadas Gold. “Fox News Corrects Obamacare Chart,” Politico. April 1, 2014. https://www.politico.com/blogs/media/2014/04/fox-news-corrects-obamacare-chart-186120. Edward-Isaac Dovere and Jennifer Epstein. “Obama: ‘Armageddon’ Hasn’t Arrived,” Politico, April 1, 2014. https://www.politico.com/story/2014/04/obamacare-enrollment-deadline-special-enrollment-105239.

78 Melanie Arter. “Sebelius Dismisses 7M Sign-Up Goal for Obamacare.” CNSnews.com, February 25, 2014. https://www.cnsnews.com/news/article/melanie-hunter/sebelius-dismisses-7m-sign-goal-obamacare. CNS news is a politically conservative news and commentary website founded by L. Brent Bozell III in 1998.

-19 “Fox News Wallows in Subterfuge,” April 2, 2014, Mediamatters.org. https://www.theprogressivesinfluence.com/2014/04/fox-news-wallows-in-subterfuge.html.

Show the following:

Adjust appearance:

Notes

Annotate