Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ECO 2150

Descriptive Statistics

2022

1   Introduction

Statistics that summarize numerical information

❼ Later in this course and in ECO 2151 will learn a variety of tests and procedures that can

be applied to numerical data

❼ BUT: before applying such procedures, it is a good idea to just look at one’s data

❼ Examination of the data can reveal a variety of “stylized facts,” and important anomalies

that your economic theory needs to explain

❼ Graphs are one way to explore properties of data

❼ Other way to summarize numerical data is to compute various summary measures, known

as descriptive statistics

Aspects of data to summarize

❼ Two basic types of descriptive statistics:

1. measures of central tendency

2. measures of dispersion

❼ Will look at both types of measure

❼ Note: some statistical theory underlies measures we will look at and explains why they are

used

– Statisticians have examined their properties

❼ For now, will take statisticians’ word for it that these measures have good properties


What is a statistic?

A statistic is defined as follows:

Definition. A statistic is any function of the sample information.

In contrast:

Definition. A parameter is  a numerical measure that describes  a specific  characteristic  of a population.

Descriptive statistics are estimates of unobservable population parameters, based on sample data

Population versus sample

Recall:

❼ A population is the complete set of all items in which the investigator is interested ❼ A sample is a subset of a population

❼ An observation is simply one element of the sample or population


Notation: Let xi  be the value of variable x for observation i

❼ Population consists of x1 , . . . ,xN

❼ Sample consists of a set of observations x1 , . . . ,xn , with n < N


2   The Summation Operator

Summation operator

Before continuing, need to review/introduce the summation operator:


Definition.

n

Xi  =      Xi  = X1 + X2 + ... + Xn

i=1                   i

Properties of

1. Let k be a constant. Then                     n

k = nk .

i=1

❼ IMPORTANT: When dealing with summations, a constant is anything that does not

depend on i, the index of the summation

2. Suppose again that k is a constant. Then

n                           n

kXi  = k      Xi .

i=1                       i=1


More properties of

3. Let both a and b be constants. Then

n                                                    n

(a + bXi ) = na + b      Xi .

i=1                                                i=1

4. Sum of a sum of terms:

n                                             n                       n

i=1                                         i=1                   i=1

5. Product of two sums:                m             n                   m      n

Xi Yj  =            Xi Yj  .

i=1         j=1               i=1  j=1

3   Measures of Central Tendency

What are measures of central tendency?

❼ Measures of central tendency are methods of defining the centre or middle of a set of numbers ❼ Three commonly-used measures of central tendancy:

1. Mean

2. Median

3. Mode

❼ Will look at each measure in turn

The Mean

Definition.  The mean of a set of numerical observations is the sum of the data values divided by the number of observations; that is, their average. Mathematically, we can write this as

N

µ = xi     for the population;

i=1

x = xi     for the sample.


The Mean

❼ Also known as the arithmetic mean

❼ Mathematically, the two formulae are the same – main difference is that in one case we use

N to indicate the number of observations, while in the other case we use n

❼ Also use a different variable name on the left-hand side

❼ Conventional in statistics to use the Greek letter µ to represent population mean of a variable

❼ Also a convention to represent the sample mean of a variable by placing a bar over the name

of the variable e.g., , y¯

❼ May add a subscript to µ to indicate which variable it pertains to – e.g., µX , µY ❼ Usually will be computing the sample mean

The Median

Definition.  The median of a set  of observations  is  the  middle  observation if the  number of observations is  odd;  it is the  average  of the middle pair if the number of observations is  even. Alternatively, the median can be defined as the value such that 50% of the observations lie above it and 50% of the observations lie below it.

To find the median:

1. Order the observations in either ascending or descending order.

2. If the number of observations is odd, the median will be the value of observation (n + 1)/2 (the middle observation).

3. If n is even, the median will be the average of the values of observations n/2 and (n + 2)/2 (the two middle observations).

The Mode

Definition.  The mode of a set of observations is the value that occurs most frequently.


❼ Some data sets do not have a mode

❼ Some data sets have more than one mode

The Geometric Mean

In calculation of growth rates or return on assets, analysts often use the geometric mean

Definition. The geometric mean of a set of values is the nth root of the product of the n values:

g  = nx1 x2 ··· xn  = (x1 x2 ··· xn )1/n  .

Take natural log of geometric mean:

ln g  = lnxi


The Geometric Mean Rate of Return

❼ Geometric mean rate of return is used to compute average percentage return of investment

over time:

g  = (x1 x2 ··· xn )1/n − 1

❼ Geometric mean and geometric mean rate of return take compounding into account

THIS IS ALL THE ATTENTION WE WILL PAY TO THE GEOMETRIC MEAN IN THIS

COURSE!

Example: Measures of central tendency

Suppose that you have a sample of final exam grades for 10 students:

Obs. No.   Grade (%)


1

2

3

4

5

6

7

8

9

10


71

85

44

66

71

95

56

78

81

79


Find the mean, median, and mode.


4   Measures of Dispersion

Measures of Dispersion

❼ Measures of dispersion look at how spread out the data are

– Are observations clustered closely around the central value, or do they lie far apart from each other?

❼ We will look at four measures of dispersion

The Range

Definition.  The range of a set of data is the difference between the values of the largest and the smallest observations; that is,

Range   =   xMAX xMIN .

❼ Range is simplest and most straightforward measure of spread of data

❼ PROBLEM: range doesn’t distinguish between situations where there are only one or two

outliers, and situations where the observations are fairly evenly distributed over the range


The Range and Outliers

❼ An outlier is an observation that is very different from most of the other observations in the

data set

❼ Intuitively, would think there is less dispersion if observations are evenly distributed over

range

❼ Outliers are given too much weight by range

Deviations from the mean

❼ Two alternative measures of dispersion are based on deviations from the mean

Deviations from mean are differences xi (sample) or xi µ (population)

❼ Deviations themselves are not that helpful – positive deviations cancel out negative ones

when they are summed

❼ Measures are

1. The mean absolute deviation

2. The standard deviation

Mean Absolute Deviation

Definition. The mean absolute deviation of a set of observations is the average of the absolute deviations.

N

MAD = X |xi µ|   for the population;

MAD = |xi − |   for the sample.

Mean Absolute Deviation

❼ Taking absolute values before summing ensures that positive and negative deviations do not

cancel each other out

❼ MAD is used less frequently in practice, primarily because absolute values are relatively

difficult to work with mathematically – this makes it more difficult to derive the properties of the MAD

– The MAD is not a continuous function


The Variance

Definition. Let x1 ,x2 , . . . ,xn  be a sample of n observations.  The sample variance, denoted s2 ,

is defined as follows:

s2  = (xi )2 .

If by chance we had data for the entire population, we could compute the population variance,

σ 2 :

N

σ 2  = X(xi µ)2 .


The Standard Deviation

Definition.  The standard deviation is simply the square root of the variance:

s = s2      (sample) ,

σ = σ 2      (population) .

The Variance and the Standard Deviation

❼ Squaring deviations is another means of ensuring that positive and negative deviations do

not cancel each other out

❼ Squaring the deviations attaches a higher weight to large deviations than taking the absolute

value does

❼ Note that the formulae used to compute the population and sample variances are not math-

ematically the same (will discuss why later)

– For population, divide by number of observations

– For sample, divide by the number of observations less one

Units of measurement of descriptive statistics

❼ Units of measurement of MAD, standard deviation, range, mean, median, and mode are all

the same as those of original variable

Units of measurement of the variance are the original units squared

❼ Be careful with units!


The Coefficient of Variation

❼ Problem with MAD and variance:  cannot be used to compare the degree of variation of

variables that are measured in different units

– Both are sensitive to units of measurement

❼ Coefficient of variation is a unit-free measure of dispersion

Definition.  The coefficient of variation is the standard deviation divided by the mean; that is,

σ

µ

s

The Coefficient of Variation

❼ CV is a measure of relative, not absolute, dispersion

❼ Can be used to compare dispersion relative to the mean of variables measured in different

units, or with very different means and variances

❼ Invariant to scaling of the data because scaling will simply adjust both the numerator and

the denominator by the same factor

❼ Measures of relative dispersion are not better than measures of absolute dispersion; they are

just different

– Income inequality!

Example: Measures of dispersion

Suppose that you have a sample of final exam grades for 10 students:

Obs. No.   Grade (%)


1

2

3

4

5

6

7

8

9

10


71

85

44

66

71

95

56

78

81

79


Find the range, variance, standard deviation, mean absolute deviation, and coefficient of vari- ation.


Use of the Standard Deviation

❼ Standard deviation can be used in at least two ways:

1. To compare the degree of variability of two data sets (of variables measured in similar units)

2. To construct an interval that contains a specified proportion of the population data

❼ Also used in hypothesis testing, but will not get to that for some time

❼ To construct an interval containing a given proportion of the population data, can use

Chebychev’s Theorem


Chebychev’s Theorem

Theorem. Chebychevs Theorem: For any population with mean µ and standard deviation σ , the percent of observations that lie within the interval [µ ± kσ] is

at least 100 1 − % ,

where k > 1 is the number of standard deviations.


In other words, we can construct an interval in which a certain proportion of the population lies within a certain number of standard deviations of the mean

Chebychev’s Theorem: Examples

❼ Example: Choose k = 1.5. Then Chebychev’s Theorem implies that

100 1 − % = 55.6%

of the population members lie within 1.5 standard deviations of the mean

❼ Can also construct an estimate of the interval using the sample mean and sample standard

deviation

❼ Example: Exam grades for 10 students

– x = 72.6 and s = 14.68

– Therefore, the approximate boundaries of the interval containing 55.6% of the popula- tion are + 1.5s = 94.62 and − 1.5s = 50.58


More precise intervals

❼ Chebychev’s Theorem applies to ALL populations, regardless of what the actual distribution

of the data is

❼ Given information about distribution of observations in the population, could construct more

precise intervals

❼ In real world, many large populations have a symmetric, bell-shaped distribution

❼ For such populations, we can specify more precise intervals based on an “Empirical Rule”

– Will learn later where the empirical rule comes from

Empirical Rule

Rule. For large populations with a symmetric, bell-shaped distribution,

❼ approximately 68% of the population members will lie within the interval µ ± σ;  ❼ approximately 95% of the population members will lie within the interval µ ± 2σ; ❼ approximately all of the population members will lie within the interval µ ± 3σ .

5   Numerical Summary of Grouped Data

Grouped Data

❼ Sometimes data for a large sample are made available to researchers in grouped form

– For example:  Statistics Canada often reports age and income as categorical variables in Public Use Microdata Files (PUMFs)

❼ In such cases, not possible to apply the formulae that we have seen for the mean, variance,

etc., since we do not have individual data

❼ How, then, do we construct summary statistics?

❼ Solution is to make use of the frequency distribution for the variable

Mean and Variance of Grouped Data: Population

Definition. Suppose that one has  data grouped into K  classes,  with frequencies f1 ,f2 , . . . ,fK . Let the  midpoints  of the  range  of each  class  be m1 ,m2 , . . . ,mK .   Then for a population  of N observations, where N = fj , the mean is

= = fj mj

and the variance is

K

2  = fj (mj )2 .


Note on Notation

❼ Note that , not µ, is used to for the population mean when the data are grouped

❼ This is because the grouped data formula can only approximate the true population mean

– Only exception is special case where each interval contains only one numerical value (i.e., xi  is the same for everyone in interval)

❼ For similar reasons, use 2  instead of σ 2  for the population variance for grouped data

Mean and Variance of Grouped Data: Sample

Definition. Suppose that one has data grouped into K classes, with frequencies f1 ,f2 , . . . ,fK . Let the midpoints of the range of each class be m1 ,m2 , . . . ,mK .  Then for a sample of n observations, where n = fj , the mean is

K

= fj mj

and the variance is

K

s2  = fj (mj )2 .

Weighted Mean

Note that mean for grouped data is a weighted mean

❼ Weights are actually relative frequencies: fi /n

❼ Relative frequency weights must sum to 1

Median of Grouped Data

❼ Can easily determine which class contains the median value by examining the frequency

distribution

❼ But: cannot directly observe the median

❼ Should you wish to do so, one can make use of this formula for estimating the value of a

particular observation:

Rule. Estimating the Value of Observation i in class j: Suppose that class j contains fj   observations, and let L be the lower boundary and U the upper boundary of class j .  If these observations were to be arranged in ascending order, the value of the ith observation

is estimated to be

L + i (U fj(−) L) ,    i = 1, . . . ,fj .


The Modal Class

When the data are grouped it is impossible to determine the mode

❼ Instead, we need to define a new concept: the modal class

Definition.  The modal class is the class with the highest frequency.

Example: Question 2.31 of Newbold et al.  (2013)

For a random sample of 25 students from a very large university, the accompanying table shows the amount of time (in hours) spent studying for final exams.


Study time

0 < 4

4 < 8

8 < 12

12 < 16

16 < 20

Number of students

3

7

8

5

2


a. Estimate the sample mean study time.

b. Estimate the sample standard deviation.


6   Relationships Between Variables Measuring Relationships Between Variables

❼ Earlier, saw that scatter plots can reveal relationships between variables

❼ Sometimes need a measure of the direction and/or strength of such relationships

❼ Two related measures of the linear relationship between two variables can be computed:

1. Covariance

2. Correlation coefficient

Population Covariance

Definition.  The population covariance between two variables x and y is

N

cov (x,y) = σxy  = X (xi µx )(yi µy )

where xi  and yi  are the observed values of the variables, µx  and µy  are the population means, and N is the size of the population.


Sample Covariance

Definition.  The sample covariance between two variables x and y is

cov (x,y) = sxy  = (xi x)(yi y)

where xi  and yi  are the observed values of the variables, x and y are the sample means, and n is the size of the sample.



Interpretation of covariance

❼ Sign of the covariance tells us about the direction of the relationship between the two vari-

ables

– Positive sign implies an upward-sloping relationship

– Negative sign implies a downward-sloping relationship

❼ But: covariance does not really tell us how strong the relationship is because it is sensitive

to units of measurement

Correlation Coefficient

Definition. The population correlation coefficient between two variables x and y is given by

σxy

ρxy  =

The sample correlation coefficient between two variables x and y is given by

sxy

rxy  =


Note: This measure also known as the Pearson correlation coefficient.

Interpretation of correlation coefficient

❼ Correlation coefficient always lies between -1 and 1

❼ If correlation coefficient equals -1, have a perfect negative relationship (a downward-sloping

straight line)

❼ If correlation coefficient equals 1, have a perfect positive relationship (an upward-sloping

straight line)

The closer are |r| or |ρ| to 1, the stronger the linear relationship

❼ Value of 0 implies no linear relationship