Chapter 6 Descriptive Statistics
6.1 Introduction to Statistics
6.1.1 Some Basic Terminologies Used in Statistics
i Population
- The set of all possible elements in the universe of interest to the researcher
ii Sample
- A Sample is a subset (a portion or part) of the population of interest
- The sample must be a representative of the population of interest
iii Element
- Element is an entity or object which the information is collected.
- Eg: Student, household, farm, company, tomato plant
iv Variable
- A variable is a feature characteristic which has different ‘values’ or categories for different elements (items/subjects/individuals)
- Eg: Gender of client, brand of mobile phones, risk level, number of emails received per day, age of client, income of client
v Data
- Data are measurements or facts that are collected from a statistical unit/entity of interest
- We collect data on variables
- Data are raw numbers or facts that must be processed (analysed) to get useful information.
We get information from data.
Eg:
Variable: Age (in years) of client
Data: 21, 45, 18, 32, 30, 22, 23, 27
Information:
The mean age is 27.25 years
The minimum age is 18 years
The range of ages is 18-45
The percentage of clients below 25 years of age: 50%
vi Statistic
- Characteristic of a sample
- The value which calculated based on sample data
vii Parameter
- Characteristic of a population
- The value which calculated based on population data
viii Census
- When a researcher gathers data from the whole population for a given measurement, it is called a census
ix Sampling
- When a researcher gathers data from a sample of the population for a given measurement, it is called sampling
- The process of selecting a sample is also called sampling
Why take a sample instead of studying every member of the population ?
- Prohibitive cost of census
- Destruction of item being studied may be required
- Not possible to test or inspect all members of a population being studied.
6.1.2 Branches of Statistics
i Descriptive Statistics
- Descriptive statistics consists of organizing, summarizing and presenting data in an informative way.
- The main purpose of descriptive statistics is to provide an overview of the data collected.
- Descriptive statistics describes the data collected through frequency tables, graphs and summary measures (mean, variance, quartiles, etc.).
ii Inferential Statistics
- In inferential statistics sample data are used to draw inferences (i.e. derive conclusions) or make predictions about the populations from which the sample has been taken.
- This includes methods used to make decisions, estimates, predictions or generalizations about a population based on a sample.
- This includes point estimations, interval estimation, test of hypotheses, regression analysis, time series analysis, multivariate analysis, etc.
6.1.3 Types of Variables
6.1.3.1 Qualitative / Quantitative Variables
i Qualitative variable (Categorical variable)
- The characteristic is a quality.
- The data are categories.
- They cannot be given numerical values.
- However, it may be given a numerical label
- Qualitative variables are sometimes referred as categorical variables.
- Eg:
Gender:
Age group:
Education level:
A/L stream:
Degree type:
Hair colour:
FIT student batch:
Undergraduate level:
Grade that you can obtain for CM 1110/ CM1130
ii Quantitative variable
- The characteristic is a quantity
- The data are numbers
- Quantitative data require numeric values that indicate how much or how many.
- They are obtained by counting or measuring with some scale
- Eg:
Number of family members:
Number of emails received per day:
Weight of a student:
Age:
Credit balance in the SIM card:
Time remaining in class:
Temperature:
Marks
6.1.3.2 Discrete/ Continuous Variables
- Quantitative variables can be classified as either discrete or continuous.
i Discrete Variables
- Quantitative
- Usually the data are obtained by counting
- There are impossible values between any two possible values
- Eg:
Number of family members:
Number of emails received per day:
ii Continuous Variables
- Quantitative
- Usually, the data are obtained by measuring with a scale
- There are no impossible values between any two possible values.(any value between any two possible values is also a possible value)
- i.e a continuous variable can take any value within a specified range.
- Eg:
Weight of a student:
Age:
Credit balance in the SIM card:
Time remaining in class:
Temperature:
Marks
6.1.4 Scales of Measurements
- There are four levels of measurements called, nominal, ordinal, interval and ratio.
- Each levels has its own rules and restrictions
- Different levels of measurement contains different amount of information with respect to whatever the data are measuring
i Nominal Scale
- Qualitative
- No order or ranking in categories.
- These categories have to be mutually exclusive, i.e. it should not be possible to place an individual or object in more than one category
- A name of a category can be substituted by a number, but it will be mere label and have no numerical meaning
ii Ordinal Scale
- Qualitative
- Categories can be ordered or ranked
- A name of a category can be substituted by a number, but such a sequence does not indicate absolute quantities.
- Difference between any two numbers on the scale does not have a numerical meaningful.
- It cannot be assumed that the differences between adjacent numbers on the scale are equal.
iii Interval Scale
- Quantitative
- Data can be ordered or ranked
- There is no absolute zero point. Zero is only an arbitrary point with which other values can compare
- Difference between two numbers is a meaningful numerical value
- Ration of two numbers is not a meaningful numerical value.
iv Ratio Scale
- Quantitative
- Highest level of measurement
- There exist an absolute zero point (It has a true zero point)
- Ratio between different measurements is meaningful
6.2 Presentation of Data
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew
Here’s a quick summary of our variables:
Variable Name | Description |
---|---|
PassengerID | Passenger ID (just a row number, so obviously not useful for prediction) |
Survived | Survived (1) or died (0) |
Pclass | Passenger class (first, second or third) |
Name | Passenger name |
Gender | Passenger Gender |
Age | Passenger age |
SibSp | Number of siblings/spouses aboard |
Parch | Number of parents/children aboard |
Ticket | Ticket number |
Fare | Fare |
Cabin | Cabin |
Embarked | Port of embarkation (S = Southampton, C = Cherbourg, Q = Queenstown) |
6.2.1 Tabular Presentations of Data
Raw Data
- Raw data are collected data that have not been organized numerically
- Eg: Passenger age
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
## [1] 22 38 26 35 35 NA 54 2 27 14 4 58 20 39 14 55 2 NA 31 NA 35 34 15 28 8
## [26] 38 NA 19 NA NA 40 NA NA 66 28 42 NA 21 18 14
An array
- An array is an arrangement of raw numerical data in ascending or descending order of magnitude.
- Eg: Passenger age
## [1] 2 2 4 8 14 14 14 15 18 19 20 21 22 26 27 28 28 31 34 35 35 35 38 38 39
## [26] 40 42 54 55 58 66
Frequency Table (Frequency Distributions)
- A frequency table (frequency distribution) is a listing of the values a variable takes in a data set, along with how often (frequently) each value occurs
- frequency can be recorded as a
- frequency or count: the number of times a value occurs, or
- percentage frequency: the percentage of times a value occurs
- Percentage frequency can be calculated as,
\[Percentage frequency = \frac{a}{b} \times100 \%\]
- The objective of constructing a frequency table are as follows
- to organize the data in a meaningful manner
- to determine the nature or shape of the distribution
- to draw charts and graphs for the presentation of data
- to facilitate computational procedures for measures of average and spread
- to make comparisons between different data sets
- There are two basic types of frequency tables
- Simple frequency tables (Ungrouped frequency distribution)
- Grouped frequency distribution
6.2.1.1 Simple frequency table (Ungrouped frequency distribution)
- Each possible value or category is taken as a class
- More suitable for
- Qualitative variables
- Discrete variables
- Sometimes construct for continuous variables when there is a small number of possible values between the minimum and maximum.
Examples:
CASE I:
Example 1
The native countries of 56 students from a certain education institute are as follows:
## [1] "SL" "BD" "SL" "SL" "SL" "SL" "IN" "SL" "SL" "SL" "BD" "SL" "SL" "SL" "IN"
## [16] "SL" "SL" "BD" "SL" "SL" "SL" "SL" "SL" "SL" "SL" "SL" "SL" "MD" "SL" "SL"
## [31] "SL" "SL" "SL" "SL" "PK" "MD" "PK" "SL" "SL" "SL" "SL" "SL" "PK" "MD" "SL"
## [46] "SL" "SL" "SL" "SL" "SL" "SL" "SL" "SL" "SL" "MD" "MD"
BD- Bangladesh, IN-India, MD-Maldives, PK-Pakistan, SL- Sri Lanka
Construct a frequency table
## Native Country Count Percentage (%)
## Bangladesh 3 5.357
## India 2 3.571
## Maldives 5 8.929
## Pakistan 3 5.357
## Sri Lanka 43 76.786
## Total 56 100.000
CASE II:
Example 2
The grades of 30 students for Statistics are as follows:
## [1] "B" "C" "B" "D" "B" "C" "C" "A" "B" "C" "C" "B" "E" "B" "B" "D" "D" "F" "B"
## [20] "D" "D" "A" "B" "A" "B" "C" "E" "A" "A"
Construct a frequency table
## Grade Count Percentage (%)
## A 5 17.241
## B 10 34.483
## C 6 20.690
## D 5 17.241
## E 2 6.897
## F 1 3.448
## Total 29 100.000
CASE III:
Example 3
The number of family members of a sample of undergraduates of Batch 19 are as follows:
## [1] 7 5 3 4 5 4 3 6 4 4 5 2 7 4 5 6 4 4 3 5
Construct a frequency table
## # A tibble: 6 x 3
## `Number of family members` Count `Percentage (%)`
## <dbl> <int> <dbl>
## 1 2 1 5
## 2 3 3 15
## 3 4 7 35
## 4 5 5 25
## 5 6 2 10
## 6 7 2 10
CASE IV:
Example 4
The ages (in years) of a sample of undergraduates of Batch 19 are as follows:
## [1] 21 22 22 23 22 24 24 23 21 22 23 22 22 23 21 21 22 23 22 23
Construct a frequency table
## # A tibble: 4 x 3
## `Age (years)` Count `Percentage (%)`
## <dbl> <int> <dbl>
## 1 21 4 20
## 2 22 8 40
## 3 23 6 30
## 4 24 2 10
6.2.1.2 Grouped frequency distribution
- A grouped frequency distribution (table) is obtained by constructing classes (or intervals) for the data and then listing the corresponding number of values in each interval.
- Suitable for quantitative variables with large number of possible values in the range of data.
- Note that when items have been grouped in this way, their individual values are lost.
- When studying about frequency distributions it is very important to know the meaning of the following terms
i Class intervals
- In a frequency distribution the total range of the observations are divided into a number of classes. Those are called class intervals
- Eg: Class intervals: 10-14, 15-19, 20-24, …, 40-44
ii Class limits
- Class limits are the smallest and largest piece of data value that can fall into a given class.
- In the class interval 10-14, the end numbers, 10 and 14, are called class limits
- The smaller number (10) is the lower class limit
- The larger number (14) is the upper class limit
iii Class boundaries
- Class boundaries are obtained by adding the upper limit of one class interval to the lower limit of the next-higher class interval and dividing by 2.
- Class boundaries are also called True class limits
- Class boundaries should not coincide with actual observations
Class interval | Class boundaries |
---|---|
10 - 14 | 9.5 – 14.5 |
15 - 19 | 14.5 – 19.5 |
20 - 24 | 19.5 – 24.5 |
25 - 29 | 24.5 – 29.5 |
30 - 34 | 29.5 – 34.5 |
35 - 39 | 34.5 – 39.5 |
40 - 44 | 39.5 – 44.5 |
iv The size or width of a class interval
- The size or width of a class interval is the difference between the lower and upper class boundaries
- It is also referred to as the class width, class size, or class length
- Eg: The class width for the class 10-14 is = 14.5-9.5 = 5
v The class mark ( Midpoint of the class)
- Midpoint of the class
- Also called as class midpoint
- \(\text{Midpoint of the class} = \frac{\text{Lower limit} + \text{Upper limit}}{2}\)
or
- \(\text{Midpoint of the class} = \frac{\text{Lower boundary} + \text{Upper boundary}}{2}\)
vi Open class intervals
A class interval that, at least theoretically, has either no upper class limit or no lower class limit indicated is called an open class interval
For example, referring to age groups of individuals, the class interval “65 year and over” is an open class interval
Rules and Practices for constructing grouped frequency tables
- Every data value should be in an interval
- The intervals should be mutually exclusive
- The classes of the distribution must be arrayed in size order.
- The number of classes not less than 5 or not greater than 15 is recommended.
- The following formula is often used to determine the number of classes: If n is the number of observations, then
\[\text{Number of classes} = \sqrt{n}\]
\[\text{Width of the class interval} = \frac{Range}{\sqrt{n}}= \frac{Min-Max}{\sqrt{n}}\]
- Data should be represented within classes having limits which the data can attain
- Classes should be continuous
- By convention, the beginning of the interval is given the appropriate exact value, rather than the end.
Eg: intervals of 0-49, 50-99,100-149 would be preferred over the intervals 1-50, 51-100, 101-150 etc. - The number f observations falling into each category or class interval (class frequency) can be easily found using tally marks.
Examples:
In a grouped frequency distribution, class intervals can be constructed in different ways
Example 1
Class interval | Number of students |
---|---|
10 - 14 | 4 |
15 - 19 | 5 |
20 - 24 | 11 |
25 - 29 | 9 |
30 - 34 | 6 |
35 - 39 | 3 |
40 - 44 | 2 |
Example 2
Salary | Number of employees |
---|---|
0 – 1999 | 1 |
2000 – 3999 | 31 |
4000 – 5999 | 18 |
6000 – 7999 | 4 |
8000 – 9999 | 2 |
10000 - 11999 | 1 |
12000 – 13999 | 0 |
14000 – 15999 | 0 |
16000 – 17999 | 1 |
18000 -19999 | 1 |
20000-21999 | 1 |
Salary | Number of employees |
---|---|
0 – 1999 | 1 |
2000 – 3999 | 31 |
4000 – 5999 | 18 |
6000 – 7999 | 4 |
8000 – 9999 | 2 |
10000 - 15999 | 1 |
16000 – 21999 | 3 |
Total | 60 |
Example 3
Salary | Number of employees |
---|---|
Less than 2000 | 1 |
2000 – 2999 | 11 |
3000 – 3999 | 20 |
4000 – 5999 | 18 |
6000 – 9999 | 6 |
Greater than or equal to 10000 | 4 |
Total | 60 |
6.2.1.3 Two-way frequency table
- Cross tabulation, Cross classification table, Contingency table, Two-way table
- Display the relationship between two or more qualitative variables (categorical variables (nominal or ordinal))
## # A tibble: 2 x 4
## Survived First Second Third
## <chr> <dbl> <dbl> <dbl>
## 1 died 80 97 372
## 2 Survived 136 87 119
## # A tibble: 2 x 4
## Survived First Second Third
## <chr> <dbl> <dbl> <dbl>
## 1 died 0.37 0.53 0.76
## 2 Survived 0.63 0.47 0.24
6.2.2 Graphic Presentations of Data
- A diagram is a visual form for presentation of statistical data.
- The form of the diagram varies according to the nature of the data
6.2.2.1 Describing Qualitative Data
- Bar chart / Pie chart
- Suitable for
- Qualitative variables (nominal or ordinal)
- Discrete variables (when the number of bars or number of different values is small)
I Bar Chart
- A bar graph uses bars to represent discrete categories of data
- It can be drawn either on horizontal (more common) or vertical base
- A rectangle of equal width is drawn for each category
- The height (in vertical bar chart) or the length (in horizontal bar chart) of the rectangle is equal to the category’s frequency or percentage.
i Simple Bar Chart
- Only one categorical variable can be presented
- Often used in conjunction with simple frequency tables
- The bars do not touch each other
- The gaps between adjacent bars are same in length
Passenger class | Count | Percentage |
---|---|---|
First | 216 | 24.242 |
Second | 184 | 20.651 |
Third | 491 | 55.107 |
ii Component Bar Chart
- Sub divided bar chart/ Stacked bar chart
- Use to compare two or more qualitative variables (nominal or ordinal)
- Often used in conjunction with two way tables
- Start by drawing a simple bar chart with the total figures.
- The bars are then divided into the component parts
- Can be drawn on absolute figures or percentages
- The various components should be kept in the same order in each bar
- To distinguish different components from one another, different colours or shades can be used
Survived | First | Second | Third |
---|---|---|---|
died | 80 | 97 | 372 |
Survived | 136 | 87 | 119 |
Percentage component bar chart
- When sub-divided bar chart is drawn on percentage basis it is called percentage bar chart
- The various components are expressed as percentage to the total
- All bars are equal in height
Survived | First | Second | Third |
---|---|---|---|
died | 0.3703704 | 0.5271739 | 0.7576375 |
Survived | 0.6296296 | 0.4728261 | 0.2423625 |
iii Multiple Bar Chart
- Compound bar chart/ Cluster bar chart
- Use to compare two or more qualitative variables (nominal or ordinal)
- Often used in conjunction with two way tables
- These bar charts are drawn side by side
Survived | First | Second | Third |
---|---|---|---|
died | 37.04 | 52.72 | 75.76 |
Survived | 62.96 | 47.28 | 24.24 |
6.2.2.2 Describing Quantitative Data
- Histogram/ Dot plot / Box plot/ Scatter plot
II Histogram
- Histogram looks similar to bar chart since it also has bars.
- However, it is different from a bar chart in a number of aspects.
- One main difference is that in the histogram, the bars are drawn attached to each other; there are no gaps between bars like in a bar chart.
- Histogram is used to show the shape of the distribution of a continuous variable.
- However, the histogram is also used for discrete variables when the data are grouped in to class intervals.
- In a histogram, the area of a bar should be proportional to the frequency of the corresponding class.
- If all the bars have the same width, then the height of a bar can represent the frequency.
- The bar corresponding to a class interval should be drawn from the lower class boundary to the upper class boundary. In this way there will be no gaps between the bars.
Example: The marks(out of 50) of a group of students are recorded in the accompanying table. Draw a histogram for the data
Marks | Number of students |
---|---|
10 - 14 | 4 |
15 - 19 | 5 |
20 - 24 | 11 |
25 - 29 | 9 |
30 - 34 | 6 |
35 - 39 | 3 |
40 - 44 | 2 |
Total | 40 |
Example 2
III Frequency polygon
- If the mid-point of the top of each block in a histogram is joined by a straight line, a frequency polygon is produced.
- This is done under the assumption that the frequencies in a class-interval are evenly distributed throughout the class
Example: The marks(out of 50) of a group of students are recorded in the accompanying table. Draw a frequency polygon for the data
Marks | Number of students |
---|---|
10 - 14 | 4 |
15 - 19 | 5 |
20 - 24 | 11 |
25 - 29 | 9 |
30 - 34 | 6 |
35 - 39 | 3 |
40 - 44 | 2 |
Total | 40 |
IV Frequency curve
- A frequency curve is drawn by smoothing the frequency polygon.
- It is smooth in such a way that the sharp turns are avoided
Example: The marks(out of 50) of a group of students are recorded in the accompanying table. Draw a frequency curve for the data
Marks | Number of students |
---|---|
10 - 14 | 4 |
15 - 19 | 5 |
20 - 24 | 11 |
25 - 29 | 9 |
30 - 34 | 6 |
35 - 39 | 3 |
40 - 44 | 2 |
Total | 40 |
frequency curves arising in practice take on certain characteristics shapes as shown bellow
- The symmetrical or bell shaped frequency curves are characterized by the fact that observations equidistant from the central maximum have the same frequency. An important example is the normal curve.
- In the moderately asymmetrical or skewed frequency curves the tail of the curve to one side of the central maximum is longer than that to the other. If the longer tail occurs to the right the curve is said to be skewed to the right or to have positive skewness.While if the reverse is true the curve is said to be skewed to the left or to have negative skewness.
- In a J shaped or reverse J shaped curve a maximum occurs at one end.
- A U shaped frequency curve has maxima at both ends.
- A bimodal frequency curve has two maxima. These appear as two distinct peaks (local maxima) in the frequency curve.When the two modes are unequal the larger mode is known as the major mode and the other as the minor mode.
- A multimodal frequency curve has more than two maxima.
V Dot Plot
- A dot plot is a method of presenting data which gives a rough but rapid visual appreciation of the way in which the data are distributed
- It consists of a horizontal line marked out with divisions of the scale on which the variable is being measured -This graph can be used to represent only the numerical data.
VI Box plot (Box and whisker plot)
- Box plot is also a useful method of representing the behavior of a data set or comparing two or more data sets.
- Box plot is constructed by identifying five statistics from the data set as largest value, smallest values, median, Q1 and Q3.
Example:
Construct a box plot for the following data set (Marks of students)
\[\text{52, 88, 56, 79, 72, 91, 85, 88, 68, 63, 76, 73, 86, 95, 12, 69}\]
Xmin = 12 Xmax = 95 Q1 =64.25 Q2 = Median = 74.5 Q3 = 87.5
VII Violin plot
- A violin plot is a method of plotting quantitative data.
- It is similar to a box plot, with the addition of a rotated kernel density plot on each side.
- Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator.
6.3 Summary Measures
Although frequency distribution serves useful purpose, there are many situations that require other types of data summarization.
What we need in many instances is the ability to summarize the data by means of a single number called a descriptive measure.
Descriptive measures may be computed from the data of a sample or the data of a population. To distinguish between them we have the following definitions.
Definitions
A descriptive measure computed from the data of a sample is called a statistic.
A descriptive measure computed from the data of a population is called a parameter.
6.3.1 Measures of Central Tendency
Measure of central tendency yield information about the center, or middle part, of a group of numbers.
Eg: Mode, Median, Arithmetic Mean, Geometric mean, Harmonic Mean, Quadratic Mean, Quartiles, Deciles, and Percentiles
6.3.1.1 Mode
The Mode is the most frequently occuring value in a set of data
Organizing the data into an ordered array (an ordering of the numbers from smallest to largest) helps to locate the mode.
A series having only one mode is called as uni-modal
In the case of a tie for the most frequent occuring value, two modes are listed. Then the data are set to be bimodal
If a set of data is not exactly bimodal but contains two values that are more dominant than others, some researchers take the liberty of referring to the data set as bimodal even without an exact tie for the mode.
Data sets with more than two modes are referred to as multimodal.
The mode is an appropriate measure of central tendency for nominal-level data.
The mode can be used to determine which category occurs most frequently.
For ungrouped data
Example 01: Find mode of the following datasets
Dataset 1: 12, 14, 10, 8, 6, 8, 15, 8
Dataset 2: 40, 44, 57, 48, 78
Dataset 3: 42, 45, 55, 50, 45, 40, 55, 45, 52, 55, 54
For grouped frequency data
Example 02: Find mode of the following data
Marks | Number of students |
---|---|
20 | 8 |
30 | 10 |
40 | 16 |
50 | 8 |
60 | 5 |
70 | 3 |
- Advantages and disadvantages of mode
Advantages
- Easy to understand
- Easy to calculate
- Not affected by extreme values in the dataset
- Good for qualitative data
Disadvantages
- Not suitable for further mathematical calculations
- There may be more than one mode for a given dataset
- It is not based upon all the observations
- In some cases, we may not be able to find a mode for a given dataset
6.3.1.2 Median
The median is the middle value in an ordered array of numbers.
Median divides the series into equal parts
The following steps are used to determine the median.
STEP 1: Arrange the observations in an ordered data array.
STEP 2: For an odd number of terms, find the middle term of the ordered array. It is the median.
STEP 3: For an even number of terms, find the arithmetic mean of the middle two terms. This arithmetic mean is the median.
\[Median = \text{the }(\frac{n+1}{2})\text{th item in the data array} \]
- The level of data measurement must be at least ordinal for a median to be meaningful.
Example 1: Find the median of the dataset 1, 8, 6, 3, 2
Example 2: Find the median of the dataset 8, 9, 1, 2, 14, 12
- Advantages and disadvantages of median
Advantages
- Simple to understand
- Easy to calculate
- Not affected by extreme values in the dataset
- Can be calculated even for qualitative variables (ordinal scale data)
Disadvantages
- It is not based upon all the observations
6.3.1.3 Arithmetic Mean
The arithmetic mean (usually called mean) is the sum of all observations divided by the total number of observations.
Population Mean
The poupulation mean us represented by the Greek letter \(mu\) (\(\mu\)).
Let, \(N\) is the number of terms in the population.
\[\mu = \frac{\sum{x}}{N}=\frac{x_1+x_2+x_3+...+x_N}{N}\]
Sample Mean
- The sample mean is represented by \(\bar{x}\)
- Let, \(n\) is the number of terms in the sample \[\bar{x} = \frac{\sum{x}}{n}=\frac{x_1+x_2+x_3+...+x_n}{n}\]
It is inappropriate to use the mean to analyse data that are not at least interval level in measurement.
Example 1: Calculate the mean from the following data
Student | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
Marks | 40 | 50 | 53 | 78 | 58 | 60 | 73 | 35 | 43 | 48 |
- Advantages and disadvantages of arithmetic mean
Advantages
- Simple to understand
- Easy to calculate
- Based on all the observations
- Well defined
- Unique
- Can be used in further calculation
Disadvantages
- Can be affected by extreme values in the dataset
- May lead to false conclusions
- Only applicable to quantitative data (not applicable to qualitative data)
Empirical relationship between mean, mode, median
- In case of symmetrical distribution, mean, median and mode coincide \((mean = meadian= mode)\)
- For a moderately asymmetrical distribution, the following relationship exists \(Mean – Mode = 3(Mean - Median)\)
Choice between mean and median
- Mean is very sensitive to outliers.Median is not sensitive to outliers
- When there are outliers in a data set, median is more appropriate than mean
6.3.1.4 Quartiles, Deciles and Percentiles
- Median divides the data set into two equal parts.
- There are other values which divide the data set into a number of equal parts
- Those are Quartiles, Deciles and Percentiles
(a) Quartiles (Q) – Quartiles divide an array into four equal parts
\[Q_i = \text{the }\frac{i}{4}(n+1)\text{th item in the data array} \]
(b) Deciles (D) – Deciles divide an array into ten equal parts
\[D_i = \text{the }\frac{i}{10}(n+1)\text{th item in the data array} \]
(c) Percentiles (P) – Percentiles divide an array into 100 equal parts
\[P_i = \text{the }\frac{i}{100}(n+1)\text{th item in the data array} \]
6.3.2 Measures of Variability
Measure of central tendency yield information about particular points of a data set.
However, business researchers can use another group of analytic tools to describe a set of data.
These tools are measures of variability, which describe the spread or the dispersion of a set of data.
Using measures of variability in conjunction with measures of central tendency makes possible a more complete numerical description of the data.
This section focuses on seven measures of variability for ungrouped data: range, interquartile range, variance, standard deviation, z score and coefficient of variation.
6.3.2.1 Range
- The range is the difference between the largest value of a data set and the smallest value.
\[Range = Maximum - Minimum\]
One important use of the range is in quality assurance, where the range is used to construct control charts
Advantages and disadvantages of range
Advantages
- Easy to understand and calculate
Disadvantages
- Consider only the highest and lowest values of the data and fails to take account of any other observations in the dataset
- Heavily influenced by extreme values
6.3.2.2 Interquartile Range (IQR)
We use the interquartile range (IQR) to measure the spread of a data around the median (M).
The interquartile range is the range of values between the first and third quartile.
Essentially it is the range of the middle 50% of the data and is determined by computing the value of \(Q_3 - Q_1\).
The interquartile range is especially useful in situations where data users are more interested in values towards the middle and less interested in extremes.
The interquartile range is used in the construction of box and whisker plots.
By eliminating the lowest 25% and the highest 25% of the items in a series, we are left with the central 50% , which are ordinarily free of extreme values.
Advantages
- Easy to understand and calculate
- Not influenced by extreme values
Disadvantages
- Ignore the first \(25\%\) and the last \(25\%\) in the dataset
6.3.2.3 Variance and Standard Deviation
- To measure the spread of data around the mean, we use the standard deviation (S).
- The variance and standard deviation are two very popular measures of dispersion.
- These measures are not meaningful unless the data are at least interval-level data.
- Their formulations are categorized into whether to evaluate from a population or from a sample.
NOTE
Sum of deviations from the arithmetic mean is always zero. \[{\sum{(x-\mu)} = 0}\]
This property requires considering the alternative ways to obtain measure of variability.
6.3.2.3.1 Variance
The variance is the average of the squared deviations about the mean for a set of numbers.
The population variance is denoted by \(\sigma^2\)
\[\sigma^2 = \frac{\sum(x-\mu)^2}{N}\]
The sum of the squared deviations about the mean of a set of values - called the sum of squares of \(x\) and sometimes abbreviated as \(SS_x\)
Because the variance is computed from squared deviations, the final result is expressed in terms of squared units of measurements.
Statistics measured in squared units are problematic to interpret.
6.3.2.3.2 Standard Deviation
The standard deviation is the square root of the variance.
The population standard deviation is denoted by \(\sigma\)
\[\sigma = \sqrt{\frac{\sum(x-\mu)^2}{N}}=\sqrt{\sigma^2}\]
- One feature of standard deviation that distinguishes it from a variance is that the standard deviation is expressed in the same units as the raw data, whereas the variance is expressed in those units squared.
Advantages
- Based on all the observations
- Since this is based on arithmetic mean, it has all the merits of it
- The most important and widely used measure of dispersion
Disadvantages
- Not easy to understand and difficult to calculate
- Gives more weight to extreme values, because the values are squared up
6.3.2.4 Empirical Rule
The empirical rule is an important rule of thumb that is used to state the approximate percentage of values that lie within a given number of standard deviations from the mean of a set of data if the data are normally distributed
The empirical rule is used only for three numbers of standard deviations: \(1\sigma\), \(2\sigma\), \(3\sigma\)
Distance from the mean | Values within distance |
---|---|
\(\mu\pm1\sigma\) | \(68\%\) |
\(\mu\pm2\sigma\) | \(95\%\) |
\(\mu\pm3\sigma\) | \(99.7\%\) |
- If a set of data is normally distributed, or bell shaped, approximately \(68\%\) of the data values are within one standard deviation of the mean, \(95\%\) are within two standard deviations, and almost \(100\%\) are within three standard deviations.
6.3.2.5 Population versus sample variance and standard deviations
The sample variance is denoted by \(s^2\) and the sample standard deviation by \(s\).
The main use for sample variances and standard deviations is as estimators of population variances and standard deviations.
Thus, computation of the sample variance and standard deviation differs slightly from computation of the population variance and standard deviation.
Both the sample variance and sample standard deviation use \(n-1\)int he denominator instead of \(n\) because using \(n\) in the denominator of a sample variance results in a statistic that tends to underestimate the population variance.
While discussion of the properties of good estimator is beyond the scope of this course, one of the properties of a good estimator is being unbiased.
Whereas, using \(n\) in the denominator of the sample variance makes it a biased estimator, using \(n-1\) allows it to be an unbiased estimator, which is a desirable property in inferential statistics.
Sample variance
\[s^2 = \frac{\sum{(x-\bar{x})^2}}{n-1}\]
Sample standard deviation
\[s = \sqrt{\frac{\sum{(x-\bar{x})^2}}{n-1}}= \sqrt{s^2}\]
6.3.2.6 Computational formulas for variance and standard deviation
An alternative method of computing variance and standard deviation, sometimes referred to as the computational method or shortcut method, is available.
Algebraically,
\[\sum{(x-\mu)^2}= \sum{x^2}- \frac{(\sum x)^2}{N}\]
and
\[\sum{(x-\bar{x})^2}= \sum{x^2}- \frac{(\sum x)^2}{n}\]
- Substituting these equivalent expressions into the original formulas for variance and standard deviation yields the following computational formulas.
Computational formula for population variance and standard deviation
\[\sigma^2 = \frac{\sum{x^2}- \frac{(\sum x)^2}{N}}{N}\]
\[\sigma = \sqrt{\sigma^2}\]
Computational formula for sample variance and standard deviation
\[s^2 = \frac{\sum{x^2}- \frac{(\sum x)^2}{n}}{n-1}\]
\[s = \sqrt{s^2}\]
- For situations in which the mean is already computed or is given, alternative forms of these formulas are:
\[\sigma^2 = \frac{\sum{x^2}-N\mu^2}{N}\]
\[s^2 = \frac{\sum{x^2}-n(\bar{x})^2}{n-1}\]
6.3.2.7 Coefficient of variation
In general, for two variables measured with he same units (eg: two groups of people both weighed in kg), the group with the larger variance and standard deviation has n=more variability among their scores.
The unit of measure affects the size od the variance.
The same population weights, expressed in ‘grams’ rather than kg would have a larger variance and standard deviation.
The coefficient of variation, a measure of relative variability gets around this difficulty and makes it possible to compare variability across variables measured in different units.
The coefficient of variation is the ratio of the standard deviation to the mean, expressed as a percentage and is denoted CV.
\[CV = \frac{\sigma}{\mu}(100)\]
Using median and quartile deviation
\[CV = \frac{\frac{Q_3 - Q_1}{2}}{Median}(100)\]
The coefficient of variation essentially is a relative comparison of a standard deviation to its mean.
The coefficient of variation can be useful in comparing standard deviations that have been computed from data with different means.
The choice of whether to use a coefficient of variation or raw standard deviations to compare multiple standard deviations is a matter of preference
The coefficient of variation also provides and optional method of interpreting the value of a standard deviation.
6.3.3 Measures of Shape
Measures of shape are tools that can be used to describe the shape of a distribution of data.
In this section, we examine two measures of shapes: skewness and Kurtosis.
6.3.3.1 Skewness
A distribution of data in which the right half is a mirror image of the left half is said to be symmetrical.
One example of a symmetrical distribution is the normal distribution, or bell curve.
Skewness is a measure of symmetry, or more precisely, the lack of symmetry
The measures of asymmetry are called as measures of skewness.
The skewed portion is the long, thin part of the curve
Skewness and the relationship of the mean, median and mode
The concept of skewness helps us to understand the relationship between the mean, median and mode.
In a unimodal distribution (distribution with a single peak or mode) that is skewed, the mode is the apex (high point) of the curve and the median is the middle value.
The mean tends to be located toward the tail of the distribution, because the mean is affected by all values, including the extreme ones.
A bell-shaped or normal distribution with the mean, median and mode all at the centre of the distribution has no skewness.
6.3.3.1.1 Pearsonian coefficient of skewness
- This coefficient compares the mean and median in light of the magnitude of the standard deviation
\[S_k = \frac{3(\mu-M_d)}{\sigma}\]
where \(S_k\) = coefficient of skewness, \(M_d\) = median
Note that if the distribution is symmetrical, the mean and median are the same value and hence the coefficient of skewness is equal to zero.
If the value of \(S_k\) is positive, the distribution is positively skewed.
If the value of \(S_k\) is negative, the distribution is negatively skewed.
The greater the magnitude of \(S_k\), the more skewed is the distribution.
6.3.3.2 Kurtosis
Kurtosis describes the amount of peakedness of a distribution.
Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution
Distributions that are high and thin are referred to as leptokurtic distributions.
Distributions that are flat and spread out are referred to as platykurtic distributions.
Between the above two types are distributions that are more ‘normal’ in shape, referred to as mesokurtic distributions
6.3.4 Measures of Association
- Many times in business it is important to explore the relationship between two numerical variables
- Measures of association are statistics that yeild information about the relatedness of numerical variables.
- In this section, we discuss only one measure of association, correlation, and do so only for two numerical variables.
6.3.4.1 Scatter plot
- A scatter plot is a two dimensional graph of pairs of points from two numerical variables
- In a quantitative bi-variate dataset, we have a \((x,y)\) pair for each sampling unit, where \(x\) denotes the independent variable and \(y\) denotes the dependent variable.
- Each \((x,y)\) pair can be considered as a point on the Cartesian plan.
- Scatter plot is a plot of all the \((x,y)\) pairs in the dataset.
- The purpose of scatter plot is to illustrate any relationship between two quantitative variables.
- If the variables are related, what kind of relationship it is, linear or nonlinear?
- If the relationship is linear, the scatter plot will show whether it is negative or positive.
Example: Edgar Anderson’s Iris Data
This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
6.3.4.2 Correlation
- Correlation is a measure of the degree of relatedness of two or more variables.
- Several measures of correlation are available , the selection of which depends mostly on the level of data being analysed.
- Ideally, researchers would like to calculate \(\rho\), the population coefficient of correlation.
- However, because researchers virtually always deal with sample data, this section introduces a widely used sample coefficient of correlation, \(r\).
- This measure is applicable only if both variables being analysed have at least an interval level of data
Pearson product-moment correlation coefficient (\(r\))
- The statistic \(r\) is the Pearson product-moment correlation coefficient, named after Karl Pearson (1857 - 1936).
- The tern \(r\) is a measure of the linear correlation of two variables.
- It is a number that ranges from -1 to 0 to +1, representing th strength of the linear relationship between the variables.
- An \(r\) value of \(+1\) denotes a perfect linear positive relationship between two variables.
- An \(r\) value of \(-1\) denotes a perfect linear negative relationship between two variables, which indicates an inverse relationship between two variables: as one variable gets larger, the other gets smaller.
- An \(r\) value of 0 means no linear relationship is present between the two variables.
\[ r = \frac{\sum(x-\bar{x})(y-\bar{y})}{\sqrt{\sum(x-\bar{x})^2\sum(y-\bar{y})^2}}\]
\[ r = \frac{\sum{xy} - \frac{(\sum x\sum y)}{n}}{\sqrt{[\sum{x^2- \frac{(\sum{x})^2}{n}}][\sum y^2-\frac{(\sum y)^2}{n}]}}\]
- Examples: Following figure shows five different degrees of correlation:
NOTE
- When \(r=0\), it signifies there is no linear relationship between the two variables. (There can be a non-linear relationship, Figure (e))
- Figure (e): There is a very strong curvilinear relationship. But there is no linear relationship.
6.4 Tutorial
Chapter 6: Descriptive Statistics
- A power company in Sri Lanka designs and manufactures power distribution switchboards for hospitals, bridges, airports, highways and water treatment plants. Power company marketing director wants to determine client satisfaction with their products and services. He developed a questionnaire that yields a satisfaction score between 0 and 100 for participant responses. A random sample of 50 of the company’s 1000 clients is asked to complete a satisfaction survey. The satisfaction scores for the 50 participants are averaged to produce a mean satisfaction score.
- What is the objective of this study?
- What is the population for this study?
- What is the sample for this study?
- What is the statistic for this study?
- What would be a parameter for this study?
- Determine the type and the scale of measurement of the following variables
- The time required to produce an item on an assembly line
- the number of litres of milk a family drinks in a week
- The ranking of 50 students in your class after their overall performance have been designated as excellent, good, satisfactory or poor
- The telephone area code of clients in Australia
- The age of each of your batch mates
- The sales at the local pizza restaurant each month
- A student index number
- The response time of emergency services
- The number of tickets sold at a ticket counter on any given day
- Monthly maximum air temperature
The grades of 30 students for Statistics are as follows:
B, C, B, D, B, C, C, A , B, C,
C, B, E, B, B, D, D, F, B, D,
D, A, B, A, B, C, E, A, A, E
- Construct a frequency table with suitable values (eg: absolute frequency, relative frequency, cumulative frequency, cumulative relative frequency)
The gender type of a group of 10 students are as follows:
Male, Female, Female, Male, Female Female, Male, Male, Female, Female
- Construct a frequency table with suitable values (eg: absolute frequency, relative frequency, cumulative frequency, cumulative relative frequency )
- In a study conducted to investigate the relationship between delivery time and computer-assisted ordering, the following data were gathered from a sample of 40 firms, that 16 use computer-assisted ordering, while 24 do not. Furthermore, past data are used to categorize each firm’s delivery times below the industry average, equal to the industry average, or above the industry average. The results obtained are given in the table below.
Computer Assisted Ordering | Delivery Time Below Industry Average | Delivery Time Equal to Industry Average | Delivery Time Above industry Average | Row Total |
---|---|---|---|---|
No | 4 | 12 | 8 | 24 |
Yes | 10 | 4 | 2 | 16 |
Total | 14 | 16 | 10 | 40 |
- For each row and column total, calculate the corresponding row or column percentage.
- For each cell, calculate the corresponding cell, row, and column percentages.
- Carry out graphical analysis to investigate the relationship between delivery-time performance and computer-assisted ordering.
- What conclusions can be made about the nature of the relationship?
- Giving suitable examples distinguish between multiple bar chart and a component bar chart.
- Write down advantages and disadvantages of each chart and when they should be used
- In a study conducted to investigate the effect of wearing helmets in riding motorcycles to reduce the head injuries, the following data were gathered by investigating 151 motorcycle accidents reported during last 10 month period. Out of the total number of cases (151) investigated, only 102 people were wearing helmets properly at the time of the accidents. Out of them, only 9 got severe head injuries. The rest got either minor head injuries or no head injuries at all. Among 49 who were not wearing helmets at the time of the accident, 15 got server head injuries and the rest got minor head injuries or no head injuries at all.
Use an appropriate table to represent the above data and state what conclusions can be drawn from it
Carry out a suitable graphical analysis to investigate the effect of wearing helmets in riding motorcycles to reduce the head injuries.
- The table below shows a frequency distribution of the weekly wages of 65 employees at the ABC Company. With reference to this table,
Wages (in $) | Number of Employees |
---|---|
250.00 - 259.99 | 8 |
260.00 - 269.99 | 10 |
270.00 - 279.99 | 16 |
280.00 - 289.99 | 14 |
290.00 - 299.99 | 10 |
300.00 - 309.99 | 5 |
310.00 - 319.99 | 2 |
Total | 65 |
- Construct a histogram
- Construct a frequency polygon
- Construct a frequency curve
- State what conclusions can be drawn from the above graphical analysis
- The contribution of the agriculture, industrial, and service sector to the Gross Domestic Product (GDP) for each province in Sri Lanka is given in the table below. It is required to investigate whether the contributions are varying from province to province. What kind of graph you would suggest for representing data to serve the purpose. Justify your answer. Sketch the proposed graph.
- Marks of 16 students are given below.
\[\text{52, 88, 56, 79, 72, 91, 85, 88, 68, 63, 76, 73, 86, 95, 12, 69}\]
- Find the quartile of the distribution and interpret the values
- Construct a box plot for the data set
- Are there outliers in the data set?
- Listed below, ordered from smallest to largest, is the time in days the customers take to pay their invoices.
\[13, 13, 13, 20, 26, 27, 31, 34, 34, 34, 35, 35, 36, 37, 38, 41, 41, 41, 45, 47, 47, 47, 50, 51, 53, 54, 56, 62, 67, 82\]
- Determine the median
- Determine the first and third quartiles
- Determine the 2nd decile and the 8th decile
- Determine the 67th percentile
- What are outliers?
- Which of the following is/are unaffected by outliers? Underline the correct answer/ answers.
- Mean
- Median
- Mode
- Range
- Standard deviation
- Inter-quartile range
- Some summary measures of a variable is given below. Descriptive Statistics:
N | Mean | StDev | Minimum | Q1 | Median | Q3 | Maximum |
---|---|---|---|---|---|---|---|
50 | 5.389 | 0.997 | 3.150 | 4.658 | 5.350 | 6.020 | 8.600 |
- Are there outliers in this data set? YES/NO Justify your answer using a box plot
The mean and the standard deviation of 250 observations of a variable X are 62.1 and 4.3 respectively. However, in a re-scrutinizing process, it was found that the observations 72 and 81 were incorrectly recorded as 92 and 87. Find the correct mean and standard deviation of the data set.
The weekly sales from a sample of AB company were organized into a frequency distribution. The mean of weekly sales was computed to be $105,900, the median $105,000, and the mode $104,500
- Sketch the sales in the form of a smoothed frequency polygon. Note the location of the mean, median and mode on the X-axis
- Is the distribution symmetrical, positively skewed, or negatively skewed? Explain.
- Compare the variation of the annual incomes of executives with the variation of the incomes of unskilled employees. The sample information is given below.
Type | \(\bar{x}\) | \(SD\) |
---|---|---|
Executives | 500,000 | 50,000 |
Unskilled employees | 22,000 | 2,200 |
- Marks for the course module AB and the corresponding lecture attendance of 10 students are given below
Student Number | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
Marks in course module AB | 84 | 51 | 91 | 60 | 68 | 89 | 98 | 58 | 53 | 47 |
Lecture attendance | 13 | 6 | 15 | 4 | 12 | 14 | 15 | 10 | 11 | 6 |
- Draw a scatter plot for the above data
- Calculate the coefficient of correlation
- Comment on the relationship between the marks and the lecture attendance
References
Black, K., Asafu-Adjaye, J., Khan, N., Perera, N., Edwards, P., & Harris, M. (2007). Australasian business statistics. John Wiley & Sons.
Spiegel, M. R., & Stephens, L. J. (2017). Schaum’s outline of statistics. McGraw Hill Professional.