This book presents the methodology and applications of a range of important topics in statistics, and is designed for graduate students in Statistics and Biostatistics and for medical researchers. Illustrations and more than ninety exercises with solutions are presented. They are constructed from the research findings of the medical journals, summary reports of the Centre for Disease Control (CDC) and the World Health Organization (WHO), and practical situations. The illustrations and exercises are related to topics such as immunization, obesity, hypertension, lipid levels, diet and exercise, harmful effects of smoking and air pollution, and the benefits of gluten free diet. This book can be recommended for a one or two semester graduate level course for students studying Statistics, Biostatistics, Epidemiology and Health Sciences. It will also be useful as a companion for medical researchers and research oriented physicians.
Ebooka przeczytasz w aplikacjach Legimi na:
Liczba stron: 416
Topics for illustrations, examples and exercises
List of abbreviations
1 Statistical measures
1.2 Mean, mode and median
1.3 Variance and standard deviation
1.4 Quartiles, deciles and percentiles
1.5 Skewness and kurtosis
1.6 Frequency distributions
1.7 Covariance and correlation
1.8 Joint frequency distribution
1.9 Linear transformation of the observations
1.10 Linear combinations of two sets of observations
2 Probability, random variable, expected value and variance
2.2 Events and probabilities
2.3 Mutually exclusive events
2.4 Independent and dependent events
2.5 Addition of probabilities
2.6 Bayes' theorem
2.7 Random variables and probability distributions
2.8 Expected value, variance and standard deviation
2.9 Moments of a distribution
3 Odds ratios, relative risk, sensitivity, specificity and the ROC curve
3.2 Odds ratio
3.3 Relative risk
3.4 Sensitivity and specificity
3.5 The receiver operating characteristic (ROC) curve
4 Probability distributions, expectations, variances and correlation
4.2 Probability distribution of a discrete random variable
4.3 Discrete distributions
4.4 Continuous distributions
4.5 Joint distribution of two discrete random variables
4.6 Bivariate normal distribution
5 Means, standard errors and confidence limits
5.2 Expectation, variance and standard error (S.E.) of the sample mean
5.3 Estimation of the variance and standard error
5.4 Confidence limits for the mean
5.5 Estimator and confidence limits for the difference of two means
5.6 Approximate confidence limits for the difference of two means
5.7 Matched samples and paired comparisons
5.8 Confidence limits for the variance
5.9 Confidence limits for the ratio of two variances
5.10 Least squares and maximum likelihood methods of estimation
6 Proportions, odds ratios and relative risks: Estimation and confidence limits
6.2 A single proportion
6.3 Confidence limits for the proportion
6.4 Difference of two proportions or percentages
6.5 Combining proportions from independent samples
6.6 More than two classes or categories
6.7 Odds ratio
6.8 Relative risk
7 Tests of hypotheses: Means and variances
7.2 Principle steps for the tests of a hypothesis
7.3 Right‐sided alternative, test statistic and critical region
7.4 Left‐sided alternative and the critical region
7.5 Two‐sided alternative, critical region and the
7.6 Difference between two means: Variances known
7.7 Matched samples and paired comparison
7.8 Test for the variance
7.9 Test for the equality of two variances
7.10 Homogeneity of variances
8 Tests of hypotheses: Proportions and percentages
8.1 A single proportion
8.2 Right‐sided alternative
8.3 Left‐sided alternative
8.4 Two‐sided alternative
8.5 Difference of two proportions
8.6 Specified difference of two proportions
8.7 Equality of two or more proportions
8.8 A common proportion
9 The Chisquare statistic
9.2 The test statistic
9.3 Test of goodness of fit
9.4 Test of independence: (
9.5 Test of independence: (2x2) classification
10 Regression and correlation
10.2 The regression model: One independent variable
10.3 Regression on two independent variables
10.4 Multiple regression: The least squares estimation
10.5 Indicator variables
10.6 Regression through the origin
10.7 Estimation of trends
10.8 Logistic regression and the odds ratio
10.9 Weighted Least Squares (WLS) estimator
10.11 Further topics in regression
11 Analysis of variance and covariance: Designs of experiments
11.2 One‐way classification: Balanced design
11.3 One‐way random effects model: Balanced design
11.4 Inference for the variance components and the mean
11.5 One‐way classification: Unbalanced design and fixed effects
11.6 Unbalanced one‐way classification: Random effects
11.7 Intraclass correlation
11.8 Analysis of covariance: The balanced design
11.9 Analysis of covariance: Unbalanced design
11.10 Randomized blocks
11.11 Repeated measures design
11.12 Latin squares
11.13 Cross‐over design
11.14 Two‐way cross‐classification
11.15 Missing observations in the designs of experiments
12.2 Illustrations of large‐scale studies
12.3 Fixed effects model for combining the estimates
12.4 Random effects model for combining the estimates
12.5 Alternative estimators for
12.6 Tests of hypotheses and confidence limits for the variance components
13 Survival analysis
13.2 Survival and hazard functions
13.3 Kaplan‐Meier Product‐Limit estimator
13.4 Standard error of
) and confidence limits for
13.5 Confidence limits for
) with the right‐censored observations
13.6 Log‐Rank test for the equality of two survival distributions
13.7 Cox's proportional hazard model
Appendix A13 Expected value and variance of
and confidence limits for
14 Nonparametric statistics
14.2 Spearman’s rank correlation coefficient
14.3 The Sign test
14.4 Wilcoxon (1945) Matched‐pairs Signed‐ranks test
14.5 Wilcoxon's test for the equality of the distributions of two non‐normal populations with unpaired sample observations
14.6 McNemer’s (1955) matched pair test for two proportions
14.7 Cochran's (1950)
‐test for the difference of three or more matched proportions
14.8 Kruskal‐Wallis one‐way ANOVA test by ranks
15 Further topics
15.2 Bonferroni inequality and the Joint Confidence Region
15.3 Least significant difference (LSD) for a pair of treatment effects
15.4 Tukey's studentized range test
15.5 Scheffe's simultaneous confidence intervals
15.6 Bootstrap confidence intervals
15.7 Transformations for the ANOVA
Solutions to exercises
End User License Agreement
Table 1.1 Heights (cm), weights (kg) and BMIs of twenty‐year old boys.
Table 1.2 Summary figures for the heights, weights and BMIs of the 20 boys in Table 1.1.
Table 1.3 Frequency distribution of the heights of the 20 boys in Table 1.1.
Table 1.4 Age (
) and weight (
= 200 adults.
Table 3.1 Smoking and hypertension.
Table 3.2 Heights (feet), weights (pounds) and relative risks for the (25–59) years‐old‐persons.
Table 3.3 Results of a diagnostic test.
Table 4.1 Joint probability distribution of age (
) and weight (
) of the 200 adults in Table 1.4
Table 6.1 HDL of 400 obese and 200 nonobese adults.
Table 7.1 Power of the test for the mean of the LDL with the right‐sided alternative when
Table 7.2 Power of the
‐test for the mean of the LDL with estimated variance: Right‐sided alternative.
Table 7.3 Power of the test for the mean of the HDL with the left‐sided alternative when
Table 7.4 Power for the t‐test for the mean of the HDL with the left‐sided alternative when
; estimated variance.
Table 7.5 Power of the test for the mean of the LDL with the two‐sided alternative;
Table 7.6 Power for the two‐side alternative for the mean of the LDL with estimated variance.
Table 9.1 Observed and expected frequencies of the normal distribution of the 400 weights in 15 classes.
Table 10.1 ANOVA for the Significance of the Regression with one independent variable.
Table 10.2 Hypotheses, Test Statistics and Confidence Limits; regression on one independent variable.
Table 10.3 ANOVA for the significance of the regression on two independent variables.
Table 10.4 Hypotheses, Test Statistics and Confidence Limits for the regression on two independent variables.
Table 10.5 ANOVA for the significance of the regression on p independent variables.
Table 10.6 Estimators for the regression through the origin.
Table 11.1 ANOVA for the equality of the treatment effects. One‐way classification.
Table 11.2 Sums of Squares and Cross Products for the Analysis of Covariance.
Table 11.3 Expected values of the mean squares. Randomized blocks design.
Table 11.4 Lipid levels (LDL, HDL) of four age groups after administering four treatments and a control. Randomized blocks design.
Table 11.5 SBPs (top) and DBPs (bottom) of three patients observed over four time intervals. Repeated measures.
Table 11.6 Percentage decrease of the SBP Latin Square design.
Table 11.7 Reduction of the SBP for three treatments. Cross‐over design.
Table 11.8 SBPs for the two‐way weight × fitness program classifications.
Table 11.9 ANOVA for the row and column effects. Unbalanced two‐way Cross classification.
Table 11.10 ANOVA for the row and column effects. Unbalanced two‐way Cross classification with interaction.
Table 12.1 Average decreases of the SBPs after a medical treatment for 15 groups of adults. Means (
) and standard deviations (
Table 13.1 Uncensored observations of Example 13.1. Survival times
: (5, 8, 12, 18, 24, 24, 24, 35, 42, 50).
Table 13.2 Survival times
Table T1.1 Heights (cm), Weights (kg) and BMI of 20
Table T1.2 Heights (cm), Weights (kg) and BMI's of 20
Table T1.3 Heights (cm), Weights (kg) and BMI's of 20
year old boys.
Table T1.4 Heights (cm), Weights (kg) and BMI's of 20
Table T1.5 Immunization Coverage (percentage) among 1‐year olds in 2004 in the countries of the world.
Table T4.1 US Population in 2006 in millions.
Table T4.2 Number of physicians in the U.S. per 10,000 civilian population.
Table T4.3 Lowest per capita healthcare expenditures in 2006 at the average exchange rate (US$); 150 countries.
Table T5 Expenditure on health in 2006. Averages (top) and standard deviations (bottom).
Table T10.1 (a) Physical and diagnostic measurements of
= 20 adults.
Table T10.1 (b) Variances and Covariances of the measurements.
Table T10.1 (c) Correlations of the measurements.
Table T10.1 (d) Corrected Sums of squares (SS) and Sums of Products (SP) of the measurements.
Table T10.2 Weight and LDL of two groups of adults (
= 11); first group followed by the second.
Table T10.3 Percent of the overweight (includes obesity, BMI
25) 20–74 years old persons.
Table T10.4 Personal Healthcare Expenditure.
Table T11.1 Age and weight of three groups of size
=10 adults each. (for balanced one‐way).
Table T11.2 Systolic blood pressures (mmHg) of the three groups of the adults in Table 11 before and after an exercise program; balanced case.
Table T11.3 Age (years) and weight (lbs) of
= 30 adults in three age groups of sizes
= 13 and
Table T11.4 Systolic Blood Pressures (mmHg) for the adults in Table T11.3 before and after an exercise program.
Figure 1.1 Stem and leaf display of the heights of the twenty boys. Leaf unit = 1.0. The median class has (6) observations. The cumulative number of observations below and above the median class are (2, 4, 9) and (5, 2).
Figure 1.2 Box and whiskers plot of the heights of boys in Table 1.1, obtained from Minitab. The middle line of the box is the median Q
. The bottom and top lines are the first and third quartiles Q
. The tips of the vertical line, whiskers, are the upper and lower limits
Figure 1.3 Histogram of the distribution of the heights of the boys in Table 1.3 obtained from Minitab.
Figure 1.4 Plot of the weights of the twenty‐year‐old boys on their heights from the observations in Table 1.1.
Figure 4.1 Probabilities under the standard normal distribution; (
) percent of the area is between
, and (α/2) percent above Z
Figure 4.2 Normal distribution fitted to the heights in Table 1.1 of the 20 twenty‐year old boys.
Figure 4.3 Normal probability plot of the heights of the twenty boys in Table 1.1.
Figure 4.4 Number of physicians per 10,000 civilian population in the U.S. in 2006 as presented in Table T4.2.
Figure 4.5 Lowest per capita healthcare expenditure at the average exchange rate (US$); 135 countries; 2006.
Figure 7.1 Probabilities of Type I and II errors for the right‐sided alternative.
Figure 7.2 Power of the test for the mean with the right‐sided alternative when the standard deviation is known.
Figure 7.3 Power of the test for the mean with the left‐sided alternative when the standard deviation is known.
Figure 7.4 Power for the test for the mean with the two‐sided alternative.
Figure 10.1 Regression of LDL on weight.
Figure 10.2 Expenditure for all persons in the United States over the nine periods from 1960 to 2010.
Figure 10.3 Logistic regression for the increase of LDL with weight.
Figure 13.1 Survival times t
: (5, 8, 12, 18, 24, 24, 24, 35, 42, 50) of Example 13.1.
Table of Contents
Poduri S.R.S. Rao
Professor of StatisticsUniversity of RochesterRochester, New York, USA
This edition first published 2017© 2017 John Wiley & Sons, Ltd
Registered OfficeJohn Wiley & Sons, Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought.
Library of Congress Cataloging‐in‐Publication Data
Names: Rao, Poduri S.R.S., author.Title: Statistical methodologies with medical applications / Poduri S.R.S. Rao.Description: Chichester, West Sussex, United Kingdom ; Hoboken : John Wiley & Sons Inc., 2016. | Includes bibliographical references and index.Identifiers: LCCN 2016022669| ISBN 9781119258490 (cloth) | ISBN 9781119258483 (Adobe PDF) | ISBN 9781119258520 (epub)Subjects: | MESH: Statistics as TopicClassification: LCC RA409 | NLM WA 950 | DDC 610.2/1–dc23
LC record available at https://lccn.loc.gov/2016022669
A catalogue record for this book is available from the British Library.
Cover Image: Gun2becontinued/Gettyimages
To my grandchildrenAsha, Sita,Maya and Wyatt
Heights, weights and BMI (Body Mass Index) of sixteen and twenty‐year‐old boys from growth charts
Immunization coverage of one‐year‐olds: Measles, DTP3 and HEP B3 from WHO reports
Medical insurance for children
Sudden Infant Death Syndrome (SIDS)
Population growth rates and fertility
Age, family size, income and health insurance
Healthcare expenditure in Africa, Asia and Europe
Vaccination for flu for different age groups
Emergency department visits for cold symptoms, injuries and other reasons.
Overweight and obesity
Trends of adult obesity
BMI and mortality
Smoking, heart disease and cancer risk
Air pollution and cancer risk
Hypertension, systolic and diastolic blood pressures (SBP, DBP) of males and females.
Cholesterol levels: LDL and HDL
Effects of overweight on LDL
Low‐dose aspirin and reduction of certain types of cancer
Celiac disease and the benefits of gluten‐free diet
Statins and the reduction of LDL
Exercise and its benefits for blood pressure levels
Weight loss with diets of combinations of low and high‐levels of fatty acids and protein
Medical rehabilitation of stroke patients
Functional independence measures of stroke patients from medical rehabilitation
: Reports of WHO, CDC, U.S. Health Statistics;
Journal of the American Medical Association
New England Journal of Medicine
and other published literature.
Statistical analysis, evaluation and inference are essential for every type of medical study and clinical experiment. Physicians and medical clinics and laboratories routinely record the blood pressures, cholesterol levels and other relevant diagnostic measurements of patients. Clinical experiments evaluate and compare the effects of medical treatments and procedures. Medical journals report the research findings on the relative risks and odds ratios related to hypertension, abnormal cholesterol levels, obesity, harmful effects of smoking habits and excessive alcohol consumption and similar topics.
Estimation of the means, standard deviations, proportions, odds ratios, relative risks and related statistical measures of health‐related characteristics are of importance for the above types of medical studies. Evaluation of the errors of estimation, ascertaining the confidence limits for the population characteristics of interest, tests of hypotheses and statistical inference, and Chisquare tests for independence and association of categorical variables are important aspects of many medical studies and clinical experiments. Statistical inference is employed, for instance, to assess the relationship between obesity and hypertension and the association between air pollution and bronchial problems. A variety of similar problems require statistical investigations and inference. Regression analysis is widely used to determine the relationship of clinical outcomes and physical attributes. In several clinical investigations, correlations between diagnostic observations are examined to search for the causal factors. Analysis of Variance and Covariance procedures are extensively employed to examine the differences between the effects of medical treatments. All the above types of statistical methods, procedures and techniques required for medical studies, research and evaluations are presented in the following chapters. Topics such as the Meta‐analysis, Survival Analysis and Hazard Ratios, and nonparametric statistics are also included.
Following the descriptive statistical measures in the first chapter, definitions of probability, odds ratios and relative risk appear in Chapters 2 and 3. Binomial, normal, Chisquare and related probability distributions essential for the statistical methods and applications are presented in Chapter 4. Estimation of the means, variances, proportions and percentages, odds ratios and relative risks, Standard Errors (S.E.) of the estimators and confidence intervals appear in Chapters 5 and 6. Tests of hypotheses of means, proportions and variances, p‐values, power of a test, sample size required for a specified power are the topics for Chapters 7 and 8. The Chisquare tests for goodness of fit and independence are presented in Chapter 9. Linear, multiple and logistic regressions and correlation are the topics for Chapter 10. Chapter 11 presents the Analysis of Variance (ANOVA) and Covariance procedures, Randomized bocks, Latin square designs, fixed and random effects models, and two‐way cross‐classification with and without interaction. Meta‐analysis and Survival Analysis in Chapters 12 and 13 are followed by the nonparametric statistics in Chapter 14. The final chapter contains topics in ANOVA and tests of hypotheses including the Simultaneous Confidence Intervals and Bootstrap Confidence Intervals.
Examples, illustrations and exercises with solutions are presented in each chapter. They are constructed from the observations of practical situations, research studies appearing in The New England Journal of Medicine (NEJM), Journal of the American Medical Association (JAMA), Lancet and other medical journals, and the summaries presented in the Health Statistics of the Center for Disease Control (CDC) in the United States and the World Health Organization (WHO). They are related to a variety of medical topics of general interest including the following: (a) heights, weights and Body Mass Index (BMI) of ten‐to‐twenty‐year‐old boys and girls; (b) immunization of children; (c) overweight, obesity, hypertension and high cholesterol levels of adults; (d) benefits of fat‐free and gluten‐free diets and exercise, and (e) healthcare expenditures and medical insurance.
BMI is the ratio of the weight in kilograms to the square of the height in meters. A person is considered to be of normal weight if the BMI is 18.5–24, overweight if it is 25–29, and obese if it is 30 or more. For the blood cholesterol levels of adults, LDL less than 100 mg/dL and HDL higher than 40 mg/dL are considered optimal. Systolic and diastolic blood pressures, SBP and DBP of 120/80 mmHg are considered desirable. Illustrations and examples and exercises throughout the chapters are related to these medical measurements and other health‐related topics. Readily available software programs in Excel, Minitab and R are utilized for the solutions of the illustrations, examples and exercises.
The various topics in these chapters are presented at the level of comprehension of the students pursuing statistics, biostatistics, medicine, biological, physical and natural sciences and epidemiological studies. Each topic is illustrated through examples. More than one hundred exercises with solutions are included. This book can be recommended for a one‐semester or two‐quarter course for the above types of students, and also for self‐study. One or two semesters of training in the principles and applications of statistical methods provides adequate preparation to pursue the different topics. The various statistical methods for medical studies presented in this manuscript can also be of interest to clinicians, physicians, and medical students and residents.
I would like to thank the editor, Ms. Kathryn Sharples, for her interest in this project. Thanks to Charles Heckler, Kevin Rader and Nicholas Zaino for their expert reviews of the manuscript. Thanks also to Sarah Briscoe, Isabelle Weir and Patricia Digiorgio for their assistance in assembling the manuscript on the word processor. Special thanks to my wife and daughter, Drs. K.R. Poduri, MD and Ann Hug Poduri, MD, MPH for sharing their medical expertise in selecting the various topics and illustrations throughout the chapters.
Poduri S.R.S. Rao
Professor of StatisticsUniversity of Rochester
World Health Organization
Center for Disease Control
Low Density Lipoprotein
High Density Lipoprotein
LDL and HDL
are measures of cholesterol levels in units of milligrams for Deciliter (mg/dl)
Systolic Blood Pressure
Diastolic Blood Pressure
SBP and DBP
are measures of pressure in the blood vessels in units of millimeters of mercury (mmHg)
Body Mass Index
Medical professionals, hospitals and healthcare centers record heights, weights and other relevant physical measurements of patients along with their blood pressures cholesterol levels and similar diagnostic measurements. National organizations such as the Center for Disease Control (CDC) in the United States, the World Health Organization (WHO) and several national and international organizations record and analyze various aspects of the healthcare status of the citizens of all age groups. Epidemiological studies and surveys collect and analyze health‐related information of the people around the globe. Clinical trials and experiments are conducted for the development of effective and improved medical treatments.
Statistical measures are utilized to analyze the various diagnostic measurements as well as the outcomes of clinical experiments. The mean, mode and median described in the following sections locate the centers of the distributions of the above types of observations. The variance, standard deviation (S.D.) and the related coefficient of variation (C.V.) are the measures of dispersion of a set of observations. The quartiles, deciles and percentiles divide the data respectively into four, ten and one hundred equal parts. The skewness coefficient exhibits the departure of the data from its symmetry, and the kurtosis coefficient its peakedness. The measurements on the heights, weights and Body Mass Indexes (BMIs) of a sample of twenty‐year‐old boys obtained from the Chart Tables of the CDC (2008) are presented in Table 1.1. These measurements for the ten and sixteen‐ year old boys and girls are presented in Appendix Tables T1.1–T1.4.
Table 1.1 Heights (cm), weights (kg) and BMIs of twenty‐year old boys.
BMI = Weight/(Height)2.
The diagnostic measurements of a sample of n individuals can be represented by . Their mean or average is
For the heights of the boys in Table 1.1, the mean becomes . Similarly, the mean of their weights is 73.1 kg. For the BMI, which is (Weight/Height2), the mean becomes 23.59.
The mode is the observation occurring more frequently than the remaining observations. For the heights of the boys, it is 176 cm. The median is the middle value of the observations. If the number of observations n is odd, it is the ()th observation. If n is an even number, it is the average of the (n/2)th and the next observation. Both the mode and median of the twenty heights of the boys in Table 1.1 are equal to 176 cm, which is slightly larger than the mean of 175.2 cm.
The mean, mode and median locate the center of the observations. The mean is also known as the first moment m1 of the observations. For the healthcare policies, for instance, it is of importance to examine the average amount of the medical expenditures incurred by families of different sizes or specified ranges of income. At the same time, useful information is provided by the median and modal values of their expenditures. Figure 1.1 is the Stem and Leaf display of the heights in Table 1.1. The cumulative number of observations below and above the median appear in the first column. The second and third columns are the stems, with the attached leaves.
Figure 1.1Stem and leaf display of the heights of the twenty boys. Leaf unit = 1.0. The median class has (6) observations. The cumulative number of observations below and above the median class are (2, 4, 9) and (5, 2).
The variance is a measure of the dispersion among the observations, and it is given by
The divisor (n – 1) in this expression represents the degrees of freedom (d.f.). If (n – 1) of the observations and the sum or mean of the n observations are known, the remaining observation is automatically determined. The expression in (1.2) can also be expressed as , which is the average of the squared differences of the n(n – 1) pairs of the observations. The standard deviation (S.D.) is given by s, the positive square root of the variance. The second central moment of the observations is the same as . For the twenty heights of boys in Table 1.1, and . The standard deviation becomes .
The unit of measurement is attached to both the mean and standard deviation; kg for weight and cm for height. It is kg/(meter‐squared) for the BMI. The coefficient of variation (C.V.), is the ratio of the standard deviation to the mean and is devoid of the unit of measurement of the observations. The mean, variance, standard deviation and C.V. for the above three characteristics for the 20 boys in Table 1.1 are presented Table 1.2.
Table 1.2 Summary figures for the heights, weights and BMIs of the 20 boys in Table 1.1.
Any set of data can be arranged in an ascending order and divided into four parts with one quarter of the observations in each part. Twenty‐five percent of the observations are below the first quartile Q1 and 75 percent above. Similarly, half the number of observations are below the median, which is the second quartile Q2, and half above. Three‐quarters of the observations are below the third quartile Q3 and one‐fourth above. As seen in Section 1.2, the median of the heights in Table 1.1 is 176 cm. The average of the fifth and sixth observations is 171 cm, which is the first quartile. Similarly, the third quartile is 179 cm, which is the average of the fifteenth and sixteenth observations. The box and whiskers plot in Figure 1.2 presents the positions of these quartiles.
Figure 1.2Box and whiskers plot of the heights of boys in Table 1.1, obtained from Minitab. The middle line of the box is the median Q2. The bottom and top lines are the first and third quartiles Q1 and Q3. The tips of the vertical line, whiskers, are the upper and lower limits and .
Ten percent of the observations are below the first decile and 90 percent above. Ninety percent of the observations are below the ninth decile and 10 percent above. One percent of the observations are below the first percentile and 99 percent above. Similarly, 99 percent of the observations are below the ninety‐ninth percentile and 1 percent above.
Physical or diagnostic measurements , of a group of individuals may not be symmetrically distributed about their mean. The third central moment, will be zero if the observations are symmetrically distributed about the mean. It will be positive if the observations are skewed to the right and negative if they are skewed to the left. For the symmetrically distributed observations, the third, fifth, seventh and all the odd central moments will be zero. The Pearsonian coefficient of skewness is given by , which does not depend on the unit of measurement of the observations unlike m2 and m3. For any set of observations symmetrically distributed about its mean, and hence . For the positively skewed observations, m3 and K1 are positive. For the negatively skewed observations, they are negative. For the heights of the boys in Table 1.1, and . These heights are slightly negatively skewed.
The fourth central moment of the observations, , becomes large as the distribution of the observations becomes peaked and small as it becomes flat. The Pearsonian coefficient of kurtosis is given by , which does not depend on the unit of measurement. For the normal distribution, which is extensively employed for statistical analysis and inference, and . For the observations on all the three characteristics in Table 1.1, the fourth moments are large, as seen from Table 1.2, but K2 is smaller than three.
Any set of clinical measurements or medical observations can be classified into a convenient number of groups and presented as the frequency distribution. The CDC, National Center for Health Statistics (NCHS) and other organizations present various health‐related measurements of the U.S. population in the form of summary tables. These measurements are obtained from periodic or continual surveys of the population in the country and also from the administrative medical records of the population. They are arranged according to age groups, education, income levels, male‐female classification and other characteristics of interest. Similar summary figures are presented by the WHO and healthcare organizations throughout the world. For the sake of illustration, the twenty heights of the boys in Table 1.1 are arranged in Table 1.3 into seven classes of the same width of five, and displayed as the histogram in Figure 1.3.
Table 1.3 Frequency distribution of the heights of the 20 boys in Table 1.1.
Relative frequency (
Figure 1.3Histogram of the distribution of the heights of the boys in Table 1.3 obtained from Minitab.
In general, the n observations can be divided into k classes with ni observations in the ith class, . The mid‐values of the classes can be denoted by (x1, x2, …, xk).
With the above notation, the mean of the n observations becomes
where is the relative frequency in the ith class and . From the above table and (1.3), the mean of the heights is
Since the 20 observations are grouped, this mean differs slightly from the actual value of 175.2 cm.
For the grouped data, the second moment becomes
Now, . From (1.4), for the heights of the boys, and , which differ from the actual values 48.76 and 51.33 as a result of the grouping. From the grouped data, the third and fourth central moments are obtained from and . In general, the rth central moment for the grouped data is given by .
The heights and weights of the 20 boys in Table 1.1 can be denoted by . With the subscripts (x, y) for these characteristics, as presented in Table 1.2, the standard deviations of these characteristics are and . Their covariance is given by
It is the sum of the cross‐products of the deviations of (xi, yi) from their means divided by (n – 1). It can also be expressed as The sample correlation coefficient of (x, y) is
It will be positive as y increases with x and negative if it decreases, and vice versa. In general, the covariance can be positive or negative. It can range from a very small negative value to a very large positive number, and the units of measurements of both x and y are attached to it. The correlation coefficient, however, ranges from –1 to , and it is devoid of the units of measurements of the two characteristics. If x increases as y increases, or x decreases as y decreases, their covariance and correlation will be positive; negative otherwise. If x and y are not related, sxy and r will be zero. For the heights and weights of the twenty‐year‐old boys in Table 1.1, from (1.5), (1.6) and Table 1.2, and . In this case, these two characteristics are highly positively correlated as expected. Figure 1.4 displays the relationship of the weights and heights of the twenty boys in Table 1.1.
Figure 1.4Plot of the weights of the twenty‐year‐old boys on their heights from the observations in Table 1.1.
When the number of observations on two variables (x, y) is not small, they can be grouped into the joint frequency distribution. National and international organizations present the health‐related characteristics in this form. For the sake of illustration, age (x) and weight (y) of a sample of adults classified into rows and columns are presented in Table 1.4.
Table 1.4 Age (x) and weight (y) of n = 200 adults.
With the first and second subscripts i = (1, 2, …,r) and representing the rows and columns respectively, the ith row and jth column consists of nij adults. The total number of observations in the ith row and jth column respectively are . The overall sample size becomes . The row and column totals are the marginal totals. They provide the frequency distributions of the age and weight respectively. The means, variances and standard deviations for the row and column classifications are obtained from these distributions as described in Section 1.6. With the mid‐values (x1, x2, …, xr) of the row classification and (y1, y2, …, yc) of the column classification, the covariance of (x, y) is obtained from
The correlation coefficient is found from .
From Table 1.4, the mean, variance and standard deviation of the age are
Similarly, for the weight, , and . From (1.7),
The correlation of age and weight now becomes , which is not very high.
For computations, it may become convenient to transform the data first. For instance, we may subtract 170 from each of the heights in Table 1.1, and divide the result by 10. The new observations now become . We may also first divide each height by 100 and then subtract 5. Now, . In either case, the new observations take the form of , where (a, b) are positive or negative constants. The mean of the transformed observations becomes
Their variance becomes
With the above type of transformation, computations for ū and become simple. Now, is obtained from and from . Note that adding or subtracting a constant displaces the mean, but it has no effect on the variance. Multiplying xi by the constant a results in multiplying its variance by a2, and the standard deviation by a.
As found earlier, the average of the heights of the twenty boys is 175.2 cm. To convert xi in cm to yi in inches, . Now, the average height is inches or close to 5 feet 9 inches. The variance becomes and inches.
Consider the gains in weights of a sample of n adults on two occasions. The total , difference , a weighted combination , with specified constants (a, b) may be of interest. The mean and variance of ti are
where are the variances and sxy the covariance of x and y. The standard deviations of x and y are sx and sy, and the sample correlation is . The standard deviation st of ti is the positive square root of V(ti).
Similarly, the mean, variance, and standard deviation of di are , and the standard deviation sd of di is the positive square root of V(di). If , where (a, b, c) are constants, and . The standard deviation su of ui is obtained from the square root of V(ui).
For an illustration, consider the gains in weights (lbs) (xi, yi) of adults on two occasions: (5, 10), (10, 5), (10, 10), (5, –5), and (5, 10); the fourth candidate lost 5 lbs.
From these observations, the mean, variance and standard deviation of xi are (7, 7.5, 2.74). Corresponding figures for yi are (6, 42.5, 6.52). The covariance and correlation are and . The mean, variance and standard deviation of ti and di respectively become (13, 57.5, 7.58) and (–1, 42.5, 6.52). With , and , the mean, variance, and standard deviation of ui become (1.25, 57.78, 5.08).
Find the summary figures for the 20 ten‐year old boys and girls in
(a) Find the means and standard deviations of the three characteristics for the sixteen‐year‐old boys and girls in
. (b) Find the means and S.D.s for the heights with grouping.
The basic principles of probability are essential for the development of statistical theory, inference and applications. Probabilities of mutually exclusive, independent and dependent events are described in the following sections. Bayes' theorem is illustrated through an example. General definitions of a probability distribution, expected value, variance and moments of a random variable are presented.
Clinically examining the difference between the effects of two or more medical treatments and evaluating the benefits of different diets for weight reduction or hypertension control are two illustrations of experiments. The outcome of an experiment can be a success or failure, Event A and its complement Event B. For instance, an exercise program may increase the HDL of a person by less than 10, or by more than 10 mg/dL. These two events can be denoted by A and its complement B. In a random sample of 100 persons participating in the exercise program, HDL may increase by less than 10 mg/dL for 40 persons and more than 10 mg/dL for the remaining. Thus (4/10)th or 40 percent of the outcomes are in favor of the event A and (6/10
Tysiące ebooków i audiobooków
Ich liczba ciągle rośnie, a Ty masz gwarancję niezmiennej ceny.
Napisali o nas:
Nowy sposób na e-księgarnię
Czytelnicy nie wierzą
Legimi idzie na całość
Projekt Legimi wielkim wydarzeniem
Spotify for ebooks