Research: The systematic and replicable investigation of a question/problem.

Research process often referred to as the scientific method. The language in which the scientific method is discussed is in need of transformation, but, it can and should apply to all forms of inquiry. The scientific method takes a linear approach to problem solving and typically entails:

Define and delimit the problem

Formulate the hypothesis

Gather Data

Analyze and interpret findings

Qualitative Research: Nature of the ‘data’ the distinguishing characteristic. With qualitative research no summary or reduction to a numerical representation of the data is made.

Quantitative Research: With quantitative research, descriptive and/or inferential statistics are used to summarize data and infer from a sample something about the population the sample represents.

Research need not be entirely one or the other. In fact a combination many times will yield a richer and more comprehensive examination of question.

Qualitative and Quantitative research are not polar opposites with completely different sets of techniques and approaches to inquiry. They exist along a continuum commonly framed in terms of the amount of control or manipulation present.

The advantage of a quantitative approach is that it is possible to measure the reactions of many people to a limited set of questions thus facilitating comparison and statistical aggregation of data. A broad, generalizable set of findings result.

The advantage of a qualitative approach is that a wealth of detailed information about a specific event is produced. This increases understanding of the cases and situations studied but reduces generalizability.

Good research begins with a good and well-articulated question. This will help you decide what type of research and data you need to examine.

INDEPENDENT VARIABLE: The variable manipulated by the experimenter. Or a broader definition would be - any variable that is assumed to produce an effect on, or be related to, a behavior of interest.

LEVELS OF AN INDEPENDENT VARIABLE: The various values or groupings of values of an independent variable. Ex: a study is conducted to determine the effect of room temperature on performance. If the experimenter tests the subject at 70, 80, and 90 degrees, there is one independent variable - room temperature - with three levels.

DEPENDENT VARIABLE: The behavior or characteristic observed of analyzed by the researcher, generally in regards to how the independent variable(s) affected or were related to it.

TYPE OF DEPENDENT VARIABLE: In empirical research, the dependent variable is quantified in some way. Statistical analysis is carried out on the numerical values of the dependent variable. The three basic types are score data (ratio, interval), ordered data (ordinal), and frequency data (categorical).

- Interval/ratio scaled data: Generally requires relatively precise measuring instruments and an understanding of the behavior being measured. Data is considered to be continuous - can measure to finer & finer degrees if you choose to. Statistical techniques (parametric) developed to analyze score data make rather stringent assumptions about the nature of the scores.
- Ordinal scaled data: used when reliable interval/ratio scaled data cannot be (or is not) obtained, but the information can be ranked from high to low along the dimension of interest. In some cases, a researcher may convert score data to ranks because it is believed that the measuring instrument was not precise enough to trust the numeral scores, or that the assumptions underlying a statistical test for continuous data would be badly violated by the data. Statistical tests deigned for use with ordered data generally do not make stringent assumption about the nature of the underlying distributions and hence are more conservative than those designed for score data.
- Categorical (nominal) data: Each subject is classified into a particular category. The frequency of occurrence of subjects in each category typically provides the data from which statistical analysis is done.

Selection of Descriptive Statistics to Summarize Data

Level of Measurement | Applicable Statistics |

Nominal/Categorical | Percentages, Mode |

Ordinal | Percentages, Mode, Median* |

Interval | Mean, Median, Mode, Standard Deviation, Range, Percentiles, Z scores |

Ratio | Mean, Median, Mode, Standard Deviation, Range, Percentiles, Z scores |

*Note: Use of the median for ordinal data should be applied only in situations where the underlying variable can be considered continuous and the numbers do not simply represent a few discrete categories.

Selection of Inferential Statistics to Summarize Data

Level of Measurement | Test Needed | Applicable Statistics |

Ordinal | Differences | Mann Whitney, Kruskal Wallis, Friedman |

Interval/ratio | Differences | t-tests, ANOVAs |

Nominal/Ordinal | Relationships | Chi Squared |

Interval/ratio | Relationships | Correlation, Regression |

Participant Observation: When behaviors of individuals are of interest, observation is an appropriate data collection method. A participant observer is one fully engaged in the activity the group under study is involved in. Depending on needs, identity is sometimes concealed. An advantage of this is that the observer is typically better able to interact with others in a more normal fashion. In addition, concealment enhances the probability that the actions of those observed would be more natural and decreases the chances of the researcher affecting the event/activity under observation. Downsides - ethical? Researcher’s lens clouded

Unobtrusive Observation: Often undertaken is a public/natural setting and those being observed unlikely to know they are part of a research project. Advantage - researcher does not influence events. Downside: unable to collect detailed information on social circumstances, subjects’ backgrounds, personal characteristics, demographics - data limited to what can be observed without any interaction.

Content Analysis: The study of recorded communication - text, audio, visual. Sources depend on study’s goals. Advantage: can collect info without influence; others can examine data to verify interpretation/results. Downside: difficult to draw conclusions about social issues from recorded sources. Inferences must be limited to the nature and sources of recorded data.

Historical Assessment: Similar to content analysis, however, it typically includes a broader range of data sources - interviews, examination of relics, text, artifacts, geography, archeological data.

Personal Interviews: In unstructured interviewing, general topics are identified prior to the interview. However, much of the interview is guided by respondent comments and the researcher’s questions about these comments. In structured interviewing, an interview guide is prepared, pretested, and carefully followed during data collection. Structured ensures obtaining equivalent information across respondents, but, limits breadth/range of responses. Unstructured can produce very rich, broad ranging information, but, may not be able to be summarized well. Overall, benefit of face to face interviewing is that questions can be clarified and info about the respondent as well as the environment, and context can be recorded (e.g. body language)

Telephone interviewing: Same as Personal Interviewing, but, an additional downside is that individuals may be reluctant to reveal information over the telephone.

Written surveys: Self administered mail surveys permit inclusion of a greater number of respondents across a wider geographic area - and at a lower cost. Downside: response rate can be quite low. Most effective when well structured and focused.

Active interaction: This refers to the collection of performance data from individuals and typically involves exposing a group to some experimental condition, training, treatment, etc. One group may be measured twice (or more) or two groups (exposed to varying conditions, training,...) may be measured once (or more). Advantage: a great deal of ‘control’ possible which strengthens ability to draw inferences. Downside: interaction changes people.

Before enrolling participants in an experiment, the investigator should be genuinely uncertain of the outcome. In other words, a true null hypothesis should exist at the onset.

The investigator must consider how adverse events will be handled; who will provide care for a participant injured in a study and who will pay for that care are important considerations.

Government/institutions typically have definitions around misconduct. In addition, there are many activities commonly considered unethical.

Central to all design & analysis concerns are ethical considerations with respect to

- Treatment of participants. The primary concern of the investigator should be the safety of the research participant.
- IRB review.
- Institutional Review Boards for the Protection of Human Subjects (IRB's) have been established at most institutions that undertake research with humans. These committees are made up of scientists, clinical faculty, and administrators who review research according to the procedures set out in Federal Regulations.
- If your research is part of a routine educational experience, or if your participants will remain completely anonymous (with no identifying code to link them to their identity), you may apply to the IRB for a certificate of exemption.
- A study may also qualify for "expedited review" if an IRB reviewer determines that it meets assessment criteria for minimal risk, and involves only procedures that are commonly done. A study that qualifies for expedited review is still held to the same standards used in full board review, but the approval process may take less time.

- Deception.
- Occasionally exploring your area of interest fully may require misleading your participants about the subject of your study. For example, home plate strike zone study. The IRB will review any proposal that suggests using deception or misrepresentation very carefully. They will require an in-depth justification of why the deception is necessary for the study and the steps you will take to safeguard your participants.

- Informed Consent.
- Federal regulations state: "no investigator may involve a human being as a subject in research covered by these regulations unless the investigator has obtained the legally effective informed consent of the subject or the subject's legally authorized representative." For informed consent to be valid, these principles apply:
- Disclosure: The potential participant must be informed as fully as possible of the nature and purpose of the research, the procedures to be used, the expected benefits to the participant and/or society, the potential of reasonably foreseeable risks, stresses, and discomforts, and alternatives to participating in the research. Document must make clear who to contact with questions/concerns.
- Understanding: The participant must understand what has been explained and must be given the opportunity to ask questions and have them answered by one of the investigators.
- The participant's consent to participate in the research must be voluntary, free of any coercion or promises of benefits unlikely to result from participation.
- Competence: The participant must be competent to give consent. In some cases a surrogate is acceptable.
- Consent: The potential human subject must authorize his/her participation in the research study, preferably in writing

- Federal regulations state: "no investigator may involve a human being as a subject in research covered by these regulations unless the investigator has obtained the legally effective informed consent of the subject or the subject's legally authorized representative." For informed consent to be valid, these principles apply:
- Privacy and confidentiality for subjects is critical.

- IRB review.
- Manipulation of data
- Complexity for no legitimate reason should not be done.
- The manipulation of raw data to other levels of measurement acceptable provided sound reasons apply – eg. Age to age groups; collapse when cells have too few cases.
- Never acceptable: fabricating, falsifying, or misrepresenting research data.

- Rigorous error checking
- Simple tools available to detect many errors
- Frequency distribution tables
- Crosstab tables
- Graphs – for outliers

- Simple tools available to detect many errors
- Accuracy in reporting
- Care should be taken not to compromise external validity of the research.

- Conclusions must be grounded in data.
- Design and implementation of protocols should take into consideration threats to internal validity.

Because researchers often conduct their research on narrowly defined problems, an important task in the evaluation of research is to judge whether a researcher has defined the problem too narrowly to make an important contribution to the advancement of knowledge.

Remember, all methods of observation (data collection) are flawed. There is no perfect way to observe a given variable. An evaluator must ask: To what extent is the method likely to produce valid and reliable data given the purpose/context framed by the researcher.

The most common sampling flaw is the use of a convenience sample or voluntary responses - e.g. mailed survey. Where self selection is an issue the evaluator must look 1st for the author’s acknowledgment of the problem and their perspective on the effect and 2nd consider whether or not the problem is great enough to invalidate or obscure findings.

Even what seems like a straight forward analysis can be flawed. For qualitative research the evaluator must consider the extent to which the design and data collection protocols are like to produce data with minimal variations in interpretation. For quantitative research the evaluator must 1st consider the evidence regarding the extent to which the data (dependent variable) is reliable and valid, 2nd the match between the analysis conducted and the research question, and 3rd the appropriateness of the analyses conducted in the context of data type, assumptions, and point around which decisions made (# tests-p values-alpha).

Details are important in research articles. The evaluator should examine whether or not enough detail is present to fully understand what was said and done to participants as well as how the data was constructed.

No research provides ‘proof’ of anything. You’re reading work from a weak/ questionable source when they write ‘research proves.....’ This is enough to call into question an entire piece of work.

Titles should be sufficiently specific

Titles should not describe results

Titles should not pose a simple yes/no question: eg do boys/girls differ in upper body strength

If two part titles are employed BOTH parts should contain specific/important information about the study

If the title framed around the main analytical question it is desirable to have the IV and DV in the title. e.g. The relationship between cholesterol level and exercise frequency.

If a narrowly delimited sample used it is desirable to include a reference to the population in the title.

Titles should not infer causality unless the analytical techniques employed appropriate for drawing this type of inference. Words that infer causality - effect, influence, impact,...

Title should not use acronyms or jargon

Purpose of the study should be stated or clearly implied.

A snapshot of the methodology should be given.

Full titles of instruments should not be used unless the purpose of the study is to evaluate the reliability and validity of data from the instrument(s)

Highlights of results should be include, but, the brevity should not result in a misrepresentation of findings.

References to implications or future research do not belong in the abstract.

Intro should lead in by identifying one or more problems without a lot of extraneous verbage. Ideally, the 1st sentence provides a concise statement of the problem and a reference to support the statement.

The importance of the problem should be made clear. Include implications of current research. The point is - has the author made the case?

Unless chronology (eg historical research) is of over riding importance the intro should be developed around topics (not references)

Key terms should be defined as they come up in the intro.

While the author’s opinion may be brought into the intro it must be clear that’s what it is. Any factual statement requires a source for support. e.g. incidence of injury has increased in recent years....

The intro should lead the reader smoothly into a wrap up paragraph with the study’s purpose, research questions, and reason study undertaken.

Underlying theory should be adequately described.

Structure: Depending on the nature of the end product (thesis vs. journal article) the breadth of the opening will vary. In either case, the structure is that of a funnel and the entire section should take the reader down a logical path that ends with a restatement of the purpose, but, now in the context of the foundation set by previous work.

Researcher should be selective in lit review. Long lists suggests work has not been scrutinized.

When results vary across studies, the author should identify for the reader which they deem more dependable and why.

Current research must be included, however not at the expense of relevant work - regardless of publication date.

Opinion, when it surfaces should be clearly communicated to be such.

Must use primary sources predominantly.

The methods section of a research paper needs to be meticulously inclusive. Someone not connected with the study should be able to replicate your work just by reading your methods section.

- Instrument development (including reliability & validity information)
- Pilot Study

- Sampling
- Consider sources of invalidity

- Data collection protocol(s)
- Pilot & Main Study
- Consider sources of invalidity

- Statistical analysis
- Descriptive Statistics
- Inferential Statistics - main question
- Descriptive/Inferential Statistics - related questions
- Review Analytical Flow Charts

Length: Technical reports and journal articles may have page limits and within that authors should convey critical components others would need to replicate work. In contrast, a thesis needs to take as much space as needed and be as meticulously inclusive as possible. Someone not connected with the study should be able to replicate your work just by reading your methods section.

Structure/Content: The following is one recommended structure. The order may vary, however the content should be present.

Instrument development (including reliability & validity information)

- Process thorough?
- Instrument pilot tested (with validity & reliability examined appropriately)

Sampling

- Will the procedure utilized produce a representative sample that inferences can reasonably be made from?
- To what extent is bias likely to be present? Common techniques: random, pseudo-random, stratified random, proportional stratified random. If optimal sampling not achieved, the author should explicitly identify the implications/limitations
- Is the sample size adequate? How was the target established?

Data collection protocol(s)

- Should be thorough and ‘tight’ enough to reduce the likelihood of compromising internal validity. Where the collection and/or recording of data has a subjective component, evidence that objectivity/reliability assessed essential.

Research design

- Were groups formed, were they equivalent at start? Individuals randomly assigned?
- Where multiple tasks/events involved was order balanced?
- Does the design indicate every effort made to minimize sources of invalidity?

Statistical analysis of data - must include how you will assess validity & reliability of dependent variable(s) (and independent variable(s) for relationship study).

- Descriptive Statistics.
- Appropriate for data type?
- Validity & reliability of data examined? Appropriately?
- Contributes to understanding of sample or problem?

- Inferential statistics addressing main problem.
- Appropriate for data type?
- Appropriate for question under examination?
- Should be the least complex it can be and still provide insight to the question(s) being examined.
- Assumptions checked?
- Both statistical and practical significance reported?

- Analyses pertaining to related problems.
- Appropriate for data type?
- Appropriate for question under examination?

Sampling - The selection of a sample is one of the keys to limiting the problems of internal and external validity and reliability of the research.

Identify Population

- Be sure to delimit (specify characteristics - breadth/depth) population.
- To indicate that you are sampling something other than the entire population, use descriptive words like selected, representative, typical, certain, a random sample of, --- . Say precisely what you mean.

Once the population you are interested in has been clearly defined a strategy for drawing a sample from that population is needed. The sample should be carefully chosen so that the researcher is able to see all the characteristics that are present in the total population in the sample chosen in the same proportions. In addition, it is important to be clear in the problem statement that you are taking a representative sample for your study.

- Researchers must use language that makes it clear in what way they have delimited the population in order to obtain a sample for study.
- Being representative is as important as being large.

Issues to consider when determining sample size:

- Sampling error is inversely related to sample size. The larger the sample size, the smaller the sampling error and the greater the likelihood that the sample is representative of the population
- Sample size should be greater when variability within the population increases. Larger samples needed when population heterogeneous.
- When subgroups planned, overall sample size needs to be large enough so that the subgroups are large enough to support meaningful comparisons.

Bias: Any influence, condition, or set of conditions which singly or together cause distortion of the data from what would have been obtained by pure chance. With this definition, any factor that impairs the randomness of the sample would be considered bias. Bias due to inadequate sampling impairs external validity.

Bias due to inadequate sampling can be a major problem for example in survey research. As the project is conceived the sample should be carefully chosen so that the researcher is able to see all the characteristics of the population in the sample.

Sampling Strategies: There are 4 major sampling techniques (Show link to cartoon example of sampling techniques) :

- Simple random sampling: Population is generally a homogeneous mass of individual Units A quantity of flower seeds of a particular variety from which random samples are selected for testing as to their germination quality
- Simple stratified sampling: Population consists of definite strata, each of which is distinctly different, but the units within the stratum are as homogeneous as possible A particular town whose total population consists of three types (strata) of citizens: Caucasian, African American; Mexican American
- Proportional stratified sampling: Population contains definite strata with differing characteristics and each strata has a proportionate ratio in terms of numbers of members to every other strata. A community in which the total population consists of individuals whose religious affiliations are: catholic (25%), Protestant (50%); Jewish (15%); unaffiliated (10%)
- Cluster sampling: Population consists of clusters whose cluster characteristics are similar yet whose unit characteristics are as heterogeneous as possible. A survey of travelers using the nation's 20 leading air terminals could be done by cluster sampling: air terminals are similar in atmosphere, purpose, design, etc. yet the passengers who use them differ widely in individual characteristics: age, gender, national origin, philosophies and beliefs, socioeconomic status, and so forth.)

One step is common in each of the 4 techniques above and that is randomization.

A random sample is a subset of observations drawn from a given population in such a way that each observation contrived in the population has an equal chance of being included in the sample. In practice, samples seldom meet this criterion for randomness, but they are treated as random if no systematic bias exists that might be expected to invalidate the generalizations based on the sample.

When the focus is on the differences across the strata or subgroups, non-proportional stratified random sampling should be used to select samples in each strata that are of the same size.

If it is more important to have a representative sample then a proportional stratified sampling process should be used to select samples of sizes that are representative of those in the population.

**Random Selection & Random Assignment**

The random selection of subjects is employed to obtain a representative sample of the population. This enhances external validity (generalizing results) and internal validity (results not confounded by sources of invalidity related to bias). The reason to employ randam assignment of subjects to treatment groups is to enhance the liklihood the groups are equivalent at the start.

Reminder: sample size is directly related to ‘power’ - probability of correctly rejecting the null hypothesis. Therefore, it is important that researchers determine sample size from the perspective of power. Software is available to help determine sample size. So, no reason not to.

Note on sample size: a well selected and controlled small sample is better than a poorly selected and poorly controlled large sample. Size alone is not the key to good research.

Summary Information on Sampling

The quality of the methods employed to collect data is another key to limiting the problems of internal and external validity and reliability of the research.

For the thesis format this section must be meticulously detailed. Absolutely every piece of information related to collecting the data must be included. In an article format it typically must be tighter since there will be page limits. Specificity with respect to EVERY variable you collect data on is necessary.

Reminder: The methods section of a research paper needs to be meticulously inclusive. Someone not connected with the study should be able to replicate it work just by reading the methods section.

When data is collected via survey, information on the development of the instrument (including reliability & validity information) is critical.

The data collection protocol(s) and selection of a sample are the keys to limiting the problems of internal and external validity and reliability of the research.

- Internal validity: Extent to which results can be attributed to "treatment".
- External validity: Extent to which results can be generalized. External validityis examined qualitatively by scrutinizing the sampling scheme employed.

Sources of Invalidity

- Rosenthal effect: Self fulfilling prophecy - you get what you expect. Best to do a double-blind study when this is a potential source of invalidity.
- Halo effect: General effect of good or bad feeling you have about a person. In observational designs this may be a particular problem. Best to use a check-list a verify reliability of the instrument and those collecting data.
- Demand characteristics: Allowing subjects to know what the goals are. Deception (of an ethical nature) may be needed to avoid this source of invalidity.
- Volunteer effect: Volunteers may be fundamentally different from the overall population you are trying to generalize to.
- Instrumentation effect: Changes in instruments can be mistaken for changes in subjects.
- Pre-testing effect: Subjects can be changed or learning can take place during a pre-test which could affect results.
- Time: Over a length of time, maturation may have more of an impact than the independent variable. Also, major events can affect subjects' behaviors and/or opinions.
- Hawthorne effect: When the giving of attention rather than the independent variable is the cause of observed differences/relationships.

Essentially this refers to the replicability of the research. The reliability of the research is assessed qualitatively by scrutinizing the design and methodology employed in the research.

Reliability of the research hinges on the thoroughness of the data collection protocol in addition to obtaining a representative sample.

To clarify, now talking about reliability & validity of the DATA.

Concerned primarily with the dependent variable. The instrument used to quantify the dependent variable should be examined for it's ability to produce valid data (ability to truly measure what it's supposed to). Valid data is clean and relevant. If the instrument is a well known one with work already in place establishing the validity of data produced by it, it may be enough to site a reference where validity was examined and show that the same protocol was followed in your study on similar subjects.

Depending on the type and purpose of a data collection, validity can be examined from one or more of several perspectives. Content; Concurrent; Predictive; Construct Validity of the dependent variable can be assessed using an interclass coefficient:

Qualitatively

- Content/logical validity - 'Expert' review Content/logical - factor analysis

Quantitatively

- Concurrent validity – correlation
- Predictive validity – correlation
- Construct validity - multi-trait/multi-method procedure (correlations); factor analysis

When measures are found to be valid for one purpose they will not necessarily be valid for another purpose. Validity also may not be generalizable across groups with varying characteristics.

Content/logical validity (assessed qualitatively) - expect authors to

1. Clearly define what was measured.

2. State all procedures used to gather measures.

3. Have had an "expert" assess whether or not instrument/test measuring what you think you are.

Content validity (assessed quantitatively) Ex: survey research - expect authors to

1. Pilot test the survey

2. Conduct a factor analysis of survey results

3. Revise based on analysis

4. Administer survey and conduct another factor analysis

Criterion-related validity (predictive and concurrent) - Compare measures from your dependent variable with measures from a criterion (expert, another test, etc.) of the same skill/knowledge.

Concurrent validity (assessed quantitatively) - expect authors to

1. Gather x [dependent variable] and y measures from a large group

2. Compute an appropriate correlation coefficient

3. If correlation > .80 for positively correlated variables or < -.80 for inversely related variables measure (x) is said to have good concurrent validity

Predictive validity (assessed quantitatively) - expect authors to

1. Gather measures using their instrument (x) and measures on the variable(s) they are trying to predict (y)

2. Compute an appropriate correlation coefficient

3. If correlation > .80 for positively correlated variables or < -.80 for inversely related variables measure (x) is said to have good predictive validity

4. Follow up with estimation of SEE - band place around predicted score to quantify prediction error.

Construct validity (assessed quantitatively)

A construct is an intangible characteristic. When you want to measure a construct such as anxiety, competitiveness, etc., you have no direct means to do so. Therefore indirect methods need to be employed. To then estimate the validity of the indirect measures (as reflections of the construct you're interested in) you record a pattern of correlations between the indirect measure(s) and other similar and dissimilar measures. Your hope is that the pattern reveals high correlations with similar measures (convergent validity) and low correlations with different measures (divergent/discriminant validity).

Expect authors to employ one of two techniques used to quantitatively assess construct validity

- Multi-trait multi-method matrix

- factor analysis.

Concerned primarily with the dependent variable. The instrument used to quantify the dependent variable should be examined for it's ability to produce reliable data. (accuracy of measures reflected in consistency)

Reliability of the dependent variable can be assessed quantitatively using and intraclass coefficient:

1. Coefficient alpha

2. Intraclass R

Reliability of Scores (Norm-referenced Reliability)

Data is reliable when there is little or no measurement error (when scores are accurate). So the key to reliability is minimizing measurement error (highly unlikely to ever eliminate).

When analyzing research, look for sources of measurement error that may have a negative impact on reliability:

Measuring device/test Test administrator

Temporary effects warm-up, practice Test length

Factors that represent sources of invalidity in the research

The reliability of measures is typically assessed in one of two ways:

- Internal consistency - By examining precision and consistency of test scores throughout one administration of a test.
- Stability - By examining precision and consistency of test scores over time. (test-retest)

An intraclass coefficient is needed to examine the reliability of data. The two common statistics used are the intraclass R and coefficient alpha.

It is possible to have reliable data that is invalid. Data/information that is valid on the other hand, should also be reliable. So, reliability does not insure validity.

You are at all times interested in the reliability & validity of both the research and the data when analyzing the quality of research.

Examining the reliability & validity of the research is done by scrutinizing the design, sampling and data collection protocols. As a reader you should not assume that if no mention is made by the author(s) no threats to internal/external validity or reliability existed. When mention is made that should not necessarily cause you to question the quality.

Examining the reliability & validity of the data is done by scrutinizing the data collection process and statistics used to assess validity and reliability of the data representing the dependent variable.

Reliability and validity of the data (examined statistically) should be reported in a research paper.

- Descriptive Statistics
- Inferential Statistics - main question
- Descriptive/Inferential Statistics - related questions

Measurement Issues

The place to start is with how to classify data - the scale the data appropriately belongs on will affect analysis decisions.

Measurement Scales

Categorical/nominal scale: Used to measure discrete variables that can be classified by two or more mutually exclusive categories.

Ex: Gender is a categorically scaled variable with two categories: male & female. the scale scores (0,1) have no meaning.

Data at this level of measurement can be summarized by:

Frequency distribution tables

Crosstabulation tables

Charts/graphs

Ordinal scale: Used to measure discrete variables that are categorical in nature and can be ordered (meaningfully).

Ex: Undergraduate class is an ordinally scaled variable with four meaningfully ordered categories: freshman, sophomore, Junior, Senior. The scale scores (1,2,3,4) have meaning in that Juniors have complete more units than sophomores who have completed more than freshman . . .

Another example is likert scaled items: eg strong agree ---- strong disagree

Data at this level of measurement can be summarized by:

Frequency distribution tables

Crosstabulation tables

Charts/graphs

There is a tendency to want to jump to the presentation of central tendency and variability at this level of measurement. Should not. Data is not yet continuous (measured to finer degrees).

In survey research, some make the argument that the underlying scale is continuous, however the data is clearly ordinal.

The reasonable exception is when you generate from several likert scaled items a factor score. Now the combined set of several items approaches a continuum and it is now more meaningful and less misleading to summarize factor scores with measures of central tendency and variability.

Interval scale: Used to measure continuous variables that are ordinal in nature and result in values that represent actual and equal differences in the variable measured.

Ex: Temperature is an interval scaled variable with meaningfully ordered categories (hot, cold) that can be measured (scale has a constant unit of measurement) to finer and finer degrees given appropriate instrumentation.

Data at this level of measurement can be summarized by:Charts/graphs

Central Tendency & Variability

Correlation

Data is now considered continuous and measures of central tendency and variability are an excellent way to summarize descriptive information on subjects’ characteristics at this level of measurement.

Ratio scale: Used to measure continuous variables that have a true zero, implying total lack of the attribute/property being measured.

Ex: Weight is a ratio scaled variable with meaningfully ordered categories (heavy, light) that can be measured to finer and finer degrees that also has a true rather than arbitrary zero.

Data at this level of measurement can be summarized by:

Charts/graphs

Central Tendency & Variability

Correlation

Depending on level of measurement summary information should be provided on

Participant Demographics

Participant Demographics by subgroup

All other variables relevant to the question under study

All other variables relevant to the question under study by subgroup

Descriptive Statistics

- Frequency Distribution Tables - Percentages
- Crosstabulation Tables - Percentages
- Central Tendency - Mean, Median, Mode
- Variability - Standard Deviation, Range
- Correlation

Category | frequency | Percent |

High | 15 | 17% |

Medium | 30 | 33% |

Low | 45 | 50% |

When reporting percentages, author should report the underlying frequencies because percentages alone can be misleading.

College A | College B | |

Number of Students | 150 | 350 |

Sport Philosophy Students | 12 (8%) | 15 (4%) |

Crosstabulation Tables

For example, if you have data on dominant hand and gender and want to know what percentage of females in a group are left handed, you could crosstabulate the two:

Left Handed | Right Handed | |

Male | 10 (33%) | 20 (67%) |

Female | 15 (33%) | 30 (67%) |

Author should make sure the direction for the total matches the text explanation.

Provides a measure of where scores tend to center. Most commonly reported is the mean; however, it is NOT a representation of the center when the distribution is skewed. The median should be reported in that instance.

Data may be severely mis-represented when an inappropriate measure of central tendency is reported.

Data should be at least interval scaled when using the median or mean.

Responses to individual Likert scaled items are not interval scaled.

The companion to central tendency. Provides a measure of the spread of scores. Should always be reported with measures of central tendency.

Provides a measure of the strength of the relationship between two variables. Selection of a correlation coefficient depends on the variable type

Two continuous: Pearson Product Moment Correlation

Two true dichotomous: Phi

Two ordinal: Kendall's Tau

One continuous; one true dichotomous: Point Biserial

General Interpretation:

-.8 to -1.0 | High/strong | +.8 to 1.0 |

-.6 to -.79 | Moderate High | .6 to .79 |

-.4 to -.59 | Moderate | .4 to .59 |

-.2 to -.39 | Low | .2 to .39 |

0 to -.19 | no relationship | 0 to .19 |

When interested in differences or change over time for one group or between groups a number of designs are applicable. The most frequently used designs can be collapsed into two broad types: true experimental and quasi-experimental.

True experimental designs: these designs all have in common the fact that the groups are randomly formed. The advantage associated with this feature is it permits the assumption to be made that the groups were equivalent at the beginning of the research which would provide control over sources of invalidity based on non-equivalency of groups.

The control is of course not inherent in the design. The researcher must still work with the groups in such a way that nothing happens to one group (other that the treatment) that does not happen to the other and that scores on the dependent measure do not vary as a result of instrumentation problems, or that the loss of subjects is not different between the groups.

This design requires the formation of at least two groups. One group will receive the ‘experimental treatment’ the other will not. The group not receiving the treatment is commonly referred to as the control group.

This design allows the researcher to test for significant differences between the control and experimental group after the experimental group has received the treatment. An independent t-test or one-way analysis of variance (ANOVA) may be used to statistically test the null hypothesis that

H0: µ1 = µ2.

In this design there is one independent variable and one dependent variable. When there are 2 levels of the independent variable either a t-test or one-way ANOVA can be used. When there are 3 or more levels of the independent variable then the one-way ANOVA must be used. For example, when there are 3 levels of the independent variable the null hypothesis is:

H0: µ1 = µ2 = µ3.

In this expanded design there is still one independent variable (now with 3 levels) and one dependent variable. The independent variable is still groups or treatment condition and the dependent variable is again the variable under study.

Essentially an extension of the randomized-groups design, this design has more than one independent variable and just one dependent variable. This design requires the formation of a group for every combination (of every level) of the two or more independent variables.

This design allows the researcher to test for significant differences as a function of each independent variable separately (main effects) and in combination (interaction). A two-way ANOVA would be used to statistically test the null hypothesis that H0: µ1 = µ2 = µ3 = ... for the first independent variable, that µ1 = µ2 = µ3 = ... for the second independent variable, and that the interaction is not significant.

The ‘jargon’ commonly associated with a factorial design looks like:

a 2X2 ANOVA .....

This is communicating that there are two levels of the first independent variable and two levels of the 2nd independent variable. The language used to talk about the results would be the main effect for the first IV, the main effect for the 2nd IV, and the interaction.

Variation of factorial design: When one or more of the independent variables is a categorical variable, such as gender, where individuals cannot be randomly assigned to the levels, you have a factorial design that no longer qualifies completely as a true experimental design, but, is used quite frequently and is quite appropriate when the topic under study calls for the examination of characteristics that people cannot be ‘assigned’ to.

In its simplest form, this design requires the formation of two groups. One group will receive the ‘experimental treatment’ the other will not. The group not receiving the treatment is still referred to as the control group.

Consider a dietary seminar intended to change eating habits particularly with respect to consumption of fat.

Group 1 | Pre Test | Seminar | Post Test |

Group 2 | Pre Test | Post Test |

In this example there are two independent variables and one dependent variable. In the situation depicted above there are two levels of each independent variable. The first independent variable is group or treatment condition (two levels - experimental/group 1 & control/group 2). The second independent variable is test (two levels - pretest & posttest). The dependent variable is grams of fat consumed.

The repeated measures design is a variation of the completely randomized design though not considered a true experimental design. Instead of using different groups of subjects, only one group of subjects is formed and all subjects are measured/tested multiple times. There is no control group.

This design allows the researcher to test for significant differences produced by the treatment - are the means across repeated measures different. A repeated measures ANOVA is the recommended analytical procedure. With this approach you have one independent variable and one dependent variable.

As an example, assume that a researcher wants to know whether or not mean scores on a measure of exercise satisfaction change depending on the environment runners exercise in. To answer this, the researcher obtains measures of exercise satisfaction from subjects after they run in an urban setting, the countryside, an indoor track, and an outdoor track. The dependent variable is exercise satisfaction and the independent variable is exercise environment.

The major advantage of this design over the completely randomized design is that fewer subjects are required. In addition, very often increased statistical power is gained because the random variability of a single subject from one measure to the next is usually much less than the variability introduced by measuring and comparing different subjects. The major disadvantage is that there may be carry-over effects from one treatment/testing to the next. In addition, subjects might become progressively more proficient at performing the criterion task and show an improvement in performance more attributable to learning than the treatment.

Regardless of the design, tests of significance should be followed by an examination of practical significance.

When interested in the relationship between/among variables, there are no design designations like ‘factorial’. The design in this situation is equated with the analytical technique to be employed. Even without design names, good researchers communicate clearly what the independent and dependent variables were and how the strength of the relationship was tested. In addition, an examination of practical significance is essential.

The null hypothesis under examination with a relationship question is:

H0: ρ = 0

To examine whether or not there is a statistically significant difference in means on some dependent variable (continuous) as a function of some independent variable (categorical) you can use the t-test when you have just two levels of the independent variable (ex: gender)

Independent t-test Statistical Procedure for testing H0: µ1 = µ2 when the two levels of the independent variable are not related.

Dependent t-test Statistical Procedure for testing H0: µ1 = µ2 when the two measures of the dependent variable are related. For example, when one group of subjects is tested twice, the two scores are related.

There are distributional assumptions associated with parametric statistics such as the t-test and ANOVAs. The most basic are:

- Homogeneity of Variance: Are the spread of scores associated with each mean similar
- Normality: Is the shape of the distribution of scores around each mean normal.

Authors should convey to reader results of checking assumptions. If assumptions violated then the non-parametric equivalent should be used.

Assessing statistical significance Following analyses using a t-test you could compare the t statistic to an appropriate table of critical values. Information needed is alpha and df:

n1+n2-2 (independent)

N –1 (dependent)

If the t statistic > critical value you can reject your null hypothesis. Most frequently however authors have used software to give them a p value to compare to the alpha they’ve chosen. If the p value < the alpha you can reject the null hypothesis. REMEMBER, if multiple tests done, alpha should be modified before comparison done.

Note: p value can be considered the probability that findings due to chance (sampling error).

Assessing practical significance. Remember the above ‘test’ tells you whether there's a statistically significant difference not whether the difference is of any practical importance. Therefore, it's important for authors to take the next step and examine practical significance by calculating a statistic such as omega squared - proportion of total variance that can be explained by the independent variable. Another useful measure is an effect size.

Effect Size. Infrequently reported, but, this statistic very valuable when it comes to interpreting results. It conveys the size of the effect observed in a way that permits interpretation of the practical significance of the results.

For a differences study:

Interpretation:

.30 Small effect

.50 Moderate effect

.80 Large effect

To examine whether or not there is a statistically significant difference in means on some dependent variable (continuous) as a function of some independent variable (categorical) you can use the F test from an ANOVA table when you have two or more levels of the independent variable (ex: 3 training protocols)

Statistical Procedure for testing H0: µ1 = µ2 = ... when the two or more levels of the independent variable are not related.

There are distributional assumptions associated with parametric statistics such as the t-test and ANOVAs. The most basic are:

- Homogeneity of Variance: Are the spread of scores associated with each mean similar
- Normality: Is the shape of the distribution of scores around each mean normal.

Authors should convey to reader results of checking assumptions. If assumptions violated then the non-parametric equivalent should be used.

Assessing statistical significance Following analyses using a F test you could compare the F statistic to an appropriate table of critical values. Information needed is alpha and df:

K-1; N-K

If your F statistic > critical value you can reject your null hypothesis. Most frequently however authors have used software to give them a p value to compare to the alpha they’ve chosen. If the p value < the alpha you can reject the null hypothesis. REMEMBER, if multiple tests done, alpha should be modified before comparison done.

Assessing practical significance Remember the above ‘test’ tells you whether there's a statistically significant difference not whether the difference is of any practical importance. Therefore, it's important for authors to take the next step and examine practical significance by calculating a statistic such as eta squared - proportion of total variance that can be explained by the independent variable. Another useful measure is an effect size.

Effect Size. Infrequently reported, but, this statistic very valuable when it comes to interpreting results. It conveys the size of the effect observed in a way that permits interpretation of the practical significance of the results.

For a differences study:

You now have two independent variables and one dependent variable. The two way ANOVA provides information on three H0:

A difference in the dependent variable due to the 1st independent variable

A difference in the dependent variable due to the 2nd independent variable.

A difference in the dependent variable due to the interaction of the two independent variables.

Assumptions - Homogeneity of Variance, Normality

Assessing Statistical Significance

Take a look at the p values for each of the main effects and interaction. If the p value < the alpha you can reject the null hypothesis. REMEMBER, since multiple tests done, alpha should be divided by 3 before comparison done.

Assessing practical significance Remember the above ‘test’ tells you whether there's a statistically significant difference not whether the difference is of any practical importance. Therefore, it's important for authors to take the next step and examine practical significance by calculating a statistic such as eta squared - proportion of total variance that can be explained by the independent variable.

Statistical Procedure for testing H0: µ1 = µ2 = ... when the two or more measures of the dependent variable are related. For example, when one group of subjects is tested two or more times, the two scores are related.

Assumptions:

Repeated Measures at least interval scaled

Sphericity

Assessing statistical significance following analyses using a F test you could compare the F statistic to an appropriate table of critical values. Information needed is alpha and df:

K-1; N-K

If your F statistic > critical value you can reject your null hypothesis. Most frequently however authors have used software to give them a p value to compare to the alpha they’ve chosen. If the p value < the alpha you can reject the null hypothesis. REMEMBER, if multiple tests done, alpha should be modified before comparison done.

Assessing practical significance Remember the above ‘test’ tells you whether there's a statistically significant difference not whether the difference is of any practical importance. Therefore, it's important for authors to take the next step and examine practical significance by calculating a statistic such as eta squared - proportion of total variance that can be explained by the independent variable.

Mann Whitney: This statistic is the non-parametric equivalent to the independent t-test. There are no distributional assumptions to meet. This statistic tests for a difference in two medians and should be used when the underlying distribution can be considered continuous.

Wilcoxon: This statistic is the non-parametric equivalent to the dependent t-test and repeated measures ANOVA. There are no distributional assumptions to meet. This statistic tests for a difference in two or more medians and should be used when the underlying distribution can be considered continuous.

Kruskal Wallace: This statistic is the non-parametric equivalent to the one-way ANOVA. There are no distributional assumptions to meet. This statistic tests for a difference in two or more medians and should be used when the underlying distribution can be considered continuous.

Assumptions for non-parametric (differences) tests

1. Samples were drawn at random from the population under consideration.

2. Variable(s) under study have underlying continuity.

Statistical Significance: Correlation

Practical Significance: Coefficient of Determination

Pearson Product Moment Correlation. When examining the null hypothesis: ρ = 0, it is important to remember that the reliability of the research should be considered. In this setting this is a matter of considering the reliability of the correlation coefficient. Said another way the question becomes: If the study is repeated, would the coefficient be similar? Answer rests in examination of the sample size and variability of scores.

A restriction in the range of scores (sampling; subgroups) can drastically affect the correlation coefficient. Interpretation must take into consideration the variability of the scores.

Assumptions:

Linearity: straight line can be draw through points on scatterplot

Data for both x and y at least interval scaled

Assessing statistical significance Following analyses using a PPMC you could compare the PPMC statistic to an appropriate table of critical values. Information needed is alpha and df:

N-2

If the PPMC statistic > critical value you can reject your null hypothesis. Most frequently however authors have used software to give them a p value to compare to the alpha they’ve chosen. If the p value < the alpha you can reject the null hypothesis. REMEMBER, if multiple tests done, alpha should be modified before comparison done.

Assessing practical significance Remember the above ‘test’ tells you whether there's a statistically significant relationship not whether the relationship is of any practical importance. Therefore, it's important for authors to take the next step and examine practical significance by calculating a statistic such as the coefficient of determination - r2 - proportion of total variance that can be explained by the independent variable.

This is the most common approach to prediction problems when you have one dependent variable and multiple independent variables.

When used as a data reduction tool, the process can be viewed as a step by step consideration of which variables in combination with each other are most strongly correlated with the dependent variable.

Assumptions - Regression

Linearity: straight line can be draw through points on scatterplot

Homoscedasticity: Y values at each x similar in variability

Dependent variable at least interval scaled

Multicolinearity: relationship among independent variables - Regression

Note: The distributional assumptions are likely to be violated when:

1. N small

2. Growth is present. Variance tends to increase with age.

3. Observations/trials truncated or insufficient practice given. Pattern may be curvilinear.

Hypothesis testing for significant regression

H0: b = 0

Assessing statistical significance Following analyses based on the analysis of variance procedure, you could compare the F statistic to an appropriate table of critical values. Information needed is alpha and df:

K; N-k-1

If the F statistic > critical value you can reject your null hypothesis. Most frequently however authors have used software to give them a p value to compare to the alpha they’ve chosen. If the p value < the alpha you can reject the null hypothesis. REMEMBER, if multiple tests done, alpha should be modified before comparison done.

Assessing practical significance

Remember the above ‘test’ tells you whether there's a statistically significant relationship not whether the relationship is of any practical importance. Therefore, it's important for authors to take the next step and examine practical significance by calculating a statistic such as the coefficient of determination - r2 - proportion of total variance that can be explained by the independent variable(s). Another useful measure is the effect size.

Relationships - Non-parametric

Statistical Significance: Chi-squared

Practical Significance: Cramer’s V & Phi

The statistic that will test for the presence relationship between two categorical (though can also be used on ordinal data) variables is the chi-square statistic. The null hypothesis under examination is:

ρxy = 0

This is read as: the correlation between x and y is zero. Another way to say this is that the variables x and y are independent. In fact the χ2 statistic is commonly referred to as the chi square test of independence.

Assumptions

The expected frequency in all cells is at least 5.

Data must be random samples from multinomial distributions.

Assessing statistical significance Following analyses using the χ2 statistic , you could compare the χ2 statistic to an appropriate table of critical values. Information needed is alpha and df:

df = (R-1)(C-1)

Where R = # of rows, and C = # of columns in a cross-tabulation table.

If the χ2 statistic > critical value you can reject your null hypothesis. Most frequently however authors have used software to give them a p value to compare to the alpha they’ve chosen. If the p value < the alpha you can reject the null hypothesis. REMEMBER, if multiple tests done, alpha should be modified before comparison done.

Assessing practical significance Remember the above ‘test’ tells you whether there's a statistically significant relationship not whether the relationship is of any practical importance. Therefore, it's important for authors to take the next step and examine practical significance by calculating a statistic such as the coefficient of determination. Another useful measure is the effect size.

For a relationship study, the effect size is the correlation coefficient (Phi, Cramer’s V). These statistics convey the strength of the relationship between the two categorical variables. Interpretation:

.30 Small

.50 Moderate

.80 Large

Best to use descriptive statistics to examine related questions so as not to diminish power.

Selection of Descriptive Statistics to Summarize Data

Level of Measurement | Applicable Statistics |

Nominal/Categorical | Percentages, Mode |

Ordinal | Percentages, Mode, Median* |

Interval | Mean, Median, Mode, Standard Deviation, Range, Percentiles, Z scores, Correlation |

Ratio | Mean, Median, Mode, Standard Deviation, Range, Percentiles, Z scores, Correlation |

*Note: Use of the median for ordinal data should be applied only in situations where the underlying variable can be considered continuous and the numbers do not simply represent a few discrete categories.

Summarizing Data

- Descriptive Statistics
- Frequency Distribution Tables - Percentages
- Crosstabulation Tables - Percentages
- Central Tendency - Mean, Median, Mode
- Variability - Standard Deviation, Range
- Correlation

- Graphs
- Continuous data: Frequency polygon or Histogram

Discrete Data: Bar Chart; Pie Chart - General Principles

3/4 Rule

Label axis for correct interpretation

Begin the vertical axis with the value zero.

To show trend, several points along the way have to be depicted if interpretation is to be sound

Only depict one aspect of a problem

Should not employ cumulative charts

- Continuous data: Frequency polygon or Histogram
- Inferential Statistics
- Differences - parametric
- t-test - Independent & Dependent; omega squared
- ANOVA - one-way, two-way, repeated measures; eta squared
- Assumptions - Homogeneity of Variance, Normality

- Differences - Non-parametric
- Mann Whitney
- Kruskal Wallace
- Wilcoxon
- No distributional Assumptions

- Differences - parametric
- Relationships - Parametric
- Correlation; Coefficient of Determination
- Regression; Coefficient of Determination
- Assumptions - homoscedasticity, linearity; multicolinearity

- Relationships - Non-parametric
- Chi-squared; Cramer’s V & Phi
- No distributional Assumptions

Practical Significance

- Effect Size
- Coefficient of Determination
- Omega Squared
- Eta Squaed

Distinction to reinforce: Continuous variables are ones at least interval scaled - call for use of parametric statistics. Discrete variables are ones that are categorical or ordinal in nature and call for use of non-parametric statistics.

Depending on level of measurement, statistical testing is an appropriate process for examining the main research question.

Null Hypothesis for differences: H0: µ1 = µ2

Null Hypothesis for Relationships: H0: ρ = 0

Decision as to whether or not parmetric test appropriate tied to whether or not assumptions met.

Work not complete until an assessment of practical significance done (minimally effect size).

Relationships: Coefficient of Determination

Differences: Eta squared (or omega squared)

Structure/Content: Entire text should be cohesive and follow a logical path that generates confidence in the findings. The following is one recommended structure. The order may vary, however the content should be present and should match the information conveyed in the analysis portion(s) of the methods section.

- Descriptive information [Note: Where relevant, tables should be used to summarize large volumes of data and text should highlight important elements]
- On sample
- Should summarize the personal and demographic information that helps the reader understand the nature of the research participants

- On relevant variables
- Should summarize pertinent variables that shown interesting patterns and/or provide insights to main question.

- On variables by subgroup(s)
- Should convey additional insights that contribute to an understanding of the nature of the research participants and/or the problem under examination.

- On sample
- Psychometric properties of data
- As appropriate, results of examinations of objectivity, reliability, and validity of data should be provided.

- Analyses related to main problem
- Should provide concise reporting of results from check of assumptions
- Should provide clear results of hypothesis testing including:
- Statistic
- Degrees of freedom
- P value
- Table (e.g. ANOVA) when appropriate

- Examination of practical significance
- Analyses connected with related problems
- Statistical examination - not recommended
- Note: Remember that each statistical test requires that you divide your alpha to adjust for the increased chance of making a type I error. Therefore, it is wise to limit the number of statistical tests conducted.

- Descriptive Statistics
- Should directly provide information pertaining to related question
- Statistics reported must be appropriate for the data type summarized.

- Statistical examination - not recommended

A tight summary of purpose and results should be included.

Methodological limitations should be clarified.

Results should be cast in light of literature cited in the intro/review of literature

New literature should not be brought into the discussion.

Implications and/or recommended action in light of findings should be drawn out in this section.

Next steps likely to extend or clarify research presented should be suggested.

Any speculations must be clearly identified as such. There should be no doubt as to whether the discussion of results is data based or conjecture.

Hypothesis testing involves examination of a statistically expressed hypothesis. The statistical expression is referred to as the null hypothesis. It is called null because the expression when completed implies no difference or relationship depending on the problem being examined.

You can think of hypothesis testing as trying to see if your results are unusual enough so that they would not even be expected by chance.

Key Elements

- Type I error = Incorrectly deciding to reject a null hypothesis. Incorrectly reject a true null hypothesis.
- Type II error = Incorrectly deciding not to reject a null hypothesis. Failing to reject a false null hypothesis.
- α = The level of risk an experimenter is willing to take of rejecting a true null hypothesis. Often called the level of significance, this value is used in establishing a critical value around which decisions (reject or not reject null) are made. It is also common to define alpha as the probability of incorrectly rejecting a true null hypothesis or the probability of making a type I error.
- β = The level of risk (not under direct control of an experimenter) of failing to reject a false null hypothesis. It is also common to define beta as the probability of making a type II error. Power = 1 - β. The probability of correctly rejecting a false null hypothesis.

- Effect Size = Magnitude of the effect of an independent variable on a dependent variable. For a differences study this is convey as the difference in standard deviation units
For a relationships study, the effect size is the correlation coefficient.

Interpretation for effect sizes:

.30 small

.50 moderate

.80 largeImportant to recognize that the term effect size can refer to many different measures. So, in written work always specify the value reported (e.g. eta squared) rather than just ‘effect size’.

In practice you will never know whether or not you've made a poor decision (made a type I or type II error) but, you can (a) set the probability that you will make a type I error when you select your alpha, and (b) determine beta (through estimating power) to estimate the probability that you made a type II error. Note: since sample size is directly related to power (and so tied to beta), studies will fail to find statistically significant results even when they do exist because of a small sample size.

Power - is the probability of correctly rejecting a false null hypothesis.

Ideally, power should be considered when planning a study, not after it is over. Knowing what you would like power to be you can determine (using software or power charts) what your sample size should be.

If power is not considered at the start of a study it should be estimated at the end, particularly when non-significant results arise.

Sample size is closely tied to power. True differences/relationships go unnoticed without enough subjects. On the other hand, trivial differences/relationships can be statistically significant with large sample sizes.

Another factor affecting power is measurement precision. As precision increases, power increases.

- If you decrease alpha (more stringent) power will decrease so beta will increase.

- As you increase sample size, power increases so beta decreases.

- As you enhance measurement precision both alpha and beta decrease so power increases.- As effect size increases both alpha and beta decrease so power increases.

Null & Alternative distributions for a 2-tailed test & Alpha = .05

Null & Alternative distributions for a 2-tailed test & alpha = .05 - increased effect size

Null & Alternative distributions for a 2-tailed test & Alpha = .01 (more stringent)

Null & Alternative distributions for a 2-tailed test & Alpha = .05 - large sample.

To determine sample size in the context of power or to determine power at the completion of a study, use the GPower software.

**Hypothesis testing: Assessing Statistical Significance**

To determine whether or not you have a statistically significant finding you either

- Compare test statistic to a critical value

Compare p value (probability finding due to chance) to alpha set by study

A common misconception is that failure to reject the null is evidence that the null is true. Simply not the case - there is no statistical evidence to support this following hypothesis testing. In fact, even with power = .80, Beta is .20 which means 20% chance of making a type II error.