by Lawrence W. Sherman and Denise Gottfredson

The challenge of this review is to make a substantial contribution to science in a very short time frame. The US Congress only allowed the Attorney General nine months for a comprehensive scientific review of crime prevention programs. While the time frame may be appropriate for the legislative process, it is virtually unprecedented in science. Similar reviews performed for the Congress by panels of the National Academy of Sciences typically last several years, and even those are usually much narrower in the scope of the research questions considered. Few scientists would expect a team of senior scholars and graduate students to even locate copies of all relevant evaluations of crime prevention programs in the available time frame, let alone to code them carefully and reliably for the purpose of a systematic review.

Our approach strikes a compromise between breadth and depth, without any compromise in scientific integrity. It attempts to rely as much as possible on other recently completed reviews of the literature we find generally reliable. Given the limited time to undertake intensive review of primary evaluation research, we reserve that method for only the highest priority program areas. This appendix provides a rationale for that strategy and the criteria employed for setting the priorities.

This review makes hard choices at four levels of analysis. One is the level of institutional settings, the rationale for which is described in Chapter Two. The next level is the choice of high and medium priority program areas of crime prevention within each setting. The third level, found within each high priority program area, is the kinds of evidence most worth relying on in assessing the effectiveness of that kind of program. The fourth level, found within each medium priority program, is how much to rely on secondary reviews of the literature, in which the federal government has recently invested millions of dollars. What follows next describes the process and criteria for making the last three of those choices. Figure 1 provides a flow chart illustrating the various steps in the process which lead us to our conclusions and recommendations in this review.

Figure 1

Diagram of Assessment Process

Figure 1

Sorting Programs and Practices

Our standard method for each chapter begins with a review of the literature, culminating in a list of all categories of crime prevention programs and practices in that institutional setting. We annotate each category on the list with definitions and illustrations. We circulate the list and its annotations among colleagues in our own department, and nationally-known experts chosen for the purpose. Based on their feedback, we revise the categories or definitions.

Unit of Analysis. Note that the basic unit of analysis we employ throughout is the category of program or practice that is conceptually coherent and subject to evaluation. Basic institutional "practices" such as police patrols, and innovative "programs" experimentally added to institutional practice such as home nurse visitation, can both be defined as analytic categories of factors potentially affecting crime prevention. The list of such categories need not distinguish programs and practices, since the two are often merged in actual practice (as in the COPS program). What our method attempts is a theoretically clear definition of the independent variable hypothesized to produce crime prevention consequences.

Evidence and Resources. In the process of constructing the list of program and practice categories, we develop a rough estimate of both the volume and quality of the scientific evidence on the effectiveness of that category. We also acquire some data on level of governmental and private resources, especially the Office of Justice Programs, is investing in programs in each category.

Based on these data, we sort categories according to the available evaluation evidence. If evidence is available, we retain the category for potential discussion and analysis in the chapter. For programs and practices with no available evidence, we sort on the basis of public (or in some cases private) resources expended on that category. If the resources are substantial, we retain the category for potential discussion in the chapter. Categories with neither evidence available nor substantial resources are set aside at that point. For the categories with evidence, regardless of resource levels, we engage in one process. For programs without evidence, we engage in another.

Categories With Evidence

The first step for categories with evidence is to consult recent secondary reviews wherever possible. The reviews themselves can be evaluated on the basis of their reliability, based on such criteria as the level of detail they provide and the completeness and accuracy of the discussion of primary sources known to our review team. If a secondary review is deemed reliable, we rely on it for a determination of the strength of the evidence about the category. If there is no secondary review available for a category, we conduct a primary evidence analysis.

Primary Evidence Analysis. For the selected categories subjected to full assessment of primary evidence, the chapter research team identifies every primary evaluation it can locate from all sources. These evaluations are reviewed for a basic in-versus-out decision. The "in" decision is based solely on whether the study reports data on outcomes measuring crime or risk and protective factors in relation to the program being evaluated. Studies which contain only process measures are excluded from any data collection other than citation data and the reason for exclusion. Studies which include outcome measures are coded using an instrument (shown at the end of this appendix) adapted for this study from an instrument designed for the National Structured Evaluation of Drug Abuse Programs, which previous research has found to produce acceptable inter-rater reliability (Center For Substance Abuse, 1995). A codesheet for each study is prepared by at least one graduate student and reviewed by a faculty member. An overall rating of methodological rigor for each study is obtained from item #8 in Section II of this codesheet.

Standard Measures of Effect Size. Given the variation in the reporting of effects discussed above, our method does not impose a standard definition of effect size across all analyses. Some chapters report effects in terms of statistical significance as well as percentage differences or percentage reductions in rates of offending. Chapters for which measures of criminal offending are rare (e.g., the school chapter) attempt to translate effects for the array of possible outcomes into standardized effects sizes using, to the extent possible, Cohen's d, defined as treatment group mean minus control group mean, divided by the pooled groups' standard deviation. Information on the statistical significance of and the magnitude of the effects are coded for each study using Section III of the codesheet. The decision about which metric to use for reporting on the magnitude of the effect is left to the senior author of each chapter.

Integration of Evidence. The end product of the analysis of empirical evidence contains a range of findings with respect to effectiveness. In the interests of clarity of presentation for policy analysis purposes, we organize the presentation of material in each chapter by the content of the findings, rather than the priority of the program. The content is defined both the strength of the scientific evidence and the strength (and

direction) of the program effects. We ultimately report on four categories of effectiveness.

Program categories are sorted into these effectiveness categories using the following rule:

Works (1): At least two studies with methodological rigor greater than or equal to "3" reporting significance tests have found crime prevention effects for the program condition, and where effect sizes are available, the effect is at least one-tenth of one standard deviation (e.g., effect size = .1) better than the effects for the control condition, and the preponderance of the evidence supports the same conclusion.

Doesn't Work (2): At least two studies with methodological rigor greater than or equal to "3" reporting significance tests have found no effect favoring the program condition, and the preponderance of other evidence supportsz the same conclusion.

Promising (3): At least one study with methodological rigor greater than or equal to "3" reporting conventional significance levels has found crime prevention effects for the program condition, and where the effect size is available, the effects are at least one-tenth of one standard deviation better than the effects for the control condition OR the preponderance of evidence favors the program.

Don't Know (4): Categories with empirical evidence which do not fit one of the above are included in this residual category.

We must, of course, be extremely careful in labeling any program category as highly certain to be good or bad in its effects. Yet we must also be clear enough to make our conclusions useful, no matter how much we anticipate that science is always provisional and that our conclusions may be changed by next year. The large number of programs to be reviewed almost guarantees that some have strong evidence of extreme effects, both positive and negative. Yet most extreme results are from single, unreplicated studies. It is just as important not to conclude too much from a single negative result as from a single positive one. A single evaluation, even with strong evidence, cannot be assumed to generalize to all or most other settings. The primary objective of identifying promising results should be to foster replications, and the more promising the results the greater the replication need. Where there is substantial consistency of evidence in one direction, the senior author of each chapter makes a judgement and defends it in the text.

The code sheet employed in primary coding is attached. In cases of secondary analysis, effect sizes have been estimated in some chapters where sufficient detail about research design has been provided in the seconbdary review. This method is not without some risk of error, but it is more informative given time constraints than the alternative of not ranking studies reported in detail in secondary reviews. Where the level of detail reported in secondary reviews about the primary source research design is too low, the studies are reported as scientific methods score unranked.

Categories Without Evidence

For categories without evidence, the chapter conclusions reflect some theoretically based assessments. The report makes less use of such assessments than it might, given the concern for potential bias in assessments. Where theoretical rationales are embodied in conclusions, they are usually well-supported by empirical evidence, such as the concentration of serious youth violence in urban areas of concentrated poverty.

Code Book for Methodological Rigor and Effect Size Computation

At least one code sheet is filled out for each study. This scheme assumes the existence of a database for each setting containing the identifying information for the publication and the codes for program category. The document number will be used to merge information from this codesheet with the existing information.

I. Identifying Information and Funding Source

1. Document #_________ [Note: document numbers are four-digit numbers, beginning with the setting code in #3 below]

2. First author last name: _____________________________

3. Institutional setting (circle one):

(1) Family

(2) School

(3) Community

(4) Labor Market

(5) Place

(6) Police

(7) Courts/Corrections

4. Type of publication:(1) Peer-Reviewed Journal

(Circle one) (2) Other publication (e.g., book; book chapter; published gov't report)

(3) unpublished (e.g., technical report, convention paper)

5. Type of funding (circle all that apply -- usually found in the acknowledgment note)

(1) OJP (Office of Justice Programs, including National Institute of Justice; Bureau of Justice Statistics; Bureau of Justice Assistance; Office of Juvenile Justice and Delinquency Prevention; Office for Victims of Crime)

(2) Department of Education (including National Institute of Education; Office of Educational Research and Improvement; Office of Elementary and Secondary Education)

(3) Department of Health and Human Services (including National Institute on Drug Abuse; National Institute on Alcoholism and Alcohol Abuse; National Institute of Mental Health; National Institute on Drug Abuse; Center for Substance Abuse Prevention; Center for Substance Abuse Treatment; Centers for Disease Control and Prevention; Administration for Children Youth and Families)

(4) Other Federal (e.g., Corporation for National Service; Housing and Urban Development; U.S. Dept. Of Labor; Dept. of the Interior; Office of National Drug Control Policy)

(5) Other

(6) Not specified

5. Identification of MODULES for coding: Number of different "modules" included in report. _______

Some studies include sub-studies for which evaluation results are reported separately. These sub-studies often use different methodologies and must be summarized separately. A "module" is defined as a study component for which a distinct treatment group is identified and separate evaluation results are reported. If a study reports on four different trials of an intervention, each trial having a different treatment group, the study has four modules. If a study includes several different program activities but these are all applied to one group of individuals, the study has only one module. If the study identifies two different groups, one of which receives certain program components and the other of which receives other components to another group, AND the results for each of the two groups are reported separately as though they were two different programs, the program has two modules. An example of the latter is when an evaluation of a school-level intervention involving one set of interventions is reported in the same article as an evaluation of a different set of interventions targeting high-risk youths in the same schools. If more than one treatment group is involved, more than one module will generally be involved unless the study uses a factorial design.

Code remaining questions for each module.

II. Methodological Rigor

1. Sample size

Fill in the number of cases for each separate unit of analysis used in the module. Report only for those units for which separate analysis is conducted. Report the sum of the treatment and comparison units actually analyzed. If n's differ for different analyses, record the range of n's used:





blocks, cities, states, or other geographical units________


other collectivity (specify)__________

2. Presence of comparison group(s)

1=No comparison group present

2=Separate comparison group present, but non-randomly constituted and limited (e.g., only demographic variables) or no information on pre-treatment equivalence of groups

3=Separate comparison group present but non-randomly constituted; extensive information provided on pre-treatment equivalence of groups; obvious group differences on important variables

4=Separate comparison group present; extensive information provided on pre-treatment equivalence of groups; only minor group differences evident

5= Random assignment to comparison and treatment groups; differences between groups are not grater than expected by chance; units for random assignment match units for analysis

Note: Sometimes random assignment takes place at a different level than the analysis. For example, classes or schools are randomly assigned to conditions, but students are the unit of analysis. These cases should not be treated as random assignments.

3. Use of control variables to account for initial group differences

1=No use of control variables to adjust for initial group differences

3=Control variables used, but many possible relevant differences uncontrolled

5=Most relevant initial differences (e.g., differences on a pre-treatment measure of the dependent variable or variables highly associated with the dependent variable) between groups controlled statistically OR random assignment to groups resulted in no initial differences.

4. Variable measurement

1=No systematic reproducible approach to variable measurement is employed

2=No indication of how study variables were constructed or obtained

3=Some attention to constructing or obtaining high quality measures, but reliability not demonstrated

4=Variables developed or selected with some consideration of use in prior studies and reliability of measurement; reliability reported; not all measures demonstrated to be reliable

5= Careful selection of relevant variables considering their prior use and reliability demonstrated for all or most of the measures

5. Control for effects of attrition from study

1=Attrition from treatment or control group is greater than 50% and no attempt is made to determine the effects of attrition on the outcome measures.

2=No accounting given of cases that dropped out of study or attrition from treatment or control group is moderate and no attempt is made to determine the effects of attrition on the outcome measures.

3=Differences between study participants (both treatment and comparison) who were present at the pre-test and absent at the post-test are identified and discussed.

4=Differences between study participants (both treatment and comparison) who were present at the pre-test and absent at the post-test are identified and discussed; possible differential attrition between treatment and comparison groups is discussed.

5=Careful statistical controls for the effects of attrition are employed, or attrition is shown to be minimal; threat of differential attrition for treatment and comparison groups is addressed adequately.

Note: Attrition is loss from the initial sample or population identified as the treatment group or the comparison group. Sometimes attrition occurs even before a pre-test is administered.

6. Post-treatment measurement period

Length of time from end of treatment to last follow-up (in months)______

7. Use of statistical significance tests

0=No statistical tests or effect sizes

1=Statistical tests used or effect sizes computed

8. Overall evaluation methodology

1=No reliance or confidence should be placed on the results of this evaluation because of the number and type of serious shortcomings(s) in the methodology employed

3=Methodology rigorous in some respects, weak in others

5=Methodology rigorous in almost all respects

Note: Key elements in your rating of overall methodology should be:

Control of extraneous variables: Have the influences of independent variables extraneous to the purposed of the study been minimized (usually through random assignment to conditions, matching treatment and comparison groups carefully, or statistically controlling for extraneous variables)?

Minimization of error variance: Are the measures relatively free of error?

Sufficiency of power to detect meaningful differences

The power of a test is the probability that a false null hypothesis will be rejected in favor of a true alternative hypothesis. The smaller the anticipated effects of the intervention, the larger the sample size must be in order to decrease the sampling variation relative to the size of the effect to be detected.

III. Identification of Outcome Measures

Complete this section only if at least one comparison is available in the study (e.g., treatment vs. control or pre-post)

Check here if no comparison is available_____

1. Level of criminal involvement of targeted population (circle one):

(A) General population, unspecified criminal involvement

(B) High-risk population (but not known to be involved in criminal justice system)

(C) Known to be not yet involved in criminal justice system

(D) Known to be involved in criminal justice system

2. Measure of Problem Behavior (circle all that apply):

(A) Crime, delinquency, theft, violence, illegal acts of aggression

(1) participation (e.g., yes/no for a specific period; proportion of population arrested in a specific period)

(2) frequency (e.g, number of crimes for a specific period; number of arrests per unit population for a specific period)

(3)seriousness level of crime (e.g, a scale ranging from status offense to murder, rape, and arson)

(4) variety (e.g., number of different crimes admitted)

(B) Alcohol and other drug use

(1) participation (e.g., yes/no for a specific period; proportion of population used in a specific period)

(2) frequency (e.g, number of times used in a specific period; number of uses per unit population for a specific period)

(3) variety (number of different substances used)

(C) Victimization experience

(1) participation (e.g., yes/no experienced victimization in specific time frame; number of victims per unit population in specific time frame)

(2) frequency (e.g., number of times victimized in specific time period; number of victimizations per unit population in specific time period)

(3)seriousness (e.g., harm done)

(4) variety (number of different victimization experiences)

(D) Rebellious behavior, anti-social behavior, aggressive behavior, defiance of authority, disrespect for others (include suspension and expulsion unless it is specifically for behaviors in categories A or B)

(E) General problem behavior (combination of different behaviors above)

3. Risk or protective factors examined in study. Circle all risk/protective factors measured as outcomes of the intervention.


(A) Employment

(B) School dropout

(C) Truancy or school tardiness

(D) Association with delinquent peers

(E) School academic performance (e.g., grade promotion, school grades, academic achievement test scores, schoolwork or homework completion)

(F) Educational attainment (except dropout by persons required by law to attend school)

(G) Disposition to self-control, impulsiveness, or recklessness

(H) Social competencies or skills

(I) Conscientiousness, belief in conventional rules or moral character, dutifulness

(J) Intentions to engage in or abstain from ATOD use, delinquent behavior, crime

(K) Commitment to education or work

(L) Caring about/Attachment to school, work, or prosocial others


(M) Parental supervision

(N) Family or parental behavior management practices

Community or School Environment:

(O) Rules, norms, expectations for behavior

(P) Availability of weapons or drugs

(Q) Community or organizational capacity for self-management (e.g. morale, leadership)

(R) Community unemployment rate

(S) Community disorganization (e.g., divorce rate, female-headed households)

IV. Program Effects and Effect Size Computation

Choose up to three outcome measure from III.2 above for summary of program effects and effect size computation. Base your selection on the following criteria: (a) the measure is used in all or most of the studies in a category (facilitating comparison of outcomes within the category) and (b) reporting of statistics for the outcome is extensive. Fill in the numbers and letters from III.2 above as well as the exact names of the measures:

A) ________ _________________________________________________

B) ________ _________________________________________________

C) ________ _________________________________________________

Outcome Measure _____ (A,B, or C):

Answer all questions for which relevant statistics are provided in the study. Fill in "NA" if statistic is not available.

1. Base rate: This is the level of the problem behavior for the treatment group prior to the treatment or for the comparison group

___ Pre-treatment for treatment group:


Standard deviation_________

Time period covered, in months (e.g., 12 months, 24 months)___________

___ Comparison group:


Standard deviation_________

Time period covered, in months (e.g., 12 months, 24 months)___________

2. Post-treatment measurement period. This is the time elapsed from pre-test or beginning of intervention to post-test measurement -- in months):________

3. Effect size.

For pre to post comparison for the treatment group: _________

For pre to post comparison for the comparison group: ________

For post-treatment comparison of treatment and comparison group: _______

4. Means and standard deviations or proportion (for rates) for the outcome measure for the treatment and comparison groups

Treatment group mean or proportion:________

Comparison group mean or proportion:________

Treatment group

standard deviation:_________ or


Comparison group

standard deviation:_________ or


Pooled standard deviation ____________ or

variance__________ for treatment and comparison groups

5. Means and standard deviations or proportion (for rates) for the pre- and post measures for the treatment group

Post-treatment group mean or proportion:________

Pre-treatment group mean or proportion:________

Post-treatment group

standard deviation:_________ or


Pre-treatment group

standard deviation:_________ or


Pooled standard deviation ____________ or

variance___________ for pre- and post

6. Pearson correlation between measure of experimental status (treatment/control) and outcome measure: _________

7. Statistical test used for assessing probability that difference (between treatment and comparison groups or pre- to post) is due to chance.

(A) Chi-square statistic (with one degree of freedom, i.e. from a 2X2 table):

chi-square value______

Total study sample size________

Exact 2-tailed p-level:__________

Nominal significance level (2-tailed; circle one):

p<.05: Yes or No

p<.01: Yes or No

(B) t statistic (for difference between means):


degrees of freedom________

or sample size for each condition:

Treatment group or post-condition "n":________

Comparison group or pre-condition "n":________

Exact 2-tailed p-level:__________

Nominal significance level (2-tailed; circle one):

p<.05: Yes or No

p<.01: Yes or No



degrees of freedom in the numerator:_______

degrees of freedom in the denominator:________


eta squared:_______

sum of squares in the numerator:_________

sum of squares in the denominator:________

number of cases in each condition:

treatment group or post-condition "n":________

comparison group or pre-condition "n":________

Exact 2-tailed p-level:__________

Nominal significance level (2-tailed; circle one):

p<.05: Yes or No

p<.01: Yes or No

(D) Other statistical test used:

Name of test:________________________________________

Exact 2-tailed p-level:__________

Nominal significance level (2-tailed circle one):

p<.05: Yes or No

p<.01: Yes or No

8. Direction of effect:

For treatment/comparison group designs: (Check one)

Treatment group has less problem behavior at post-test than comparison group:____

Comparison group has less problem behavior at post-test than treatment group:____

No difference exists between groups at post-test:____

For pre-post designs: (Check one)

Post-level of problem behavior is lower than pre-level:____

Pre-level of problem behavior is lower than post-level:____

No difference exists between pre- and post-measures:____

Repeat section IV, questions 1 - 8 for Outcome Each Additional Selected Outcome Measure.School-based Programs:

Also code:

Grade level of treatment group:

(Check more than one of group is split fairly evenly across levels below)

____ Mostly early elementary (K-3)

____ Mostly upper elementary (4-5)

____ Mostly middle school (6-8)

____ Mostly high school (9-12)

Range of grade levels included in treatment group:_____ to ______