Brigham Young University Brigham Young University
BYU ScholarsArchive BYU ScholarsArchive
Theses and Dissertations
2006-07-21
Holistic Scoring of ESL Essays Using Linguistic Maturity Holistic Scoring of ESL Essays Using Linguistic Maturity
Attributes Attributes
Ronald Millett
Brigham Young University - Provo
Follow this and additional works at: https://scholarsarchive.byu.edu/etd
Part of the Linguistics Commons
BYU ScholarsArchive Citation BYU ScholarsArchive Citation
Millett, Ronald, "Holistic Scoring of ESL Essays Using Linguistic Maturity Attributes" (2006).
Theses and
Dissertations
. 762.
https://scholarsarchive.byu.edu/etd/762
This Thesis is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for inclusion
in Theses and Dissertations by an authorized administrator of BYU ScholarsArchive. For more information, please
AUTOMATIC HOLISTIC SCORING OF ESL ESSAYS USING
LINGUISTIC MATURITY ATTRIBUTES
by
Ronald P. Millett
A thesis submitted to the faculty of
Brigham Young University
in partial fulfillment of the requirements for the degree of
Masters of Arts
Department of Linguistics and English Language
Brigham Young University
August 2006
Copyright © 2006 Ronald P. Millett
All Rights Reserved
BRIGHAM YOUNG UNIVERSITY
GRADUATE COMMITTEE APPROVAL
of a thesis submitted by
Ronald P. Millett
This thesis has been read by each member of the following graduate
committee and by majority vote has been found to be satisfactory.
________________________ ______________________________________
Date Deryle W. Lonsdale, Chair
________________________ ______________________________________
Date C. Ray Graham
________________________ ______________________________________
Date Diane Strong-Krause
BRIGHAM YOUNG UNIVERSITY
As chair of the candidates graduate committee, I have read the thesis of Ronald P.
Millett in its final form and have found that (1) its format, citations and bibliographical
style are consistent and acceptable and fulfill university and department style
requirements; (2) its illustrative materials including figures, tables, and charts are in
place; and (3) the final manuscript is satisfactory to the graduate committee and is ready
for submission to the university library.
________________________ _______________________________________
Date Deryle W. Lonsdale
Chair, Graduate Committee
Accepted for the Department _______________________________________
John S. Robertson
Associate Chair, Department of Linguistics and
English Language
Accepted for the College ________________________________________
Gregory Clark,
Associate Dean, College of Humanities
ABSTRACT
AUTOMATIC HOLISTIC SCORING OF ESL ESSAYS USING
LINGUISTIC MATURITY ATTRIBUTES
Ronald P. Millett
Department of Linguistics and English Language
Master of Arts
Automated scoring of essays has been a research topic for some time in
computational linguistics studies. Only recently have the particular challenges of
automatic holistic scoring of ESL essays with their high grammatical, spelling and other
error rates been a topic of research. This thesis evaluates the effectiveness of using
statistical measures of linguistic maturity to predict holistic scores for ESL essays using
several techniques. Selected linguistic attributes include parts of speech, part-of-speech
patterns, vocabulary density, and sentence and essay lengths.
Using customized algorithms based on multivariable regression analysis as well
as memory-based machine learning, holistic scores were predicted on test essays within
±1.0 of the scoring level of human judges scores successfully an average of 90% of the
time. This level of prediction is an improvement over a 66% prediction level attained in
a previous study using customized algorithms.
ACKNOWLEDGEMENTS
My dear wife, Rhonda, and my six children, Ron, Barbara, Olga, Preston, Tanya
and Tyler have been very tolerant of five years worth of off-hours schooling to obtain this
degree. Rhonda always encourages me to try to continue to improve myself and I greatly
appreciate that.
My life and career changed forever when I first became acquainted with Eldon
Lytle in an honors linguistics class at BYU in 1971. He is a great mentor and friend and
his linguistic insights are exceptional. He is a pioneer in the area of a more intuitive
alternative (Junction Grammar) to standard linguistics theories, statistical linguistic data
collection, grammar checking, machine-assisted translation and attribute matching. I
appreciate very much his allowing me to use the WordMap program to make this study.
Deryle Lonsdale has provided the inspiration to earn this degree and an example
of both breadth and depth of understanding across the field of computational linguistics.
He has been patient with my slow pace of progress and through coauthoring a paper with
me and above and beyond help with this thesis has helped me gain skills to be more
credible in both my software programming and in the linguistics field.
John Robertson and Alan Manning provided encouragement and help in the thesis and
paper writing classes. Alan Melby, my longtime friend and associate from the days of
the research for his Ph.D. dissertation, gave encouragement to me throughout these five
years of study. Diane Strong-Krause has patiently waited for a draft of this thesis to
finally be delivered and was a coauthor with Deryle Lonsdale of the paper that inspired
the thesis topic. My research paper for Ray Grahams language acquisition class in 2003
was the direct precursor of this thesis and provided a beginning step into the feasibility of
this kind of study.
viii
TABLE OF CONTENTS
LIST OF TABLES.........................................................................................................X
LIST OF FIGURES.......................................................................................................XI
1.0 INTRODUCTION...............................................................................................1
2.0 REVIEW OF LITERATURE...............................................................................3
2.1 SURFACE FEATURE ANALYSIS: PROJECT ESSAY GRADE (PEG)........................3
2.1.1 PEG as a Model System for Essay Analysis and Prediction ......................4
2.1.2 PEG Results and Analysis ........................................................................5
2.2 GRAMMAR CHECKING PROGRAM APPLIED TO SCORING RESEARCH ..................6
2.3 LARGE-SCALE COMMERCIAL APPLICATION OF ESSAY SCORING: E-RATER .......7
2.4 LATENT SEMANTIC ANALYSIS (LSA) BAG OF WORDS APPROACH....................8
2.5 HYBRID SYSTEMS: BAG OF WORDS PLUS RULES..............................................9
2.6 EXEMPLAR-DRIVEN MEMORY-BASED SYSTEMS ............................................10
2.7 PROBLEMS WITH ESISTING APPROACHES .......................................................10
2.8 SPECIAL NEEDS OF ESL STUDENTS ...............................................................10
2.9 HIERARCHY OF CHALLENGING SYNTACTIC STRUCTURES ...............................11
2.10 ENGLISH GRAMMAR CHECKER TO ASSIST ESL..............................................11
2.11 EVALUATION USING SYNTACTIC PARSER AND CUSTOMIZED ALGORITHM.......13
2.12 IMPLICATIONS OF CURRENT RESEARCH ON SCORING OF ESL ESSAYS.............13
3.0 METHODOLOGY............................................................................................14
3.1 CORPUS SELECTION ......................................................................................15
3.2 PREPARATION OF ESSAYS..............................................................................17
3.3 GENERATION OF WORDMAP LINGUISTIC ATTRIBUTES ...................................18
3.4 EXTRACTION AND SELECTION OF WORDMAP ATTRIBUTES.............................22
3.4.1 Analysis Using Built-in WordMap Comparator Module.........................23
3.4.2 Correlation Analysis for Attributes for Training Set...............................25
3.4.3 Essay Length Attribute ...........................................................................26
3.4.4 CTA-Gain: Part of Speech Summary Attribute........................................27
3.4.5 PTA-Gain: Part of Speech Pattern Summary Attribute...........................29
3.4.6 SLA-seg: Sentence Density.....................................................................31
3.4.7 WDA-Ratio: Vocabulary Summary Attribute..........................................32
3.5 TRAINING ESSAYS AND TEST ESSAYS IN CORPUS...........................................33
3.6 SINGLE AND MULTIPLE VARIABLE REGRESSION ANALYSIS ............................35
3.7 CUSTOM EXCEL SPREADSHEETS - PREDICTING AND EVALUATING SCORES......36
3.8 CUSTOM ALGORITHM DEVELOPMENT............................................................38
3.9 MEMORY-BASED MACHINE LEARNING..........................................................39
3.9.1 TiMBL Memory-Based Language Processing System.............................40
ix
3.9.2 Processing Options in TiMBL.................................................................41
3.9.3 TiMBL ESL Data formats.......................................................................43
4.0 RESULTS.........................................................................................................46
4.1 ALGORITHM-BASED PREDICTIONS.................................................................46
4.1.1 Self-Prediction of Training Set - Linear Formula - Each Attribute..........46
4.1.2 Self-Prediction of Training Set - Linear Formula - All Five Attributes....47
4.1.3 Prediction of Test Set Scores - Linear Formula - Each Attribute.............48
4.1.4 Prediction of Test Set Scores - Linear Formula - All Five Attributes.......52
4.1.5 Prediction of Test Set Scores with Customized Algorithm.......................53
4.2 MEMORY-BASED LEARNING PREDICTIONS ....................................................55
4.2.1 CTA and PTA Individual Attributes Training Set Self-Tests....................55
4.2.2 CTA and PTA Individual Attributes Test Sets Predictions.......................56
4.2.3 TiMBL Predictions using Five Selected Summary Attributes...................58
5.0 CONCLUSIONS...............................................................................................60
5.1 FUTURE RESEARCH.......................................................................................61
REFERENCES..............................................................................................................63
x
LIST OF TABLES
Table 2-1: Highest positive and negatively correlated PEG surface features.................3
Table 2-2: Hierarchy of difficult syntactic structures for ESL students........................12
Table 3-1: Corpus of ESL essays for training and testing............................................15
Table 3-2: Spreadsheet extract for student name, ID and essay scores ........................19
Table 3-3: Cross reference file between WordMap files and holistic scores................19
Table 3-4: Attribute groups and examples from WordMap
III
text analysis...................19
Table 3-5: Extract of custom .csv export file from WordMap for W 2002 essay #1.....23
Table 3-6: Correlations for certain WordMap attributes with holistic scores...............26
Table 3-7: CTA POS individual attributes and correlation with holistic score.............28
Table 3-8: PTA trigram POS patterns and correlation with holistic score....................30
Table 3-9: Sample linear formulas derived from multivariable regression analysis ....36
Table 3-10: Score prediction and evaluation spreadsheet CTA-Gain self-test...............37
Table 3-11: Spreadsheet formulas for evaluating predicted scores accuracy................38
Table 3-12: Spreadsheet with the 77 used CTA subattributes ready for TiMBL..........44
Table 3-13: Spreadsheet with the 84 PTA subattributes ready for TiMBL...................45
Table 4-1: Holistic score self-test prediction for individual attribute linear equations..47
Table 4-2: Holistic score self-test prediction five attribute linear equation..................48
Table 4-3: Holistic score prediction for CHK-Length attribute linear equation............49
Table 4-4: Holistic score prediction for CTA-Gain attribute linear equation ...............49
Table 4-5: Holistic score prediction for PTA-Gain attribute linear equation ...............50
Table 4-6: Holistic score prediction for SLA-seg attribute linear equation .................51
Table 4-7: Holistic score prediction for WDA-Ratio attribute linear equation ............51
Table 4-8: Holistic score prediction for five attribute linear equation .........................52
Table 4-9: Comparison of individual and combined five attribute formulas ..............53
Table 4-10: Self-prediction of Winter 2002 essays using custom algorithm ...............54
Table 4-11: Holistic score prediction for four test sets using customized algorithm.....55
Table 4-12: TiMBL self-test predictions for CTA and PTA subattributes ....................56
Table 4-13: TiMBL predictions for CTA and PTA subattributes using W 1992 set......57
Table 4-14: Holistic score prediction - 5 attributes for test data sets using TiMBL......58
Table 4-15: Holistic score prediction average for 5 attributes with three methods ......59
xi
LIST OF FIGURES
Figure 2-1: Block diagram PEG system.......................................................................5
Figure 3-1: Essay extract from raw text information file for individual student ..........20
Figure 3-2: WordMap rough draft essay file ready for analysis...................................21
Figure 3-3: Default comparator module operation in WordMap .................................24
Figure 3-4: CHK-Length correlated with holistic score for training essays .................27
Figure 3-5: CTA-Gain correlated with holistic scores from judges for essays ............29
Figure 3-6: PTA-Gain correlated with holistic scores from judges for essays .............31
Figure 3-7: SLA-seg correlated with holistic scores from judges for essays ...............32
Figure 3-8: WDA-Ratio correlated with holistic scores from judges for essays...........33
Figure 3-9 Holistic score analysis and prediction process for ESL Essays..................35
Figure 3-10: Basic formula with added custom additions ...........................................39
1
1.0 Introduction
Research into automatic holistic scoring of essays has many practical as well as
academic applications. Just as machine translation attempts to simulate the technical and
intuitive processes of human language translation, the automatic scoring of essays
attempts to capture elements of the complex intelligent processes of assigning a quality
rating to a writing sample. Various statistical measurements have been shown to be able
to closely predict a teacher’s or judges holistic score without human intervention.
One of the better known examples of automatic essay analysis is the Educational
Testing Services E-Rater system that is used to help judges grade the nationally-used
Scholastic Aptitude Test (SAT) Essay exams at the high school senior level of writing
and the Graduate Management Admissions Test (GMAT) Analytical Writing Assessment
(AWA) exam (Burstein, Leacock and Swartz, 2001, p. 3). Exam grading that was
formerly performed by three judges is now performed by one judge and the E-Rater
program, or an additional backup judge if the first judge and the computer do not agree
on the initial score.
This thesis evaluates the effectiveness of using statistical measures of linguistic
maturity to predict holistic scores for ESL essays using several techniques. The research
question to be evaluated is whether computer-generated holistic scoring of ESL essays
can achieve a level of accuracy so that this process could become a useful and efficient
tool for ESL teachers and students.
Automatic essay analysis techniques have only recently been applied to the area
of scoring writing samples from students of English as a Second Language (ESL). The
error rates of the ESL student often leave little text in the composition that a computer
2
analysis program could process successfully. Spelling errors, grammatical errors and
short or endless sentences are all problems that are accentuated for these students.
Lonsdale and Strong-Krause (2003) used variables from the publicly available
Link Grammar program to drive their customized prediction algorithm trained on 60 ESL
essays and tested on 240 additional essays. Their report that correct scores could be
predicted 66% of the time on test data within a standard tolerance of ±1.0 rating points
gives encouragement to see if other techniques can also be applied successfully to ESL
essays.
This study uses the same corpus of ESL essays that the Lonsdale/Strong-Krause
study analyzed and tested. Linguistic data for the current study was generated by
Linguistic Technology Inc.s WordMap
III
grammar checking and linguistic analysis
package. The linguistic maturity attributes were selected for this study by computing
correlation values for all the attributes and selecting attributes from different linguistic
attribute groupings. Standard single and multivariable regression techniques were used
by this study to derive customized prediction algorithms that were compared against
predictions made by the memory-based learning TiMBL system for the corpus test sets.
The most successful algorithms predicted holistic scores within ±1.0 rating point
accuracy 96% of the time.
3
2.0 Review of Literature
This chapter will review the literature for written English essay grading in general
and the application to English as a Second Language studies in particular.
2.1 Surface Feature Analysis: Project Essay Grade (PEG)
One of the earliest systems to attempt to automatically score essays was called the
Project Essay Grade (PEG) developed by Ellis Page in 1968 (Chung, 1997). This system
concentrated on surface features of the essay. Various features were analyzed and
correlated with essay scores. These variables included average sentence length, length of
essay in words, number of commas, apostrophes, question marks, relative pronouns and
subordinating conjunctions and other connectives.
Page correlated the various collected variables with the human ratings. The
variables with the highest positive correlation with essay scores were word length,
number of commas, and essay length. Table 2-1 shows these highly correlated features
and their correlation value r.
PEG Feature Correlation (r)
Standard Deviation of word
length
0.53
Average Word Length 0.51
Number of Commas 0.34
Essay Length (words) 0.32
Number of Prepositions 0.25
Number of Dashes 0.22
Number of Uncommon Words -0.48
Number of Apostrophes -0.23
Number of Spelling Errors -0.21
Table 2-1: Highest positive and negatively correlated PEG
surface features (Chung, 1997, table 1)
4
2.1.1 PEG as a Model System for Essay Analysis and Prediction
The PEG system is a good example of how a holistic score essay prediction
system works. First, a group of training essays were collected and assigned a grade by
multiple graders. In the initial PEG study, four graders were used to determine a holistic
score of overall quality of the essay. The electronic versions of the essays were analyzed
and features were collected. Correlation and regression analysis determined which
variables were most closely correlated with the essay holistic score. Multiple variable
regression techniques resulted in a prediction equation with the independent variables
being the features and the dependent variable the predicted holistic score. The equation
also included numbers to give appropriate weightings to the various features in the
algorithm.
Figure 2-1 shows a block diagram of the PEG essay grading system. Two sets of
essays are needed for the system to work: the training essays used to develop the
prediction algorithm, and the evaluation essays to try out that prediction algorithm on
new unseen essays. Human evaluation of the training essays gives the dependent variable
for the equation. Human evaluation of the validation essays is used to compare the
automatically generated predicted score with the score assigned by a human judge.
5
Figure 2-1: Block diagram PEG system (Chung, 1997, Figure 1). Two groups of essays are
needed, one group for training and the open-ended group to be evaluated or validation essays.
The validation procedure to compare the accuracy of the predictions is contained inside the
dashed line.
2.1.2 PEG Results and Analysis
The results of even the early PEG system were fairly good. On a test of 276
essays by 8
th
to 12
th
grade students in 1964 using 31 variables, the predicted scores were
correlated with the score of the human judge at r = 0.50. Later studies in 1994 using 20
variables showed an even better correlation number of r = 0.66 with those of the human
judges scores. These positive correlations were as high as the individual judges were
6
correlated with each others scores. This is a correlation measurement and not a
percentage of close correct scores measurement that will be introduced later in this
section which is the current standard for measurement of predicted score accuracy in this
research area.
Strengths of this approach include the excellent correlation with human scores,
the simple computational approach correlated to lexical and other easily accessible
features (such as number of commas) and its straight forward methodology. Weaknesses
include its limitation to purely surface features that do not include grammatical syntax,
meaning, or context. The system must be recalibrated for each new application and the
essays are only scored relative to other essays of the same exact type.
2.2 Grammar Checking Program Applied to Scoring Research
Near the beginning of the personal computer era in 1981, one of the first
published software packages that incorporated composition evaluation measures was the
WordMAP® program from Linguistics Technologies, Inc. (Lytle, 1986, 1993). This
program was likely the first commercial grammar checker for the IBM Personal
Computer environment and in the authors opinion still equals or exceeds in accuracy
most commercially available programs now included in modern word processors. The
author of this thesis was on an evaluation team in 1992 at WordPerfect where WordMap
was rated as one of the top two programs for grammar checking being evaluated at the
time.
Roberts (1983) conducted an initial study using WordMAP that used vocabulary
based statistical measurements to validate earlier studies on author identification in the
Book of Mormon. Subsequent research on student essays with the Heber school system
7
in Utah eventually led Lytle to receive a contract for four years of consulting work with
the Educational Testing Service, the company responsible for the SAT test. They studied
whether high school senior SAT essays could be scored effectively by computer,
matching the holistic scores given by a highly trained panel of three judges (Breland and
Lytle, 1990). This paper indicated that good predictions of writing ability can be made
without the use of human readers. WordMAP was able to predict ratings with a
correlation of 0.82 compared to three judges among themselves at a correlation of 0.74.
Independently of ETS, Lytle continued his research and developed further refinements of
his WordMap program and continued studies with the Lincoln County school district in
Nevada (Lytle and Matthews, 1986).
WordMap was designed to not only collect statistics that might be used to predict
scores, but to also provide feedback to the student on how his writing could be improved.
By producing a grammar checking program marketed to individuals as well as schools or
educational research institutions such as ETS, WordMap was promoted as a tool that
could be used constantly during a students education, analyzing, tracking and giving
suggestions for improvement at all stages of a students writing skills development.
Section 3.3 of this thesis provides more information on WordMap and its
capabilities and application to this study.
2.3 Large-Scale Commercial Application of Essay Scoring: E-Rater
The Educational Testing Service introduced a software product called E-Rater
(Rudner and Gagne, 2001) that aims at evaluating the writing proficiency of high school
students as they prepare to enter college. E-Rater, available since 1997, expanded on the
research between ETS and Linguistic Technologies in the 1986-1990 period that showed
8
a high correlation of grammatical and vocabulary-based variables with the holistic essay
ratings. This package is currently being used extensively and is perhaps the highest-
profile example of extensive use of automated essay scoring in a commercial
environment.
Burstein, Leacock and Swartz (2001) reported that since its implementation in
1999, over 750,000 Graduate Management Admission Tests (GMAT) have been graded
with E-Rater showing agreement rates within 1.0 point of a single judges score (1-6
possible holistic score) 97% of the time. When the agreement rate is not within the 1.0
threshold, another judge is called in to provide another human score for the essay.
The standard of measuring predicted score accuracy within 1.0 point of the
judges score means that if the judges score is 5 on a 5 point scale, that a 4 or a 5 would
be within 1.0 point of the score, but 1, 2, or 3 would not. If the judges score is 3 on a 5
point scale, then a 2, 3, or 4 would be judged as within 1.0 of the judges score.
The E-Rater program has also been evaluated for testing of college level essays
for non-native speakers of English for the Test of Written English (TWE) exam (Burstein
and Chodorow, 1999). Although there were significant differences between the scores of
native English speakers and native Chinese, Arabic or Spanish speakers writing in
English, the E-Rater system, which is tuned for each topic using about 52 syntactic,
discourse and other analysis variables, was able to predict the holistic scores exactly or
with an adjacent score (within 1.0 of judged score) about 92% of the time.
2.4 Latent Semantic Analysis (LSA) Bag of Words Approach
Another approach to holistic scoring of essays involves using Latent Semantic
Analysis and Indexing (LSA and LSI). The overall LSI approach has been used as the
9
basis of text analysis since 1989. More recently, LSI, developed by Thomas Landauer
(Landauer, 2003), has been released as a product called Intelligent Essay Assessor (IEA)
in 1997. LSI is a bag of words kind of process where word frequency vectors are
processed into 200-400 dimensions of semantic space. Used initially for information
retrieval as an improved approach to simple word frequency vectors, it also provides a
very impressive correlation with human rating scores. An important strength of the LSA
approach is that words are grouped into the semantic index dimensions to avoid
mismatches on similar words. A weakness is that, like PEG, syntactic and grammar
variables are missing; everything in LSI approach is in the vocabulary (Rudner, 2001).
Recent work by Kanejiha (2003) has centered on adding some syntactic
information to the LSI approach by including the part of speech category for the word
previous to the word being indexed. This approach, called Semantically Enhanced Latent
Semantic Analysis (SELSA), has been shown to exceed the ratings of a straight LSA
approach on evaluating essay answers to basic computer science questions.
2.5 Hybrid Systems: Bag of Words Plus Rules
Rosé (2003) has developed an approach that begins to combine the advantages of
the bag of word LSI and PEG systems with a rule-based component with the hybrid
CarmelTC system. This system combines a rule learning approach with features
extracted from a syntactic analysis with a Naive Bayes analysis of the text. This
comparison study reported on in this paper tested the task of automatic grading of
answers on a qualitative physics test questions. The results showed this hybrid approach
was significantly better than a straight LSI or Naive Bayes approach using the same data.
10
2.6 Exemplar-Driven Memory-Based Systems
Exemplar-driven methods are also beginning to be used for essay grading.
Machine learning techniques are being applied to analyze word frequencies and
proximity to other similar words in multiple dimensional space (Chodrow and Leacock,
2000). By accessing a 30 million word corpus indexed by every word and every context,
this system, called ALEK, was able to compare student essay answers on the TOEFL
(Test of English as a Foreign Language) exam with this data base to find determiner and
agreement grammatical errors such as a desks because of the low frequency of of this
collocation in the data base. This approach compared favorably with the Word97
grammar checker included with Microsoft Wor in a limited set of error environments.
2.7 Problems with Existing Approaches
Criticisms of these existing approaches include theoretical as well as practical
issues. For example, many systems exhibit an inherent Achilles heel since it is possible
to trick them into evaluating a nonsensical text purely by reverse-engineering the scoring
mechanism and designing a text that corresponds to the criteria” (Lonsdale and Strong-
Krause, 2003). Other problems include costly development of specialized data and rules
for training the system for a specific environment and the need to expand the system by
hand to build a new model.
2.8 Special Needs of ESL Students
Only recently have efforts in automatic scoring of essays emphasized the special
requirements of essays written by ESL students. These essays, especially for those with
low proficiency in English, often are made up of ill-formed sentences. In addition to
having difficulty in linking grammatical sentences together properly, ESL students
11
struggle with problems of spelling and vocabulary that provides a challenging
environment to attempt to duplicate manual grading with a computer-generated score.
In her 1994 thesis, Susan Ingle addressed the correlation of holistic scores for
different levels of ESL papers with various objective measurements. 193 essays written
by native speakers of Japanese and Spanish who were learning English at the Brigham
Young University (BYU) English Language Center were examined. Ingle found that
essay length was correlated with holistic scores an amazing 58% of the time. She added
other manually generated variables such as mean error-free T-unit length, percent of
subordinated clauses per T-unit, percentage error-free T-units and mean T-unit length. A
T-unit consists of a main clause plus its subordinate clauses. The result of these five
variables was a correct prediction of 75% of the holistic scores.
2.9 Hierarchy of Challenging Syntactic Structures
Another thesis at BYU by Xinyou Zhang (1994) investigated a hierarchy of
syntactic structures that were especially difficult for ESL students to use correctly. Table
2-2 shows this hierarchy, with the higher numbers being more difficult for ESL students.
Detecting these structures and creating a variable attribute for them for an automated
study would be a great extension of Zhangs work that evaluated a limited number of
essays manually to derive the hierarchy.
2.10 English Grammar Checker to Assist ESL
Park (1997) examined the specific problems and benefits of using an English
grammar checker to assist ESL students. This program was based on a Combinatory
Categorial Grammar parser written in Prolog. It includes grammatical mistakes as
ungrammatical variations of the constituents that can be related to given lexical entries in
12
a categorial lexicon. A training set of essays is required to train the grammar checker
which can then be used with a test set.
Number
Syntactic Structure Name Example
1. Noun Clause I believe that the Book of Mormon is true.
2. Adverb Clause It rained today because a storm came in.
3. Relative Clause The cat that the dog chased got away.
4. Appositive Reagan, our President From 1981 until 1989,
5. Participial He hobbled along, swinging his cane.
6. Absolutes He stood alone, his hands tied behind his back.
Table 2-2: Hierarchy of syntactic structures determined to be increasingly difficult for ESL
students (Zhang, 1994, pp. 39-40)
The errors are tailored for the ESL environment. For example, even though the
sentence I want leave is technically correct since leave can be used as a noun, in an ESL
essay, it is much more likely that the writer has omitted the to and meant to say I want
to leave.
In the North American chapter of the Association of Computational Linguistics
(NAACL) 2003 conference, an interesting paper analyzed a Swedish grammar checker
named Granska being used by second-language learners of Swedish. Ola Knutsson
(2003) compared essay scores before and after the use of the grammar checker. Granska
detected 38% of errors in texts that argued a particular point of view. One sidelight of
this work was the detection of essays where large quantities of text were copied from
other sources. However, this study was limited to essays from only eight students. This
work seems to be a preliminary study and much further work is needed before any real
conclusions can be drawn.
13
2.11 Evaluation using Syntactic Parser and Customized Algorithm
Lonsdale and Strong-Krause (2003) studied the effectiveness of an approach to
grading ESL essays using data derived from the syntactic Link Grammar Parser. This
flexible and robust parser links pairs of words with relationship pointers rather than
constructing strict parse trees. They developed a corpus of 301 novice and intermediate
ESL essays. This corpus was analyzed by the Link Grammar parser and then a Perl script
assigned a five point holistic score. 60 - 69 % of the automatically generated ratings
were consistent with the manually generated scores within 1.0 rating point. The system
often overrated essays that received a score of 1 or 2 from judges and would under-score
essays that were scored high by the judges but contained many run-on sentences. This
ongoing research at BYU continues to refine the link grammar approach and is being
expanded to now combine the link grammar measurements with other variables that
could result an even better hybrid system.
2.12 Implications of Current Research on Scoring of ESL Essays
The special requirements of ESL essays, especially at the lower beginning levels,
are just now beginning to be addressed in current research studies. The success that has
been achieved in other areas of essay evaluation lends encouragement to further efforts.
Basic principles that have proved successful include having a robust essay parser, using a
variety of attributes, and investigating different methods for holistic score generation
from that varied data.
14
3.0 Methodology
This chapter details the methodology that is used for this study. This thesis
research uses several commercially or publicly available software packages to generate
the linguistic maturity attributes, analyze and derive prediction methods for the holistic
scores on the training set of essays, and evaluate the validity of those predictions on test
essays. Several custom programs and templates were written for this thesis to assist
these programs to tie the data processing steps together or to perform functions not
included in these systems.
The WordMap
III
program was used to generate linguistic maturity attributes that
were extracted by a program for essay analysis into one file for each data set. These
attribute files were imported into Microsoft Excel for data base storage and sorting. The
WINKS v. 4.8 statistical package provided subsequent single and multivariable statistical
regression analysis of these attributes, correlations for single selected attributes with the
judges scores and scatter graphs showing linear correlation. Microsoft Excel® was then
used for template spreadsheets to predict and evaluate the algorithmic formulas. Finally,
several different settings of a machine learning program called TiMBL were used for
prediction of scores using Excel spreadsheets and customized programming.
Perl, Microsoft Quick Basic and customized spreadsheet programming were also
developed for this thesis to perform the following functions:
(1) Extraction of essays and holistic scores from raw student evaluation files.
(2) Creation of individual writing sample text files in WordMap input text format plus
cross reference files for each subset of a student essay corpus.
15
(3) Extraction of all WordMap attributes after linguistic analysis of individual essay files
plus merging these attributes with essay identification and holistic scores into a comma-
separated variable combined file for each of five groups of essays in the corpus.
(4) Correlation analysis for all attributes on a set of essays for key attribute identification.
(5) Customized spreadsheet creation using linear prediction and customized algorithms to
predict holistic scores and evaluate the accuracy of the predicted scores.
(6) Customized spreadsheet for individual components of individual parts of speech and
POS patterns plus Perl merge routine to prepare binary or numeric versions of these
sparse arrays for TiMBL analysis.
(7) Customized spreadsheet creation for evaluating machine learned predictions of
holistic scores for multiple numeric variables as well as sparse array analysis.
3.1 Corpus Selection
This thesis uses the same corpus of ESL essays used by Lonsdale and Strong-
Krauses (2003) study. A parallel study, using the same training and testing data, can
facilitate a better comparison of the scoring algorithms between separate research efforts.
The essays were collected in the English Language Center at BYU as part of the normal
ESL classes. These intensive English learning courses included students ranging from
novice to intermediate. Table 3-1 shows semesters and number of student essays from
the five semesters that were used for data for the study.
Semester # Students
Winter 2002 60
Winter 1992 72
Fall 2001 72
Winter 2001 30
Summer 2001 43
Table 3-1: ESL essays for training and testing
16
The study used the 60 essays from the winter 2002 semester for training; testing
of the algorithm was performed on the other four groups of essays. The 277 essays
consisted of about 45,000 words and 3,100 sentences. Each essay had an average of 165
words in 11.2 sentences, for an average of 14.75 words per sentence. A holistic score of
1 to 5 was given to each essay by two judges. The description of the five scoring levels is
as follows from the Lonsdale/Strong-Krause paper:
1. Demonstrates limited ability to write English words and sentences. Sentences
and paragraphs may be incomplete and difficult to follow.
2. Writes a simple paragraph with a fair control of basic, not complex sentences
structures. Errors occur in almost all sentences except for the most basic,
formula-type (memorized) structures. Little detail is present.
3. Writes a fairly long paragraph with relatively simple sentence structures.
Personal experiences and some emotions can be expressed, but much detail is
missing. Frequent errors in grammar and word use make reading somewhat
burdensome.
4. Writes long groups of paragraphs with some complex sentence patterns. Some
transitions are used effectively. Vocabulary is broadening, but some wrong word
use. Grammar errors may detract from meaning. Some ideas are supported with
detail. Some notion of an introduction and conclusion is included.
5. Writes complex thoughts using complex sentence patterns with effective
transitions between ideas and sentences. Errors in grammar exist but do not
obscure meaning. A variety of advanced vocabulary words are used but some
wrong use occurs, including problems with prepositions and articles. Ideas are
clearly supported with details. Effective introduction and conclusion are
included.
Here is a sample essay that shows some of the challenges of automatically scoring
these essays.
Iwork really hard and occacionally I don’t have time for have
fun whith mt friens but i don’t mind becausse i knew ,when i grow
up i will have a profesion and have a good job and i will be very
happy.
In the Lonsdale/Strong-Krause study, the importance of the Link Grammar
Parsers robustness was emphasized to be able to keep running and trying to link entities
17
together syntactically even in the midst of recurring errors. It is for this same reason that
the WordMap program has been used in the current study. Instead of having its parse
structures just get worse and worse as errors accumulate, a robust grammar checking
program keeps retrying, guessing at misspellings, even repairing structures based on its
best guesses to come out with data even in the midst of near chaos in the essay.
There were some anomalies in the data that should be mentioned as applied to this
study. The original number of essays in the 2003 study was 301. About 3% of the
essays were too small to be processed effectively in the WordMap system. An additional
2% of the essays had errors in WordMap processing that prevented the generation of an
attribute file. For the three 2001 data sets, some of the essays, about 7%, were missing
from the archived files saved from the 2003 study. In spite of these difficulties, the 277
essays that were used in this study were determined to contain a similar variety of holistic
scores as the 2003 study.
3.2 Preparation of Essays
Perl programs were written for this thesis to extract and prepare the ESL essays
for analysis by the WordMap system. The files for the corpus included spreadsheet data
enumerating student name, student ID number, and holistic scores from the two judges
for the essay, mixed with additional testing information. The actual essay text,
referenced by name and/or ID number was contained in merged summary files for two of
the five corpus sets, and in individual files named by student ID number along with other
class final testing evaluations for reading, writing and grammar for the other three sets.
Each essay received a file name consistent with the corpus set and a number for
the essay within that set (e.g. TF01-01 for Fall 2001 test set, first essay). The essay was
18
formatted into WordMap rough draft input format, and a summary file was generated for
each set in the corpus to cross reference the judges scores with this essay. Table 3-2
shows an extract from one of the spreadsheet files with scoring information. Figure 3-1
shows a sample essay from the raw class evaluation file. Figure 3-2 shows a sample
essay file ready for WordMap analysis. Table 3-3 shows an extract of a log file to cross-
reference the WordMap files with the essay scores for later analysis. All ID numbers
have been randomized from the original IDs.
3.3 Generation of WordMap Linguistic Attributes
WordMap was selected to generate the linguistic attributes for several reasons.
First, the program is based on a grammar checking engine which has the ability to
recover from error conditions in order to analyze an essay. The Lonsdale and Strong-
Krause study showed how important a resilient parser is for analysis of these essays.
Second, WordMap provided a range of attributes that spanned all the way from
vocabulary and surface features such as average word lengths to very detailed syntactic
and stylistic attributes such as part of speech trigrams. Third, although WordMap is a
proprietary system that is licensed for a fee, the system was provided to BYU at no cost
for ongoing research into analysis of ESL essays in this thesis and for future research.
Since WordMap
III
is a DOS system and the program had been used in limited computer
environments, part of this thesis work involved getting the system to run on a Windows
XP-based computer.
The individual files for each essay were processed by the WordMap system. The
system collects a wide variety of statistics in various attribute groups. Table 3-4 shows
19
these groups, their description and sample attribute name and description for an example
attribute in that group.
Semester Last Name First Name
Student
Number Judge#1
Judge#2 Average
F01 Last-13 First-13 999-99-6363 2
2
2
F01 Last-14 First-14 999-99-7474 3
4
3.5
F01 Last-15 First-15 999-99-9696 3
3
3
Table 3-2: Spreadsheet extract for student name, ID and essay scores from two judges
WordMap data file Judge 1 Score Judge 2 Score Average Score
TW1-06.rgh 3 3 3
TW1-07.rgh 1 1 1
TW1-08.rgh 2 2 2
TW1-09.rgh 1 2 1.5
Table 3-3: Cross reference file between WordMap files and holistic scores.
Abbr. Description Example Description
ALL Over All composite index SYN Syntax composite
CTA Parts-of-speech (POS) attributes vc+ed Past tense verb
DEN Word redundancy attributes Den-200[xx New words any category
xx per 200 words
FTR Syntactic-types attributes psv Passive
CHK Grammar & style attributes [FRAG] Sentence fragment
IDL Idiom List high_school Combined meaning
LBL Pattern label attributes nSV-Lbls Number of clauses
LNA Length-of-words attributes lna-07 Word length 7 chars
LSS Variety of language index PTA-Gain PTA section % patterns
LST Function type attributes !DET Determiners
MOR Morphological attributes a°W Decodable prefixes??
MRK Punctuation use attributes commas Number of commas
PNC Punctuation skills index [+_] Syntactic gap
PTA Text POS 3-grams nc-ccn-nc Conjoined class nouns
SGA Sentence length attributes SGA-Median Median sentence length
SLA Select category attributes sla-aj+nc Adj + class noun
SYN Syntactic skills index av+vc+ing Adverb + verb participle
VOC Vocabulary specific index VOC-WDS Non function words??
WDA Vocabulary attributes there Specific function word
Table 3-4: Attribute groups and examples from WordMap
III
text analysis (Lytle, 1986,
“File Types, p. 4)
20
Figure 3-1: Essay extract from raw text information file for individual student. File is named
using student ID number, which has been randomized for this figure.
File Name: 999-99-8787.xx
Last-8, First-8
Date: 9/4/01
Test Started: 9:17 AM
LISTENING TEST
================================================
S Item# L R AN Logit Est. SE
L 05701 1 1 CR -1.87 0.00 1.00
L 00502 3 1 CR -1.07 0.00 1.00
Grammar TEST
================================================
S Item# L R AN Logit Est. SE
L 1.019 1 1 CR -3.23 1.00
L 1.009 2 1 CR -.94 1.00
Reading Test
================================================
S Item# L R AN Logit Est. SE
L 06302 2 0 D3 -.90 0.00 1.00
L 06301 3 1 CR -.59 0.00 1.00
WRITING TEST
====================================================
Last-8, First-8
999-99-8787
Write as much as you can about THE MOST IMPORTANT CLASS
YOU HAVE EVER TAKEN.
I didn't like study English before, but when I was high
school, I met a great teacher. He had been studying
English. I took his class when I was second grade of
high school. This class was impotrant for me. There are
some reasons.
I could be learned a lot of things by him. How
wonderful that we can talk by English. I think teacher
is most important when we take class. I'm very gald
that I could take his English class. Now I'm trying
study English again!
21
Figure 3-2: WordMap rough draft essay file ready for analysis from a level 4 essay
For example , the CHK group contains an attribute with the word length of the
essay. The CTA and PTA groups contain attributes summarizing syntactic variety in
the essay. The SLA group contains sentence density information and the WDA
group contains vocabulary variety attributes.
After processing an essay text file, WordMap then writes the complete set of
attributes into a custom format data base file (.rnd extension) consisting of the attribute
group, the attribute, the raw frequency of the attribute in the sample text and the
percentage that this frequency represents of all of the attributes in the group.
As an example of how varied these attributes are, for the 60 essays of the first
subset of the corpus WordMap generated a total of 19,289 attribute values. 963 different
.NM TF1-08.rgh
.SP 3
.TL//student = Student=Last-8, First-8, Scores = 3, 3, 3
.PP
I didn't like study English before, but when I was high
school, I met a great teacher. He had been studying
English. I took his class when I was second grade of
high school. This class was impotrant for me. There are
some reasons. First, he has a good personarity, and
everyone likes him. I respected him, and I tried to
hard study English, because I wanted to he thinks me a
good student.
Second, because his story was very interesting, I liked
to hear that he was speaking not only English, but also
another stories. Then he taught us how we can enjoy
study English. If I didn't take his class, I wouldn't
like English forever. I could be learned a lot of things
by him. How wonderful that we can talk by English. I
think teacher is most important when we t
ake class. I'm
very gald that I could take his English class. Now I'm
trying study English again!
.END
22
attributes were defined though most of them are not defined for every essay. Each essay
had an average of 321 attributes generated.
3.4 Extraction and Selection of WordMap Attributes
Because the existing export functions of the WordMap program provided limited
flexibility for extracting attributes, a customized program was developed to extract out
the complete dataset of all attributes for a subset of the corpus into a standard comma
separated variable (CSV) format. Data from the WordMap file was cross-referenced with
its essay identification number and the holistic score category that was determined by the
judges for the essay. Table 3-5 shows an extract of a portion of the .csv export file
showing attributes for one of the training essays.
There is a built-in comparator module in WordMap to compare essays to a set of
standard essays such as the sample essays representing the five holistic scoring groups. It
was discovered that the default groupings of attributes in the comparator module resulted
in poor predictions for a subset of the essays. Correlation analysis of the extracted
attributes led to selection of several key attributes and attribute groups to create the
custom algorithms and assemble the data for training and testing for the memory-based
machine learning system.
The selected attributes ended up being summary attributes that have the highest or
nearly the highest positive or negative correlation with the holistic scores for one of the
sets of essays. Many other attributes were considered before these were selected.
WordMap generates extensive error flags as it runs the grammar checking modules, but
the individual flags were not highly correlated with the scores. For example, run-on
sentences were only correlated with holistic scores at r = -0.068. Stylistic analysis flags,
23
File ScoreWM Grp AttribRaw Freq% Freq Analysis
TR-1.rnd
3CTA nc 19 12.10191
TR-1.rnd
3CTA nmprn 15 9.55414
TR-1.rnd
3CTA seg 15 9.55414
TR-1.rnd
3CTA p 13 8.280255
TR-1.rnd
3WDA the 6 3.97351
TR-1.rnd
3WDA and 6 3.97351
TR-1.rnd
3WDA to 6 3.97351
TR-1.rnd
3WDA in 5 3.311258
TR-1.rnd
3WDA of 3 1.986755
TR-1.rnd
3IDL
there_'s
3 2.054795
TR-1.rnd
3IDL a_few 1 0.6849315
TR-1.rnd
3IDL
high_school
1 0.6849315
TR-1.rnd
3CHK [THAT] 3 1.886792
TR-1.rnd
3CHK [+_] 1 0.6289308
TR-1.rnd
3CHK [Prep] 2 1.257862
TR-1.rnd
3LNA lna-04 18 13.95349
TR-1.rnd
3LNA lna-05 14 10.85271
TR-1.rnd
3ALL ALL-VOC-13.65885 -4.907377
TR-1.rnd
3ALL ALL-PNC-6.161773 -2.388628
TR-1.rnd
3ALL ALL-GMR-4.419491 -0.3614517
TR-1.rnd
3ALL ALL-SYN-57.50528 -2.319609
Table 3-5: Extract of custom .csv export file from WordMap .rnd format for ESL essay #1 from
the 2002 corpus subset. The analysis of this 159 word essay included 349 different attribute
values.
such as passive use, were higher at r = 0.323 correlation value, but even it was nowhere
near the r = 0.75 and higher level of the highest summary attributes.
3.4.1 Analysis Using Built-in WordMap Comparator Module
There were several characteristics of the ESL essay corpus and the current
WordMap system that led to the decision to extract out all of the attributes from the .rnd
data files to perform analysis outside of WordMap. To analyze the essays using the
comparator module, each essay was grouped with other essays with the same holistic
score in the same file, resulting in five standard files. Instructions for the comparator
module indicated that these standard files should be about the same size. However, in the
Winter 2002 files being processed, this corpus subset ranged from only 2,800 words for
level one essays to 19,300 for level three essays.
24
Figure 3-3 details the basic function of the built-in WordMap comparator system
(Lytle, 1986, Comparator section). Note that the selected single essay is compared,
entire attribute group by entire attribute group, to the five standard attribute files
containing all of the attribute values for a holistic score level. The final comparison
result of a test essay with the five files is the accumulated score of selected groups.
A few essays were used to try out the self-prediction capability of the default
comparator module for various groups. Groups that were tried included the CTA, CHK
and FTR groups (see Table 3-4). These initial randomly selected essays only predicted 3
of 10 scores within ±1.0 accuracy in these initial tests. This is far below the baseline
holistic score 3 assigned to 42% of the Winter 2002 essays.
Figure 3-3: Default comparator module operation in WordMap.
25
Given the comparators limited predictive capability for ESL scores in the sample
corpus, it was expected that selecting and analyzing the entire set of attributes separately
outside of the WordMap environment via different techniques would further improve on
this matching accuracy. The built-in WordMap variable extraction module was
supplemented by Perl and QuickBasic programs written for this study in order to extract
all of the WordMap attributes to a single file containing the attributes for all of the essays
in a subset of the corpus such as the Winter 2002 essays.
3.4.2 Correlation Analysis for Attributes
After extracting all of the WordMap attribute data for the Winter 2002 essays, it
soon became apparent why the built in comparator module had difficulty with this data.
Lytle (2005, p. 214, Table 1) had listed some attributes that he had found to be highly
correlated with linguistic maturity studying students learning English as their native
language in elementary and secondary schools. At first a few of these attributes were
analyzed by the statistics package (WINKS) for correlation analysis, one variable at a
time.
Attributes that had been shown by Lytle to have the best correlation values for
elementary and secondary school native English students were found to be much less
significant for the ESL essays. For example, the nmprn personal pronoun attribute had
been shown over the years to have a high negative correlation value of r = -0.919 with
linguistic maturity ratings. For the 60 ESL Winter 2002 essays, however, the value for r
was only -0.193. Lytle found punctuation marks frequency to be correlated with
linguistic maturity with a value of r = 0.671, but for the ESL essays, that value was only r
26
= 0.066. With these individual attributes much less significant for the ESL essays than
the native English essays, the summation of these attributes (e.g. the whole CTA group)
could be expected to have lower prediction ability for holistic scores.
It was decided to write a Perl program for this thesis to analyze the correlation
values for the nearly 1,000 attributes with the average of the judges holistic score values.
Table 3-6 shows several of the top positively and negatively correlated attributes. These
highly correlated attributes became the ones used in this study for analysis and prediction
of holistic scores of the ESL essays. For further explanations about the attributes groups,
see Table 3-4. These attributes will next be described in detail.
Group Attribute Name correlation
(r)
CHK: Grammar CHK-Length 0.778
CTA: Parts of Speech CTA-Gain 0.772
PTA: POS Trigrams PTA-Gain 0.772
SLA: Select Categories SLA-seg -0.548
WDA: Vocabulary WDA-Ratio 0.766
Table 3-6: Correlations for WordMap attributes with the judges
holistic scores for the 60 Winter 2002 set ESL essays from.
3.4.3 Essay Length Attribute
The ESL essays that are longer are positively correlated with a higher holistic
score. These essays were written under a 30-minute time constraint and a longer essay
seems to be one indication of greater writing skill. The correlation value r for this
variable, designated CHK-Length, is 0.778 and this was the most highly correlated
linguistic attribute in the set of essays. The length of the 60 Winter 2002 essays averaged
177 words, going from a low of 13 words to a high of 352 words. Figure 3-4 shows a
27
graph of this attribute compared with the average holistic score along with the linear
correlation.
Figure 3-4: CHK-Length correlated with holistic score for training essays.
3.4.4 CTA-Gain: Part of Speech Summary Attribute
Two key WordMap summary attributes that have never been used outside the
system for predictions ended up being tied for second place (r = 0.772) just slightly
behind CHK-Length in their correlation with the holistic scores of the 60 Winter 2002
ESL essays. The first of these two attributes is called CTA-Gain and represents the
percentage of the 96 WordMap specialized part of speech categories that occur in the
writing sample. An example of one of these categories is aj+est for a superlative
adjective, e.g. fastest.
28
Many of the attributes are specified using descriptions based on Lytles Junction
Grammar theory of language (Lytle 1979b, 1986, 2006). An example is the definite
article the, which is coded as nmr for its part of speech attribute in WordMap. In
Junction Grammar terminology, a definite article is a nominal (n) modalizer (m) that
triggers a retrieval (r) of an existing semantic entity in the memory space.
Table 3-7 shows several of the CTA individual attributes with their individual
correlation with holistic scores, sample sparseness and frequencies. Some are very
sparse, such as the aj+er comparative found in only 7% of the training essays. The
typical % is the percent of these words compared with the entire essays words. 0.5
would be one word like this one in 200. The high % in essay is the high percentage
found in any essay for this feature. 4.3 would mean that over 1 in every 25 words in an
essay contains this part of speech classification of words. It is very interesting that this
summary attribute (CTA-Gain) is highly correlated with the holistic scores but that the
individual attributes comprising this summary correlate much less.
POS Description %
essays
w/ attr
Typical %
in essay
High %
in essay
Example correlation (r)
with judges
score
qav quantifier for
adverb
16 0.5 4.3 very highly
favored
-0.186
aj+er comparative 7 0.6 0.9 smaller 0.157
ccn conjunction 98 3.5 9.5 and 0.190
mrk punctuation 94 4 16 ! 0.217
Table 3-7: CTA part of speech individual attributes and correlation values
with the judges holistic scores using the 60 training essays.
Figure 3-5 shows a graph of this attribute compared with the average holistic score for
the Winter 2002 essays.
29
Figure 3-5: CTA-Gain correlated with holistic scores from judges for essays
3.4.5 PTA-Gain: Part of Speech Pattern Summary Attribute
The second key WordMap-specific variable that is tied for second in its
correlation with the holistic scores of the 60 ESL Winter 2002 essays is called PTA-Gain
(r = 0.772). This summary attribute is a simple percentage of how many of the 84
different types of specialized partof-speech (POS) trigram patterns occur in the text.
Lytle selected these patterns from a large data base of competetent student writings
(Lytle, 2006). An example of one of these patterns is aj-nc-seg which is an adjective
followed by a common noun and an end of sentence marker, e.g. John kicked the red
ball. Note that some of these patterns include punctuation such as end of sentence
punctuation. Correlation of these sometimes sparse values was also done on the
individual pattern attributes. Table 3-8 shows a sample of some of these attributes and
30
their individual r correlation values using the frequency percentage values for the 60 test
essays.
Trigram POS
Pattern
% of
essays
with
attribute
Typical %
in essay
High % in
essay
Example correlation
(r) with
judges
score
nc-ccn-nc 32 0.70 2.0 boys and
girls
0.006
aj-nc-seg 60 1.08 3.3 red ball <end
of sent>
-0.207
vc-p-nmr 23 0.33 1.4 went onto the 0.053
p-nmps-nc 62 1.1 2.1 on my roof 0.403
seg-nmprn-vc 82 2.2 7.6 <end of sent>
Henry went
-0.482
Table 3-8: PTA trigram POS patterns and correlation values for the percentage of their
use to judges holistic scores. (Lytle, 1986, “Lists, p. 21)
Figure 3-6 shows a graph of the PTA-Gain attribute compared with the average
holistic score.
So far, it is interesting to note that the most highly correlated attributes included
different dimensions of linguistic maturity. The essay length attribute focused on a
statistical attribute that is easily measured. The selection of both the PTA-Gain and
CTA-Gain summary attributes and their individual components focused in on short and
longer syntactic patterns and contained some of the most interesting attribute information
focused on the particular point of view of the Junction Grammar theory.
31
Figure 3-6: PTA-Gain correlated with holistic scores from judges for essays.
As the current study continued, still more dimensions of attributes also seemed to
be highly correlated with the holistic scores of these training essays. It was hoped that by
selecting attributes of different kinds that they would complement each other in making
better quality predictions.
3.4.6 SLA-seg: Sentence Density
The highest attribute using negative correlation calculations is the sentence
density attribute, or SLA-seg (r = -0.5475). This attribute captures WordMaps
calculation of the number of sentences per 100 words. Most essays with lots of simple
sentences are graded lower, but there are exceptions with essays that contain run-on
sentence after run-on sentence. The 60 Winter 2002 essays have an average of 6.3
sentences per 100 words, with a low of 0.76 sentences per hundred (an extreme example
of run-on sentences), and a high of 12.1 sentences per hundred words.
32
Figure 3-7 shows a graph of this attribute compared with the average holistic
score.
Figure 3-7: SLA-seg correlated with holistic scores from judges for essays.
3.4.7 WDA-Ratio: Vocabulary Summary Attribute
After selecting two summary attributes that reflect the part of speech categories
and patterns, plus another two with the essay length and sentence density, another
attribute was selected that reflects the vocabulary richness of the writing in the essay.
This attribute is called WDA-Ratio and represents the density of content words compared
to a fixed list of function words in the essay (Lytle, 1986). Its correlation is calculated at
r = 0.766, and it is ranked number 7 out of the total of 961 attributes. Figure 3-8 shows a
graph of this attribute compared with the average holistic score.
33
Figure 3-8: WDA-Ratio correlated with holistic scores from judges for essays.
3.5 Training Essays and Test Essays in Corpus
The validity of a scientific hypothesis depends on its ability to make new
predictions that can be verified by observations. The standard method of running tests
and evaluating the effectiveness of predictive algorithms is followed in this study. The
first step was to train the system on a sample set of data called a training set. Attribute
analysis of the training set essays using statistical packages or manual analysis helped
develop a predictive algorithm analyzing selected attributes to predict the holistic scores.
The algorithm was then further tested and refined by testing how well it predicted the
scores of the individual essays in the training set itself. The algorithm was then deemed
ready to predict the holistic scores of a new set of unseen test essays that played no part
in the development of the algorithms.
34
In this study, the training set consists of 60 ESL essays from the Winter 2002
classes at the BYU English Language Center. Once the necessary programming pieces
had been developed to collect the data from its original formats, analyze it in the
WordMap system and extract the data with all the variables for all of the 60 training
essays, the task began analyzing of the data and creating methods, algorithms, and
selected data examples to predict holistic scores. The other four groups of essays were
reserved for the testing phase to evaluate the effectiveness of the predicted scores.
Figure 3-9 shows the overall flow of information for this study. The ESL training
and test set essays are processed through identical steps to generate and extract out their
linguistic maturity attributes (four modules above the dotted line). Armed with the
attributes from the training set, this data can be analyzed and individual attributes
selected. The selected attributes are used to derive a customized algorithm that uses the
attributes to predict a holistic score. The algorithm can then be used to predict holistic
scores for the test essays that were not consulted during the development of the prediction
algorithms. After predicting scores for the test sets, their accuracy can be evaluated.
Up to this point we have discussed the analysis of the Winter 2002 essays (i.e. the
training set) and the extraction of attributes followed by the process of selecting attributes
that will be used to predict holistic scores. The derivation of the predictive algorithms or
settings for memory-based processing will now be considered.
35
Figure 3-9: Overview of holistic score analysis and prediction process for ESL Essays in this
thesis. Above the dashed line is the data preparation phase of the research. Below the dashed
line is the analysis and prediction portion of the research.
3.6 Single and Multiple Variable Regression Analysis
Up to this point the essays were extracted from their raw document files and
analyzed by the WordMap program and attributes subsequently generated and extracted
into a file with scores and the other essays and their attributes in the training set. At this
point enough information has been collected and organized to derive formulas that can
then predict the holistic score for a new essay based on one or more of the extracted
variables.
36
The average judged holistic score and the five extracted linguistic maturity
summary attributes were imported into a standard statistical package. This allowed the
single and multiple regression statistical analysis for deriving linear equations using the
attributes and thus for predicting holistic scores. Various combinations of the attributes
were used. Table 3-9 shows several of the derived formulas using various combinations
of the selected attributes.
Table 3-9: Sample linear formulas derived from WINKS 4.80 multivariable regression analysis
for holistic score prediction using various combinations of the selected WordMap attributes.
3.7 Custom Excel Spreadsheets for Prediction and Evaluation of Scores
Returning to the basic information flow outline of Figure 3-9, as the linguistic
maturity attributes were selected and analyzed for this study, various prediction
algorithms were derived. The prediction algorithms were refined and tested by
comparing how well they predict the scores of the 60 essays in the training set, also
called a self-test of the training data. This refinement cycle was supported by custom
Excel spreadsheets to calculate predicted scores and evaluate how closely the scores
relate to the scores assigned by the judges.
CHK-Length PTA-Gain CTA-Gain SLA-seg WDA-Ratio Linear Regression Formula
x Score = -0.4748375 + 0.0916602 x
x y Score = 0.0342 x + 0.0436 y + 0.23
x y z Score = 0.02486 x + 0.02586 y +
0.1318 z + 1.21
x y z w v Score = 0.0030298 x + 0.0150636 y +
0.0375125 z - 0.0437892 w
0.1720084 v + 0.66971
37
Table 3-10 shows the basic visible format of the spreadsheet used to test the
prediction algorithm for the CTA-Gain variable by itself against the training data.
The file name and ID numbers allow the cross reference to the actual essay in the
appropriate corpus subset of essays. The two judges scores are recorded that usually are
given in whole numbers. The average of the two scores becomes a half-point score when
the judges do not agree on the score but are within one point of each other.
A B C D E F G H I J
1 File J1 J2 Ave CTA-Gain
Predicted
Score
Round
0.5 Exact Within 0.5
Within 1.0
2
TR-01.rnd
2
3
2.5
37.1134
2.926984
3 1 1
1
3
TR-02.rnd
2
1
1.5
25.7732
1.887539
2 1 1
1
4
TR-03.rnd
2
3
2.5
32.98969
2.549004
2.5 1 1
1
5
TR-04.rnd
2
2
2
26.80412
1.982034
2 1 1
1
6
TR-05.rnd
1
1
1
18.5567
1.226073
1 1 1
1
7
TR-06.rnd
3
3
3
44.3299
3.58845
3.5 0 1
1
58
TR-57.rnd
1
2
1.5
21.64948
1.509558
1.5 1 1
1
59
TR-58.rnd
3
3
3
36.08247
2.832489
3 1 1
1
60
TR-59.rnd
2
2
2
28.86598
2.171024
2 1 1
1
61
TR-60.rnd
1
1
1
19.58763
1.320569
1.5 0 1
1
62
63 Totals 26 50
57
% of 60 essays 43.33% 83.33%
95.00%
Table 3-10: Score prediction and evaluation spreadsheet CTA-Gain self test training set
The CTA-Gain numbers are the actual attribute values from WordMap. This
particular algorithm using CTA-Gain for prediction of the holistic score is encoded in the
spreadsheet as =-0.4748375 + E2 * 0.0916602 for cell number F2 . The predicted
score is rounded to half points using the formula =INT(2 * F2 + 0.5)/2 for cell
number G2. This formula rounds 2.93 to 3. Three tests are made to compare the
predicted score with the judges score. Predicted scores that match the judges scores
counted first, then scores that are within 0.5 of the judges scores and finally scores that
are within 1.0 of the judges scores. In the cases where the judges agree on the holistic
score, this evaluation is straightforward. But when the judges disagree, and an average
38
score is derived, the criterion is expanded to agreement with either of the judges, or the
average score between the two.
In the current example for essay number 1, one judge scored the essay at 2 and
the other at 3, with the average being 2.5. The predicted score of 2.93 was rounded to
3.0. 3.0 exactly agrees with judge number 2, and so qualifies for an exact match, a within
0.5 match and a within 1.0 match evaluation score.
For essay number 6, both judges agreed that it should be scored at 3.0. The
predicted score was 3.59 which was rounded to 3.5. 3.5 does not count as an exact match
with 3.0, but 3.5 is within 0.5 of 3.0 and receives a evaluation point for being within 0.5
and being within 1.0 of the judges holistic scores.
In the spreadsheet the formulas used to test for matching results shown in table 3-11.
1
Exact match after
rounding =IF(OR(G3=C3,G3=D3,G3=E3),1,0)
2 Match within 0.5 =IF(OR(ABS(G3-C3) <= 0.5, ABS(G3-D3) <= 0.5),1,0)
3 Match within 1.0 =IF(OR(ABS(G4-C4) <= 1, ABS(G4-D4) <= 1),1,0)
Table 3-11: Excel spreadsheet formulas for evaluating predicted scores vs. judges scores
3.8 Custom Algorithm Development
Another option to a complicated multi-attribute linear formula for predicting the
holistic score is to make a hybrid system with some linear components and some custom
logical components. In early versions of this thesis research using a rounding formula to
produce whole numbers for holistic scores instead of allowing half-point values, the
linear regression formula using CTA-Gain, PTA-Gain and WDA-Ratio were
supplemented with custom formulas using those three attributes plus essay length (CHK-
Length) attribute. The second formula used the essay length to override the regression
analysis formula to predict better the level-one scores. The third formula used a
39
combination of PTA-Gain, CTA-Gain and WDA-Ratio values enabled better prediction
of the level five scores that were not well represented in the training set.
The complicated algorithm is documented in pseudocode in Figure 3-10. This
formula uses four of the five selected linguistic maturity variables. SLA-seg, sentence
density, is not included.
Figure 3-10: Basic formula derived from regression analysis with added custom
conditions from manual analysis of training set data.
3.9 Memory-based Machine Learning
A contrasting approach to linear multivariable regression or deriving custom
algorithms for prediction of holistic scores is to use the emerging technology of memory-
based learning that is being successfully applied to many linguistic processing tasks
(Daelemans and van den Bosch, 2005). This approach is particularly useful in todays
world of annotated text data bases and extensive linguistic corpora available for computer
processing.
Memory-based learning methods make predictions using the linguistic text,
phonetic, syntactic or semantic data directly, instead of relying on hand-crafted or
Formula 1:
Score = 0.18904 + CTA-Gain * 0.0418584 +
PTA-Gain * 0.0188846 + WDA-Gain * 0.7209701;
Formula 2:
if (Score <= 2 && CHK-Length<100) then
Score = 1;
Formula 3:
if (CTA-Gain > 40 && PTA-Gain > 40
&& (CTA-Gain + PTA-Gain) > 90 &&
WDA-Ratio > 1) then Score = 5;
40
computer-generated rules. This approach has relevance to psycholinguistic studies that
support a single memory and cognitive system for both regular and irregular linguistic
forms, as opposed to a rule-based system for regular forms with an exception list for
irregular forms.
This empirical or inductive approach to natural language processing (NLP) is
contrasted with the rationalist or deductive knowledge-based approach that in the
past has dominated this subfield of artificial intelligence research. Rules and decision
trees are often hand-crafted or computer-generated from the source data that forms the
basis for the language processing system. Memory-based language processing (MBLP)
is presented as a lazy learning approach that is contrasted with the eager learning
approach of formal rules. The eager systems try to abstract the data and filter out the
exceptional behaviors, where the lazy systems keep all of the data in memory for
processing as needed at retrieval time. A lazy system exhibits rule-like behavior when
certain paths through the memorized examples are followed over and over again, like an
oft-taken path in the countryside.
Data is entered into the MBLP system by combining a classifier or outcome with
selected encoded features. These features can be specific words, orthographic or
phonetic word representations, syntactic or semantic word categories, etc.
3.9.1 TiMBL Memory-Based Language Processing System
One of the most popular and sophisticated implementations of memory-based
learning comes from ILK research group at Tilburg University in the Netherlands in a
system called TiMBL, an acronym for Tiburg Memory-based Learner. This memory-
based machine learning program has been applied to linguistic tasks as varied as German
41
plural formation, Spanish word stress patterns and orthographic-to-phonemic conversions
(Daelemans and van den Bosch, 2005).
There are several reasons to select TiMBL for the memory-based learning tests
for this thesis. First, this system is widely used in research and is publicly available at no
cost as an open-source program. Second, TiMBL has many flexible options that enable
different statistical methods for nearest-neighbor calculations and rankings that allow for
customized applications to each data environment. Third, continuous numeric values
were required for the analysis in this study and that capability is supported by the TiMBL
system. Fourth, the machine learning libraries and toolkits used for many machine
learning systems require additional programming to implement whereas TiMBL can run
directly from input files with program parameter switches to control its operation.
3.9.2 Processing Options in TiMBL
TiMBL input data files are comma separated variable (CSV) files. For this thesis
a TiMBL input file contains several rows, one row of data containing the attributes for
each ESL essay. Each of the columns corresponds to a particular attribute across all of
the essays. The last column of each row is reserved for the outcome, the holistic score
for the essay. A single training file can be used for self-tests: TiMBL trains itself using
the training file and then for the self-test removes the current row from its memory set
(leave_one_out option) and then tries to predict the holistic based on the other essays
attributes. This procedure prevents each row from identifying itself in the data as the
closest match.
For the cases involving both a training set and a test set, two files are passed to the
TiMBL system. The system trains itself on the first file and then uses that data to predict
42
the test data set, one row at a time. After a nearest-neighbor analysis has been made and
a predicted outcome assigned, that outcome is compared against the last column of the
test set to report on how accurate the prediction was in the output report.
TiMBL includes a wide variety of options as program switches to vary the
statistical methods used for predicting the outcome or classification using the variable
values. A few of these settings were used for this thesis.
1. Varying the nearest neighbor number
The default number of nearest neighbors that are retrieved can be varied. The
author tried various settings and finally selected a setting of five nearest neighbors
(switch = -k5).
2. Varying the weighting of the distance measurements
Various criteria could determine how nearest neighbors decide the classification
of the predicted outcome. One approach is to have majority voting. Another is to have
the votes weighted according to the inverse of the distance formula (switch = -dID).
Both settings were tried and the latter one yielded the best results for discrete attribute
analysis of the CTA and PTA subattributes.
3. Discrete variable values vs. numeric variable values
The five selected linguistic maturity attributes were numeric variables as were the
PTA and CTA subattributes with their percentage use number. Numeric mode was used,
selected by the switch -mN.
43
3.9.3 TiMBL ESL Data formats
The TiMBL approach seemed to be most appropriate for this work in three areas:
(1) To provide another approach for holistic score prediction for the five
selected summary attributes. This approach uses numeric values for the attributes instead
of discrete variable values.
(2) To provide analysis and holistic score prediction based on the sparse PTA
and CTA subattributes that are summed up to make the PTA-Gain and CTA-Gain
attributes. The first approach uses discrete binary values indicating whether the
subattribute is used or not in the essay.
(3) To provide analysis and holistic score prediction for sparse PTA and CTA
subattributes using numeric values indicating the percentages of these subattributes in the
essay.
For the discrete attribute analysis (2) a customized Perl script was used to extract
out these often very sparse data elements and format them for the TiMBL system with the
variable values only indicating the presence (1) or absence (0) of the variable in the
essay. Table 3-12 shows extracts from the data file for the 77 out of 96 possible CTA
attributes used in the 60 training essays which sum together using existence or non-
existence to make the CTA-Gain variable. The actual file for use with the TiMBL
program does not contain the first row with CTA individual attribute number or the first
column with essay numbers.
In the table, each row represents a set of data for a single training essay. Each
column represents a subattribute of CTA such as aj+er (comparative) that is defined as a
variable for TiMBL. The frequency and percentage use of that attribute is not used in
44
this file, only the fact that it exists which is represented by a 1 or a 0. The existence
of the attribute only requires discrete variable value settings instead of continuous range
numeric variable value settings in TiMBL. The last column is the outcome that the
judges gave as a holistic score which becomes TiMBLs outcome variable that it will use
to train the system.
Essay 1
2 3 4 5 6 7 8 9 10
71 72 73 74 75 76 77 Score
5 1
0 0 0 1 1 0 0 0 0 . 0 0 0 1 0 0 0 1
8 1
0 0 0 0 0 0 0 0 0 . 0 0 0 1 0 0 0 1
12 1
0 0 1 0 0 0 0 0 0 . 1 0 1 0 1 0 0 1
18 1
0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1
30 1
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
34 1
0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 1
39 1
0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 0 1
42 1
0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 1
43 1
0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1
55 1
0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 1
60 0
0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1
2 1
0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 2
4 1
0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 2
15 1
0 0 1 0 0 1 0 0 0 1 0 1 1 1 0 0 2
Table 3-12: Spreadsheet with the 77 used CTA subattributes in the training set ready for use by
TiMBL as a 60 row training set file.
A customized Perl script was used to extract out these often very sparse data
elements and format them for the TiMBL system with the attribute values indicating the
percentage of the usage of that attribute (e.g. 1.234568). Table 3-13 shows extracts from
the data file for the 84 possible PTA attributes used by the 60 training essays which sum
together using existence or non-existence to make the PTA-Gain attribute. In this case
data values represent the percentage of the occurrence of a certain part-of-speech pattern
such as nc-ccn-nc (conjoined class nouns) as the numeric value to use in predictions. This
45
example illustrates the fact that portions of this CTA and PTA subattribute space are very
sparse.
Essay # 1 2 3
82 83 84 Score
5 0 0 0
0 0 0 1
8 0 0 0
0 0 0 1
12 0 0 0
0 0 0 1
18 0 0 0
0 0 0 1
30 0 0 0
0 0 0 1
34 0 0 0
1.234568
0 0 1
39 0 0 0
0 0 0 1
42 0 0 0
0 0 0 1
43 0 0 0
0 0 0 1
55 0 0 0
0 0 0 1
60 0 0 0
0 0 0 1
2 0 0 0
1.242236
0 0 2
4 0 0 0
0 0 0 2
15 1.694915
0 0.847458
0 0 0 2
Table 3-13: Extract of part of the PTA-Gain subattributes frequency percentage array prepared as
an input file for TiMBL in numeric analysis mode. .
46
4.0 Results
This chapter details the results of using the methodology described in chapter 3.
The results are divided into two main sections: (1) Algorithm-based predictions and
(2) Memory-based learning predictions.
4.1 Algorithm-Based Predictions
Only the summary attributes were used in the predictions that were derived by
algorithms. The attributes used were CHK-Length, CTA-Gain, PTA-Gain, SLA-seg and
WDA-Ratio as previously discussed.
4.1.1 Self-Prediction of Training Set with a Linear Formula for Each Attribute
Each of the five summary attributes had a high similar correlation with the
training set holistic scores assigned by the judges. The self-prediction tests reflected this
high correlation. The linear formula was derived on each single variable via regression
analysis by the WINKS statistical program. Table 4-1 displays the predictions for the 60
essays of the training set compared with the judges scores. This test shows how well the
algorithms predict the score of the essays in the training set itself, or self-prediction. The
algorithm for each attribute derived by regression analysis using the WINKS 4.80
statistical program is included in the table.
These results provide a baseline of how well the regression formula works on the
data that was used to generate it. The results reported in this thesis contain not only the
default standard of how many predicted scores were within 1.0 holistic score point of the
judges scores, but also predictions that exactly match the judges scores and those that
are within 0.5 of the judges scores. When it is considered that the default reporting of
being within 1.0 for a judges score of 3 would allow both 2 and 4 as possible predicted
47
scores, the exact prediction and the within 0.5 prediction reporting adds a much more
stringent level of prediction to this kind of study.
Attribute Exact Prediction Within 0.5 Within 1.0 Regression Analysis
Algorithm
CHK-Length 43.33% 88.33% 98.33% 1.0618116 +
CHK-Length *
0.0079894
CTA-Gain 55.00% 83.33% 95.00% -0.4748375 +
CTA-Gain *
0.0916602
PTA-Gain 38.33% 81.67% 95.00% 0.8684423+
PTA-Gain *
0.0558809
SLA-seg 45.00% 66.67% 91.67% 3.3918714 +
SLA-seg *
-0.1452239
WDA-Ratio 43.33% 86.67% 95.00% 1.2406975 +
WDA-Ratio
* 2.2475918
Average 46.00% 81.33% 95.00% N/A
Table 4-1: Holistic score self-test prediction compared with judges score for individual attribute
linear equations (Winter 2002).
4.1.2 Self-Prediction of Training Set with a Linear Formula for All Five Attributes
The individual formulas using each attribute provide very good predictions on the
training data. The combination formula for all five attributes provides a higher self-
prediction level for each category than the average of the five selected attributes
individually. These results are detailed in Table 4-2.
48
Exact Prediction Within 0.5 Within 1.0 Regression Analysis Algorithm
46.67% 85.00% 100.00% 0.6697 + CTA-Gain * 0.0375125
+ PTA-Gain * 0.0150636 +
CHK-Length * 0.0030298 +
WDA-Ratio * -0.1720084 +
SLA-seg * -0.0437892
Table 4-2: Holistic score self-test prediction compared with judges score for one linear equation
containing all five variables (Winter 2002).
4.1.3 Prediction of Test Set Scores with a Linear Formula for Each Attribute
Each single attribute linear formula derived from the training data for Winter
2002 was applied to the sight unseen test data including the Winter 1992, Fall 2001,
Winter 2001 and Summer 2001 test sets. The algorithms were listed previously in Table
4-1. Table 4-3 shows the CHK-Length essay length variable linear regression equation
trained only with the Winter 2002 training essays and how well it predicts the scores for
the four training sets. Note that throughout these results, the test predictions for Winter
1992 essays are lower than the 2001 essays. That difference may be due to the fact that
all training has been done on the Winter 2002 data set. More standardization and
consistency in scoring by judges is assumed between 2002 and 2001 than between 2002
and 1992. But, nevertheless, it is very interesting to be able to get these levels of
accurate predictions on the 1992 essays after only being trained on the 2002 essays.
49
Test Set Exact Prediction Within 0.5 Within 1.0
Winter 1992 25.00% 62.50% 79.17%
Fall 2001 44.16% 76.62% 90.91%
Winter 2001 26.67% 63.33% 86.67%
Summer 2001 51.16% 79.07% 97.67%
Average 36.75% 70.38% 88.61%
Table 4-3: Holistic score prediction for test sets compared with judges score for individual
attribute linear equation for CHK-Length essay length attribute.
Table 4-4 shows the CTA-Gain part of speech summary attribute linear regression
equation trained only with the Winter 2002 training essays and how well it predicts the
scores for the four training sets.
Test Set Exact Prediction Within 0.5 Within 1.0
Winter 1992 23.61% 54.17% 81.94%
Fall 2001 53.25% 85.71% 90.91%
Winter 2001 26.67% 63.33% 86.67%
Summer 2001 51.16% 79.07% 95.35%
Average 38.67% 79.07% 95.35%
Table 4-4: Holistic score prediction for test sets compared with judges score for individual
attribute linear equation for the CTA-Gain attribute.
50
Table 4-5 shows the PTA-Gain part-of-speech pattern summary attribute linear
regression equation trained only with the Winter 2002 essays and how well it predicts the
scores for the four training sets.
Test Set Exact Prediction Within 0.5 Within 1.0
Winter 1992 26.39% 66.67% 80.56%
Fall 2001 49.35% 87.01% 96.10%
Winter 2001 16.67% 53.33% 86.67%
Summer 2001 39.53% 76.74% 88.37%
Average 32.99% 70.94% 87.93%
Table 4-5: Holistic score prediction compared with judges score for individual attribute linear
equation for the PTA-Gain attribute.
Table 4-6 shows the SLA-seg sentence density attribute linear regression equation
trained only with the Winter 2002 essays and how well it predicts the scores for the four
training sets.
51
Test Set Exact Prediction Within 0.5 Within 1.0
Winter 1992 33.33% 61.11% 77.78%
Fall 2001 38.96% 67.53% 89.61%
Winter 2001 30.00% 50.00% 80.00%
Summer 2001 44.19% 55.81% 79.07%
Average 36.62% 55.81% 79.07%
Table 4-6: Holistic score prediction compared with judges score for individual attribute linear
equation for the SLA-seg attribute.
Table 4-7 shows the WDA-Ratio vocabulary richness attribute linear regression
equation trained only with the Winter 2002 training essays and how well it predicts the
scores for the four training sets.
Test Set Exact Prediction Within 0.5 Within 1.0
Winter 1992 20.83% 66.67% 80.56%
Fall 2001 45.45% 75.32% 90.91%
Winter 2001 26.67% 56.67% 90.00%
Summer 2001 41.86% 79.07% 100.00%
Average 33.70% 69.43% 90.37%
Table 4-7: Holistic score prediction compared with judges score for individual attribute linear
equation for the WDA-Ratio attribute.
52
4.1.4 Prediction of Test Set Scores with a Linear Formula for All Five Attributes
By combining all five of the selected summary attributes and deriving a multiple
variable regression linear prediction formula, it was hoped that the results would be better
than the individual attributes. The algorithm was previously shown in Table 4-2. Table
4-8 shows the correct prediction percentages using this combined formula on the four test
data sets.
Test Set Exact Prediction Within 0.5 Within 1.0
Winter 1992 22.22% 62.50% 81.94%
Fall 2001 53.25% 84.42% 93.51%
Winter 2001 26.67% 60.00% 83.33%
Summer 2001 51.16% 79.07% 95.35%
Average 38.33% 71.50% 88.53%
Table 4-8: Holistic score prediction compared with judges score for individual attribute linear
equation including all five selected summary attributes.
Table 4-9 compares the average prediction values over the four test sets for the
five attributes separately with the single algorithm putting together all five attributes.
The combined algorithm overall is an improvement over the individual single attributes
algorithms. The average of the averages row averages the first five rows of the table.
53
Summary
Attribute
Exact Prediction Within 0.5 Within 1.0
CHK-Length 36.75% 70.38% 88.61%
CTA-Gain 38.67% 79.07% 95.35%
PTA-Gain 32.99% 70.94% 87.93%
SLA-seg 36.62% 55.81% 79.07%
WDA-Ratio 33.70% 69.43% 90.37%
Average of
single Attribute
Averages
35.75% 69.12% 88.27%
All Five
Attributes
38.33% 71.50% 88.53%
Table 4-9: Comparison of individual attribute linear regression algorithms with a combined five
variable algorithm on predicting test essay scores.
4.1.5 Prediction of Test Set Scores with Customized Algorithm
Even with highly accurate holistic score prediction values for these algorithms,
single and multivariable, it was noticed that at the low and high end of the scoring scale
that the predictions were less reliable. A linear formula was derived for the three
variables CTA-Gain, PTA-Gain and WDA-Ratio. Adjustments were then made to the
derived score with custom formulas developed for this thesis by inspection of the training
set data using the variables to better predict the holistic scores for levels 1 and 5. In this
algorithm, SLA-seg, the sentence density attribute, was not used. Level 5 was not well
represented in the training set sample and this custom algorithm made adjustments to
better predict that level.
54
Table 4-10 shows the self-test predictions using this custom algorithm on the
training set data for the Winter 2002 essays. The linear regression algorithm using CTA-
Gain, PTA-Gain and WDA-Ratio is overridden by other formulas using the first
predicted score and attribute values. First, Formula #1 is used to derive a score based on
the three attributes. Then, Formula #2 is applied. If the score from this first formula is
less than or equal to two, and if the essay length attribute (CHK-Length) is less than 100,
then we override the score computed by the linear regression algorithm and set the score
to 1. Now implementing Formula #3, if CTA-Gain and PTA-Gain are both greater than
40 and their sum is greater than 90, we override the predicted score variable and set it to
five. The table values evaluate the predictions for the training essays themselves from
this set of computed and customized formulas.
Table 4-11 shows the test set predictions for the four test sets of essays that use
this same set of formulas trained on the training set.
Exact Prediction Within 0.5 Within 1.0 Custom Algorithm Description
51.67% 78.33% 95.00%
Formula 1:
Score = 0.18904 +
CTA-Gain * 0.0418584 +
PTA-Gain * 0.0188846 +
WDA-Gain * 0.7209701;
Formula 2:
if(Score <= 2 && CHK-Length<100) then
Score = 1;
Formula 3:
If (CTA-Gain > 40 && PTA-Gain > 40
&& (CTA-Gain + PTA-Gain) > 90 &&
WDA-Ratio > 1) then
Score = 5;
Table 4-10: Self-prediction of Winter 2002 essays using custom algorithm derived from training
data.
55
Test Set Exact Prediction Within 0.5 Within 1.0
Winter 1992 23.61% 59.72% 81.94%
Fall 2001 55.84% 76.62% 94.81%
Winter 2001 46.67% 60.00% 83.33%
Summer 2001 46.51% 74.42% 97.67%
Average 43.15% 67.69% 89.44%
Table 4-11: Prediction of holistic scores for four test sets using customized algorithm
derived from the training data (Table 4-10 formulas)
4.2 Memory-based Learning Predictions
Finally, TiMBL machine learning predictions were used for two sets of
predictions. The first studies predicted holistic scores based on individual components of
the CTA and PTA linguistic maturity subattribute groups, both in numeric and discrete
modes. The second study predicted holistic scores using the numeric values of the five
selected summary attributes.
4.2.1 CTA and PTA Individual Attributes Training Set Self-Tests
The CTA-Gain and PTA-Gain attributes consisted of a count of the individual
parts of speech or part-of-speech patterns that were used by the writer in the essays. The
individual components, or subattributes of PTA-Gain or CTA-Gain, consist of a sparse
array of attribute values. Two separate modes were used to see if TiMBL could correctly
predict the holistic scores based on the training set of these sparse arrays. One of these
56
modes was for just the presence of the subattributes and the other was for the actual
percentage for the frequencies of the use of these subattributes.
Table 4-12 shows the self- prediction runs for the subattributes of the CTA and
PTA linguistic maturity groups. The training set is the Winter 2002 essays and each
essay in the training set is being tested. This is a self-prediction test using the
leave_one_out TiMBL option to see how well the training data could be predicted.
TiMBL options for the discrete runs were -k5 a0 using the standard nearest-neighbor
matching algorithm IB1 and 5 nearest neighbors. TiMBL options for the numeric runs
were -k5 a0 mN to select numeric mode with 5 nearest neighbors and the standard
matching algorithm.
Attribute Exact Prediction Within 0.5 Within 1.0
CTA Discrete 41.67% 68.33% 88.33%
CTA Numeric 60.00% 73.33% 95.00%
PTA Discrete 53.33% 61.67% 85.00%
PTA Numeric 56.67% 60.00% 93.33%
Table 4-12: TiMBL self-test predictions of correct holistic scores and how they agree with the
judges scores using discrete and percentage subattributes of the CTA and PTA linguistic
maturity statistical groups (Winter 2002).
4.2.2 CTA and PTA Individual Attributes Test Sets Predictions
Near the end of this study, Perl programs and Excel spreadsheets were developed
to create training and test data to test the CTA and PTA subattribute score predictions for
57
the test sets. Unlike the training set self-prediction test, which could be run using a .csv
dump from a spreadsheet, the individual-named subattributes had to be matched and
aligned between the two data sets in order for TiMBL to use the proper variables for its
training and predictions. Not every attribute used in the training set was used by the test
set and vice versa.
Initial runs similar to the self tests described above, using the Winter 2002
training set and the Winter 1992 test set, are shown in Table 4-13. The results seem
disappointingly low after such good results in the self-test for the training set. The results
need to be reviewed to make sure that an alignment error was not made in the Perl
program. The current program requires modifications to work with each data set and it
is currently a slow process with manual and automatic editing steps to prepare and run a
TiMBL test set trained on the 1992 training set for a new test data set. This process was
not needed for the self-test of the training set on itself.
Attribute Exact Prediction Within 0.5 Within 1.0
CTA Discrete 20.55% 27.40% 54.79%
CTA Numeric 24.66% 38.36% 63.01%
PTA Discrete Under review Under review Under review
PTA Numeric 12.33% 13.70% 52.05%
Table 4-13: CTA and PTA subattributes predictions using TiMBL for training set Winter 2002
and test set Winter 1992.
58
4.2.3 TiMBL Predictions using Five Selected Summary Attributes
The most exciting result to report from this research is how well the TiMBL
program did in prediction of holistic scores based on the five selected linguistic maturity
summary attributes CHK-Length, CTA-Gain, PTA-Gain, SLA-seg, and WDA-Ratio.
The TiMBL system parameters were set at 5 nearest neighbors, using the standard
algorithm IB1 and numeric mode. The results for the four test sets are shown in
Table 4-14. . TiMBL settings -k5 a0 mN were used: 5 nearest neighbors, IB1
algorithm, numeric options for variable values.
Test Set Exact Prediction Within 0.5 Within 1.0
Winter 1992 41.67% 47.22% 87.50%
Fall 2001 70.13% 70.13% 98.70%
Winter 2001 46.67% 53.33% 96.67%
Summer 2001 62.79% 65.12% 100.00%
Average 55.31% 58.95% 95.72%
Table 4-14: Holistic score prediction for the four test data sets compared with judges score using
memory-based learning with all five selected summary attributes
Table 4-15 compares the TiMBL prediction percentages using all five attributes
with the previous algorithms that use all five attributes using the linear prediction
formulas from regression analysis. Exact matches are clearly much better with the
TiMBL five attribute memory-based predictions over all other approaches (18 percentage
points higher) and the TiMBL matches within 1.0 of the judges scores are above all the
59
algorithm-based approaches, averaging seven percentage points above. The predictions
by TiMBL within 0.5 points of the judges scores, on the other hand, are lower for the
TiMBL prediction than the linear algorithms for all but one of the linear formulas. But
overall, the predictions within 1.0 of the judges scores as well as the very accurate exact
match scores seem to indicate a clear winner for memory-based machine learning as a
reliable and flexible retrieval mechanism for prediction of linguistic maturity scores for
ESL essays.
Analysis Method
Exact Prediction Within 0.5 Within 1.0
Linear equation
for all five
attributes
38.33% 71.50% 88.53%
Customized
algorithm
43.15% 67.69% 89.44%
TiMBL
memory-based
learning
55.31% 58.95% 95.72%
Table 4-15: Holistic score prediction averages for the four test data sets compared with judges
score using all five selected attributes using the regression algorithms, customized algorithms and
memory-based learning using TiMBL.
60
5.0 Conclusions
Both algorithmic and machine learning implementation of holistic score
prediction for ESL essays using linguistic maturity attributes were part of this study.
After analysis of the data, one of the conclusions for this study is that prediction of
holistic scores for ESL essays using linguistic maturity attributes can provide a level of
accuracy compared with human judges scores that has not been demonstrated before
using both of these methods. The best earlier levels of 66% agreement within 1.0 point
of the judges holistic scores has been consistently beaten by the 80% and 90% levels
achieved in this study, even going to 100% for one of the test data sets.
It was very interesting to observe that the best attributes to select for prediction of
holistic scores ended up being high level summary attributes that emphasized the writers
positive areas rather than his mistakes. For example, the variety of part-of-speech use
attribute is much more highly correlated with holistic scores than a count of run-on
sentences or the density of the errors in the essay. It was also interesting that attributes
from different attribute groups were all highly correlated with the holistic scores. This
would seem to support the intuition that the holistic score is being formulated with input
from several kinds of input including the quantity of writing, variety of syntactic
structures, and variety of vocabulary.
Returning to the research question at the beginning of the thesis: Can computer-
generated holistic scoring of ESL essays can achieve a level of accuracy so that this
process could become a useful and efficient tool for ESL teachers and students? Based
on the 97% level of accuracy of the E-Rater system that is considered useful and efficient
61
for college level essay tests, the 96% level of prediction from the TiMBL system on ESL
essays would seem to achieve that useful and efficient rating level.
One question that follows is what role such a system might play in the
administration of an ESL program. The current study provides predictions on a set of
essays that could be used to place students in different ESL classes but does not study the
more subtle differences between students at a single level (e.g. at level three). At least at
the student placement level in the ESL classes, this study would seem to demonstrate a
capability that could be useful in an ESL program.
Another important conclusion is that the memory-based predictions were
significantly higher than the pure algorithmic or custom formula predictions, especially
on exact score predictions. A possible reason is that the regression algorithms smooth the
predictive formulas so exact predictions are more rare whereas the actual data of the
nearest-neighbor search in TiMBL allows for individual comparisons and distance
checking to find the correct match to a set of actual data items.
5.1 Future Research
There is considerable future research that could follow from the above
conclusions. Since a variety of attributes has been shown to provide good predictive
power, other attributes should be investigated. This study began to investigate the
individual PTA and CTA subattributes. Further research could determine if these sparse
attributes, individually or in combination, can predict holistic scores.
The ETS E-Rater program uses about 50 attributes to distinguish a 1-6 score on
the SAT Essay exam for high school seniors. A topic of future research could determine
whether a WordMap-based system could distinguish between writing levels for an ESL
62
class at a given level. If such a successful prediction could be proved, then such a system
could be a useful tool for the ESL teacher to use as part of classroom instruction.
Another topic would be to see if writing improvement flags already generated by
WordMap for the essay could also be useful to the ESL teacher or student along with the
holistic score that is trained on the teacher’s grading sample essays and/or the standard
grading sample essays for the entire ESL program.
The BYU English Language Center regularly collects modest amounts of
evaluation data. Not only could a larger corpus of data be used for a future study, but
other portions of the evaluation, such as reading level and grammar usage understanding,
could be compared against the linguistic maturity attributes and holistic scores as well.
The current WordMap system runs in DOS mode and has some system
incompatibilities with the current versions on PC computers. The Linux version of
WordMap currently under development is more stable than this version in a DOS box
mode and should be investigated along with programming updates to adapt the current
code base to avoid system calls that have compatibility problems. The various programs
developed for this thesis also should be better integrated together to allow for more
extensive research studies.
63
REFERENCES
Hunter M. Breland and Eldon G. Lytle, 1990. Computer-Assisted Writing Skill
Assessment, Annual meeting of the American Educational Research Association
and the National Council on Measurement in Education, Boston.
Jill Burnstein, Claudia Leacock and Richard Swartz, 2001, Automated Evaluation of
Essays and Short Answers, Proceedings for 5
th
Computer Assisted Assessment
(CAA) Conference, A two-day conference for developers and practitioners of
CAA in higher education. July 2001, Loughborough University, Leicestershire,
UK,
Jill Burnstein and Martin Chodorow, 1999, Automated Essay Scoring for Nonnative
English Speakers, In Proceedings of the ACL99 Workshop on Compter-
Mediated Language Assessment and Evalutation of Natural Language
Processing. College Park, MD.
Gregory K. Chung and Harold E. ONeil, Jr., 1997. Methodological Approaches to
Online Scoring of Essays. Center for the Study of Evaluation, University of
California, Los Angeles.
Walter Daelemans and Antal van den Bosch, 2005, Memory-Based Language
Processing, Studies in Natural Language Processing, Cambridge University Press.
Walter Daelemans, Jakub Zavrel, Ko van der Sloot and Antal van den Bosch, 2004,
TiMBL: Tilburg Memory-Based Learner Reference Guide version 5.1, ILK
Technical Report ILK 04-02, Tilburg University CNTS - Language Technology
Group, University of Antwerp, The Netherlands.
Susan Scott Ingle, 1994. Using Objective Criteria to Evaluate Proficiency in ESL
Writing. Brigham Young University Thesis.
Dharmandra Kanejiha, Arun Kumar and Surendra Prasad, 2003. Automatic Evaluation
of Students Answers using Syntactically Enhanced LSA. In Proceedings of the
NAACL 2003 Workshop. Edmonton, Alberta, Canada. Association for
Computational Linguistics. pp. 53-60.
Ola Knutsson, Teresa Cerrato Pargman and Kerstin Severinson Eklundh, 2003.
Transforming Grammar Checking Technology into a Learning Environment for
Second Language Writing. In Proceedings of the NAACL 2003 Workshop.
Edmonton, Alberta, Canada. Association for Computational Linguistics. pp. 38-
45.
64
Thomas K. Landauer, 2003. Pasteurs Quadrant, Computational Linguistics, LSA,
Education. In Proceedings of the NAACL 2003 Workshop. Edmonton, Alberta,
Canada. Association for Computational Linguistics. pp. 46-52.
Deryle Lonsdale and Strong-Krause, Diane, 2003. Automated Rating of ESL Essays. In
Proceedings of the NAACL 2003 Workshop. Edmonton, Alberta, Canada.
Association for Computational Linguistics. pp. 61-67.
Eldon G. Lytle, 1975, Junction Grammar Analysis of Quantifiers,Implementation
Guide, (BYU Translation Sciences Institute).
Eldon G. Lytle, 1977, Evolution of Junction Grammar,Junction Theory and
Application, v. 1. no. 1, (Provo, Utah: BYU Translation Sciences Institute).
www.junction-grammar.com.
Eldon G. Lytle, 1979a, Doing More with Structure,Junction Theory and Application,
v. 2. no. 2, (Provo, Utah: BYU Translation Sciences Institute).
Eldon G. Lytle, 1979b, Junction Grammar: Theory and Application,Sixth LACUS
Forum, Columbia, SC: Hornbeam Press, Incorporated, pp. 305-343.
Eldon G. Lytle, 1986, WordMAP® Users Guide, Linguistic Technologies, Inc., Pioche,
Nevada.
Eldon G. Lytle, 1993. Grammar Check Handbook: WordMAP® Version 4.10.
Linguistic Technologies, Inc. Pioche, Nevada.
Eldon G. Lytle, 2005, LANGUAGE in Capital Letters: Unity in Nature, beta E-Book at
url = www.language-icl.com.
Eldon G. Lytle, 2006, WordMAP CTA and PTA Attribute List for WordMAP
III
, Linguistic
Technologies, Inc., Pioche, Nevada.
Eldon G. Lytle and Nelson C. Matthews, 1986. Field Test of the WordMAP Writing Aids
System. Lincoln County School District, Panaca, Nevada.
Jong C. Park, Martha Palmer, and Gay Washburn. 1997. An English grammar checker
as a writing aid for students of English as a Second Language. In Proceedings of
the Conference of Applied Natural Language Processing (ANLP). Washington,
DC.
Carolyn P. Rosé, Anonio Roque, Durnisizwe Bhembe, Kurt Vanlehn, 2003. A Hybrid
Text Classification Approach for analysis of Student Essays. In Proceedings of
the NAACL 2003 Workshop. Edmonton, Alberta, Canada. Association for
Computational Linguistics. pp. 68-75.
65
Brian C. Roberts, 1983. Stylometry and Wordprints: A Book of Mormon reevaluation.
Masters thesis, Brigham Young University Statistics Department.
Laurence Rudner and Phil Gagne, 2001. An overview of three approaches to scoring
written essays by computer. Practical Assessment, Research and Evaluation,
7(26).
Xinyou Zhang, 1994. The Order of Difficulty of Six Types of Sentence-Combining
Structures for ESL Students. Brigham Young University Thesis.