Holistic Scoring of ESL Essays Using Linguistic Maturity Attributes

Brigham Young University Brigham Young University

BYU ScholarsArchive BYU ScholarsArchive

Theses and Dissertations

2006-07-21

Holistic Scoring of ESL Essays Using Linguistic Maturity Holistic Scoring of ESL Essays Using Linguistic Maturity

Attributes Attributes

Ronald Millett

Brigham Young University - Provo

Follow this and additional works at: https://scholarsarchive.byu.edu/etd

Part of the Linguistics Commons

BYU ScholarsArchive Citation BYU ScholarsArchive Citation

Millett, Ronald, "Holistic Scoring of ESL Essays Using Linguistic Maturity Attributes" (2006).

Theses and

Dissertations

. 762.

https://scholarsarchive.byu.edu/etd/762

This Thesis is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for inclusion

in Theses and Dissertations by an authorized administrator of BYU ScholarsArchive. For more information, please

contact [email protected], [email protected].

AUTOMATIC HOLISTIC SCORING OF ESL ESSAYS USING

LINGUISTIC MATURITY ATTRIBUTES

Ronald P. Millett

A thesis submitted to the faculty of

Brigham Young University

in partial fulfillment of the requirements for the degree of

Masters of Arts

Department of Linguistics and English Language

Brigham Young University

August 2006

BRIGHAM YOUNG UNIVERSITY

GRADUATE COMMITTEE APPROVAL

of a thesis submitted by

Ronald P. Millett

This thesis has been read by each member of the following graduate

committee and by majority vote has been found to be satisfactory.

________________________ ______________________________________

Date Deryle W. Lonsdale, Chair

________________________ ______________________________________

Date C. Ray Graham

________________________ ______________________________________

Date Diane Strong-Krause

BRIGHAM YOUNG UNIVERSITY

As chair of the candidate’s graduate committee, I have read the thesis of Ronald P.

Millett in its final form and have found that (1) its format, citations and bibliographical

style are consistent and acceptable and fulfill university and department style

requirements; (2) its illustrative materials including figures, tables, and charts are in

place; and (3) the final manuscript is satisfactory to the graduate committee and is ready

for submission to the university library.

________________________ _______________________________________

Date Deryle W. Lonsdale

Chair, Graduate Committee

Accepted for the Department _______________________________________

John S. Robertson

Associate Chair, Department of Linguistics and

English Language

Accepted for the College ________________________________________

Gregory Clark,

Associate Dean, College of Humanities

ABSTRACT

AUTOMATIC HOLISTIC SCORING OF ESL ESSAYS USING

LINGUISTIC MATURITY ATTRIBUTES

Ronald P. Millett

Department of Linguistics and English Language

Master of Arts

Automated scoring of essays has been a research topic for some time in

computational linguistics studies. Only recently have the particular challenges of

automatic holistic scoring of ESL essays with their high grammatical, spelling and other

error rates been a topic of research. This thesis evaluates the effectiveness of using

statistical measures of linguistic maturity to predict holistic scores for ESL essays using

several techniques. Selected linguistic attributes include parts of speech, part-of-speech

patterns, vocabulary density, and sentence and essay lengths.

Using customized algorithms based on multivariable regression analysis as well

as memory-based machine learning, holistic scores were predicted on test essays within

±1.0 of the scoring level of human judges’ scores successfully an average of 90% of the

time. This level of prediction is an improvement over a 66% prediction level attained in

a previous study using customized algorithms.

ACKNOWLEDGEMENTS

My dear wife, Rhonda, and my six children, Ron, Barbara, Olga, Preston, Tanya

and Tyler have been very tolerant of five years worth of off-hours schooling to obtain this

degree. Rhonda always encourages me to try to continue to improve myself and I greatly

appreciate that.

My life and career changed forever when I first became acquainted with Eldon

Lytle in an honors linguistics class at BYU in 1971. He is a great mentor and friend and

his linguistic insights are exceptional. He is a pioneer in the area of a more intuitive

alternative (Junction Grammar) to standard linguistics theories, statistical linguistic data

collection, grammar checking, machine-assisted translation and attribute matching. I

appreciate very much his allowing me to use the WordMap program to make this study.

Deryle Lonsdale has provided the inspiration to earn this degree and an example

of both breadth and depth of understanding across the field of computational linguistics.

He has been patient with my slow pace of progress and through coauthoring a paper with

me and “above and beyond” help with this thesis has helped me gain skills to be more

credible in both my software programming and in the linguistics field.

John Robertson and Alan Manning provided encouragement and help in the thesis and

paper writing classes. Alan Melby, my longtime friend and associate from the days of

the research for his Ph.D. dissertation, gave encouragement to me throughout these five

years of study. Diane Strong-Krause has patiently waited for a draft of this thesis to

finally be delivered and was a coauthor with Deryle Lonsdale of the paper that inspired

the thesis topic. My research paper for Ray Graham’s language acquisition class in 2003

was the direct precursor of this thesis and provided a beginning step into the feasibility of

this kind of study.

viii

TABLE OF CONTENTS

LIST OF TABLES.........................................................................................................X

LIST OF FIGURES.......................................................................................................XI

1.0 INTRODUCTION...............................................................................................1

2.0 REVIEW OF LITERATURE...............................................................................3

2.1 SURFACE FEATURE ANALYSIS: PROJECT ESSAY GRADE (PEG)........................3

2.1.1 PEG as a Model System for Essay Analysis and Prediction ......................4

2.1.2 PEG Results and Analysis ........................................................................5

2.2 GRAMMAR CHECKING PROGRAM APPLIED TO SCORING RESEARCH ..................6

2.3 LARGE-SCALE COMMERCIAL APPLICATION OF ESSAY SCORING: E-RATER .......7

2.4 LATENT SEMANTIC ANALYSIS (LSA) BAG OF WORDS APPROACH....................8

2.5 HYBRID SYSTEMS: BAG OF WORDS PLUS RULES..............................................9

2.6 EXEMPLAR-DRIVEN MEMORY-BASED SYSTEMS ............................................10

2.7 PROBLEMS WITH ESISTING APPROACHES .......................................................10

2.8 SPECIAL NEEDS OF ESL STUDENTS ...............................................................10

2.9 HIERARCHY OF CHALLENGING SYNTACTIC STRUCTURES ...............................11

2.10 ENGLISH GRAMMAR CHECKER TO ASSIST ESL..............................................11

2.11 EVALUATION USING SYNTACTIC PARSER AND CUSTOMIZED ALGORITHM.......13

2.12 IMPLICATIONS OF CURRENT RESEARCH ON SCORING OF ESL ESSAYS.............13

3.0 METHODOLOGY............................................................................................14

3.1 CORPUS SELECTION ......................................................................................15

3.2 PREPARATION OF ESSAYS..............................................................................17

3.3 GENERATION OF WORDMAP LINGUISTIC ATTRIBUTES ...................................18

3.4 EXTRACTION AND SELECTION OF WORDMAP ATTRIBUTES.............................22

3.4.1 Analysis Using Built-in WordMap Comparator Module.........................23

3.4.2 Correlation Analysis for Attributes for Training Set...............................25

3.4.3 Essay Length Attribute ...........................................................................26

3.4.4 CTA-Gain: Part of Speech Summary Attribute........................................27

3.4.5 PTA-Gain: Part of Speech Pattern Summary Attribute...........................29

3.4.6 SLA-seg: Sentence Density.....................................................................31

3.4.7 WDA-Ratio: Vocabulary Summary Attribute..........................................32

3.5 TRAINING ESSAYS AND TEST ESSAYS IN CORPUS...........................................33

3.6 SINGLE AND MULTIPLE VARIABLE REGRESSION ANALYSIS ............................35

3.7 CUSTOM EXCEL SPREADSHEETS - PREDICTING AND EVALUATING SCORES......36

3.8 CUSTOM ALGORITHM DEVELOPMENT............................................................38

3.9 MEMORY-BASED MACHINE LEARNING..........................................................39

3.9.1 TiMBL Memory-Based Language Processing System.............................40

3.9.2 Processing Options in TiMBL.................................................................41

3.9.3 TiMBL ESL Data formats.......................................................................43

4.0 RESULTS.........................................................................................................46

4.1 ALGORITHM-BASED PREDICTIONS.................................................................46

4.1.1 Self-Prediction of Training Set - Linear Formula - Each Attribute..........46

4.1.2 Self-Prediction of Training Set - Linear Formula - All Five Attributes....47

4.1.3 Prediction of Test Set Scores - Linear Formula - Each Attribute.............48

4.1.4 Prediction of Test Set Scores - Linear Formula - All Five Attributes.......52

4.1.5 Prediction of Test Set Scores with Customized Algorithm.......................53

4.2 MEMORY-BASED LEARNING PREDICTIONS ....................................................55

4.2.1 CTA and PTA Individual Attributes Training Set Self-Tests....................55

4.2.2 CTA and PTA Individual Attributes Test Sets Predictions.......................56

4.2.3 TiMBL Predictions using Five Selected Summary Attributes...................58

5.0 CONCLUSIONS...............................................................................................60

5.1 FUTURE RESEARCH.......................................................................................61

REFERENCES..............................................................................................................63

LIST OF TABLES

Table 2-1: Highest positive and negatively correlated PEG surface features.................3

Table 2-2: Hierarchy of difficult syntactic structures for ESL students........................12

Table 3-1: Corpus of ESL essays for training and testing............................................15

Table 3-2: Spreadsheet extract for student name, ID and essay scores ........................19

Table 3-3: Cross reference file between WordMap files and holistic scores................19

Table 3-4: Attribute groups and examples from WordMap

III

text analysis...................19

Table 3-5: Extract of custom .csv export file from WordMap for W 2002 essay #1.....23

Table 3-6: Correlations for certain WordMap attributes with holistic scores...............26

Table 3-7: CTA POS individual attributes and correlation with holistic score.............28

Table 3-8: PTA trigram POS patterns and correlation with holistic score....................30

Table 3-9: Sample linear formulas derived from multivariable regression analysis ....36

Table 3-10: Score prediction and evaluation spreadsheet CTA-Gain self-test...............37

Table 3-11: Spreadsheet formulas for evaluating predicted scores accuracy................38

Table 3-12: Spreadsheet with the 77 used CTA subattributes ready for TiMBL..........44

Table 3-13: Spreadsheet with the 84 PTA subattributes ready for TiMBL...................45

Table 4-1: Holistic score self-test prediction for individual attribute linear equations..47

Table 4-2: Holistic score self-test prediction five attribute linear equation..................48

Table 4-3: Holistic score prediction for CHK-Length attribute linear equation............49

Table 4-4: Holistic score prediction for CTA-Gain attribute linear equation ...............49

Table 4-5: Holistic score prediction for PTA-Gain attribute linear equation ...............50

Table 4-6: Holistic score prediction for SLA-seg attribute linear equation .................51

Table 4-7: Holistic score prediction for WDA-Ratio attribute linear equation ............51

Table 4-8: Holistic score prediction for five attribute linear equation .........................52

Table 4-9: Comparison of individual and combined five attribute formulas ..............53

Table 4-10: Self-prediction of Winter 2002 essays using custom algorithm ...............54

Table 4-11: Holistic score prediction for four test sets using customized algorithm.....55

Table 4-12: TiMBL self-test predictions for CTA and PTA subattributes ....................56

Table 4-13: TiMBL predictions for CTA and PTA subattributes using W 1992 set......57

Table 4-14: Holistic score prediction - 5 attributes for test data sets using TiMBL......58

Table 4-15: Holistic score prediction average for 5 attributes with three methods ......59

LIST OF FIGURES

Figure 2-1: Block diagram PEG system.......................................................................5

Figure 3-1: Essay extract from raw text information file for individual student ..........20

Figure 3-2: WordMap rough draft essay file ready for analysis...................................21

Figure 3-3: Default comparator module operation in WordMap .................................24

Figure 3-4: CHK-Length correlated with holistic score for training essays .................27

Figure 3-5: CTA-Gain correlated with holistic scores from judges for essays ............29

Figure 3-6: PTA-Gain correlated with holistic scores from judges for essays .............31

Figure 3-7: SLA-seg correlated with holistic scores from judges for essays ...............32

Figure 3-8: WDA-Ratio correlated with holistic scores from judges for essays...........33

Figure 3-9 Holistic score analysis and prediction process for ESL Essays..................35

Figure 3-10: Basic formula with added custom additions ...........................................39

1.0 Introduction

Research into automatic holistic scoring of essays has many practical as well as

academic applications. Just as machine translation attempts to simulate the technical and

intuitive processes of human language translation, the automatic scoring of essays

attempts to capture elements of the complex intelligent processes of assigning a quality

rating to a writing sample. Various statistical measurements have been shown to be able

to closely predict a teacher’s or judge’s holistic score without human intervention.

One of the better known examples of automatic essay analysis is the Educational

Testing Service’s E-Rater system that is used to help judges grade the nationally-used

Scholastic Aptitude Test (SAT) Essay exams at the high school senior level of writing

and the Graduate Management Admissions Test (GMAT) Analytical Writing Assessment

(AWA) exam (Burstein, Leacock and Swartz, 2001, p. 3). Exam grading that was

formerly performed by three judges is now performed by one judge and the E-Rater

program, or an additional backup judge if the first judge and the computer do not agree

on the initial score.

This thesis evaluates the effectiveness of using statistical measures of linguistic

maturity to predict holistic scores for ESL essays using several techniques. The research

question to be evaluated is whether computer-generated holistic scoring of ESL essays

can achieve a level of accuracy so that this process could become a useful and efficient

tool for ESL teachers and students.

Automatic essay analysis techniques have only recently been applied to the area

of scoring writing samples from students of English as a Second Language (ESL). The

error rates of the ESL student often leave little text in the composition that a computer

analysis program could process successfully. Spelling errors, grammatical errors and

short or endless sentences are all problems that are accentuated for these students.

Lonsdale and Strong-Krause (2003) used variables from the publicly available

Link Grammar program to drive their customized prediction algorithm trained on 60 ESL

essays and tested on 240 additional essays. Their report that correct scores could be

predicted 66% of the time on test data within a standard tolerance of ±1.0 rating points

gives encouragement to see if other techniques can also be applied successfully to ESL

essays.

This study uses the same corpus of ESL essays that the Lonsdale/Strong-Krause

study analyzed and tested. Linguistic data for the current study was generated by

Linguistic Technology Inc.’s WordMap

III

grammar checking and linguistic analysis

package. The linguistic maturity attributes were selected for this study by computing

correlation values for all the attributes and selecting attributes from different linguistic

attribute groupings. Standard single and multivariable regression techniques were used

by this study to derive customized prediction algorithms that were compared against

predictions made by the memory-based learning TiMBL system for the corpus test sets.

The most successful algorithms predicted holistic scores within ±1.0 rating point

accuracy 96% of the time.

2.0 Review of Literature

This chapter will review the literature for written English essay grading in general

and the application to English as a Second Language studies in particular.

2.1 Surface Feature Analysis: Project Essay Grade (PEG)

One of the earliest systems to attempt to automatically score essays was called the

Project Essay Grade (PEG) developed by Ellis Page in 1968 (Chung, 1997). This system

concentrated on surface features of the essay. Various features were analyzed and

correlated with essay scores. These variables included average sentence length, length of

essay in words, number of commas, apostrophes, question marks, relative pronouns and

subordinating conjunctions and other connectives.

Page correlated the various collected variables with the human ratings. The

variables with the highest positive correlation with essay scores were word length,

number of commas, and essay length. Table 2-1 shows these highly correlated features

and their correlation value r.

PEG Feature Correlation (r)

Standard Deviation of word

length

0.53

Average Word Length 0.51

Number of Commas 0.34

Essay Length (words) 0.32

Number of Prepositions 0.25

Number of Dashes 0.22

Number of Uncommon Words -0.48

Number of Apostrophes -0.23

Number of Spelling Errors -0.21

Table 2-1: Highest positive and negatively correlated PEG

surface features (Chung, 1997, table 1)

2.1.1 PEG as a Model System for Essay Analysis and Prediction

The PEG system is a good example of how a holistic score essay prediction

system works. First, a group of training essays were collected and assigned a grade by

multiple graders. In the initial PEG study, four graders were used to determine a holistic

score of overall quality of the essay. The electronic versions of the essays were analyzed

and features were collected. Correlation and regression analysis determined which

variables were most closely correlated with the essay holistic score. Multiple variable

regression techniques resulted in a prediction equation with the independent variables

being the features and the dependent variable the predicted holistic score. The equation

also included numbers to give appropriate weightings to the various features in the

algorithm.

Figure 2-1 shows a block diagram of the PEG essay grading system. Two sets of

essays are needed for the system to work: the training essays used to develop the

prediction algorithm, and the evaluation essays to try out that prediction algorithm on

new unseen essays. Human evaluation of the training essays gives the dependent variable

for the equation. Human evaluation of the validation essays is used to compare the

automatically generated predicted score with the score assigned by a human judge.

Figure 2-1: Block diagram PEG system (Chung, 1997, Figure 1). Two groups of essays are

needed, one group for training and the open-ended group to be evaluated or validation essays.

The validation procedure to compare the accuracy of the predictions is contained inside the

dashed line.

2.1.2 PEG Results and Analysis

The results of even the early PEG system were fairly good. On a test of 276

essays by 8

to 12

grade students in 1964 using 31 variables, the predicted scores were

correlated with the score of the human judge at r = 0.50. Later studies in 1994 using 20

variables showed an even better correlation number of r = 0.66 with those of the human

judges’ scores. These positive correlations were as high as the individual judges were

correlated with each other’s scores. This is a correlation measurement and not a

percentage of close correct scores measurement that will be introduced later in this

section which is the current standard for measurement of predicted score accuracy in this

research area.

Strengths of this approach include the excellent correlation with human scores,

the simple computational approach correlated to lexical and other easily accessible

features (such as number of commas) and its straight forward methodology. Weaknesses

include its limitation to purely surface features that do not include grammatical syntax,

meaning, or context. The system must be recalibrated for each new application and the

essays are only scored relative to other essays of the same exact type.

2.2 Grammar Checking Program Applied to Scoring Research

Near the beginning of the personal computer era in 1981, one of the first

published software packages that incorporated composition evaluation measures was the

WordMAP® program from Linguistics Technologies, Inc. (Lytle, 1986, 1993). This

program was likely the first commercial grammar checker for the IBM Personal

Computer environment and in the author’s opinion still equals or exceeds in accuracy

most commercially available programs now included in modern word processors. The

author of this thesis was on an evaluation team in 1992 at WordPerfect where WordMap

was rated as one of the top two programs for grammar checking being evaluated at the

time.

Roberts (1983) conducted an initial study using WordMAP that used vocabulary

based statistical measurements to validate earlier studies on author identification in the

Book of Mormon. Subsequent research on student essays with the Heber school system

in Utah eventually led Lytle to receive a contract for four years of consulting work with

the Educational Testing Service, the company responsible for the SAT test. They studied

whether high school senior SAT essays could be scored effectively by computer,

matching the holistic scores given by a highly trained panel of three judges (Breland and

Lytle, 1990). This paper indicated that “good predictions of writing ability can be made

without the use of human readers.” WordMAP was able to predict ratings with a

correlation of 0.82 compared to three judges among themselves at a correlation of 0.74.

Independently of ETS, Lytle continued his research and developed further refinements of

his WordMap program and continued studies with the Lincoln County school district in

Nevada (Lytle and Matthews, 1986).

WordMap was designed to not only collect statistics that might be used to predict

scores, but to also provide feedback to the student on how his writing could be improved.

By producing a grammar checking program marketed to individuals as well as schools or

educational research institutions such as ETS, WordMap was promoted as a tool that

could be used constantly during a student’s education, analyzing, tracking and giving

suggestions for improvement at all stages of a student’s writing skills development.

Section 3.3 of this thesis provides more information on WordMap and its

capabilities and application to this study.

2.3 Large-Scale Commercial Application of Essay Scoring: E-Rater

The Educational Testing Service introduced a software product called E-Rater

(Rudner and Gagne, 2001) that aims at evaluating the writing proficiency of high school

students as they prepare to enter college. E-Rater, available since 1997, expanded on the

research between ETS and Linguistic Technologies in the 1986-1990 period that showed

a high correlation of grammatical and vocabulary-based variables with the holistic essay

ratings. This package is currently being used extensively and is perhaps the highest-

profile example of extensive use of automated essay scoring in a commercial

environment.

Burstein, Leacock and Swartz (2001) reported that since its implementation in

1999, over 750,000 Graduate Management Admission Tests (GMAT) have been graded

with E-Rater showing agreement rates within 1.0 point of a single judge’s score (1-6

possible holistic score) 97% of the time. When the agreement rate is not within the 1.0

threshold, another judge is called in to provide another human score for the essay.

The standard of measuring predicted score accuracy within 1.0 point of the

judge’s score means that if the judge’s score is 5 on a 5 point scale, that a 4 or a 5 would

be within 1.0 point of the score, but 1, 2, or 3 would not. If the judge’s score is 3 on a 5

point scale, then a 2, 3, or 4 would be judged as within 1.0 of the judge’s score.

The E-Rater program has also been evaluated for testing of college level essays

for non-native speakers of English for the Test of Written English (TWE) exam (Burstein

and Chodorow, 1999). Although there were significant differences between the scores of

native English speakers and native Chinese, Arabic or Spanish speakers writing in

English, the E-Rater system, which is tuned for each topic using about 52 syntactic,

discourse and other analysis variables, was able to predict the holistic scores exactly or

with an adjacent score (within 1.0 of judged score) about 92% of the time.

2.4 Latent Semantic Analysis (LSA) Bag of Words Approach

Another approach to holistic scoring of essays involves using Latent Semantic

Analysis and Indexing (LSA and LSI). The overall LSI approach has been used as the

basis of text analysis since 1989. More recently, LSI, developed by Thomas Landauer

(Landauer, 2003), has been released as a product called Intelligent Essay Assessor (IEA)

in 1997. LSI is a “bag of words” kind of process where word frequency vectors are

processed into 200-400 dimensions of semantic space. Used initially for information

retrieval as an improved approach to simple word frequency vectors, it also provides a

very impressive correlation with human rating scores. An important strength of the LSA

approach is that words are grouped into the semantic index dimensions to avoid

mismatches on similar words. A weakness is that, like PEG, syntactic and grammar

variables are missing; everything in LSI approach is in the vocabulary (Rudner, 2001).

Recent work by Kanejiha (2003) has centered on adding some syntactic

information to the LSI approach by including the part of speech category for the word

previous to the word being indexed. This approach, called Semantically Enhanced Latent

Semantic Analysis (SELSA), has been shown to exceed the ratings of a straight LSA

approach on evaluating essay answers to basic computer science questions.

2.5 Hybrid Systems: Bag of Words Plus Rules

Rosé (2003) has developed an approach that begins to combine the advantages of

the “bag of word” LSI and PEG systems with a rule-based component with the “hybrid”

CarmelTC system. This system combines a rule learning approach with features

extracted from a syntactic analysis with a Naive Bayes analysis of the text. This

comparison study reported on in this paper tested the task of automatic grading of

answers on a qualitative physics test questions. The results showed this hybrid approach

was significantly better than a straight LSI or Naive Bayes approach using the same data.

2.6 Exemplar-Driven Memory-Based Systems

Exemplar-driven methods are also beginning to be used for essay grading.

Machine learning techniques are being applied to analyze word frequencies and

proximity to other similar words in multiple dimensional space (Chodrow and Leacock,

2000). By accessing a 30 million word corpus indexed by every word and every context,

this system, called ALEK, was able to compare student essay answers on the TOEFL

(Test of English as a Foreign Language) exam with this data base to find determiner and

agreement grammatical errors such as “a desks” because of the low frequency of of this

collocation in the data base. This approach compared favorably with the Word97

grammar checker included with Microsoft Word® in a limited set of error environments.

2.7 Problems with Existing Approaches

Criticisms of these existing approaches include theoretical as well as practical

issues. “For example, many systems exhibit an inherent Achilles’ heel since it is possible

to trick them into evaluating a nonsensical text purely by reverse-engineering the scoring

mechanism and designing a text that corresponds to the criteria” (Lonsdale and Strong-

Krause, 2003). Other problems include costly development of specialized data and rules

for training the system for a specific environment and the need to expand the system by

hand to build a new model.

2.8 Special Needs of ESL Students

Only recently have efforts in automatic scoring of essays emphasized the special

requirements of essays written by ESL students. These essays, especially for those with

low proficiency in English, often are made up of ill-formed sentences. In addition to

having difficulty in linking grammatical sentences together properly, ESL students

struggle with problems of spelling and vocabulary that provides a challenging

environment to attempt to duplicate manual grading with a computer-generated score.

In her 1994 thesis, Susan Ingle addressed the correlation of holistic scores for

different levels of ESL papers with various objective measurements. 193 essays written

by native speakers of Japanese and Spanish who were learning English at the Brigham

Young University (BYU) English Language Center were examined. Ingle found that

essay length was correlated with holistic scores an amazing 58% of the time. She added

other manually generated variables such as mean error-free T-unit length, percent of

subordinated clauses per T-unit, percentage error-free T-units and mean T-unit length. A

T-unit consists of a main clause plus its subordinate clauses. The result of these five

variables was a correct prediction of 75% of the holistic scores.

2.9 Hierarchy of Challenging Syntactic Structures

Another thesis at BYU by Xinyou Zhang (1994) investigated a hierarchy of

syntactic structures that were especially difficult for ESL students to use correctly. Table

2-2 shows this hierarchy, with the higher numbers being more difficult for ESL students.

Detecting these structures and creating a variable attribute for them for an automated

study would be a great extension of Zhang’s work that evaluated a limited number of

essays manually to derive the hierarchy.

2.10 English Grammar Checker to Assist ESL

Park (1997) examined the specific problems and benefits of using an English

grammar checker to assist ESL students. This program was based on a Combinatory

Categorial Grammar parser written in Prolog. “It includes grammatical mistakes as

ungrammatical variations of the constituents that can be related to given lexical entries in

a categorial lexicon.” A training set of essays is required to train the grammar checker

which can then be used with a test set.

Number

Syntactic Structure Name Example

1. Noun Clause I believe that the Book of Mormon is true.

2. Adverb Clause It rained today because a storm came in.

3. Relative Clause The cat that the dog chased got away.

4. Appositive Reagan, our President From 1981 until 1989, …

5. Participial He hobbled along, swinging his cane.

6. Absolutes He stood alone, his hands tied behind his back.

Table 2-2: Hierarchy of syntactic structures determined to be increasingly difficult for ESL

students (Zhang, 1994, pp. 39-40)

The errors are tailored for the ESL environment. For example, even though the

sentence I want leave is technically correct since leave can be used as a noun, in an ESL

essay, it is much more likely that the writer has omitted the “to” and meant to say I want

to leave.

In the North American chapter of the Association of Computational Linguistics

(NAACL) 2003 conference, an interesting paper analyzed a Swedish grammar checker

named Granska being used by second-language learners of Swedish. Ola Knutsson

(2003) compared essay scores before and after the use of the grammar checker. Granska

detected 38% of errors in texts that argued a particular point of view. One sidelight of

this work was the detection of essays where large quantities of text were copied from

other sources. However, this study was limited to essays from only eight students. This

work seems to be a preliminary study and much further work is needed before any real

conclusions can be drawn.

2.11 Evaluation using Syntactic Parser and Customized Algorithm

Lonsdale and Strong-Krause (2003) studied the effectiveness of an approach to

grading ESL essays using data derived from the syntactic Link Grammar Parser. This

flexible and robust parser links pairs of words with relationship pointers rather than

constructing strict parse trees. They developed a corpus of 301 novice and intermediate

ESL essays. This corpus was analyzed by the Link Grammar parser and then a Perl script

assigned a five point holistic score. 60 - 69 % of the automatically generated ratings

were consistent with the manually generated scores within 1.0 rating point. The system

often overrated essays that received a score of 1 or 2 from judges and would under-score

essays that were scored high by the judges but contained many run-on sentences. This

ongoing research at BYU continues to refine the link grammar approach and is being

expanded to now combine the link grammar measurements with other variables that

could result an even better hybrid system.

2.12 Implications of Current Research on Scoring of ESL Essays

The special requirements of ESL essays, especially at the lower beginning levels,

are just now beginning to be addressed in current research studies. The success that has

been achieved in other areas of essay evaluation lends encouragement to further efforts.

Basic principles that have proved successful include having a robust essay parser, using a

variety of attributes, and investigating different methods for holistic score generation

from that varied data.

3.0 Methodology

This chapter details the methodology that is used for this study. This thesis

research uses several commercially or publicly available software packages to generate

the linguistic maturity attributes, analyze and derive prediction methods for the holistic

scores on the training set of essays, and evaluate the validity of those predictions on test

essays. Several custom programs and templates were written for this thesis to assist

these programs to tie the data processing steps together or to perform functions not

included in these systems.

The WordMap

III

program was used to generate linguistic maturity attributes that

were extracted by a program for essay analysis into one file for each data set. These

attribute files were imported into Microsoft Excel for data base storage and sorting. The

WINKS v. 4.8 statistical package provided subsequent single and multivariable statistical

regression analysis of these attributes, correlations for single selected attributes with the

judges’ scores and scatter graphs showing linear correlation. Microsoft Excel® was then

used for template spreadsheets to predict and evaluate the algorithmic formulas. Finally,

several different settings of a machine learning program called TiMBL were used for

prediction of scores using Excel spreadsheets and customized programming.

Perl, Microsoft Quick Basic and customized spreadsheet programming were also

developed for this thesis to perform the following functions:

(1) Extraction of essays and holistic scores from raw student evaluation files.

(2) Creation of individual writing sample text files in WordMap input text format plus

cross reference files for each subset of a student essay corpus.

(3) Extraction of all WordMap attributes after linguistic analysis of individual essay files

plus merging these attributes with essay identification and holistic scores into a comma-

separated variable combined file for each of five groups of essays in the corpus.

(4) Correlation analysis for all attributes on a set of essays for key attribute identification.

(5) Customized spreadsheet creation using linear prediction and customized algorithms to

predict holistic scores and evaluate the accuracy of the predicted scores.

(6) Customized spreadsheet for individual components of individual parts of speech and

POS patterns plus Perl merge routine to prepare binary or numeric versions of these

sparse arrays for TiMBL analysis.

(7) Customized spreadsheet creation for evaluating machine learned predictions of

holistic scores for multiple numeric variables as well as sparse array analysis.

3.1 Corpus Selection

This thesis uses the same corpus of ESL essays used by Lonsdale and Strong-

Krause’s (2003) study. A parallel study, using the same training and testing data, can

facilitate a better comparison of the scoring algorithms between separate research efforts.

The essays were collected in the English Language Center at BYU as part of the normal

ESL classes. These intensive English learning courses included students ranging from

novice to intermediate. Table 3-1 shows semesters and number of student essays from

the five semesters that were used for data for the study.

Semester # Students

Winter 2002 60

Winter 1992 72

Fall 2001 72

Winter 2001 30

Summer 2001 43

Table 3-1: ESL essays for training and testing

The study used the 60 essays from the winter 2002 semester for training; testing

of the algorithm was performed on the other four groups of essays. The 277 essays

consisted of about 45,000 words and 3,100 sentences. Each essay had an average of 165

words in 11.2 sentences, for an average of 14.75 words per sentence. A holistic score of

1 to 5 was given to each essay by two judges. The description of the five scoring levels is

as follows from the Lonsdale/Strong-Krause paper:

1. Demonstrates limited ability to write English words and sentences. Sentences

and paragraphs may be incomplete and difficult to follow.

2. Writes a simple paragraph with a fair control of basic, not complex sentences

structures. Errors occur in almost all sentences except for the most basic,

formula-type (memorized) structures. Little detail is present.

3. Writes a fairly long paragraph with relatively simple sentence structures.

Personal experiences and some emotions can be expressed, but much detail is

missing. Frequent errors in grammar and word use make reading somewhat

burdensome.

4. Writes long groups of paragraphs with some complex sentence patterns. Some

transitions are used effectively. Vocabulary is broadening, but some wrong word

use. Grammar errors may detract from meaning. Some ideas are supported with

detail. Some notion of an introduction and conclusion is included.

5. Writes complex thoughts using complex sentence patterns with effective

transitions between ideas and sentences. Errors in grammar exist but do not

obscure meaning. A variety of advanced vocabulary words are used but some

wrong use occurs, including problems with prepositions and articles. Ideas are

clearly supported with details. Effective introduction and conclusion are

included.

Here is a sample essay that shows some of the challenges of automatically scoring

these essays.

Iwork really hard and occacionally I don’t have time for have

fun whith mt friens but i don’t mind becausse i knew ,when i grow

up i will have a profesion and have a good job and i will be very

happy.

In the Lonsdale/Strong-Krause study, the importance of the Link Grammar

Parser’s robustness was emphasized to be able to keep running and trying to link entities

together syntactically even in the midst of recurring errors. It is for this same reason that

the WordMap program has been used in the current study. Instead of having its parse

structures just get worse and worse as errors accumulate, a robust grammar checking

program keeps retrying, guessing at misspellings, even repairing structures based on its

best guesses to come out with data even in the midst of near chaos in the essay.

There were some anomalies in the data that should be mentioned as applied to this

study. The original number of essays in the 2003 study was 301. About 3% of the

essays were too small to be processed effectively in the WordMap system. An additional

2% of the essays had errors in WordMap processing that prevented the generation of an

attribute file. For the three 2001 data sets, some of the essays, about 7%, were missing

from the archived files saved from the 2003 study. In spite of these difficulties, the 277

essays that were used in this study were determined to contain a similar variety of holistic

scores as the 2003 study.

3.2 Preparation of Essays

Perl programs were written for this thesis to extract and prepare the ESL essays

for analysis by the WordMap system. The files for the corpus included spreadsheet data

enumerating student name, student ID number, and holistic scores from the two judges

for the essay, mixed with additional testing information. The actual essay text,

referenced by name and/or ID number was contained in merged summary files for two of

the five corpus sets, and in individual files named by student ID number along with other

class final testing evaluations for reading, writing and grammar for the other three sets.

Each essay received a file name consistent with the corpus set and a number for

the essay within that set (e.g. TF01-01 for Fall 2001 test set, first essay). The essay was

formatted into WordMap rough draft input format, and a summary file was generated for

each set in the corpus to cross reference the judges’ scores with this essay. Table 3-2

shows an extract from one of the spreadsheet files with scoring information. Figure 3-1

shows a sample essay from the raw class evaluation file. Figure 3-2 shows a sample

essay file ready for WordMap analysis. Table 3-3 shows an extract of a log file to cross-

reference the WordMap files with the essay scores for later analysis. All ID numbers

have been randomized from the original ID’s.

3.3 Generation of WordMap Linguistic Attributes

WordMap was selected to generate the linguistic attributes for several reasons.

First, the program is based on a grammar checking engine which has the ability to

recover from error conditions in order to analyze an essay. The Lonsdale and Strong-

Krause study showed how important a resilient parser is for analysis of these essays.

Second, WordMap provided a range of attributes that spanned all the way from

vocabulary and surface features such as average word lengths to very detailed syntactic

and stylistic attributes such as part of speech trigrams. Third, although WordMap is a

proprietary system that is licensed for a fee, the system was provided to BYU at no cost

for ongoing research into analysis of ESL essays in this thesis and for future research.

Since WordMap

III

is a DOS system and the program had been used in limited computer

environments, part of this thesis work involved getting the system to run on a Windows

XP-based computer.

The individual files for each essay were processed by the WordMap system. The

system collects a wide variety of statistics in various attribute groups. Table 3-4 shows

these groups, their description and sample attribute name and description for an example

attribute in that group.

Semester Last Name First Name

Student

Number Judge#1

Judge#2 Average

F01 Last-13 First-13 999-99-6363 2

F01 Last-14 First-14 999-99-7474 3

3.5

F01 Last-15 First-15 999-99-9696 3

Table 3-2: Spreadsheet extract for student name, ID and essay scores from two judges

WordMap data file Judge 1 Score Judge 2 Score Average Score

TW1-06.rgh 3 3 3

TW1-07.rgh 1 1 1

TW1-08.rgh 2 2 2

TW1-09.rgh 1 2 1.5

Table 3-3: Cross reference file between WordMap files and holistic scores.

Abbr. Description Example Description

ALL Over All composite index SYN Syntax composite

CTA Parts-of-speech (POS) attributes vc+ed Past tense verb

DEN Word redundancy attributes Den-200[xx New words any category

xx per 200 words

FTR Syntactic-types attributes psv Passive

CHK Grammar & style attributes [FRAG] Sentence fragment

IDL Idiom List high_school Combined meaning

LBL Pattern label attributes nSV-Lbls Number of clauses

LNA Length-of-words attributes lna-07 Word length 7 chars

LSS Variety of language index PTA-Gain PTA section % patterns

LST Function type attributes !DET Determiners

MOR Morphological attributes a°W Decodable prefixes??

MRK Punctuation use attributes commas Number of commas

PNC Punctuation skills index [+_] Syntactic gap

PTA Text POS 3-grams nc-ccn-nc Conjoined class nouns

SGA Sentence length attributes SGA-Median Median sentence length

SLA Select category attributes sla-aj+nc Adj + class noun

SYN Syntactic skills index av+vc+ing Adverb + verb participle

VOC Vocabulary specific index VOC-WDS Non function words??

WDA Vocabulary attributes there Specific function word

Table 3-4: Attribute groups and examples from WordMap

III

text analysis (Lytle, 1986,

“File Types”, p. 4)

Figure 3-1: Essay extract from raw text information file for individual student. File is named

using student ID number, which has been randomized for this figure.

File Name: 999-99-8787.xx

Last-8, First-8

Date: 9/4/01

Test Started: 9:17 AM

…

LISTENING TEST

================================================

S Item# L R AN Logit Est. SE

L 05701 1 1 CR -1.87 0.00 1.00

L 00502 3 1 CR -1.07 0.00 1.00

…

Grammar TEST

================================================

S Item# L R AN Logit Est. SE

L 1.019 1 1 CR -3.23 1.00

L 1.009 2 1 CR -.94 1.00

…

Reading Test

================================================

S Item# L R AN Logit Est. SE

L 06302 2 0 D3 -.90 0.00 1.00

L 06301 3 1 CR -.59 0.00 1.00

…

WRITING TEST

====================================================

Last-8, First-8

999-99-8787

Write as much as you can about THE MOST IMPORTANT CLASS

YOU HAVE EVER TAKEN.

I didn't like study English before, but when I was high

school, I met a great teacher. He had been studying

English. I took his class when I was second grade of

high school. This class was impotrant for me. There are

some reasons.

…

I could be learned a lot of things by him. How

wonderful that we can talk by English. I think teacher

is most important when we take class. I'm very gald

that I could take his English class. Now I'm trying

study English again!

Figure 3-2: WordMap rough draft essay file ready for analysis from a level 4 essay

For example , the “CHK” group contains an attribute with the word length of the

essay. The “CTA” and “PTA” groups contain attributes summarizing syntactic variety in

the essay. The “SLA” group contains sentence density information and the “WDA”

group contains vocabulary variety attributes.

After processing an essay text file, WordMap then writes the complete set of

attributes into a custom format data base file (.rnd extension) consisting of the attribute

group, the attribute, the raw frequency of the attribute in the sample text and the

percentage that this frequency represents of all of the attributes in the group.

As an example of how varied these attributes are, for the 60 essays of the first

subset of the corpus WordMap generated a total of 19,289 attribute values. 963 different

.NM TF1-08.rgh

.SP 3

.TL//student = Student=Last-8, First-8, Scores = 3, 3, 3

.PP

I didn't like study English before, but when I was high

school, I met a great teacher. He had been studying

English. I took his class when I was second grade of

high school. This class was impotrant for me. There are

some reasons. First, he has a good personarity, and

everyone likes him. I respected him, and I tried to

hard study English, because I wanted to he thinks me a

good student.

Second, because his story was very interesting, I liked

to hear that he was speaking not only English, but also

another stories. Then he taught us how we can enjoy

study English. If I didn't take his class, I wouldn't

like English forever. I could be learned a lot of things

by him. How wonderful that we can talk by English. I

think teacher is most important when we t

ake class. I'm

very gald that I could take his English class. Now I'm

trying study English again!

.END

attributes were defined though most of them are not defined for every essay. Each essay

had an average of 321 attributes generated.

3.4 Extraction and Selection of WordMap Attributes

Because the existing export functions of the WordMap program provided limited

flexibility for extracting attributes, a customized program was developed to extract out

the complete dataset of all attributes for a subset of the corpus into a standard comma

separated variable (CSV) format. Data from the WordMap file was cross-referenced with

its essay identification number and the holistic score category that was determined by the

judges for the essay. Table 3-5 shows an extract of a portion of the .csv export file

showing attributes for one of the training essays.

There is a built-in comparator module in WordMap to compare essays to a set of

standard essays such as the sample essays representing the five holistic scoring groups. It

was discovered that the default groupings of attributes in the comparator module resulted

in poor predictions for a subset of the essays. Correlation analysis of the extracted

attributes led to selection of several key attributes and attribute groups to create the

custom algorithms and assemble the data for training and testing for the memory-based

machine learning system.

The selected attributes ended up being summary attributes that have the highest or

nearly the highest positive or negative correlation with the holistic scores for one of the

sets of essays. Many other attributes were considered before these were selected.

WordMap generates extensive error flags as it runs the grammar checking modules, but

the individual flags were not highly correlated with the scores. For example, run-on

sentences were only correlated with holistic scores at r = -0.068. Stylistic analysis flags,

File ScoreWM Grp AttribRaw Freq% Freq Analysis

TR-1.rnd

3CTA nc 19 12.10191

TR-1.rnd

3CTA nmprn 15 9.55414

TR-1.rnd

3CTA seg 15 9.55414

TR-1.rnd

3CTA p 13 8.280255

TR-1.rnd

3WDA the 6 3.97351

TR-1.rnd

3WDA and 6 3.97351

TR-1.rnd

3WDA to 6 3.97351

TR-1.rnd

3WDA in 5 3.311258

TR-1.rnd

3WDA of 3 1.986755

TR-1.rnd

3IDL

there_'s

3 2.054795

TR-1.rnd

3IDL a_few 1 0.6849315

TR-1.rnd

3IDL

high_school

1 0.6849315

TR-1.rnd

3CHK [THAT] 3 1.886792

TR-1.rnd

3CHK [+_] 1 0.6289308

TR-1.rnd

3CHK [Prep] 2 1.257862

TR-1.rnd

3LNA lna-04 18 13.95349

TR-1.rnd

3LNA lna-05 14 10.85271

TR-1.rnd

3ALL ALL-VOC-13.65885 -4.907377

TR-1.rnd

3ALL ALL-PNC-6.161773 -2.388628

TR-1.rnd

3ALL ALL-GMR-4.419491 -0.3614517

TR-1.rnd

3ALL ALL-SYN-57.50528 -2.319609

Table 3-5: Extract of custom .csv export file from WordMap .rnd format for ESL essay #1 from

the 2002 corpus subset. The analysis of this 159 word essay included 349 different attribute

values.

such as passive use, were higher at r = 0.323 correlation value, but even it was nowhere

near the r = 0.75 and higher level of the highest summary attributes.

3.4.1 Analysis Using Built-in WordMap Comparator Module

There were several characteristics of the ESL essay corpus and the current

WordMap system that led to the decision to extract out all of the attributes from the .rnd

data files to perform analysis outside of WordMap. To analyze the essays using the

comparator module, each essay was grouped with other essays with the same holistic

score in the same file, resulting in five standard files. Instructions for the comparator

module indicated that these standard files should be about the same size. However, in the

Winter 2002 files being processed, this corpus subset ranged from only 2,800 words for

level one essays to 19,300 for level three essays.

Figure 3-3 details the basic function of the built-in WordMap comparator system

(Lytle, 1986, “Comparator” section). Note that the selected single essay is compared,

entire attribute group by entire attribute group, to the five standard attribute files

containing all of the attribute values for a holistic score level. The final comparison

result of a test essay with the five files is the accumulated score of selected groups.

A few essays were used to try out the self-prediction capability of the default

comparator module for various groups. Groups that were tried included the CTA, CHK

and FTR groups (see Table 3-4). These initial randomly selected essays only predicted 3

of 10 scores within ±1.0 accuracy in these initial tests. This is far below the baseline

holistic score 3 assigned to 42% of the Winter 2002 essays.

Figure 3-3: Default comparator module operation in WordMap.

Given the comparator’s limited predictive capability for ESL scores in the sample

corpus, it was expected that selecting and analyzing the entire set of attributes separately

outside of the WordMap environment via different techniques would further improve on

this matching accuracy. The built-in WordMap variable extraction module was

supplemented by Perl and QuickBasic programs written for this study in order to extract

all of the WordMap attributes to a single file containing the attributes for all of the essays

in a subset of the corpus such as the Winter 2002 essays.

3.4.2 Correlation Analysis for Attributes

After extracting all of the WordMap attribute data for the Winter 2002 essays, it

soon became apparent why the built in comparator module had difficulty with this data.

Lytle (2005, p. 214, Table 1) had listed some attributes that he had found to be highly

correlated with linguistic maturity studying students learning English as their native

language in elementary and secondary schools. At first a few of these attributes were

analyzed by the statistics package (WINKS) for correlation analysis, one variable at a

time.

Attributes that had been shown by Lytle to have the best correlation values for

elementary and secondary school native English students were found to be much less

significant for the ESL essays. For example, the nmprn personal pronoun attribute had

been shown over the years to have a high negative correlation value of r = -0.919 with

linguistic maturity ratings. For the 60 ESL Winter 2002 essays, however, the value for r

was only -0.193. Lytle found punctuation marks frequency to be correlated with

linguistic maturity with a value of r = 0.671, but for the ESL essays, that value was only r

= 0.066. With these individual attributes much less significant for the ESL essays than

the native English essays, the summation of these attributes (e.g. the whole CTA group)

could be expected to have lower prediction ability for holistic scores.

It was decided to write a Perl program for this thesis to analyze the correlation

values for the nearly 1,000 attributes with the average of the judges’ holistic score values.

Table 3-6 shows several of the top positively and negatively correlated attributes. These

highly correlated attributes became the ones used in this study for analysis and prediction

of holistic scores of the ESL essays. For further explanations about the attributes groups,

see Table 3-4. These attributes will next be described in detail.

Group Attribute Name correlation

(r)

CHK: Grammar CHK-Length 0.778

CTA: Parts of Speech CTA-Gain 0.772

PTA: POS Trigrams PTA-Gain 0.772

SLA: Select Categories SLA-seg -0.548

WDA: Vocabulary WDA-Ratio 0.766

Table 3-6: Correlations for WordMap attributes with the judges’

holistic scores for the 60 Winter 2002 set ESL essays from.

3.4.3 Essay Length Attribute

The ESL essays that are longer are positively correlated with a higher holistic

score. These essays were written under a 30-minute time constraint and a longer essay

seems to be one indication of greater writing skill. The correlation value r for this

variable, designated CHK-Length, is 0.778 and this was the most highly correlated

linguistic attribute in the set of essays. The length of the 60 Winter 2002 essays averaged

177 words, going from a low of 13 words to a high of 352 words. Figure 3-4 shows a

graph of this attribute compared with the average holistic score along with the linear

correlation.

Figure 3-4: CHK-Length correlated with holistic score for training essays.

3.4.4 CTA-Gain: Part of Speech Summary Attribute

Two key WordMap summary attributes that have never been used outside the

system for predictions ended up being tied for second place (r = 0.772) just slightly

behind CHK-Length in their correlation with the holistic scores of the 60 Winter 2002

ESL essays. The first of these two attributes is called CTA-Gain and represents the

percentage of the 96 WordMap specialized part of speech categories that occur in the

writing sample. An example of one of these categories is aj+est for a superlative

adjective, e.g. “fastest.”

Many of the attributes are specified using descriptions based on Lytle’s Junction

Grammar theory of language (Lytle 1979b, 1986, 2006). An example is the definite

article the, which is coded as nmr for its part of speech attribute in WordMap. In

Junction Grammar terminology, a definite article is a nominal (n) modalizer (m) that

triggers a retrieval (r) of an existing semantic entity in the memory space.

Table 3-7 shows several of the CTA individual attributes with their individual

correlation with holistic scores, sample sparseness and frequencies. Some are very

sparse, such as the aj+er comparative found in only 7% of the training essays. The

typical % is the percent of these words compared with the entire essay’s words. 0.5

would be one word like this one in 200. The high % in essay is the high percentage

found in any essay for this feature. 4.3 would mean that over 1 in every 25 words in an

essay contains this part of speech classification of words. It is very interesting that this

summary attribute (CTA-Gain) is highly correlated with the holistic scores but that the

individual attributes comprising this summary correlate much less.

POS Description %

essays

w/ attr

Typical %

in essay

High %

in essay

Example correlation (r)

with judges’

score

qav quantifier for

adverb

16 0.5 4.3 very highly

favored

-0.186

aj+er comparative 7 0.6 0.9 smaller 0.157

ccn conjunction 98 3.5 9.5 and 0.190

mrk punctuation 94 4 16 ! 0.217

Table 3-7: CTA part of speech individual attributes and correlation values

with the judges’ holistic scores using the 60 training essays.

Figure 3-5 shows a graph of this attribute compared with the average holistic score for

the Winter 2002 essays.

Figure 3-5: CTA-Gain correlated with holistic scores from judges for essays

3.4.5 PTA-Gain: Part of Speech Pattern Summary Attribute

The second key WordMap-specific variable that is tied for second in its

correlation with the holistic scores of the 60 ESL Winter 2002 essays is called PTA-Gain

(r = 0.772). This summary attribute is a simple percentage of how many of the 84

different types of specialized part–of-speech (POS) trigram patterns occur in the text.

Lytle selected these patterns from a large data base of competetent student writings

(Lytle, 2006). An example of one of these patterns is aj-nc-seg which is an adjective

followed by a common noun and an end of sentence marker, e.g. “John kicked the red

ball.” Note that some of these patterns include punctuation such as end of sentence

punctuation. Correlation of these sometimes sparse values was also done on the

individual pattern attributes. Table 3-8 shows a sample of some of these attributes and

their individual r correlation values using the frequency percentage values for the 60 test

essays.

Trigram POS

Pattern

% of

essays

with

attribute

Typical %

in essay

High % in

essay

Example correlation

(r) with

judges’

score

nc-ccn-nc 32 0.70 2.0 boys and

girls

0.006

aj-nc-seg 60 1.08 3.3 red ball <end

of sent>

-0.207

vc-p-nmr 23 0.33 1.4 went onto the 0.053

p-nmps-nc 62 1.1 2.1 on my roof 0.403

seg-nmprn-vc 82 2.2 7.6 <end of sent>

Henry went

-0.482

Table 3-8: PTA trigram POS patterns and correlation values for the percentage of their

use to judges’ holistic scores. (Lytle, 1986, “Lists,” p. 21)

Figure 3-6 shows a graph of the PTA-Gain attribute compared with the average

holistic score.

So far, it is interesting to note that the most highly correlated attributes included

different dimensions of linguistic maturity. The essay length attribute focused on a

statistical attribute that is easily measured. The selection of both the PTA-Gain and

CTA-Gain summary attributes and their individual components focused in on short and

longer syntactic patterns and contained some of the most interesting attribute information

focused on the particular point of view of the Junction Grammar theory.

Figure 3-6: PTA-Gain correlated with holistic scores from judges for essays.

As the current study continued, still more dimensions of attributes also seemed to

be highly correlated with the holistic scores of these training essays. It was hoped that by

selecting attributes of different kinds that they would complement each other in making

better quality predictions.

3.4.6 SLA-seg: Sentence Density

The highest attribute using negative correlation calculations is the sentence

density attribute, or SLA-seg (r = -0.5475). This attribute captures WordMap’s

calculation of the number of sentences per 100 words. Most essays with lots of simple

sentences are graded lower, but there are exceptions with essays that contain run-on

sentence after run-on sentence. The 60 Winter 2002 essays have an average of 6.3

sentences per 100 words, with a low of 0.76 sentences per hundred (an extreme example

of run-on sentences), and a high of 12.1 sentences per hundred words.

Figure 3-7 shows a graph of this attribute compared with the average holistic

score.

Figure 3-7: SLA-seg correlated with holistic scores from judges for essays.

3.4.7 WDA-Ratio: Vocabulary Summary Attribute

After selecting two summary attributes that reflect the part of speech categories

and patterns, plus another two with the essay length and sentence density, another

attribute was selected that reflects the vocabulary richness of the writing in the essay.

This attribute is called WDA-Ratio and represents the density of content words compared

to a fixed list of function words in the essay (Lytle, 1986). Its correlation is calculated at

r = 0.766, and it is ranked number 7 out of the total of 961 attributes. Figure 3-8 shows a

graph of this attribute compared with the average holistic score.

Figure 3-8: WDA-Ratio correlated with holistic scores from judges for essays.

3.5 Training Essays and Test Essays in Corpus

The validity of a scientific hypothesis depends on its ability to make new

predictions that can be verified by observations. The standard method of running tests

and evaluating the effectiveness of predictive algorithms is followed in this study. The

first step was to train the system on a sample set of data called a training set. Attribute

analysis of the training set essays using statistical packages or manual analysis helped

develop a predictive algorithm analyzing selected attributes to predict the holistic scores.

The algorithm was then further tested and refined by testing how well it predicted the

scores of the individual essays in the training set itself. The algorithm was then deemed

ready to predict the holistic scores of a new set of unseen test essays that played no part

in the development of the algorithms.

In this study, the training set consists of 60 ESL essays from the Winter 2002

classes at the BYU English Language Center. Once the necessary programming pieces

had been developed to collect the data from its original formats, analyze it in the

WordMap system and extract the data with all the variables for all of the 60 training

essays, the task began analyzing of the data and creating methods, algorithms, and

selected data examples to predict holistic scores. The other four groups of essays were

reserved for the testing phase to evaluate the effectiveness of the predicted scores.

Figure 3-9 shows the overall flow of information for this study. The ESL training

and test set essays are processed through identical steps to generate and extract out their

linguistic maturity attributes (four modules above the dotted line). Armed with the

attributes from the training set, this data can be analyzed and individual attributes

selected. The selected attributes are used to derive a customized algorithm that uses the

attributes to predict a holistic score. The algorithm can then be used to predict holistic

scores for the test essays that were not consulted during the development of the prediction

algorithms. After predicting scores for the test sets, their accuracy can be evaluated.

Up to this point we have discussed the analysis of the Winter 2002 essays (i.e. the

training set) and the extraction of attributes followed by the process of selecting attributes

that will be used to predict holistic scores. The derivation of the predictive algorithms or

settings for memory-based processing will now be considered.

Figure 3-9: Overview of holistic score analysis and prediction process for ESL Essays in this

thesis. Above the dashed line is the data preparation phase of the research. Below the dashed

line is the analysis and prediction portion of the research.

3.6 Single and Multiple Variable Regression Analysis

Up to this point the essays were extracted from their raw document files and

analyzed by the WordMap program and attributes subsequently generated and extracted

into a file with scores and the other essays and their attributes in the training set. At this

point enough information has been collected and organized to derive formulas that can

then predict the holistic score for a new essay based on one or more of the extracted

variables.

The average judged holistic score and the five extracted linguistic maturity

summary attributes were imported into a standard statistical package. This allowed the

single and multiple regression statistical analysis for deriving linear equations using the

attributes and thus for predicting holistic scores. Various combinations of the attributes

were used. Table 3-9 shows several of the derived formulas using various combinations

of the selected attributes.

Table 3-9: Sample linear formulas derived from WINKS 4.80 multivariable regression analysis

for holistic score prediction using various combinations of the selected WordMap attributes.

3.7 Custom Excel Spreadsheets for Prediction and Evaluation of Scores

Returning to the basic information flow outline of Figure 3-9, as the linguistic

maturity attributes were selected and analyzed for this study, various prediction

algorithms were derived. The prediction algorithms were refined and tested by

comparing how well they predict the scores of the 60 essays in the training set, also

called a self-test of the training data. This refinement cycle was supported by custom

Excel spreadsheets to calculate predicted scores and evaluate how closely the scores

relate to the scores assigned by the judges.

CHK-Length PTA-Gain CTA-Gain SLA-seg WDA-Ratio Linear Regression Formula

x Score = -0.4748375 + 0.0916602 x

x y Score = 0.0342 x + 0.0436 y + 0.23

x y z Score = 0.02486 x + 0.02586 y +

0.1318 z + 1.21

x y z w v Score = 0.0030298 x + 0.0150636 y +

0.0375125 z - 0.0437892 w –

0.1720084 v + 0.66971

Table 3-10 shows the basic visible format of the spreadsheet used to test the

prediction algorithm for the CTA-Gain variable by itself against the training data.

The file name and ID numbers allow the cross reference to the actual essay in the

appropriate corpus subset of essays. The two judges’ scores are recorded that usually are

given in whole numbers. The average of the two scores becomes a half-point score when

the judges do not agree on the score but are within one point of each other.

A B C D E F G H I J

1 File J1 J2 Ave CTA-Gain

Predicted

Score

Round

0.5 Exact Within 0.5

Within 1.0

TR-01.rnd

2.5

37.1134

2.926984

3 1 1

TR-02.rnd

1.5

25.7732

1.887539

2 1 1

TR-03.rnd

2.5

32.98969

2.549004

2.5 1 1

TR-04.rnd

26.80412

1.982034

2 1 1

TR-05.rnd

18.5567

1.226073

1 1 1

TR-06.rnd

44.3299

3.58845

3.5 0 1

…

TR-57.rnd

1.5

21.64948

1.509558

1.5 1 1

TR-58.rnd

36.08247

2.832489

3 1 1

TR-59.rnd

28.86598

2.171024

2 1 1

TR-60.rnd

19.58763

1.320569

1.5 0 1

63 Totals 26 50

% of 60 essays 43.33% 83.33%

95.00%

Table 3-10: Score prediction and evaluation spreadsheet CTA-Gain self test – training set

The CTA-Gain numbers are the actual attribute values from WordMap. This

particular algorithm using CTA-Gain for prediction of the holistic score is encoded in the

spreadsheet as =-0.4748375 + E2 * 0.0916602 for cell number F2 . The predicted

score is rounded to half points using the formula =INT(2 * F2 + 0.5)/2 for cell

number G2. This formula rounds 2.93 to 3. Three tests are made to compare the

predicted score with the judges’ score. Predicted scores that match the judges’ scores

counted first, then scores that are within 0.5 of the judges’ scores and finally scores that

are within 1.0 of the judges’ scores. In the cases where the judges agree on the holistic

score, this evaluation is straightforward. But when the judges disagree, and an average

score is derived, the criterion is expanded to agreement with either of the judges, or the

average score between the two.

In the current example for essay number 1, one judge scored the essay at 2 and

the other at 3, with the average being 2.5. The predicted score of 2.93 was rounded to

3.0. 3.0 exactly agrees with judge number 2, and so qualifies for an exact match, a within

0.5 match and a within 1.0 match evaluation score.

For essay number 6, both judges agreed that it should be scored at 3.0. The

predicted score was 3.59 which was rounded to 3.5. 3.5 does not count as an exact match

with 3.0, but 3.5 is within 0.5 of 3.0 and receives a evaluation point for being within 0.5

and being within 1.0 of the judges holistic scores.

In the spreadsheet the formulas used to test for matching results shown in table 3-11.

Exact match after

rounding =IF(OR(G3=C3,G3=D3,G3=E3),1,0)

2 Match within 0.5 =IF(OR(ABS(G3-C3) <= 0.5, ABS(G3-D3) <= 0.5),1,0)

3 Match within 1.0 =IF(OR(ABS(G4-C4) <= 1, ABS(G4-D4) <= 1),1,0)

Table 3-11: Excel spreadsheet formulas for evaluating predicted scores vs. judges scores

3.8 Custom Algorithm Development

Another option to a complicated multi-attribute linear formula for predicting the

holistic score is to make a hybrid system with some linear components and some custom

logical components. In early versions of this thesis research using a rounding formula to

produce whole numbers for holistic scores instead of allowing half-point values, the

linear regression formula using CTA-Gain, PTA-Gain and WDA-Ratio were

supplemented with custom formulas using those three attributes plus essay length (CHK-

Length) attribute. The second formula used the essay length to override the regression

analysis formula to predict better the level-one scores. The third formula used a

combination of PTA-Gain, CTA-Gain and WDA-Ratio values enabled better prediction

of the level five scores that were not well represented in the training set.

The complicated algorithm is documented in pseudocode in Figure 3-10. This

formula uses four of the five selected linguistic maturity variables. SLA-seg, sentence

density, is not included.

Figure 3-10: Basic formula derived from regression analysis with added custom

conditions from manual analysis of training set data.

3.9 Memory-based Machine Learning

A contrasting approach to linear multivariable regression or deriving custom

algorithms for prediction of holistic scores is to use the emerging technology of memory-

based learning that is being successfully applied to many linguistic processing tasks

(Daelemans and van den Bosch, 2005). This approach is particularly useful in today’s

world of annotated text data bases and extensive linguistic corpora available for computer

processing.

Memory-based learning methods make predictions using the linguistic text,

phonetic, syntactic or semantic data directly, instead of relying on hand-crafted or

Formula 1:

Score = 0.18904 + CTA-Gain * 0.0418584 +

PTA-Gain * 0.0188846 + WDA-Gain * 0.7209701;

Formula 2:

if (Score <= 2 && CHK-Length<100) then

Score = 1;

Formula 3:

if (CTA-Gain > 40 && PTA-Gain > 40

&& (CTA-Gain + PTA-Gain) > 90 &&

WDA-Ratio > 1) then Score = 5;

computer-generated rules. This approach has relevance to psycholinguistic studies that

support a single memory and cognitive system for both regular and irregular linguistic

forms, as opposed to a rule-based system for regular forms with an exception list for

irregular forms.

This empirical or inductive approach to natural language processing (NLP) is

contrasted with the “rationalist” or “deductive” knowledge-based approach that in the

past has dominated this subfield of artificial intelligence research. Rules and decision

trees are often hand-crafted or computer-generated from the source data that forms the

basis for the language processing system. Memory-based language processing (MBLP)

is presented as a lazy learning approach that is contrasted with the eager learning

approach of formal rules. The eager systems try to abstract the data and filter out the

exceptional behaviors, where the lazy systems keep all of the data in memory for

processing as needed at retrieval time. A lazy system exhibits rule-like behavior when

certain paths through the memorized examples are followed over and over again, like an

oft-taken path in the countryside.

Data is entered into the MBLP system by combining a classifier or outcome with

selected encoded features. These features can be specific words, orthographic or

phonetic word representations, syntactic or semantic word categories, etc.

3.9.1 TiMBL Memory-Based Language Processing System

One of the most popular and sophisticated implementations of memory-based

learning comes from ILK research group at Tilburg University in the Netherlands in a

system called TiMBL, an acronym for Tiburg Memory-based Learner. This memory-

based machine learning program has been applied to linguistic tasks as varied as German

plural formation, Spanish word stress patterns and orthographic-to-phonemic conversions

(Daelemans and van den Bosch, 2005).

There are several reasons to select TiMBL for the memory-based learning tests

for this thesis. First, this system is widely used in research and is publicly available at no

cost as an open-source program. Second, TiMBL has many flexible options that enable

different statistical methods for nearest-neighbor calculations and rankings that allow for

customized applications to each data environment. Third, continuous numeric values

were required for the analysis in this study and that capability is supported by the TiMBL

system. Fourth, the machine learning libraries and toolkits used for many machine

learning systems require additional programming to implement whereas TiMBL can run

directly from input files with program parameter switches to control its operation.

3.9.2 Processing Options in TiMBL

TiMBL input data files are comma separated variable (CSV) files. For this thesis

a TiMBL input file contains several rows, one row of data containing the attributes for

each ESL essay. Each of the columns corresponds to a particular attribute across all of

the essays. The last column of each row is reserved for the outcome, the holistic score

for the essay. A single training file can be used for self-tests: TiMBL trains itself using

the training file and then for the self-test removes the current row from its memory set

(“leave_one_out” option) and then tries to predict the holistic based on the other essays’

attributes. This procedure prevents each row from identifying itself in the data as the

closest match.

For the cases involving both a training set and a test set, two files are passed to the

TiMBL system. The system trains itself on the first file and then uses that data to predict

the test data set, one row at a time. After a nearest-neighbor analysis has been made and

a predicted outcome assigned, that outcome is compared against the last column of the

test set to report on how accurate the prediction was in the output report.

TiMBL includes a wide variety of options as program switches to vary the

statistical methods used for predicting the outcome or classification using the variable

values. A few of these settings were used for this thesis.

1. Varying the nearest neighbor number

The default number of nearest neighbors that are retrieved can be varied. The

author tried various settings and finally selected a setting of five nearest neighbors

(switch = -k5).

2. Varying the weighting of the distance measurements

Various criteria could determine how nearest neighbors decide the classification

of the predicted outcome. One approach is to have majority voting. Another is to have

the votes weighted according to the inverse of the distance formula (switch = -dID).

Both settings were tried and the latter one yielded the best results for discrete attribute

analysis of the CTA and PTA subattributes.

3. Discrete variable values vs. numeric variable values

The five selected linguistic maturity attributes were numeric variables as were the

PTA and CTA subattributes with their percentage use number. Numeric mode was used,

selected by the switch -mN.

3.9.3 TiMBL ESL Data formats

The TiMBL approach seemed to be most appropriate for this work in three areas:

(1) To provide another approach for holistic score prediction for the five

selected summary attributes. This approach uses numeric values for the attributes instead

of discrete variable values.

(2) To provide analysis and holistic score prediction based on the sparse PTA

and CTA subattributes that are summed up to make the PTA-Gain and CTA-Gain

attributes. The first approach uses discrete binary values indicating whether the

subattribute is used or not in the essay.

(3) To provide analysis and holistic score prediction for sparse PTA and CTA

subattributes using numeric values indicating the percentages of these subattributes in the

essay.

For the discrete attribute analysis (2) a customized Perl script was used to extract

out these often very sparse data elements and format them for the TiMBL system with the

variable values only indicating the presence (1) or absence (0) of the variable in the

essay. Table 3-12 shows extracts from the data file for the 77 out of 96 possible CTA

attributes used in the 60 training essays which sum together using existence or non-

existence to make the CTA-Gain variable. The actual file for use with the TiMBL

program does not contain the first row with CTA individual attribute number or the first

column with essay numbers.

In the table, each row represents a set of data for a single training essay. Each

column represents a subattribute of CTA such as aj+er (comparative) that is defined as a

variable for TiMBL. The frequency and percentage use of that attribute is not used in

this file, only the fact that it exists which is represented by a “1” or a “0.” The existence

of the attribute only requires discrete variable value settings instead of continuous range

numeric variable value settings in TiMBL. The last column is the outcome that the

judges gave as a holistic score which becomes TiMBL’s outcome variable that it will use

to train the system.

Essay 1

2 3 4 5 6 7 8 9 10 …

71 72 73 74 75 76 77 Score

5 1

0 0 0 1 1 0 0 0 0 . 0 0 0 1 0 0 0 1

8 1

0 0 0 0 0 0 0 0 0 . 0 0 0 1 0 0 0 1

12 1

0 0 1 0 0 0 0 0 0 . 1 0 1 0 1 0 0 1

18 1

0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1

30 1

0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1

34 1

0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 1

39 1

0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 0 1

42 1

0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 1

43 1

0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1

55 1

0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 1

60 0

0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1

2 1

0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 2

4 1

0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 2

15 1

0 0 1 0 0 1 0 0 0 1 0 1 1 1 0 0 2

Table 3-12: Spreadsheet with the 77 used CTA subattributes in the training set ready for use by

TiMBL as a 60 row training set file.

A customized Perl script was used to extract out these often very sparse data

elements and format them for the TiMBL system with the attribute values indicating the

percentage of the usage of that attribute (e.g. 1.234568). Table 3-13 shows extracts from

the data file for the 84 possible PTA attributes used by the 60 training essays which sum

together using existence or non-existence to make the PTA-Gain attribute. In this case

data values represent the percentage of the occurrence of a certain part-of-speech pattern

such as nc-ccn-nc (conjoined class nouns) as the numeric value to use in predictions. This

example illustrates the fact that portions of this CTA and PTA subattribute space are very

sparse.

Essay # 1 2 3

…

82 83 84 Score

5 0 0 0

0 0 0 1

8 0 0 0

0 0 0 1

12 0 0 0

0 0 0 1

18 0 0 0

0 0 0 1

30 0 0 0

0 0 0 1

34 0 0 0

1.234568

0 0 1

39 0 0 0

0 0 0 1

42 0 0 0

0 0 0 1

43 0 0 0

0 0 0 1

55 0 0 0

0 0 0 1

60 0 0 0

0 0 0 1

2 0 0 0

1.242236

0 0 2

4 0 0 0

0 0 0 2

15 1.694915

0 0.847458

0 0 0 2

Table 3-13: Extract of part of the PTA-Gain subattributes frequency percentage array prepared as

an input file for TiMBL in numeric analysis mode. .

4.0 Results

This chapter details the results of using the methodology described in chapter 3.

The results are divided into two main sections: (1) Algorithm-based predictions and

(2) Memory-based learning predictions.

4.1 Algorithm-Based Predictions

Only the summary attributes were used in the predictions that were derived by

algorithms. The attributes used were CHK-Length, CTA-Gain, PTA-Gain, SLA-seg and

WDA-Ratio as previously discussed.

4.1.1 Self-Prediction of Training Set with a Linear Formula for Each Attribute

Each of the five summary attributes had a high similar correlation with the

training set holistic scores assigned by the judges. The self-prediction tests reflected this

high correlation. The linear formula was derived on each single variable via regression

analysis by the WINKS statistical program. Table 4-1 displays the predictions for the 60

essays of the training set compared with the judges’ scores. This test shows how well the

algorithms predict the score of the essays in the training set itself, or self-prediction. The

algorithm for each attribute derived by regression analysis using the WINKS 4.80

statistical program is included in the table.

These results provide a baseline of how well the regression formula works on the

data that was used to generate it. The results reported in this thesis contain not only the

default standard of how many predicted scores were within 1.0 holistic score point of the

judges’ scores, but also predictions that exactly match the judges’ scores and those that

are within 0.5 of the judges’ scores. When it is considered that the default reporting of

being within 1.0 for a judge’s score of 3 would allow both 2 and 4 as possible predicted

scores, the exact prediction and the “within 0.5” prediction reporting adds a much more

stringent level of prediction to this kind of study.

Attribute Exact Prediction Within 0.5 Within 1.0 Regression Analysis

Algorithm

CHK-Length 43.33% 88.33% 98.33% 1.0618116 +

CHK-Length *

0.0079894

CTA-Gain 55.00% 83.33% 95.00% -0.4748375 +

CTA-Gain *

0.0916602

PTA-Gain 38.33% 81.67% 95.00% 0.8684423+

PTA-Gain *

0.0558809

SLA-seg 45.00% 66.67% 91.67% 3.3918714 +

SLA-seg *

-0.1452239

WDA-Ratio 43.33% 86.67% 95.00% 1.2406975 +

WDA-Ratio

* 2.2475918

Average 46.00% 81.33% 95.00% N/A

Table 4-1: Holistic score self-test prediction compared with judges’ score for individual attribute

linear equations (Winter 2002).

4.1.2 Self-Prediction of Training Set with a Linear Formula for All Five Attributes

The individual formulas using each attribute provide very good predictions on the

training data. The combination formula for all five attributes provides a higher self-

prediction level for each category than the average of the five selected attributes

individually. These results are detailed in Table 4-2.

Exact Prediction Within 0.5 Within 1.0 Regression Analysis Algorithm

46.67% 85.00% 100.00% 0.6697 + CTA-Gain * 0.0375125

+ PTA-Gain * 0.0150636 +

CHK-Length * 0.0030298 +

WDA-Ratio * -0.1720084 +

SLA-seg * -0.0437892

Table 4-2: Holistic score self-test prediction compared with judges’ score for one linear equation

containing all five variables (Winter 2002).

4.1.3 Prediction of Test Set Scores with a Linear Formula for Each Attribute

Each single attribute linear formula derived from the training data for Winter

2002 was applied to the sight unseen test data including the Winter 1992, Fall 2001,

Winter 2001 and Summer 2001 test sets. The algorithms were listed previously in Table

4-1. Table 4-3 shows the CHK-Length essay length variable linear regression equation

trained only with the Winter 2002 training essays and how well it predicts the scores for

the four training sets. Note that throughout these results, the test predictions for Winter

1992 essays are lower than the 2001 essays. That difference may be due to the fact that

all training has been done on the Winter 2002 data set. More standardization and

consistency in scoring by judges is assumed between 2002 and 2001 than between 2002

and 1992. But, nevertheless, it is very interesting to be able to get these levels of

accurate predictions on the 1992 essays after only being trained on the 2002 essays.

Test Set Exact Prediction Within 0.5 Within 1.0

Winter 1992 25.00% 62.50% 79.17%

Fall 2001 44.16% 76.62% 90.91%

Winter 2001 26.67% 63.33% 86.67%

Summer 2001 51.16% 79.07% 97.67%

Average 36.75% 70.38% 88.61%

Table 4-3: Holistic score prediction for test sets compared with judges’ score for individual

attribute linear equation for CHK-Length essay length attribute.

Table 4-4 shows the CTA-Gain part of speech summary attribute linear regression

equation trained only with the Winter 2002 training essays and how well it predicts the

scores for the four training sets.

Test Set Exact Prediction Within 0.5 Within 1.0

Winter 1992 23.61% 54.17% 81.94%

Fall 2001 53.25% 85.71% 90.91%

Winter 2001 26.67% 63.33% 86.67%

Summer 2001 51.16% 79.07% 95.35%

Average 38.67% 79.07% 95.35%

Table 4-4: Holistic score prediction for test sets compared with judges’ score for individual

attribute linear equation for the CTA-Gain attribute.

Table 4-5 shows the PTA-Gain part-of-speech pattern summary attribute linear

regression equation trained only with the Winter 2002 essays and how well it predicts the

scores for the four training sets.

Test Set Exact Prediction Within 0.5 Within 1.0

Winter 1992 26.39% 66.67% 80.56%

Fall 2001 49.35% 87.01% 96.10%

Winter 2001 16.67% 53.33% 86.67%

Summer 2001 39.53% 76.74% 88.37%

Average 32.99% 70.94% 87.93%

Table 4-5: Holistic score prediction compared with judges’ score for individual attribute linear

equation for the PTA-Gain attribute.

Table 4-6 shows the SLA-seg sentence density attribute linear regression equation

trained only with the Winter 2002 essays and how well it predicts the scores for the four

training sets.

Test Set Exact Prediction Within 0.5 Within 1.0

Winter 1992 33.33% 61.11% 77.78%

Fall 2001 38.96% 67.53% 89.61%

Winter 2001 30.00% 50.00% 80.00%

Summer 2001 44.19% 55.81% 79.07%

Average 36.62% 55.81% 79.07%

Table 4-6: Holistic score prediction compared with judges’ score for individual attribute linear

equation for the SLA-seg attribute.

Table 4-7 shows the WDA-Ratio vocabulary richness attribute linear regression

equation trained only with the Winter 2002 training essays and how well it predicts the

scores for the four training sets.

Test Set Exact Prediction Within 0.5 Within 1.0

Winter 1992 20.83% 66.67% 80.56%

Fall 2001 45.45% 75.32% 90.91%

Winter 2001 26.67% 56.67% 90.00%

Summer 2001 41.86% 79.07% 100.00%

Average 33.70% 69.43% 90.37%

Table 4-7: Holistic score prediction compared with judges’ score for individual attribute linear

equation for the WDA-Ratio attribute.

4.1.4 Prediction of Test Set Scores with a Linear Formula for All Five Attributes

By combining all five of the selected summary attributes and deriving a multiple

variable regression linear prediction formula, it was hoped that the results would be better

than the individual attributes. The algorithm was previously shown in Table 4-2. Table

4-8 shows the correct prediction percentages using this combined formula on the four test

data sets.

Test Set Exact Prediction Within 0.5 Within 1.0

Winter 1992 22.22% 62.50% 81.94%

Fall 2001 53.25% 84.42% 93.51%

Winter 2001 26.67% 60.00% 83.33%

Summer 2001 51.16% 79.07% 95.35%

Average 38.33% 71.50% 88.53%

Table 4-8: Holistic score prediction compared with judges’ score for individual attribute linear

equation including all five selected summary attributes.

Table 4-9 compares the average prediction values over the four test sets for the

five attributes separately with the single algorithm putting together all five attributes.

The combined algorithm overall is an improvement over the individual single attributes

algorithms. The “average of the averages” row averages the first five rows of the table.

Summary

Attribute

Exact Prediction Within 0.5 Within 1.0

CHK-Length 36.75% 70.38% 88.61%

CTA-Gain 38.67% 79.07% 95.35%

PTA-Gain 32.99% 70.94% 87.93%

SLA-seg 36.62% 55.81% 79.07%

WDA-Ratio 33.70% 69.43% 90.37%

Average of

single Attribute

Averages

35.75% 69.12% 88.27%

All Five

Attributes

38.33% 71.50% 88.53%

Table 4-9: Comparison of individual attribute linear regression algorithms with a combined five

variable algorithm on predicting test essay scores.

4.1.5 Prediction of Test Set Scores with Customized Algorithm

Even with highly accurate holistic score prediction values for these algorithms,

single and multivariable, it was noticed that at the low and high end of the scoring scale

that the predictions were less reliable. A linear formula was derived for the three

variables CTA-Gain, PTA-Gain and WDA-Ratio. Adjustments were then made to the

derived score with custom formulas developed for this thesis by inspection of the training

set data using the variables to better predict the holistic scores for levels 1 and 5. In this

algorithm, SLA-seg, the sentence density attribute, was not used. Level 5 was not well

represented in the training set sample and this custom algorithm made adjustments to

better predict that level.

Table 4-10 shows the self-test predictions using this custom algorithm on the

training set data for the Winter 2002 essays. The linear regression algorithm using CTA-

Gain, PTA-Gain and WDA-Ratio is overridden by other formulas using the first

predicted score and attribute values. First, Formula #1 is used to derive a score based on

the three attributes. Then, Formula #2 is applied. If the score from this first formula is

less than or equal to two, and if the essay length attribute (CHK-Length) is less than 100,

then we override the score computed by the linear regression algorithm and set the score

to 1. Now implementing Formula #3, if CTA-Gain and PTA-Gain are both greater than

40 and their sum is greater than 90, we override the predicted score variable and set it to

five. The table values evaluate the predictions for the training essays themselves from

this set of computed and customized formulas.

Table 4-11 shows the test set predictions for the four test sets of essays that use

this same set of formulas trained on the training set.

Exact Prediction Within 0.5 Within 1.0 Custom Algorithm Description

51.67% 78.33% 95.00%

Formula 1:

Score = 0.18904 +

CTA-Gain * 0.0418584 +

PTA-Gain * 0.0188846 +

WDA-Gain * 0.7209701;

Formula 2:

if(Score <= 2 && CHK-Length<100) then

Score = 1;

Formula 3:

If (CTA-Gain > 40 && PTA-Gain > 40

&& (CTA-Gain + PTA-Gain) > 90 &&

WDA-Ratio > 1) then

Score = 5;

Table 4-10: Self-prediction of Winter 2002 essays using custom algorithm derived from training

data.

Test Set Exact Prediction Within 0.5 Within 1.0

Winter 1992 23.61% 59.72% 81.94%

Fall 2001 55.84% 76.62% 94.81%

Winter 2001 46.67% 60.00% 83.33%

Summer 2001 46.51% 74.42% 97.67%

Average 43.15% 67.69% 89.44%

Table 4-11: Prediction of holistic scores for four test sets using customized algorithm

derived from the training data (Table 4-10 formulas)

4.2 Memory-based Learning Predictions

Finally, TiMBL machine learning predictions were used for two sets of

predictions. The first studies predicted holistic scores based on individual components of

the CTA and PTA linguistic maturity subattribute groups, both in numeric and discrete

modes. The second study predicted holistic scores using the numeric values of the five

selected summary attributes.

4.2.1 CTA and PTA Individual Attributes Training Set Self-Tests

The CTA-Gain and PTA-Gain attributes consisted of a count of the individual

parts of speech or part-of-speech patterns that were used by the writer in the essays. The

individual components, or subattributes of PTA-Gain or CTA-Gain, consist of a sparse

array of attribute values. Two separate modes were used to see if TiMBL could correctly

predict the holistic scores based on the training set of these sparse arrays. One of these

modes was for just the presence of the subattributes and the other was for the actual

percentage for the frequencies of the use of these subattributes.

Table 4-12 shows the self- prediction runs for the subattributes of the CTA and

PTA linguistic maturity groups. The training set is the Winter 2002 essays and each

essay in the training set is being tested. This is a self-prediction test using the

“leave_one_out” TiMBL option to see how well the training data could be predicted.

TiMBL options for the discrete runs were “-k5 –a0” using the standard nearest-neighbor

matching algorithm IB1 and 5 nearest neighbors. TiMBL options for the numeric runs

were “-k5 –a0 –mN” to select numeric mode with 5 nearest neighbors and the standard

matching algorithm.

Attribute Exact Prediction Within 0.5 Within 1.0

CTA Discrete 41.67% 68.33% 88.33%

CTA Numeric 60.00% 73.33% 95.00%

PTA Discrete 53.33% 61.67% 85.00%

PTA Numeric 56.67% 60.00% 93.33%

Table 4-12: TiMBL self-test predictions of correct holistic scores and how they agree with the

judges’ scores using discrete and percentage subattributes of the CTA and PTA linguistic

maturity statistical groups (Winter 2002).

4.2.2 CTA and PTA Individual Attributes Test Sets Predictions

Near the end of this study, Perl programs and Excel spreadsheets were developed

to create training and test data to test the CTA and PTA subattribute score predictions for

the test sets. Unlike the training set self-prediction test, which could be run using a .csv

dump from a spreadsheet, the individual-named subattributes had to be matched and

aligned between the two data sets in order for TiMBL to use the proper variables for its

training and predictions. Not every attribute used in the training set was used by the test

set and vice versa.

Initial runs similar to the self tests described above, using the Winter 2002

training set and the Winter 1992 test set, are shown in Table 4-13. The results seem

disappointingly low after such good results in the self-test for the training set. The results

need to be reviewed to make sure that an alignment error was not made in the Perl

program. The current program requires modifications to work with each data set and it

is currently a slow process with manual and automatic editing steps to prepare and run a

TiMBL test set trained on the 1992 training set for a new test data set. This process was

not needed for the self-test of the training set on itself.

Attribute Exact Prediction Within 0.5 Within 1.0

CTA Discrete 20.55% 27.40% 54.79%

CTA Numeric 24.66% 38.36% 63.01%

PTA Discrete Under review Under review Under review

PTA Numeric 12.33% 13.70% 52.05%

Table 4-13: CTA and PTA subattributes predictions using TiMBL for training set Winter 2002

and test set Winter 1992.

4.2.3 TiMBL Predictions using Five Selected Summary Attributes

The most exciting result to report from this research is how well the TiMBL

program did in prediction of holistic scores based on the five selected linguistic maturity

summary attributes CHK-Length, CTA-Gain, PTA-Gain, SLA-seg, and WDA-Ratio.

The TiMBL system parameters were set at 5 nearest neighbors, using the standard

algorithm IB1 and numeric mode. The results for the four test sets are shown in

Table 4-14. . TiMBL settings “-k5 –a0 –mN” were used: 5 nearest neighbors, IB1

algorithm, numeric options for variable values.

Test Set Exact Prediction Within 0.5 Within 1.0

Winter 1992 41.67% 47.22% 87.50%

Fall 2001 70.13% 70.13% 98.70%

Winter 2001 46.67% 53.33% 96.67%

Summer 2001 62.79% 65.12% 100.00%

Average 55.31% 58.95% 95.72%

Table 4-14: Holistic score prediction for the four test data sets compared with judges’ score using

memory-based learning with all five selected summary attributes

Table 4-15 compares the TiMBL prediction percentages using all five attributes

with the previous algorithms that use all five attributes using the linear prediction

formulas from regression analysis. Exact matches are clearly much better with the

TiMBL five attribute memory-based predictions over all other approaches (18 percentage

points higher) and the TiMBL matches within 1.0 of the judges’ scores are above all the

algorithm-based approaches, averaging seven percentage points above. The predictions

by TiMBL within 0.5 points of the judges’ scores, on the other hand, are lower for the

TiMBL prediction than the linear algorithms for all but one of the linear formulas. But

overall, the predictions within 1.0 of the judges’ scores as well as the very accurate exact

match scores seem to indicate a clear winner for memory-based machine learning as a

reliable and flexible retrieval mechanism for prediction of linguistic maturity scores for

ESL essays.

Analysis Method

Exact Prediction Within 0.5 Within 1.0

Linear equation

for all five

attributes

38.33% 71.50% 88.53%

Customized

algorithm

43.15% 67.69% 89.44%

TiMBL

memory-based

learning

55.31% 58.95% 95.72%

Table 4-15: Holistic score prediction averages for the four test data sets compared with judges’

score using all five selected attributes using the regression algorithms, customized algorithms and

memory-based learning using TiMBL.

5.0 Conclusions

Both algorithmic and machine learning implementation of holistic score

prediction for ESL essays using linguistic maturity attributes were part of this study.

After analysis of the data, one of the conclusions for this study is that prediction of

holistic scores for ESL essays using linguistic maturity attributes can provide a level of

accuracy compared with human judges’ scores that has not been demonstrated before

using both of these methods. The best earlier levels of 66% agreement within 1.0 point

of the judges’ holistic scores has been consistently beaten by the 80% and 90% levels

achieved in this study, even going to 100% for one of the test data sets.

It was very interesting to observe that the best attributes to select for prediction of

holistic scores ended up being high level summary attributes that emphasized the writer’s

positive areas rather than his mistakes. For example, the variety of part-of-speech use

attribute is much more highly correlated with holistic scores than a count of run-on

sentences or the density of the errors in the essay. It was also interesting that attributes

from different attribute groups were all highly correlated with the holistic scores. This

would seem to support the intuition that the holistic score is being formulated with input

from several kinds of input including the quantity of writing, variety of syntactic

structures, and variety of vocabulary.

Returning to the research question at the beginning of the thesis: Can computer-

generated holistic scoring of ESL essays can achieve a level of accuracy so that this

process could become a useful and efficient tool for ESL teachers and students? Based

on the 97% level of accuracy of the E-Rater system that is considered useful and efficient

for college level essay tests, the 96% level of prediction from the TiMBL system on ESL

essays would seem to achieve that useful and efficient rating level.

One question that follows is what role such a system might play in the

administration of an ESL program. The current study provides predictions on a set of

essays that could be used to place students in different ESL classes but does not study the

more subtle differences between students at a single level (e.g. at level three). At least at

the student placement level in the ESL classes, this study would seem to demonstrate a

capability that could be useful in an ESL program.

Another important conclusion is that the memory-based predictions were

significantly higher than the pure algorithmic or custom formula predictions, especially

on exact score predictions. A possible reason is that the regression algorithms smooth the

predictive formulas so exact predictions are more rare whereas the actual data of the

nearest-neighbor search in TiMBL allows for individual comparisons and distance

checking to find the correct match to a set of actual data items.

5.1 Future Research

There is considerable future research that could follow from the above

conclusions. Since a variety of attributes has been shown to provide good predictive

power, other attributes should be investigated. This study began to investigate the

individual PTA and CTA subattributes. Further research could determine if these sparse

attributes, individually or in combination, can predict holistic scores.

The ETS E-Rater program uses about 50 attributes to distinguish a 1-6 score on

the SAT Essay exam for high school seniors. A topic of future research could determine

whether a WordMap-based system could distinguish between writing levels for an ESL

class at a given level. If such a successful prediction could be proved, then such a system

could be a useful tool for the ESL teacher to use as part of classroom instruction.

Another topic would be to see if writing improvement flags already generated by

WordMap for the essay could also be useful to the ESL teacher or student along with the

holistic score that is trained on the teacher’s grading sample essays and/or the standard

grading sample essays for the entire ESL program.

The BYU English Language Center regularly collects modest amounts of

evaluation data. Not only could a larger corpus of data be used for a future study, but

other portions of the evaluation, such as reading level and grammar usage understanding,

could be compared against the linguistic maturity attributes and holistic scores as well.

The current WordMap system runs in DOS mode and has some system

incompatibilities with the current versions on PC computers. The Linux version of

WordMap currently under development is more stable than this version in a “DOS box”

mode and should be investigated along with programming updates to adapt the current

code base to avoid system calls that have compatibility problems. The various programs

developed for this thesis also should be better integrated together to allow for more

extensive research studies.

REFERENCES

Hunter M. Breland and Eldon G. Lytle, 1990. “Computer-Assisted Writing Skill

Assessment,” Annual meeting of the American Educational Research Association

and the National Council on Measurement in Education, Boston.

Jill Burnstein, Claudia Leacock and Richard Swartz, 2001, “Automated Evaluation of

Essays and Short Answers,” Proceedings for 5

Computer Assisted Assessment

(CAA) Conference, A two-day conference for developers and practitioners of

CAA in higher education. July 2001, Loughborough University, Leicestershire,

UK,

Jill Burnstein and Martin Chodorow, 1999, “Automated Essay Scoring for Nonnative

English Speakers,” In Proceedings of the ACL99 Workshop on Compter-

Mediated Language Assessment and Evalutation of Natural Language

Processing. College Park, MD.

Gregory K. Chung and Harold E. O’Neil, Jr., 1997. Methodological Approaches to

Online Scoring of Essays. Center for the Study of Evaluation, University of

California, Los Angeles.

Walter Daelemans and Antal van den Bosch, 2005, Memory-Based Language

Processing, Studies in Natural Language Processing, Cambridge University Press.

Walter Daelemans, Jakub Zavrel, Ko van der Sloot and Antal van den Bosch, 2004,

TiMBL: Tilburg Memory-Based Learner Reference Guide version 5.1, ILK

Technical Report – ILK 04-02, Tilburg University CNTS - Language Technology

Group, University of Antwerp, The Netherlands.

Susan Scott Ingle, 1994. Using Objective Criteria to Evaluate Proficiency in ESL

Writing. Brigham Young University Thesis.

Dharmandra Kanejiha, Arun Kumar and Surendra Prasad, 2003. Automatic Evaluation

of Students’ Answers using Syntactically Enhanced LSA. In Proceedings of the

NAACL 2003 Workshop. Edmonton, Alberta, Canada. Association for

Computational Linguistics. pp. 53-60.

Ola Knutsson, Teresa Cerrato Pargman and Kerstin Severinson Eklundh, 2003.

Transforming Grammar Checking Technology into a Learning Environment for

Second Language Writing. In Proceedings of the NAACL 2003 Workshop.

Edmonton, Alberta, Canada. Association for Computational Linguistics. pp. 38-

45.

Thomas K. Landauer, 2003. Pasteur’s Quadrant, Computational Linguistics, LSA,

Education. In Proceedings of the NAACL 2003 Workshop. Edmonton, Alberta,

Canada. Association for Computational Linguistics. pp. 46-52.

Deryle Lonsdale and Strong-Krause, Diane, 2003. Automated Rating of ESL Essays. In

Proceedings of the NAACL 2003 Workshop. Edmonton, Alberta, Canada.

Association for Computational Linguistics. pp. 61-67.

Eldon G. Lytle, 1975, “Junction Grammar Analysis of Quantifiers,” Implementation

Guide, (BYU Translation Sciences Institute).

Eldon G. Lytle, 1977, “Evolution of Junction Grammar,” Junction Theory and

Application, v. 1. no. 1, (Provo, Utah: BYU Translation Sciences Institute).

www.junction-grammar.com.

Eldon G. Lytle, 1979a, “Doing More with Structure,” Junction Theory and Application,

v. 2. no. 2, (Provo, Utah: BYU Translation Sciences Institute).

Eldon G. Lytle, 1979b, “Junction Grammar: Theory and Application,” Sixth LACUS

Forum, Columbia, SC: Hornbeam Press, Incorporated, pp. 305-343.

Eldon G. Lytle, 1986, WordMAP® User’s Guide, Linguistic Technologies, Inc., Pioche,

Nevada.

Eldon G. Lytle, 1993. Grammar Check Handbook: WordMAP® Version 4.10.

Linguistic Technologies, Inc. Pioche, Nevada.

Eldon G. Lytle, 2005, LANGUAGE in Capital Letters: Unity in Nature, beta E-Book at

url = www.language-icl.com.

Eldon G. Lytle, 2006, WordMAP CTA and PTA Attribute List for WordMAP

III

, Linguistic

Technologies, Inc., Pioche, Nevada.

Eldon G. Lytle and Nelson C. Matthews, 1986. Field Test of the WordMAP Writing Aids

System. Lincoln County School District, Panaca, Nevada.

Jong C. Park, Martha Palmer, and Gay Washburn. 1997. “An English grammar checker

as a writing aid for students of English as a Second Language.” In Proceedings of

the Conference of Applied Natural Language Processing (ANLP). Washington,

DC.

Carolyn P. Rosé, Anonio Roque, Durnisizwe Bhembe, Kurt Vanlehn, 2003. A Hybrid

Text Classification Approach for analysis of Student Essays. In Proceedings of

the NAACL 2003 Workshop. Edmonton, Alberta, Canada. Association for

Computational Linguistics. pp. 68-75.

Brian C. Roberts, 1983. Stylometry and Wordprints: A Book of Mormon reevaluation.

Master’s thesis, Brigham Young University Statistics Department.

Laurence Rudner and Phil Gagne, 2001. An overview of three approaches to scoring

written essays by computer. Practical Assessment, Research and Evaluation,

7(26).

Xinyou Zhang, 1994. The Order of Difficulty of Six Types of Sentence-Combining

Structures for ESL Students. Brigham Young University Thesis.