Re: [Edu-sig] Python Programming: Procedural Online Test

In a message of Sun, 04 Dec 2005 11:32:27 PST, Scott David Daniels writes:
I wrote:
... keeping people at 80% correct is great rule-of-thumb goal ...
To elaborate on the statement above a bit, we did drill-and practice teaching (and had students loving it). The value of the 80% is for maximal learning. Something like 50% is the best for measurement theory (but discourages the student drastically). In graduate school I had one instructor who tried to target his tests to get 50% as the average mark. It was incredibly discouraging for most of the students (I eventually came to be OK with it, but it took half the course).
<snip> 'Discouraging' misses the mark. The University of Toronto has professors who like to test to 50% as well. And it causes suicides among undergraduates who are first exposed to this, unless there is adequate preparation. This is incredibly _dangerous_ stuff. Laura
--Scott David Daniels Scott.Daniels@Acm.Org
_______________________________________________ Edu-sig mailing list Edu-sig@python.org http://mail.python.org/mailman/listinfo/edu-sig

Hello Laura, That's better than the Abstract Algebra class I took as an undergraduate. The highest score on Test 1 was 19%. I got 6%! I retook the class from another teacher and topped the class. Liked the subject so much I took the second semester just for fun. Testing and teaching strategies make a tremendous difference. Sunday, December 4, 2005, 11:50:22 PM, you wrote: LC> In a message of Sun, 04 Dec 2005 11:32:27 PST, Scott David Daniels writes:
I wrote:
... keeping people at 80% correct is great rule-of-thumb goal ...
To elaborate on the statement above a bit, we did drill-and practice teaching (and had students loving it). The value of the 80% is for maximal learning. Something like 50% is the best for measurement theory (but discourages the student drastically). In graduate school I had one instructor who tried to target his tests to get 50% as the average mark. It was incredibly discouraging for most of the students (I eventually came to be OK with it, but it took half the course).
LC> <snip> LC> 'Discouraging' misses the mark. The University of Toronto has professors LC> who like to test to 50% as well. And it causes suicides among undergraduates LC> who are first exposed to this, unless there is adequate preparation. This LC> is incredibly _dangerous_ stuff. LC> Laura
--Scott David Daniels Scott.Daniels@Acm.Org
_______________________________________________ Edu-sig mailing list Edu-sig@python.org http://mail.python.org/mailman/listinfo/edu-sig LC> _______________________________________________ LC> Edu-sig mailing list LC> Edu-sig@python.org LC> http://mail.python.org/mailman/listinfo/edu-sig
-- Best regards, Chuck

One of the main reasons I decided to use an Item Response Theory (IRT) framework was that the testing platform, once fully operational, will not give students questions that are either too easy or too difficult for them, thus reducing anxiety and boredom for low and high ability students, respectively. In other words, high ability students will be challenged with more difficult questions and low ability students will receive questions that are challenging but matched to their ability. Each score is on the same scale, although some students will not receive the same questions. This is the beautiful thing! That is the concept of adaptive or tailored testing being implemented in the Python Programming: Procedural Online Test (http://www.adaptiveassessmentservices.com). After reading the comment on 50% percent being optimal for measurement theory, I have to say about 90 years ago that was the best practice in order to maximize item/test variance, which maximized the distribution of scores. This is primarily a World War I and II convention in developing selection tests, i.e., Alpha and Beta, used to place conscripts in appropriate combat roles. Those two tests are the predecessors of the SAT administered by the Educational Testing Service, which is the organization where most of the war psychologists who developed Alpha and Beta went after the WW II. Because of their influence in selecting recruits who then received money after the war to go to college in the form of the GI Bill, these measurement specialists (psychometricians) did the same thing for ETS with the SAT in screening the same cohort for placement in colleges and universities around America. These psychologists had a strong influence of what constituted good practice in standardized testing. Accordingly, the practice of using 50% became well entrenched. Later, IRT came on the scene in the early 1950s as an alternative to classical test theory and has some great theoretical and practical advantages over the previous approach of selecting items that have a variance of .50. The computing technology was not available then to implement the theory. However, it wasn't until the advent of the PC in the late 70s and early 80s that got psychometricians like me motivated to begin the implementation of IRT; once again at the forefront in the development was the armed services in the late 70s. It will take another decade or so to break the hold that Classical Test Theory has on measurement, and expect students' test anxiety to remain high in the interim. But as more and more begin to realize the benefits of IRT, especially computer adaptive testing, over CTT, it will no longer be an issue of was guidance should be used to administer and score tests.
From: Chuck Allison <chuck@freshsources.com> Reply-To: Chuck Allison <chuck@freshsources.com> To: Laura Creighton <lac@strakt.com> CC: edu-sig@python.org, Scott David Daniels <Scott.Daniels@Acm.Org> Subject: Re: [Edu-sig] Python Programming: Procedural Online Test Date: Mon, 5 Dec 2005 00:52:50 -0700
Hello Laura,
That's better than the Abstract Algebra class I took as an undergraduate. The highest score on Test 1 was 19%. I got 6%! I retook the class from another teacher and topped the class. Liked the subject so much I took the second semester just for fun. Testing and teaching strategies make a tremendous difference.
Sunday, December 4, 2005, 11:50:22 PM, you wrote:
LC> In a message of Sun, 04 Dec 2005 11:32:27 PST, Scott David Daniels writes:
I wrote:
... keeping people at 80% correct is great rule-of-thumb goal ...
To elaborate on the statement above a bit, we did drill-and practice teaching (and had students loving it). The value of the 80% is for maximal learning. Something like 50% is the best for measurement theory (but discourages the student drastically). In graduate school I had one instructor who tried to target his tests to get 50% as the average mark. It was incredibly discouraging for most of the students (I eventually came to be OK with it, but it took half the course).
LC> <snip>
LC> 'Discouraging' misses the mark. The University of Toronto has professors LC> who like to test to 50% as well. And it causes suicides among undergraduates LC> who are first exposed to this, unless there is adequate preparation. This LC> is incredibly _dangerous_ stuff.
LC> Laura
--Scott David Daniels Scott.Daniels@Acm.Org
_______________________________________________ Edu-sig mailing list Edu-sig@python.org http://mail.python.org/mailman/listinfo/edu-sig LC> _______________________________________________ LC> Edu-sig mailing list LC> Edu-sig@python.org LC> http://mail.python.org/mailman/listinfo/edu-sig
-- Best regards, Chuck
_______________________________________________ Edu-sig mailing list Edu-sig@python.org http://mail.python.org/mailman/listinfo/edu-sig

On 5Dec 2005, at 7:50 AM, damon bryant wrote:
One of the main reasons I decided to use an Item Response Theory (IRT) framework was that the testing platform, once fully operational, will not give students questions that are either too easy or too difficult for them, thus reducing anxiety and boredom for low and high ability students, respectively. In other words, high ability students will be challenged with more difficult questions and low ability students will receive questions that are challenging but matched to their ability.
So far so good...
Each score is on the same scale, although some students will not receive the same questions. This is the beautiful thing!
I'd like to respectfully disagree. I'm afraid that would cause more harm than good. One side of student evaluation is to give feedback *for* the students. That is a relative measure, his/her performance against his/her peers. If I understood correctly the proposal is to give a "hard"-A for some and an "easy"-A for others, so everybody have A's (A=='good score'). Is that it ? That sounds like sweeping the dirt under the carpet. Students will know. We have to prepare them to tackle failure as well as success. I do not mean such efforts are not worthy, quite the reverse. But I strongly disagree with an adaptive scale. There should be a single scale fro the whole spectre of tests. If some students excel their results must show this, as well as if some students perform poorly that should not be hidden from them. Give them a goal and the means to pursue their goal. If I got your proposal all wrong, I apologize ;o) best regards, Senra Rodrigo Senra ______________ rsenra @ acm.org http://rodrigo.senra.nom.br

Could it be argued that the goal be for all students to score 100% on the desired content? Rodrigo Senra said:
On 5Dec 2005, at 7:50 AM, damon bryant wrote:
One of the main reasons I decided to use an Item Response Theory (IRT) framework was that the testing platform, once fully operational, will not give students questions that are either too easy or too difficult for them, thus reducing anxiety and boredom for low and high ability students, respectively. In other words, high ability students will be challenged with more difficult questions and low ability students will receive questions that are challenging but matched to their ability.
So far so good...
Each score is on the same scale, although some students will not receive the same questions. This is the beautiful thing!
I'd like to respectfully disagree. I'm afraid that would cause more harm than good. One side of student evaluation is to give feedback *for* the students. That is a relative measure, his/her performance against his/her peers.
If I understood correctly the proposal is to give a "hard"-A for some and an "easy"-A for others, so everybody have A's (A=='good score'). Is that it ? That sounds like sweeping the dirt under the carpet. Students will know. We have to prepare them to tackle failure as well as success.
I do not mean such efforts are not worthy, quite the reverse. But I strongly disagree with an adaptive scale. There should be a single scale fro the whole spectre of tests. If some students excel their results must show this, as well as if some students perform poorly that should not be hidden from them. Give them a goal and the means to pursue their goal.
If I got your proposal all wrong, I apologize ;o)
best regards, Senra
Rodrigo Senra ______________ rsenra @ acm.org http://rodrigo.senra.nom.br
_______________________________________________ Edu-sig mailing list Edu-sig@python.org http://mail.python.org/mailman/listinfo/edu-sig
-------------------------------------------------------------------- S c o t t J. D u r k i n -------------------------------------------------------------------- Computer Science |||| Preston Junior High sdurkin@psdschools.org |||| http://staffweb.psdschools.org/sdurkin ____________________________________________________________________ ___ _ ___ _ ___ _ ___ _ ___ _ [(_)] |=| [(_)] |=| [(_)] |=| [(_)] |=| [(_)] |=| '-` |_| '-` |_| '-` |_| '-` |_| '-` |_| /mmm/ / /mmm/ / /mmm/ / /mmm/ / /mmm/ / |____________|____________|____________|____________| | | | ___ \_ ___ \_ ___ \_ Computer Room [(_)] |=| [(_)] |=| [(_)] |=| Lab N205 '-` |_| '-` |_| '-` |_| /mmm/ /mmm/ /mmm/ ____________________________________________________________________ 970.419.7358 |||| 2005-2006 scott.james.durkin

[ Scott Durkin ]:
Could it be argued that the goal be for all students to score 100% on the desired content?
That is precisely my goal when I elaborate exams. No success so far ;o) [ Damon Bryant ]:
No, students are not receiving a hard A or an easy A. I make no classifications such as those you propose. My point is that questions are placed on the same scale as the ability being measured (called a theta scale). Grades may be mapped to the scale though, but a hard A or easy A will not be assigned under aforementioned conditions described.
Because all questions in the item bank have been linked, two students can take the same computer adaptive test but have no items in common between the two administrations. However, scores are on the same scale.
Thank you for taking the trouble to explain it further. Abração, Senra Rodrigo Senra ______________ rsenra @ acm.org http://rodrigo.senra.nom.br

Could it be argued that the goal be for all students to score 100% on the desired content?
I would argue that it should be one of the goals in designing and implementing a training program. The test could have a different purpose. What we all have experienced in teaching students is that ability is distributed; more than likely that distribution is normal for whatever reason, and the variation of scores within the distribution can be tight (e.g., SAT quantitative scores at Rice) or loose (e.g., SAT quantitative scores at a junior college, assuming that the SAT is a requirement). Psychological tests and measures can give us an indication of where students stand in a distribution (norm-referenced testing) or where each student's achievement level is relative to some absolute performance criterion (criterion-referenced testing) before, during, or after training. In other words, it depends on the purpose of testing, which is determined before it is designed and is a major evaluation point of its validity or accuracy in doing what it purports to do. Damon

Damon, Thank you for your thoughtful response. In terms of the Python tests, I as well would hope that all my students (13- to 15-years-old) could answer questions based on the content shared - kind of in the spirit of the Computing for All/Core Knowledge (NoChildLeftBehind-ish? - not playing "gotcha", but here is the information we expect you to know, do you know it? can you apply it?) approach (along with opportunities for students to display and be recoginized for comprehension and ability above and beyond what was expressly expected within the realm of the standard curriculum) - as you indicated in the phrase "training program" in the first paragraph of your response. As far as the assessment of the distributed ability-related issues (primarily expressed in your second paragraph), I will definitely leave that to the education psychologists and what is attempting to be measured - perhaps that of which is beyond the curriculum. Thanks again, Scott damon bryant said:
Could it be argued that the goal be for all students to score 100% on the desired content?
I would argue that it should be one of the goals in designing and implementing a training program. The test could have a different purpose. What we all have experienced in teaching students is that ability is distributed; more than likely that distribution is normal for whatever reason, and the variation of scores within the distribution can be tight (e.g., SAT quantitative scores at Rice) or loose (e.g., SAT quantitative scores at a junior college, assuming that the SAT is a requirement).
Psychological tests and measures can give us an indication of where students stand in a distribution (norm-referenced testing) or where each student's achievement level is relative to some absolute performance criterion (criterion-referenced testing) before, during, or after training. In other words, it depends on the purpose of testing, which is determined before it is designed and is a major evaluation point of its validity or accuracy in doing what it purports to do.
Damon
-------------------------------------------------------------------- S c o t t J. D u r k i n -------------------------------------------------------------------- Computer Science |||| Preston Junior High sdurkin@psdschools.org |||| http://staffweb.psdschools.org/sdurkin ____________________________________________________________________ ___ _ ___ _ ___ _ ___ _ ___ _ [(_)] |=| [(_)] |=| [(_)] |=| [(_)] |=| [(_)] |=| '-` |_| '-` |_| '-` |_| '-` |_| '-` |_| /mmm/ / /mmm/ / /mmm/ / /mmm/ / /mmm/ / |____________|____________|____________|____________| | | | ___ \_ ___ \_ ___ \_ Computer Room [(_)] |=| [(_)] |=| [(_)] |=| Lab N205 '-` |_| '-` |_| '-` |_| /mmm/ /mmm/ /mmm/ ____________________________________________________________________ 970.419.7358 |||| 2005-2006 scott.james.durkin

Hi Rodrigo!
If I understood correctly the proposal is to give a "hard"-A for some and an "easy"-A for others, so everybody have A's (A=='good score'). Is that it?
No, students are not receiving a hard A or an easy A. I make no classifications such as those you propose. My point is that questions are placed on the same scale as the ability being measured (called a theta scale). Grades may be mapped to the scale though, but a hard A or easy A will not be assigned under aforementioned conditions described. Because all questions in the item bank have been linked, two students can take the same computer adaptive test but have no items in common between the two administrations. However, scores are on the same scale. Research has shown that even low ability students, despite their performance, prefer computer adaptive tests over static fixed-length tests. It has also been shown to lower test anxiety while serving the same purpose as fixed-length linear tests in that educators are able to extract the same level of information about student achievement or aptitude without banging a student's head up against questions that he/she may have a very low probability of getting correct. The high ability students, instead of being bored, are receiving questions on the higher end of the theta scale that are appropriately matched to their ability to challenge them.
That sounds like sweeping the dirt under the carpet. Students will know. We have to prepare them to tackle failure as well as success.
In fact computer adaptive tests are designed to administer items to a person of a SPECIFIC ability that will yield a 50/50 chance of correctly responding. For example, there are two examinees: Examinee A has a true theta of -1.5, and Examinee B has a true theta of 1.5. The theta scale has a typical range of -3 to 3. There is a question that has been mapped to the theta scale and it has a difficulty value of 1.5, how we estimate this is beyond our discussion but is relatively easy to do with Python. The item is appropriately match for Examinee B because s/he has approximately a 50% chance of getting this one right - not a very high chance or a very low chance of getting it correct but a equi-probable opportunity of either a success or a failure. According to sampling theory, with multiple administrations of this item to a population of persons with a theta of 1.5, there will be an approximately equal number of successes and failures on this item, because the odds of getting it correct vs. incorrect are equal. However, with multiple administrations of this same item to a population of examinees with a theta of -1.5, which is substantially lower than 1.5, there will be exceedingly more failures than successes. Adaptive test algorithms seek to maximize information about examinees by estimating their ability and searching for questions in the item bank that match their ability levels, thus providing a 50/50 chance of getting it right. This is very different than administering a test where the professor seeks to have an average score is 50% because low ability students will get the vast majority of questions wrong, which could potentially increase anxiety, decrease self-efficacy, and lower the chance of acquiring information in subsequent teaching sessions (Bandura, self regulation). Adaptive testing is able to mitigate the psychological influences of testing on examinees by seeking to provide equal opportunities for both high and low ability students to experience success and failure to the same degree by getting items that are appropriately matched to their skill level. This is the aspect of adaptive testing that is attractive to me. It may not solve the problem, but it is a way of using technology to move in the right direction. I hope this is a better explanation than what I provided earlier.
From: Rodrigo Senra <rsenra@acm.org> To: edu-sig@python.org Subject: Re: [Edu-sig] Python Programming: Procedural Online Test Date: Mon, 5 Dec 2005 19:53:00 -0200
On 5Dec 2005, at 7:50 AM, damon bryant wrote:
One of the main reasons I decided to use an Item Response Theory (IRT) framework was that the testing platform, once fully operational, will not give students questions that are either too easy or too difficult for them, thus reducing anxiety and boredom for low and high ability students, respectively. In other words, high ability students will be challenged with more difficult questions and low ability students will receive questions that are challenging but matched to their ability.
So far so good...
Each score is on the same scale, although some students will not receive the same questions. This is the beautiful thing!
I'd like to respectfully disagree. I'm afraid that would cause more harm than good. One side of student evaluation is to give feedback *for* the students. That is a relative measure, his/her performance against his/her peers.
If I understood correctly the proposal is to give a "hard"-A for some and an "easy"-A for others, so everybody have A's (A=='good score'). Is that it ? That sounds like sweeping the dirt under the carpet. Students will know. We have to prepare them to tackle failure as well as success.
I do not mean such efforts are not worthy, quite the reverse. But I strongly disagree with an adaptive scale. There should be a single scale fro the whole spectre of tests. If some students excel their results must show this, as well as if some students perform poorly that should not be hidden from them. Give them a goal and the means to pursue their goal.
If I got your proposal all wrong, I apologize ;o)
best regards, Senra
Rodrigo Senra ______________ rsenra @ acm.org http://rodrigo.senra.nom.br
_______________________________________________ Edu-sig mailing list Edu-sig@python.org http://mail.python.org/mailman/listinfo/edu-sig

damon bryant wrote:
Hi Rodrigo!
If I understood correctly the proposal is to give a "hard"-A for some and an "easy"-A for others, so everybody have A's (A=='good score'). Is that it?
No, students are not receiving a hard A or an easy A. I make no classifications such as those you propose. My point is that questions are placed on the same scale as the ability being measured (called a theta scale). Grades may be mapped to the scale though, but a hard A or easy A will not be assigned under aforementioned conditions described.
Because all questions in the item bank have been linked, two students can take the same computer adaptive test but have no items in common between the two administrations. However, scores are on the same scale. Research has shown that even low ability students, despite their performance, prefer computer adaptive tests over static fixed-length tests. It has also been shown to lower test anxiety while serving the same purpose as fixed-length linear tests in that educators are able to extract the same level of information about student achievement or aptitude without banging a student's head up against questions that he/she may have a very low probability of getting correct. The high ability students, instead of being bored, are receiving questions on the higher end of the theta scale that are appropriately matched to their ability to challenge them.
That sounds like sweeping the dirt under the carpet. Students will know. We have to prepare them to tackle failure as well as success.
.... The item is appropriately match for Examinee B because s/he has approximately a 50% chance of getting this one right - not a very high chance or a very low chance of getting it correct but a equi-probable opportunity of either a success or a failure....
Two comments: (1) You may find target a higher probability of correct gives a better subjective experience without significantly increasing the length of the test required to be confident of the score. (2) You should track each question's history vs. the final score for the test-taker. This practice can help validate your scoring, as well as help you in weeding out mis-scored questions. --Scott David Daniels Scott.Daniels@Acm.Org
participants (6)
-
Chuck Allison
-
damon bryant
-
Laura Creighton
-
Rodrigo Senra
-
Scott David Daniels
-
Scott Durkin