Evaluating Content-Related Validity Evidence Using Text ModelingDaniel AndersonBrock RowleySondra StegengaP. Shawn IrvinJoshua M. Rosenberg1 / 17

BackgroundContent-related validity evidence2 / 17

Background

One five major sources of validity evidence (as outlined by the Standards)

2 / 17

Background

One five major sources of validity evidence (as outlined by the Standards)
Does the content represented in the test represent the targeted content?
- Are specific areas missing?
- Are specific areas over-represented?

2 / 17

Background

One five major sources of validity evidence (as outlined by the Standards)
Does the content represented in the test represent the targeted content?
- Are specific areas missing?
- Are specific areas over-represented?
Operationally, evidence is often gathered through alignment studies.
- Judgments made by panels of experts (educators).
- Does the test items align with the content standards?

2 / 17

Study purpose

Extend content-related validity evidence through the use of text mining

3 / 17

Study purpose

Extend content-related validity evidence through the use of text mining

What thematic topics are represented in the content standards?

3 / 17

Study purpose

Extend content-related validity evidence through the use of text mining

What thematic topics are represented in the content standards?
How do individual items map on to these topics (if at all)?

3 / 17

Study purpose

Extend content-related validity evidence through the use of text mining

What thematic topics are represented in the content standards?
How do individual items map on to these topics (if at all)?
What is the overall coverage of the topics across test items?

3 / 17

Topic modelingCorpus of words split into documents
4 / 17

Topic modeling

Corpus of words split into documents
- We treat each content standard as a document

4 / 17

Topic modeling

Corpus of words split into documents
- We treat each content standard as a document
Latent variables (topics) estimated from word co-occurrence

4 / 17

Topic modeling

Corpus of words split into documents
- We treat each content standard as a document
Latent variables (topics) estimated from word co-occurrence
- Number of topics estimated is determined by the researcher (similar to exploratory factor analysis)

4 / 17

Topic modeling

Corpus of words split into documents
- We treat each content standard as a document
Latent variables (topics) estimated from word co-occurrence
- Number of topics estimated is determined by the researcher (similar to exploratory factor analysis)
Each document is a mixture of topics
- $γ$ estimates provide probability a given topic is represented within a document
Each topic is a mixture of words
- $β$ estimates provide probability a given word is represented within a topic

4 / 17

The fundamental idea

5 / 17

The fundamental idea

Train a model on the content standards to estimate the latent topics represented therein.

5 / 17

The fundamental idea

Train a model on the content standards to estimate the latent topics represented therein.
Apply the model to the test items to estimate which topics the items represent (based on the text within the item).

5 / 17

Our application

Science NGSS Performance Expectations
Grade 8 statewide Alternate Assessment based on Alternate Achievement Standards (AA-AAS)
- Designed for students with the most significant cognitive disabilities.
- 1% reporting cap
- Reduced in depth, breadth, and complexity

6 / 17

AnalysesTopics estimated using Latent Dirichlet AllocationCommon stop words removed ("and", "of", "the", etc.)
Webb's DOK verbs removed ("choose", "describe", "find")

7 / 17

Analyses

Topics estimated using Latent Dirichlet Allocation
- Common stop words removed ("and", "of", "the", etc.)
- Webb's DOK verbs removed ("choose", "describe", "find")
Four methods evaluated to determine optimal number of topics (2-25 evaluated)
- Arun et al. (2010): KL-Divergence
- Cao et al. (2009): Cosine similarity
- Deveaud et al. (2014): Jensen Shannon distance
- Griffiths & Sayers (2004): harmonic mean of posterior log-likelihoods

7 / 17

Analyses

Topics estimated using Latent Dirichlet Allocation
- Common stop words removed ("and", "of", "the", etc.)
- Webb's DOK verbs removed ("choose", "describe", "find")
Four methods evaluated to determine optimal number of topics (2-25 evaluated)
- Arun et al. (2010): KL-Divergence
- Cao et al. (2009): Cosine similarity
- Deveaud et al. (2014): Jensen Shannon distance
- Griffiths & Sayers (2004): harmonic mean of posterior log-likelihoods
Smaller range of topics evaluated by two science content experts for substantive meaning

7 / 17

Results: $n$ topics

3-6 topics evaluated for substantive meaning
5-topic solution independently arrived upon
- Distinct topics, little redundancy

8 / 17

Topics

Topic	Substantive Label
1	Analyzing data and using evidence to understand organisms and systems
2	Using scientific evidence to understand Earth systems
3	Energy
4	Genetic information
5	Scientific solutions

9 / 17

Mapping topicsto standardsMost standards represented by a single topic
10 / 17

Mapping wordsto topicsTop 15 words displayed
11 / 17

Predictingitemsto topicsNine random items displayed
12 / 17

Topic coverage13 / 17

DiscussionContent validity is a critical component of "overall evaluative judgment" (Messick, 1995) of the validity of test-based inferences
14 / 17

Discussion

Content validity is a critical component of "overall evaluative judgment" (Messick, 1995) of the validity of test-based inferences
Particularly important within standards-based educational systems

14 / 17

Discussion

Content validity is a critical component of "overall evaluative judgment" (Messick, 1995) of the validity of test-based inferences
Particularly important within standards-based educational systems
Text-modeling may serve as additional source of evidence (triangulation)

14 / 17

Discussion

Content validity is a critical component of "overall evaluative judgment" (Messick, 1995) of the validity of test-based inferences
Particularly important within standards-based educational systems
Text-modeling may serve as additional source of evidence (triangulation)
May be useful as a diagnostic tool

14 / 17

Limitations & Future DirectionsResults depend upon chosen topic model - different models may lead to different inferences
15 / 17

Limitations & Future Directions

Results depend upon chosen topic model - different models may lead to different inferences
Our model is preliminary, but publicly available. Consensus from field could help inform models that are useful and provide better validity evidence.

15 / 17

Limitations & Future Directions

Results depend upon chosen topic model - different models may lead to different inferences
Our model is preliminary, but publicly available. Consensus from field could help inform models that are useful and provide better validity evidence.
Our application was in Science with an AA-AAS
- Generalizability to other content areas/tests is not known

15 / 17

Limitations & Future Directions

Results depend upon chosen topic model - different models may lead to different inferences
Our model is preliminary, but publicly available. Consensus from field could help inform models that are useful and provide better validity evidence.
Our application was in Science with an AA-AAS
- Generalizability to other content areas/tests is not known
What if an item has no text?
- Text-modeling could perhaps be used to help "flag" items for further investigation
- Alternative ML procedures (e.g., image recognition) may help

15 / 17

Conclusions

Text mining procedures may provide additional source of evidence
- Perhaps supplementing formal alignment studies
Evidence could be used diagnostically
Topic modeling itself may be useful in understanding the topics represented in either the standards or a given test, indpendent of the linkage between the two.

16 / 17

Thanks

repo @datalorax

Slides available at: http://www.datalorax.com/talks/ncme19/

17 / 17

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Evaluating Content-Related Validity Evidence Using Text Modeling

Daniel Anderson

Brock Rowley

Sondra Stegenga

P. Shawn Irvin

Joshua M. Rosenberg

Background

Content-related validity evidence

Background

Content-related validity evidence

Background

Content-related validity evidence

Background

Content-related validity evidence

Study purpose

Study purpose

Study purpose

Study purpose

Topic modeling

Topic modeling

Topic modeling

Topic modeling

Topic modeling

Our application

Analyses

Analyses

Analyses

Results: nn topics

Topics

Mapping topics

to standards

Mapping words

to topics

Predicting

items

to topics

Topic coverage

Discussion

Discussion

Discussion

Discussion

Limitations & Future Directions

Limitations & Future Directions

Limitations & Future Directions

Limitations & Future Directions

Conclusions

Thanks

Background

Content-related validity evidence

Help

Results: $n$ topics