-
Notifications
You must be signed in to change notification settings - Fork 0
/
SMLP_CYH.Rmd
136 lines (107 loc) · 5.34 KB
/
SMLP_CYH.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
title: "SMLP Task 1: Replicate Hsieh et al. (upcoming)"
author: "Cheng-Yu Hsieh"
output: md_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(lme4)
library(lmerTest)
library(glmmTMB)
library(tidyverse)
```
# Data Description
The data were mostly derived from Chinese Lexicon Project of lexical decision (Tse et al., 2017, https://doi.org/10.3758/s13428-016-0810-5) and naming (Tse et al., 2022, https://osf.io/vwnps) performance on traditional two-character compound words. In this analysis, we focused on data of the lexical decision task, where participants were asked to judge if the compound word presented was a real word or a nonword. Using this dataset, we try to replicate key findings of Tse et al. (2017, 2022) by comparing the difference between lexical decision and naming latencies.
```{r load data from github, echo=TRUE}
gh.link <- "https://raw.githubusercontent.com/cyhsieh-psy/SMLP_exercise/main/Chinese_Lexicon_Project_SMLP.csv"
df_raw <- read.csv(gh.link, sep = ",", header = TRUE)
```
## Code book (of the relevant vaiables)
1. word: two-character Chinese word
2. Corr: number of correct trials
3. Err: number of incorrect trials
4. Acc: accuracy of trials
5. RT: Mean reaction time for each word computed across participants (raw)
6. zRT: Mean of the standardized reaction time for each word computed across participants
7. Subtlex_raw_W, Subtlex_raw_C1, Subtlex_raw_C2: raw frequency count of whole word, first character and second (derived from SUBTLEX-CH)
8. Subtlex_CD_W, Subtlex_CD_C1, Subtlex_CD_C2: raw contextual diversity count of whole word, first character and second (derived from SUBTLEX)
9. Google_freq_W, Google_freq_C1, Google_freq_C2: raw frequency count of whole word, first character and second (derived from Google)
10. C1_ID, C2_ID: a label for each character at first position and at second position
11. neighborhood_C1, neighborhood_C2: neighborhood size of first character and second character
12. FSC_C1, FSC_C2: family semantic consistency of first character and second character
13. nomeaning_C1, nomeaning_C2: number of meanings of first character and second character
# Model Comparison/Selection
```{r model of RT}
# deleting items with missing values
df_test <- filter(df_raw,
Acc >= 0.7,
Google_freq_W > 0,
Google_freq_C1 > 0,
Google_freq_C2 > 0,
Subtlex_raw_W > 0,
Subtlex_raw_C1 > 0,
Subtlex_raw_C2 > 0,
Subtlex_CD_W > 0)
# Models with different sources of frequency count
mod1 <- lmer(zRT~scale(log(Google_freq_W),scale=F)+
scale(log(Google_freq_C1),scale=F)+
scale(log(Google_freq_C2),scale=F)+
(1 | C1_ID) + (1 | C2_ID),
df_test, REML = F)
mod2 <- lmer(zRT~scale(log(Subtlex_raw_W),scale=F)+
scale(log(Subtlex_raw_C1),scale=F)+
scale(log(Subtlex_raw_C2),scale=F)+
(1 | C1_ID) + (1 | C2_ID),
df_test, REML = F)
mod3 <- lmer(zRT~scale(log(Subtlex_CD_W),scale=F)+
scale(log(Subtlex_CD_C1),scale=F)+
scale(log(Subtlex_CD_C2),scale=F)+
(1 | C1_ID) + (1 | C2_ID),
df_test, REML = F)
AIC(mod1, mod2, mod3)
BIC(mod1, mod2, mod3)
```
# Analysing lexical decision latencies using lmer function
```{r model of RT}
# deleting items with missing values
df <- filter(df_raw,
Acc >= 0.7,
Subtlex_CD_W > 0,
Subtlex_CD_C1 > 0,
Subtlex_CD_C2 > 0,
neighborhood_C1 > 0,
neighborhood_C2 > 0,
FSC_C1 > 0,
FSC_C2 > 0)
# model specification
model.word.RT <- lmer(zRT~scale(log(Subtlex_CD_W),scale=F)+
scale(log(neighborhood_C1),scale=F)+
scale(FSC_C1,scale=F)+
scale(nomeaning_C1,scale=F)+
scale(log(neighborhood_C2),scale=F)+
scale(FSC_C2,scale=F)+
scale(nomeaning_C2,scale=F)+
(1 | C1_ID) + (1 | C2_ID),
data = df, REML = TRUE)
summary(model.word.RT)
```
# Analysing lexical decision accuracy using glmmTMB function
```{r model of accuracy}
y <- cbind(df$Corr, df$Err)
word.Acc <- glmmTMB(y ~ scale(log(Subtlex_CD_W),scale=F)+
scale(log(neighborhood_C1),scale=F)+
scale(FSC_C1,scale=F)+
scale(nomeaning_C1,scale=F)+
scale(log(neighborhood_C2),scale=F)+
scale(FSC_C2,scale=F)+
scale(nomeaning_C2,scale=F)+
(1 | C1_ID) + (1 | C2_ID),
family = betabinomial,
data = df, REML=TRUE)
summary(word.Acc)
```
# Analysis/modelling issues:
1. For model comparison analysis, are there any other approaches better than comparing the AIC/BIC across different models?
2. How to choose the appropriate model for accuracy data? Shall we input number of correct and incorrect trials for binomial models? Or is it possible to input the percentage of accuracy (0%-100%) as a continuous variable? How to deal with overdispersion?
3. What is the best effect size(s) reported for (G)LMM?
4. It takes long time to estimate the model for accuracy data. Are there ways to optimise and speed up the process?