Cross Validation
Using a TrainedRotation, we can perform k-fold cross-validation on an ENA model
First, we'll load a few libraries:
using EpistemicNetworkAnalysis
using DataFrames
using Statistics
using GLM
Second, we'll load our data and prepare our model config. We'll be using the FormulaRotation
example from the ICQE23 workshop:
data = loadExample("transitions")
deriveAnyCode!(data, :BODY, :Changes, :Mood, :Oily, :Dysphoria, :Cry)
deriveAnyCode!(data, :REFLECT, :Identity, :Longing, :Dream, :Childhood, :Family, :Name, :Letter, :Doubt, :Religion)
deriveAnyCode!(data, :LEARN, :WWW, :Experiment, :Recipe)
deriveAnyCode!(data, :PROGRESS, :Strangers, :Passed, :Out, :Affirmation)
data[!, :All] .= "All"
codes = [:DoseTracking, :SkippedDose, :Happy, :NonHappy, :Sweets, :BODY, :REFLECT, :LEARN, :PROGRESS]
conversations = [:All]
units = [:Date]
rotation = FormulaRotation(
LinearModel, @formula(y ~ 1 + Day), 2, nothing
)
Now we can start setting up our cross-validation. We'll give each row a random number from 1 to 5, setting us up for a 5-fold cross-validation.
k_folds = 5
data[!, :Fold] .= rand(1:k_folds, nrow(data))
Then, we'll iterate. We'll create a trainmodel
with a unitFilter
, using the logic row.Fold != i
to select all units except our hold out set. After that, we'll create a testmodel
with the opposite unitFilter
and rotate it using TrainedRotation(trainmodel)
. That will project our hold out units into our trained embedding. The last thing we'll do in this loop is grab a statistic to add to a results
list:
results = Real[]
for i in 1:k_folds
trainmodel = ENAModel(
data, codes, conversations, units,
windowSize=4,
recenterEmpty=true,
rotateBy=rotation,
unitFilter=(row)->(row.Fold != i)
)
testmodel = ENAModel(
data, codes, conversations, units,
windowSize=4,
recenterEmpty=true,
rotateBy=TrainedRotation(trainmodel),
unitFilter=(row)->(row.Fold == i)
)
result = testmodel.embedding[1, :Formula_AdjR2]
push!(results, result)
end
Finally, we'll display the results and their mean:
println(results)
println(mean(results))
Real[0.6894140464377976, 0.7097994282786444, 0.7075890910659113, 0.7022475568515051, 0.7685992278547673]
0.7155298700977252
Putting it all together, here is a helper function you should be able to drop-in and apply to your own data:
# Helper
function kfoldcv(wholemodel, k_folds, statistic)
results = Real[]
wholemodel.data[!, :Fold] .= rand(1:k_folds, nrow(data))
for i in 1:k_folds
trainmodel = ENAModel(
wholemodel,
unitFilter=(row)->(row.Fold != i)
)
testmodel = ENAModel(
wholemodel,
rotateBy=TrainedRotation(trainmodel),
unitFilter=(row)->(row.Fold == i)
)
result = testmodel.embedding[1, statistic]
push!(results, result)
end
return results
end
# Example usage
wholemodel = ENAModel(
data, codes, conversations, units,
windowSize=4,
recenterEmpty=true,
rotateBy=rotation
)
results = kfoldcv(wholemodel, 5, :Formula_AdjR2)
println(results)
println(mean(results))
Real[0.7008250664900608, 0.6915942868055835, 0.7373538629006076, 0.6995486693104294, 0.7066236614254947]
0.7071891093864353