Title: eatDB: A Spreadsheet Interface to Relational Data Bases

Authors: Benjamin Becker

Affiliation: Humboldt University Berlin

Abstract:
In educational large-scale assessments, often a substantial amount of data is
collected. The imputation of missing responses and the estimation of person
parameters via plausible values enhance this problem even further. Storing
this kind of hierarchical data (e.g. pupils nested in classes/schools,
imputations nested in persons) in a common two dimensional spreadsheet saving
style (e.g. as .csv, .sav, or .RData) is very inefficient. R needs enough
working memory to store the complete two dimensional data, even if only a
certain subset of the data is to be used for analysis. Therefore, data sets
from educational large-scale assessments like PISA, PIAAC or the German
Bildungstrend, can often not be loaded into R on common hardware setups. In
other areas, relational data bases are often used to store similar kinds of
hierarchical data tidily (Wickham, 2014) or, speaking in terms of relational
data bases, normalized, to optimize storing efficiency and allowing easier and
more efficient querying of the data. However these relational data base
management systems (RDBMS) are rarely used in the educational large-scale
assessment context, probably partly due to the fact that they require users
learning SQL. R interfaces like dplyr (Wickham, François, Henry, & Müller,
2018) exist, but the initial data base creation and later joining of data
frames are still rather cumbersome. The R package eatDB is meant to bridge
this gap. It provides a simple R interface for the creation of data bases and
extracting data from data bases created via eatDB. It utilizes SQLite3 (SQLite
Development Team, 2018), the R driver RSQLite (Müller, Wickham, James, &
Falcon, 2018) and the R driver framework DBI (R Special Interest Group on
Databases, Wickham, & Müller, 2018). Extracting data from large hierarchical
data sets becomes substantially faster and more efficient. Exhaustive checks
to guarantee the integrity of the data base are performed by the package. In
my presentation, I would like to give a short introduction to the ideas behind
eatDB, how these ideas are implemented in eatDB and how eatDB can be used in
practice. Furthermore, I would like to show some small benchmark examples
illustrating the reduction in working memory and increase in efficiency this
approach yields compared to common alternatives.

References:
Müller, K., Wickham, H., James, D. A., & Falcon, S. (2018). RSQLite: 'SQLite'
interface for R [Computer software manual]. Retrieved from
https://CRAN.R-project.org/package=RSQLite (R package version 2.1.1) R Special
Interest Group on Databases

Wickham, H., & Müller, K. (2018). DBI: R database interface [Computer software
manual]. Retrieved from https://CRAN.R-project.org/package=DBI (R package
version 1.0.0)

SQLite Development Team. (2018). SQLite [Computer software manual]. Retrieved
from https://www.sqlite.org/index.html (Version 3.26.0)

Wickham, H. (2014). Tidy data. The Journal of Statistical Software, 59 .
Retrieved from http://www.jstatsoft.org/v59/i10/

Wickham, H., François, R., Henry, L., & Müller, K. (2018). dplyr: A grammar of
data manipulation [Computer software manual]. Retrieved from
https://CRAN.R-project.org/package=dplyr (R package version 0.7.8)