distribglm
: Distributed Computing Model from a Synced FolderThe goal of distribglm
is to provide an example of a Distributed Computing Model from a Synced Folder. In this case, “synced folder” usually means Dropbox or Box Sync.
You can install the released version of distribglm from CRAN with:
This is a basic example which shows you how to solve a common problem:
To set up a model, run setup_model.R
. Currently, it’s only been tested on logistic regression. This will ensure the correct folders are set up on the synced folder. Once the folders are synced, each hospital/site should use the secondary.R
file.
For each site, a few things need to be specified. First, the site_name
needs to be specified in the document. This can be something like site1
or something like my_hospital_name
. Second, the site must specify the synced_folder
path. Though this folder is synced across the sites and the computing site, it may be located on different sub-folders on each computer. For example ~/Dropbox/Projects/distributed_model
, C:/My Documents/distributed_model
, etc.
Additionally, the site may need to specify all_site_names
, which is the names of all the other sites that are computing this model as well. This can be relaxed, but is currently there so that each site can see what’s going on, whether they’re waiting for another site to finish computing or for the computing site to update the beta coefficients.
The third thing is the site needs to specify the data_source
. In the cases above, we use a CSV for the data. The important thing is that this data can be in a secure place. The data from this is used for computation but is not in the synced folder. Lastly, the site needs to specify the model_name
to indicate which model is being analyzed. This could be relaxed where once a model is finalized/converged, all iterations and estimates are zipped up and then the formula is removed. Thus, if you check that a formula no longer exists, the site code will exit.
All sites need to have the data structured so the column names match exactly across sites for the formula specification to work. There could be an initial data check step too.
The computing/master site first needed to setup the model from above. (Maybe specify all site names here too?). Similar to the sites, the computing site needs to specify the model_name
to indicate which model is being analyzed, the synced_folder
path, and all_site_names
. That should be it to start the process until the model converges.
Currently, the process that happens is as follows:
R
. The model design matrix X
is createdXβ
is computed, which is ŷ, and ε̂ = y − ŷ. Then δ̂ = *X**ε̂* is created and δ̄ is used as the vector of gradients to update β`. Each site returns δ̂ and the sample size at that site so that they can be combined to δ̄.