MicroPEM data Cleaning process developed through the CHAI project

M. Salmon, S. Vakacherla, C. Mila, J. Marshall, C. Tonne



Continuous personal monitoring of PM2.5 in the CHAI project resulted in 261 MicroPEM files corresponding to 250 measurements sessions of 24 hours. Data were collected using 5 distinct MicroPEMs. Since there is no central RTI MicroPEM documentation nor agreed-on harmonized data cleaning process, we developed this framework. In the literature, details about data cleaning were too scarce to allow reproducibility. In this document we explain the issues we encountered, we present the framework we used for data cleaning, and provide some details about our code.

Brief overview of issues encountered in MicroPEM data:

Definition of the data cleaning framework

Removal of entire files.

Files were removed from the analysis if

The criterion on post-sampling flow rate led to the exclusion of 5 files from the analysis, the other criteria led in total to the exclusion of 28 files. Therefore our final table contains data from 228 files.

Correction/removal of individual data points in time series.

Identification of negative baseline shifts

We identified baseline shifts based on the following:

Files identified as having negative baseline shift were corrected as follows: add the opposite of the minimum to all values so that the minimum is zero.

Gravimetric correction

For calculating coefficients for gravimetic correction we used 53 MicroPEM files which 1) had been colocated with SKC pumps and 2) had no detected negative baseline shift. We made a linear regression of SKC gravimetric mean concentration vs. nephelometer mean concentration (with all previous corrections) and used the resulting coefficients to correct all MicroPEM files. The obtained correction was corrected = 21.3996 + 0.4658 x nephelometer. We did not have enough files from colocation to have a specific device / season correction.

Note that in our final table we kept

Technical details about R implementation of the data cleaning framework

We cannot share our code but provide some details.

The first step of the data cleaning was to transform all MicroPEM files into two data.frames using rtimicropem::batch_convert(). Since in our study the filename was informative (it contained participant ID, date, etc.) we parsed the filenames using our internal package. The table with settings was joined with information from the logsheet such as pre and post-sampling flow rate.

Then we created a sqlite database with the measurements, in which the filename was an index.

Because of the huge number of files we had to process, we defined functions operating on participant-days/files, and used batchtools to parallelize the process.


We hope that this vignette provided an useful insight into MicroPEM data cleaning. It probably does not cover all possible MicroPEM issues, and each project will not experience all issues. If you have any question or want to start a discussion please open a Github issue or contact the maintainer of the package via maelle dot salmon at yahoo dot se.