# Introduction

Continuous personal monitoring of PM2.5 in the CHAI project resulted in 261 MicroPEM files corresponding to 250 measurements sessions of 24 hours. Data were collected using 5 distinct MicroPEMs. Since there is no central RTI MicroPEM documentation nor agreed-on harmonized data cleaning process, we developed this framework. In the literature, details about data cleaning were too scarce to allow reproducibility. In this document we explain the issues we encountered, we present the framework we used for data cleaning, and provide some details about our code.

Brief overview of issues encountered in MicroPEM data:

• Temperature sensitivity (zero decreasing with increasing temperature), apparently random baseline shifts.

• Negative relative humidity for a few points, leading to aberrant rh-corrected nephelometer values.

• Strange variations of internal parameters such as flow or orifice pressure, indicating the flow was not constant. For one MicroPEM that finally broke, the inlet pressure of some files indicated the device had been sucking air from the device instead of air from the environment.

• The internal flowmeter was not always reliable. Sometimes it wasn’t logging at all, sometimes it was logging 0.15 instead of 0.5. Because of the unreliable performance form the internal flowmeter, data from an external flowmeter pre and post-sampling is very valuable.

• There were holes in some filters for gravimetric measurements. Filter weights were also very variable and sometimes unrealistic (extremely high or small). We hypothesized that some filters might have contained small holes not visible to the eye. Therefore, we did not use MicroPEM filters for gravimetric correction in our study. Instead, we used colocated gravimetric measurements using an SKC pump fitted with BGI cyclone.

# Definition of the data cleaning framework

## Removal of entire files.

Files were removed from the analysis if

• the post-sampling flow rate as measured by an external flowmeter was either 1) non available 2) higher than 120% of the pre-sampling flow rate.

• there were too few, e.g. 0 or 10, measures in total. Such files actually indicated a MicroPEM failure.

• Relative humidity was not logged at all, which happened to one of our files, since relative humidity is used to correct nephelometer vales

• visual checks, for our MicroPEMs MP724 (which presented many negative values) and MP718 (which was later sent back to RTI because of malfunctions), defined the file as problematic. All visual checks were performed by at least two team members.

• For MP718 we eliminated files where inlet pressure was so low that it indicated the MicroPEM had been sucking air from inside the device.

• For MP724 we eliminated files with an abrupt baseline shift, or where inlet pressure wasn’t constantly increasing.

The criterion on post-sampling flow rate led to the exclusion of 5 files from the analysis, the other criteria led in total to the exclusion of 28 files. Therefore our final table contains data from 228 files.

## Correction/removal of individual data points in time series.

• Only values consistent with the start and stop time of monitoring as recorded on an external log sheet were retained: 1) in some cases there were measurements from a previous session in the file 2) when downloading data later on the same day sometimes by mistake the MicroPEM was turned on again and a few measures were produced.

• We detected nephelometer high outliers and removed them thanks to a custom function adapted from forecast::tsoutlier. We copy the function later in this vignette.

• Nephelometer values corresponding to negative relative humidity were set to missing.

• Nephelometer values corresponding to a internal flow or orifice pressure (orifice pressure is indicative of internal flow which is useful in cases where internal flow wasn’t logged properly) outside of the range between 0.9xmedian in the file and 1.1xmedian in the file were set to missing.

• We applied a temperature correction for 3 MicroPEMs where we had noticed a zero drift with increasing temperature. 4 MicroPEMs ran with a HEPA filter attached to zero the device: 2 were found to be temperature sensitive. For the MicroPEM that was already sent back to RTI because it was broken, MP718, we used a file where the MicroPEM had been sucking air from the inside of the device (which we know because inlet pressure was so low) over a range of temperatures to develop the temperature correction. A device-specific temperature correction was derived (zero_t = a + bxtemperature) from the reference datasets and then applied to all files from each of these three MicroPEMs. For information here are the coefficients we get for the three MicroPEMs.

• MicroPEM 718 temperature sensitivity from 36ºC. a=27.573813, b=-1.074154

• MicroPEM 724 temperature sensitivity from 35ºC. a=67.411151, b=-1.996988

• MicroPEM 813 temperature sensitivity from 36ºC. a=33.6162432, b=-0.9616226

## Identification of negative baseline shifts

We identified baseline shifts based on the following:

• Changes in values measured with a HEPA filter attached to inleft immediately pre and post sampling in the field. We observed changes in zero over time (e.g. if zeroed a few hours before sampling) and during sampling. Zero drift could be positive or negative, with no relation to other parameters e.g. temperature.

• Files with a high proportion of negative nephelometer values. We identified files with a negative baseline shift as those where the mean was changed by more than 5% if holding off negative values. The criterion was met by 31 files.

Files identified as having negative baseline shift were corrected as follows: add the opposite of the minimum to all values so that the minimum is zero.

## Gravimetric correction

For calculating coefficients for gravimetic correction we used 53 MicroPEM files which 1) had been colocated with SKC pumps and 2) had no detected negative baseline shift. We made a linear regression of SKC gravimetric mean concentration vs. nephelometer mean concentration (with all previous corrections) and used the resulting coefficients to correct all MicroPEM files. The obtained correction was corrected = 21.3996 + 0.4658 x nephelometer. We did not have enough files from colocation to have a specific device / season correction.

Note that in our final table we kept

• the “RH corrected nephelometer” values as provided by MicroPEMs.

• the values having received all corrections except the one for negative baseline shifts.

• the values having received all corrections.

# Technical details about R implementation of the data cleaning framework

We cannot share our code but provide some details.

The first step of the data cleaning was to transform all MicroPEM files into two data.frames using rtimicropem::batch_convert(). Since in our study the filename was informative (it contained participant ID, date, etc.) we parsed the filenames using our internal package. The table with settings was joined with information from the logsheet such as pre and post-sampling flow rate.

Then we created a sqlite database with the measurements, in which the filename was an index.

Because of the huge number of files we had to process, we defined functions operating on participant-days/files, and used batchtools to parallelize the process.

# Conclusion

We hope that this vignette provided an useful insight into MicroPEM data cleaning. It probably does not cover all possible MicroPEM issues, and each project will not experience all issues. If you have any question or want to start a discussion please open a Github issue or contact the maintainer of the package via maelle dot salmon at yahoo dot se.