Getting your event log in the right format

Gert Janssenswillen

2/12/2015

The goal of this vignette is to illustrate how event data can be preprocessed in R to create an eventlog object. Two different approaches are discussed: importing an event log from a XES-file, and importing an event log in csv-format.

Import event log from XES

A very easy way to create event logs in R is to import the event log stored in XES-format. For example, we take the eventlog of municipality 1 of the BPI Challenge 2015, which can be found at the Process Mining Data Repository. In order to follow this Vignette, just store the data somewhere on your local pc.

Once you have the data op your pc, you can input the location to the eventlog_from_xes function. Alternatively, calling this function without any arguments, as is done below, will open a dialog-box, allowing us to navigate to the event log.

data <- eventlog_from_xes()
data
## Event log consisting of:
## 52217 events
## 1099 traces
## 1199 cases
## 398 activities
## 52217 activity instances
## 
## # A tibble: 52,217 × 15
##    case_concept.name event_question  event_dateFinished
##                <chr>          <chr>               <chr>
## 1           10009138          EMPTY 2014-04-14 00:00:00
## 2           10009138          False 2014-04-14 00:00:00
## 3           10009138          EMPTY 2014-04-14 00:00:00
## 4           10009138           True 2014-04-14 00:00:00
## 5           10009138          EMPTY 2014-04-14 00:00:00
## 6           10009138          EMPTY 2014-04-14 00:00:00
## 7           10009138          EMPTY 2014-04-14 00:00:00
## 8           10009138          False 2014-04-14 00:00:00
## 9           10009138          False 2014-04-14 00:00:00
## 10          10009138          EMPTY 2014-04-14 00:00:00
## # ... with 52,207 more rows, and 12 more variables: event_dueDate <chr>,
## #   event_action_code <chr>, event_activityNameEN <chr>,
## #   event_planned <chr>, event_time.timestamp <chr>,
## #   event_monitoringResource <chr>, event_org.resource <chr>,
## #   event_activityNameNL <chr>, event_concept.name <chr>,
## #   event_lifecycle.transition <chr>, event_dateStop <chr>,
## #   activity_instance <dbl>

Printing the event log, stored in the object data, immediatly shows that the object is of the class eventlog. The eventlog_from_xes functions also handles the following things:

In this example, all events refer to the same lifecycle transition, i.e. complete.

table(data$event_lifecycle.transition)
## 
## complete 
##    52217

As a result, each single event conforms to a seperate activity instance. Thus, there are as many activity instances as there are events.

n_events(data)
## [1] 52217
n_activity_instances(data)
## [1] 52217

The event log classifiers are initialized as follows

case_id(data)
activity_id(data)
activity_instance_id(data)
lifecycle_id(data)
timestamp(data)
## [1] "case_concept.name"
## [1] "event_concept.name"
## [1] "activity_instance"
## [1] "event_lifecycle.transition"
## [1] "event_time.timestamp"

The only preprocessing step that needs to be done is to convert the timestamps to objects of the POSIXct class. This can be done using the lubridate package and by looking at the format the timestamps are in.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
data[1:4,timestamp(data)]
## # A tibble: 4 × 1
##        event_time.timestamp
##                       <chr>
## 1 2014-04-11T00:00:00+02:00
## 2 2014-04-14T00:00:00+02:00
## 3 2014-04-14T00:00:00+02:00
## 4 2014-04-14T00:00:00+02:00
data$event_time.timestamp <- ymd_hms(data$event_time.timestamp)

Note that case attributes can be extracted from a XES-file using the function case_attributes_from_xes

Import from csv file

Alternatively, the event log might be stored in a csv-file. For importing csv files, more information can be found in ?read.csv or using the readr package. An example of an event log imported from a csv-file has been included under the name csv_example.

data("csv_example", package = "edeaR")
head(csv_example)
##   CASE ACTIVITY            COMPLETE               START
## 1  CA1        A 2015-01-03 01:23:45 2015-01-01 01:23:45
## 2  CA1        B 2015-01-04 01:23:45 2015-01-03 01:23:45
## 3  CA1        C 2015-01-07 01:23:45 2015-01-05 01:23:45
## 4  CA1        D 2015-01-07 01:23:45 2015-01-06 01:23:45
## 5  CA1        E 2015-01-09 01:23:45 2015-01-07 01:23:45
## 6 CA10        B 2015-01-12 01:23:45 2015-01-11 01:23:45

In this example, it can be seen that each row is in fact an activity instance, bearing multiple timestamps, i.e. both a complete and a start timestamp. The following steps are required in order to convert this data.frame to an event log.

  1. Create an activity instance classifier, which has a unique value in each row.
  2. Reshape the dataframe, so that each row is an event
  3. Convert the values of the lifecycle transition to their standard values.
  4. Convert the timestamps to POSIXct objects
  5. Making an eventlog object

Creating an activity instance classifier

csv_example$ACTIVITY_INSTANCE <- 1:nrow(csv_example)

Reshaping the data

This can be easily done using the tidyr package. Look to ?tidyr for more information.

library(tidyr)
csv_example <- gather(csv_example, LIFECYCLE, TIMESTAMP, -CASE, -ACTIVITY, -ACTIVITY_INSTANCE)
head(csv_example)
##    CASE ACTIVITY ACTIVITY_INSTANCE LIFECYCLE           TIMESTAMP
## 1   CA1        A                 1     START 2015-01-01 01:23:45
## 2   CA1        B                 2     START 2015-01-03 01:23:45
## 3   CA1        C                 3     START 2015-01-05 01:23:45
## 4   CA1        D                 4     START 2015-01-06 01:23:45
## 5   CA1        E                 5     START 2015-01-07 01:23:45
## 6 CA365        A                 6     START 2015-01-01 01:23:45

Converting the lifecycle values

By changing this column in a factor, their levels can easily be changed

csv_example$LIFECYCLE <- factor(csv_example$LIFECYCLE, labels = c("start","complete"))
head(csv_example)
##    CASE ACTIVITY ACTIVITY_INSTANCE LIFECYCLE           TIMESTAMP
## 1   CA1        A                 1     start 2015-01-01 01:23:45
## 2   CA1        B                 2     start 2015-01-03 01:23:45
## 3   CA1        C                 3     start 2015-01-05 01:23:45
## 4   CA1        D                 4     start 2015-01-06 01:23:45
## 5   CA1        E                 5     start 2015-01-07 01:23:45
## 6 CA365        A                 6     start 2015-01-01 01:23:45

Converting the timestamps

Using lubridate, as before.

csv_example$TIMESTAMP <- ymd_hms(csv_example$TIMESTAMP)

Creating an eventlog object

log <- eventlog(eventlog = csv_example, 
                case_id = "CASE",
                activity_id = "ACTIVITY", 
                activity_instance_id = "ACTIVITY_INSTANCE", 
                lifecycle_id = "LIFECYCLE", 
                timestamp = "TIMESTAMP")
## Warning in eventlog(eventlog = csv_example, case_id = "CASE", activity_id =
## "ACTIVITY", : No resource identifier provided nor found. Set to default: NA
log
## Event log consisting of:
## 12766 events
## 9 traces
## 1000 cases
## 6 activities
## 6383 activity instances
## 
## # A tibble: 12,766 × 5
##      CASE ACTIVITY ACTIVITY_INSTANCE LIFECYCLE           TIMESTAMP
##    <fctr>   <fctr>             <int>    <fctr>              <dttm>
## 1     CA1        A                 1     start 2015-01-01 01:23:45
## 2     CA1        B                 2     start 2015-01-03 01:23:45
## 3     CA1        C                 3     start 2015-01-05 01:23:45
## 4     CA1        D                 4     start 2015-01-06 01:23:45
## 5     CA1        E                 5     start 2015-01-07 01:23:45
## 6   CA365        A                 6     start 2015-01-01 01:23:45
## 7   CA365        B                 7     start 2015-01-03 01:23:45
## 8   CA365        C                 8     start 2015-01-05 01:23:45
## 9   CA365        D                 9     start 2015-01-06 01:23:45
## 10  CA365        E                10     start 2015-01-07 01:23:45
## # ... with 12,756 more rows