Given the sheer amount of news sources in the digital age (e.g., newspapers, blogs, social media) it has become difficult to determine where news is first introduced and how it diffuses across sources. RNewsflow provides tools for analyzing content homogeneity and diffusion patterns using computational text analysis. The content of news messages is compared using techniques from the field of information retrieval, similar to plagiarism detection. By using a sliding window approach to only compare messages within a given time distance, many sources can be compared over long periods of time. Furthermore, the package introduces an approach for analyzing the news similarity data as a network, and includes various functions to analyze and visualize this network.


The news diffusion process in the digital age involves many interdependent sources, ranging from news agencies and traditional newspapers to blogs and people on social media (Meraz 2011; Paterson 2005; Pew Research Center 2010). We offer the RNewsflow R package1 as a toolkit to analyze the homogeneity and diffusion of news content using computational text analysis. This analysis consists of two steps. First, techniques from the field of information retrieval are used to measure the similarity of news messages (e.g., addressing the same event, containing identical phrases). Second, the temporal order in which messages are published is used to analyze consistent patterns in who follows whom.

The main contribution of this package lies in the specialized application of document similarity measures for the purpose of comparing news messages. News is a special type of information in the sense that it has a time dimension—it quickly loses its relevance. Therefore, we are often only interested in the similarity of news documents that occured within a short time distance. By restricting document comparisons to a given time distance, it becomes possible to compare many documents over a long period of time, but available document comparison software generally does not have this feature. We therefore offer the newsflow.compare function which compares documents using a sliding window over time. In addition, this package offers tools to aggregate, analyze and visualize the document similarity data. By covering all steps of the text and data analysis in a single R package, it becomes easier to develop and share methods, and analyses become more transparent and replicable.

The primary intended audience of this package is scholars and professionals in fields where the impact of news on society is a prime factor, such as journalism, political communication and public relations (Baum and Groeling 2008; Boczkowski and De Santos 2007; Ragas 2014). To what extent the content of certain sources is homogeneous or diverse has implications for central theories of media effects, such as agenda-setting and the spiral of silence (Bennett and Iyengar 2008; Blumler and Kavanagh 1999). Identifying patterns in how news travels from the initial source to the eventual audience is important for understanding who the most influential “gatekeepers” are (Shoemaker and Vos 2009). Furthermore, the document similarity data enables one to study news values (Galtung and Ruge 1965) by analyzing what elements of news predict their diffusion rate and patterns.

The package is designed to appeal to scholars and professionals without prior experience in computational text analysis. This vignette covers the entire chain from processing raw data—written text with source and date information—to analyzing and visualizing the output. It points to relevant software within and outside of R for pre-processing written texts, and demonstrates how to use the core functions of this package. For more advanced users there are additional functions and parameters to support versatility. RNewsflow is completely open-source to promote active involvement in developing and evaluating the methodology. The source code is available on Github—https://github.com/kasperwelbers/RNewsflow. All data used in this vignette is included in the package for easy replication.

The structure of this vignette is as follows. The first part discusses the data preparation. There are several choices to be made here that determine on what grounds the content of documents is compared. The second part shows how the core function of this packages, newsflow.compare, is used to calculate document similarities for many documents over time. The third part demonstrates functions for exploring, aggregating and visualizing the document similarity data. Finally, conclusions regarding the current version of the package and future directions are discussed.

Preparing the data

To analyze content homogeneity and news diffusion using computational text analysis, we need to know who said what at what time. Thus, our data consists of messages in text form, including meta information for the source and publication date of each message. The texts furthermore need to be pre-processed and represented as a document-term matrix (DTM). This section first discusses techniques and refers to existing software to pre-process texts and create the DTM, and reflects on how certain choices influence the analysis. Second, it shows how the source and date information should be organized. Third, it discusses several ways to filter and weight the DTM based on word statistics.

Pre-processing texts and creating the DTM

A DTM is a matrix in which rows represent documents, columns represent terms, and cells indicate how often each term occurred in each document. This is referred to as a bag of words representation of texts, since we ignore the order of words. This simplified representation makes analysis much easier and less computationally demanding, and as research has shown: ``a simple list of words, which we call unigrams, is often sufficient to convey the general meaning of a text’’ (Grimmer and Stewart 2013: 6).

As input for this package, the DTM has to be in the DocumentTermMarix class of the tm package (Feinerer, Hornik, and Meyer 2008). This is a popular R package for text mining, or computational text analysis, which also contains functions to create a DTM based on raw text in various formats. We also recommend the RTextTools package (Jurka et al. 2013), which has the create_matrix function that wraps various tm functions for creating a DTM into a single convenient function. For example, see the following DTM, created with the create_matrix function.

doc1 = 'Socrates is human'
doc2 = 'Humans are mortal'
doc3 = 'Therefore, Socrates is mortal'
dtm = RTextTools::create_matrix(
  textColumns = c(doc1,doc2,doc3), 
  minWordLength = 1, removeStopwords = FALSE)

rownames(dtm) = paste('Document', 1:nrow(dtm))
##             Terms
## Docs         are human humans is mortal socrates therefore
##   Document 1   0     1      0  1      0        1         0
##   Document 2   1     0      1  0      1        0         0
##   Document 3   0     0      0  1      1        1         1

Based on the representation of texts in the DTM, the similarity of documents can be calculated as the similarity of row vectors. However, as seen in the example, this approach has certain pitfalls if we simply look at all words in their literal form. For instance, the words “human” and “humans” are given separate columns, despite having largely the same meaning. As a result, this similarity of the first two texts is not recognized. Also, the words “are”, and “is” do not have any substantial meaning, so they are not informative and can be misleading for the calculation of document similarity. There are various techniques to filter and transform words that can be used to mend these issues. In addition, we can use these techniques to steer on what grounds documents are compared.

All mentioned techniques except for lemmatization and part-of-speech tagging are available in the tm package and in the create_matrix function of the RTextTools package. To use lemmatization or part-of-speech tagging there are several free to use grammar parsers, such as CoreNLP for English (Manning et al. 2014) and Frog for Dutch (Van den Bosch et al. 2007)4.

For this vignette, data is used that has been preprocessed with the Dutch grammar parser Frog. The data is based on a recent study on the influence of a Dutch news agency on the print and online editions of Dutch newspapers in political news coverage (Author citation, forthcoming). The terms have been lemmatized, and only the nouns and proper names of the headline and first five sentences (that generally contain the essential who, what and where of news) are used. By focusing on these elements, the analysis in this study focused on whether documents address the same events. This data is also made available in the package as demo data, in which the actual nouns and proper names have been substituted with indexed word types (e.g., the third organization in the vocabulary is referred to as organization.3, the 20th noun as noun.20).


##           Terms
## Docs       person.688 noun.1516 location.119 organization.323 person.493
##   35573532          0         0            0                0          0
##   35573536          0         0            0                0          0
##   35573539          0         0            0                0          0

Organizing document meta information

In addition to the DTM, we need meta information for each document. This should be structured as a data frame with at least two columns. The first column contains the document names, and should match with the rownames (i.e. document names) of the DTM. The second column contains the publication date of the document, which should be in the Date or POSIXct class. For easy compatibility with the functions offered in this package, it is recommended to label these columns “document_id” and “date”. Any additional columns in the meta data.frame can be used in the analysis to aggregate and visualize results. Here it is recommended to have at least a column containing the source of the document, labeled “source.”5

##   document_id                date     source sourcetype
## 1    35573532 2013-06-01 06:00:00 Print NP 2   Print NP
## 2    35573536 2013-06-01 06:00:00 Print NP 2   Print NP
## 3    35573539 2013-06-01 06:00:00 Print NP 2   Print NP

Using word statistics to filter and weight the DTM

As a final step in the data preparation, we can filter and weight words based on word statistics, such as how often a word occured. Since we are analyzing news diffusion, a particularly interesting characteristic of words is the distribution of their use over time. To focus the comparison of documents on words that indicate new events, we can filter out words that are evenly used over time. To calculate this, we offer the term.day.dist function.

tdd = term.day.dist(dtm, meta)
tdd[sample(1:nrow(tdd), 4),] # show 4 random rows
##            term freq doc.freq days.n days.pct days.entropy
## 3742 person.179   18       18      5   0.1667         4.67
## 4612  noun.1665    7        6      3   0.1000         2.94
## 4771  noun.2412    4        4      2   0.0667         1.75
## 2301  noun.4105    1        1      1   0.0333         1.00
##      days.entropy.norm
## 3742            0.1558
## 4612            0.0981
## 4771            0.0585
## 2301            0.0333

Of particular interest is the days.entropy score, which is the entropy of the distribution of words over days. This tells us whether the occurrence of a word over time is evenly distributed (high entropy) or concentrated (low entropy).6 The maximum value for entropy is the total number of days (in case of a uniform distribution). The days.entropy.norm score normalizes the entropy by dividing by the number of days. By selecting the terms with low entropy scores, the DTM can be filtered by using the selected terms as column values.

select_terms = tdd$term[tdd$days.entropy.norm <= 0.3]
dtm = dtm[,select_terms]

In addition to deleting terms, we should also weight terms. Turney explains that ‘’The idea of weighting is to give more weight to surprising events and less weight to expected events’‘, which is important because’‘surprising events, if shared by two vectors, are more discriminative of the similarity between the vectors than less surprising events’’ (Turney 2002: 156). Thus, we want to give more weight to rare words than common words. A classic weighting scheme and recommended standard in information retrieval is the term-frequency inverse document frequency (tf.idf) (Sparck Jones 1972,Monroe, Colaresi, and Quinn (2008)). This and other weighting schemes can easily be applied using the tm package, for instance using the weightTfIdf function.

dtm = weightTfIdf(dtm)

Calculating document similarities

Given a DTM and corresponding document meta data, the document similarities over time can be calculated with the newsflow.compare function. The calculation of document similarities is performed using a vector space model (Salton, Wong, and Yang 1975; Salton and Harman 2003) approach, but with a sliding window over time to only compare document that occur within a given time distance. The function has two main data inputs: the DTM and a data.frame with meta information. The meta data.frame should have a column containing document id’s that match the rownames of the DTM (i.e. documents) and should have a column indicating the publication time. By default these columns should be labeled “document_id” and “date”, but the column labels can also be set using the id.var and date.var parameters. Any other columns (in our case “source” and “sourcetype”) will automatically be included as document meta information in the output.

Furthermore, three parameters are of particular importance. The hour.window parameter determines the time window in hours within which each document is compared to other documents. The argument is a vector of length 2, in which the first and second value determine the left and right side of the window, respectively. For example, c(0, 36) will compare each document to all documents within the next 36 hours. The measure parameter, which determines what measure for similarity is used, defaults to cosine similarity7. This is a commonly used measure, which indicates similarity as a score between 0 (no similarity) and 1 (identical)8. The min.similarity parameter is used to ignore all document pairs below a certain similarity score. In the current example we use a minimum similarity of 0.4, because a validity test in the paper from which the current data is taken found this to be a good threshold for finding documents that address the same events. Whether or not a threshold should be used and what the value should be depends on the goal of the analysis and the data. We recommend a manual validation to verify that the similarity threshold matches human interpretation of whether or not two documents are similar, given a pre-defined interpretation of similarity (e.g., same event, same theme).

g = newsflow.compare(dtm, meta,
                             hour.window = c(0,36), 
                             min.similarity = 0.4)

The output g is a network, or graph, in the format of the igraph package [Csardi and Nepusz (2006)}. The vertices (or nodes) of this network represent documents, and the date and source of each document are stored as vertex attributes. The edges (or ties) represent the similarity of documents, and the similarity score and time difference are stored as edge attributes. To avoid confusion, keep in mind that from hereon when we talk about vertices or a vertex we are talking about documents, and that edges represent document pairs. An advantage of using a network format is that it combines this data in an efficient way, without copying the document meta information for each edge. This network forms the basis for all the analysis functions offered in this package9.

A full understanding of the igraph package is not required to use the current package, but one does need some basic understanding of the functions for viewing and extracting the document/vertex and edge attributes. First, vertex and edge attributes cannot directly be extracted using the $ symbol, but require the functions V() and E() to be used, for vertex and edge attributes, respectively. These can be used as follows.

vertex.sourcetype = V(g)$sourcetype
edge.hourdiff = E(g)$hourdiff

## [1] "Newsagency" "Online NP"  "Newsagency" "Newsagency" "Online NP" 
## [6] "Online NP"
## [1]  1.32  1.12  2.08  1.40  4.50 10.50

Alternatively, all vertex and edge attributes can be viewed or extracted with the get.data.frame function.

v = get.data.frame(g, 'vertices')
e = get.data.frame(g, 'edges')

##              name                date      source sourcetype
## 97360803 97360803 2013-06-03 16:54:00  Newsagency Newsagency
## 35734376 35734376 2013-06-03 18:13:00 Online NP 1  Online NP
## 97361657 97361657 2013-06-05 09:21:00  Newsagency Newsagency
##       from       to weight hourdiff
## 1 97360803 35734376      1     1.32
## 2 97361657 36022422      1     1.12
## 3 97361740 36043529      1     2.08

The weight attribute of the edges represents the similarity score. The hourdiff attribute represents the time difference in hours between two documents, where the value indicates how long the to article was published after the from article. A histogram can provide a good first indication of this data.

hist(E(g)$hourdiff, main='Time distance of document pairs', 
     xlab = 'Time difference in hours', breaks = 150)

In the histogram we see that most document pairs with a similarity score above the threshold are about an hour apart (1 on the x-axis). This is mainly because the online newspapers often follow the news agency within a very short amount of time{^delaynote]. As time distance increases, the number of document pairs decreases, which makes sense because news gets old fast, so news diffusion should occur within a limited time after publication.

Tailoring the document comparison window

If news diffuses from one source to another, then the time difference cannot be zero, since the source that follows needs time to edit and publish the news. This delay period can also differ between sources. Websites can adopt news within minutes, but newspapers have a long time between pressing and publishing the newspaper, meaning that there is a period of several hours before publication during which influence is not possible. Thus, we have to adjust the window for document pairs. To make it more convenient to adjust and inspect the window settings for different sources, we offer the filter.window and show.window functions.

The filter.window function can be used to filter the document pairs (i.e. edges) using the hour.window parameter, which works identical to the hour.window parameter in the newsflow.compare function. In addition, the from.vertices and to.vertices parameters can be used to select the vertices (i.e. documents) for which this filter is applied. This makes it easy to tailor the window for different sources, or source types. For the current data, we first set the minimum time distance for all document pairs to 0.1 hours. Then, we set a minimum time distance of 6 hours for document pairs where the to document is from a print newspaper.

# set window for all vertices
g = filter.window(g,  hour.window = c(0.1, 36))

# set window for print newspapers
g = filter.window(g,  hour.window = c(6, 36), 
           to.vertices = V(g)$sourcetype == 'Print NP')

For all sources the window has now first been adjusted so that a document can only match a document that occured at least 0.1 hours later. For print newspapers, this is then set to 6 hours. With the show.window function we can view the actual window in the data. This function aggregates the edges for all combinations of attributes specified in from.attribute and to.attribute, and shows the minimum and maximum hour difference for each combination. We set the to.attribute parameter to “source”, and leave the from.attribute parameter empty. This way, we get the window for edges from any document towards each source.

show.window(g, to.attribute = 'source')
##           from          to window.left window.right
## 1 any document  Newsagency       0.150         35.4
## 2 any document Online NP 1       0.100         35.8
## 3 any document Online NP 2       0.117         36.0
## 4 any document  Print NP 1       7.017         36.0
## 5 any document  Print NP 2       7.350         35.1

We see here that edges towards “Print NP 2” have a time distance of at least 7.4 hours and at most 35.1 hours. This falls within the defined window of at least 6 and at most 36 hours.

Analyzing the document similarity network

Before we aggregate the network, it can be informative to look at the individual sub-components. If a threshold for document similarity is used, then there should be multiple disconnected components of documents that are only similar to each other. With the current data, these components tend to reflect documents that address the same or related events. Decomposing the network can be done with the decompose.graph() function from the igraph package.

g_subcomps = decompose.graph(g)

The demo data with the current settings has 685 sub-components. To visualize these components, we offer the document.network.plot function. This function draws a network where nodes (i.e. documents) are positioned based on their date (x-axis) and source (y-axis).

gs = g_subcomps[[2]] # select the second sub-component

The visualization shows that a news message was first published by the news agency on June 5th at 9:21 AM. Soon after this messages was adopted by two online newspapers, and the next day the print newspaper followed. The grayscale and width of the edges also show that the online newspaper messages were more similar (i.e. thick and black) to the news agency message compared to the print newspaper message.

By default, the “source” attribute is used for the y-axis, but this can be changed to other document attributes using the source.attribute parameters. If a DTM is also provided, the visualization will also include a word cloud with the most frequent words of these documents.

document.network.plot(gs, source.attribute = 'sourcetype', 

These visualizations and the corresponding subcomponents help us to qualitatively investigate specific cases. This also helps to evaluate whether the document similarity measures are valid given the goal of the analysis. Furthermore, they illustrate how we can analyze homogeneity and news diffusion patterns. For each source we can count what proportion of its publications is similar to earlier publications by specific other sources. We can also analyze the average time between publications.

Another useful application is that we can use them to see whether certain transformations of the network might be required. Depending on the purpose of the analysis it can be relevant to add or delete certain edges. For instance, in the previous visualizations we see that the print newspaper message matched both the newsagency message and two online newspaper messages. If we are specifically interested in who the original source of the message is, then it makes sense to only count the edge to the newsagency. Here we demonstrate the only.first.match function, which transforms the network so that a document only has an edge to the earliest dated document it matches within the specified time window10.

gs_onlyfirst = only.first.match(gs)

Aggregating the document similarity network

This package offers the network.aggregate function as a versatile way to aggregate the edges of the document similarity network based on the vertex attributes (i.e. the document meta information). The first argument is the network (in the igraph class). The second argument, for the by parameter, is a character vector to indicate one or more vertex attributes based on which the edges are aggregated. Optionally, the by characteristics can also be specified separately for by.from and by.to. This gives flexible control over the data, for instance to aggregate from sources to sourcetypes, or to aggregate scores for each source per month.

By default, the function returns the number of edges, as well as the number of nodes that is connected for both the from and to group. The number of nodes that is connected is only relevant if a threshold for similarity (edge weight) is used, so that whether or not an edge exists indicates whether or not two documents are similar. In addition, if an edge.attribute is given, this attribute will be aggregated using the function specified in agg.FUN. For the following example we include this to analyze the median of the hourdiff attribute.

g.agg = network.aggregate(g, by='source', 

e = get.data.frame(g.agg, 'edges')
##          from          to edges agg.hourdiff from.V from.Vprop to.V
## 1  Newsagency  Newsagency   114         8.49     94     0.1575   98
## 2  Print NP 1  Newsagency    19         8.33     15     0.1020   18
## 3 Online NP 1  Newsagency   103         4.88     81     0.1753   92
## 4 Online NP 2  Newsagency    67         6.78     58     0.1391   61
## 5  Print NP 2  Newsagency    10        10.75     10     0.0763   10
## 6  Newsagency Online NP 1   434         1.27    380     0.6365  406
##   to.Vprop
## 1   0.1642
## 2   0.0302
## 3   0.1541
## 4   0.1022
## 5   0.0168
## 6   0.8788

In the edges of the aggregated network there are six scores for each edge. The edges attribute counts the number of edges from the from group to the to group. For example, we see that Newsagency documents have 434 edges to later published Online NP 1 documents. The agg.hourdiff attribute shows that the median of the hourdiff attribute of these 434 edges is 1.27 (1 hour and 16 minutes). In addition to the edges, we can look at the number of messages (i.e. vertices) in the from group that matched with at least one message in the to group. This is given by the from.V attribute, which shows here that 380 Newsagency documents matched with a later published Online NP 1 document11. This is also given as the proportion of all vertices/documents in the from group, as the from.Vprop attribute. Substantially, the from.Vprop score thus indicates that 63.65% of political news messages in Newsagency is similar or identical to later published Online NP 1 messages.

Alternatively, we can also look at the to.V and to.Vprop scores to focus on the number and proportion of messages in the to group that match with at least one message in the from group. Here we see, for instance, that 87.88% of political news messages in Online NP 1 is similar to or identical to previously published messages Newsagency messages. The from.Vprop and to.Vprop scores thus give different and complementary measures for the influence of Newsagency on Online NP 1.

Inspecting and visualizing results

The final step is to interpret the aggregated network data, or to prepare this data for use in a different analysis. We already saw that the network data can be transformed to a common data.frame with the get.data.frame function. Alternatively, igraph has the get.adjacency function to return the values for one edge attribute as an adjacency matrix, with the vertices in the rows and columns. This is a good way to present the scores of the aggregated network.

adj.m = get.adjacency(g.agg, attr= 'to.Vprop', sparse = FALSE)
round(adj.m, 2) # round on 2 decimals
##             Newsagency Print NP 1 Online NP 1 Online NP 2 Print NP 2
## Newsagency        0.16       0.24        0.88        0.82       0.34
## Print NP 1        0.03       0.03        0.05        0.04       0.02
## Online NP 1       0.15       0.22        0.11        0.33       0.31
## Online NP 2       0.10       0.18        0.26        0.08       0.31
## Print NP 2        0.02       0.01        0.03        0.07       0.01

Alternatively, an intuitive way to present the aggregated network is by visualizing it. For this we can use the visualization features of the igraph package. To make this more convenient, we also offer the directed.network.plot function. This is a wrapper for the plot.igraph function, which makes it easier to use different edge weight attributes and to use an edge threshold, and has default plotting parameters to work well for directed graphs with edge labels. For this example we use the to.Vprop edge attribute with a threshold of 0.2.

directed.network.plot(g.agg, weight.var = 'to.Vprop',
                       weight.thres = 0.2)

For illustration, we can now see how the results change if we transform the network with the only.first.match function, which only counts edges to the first source that published a document.

g2 = only.first.match(g)
g2.agg = network.aggregate(g2, by='source', 

directed.network.plot(g2.agg, weight.var = 'to.Vprop',
                       weight.thres = 0.2)

The first network is much more dense compared to the second. In particular, we see stronger edges between the print and online editions of the newspapers. In the second network only the edges from the news agency towards the print and online newspapers remain, meaning that edges between newspapers all score less than 0.2. This implies that many of the edges between newspapers that could be observed in the first network resulted from cases where both newspapers adopt the same news agency articles.

Note that the second network is not better per se. It is possible that the initial source of a message is not the direct source. For example, a blog might not have access to the news agency feed, and therefore only receive news agency messages if they are published by another source. Thus, the most suitable approach depends on the purpose of the analysis. One of the goals of creating this package is to facilitate a platform for scholars and professionals to develop best practices for different occasions.

Alternative applications of this package

There are several alternative applications of the functions offered in this package that are not covered in this vignette. Here we briefly point out some of the more useful alternatives.

In the network.aggregate function it is possible to use different vertex attributes to aggregate the edges for from and to nodes. A particularly interesting application of this feature is to use the publication date in the aggregation. For instance, with the following settings, we can get the proportion of matched documents per day. Here we also demonstrate the return.df parameter in the network.aggregate function. This way the results are returned as a single data.frame in which relevant edge and vertex information is merged.

V(g)$day = format(as.Date(V(g)$date), '%Y-%m-%d')
agg.perday = network.aggregate(g, 
              by.from='sourcetype', by.to=c('sourcetype', 'day'), 
              edge.attribute='hourdiff', agg.FUN=median, 

head(agg.perday[agg.perday$to.sourcetype == 'Online NP',  
                c('from.sourcetype', 'to.sourcetype', 'to.day','to.Vprop')])
##    from.sourcetype to.sourcetype     to.day to.Vprop
## 64        Print NP     Online NP 2013-06-01    0.263
## 65      Newsagency     Online NP 2013-06-01    0.789
## 66       Online NP     Online NP 2013-06-01    0.368
## 67      Newsagency     Online NP 2013-06-02    0.778
## 68       Online NP     Online NP 2013-06-02    0.111
## 69      Newsagency     Online NP 2013-06-03    0.864

Looking at the to.Vprop score, we see that on 2013-06-01, 78.95% of Online NP messages match with previously published Newsagency messages12. This way, the aggregated document similarity data can be analyzed as a time-series. For instance, to analyze whether seasonal effects or extra-media developments such as election campaigns affect content homogeneity or intermedia dynamics. The return.df feature is convenient for this purpose, because it directly matches all the vertex and edge attributes (as opposed to the get.data.frame function).

Another useful application of this feature is to only aggregate the by.to group, by using the document name in the by.from argument. This way, the data in the by.to group is aggregated for each individual message. In the following example we use this to see for each individual message whether it matches with later published messages in each sourcetype. This could for instance be used to analyze whether certain document level variables (e.g., news factors, sensationalism) make a message more likely to be adopted by other news outlets. We set the edge.attribute to “weight” and agg.FUN to max, so that for each document we can see how strong the strongest match with each source was.

agg.perdoc = network.aggregate(g, 
                  by.from='name', by.to='sourcetype', 
                  edge.attribute='weight', agg.FUN=max,
docXsource = xtabs(agg.weight ~ from.name + to.sourcetype, 
                   agg.perdoc, sparse = FALSE)
##            to.sourcetype
## from.name   Newsagency Online NP Print NP
##   147908507       0.40     0.425        0
##   148662598       0.00     0.767        0
##   150454094       0.52     0.528        0
##   150454102       0.00     0.895        0
##   150454110       0.00     0.402        0
##   150454167       0.00     0.996        0

Finally, note that we have now only compared documents with future documents. We thereby focus the analysis on news diffusion. To focus on content homogeneity, each document can be compared to both past and future documents. By measuring content homogeneity aggregated over time, patterns such as longitudinal trends can be analyzed.

Conclusion and future improvements

We have demonstrated how the RNewsflow package can be used to perform a many-to-many comparison of documents. The primary focus and most important feature of this package is the newsflow.compare function. This function compares all documents that occur within a given time distance, which makes it computationally feasible for longitudinal data. Using this data, we can analyze to what extent different sources publish the same content and whether there are consistent patterns in who follows whom. The secondary focus of this package is to provide functions to organize, visualize and analyze the document similarity data within R. By enabling both the document comparison and analysis to be performed within R, this type of analysis becomes more accesible to scholars less versed in computational text analysis, and it becomes easier to share and replicate methods at a very detailed level.

The data input required for this analysis consists solely of textual documents and their corresponding publication date and source. Since no human coding is required, the package enables large scale comparative and longitudinal studies. Although the demonstration in this vignette used a moderate sized dataset, the document.window.compare can handle much larger data and is fast, thanks to the excellent sparse matrix multiplication alghoritm of the Matrix package.

The validity of the method presented here relies on various factors; most importantly the type of data, the pre-processing and DTM preparation steps and the similarity threshold. It is thus recommended to use a gold standard, based on human interpretation, to test its validity. In a recent empirical study (author citation, forthcoming) we obtained good performance for whether documents addressed the same events and whether they contained identical phrases by determining thresholds based on a manually coded gold standard. Also, by using a news website that consistently referred to a news agency by name, we confirmed that we could reliably identify news agency influence for this format. Naturally, there are always limitations to the accuracy with which influence in news diffusion can be measured using only content analysis. However, we generally do not have access to other reliable information. The advantage of a content analysis based approach is that it can be used on a large scale and over long periods of time, without relying on often equally inaccurate self-reports of news workers. The current package contributes to the toolkit for this type of analysis.

Our goal is to continue developing this package as a specialized toolkit for analyzing the homogeneity and diffusion of news content. First of all, additional approaches for measuring whether documents are related will be added. Currently only a vector space model approach for calculating document similarity is implemented. For future versions alternative approaches such as language modeling will also be explored. In particular, we want to add measures to express the relation of documents over time in terms of probability and information gain. This would also allow us to define a more formal criterion for whether or not a relation exists, other than using a constant threshold for document similarity. Secondly, new methods for analyzing and visualizing the network data will be explored. In particular, methods will be implemented for analyzing patterns beyond dyadic ties between news outlets, building on techniques from the field of network analysis. To promote the involvement of other scholars and professionals in this development, the package is published entirely open-source. The source code is hosted on GitHub—https://github.com/kasperwelbers/RNewsflow.

Practical code example


# Prepare DTM and meta data

tdd = term.day.dist(dtm, meta)
dtm = dtm[,tdd$term[tdd$days.entropy.norm <= 0.3]]

dtm = weightTfIdf(dtm)

# Prepare document similarity network
g = newsflow.compare(dtm, meta, hour.window = c(0.1,36), 
                     min.similarity = 0.4)
g = filter.window(g, hour.window = c(6, 36), 
                  to.vertices = V(g)$sourcetype == 'Print NP')

show.window(g, to.attribute = 'source')

g_subcomps = decompose.graph(g)
document.network.plot(g_subcomps[[2]], dtm=dtm)

# Aggregate network 
g.agg = network.aggregate(g, by='source', 
                          edge.attribute='hourdiff', agg.FUN=median)

get.adjacency(g.agg, attr='to.Vprop')
directed.network.plot(g.agg, weight.var = 'to.Vprop', 

g2 = only.first.match(g)
g2.agg = network.aggregate(g2, by='source', 
                           edge.attribute='hourdiff', agg.FUN=median)

get.adjacency(g2.agg, attr='to.Vprop')
directed.network.plot(g2.agg, weight.var = 'to.Vprop', 


Baum, Matthew A, and Tim Groeling. 2008. “New Media and the Polarization of American Political Discourse.” Political Communication 25 (4). Taylor & Francis: 345–65.

Bennett, W Lance, and Shanto Iyengar. 2008. “A New Era of Minimal Effects? The Changing Foundations of Political Communication.” Journal of Communication 58 (4). Wiley Online Library: 707–31.

Blumler, Jay G, and Dennis Kavanagh. 1999. “The Third Age of Political Communication: Influences and Features.” Political Communication 16 (3). Taylor & Francis: 209–30.

Boczkowski, Pablo J, and Martin De Santos. 2007. “When More Media Equals Less News: Patterns of Content Homogenization in Argentina’s Leading Print and Online Newspapers.” Political Communication 24 (2). Taylor & Francis: 167–80.

Csardi, Gabor, and Tamas Nepusz. 2006. “The Igraph Software Package for Complex Network Research.” InterJournal, Complex Systems 1695 (5): 1–9.

Feinerer, Ingo, Kurt Hornik, and David Meyer. 2008. “Text Mining Infrastructure in R.” Journal of Statistical Software 25 (5): 1–54. http://www.jstatsoft.org/v25/i05.

Galtung, Johan, and Mari Holmboe Ruge. 1965. “The Structure of Foreign News the Presentation of the Congo, Cuba and Cyprus Crises in Four Norwegian Newspapers.” Journal of Peace Research 2 (1). Sage Publications: 64–90.

Grimmer, Justin, and Brandon M Stewart. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21 (3). SPM-PMSAPSA: 267–97.

Jurka, Timothy P, Loren Collingwood, Amber E Boydstun, Emiliano Grossman, and Wouter van Atteveldt. 2013. “RTextTools: A Supervised Learning Package for Text Classification.” The R Journal 5 (1): 6–12.

Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. “The Stanford CoreNLP Natural Language Processing Toolkit.” In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 55–60.

Meraz, Sharon. 2011. “Using Time Series Analysis to Measure Intermedia Agenda-Setting Influence in Traditional Media and Political Blog Networks.” Journalism & Mass Communication Quarterly 88 (1). SAGE Publications: 176–94.

Monroe, Burt L, Michael P Colaresi, and Kevin M Quinn. 2008. “Fightin’words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict.” Political Analysis 16 (4). SPM-PMSAPSA: 372–403.

Paterson, Chris. 2005. “News Agency Dominance in International News on the Internet.” Converging Media, Diverging Politics. Lanham, MD: Lexington Books, 145–63.

Pew Research Center. 2010. “New Media Old Media: How Blogs and Social Media Agendas Relate.” Available at: http://www.journalism.org/node/20621.

Ragas, Matthew W. 2014. “Intermedia Agenda Setting in Business News Coverage.” Communication and Language Analysis in the Public Sphere. IGI Global, 335.

Salton, Gerard, and Donna Harman. 2003. Information Retrieval. John Wiley; Sons Ltd.

Salton, Gerard, Anita Wong, and Chung-Shu Yang. 1975. “A Vector Space Model for Automatic Indexing.” Communications of the ACM 18 (11). ACM: 613–20.

Shoemaker, Pamela J, and Timothy Vos. 2009. Gatekeeping Theory. Routledge.

Sparck Jones, Karen. 1972. “A Statistical Interpretation of Term Specificity and Its Application in Retrieval.” Journal of Documentation 28 (1). MCB UP Ltd: 11–21.

Turney, Peter D. 2002. “Thumbs up or Thumbs down? Semantic Orientation Applied to Unsupervised Classification of Reviews.” In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL’02), 417–24.

Van den Bosch, A, G.J. Busser, W. Daelemans, and S. Canisius. 2007. “An Efficient Memory-Based Morphosyntactic Tagger and Parser for Dutch.” Leuven.

  1. R is an open-source statistical software package (https://www.r-project.org/).

  2. Stemming and lemmatization are both techniques for reducing words to their root, or more specifically their stem and lemma. This is used to group different forms of the same word together. Without going into specifics, lemmatization is a much more computationally demanding approach, but generally gives better results. Especially for richly inflicted languages such as German or Dutch it is highly recommended to use lemmatization instead of stemming.

  3. Part-of-speech tagging is a technique that identifies types of words, such as verbs, nouns and adjectives.

  4. To create a DTM based on externally pre-processed documents, one can use the base function xtabs to create a sparse matrix and the as.DocumentTermMatrix function from the tm package to transform this matrix to the DocumentTermMatrix class.

  5. “document_id”, “date” and “source” are the default labels used in several functions to interpret the meta information. Note that these defaults can always be changed using the function parameters. Using the default labels only serves as a convenience.

  6. Note that this is also a good automatic approach for filtering out stopwords, boilerplate words, and word forms such as articles and common verbs.

  7. Alternatively, the current version also supports percentage.from and percentage.to, which are measures giving the percentage of overlapping term occurences. This is an asymmetrical measures: the overlap percentage of document 1 to document 2 can be different from the overlap percentage of document 2 to document 1. Therefore, from and to indicate whether the percentage is based on the document from which we compare or the article to which we compare. When using an asymmetrical measure, the direction should be carefully taken into account in the analysis.

  8. If the DTM contains negative values, the cosine similarity score can range from -1 to 1. Note that for the other similarity measures there can be no negative values in the DTM.

  9. If data about document similarities is imported, then the document.network function can be used to create this network. This way the functions of this package for aggregating and visualizing the network can still be used.

  10. If there are multiple earliest dated documents (that is, having the same publication date) then edges to all earliest dated documents are kept.

  11. Note that the edges score is always equal to or higher than the from.matched score, since one document can match with multiple other documents.

  12. The from.Vprop cannot be interpreted in the same way, because it gives the proportion of all messages in the from group. If one is interested in the from.Vprop score per day, the by.from and by.to arguments in network.aggregate need to be switched. Note that one should not simply aggregate both by.from and by.to by date, because then only documents that both occured on this date will be aggregated.