We start with loading the required packages. mediacloudr
is used to download an article based on a article id received from the Media Cloud Topic Mapper. Furthermore, mediacloudr
provides a function to extract social media meta data from HTML documents. httr
is used to turn R into a HTTP client to download and process the responding article page. We use xml2
to parse - make it readable for R - the HTML document and rvest
to find elements of interest within the HTML document.
In the first step, we request the article with the id 1126843780
. It is important to add the upper case L
to the number to turn the numeric type into an integer type. Otherwise the function will throw an error. The article was selected with help of the Media Cloud Topic Mapper online tool. If you created an account, you can create and analyze your own topics.
# define media id as integer
story_id <- 1126843780L
# download article
article <- get_story(story_id = story_id)
The USA Today news article comes with an URL which we can use to download the complete article using the httr
package. We use the GET
function to download the article. Afterwards, we extract the website using the content
function. It is important to provide the type
argument to extract the text only. Otherwise, the function tries to guess the type and will automatically parse the content based on the content-type
HTTP header. The author of the httr
package suggests to manually parse the content. In this case, we use the read_html
function which is provided in the xml2
package.
# download article
response <- GET(article$url[1])
# extract article html
html_document <- content(response, type = "text", encoding = "UTF-8")
# parse website
parsed_html <- read_html(html_document)
After parsing the response into a R readable format, we extract the actual body of the article. Therefore, we use the html_nodes
function to find the html tags with defined in the css
argument. A useful open source tool to find the corresponding tags or css classes is the Selector Gadget. Alternatively, you can use the developer tools of the browser you are usually using. The html_text
provides us with a character vector. Each element contains a paragraph of the article. We use the paste
function to merge the paragraph into one closed text. We could analyze the text using different metrics such as word frequencies or sentiment analysis.
# extract article body
article_body_nodes <- html_nodes(x = parsed_html, css = ".content-well div p")
article_body <- html_text(x = article_body_nodes)
# paste character vector to one text
article_body <- paste(article_body, collapse = " ")
In the last step, we extract the social media meta data from the article. Social media meta data are shown if the article URL is shared on social media. The article representation usually include a heading, summary and a small image/thumbnail. The extract_meta_data
expects a raw HTML document and provides Open Graph (a standard introduced by Facebook), Twitter and native meta data.
Open Graph Title: “ICE drops off migrants at Phoenix bus station”
Article Title (provided by mediacloud.org): “Arizona churches working to help migrants are ‘at capacity’ or ‘tapped out on resources’”
The meta data can be compared to the original content of the article. A short analysis reveals that USA Today chose a different heading to advertise the article on Facebook. Larger analysis can use quantitative tools such as string similarity measures, such as provided by the stringdist
package.