Collecting, Analyzing and Presenting data about the participation in #ilovefs day
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

collector.md 36KB

Documentation: collecto.R

Table of Contents


The Script

The R script documented here has a modular structure. It is divided into 2 sections that handle loading the section necessary for the process and exporting the aggregated data into usable formats in the end. The remaining sections handle one specific data source each (eg.: Twitter, Mastodon, …). While the Package-Section is obviously necessary for the remaining sections (depending on which ones you actually want to use) as well as the Export-Section for actually using the data in other applications, scripts or by other people, you can cherry-pick between the Datasource-Sections. These can be used independently and in no particular order to another. Keep that in mind, if you only want to analyze X.

As a site-note, the script is written to keep the data collected as anonymous as possible, however because we deal with a rather small sample and because of the nature of social media, it is in most cases still possible to track down each specific user in the resulting data. While time and date of the posting are mostly unproblematic


Packages

As of writing this script and its documentation, three scraper-packages are being used:

The twitteR package

twitteR has a rather extensive documentation as well as “in-R-manuals”. Simply enter ??twitteR into the R console or look up a specific function with ?function, replacing function with its actual name. twitteR has several useful function to scrape Twitter-Data, most of which however apply to the Twitter account in use - which in our case is not necessary. The Twitter-Section uses only three functions, which will be discussed individually, later on:

  setup_twitter_oauth() # authentication
  searchTwitter()       # searching Twitter for a particular string
  strip_retweets()      # exclude Retweets in the results

As a site-note; I had to install the httr-package - a dependency of twitteR - from the Repositories of my distribution of choice, as the one provided by CRAN would not compile for some reason. So if you run into a similar issue, look for something like r-cran-httr in your packagemanager.

The Rfacebook package

Attention: I tried to set up a facebook account just for this purpose, but their registration process is rather tedious and honestly ridiculous. Keep your phone number or credit card nearby, as well as a photo of your face. I can not accept these kinds of intrusion, even for the purpose of this data-analysis. If you already have a facebook-account, you can however use that one to receive the API access tokens and use the Rfacebook package described in this section. I did not, so the process described here is potential and I do actually know the structure of each function’s output.

Rfacebook documents its internal and external functions fairly well, too. The focus of the package does not quite align with the purpose we have in mind here (concentration on metrics for site-administrators and analyzing specific people’s actions), however caused by a lack of alternatives we can still use it to some extend. Unfortunately, the functions of this package have very generic names and thus may conflict with functions from other packages. Here is a little tip to prevent the usage of the wrong function in R: Prefix the function you want to use with the name of the package and a double colon. In the case of the getShares() function, this would result in Rfacebook::getShares(). The functions we are interested in and will be discussed later on as well are:

  fbOAuth()           # authentication / generating an auth-token
  getCommentReplies() # replies to a comment on a post
  getGroup()          # retrieve information from a public group
  getPage()           # retrieve information from a public page
  getPost()           # retrieve information from a public post (incl. comments)
  getReactions()      # retrieve reactions to a single or multiple posts
  getShares()         # retrieve list of shares of a post
  getUsers()          # retrieve information about poster
  searchFacebook()    # search public posts with a certain string [deprecated] 
  searchPages()       # search public pages that mention a certain string

As a site-note; I had to install the httr-package - a dependency of Rfacebook - from the Repositories of my distribution of choice, as the one provided by CRAN would not compile for some reason. So if you run into a similar issue, look for something like r-cran-httr in your packagemanager.

The mastodon package

The good thing about Mastodon is, that searches are not restricted to a single Mastodon-Instance or to Mastodon at all. If your Instance has enough outbound connections (so make sure you chose a very active and inter-communicative one), you are able to not only search Mastodon-Instances, but also GNUsocial, Pump.io and other compatible Social Media instances. Luckily, this also applies to the mastodon-package. Unfortunately, mastodon for R is documented very poorly, if at all. This brings us in the uncomfortable position, that we need to figure out, what the outputs of each function actually mean. Those are not properly labeled either, so this is a task of trial’n’error and a lot of guessing. If you have time and dedication, feel free to document it properly and open a pull-request on the project’s Github page. The relevant results that we use in our script are listed in the Mastodon-Section of this documentation. Again, just like with the Rfacebook package, the function-names are very generic and thus it is a good idea to prefix them with mastodon:: to prevent the use of a wrong function from another package (eg.: login() becomes mastodon::login()). From the long list of functions in this package, we only need two for our analysis:

  login()       # authentication / generating an auth-token
  get_hashtag() # search the fediverse for posts that include a specific hashtag

Note: as this package is not hosted on CRAN but on github, you can not install it with install.packages() like the other packages. The easiest way is to install it with install_github() from the devtools package. In order to use install_github() without loading the library (as we only need it for this one time), you can prefix it with its package name. Installing and loading the mastodon package would look like this:

  install.packages("devtools")
  devtools::install_github(repo = "ThomasChln/mastodon")
  library("mastodon")

Also note, that devtools requires the development files of libssl to be installed on your machine.


Twitter

Authenticate

As the package in use here needs access to the Twitter-API, what we first need are the “Consumer Key”, “Consumer Secret”, “Access Token” and “Access Token Secret”, all of which you can order from apps.twitter.com. Of course, you need a Twitter-Account for this (staff may ask for the FSFE’s Account).

The authentication can be done in two ways:

  1. via manual input. The R-Console will prompt you to enter the credentials by typing them in.
  2. via a plain text file with the saved credentials. This .txt file has a very specific structure which you have to follow. You can find an example file in the examples folder.

The first line of the credential-file contains the labels. These have to be in the same order as the credentials themselves in the line below. The labels as well as the credentials are each separated by a single semi-colon ;. Storing the credentials in plain text surely is not optimal, but the easiest way to get the information into our R-Session. This should not be too critical, if your disk is encrypted.

Next, we order our oauth token with setup_twitter_oauth(). This function is a wrapper for httr, which will also store this token in a local file, so make sure to not leak those by making the file public. The oauth token can not only be used to scrape information from Twitter, it also grants write-access, so can be used to manipulate the affiliated Twitter-Account or interact with Twitter in any other way.

The function used to authenticate takes all of our 4 credential-keys as arguments, which in this script are stored in the twitter_consumerkey twitter_consumerpri twitter_tokenaccess twitter_tokensecret variables:

  setup_twitter_oauth(consumer_key = twitter_consumerkey,
                      consumer_secret = twitter_consumerpri,
                      access_token = twitter_tokenaccess,
                      access_secret = twitter_tokensecret)

Scraping Tweets

Once we have an oauth token, we can already start looking for desired tweets to collect. For this we use the searchTwitter() function. All functions in the twittR package access the file created by the auth-function mentioned before, so there is no need to enter this as argument. What arguments we do need are:

  • the string to search for, in this case ilovefs. This will not only include things like “ilovefs18”, “ilovefs2018”, “ILoveFS”, etc but also hashtags like “#ilovefs”
  • the date from which on we want to search. It is worth noting, that the API is limited in that it can only go back a few months. So if you want to look for results from a year ago, you have bad luck. This date has to be in the form of “YYYY-MM-DD”. For our purpose, it makes sense to set it to either 2018-01-01 or 2018-02-01 to also catch people promoting the campaign in advance
  • the date until which we want to search. This one also has to be in the form of “YYYY-MM-DD”. This argument usually only makes sense, if you analyze events in the past. For our purpose, we can set it either to the present or future date
  • the maximum number of tweets to be aggregated. This number is only useful for search-terms that get a lot of coverage on twitter (eg.: trending hashtags). For our purpose, we can safely set it to a number that is much higher than the anticipated participation in the campaign, like 9999999999 so we get ALL tweets containing our specified string
  • the order-type for the search. Again, this only makes sense for searches where we do not want each and every single tweet. In our case, set it to anything, for example recent

We save the result of this command in the variable twitter_tw_dirty. The dirty stands for an “unclean” result, still containing retweets. The resulting code is:

  twitter_tw_dirty <- searchTwitter(search = "ilovefs",
                                    since = "2018-01-01",
                                    until = "2018-12-31",
                                    n = 999999999,
                                    resultType = "recent")

The next step is to clean this data and remove retweets (they are listed in the “dirty” data as normal tweets as well), as those are not necessary for use. We can still extract the number of retweets of each posting later on, who retweeted is not important. We provide three arguments to the function strip_retweets():

  • the list() item containing our scraped tweets. As shown above, we saved this to the variable twitter_tw_dirty
  • whether we want to also remove “manual rewteets”, which is someone literally copy-and-pasting the text of a tweet. This is up to debate, but personally I would say, that this should be kept in as this is what a lot of “share this site” buttons on websites do. This is still participation and should thus be included in the results
  • whether we want to remove “modified tweets”, which probably means “quoted” ones? Either way, if in doubt we want to keep it in. We can still remove it, if we later find out it is in fact a retweet.

The result is saved to the variable twitter_tw, now containing only clean data:

  twitter_tw <- strip_retweets(tweets = twitter_tw_dirty,
                               strip_manual = FALSE,
                               strip_mt = FALSE)

Stripping out data

The list() item resulting from the searchTwitter() function has a logical, but rather inconvenient structure. The list() contains a lower list() for each Tweet scraped. Those lower list() items contain variables for each property, as shown by the illustration below:

  twitter_tw
    |
    |- [LIST 1]
    |    |- text = "This is my tweet about #ilovefs https://fsfe.org"
    |    |- ...
    |    |- favoriteCount = 21
    |    |- ...
    |    |- created = "2018-02-14 13:52:59"
    |    |- ...
    |    |- statusSource = "<a href='/download/android'>Twitter for Android</a>"
    |    |- screenName = "fsfe"
    |    |- retweetCount = 9
    |    |- ....
    |    |- urls [LIST]
    |    |    |- expanded = "https://fsfe.org"
    |    |    '- ...
    |    '- ...
    |
    |- [LIST 2]
    |    |- ...
    |    '- ...
    |
    '- ...

The inconvenience about this structure stems from that we need to use for-loops in order to run through each lower list() item and extract its variables individually.

For the sake of keeping this short, this documentation only explains the extraction of a single argument, namely the Client used to post a Tweet. Firstly we create a new, empty vector() item called twitter_client with the “combine” command (or c() for short). Usually you do not have to pre-define empty vectors in R, but it will be created automatically if you assign it a value, has we’ve done before multiple times. You only need to pre-define it, if you want to address a specific location in that vector, say skipping the first value and filling in the second. We do it like this here, as we want the resulting vector() item to have the same order as the list():

  twitter_client <- c()

The for-loop has to count up from 1 to as long as the list() item is. So if we scraped four Tweets, the for-loop has to count 1 2 3 4:

  for(i in c(1:length(twitter_tw))){
    ...
  }

Next, we check if the desired variable in the lower list() item is set. However, R does not have a specific way of checking whether a variable is set or not. However, if a variable exists, but is empty, its length is zero. Thus if we want to check if a variable is set or not, we can simply check its length. In particular, here we check if the vector statusSource within the i-th lower list of twitter_tw has a length greater than zero:

  if(length(twitter_tw[[i]]$statusSource) > 0){
    ...
  } else {
    ...
  }

Finally, we can extract the value we are after - the statusSource vector. We assign it to the i-th position in the previously defined vector() item twitter_client, if the previously mentioned if-statement is true. As a little hack here, we specifically assign it as a character-item with the as.character() function. This may not always be necessary, but sometimes wrong values will be assigned, if the source-variable is a factor(), but I won’t go in-depth on that matter here. Just a word of caution: always check your variables before continuing. If the if-statement above is false, we instead assign NA, meaning “Not Available”

  twitter_client[i] <- as.character(twitter_tw[[i]]$statusSource)
  ...
  twitter_client[i] <- NA

Sometimes, as it is the case with twitter_client, the extracted string contains things that we do not need or want. So we used regex to get rid of it.

If you are not familiar with regex, I highly recommend regexr.com to learn how to use it. It also contains a nifty cheat-sheet.

Official Twitter-Clients include the download URL, besides the name of the client. It’s safe to assume, that most other clients do the same, so we can clean up the string with two simple sub() commands (meaning “substitude”). As arguments, we give it the pattern it should substitude, as well as the replacement string (in our case, this string is empty / none) and the string that this should happen to - here twitter_client. We assign both to the same variable again, overriding its previous value:

  twitter_client <- sub(pattern = ".*\">", replace = "", x = twitter_client)
  twitter_client <- sub(pattern = "</a>", replace = "", x = twitter_client)

All combined together, this looks similar to this:

  twitter_client <- c()
  for(i in 1:length(twitter_tw)){
    if(length(twitter_tw[[i]]$statusSource) > 0){
      twitter_client[i] <- as.character(twitter_tw[[i]]$statusSource)
    } else {
      twitter_client[i] <- NA
    }
  }
  twitter_client <- sub(pattern = ".*\">", replace = "", x = twitter_client)
  twitter_client <- sub(pattern = "</a>", replace = "", x = twitter_client)

All other values are handled in a similar fashion. Some of those need some smaller fixes afterwards, just like the removal of URLs intwitter_client.

Creating the finished dataset

After we scraped all desired tweets and extracted the relevant information from it, it makes sense to combine the individual variables to a dataset, which can be easily handled, exported and reused. It also makes sense to have relatively short variable-names within such dataset. During the data collecting process, we used a twitter_ prefix in front of each variable, so we are sure we use the correct variables, all coming from our Twitter-scraper. We do not need to do in a data.frame() item, as its name itself already eliminates the risk of using wrong variables.

Additionally, we still need to split up the twitter_timedate variable, which currently contains the point of time of the tweet in the form of YYYY-MM-DD HH:MM:SS. For this, we again use regex and the function sub(). As sub() only replaces the first instance of the pattern given to it, if we have multiple occasions of a given pattern, we need to use gsub() (for global substitute).

We also give some of the variables a new “mode”, for example transferring them from a character() item (a string) over to a factor() item, making them an ordinal or nominal variable. This makes especially sense for the number of retweets and favorites.

The results are seven discrete variables, which in a second step can be combined into a data.frame() item:

  time <- sub(pattern = ".* ", x = twitter_timedate, replace = "")
  time <- as.numeric(gsub(pattern = ":", x = time, replace = ""))
  date <- sub(pattern = " .*", x = twitter_timedate, replace = "")
  date <- as.numeric(gsub(pattern = "-", x = date, replace = ""))
  retw <- as.factor(twitter_rts)
  favs <- as.factor(twitter_fav)
  link <- as.character(twitter_url)
  text <- as.character(twitter_txt)
  clnt <- as.character(twitter_client)

When combining these variables into a data.frame(), we first need to create a matrix from them, by binding these variables as columns of said matrix with the cbind() command. The result can be used by the data.frame() function to great such item. We label this dataset twitter, making it clear, what source of data we are dealing with:

  twitter <- data.frame(cbind(date, time, retw, favs, text, link, clnt))

Often during that process, all variables within the data.frame() item are transformed into factor() variables, which is not what we want for most of these. Usually, when working with variables within a data.frame() you have to prefix the variable with the name of the data.frame and a dollar-sign, meaning that you want to access that variable within that data.frame(). This would make the process of changing the mode quite tedious for each variable:

  twitter$text <- as.numeric(as.character(twitter$text))

Instead, we can use the within() function, using the twitter dataset as one argument and the expression of what we want to do within this dataset as another:

  twitter <- within(data = twitter,
		    expr = {
			    date <- as.numeric(as.character(date))
			    time <- as.numeric(as.character(time))
			    text <- as.character(text)
			    link <- as.character(link)
			   })

The expression as.numeric(as.character(...)) in some of these assignments are due to the issues, when transforming factor() variables to numeric() variables directly, as mentioned before. First transforming them into a character() (string), which then can be transformed into a numeric() value without risks, is a little hack.

The dataset is now finished and contains every aspect we want to analyze later on. You can skip down to the Exporting-Section to read about how to export the data, so it can be used outside your current R-Session.


Fediverse

Authenticate

In the Mastodon package, authentication works similar as in the twitteR package. You still need an account on any Mastodon-Instance you like, but you do not have to create API-Credentials on the website. Instead, it can all be handled from within R.

However, this comes with a different kind of complication: You login-credentials have to be saved as plain text variables in your R-session and if you want to go the comfortable way of saving these in an “auth file”, as we did with Twitter, this comes with an additional risk.

You can mitigate that risk, if you use an encrypted storage space - which I would highly recommend either way. If you haven’t encrypted your entire hard drive, you may take a look at this wiki article about encryptfs.

Either way, you have two ways of inserting your credentials into the R-session:

  1. via manual input. The R-Console will prompt you to enter the credentials by typing them in.
  2. via a plain text file with the saved credentials. This .txt file has a very specific structure which you have to follow. You can find an example file in the examples folder.

The first line of the credential-file contains the labels. These have to be in the same order as the credentials themselves in the line below. The labels as well as the credentials are each separated by a single semi-colon ;. As mentioned before, storing your login as plain text is a risk that you have to deal with somehow. Ideally with encryption.

If we loaded our login-credentials into the variables mastodon_auth_insta mastodon_auth_login mastodon_auth_passw, we can order our API access token with the package’s login() function, which takes these three values as arguments. Again, the name of the function is very generic and may overlap with function in other packages. So it is a good idea to prefix it with the package name and a double colon. This is the case for all functions in this package, so I will not further mention it, but we should continue doing it regardless. We store the resulting list into the variable mastodon_auth:

  mastodon_auth <- mastodon::login(instance = mastodon_auth_insta,
				   user = mastodon_auth_login,
				   pass = mastodon_auth_passw)

Scraping Toots and Postings

Once we successfully got our access token, we can start collecting postings containing our desired string. Contrary to Twitter, Mastodon does not allow to search for a string contained in posts, however we can search for hashtags with the get_hashtag() function. This one needs four arguments:

  • our previously generated access token mastodon_auth
  • a string containing the hashtag we want to search for. In our case, ilovefs would make most sense. You can however make the argument, that we should also search for ilfs. Things like “#ilovefs18” or “#ilovefs2018” should be covered, however
  • whether we want to only search on the local instance (the instance your account is registered on). Of course we set this one to FALSE, as we want to search the entire fediverse, including Mastodon-, GNUsocial- and Pump.io- instances
  • the maximum number of postings we want to collect. As in the twitteR package, we can set this to a very high number, but this may need some consideration in the future. Generally, the fediverse is much more serious about free software than other social media types. Right now, it is still fairly young, but as it gets older (and grows in users), the number of participants in the “I love Free Software Day” may rise quite dramatically. So you could try out a lower number for this argument and take a look at the dates of posting to get a feeling of how high this number should be

The result is saved to the variable mastodon_toot:

  mastodon_toot <- mastodon::get_hashtag(token = mastodon_auth,
				         hashtag = "ilovefs",
				         local = FALSE,
				         n = 100)

Stripping out data

Unfortunately, as of writing this script and documentation, the mastodon package has very poor documentation itself. For instance, there is no explanation of the variables in the resulting list of the get_hastag() function. Because of the structure of this list() item, there are no labels either. With the help of the names() of R’s base-package, I could however identify all variables:

  names(mastodon_toot)

Additionally, the structure of the resulting list() item has a great advantage over the results in the twitteR package: It is very easy to extract the data, as it already has the same structure that we use as well, as illustrated below:

  mastodon_toot
    |
    |- ...
    |- created_at = "2018-01-22T10:44:53", "2018-01-22T10:45:10", ...
    |- ...
    |- visibility = "public", "public", ...
    |- language = "en", "en", ...
    |- uri = "tag:quitter.no,2018-01-22:noticeID=0000000000001:objectType=note", ...
    |- content = "<3 Opensource! #ilovefs", "FREE SOFTWARE!1eleven #ilovefs", ...
    |- url = "quitter.no/status/0000000000001", "quitter.no/status0000000000002", ...
    |- reblogs_count = "9", "1", ...
    |- favourites_count = "53", "3", ...
    |- ...
    |- account [LIST]
    |     |- [LIST 1]
    |     |     |- ...
    |     |     |- username = "linux-beginner-for-a-day"
    |     |     '- ...
    |     |
    |     |- [LIST 2]
    |     |     |- ...
    |     |     |- username = "C12yp70_H4X012_1337-420"
    |     |     '- ...
    |     |
    |     '- ...
    |- media_attachements [LIST]
    |     |- [LIST 1]
    |     |     |- ...
    |     |     |- remote_url = "https://quitter.no/media/ilovefs-banner.png"
    |     |     '- ...
    |     |
    |     |- [LIST 2]
    |     |     |- ...
    |     |     |- username = ""
    |     |     '- ...
    |     |
    |     '- ...
    '- ...

Because of this, we can often times to a basic assignment, like this:

  mastodon_lang <- mastodon_toot[[8]]

However, in such cases as the time of the posting, we need to use sub(), gsub() and as.numeric() to extract the data we want (in this case, splitting time and date into single, numeric variables). We do something similar for the uri variable in the list to extract the name of the instance.

URLs and hashtags have a HTML-format in the posting-text, so we need to get rid of this, without removing anything else from it. If you do not understand the regex here, make sure to check out regexr.com:

  mastodon_txt <- gsub(pattern = "<.*?>", x = mastodon_toot[[10]], replacement = "")

Besides that, we should also try to identify bots, which are very common in the fediverse and post about things like “Trending Hashtags”. Of course, this is problematic for us, as this most definitely can not be considered participation. We can either sort bots out by their account-id or name. I went for the name in this case, as there may be more “TrendingBots” scattered throughout the fediverse. For this, we need to go through each “lower list” containing the account information and noting down, which ones are bots and which are not. If we identify a poster as a bot, we give the variable mastodon_bot the value TRUE for this position and FALSE if this is not a bot. Just like extracting information from the lower list() items in the twitteR package, we first need to create an empty vector() item:

  mastodon_bot <- c()

Next, it will be filled with the help of a for-loop. It has to count up from 1 to as long as the mastodon_pers list() item is:

  for(i in 1:length(mastodon_pers)){
    ...
  }

Within this for-loop, we need to check whether or not that account is a bot. As described above, for the sake of simplicity and because the only bot that comes to mind is the “TrendingBot”, we do it with a simple if-statement:

  if(mastodon_pers[[i]]$username == "TrendingBot"){
    ...
  } else {
    ...
  }

Note: you can use multiple Bot-names by adding “|” (or) followed by another botname to the statement.

As mentioned above, if the statement is true, we set the mastodon_bot variable at this position as TRUE and as FALSE if it is not.

All put together, we have:

  mastodon_bot <- c()
  for(i in 1:length(mastodon_pers)){
    if(mastodon_pers[[i]]$username == "TrendingBot"){
      mastodon_bot[i] <- TRUE
    } else {
      mastodon_bot[i] <- FALSE
    }
  }

Creating the finished dataset

If we scraped all information, we are still dealing with “dirty” data, here. We already identified bots, but haven’t removed them yet. We also didn’t set a date-range within which we want to collect data. Additionally, we should also sort out “private” posting, as we want to publish our data and should not leak someone’s thoughts who clearly don’t wants them to be public. However it is to be expected, that there is close to no person who

  • a) white-listed your account to see their private postings
  • b) posts about #ilovefs in a private post

However, we should keep it in mind regardless.

To identify posts to be excluded, we can simply use the which() function in conjunction with a condition for each attribute and bind them together with the c() (or “combine”) function. Here we can include the previously identified bots, and the condition, that the “date” has to be lower than (before) a certain numeric value in the form of “YYYYMMDD”. Lastly, we exlclude everything that is not marked as “public”:

  mastodon_exclude <- c(which(mastodon_bot),
		        which(mastodon_date < 20180101),
		        which(mastodon_priv != "public"))

Before we create the data.frame() item, we can drop all mastodon_ prefixes from the variables, as the name of the dataset itself makes already clear, what the source of the data is. We can also strip out the posts we don’t want in there and which positions are listed in the mastodon_exclude variable:

  date <- mastodon_date[-mastodon_exclude]
  time <- mastodon_time[-mastodon_exclude]
  lang <- mastodon_lang[-mastodon_exclude]
  inst <- mastodon_insta[-mastodon_exclude]
  text <- mastodon_txt[-mastodon_exclude]
  link <- mastodon_url[-mastodon_exclude]
  favs <- mastodon_fav[-mastodon_exclude]
  imag <- mastodon_img[-mastodon_exclude]

As before with the Twitter-data, we combine these newly created variables into a data.frame() item by first turning it into a matrix by binding these vectors as columns with cbind() and turning it into the finished dataset called mastodon with data.frame():

mastodon <- data.frame(cbind(date, time, lang, inst, text, link, favs, imag))

As this usually re-defines the variables as factor(), we will use within() again, to give them the correct mode:

mastodon <- within(data = mastodon, expr = {
		     date <- as.numeric(as.character(date));
		     time <- as.numeric(as.character(time));
		     text <- as.character(text);
		     link <- as.character(link);
		  })

The dataset can now be exported. Skip down to the Exporting-Section to learn how.


Exporting Datasets

There are several reasons to why we want to export our data:

  1. to keep a backup / an archive. As we have seen in the Twitter-Section, the social media sites do not always enable us to collect a full back-log of what has been posted in the past. If we want to analyze our data at a later point of time or if we want to compare several point of times to another, it makes sense to have an archive and preferably a backup to prevent data loss
  2. to use the data outside your current R-session. The variables only live for as long as your R-session is running. As soon as you close it, all is gone (except if you agree to save to an image, which actually does the very same, we are doing here). So it makes sense to export the data, which then can be imported and worked with later again.
  3. to enable other use to analyze and work with the data. Obviously, this is an important one for us. We do want to share our results and the data we used for this so other people can learn and to make our anylsis transparent.

In order to fully enable anyone to use the data, whatever software he or she is using, we export in three common and easily readable formats: .RData .csv .txt. The later one is the simplest one and can be read by literally any text-editor. Each string in there is enclosed by quotes " and seperated with a single space in a table-layout. The .csv format is very similar, though the seperation is done with a symbol - in this case a colon ,. This format is not only readable by all text-editors (because it is pure text), it can also be read by spreadsheet applications like libreoffice-calc. The disadvantage of both formats is, that they can only hold items with the same “labels”, so we need to create multiple export-files for each data source. Also, when importing, you often have to redefine each vaiable’s mode again.

Lastly, we also export as .RData, R’s very own format. Since R is free software, I would suspect, that most statistics-software can read this format, but I do not actually know for a fact. However, it certainly is the easiest to work with in R, as you can include as many variables and datasets as you want and the modes of each variable stay in tact. .RData is a binary format and can not be read by text-editors or non-specialized software.

In order to have an easily navigatable archive, we should not only label the output-files with the source of the data, but also with the date when they were collected. For this, we first need the current time/date, which R provides with the Sys.time() function. We want to bring it in a format suitable for file names like “YYYY-MM-DD_HH-MM-SS”, which we can do with sub() and gsub() respectively:

  time_of_saving <- sub(x = Sys.time(), pattern = " CET", replace = "")
  time_of_saving <- sub(x = time_of_saving, pattern = " ", replace = "_")
  time_of_saving <- gsub(x = time_of_saving, pattern = ":", replace = "-")

Next, we model the save-path we want the data to be exported to, for which we can use paste0(). For example to save the .RData file, we want to export to the data/folder into the fileilovefs-all_YYYY-MM-DD_HH-MM-SS.RData`:

  save_path <- paste0("./data/ilovefs-all_", time_of_saving, ".RData")

Note: using paste() instead of paste0() will create a space between each strings, which we do not want here.

We follow a similar approch for the individual .txtfiles, also adding the name of the source into the filename (as they will only hold one data source each). For example:

  save_path_twitter_t <- paste0("./data/ilovefs-twitter_", time_of_saving, ".txt")

Lastly, we need to actually export the data, which we can do with:

  save()        # for .RData
  write.table() # for .txt
  write.csv()   # for .csv

All three functions take the data as argument, as well as the previously defined file path. In the case of save() where we export multiple datasets, their names need to be collected in a vector() item with the c() function first:


  save(list = c("twitter", "mastodon"), file = save_path)

  write.table(mastodon, file = save_path_fed_t)

  write.csv(twitter, file = save_path_twitter_c)

If this is done, we can safely close our R-Session, as we just archived all data for later use or for other people to join in!