Collecting, Analyzing and Presenting data about the participation in #ilovefs day
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

collector.md 42KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967
  1. # Documentation: [collecto.R](../collecto.R)
  2. ## Table of Contents
  3. * [General information about the Script](#the-script)
  4. * [Packages used and the Package section](#packages)
  5. * [The twittR package](#the-twitter-package)
  6. * [The Mastodon package](#the-mastodon-package)
  7. * [The RedditExtractoR package](#the-redditextractor-package)
  8. * [Collecting from Twitter](#twitter)
  9. * [Collecting from the Fediverse](#fediverse)
  10. * [Collecting from Reddit](#reddit)
  11. * [Exporting the Datasets](#exporting-datasets)
  12. * * *
  13. ## The Script
  14. The R script documented here has a modular structure. It is divided into 2
  15. sections that handle loading the packages necessary for the process and
  16. exporting the aggregated data into usable formats in the end. The remaining
  17. sections handle one specific data source each (eg.: Twitter, Mastodon, Reddit).
  18. While the [Package-Section](#packages) is obviously necessary for the remaining
  19. sections (depending on which ones you actually want to use) as well as the
  20. [Export-Section](#exporting) for actually using the data in other applications,
  21. scripts or by other people, you can cherry-pick between the
  22. [Datasource-Sections](#datasources). These can be used independently and in no
  23. particular order to another. Keep that in mind, if you only want to analyze *X*.
  24. As a site-note, the script is written to keep the data collected as anonymous as
  25. possible, however because we deal with a rather small sample and because of the
  26. nature of social media, it is in most cases still possible to track down each
  27. specific user in the resulting data. As we only access public postings, it is
  28. safe to assume, that people want their posts to be seen anyways, so it is not as
  29. problematic as it may seem. Nevertheless, we should still treat the data with
  30. much care and do not leak meta-data if possible.
  31. * * *
  32. ## Packages
  33. As of writing this script and its documentation, three scraper-packages are
  34. being used:
  35. * [twitteR](https://cran.r-project.org/package=twitteR) (Version 1.1.9)
  36. * [mastodon](https://github.com/ThomasChln/mastodon) (Commit [a6815b6](https://github.com/ThomasChln/mastodon/commit/a6815b6fb626960ffa02bd407b8f05d84bd0f549))
  37. * [RedditExtractoR](cran.r-project.org/package=RedditExtractoR) (Version 2.0.2)
  38. ### The twitteR package
  39. twitteR has a rather extensive
  40. [documentation](https://cran.r-project.org/web/packages/twitteR/twitteR.pdf) as
  41. well as "in-R-manuals". Simply enter `??twitteR` into the R console or look up
  42. a specific function with `?function`, replacing `function` with its actual name.
  43. twitteR has several useful function to scrape Twitter-Data, most of which
  44. however apply to the Twitter account in use - which in our case is not
  45. necessary. The [Twitter-Section](#twitter) uses only three functions, which will
  46. be discussed individually, later on:
  47. ```
  48. setup_twitter_oauth() # authentication
  49. searchTwitter() # searching Twitter for a particular string
  50. strip_retweets() # exclude Retweets in the results
  51. ```
  52. As a site-note; I had to install the
  53. [httr-package](https://cran.r-project.org/web/packages/httr/index.html) - a
  54. dependency of twitteR - from the Repositories of my distribution of choice, as
  55. the one provided by CRAN would not compile for some reason. So if you run into a
  56. similar issue, look for something like `r-cran-httr` in your packagemanager.
  57. ### The mastodon package
  58. The good thing about Mastodon is, that searches are not restricted to a single
  59. Mastodon-Instance or to Mastodon at all. If your Instance has enough outbound
  60. connections (so make sure you chose a very active and inter-communicative one),
  61. you are able to not only search Mastodon-Instances, but also GNUsocial, Pump.io
  62. and other compatible Social Media instances. Luckily, this also applies to the
  63. mastodon-package. Unfortunately, mastodon for R is documented
  64. [very poorly, if at all](https://github.com/ThomasChln/mastodon/blob/a6815b6fb626960ffa02bd407b8f05d84bd0f549/README.md).
  65. This brings us in the uncomfortable position, that we need to figure out, what
  66. the outputs of each function actually mean. Those are not properly labeled
  67. either, so this is a task of trial'n'error and a lot of guessing. If you have
  68. time and dedication, feel free to document it properly and open a pull-request
  69. on the [project's Github page](https://github.com/ThomasChln/mastodon). The
  70. relevant results that we use in our script are listed in the
  71. [Mastodon-Section](#mastodon) of this documentation. Again, just like with the
  72. Rfacebook package, the function-names are very generic and thus it is a good
  73. idea to prefix them with `mastodon::` to prevent the use of a wrong function
  74. from another package (eg.: `login()` becomes `mastodon::login()`).
  75. From the long list of functions in this package, we only need two for our
  76. analysis:
  77. ```
  78. login() # authentication / generating an auth-token
  79. get_hashtag() # search the fediverse for posts that include a specific hashtag
  80. ```
  81. Note:
  82. as this package is not hosted on CRAN but on github, you can not install it with
  83. `install.packages()` like the other packages. The easiest way is to install it
  84. with `install_github()` from the `devtools` package. In order to use
  85. `install_github()` without loading the library (as we only need it for this one
  86. time), you can prefix it with its package name.
  87. Installing and loading the mastodon package would look like this:
  88. ```
  89. install.packages("devtools")
  90. devtools::install_github(repo = "ThomasChln/mastodon")
  91. library("mastodon")
  92. ```
  93. Also note, that `devtools` requires the development files of *libssl* to be
  94. installed on your machine.
  95. ### The RedditExtractoR package
  96. RedditExtractoR has a rather extensive
  97. [documentation](https://cran.r-project.org/web/packages/RedditExtractoR/RedditExtractoR.pdf)
  98. but no general "in-R-manual". You can however look up a specific function within
  99. the package by entering `?function`into the R-promt, replacing `function` with
  100. its actual name. RedditExtractoR has several useful function to scrape
  101. Reddit-Posts or create fancy graphs from them. In our case, we only need two
  102. very basic functions that will be discussed in the [Reddit-Section](#reddit)
  103. later on:
  104. ```
  105. reddit_urls() # searching Reddit for a particular string
  106. reddit_content() # scrape data of an indicidual post
  107. ```
  108. You may have noticed, that there is no "authenticate" command within this
  109. package. As of now, the Reddit-API does not require authentication, as all posts
  110. are for general consumption anyways. This may or may not change in the future,
  111. so keep an eye on this.
  112. * * *
  113. ## Twitter
  114. ### Authenticate
  115. As the package in use here needs access to the Twitter-API, what we first need
  116. are the "Consumer Key", "Consumer Secret", "Access Token" and "Access Token
  117. Secret", all of which you can order from
  118. [apps.twitter.com](https://apps.twitter.com/). Of course, you need a
  119. Twitter-Account for this (staff may ask for the FSFE's Account).
  120. The authentication can be done in two ways:
  121. 1. via manual input. The R-Console will prompt you to enter the credentials by
  122. typing them in.
  123. 2. via a plain text file with the saved credentials. This `.txt` file has a very
  124. specific structure which you have to follow. You can find an example file in
  125. the examples folder.
  126. The first line of the credential-file contains the *labels*. These have to be in
  127. the same order as the *credentials* themselves in the line below. The *labels*
  128. as well as the *credentials* are each separated by a single semi-colon `;`.
  129. Storing the credentials in plain text surely is not optimal, but the easiest way
  130. to get the information into our R-Session. This should not be too critical, if
  131. your disk is encrypted.
  132. Next, we order our oauth token with `setup_twitter_oauth()`. This function is a
  133. wrapper for httr, which will also store this token in a local file, so make sure
  134. to **not leak those by making the file public**. The oauth token can not only be
  135. used to scrape information from Twitter, it also grants write-access, so can be
  136. used to manipulate the affiliated Twitter-Account or interact with Twitter in
  137. any other way.
  138. The function used to authenticate takes all of our 4 credential-keys as
  139. arguments, which in this script are stored in the `twitter_consumerkey
  140. twitter_consumerpri twitter_tokenaccess twitter_tokensecret` variables:
  141. ```
  142. setup_twitter_oauth(consumer_key = twitter_consumerkey,
  143. consumer_secret = twitter_consumerpri,
  144. access_token = twitter_tokenaccess,
  145. access_secret = twitter_tokensecret)
  146. ```
  147. ### Scraping Tweets
  148. Once we have an oauth token, we can already start looking for desired tweets to
  149. collect. For this we use the `searchTwitter()` function. All functions in the
  150. `twittR` package access the file created by the auth-function mentioned before,
  151. so there is no need to enter this as argument. What arguments we do need are:
  152. * the string to search for, in this case `ilovefs`. This will not only include
  153. things like "ilovefs18", "ilovefs2018", "ILoveFS", etc but also hashtags like
  154. "#ilovefs"
  155. * the date from which on we want to search. It is worth noting, that the API is
  156. limited in that it can only go back a few months. So if you want to look for
  157. results from a year ago, you have bad luck. This date has to be in the form of
  158. "YYYY-MM-DD". For our purpose, it makes sense to set it to either
  159. `2018-01-01` or `2018-02-01` to also catch people promoting the campaign
  160. in advance
  161. * the date until which we want to search. This one also has to be in the form of
  162. "YYYY-MM-DD". This argument usually only makes sense, if you analyze events in
  163. the past. For our purpose, we can set it either to the present or future date
  164. * the maximum number of tweets to be aggregated. This number is only useful for
  165. search-terms that get a lot of coverage on twitter (eg.: trending hashtags).
  166. For our purpose, we can safely set it to a number that is much higher than the
  167. anticipated participation in the campaign, like `9999999999` so we get ALL
  168. tweets containing our specified string
  169. * the order-type for the search. Again, this only makes sense for searches where
  170. we do not want each and every single tweet. In our case, set it to anything,
  171. for example `recent`
  172. We save the result of this command in the variable `twitter_tw_dirty`. The
  173. *dirty* stands for an "unclean" result, still containing retweets. The resulting
  174. code is:
  175. ```
  176. twitter_tw_dirty <- searchTwitter(search = "ilovefs",
  177. since = "2018-01-01",
  178. until = "2018-12-31",
  179. n = 999999999,
  180. resultType = "recent")
  181. ```
  182. The next step is to clean this data and remove retweets (they are listed in the
  183. "dirty" data as normal tweets as well), as those are not necessary for use. We
  184. can still extract the number of retweets of each posting later on, who retweeted
  185. is not important. We provide three arguments to the function `strip_retweets()`:
  186. * the `list()` item containing our scraped tweets. As shown above, we saved this
  187. to the variable `twitter_tw_dirty`
  188. * whether we want to also remove "manual rewteets", which is someone literally
  189. copy-and-pasting the text of a tweet. This is up to debate, but personally I
  190. would say, that this should be kept in as this is what a lot of "share this
  191. site" buttons on websites do. This is still participation and should thus be
  192. included in the results
  193. * whether we want to remove "modified tweets", which *probably* means "quoted"
  194. ones? Either way, if in doubt we want to keep it in. We can still remove it,
  195. if we later find out it is in fact a retweet.
  196. The result is saved to the variable `twitter_tw`, now containing only clean
  197. data:
  198. ```
  199. twitter_tw <- strip_retweets(tweets = twitter_tw_dirty,
  200. strip_manual = FALSE,
  201. strip_mt = FALSE)
  202. ```
  203. ### Stripping out data
  204. The `list()` item resulting from the `searchTwitter()` function has a logical,
  205. but rather inconvenient structure. The `list()` contains a lower `list()` for
  206. each Tweet scraped. Those lower `list()` items contain variables for each
  207. property, as shown by the illustration below:
  208. ```
  209. twitter_tw
  210. |
  211. |- [LIST 1]
  212. | |- text = "This is my tweet about #ilovefs https://fsfe.org"
  213. | |- ...
  214. | |- favoriteCount = 21
  215. | |- ...
  216. | |- created = "2018-02-14 13:52:59"
  217. | |- ...
  218. | |- statusSource = "<a href='/download/android'>Twitter for Android</a>"
  219. | |- screenName = "fsfe"
  220. | |- retweetCount = 9
  221. | |- ....
  222. | |- urls [LIST]
  223. | | |- expanded = "https://fsfe.org"
  224. | | '- ...
  225. | '- ...
  226. |
  227. |- [LIST 2]
  228. | |- ...
  229. | '- ...
  230. |
  231. '- ...
  232. ```
  233. The inconvenience about this structure stems from that we need to use a for-loop
  234. in order to run through each lower `list()` item and extract its variables
  235. individually.
  236. For the sake of keeping this short, this documentation only explains the
  237. extraction of a single argument, namely the Client used to post a Tweet. All
  238. other information are scraped in a very similar fashion.
  239. Firstly we create a new, empty `vector()` item called `twitter_client` with the
  240. "combine" command (or `c()` for short). Usually you do not have to pre-define
  241. empty vectors in R, but it will be created automatically if you assign it a
  242. value, has we've done before multiple times. You only need to pre-define it, if
  243. you want to address a specific *location* in that vector, say skipping the first
  244. value and filling in the second. We do it like this here, as we want the
  245. resulting `vector()` item to have the same order as the `list()`:
  246. ```
  247. twitter_client <- c()
  248. ```
  249. The for-loop has to count up from 1 to as long as the `list()`
  250. item is. So if we scraped four Tweets, the for-loop has to count `1 2 3 4`:
  251. ```
  252. for(i in c(1:length(twitter_tw))){
  253. ...
  254. }
  255. ```
  256. Next, we check if the desired variable in the lower `list()` item is set.
  257. However, R does not have a specific way of checking whether a variable is set or
  258. not. However, if a variable exists, but is empty, its length is zero. Thus if we
  259. want to check if a variable is set or not, we can simply check its length. In
  260. particular, here we check if the vector `statusSource` within the `i`-th lower
  261. list of `twitter_tw` has a length greater than zero:
  262. ```
  263. if(length(twitter_tw[[i]]$statusSource) > 0){
  264. ...
  265. } else {
  266. ...
  267. }
  268. ```
  269. Finally, we can extract the value we are after - the `statusSource` vector. We
  270. assign it to the `i`-th position in the previously defined `vector()` item
  271. `twitter_client`, if the previously mentioned if-statement is true. As a little
  272. *hack* here, we **specifically** assign it as a character-item with the
  273. `as.character()` function. This may not always be necessary, but sometimes wrong
  274. values will be assigned, if the source-variable is a `factor()`, but I won't go
  275. in-depth on that matter here. Just a word of caution: **always check your
  276. variables before continuing**. If the if-statement above is false, we instead
  277. assign `NA`, meaning "Not Available"
  278. ```
  279. twitter_client[i] <- as.character(twitter_tw[[i]]$statusSource)
  280. ...
  281. twitter_client[i] <- NA
  282. ```
  283. Sometimes, as it is the case with `twitter_client`, the extracted string
  284. contains things that we do not need or want. So we used regex to get rid of it.
  285. *If you are not familiar with regex, I highly recommend
  286. [regexr.com](https://regexr.com/) to learn how to use it. It also contains a
  287. nifty cheat-sheet.*
  288. Official Twitter-Clients include the download URL, besides the name of the
  289. client. It's safe to assume, that most other clients do the same, so we can
  290. clean up the string with two simple `sub()` commands (meaning "substitude"). As
  291. arguments, we give it the pattern it should substitude, as well as the
  292. replacement string (in our case, this string is empty / none) and the string
  293. that this should happen to - here `twitter_client`. We assign both to the same
  294. variable again, overriding its previous value:
  295. ```
  296. twitter_client <- sub(pattern = ".*\">", replace = "", x = twitter_client)
  297. twitter_client <- sub(pattern = "</a>", replace = "", x = twitter_client)
  298. ```
  299. All combined together, this looks similar to this:
  300. ```
  301. twitter_client <- c()
  302. for(i in 1:length(twitter_tw)){
  303. if(length(twitter_tw[[i]]$statusSource) > 0){
  304. twitter_client[i] <- as.character(twitter_tw[[i]]$statusSource)
  305. } else {
  306. twitter_client[i] <- NA
  307. }
  308. }
  309. twitter_client <- sub(pattern = ".*\">", replace = "", x = twitter_client)
  310. twitter_client <- sub(pattern = "</a>", replace = "", x = twitter_client)
  311. ```
  312. All other values are handled in a similar fashion. Some of those need some
  313. smaller fixes afterwards, just like the removal of URLs in`twitter_client`.
  314. ### Creating the finished dataset
  315. After we scraped all desired tweets and extracted the relevant information from
  316. it, it makes sense to combine the individual variables to a dataset, which can
  317. be easily handled, exported and reused. It also makes sense to have relatively
  318. short variable-names within such dataset. During the data collecting process, we
  319. used a `twitter_` prefix in front of each variable, so we are sure we use the
  320. correct variables, all coming from our Twitter-scraper. We do not need to do
  321. in a `data.frame()` item, as its name itself already eliminates the risk of
  322. using wrong variables.
  323. Additionally, we still need to split up the `twitter_timedate` variable, which
  324. currently contains the point of time of the tweet in the form of
  325. `YYYY-MM-DD HH:MM:SS`. For this, we again use regex and the function `sub()`.
  326. As `sub()` only replaces the first instance of the pattern given to it, if we
  327. have multiple occasions of a given pattern, we need to use `gsub()` (for global
  328. substitute).
  329. We also give some of the variables a new "mode", for example transferring them
  330. from a `character()` item (a string) over to a `factor()` item, making them an
  331. ordinal or nominal variable. This makes especially sense for the number of
  332. retweets and favorites.
  333. The results are seven discrete variables, which in a second step can be combined
  334. into a `data.frame()` item:
  335. ```
  336. time <- sub(pattern = ".* ", x = twitter_timedate, replace = "")
  337. time <- as.numeric(gsub(pattern = ":", x = time, replace = ""))
  338. date <- sub(pattern = " .*", x = twitter_timedate, replace = "")
  339. date <- as.numeric(gsub(pattern = "-", x = date, replace = ""))
  340. retw <- as.factor(twitter_rts)
  341. favs <- as.factor(twitter_fav)
  342. link <- as.character(twitter_url)
  343. text <- as.character(twitter_txt)
  344. clnt <- as.character(twitter_client)
  345. ```
  346. When combining these variables into a `data.frame()`, we first need to create
  347. a matrix from them, by *binding* these variables as columns of said matrix with
  348. the `cbind()` command. The result can be used by the `data.frame()` function to
  349. great such item. We label this dataset `twitter`, making it clear, what source
  350. of data we are dealing with:
  351. ```
  352. twitter <- data.frame(cbind(date, time, retw, favs, text, link, clnt))
  353. ```
  354. Often during that process, all variables within the `data.frame()` item are
  355. transformed into `factor()` variables, which is not what we want for most of
  356. these.
  357. Usually, when working with variables within a `data.frame()` you have to prefix
  358. the variable with the name of the `data.frame` and a dollar-sign, meaning that
  359. you want to access that variable **within** that `data.frame()`. This would
  360. make the process of changing the mode quite tedious for each variable:
  361. ```
  362. twitter$text <- as.numeric(as.character(twitter$text))
  363. ```
  364. Instead, we can use the `within()` function, using the `twitter` dataset as one
  365. argument and the expression of what we want to do *within* this dataset as
  366. another:
  367. ```
  368. twitter <- within(data = twitter,
  369. expr = {
  370. date <- as.numeric(as.character(date))
  371. time <- as.numeric(as.character(time))
  372. text <- as.character(text)
  373. link <- as.character(link)
  374. })
  375. ```
  376. The expression `as.numeric(as.character(...))` in some of these assignments are
  377. due to the issues, when transforming `factor()` variables to `numeric()`
  378. variables directly, as mentioned before. First transforming them into a
  379. `character()` (string), which then can be transformed into a `numeric()` value
  380. without risks, is a little *hack*.
  381. The dataset is now finished and contains every aspect we want to analyze later
  382. on. You can skip down to the [Exporting-Section](#exporting-datasets) to read
  383. about how to export the data, so it can be used outside your current R-Session.
  384. * * *
  385. ## Fediverse
  386. ### Authenticate
  387. In the Mastodon package, authentication works similar as in the twitteR package.
  388. You still need an account on any Mastodon-Instance you like, but you do not have
  389. to create API-Credentials on the website. Instead, it can all be handled from
  390. within R.
  391. However, this comes with a different kind of complication:
  392. You login-credentials have to be saved as plain text variables in your R-session
  393. and if you want to go the comfortable way of saving these in an "auth file", as
  394. we did with Twitter, this comes with an additional risk.
  395. You can mitigate that risk, if you use an encrypted storage space - which I
  396. would highly recommend either way. If you haven't encrypted your entire
  397. hard drive, you may take a look at this wiki article about
  398. [encryptfs](https://help.ubuntu.com/community/EncryptedPrivateDirectory).
  399. Either way, you have two ways of inserting your credentials into the R-session:
  400. 1. via manual input. The R-Console will prompt you to enter the credentials by
  401. typing them in.
  402. 2. via a plain text file with the saved credentials. This `.txt` file has a very
  403. specific structure which you have to follow. You can find an example file in
  404. the examples folder.
  405. The first line of the credential-file contains the *labels*. These have to be in
  406. the same order as the *credentials* themselves in the line below. The *labels*
  407. as well as the *credentials* are each separated by a single semi-colon `;`. As
  408. mentioned before, **storing your login as plain text is a risk that you have to
  409. deal with somehow**. Ideally with encryption.
  410. If we loaded our login-credentials into the variables
  411. `mastodon_auth_insta mastodon_auth_login mastodon_auth_passw`, we can *order*
  412. our API access token with the package's `login()` function, which takes these
  413. three values as arguments. Again, the name of the function is very generic and
  414. may overlap with function in other packages. So it is a good idea to prefix it
  415. with the package name and a double colon. This is the case for all functions in
  416. this package, so I will not further mention it, but we should continue doing it
  417. regardless. We store the resulting list into the variable `mastodon_auth`:
  418. ```
  419. mastodon_auth <- mastodon::login(instance = mastodon_auth_insta,
  420. user = mastodon_auth_login,
  421. pass = mastodon_auth_passw)
  422. ```
  423. ### Scraping Toots and Postings
  424. Once we successfully got our access token, we can start collecting postings
  425. containing our desired string. Contrary to Twitter, Mastodon does not allow to
  426. search for a string contained in posts, however we can search for hashtags with
  427. the `get_hashtag()` function. This one needs four arguments:
  428. * our previously generated access token `mastodon_auth`
  429. * a string containing the hashtag we want to search for. In our case, `ilovefs`
  430. would make most sense. You can however make the argument, that we should
  431. **also** search for `ilfs`. Things like "#ilovefs18" or "#ilovefs2018"
  432. *should* be covered, however
  433. * whether we want to only search on the local instance (the instance your
  434. account is registered on). Of course we set this one to `FALSE`, as we want to
  435. search the entire fediverse, including Mastodon-, GNUsocial- and
  436. Pump.io- instances
  437. * the maximum number of postings we want to collect. As in the `twitteR`
  438. package, we can set this to a very high number, but this may need some
  439. consideration in the future. Generally, the fediverse is much more serious
  440. about free software than other social media types. Right now, it is still
  441. fairly young, but as it gets older (and grows in users), the number of
  442. participants in the "I love Free Software Day" may rise quite dramatically. So
  443. you could try out a lower number for this argument and take a look at the
  444. dates of posting to get a feeling of how high this number should be
  445. The result is saved to the variable `mastodon_toot`:
  446. ```
  447. mastodon_toot <- mastodon::get_hashtag(token = mastodon_auth,
  448. hashtag = "ilovefs",
  449. local = FALSE,
  450. n = 100)
  451. ```
  452. ### Stripping out data
  453. Unfortunately, as of writing this script and documentation, the `mastodon`
  454. package has very poor documentation itself. For instance, there is no
  455. explanation of the variables in the resulting list of the `get_hastag()`
  456. function. Because of the structure of this `list()` item, there are no labels
  457. either. With the help of the `names()` of R's base-package, I could however
  458. identify all variables:
  459. ```
  460. names(mastodon_toot)
  461. ```
  462. Additionally, the structure of the resulting `list()` item has a great advantage
  463. over the results in the `twitteR` package: It is very easy to extract the data,
  464. as it already has the same structure that we use as well, as illustrated below:
  465. ```
  466. mastodon_toot
  467. |
  468. |- ...
  469. |- created_at = "2018-01-22T10:44:53", "2018-01-22T10:45:10", ...
  470. |- ...
  471. |- visibility = "public", "public", ...
  472. |- language = "en", "en", ...
  473. |- uri = "tag:quitter.no,2018-01-22:noticeID=0000000000001:objectType=note", ...
  474. |- content = "<3 Opensource! #ilovefs", "FREE SOFTWARE!1eleven #ilovefs", ...
  475. |- url = "quitter.no/status/0000000000001", "quitter.no/status0000000000002", ...
  476. |- reblogs_count = "9", "1", ...
  477. |- favourites_count = "53", "3", ...
  478. |- ...
  479. |- account [LIST]
  480. | |- [LIST 1]
  481. | | |- ...
  482. | | |- username = "linux-beginner-for-a-day"
  483. | | '- ...
  484. | |
  485. | |- [LIST 2]
  486. | | |- ...
  487. | | |- username = "C12yp70_H4X012_1337-420"
  488. | | '- ...
  489. | |
  490. | '- ...
  491. |- media_attachements [LIST]
  492. | |- [LIST 1]
  493. | | |- ...
  494. | | |- remote_url = "https://quitter.no/media/ilovefs-banner.png"
  495. | | '- ...
  496. | |
  497. | |- [LIST 2]
  498. | | |- ...
  499. | | |- username = ""
  500. | | '- ...
  501. | |
  502. | '- ...
  503. '- ...
  504. ```
  505. Because of this, we can often times to a basic assignment, like this:
  506. ```
  507. mastodon_lang <- mastodon_toot[[8]]
  508. ```
  509. However, in such cases as the time of the posting, we need to use `sub()`,
  510. `gsub()` and `as.numeric()` to extract the data we want (in this case, splitting
  511. time and date into single, numeric variables). We do something similar for the
  512. `uri` variable in the list to extract the name of the instance.
  513. URLs and hashtags have a HTML-format in the posting-text, so we need to get rid
  514. of this, without removing anything else from it. If you do not understand the
  515. regex here, make sure to check out [regexr.com](https://regexr.com/):
  516. ```
  517. mastodon_txt <- gsub(pattern = "<.*?>", x = mastodon_toot[[10]], replacement = "")
  518. ```
  519. Besides that, we should also try to identify bots, which are very common in the
  520. fediverse and post about things like "Trending Hashtags". Of course, this is
  521. problematic for us, as this most definitely can not be considered participation.
  522. We can either sort bots out by their account-id or name. I went for the name in
  523. this case, as there may be more "TrendingBots" scattered throughout the
  524. fediverse. For this, we need to go through each "lower list" containing the
  525. account information and noting down, which ones are bots and which are not.
  526. If we identify a poster as a bot, we give the variable `mastodon_bot` the value
  527. `TRUE` for this position and `FALSE` if this is not a bot. Just like extracting
  528. information from the lower `list()` items in the `twitteR` package, we first
  529. need to create an empty `vector()` item:
  530. ```
  531. mastodon_bot <- c()
  532. ```
  533. Next, it will be filled with the help of a for-loop. It has to count up from 1
  534. to as long as the `mastodon_pers` `list()` item is:
  535. ```
  536. for(i in 1:length(mastodon_pers)){
  537. ...
  538. }
  539. ```
  540. Within this for-loop, we need to check whether or not that account is a bot. As
  541. described above, for the sake of simplicity and because the only bot that comes
  542. to mind is the "TrendingBot", we do it with a simple if-statement:
  543. ```
  544. if(mastodon_pers[[i]]$username == "TrendingBot"){
  545. ...
  546. } else {
  547. ...
  548. }
  549. ```
  550. *Note: you can use multiple Bot-names by adding "|" (or) followed by another
  551. botname to the statement.*
  552. As mentioned above, if the statement is true, we set the `mastodon_bot` variable
  553. at this position as `TRUE` and as `FALSE` if it is not.
  554. All put together, we have:
  555. ```
  556. mastodon_bot <- c()
  557. for(i in 1:length(mastodon_pers)){
  558. if(mastodon_pers[[i]]$username == "TrendingBot"){
  559. mastodon_bot[i] <- TRUE
  560. } else {
  561. mastodon_bot[i] <- FALSE
  562. }
  563. }
  564. ```
  565. ### Creating the finished dataset
  566. If we scraped all information, we are still dealing with "dirty" data, here. We
  567. already identified bots, but haven't removed them yet. We also didn't set a
  568. date-range within which we want to collect data. Additionally, we should also
  569. sort out "private" posting, as we want to publish our data and should not leak
  570. someone's thoughts who clearly don't wants them to be public. However it is to
  571. be expected, that there is close to no person who
  572. * a) white-listed your account to see their private postings
  573. * b) posts about #ilovefs in a private post
  574. However, we should keep it in mind regardless.
  575. To identify posts to be excluded, we can simply use the `which()` function in
  576. conjunction with a condition for each attribute and bind them together with the
  577. `c()` (or "combine") function. Here we can include the previously identified
  578. bots, and the condition, that the "date" has to be lower than (before) a certain
  579. numeric value in the form of "YYYYMMDD". Lastly, we exlclude everything that
  580. is not marked as "public":
  581. ```
  582. mastodon_exclude <- c(which(mastodon_bot),
  583. which(mastodon_date < 20180101),
  584. which(mastodon_priv != "public"))
  585. ```
  586. Before we create the `data.frame()` item, we can drop all `mastodon_` prefixes
  587. from the variables, as the name of the dataset itself makes already clear, what
  588. the source of the data is. We can also strip out the posts we don't want in
  589. there and which positions are listed in the `mastodon_exclude` variable:
  590. ```
  591. date <- mastodon_date[-mastodon_exclude]
  592. time <- mastodon_time[-mastodon_exclude]
  593. lang <- mastodon_lang[-mastodon_exclude]
  594. inst <- mastodon_insta[-mastodon_exclude]
  595. text <- mastodon_txt[-mastodon_exclude]
  596. link <- mastodon_url[-mastodon_exclude]
  597. favs <- mastodon_fav[-mastodon_exclude]
  598. imag <- mastodon_img[-mastodon_exclude]
  599. ```
  600. As before with the Twitter-data, we combine these newly created variables into
  601. a `data.frame()` item by first turning it into a matrix by binding these vectors
  602. as columns with `cbind()` and turning it into the finished dataset called
  603. `mastodon` with `data.frame()`:
  604. ```
  605. mastodon <- data.frame(cbind(date, time, lang, inst, text, link, favs, imag))
  606. ```
  607. As this usually re-defines the variables as `factor()`, we will use `within()`
  608. again, to give them the correct mode:
  609. ```
  610. mastodon <- within(data = mastodon, expr = {
  611. date <- as.numeric(as.character(date));
  612. time <- as.numeric(as.character(time));
  613. text <- as.character(text);
  614. link <- as.character(link);
  615. })
  616. ```
  617. The dataset can now be exported. Skip down to the
  618. [Exporting-Section](#exporting-datasets) to learn how.
  619. * * *
  620. ## Reddit
  621. *RedditExtractoR (or actually Reddit) doesn't currently require you to
  622. authenticate. So you can get right into scraping!*
  623. ### Scraping Posts
  624. There are multiple ways of searching for a certain string. Optionally, you can
  625. determine which Subreddit you want to search in. In most cases, it makes sense
  626. to search in all of them. To search for a particular string, we use the
  627. `reddit_urls()` function. It takes one mandatory and five optional arguments:
  628. * the string we want to search for (I am not certain, whether this includes the
  629. content / text of the actual posts or only titles). In our case, **ilovefs**
  630. should work just fine, as this is the name of the campaign and probably what
  631. people will use in their posts
  632. * the subreddits we want to search in. There is no real reason to limit this in
  633. the case of the ILoveFS-Campaign, but it may make sense in other cases. If not
  634. needed, this argument can be commented out with a `#`
  635. * the minimum number of comments a post should have in order to be included. As
  636. we want all posts regardless of their popularity, we should set this to `0`
  637. * how many pages of posts the result should include. Here applies the same as
  638. before: we want all posts, so we set this to a very high number like `99999`
  639. * the sort order of the results. This doesn't really matter, as we try to scrape
  640. all posts containing our search string. You can most likely leave it out or
  641. set it to `new`
  642. * the wait time between API-requests. The minimum (API limit) is 2 seconds, but
  643. if you want to be sure set it to a slightly higher level
  644. The resultis saved to the variable `reddit_post_dirty`, where as the *dirty*
  645. stands for the fact that we haven't yet sorted out older posts than from this
  646. year's event:
  647. ```
  648. reddit_post_dirty <- reddit_urls(search_terms = "ilovefs",
  649. #subreddit = "freesoftware linux opensource",
  650. cn_threshold = 0,
  651. page_threshold = 99999,
  652. sort_by = "new",
  653. wait_time = 5)
  654. ```
  655. ### Stripping out data
  656. The data from the `RedditExtractoR` package comes in an easily usable
  657. `data.frame()` output. Its structure is illustrated below:
  658. ```
  659. reddit_post
  660. |
  661. |- date = "14-02-17", "13-02-17", ...
  662. |- num_comments = "23", "15", ...
  663. |- title = "Why I love Opensource #ilovefs", "Please participate in ILoveFS", ...
  664. |- subreddit = "opensource", "linux", ...
  665. '- URL = "https://www.reddit.com/r/opensource/comments/dhfiu/", ...
  666. ```
  667. Firstly, we should exclude all postings from years before. For this, we simply
  668. trim the `date` variable (Format is "DD-MM-YY") within the `data.frame()` to
  669. only display the year and use those posts from the current year. We save the
  670. result in the `reddit_post` variable:
  671. ```
  672. reddit_searchinyear <- 18
  673. reddit_post_year <- gsub(x = reddit_post_dirty$date,
  674. pattern = "\\d.-\\d.-",
  675. replace = "")
  676. reddit_post <- reddit_post_dirty[which(reddit_post_year == reddit_searchfromyear),]
  677. ```
  678. To ease the handling of this process, the year we want to search in is assigned
  679. to the variable `reddit_searchinyear` in a "YY" format first (here: "18" for
  680. "2018"). We use `gsub()` to trim the date to just display the year and use
  681. `which()` to determine which post's year is equal to `reddit_searchinyear`.
  682. Afterwards, we can use a single for-loop to extract all relevant variables. We
  683. simply create an empty `vector()` for each variable:
  684. ```
  685. comt <- c()
  686. subr <- c()
  687. ttle <- c()
  688. date <- c()
  689. rurl <- c()
  690. ```
  691. And fill the appropriate position on the vector with the corresponding value.
  692. We do this for each scraped post:
  693. ```
  694. for(i in c(1:length(reddit_post$URL))){
  695. comt[i] <- reddit_post$num_comments[i]
  696. ttle[i] <- reddit_post$title[i]
  697. rurl[i] <- reddit_post$URL[i]
  698. date[i] <- gsub(x = reddit_post$date[i], pattern = "-", replace = "")
  699. subr[i] <- reddit_post$subreddit[i]
  700. ...
  701. }
  702. ```
  703. However, not all of the relevant data is contained in the `reddit_post` dataset.
  704. We need another function from the `RedditExtractoR` package, called
  705. `reddit_content()` which is able to also give us the score, text and linked-to
  706. website of the post. As an argument, this function only needs the URL of a post,
  707. which is contained in our previously mentioned `data.frame()`:
  708. ```
  709. reddit_content <- reddit_content(URL = reddit_post$URL[1])
  710. ```
  711. The resulting variable `reddit_content` is another `data.frame()` with a similar
  712. structure as the previously used `reddit_post`:
  713. ```
  714. reddit_content
  715. |
  716. |- ...
  717. |- num_comments = "20"
  718. |- ...
  719. |- post_score = "15"
  720. |- ...
  721. |- post_text = "I really do love this software because..."
  722. |- link = "https://cran.r-project.org"
  723. '- ...
  724. ```
  725. Since we need to do this for every single post, we can include this into our
  726. for-loop. Because we call the function with only one post-URL at a time, we can
  727. set the wait time between request to zero. However the for-loop will call the
  728. function multiple times regardless, so we should make sure it is actually
  729. waiting before doing so or else we will hit the API-timeout. We can do this with
  730. the `Sys.sleep()` function. Everything put together:
  731. ```
  732. comt <- c()
  733. subr <- c()
  734. ptns <- c()
  735. ttle <- c()
  736. text <- c()
  737. link <- c()
  738. date <- c()
  739. rurl <- c()
  740. for(i in c(1:length(reddit_post$URL))){
  741. comt[i] <- reddit_post$num_comments[i]
  742. ttle[i] <- reddit_post$title[i]
  743. rurl[i] <- reddit_post$URL[i]
  744. date[i] <- gsub(x = reddit_post$date[i], pattern = "-", replace = "")
  745. subr[i] <- reddit_post$subreddit[i]
  746. Sys.sleep(2)
  747. reddit_content <- reddit_content(URL = reddit_post$URL[i], wait_time = 0)
  748. ptns[i] <- reddit_content$post_score
  749. text[i] <- reddit_content$post_text
  750. link[i] <- reddit_content$link
  751. }
  752. ```
  753. ### Creating the finished dataset
  754. As we do not really need to *filter* something out (we have already done so with
  755. the dates before), we can directly bind our variables to a `data.frame()`. As
  756. with the other datasources (eg.: [Twitter](#twitter)) we create a matrix with
  757. the `cbind()` fucntion, which can be tunred into the finished dataset with
  758. `data.frame()`, assigning it to the variable `reddit`:
  759. ```
  760. reddit <- data.frame(cbind(date, rurl, link, text, ttle, ptns, subr, comt))
  761. ```
  762. This usually re-defines every single variable within the dataset as `factor()`,
  763. so we use the `within()` function to change their mode:
  764. ```
  765. reddit <- within(data = reddit, expr = {
  766. date <- as.numeric(as.character(date));
  767. rurl <- as.character(rurl);
  768. link <- as.character(link);
  769. text <- as.character(text);
  770. ttle <- as.character(ttle);
  771. ptns <- as.numeric(as.character(ptns));
  772. subr <- as.character(subr);
  773. comt <- as.numeric(as.character(comt));
  774. })
  775. ```
  776. The dataset can now be exported. Skip down to the
  777. [Exporting-Section](#exporting-datasets) to learn how.
  778. * * *
  779. ## Exporting Datasets
  780. There are several reasons to why we want to export our data:
  781. 1. to keep a backup / an archive. As we have seen in the
  782. [Twitter-Section](#twitter), the social media sites do not always enable us
  783. to collect a full back-log of what has been posted in the past. If we want to
  784. analyze our data at a later point of time or if we want to compare several
  785. point of times to another, it makes sense to have an archive and preferably
  786. a backup to prevent data loss
  787. 2. to use the data outside your current R-session. The variables only live for
  788. as long as your R-session is running. As soon as you close it, all is gone
  789. (except if you agree to save to an image, which actually does the very same,
  790. we are doing here). So it makes sense to export the data, which then can be
  791. imported and worked with later again.
  792. 3. to enable other use to analyze and work with the data. Obviously, this is an
  793. important one for us. We **do** want to share our results and the data we
  794. used for this so other people can learn and to make our anylsis transparent.
  795. In order to fully enable anyone to use the data, whatever software he or she is
  796. using, we export in three common and easily readable formats:
  797. `.RData .csv .txt`. The later one is the simplest one and can be read by
  798. literally **any** text-editor. Each string in there is enclosed by quotes `"`
  799. and seperated with a single space in a table-layout. The `.csv` format is very
  800. similar, though the seperation is done with a symbol - in this case a colon `,`.
  801. This format is not only readable by all text-editors (because it is pure text),
  802. it can also be read by spreadsheet applications like libreoffice-calc. The
  803. disadvantage of both formats is, that they can only hold items with the same
  804. "labels", so we need to create multiple export-files for each data source. Also,
  805. when importing, you often have to redefine each vaiable's mode again.
  806. Lastly, we also export as `.RData`, R's very own format. Since R is free
  807. software, I would suspect, that most statistics-software can read this format,
  808. but I do not actually know for a fact. However, it certainly is the easiest to
  809. work with in R, as you can include as many variables and datasets as you want
  810. and the modes of each variable stay in tact. `.RData` is a binary format and
  811. can not be read by text-editors or non-specialized software.
  812. In order to have an easily navigatable archive, we should not only label the
  813. output-files with the source of the data, but also with the date when they were
  814. collected. For this, we first need the current time/date, which R provides with
  815. the `Sys.time()` function. We want to bring it in a format suitable for
  816. file names like "YYYY-MM-DD_HH-MM-SS", which we can do with `sub()` and `gsub()`
  817. respectively:
  818. ```
  819. time_of_saving <- sub(x = Sys.time(), pattern = " CET", replace = "")
  820. time_of_saving <- sub(x = time_of_saving, pattern = " ", replace = "_")
  821. time_of_saving <- gsub(x = time_of_saving, pattern = ":", replace = "-")
  822. ```
  823. Next, we model the save-path we want the data to be exported to, for which we
  824. can use `paste0()`. For example to save the `.RData` file, we want to export to
  825. the `data/` folder:
  826. ```
  827. save_path <- paste0("./data/ilovefs-all_", time_of_saving, ".RData")
  828. ```
  829. *Note: using `paste()` instead of `paste0()` will create a space between each
  830. strings, which we do not want here.*
  831. We follow a similar approach for the individual `.txt` files, also adding the
  832. name of the source into the filename (as they will only hold one data source
  833. each). For example:
  834. ```
  835. save_path_twitter_t <- paste0("./data/ilovefs-twitter_", time_of_saving, ".txt")
  836. ```
  837. Lastly, we need to actually export the data, which we can do with:
  838. ```
  839. save() # for .RData
  840. write.table() # for .txt
  841. write.csv() # for .csv
  842. ```
  843. All three functions take the data as argument, as well as the previously defined
  844. file path. In the case of `save()` where we export multiple datasets, their
  845. names need to be collected in a `vector()` item with the `c()` function first:
  846. ```
  847. save(list = c("twitter", "mastodon"), file = save_path)
  848. write.table(mastodon, file = save_path_fed_t)
  849. write.csv(twitter, file = save_path_twitter_c)
  850. ```
  851. **If this is done, we can safely close our R-Session, as we just archived all
  852. data for later use or for other people to join in!**