Browse Source

Edit: adapt documentation to the rtweet package

janwey 1 year ago
parent
commit
4770baddfb
2 changed files with 142 additions and 201 deletions
  1. 142
    201
      docs/collector.md
  2. BIN
      docs/collector.pdf

+ 142
- 201
docs/collector.md View File

@@ -4,7 +4,7 @@
4 4
 
5 5
 * [General information about the Script](#the-script)
6 6
 * [Packages used and the Package section](#packages)
7
-    * [The twittR package](#the-twitter-package)
7
+    * [The rtweet package](#the-twitter-package)
8 8
     * [The curl and rjson packages](#the-curl-and-rjson-packages)
9 9
     * [The RedditExtractoR package](#the-redditextractor-package)
10 10
 * [Collecting from Twitter](#twitter)
@@ -41,29 +41,31 @@ much care and do not leak meta-data if possible.
41 41
 As of writing this script and its documentation, two platform specific and two
42 42
 general scraper-packages are being used:
43 43
 
44
-* [twitteR](https://cran.r-project.org/package=twitteR) (Version 1.1.9)
44
+* [rtweet](https://cran.r-project.org/package=rtweet) (Version 0.6.0)
45 45
 * [curl](https://cran.r-project.org/package=curl) (Version 3.1)
46 46
 * [rjson](https://cran.r-project.org/package=rjson) (Version 0.2.15)
47 47
 * [RedditExtractoR](cran.r-project.org/package=RedditExtractoR) (Version 2.0.2)
48 48
 
49
-### The twitteR package
50
-twitteR has a rather extensive
51
-[documentation](https://cran.r-project.org/web/packages/twitteR/twitteR.pdf) as
52
-well as "in-R-manuals". Simply enter `??twitteR` into the R console or look up
49
+### The rtweet package
50
+rtweet has a rather extensive
51
+[documentation](https://cran.r-project.org/web/packages/rtweet/rtweet.pdf) as
52
+well as "in-R-manuals". Simply enter `??rtweet` into the R console or look up
53 53
 a specific function with `?function`, replacing `function` with its actual name.
54
-twitteR has several useful function to scrape Twitter-Data, most of which
55
-however apply to the Twitter account in use - which in our case is not
56
-necessary. The [Twitter-Section](#twitter) uses only three functions, which will
57
-be discussed individually, later on:
54
+It is the successor of the previously used
55
+[twittR package](https://cran.r-project.org/package=rtweet) with a lot of
56
+improvements and fewer restrictions regarding the Twitter-API. rtweet has several
57
+useful functions to scrape Twitter-Data, most of which however apply to the
58
+Twitter account in use - which in our case is not necessary. The
59
+[Twitter-Section](#twitter) uses only two functions, which will be discussed
60
+individually, later on:
58 61
 ```
59
-  setup_twitter_oauth() # authentication
60
-  searchTwitter()       # searching Twitter for a particular string
61
-  strip_retweets()      # exclude Retweets in the results
62
+  create_token()  # authentication
63
+  search_tweets() # searching Twitter for a particular string
62 64
 ```
63 65
 
64 66
 As a site-note; I had to install the
65 67
 [httr-package](https://cran.r-project.org/package=httr) - a dependency of
66
-twitteR - from the Repositories of my distribution of choice, as the one
68
+rtweet - from the Repositories of my distribution of choice, as the one
67 69
 provided by CRAN would not compile for some reason. So if you run into a similar
68 70
 issue, look for something like `r-cran-httr` in your packagemanager.
69 71
 
@@ -116,18 +118,22 @@ so keep an eye on this.
116 118
 
117 119
 ### Authenticate
118 120
 As the package in use here needs access to the Twitter-API, what we first need
119
-are the "Consumer Key", "Consumer Secret", "Access Token" and "Access Token
120
-Secret", all of which you can order from
121
-[apps.twitter.com](https://apps.twitter.com/). Of course, you need a
122
-Twitter-Account for this (staff may ask for the FSFE's Account).
121
+are the "Consumer Key", "Consumer Secret" and our "App Name", all of which you
122
+can order from [apps.twitter.com](https://apps.twitter.com/). Of course, you
123
+need a Twitter-Account for this (staff may ask for the FSFE's Account).
123 124
 
124 125
 The authentication can be done in two ways:
125 126
 
126 127
 1. via manual input. The R-Console will prompt you to enter the credentials by
127
-   typing them in.
128
+   typing them in. Going this route will exclude the option to run the script
129
+   automatically.
128 130
 2. via a plain text file with the saved credentials. This `.txt` file has a very
129 131
    specific structure which you have to follow. You can find an example file in
130
-   the examples folder.
132
+   the examples folder. Going this route can potentially be a security risk for
133
+   the twitter-account in use, as the `.txt` file is stored in plain text. The
134
+   problem can be mitigated if your harddrive is encrypted. *It may also be
135
+   possible to implement decryption of a file via GNUPG with the `system()`
136
+   command. However, this has not been implemented in this script (yet)*.
131 137
 
132 138
 The first line of the credential-file contains the *labels*. These have to be in
133 139
 the same order as the *credentials* themselves in the line below. The *labels*
@@ -136,243 +142,171 @@ Storing the credentials in plain text surely is not optimal, but the easiest way
136 142
 to get the information into our R-Session. This should not be too critical, if
137 143
 your disk is encrypted.
138 144
 
139
-Next, we order our oauth token with `setup_twitter_oauth()`. This function is a
140
-wrapper for httr, which will also store this token in a local file, so make sure
141
-to **not leak those by making the file public**. The oauth token can not only be
142
-used to scrape information from Twitter, it also grants write-access, so can be
143
-used to manipulate the affiliated Twitter-Account or interact with Twitter in
144
-any other way.
145
+Next, we create the oauth token with `create_token()`.The oauth token can not
146
+only be used to scrape information from Twitter, it also grants write-access, so
147
+can be used to manipulate the affiliated Twitter-Account or interact with Twitter
148
+in any other way.
145 149
 
146
-The function used to authenticate takes all of our 4 credential-keys as
147
-arguments, which in this script are stored in the `twitter_consumerkey
148
-twitter_consumerpri twitter_tokenaccess twitter_tokensecret` variables:
150
+The function used to authenticate takes the consumer-key and consumer-secret as
151
+well as the name of the app, which you registered on the twitter-developer page
152
+before, as arguments, which in this script are stored in the `twitter_consumerkey
153
+twitter_consumerpri twitter_appname` variables:
149 154
 ```
150
-  setup_twitter_oauth(consumer_key = twitter_consumerkey,
151
-                      consumer_secret = twitter_consumerpri,
152
-                      access_token = twitter_tokenaccess,
153
-                      access_secret = twitter_tokensecret)
155
+  twitter_token <- create_token(app = twitter_appname,
156
+                                consumer_key = twitter_consumerkey,
157
+                                consumer_secret = twitter_consumerpri)
154 158
 ```
155 159
 
156 160
 ### Scraping Tweets
157 161
 Once we have an oauth token, we can already start looking for desired tweets to
158
-collect. For this we use the `searchTwitter()` function. All functions in the
159
-`twittR` package access the file created by the auth-function mentioned before,
160
-so there is no need to enter this as argument. What arguments we do need are:
162
+collect. For this we use the `search_tweets()` function. All functions in the
163
+`rtweet` package access the token via environment variables. So make sure to
164
+create it before use and don't override it, afterwards. What arguments we need
165
+to forward to the function are:
161 166
 
162 167
 * the string to search for, in this case `ilovefs`. This will not only include
163 168
   things like "ilovefs18", "ilovefs2018", "ILoveFS", etc but also hashtags like
164 169
   "#ilovefs"
165
-* the date from which on we want to search. It is worth noting, that the API is
166
-  limited in that it can only go back a few months. So if you want to look for
167
-  results from a year ago, you have bad luck. This date has to be in the form of
168
-  "YYYY-MM-DD". For our purpose, it makes sense to set it to either
169
-  `2018-01-01` or `2018-02-01` to also catch people promoting the campaign
170
-  in advance
171
-* the date until which we want to search. This one also has to be in the form of
172
-  "YYYY-MM-DD". This argument usually only makes sense, if you analyze events in
173
-  the past. For our purpose, we can set it either to the present or future date
174 170
 * the maximum number of tweets to be aggregated. This number is only useful for
175 171
   search-terms that get a lot of coverage on twitter (eg.: trending hashtags).
176 172
   For our purpose, we can safely set it to a number that is much higher than the
177 173
   anticipated participation in the campaign, like `9999999999` so we get ALL
178 174
   tweets containing our specified string
179
-* the order-type for the search. Again, this only makes sense for searches where
180
-  we do not want each and every single tweet. In our case, set it to anything,
181
-  for example `recent`
175
+* whether we want to include retweets in your data as well (we do not, in this
176
+  case, so set it to `FALSE`)
182 177
 
183
-We save the result of this command in the variable `twitter_tw_dirty`. The
184
-*dirty* stands for an "unclean" result, still containing retweets. The resulting
178
+We save the result of this command in the variable `twitter_tw`. The resulting
185 179
 code is:
186 180
 ```
187
-  twitter_tw_dirty <- searchTwitter(search = "ilovefs",
188
-                                    since = "2018-01-01",
189
-                                    until = "2018-12-31",
190
-                                    n = 999999999,
191
-                                    resultType = "recent")
192
-```
193
-
194
-The next step is to clean this data and remove retweets (they are listed in the
195
-"dirty" data as normal tweets as well), as those are not necessary for use. We
196
-can still extract the number of retweets of each posting later on, who retweeted
197
-is not important. We provide three arguments to the function `strip_retweets()`:
198
-
199
-* the `list()` item containing our scraped tweets. As shown above, we saved this
200
-  to the variable `twitter_tw_dirty`
201
-* whether we want to also remove "manual rewteets", which is someone literally
202
-  copy-and-pasting the text of a tweet. This is up to debate, but personally I
203
-  would say, that this should be kept in as this is what a lot of "share this
204
-  site" buttons on websites do. This is still participation and should thus be
205
-  included in the results
206
-* whether we want to remove "modified tweets", which *probably* means "quoted"
207
-  ones? Either way, if in doubt we want to keep it in. We can still remove it,
208
-  if we later find out it is in fact a retweet.
209
-
210
-The result is saved to the variable `twitter_tw`, now containing only clean
211
-data:
212
-```
213
-  twitter_tw <- strip_retweets(tweets = twitter_tw_dirty,
214
-                               strip_manual = FALSE,
215
-                               strip_mt = FALSE)
181
+  twitter_tw <- search_tweets(q = "#ilovefs",
182
+                              n = 9999,
183
+                              include_rts = FALSE)
216 184
 ```
217 185
 
218 186
 ### Stripping out data
219
-The `list()` item resulting from the `searchTwitter()` function has a logical,
220
-but rather inconvenient structure. The `list()` contains a lower `list()` for
221
-each Tweet scraped. Those lower `list()` items contain variables for each
222
-property, as shown by the illustration below:
187
+Most of the resulting data can be extracted with simple assignment, only a few
188
+characteristics are organized within `list()` items in the `data.frame()`. The
189
+structure of the dataset is as follows:
223 190
 ```
224 191
   twitter_tw
225 192
     |
226
-    |- [LIST 1]
227
-    |    |- text = "This is my tweet about #ilovefs https://fsfe.org"
228
-    |    |- ...
229
-    |    |- favoriteCount = 21
230
-    |    |- ...
231
-    |    |- created = "2018-02-14 13:52:59"
232
-    |    |- ...
233
-    |    |- statusSource = "<a href='/download/android'>Twitter for Android</a>"
234
-    |    |- screenName = "fsfe"
235
-    |    |- retweetCount = 9
236
-    |    |- ....
237
-    |    |- urls [LIST]
238
-    |    |    |- expanded = "https://fsfe.org"
239
-    |    |    '- ...
193
+    |- ...
194
+    |- text = "Yay! #ilovefs", "Today is #ilovefs, celebrate!", ...
195
+    |- ...
196
+    |- screen_name = "user123", "fsfe", ...
197
+    |- ...
198
+    |- source = "Twidere", "Tweetdeck", ...
199
+    |- ...
200
+    |- favorite_count = 8, 11, ...
201
+    |- retweet_count = 2, 7, ...
202
+    |- ...
203
+    |- lang = "en", "en", ...
204
+    |- ...
205
+    |- created_at = "14-02-2018" 10:01 CET", "14-02-2018 10:01 CET", ...
206
+    |- ...
207
+    |- urls_expanded_url
208
+    |    |- NA, "https://ilovefs.org", ...
240 209
     |    '- ...
241 210
     |
242
-    |- [LIST 2]
243
-    |    |- ...
211
+    |- media_expanded_url
212
+    |    |- NA, NA, ...
244 213
     |    '- ...
245 214
     |
246 215
     '- ...
247 216
 ```
248 217
 
249
-The inconvenience about this structure stems from that we need to use a for-loop
250
-in order to run through each lower `list()` item and extract its variables
218
+The inconvenience about the `list()` structure stems from that we need to use a
219
+for-loop in order to run through each `list()` item and extract its variables
251 220
 individually.
252 221
 
253 222
 For the sake of keeping this short, this documentation only explains the
254
-extraction of a single argument, namely the Client used to post a Tweet. All
223
+extraction of a single argument, namely the Client used to post a Tweet and the
224
+extracting of one of the items in the `list()`, namely the media-URL. All
255 225
 other information are scraped in a very similar fashion.
256 226
 
257
-Firstly we create a new, empty `vector()` item called `twitter_client` with the
258
-"combine" command (or `c()` for short). Usually you do not have to pre-define
259
-empty vectors in R, but it will be created automatically if you assign it a
260
-value, has we've done before multiple times. You only need to pre-define it, if
261
-you want to address a specific *location* in that vector, say skipping the first
262
-value and filling in the second. We do it like this here, as we want the
263
-resulting `vector()` item to have the same order as the `list()`:
264
-```
265
-  twitter_client <- c()
227
+For the media-URL, firstly we create a new, empty `vector()` item called `murl`
228
+with the `vector()` command. Usually you do not have to pre-define empty vectors
229
+in R, but it will be created automatically if you assign it a value, has we've
230
+done before multiple times. You only need to pre-define it, if you want to
231
+address a specific *location* in that vector, say skipping the first value and
232
+filling in the second. We do it like this here, as we want the resulting
233
+`vector()` item to have the same order as the original dataset. In theory, you
234
+could also use the combine command `c()`, however `vector()` gives you the option
235
+to pre-define the mode (numeric, character, factor, ...) as well as the length of
236
+the variable. The variable `twitter_number` that we create before that, simply
237
+contains the number of all tweets, so we know how long the `murl` vector has to
238
+be:
239
+```
240
+  twitter_number <- length(twitter_tw$text)
241
+  ...
242
+  murl <- vector(mode = "character", length = twitter_number)
266 243
 ```
267 244
 
268
-The for-loop has to count up from 1 to as long as the `list()`
269
-item is. So if we scraped four Tweets, the for-loop has to count `1 2 3 4`:
245
+The for-loop has to count up from 1 to as many tweets as we scraped. We already
246
+saved this value into the `twitter_number` variable. So if we scraped four
247
+Tweets, the for-loop has to count `1 2 3 4`:
270 248
 ```
271
-  for(i in c(1:length(twitter_tw))){
249
+  for(i in c(1:twitter_number)){
272 250
     ...
273 251
   }
274 252
 ```
275 253
 
276
-Next, we check if the desired variable in the lower `list()` item is set.
277
-However, R does not have a specific way of checking whether a variable is set or
278
-not. However, if a variable exists, but is empty, its length is zero. Thus if we
279
-want to check if a variable is set or not, we can simply check its length. In
280
-particular, here we check if the vector `statusSource` within the `i`-th lower
281
-list of `twitter_tw` has a length greater than zero:
254
+Next, we simply assign the first value of the according `list()` item to our
255
+pre-defined vector. To be precise, we assign it to the current location in the
256
+vector. You could also check first, if this value exists in the first place,
257
+however the `retweet` package sets `NA`, meaning "Not Available", if that value
258
+is missing, which is fine for our purpose:
282 259
 ```
283
-  if(length(twitter_tw[[i]]$statusSource) > 0){
284
-    ...
285
-  } else {
286
-    ...
287
-  }
260
+  murl[i] <- twitter_tw$media_expanded_url[[i]][1]
288 261
 ```
289 262
 
290
-Finally, we can extract the value we are after - the `statusSource` vector. We
291
-assign it to the `i`-th position in the previously defined `vector()` item
292
-`twitter_client`, if the previously mentioned if-statement is true. As a little
293
-*hack* here, we **specifically** assign it as a character-item with the
294
-`as.character()` function. This may not always be necessary, but sometimes wrong
295
-values will be assigned, if the source-variable is a `factor()`, but I won't go
296
-in-depth on that matter here. Just a word of caution: **always check your
297
-variables before continuing**. If the if-statement above is false, we instead
298
-assign `NA`, meaning "Not Available"
263
+All combined, this looks like this:
299 264
 ```
300
-  twitter_client[i] <- as.character(twitter_tw[[i]]$statusSource)
265
+  twitter_number <- length(twitter_tw$text)
266
+  clnt <- twitter_tw$source
301 267
   ...
302
-  twitter_client[i] <- NA
268
+  
269
+  murl <- vector(mode = "character", length = twitter_number)
270
+  for(i in 1:twitter_number){
271
+    ...
272
+    murl[i] <- twitter_tw$media_expanded_url[[i]][1]
273
+  }
303 274
 ```
304 275
 
305
-Sometimes, as it is the case with `twitter_client`, the extracted string
306
-contains things that we do not need or want. So we used regex to get rid of it.
276
+Sometimes, as it is the case with `fdat` the "full date" variable, the
277
+extracted string contains things that we do not need or want. So we use regex to
278
+get rid of it.
307 279
 
308 280
 *If you are not familiar with regex, I highly recommend
309 281
 [regexr.com](https://regexr.com/) to learn how to use it. It also contains a
310 282
 nifty cheat-sheet.*
311
-
312
-Official Twitter-Clients include the download URL, besides the name of the
313
-client. It's safe to assume, that most other clients do the same, so we can
314
-clean up the string with two simple `sub()` commands (meaning "substitude"). As
315
-arguments, we give it the pattern it should substitude, as well as the
316
-replacement string (in our case, this string is empty / none) and the string
317
-that this should happen to - here `twitter_client`. We assign both to the same
318
-variable again, overriding its previous value:
319 283
 ```
320
-  twitter_client <- sub(pattern = ".*\">", replace = "", x = twitter_client)
321
-  twitter_client <- sub(pattern = "</a>", replace = "", x = twitter_client)
284
+  time <- sub(pattern = ".* ", x = fdat, replace = "")
285
+  time <- gsub(pattern = ":", x = time, replace = "")
286
+  date <- sub(pattern = " .*", x = fdat, replace = "")
287
+  date <- gsub(pattern = "-", x = date, replace = "")
322 288
 ```
323 289
 
324
-All combined together, this looks similar to this:
290
+This step is particular useful, if we want only tweets in our dataset, that were
291
+posted during a specific time-period. Using the `which` command, we can figure
292
+out the positions of each tweet in our dataset, which has been posted prior to
293
+or after a certain date (or time, if you wish). The `date` and `time` variables
294
+we created before are numeric values describing the date/time of the tweet as
295
+`YYYYMMDD` and `HHMMSS` respectively. As soon as we found out which positions
296
+fir the criteria (before February 10th and after February 16th, for example), we
297
+can eliminate all of these tweets from our dataset:
325 298
 ```
326
-  twitter_client <- c()
327
-  for(i in 1:length(twitter_tw)){
328
-    if(length(twitter_tw[[i]]$statusSource) > 0){
329
-      twitter_client[i] <- as.character(twitter_tw[[i]]$statusSource)
330
-    } else {
331
-      twitter_client[i] <- NA
332
-    }
333
-  }
334
-  twitter_client <- sub(pattern = ".*\">", replace = "", x = twitter_client)
335
-  twitter_client <- sub(pattern = "</a>", replace = "", x = twitter_client)
299
+  twitter_exclude <- which(as.numeric(date) > 20180216 | as.numeric(date) < 20180210)
300
+  date <- date[-twitter_exclude]
301
+  ..
302
+  user <- user[-twitter_exclude]
336 303
 ```
337 304
 
338
-All other values are handled in a similar fashion. Some of those need some
339
-smaller fixes afterwards, just like the removal of URLs in`twitter_client`.
340
-
341 305
 ### Creating the finished dataset
342 306
 After we scraped all desired tweets and extracted the relevant information from
343 307
 it, it makes sense to combine the individual variables to a dataset, which can
344 308
 be easily handled, exported and reused. It also makes sense to have relatively
345
-short variable-names within such dataset. During the data collecting process, we
346
-used a `twitter_` prefix in front of each variable, so we are sure we use the
347
-correct variables, all coming from our Twitter-scraper. We do not need to do
348
-in a `data.frame()` item, as its name itself already eliminates the risk of
349
-using wrong variables.
350
-
351
-Additionally, we still need to split up the `twitter_timedate` variable, which
352
-currently contains the point of time of the tweet in the form of
353
-`YYYY-MM-DD HH:MM:SS`. For this, we again use regex and the function `sub()`.
354
-As `sub()` only replaces the first instance of the pattern given to it, if we
355
-have multiple occasions of a given pattern, we need to use `gsub()` (for global
356
-substitute).
357
-
358
-We also give some of the variables a new "mode", for example transferring them
359
-from a `character()` item (a string) over to a `factor()` item, making them an
360
-ordinal or nominal variable. This makes especially sense for the number of
361
-retweets and favorites.
362
-
363
-The results are seven discrete variables, which in a second step can be combined
364
-into a `data.frame()` item:
365
-```
366
-  time <- sub(pattern = ".* ", x = twitter_timedate, replace = "")
367
-  time <- as.numeric(gsub(pattern = ":", x = time, replace = ""))
368
-  date <- sub(pattern = " .*", x = twitter_timedate, replace = "")
369
-  date <- as.numeric(gsub(pattern = "-", x = date, replace = ""))
370
-  retw <- as.factor(twitter_rts)
371
-  favs <- as.factor(twitter_fav)
372
-  link <- as.character(twitter_url)
373
-  text <- as.character(twitter_txt)
374
-  clnt <- as.character(twitter_client)
375
-```
309
+short variable-names within such dataset.
376 310
 
377 311
 When combining these variables into a `data.frame()`, we first need to create
378 312
 a matrix from them, by *binding* these variables as columns of said matrix with
@@ -380,7 +314,8 @@ the `cbind()` command. The result can be used by the `data.frame()` function to
380 314
 great such item. We label this dataset `twitter`, making it clear, what source
381 315
 of data we are dealing with:
382 316
 ```
383
-  twitter <- data.frame(cbind(date, time, retw, favs, text, link, clnt))
317
+  twitter <- data.frame(cbind(date, time, fdat, retw, favs, text,
318
+  	     			    lang, murl, link, clnt, user))
384 319
 ```
385 320
 
386 321
 Often during that process, all variables within the `data.frame()` item are
@@ -398,13 +333,19 @@ Instead, we can use the `within()` function, using the `twitter` dataset as one
398 333
 argument and the expression of what we want to do *within* this dataset as
399 334
 another:
400 335
 ```
401
-  twitter <- within(data = twitter,
402
-		    expr = {
403
-			    date <- as.numeric(as.character(date))
404
-			    time <- as.numeric(as.character(time))
405
-			    text <- as.character(text)
406
-			    link <- as.character(link)
407
-			   })
336
+  twitter <- within(data = twitter, expr = {
337
+		     date <- as.character(date);
338
+		     time <- as.character(time);
339
+                     fdat <- as.character(fdat);
340
+                     retw <- as.character(retw);
341
+                     favs <- as.character(favs);
342
+		     text <- as.character(text);
343
+		     link <- as.character(link);
344
+                     murl <- as.character(murl);
345
+                     lang <- as.character(lang);
346
+                     clnt <- as.character(clnt);
347
+                     user <- as.character(user);
348
+		  })
408 349
 ```
409 350
 
410 351
 The expression `as.numeric(as.character(...))` in some of these assignments are

BIN
docs/collector.pdf View File


Loading…
Cancel
Save