Browse Source

Edit: adapt documentation to the rtweet package

pull/15/head
janwey 1 year ago
parent
commit
4770baddfb
2 changed files with 142 additions and 201 deletions
  1. 142
    201
      docs/collector.md
  2. BIN
      docs/collector.pdf

+ 142
- 201
docs/collector.md View File

@@ -4,7 +4,7 @@

* [General information about the Script](#the-script)
* [Packages used and the Package section](#packages)
* [The twittR package](#the-twitter-package)
* [The rtweet package](#the-twitter-package)
* [The curl and rjson packages](#the-curl-and-rjson-packages)
* [The RedditExtractoR package](#the-redditextractor-package)
* [Collecting from Twitter](#twitter)
@@ -41,29 +41,31 @@ much care and do not leak meta-data if possible.
As of writing this script and its documentation, two platform specific and two
general scraper-packages are being used:

* [twitteR](https://cran.r-project.org/package=twitteR) (Version 1.1.9)
* [rtweet](https://cran.r-project.org/package=rtweet) (Version 0.6.0)
* [curl](https://cran.r-project.org/package=curl) (Version 3.1)
* [rjson](https://cran.r-project.org/package=rjson) (Version 0.2.15)
* [RedditExtractoR](cran.r-project.org/package=RedditExtractoR) (Version 2.0.2)

### The twitteR package
twitteR has a rather extensive
[documentation](https://cran.r-project.org/web/packages/twitteR/twitteR.pdf) as
well as "in-R-manuals". Simply enter `??twitteR` into the R console or look up
### The rtweet package
rtweet has a rather extensive
[documentation](https://cran.r-project.org/web/packages/rtweet/rtweet.pdf) as
well as "in-R-manuals". Simply enter `??rtweet` into the R console or look up
a specific function with `?function`, replacing `function` with its actual name.
twitteR has several useful function to scrape Twitter-Data, most of which
however apply to the Twitter account in use - which in our case is not
necessary. The [Twitter-Section](#twitter) uses only three functions, which will
be discussed individually, later on:
It is the successor of the previously used
[twittR package](https://cran.r-project.org/package=rtweet) with a lot of
improvements and fewer restrictions regarding the Twitter-API. rtweet has several
useful functions to scrape Twitter-Data, most of which however apply to the
Twitter account in use - which in our case is not necessary. The
[Twitter-Section](#twitter) uses only two functions, which will be discussed
individually, later on:
```
setup_twitter_oauth() # authentication
searchTwitter() # searching Twitter for a particular string
strip_retweets() # exclude Retweets in the results
create_token() # authentication
search_tweets() # searching Twitter for a particular string
```

As a site-note; I had to install the
[httr-package](https://cran.r-project.org/package=httr) - a dependency of
twitteR - from the Repositories of my distribution of choice, as the one
rtweet - from the Repositories of my distribution of choice, as the one
provided by CRAN would not compile for some reason. So if you run into a similar
issue, look for something like `r-cran-httr` in your packagemanager.

@@ -116,18 +118,22 @@ so keep an eye on this.

### Authenticate
As the package in use here needs access to the Twitter-API, what we first need
are the "Consumer Key", "Consumer Secret", "Access Token" and "Access Token
Secret", all of which you can order from
[apps.twitter.com](https://apps.twitter.com/). Of course, you need a
Twitter-Account for this (staff may ask for the FSFE's Account).
are the "Consumer Key", "Consumer Secret" and our "App Name", all of which you
can order from [apps.twitter.com](https://apps.twitter.com/). Of course, you
need a Twitter-Account for this (staff may ask for the FSFE's Account).

The authentication can be done in two ways:

1. via manual input. The R-Console will prompt you to enter the credentials by
typing them in.
typing them in. Going this route will exclude the option to run the script
automatically.
2. via a plain text file with the saved credentials. This `.txt` file has a very
specific structure which you have to follow. You can find an example file in
the examples folder.
the examples folder. Going this route can potentially be a security risk for
the twitter-account in use, as the `.txt` file is stored in plain text. The
problem can be mitigated if your harddrive is encrypted. *It may also be
possible to implement decryption of a file via GNUPG with the `system()`
command. However, this has not been implemented in this script (yet)*.

The first line of the credential-file contains the *labels*. These have to be in
the same order as the *credentials* themselves in the line below. The *labels*
@@ -136,243 +142,171 @@ Storing the credentials in plain text surely is not optimal, but the easiest way
to get the information into our R-Session. This should not be too critical, if
your disk is encrypted.

Next, we order our oauth token with `setup_twitter_oauth()`. This function is a
wrapper for httr, which will also store this token in a local file, so make sure
to **not leak those by making the file public**. The oauth token can not only be
used to scrape information from Twitter, it also grants write-access, so can be
used to manipulate the affiliated Twitter-Account or interact with Twitter in
any other way.
Next, we create the oauth token with `create_token()`.The oauth token can not
only be used to scrape information from Twitter, it also grants write-access, so
can be used to manipulate the affiliated Twitter-Account or interact with Twitter
in any other way.

The function used to authenticate takes all of our 4 credential-keys as
arguments, which in this script are stored in the `twitter_consumerkey
twitter_consumerpri twitter_tokenaccess twitter_tokensecret` variables:
The function used to authenticate takes the consumer-key and consumer-secret as
well as the name of the app, which you registered on the twitter-developer page
before, as arguments, which in this script are stored in the `twitter_consumerkey
twitter_consumerpri twitter_appname` variables:
```
setup_twitter_oauth(consumer_key = twitter_consumerkey,
consumer_secret = twitter_consumerpri,
access_token = twitter_tokenaccess,
access_secret = twitter_tokensecret)
twitter_token <- create_token(app = twitter_appname,
consumer_key = twitter_consumerkey,
consumer_secret = twitter_consumerpri)
```

### Scraping Tweets
Once we have an oauth token, we can already start looking for desired tweets to
collect. For this we use the `searchTwitter()` function. All functions in the
`twittR` package access the file created by the auth-function mentioned before,
so there is no need to enter this as argument. What arguments we do need are:
collect. For this we use the `search_tweets()` function. All functions in the
`rtweet` package access the token via environment variables. So make sure to
create it before use and don't override it, afterwards. What arguments we need
to forward to the function are:

* the string to search for, in this case `ilovefs`. This will not only include
things like "ilovefs18", "ilovefs2018", "ILoveFS", etc but also hashtags like
"#ilovefs"
* the date from which on we want to search. It is worth noting, that the API is
limited in that it can only go back a few months. So if you want to look for
results from a year ago, you have bad luck. This date has to be in the form of
"YYYY-MM-DD". For our purpose, it makes sense to set it to either
`2018-01-01` or `2018-02-01` to also catch people promoting the campaign
in advance
* the date until which we want to search. This one also has to be in the form of
"YYYY-MM-DD". This argument usually only makes sense, if you analyze events in
the past. For our purpose, we can set it either to the present or future date
* the maximum number of tweets to be aggregated. This number is only useful for
search-terms that get a lot of coverage on twitter (eg.: trending hashtags).
For our purpose, we can safely set it to a number that is much higher than the
anticipated participation in the campaign, like `9999999999` so we get ALL
tweets containing our specified string
* the order-type for the search. Again, this only makes sense for searches where
we do not want each and every single tweet. In our case, set it to anything,
for example `recent`
* whether we want to include retweets in your data as well (we do not, in this
case, so set it to `FALSE`)

We save the result of this command in the variable `twitter_tw_dirty`. The
*dirty* stands for an "unclean" result, still containing retweets. The resulting
We save the result of this command in the variable `twitter_tw`. The resulting
code is:
```
twitter_tw_dirty <- searchTwitter(search = "ilovefs",
since = "2018-01-01",
until = "2018-12-31",
n = 999999999,
resultType = "recent")
```

The next step is to clean this data and remove retweets (they are listed in the
"dirty" data as normal tweets as well), as those are not necessary for use. We
can still extract the number of retweets of each posting later on, who retweeted
is not important. We provide three arguments to the function `strip_retweets()`:

* the `list()` item containing our scraped tweets. As shown above, we saved this
to the variable `twitter_tw_dirty`
* whether we want to also remove "manual rewteets", which is someone literally
copy-and-pasting the text of a tweet. This is up to debate, but personally I
would say, that this should be kept in as this is what a lot of "share this
site" buttons on websites do. This is still participation and should thus be
included in the results
* whether we want to remove "modified tweets", which *probably* means "quoted"
ones? Either way, if in doubt we want to keep it in. We can still remove it,
if we later find out it is in fact a retweet.

The result is saved to the variable `twitter_tw`, now containing only clean
data:
```
twitter_tw <- strip_retweets(tweets = twitter_tw_dirty,
strip_manual = FALSE,
strip_mt = FALSE)
twitter_tw <- search_tweets(q = "#ilovefs",
n = 9999,
include_rts = FALSE)
```

### Stripping out data
The `list()` item resulting from the `searchTwitter()` function has a logical,
but rather inconvenient structure. The `list()` contains a lower `list()` for
each Tweet scraped. Those lower `list()` items contain variables for each
property, as shown by the illustration below:
Most of the resulting data can be extracted with simple assignment, only a few
characteristics are organized within `list()` items in the `data.frame()`. The
structure of the dataset is as follows:
```
twitter_tw
|
|- [LIST 1]
| |- text = "This is my tweet about #ilovefs https://fsfe.org"
| |- ...
| |- favoriteCount = 21
| |- ...
| |- created = "2018-02-14 13:52:59"
| |- ...
| |- statusSource = "<a href='/download/android'>Twitter for Android</a>"
| |- screenName = "fsfe"
| |- retweetCount = 9
| |- ....
| |- urls [LIST]
| | |- expanded = "https://fsfe.org"
| | '- ...
|- ...
|- text = "Yay! #ilovefs", "Today is #ilovefs, celebrate!", ...
|- ...
|- screen_name = "user123", "fsfe", ...
|- ...
|- source = "Twidere", "Tweetdeck", ...
|- ...
|- favorite_count = 8, 11, ...
|- retweet_count = 2, 7, ...
|- ...
|- lang = "en", "en", ...
|- ...
|- created_at = "14-02-2018" 10:01 CET", "14-02-2018 10:01 CET", ...
|- ...
|- urls_expanded_url
| |- NA, "https://ilovefs.org", ...
| '- ...
|
|- [LIST 2]
| |- ...
|- media_expanded_url
| |- NA, NA, ...
| '- ...
|
'- ...
```

The inconvenience about this structure stems from that we need to use a for-loop
in order to run through each lower `list()` item and extract its variables
The inconvenience about the `list()` structure stems from that we need to use a
for-loop in order to run through each `list()` item and extract its variables
individually.

For the sake of keeping this short, this documentation only explains the
extraction of a single argument, namely the Client used to post a Tweet. All
extraction of a single argument, namely the Client used to post a Tweet and the
extracting of one of the items in the `list()`, namely the media-URL. All
other information are scraped in a very similar fashion.

Firstly we create a new, empty `vector()` item called `twitter_client` with the
"combine" command (or `c()` for short). Usually you do not have to pre-define
empty vectors in R, but it will be created automatically if you assign it a
value, has we've done before multiple times. You only need to pre-define it, if
you want to address a specific *location* in that vector, say skipping the first
value and filling in the second. We do it like this here, as we want the
resulting `vector()` item to have the same order as the `list()`:
```
twitter_client <- c()
For the media-URL, firstly we create a new, empty `vector()` item called `murl`
with the `vector()` command. Usually you do not have to pre-define empty vectors
in R, but it will be created automatically if you assign it a value, has we've
done before multiple times. You only need to pre-define it, if you want to
address a specific *location* in that vector, say skipping the first value and
filling in the second. We do it like this here, as we want the resulting
`vector()` item to have the same order as the original dataset. In theory, you
could also use the combine command `c()`, however `vector()` gives you the option
to pre-define the mode (numeric, character, factor, ...) as well as the length of
the variable. The variable `twitter_number` that we create before that, simply
contains the number of all tweets, so we know how long the `murl` vector has to
be:
```
twitter_number <- length(twitter_tw$text)
...
murl <- vector(mode = "character", length = twitter_number)
```

The for-loop has to count up from 1 to as long as the `list()`
item is. So if we scraped four Tweets, the for-loop has to count `1 2 3 4`:
The for-loop has to count up from 1 to as many tweets as we scraped. We already
saved this value into the `twitter_number` variable. So if we scraped four
Tweets, the for-loop has to count `1 2 3 4`:
```
for(i in c(1:length(twitter_tw))){
for(i in c(1:twitter_number)){
...
}
```

Next, we check if the desired variable in the lower `list()` item is set.
However, R does not have a specific way of checking whether a variable is set or
not. However, if a variable exists, but is empty, its length is zero. Thus if we
want to check if a variable is set or not, we can simply check its length. In
particular, here we check if the vector `statusSource` within the `i`-th lower
list of `twitter_tw` has a length greater than zero:
Next, we simply assign the first value of the according `list()` item to our
pre-defined vector. To be precise, we assign it to the current location in the
vector. You could also check first, if this value exists in the first place,
however the `retweet` package sets `NA`, meaning "Not Available", if that value
is missing, which is fine for our purpose:
```
if(length(twitter_tw[[i]]$statusSource) > 0){
...
} else {
...
}
murl[i] <- twitter_tw$media_expanded_url[[i]][1]
```

Finally, we can extract the value we are after - the `statusSource` vector. We
assign it to the `i`-th position in the previously defined `vector()` item
`twitter_client`, if the previously mentioned if-statement is true. As a little
*hack* here, we **specifically** assign it as a character-item with the
`as.character()` function. This may not always be necessary, but sometimes wrong
values will be assigned, if the source-variable is a `factor()`, but I won't go
in-depth on that matter here. Just a word of caution: **always check your
variables before continuing**. If the if-statement above is false, we instead
assign `NA`, meaning "Not Available"
All combined, this looks like this:
```
twitter_client[i] <- as.character(twitter_tw[[i]]$statusSource)
twitter_number <- length(twitter_tw$text)
clnt <- twitter_tw$source
...
twitter_client[i] <- NA
murl <- vector(mode = "character", length = twitter_number)
for(i in 1:twitter_number){
...
murl[i] <- twitter_tw$media_expanded_url[[i]][1]
}
```

Sometimes, as it is the case with `twitter_client`, the extracted string
contains things that we do not need or want. So we used regex to get rid of it.
Sometimes, as it is the case with `fdat` the "full date" variable, the
extracted string contains things that we do not need or want. So we use regex to
get rid of it.

*If you are not familiar with regex, I highly recommend
[regexr.com](https://regexr.com/) to learn how to use it. It also contains a
nifty cheat-sheet.*

Official Twitter-Clients include the download URL, besides the name of the
client. It's safe to assume, that most other clients do the same, so we can
clean up the string with two simple `sub()` commands (meaning "substitude"). As
arguments, we give it the pattern it should substitude, as well as the
replacement string (in our case, this string is empty / none) and the string
that this should happen to - here `twitter_client`. We assign both to the same
variable again, overriding its previous value:
```
twitter_client <- sub(pattern = ".*\">", replace = "", x = twitter_client)
twitter_client <- sub(pattern = "</a>", replace = "", x = twitter_client)
time <- sub(pattern = ".* ", x = fdat, replace = "")
time <- gsub(pattern = ":", x = time, replace = "")
date <- sub(pattern = " .*", x = fdat, replace = "")
date <- gsub(pattern = "-", x = date, replace = "")
```

All combined together, this looks similar to this:
This step is particular useful, if we want only tweets in our dataset, that were
posted during a specific time-period. Using the `which` command, we can figure
out the positions of each tweet in our dataset, which has been posted prior to
or after a certain date (or time, if you wish). The `date` and `time` variables
we created before are numeric values describing the date/time of the tweet as
`YYYYMMDD` and `HHMMSS` respectively. As soon as we found out which positions
fir the criteria (before February 10th and after February 16th, for example), we
can eliminate all of these tweets from our dataset:
```
twitter_client <- c()
for(i in 1:length(twitter_tw)){
if(length(twitter_tw[[i]]$statusSource) > 0){
twitter_client[i] <- as.character(twitter_tw[[i]]$statusSource)
} else {
twitter_client[i] <- NA
}
}
twitter_client <- sub(pattern = ".*\">", replace = "", x = twitter_client)
twitter_client <- sub(pattern = "</a>", replace = "", x = twitter_client)
twitter_exclude <- which(as.numeric(date) > 20180216 | as.numeric(date) < 20180210)
date <- date[-twitter_exclude]
..
user <- user[-twitter_exclude]
```

All other values are handled in a similar fashion. Some of those need some
smaller fixes afterwards, just like the removal of URLs in`twitter_client`.

### Creating the finished dataset
After we scraped all desired tweets and extracted the relevant information from
it, it makes sense to combine the individual variables to a dataset, which can
be easily handled, exported and reused. It also makes sense to have relatively
short variable-names within such dataset. During the data collecting process, we
used a `twitter_` prefix in front of each variable, so we are sure we use the
correct variables, all coming from our Twitter-scraper. We do not need to do
in a `data.frame()` item, as its name itself already eliminates the risk of
using wrong variables.

Additionally, we still need to split up the `twitter_timedate` variable, which
currently contains the point of time of the tweet in the form of
`YYYY-MM-DD HH:MM:SS`. For this, we again use regex and the function `sub()`.
As `sub()` only replaces the first instance of the pattern given to it, if we
have multiple occasions of a given pattern, we need to use `gsub()` (for global
substitute).

We also give some of the variables a new "mode", for example transferring them
from a `character()` item (a string) over to a `factor()` item, making them an
ordinal or nominal variable. This makes especially sense for the number of
retweets and favorites.

The results are seven discrete variables, which in a second step can be combined
into a `data.frame()` item:
```
time <- sub(pattern = ".* ", x = twitter_timedate, replace = "")
time <- as.numeric(gsub(pattern = ":", x = time, replace = ""))
date <- sub(pattern = " .*", x = twitter_timedate, replace = "")
date <- as.numeric(gsub(pattern = "-", x = date, replace = ""))
retw <- as.factor(twitter_rts)
favs <- as.factor(twitter_fav)
link <- as.character(twitter_url)
text <- as.character(twitter_txt)
clnt <- as.character(twitter_client)
```
short variable-names within such dataset.

When combining these variables into a `data.frame()`, we first need to create
a matrix from them, by *binding* these variables as columns of said matrix with
@@ -380,7 +314,8 @@ the `cbind()` command. The result can be used by the `data.frame()` function to
great such item. We label this dataset `twitter`, making it clear, what source
of data we are dealing with:
```
twitter <- data.frame(cbind(date, time, retw, favs, text, link, clnt))
twitter <- data.frame(cbind(date, time, fdat, retw, favs, text,
lang, murl, link, clnt, user))
```

Often during that process, all variables within the `data.frame()` item are
@@ -398,13 +333,19 @@ Instead, we can use the `within()` function, using the `twitter` dataset as one
argument and the expression of what we want to do *within* this dataset as
another:
```
twitter <- within(data = twitter,
expr = {
date <- as.numeric(as.character(date))
time <- as.numeric(as.character(time))
text <- as.character(text)
link <- as.character(link)
})
twitter <- within(data = twitter, expr = {
date <- as.character(date);
time <- as.character(time);
fdat <- as.character(fdat);
retw <- as.character(retw);
favs <- as.character(favs);
text <- as.character(text);
link <- as.character(link);
murl <- as.character(murl);
lang <- as.character(lang);
clnt <- as.character(clnt);
user <- as.character(user);
})
```

The expression `as.numeric(as.character(...))` in some of these assignments are

BIN
docs/collector.pdf View File


Loading…
Cancel
Save