Browse Source

Add: Documentation - Packages used - Twitter Collector

pull/1/head
janwey 1 year ago
parent
commit
b29c3fdff8

+ 3
- 0
docs/README.md View File

@@ -0,0 +1,3 @@
# Documentation

* **[collecto.R](./collector.md)** the collector/scraper for data from different socialmedia-sources

+ 449
- 0
docs/collector.md View File

@@ -0,0 +1,449 @@
# Documentation: [collecto.R](../collecto.R)

## Table of Contents

* [General information about the Script](#the-script)
* [Packages used and the Package section](#packages)
* [The twittR package](#the-twitter-package)
* [The Rfacebook package](#the-rfacebook-package)
* [The Mastodon package](#the-mastodon-package)
* [Twitter](#twitter)


* * *

## The Script
The R script documented here has a modular structure. It is divided into 2
sections that handle loading the section necessary for the process and exporting
the aggregated data into usable formats in the end. The remaining sections
handle one specific data source each (eg.: Twitter, Mastodon, ...). While the
[Package-Section](#packages) is obviously necessary for the remaining sections
(depending on which ones you actually want to use) as well as the
[Export-Section](#exporting) for actually using the data in other applications,
scripts or by other people, you can cherry-pick between the
[Datasource-Sections](#datasources). These can be used independently and in no
particular order to another. Keep that in mind, if you only want to analyze *X*.

As a site-note, the script is written to keep the data collected as anonymous as
possible, however because we deal with a rather small sample and because of the
nature of social media, it is in most cases still possible to track down each
specific user in the resulting data. While time and date of the posting are
mostly unproblematic

* * *

## Packages
As of writing this script and its documentation, three scraper-packages are
being used:

* [twitteR](https://cran.r-project.org/package=twitteR) (Version 1.1.9)
* [Rfacebook](https://cran.r-project.org/package=Rfacebook) (Version 0.6.15)
* [mastodon](https://github.com/ThomasChln/mastodon) (Commit [a6815b6](https://github.com/ThomasChln/mastodon/commit/a6815b6fb626960ffa02bd407b8f05d84bd0f549))

### The twitteR package
twitteR has a rather extensive
[documentation](https://cran.r-project.org/web/packages/twitteR/twitteR.pdf) as
well as "in-R-manuals". Simply enter `??twitteR` into the R console or look up
a specific function with `?function`, replacing `function` with its actual name.
twitteR has several useful function to scrape Twitter-Data, most of which
however apply to the Twitter account in use - which in our case is not
necessary. The [Twitter-Section](#twitter) uses only three functions, which will
be discussed individually, later on:
```
setup_twitter_oauth() # authentication
searchTwitter() # searching Twitter for a particular string
strip_retweets() # exclude Retweets in the results
```
As a site-note; I had to install the
[httr-package](https://cran.r-project.org/web/packages/httr/index.html) - a
dependency of twitteR - from the Repositories of my distribution of choice, as
the one provided by CRAN would not compile for some reason. So if you run into a
similar issue, look for something like `r-cran-httr` in your packagemanager.

### The Rfacebook package
_**Attention:**
I tried to set up a facebook account just for this purpose, but their
registration process is rather tedious and honestly ridiculous. Keep your phone
number or credit card nearby, as well as a photo of your face. I can not accept
these kinds of intrusion, even for the purpose of this data-analysis. If you
already have a facebook-account, you can however use that one to receive the API
access tokens and use the *Rfacebook* package described in this section. I did
not, so the process described here is potential and I do actually know the
structure of each function's output._

Rfacebook
[documents its internal and external functions](https://cran.r-project.org/web/packages/twitteR/twitteR.pdf)
fairly well, too. The focus of the package does not quite align with the purpose
we have in mind here (concentration on metrics for site-administrators and
analyzing specific people's actions), however caused by a lack of alternatives
we can still use it to some extend.
Unfortunately, the functions of this package have very generic names and thus
may conflict with functions from other packages. Here is a little tip to prevent
the usage of the wrong function in R:
Prefix the function you want to use with the name of the package and a double
colon. In the case of the `getShares()` function, this would result in
`Rfacebook::getShares()`. The functions we are interested in and will be
discussed later on as well are:
```
fbOAuth() # authentication / generating an auth-token
getCommentReplies() # replies to a comment on a post
getGroup() # retrieve information from a public group
getPage() # retrieve information from a public page
getPost() # retrieve information from a public post (incl. comments)
getReactions() # retrieve reactions to a single or multiple posts
getShares() # retrieve list of shares of a post
getUsers() # retrieve information about poster
searchFacebook() # search public posts with a certain string [deprecated]
searchPages() # search public pages that mention a certain string
```
As a site-note; I had to install the
[httr-package](https://cran.r-project.org/web/packages/httr/index.html) - a
dependency of Rfacebook - from the Repositories of my distribution of choice, as
the one provided by CRAN would not compile for some reason. So if you run into a
similar issue, look for something like `r-cran-httr` in your packagemanager.

### The mastodon package
The good thing about Mastodon is, that searches are not restricted to a single
Mastodon-Instance or to Mastodon at all. If your Instance has enough outbound
connections (so make sure you chose a very active and inter-communicative one),
you are able to not only search Mastodon-Instances, but also GNUsocial, Pump.io
and other compatible Social Media instances. Luckily, this also applies to the
mastodon-package. Unfortunately, mastodon for R is documented
[very poorly, if at all](https://github.com/ThomasChln/mastodon/blob/a6815b6fb626960ffa02bd407b8f05d84bd0f549/README.md).
This brings us in the uncomfortable position, that we need to figure out, what
the outputs of each function actually mean. Those are not properly labeled
either, so this is a task of trial'n'error and a lot of guessing. If you have
time and dedication, feel free to document it properly and open a pull-request
on the [project's Github page](https://github.com/ThomasChln/mastodon). The
relevant results that we use in our script are listed in the
[Mastodon-Section](#mastodon) of this documentation. Again, just like with the
Rfacebook package, the function-names are very generic and thus it is a good
idea to prefix them with `mastodon::` to prevent the use of a wrong function
from another package (eg.: `login()` becomes `mastodon::login()`).
From the long list of functions in this package, we only need two for our
analysis:
```
login() # authentication / generating an auth-token
get_hashtag() # search the fediverse for posts that include a specific hashtag
```
Note:
as this package is not hosted on CRAN but on github, you can not install it with
`install.packages()` like the other packages. The easiest way is to install it
with `install_github()` from the `devtools` package. In order to use
`install_github()` without loading the library (as we only need it for this one
time), you can prefix it with its package name.
Installing and loading the mastodon package would look like this:
```
install.packages("devtools")
devtools::install_github(repo = "ThomasChln/mastodon")
library("mastodon")
```
Also note, that `devtools` requires the development files of *libssl* to be
installed on your machine.

* * *

## Twitter

### Authenticate
As the package in use here needs access to the Twitter-API, what we first need
are the "Consumer Key", "Consumer Secret", "Access Token" and "Access Token
Secret", all of which you can order from
[apps.twitter.com](https://apps.twitter.com/). Of course, you need a
Twitter-Account for this (staff may ask for the FSFE's Account).

The authentication can be done in two ways:

1. via manual input. The R-Console will prompt you to enter the credentials by
typing them in.
2. via a plain text file with the saved credentials. This `.txt` file has a very
specific structure which you have to follow. You can find an example file in
the examples folder.

The first line of the credential-file contains the *labels*. These have to be in
the same order as the *credentials* themselves in the line below. The *labels*
as well as the *credentials* are each separated by a single semi-colon `;`.
Storing the credentials in plain text surely is not optimal, but the easiest way
to get the information into our R-Session. This should not be too critical, if
your disk is encrypted.

Next, we order our oauth token with `setup_twitter_oauth()`. This function is a
wrapper for httr, which will also store this token in a local file, so make sure
to **not leak those by making the file public**. The oauth token can not only be
used to scrape information from Twitter, it also grants write-access, so can be
used to manipulate the affiliated Twitter-Account or interact with Twitter in
any other way.

The function used to authenticate takes all of our 4 credential-keys as
arguments, which in this script are stored in the `twitter_consumerkey
twitter_consumerpri twitter_tokenaccess twitter_tokensecret` variables:
```
setup_twitter_oauth(consumer_key = twitter_consumerkey,
consumer_secret = twitter_consumerpri,
access_token = twitter_tokenaccess,
access_secret = twitter_tokensecret)
```

### Scraping Tweets
Once we have an oauth token, we can already start looking for desired tweets to
collect. For this we use the `searchTwitter()` function. All functions in the
`twittR` package access the file created by the auth-function mentioned before,
so there is no need to enter this as argument. What arguments we do need are:

* the string to search for, in this case `ilovefs`. This will not only include
things like "ilovefs18", "ilovefs2018", "ILoveFS", etc but also hashtags like
"#ilovefs"
* the date from which on we want to search. It is worth noting, that the API is
limited in that it can only go back a few months. So if you want to look for
results from a year ago, you have bad luck. This date has to be in the form of
"YYYY-MM-DD". For our purpose, it makes sense to set it to either
`2018-01-01` or `2018-02-01` to also catch people promoting the campaign
in advance
* the date until which we want to search. This one also has to be in the form of
"YYYY-MM-DD". This argument usually only makes sense, if you analyze events in
the past. For our purpose, we can set it either to the present or future date
* the maximum number of tweets to be aggregated. This number is only useful for
search-terms that get a lot of coverage on twitter (eg.: trending hashtags).
For our purpose, we can safely set it to a number that is much higher than the
anticipated participation in the campaign, like `9999999999` so we get ALL
tweets containing our specified string
* the order-type for the search. Again, this only makes sense for searches where
we do not want each and every single tweet. In our case, set it to anything,
for example `recent`

We save the result of this command in the variable `twitter_tw_dirty`. The
*dirty* stands for an "unclean" result, still containing retweets. The resulting
code is:
```
twitter_tw_dirty <- searchTwitter(search = "ilovefs",
since = "2018-01-01",
until = "2018-12-31",
n = 999999999,
resultType = "recent")
```

The next step is to clean this data and remove retweets (they are listed in the
"dirty" data as normal tweets as well), as those are not necessary for use. We
can still extract the number of retweets of each posting later on, who retweeted
is not important. We provide three arguments to the function `strip_retweets()`:

* the `list()` item containing our scraped tweets. As shown above, we saved this
to the variable `twitter_tw_dirty`
* whether we want to also remove "manual rewteets", which is someone literally
copy-and-pasting the text of a tweet. This is up to debate, but personally I
would say, that this should be kept in as this is what a lot of "share this
site" buttons on websites do. This is still participation and should thus be
included in the results
* whether we want to remove "modified tweets", which *probably* means "quoted"
ones? Either way, if in doubt we want to keep it in. We can still remove it,
if we later find out it is in fact a retweet.

The result is saved to the variable `twitter_tw`, now containing only clean
data:
```
twitter_tw <- strip_retweets(tweets = twitter_tw_dirty,
strip_manual = FALSE,
strip_mt = FALSE)
```

### Stripping out data
The `list()` item resulting from the `searchTwitter()` function has a logical,
but rather inconvenient structure. The `list()` contains a lower `list()` for
each Tweet scraped. Those lower `list()` items contain variables for each
property, as shown by the illustration below:
```
twitter_tw
|
|- [[1]]
| |- text = "This is my tweet about #ilovefs https://fsfe.org"
| |- ...
| |- favoriteCount = 21
| |- ...
| |- created = "2018-02-14 13:52:59"
| |- ...
| |- statusSource = "<a href='/download/android'>Twitter for Android</a>"
| |- screenName = "fsfe"
| |- retweetCount = 9
| |- ....
| |- urls [LIST]
| | |- expanded = "https://fsfe.org"
| | '- ...
| '- ...
|
|- [[2]]
| |- ...
| '- ...
|
'- ...
```

The inconvenience about this structure stems from that we need to use for-loops
in order to run through each lower `list()` item and extract its variables
individually.

For the sake of keeping this short, this documentation only explains the
extraction of a single argument, namely the Client used to post a Tweet.
Firstly we create a new, empty `vector()` item called `twitter_client` with the
"combine" command (or `c()` for short). Usually you do not have to pre-define
empty vectors in R, but it will be created automatically if you assign it a
value, has we've done before multiple times. You only need to pre-define it, if
you want to address a specific *location* in that vector, say skipping the first
value and filling in the second. We do it like this here, as we want the
resulting `vector()` item to have the same order as the `list()`:
```
twitter_client <- c()
```

The for-loop has to count up from 1 to as long as the `list()`
item is. So if we scraped four Tweets, the for-loop has to count `1 2 3 4`:
```
for(i in c(1:length(twitter_tw))){
...
}
```

Next, we check if the desired variable in the lower `list()` item is set.
However, R does not have a specific way of checking whether a variable is set or
not. However, if a variable exists, but is empty, its length is zero. Thus if we
want to check if a variable is set or not, we can simply check its length. In
particular, here we check if the vector `statusSource` within the `i`-th lower
list of `twitter_tw` has a length greater than zero:
```
if(length(twitter_tw[[i]]$statusSource) > 0){
...
} else {
...
}
```

Finally, we can extract the value we are after - the `statusSource` vector. We
assign it to the `i`-th position in the previously defined `vector()` item
`twitter_client`, if the previously mentioned if-statement is true. As a little
*hack* here, we **specifically** assign it as a character-item with the
`as.character()` function. This may not always be necessary, but sometimes wrong
values will be assigned, if the source-variable is a `factor()`, but I won't go
in-depth on that matter here. Just a word of caution: **always check your
variables before continuing**. If the if-statement above is false, we instead
assign `NA`, meaning "Not Available"
```
twitter_client[i] <- as.character(twitter_tw[[i]]$statusSource)
...
twitter_client[i] <- NA
```

Sometimes, as it is the case with `twitter_client`, the extracted string
contains things that we do not need or want. So we used regex to get rid of it.

*If you are not familiar with regex, I highly recommend
[regexr.com](https://regexr.com/) to learn how to use it. It also contains a
nifty cheat-sheet.*

Official Twitter-Clients include the download URL, besides the name of the
client. It's safe to assume, that most other clients do the same, so we can
clean up the string with two simple `sub()` commands (meaning "substitude"). As
arguments, we give it the pattern it should substitude, as well as the
replacement string (in our case, this string is empty / none) and the string
that this should happen to - here `twitter_client`. We assign both to the same
variable again, overriding its previous value:
```
twitter_client <- sub(pattern = ".*\">", replace = "", x = twitter_client)
twitter_client <- sub(pattern = "</a>", replace = "", x = twitter_client)
```

All combined together, this looks similar to this:
```
twitter_client <- c()
for(i in 1:length(twitter_tw)){
if(length(twitter_tw[[i]]$statusSource) > 0){
twitter_client[i] <- as.character(twitter_tw[[i]]$statusSource)
} else {
twitter_client[i] <- NA
}
}
twitter_client <- sub(pattern = ".*\">", replace = "", x = twitter_client)
twitter_client <- sub(pattern = "</a>", replace = "", x = twitter_client)
```

All other values are handled in a similar fashion. Some of those need some
smaller fixes afterwards, just like the removal of URLs in`twitter_client`.

### Creating the finished dataset
After we scraped all desired tweets and extracted the relevant information from
it, it makes sense to combine the individual variables to a dataset, which can
be easily handled, exported and reused. It also makes sense to have relatively
short variable-names within such dataset. During the data collecting process, we
used a `twitter_` prefix in front of each variable, so we are sure we use the
correct variables, all coming from our Twitter-scraper. We do not need to do
in a `data.frame()` item, as its name itself already eliminates the risk of
using wrong variables.

Additionally, we still need to split up the `twitter_timedate` variable, which
currently contains the point of time of the tweet in the form of
`YYYY-MM-DD HH:MM:SS`. For this, we again use regex and the function `sub()`.
As `sub()` only replaces the first instance of the pattern given to it, if we
have multiple occasions of a given pattern, we need to use `gsub()` (for global
substitute).

We also give some of the variables a new "mode", for example transferring them
from a `character()` item (a string) over to a `factor()` item, making them an
ordinal or nominal variable. This makes especially sense for the number of
retweets and favorites.

The results are seven discrete variables, which in a second step can be combined
into a `data.frame()` item:
```
time <- sub(pattern = ".* ", x = twitter_timedate, replace = "")
time <- as.numeric(gsub(pattern = ":", x = time, replace = ""))
date <- sub(pattern = " .*", x = twitter_timedate, replace = "")
date <- as.numeric(gsub(pattern = "-", x = date, replace = ""))
retw <- as.factor(twitter_rts)
favs <- as.factor(twitter_fav)
link <- as.character(twitter_url)
text <- as.character(twitter_txt)
clit <- as.character(twitter_client)
```

When combining these variables into a `data.frame()`, we first need to create
a matrix from them, by *binding* these variables as columns of said matrix with
the `cbind()` command. The result can be used by the `data.frame()` function to
great such item. We label this dataset `twitter`, making it clear, what source
of data we are dealing with:
```
twitter <- data.frame(cbind(date, time, retw, favs, text, link, clit))
```

Often during that process, all variables within the `data.frame()` item are
transformed into `factor()` variables, which is not what we want for most of
these.
Usually, when working with variables within a `data.frame()` you have to prefix
the variable with the name of the `data.frame` and a dollar-sign, meaning that
you want to access that variable **within** that `data.frame()`. This would
make the process of changing the mode quite tedious for each variable:
```
twitter$text <- as.numeric(as.character(twitter$text))
```

Instead, we can use the `within()` function, using the `twitter` dataset as one
argument and the expression of what we want to do *within* this dataset as
another:
```
twitter <- within(data = twitter,
expr = {
date <- as.numeric(as.character(date))
time <- as.numeric(as.character(time))
text <- as.character(text)
link <- as.character(link)
})
```

The expression `as.numeric(as.character(...))` in some of these assignments are
due to the issues, when transforming `factor()` variables to `numeric()`
variables directly, as mentioned before. First transforming them into a
`character()` (string), which then can be transformed into a `numeric()` value
without risks, is a little *hack*.

The dataset is not finished and contains every aspect we want to analyze later
on. You can skip down to the [Exporting-Section](#exporting-datasets) to read
about how to export the data, so it can be used outside your current R-Session.

* * *

BIN
docs/collector.pdf View File


facebook_api_example.txt → examples/facebook_api_example.txt View File


fediverse_mastodon_api_example.txt → examples/fediverse_mastodon_api_example.txt View File


twitter_api_example.txt → examples/twitter_api_example.txt View File


Loading…
Cancel
Save