Browse Source

Edit: Updated Documentation to reflect the new Fediverse-Section of the script

pull/5/head
janwey 1 year ago
parent
commit
13568bdc51
3 changed files with 173 additions and 255 deletions
  1. 2
    2
      collecto.R
  2. 171
    253
      docs/collector.md
  3. BIN
      docs/collector.pdf

+ 2
- 2
collecto.R View File

@@ -272,7 +272,7 @@ reto <- c()
favs <- c()
murl <- c()
acct <- c()
for(i in 1:10){
for(i in 1:9999999999){
if(i == 1){
mastodon_instance <- "https://mastodon.social"
mastodon_hashtag <- "ilovefs"
@@ -329,7 +329,7 @@ inst <- sub(pattern = "https:\\/\\/", x = inst, replacement = "")
inst <- sub(pattern = "\\/.*", x = inst, replacement = "")

### Only include Toots from this year
mastodon_exclude <- which(date < 20180101)
mastodon_exclude <- which(date < 20180201)
date <- date[-mastodon_exclude]
time <- time[-mastodon_exclude]
lang <- lang[-mastodon_exclude]

+ 171
- 253
docs/collector.md View File

@@ -5,7 +5,7 @@
* [General information about the Script](#the-script)
* [Packages used and the Package section](#packages)
* [The twittR package](#the-twitter-package)
* [The Mastodon package](#the-mastodon-package)
* [The curl and rjson packages](#the-curl-and-rjson-packages)
* [The RedditExtractoR package](#the-redditextractor-package)
* [Collecting from Twitter](#twitter)
* [Collecting from the Fediverse](#fediverse)
@@ -38,11 +38,12 @@ much care and do not leak meta-data if possible.
* * *

## Packages
As of writing this script and its documentation, three scraper-packages are
being used:
As of writing this script and its documentation, two platform specific and two
general scraper-packages are being used:

* [twitteR](https://cran.r-project.org/package=twitteR) (Version 1.1.9)
* [mastodon](https://github.com/ThomasChln/mastodon) (Commit [a6815b6](https://github.com/ThomasChln/mastodon/commit/a6815b6fb626960ffa02bd407b8f05d84bd0f549))
* [curl](https://cran.r-project.org/package=curl) (Version 3.1)
* [rjson](https://cran.r-project.org/package=rjson) (Version 0.2.15)
* [RedditExtractoR](cran.r-project.org/package=RedditExtractoR) (Version 2.0.2)

### The twitteR package
@@ -61,51 +62,34 @@ be discussed individually, later on:
```

As a site-note; I had to install the
[httr-package](https://cran.r-project.org/web/packages/httr/index.html) - a
dependency of twitteR - from the Repositories of my distribution of choice, as
the one provided by CRAN would not compile for some reason. So if you run into a
similar issue, look for something like `r-cran-httr` in your packagemanager.
[httr-package](https://cran.r-project.org/package=httr) - a dependency of
twitteR - from the Repositories of my distribution of choice, as the one
provided by CRAN would not compile for some reason. So if you run into a similar
issue, look for something like `r-cran-httr` in your packagemanager.

### The mastodon package
### The curl and rjson packages
The good thing about Mastodon is, that searches are not restricted to a single
Mastodon-Instance or to Mastodon at all. If your Instance has enough outbound
connections (so make sure you chose a very active and inter-communicative one),
you are able to not only search Mastodon-Instances, but also GNUsocial, Pump.io
and other compatible Social Media instances. Luckily, this also applies to the
mastodon-package. Unfortunately, mastodon for R is documented
[very poorly, if at all](https://github.com/ThomasChln/mastodon/blob/a6815b6fb626960ffa02bd407b8f05d84bd0f549/README.md).
This brings us in the uncomfortable position, that we need to figure out, what
the outputs of each function actually mean. Those are not properly labeled
either, so this is a task of trial'n'error and a lot of guessing. If you have
time and dedication, feel free to document it properly and open a pull-request
on the [project's Github page](https://github.com/ThomasChln/mastodon). The
relevant results that we use in our script are listed in the
[Mastodon-Section](#mastodon) of this documentation. Again, just like with the
Rfacebook package, the function-names are very generic and thus it is a good
idea to prefix them with `mastodon::` to prevent the use of a wrong function
from another package (eg.: `login()` becomes `mastodon::login()`).
From the long list of functions in this package, we only need two for our
analysis:
```
login() # authentication / generating an auth-token
get_hashtag() # search the fediverse for posts that include a specific hashtag
```

Note:
as this package is not hosted on CRAN but on github, you can not install it with
`install.packages()` like the other packages. The easiest way is to install it
with `install_github()` from the `devtools` package. In order to use
`install_github()` without loading the library (as we only need it for this one
time), you can prefix it with its package name.
Installing and loading the mastodon package would look like this:
```
install.packages("devtools")
devtools::install_github(repo = "ThomasChln/mastodon")
library("mastodon")
```

Also note, that `devtools` requires the development files of *libssl* to be
installed on your machine.
and other compatible Social Media instances. Previously, we used a specialized
package for scraping the Fediverse, simply called
[mastodon](https://github.com/ThomasChln/mastodon), however it proved to be
unreliable, poorly documented and probably even unmaintained. Luckily, Mastodon
as an opensource platform also has a really open API we can access with simple
tools like [curl](https://cran.r-project.org/package=curl) and
[rjson](https://cran.r-project.org/package=rjson). Specifically, we use
following functions of the `curl` package:
```
curl_fetch_memory() # import HTTP headers from the API-page
parse_headers() # extracts responses from the HTTP header
```

This will generate an output in a JSON format, which we can transform to a
`list()` item with a function from the `rjson` package:
```
fromJSON() # transform JSON to a list() item
```

### The RedditExtractoR package
RedditExtractoR has a rather extensive
@@ -367,7 +351,7 @@ using wrong variables.
Additionally, we still need to split up the `twitter_timedate` variable, which
currently contains the point of time of the tweet in the form of
`YYYY-MM-DD HH:MM:SS`. For this, we again use regex and the function `sub()`.
As `sub()` only replaces the first instance of the pattern given to it, if we
As `sub()` only replanes the first instance of the pattern given to it, if we
have multiple occasions of a given pattern, we need to use `gsub()` (for global
substitute).

@@ -436,255 +420,188 @@ about how to export the data, so it can be used outside your current R-Session.
* * *

## Fediverse
The
[Mastodon-API](https://github.com/tootsuite/documentation/blob/461a17603504811b786084176c65f31ae405802d/Using-the-API/API.md)
doesn't require authentication for public timelines or (hash-) tags. Since this
is exactly the data we want to aggregate, authentication is not needed here.

### Authenticate
In the Mastodon package, authentication works similar as in the twitteR package.
You still need an account on any Mastodon-Instance you like, but you do not have
to create API-Credentials on the website. Instead, it can all be handled from
within R.

However, this comes with a different kind of complication:
You login-credentials have to be saved as plain text variables in your R-session
and if you want to go the comfortable way of saving these in an "auth file", as
we did with Twitter, this comes with an additional risk.

You can mitigate that risk, if you use an encrypted storage space - which I
would highly recommend either way. If you haven't encrypted your entire
hard drive, you may take a look at this wiki article about
[encryptfs](https://help.ubuntu.com/community/EncryptedPrivateDirectory).

Either way, you have two ways of inserting your credentials into the R-session:

1. via manual input. The R-Console will prompt you to enter the credentials by
typing them in.
2. via a plain text file with the saved credentials. This `.txt` file has a very
specific structure which you have to follow. You can find an example file in
the examples folder.

The first line of the credential-file contains the *labels*. These have to be in
the same order as the *credentials* themselves in the line below. The *labels*
as well as the *credentials* are each separated by a single semi-colon `;`. As
mentioned before, **storing your login as plain text is a risk that you have to
deal with somehow**. Ideally with encryption.

If we loaded our login-credentials into the variables
`mastodon_auth_insta mastodon_auth_login mastodon_auth_passw`, we can *order*
our API access token with the package's `login()` function, which takes these
three values as arguments. Again, the name of the function is very generic and
may overlap with function in other packages. So it is a good idea to prefix it
with the package name and a double colon. This is the case for all functions in
this package, so I will not further mention it, but we should continue doing it
regardless. We store the resulting list into the variable `mastodon_auth`:
### Scraping Toots and Postings
Contrary to Twitter, Mastodon does not allow to search for a string contained in
posts, however we can search for hashtags through the tag-timeline in the
[API](https://github.com/tootsuite/documentation/blob/461a17603504811b786084176c65f31ae405802d/Using-the-API/API.md#timelines).
For this we have to construct an URL for the API-call (keep an eye on changes to
the API and adapt accordingly) in the following form:
```
mastodon_auth <- mastodon::login(instance = mastodon_auth_insta,
user = mastodon_auth_login,
pass = mastodon_auth_passw)
https://DOMAIN.OF.INSTANCE/api/v1/timeline/tag/SEARCHTERM
```

### Scraping Toots and Postings
Once we successfully got our access token, we can start collecting postings
containing our desired string. Contrary to Twitter, Mastodon does not allow to
search for a string contained in posts, however we can search for hashtags with
the `get_hashtag()` function. This one needs four arguments:

* our previously generated access token `mastodon_auth`
* a string containing the hashtag we want to search for. In our case, `ilovefs`
would make most sense. You can however make the argument, that we should
**also** search for `ilfs`. Things like "#ilovefs18" or "#ilovefs2018"
*should* be covered, however
* whether we want to only search on the local instance (the instance your
account is registered on). Of course we set this one to `FALSE`, as we want to
search the entire fediverse, including Mastodon-, GNUsocial- and
Pump.io- instances
* the maximum number of postings we want to collect. As in the `twitteR`
package, we can set this to a very high number, but this may need some
consideration in the future. Generally, the fediverse is much more serious
about free software than other social media types. Right now, it is still
fairly young, but as it gets older (and grows in users), the number of
participants in the "I love Free Software Day" may rise quite dramatically. So
you could try out a lower number for this argument and take a look at the
dates of posting to get a feeling of how high this number should be

The result is saved to the variable `mastodon_toot`:
```
mastodon_toot <- mastodon::get_hashtag(token = mastodon_auth,
hashtag = "ilovefs",
local = FALSE,
n = 100)
```
Additionally, you can add `?limit=40` at the end of the URL to raise the results
from 20 to 40 posts. For the search term it makes sense to use our official
hashtag for the "I Love Free Software" Campaign: *ilovefs*.

### Stripping out data
Unfortunately, as of writing this script and documentation, the `mastodon`
package has very poor documentation itself. For instance, there is no
explanation of the variables in the resulting list of the `get_hastag()`
function. Because of the structure of this `list()` item, there are no labels
either. With the help of the `names()` of R's base-package, I could however
identify all variables:
In R, you can easily construct this with the `paste0()` function (the `paste()`
function will introduce spaces between the arguments, which we obviously do not
want):
```
names(mastodon_toot)
mastodon_instance <- "https://mastodon.social"
mastodon_hashtag <- "ilovefs"
mastodon_url <- paste0(mastodon_instance,
"/api/v1/timelines/tag/",
mastodon_hashtag,
"?limit=40")
```

Additionally, the structure of the resulting `list()` item has a great advantage
over the results in the `twitteR` package: It is very easy to extract the data,
as it already has the same structure that we use as well, as illustrated below:
Next, we use the `curl_fetch_memory()` function to fetch the data from our
mastodon instance. The result of this is raw data, not readable by humans. In
order to translate this into a readable format, we use `rawToChar()` form the R
base package. This readable format is actually
[JSON](https://de.wikipedia.org/wiki/JavaScript_Object_Notation), which can be
easily transformed into a `list()` item with the `fromJSON()` function. All
three functions put together, we have something like this:
```
mastodon_toot
|
|- ...
|- created_at = "2018-01-22T10:44:53", "2018-01-22T10:45:10", ...
|- ...
|- visibility = "public", "public", ...
|- language = "en", "en", ...
|- uri = "tag:quitter.no,2018-01-22:noticeID=0000000000001:objectType=note", ...
|- content = "<3 Opensource! #ilovefs", "FREE SOFTWARE!1eleven #ilovefs", ...
|- url = "quitter.no/status/0000000000001", "quitter.no/status0000000000002", ...
|- reblogs_count = "9", "1", ...
|- favourites_count = "53", "3", ...
|- ...
|- account [LIST]
| |- [LIST 1]
| | |- ...
| | |- username = "linux-beginner-for-a-day"
| | '- ...
| |
| |- [LIST 2]
| | |- ...
| | |- username = "C12yp70_H4X012_1337-420"
| | '- ...
| |
| '- ...
|- media_attachements [LIST]
| |- [LIST 1]
| | |- ...
| | |- remote_url = "https://quitter.no/media/ilovefs-banner.png"
| | '- ...
| |
| |- [LIST 2]
| | |- ...
| | |- username = ""
| | '- ...
| |
| '- ...
'- ...

mastodon_reqres <- curl_fetch_memory(mastodon_url)
mastodon_rawjson <- rawToChar(mastodon_reqres$content)
toots <- fromJSON(mastodon_rawjson)
```

Because of this, we can often times to a basic assignment, like this:
`toots` is our resulting `list()` item.

Another issue is, that the Mastodon-API currently caps at 40 toots. However, we
want much more than only the last 40, so we need to make several API-calls,
specifying the *"range"*. This is set with the `max_id=` parameter within the
URL. The "ID" is the
[unique identifier of each status/post](https://github.com/tootsuite/documentation/blob/461a17603504811b786084176c65f31ae405802d/Using-the-API/API.md#status).
You can have several parameters with dividing them them the `&` character, which
will look similar to this:
```
mastodon_lang <- mastodon_toot[[8]]
https://DOMAIN.OF.INSTANCE/api/v1/timeline/tag/SEARCHTERM/?limit=40&max_id=IDNUMBER
```

However, in such cases as the time of the posting, we need to use `sub()`,
`gsub()` and `as.numeric()` to extract the data we want (in this case, splitting
time and date into single, numeric variables). We do something similar for the
`uri` variable in the list to extract the name of the instance.

URLs and hashtags have a HTML-format in the posting-text, so we need to get rid
of this, without removing anything else from it. If you do not understand the
regex here, make sure to check out [regexr.com](https://regexr.com/):
Luckily, we do not have to find out the ID manually. The header of the API
response saved into the `mastodon_reqres` variable also lists the "*next page*"
of results, so we can simply grab this with the `parse_headers()` function from
the `curl` package and use some regex to strip it out:
```
mastodon_txt <- gsub(pattern = "<.*?>", x = mastodon_toot[[10]], replacement = "")
mastodon_lheader <- parse_headers(mastodon_reqres$headers)[11]
mastodon_next <- sub(x = mastodon_lheader, pattern = ".*link:\ <", replace = "")
mastodon_url <- sub(x = mastodon_next, pattern = ">;\ rel=\"next\".*", replace = "")
```

Besides that, we should also try to identify bots, which are very common in the
fediverse and post about things like "Trending Hashtags". Of course, this is
problematic for us, as this most definitely can not be considered participation.
We can either sort bots out by their account-id or name. I went for the name in
this case, as there may be more "TrendingBots" scattered throughout the
fediverse. For this, we need to go through each "lower list" containing the
account information and noting down, which ones are bots and which are not.
If we identify a poster as a bot, we give the variable `mastodon_bot` the value
`TRUE` for this position and `FALSE` if this is not a bot. Just like extracting
information from the lower `list()` items in the `twitteR` package, we first
need to create an empty `vector()` item:
```
mastodon_bot <- c()
*If you are not familiar with regex, I highly recommend
[regexr.com](https://regexr.com/) to learn how to use it. It also contains a
nifty cheat-sheet.*


If this returns a valid result (if the `toot` variable is set), we forward it to
the [extraction function](#extraction-function) called `mastodon.fetchdata()`,
which is defined earlier in the script. This returns a `data.frame()` item,
containing all relevant variables **of the current "page"**. If we continously
bind them together in a for-loop, we finally receive multiple vectors of all
toots ever posted with the (hash-) tag *#ilovefs*:
```
if(length(toots) > 0){
tmp_mastodon_df <- mastodon.fetchdata(data = toots)
datetime <- c(datetime, as.character(tmp_mastodon_df$tmp_datetime))
lang <- c(lang, as.character(tmp_mastodon_df$tmp_lang))
inst <- c(inst, as.character(tmp_mastodon_df$tmp_inst))
link <- c(link, as.character(tmp_mastodon_df$tmp_link))
text <- c(text, as.character(tmp_mastodon_df$tmp_text))
reto <- c(reto, as.character(tmp_mastodon_df$tmp_reto))
favs <- c(favs, as.character(tmp_mastodon_df$tmp_favs))
murl <- c(murl, as.character(tmp_mastodon_df$tmp_murl))
acct <- c(acct, as.character(tmp_mastodon_df$tmp_acct))
} else {
break
}
```

Next, it will be filled with the help of a for-loop. It has to count up from 1
to as long as the `mastodon_pers` `list()` item is:
As of writing this documentation, the cap of the for-loop is set to 9999999999,
which most likely will never be reached. However, the loop will always stop, as
soon as the `toot` variable doesn't contain meaningful content anymore (see the
*break* command in the code above).

When extracted, some of the date has to be reformed, reformatted or changed in
some way. We use regex for this as well. For the sake of simplicity, the example
below only shows the cleaning of the `text` variable. Other variables are
treated in a similar fashion:
```
for(i in 1:length(mastodon_pers)){
...
}
text <- gsub(pattern = "<.*?>", x = text, replacement = "")
text <- gsub(pattern = " ", x = text, replacement = "")
```

Within this for-loop, we need to check whether or not that account is a bot. As
described above, for the sake of simplicity and because the only bot that comes
to mind is the "TrendingBot", we do it with a simple if-statement:
Additionally, posts that are too old have to be removed (usually, setting the
oldest date to January 01 of the current year works fine, February may be fine
as well). The format of the date should be `YYYYMMDD` and a `numeric()` value:
```
if(mastodon_pers[[i]]$username == "TrendingBot"){
...
} else {
...
}
mastodon_exclude <- which(date < 20180101)
date <- date[-mastodon_exclude]
time <- time[-mastodon_exclude]
lang <- lang[-mastodon_exclude]
inst <- inst[-mastodon_exclude]
text <- text[-mastodon_exclude]
link <- link[-mastodon_exclude]
reto <- reto[-mastodon_exclude]
favs <- favs[-mastodon_exclude]
murl <- murl[-mastodon_exclude]
acct <- acct[-mastodon_exclude]
```

*Note: you can use multiple Bot-names by adding "|" (or) followed by another
botname to the statement.*
### Extraction Function
The extraction function `mastodon.fetchdata()` has to be defined prior to
running it (obviously), hence it is the first chunk of code in the
fediverse-section of the script. As argument, it only takes the extracted data
in a `list()` format (which we saved in the variable `toot`). For each post/toot
in the `list()` item, the function will extract:

As mentioned above, if the statement is true, we set the `mastodon_bot` variable
at this position as `TRUE` and as `FALSE` if it is not.
* date & time of the post
* language of the post (currently only differenciates between english/japanese)
* the instance of the poster/tooter
* the URL of the post
* the actual content/text of the post
* the number of boots/shares/retweets
* the number of favorites
* the URL of the attached image (NA, if no image is attached)
* the account of the poster (instance & username)

All put together, we have:
```
mastodon_bot <- c()
for(i in 1:length(mastodon_pers)){
if(mastodon_pers[[i]]$username == "TrendingBot"){
mastodon_bot[i] <- TRUE
} else {
mastodon_bot[i] <- FALSE
}
}
```
mastodon.fetchdata <- function(data){

### Creating the finished dataset
...

If we scraped all information, we are still dealing with "dirty" data, here. We
already identified bots, but haven't removed them yet. We also didn't set a
date-range within which we want to collect data. Additionally, we should also
sort out "private" posting, as we want to publish our data and should not leak
someone's thoughts who clearly don't wants them to be public. However it is to
be expected, that there is close to no person who
for(i in 1:length(data)){

* a) white-listed your account to see their private postings
* b) posts about #ilovefs in a private post
#### Time and Date of Toot
if(length(data[[i]]$created_at) > 0){
tmp_datetime[i] <- data[[i]]$created_at
} else {
# insert empty value, if it does not exist
tmp_datetime[i] <- NA
}

However, we should keep it in mind regardless.
...

To identify posts to be excluded, we can simply use the `which()` function in
conjunction with a condition for each attribute and bind them together with the
`c()` (or "combine") function. Here we can include the previously identified
bots, and the condition, that the "date" has to be lower than (before) a certain
numeric value in the form of "YYYYMMDD". Lastly, we exlclude everything that
is not marked as "public":
```
mastodon_exclude <- c(which(mastodon_bot),
which(mastodon_date < 20180101),
which(mastodon_priv != "public"))
```
}

Before we create the `data.frame()` item, we can drop all `mastodon_` prefixes
from the variables, as the name of the dataset itself makes already clear, what
the source of the data is. We can also strip out the posts we don't want in
there and which positions are listed in the `mastodon_exclude` variable:
```
date <- mastodon_date[-mastodon_exclude]
time <- mastodon_time[-mastodon_exclude]
lang <- mastodon_lang[-mastodon_exclude]
inst <- mastodon_insta[-mastodon_exclude]
text <- mastodon_txt[-mastodon_exclude]
link <- mastodon_url[-mastodon_exclude]
favs <- mastodon_fav[-mastodon_exclude]
imag <- mastodon_img[-mastodon_exclude]
return(data.frame(cbind(tmp_datetime,
tmp_lang,
tmp_inst,
tmp_text,
tmp_link,
tmp_reto,
tmp_favs,
tmp_murl,
tmp_acct)))
}
```

### Creating the finished dataset

As before with the Twitter-data, we combine these newly created variables into
a `data.frame()` item by first turning it into a matrix by binding these vectors
as columns with `cbind()` and turning it into the finished dataset called
`mastodon` with `data.frame()`:
```
mastodon <- data.frame(cbind(date, time, lang, inst, text, link, favs, imag))
mastodon <- data.frame(cbind(date, time, lang, inst, text, link, reto, favs, murl, acct))
```

As this usually re-defines the variables as `factor()`, we will use `within()`
@@ -695,6 +612,7 @@ mastodon <- within(data = mastodon, expr = {
time <- as.numeric(as.character(time));
text <- as.character(text);
link <- as.character(link);
murl <- as.character(murl);
})
```


BIN
docs/collector.pdf View File


Loading…
Cancel
Save