Browse Source

Merge branch 'Mastodon-docs' of janwey/ilfs-data into master

pull/2/head
janwey 1 year ago
parent
commit
b733ccc2b9
3 changed files with 249 additions and 18 deletions
  1. 5
    13
      collecto.R
  2. 244
    5
      docs/collector.md
  3. BIN
      docs/collector.pdf

+ 5
- 13
collecto.R View File

@@ -193,11 +193,6 @@ twitter <- within(data = twitter, expr = {

### Authenticate to the Fediverse (here: Mastodon)

# Note -------------------------------------------------------------------------
# It is sub-optimal to use clear-text credentials for the authentification
# process, but the mastodon-package does not (yet) support oath
# ------------------------------------------------------------------------------

#### Manual input (uncomment if needed)
#mastodon_auth_insta <- readline("[Mastodon] Enter your Instance-URL."
#mastodon_auth_login <- readline("[Mastodon] Enter your registered mail.")
@@ -247,13 +242,8 @@ mastodon_toot <- mastodon::get_hashtag(token = mastodon_auth,
# 23.
# ------------------------------------------------------------------------------

### Sort out non-public posts
mastodon_priv <- which(mastodon_toot[[7]] != "public")
if(length(mastodon_priv) > 0){
for(i in 1:length(mastodon_toot)){
mastodon_toot[[i]] <- mastodon_toot[[i]][-c(mastodon_priv)]
}
}
### public and non-public posts
mastodon_priv <- mastodon_toot[[7]]

### Time of post
#### date (as numeric value)
@@ -306,7 +296,9 @@ for(i in 1:length(mastodon_toot[[20]])){

### Cleaning data (removal of excluded posts)
mastodon_exclude <- c(which(mastodon_bot),
which(mastodon_date < 20180101))
which(mastodon_date < 20180101),
which(mastodon_priv != "public"))

date <- mastodon_date[-mastodon_exclude]
time <- mastodon_time[-mastodon_exclude]
lang <- mastodon_lang[-mastodon_exclude]

+ 244
- 5
docs/collector.md View File

@@ -8,6 +8,7 @@
* [The Rfacebook package](#the-rfacebook-package)
* [The Mastodon package](#the-mastodon-package)
* [Twitter](#twitter)
* [Fediverse](#fediverse)


* * *
@@ -254,7 +255,7 @@ property, as shown by the illustration below:
```
twitter_tw
|
|- [[1]]
|- [LIST 1]
| |- text = "This is my tweet about #ilovefs https://fsfe.org"
| |- ...
| |- favoriteCount = 21
@@ -270,7 +271,7 @@ property, as shown by the illustration below:
| | '- ...
| '- ...
|
|- [[2]]
|- [LIST 2]
| |- ...
| '- ...
|
@@ -400,7 +401,7 @@ into a `data.frame()` item:
favs <- as.factor(twitter_fav)
link <- as.character(twitter_url)
text <- as.character(twitter_txt)
clit <- as.character(twitter_client)
clnt <- as.character(twitter_client)
```

When combining these variables into a `data.frame()`, we first need to create
@@ -409,7 +410,7 @@ the `cbind()` command. The result can be used by the `data.frame()` function to
great such item. We label this dataset `twitter`, making it clear, what source
of data we are dealing with:
```
twitter <- data.frame(cbind(date, time, retw, favs, text, link, clit))
twitter <- data.frame(cbind(date, time, retw, favs, text, link, clnt))
```

Often during that process, all variables within the `data.frame()` item are
@@ -442,8 +443,246 @@ variables directly, as mentioned before. First transforming them into a
`character()` (string), which then can be transformed into a `numeric()` value
without risks, is a little *hack*.

The dataset is not finished and contains every aspect we want to analyze later
The dataset is now finished and contains every aspect we want to analyze later
on. You can skip down to the [Exporting-Section](#exporting-datasets) to read
about how to export the data, so it can be used outside your current R-Session.

* * *

## Fediverse

### Authenticate
In the Mastodon package, authentication works similar as in the twitteR package.
You still need an account on any Mastodon-Instance you like, but you do not have
to create API-Credentials on the website. Instead, it can all be handled from
within R.

However, this comes with a different kind of complication:
You login-credentials have to be saved as plain text variables in your R-session
and if you want to go the comfortable way of saving these in an "auth file", as
we did with Twitter, this comes with an additional risk.

You can mitigate that risk, if you use an encrypted storage space - which I
would highly recommend either way. If you haven't encrypted your entire
hard drive, you may take a look at this wiki article about
[encryptfs](https://help.ubuntu.com/community/EncryptedPrivateDirectory).

Either way, you have two ways of inserting your credentials into the R-session:

1. via manual input. The R-Console will prompt you to enter the credentials by
typing them in.
2. via a plain text file with the saved credentials. This `.txt` file has a very
specific structure which you have to follow. You can find an example file in
the examples folder.

The first line of the credential-file contains the *labels*. These have to be in
the same order as the *credentials* themselves in the line below. The *labels*
as well as the *credentials* are each separated by a single semi-colon `;`. As
mentioned before, **storing your login as plain text is a risk that you have to
deal with somehow**. Ideally with encryption.

If we loaded our login-credentials into the variables
`mastodon_auth_insta mastodon_auth_login mastodon_auth_passw`, we can *order*
our API access token with the package's `login()` function, which takes these
three values as arguments. Again, the name of the function is very generic and
may overlap with function in other packages. So it is a good idea to prefix it
with the package name and a double colon. This is the case for all functions in
this package, so I will not further mention it, but we should continue doing it
regardless. We store the resulting list into the variable `mastodon_auth`:
```
mastodon_auth <- mastodon::login(instance = mastodon_auth_insta,
user = mastodon_auth_login,
pass = mastodon_auth_passw)
```

### Scraping Toots and Postings
Once we successfully got our access token, we can start collecting postings
containing our desired string. Contrary to Twitter, Mastodon does not allow to
search for a string contained in posts, however we can search for hashtags with
the `get_hashtag()` function. This one needs four arguments:

* our previously generated access token `mastodon_auth`
* a string containing the hashtag we want to search for. In our case, `ilovefs`
would make most sense. You can however make the argument, that we should
**also** search for `ilfs`. Things like "#ilovefs18" or "#ilovefs2018"
*should* be covered, however
* whether we want to only search on the local instance (the instance your
account is registered on). Of course we set this one to `FALSE`, as we want to
search the entire fediverse, including Mastodon-, GNUsocial- and
Pump.io- instances
* the maximum number of postings we want to collect. As in the `twitteR`
package, we can set this to a very high number, but this may need some
consideration in the future. Generally, the fediverse is much more serious
about free software than other social media types. Right now, it is still
fairly young, but as it gets older (and grows in users), the number of
participants in the "I love Free Software Day" may rise quite dramatically. So
you could try out a lower number for this argument and take a look at the
dates of posting to get a feeling of how high this number should be

The result is saved to the variable `mastodon_toot`:
```
mastodon_toot <- mastodon::get_hashtag(token = mastodon_auth,
hashtag = "ilovefs",
local = FALSE,
n = 100)
```

### Stripping out data
Unfortunately, as of writing this script and documentation, the `mastodon`
package has very poor documentation itself. For instance, there is no
explanation of the variables in the resulting list of the `get_hastag()`
function. Because of the structure of this `list()` item, there are no labels
either. With the help of the `names()` of R's base-package, I could however
identify all variables:
```
names(mastodon_toot)
```

Additionally, the structure of the resulting `list()` item has a great advantage
over the results in the `twitteR` package: It is very easy to extract the data,
as it already has the same structure that we use as well, as illustrated below:
```
mastodon_toot
|
|- ...
|- created_at = "2018-01-22T10:44:53", "2018-01-22T10:45:10", ...
|- ...
|- visibility = "public", "public", ...
|- language = "en", "en", ...
|- uri = "tag:quitter.no,2018-01-22:noticeID=0000000000001:objectType=note", ...
|- content = "<3 Opensource! #ilovefs", "FREE SOFTWARE!1eleven #ilovefs", ...
|- url = "quitter.no/status/0000000000001", "quitter.no/status0000000000002", ...
|- reblogs_count = "9", "1", ...
|- favourites_count = "53", "3", ...
|- ...
|- account [LIST]
| |- [LIST 1]
| | |- ...
| | |- username = "linux-beginner-for-a-day"
| | '- ...
| |
| |- [LIST 2]
| | |- ...
| | |- username = "C12yp70_H4X012_1337-420"
| | '- ...
| |
| '- ...
|- media_attachements [LIST]
| |- [LIST 1]
| | |- ...
| | |- remote_url = "https://quitter.no/media/ilovefs-banner.png"
| | '- ...
| |
| |- [LIST 2]
| | |- ...
| | |- username = ""
| | '- ...
| |
| '- ...
'- ...

```

Because of this, we can often times to a basic assignment, like this:
```
mastodon_lang <- mastodon_toot[[8]]
```

However, in such cases as the time of the posting, we need to use `sub()`,
`gsub()` and `as.numeric()` to extract the data we want (in this case, splitting
time and date into single, numeric variables). We do something similar for the
`uri` variable in the list to extract the name of the instance.

URLs and hashtags have a HTML-format in the posting-text, so we need to get rid
of this, without removing anything else from it. If you do not understand the
regex here, make sure to check out [regexr.com](https://regexr.com/):
```
mastodon_txt <- gsub(pattern = "<.*?>", x = mastodon_toot[[10]], replacement = "")
```

Besides that, we should also try to identify bots, which are very common in the
fediverse and post about things like "Trending Hashtags". Of course, this is
problematic for us, as this most definitely can not be considered participation.
We can either sort bots out by their account-id or name. I went for the name in
this case, as there may be more "TrendingBots" scattered throughout the
fediverse. For this, we need to go through each "lower list" containing the
account information and noting down, which ones are bots and which are not.
If we identify a poster as a bot, we give the variable `mastodon_bot` the value
`TRUE` for this position and `FALSE` if this is not a bot. Just like extracting
information from the lower `list()` items in the `twitteR` package, we first
need to create an empty `vector()` item and fill it with the help of a for-loop:
```
mastodon_bot <- c()
for(i in 1:length(mastodon_pers)){
if(mastodon_pers[[i]]$username == "TrendingBot"){
mastodon_bot[i] <- TRUE
} else {
mastodon_bot[i] <- FALSE
}
}
```

### Creating the finished dataset

If we scraped all information, we are still dealing with "dirty" data, here. We
already identified bots, but haven't removed them yet. We also didn't set a
date-range within which we want to collect data. Additionally, we should also
sort out "private" posting, as we want to publish our data and should not leak
someone's thoughts who clearly don't wants them to be public. However it is to
be expected, that there is close to no person who

* a) white-listed your account to see their private postings
* b) posts about #ilovefs in a private post

However, we should keep it in mind regardless.

To identify posts to be excluded, we can simply use the `which()` function in
conjunction with a condition for each attribute and bind them together with the
`c()` (or "combine") function. Here we can include the previously identified
bots, and the condition, that the "date" has to be lower than (before) a certain
numeric value in the form of "YYYYMMDD". Lastly, we exlclude everything that
is not marked as "public":
```
mastodon_exclude <- c(which(mastodon_bot),
which(mastodon_date < 20180101),
which(mastodon_priv != "public"))
```

Before we create the `data.frame()` item, we can drop all `mastodon_` prefixes
from the variables, as the name of the dataset itself makes already clear, what
the source of the data is. We can also strip out the posts we don't want in
there and which positions are listed in the `mastodon_exclude` variable:
```
date <- mastodon_date[-mastodon_exclude]
time <- mastodon_time[-mastodon_exclude]
lang <- mastodon_lang[-mastodon_exclude]
inst <- mastodon_insta[-mastodon_exclude]
text <- mastodon_txt[-mastodon_exclude]
link <- mastodon_url[-mastodon_exclude]
favs <- mastodon_fav[-mastodon_exclude]
imag <- mastodon_img[-mastodon_exclude]
```

As before with the Twitter-data, we combine these newly created variables into
a `data.frame()` item by first turning it into a matrix by binding these vectors
as columns with `cbind()` and turning it into the finished dataset called
`mastodon` with `data.frame()`:
```
mastodon <- data.frame(cbind(date, time, lang, inst, text, link, favs, imag))
```

As this usually re-defines the variables as `factor()`, we will use `within()`
again, to give them the correct mode:
```
mastodon <- within(data = mastodon, expr = {
date <- as.numeric(as.character(date));
time <- as.numeric(as.character(time));
text <- as.character(text);
link <- as.character(link);
})
```

The dataset can now be exported. Skip down to the
[Exporting-Section](#exorting-datasets) to learn how.

* * *

BIN
docs/collector.pdf View File


Loading…
Cancel
Save