Browse Source

Add: Documentation for the Reddit Section of the Collector

pull/2/head
janwey 1 year ago
parent
commit
490c2eefd1
3 changed files with 267 additions and 10 deletions
  1. 71
    1
      collecto.R
  2. 196
    9
      docs/collector.md
  3. BIN
      docs/collector.pdf

+ 71
- 1
collecto.R View File

@@ -16,6 +16,10 @@ install.packages("devtools")
# requires libssl-dev
devtools::install_github("ThomasChln/mastodon")
library("mastodon")

### Reddit
install.packages("RedditExtractoR")
library("RedditExtractoR")
# }}}

## Twitter Collector {{{ ----
@@ -261,6 +265,66 @@ mastodon <- within(data = mastodon, expr = {
})
# }}}

## Reddit Collector {{{ ----

### Authentication at Reddit
# no authentication necessary, hence we can directly start scraping

### Get posts on Reddit
reddit_post_dirty <- reddit_urls(search_terms = "ilovefs",
#subreddit = "freesoftware linux opensource",
cn_threshold = 0,
page_threshold = 99999,
sort_by = "new",
wait_time = 5)

### Only use posts from the current year
reddit_searchinyear <- 18 # has to have format "YY", eg "18" for "2018"
reddit_post_year <- gsub(x = reddit_post_dirty$date,
pattern = "\\d.-\\d.-",
replace = "")
reddit_post <- reddit_post_dirty[which(reddit_post_year == reddit_searchfromyear),]

### Extracting relevant variables
comt <- c() # Comments / Replies
subr <- c() # Subreddit
ptns <- c() # Points / Score
ttle <- c() # Title
text <- c() # Text / Content
link <- c() # Linked to Website
date <- c() # Date
rurl <- c() # Reddit-URL of post
for(i in c(1:length(reddit_post$URL))){
comt[i] <- reddit_post$num_comments[i]
ttle[i] <- reddit_post$title[i]
rurl[i] <- reddit_post$URL[i]
date[i] <- gsub(x = reddit_post$date[i], pattern = "-", replace = "")
subr[i] <- reddit_post$subreddit[i]

reddit_content <- reddit_content(URL = reddit_post$URL[i])
ptns[i] <- reddit_content$post_score
text[i] <- reddit_content$post_text
link[i] <- reddit_content$link
}

### Creating dataframe
reddit <- data.frame(cbind(date, rurl, link, text, ttle, ptns, subr, comt))

#### Clean-Up
rm(list = c("date", "rurl", "link", "text", "ttle", "ptns", "subr", "comt"))

reddit <- within(data = reddit, expr = {
date <- as.numeric(as.character(date));
rurl <- as.character(rurl);
link <- as.character(link);
text <- as.character(text);
ttle <- as.character(ttle);
ptns <- as.numeric(as.character(ptns));
subr <- as.character(subr);
comt <- as.numeric(as.character(comt));
})
# }}}

### Exporting data {{{ ----

time_of_saving <- sub(x = Sys.time(), pattern = " CET", replace = "")
@@ -269,7 +333,7 @@ time_of_saving <- gsub(x = time_of_saving, pattern = ":", replace = "-")

#### RData
save_path <- paste0("./data/ilovefs-all_", time_of_saving, ".RData")
save(list = c("twitter", "mastodon"), file = save_path)
save(list = c("twitter", "mastodon", "reddit"), file = save_path)

#### Text
##### Fediverse
@@ -278,6 +342,9 @@ write.table(mastodon, file = save_path_fed_t)
##### Twitter
save_path_twitter_t <- paste0("./data/ilovefs-twitter_", time_of_saving, ".txt")
write.table(twitter, file = save_path_twitter_t)
##### Reddit
save_path_reddit_t <- paste0("./data/ilovefs-reddit_", time_of_saving, ".txt")
write.table(reddit, file = save_path_reddit_t)

#### CSV
##### Fediverse
@@ -286,4 +353,7 @@ write.csv(mastodon, file = save_path_fed_c)
##### Twitter
save_path_twitter_c <- paste0("./data/ilovefs-twitter_", time_of_saving, ".csv")
write.csv(twitter, file = save_path_twitter_c)
##### Reddit
save_path_reddit_c <- paste0("./data/ilovefs-reddit_", time_of_saving, ".csv")
write.csv(reddit, file = save_path_reddit_c)
# }}}

+ 196
- 9
docs/collector.md View File

@@ -6,8 +6,10 @@
* [Packages used and the Package section](#packages)
* [The twittR package](#the-twitter-package)
* [The Mastodon package](#the-mastodon-package)
* [The RedditExtractoR package](#the-redditextractor-package)
* [Collecting from Twitter](#twitter)
* [Collecting from the Fediverse](#fediverse)
* [Collecting from Reddit](#reddit)
* [Exporting the Datasets](#exporting-datasets)


@@ -15,11 +17,11 @@

## The Script
The R script documented here has a modular structure. It is divided into 2
sections that handle loading the section necessary for the process and exporting
the aggregated data into usable formats in the end. The remaining sections
handle one specific data source each (eg.: Twitter, Mastodon, ...). While the
[Package-Section](#packages) is obviously necessary for the remaining sections
(depending on which ones you actually want to use) as well as the
sections that handle loading the packages necessary for the process and
exporting the aggregated data into usable formats in the end. The remaining
sections handle one specific data source each (eg.: Twitter, Mastodon, Reddit).
While the [Package-Section](#packages) is obviously necessary for the remaining
sections (depending on which ones you actually want to use) as well as the
[Export-Section](#exporting) for actually using the data in other applications,
scripts or by other people, you can cherry-pick between the
[Datasource-Sections](#datasources). These can be used independently and in no
@@ -28,8 +30,10 @@ particular order to another. Keep that in mind, if you only want to analyze *X*.
As a site-note, the script is written to keep the data collected as anonymous as
possible, however because we deal with a rather small sample and because of the
nature of social media, it is in most cases still possible to track down each
specific user in the resulting data. While time and date of the posting are
mostly unproblematic
specific user in the resulting data. As we only access public postings, it is
safe to assume, that people want their posts to be seen anyways, so it is not as
problematic as it may seem. Nevertheless, we should still treat the data with
much care and do not leak meta-data if possible.

* * *

@@ -39,6 +43,7 @@ being used:

* [twitteR](https://cran.r-project.org/package=twitteR) (Version 1.1.9)
* [mastodon](https://github.com/ThomasChln/mastodon) (Commit [a6815b6](https://github.com/ThomasChln/mastodon/commit/a6815b6fb626960ffa02bd407b8f05d84bd0f549))
* [RedditExtractoR](cran.r-project.org/package=RedditExtractoR) (Version 2.0.2)

### The twitteR package
twitteR has a rather extensive
@@ -99,6 +104,24 @@ Installing and loading the mastodon package would look like this:
Also note, that `devtools` requires the development files of *libssl* to be
installed on your machine.

### The RedditExtractoR package
RedditExtractoR has a rather extensive
[documentation](https://cran.r-project.org/web/packages/RedditExtractoR/RedditExtractoR.pdf)
but no general "in-R-manual". You can however look up a specific function within
the package by entering `?function`into the R-promt, replacing `function` with
its actual name. RedditExtractoR has several useful function to scrape
Reddit-Posts or create fancy graphs from them. In our case, we only need two
very basic functions that will be discussed in the [Reddit-Section](#reddit)
later on:
```
reddit_urls() # searching Reddit for a particular string
reddit_content() # scrape data of an indicidual post
```
You may have noticed, that there is no "authenticate" command within this
package. As of now, the Reddit-API does not require authentication, as all posts
are for general consumption anyways. This may or may not change in the future,
so keep an eye on this.

* * *

## Twitter
@@ -673,6 +696,170 @@ The dataset can now be exported. Skip down to the

* * *

## Reddit
*RedditExtractoR (or actually Reddit) doesn't currently require you to
authenticate. So you can get right into scraping!*

### Scraping Posts
There are multiple ways of searching for a certain string. Optionally, you can
determine which Subreddit you want to search in. In most cases, it makes sense
to search in all of them. To search for a particular string, we use the
`reddit_urls()` function. It takes one mandatory and five optional arguments:

* the string we want to search for (I am not certain, whether this includes the
content / text of the actual posts or only titles). In our case, **ilovefs**
should work just fine, as this is the name of the campaign and probably what
people will use in their posts
* the subreddits we want to search in. There is no real reason to limit this in
the case of the ILoveFS-Campaign, but it may make sense in other cases. If not
needed, this argument can be commented out with a `#`
* the minimum number of comments a post should have in order to be included. As
we want all posts regardless of their popularity, we should set this to `0`
* how many pages of posts the result should include. Here applies the same as
before: we want all posts, so we set this to a very high number like `99999`
* the sort order of the results. This doesn't really matter, as we try to scrape
all posts containing our search string. You can most likely leave it out or
set it to `new`
* the wait time between API-requests. The minimum (API limit) is 2 seconds, but
if you want to be sure set it to a slightly higher level

The resultis saved to the variable `reddit_post_dirty`, where as the *dirty*
stands for the fact that we haven't yet sorted out older posts than from this
year's event:
```
reddit_post_dirty <- reddit_urls(search_terms = "ilovefs",
#subreddit = "freesoftware linux opensource",
cn_threshold = 0,
page_threshold = 99999,
sort_by = "new",
wait_time = 5)
```

### Stripping out data
The data from the `RedditExtractoR` package comes in an easily usable
`data.frame()` output. Its structure is illustrated below:
```
reddit_post
|
|- date = "14-02-17", "13-02-17", ...
|- num_comments = "23", "15", ...
|- title = "Why I love Opensource #ilovefs", "Please participate in ILoveFS", ...
|- subreddit = "opensource", "linux", ...
'- URL = "https://www.reddit.com/r/opensource/comments/dhfiu/", ...
```

Firstly, we should exclude all postings from years before. For this, we simply
trim the `date` variable (Format is "DD-MM-YY") within the `data.frame()` to
only display the year and use those posts from the current year. We save the
result in the `reddit_post` variable:
```
reddit_searchinyear <- 18
reddit_post_year <- gsub(x = reddit_post_dirty$date,
pattern = "\\d.-\\d.-",
replace = "")
reddit_post <- reddit_post_dirty[which(reddit_post_year == reddit_searchfromyear),]
```
To ease the handling of this process, the year we want to search in is assigned
to the variable `reddit_searchinyear` in a "YY" format first (here: "18" for
"2018"). We use `gsub()` to trim the date to just display the year and use
`which()` to determine which post's year is equal to `reddit_searchinyear`.

Afterwards, we can use a single for-loop to extract all relevant variables. We
simply create an empty `vector()` for each variable:
```
comt <- c()
subr <- c()
ttle <- c()
date <- c()
rurl <- c()
```
And fill the appropriate position on the vector with the corresponding value.
We do this for each scraped post:
```
for(i in c(1:length(reddit_post$URL))){
comt[i] <- reddit_post$num_comments[i]
ttle[i] <- reddit_post$title[i]
rurl[i] <- reddit_post$URL[i]
date[i] <- gsub(x = reddit_post$date[i], pattern = "-", replace = "")
subr[i] <- reddit_post$subreddit[i]
...
}
```
However, not all of the relevant data is contained in the `reddit_post` dataset.
We need another function from the `RedditExtractoR` package, called
`reddit_content()` which is able to also give us the score, text and linked-to
website of the post. As an argument, this function only needs the URL of a post,
which is contained in our previously mentioned `data.frame()`:
```
reddit_content <- reddit_content(URL = reddit_post$URL[1])
```
The resulting variable `reddit_content` is another `data.frame()` with a similar
structure as the previously used `reddit_post`:
```
reddit_content
|
|- ...
|- num_comments = "20"
|- ...
|- post_score = "15"
|- ...
|- post_text = "I really do love this software because..."
|- link = "https://cran.r-project.org"
'- ...
```
Since we need to do this for every single post, we can include this into our
for-loop. Everything put together:
```
comt <- c()
subr <- c()
ptns <- c()
ttle <- c()
text <- c()
link <- c()
date <- c()
rurl <- c()
for(i in c(1:length(reddit_post$URL))){
comt[i] <- reddit_post$num_comments[i]
ttle[i] <- reddit_post$title[i]
rurl[i] <- reddit_post$URL[i]
date[i] <- gsub(x = reddit_post$date[i], pattern = "-", replace = "")
subr[i] <- reddit_post$subreddit[i]

reddit_content <- reddit_content(URL = reddit_post$URL[i])
ptns[i] <- reddit_content$post_score
text[i] <- reddit_content$post_text
link[i] <- reddit_content$link
}
```

### Creating the finished dataset
As we do not really need to *filter* something out (we have already done so with
the dates before), we can directly bind our variables to a `data.frame()`. As
with the other datasources (eg.: [Twitter](#twitter)) we create a matrix with
the `cbind()` fucntion, which can be tunred into the finished dataset with
`data.frame()`, assigning it to the variable `reddit`:
```
reddit <- data.frame(cbind(date, rurl, link, text, ttle, ptns, subr, comt))
```
This usually re-defines every single variable within the dataset as `factor()`,
so we use the `within()` function to change their mode:
```
reddit <- within(data = reddit, expr = {
date <- as.numeric(as.character(date));
rurl <- as.character(rurl);
link <- as.character(link);
text <- as.character(text);
ttle <- as.character(ttle);
ptns <- as.numeric(as.character(ptns));
subr <- as.character(subr);
comt <- as.numeric(as.character(comt));
})
```
The dataset can now be exported. Skip down to the
[Exporting-Section](#exporting-datasets) to learn how.

* * *

## Exporting Datasets
There are several reasons to why we want to export our data:

@@ -724,14 +911,14 @@ respectively:

Next, we model the save-path we want the data to be exported to, for which we
can use `paste0()`. For example to save the `.RData` file, we want to export to
the data/` folder into the file `ilovefs-all_YYYY-MM-DD_HH-MM-SS.RData`:
the `data/` folder:
```
save_path <- paste0("./data/ilovefs-all_", time_of_saving, ".RData")
```
*Note: using `paste()` instead of `paste0()` will create a space between each
strings, which we do not want here.*

We follow a similar approch for the individual `.txt`files, also adding the
We follow a similar approach for the individual `.txt` files, also adding the
name of the source into the filename (as they will only hold one data source
each). For example:
```

BIN
docs/collector.pdf View File


Loading…
Cancel
Save