Parcourir la source

Edit: Formatting fixes, update documentation for the Twitter-Performance-Fix

pull/3/head
janwey il y a 1 an
Parent
révision
4ff78b8a1d
2 fichiers modifiés avec 17 ajouts et 2 suppressions
  1. 17
    2
      docs/collector.md
  2. BIN
      docs/collector.pdf

+ 17
- 2
docs/collector.md Voir le fichier

@@ -59,6 +59,7 @@ be discussed individually, later on:
searchTwitter() # searching Twitter for a particular string
strip_retweets() # exclude Retweets in the results
```

As a site-note; I had to install the
[httr-package](https://cran.r-project.org/web/packages/httr/index.html) - a
dependency of twitteR - from the Repositories of my distribution of choice, as
@@ -89,6 +90,7 @@ analysis:
login() # authentication / generating an auth-token
get_hashtag() # search the fediverse for posts that include a specific hashtag
```

Note:
as this package is not hosted on CRAN but on github, you can not install it with
`install.packages()` like the other packages. The easiest way is to install it
@@ -101,6 +103,7 @@ Installing and loading the mastodon package would look like this:
devtools::install_github(repo = "ThomasChln/mastodon")
library("mastodon")
```

Also note, that `devtools` requires the development files of *libssl* to be
installed on your machine.

@@ -117,6 +120,7 @@ later on:
reddit_urls() # searching Reddit for a particular string
reddit_content() # scrape data of an indicidual post
```

You may have noticed, that there is no "authenticate" command within this
package. As of now, the Reddit-API does not require authentication, as all posts
are for general consumption anyways. This may or may not change in the future,
@@ -258,12 +262,14 @@ property, as shown by the illustration below:
'- ...
```

The inconvenience about this structure stems from that we need to use for-loops
The inconvenience about this structure stems from that we need to use a for-loop
in order to run through each lower `list()` item and extract its variables
individually.

For the sake of keeping this short, this documentation only explains the
extraction of a single argument, namely the Client used to post a Tweet.
extraction of a single argument, namely the Client used to post a Tweet. All
other information are scraped in a very similar fashion.

Firstly we create a new, empty `vector()` item called `twitter_client` with the
"combine" command (or `c()` for short). Usually you do not have to pre-define
empty vectors in R, but it will be created automatically if you assign it a
@@ -613,6 +619,7 @@ to mind is the "TrendingBot", we do it with a simple if-statement:
...
}
```

*Note: you can use multiple Bot-names by adding "|" (or) followed by another
botname to the statement.*

@@ -759,6 +766,7 @@ result in the `reddit_post` variable:
replace = "")
reddit_post <- reddit_post_dirty[which(reddit_post_year == reddit_searchfromyear),]
```

To ease the handling of this process, the year we want to search in is assigned
to the variable `reddit_searchinyear` in a "YY" format first (here: "18" for
"2018"). We use `gsub()` to trim the date to just display the year and use
@@ -773,6 +781,7 @@ simply create an empty `vector()` for each variable:
date <- c()
rurl <- c()
```

And fill the appropriate position on the vector with the corresponding value.
We do this for each scraped post:
```
@@ -785,6 +794,7 @@ We do this for each scraped post:
...
}
```

However, not all of the relevant data is contained in the `reddit_post` dataset.
We need another function from the `RedditExtractoR` package, called
`reddit_content()` which is able to also give us the score, text and linked-to
@@ -793,6 +803,7 @@ which is contained in our previously mentioned `data.frame()`:
```
reddit_content <- reddit_content(URL = reddit_post$URL[1])
```

The resulting variable `reddit_content` is another `data.frame()` with a similar
structure as the previously used `reddit_post`:
```
@@ -807,6 +818,7 @@ structure as the previously used `reddit_post`:
|- link = "https://cran.r-project.org"
'- ...
```

Since we need to do this for every single post, we can include this into our
for-loop. Because we call the function with only one post-URL at a time, we can
set the wait time between request to zero. However the for-loop will call the
@@ -845,6 +857,7 @@ the `cbind()` fucntion, which can be tunred into the finished dataset with
```
reddit <- data.frame(cbind(date, rurl, link, text, ttle, ptns, subr, comt))
```

This usually re-defines every single variable within the dataset as `factor()`,
so we use the `within()` function to change their mode:
```
@@ -859,6 +872,7 @@ so we use the `within()` function to change their mode:
comt <- as.numeric(as.character(comt));
})
```

The dataset can now be exported. Skip down to the
[Exporting-Section](#exporting-datasets) to learn how.

@@ -919,6 +933,7 @@ the `data/` folder:
```
save_path <- paste0("./data/ilovefs-all_", time_of_saving, ".RData")
```

*Note: using `paste()` instead of `paste0()` will create a space between each
strings, which we do not want here.*


BIN
docs/collector.pdf Voir le fichier


Chargement…
Annuler
Enregistrer