Преглед изворни кода

Edit: Formatting fixes, update documentation for the Twitter-Performance-Fix

janwey пре 1 година
2 измењених фајлова са 17 додато и 2 уклоњено
  1. 17
  2. BIN

+ 17
- 2
docs/collector.md Прегледај датотеку

@@ -59,6 +59,7 @@ be discussed individually, later on:
searchTwitter() # searching Twitter for a particular string
strip_retweets() # exclude Retweets in the results

As a site-note; I had to install the
[httr-package](https://cran.r-project.org/web/packages/httr/index.html) - a
dependency of twitteR - from the Repositories of my distribution of choice, as
@@ -89,6 +90,7 @@ analysis:
login() # authentication / generating an auth-token
get_hashtag() # search the fediverse for posts that include a specific hashtag

as this package is not hosted on CRAN but on github, you can not install it with
`install.packages()` like the other packages. The easiest way is to install it
@@ -101,6 +103,7 @@ Installing and loading the mastodon package would look like this:
devtools::install_github(repo = "ThomasChln/mastodon")

Also note, that `devtools` requires the development files of *libssl* to be
installed on your machine.

@@ -117,6 +120,7 @@ later on:
reddit_urls() # searching Reddit for a particular string
reddit_content() # scrape data of an indicidual post

You may have noticed, that there is no "authenticate" command within this
package. As of now, the Reddit-API does not require authentication, as all posts
are for general consumption anyways. This may or may not change in the future,
@@ -258,12 +262,14 @@ property, as shown by the illustration below:
'- ...

The inconvenience about this structure stems from that we need to use for-loops
The inconvenience about this structure stems from that we need to use a for-loop
in order to run through each lower `list()` item and extract its variables

For the sake of keeping this short, this documentation only explains the
extraction of a single argument, namely the Client used to post a Tweet.
extraction of a single argument, namely the Client used to post a Tweet. All
other information are scraped in a very similar fashion.

Firstly we create a new, empty `vector()` item called `twitter_client` with the
"combine" command (or `c()` for short). Usually you do not have to pre-define
empty vectors in R, but it will be created automatically if you assign it a
@@ -613,6 +619,7 @@ to mind is the "TrendingBot", we do it with a simple if-statement:

*Note: you can use multiple Bot-names by adding "|" (or) followed by another
botname to the statement.*

@@ -759,6 +766,7 @@ result in the `reddit_post` variable:
replace = "")
reddit_post <- reddit_post_dirty[which(reddit_post_year == reddit_searchfromyear),]

To ease the handling of this process, the year we want to search in is assigned
to the variable `reddit_searchinyear` in a "YY" format first (here: "18" for
"2018"). We use `gsub()` to trim the date to just display the year and use
@@ -773,6 +781,7 @@ simply create an empty `vector()` for each variable:
date <- c()
rurl <- c()

And fill the appropriate position on the vector with the corresponding value.
We do this for each scraped post:
@@ -785,6 +794,7 @@ We do this for each scraped post:

However, not all of the relevant data is contained in the `reddit_post` dataset.
We need another function from the `RedditExtractoR` package, called
`reddit_content()` which is able to also give us the score, text and linked-to
@@ -793,6 +803,7 @@ which is contained in our previously mentioned `data.frame()`:
reddit_content <- reddit_content(URL = reddit_post$URL[1])

The resulting variable `reddit_content` is another `data.frame()` with a similar
structure as the previously used `reddit_post`:
@@ -807,6 +818,7 @@ structure as the previously used `reddit_post`:
|- link = "https://cran.r-project.org"
'- ...

Since we need to do this for every single post, we can include this into our
for-loop. Because we call the function with only one post-URL at a time, we can
set the wait time between request to zero. However the for-loop will call the
@@ -845,6 +857,7 @@ the `cbind()` fucntion, which can be tunred into the finished dataset with
reddit <- data.frame(cbind(date, rurl, link, text, ttle, ptns, subr, comt))

This usually re-defines every single variable within the dataset as `factor()`,
so we use the `within()` function to change their mode:
@@ -859,6 +872,7 @@ so we use the `within()` function to change their mode:
comt <- as.numeric(as.character(comt));

The dataset can now be exported. Skip down to the
[Exporting-Section](#exporting-datasets) to learn how.

@@ -919,6 +933,7 @@ the `data/` folder:
save_path <- paste0("./data/ilovefs-all_", time_of_saving, ".RData")

*Note: using `paste()` instead of `paste0()` will create a space between each
strings, which we do not want here.*

docs/collector.pdf Прегледај датотеку