Browse Source

Edit: Formatting fixes, update documentation for the Twitter-Performance-Fix

janwey 1 year ago
parent
commit
4ff78b8a1d
2 changed files with 17 additions and 2 deletions
  1. 17
    2
      docs/collector.md
  2. BIN
      docs/collector.pdf

+ 17
- 2
docs/collector.md View File

@@ -59,6 +59,7 @@ be discussed individually, later on:
59 59
   searchTwitter()       # searching Twitter for a particular string
60 60
   strip_retweets()      # exclude Retweets in the results
61 61
 ```
62
+
62 63
 As a site-note; I had to install the
63 64
 [httr-package](https://cran.r-project.org/web/packages/httr/index.html) - a
64 65
 dependency of twitteR - from the Repositories of my distribution of choice, as
@@ -89,6 +90,7 @@ analysis:
89 90
   login()       # authentication / generating an auth-token
90 91
   get_hashtag() # search the fediverse for posts that include a specific hashtag
91 92
 ```
93
+
92 94
 Note:
93 95
 as this package is not hosted on CRAN but on github, you can not install it with
94 96
 `install.packages()` like the other packages. The easiest way is to install it
@@ -101,6 +103,7 @@ Installing and loading the mastodon package would look like this:
101 103
   devtools::install_github(repo = "ThomasChln/mastodon")
102 104
   library("mastodon")
103 105
 ```
106
+
104 107
 Also note, that `devtools` requires the development files of *libssl* to be
105 108
 installed on your machine.
106 109
 
@@ -117,6 +120,7 @@ later on:
117 120
   reddit_urls()	        # searching Reddit for a particular string
118 121
   reddit_content()      # scrape data of an indicidual post
119 122
 ```
123
+
120 124
 You may have noticed, that there is no "authenticate" command within this
121 125
 package. As of now, the Reddit-API does not require authentication, as all posts
122 126
 are for general consumption anyways. This may or may not change in the future,
@@ -258,12 +262,14 @@ property, as shown by the illustration below:
258 262
     '- ...
259 263
 ```
260 264
 
261
-The inconvenience about this structure stems from that we need to use for-loops
265
+The inconvenience about this structure stems from that we need to use a for-loop
262 266
 in order to run through each lower `list()` item and extract its variables
263 267
 individually.
264 268
 
265 269
 For the sake of keeping this short, this documentation only explains the
266
-extraction of a single argument, namely the Client used to post a Tweet.
270
+extraction of a single argument, namely the Client used to post a Tweet. All
271
+other information are scraped in a very similar fashion.
272
+
267 273
 Firstly we create a new, empty `vector()` item called `twitter_client` with the
268 274
 "combine" command (or `c()` for short). Usually you do not have to pre-define
269 275
 empty vectors in R, but it will be created automatically if you assign it a
@@ -613,6 +619,7 @@ to mind is the "TrendingBot", we do it with a simple if-statement:
613 619
     ...
614 620
   }
615 621
 ```
622
+
616 623
 *Note: you can use multiple Bot-names by adding "|" (or) followed by another
617 624
 botname to the statement.*
618 625
 
@@ -759,6 +766,7 @@ result in the `reddit_post` variable:
759 766
 			   replace = "")
760 767
   reddit_post <- reddit_post_dirty[which(reddit_post_year == reddit_searchfromyear),]
761 768
 ```
769
+
762 770
 To ease the handling of this process, the year we want to search in is assigned
763 771
 to the variable `reddit_searchinyear` in a "YY" format first (here: "18" for
764 772
 "2018"). We use `gsub()` to trim the date to just display the year and use
@@ -773,6 +781,7 @@ simply create an empty `vector()` for each variable:
773 781
   date <- c()
774 782
   rurl <- c()
775 783
 ```
784
+
776 785
 And fill the appropriate position on the vector with the corresponding value.
777 786
 We do this for each scraped post:
778 787
 ```
@@ -785,6 +794,7 @@ We do this for each scraped post:
785 794
     ...
786 795
   }
787 796
 ```
797
+
788 798
 However, not all of the relevant data is contained in the `reddit_post` dataset.
789 799
 We need another function from the `RedditExtractoR` package, called
790 800
 `reddit_content()` which is able to also give us the score, text and linked-to
@@ -793,6 +803,7 @@ which is contained in our previously mentioned `data.frame()`:
793 803
 ```
794 804
   reddit_content <- reddit_content(URL = reddit_post$URL[1])
795 805
 ```
806
+
796 807
 The resulting variable `reddit_content` is another `data.frame()` with a similar
797 808
 structure as the previously used `reddit_post`:
798 809
 ```
@@ -807,6 +818,7 @@ structure as the previously used `reddit_post`:
807 818
     |- link = "https://cran.r-project.org"
808 819
     '- ...
809 820
 ```
821
+
810 822
 Since we need to do this for every single post, we can include this into our
811 823
 for-loop. Because we call the function with only one post-URL at a time, we can
812 824
 set the wait time between request to zero. However the for-loop will call the
@@ -845,6 +857,7 @@ the `cbind()` fucntion, which can be tunred into the finished dataset with
845 857
 ```
846 858
   reddit <- data.frame(cbind(date, rurl, link, text, ttle, ptns, subr, comt))
847 859
 ```
860
+
848 861
 This usually re-defines every single variable within the dataset as `factor()`,
849 862
 so we use the  `within()` function to change their mode:
850 863
 ```
@@ -859,6 +872,7 @@ so we use the  `within()` function to change their mode:
859 872
 		   comt <- as.numeric(as.character(comt));
860 873
 		  })
861 874
 ```
875
+
862 876
 The dataset can now be exported. Skip down to the
863 877
 [Exporting-Section](#exporting-datasets) to learn how.
864 878
 
@@ -919,6 +933,7 @@ the `data/` folder:
919 933
 ```
920 934
   save_path <- paste0("./data/ilovefs-all_", time_of_saving, ".RData")
921 935
 ```
936
+
922 937
 *Note: using `paste()` instead of `paste0()` will create a space between each
923 938
 strings, which we do not want here.*
924 939
 

BIN
docs/collector.pdf View File


Loading…
Cancel
Save