Browse Source

Edit: Updated Documentation to reflect the new Fediverse-Section of the script

janwey 1 year ago
parent
commit
13568bdc51
3 changed files with 173 additions and 255 deletions
  1. 2
    2
      collecto.R
  2. 171
    253
      docs/collector.md
  3. BIN
      docs/collector.pdf

+ 2
- 2
collecto.R View File

@@ -272,7 +272,7 @@ reto <- c()
272 272
 favs <- c()
273 273
 murl <- c()
274 274
 acct <- c()
275
-for(i in 1:10){
275
+for(i in 1:9999999999){
276 276
   if(i == 1){
277 277
     mastodon_instance <- "https://mastodon.social"
278 278
     mastodon_hashtag <- "ilovefs"
@@ -329,7 +329,7 @@ inst <- sub(pattern = "https:\\/\\/", x = inst, replacement = "")
329 329
 inst <- sub(pattern = "\\/.*", x = inst, replacement = "")
330 330
 
331 331
 ### Only include Toots from this year
332
-mastodon_exclude <- which(date < 20180101)
332
+mastodon_exclude <- which(date < 20180201)
333 333
 date <- date[-mastodon_exclude]
334 334
 time <- time[-mastodon_exclude]
335 335
 lang <- lang[-mastodon_exclude]

+ 171
- 253
docs/collector.md View File

@@ -5,7 +5,7 @@
5 5
 * [General information about the Script](#the-script)
6 6
 * [Packages used and the Package section](#packages)
7 7
     * [The twittR package](#the-twitter-package)
8
-    * [The Mastodon package](#the-mastodon-package)
8
+    * [The curl and rjson packages](#the-curl-and-rjson-packages)
9 9
     * [The RedditExtractoR package](#the-redditextractor-package)
10 10
 * [Collecting from Twitter](#twitter)
11 11
 * [Collecting from the Fediverse](#fediverse)
@@ -38,11 +38,12 @@ much care and do not leak meta-data if possible.
38 38
 * * *
39 39
 
40 40
 ## Packages
41
-As of writing this script and its documentation, three scraper-packages are
42
-being used:
41
+As of writing this script and its documentation, two platform specific and two
42
+general scraper-packages are being used:
43 43
 
44 44
 * [twitteR](https://cran.r-project.org/package=twitteR) (Version 1.1.9)
45
-* [mastodon](https://github.com/ThomasChln/mastodon) (Commit [a6815b6](https://github.com/ThomasChln/mastodon/commit/a6815b6fb626960ffa02bd407b8f05d84bd0f549))
45
+* [curl](https://cran.r-project.org/package=curl) (Version 3.1)
46
+* [rjson](https://cran.r-project.org/package=rjson) (Version 0.2.15)
46 47
 * [RedditExtractoR](cran.r-project.org/package=RedditExtractoR) (Version 2.0.2)
47 48
 
48 49
 ### The twitteR package
@@ -61,51 +62,34 @@ be discussed individually, later on:
61 62
 ```
62 63
 
63 64
 As a site-note; I had to install the
64
-[httr-package](https://cran.r-project.org/web/packages/httr/index.html) - a
65
-dependency of twitteR - from the Repositories of my distribution of choice, as
66
-the one provided by CRAN would not compile for some reason. So if you run into a
67
-similar issue, look for something like `r-cran-httr` in your packagemanager.
65
+[httr-package](https://cran.r-project.org/package=httr) - a dependency of
66
+twitteR - from the Repositories of my distribution of choice, as the one
67
+provided by CRAN would not compile for some reason. So if you run into a similar
68
+issue, look for something like `r-cran-httr` in your packagemanager.
68 69
 
69
-### The mastodon package
70
+### The curl and rjson packages
70 71
 The good thing about Mastodon is, that searches are not restricted to a single
71 72
 Mastodon-Instance or to Mastodon at all. If your Instance has enough outbound
72 73
 connections (so make sure you chose a very active and inter-communicative one),
73 74
 you are able to not only search Mastodon-Instances, but also GNUsocial, Pump.io
74
-and other compatible Social Media instances. Luckily, this also applies to the
75
-mastodon-package. Unfortunately, mastodon for R is documented
76
-[very poorly, if at all](https://github.com/ThomasChln/mastodon/blob/a6815b6fb626960ffa02bd407b8f05d84bd0f549/README.md).
77
-This brings us in the uncomfortable position, that we need to figure out, what
78
-the outputs of each function actually mean. Those are not properly labeled
79
-either, so this is a task of trial'n'error and a lot of guessing. If you have
80
-time and dedication, feel free to document it properly and open a pull-request
81
-on the [project's Github page](https://github.com/ThomasChln/mastodon). The
82
-relevant results that we use in our script are listed in the
83
-[Mastodon-Section](#mastodon) of this documentation. Again, just like with the
84
-Rfacebook package, the function-names are very generic and thus it is a good
85
-idea to prefix them with `mastodon::` to prevent the use of a wrong function
86
-from another package (eg.: `login()` becomes `mastodon::login()`).
87
-From the long list of functions in this package, we only need two for our
88
-analysis:
89
-```
90
-  login()       # authentication / generating an auth-token
91
-  get_hashtag() # search the fediverse for posts that include a specific hashtag
92
-```
93
-
94
-Note:
95
-as this package is not hosted on CRAN but on github, you can not install it with
96
-`install.packages()` like the other packages. The easiest way is to install it
97
-with `install_github()` from the `devtools` package. In order to use
98
-`install_github()` without loading the library (as we only need it for this one
99
-time), you can prefix it with its package name. 
100
-Installing and loading the mastodon package would look like this:
101
-```
102
-  install.packages("devtools")
103
-  devtools::install_github(repo = "ThomasChln/mastodon")
104
-  library("mastodon")
105
-```
106
-
107
-Also note, that `devtools` requires the development files of *libssl* to be
108
-installed on your machine.
75
+and other compatible Social Media instances. Previously, we used a specialized
76
+package for scraping the Fediverse, simply called
77
+[mastodon](https://github.com/ThomasChln/mastodon), however it proved to be
78
+unreliable, poorly documented and probably even unmaintained. Luckily, Mastodon
79
+as an opensource platform also has a really open API we can access with simple
80
+tools like [curl](https://cran.r-project.org/package=curl) and
81
+[rjson](https://cran.r-project.org/package=rjson). Specifically, we use
82
+following functions of the `curl` package:
83
+```
84
+  curl_fetch_memory() # import HTTP headers from the API-page
85
+  parse_headers()     # extracts responses from the HTTP header
86
+```
87
+
88
+This will generate an output in a JSON format, which we can transform to a
89
+`list()` item with a function from the `rjson` package:
90
+```
91
+  fromJSON() # transform JSON to a list() item
92
+```
109 93
 
110 94
 ### The RedditExtractoR package
111 95
 RedditExtractoR has a rather extensive
@@ -367,7 +351,7 @@ using wrong variables.
367 351
 Additionally, we still need to split up the `twitter_timedate` variable, which
368 352
 currently contains the point of time of the tweet in the form of
369 353
 `YYYY-MM-DD HH:MM:SS`. For this, we again use regex and the function `sub()`.
370
-As `sub()` only replaces the first instance of the pattern given to it, if we
354
+As `sub()` only replanes the first instance of the pattern given to it, if we
371 355
 have multiple occasions of a given pattern, we need to use `gsub()` (for global
372 356
 substitute).
373 357
 
@@ -436,255 +420,188 @@ about how to export the data, so it can be used outside your current R-Session.
436 420
 * * *
437 421
 
438 422
 ## Fediverse
423
+The
424
+[Mastodon-API](https://github.com/tootsuite/documentation/blob/461a17603504811b786084176c65f31ae405802d/Using-the-API/API.md)
425
+doesn't require authentication for public timelines or (hash-) tags. Since this
426
+is exactly the data we want to aggregate, authentication is not needed here.
439 427
 
440
-### Authenticate
441
-In the Mastodon package, authentication works similar as in the twitteR package.
442
-You still need an account on any Mastodon-Instance you like, but you do not have
443
-to create API-Credentials on the website. Instead, it can all be handled from
444
-within R.
445
-
446
-However, this comes with a different kind of complication:
447
-You login-credentials have to be saved as plain text variables in your R-session
448
-and if you want to go the comfortable way of saving these in an "auth file", as
449
-we did with Twitter, this comes with an additional risk.
450
-
451
-You can mitigate that risk, if you use an encrypted storage space - which I
452
-would highly recommend either way. If you haven't encrypted your entire
453
-hard drive, you may take a look at this wiki article about
454
-[encryptfs](https://help.ubuntu.com/community/EncryptedPrivateDirectory).
455
-
456
-Either way, you have two ways of inserting your credentials into the R-session:
457
-
458
-1. via manual input. The R-Console will prompt you to enter the credentials by
459
-   typing them in.
460
-2. via a plain text file with the saved credentials. This `.txt` file has a very
461
-   specific structure which you have to follow. You can find an example file in
462
-   the examples folder.
463
-
464
-The first line of the credential-file contains the *labels*. These have to be in
465
-the same order as the *credentials* themselves in the line below. The *labels*
466
-as well as the *credentials* are each separated by a single semi-colon `;`. As
467
-mentioned before, **storing your login as plain text is a risk that you have to
468
-deal with somehow**. Ideally with encryption.
469
-
470
-If we loaded our login-credentials into the variables
471
-`mastodon_auth_insta mastodon_auth_login mastodon_auth_passw`, we can *order*
472
-our API access token with the package's `login()` function, which takes these
473
-three values as arguments. Again, the name of the function is very generic and
474
-may overlap with function in other packages. So it is a good idea to prefix it
475
-with the package name and a double colon. This is the case for all functions in
476
-this package, so I will not further mention it, but we should continue doing it
477
-regardless. We store the resulting list into the variable `mastodon_auth`:
428
+### Scraping Toots and Postings
429
+Contrary to Twitter, Mastodon does not allow to search for a string contained in
430
+posts, however we can search for hashtags through the tag-timeline in the
431
+[API](https://github.com/tootsuite/documentation/blob/461a17603504811b786084176c65f31ae405802d/Using-the-API/API.md#timelines).
432
+For this we have to construct an URL for the API-call (keep an eye on changes to
433
+the API and adapt accordingly) in the following form:
478 434
 ```
479
-  mastodon_auth <- mastodon::login(instance = mastodon_auth_insta,
480
-				   user = mastodon_auth_login,
481
-				   pass = mastodon_auth_passw)
435
+  https://DOMAIN.OF.INSTANCE/api/v1/timeline/tag/SEARCHTERM
482 436
 ```
483 437
 
484
-### Scraping Toots and Postings
485
-Once we successfully got our access token, we can start collecting postings
486
-containing our desired string. Contrary to Twitter, Mastodon does not allow to
487
-search for a string contained in posts, however we can search for hashtags with
488
-the `get_hashtag()` function. This one needs four arguments:
489
-
490
-* our previously generated access token `mastodon_auth`
491
-* a string containing the hashtag we want to search for. In our case, `ilovefs`
492
-  would make most sense. You can however make the argument, that we should
493
-  **also** search for `ilfs`. Things like "#ilovefs18" or "#ilovefs2018"
494
-  *should* be covered, however
495
-* whether we want to only search on the local instance (the instance your
496
-  account is registered on). Of course we set this one to `FALSE`, as we want to
497
-  search the entire fediverse, including Mastodon-, GNUsocial- and
498
-  Pump.io- instances
499
-* the maximum number of postings we want to collect. As in the `twitteR`
500
-  package, we can set this to a very high number, but this may need some
501
-  consideration in the future. Generally, the fediverse is much more serious
502
-  about free software than other social media types. Right now, it is still
503
-  fairly young, but as it gets older (and grows in users), the number of
504
-  participants in the "I love Free Software Day" may rise quite dramatically. So
505
-  you could try out a lower number for this argument and take a look at the
506
-  dates of posting to get a feeling of how high this number should be
507
-
508
-The result is saved to the variable `mastodon_toot`:
509
-```
510
-  mastodon_toot <- mastodon::get_hashtag(token = mastodon_auth,
511
-				         hashtag = "ilovefs",
512
-				         local = FALSE,
513
-				         n = 100)
514
-```
438
+Additionally, you can add `?limit=40` at the end of the URL to raise the results
439
+from 20 to 40 posts. For the search term it makes sense to use our official
440
+hashtag for the "I Love Free Software" Campaign: *ilovefs*.
515 441
 
516
-### Stripping out data
517
-Unfortunately, as of writing this script and documentation, the `mastodon`
518
-package has very poor documentation itself. For instance, there is no
519
-explanation of the variables in the resulting list of the `get_hastag()`
520
-function. Because of the structure of this `list()` item, there are no labels
521
-either. With the help of the `names()` of R's base-package, I could however
522
-identify all variables:
442
+In R, you can easily construct this with the `paste0()` function (the `paste()`
443
+function will introduce spaces between the arguments, which we obviously do not
444
+want):
523 445
 ```
524
-  names(mastodon_toot)
446
+  mastodon_instance <- "https://mastodon.social"
447
+  mastodon_hashtag <- "ilovefs"
448
+  mastodon_url <- paste0(mastodon_instance,
449
+			 "/api/v1/timelines/tag/",
450
+                         mastodon_hashtag,
451
+                         "?limit=40")
525 452
 ```
526 453
 
527
-Additionally, the structure of the resulting `list()` item has a great advantage
528
-over the results in the `twitteR` package: It is very easy to extract the data,
529
-as it already has the same structure that we use as well, as illustrated below:
454
+Next, we use the `curl_fetch_memory()` function to fetch the data from our
455
+mastodon instance. The result of this is raw data, not readable by humans. In
456
+order to translate this into a readable format, we use `rawToChar()` form the R
457
+base package. This readable format is actually
458
+[JSON](https://de.wikipedia.org/wiki/JavaScript_Object_Notation), which can be
459
+easily transformed into a `list()` item with the `fromJSON()` function. All
460
+three functions put together, we have something like this:
530 461
 ```
531
-  mastodon_toot
532
-    |
533
-    |- ...
534
-    |- created_at = "2018-01-22T10:44:53", "2018-01-22T10:45:10", ...
535
-    |- ...
536
-    |- visibility = "public", "public", ...
537
-    |- language = "en", "en", ...
538
-    |- uri = "tag:quitter.no,2018-01-22:noticeID=0000000000001:objectType=note", ...
539
-    |- content = "<3 Opensource! #ilovefs", "FREE SOFTWARE!1eleven #ilovefs", ...
540
-    |- url = "quitter.no/status/0000000000001", "quitter.no/status0000000000002", ...
541
-    |- reblogs_count = "9", "1", ...
542
-    |- favourites_count = "53", "3", ...
543
-    |- ...
544
-    |- account [LIST]
545
-    |     |- [LIST 1]
546
-    |     |     |- ...
547
-    |     |     |- username = "linux-beginner-for-a-day"
548
-    |     |     '- ...
549
-    |     |
550
-    |     |- [LIST 2]
551
-    |     |     |- ...
552
-    |     |     |- username = "C12yp70_H4X012_1337-420"
553
-    |     |     '- ...
554
-    |     |
555
-    |     '- ...
556
-    |- media_attachements [LIST]
557
-    |     |- [LIST 1]
558
-    |     |     |- ...
559
-    |     |     |- remote_url = "https://quitter.no/media/ilovefs-banner.png"
560
-    |     |     '- ...
561
-    |     |
562
-    |     |- [LIST 2]
563
-    |     |     |- ...
564
-    |     |     |- username = ""
565
-    |     |     '- ...
566
-    |     |
567
-    |     '- ...
568
-    '- ...
569
-
462
+  mastodon_reqres <- curl_fetch_memory(mastodon_url)
463
+  mastodon_rawjson <- rawToChar(mastodon_reqres$content)
464
+  toots <- fromJSON(mastodon_rawjson)
570 465
 ```
571 466
 
572
-Because of this, we can often times to a basic assignment, like this:
467
+`toots` is our resulting `list()` item.
468
+
469
+Another issue is, that the Mastodon-API currently caps at 40 toots. However, we
470
+want much more than only the last 40, so we need to make several API-calls,
471
+specifying the *"range"*. This is set with the `max_id=` parameter within the
472
+URL. The "ID" is the
473
+[unique identifier of each status/post](https://github.com/tootsuite/documentation/blob/461a17603504811b786084176c65f31ae405802d/Using-the-API/API.md#status).
474
+You can have several parameters with dividing them them the `&` character, which
475
+will look similar to this:
573 476
 ```
574
-  mastodon_lang <- mastodon_toot[[8]]
477
+  https://DOMAIN.OF.INSTANCE/api/v1/timeline/tag/SEARCHTERM/?limit=40&max_id=IDNUMBER
575 478
 ```
576 479
 
577
-However, in such cases as the time of the posting, we need to use `sub()`,
578
-`gsub()` and `as.numeric()` to extract the data we want (in this case, splitting
579
-time and date into single, numeric variables). We do something similar for the
580
-`uri` variable in the list to extract the name of the instance.
581
-
582
-URLs and hashtags have a HTML-format in the posting-text, so we need to get rid
583
-of this, without removing anything else from it. If you do not understand the
584
-regex here, make sure to check out [regexr.com](https://regexr.com/):
480
+Luckily, we do not have to find out the ID manually. The header of the API
481
+response saved into the `mastodon_reqres` variable also lists the "*next page*"
482
+of results, so we can simply grab this with the `parse_headers()` function from
483
+the `curl` package and use some regex to strip it out:
585 484
 ```
586
-  mastodon_txt <- gsub(pattern = "<.*?>", x = mastodon_toot[[10]], replacement = "")
485
+  mastodon_lheader <- parse_headers(mastodon_reqres$headers)[11]
486
+  mastodon_next <- sub(x = mastodon_lheader, pattern = ".*link:\ <", replace = "")
487
+  mastodon_url <- sub(x = mastodon_next, pattern = ">;\ rel=\"next\".*", replace = "")
587 488
 ```
588 489
 
589
-Besides that, we should also try to identify bots, which are very common in the
590
-fediverse and post about things like "Trending Hashtags". Of course, this is
591
-problematic for us, as this most definitely can not be considered participation.
592
-We can either sort bots out by their account-id or name. I went for the name in
593
-this case, as there may be more "TrendingBots" scattered throughout the
594
-fediverse. For this, we need to go through each "lower list" containing the
595
-account information and noting down, which ones are bots and which are not.
596
-If we identify a poster as a bot, we give the variable `mastodon_bot` the value
597
-`TRUE` for this position and `FALSE` if this is not a bot. Just like extracting
598
-information from the lower `list()` items in the `twitteR` package, we first
599
-need to create an empty `vector()` item:
600
-```
601
-  mastodon_bot <- c()
490
+*If you are not familiar with regex, I highly recommend
491
+[regexr.com](https://regexr.com/) to learn how to use it. It also contains a
492
+nifty cheat-sheet.*
493
+
494
+
495
+If this returns a valid result (if the `toot` variable is set), we forward it to
496
+the [extraction function](#extraction-function) called `mastodon.fetchdata()`,
497
+which is defined earlier in the script. This returns a `data.frame()` item,
498
+containing all relevant variables **of the current "page"**. If we continously
499
+bind them together in a for-loop, we finally receive multiple vectors of all
500
+toots ever posted with the (hash-) tag *#ilovefs*:
501
+```
502
+  if(length(toots) > 0){
503
+    tmp_mastodon_df <- mastodon.fetchdata(data = toots)
504
+    datetime <- c(datetime, as.character(tmp_mastodon_df$tmp_datetime))
505
+    lang <- c(lang, as.character(tmp_mastodon_df$tmp_lang))
506
+    inst <- c(inst, as.character(tmp_mastodon_df$tmp_inst))
507
+    link <- c(link, as.character(tmp_mastodon_df$tmp_link))
508
+    text <- c(text, as.character(tmp_mastodon_df$tmp_text))
509
+    reto <- c(reto, as.character(tmp_mastodon_df$tmp_reto))
510
+    favs <- c(favs, as.character(tmp_mastodon_df$tmp_favs))
511
+    murl <- c(murl, as.character(tmp_mastodon_df$tmp_murl))
512
+    acct <- c(acct, as.character(tmp_mastodon_df$tmp_acct))
513
+  } else {
514
+    break
515
+  }
602 516
 ```
603 517
 
604
-Next, it will be filled with the help of a for-loop. It has to count up from 1
605
-to as long as the `mastodon_pers` `list()` item is:
518
+As of writing this documentation, the cap of the for-loop is set to 9999999999,
519
+which most likely will never be reached. However, the loop will always stop, as
520
+soon as the `toot` variable doesn't contain meaningful content anymore (see the
521
+*break* command in the code above).
522
+
523
+When extracted, some of the date has to be reformed, reformatted or changed in
524
+some way. We use regex for this as well. For the sake of simplicity, the example
525
+below only shows the cleaning of the `text` variable. Other variables are
526
+treated in a similar fashion:
606 527
 ```
607
-  for(i in 1:length(mastodon_pers)){
608
-    ...
609
-  }
528
+text <- gsub(pattern = "<.*?>", x = text, replacement = "")
529
+text <- gsub(pattern = "  ", x = text, replacement = "")
610 530
 ```
611 531
 
612
-Within this for-loop, we need to check whether or not that account is a bot. As
613
-described above, for the sake of simplicity and because the only bot that comes
614
-to mind is the "TrendingBot", we do it with a simple if-statement:
532
+Additionally, posts that are too old have to be removed (usually, setting the
533
+oldest date to January 01 of the current year works fine, February may be fine
534
+as well). The format of the date should be `YYYYMMDD` and a `numeric()` value:
615 535
 ```
616
-  if(mastodon_pers[[i]]$username == "TrendingBot"){
617
-    ...
618
-  } else {
619
-    ...
620
-  }
536
+mastodon_exclude <- which(date < 20180101)
537
+date <- date[-mastodon_exclude]
538
+time <- time[-mastodon_exclude]
539
+lang <- lang[-mastodon_exclude]
540
+inst <- inst[-mastodon_exclude]
541
+text <- text[-mastodon_exclude]
542
+link <- link[-mastodon_exclude]
543
+reto <- reto[-mastodon_exclude]
544
+favs <- favs[-mastodon_exclude]
545
+murl <- murl[-mastodon_exclude]
546
+acct <- acct[-mastodon_exclude]
621 547
 ```
622 548
 
623
-*Note: you can use multiple Bot-names by adding "|" (or) followed by another
624
-botname to the statement.*
549
+### Extraction Function
550
+The extraction function `mastodon.fetchdata()` has to be defined prior to
551
+running it (obviously), hence it is the first chunk of code in the
552
+fediverse-section of the script. As argument, it only takes the extracted data
553
+in a `list()` format (which we saved in the variable `toot`). For each post/toot
554
+in the `list()` item, the function will extract:
625 555
 
626
-As mentioned above, if the statement is true, we set the `mastodon_bot` variable
627
-at this position as `TRUE` and as `FALSE` if it is not.
556
+* date & time of the post
557
+* language of the post (currently only differenciates between english/japanese)
558
+* the instance of the poster/tooter
559
+* the URL of the post
560
+* the actual content/text of the post
561
+* the number of boots/shares/retweets
562
+* the number of favorites
563
+* the URL of the attached image (NA, if no image is attached)
564
+* the account of the poster (instance & username)
628 565
 
629
-All put together, we have:
630
-```
631
-  mastodon_bot <- c()
632
-  for(i in 1:length(mastodon_pers)){
633
-    if(mastodon_pers[[i]]$username == "TrendingBot"){
634
-      mastodon_bot[i] <- TRUE
635
-    } else {
636
-      mastodon_bot[i] <- FALSE
637
-    }
638
-  }
639 566
 ```
567
+  mastodon.fetchdata <- function(data){
640 568
 
641
-### Creating the finished dataset
569
+  ...
642 570
 
643
-If we scraped all information, we are still dealing with "dirty" data, here. We
644
-already identified bots, but haven't removed them yet. We also didn't set a
645
-date-range within which we want to collect data. Additionally, we should also
646
-sort out "private" posting, as we want to publish our data and should not leak
647
-someone's thoughts who clearly don't wants them to be public. However it is to
648
-be expected, that there is close to no person who
571
+  for(i in 1:length(data)){
649 572
 
650
-* a) white-listed your account to see their private postings
651
-* b) posts about #ilovefs in a private post
573
+    #### Time and Date of Toot
574
+    if(length(data[[i]]$created_at) > 0){
575
+      tmp_datetime[i] <- data[[i]]$created_at
576
+    } else {
577
+    # insert empty value, if it does not exist
578
+      tmp_datetime[i] <- NA
579
+    }
652 580
 
653
-However, we should keep it in mind regardless.
581
+    ...
654 582
 
655
-To identify posts to be excluded, we can simply use the `which()` function in
656
-conjunction with a condition for each attribute and bind them together with the
657
-`c()` (or "combine") function. Here we can include the previously identified
658
-bots, and the condition, that the "date" has to be lower than (before) a certain
659
-numeric value in the form of "YYYYMMDD". Lastly, we exlclude everything that
660
-is not marked as "public":
661
-```
662
-  mastodon_exclude <- c(which(mastodon_bot),
663
-		        which(mastodon_date < 20180101),
664
-		        which(mastodon_priv != "public"))
665
-```
583
+  }
666 584
 
667
-Before we create the `data.frame()` item, we can drop all `mastodon_` prefixes
668
-from the variables, as the name of the dataset itself makes already clear, what
669
-the source of the data is. We can also strip out the posts we don't want in
670
-there and which positions are listed in the `mastodon_exclude` variable:
671
-```
672
-  date <- mastodon_date[-mastodon_exclude]
673
-  time <- mastodon_time[-mastodon_exclude]
674
-  lang <- mastodon_lang[-mastodon_exclude]
675
-  inst <- mastodon_insta[-mastodon_exclude]
676
-  text <- mastodon_txt[-mastodon_exclude]
677
-  link <- mastodon_url[-mastodon_exclude]
678
-  favs <- mastodon_fav[-mastodon_exclude]
679
-  imag <- mastodon_img[-mastodon_exclude]
585
+  return(data.frame(cbind(tmp_datetime,
586
+			 tmp_lang,
587
+			 tmp_inst,
588
+			 tmp_text,
589
+			 tmp_link,
590
+			 tmp_reto,
591
+			 tmp_favs,
592
+			 tmp_murl,
593
+			 tmp_acct)))
594
+  }
680 595
 ```
681 596
 
597
+### Creating the finished dataset
598
+
682 599
 As before with the Twitter-data, we combine these newly created variables into
683 600
 a `data.frame()` item by first turning it into a matrix by binding these vectors
684 601
 as columns with `cbind()` and turning it into the finished dataset called
685 602
 `mastodon` with `data.frame()`:
686 603
 ```
687
-mastodon <- data.frame(cbind(date, time, lang, inst, text, link, favs, imag))
604
+mastodon <- data.frame(cbind(date, time, lang, inst, text, link, reto, favs, murl, acct))
688 605
 ```
689 606
 
690 607
 As this usually re-defines the variables as `factor()`, we will use `within()`
@@ -695,6 +612,7 @@ mastodon <- within(data = mastodon, expr = {
695 612
 		     time <- as.numeric(as.character(time));
696 613
 		     text <- as.character(text);
697 614
 		     link <- as.character(link);
615
+		     murl <- as.character(murl);
698 616
 		  })
699 617
 ```
700 618
 

BIN
docs/collector.pdf View File


Loading…
Cancel
Save