Browse Source

Edit: (WIP) documentation for plotte.R

pull/15/head
janwey 1 year ago
parent
commit
e98fbe13d9
4 changed files with 100 additions and 2 deletions
  1. 1
    0
      docs/README.md
  2. 99
    0
      docs/plotter.md
  3. BIN
      docs/plotter.pdf
  4. 0
    2
      todo.txt

+ 1
- 0
docs/README.md View File

@@ -1,3 +1,4 @@
# Documentation

* **[collecto.R](./collector.md)** the collector/scraper for data from different socialmedia-sources
* **[plotte.R](./plotter.md)** the visualizer script for the collected data

+ 99
- 0
docs/plotter.md View File

@@ -0,0 +1,99 @@
# Documentation: [plotte.R](../plotte.R)

## Table of Contents
* [General information about the script](#the-script)
* [Packages used](#packages)
* [The ggplot2 package](#the-ggplot2-package)
* [The gridExtra package](#the-gridextra-package)

* * *

## The Script
The R script documented here doesn't have such a streamlined structure as the
**Collecto.R** script. Each section

## Packages
As of writing this script and documentation, we only use two packages:
* [ggplot2](https://cran.r-project.org/package=ggplot2) (Version 2.2.1)
* [gridExtra](https://cran.r-project.org/package=gridExtra) (Version 2.3)

### The ggplot2 package
ggplot2 is a powerhouse of a package, when it comes to data visualisation. Our
usage is rather basic and limited, however it certainly is able to create much
more elegant graphics than R's default `plot()` command, which we will also use
in this script at some point. From the ggplot2 package, we combine following
functions:
```
ggplot() # initializing the actual plot
aes() # greate "aesthetic" mappings in the plot object
geom_histogram() # declaring the histogram style of the plot
scale_x_datetime() # positioning scales for date and time
scale_y_continuous() # positioning scales for continous data / index
ggtitle() # setting a title for the plot
scale_fill_gradient() # give the visualized data a gradient color
```

### The gridExtra package
gridExtra will only be used to arrange several plots produced by the `ggplot2`
package next to each other, as this does not work with the `par()` function,
commonly used in conjunction with R's default `plot()`. So, we only need
gridExtra for ggplot2-objects:
```
grid.arrange() # arrange two ggplot2 objects in a grid
```

* * *

## Participation by Platform
The *by-platform-graphic* actually consists of two plots, arranged next to each
other. One side simply divides the collected data between the two categories
"Twitter" and "Fediverse". This is especially easy to divide, since the data we
collected already comes in two discrete datasets for both platforms. Knowning
this, we can simply create a factor variable `platform`, which contains the
string `twitter` exactly so many times as we have tweet. The same is true for
`fediverse`. for this, we use the `rep()` (repeat) as well as the `factor()`
functions. The appropriate code looks like this:
```
twitter_number <- rep(x = "twitter", times = length(twitter$text))
fediver_number <- rep(x = "fediverse", times = length(mastodon$text))
platform <- factor(c(twitter_number, fediver_number),
levels = c("fediverse", "twitter"))
```
This data can now be visualized in a barplot later on.

The second plot seperates all fediverse-data into the single instances. Our
scraped data contains the account name of each poster, which usually includes
the instance-domain as well, for example: `fsfe@status.fsfe.org`.

In order to only extract the domains of the instances, we use the `sub()`
function in conjunction with regex and save the results into the `instances`
variable:
```
instances <- sub(x = as.character(mastodon$acct), pattern = ".*\\@", replace = "")
```

However, all accounts on the instance you scraped your data from - in this case
from [mastodon.social](https://mastodon.social) - only the username is displayed,
not the domain of the instance. For example: `fsfe`.

In order to catch these as well, we look for all strings, that do not contain an
`@` symbol with the `grep()` function and save their position into a variable
(here: `msoc`). The `invert = TRUE` argument makes sure, that we get exactly
those accounts, that do **not** contain the searched pattern:
```
msoc <- grep(x = as.character(mastodon$acct), pattern = "@", invert = TRUE)
```

Now we can replace all positions in the `instance` variable with the domain of
the instance we scraped our data from. Afterwards, we should change the mode of
the `instances` variable to `factor()`:
```
instances[msoc] <- "mastodon.social"
instances <- as.factor(instances)
```

Finally, we can start plotting. For this we use the default `plot()` function, as
well as `legend()` to provide some extra information, necessary to understand the
graphic. Our first plot uses the previously constructed `platform` variable as
input. Since this is a factor variable, R will automatically create a barplot
from this data. For the color, we use red for the Fediverse and blue for Twitter.

BIN
docs/plotter.pdf View File


+ 0
- 2
todo.txt View File

@@ -1,2 +0,0 @@
# TODO
- replace regex of date/time in collecto.R with strptime() https://stackoverflow.com/questions/15838548/parsing-iso8601-date-and-time-format-in-r

Loading…
Cancel
Save