Quellcode durchsuchen

Add: documenting the data export

pull/2/head
janwey vor 1 Jahr
Ursprung
Commit
01658836b4
2 geänderte Dateien mit 91 neuen und 3 gelöschten Zeilen
  1. 91
    3
      docs/collector.md
  2. BIN
      docs/collector.pdf

+ 91
- 3
docs/collector.md Datei anzeigen

@@ -7,8 +7,9 @@
* [The twittR package](#the-twitter-package)
* [The Rfacebook package](#the-rfacebook-package)
* [The Mastodon package](#the-mastodon-package)
* [Twitter](#twitter)
* [Fediverse](#fediverse)
* [Collecting from Twitter](#twitter)
* [Collecting from the Fediverse](#fediverse)
* [Exporting the Datasets](#exporting-datasets)


* * *
@@ -712,6 +713,93 @@ mastodon <- within(data = mastodon, expr = {
```

The dataset can now be exported. Skip down to the
[Exporting-Section](#exorting-datasets) to learn how.
[Exporting-Section](#exporting-datasets) to learn how.

* * *

## Exporting Datasets
There are several reasons to why we want to export our data:

1. to keep a backup / an archive. As we have seen in the
[Twitter-Section](#twitter), the social media sites do not always enable us
to collect a full back-log of what has been posted in the past. If we want to
analyze our data at a later point of time or if we want to compare several
point of times to another, it makes sense to have an archive and preferably
a backup to prevent data loss
2. to use the data outside your current R-session. The variables only live for
as long as your R-session is running. As soon as you close it, all is gone
(except if you agree to save to an image, which actually does the very same,
we are doing here). So it makes sense to export the data, which then can be
imported and worked with later again.
3. to enable other use to analyze and work with the data. Obviously, this is an
important one for us. We **do** want to share our results and the data we
used for this so other people can learn and to make our anylsis transparent.

In order to fully enable anyone to use the data, whatever software he or she is
using, we export in three common and easily readable formats:
`.RData .csv .txt`. The later one is the simplest one and can be read by
literally **any** text-editor. Each string in there is enclosed by quotes `"`
and seperated with a single space in a table-layout. The `.csv` format is very
similar, though the seperation is done with a symbol - in this case a colon `,`.
This format is not only readable by all text-editors (because it is pure text),
it can also be read by spreadsheet applications like libreoffice-calc. The
disadvantage of both formats is, that they can only hold items with the same
"labels", so we need to create multiple export-files for each data source. Also,
when importing, you often have to redefine each vaiable's mode again.

Lastly, we also export as `.RData`, R's very own format. Since R is free
software, I would suspect, that most statistics-software can read this format,
but I do not actually know for a fact. However, it certainly is the easiest to
work with in R, as you can include as many variables and datasets as you want
and the modes of each variable stay in tact. `.RData` is a binary format and
can not be read by text-editors or non-specialized software.

In order to have an easily navigatable archive, we should not only label the
output-files with the source of the data, but also with the date when they were
collected. For this, we first need the current time/date, which R provides with
the `Sys.time()` function. We want to bring it in a format suitable for
file names like "YYYY-MM-DD_HH-MM-SS", which we can do with `sub()` and `gsub()`
respectively:
```
time_of_saving <- sub(x = Sys.time(), pattern = " CET", replace = "")
time_of_saving <- sub(x = time_of_saving, pattern = " ", replace = "_")
time_of_saving <- gsub(x = time_of_saving, pattern = ":", replace = "-")
```

Next, we model the save-path we want the data to be exported to, for which we
can use `paste0()`. For example to save the `.RData` file, we want to export to
the data/` folder into the file `ilovefs-all_YYYY-MM-DD_HH-MM-SS.RData`:
```
save_path <- paste0("./data/ilovefs-all_", time_of_saving, ".RData")
```
*Note: using `paste()` instead of `paste0()` will create a space between each
strings, which we do not want here.*

We follow a similar approch for the individual `.txt`files, also adding the
name of the source into the filename (as they will only hold one data source
each). For example:
```
save_path_twitter_t <- paste0("./data/ilovefs-twitter_", time_of_saving, ".txt")
```

Lastly, we need to actually export the data, which we can do with:
```
save() # for .RData
write.table() # for .txt
write.csv() # for .csv
```

All three functions take the data as argument, as well as the previously defined
file path. In the case of `save()` where we export multiple datasets, their
names need to be collected in a `vector()` item with the `c()` function first:
```

save(list = c("twitter", "mastodon"), file = save_path)

write.table(mastodon, file = save_path_fed_t)

write.csv(twitter, file = save_path_twitter_c)
```

**If this is done, we can safely close our R-Session, as we just archived all
data for later use or for other people to join in!**

BIN
docs/collector.pdf Datei anzeigen


Laden…
Abbrechen
Speichern