R-Ladies Melbourne second anniversary

To celebrate our second birthday 🎂 we wanted to bring out the diversity of our community. That’s why we asked to all our members and all the -Ladies Melbourne followers to send us their favourite R tip and share it in a 5 minutes presentation at our event. The lightning talks were followed by a panel discussion about R worlflows and communication of a project results. Below are some highlights from the night and all the beautiful tips that we manage to collect!

Have fun 😉!!









Visualisation

Having fun with base::plot()!

Author: Soroor Zadeh

par(bg="black")
plot( (-2)^as.complex( seq(0, 7, 0.03) ), pch=21, bg=c( 2,3), xlab="", ylab="", xaxt="n", yaxt="n")
text(-30, 40, "Happy Anniversary! \n R-Ladies Melbourne", cex=2,col="#88398A")




Violin plots with overlayed boxplots and coloured by group: position_dodge()

Author: Anna Quaglieri

Whenever I have to display to compare continuous variables this has become my favourite way to go! Violin plot + boxplot allows me to see both the quantiles and the overall density distribution that if can often be missed with only boxplots.

library(ggplot2)

data <- data.frame(Gene = rep(c("Gene1","Gene2","Gene3","Gene4"),each=46),
                   Counts = log2(rbinom(n = 46*4,size = 1000,prob = 0.3)),
                   CBF = sample(x = c("A","B"),size = 46*4,replace=TRUE))

dodge <- position_dodge(width = 1)
ggplot(data,aes(x=Gene,y=Counts,fill=CBF)) + theme_bw()  + theme(axis.text.x = element_text(angle = 0)) + geom_violin(trim=FALSE,position = dodge) + geom_boxplot(width=.1,position = dodge,show.legend = FALSE) +  labs(y="log2Counts") + facet_wrap(~Gene,scales="free_x")




DT package for interactive tables

Authors: Sepideh Foroutan

library(DT)
library(reshape2) # to get the "tips" dataset

data("tips")

datatable(tips, filter = "top", options = list(pageLength = 8))  %>%   ## Bold some numbers:
  formatStyle('total_bill', 
    fontWeight = styleInterval(18, c('normal', 'bold'))) %>%  ## show colour bar
  formatStyle('tip', 
    background = styleColorBar(tips$tip, 'mediumpurple'),
    backgroundSize = '100% 95%',
    backgroundRepeat = 'no-repeat',
    backgroundPosition = 'centre') %>%   ## transform values
  formatStyle('sex', 
    transform = 'rotateX(-45deg) rotateY(-30deg) rotateZ(-50deg)',
    backgroundColor = styleEqual(unique(tips$sex), c('lightblue', 'lightseagreen'))) %>%  ## colour value/background
  formatStyle('size', 
    color = styleInterval(c(2, 4), c('blue', 'black', 'red')), 
    backgroundColor = styleInterval(c(2, 4), c('white', 'gray', 'gray50')))




Different ways of plotting your data with ggplot() using geom_histogram() and geom_density_ridges()

Authors: Marie Trussart

I used here ggd, which is a data frame extracted from clustering and expression datasets and that’s how I define it:

ggd <- melt(data.frame(cluster = cell_clustering, expr),id.vars = "cluster", value.name = "expression")

Example of the data:

Screenshot showing the first six rows of the data used to generate the plots below.

Screenshot showing the first six rows of the data used to generate the plots below.


### Density distributions
ggplot() +
geom_density_ridges(data = ggd, aes(x = expression, y = cluster), alpha = 0.3) +  scale_x_continuous(expand = c(0.01, 0)) +
scale_y_discrete(expand = c(0.01, 0))+
theme_ridges() +
theme(axis.text = element_text(size = 7),
strip.text = element_text(size = 7))
ggsave("Fig1.pdf")

## Histogram distribution by facet
ggplot(data = ggd, aes(x = expression)) + facet_wrap(~cluster, scales = 'free_y') +
geom_histogram()+
theme(strip.text.x = element_text(size=6),
        strip.text.y = element_text(size=6),
        axis.text = element_text( size = 8 ),
        axis.text.x = element_text( size = 8 ),
        axis.title = element_text( size = 8, face = "bold"))
ggsave("Fig2.pdf")

### All the histogram distributions on the same plot
ggplot(data =  ggd[grep("DG19", ggd$cluster),], aes(x = expression,fill=cluster))  +
geom_histogram()
ggsave("Fig3.pdf")
Figure 1. `geom_density_ridges()`

Figure 1. geom_density_ridges()

Figure 2. `geom_histogram()` and `facet_wrap()`

Figure 2. geom_histogram() and facet_wrap()

Figure 3. Another example of `geom_histogram() colouring by classes.`

Figure 3. Another example of geom_histogram() colouring by classes.




Data visualisation via volcano plots

Author: Erika Duan

As a wet-lab immunologist, most of my job involves trying to find and then illustrate meaningful patterns from large biological datasets.

We obtain a lot of data from RNA sequencing experiments. These are experiments which look at how many mRNA molecules (i.e. message signals) are found in an object and how these signals differ in quantity across multiple objects.

We often analyse datasets with changes across >10,000 signals between >=2 different objects. A volcano plot is one way we visualise all statistically significant versus non-significant differences in one graph.

A typical data analysis pipeline

  1. A large matrix is obtained, containing the number of signals ‘counted’ per signal type per object. Each row contains a unique signal ID (i.e. in my case a unique gene ID) and each column contains all the signal counts for one single object. The researcher also has additional information about each object (i.e. object classification categories like object type, timepoint, batch etc.). This is very important for downstream RNAseq analysis, but not required for this analysis.

  2. A minimal information threshold is set (i.e. minimal signal count per signal > 1 for at least 1 object). An awesome statistical package, in my case DESeq2 (https://bioconductor.org/packages/release/bioc/html/DESeq2.html), is then used to test whether any signals are differentially expressed between different objects.

  3. Data visualisation of all statistically significant versus non-significant signals between at least two objects, with the aim of highlighting any new or particularly interesting biological patterns.

Here, a volcano plot is used to depict:

  • how many signals are differentially expressed (using a statistical cut-off),
  • and by how much (i.e. signal fold change),
  • between two objects tested.

Drawing volcano plots with ggplot2

A results output file can be created in DESeq2 i.e. using results(dds, contrast=c("Sample.type", "A", "B")) and converted into a dataframe.

For convenience, I have provided a fake results output called AvsB_results.csv for use (i.e. a dataframe containing all signal differences between object A versus object B). Since we will be using both dplyr and ggplot2, I always find it more convenient to download the tidyverse package.

library("tidyverse")
library("ggrepel") # We will also need this package for the final labelling of data points. 

We start with our dataset of interest.

Note that for the volcano plot, you only need three columns of information:

  1. Gene symbol (aka unique signal ID)
  2. Log2(fold change) (aka how much the level of each signal in A differs from B by)
  3. Padj (the adjusted P-value or statistical likelihood for whether a signal in A is not different to that of B)
AvsB_results <- read.csv("How_R_You_R-LadiesMelbourne_code_and_tips_data/AvsB_results.csv", header = T, stringsAsFactors = F)
str(AvsB_results) # The dataframe contains the 3 columns of info described above. 
## 'data.frame':    600 obs. of  3 variables:
##  $ log2FoldChange: num  3.804 2.104 1.804 1.309 0.525 ...
##  $ padj          : num  1.24e-13 7.29e-08 2.30e-05 1.69e-03 1.71e-03 ...
##  $ Symbol        : chr  "Ep300" "Nemf" "Atad2b" "Rft1" ...

A simple volcano plot depicts:

  • Along its x-axis: log2(fold change)
  • Along its y-axis: -log10(padj)

Note that the y-axis is depicted as -log10(padj), which allows the data points (i.e. volcano spray) to project upwards as the absolute value along the x axis increases. Graphically, this is more intuitive to visualise.

simple_vp <- ggplot(AvsB_results, aes(x = log2FoldChange,
                         y = -log10(padj))) + 
  geom_point() # A simple volcano plot is created.

simple_vp

This plot is too plain as objects of interest do not easily jump out at us.
A good volcano plot will highlight all the signals (represented by individual data points) which are significantly different between A vs B.
In this case, we would be interested in highlighting genes which have a padj <= 0.05 (or a -log10(padj) >= 1.30103) (my chosen statitical cut-off). I would also be interested in highlighting genes which additionally have a log2 fold change <= -1 or >= 1 (i.e. signals which are at least 2-fold bigger or smaller in A vs B).

I can now define these quandrants using:

simple_vp + 
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") + # horiztonal dashed line
  geom_vline(xintercept = c(-1,1), linetype = "dashed") # vertical dashed line

The top-left quadrant contains all signals that are significantly decreased in A vs B, and the top right quandrant contains genes that are significantly increased in A vs B. The remaining genes are not significantly different and hence much less interesting to me.

The next thing we can therefore do is to highlight these three different groups of signals.
To do this, I return to my original dataframe and use the dplyr::mutate function.

AvsB_results <- mutate(AvsB_results,
                       AvsB_type = ifelse(is.na(padj)|padj > 0.05|abs(log2FoldChange) < 1, "ns", 
                         ifelse(log2FoldChange <= -1, "down",
                                "up"))) # creates a new column called AvsB_type, with signals classified as "ns", "down" or "up"

group_by(AvsB_results, AvsB_type) %>%
  summarize(Counts = n()) # counts how many signals are present in each category
## # A tibble: 3 x 2
##   AvsB_type Counts
##   <chr>      <int>
## 1 down           3
## 2 ns           591
## 3 up             6

Now that AvsB_type can segregate each signal based on whether it is ‘up’, ‘down’ or ‘ns’ (i.e. non-significant), I can colour these three signal types differently (and/or change their size/transparency to make different points stand out more versus less).

cols <- c("up" = "#ffad73", "down" = "#26b3ff", "ns" = "grey") 
sizes <- c("up" = 3, "down" = 3, "ns" = 1) 
alphas <- c("up" = 1, "down" = 1, "ns" = 0.5)

ggplot(AvsB_results, aes(x = log2FoldChange,
                         y = -log10(padj))) +
  geom_point(aes(colour = AvsB_type, #specify point colour by AvsB_type
                 size = AvsB_type, #specify point size by AvsB_type
                 alpha = AvsB_type)) + #specify point transparency by AvsB_type
  scale_color_manual(values = cols) +
  scale_size_manual(values = sizes) +
  scale_alpha_manual(values = alphas) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") + 
  geom_vline(xintercept = c(-1,1), linetype = "dashed") 

This is great! But there is still one final nifty trick!

As a biologist, I often get >100s of genes which are significantly increased or decreased between two objects. To examine whether interesting patterns (interconnected signals) exist within these 100 genes, I run them through gene over-representation databases like this one.

Interesting_pathway <- c("Nemf", "Rft1", "Atp5h") # An external database identifies an interesting signal network!

We would like to highlight these particular signals, by representing them in a different (darker) colour and also by labelling each individual point of interest.

ggplot(AvsB_results, aes(x = log2FoldChange,
                         y = -log10(padj))) +
  geom_point(aes(colour = AvsB_type,
                 size = AvsB_type,
                 alpha = AvsB_type)) +
  scale_color_manual(values = cols) +
  scale_size_manual(values = sizes) +
  scale_alpha_manual(values = alphas) +
  scale_x_continuous(limits = c(-4, 4)) + # changing the x-axis to make my volcano plot symmetrical
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") + 
  geom_vline(xintercept = c(-1,1), linetype = "dashed") +
  geom_text_repel(data = AvsB_results %>% 
                    filter(Symbol %in% Interesting_pathway), # labels only genes in the interesting pathway
                  aes(label = Symbol),
                  size = 3.5,
                  color = "black",
                  nudge_x = 0.3, nudge_y = 0.1) + 
  geom_point(data = AvsB_results %>%
               filter(Symbol %in% Interesting_pathway), # adds new points for only genes in the interesting pathway
             color = "#d91933",
             size = 2) +
  theme_classic() + # creates a white background
  theme(panel.border = element_rect(colour = "black", fill=NA, size= 0.5)) # creates a plot border

Viola! Enjoy your volcano plot (and remember, there are lots of graphical modifiers you can use to visualise data using them, as long as your methods are logical and reasonable)!

Development notes

Chuanxin Liu devised the elegant strategy for labelling all signal types as ‘up’, ‘ns’ or ‘down’ and the code for the labelling of specific signal data points.

Create multiple plots using for loop / Code Snippets

Author: Ivy Lin

Have a look at Ivy’s R tip published on RPubs http://rpubs.com/IvyLin/R-tips or got through her code below.

Tip 1 : Create multiple plots using for loop

Demo dataset : Telcom Customer Churn data, source: https://www.kaggle.com/blastchar/telco-customer-churn

library(readr)
library(dplyr)
library(ggplot2)

When we want to visually explore a dataset with many variables…
Is there any efficient way to create multiple plots at the same time?

#import data (default stringAsFactor = TRUE )
customer <- read.csv("How_R_You_R-LadiesMelbourne_code_and_tips_data/WA_Fn-UseC_-Telco-Customer-Churn.csv",stringsAsFactors = T)
str(customer)
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
  • Data wrangling
# Make ID back to correct format
customer$customerID <- as.character(customer$customerID)

#reorder
customer <- customer[c(1,21,2:20)]

# scan missing value
colSums(is.na(customer))
##       customerID            Churn           gender    SeniorCitizen 
##                0                0                0                0 
##          Partner       Dependents           tenure     PhoneService 
##                0                0                0                0 
##    MultipleLines  InternetService   OnlineSecurity     OnlineBackup 
##                0                0                0                0 
## DeviceProtection      TechSupport      StreamingTV  StreamingMovies 
##                0                0                0                0 
##         Contract PaperlessBilling    PaymentMethod   MonthlyCharges 
##                0                0                0                0 
##     TotalCharges 
##               11
#observation with missing values 
missing <- filter(customer, is.na(customer$TotalCharges) == TRUE )

# since TotalCharges is roughly equals to tenure * monthly charges, I replace the missing value accordingly.
customer_m <- customer %>% mutate(TotalCharges = ifelse(is.na(customer$TotalCharges), customer$MonthlyCharges*customer$tenure, TotalCharges) )
  • Data exploratory on different variables

Tips: use for-loop to create multiple plots in just 1 command.
Requirement: you’ll need to have varibles in same data type.

variables <- list(  'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService','OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport','StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling','PaymentMethod' )

for (i in variables){
 plot <-  ggplot(customer_m, aes_string(x = i, fill = as.factor(customer_m$Churn)))+ 
    geom_bar( position = "stack")+ scale_fill_discrete(name = "churn")
 print(plot)
}

  • Pros : Efficient way to do exploratory plot.
  • Cons : Cannot adjust individual plot.

Tip 2: Code Snippet:

Just type : snippet plus function name e.g. snippet fun , snippet apply.
Or View the build-in snippets from Tool > Global options > Code > Snippet, you can create your own snippet.

# example for snippet fun
for(){
  
}

for (i in appple) {
  
}




Dealing with strings and factors

Matching strings

Author: Saskia Freytag

Find Saskia’s fabulous slides about all the ways to match strings in R here.




library(forcats): the incredible things you can do with your R factors!

Author: Anna Quaglieri

library(forcats)
library(datasets.load)
library(dplyr)
library(ggplot2)
library(DT)
library(cowplot)
datatable((ChickWeight))
# Let's make diet a factor
ChickWeight$Diet <- factor(ChickWeight$Diet,levels=c(1,2,3,4),labels = c("Diet1","Diet2","Diet3","Diet4"))
  • forcats::fct_relevel: reorder manually

The default factor ordering is by alphabetical ordering.

table(ChickWeight$Diet)
## 
## Diet1 Diet2 Diet3 Diet4 
##   220   120   120   118
p1=ChickWeight %>% 
ggplot(aes(x=Diet,y=weight)) + geom_boxplot() + ggtitle("Initial")+ theme_bw()

p2=ChickWeight %>% 
  mutate(Diet = fct_relevel(Diet,"Diet2","Diet3","Diet1","Diet4")) %>%
ggplot(aes(x=Diet,y=weight)) + geom_boxplot() +  ggtitle("After fct_relevel(Diet)") + theme_bw()

plot_grid(p1,p2)

  • forcats::fct_infreq: reorder by factor frequency
# let's sample out some rows
sample_chick <- ChickWeight[sample(nrow(ChickWeight),size=100),]

table(sample_chick$Diet)
## 
## Diet1 Diet2 Diet3 Diet4 
##    45    14    26    15
p1=sample_chick %>% 
  ggplot(aes(x=Diet)) + geom_bar() + ggtitle("Initial")+ theme_bw()


table(fct_infreq(sample_chick$Diet))
## 
## Diet1 Diet3 Diet4 Diet2 
##    45    26    15    14
p2=sample_chick %>% 
  mutate(Diet = fct_infreq(Diet)) %>%
ggplot(aes(x=Diet)) + geom_bar() + ggtitle("After fct_infreq(Diet)") + theme_bw()

plot_grid(p1,p2)

  • forcats::fct_reorder: reorder by values of another variable
# chickwts  Chicken Weights by Feed Type
datatable(chickwts)
class(chickwts$feed)
## [1] "factor"
p1=chickwts %>% 
  ggplot(aes(x=feed,y=weight,fill=feed)) + geom_boxplot() + ggtitle("Initial")+ theme_bw() + theme(legend.position = "bottom")

p2=chickwts %>% 
  mutate(feed = fct_reorder(feed,weight)) %>%
  ggplot(aes(x=feed,y=weight,fill=feed)) + geom_boxplot() + ggtitle("After fct_reorder(feed,weight)") + theme_bw()+ theme(legend.position = "bottom")

plot_grid(p1,p2)




Data manipulation


mutate_at() and summarise_at()

Author: Lucy Liu

A few weeks ago I learnt about mutate_at() and summarise_at().

Mutate

The well known mutate() lets you do something like this:

# load package
library(tidyverse)
iris %>%
  mutate(newcol = Sepal.Length * 10) %>%
  head()
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species newcol
## 1          5.1         3.5          1.4         0.2  setosa     51
## 2          4.9         3.0          1.4         0.2  setosa     49
## 3          4.7         3.2          1.3         0.2  setosa     47
## 4          4.6         3.1          1.5         0.2  setosa     46
## 5          5.0         3.6          1.4         0.2  setosa     50
## 6          5.4         3.9          1.7         0.4  setosa     54

mutate_at() is used to perform a function on several columns at once. The syntax goes like this:

  • Tell it which columns you want to ‘transform’. You can use vars() to do this. vars() understands the same specifications as select() e.g. -c(col), starts_with(), contains().
  • Tell it the function you want to perform.

When you only want to perform 1 function, you can get it to just replace the old columns:

iris %>%
  mutate_at(vars(starts_with("Petal")), log) %>%
  head()
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5    0.3364722  -1.6094379  setosa
## 2          4.9         3.0    0.3364722  -1.6094379  setosa
## 3          4.7         3.2    0.2623643  -1.6094379  setosa
## 4          4.6         3.1    0.4054651  -1.6094379  setosa
## 5          5.0         3.6    0.3364722  -1.6094379  setosa
## 6          5.4         3.9    0.5306283  -0.9162907  setosa

Here the columns Petal.Length and Petal.Width are now logs of the old columns.

If instead you wanted to add new columns to the end, use funs():

iris %>%
  mutate_at(vars(starts_with("Petal")), 
            funs(log = log(.))) %>%
  head()
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
##   Petal.Length_log Petal.Width_log
## 1        0.3364722      -1.6094379
## 2        0.3364722      -1.6094379
## 3        0.2623643      -1.6094379
## 4        0.4054651      -1.6094379
## 5        0.3364722      -1.6094379
## 6        0.5306283      -0.9162907

Note that we now need to use the . notation. This just means the data in the column selected.

You can also perform several functions:

iris %>%
  mutate_at(vars("Petal.Width"), funs(
    norm = ./mean(.),
    log = log(.)
  )) %>%
  head()
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species      norm
## 1          5.1         3.5          1.4         0.2  setosa 0.1667593
## 2          4.9         3.0          1.4         0.2  setosa 0.1667593
## 3          4.7         3.2          1.3         0.2  setosa 0.1667593
## 4          4.6         3.1          1.5         0.2  setosa 0.1667593
## 5          5.0         3.6          1.4         0.2  setosa 0.1667593
## 6          5.4         3.9          1.7         0.4  setosa 0.3335186
##          log
## 1 -1.6094379
## 2 -1.6094379
## 3 -1.6094379
## 4 -1.6094379
## 5 -1.6094379
## 6 -0.9162907

Summarise

summarise_at() works similarly:

iris %>%
  group_by(Species) %>%
  summarise_at(vars(starts_with("Petal")), 
               funs(mean = mean(.),
                    median = median(.),
                    sd = sd(.)))
## # A tibble: 3 x 7
##   Species Petal.Length_me… Petal.Width_mean Petal.Length_me…
##   <fct>              <dbl>            <dbl>            <dbl>
## 1 setosa              1.46            0.246             1.5 
## 2 versic…             4.26            1.33              4.35
## 3 virgin…             5.55            2.03              5.55
## # ... with 3 more variables: Petal.Width_median <dbl>,
## #   Petal.Length_sd <dbl>, Petal.Width_sd <dbl>

You select columns using vars() and use funs() to tell it what function you want to perform.




Manipulate a list to transform it to a data frame

Author: Maria Prokofieva

test <- map_df(yourlist, ~ data_frame(size = paste(file.size(.x), collapse=" ")) %>% mutate(filename = .x))




Accessing company information from Company House (UK)

Author: Maria Prokofieva

Companies House https://beta.companieshouse.gov.uk/ is the United Kingdom’s registrar of companies. As a member of the Public Data Group, they make their company-related data available for public use via their API https://developer.companieshouse.gov.uk/api/docs/

You can search company related information using http request, save it and use it for your analysis. You can either search one company or a group of companies. They are very generous with the inforamtion they provide.

API key authentication

The Companies House API works with authentication credentials that are sent with each request.

To get an API key, you need to setup an applications and register it with the Companies House Developer Hub as an API Key application. This can be done here: https://developer.companieshouse.gov.uk/developer/applications

This will allocate a unique key to the application which can be sent with any GET request for a public resource served by the Companies House API.

In this example we are searching information about two companies with registration numbers 05141488 and 09202639

options(stringsAsFactors = FALSE)
library(knitr)
library(httr)
library(jsonlite)
library(data.table)
library(RCurl)
library(purrr)

#company overview page - List of companies
pages = list()
dataFrame = list()

companies<-c("05141488","09202639")

companyList <- paste("https://api.companieshouse.gov.uk/company/", companies, sep="")

for(u in companyList) {
  
  pages[[u]] = GET(u, authenticate("q3LHh0aXgO8d2OI_Mq4uTJb_Mw-sNZPLTKzrb1Fl", ""))
  
  cont <- content(pages[[u]], as = "parsed", type = "application/json")

#explicit convertion to data frame
  dataFrame[[u]] <- data.frame(cont)
}

Now, we have a dataframe with A LOT of information which we do not really need in full, so select those entries you do need.

#select elements from lists

dataFrameSelected<-lapply(dataFrame, `[`, c('company_number',
                                            'date_of_creation',
                                            'type',
                                            'company_name'))



#convert selected to dataframe
dataFrameCompanyOverview = do.call(rbind, dataFrameSelected)

#dataFrameCompanyOverview

kable(dataFrameCompanyOverview, format = "html", caption = "Company data from CompanyHouse")
Company data from CompanyHouse
company_number date_of_creation type company_name
https://api.companieshouse.gov.uk/company/05141488 05141488 2004-06-01 ltd SPECIALISED CAMERA SERVICES LIMITED
https://api.companieshouse.gov.uk/company/09202639 09202639 2014-09-03 ltd SELL MY LIVESTOCK UK LTD




Where to get help with your R question?

Author: Maëlle Salmon

“Imagine you have an R question…” Check out Maëlle super comprehensive blog post Where to get help with your R question? about how to make your search the most efficient and targeted possible!




How to get all the functions within a package

Author: Emi Tanaka

Below you can find all the functions in dplyr (minus the internal hidden ones). There are in total 245 functions in dplyr.

library(dplyr)
ls("package:dplyr")
##   [1] "%>%"                "add_count"          "add_count_"        
##   [4] "add_row"            "add_rownames"       "add_tally"         
##   [7] "add_tally_"         "all_equal"          "all_vars"          
##  [10] "anti_join"          "any_vars"           "arrange"           
##  [13] "arrange_"           "arrange_all"        "arrange_at"        
##  [16] "arrange_if"         "as_data_frame"      "as_tibble"         
##  [19] "as.tbl"             "as.tbl_cube"        "auto_copy"         
##  [22] "band_instruments"   "band_instruments2"  "band_members"      
##  [25] "bench_tbls"         "between"            "bind_cols"         
##  [28] "bind_rows"          "case_when"          "changes"           
##  [31] "check_dbplyr"       "coalesce"           "collapse"          
##  [34] "collect"            "combine"            "common_by"         
##  [37] "compare_tbls"       "compare_tbls2"      "compute"           
##  [40] "contains"           "copy_to"            "count"             
##  [43] "count_"             "cumall"             "cumany"            
##  [46] "cume_dist"          "cummean"            "current_vars"      
##  [49] "data_frame"         "data_frame_"        "db_analyze"        
##  [52] "db_begin"           "db_commit"          "db_create_index"   
##  [55] "db_create_indexes"  "db_create_table"    "db_data_type"      
##  [58] "db_desc"            "db_drop_table"      "db_explain"        
##  [61] "db_has_table"       "db_insert_into"     "db_list_tables"    
##  [64] "db_query_fields"    "db_query_rows"      "db_rollback"       
##  [67] "db_save_query"      "db_write_table"     "dense_rank"        
##  [70] "desc"               "dim_desc"           "distinct"          
##  [73] "distinct_"          "do"                 "do_"               
##  [76] "dr_dplyr"           "ends_with"          "enexpr"            
##  [79] "enexprs"            "enquo"              "enquos"            
##  [82] "ensym"              "ensyms"             "eval_tbls"         
##  [85] "eval_tbls2"         "everything"         "explain"           
##  [88] "expr"               "failwith"           "filter"            
##  [91] "filter_"            "filter_all"         "filter_at"         
##  [94] "filter_if"          "first"              "frame_data"        
##  [97] "full_join"          "funs"               "funs_"             
## [100] "glimpse"            "group_by"           "group_by_"         
## [103] "group_by_all"       "group_by_at"        "group_by_if"       
## [106] "group_by_prepare"   "group_indices"      "group_indices_"    
## [109] "group_size"         "group_vars"         "grouped_df"        
## [112] "groups"             "id"                 "ident"             
## [115] "if_else"            "inner_join"         "intersect"         
## [118] "is_grouped_df"      "is.grouped_df"      "is.src"            
## [121] "is.tbl"             "lag"                "last"              
## [124] "lead"               "left_join"          "location"          
## [127] "lst"                "lst_"               "make_tbl"          
## [130] "matches"            "min_rank"           "mutate"            
## [133] "mutate_"            "mutate_all"         "mutate_at"         
## [136] "mutate_each"        "mutate_each_"       "mutate_if"         
## [139] "n"                  "n_distinct"         "n_groups"          
## [142] "na_if"              "nasa"               "near"              
## [145] "nth"                "ntile"              "num_range"         
## [148] "one_of"             "order_by"           "percent_rank"      
## [151] "progress_estimated" "pull"               "quo"               
## [154] "quo_name"           "quos"               "rbind_all"         
## [157] "rbind_list"         "recode"             "recode_factor"     
## [160] "rename"             "rename_"            "rename_all"        
## [163] "rename_at"          "rename_if"          "rename_vars"       
## [166] "rename_vars_"       "right_join"         "row_number"        
## [169] "rowwise"            "same_src"           "sample_frac"       
## [172] "sample_n"           "select"             "select_"           
## [175] "select_all"         "select_at"          "select_if"         
## [178] "select_var"         "select_vars"        "select_vars_"      
## [181] "semi_join"          "setdiff"            "setequal"          
## [184] "show_query"         "slice"              "slice_"            
## [187] "sql"                "sql_escape_ident"   "sql_escape_string" 
## [190] "sql_join"           "sql_select"         "sql_semi_join"     
## [193] "sql_set_op"         "sql_subquery"       "sql_translate_env" 
## [196] "src"                "src_df"             "src_local"         
## [199] "src_mysql"          "src_postgres"       "src_sqlite"        
## [202] "src_tbls"           "starts_with"        "starwars"          
## [205] "storms"             "summarise"          "summarise_"        
## [208] "summarise_all"      "summarise_at"       "summarise_each"    
## [211] "summarise_each_"    "summarise_if"       "summarize"         
## [214] "summarize_"         "summarize_all"      "summarize_at"      
## [217] "summarize_each"     "summarize_each_"    "summarize_if"      
## [220] "sym"                "syms"               "tally"             
## [223] "tally_"             "tbl"                "tbl_cube"          
## [226] "tbl_df"             "tbl_nongroup_vars"  "tbl_sum"           
## [229] "tbl_vars"           "tibble"             "top_n"             
## [232] "transmute"          "transmute_"         "transmute_all"     
## [235] "transmute_at"       "transmute_if"       "tribble"           
## [238] "trunc_mat"          "type_sum"           "ungroup"           
## [241] "union"              "union_all"          "vars"              
## [244] "with_order"         "wrap_dbplyr_obj"




sessionInfo()

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] RCurl_1.95-4.11     bitops_1.0-6        data.table_1.11.6  
##  [4] jsonlite_1.5        httr_1.3.1          cowplot_0.9.3      
##  [7] datasets.load_0.3.0 bindrcpp_0.2.2      ggrepel_0.8.0      
## [10] forcats_0.3.0       stringr_1.3.1       dplyr_0.7.6        
## [13] purrr_0.2.5         readr_1.1.1         tidyr_0.8.1        
## [16] tibble_1.4.2        tidyverse_1.2.1     reshape2_1.4.3     
## [19] DT_0.4              ggplot2_3.0.0       knitr_1.20         
## [22] icon_0.1.0          emo_0.0.0.9000      png_0.1-7          
## [25] magick_1.9         
## 
## loaded via a namespace (and not attached):
##  [1] modelr_0.1.2     shiny_1.1.0      assertthat_0.2.0 highr_0.7       
##  [5] cellranger_1.1.0 yaml_2.2.0       pillar_1.3.0     backports_1.1.2 
##  [9] lattice_0.20-35  glue_1.3.0       digest_0.6.15    promises_1.0.1  
## [13] rvest_0.3.2      colorspace_1.3-2 htmltools_0.3.6  httpuv_1.4.5    
## [17] plyr_1.8.4       pkgconfig_2.0.2  broom_0.5.0      haven_1.1.2     
## [21] xtable_1.8-2     scales_1.0.0     later_0.7.3      withr_2.1.2     
## [25] lazyeval_0.2.1   cli_1.0.0        magrittr_1.5     crayon_1.3.4    
## [29] readxl_1.1.0     mime_0.5         evaluate_0.11    fansi_0.3.0     
## [33] nlme_3.1-137     xml2_1.2.0       tools_3.5.1      hms_0.4.2       
## [37] munsell_0.5.0    compiler_3.5.1   rlang_0.2.2      rstudioapi_0.7  
## [41] htmlwidgets_1.2  crosstalk_1.0.0  miniUI_0.1.1.1   labeling_0.3    
## [45] rmarkdown_1.10   gtable_0.2.0     curl_3.2         R6_2.2.2        
## [49] lubridate_1.7.4  utf8_1.1.4       bindr_0.1.1      rprojroot_1.3-2 
## [53] stringi_1.2.4    Rcpp_0.12.18     tidyselect_0.2.4
 

A work by R-Ladies Melbourne