base::plot()
!position_dodge()
DT
package for interactive tablesggplot()
using geom_histogram()
and geom_density_ridges()
R
question?sessionInfo()
To celebrate our second birthday 🎂 we wanted to bring out the diversity of our community. That’s why we asked to all our members and all the -Ladies Melbourne followers to send us their favourite R tip and share it in a 5 minutes presentation at our event. The lightning talks were followed by a panel discussion about R worlflows and communication of a project results. Below are some highlights from the night and all the beautiful tips that we manage to collect!
Have fun 😉!!
@annaquagli on the summary of our last year : This is a great achievement that we reached more than 1000 members in Sep! Thanks to all the members and sponsors! pic.twitter.com/45nbHoJELz
— R-Ladies Melbourne (@RLadiesMelb) October 17, 2018
The first & most important part of presenting your findings is telling your audience why they should care about the results. If you don’t have their attention, it doesn’t matter how pretty your graphs are.
— David M for Murder (@frostickle) October 17, 2018
-@nikkirubinstein on how to give a talk@RLadiesMelb #Rladies #howRYou pic.twitter.com/llQLlBENMM
Fantastic way to celebrate the second anniversary of @RLadiesMelb! Nice baking @trashystats 🎂 #rstats celebration 🎊 pic.twitter.com/eFVQCpIZE4
— Nikki Rubinstein (@nikkirubinstein) October 17, 2018
base::plot()
!Author: Soroor Zadeh
par(bg="black")
plot( (-2)^as.complex( seq(0, 7, 0.03) ), pch=21, bg=c( 2,3), xlab="", ylab="", xaxt="n", yaxt="n")
text(-30, 40, "Happy Anniversary! \n R-Ladies Melbourne", cex=2,col="#88398A")
position_dodge()
Author: Anna Quaglieri
Whenever I have to display to compare continuous variables this has become my favourite way to go! Violin plot + boxplot allows me to see both the quantiles and the overall density distribution that if can often be missed with only boxplots.
library(ggplot2)
data <- data.frame(Gene = rep(c("Gene1","Gene2","Gene3","Gene4"),each=46),
Counts = log2(rbinom(n = 46*4,size = 1000,prob = 0.3)),
CBF = sample(x = c("A","B"),size = 46*4,replace=TRUE))
dodge <- position_dodge(width = 1)
ggplot(data,aes(x=Gene,y=Counts,fill=CBF)) + theme_bw() + theme(axis.text.x = element_text(angle = 0)) + geom_violin(trim=FALSE,position = dodge) + geom_boxplot(width=.1,position = dodge,show.legend = FALSE) + labs(y="log2Counts") + facet_wrap(~Gene,scales="free_x")
DT
package for interactive tablesAuthors: Sepideh Foroutan
library(DT)
library(reshape2) # to get the "tips" dataset
data("tips")
datatable(tips, filter = "top", options = list(pageLength = 8)) %>% ## Bold some numbers:
formatStyle('total_bill',
fontWeight = styleInterval(18, c('normal', 'bold'))) %>% ## show colour bar
formatStyle('tip',
background = styleColorBar(tips$tip, 'mediumpurple'),
backgroundSize = '100% 95%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'centre') %>% ## transform values
formatStyle('sex',
transform = 'rotateX(-45deg) rotateY(-30deg) rotateZ(-50deg)',
backgroundColor = styleEqual(unique(tips$sex), c('lightblue', 'lightseagreen'))) %>% ## colour value/background
formatStyle('size',
color = styleInterval(c(2, 4), c('blue', 'black', 'red')),
backgroundColor = styleInterval(c(2, 4), c('white', 'gray', 'gray50')))
ggplot()
using geom_histogram()
and geom_density_ridges()
Authors: Marie Trussart
I used here ggd, which is a data frame extracted from clustering and expression datasets and that’s how I define it:
ggd <- melt(data.frame(cluster = cell_clustering, expr),id.vars = "cluster", value.name = "expression")
Example of the data:
### Density distributions
ggplot() +
geom_density_ridges(data = ggd, aes(x = expression, y = cluster), alpha = 0.3) + scale_x_continuous(expand = c(0.01, 0)) +
scale_y_discrete(expand = c(0.01, 0))+
theme_ridges() +
theme(axis.text = element_text(size = 7),
strip.text = element_text(size = 7))
ggsave("Fig1.pdf")
## Histogram distribution by facet
ggplot(data = ggd, aes(x = expression)) + facet_wrap(~cluster, scales = 'free_y') +
geom_histogram()+
theme(strip.text.x = element_text(size=6),
strip.text.y = element_text(size=6),
axis.text = element_text( size = 8 ),
axis.text.x = element_text( size = 8 ),
axis.title = element_text( size = 8, face = "bold"))
ggsave("Fig2.pdf")
### All the histogram distributions on the same plot
ggplot(data = ggd[grep("DG19", ggd$cluster),], aes(x = expression,fill=cluster)) +
geom_histogram()
ggsave("Fig3.pdf")
Author: Erika Duan
As a wet-lab immunologist, most of my job involves trying to find and then illustrate meaningful patterns from large biological datasets.
We obtain a lot of data from RNA sequencing experiments. These are experiments which look at how many mRNA molecules (i.e. message signals) are found in an object and how these signals differ in quantity across multiple objects.
We often analyse datasets with changes across >10,000 signals between >=2 different objects. A volcano plot is one way we visualise all statistically significant versus non-significant differences in one graph.
A large matrix is obtained, containing the number of signals ‘counted’ per signal type per object. Each row contains a unique signal ID (i.e. in my case a unique gene ID) and each column contains all the signal counts for one single object. The researcher also has additional information about each object (i.e. object classification categories like object type, timepoint, batch etc.). This is very important for downstream RNAseq analysis, but not required for this analysis.
A minimal information threshold is set (i.e. minimal signal count per signal > 1 for at least 1 object). An awesome statistical package, in my case DESeq2
(https://bioconductor.org/packages/release/bioc/html/DESeq2.html), is then used to test whether any signals are differentially expressed between different objects.
Data visualisation of all statistically significant versus non-significant signals between at least two objects, with the aim of highlighting any new or particularly interesting biological patterns.
Here, a volcano plot is used to depict:
ggplot2
A results output file can be created in DESeq2
i.e. using results(dds, contrast=c("Sample.type", "A", "B"))
and converted into a dataframe.
For convenience, I have provided a fake results output called AvsB_results.csv
for use (i.e. a dataframe containing all signal differences between object A versus object B). Since we will be using both dplyr
and ggplot2
, I always find it more convenient to download the tidyverse
package.
library("tidyverse")
library("ggrepel") # We will also need this package for the final labelling of data points.
We start with our dataset of interest.
Note that for the volcano plot, you only need three columns of information:
AvsB_results <- read.csv("How_R_You_R-LadiesMelbourne_code_and_tips_data/AvsB_results.csv", header = T, stringsAsFactors = F)
str(AvsB_results) # The dataframe contains the 3 columns of info described above.
## 'data.frame': 600 obs. of 3 variables:
## $ log2FoldChange: num 3.804 2.104 1.804 1.309 0.525 ...
## $ padj : num 1.24e-13 7.29e-08 2.30e-05 1.69e-03 1.71e-03 ...
## $ Symbol : chr "Ep300" "Nemf" "Atad2b" "Rft1" ...
A simple volcano plot depicts:
Note that the y-axis is depicted as -log10(padj), which allows the data points (i.e. volcano spray) to project upwards as the absolute value along the x axis increases. Graphically, this is more intuitive to visualise.
simple_vp <- ggplot(AvsB_results, aes(x = log2FoldChange,
y = -log10(padj))) +
geom_point() # A simple volcano plot is created.
simple_vp
This plot is too plain as objects of interest do not easily jump out at us.
A good volcano plot will highlight all the signals (represented by individual data points) which are significantly different between A vs B.
In this case, we would be interested in highlighting genes which have a padj <= 0.05 (or a -log10(padj) >= 1.30103) (my chosen statitical cut-off). I would also be interested in highlighting genes which additionally have a log2 fold change <= -1 or >= 1 (i.e. signals which are at least 2-fold bigger or smaller in A vs B).
I can now define these quandrants using:
simple_vp +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") + # horiztonal dashed line
geom_vline(xintercept = c(-1,1), linetype = "dashed") # vertical dashed line
The top-left quadrant contains all signals that are significantly decreased in A vs B, and the top right quandrant contains genes that are significantly increased in A vs B. The remaining genes are not significantly different and hence much less interesting to me.
The next thing we can therefore do is to highlight these three different groups of signals.
To do this, I return to my original dataframe and use the dplyr::mutate
function.
AvsB_results <- mutate(AvsB_results,
AvsB_type = ifelse(is.na(padj)|padj > 0.05|abs(log2FoldChange) < 1, "ns",
ifelse(log2FoldChange <= -1, "down",
"up"))) # creates a new column called AvsB_type, with signals classified as "ns", "down" or "up"
group_by(AvsB_results, AvsB_type) %>%
summarize(Counts = n()) # counts how many signals are present in each category
## # A tibble: 3 x 2
## AvsB_type Counts
## <chr> <int>
## 1 down 3
## 2 ns 591
## 3 up 6
Now that AvsB_type can segregate each signal based on whether it is ‘up’, ‘down’ or ‘ns’ (i.e. non-significant), I can colour these three signal types differently (and/or change their size/transparency to make different points stand out more versus less).
cols <- c("up" = "#ffad73", "down" = "#26b3ff", "ns" = "grey")
sizes <- c("up" = 3, "down" = 3, "ns" = 1)
alphas <- c("up" = 1, "down" = 1, "ns" = 0.5)
ggplot(AvsB_results, aes(x = log2FoldChange,
y = -log10(padj))) +
geom_point(aes(colour = AvsB_type, #specify point colour by AvsB_type
size = AvsB_type, #specify point size by AvsB_type
alpha = AvsB_type)) + #specify point transparency by AvsB_type
scale_color_manual(values = cols) +
scale_size_manual(values = sizes) +
scale_alpha_manual(values = alphas) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-1,1), linetype = "dashed")
This is great! But there is still one final nifty trick!
As a biologist, I often get >100s of genes which are significantly increased or decreased between two objects. To examine whether interesting patterns (interconnected signals) exist within these 100 genes, I run them through gene over-representation databases like this one.
Interesting_pathway <- c("Nemf", "Rft1", "Atp5h") # An external database identifies an interesting signal network!
We would like to highlight these particular signals, by representing them in a different (darker) colour and also by labelling each individual point of interest.
ggplot(AvsB_results, aes(x = log2FoldChange,
y = -log10(padj))) +
geom_point(aes(colour = AvsB_type,
size = AvsB_type,
alpha = AvsB_type)) +
scale_color_manual(values = cols) +
scale_size_manual(values = sizes) +
scale_alpha_manual(values = alphas) +
scale_x_continuous(limits = c(-4, 4)) + # changing the x-axis to make my volcano plot symmetrical
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-1,1), linetype = "dashed") +
geom_text_repel(data = AvsB_results %>%
filter(Symbol %in% Interesting_pathway), # labels only genes in the interesting pathway
aes(label = Symbol),
size = 3.5,
color = "black",
nudge_x = 0.3, nudge_y = 0.1) +
geom_point(data = AvsB_results %>%
filter(Symbol %in% Interesting_pathway), # adds new points for only genes in the interesting pathway
color = "#d91933",
size = 2) +
theme_classic() + # creates a white background
theme(panel.border = element_rect(colour = "black", fill=NA, size= 0.5)) # creates a plot border
Viola! Enjoy your volcano plot (and remember, there are lots of graphical modifiers you can use to visualise data using them, as long as your methods are logical and reasonable)!
Chuanxin Liu devised the elegant strategy for labelling all signal types as ‘up’, ‘ns’ or ‘down’ and the code for the labelling of specific signal data points.
Author: Ivy Lin
Have a look at Ivy’s R tip published on RPubs http://rpubs.com/IvyLin/R-tips or got through her code below.
Demo dataset : Telcom Customer Churn data, source: https://www.kaggle.com/blastchar/telco-customer-churn
library(readr)
library(dplyr)
library(ggplot2)
When we want to visually explore a dataset with many variables…
Is there any efficient way to create multiple plots at the same time?
#import data (default stringAsFactor = TRUE )
customer <- read.csv("How_R_You_R-LadiesMelbourne_code_and_tips_data/WA_Fn-UseC_-Telco-Customer-Churn.csv",stringsAsFactors = T)
str(customer)
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
# Make ID back to correct format
customer$customerID <- as.character(customer$customerID)
#reorder
customer <- customer[c(1,21,2:20)]
# scan missing value
colSums(is.na(customer))
## customerID Churn gender SeniorCitizen
## 0 0 0 0
## Partner Dependents tenure PhoneService
## 0 0 0 0
## MultipleLines InternetService OnlineSecurity OnlineBackup
## 0 0 0 0
## DeviceProtection TechSupport StreamingTV StreamingMovies
## 0 0 0 0
## Contract PaperlessBilling PaymentMethod MonthlyCharges
## 0 0 0 0
## TotalCharges
## 11
#observation with missing values
missing <- filter(customer, is.na(customer$TotalCharges) == TRUE )
# since TotalCharges is roughly equals to tenure * monthly charges, I replace the missing value accordingly.
customer_m <- customer %>% mutate(TotalCharges = ifelse(is.na(customer$TotalCharges), customer$MonthlyCharges*customer$tenure, TotalCharges) )
Tips: use for-loop to create multiple plots in just 1 command.
Requirement: you’ll need to have varibles in same data type.
variables <- list( 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService','OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport','StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling','PaymentMethod' )
for (i in variables){
plot <- ggplot(customer_m, aes_string(x = i, fill = as.factor(customer_m$Churn)))+
geom_bar( position = "stack")+ scale_fill_discrete(name = "churn")
print(plot)
}
Just type : snippet plus function name e.g. snippet fun
, snippet apply
.
Or View the build-in snippets from Tool > Global options > Code > Snippet, you can create your own snippet.
# example for snippet fun
for(){
}
for (i in appple) {
}
Author: Saskia Freytag
Find Saskia’s fabulous slides about all the ways to match strings in R here.
library(forcats)
: the incredible things you can do with your R
factors!Author: Anna Quaglieri
library(forcats)
library(datasets.load)
library(dplyr)
library(ggplot2)
library(DT)
library(cowplot)
datatable((ChickWeight))
# Let's make diet a factor
ChickWeight$Diet <- factor(ChickWeight$Diet,levels=c(1,2,3,4),labels = c("Diet1","Diet2","Diet3","Diet4"))
forcats::fct_relevel
: reorder manuallyThe default factor ordering is by alphabetical ordering.
table(ChickWeight$Diet)
##
## Diet1 Diet2 Diet3 Diet4
## 220 120 120 118
p1=ChickWeight %>%
ggplot(aes(x=Diet,y=weight)) + geom_boxplot() + ggtitle("Initial")+ theme_bw()
p2=ChickWeight %>%
mutate(Diet = fct_relevel(Diet,"Diet2","Diet3","Diet1","Diet4")) %>%
ggplot(aes(x=Diet,y=weight)) + geom_boxplot() + ggtitle("After fct_relevel(Diet)") + theme_bw()
plot_grid(p1,p2)
forcats::fct_infreq
: reorder by factor frequency# let's sample out some rows
sample_chick <- ChickWeight[sample(nrow(ChickWeight),size=100),]
table(sample_chick$Diet)
##
## Diet1 Diet2 Diet3 Diet4
## 45 14 26 15
p1=sample_chick %>%
ggplot(aes(x=Diet)) + geom_bar() + ggtitle("Initial")+ theme_bw()
table(fct_infreq(sample_chick$Diet))
##
## Diet1 Diet3 Diet4 Diet2
## 45 26 15 14
p2=sample_chick %>%
mutate(Diet = fct_infreq(Diet)) %>%
ggplot(aes(x=Diet)) + geom_bar() + ggtitle("After fct_infreq(Diet)") + theme_bw()
plot_grid(p1,p2)
forcats::fct_reorder
: reorder by values of another variable# chickwts Chicken Weights by Feed Type
datatable(chickwts)
class(chickwts$feed)
## [1] "factor"
p1=chickwts %>%
ggplot(aes(x=feed,y=weight,fill=feed)) + geom_boxplot() + ggtitle("Initial")+ theme_bw() + theme(legend.position = "bottom")
p2=chickwts %>%
mutate(feed = fct_reorder(feed,weight)) %>%
ggplot(aes(x=feed,y=weight,fill=feed)) + geom_boxplot() + ggtitle("After fct_reorder(feed,weight)") + theme_bw()+ theme(legend.position = "bottom")
plot_grid(p1,p2)
mutate_at()
and summarise_at()
Author: Lucy Liu
A few weeks ago I learnt about mutate_at()
and summarise_at()
.
The well known mutate()
lets you do something like this:
# load package
library(tidyverse)
iris %>%
mutate(newcol = Sepal.Length * 10) %>%
head()
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species newcol
## 1 5.1 3.5 1.4 0.2 setosa 51
## 2 4.9 3.0 1.4 0.2 setosa 49
## 3 4.7 3.2 1.3 0.2 setosa 47
## 4 4.6 3.1 1.5 0.2 setosa 46
## 5 5.0 3.6 1.4 0.2 setosa 50
## 6 5.4 3.9 1.7 0.4 setosa 54
mutate_at()
is used to perform a function on several columns at once. The syntax goes like this:
vars()
to do this. vars()
understands the same specifications as select()
e.g. -c(col)
, starts_with()
, contains()
.When you only want to perform 1 function, you can get it to just replace the old columns:
iris %>%
mutate_at(vars(starts_with("Petal")), log) %>%
head()
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 0.3364722 -1.6094379 setosa
## 2 4.9 3.0 0.3364722 -1.6094379 setosa
## 3 4.7 3.2 0.2623643 -1.6094379 setosa
## 4 4.6 3.1 0.4054651 -1.6094379 setosa
## 5 5.0 3.6 0.3364722 -1.6094379 setosa
## 6 5.4 3.9 0.5306283 -0.9162907 setosa
Here the columns Petal.Length and Petal.Width are now logs of the old columns.
If instead you wanted to add new columns to the end, use funs()
:
iris %>%
mutate_at(vars(starts_with("Petal")),
funs(log = log(.))) %>%
head()
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## Petal.Length_log Petal.Width_log
## 1 0.3364722 -1.6094379
## 2 0.3364722 -1.6094379
## 3 0.2623643 -1.6094379
## 4 0.4054651 -1.6094379
## 5 0.3364722 -1.6094379
## 6 0.5306283 -0.9162907
Note that we now need to use the .
notation. This just means the data in the column selected.
You can also perform several functions:
iris %>%
mutate_at(vars("Petal.Width"), funs(
norm = ./mean(.),
log = log(.)
)) %>%
head()
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species norm
## 1 5.1 3.5 1.4 0.2 setosa 0.1667593
## 2 4.9 3.0 1.4 0.2 setosa 0.1667593
## 3 4.7 3.2 1.3 0.2 setosa 0.1667593
## 4 4.6 3.1 1.5 0.2 setosa 0.1667593
## 5 5.0 3.6 1.4 0.2 setosa 0.1667593
## 6 5.4 3.9 1.7 0.4 setosa 0.3335186
## log
## 1 -1.6094379
## 2 -1.6094379
## 3 -1.6094379
## 4 -1.6094379
## 5 -1.6094379
## 6 -0.9162907
summarise_at()
works similarly:
iris %>%
group_by(Species) %>%
summarise_at(vars(starts_with("Petal")),
funs(mean = mean(.),
median = median(.),
sd = sd(.)))
## # A tibble: 3 x 7
## Species Petal.Length_me… Petal.Width_mean Petal.Length_me…
## <fct> <dbl> <dbl> <dbl>
## 1 setosa 1.46 0.246 1.5
## 2 versic… 4.26 1.33 4.35
## 3 virgin… 5.55 2.03 5.55
## # ... with 3 more variables: Petal.Width_median <dbl>,
## # Petal.Length_sd <dbl>, Petal.Width_sd <dbl>
You select columns using vars()
and use funs()
to tell it what function you want to perform.
Author: Maria Prokofieva
test <- map_df(yourlist, ~ data_frame(size = paste(file.size(.x), collapse=" ")) %>% mutate(filename = .x))
Author: Maria Prokofieva
Companies House https://beta.companieshouse.gov.uk/ is the United Kingdom’s registrar of companies. As a member of the Public Data Group, they make their company-related data available for public use via their API https://developer.companieshouse.gov.uk/api/docs/
You can search company related information using http request, save it and use it for your analysis. You can either search one company or a group of companies. They are very generous with the inforamtion they provide.
The Companies House API works with authentication credentials that are sent with each request.
To get an API key, you need to setup an applications and register it with the Companies House Developer Hub as an API Key application. This can be done here: https://developer.companieshouse.gov.uk/developer/applications
This will allocate a unique key to the application which can be sent with any GET request for a public resource served by the Companies House API.
In this example we are searching information about two companies with registration numbers 05141488 and 09202639
options(stringsAsFactors = FALSE)
library(knitr)
library(httr)
library(jsonlite)
library(data.table)
library(RCurl)
library(purrr)
#company overview page - List of companies
pages = list()
dataFrame = list()
companies<-c("05141488","09202639")
companyList <- paste("https://api.companieshouse.gov.uk/company/", companies, sep="")
for(u in companyList) {
pages[[u]] = GET(u, authenticate("q3LHh0aXgO8d2OI_Mq4uTJb_Mw-sNZPLTKzrb1Fl", ""))
cont <- content(pages[[u]], as = "parsed", type = "application/json")
#explicit convertion to data frame
dataFrame[[u]] <- data.frame(cont)
}
Now, we have a dataframe with A LOT of information which we do not really need in full, so select those entries you do need.
#select elements from lists
dataFrameSelected<-lapply(dataFrame, `[`, c('company_number',
'date_of_creation',
'type',
'company_name'))
#convert selected to dataframe
dataFrameCompanyOverview = do.call(rbind, dataFrameSelected)
#dataFrameCompanyOverview
kable(dataFrameCompanyOverview, format = "html", caption = "Company data from CompanyHouse")
company_number | date_of_creation | type | company_name | |
---|---|---|---|---|
https://api.companieshouse.gov.uk/company/05141488 | 05141488 | 2004-06-01 | ltd | SPECIALISED CAMERA SERVICES LIMITED |
https://api.companieshouse.gov.uk/company/09202639 | 09202639 | 2014-09-03 | ltd | SELL MY LIVESTOCK UK LTD |
R
question?Author: Maëlle Salmon
“Imagine you have an R
question…” Check out Maëlle super comprehensive blog post Where to get help with your R
question? about how to make your search the most efficient and targeted possible!
Author: Emi Tanaka
Below you can find all the functions in dplyr
(minus the internal hidden ones). There are in total 245 functions in dplyr
.
library(dplyr)
ls("package:dplyr")
## [1] "%>%" "add_count" "add_count_"
## [4] "add_row" "add_rownames" "add_tally"
## [7] "add_tally_" "all_equal" "all_vars"
## [10] "anti_join" "any_vars" "arrange"
## [13] "arrange_" "arrange_all" "arrange_at"
## [16] "arrange_if" "as_data_frame" "as_tibble"
## [19] "as.tbl" "as.tbl_cube" "auto_copy"
## [22] "band_instruments" "band_instruments2" "band_members"
## [25] "bench_tbls" "between" "bind_cols"
## [28] "bind_rows" "case_when" "changes"
## [31] "check_dbplyr" "coalesce" "collapse"
## [34] "collect" "combine" "common_by"
## [37] "compare_tbls" "compare_tbls2" "compute"
## [40] "contains" "copy_to" "count"
## [43] "count_" "cumall" "cumany"
## [46] "cume_dist" "cummean" "current_vars"
## [49] "data_frame" "data_frame_" "db_analyze"
## [52] "db_begin" "db_commit" "db_create_index"
## [55] "db_create_indexes" "db_create_table" "db_data_type"
## [58] "db_desc" "db_drop_table" "db_explain"
## [61] "db_has_table" "db_insert_into" "db_list_tables"
## [64] "db_query_fields" "db_query_rows" "db_rollback"
## [67] "db_save_query" "db_write_table" "dense_rank"
## [70] "desc" "dim_desc" "distinct"
## [73] "distinct_" "do" "do_"
## [76] "dr_dplyr" "ends_with" "enexpr"
## [79] "enexprs" "enquo" "enquos"
## [82] "ensym" "ensyms" "eval_tbls"
## [85] "eval_tbls2" "everything" "explain"
## [88] "expr" "failwith" "filter"
## [91] "filter_" "filter_all" "filter_at"
## [94] "filter_if" "first" "frame_data"
## [97] "full_join" "funs" "funs_"
## [100] "glimpse" "group_by" "group_by_"
## [103] "group_by_all" "group_by_at" "group_by_if"
## [106] "group_by_prepare" "group_indices" "group_indices_"
## [109] "group_size" "group_vars" "grouped_df"
## [112] "groups" "id" "ident"
## [115] "if_else" "inner_join" "intersect"
## [118] "is_grouped_df" "is.grouped_df" "is.src"
## [121] "is.tbl" "lag" "last"
## [124] "lead" "left_join" "location"
## [127] "lst" "lst_" "make_tbl"
## [130] "matches" "min_rank" "mutate"
## [133] "mutate_" "mutate_all" "mutate_at"
## [136] "mutate_each" "mutate_each_" "mutate_if"
## [139] "n" "n_distinct" "n_groups"
## [142] "na_if" "nasa" "near"
## [145] "nth" "ntile" "num_range"
## [148] "one_of" "order_by" "percent_rank"
## [151] "progress_estimated" "pull" "quo"
## [154] "quo_name" "quos" "rbind_all"
## [157] "rbind_list" "recode" "recode_factor"
## [160] "rename" "rename_" "rename_all"
## [163] "rename_at" "rename_if" "rename_vars"
## [166] "rename_vars_" "right_join" "row_number"
## [169] "rowwise" "same_src" "sample_frac"
## [172] "sample_n" "select" "select_"
## [175] "select_all" "select_at" "select_if"
## [178] "select_var" "select_vars" "select_vars_"
## [181] "semi_join" "setdiff" "setequal"
## [184] "show_query" "slice" "slice_"
## [187] "sql" "sql_escape_ident" "sql_escape_string"
## [190] "sql_join" "sql_select" "sql_semi_join"
## [193] "sql_set_op" "sql_subquery" "sql_translate_env"
## [196] "src" "src_df" "src_local"
## [199] "src_mysql" "src_postgres" "src_sqlite"
## [202] "src_tbls" "starts_with" "starwars"
## [205] "storms" "summarise" "summarise_"
## [208] "summarise_all" "summarise_at" "summarise_each"
## [211] "summarise_each_" "summarise_if" "summarize"
## [214] "summarize_" "summarize_all" "summarize_at"
## [217] "summarize_each" "summarize_each_" "summarize_if"
## [220] "sym" "syms" "tally"
## [223] "tally_" "tbl" "tbl_cube"
## [226] "tbl_df" "tbl_nongroup_vars" "tbl_sum"
## [229] "tbl_vars" "tibble" "top_n"
## [232] "transmute" "transmute_" "transmute_all"
## [235] "transmute_at" "transmute_if" "tribble"
## [238] "trunc_mat" "type_sum" "ungroup"
## [241] "union" "union_all" "vars"
## [244] "with_order" "wrap_dbplyr_obj"
sessionInfo()
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] RCurl_1.95-4.11 bitops_1.0-6 data.table_1.11.6
## [4] jsonlite_1.5 httr_1.3.1 cowplot_0.9.3
## [7] datasets.load_0.3.0 bindrcpp_0.2.2 ggrepel_0.8.0
## [10] forcats_0.3.0 stringr_1.3.1 dplyr_0.7.6
## [13] purrr_0.2.5 readr_1.1.1 tidyr_0.8.1
## [16] tibble_1.4.2 tidyverse_1.2.1 reshape2_1.4.3
## [19] DT_0.4 ggplot2_3.0.0 knitr_1.20
## [22] icon_0.1.0 emo_0.0.0.9000 png_0.1-7
## [25] magick_1.9
##
## loaded via a namespace (and not attached):
## [1] modelr_0.1.2 shiny_1.1.0 assertthat_0.2.0 highr_0.7
## [5] cellranger_1.1.0 yaml_2.2.0 pillar_1.3.0 backports_1.1.2
## [9] lattice_0.20-35 glue_1.3.0 digest_0.6.15 promises_1.0.1
## [13] rvest_0.3.2 colorspace_1.3-2 htmltools_0.3.6 httpuv_1.4.5
## [17] plyr_1.8.4 pkgconfig_2.0.2 broom_0.5.0 haven_1.1.2
## [21] xtable_1.8-2 scales_1.0.0 later_0.7.3 withr_2.1.2
## [25] lazyeval_0.2.1 cli_1.0.0 magrittr_1.5 crayon_1.3.4
## [29] readxl_1.1.0 mime_0.5 evaluate_0.11 fansi_0.3.0
## [33] nlme_3.1-137 xml2_1.2.0 tools_3.5.1 hms_0.4.2
## [37] munsell_0.5.0 compiler_3.5.1 rlang_0.2.2 rstudioapi_0.7
## [41] htmlwidgets_1.2 crosstalk_1.0.0 miniUI_0.1.1.1 labeling_0.3
## [45] rmarkdown_1.10 gtable_0.2.0 curl_3.2 R6_2.2.2
## [49] lubridate_1.7.4 utf8_1.1.4 bindr_0.1.1 rprojroot_1.3-2
## [53] stringi_1.2.4 Rcpp_0.12.18 tidyselect_0.2.4
A work by R-Ladies Melbourne