I often have the desire to compare distributions with differing numbers of data points. This is fairly easy to do with R and ggplot2.
Step 1. Get data from first and second sources.
> e = read.table("ensembl_last_exon_distance.txt",header=T) > r = read.table("refgene_last_exon_distance.txt",header=T)
Step 2. Since the two data sources have different headers, I can’t use rbind immediately to combine them together. Also, I wouldn’t know what was what (since they’d be on the same column). So I need to, basically, copy the data over to a new data.frame with the same headers. Then I can add a second column to distinguish the two data types.
> ensemblu = data.frame(Distance = (e$ensembl)) > refgeneu = data.frame(Distance = (r$refgene)) > ensemblu$DataSource = 'ensembl' > refgeneu$DataSource = 'refgene' > head(refgeneu) Distance DataSource 1 71914 refgene 2 259289 refgene 3 24759 refgene 4 8520 refgene 5 103292 refgene 6 148873 refgene
Step 3. Combine both data sets together. You see from the head and tail that I now have both data sets together, in one column. I will use the second column when plotting to distinguish the data sets visually.
> both = rbind(ensemblu,refgeneu) > head(both) Distance DataSource 1 6157 ensembl 2 18815 ensembl 3 43723 ensembl 4 48196 ensembl 5 31755 ensembl 6 93981 ensembl > tail(both) Distance DataSource 23037 42503 refgene 23038 26796 refgene 23039 34782 refgene 23040 18100 refgene 23041 6066 refgene 23042 7635 refgene
Step 4. Plot. You can histogram it straight up, log transform, rnorm, etc. very easily.
> library(ggplot2) > ggplot(both, aes(Distance, fill=DataSource)) + geom_bar(alpha=0.5) > ggplot(both, aes(log(Distance), fill=DataSource)) + geom_bar(alpha=0.5) > ggplot(both, aes(rnorm(Distance), fill=DataSource)) + geom_histogram(alpha=0.5)
For more details, go here.