Category: R

R-TreeBASE Tutorial

October 28, 2011 / pcwainwr / 0 Comments

My treebase package is now up on the CRAN repository. (Source code is up, the binaries should appear soon). Here’s a few introductory examples to illustrate some of the functionality of the package. Thanks in part to new data deposition requirements at journals such as Evolution, Am Nat, and Sys Bio, and data management plan requirements from NSF, I hope the package will become increasingly useful for teaching by replicating results and for meta-analyses that can be automatically updated as the repository grows. Please contact me with any bugs or requests (or post in the issue tracker).

Basic tree and metadata queries

Downloading trees by different queries: by author, taxa, & study. More options are described in the help file.

[code lang=”R”] both <- search_treebase("Ronquist or Hulesenbeck", by=c("author", "author"))
dolphins <- search_treebase(‘"Delphinus"’, by="taxon", max_trees=5)
studies <- search_treebase("2377", by="id.study")
Near <- search_treebase("Near", "author", branch_lengths=TRUE)
[/code]

We can query the metadata record directly. For instance, plot the growth of Treebase submissions by publication date

[code lang=”R”] all <- search_metadata("", by="all")
dates <- sapply(m, function(x) as.numeric(x$date))
hist(dates, main="TreeBase growth", xlab="Year")
[/code]

(This query could also take a date range).

How do the weekly’s do on submissions to Treebase? We construct this in a way that gives us back the indices of the matches, so we can then grab those trees directly. Run the scripts yourself to see if they’ve changed!

[code lang=”R”] nature <- sapply(all, function(x) length(grep("Nature", x$publisher))>0)
science <- sapply(all, function(x) length(grep("^Science$", x$publisher))>0)
> sum(nature)
[1] 14
> sum(science)
[1] 14
[/code]

Now get me all of those treebase trees that have appeared in Nature.

[code lang=”R”] s <- get_study_id( all[nature] )
nature_trees <- sapply(s, function(x) search_treebase(x, "id.study"))
[/code]

Which authors have the most submissions?

[code lang=”R”] authors <- sapply(all, function(x){
index <- grep( "creator", names(x))
x[index] })
a <- as.factor(unlist(authors))
> head(summary(a))
Crous, Pedro W. Wingfield, Michael J. Groenewald, Johannes Z.
88 68 58
Donoghue, Michael J. Takamatsu, Susumu Wingfield, Brenda D.
39 36 35
[/code]

Replicating results

A nice paper by Derryberry et al. appeared in Evolution recently on diversification in ovenbirds and woodcreepers [cite]10.1111/j.1558-5646.2011.01374.x[/cite]. The article mentions that the tree is on Treebase, so let’s see if we can replicate their diversification rate analysis:

Let’s grab the trees by that author and make sure we have the right one:

[code lang=”R”] require(treebase)
search_treebase("Derryberry", "author")[[1]] -> tree
metadata(tree$S.id)
plot(tree)
[/code]

(click to zoom – go to all sizes->original size)

They fit a variety of diversification rate models avialable in the laser package, which they compare by aic.

[code lang=”R”] require(laser)
tt <- branching.times(tree)
models <- list(pb = pureBirth(tt),
bdfit = bd(tt),
y2r = yule2rate(tt), # yule model with single shift pt
ddl = DDL(tt), # linear, diversity-dependent
ddx = DDX(tt), #exponential diversity-dendent
sv = fitSPVAR(tt), # vary speciation in time
ev = fitEXVAR(tt), # vary extinction in time
bv = fitBOTHVAR(tt)# vary both
)
aics <- sapply(models, function(x) x$aic)
# show the winning model
models[which.min(aics)] > models[which.min(aics)] $y2r
LH st1 r1 r2 aic
276 505.0685 1.171871 0.1426537 0.05372305 -1004.137
[/code]

Yup, their result agrees with our analysis. Using the extensive toolset for diversification rates in R, we could now rather easily check if these results hold up in newer methods such as TreePar, etc.

Meta-Analysis

Of course one of the more interesting challenges of having an automated interface is the ability to perform meta-analyses across the set of available phylogenies in treebase. As a simple proof-of-principle, let’s check all the phylogenies in treebase to see if they fit a birth-death model or yule model better.

We’ll create two simple functions to help with this analysis. While these can be provided by the treebase package, I’ve included them here to illustrate that the real flexibility comes from being able to create custom functions. ((These are primarily illustrative; I hope users and developers will create their own. In a proper analysis we would want a few additional checks.))

[code lang=”R”] timetree <- function(tree){
check.na <- try(sum(is.na(tree$edge.length))>0)
if(is(check.na, "try-error") | check.na)
NULL
else
try( chronoMPL(multi2di(tree)) )
}
drop_errors <- function(tr){
tt <- tr[!sapply(tr, is.null)] tt <- tt[!sapply(tt, function(x) is(x, "try-error"))] print(paste("dropped", length(tr)-length(tt), "trees"))
tt
}
require(laser)
pick_branching_model <- function(tree){
m1 <- try(pureBirth(branching.times(tree)))
m2 <- try(bd(branching.times(tree)))
as.logical(try(m2$aic < m1$aic))
}
[/code]

Return only treebase trees that have branch lengths. This has to download every tree in treebase, so this will take a while. Good thing we don’t have to do that by hand.

[code lang=”R”] all <- search_treebase("Consensus", "type.tree", branch_lengths=TRUE)
tt <- drop_errors(sapply(all, timetree))
is_yule <- sapply(tt, pick_branching_model)
table(is_yule)
[/code]

(cross-posted from my notebook)

FishBASE from R

August 26, 2011 / pcwainwr / 11 Comments

In lab known for its quality data collection, high-speed video style, writing the weekly blog post can be a bit of a challenge for the local code monkey. That’s right, no videos today. But lucky for me, even this group can still make good use of publicly available data. Like the other day, when one of our newer graduate students wanted to do a certain degree of data-mining from information available from FishBASE. Now I’m sure there are plenty of reasons to grumble about it, but there’s quite an impressive bit of information available on FishBASE, most of it decently referenced if by no means comprehensive. And getting that kind of information quickly and easily is rapidly becoming a skill every researcher needs. So here’s a quick tutorial of the tools to do this.

Okay, so what are we talking about? Let’s start with an example fishbase page on Nile tilapia:

Okay, so there’s a bunch of information we could start copying down, then go onto the next fish. Sounds pretty tedious for 30,000 species… Now we can get an html copy of this page, but that’s got all this messy formatting we’ll have to dispense with. Luckily, we scroll down a little farther and we see the magic words:

Download XML

The summary page takes us to just what we were looking for:

To the human eye this might not look very useful, but to a computer, it’s just what we wanted. ((Well, not acutally. A RESTful interface to this data would be better, where we could query by the different categories, finding all fish of a certain family or diet, but we’ll manage just fine from here, just might be a bit slower)) It’s XML – eXtensible Mark-up Language: meaning it’s all the stuff on the previous page, marked up with all these helpful tags so that a computer can make sense of the document. ((Actually, not everything on the html page. Between the broken links, there’s tons of information like the life history tool, diet data references, etc that somehow haven’t made it into the XML summary sheet.))

To process this, we’ll fire up our favorite language and start scripting. To access the internet from R we’ll use a the command-line url browser, curl, available in the RCurl package. We’ll process XML with the XML package, so let’s load that too. Install these from CRAN if you don’t have them:

[source lang=”r”] require(XML)
require(RCurl)
[/source]

Click on that XML link to the summary page, and note how the url is assembled: http://fishbase.sinica.edu.tw/maintenance/FB/showXML.php?identifier=FB-2&ProviderDbase=03

The thing to note here is what followers the php?. There’s something called an identifier, which is set equal to the value FB-2. 2 is the id number of Nile tilapia. Change that to 3 (leave the rest the same) and you get African perch.

We can grab the content of this page in R and parse the XML using these two packages:

[source lang=”r”] fish.id <- 2
url <- paste("http://fishbase.sinica.edu.tw/",
"maintenance/FB/showXML.php?identifier=FB-",
fish.id, "&ProviderDbase=03", sep="")
tt <- getURLContent(url, followlocation=TRUE)
doc <- xmlParse(tt)
[/source]

We create the url with a given fish id. Then the getURLContent line is the equivalent of pointing your browser to that url. Because the page is XML, we then read or parse the page with xmlParse, creating an R object representation of all that XML. We are ready to rock and roll.

If you look closely at the XML, you’ll see all these placed around the information we want, like the scientific names, identifying what they are. For instance, we see the lines that look like this:

[xml] <dwc:Family>Cichlidae</dwc:Family>
<dwc:Genus>Oreochromis</dwc:Genus>
[/xml]

Clearly these are telling us the family and the Genus for this creature. The computer just has to look for the tag (this whole entry is called a node) and that’s the family. The dwc stands for Darwin Core, which is a big language or ontology created explicitly for taxonomic applications. It tells the computer that “Family” in Darwin-Core speak, is precisely the taxonomic meaning of the word. The computer can go and look up the dwc “namespace” given at the beginning of this document to learn exactly what that word means.

To grab this text using R, we simply ask for the value of the node named dwc:Family:

[source lang=”r”] Family <- xmlValue(getNodeSet(doc, "//dwc:Family")[[1]])
Family
[/source]

There’s two commands combined in that line. getNodeSet() is the first, getting any nodes that have the name dwc:Family anywhere (the // means anywhere) in the document. The [[1]] indicates that we want the first node it finds (since there’s only one of these in the entire document). Specifying which node we want makes use of the xpath syntax, a powerful way of navigating XML which we’ll use more later.

The xmlValue command just extracts the contents of that node, now that we’ve found it. If we ask R for the value it got for Family, it says Cichlidae, just as expected.

That was easy. We can do the same for Genus, Scientific Name, etc. As we go down the XML document, we see some of the information we want isn’t given a uniquely labeled node.
For instance, the information for size, habitat, and distribution are all under nodes called . Closer look shows these nodes are under “parent” nodes, all called . If we look at the other “child” nodes of these nodes, we see they also have a node called , like this:

[xml] <dataObject>
<dc:identifier>FB-Size-2</dc:identifier>
… other stuff ….
<dc:description> Text We need </dc:description>
… other stuff ….
</dataObject>
[/xml]

That identifier node tells us that this dataObject has size information. We can grab the content of that dc:description by first finding the identifier that says FB-Size-2, going up to it’s parent dataObject, and asking that dataObject for it’s child node called description. Like this:

[source lang=”r”] size_node <- getNodeSet(doc, paste("//dc:identifier[string(.) =
FB-Size-", fish.id, "’]/..", sep=""))
size <- xmlValue( size_node[[1]][["description"]] )
[/source]

Okay, so that looks crazy complicated — guess we got in a bit deep. See that link to xpath above? That’s a wee tutorial that will explain most of what’s going on here. It’s pretty simple if you take it slow. Using these kind of tricks, we can extract just the information we need from the XML.

Meanwhile, if you want the fast-track option, I’ve put this together in a little R package called rfishbase. The package has a function called fishbase() which extracts various bits of information using these calls. Once you get the hang of it you can modify it pretty easily. I do a little extra processing to get numbers from text using Regular Expressions, a sorta desperate but powerful option when everything isn’t nicely labeled XML.

Using this function we query a fish id number and get back the data in a nice R list, which we can happily manipulate. Here’s a quick demo:

[source lang=”r”] require(rfishbase)

## a general function to loop over all fish ids to get data
getData <- function(fish.ids){
suppressWarnings(
lapply(fish.ids, function(i) try(fishbase(i)))
)
}

# A function to extract the ages, handling missing values
get.ages <- function(fish.data){
sapply(fish.data, function(out){
if(!(is(out,"try-error")))
x <- out$size_values[["age"]] else
x <- NA
x
})

}

# Process all the XML first, then extract the ages
fish.data <- getData(2:500)
yr <- get.ages(fish.data)

# Plot data
hist(yr, breaks=40, main="Age Distribution", xlab="age (years)");
[/source]

Note we take a bit of care to avoid the missing values using try() function. Here’s the results:

Kinda cool, isn’t it?

Showcasing the latest phylogenetic methods: AUTEUR

July 20, 2011 / pcwainwr / 4 Comments

While high-speed fish feeding videos may be the signature of the lab, dig a bit deeper and you’ll find a wealth of comparative phylogenetic methods sneaking in. It’s a natural union — expert functional morphology is the key to good comparative methods, just as phylogenies hold the key to untangling the evolutionary origins of that morphology. The lab’s own former graduate, Brian O’Meara, made a revolutionary step forward in the land of phylogenetic methods when he unveiled Brownie in 2006, allowing researchers to identify major shifts in trait diversification rates across the tree. This work spurred not only a flood of empirical applications but also methodological innovations, such as Liam’s brownie-lite, and today’s focus: Jon Eastman et al.‘s auteur package.

Auteur, short for “Accommodating uncertainty in trait evolution using R,” is the grown-up Bayesian RJMCMC version of that original idea in Brownie. Diversification rates can change along the phylogenetic tree — only this time, you don’t have to specify where those changes could have occurred, or how many there may have been — auteur simply tries them all.

If you want the details, definitely go read the paper — it’s all there, clear and thorough. Meanwhile, what we really want to do, is take it out for a test drive.

The package isn’t up on CRAN yet, so you can grab the development version from Jon’s github page, or click here. Put that package in a working directory and fire up R in that directory. Let’s go for a spin.

[sourcecode language=”R”] install.packages("auteur_0.11.0612.tar.gz", repos=NULL)
library(auteur)

[/sourcecode]

Great, the package installed and loaded successfully. Looks like Jon’s put all 73 functions into the NAMESPACE, but it’s not hard to guess which one looks like the right one to start with. rjmcmc.bm. Yeah, that looks good. It has a nice help file, with — praise the fish — example code. Looks like we’re gonna run a simulation, where we know the answer, and see how it does:

[sourcecode language=”R”]

#############
## generate tree
n=24
while(1) {
phy=prunelastsplit(birthdeath.tree(b=1,d=0,taxa.stop=n+1))
phy$tip.label=paste("sp",1:n,sep="")
rphy=reorder(phy,"pruningwise")

# find an internal edge
anc=get.desc.of.node(Ntip(phy)+1,phy)
branches=phy$edge[,2] branches=branches[branches>Ntip(phy) & branches!=anc] branch=branches[sample(1:length(branches),1)] desc=get.descendants.of.node(branch,phy)
if(length(desc)>=4) break()
}
rphy=phy
rphy$edge.length[match(desc,phy$edge[,2])]=phy$edge.length[match(desc,phy$edge[,2])]*64

e=numeric(nrow(phy$edge))
e[match(c(branch,desc),phy$edge[,2])]=1
cols=c("red","gray")
dev.new()
plot(phy,edge.col=ifelse(e==1,cols[1],cols[2]), edge.width=2)
mtext("expected pattern of rates")

#############
## simulate data on the ‘rate-shifted’ tree
dat=rTraitCont(phy=rphy, model="BM", sigma=sqrt(0.1))

[/sourcecode]

That creates this beautiful example (sorry, no random generator seed, you’re results may vary but that’s ok) tree:

Okay, so that’s the target, showing where the shift occurred. Note the last line got us some data on this tree. We’re ready to run the software. It looks super easy:

[sourcecode language=”R”] ## run two short reversible-jump Markov chains
r=paste(sample(letters,9,replace=TRUE),collapse="")
lapply(1:2, function(x) rjmcmc.bm(phy=phy, dat=dat, ngen=10000, sample.freq=10, prob.mergesplit=0.1, simplestart=TRUE, prop.width=1, fileBase=paste(r,x,sep=".")))
[/sourcecode]

The data is going in as “phy” and “dat”, just as expected. We won’t worry about the optional parameters that follow for the moment. Note that because we use lapply to run multiple chains, it would be super easy to run this on multiple processors.

Note that Jon’s creating a bunch of directories to store parameters, etc. This can be important for MCMC methods where chains get too cumbersome to handle in memory. Enough technical rambling, let’s merge and load those files in now, and plot what we got:

[sourcecode language=”R”] # collect directories
dirs=dir("./",pattern=paste("BM",r,sep="."))
pool.rjmcmcsamples(base.dirs=dirs, lab=r)

## view contents of .rda
load(paste(paste(r,"combined.rjmcmc",sep="."),paste(r,"posteriorsamples.rda",sep="."),sep="/"))
print(head(posteriorsamples$rates))
print(head(posteriorsamples$rate.shifts))

## plot Markov sampled rates
dev.new()
shifts.plot(phy=phy, base.dir=paste(r,"combined.rjmcmc",sep="."), burnin=0.5, legend=TRUE, edge.width=2)

# clean-up: unlink those directories
unlink(dir(pattern=paste(r)),recursive=TRUE)
[/sourcecode]

Not only is that a beautiful plot, but it’s nailed the shift in species 12-16. How’d your example do?

Auteur comes with three beautiful large data sets described in the paper. Check them out, but expect longer run times than our simple example!

[sourcecode language=”R”]

data(chelonia)
# take a look at this data
> chelonia
$phy
Phylogenetic tree with 226 tips and 225 internal nodes.

Tip labels:
Elseya_latisternum, Chelodina_longicollis, Phrynops_gibbus, Acanthochelys_radiolata, Acanthochelys_macrocephala, Acanthochelys_pallidipectoris, …

Rooted; includes branch lengths.

$dat
Pelomedusa_subrufa Pelusios_williamsi
2.995732 3.218876
…
dat <- chelonia$dat
phy <- chelonia$phy
## ready to run as above

[/sourcecode]

Thanks Jon and the rest of the Harmon Lab for a fantastic package. This is really just a tip of the iceberg, but should help get you started. See the paper for a good example of posterior analyses requisite after running any kind of MCMC, or stay tuned for a later post.