LAB 2,3,4 – Niche Modeling, Thresholding, and some Challenges for those who finish everything else early.

Lab for February 7th – Niche modeling week 1

Couple up front issues:  You are going to need a lot of disk space today and it really helps to have DIVA-GIS or another GIS at your disposal for layer processing.  Also, this lab will go quickly if your computer is fast and much more slowly on older boxes.

After further thought, I have decided to move some of the more difficult post-analysis operations into next week’s lab.  For next week, we will do thresholding, and choosing which thresholds to apply, and how to choose non-standard threshold amount, along with how to interpret Receiver-Operator characteristics and area under the curve values.  We will also do projecting and how to interpret projected results, clamping, etc.    We will be projecting backwards onto two different Last Glacial Maximum climate models and reconciling the final outputs from the two runs with the two different models.  We will also be forward projecting onto doubleCO2 conditions at 2090.  FUN!   Finally, we will do some other operations like determining elevation or other characteristics of suitable or unsuitable areas in the final model outputs.

The American Museum of Natural History has put together a very good site concerning the theory and practice of niche modeling.   Creating niche models and species distribution models is a three step process:  1.)  Determine input data sources and data types; 2.) Apply a modeling algorithm (or algorithms) to the input data; 3.)  Determine final output views and assess predictive performance.  We are going to try doing these steps in labs. 

PART 1.  Determine input data sources and data types. Please start by skimming this: http://biodiversityinformatics.amnh.org/index.php?section_id=105&content_id=344 

a.       We will be using species occurrence datasets that either you already have or that you can easily collect from the Global Biodiversity Information Facility.  Your input point occurrence spreadsheet must contain a taxon name in column A, longitude (remember in North America, our longitude is negative) in Column B, and latitude in Column C.  For the purposes of the lab, we are going to restrict ourselves to terrestrial systems.  If you want to run marine or freshwater experiments down the road, see me.

b.      For the purposes of today’s lab, we will restrict our input data to worldclim layers in ASCII grid format.    Files are here: http://www.worldclim.org/current.htm (remember to download the generic formats) – and please select the bioclim layers not the monthly layers (although you could use those in your models too). 

c.         To use the layers, you need to do some work.   The file format of the download is the “.bil” format.  To get these into ASCII grids, extract the zip file and then load the .bil files into DIVA-GIS .  To do this, you need to import grid files.   Go to Data->Import to GridFile -> Multiple Files.  Select the radio button labeled .bil.  Then select all  the files from the folder you extracted – just “select all” and then hit APPLY.   You can now look at the layers that are in DIVA native format.  Go ahead and click “Layer->Add Layer” and select one of the new bioclim layers in .grd to load into DIVA.  Just to see it show up.  You are almost there.  The Maxent format for layers is ASCII GRID format.  So now you just need to go to Data -> Export to Grid File -> Multiple Files and select the ESRI ASCII radio button.  Load all the .grd files (DIVA native format; don’t try loading the original .bil files!) and then hit APPLY to convert them to the ESRI ASCII format.   Move the  ESRI ASCII (.asc ) files you just created to a subdirectory where they are on their own.   I am also going to post worldclim layers to the website for download, to help speed the process along.  We are going to use the bioclim summary variables for the modeling.  Your last lab lists each bioclim variable and what they represent, so refer back to the lab for that info.

d.      For our niche modeling experiments, we will use the whole world climate layers.  Once you are familiar with running these experiments, you can cut the extent of the modeling to a smaller region.  To do so, you just need to create a stack from the .GRD files you already created in DIVA, decide an extent, and cut the stack to that extent.  Then export them back out into ESRI ASCII format like above.  VOILA!  This will make the experiments run faster, and for taxa that you know have small ranges, a worldwide experiment doesn’t make a ton of sense.

PART 2.  Upload your input data to a modeling algorithm/computer program and run a niche modeling experiment.  There are multiple approaches to niche modeling.   Please read a bit about this here:  http://biodiversityinformatics.amnh.org/index.php?section_id=105&content_id=345

For a variety of reasons, the most important being ease of use and reporting quality, we are only going to run Maxent modeling in class today.   I have used GARP relatively extensively, and GARP has some very nice features, but it is also deeply frustrating to get GARP up and running and it is more difficult to assess model performance using GARP for novice users.  HOWEVER, you are encouraged to download GARP and see what it can do (if you have a PC and are somewhat masochistic). 

e.       Make sure you have the Java Run Time Environment installed on your computer.  You need it to get Maxent up and running. 

f.        Go here: http://www.cs.princeton.edu/~schapire/maxent/

g.       Download the application and download the tutorial (http://www.cs.princeton.edu/~schapire/maxent/tutorial/tutorial.doc) and tutorial data (http://www.cs.princeton.edu/~schapire/maxent/tutorial/tutorial-data.zip).  Note that layers are included in the tutorial distribution.  Import one of them into DIVA-GIS to determine the extent of the layers.  In this case, you should discover that the GIS layers in the tutorial are for South America.  PLEASE RUN THROUGH THE TUTORIAL IN ITS ENTIRETY EXCEPT FOR PROJECTING.   We will use the Project function extensively next week.  THE TUTORIAL SHOULD TAKE ABOUT 45 minutes to an hour (so this is a long lab!)

h.      You want to run your own experiments today.  To do so, format your species occurrence file as a .csv file (eg. comma separated ) by using the Save As feature in Excel and selecting .csv file.

i.         Load the Bioclim data you downloaded previously.    Just select the directory where the bioclim variables reside under “environmental layers”.

j.        Load your own occurrence data under Samples.   Check and make sure the areas under the file loading prompts look reasonable (should see 19 selectable layers under the “environmental layers” import area and a species name with a check next to it under “samples”

k.       The tutorial does a good job explaining some of the options for running experiments.  I usually use a 25% random test percentage.  This makes 25% of the records “testing records”.  The other 75% are training.  If you don’t select “random seed” the same records are used for testing and training each run.  If you do select “random seed” then different records are used for testing and training each run, which will affect results a bit run to run (different training point sets could for example include different outliers that might affect results).  The other “settings” features are probably fine as they are. 

l.         Make sure the “create response curves” and “make pictures” check boxes are checked.

m.    For reasons I will come back to later, for now set the output format to “culmulative” and the output file type to .asc (this is ASCII Grid format). 

n.      Choose an output directory where all your outputs from the modeling experiment will be saved.

o.      Hit “Run”

p.      Depending on your dataset size and your computer, an experiment at global extent will take something on the order of 10-25 minutes.

q.      While the experiment is running, grab some food and please look over this about assessing model performance: http://biodiversityinformatics.amnh.org/index.php?section_id=105&content_id=346

PART 3 .  Post-modeling experiment work

1)       Given time constraints, we won’t do certain operations quite yet.  The first thing to do is look at web output of the modeling results in the output folder.   You should see an Internet Explorer or Firefox document in the output folder.  Open it up and scroll down to the results picture.  If you see only blue colors, something went wrong.  You should see a range of colors from red to blue.  These colors represent a scale from high regularization values to very low.  A high regularization value is red and represents the likelihood that the environmental – based on the input GIS layers – is suitable for the species.   The picture generated from Maxent is just to be used for “did it work” instant gratification. 

2)      The next task is importing the output .asc file back into DIVA-GIS for further examination.    This is actually very simple.  Just go ahead and import the .asc file into DIVA-GIS just like you did for the .bil files, but make sure the radio button is selected for ESRI ASCII for import.  You can just import the single file instead of multiple files.

 

 

New stuff last week, Feb 13th – Niche modeling week 2

 

Once the model is opened in DIVA, you can look at the overall outputs with other layers, etc etc.  Be especially sure to include the original shapefile of your occurrence points; this is really useful for getting a sense of how the thresholding works.

3)      You want to take the continuous probability values output from Maxent and make a binary state prediction, defining suitable and unsuitable.  To do this, first determine which threshold you want to apply.  Examine the HTML file output from your modeling run and look at the table right underneath the ROC curves.   This table lists different thresholds and values.   The “minimum training presence” threshold defines the smallest possible range of suitable habitat that also includes all the training data points used in model construction.  The 10th percentile training presence would allow 10% of the training records to be in unsuitable areas.  Another approach is to look at the ROC curves and ask the following:  As we change the threshold values in Maxent, we make the models more specific (less area predicted), which means more occurrences are likely to be falsely labeled as occurring in unsuitable areas (decreasing sensitivity).  At the same time, the proportion of observed absences correctly predicted increases (increasing specificity).  These are plotted on the first graph of your outputs as a red curve and dark blue curve.  The point where those lines cross is equal sensitivity and specificity, and that culmulative threshold value is what you use to create a binary output.

4)      So, write down a value for a couple different thresholds from your output file.  Go into DIVA-GIS, and double click the niche model output layer.   This opens the layer properties tab.   First reduce the number of “rows” to 2 by clicking the big red minus sign under “Add or Remove Rows”.  Once you have two rows, go to the top pull down menu and select “Manual” and then simply type in O to X (where X is the threshold value you wrote down from step D) in the From and To box respectively for the first row.  You can also click on the color to change it to something you like more.  Then for the second row, type in X to 100 in the From and To box respectively.  Hit Ok.  You will then basically have effectively thresholded the layer, at least visually, in DIVA.  You can also use the Grid -> Reclass function to make values 0 to X equal to 0 (unsuitable) and X to 100 equal to 1 (suitable).  This does change your values in the grid file permanently from their original probability value to 0 or 1.

 

New Stuff Week of February 20th, Niche Modeling Week 3

PART I, LAB 3 – Preparing the data for projecting onto past or future conditions

This week you are going to work on projecting niches from the present onto climate landscapes of the past and future.  In order to do this, you must prepare such layers for use.   Because you should get practice dealing with layer formatting issues, I am not going to simply provide you with the layers.    Here are some steps:

(1)     You are strong encouraged to download the layers before lab.

(2)    Climate layers in the past, at 2.5 minutes resolution, for 18kya: http://biogeo.berkeley.edu/worldclim1_4/grid/pst/21k/wc_2_5m_CCSM_21k_bio.zip

http://biogeo.berkeley.edu/worldclim1_4/grid/pst/21k/wc_2_5m_MIROC3.2_21k_bio.zip

Note:  these are two separate models of climate conditions at the Last  Glacial Maximum.

(3)    Climate layers in the future, at multiple resolution, for ~2090CE

http://www.worldclim.org/future.htm  (note the download times for very high resolution future layers is going to be “long”)

(4)    As before, you need to verify that these are in ESRI ASCII format.  So you need to use DIVA-GIS to do any conversions.

(5)    Make sure you know the grid-size of your layers for projection.  You can check this in DIVA-GIS by right-clicking a grid layer, selecting properties and looking at the “info” tab.  You should see  x,y min and max (defining the extent of the layer) and cell size.

(6)    Set up separate directories for the MICROC and CCSM Last Glacial Maximum BIO19 variables and for the DoubleCO2 model outputs.  That is, make sure you have three separate  directories for CCSM, MICROC and DoubleCO2, each containing the 19 bioclimatic variable raster layers. 

PART II – LAB 3.  Performing projections

Performing projections is really easy.  Set up your model run the same way you always do.  Make sure you are using Maxent3.0 or later.

(1)    Note at the bottom of the Maxent application is a text box labeled “Projection Layers Directory/File”. 

(2)    Click the browse button and point to the location of the directory containing either the MICROC, CCSM or DoubleCO2 bio19 layers. 

(3)    Run the experiment.  During the runs, you will notice that there is a final step at the end, creating two projected .asc GIS layers.  The first layer is the actual projection onto the climate landscape in the past or future.  The second layer is the clamping layer.   These will show up in your output directory as “<taxonname>_<projectiondirectoryname>.asc”, (for example as “microtusmontanus_microc32.asc”) if the .csv file is named microtusmontanus and the projection directory is named microc32.

(4)    Clamping.  What is it?  One problem with projecting onto a past of future climate landscape is that there are climate conditions in the past and future that have no analogs today.  For example, some regions may be colder and wetter than any existing spot on today’s landscape.    These conditions are outside the range represented in the training data.  Projections onto such conditions are suspect,  and for many applications, you'll want to cookie-cutter them out.   Clamping “does that” for you, by determining which areas in the climate landscape are outside the range represented by the training data, and showing that to you visually.  Also, the maxent3.0 algorithm will downweight those areas in the final “suitability” prediction so they are less likely to be labeled “suitable”. 

(5)    Both the projection and clamping output pictures are in the output .html file.

(6)    You can load the .asc files into DIVA-GIS and examine them just like the “now” outputs that you have examined previously.

(7)    You can use “now” outputs to determine thresholds and apply them to the past or future.  So you can convert your probability function (from 0-100) to binary (suitable/unsuitable) in the past or future as well as present.

(8)    What assumptions are you making when projecting a niche into the past or future?

PART III – Lab 3.  Reconciling models

(1)    For the past suitable habitat predictions, you should run Maxent twice – once for MICROC and once for CCSM.  For these runs, do not click random seed, because using different training and testing subsets will change the thresholding a bit and make it difficult to reconcile model outputs.  Using the same training and testing dataset will mean that thresholds for both models will be the same – useful!

(2)    To combine the results of the models into one consensus model,  load both layers into DIVA-GIS

(3)    Select both grids (using shift-select of the grids on the left panel of DIVA)

(4)    Go to Grid -> Overlay  and make sure the right grids are selected in the “First” and “Second” text box.

(5)    Do the operation “add” and select a resulting grid.

(6)    Now threshold the layers by adding doubling the threshold you are already using for suitability (minimum training presence, or specificity/sensitivity balance, whatever).  So if your threshold is 11.5 for each layer, set the properties of the new layer to have two rows (0-23 equals red, 23-200 equals green).    This effectively combines the layers together.

 

THIS WEEK’S CHALLENGES.  NOW EVEN MORE CHALLENGING.

Now that you have models showing  past, present and future suitable habitats, can you: 

(a)    Find out the elevations of the suitable habitat in the past, present or future?

(i)      Hints:  load the outputs into DIVA, reclass the grid to be 0 for unsuitable and 1 for suitable, export the file as a shapefile of points (changes the midpoint of the grid to a point with the value the grid cell would have), and then extract values by point as you did in Lab 1.    Also, worldwide outputs at high resolution generate A LOT of data, potentially millions of datapoints.  Therefore you should cut the environmental layers to a smaller extent if your taxon (or taxa) is not worldwide in distribution.  Also, think about aggregating your grid to make the resolution more coarse --- if you aggregate a grid using the “mean” value,  and a factor of two, it will calculate mean values for a coarser grid using two adjacent grids.  Try it out and check properties of the new grid for cell sizes compared to the original grid.

(b)   Determine amount of area lost or gained from the past to the present or from present to the future

(i)       Hints:  load the models for past, present and future into DIVA.  Reclass suitable and unsuitable to 0 and 1.  Subtract the layers from each other using GRID -> Overlay -> Subtract function.   Or you can do the above steps to export suitable or unsuitable as a textfile and then do the math in Excel. 

                                                                                      

LAST WEEK’S CHALLENGES!

For those of you with extra time, here are a couple challenges to try:

b)      Explore the following datasets that might be included in a niche modeling experiment

i)        Soil layers:  http://www.grid.unep.ch/data/data.php?category=lithosphere

ii)       Hydrology/Slope/Aspect: http://eros.usgs.gov/products/elevation/gtopo30/hydro/index.html

iii)     Land cover/Land Use data: http://glcf.umiacs.umd.edu/data/

iv)     Global aridity (good luck with Google)

c)       Calculate a 5% minimum presence threshold.  This value is not listed in the table in the HTML output so you need to use a different methodology.  Here is what you do:  First, find an output file that has samplePredictions in the name of the file.  Open this in Excel.  The output is the value in terms of maxent prediction for each occurrence.  Sort the file by by test or train  and then by culmulative prediction in Excel.  Count up how many training points you have in total.  Take that number and multiply by .05.  Round the number to the nearest whole.  So if you have 100 training points, for example, that is 5 records.  If you have 109, that is 5, and 111, that is six.   Now count down that many records in the training dataset in your sorted spreadsheet.  So if it was 100 training records, look at the fifth training record and its cumulative prediction.  That value is the 5% minimum training presence value for thresholding.  Think about why this “makes sense”.

d)      Talk to me (and work on) your project due Feb. 27th.