LAB 3 – Niche Modeling, Thresholding, and some Challenges for those who finish everything else early.

Couple up front issues:  You are going to need a lot of disk space today and it really helps to have DIVA-GIS or another GIS at your disposal for layer processing.  Also, this lab will go quickly if your computer is fast and much more slowly on older boxes.

After further thought, I have decided to move some of the more difficult post-analysis operations into next week’s lab.  For next week, we will do thresholding, and choosing which thresholds to apply, and how to choose non-standard threshold amount, along with how to interpret Receiver-Operator characteristics and area under the curve values.  We will also do projecting and how to interpret projected results, clamping, etc.    We will be projecting backwards onto two different Last Glacial Maximum climate models and reconciling the final outputs from the two runs with the two different models.  We will also be forward projecting onto doubleCO2 conditions at 2090.  FUN!   Finally, we will do some other operations like determining elevation or other characteristics of suitable or unsuitable areas in the final model outputs.

The American Museum of Natural History has put together a very good site concerning the theory and practice of niche modeling.   Creating niche models and species distribution models is a three step process:  1.)  Determine input data sources and data types; 2.) Apply a modeling algorithm (or algorithms) to the input data; 3.)  Determine final output views and assess predictive performance.  We are going to try doing these steps in labs. 

PART 1.  Determine input data sources and data types. Please start by skimming this: http://biodiversityinformatics.amnh.org/index.php?section_id=105&content_id=344 

a.       We will be using species occurrence datasets that either you already have or that you can easily collect from the Global Biodiversity Information Facility.  Your input point occurrence spreadsheet must contain a taxon name in column A, longitude (remember in North America, our longitude is negative) in Column B, and latitude in Column C.  For the purposes of the lab, we are going to restrict ourselves to terrestrial systems.  If you want to run marine or freshwater experiments down the road, see me.

b.      For the purposes of today’s lab, we will restrict our input data to worldclim layers in ASCII grid format.    Files are here: http://www.worldclim.org/current.htm (remember to download the generic formats) – and please select the bioclim layers not the monthly layers (although you could use those in your models too). 

c.         To use the layers, you need to do some work.   The file format of the download is the “.bil” format.  To get these into ASCII grids, extract the zip file and then load the .bil files into DIVA-GIS .  To do this, you need to import grid files.   Go to Data->Import to GridFile -> Multiple Files.  Select the radio button labeled .bil.  Then select all  the files from the folder you extracted – just “select all” and then hit APPLY.   You can now look at the layers that are in DIVA native format.  Go ahead and click “Layer->Add Layer” and select one of the new bioclim layers in .grd to load into DIVA.  Just to see it show up.  You are almost there.  The Maxent format for layers is ASCII GRID format.  So now you just need to go to Data -> Export to Grid File -> Multiple Files and select the ESRI ASCII radio button.  Load all the .grd files (DIVA native format; don’t try loading the original .bil files!) and then hit APPLY to convert them to the ESRI ASCII format.   Move the  ESRI ASCII (.asc ) files you just created to a subdirectory where they are on their own.   I am also going to post worldclim layers to the website for download, to help speed the process along.  We are going to use the bioclim summary variables for the modeling.  Your last lab lists each bioclim variable and what they represent, so refer back to the lab for that info.

d.      For our niche modeling experiments, we will use the whole world climate layers.  Once you are familiar with running these experiments, you can cut the extent of the modeling to a smaller region.  To do so, you just need to create a stack from the .GRD files you already created in DIVA, decide an extent, and cut the stack to that extent.  Then export them back out into ESRI ASCII format like above.  VOILA!  This will make the experiments run faster, and for taxa that you know have small ranges, a worldwide experiment doesn’t make a ton of sense.

PART 2.  Upload your input data to a modeling algorithm/computer program and run a niche modeling experiment.  There are multiple approaches to niche modeling.   Please read a bit about this here:  http://biodiversityinformatics.amnh.org/index.php?section_id=105&content_id=345

For a variety of reasons, the most important being ease of use and reporting quality, we are only going to run Maxent modeling in class today.   I have used GARP relatively extensively, and GARP has some very nice features, but it is also deeply frustrating to get GARP up and running and it is more difficult to assess model performance using GARP for novice users.  HOWEVER, you are encouraged to download GARP and see what it can do (if you have a PC and are somewhat masochistic). 

e.       Make sure you have the Java Run Time Environment installed on your computer.  You need it to get Maxent up and running. 

f.        Go here: http://www.cs.princeton.edu/~schapire/maxent/

g.       Download the application and download the tutorial (http://www.cs.princeton.edu/~schapire/maxent/tutorial/tutorial.doc) and tutorial data (http://www.cs.princeton.edu/~schapire/maxent/tutorial/tutorial-data.zip).  Note that layers are included in the tutorial distribution.  Import one of them into DIVA-GIS to determine the extent of the layers.  In this case, you should discover that the GIS layers in the tutorial are for South America.  PLEASE RUN THROUGH THE TUTORIAL IN ITS ENTIRETY EXCEPT FOR PROJECTING.   We will use the Project function extensively next week.  THE TUTORIAL SHOULD TAKE ABOUT 45 minutes to an hour (so this is a long lab!)

h.      You want to run your own experiments today.  To do so, format your species occurrence file as a .csv file (eg. comma separated ) by using the Save As feature in Excel and selecting .csv file.

i.         Load the Bioclim data you downloaded previously.    Just select the directory where the bioclim variables reside under “environmental layers”.

j.        Load your own occurrence data under Samples.   Check and make sure the areas under the file loading prompts look reasonable (should see 19 selectable layers under the “environmental layers” import area and a species name with a check next to it under “samples”

k.       The tutorial does a good job explaining some of the options for running experiments.  I usually use a 25% random test percentage.  This makes 25% of the records “testing records”.  The other 75% are training.  If you don’t select “random seed” the same records are used for testing and training each run.  If you do select “random seed” then different records are used for testing and training each run, which will affect results a bit run to run (different training point sets could for example include different outliers that might affect results).  The other “settings” features are probably fine as they are. 

l.         Make sure the “create response curves” and “make pictures” check boxes are checked.

m.    For reasons I will come back to later, for now set the output format to “culmulative” and the output file type to .asc (this is ASCII Grid format). 

n.      Choose an output directory where all your outputs from the modeling experiment will be saved.

o.      Hit “Run”

p.      Depending on your dataset size and your computer, an experiment at global extent will take something on the order of 10-25 minutes.

q.      While the experiment is running, grab some food and please look over this about assessing model performance: http://biodiversityinformatics.amnh.org/index.php?section_id=105&content_id=346

 

PART 3 .  Post-modeling experiment work

1)       Given time constraints, we won’t do certain operations quite yet.  The first thing to do is look at web output of the modeling results in the output folder.   You should see an Internet Explorer or Firefox document in the output folder.  Open it up and scroll down to the results picture.  If you see only blue colors, something went wrong.  You should see a range of colors from red to blue.  These colors represent a scale from high regularization values to very low.  A high regularization value is red and represents the likelihood that the environmental – based on the input GIS layers – is suitable for the species.   The picture generated from Maxent is just to be used for “did it work” instant gratification. 

2)      The next task is importing the output .asc file back into DIVA-GIS for further examination.    This is actually very simple.  Just go ahead and import the .asc file into DIVA-GIS just like you did for the .bil files, but make sure the radio button is selected for ESRI ASCII for import.  You can just import the single file instead of multiple files.

 

New stuff this week_________________________________________________

3)      Once the model is opened in DIVA, you can look at the overall outputs with other layers, etc etc.  Be especially sure to include the original shapefile of your occurrence points; this is really useful for getting a sense of how the thresholding works.

4)      You want to take the continuous probability values output from Maxent and make a binary state prediction, defining suitable and unsuitable.  To do this, first determine which threshold you want to apply.  Examine the HTML file output from your modeling run and look at the table right underneath the ROC curves.   This table lists different thresholds and values.   The “minimum training presence” threshold defines the smallest possible range of suitable habitat that also includes all the training data points used in model construction.  The 10th percentile training presence would allow 10% of the training records to be in unsuitable areas.  Another approach is to look at the ROC curves and ask the following:  As we change the threshold values in Maxent, we make the models more specific (less area predicted), which means more occurrences are likely to be falsely labeled as occurring in unsuitable areas (decreasing sensitivity).  At the same time, the proportion of observed absences correctly predicted increases (increasing specificity).  These are plotted on the first graph of your outputs as a red curve and dark blue curve.  The point where those lines cross is equal sensitivity and specificity, and that culmulative threshold value is what you use to create a binary output.

5)      So, write down a value for a couple different thresholds from your output file.  Go into DIVA-GIS, and double click the niche model output layer.   This opens the layer properties tab.   First reduce the number of “rows” to 2 by clicking the big red minus sign under “Add or Remove Rows”.  Once you have two rows, go to the top pull down menu and select “Manual” and then simply type in O to X (where X is the threshold value you wrote down from step D) in the From and To box respectively for the first row.  You can also click on the color to change it to something you like more.  Then for the second row, type in X to 100 in the From and To box respectively.  Hit Ok.  You will then basically have effectively thresholded the layer, at least visually, in DIVA.  You can also use the Grid -> Reclass function to make values 0 to X equal to 0 (unsuitable) and X to 100 equal to 1 (suitable).  This does change your values in the grid file permanently from their original probability value to 0 or 1.

 

PART 4.  CHALLENGES!

For those of you with extra time, here are a couple challenges to try:

a)      Explore the following datasets that might be included in a niche modeling experiment

i)        Soil layers:  http://www.grid.unep.ch/data/data.php?category=lithosphere

ii)       Hydrology/Slope/Aspect: http://eros.usgs.gov/products/elevation/gtopo30/hydro/index.html

iii)     Land cover/Land Use data: http://glcf.umiacs.umd.edu/data/

iv)     Global aridity (good luck with Google)

b)      Calculate a 5% minimum presence threshold.  This value is not listed in the table in the HTML output so you need to use a different methodology.  Here is what you do:  First, find an output file that has samplePredictions in the name of the file.  Open this in Excel.  The output is the value in terms of maxent prediction for each occurrence.  Sort the file by by test or train  and then by culmulative prediction in Excel.  Count up how many training points you have in total.  Take that number and multiply by .05.  Round the number to the nearest whole.  So if you have 100 training points, for example, that is 5 records.  If you have 109, that is 5, and 111, that is six.   Now count down that many records in the training dataset in your sorted spreadsheet.  So if it was 100 training records, look at the fifth training record and its cumulative prediction.  That value is the 5% minimum training presence value for thresholding.  Think about why this “makes sense”.

c)       Talk to me (and work on) your project due Feb. 27th.