Skip to main content Skip to navigation
CEREO R

Intro to R

This week was our introduction to R lecture. This lecture was a brief overview of what R is and how to use it, with the goal of making the newer R users in our community literate in R syntax. For this session users needed to have downloaded both R and R Studio.

First we discussed what R is, along with some great resources that people can use to learn R. The first resource comes from the Software Carpentry team: https://swcarpentry.github.io/r-novice-gapminder/. The second is from one of our fellow R group members, Rachel Olsson, and is an introduction for new users designed for one of her labs, but applicable here. Make sure to download both the Lab1 Walkthrough and the Floral_diversity dataset.

The R script for our session (in .txt), along with the notes we added to it in class: IntroRScript

Package Intro: Multivariate Time Series – MAR models

This week Dr. Steve Katz will discuss multivariate time series analysis using the MARSS package. There is some supplementary material for this talk:

packages needed: MAR1 and MARSS

An example of using MAR1 and MARSS on ecological data: R demo supplement 20130305

The package user guide to help orient you with the MARSS package

https://cran.r-project.org/web/packages/MARSS/vignettes/UserGuide.pdf

Research Profile: GLMM and Predictions

Tomorrow PhD Candidate Zoe Hanley will discuss generalized linear models  in R and making prediction maps for wolf distribution. Necessary packages are:

library(glmmADMB) #Generalized Linear Mixed Modeling (GLMMs). Includes zero-inflated distributions.
#Use download instructions from:http://glmmadmb.r-forge.r-project.org/
library(graphics) #temporal autocorrelation graphs
library(lattice) #PACK vs. YEAR graphs
library(bbmle) #AIC table
library(plyr) #create cross-validation progress bar

The data and script can be found below:

RGroup_MAP

RGroup_GLMM

HanleyRProfile

 

Packrat Package – managing package versions

This week CEREO’s Stephanie Labou introduced us to the packrat package. Packrat is a relatively new package that assists collaboration and functionality of code by maintain and standardizing package versions used in a project. Depending on the level of experience, R users may not have ran into this issue before but it is a persistent problem with the R system. Due to the dynamic and open nature of the software, changes and improvements to packages can tweek the way that certain functions interact, making old code buggy or obsolete. Packrat is an attempt to control for this.

Packrat, in essence, creates a large zip file with all of the libraries and settings used for a project. Users then send this entire file to their collaborators and collaborators load packages and libraries from that zip file. This ensures that the versions of packages used are the same across all collaborators. Within packrat, each folder is essentially its own project, with its own packages – packrat folders are created within the working directory when the creation command is called.

The first step in using packrat is to create, or “bundle” your libraries. This is shown in the script below. In addition, the script below uses the “::” syntax to call commands. The double colon symbol is a way to specify exactly which packages commands are being used. This is because some packages have commands with the same name – whichever package is loaded last will overwrite the identically named commands from the other one. This is why people’s script may sometimes have notes about the loading order of packages.

PackratBundling

Once the packrat package has bundled the libraries for a project, you can then send the entire file to a collaborator. To re-create this on your own computer open a brand new R session and then follow the script below, which will unbundle the packrat file created in the above script:

PackratUnbundling

Once you are working within a packrat session there are some useful commands to know. One is sessionInfo() which shows what versions of things you have loaded. There is also a way to install older versions of packages – this is useful if you want to create a new packrat project but you realize your current packages are too new. Information on how to do that can be found here.

Additionally, the scripts provided by Stephanie do an excellent job of annotating, or commenting, on the code. This is especially important when working with collaborators, but is also important when working solo as it makes it easier to troubleshoot issues. Good annotations can help users determine if issues are code issues, are package related (and can therefore be addressed with packrat), or are (rarely) issues with versions of R. R version errors are harder to fix, and are not addressed by the packrat package. But! As Dr. Katz said during this session: “there is a long conversation to be had about strategies in programming for another time.”

Enjoy packrat!

PCA and Atmospheric Research

Today Tsengel Nergui showed us how she used Principal Component Analysis in her atmospheric research. The script and data provided shows an excellent example of PCA application. Tsengel discusses not only int interpretation of the results, but also some of the standardization that one can do prior to PCA.

In the discussion portion of the session we talked about how a conceptual understanding of PCA can be broken into two philosophies: calculating the eigenvalues or focusing on the dissimilarity matrix. Both lead to the same place but some researchers may find one or the other strategy more compelling. PCA, and indeed other multivariate apraoches in R, are very clearly explained in Manly’s Multivariate Statistical Methods: A Primer. The 4th edition has a website that includes example data and script for R. Another good resource is the R package vegan.

In addition to discussin PCA, we also discussed loading jpegs in R. This is very simple to do with the jpeg package.

This talk will require the following packages:

library(stats)
library(plyr)  # plyr must be called before dplyr
library(dplyr) 
library(ggplot2) 
jpeg
rasterImage

Necessary script and data below:

Rsession_MixedBag2_tsengel

BEL116_hourly_O3_met_2012Summer

 

 

 

 

 

High Performance Computing and R – WSU’s Kamiak Cluster

This week we had a guest speaker, Jeff White from IT, who discussed accessing the Kamiak High Performance Computer on campus (slides can be found here). We also discussed creating .csv files and getting that data into R.

Kamiak is a computer may be accessed by any student with an approved access. Access can be set up by contacting CIRC, the Center for Institutional Research Computing, which runs Kamiak through their Service Desk. You will need to make an account first and your adviser or project PI will need to vouch for you.

Kamiak is a large computer, or “cluster” of smaller computers which work in tandem. Kamiak is a Linux system – what that means functionally is you access it through what is called the “secure shell”, or ssh. This is an interface which communicates with the computer remotely, so you load it up on your personal computer and then can run programs and software on the Kamiak computer. It is not a point and click system, but rather one that is done by coding, in this case Linux. Information on how to install or open ssh software onto your own computer can be found here: https://hpc.wsu.edu/users-guide/terminal-ssh/.

Once you have the ssh running, and an active Kamiak account, you log into the computer using your WSU credentials. There are a vast number of commands you can use to communicate with the computer – here is a good resource for learning Linux in general, which goes over both the “secure shell” and how to write scripts to run programs: http://linuxcommand.org/.  From Jeff’s lecture there were a number of quick commands that he used which I have summarized below and on our Resources page.

On Kamiak, the primary way of navigating files and “jobs” (programs the computer is running) is through using a scheduling software called “slurm”. The following commands all have an “s” in the front because they refer to slurm specific commands – they are not generic linux commands, though in many cases those work too. For more information see the entire Training PDF.

sinfo #shows what CPUs are available to use
sbatch #creates job
scontrol #shows jobs
scancel #cancels jobs. Example, to cancel job humner 345: scancel 345
sq #shows all of your running or pending jobs. 

#Other commands
idev #opens up an interactive interface to run programs without an writing .sh script and submitting it to the computer 
cat slurm -Job #looks at a specific job number. Example: cat slurm -345

In general, Kamiak and Linux systems work where you write a “script”, basically a set of commands for the computer to do on its own, and then you submit that script to the computer and look at the results after. These script files are .sh files and can be written in a number of different programs, called text editors. A basic one that is relatively simple, and which can be edited and created in Kamiak through the “vim” command, is vim. Once you have written the instructions into your .sh file, you move the file, and any associated data, to Kamiak and you tell Kamiak to run it. Kamiak will run it as commanded and then the output will be saved where ever you have directed it to save. A great example of running a file, and of a simple Kamiak .sh script, can be found on the Kamiak website here.

To move files to and from  Kamiak there are a few different ways. For Mac or Linux users it can be done relatively easily as there are built in programs that let you transfer files. For Windows users a great program to use is WinScp. This program lets you use it either through the command line (aka the code) or through a point and click interface. All of these programs work where you first connect to Kamiak from your computer, then move files, then disconnect.

Here is an example of creating a .csv file in R, then moving it to Kamiak, on a Windows computer. Mac and Linux users will have similar experiences.

Creating the File

Connecting to Kamiak to transfer the file using WinScp

Connecting to Kamiak, note the name of Kamiak and the port number.

Moving the file

Using R on Kamiak

When using R on Kamiak it is important to create a default space for packages to install to on your own home directory. Our own Tung Nguyen has created one for us that is on the Kamiak website: https://hpc.wsu.edu/r/

 

 

High Resolution Graphics, Memory Issues, and WSU security

Today in our troubleshooting session we addressed exporting high resolution figures using R studio. Because R studio does not allow for a resolution increase we use coding that relies on some embedded R functionality, and works great for those that just us native R rather than R Studio. The script does not require any packages as it is base R functions, but it does require the working directory to be the desired destination folder – otherwise you’ll have to search your hard drive for it! The script for that is here: ExportingHighResolutionFigures

We also discussed memory issues in R. R is not very streamlined as far as memory use so there are a few tricks we can use to assist it to be more efficient. The first is using the ls() function. This identifies what products or objects you have currently stored within your R session. The more products you have, the more memory you are using. The ls() command shows the same products that are easily viewed in R Studio in the environment panel on the top left.

If you wold like to remove any products in the environment you can use the rm() command. Place the name of the object you would like to remove within the parentheses and it will be deleted from the environment.

An additional tool for salvaging memory is the gc() command. “gc” stands for “garbage collector” and, while it doesn’t delete any products, it removes memory storage that is associated with deleted or altered products.

Lastly, we discussed a persistent issue at WSU with accessing data from external sources. The current workaround, if you are using a Windows OS, is to specify to R which internet port it needs to use. R’s default port is currently not working for some data retrieval, so using the Microsoft Explorer port is necessary (WSU’s security allows Explorer to get data). The code to do this is setInternet2(TRUE).

 

Bringing in Data and Publicly Available Data Packages

This week we discussed how we bring in data, forms of data, good sources for help, and some packages that pull in publicly available data.

First of all, we talked about R Studio (https://www.rstudio.com/). R Studio is a great interface for using R and in addition it allows for some “point and click” methods of bring in data. The “input dataset” button on the top right square of the R Studio interface allows you to input data from either a local file on your computer or by connecting to the internet.

Now, data can also be brought in through code. A good resource for ways to import specific types of data is this Quick R page: http://www.statmethods.net/input/importingdata.html. The most common data type that people work with is .csv files, which are inported using the “read.csv()” command. If you want to read an Excel file you need the “xlsx” package.

If you want to read data from a website, which the point and click method in R Studio lets you do, there are many ways to do it. Two common ways are using the “RCurl” package or the “data.table” package. Examples of that code are below. Remember, to use a package you need to first have the package installed (“buying the book”) and then you need to use the library command to use the package (“taking the book off the shelf”).

library(RCurl)
myfile <- getURL(‘https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/data/bycatch.csv’, ssl.verifyhost=FALSE, ssl.verifypeer=FALSE)

library(data.table)
mydat <- fread(‘http://www.stats.ox.ac.uk/pub/datasets/csb/ch11b.dat’)
head(mydat)

Some packages that we discussed which make us of publicly available data are:

Out of these packages EcoRetriever is the hardest to install. You must first install the Retriever program from http://www.data-retriever.org/, then install the ecoretriever package. This will allows you program in queries of the data available at data-retriever.org.

An example of using one of these packages, the dataRetreival package which is automatically accessed through the package ‘EGRET’ can be found here: r_for_hydrology_script.  This script is from R Working Group contributor Tung Nguyen.

In addition to those packages there were questions about Economic and Social Science data sources. Here are some packages or resources that I tracked down which have data specific for those fields: