In this example, the COPASutils package will be used to read in, process, and analyze data resulting from a Genome Wide Association Study (GWAS) using Caenorhabditis elegans nematodes and a COPAS BIOSORT large particle flow cytometer. This example assumes that the COPASutils package is installed, with all necessary dependencies on your local machine. To install the COPASutils package, you can use the command install.packages("COPASutils")
.
Note: Some of the commands below will produce warnings when executed on a local machine. The warnings describe issues such as missing or strange values fed to the underlying functions and should not alarm the user. Warnings have been suppressed in this vignette for the sake of aesthetics.
The COPASutils package will first be loaded so that we can utilize the associated functions and example data:
library(COPASutils)
Note: All of the raw .txt files referenced below are available at https://github.com/AndersenLab/COPASutils-data
We can now read in the plate data from one of the example data sets. In this GWAS experimental design, each plate is set up with three worms sorted to each well in every other column. An example data set, called “plateData1.txt”, that illustrates this step of the experiment is available from the Andersen Laboratory website (andersenlab.org). This text file is a representative example of raw data collected from a Union Biometrica BIOSORT large particle flow cytometer with ReFLx sampler. We will first read this file in and save it to a data frame called setupRaw
:
setupRaw <- readPlate("plateData1.txt")
If we look at the head of this data frame we can begin to step through the different data that are output by the COPAS machine:
head(setupRaw)
## Source: local data frame [6 x 15]
## Groups: row, col
##
## row col sort TOF EXT time green yellow red norm.EXT norm.green norm.yellow norm.red object call50
## 1 A 1 6 341 206 0 42 57 44 0.6041 0.12317 0.16716 0.12903 1.0000 object
## 2 A 1 6 345 226 62 57 61 44 0.6551 0.16522 0.17681 0.12754 1.0000 object
## 3 A 1 2 53 29 421 10 4 5 0.5472 0.18868 0.07547 0.09434 0.9987 object
## 4 A 1 6 341 185 452 60 64 47 0.5425 0.17595 0.18768 0.13783 1.0000 object
## 5 A 1 0 487 193 858 47 47 38 0.3963 0.09651 0.09651 0.07803 1.0000 object
## 6 A 1 0 58 17 920 3 3 4 0.2931 0.05172 0.05172 0.06897 0.9989 object
We can see that the COPAS machine groups the output data primarily by row and column of a 96-well plate, represented by the columns row
and col
in the above data frame. Each row in the data frame represents the readings for a single object. Working through the columns left to right, we see that the sorter returns the sort status of the object in the sort
column (for our purposes, we are only concerned with instances where sort = 6, as these worms were sorted to the respective wells in the target plate; for a complete explanation of sort status codes, please consult the machine’s user manual). We then see TOF
which stands for “time of flight” or a measure of the length of the object in microns. Next is EXT
or “extinction”, a measure of the optical density of the object. Following this is the time
column which represents the relative time from the first object sorted in each well. Using this scheme, the first object to pass through the flow cell for each well will therefore be assigned a time of 0. Next are the peak height values for each of the fluorescence channels, indicated by the green
, yellow
, and red
columns. The next four columns contain the data for the normalized EXT, red, green, and yellow values (value/TOF
). Lastly the columns object
and call50
represent data returned from the support vector machine (SVM) that probabilistically determines whether each “object” is actually an object (cell, worm, etc.) or a bubble. This feature is useful if the experiment, like ours, requires the bubble trap hardware to be bypassed. It is important to note that the SVM was trained on data from our own machine and that its efficacy should be tested against a user’s own data before any user utilizes this feature. The call50
column displays “object” if the probability of being an object (object
) is greater than 0.5 or “bubble” otherwise.
If you would like to remove the last two columns (i.e. read in data without the help of the SVM), you can set the SVM
argument to FALSE, as below:
setupRaw2 <- readPlate("plateData1.txt", SVM=FALSE)
head(setupRaw2)
## Source: local data frame [6 x 13]
## Groups: row, col
##
## row col sort TOF EXT time green yellow red norm.EXT norm.green norm.yellow norm.red
## 1 A 1 6 341 206 0 42 57 44 0.6041 0.12317 0.16716 0.12903
## 2 A 1 6 345 226 62 57 61 44 0.6551 0.16522 0.17681 0.12754
## 3 A 1 2 53 29 421 10 4 5 0.5472 0.18868 0.07547 0.09434
## 4 A 1 6 341 185 452 60 64 47 0.5425 0.17595 0.18768 0.13783
## 5 A 1 0 487 193 858 47 47 38 0.3963 0.09651 0.09651 0.07803
## 6 A 1 0 58 17 920 3 3 4 0.2931 0.05172 0.05172 0.06897
We can also set cutoffs for minimum and maximum time of flight and extinction values as such:
setupRaw3 <- readPlate("plateData1.txt", tofmin=60, tofmax=1000, extmin=50, extmax=500)
head(setupRaw3)
## Source: local data frame [6 x 15]
## Groups: row, col
##
## row col sort TOF EXT time green yellow red norm.EXT norm.green norm.yellow norm.red object call50
## 1 A 1 6 341 206 0 42 57 44 0.6041 0.12317 0.16716 0.12903 1 object
## 2 A 1 6 345 226 62 57 61 44 0.6551 0.16522 0.17681 0.12754 1 object
## 3 A 1 6 341 185 452 60 64 47 0.5425 0.17595 0.18768 0.13783 1 object
## 4 A 1 0 487 193 858 47 47 38 0.3963 0.09651 0.09651 0.07803 1 object
## 5 A 1 0 257 248 2012 66 77 60 0.9650 0.25681 0.29961 0.23346 1 object
## 6 A 1 0 469 259 2199 63 70 57 0.5522 0.13433 0.14925 0.12154 1 object
Lastly, we can read in a data file from a BioSorter machine, which utilizes the LP Sampler instead of the BIOSORT machine’s ReFLx module. We can adjust the way in which the file is read to compensate for this difference by setting the reflx
parameter to FALSE
:
biosorterData <- readPlate("BioSorter.txt", reflx=FALSE)
head(biosorterData)
## Source: local data frame [6 x 15]
## Groups: row, col
##
## row col sort TOF EXT time green yellow red norm.EXT norm.green norm.yellow norm.red object call50
## 1 A 1 0 480 356.0 0 2.70 1.65 2.33 0.7417 0.005625 0.003438 0.004854 1.0000 object
## 2 A 1 0 138 29.9 95 0.51 0.39 0.41 0.2167 0.003696 0.002826 0.002971 0.9999 object
## 3 A 1 0 40 1.2 931 0.11 0.11 0.11 0.0300 0.002750 0.002750 0.002750 0.3757 bubble
## 4 A 1 0 280 71.0 973 8.25 1.06 3.47 0.2536 0.029464 0.003786 0.012393 1.0000 object
## 5 A 1 0 264 61.6 1125 11.80 1.14 2.65 0.2333 0.044697 0.004318 0.010038 1.0000 object
## 6 A 1 0 492 55.8 1195 11.00 1.85 6.03 0.1134 0.022358 0.003760 0.012256 1.0000 object
## [1] TRUE TRUE
Now that we have read our raw data into R using the readPlate
function, we probably want to summarize the data by well. For instance, in the GWAS experiment from we which are examining the data, we would like three worms be sorted into each well in every other column. We can check these sort data. To summarize the data we can use the summarizePlate
function:
setupSummarized <- summarizePlate(setupRaw)
colnames(setupSummarized)
## [1] "row" "col" "n" "n.sorted" "mean.TOF"
## [6] "median.TOF" "mean.EXT" "median.EXT" "mean.red" "median.red"
## [11] "mean.green" "median.green" "mean.yellow" "median.yellow" "mean.norm.EXT"
## [16] "median.norm.EXT" "mean.norm.red" "median.norm.red" "mean.norm.green" "median.norm.green"
## [21] "mean.norm.yellow" "median.norm.yellow"
We now see that we have many more trait variables, many of which describe the distribution of the originally measured values by well. We can get an even more complete picture by adding in extra quantiles (quantiles
argument), log transformed values of EXT and the fluorescence channels (log
argument), and the minimum and maximum of each trait (ends
argument), as below:
setupSummarized2 <- summarizePlate(setupRaw, quantiles=TRUE, log=TRUE, ends=TRUE)
colnames(setupSummarized2)
## [1] "row" "col" "n" "n.sorted"
## [5] "mean.TOF" "min.TOF" "q10.TOF" "q25.TOF"
## [9] "median.TOF" "q75.TOF" "q90.TOF" "max.TOF"
## [13] "mean.EXT" "min.EXT" "q10.EXT" "q25.EXT"
## [17] "median.EXT" "q75.EXT" "q90.EXT" "max.EXT"
## [21] "mean.red" "min.red" "q10.red" "q25.red"
## [25] "median.red" "q75.red" "q90.red" "max.red"
## [29] "mean.green" "min.green" "q10.green" "q25.green"
## [33] "median.green" "q75.green" "q90.green" "max.green"
## [37] "mean.yellow" "min.yellow" "q10.yellow" "q25.yellow"
## [41] "median.yellow" "q75.yellow" "q90.yellow" "max.yellow"
## [45] "mean.norm.EXT" "min.norm.EXT" "q10.norm.EXT" "q25.norm.EXT"
## [49] "median.norm.EXT" "q75.norm.EXT" "q90.norm.EXT" "max.norm.EXT"
## [53] "mean.norm.red" "min.norm.red" "q10.norm.red" "q25.norm.red"
## [57] "median.norm.red" "q75.norm.red" "q90.norm.red" "max.norm.red"
## [61] "mean.norm.green" "min.norm.green" "q10.norm.green" "q25.norm.green"
## [65] "median.norm.green" "q75.norm.green" "q90.norm.green" "max.norm.green"
## [69] "mean.norm.yellow" "min.norm.yellow" "q10.norm.yellow" "q25.norm.yellow"
## [73] "median.norm.yellow" "q75.norm.yellow" "q90.norm.yellow" "max.norm.yellow"
## [77] "mean.log.EXT" "min.log.EXT" "q10.log.EXT" "q25.log.EXT"
## [81] "median.log.EXT" "q75.log.EXT" "q90.log.EXT" "max.log.EXT"
## [85] "mean.log.red" "min.log.red" "q10.log.red" "q25.log.red"
## [89] "median.log.red" "q75.log.red" "q90.log.red" "max.log.red"
## [93] "mean.log.green" "min.log.green" "q10.log.green" "q25.log.green"
## [97] "median.log.green" "q75.log.green" "q90.log.green" "max.log.green"
## [101] "mean.log.yellow" "min.log.yellow" "q10.log.yellow" "q25.log.yellow"
## [105] "median.log.yellow" "q75.log.yellow" "q90.log.yellow" "max.log.yellow"
## [109] "mean.log.norm.EXT" "min.log.norm.EXT" "q10.log.norm.EXT" "q25.log.norm.EXT"
## [113] "median.log.norm.EXT" "q75.log.norm.EXT" "q90.log.norm.EXT" "max.log.norm.EXT"
## [117] "mean.log.norm.red" "min.log.norm.red" "q10.log.norm.red" "q25.log.norm.red"
## [121] "median.log.norm.red" "q75.log.norm.red" "q90.log.norm.red" "max.log.norm.red"
## [125] "mean.log.norm.green" "min.log.norm.green" "q10.log.norm.green" "q25.log.norm.green"
## [129] "median.log.norm.green" "q75.log.norm.green" "q90.log.norm.green" "max.log.norm.green"
## [133] "mean.log.norm.yellow" "min.log.norm.yellow" "q10.log.norm.yellow" "q25.log.norm.yellow"
## [137] "median.log.norm.yellow" "q75.log.norm.yellow" "q90.log.norm.yellow" "max.log.norm.yellow"
We now have a great deal of information describing, in detail, the summary statistics of each of the measured traits. Again, each subset of new trait values can be removed by leaving each of the optional parameters (quantiles
, log
, and ends
) set to FALSE
. Each trait follows a specific naming system, wherein any mathematical transformation imparted on the original data is added at the beginning of the trait. For instance, the mean of all of the time of flight data (TOF) for each well can be found in the column mean.TOF
. If we wanted the column corresponding to the 25th quantile of the log transformed extinction data (EXT), we could find it in the column named q25.log.EXT
. All of the naming conventions are outlined in the table below:
Statistic | Abbreviation | Example |
---|---|---|
mean | mean | mean.TOF |
median | median | median.EXT |
minimum | min | min.yellow |
maximum | max | max.green |
normalized to time of flight | norm | mean.norm.red |
quantiles | qXX | q25.red |
log transformed data | log | mean.log.red |
Some statistics, such as normalized and log values are calculated before the data are summarized and, as such, have a distribution of their own. For these traits, all mean, median, min, max and quantile data are available by stringing together names as in the above table.
In the experiment we are examining here, it was desired that no worms be sorted into the wells in the even columns. Therefore, the data we are seeing in these columns are the result of background debris or accidental placement of worms into the wells. We now want to remove these wells before we continue with our analysis. Included in the package is the removeWells
function, which does exactly what its name implies:
setupSummarized3 <- removeWells(setupSummarized, c("A2", "A4", "A6")) # All wells you want to remove need to be included here
head(setupSummarized3[,1:10])
## Source: local data frame [6 x 10]
## Groups:
##
## row col n n.sorted mean.TOF median.TOF mean.EXT median.EXT mean.red median.red
## 1 A 1 24 3 342.8 341 241.5 230.5 64.92 49.5
## 2 A 2 NA NA NA NA NA NA NA NA
## 3 A 3 70 3 307.2 310 190.9 209.5 62.61 52.0
## 4 A 4 NA NA NA NA NA NA NA NA
## 5 A 5 69 3 323.9 346 203.1 215.0 47.00 48.0
## 6 A 6 NA NA NA NA NA NA NA NA
The removeWells
function takes as input a summarized plate data frame, as well as a vector of string corresponding to wells to be removed, and returns the data frame with all phenotype data in the frame set to NA. An optional drop
parameter is used to specify whether to “NA out” trait data or drop those rows from the frame entirely:
setupSummarized4 <- removeWells(setupSummarized, c("A2", "A4", "A6"), drop=TRUE)
head(setupSummarized4[,1:10])
## Source: local data frame [6 x 10]
## Groups:
##
## row col n n.sorted mean.TOF median.TOF mean.EXT median.EXT mean.red median.red
## 1 A 1 24 3 342.8 341 241.5 230.5 64.92 49.5
## 3 A 3 70 3 307.2 310 190.9 209.5 62.61 52.0
## 5 A 5 69 3 323.9 346 203.1 215.0 47.00 48.0
## 7 A 7 64 3 301.4 285 185.6 202.0 54.28 46.5
## 8 A 8 2 2 245.0 245 211.5 211.5 43.50 43.5
## 9 A 9 82 3 324.0 356 190.7 205.5 44.24 47.0
The converse of the removeWells
function is the fillWells
function. This function fills the trait data for any wells missing from the selected data frame with NA
s. If you want to fill the wells back in from our example above, where the rows were dropped from the data frame, we could use the following command:
setupSummarized5 <- fillWells(setupSummarized4)
head(setupSummarized5[,1:10])
## row col n n.sorted mean.TOF median.TOF mean.EXT median.EXT mean.red median.red
## 1 A 1 24 3 342.8 341 241.5 230.5 64.92 49.5
## 2 A 2 NA NA NA NA NA NA NA NA
## 3 A 3 70 3 307.2 310 190.9 209.5 62.61 52.0
## 4 A 4 NA NA NA NA NA NA NA NA
## 5 A 5 69 3 323.9 346 203.1 215.0 47.00 48.0
## 6 A 6 NA NA NA NA NA NA NA NA
One issue when working with 96-well plates is that of placement within the plate. Edge wells, being exposed to more direct dry air currents, may experience greater evaporation, which may have an effect on different traits in those wells. To test this hypothesis, we can utilize the edgeEffect
function:
edgeEffect(setupSummarized, "n")
## Warning: cannot compute exact p-value with ties
## [1] 0.3775
This function takes a summarized plate data frame and the name of the trait to test. In this instance, we tested our setup plate for any effect with respect to the number of worms in each well. The function splits the plate population by wells found on the perimeter of the plate and those found on the interior, then performs a Wilcoxon Rank-Sum test between the two populations for the specified trait. The resultant p-value is returned. Since the returned p-value does not exceed our significance threshold of 0.05, we fail to reject the null hypothesis that the two populations are drawn from the same distribution. If we want to simultaneously test all traits, we can simply not specify a trait to be tested. A data frame of all traits and associated p-values will be returned.
Now that we have access to both the unsummarized and summarized data, we would like to visualize the results from this plate. The plate in this example was set up with worms in every other column. To confirm that large populations only exist in every other columns, we will plot a heat map of the plate representing the population present in each well using the plotTrait
function:
plotTrait(setupSummarized, "n")
We can see that the larger populations of worms are, in fact, only present in every other well, row-wise. We can also see wells that are missing data points, as they are colored gray and are missing the numerical annotation. The returned plot is a ggplot2 object and as such can be manipulated using standard ggplot2 functions and grammar. This extensibility is true for all of the following plotting functions as well. For instance, we can add a title as such:
plotTrait(setupSummarized, "n") + ggtitle("Example Heatmap")
We are not simply limited to heat maps, however. By plotting the raw data, we can get a better feel for the distributions of the traits. We can plot a histogram of the values in each well:
plotTrait(setupRaw, "TOF", type="hist")
Or we can plot a scatter plot between two traits:
plotTrait(setupRaw, "TOF", "EXT", type="scatter")
We may also want to compare how traits differ across plates. Included in the package is a function called plotCompare
that accomplishes this task. Here, we’ll compare the distributions of the time-of-flight values between the data in setupRaw
and a new plate called scoreRaw
. These two plates will be entered as a list to the first argument in the function. Then, we’ll specify the trait to be compared (TOF
in this case). Finally, we’ll enter in a vector of the plates names as the optional third argument:
scoreRaw <- plateData2
plotCompare(list(setupRaw, scoreRaw), "TOF", plateNames=c("Setup", "Score"))
We can see that side-by-side box plots are plotted for the values in each well. Likewise, we can compare the summarized values between plates by feeding in summarized plate data:
scoreSummarized <- summarizePlate(scoreRaw)
plotCompare(list(setupSummarized, scoreSummarized), "mean.TOF", plateNames=c("Setup", "Score"))
In addition, we can also check for correlation between traits both within and across plates using a function called plotCorMatrix
. This function will plot a correlation heat map between all of the traits either within or between summarized plates. Here, we will examine correlations between traits within the summarized setup data:
plotCorMatrix(setupSummarized)
In the above matrix we can see that some traits are positively correlated, some are negatively correlated, and some are completely uncorrelated. We can also examine these patterns between plates as such:
plotCorMatrix(setupSummarized, scoreSummarized)
We can now see an instance where all values of a trait (n.sorted
) were equivalent and therefore the correlation could not be calculated. Tiles for these traits are all set to gray.
If we now transition to some new data, representing an experiment in which drug dose response curves were measured, we can utilize the last plot function in the package, plotDR
. We will first load in our example data set, called doseData
into the variable dosesRaw
. We will then summarize this data, filling in the strains with a 96-element vector corresponding to the names of the strains across a plate, row-wise.
dosesRaw <- doseData
strains <- rep(c("Strain 1", "Strain 2", "Strain 3", "Strain 4"), each=6, times=4)
dosesSummarized <- summarizePlate(dosesRaw, strains)
doses <- rep(c(0,2.5,5,10,20,NA), times=16)
plotDR(dosesSummarized, dosages=doses, trait="mean.TOF")
We can see that we now need to include the strains vector when summarizing the data as well as a dosages vector when plotting the dose response. We also might like to see how the strains vary across every trait. We can generate a list of ggplot2 objects using the plotDR_allTraits
function:
plotList <- plotDR_allTraits(dosesSummarized, dosages=doses)
We can even access each plot by name using the scheme below:
plotList$median.red
plotList$mean.EXT
The COPASutils package provides quick, streamlined tools for reading, processing, and plotting data resulting from COPAS platform machines. Here, we have analyzed data from several different experiments. Of course, analysis pipelines for the data will change from project to project, and the pipeline described here may not be the best fit for your data. COPASutils provides a general and flexible work flow for COPAS data and should be easily adaptable to your specific project.