In this example, the COPASutils package will be used to read in, process, and analyze data resulting from a Genome Wide Association Study (GWAS) using *Caenorhabditis elegans* nematodes and a COPAS BIOSORT large particle flow cytometer. This example assumes that the COPASutils package is installed, with all necessary dependencies on your local machine. To install the COPASutils package, you can use the command `install.packages("COPASutils")`

.

Note:Some of the commands below will produce warnings when executed on a local machine. The warnings describe issues such as missing or strange values fed to the underlying functions and should not alarm the user. Warnings have been suppressed in this vignette for the sake of aesthetics.

The COPASutils package will first be loaded so that we can utilize the associated functions and example data:

`library(COPASutils)`

Note:All of the raw .txt files referenced below are available at https://github.com/AndersenLab/COPASutils-data

We can now read in the plate data from one of the example data sets. In this GWAS experimental design, each plate is set up with three worms sorted to each well in every other column. An example data set, called “plateData1.txt”, that illustrates this step of the experiment is available from the Andersen Laboratory website (andersenlab.org). This text file is a representative example of raw data collected from a Union Biometrica BIOSORT large particle flow cytometer with ReFLx sampler. We will first read this file in and save it to a data frame called `setupRaw`

:

`setupRaw <- readPlate("plateData1.txt")`

If we look at the head of this data frame we can begin to step through the different data that are output by the COPAS machine:

`head(setupRaw)`

```
## Source: local data frame [6 x 15]
## Groups: row, col
##
## row col sort TOF EXT time green yellow red norm.EXT norm.green norm.yellow norm.red object call50
## 1 A 1 6 341 206 0 42 57 44 0.6041 0.12317 0.16716 0.12903 1.0000 object
## 2 A 1 6 345 226 62 57 61 44 0.6551 0.16522 0.17681 0.12754 1.0000 object
## 3 A 1 2 53 29 421 10 4 5 0.5472 0.18868 0.07547 0.09434 0.9987 object
## 4 A 1 6 341 185 452 60 64 47 0.5425 0.17595 0.18768 0.13783 1.0000 object
## 5 A 1 0 487 193 858 47 47 38 0.3963 0.09651 0.09651 0.07803 1.0000 object
## 6 A 1 0 58 17 920 3 3 4 0.2931 0.05172 0.05172 0.06897 0.9989 object
```

We can see that the COPAS machine groups the output data primarily by row and column of a 96-well plate, represented by the columns `row`

and `col`

in the above data frame. Each row in the data frame represents the readings for a single object. Working through the columns left to right, we see that the sorter returns the sort status of the object in the `sort`

column (for our purposes, we are only concerned with instances where sort = 6, as these worms were sorted to the respective wells in the target plate; for a complete explanation of sort status codes, please consult the machine’s user manual). We then see `TOF`

which stands for “time of flight” or a measure of the length of the object in microns. Next is `EXT`

or “extinction”, a measure of the optical density of the object. Following this is the `time`

column which represents the relative time from the first object sorted in each well. Using this scheme, the first object to pass through the flow cell for each well will therefore be assigned a time of 0. Next are the peak height values for each of the fluorescence channels, indicated by the `green`

, `yellow`

, and `red`

columns. The next four columns contain the data for the normalized EXT, red, green, and yellow values (value/`TOF`

). Lastly the columns `object`

and `call50`

represent data returned from the support vector machine (SVM) that probabilistically determines whether each “object” is actually an object (cell, worm, etc.) or a bubble. This feature is useful if the experiment, like ours, requires the bubble trap hardware to be bypassed. It is important to note that the SVM was trained on data from our own machine and that its efficacy should be tested against a user’s own data before any user utilizes this feature. The `call50`

column displays “object” if the probability of being an object (`object`

) is greater than 0.5 or “bubble” otherwise.

If you would like to remove the last two columns (*i.e.* read in data without the help of the SVM), you can set the `SVM`

argument to FALSE, as below:

```
setupRaw2 <- readPlate("plateData1.txt", SVM=FALSE)
head(setupRaw2)
```

```
## Source: local data frame [6 x 13]
## Groups: row, col
##
## row col sort TOF EXT time green yellow red norm.EXT norm.green norm.yellow norm.red
## 1 A 1 6 341 206 0 42 57 44 0.6041 0.12317 0.16716 0.12903
## 2 A 1 6 345 226 62 57 61 44 0.6551 0.16522 0.17681 0.12754
## 3 A 1 2 53 29 421 10 4 5 0.5472 0.18868 0.07547 0.09434
## 4 A 1 6 341 185 452 60 64 47 0.5425 0.17595 0.18768 0.13783
## 5 A 1 0 487 193 858 47 47 38 0.3963 0.09651 0.09651 0.07803
## 6 A 1 0 58 17 920 3 3 4 0.2931 0.05172 0.05172 0.06897
```

We can also set cutoffs for minimum and maximum time of flight and extinction values as such:

```
setupRaw3 <- readPlate("plateData1.txt", tofmin=60, tofmax=1000, extmin=50, extmax=500)
head(setupRaw3)
```

```
## Source: local data frame [6 x 15]
## Groups: row, col
##
## row col sort TOF EXT time green yellow red norm.EXT norm.green norm.yellow norm.red object call50
## 1 A 1 6 341 206 0 42 57 44 0.6041 0.12317 0.16716 0.12903 1 object
## 2 A 1 6 345 226 62 57 61 44 0.6551 0.16522 0.17681 0.12754 1 object
## 3 A 1 6 341 185 452 60 64 47 0.5425 0.17595 0.18768 0.13783 1 object
## 4 A 1 0 487 193 858 47 47 38 0.3963 0.09651 0.09651 0.07803 1 object
## 5 A 1 0 257 248 2012 66 77 60 0.9650 0.25681 0.29961 0.23346 1 object
## 6 A 1 0 469 259 2199 63 70 57 0.5522 0.13433 0.14925 0.12154 1 object
```

Lastly, we can read in a data file from a BioSorter machine, which utilizes the LP Sampler instead of the BIOSORT machine’s ReFLx module. We can adjust the way in which the file is read to compensate for this difference by setting the `reflx`

parameter to `FALSE`

:

```
biosorterData <- readPlate("BioSorter.txt", reflx=FALSE)
head(biosorterData)
```

```
## Source: local data frame [6 x 15]
## Groups: row, col
##
## row col sort TOF EXT time green yellow red norm.EXT norm.green norm.yellow norm.red object call50
## 1 A 1 0 480 356.0 0 2.70 1.65 2.33 0.7417 0.005625 0.003438 0.004854 1.0000 object
## 2 A 1 0 138 29.9 95 0.51 0.39 0.41 0.2167 0.003696 0.002826 0.002971 0.9999 object
## 3 A 1 0 40 1.2 931 0.11 0.11 0.11 0.0300 0.002750 0.002750 0.002750 0.3757 bubble
## 4 A 1 0 280 71.0 973 8.25 1.06 3.47 0.2536 0.029464 0.003786 0.012393 1.0000 object
## 5 A 1 0 264 61.6 1125 11.80 1.14 2.65 0.2333 0.044697 0.004318 0.010038 1.0000 object
## 6 A 1 0 492 55.8 1195 11.00 1.85 6.03 0.1134 0.022358 0.003760 0.012256 1.0000 object
```

`## [1] TRUE TRUE`

Now that we have read our raw data into R using the `readPlate`

function, we probably want to summarize the data by well. For instance, in the GWAS experiment from we which are examining the data, we would like three worms be sorted into each well in every other column. We can check these sort data. To summarize the data we can use the `summarizePlate`

function:

```
setupSummarized <- summarizePlate(setupRaw)
colnames(setupSummarized)
```

```
## [1] "row" "col" "n" "n.sorted" "mean.TOF"
## [6] "median.TOF" "mean.EXT" "median.EXT" "mean.red" "median.red"
## [11] "mean.green" "median.green" "mean.yellow" "median.yellow" "mean.norm.EXT"
## [16] "median.norm.EXT" "mean.norm.red" "median.norm.red" "mean.norm.green" "median.norm.green"
## [21] "mean.norm.yellow" "median.norm.yellow"
```

We now see that we have many more trait variables, many of which describe the distribution of the originally measured values by well. We can get an even more complete picture by adding in extra quantiles (`quantiles`

argument), log transformed values of EXT and the fluorescence channels (`log`

argument), and the minimum and maximum of each trait (`ends`

argument), as below:

```
setupSummarized2 <- summarizePlate(setupRaw, quantiles=TRUE, log=TRUE, ends=TRUE)
colnames(setupSummarized2)
```

```
## [1] "row" "col" "n" "n.sorted"
## [5] "mean.TOF" "min.TOF" "q10.TOF" "q25.TOF"
## [9] "median.TOF" "q75.TOF" "q90.TOF" "max.TOF"
## [13] "mean.EXT" "min.EXT" "q10.EXT" "q25.EXT"
## [17] "median.EXT" "q75.EXT" "q90.EXT" "max.EXT"
## [21] "mean.red" "min.red" "q10.red" "q25.red"
## [25] "median.red" "q75.red" "q90.red" "max.red"
## [29] "mean.green" "min.green" "q10.green" "q25.green"
## [33] "median.green" "q75.green" "q90.green" "max.green"
## [37] "mean.yellow" "min.yellow" "q10.yellow" "q25.yellow"
## [41] "median.yellow" "q75.yellow" "q90.yellow" "max.yellow"
## [45] "mean.norm.EXT" "min.norm.EXT" "q10.norm.EXT" "q25.norm.EXT"
## [49] "median.norm.EXT" "q75.norm.EXT" "q90.norm.EXT" "max.norm.EXT"
## [53] "mean.norm.red" "min.norm.red" "q10.norm.red" "q25.norm.red"
## [57] "median.norm.red" "q75.norm.red" "q90.norm.red" "max.norm.red"
## [61] "mean.norm.green" "min.norm.green" "q10.norm.green" "q25.norm.green"
## [65] "median.norm.green" "q75.norm.green" "q90.norm.green" "max.norm.green"
## [69] "mean.norm.yellow" "min.norm.yellow" "q10.norm.yellow" "q25.norm.yellow"
## [73] "median.norm.yellow" "q75.norm.yellow" "q90.norm.yellow" "max.norm.yellow"
## [77] "mean.log.EXT" "min.log.EXT" "q10.log.EXT" "q25.log.EXT"
## [81] "median.log.EXT" "q75.log.EXT" "q90.log.EXT" "max.log.EXT"
## [85] "mean.log.red" "min.log.red" "q10.log.red" "q25.log.red"
## [89] "median.log.red" "q75.log.red" "q90.log.red" "max.log.red"
## [93] "mean.log.green" "min.log.green" "q10.log.green" "q25.log.green"
## [97] "median.log.green" "q75.log.green" "q90.log.green" "max.log.green"
## [101] "mean.log.yellow" "min.log.yellow" "q10.log.yellow" "q25.log.yellow"
## [105] "median.log.yellow" "q75.log.yellow" "q90.log.yellow" "max.log.yellow"
## [109] "mean.log.norm.EXT" "min.log.norm.EXT" "q10.log.norm.EXT" "q25.log.norm.EXT"
## [113] "median.log.norm.EXT" "q75.log.norm.EXT" "q90.log.norm.EXT" "max.log.norm.EXT"
## [117] "mean.log.norm.red" "min.log.norm.red" "q10.log.norm.red" "q25.log.norm.red"
## [121] "median.log.norm.red" "q75.log.norm.red" "q90.log.norm.red" "max.log.norm.red"
## [125] "mean.log.norm.green" "min.log.norm.green" "q10.log.norm.green" "q25.log.norm.green"
## [129] "median.log.norm.green" "q75.log.norm.green" "q90.log.norm.green" "max.log.norm.green"
## [133] "mean.log.norm.yellow" "min.log.norm.yellow" "q10.log.norm.yellow" "q25.log.norm.yellow"
## [137] "median.log.norm.yellow" "q75.log.norm.yellow" "q90.log.norm.yellow" "max.log.norm.yellow"
```

We now have a great deal of information describing, in detail, the summary statistics of each of the measured traits. Again, each subset of new trait values can be removed by leaving each of the optional parameters (`quantiles`

, `log`

, and `ends`

) set to `FALSE`

. Each trait follows a specific naming system, wherein any mathematical transformation imparted on the original data is added at the beginning of the trait. For instance, the mean of all of the time of flight data (TOF) for each well can be found in the column `mean.TOF`

. If we wanted the column corresponding to the 25th quantile of the log transformed extinction data (EXT), we could find it in the column named `q25.log.EXT`

. All of the naming conventions are outlined in the table below:

Statistic | Abbreviation | Example |
---|---|---|

mean | mean | mean.TOF |

median | median | median.EXT |

minimum | min | min.yellow |

maximum | max | max.green |

normalized to time of flight | norm | mean.norm.red |

quantiles | qXX | q25.red |

log transformed data | log | mean.log.red |

Some statistics, such as normalized and log values are calculated before the data are summarized and, as such, have a distribution of their own. For these traits, all mean, median, min, max and quantile data are available by stringing together names as in the above table.

In the experiment we are examining here, it was desired that no worms be sorted into the wells in the even columns. Therefore, the data we are seeing in these columns are the result of background debris or accidental placement of worms into the wells. We now want to remove these wells before we continue with our analysis. Included in the package is the `removeWells`

function, which does exactly what its name implies:

```
setupSummarized3 <- removeWells(setupSummarized, c("A2", "A4", "A6")) # All wells you want to remove need to be included here
head(setupSummarized3[,1:10])
```

```
## Source: local data frame [6 x 10]
## Groups:
##
## row col n n.sorted mean.TOF median.TOF mean.EXT median.EXT mean.red median.red
## 1 A 1 24 3 342.8 341 241.5 230.5 64.92 49.5
## 2 A 2 NA NA NA NA NA NA NA NA
## 3 A 3 70 3 307.2 310 190.9 209.5 62.61 52.0
## 4 A 4 NA NA NA NA NA NA NA NA
## 5 A 5 69 3 323.9 346 203.1 215.0 47.00 48.0
## 6 A 6 NA NA NA NA NA NA NA NA
```

The `removeWells`

function takes as input a summarized plate data frame, as well as a vector of string corresponding to wells to be removed, and returns the data frame with all phenotype data in the frame set to NA. An optional `drop`

parameter is used to specify whether to “NA out” trait data or drop those rows from the frame entirely:

```
setupSummarized4 <- removeWells(setupSummarized, c("A2", "A4", "A6"), drop=TRUE)
head(setupSummarized4[,1:10])
```

```
## Source: local data frame [6 x 10]
## Groups:
##
## row col n n.sorted mean.TOF median.TOF mean.EXT median.EXT mean.red median.red
## 1 A 1 24 3 342.8 341 241.5 230.5 64.92 49.5
## 3 A 3 70 3 307.2 310 190.9 209.5 62.61 52.0
## 5 A 5 69 3 323.9 346 203.1 215.0 47.00 48.0
## 7 A 7 64 3 301.4 285 185.6 202.0 54.28 46.5
## 8 A 8 2 2 245.0 245 211.5 211.5 43.50 43.5
## 9 A 9 82 3 324.0 356 190.7 205.5 44.24 47.0
```

The converse of the `removeWells`

function is the `fillWells`

function. This function fills the trait data for any wells missing from the selected data frame with `NA`

s. If you want to fill the wells back in from our example above, where the rows were dropped from the data frame, we could use the following command:

```
setupSummarized5 <- fillWells(setupSummarized4)
head(setupSummarized5[,1:10])
```

```
## row col n n.sorted mean.TOF median.TOF mean.EXT median.EXT mean.red median.red
## 1 A 1 24 3 342.8 341 241.5 230.5 64.92 49.5
## 2 A 2 NA NA NA NA NA NA NA NA
## 3 A 3 70 3 307.2 310 190.9 209.5 62.61 52.0
## 4 A 4 NA NA NA NA NA NA NA NA
## 5 A 5 69 3 323.9 346 203.1 215.0 47.00 48.0
## 6 A 6 NA NA NA NA NA NA NA NA
```

One issue when working with 96-well plates is that of placement within the plate. Edge wells, being exposed to more direct dry air currents, may experience greater evaporation, which may have an effect on different traits in those wells. To test this hypothesis, we can utilize the `edgeEffect`

function:

`edgeEffect(setupSummarized, "n")`

`## Warning: cannot compute exact p-value with ties`

`## [1] 0.3775`

This function takes a summarized plate data frame and the name of the trait to test. In this instance, we tested our setup plate for any effect with respect to the number of worms in each well. The function splits the plate population by wells found on the perimeter of the plate and those found on the interior, then performs a Wilcoxon Rank-Sum test between the two populations for the specified trait. The resultant p-value is returned. Since the returned p-value does not exceed our significance threshold of 0.05, we fail to reject the null hypothesis that the two populations are drawn from the same distribution. If we want to simultaneously test all traits, we can simply not specify a trait to be tested. A data frame of all traits and associated p-values will be returned.

Now that we have access to both the unsummarized and summarized data, we would like to visualize the results from this plate. The plate in this example was set up with worms in every other column. To confirm that large populations only exist in every other columns, we will plot a heat map of the plate representing the population present in each well using the `plotTrait`

function:

`plotTrait(setupSummarized, "n")`

We can see that the larger populations of worms are, in fact, only present in every other well, row-wise. We can also see wells that are missing data points, as they are colored gray and are missing the numerical annotation. The returned plot is a ggplot2 object and as such can be manipulated using standard ggplot2 functions and grammar. This extensibility is true for all of the following plotting functions as well. For instance, we can add a title as such:

`plotTrait(setupSummarized, "n") + ggtitle("Example Heatmap")`

We are not simply limited to heat maps, however. By plotting the raw data, we can get a better feel for the distributions of the traits. We can plot a histogram of the values in each well:

`plotTrait(setupRaw, "TOF", type="hist")`