1 SAMSA2

1.1 RefSeq database

Non-redundant set of reference proteins (predicted+verified)
Contains species + protein name
27G, 68M sequences

1.2 Subsystems database

SEED is a categorization system which organizes gene functional categories into a hierarchy (4 levels)
7.6G, 2.9M sequences

2 Data

2016 Lake Champlain algal bloom samples
Metatranscriptomes (2 X 100bp Illumina HiSeq)
3 sampling site: PRM (15 samples), St1 (27 samples), St2 (23 samples)
130 fastq samples (79G of fastq.gz data)

3 Running the pipeline

Compute Canada (Cedar)
I ran the pipeline seperately for each sampling site
48 CPUs, ~15hr X 3 sampling sites

4 Results

4.1 Merging

## [1] "50.95 % of sequences were merged"

Does merging make sense? Depends? on size selection of amplicons sequenced (should check)
I now also keep unmerged reads (unmerged reads are concatenated and added to a single file. The forward read and the reverse (complement) read are concatenated with a string of 20 Ns in the middle: This is done through a new R script entitled: combining_umerged.R)

4.2 Trimmomatic

## [1] "11.21 % of sequences were dropped"

Note: I now trim out all illumina adaptors, which wasn’t done before…
Note: I’ve double checked that contaminating adaptors are removed using fastqc.

4.3 Ribosomes

## [1] "8.104 % of sequences were ribosomes and removed"

## [1] "Three samples had a high % of ribosomes removed"

##                                                                    V1
## 1 HI.4752.003.NEBNext_Index_13.000576_ChampSt1-20160915-WatPhotz_RNAa
## 2 HI.4752.003.NEBNext_Index_14.000577_ChampSt1-20160915-WatPhotz_RNAb
## 3 HI.4752.003.NEBNext_Index_15.000578_ChampSt1-20160915-WatPhotz_RNAc
##      V9
## 1 59.60
## 2 63.94
## 3 66.96

4.4 Other stats

## [1] "431147: mininum nb sequence annotated per sample (refseq)"

## [1] "31847464: maximum nb sequence annotated per sample (refseq)"

## [1] "2049000: mean nb sequence annotated per sample (refseq)"

## [1] "108955: mininum nb sequence annotated per sample (subsys)"

## [1] "16733558: maximum nb sequence annotated per sample (subsys)"

## [1] "1001000: mean nb sequence annotated per sample (subsys)"

## [1] "0.7181: Total Number of paired-end reads (billions)"

## [1] "0.15: Total Number of sequences after cleaning+merging (billions)"

## [1] "93.37% : percentage of sequences annotated (refseq)"

## [1] "45.62% : percentage of sequences annotated (subsys)"

Note: Diamond threshold (default=0.001: kinda high, but mUst keep high, given that these are short reads, even if merged…)…
Note: threshold may not be that important, given that we are more interested in how things change over time, not necessary precisely identify a species…

4.5 RefSeq annotations - species

Here I only show spp. that are present >5% in at least one of the samples
We can see the Dolichospermum bloom
What is going in the St1 samples on Sept 15th?
Inediibacterium massiliense and Blautia massiliensis are human fecal bacteria…

4.5.1 Environmental data (N,P,Chl)

Huge peak in values on Sept 9th, 2016.

4.5.2 I. massiliense

All I. massiliense 20160915_St1 sequences match to a single hypothetical (i.e predicted) protein (WP_053955528.1) in refseq.
e-values ~e-20 (because sequences are small + sequence identity ~90%)

4.5.2.1 I. massiliense genome BLAST

Download I. massiliense genome.
BLASTn sequences (for now only 100k sequences) against it.
20160915_St1: hits to a single region of ~150bp (sequence identity ~90%, 6206 hits/100k sequences)
20160601_St1: hits to a single region of ~150bp (sequence identity ~90%, 31 hits/100k sequences)
Conclusion: sequences may NOT be I. massiliense, but something close and ~200X more abundant in these samples

4.5.3 B. massiliensis

All B. massiliensis 20160915_St1 sequences match to a single hypothetical (i.e predicted) protein (WP_059086336.1) in refseq
e-values ~e-10 (because sequences are small + %identity ~90%)

4.5.3.1 B. massiliensis genome BLAST

Download B. massiliensis genome (3.5 Mbp)
BLASTn sequences (for now only 100k sequences) against it.
Sequences BLAST to several regions of the genome
20160915_St1: 34801 hits/100k sequences to several genomic regions
20160601_St1: 188 hits/100k sequences

Same conclusion: may not be B. massiliensis, but something close and much more abundant in these samples (~200X).

4.5.4 Are the B. massiliensis & I. massiliense genomic regions the same?

samtools faidx Inediibacterium_massiliense_genome.fasta
samtools faidx Inediibacterium_massiliense_genome.fasta NZ_LN876587.1:1637150-1637400 >I_massiliense_contig1.fasta

samtools faidx Inediibacterium_massiliense_genome.fasta
samtools faidx Blautia_massiliensis_GD8_NZ_LN913006_WGS.fasta NZ_LN913006.1:2359000-2360200 >B_massiliensis_contig1.fasta

Yes, if we align the genomics regions for which the sequences matched to (with MUSCLE), the are almost the same (~99% identity of ~150bp)

4.6 RefSeq annotations - genera Fold change

Here I only show spp. that are present >5% in at least one of the samples
Replicated (a, b & c) were merged.
We can see the Dolichospermum bloom much better. In addition to the Microcystis + Anabaena sp. blooms

4.7 RefSeq annotations - gene names

Here I only show spp. that are present >1% in at least one of the samples.

4.8 Subseq annotations - functions

Here I only show spp. that are present >1% in at least one of the samples (for level 2:4 >-.5% for level 1).

4.9 Principal Coordinate Analyses

Note that the Sept 9th 2016 St1 sites were removed, because they skew the ordinations too much.
Most of the variation is explained by Axis 1. Samples get further away as time goes by.

4.10 alpha diversity over time

alpha

4.11 Cyanotoxic bacteria

Looking at the expression of 342 cyanotoxic genes (as Identified by Olga).
Cyanotoxic genes represent ~1% of all genes expressed (Note that graph below depicts total counts).
I grouped genes together according to their name.
Cyanotoxic genes rise similary to what was identified in metatranscriptomics.

4.12 Metagenomic_Champlain_short_term

I’ve ran the pipeline on these samples (they are all St1 according to Nicolas).
Species were grouped by genera to simplify.

4.13 Metagenomics and metatranscriptomics tell a different story.

First, you can see the difference in the bloom composition between the 2015 and 2016 metagenomics datasets (first 2 panels). In 2015, it was dominated by Microcystis. In 2016, it was dominated by Dolichospermum, but in September Microcystis start to show up and dominate.
In 2016, for which we had metagenomics vs metatranscriptomics at similar dates, the Dolichospermum story is similar, but the Microcystis are almost absent from metatranscriptomics. This means that they are there (according to metagenomics), but not active (according to metatranscriptomics).
Overall, the metatranscriptomics is more diverse, while the metagenomics is largely dominated by cyanobacteria. I will compare alpha diversity between both datasets to explicitely test that.

4.14 Metagenomic_Champlain_long_term

Species were grouped by genera to simplify it (e.g.: There were a lot of different Flavobacterium species)
Altough this dataset is not comparable to the metatranscriptome directly, if I extrapolate it seems like there is little correlation btwn metat and metag.: This is pretty cool and collaborates was we see with the short term metagenomics dataset.

5 To do (lab meeting Oct 19th):

Check the Microcystis in the database and in the list of annotated genes?
There are 67425 Microcystis gene products (63.5k are for M. aeruginosa) in the dataset
This is a very good representation
Get a Microcystis protein set from Olga?
Done, but not sure how usefull, given that it is actually well represented in the dataset
Check illumina primer/adaptor sequences (find them on nanuq), if they may cause contamination (esp. for the 09-15 samples)?
Done. Refiltered samples as there was indeed some contamination
Didn’t change results very much
Check sequence similarity bwtn the 2 fecal species proteins
done. There are almost the same…
Theoretically, you should get a could corellation bwtn annotation to Rubisco + Photosystem and chlorophyll content. I should verify this (simple correlation or can also perfom a more complex RDA)
Not done yet, but I suspect correlation is not great… Why: I don’t know
Check annotatin at the other three functionnal levels (right know, I only look at the first one…)
Done at all 4 levels
eggnog annotation Not done
I can do a PCA to look at sample clustering based on communities
Done using phyloseq (Sept 15th samples were removed because they skew the ordinations so much…)
I can do an RDA to see how certain samples may correlate with environmental data
Not done, as environmental variables are being updated now
Look at alpha + beta diversity
alpha done: not much going on
metagenome corellated with metatranscriptome In progress: this is probably the most promising analysis
Check oceans metag vs . metat papers for ideas/references
fold changes for functions… maybe but not super exciting

6 Notes from Jan 11th 2019 lab meeting

Rpackage taxize database for species names synonyms
Check toxin results and how it correlates with the metat. expression
check expression of Dolicho and the metat. cyanotoxin expression (decrease in Dolicho around sept 15th: do we see more toxin production then…)
phages are one of the driver of blooms (phage kills dolicho, releases toxin in the water…)
metaphlan2

ATRAPP transcriptomics

January 29, 2019