1 SAMSA2

1.1 RefSeq database

  • Non-redundant set of reference proteins (predicted+verified)
  • Contains species + protein name
  • 27G, 68M sequences

1.2 Subsystems database

  • SEED is a categorization system which organizes gene functional categories into a hierarchy (4 levels)
  • 7.6G, 2.9M sequences

2 Data

3 Running the pipeline

4 Results

4.1 Merging

## [1] "50.95 % of sequences were merged"
  • Does merging make sense? Depends? on size selection of amplicons sequenced (should check)
  • I now also keep unmerged reads (unmerged reads are concatenated and added to a single file. The forward read and the reverse (complement) read are concatenated with a string of 20 Ns in the middle: This is done through a new R script entitled: combining_umerged.R)

4.2 Trimmomatic

## [1] "11.21 % of sequences were dropped"
  • Note: I now trim out all illumina adaptors, which wasn’t done before…
  • Note: I’ve double checked that contaminating adaptors are removed using fastqc.

4.3 Ribosomes

## [1] "8.104 % of sequences were ribosomes and removed"
## [1] "Three samples had a high % of ribosomes removed"
##                                                                    V1
## 1 HI.4752.003.NEBNext_Index_13.000576_ChampSt1-20160915-WatPhotz_RNAa
## 2 HI.4752.003.NEBNext_Index_14.000577_ChampSt1-20160915-WatPhotz_RNAb
## 3 HI.4752.003.NEBNext_Index_15.000578_ChampSt1-20160915-WatPhotz_RNAc
##      V9
## 1 59.60
## 2 63.94
## 3 66.96

4.4 Other stats

## [1] "431147: mininum nb sequence annotated per sample (refseq)"
## [1] "31847464: maximum nb sequence annotated per sample (refseq)"
## [1] "2049000: mean nb sequence annotated per sample (refseq)"
## [1] "108955: mininum nb sequence annotated per sample (subsys)"
## [1] "16733558: maximum nb sequence annotated per sample (subsys)"
## [1] "1001000: mean nb sequence annotated per sample (subsys)"
## [1] "0.7181: Total Number of paired-end reads (billions)"
## [1] "0.15: Total Number of sequences after cleaning+merging (billions)"
## [1] "93.37% : percentage of sequences annotated (refseq)"
## [1] "45.62% : percentage of sequences annotated (subsys)"
  • Note: Diamond threshold (default=0.001: kinda high, but mUst keep high, given that these are short reads, even if merged…)…
  • Note: threshold may not be that important, given that we are more interested in how things change over time, not necessary precisely identify a species…

4.5 RefSeq annotations - species

 
 

  • Here I only show spp. that are present >5% in at least one of the samples
  • We can see the Dolichospermum bloom
  • What is going in the St1 samples on Sept 15th?
  • Inediibacterium massiliense and Blautia massiliensis are human fecal bacteria…

4.5.1 Environmental data (N,P,Chl)

  • Huge peak in values on Sept 9th, 2016.
     
     

4.5.2 I. massiliense

  • All I. massiliense 20160915_St1 sequences match to a single hypothetical (i.e predicted) protein (WP_053955528.1) in refseq.
  • e-values ~e-20 (because sequences are small + sequence identity ~90%)

4.5.2.1 I. massiliense genome BLAST

  • Download I. massiliense genome.
  • BLASTn sequences (for now only 100k sequences) against it.
  • 20160915_St1: hits to a single region of ~150bp (sequence identity ~90%, 6206 hits/100k sequences)
  • 20160601_St1: hits to a single region of ~150bp (sequence identity ~90%, 31 hits/100k sequences)
     
  • Conclusion: sequences may NOT be I. massiliense, but something close and ~200X more abundant in these samples

4.5.3 B. massiliensis

  • All B. massiliensis 20160915_St1 sequences match to a single hypothetical (i.e predicted) protein (WP_059086336.1) in refseq
  • e-values ~e-10 (because sequences are small + %identity ~90%)

4.5.3.1 B. massiliensis genome BLAST

  • Download B. massiliensis genome (3.5 Mbp)
  • BLASTn sequences (for now only 100k sequences) against it.
  • Sequences BLAST to several regions of the genome
  • 20160915_St1: 34801 hits/100k sequences to several genomic regions
  • 20160601_St1: 188 hits/100k sequences

  • Same conclusion: may not be B. massiliensis, but something close and much more abundant in these samples (~200X).
     

4.5.4 Are the B. massiliensis & I. massiliense genomic regions the same?

samtools faidx Inediibacterium_massiliense_genome.fasta
samtools faidx Inediibacterium_massiliense_genome.fasta NZ_LN876587.1:1637150-1637400 >I_massiliense_contig1.fasta

samtools faidx Inediibacterium_massiliense_genome.fasta
samtools faidx Blautia_massiliensis_GD8_NZ_LN913006_WGS.fasta NZ_LN913006.1:2359000-2360200 >B_massiliensis_contig1.fasta

  • Yes, if we align the genomics regions for which the sequences matched to (with MUSCLE), the are almost the same (~99% identity of ~150bp)

4.6 RefSeq annotations - genera Fold change

 
 

  • Here I only show spp. that are present >5% in at least one of the samples
  • Replicated (a, b & c) were merged.
  • We can see the Dolichospermum bloom much better. In addition to the Microcystis + Anabaena sp. blooms

4.7 RefSeq annotations - gene names

  • Here I only show spp. that are present >1% in at least one of the samples.

4.8 Subseq annotations - functions

  • Here I only show spp. that are present >1% in at least one of the samples (for level 2:4 >-.5% for level 1).
     

4.9 Principal Coordinate Analyses

  • Note that the Sept 9th 2016 St1 sites were removed, because they skew the ordinations too much.
  • Most of the variation is explained by Axis 1. Samples get further away as time goes by.

4.10 alpha diversity over time

  • alpha

4.11 Cyanotoxic bacteria

  • Looking at the expression of 342 cyanotoxic genes (as Identified by Olga).
  • Cyanotoxic genes represent ~1% of all genes expressed (Note that graph below depicts total counts).
  • I grouped genes together according to their name.
  • Cyanotoxic genes rise similary to what was identified in metatranscriptomics.

4.12 Metagenomic_Champlain_short_term

  • I’ve ran the pipeline on these samples (they are all St1 according to Nicolas).
  • Species were grouped by genera to simplify.

4.13 Metagenomics and metatranscriptomics tell a different story.

  • First, you can see the difference in the bloom composition between the 2015 and 2016 metagenomics datasets (first 2 panels). In 2015, it was dominated by Microcystis. In 2016, it was dominated by Dolichospermum, but in September Microcystis start to show up and dominate.
  • In 2016, for which we had metagenomics vs metatranscriptomics at similar dates, the Dolichospermum story is similar, but the Microcystis are almost absent from metatranscriptomics. This means that they are there (according to metagenomics), but not active (according to metatranscriptomics).
  • Overall, the metatranscriptomics is more diverse, while the metagenomics is largely dominated by cyanobacteria. I will compare alpha diversity between both datasets to explicitely test that.

4.14 Metagenomic_Champlain_long_term

  • Species were grouped by genera to simplify it (e.g.: There were a lot of different Flavobacterium species)
  • Altough this dataset is not comparable to the metatranscriptome directly, if I extrapolate it seems like there is little correlation btwn metat and metag.: This is pretty cool and collaborates was we see with the short term metagenomics dataset.

5 To do (lab meeting Oct 19th):

6 Notes from Jan 11th 2019 lab meeting