README.md

# About

This project provides a workflow to extract microorganism taxa, habitats, and phenotypes from texts using NCBI taxonomy and OntoBiotope ontology. It processes data from CIRMS, GenBank, DSMZ, and PubMed and enriches the [Omnicrobe database](https://omnicrobe.migale.inrae.fr/) with food microbe flora.

The workflow includes six text-mining pipelines using [AlvisNLP](https://bibliome.github.io/alvisnlp/) and other tools, executed via Snakemake. 

The followings steps are provided to run the workflow on the Migale facility (SGE cluster / Linux OS (Ubuntu)). You must know how to use [AlviNLP](https://bibliome.github.io/alvisnlp/), [Snakemake](https://snakemake.readthedocs.io), and the [SGE queuing system](http://star.mit.edu/cluster/docs/0.93.3/guides/sge.html).

> **_NOTE:_** Adaptations are required for other environments. <br/>
> See additional documentation [here](docs/README.md).

## Install

### **0.** clone the project

```
git clone https://forgemia.inra.fr/omnicrobe/text-mining-workflow.git
cd text-mining-workflow/
```

### **1.** set the AlvisnLP Singularity image 

```
cd softwares/
ln -s /work_projet/bibliome/singularity/alvisnlp-0.10.1.sif alvisnlp.sif # change alvisnlp version if required
```

### **2.** create the global conda env for snakemake

```
conda env create -f softwares/envs/snakemake-5.13.0-env.yaml 
conda activate snakemake-5.13.0-env
```

### **3.** clone [obo-utils](https://github.com/Bibliome/obo-utils)

```
cd softwares/
git clone https://github.com/Bibliome/obo-utils.git
```

### **4.** clone and install [alvisir](https://github.com/Bibliome/alvisir)


```
cd softwares/
mkdir -p alvisir-install
git clone https://github.com/Bibliome/alvisir.git
cd alvisir/
mvn clean package
./install.sh ../alvisir-install
cd ..
rm -rf alvisir
```

##  Set configs


### **1.** set the path to `Alvisnlp Singularity Image` in config file `config/config.yaml`

```
## alvisnlp singlarity image
SINGULARITY_IMG: "softwares/alvisnlp.sif"
```

### **2.** set the path to `obo-utils` in config file `config/config.yaml`

```
## obo-utils home
OBO_UTILS : "softwares/obo-utils"
```

### **3.** set the path to `obo-utils` in config file `config/config.yaml`

```
## alvisir home
ALVISIR_HOME : "softwares/alvisir-install"
```

### **4.** add supertaxo 

You need to install and run the [taxonomy pipelines](https://forgemia.inra.fr/omnicrobe/extended-microorganisms-taxonomy) to create the extented taxo, then copy the following result files in folder `ancillaries/extended-microorganisms-taxonomy/`

```
extended-microorganisms-taxonomy/output/bacdive-match \
extended-microorganisms-taxonomy/output/bacdive-match/bacdive-to-taxid.txt \
extended-microorganisms-taxonomy/output/taxa+id_full.trie \
extended-microorganisms-taxonomy/output/taxa+id_microorganisms.txt \
extended-microorganisms-taxonomy/output/taxid_microorganisms.txt \
extended-microorganisms-taxonomy/output/bacdive-strains \
extended-microorganisms-taxonomy/output/ncbi-taxonomy/names.dmp \
extended-microorganisms-taxonomy/output/taxa+id_full.txt \
extended-microorganisms-taxonomy/output/taxid_full.txt ancextended-microorganisms-taxonomy
```

## Run

### **step 1.** ` Get Pubmed Abstract` 

```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile get-pubmed-abstracts.snakefile \
--cluster "qsub -v PYTHONPATH=''  -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```

> **_NOTE:_**  The corpus splitted into batches `corpora/pubmed/batches`, its is required for next steps <br/>
>  Execution time ~ 7 hours 

### **step 2.** ` Get EPMC FullTexts` 

```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile get-epmc-fulltexts.snakefile \
--cluster "qsub -v PYTHONPATH=''  -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```

> **_NOTE:_** The corpus splitted into batches `corpora/epmc/batches`, its is required for next steps  <br/>
> **_NOTE:_** Execution time ~ ?


### **step 3.** `preprocess Ontobiotope` <!--to analyze the ontologies, cut the desired branches and produce the tomap models and lexicon to be used in the next steps. -->
   
```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile preprocess-ontology.snakefile \
--cluster "qsub -v PYTHONPATH=''  -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```

> **_NOTE:_**  check snakefile for outputs, they are required for next steps  <br/>
> Execution time ~ 40 mn

### **step 4.1** `process Pubmed Data` <!--to extracts microorganisms, habitats of texts from Pubmed. -->

```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs 80 \
--snakefile process-pubmed.snakefile \
--cluster "qsub -v PYTHONPATH='' -l mem_free=36G  -V -cwd -e log/ -o log/ -q long.q,maiage.q,short.q -pe thread 2" \
--restart-times 4 all
```

> **_NOTE:_**  results: `corpora/florilege/pubmed/PubMed-*.txt`  <br/>
> Execution time ~ 1h per batch for ~3000 batches

### **step 4.2** `process EPMC Data` <!--to extracts microorganisms, habitats of texts from epmc fulltexts. -->

```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs 80 \
--snakefile process-epmc.snakefile \
--cluster "qsub -v PYTHONPATH=''  -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```

> **_NOTE:_**  results: `corpora/florilege/epmc/EPMC-*.txt`  <br/>
> Execution time ~ ?

### **step 4.3** `process CIRM data` <!--to extract microorganisms, habitats of texts from CIRM. -->

```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile process-cirm.snakefile \
--cluster "qsub -v PYTHONPATH=''  -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```

> **_NOTE:_**  results: `corpora/florilege/cirm/cirm-*-results.txt`  <br/>
> Execution time ~ ?

### **step 4.4** `process GenBank data` <!--to extract microorganisms, habitats of texts from GenBank.-->

```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile process-genbank.snakefile \
--cluster "qsub -v PYTHONPATH='' -l mem_free=36G  -V -cwd -e log/ -o log/ -q long.q,maiage.q,short.q -pe thread 2" \
--restart-times 4 all
```

> **_NOTE:_**  results: `corpora/florilege/genbank/genbank-results.txt`  <br/>
> Execution time ~ ?

### **step 4.5.** `process DSMZ data` <!--to extract microorganisms, habitats of texts from DSMZ. -->

```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile process-dsmz.snakefile \
--cluster "qsub -v PYTHONPATH=''  -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```

> **_NOTE:_**  results: `corpora/florilege/dsmz/dsmz-results.txt`  <br/>
> Execution time ~ ?

## Run eval process

### **step 1.** `evaluate with BioNLP-OST` <!--to extract microorganisms, habitats of texts from DSMZ. -->

```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile evaluate-with-bionlp-ost.snakefile \
--cluster "qsub -v PYTHONPATH=''  -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```

> **_NOTE:_** scores: `corpora/florilege/eval/new/BB19-*-eval.json`  <br/>
> Execution time ~ 15 mn

### **step 2.** `compare pubmed results` <!--to extract microorganisms, habitats of texts from DSMZ. -->

add respectively new results and old results into `corpora/florilege/compare/pubmed/new` and `corpora/florilege/compare/pubmed/old` 

```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile compare-results.snakefile --config NEW_RESULT_FOLDER = 'corpora/florilege/compare/pubmed/new' OLD_RESULT_FOLDER = 'corpora/florilege/compare/pubmed/old' MATCH_RESULT_FOLDER='corpora/florilege/compare/pubmed' \
--cluster "qsub -v PYTHONPATH=''  -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```

> **_NOTE:_** results: `corpora/florilege/compare/pubmed/*.rankdiff.txt`  <br/>

### **step 3.** `compare epmc results` <!--to extract microorganisms, habitats of texts from DSMZ. -->

add respectively new results and old results into `corpora/florilege/compare/epmc/new` and `corpora/florilege/compare/epmc/old` 

```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile compare-results.snakefile --config NEW_RESULT_FOLDER='corpora/florilege/compare/epmc/new' OLD_RESULT_FOLDER='corpora/florilege/compare/epmc/old' MATCH_RESULT_FOLDER='corpora/florilege/compare/epmc' \
--cluster "qsub -v PYTHONPATH=''  -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```

> **_NOTE:_** results: `corpora/florilege/compare/epmc/*.rankdiff.txt`  <br/>

## Update alvisir instances

### **step 1.** 

copy the data for omnicrobe alvisir-dev into `/work_projet/bibliome/wservice-configs/omnicrobe-dev_alvisir/`


```
cd corpora/florilege/alvisir
cp -r index /work_projet/bibliome/wservice-configs/omnicrobe-dev_alvisir/
cp -r expander /work_projet/bibliome/wservice-configs/omnicrobe-dev_alvisir/
cp -r *.json BioNLP-OST+EnovFood-Phenotype.json /work_projet/bibliome/wservice-configs/omnicrobe-dev_alvisir/resources/
```

> **_NOTE:_** results: make sure to save the existing data before copying for recovering purpose.<br/> The update can be seing here https://bibliome.migale.inrae.fr/omnicrobe-dev/alvisir/webapi/search

### **step 2.** 

copy the data for omnicrobe alvisir into `/work_projet/bibliome/wservice-configs/omnicrobe_alvisir/` 

```
cd corpora/florilege/alvisir
cp -r index /work_projet/bibliome/wservice-configs/omnicrobe_alvisir/
cp -r expander /work_projet/bibliome/wservice-configs/omnicrobe_alvisir/
cp -r *.json BioNLP-OST+EnovFood-Phenotype.json /work_projet/bibliome/wservice-configs/omnicrobe_alvisir/resources/
```

> **_NOTE:_** results: make sure to save the existing data before copying for recovering purpose. <br/>The update can be seing here https://bibliome.migale.inrae.fr/omnicrobe/alvisir/webapi/search