# About

This project provides a workflow to extract microorganism taxa, habitats, and phenotypes from texts using NCBI taxonomy and OntoBiotope ontology. It processes data from CIRMS, GenBank, DSMZ, and PubMed and enriches the [Omnicrobe database](https://omnicrobe.migale.inrae.fr/) with food microbe flora.

The workflow includes six text-mining pipelines using [AlvisNLP](https://bibliome.github.io/alvisnlp/) and other tools, executed via Snakemake. 

The followings steps are provided to run the workflow on the Migale facility (SGE cluster / Linux OS (Ubuntu)). You must know how to use [AlviNLP](https://bibliome.github.io/alvisnlp/), [Snakemake](https://snakemake.readthedocs.io), and the [SGE queuing system](http://star.mit.edu/cluster/docs/0.93.3/guides/sge.html).

> **_NOTE:_** Adaptations may be needed for other environments. See additional documentation [here](docs/README.md).

## Install

### **0.** clone the project

```
git clone https://forgemia.inra.fr/omnicrobe/text-mining-workflow.git
cd text-mining-workflow/
```

### **1.** Set the AlvisnLP Singularity image. 

```
cd softwares/
ln -s /work_projet/bibliome/singularity/alvisnlp-0.10.1.sif alvisnlp.sif # change alvisnlp version if required
```

### **2.** create the global conda env for snakemake

```
conda env create -f softwares/envs/snakemake-5.13.0-env.yaml 
conda activate snakemake-5.13.0-env
```

### **3.** clone [obo-utils](https://github.com/Bibliome/obo-utils)

```
cd softwares/
git clone https://github.com/Bibliome/obo-utils.git
```

### **4.** clone and install [alvisir](https://github.com/Bibliome/alvisir)


```
cd softwares/
mkdir -p alvisir-install
git clone https://github.com/Bibliome/alvisir.git
cd alvisir/
mvn clean package
./install.sh ../alvisir-install
cd ..
rm -rf alvisir
```

##  Set configs


### **1.** set the path to `Alvisnlp Singularity Image` in config file `config/config.yaml`

```
## alvisnlp singlarity image
SINGULARITY_IMG: "softwares/alvisnlp.sif"
```

### **2.** set the path to `obo-utils` in config file `config/config.yaml`

```
## obo-utils home
OBO_UTILS : "softwares/obo-utils"
```

### **3.** set the path to `obo-utils` in config file `config/config.yaml`

```
## alvisir home
ALVISIR_HOME : "softwares/alvisir-install"
```

### **4.** add supertaxo 

You need to install and run the [taxonomy pipelines](https://forgemia.inra.fr/omnicrobe/extended-microorganisms-taxonomy), then after copy the following result files in folder `ancillaries/extended-microorganisms-taxonomy/`

```
extended-microorganisms-taxonomy/output/bacdive-match \
extended-microorganisms-taxonomy/output/bacdive-match/bacdive-to-taxid.txt \
extended-microorganisms-taxonomy/output/taxa+id_full.trie \
extended-microorganisms-taxonomy/output/taxa+id_microorganisms.txt \
extended-microorganisms-taxonomy/output/taxid_microorganisms.txt \
extended-microorganisms-taxonomy/output/bacdive-strains \
extended-microorganisms-taxonomy/output/ncbi-taxonomy/names.dmp \
extended-microorganisms-taxonomy/output/taxa+id_full.txt \
extended-microorganisms-taxonomy/output/taxid_full.txt ancextended-microorganisms-taxonomy
```

## Run

### **step 1.** ` Get Pubmed Abstract` 

```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile get-pubmed-abstracts.snakefile \
--cluster "qsub -v PYTHONPATH=''  -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```

> **_NOTE:_**  The corpus splitted into batches is stored into `corpora/pubmed/`, its is required for next steps

> **_NOTE:_** Execution time ~ 7 hours 

### **step 2.** ` Get EPMC FullTexts` 

```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile get-epmc-fulltexts.snakefile \
--cluster "qsub -v PYTHONPATH=''  -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```

> **_NOTE:_** The corpus splitted into batches is stored into `corpora/epmc/`, its is required for next steps

> **_NOTE:_** .snakemake/log/2024-10-22T103433.420025.snakemake.log

> **_NOTE:_** Execution time ~ 7 hours 


### **step 2.** `preprocess Ontobiotope` <!--to analyze the ontologies, cut the desired branches and produce the tomap models and lexicon to be used in the next steps. -->
   
```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile preprocess-ontology.snakefile \
--cluster "qsub -v PYTHONPATH=''  -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```

> **_NOTE:_**  check snakefile for outputs, they are required for next steps
>**_NOTE:_**  execution time ~ 40 mn

### **step 3.1** `process Pubmed Data` <!--to extracts microorganisms, habitats of texts from Pubmed. -->

```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs 80 \
--snakefile process_PubMed_corpus.snakefile \
--cluster "qsub -v PYTHONPATH=''  -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```

### **step 3.2** `process CIRM data` <!--to extract microorganisms, habitats of texts from CIRM. -->

```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile process_CIRM_corpus.snakefile \
--cluster "qsub -v PYTHONPATH=''  -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```

### **step 3.3** `process GenBank data` <!--to extract microorganisms, habitats of texts from GenBank.-->

```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile process_GenBank_corpus.snakefile \
--cluster "qsub -v PYTHONPATH=''  -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```

### **step 3.4.** `process DSMZ data` <!--to extract microorganisms, habitats of texts from DSMZ. -->

```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile process_DSMZ_corpus.snakefile \
--cluster "qsub -v PYTHONPATH=''  -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```

### Run eval process

### **step 3.4.** `evaluate with BioNLP-OST` <!--to extract microorganisms, habitats of texts from DSMZ. -->

```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile evaluate_with_BioNLP-OST.snakefile \
--cluster "qsub -v PYTHONPATH=''  -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```
> **_NOTE:_** the scores are available here `corpora/florilege/eval/new/BB19-*-eval.json`
> **_NOTE:_** execution time ~ 15 mn