Newer
Older
This project provides a workflow to extract microorganism taxa, habitats, and phenotypes from texts using NCBI taxonomy and OntoBiotope ontology. It processes data from CIRMS, GenBank, DSMZ, and PubMed and enriches the [Omnicrobe database](https://omnicrobe.migale.inrae.fr/) with food microbe flora.
The workflow includes six text-mining pipelines using [AlvisNLP](https://bibliome.github.io/alvisnlp/) and other tools, executed via Snakemake.
The followings steps are provided to run the workflow on the Migale facility (SGE cluster / Linux OS (Ubuntu)). You must know how to use [AlviNLP](https://bibliome.github.io/alvisnlp/), [Snakemake](https://snakemake.readthedocs.io), and the [SGE queuing system](http://star.mit.edu/cluster/docs/0.93.3/guides/sge.html).
> **_NOTE:_** Adaptations are required for other environments. <br/>
> See additional documentation [here](docs/README.md).
```
git clone https://forgemia.inra.fr/omnicrobe/text-mining-workflow.git
cd text-mining-workflow/
```
```
cd softwares/
ln -s /work_projet/bibliome/singularity/alvisnlp-0.10.1.sif alvisnlp.sif # change alvisnlp version if required
```
### **2.** create the global conda env for snakemake
```
conda env create -f softwares/envs/snakemake-5.13.0-env.yaml
conda activate snakemake-5.13.0-env
```
### **3.** clone [obo-utils](https://github.com/Bibliome/obo-utils)
```
cd softwares/
git clone https://github.com/Bibliome/obo-utils.git
```
### **4.** clone and install [alvisir](https://github.com/Bibliome/alvisir)
```
cd softwares/
mkdir -p alvisir-install
git clone https://github.com/Bibliome/alvisir.git
cd alvisir/
mvn clean package
./install.sh ../alvisir-install
cd ..
rm -rf alvisir
```
### **1.** set the path to `Alvisnlp Singularity Image` in config file `config/config.yaml`
```
## alvisnlp singlarity image
SINGULARITY_IMG: "softwares/alvisnlp.sif"
```
### **2.** set the path to `obo-utils` in config file `config/config.yaml`
```
## obo-utils home
OBO_UTILS : "softwares/obo-utils"
```
### **3.** set the path to `obo-utils` in config file `config/config.yaml`
```
## alvisir home
ALVISIR_HOME : "softwares/alvisir-install"
```
You need to install and run the [taxonomy pipelines](https://forgemia.inra.fr/omnicrobe/extended-microorganisms-taxonomy) to create the extented taxo, then copy the following result files in folder `ancillaries/extended-microorganisms-taxonomy/`
```
extended-microorganisms-taxonomy/output/bacdive-match \
extended-microorganisms-taxonomy/output/bacdive-match/bacdive-to-taxid.txt \
extended-microorganisms-taxonomy/output/taxa+id_full.trie \
extended-microorganisms-taxonomy/output/taxa+id_microorganisms.txt \
extended-microorganisms-taxonomy/output/taxid_microorganisms.txt \
extended-microorganisms-taxonomy/output/bacdive-strains \
extended-microorganisms-taxonomy/output/ncbi-taxonomy/names.dmp \
extended-microorganisms-taxonomy/output/taxa+id_full.txt \
extended-microorganisms-taxonomy/output/taxid_full.txt ancextended-microorganisms-taxonomy
```
## Run
### **step 1.** ` Get Pubmed Abstract`
```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile get-pubmed-abstracts.snakefile \
--cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```
> **_NOTE:_** The corpus splitted into batches `corpora/pubmed/batches`, its is required for next steps <br/>
> Execution time ~ 7 hours
### **step 2.** ` Get EPMC FullTexts`
```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile get-epmc-fulltexts.snakefile \
--cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```
> **_NOTE:_** The corpus splitted into batches `corpora/epmc/batches`, its is required for next steps <br/>
### **step 3.** `preprocess Ontobiotope` <!--to analyze the ontologies, cut the desired branches and produce the tomap models and lexicon to be used in the next steps. -->
```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile preprocess-ontology.snakefile \
--cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```
> **_NOTE:_** check snakefile for outputs, they are required for next steps <br/>
### **step 4.1** `process Pubmed Data` <!--to extracts microorganisms, habitats of texts from Pubmed. -->
```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs 80 \
--cluster "qsub -v PYTHONPATH='' -l mem_free=36G -V -cwd -e log/ -o log/ -q long.q,maiage.q,short.q -pe thread 2" \
> **_NOTE:_** results: `corpora/florilege/pubmed/PubMed-*.txt` <br/>
### **step 4.2** `process EPMC Data` <!--to extracts microorganisms, habitats of texts from epmc fulltexts. -->
```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs 80 \
--cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```
> Execution time ~ ?
### **step 4.3** `process CIRM data` <!--to extract microorganisms, habitats of texts from CIRM. -->
```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```
> **_NOTE:_** results: `corpora/florilege/cirm/cirm-*-results.txt` <br/>
### **step 4.4** `process GenBank data` <!--to extract microorganisms, habitats of texts from GenBank.-->
```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--cluster "qsub -v PYTHONPATH='' -l mem_free=36G -V -cwd -e log/ -o log/ -q long.q,maiage.q,short.q -pe thread 2" \
> **_NOTE:_** results: `corpora/florilege/genbank/genbank-results.txt` <br/>
### **step 4.5.** `process DSMZ data` <!--to extract microorganisms, habitats of texts from DSMZ. -->
```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```
> **_NOTE:_** results: `corpora/florilege/dsmz/dsmz-results.txt` <br/>
### **step 1.** `evaluate with BioNLP-OST` <!--to extract microorganisms, habitats of texts from DSMZ. -->
```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```
> **_NOTE:_** scores: `corpora/florilege/eval/new/BB19-*-eval.json` <br/>
### **step 2.** `compare pubmed results` <!--to extract microorganisms, habitats of texts from DSMZ. -->
add respectively new results and old results into `corpora/florilege/compare/pubmed/new` and `corpora/florilege/compare/pubmed/old`
```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile compare-results.snakefile --config NEW_RESULT_FOLDER = 'corpora/florilege/compare/pubmed/new' OLD_RESULT_FOLDER = 'corpora/florilege/compare/pubmed/old' MATCH_RESULT_FOLDER='corpora/florilege/compare/pubmed' \
--cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```
> **_NOTE:_** results: `corpora/florilege/compare/pubmed/*.rankdiff.txt` <br/>
### **step 3.** `compare epmc results` <!--to extract microorganisms, habitats of texts from DSMZ. -->
add respectively new results and old results into `corpora/florilege/compare/epmc/new` and `corpora/florilege/compare/epmc/old`
```
snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \
--snakefile compare-results.snakefile --config NEW_RESULT_FOLDER='corpora/florilege/compare/epmc/new' OLD_RESULT_FOLDER='corpora/florilege/compare/epmc/old' MATCH_RESULT_FOLDER='corpora/florilege/compare/epmc' \
--cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \
--restart-times 4 all
```
> **_NOTE:_** results: `corpora/florilege/compare/epmc/*.rankdiff.txt` <br/>
## Update alvisir instances
### **step 1.**
copy the data for omnicrobe alvisir-dev into `/work_projet/bibliome/wservice-configs/omnicrobe-dev_alvisir/`
```
cd corpora/florilege/alvisir
cp -r index /work_projet/bibliome/wservice-configs/omnicrobe-dev_alvisir/
cp -r expander /work_projet/bibliome/wservice-configs/omnicrobe-dev_alvisir/
cp -r *.json BioNLP-OST+EnovFood-Phenotype.json /work_projet/bibliome/wservice-configs/omnicrobe-dev_alvisir/resources/
```
> **_NOTE:_** results: make sure to save the existing data before copying for recovering purpose.<br/> The update can be seing here https://bibliome.migale.inrae.fr/omnicrobe-dev/alvisir/webapi/search
copy the data for omnicrobe alvisir into `/work_projet/bibliome/wservice-configs/omnicrobe_alvisir/`
```
cd corpora/florilege/alvisir
cp -r index /work_projet/bibliome/wservice-configs/omnicrobe_alvisir/
cp -r expander /work_projet/bibliome/wservice-configs/omnicrobe_alvisir/
cp -r *.json BioNLP-OST+EnovFood-Phenotype.json /work_projet/bibliome/wservice-configs/omnicrobe_alvisir/resources/
```
> **_NOTE:_** results: make sure to save the existing data before copying for recovering purpose. <br/>The update can be seing here https://bibliome.migale.inrae.fr/omnicrobe/alvisir/webapi/search