# About This project provides a workflow to extract microorganism taxa, habitats, and phenotypes from texts using NCBI taxonomy and OntoBiotope ontology. It processes data from CIRMS, GenBank, DSMZ, and PubMed and enriches the [Omnicrobe database](https://omnicrobe.migale.inrae.fr/) with food microbe flora. The workflow includes six text-mining pipelines using [AlvisNLP](https://bibliome.github.io/alvisnlp/) and other tools, executed via Snakemake. The followings steps are provided to run the workflow on the Migale facility (SGE cluster / Linux OS (Ubuntu)). You must know how to use [AlviNLP](https://bibliome.github.io/alvisnlp/), [Snakemake](https://snakemake.readthedocs.io), and the [SGE queuing system](http://star.mit.edu/cluster/docs/0.93.3/guides/sge.html). > **_NOTE:_** Adaptations may be needed for other environments. See additional documentation [here](docs/README.md). ## Install ### **0.** clone the project ``` git clone https://forgemia.inra.fr/omnicrobe/text-mining-workflow.git cd text-mining-workflow/ ``` ### **1.** Set the AlvisnLP Singularity image. ``` cd softwares/ ln -s /work_projet/bibliome/singularity/alvisnlp-0.10.1.sif alvisnlp.sif # change alvisnlp version if required ``` ### **2.** create the global conda env for snakemake ``` conda env create -f softwares/envs/snakemake-5.13.0-env.yaml conda activate snakemake-5.13.0-env ``` ### **3.** clone [obo-utils](https://github.com/Bibliome/obo-utils) ``` cd softwares/ git clone https://github.com/Bibliome/obo-utils.git ``` ### **4.** clone and install [alvisir](https://github.com/Bibliome/alvisir) ``` cd softwares/ mkdir -p alvisir-install git clone https://github.com/Bibliome/alvisir.git cd alvisir/ mvn clean package ./install.sh ../alvisir-install cd .. rm -rf alvisir ``` ## Set configs ### **1.** set the path to `Alvisnlp Singularity Image` in config file `config/config.yaml` ``` ## alvisnlp singlarity image SINGULARITY_IMG: "softwares/alvisnlp.sif" ``` ### **2.** set the path to `obo-utils` in config file `config/config.yaml` ``` ## obo-utils home OBO_UTILS : "softwares/obo-utils" ``` ### **3.** set the path to `obo-utils` in config file `config/config.yaml` ``` ## alvisir home ALVISIR_HOME : "softwares/alvisir-install" ``` ### **4.** add supertaxo You need to install and run the [taxonomy pipelines](https://forgemia.inra.fr/omnicrobe/extended-microorganisms-taxonomy), then after copy the following result files in folder `ancillaries/extended-microorganisms-taxonomy/` ``` extended-microorganisms-taxonomy/output/bacdive-match \ extended-microorganisms-taxonomy/output/bacdive-match/bacdive-to-taxid.txt \ extended-microorganisms-taxonomy/output/taxa+id_full.trie \ extended-microorganisms-taxonomy/output/taxa+id_microorganisms.txt \ extended-microorganisms-taxonomy/output/taxid_microorganisms.txt \ extended-microorganisms-taxonomy/output/bacdive-strains \ extended-microorganisms-taxonomy/output/ncbi-taxonomy/names.dmp \ extended-microorganisms-taxonomy/output/taxa+id_full.txt \ extended-microorganisms-taxonomy/output/taxid_full.txt ancextended-microorganisms-taxonomy ``` ## Run ### **step 1.** ` Get Pubmed Abstract` ``` snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \ --snakefile get-pubmed-abstracts.snakefile \ --cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \ --restart-times 4 all ``` > **_NOTE:_** The corpus splitted into batches is stored into `corpora/pubmed/`, its is required for next steps > **_NOTE:_** Execution time ~ 7 hours ### **step 2.** ` Get EPMC FullTexts` ``` snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \ --snakefile get-epmc-fulltexts.snakefile \ --cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \ --restart-times 4 all ``` > **_NOTE:_** The corpus splitted into batches is stored into `corpora/epmc/`, its is required for next steps > **_NOTE:_** .snakemake/log/2024-10-22T103433.420025.snakemake.log > **_NOTE:_** Execution time ~ 7 hours ### **step 2.** `preprocess Ontobiotope` <!--to analyze the ontologies, cut the desired branches and produce the tomap models and lexicon to be used in the next steps. --> ``` snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \ --snakefile preprocess-ontology.snakefile \ --cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \ --restart-times 4 all ``` > **_NOTE:_** check snakefile for outputs, they are required for next steps >**_NOTE:_** execution time ~ 40 mn ### **step 3.1** `process Pubmed Data` <!--to extracts microorganisms, habitats of texts from Pubmed. --> ``` snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs 80 \ --snakefile process_PubMed_corpus.snakefile \ --cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \ --restart-times 4 all ``` ### **step 3.2** `process CIRM data` <!--to extract microorganisms, habitats of texts from CIRM. --> ``` snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \ --snakefile process_CIRM_corpus.snakefile \ --cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \ --restart-times 4 all ``` ### **step 3.3** `process GenBank data` <!--to extract microorganisms, habitats of texts from GenBank.--> ``` snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \ --snakefile process_GenBank_corpus.snakefile \ --cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \ --restart-times 4 all ``` ### **step 3.4.** `process DSMZ data` <!--to extract microorganisms, habitats of texts from DSMZ. --> ``` snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \ --snakefile process_DSMZ_corpus.snakefile \ --cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \ --restart-times 4 all ``` ### Run eval process ### **step 3.4.** `evaluate with BioNLP-OST` <!--to extract microorganisms, habitats of texts from DSMZ. --> ``` snakemake --nolock --verbose --printshellcmds --use-singularity --use-conda --reason --latency-wait 30 --jobs \ --snakefile evaluate_with_BioNLP-OST.snakefile \ --cluster "qsub -v PYTHONPATH='' -V -cwd -e log/ -o log/ -q short.q -pe thread 2" \ --restart-times 4 all ``` > **_NOTE:_** the scores are available here `corpora/florilege/eval/new/BB19-*-eval.json` > **_NOTE:_** execution time ~ 15 mn