Skip to content

☄️ molfeat-hype

PyPI PyPI - Python Version license test code-check release

Overview

molfeat-hype is an extension of molfeat that investigates the performance of embeddings from various LLMs trained without explicit molecular context for molecular modeling. It leverages some of the most hyped LLM models in NLP to answer the following question:

Is it necessary to pretrain/finetune LLMs on molecular context to obtain good molecular representations?

To find an answer to this question, check out the benchmarks.

Spoilers YES! Understanding molecular context/structure/properties is key to building good molecular featurizers.

LLMs:

molfeat-hype supports two types of LLM embeddings:

  1. Classic Embeddings: These are classical embeddings provided by foundation models (or any LLMs). The models available in this tool include OpenAI's openai/text-embedding-ada-002 model, llama, and several embedding models accessible through sentence-transformers.

  2. Instruction-based Embeddings: These are models that have been trained to follow instructions (thus acting like ChatGPT) or are conditional models that require a prompt.

    • Prompt-based instruction: A model (like Chat-GPT: openai/gpt-3.5-turbo) is asked to act like an all-knowing AI assistant for drug discovery and provide the best molecular representation for the input list of molecules. Here, we parse the representation from the Chat agent output.
    • Conditional embeddings: A model trained for conditional text embeddings that takes instruction as additional input. Here, the embedding is the model underlying representation of the molecule conditioned by the instructions it received. For more information, see this instructor-embedding.

Installation

You can install molfeat-hype using pip. conda installation is planned soon.

pip install molfeat-hype

molfeat-hype mostly depends on molfeat and langchain. For a list of complete dependencies, please see the env.yml file.

Acknowledgements

Check out the following projects that made molfeat-hype possible:

  • Please refer to the langchain documentation for any questions related to langchain.

Contributing

As an open-source project in a rapidly developing field, we are extremely open to contributions, whether in the form of new features, improved infrastructure, or better documentation. For detailed information on how to contribute, see our contribution guide.

Disclaimer

This repository contains an experimental investigation of LLM embeddings for molecules. Please note that the consistency and usefulness of the returned molecular embeddings are not guaranteed. This project is meant for fun and exploratory purposes only and should not be used as a demonstration of LLM capabilities for molecular embeddings. Any statements made in this repository are the opinions