Getting Started
%load_ext autoreload
%autoreload 2
# avoid annoying tokenizer warnings
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import datamol as dm
smiles = dm.freesolv()["smiles"].values[:5]
smiles
array(['CN(C)C(=O)c1ccc(cc1)OC', 'CS(=O)(=O)Cl', 'CC(C)C=C', 'CCc1cnccn1', 'CCCCCCCO'], dtype=object)
In these examples we will explore various embeddings provided by the molfeat-hype
plugin of molfeat
. We are interested in understanding and assessing how good Large Language Models (LLMs) that have NOT been trained or finetuned with any particular molecular context can get on molecular featurization.
Classic Embeddings¶
Classic embeddings are embeddings provided by a LLM directly.
from molfeat_hype.trans.llm_embeddings import LLMTransformer
/Users/manu/.miniconda/envs/molfeat_hype/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Using the OPENAI API for embeddings¶
embedder = LLMTransformer(kind="openai/text-embedding-ada-002")
out = embedder(smiles)
out.shape
(5, 1536)
len(embedder)
1536
# the cache should have this molecule
len(embedder.precompute_cache.get("CCCCCCCO"))
1536
Using the Sentence-Transformers models¶
embedder = LLMTransformer(kind="sentence-transformers/all-mpnet-base-v2")
out = embedder(smiles)
out.shape
(5, 768)
# Path to the llama quantized weight.
# You can find them online by asking Meta.
# Someone also said there is a torrent/IPFS/direct download somewhere of the original llama weight
# After getting the llama weight, you can quantized them yourself.
lama_quantized_model_path = "/Users/manu/Code/llama.cpp/models/7B/ggml-model-q4_0.bin"
embedder = LLMTransformer(kind="llama.cpp", quantized_model_path=lama_quantized_model_path)
out = embedder(smiles)
out.shape
llama.cpp: loading model from /Users/manu/Code/llama.cpp/models/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 1024 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 59.11 KB llama_model_load_internal: mem required = 5809.32 MB (+ 2052.00 MB per state) llama_init_from_file: kv self size = 1024.00 MB
(5, 4096)
Instruction-based models¶
molfeat_hype
provides two types of instruction-based models for molecule embeddings:
Prompt-based instruction: a ChatGPT model is trained to act like an all-knowing assistant for drug discovery, providing the best molecular representation for the input list of molecules. The representation is parsed from the Chat agent output.
Conditional embedding: a model trained for conditional text embeddings that takes instructions in its input. The embedding is the model's underlying representation of the molecule, conditioned by the instructions it received. For more information, see instructor-embedding.
Using the ChatGPT embeddings¶
from molfeat_hype.trans.llm_instruct_embeddings import InstructLLMTransformer
# should fail if the model did not understand the prompt
embedder = InstructLLMTransformer(kind="openai/chatgpt", embedding_size=16)
2023-04-30 19:21:26.148 | WARNING | molfeat.trans.base:__init__:51 - The 'InstructLLMTransformer' interaction has been superseded by a new class with id 0x7fdbfd704090
out = embedder(smiles)
out.shape
(5, 16)
Using the instructor embeddings¶
# should fail if the model did not understand the prompt
# we recommend the instructor-large model
embedder = InstructLLMTransformer(kind="hkunlp/instructor-large")
out = embedder(smiles)
out.shape
load INSTRUCTOR_Transformer max_seq_length 512
(5, 768)