Getting Started

In [1]:

                
                    Copied!
                    
%load_ext autoreload
%autoreload 2
%load_ext autoreload
%autoreload 2

In [2]:

                
                    Copied!
                    
# avoid annoying tokenizer warnings
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# avoid annoying tokenizer warnings
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [3]:

                
                    Copied!
                    
import datamol as dm
smiles = dm.freesolv()["smiles"].values[:5]
smiles
import datamol as dm
smiles = dm.freesolv()["smiles"].values[:5]
smiles

Out[3]:

array(['CN(C)C(=O)c1ccc(cc1)OC', 'CS(=O)(=O)Cl', 'CC(C)C=C', 'CCc1cnccn1',
       'CCCCCCCO'], dtype=object)

In these examples we will explore various embeddings provided by the molfeat-hype plugin of molfeat. We are interested in understanding and assessing how good Large Language Models (LLMs) that have NOT been trained or finetuned with any particular molecular context can get on molecular featurization.

Classic Embeddings¶

Classic embeddings are embeddings provided by a LLM directly.

In [4]:

                
                    Copied!
                    
from molfeat_hype.trans.llm_embeddings import LLMTransformer
from molfeat_hype.trans.llm_embeddings import LLMTransformer

/Users/manu/.miniconda/envs/molfeat_hype/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Using the OPENAI API for embeddings¶

In [5]:

                
                    Copied!
                    
embedder = LLMTransformer(kind="openai/text-embedding-ada-002")
out = embedder(smiles)
out.shape
embedder = LLMTransformer(kind="openai/text-embedding-ada-002")
out = embedder(smiles)
out.shape

Out[5]:

(5, 1536)

In [6]:

                
                    Copied!
                    
len(embedder)
len(embedder)

Out[6]:

In [7]:

                
                    Copied!
                    
# the cache should have this molecule
len(embedder.precompute_cache.get("CCCCCCCO"))
# the cache should have this molecule
len(embedder.precompute_cache.get("CCCCCCCO"))

Out[7]:

Using the Sentence-Transformers models¶

In [8]:

                
                    Copied!
                    
embedder = LLMTransformer(kind="sentence-transformers/all-mpnet-base-v2")
out = embedder(smiles)
out.shape
embedder = LLMTransformer(kind="sentence-transformers/all-mpnet-base-v2")
out = embedder(smiles)
out.shape

Out[8]:

(5, 768)

Using the Llama weights¶

To use the Llama weights, you need to obtain them first, then follow the instruction provided in the llama.cpp repo to get 4-bits quantization of model weights.

In [9]:

                
                    Copied!
                    
# Path to the llama quantized weight. 
# You can find them online by asking Meta. 
# Someone also said there is a torrent/IPFS/direct download somewhere of the original llama weight
# After getting the llama weight, you can quantized them yourself.
lama_quantized_model_path = "/Users/manu/Code/llama.cpp/models/7B/ggml-model-q4_0.bin"
# Path to the llama quantized weight. 
# You can find them online by asking Meta. 
# Someone also said there is a torrent/IPFS/direct download somewhere of the original llama weight
# After getting the llama weight, you can quantized them yourself.
lama_quantized_model_path = "/Users/manu/Code/llama.cpp/models/7B/ggml-model-q4_0.bin"

In [10]:

                
                    Copied!
                    
embedder = LLMTransformer(kind="llama.cpp", quantized_model_path=lama_quantized_model_path)
out = embedder(smiles)
out.shape
embedder = LLMTransformer(kind="llama.cpp", quantized_model_path=lama_quantized_model_path)
out = embedder(smiles)
out.shape

llama.cpp: loading model from /Users/manu/Code/llama.cpp/models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 1024
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 2052.00 MB per state)
llama_init_from_file: kv self size  = 1024.00 MB

Out[10]:

(5, 4096)

Instruction-based models¶

molfeat_hype provides two types of instruction-based models for molecule embeddings:

Prompt-based instruction: a ChatGPT model is trained to act like an all-knowing assistant for drug discovery, providing the best molecular representation for the input list of molecules. The representation is parsed from the Chat agent output.
Conditional embedding: a model trained for conditional text embeddings that takes instructions in its input. The embedding is the model's underlying representation of the molecule, conditioned by the instructions it received. For more information, see instructor-embedding.

Using the ChatGPT embeddings¶

In [11]:

                
                    Copied!
                    
from molfeat_hype.trans.llm_instruct_embeddings import InstructLLMTransformer
from molfeat_hype.trans.llm_instruct_embeddings import InstructLLMTransformer

In [17]:

                
                    Copied!
                    
# should fail if the model did not understand the prompt
embedder = InstructLLMTransformer(kind="openai/chatgpt", embedding_size=16)
# should fail if the model did not understand the prompt
embedder = InstructLLMTransformer(kind="openai/chatgpt", embedding_size=16)

2023-04-30 19:21:26.148 | WARNING  | molfeat.trans.base:__init__:51 - The 'InstructLLMTransformer' interaction has been superseded by a new class with id 0x7fdbfd704090

In [18]:

                
                    Copied!
                    
out = embedder(smiles)
out.shape
out = embedder(smiles)
out.shape

Out[18]:

(5, 16)

Using the instructor embeddings¶

In [14]:

                
                    Copied!
                    
# should fail if the model did not understand the prompt
# we recommend the instructor-large model
embedder = InstructLLMTransformer(kind="hkunlp/instructor-large")
out = embedder(smiles)
out.shape
# should fail if the model did not understand the prompt
# we recommend the instructor-large model
embedder = InstructLLMTransformer(kind="hkunlp/instructor-large")
out = embedder(smiles)
out.shape

load INSTRUCTOR_Transformer
max_seq_length  512

Out[14]:

(5, 768)