LLM embeddings¶
Classical Embeddings¶
This section corresponds to classical embedding of a text object (here a molecule in a smiles format)
LLMTransformer
¶
Bases: PretrainedMolTransformer
Large Language Model Embeddings Transformer for molecule. This transformer embeds molecules using available Large Language Models (LLMs) through langchain. Please note that the LLMs do not have any molecular context as they were not trained on molecules or any specific molecular task. They are just trained on a large corpus of text.
Caching computation
LLMs can be computationally expensive and even financially expensive if you use the OpenAI embeddings. To avoid recomputing the embeddings for the same molecules, we recommend using a molfeat Cache object. By default, an in-memory cache (DataCache) is used, but other caching systems can be explored.
Using OpenAI Embeddings
If you are using the OpenAI embeddings, you need to provide an 'open_ai_key' argument or define one through an environment variable 'OPEN_AI_KEY'.
Please note that only the text-embedding-ada-002
model is supported.
Refer to OpenAI's documentation for more information.
Using LLAMA Embeddings
The Llama embeddings are provided via the python bindings of llama.cpp
.
We do not provide the path to the quantized Llama model. However, it's easy to find them online;
some people have shared the torrent/IPFS/direct download links to the Llama weights, then you can quantized them yourself.
Using Sentence Transformer Embeddings
The sentence transformer embeddings are based on the SentenceTransformers package.
Source code in molfeat_hype/trans/llm_embeddings.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
|
SUPPORTED_EMBEDDINGS = ['openai/text-embedding-ada-002', 'sentence-transformers/all-MiniLM-L6-v2', 'sentence-transformers/all-mpnet-base-v2', 'llama.cpp']
class-attribute
instance-attribute
¶
kind = kind
instance-attribute
¶
model = LlamaCppEmbeddings(model_path=quantized_model_path, None=params)
instance-attribute
¶
standardize = standardize
instance-attribute
¶
__init__(kind=Union[str, LangChainEmbeddings], standardize=True, precompute_cache=True, n_jobs=0, dtype=float, openai_api_key=None, quantized_model_path=None, parallel_kwargs=None, **params)
¶
Instantiate a LLM Embeddings transformer
Parameters:
Name | Type | Description | Default |
---|---|---|---|
kind |
kind of LLM to use. Supported LLMs are accessible through the SUPPORTED_EMBEDDINGS attribute. Here are a few: - "openai/text-embedding-ada-002" - "sentence-transformers/all-MiniLM-L6-v2" - "sentence-transformers/all-mpnet-base-v2" - "llama.cpp" You can also provide any model hosted on hugginface that compute embeddings |
Union[str, LangChainEmbeddings]
|
|
standardize |
bool
|
if True, standardize smiles before embedding |
True
|
precompute_cache |
bool
|
if True, add a cache to cache the embeddings for the same molecules. |
True
|
n_jobs |
int
|
number of jobs to use for preprocessing smiles. |
0
|
dtype |
data type to use for the embeddings return type |
float
|
|
openai_api_key |
Optional[str]
|
openai api key to use. If None, will try to get it from the environment variable OPENAI_API_KEY |
None
|
quantized_model_path |
Optional[str]
|
path to the quantized model for llama.cpp. If None, will try to get it from the environment variable quantized_MODEL_PATH |
None
|
**params |
parameters to pass to the LLM embeddings. See langchain documentation |
{}
|
Source code in molfeat_hype/trans/llm_embeddings.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
|
Instruct-aware Embeddings¶
This section corresponds to models that accept instructions to compute the embedding or an input molecule.
DEFAULT_SYSTEM_PROMPT = "MolAssistant is a powerful large language model trained by the Open Drug Discovery consortium, \nspecifically designed to assist with various tasks related to drug discovery. Its capabilities range from providing \nmolecular representation to designing new molecules with specific optimization objectives in mind. Being pre-trained \non the largest corpus of chemical and biological data, MolAssistant has a deep understanding of chemical structures \nand extensive biological knowledge. It can generate both human-like text and numerical outputs, providing concise and \naccurate responses that are coherent with the topic and instructions given.\n\nMolAssistant is constantly learning and improving, which allows it to process and understand vast amounts of chemical \nand biological data, and it can comprehend any molecular structure. The model's knowledge enables it to provide valuable \ninsights and information for a wide range of tasks, including molecular search, QSAR model building, and molecular generation,\nmaking it an indispensable tool for advancing the science of molecular modeling and design in drug discovery.\n\n{history}\nHuman: {human_input}\nMolAssistant:"
module-attribute
¶
MODEL_EMBEDDING_INSTRUCTIONS = {'instructor': 'Represent the following molecule for {context}:', 'openai': 'I want you to provide {dimension} dimensional numerical vector at a precision of {precision} as a representation of molecules in the SMILES format that you will receive as input.\n You should first start by understanding the chemical structure and electronic properties of the input molecules before generating the {dimension}-dimensional representation for the following task: {context}.\n To obtain the output, I will provide you with either a single SMILES command or a list of SMILES commands, and you will reply with the most accurate and informative {dimension}-dimensional representation in a json parseable format where the keys are the molecules and the values their representations.\n When generating the output, please ensure that the format is consistent with the task and the instruction given. Do not write explanations. Do not type anything else unless I instruct you to do so. \n In case of any invalid or unrecognized SMILES inputs, please provide a suitable error message. My first molecule is c1ccccc1.'}
module-attribute
¶
InstructLLMTransformer
¶
Bases: PretrainedMolTransformer
Instruction-following Large Language Model Embeddings Transformer for molecules. This transformer embeds molecules using available LLM through langchain. Note that the LLMs do not have any molecular context as they were not trained on molecules or any specific molecular task. They are just trained on a large corpus of text.
Source code in molfeat_hype/trans/llm_instruct_embeddings.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 |
|
SUPPORTED_EMBEDDINGS = ['hkunlp/instructor-large', 'hkunlp/instructor-base', 'openai/gpt-3.5-turbo', 'openai/gpt-4', 'openai/chatgpt']
class-attribute
instance-attribute
¶
batch_size = batch_size
instance-attribute
¶
context = context or 'modelling'
instance-attribute
¶
conv_buffer_size = conv_buffer_size
instance-attribute
¶
conv_max_tokens = conv_max_tokens
instance-attribute
¶
embedding_size = embedding_size
instance-attribute
¶
kind = kind
instance-attribute
¶
model = None
instance-attribute
¶
precision = precision
instance-attribute
¶
standardize = standardize
instance-attribute
¶
system_prompt = system_prompt or DEFAULT_SYSTEM_PROMPT
instance-attribute
¶
__init__(kind=Union[str, LangChainEmbeddings], embedding_size=32, context='modelling', standardize=True, precompute_cache=True, n_jobs=0, conv_buffer_size=10, conv_max_tokens=None, dtype=float, openai_api_key=None, precision=5, batch_size=None, system_prompt=None, parallel_kwargs=None, **params)
¶
Instantiate an instruction following LLM transformer for molecular embeddings
Parameters:
Name | Type | Description | Default |
---|---|---|---|
kind |
type or name of the model to use for embeddings |
Union[str, LangChainEmbeddings]
|
|
embedding_size |
int
|
size of the embeddings to return for chat-like models |
32
|
context |
Optional[str]
|
context to give to the prompt for returning the results. Default is "modelling" which is the context for the modelling instructions. |
'modelling'
|
standardize |
bool
|
if True, standardize smiles before embedding |
True
|
precompute_cache |
bool
|
if True, add a cache to cache the embeddings for the same molecules. |
True
|
n_jobs |
int
|
number of jobs to use for preprocessing smiles. |
0
|
conv_buffer_size |
int
|
conversation buffer size so assistant can remember previous conversations and context for generating features. |
10
|
conv_max_tokens |
Optional[int]
|
maximum number of tokens to use for the conversation context. If None, will not use a token size limitations. |
None
|
dtype |
data type to use for the embeddings return type |
float
|
|
openai_api_key |
Optional[str]
|
openai api key to use. If None, will try to get it from the environment variable OPENAI_API_KEY |
None
|
precision |
int
|
float precision of the output vector |
5
|
batch_size |
Optional[int]
|
batch size to use for embedding molecules. If None, will not use a batch size |
None
|
system_prompt |
Optional[str]
|
system prompt to use for chat-like models. If None, will use the default prompt. |
None
|
**params |
parameters to pass to the LLM embeddings. See langchain documentation |
{}
|
Source code in molfeat_hype/trans/llm_instruct_embeddings.py
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 |
|
__len__()
¶
Get the length of the featurizer
Source code in molfeat_hype/trans/llm_instruct_embeddings.py
184 185 186 187 188 |
|