Skip to content

LLM embeddings

Classical Embeddings

This section corresponds to classical embedding of a text object (here a molecule in a smiles format)

LLMTransformer

Bases: PretrainedMolTransformer

Large Language Model Embeddings Transformer for molecule. This transformer embeds molecules using available Large Language Models (LLMs) through langchain. Please note that the LLMs do not have any molecular context as they were not trained on molecules or any specific molecular task. They are just trained on a large corpus of text.

Caching computation

LLMs can be computationally expensive and even financially expensive if you use the OpenAI embeddings. To avoid recomputing the embeddings for the same molecules, we recommend using a molfeat Cache object. By default, an in-memory cache (DataCache) is used, but other caching systems can be explored.

Using OpenAI Embeddings

If you are using the OpenAI embeddings, you need to provide an 'open_ai_key' argument or define one through an environment variable 'OPEN_AI_KEY'. Please note that only the text-embedding-ada-002 model is supported. Refer to OpenAI's documentation for more information.

Using LLAMA Embeddings

The Llama embeddings are provided via the python bindings of llama.cpp. We do not provide the path to the quantized Llama model. However, it's easy to find them online; some people have shared the torrent/IPFS/direct download links to the Llama weights, then you can quantized them yourself.

Using Sentence Transformer Embeddings

The sentence transformer embeddings are based on the SentenceTransformers package.

Source code in molfeat_hype/trans/llm_embeddings.py
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
class LLMTransformer(PretrainedMolTransformer):
    """
    Large Language Model Embeddings Transformer for molecule.
    This transformer embeds molecules using available Large Language Models (LLMs) through langchain.
    Please note that the LLMs do not have any molecular context as they were not trained on molecules or any specific molecular task.
    They are just trained on a large corpus of text.


    !!! warning "Caching computation"
        LLMs can be computationally expensive and even financially expensive if you use the OpenAI embeddings.
        To avoid recomputing the embeddings for the same molecules, we recommend using a molfeat Cache object.
        By default, an in-memory cache (DataCache) is used, but other caching systems can be explored.


    ??? note "Using OpenAI Embeddings"
        If you are using the OpenAI embeddings, you need to provide an 'open_ai_key' argument or define one through an environment variable 'OPEN_AI_KEY'.
        Please note that only the `text-embedding-ada-002` model is supported.
        Refer to OpenAI's documentation for more information.

    ??? note "Using LLAMA Embeddings"
        The Llama embeddings are provided via the python bindings of `llama.cpp`.
        We do not provide the path to the quantized Llama model. However, it's easy to find them online;
        some people have shared the torrent/IPFS/direct download links to the Llama weights, then you can quantized them yourself.

    ??? note "Using Sentence Transformer Embeddings"
        The sentence transformer embeddings are based on the SentenceTransformers package.

    """

    SUPPORTED_EMBEDDINGS = [
        "openai/text-embedding-ada-002",
        "sentence-transformers/all-MiniLM-L6-v2",
        "sentence-transformers/all-mpnet-base-v2",
        "llama.cpp",
    ]

    def __init__(
        self,
        kind=Union[str, LangChainEmbeddings],
        standardize: bool = True,
        precompute_cache: bool = True,
        n_jobs: int = 0,
        dtype=float,
        openai_api_key: Optional[str] = None,
        quantized_model_path: Optional[str] = None,
        parallel_kwargs: Optional[dict] = None,
        **params,
    ):
        """Instantiate a LLM Embeddings transformer

        Args:
            kind: kind of LLM to use. Supported LLMs are accessible through the SUPPORTED_EMBEDDINGS attribute. Here are a few:
                - "openai/text-embedding-ada-002"
                - "sentence-transformers/all-MiniLM-L6-v2"
                - "sentence-transformers/all-mpnet-base-v2"
                - "llama.cpp"
                You can also provide any model hosted on hugginface that compute embeddings
            standardize: if True, standardize smiles before embedding
            precompute_cache: if True, add a cache to cache the embeddings for the same molecules.
            n_jobs: number of jobs to use for preprocessing smiles.
            dtype: data type to use for the embeddings return type
            openai_api_key: openai api key to use. If None, will try to get it from the environment variable OPENAI_API_KEY
            quantized_model_path: path to the quantized model for llama.cpp. If None, will try to get it from the environment variable quantized_MODEL_PATH
            **params: parameters to pass to the LLM embeddings. See langchain documentation
        """

        self.kind = kind
        self.model = None
        self.standardize = standardize
        if isinstance(kind, str):
            if not kind.startswith(tuple(self.SUPPORTED_EMBEDDINGS)):
                logger.warning(f"Model {kind} not found, trying from huggingface hub.")
                on_hgf = requests.get(f"https://huggingface.co/{kind}")
                try:
                    on_hgf.raise_for_status()
                except:
                    raise ValueError(
                        f"Unknown LLM type {kind} requested. Supported models are {self.SUPPORTED_EMBEDDINGS}"
                    )
            if kind.startswith("openai/"):
                if openai_api_key is None:
                    openai_api_key = os.environ.get("OPENAI_API_KEY")
                self.model = OpenAIEmbeddings(
                    model=kind.replace("openai/", ""), openai_api_key=openai_api_key, **params
                )
            elif kind.startswith("llama.cpp"):
                if quantized_model_path is None:
                    quantized_model_path = os.environ.get("QUANT_MODEL_PATH")
                if quantized_model_path is not None and dm.fs.exists(quantized_model_path):
                    create_symlink(quantized_model_path, kind)
                else:
                    model_base_name = os.path.splitext(os.path.basename(quantized_model_path))[0]
                    quantized_model_path = dm.fs.glob(dm.fs.join(CACHE_DIR, f"{model_base_name}*"))
                    if len(quantized_model_path) == 0:
                        raise ValueError(
                            f"Could not find the quantized model {model_base_name} anywhere, including in the cache dir {CACHE_DIR}"
                        )
                    quantized_model_path = quantized_model_path[0]
                with contextlib.redirect_stdout(None):
                    with contextlib.redirect_stderr(None):
                        n_ctx = max(params.get("n_ctx", 1024), 1024)
                        params["n_ctx"] = n_ctx
                        self.model = LlamaCppEmbeddings(model_path=quantized_model_path, **params)
                        self.model.client.verbose = False
            else:
                self.model = HuggingFaceEmbeddings(
                    model_name=kind, model_kwargs=params, cache_folder=CACHE_DIR
                )
        super().__init__(
            precompute_cache=precompute_cache,
            n_jobs=n_jobs,
            dtype=dtype,
            device="cpu",
            parallel_kwargs=parallel_kwargs,
            **params,
        )

    def _convert(self, inputs: List[Union[str, dm.Mol]], **kwargs):
        """Convert the list of input molecules into the proper format for embeddings

        Args:
            inputs: list of input molecules
            **kwargs: additional keyword arguments for API consistency

        """
        self._preload()
        parallel_kwargs = copy.deepcopy(getattr(self, "parallel_kwargs", {}))
        parallel_kwargs["n_jobs"] = self.n_jobs
        return convert_smiles(inputs, parallel_kwargs, standardize=self.standardize)

    def _embed(self, smiles: List[str], **kwargs):
        """This function takes a list of smiles or molecules and return the featurization
        corresponding to the inputs.
        In `transform` and `_transform`, this function is called after calling `_convert`

        Args:
            smiles: input smiles
            **kwargs: additional keyword arguments for API consistency
        """
        return self.model.embed_documents(smiles)

SUPPORTED_EMBEDDINGS = ['openai/text-embedding-ada-002', 'sentence-transformers/all-MiniLM-L6-v2', 'sentence-transformers/all-mpnet-base-v2', 'llama.cpp'] class-attribute instance-attribute

kind = kind instance-attribute

model = LlamaCppEmbeddings(model_path=quantized_model_path, None=params) instance-attribute

standardize = standardize instance-attribute

__init__(kind=Union[str, LangChainEmbeddings], standardize=True, precompute_cache=True, n_jobs=0, dtype=float, openai_api_key=None, quantized_model_path=None, parallel_kwargs=None, **params)

Instantiate a LLM Embeddings transformer

Parameters:

Name Type Description Default
kind

kind of LLM to use. Supported LLMs are accessible through the SUPPORTED_EMBEDDINGS attribute. Here are a few: - "openai/text-embedding-ada-002" - "sentence-transformers/all-MiniLM-L6-v2" - "sentence-transformers/all-mpnet-base-v2" - "llama.cpp" You can also provide any model hosted on hugginface that compute embeddings

Union[str, LangChainEmbeddings]
standardize bool

if True, standardize smiles before embedding

True
precompute_cache bool

if True, add a cache to cache the embeddings for the same molecules.

True
n_jobs int

number of jobs to use for preprocessing smiles.

0
dtype

data type to use for the embeddings return type

float
openai_api_key Optional[str]

openai api key to use. If None, will try to get it from the environment variable OPENAI_API_KEY

None
quantized_model_path Optional[str]

path to the quantized model for llama.cpp. If None, will try to get it from the environment variable quantized_MODEL_PATH

None
**params

parameters to pass to the LLM embeddings. See langchain documentation

{}
Source code in molfeat_hype/trans/llm_embeddings.py
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
def __init__(
    self,
    kind=Union[str, LangChainEmbeddings],
    standardize: bool = True,
    precompute_cache: bool = True,
    n_jobs: int = 0,
    dtype=float,
    openai_api_key: Optional[str] = None,
    quantized_model_path: Optional[str] = None,
    parallel_kwargs: Optional[dict] = None,
    **params,
):
    """Instantiate a LLM Embeddings transformer

    Args:
        kind: kind of LLM to use. Supported LLMs are accessible through the SUPPORTED_EMBEDDINGS attribute. Here are a few:
            - "openai/text-embedding-ada-002"
            - "sentence-transformers/all-MiniLM-L6-v2"
            - "sentence-transformers/all-mpnet-base-v2"
            - "llama.cpp"
            You can also provide any model hosted on hugginface that compute embeddings
        standardize: if True, standardize smiles before embedding
        precompute_cache: if True, add a cache to cache the embeddings for the same molecules.
        n_jobs: number of jobs to use for preprocessing smiles.
        dtype: data type to use for the embeddings return type
        openai_api_key: openai api key to use. If None, will try to get it from the environment variable OPENAI_API_KEY
        quantized_model_path: path to the quantized model for llama.cpp. If None, will try to get it from the environment variable quantized_MODEL_PATH
        **params: parameters to pass to the LLM embeddings. See langchain documentation
    """

    self.kind = kind
    self.model = None
    self.standardize = standardize
    if isinstance(kind, str):
        if not kind.startswith(tuple(self.SUPPORTED_EMBEDDINGS)):
            logger.warning(f"Model {kind} not found, trying from huggingface hub.")
            on_hgf = requests.get(f"https://huggingface.co/{kind}")
            try:
                on_hgf.raise_for_status()
            except:
                raise ValueError(
                    f"Unknown LLM type {kind} requested. Supported models are {self.SUPPORTED_EMBEDDINGS}"
                )
        if kind.startswith("openai/"):
            if openai_api_key is None:
                openai_api_key = os.environ.get("OPENAI_API_KEY")
            self.model = OpenAIEmbeddings(
                model=kind.replace("openai/", ""), openai_api_key=openai_api_key, **params
            )
        elif kind.startswith("llama.cpp"):
            if quantized_model_path is None:
                quantized_model_path = os.environ.get("QUANT_MODEL_PATH")
            if quantized_model_path is not None and dm.fs.exists(quantized_model_path):
                create_symlink(quantized_model_path, kind)
            else:
                model_base_name = os.path.splitext(os.path.basename(quantized_model_path))[0]
                quantized_model_path = dm.fs.glob(dm.fs.join(CACHE_DIR, f"{model_base_name}*"))
                if len(quantized_model_path) == 0:
                    raise ValueError(
                        f"Could not find the quantized model {model_base_name} anywhere, including in the cache dir {CACHE_DIR}"
                    )
                quantized_model_path = quantized_model_path[0]
            with contextlib.redirect_stdout(None):
                with contextlib.redirect_stderr(None):
                    n_ctx = max(params.get("n_ctx", 1024), 1024)
                    params["n_ctx"] = n_ctx
                    self.model = LlamaCppEmbeddings(model_path=quantized_model_path, **params)
                    self.model.client.verbose = False
        else:
            self.model = HuggingFaceEmbeddings(
                model_name=kind, model_kwargs=params, cache_folder=CACHE_DIR
            )
    super().__init__(
        precompute_cache=precompute_cache,
        n_jobs=n_jobs,
        dtype=dtype,
        device="cpu",
        parallel_kwargs=parallel_kwargs,
        **params,
    )

Instruct-aware Embeddings

This section corresponds to models that accept instructions to compute the embedding or an input molecule.

DEFAULT_SYSTEM_PROMPT = "MolAssistant is a powerful large language model trained by the Open Drug Discovery consortium, \nspecifically designed to assist with various tasks related to drug discovery. Its capabilities range from providing \nmolecular representation to designing new molecules with specific optimization objectives in mind. Being pre-trained \non the largest corpus of chemical and biological data, MolAssistant has a deep understanding of chemical structures \nand extensive biological knowledge. It can generate both human-like text and numerical outputs, providing concise and \naccurate responses that are coherent with the topic and instructions given.\n\nMolAssistant is constantly learning and improving, which allows it to process and understand vast amounts of chemical \nand biological data, and it can comprehend any molecular structure. The model's knowledge enables it to provide valuable \ninsights and information for a wide range of tasks, including molecular search, QSAR model building, and molecular generation,\nmaking it an indispensable tool for advancing the science of molecular modeling and design in drug discovery.\n\n{history}\nHuman: {human_input}\nMolAssistant:" module-attribute

MODEL_EMBEDDING_INSTRUCTIONS = {'instructor': 'Represent the following molecule for {context}:', 'openai': 'I want you to provide {dimension} dimensional numerical vector at a precision of {precision} as a representation of molecules in the SMILES format that you will receive as input.\n You should first start by understanding the chemical structure and electronic properties of the input molecules before generating the {dimension}-dimensional representation for the following task: {context}.\n To obtain the output, I will provide you with either a single SMILES command or a list of SMILES commands, and you will reply with the most accurate and informative {dimension}-dimensional representation in a json parseable format where the keys are the molecules and the values their representations.\n When generating the output, please ensure that the format is consistent with the task and the instruction given. Do not write explanations. Do not type anything else unless I instruct you to do so. \n In case of any invalid or unrecognized SMILES inputs, please provide a suitable error message. My first molecule is c1ccccc1.'} module-attribute

InstructLLMTransformer

Bases: PretrainedMolTransformer

Instruction-following Large Language Model Embeddings Transformer for molecules. This transformer embeds molecules using available LLM through langchain. Note that the LLMs do not have any molecular context as they were not trained on molecules or any specific molecular task. They are just trained on a large corpus of text.

Source code in molfeat_hype/trans/llm_instruct_embeddings.py
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
class InstructLLMTransformer(PretrainedMolTransformer):
    """
    Instruction-following Large Language Model Embeddings Transformer for molecules. This transformer embeds molecules using available LLM through langchain.
    Note that the LLMs do not have any molecular context as they were not trained on molecules or any specific molecular task.
    They are just trained on a large corpus of text.
    """

    SUPPORTED_EMBEDDINGS = [
        "hkunlp/instructor-large",
        "hkunlp/instructor-base",
        "openai/gpt-3.5-turbo",
        "openai/gpt-4",
        "openai/chatgpt",  # alias for "openai/gpt-3.5-turbo"
    ]

    def __init__(
        self,
        kind=Union[str, LangChainEmbeddings],
        embedding_size: int = 32,
        context: Optional[str] = "modelling",
        standardize: bool = True,
        precompute_cache: bool = True,
        n_jobs: int = 0,
        conv_buffer_size: int = 10,
        conv_max_tokens: Optional[int] = None,
        dtype=float,
        openai_api_key: Optional[str] = None,
        precision: int = 5,
        batch_size: Optional[int] = None,
        system_prompt: Optional[str] = None,
        parallel_kwargs: Optional[dict] = None,
        **params,
    ):
        """Instantiate an instruction following LLM transformer for molecular embeddings

        Args:
            kind: type or name of the model to use for embeddings
            embedding_size: size of the embeddings to return for chat-like models
            context: context to give to the prompt for returning the results. Default is "modelling" which is the context for the modelling instructions.
            standardize: if True, standardize smiles before embedding
            precompute_cache: if True, add a cache to cache the embeddings for the same molecules.
            n_jobs: number of jobs to use for preprocessing smiles.
            conv_buffer_size: conversation buffer size so assistant can remember previous conversations and context for generating features.
            conv_max_tokens: maximum number of tokens to use for the conversation context. If None, will not use a token size limitations.
            dtype: data type to use for the embeddings return type
            openai_api_key: openai api key to use. If None, will try to get it from the environment variable OPENAI_API_KEY
            precision: float precision of the output vector
            batch_size: batch size to use for embedding molecules. If None, will not use a batch size
            system_prompt: system prompt to use for chat-like models. If None, will use the default prompt.
            **params: parameters to pass to the LLM embeddings. See langchain documentation
        """

        self.kind = kind
        self.model = None
        self.standardize = standardize
        self.context = context or "modelling"
        self.embedding_size = embedding_size
        self.precision = precision
        self.batch_size = batch_size
        self.conv_max_tokens = conv_max_tokens
        self.conv_buffer_size = conv_buffer_size
        self._length = None
        self.system_prompt = system_prompt or DEFAULT_SYSTEM_PROMPT
        params.setdefault("temperature", 0.8)
        if isinstance(kind, str):
            if not (
                kind.startswith(tuple(self.SUPPORTED_EMBEDDINGS)) or kind.startswith("openai/")
            ):
                raise ValueError(
                    f"Unknown LLM type {kind} requested. Supported models are {self.SUPPORTED_EMBEDDINGS}"
                )
            if kind.startswith("openai/"):
                if kind == "openai/chatgpt":
                    kind = "openai/gpt-3.5-turbo"
                if openai_api_key is None:
                    openai_api_key = os.environ.get("OPENAI_API_KEY")
                prompt = PromptTemplate(
                    input_variables=["history", "human_input"], template=self.system_prompt
                )
                llm = ChatOpenAI(
                    model_name=kind.replace("openai/", ""),
                    openai_api_key=openai_api_key,
                    **params,
                )
                self.model = LLMChain(
                    llm=llm,
                    prompt=prompt,
                    verbose=False,
                    memory=EmbeddingConversationMemory(
                        k=self.conv_buffer_size,
                        ai_prefix="MolAssistant",
                        llm=llm,
                        max_token_limit=self.conv_max_tokens,
                    ),
                )
                output = self.model.predict(
                    human_input=MODEL_EMBEDDING_INSTRUCTIONS["openai"].format(
                        dimension=embedding_size, precision=self.precision, context=self.context
                    )
                )
                embeddings = json.loads(output)
                if "c1ccccc1" not in embeddings:
                    raise ValueError(
                        "Model is not able to understand the prompt. Please select a different model"
                    )
                assert (
                    len(embeddings["c1ccccc1"]) == embedding_size
                ), "Model cannot return the correct embedding size."
                self._length = embedding_size

            elif kind.startswith("llama.cpp") or kind.startswith("gpt4all"):
                raise ValueError(f"{kind} is not yet supported, because or how slow they are.")
            else:
                # we need to remove temperature key
                params.pop("temperature", None)
                self.model = HuggingFaceInstructEmbeddings(
                    model_name=kind,
                    embed_instruction=MODEL_EMBEDDING_INSTRUCTIONS["instructor"].format(
                        context=self.context
                    ),
                    model_kwargs=params,
                    cache_folder=CACHE_DIR,
                )
        super().__init__(
            precompute_cache=precompute_cache,
            n_jobs=n_jobs,
            dtype=dtype,
            device="cpu",
            parallel_kwargs=parallel_kwargs,
            **params,
        )

    def __len__(self):
        """Get the length of the featurizer"""
        if self._length is not None:
            return self._length
        return super().__len__()

    def _convert(self, inputs: List[Union[str, dm.Mol]], **kwargs):
        """Convert the list of input molecules into the proper format for embeddings"""
        self._preload()
        parallel_kwargs = copy.deepcopy(getattr(self, "parallel_kwargs", {}))
        parallel_kwargs["n_jobs"] = self.n_jobs
        return convert_smiles(inputs, parallel_kwargs, standardize=self.standardize)

    def _embed(self, smiles: List[str], **kwargs):
        """_embed takes a list of smiles or molecules and return the featurization
        corresponding to the inputs.  In `transform` and `_transform`, this function is
        called after calling `_convert`

        Args:
            smiles: input smiles
        """
        if isinstance(self.model, LangChainEmbeddings):
            return self.model.embed_documents(smiles)
        # basically running embeddings
        # compute expected total token for inputs based on expected number of char
        expected_tokens = (self.embedding_size * (self.precision + 5) + 4) * len(smiles) + sum(
            len(x) + 5 for x in smiles
        )
        # we splits the number of smiles to avoid being over the maximum tokens
        if not self.batch_size:
            maximum_tokens = self.model.llm.max_tokens or self.model.memory.max_token_limit or 2000
            n_splits = max(1, int(np.ceil(expected_tokens / maximum_tokens)))
        else:
            n_splits = max(1, int(np.ceil(len(smiles) / self.batch_size)))
        data = {}
        for batch in tqdm(np.array_split(smiles, n_splits), desc=f"Batch embedding", leave=False):
            json_output = self.model.predict(human_input=" ,".join(batch))
            batch_data = json.loads(json_output)
            # EN: surprisingly, ChatGPT can return randomized version of a SMILES
            data.update({dm.unique_id(k.strip()): v for k, v in batch_data.items()})
        missed_embedding = np.full_like(list(data.values())[0], np.nan)
        data = [data.get(dm.unique_id(sm), missed_embedding) for sm in smiles]
        return data

SUPPORTED_EMBEDDINGS = ['hkunlp/instructor-large', 'hkunlp/instructor-base', 'openai/gpt-3.5-turbo', 'openai/gpt-4', 'openai/chatgpt'] class-attribute instance-attribute

batch_size = batch_size instance-attribute

context = context or 'modelling' instance-attribute

conv_buffer_size = conv_buffer_size instance-attribute

conv_max_tokens = conv_max_tokens instance-attribute

embedding_size = embedding_size instance-attribute

kind = kind instance-attribute

model = None instance-attribute

precision = precision instance-attribute

standardize = standardize instance-attribute

system_prompt = system_prompt or DEFAULT_SYSTEM_PROMPT instance-attribute

__init__(kind=Union[str, LangChainEmbeddings], embedding_size=32, context='modelling', standardize=True, precompute_cache=True, n_jobs=0, conv_buffer_size=10, conv_max_tokens=None, dtype=float, openai_api_key=None, precision=5, batch_size=None, system_prompt=None, parallel_kwargs=None, **params)

Instantiate an instruction following LLM transformer for molecular embeddings

Parameters:

Name Type Description Default
kind

type or name of the model to use for embeddings

Union[str, LangChainEmbeddings]
embedding_size int

size of the embeddings to return for chat-like models

32
context Optional[str]

context to give to the prompt for returning the results. Default is "modelling" which is the context for the modelling instructions.

'modelling'
standardize bool

if True, standardize smiles before embedding

True
precompute_cache bool

if True, add a cache to cache the embeddings for the same molecules.

True
n_jobs int

number of jobs to use for preprocessing smiles.

0
conv_buffer_size int

conversation buffer size so assistant can remember previous conversations and context for generating features.

10
conv_max_tokens Optional[int]

maximum number of tokens to use for the conversation context. If None, will not use a token size limitations.

None
dtype

data type to use for the embeddings return type

float
openai_api_key Optional[str]

openai api key to use. If None, will try to get it from the environment variable OPENAI_API_KEY

None
precision int

float precision of the output vector

5
batch_size Optional[int]

batch size to use for embedding molecules. If None, will not use a batch size

None
system_prompt Optional[str]

system prompt to use for chat-like models. If None, will use the default prompt.

None
**params

parameters to pass to the LLM embeddings. See langchain documentation

{}
Source code in molfeat_hype/trans/llm_instruct_embeddings.py
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
def __init__(
    self,
    kind=Union[str, LangChainEmbeddings],
    embedding_size: int = 32,
    context: Optional[str] = "modelling",
    standardize: bool = True,
    precompute_cache: bool = True,
    n_jobs: int = 0,
    conv_buffer_size: int = 10,
    conv_max_tokens: Optional[int] = None,
    dtype=float,
    openai_api_key: Optional[str] = None,
    precision: int = 5,
    batch_size: Optional[int] = None,
    system_prompt: Optional[str] = None,
    parallel_kwargs: Optional[dict] = None,
    **params,
):
    """Instantiate an instruction following LLM transformer for molecular embeddings

    Args:
        kind: type or name of the model to use for embeddings
        embedding_size: size of the embeddings to return for chat-like models
        context: context to give to the prompt for returning the results. Default is "modelling" which is the context for the modelling instructions.
        standardize: if True, standardize smiles before embedding
        precompute_cache: if True, add a cache to cache the embeddings for the same molecules.
        n_jobs: number of jobs to use for preprocessing smiles.
        conv_buffer_size: conversation buffer size so assistant can remember previous conversations and context for generating features.
        conv_max_tokens: maximum number of tokens to use for the conversation context. If None, will not use a token size limitations.
        dtype: data type to use for the embeddings return type
        openai_api_key: openai api key to use. If None, will try to get it from the environment variable OPENAI_API_KEY
        precision: float precision of the output vector
        batch_size: batch size to use for embedding molecules. If None, will not use a batch size
        system_prompt: system prompt to use for chat-like models. If None, will use the default prompt.
        **params: parameters to pass to the LLM embeddings. See langchain documentation
    """

    self.kind = kind
    self.model = None
    self.standardize = standardize
    self.context = context or "modelling"
    self.embedding_size = embedding_size
    self.precision = precision
    self.batch_size = batch_size
    self.conv_max_tokens = conv_max_tokens
    self.conv_buffer_size = conv_buffer_size
    self._length = None
    self.system_prompt = system_prompt or DEFAULT_SYSTEM_PROMPT
    params.setdefault("temperature", 0.8)
    if isinstance(kind, str):
        if not (
            kind.startswith(tuple(self.SUPPORTED_EMBEDDINGS)) or kind.startswith("openai/")
        ):
            raise ValueError(
                f"Unknown LLM type {kind} requested. Supported models are {self.SUPPORTED_EMBEDDINGS}"
            )
        if kind.startswith("openai/"):
            if kind == "openai/chatgpt":
                kind = "openai/gpt-3.5-turbo"
            if openai_api_key is None:
                openai_api_key = os.environ.get("OPENAI_API_KEY")
            prompt = PromptTemplate(
                input_variables=["history", "human_input"], template=self.system_prompt
            )
            llm = ChatOpenAI(
                model_name=kind.replace("openai/", ""),
                openai_api_key=openai_api_key,
                **params,
            )
            self.model = LLMChain(
                llm=llm,
                prompt=prompt,
                verbose=False,
                memory=EmbeddingConversationMemory(
                    k=self.conv_buffer_size,
                    ai_prefix="MolAssistant",
                    llm=llm,
                    max_token_limit=self.conv_max_tokens,
                ),
            )
            output = self.model.predict(
                human_input=MODEL_EMBEDDING_INSTRUCTIONS["openai"].format(
                    dimension=embedding_size, precision=self.precision, context=self.context
                )
            )
            embeddings = json.loads(output)
            if "c1ccccc1" not in embeddings:
                raise ValueError(
                    "Model is not able to understand the prompt. Please select a different model"
                )
            assert (
                len(embeddings["c1ccccc1"]) == embedding_size
            ), "Model cannot return the correct embedding size."
            self._length = embedding_size

        elif kind.startswith("llama.cpp") or kind.startswith("gpt4all"):
            raise ValueError(f"{kind} is not yet supported, because or how slow they are.")
        else:
            # we need to remove temperature key
            params.pop("temperature", None)
            self.model = HuggingFaceInstructEmbeddings(
                model_name=kind,
                embed_instruction=MODEL_EMBEDDING_INSTRUCTIONS["instructor"].format(
                    context=self.context
                ),
                model_kwargs=params,
                cache_folder=CACHE_DIR,
            )
    super().__init__(
        precompute_cache=precompute_cache,
        n_jobs=n_jobs,
        dtype=dtype,
        device="cpu",
        parallel_kwargs=parallel_kwargs,
        **params,
    )

__len__()

Get the length of the featurizer

Source code in molfeat_hype/trans/llm_instruct_embeddings.py
184
185
186
187
188
def __len__(self):
    """Get the length of the featurizer"""
    if self._length is not None:
        return self._length
    return super().__len__()