LLM embeddings¶

Classical Embeddings¶

This section corresponds to classical embedding of a text object (here a molecule in a smiles format)

`LLMTransformer` ¶

Bases: PretrainedMolTransformer

Large Language Model Embeddings Transformer for molecule. This transformer embeds molecules using available Large Language Models (LLMs) through langchain. Please note that the LLMs do not have any molecular context as they were not trained on molecules or any specific molecular task. They are just trained on a large corpus of text.

Caching computation

LLMs can be computationally expensive and even financially expensive if you use the OpenAI embeddings. To avoid recomputing the embeddings for the same molecules, we recommend using a molfeat Cache object. By default, an in-memory cache (DataCache) is used, but other caching systems can be explored.

Using OpenAI Embeddings

If you are using the OpenAI embeddings, you need to provide an 'open_ai_key' argument or define one through an environment variable 'OPEN_AI_KEY'. Please note that only the text-embedding-ada-002 model is supported. Refer to OpenAI's documentation for more information.

Using LLAMA Embeddings

The Llama embeddings are provided via the python bindings of llama.cpp. We do not provide the path to the quantized Llama model. However, it's easy to find them online; some people have shared the torrent/IPFS/direct download links to the Llama weights, then you can quantized them yourself.

Using Sentence Transformer Embeddings

The sentence transformer embeddings are based on the SentenceTransformers package.

Source code in molfeat_hype/trans/llm_embeddings.py

class LLMTransformer(PretrainedMolTransformer):
    """
    Large Language Model Embeddings Transformer for molecule.
    This transformer embeds molecules using available Large Language Models (LLMs) through langchain.
    Please note that the LLMs do not have any molecular context as they were not trained on molecules or any specific molecular task.
    They are just trained on a large corpus of text.


    !!! warning "Caching computation"
        LLMs can be computationally expensive and even financially expensive if you use the OpenAI embeddings.
        To avoid recomputing the embeddings for the same molecules, we recommend using a molfeat Cache object.
        By default, an in-memory cache (DataCache) is used, but other caching systems can be explored.


    ??? note "Using OpenAI Embeddings"
        If you are using the OpenAI embeddings, you need to provide an 'open_ai_key' argument or define one through an environment variable 'OPEN_AI_KEY'.
        Please note that only the `text-embedding-ada-002` model is supported.
        Refer to OpenAI's documentation for more information.

    ??? note "Using LLAMA Embeddings"
        The Llama embeddings are provided via the python bindings of `llama.cpp`.
        We do not provide the path to the quantized Llama model. However, it's easy to find them online;
        some people have shared the torrent/IPFS/direct download links to the Llama weights, then you can quantized them yourself.

    ??? note "Using Sentence Transformer Embeddings"
        The sentence transformer embeddings are based on the SentenceTransformers package.

    """

    SUPPORTED_EMBEDDINGS = [
        "openai/text-embedding-ada-002",
        "sentence-transformers/all-MiniLM-L6-v2",
        "sentence-transformers/all-mpnet-base-v2",
        "llama.cpp",
    ]

    def __init__(
        self,
        kind=Union[str, LangChainEmbeddings],
        standardize: bool = True,
        precompute_cache: bool = True,
        n_jobs: int = 0,
        dtype=float,
        openai_api_key: Optional[str] = None,
        quantized_model_path: Optional[str] = None,
        parallel_kwargs: Optional[dict] = None,
        **params,
    ):
        """Instantiate a LLM Embeddings transformer

        Args:
            kind: kind of LLM to use. Supported LLMs are accessible through the SUPPORTED_EMBEDDINGS attribute. Here are a few:
                - "openai/text-embedding-ada-002"
                - "sentence-transformers/all-MiniLM-L6-v2"
                - "sentence-transformers/all-mpnet-base-v2"
                - "llama.cpp"
                You can also provide any model hosted on hugginface that compute embeddings
            standardize: if True, standardize smiles before embedding
            precompute_cache: if True, add a cache to cache the embeddings for the same molecules.
            n_jobs: number of jobs to use for preprocessing smiles.
            dtype: data type to use for the embeddings return type
            openai_api_key: openai api key to use. If None, will try to get it from the environment variable OPENAI_API_KEY
            quantized_model_path: path to the quantized model for llama.cpp. If None, will try to get it from the environment variable quantized_MODEL_PATH
            **params: parameters to pass to the LLM embeddings. See langchain documentation
        """

        self.kind = kind
        self.model = None
        self.standardize = standardize
        if isinstance(kind, str):
            if not kind.startswith(tuple(self.SUPPORTED_EMBEDDINGS)):
                logger.warning(f"Model {kind} not found, trying from huggingface hub.")
                on_hgf = requests.get(f"https://huggingface.co/{kind}")
                try:
                    on_hgf.raise_for_status()
                except:
                    raise ValueError(
                        f"Unknown LLM type {kind} requested. Supported models are {self.SUPPORTED_EMBEDDINGS}"
                    )
            if kind.startswith("openai/"):
                if openai_api_key is None:
                    openai_api_key = os.environ.get("OPENAI_API_KEY")
                self.model = OpenAIEmbeddings(
                    model=kind.replace("openai/", ""), openai_api_key=openai_api_key, **params
                )
            elif kind.startswith("llama.cpp"):
                if quantized_model_path is None:
                    quantized_model_path = os.environ.get("QUANT_MODEL_PATH")
                if quantized_model_path is not None and dm.fs.exists(quantized_model_path):
                    create_symlink(quantized_model_path, kind)
                else:
                    model_base_name = os.path.splitext(os.path.basename(quantized_model_path))[0]
                    quantized_model_path = dm.fs.glob(dm.fs.join(CACHE_DIR, f"{model_base_name}*"))
                    if len(quantized_model_path) == 0:
                        raise ValueError(
                            f"Could not find the quantized model {model_base_name} anywhere, including in the cache dir {CACHE_DIR}"
                        )
                    quantized_model_path = quantized_model_path[0]
                with contextlib.redirect_stdout(None):
                    with contextlib.redirect_stderr(None):
                        n_ctx = max(params.get("n_ctx", 1024), 1024)
                        params["n_ctx"] = n_ctx
                        self.model = LlamaCppEmbeddings(model_path=quantized_model_path, **params)
                        self.model.client.verbose = False
            else:
                self.model = HuggingFaceEmbeddings(
                    model_name=kind, model_kwargs=params, cache_folder=CACHE_DIR
                )
        super().__init__(
            precompute_cache=precompute_cache,
            n_jobs=n_jobs,
            dtype=dtype,
            device="cpu",
            parallel_kwargs=parallel_kwargs,
            **params,
        )

    def _convert(self, inputs: List[Union[str, dm.Mol]], **kwargs):
        """Convert the list of input molecules into the proper format for embeddings

        Args:
            inputs: list of input molecules
            **kwargs: additional keyword arguments for API consistency

        """
        self._preload()
        parallel_kwargs = copy.deepcopy(getattr(self, "parallel_kwargs", {}))
        parallel_kwargs["n_jobs"] = self.n_jobs
        return convert_smiles(inputs, parallel_kwargs, standardize=self.standardize)

    def _embed(self, smiles: List[str], **kwargs):
        """This function takes a list of smiles or molecules and return the featurization
        corresponding to the inputs.
        In `transform` and `_transform`, this function is called after calling `_convert`

        Args:
            smiles: input smiles
            **kwargs: additional keyword arguments for API consistency
        """
        return self.model.embed_documents(smiles)

`SUPPORTED_EMBEDDINGS = ['openai/text-embedding-ada-002', 'sentence-transformers/all-MiniLM-L6-v2', 'sentence-transformers/all-mpnet-base-v2', 'llama.cpp']` `class-attribute` `instance-attribute` ¶

`kind = kind` `instance-attribute` ¶

`model = LlamaCppEmbeddings(model_path=quantized_model_path, None=params)` `instance-attribute` ¶

`standardize = standardize` `instance-attribute` ¶

`init(kind=Union[str, LangChainEmbeddings], standardize=True, precompute_cache=True, n_jobs=0, dtype=float, openai_api_key=None, quantized_model_path=None, parallel_kwargs=None, **params)` ¶

Instantiate a LLM Embeddings transformer

Parameters:

Name	Type	Description	Default
`kind`		kind of LLM to use. Supported LLMs are accessible through the SUPPORTED_EMBEDDINGS attribute. Here are a few: - "openai/text-embedding-ada-002" - "sentence-transformers/all-MiniLM-L6-v2" - "sentence-transformers/all-mpnet-base-v2" - "llama.cpp" You can also provide any model hosted on hugginface that compute embeddings	`Union[str, LangChainEmbeddings]`
`standardize`	`bool`	if True, standardize smiles before embedding	`True`
`precompute_cache`	`bool`	if True, add a cache to cache the embeddings for the same molecules.	`True`
`n_jobs`	`int`	number of jobs to use for preprocessing smiles.	`0`
`dtype`		data type to use for the embeddings return type	`float`
`openai_api_key`	`Optional[str]`	openai api key to use. If None, will try to get it from the environment variable OPENAI_API_KEY	`None`
`quantized_model_path`	`Optional[str]`	path to the quantized model for llama.cpp. If None, will try to get it from the environment variable quantized_MODEL_PATH	`None`
`**params`		parameters to pass to the LLM embeddings. See langchain documentation	`{}`

Source code in molfeat_hype/trans/llm_embeddings.py

def __init__(
    self,
    kind=Union[str, LangChainEmbeddings],
    standardize: bool = True,
    precompute_cache: bool = True,
    n_jobs: int = 0,
    dtype=float,
    openai_api_key: Optional[str] = None,
    quantized_model_path: Optional[str] = None,
    parallel_kwargs: Optional[dict] = None,
    **params,
):
    """Instantiate a LLM Embeddings transformer

    Args:
        kind: kind of LLM to use. Supported LLMs are accessible through the SUPPORTED_EMBEDDINGS attribute. Here are a few:
            - "openai/text-embedding-ada-002"
            - "sentence-transformers/all-MiniLM-L6-v2"
            - "sentence-transformers/all-mpnet-base-v2"
            - "llama.cpp"
            You can also provide any model hosted on hugginface that compute embeddings
        standardize: if True, standardize smiles before embedding
        precompute_cache: if True, add a cache to cache the embeddings for the same molecules.
        n_jobs: number of jobs to use for preprocessing smiles.
        dtype: data type to use for the embeddings return type
        openai_api_key: openai api key to use. If None, will try to get it from the environment variable OPENAI_API_KEY
        quantized_model_path: path to the quantized model for llama.cpp. If None, will try to get it from the environment variable quantized_MODEL_PATH
        **params: parameters to pass to the LLM embeddings. See langchain documentation
    """

    self.kind = kind
    self.model = None
    self.standardize = standardize
    if isinstance(kind, str):
        if not kind.startswith(tuple(self.SUPPORTED_EMBEDDINGS)):
            logger.warning(f"Model {kind} not found, trying from huggingface hub.")
            on_hgf = requests.get(f"https://huggingface.co/{kind}")
            try:
                on_hgf.raise_for_status()
            except:
                raise ValueError(
                    f"Unknown LLM type {kind} requested. Supported models are {self.SUPPORTED_EMBEDDINGS}"
                )
        if kind.startswith("openai/"):
            if openai_api_key is None:
                openai_api_key = os.environ.get("OPENAI_API_KEY")
            self.model = OpenAIEmbeddings(
                model=kind.replace("openai/", ""), openai_api_key=openai_api_key, **params
            )
        elif kind.startswith("llama.cpp"):
            if quantized_model_path is None:
                quantized_model_path = os.environ.get("QUANT_MODEL_PATH")
            if quantized_model_path is not None and dm.fs.exists(quantized_model_path):
                create_symlink(quantized_model_path, kind)
            else:
                model_base_name = os.path.splitext(os.path.basename(quantized_model_path))[0]
                quantized_model_path = dm.fs.glob(dm.fs.join(CACHE_DIR, f"{model_base_name}*"))
                if len(quantized_model_path) == 0:
                    raise ValueError(
                        f"Could not find the quantized model {model_base_name} anywhere, including in the cache dir {CACHE_DIR}"
                    )
                quantized_model_path = quantized_model_path[0]
            with contextlib.redirect_stdout(None):
                with contextlib.redirect_stderr(None):
                    n_ctx = max(params.get("n_ctx", 1024), 1024)
                    params["n_ctx"] = n_ctx
                    self.model = LlamaCppEmbeddings(model_path=quantized_model_path, **params)
                    self.model.client.verbose = False
        else:
            self.model = HuggingFaceEmbeddings(
                model_name=kind, model_kwargs=params, cache_folder=CACHE_DIR
            )
    super().__init__(
        precompute_cache=precompute_cache,
        n_jobs=n_jobs,
        dtype=dtype,
        device="cpu",
        parallel_kwargs=parallel_kwargs,
        **params,
    )

Instruct-aware Embeddings¶

This section corresponds to models that accept instructions to compute the embedding or an input molecule.

DEFAULT_SYSTEM_PROMPT = "MolAssistant is a powerful large language model trained by the Open Drug Discovery consortium, \nspecifically designed to assist with various tasks related to drug discovery. Its capabilities range from providing \nmolecular representation to designing new molecules with specific optimization objectives in mind. Being pre-trained \non the largest corpus of chemical and biological data, MolAssistant has a deep understanding of chemical structures \nand extensive biological knowledge. It can generate both human-like text and numerical outputs, providing concise and \naccurate responses that are coherent with the topic and instructions given.\n\nMolAssistant is constantly learning and improving, which allows it to process and understand vast amounts of chemical \nand biological data, and it can comprehend any molecular structure. The model's knowledge enables it to provide valuable \ninsights and information for a wide range of tasks, including molecular search, QSAR model building, and molecular generation,\nmaking it an indispensable tool for advancing the science of molecular modeling and design in drug discovery.\n\n{history}\nHuman: {human_input}\nMolAssistant:" `module-attribute` ¶

MODEL_EMBEDDING_INSTRUCTIONS = {'instructor': 'Represent the following molecule for {context}:', 'openai': 'I want you to provide {dimension} dimensional numerical vector at a precision of {precision} as a representation of molecules in the SMILES format that you will receive as input.\n You should first start by understanding the chemical structure and electronic properties of the input molecules before generating the {dimension}-dimensional representation for the following task: {context}.\n To obtain the output, I will provide you with either a single SMILES command or a list of SMILES commands, and you will reply with the most accurate and informative {dimension}-dimensional representation in a json parseable format where the keys are the molecules and the values their representations.\n When generating the output, please ensure that the format is consistent with the task and the instruction given. Do not write explanations. Do not type anything else unless I instruct you to do so. \n In case of any invalid or unrecognized SMILES inputs, please provide a suitable error message. My first molecule is c1ccccc1.'} `module-attribute` ¶

`InstructLLMTransformer` ¶

Bases: PretrainedMolTransformer

Instruction-following Large Language Model Embeddings Transformer for molecules. This transformer embeds molecules using available LLM through langchain. Note that the LLMs do not have any molecular context as they were not trained on molecules or any specific molecular task. They are just trained on a large corpus of text.

Source code in molfeat_hype/trans/llm_instruct_embeddings.py

class InstructLLMTransformer(PretrainedMolTransformer):
    """
    Instruction-following Large Language Model Embeddings Transformer for molecules. This transformer embeds molecules using available LLM through langchain.
    Note that the LLMs do not have any molecular context as they were not trained on molecules or any specific molecular task.
    They are just trained on a large corpus of text.
    """

    SUPPORTED_EMBEDDINGS = [
        "hkunlp/instructor-large",
        "hkunlp/instructor-base",
        "openai/gpt-3.5-turbo",
        "openai/gpt-4",
        "openai/chatgpt",  # alias for "openai/gpt-3.5-turbo"
    ]

    def __init__(
        self,
        kind=Union[str, LangChainEmbeddings],
        embedding_size: int = 32,
        context: Optional[str] = "modelling",
        standardize: bool = True,
        precompute_cache: bool = True,
        n_jobs: int = 0,
        conv_buffer_size: int = 10,
        conv_max_tokens: Optional[int] = None,
        dtype=float,
        openai_api_key: Optional[str] = None,
        precision: int = 5,
        batch_size: Optional[int] = None,
        system_prompt: Optional[str] = None,
        parallel_kwargs: Optional[dict] = None,
        **params,
    ):
        """Instantiate an instruction following LLM transformer for molecular embeddings

        Args:
            kind: type or name of the model to use for embeddings
            embedding_size: size of the embeddings to return for chat-like models
            context: context to give to the prompt for returning the results. Default is "modelling" which is the context for the modelling instructions.
            standardize: if True, standardize smiles before embedding
            precompute_cache: if True, add a cache to cache the embeddings for the same molecules.
            n_jobs: number of jobs to use for preprocessing smiles.
            conv_buffer_size: conversation buffer size so assistant can remember previous conversations and context for generating features.
            conv_max_tokens: maximum number of tokens to use for the conversation context. If None, will not use a token size limitations.
            dtype: data type to use for the embeddings return type
            openai_api_key: openai api key to use. If None, will try to get it from the environment variable OPENAI_API_KEY
            precision: float precision of the output vector
            batch_size: batch size to use for embedding molecules. If None, will not use a batch size
            system_prompt: system prompt to use for chat-like models. If None, will use the default prompt.
            **params: parameters to pass to the LLM embeddings. See langchain documentation
        """

        self.kind = kind
        self.model = None
        self.standardize = standardize
        self.context = context or "modelling"
        self.embedding_size = embedding_size
        self.precision = precision
        self.batch_size = batch_size
        self.conv_max_tokens = conv_max_tokens
        self.conv_buffer_size = conv_buffer_size
        self._length = None
        self.system_prompt = system_prompt or DEFAULT_SYSTEM_PROMPT
        params.setdefault("temperature", 0.8)
        if isinstance(kind, str):
            if not (
                kind.startswith(tuple(self.SUPPORTED_EMBEDDINGS)) or kind.startswith("openai/")
            ):
                raise ValueError(
                    f"Unknown LLM type {kind} requested. Supported models are {self.SUPPORTED_EMBEDDINGS}"
                )
            if kind.startswith("openai/"):
                if kind == "openai/chatgpt":
                    kind = "openai/gpt-3.5-turbo"
                if openai_api_key is None:
                    openai_api_key = os.environ.get("OPENAI_API_KEY")
                prompt = PromptTemplate(
                    input_variables=["history", "human_input"], template=self.system_prompt
                )
                llm = ChatOpenAI(
                    model_name=kind.replace("openai/", ""),
                    openai_api_key=openai_api_key,
                    **params,
                )
                self.model = LLMChain(
                    llm=llm,
                    prompt=prompt,
                    verbose=False,
                    memory=EmbeddingConversationMemory(
                        k=self.conv_buffer_size,
                        ai_prefix="MolAssistant",
                        llm=llm,
                        max_token_limit=self.conv_max_tokens,
                    ),
                )
                output = self.model.predict(
                    human_input=MODEL_EMBEDDING_INSTRUCTIONS["openai"].format(
                        dimension=embedding_size, precision=self.precision, context=self.context
                    )
                )
                embeddings = json.loads(output)
                if "c1ccccc1" not in embeddings:
                    raise ValueError(
                        "Model is not able to understand the prompt. Please select a different model"
                    )
                assert (
                    len(embeddings["c1ccccc1"]) == embedding_size
                ), "Model cannot return the correct embedding size."
                self._length = embedding_size

            elif kind.startswith("llama.cpp") or kind.startswith("gpt4all"):
                raise ValueError(f"{kind} is not yet supported, because or how slow they are.")
            else:
                # we need to remove temperature key
                params.pop("temperature", None)
                self.model = HuggingFaceInstructEmbeddings(
                    model_name=kind,
                    embed_instruction=MODEL_EMBEDDING_INSTRUCTIONS["instructor"].format(
                        context=self.context
                    ),
                    model_kwargs=params,
                    cache_folder=CACHE_DIR,
                )
        super().__init__(
            precompute_cache=precompute_cache,
            n_jobs=n_jobs,
            dtype=dtype,
            device="cpu",
            parallel_kwargs=parallel_kwargs,
            **params,
        )

    def __len__(self):
        """Get the length of the featurizer"""
        if self._length is not None:
            return self._length
        return super().__len__()

    def _convert(self, inputs: List[Union[str, dm.Mol]], **kwargs):
        """Convert the list of input molecules into the proper format for embeddings"""
        self._preload()
        parallel_kwargs = copy.deepcopy(getattr(self, "parallel_kwargs", {}))
        parallel_kwargs["n_jobs"] = self.n_jobs
        return convert_smiles(inputs, parallel_kwargs, standardize=self.standardize)

    def _embed(self, smiles: List[str], **kwargs):
        """_embed takes a list of smiles or molecules and return the featurization
        corresponding to the inputs.  In `transform` and `_transform`, this function is
        called after calling `_convert`

        Args:
            smiles: input smiles
        """
        if isinstance(self.model, LangChainEmbeddings):
            return self.model.embed_documents(smiles)
        # basically running embeddings
        # compute expected total token for inputs based on expected number of char
        expected_tokens = (self.embedding_size * (self.precision + 5) + 4) * len(smiles) + sum(
            len(x) + 5 for x in smiles
        )
        # we splits the number of smiles to avoid being over the maximum tokens
        if not self.batch_size:
            maximum_tokens = self.model.llm.max_tokens or self.model.memory.max_token_limit or 2000
            n_splits = max(1, int(np.ceil(expected_tokens / maximum_tokens)))
        else:
            n_splits = max(1, int(np.ceil(len(smiles) / self.batch_size)))
        data = {}
        for batch in tqdm(np.array_split(smiles, n_splits), desc=f"Batch embedding", leave=False):
            json_output = self.model.predict(human_input=" ,".join(batch))
            batch_data = json.loads(json_output)
            # EN: surprisingly, ChatGPT can return randomized version of a SMILES
            data.update({dm.unique_id(k.strip()): v for k, v in batch_data.items()})
        missed_embedding = np.full_like(list(data.values())[0], np.nan)
        data = [data.get(dm.unique_id(sm), missed_embedding) for sm in smiles]
        return data

`SUPPORTED_EMBEDDINGS = ['hkunlp/instructor-large', 'hkunlp/instructor-base', 'openai/gpt-3.5-turbo', 'openai/gpt-4', 'openai/chatgpt']` `class-attribute` `instance-attribute` ¶

`batch_size = batch_size` `instance-attribute` ¶

`context = context or 'modelling'` `instance-attribute` ¶

`conv_buffer_size = conv_buffer_size` `instance-attribute` ¶

`conv_max_tokens = conv_max_tokens` `instance-attribute` ¶

`embedding_size = embedding_size` `instance-attribute` ¶

`kind = kind` `instance-attribute` ¶

`model = None` `instance-attribute` ¶

`precision = precision` `instance-attribute` ¶

`standardize = standardize` `instance-attribute` ¶

`system_prompt = system_prompt or DEFAULT_SYSTEM_PROMPT` `instance-attribute` ¶

`init(kind=Union[str, LangChainEmbeddings], embedding_size=32, context='modelling', standardize=True, precompute_cache=True, n_jobs=0, conv_buffer_size=10, conv_max_tokens=None, dtype=float, openai_api_key=None, precision=5, batch_size=None, system_prompt=None, parallel_kwargs=None, **params)` ¶

Instantiate an instruction following LLM transformer for molecular embeddings

Parameters:

Name	Type	Description	Default
`kind`		type or name of the model to use for embeddings	`Union[str, LangChainEmbeddings]`
`embedding_size`	`int`	size of the embeddings to return for chat-like models	`32`
`context`	`Optional[str]`	context to give to the prompt for returning the results. Default is "modelling" which is the context for the modelling instructions.	`'modelling'`
`standardize`	`bool`	if True, standardize smiles before embedding	`True`
`precompute_cache`	`bool`	if True, add a cache to cache the embeddings for the same molecules.	`True`
`n_jobs`	`int`	number of jobs to use for preprocessing smiles.	`0`
`conv_buffer_size`	`int`	conversation buffer size so assistant can remember previous conversations and context for generating features.	`10`
`conv_max_tokens`	`Optional[int]`	maximum number of tokens to use for the conversation context. If None, will not use a token size limitations.	`None`
`dtype`		data type to use for the embeddings return type	`float`
`openai_api_key`	`Optional[str]`	openai api key to use. If None, will try to get it from the environment variable OPENAI_API_KEY	`None`
`precision`	`int`	float precision of the output vector	`5`
`batch_size`	`Optional[int]`	batch size to use for embedding molecules. If None, will not use a batch size	`None`
`system_prompt`	`Optional[str]`	system prompt to use for chat-like models. If None, will use the default prompt.	`None`
`**params`		parameters to pass to the LLM embeddings. See langchain documentation	`{}`

Source code in molfeat_hype/trans/llm_instruct_embeddings.py

def __init__(
    self,
    kind=Union[str, LangChainEmbeddings],
    embedding_size: int = 32,
    context: Optional[str] = "modelling",
    standardize: bool = True,
    precompute_cache: bool = True,
    n_jobs: int = 0,
    conv_buffer_size: int = 10,
    conv_max_tokens: Optional[int] = None,
    dtype=float,
    openai_api_key: Optional[str] = None,
    precision: int = 5,
    batch_size: Optional[int] = None,
    system_prompt: Optional[str] = None,
    parallel_kwargs: Optional[dict] = None,
    **params,
):
    """Instantiate an instruction following LLM transformer for molecular embeddings

    Args:
        kind: type or name of the model to use for embeddings
        embedding_size: size of the embeddings to return for chat-like models
        context: context to give to the prompt for returning the results. Default is "modelling" which is the context for the modelling instructions.
        standardize: if True, standardize smiles before embedding
        precompute_cache: if True, add a cache to cache the embeddings for the same molecules.
        n_jobs: number of jobs to use for preprocessing smiles.
        conv_buffer_size: conversation buffer size so assistant can remember previous conversations and context for generating features.
        conv_max_tokens: maximum number of tokens to use for the conversation context. If None, will not use a token size limitations.
        dtype: data type to use for the embeddings return type
        openai_api_key: openai api key to use. If None, will try to get it from the environment variable OPENAI_API_KEY
        precision: float precision of the output vector
        batch_size: batch size to use for embedding molecules. If None, will not use a batch size
        system_prompt: system prompt to use for chat-like models. If None, will use the default prompt.
        **params: parameters to pass to the LLM embeddings. See langchain documentation
    """

    self.kind = kind
    self.model = None
    self.standardize = standardize
    self.context = context or "modelling"
    self.embedding_size = embedding_size
    self.precision = precision
    self.batch_size = batch_size
    self.conv_max_tokens = conv_max_tokens
    self.conv_buffer_size = conv_buffer_size
    self._length = None
    self.system_prompt = system_prompt or DEFAULT_SYSTEM_PROMPT
    params.setdefault("temperature", 0.8)
    if isinstance(kind, str):
        if not (
            kind.startswith(tuple(self.SUPPORTED_EMBEDDINGS)) or kind.startswith("openai/")
        ):
            raise ValueError(
                f"Unknown LLM type {kind} requested. Supported models are {self.SUPPORTED_EMBEDDINGS}"
            )
        if kind.startswith("openai/"):
            if kind == "openai/chatgpt":
                kind = "openai/gpt-3.5-turbo"
            if openai_api_key is None:
                openai_api_key = os.environ.get("OPENAI_API_KEY")
            prompt = PromptTemplate(
                input_variables=["history", "human_input"], template=self.system_prompt
            )
            llm = ChatOpenAI(
                model_name=kind.replace("openai/", ""),
                openai_api_key=openai_api_key,
                **params,
            )
            self.model = LLMChain(
                llm=llm,
                prompt=prompt,
                verbose=False,
                memory=EmbeddingConversationMemory(
                    k=self.conv_buffer_size,
                    ai_prefix="MolAssistant",
                    llm=llm,
                    max_token_limit=self.conv_max_tokens,
                ),
            )
            output = self.model.predict(
                human_input=MODEL_EMBEDDING_INSTRUCTIONS["openai"].format(
                    dimension=embedding_size, precision=self.precision, context=self.context
                )
            )
            embeddings = json.loads(output)
            if "c1ccccc1" not in embeddings:
                raise ValueError(
                    "Model is not able to understand the prompt. Please select a different model"
                )
            assert (
                len(embeddings["c1ccccc1"]) == embedding_size
            ), "Model cannot return the correct embedding size."
            self._length = embedding_size

        elif kind.startswith("llama.cpp") or kind.startswith("gpt4all"):
            raise ValueError(f"{kind} is not yet supported, because or how slow they are.")
        else:
            # we need to remove temperature key
            params.pop("temperature", None)
            self.model = HuggingFaceInstructEmbeddings(
                model_name=kind,
                embed_instruction=MODEL_EMBEDDING_INSTRUCTIONS["instructor"].format(
                    context=self.context
                ),
                model_kwargs=params,
                cache_folder=CACHE_DIR,
            )
    super().__init__(
        precompute_cache=precompute_cache,
        n_jobs=n_jobs,
        dtype=dtype,
        device="cpu",
        parallel_kwargs=parallel_kwargs,
        **params,
    )

`len()` ¶

Get the length of the featurizer

Source code in molfeat_hype/trans/llm_instruct_embeddings.py

def __len__(self):
    """Get the length of the featurizer"""
    if self._length is not None:
        return self._length
    return super().__len__()

LLM embeddings¶

Classical Embeddings¶

LLMTransformer ¶

SUPPORTED_EMBEDDINGS = ['openai/text-embedding-ada-002', 'sentence-transformers/all-MiniLM-L6-v2', 'sentence-transformers/all-mpnet-base-v2', 'llama.cpp'] class-attribute instance-attribute ¶

kind = kind instance-attribute ¶

model = LlamaCppEmbeddings(model_path=quantized_model_path, None=params) instance-attribute ¶

standardize = standardize instance-attribute ¶

__init__(kind=Union[str, LangChainEmbeddings], standardize=True, precompute_cache=True, n_jobs=0, dtype=float, openai_api_key=None, quantized_model_path=None, parallel_kwargs=None, **params) ¶

Instruct-aware Embeddings¶

InstructLLMTransformer ¶

SUPPORTED_EMBEDDINGS = ['hkunlp/instructor-large', 'hkunlp/instructor-base', 'openai/gpt-3.5-turbo', 'openai/gpt-4', 'openai/chatgpt'] class-attribute instance-attribute ¶

batch_size = batch_size instance-attribute ¶

context = context or 'modelling' instance-attribute ¶

conv_buffer_size = conv_buffer_size instance-attribute ¶

conv_max_tokens = conv_max_tokens instance-attribute ¶

embedding_size = embedding_size instance-attribute ¶

kind = kind instance-attribute ¶

model = None instance-attribute ¶

precision = precision instance-attribute ¶

standardize = standardize instance-attribute ¶

system_prompt = system_prompt or DEFAULT_SYSTEM_PROMPT instance-attribute ¶

__init__(kind=Union[str, LangChainEmbeddings], embedding_size=32, context='modelling', standardize=True, precompute_cache=True, n_jobs=0, conv_buffer_size=10, conv_max_tokens=None, dtype=float, openai_api_key=None, precision=5, batch_size=None, system_prompt=None, parallel_kwargs=None, **params) ¶

__len__() ¶

`LLMTransformer` ¶

`SUPPORTED_EMBEDDINGS = ['openai/text-embedding-ada-002', 'sentence-transformers/all-MiniLM-L6-v2', 'sentence-transformers/all-mpnet-base-v2', 'llama.cpp']` `class-attribute` `instance-attribute` ¶

`kind = kind` `instance-attribute` ¶

`model = LlamaCppEmbeddings(model_path=quantized_model_path, None=params)` `instance-attribute` ¶

`standardize = standardize` `instance-attribute` ¶

`init(kind=Union[str, LangChainEmbeddings], standardize=True, precompute_cache=True, n_jobs=0, dtype=float, openai_api_key=None, quantized_model_path=None, parallel_kwargs=None, **params)` ¶

`InstructLLMTransformer` ¶

`SUPPORTED_EMBEDDINGS = ['hkunlp/instructor-large', 'hkunlp/instructor-base', 'openai/gpt-3.5-turbo', 'openai/gpt-4', 'openai/chatgpt']` `class-attribute` `instance-attribute` ¶

`batch_size = batch_size` `instance-attribute` ¶

`context = context or 'modelling'` `instance-attribute` ¶

`conv_buffer_size = conv_buffer_size` `instance-attribute` ¶

`conv_max_tokens = conv_max_tokens` `instance-attribute` ¶

`embedding_size = embedding_size` `instance-attribute` ¶

`kind = kind` `instance-attribute` ¶

`model = None` `instance-attribute` ¶

`precision = precision` `instance-attribute` ¶

`standardize = standardize` `instance-attribute` ¶

`system_prompt = system_prompt or DEFAULT_SYSTEM_PROMPT` `instance-attribute` ¶

`init(kind=Union[str, LangChainEmbeddings], embedding_size=32, context='modelling', standardize=True, precompute_cache=True, n_jobs=0, conv_buffer_size=10, conv_max_tokens=None, dtype=float, openai_api_key=None, precision=5, batch_size=None, system_prompt=None, parallel_kwargs=None, **params)` ¶

`len()` ¶