This repository is an example "recipe" for using AWS Bedrock agents and knowledgebases to extract information from the biomedical literature.
This was a pilot experiment, so there are some missing steps that could be converted into Infrastructure As Code, e.g. for creating the agent, knowledgebase, or lambda function.
I want to also mention that all of this is changing very quickly, and it's likely that some steps I've included here are outdated or handled by Bedrock. For example, the lambda function was written due to overcome a known bug in S3 metadata retrieval for agents interacting with knowledgebases. It's quite possible that bug has been fixed by now and that directly connecting the Bedrock Agent to your knowledgebase would work just fine.
Our group at Sage Bionetworks maintains a data portal (nfdataportal.org) for neurofibromatosis (NF), a family of rare diseases. One portion of this portal is dedicated to experimental tools (tools.nf.synapse.org) - cell lines, animal models, plasmids, etc - relevant to NF. We've built a database with lots of interesting information about these resources, but we were struggling to efficiently extract "observations" about these tools from the scientific literature. For example, "mouse model X develops tumor type Y at N months of age on average." These snippets are important to understand, especially if you are using these tools to model particular aspects of NF (such as tumors).
In late 2023 we started experimenting with ChatGPT, ScholarAI and "manual" RAG where we provided a publication and extracted information about a tool such as a mouse model or a cell line from that publication. You can read more about that here: https://doi.org/10.21428/4f83582b.bfc5e9cb
It worked well, but it was slow, and required a lot of user input. Therefore, we recently started exploring AWS Bedrock as a platform to scale this up. Overall, our goal was to take all of the publications (at least, the open access ones - only 400 or so) mentioned in our database and use an LLM to extract useful tidbits about these research tools from the pubs. Here, we used Anthropic Claude 3 on Bedrock to build an agent and to "read" our publications for us.
The first step was to build a knowledgebase. There's extensive documentation on AWS Bedrock knowledgebases, so I won't describe that here, but I will describe how I retrieved the information (pdfs) to put into the knowledgebase. Unpaywall is AMAZING! I hacked together some R code to use the Unpaywall API and find legal open access copies of as many of the database publications (we just had DOIs) as possible. I ended up with about 400 pdfs. I also generated metadata files (https://aws.amazon.com/blogs/machine-learning/knowledge-bases-for-amazon-bedrock-now-supports-metadata-filtering-to-improve-retrieval-accuracy/) for each PDF so that we could trace each extracted "observation" back to a particular DOI and cite it appropriately. Then you just upload to an S3 bucket and follow the AWS docs to create a vector database/knowledgebase from the PDFs.
Unfortunately, when I was working on this there was a bug that prevented retrieval of S3 object metadata in agents. Therefore, I had to define a custom lambda function that would query the knowledgebase and return the results which included the S3 object metadata. The lambda directory is a modification of the aws-lambda-developer-guide guide https://github.com/awsdocs/aws-lambda-developer-guide/tree/main/sample-apps/blank-python. The actual function the lambda executes when called by the agent can be found in lambda/function directory. Hopefully, at the time of writing, this bug is close to or already resolved; meaning that no lambda function is necessary.
This is my simple script to orchestrate the agent. Prior to running this, you need to create and configure an agent on bedrock. In my case, I gave it some basic system instructions and configured it to recognize experimental tool "resource" names and trigger the lambda function above to query the knowledgebase for relevant information. This was by far the most challenging part of the process, because it required a ton of iteration and experimentation to get the LLM behaving the way I wanted it to. This part seems like it would be challenging to port to infrastructure as code - as of now, you really have to do a lot of manual iteration, testing, and refinement, and then would have to build a deployment template after the fact (which I have not done).
In brief, this script cycles through a table of data from our database, and for each row builds a prompt that gets passed to the agent. With some handling for different kinds of unexpected responses, the script either determines that the agent is either providing "observations" - in which case it prompts the agent for more observations. This loop repeats until the agent starts to return results that indicate it has exhausted the available information. After this the loop breaks and moves onto the next resource.
After this is all done, we format it into a table and push it back into the database. To see what some of the results look like, check out this pig model of NF1 in our database: https://nf.synapse.org/Explore/Tools/DetailsPage/Observations?resourceId=57625a6e-c039-418e-8e22-db464f8aa827
I spent a lot of time refining and testing this, and I'm sure more could be done to improve this. I found the Anthropic prompting guide very helpful: https://docs.anthropic.com/en/docs/prompt-engineering
The prompt looks something like this:
query = "Please extract a comprehensive set of highly-accurate observations about '{}'".format(row['resourceName'])
if row['resourceType']:
query += ", a {}".format(row['resourceType'])
if row['synonyms']:
query += ", also known as {}".format(row['synonyms'])
if row['resourceId']:
query += ", resourceId: {}".format(row['resourceId'])
if isinstance(row['rrid'], str) and row['rrid'] != 'nan':
query += ", RRID:{}. ".format(row["rrid"])
print(query)
query += '. The RRID might not be mentioned in the search results. Also, the RRID is not the same as the resourceId. The resourceId is an etag and will be provided in the query. Most importantly PLEASE be sure that any observations extracted are relevant to the named resource. False negatives (i.e. missing an observation) are acceptable for now, false positives (i.e. observations attributed to the wrong resource or DOI) are not acceptable. Do not invent synonyms for cell lines or animal models that I have not explicitly provided, with the few exceptions I mention later in these instructions. Please be ABSOLUTELY SURE that the observation matches the resource. For example, if a cell line like SK-MEL-238 is queried, and the search results mention SK-MEL-2 or SK-MEL-131, these are probably not observations about SK-MEL-238 or SK-MEL-181. Or, if the search results do not explicitly mention the resource (e.g. STS-26T, or SZ-NF4 are in the query but not in the search results), then those search results probably do not contain relevant observations and should be ignored. Or, sometimes, author initials or other acronyms can be confused for a resource (e.g. cell line SZ-NF4 and author initials SZ). On the other hand, sometimes papers may mention the full name of the resource once and then refer to it thereafter using an abbrevation, particularly in the case of animal models (e.g. B6;129S2-Trp53tm1Tyj Nf1tm1Tyj/J is also known as NPcis). In that specific instance, it is OK to extract observations that do not have a perfect name match with the query resource. Similarly, sometimes there are minor differences in punctuation, spacing, or capitalization (e.g. FTC133 vs FTC-133, YST1 vs YST-1, or U87-MG vs U87MG vs U87 MG or sNF94.3 vs SNF94.3, many other examples exist); these should be treated as identical resources. If a resourceName or synonym is extremely generic - for example, Nf1+/- or NF1-mut or NF1-null or similar, do not include it in the knowledgebase search and do not extract observations about it, because it is possible that the search results are talking about a different animal model or cell line. DO NOT include observations where the focus is methodology, acknowledgements, ethics, culture conditions, quality control (e.g. <example>the cell lines were sequenced with whole genome sequencing</example>, or the <example>the cell lines were acquired from...</example>, or <example>The mouse genotypes were verified by PCR.</example>, or <example>The mice were evaluated twice daily.</example> or <example>the cell line was confirmed to be negative for mycoplasma contamination</example> or <example></example>). We are only interested in observations that are data-driven and scientific in origin. DO NOT include observations that do not match the input resource name or describe a different cell line or mouse model. Be absolutely sure that your extracted observations are accurate for a particular resource. It is not acceptable to hallucinate or make up observations. Please be sure to retrieve the "doi" portion of your response from the metadata associated with the chunk from which the observation was extracted. DO NOT make up a DOI. DO NOT respond in any format other than the requested JSON format. Missing values (for example, if the observationTime is not applicable), fill it in with a "" to make sure it is valid JSON. 000-If you do not find any relevant observations for the query resource in the search results, or there is nothing to extract, simply return [null]; do not extract anything in the "observation" format. DO NOT include any preamble to the JSON or text after the JSON. The JSON portion of your response must be valid JSON, readable in python by the json library. Be sure to wrap the JSON portion of your response in <json_response> </json_response> tags. The observations you extract should be summarized, and succinct, but we are interested in all scientific observations about the resource; even if they are complex or jargon-heavy topics, please still extract them. Here are some examples: <example_1> For the query "NF1OPG, an Animal Model, resource ID 76ff3bea-5a2c-4d9c-b3c4-513842c11af4", given the search result: <example_search_result> retrievedReferences": [{"content": {"text": "Nf1OPG mice with optic glioma tumors consistently developed preneoplastic lesions by 3 months of age that progressed to optic gliomas over the next 3 to 6 months. By 7–9 months of age, 100% of mice had symptomatic optic glioma and required euthanasia due to progressive neurological symptoms."}, "location": {"s3Location": {"uri": "s3://nf-tools-database-publications/nftc_pdfs/nftc_10.1158_1078-0432.CCR-13-1740.pdf"}, "type": "S3"}, "metadata": {"x-amz-bedrock-kb-source-uri": "s3://nf-tools-database-publications/nftc_pdfs/nftc_10.1158_1078-0432.CCR-13-1740.pdf", "doi": ["https://doi.org/10.1158/1078-0432.CCR-13-1740"]}} </example_search_result> <example_response> <json_response> [{"resourceId":"76ff3bea-5a2c-4d9c-b3c4-513842c11af4","resourceName":"NF1OPG","resourceType":["Animal Model"],"observationText":"In the NF1OPG mouse model, preneoplastic lesions consistently developed by 3 months of age and progressed to symptomatic optic gliomas requiring euthanasia by 7-9 months due to neurological symptoms in 100% of mice.","observationType":["Tumor progression","Neurological symptoms"],"observationPhase":"juvenile","observationTime":3,"observationTimeUnits":"months","doi":"https://doi.org/10.1158/1078-0432.CCR-13-1740"}] </json_response> </example_response> </example_1> <example_2> For the query "NF1 flox/flox; GFAP-Cre, an Animal Model, resource ID d2173c46-0d4d-4b79-bcdf-ceb6d05b5a3f", given the search result: <example_search_result> retrievedReferences": [{"content": {"text": "Mice with astroglial inactivation of the Nf1 tumor suppressor gene (Nf1 flox/flox; GFAP-Cre mice) developed low-grade astrocytomas with 100% penetrance. These low-grade gliomas were detected as early as 3 months of age, and the mice exhibited progressive neurological dysfunction with advanced age."}, "location": {"s3Location": {"uri": "s3://nf-tools-database-publications/nftc_pdfs/nftc_10.1158_0008-5472.CAN-05-0677.pdf"}, "type": "S3"}, "metadata": {"x-amz-bedrock-kb-source-uri": "s3://nf-tools-database-publications/nftc_pdfs/nftc_10.1158_0008-5472.CAN-05-0677.pdf", "doi": ["https://doi.org/10.1158/0008-5472.CAN-05-0677"]}} </example_search_result> <example_response> <json_response> [{"resourceId":"d2173c46-0d4d-4b79-bcdf-ceb6d05b5a3f","resourceName":"NF1 flox/flox; GFAP-Cre","resourceType":["Animal Model"],"observationText":"These mice developed low-grade astrocytomas with 100% penetrance starting as early as 3 months of age, exhibiting progressive neurological dysfunction with increasing age.","observationType":["Tumor incidence","Neurological symptoms"],"observationPhase":"juvenile","observationTime":3,"observationTimeUnits":"months","doi":"https://doi.org/10.1158/0008-5472.CAN-05-0677"}] </json_response> </example_response> </example_2> <example_3> For the query "T265, a Cell Line, resourceId 6419dd0d-1937-4ecf-bf01-876632ae0f54", given the search result: <example_search_result> retrievedReferences": [{"content": {"text": "Then we performed a human STR authentication analysis to identify any possible cross-contamination or misidentification among cell lines of human origin (Table S1). All STR profiles matched the STR profiles published in Cellosaurus and ATCC when available. However, in this process, we identified the same STR profile for ST88-14 and T265 cell lines (Data S2) in all ST88-14- and T265-related samples provided by different laboratories. To find out which cell line was misidentified we analyzed the oldest ST88-14 and T265 stored vials in their original labs and more conclusively, the primary tumor from which the ST88-14 cell line was isolated (Data S2). We identified the ST88-14 cell line as the genuine cell line for that STR profile, NF1 germline (c.1649dupT) mutation and somatic copy number alteration landscape, and dismissed the use of the T265 cell line, which we assume was misidentified at some point after its establishment and expansion."}, "location": {"s3Location": {"uri": "s3://nf-tools-database-publications/nftc_pdfs/nftc_10.1016.j.isci.2023.106096.pdf"}, "type": "S3"}, "metadata": {"x-amz-bedrock-kb-source-uri": "s3://nf-tools-database-publications/nftc_pdfs/nftc_10.1016.j.isci.2023.106096.pdf", "doi": ["https://www.doi.org/10.1016/j.isci.2023.106096"]}} </example_search_result> <example_response> <json_response> [{"resourceId":"6419dd0d-1937-4ecf-bf01-876632ae0f54","resourceName":"T265","resourceType":["Cell Line"],"observationText":"The T265 cell line was discarded as it exhibited the same STR profile as the ST88-14 and its matched primary MPNST, suggesting it may not be a distinct cell line but rather a duplicate or misidentified version of ST88-14.","observationType":["Cell line identity"],"observationPhase":"","observationTime":,"observationTimeUnits":"","doi":"https://www.doi.org/10.1016/j.isci.2023.106096"}] </json_response> </example_response> </example_3> Note that example_3 could also be a valid search result for ST88-14. <example_4> For the query "NF90-8, a Cell Line, resourceId 0f404e70-2acf-4877-bcd5-6da81d9fa41e", given the search result: <example_search_result> retrievedReferences": [{"content": {"text": "The functional impact of small variants in oncogenes and TSGs was also moderate. We identified some MPNST-related genes inactivated by pathogenic SNVs (Figure 4B and Table S3). In addition to germline NF1 mutations, somatic mutations also affected NF1, as well as other genes including TP53, PRC2 genes, and PTEN. Remarkably, we did not identify gain-of-function mutations in oncogenes, except a BRAF V600E mutation in the STS-26T cell line. In contrast, we identified gains in genomic regions containing receptors, especially a highly gained region containing PDGFRA and KIT in two NF1-related cell lines (S462 and NF90-8) (Figure 4B). The most frequently inactivated gene in our set of cell lines was CDKN2A, a known bottleneck for MPNST development.19,20 The fact that this gene was inactivated by a point mutation only in one cell line, exemplifies the relatively low functional impact of small variants compared to structural variants in MPNST initiation."}, "location": {"s3Location": {"uri": "s3://nf-tools-database-publications/nftc_pdfs/nftc_10.1016.j.isci.2023.106096.pdf"}, "type": "S3"}, "metadata": {"x-amz-bedrock-kb-source-uri": "s3://nf-tools-database-publications/nftc_pdfs/nftc_10.1016.j.isci.2023.106096.pdf", "doi": ["https://www.doi.org/10.1016/j.isci.2023.106096"]}} </example_search_result> <example_response> <json_response> [{"resourceId":"0f404e70-2acf-4877-bcd5-6da81d9fa41e","resourceName":"NF90-8","resourceType":["Cell Line"],"observationText":"The NF90-8 cell line had a highly gained region in chromosome 4 containing the PDGFRA and KIT receptors.","observationType":["Genomics"],"observationPhase":"","observationTime":,"observationTimeUnits":"","doi":"https://www.doi.org/10.1016/j.isci.2023.106096"}] </json_response> </example_response> </example_4>'