Dr. Melissa Haendel: The Monarch Initiative: Semantic Phenotyping for Disease Diagnosis & Discovery

>>Good morning everyone. Thank you for joining us today for the MCIC with Speaker Series. I’m Tony Kerlavage, the head of Cancer Informatics here at CBITT. Just a couple items to mention. Today’s presentation is being recorded and will be available via the CBIIT website at cbitt.nci.nih.gov. You can find information about future speakers on our site and by following us on Twitter at NCI_NCIP. Today I have the pleasure of welcoming Dr. Melissa Haendel, the Director of the Oncology Development Group within the Oregon Health and Science University Library and the Department of Medical Informatics and Epidemiology. Dr. Haendel earned her PhD in neuroscience at the University of Wisconsin Madison and a BA in chemistry from Reed College. She is also the principle investigator of the Monarch Initiative. The title of her presentation today is “The Monarch Initiative: Semantic Prototype, Phenotyping for Disease Diagnosis and Discovery.” And with that I’ll turn the floor over to Dr. Haendel.>>Well thank you very much for inviting me to speak this morning. Today as my title indicates I’m going to be talking about phenotyping and I should mention that this work is being used in the context of collaborations going on in the Global Alliance for Genomics and Health and with my [inaudible] group as part of phenocentral and the matchmaker exchange to exchange patient phenotype genotype data for maximizing disease discovery globally. I’m not really going to talk about that work, but the work I’ll talk about today is really the underpinnings by which we are aiming to share those patient data to help support disease diagnoses. So, today I’m going to talk about in particular the computable phenotypic profile. How do we deeply see the [inaudible] to aid disease discovery? How are we using this phenotype data in combination with exome analysis for disease diagnosis. And then based on my background in extensive use of different types of model organisms I have a deep interest in better leveraging that organism data and so I’ll talk a little bit about how we might use that data and cross the species divide to help use those model data for disease diagnoses. Then I’ll talk a little bit about some work that we’re doing to have a better understanding on what constitutes good phenotyping in the case of rare and undiagnosed diseases. How does a clinician know that they’ve done enough in order to support disease diagnosis. And then because this is an NCI talk, I’m going to talk about some work that we’re doing right now to help represent chronological considerations in the way in which we phenotype. And this is really ongoing work but I think I would welcome questions from the audience about how this might be applied in the cancer realm with which I’m really not that familiar. So many of you will recognize these patients. If you look carefully I think it will only take you less than a second to know that these are Downs patients. But this is actually much harder for a computer. A computer needs to know the things that are indicative of the Downs syndrome and you know as a human clinician or even a person on the street, you now we don’t have to go through the process of listing out such things as flat back of head and many loops on fingertips and intestinal blockage. These are not things that we have to recognize. It’s really sort of an intuitive recognition because it’s a very common syndrome and we’ve come to recognize it, but the computer needs to have this list. They need a list and they need to know what an abnormal pelvis is and what an enlarged colon is and have this list. And so this list I will define as a phenotypic profile. And this phenotypic profile is the unit of operations for much of the work that I’ll be talking about. So in most clinical phenotyping contexts EHRs provide for you know free text representations of these phenotypic profiles. And these things can be listed in many different ways as you might use an abbreviation such as df to mean dysmorphic features. You might just write dysmorphic. You might be more specific and talk about dysmorphic faces as better. And so this becomes somewhat while it’s the currency of exchange for a clinician it becomes a bit impenetrable for a computer to sort out. This is somewhat improved by the use of clinical symptom checkboxes which really give us a list that the computer can operate on. But these type of checklists while you know quite effective in the context of more common diagnoses and for following patient progress over time, don’t do too much for us when we’re dealing with rare and undiagnosed disease, because they simply don’t specify enough details. As it turns out six percent of the general population suffers from a rare disorder meaning approximately 200,000 patients or less. And six percent of the patients contacting the NIH Office of Rare Disorders do not have a diagnosis whatsoever. So that’s actually quite a lot of patients that we’re talking about where they are falling through the cracks of the normal clinical phenotyping scenario. So the yet-to-be diagnosed patient is it a known disorder that is not recognized during a prior evaluation or is it some kind of atypical presentation of a known disorder? Could it be a combination of several disorders or is it actually a novel, unreported disorder. And these are the questions that a clinician has to ask themselves, if they’re evaluating patients that have yet to be diagnosed. So, the challenge then for genetic diseases, which is what I’m focused on here, is how do we interpret the disease candidates? We have a set of phenotypes and if we’ve done a good job we hopefully have enough, a nice long list of very descriptive phenotypes that I’ll talk more about how we achieve that later. We’ve potentially done whole genome sequencing or whole exome sequencing and have some information about the genotype. If we’re lucky we also have parental genotyping that’s been performed. There is information about the patient’s history and the environment that they’ve been exposed to and whether it be life’s interactions with potential toxins or lifestyles, issues such as diet or drug therapy that have been tried for their particular disorder over time. And this all goes into a black box along with a bunch of sorting that happens with a variety of algorithms that we throw at the problem that includes sorting out our candidate genes based on pathogenicity, the frequency of the very occurrence, protein interactions of those candidates variants, gene expression changes that occur, relationships between the candidates and gene networks, the use epigenomics, metabolomics data and pretty much every other OMIMs data that you might be able to throw at this problem. And out pops the other end, a list of candidates for functional validation. And so the question then is, what’s in the box? How can we better prioritize these variants and how do we compare these different methods? So fundamentally if we just go to the phenotype problem we– and then we come back to our down example we have a patient that has a set of phenotypes such as flat back of the head or hypotonia and the problem is really about matching that patient to a known disease. In this case this disease has been described as having an abnormal skull morphology and a decreased muscle mass. And so the question is, is this– does this patient have the disease or not? And the phenotype that you’ll see indicated by the colors, and I’ll be using these types of notation throughout the talk you know are not exact matches. And so if the list of phenotypes doesn’t exactly match and it’s not a very common disease that we’re familiar with it becomes a very difficult problem to understand if the phenotypic constellation is the same in the patient as it is in a described disease and, therefore, making differential diagnosis very difficult. So in these cases what often happens and still happens is a lot of searching goes on to look for those specific phenotypes that are observed in the patient. And this goes on in OMIM, in Orphanet, in PubMed. In very many places one might take those, list the phenotypes and start searching for case studies or models that have those particular phenotypes in common with the patient. But this is a really ineffective strategy, and in fact, if you go to OMIM and this is the case in many places and search for such things that human being would see as being similar such as large bone, enlarged bone, big bone, huge bones, massive bones, hyperplastic bones, etc. you get actually very different results when you search for these things even though a human knows that these things are really fairly synonymous, the computer does not. And so the question is then how do we improve the situation for searching. And this is why I would claim that you know phenotyping is actually the final frontier. And what we really want to understand is how do we perform more accurate and precise phenotyping to support a greater understanding of genotype, the relationships between genotype and phenotype and the potential effects of environment whether they be retrospective in terms of causes for a different disease or intended therapies or interventions and how those might affect the phenotypic outcome. You simply need to have a better understanding of how to represent phenotype data similar to what we’ve done for the genotype data over the past many years. So one approach, and there are many. This is really, the one that I’m going to talk about today is really you know a semantic approach for representing phenotypes. If we instead of thinking of that list of phenotypes that we have our downs patient as a graph we can actually do some very interesting things in terms of the way that compare these phenotypic profiles between patients and known diseases or across patients, and that’s really the subject of the work with the matchmaker exchange is to try to find his other patients in the world that have the same disease, the same extremely rare disease. So in this case we have a graph that shows things classified by color here. So if we think about the different colors as being different anatomical systems and the specific relationships between each node, which is indicated by a circle as being a different phenotype term, we can think about a given patient or a given disease might have a collection of these things represented. And so over here on the left these are represented in terms of some typing hierarchy you might think of an abnormal motor neuron morphology as being a subtype of an abnormal central nervous system but being a subtype of abnormal nervous system morphology. And so in that way we have this nice graph structure, the relationships between these different phenotypes and we can also include all those synonyms that we were lacking in our prior three types of search. So this helps us really sort of organize those phenotype profiles into a graph structure. And so here’s what it would look like if you had a particular disease. You would have a set of annotation, a collection if you will of nodes in this graph. So your list is now a collection of nodes in a graph and those are indicated here with the darker circles. So this might be nervous system, phenotype. This might be a heart phenotype. This might be a skin phenotype, etc. And in doing so if this node is a phenotype for the patient or the disease it also means that all the things that subsume up to it are true. So, if this is a nervous system phenotype this more specific motor neuron phenotype would indicate that its all the way up to here as being a nervous system phenotype. That’s actually a very powerful tool that we can use to help compare patients to diseases. So going back to our Downs case again, here we have an example of a set of phenotypes that are all nodes in a graph coming from the different anatomical system branches. And these nodes in that graph are all annotated to a particular disease, in this case, Downs Syndrome, which then if its known can be related to the genetic basis of the disease, in this case Chromosome 21 Trisomy. And so we have basically created these sets of annotations to each of the– to nodes in this graph for all the known Mendelian diseases. But because of that we can now do what–>>Hey Melissa, can you hold on one second please. We’ve just lost the screen momentarily here. >>Sure. Okay, go ahead. >>Okay, so once we have all of this data for Mendelian diseases and code in these graph structures as sections of annotations of phenotypic profiles to the graph. We can actually do a really neat comparison between any given patient and that corpus of disease annotation. And so here with a slightly hokey animation I’m going to try to show you what that looks like. The– on the left over here we have a patient which is indicated by a set of nodes in the graph. These are there phenotypes and as indicated by the colors going up to the roots that means that in this case this very specific nervous system phenotype is a type of metasystem phenotype. So let’s say that’s the motor neuron phenotype. And so what we want to do is compare this patient with this collection of nodes against every other known disease that we have in the corpus. And so you can look just by eye and see that there’s clearly some overlap. There’s– in this patient there’s some green, some orange, some red, some gray, some yellow and we have the same thing kind of over here, but they’re different phenotypes. They’re not exactly the same and if we were doing simply a list-based approach this phenotype over here would not match this phenotype over here. But they can be related based on the thing that subsumes them, right. This node here is actually the thing that the patient has in common with disease facts over here. And so as it turns out in this case the patient is more similar to disease X than disease Y and we can actually see what that overlier looks like if we try to align the graph. And this is really what the algorithms that we’re developing do, is they aim to align the graphs and look at where the overlap happens. And then if we look at the overlap you can see now that that’s the node in common with the patient and the disease. And so the patient has these nodes indicated, these more specific phenotypes where disease X has this phenotype. But this is the node that’s in common. We do this across the whole set of phenotypes for every phenotype that’s been annotated to the patient and to the disease. And using graph algorithms can find the most similar disease to the patient based upon how well the graph matches. And this can also be done for patients against patients as indicated earlier. So to do this we’ve been developing the human phenotype ontology, which is really focused on representation of human phenotypes that are used to then annotate patients disorders, genotypes, genes and sequence variants for humans. And we had our nucleic acid research update paper come out recently. You can basically represent in a very deep way the types of phenotypes that we each compare using this ontology. And this ontology is really a logically defined speck of terms that are used to not only create these annotations but also for interoperability purposes and I’ll come back to that later. So a lot of people have asked why then do we actually need yet another clinical vocabulary? Do we not have enough clinical vocabularies already? Well the answer is we actually do need another clinical vocabulary because when you look at the UMLS or SNOMED or ICD and you think about the types of phenotypes that we see in these rare and undiagnosed patients they’re simply not covered by these vocabularies. And this was a really nice evaluation of this presented at ISMB on phenotype day this past year by Winnenburg and Bodenreider where they actually took the human phenotype ontology and compared it against the UMLS simply based on string matching and looking at the cross-references between the terms. And the human phenotype ontology has approximate, almost 12,000 terms or so now. And basically with the entirety of the UMLS we only have about 50% coverage based on purely, on string matching which doesn’t necessarily mean that all of those are even in good alignment and only 36% based on cross references probably– the reality is probably somewhere in between. So we have actually a lot content in the phenotype ontology that’s simply not represented by current clinical vocabularies that are used in the more mainstream context of EHRs. So I’d like to switch gears now and talk a little bit about how we’re using these graphic comparisons in the context of disease diagnosis in a tool called Phenix. And there’s a link to the Phenix tool there and a paper that came out just this fall in “Science Translational Medicine” and the citation is at the bottom. And so here what we’re going is we talk whole exome sequencing and this whole exome sequencing is really intended to be clinically, as clinically relevant as possible and we’re using a target panel of 2700 or so known Mendelian disease genes. So not doing a whole exome analysis on the whole exome, but really targeting these particular genes, doing the classic and now standard filtering for removing off-target, common variants that are not known to be causing genes. And so in this case we come out with fewer stars, which are the candidate variants in our exomes, which are represented by the black boxes. And then we apply a series of filters for you know the inheritance pattern, looking for de novo mutations, and then we get a set of candidates out the bottom. And in this case, there’s still quite a lot of candidates. But if we combine this with the phenotype matching algorithm that I showed you before, where we take the patient’s data and match it against all the known Mendelian disease-gene profiles from HGMD, Clinvar, OMIM and Orphanet, we can then examine which ones of these candidates are most phenotypically similar based upon, based upon those known disease genes and then combine that data with the candidates that have come out to the more exome filtering methods and in the end come up with a better prioritized list of candidates. And I’m not going to go into all the details of the methods here, but just wanted to make the point that when we do this with some controlled patients, and here we had 52 patients with diagnosed rare diseases, we actually did really quite well and the Phenix tool ranks these diagnosed patients really very high. And for dominant mutations the average rank for these known diagnosed patients was 1.7, which is really fantastic. For receptives the average rank was five. We didn’t do quite as well there and for X link it was 1.8. So the use of that phenotype data then really greatly improves the prioritization and you can read in the paper more specifically the details of the methods and also the benchmarking and comparison that we’ve done to support you know a better understanding of how well this is working, because obviously we’re talking about a very small end here. And in this case we actually took 40 undiagnosed patients and ran them through the same prioritization method using both the targeted exome analysis and the deep phenotyping and were able to diagnose 11 of our 40 patients with known Mendelian disease disorders. And here’s an example of one them. Basically in this case these bolded phenotypes, so global developmental delay, delayed speech and language development, feeding difficulties, flat nose, etc. were the things that actually helped with that phenotypic alignment of the graph for this patient with a known disease. And so that was pretty fantastic because these patients had been seen by a series of expert clinicians already and we’re at the end of a long diagnostic journey. And so being able to just simply use the graph matching algorithm really helped diagnose these patients because they had atypical presentations or slight variations on known disorders. And what about those other patients that we can’t solve? What do we do with them? And how do we understand rare disease etiology and discover treatments because even though we’ve– even for these patients that we have a diagnosis we’d like to learn more about how are these, how are these variants changing the physiology of the patient and how can we better understand the mechanism of their involvement and the disorder to target therapies. The end of the game doesn’t stop with the diagnosis, although certainly for the patients that receive the diagnoses and has been investigated by the undiagnosed disease program it’s extremely important to those families and so they are very grateful for a diagnosis. But we want to better understand the biology so that we can actually find therapies for these patients and then also generally better understand the network of proteins that are involved in these and the etiology of these diseases. And in some cases that might lead to a greater understanding of even more common diseases and, therefore, better treatment strategies there as well. So, of course, I’m going to say that we can use model organisms to help us better understand the etiology as well as to target those patients that are still yet undiagnosed. So in this example we have a very large fish. It has the double mutation in two different genes. It’s been annotated by the beaver fish database [inaudible] that’s having increased weight, adipose tissue volume change, glucose homeostasis altered, then we have a very large mouse which by the MGI database has been annotated with increased food intake, hypoglycemia, insulin resistance and then in the human we have this mutation in the ALS and one gene which we known to be causal for obesity and diabetes and insulin resistance. And in these cases we actually know the relationship between each gene. But the question that we have when we want to try to associate these things what do we do when we don’t actually know the genetic basis of the human disease. How can we actually use the information that we have about the phenotype data for these models and their relationships to known genotypes in these models to help inform what this question mark might be? When you look at the human coding genome and you look at how much phenotype data do we have from the various forces, and in this case we’ve just looked at the GWAS catalog, ClinVar and OMIM we only have phenotypic coverage for less than 40% of the coding human genome, which is actually a lot of the genome, a lot of the coding genome that we don’t know the phenotypic consequences of mutations for. However, when you look at various model organisms including mouse, rat, fruit fly, the zebrafish and the C. elegans we see that they also have a similar coverage in terms of coding mutations. So if you take the human coding genome and take all of the known orthologs for those coding genes in each one of the individual species and ask the question do we have any phenotype information about the gene if its been mutated. We don’t actually have that great of coverage in any one organism either. What’s really interesting is when you combine them. So, when we merge them altogether we actually have incredible coverage because it turns out that we understand something about the phenotypic consequences of mutations that for those human coding genes in the individual species in different genes. So we go up to almost 80% coverage of the human coding genome if we combine data from all of the different known model organisms. What– another interesting observation that we’ve made is that– and this is just showing data from human, mouse and fish is that when we classify the phenotypic data by either disease or genotype shown on the left based upon which anatomical system the phenotype date is in, we see very big differences between the different models. So, for example, we have a great many more nervous system phenotypes in the mouse than in human or zebrafish, but it’s interesting in the zebrafish we actually have a lot of head and neck phenotypes. In this case it’s because there’s been an enormous amount of work done on craniofacial development in the zebrafish. Now you might ask the question, is that because these different organisms have different benefits for study. That is most certainly the case. The zebrafish are transparent early during development. So we’re going to be looking at types of phenotypes that one can observe during those timeframes because that’s what they’re really good for. We have incredible information about metabolism and the immune system in the mouse because the mouse is particularly good for those types of studies. I would also conjecture that maybe there’s also some life history differences between these different organisms that also lend us to have just simply different phenotypic consequences of mutation in the different organisms that may be observable in one organism for any given ortholog that might not in another. And so again, I think it’s important to consider the fact that we actually understand different aspects of genotypes relate to phenotypes from these different organisms. And use of all the data together can help us get a better handle on mechanism. So we have a terrible problem though and that is that the clinical and model phenotypes are described incredibly differently. A clinician describing Parkinson’s Disease might, might describe the patient as having resting tremors whereas as a model organism researcher would describe a similar phenotype as having a stereotypic behavior. This is not a text matching problem that can be solved. It’s a conceptual alignment problem. Similarly the patient might have constipation whereas the model organism biologist might indicate that as having decreased gut peristalsis. So there’s an incredible problem then in trying to correlate the phenotypes across the species divide. And this is indicated by the fact that we actually have different vocabularies for describing these phenotypes in the different organism. In the human we have human development ontology and we also have the foundational model of anatomy over here for adult anatomy. So we don’t even in the human agree upon how we can describe anatomical structures during development versus during the adults. And there are many other vocabularies that are not listed here to also describe human anatomy. We have the mouse anatomy. We have the mouse anatomy. We have the mammalian phenotype anatomy ontology and the list goes on and on and on and on. So the question, and each of these have different relationships that are represented. Some are based on the partonomies, so the longest part of the lower respiratory tract, whereas others are based on the developmental hierarchies so the lung develops from the lung bud and so on. So we also have a problem of relationship alignment between the concepts in these vocabularies as well as the overall content from the different organisms and the different source vocabularies. So, to try to address this problem a number of years ago we started working on an ontology that we jokingly refer to as the Uber ontology that would subsume all of these ontologies and help us better align the anatomical and phenotypic concepts across disease, but would also leverage the relationship types that were represented in those source ontologies to help us build a Uber structure that would then contain the best of everything that each of these particular source vocabularies had to offer, but also fixing them and feeding back to them in the process of that. And so in this case you can see the example of where the mouse lung alveolus and the human pulmonary alveolus are both subclass, subtypes of the Uber alveolus of lung. And this class is defined logically and I’m not going to talk too much about the logic here, but its defined as being part of the alveolar tract. It’s defined as being present only in the mammalian taxa and it’s a subclass of the alveolus, which is an organ cart, etc. And so this whole semantic framework is built then such that we can integrate the anatomical and phenotypic relationship and concepts from the different organisms and across different stages of development. And this work has been published a few times here. Another problem is that each organism represents the genotype to phenotype associations very differently. And in human we might have a WGS study that’s related to a snip profile that’s related to an allele that’s then related to genes and so those phenotypes down here can be attached you know up here, they can be attached down here. In the, you know in the case of population for C. elegans we attach it to an allele in the fruit fly and it’s attached to a set of alleles and the fish is attached to the full genotype and the frog is attached to the gene. And so from each different organism and each different source the phenotype associations are made to a different component of the genotype. And so we’ve built a small model that helps us integrate these different genotypic parts with the phenotype associations and are currently in the process of building a web application that will be publicly available for recording in a standardized format the phenotype data and its association to the full genotypes to validate the relationships to help support prospectively a better representation of genotype to phenotype data to help address this problem prospectively. And so, in the process of integrating all of these data we’ve collected quite the corpus from a number of different sources, some of which are shown here and in the monarch system. And in this case I’m just showing some of the commonly used model organisms for mouse, fish, fly and worm but often many different human sources. So taking data from all the genotype phenotype data from these different sources and integrating those according to the same unified semantic model for phenotype representation and addressing the genotype to phenotype association problems that I showed on the prior [inaudible] to help support a relationship between phenotypic annotations and the full genotype as well as being able to walk up and down those genotype, the genotype information so the genotype is associated with a particular allele versus a set of alleles versus the set of alleles versus the full genotype including any background variation that might occur. And so basically have a mapping, a large collection of genotype phenotype data including animal QTLs from different livestock organisms. We have genotype phenotype data coming from some of the well genotyped fancier pets, such as dog and horse, really trying to take advantage of the fact that there’s a whole lot of genotype phenotype data out there. We just need to collect it and align it, integrate it and make it available for our comparison algorithms. So in collaboration with the undiagnosed disease program, and I’m not going to talk too much about this work, but I just wanted to show an example, we’ve developed a tool that’s a kin to Phenix but leverages also the model data as well as other types of data that I’ll talk about in a second that does not target only a particular exome panel but using all of the exomes we can prioritize the variants using the same sort of method where we used the phenotype comparisons, but in this case it is not only against human diseases that we’ve annotated but against the corpus of all of that data that I showed you on the prior slide. So in this case we have a candidate diagnosis for this patient that presented with a series of phenotypes, only for which are shown here such as osteopenia, short stature, neonatal hypoglycemia, chronic acidosis, and in this case the candidate based in the SMS gene is based upon a match to a mouse that phenotype, the phenotypic profile from a mouse that has, that has been annotated with decreased bone mineral density, increased body length, decreased circulating glucose levels, and decreased circulating potassium levels and you can see what the subsuming phenotypes are that were in common. So when we match those graphs, these ones in the middle, the green ones are the things that told us that this was you now the most similar phenotypic profile. So, for osteopenia, decreased bone mineral density which is the same as this term is the one that’s in common and similarly for decreased circulating glucose level is the subsuming phenotype for neonatal hypoglycemia and decreased circulating glucose level. And so in this way we’re actually using the model organism data to help us diagnose these patients and then either there’s another family identified and or functional validation is being done in additional model organisms to validate these candidates. Here’s another example of using interactome walking to help support phenotypic support. In this case, for this patient, using again the exomiser tool we have a match to this patient to a known human disorder, but the candidate that we had in this case was MED 23 was not the match for this syndrome which is related to NIPBL gene, but as it turns out NIPBL is in a protein interaction network with MED 8 which interacts with MED 23 and so in this way because of the interactions, these protein interactions we can prioritize this MED 23 gene based upon its, the phenotypic similarity of the patient to this known disease that has a mutation, the NIPBL gene because of this interaction, knowing that phenotypic consequences of mutation in interacting proteins can lead to similar phenotypic effects. And so by using this tool we’re in the process now of finalizing this study, but just got some preliminary data for 11 families. We’ve analyzed these case with exomizer and so here we have, here we show whether we use variant alone, phenotype, the phenotyping alone, bone disease-gene associations and then whether or not we use Mendelian filtering based on the trio analysis. How well does the system provide in terms of giving us the right candidate at the top, as the top candidate. And so if you use variant alone you really only get a few of them in the top 10. And as we increase in terms of combining these different facets of the analysis so you know adding phenotypes to the variants we get much better. And when we bring them all together the top candidates, the known candidates and the top candidate is in the top. Six of the candidates of our 11, six of our families out of the 11, the correct variant was the top candidate and in these ones it was in the top five. And for all of them it was in the top 10. So we believe that combining genotype, phenotype, protein interaction data and inheritance models together can actually provide us with that prioritization for these undiagnosed patients. So today we have four families that have a diagnosis including one novel disease-gene association discovered for York Platelet syndrome and then one gene. We have strong candidates identified for 19 families that are undergoing that functional validation through mouse and zebrafish modeling and we’re now going through several hundred more cases using the tool and this cross species phenotype data. So hopefully we’ll be done soon and we’ll be able to publish that result and I would definitely look forward to any conversations for people doing similar work to help us validate these candidates. So I’d like to switch gears now and talk about how the clinician knows they’ve provided enough phenotyping because this is of obvious importance. How many annotations, how many different categories, how many within each? Here we have an example for Schwartz-jampel Syndrome and in this case you can see the phenotypic profile has phenotypes, over 100 phenotypes in many different anatomical systems. So what we’ve done then is to create a variety of derived diseases that are either more general or have, are missing aspects of breadth or we’ve introduce noise and then asked the questions, are these derived diseases similar to the– more similar to the original one or are they more similar to a different disease or are they more similar to a different disease or is it indistinguishable and distinguishable beyond random. So we’re just trying to get a handle of you know if we take known phenotypic profiles and remove bits of them, how far do we go until it’s no longer most similar to its parent disease? And so this is really about finding the most, or coming back to the idea of finding the most phenotypic profile in common that I showed earlier where instead of now looking at patients versus diseases we’re looking at known disease phenotypic profiles and removing bits of information from them to see how similar [inaudible] to their own parent cells. And so that’s what this looks like. And so here we’re removing by categories. So again I talked about this maybe being the nervous system so minus the green would be minus the nervous system. Here might be minus the skeletal system and so on and so we removed one anatomical system at a time. And so this is what this looks like and this is just again a few of the phenotypes. So here’s our skeletal case where we’ve removed all the skeletal phenotypes versus random where we’ve removed the same amount of information, which is based upon a graph metric that we’re not going to get into but the same amount of information contents from the graph randomly. And what we find is for the skeletal case in this case the case with the skeletal information removed is 86% similar to the parent. And randomly it’s about 91% somewhere, which is actually quite good. And so when we do this analysis for every disease in our corpus, for every Mendelian disease, which is about 7000 plus. We find that actually semantic similarity between the phenotypic profiles is quite robust in the face of missing information. So if we’re missing a whole anatomical system we can actually still identify the right disease even when that information is nothing at about 92% of the derived diseases. So then we do the same thing by lifting. So what if we’re just not very specific about our phenotype and I mentioned this earlier about how important that is. And as it turns out its actually very, very important to deeply [inaudible] because that’s where much of the information comes and so our results here show that it’s actually quite sensitive and it doesn’t take very many levels up to start seeing a reduction in the specificity of being able to match the phenotypic profile. So basically at the end of the day the bottom line is we can missing a whole category but we can be very specific when we do describe the phenotypes. And all of this work has been implemented in a five star rating system to help the clinicians annotate better, which is shown both on our own Monarch website disease pages but also in the phenotips clinical annotation tool and the link is here so that clinicians can actually recognize when the phenotyping is sufficiently distinct enough from the corpus to help us utilize the full spectrum of both human and [inaudible] organism data to help diagnose the patient. And so that makes for happy clinicians and actually we found that the data that has come about since we’ve implement the five star system has been greatly improved. So before, I’m going to finish up now but I wanted to come back to some work that we’re doing on the algorithms that I think are particularly relevant to the cancer community which is that of chronology. But here if we take a patient that has, that is 40 years old and has lower back pain, motor weakness and unpleasant muscle twitching and another patient that has, is 65 years old and this is a construction worker by the way. And here we have a 65 year old lawyer who has unpleasant muscle twitching, stumbling and leg weakness. You might think about these patients as being somewhat similar in their phenotypic profiles, but patient one is actually diagnosed with degenerative disk disease with an L3 nerve root problem causing muscle weakness and a more recent onset of benign fasciculation syndrome. And this is the based upon the chronology of the phenotypic presentation, so the lower back pain started quite some time ago, 15 years ago, five years ago motor weakness, two years ago unpleasant muscle twitching. And so it’s likely that this particular disorder is related to his profession as a construction worker, which is very different than this patient, which a year ago started having unpleasant muscle twitching and most recently was starting to stumble and have leg weakness. And at 65 years old this patient was diagnosed with Lou Gehrig’s disease. And so based upon the onset of the phenotypes, the age of which they occur as well as the temporal ordering of these phenotypes, clinicians have you know utilized this information to specify the particulars of the diagnosis. We would like to encode this as well into our phenotypic matching algorithm, which looks something like this. So if we have a patient on the top who presents with purple and then blue and then green and then later orange is added and so the phenotypes to that, to their particular graph get added over time. The question is then how do we utilize this information, this chronological ordering and timing of the information to help us with the matching algorithm. And so this is work that’s being implemented into the algorithm. Now when I think it’s particularly relevant to the cancer problem, in particular because of the relationships between drugs and chemicals and the patient’s phenotypes over time. And so this is all the information that we are now encoding into the algorithm, because it’s really not the genotype that we want to relate to the phenotypes, but the genotype plus any environmental exposures over time which would include diet or exposure to chemicals as well as any kind of drug intervention. And so by adding the ability to operate on genotype plus environment over time in the context of you know non-Mendelian diseases were in the case of cancer, I think this will be particularly relevant in how these algorithms can take that information into account. Similarly we need to add negation to the algorithm so the patient might, we might say does not have this particular gray phenotype over here, but what that really means and this particular disease does not have this phenotype over here. What that actually means is it’s a very powerful statemetn to say that there is not a phenotype because it means that anything that’s below that, so if you say that there’s no nervous system phenotype you mean that any type of [inaudible] in that phenotype is not present. And so it really looks, the graph really looks like this and this for these particular patients and this particular patient and disease. And so utilizing absences in that graph and the matching algorithm as well. So that’s a lot to take into account and it’s obviously a lot of work but we would be delighted to try to come up with some cancer use cases where we have particular phenotypic outcomes in response to different treatments over time to help us understand how to better build these algorithms for these phenotypic profile matching cases. And so in conclusion, phenotypic data can be represented using ontologies or semantics for improved comparisons within and across species. And for known disease-gene associations the human phenotype data that we already have is actually pretty effective at prioritizing variants if we use this graph based approach. So unknown disease-gene association with the use of model organisms really greatly improves the variant prioritization and basically all looking at scientific data that we can get our hands on. And that phenotype breadth is recommended to buffer a lack of information, but in particular very specific phenotyping is necessary to insure the quality and that it matches, especially across species. So again, the future work is to add additional variables to our semantic similarity algorithm, which includes negation environment and the chronology and then validate existing kind of a model for recapitulation of the diseases that we’ve identified and further characterization of organization phenotypes and adding more, many more non-model organisms to the analysis and pulling that content in. And I should mention that this work was funded by an R24 from the office of Director as well as multiple contracts with the undiagnosed disease program and it’s a large international team and I’d like to thank all of these people. But, in particular, Peter Robinson at charity who really leads the work on the CDex tool and human phenotype ontology. Cindy Smith who works on the mammalian phenotype ontology for mouse. Mike Brudno whose group developed the phenotips tool without which we’d never get the human data. Damian Smedley at Sanger who has developed, who has actually implemented the phenotype matching algorithms using the model organism data and [inaudible] in the exomizer tool. Harry Hochheiser who does our visualization work. Maryann Martone who leads the neuroscience information framework and the underpinnings of which are used for our platform. Chris Mungall who developed the phenotype matching algorithm. Nicole Washington who did all the Informatics efforts on what constitutes a minimum phenotype. Everyone at the UDP without which we wouldn’t have the use cases to study this work and my team at OHSU as well as Tudor Groza who does a lot of the text matching to help us understand better what content is missing from our ontology. And so with that I’ll close and take some questions. Thanks very much.>>Okay, let’s thank Dr. Haendel for a terrific presentation. [applause] We have time for a few questions so we’ll open the floor. If you’re in the room please use the microphone in front of you and if you’re on the webX indicate with a raised hand on the dashboard and we’ll unmute your line. >>Excellent presentation. Just curious a portion of your presentation when you mentioned that only 11 out of I don’t quite remember the number of patients that didn’t really categorize. You focused mostly on the exome mutations. Have you really looked at other variations you know things like copy number variation, apagenomics and perhaps protein modifications. I mean do you think that’s something that will happen in the future? >>That’s, that’s a really great question and I think you know at ASHG this year it seems that pretty much every talk I went to was about you know how do we, how do we decide if something is clinically relevant or clinically actionable? And I think you know right now one of the issues is how do we do this in such a way that– so that it can be as inexpensive as possible so that we’re not throwing the whole battery of approaches at any given family at once, but rather taking a more narrow approach at first and taking what comes out. And so for that there’s 11 patients out of 40, we took the approach of only, only doing excellent sequencing on a targeted set of exomes, which makes it a lot less expensive and a lot more clinically you know relevant in terms of being more able to implement that in the clinic. Of course, if we do full excellent sequencing our whole genome sequencing we have a plethora of other things that we can try to do. And so for those patients in the 40 that were not diagnosed, that would be like the next step that one would take. We would then apply exomizer and use full exomes and see if we come up with anything. They’re we’re using the protein interaction data to help support that. We’re working now on use of regulatory mutations and seeing how those can be annotated some more to help comparison. So, there’s certainly a lot of other kinds of things not just that we’re doing but many people are doing to help support hose types of analyses. But some, you know much more of a research project at that point than something that could happen in the clinic.>>Other questions in the room or online. Melissa, let me ask you a question. You did a perfect job of demonstrating the power of graph representation, graph algorithms in particular when combined with model organism phenotype. Can you talk a little about what your assessment is on the current state of the Uber ontology and what you see as the biggest challenges still facing you in that area. >>Well okay I’m not sure if this is a politically correct thing to say, but most of the work that has gone on on the– I don’t think Uber ontology has ever really received any funding. It’s entirely a volunteer effort. So that’s definitely one of the issues is that most of the work on these standards is really a labor of love that people have put in because they know they need it to help support their data representation, integration and analysis. But it’s not something that’s considered research onto itself but not a very fundable activity. So maybe that’s not the type of answer that you were looking for. But that’s definitely an issue. More biologically speaking I think you know one of the issues is that we really need to help support a more robust platform for the way in which these ontologies are developed so like for example the human phenotype oncology you know an abnormality of the nervous system would be defined based on an anatomical structure that’s the nervous system and so each individual term is defined in terms of terms from other ontologies and it’s all a large interoperable semantics framework. But we don’t really have very good tools for managing the prophase by which any one group can contribute. So, for example, if the zebrafish group decides to go and really expand their representation of the [inaudible] you know how do we better make sure that that’s consistent with the representation and the human phenotype ontology and Uber anatomy ontology and so forth. And so we have built some tools to help support quality assurance checking across these different ontologies, but there’s not really a great way for us to sort of you know get more citizen involvement if you will and make sure that things are developed consistently. So we could be a lot more efficient in the way that we develop these things if we had better tools. So I can see that as one of the biggest challenges. We’re also you know while we have quality developmental prophesies represented from some organisms we’re quite lacking in certain areas or in certain other organisms. And so that’s also gaps that we’re trying to fill making sure that we understand better the developmental progression of the anatomy in each of the species and each of the systems.>>Yeah great thanks. Your answer was perfectly appropriate I think in the pointing out this is a labor of love is really very important. I mean if we know how too much of a challenge it is even in the clinical trial space or in the clinical area where you know trying to arrive on common terminology is for EHRs is such a challenge. Now when you’re talking across many different communities working on different model organisms you know pointing out how important having easy to use tools and data management systems that can keep everything in sync it seems almost like an insurmountable task. But hopefully we’re going to keep plugging away at that and coming to some conclusion. Okay, it sounds like– there was another question but I think you answered it. Any others here. I think we’re just about out of time anyway but, nope no more questions. Well, thanks again Melissa, terrific presentation. We hope everybody can join us for our next presentation which will be on February 4th Ann Christine, Catlin and Marietta Harrison from Purdue University will be discussing Hub Zero and the NCIP Hub. Once again let’s thank Dr. Haendel for this very exciting presentation. [applause]
>>Thanks very much and please feel free to contact me if you’re interested in this work and you can reach me on Twitter at one of these two addresses or our new website at monarchinitiatives.org.>>Great, thank you.>>Thanks, thanks very much.

Leave a Reply

Your email address will not be published. Required fields are marked *