Subtitles section Play video Print subtitles [music playing] >> Mary Engler: Well, welcome back from break and I'm delighted to introduce -- after such an incredible morning with such great speakers -- I'm delighted to introduce our next speaker Dr. Bonnie Westra, who'll be presenting Big Data Analytics for Healthcare. Dr. Westra is director for the Center of Nursing Informatics and associate professor in the School of Nursing at the University of Minnesota. She works to improve the exchange and use of electronic health data. Her important work aims to help older adults remain in their community and live healthy lives. Dr. Westra is committed to using nursing and health data to support improved and better patient outcomes as well as developing the next generation of nurse informaticists -- informatistatcians. [laughter] Okay. Please, join me in a warm welcome for Dr. Westra. [applause] >> Bonnie Westra: Is it potato or potato [laughs]? [laughter] So, I am just absolutely thrilled to be here and this is an amazing audience. It's grown since last year, so this is great. So, today what I'd like to do is to relate the importance of big data in healthcare to what we're talking about today, identify some of the critical steps to make data useful so when you think of electronic health record data or secondary use of existing data, there is a lot that has to be done to make it useable for purposes of research. Look at some of the principles of big data analytics and then talk about some examples of some of the science, and you'll hear a lot more about that during the week in terms of more in depth on that. So, when we think about big data science, it's really the application of mathematical algorithms to large data sets to infer probabilities for prediction. That's the very simple definition. You'll hear a number of other definitions as you go through the week as well. And the purpose is really to find novel patterns in data to enable data driven decisions. I think as we continue to progress with big data science, we won't only find novel patterns but in fact we'll be able to do much more of being able to demonstrate hypothesis. One of my students was at a big data conference that Mayo University in Minnesota was putting on, and one of the things that they're starting to do now is to replicate clinical trials using big data, and they're in some cases able to come up with results that are 95 percent similar to having done the clinical trials themselves. So we're going to be seeing a real shift in the use of big data in the future. So when I think about big data analytics, what this picture's really portraying is big data analytics exists on a continuum for clinical translational science from T1 to T4 where there's foundational types of work that need to be done but we actually need to apply the results in clinical practice and to learn from clinical practice that it then informs foundational science again. When you look at the middle of this picture, what this is really showing is that this is really what nursing is about. If you look at the ANA's scope and standards of practice on the social policy statements, nursing is really about protecting, promoting health and then to alleviate suffering. So when we focus on -- when we think about big data science in nursing, that's really kind of our area of expertise. And what you see on the bottom of this graph is it's really about when we move from data, you know, we don't lack data. We lack information and knowledge and so it's really about how we transform data into information into knowledge, and then the wise use of that information within practice itself. This was, I -- we were doing a conference back in Minnesota on big data and I happened to run into this graphic that just, you know, it's like how fast is data growing nowadays? And so what you can see is data flows so fast that the total accumulation in the past two years is a zeta byte. And I'm like, "Well, what is a zeta byte?" A zeta byte is a one with 21 zeroes after it. And that what you can see is the amount of data that we've accumulated in the last two years equals all the total information in the last century. So the rate of growth of data is getting to be huge. Data by itself though, isn't sufficient. It really needs to be able to be transferred or transformed into information and knowledge. Well, when we think about healthcare, what we can see is that the definition is that it's a large volume, but it might not be large volume. So when you think about genomics sometimes it's not a large volume, but it's very complex data, and that as we think about getting beyond genomics and we think about where we're at, it's really looking at where are all the variety of data sources and, it's the integration of multiple datasets that we're really running into now. And it's data that accumulates over time, so it's ever changing and the speed of it is ever changing. What you can see in the right-hand corner here is that there -- as we think about the new health sciences and data sources, genomics is a really critical piece, but the electronic health record, patient portals, social media, the drug research test results, all the monitoring and censoring technology and more recently adding in geocoding. So as we think about geocoding, it's really the ability to pinpoint the latitude and longitude of where patients exist. It's a more precise way of looking at the geographical setting in which patients exist, and that there's a lot of secondary data then around geocodes that can give us background information about neighborhoods that include such things as, you know, looking at financial class, education. Now it doesn't mean that it always applies to me, because I might be an odd person in a neighborhood, but it gives us more background information that we may not be able to get from other resources. So, big data is really about volume, velocity, voracity as Dr. Grady pointed out earlier today. Now as we think about big data, 10 years ago when I went to the University of Minnesota and my Dean, Connie Delaney [phonetic sp] had talked about doing data mining and I thought, "Oh, that sounds really interesting." Because I was in the software business before and our whole goal was to collect data in a standardized way that can be reused for purposes of research and quality improvement. I just didn't know what to do with it once I got it. And so I've had the fortune to work with data miners. We have a large computer science department that does internationally known for its data mining, and a lot of that work was funded primarily by the National Science Foundation at that time because it was really about methodologies. Well now we're starting to see big data science being funded much more mainstream in addition now, NIH, CTSA, et cetera, are all working on how do we fund the knowledge, the new methodologies that we need in terms of big data science? So, an example of some of the big data science that really is funded already today is that if we look at our CTSAs. So, there's 61-plus CTSA clinical translational science awards across the country and the goal is to be able to share methodologies, to have clinical data repositories and clinical data warehouses, and then to begin to start to say, "How do we do some research that goes across these CTSAs? How do we collaborate together?" Or as we look at PCORnet. PCORnet is another example. So as we think about, there are 11 clinical data research networks -- this may have increased by now -- as well as 18 patient powered research networks. We happen to participate in one that has 10 different academic of healthcare systems working together, and it means that for our data warehouse we have to have a common data model with common data standards with common data queries in order to be able to look at research such as we're looking at ALS, obesity, and breast cancer. And wouldn't it be nice if we could look at some of the signs and symptoms that nurses are interested in addition to looking at specific kinds of diseases? When we look at some of the work that Optum health as well as other insurance companies, they're really beginning to take a look at amassing large datasets. So Optum Labs happens to have 140 million lives from claims data, and they're adding in 40 million lives from electronic health records, so that provides really large data sets for us to be able to ask some questions in ways that we haven't been able to do. I'm excited about reuse of existing data, and so hopefully some of that enthusiasm will wear off on you today because it's really a great opportunity. Now, in order to use large data sources, what that means is that we need a common data model. We need standardized coding of data and we need standardized queries. What I mean by that is that if we don't ask about the same variables and we don't collect the data or code the data in the same ways, it makes it hard for us to be able to do comparisons then across software vendors or health systems or academic institutions. And with the PCORI grant for instance, we're actually looking at how do we do common queries so that if we've got the common models, we can write a query and share the queries with others to be able to pull data out from multiple health systems in a similar way. So I'm going to talk about what I mean by that a little bit more and show you examples of how we have to be thinking in nursing about this as well as thinking interprofessionally. So when you look at PCORnet, they started with a common data model one, then they went to version two, and now this is version three that's being worked on at this time. So you can see in the top left hand corner we have conditions which might be patient reported conditions as well as healthcare provider conditions, but you can also see that down in the left hand corner that there are also diagnosis. So diagnosis are ICD9 coding that goes with it. ICD10 is that unfolding. Notice when you think about your science, where is the data that you want for your science, and is it represented in this common data model? I would suggest that there's many types of data in the common data model that's important to all of us as we think about where we're going whether it's demographics of medications or, you know, what are the kinds of diseases that people have? And there's also something missing as we move forward. So, before I get to what's missing one of the things that I want to point out that's critical is that in order for PCORI or NCATS or any of these other organizations to be able to do queries across multiple institutions they have to have data standards. And so when we look at demographics for instance, OMB is the standard that we use for demographics. When we look at medications, it's RxNorm for medications. Laboratory is coded with LOINC. Procedures are coded with CPT HCPCS or ICD9/ICD10 codes. We also have diagnosis that have ICD9/ICD10 but in addition, SNOMED CT codes or another type of standard. And when we look at vital status we're looking at the CDC standard for vital status and with vital signs they're using LOINC. So LOINC started as a laboratory data. It's expanded to include types of documents. It also has expanded now to include a lot of clinical assessments. So you're going to find the MDS used in nursing homes, Oasis that's used in homecare, you'll see things like the Braden or the Morse Fall Scales, and we're expanding more types of assessments that are important to nurses in the LOINC coding. It also, by the way, includes the nursing management minimum dataset, which the announcement just came out this week that we've just finished updating variables and they've been coded in LOINC, so if you wanted to look at the work of Linda Aiken, for instance, you'd find standard codes that can be used across multiple settings. So, our vision of what we want to see in terms of clinical data repositories that are critical for nurses is when we look at clinical data, we need to expand that to include the nursing management minimum data -- the nursing management dataset. What that means is we need to look at nursing diagnosis, nursing interventions, nursing outcomes, acuity, and we also have to take a look at a national identifier for nurses. Which, by the way, every registered nurse can apply for an NPI which is the National Provider Identifier so that we could track nurses across settings, just like we do any other -- you know, the physicians or the advanced nurse practitioners, but it's available for any RN to be able to apply. So, when we extend what data's available, if we added in what are the interventions that nurses do? What are the additional kinds of assessments that nurses do? That data is really critical for us to be able to do big data science. What you can also see is that there's management data -- often times we think of that as claims data -- but when you think about management data it needs to go beyond that when we start talking about standardized units. Like if I see a patient in an ICU does it matter and how do we even name ICUs? Or psychiatric units? At Mayo we used to call it three mary bry. Well, how generalizable is that? So there are ways to be able to generalize the naming of units and that actually builds off of the NDNQI database. And then when we look at the workforce in nursing, Linda Aiken's work I think is just stellar in terms of really trying to understand, what are the things that we understand about nurses because they effect patients' outcomes, and they also affect our nursing workforce outcomes as well. So our clinical data repositories need to expand to include additional data that's sensitive to nurses and nursing practice, and it also needs to go across the continuum of care. Now, at the University of Minnesota, we have a CTSA award, and our partner is Fairview Health Systems. And so you can see here that as we built our clinical data repository we have a variety of different kinds of data about patients and about encounters that we have available to reuse for purposes of research. You can bet that the students that I have in the doctoral program are all being trained to be big data researchers. It's like, "Stick with me kid, because this is the way we're going." So they use this but they also use, like, some of the tumor registries or transplant registries as another data sources as well. And this data's available then for looking at cohort discovery or recruitment observational studies, and predictive analytics. Now, when you look at what's actually in there and we characterize that data, we basically have over 2 million patients just in this one data repository, and we have about 4 billion rows of unique data, so we don't lack data. What's important to take a look at is, what is the biggest piece of the pie here? It's flow sheet data. And what is flow sheet data? >> Female Speaker: [inaudible] Bonnie Westra: Yeah, it's primarily nursing data, but it's also interprofessional, so PTOT speech and language, dietician, social workers, there are specialized data collection for, like, radiation oncology and that kind of stuff. But a lot of it is nurse sensitive data. So one of the things that we've been doing as part of our CTSI or CTSA award, is we're looking at this what we call extended clinical data, and developing a process to standardize how we move from the raw data and mapping the flow sheet data to clinical data models. And that these clinical data models then will become generalizable across institutions the actual mapping to the flow sheet I.D.s will be unique to each institution. One of the reasons this is important is I was just working on our pain clinical data model this last weekend trying to get ready to move it into a tool we call i2b2, and we had something like 364 unique I.D.s for the way we collect pain data, and that those 364 unique I.D.s actually represented something like 54 concepts. Or represented actually I think 36 concepts, and when you do pain rating on a scale of 0 to 10, we had 54 different flow sheet I.D.s that are pain rating of 0 to 10. Why don't we have one? So, what that means is that we have a concept in our clinical data model called pain rating, specifically 0 to 10. We also have the flack and the long baker and you know, every other pain rating scale possible in the system. But it means that we have to identify a topic like pain. We have to identify what are the concepts that are associated with that. Then we have to look at how we map our flow sheets to those concepts. We then present it to our group in an interactive process for validation before we can actually move that into making it useful for purposes of research -- researchers. So we now have a standardized process that we've been able to develop, and now we're moving it into trying to develop open source software, so that if you wanted to come play with us and you wanted to say, "I like the model you're using and I want to use it, and let's see if we can do some comparative effectiveness research," that it's something that can be shared with others. And that's part of the nature of the CTSA awards is that we develop things that can be used across so everybody doesn't have to do it independently. So here's examples of some of the clinical data models that we've been developing. So behavioral health, we have somebody who's a specialist in that area who's working on a couple of models. Most of them are physiological at this point, and we started that way because of another project we're working with. But one of the things that we started with internal is we said, "What are the quality metrics that we're having to report out that are sensitive to nursing?" So when you looking at prevention of falls, prevention of pain, CAUTI, VTE, and one other I can't think of right now, but we really tried to take a look at what are those things that are really sensitive to nursing practice and then how do we build our data models that can be used for quality improvement, but also can be used then for purposes of research? If we do certain things at a certain point in time, does it really matter? And then we've extended it to some other areas that are, you know, based on what are the most frequent kinds of measures that might be important to nurse researchers to be able to work with. Now, one of he things that the CTSAs do is many of them use a tool called i2b2, and i2b2 can do many things, but one of the first things it does is it provides you with de-identified counts, of how many patients do you have that meet certain criteria; so if you're going to submit a grant, that you would be able to know whether you had enough patients to actually potentially recruit. One of the things that is missing out of it is almost everything that's in flow sheets. So, Judy Warren and colleagues proposed an example of what would it look like in i2b2 if we added in some of he kinds of measures that we're looking at that are like review of systems of some of the clinical quality measures. So we're in the process of really looking at a whole methodology of how to move that flow sheet data from the data models in to i2b2 so that anybody could say, "Oh, I'd like to study, you know, prevention of pressure ulcers. How many stage four pressure ulcers do we actually have and, you know, what kind of treatments are they getting and does it matter?" And so that's an example of how this tool will be used. Now, in order to make data useful it also has to be coded. So remember the slide I showed you that showed we're using RxNorm and we're using LOINC and we're using OMB and we're using CDC codes? Well, when we look at what code set should be used for standardizing the data that we use that's not part of those kinds of data, you'll see that the American Nurses Association actually has recognized 12 terminologies or datasets and they're done recognizing new ones. Now it's just continuing to keep them up to date. And so, the ANA just came out with a new position statement, "Inclusion of recognized terminology supporting the nursing practice within electronic health records and other information solutions." What that means is they say in that new paper that just came out is that all healthcare setting should use some type of a standardized terminology within their electronic health records to represent nursing data. It makes it reusable then for purposes of quality improvement and comparative effectiveness research. However, when it is stored within clinical data repositories or when we're looking at interoperability across systems, then SNOMED CT is the standard that would be used for nursing diagnosis. So you might use the Omaha system of NANDA or CCC or any of these, but it has to be mapped then to SNOMED CT so that if I'm using the Omaha system and you're using ICNP, that they actually can talk to each other where they have comparable terms. What the ANA has also recommended is that nursing interventions, while there's many standardized terminologies, actually use SNOMED CT for being able to do information exchange and for building your data warehouses if you're using different systems that you want to do research with. And that nursing outcomes would be used with SNOMED CT, sometimes maybe LOINC, and that assessments be used with LOINC, and I won't go into all the details underneath that because it's more complicated than that. Because sometimes the answers are LOINC and sometimes they're SNOMED CT, depending. So there's a lot that goes on behind the scenes, but this is rally important because if -- and this actually comes off of the ONC recommendations for interoperability for clinical quality measures -- that's how these standards actually came about so that it's consistent with the federal policy when we're doing this. So, ANA, it's on their website. The URL was so long that we had permission just to put it on our website and give you a short URL. So if you want to learn more about it the URL is listed down here. So, another effort that is going on is that in addition to some of the foundational work that we're doing through the CTSA, is that there is a whole group that's headed by Susan Matheny that is about how do we build out an assessment framework in very specific coding for the kinds of questions that we asked for physiological measures? So when we look at the LOINC assessment framework we start with first physiological measures, and then there's other things shown in orange called the future domains that also have to look at what are the assessment criteria that are documented in electronic health records that need standardized code sets? So there's a group that Susan Matheny is heading up that includes software vendors, different healthcare systems, people with EHRs that aren't the same EHRs, and they're pulling together a minimum set of assessment questions and getting standardized codes for those minimum set of assessment questions and they were just submitted to LOINC I think the end of June for final coding and distribution in the next release of LOINC. And this group is continuing on to build out additional criteria for assessment, so that we have comparable standards across different systems. Now, I mentioned that the nursing management minimum dataset -- this was actually developed back in about 1997 recognized by the American Nurses Association and has been just updated for two out of the three areas. So in the environment you can see the types of data elements that are included -- and this is very high-level data elements -- there's a lot of detail underneath these. And you can see nursing resources. Now, when this was updated we harmonized it with every standard we could possibly find. A lot of it has been NDNQI, so the Nursing Database for Quality Indicators, but it's also been harmonized with every other standard we could find so that there weren't different standards consistently for these types of variables. It also -- if you've followed the Future of Nursing -- Future of Nursing work from the IOM report and the Robert Wood Johnson Foundation, it matches the workforce data that they're trying to collect through the national board -- state boards of nursing. So again, if you're collecting data for one reason that in fact you can actually use it for multiple reasons when you're using a standard across the country. So, there is a reference here. You can go to LOINC.org, and if you look under news you'll see the release that came out this last week about this, and then you'll also see that if you go to the University of Minnesota website that the implementation guide is available that gives you all of the details that you never wanted to know but need if you're actually going to standardize your data. So, the point of all this is that when you think about using big data and you want to do nursing research, it's really critical that we think about all of our multiple data sources whether it's electronic health record or if you're thinking about with management minimum dataset for instance. You're thinking about scheduling, you're thinking about HR data, and that doesn't even begin to get into all the device data and the personal data contributed by patients. So that's additional data, and think about what it's going to take to standardize that in addition. It won't be on my plate, but many of you might want to actually do that because it's a really good way to begin to move forward. So the message that I wanted to leave you with on that is there's lots of data. When we think about nursing research that we are at the very beginning of starting to say, "What data?" And how do we standardize that data? And how do we store and retrieve that data in ways that we can do comparative effectiveness research with that data or some of the big data science. Just one example, I'm not going to cover today but I'll talk a little bit tomorrow, is we're pulling data our to electronic heath records to try to say, how do we really understand patients that are likely to have sepsis, and then there's the sepsis bundle, that if you do -- you know, if you do certain types of evidence-based practice quickly and on time, you can actually prevent complications. Well, we're pulling out electronic health record data, and guess what? This is really interesting. We got an NSF grant to do this and so we said, "Well, we're going to look at evidence-based practice guidelines, nurses and physicians, well guess what? The evidence-based practice guidelines for nurses aren't really being used. And so we're having to figure out how would you find the data. Not because nurses aren't doing a good job just the guideline types of software wasn't used in the way we thought. So then we said, "Well, we'll look at, you know, we'll look at certain data elements, and then we're also going to look at physician guidelines and are they being used?" So, in order to know if you did something in a timely manner, you have to know, when did somebody suspect that sepsis began? Do you know where that's located? Maybe in a physician's note. And so the best way to find out if patients are likely to develop sepsis is nurse's vital signs and the flow sheet data. And so consistent documentation in those flow sheet data becomes really critical. And then if they're being followed and adjusted, you have to understand things like fluid balance, cognitive status, your laboratory data as well as the vital sign data that's going on with that, and lots of other stuff. So this EHR data is critical in terms of being able to really look at how do we prevent complications. So I'm going to talk a little bit now moving into more of the analytics. So when we think about analytics there is a book, it's free online. This is not an advertisement for them, but it was one that changed my life. And so it's called, "The Fourth Paradigm of Science." And it really talks about, how do we move into data intensive scientific discovery? And one of the things that I think is really interesting is, how many of you have every read a book -- a fiction book -- it's called "The Timekeeper?" It is really a fun book. The thing that's fun about it is it talks about before people knew time existed, they hadn't picked up the observational pattern thousands of years ago that basically said that, "Oh, there is this repetitious thing called time." It then goes on to talk about the consequences for us of how we want more of it, you know? And so it's not always a good thing to discover things, but, you know, our first science was really about observations and really trying to understand what do we notice? You know, what's the empirical data? We then moved into thinking about a theoretical branch. So what are our models? How do we increase the generalizability of our science? From there we've moved into in the last few decades computational branch which is really how do we simulate complex phenomena? And now, we're moving into data exploration or something that's called e-Science. So we can hear the term big data, or big data science. E-Science is another term that's used for that. So when you look at that, what you can see is that we have data that's being captured by all kinds of instruments. We have data that's processed by software and we have information and knowledge that's stored in computers. And so, what we really have to do is how do we look at analyzing data from these files and these databases in coming up with new knowledge? And it requires new ways of thinking, and it requires new tools and new methods as we move forward. So foundational to big data science is algorithms and artificial intelligence. So how do we take a look at if this then that, if this then that? So it requires structured data, you know, so that we can develop these algorithms to be able to come to conclusions. Now machines are much faster at processing these algorithms than the human mind is, and they can process much more complex. So our big data science is really about the use of algorithms that are able to process data in really rapid ways. Semi -- what we call -- semi-artificial. Not totally like you just throw it in there and it does it and it gives you the answer. There's a lot more to it than that. So there's some principles about big data science that are important, and one of those principles is let the data speak. So, what that means is we often times will say, as I take a look at trying to understand CAUTI is one of the subjects that one of my students is working on. She's really trying to understand, we have these guidelines for how do you have this catheter associated urinary tract infection, how do we prevent that? So if we follow the guidelines, why aren't we doing any better? And what's missing is we probably don't have the right data that we're looking at. So she's actually combining some of the management data along with the clinical data to try to say are there certain units? Are there certain types of staffing? Is there -- you know, how do staff satisfaction? You know, how does that all play into all of this? What's the experience? What's the education? You know, what's the certification, the background? And so, she is throwing in more types of data and then trying to let the data speak in terms of, you know, does this provide us any new insights that we can think about? Another thing is to repurpose existing data. So once you have data, 80 percent of big data science is the data preparation. I think it's closer to 90, but it takes forever to kind of get the data set up because it's not like you're collecting new data with a standardized instrument that, you know, has all these validity and reliability, so there's a lot of data preparation and transformation that needs to go on. So once you've got that done and you understand the data and the metadata, that is the context, the meaning, the background of why do we collect this? What does it actually mean? You know, give me the context of this. Then we can understand, how is it collected? Why was it collected? What are the strengths of it? What are the limitations? When I first started in this, I worked in homecare software. There wasn't anything I didn't know about Oasis. Because I learned a ton by making every mistake, working with everybody I could, and understand it thoroughly. When I went to working with big health system data, I'm like a novice all over again. So once I get a good dataset set up believe me, I'm going to be working with that forever. And so you'll see some examples of that tomorrow on a different talk. So in big data science another thing that we have to think about is that N equals all versus sampling. So it's not necessarily about random sampling, it's really about once you've got all the data, you know, how does that effect your assumptions about what you're doing in science? And there's another principle called correlations versus causality. So, you know, randomized clinical trials are trying to understand the why. Why did this happen? And what we're trying to understand and when we've got big data is, you know, what's the frequency with which certain things occur? What's the sensitivity? What's the specificity? How do we understand the probabilities that go with it? And so we're often times looking at correlations versus trying to look at causation. Big data's messy. I've had a chance to work with our CTSI database where they've done a lot of cleanup and standardization and then I've worked with the raw data, same software vendor. I've certainly learned that once you have the data and you clean it up, it really makes a difference. And will it ever be perfect? Absolutely not. But we think our instruments are perfect, you know? And they're actually not either. So there is a certain probability that things occur and you get a large enough dataset. You know, it really makes a difference in how you work with the data. And then there's also a concept called data storage location. So, there are some people that think you should put all the world's data into a central database and work with it, and then there are others that do something called federated data queries. So federated data queries is where, like with our PCORI grant, everybody has their own data. It's modeled in the same way and so we can send our queries to be able to do big data research without having all the data in the same pot at the same time. Another thing that's really critical is big data is a team sport. I can't say that enough. If you ask me all the mathematical foundation for the kind of research we're doing, I'm not the one that can tell you that. I work with these computer science guys that have very strong mathematical background, and I get educated everyday I work with them. And so we need to -- and I also know from example that they really don't understand clinical. And so, you know, when we had a variable gender they were going to take male and do male/not male female/not female. And it's like, you only have two answers in the database, so why do we need four answers [laughs], you know, for this? But that's just a simple thing but they don't understand, like, you know, what's a CVP, for instance. I have to actually look some of that up now too as I'm getting further away from clinical but it's really trying to understand you need a domain specialist. You need a data scientist. A data scientist is an expert in databases, machine learning, statistics, and visualization. And you need an informatician. So how do you standardize and translate the data to information and knowledge? So, you know, understanding all that database stuff and he terminology stuff is really important. As I said, 80 percent is preprocessing of the data. And then there's a whole thing called dimension reduction and transform use of data. So, one of my student said, "Well, I want to use ICD9 codes so I'll ask for those." And I'm like, "What are you going to do with them?" And so she finally got down to what I really need to understand is there's certain diseases that predispose people to having CAUTI. And so, I only need to be able to aggregate them at a very high level to see -- and so it means you have to know all your ICD9 structure and be able to go up to immunosuppressive drugs for instance or other diseases that predispose you to getting infections or previous history of infections. So, you don't want 13,000 ICD9 codes. You really want high-level categories. So it's learning how to use the data, how to transform the data. A lot of times we have many questions that represent the same thing, so do you create a scale? If your assumption for your data model is that you need binary data, how do you do your data cuts? You know? So with Oasis data we use no problem or little problem and moderate to severe problem because we need a binary variable. And so it's that kind of stuff that you need to do. And then there's all kinds of ways of saying, how do you understand the strength of your answers? You can quantify uncertainties so you're looking at things like accuracy, precision, recall, trying to understand sensitivity, specificity, using AUCs to try and understand the strength of your models. So I'm going to quickly go through just a few examples of how we're now moving into using some of these types of analysis and some of the newer methods of being able to analyze data. So, one is natural language processing. Another is visualization and a third is data mining. What I'm not going to do is address genomics. I wouldn't touch that one, it's not my forte. So, natural language processing is really another name for it is called text mining. And that is, as we take a look at this, five percent of our data is really structured data and the most is not structured data. So we really need to -- we really need to think about how do we deal with that unstructured data because it has a lot of value within it. But, so an NLP can actually help us be able to create structured data from unstructured data so that we then can be able to use that data more effectively. So, it really uses computer based linguistics and artificial intelligence to be able to identify and extract information and so free text data's really the sources. So when you think of nurse's notes for instance. The goal is to create useful data across the various sites and to be able to get structured data for knowledge discovery. And there are very specific criteria for trustworthiness. When I did my doctoral program and we wanted to do qualitative research -- that was many years ago people were a lot like, well that sounds like foo foo. [laughter] Well, now there is like, you know, really trustworthy criteria and there's trustworthy criteria for data mining as well. So when you look at, how many of you have heard of Watson? Yeah, so when you think about Watson, Watson was used initially tested with Jeopardy, you know? And finally it beat human beings. So now IBM is actually moving into how can we use that for purposes of healthcare? And how do we begin to harness the algorithmic potential of Watson? So, Watson is really an opportunity to begin to think about big data science and do you know how they're training it? They're asking -- they're doing almost kind of like a think out loud with physicians. Like how do you make decisions? You know, they're reviewing the literature to see what's in the literature. We need some nurses feeding data into Watson so that we can get other kinds of data in addition. But Watson uses natural language processing to then create structured data to do the algorithms. So when you think about another example, how many heard of Google Flu Trends? Yeah, so with Google Flu Trends, one of the things is how do you mind data on the Internet? What kinds of things are people actually searching for that are things that are about flu? What are the symptoms of flu? What are the medications you take for managing the symptoms of flu? And what they found is that actually Google flu trends could predict a flue epidemic before the CDC could. Because it was based on patients trying to do their symptoms, and then based on that, they could see that there was this trend emerging. Now when they actually looked at who had flu, the reported flu and the Google trends, CDC outdid Google, but it pointed to an emerging trend that was occuring. And actually what we're seeing now is we're doing some of that kind of mining of data with pharmaceutical reports looking for adverse events. And so we're using the FDA has an adverse event reporting system, and what they're finding is that as they're looking at the combination of different drugs that people are taking they're beginning to see where adverse events are occurring through combinations of different drugs that previously weren't known. So when you think about we do these clinical trials, we get our drugs out on he market. After the drug's out on the market it's like, how do they actually work in the real population? And I think Eric's presentation earlier with that new graphic that just came out of Nature, that one out of 10 or one out of 15 people actually benefits, the question is how many people get harm? And how do we know what the combination of drugs is that could actually cause harm? So there's some really interesting stuff that's going on with mining data and looking at combinations to try to understand, are there things we just don't know? So another area's looking at novel associative diagnosis. When I first read this I'm like, "I don't get it." And what it is, is that we're really trying to understand what kinds of meaningful diseases co-occur together that we previously didn't know? So an example is obesity and hypertension. That's a real common one. We know that those two go together frequently. But how many combinations of diseases that we just don't understand go together? So there's a team of researchers that compared literature mining with clinical data mining and what they did is with this massive dataset they looked at all the ICD9 codes in a massive dataset. So this person has these three or five or 14 diagnosis that all co-occur together and they said, "What do we see in the literature of what diagnosis co-occur together?" Because they thought that they could validate commonly known ones which they could and they could discover new ones that needed further investigation. Well, what they did is they looked at that, is that they found there's very little overlap between diagnosis in the clinical dataset and in the literature. So the question is, is it that the methodology needs to be improved? Is it that we only know the tip of the iceberg of what kind of things co-occur together? Can we gain new insights about new combinations that frequently co-occur together that can help us predict problems that people have and try to get ahead of it? Another example is early detection of heart failure. So there was a study that was done and I won't pronounce the name on this by this person and the team and what they were really trying to do is can they determine whether automated analytics having counter notes in the electronic health record might enable the differentiation of subjects who would ultimately be diagnosed with heart failure. So if you look at signs and symptoms that people are getting, can you begin to start seeing early on that this person's going to be moving into heart failure or that their heart failure might actually be worsening? So that you can anticipate and try to prevent problems so that you can anticipate and try to make sure that the right treatment is being done? So they wanted to use -- they used novel tools for text mining notes for early symptoms and then they compared it with patients who did and did not get heart failure. The good news is, is they found that they could detect heart failure early. The bad news is people who didn't get heart failure also had some of those symptoms. So again, we're at the beginning of this kind of science and it really needs to be refined so that we can begin to get better specificity and sensitivity as we do these algorithms that we're developing for predicting. Now visualization is another type of tool and, so as you think about how do we understand massive amounts of information? So there's a lot of different tools for helping us to be able to quickly be able to see what is going on, and so these are just examples of visualization not to read what the details are about this. But what you can see is there was a study done by Lee [phonetic sp] and colleagues where they were trying to understand older adults and their patterns of wellness from point A to eight weeks later in terms of their wellness patterns. But what they were really trying to do in this study is to say, what kind of way can you visualize holistic health? And do you visualize holistic health and the change in holistic health over these eight weeks by using a stacked bar graph, you know, or one of the other types of devices? And then they had focus groups and they tried so say, "What do you think about this?" You know, "How well does that help you to process the information?" And so it helped them to be able to think about it -- it's really a cognitive science kind of background of how people process information, what kind of colors, how much contrast, what shapes and design help people be able to process information? So this is kind of an emerging area where we're really trying to understand patterns related to different phenomena. Karen Munson for instance, one of my colleagues, has been looking at this with public health data, and she's looking at what are the patterns of care for maternal child health patients? Moms who have a lot of support needs from public health nurses, and are there individual signatures of nurses and how they provide care and are certain patterns more effective, and with what subgroup of patients are those patterns more effective? So she's using visualization more like this stream graphic over on the top left side here to look at signatures of nursing practice over time. So one of the things I find is that as we're doing data mining, the genetic algorithms are increasing in their accuracy and their abilities. So if you think about the financial market, I don't know about you, but I came back from a trip to Taiwan one time, went to purchase something at RadioShack and my credit card was declined. And I'm like, "What do you mean my credit card's declined?" And they said, "It's declined." And so I'd used it in Taiwan. What I didn't know is that was an unusual pattern for me and they happened to pick it up and they said, "Were you in Taiwan?" And I'm like, "Yeah, I was in Taiwan." They said, "Okay, fine. We'll enable your card again." Well, it used to be that they would do a 25 percent sample of all the transactions and be able to pick up these abnormal patterns to try to look for fraud. Now they actually can process 100 percent of transactions with fairly good accuracy. So if they can do that with bank transactions, why can't we do that with EHR data? And part of it is they have nice, structured data [laughs], you know? In compared to what we're using. So data mining is really about, how do you look at a data repository, select out the type of data you want, look at preprocessing that data, which is 80 percent of the work, do transformation -- so creating scales or looking at levels of granularity. But then it uses some different kind of algorithms and different analytic methods. So up until I got to data mining on this graphic we're really talking about traditional research in many ways. But when we get to data mining we're then looking at all kinds of different algorithms that get run that are semi-automated that can do a lot of process that we have to do manually in traditional statistical analysis. And, in order to come up with results, the next step is critical. We can come up with lots of really weird results. I can't remember the one that Eric showed earlier, or maybe Patricia Grady did when she said, you know, "Diapers and candy bars." Or something like that. But whatever it was, it doesn't make sense, and so we really have to make sure that we're using our domain knowledge in order to see, is this actually clinically interpretable as we move forward? So, data mining is also known as knowledge discovery in databases. It's automated or semi-automated processing of data using very strong mathematical formulas to do this and that there are absolutely ways of being able to look at the trustworthiness of the data. So we use -- a lot of it is sensitivities, specificity, recall accuracy, precision. There's also something called false discovery rates is another way of checking out the validity of what you're finding. And there are lots of different methods, so some of those methods are association rule learning, there's clustering analysis, there's classification like decision trees, and many new methods that are emerging constantly. So it's not like you can say data mining is just data mining. It's like saying quantitative analysis, you know? So it's lots of different methods of being able to do this. I think an example of data mining is the fusion of big data and little babies. So there was actually a study that was done looking at all the sensory of data in a NICU and trying to understand who's likely to develop infections and that they were able to find that 24 hours earlier than the usual methods of capturing continuous data from multiple machines they were able to pick up who was going to run into trouble and to head it off with the NICU babies. So, it has very practical applications. Another example is looking at type 2 diabetes risk assessment and really trying to understand with not just association rules, but now we're moving into newer methods of trying to look at time series along with association rules and trying to see patterns over time and how those patterns over time and the rules you can create from the data will predict who's likely to run into problems. And so, some of the work that George Simon [phonetic sp] has done with his group is really looked at survival association rules and they substantially outperform the Framingham score in terms of being able to look at the development of complications. So, in conclusion, big data are readily available. We don't lack data. The information infrastructure is critical for big data analytics. One of my colleagues I've done research with, she said, "I just keep hoping one of the days you can just throw it all in the pot and something will happen." [laughter] And it's like, that is not what big data analysis is about. There are rules just like there are for qualitative research or quantitative research. And that the analytic methods are now becoming mainstream. So 10 years ago it would be really hard to get data mining studies funded unless you went to the NSF. Now that's getting to be more and more mainstream. As a matter of fact, if you look in nursing journals and you look for nurses who are doing data mining, you won't find a lot out there yet. So it's still just really at the beginning, but at least we're starting to get some funding available now for doing it. So, one of the implications though out of this that we really need to be thinking about is how are we training our students, the emerging scientists. How are we training our self here today? But how are we training the emerging scientist to really be prepared to do this kind of science of big data analysis, and the newer methods that need to be done? How do we think about integrating nurses into existing interprofessional research teams? So, I don't know about you, but how many nurses do you know that are on CTSAs that are doing the data mining with nursing data as part of the data warehouse? Or on PCORI grants where they're building out, you know, some of the signs and symptoms that nurses are interested in are the interventions in addition to the interprofessional data. And so, it's really important that we take a look at making sure that we're including nurse sensitive data as part of interprofesional data and that means that we really need to be paying attention to the data standards, you know? So that we are collecting consistent data in consistent ways with consistent coding so we can do the consistent queries to be able to really play in the big data science arena. So with that, I'll stop and see if you have any questions. I think we have one minute [laughs]. [applause] We have a question over here. Okay, so the question is how do you find the colleagues like in computer science who can really help you? Well, I tell you, I was really ignorant when I started. I actually worked with somebody from the University of Pennsylvania the first time I did it because I didn't know any data miners at the University of Minnesota. And, then I got talking with colleagues who said, "Oh, do you know so and so who knows so and so?" And then I started actually paying attention to what's being published at the University of Minnesota. It turns out that Vipin Kumar, who's head of the computer science department, is actually one of the best internationally known computer scientists actually, he and Michael Steinbach, one of my research partners, have their own book published on data mining for the class that my students take with -- along with the computer science students. So, one, start with looking at -- if you look at some of the publications coming out of your university, it's the first place to start to figure out if you have anybody around who can do data mining. And I just didn't even know to think about that when I first started. So, it's a good way to start. Part of it is playing attention to there's a number of -- if you go to Aimia for instance there's a whole strong track of data miners that have their own working group at Aimia. Also, there's a lot of data mining conferences going on and so if you just start searching for -- I mean, personally I do, I would do data mining and University of Minnesota in Google, and that's a really fast way of finding out who's doing that as another strategy to try to find partners. And they were thrilled to death, believe me, to get hooked up with people in healthcare because they knew that was an emerging area, big data. They just knew that they didn't know it, and I didn't know what they knew so together it made a good partnership. Okay, thank you. [applause] >> Mary Engler: Thank you, Dr. Westra, that was just wonderful. [music playing]
B1 data big data nursing clinical mining data science NINR Big Data Boot Camp Part 3: Big Data Analytics for Healthcare - Dr. Bonnie Westra 395 22 richardwang posted on 2015/11/26 More Share Save Report Video vocabulary