Notes from module 10 of the Interprofessional Health Informatics course I’m working on (plus side reading that I did to fill in some blanks/learn more about some things mentioned in the course).

- Big Data = size of a database that is too large to manipulate with traditional methods
- there are terabytes and terabytes of patient data being collected
- data is also collected from instruments, devices, sensors, social media, mobile technologies, etc.
- see notes re: eScience
- methods: data mining, data visualization – helps you generate hypotheses from the data
- neural networks – e.g., can help with pattern recognition
- 3 big projects:

- Exploring and Understanding Adverse Drug Reactions project
- computer system to detect ADE
- EHRs in 4 EU countries
- analyzing the EHRs for signals, combos of drugs & AEs

- Exploring the Frontier of EHR Surveillance: The Case of Postop Complications
- data mining

- MetroHealth
- replicated Norwegian study of heart disease risk by data mining EHRS
- 3 months (vs. 13 years) and gave more precise results

- eScience Challenges
- how do we codify and represent our knowledge?
- ontologies provide common scheme on how to organize, reorganize, and share data

- digital infrastructure for data capture, standardized data elements and transfer, data analysis, dissemination, and research funding is all need
- ontology-based search – allowed by structured data

## Data Visualization

- these 4 sets of data ((which I Googled and found are knowns as “Ansombe’s quartet“) have identical statistics (i.e., the mean and standard deviation of x is the same for all 4, the mean and standard deviation of y is the same for all 4 sets)

I | II | III | IV | ||||
---|---|---|---|---|---|---|---|

x | y | x | y | x | y | x | y |

10.0 | 8.04 | 10.0 | 9.14 | 10.0 | 7.46 | 8.0 | 6.58 |

8.0 | 6.95 | 8.0 | 8.14 | 8.0 | 6.77 | 8.0 | 5.76 |

13.0 | 7.58 | 13.0 | 8.74 | 13.0 | 12.74 | 8.0 | 7.71 |

9.0 | 8.81 | 9.0 | 8.77 | 9.0 | 7.11 | 8.0 | 8.84 |

11.0 | 8.33 | 11.0 | 9.26 | 11.0 | 7.81 | 8.0 | 8.47 |

14.0 | 9.96 | 14.0 | 8.10 | 14.0 | 8.84 | 8.0 | 7.04 |

6.0 | 7.24 | 6.0 | 6.13 | 6.0 | 6.08 | 8.0 | 5.25 |

4.0 | 4.26 | 4.0 | 3.10 | 4.0 | 5.39 | 19.0 | 12.50 |

12.0 | 10.84 | 12.0 | 9.13 | 12.0 | 8.15 | 8.0 | 5.56 |

7.0 | 4.82 | 7.0 | 7.26 | 7.0 | 6.42 | 8.0 | 7.91 |

5.0 | 5.68 | 5.0 | 4.74 | 5.0 | 5.73 | 8.0 | 6.89 |

- for all 4 sets:
- mean of x = 9
- mean of y =11
- variance of x = 7.50
- variance of y =4.12
- correlation between x and y = 0.816
- linear regression equation = y = 3 + 0.5x

- so they are “statistically identical” but when you graph them, you see they are quite different!

“Anscombe’s quartet 3” by Anscombe.svg: Schutz

derivative work (label using subscripts): Avenue (talk) – Anscombe.svg. Licensed under CC BY-SA 3.0 via Commons.

- data visualization isn’t new:
- e.g., John Snow’s cholera map; Florence Nightingale did lots of graphs

- reading visualizations:
- Perception: low-level activity of sending the visual aspects of a day
- Cognition: the higher-level process of interpreting the display and translating it into meaning
- The challenge: using what we know about perception and cognition to make visualizations better

- research has been done on which things we can perceive more quickly (e.g., comparing things along a 2D line is quicker than comparing areas of a shape which is quicker than comparing volume of a 3D object)
- cognitive burden – how hard is it for us to interpret the data (e.g., extract values, compare values, detect trends)
- when creating data visualizations, we should make the images easiest to perceive and we should match the method of visualization with its purpose
- e.g., if you need to extract an exact value, use a table; but if you need to detect a trend in the data, use a line graph (if you want to get an exact value, it’s harder to do from a graph)

- key point: there isn’t one best way to display data – it depends on the purpose
- some tasks may require combinations
- many published guidelines with different aims
- persuasive graphs
- statistical graphs

- there is no general theory of data visualization
- suggested practices (general):
- for value extraction: table
- for proportions: pie charts, stacked bar charts
- for value comparison: bar charts, line graphs, scatterplots
- tended detection: line graphs
- use the design which minimized the cognitive burden for the task at hand

- 3 questions to ask when designing a visualization:
- who is the intended audience?
- what is the goal? (e.g., exploration, education, persuasion)
- what are the data composed of, statistically? (e.g., continuous, categorical, time series)

- in addition to the graphs we commonly use (line graph, bar chart), there are some other types of graphs:
- a
**streamgraph**: “a type of stacked area graph which is displaced around a central axis, resulting in a flowing, organic shape.” (Wikipedia)

###### “LastGraph example” by Psychonaut – Own work. Licensed under CC0 via Commons.

- a sunburst graph: “used to visualize hierarchical data, depicted by concentric circles” (Wikipedia)

“Disk usage (Boabab)” by w:Baobab (software). Licensed under CC0 via Commons.

- you do a lot with simple tools like Excel, but there is also more advanced software to do even cooler things:
- e.g. Tableau is a software (costs money) that makes it easy to do data visualization; it uses knowledge of best practices of data visualization to suggest what to do; it’s quite expensive, requires some training
- e.g., ggplot2 for R (free) – builds on basic graphing in R and allows you do to stuff more easily; very sophisticated graphs
- d3js.org – JavaScript library for manipulating documents based on data (allows you to make your graphics interactive and plug it right into your website) (free)

And with that, I’ve completed my first ever Coursera course!