Large streams of data, mostly unlabeled.
Machine learning approach to fit models to data. How does it work? Take the raw data, hypothesize a model, use a learning algorithm to get the model parameters to match the data.
What makes a good machine learning algorithm?
- Performance guarantees: (statistical consistency and finite sample bounds)
- Real-world sensors, data, resources (high-dimensional, large-scale, ...)
For many types of dynamical systems, learning is provably intractable. You must choose the right class of model, or else all bets are off!
- Spectral Learning approaches to machine learning
- Topology: Encompasses the global shape of the data, and the relations between data points or groups within the global structure
- Google Pagerank Algorithm
- Example: Cosmic Crystallography
- Torus universe (zero curvature)
- Spherical universe (positive curvature)
- Other universe (negative curvature)
- Data: Hyperspectral Imagery
- Gradient Flow Algorithm
- identify neighbor with highest density for each data point (arrow points from that point to that particular neighbor)
- follow the arrows to identify clusters
Found an interesting paper by Nicolas Christin and his group at CMU, available here. The authors examine the encrypted passwords across the entire university and run algorithms to guess the passwords. They break down the demographics along with how many attempts it took to guess the password. What's interesting? Check out Figure 1! Business students have the most guessable passwords, while Computer Science students have the least. I encourage all to check out this paper, or at least browse through the graphs!
Interesting points from the talk
- Drugs in different countries have different names, so they had to do matching
- Use the Jacard distance to find related pharmacies
Interesting points to look into for research:
- spinglass clustering algorithm
- visualizations for spinglass
SIE Colloquium by Matthew Gerber, Research Assistant Professor in the Systems and Information Engineering Department.
The PTL group has 2 faculty, 10 grad students, and collaborators at the health system.
Predicting crime using twitter:
- Conventional warfare had easily identified forces and open conflict with direct attacks (friends/enemies). The US has no conventional military peers. The US us dealing with asymmetric warfare (asymmetry in size, power, funding, influence). Our enemies have tactical advantages.
- Monitoring via hot-spot maps
- Problems: very specific to the are you're studying and it's retrospective. Can't take yesterday's model and predict on a different place today.
- Overview of the approach
- Gather information on potential crime correlates (Incident Layer, Grid Layer, Demographic Layer, Spatial Layer). Ex: newar military outpost? religious site? Income levels and ethnic tension, and prior history (each on a different layer). Want to take these information and create a statistical model.
- Text provides a problem: unstructured text abounds. These short tweets should be helpful: "The second blast was caused by a motorcycle bomb targeting a minibus in the Domeez area in the south of the city. That needs to be read by a human or automated approach (this talk).
- Automatically integrate unstructured text: add some new layers from the previous model (Twitter Layer, Newswire Layer, ...).
- He's looking at tweets from the Chicago area (collecting in the basement of olsson--time, text, etc). A few topics: 1) flight(0.54), plane(0.2), terminal(0.11),... ; 2) shopping (0.39), buy(--),...
- Mapping these topics to heat map of Chicago. Can see where certain things are being talked about.
- Unsupervised topic modeling
- Latent Dirichlet allocation (Blei et al 2003)
- A generative story (2 topics). Outside of these documents live topics. We can generate these. Do a similar thing with the documents (grab a dirichlet distribution and produce another--a distribution of topics that the document consists of). Want to pick a topic from that distribution to generate a word. (generate by repeating this process).
- Gather tweets from a neighborhood, tokenize and filter words, identify topic probabilities by LDA, compute probability of crime . The question what is ?
- Find the beta coefficients that give the best function
- Establish training window (1/1/13-1/31/13)
- Lay down non-crime points
- lay down crime points from training window
- Compute topic neighborhoods
- compile training data (use Kernel Density Estimate (?) that adds historical data to the model)
- Want to find the smallest place boundaries with the highest crime levels.
- Do people actually talk about crime on twitter? (that's the big question-- but gangs do trash-talk about their crimes, etc)
- Baseline for comparison was the kernel density estimation (based on past, where is crime likely to occur?)
- They do well with twitter data model + KDE over just KDE for certain results: prostitution, battery.
- They are worse with other topics/crime: homicide, liquor law violations.
- AUC improvement for 22 of 25 crime types, with average peak improvement of 11 points
- Clinical Practice Guidelines
- Want to formalize using natural language processing
- Sentences have a specific order: they're using NLP and parsing English sentences. (concern: context sensitivity of English)
- Want to annotate the text with semantic labels (not XML, though).
- Precisions: temporal identifiers 28% are identified; others average around 50%, with the top around 75-80%
- Warning: need to make sure that fully automated isn't used alone, as there could be things that automated analysis would miss that could be life-threatening.
- The big picture
- Want to get structured information from unstructured text data through Natural Language Processing