Mininook

Musings on Christianity, Politics, and Computer Science Geekery

Category: Big Data

Large streams of data, mostly unlabeled.

Machine learning approach to fit models to data. How does it work? Take the raw data, hypothesize a model, use a learning algorithm to get the model parameters to match the data.

What makes a good machine learning algorithm?

• Performance guarantees: $\theta \approx \theta^*$ (statistical consistency and finite sample bounds)
• Real-world sensors, data, resources (high-dimensional, large-scale, ...)

For many types of dynamical systems, learning is provably intractable. You must choose the right class of model, or else all bets are off!

Look into:

• Spectral Learning approaches to machine learning
• Topology: Encompasses the global shape of the data, and the relations between data points or groups within the global structure
• Example: Cosmic Crystallography
• Torus universe (zero curvature)
• Spherical universe (positive curvature)
• Other universe (negative curvature)
• Data: Hyperspectral Imagery
• identify neighbor with highest density for each data point (arrow points from that point to that particular neighbor)
• gives a data field
• follow the arrows to identify clusters

people.rit.edu/wfbsma/data/NINJA_MAIN_self_test_refl_RX.img.html

Interesting points from the talk

• Drugs in different countries have different names, so they had to do matching
• Use the Jacard distance to find related pharmacies

Interesting points to look into for research:

• spinglass clustering algorithm
• visualizations for spinglass

https://www.andrew.cmu.edu/user/nicolasc/

SIE Colloquium by Matthew Gerber, Research Assistant Professor in the Systems and Information Engineering Department.

The PTL group has 2 faculty, 10 grad students, and collaborators at the health system.

• Conventional warfare had easily identified forces and open conflict with direct attacks (friends/enemies). The US has no conventional military peers. The US us dealing with asymmetric warfare (asymmetry in size, power, funding, influence). Our enemies have tactical advantages.
• Monitoring via hot-spot maps
1. Problems: very specific to the are you're studying and it's retrospective. Can't take yesterday's model and predict on a different place today.
• Overview of the approach
1. Gather information on potential crime correlates (Incident Layer, Grid Layer, Demographic Layer, Spatial Layer). Ex: newar military outpost? religious site? Income levels and ethnic tension, and prior history (each on a different layer). Want to take these information and create a statistical model.
2. Text provides a problem: unstructured text abounds. These short tweets should be helpful: "The second blast was caused by a motorcycle bomb targeting a minibus in the Domeez area in the south of the city. That needs to be read by a human or automated approach (this talk).
3. Automatically integrate unstructured text: add some new layers from the previous model (Twitter Layer, Newswire Layer, ...).
• He's looking at tweets from the Chicago area (collecting in the basement of olsson--time, text, etc). A few topics: 1) flight(0.54), plane(0.2), terminal(0.11),... ; 2) shopping (0.39), buy(--),...
1. Mapping these $n$ topics to heat map of Chicago. Can see where certain things are being talked about.
2. Unsupervised topic modeling
• Latent Dirichlet allocation (Blei et al 2003)
• A generative story (2 topics). Outside of these documents live topics. We can generate these. Do a similar thing with the documents (grab a dirichlet distribution and produce another--a distribution of topics that the document consists of). Want to pick a topic from that distribution to generate a word. (generate by repeating this process).
• Gather tweets from a neighborhood, tokenize and filter words, identify topic probabilities by LDA, compute probability of crime $P(Crime) = F(0.15,0.74,...,f_n)$. The question what is $f$?
1. $\frac{1}{1+e^{-\left(\beta_0 + \prod_{b=1}^n \beta_bf_b(p)\right)}}$.
2. Find the beta coefficients that give the best function
• Training
• Establish training window (1/1/13-1/31/13)
• Lay down non-crime points
• lay down crime points from training window
• Compute topic neighborhoods
• compile training data (use Kernel Density Estimate (?) that adds historical data to the model)
• Evaluation
• Want to find the smallest place boundaries with the highest crime levels.
• Do people actually talk about crime on twitter? (that's the big question-- but gangs do trash-talk about their crimes, etc)
• Baseline for comparison was the kernel density estimation (based on past, where is crime likely to occur?)
• They do well with twitter data model + KDE over just KDE for certain results: prostitution, battery.
• They are worse with other topics/crime: homicide, liquor law violations.
• AUC improvement for 22 of 25 crime types, with average peak improvement of 11 points
• Clinical Practice Guidelines
• Want to formalize using natural language processing
• Sentences have a specific order: they're using NLP and parsing English sentences. (concern: context sensitivity of English)
• Want to annotate the text with semantic labels (not XML, though).
• Precisions: temporal identifiers 28% are identified; others average around 50%, with the top around 75-80%
• Warning: need to make sure that fully automated isn't used alone, as there could be things that automated analysis would miss that could be life-threatening.
• The big picture
• Want to get structured information from unstructured text data through Natural Language Processing