Mininook

Musings on Christianity, Politics, and Computer Science Geekery

Category: Computer Science (page 2 of 2)

Christin: Seeking a fix: Measuring, analyzing and disrupting unlicensed online drug sales

Interesting points from the talk

  • Drugs in different countries have different names, so they had to do matching
  • Use the Jacard distance to find related pharmacies

Interesting points to look into for research:

  • spinglass clustering algorithm
  • visualizations for spinglass

https://www.andrew.cmu.edu/user/nicolasc/

Gerber: Natural Language Processing for Predictive Technology

SIE Colloquium by Matthew Gerber, Research Assistant Professor in the Systems and Information Engineering Department.

The PTL group has 2 faculty, 10 grad students, and collaborators at the health system.

Predicting crime using twitter:

  • Conventional warfare had easily identified forces and open conflict with direct attacks (friends/enemies). The US has no conventional military peers. The US us dealing with asymmetric warfare (asymmetry in size, power, funding, influence). Our enemies have tactical advantages.
  • Monitoring via hot-spot maps
    1. Problems: very specific to the are you're studying and it's retrospective. Can't take yesterday's model and predict on a different place today.
  • Overview of the approach
    1. Gather information on potential crime correlates (Incident Layer, Grid Layer, Demographic Layer, Spatial Layer). Ex: newar military outpost? religious site? Income levels and ethnic tension, and prior history (each on a different layer). Want to take these information and create a statistical model.
    2. Text provides a problem: unstructured text abounds. These short tweets should be helpful: "The second blast was caused by a motorcycle bomb targeting a minibus in the Domeez area in the south of the city. That needs to be read by a human or automated approach (this talk).
    3. Automatically integrate unstructured text: add some new layers from the previous model (Twitter Layer, Newswire Layer, ...).
  • He's looking at tweets from the Chicago area (collecting in the basement of olsson--time, text, etc). A few topics: 1) flight(0.54), plane(0.2), terminal(0.11),... ; 2) shopping (0.39), buy(--),...
    1. Mapping these n topics to heat map of Chicago. Can see where certain things are being talked about.
    2. Unsupervised topic modeling
      • Latent Dirichlet allocation (Blei et al 2003)
      • A generative story (2 topics). Outside of these documents live topics. We can generate these. Do a similar thing with the documents (grab a dirichlet distribution and produce another--a distribution of topics that the document consists of). Want to pick a topic from that distribution to generate a word. (generate by repeating this process).
      • Gather tweets from a neighborhood, tokenize and filter words, identify topic probabilities by LDA, compute probability of crime P(Crime) = F(0.15,0.74,...,f_n). The question what is f?
        1. \frac{1}{1+e^{-\left(\beta_0 + \prod_{b=1}^n \beta_bf_b(p)\right)}}.
        2. Find the beta coefficients that give the best function
      • Training
        • Establish training window (1/1/13-1/31/13)
        • Lay down non-crime points
        • lay down crime points from training window
        • Compute topic neighborhoods
        • compile training data (use Kernel Density Estimate (?) that adds historical data to the model)
      • Evaluation
        • Want to find the smallest place boundaries with the highest crime levels.
        • Do people actually talk about crime on twitter? (that's the big question-- but gangs do trash-talk about their crimes, etc)
        • Baseline for comparison was the kernel density estimation (based on past, where is crime likely to occur?)
        • They do well with twitter data model + KDE over just KDE for certain results: prostitution, battery.
        • They are worse with other topics/crime: homicide, liquor law violations.
        • AUC improvement for 22 of 25 crime types, with average peak improvement of 11 points
  • Clinical Practice Guidelines
    • Want to formalize using natural language processing
    • Sentences have a specific order: they're using NLP and parsing English sentences. (concern: context sensitivity of English)
    • Want to annotate the text with semantic labels (not XML, though).
    • Precisions: temporal identifiers 28% are identified; others average around 50%, with the top around 75-80%
    • Warning: need to make sure that fully automated isn't used alone, as there could be things that automated analysis would miss that could be life-threatening.
  • The big picture
    • Want to get structured information from unstructured text data through Natural Language Processing

Command Line Master

Wanted to post the craziest command line script I've used in a long time.  Used to convert names listed in XML tags in an EAC-CPF record to filenames to copy.

grep -h -o -P "<relationEntry>(.*?)</relationEntry>" *.xml
 | sed -e 's/<[a-zA-Z0-9\/\+]*>//g'
 | awk '{print tolower($0)}'
 | sed -e 's/[ ,.:]\+/\-/g'
 | sed -e 's/$/cr.xml/g'
 | while read x ; do cp /data/production/data/$x eac_data/. ; done
Newer posts

© 2017 Mininook

Theme by Anders NorenUp ↑