So, I always am using some command line shortcuts to do various tasks, and often have to look up the tricks every time I need to do something remotely fancy. Here are some of my most-used helpful hints:
- To remove the leading spaces and tabs from each line of text on standard in (so use with a pipe for the input), this sed command will work well:
sed -e 's/^[ \t]*//'
- Reformatting XML/HTML files so that line returns inside tags are removed:
xmllint --format --noblanks infile.xml > outfile.xml
Large streams of data, mostly unlabeled.
Machine learning approach to fit models to data. How does it work? Take the raw data, hypothesize a model, use a learning algorithm to get the model parameters to match the data.
What makes a good machine learning algorithm?
- Performance guarantees: (statistical consistency and finite sample bounds)
- Real-world sensors, data, resources (high-dimensional, large-scale, ...)
For many types of dynamical systems, learning is provably intractable. You must choose the right class of model, or else all bets are off!
- Spectral Learning approaches to machine learning
- Topology: Encompasses the global shape of the data, and the relations between data points or groups within the global structure
- Google Pagerank Algorithm
- Example: Cosmic Crystallography
- Torus universe (zero curvature)
- Spherical universe (positive curvature)
- Other universe (negative curvature)
- Data: Hyperspectral Imagery
- Gradient Flow Algorithm
- identify neighbor with highest density for each data point (arrow points from that point to that particular neighbor)
- follow the arrows to identify clusters
Found an interesting paper by Nicolas Christin and his group at CMU, available here. The authors examine the encrypted passwords across the entire university and run algorithms to guess the passwords. They break down the demographics along with how many attempts it took to guess the password. What's interesting? Check out Figure 1! Business students have the most guessable passwords, while Computer Science students have the least. I encourage all to check out this paper, or at least browse through the graphs!
Interesting points from the talk
- Drugs in different countries have different names, so they had to do matching
- Use the Jacard distance to find related pharmacies
Interesting points to look into for research:
- spinglass clustering algorithm
- visualizations for spinglass