Musings on Christianity, Politics, and Computer Science Geekery

Category: Computer Science (page 1 of 2)

Quick Multi-File Rename in Linux

Call it the little things, but I'm usually excited when I find a new tool that speeds things up.  I accidentally ran a set of experiments that produced multiple files that were named incorrectly.  That is, I wrote files as aq*-female-1_day-*_event*.csv but I was actually supposed to write the files as aq*-female-edt-*_event*.csv.  These files fit into a larger set of results with varying time lengths (edt, month, year, etc), so I needed to change dozens of files to fix my mistake.

Cue the head scratching.  Do I rename each individually?  Do I write a bash or PHP script to perform the renaming, perhaps taking time away from other things? No! I found out there's a tool for that built into the Linux command line environment: rename.

The rename command takes a Perl regular expression and a set of files and performs a rename on them based on the regular expression.  So, in this way, I could rename dozens of files with one command:

> rename 's/(.*)-1_day-(.*)/\1-edt-\2/' *.csv


Net Neutrality

Earlier today, I came up with this analogy for the internet and net neutrality:

Let’s say UPS owns I-95, FedEx owns I-64, and Joe’s Shipping owns I-295 and 288. With net neutrality, anyone could drive on any interstate, including UPS trucks on 64, without additional cost. There is some negotiation between companies for the interchanges between 64, 95, and 295.

Without net neutrality, it becomes much more problematic. FedEx could charge a fee for UPS trucks on 64, and vice versa. Joe’s Shipping is such a small company, that they may not be able to afford new charges to make deliveries using 64 or 95; therefore, they end up needing to use back roads which would affect delivery speed.  They would lose to the companies who could take the faster routes and ultimately can’t compete with UPS and FedEx speeds. So, they fold and sell 295 and 288 to the other companies.

Perhaps you own a store along 64, but depend on a supplier from the DC area for your products. If the supplier is a small company, you or them would have to add the additional cost of shipping fees from both shipping companies to use both roads to get the delivery to you (i.e. UPS delivery fee plus the FedEx fee that UPS pays to use 64). But, if the supplier is already a big player, like Amazon or Walmart, they will likely have a second warehouse off 64, so they can still offer the lower FedEx-only shipping charges.  Therefore, small suppliers can't compete with already established large corporations.

And, what would be even worse: what if UPS and FedEx owned their own supply companies? Then perhaps you buy their products and shipping, because they charge anyone else extra to use either of their roads.

And that’s where we are today. Comcast and Verizon own large swaths of the internet and its interconnection, and they produce content (tv, movies, websites, etc). AT&T, which also owns portions of the internet, are trying to acquire Time Warner, including their production companies.

So, that should be terrifying. Even if they are transparent about how much they charge, it’s still not neutral. There aren’t enough back-channels to help all content get everywhere.

Now, I know you might be thinking "well, I pay for Ting, Google Fiber, [insert your good company here] internet, so they won't play favorites with content."  But, it's not just about them; the internet is a very deep and complex network.  At its base is a backbone controlled by multiple different companies, some that you may have never heard of.   Your web content may pass through a few different companies on top of the one that you actually pay for internet access.  Without net neutrality, any one of them along the way has the ability to stop or slow your data or charge a fee.

There are a few things you can try to test out the internet for yourself and see what companies you'll need to deal with to do rather mundane things online.  These are: traceroute and whois, and they're freely available in Terminal (MacOS), the Linux, and I believe Windows' Command Prompt.

Example Usage: My website

Let's take a look at getting to my site,, from my in-law's house.  From the terminal, we will execute the command  traceroute which will provide us with the following response:

traceroute to (, 100 hops max, 60 byte packets
1 gateway ( 2.745 ms 3.198 ms 4.602 ms
2 ( 25.460 ms 25.820 ms 26.449 ms
3 ( 25.772 ms 25.879 ms 27.830 ms
4 ( 27.624 ms 27.535 ms 27.511 ms
5 ( 34.516 ms 34.505 ms 34.451 ms
6 ( 36.430 ms 20.432 ms 24.491 ms
7 ( 27.858 ms 27.145 ms 27.138 ms
8 ( 33.523 ms 33.494 ms 32.625 ms
9 ( 33.410 ms 36.523 ms 38.295 ms
10 ( 37.674 ms 36.776 ms 39.484 ms
11 ( 39.854 ms 40.663 ms 40.865 ms
12 ( 26.594 ms 27.676 ms 28.477 ms
13 ( 22.865 ms 23.073 ms 23.237 ms

This list shows all the steps between my laptop and my website.  You'll notice it's backwards; that is, these are the step to my website.  However, the website data will take roughly the same path back to my laptop.  Let's unpack this a little:

  • Step 1 is the gateway, i.e. the router in the house that my laptop connects to on wifi.  If your first entry starts with 10 or 192.168, then that is a local network and likely your router.
  • Steps 2-8 are all routers or computers at Comcast.  Steps 3 and 5-7 specifically tell us that they are, and we see my request going from Palmyra to Charlottesville to Ashburn.
  • Steps 9-11 are all routers or computers at  MCI Communications (remember them?  well, they're actually Verizon now).  They don't advertise that fact here, but I'll show you how to get that information in a minute.
  • Steps 12-13 are computers at DreamHost, where my website resides.

How do I know that step 9 is Verizon?  Our second command will give us that information: by typing whois into our terminal, we get a response from a registrar that details the owner of that particular address.  In this case, the important part is:

Organization: MCI Communications Services, Inc. d/b/a Verizon Business (MCICS)

In an age without net neutrality, my site could be slowed down by either Comcast or Verizon, even though my website is hosted at DreamHost.  You'll see images of "plans" that speculate paying extra for the "news websites" package or the "streaming video sites" package, but the actual case is more complicated than that.  My in-laws could pay Comcast extra for the "personal websites" package, but that won't affect Verizon's handling of my website data.

This is a simple example because it is likely that DreamHost pays Verizon for internet access and my in-laws pay Comcast, but there are cases in which the internet traffic will pass through an intermediary company.  I encourage you to go forth and test this out.  You'll find companies like Fox News that pay a company called Akamai, which provides those "warehouses" from my analogy--places on your network that may use only your internet provider to deliver faster responses.  You'll see companies like Level3 that you may have never heard of.

When you're done, and you're convinced something needs to be done, there are a few things you can do to try to influence what's happening at the FCC:

  1. Call your representatives in Congress and ask them to support net neutrality.  (Don't email, call.  Someone has to take your call.)
  2. Comment with the FCC.  They are supposed to take these into account when making the decision.
  3. Vote in 2018.

Quick Bar-Chart of disk usage

Today I was in search of a command that I had used a long time ago, but ran into a much more interesting one instead.  At the time, I must have been needing to discover what files were the largest disk hogs and if there was a long tail (i.e. how many of the 3.7M files in this directory--not my fault, by the way--were inconsequential).  That brings us to this wonderful "one-line" command:

find /dir/ -name "*.xml" -exec du -s {} ; | perl -ni -e 'if (/^(d+)s+(.*)/) { $h{$2} = $1; if ($max < $1) { $max = $1; } if (length($2) > $maxfname) { $maxfname = length($2); } } END { map { $barlen = ($h{$_} / $max) * 50; $bar = "*" x $barlen; printf ("%" . $maxfname . "s" . "(%5d): %s", $_, $h{$_}, $bar); print "n"; } sort { $h{$b} <=> $h{$a} } keys %h }' 2> /dev/null > report.txt

What that specifically does is to find every XML file in the dir directory, use the linux du command to get the file's size.  That list of filenames and sizes is passed to a hacky perl script that pulls out the size, creates a horizontal histogram bar based on the max size (limit 50 *s wide), sort and return the list from max to min.  Lastly, that's saved to report.txt.

That's quite a quick and dirty trick, but produces a nice command-line output like this:

/dir/w6bz9whg.xml(36560): **************************************************
/dir/w6km312r.xml(31772): *******************************************
/dir/w68d03gz.xml(27728): *************************************
/dir/w6vt5fhv.xml(27076): *************************************
/dir/w6m07v80.xml(17420): ***********************
/dir/w68m0zj8.xml(15276): ********************
/dir/w6mq7qpz.xml(15052): ********************
/dir/w6vq30tq.xml(13808): ******************
/dir/w6tb51hr.xml(13160): *****************


Command Line Tricks

So, I always am using some command line shortcuts to do various tasks, and often have to look up the tricks every time I need to do something remotely fancy.  Here are some of my most-used helpful hints:

  • To remove the leading spaces and tabs from each line of text on standard in (so use with a pipe for the input), this sed command will work well:
    sed -e 's/^[ \t]*//'
  • Reformatting XML/HTML files so that line returns inside tags are removed:
    xmllint --format --noblanks infile.xml > outfile.xml

Boots: New Machine Learning Approaches to Modeling Dynamical Systems

Large streams of data, mostly unlabeled.

Machine learning approach to fit models to data. How does it work? Take the raw data, hypothesize a model, use a learning algorithm to get the model parameters to match the data.

What makes a good machine learning algorithm?

  • Performance guarantees: \theta \approx \theta^* (statistical consistency and finite sample bounds)
  • Real-world sensors, data, resources (high-dimensional, large-scale, ...)

For many types of dynamical systems, learning is provably intractable. You must choose the right class of model, or else all bets are off!

Look into:

  • Spectral Learning approaches to machine learning
« Older posts

© 2022 Mininook

Theme by Anders NorenUp ↑