Simple exercises with grep, sed and awk in org-mode

Posted on
tutorial, emacs, orgmode, programming

For text processing, I had never bothered to learn classic Unix tools such as sed and awk, because I can always use Python's regular expression library. The syntax of sed and awk just appeared to be too arcane to me. However, recently I realize that for many simple ad-hoc tasks, even writing a Python script is too much overhead. This motivated me to learn to use regular expressions directly in the command line.

It also gave me an excuse to play with literate programming in org-mode. I love mixing narratives with code in notebook environments, but with Jupyter and Mathematica notebooks, I have to program in Python, R, or Mathematica. Org-mode has the advantage of supporting many programming languages. Furthermore, code blocks in an org-mode notebook can be written in more than one languages, which makes it perfect for teaching and learning programming languages. Another useful feature is that instead of loading data from an external file, a code block can read from data blocks (and even from the results of other code blocks) in the same notebook. It's very convenient to be able to manipulate data and code in a self-contained document.

If you are reading this on GitHub or on my blog, note that you are viewing a rendering of the notebook, which hides the markups used to construct the code blocks. On GitHub, you can click on the "Raw" button to view the source code. More usefully, you can download the source code and open it in Emacs. You can then move the cursor to one of the code blocks, and evaluate it with C-c C-c.

Preparations

By default, org-mode in Emacs only allows the evaluation of LISP code blocks. For interacting with the Unix shell and Python, I inserted the following to my Emacs configuration script:

  ;; Do not ask for confirmation when evaluation a block
  (setq org-confirm-babel-evaluate nil)

  (org-babel-do-load-languages
   'org-babel-load-languages
   '((emacs-lisp . t)
     (shell . t)
     (python . t)))

Running unix shell commands in org-mode

Let's start with this small chunk of text:

apple orange User24 banana User300 User5s cherry strawberry
kiwi User65

To store it in a code block, I used the #+BEGIN_EXAMPLE and #+END_EXAMPLE markup. I gave it a name (source_text) with #+NAME:, so that other code blocks can refer to it. To learn more about the mechanism for passing information around among code blocks, jherrlin has posted several very useful blog entries.

The following is a shell script block. I use the :stdin source_text directive to send the text to STDIO. To operate on STDIO from a shell script, the easiest way is to read from the /dev/stdin file:

cat /dev/stdin
apple orange User24 banana User300 User5s cherry strawberry

The result above shows that we can indeed read from the source_text block. Note that the format that the result is displayed in is specified by the shell script block. I used :results output to tell org-mode to display the raw output of the shell script, instead of formatting it as a table (which is the default behavior). The :exports both directive was used so that when the notebook is exported to HTML, the code and the result are both exported (by default, only the code is exported).

Simple pattern matching with grep

How to extract all the user IDs (User24, User300, User5, =User65=) from the string? In Python, it's just:

  import re
  res = re.findall(r'User\d+', str)
  for r in res:
    print(r)
User24
User300
User5
User65

To do the same in the Unix shell, grep is the closest to Python's re.findall(). It took me a couple of minutes to figure out that I had to use the extended version of grep, egrep, because the + pattern is unavailable in grep:

cat /dev/stdin | egrep -o 'User\d+'
User24
User300
User5
User65

The -o option (o stands for "only") is important, because it tells (e)grep to print only the substring that matches the regular expression pattern. Interestingly, the matched substrings are displayed in individual lines. This behavior will come in handy later.

Without the -o flag, (e)grep prints the entire line if there is a match in the line. This behavior is useful for identifying lines in a file that matches the pattern, but it's not useful for extracting useful pieces in a line.

cat /dev/stdin | egrep 'User\d+'
apple orange User24 banana User300 User5s cherry strawberry
kiwi User65

I also tried to solve this problem with sed, but it didn't seem to be work. sed is good at editing words in a document, but it is not good at extraction.

Simple pattern matching with awk

This was the first time that I used awk. I was hoping for an elegant one-liner (what was I thinking?), but it turned out that for each line ($0), I had to loop through all matches, print out the match, and then update $0 to become the rest of the line. Yuck! If I have to do this, I might as well use Python.

  {
      while (match($0, "User[0-9]+")) {
          print substr($0, RSTART, RLENGTH);
          $0 = substr($0, RSTART + RLENGTH);
      }
  }
User24
User300
User5
User65

Substring extraction with grep and sed

Let's make the problem a little harder. Consider this chunk of text:

apple orange User24.txt banana User300s User5.text cherry strawberry
kiwi User65.gif banana User31.text

Some of the substrings that begin with User are filenames (e.g., User24.txt). Form the filenames, I want to extract the parts before the extensions (e.g., the User24 part of User24.txt). Note that the length of the extension is not constant.

With Python's re.findall(), this can easily be done by creating a capture group in the pattern with parentheses:

  import re
  res=re.findall(r'(User\d+).[a-z]{3,4}', str)
  for r in res:
    print(r)
User24
User5
User65
User31

With grep, it's easy to pick up the filenames…

cat /dev/stdin | egrep -o 'User\d+.[a-z]{3,4}'
User24.txt
User5.text
User65.gif
User31.text

… but it is not easy to take them apart. That's where sed comes in! I couldn't solve this problem with sed alone, because sed is not good at picking up multiple matches in the same line. But since grep very helpfully puts each match in is own line, it's perfect for sed. It's time to try the famous sed command s/ ("substitute"):

  sedre='(User[0-9]+).[a-z]{3,4}'
  action='\1'
  cat /dev/stdin | egrep -o "$sedre" | sed -n -E "s/$sedre/$action/p"
User24
User5
User65
User31

It's a nice one-liner, bit there are a some details to unpack:

  1. The syntax of the s/ command is s/pattern/action/options. I defined two variables (sedre and action) to make this structure more obvious, but it wasn't necessary.

  2. I asked sed to print only the lines that matched. This isn't necessary because every line from egrep should match, but it's useful for debugging. This is done with the -n flag, and with the p ("print") option in last part of the /s command.

  3. I turned on the -E flag, to use extended regular expressions.

  4. I created a capture group in the pattern sedre with parentheses. Normally, the parentheses need to be escaped (i.e., \( and \)), but with the -E flag on, they shouldn't be escaped.

  5. In the action part of the s/ command, I used \1 to refer to the first (and only) group in the pattern.

I was planning to solve this with awk, because awk is supposed to be good at extracting bits and pieces of information from texts. But awk is really designed for processing tabular data. For unstructured texts, awk is worse than Python, so I decided not to bother.