Tutorial WS22: Foundations in Data Engineering

Here you find my tutorial notes for my tutorial in "Foundations in Data Engineering".

Tutorial 1 notes.

Introduction

  • Michael Brauweiler

If you plan to keep coming to this tutorial session:

  • Send email to [michael.brauweiler@posteo.de]
  • Thursday: Use subject [FDE T-7].
  • Friday: Use subject [FDE T-9].

Exercise 1: Building Blocks

  • man
  • cat
  • comm
  • join
  • sort
  • head / tail
  • uniq
  • wc
  • seq
  • grep
  • awk
  • sed

man

Interface to the system reference manuals.

man man

cat

Concatenate files and print on the standard output.

cat [OPTION]... [FILE]...

cat random-names.txt
tac random-names.txt

comm

Compares two sorted files line by line.

comm [OPTION]... FILE1 FILE2

cat random-names.txt |
    shuf |
    head |
    sort > meeting1.txt

cat random-names.txt |
    shuf |
    head |
    sort > meeting2.txt

comm meeting1.txt meeting2.txt

join

Join lines of two files on a common sorted field.

join [OPTION]... FILE1 FILE2

cat random-names.txt |
    cut -d " " -f1 |
    cat -n > first_names.txt

cat random-names.txt |
    cut -d " " -f2 |
    cat -n > last_names.txt

join first_names.txt last_names.txt

sort

Sort lines of text files by a key/column.

sort [OPTION]... [FILE]...

sort random-names.txt

head / tail

Output the first part/last part of files.

head [OPTION]... [FILE]...

head random-names.txt
tail random-names.txt

uniq

Report or omit (subsequent) repeated lines.

uniq [OPTION]... [INPUT [OUTPUT]]

cat random-names.txt \
    <(head random-names.txt) |
    sort > repeated-names.txt
uniq repeated-names.txt

wc

Print newline, word, and byte counts for each file.

wc [OPTION]... [FILE]...

wc random-names.txt
# Counts only newline characters
wc -l random-names.txt
wc -w random-names.txt
echo -n "😀" | wc -c
echo -n "😀" | wc -m

seq

Print a sequence of numbers.

seq [OPTION]... LAST seq [OPTION]... FIRST LAST seq [OPTION]... FIRST INCREMENT LAST

seq 10
seq 5 10
seq -10 2 10

grep

Print lines matching a pattern.

grep [OPTION...] PATTERNS [FILE...]

grep -E '(.).+ \1.+' random-names.txt

sed

Stream editor for filtering and transforming text.

sed [OPTION]... {script} [input-file]...

cat random-names.txt |
    sed -E 's/(.+) (.+)/\2 \1/'

awk

Pattern scanning and processing language.

awk '{print $2,$1}' random-names.txt

Exercise 2: Combining Building Blocks

There are three ways to combine programs:

  • Pipelines
  • Lists
  • Control Constructs

Pipelines

A sequence of one or more commands separated by one of the control operators | or |&.

[time [-p]] [ ! ] command [ [| | |&] command ... ]

  • command1 | command2 forwards standard output of command1 to standard input of command2.
  • command1 |& command2 forwards standard output and standard error of command1 to standard input of command2.
  • Same as command1 2>&1 | command2

Lists

Sequence of one or more pipelines combined by operators:

  • separated by ;, &, &&, or ||.
  • optionally terminated by ;, & <newline>

Separation operators

  • ; - executes sequentially
  • & - executes in background
  • && - AND
  • || - OR

Example: AND / OR

# command2 is only executed if command1 returns successfully (status zero)
command1 && command2
command1 || command2

Example: Process Substitution

command1 <(command2) <(command3)

Control Constructs

  • if
  • for
  • while
for num in {1..100} ; do
  out=""
  if ! (( $num % 3)) ; then out="Fizz"; fi
  if ! (( $num % 5)) ; then out="${out}Buzz"; fi
  echo ${out:-$num}
done

Exercise 3: Application

  • Text datasets
  • Record/Tuples
  • Gnu utils not used for nested datastructures like XML or JSON.
cut -d' ' -f2 random-names.txt
awk -F" " '{ print $2 }' random-names.txt

Exercise 4: Pager

[...] helps the user get the output one page at a time [...]

  • less
  • more
  • (head)
  • (tail)

Exercise 5: Regex

Algebraic notation for characterizing a set of strings.

  • Deterministic classifier for searching or substitutions.

Quick Overview

  • Disjunction of letters
  • Disjunction of words
  • Ranges
  • Negation
  • Quantifiers
  • Anchors

Disjunction of letters

  • [wW]: w or W

Matches woodchuck or Woodchuck.

Disjunction of words

yours|mine: yours or mine a|b|c: same as [abc] smil(ey|ies): smiley or smilies

Ranges

  • [a-z]: lowercase letter
  • [A-Z]: uppercase letters
  • [0-9]: numbers

[a-z5-9]

  • matches
  • beans
  • 9000
  • but not
  • Hello World
  • 1337

Negation

  • [^A-Z]: not an uppercase letter
  • [^Ss]: neither S or s

Quantifiers

  • colou?r: color or colour
  • o*h!: h! or oh! or ooh! ...
  • o+h!: oh! or ooh! ...
  • a{3,5}h: aaah or aaaah or aaaaah
  • beg.n: begin or begun or beg3n

Anchors

  • ^: start of line
  • $: end of line
  • \b: word boundary

  • \bthe\b

  • matches the world
  • but not other

Resources

Task: Build a email address matching regex. Should match addresses in mails_match.txt not mails_dont_match.txt.

Solution

grep -E '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' mails_match.txt |
    wc -l

Exercise 6: Regex Groups

Regex has the concept of groups:

  • capturing groups (...)
  • non-capturing groups (?:...)

Task 1: Use non-capturing groups to match mac addresses in random-mac.txt. Task 2: Use capturing groups to swap first name and last name in random-names.txt.

Solution: Non-Capturing Groups

cat random-mac.txt |
    grep -E '(?:[0-9A-F]{2}:){5}[0-9A-F]{2}'

Solution: Capturing Groups

cat random-names.txt |
    sed -E 's/(.+) (.+)/\2, \1/'

cat random-names.txt |
    awk '{print $2", "$1}'

cat random-names.txt |
    perl -n -e'/^([A-z]+) ([A-z]+)$/ && print $2, ", ",$1, "\n"'