Tutorial WS22: Foundations in Data Engineering

Here you find my tutorial notes for my tutorial in "Foundations in Data Engineering".

Tutorial 1 notes.

Introduction

Michael Brauweiler

If you plan to keep coming to this tutorial session:
Send email to [michael.brauweiler@posteo.de]
Thursday: Use subject [FDE T-7].
Friday: Use subject [FDE T-9].

Exercise 1: Building Blocks

man
cat
comm
join
sort
head / tail
uniq
wc
seq
grep
awk
sed

man

Interface to the system reference manuals.

man man

cat

Concatenate files and print on the standard output.

cat [OPTION]... [FILE]...

cat random-names.txt
tac random-names.txt

comm

Compares two sorted files line by line.

comm [OPTION]... FILE1 FILE2

cat random-names.txt |
    shuf |
    head |
    sort > meeting1.txt

cat random-names.txt |
    shuf |
    head |
    sort > meeting2.txt

comm meeting1.txt meeting2.txt

join

Join lines of two files on a common sorted field.

join [OPTION]... FILE1 FILE2

cat random-names.txt |
    cut -d " " -f1 |
    cat -n > first_names.txt

cat random-names.txt |
    cut -d " " -f2 |
    cat -n > last_names.txt

join first_names.txt last_names.txt

sort

Sort lines of text files by a key/column.

sort [OPTION]... [FILE]...

sort random-names.txt

head / tail

Output the first part/last part of files.

head [OPTION]... [FILE]...

head random-names.txt
tail random-names.txt

uniq

Report or omit (subsequent) repeated lines.

uniq [OPTION]... [INPUT [OUTPUT]]

cat random-names.txt \
    <(head random-names.txt) |
    sort > repeated-names.txt
uniq repeated-names.txt

wc

Print newline, word, and byte counts for each file.

wc [OPTION]... [FILE]...

wc random-names.txt
# Counts only newline characters
wc -l random-names.txt
wc -w random-names.txt
echo -n "😀" | wc -c
echo -n "😀" | wc -m

seq

Print a sequence of numbers.

seq [OPTION]... LAST seq [OPTION]... FIRST LAST seq [OPTION]... FIRST INCREMENT LAST

seq 10
seq 5 10
seq -10 2 10

grep

Print lines matching a pattern.

grep [OPTION...] PATTERNS [FILE...]

grep -E '(.).+ \1.+' random-names.txt

sed

Stream editor for filtering and transforming text.

sed [OPTION]... {script} [input-file]...

cat random-names.txt |
    sed -E 's/(.+) (.+)/\2 \1/'

awk

Pattern scanning and processing language.

awk '{print $2,$1}' random-names.txt

Exercise 2: Combining Building Blocks

There are three ways to combine programs:

Pipelines
Lists
Control Constructs

Pipelines

A sequence of one or more commands separated by one of the control operators | or |&.

[time [-p]] [ ! ] command [ [| | |&] command ... ]

command1 | command2 forwards standard output of command1 to standard input of command2.
command1 |& command2 forwards standard output and standard error of command1 to standard input of command2.
Same as command1 2>&1 | command2

Lists

Sequence of one or more pipelines combined by operators:

separated by ;, &, &&, or ||.
optionally terminated by ;, & <newline>

Separation operators

; - executes sequentially
& - executes in background
&& - AND
|| - OR

Example: AND / OR

# command2 is only executed if command1 returns successfully (status zero)
command1 && command2
command1 || command2

Example: Process Substitution

command1 <(command2) <(command3)

Control Constructs

if
for
while

for num in {1..100} ; do
  out=""
  if ! (( $num % 3)) ; then out="Fizz"; fi
  if ! (( $num % 5)) ; then out="${out}Buzz"; fi
  echo ${out:-$num}
done

Exercise 3: Application

Text datasets
Record/Tuples
Gnu utils not used for nested datastructures like XML or JSON.

cut -d' ' -f2 random-names.txt
awk -F" " '{ print $2 }' random-names.txt

Exercise 4: Pager

[...] helps the user get the output one page at a time [...]

less
more
(head)
(tail)

Exercise 5: Regex

Algebraic notation for characterizing a set of strings.

Deterministic classifier for searching or substitutions.

Quick Overview

Disjunction of letters
Disjunction of words
Ranges
Negation
Quantifiers
Anchors

Disjunction of letters

[wW]: w or W

Matches woodchuck or Woodchuck.

Disjunction of words

yours|mine: yours or mine a|b|c: same as [abc] smil(ey|ies): smiley or smilies

Ranges

[a-z]: lowercase letter
[A-Z]: uppercase letters
[0-9]: numbers

[a-z5-9]

matches
beans
9000
but not
Hello World
1337

Negation

[^A-Z]: not an uppercase letter
[^Ss]: neither S or s

Quantifiers

colou?r: color or colour
o*h!: h! or oh! or ooh! ...
o+h!: oh! or ooh! ...
a{3,5}h: aaah or aaaah or aaaaah
beg.n: begin or begun or beg3n

Anchors

^: start of line
$: end of line
\b: word boundary
\bthe\b
matches the world
but not other

Resources

Task: Build a email address matching regex. Should match addresses in mails_match.txt not mails_dont_match.txt.

Solution

grep -E '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' mails_match.txt |
    wc -l

Exercise 6: Regex Groups

Regex has the concept of groups:

capturing groups (...)
non-capturing groups (?:...)

Task 1: Use non-capturing groups to match mac addresses in random-mac.txt. Task 2: Use capturing groups to swap first name and last name in random-names.txt.

Solution: Non-Capturing Groups

cat random-mac.txt |
    grep -E '(?:[0-9A-F]{2}:){5}[0-9A-F]{2}'

Solution: Capturing Groups

cat random-names.txt |
    sed -E 's/(.+) (.+)/\2, \1/'

cat random-names.txt |
    awk '{print $2", "$1}'

cat random-names.txt |
    perl -n -e'/^([A-z]+) ([A-z]+)$/ && print $2, ", ",$1, "\n"'