Data Compression-Introduction to Data Compression Notes | EduRev

: Data Compression-Introduction to Data Compression Notes | EduRev

 Page 1


1
CSEP 590 
Data Compression
Autumn 2007
Course Policies
Introduction to Data Compression
Entropy
Variable Length Codes
CSEP 590 - Lecture 1 - Autumn 2007 2
Instructors
• Instructor
– Richard Ladner
– ladner@cs.washington.edu
– 206 543-9347
• TA
– Rahul Vanam
– rahulv@u.washington.edu
CSEP 590 - Lecture 1 - Autumn 2007 3
Helpful Knowledge
• Algorithm Design and Analysis
• Probability
CSEP 590 - Lecture 1 - Autumn 2007 4
Resources
• Text Book
– Khalid Sayood, Introduction to Data Compression, 
Third Edition, Morgan Kaufmann Publishers, 2006. 
• Course Web Page
– http://www.cs.washington.edu/csep590a
• Papers and Sections from Books
• Discussion Board
– For discussion
CSEP 590 - Lecture 1 - Autumn 2007 5
Engagement by Students
• Weekly Assignments
– Understand compression methodology
– Due in class on Fridays (except midterm Friday)
– No late assignments accepted except with prior 
approval
• Programming Projects
– Bi-level arithmetic coder and decoder.
– Build code and experiment
CSEP 590 - Lecture 1 - Autumn 2007 6
Final Exam and Grading
• 6:30-8:20 p.m. Thursday, Dec. 13, 2007 
• Percentages
– Weekly assignments (50%) 
– Project (20%) 
– Final exam (30%) 
Page 2


1
CSEP 590 
Data Compression
Autumn 2007
Course Policies
Introduction to Data Compression
Entropy
Variable Length Codes
CSEP 590 - Lecture 1 - Autumn 2007 2
Instructors
• Instructor
– Richard Ladner
– ladner@cs.washington.edu
– 206 543-9347
• TA
– Rahul Vanam
– rahulv@u.washington.edu
CSEP 590 - Lecture 1 - Autumn 2007 3
Helpful Knowledge
• Algorithm Design and Analysis
• Probability
CSEP 590 - Lecture 1 - Autumn 2007 4
Resources
• Text Book
– Khalid Sayood, Introduction to Data Compression, 
Third Edition, Morgan Kaufmann Publishers, 2006. 
• Course Web Page
– http://www.cs.washington.edu/csep590a
• Papers and Sections from Books
• Discussion Board
– For discussion
CSEP 590 - Lecture 1 - Autumn 2007 5
Engagement by Students
• Weekly Assignments
– Understand compression methodology
– Due in class on Fridays (except midterm Friday)
– No late assignments accepted except with prior 
approval
• Programming Projects
– Bi-level arithmetic coder and decoder.
– Build code and experiment
CSEP 590 - Lecture 1 - Autumn 2007 6
Final Exam and Grading
• 6:30-8:20 p.m. Thursday, Dec. 13, 2007 
• Percentages
– Weekly assignments (50%) 
– Project (20%) 
– Final exam (30%) 
2
CSEP 590 - Lecture 1 - Autumn 2007 7
Logistics
• I will be gone the week of October 15
th
. We’ll 
need to have a make up class.
• There is no class Thanksgiving week, 
November 19
th
.
• We have some guest speakers toward the 
end of the quarter.
CSEP 590 - Lecture 1 - Autumn 2007 8
Basic Data Compression Concepts
Encoder Decoder
compressed original
x y
x ˆ
• Lossless compression
– Also called entropy coding, reversible coding.
• Lossy compression
– Also called irreversible coding. 
• Compression ratio =             
– is number of bits in x.
x x ˆ =
x x ˆ ?
y x
x
decompressed
CSEP 590 - Lecture 1 - Autumn 2007 9
Why Compress
• Conserve storage space
• Reduce time for transmission
– Faster to encode, send, then decode than to send 
the original
• Progressive transmission
– Some compression techniques allow us to send 
the most important bits first so we can get a low 
resolution version of some data before getting the 
high fidelity version
• Reduce computation
– Use less data to achieve an approximate answer
CSEP 590 - Lecture 1 - Autumn 2007 10
Braille
• System to read text by feeling raised dots on 
paper (or on electronic displays).  Invented in 
1820s by Louis Braille, a French blind man.
a b c z
and the with mother 
th gh ch
CSEP 590 - Lecture 1 - Autumn 2007 11
Braille Example
Clear text:
Call me Ishmael.  Some years ago -- never mind how 
long precisely -- having \\ little or no money in my purse, 
and nothing particular to interest me on shore, \\ I thought 
I would sail about a little and see the watery part of the 
world.   (238 characters)
Grade 2 Braille in ASCII.
,call me ,i\%mael4 ,``s ye$>$s ago -- n``e m9d h[ l;g 
precisely -- hav+ \\ ll or no m``oy 9 my purse1 \& no?+ 
``picul$>$ 6 9t]e/ me on \%ore1 \\ ,i $?$``$|$ ,i wd sail 
ab a ll \& see ! wat]y ``p ( ! \_w4  (203 characters)
Compression ratio = 238/203 = 1.17 
CSEP 590 - Lecture 1 - Autumn 2007 12
Lossless Compression
• Data is not lost - the original is really needed.
– text compression
– compression of computer binary files
• Compression ratio typically no better than 4:1 for 
lossless compression on many kinds of files.
• Statistical Techniques
– Huffman coding
– Arithmetic coding
– Golomb coding
• Dictionary techniques
– LZW, LZ77 
– Sequitur 
– Burrows-Wheeler Method
• Standards - Morse code, Braille, Unix compress, gzip, 
zip, bzip, GIF, JBIG, Lossless JPEG
Page 3


1
CSEP 590 
Data Compression
Autumn 2007
Course Policies
Introduction to Data Compression
Entropy
Variable Length Codes
CSEP 590 - Lecture 1 - Autumn 2007 2
Instructors
• Instructor
– Richard Ladner
– ladner@cs.washington.edu
– 206 543-9347
• TA
– Rahul Vanam
– rahulv@u.washington.edu
CSEP 590 - Lecture 1 - Autumn 2007 3
Helpful Knowledge
• Algorithm Design and Analysis
• Probability
CSEP 590 - Lecture 1 - Autumn 2007 4
Resources
• Text Book
– Khalid Sayood, Introduction to Data Compression, 
Third Edition, Morgan Kaufmann Publishers, 2006. 
• Course Web Page
– http://www.cs.washington.edu/csep590a
• Papers and Sections from Books
• Discussion Board
– For discussion
CSEP 590 - Lecture 1 - Autumn 2007 5
Engagement by Students
• Weekly Assignments
– Understand compression methodology
– Due in class on Fridays (except midterm Friday)
– No late assignments accepted except with prior 
approval
• Programming Projects
– Bi-level arithmetic coder and decoder.
– Build code and experiment
CSEP 590 - Lecture 1 - Autumn 2007 6
Final Exam and Grading
• 6:30-8:20 p.m. Thursday, Dec. 13, 2007 
• Percentages
– Weekly assignments (50%) 
– Project (20%) 
– Final exam (30%) 
2
CSEP 590 - Lecture 1 - Autumn 2007 7
Logistics
• I will be gone the week of October 15
th
. We’ll 
need to have a make up class.
• There is no class Thanksgiving week, 
November 19
th
.
• We have some guest speakers toward the 
end of the quarter.
CSEP 590 - Lecture 1 - Autumn 2007 8
Basic Data Compression Concepts
Encoder Decoder
compressed original
x y
x ˆ
• Lossless compression
– Also called entropy coding, reversible coding.
• Lossy compression
– Also called irreversible coding. 
• Compression ratio =             
– is number of bits in x.
x x ˆ =
x x ˆ ?
y x
x
decompressed
CSEP 590 - Lecture 1 - Autumn 2007 9
Why Compress
• Conserve storage space
• Reduce time for transmission
– Faster to encode, send, then decode than to send 
the original
• Progressive transmission
– Some compression techniques allow us to send 
the most important bits first so we can get a low 
resolution version of some data before getting the 
high fidelity version
• Reduce computation
– Use less data to achieve an approximate answer
CSEP 590 - Lecture 1 - Autumn 2007 10
Braille
• System to read text by feeling raised dots on 
paper (or on electronic displays).  Invented in 
1820s by Louis Braille, a French blind man.
a b c z
and the with mother 
th gh ch
CSEP 590 - Lecture 1 - Autumn 2007 11
Braille Example
Clear text:
Call me Ishmael.  Some years ago -- never mind how 
long precisely -- having \\ little or no money in my purse, 
and nothing particular to interest me on shore, \\ I thought 
I would sail about a little and see the watery part of the 
world.   (238 characters)
Grade 2 Braille in ASCII.
,call me ,i\%mael4 ,``s ye$>$s ago -- n``e m9d h[ l;g 
precisely -- hav+ \\ ll or no m``oy 9 my purse1 \& no?+ 
``picul$>$ 6 9t]e/ me on \%ore1 \\ ,i $?$``$|$ ,i wd sail 
ab a ll \& see ! wat]y ``p ( ! \_w4  (203 characters)
Compression ratio = 238/203 = 1.17 
CSEP 590 - Lecture 1 - Autumn 2007 12
Lossless Compression
• Data is not lost - the original is really needed.
– text compression
– compression of computer binary files
• Compression ratio typically no better than 4:1 for 
lossless compression on many kinds of files.
• Statistical Techniques
– Huffman coding
– Arithmetic coding
– Golomb coding
• Dictionary techniques
– LZW, LZ77 
– Sequitur 
– Burrows-Wheeler Method
• Standards - Morse code, Braille, Unix compress, gzip, 
zip, bzip, GIF, JBIG, Lossless JPEG
3
CSEP 590 - Lecture 1 - Autumn 2007 13
Lossy Compression 
• Data is lost, but not too much.
– audio
– video
– still images, medical images, photographs
• Compression ratios of 10:1 often yield quite 
high fidelity results.
• Major techniques include
– Vector Quantization
– Wavelets
– Block transforms
– Standards - JPEG, JPEG2000, MPEG 2, H.264 
CSEP 590 - Lecture 1 - Autumn 2007 14
Why is Data Compression Possible
• Most data from nature has redundancy
– There is more data than the actual information 
contained in the data.
– Squeezing out the excess data amounts to 
compression.
– However, unsqueezing is necessary to be able to 
figure out what the data means.
• Information theory is needed to understand 
the limits of compression and give clues on 
how to compress well.
CSEP 590 - Lecture 1 - Autumn 2007 15
What is Information
• Analog data
– Also called continuous data
– Represented by real numbers (or complex 
numbers)
• Digital data
– Finite set of symbols {a
1
, a
2
, ... , a
m
}
– All data represented as sequences (strings) in the 
symbol set.
– Example: {a,b,c,d,r}   abracadabra
– Digital data can be an approximation to analog 
data
CSEP 590 - Lecture 1 - Autumn 2007 16
Symbols
• Roman alphabet plus punctuation
• ASCII - 256 symbols
• Binary - {0,1}
– 0 and 1 are called bits
– All digital information can be represented 
efficiently in binary
– {a,b,c,d} fixed length representation
– 2 bits per symbol
11 10 01 00 binary
d c b a symbol
CSEP 590 - Lecture 1 - Autumn 2007 17
Exercise - How Many Bits Per 
Symbol?
• Suppose we have n symbols.  How many bits 
(as a function of n ) are needed in to 
represent a symbol in binary?
– First try n a power of 2.
CSEP 590 - Lecture 1 - Autumn 2007 18
Discussion: Non-Powers of Two
• Can we do better than a fixed length 
representation for non-powers of two?
Page 4


1
CSEP 590 
Data Compression
Autumn 2007
Course Policies
Introduction to Data Compression
Entropy
Variable Length Codes
CSEP 590 - Lecture 1 - Autumn 2007 2
Instructors
• Instructor
– Richard Ladner
– ladner@cs.washington.edu
– 206 543-9347
• TA
– Rahul Vanam
– rahulv@u.washington.edu
CSEP 590 - Lecture 1 - Autumn 2007 3
Helpful Knowledge
• Algorithm Design and Analysis
• Probability
CSEP 590 - Lecture 1 - Autumn 2007 4
Resources
• Text Book
– Khalid Sayood, Introduction to Data Compression, 
Third Edition, Morgan Kaufmann Publishers, 2006. 
• Course Web Page
– http://www.cs.washington.edu/csep590a
• Papers and Sections from Books
• Discussion Board
– For discussion
CSEP 590 - Lecture 1 - Autumn 2007 5
Engagement by Students
• Weekly Assignments
– Understand compression methodology
– Due in class on Fridays (except midterm Friday)
– No late assignments accepted except with prior 
approval
• Programming Projects
– Bi-level arithmetic coder and decoder.
– Build code and experiment
CSEP 590 - Lecture 1 - Autumn 2007 6
Final Exam and Grading
• 6:30-8:20 p.m. Thursday, Dec. 13, 2007 
• Percentages
– Weekly assignments (50%) 
– Project (20%) 
– Final exam (30%) 
2
CSEP 590 - Lecture 1 - Autumn 2007 7
Logistics
• I will be gone the week of October 15
th
. We’ll 
need to have a make up class.
• There is no class Thanksgiving week, 
November 19
th
.
• We have some guest speakers toward the 
end of the quarter.
CSEP 590 - Lecture 1 - Autumn 2007 8
Basic Data Compression Concepts
Encoder Decoder
compressed original
x y
x ˆ
• Lossless compression
– Also called entropy coding, reversible coding.
• Lossy compression
– Also called irreversible coding. 
• Compression ratio =             
– is number of bits in x.
x x ˆ =
x x ˆ ?
y x
x
decompressed
CSEP 590 - Lecture 1 - Autumn 2007 9
Why Compress
• Conserve storage space
• Reduce time for transmission
– Faster to encode, send, then decode than to send 
the original
• Progressive transmission
– Some compression techniques allow us to send 
the most important bits first so we can get a low 
resolution version of some data before getting the 
high fidelity version
• Reduce computation
– Use less data to achieve an approximate answer
CSEP 590 - Lecture 1 - Autumn 2007 10
Braille
• System to read text by feeling raised dots on 
paper (or on electronic displays).  Invented in 
1820s by Louis Braille, a French blind man.
a b c z
and the with mother 
th gh ch
CSEP 590 - Lecture 1 - Autumn 2007 11
Braille Example
Clear text:
Call me Ishmael.  Some years ago -- never mind how 
long precisely -- having \\ little or no money in my purse, 
and nothing particular to interest me on shore, \\ I thought 
I would sail about a little and see the watery part of the 
world.   (238 characters)
Grade 2 Braille in ASCII.
,call me ,i\%mael4 ,``s ye$>$s ago -- n``e m9d h[ l;g 
precisely -- hav+ \\ ll or no m``oy 9 my purse1 \& no?+ 
``picul$>$ 6 9t]e/ me on \%ore1 \\ ,i $?$``$|$ ,i wd sail 
ab a ll \& see ! wat]y ``p ( ! \_w4  (203 characters)
Compression ratio = 238/203 = 1.17 
CSEP 590 - Lecture 1 - Autumn 2007 12
Lossless Compression
• Data is not lost - the original is really needed.
– text compression
– compression of computer binary files
• Compression ratio typically no better than 4:1 for 
lossless compression on many kinds of files.
• Statistical Techniques
– Huffman coding
– Arithmetic coding
– Golomb coding
• Dictionary techniques
– LZW, LZ77 
– Sequitur 
– Burrows-Wheeler Method
• Standards - Morse code, Braille, Unix compress, gzip, 
zip, bzip, GIF, JBIG, Lossless JPEG
3
CSEP 590 - Lecture 1 - Autumn 2007 13
Lossy Compression 
• Data is lost, but not too much.
– audio
– video
– still images, medical images, photographs
• Compression ratios of 10:1 often yield quite 
high fidelity results.
• Major techniques include
– Vector Quantization
– Wavelets
– Block transforms
– Standards - JPEG, JPEG2000, MPEG 2, H.264 
CSEP 590 - Lecture 1 - Autumn 2007 14
Why is Data Compression Possible
• Most data from nature has redundancy
– There is more data than the actual information 
contained in the data.
– Squeezing out the excess data amounts to 
compression.
– However, unsqueezing is necessary to be able to 
figure out what the data means.
• Information theory is needed to understand 
the limits of compression and give clues on 
how to compress well.
CSEP 590 - Lecture 1 - Autumn 2007 15
What is Information
• Analog data
– Also called continuous data
– Represented by real numbers (or complex 
numbers)
• Digital data
– Finite set of symbols {a
1
, a
2
, ... , a
m
}
– All data represented as sequences (strings) in the 
symbol set.
– Example: {a,b,c,d,r}   abracadabra
– Digital data can be an approximation to analog 
data
CSEP 590 - Lecture 1 - Autumn 2007 16
Symbols
• Roman alphabet plus punctuation
• ASCII - 256 symbols
• Binary - {0,1}
– 0 and 1 are called bits
– All digital information can be represented 
efficiently in binary
– {a,b,c,d} fixed length representation
– 2 bits per symbol
11 10 01 00 binary
d c b a symbol
CSEP 590 - Lecture 1 - Autumn 2007 17
Exercise - How Many Bits Per 
Symbol?
• Suppose we have n symbols.  How many bits 
(as a function of n ) are needed in to 
represent a symbol in binary?
– First try n a power of 2.
CSEP 590 - Lecture 1 - Autumn 2007 18
Discussion: Non-Powers of Two
• Can we do better than a fixed length 
representation for non-powers of two?
4
CSEP 590 - Lecture 1 - Autumn 2007 19
Information Theory
• Developed by Shannon in the 1940’s and 50’s
• Attempts to explain the limits of communication 
using probability theory.
• Example: Suppose English text is being sent
– It is much more likely to receive an “e” than a “z”.
– In some sense “z” has more information than “e”.
CSEP 590 - Lecture 1 - Autumn 2007 20
0
1
2
3
4
5
6
7
0.01
0.08
0.15
0.22
0.29
0.36
0.43
0.5
0.57
0.64
0.71
0.78
0.85
0.92
0.99
x
y
-log(x)
First-order Information
• Suppose we are given symbols {a
1
, a
2
, ... , a
m
}.
• P(a
i
) = probability of symbol a
i
occurring in the 
absence of any other information.
P(a
1
) + P(a
2
) + ... + P(a
m
) = 1
• inf(a
i
) = log
2
(1/P(a
i
)) bits is the information of a
i
in bits.
CSEP 590 - Lecture 1 - Autumn 2007 21
Example
• {a, b, c} with P(a) = 1/8, P(b) = 1/4, P(c) = 5/8
– inf(a) = log
2
(8) = 3
– inf(b) = log
2
(4) = 2
– inf(c) = log
2
(8/5) = .678
• Receiving an “a” has more information than 
receiving a “b” or “c”.
CSEP 590 - Lecture 1 - Autumn 2007 22
First Order Entropy
• The first order entropy is defined for a probability 
distribution over symbols {a
1
, a
2
, ... , a
m
}.
• H is the average number of bits required to code up a 
symbol, given all we know is the probability distribution 
of the symbols.
• H is the Shannon lower bound on the average number of 
bits to code a symbol in this “source model”.
• Stronger models of entropy include context. 
)
) (
1
( log ) (
2
1 i
m
i
i
a P
a P H
?
=
=
CSEP 590 - Lecture 1 - Autumn 2007 23
Entropy Examples
• {a, b, c} with a 1/8, b 1/4, c 5/8.
– H = 1/8 *3 + 1/4 *2 + 5/8* .678 = 1.3 bits/symbol
• {a, b, c} with a 1/3, b 1/3, c 1/3. (worst case)
– H = 3* (1/3)*log
2
(3) = 1.6 bits/symbol
• Note that a standard code takes 2 bits per 
symbol
10 01 00 binary code
c b a symbol
CSEP 590 - Lecture 1 - Autumn 2007 24
An Extreme Case
• {a, b, c} with a 1, b 0, c 0
– H = ?
Page 5


1
CSEP 590 
Data Compression
Autumn 2007
Course Policies
Introduction to Data Compression
Entropy
Variable Length Codes
CSEP 590 - Lecture 1 - Autumn 2007 2
Instructors
• Instructor
– Richard Ladner
– ladner@cs.washington.edu
– 206 543-9347
• TA
– Rahul Vanam
– rahulv@u.washington.edu
CSEP 590 - Lecture 1 - Autumn 2007 3
Helpful Knowledge
• Algorithm Design and Analysis
• Probability
CSEP 590 - Lecture 1 - Autumn 2007 4
Resources
• Text Book
– Khalid Sayood, Introduction to Data Compression, 
Third Edition, Morgan Kaufmann Publishers, 2006. 
• Course Web Page
– http://www.cs.washington.edu/csep590a
• Papers and Sections from Books
• Discussion Board
– For discussion
CSEP 590 - Lecture 1 - Autumn 2007 5
Engagement by Students
• Weekly Assignments
– Understand compression methodology
– Due in class on Fridays (except midterm Friday)
– No late assignments accepted except with prior 
approval
• Programming Projects
– Bi-level arithmetic coder and decoder.
– Build code and experiment
CSEP 590 - Lecture 1 - Autumn 2007 6
Final Exam and Grading
• 6:30-8:20 p.m. Thursday, Dec. 13, 2007 
• Percentages
– Weekly assignments (50%) 
– Project (20%) 
– Final exam (30%) 
2
CSEP 590 - Lecture 1 - Autumn 2007 7
Logistics
• I will be gone the week of October 15
th
. We’ll 
need to have a make up class.
• There is no class Thanksgiving week, 
November 19
th
.
• We have some guest speakers toward the 
end of the quarter.
CSEP 590 - Lecture 1 - Autumn 2007 8
Basic Data Compression Concepts
Encoder Decoder
compressed original
x y
x ˆ
• Lossless compression
– Also called entropy coding, reversible coding.
• Lossy compression
– Also called irreversible coding. 
• Compression ratio =             
– is number of bits in x.
x x ˆ =
x x ˆ ?
y x
x
decompressed
CSEP 590 - Lecture 1 - Autumn 2007 9
Why Compress
• Conserve storage space
• Reduce time for transmission
– Faster to encode, send, then decode than to send 
the original
• Progressive transmission
– Some compression techniques allow us to send 
the most important bits first so we can get a low 
resolution version of some data before getting the 
high fidelity version
• Reduce computation
– Use less data to achieve an approximate answer
CSEP 590 - Lecture 1 - Autumn 2007 10
Braille
• System to read text by feeling raised dots on 
paper (or on electronic displays).  Invented in 
1820s by Louis Braille, a French blind man.
a b c z
and the with mother 
th gh ch
CSEP 590 - Lecture 1 - Autumn 2007 11
Braille Example
Clear text:
Call me Ishmael.  Some years ago -- never mind how 
long precisely -- having \\ little or no money in my purse, 
and nothing particular to interest me on shore, \\ I thought 
I would sail about a little and see the watery part of the 
world.   (238 characters)
Grade 2 Braille in ASCII.
,call me ,i\%mael4 ,``s ye$>$s ago -- n``e m9d h[ l;g 
precisely -- hav+ \\ ll or no m``oy 9 my purse1 \& no?+ 
``picul$>$ 6 9t]e/ me on \%ore1 \\ ,i $?$``$|$ ,i wd sail 
ab a ll \& see ! wat]y ``p ( ! \_w4  (203 characters)
Compression ratio = 238/203 = 1.17 
CSEP 590 - Lecture 1 - Autumn 2007 12
Lossless Compression
• Data is not lost - the original is really needed.
– text compression
– compression of computer binary files
• Compression ratio typically no better than 4:1 for 
lossless compression on many kinds of files.
• Statistical Techniques
– Huffman coding
– Arithmetic coding
– Golomb coding
• Dictionary techniques
– LZW, LZ77 
– Sequitur 
– Burrows-Wheeler Method
• Standards - Morse code, Braille, Unix compress, gzip, 
zip, bzip, GIF, JBIG, Lossless JPEG
3
CSEP 590 - Lecture 1 - Autumn 2007 13
Lossy Compression 
• Data is lost, but not too much.
– audio
– video
– still images, medical images, photographs
• Compression ratios of 10:1 often yield quite 
high fidelity results.
• Major techniques include
– Vector Quantization
– Wavelets
– Block transforms
– Standards - JPEG, JPEG2000, MPEG 2, H.264 
CSEP 590 - Lecture 1 - Autumn 2007 14
Why is Data Compression Possible
• Most data from nature has redundancy
– There is more data than the actual information 
contained in the data.
– Squeezing out the excess data amounts to 
compression.
– However, unsqueezing is necessary to be able to 
figure out what the data means.
• Information theory is needed to understand 
the limits of compression and give clues on 
how to compress well.
CSEP 590 - Lecture 1 - Autumn 2007 15
What is Information
• Analog data
– Also called continuous data
– Represented by real numbers (or complex 
numbers)
• Digital data
– Finite set of symbols {a
1
, a
2
, ... , a
m
}
– All data represented as sequences (strings) in the 
symbol set.
– Example: {a,b,c,d,r}   abracadabra
– Digital data can be an approximation to analog 
data
CSEP 590 - Lecture 1 - Autumn 2007 16
Symbols
• Roman alphabet plus punctuation
• ASCII - 256 symbols
• Binary - {0,1}
– 0 and 1 are called bits
– All digital information can be represented 
efficiently in binary
– {a,b,c,d} fixed length representation
– 2 bits per symbol
11 10 01 00 binary
d c b a symbol
CSEP 590 - Lecture 1 - Autumn 2007 17
Exercise - How Many Bits Per 
Symbol?
• Suppose we have n symbols.  How many bits 
(as a function of n ) are needed in to 
represent a symbol in binary?
– First try n a power of 2.
CSEP 590 - Lecture 1 - Autumn 2007 18
Discussion: Non-Powers of Two
• Can we do better than a fixed length 
representation for non-powers of two?
4
CSEP 590 - Lecture 1 - Autumn 2007 19
Information Theory
• Developed by Shannon in the 1940’s and 50’s
• Attempts to explain the limits of communication 
using probability theory.
• Example: Suppose English text is being sent
– It is much more likely to receive an “e” than a “z”.
– In some sense “z” has more information than “e”.
CSEP 590 - Lecture 1 - Autumn 2007 20
0
1
2
3
4
5
6
7
0.01
0.08
0.15
0.22
0.29
0.36
0.43
0.5
0.57
0.64
0.71
0.78
0.85
0.92
0.99
x
y
-log(x)
First-order Information
• Suppose we are given symbols {a
1
, a
2
, ... , a
m
}.
• P(a
i
) = probability of symbol a
i
occurring in the 
absence of any other information.
P(a
1
) + P(a
2
) + ... + P(a
m
) = 1
• inf(a
i
) = log
2
(1/P(a
i
)) bits is the information of a
i
in bits.
CSEP 590 - Lecture 1 - Autumn 2007 21
Example
• {a, b, c} with P(a) = 1/8, P(b) = 1/4, P(c) = 5/8
– inf(a) = log
2
(8) = 3
– inf(b) = log
2
(4) = 2
– inf(c) = log
2
(8/5) = .678
• Receiving an “a” has more information than 
receiving a “b” or “c”.
CSEP 590 - Lecture 1 - Autumn 2007 22
First Order Entropy
• The first order entropy is defined for a probability 
distribution over symbols {a
1
, a
2
, ... , a
m
}.
• H is the average number of bits required to code up a 
symbol, given all we know is the probability distribution 
of the symbols.
• H is the Shannon lower bound on the average number of 
bits to code a symbol in this “source model”.
• Stronger models of entropy include context. 
)
) (
1
( log ) (
2
1 i
m
i
i
a P
a P H
?
=
=
CSEP 590 - Lecture 1 - Autumn 2007 23
Entropy Examples
• {a, b, c} with a 1/8, b 1/4, c 5/8.
– H = 1/8 *3 + 1/4 *2 + 5/8* .678 = 1.3 bits/symbol
• {a, b, c} with a 1/3, b 1/3, c 1/3. (worst case)
– H = 3* (1/3)*log
2
(3) = 1.6 bits/symbol
• Note that a standard code takes 2 bits per 
symbol
10 01 00 binary code
c b a symbol
CSEP 590 - Lecture 1 - Autumn 2007 24
An Extreme Case
• {a, b, c} with a 1, b 0, c 0
– H = ?
5
CSEP 590 - Lecture 1 - Autumn 2007 25
Entropy Curve
• Suppose we have two symbols with probabilities 
x and 1-x, respectively.
0
0.2
0.4
0.6
0.8
1
1.2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
probability of first symbol
entropy
-(x log x + (1-x)log(1-x))
maximum entropy at .5
CSEP 590 - Lecture 1 - Autumn 2007 26
A Simple Prefix Code
• {a, b, c} with a 1/8, b 1/4, c 5/8.
• A prefix code is defined by a binary tree
• Prefix code property
– no output is a prefix of another
b
c
a
0
0
1
1
1 c
01 b
00 a
ccabccbccc
1 1 00 01 1 1 01 1 1 1
input output
code
binary tree
CSEP 590 - Lecture 1 - Autumn 2007 27
Binary Tree Terminology
root
leaf
node
1. Each node, except the root, has a unique parent.
2. Each internal node has exactly two children.
CSEP 590 - Lecture 1 - Autumn 2007 28
Decoding a Prefix Code
b
c
a
0
0
1
1
repeat
start at root of tree
repeat
if read bit = 1 then go right
else go left
until node is a leaf
report leaf
until end of the code
11000111100
CSEP 590 - Lecture 1 - Autumn 2007 29
Decoding a Prefix Code
b
c
a
0
0
1
1
11000111100
CSEP 590 - Lecture 1 - Autumn 2007 30
Decoding a Prefix Code
b
c
a
0
0
1
1
11000111100
c
Read More
Offer running on EduRev: Apply code STAYHOME200 to get INR 200 off on our premium plan EduRev Infinity!