Page 1 1 CSEP 590 Data Compression Autumn 2007 Course Policies Introduction to Data Compression Entropy Variable Length Codes CSEP 590 - Lecture 1 - Autumn 2007 2 Instructors • Instructor – Richard Ladner – ladner@cs.washington.edu – 206 543-9347 • TA – Rahul Vanam – rahulv@u.washington.edu CSEP 590 - Lecture 1 - Autumn 2007 3 Helpful Knowledge • Algorithm Design and Analysis • Probability CSEP 590 - Lecture 1 - Autumn 2007 4 Resources • Text Book – Khalid Sayood, Introduction to Data Compression, Third Edition, Morgan Kaufmann Publishers, 2006. • Course Web Page – http://www.cs.washington.edu/csep590a • Papers and Sections from Books • Discussion Board – For discussion CSEP 590 - Lecture 1 - Autumn 2007 5 Engagement by Students • Weekly Assignments – Understand compression methodology – Due in class on Fridays (except midterm Friday) – No late assignments accepted except with prior approval • Programming Projects – Bi-level arithmetic coder and decoder. – Build code and experiment CSEP 590 - Lecture 1 - Autumn 2007 6 Final Exam and Grading • 6:30-8:20 p.m. Thursday, Dec. 13, 2007 • Percentages – Weekly assignments (50%) – Project (20%) – Final exam (30%) Page 2 1 CSEP 590 Data Compression Autumn 2007 Course Policies Introduction to Data Compression Entropy Variable Length Codes CSEP 590 - Lecture 1 - Autumn 2007 2 Instructors • Instructor – Richard Ladner – ladner@cs.washington.edu – 206 543-9347 • TA – Rahul Vanam – rahulv@u.washington.edu CSEP 590 - Lecture 1 - Autumn 2007 3 Helpful Knowledge • Algorithm Design and Analysis • Probability CSEP 590 - Lecture 1 - Autumn 2007 4 Resources • Text Book – Khalid Sayood, Introduction to Data Compression, Third Edition, Morgan Kaufmann Publishers, 2006. • Course Web Page – http://www.cs.washington.edu/csep590a • Papers and Sections from Books • Discussion Board – For discussion CSEP 590 - Lecture 1 - Autumn 2007 5 Engagement by Students • Weekly Assignments – Understand compression methodology – Due in class on Fridays (except midterm Friday) – No late assignments accepted except with prior approval • Programming Projects – Bi-level arithmetic coder and decoder. – Build code and experiment CSEP 590 - Lecture 1 - Autumn 2007 6 Final Exam and Grading • 6:30-8:20 p.m. Thursday, Dec. 13, 2007 • Percentages – Weekly assignments (50%) – Project (20%) – Final exam (30%) 2 CSEP 590 - Lecture 1 - Autumn 2007 7 Logistics • I will be gone the week of October 15 th . We’ll need to have a make up class. • There is no class Thanksgiving week, November 19 th . • We have some guest speakers toward the end of the quarter. CSEP 590 - Lecture 1 - Autumn 2007 8 Basic Data Compression Concepts Encoder Decoder compressed original x y x ˆ • Lossless compression – Also called entropy coding, reversible coding. • Lossy compression – Also called irreversible coding. • Compression ratio = – is number of bits in x. x x ˆ = x x ˆ ? y x x decompressed CSEP 590 - Lecture 1 - Autumn 2007 9 Why Compress • Conserve storage space • Reduce time for transmission – Faster to encode, send, then decode than to send the original • Progressive transmission – Some compression techniques allow us to send the most important bits first so we can get a low resolution version of some data before getting the high fidelity version • Reduce computation – Use less data to achieve an approximate answer CSEP 590 - Lecture 1 - Autumn 2007 10 Braille • System to read text by feeling raised dots on paper (or on electronic displays). Invented in 1820s by Louis Braille, a French blind man. a b c z and the with mother th gh ch CSEP 590 - Lecture 1 - Autumn 2007 11 Braille Example Clear text: Call me Ishmael. Some years ago -- never mind how long precisely -- having \\ little or no money in my purse, and nothing particular to interest me on shore, \\ I thought I would sail about a little and see the watery part of the world. (238 characters) Grade 2 Braille in ASCII. ,call me ,i\%mael4 ,``s ye$>$s ago -- n``e m9d h[ l;g precisely -- hav+ \\ ll or no m``oy 9 my purse1 \& no?+ ``picul$>$ 6 9t]e/ me on \%ore1 \\ ,i $?$``$|$ ,i wd sail ab a ll \& see ! wat]y ``p ( ! \_w4 (203 characters) Compression ratio = 238/203 = 1.17 CSEP 590 - Lecture 1 - Autumn 2007 12 Lossless Compression • Data is not lost - the original is really needed. – text compression – compression of computer binary files • Compression ratio typically no better than 4:1 for lossless compression on many kinds of files. • Statistical Techniques – Huffman coding – Arithmetic coding – Golomb coding • Dictionary techniques – LZW, LZ77 – Sequitur – Burrows-Wheeler Method • Standards - Morse code, Braille, Unix compress, gzip, zip, bzip, GIF, JBIG, Lossless JPEG Page 3 1 CSEP 590 Data Compression Autumn 2007 Course Policies Introduction to Data Compression Entropy Variable Length Codes CSEP 590 - Lecture 1 - Autumn 2007 2 Instructors • Instructor – Richard Ladner – ladner@cs.washington.edu – 206 543-9347 • TA – Rahul Vanam – rahulv@u.washington.edu CSEP 590 - Lecture 1 - Autumn 2007 3 Helpful Knowledge • Algorithm Design and Analysis • Probability CSEP 590 - Lecture 1 - Autumn 2007 4 Resources • Text Book – Khalid Sayood, Introduction to Data Compression, Third Edition, Morgan Kaufmann Publishers, 2006. • Course Web Page – http://www.cs.washington.edu/csep590a • Papers and Sections from Books • Discussion Board – For discussion CSEP 590 - Lecture 1 - Autumn 2007 5 Engagement by Students • Weekly Assignments – Understand compression methodology – Due in class on Fridays (except midterm Friday) – No late assignments accepted except with prior approval • Programming Projects – Bi-level arithmetic coder and decoder. – Build code and experiment CSEP 590 - Lecture 1 - Autumn 2007 6 Final Exam and Grading • 6:30-8:20 p.m. Thursday, Dec. 13, 2007 • Percentages – Weekly assignments (50%) – Project (20%) – Final exam (30%) 2 CSEP 590 - Lecture 1 - Autumn 2007 7 Logistics • I will be gone the week of October 15 th . We’ll need to have a make up class. • There is no class Thanksgiving week, November 19 th . • We have some guest speakers toward the end of the quarter. CSEP 590 - Lecture 1 - Autumn 2007 8 Basic Data Compression Concepts Encoder Decoder compressed original x y x ˆ • Lossless compression – Also called entropy coding, reversible coding. • Lossy compression – Also called irreversible coding. • Compression ratio = – is number of bits in x. x x ˆ = x x ˆ ? y x x decompressed CSEP 590 - Lecture 1 - Autumn 2007 9 Why Compress • Conserve storage space • Reduce time for transmission – Faster to encode, send, then decode than to send the original • Progressive transmission – Some compression techniques allow us to send the most important bits first so we can get a low resolution version of some data before getting the high fidelity version • Reduce computation – Use less data to achieve an approximate answer CSEP 590 - Lecture 1 - Autumn 2007 10 Braille • System to read text by feeling raised dots on paper (or on electronic displays). Invented in 1820s by Louis Braille, a French blind man. a b c z and the with mother th gh ch CSEP 590 - Lecture 1 - Autumn 2007 11 Braille Example Clear text: Call me Ishmael. Some years ago -- never mind how long precisely -- having \\ little or no money in my purse, and nothing particular to interest me on shore, \\ I thought I would sail about a little and see the watery part of the world. (238 characters) Grade 2 Braille in ASCII. ,call me ,i\%mael4 ,``s ye$>$s ago -- n``e m9d h[ l;g precisely -- hav+ \\ ll or no m``oy 9 my purse1 \& no?+ ``picul$>$ 6 9t]e/ me on \%ore1 \\ ,i $?$``$|$ ,i wd sail ab a ll \& see ! wat]y ``p ( ! \_w4 (203 characters) Compression ratio = 238/203 = 1.17 CSEP 590 - Lecture 1 - Autumn 2007 12 Lossless Compression • Data is not lost - the original is really needed. – text compression – compression of computer binary files • Compression ratio typically no better than 4:1 for lossless compression on many kinds of files. • Statistical Techniques – Huffman coding – Arithmetic coding – Golomb coding • Dictionary techniques – LZW, LZ77 – Sequitur – Burrows-Wheeler Method • Standards - Morse code, Braille, Unix compress, gzip, zip, bzip, GIF, JBIG, Lossless JPEG 3 CSEP 590 - Lecture 1 - Autumn 2007 13 Lossy Compression • Data is lost, but not too much. – audio – video – still images, medical images, photographs • Compression ratios of 10:1 often yield quite high fidelity results. • Major techniques include – Vector Quantization – Wavelets – Block transforms – Standards - JPEG, JPEG2000, MPEG 2, H.264 CSEP 590 - Lecture 1 - Autumn 2007 14 Why is Data Compression Possible • Most data from nature has redundancy – There is more data than the actual information contained in the data. – Squeezing out the excess data amounts to compression. – However, unsqueezing is necessary to be able to figure out what the data means. • Information theory is needed to understand the limits of compression and give clues on how to compress well. CSEP 590 - Lecture 1 - Autumn 2007 15 What is Information • Analog data – Also called continuous data – Represented by real numbers (or complex numbers) • Digital data – Finite set of symbols {a 1 , a 2 , ... , a m } – All data represented as sequences (strings) in the symbol set. – Example: {a,b,c,d,r} abracadabra – Digital data can be an approximation to analog data CSEP 590 - Lecture 1 - Autumn 2007 16 Symbols • Roman alphabet plus punctuation • ASCII - 256 symbols • Binary - {0,1} – 0 and 1 are called bits – All digital information can be represented efficiently in binary – {a,b,c,d} fixed length representation – 2 bits per symbol 11 10 01 00 binary d c b a symbol CSEP 590 - Lecture 1 - Autumn 2007 17 Exercise - How Many Bits Per Symbol? • Suppose we have n symbols. How many bits (as a function of n ) are needed in to represent a symbol in binary? – First try n a power of 2. CSEP 590 - Lecture 1 - Autumn 2007 18 Discussion: Non-Powers of Two • Can we do better than a fixed length representation for non-powers of two? Page 4 1 CSEP 590 Data Compression Autumn 2007 Course Policies Introduction to Data Compression Entropy Variable Length Codes CSEP 590 - Lecture 1 - Autumn 2007 2 Instructors • Instructor – Richard Ladner – ladner@cs.washington.edu – 206 543-9347 • TA – Rahul Vanam – rahulv@u.washington.edu CSEP 590 - Lecture 1 - Autumn 2007 3 Helpful Knowledge • Algorithm Design and Analysis • Probability CSEP 590 - Lecture 1 - Autumn 2007 4 Resources • Text Book – Khalid Sayood, Introduction to Data Compression, Third Edition, Morgan Kaufmann Publishers, 2006. • Course Web Page – http://www.cs.washington.edu/csep590a • Papers and Sections from Books • Discussion Board – For discussion CSEP 590 - Lecture 1 - Autumn 2007 5 Engagement by Students • Weekly Assignments – Understand compression methodology – Due in class on Fridays (except midterm Friday) – No late assignments accepted except with prior approval • Programming Projects – Bi-level arithmetic coder and decoder. – Build code and experiment CSEP 590 - Lecture 1 - Autumn 2007 6 Final Exam and Grading • 6:30-8:20 p.m. Thursday, Dec. 13, 2007 • Percentages – Weekly assignments (50%) – Project (20%) – Final exam (30%) 2 CSEP 590 - Lecture 1 - Autumn 2007 7 Logistics • I will be gone the week of October 15 th . We’ll need to have a make up class. • There is no class Thanksgiving week, November 19 th . • We have some guest speakers toward the end of the quarter. CSEP 590 - Lecture 1 - Autumn 2007 8 Basic Data Compression Concepts Encoder Decoder compressed original x y x ˆ • Lossless compression – Also called entropy coding, reversible coding. • Lossy compression – Also called irreversible coding. • Compression ratio = – is number of bits in x. x x ˆ = x x ˆ ? y x x decompressed CSEP 590 - Lecture 1 - Autumn 2007 9 Why Compress • Conserve storage space • Reduce time for transmission – Faster to encode, send, then decode than to send the original • Progressive transmission – Some compression techniques allow us to send the most important bits first so we can get a low resolution version of some data before getting the high fidelity version • Reduce computation – Use less data to achieve an approximate answer CSEP 590 - Lecture 1 - Autumn 2007 10 Braille • System to read text by feeling raised dots on paper (or on electronic displays). Invented in 1820s by Louis Braille, a French blind man. a b c z and the with mother th gh ch CSEP 590 - Lecture 1 - Autumn 2007 11 Braille Example Clear text: Call me Ishmael. Some years ago -- never mind how long precisely -- having \\ little or no money in my purse, and nothing particular to interest me on shore, \\ I thought I would sail about a little and see the watery part of the world. (238 characters) Grade 2 Braille in ASCII. ,call me ,i\%mael4 ,``s ye$>$s ago -- n``e m9d h[ l;g precisely -- hav+ \\ ll or no m``oy 9 my purse1 \& no?+ ``picul$>$ 6 9t]e/ me on \%ore1 \\ ,i $?$``$|$ ,i wd sail ab a ll \& see ! wat]y ``p ( ! \_w4 (203 characters) Compression ratio = 238/203 = 1.17 CSEP 590 - Lecture 1 - Autumn 2007 12 Lossless Compression • Data is not lost - the original is really needed. – text compression – compression of computer binary files • Compression ratio typically no better than 4:1 for lossless compression on many kinds of files. • Statistical Techniques – Huffman coding – Arithmetic coding – Golomb coding • Dictionary techniques – LZW, LZ77 – Sequitur – Burrows-Wheeler Method • Standards - Morse code, Braille, Unix compress, gzip, zip, bzip, GIF, JBIG, Lossless JPEG 3 CSEP 590 - Lecture 1 - Autumn 2007 13 Lossy Compression • Data is lost, but not too much. – audio – video – still images, medical images, photographs • Compression ratios of 10:1 often yield quite high fidelity results. • Major techniques include – Vector Quantization – Wavelets – Block transforms – Standards - JPEG, JPEG2000, MPEG 2, H.264 CSEP 590 - Lecture 1 - Autumn 2007 14 Why is Data Compression Possible • Most data from nature has redundancy – There is more data than the actual information contained in the data. – Squeezing out the excess data amounts to compression. – However, unsqueezing is necessary to be able to figure out what the data means. • Information theory is needed to understand the limits of compression and give clues on how to compress well. CSEP 590 - Lecture 1 - Autumn 2007 15 What is Information • Analog data – Also called continuous data – Represented by real numbers (or complex numbers) • Digital data – Finite set of symbols {a 1 , a 2 , ... , a m } – All data represented as sequences (strings) in the symbol set. – Example: {a,b,c,d,r} abracadabra – Digital data can be an approximation to analog data CSEP 590 - Lecture 1 - Autumn 2007 16 Symbols • Roman alphabet plus punctuation • ASCII - 256 symbols • Binary - {0,1} – 0 and 1 are called bits – All digital information can be represented efficiently in binary – {a,b,c,d} fixed length representation – 2 bits per symbol 11 10 01 00 binary d c b a symbol CSEP 590 - Lecture 1 - Autumn 2007 17 Exercise - How Many Bits Per Symbol? • Suppose we have n symbols. How many bits (as a function of n ) are needed in to represent a symbol in binary? – First try n a power of 2. CSEP 590 - Lecture 1 - Autumn 2007 18 Discussion: Non-Powers of Two • Can we do better than a fixed length representation for non-powers of two? 4 CSEP 590 - Lecture 1 - Autumn 2007 19 Information Theory • Developed by Shannon in the 1940’s and 50’s • Attempts to explain the limits of communication using probability theory. • Example: Suppose English text is being sent – It is much more likely to receive an “e” than a “z”. – In some sense “z” has more information than “e”. CSEP 590 - Lecture 1 - Autumn 2007 20 0 1 2 3 4 5 6 7 0.01 0.08 0.15 0.22 0.29 0.36 0.43 0.5 0.57 0.64 0.71 0.78 0.85 0.92 0.99 x y -log(x) First-order Information • Suppose we are given symbols {a 1 , a 2 , ... , a m }. • P(a i ) = probability of symbol a i occurring in the absence of any other information. P(a 1 ) + P(a 2 ) + ... + P(a m ) = 1 • inf(a i ) = log 2 (1/P(a i )) bits is the information of a i in bits. CSEP 590 - Lecture 1 - Autumn 2007 21 Example • {a, b, c} with P(a) = 1/8, P(b) = 1/4, P(c) = 5/8 – inf(a) = log 2 (8) = 3 – inf(b) = log 2 (4) = 2 – inf(c) = log 2 (8/5) = .678 • Receiving an “a” has more information than receiving a “b” or “c”. CSEP 590 - Lecture 1 - Autumn 2007 22 First Order Entropy • The first order entropy is defined for a probability distribution over symbols {a 1 , a 2 , ... , a m }. • H is the average number of bits required to code up a symbol, given all we know is the probability distribution of the symbols. • H is the Shannon lower bound on the average number of bits to code a symbol in this “source model”. • Stronger models of entropy include context. ) ) ( 1 ( log ) ( 2 1 i m i i a P a P H ? = = CSEP 590 - Lecture 1 - Autumn 2007 23 Entropy Examples • {a, b, c} with a 1/8, b 1/4, c 5/8. – H = 1/8 *3 + 1/4 *2 + 5/8* .678 = 1.3 bits/symbol • {a, b, c} with a 1/3, b 1/3, c 1/3. (worst case) – H = 3* (1/3)*log 2 (3) = 1.6 bits/symbol • Note that a standard code takes 2 bits per symbol 10 01 00 binary code c b a symbol CSEP 590 - Lecture 1 - Autumn 2007 24 An Extreme Case • {a, b, c} with a 1, b 0, c 0 – H = ? Page 5 1 CSEP 590 Data Compression Autumn 2007 Course Policies Introduction to Data Compression Entropy Variable Length Codes CSEP 590 - Lecture 1 - Autumn 2007 2 Instructors • Instructor – Richard Ladner – ladner@cs.washington.edu – 206 543-9347 • TA – Rahul Vanam – rahulv@u.washington.edu CSEP 590 - Lecture 1 - Autumn 2007 3 Helpful Knowledge • Algorithm Design and Analysis • Probability CSEP 590 - Lecture 1 - Autumn 2007 4 Resources • Text Book – Khalid Sayood, Introduction to Data Compression, Third Edition, Morgan Kaufmann Publishers, 2006. • Course Web Page – http://www.cs.washington.edu/csep590a • Papers and Sections from Books • Discussion Board – For discussion CSEP 590 - Lecture 1 - Autumn 2007 5 Engagement by Students • Weekly Assignments – Understand compression methodology – Due in class on Fridays (except midterm Friday) – No late assignments accepted except with prior approval • Programming Projects – Bi-level arithmetic coder and decoder. – Build code and experiment CSEP 590 - Lecture 1 - Autumn 2007 6 Final Exam and Grading • 6:30-8:20 p.m. Thursday, Dec. 13, 2007 • Percentages – Weekly assignments (50%) – Project (20%) – Final exam (30%) 2 CSEP 590 - Lecture 1 - Autumn 2007 7 Logistics • I will be gone the week of October 15 th . We’ll need to have a make up class. • There is no class Thanksgiving week, November 19 th . • We have some guest speakers toward the end of the quarter. CSEP 590 - Lecture 1 - Autumn 2007 8 Basic Data Compression Concepts Encoder Decoder compressed original x y x ˆ • Lossless compression – Also called entropy coding, reversible coding. • Lossy compression – Also called irreversible coding. • Compression ratio = – is number of bits in x. x x ˆ = x x ˆ ? y x x decompressed CSEP 590 - Lecture 1 - Autumn 2007 9 Why Compress • Conserve storage space • Reduce time for transmission – Faster to encode, send, then decode than to send the original • Progressive transmission – Some compression techniques allow us to send the most important bits first so we can get a low resolution version of some data before getting the high fidelity version • Reduce computation – Use less data to achieve an approximate answer CSEP 590 - Lecture 1 - Autumn 2007 10 Braille • System to read text by feeling raised dots on paper (or on electronic displays). Invented in 1820s by Louis Braille, a French blind man. a b c z and the with mother th gh ch CSEP 590 - Lecture 1 - Autumn 2007 11 Braille Example Clear text: Call me Ishmael. Some years ago -- never mind how long precisely -- having \\ little or no money in my purse, and nothing particular to interest me on shore, \\ I thought I would sail about a little and see the watery part of the world. (238 characters) Grade 2 Braille in ASCII. ,call me ,i\%mael4 ,``s ye$>$s ago -- n``e m9d h[ l;g precisely -- hav+ \\ ll or no m``oy 9 my purse1 \& no?+ ``picul$>$ 6 9t]e/ me on \%ore1 \\ ,i $?$``$|$ ,i wd sail ab a ll \& see ! wat]y ``p ( ! \_w4 (203 characters) Compression ratio = 238/203 = 1.17 CSEP 590 - Lecture 1 - Autumn 2007 12 Lossless Compression • Data is not lost - the original is really needed. – text compression – compression of computer binary files • Compression ratio typically no better than 4:1 for lossless compression on many kinds of files. • Statistical Techniques – Huffman coding – Arithmetic coding – Golomb coding • Dictionary techniques – LZW, LZ77 – Sequitur – Burrows-Wheeler Method • Standards - Morse code, Braille, Unix compress, gzip, zip, bzip, GIF, JBIG, Lossless JPEG 3 CSEP 590 - Lecture 1 - Autumn 2007 13 Lossy Compression • Data is lost, but not too much. – audio – video – still images, medical images, photographs • Compression ratios of 10:1 often yield quite high fidelity results. • Major techniques include – Vector Quantization – Wavelets – Block transforms – Standards - JPEG, JPEG2000, MPEG 2, H.264 CSEP 590 - Lecture 1 - Autumn 2007 14 Why is Data Compression Possible • Most data from nature has redundancy – There is more data than the actual information contained in the data. – Squeezing out the excess data amounts to compression. – However, unsqueezing is necessary to be able to figure out what the data means. • Information theory is needed to understand the limits of compression and give clues on how to compress well. CSEP 590 - Lecture 1 - Autumn 2007 15 What is Information • Analog data – Also called continuous data – Represented by real numbers (or complex numbers) • Digital data – Finite set of symbols {a 1 , a 2 , ... , a m } – All data represented as sequences (strings) in the symbol set. – Example: {a,b,c,d,r} abracadabra – Digital data can be an approximation to analog data CSEP 590 - Lecture 1 - Autumn 2007 16 Symbols • Roman alphabet plus punctuation • ASCII - 256 symbols • Binary - {0,1} – 0 and 1 are called bits – All digital information can be represented efficiently in binary – {a,b,c,d} fixed length representation – 2 bits per symbol 11 10 01 00 binary d c b a symbol CSEP 590 - Lecture 1 - Autumn 2007 17 Exercise - How Many Bits Per Symbol? • Suppose we have n symbols. How many bits (as a function of n ) are needed in to represent a symbol in binary? – First try n a power of 2. CSEP 590 - Lecture 1 - Autumn 2007 18 Discussion: Non-Powers of Two • Can we do better than a fixed length representation for non-powers of two? 4 CSEP 590 - Lecture 1 - Autumn 2007 19 Information Theory • Developed by Shannon in the 1940’s and 50’s • Attempts to explain the limits of communication using probability theory. • Example: Suppose English text is being sent – It is much more likely to receive an “e” than a “z”. – In some sense “z” has more information than “e”. CSEP 590 - Lecture 1 - Autumn 2007 20 0 1 2 3 4 5 6 7 0.01 0.08 0.15 0.22 0.29 0.36 0.43 0.5 0.57 0.64 0.71 0.78 0.85 0.92 0.99 x y -log(x) First-order Information • Suppose we are given symbols {a 1 , a 2 , ... , a m }. • P(a i ) = probability of symbol a i occurring in the absence of any other information. P(a 1 ) + P(a 2 ) + ... + P(a m ) = 1 • inf(a i ) = log 2 (1/P(a i )) bits is the information of a i in bits. CSEP 590 - Lecture 1 - Autumn 2007 21 Example • {a, b, c} with P(a) = 1/8, P(b) = 1/4, P(c) = 5/8 – inf(a) = log 2 (8) = 3 – inf(b) = log 2 (4) = 2 – inf(c) = log 2 (8/5) = .678 • Receiving an “a” has more information than receiving a “b” or “c”. CSEP 590 - Lecture 1 - Autumn 2007 22 First Order Entropy • The first order entropy is defined for a probability distribution over symbols {a 1 , a 2 , ... , a m }. • H is the average number of bits required to code up a symbol, given all we know is the probability distribution of the symbols. • H is the Shannon lower bound on the average number of bits to code a symbol in this “source model”. • Stronger models of entropy include context. ) ) ( 1 ( log ) ( 2 1 i m i i a P a P H ? = = CSEP 590 - Lecture 1 - Autumn 2007 23 Entropy Examples • {a, b, c} with a 1/8, b 1/4, c 5/8. – H = 1/8 *3 + 1/4 *2 + 5/8* .678 = 1.3 bits/symbol • {a, b, c} with a 1/3, b 1/3, c 1/3. (worst case) – H = 3* (1/3)*log 2 (3) = 1.6 bits/symbol • Note that a standard code takes 2 bits per symbol 10 01 00 binary code c b a symbol CSEP 590 - Lecture 1 - Autumn 2007 24 An Extreme Case • {a, b, c} with a 1, b 0, c 0 – H = ? 5 CSEP 590 - Lecture 1 - Autumn 2007 25 Entropy Curve • Suppose we have two symbols with probabilities x and 1-x, respectively. 0 0.2 0.4 0.6 0.8 1 1.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 probability of first symbol entropy -(x log x + (1-x)log(1-x)) maximum entropy at .5 CSEP 590 - Lecture 1 - Autumn 2007 26 A Simple Prefix Code • {a, b, c} with a 1/8, b 1/4, c 5/8. • A prefix code is defined by a binary tree • Prefix code property – no output is a prefix of another b c a 0 0 1 1 1 c 01 b 00 a ccabccbccc 1 1 00 01 1 1 01 1 1 1 input output code binary tree CSEP 590 - Lecture 1 - Autumn 2007 27 Binary Tree Terminology root leaf node 1. Each node, except the root, has a unique parent. 2. Each internal node has exactly two children. CSEP 590 - Lecture 1 - Autumn 2007 28 Decoding a Prefix Code b c a 0 0 1 1 repeat start at root of tree repeat if read bit = 1 then go right else go left until node is a leaf report leaf until end of the code 11000111100 CSEP 590 - Lecture 1 - Autumn 2007 29 Decoding a Prefix Code b c a 0 0 1 1 11000111100 CSEP 590 - Lecture 1 - Autumn 2007 30 Decoding a Prefix Code b c a 0 0 1 1 11000111100 cRead More

Offer running on EduRev: __Apply code STAYHOME200__ to get INR 200 off on our premium plan EduRev Infinity!