Download Huffman Compression (continued)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Quadtree wikipedia , lookup

B-tree wikipedia , lookup

Binary search tree wikipedia , lookup

Huffman coding wikipedia , lookup

Transcript
CIS265/506
Storage Basics






Hard Disks are come in several interfaces and formats.
Storage Capacity is measured in Gigabytes
Bandwidth determines how fast data can be moved to or from storage. It is
measured in MB/Sec with both sustained and burst rates for read and write.
Access Time is in ms and consist of seek time (the head moving across the platter),
rotation latency (time it takes for the drive to rotate to correct position) and Block
Transfer Time (time to read/write a block). In general, higher RPMS, smaller platter
size and more numerous platters all make for faster access
Mean Time Between Failure (MBTF) usually the number of hours of operation
before a drive will fail (on average).
Interface is the protocol that the drive uses to communicate with the PC.
Terminology





Heads consists of the number of read/write ‘needles’ that can access your drive. In
general 2 per platter
Spindle what the drive platters spin on
Platter is a magnetically coated disk that resembles a record and stores numerous 0s
or 1s. May have multiple platters stacked on top of one another in a disk (typically 20
GB a platter for IDE and 18GB a platter for SCSI)
Tracks and Cylinder (multi-platter tracks) positional descriptor assigned to each
“ring” of a disk
Sector another positional descriptor of the disk. A pie shaped pie slice of the disk
that contains many sectors
Terminology
 Blocks are the combined position of sector and track numbers and typically store 512
to 4096 Bytes each. Blocks are separated by Inter Block Gaps which serve as
“speed bumps” so that the drive knows where blocks begin and end. Blocks can be
combined into contiguous, logically addressable units called clusters
 Hardware Address consists of block, sector and track numbers
Why do we care?
 Hard drive performance is measured in milliseconds (ms) while your computer
processes information in nanoseconds (ns).
 Hard drives are usually 1000’s of times slower than your CPU.
 Any speedup in hard drive access yields a serious speedup in machine performance.
From “Data Structures for Java”
William H. Ford
William R. Topp
Chapter 23
File Compression
Binary Files
 File types are text files and binary files. Java deals with files by creating a byte
stream that connects the file and the application.
 Binary files can be handled with DataInputStream and DataOutputStream classes.
Binary Files (continued)
 A data input stream lets an application read primitive Java data types from an
underlying input stream in a machine-independent way.
 A data output stream lets an application write primitive Java data types to an output
stream in a portable way. An application can then use a data input stream to read the
data back in.
File Compression
 Lossless compression loses no data and is used for data backup.
File Compression (continued)
 Lossy compression is used for applications like sound and video compression and
causes minor loss of data.
File Compression (continued)
 The compression ratio is the ratio of the number of bits in the original data to the
number of bits in the compressed image. For instance, if a data file contains 500,000
bytes and the compressed data contains 100,000 bytes, the compression ratio is 5:1
Huffman Compression
 Huffman compression relies on counting the number of occurrences of each 8-bit byte
in the data and generating a sequence of optimal binary codes called prefix codes.
 The Huffman algorithm is an example of a greedy algorithm. A greedy algorithm
makes an optimal choice at each local step in the hope of creating an optimal solution
to the entire problem.
Huffman Compression (continued)
 The algorithm generates a table that contains the frequency of occurrence of each
byte in the file. Using these frequencies, the algorithm assigns each byte a string of
bits known as its bit code and writes the bit code to the compressed image in place or
the original byte.
 Compression occurs if each 8-bit char in a file is replaced by a shorter bit sequence.
Huffman Compression (continued)
 Use a binary tree to represent bit codes.
A left edge is a 0 and a right edge is a 1. Each interior node specifies a frequency
count, and each leaf node holds a character and its frequency.
Huffman Compression (continued)
 Each data byte occurs only in a leaf node. Such codes are called prefix codes.
 A full binary tree is one in where each interior node has two children.
 By converting the tree to a full tree, we can generate better bit codes for our example.
Huffman Compression (continued)
 To compress a file replace each char by its prefix code. To uncompress, follow the bit
code bit‑by‑bit from the root of the tree to the corresponding character. Write the
character to the uncompressed file.
 Good compression involves choosing an optimal tree. It can be shown that the
optimal bit codes for a file are always represented by a full tree.
Huffman Compression (continued)
 A Huffman tree generates the minimum number of bits in the compressed image. It
generates optimal prefix codes.
Building a Huffman Tree
 For each of the n bytes in a file, assign
the byte and its frequency to a tree node, and insert the node into a minimum priority
queue ordered by frequency.
Building a Huffman Tree (continued)
 Remove two elements, x and y, from the priority queue, and attach them as children
of a node whose frequency is the sum of the frequencies of its children. Insert the
resulting node into the priority queue.
 In a loop, perform this action n-1 times. Each loop iteration creates one of the n-1
interior nodes of the full tree.
Building a Huffman Tree (continued)
 With a minimum priority queue the least frequently occurring characters have longer
bit codes, and the more frequently occurring chars have shorter bit codes.
Huffman Tree
 Review pages 415-422 in your text for code and additional information
Serialization
 A persistent object can exist apart from the executing program and can be stored in a
file.
 Serialization involves storing and retrieving objects from an external file.
 The classes ObjectOutputStream and ObjectInputStream are used for serialization.
Serialization (continued)
 Assume anObject is an instance of a class that implements the Serializable interface.
Serialization (continued)
 Deserializing an Object.