|
Post by Progenitor A on Jan 29, 2011 21:00:03 GMT 1
I think I follow you. It is the changes in a binary message that tell us there has been a problem in transmission, not the actual content or meaning of the message to an end user. So that in the example of 20 0's there are no changes and therefore no information in the context we are discussing. And, as you have pointed out, it is not necessary to sent 20 0's when a much more compressed from of this can be used to convey the same meaning. Am I on the right track naymissus? Yes I think you are following me. I was concerned that I was not explaining things well And this buinesss of information in changes is what we (perhaps unconsciously) look for in messages Take these two messages , of equal length (66 characters) but one has lots of changes and the other does not 66 Character message may changesFriendsRomanscountrymenlendmeyourearsIcometoburyCaesarnottopraisehim 66 Character message zero changeseeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeIt is (I hope) self evident that the messge with many changes has a far higher information content than the message with no changes Shannon simply analyses and formalises this intuitive fact
|
|
|
Post by abacus9900 on Jan 29, 2011 22:08:08 GMT 1
I think I follow you. It is the changes in a binary message that tell us there has been a problem in transmission, not the actual content or meaning of the message to an end user. So that in the example of 20 0's there are no changes and therefore no information in the context we are discussing. And, as you have pointed out, it is not necessary to sent 20 0's when a much more compressed from of this can be used to convey the same meaning. Am I on the right track naymissus? Yes I think you are following me. I was concerned that I was not explaining things well And this buinesss of information in changes is what we (perhaps unconsciously) look for in messages Take these two messages , of equal length (66 characters) but one has lots of changes and the other does not 66 Character message may changesFriendsRomanscountrymenlendmeyourearsIcometoburyCaesarnottopraisehim 66 Character message zero changeseeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeIt is (I hope) self evident that the messge with many changes has a far higher information content than the message with no changes Shannon simply analyses and formalises this intuitive fact Right. So what in the world was STA banging on about? And what did she mean by ensembles and coin tossing?
|
|
|
Post by Progenitor A on Jan 29, 2011 22:28:00 GMT 1
Right. So what in the world was STA banging on about? And what did she mean by ensembles and coin tossing? An ensemble is the total number of binary messages N digits can carry. So if we take a binary message length of 8 digits, (N=8) then there are 2 8 different messages. 2 8= 256 Therefore for a a binary message length of N there are 2 N different messages, and in the case we are considering where N=8 2 8 = 256 and all these 256 messages form an ENSEMBLE Only STA can explain what she means, and why it differs from Dr Pierce Professor of Engineering CalTech
|
|
|
Post by Progenitor A on Jan 30, 2011 13:51:04 GMT 1
Consequences of Shannons information theory (which are largely common sense I think but Shannon put them on to a firm mathematical basis) can be seen by looking at redundancy in signals. Redundancy is just what it says - digits (in a binary system) that are unnecessary and can be discarded. By discarding unnecessary digits we can mke more efficient use of a communcation channel
We will look at 8-bit messages and include two control bits that ensure that dispensing with redundant bits does not cause ambiguity (XX=any binary value) control bits 8-bit message X X 1 1 1 1 1 1 1 1. Now Shannon tells us that a string of all 1's contains no information, therefore there is no point in sending it. We simply have to inform the far end that the missing bits that would have been in the message sent are all 1's We can set the control bits so that the value 00 means 'all 1's'
and what is transmitted is control bits 00 And when the receiving end receives 00 it automatically puts eight 1's into the message
If th emessage to be sent is control bit XX 00000000 Again Shannon tells us that this string of 0's has zero information content and need not be sent
The control bits value 01 might indicate 'all 0's' and all that must be sent is control bit 01 The receiving end interprets this as all 0's and fills in 0's
We can go further: take this sequence of 8-bit messages:
XX 10110110 XX 10110110 XX 10110110
It is evident that the second and third messages are the same as the first and hence are redundant
We can arrange the control bits so that 10 means 'the same as before' so we can send XX 10110110 10 10
And when the far end receives 10 it simply repeats the information received in the previous 8-bits
The control bits can also contain many other control functions such as telling the far end that the previously received 8-bits contained an error and hence could you please re-transmit those 8-bits For these redundancy-rejection sytems to work the transmitting and receiving ends nust be synchronised but I am sure that you are aware of that.
We can continue with looking at error correction detection and correction techniques if you wish
|
|
|
Post by abacus9900 on Jan 30, 2011 15:56:21 GMT 1
Yes, I am following this very closely but are not the control bits very limited in what they can perform if there are just two of them? Two bits only allows 4 possible codes to be sent.
|
|
|
Post by Progenitor A on Jan 30, 2011 16:36:46 GMT 1
Yes, I am following this very closely but are not the control bits very limited in what they can perform if there are just two of them? Two bits only allows 4 possible codes to be sent. Yes you are right. I limited the control bits to keep things simple. In real communications systems that have a bad noisy channel such as the GSM or 3G mobile phone system, 16 to 32 control bits are commonplace. Trouble is with large numers of control bits is that the overhead is then high and cuts down on the 'payload' of information that can be transmitted over a limited capacity communication channel To reduce the overhead proportion the message i length carrying the information must be large -i n the case of GSM and 3G for 16 control bits each message length is typically about 108-256 bits. That also creates a problemhowever, because if an error is deteced in the information that is transmitted, the normal reaction is to request a re-transmission of the message that has errors in it. If the message is very long, a very long re-transmission is necessary, substantially reducing the real rate of information transfer. (Note that when a voice signal is being transmitted, no request for retransmission of errored blocks is made, for the simple reason that the customer would not tolerate the delays whilst the errors were corrected, so errored blocks are simply discarded, or may be passed to the customer if the errors are not too bad. To compensate for that the error protection for GSM and 3G voice transmissions is quite amazing. It is so good that 2/5 of the information can be lost and the customer doesn't realise anything is wrong!)
|
|
|
Post by abacus9900 on Jan 30, 2011 17:18:31 GMT 1
So, essentially, using only 16 or more control bits, information can be encoded which substantially compresses the raw data. Is this what happens when data is sent over the net in file compression? The trick is, I suppose, to find the optimum amount of control bits that will offer sufficient information while delivering it within a reasonable time-frame. Is this what Shannon worked out?
I suppose in voice transmissions most people are clever enough to compensate for the errors, in many cases.
|
|
|
Post by Progenitor A on Jan 30, 2011 18:33:08 GMT 1
So, essentially, using only 16 or more control bits, information can be encoded which substantially compresses the raw data. Is this what happens when data is sent over the net in file compression? The trick is, I suppose, to find the optimum amount of control bits that will offer sufficient information while delivering it within a reasonable time-frame. Is this what Shannon worked out? I suppose in voice transmissions most people are clever enough to compensate for the errors, in many cases. There are normally two algorithms acting upon data that is to be sent over a communication channel. First there is the compression algorithm which has its own set of control bits and rules(and of course the algorithm used must be agreed with at both ends - for the Internet, Win zip is the most common algorithm), but for real-time applications Internet applications a mutually agreed algorithm must be established. When the data has been compressed (there will not be much redundancy left in the message), it is then sent to an encoder and error detection/correction algorithm (again both ends of the communication link must agree which encoding and error correction and detection protocols are to be used). These protocols can be very involved, with for example the ability to detect and correct errors in incoming data and if the errors cannot be corrected, asking the other end to re-transmit the whole message, part of the message or selected bits in the message. If there are lots of errors in the received signal the receiving end can request the sending end to add redundancy to the message to make it more reliable (but slower) There may be an intermediate process of encryption of the message so that any intercept cannot understand what is being sent And certainly Shannon is concerned with getting the maximum amount of data across a channel with the the minimum error. He showed mathematically that it could be done. He left it to others to design systems which achieved that The human ear and brain are quite remarkable in that if tones are missing (in music for example, where the cheap radios cannot produce low and high tones properly, the brain will 'fill in' the missing bits. Same with telephone conversations, the brain can 'filter out' some noise and concentrate on the wanted signal, again 'filling in' bits that are not there. That phenomenon is sometimes called the 'cocktail party' effect, from where, when you are having a conversation at a cocktail party with lots of other loud conversations going on around you, your ear and brain 'filter out' the unwanted conversations and concentrate on the wanted conversation. As you get older the ear-brain facility weakens and it can become no longer possible to have a conversation in such a noisy environment.
|
|
|
Post by abacus9900 on Jan 30, 2011 20:29:39 GMT 1
Fascinating.
Is much the same method used in OS's to write/read data to and from hard drives?
|
|
|
Post by Progenitor A on Jan 30, 2011 21:40:11 GMT 1
Fascinating. Is much the same method used in OS's to write/read data to and from hard drives? I do not really have any deep knowledge of the internal working of computers. I do know that the internal connections that a computer OS use form a nearly 'perfect' communication channel and hence there is probably no need for a PC computer to have complex redundancy-minimisation, and encoding for error detection and correction. If, within a PC, encoding for error detection and correction were used, then the speed of the PC would be considerably slower to no effect if the communication 'channel' were perfect As to compression within a computer, I do not really know. Compression of the information within a PC would in principle speed the computer up, but quite frankly I do not know whether that is done When it comes to communications between computers over a kilometre or hundreds of km, where the communication channel is less reliable and more noisy and then all the techniques we have spoken of definitely are used. Telecommunications is a hybrid word meaning 'communications over a distance' and Shannon was concerned with the science of transferring inormation over a channel , which because of its length, or distance, was limited in capacity (bits per second) and subject to noise.
|
|
|
Post by speakertoanimals on Feb 1, 2011 13:54:21 GMT 1
Shannon didn't!
Again, I refer you back to page one of the Shannon paper.
In the first part where he SPECIFICALLY defines information content, he defined it as (minus) the logarithm of the probability. NOTHING to do with pattern (or not) within the string itself. Just the probability of one particular message compared to the ensemble of all possible messages.
To give a quote from a random google:
Now when he moves on from page 1, he considers specific problems such as sending information down noiseless and noisy communication channels. He also considers cases where (unlike the initial example), there are correlations in the signals sent. These all lessen the actual amount of information in the message. Examples are english, where even with some missing letters, knowing it was english, you could have a stab at predicting them -- this is the correlation, and why the information in a string of english is LESS than the information in a random string of letters (where since it is random, you CANNOT predict any missing letters).
WHY does this matter? Because the proper development and understanding of the subject means you have to go from the purely random noiseless case, before you can attempt real messages on noisy communication channels.
Now this stuff about CHANGES -- for real signals (such as telephony), there are substantial correlations and substantial periods of silence. These patterns mean there is less actual information in there than at first sight. So, for example, is substantial periods of silence, anyone can see that it may be cheaper to do --silence starts/silence ends rather than sit there and transmit all the silence.
But this is NOT the same case as a string of random heads and tails, where there MAY be a string of heads. Indeed, in this case you can show mathematically that trying to do it by instead just sending change when it it changes from heads to tails won't gain you anything.
THe point being, in the original post I was talking about Shannon information (page 1 of his paper, the definition of SHannon information that Shannon gave!), NOT the result for real-world messages which aren't strings of random bits sent over noisy channels.
Other posters seem to be only able to handle these latter cases, but obviously have no idea where the results come from.
In this case, yes, but NOT because the SHannon information of a string of zeros is always and identically zero, but because for these particular types of messages WITH CORRELATIONS (ie redundancy), it turns out that the cheapest way to encode is start/end.
This is NOT the same as saying a string of 0's from an uncorrelated process has zero information content -- as I keep pointing out, the information of ANY string from such a process is the SAME.
FOr other processes (with correlations), the computations are different, which is where the engineers seemingly step-in, and treat the other stuff as uneccessary preliminaries -- maybe because the later stuff is intuitive, and the ealier stuff ISN'T. Hence those poor engineers who can't go beyond intuition to abstraction start part-way along the journey.....................
Which also brings me to Shannon entropy, which a poster also got wrong. The shannon entropy relates to the SOURCE, and the ensemble of all possible messages.
Now the initial random source case supposed we knew beforehand the probability distribution of events from the source, hence the probability of any possible message.
But suppose we don't know that! Let's suppose, for example, we have random coin tosses (to make it simpler), but don't know whether or not the coin is weighted, so heads may be a bit more frequent than tails.
Now life gets a bit more complicated -- we first need a sample to try and estimate the relative frequency.
So, we grab a string of 10 tosses -- and use that. We hence use the probability of heads in that particular message to compute the entropy of the source. We are NOT computing the entropy of the single message (where a previous poster got confused!), but using the probability of heads in that message as an estimate of the probability of heads from the source, and hence estimating the entropy of the source.
In this case, we can see without too much effort that its going to be a lousy estimate! 10 tosses isn't enough! Indeed, you can compute (Bayes theorem and all that), the probability that any probability of heads would give the actual message, and then the estimated probability usually just takes the values that gives a maximum probability for generating the message -- except the distribution for 10 tosses will be jolly flat, and a whole range of probabilities of heads will give almost as a high a probability of generating that message! Hence our estimaye of actual head probability is lousy!
TO give an example, suppose we actually have heads prob 1/4, tails prob 3/4.
I then take 2 tosses, and get 1 heads. Naive estimate says 1 head out of 2, prob heads is 1/2. Probability of 1 head out of 2, with heads probability p is actually 2p(1-p) (order not specified). If you plot this, you find the maximum is indeed p=1/2 (our estimate), whereas the actual value (p=1/4) is 3/8. Large, but not as big as for p=1/2 (the incorrect value!).
It is fairlty obvious what the problem is -- we need a BIGGER sample, because we know that as our sample size increases, the error on the sample mean decreases. Hence we need a LOT of tosses to make a decent estimate of the rate that heads appear. Once we have a decent estimate, then we use that to compute the source entropy, even if the actual computation makes it look like the entropy of the MESSAGE to the uninitiated.
Now we see where the confusion comes in -- we have an UNKNOWN source. We need a long enough, sample message in order to make a decent stab at estimated the properties of the source (such as frequency of various characters, whether or not there is correlation). Once we have that, we estimate the source entropy, hence we have an estimate, based on one particular LONG message, of what the average information content of any message from that source will be.
Suppose we take a second long message -- a prevoious poster computed the entropy of that message as well, and mixed it up with the information content of that message! Not right! If we have a second message, that gives us a second estimate of the source entropy, which may be different from the first. That is all - the source has remianed the same, just we have two estimates of its parameters. but the parameters and entropy belong to the source and we are trying to estimate them, NOT to the message which is just a sample.
SO, back to our zeros! If I have a source that gives me 10 zeros, it is POSSIBLE (but not that likely), that source actually produces random bits with equal probability. If I ask same source for a million bits, and agian get a million zeros -- then I can very reasonably infer that the source isn't doing anything BUT produce zeros, hence contains no information -- NOT because a string of zeros contains no information per se, but because of the estimate of the probability of zero based on that string.
Suppose now I have a million tosses, heads and tails equally likely and no correlation, yet within that million I find a string of a hundred zeros. Does that contain no information? NO -- it contains just as much information as any other 100 digit string from that source. Looked at in this way, take the million digit string, and break it up into 100 digit sub-strings. The million zeros gives only ONE such 100-string, hence that string no info. In the random case, there are MANY other possible 100 digit strings (the ensemble of all possible 100-digit messages, or at least 10,000 random samples from that ensemble!). Hence comparing the 100-zero case withn all the others, I make a DIFFERENT judgement as to the information content of the 100 zero string, because I am making a DIFFERENT estimate of the probability of zero for that source.
All of this sounds complicated, and it is a bit, so I can see what the intuitive appraoch seems so user-friendly -- just it has the problem that although it works for engineers with applications to consider, it totally gets wrong the BASICS of the subject. You need to start at page one of the paper, not jump straight in to the results of later sections which acord with your intuition, hence give people the impression they don't need to bother with mastering the previous Intro sections.
|
|
|
Post by speakertoanimals on Feb 1, 2011 15:30:54 GMT 1
And for this I have been described as a brainless moron, which seems a little unfair................
now lets tackle the case of changes -- the important point to note is that transmitting only changes ONLY works for signals with correlations or long periods of silence.
Let's take a fair coin again. P heads is p tails is 1/2.
According to SHannon, any string of N tosses occurs with probability 1/2^N, hence information content (-log P) of N bits. No surprise.
now suppose we say -- I don't care about heads or tails, just different or same. Hence I dump 1 bit for first toss, than rather than H or T, say change, no change. How often does change occur? Easy to see it is 1/2 again, hence this string gives N-1 change or no change, hence (N-1) bits as before, discarding initial toss (hence we can't distinguish between HTTH and THHT).
But SAME information content (less first toss) as sending heads or tails -- there is no cheaper way to code it or compress it! Because every toss is 1 bit of information.
But now suppose we have a telephone message, where long periods of silence -- the correlations going on here make it a DIFFERENT case, where tricks such as -- send only change -- will work, but NOT because a string of all the same really does have zero infomation content, it just happens to in this special case.
Looking at stuff on the web, there is actually quite a bit of confusion between Shannon entropy of a source, and Shannon entropy of a message.
Strictly speaking, the Shannon entropy of a source that produces random symbols (say i) with probability pi is given by the expectation value of -log p. So, if event i happens N times pi times (on average), then event i contributes an amount pi times -log pi to the expectation value. Summing over i (all possible symbols), gives the result:
E = \sum_{i} - pi log pi
The complication comes in when you use the occurrence of a symbol i within a particular message as an ESTIMATE of pi for the source, and where the real expectation value (average over all messages, hence actual probability for source), gets replaced by the SAMPLE value of the entropy (ie computed for one particular message). Which leads to the mistaken impression that entropy refers to a message, whereas in fact it refers to the source, and is the AVERAGE information content of all possible messages from that source.
Indeed, other stuff I have found seem to concur that in some papers, usage gets a bit tangled:
The latter case is the one I would refer to as making an ESTIMATE of the symbol frequencies for the source. This is in accord with what Shannon used, who talked about the entropy of the source, and finding the best possible way, on average, to transmit ANY possible message from a given fixed source, not just the one we happen to have at the moment. In my earlier post, you can see why using a LONG sample message to estimate source properties leads to this short-cutting and possible confusion.
But just because it seems common to refer to the entropy of a message, doesn't make it RIGHT, just common!
Perhaps the confusion?
Lets go back to Shannon. Shannon says if an event occurs with probability pi, the optimum codeword length for encoding the occurence of i (phew!) is of length -log pi.
BUT this codeword can only be used if we know pi a priori, have worked out our codebook accordingly, and GIVEN it to the receiver and sender of messages. Then they know how to code and decode messages.
SO, with those frequencies and that code book, suppose I now have a message where symbol i occurs Npi times (exactly as predicted). Then the coded message length is given by:
Npi times -log pi summed over i.
Now we can see how confusion occurs. If I compute the pi's for a particular message (supposing they ARE the actual probabilities for the source, instead of just an estimate based on ONE message), then the entropy of that MESSAGE (ie of the source, but using the estimates of probabilities for that one message rather than true source probabilities) gives an optimum coded-message length that may be less than the actual message length (and gives ZERO for repeated string!). should we worry? no, because what we have actually done is confused estimate of probabilities with actual probabilities. If, BEFORE I SENT the message, I used above estimates to compute optimum codebook, it would be optimial for that ONE message, but the next message I sent, it would not be (my estimates are actually only estimates!), and I'd have to go send a different codebook for EACH message using this scheme!
Which is not what is usually intended, since if we are going to send a whole damn codebook with every message, then we really should include that cost as well -- but that is in effect Rissanens Minimum description length stuff, not pure shannon.
Hence by confusing a particular string of repeated zeros (sample probability of zero is one), with source probability (actual probability of zero need not be one), you confuse information content of a message and average information content of messages from a source (entropy of source), with the entropy of a message and information content of a message based on these possibly not very good estimates. Hence the oft-repeated 000 contains no information, which is wrong and even when not totally wrong, usually grossly misleading..............
But I'll continue looking into it, and blame the engineers and those who are a little less than scrupulous in their teaching material............
|
|
|
Post by abacus9900 on Feb 1, 2011 18:49:39 GMT 1
Well, all you're really saying is that the system is designed to work out the probable coded bytes based on an agreed formula.
For example, in English, the phrase: "Mary had a little lamb, its fleece was white as snow" would still be recognized if seen as "M*ry h*d a l itt*e l**b, i*s fl*ece w*s **ite as *now."
In other words, the mathematical probabilities ensures that the inbuilt entropy is accommodated.
|
|
|
Post by Progenitor A on Feb 1, 2011 19:25:01 GMT 1
Well, all you're really saying is that the system is designed to work out the probable coded bytes based on an agreed formula. For example, in English, the phrase: "Mary had a little lamb, its fleece was white as snow" would still be recognized if seen as "M*ry h*d a l itt*e l**b, i*s fl*ece w*s **ite as *now." In other words, the mathematical probabilities ensures that the inbuilt entropy is accommodated. Quite The entropy is concerned with the mathematical probability (or uncertainty) of the elements of a message. If the probabilty of a binary message is 1 the entropy is 0 - there is no uncertainty - everything is the same - there are no changes -and no information to transmit. If the probability is 1, then it is a certainty what the message contains and it doesn't have any information for transmission. If the probability of a message is low- the uncertainty is high-everthing is different- there is no 'sameness'- there are many changes and there is lots of information to transmit It is all quite simple really (once Shannon has showed the way). For example you and I would know the probability of errors in sending the 8-bit message 10101010 (entropy 1) in a communication channel that had a total entropy of only 0.8 Useful stuff
|
|
|
Post by abacus9900 on Feb 1, 2011 20:03:13 GMT 1
The trouble is STA is no help in providing a basic framework onto which ordinary people can add to in order to develop their understanding. Again, you cannot expect people to begin learning any topic if you are always going to start at an advanced level. Why does she keep doing this, it really does not serve much purpose.
|
|