<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im">Besides the high chance of false positives, what makes this method (and the problem it tries to solve) so so difficult is that binary files may contain what is considered to be large amounts of text, and text files may contain pieces of binary data.<br>
For example, consider a windows executable file - Much of the data in such a file is considered binary data, but there are defined sections where strings and text resources are stored. Any heuristic algorithm like the one mentioned will be insufficient in such cases.<br>
Although I can't think of a situation off hand where the opposite may be true (binary data embedded in what is considered to be a text file) I'm pretty sure such a situation exists.<br>
</div></blockquote>
<br>
One could consider PDF to be such a format (text with embedded binary data).<div class="HOEnZb"><div class="h5"><br></div></div></blockquote><div><br></div><div style>RTF is another example. </div></div></div></div>