Fiehn Lab - Statsoft Statistica Beware of 101

The 101 in Statsoft Statistca (and Dataminer v8) a reoccurring fact of grievance.

or Beware of 101!

First we have to keep in mind that Statistica comes from the 16-bit area and I am following the program since several years and I am also a heavy user of Statistica. The core is based on old but working thread safe code and undergoes heavy testing. One of the reasons that Statistica is not using multi core CPUs is that this code just can not be transferred to multi core or multi CPU (SMP) applications. That is a very sad fact. Only parts of the software (Neural Networks) are multithreaded. At least someone at Statsoft read: The free lunch is over - A Fundamental Turn Toward Concurrency in Software. Now I can live with that and many other small annoyances (see ) among them the bug that parts of the multivariate analysis package did not accept more than 265 variables until recently or invariant variables are not allowed and no statistics can be calculated. Furthermore all the people and support staff I had contact with are really nice and truly interested in solving those problems. But one O one is a historic relic deeply embedded within the code which Statsoft can not get rid off.

So there is one bug which is really a pain in the brain, the magic number 101. How is that possible and why such a lengthy and long winded article? Because 101 it the evil number within Statistica. Yes 101 is the 666 of Statistica.

To understand and explain that we have to look into the Helpfile. The Helpfile is in principle the extended version of the really excellent online Statistica TextBook. It says in Notes on Text Labels and Text Values:

How many unique text labels or text values can you have? You can have a practically unlimited number of unique text label/numeric value associations. Numeric values begin at 101 if you enter the data directly into a spreadsheet. This minimizes the likelihood of creating a numeric association that conflicts with a number elsewhere in the variable.

How do I create a text label/numeric association? To assign a text label to a specific numeric value, enter the text into the Text Labels Editor. If you do not care about the specific numeric values, but want to enter a text label and let STATISTICA assign the respective numeric value automatically, then simply enter your text string into a variable (in the spreadsheet) with the data type Double, Integer, or Byte. (Remember, the data type Text has no numeric associations.) STATISTICA creates the association automatically, beginning at101.

Export HTML performance warning. Start numeric values that are automatically assigned to Text Labels with. Enter the value you want assigned to the first text label in the Start numeric values that are automatically assigned to Text Labels with box. By default, this value is 101. The range of valid numeric values assigned to text labels is -32,768 to 32,767, with the exception of -999999998 (the default missing data code); if the maximum numeric value is reached and negative values have not yet been used, negative numeric values, starting with -32,768, are assigned.

What are the implications? If you copy/paste values from the workhorse EXCEL and you have a text in there (column headers or other text variables) it will create random text in Statistica at position 101 and later calculate with that. If you enter text into a double column the new Statistica v8 will ask "The data you entered is text..." Which is good, because in older versions it would just ignore that and the mess would begin.

Please be sure to check all you numbers when any message box occurs with the words Text and Variables. That is a new warning box since Statistica version 8, all the older Statistica versions just ignored that. It is extremely important to understand that once you copy/paste numbers with text from any application to a Statistica worksheet you are in danger. If you press Import with text labels, the text will be positioned at case number 101. So all text headers will show up at case 1 and case number 102 (101+1). In the second case the text labels will be removed from case 1 and case 102 so no text in the whole sheet. Copy/Paste from Excel to Statistica requires to Paste with Headers. Make sure no text is pasted into number variables.

Statistica is not EXCEL, because in Statistica every column or variable has to be of four types: Double, Text, Integer, Byte. That is a must and different from the additional data display format (Scientific, with comma with dots and so on). By default a variable is double (64-bit IEEE floating point notation). You can use text in Statistica (as categorial values) but text should be in single columns and not mixed with numbers. The common practice of multiple table headers is not allowed in Statistica. It is important to understand once there is text in a column this table is poisoned and can not be restored.

The problem is that unlike in Excel or OpenOffice Calc the text values will code at position 101 and they also will be used in calculations. So sorting numbers from 1 to 200 will result in text exactly at position 101 (instead on the top or bottom).

Now this picture can not be explained. The reason is, that it works like this since the evil 101 was introduced into Statistica.
What happened? Because the PCA did not work with invariant values and NIPALS PCA had a bug (in CRM) I had to use the basic statistics tools and calculate stdev and variance. That requires transposing the whole matrix and later sorting according to variance to remove the values with zero variance. in order to know which values are needed I had to A) introduce a column with variable numbers and a column with the variance itself. As a lazy guy I named it XXX because that sheet will go to trash anyway (A good practice in Statistica is always to create copy sheets, because many actions can not be undone and to create backups of worksheets, instead of growing them to the hundreds of megabytes).

Now suddenly the xxx header shows up at evil position 101. Even worse the following value which is my column ID 101 is now also named 101. Even worse things will happen when you try the change the text in the header (after sorting everything according to the case names, hence creating an original sheet). Now every new text entered at position one (the top position) in the same column or in the next column is added to position 101 and below (see picture below).

The extremely evil thing happens, that after you detect the 101 dilemma, because nobody told you so and you created all the nice statistics with life sciences data and matrix sizes in the 10000x100000 that there is text at position 101. The numbers go 97,98,99,100,text,text,text,text,105,106,107. Ok. Very easy, just delete the text and put in 101, 102, 103, 104 as numbers. No remember - beware of 101 - once there is text it can not be replaced. Text at one O one (101) means it is too late. You can pull numbers down, the text will stay there, you enter new numbers, the text comes back. The variable definition is still set to double. Now the double evil thingy is that you can clear the text with the same number (try 97,97,97) and then even save it. But after you fill in running numbers or clean the positions the text will appear again (we will see later its a feature not a bug - just kidding its a bug). But that bug position can be change from 101 to any number.

Problems during import of large data files can occur if there is text imported together with numbers. Statistica can import pretty well large datasets 10000x10000 or larger, but the evil 101 will hit when text is mixed with numbers. See the graphics below, at position 101 the column header was introduced as text. So after import of external data each field has to be validated or one has to perform a text search so no text is within the spreadsheet.

The 101 value can be actually changed to any number (I wonder did people have only sheets with 99 values, so 101 was far away?). The 101 can be changed in the option menu. I dont know what to set there 0? I have 0 very often. 10 Million, 10 billions? I have millions of cases, but billions certainly not (due to the 32-bit memory problem, Statistica does not support large memory systems, yes Statistica can be installed on 64-bit Windows but it can only use 2047 MByte RAM for each calculation, if you have more RAM (32 GBytes) it does not matter.

As I described before this is within Statistica since the very beginning and I sent my comments on the to the support already years ago; The only solution is not to enter text in any of the columns if it is used as double type for calculations. Do not mix text and numbers. Statistica can use text without problems (as categorial variables or for a-priori calculations) and calculates without problems. But even if Statsoft insists that this is not a problem (I agree - once you know 101) it is a big annoyance and very counterproductive and even buggy. Especially the case that the text can not be removed is really bad.

Why do I care? Because I like to work with Statistica. It is very easy to use, has 20k options, is user friendly, GUI driven but 101 gave me gray hair until I figured out what the problem was (my fault of course).

Attachement: statistica-beware-of-101.sta (15 kBytes) and copy-paste-statistica-101-problem.zip (16 kBytes)

Statsoft Statistica beware of 101