You can change the unit of analysis from individuals to households by aggregating the data based on the value of the household ID variable. For example, you could create two new variables in the original data file that contain the number of people in the household and the per capita income for the household total income divided by number of people in the household. Figure Aggregate summary data added to original data. Weighting Data The WEIGHT command simulates case replication by treating each case as if it were actually the number of cases indicated by the value of the weight variable. You can use a weight variable to adjust the distribution of cases to more accurately reflect the larger population or to simulate raw data from aggregated data.

You can compute and apply a weight variable to simulate this distribution. Note: In this example, the weight values have been calculated in a manner that does not alter the total number of cases.

### How do I read ICPSR data into R?

If the weighted number of cases exceeds the original number of cases, tests of significance are inflated; if it is smaller, they are deflated. More flexible and reliable weighting techniques are available in the Complex Samples add-on module. The table looks like this:. You then read the data into SPSS, using rows, columns, and cell counts as variables; then, use the cell count variable as a weight variable.

## Stata - Software for Digital Scholarship - InfoGuides at George Mason University

For example, 1, 2, 35 indicates that the value in the first row, second column is The total row and column are not included. In this example, the value labels are the row and column labels from the original table. Figure Crosstabulation and significance tests for reconstructed table.

- Sensors : a comprehensive survey;
- Future Automotive Fuels: • Prospects • Performance • Perspective!
- Probabilistics Search for Tracking Targets: Theory and Modern Applications.

Changing File Structure SPSS expects data to be organized in a certain way, and different types of analysis may require different data structures. Since your original data can come from many different sources, the data may require some reorganization before you can create the reports or analyses that you want. You can use the FLIP command to create a new data file in which the rows and columns in the original data file are transposed so that cases rows become variables and variables columns become cases.

So what do you do with an Excel file in which cases are recorded in the columns and variables are recorded in the rows? For example, what if the Excel file looks like Figure ? Since the first cell in the first row is blank, it is assigned a default variable name of V1. The case data is not important at this point, and the fewer variables you create when flipping the file, the less time and resources it takes.

Note: The generated command does not end with a period.

Sometimes you may need to restructure your data in a slightly more complex manner than simply flipping rows and columns. If a data file contains groups of related cases, you may not be able to use the appropriate statistical techniques for example, Paired Samples T Test or Repeated Measures GLM because the data are not organized in the required fashion for those techniques.

Cases in the same household represent related observations, not independent observations, and we want to restructure the data file so that each group of related cases is one case in the restructured file and new variables are created to contain the related observations. Index variables can be either string or numeric.

Numeric index values must be non-missing, positive integers; string index values cannot be blank. By default, a period is used. You can use any characters that are allowed in a valid variable name which means the character cannot be a space. The previous example turned related cases into related variables for use with statistical techniques that compare and contrast related samples.

But sometimes you may need to do the exact opposite—convert variables that represent unrelated observations to variables. Example A simple Excel file contains two columns of information: income for males and income for females.

There is no known or assumed relationship between male and female values that are recorded in the same row; the two columns represent independent unrelated observations, and we want to create cases rows from the columns variables and create a new variable that indicates the gender for each case. A value of 1 indicates that the new case came from the original male income column, and a value of 2 indicates that the new case came from the original female income column.

Example In this example, the original data contain separate variables for two measures taken at three separate times for each case. This is the correct data structure for most procedures that compare related observations—but there is one important exception: Linear Mixed Models available in the Advanced Statistics add-on module requires a data structure in which related observations are recorded as separate cases. Transforming Data Values In an ideal situation, your raw data are perfectly suitable for the reports and analyses that you need.

Unfortunately, this is rarely the case.

Preliminary analysis may reveal inconvenient coding schemes or coding errors, or data transformations may be required in order to coax out the true relationship between variables. You can perform data transformations ranging from simple tasks, such as collapsing categories for reports, to more advanced tasks, such as creating new variables based on complex equations and conditional statements. For example, questionnaires often use a combination of high-low and low- high rankings.

For reporting and analysis purposes, you probably want these all coded in a consistent manner.

## Software for Digital Scholarship

Creating a small number of discrete categories from a continuous scale variable is sometimes referred to as banding. For example, you can recode salary data into a few salary range categories. Although it is not difficult to write command syntax to band a scale variable into range categories, we recommend that you try the Visual Bander, available on the Transform menu, because it can help you make the best recoding choices by showing the actual distribution of values and where your selected category boundaries occur in the distribution.

It also provides a number of different banding methods and can automatically generate descriptive labels for the banded categories. The vertical lines indicate the banded category divisions for the specified range groupings.

You can use the Paste button in the Visual Bander to paste the command syntax for your selections into a command syntax window. Without this, user-missing values could be inadvertently combined into a non- missing category for the new variable. For example, THRU would not include a value of You can perform simple numeric transformations using the standard programming language notation for addition, subtraction, multiplication, division, exponents, and so on.

In addition to simple arithmetic operators, you can also transform data with a wide variety of functions, including arithmetic and statistical functions. The divisor for the calculation of the mean is the number of non-missing values. For example, if only two of the variables have non-missing values for a particular case, the value of the computed variable is 2 for that case. Since no minimum number of non-missing values is specified for the MEAN function, a mean will be calculated and truncated as long as at least one of the variables has a non-missing value for that case.

Figure Variables computed with arithmetic and statistical functions. Random value and distribution functions generate random values based on the specified type of distribution and parameters, such as mean, standard deviation, or maximum value. NORMAL 50, 25 returns a random value from a normal distribution with a mean of 50 and a standard deviation of Figure Histograms of randomly generated values for different distributions.

Random variable functions are available for a variety of distributions, including Bernoulli, Cauchy, Weibull, and others. Perhaps the most common problem with string values is inconsistent capitalization.

- Nuclear Physics: Exploring the Heart of Matter.
- Design for Six Sigma (2002).
- Renovation?
- The Flight of the Barbarous Relic?

For example, you could combine three numeric variables for area code, exchange, and number into one string variable for telephone number with dashes between the values. Unlike new numeric variables, which can be created by transformation commands, you must define new string variables before using them in any transformations. Each argument can be a variable name, an expression, or a literal string enclosed in quotes. The value argument can be a variable name, a number, or an expression.

The format argument must be a valid numeric format. In this example, we use N format to support leading zeros in values for example, Figure Original numeric values and concatenated string values. In addition to being able to combine strings, you can also take them apart.

### Can I use R without having to learn the details of the R language?

For example, you could take apart a character telephone number, recorded as a string because of the embedded dashes , and create three new numeric variables for area code, exchange, and number. If all of the values were in the form nnn-nnn-nnnn with no spaces, it would be fairly easy to extract each segment of the telephone number, but some of the values have leading spaces or spaces before and after the dashes. The value argument can be a variable name, a number expressed as a string in quotes, or an expression.

The format argument must be a valid numeric format; this format is used to determine the numeric value of the string. The value argument can be a variable name, an expression, or a literal string enclosed in quotes. The optional length argument is a number that specifies how many characters to read starting at the value specified on the position argument.

Without the length argument, the string is read from the specified starting position to the end of the string value. The haystack argument can be a variable name or a literal string enclosed in quotes. The needle argument can be a literal string enclosed in quotes or an expression. Both arguments must evaluate to strings. The function returns a numeric value that represents the starting position of needle within haystack. In the absence of a length argument, the remainder of the string value is read. The alternative method eliminates the need to use a somewhat complicated expression to extract a substring from the middle of the string value by using a temporary variable and changing the value of the temporary variable to the remaining portion s of the string value as each segment is extracted.

Since the area code and the original first dash have been removed from telstr, this is the position of the dash between the exchange and the number. When reading in data from text files or databases, the width of string variables is sometimes set higher than necessary.