pd.get_dummies
时间: 2024-05-10 09:21:20 浏览: 106
pd.get_dummies is a Python function from the pandas library that is used to create dummy variables from categorical data. It creates a new column for each unique category of a categorical variable, and assigns a value of 1 or 0 to each row depending on whether that row belongs to that category or not. This is useful for machine learning algorithms that require numerical input, as it converts non-numerical data into a numerical format.
For example, if we have a dataset with a categorical variable "color" that has three categories: red, green, and blue, pd.get_dummies will create three new columns in the dataset called "color_red", "color_green", and "color_blue". Each row will have a value of 1 in the column that corresponds to its color, and 0 in the other two columns.
The syntax for pd.get_dummies is:
```
pd.get_dummies(data, columns=None, prefix=None, prefix_sep='_', dummy_na=False, drop_first=False)
```
- data: the input pandas DataFrame or Series
- columns: the name or list of names of the columns to encode. If not specified, all non-numerical columns will be encoded.
- prefix: the prefix to add to the column names of the new dummy variables
- prefix_sep: the separator to use between the prefix and the original column name
- dummy_na: whether to create an additional column for missing values. If True, a column called "column_name_nan" will be created for each column with missing values.
- drop_first: whether to drop the first column of each set of dummy variables to avoid multicollinearity. If True, the first column will be dropped.
阅读全文