pandas drop_duplicates
时间: 2023-10-09 10:08:24 浏览: 79
The `drop_duplicates()` method in pandas is used to remove duplicate rows from a DataFrame. By default, it considers all columns and removes rows that have the same values in all columns. However, you can also specify a subset of columns to consider for identifying duplicates.
Syntax:
```
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
```
Parameters:
- `subset`: A list of column names to consider for identifying duplicates.
- `keep`: Specifies which duplicates to keep. Possible values are `'first'` (keep the first occurrence), `'last'` (keep the last occurrence), and `False` (remove all duplicates).
- `inplace`: If `True`, the original DataFrame is modified in place and nothing is returned. If `False` (default), a new DataFrame with duplicates removed is returned.
- `ignore_index`: If `True`, the resulting DataFrame will have a new index from 0 to n-1, where n is the number of rows. If `False` (default), the resulting DataFrame will have the same index as the original DataFrame.
Example:
```
import pandas as pd
# Creating a DataFrame with duplicate rows
data = {'Name': ['John', 'John', 'Mary', 'Tom', 'Mary'],
'Age': [25, 25, 30, 35, 30],
'City': ['New York', 'New York', 'Los Angeles', 'Chicago', 'Los Angeles']}
df = pd.DataFrame(data)
# Removing duplicate rows
df.drop_duplicates(inplace=True)
print(df)
```
Output:
```
Name Age City
0 John 25 New York
2 Mary 30 Los Angeles
3 Tom 35 Chicago
4 Mary 30 Los Angeles
```
In the above example, the `drop_duplicates()` method removes the duplicate rows based on all columns and returns a new DataFrame. If `inplace=True` is used, the original DataFrame is modified directly.
阅读全文