def infor(data): # 计算每个元素出现的概率 prob = pd.value_counts(data) / len(data) # 信息熵的公式 return sum(np.log2(prob) * prob * (-1))
时间: 2024-04-11 14:31:16 浏览: 12
这段代码是一个计算数据集的信息熵的函数。它首先计算每个元素在数据集中出现的概率,然后使用信息熵的公式计算并返回结果。信息熵是用来衡量数据集的不确定性和混乱程度的指标,数值越大表示数据集越不确定。在这段代码中,使用了pandas库的value_counts函数来计算每个元素出现的次数,然后除以数据集的总长度得到概率值。最后,通过对概率值进行求对数、乘以概率值本身再取负,再求和的方式计算信息熵。这个函数的输入参数是一个数据集。
相关问题
优化以下SQL,给出优化后的SQL:SELECT un.*, sta.*, CASE WHEN COALESCE(un.lot_time, 0) > COALESCE(sta.dock_time_limit, 0) THEN COALESCE(un.lot_time, 0) - COALESCE(sta.dock_time_limit, 0) ELSE 0 END AS lotDuration FROM unload_over_time un LEFT JOIN ( SELECT parking_lot_code, dock_time_limit FROM stall_infor WHERE data_state = 0 AND delete_state = 1 AND industry_park_code='YQWQML' ) sta ON un.parking_lot_code = sta.parking_lot_code INNER JOIN ( SELECT DISTINCT plate_number FROM supplier_info WHERE data_state = 0 AND delete_state = 1 AND arrived_atetime IS NOT NULL AND left_datetime IS NULL AND industry_park_code='YQWQML' ) sup ON un.plate_number = sup.plate_number WHERE un.data_state = 0 AND un.delete_state = 1 AND un.is_out_of_time = 1 AND un.industry_park_code='YQWQML' AND un.parking_lot_code IN ( SELECT DISTINCT parking_lot_code FROM unload_over_time WHERE data_state = 0 AND delete_state = 1 AND is_out_of_time = 1 AND industry_park_code='YQWQML' )
优化后的SQL如下:
```sql
SELECT un.*, sta.*,
CASE WHEN COALESCE(un.lot_time, 0) > COALESCE(sta.dock_time_limit, 0) THEN COALESCE(un.lot_time, 0) - COALESCE(sta.dock_time_limit, 0) ELSE 0 END AS lotDuration
FROM unload_over_time un
LEFT JOIN stall_infor sta ON un.parking_lot_code = sta.parking_lot_code
INNER JOIN supplier_info sup ON un.plate_number = sup.plate_number
WHERE un.data_state = 0
AND un.delete_state = 1
AND un.is_out_of_time = 1
AND un.industry_park_code='YQWQML'
AND EXISTS (
SELECT 1
FROM unload_over_time uot
WHERE uot.data_state = 0
AND uot.delete_state = 1
AND uot.is_out_of_time = 1
AND uot.industry_park_code='YQWQML'
AND uot.parking_lot_code = un.parking_lot_code
)
AND EXISTS (
SELECT 1
FROM stall_infor si
WHERE si.data_state = 0
AND si.delete_state = 1
AND si.industry_park_code='YQWQML'
AND si.parking_lot_code = un.parking_lot_code
)
AND EXISTS (
SELECT 1
FROM supplier_info sp
WHERE sp.data_state = 0
AND sp.delete_state = 1
AND sp.arrived_atetime IS NOT NULL
AND sp.left_datetime IS NULL
AND sp.industry_park_code='YQWQML'
AND sp.plate_number = un.plate_number
)
```
优化后的SQL做了以下几点改进:
1. 将子查询转化为 EXISTS 子句,避免使用 DISTINCT 关键字和 IN 子查询,提高查询性能。
2. 将 LEFT JOIN 子查询转化为直接的 LEFT JOIN 操作,避免嵌套查询,简化SQL结构。
3. 合并了相同的过滤条件,减少重复的条件判断。
优化后的SQL应该能够提高查询性能。但请注意,在实际应用中,最佳的优化方案还需要根据具体的数据量、索引等因素进行调整和测试。
import requests from bs4 import BeautifulSoup url = 'https://movie.douban.com/subject/30228394/' header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)\ AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'} response = requests.get(url=url, headers=header) soup = BeautifulSoup(response.text, 'html.parser') tv_infor = {} # 1.获取电视剧名称 name = soup.find(property="v:itemreviewed").string # 根据属性property="v: itemreviewed"查找 tv_infor['name'] = name # 将电影名称加到字典tv_infor中 # 2.获取导演 director = soup.find(rel="v: directedBy").string # 根据属性re1="v:directedBy“查找 tv_infor['director'] = director # 3.获取编剧 soup_list = soup. findAll(class_="attrs")[1].findAll('a') writers = [elem. string for elem in soup_list] tv_infor['writers'] = writers # 4.获取演员 soup_list = soup. findAll(rel="v:starring") actors = [elem. string for elem in soup_list] tv_infor['actors'] = actors # 5.获取类型 soup_list = soup. findAll(property="v: genre") tv_type = [elem. string for elem in soup_list] tv_infor['type'] = tv_type # 6.首播时间 release_date = soup.find(property="v: initialReleaseDate").string tv_infor['release_date'] = release_date # 7.豆瓣评分 rating = soup.find(property="v: average").string tv_infor['rating'] = rating # 8.参评人数 votes = soup.find(property="v: votes").string tv_infor['votes'] = votes print("电视剧《觉醒年代》相关信息如下:") for key, value in tv_infor.items(): print(key, ":", value)运行报错
这段代码是Python 2.x的语法,但在Python 3.x中会出现语法错误。你需要将所有的print语句修改为函数调用的格式,即在print后添加括号,像这样:
```python
import requests
from bs4 import BeautifulSoup
url = 'https://movie.douban.com/subject/30228394/'
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)\ AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'}
response = requests.get(url=url, headers=header)
soup = BeautifulSoup(response.text, 'html.parser')
tv_infor = {}
# 1.获取电视剧名称
name = soup.find(property="v:itemreviewed").string
tv_infor['name'] = name
# 2.获取导演
director = soup.find(rel="v: directedBy").string
tv_infor['director'] = director
# 3.获取编剧
soup_list = soup.findAll(class_="attrs")[1].findAll('a')
writers = [elem.string for elem in soup_list]
tv_infor['writers'] = writers
# 4.获取演员
soup_list = soup.findAll(rel="v:starring")
actors = [elem.string for elem in soup_list]
tv_infor['actors'] = actors
# 5.获取类型
soup_list = soup.findAll(property="v:genre")
tv_type = [elem.string for elem in soup_list]
tv_infor['type'] = tv_type
# 6.首播时间
release_date = soup.find(property="v:initialReleaseDate").string
tv_infor['release_date'] = release_date
# 7.豆瓣评分
rating = soup.find(property="v:average").string
tv_infor['rating'] = rating
# 8.参评人数
votes = soup.find(property="v:votes").string
tv_infor['votes'] = votes
print("电视剧《觉醒年代》相关信息如下:")
for key, value in tv_infor.items():
print(key, ":", value)
```
这样就可以正常运行了。