用python的re.findall对该数据集进行清洗，生成excel文本，其中每部电影的演员单独一格

好的，我们可以使用Python的`re`模块来解析并提取数据，并将其写入Excel文件中。以下是一个示例代码： ```python import re import pandas as pd # 原始数据 data = """ [原始数据内容] """ # 使用正则表达式提取数据 pattern = r'\{"id": "(.*?)",.*?"name": "(.*?)",.*?"directors": \[(.*?)\],.*?"writers": \[(.*?)\],.*?"actors": \[(.*?)\],.*?"genres": \[(.*?)\],.*?"countries": \[(.*?)\],.*?"languages": \[(.*?)\],.*?"pubdates": \[(.*?)\],.*?"episodes": "(.*?)",.*?"durations": \[(.*?)\],.*?"other_names": \[(.*?)\],.*?"summary": "(.*?)",.*?"rating": \{"average": "(.*?)",.*?"reviews_count": "(.*?)"\}\}' matches = re.findall(pattern, data, re.DOTALL) # 处理提取的数据 movies = [] for match in matches: movie = { 'id': match[0], 'name': match[1], 'directors': [d.strip() for d in re.findall(r'"name": "(.*?)"', match[2])], 'writers': [w.strip() for w in re.findall(r'"name": "(.*?)"', match[3])], 'actors': [a.strip() for a in re.findall(r'"name": "(.*?)"', match[4])], 'genres': [g.strip().replace('"', '') for g in match[5].split(',')], 'countries': [c.strip().replace('"', '') for c in match[6].split(',')], 'languages': [l.strip().replace('"', '') for l in match[7].split(',')], 'pubdates': [p.strip().replace('"', '') for p in match[8].split(',')], 'episodes': match[9], 'durations': [d.strip().replace('"', '') for d in match[10].split(',')], 'other_names': [o.strip().replace('"', '') for o in match[11].split(',')], 'summary': match[12].strip(), 'average_rating': match[13], 'reviews_count': match[14] } movies.append(movie) # 将数据转换为DataFrame df = pd.DataFrame(movies) # 写入Excel文件 df.to_excel('movie_info.xlsx', index=False) ``` ### 解释 1. **正则表达式**：`pattern`用于匹配每个电影的信息块。 2. **提取数据**：`re.findall`函数用于找到所有符合模式的匹配项。 3. **处理数据**：遍历每个匹配项，提取并整理所需字段。 4. **转换为DataFrame**：将整理好的数据转换为Pandas DataFrame。 5. **写入Excel**：将DataFrame写入Excel文件。 ### 注意事项 - `re.DOTALL`标志允许`.`匹配包括换行符在内的所有字符。 - `re.findall`函数中的嵌套正则表达式用于进一步提取具体字段的内容。 - 最终结果会被保存到名为`movie_info.xlsx`的Excel文件中。运行上述代码后，你会得到一个包含所有电影信息的Excel文件，其中每部电影的演员列表会单独占据一列。

阅读全文

CSDN会员

开通CSDN年卡参与万元壕礼抽奖

海量 VIP免费资源千本正版电子书商城会员专享价千门课程&专栏

全年可省5,000元立即开通