python 正则去掉 /n \u3000

可以使用 re.sub() 函数来去掉 /n 和 \u3000： import re text = "hello\n　　world" clean_text = re.sub(r'[\n\u3000]', '', text) print(clean_text) 输出结果为： helloworld

encoding=utf-8 import nltk import json from nltk.corpus import stopwords import re eg_stop_words = set(stopwords.words('english')) sp_stop_words = set(stopwords.words('spanish')) all_stop_words = eg_stop_words.union(sp_stop_words) input_file_name = r'建模.txt' output_file_name = r'train.txt' out_file = open(output_file_name, encoding='utf-8', mode='w') 打开输出文件 with open(output_file_name, encoding='utf-8', mode='w') as output_file: # 打开输入文件，对每一行进行处理 with open(input_file_name, encoding='utf-8') as f: for idx, line in enumerate(f): print("正在处理第{}行数据".format(idx)) if idx == 0: # 第一行是列名，不要 print(line) continue line = line.strip() sps = line.split("\t") # 将行按制表符分隔为列表 report_no = sps[0] target = sps[2] smses = sps[-1] smses = smses.strip(""") # 去掉短信两端的引号 smses = smses.replace("""", """) # 把两个双引号转换成单引号 root = json.loads(smses) # 解析 json 格式的短信 msg = "" for item in root: # 遍历短信中的每一条信息 body = item["body"] # 获取信息的正文 msg += body + "\n" # 把正文追加到总的信息传递过来的msg中 text = re.sub(r'[^\w\s]', '', msg) # 使用正则表达式去掉标点符号 text = re.sub(r'http\S+', '', text) # 去掉链接 text = re.sub(r'\d+', '', text)#去除数字 text = text.lower() words = text.split() filtered_words = [word for word in words if word not in all_stop_words] text = ' '.join(filtered_words) print(report_no + '\t' + target) msg = target + '\u0001' + text + '\n' out_file.write(msg) out_file.close()帮我改成用 pandas 处理

当使用 pandas 进行处理时，你可以使用 `pandas.read_csv()` 函数读取文件，并使用 DataFrame 进行数据处理。以下是使用 pandas 进行处理的示例代码： ```python import pandas as pd import nltk from nltk.corpus import stopwords import re eg_stop_words = set(stopwords.words('english')) sp_stop_words = set(stopwords.words('spanish')) all_stop_words = eg_stop_words.union(sp_stop_words) input_file_name = '建模.txt' output_file_name = 'train.txt' df = pd.read_csv(input_file_name, encoding='utf-8', sep='\t') # 处理每一行数据 for idx, row in df.iterrows(): print("正在处理第{}行数据".format(idx)) if idx == 0: # 第一行是列名，不需要处理 continue smses = row['短信'] smses = smses.strip("'") smses = smses.replace('""', "'") root = json.loads(smses) msg = "" for item in root: body = item["body"] msg += body + "\n" text = re.sub(r'[^\w\s]', '', msg) text = re.sub(r'http\S+', '', text) text = re.sub(r'\d+', '', text) text = text.lower() words = text.split() filtered_words = [word for word in words if word not in all_stop_words] text = ' '.join(filtered_words) df.at[idx, '目标'] = row['目标'] + '\u0001' + text df.to_csv(output_file_name, encoding='utf-8', sep='\t', index=False) ``` 这段代码将会读取名为 '建模.txt' 的文件，并根据每一行的数据进行处理，最后将处理后的结果保存到 'train.txt' 文件中。注意在使用该代码前，你需要先安装 pandas 和 nltk 库。

阅读全文

python 正则 去掉 /n \u3000

相关推荐

python如何去除字符串中不想要的字符

Python 字符串前面加u,r,b的含义

Python 字符串处理特殊空格\xc2\xa0\t\n Non-breaking space

Python正则表达式与文本处理技巧

使用Python正则表达式匹配特殊字符

Python正则表达式国际化处理：构建跨语言匹配方案

【Python正则表达式终极指南】：5个技巧让你从新手到专家

正则表达式(风云).rar

Python符号大全

详解Python中字符串前“b”,“r”,“u”,“f”的作用

Python学习笔记.doc

【Python高级应用】：正则表达式在字符串处理中的巧妙运用

正则表达式应用：高效文本处理与匹配技巧

RNN正则化技术：过拟合的终极防御指南

HTMLParser与正则表达式协同攻略：数据提取与分析技巧

Python文本处理艺术

Python字符串空格处理：高级技巧大揭秘，去除字符串中的特定空格不再是难事

Python数据清洗基础入门

基于Qt开发的截图工具- 支持全屏截图， 支持自定义截图，支持捕获窗口截图，支持固定大小窗口截图，颜色拾取，图片编辑

最新推荐

Python使用正则表达式去除(过滤)HTML标签提取文字功能

基于Python获取docx/doc文件内容代码解析

Shell与Python正则表达式参考

基于Qt开发的截图工具- 支持全屏截图， 支持自定义截图，支持捕获窗口截图，支持固定大小窗口截图，颜色拾取，图片编辑

全国江河水系图层shp文件包下载

管理建模和仿真的文件

Keras模型压缩与优化：减小模型尺寸与提升推理速度

MTK 6229 BB芯片在手机中有哪些核心功能，OTG支持、Wi-Fi支持和RTC晶振是如何实现的？

点云二值化测试数据集的详细解读

"互动学习：行动中的多样性与论文攻读经历"

python 正则去掉 /n \u3000

基于Qt开发的截图工具- 支持全屏截图，支持自定义截图，支持捕获窗口截图，支持固定大小窗口截图，颜色拾取，图片编辑

基于Qt开发的截图工具- 支持全屏截图，支持自定义截图，支持捕获窗口截图，支持固定大小窗口截图，颜色拾取，图片编辑