模块一-Pandas基础入门——05. 数据导入与导出

张

张建站

2026/5/13 9:12:15

10分钟阅读

05. 数据导入与导出1. 概述在实际工作中数据通常存储在文件中CSV、Excel、JSON 等或数据库中。Pandas 提供了丰富的 I/O 接口可以轻松地从各种数据源读取数据并将处理结果保存到文件。importpandasaspdimportnumpyasnp2. CSV 文件读写CSVComma-Separated Values是最常用的数据交换格式每行是一条记录每列用逗号分隔。2.1 read_csv() - 读取 CSV 文件# 基本读取dfpd.read_csv(data.csv)# 常用参数dfpd.read_csv(data.csv,sep,,# 分隔符默认逗号header0,# 第几行作为列名0表示第一行names[col1,col2],# 自定义列名index_col0,# 指定列为行索引usecols[col1,col3],# 只读取指定列dtype{col1:int32},# 指定列的数据类型parse_dates[date_col],# 解析日期列nrows100,# 只读取前100行skiprows2,# 跳过前2行encodingutf-8,# 文件编码na_values[,NULL],# 指定哪些值视为缺失值)参数详解参数说明示例filepath文件路径data.csv或./data/data.csvsep分隔符,CSV、\tTSVheader列名所在行0第一行、None无列名names自定义列名[A, B, C]index_col作为行索引的列0第一列、[ID]usecols要读取的列[A, B]dtype列数据类型{A: int32}parse_dates解析为日期[date]nrows读取行数1000skiprows跳过行2跳过前2行encoding文件编码utf-8、gbkna_values缺失值标记[, NULL, NA]2.2 实际示例# 创建一个示例 CSV 文件sample_datapd.DataFrame({ID:[1,2,3,4,5],姓名:[张三,李四,王五,赵六,钱七],年龄:[25,30,None,28,35],工资:[5000,8000,6500,None,9000],入职日期:[2024-01-15,2024-02-20,2024-03-10,2024-04-01,2024-05-12]})sample_data.to_csv(sample_data.csv,indexFalse,encodingutf-8-sig)# 读取 CSV 文件dfpd.read_csv(sample_data.csv)print(原始读取:)print(df)# 指定 ID 列为索引dfpd.read_csv(sample_data.csv,index_col0)print(\n指定 ID 为索引:)print(df)# 只读取指定列dfpd.read_csv(sample_data.csv,usecols[姓名,年龄])print(\n只读取姓名和年龄:)print(df)# 处理缺失值print(\n指定缺失值标记:)dfpd.read_csv(sample_data.csv,na_values[None,])print(df.isnull().sum())# 解析日期dfpd.read_csv(sample_data.csv,parse_dates[入职日期])print(\n日期类型:)print(df.dtypes)2.3 to_csv() - 保存为 CSVdfpd.DataFrame({A:[1,2,3],B:[4,5,6],C:[7,8,9]})# 基本保存df.to_csv(output.csv,indexFalse)# 常用参数df.to_csv(output.csv,indexFalse,# 不保存行索引sep,,# 分隔符encodingutf-8-sig,# 编码解决中文乱码headerTrue,# 保存列名columns[A,B],# 只保存指定列na_repNULL,# 缺失值填充float_format%.2f# 浮点数格式)3. Excel 文件读写Excel 文件需要安装openpyxl用于 .xlsx或xlrd用于 .xls。pipinstallopenpyxl xlrd3.1 read_excel() - 读取 Excel 文件# 基本读取dfpd.read_excel(data.xlsx)# 常用参数dfpd.read_excel(data.xlsx,sheet_nameSheet1,# 工作表名称或索引0表示第一个header0,# 列名所在行names[A,B,C],# 自定义列名index_col0,# 作为行索引的列usecolsA:C,# 读取的列范围dtype{A:int32},# 数据类型nrows100,# 读取行数skiprows2,# 跳过行)# 读取多个工作表xlpd.ExcelFile(data.xlsx)df1pd.read_excel(xl,Sheet1)df2pd.read_excel(xl,Sheet2)# 一次性读取所有工作表dict_dfspd.read_excel(data.xlsx,sheet_nameNone)3.2 to_excel() - 保存为 Excel# 基本保存df.to_excel(output.xlsx,indexFalse)# 保存到指定工作表df.to_excel(output.xlsx,sheet_nameSheet1,indexFalse)# 多个工作表保存withpd.ExcelWriter(output.xlsx)aswriter:df1.to_excel(writer,sheet_nameSheet1,indexFalse)df2.to_excel(writer,sheet_nameSheet2,indexFalse)# 设置格式df.to_excel(output.xlsx,sheet_nameSheet1,indexFalse,na_repNULL,float_format%.2f)4. JSON 文件读写JSON 是 Web API 常用的数据格式。4.1 read_json() - 读取 JSON# JSON 字符串json_str{name: 张三, age: 25, city: 北京}dfpd.read_json(json_str,typseries)# JSON 文件dfpd.read_json(data.json)# 嵌套 JSON# 格式1记录格式每条记录一个对象data[{name:张三,age:25},{name:李四,age:30}]dfpd.DataFrame(data)# 格式2列格式data{name:[张三,李四],age:[25,30]}dfpd.DataFrame(data)4.2 to_json() - 保存为 JSONdfpd.DataFrame({name:[张三,李四,王五],age:[25,30,28],city:[北京,上海,广州]})# 基本保存df.to_json(output.json,orientrecords,force_asciiFalse)# orient 参数说明# records : [{col: value}, ...] → 最常用# index : {index: {col: value}}# columns : {col: {index: value}}# values : 二维数组# table : 包含 schema 的表格格式# 中文支持df.to_json(output.json,orientrecords,force_asciiFalse,indent2)5. 其他格式5.1 HTML 表格# 读取 HTML 表格tablespd.read_html(https://example.com/table.html)dftables[0]# 第一个表格# 保存为 HTMLdf.to_html(output.html,indexFalse)5.2 Parquet 格式高效列式存储Parquet 是一种高效的列式存储格式适合大数据场景。pipinstallpyarrow# 读取 Parquetdfpd.read_parquet(data.parquet)# 保存为 Parquetdf.to_parquet(output.parquet,indexFalse)5.3 SQL 数据库fromsqlalchemyimportcreate_engine# 创建数据库连接enginecreate_engine(sqlite:///database.db)# 从 SQL 读取dfpd.read_sql(SELECT * FROM table_name,engine)dfpd.read_sql_query(SELECT * FROM table_name,engine)dfpd.read_sql_table(table_name,engine)# 写入 SQLdf.to_sql(table_name,engine,if_existsreplace,indexFalse)# if_exists: fail, replace, append6. 大文件处理技巧6.1 分块读取当文件太大无法一次性加载到内存时可以分块读取# 分块读取 CSVchunk_size10000chunks[]forchunkinpd.read_csv(large_file.csv,chunksizechunk_size):# 处理每个块processed_chunkchunk[chunk[value]0]chunks.append(processed_chunk)# 合并结果dfpd.concat(chunks,ignore_indexTrue)6.2 只读取需要的列# 只读取指定的列减少内存占用dfpd.read_csv(large_file.csv,usecols[col1,col2,col3])6.3 指定数据类型# 指定列的数据类型减少内存dtype_spec{col1:int32,# 使用更小的整数类型col2:float32,# 使用单精度浮点数col3:category# 分类类型}dfpd.read_csv(large_file.csv,dtypedtype_spec)7. 文件路径处理7.1 绝对路径和相对路径# 绝对路径从根目录开始dfpd.read_csv(/Users/username/data/file.csv)# 相对路径从当前工作目录开始dfpd.read_csv(./data/file.csv)dfpd.read_csv(../data/file.csv)# 查看当前工作目录importosprint(os.getcwd())# 改变工作目录os.chdir(/path/to/directory)7.2 使用 pathlib推荐frompathlibimportPath# 创建路径对象data_dirPath(./data)file_pathdata_dir/file.csv# 读取文件dfpd.read_csv(file_path)# 批量读取csv_filesPath(./data).glob(*.csv)dfs[pd.read_csv(f)forfincsv_files]8. 文件编码问题8.1 常见编码编码说明适用场景utf-8国际通用大多数情况utf-8-sig带 BOM 的 UTF-8Excel 保存的 CSVgbk/gb2312简体中文中文 Windows 系统latin1西欧编码处理混合编码8.2 编码问题解决# 尝试不同编码try:dfpd.read_csv(data.csv,encodingutf-8)exceptUnicodeDecodeError:dfpd.read_csv(data.csv,encodinggbk)# 使用 chardet 自动检测编码importchardetwithopen(data.csv,rb)asf:resultchardet.detect(f.read())encodingresult[encoding]dfpd.read_csv(data.csv,encodingencoding)9. 完整示例数据导入导出流程# 创建示例数据importpandasaspdimportnumpyasnp np.random.seed(42)datapd.DataFrame({ID:range(1,101),姓名:[f用户_{i}foriinrange(1,101)],年龄:np.random.randint(18,65,100),收入:np.random.normal(8000,2000,100).round(0),城市:np.random.choice([北京,上海,广州,深圳],100),注册日期:pd.date_range(2024-01-01,periods100,freqD)})print(*60)print(数据导入导出示例)print(*60)# 1. 保存为 CSVdata.to_csv(output/users.csv,indexFalse,encodingutf-8-sig)print(✓ 已保存为 CSV: output/users.csv)# 2. 保存为 Excel需要 openpyxltry:data.to_excel(output/users.xlsx,indexFalse,sheet_name用户数据)print(✓ 已保存为 Excel: output/users.xlsx)exceptImportError:print(✗ 需要安装 openpyxl: pip install openpyxl)# 3. 保存为 JSONdata.to_json(output/users.json,orientrecords,force_asciiFalse,indent2)print(✓ 已保存为 JSON: output/users.json)# 4. 读取并验证print(\n读取验证:)df_csvpd.read_csv(output/users.csv)print(fCSV 读取:{df_csv.shape}行)# 5. 部分数据读取df_samplepd.read_csv(output/users.csv,nrows10)print(f只读前10行:{df_sample.shape})# 6. 指定列读取df_selectedpd.read_csv(output/users.csv,usecols[姓名,年龄,收入])print(f只读指定列:{df_selected.columns.tolist()})10. 总结文件格式读取函数保存函数常用场景CSVread_csv()to_csv()通用数据交换Excelread_excel()to_excel()业务报表JSONread_json()to_json()Web APIParquetread_parquet()to_parquet()大数据存储SQLread_sql()to_sql()数据库交互HTMLread_html()to_html()网页展示常见问题问题解决方案中文乱码使用encodingutf-8-sig文件太大分块读取chunksize内存不足只读需要的列usecolsExcel 错误安装openpyxl路径错误使用绝对路径或检查工作目录11. 模块一总结恭喜你完成了模块一Pandas 基础入门你已掌握✅ Pandas 简介与安装✅ Series 一维数据结构✅ DataFrame 二维数据结构✅ 数据查看与探索✅ 数据导入与导出

Pinecone示例库实战指南：从向量数据库原理到RAG应用开发

1. 项目概述：Pinecone示例库的深度探索如果你正在寻找一个能让你快速上手向量数据库和现代AI应用开发的“实战训练营”，那么Pinecone官方的 pinecone-io/examples 仓库绝对是一个不容错过的宝藏。这个仓库远不止是一个简单的代码合集，它更…...

2026/5/13 9:10:30 阅读更多 →

避开Halcon ROI绘制与保存的常见坑：`draw_`与`gen_`算子参数传递详解

避开Halcon ROI绘制与保存的常见坑：draw_与gen_算子参数传递详解刚接触Halcon的开发者，在绘制和操作ROI（感兴趣区域）时，常常会遇到一些看似简单却容易踩坑的问题。比如，为什么draw_circle和gen_circle的参…...

2026/5/13 9:07:25 阅读更多 →

OpenClaw AI Agent实战指南：从基础命令到高级模块配置

1. 项目概述最近在折腾一个叫 OpenClaw 的 AI Agent 项目，它本质上是一个可以通过命令行（CLI）进行交互和管理的智能体框架。如果你对构建能自动执行任务、管理会话、甚至按计划跑脚本的 AI 助手感兴趣，那这个项目值得你花时间研究…...

2026/5/13 9:05:30 阅读更多 →

BriSe AI：构建类脑自我层次模型，从模式匹配迈向自主理解

1. 项目概述：从“模仿”到“涌现”的范式跃迁最近和几位做认知科学和神经科学的朋友聊得比较多，大家都有一个共同的感受：当前主流的人工智能，无论是大语言模型还是多模态模型，本质上还是在做“模式匹配”和“概率预测…...

2026/5/13 10:41:29 阅读更多 →

6G边缘计算与生成式AI融合：基于LDM与DRL的协同优化架构实践

1. 项目概述：当6G边缘计算遇上生成式AI最近和几个做通信和AI的朋友聊天，大家不约而同地提到了一个词：6G边缘生成式AI。这听起来像是把几个最前沿的技术名词硬凑在一起，但当你真正拆开来看，会发现它背后指向的是一个非常…...

2026/5/13 8:57:11 阅读更多 →

DriverStore Explorer完全指南：轻松管理Windows驱动，释放宝贵磁盘空间

DriverStore Explorer完全指南：轻松管理Windows驱动，释放宝贵磁盘空间【免费下载链接】DriverStoreExplorer Driver Store Explorer 项目地址: https://gitcode.com/gh_mirrors/dr/DriverStoreExplorer 你是否曾因为Windows系统变得越来越臃肿而…...

2026/5/12 9:54:02 阅读更多 →