MongoDB千万级数据的分析
MongoDB千万级数据的分析一、导入清单1:读取CSV文件,存储到数据库中01#-*- coding:UTF-8 -*-02'''03Created on 2013-10-20040506@author: tyk07080910'''11from pymongo.connection import Connection12from time import time13import codecs14import csv15import os16rootdir = "2000W/" # 指明被遍历的文件夹17'''1819'''20def process_data():21conn = Connection('localhost', 27017) #获取一个连接22##conn.drop_database('guestHouse')23db = conn.TYK24guest = db.guestHouse252627guest_info = []28for parent, dirnames, filenames in os.walk(rootdir): #三个参数:分别返回1.父目录 2.所有文件夹名字(不含路径) 3.所有文件名字29for filename in filenames:30ErrorLine = []31key_length = 032fullname = os.path.join(parent,filename)33try:34#with codecs.open(fullname, encoding='utf_8') as file:35with codecs.open(fullname, encoding='utf_8_sig') as file:#忽略UTF-8文件前面的BOM36keys = file.readline().split(',')#先读掉第一行的注释37key_length = len(keys)38spamreader = csv.reader(file)#以CSV格式读取,返回的不再是str,而是list39for line in spamreader:40if key_length != len(line):#部分数据不完整,记录下来41ErrorLine.append(line)42else:43each_info = {}44for i in range(1, len(keys)):#过滤第一个字段Name,姓名将不再存到数据库中45each_info[keys[i]] = line[i]4647guest_info.append(each_info)48if len(guest_info) == 10000:#每10000条进行一次存储操作49guest.insert(guest_info)50guest_info = []5152except Exception, e:53print filename + "\t" + str(e)5455#统一处理错误信息56with open('ERR/' + os.path.splitext(filename)[0] + '-ERR.csv', 'w') as log_file:57spamwriter = csv.writer(log_file)58for line in ErrorLine:59spamwriter.writerow(line)60#最后一批61guest.insert(guest_info)6263if __name__ == '__main__':64start = time()65process_data()66stop = time()67print(str(stop-start) + "秒")后来睡着了、关机了,耗时多久也不得而知了⊙﹏⊙b汗总结:1.文件编码为UTF-8,不能直接open()打开读取。2.文件已CSV格式进行存储,读取时用CSV模块处理来读取。这是读出来的数据每行为一个list。注意,不能简单的以","拆分后进行读取。对于这种形状"a,b,c", d的数据是无易做图确解析的。3.对于UTF-8文件,如果有BOM的形式去读是要以'utf_8_sig'编码读取,这样会跳过开头的BOM。如果不处理掉BOM,BOM会随数据一同存到数据库中,造成类似" XXX"的现象(有一个空格的假象)。如果真的已经存到库中了,那只有改key了1db.guestHouse.update({}, {"$rename" : {" Name" : "Name"}}, false, true)另外,网上还有一种方法(尝试失败了,具体原因应该是把字符串转换成字节码然后再去比较。怎么转这个我还不会...)1#with codecs.open(fullname, encoding='utf-8') as file:2with codecs.open(fullname, encoding='utf_8_sig') as file:3keys = file.readline().split(',')4if keys[0][:3] == codecs.BOM_UTF8:#将keys[0]转化为字节码再去比较5keys[0] = keys[0][3:]扩展:今天发现MongoDB本身就带有导入功能mongoimport,可以直接导入CSV文件...小试一把1.不做错误数据过滤,直接导入。用专利引用数据做一下实验(《Hadoop权威指南》一书中的实验数据)实验数据:01"PATENT","GYEAR","GDATE","APPYEAR","COUNTRY","POSTATE","ASSIGNEE","ASSCODE","CLAIMS","NCLASS","CAT","SUBCAT","CMADE","CRECEIVE","RATIOCIT","GENERAL","ORIGINAL","FWDAPLAG","BCKGTLAG","SELFCTUB","SELFCTLB","SECDUPBD","SECDLWBD"023070801,1963,1096,,"BE","",,1,,269,6,69,,1,,0,,,,,,,033070802,1963,1096,,"US","TX",,1,,2,6,63,,0,,,,,,,,,043070803,1963,1096,,"US",05"IL",,1,,2,6,63,,9,,0.3704,,,,,,,063070804,1963,1096,,"US","OH",,1,,2,6,63,,3,,0.6667,,,,,,,073070805,1963,1096,,"US","CA",,1,,2,6,63,,1,,0,,,,,,,083070806,1963,1096,,"US","PA",,1,,2,6,63,,0,,,,,,,,,093070807,1963,1096,,"US","OH",,1,,623,3,39,,3,,0.4444,,,,,,,103070808,1963,1096,,"US","IA",,1,,623,3,39,,4,,0.375,,,,,,,113070809,1963,1096,,,,1,,4,6,65,,0,,,,,,,,,1mongoimport -d TYK -c guest --type csv --file d:\text.csv --headerline一共11行。第一行注释,9条数据。第3条中间截断,第9条取出中间两个数值"US","AZ"。按照csv规定现在应该是10条数据结果:01> db.guest.find({}, {"PATENT" : 1, "_id" : 1})02{ "_id" : ObjectId("52692c2a0b082a1bbb727d86"), "PATENT" : 3070801 }03{ "_id" : ObjectId("52692c2a0b082a1bbb727d87"), "PATENT" : 3070802 }04{ "_id" : ObjectId("52692c2a0b082a1bbb727d88"), "PATENT" : 3070803 }05{ "_id" : ObjectId("52692c2a0b082a1bbb727d89"), "PATENT" : "IL" }06{ "_id" : ObjectId(上一个:锁的理解
下一个:MongoDB数据修改总结
- 更多mongodb疑问解答:
- 【急】MongoDB写入错误~~~~
- Mongodb NOSql 数据库问题,是否可以插入带接口的类
- java操作mongodb
- Spring data MongoDB 更新整个内嵌文档时报错???????
- node.js连接mongodb更新
- MongoDB Java驱动 WriteConcern.SAFE非常浪费资源
- 求科普,hibernate怎样操作mongodb?
- 问一下mongodb怎么用hibernate整合
- mongodb查询的数据过多
- 使用JAVA创建MongoDB的问题
- Mongodb事务管理问题?
- mongodb利用java进行模糊查询
- spring 抽象类 注入值为空(spring3+mongodb+morphia)
- 【急】MongoDB写入错误~~~~
- Mongodb NOSql 数据库问题,是否可以插入带接口的类