Remove Duplicate Documents: MongoDB

We learnt how to create unique key/index using {unique: true} option with ensureIndex() method. Now lets see how we can create unique key when there are duplicate entries/documents already present inside the collection.

dropDups-unique-key-index-mongodb

Insert documents

1
2
3
4
5
6
7
8
9
10
11
12
> db.system.indexes.find()
{ "v" : 1, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "foo.test" }
 
 
> db.test.insert({name: "Satish", age: 27});
WriteResult({ "nInserted" : 1 })
 
> db.test.insert({name: "Kiran", age: 28});
WriteResult({ "nInserted" : 1 })
 
> db.test.insert({name: "Satish", age: 27});
WriteResult({ "nInserted" : 1 })

Here we have 3 documents. First and the last document has same value for “name” and “age” fields.

dropDups() To Remove Duplicate Documents: MongoDB


[youtube https://www.youtube.com/watch?v=aQXdtDWKBiU]

YouTube Link: https://www.youtube.com/watch?v=aQXdtDWKBiU [Watch the Video In Full Screen.]



Creating unique key on field “name”

1
2
3
4
5
6
7
8
9
> db.test.ensureIndex({name: 1}, {unique: true});
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 1,
        "ok" : 0,
        "errmsg" : "E11000 duplicate key error index: foo.test.$name_1  dup key:
 { : \"Satish\" }",
        "code" : 11000
}

This creates error, as the collection “test” already has duplicate entries/documents.

Create Unique Key by dropping random duplicate entries

1
2
3
4
5
6
7
8
9
10
11
> db.test.ensureIndex({name: 1}, {unique: true, dropDups: true});
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 1,
        "numIndexesAfter" : 2,
        "ok" : 1
}
 
> db.test.find();
{ "_id" : ObjectId("53d8f1268019dce2ce61eb86"), "name" : "Satish", "age" : 27 }
{ "_id" : ObjectId("53d8f12f8019dce2ce61eb87"), "name" : "Kiran", "age" : 28 }

dropDups() method retains only 1 document randomly and deletes/removes/drops all other duplicate entries/documents permanently.

Note: Since the documents are deleted randomly and can not be restored, you need to be very careful while making use of dropDup() method.

Creating Unique Key/index: MongoDB

We have learnt how to create a key/index so far – today lets learn how to create unique key/index in MongoDB.

creating-unique-key-index-mongodb

Related Read:
ObjectId ( _id ) as Primary Key: MongoDB
index creation: MongoDB

foo: database name
name: collection name

Primary Key in MongoDB: _id

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> db.name.insert({_id: 1, a: 1});
WriteResult({ "nInserted" : 1 })
 
> db.system.indexes.find()
{ "v" : 1, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "foo.name" }
 
> db.name.insert({_id: 1, a: 2});
WriteResult({
        "nInserted" : 0,
        "writeError" : {
                "code" : 11000,
                "errmsg" : "insertDocument :: caused by :: 11000 E11000 
                            duplicate key error index: foo.name.$_id_  dup key: { : 1.0 }"
        }
})

Since “_id” is treated as primary key in mongoDB, we can’t insert duplicate values to it. In above case, we are trying to insert value of “_id” as 1 twice – the second time around it threw an error stating the entered value as duplicate.

Related Read:
DBMS Basics: Getting Started Guide
Primary Foreign Unique Keys, AUTO_INCREMENT: MySQL
Primary Key & Foreign Key Implementation: MySQL

Creating Unique Key/index: MongoDB


[youtube https://www.youtube.com/watch?v=QEy1IctH99w]

YouTube Link: https://www.youtube.com/watch?v=QEy1IctH99w [Watch the Video In Full Screen.]



Creating Key/Index

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
> db.name.insert({a: 1});
WriteResult({ "nInserted" : 1 })
 
> db.name.find()
{ "_id" : ObjectId("53d8cadbbbfe6d81d0bcc364"), "a" : 1 }
{ "_id" : 1, "a" : 1 }
 
> db.name.ensureIndex({a: 1});
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 1,
        "numIndexesAfter" : 2,
        "ok" : 1
}
> db.system.indexes.find()
{ "v" : 1, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "foo.name" }
{ "v" : 1, "key" : { "a" : 1 }, "name" : "a_1", "ns" : "foo.name" }

Here we create index on field “a”.

Inserting duplicate values into key field

1
2
3
4
5
6
7
> db.name.insert({a: 1});
WriteResult({ "nInserted" : 1 })
 
> db.name.find()
{ "_id" : ObjectId("53d8cadbbbfe6d81d0bcc364"), "a" : 1 }
{ "_id" : 1, "a" : 1 }
{ "_id" : ObjectId("53d8cb4dbbfe6d81d0bcc365"), "a" : 1 }

insert operation simply inserts the duplicate value to field “a” even though its made as a key/index.

Removing documents and Key/Index on field “a”

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> db.name.find()
{ "_id" : ObjectId("53d8cadbbbfe6d81d0bcc364"), "a" : 1 }
{ "_id" : 1, "a" : 1 }
{ "_id" : ObjectId("53d8cb4dbbfe6d81d0bcc365"), "a" : 1 }
 
> db.name.remove({a: 1});
WriteResult({ "nRemoved" : 3 })
 
> db.name.find()
 
> db.name.dropIndex({a: 1});
{ "nIndexesWas" : 2, "ok" : 1 }
 
> db.system.indexes.find()
{ "v" : 1, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "foo.name" }

Before implementing unique key on field “a” we need to first remove the duplicate entries present inside our collection orelse it’ll through errors. Here we also remove the index/key on “a”, so that we can create unique key/index on “a”.

Creating Unique key/index on field “a”

1
2
3
4
5
6
7
> db.name.ensureIndex({a: 1}, {unique: true});
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 1,
        "numIndexesAfter" : 2,
        "ok" : 1
}

To create unique key/index, we need to make use of ensureIndex() method – first parameter being the field name to be made as unique key along with it’s value 1 or -1. 1 signifies ascending order, -1 signifies descending order. The second parameter {unique: true}, specifies that the key/index must be unique key/index, like that of “_id”.

Duplicate key error on our Unique Key!

1
2
3
4
5
6
7
8
9
10
11
12
> db.name.find()
{ "_id" : ObjectId("53d8cb85bbfe6d81d0bcc366"), "a" : 1 }
 
> db.name.insert({a: 1});
WriteResult({
        "nInserted" : 0,
        "writeError" : {
                "code" : 11000,
                "errmsg" : "insertDocument :: caused by :: 11000 E11000 
                            duplicate key error index: foo.name.$a_1  dup key: { : 1.0 }"
        }
})

Now if we try to insert duplicate values into field “a” it throws duplicate key error.

Multi-key Indexes and Arrays: MongoDB

We have learnt the basics of multi-key indexes in MongoDB. Lets look at an example to demonstrate the multi-key indexing on arrays.

arrays-multi-key-index-mongodb

foo: database name
name: collection name

Insert a document

1
2
3
4
5
6
7
8
MongoDB shell version: 2.6.1
connecting to: test
 
> use foo
switched to db foo
 
> db.name.insert({a: 1, b: 2, c: 3});
WriteResult({ "nInserted" : 1 })

Here we insert {a: 1, b: 2, c: 3} into “name” collection.

Multi-key Indexes and Arrays: MongoDB


[youtube https://www.youtube.com/watch?v=VGHSmjVmnzs]

YouTube Link: https://www.youtube.com/watch?v=VGHSmjVmnzs [Watch the Video In Full Screen.]



Basic Cursor

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
> db.name.find({a: 1, b: 2})
{ "_id" : ObjectId("53d8982b79142c385cddc607"), "a" : 1, "b" : 2, "c" : 3 }
 
> db.name.find({a: 1, b: 2}).explain()
{
        "cursor" : "BasicCursor",
        "isMultiKey" : false,
        "n" : 1,
        "nscannedObjects" : 1,
        "nscanned" : 1,
        "nscannedObjectsAllPlans" : 1,
        "nscannedAllPlans" : 1,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 0,
        "server" : "Satish-PC:27017",
        "filterSet" : false
}

We find() the document using fields “a” and “b” and the query/command returns a basic cursor, as we do not have indexing on them.

Related Read: index creation: MongoDB

Lets create index on a and b

1
2
3
4
5
6
7
> db.name.ensureIndex({a: 1, b: 1});
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 1,
        "numIndexesAfter" : 2,
        "ok" : 1
}

Previous there was only 1 index i.e., on “_id” Now there are 2 indexes – “_id” and “{a: 1, b: 1}”

Btree Cursor with multi-key as false

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
> db.name.find({a: 1, b: 2}).explain()
{
        "cursor" : "BtreeCursor a_1_b_1",
        "isMultiKey" : false,
        "n" : 1,
        "nscannedObjects" : 1,
        "nscanned" : 1,
        "nscannedObjectsAllPlans" : 1,
        "nscannedAllPlans" : 1,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 0,
        "indexBounds" : {
                "a" : [
                        [
                                1,
                                1
                        ]
                ],
                "b" : [
                        [
                                2,
                                2
                        ]
                ]
        },
        "server" : "Satish-PC:27017",
        "filterSet" : false
}

After creating the index on “a” and “b”, chain explain() method on the same command, and it shows you that, now it returns a Btree Cursor.

Lets insert another document

1
2
> db.name.insert({a: [0, 1, 2], b: 2, c: 3});
WriteResult({ "nInserted" : 1 })

Lets insert an array as value to field “a” and scalar values to “b” and “c”.

Btree Cursor with Multi-key true

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
> db.name.find({a: 1, b: 2})
{ "_id" : ObjectId("53d8982b79142c385cddc607"), 
  "a" : 1, "b" : 2, "c" : 3 }
{ "_id" : ObjectId("53d8986f79142c385cddc608"), 
  "a" : [ 0, 1, 2 ], "b" : 2, "c": 3 }
 
> db.name.find({a: 1, b: 2}).explain()
{
        "cursor" : "BtreeCursor a_1_b_1",
        "isMultiKey" : true,
        "n" : 2,
        "nscannedObjects" : 2,
        "nscanned" : 2,
        "nscannedObjectsAllPlans" : 2,
        "nscannedAllPlans" : 2,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 0,
        "indexBounds" : {
                "a" : [
                        [
                                1,
                                1
                        ]
                ],
                "b" : [
                        [
                                2,
                                2
                        ]
                ]
        },
        "server" : "Satish-PC:27017",
        "filterSet" : false
}

Now append explain() method to our command, it shows us that it returns a Btree Cursor and multi-key as true. MongoDB engine need to match every element of the array present in field “a” with the scalar value of field “b”. Hence it uses Multi-Key indexing.

Multi-Key Condition in MongoDB

1
2
3
4
5
6
7
8
9
> db.name.insert({a: [0, 1, 2], b: [3, 4], c: 3});
WriteResult({
        "nInserted" : 0,
        "writeError" : {
                "code" : 10088,
                "errmsg" : "insertDocument :: caused by :: 10088 cannot 
                            index parallel arrays [b] [a]"
        }
})

It’s difficult to match every combination of the array elements present inside both “a” and “b” fields. If both keys/indexes has its value as an array, then it gets complicated. Thus, mongoDB doesn’t allow both keys to be arrays. Either one of them must be a scalar value.

Get Index and Delete Index: MongoDB

We learnt the uses of having an index/key on our collection and how to create the index. Now, in this video tutorial lets learn how to get index on individual collection and how to drop / remove / delete the index we’ve created.

getIndex-dropIndex-mongodb

Related Read: index creation: MongoDB

temp: Database name
no, another: collection names
We’ve 10 Million documents inside “no” collection.

Sample document

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
> use temp
switched to db temp
> show collections
no
system.indexes
 
> db.no.find({"student_id": {$lt: 3}}).pretty()
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833cda"),
        "student_id" : 0,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833cdb"),
        "student_id" : 1,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833cdc"),
        "student_id" : 2,
        "name" : "Satish"
}

Fetch Index and Drop / remove Index: MongoDB


[youtube https://www.youtube.com/watch?v=qYHIRWHS_5I]

YouTube Link: https://www.youtube.com/watch?v=qYHIRWHS_5I [Watch the Video In Full Screen.]



We shall take a look at “system.indexes” collection

1
2
3
4
5
> db.system.indexes.find()
{ "v" : 1, "key" : { "_id" : 1 }, 
           "name" : "_id_", "ns" : "temp.no" }
{ "v" : 1, "key" : { "student_id" : 1 }, 
           "name" : "student_id_1", "ns" : "temp.no" }

Create “another” collection

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> db.another.insert({"name": "Satish", "age": 27});
WriteResult({ "nInserted" : 1 })
 
> show collections
another
no
system.indexes
 
> db.system.indexes.find()
{ "v" : 1, "key" : { "_id" : 1 }, 
           "name" : "_id_", "ns" : "temp.no" }
{ "v" : 1, "key" : { "student_id" : 1 }, 
           "name" : "student_id_1", "ns" : "temp.no" }
{ "v" : 1, "key" : { "_id" : 1 }, 
           "name" : "_id_", "ns" : "temp.another" }

After creating “another” collection, mongoDB engine generates default key on its “_id” field. And “system.indexes” shows all the keys present inside the database for all the collections it has. This can get messy if we have large number of collections – which we do in even slightly bigger projects.

To get index on individual collection

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
> db.another.getIndexes()
[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "name" : "_id_",
                "ns" : "temp.another"
        }
]
 
> db.no.getIndexes()
[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "name" : "_id_",
                "ns" : "temp.no"
        },
        {
                "v" : 1,
                "key" : {
                        "student_id" : 1
                },
                "name" : "student_id_1",
                "ns" : "temp.no"
        }
]

We can make use of getIndex() method to fetch or get indexes / keys present on individual collection.

Removing / deleting / dropping – index / key

1
2
3
4
5
6
7
8
9
10
11
12
13
14
> db.no.dropIndex({"student_id": 1});
{ "nIndexesWas" : 2, "ok" : 1 }
 
> db.no.getIndexes()
[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "name" : "_id_",
                "ns" : "temp.no"
        }
]

make use of dropIndex() method and pass-in the index object similar to that used while creating the index. This shall drop the index.

index creation: MongoDB

Lets learn to create index and to optimize the database in MongoDB.

Creating “Database”: “temp”, “Collection”: “no”, and inserting 10 Million documents inside it

1
2
3
4
5
use temp
switched to db temp
 
for(i=0; i< = 10000000; i++)
db.no.insert({"student_id": i, "name": "Satish"});

Since Mongo Shell is built out of JavaScript, you can pass in any valid Javascript code to it. So we write a for loop and insert 10 Million documents inside “no” collection.

creating-index-mongodb

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
MongoDB shell version: 2.6.1
connecting to: test
> show dbs
admin    (empty)
daily    0.078GB
local    0.078GB
nesting  0.078GB
school   0.078GB
temp     3.952GB
test     0.078GB
> use temp
switched to db temp
> show collections
no
system.indexes
> db.no.find().pretty()
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833cda"),
        "student_id" : 0,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833cdb"),
        "student_id" : 1,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833cdc"),
        "student_id" : 2,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833cdd"),
        "student_id" : 3,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833cde"),
        "student_id" : 4,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833cdf"),
        "student_id" : 5,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833ce0"),
        "student_id" : 6,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833ce1"),
        "student_id" : 7,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833ce2"),
        "student_id" : 8,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833ce3"),
        "student_id" : 9,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833ce4"),
        "student_id" : 10,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833ce5"),
        "student_id" : 11,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833ce6"),
        "student_id" : 12,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833ce7"),
        "student_id" : 13,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833ce8"),
        "student_id" : 14,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833ce9"),
        "student_id" : 15,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833cea"),
        "student_id" : 16,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833ceb"),
        "student_id" : 17,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833cec"),
        "student_id" : 18,
        "name" : "Satish"
}
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833ced"),
        "student_id" : 19,
        "name" : "Satish"
}
Type "it" for more
 
> it

“no” collection has 10 Million record, but it won’t fetch you all records at once, as it would take a lot of time and resources of your computer! So it only fetches 20 records at a time. You can iterate through next 20 documents by using command “it“.

index creation: MongoDB


[youtube https://www.youtube.com/watch?v=zK_mRyiNs-I]

YouTube Link: https://www.youtube.com/watch?v=zK_mRyiNs-I [Watch the Video In Full Screen.]



1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
> db.no.find({"student_id": 5}).pretty()
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833cdf"),
        "student_id" : 5,
        "name" : "Satish"
}
 
 
> db.no.findOne({"student_id": 5});
{
        "_id" : ObjectId("53c9020abcdd1ea7fb833cdf"),
        "student_id" : 5,
        "name" : "Satish"
}
> db.no.find({"student_id": 5000000}).pretty()
{
        "_id" : ObjectId("53c90ca6bcdd1ea7fbcf881a"),
        "student_id" : 5000000,
        "name" : "Satish"
}

find() method scans through all the documents present in the collection to find multiple matches for the condition. So in above case, find() method scans through 10 Million documents, hence returns the result slowly. Where as findOne() method stops scanning the collection as soon as it finds the first matching document, so findOne() returns result faster than find() method.

Related Read:
Multi-key Index: MongoDB
index / key: MongoDB

Creating index

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
> show collections
no
system.indexes
 
> db.system.indexes.find()
{ "v" : 1, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "temp.no" }
 
> db.no.ensureIndex({"student_id": 1});
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 1,
        "numIndexesAfter" : 2,
        "ok" : 1
}
 
> db.system.indexes.find()
{ "v" : 1, "key" : { "_id" : 1 }, 
                     "name" : "_id_", "ns" : "temp.no" }
{ "v" : 1, "key" : { "student_id" : 1 }, 
                     "name" : "student_id_1", "ns" : "temp.no" }

We create index on “student_id”. It takes little time to create the index, as we have 10 Million documents inside “no” collection.

After creating index on “student_id”, run the same command and you’ll get the results instantly – maybe it takes 0.01 ms, but the delay can’t be noticed.
Why does it return results faster after creating index on “student_id”? Watch this short video lesson to know it: index / key: MongoDB

1
2
3
4
5
6
7
8
9
10
11
12
13
> db.no.find({"student_id": 5000000}).pretty()
{
        "_id" : ObjectId("53c90ca6bcdd1ea7fbcf881a"),
        "student_id" : 5000000,
        "name" : "Satish"
}
> db.no.find({"student_id": 10000000}).pretty()
{
        "_id" : ObjectId("53c914adbcdd1ea7fb1bd35a"),
        "student_id" : 10000000,
        "name" : "Satish"
}
>

So the querys/commands can be optimized by creating indexes on frequently accessed fields.