Performance ideas (in-memory C# hashset and contains too slow) -
i have following code
private void loadintomemory() { //init large hashset hashset<document> hsalldocuments = new hashset<document>(); //get first rows database list<document> docslist = document.getallabovedocid(0, 500000); //load objects dictionary foreach (document d in docslist) { hsalldocuments.add(d); } application["dicalldocuments"] = hsalldocuments; } private hashset<document> documenthits(hashset<document> hsrawhit, hashset<document> hsalldocuments, string query, string[] queryarray) { int counter = 0; const int maxcount = 1000; foreach (document d in hsalldocuments) { //headline if (d.headline.contains(query)) { if (counter >= maxcount) break; hsrawhit.add(d); counter++; } //description if (d.description.contains(query)) { if (counter >= maxcount) break; hsrawhit.add(d); counter++; } //splitted query word word //string[] queryarray = query.split(' '); if (queryarray.count() > 1) { foreach (string q in queryarray) { if (d.headline.contains(q)) { if (counter >= maxcount) break; hsrawhit.add(d); counter++; } //description if (d.description.contains(q)) { if (counter >= maxcount) break; hsrawhit.add(d); counter++; } } } } return hsrawhit; }
first load data hashset (via application later use) - runs fine - totally ok slow i'm doing.
will running 4.0 framework in c# (can't update new upgrade 4.0 async stuff).
the documenthits method runs slow on current setup - considering it's in memory. can speed method?
examples awesome - thanks.
i see using hashset
, not using of it's advantages, should use list
instead.
what's taking time looping through documents , looking strings in other strings, should try elliminate as possible of that.
one possibility set indexes of documents contains character pairs. if string query
contains hello
, looking in documents contains he
, el
, ll
, lo
.
you set dictionary<string, list<int>>
dictionary key character combinations , list contains indexes documents in document list. setting dictionary take time, of course, can focus on character combinations less common. if character combination exists in 80% of documents, it's pretty useless elliminating documents, if character combination exists in 2% of documents has elliminated 98% of work.
if loop through documents in list , add occurances lists in dictionary, lists of indexes sorted, easy join lists later on. when add indexes list, can throw away lists when large , see not useful elliminating documents. way keeping shorter lists , not consume memory.
edit:
it put small example:
public class indexelliminator<t> { private list<t> _items; private dictionary<string, list<int>> _index; private func<t, string> _getcontent; private static hashset<string> getpairs(string value) { hashset<string> pairs = new hashset<string>(); (int = 1; < value.length; i++) { pairs.add(value.substring(i - 1, 2)); } return pairs; } public indexelliminator(list<t> items, func<t, string> getcontent, int maxindexsize) { _items = items; _getcontent = getcontent; _index = new dictionary<string, list<int>>(); (int index = 0;index<_items.count;index++) { t item = _items[index]; foreach (string pair in getpairs(_getcontent(item))) { list<int> list; if (_index.trygetvalue(pair, out list)) { if (list != null) { if (list.count == maxindexsize) { _index[pair] = null; } else { list.add(index); } } } else { list = new list<int>(); list.add(index); _index.add(pair, list); } } } } private static list<int> joinlists(list<int> list1, list<int> list2) { list<int> result = new list<int>(); int i1 = 0, i2 = 0; while (i1 < list1.count && i2 < list2.count) { switch (math.sign(list1[i1].compareto(list2[i2]))) { case 0: result.add(list1[i1]); i1++; i2++; break; case -1: i1++; break; case 1: i2++; break; } } return result; } public list<t> find(string query) { hashset<string> pairs = getpairs(query); list<list<int>> indexes = new list<list<int>>(); bool found = false; foreach (string pair in pairs) { list<int> list; if (_index.trygetvalue(pair, out list)) { found = true; if (list != null) { indexes.add(list); } } } list<t> result = new list<t>(); if (found && indexes.count == 0) { indexes.add(enumerable.range(0, _items.count).tolist()); } if (indexes.count > 0) { while (indexes.count > 1) { indexes[indexes.count - 2] = joinlists(indexes[indexes.count - 2], indexes[indexes.count - 1]); indexes.removeat(indexes.count - 1); } foreach (int index in indexes[0]) { if (_getcontent(_items[index]).contains(query)) { result.add(_items[index]); } } } return result; } }
test:
list<string> items = new list<string> { "hello world", "how you", "what this", "can true", "some phrases", "words upon words", "what do", "where go", "when this", "how can be", "well above margin", "close center" }; indexelliminator<string> index = new indexelliminator<string>(items, s => s, items.count / 2); list<string> found = index.find("this"); foreach (string s in found) console.writeline(s);
output:
what can true when how can
Comments
Post a Comment