I have to load in millions of rows of data from MySQL database into a C# application. It takes more than 5 minutes for the complete set of data to be loaded. Does it make sense to do multi-threading for my case? If yes, how should I do it? Should I divide the data into say 5 sets and do multi-threading from there?
thanks.

12 answers
The usual reason to use multithreading in this sort of scenario is to keep the GUI responsive in the meantime
Unless you have more than one processor or core, multithreading won't improve the time it takes to load the data into memory. However, if you have a quad-core machine, then you could divide the data into 4 sets.
BTW, is it really necessary for all the data to be in memory at one time? Depending on the row size and what else you're doing, there's a risk you may run out of memory when attempting to store millions of rows.
answered one year ago by:
17279
412
I agree to this (run out of memory), in this case I'd suggest that u should allow computer to do it's work. Just keep ur user updated with the work which is going on in the backend. I mean to say: Show the Progressbar.
for my application, assume that I have to load in that huge chunk (ten millions rows, i.e end=10e6) of data at one time, and I want to use multi-threading to speed up the process, how should I do it? I have tried this, but it seems not a correct way of doing it. Any code samples would be appreciated.
using System.Threading;
private void button1_Click(object sender, EventArgs e)
{
MultiThreadingTask();
}
private void MultiThreadingTask()
{
ThreadStart worker1 = new ThreadStart(WorkerThreadMethod);
Thread t1 = new Thread(worker1);
t1.Start();
LoadXY_values(0, 250000);
t1.Abort();
ThreadStart worker2 = new ThreadStart(WorkerThreadMethod);
Thread t2 = new Thread(worker2);
t2.Start();
LoadXY_values(250000, 500000);
t2.Abort();
}
static void WorkerThreadMethod()
{
}
private void LoadXY_values(Int64 start, Int64 end)
{
MySqlConnection myConn = new MySqlConnection(conn);
myConn.Open();
MySqlCommand MySqlCmd = new MySqlCommand();
MySqlCmd.Connection = myConn;
MySqlCmd = new MySqlCommand("SELECT x_values, y_values from actual_data WHERE x_values > '" + start + "' AND x_values < '" + end + "'" , myConn);
MySqlDataReader reader = MySqlCmd.ExecuteReader();
while (reader.Read())
{
x_values = Convert.ToInt64(reader["x_values"]);
y_values = Convert.ToInt16(reader["y_values"]);
listAll.Add(new XY(x_values, y_values));
}
reader.Close();
myConn.Close();
}
answered one year ago by:
494
Try it like this:
answered one year ago by:
17279
beside this, there is something you should take into consideration in case of loading this huge data in memory which is the location of the bottleneck. i think the bottleneck is not the processor here but i think the bottleneck is the hd. to be sure check the processor in the task mgr if its going high while reading and i doubt that it will go high thats ofc if the db is on the different server. if its on the same pc then you should get concerned with the pc hardware rather than your software.
answered one year ago by:
1556
I am not sure whether I am reading your code correctly.
If those 4 threads are in the FOR loop, isn't each thread has to wait for the previous thread to finish before it can execute? what then are the benefits of multi-threading?
Could you also explain
new Thread( () =>
The first time I come across => operator.
thanks.
answered one year ago by:
494
A thread is created and started for each of the four iterations of the for loop. A new thread doesn't wait for the previous thread to finish (you'd need to call Thread.Join() for that) - it starts straightaway.
An easy way to pass arguments to a new thread is to use a 'lambda expression' (requires C# 3.0 or 4.0). The argument we're passing to the thread's constructor (see below) is a lambda expression which can be thought of as a method which is written 'inline' rather than declared separately in the usual way. It is similar in concept to (but more concise than) an anonymous method in C# 2.0:
The part before the =>, enclosed in brackets unless there's exactly one, are the argument(s) to the method. In this case there are no arguments so the brackets are empty.
The part after the => is what the method does which is to call the LoadXY_values method with the appropriate arguments. As this is a void method, the lambda expression doesn't return a value.
The lambda expression works because it's convertible to a ThreadStart delegate which one would normally pass to the Thread's constructor - in other words it's a method which takes no parameters and has a void return type.
A characteristic of lambda expressions is that they capture 'outer' variables which are local variables or parameters passed by value to the method in which they're defined and which are used by the lambda expression itself. Also the captured variables are evaluated when the lambda expression is actually invoked not when they were captured.
This can sometimes be a nuisance (as here). The for loop control variable 'i' is an outer variable which we'd like to use in the lambda expression but, if we did so, then by the time the lambda expression is invoked the value 'i' might have changed to the next value. To get around this we assign the current value of 'i' to a temporary variable 'j' which the lambda expression captures instead. This solves the problem because a new temporary variable is created for each iteration of the loop.
An alternative to using lamda expressions to pass arguments to a thread is to use a ParameterizedThreadStart delegate. However, this can only take a single parameter of type object. If you want to pass multiple parameters, you therefore have to pack them into an array and then unpack them within the thread method itself which is a bit messy.
answered one year ago by:
17279
hi Vulpes,
thanks for the detailed explanation. I tried both methods individually ( LoadXY_values(0,total_rows) and MultiThreadingTask() ), but they both give me different answers. I do not know why. I think MultiThreadingTask() has some problems, as when it comes to different threads, the listAll variable would be different, compared when we compute it as a whole in LoadXY_values(0,total_rows)
For LoadXY_values(0,80000), it takes 55 seconds to complete. I am hoping that MultiThreadingTask() could minimize the time by at least half.
// Complete code
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using MySql.Data.MySqlClient;
using System.IO;
using Steema.TeeChart.Data;
using System.Threading;
struct XY
{
public double X;
public double Y;
public XY(double x, double y)
{
X = x;
Y = y;
}
public override string ToString()
{
return String.Format("x = {0}, y = {1}", X, Y);
}
}
namespace Timing_Error
{
public partial class Form1 : Form
{
string conn = "server=127.0.0.1;database=xy_timing_error;uid=root;password=xxxxxxxx";
Int32 x_values;
Int16 y_values;
double[] maxIntervals = null;
List<XY> listAll = new List<XY>();
List<XY> listMaxima = new List<XY>();
List<XY> listMinima = new List<XY>();
private XY[] arrayAll;
int total_rows = 80000; // supposed to be 10 millions
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
DateTime startTime = DateTime.Now;
timer1.Start();
LoadXY_values(0,total_rows);
//MultiThreadingTask();
timer1.Stop();
DateTime stopTime = DateTime.Now;
TimeSpan duration = stopTime - startTime;
textBox1.Text = duration.Seconds.ToString();
FindMaxMin();
}
private void MultiThreadingTask()
{
// create an array to hold all 10 million rows so threads always write to different parts of it
int total_threads = 4;
// arrayAll = new XY[total_rows];
// create and start 4 threads to load 2.5 million rows each
for (int i = 0; i < total_threads; i++)
{
Int32 j = i ; // create local variable so each lambda expression captures a different instance
new Thread(() => LoadXY_values(j * total_rows / total_threads, (j + 1) * total_rows / total_threads)).Start();
}
}
private void LoadXY_values(Int32 start, Int32 end)
{
arrayAll = new XY[end];
MySqlConnection myConn = new MySqlConnection(conn);
myConn.Open();
MySqlCommand MySqlCmd = new MySqlCommand();
MySqlCmd.Connection = myConn;
MySqlCmd = new MySqlCommand("SELECT x_values, y_values from actual_data WHERE x_values > '" + start + "' AND x_values < '" + end + "'" , myConn);
MySqlDataReader reader = MySqlCmd.ExecuteReader();
while (reader.Read())
{
x_values = Convert.ToInt32(reader["x_values"]);
y_values = Convert.ToInt16(reader["y_values"]);
arrayAll[(Int32)x_values] = new XY(x_values, y_values);
listAll = arrayAll.ToList();
}
reader.Close();
myConn.Close();
}
private void FindMaxMin()
{
FileStream fstream;
StreamWriter writer;
TextWriter OutText = Console.Out;
// remove points which have same y value as preceding point
for (int i = listAll.Count - 2; i >= 0; i--)
{
XY curr = listAll[i];
XY next = listAll[i + 1];
if (curr.Y == next.Y) listAll.RemoveAt(i + 1);
}
// find maxima (it can't be first or last point)
for (int i = 1; i < listAll.Count - 1; i++)
{
XY prev = listAll[i - 1];
XY curr = listAll[i];
XY next = listAll[i + 1];
if (IsMaximum(prev, curr, next))
{
listMaxima.Add(curr);
}
}
// find minima (it can't be first or last point)
for (int i = 1; i < listAll.Count - 1; i++)
{
XY prev = listAll[i - 1];
XY curr = listAll[i];
XY next = listAll[i + 1];
if (IsMinimum(prev, curr, next))
{
listMinima.Add(curr);
}
}
// convert lists to arrays
XY[] maxima = listMaxima.ToArray();
XY[] minima = listMinima.ToArray();
// calculate absolute intervals between successive x values of maxima
if (maxima.Length > 1)
{
maxIntervals = new double[maxima.Length - 1];
for (int i = 1; i < maxima.Length; i++)
{
maxIntervals[i - 1] = Math.Abs(maxima[i].X - maxima[i - 1].X);
}
}
// calculate absolute intervals between successive x values of mimima
double[] minIntervals = null;
if (minima.Length > 1)
{
minIntervals = new double[minima.Length - 1];
for (int i = 1; i < minima.Length; i++)
{
minIntervals[i - 1] = Math.Abs(minima[i].X - minima[i - 1].X);
}
}
try
{
fstream = new FileStream("./output.txt", FileMode.Create); // overwrite if exist
writer = new StreamWriter(fstream);
}
catch (Exception e)
{
Console.WriteLine("Cannot open output.txt for writing");
Console.WriteLine(e.Message);
return;
}
Console.SetOut(writer);
if (maxima.Length == 0)
{
Console.WriteLine("There are no maxima");
}
else
{
Console.WriteLine("Maxima are :\n");
foreach (XY xy in maxima)
{
Console.WriteLine(" {0}", xy);
}
}
if (maxIntervals != null)
{
Console.WriteLine("\nAbsolute intervals between maxima are :\n");
foreach (double interval in maxIntervals)
{
Console.WriteLine(" {0}", interval);
}
}
if (minima.Length == 0)
{
Console.WriteLine("\nThere are no minima");
}
else
{
Console.WriteLine("\nMinima are :\n");
foreach (XY xy in minima)
{
Console.WriteLine(" {0}", xy);
}
}
if (minIntervals != null)
{
Console.WriteLine("\nAbsolute intervals between minima are :\n");
foreach (double interval in minIntervals)
{
Console.WriteLine(" {0}", interval);
}
}
Console.SetOut(OutText);
writer.Close();
fstream.Close();
}
static bool IsMaximum(XY prev, XY curr, XY next)
{
if (curr.Y > prev.Y && curr.Y > next.Y) return true;
return false;
}
static bool IsMinimum(XY prev, XY curr, XY next)
{
if (curr.Y < prev.Y && curr.Y < next.Y) return true;
return false;
}
}
}
answered one year ago by:
494
You've got into a muddle here with your listAll and your arrayAll. Also when you're using multithreading, you don't want to stop the timer or to call the FindMaxMin() method until all the background threads have ended. This makes it awkward to accomodate both approaches in the same program.
What I'd do is to initialize arrayAll in button1_Click and convert it to listAll just before you call FindMaxMin():
Next I'd change MultiThreadingTask so that the main thread will block until all the background threads have finished:
Finally, you need to change LoadXY_values() to take account of the above changes:
answered one year ago by:
17279
thanks.
I try out both methods, LoadXY_values() and MultiThreadingTask(), for total_rows=8000000, both takes about the same time (59 seconds). I also tested with other total_rows numbers, seem to me that there is not much improvement in the time taken to execute for using MultiThreadingTask().
I actually comment out the last 2 lines, so that it is a fair comparison.
//listAll = arrayAll.ToList(); // convert to list here
//FindMaxMin();
answered one year ago by:
494
17279
Well, I notice that even using a single thread the time taken to load 8 million rows (the whole set?) is now much better than it was (5 minutes). Just as a matter of interest, how many cores do you have and which version of .NET are you using? .NET 4.0 has automatic data parallelization via PLINQ though it's unlikely to be appreciably faster than doing it yourself. There may also be a limitation in the MySql data provider which means that serving two concurrent readers on different parts of the same table is not much faster than serving a single reader on the whole table.
On Device Manager, I can see under Processors listed two Intel(R) Core(TM)2 CPU T7200 @ 2.00GHZ. when I type systeminfo at command prompt, it says:
OS Name: Microsoftr Windows VistaT Home Premium
OS Version: 6.0.6001 Service Pack 1 Build 6001
OS Manufacturer: Microsoft Corporation
OS Configuration: Standalone Workstation
OS Build Type: Multiprocessor Free
Registered Owner: ThiamHuat
System Type: X86-based PC
Processor(s): 1 Processor(s) Installed.
[01]: x64 Family 6 Model 15 Stepping 6 GenuineIntel ~ 996 Mhz
My .NET version is 3.5. and MySQL website says this:
MySQL is fully multi-threaded, and will make use of multiple CPUs, provided that the operating system supports them.
Now, I updated total rows to be 10 millions for the 4 cases, and the results is surprising. Multitasking does not seem to improve performance, but need more time. Both SQL Server 2008 and MySql are loaded with the same set of 10 millions rows of data.
Without MultiTasking:
using SQL Server 2008: total time taken to complete = 0 min 28 sec
using MySQL: total time taken to complete = 1 min 15 sec
With MultiTasking:
MultiTasking with SQL Server 2008: total time taken to complete = 0 min 34 sec
MultiTasking with MySQL: total time taken to complete = 1 min 18 sec
///////////////// code //////////////////////////////////////////////////////////
private void button1_Click(object sender, EventArgs e)
{
arrayAll = new XY[total_rows]; // initialize array here
DateTime startTime = DateTime.Now;
timer1.Start();
// LoadXY_values_MySql(0, total_rows);
// LoadXY_values_MSSqlServer2008(0, total_rows);
MultiThreadingTask_MySql();
// MultiThreadingTask_MSSqlServer2008();
timer1.Stop();
DateTime stopTime = DateTime.Now;
TimeSpan duration = stopTime - startTime;
textBox1.Text = String.Concat(duration.Minutes.ToString()," min ", duration.Seconds.ToString()," sec");
}
private void MultiThreadingTask_MSSqlServer2008()
{
int total_threads = 4;
// create and start 4 threads to load 2.5 million rows each
for (int i = 0; i < total_threads; i++)
{
Int32 j = i; // create local variable so each lambda expression captures a different instance
Thread t = new Thread(() => LoadXY_values_MSSqlServer2008(j * total_rows / total_threads, (j + 1) * total_rows / total_threads));
t.Start();
t.Join(); // blocks main thread until this thread finishes
}
}
private void MultiThreadingTask_MySql()
{
int total_threads = 4;
// create and start 4 threads to load 2.5 million rows each
for (int i = 0; i < total_threads; i++)
{
Int32 j = i; // create local variable so each lambda expression captures a different instance
Thread t = new Thread(() => LoadXY_values_MySql(j * total_rows / total_threads, (j + 1) * total_rows / total_threads));
t.Start();
t.Join(); // blocks main thread until this thread finishes
}
}
private void LoadXY_values_MySql(Int32 start, Int32 end)
{
MySqlConnection myConn = new MySqlConnection(conn);
myConn.Open();
MySqlCommand MySqlCmd = new MySqlCommand();
MySqlCmd.Connection = myConn;
MySqlCmd = new MySqlCommand("SELECT x_values, y_values from actual_data WHERE x_values > '" + start + "' AND x_values < '" + end + "'" , myConn);
MySqlDataReader reader = MySqlCmd.ExecuteReader();
while (reader.Read())
{
x_values = Convert.ToInt32(reader["x_values"]);
y_values = Convert.ToInt16(reader["y_values"]);
arrayAll[(Int32)x_values] = new XY(x_values, y_values);
}
reader.Close();
myConn.Close();
}
private void LoadXY_values_MSSqlServer2008(Int32 start, Int32 end)
{
string conn = "Data Source=127.0.0.1;Initial Catalog=XY_Timing_Error;Integrated Security=SSPI";
using (SqlConnection connection = new SqlConnection(conn))
{
SqlConnection myConn = new SqlConnection(conn);
myConn.Open();
SqlCommand SqlCmd = new SqlCommand();
SqlCmd.Connection = myConn;
SqlCmd = new SqlCommand("SELECT X_values, Y_values from XY_values WHERE X_values > '" + start + "' AND X_values < '" + end + "'", myConn);
SqlDataReader reader = SqlCmd.ExecuteReader();
while (reader.Read())
{
X_values = Convert.ToInt32(reader["X_values"]);
Y_values = Convert.ToInt16(reader["Y_values"]);
arrayAll[(Int32)X_values] = new XY(X_values, Y_values);
}
reader.Close();
myConn.Close();
}
}
answered one year ago by:
494
Although you appear to have two dual core processors, the results are similar to what you'd expect if you were trying to run multithreaded code on a single processor i.e. slower than the equivalent single threaded code because of the overhead of creating and managing multiple threads (each of which has its own 1MB stack) and 'context switching' between the threads.
It may therefore be that all four threads are being run on a single core of one of the processors. Now a problem here is that there doesn't appear to be a reliable method (prior to .NET 4.0 anyway) to force a thread to run on a particular core or processor, though you can try messing about with processor affinities and the like. Having said that, you'd generally expect the CLR and OS to schedule your threads to use the machine's capabilities as efficiently as possible and so the results are puzzling.
What you could try is to use the threadpool rather than creating your own threads. To do that you first need to change the MultithreadingTask_MySql() method (with similar changes to the MultiThreadingTask_MSSqlServer2008() method):
Secondly, you need to add this new method (and a similar new method for Sql Server 2008)
answered one year ago by:
17279
I have the following error:
Error 1 The call is ambiguous between the following methods or properties: 'Timing_Error.Form1.ThreadProc_MySql(object)' and 'Timing_Error.Form1.ThreadProc_MySql(object)' D:\Data\Timing_Error\Timing_Error\Form1.cs 105 39 Timing_Error
from the line
ThreadPool.QueueUserWorkItem(new WaitCallback(ThreadProc_MySql), state);
answered one year ago by:
494
17279
The error makes no sense because, even if you've inadvertently added two copies of the ThreadProc_MySql(object) method - perhaps forgetting to change the name when you were adding similar code for SQL Server 2008 - the compiler should give you a 'duplicate definition' error rather than an ambiguity error. However, as I can't think of anything else, I'd do a full search for ThreadProc_MySql within your project to see whether it throws up any evidence of duplication.