Wiki source code of Loading Musicbrainz in Elasticsearch
Last modified by Normann P. Nielsen on 2025/12/12 15:13
Show last authors
| author | version | line-number | content |
|---|---|---|---|
| 1 | I stumpled upon the [[INSERT INTO LOGSTASH SELECT DATA FROM DATABASE>>https://www.elastic.co/blog/logstash-jdbc-input-plugin]] article and decided to play around. | ||
| 2 | |||
| 3 | Installing PostgreSQL and loading the database and the data into Elasticsearch was just as described; the manual for MBSlave is very good. | ||
| 4 | |||
| 5 | Refer to the [[https://musicbrainz.org/doc/MusicBrainz_Database]] for more | ||
| 6 | |||
| 7 | |||
| 8 | {{success}}I decided to use elkserver3 and a new logstash on that one, to avoid messing elkserver1 up. | ||
| 9 | |||
| 10 | Is possible to have a different Logstash config on each server in the cluster. Also - I decided to name the Index "musicbrainz-%{+YYYY.MM.dd}" to have some control over the load and a possible cleanup afterwards.{{/success}} | ||
| 11 | |||
| 12 | |||
| 13 | |||
| 14 | |||
| 15 | My Input file: | ||
| 16 | |||
| 17 | |||
| 18 | {{code language="10-musicbrainz.conf"}} | ||
| 19 | input { | ||
| 20 | jdbc { | ||
| 21 | jdbc_driver_library => "/etc/logstash/postgresql-9.4.1212.jre6.jar" | ||
| 22 | jdbc_driver_class => "org.postgresql.Driver" | ||
| 23 | jdbc_connection_string => "jdbc:postgresql://localhost:5432/musicbrainz?user=musicbrainz&password=*******" | ||
| 24 | jdbc_user => "musicbrainz" | ||
| 25 | statement_filepath => "/etc/logstash/query.sql" | ||
| 26 | schedule => "0 15 * * *" | ||
| 27 | } | ||
| 28 | } | ||
| 29 | {{/code}} | ||
| 30 | |||
| 31 | Notice the Schedule - the Query runs one time each day at 15:00 - but the data is static, so its not nessesary....But the schedule makes sure I know when the Query is runned one time (only) | ||
| 32 | |||
| 33 | My Output file: | ||
| 34 | |||
| 35 | |||
| 36 | {{code}} | ||
| 37 | output { | ||
| 38 | |||
| 39 | elasticsearch | ||
| 40 | { | ||
| 41 | hosts => "localhost:9200" | ||
| 42 | sniffing => false | ||
| 43 | manage_template => false | ||
| 44 | index => "musicbrainz-%{+YYYY.MM.dd}" | ||
| 45 | } | ||
| 46 | } | ||
| 47 | {{/code}} | ||
| 48 | |||
| 49 | The load gave this in Kibana: | ||
| 50 | |||
| 51 | [[image:Screen Shot 2016-12-30 at 15.58.06.png||width="900"]] | ||
| 52 | |||
| 53 | And verifying the row count in PostgreSQL with: | ||
| 54 | |||
| 55 | |||
| 56 | {{code}} | ||
| 57 | SELECT count(*) AS Dummy FROM ( | ||
| 58 | SELECT | ||
| 59 | release_group.gid AS album_id, | ||
| 60 | release_group.type AS album_primary_type_id, | ||
| 61 | release_group_primary_type.name AS album_primary_type_name, | ||
| 62 | release.name AS release_name, | ||
| 63 | artist.name AS artist_name, | ||
| 64 | artist.gid AS artist_gid, | ||
| 65 | artist_credit.id AS artist_credit_id, | ||
| 66 | artist.type AS artist_type_id, | ||
| 67 | artist_type.name AS artist_type_name, | ||
| 68 | artist.begin_date_year artist_begin_date_year, | ||
| 69 | area.name AS artist_country_name, | ||
| 70 | release_country.date_year AS release_year, | ||
| 71 | release_country.date_month AS release_month, | ||
| 72 | release_country.date_day AS release_day | ||
| 73 | FROM | ||
| 74 | musicbrainz.artist | ||
| 75 | INNER JOIN musicbrainz.artist_credit_name | ||
| 76 | ON artist_credit_name.artist = artist.id | ||
| 77 | INNER JOIN musicbrainz.artist_credit | ||
| 78 | ON artist_credit.id = artist_credit_name.artist_credit | ||
| 79 | INNER JOIN musicbrainz.release_group | ||
| 80 | ON release_group.artist_credit = artist_credit.id | ||
| 81 | INNER JOIN musicbrainz.release | ||
| 82 | ON release.release_group = release_group.id | ||
| 83 | INNER JOIN musicbrainz.release_country | ||
| 84 | ON release.id = release_country.release | ||
| 85 | INNER JOIN musicbrainz.artist_type | ||
| 86 | ON artist.type = artist_type.id | ||
| 87 | INNER JOIN musicbrainz.area | ||
| 88 | ON artist.area = area.id | ||
| 89 | INNER JOIN musicbrainz.release_group_primary_type | ||
| 90 | ON release_group_primary_type.id = release_group.type | ||
| 91 | WHERE | ||
| 92 | ((release_country.date_year IS NOT NULL) AND | ||
| 93 | (release_country.date_month IS NOT NULL) AND | ||
| 94 | (release_country.date_day IS NOT NULL)) | ||
| 95 | ) As Dummy2 | ||
| 96 | {{/code}} | ||
| 97 | |||
| 98 | Gave: | ||
| 99 | |||
| 100 | |||
| 101 | {{code}} | ||
| 102 | dummy | ||
| 103 | -------- | ||
| 104 | 622527 | ||
| 105 | (1 row) | ||
| 106 | {{/code}} | ||
| 107 | |||
| 108 | Success - same row count :) | ||
| 109 | |||
| 110 | I do notice that some rows seems to be the same: | ||
| 111 | |||
| 112 | [[image:Screen Shot 2016-12-30 at 16.14.01.png||width="900"]] | ||
| 113 | |||
| 114 | Or not? The "album_id" is the same, but in one row the "release_year" differs from the two others.... | ||
| 115 | |||
| 116 | Running the SQL | ||
| 117 | |||
| 118 | |||
| 119 | {{code}} | ||
| 120 | SELECT Distinct * FROM ( | ||
| 121 | SELECT | ||
| 122 | release_group.gid AS album_id, | ||
| 123 | release_group.type AS album_primary_type_id, | ||
| 124 | release_group_primary_type.name AS album_primary_type_name, | ||
| 125 | release.name AS release_name, | ||
| 126 | artist.name AS artist_name, | ||
| 127 | artist.gid AS artist_gid, | ||
| 128 | artist_credit.id AS artist_credit_id, | ||
| 129 | artist.type AS artist_type_id, | ||
| 130 | artist_type.name AS artist_type_name, | ||
| 131 | artist.begin_date_year artist_begin_date_year, | ||
| 132 | area.name AS artist_country_name, | ||
| 133 | release_country.date_year AS release_year, | ||
| 134 | release_country.date_month AS release_month, | ||
| 135 | release_country.date_day AS release_day | ||
| 136 | FROM | ||
| 137 | musicbrainz.artist | ||
| 138 | INNER JOIN musicbrainz.artist_credit_name | ||
| 139 | ON artist_credit_name.artist = artist.id | ||
| 140 | INNER JOIN musicbrainz.artist_credit | ||
| 141 | ON artist_credit.id = artist_credit_name.artist_credit | ||
| 142 | INNER JOIN musicbrainz.release_group | ||
| 143 | ON release_group.artist_credit = artist_credit.id | ||
| 144 | INNER JOIN musicbrainz.release | ||
| 145 | ON release.release_group = release_group.id | ||
| 146 | INNER JOIN musicbrainz.release_country | ||
| 147 | ON release.id = release_country.release | ||
| 148 | INNER JOIN musicbrainz.artist_type | ||
| 149 | ON artist.type = artist_type.id | ||
| 150 | INNER JOIN musicbrainz.area | ||
| 151 | ON artist.area = area.id | ||
| 152 | INNER JOIN musicbrainz.release_group_primary_type | ||
| 153 | ON release_group_primary_type.id = release_group.type | ||
| 154 | WHERE | ||
| 155 | ((release_country.date_year IS NOT NULL) AND | ||
| 156 | (release_country.date_month IS NOT NULL) AND | ||
| 157 | (release_country.date_day IS NOT NULL)) | ||
| 158 | ) As Dummy2 | ||
| 159 | {{/code}} | ||
| 160 | |||
| 161 | Gave | ||
| 162 | |||
| 163 | |||
| 164 | {{code}} | ||
| 165 | 560155 rows | ||
| 166 | {{/code}} | ||
| 167 | |||
| 168 | So, there is a possible redundancy in the SQL provided from [[https://www.elastic.co/blog/logstash-jdbc-input-plugin]] | ||
| 169 | |||
| 170 | Reloading the data (after deleting the Index) gives: | ||
| 171 | |||
| 172 | [[image:Screen Shot 2016-12-30 at 17.08.50.png||width="900"]] | ||
| 173 | |||
| 174 | = Samples = | ||
| 175 | |||
| 176 | Here is a sample of all albums from "Denmark" with "Peter" in the Artist name: | ||
| 177 | |||
| 178 | [[image:Screen Shot 2016-12-30 at 16.20.58.png||width="900"]] | ||
| 179 | |||
| 180 | A few Visualizations.. | ||
| 181 | |||
| 182 | [[image:Screen Shot 2016-12-31 at 09.14.14.png||width="900"]] | ||
| 183 | |||
| 184 | = Whats Next.. = | ||
| 185 | |||
| 186 | Well, this is unfinished business...there are so much more data to combine...... | ||
| 187 | |||
| 188 | And another project could be parsing IMDB data ... [[http://www.imdb.com/interfaces]] |